## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

D EPT.

OF

c Michael Creel

Version 0.70, September, 2005

E CONOMICS

AND

E CONOMIC H ISTORY, U NIVERSITAT A UTÒNOMA

DE

B ARCELONA ,

MICHAEL . CREEL @ UAB . ES , H T T P :// P A R E T O . U A B . E S / M C R E E L

Contents

List of Figures List of Tables Chapter 1. About this document 1.1. License 1.2. Obtaining the materials 1.3. An easy way to use LYX and Octave today 1.4. Known Bugs Chapter 2. Introduction: Economic and econometric models Chapter 3. Ordinary Least Squares 3.1. The Linear Model 3.2. Estimation by least squares 3.3. Geometric interpretation of least squares estimation 3.4. Inﬂuential observations and outliers 3.5. Goodness of ﬁt 3.6. The classical linear regression model 3.7. Small sample statistical properties of the least squares estimator 3.8. Example: The Nerlove model Exercises Chapter 4. Maximum likelihood estimation

3

10 12 13 14 14 15 17 18 21 21 22 25 28 31 34 36 43 49 50

CONTENTS

4

4.1. The likelihood function 4.2. Consistency of MLE 4.3. The score function 4.4. Asymptotic normality of MLE 4.6. The information matrix equality 4.7. The Cramér-Rao lower bound Exercises Chapter 5. Asymptotic properties of the least squares estimator 5.1. Consistency 5.2. Asymptotic normality 5.3. Asymptotic efﬁciency Chapter 6. Restrictions and hypothesis tests 6.1. Exact linear restrictions 6.2. Testing 6.3. The asymptotic equivalence of the LR, Wald and score tests 6.4. Interpretation of test statistics 6.5. Conﬁdence intervals 6.6. Bootstrapping 6.7. Testing nonlinear restrictions, and the Delta Method 6.8. Example: the Nerlove data Chapter 7. Generalized least squares 7.1. Effects of nonspherical disturbances on the OLS estimator 7.2. The GLS estimator 7.3. Feasible GLS 7.4. Heteroscedasticity

50 54 56 58 63 65 68 70 70 71 72 75 75 81 90 94 94 95 98 102 110 111 112 115 117

CONTENTS

5

7.5. Autocorrelation Exercises Exercises Chapter 8. Stochastic regressors 8.1. Case 1 8.2. Case 2 8.3. Case 3 8.4. When are the assumptions reasonable? Exercises Chapter 9. Data problems 9.1. Collinearity 9.2. Measurement error 9.3. Missing observations Exercises Exercises Exercises Chapter 10. Functional form and nonnested tests 10.1. Flexible functional forms 10.2. Testing nonnested hypotheses Chapter 11. Exogeneity and simultaneity 11.1. Simultaneous equations 11.2. Exogeneity 11.3. Reduced form 11.4. IV estimation

130 151 153 154 155 156 158 158 161 162 162 171 175 181 181 181 182 183 195 199 199 202 205 208

CONTENTS

6

11.5. Identiﬁcation by exclusion restrictions 11.6. 2SLS 11.7. Testing the overidentifying restrictions 11.8. System methods of estimation 11.9. Example: 2SLS and Klein’s Model 1 Chapter 12. Introduction to the second half Chapter 13. Numeric optimization methods 13.1. Search 13.2. Derivative-based methods 13.3. Simulated Annealing 13.4. Examples 13.5. Duration data and the Weibull model 13.6. Numeric optimization: pitfalls Exercises Chapter 14. Asymptotic properties of extremum estimators 14.1. Extremum estimators 14.2. Consistency 14.3. Example: Consistency of Least Squares 14.4. Asymptotic Normality 14.5. Examples 14.6. Example: Linearization of a nonlinear model Exercises Chapter 15. Generalized method of moments (GMM) 15.1. Deﬁnition

214 227 231 236 245 248 257 258 258 267 268 272 276 282 283 283 284 289 291 294 298 303 304 304

CONTENTS

7

15.2. Consistency 15.3. Asymptotic normality 15.4. Choosing the weighting matrix 15.5. Estimation of the variance-covariance matrix 15.6. Estimation using conditional moments 15.7. Estimation using dynamic moment conditions 15.8. A speciﬁcation test 15.9. Other estimators interpreted as GMM estimators 15.10. Example: The Hausman Test 15.11. Application: Nonlinear rational expectations 15.12. Empirical example: a portfolio model Exercises Chapter 16. Quasi-ML Chapter 17. Nonlinear least squares (NLS) 17.1. Introduction and deﬁnition 17.2. Identiﬁcation 17.3. Consistency 17.4. Asymptotic normality 17.5. Example: The Poisson model for count data 17.6. The Gauss-Newton algorithm

307 308 310 313 316 322 322 325 334 341 345 347 348 354 354 356 358 358 360 361

17.7. Application: Limited dependent variables and sample selection 364 Chapter 18. Nonparametric inference 18.1. Possible pitfalls of parametric inference: estimation 18.2. Possible pitfalls of parametric inference: hypothesis testing 18.3. The Fourier functional form 368 368 372 373

CONTENTS

8

18.4. Kernel regression estimators 18.5. Kernel density estimation 18.6. Semi-nonparametric maximum likelihood 18.7. Examples Chapter 19. Simulation-based estimation 19.1. Motivation 19.2. Simulated maximum likelihood (SML) 19.3. Method of simulated moments (MSM) 19.4. Efﬁcient method of moments (EMM) 19.5. Example: estimation of stochastic differential equations Chapter 20. Parallel programming for econometrics Chapter 21. Introduction to Octave 21.1. Getting started 21.2. A short introduction 21.3. If you’re running a Linux installation... Chapter 22. Notation and Review 22.1. Notation for differentiation of vectors and matrices 22.2. Convergenge modes 22.3. Rates of convergence and asymptotic equality Exercises Chapter 23. The GPL Chapter 24. The attic 24.1. MEPS data: more on count models 24.2. Hurdle models

385 391 391 397 408 408 415 418 422 428 431 432 432 432 435 436 436 437 441 444 445 456 457 462

CONTENTS

9

24.3. Models for time series data Bibliography Index

474 491 492

7.5.1 7.1 3.5.2 3.1 3.1 7.1 7.3 3. Klein consumption equation 10 ¢ £¡ Uncentered .6.2 3.1 3.1 1.2 3.1 6. Nerlove model.7.4.5. Classical Model Example OLS Fit The ﬁt in observation space Detection of inﬂuential observations 15 16 23 26 26 30 32 37 38 41 42 96 107 125 132 144 147 149 Unbiasedness of OLS under classical assumptions Biasedness of OLS when an assumption fails Gauss-Markov Result: The OLS estimator Gauss-Markov Result: The split sample estimator Joint and Individual Conﬁdence Regions RTS as a function of ﬁrm size Residuals.7.1 7.2 LYX Octave Typical data.3. sorted by ﬁrm size Autocorrelation induced by misspeciﬁcation Durbin-Watson critical values Residuals of simple Nerlove model OLS residuals.6.5.2 7.2.2.3.1 3.List of Figures 1.1 3.4.7.8.4 6.2.

5.2.1 Sample selection bias The search method Increasing directions of search Newton-Raphson method Using MuPAD to get analytic derivatives Life expectancy of mongooses.LIST OF FIGURES 11 9.1 when there is no collinearity when there is collinearity 164 165 179 259 261 263 266 275 277 278 335 433 .1 13.1 Running an Octave program ©¨¦¤ §¥ © §¥ ¨¦¤ 9.3 13.10.2 13.6.1.1 13. Weibull model Life expectancy of mongooses.1.3.1 13.1 15.1 13.2.2 9.2 13. mixed Weibull model A foggy mountain OLS and IV estimators when regressors and errors are correlated 21.5.2.1.2.

Sample and Estimated (NB-II) Actual and Poisson ﬁtted frequencies Actual and Hurdle Poisson ﬁtted frequencies Information Criteria. OBDV 457 462 463 467 474 12 .List of Tables 1 2 3 4 5 Marginal Variances. Sample and Estimated (Poisson) Marginal Variances.

the document is a somewhat terse approximation to a textbook. you’ll see that you are free to copy and modify the document. The second half is somewhat more polished than the ﬁrst half. A few of my favorites are listed in the bibliography. These notes are not intended to be a perfect substitute for a printed textbook. see these very nice lecture slides. when viewed in printed form.CHAPTER 1 About this document This document integrates lecture notes for a one year graduate level course with computer programs that illustrate and apply the methods that are studied. ready to run using keyboard macros. with a bias toward microeconometrics. If you take a moment to read the licensing information in the next section. the emphasis is on estimation and inference within the world of stationary data. The immediate availability of executable (and modiﬁable) example programs when using the PDF1 version of the document is one of the advantages of the system that has been used. since I have taught that course more often. please note that last sentence carefully. To do this with the PDF version you need to do some setup work. 1 It is possible to have the program links open up in an editor. Error corrections and other additions are also welcome. With respect to contents. As an example of a project that has made use of these notes. See the bootable CD described below. If you are a student of mine. 13 . There are many good textbooks available. If anyone would like to contribute material that expands the contents. it would be very welcome. On the other hand.

which you’re looking at in some form now. It will run on Linux. but LYX is also free of charge. It (with help from other applications) can export A your work in LTEX. PDF and several other forms. at pareto.1 shows LYX editing this document. Obtaining the materials The materials are available on my web page.1. you can obtain the editable sources. GNU Octave has been used for the example programs. in a variety of forms including PDF and the editable sources. you must make available the source ﬁles.1. which are scattered though the document. The ﬁrst is the high quality of the Octave environment for doing applied econometrics.2. and MacOS systems. for your modiﬁed version of the materials. The main document was prepared using LYX (www. In addition to the ﬁnal product.2. In particular. The main thing you need to know is that you are free to modify and distribute these materials in any way you like. Figure 1.uab. or send error corrections and contributions.2. which forms Section 23 of the notes. if you like. OBTAINING THE MATERIALS 14 1.es/mcreel/Econometrics/. License All materials are copyrighted by Michael Creel with the date that appears above. in editable form. basically working as A a graphical frontend to LTEX. LYX is a free2 “what you see is what you mean” word processor. as long as you do so under the terms of the GPL. which will allow you to create your own version. This choice is motivated by two factors. 1. HTML. They are provided under the terms of the GNU General Public License. Windows.lyx.org).org) and Octave (www. .octave. The fundamental tools exist and are implemented in a way that make extending 2 ”Free” is used in the sense of ”freedom”.

1.2 shows an Octave program being edited by NEdit. Figure 1. Thirdly. The ﬁles won’t run properly from your browser.3.2.1. and here.2. Support ﬁles needed to run these are available here. AN EASY WAY TO USE LYX AND OCTAVE TODAY 15 F IGURE 1. Windows and MacOS. Secondly. Octave’s licensing philosophy ﬁts in with the goals of this project. and the result of running the program in a shell window. LYX them fairly easy.3. The example programs included here may convince you of this point. An easy way to use LYX and Octave today The example programs are available as links to ﬁles on my web page in the PDF version. 1. since there are dependencies . it runs on Linux.

uab. All of this may sound a bit complicated. because it is. It contains a bootable-from-CD Gnu/Linux .2. An easier solution is available: The ﬁle pareto. you should go to the home page of this document. AN EASY WAY TO USE LYX AND OCTAVE TODAY 16 F IGURE 1. Octave between ﬁles .2. To see how to use these ﬁles (edit and run them).they are only illustrative when browsing.3.iso is an ISO image ﬁle that may be burnt to CDROM. Then set the base URL of the PDF ﬁle to point to wherever the Octave ﬁles are installed. since you will probably want to download the pdf version together with all the support ﬁles and examples.es/mcreel/Econometrics/econometrics.1.

1. It is based upon the Knoppix GNU/Linux distribution. ps2pdf bugs? . See the Knoppix web page for more information. it will allow you to cut out small portions of the notes and edit them. possibly with security problems that have not been ﬁxed. Known Bugs This section is a reminder to myself to try to ﬁx a few things. you can use it to install Debian GNU/Linux on your computer (run knoppix-installer as the root user).1. etc. run the Octave example programs. The numbers are correct. additions. So if you do a hard disk installation you should do apt-get update. The PDF version has hyperlinks to ﬁgures that jump to the wrong ﬁg- ure. The versions of programs on the CD may be quite out of date. and will not touch your hard disk unless you explicitly tell it to do so. but the links are not.! The CD automatically detects the hardware of your computer. and send them to me as LYX (or TEX) ﬁles for inclusion in future versions. KNOWN BUGS 17 system that has all of the tools needed to edit this document. Think error corrections.4. In particular. apt-get upgrade toot sweet.4. Additionally. with some material removed and other added. etcetera.

Suppose we can 18 B# Some components of may not be observable to an outside modeler. and you can’t tell what they will order just by looking at them. The individual demand functions are The model is not estimable as it stands. since: For example. 8 E3 The form of the demand function is different for all 1 8 5 !@A8@89764) 3 1 Suppose we have a sample consisting of one observation on © B ! B ! B ¥ CDC # B B 2 ) ' 0(& is vector of prices of the good and its substitutes and comple- ©# $!"!¥ # % individuals’ indexes the . where individuals in the sample). people don’t eat the same lunch every day.CHAPTER 2 Introduction: Economic and econometric models Economic theory tells us that the demand function for a good is something like: is the quantity demanded ments is income is a vector of other variables such as individual characteristics that affect preferences demands at time period (this is a cross section.

Only when we are convinced that the model is at least approximately correct should we use it for economic analysis. to check that the model seems to be reasonable. 3 G P ` RB F P W P T R PI B B 0aS§ YVXS§ B VU$§ BSQDH§ HC e d©c¥ C b B BG component . These additional assumptions have no theoretical basis. speciﬁcation testing will be needed. There is a single unobservable component. we have restricted the model to the class of linear in the variables functions.2. § equation. which in principle may differ for all B F B# break into the observable components and a single unobservable have been . we can always write the last and in order to be able to estimate them from sample data. we need to make additional assumptions. If we assume nothing about the error term . Of all parametric families of functions. For this reason. and we assume it is addi- tive. But in order for the coefﬁcients to have an economic meaning. The validity of any results we obtain using this model will be contingent on these additional restrictions being at least approximately correct. INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS 19 A step toward an estimable econometric model is to suppose that the model may be written as We have imposed a number of restrictions on the theoretical model: The functions restricted to all belong to the same parametric family. they are assumptions on top of those needed to prove the existence of a demand function. The parameters are constant across individuals.

three factors can cause a statistical test to reject the null hypothesis: (1) the hypothesis is false (2) a type I error has occured (3) the econometric model is not correctly speciﬁed so the test does not have the assumed distribution We would like to ensure that the third reason is not contributing to rejections. econometric methods that seek to minimize maintained assumptions are introduced. In the next few sections we will obtain results supposing that the econometric model is entirely correctly speciﬁed. so that rejection will be due to either the ﬁrst or second reasons. . Hopefully the above example makes it clear that there are many possible sources of misspeciﬁcation of econometric models. Later on.2. INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS 20 When testing a hypothesis using an econometric model. Later we will examine the consequences of misspeciﬁcation and see some methods for determining if a model is correctly speciﬁed.

CHAPTER 3 Ordinary Least Squares 3. Suppose that we want to use data to try to determine the best linear ap- obtained by some form of sampling1. and It will be deﬁned more precisely later. r p Yq§ The suare Linearity: the model is a linear function of the parameter vector ¢ h 8 i @@8A89 gI e h § P 8 uP i p h ut@8A8P ¢ p ¢ VsI pI § § P e wP p d§R f v f Consider approximating a variable using the variables . 21 1 85 A@8A874) v ¨ 2 © f¥ G P v t0§ R f 8 v f proximation to p § perscript “0” in means this is the ”true value” of the unknown parameter. using the variables The data y § bb 8 c© p h xxb p ¢ § pI § ¥ § p y h bb © i xxb ¢ I ¥ Xv f The dependent variable is a scalar random variable. The Linear Model can consider a model that is a linear approximation: or. and usually suppressed when it’s not necessary for clarity. We f . cross-sectional data may be obtained by random sampling. An individual observation is thus 1 For example.1. using vector notation: is a -vector of explanatory variables. Time series data accumulate historically.

3.2. ESTIMATION BY LEAST SQUARES

22

where

is

and

Linear models are more general than they might ﬁrst appear, since one can employ nonlinear transformations of the variables:

to a model in the form of equation 3.6.1. For example, the Cobb-Douglas model

can be transformed logarithmically to obtain

If we deﬁne

The approximation is linear in the parameters, but not necessarily linear in the variables.

Figure 3.2.1, obtained by running TypicalData.m shows some data that fol-

dom error that has mean zero and is independent of

. Exactly how the green

line is deﬁned will become clear later. In practice, we only have the data, and

Ee

¢ © f g¨! ¢ ¥

¢ ¢ H§ § PI

line

, and the red crosses are the data points

e EaP ¢ ¢ aH§ gf § PI

lows the linear model

w D$7# f I§ 8G 70P | F | uP ¢ F ¢ uP w A # § §

3.2. Estimation by least squares

© G ~ } x| z x¢ w ¥ 7gwyF gyF # {

© ¥ B t

where the

are known functions. Deﬁning

etc., we can put the model in the form needed.

. The green line is the ”true” regression where is a ran-

h bb ¢ R "f v xxb v I v lk

F¥ g©p¨vI mu I $¥ p u f ©# m © F¥ n © # G P r © F¥ b b b © F V§ sp¨T m xxqp¨¥ ¢ m £¨I m o $¨¥ p m

**G P V§ h ff bb f I ) ' jQ1 R igxxb ¢ ef d
**

. etc. leads

(3.1.1)

1

The

observations can be written in matrix form as

3.2. ESTIMATION BY LEAST SQUARES

23

**F IGURE 3.2.1. Typical data, Classical Model
**

10 data true regression line

5

0

-5

-10

-15

0

2

4

6

8

10 X

12

14

16

18

20

we don’t know where the green line lies. We need to gain information about the straight line that best ﬁts the data points. The ordinary least squares (OLS) estimator is deﬁned as the value that minimizes the sum of the squared errors:

where

¢ § % § R V§ tuis R § P R 5 R © § © %¥ R q§ %¥ I @ ¢ ©§ R f v g¨¥ f

© §¥ ¨¦¤ E

§ © §¥ ¤

3.2. ESTIMATION BY LEAST SQUARES

24

This last expression makes it clear how the OLS estimator is deﬁned: it min-

”best” means minimum Euclidean distance. One could think of other estimators based upon other metrics. For example, the minimum absolute distance (MAD) minimizes

. Later, we will see that which estimator is

best in terms of their statistical properties, rather than in terms of the metrics that deﬁne them, depends upon the properties of , about which we have as yet made no assumptions.

and it to zero:

so

To verify that this is a minimum, check the s.o.s.c.:

minimizer. The ﬁtted values are in the vector

The residuals are in the vector

§

© 1

form in a p.d. matrix (identity matrix of order

, so

§ i G 8 ¢§ ¡

© ¥

Since

this matrix is positive deﬁnite, since it’s a quadratic is in fact a

§

5 P 5 © ¥ x § R w R p q § ¦¤

8 R I !© R ¥ §

© ¥ x R 5 § ¤ ¢

© §¥ g¨¦¤

To minimize the criterion

ﬁnd the derivative with respect to

v

will deﬁne the best linear approximation to using

e

8 § "

f

f

imizes the Euclidean distance between

and

The ﬁtted OLS coefﬁcients as basis functions, where

§ R v gf q@ I f

3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION

25

Note that

Also, the ﬁrst order conditions can be written as

which is to say, the OLS residuals are orthogonal to this more carefully.

3.3. Geometric interpretation of least squares estimation

true regression line. Note that the true line and the estimated line are different. This ﬁgure was created by running the Octave program OlsFit.m . You can experiment with changing the parameter values to see how this affects the ﬁt, and to see how the ﬁtted line will sometimes be close to the true line, and sometimes rather far away.

3.3.2. In Observation Space. If we want to plot in observation space, we’ll need to use only two or three observations, or we’ll encounter some limitations of the blackboard. Let’s use two. With only two observations, we can’t have

¤ !i

3.3.1. In

Space. Figure 3.3.1 shows a typical ﬁt to data, along with the

P G § G P 0§

. Let’s look at

G R h § % R § R £R

8 ¥ 4) ¦

3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION

26

**F IGURE 3.3.1. Example OLS Fit
**

15 data points fitted line true line

10

5

0

-5

-10

-15

0

2

4

6

8

10 X

12

14

16

18

20

**F IGURE 3.3.2. The ﬁt in observation space
**

Observation 2

y

e = M_xY

S(x)

x x*beta=P_xY

Observation 1

3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION

27

nent that is the orthogonal projection onto the orthogonal to the span of

to the space spanned by

Since

is in this space,

the f.o.c. that deﬁne the least squares estimator imply that this is so.

since

is the projection of onto the . We have that

dimensional space that is orthogonal

8fH³Ri I !X%u²pf d±° © R ¥ ©fi I !X%uj¯f R © R ¥

§ j¯f

G

to the span of

R © R ¥ i I !Xi¬u sª «

® 8 « f ª §

f

Therefore, the matrix that projects

onto the span of

is

i

f

fi I Q©i¨§ § R © R ¥

§

3.3.3. Projection Matrices.

is the projection of onto the span of

8 ¡ G R

f

§

Since

is chosen to make

as short as possible,

G

8 i G 8 ¦ G i

onto the

dimensional space spanned by

i1 §

,

f

We can decompose

into two components: the orthogonal projection and the composubpace that is

G

will be orthogonal Note that

or

3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS

28

We have

Therefore

space.

Note that both

and

– The only nonsingular idempotent matrix is the identity matrix.

3.4. Inﬂuential observations and outliers

p§

The OLS estimator of the

element of the vector

f BR ¸ f A·B ³i I 6Qi¬¥ ° R © R

w

– An idempotent matrix

is one such that

8 wµw w 8 R ¡w w

– A symmetric matrix

w « X´

§B §

¶ 3

« ª

i

ﬁned by

and the portion that lies in the orthogonal

are symmetric and idempotent. is one such that

is simply

u1

orthogonal components - the portion that lies in the

dimensional space dedimensional

f

1

These two projection matrices decompose the

dimensional vector

8 ª pf ± « R © R ¥ i I 6Xiujpf ± 8 P ¦ G § f X´ £f sª « P « 8 « f Q´ G

into two

f

« X´

f

So the matrix that projects

onto the space orthogonal to the span of

is

3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS

29

This is how we deﬁne a linear estimator - it’s a linear function of the dependent variable. Since it’s a linear combination of the observations on the dependent variable, where the weights are detemined by the observations on the regressors, some observations may have more inﬂuence than others. Deﬁne

a in the t position). So

and

the observation has the potential to affect the ﬁt importantly. The weight,

is referred to as the leverage of the observation. However, an observation may

tion (designate this estimator as innon, pp. 32-5 for proof) that

One can show (see Davidson and MacK-

t G i I !X%¥ g ¹ ) È) Ç£§ @Å Ã § R © R É Æ

§

To account for this, consider estimation of

without using the

¶ 2

8 Ã © @Å Ä§

by, which only depends on the

’s.

f

also be inﬂuential due to the value of

, rather than the weight it is multiplied

observa-

¹

1 Â

So, on average, the weight on the ’s is

. If the weight is much higher, then

1

º

is the t element on the main diagonal of

8 Â q1 ¡ ¹ e sy¿ Á « ªÀ

) ¢
¼ º ¢
º sª «
º ª R
º « d
© ¥ « sª « ª »

( is a vector of zeros with

) ½ ¹ ¾ ½

f

¹ ¶ ¶ ) ¹

3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS

30

**F IGURE 3.4.1. Detection of inﬂuential observations
**

14 12 Data points fitted Leverage Influence

10

8

6

4

2

0

-2

0

0.5

1

1.5 X

2

2.5

3

so the change in the

While an observation may be inﬂuential if it doesn’t affect its own ﬁtted value, it certainly is inﬂuential if it does. A fast means of identifying inﬂuential observations is to plot

(which I will refer to as the own inﬂuence of the

observation) as a function of . Figure 3.4.1 gives an example plot of data, ﬁt, leverage and inﬂuence. The Octave program is InﬂuentialObservation.m . If you re-run the program you will see that the leverage of the last observation (an outlying value of x) is always high, and the inﬂuence is sometimes high. After inﬂuential observations are detected, one needs to determine why they are inﬂuential. Possible causes include:

É G g

¹ ¹

È) Æ @Å Ã § aj£§ a 2

ÊÊ t G h ¶¶ I

¶ 2

observations ﬁtted value is

pure randomness may have caused us to sample a low-probability observation.1) © t ÐÏ H¥ ¢ 4vÎ ¢ f ¢ « f ª fR f § R R § f R f È) G R G ¢Í ¡ ¢Í ¡ The uncentered is deﬁned as P R R 5 P R R R G tR G ¼ G iÌ § Ë § i § ff G R R P R R R G t G § i § ff P G § f entry errors are very common.5. so .5. .5. This is the idea behind structural change: the parameters may not be constant across all observations. Data There exist robust estimation methods that downweight outliers. special economic factors that affect some observations. Goodness of ﬁt The ﬁtted model is Take the inner product: But the middle term of the RHS is zero since (3. GOODNESS OF FIT 31 data entry error. which can easily be corrected once detected.3. These would need to be identiﬁed and incorporated in the model. 3.

5. f 8 f term.5. Another. since this changes . the yellow vector is a constant. other than the constant the model to explain the variation of mean. more common deﬁni- tion measures the contribution of the variables.1. GOODNESS OF FIT 32 (see Figure 3. since it’s on the degree line in observation space). to explaining the variation in Thus it measures the ability of about its unconditional sample ¢ ¡ F IGURE 3. Uncentered f ¢ £¡ The uncentered changes if we add a constant to f Ò¦Ñ t t where is the angle between and the span of .1.5.3.

5.e.. equation 3. So 3.1 becomes just returns the vector of deviations from the mean. Û § Õ R R § Û ¡ X´ 8 G 9G Ö´ Õ R Á G i Þ G R G where Supposing that Û ¢ 7u¨¥ @f f Ö´ R f Û ¿ © Ýf f I Õ ÛÛ R Û ÛÜ¿ÙÚÈ) f Õ f Ö´ G y) ¢Ø ¡ G R = Û Û ÜÙ 9G Ö´ t G § X´ iÌs § f Ö´ f Õ R P Õ R R Õ R from the mean. GOODNESS OF FIT then one can show that contains a column of ones (i.8 4) » ¢Ø ¡ » © gÓ Ó qª « ( Û Û Û Û ¡¿ ¢Ø ¡ R P Õ R R G G § Ö´ i § f Ö´ ¦f Õ R t G . . there is a constant term).5. In terms of deviations 1 Â ¬×wsf ± RÓÓ ¨RÓ I 6Ó¨REwsf ± © Ó¥Ó 33 Ö´ Õ Let 1 R t9@A8@8944Ô¥ Ó ©) 8)) a The centered ¢Ø ¡ where f Ö´ Õ So so Supposing that a column of ones is in the space spanned by In this case and is deﬁned as -vector.

We’ll start with the classical linear regression model. The classical linear regression model Up to this point the model is empty of content beyond the deﬁnition of a economic content to the model. which incorporates some assumptions that are clearly not realistic for economic data. For the à à The partial derivative is e h § P 8 wP ih u@8A8P ¢ ¢ uI H§ f § P I ß 7 respect to ? The linear approximation is to have an economic f nomic interpretation. using vector notation: e h § P 8 uP i p h ut@8A8P ¢ p ¢ VsI pI § § P e wP p d§R f v f (3. THE CLASSICAL LINEAR REGRESSION MODEL 34 3. Later we’ll adapt the results to what we can get with more realistic assumptions. or. what is the partial derivative of f best linear approximation to and some geometrical properties. This is to be able to explain some concepts with a minimum of confusion and notational clutter. and the regression parameters have no eco- meaning. we need to make additional assumptions. there’s no guarantee that Ô6áãâä á ß P ß § ß e à f à =0. There is no with . The assumptions that are appropriate to make depend on the data under consideration. For example.3.1) r p Yq§ Linearity: the model is a linear function of the parameter vector § Up to now.6.6.6.

This is needed to be able to iden- tify the individual effects of the explanatory variables. and 3.6) ¤ pÔ E!¬G¼ë í 2 ê © ì e ¥ (3.6. Normally distributed errors: © è ¥ ç 4f ± ¢ ! je (3.2) where is a ﬁnite positive deﬁnite matrix.6. its number of columns.6.3.5) 2ê c ¢p è ¨¬é © G¥ (3. it has rank .3) is jointly distributed IIN.6. Independently and identically distributed errors: (3. This implies the following two properties: Homoscedastic errors: Nonautocorrelated errors: Optionally.6.6. THE CLASSICAL LINEAR REGRESSION MODEL 35 Nonstochastic linearly independent regressors: is a ﬁxed matrix of con- (3.2 « æ G .4) © è $f ± ¢ ! ¥ ± e ± ç 1 æ « å R ) A stants.6. we will sometimes assume that the errors are normally distributed.

1.2 and 3. With time series data. where . Figure 3. we have only examined numeric properties of the OLS estimator. Small sample statistical properties of the least squares estimator Up to now. Figure 3. The statistical properties depend upon the assumptions we can make. assumption 3. the OLS estimator will often be biased. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 36 3.1 shows the results of a small Monte Carlo experiment where the OLS estimator was calculated for 10000 samples from the classical model with that the appears to be estimated without bias.2 does not hold: the G P t¯vI f ï 8 P 4f was calculated for 1000 samples from the AR(1) model with ï ¢î è 5 1 . We have By 3.m .2 shows the results of a small Monte Carlo experiment where the OLS estimator where and .6. The program that generates the plot is Unbiased. We can see . R © R Ù Gi I !X%¥ § ) ¢î è 5 1 ¢§ G 5 P ÖP 0) f .3. Now we will examine statistical properties. and G Ù i I 6Xi¥ R © R Gi I !Xi¬¥ Ù R © R Gi I 6Qi¬§ R © R ¥ P ©G P § V"¥ R I Q R ¬¥ © is ﬁxed across samples. .6. 3.6. Rf I X R ¥ § © . if you would like to experiment with this. By linearity. Unbiasedness. that always hold.7.7.7.7.3 so the OLS estimator is unbiased under the assumptions of the classical model. In this case.7.

04 0.3. which implies strong exogeneity). In Figure 3.02 0 -3 -2 -1 0 1 2 3 about -0.2.7.12 0.7. 3. We can see that the bias in the estimation of is ó © R ¥ §ð ç ¢p è I 6Xixòñui§ G . The program that generates the plot is Biased. With the linearity assumption. Unbiasedness of OLS under classical assumptions Beta hat .6. the 8 RG I X R a§ § © ¥ P ¢§ regressors are stochastic.1.08 0.6. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 37 F IGURE 3.2. Normality. since the DGP (see the Octave program) has normal errors. then since a linear function of a normal random vector is also normally distributed. Even when the data may be taken to be IID. Adding the assumption of normality (3.1 0.7.1 you can see that the estimator appears to be normally distributed. It in fact is normally distributed.7.Beta true 0.m .06 0. if you would like to experiment with this. we have This is a linear function of .

06 0. This depends upon the mean being large enough in relation to the variance. strictly speaking.4 assumption of normality is often questionable or simply untenable.1 0.7.02 0 -1.2.8 -0. it is a count variable with a discrete distribution.3. as long as the probability of a negative value occuring is negligable under the model. Now let’s make all the classical assumptions except the assumption of Normality may be a good model nonetheless. Many variables in economics can take on only nonnegative values.2 0.Beta true 0. For example.12 0.6 -0. 2 .4 -0.04 0.7. The variance of the OLS estimator and the Gauss-Markov theorem.2 3. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 38 F IGURE 3.3. if the dependent variable is the number of automobile trips per week. Biasedness of OLS when an assumption fails Beta hat . rules out normality.2 0 0.2 -1 -0.08 0.7. which. and is thus not normally distributed.14 0.

as we proved above. that are a function of § § ¥ Ù © © ¢p è I X R ¥ ù RG © R Ù ú 9I !©XRi¬¥uÌtGRi I 6Qi¬¥ õ Ù © ô § ¥ À Üé ö§ ø § ö§ ×R h 0%¢ h §0÷¢ RG I X R § § © ¥ P § . So û that deﬁne some is . SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 39 normality. We’ll still insist upon unbiasedness. Consider ü One could consider other weights f ³R I X R ¥ ° © f û 8 f tion of the dependent variable. a function of must have it is nonstochastic. We have and we know that The OLS estimator is a linear estimator.7. not the dependent vari- able. too. If the estimator is unbiased.3. It is also unbiased under the present assumptions. then we : 8 i p "ü § © G ü § ü¥ ËP p "òUë p§ ÿ 9± f ü¥ © u¼ë Á 1 ' jX ü ÿ 9± ü % © ¥ Qþü ü where is some matrix function of Note that since ü f ü ý § other linear estimator. which means that it is a linear func- where is a function of the explanatory variables only.

It is worth emphasizing again that we have not used the normality assumption in any way to prove the Gauss-Markov theorem. more formally. estimating using each part of resulting estimators. You © ý§ ¥ 7é The inequality is a shorthand means of expressing. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 40 Deﬁne so Since so So is a positive semi-deﬁnite matrix. The OLS estimator is the ”best linear unbiased estimator” (BLUE). that © ¥ ¢p è h I Q R ¬P R R © R ¥ R © R ¥ ¢p è R ó % I 6XiP ð ó % I 6XiP ð © ¥ R I X R ¡ ü © ¥ R I X R ¬P ü 8 ¢p ©èÚü $é R ü © ý§ ¥ © ¥ § é ¡ © ¥ ý § é 9 ü ÿ± © ý§ ¥ q$é ý§ The variance of is © ¥ § é . The program Efﬁciency. This is a proof of the Gauss-Markov Theorem. but inefﬁcient with respect to the OLS estimator. To illustrate the Gauss-Markov result.7. as long as the other assumptions hold. so it is valid if the errors are not normally distributed.m illustrates this using the data separately by OLS.3. then averaging the splitting the sample into equally-sized parts. consider the estimator that results from should be able to show that this estimator is unbiased.

since the tails of its histogram are more narrow.5 2 2.5 3 3. We have that and estimate the variance of .08 0.06 0. . The data generating process follows the classical 3.7. OLS 0.02 0.05 0.01 0 0 0.5 4 a small Monte Carlo experiment.03 0.04 0.09 0. which compares the OLS estimator and a 3way split sample estimator. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 41 F IGURE 3.3. with .07 0. A commonly used estimator of 8 75 § ¢p è ó i ð § ¥ À Þé y © ô I G R G ÷1 p ¢ ¢è ) ¢p è e § § ¥ Ù © ) 75 1 model. in order to have an idea of the precision of the is This estimator is unbiased: ¢p è estimates of .1 0.7. Gauss-Markov Result: The OLS estimator Beta 2 hat.4 we can see that the OLS estimator is more efﬁcient. The true parameter value is In Figures 3.3.3 and but we still need to § .7.5 1 1.7.

06 0.04 0.02 0 0 0. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR 42 F IGURE 3.5 2 2.3. Gauss-Markov Result: The split sample estimator Beta 2 hat.7.1 0. Split Sample Estimator 0.4.5 1 1.08 0.7.5 4 Ùp ´À ¿ Y« ¢ è £ G R© tG ´ ¥ « ¤â Ù « Ù y¿ À R© GG ´4Ày¿ ¥ Ù ©G ´ R G Ày¿ ¥ Ù G ´ tG R © 1 Su¥ ¢p è R G t G ¢p è 1 ) 1 ) 1 ) 1 ) 1 ) 1 ) 1 ) p ¥ © ¢ ¢ è ¼ë p ¢ ¢è .12 0.5 3 3.

3. For a ﬁrm that takes input prices © Ü¥ ¥ y¿ © w À w ¥ y¿ À when both products are con- and .8. Homogeneity The cost function is homogeneous of degree 1 in input 2 prices: where is a scalar constant. Example: The Nerlove model quantities of inputs subject to the restriction tained by substituting the factor demands into the criterion function: Monotonicity Increasing factor prices cannot decrease cost. The cost function is ob- F 3.8. This is because ©§ F 4©Ü¨¥ The solution is the vector of factor demands 8©§ F R 4©Ü¨¥ YF 4¯F û ©§ 8 § © ¥ ¡ ã R yF F ©4¯¨¥ à û § F ©§ F 4©Ü¨¥ û 2 ¯c¥ û © § F2 à § the output level as given. the cost minimization problem is to choose the to solve the problem .3. Theoretical background.8. so Remember that these derivatives give the conditional factor demands (Shephard’s Lemma). EXAMPLE: THE NERLOVE MODEL ¥ ¦ ¨ 43 where we use the fact that formable.1. this estimator is also unbiased under these assumptions. Thus.

the Cobb-Douglas cost function has the This is one of the reasons the Cobb-Douglas form is popular .the coefﬁcients are easy to interpret.2. The Cobb-Douglas functional form is linear in the logarithms of the regressors and the dependent variable. 3.8.8. if there are form factors. Returns to scale The returns to scale parameter is deﬁned as the in- plies that cost increases in the proportion 1:1.3. For a cost function. If this is the case. EXAMPLE: THE NERLOVE MODEL 44 the factor demands are homogeneous of degree zero in factor prices they only depend upon relative prices. since they are the elasticities of the dependent variable ß§ x x î º § x ÌF@88A8 IYF w º x § x F8@8 ä x ß ÌF8 Ix F 6ß § w I î ß F h û É# ! Æ ß F û "à à û What is the elasticity of with respect to ? § Constant returns to scale is the case where increasing production I im- Ét©4©Ü¨¥ § F û § § ©¯¨¥à Æ § F û à îº § ß F x x ÌF@8A88 Ix F ¡uû w ä` º ) verse of the elasticity of cost with respect to output: . then . Cobb-Douglas functional form.

Not that in this case.8. The observations are 3 ! ¡ B§ 8 8 8 ©A@@94) 8 4) )§ ( so Likewise. So with a Cobb-Douglas cost function. and were collected by M.3. The ﬁle nerlove. . Note that after a logarithmic transformation we obtain where . EXAMPLE: THE NERLOVE MODEL 45 with respect to the explanatory variable. One can verify that the property of HOD1 implies that In other words. The Nerlove data and OLS.data contains data on 145 electric utility companies’ cost of production. So we see that the transformed model is linear in the logs of the data.8.S. The data are for the U. ©§ F ¯¨¥ ß ¤ û ¯¨¥ ©§ F ß ß F h û É# ! Æ ß F û "à à $ ä `º ¶ w A V& ©§ F 4©Ü¨¥ ß ¤ . The cost shares are constants. monotonicity implies that the coefﬁcients jß § e P (§ uÈ§ )VP F A Vt@8A8µF A HuP § P 8 PI I§ ) ))( § I ) § B & ' uû A % the cost share of the input.3.. Nerlove. output and input prices. The hypothesis that the technology exhibits CRTS implies that . the cost shares add up to 1. 3.

and the library of Octave functions mentioned in the introduction to Octave that forms section 21 of this document. you have all of this installed and ready to run.049 0.518 ********************************************************* Do the theoretical restrictions hold? Does the model ﬁt well? What do you think about RTS? 3 If you are running the bootable CD. We will estimate the Cobb-Douglas model (3. -1.648 p-value 0. as well as Nerlove.774 0.153943 Results (Ordinary var-cov estimator) constant output labor fuel capital estimate -3. OUTPUT OF LABOR . 8 g© ÿ ª ¥ g© æ ¥ e uP ÿ ª )VP 5ª 4uP 1ª | VP æ ¢ uD§ Úû 6 § 3 § § § PI 2 0 © û¥ by row.499 4. EXAMPLE: THE NERLOVE MODEL 46 .1) using OLS.244 1. To do this yourself.249 -0.017 0. PRICE OF FUEL and PRICE OF CAPITAL that the data are sorted by output level (the third column).339 t-stat.136 0.8.8.err.436 0.427 -0.720 0.3 The results are ********************************************************* OLS estimation results Observations 145 R-squared 0. 1.000 0. COST PRICE Note © 1ª ¥ 2 © 1ª ¥ 0 .3.925955 Sigma-squared 0.220 st.291 0.527 0. you need the data ﬁle mentioned above.987 41.m (the estimation program) . and the columns are COMPANY.100 0.000 0.

available in English. the Gnu Regression. no fuss. Econometrics. EXAMPLE: THE NERLOVE MODEL 47 While we will use Octave programs as examples in this document.3. so that I can just include the results into this document. since following the programming statements is a useful way of learning how theory is put into practice. Error 7 ï 5 Ñ ï @F78 ï 9C7 A)8 ) ©58 E Ñ ï Ñ 9C9 Ñ A ) 8 A 8 Ñ CBA) -statistic p-value ) 7©58 E E CECCï AG) 9 Ò @5 Ñ 8 ) Ñ @C7 Ñ 8 7 9 5 Ñ ï 7 4FA8 9 Ò @5 Ò 8 . I heartily recommend Gretl.8. you may be interested in a more ”user-friendly” environment for doing econometrics. French. and it comes with a lot of data ready to A use. no muss. Here the results of the Nerlove model from GRETL: Model 2: OLS estimates using the 145 observations 1–145 Dependent variable: l_cost l_capita l_fuel 5 E ) Ò8 444 8 ) C7 @)8 9 l_labor l_output E CE A D7 7 8 const 444 8 Ñ 8 A F8 9 E Ñ Ò ï Ñ 58 Ñ 5 ï Ñ ) ï 8 ÒÑ 8 $4Ñ 5) Ñ 8) E Ò A Cï 4¯ 2 Variable Coefﬁcient Std. It even has an option to save output as LTEX fragments. This is an easy to use program. and Spanish. and Time-Series Library.

of dependent variable Sum of squared residuals Standard error of residuals ( ) Akaike information criterion Schwarz Bayesian criterion Fortunately. Gretl and my OLS program agree upon the results.3. I recommend using GRETL to repeat the examples that are done using Octave.8. Before considering the asymptotic properties of the OLS estimator it is useful to review the MLE estimator. ¢ Ý¡ Adjusted ¢ ¡ Unadjusted 9 C9 Mean of dependent variable A D9 8 ï ïÒ) E Ñ C 8 ¦Ñ ) Ò 9 F¦D7 98 A E Ñ 7 Ñ E I5 ï 8 Ò4Ò ï Ò 5 ï 8 9 @5 H8 Ò7 ï7 5 4Ò g75 Ò 8) 54AG)75 Ñ 4) 8 5 8 Ñ BA4) è © Ñ 9 Ñ ¥ ) P . EXAMPLE: THE NERLOVE MODEL 48 S. since under the assumption of normal errors the two estimators coincide.D. The previous properties hold for ﬁnite sample sizes. Gretl is included in the bootable CD mentioned in the introduction.

4 is unbiased. (4) Using GRETL. examine the residuals after OLS estimation and tell me whether or not you believe that the assumption of independent identically distributed normal errors is warranted. which satisﬁes the classical assumptions. write a little program that veriﬁes that © ¯¥ ¥ Y¿ © w À w ¥ y¿ À U w where and are conformable matrices of contants? © S Q¥ g¼T ã R² ç (5) For a random vector what is the distribution of P o w .EXERCISES 49 Exercises (1) Prove that the split sample estimator used to generate ﬁgure 3. ¥ w for and 4x4 matrices of random numbers. . (2) Calculate the OLS estimates of the Nerlove model using Octave and GRETL. and interpret them. Print out any that you think are relevant. just look at the plots. and provide printouts of the results. (3) Do an analysis of whether or not there are inﬂuential observations for OLS estimation of the Nerlove model. prove that the variance of the OLS estimator declines to zero as the sample size increases. e P E ¢ § ÄH§ 4f P I (7) For the model with a constant and a single regressor. Discuss. .7. U V function trace. Note: there is an Octave ¥ ¦ (6) Using Octave. No need to do formal tests. Interpret the results.

as is shown below. nonlinear models with nonnormal errors are introduced. This density can be factored as likelihood function. 50 X p X The maximum likelihood estimator of i where is a parameter space.CHAPTER 4 Maximum likelihood estimation The maximum likelihood estimator is important since it is asymptotically efﬁcient. For the classical linear model with normal errors. sented without examples. In the second half of the course. 4. Suppose W the joint density of and is character- ized by a parameter vector This is the joint density of the sample. is the value of that maximizes the X The likelihood function is just this density evaluated at other values # # 88 I h fÚ98 g# f © p !Ye¥ `¨ © p YW ¨¥ W a d ¤ i g X ph© X W ¤ g© YT¥ 8 X W ¤ © p yT¨¥ a £ r p X ff 888 I h g9qf ¤ Y ¨ 1 ¨ a Y b`¨ © X W ¤ UyT¨¥ p X W ¤ © YT¥ § the ML and OLS estimators of are the same. The likelihood function Suppose we have a sample of size of the random vectors and .1. and examples may be found there. so the following theory is pre- f a Y cG¨ .

by using the fact that a joint density can be factored into the product of a marginal and conditional (doing this iteratively) To simplify notation. the variables © W ©d ¤ !Y¤¥ 4YW ¥ a `¨ a £ Y ¨ © X W ¤ U©YT¥ a Y c`¨ imizer of the overall likelihood function d ©d ¤ YW ¥ a £ Y ¨ tional likelihood function p pd Note that if and share no elements. If this is not possible. THE LIKELIHOOD FUNCTION 51 to be exogenous for estimation of . the likelihood function can be ©d ¤ 4yW ¨¥ a £ Y G¨ ~ p d D EFINITION 4. deﬁne geous variables. and we may more conveniently work with . © d f # g¥ ¨ I @ fq written as ©d ¤ YW ¥ f 1 If the observations are independent. etc.4. Now the likelihood function can be written as ©c ¨¥ d f ¨ I @ fq ©d ¤ 4¥ f ¢ 6t ¢ gg# I #If I so © d f# f888 rI f f bbb © d # fI f © d #I f © dI I f 476f gE999 ¢ f s¦f g¥ ¨ xxH | 6 ¢ !¦f | ¨¥ ¨ 4 ¢ 66¦f ¢ ¨¥ ¨ 4g# ¥ . In this case.1.it contains exogenous and predetermined endo- # f 8 fI f 6I !@A8@89 ¢ !t ¨ where the are possibly of different form. are said ¨ ©d ¤ ¨¥ f pd . The maximum likelihood estimator of ©d ¤ 4YW ¥ a £ Y ¨ the conditional likelihood function for the purposes of estimating W d d X for the elements of that correspond to . we can always factor the likelihood into contribu- tions of observations. . then the maximizer of the condiwith respect to is the same as the max.1.1.

Example: Bernoulli trial. The outcome of a toss is a Bernoulli random variable: So a representative term that enters the likelihood function is and 1 ©¤t ¥ t"ô º ¹ ¦) f 8 7d increasing function.1.4.1. THE LIKELIHOOD FUNCTION 52 The criterion function can be deﬁned as the average log-likelihood function: The maximum likelihood estimator may thus be deﬁned equivalently as where the set maximized over is deﬁned below. Since f f has no effect on 4.1. so that the probability of a heads may not be 0.5. Maybe we’re interested in estimating the probability of a heads. and maximize at the same value of d©Ô¥ b I @ 1 1 © d¥ f ©c f d ©d ¤ ¨¥ ¨ ) 4¨¥ f A ) 4RC¤ f ) f 9 jÂ g ! ) g f 9 ²v! u I © p iÈc¥ p4 ) u © ) © f )¥ P iiy¥ S0ÈÔÜ f ò¨¥ © f u © ) © f I i%ÈÔ¥ u ò¨¥ © d¥ f geS¤ ~ d is a monotonic Dividing by p f © ò¨¥ Y G¨ Y G¨ Y `¨ A 8 7d . Suppose that we are ﬂipping a coin that may be biased. Let be a binary variable that indicates whether or not a heads is observed.

THE LIKELIHOOD FUNCTION 53 The derivative of this is Setting to zero and solving gives Now imagine that we had a bag full of bent coins. each bent around a sphere of a different radius (with the head pointing to the outside of the sphere). We might suspect that the probability of a heads could depend upon the ra$ so © § B ¥ f BC ©! C i B ¥ © iÈc¥ ) BC © B iÈ)Ô¥ B B B i¯f B § B ) C © B %ÈÔ¥ B q¥ B à © § § !¨¥ à © § f à Y `¨ A à ' § that is a 2 1 vector. © )¥ iÈcÜ iÜf © ) ii¼Ô¥ ©f ) S0È¥ f . Suppose that where B R r B À ) n qi ©© ~ ) © § B ¥ I § BR Ô¥ 7g} P c¥ ! C p So it’s easy to calculate the MLE of ©i¼ÔÜ B 1 )¥ I i B f f ) diàf ¤ © ¥ à Ýf 1 Averaging this over a sample of size gives in this case.4. Now B dius. so iò¥ à © f Y `¨ à .1.

we need to make explicit some assumptions.2. Convergence for a single element of the parameter space. For a given parameter value. 4. Uniform convergence: vergence holds for all possible parameter values. Consistency of MLE To show consistency of the MLE. 8 y x g pwd Compact parameter space: an open bounded subset of ÿ § 1 § © §¥ B ©© § B ¥ f I C ! C i B ¨¥ fB iàf ¤ à Max- . combined with the assumption of a compact parameter space. ¤ We have suppressed here for simplicity. . and ﬁnding their values often require use of numeric methods to ﬁnd solutions to the ﬁrst order conditions. they are often nonlinear. ensures uniform convergence. CONSISTENCY OF MLE 54 So the derivative of the average log lihelihood function is now This is a set of 2 nolinear equations in the two unknown elements in .4. This requires that almost sure con- x This implies that is an interior point of the parameter space d 8 x g d wc ê d d¥ Ôg© p 7ea ¤ $ © d¥ f ¤ 4RC!¤4ë A f ì 4eS¤ © d¥ f Í x imixation is over which is compact. This is common with ML estimators.2. an ordinary Law of Large Numbers will usually imply almost sure convergence to the limit of the expectation. There is no explicit solution for the two elements that set the equations to zero.

the expectation on the RHS is ' any density is 1 Therefore.2. for any by Jensen’s inequality ( is a concave function). CONSISTENCY OF MLE 55 continuous in Identiﬁcation: has a unique maximum in its ﬁrst argument. and since the integral of © p eaG¤ d d ¥ p 4) ¦© p e¥ f © 4eR¥dd ¥ ft d © ©©Ô¥ b ÉµÉ© p e¥ d ÉµÉ© p e¥ d ©d 4R¥ f f Æ ë Æ » ©d 4R¥ f f Æ A Æ ë p d d í 8 p d ì d f T 8 Üx g d d 7 f f É© p e¥ d ©d e¥ f f Æ ë © p 7R¥aG¤ d d 87d © d¥ f eS9¤ Continuity: is continuous in This implies that is f d . since or Taking limits. certainly exists. since a continuous function has a maximum on a compact set. Second. Now.4. 8 » E© p Ryò²Ä4RÈòë © d¥ f ¤¥ ë ©© d¥ f ¤¥ pd pd ¤ pd d » © e¥ Guq© 7e¥ G¤ É É d p© p e¥ » ©d 4R¥ f f Æ A Æ ë tÔ¥ ©) 8 © p e¥ d f since is the density function of the observations. this is except on a set of zero probability (by the uniform convergence assumption). We will use these assumptions to show that First.

3. at least when p d assumptions to prove weak consistency (convergence in probability to q1 at least one limit point). In other words. . 4. Since 8 ¡ © p p Ra uq© p Ra ¤ d d ¥ ¤ d d¥ pd Cd Suppose that is a limit point of f d f d p d ÷cc ½ © p p Ra uq© p 7Ra ¤ í d ê d d ¥ ¤ d d¥ p d d í ity is strict if : a. Note that almost sure convergence implies convergence in probability. independent of ) of . 8 8 p d d f as This completes the proof of strong consistency of the MLE. and it is equal to the true parameter value with probability one.3. This is omitted here.4.s. the inequal- have These last two inequalities imply that d a. THE SCORE FUNCTION 56 By the identiﬁcation assumption there is a unique maximizer. (any sequence from a compact set has we must is a maximizer. Thus there is only one limit point. The score function 1 pd © p e² d ¥ in a neighborhood of © d¥ f 4RC9¤ Differentiability: Assume that is twice continuously differentiable is large enough.s. One can use weaker the MLE.

which implies that it is a random function. d The ML estimator sets the derivatives to zero: We will show that This is the expectation taken with respect 8 f t© d f g$c g¨¥ ¨ ©d f gf$©c g¨¥ ¨ c g¨¥ ¨ c ) g¨¥ ¨ t d f © d f f t© d f © d f g$ ¨¥ ¨ c g¨¥ ¨ © d ¥ 4RÔ14ë ©d 4R¥ ¨ to the density not necessarily ¤ This is the score vector (with dim ¤ 8©) gxp' I @ 1 8 d g©e¥Ô $ f ) I @ 1 ©d f ã g¨¥ ¨ f ) © d¥ f ¤ eS Sg © d ¤¥ f 8 $ Note that the score function has as an 8 d © p e¥ ¨ 82 ê © d¥ þ s4ecg 1)ë I @ 1 4d S ©4d Ô ¥ © ¥f f ) . but one should not forget that they are still there.3. (and any exogeneous variables) will often be suppressed for clarity. THE SCORE FUNCTION 57 To maximize the log-likelihood function.4. take derivatives: argument.

by the dominated convergence theorem. ASYMPTOTIC NORMALITY OF MLE 58 order of integration and differentiation.4. 4. Take about the true value 8 4S©)ë © d ¤¥ f © ¥ 4 d H d 2 This hold for all r 4RÔxë © d¥ ¥ So the expectation of the score vector is zero. so it implies that ¨ Given some regularity conditions on boundedness of we can switch the f t© d f $4c ¨¥ ¨ ) 9 © d ¥ eE5ë $ )¥ P Ycyd .4. This gives where we use the fact that the integral of the density is 1. So e e f d where Assume is invertible (we’ll justify r p d a ﬁrst order Taylor’s series expansion of h p £d E© eH y © p RD d © d¥ ¥ P d¥ © e¥ d 8 ½ e 4) "g½ p © d d g© p eHd¥ h p £d © R¥ d d ©d ¨H ¤¥ © d¥ f 4RC9¤ Recall that we assume that is twice continuously differentiable.4. Asymptotic normality of MLE or with appropriate deﬁnitions © p RHb1 I 6© R¥ d ¥ d h d h p ipd 1 d h this in a minute).

we assume that a SLLN applies. d This matrix converges to a ﬁnite limit. k l p d¥ ½ © R¾ d ó p d ¥ f ð f d © eS¤ ¢ Që A ì © R¥ © e¥ d continuous in . d d p d d we have that . converges to the limit of it’s expectation: ©d 4R¥ p © e cþP d )¥ d e j d d Also. Given this. ASYMPTOTIC NORMALITY OF MLE 59 This is ì is . However. We don’t assume any particular set here. There are different sets of assumptions that can be used to justify appeal to different SLLN’s.where the notation Given that this is an average of terms.4. Regularity conditions are a set of assumptions that guarantee that this will happen. and their variances must not become inﬁnite. since the appropriate assumptions will depend upon the particularities of a given model. Also. since we know that d is consistent. it should usually be the case that this satisﬁes a strong law of large numbers (SLLN). the must not be too strongly dependent over time. by the above differentiability assumtion. For example. and since R 8 d d ©d f¤ e¥Sà¢ à $ à I @ © e¥E d ¢ f ¨ © eS¤ ¢ d ¥ f © RD y d ¥ ) 1 ©d f 4R¥ 9¤ ¢ d © e¥ d 8 d © e¥ Now consider © d¥ 6IeE ¨ A ¢ d 4.

4. maximizes the limiting objective function. Therefore the previous inversion is justiﬁed. then fore of full rank.4. we get d We’ve already seen that (which holds in the limit). for 8 1 r. (a constant vector) we need to scale by I @ © p eÔ d ¥ f I @ © p c f ¥ d g¨E ¨ f © d¥ f ¤ eS h 8 ® eE5)ë © d¥ 1 1 1 ) 1h h h This is As such. asymptotically. Note that by consistency. and by the assumption that ì h p £d 1 d T d p d¥ f © RCb1 8 d¥ g© p RDb1 h ì © p RS d ¥ f T p d i. which is legitimate given regularity conditions. Since there is a unique maxis twice continuously differentiable © p p ea ¤ ¢ d d ¥ © d¥ f ¤¥ E© p eSòë A f ¢ must be negative deﬁnite.1) Now consider h We’ve already seen that a CLT applies. it is reasonable to assume that 8 d¥ g© p eHb1 I 6© p ea d¥ © p ea d ¥ © d¥ f 4RC9¤ imizer. ASYMPTOTIC NORMALITY OF MLE 60 Re-arranging orders of limits and differentiation. h d h and we have (4.4.. To avoid this collapse to a degenerate A generic CLT states that.v. and there- © p p RaG¤ ½ © p 7R¾ ¤ d d ¥ d d ¥ p d¥ © R¾ .e.

ASYMPTOTIC NORMALITY OF MLE 61 a random vector that satisﬁes certain conditions.4. Then the properties of 1 1 I a @f 1 h f a o o h © p ea o d ¥ f ally. will be of the form of an average.2) h This can also be written as ó © p e¥ fb1 ð eré f d ë f h the properties of the For example.4.4. then a CLT for dependent processes will apply. 8 Ä³ I © p ea © p Ra o I © p R¾ d° uç h p ssd 1 d ¥ d ¥ d ¥ d © p R¾ o q d ¥ m © p RCb1 d ¥ f (4. Usu: ©© f ¥ E$aé ² ¥ h m © 4f ¬¥ Ù £a f f a depend on .1] and [4. ÿ 9± q ¡ © p eSgn1 ¥ Ù d ¥ f Supposing that a CLT applies. scaled by f The “certain conditions” that must satisfy depend on the case at hand.2]. if the have ﬁnite variances and are we get f © p e¥ gb1 ¢ eI !© p ea d f d¥ p e4 h p d¥ © ea 8 c © p RDb1 d ¥ This is the case for for example. we get The MLE estimator is asymptotically normally distributed.4.4. not too strongly dependent. d d h Combining [4. and noting that m where d ¥ f d ¥ f Ró © p RCD© p RCh 1 ð is known as the information matrix.

**4.4. ASYMPTOTIC NORMALITY OF MLE
**

d d

and asymptotically normally distributed if

There do exist, in special cases, estimators that are consistent such that

mally,

to a stable limiting distribution.

is asymptotically unbiased if

(4.4.4)

Estimators that are CAN are asymptotically unbiased, though not all consistent estimators are asymptotically unbiased. Such cases are unusual, though. An example is

Show that this estimator is consistent but asymptotically biased. Also ask yourself how you could deﬁne an estimator that would have this density.

r 1 d If I © d ¥ p d I f È)

d

d

**E XERCISE 4.5. Consider an estimator
**

¨

with density

d

D EFINITION 2 (Asymptotic unbiasedness). An estimator

8 © ¥ 7d d të tA f

1 8 h p ssd 1 d T h

r

é

where

is a ﬁnite positive deﬁnite matrix.

These are known as superconsistent estimators, since nor-

is the highest factor that we can multiply by an still get convergence

© é ¥ £n6 þ

m

(4.4.3)

of a parameter

1

p

**D EFINITION 1 (CAN). An estimator
**

h

of a parameter

is

h

62

-consistent

h p £d 1 d

pd

h

**ó R s4RD "s4RD 1 ð 4È s4e¥ © d ¥ © d ¥ ë © d
**

d

14

ë

allows us to write

(This forms the basis for a speciﬁcation test proposed by White: if the scores

**2 ¤ ©I
f!A8@@889If
gÔ
¨ $¤ ¥ 2 d f¥ í 4¤ £2 I @
1 ¥ ¥ v ë w4È © d ¥ w HR ©4dRÔ
us©4dRE
g s4RE
f )
**

since for

© d ¥ © d ¥ ë P © d ¥ R s4eE ueÔ5Vs4RE y © d¥ © d ¥ eE ë © d¥ ueE ¨ 5VP ³ 4eE ¨ ¢ ¨ A f¦4RE ¨ s4RE ¨ y us4RE ¨ A 4eE ¨ ¢ t © d ¥ © d ¥ © d ¥ P ³© d¥ f t© d¥ $4RE ¨ y eÔ ¨ £$eE ¨ ³ eE ¨ A ¢ © d ¥ P f t© d¥ © d¥

4.6. The information matrix equality

d

We will show that

4.6. THE INFORMATION MATRIX EQUALITY

Let

be short for

**©d
f c
g¨¥
**

¨

has conditioned on prior information, so what was random in

ì 4 I@ 1 ë f )

If

d ° 4ë ° 4ë

°

**f$©4RE
¨ E©4ec
¨ ¥ t d ¥ © d¥ f t© d¥ $eE
¨ f t© d¥ $eE
**

¨

)

© d¥ 4RE

¨

8 © d¥ g4Ra ± 4R¾ © d¥

(4.6.1)

54

ë

Now sum over

Now differentiate again:

and multiply by

d

1

The scores

appear to be correlated one may question the speciﬁcation of the model). This and are uncorrelated for so is ﬁxed in .

63

g

**© p R¥a © p e¥a o © p e¥ d d d I I © p ea x d o d ¥ x d x I © p ea x d ¥ I © p Racé d ¥
**

x d

**p d¥ © Rané p d¥ © Ranx é p d¥ © Ranx é
**

all valid. These include

x x

p d¥ © ¥f © f R r d Sg n r d ¥ n 1 © Ra ± p d¥ © ea

**8© 4d ¥ I @
R ¥
© ¥
©$d Ôd Ô d1 f
**

and

**³ I 6© p Ra o d° ì h p ipd 1 d¥ d
**

d h

³ I 6© p e¥ d

To estimate the asymptotic variance, we need estimators of

© p Ra d ¥

o

© p R¾ d ¥

© p ea o I 6© p e¥ d ¥ d

**8© d¥ 4Ra o e¾ © d¥
**

d

This holds for all

limits, we get

(4.6.2)

since all cross products between different periods expect to zero. Finally take

in particular, for

4.6. THE INFORMATION MATRIX EQUALITY

Using this,

d

to estimate the information matrix. Why not?

p d¥ © ea x o d

We can use

x

° ì h p £d 1 d

dd

8p

Note, one can’t use

7d

h

(4.6.3)

simpliﬁes to

From this we see that there are alternative ways to estimate

that are

64

.

4.7. THE CRAMÉR-RAO LOWER BOUND

65

These are known as the inverse Hessian, outer product of the gradient (OPG) and sandwich estimators, respectively. The sandwich form is the most robust, since it coincides with the covariance estimator of the quasi-ML estimator.

4.7. The Cramér-Rao lower bound T HEOREM 3. [Cramer-Rao Lower Bound] The limiting variance of a CAN

**semideﬁnite matrix. Proof: Since the estimator is CAN, it is asymptotically unbiased, so
**

f ©d ¥ 4ipý d t4ë A

Noting that

Now note that have

ÿ ±

| 1z f t © d ~ © d }{ y $e¥ ¨ 4R¥ ¨ ) 1

h

h £ý d 1 d

1

**Playing with powers of
**

h

8 ± f$t© 9± Ôv©4d¤¥ ¨ y ± h ipý d y d ÿ ÿ ¥ ÿ 8 ¦t h ipý d y ©¥ ¨ t f p¦4R¥ ¨ y 4R¥ ¨ h £ý d f d d ¤ P f t© d ©d d ©d ge¥ ¨ A y e¥ ¨ 4¥ ¨ y ©d ©d ¤

we can write

and

With this we

we get

8 g©

this is a

' f d ©d ¤ $t r h £ý d 4¨¥ ¨ n y

8 $4e¥ ¨ y 4e¥ ¨ h ipý d ©d d ÿ± f t© d

¥

A

f

© d ¥ f y ipý d të

f

rR

d

Differentiate wrt

ýd

A

pd f

estimator of

, say , minus the inverse of the information matrix is a positive

matrix of zeros

t

f

4.7. THE CRAMÉR-RAO LOWER BOUND

66

we can write

any CAN estimator, is an identity matrix. Using this, suppose the variance of

Since this is a covariance matrix, it is positive semi-deﬁnite. Therefore, for any

This simpliﬁes to

proof. This means that CAN estimator.

is a lower bound for the asymptotic variance of a

D EFINITION 4.7.1. (Asymptotic efﬁciency) Given two CAN estimators of a

is a positive semideﬁnite matrix.

A direct proof of asymptotic efﬁciency of an estimator is infeasible, but if one can show that the asymptotic variance is equal to the inverse of the

ýd

d

d

ýd

©4d an0q4ý d aré ¥ é © ¥ p

d

parameter

, say

and ,

© d¥ ea

o

© ýd ¥ 9ané

©d e¥ I

o

&

Since

is arbitrary,

is positive semideﬁnite. This conludes the

is asymptotically efﬁcient with respect to

8

¡

4Ra o © d¥ & I

&

8

¡

&

h 4e¥ I q9¾ré R © d o © ýd ¥

© d¥ ea

ÿ 9±

o

ÿ 9± © ¥ $ý d ¾ré

r e¥ I o R & R & n ©d

&

&

-vector

8

© d¥ ea

o

n

é

(4.7.1)

ÿ 9±

ÿ 9± © ýd ¥ 49ané

© d¥ 4eHn1 h ipý d 1 d

8© ¥ g$ý d h ¾né

tends to

Therefore,

if

ýd

**This means that the covariance of the score function with
**

h

© d¥ geH

**Note that the bracketed part is just the transpose of the score vector,
**

h h h £ý d 1 n ë
A f d h h

so

h ipý d 1 d

ÿ ± R© d¥ 9¡ r 4eHc1

for

h ssý d 1 d

4.7. THE CRAMÉR-RAO LOWER BOUND

67

information matrix, then the estimator is asymptotically efﬁcient. In particular, the MLE is asymptotically efﬁcient. Summary of MLE Consistent

Asymptotically normal (CAN) Asymptotically efﬁcient Asymptotically unbiased This is for general MLE: we haven’t speciﬁed the distribution or the linearity/nonlinearity of the estimator

EXERCISES

68

Exercises (1) Consider coin tossing with a single possibly biased coin. The density function for the random variable

**Suppose that we have a sample of size . We know from above that the ML
**

d d

estimator is

. We also know from the theory above that

b) Write an Octave program that does a Monte Carlo study that shows that

several values of .

**(Student-t with 1 degree of freedom) density. So
**

k

The Cauchy density has a shape similar to a normal density, but with much thicker tails. Thus, extremely small and large errors occur much more frequently with this density than would happen if the errors were normally

d

(3) Consider the model classical linear regression model

© d¥ f 4RCg

. Find the score function

© d¥ f 4RCg

distributed. Find the score function

where

where

Rh è R§ d Ee¡Pi§ R Ç 4f R h & R § l

½ Ôe ½

k

e P) ) 9 © ¢ wÔ¥

Ee

&

P j§ R f

(2) Consider the model

where the errors follow the Cauchy

© p iÜv¥ 1 Ýf

give me histograms that show the sampling frequency of

1

is approximately normally distributed when

is large. Please for

h

© p a ¥

o

© p da ¥

**a) ﬁnd the analytical expressions for
**

h

³ I © p a © p a o I © p d da d° ç © p iÜ¥ 1 ¥ ¥ ¥ Ýf

) f 9 jÂ g ! ) g f 9 ²v! u I © p iÈc¥ p4 ) u 1

©¤t ¥ t"ô º ¹ ¦) f

is

**p f © ò¨¥ ©
e E¥
**

¨

Y `¨

Ýf p ¢h 1

and

for this problem

© ¢ ! j ¦± Ôe è ¥ ± ç

© p %Üg¥ 1 Ýf

.

where

.

EXERCISES

69

(4) Compare the ﬁrst order conditional that deﬁne the ML estimators of problems 2 and 3 and interpret the differences. Why are the ﬁrst order conditions that deﬁne an efﬁcient estimator different in the two cases?

CHAPTER 5

**Asymptotic properties of the least squares estimator
**

The OLS estimator under the classical assumptions is unbiased and BLUE, for all sample sizes. Now let’s see what happens when the sample size tends to inﬁnity.

5.1. Consistency

since the inverse of a nonsingular matrix is a continuous function of the

elements of the matrix. Considering

Each

has expectation zero, so

70

Consider the last two terms. By assumption

ó I yf ð « «

I @ 1 G ¬ ) f f îy « f Á « æ ó y f ð tf A « « 1 É 1 GR I R Æ P G% I 6XiP R © R ¥

©GVP§"¥tRi I 6Qi¬¥ © R © Rf I Q R ¬¥ Ée1 Ù RG Æ p§ p§ 1 RG

§ t¬G I « æ

2. If the error distribution is unknown.2. The consistency proof does not use the normality assumption. as long as terms of an average have ﬁnite variances and are not too strongly dependent. but the the other classical 8 ¢ è R © e E¨ ¥ é . one will be able to ﬁnd a LLN or CLT to apply. we of course don’t know the distribution of the estimator. of which there are very many to choose from. ASYMPTOTIC NORMALITY 71 The variance of each term is As long as these are ﬁnite. However. and given a technical condition1. Basically.5. so is unknown. the Kolmogorov This implies that This is the property of strong consistency: the estimator converges in almost surely to the true value. Assuming the distribution of 8 ì ¬ G 8p§ ì § I@ 1 f ) SLLN applies. G sults. Remember that almost sure convergence implies convergence in probability. I’m going to avoid the technicalities. we can get asymptotic re- assumptions hold: 1 For application of LLN’s and CLT’s. 5. Asymptotic normality We’ve seen that the OLS estimator is normally distributed under the assumption of normal errors.

3. so É 1 É 1 GR I R Æ h i 6Xi¥ GR I © R Gi I 6XiP p § R © R ¥ p « æ¢ è 1 Ù f A R ! Re Æ e is not normally dis- fy ó f î « 8 I « æ I « y « ð p§ h j£ § 1 p § 0%§ h h h § ÉÞ1 Æ f RG é . is asymptotically normally distributed when a CLT can be 5. ASYMPTOTIC EFFICIENCY 72 Now as before. we need to apply a CLT. If ó pè I « æ ¢ ! ð ó pè « æ ¢ ! ð m h p 0÷§ 1 § m 1 GR h holds. the Lindeberg-Feller CLT) Therefore. We assume one (for instance. The least squares objective function is § tributed. To get asymptotic normality. the OLS estimator is normally distributed in small and applied. Asymptotic efﬁciency G G large samples if is normally distributed.5.3. In summary. Considering the limit of the variance is The mean is of course zero.

their properties are the same. so the estimators are the same.3. This is not the case if § classical assumptions with normality. In particular. 1 © è § y !¨¥ f yuá so I ¢ è 5 @ ¢ g5 è 8 É ©f ~ S¥ ¨ ¢ Ô©§ R g¥ sÆ 7g} ) f fq hu 4) y á f ± îá f and I ¢ è 5 @ É ¢ g5 è ¢ G DÆ 7~} ) fq © è ¥ g4f ± ¢p ! h ² so © f ¢ § R g¨¥ so are the same as the fonc for OLS (up is asymptotically efﬁcient. the OLS estimator p§ I @ ¢è 8 © g5 f è A j 5 1 ¢ § R g¨¥ f h Taking logs.5. ASYMPTOTIC EFFICIENCY 73 Supposing that is normally distributed. We have It’s clear that the fonc for the MLE of to multiplication by a constant). it will be possible to use (iterated) linear estimation methods and still achieve asymptotic efﬁciency even if the assumption that G as long as is still normally distributed. under the As we’ll see later. Therefore. G § 70P p i f I@ f © §¥ q¦¤ © 4¬G¥ ç G ¨ îá G 7f ± ¢ è Ü¥ À Üé í © G ô § "jÜf G . under the present assumptions. the model is The joint density for can be constructed using a change of variables.

G . In general with nonnormal errors it will be necessary to use nonlinear estimation methods to achieve asymptotically efﬁcient estimation.5. ASYMPTOTIC EFFICIENCY 74 is nonnormal.3.

In particular. a demand function is supposed to be homogeneous of degree zero in prices and income. this is a linear equality restriction. then we need that so which is a parameter restriction. If we have a Cobb-Douglas (log-linear) model. economic theory suggests restrictions on the parameters of a model.1. | uP ¢ uDH§ § § PI 75 The only way to guarantee this for arbitrary 8" | §VP ¢ A ¢ §uPDH HV© | uP ¢ uDH¨S A ¥ I I § P § § P I §¥ © 7 | uP ¢ A ¢ uH H§ § § PI I G VP 7 | VP ¢ ¢ uH A HuP p § § p § § PI I§ § ¢ ¢§ P I I | uP A uH H§ is to set G VP § ¢ ¢§ P I I§ p | uP A Vs HuP § § A . which is probably the most commonly encountered case. For example. Exact linear restrictions In many cases.CHAPTER 6 Restrictions and hypothesis tests 6.

and is a À G P § 0" À vector of constants. ) ²' æ Ë½ æ ' æ ¡ where is a matrix. which makes things less messy.1.1. The most f § p¡ . obvious approach is to set up the Lagrangean The Lagrange multipliers are scaled by 2. EXACT LINEAR RESTRICTIONS 76 6.1. The general formulation of linear equality restrictions is the model sible.6. Imposition. The fonc are which can be written as 8 À Rf R ¡ I We get x 8 § ¡ 5 P© § f © § f g© À %p¨¥ R e Ë"²¯¨¥ R qi²Ü¥ ) 1 ¨¦¤ © §¥ $ e 8 $ À r§ ¡ ¡5 P 5 P 5 R Ëar§ R w£ Rf µ À fR r e § ¡ R R ¡ § Let’s consider how to estimate subject to the restrictions 8 À p¡ § § We also assume that that satisﬁes the restrictions: they aren’t infea- æ ¥ © e ¢ § ¦¤ w ¥ x © e ¢ § ¦¤ ¡ R r e § ¡ We assume is of rank so that there are no redundant restrictions.

EXACT LINEAR RESTRICTIONS 77 . © ¥ I ª I X R w¡ I ª © ¥ © ¥ © I ª R ¡ I ©X R ¥ I X R u¡ I ª R ¡ I X R I Q R ¬¥ © ¥ ¡ }± I ª I X R ¬ËY © © I ª R ¡ I X R ¥ I X R ¬¥ I ¥ ¥ ÿ± I ) ÿ 9± w w ¥ ¦ ) ÿ ± $ û ª R ¡ I X R ¬¥ © ÿ 9± I ª © I ª R ¡ I X R ¬¥ ÿ 9± û $ $ ÿ 9± ÿ 9± ª R ¡ I X R ¥ © R ¡ I X R wµ © ¥ ¡ R ¡ I X R ¥ © ¡ R ¥ 8 $ w R ¡ © ¥ ¡ ± I X R wµ © I X R ¥ For the masochists: Stepwise Inversion so and Note that 6.1.

§ distributions. Write Then 8 ¢ § ¢ ¡ I ¯0 À I ¯¡ H§ I ¡ I I I ¯¡ pendent. EXACT LINEAR RESTRICTIONS 78 so (everyone should start paying attention again) Though this is the obvious way to go about ﬁnding the restricted estima- tor. Recall that for restrictions are linearly inde- § r § The fact that and are linear functions of makes it easy to determine their a À Rf ¡I ª ÀI ª P § © © © ¥ ÿ s¡ I ª R ¡ I X R ¡ 9± ¥ À I ª R ¡ I X R ¥ h %§ ¡ I À ª h %§ ¡ I R ¡ I X R ¡£§ © ¥ À ª © ¥ I ª I X R w¡ I ª © © ¥ © ¥ © I ª R ¡ I X R ¥ I Q R ¬w¡ I ª R ¡ I X R I X R ¬¥ e r e § 8 i . one can always make nonsingular by reorganizing the columns of æ æ ' æ I Þ¡ where is nonsingular.6. since the distribution of is already known. and for and a matrix and vector of constants. if the number of restrictions is small. Supposing the À G VP ¢ § ¢ VDHÔ P I §I f I H§ ¢§ U r ¢ ¡ Þ¡ n I 8 R w © ¥ À ¯é ¡ gP Yw ¥ À Üé ô w ©U ô w random vector. is to impose them by substitution. respectively.1. an easier way.

© R ¥ ¢p è I D i¬¥ © ¢ § é © ¥ ¢ è I D R ¬¥ © ¢ § é ¢§ One can estimate by OLS.e.1. i. supposing the restriction is true.. EXACT LINEAR RESTRICTIONS 79 Substitute this into the model or with the appropriate deﬁnitions. and the estimator is To recover use the restriction.6. To ﬁnd the variance of is a linear function of so I gH§ ¢ ¢ ¢ ¢p è R ó ¡ ¢ I I ×ð R ¡ I © R ¥ ¢ ó ¡ ¥ R I I ¡ ð R E© ¢ § é ¢ © æ h ¢ § ²@f ¢ p ¥ ÷1 ¢ ¢ ¢è R h ¢ § ÖCf ¡ I I ¡ ¡ I I ¡ ©I ¥ ÔH § é ¢§ ¢p è where one estimates in the normal way. using the restricted model. The variance of G § 0P ¢ H³ ¢ ¡ I ¯Ôj ¢ ° I ¡ I G VP ¢ § ¢ VP ¢ § ¢ ¡ I ¯E² À I ¯E I ¡ I I ¡ I 8G § VP ¢ @f is as before use the fact that it I ¡ I À I ¯ÔjÜf f I H§ . This model satisﬁes the classical assumptions.

1. 6. we obtain So.§ ¡ p0 À Gi I !Xi¬¥ R © R Gi I 6©X©R%a¡ I ª µ¡ I !X%CpV À I ª µ¡ I 6Qi¬U©Gi I !Xi¬§ R ¥ R © R ¥ § ¡ R © R ¥ P R © R ¥ P fRi I 6QRi¬a¡ I ª µ¡ I 6Qi¬¡ À I ª µ¡ I !Xi¬§ © ¥ R © R ¥ R © R ¥ P h %§ ¡ I µ¡ I 6Xi¡÷§ À ªR © R ¥ 6. We have that If the restriction is false. the ﬁrst term is the OLS covariance.2. depending on the magnitudes of and 8¢ è P § 0r § r § .1. MSE. Properties of the restricted estimator. EXACT LINEAR RESTRICTIONS 80 zero. we may be better or worse off. The second term is PSD. and that the cross of the ﬁrst and third has a cancellation with the square R© § ¥© § ¥ c0r§ 0r§ ¼ë Dr§ ¥ ¢Û ´ © Ù Gi I !Xi¬a¡ I R © R ¥ § ¡ pV À I R © R ª µ¡ I !Xi¬¥ R © R ª µ¡ I !Xi¬¥ of the third. True restrictions improve efﬁciency of estimation. in terms of If the restriction is true. so we are better off. and the third P Ù © ¢Û ´ r § ¥ R R © R ¢ è I 6©Xi¥a¡ I ª µ¡ I 6Xi¥ ¥ ¡ § R © R I 6©QRi¬a¡ I ª R §p0 À up¡0 À I ª µ¡ I 6Xi¥ © ¢ è I X R ¥ Noting that the crosses between the second term and the other terms expect to Mean squared error is term is NSD. the second term is 0.

1.. One could use the consistent estimator ¢ 8t9 þç ©) ¥ ¢p è R ¡ I ©Q R a¡ p è ¢p è R ¡ I X R ¬a¡ © ¥ ¥ À % § ¡ À £ § ¡ so í § À p¡r d óx © ¥ ¡ ¢p Rè ¡ I X R a6 ð Úç À £§ ¡ § p À p¡r GdP § V" f p d . If ¢ We need a few results on the ©§ R¥ ¢ ©) ¥ t9 j as long as the Dè ¢p in place of but the test would only be valid asymptotically in this case. one can test theory by testing parameter restrictions. vs. A number of tests are available.2. 6. P ROPOSITION 4. as in the above homogeneity example. one wishes to test economic theories.v. and the are independent.’s.2. Testing In many cases.1) ( Å Ãz ©§Rv2ç © )( ¥ ¥ t9 j © e ¨¥ ¢ 1 ' (6. If theory suggests parameter restrictions. (6.6.2. distribution.2. then ¢p è The problem is that is unknown.2. TESTING 81 6. t-test.2) 1 ç R © Q¥ $f ± Äjç P ROPOSITION 5. Under . is a vector of independent r. Suppose one has the model and one wishes to test the single restriction with normality of the errors.

Then consider We have but and we get the result we wanted. has the noncentrality parameter equal to zero. A more general proposition which implies this result is é ¥ if and only if is idempotent.2. Proof: Factor as (this is the Cholesky factorization).2. and it’s distribution is written as suppressing the R Q ¢B Q B e ' where is the noncentrality parameter. © ¥ E© ¥ ¥ ¢ ' ç ¯¥ R (6.v. If the dimensional random vector © 1 gD¨¥ ¢ R Þª ª I é ¢ 8© 1 gD¨¥ ¢ to as a central ¢ f R ª é ª ± When a r.. it is referred r. Thus but © é ¥ g¯ ² ç R ª f± © ª éR ª jaf ¥ ç R Þª Ré ª ª R Þª é ª © ¥ ç 4f ± ju¾f 1 P ROPOSITION 6. We’ll prove this one as an indication of how the following unproven propositions could be proved. then 8 R ª f ç I Ré then . If the dimensional random vector ©1 D¨¥ ¢ I aé R Þª R ¡ f¦f ª R ' ç af R f so and thus .3) © é ¥ Ü juç 1 P ROPOSITION 7. TESTING Q 82 noncentrality parameter.6.v.

If the random vector (of dimension ) ¥ © ± ²lç 1 © ¨¥ ¢ 1 Épè Épè G Æ «X´ R G Æ ¢p è G X´ R G « ç ¢p è GR G P ROPOSITION 8.4) ' ç ¯¥ R then is idempotent with rank Consider the random variable An immediate consequence is P ROPOSITION 9. TESTING are independent.2.© vç 1¥ 2 X´ i I !Xi¬¥ « R © R § © uv2 1¥ G R G z Å y ÿ f Ã R ¡ I ©X R a¡ p è ¥ î î ¢ y y Ã Å « " x « À % § ¡ distribution if and h RG I Q R ¬§ § © ¥ P This will have the 8 © À ¥ ¢ and ¥ À ¥ g© ± juç 1 (6. If the random vector (of dimension ) and are independent if 8 ¥ 8 w ¯¥ R so x R ¡ © ¥ p è I X R a¡ ¢ è À % § ¡ À £ § ¡ Now consider (remember that we have only one restriction in this case) and 6.2. But then 83 Yw .

The test allows testing multiple restrictions jointly. If the random vector (of dimension ) f provided that and are independent. then P (6.2. since as In practice. TESTING d d 84 In particular. for the commonly encountered test of signiﬁcance of an individual distributed.5) ©¤ gtò¥ ¢ ©¤ t À ¥ ' ç af ' ç P ROPOSITION 10. If and then 2 p mality is suspected. test. it is simple ¥ g© ± j ç 1 P ROPOSITION 11. This will reject less often since the distribu- 2 procedure is to take critical values from the 8 k d 1 ©) ¥ t ² ç ¤Âf À Â © À¥ ¢ m © v2 1¥ bution. and previous results on the 8 ¥ ¦ w ¯¥ R and are independent if distribution.2. a conservative distribution if nonnor- ©) ¥ t9 j ymptotic result to justify taking critical values from the 2 Note: the test is strictly valid only if the errors are actually normally x © ÷vç B § è 1¥ 2 B p p § í B $r § HB $r coefﬁcient. A numerically equivalent expression is 8 © !R¥ 1 § P ¢è§ ç h %§ ¡ ó R ¡ X R w¡ ð R h £§ ¡ I © ¥ À I À P to show that the following statistic has the distribution: ¢ Using these results. P P 6.6. . for which vs. If one has nonnormal errors.2. the test statistic is P yw R distri- . one could use the above as- tion is fatter-tailed than is the normal.2.

the unrestricted model should “approximately” satisfy the restriction. TESTING 85 distributed. there is a cancellation of and the statistic to use is The Wald test is a simple way to test restrictions without having to estimate the restricted model.2. Note that this formula is similar to one of the formulae provided for P the test. Use as the consistent estimator of this.3. 8 I « æ ©§ R¥ ¢ m h hR © R ¥ p h À % § ¡ I µ¡ I 6Xia¡ ¢ ¢ è R À % § ¡ 4 R¤ 1 © I D1 Â R ¬¥ ¢p è Note that or are not observable. 6.2. Wald-type tests. The Wald principle is based on the idea that if a restriction is true.6. The following tests will be appropriate when one cannot assume normally distributed errors. The test statistic we use substitutes the With ©§ R¥ ¢ ó µ¡ ¡ ¢p ! ñ R I « æ è ð m ó pè I « æ ¢ ! ð h £§ ¡ ó µ¡ I « æ ¡ ¢p è ð R h %§ ¡ 1 À I R À m m h p 0÷§ 1 § h £§ ¡ 1 À À h p p¦r p § ¡ P Note: The test is strictly valid only if the errors are truly normally 8 © !R¥ 1 § P ç § © ¨¥ Û Ù 1 Û Û Û ÜÙ Â Û Û ÜÙ ¥ Â© I « æ . Given that the least squares estimator is asymptotically normally distributed: d h then under we have so by Proposition [6] consistent estimators.

2. but the principle is valid for a wide variety of estimation methods. Score-type tests (Rao tests. Lagrange multiplier tests). so one might prefer to have a test based upon the restricted. or 8 4) ó R pè 1 I ª µ¡ I « æ ¡ I © ª ¥ 1 A ¢ ! ð t r p ó R ¡ ¡ ¢p ! ð I« æ è ó pè I ª R ¡ I « æ ¡ I ª ¢ ! ð h £§ ¡ I À ª h %§ ¡ ó µ¡ I 6Xia¡ ð À I R © R ¥ § m h £§ ¡ 1 À m e e 1 m h h e 1 h § is nonlinear in and but is linear in under d G P ¡© § VUs"¬¥ f Estimation of . TESTING 86 6.2. the model nonlinear models is a bit more complicated. The original development was for ML estimation.4. if the restrictions are true.6. We have seen that Given that under the null hypothesis. Score-type tests are based upon the general principle that the gradient vector of the unrestricted model. an unrestricted model may be nonlinear in the parameters. should be asymptotically normally distributed with mean zero. In some cases. The score test is useful in this situation. linear model. but the model is linear in the parameters under the null hypothesis. For example. evaluated at the restricted estimate.

It may seem that one needs the actual Lagrange multipliers to calculate this. these are not available.2. If we impose the restrictions by substitution.6. Note that the test can be written as However. we can use the fonc for the restricted estimator: ©§ R¥ ¢ e R ¡ P R P R µwar§ i0£dfiÈ m ¢p è R ¡ X R ¥ R h e R ¡ © e I 8 ¢p è estimator of 1 since the powers of cancel. However. So there is a cancellation and we get This makes it clear why the test is sometimes referred to as a Lagrange multiplier test. TESTING 87 since the ’s cancel and inserting the limit of a matrix of constants changes nothing. To get a usable test statistic substitute a consistent ©§ R¥ ¢ m ¢p è É © ¥ Æ R e R ¡ I X R a¡ e In this case. R µ¡ I « æ ¡ Rµ¡ É 1 (¡ Æ I R R © R ¥ ¡ µ¡ I !Xi¬aD1 ó pè ð I ª 1 ¢ ! ñ m e ª 1 A 1 h 1 .

2. note that the fonc for restricted least squares give us r and the rhs is simply the gradient (score) of the unrestricted model. if the restriction is true. e R ¡ P R P R µwar§ i0£dfiÈ 8©§ gR¥ ¢ § ijÜfi e µ¡ R R R m ¢p è @ G ª R G « but this is simply ©§ R¥ ¢ m @ ¢p è © ¥ G R I X R ¬u R G R e µ¡ . The scores evaluated at the unrestricted estimate are identically zero. since P. Rao ﬁrst proposed it in 1948. evaluated at the restricted estimator. we get To see why the test is also known as a score test. The test is also known as a Rao test. The logic behind the score test is that the scores evaluated at the restricted estimate should be approximately zero.6. TESTING 88 to get that R IG i © f¥R Hr§ ² i Substituting this into the above.

To show h 4ý d ¥ © by the fonc and we is deﬁned in terms of and manipulate f © q4 d ¥ f 5 ¡ f © ýd 49¥ f .5. TESTING 89 6.6. The likelihood ratio test. Likelihood ratio-type tests. So d d about ¢ x that it is asymptotically take a second order Taylor’s series expansion of ýd ¢ d ¡ ©ý 49¥ f f d © gd ¥ r ¾d d ©d 4R¥ f A I f d where is the unrestricted estimate and is the restricted estimate.2. The Wald test can be calculated using the unrestricted model. the ﬁrst order term drops out since d need to multiply the second-order term by ) so ¢ £ We also have that. uses both the restricted and the unrestricted estimators. to prove this set up the Lagrangean for MLE subject to À p¡ § 8 d¥ p d¥ g© p eH ¢ eI 1 I !© p ea h d £ý d © p R¥¾ R h d pý d 1 ¡ d o f d © p R¥ o © p R¾ d ¥ o pd h ip d 1 h k 1 As ©d 4R¥ 1 © $ d d ¥ f h d £ý d gd ¥ R h d pý d 5 $d ¥ © 1 P © f since h © sý d d ¥ R h d pý d y 1 by the information matrix equality. The score test can be calculated using only the restricted model. on the other hand. from [??] that An analogous result for the restricted estimator is (this is unproven here.2. The test statistic is (note.

6. In . We’ll show ¢ We have seen that the three tests all converge to 8 d¥ p g© p RD ¢ eI 1 h I 6© p Ra o ¡ I ó µ¡ I 6© p Ra o 7µ0sf I !© p ea d¥ R d¥ ¡ð R ¡ ± d¥ 8 g R© ¡ I © p Ra o 6 j d ¥ ¡ ¥ © d¥ E© p e¾ o þ ¥ 8©§ 4¥ ¢ m m m ³ d¥ p c© p RH ¢ eI 1 I © p Ra o ¡ ° I R³ ¡ I © p Ra o ¡ ° R³ ¡ I © p ea o R © p RD ¢ eI 1 ° ¡ d ¥ d ¥ d ¥ d ¥ p © p eH I 6© p Ra o ¡ I ó µ¡ I !© p ea o ¡ ð µ¡ I !© p ea o ¢ eI Y h d pý d 1 d ¥ d¥ R d¥ R d¥ p 1 © p eH ¢ eI 1 I © p ea o ¡ d ¥ d ¥ p © p eH ¢ eI 1 d ¥ p ¡ f o pd h ipý d 1 h f h random variables. substituting into [??] But since the linear function We can see that LR is a quadratic form of this rv. THE ASYMPTOTIC EQUIVALENCE OF THE LR. WALD AND SCORE TESTS 90 the ﬁrst order conditions) : Combining the last two equations so. The asymptotic equivalence of the LR. with the inverse of its variance in the middle. We have seen that the Wald test is asymptotically equivalent to ©§ 4¥ ¢ m h £§ ¡ ó µ¡ I « æ ¡ ¢p è ð R h %§ ¡ 1 ü À I R À ¢ fact.3. under the null hypothesis. so 6.3. Wald and score tests that the Wald and LR tests are asymptotically equivalent. they all converge to the same rv.

so the projec.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR.¢ 8 "j¯¨¥ R è "jÜ¥ )5 è 0 5 1 © § f © § f h 1 © è § y s!¨¥ f A © p eH ¢ eI 1 I !© p e¥ o ¡ I ó µ¡ I 6© p R¥ o ¡ ð µ¡ I 6© p e¥ o © p eH ¢ eI 1 ¡ d ¥ d R d R d R d¥ p p f R ¡ I X R u © ¥ ¥ © ¥ ð RG I ©X R a¡ I ó R ¡ I X R a¡ ¢p è 7R Gi I « æ ¡ I ó µ¡ I « æ ¡ ¢p è ð R R Using Under normality. we have seen that the likelihood function is ª ü ¢p è G i ª R G ¢p è GR w I © wR w ¥ wR G ¡ I X R u R G © ¥ R µ¡ I « æ tG I 1 R RG ¢ 1 É 1 Æ ¡ p eI I R Gi I 6XiaH1 R © R ¥ ¡ h p§ ¥ ¡ © 0£ § aD1 we get h © p 0£§ a¡ À £§ ¡ § ¥ © § RG I X R ¥ p 0£§ 6. is the projection matrix formed by the matrix . WALD AND SCORE TESTS 91 where Substitute this into [??] to get and Note that this matrix is idempotent and has § Now consider the likelihood ratio statistic tion matrix has rank 8 § columns.

WALD AND SCORE TESTS . Using this. THE ASYMPTOTIC EQUIVALENCE OF THE LR.3. by the information matrix equality: 6. under the null hypothesis. we get pd I « æ ¢ è I © e¥ o ¢è « æ A A A d pd © e¥ o $ ¢ H1 è GR ¢ 1 è © p "j¯¨¥ R § f © è § s!¨¥ f ) 1 x 92 © p ¨H §¥ so Also.¡ f f ü ´ P § ü ¢p è G i ª R G © ¥ © ¥ è © RG I X R a¡ I ó R ¡ I X R a¡ ¢p òð R ¡ I X R ¥ R R G alent. one can show that. Similarly. This completes the proof that the Wald and LR tests are asymptotically equiv- ¡ f ¢ 1 è R ¢ 1 è yx © p §ijÜf¨¥ R © p ¨¥H y x § © p Ra d ¥ Substituting these last expressions into [??].

in that it’s like a LR statistic in that it uses the value of the objective functions of the restricted and unrestricted models. since one statistic can be thought of as a pseudo- ¡ f The proof for the statistics except for does not depend upon nor- ¡ f and such that .3. they are numerically different in small samples. as can be veriﬁed by examining the expressions for the statistics. the LR statistic. THE ASYMPTOTIC EQUIVALENCE OF THE LR. WALD AND SCORE TESTS 93 mality of the errors. Though the four statistics are asymptotically equivalent. can’t write the likelihood function without them. This is readily generalizable to nonlinear models and/or other estimation methods. due to the close relationship between the statistics P § supposing normality. P § However. For example all of the following are consistent for ( f ¤ ¤y î î f ¤ h ¤y î î f î y î hî y î f 8 4) 1 Â ô A ¢è how is estimated. The numeric values of the tests also depend upon d ô and in general the denominator call be replaced with any quantity p ¢ è this. The presentation of the score and Wald tests has been done in the context of the linear model. but it doesn’t require distributional assumptions. and we’ve already seen than there are several ways to do under ¡ f The statistic is based upon distributional assumptions.6.

CONFIDENCE INTERVALS 94 It can be shown. Interpretation of test statistics Now that we have a menu of test statistics. one can manipulate reported results to favor or disfavor a hypothesis. Conﬁdence intervals Conﬁdence intervals for single coefﬁcients are generated in the normal manner.5. This is a bit problematic: there is the possibility that by careful choice of the statistic used. the Wald test will always reject if the LR test rejects. since . and in turn the LR test rejects if the LM test rejects. 6. 6.5. The small sample behavior of the tests can be quite different. Given the statistic f î yî 8 f ¥ ¡ ´ f x è © §¥ § ¥ ²%§ ¨v2 ¥ ü 2 f îy î is used for the score test. we need to know how to use them. test is to be preferred. Likewise.4. A conservative/honest approach would be to report all three test statistics when they are available. the true size of the score test is often smaller than the nominal size. In P the case of linear models with normal errors the asymptotic approximations are not an issue. ¤ ¤ and if that is used to calculate the Wald test and For this reason. for linear regression models subject to linear restrictions.6. The true size (probability of rejection of the null when the null is true) of the Wald test is often dramatically higher than the nominal size associated with the asymptotic distribution.

Bootstrapping When we rely on asymptotic theory to use the normal distribution-based tests and conﬁdence intervals.6. This generates an ellipse. 6.6. if the estimators are correlated. – Joint rejection does not imply individal tests will reject. If the sample size is small and errors are highly nonnormal. From the pictue we can see that: – Rejection of hypotheses individually does not imply that the joint test will reject. BOOTSTRAPPING 95 A conﬁdence ellipse for two coefﬁcients jointly would be.6. since the CI for an individual coefﬁcient de- other coefﬁcient is marginalized (e..g. the P speciﬁed critical value. Since the must extend beyond the bounds of the individual CI. the small sample distribution of h may be very different than its large & Y) ellipse is bounded in both dimensions but also contains mass & ) h p 0÷§ 1 § ¢ gD§ §I set of { such that the (or some other test statistic) doesn’t reject at the ﬁnes a (inﬁnitely long) rectangle with total prob. can take on any value). mass ¢ x ¸ è¥ p § © ª § § The set of such © §¥ ¨v2 such that does not reject is the interval x è ¢ r § ¢p ¥ § ¨¸Ë½ § §0%§ ½ p T¸ SÖt © & ¥ û § p $r p § & p§ using a ¦ © & ) Èc¥ 4 ) a conﬁdence interval for d is deﬁned by the bounds of the set of signiﬁcance level: § since the it . The region is an ellipse. we’re often at serious risk of making important errors. analogously.

Joint and Individual Conﬁdence Regions .6.6. BOOTSTRAPPING 96 F IGURE 6.5.1.

A means of trying to gain information on the small sample distribution of test statistics and estimators is the bootstrap. BOOTSTRAPPING 97 sample distribution. until we have a large number. Note that this will not give the shortest CI if the empirical distribution is skewed. we can use the replications to calculate the empirical distribution of 8 ý§ « (5) Repeat steps 1-4. since we have random sampling. The steps are: a (2) Then generate the data by (3) Now take this and estimate and use the remaining endpoints as the limits of the CI. We’ll consider a simple example. Call this vector ß § G G Given that the distribution of is unknown. the distributions of test statistics may not resemble their limiting distributions at all. we could generate artiﬁcial data. Suppose that is nonstochastic known in small samples. However. the distribution of © ¢p ! ¥ è G 0P will be un(it’s ± ± p " § ç G f ß ß ý§ . of ß ý§ (4) Save ýG 8 ý f i I !X%¥ ý § R © R ß ß ß ý G § ý f P ß 8 ) ' g©tjQ1 1 (1) Draw observations from with replacement. Also. 5 q& Â « from smallest to largest. 8 ß ý§ With this. just to get the main idea.6.6. and drop the ﬁrst and last p§ ¦ © © & One way to form a 100(1- conﬁdence interval for would be to order the of the replications.

which are beyond the score of this course. this doesn’t hold. TESTING NONLINEAR RESTRICTIONS. increase The bootstrap is based fundamentally on the idea that the empirical distribution of the sample data converges to the actual sampling the empirical distribution should converge in distribution to statistics based on sampling from the actual sampling distribution.6.7. At a minimum. This is easy to check. § .7. the bootstrap is a good way to check if asymptotic theory results offer a decent approximation to the small sample distribution. If the assumption of iid errors is too strong (for example if there is heteroscedasticity or autocorrelation. we’ll just consider the Wald test for nonlinear restrictions on a linear model. see below) one can work with a bootstrap deﬁned by sampling from « « How to choose : should be large enough that the results don’t change with repetition of the entire bootstrap. Simple: just calculate the transformation tion. 6. so statistics based on sampling from © ¨¥ f 4% for each and work with the empirical distribution of the transforma- with replacement. « If you ﬁnd the results change a lot. Since estimation subject to nonlinear restrictions requires nonlinear estimation methods. at least when the model is linear. AND THE DELTA METHOD 98 Suppose one was interested in the distribution of some function of for example a test statistic. In ﬁnite samples. and the Delta Method Testing nonlinear restrictions of a linear model is not much more difﬁcult. Testing nonlinear restrictions. and try again. 1 distribution as becomes large.

6. so Using this we get © § ¥ À p§ We suppose that the restrictions are not redundant in a neighborhood of x ©q¥a¡ ¨¥ À y x § © § ©© §¥ ¡¥ E¨a § evaluated at as § where is a -vector valued function. so § d©c¥ À b . AND THE DELTA METHOD 99 Consider the nonlinear restrictions that § have Due to consistency of We’ve already seen the distribution of h Considering the quadratic form ©§ 4¥ ¢ ó 8 x¢p è©© p ¨a¡ I « æ © p ¨a6 ð m § ¥ À 1 R §¥ §¥ ¡ © 8© p j%§ ¢1 § ¥ © p §0÷§ © p aD1 § ¥ À 1 ¥ § ¥ ¡h © p § § h h § we can replace m ¢p è © § ¥ À I ó R © p a¡ I « æ © p ¨a¡ ð R § ¥ À 1 §¥ §¥ © © p 0£§ v© ¨a¡ § ¥ À § ¥ §¥ © 5 § where is a convex combination of 8p§ § © p §j£§ © aw© p ¥ À q§ ¥ À ¥ §¥ ¡ P § © p§ about : 8p§ in a neighborhood of Take a ﬁrst order Taylor’s series expansion of and Under the null hypothesis we by . Write the derivative of the restriction 8 ¡ © p ¨¥ À § .7. TESTING NONLINEAR RESTRICTIONS. asymptotically.

the vector of elasticities of a function is linear function 8G P Vd§R f ® where means element-by-element multiplication.6. TESTING NONLINEAR RESTRICTIONS. Suppose we estimate a ¢p è is under the null hypothesis.7. The score and LR tests are also possibilities. or as Klein’s approx- imation. Since this is a Wald test. If the nonlinear function not hypothesized to be zero. which aren’t in the scope of this course. Note that this also gives a convenient way to estimate nonlinear functions and associated asymptotic conﬁdence intervals. but they require estimation methods for nonlinear models. it will tend to over-reject in ﬁnite samples. AND THE DELTA METHOD 100 the resulting statistic is under the null hypothesis. Substituting consistent estimators for © p ¥ À § p « ær § and óx R § ¥ §¥ ¡ ¢p ©è© p ¨a¡ I « æ © p a6 ð ©§ 4¥ ¢ © ¥ ¨ ® © ¥ à¨ © ¥ à © ¥ ¨ © ¢p èc© p ¨a¡ I 6Xiv© p ¨a6© p ¨¥ À ²u§ ¥ À R §¥ © R ¥ §¥ ¡ § ¥ ¬ © m ¢è ¢ © § ¥ À h R q§ a¡ I Q R ¬§ a¡ R § ¥ À © ¥ © ¥© ¥ © I m h © p ¨¥ q§ ¥ 1 § À © À . This is known in the literature as the Delta method. we just have h so an approximation to the distribution of the function of the estimator is For example.

An expenditure share 8 & 8 5 3 © a6@A8@89764) ! ÷!¥ ÷¥ B ¤ © B CB & system for goods is © ÷!d¥ parameters.6. nonlinear restrictions can also involve the data. In many cases. . not just the be a demand funcion. AND THE DELTA METHOD 101 (note that this is the entire vector of elasticities). where is prices and is income. are © §¥ a¡ .. Note that the elasticity and shows how this can be done. . . use To get a consistent estimator just substitute in .. . For example. . TESTING NONLINEAR RESTRICTIONS. . %§ R . consider a model of expenditure shares. .m 8 h¢ vh § bb xxb § ¢ § R ¥ © hi .7. . The estimated elasticities are To calculate the estimated standard errors of all ﬁve elasticites. . .r. . .t. . bb xxb ¢¢ ¢ § I¢ H§ I ® ® §R § © ¥ §R © ¥ § bb xxb ¢ bb xxb ¢ I R § © ¥ à f à The elasticities of w. . Let 8 the standard error are functions of The program ExampleDeltaMethod.

Example: the Nerlove data Remember that we in a previous example (section 3.049 0. so we have the restrictions Suppose we postulate a linear model for the expenditure shares: It is fairly easy to write restrictions such that the shares sum to one. -1.017 t-stat.527 0.000 8 " for all possible and 8 " and the values of and It is impossible to impose the restriction that In such cases. 1.6. and we assume that expenditures sum to income.8. one might consider whether » } ) restriction that the shares lie in the I ) © ÷!¥ B ¤ B ¯ 3ê © 4) » ÷!d¥ B ¤ » B G 0P W uP T ©§R QP I § ÷!d¥ B ¤ § © B B B interval depends on both parameters ) » ÷!d¥ B ¤ © . EXAMPLE: THE NERLOVE DATA 102 Now demand must be positive.987 41.153943 Results (Ordinary var-cov estimator) constant output estimate -3. 6. but the or not a linear model is a reasonable speciﬁcation.774 0.925955 Sigma-squared 0.8.3) that the OLS results for the Nerlove model are ********************************************************* OLS estimation results Observations 145 R-squared 0.err.8.244 p-value 0.720 st.

192 ******************************************************* ) ñÿ P 2 P 0 § § § there is homogeneity of degree 1 then .427 -0.721 0.155686 estimate constant output labor fuel capital -4.878 4.593 0.040 2.000 0.691 0. and if p-value 0.100 0.err. -5. and that .263 41.m imposes and tests CRTS and then HOD1.518 ********************************************************* hypotheses either separately or jointly.136 0.100 0.414 -0.000 0.038 4) Ú § Remember that if we have constant returns to scale.8.249 -0.339 1.000 0.159 -0.291 0.969 . EXAMPLE: THE NERLOVE DATA 103 labor fuel capital 0.499 4. We can test these t-stat.018 0.891 0.000 0.220 0.436 0. then ) í ÿ VP 2 VP 0 § § § ½ ÿ § uÿ ¤ Note that .925652 Sigma-squared 0.206 0. From it we obtain the results that follow: Imposing and testing HOD1 ******************************************************* Restricted LS estimation results Observations 145 R-squared 0.648 0.005 0.007 st. 0.6. NerloveRestrictions.

-2.593 0.000 0.076 st.262 265.441 0.968 0.289 0.530 1.000 0.132 p-value 0.574 0.790420 Sigma-squared 0.450 0.167 0.000 0.489 0.414 150. 2.020 0.592 p-value 0.8.438861 estimate constant output labor fuel capital -7.err.040 4.000 0.000 .895 ******************************************************* Value F Wald LR 256.442 Imposing and testing CRTS ******************************************************* Restricted LS estimation results Observations 145 R-squared 0.863 p-value 0.441 0.6.572 t-stat.000 0.594 0. EXAMPLE: THE NERLOVE DATA 104 Value F Wald LR Score 0.539 Inf 0.012 0.966 0.715 0.000 0.

1) where is a superscript (not a power) that inicates that the coefﬁcients may be different according to the subsample in which the observation falls. That is. Also. EXAMPLE: THE NERLOVE DATA 105 Score 93. Deﬁne 5 subsamples of ﬁrms. compared to the un- restricted results. ¢ ¡ ) 0 § ) j § ) V § Ò @A8@89764) 8 5 ¢ ¡ does .g. and that a conﬁdence interval for @)8 u& posed. From the point of view of neoclassical economic theory. E XERCISE 12. these results are not anomalous: HOD1 is an implication of the theory. for 8 7 Ò @8A84) § that a t-test for also rejects. The Chow test. The ﬁve subsamples can be indexed E 7 Deﬁne a piecewise linear model (6. Also note that the hypothesis that . Modify the NerloveRestrictions. then the next 29 ﬁrms. Recall that the data is sorted by output (the third column). with the ﬁrst group being the 29 ﬁrms with the lowest output levels.8. but CRTS is not. ). Since CRTS is rejected. etc. you can see not overlap 1. so the restriction is is rejected by the test stadrops quite a bit when tistics at all reasonable signiﬁcance levels.000 Notice that the input price coefﬁcients in fact sum to 1 when HOD1 is im- does not drop much when the restriction is imposed. If you look at the unrestricted estimation results. . For CRTS.m program to impose and test the restrictions jointly. % e EwP ÿ ª 6 VP 1ª 3 uS 1ª A | VC æ ¢ VP I § û A § 2 § P 0 § P § ß ß ß ß ß 2 5 % 85 ï ©5@8A874) 2 ) % % by where for . etc.8. HOD1 is not rejected at usual signiﬁcance levels (e..6. you should note that satisﬁed.771 0. let’s examine the possibilities more carefully. Note that imposing CRTS.

The null to test is that the parameter vectors for the separate groups are all the same. is sometimes referred to as a Chow test. .8. and is the vector of errors for the subsample. .data indicates this way of breaking up the sample. 8 2 Ie ¢e ¢ § I § bb xxb ¢ I % the coefﬁcients depend upon which in turn depends upon Note that the I f ¢f .2) . that there is coefﬁcient stability across the ﬁve subsamples. Since the restrictions are rejected. . There are 20 restrictions. What is the pattern of RTS as a function of the output % subsample. It also tests the hypothesis that the ﬁve subsamples share the same parameter vector. . we should probably use the unrestricted model for analysis.8. . . If that’s not clear to you. 6 This type of test.m estimates the above model. look at the Octave program. . The new model may be written as The Octave program Restrictions/ChowTest. . that is. or in other words.6. % ¶ § 3§ § ¢ § I § | e ß I ) iy' I ¦f where is 29 is 29 is the vector of coefﬁcient for the ¶ 6 e P 6 § 6 ) ¡' Ò 3 | ß ) j' ï 5 § i Ò ' 6 @ f (6. The restrictions are rejected at all conventional signiﬁcance levels. EXAMPLE: THE NERLOVE DATA 106 ﬁrst column of nerlove. that parameters are constant across different sets of data.

8 RTS 1 1. We can see that there is increasing RTS for small ﬁrms.6 2.8.6. RTS as a function of ﬁrm size 2.8 1. but that RTS is approximately constant for large ﬁrms.6 1.5 4 4.1 plots RTS.8.1.2 1 0.2 2 1. EXAMPLE: THE NERLOVE DATA 107 F IGURE 6.4 1.5 2 2.5 3 Output group 3.8.4 2.5 5 group (small to large)? Figure 6. .

Comment on the results. estimated returns to scale is d x Û I ¿ °¡ c. t-statistics for tests of signiﬁcance. But perhaps we could restrict the input price coefﬁcients to be the same but let the constant and output coefﬁcients vary by group size. and are independently uniformly ) í Û ¿ $r ¡ ) í $r § ) Û ¦ p ¡ ¿ rd ¡ (a) estimate this model by OLS. Compare the plot to that given in the notes for the unrestricted model.8. we reject that there is coefﬁcient stability across the 5 groups. Comment on the results.6.3) for coefﬁcients. score and likelihood ratio tests. Interpret the results in detail. Apply the delta method to calculate the estimated standard error for estimated RTS. Comment on the results. distributed on and (a) Compare the means and standard errors of the estimated coefﬁcients using OLS and restricted OLS.8. EXAMPLE: THE NERLOVE DATA 108 (1) Using the Chow test on the Nerlove model. giving ¢§ e B 6 § 3 § § B B wP qÿ ª )uP B 5ª 4uP B 1ª | VP sæ A ß uP Iß § DB û 2 0 . (2) For the simple Nerlove model. (c) Plot the estimated RTS parameters as a function of ﬁrm size. and the associated p-values. (b) Test the restrictions implied by this model using the F. imposing the restriction that ©) ¥ ± ç t ² ¦± e ) }9 ¢ | e wP | ¡P ¢ Yµ f ) ) P 5 ) £ $r p § than testing versus . Directly test versus (3) Perform a Monte Carlo study that generates data from the model where the sample size is 30. Wald. This new model is (6. estimated standard errors d 8 5 | VP ¢ § § d rather .

m and myols.m . bootstrap. imposing the restriction that (c) Discuss the results. bootstrap_resample_iid.m .6.8. 8 ) | VP ¢ § § . (4) Get the Octave scripts bootstrap_example1. and interpret the results.m ﬁgure out what they do. EXAMPLE: THE NERLOVE DATA 109 (b) Compare the means and standard errors of the estimated coefﬁcients using OLS and restricted OLS. run them.

The case where is a diagonal matrix gives uncorrelated. relaxing this admittedly unrealistic assumption later. This is known as autocorrelation. We’ll assume ﬁxed regressors for now. S The case where nonzero elements off the main diagonal gives identically (assuming higher moments are also the same) dependently distributed errors.CHAPTER 7 Generalized least squares One of the assumptions we’ve made up to now is that or occasionally Now we’ll investigate the consequences of nonidentically and/or dependently distributed errors. nonidenti- has the same number on the main diagonal but § S where is a general symmetric positive deﬁnite matrix (we’ll write G P § Vi 8 è ¥ ± ç g© ¢ ! ² ± DtG S è © ¢ ! ¥ ¦± tG ± ç f © G¥ Uë © G¥ é in place . 110 S p§ of to simplify the typing of these notes). The model is cally distributed errors. This is known as heteroscedasticity.

as before.1. tests given above do not lead to statistics with these distributions. any test statistic that is based upon ¢ ¢ (7. a joint conﬁdence region for would be an dimensional hy- or the probability based . the formulas for the ¢è Due to this.7.1. is still consistent. though why this term is used. In particular. This persphere. then be useful for testing hypotheses. S The problem is that is unknown in general.1) © R ¥ R © R I 6XiV±Si I !X%¥ ³ I 6XiuÌttGi I !Xi¬¥ ° ë © R ¥ R G R © R R© § ¥© § r c0£ § j£ § ¥ n ë § The variance of is q1 G% I 6Xi§ R © R ¥ P G © Rf I X R ¥ § G § is known as “nonspherical” disturbances. 7. If is normally distributed. following exactly the same argument given before. Perhaps it’s because under the classical assumptions. so this distribution won’t ¢ P 2 ó © R ¥ R © R ¥ §ð ç I 6XiV±Si I !X%tòñu" § ¢è limit of is invalid.1. EFFECTS OF NONSPHERICAL DISTURBANCES ON THE OLS ESTIMATOR 111 The general case combines heteroscedasticity and autocorrelation. Effects of nonspherical disturbances on the OLS estimator The least square estimator is We have unbiasedness. I have no idea.

The GLS estimator S Suppose were known.2. 7. Previous test statistics aren’t valid in this case for this reason. and unconditional on h we still have I yS ª R ª f ± S ªR ª § h j% § 1 h h .2. as is shown below. is inefﬁcient. THE GLS ESTIMATOR 112 Deﬁne the limiting variance of so we obtain Summary: OLS with heteroscedasticity and/or autocorrelation is: unbiased in the same circumstances in which the estimator is unbiased with iid errors has a different variance than before. Then one could form the Cholesky decomposition We have G R ¢ eI 1 p RG ¢ 1 É 1 Æ p eI I R Gi I 6Q©iÄ1 R © R ¥ ² ' (supposing a CLT applies) as ó § æ I « ¼² I « æ ð m h 0% § 1 É 1 tf R tG R uë G Æ Without normality.7. but with a different limiting covariance matrix. so the previous test statistics aren’t valid is consistent is asymptotically normally distributed.

2. ³ C0 ¯ f I S R I X I S R ¥ © © ª Rf ª¯ª R I Q R ¯ª R ¥ ©fR I 6© R ¥ § so satisﬁes the classical assumptions. making the obvious deﬁnitions. The GLS estimator is simply OLS applied f± G¥ © é G¥ © Uë G P V§ f Therefore. which implies that to the transformed model: Consider the model . THE GLS ESTIMATOR 113 or. the model f± R ª S ª G ¥ R© ª R tG ª ¼ë G ª @G This variance of is 8 V§ f G P GR ª "R ª ©fR ª P § f R ª S ª ± R ª R ª S ª R ª 7.

when dealing with the transformed model. Tests are valid. All the previous results regarding the desirable properties of the least squares estimator hold. THE GLS ESTIMATOR 114 The GLS estimator is unbiased in the same circumstances under which the so Either of these last formulas can be used.7. For example. since the transformed model satisﬁes the classical assumptions.. assuming is nonstochastic § ø R h 0 ù ë 8 § ù ë © ³ C0 ³ C0 ¯ ³ @0 ¯ § § h j § ¯ § Uë ¥ ³ @0 ¯ § õ ë can set it to .2. using the previous formulas. as long as we substitute This is preferable to re-deriving the appropriate formulas. 8 4) ¢è 8 % in place of Furthermore. any test that involves © R I 6X I ySi¥ I 6© R ¥ I 6© R ¬¥ R I 6© R ¥ ù ú I 6© R ¬¥ R G ©GR I 6© ©R ¥ ë G I ¥ P R 6© R ¬§ © GVP§ gR I 6© ©R ¥ ¥ ©fR I 6© ©R ¥ The variance of the estimator. conditional on R © R ú G0P§i¥ I y±S% I 6X I ySi¥ R © R ú f I y±S% I 6X I ySi¥ can be calculated using OLS estimator is unbiased.

and that GLS is efﬁcient relative to that OLS. FEASIBLE GLS 115 The GLS estimator is more efﬁcient than the OLS estimator.3. As one can verify by calculating fonc. Feasible GLS isn’t known usually. we conclude is positive semi-deﬁnite. : it’s an matrix with y w S w © R ¥ © R ¥ R © R I 6Q I y±Si¬¡ I 6X©%uSi I 6Qi¬¥ © § f R© § f qijÜ¨¥ I yBSc"j¯¨¥ E 8 ³ I S R I X I S R ¡ R I Q R ¬¥ ° w © ¥ © © ³ C0 I ³ C0 ¯ § ¥ ÜVq§ ¥ À ¯é ô é © ô À ¯ § 5 Â DuP ¢ ¨¥ ©1 1 y w S w y w S w consequence of the Gauss-Markov theorem. Then noting that is a quadratic form in a positive deﬁnite matrix. not that (the following needs to be completed) This may not seem ob- . This is a where vious.7. as you can verify for yourself. so this estimator isn’t available. To see this directly. the GLS estimator is the solution to the minimization problem unique elements. but it is true. 1 ©1 1 ÜP 5 Â 0 ¢ ¬¥ 1 ' ¼Y1 S Consider the dimension of S The problem is that S so the metric is used to weight the residuals. since the GLS estimator is based on a model that satisﬁes the classical assumptions but the OLS estimator is not. 7.3.

If we can consistently estimate d d Suppose that we parameterize © d ¥ 4%£S S 8 q1 faster than There’s no way to devise an estimator that satisﬁes a so that a consistent estimator can be devised.3. ¢ FGLS estimator. the usual way to proceed is 8 © d¥ g4R£S depending on the parameterization 8 7d (1) Deﬁne a consistent estimator of This is a case-by-case proposition.7. We’ll see examples below. so that Slutsky theorem). FEASIBLE GLS 116 LLN without adding restrictions. The feasible GLS estimator is based upon making sufﬁcient assumptions S regarding the form of S as well as other parameters. These are (1) Consistency (2) Asymptotic normality (3) Asymptotic efﬁciency if the errors are normally distributed. In this case. (CramerRao). (4) Test procedures are asymptotically valid. In practice. pS S If we replace in the formulas for the GLS estimator with we obtain the d ¢ © d ¥ i¬£S © ¥ 4d i£S T © d ¥ 4i£S S µS sistently estimate as long as is a continuous function of 7d d where is of ﬁxed dimension. where may include 1 The number of parameters to estimate is larger than and increases § we can con(by the . The FGLS estimator shares the same asymptotic properties as GLS. as a function of and .

but have different variances. BG is more variability in these factors for large ﬁrms than for small ﬁrms. Actually. the popular ARCH (autoregressive conditionally heteroscedastic) models explicitly assume that a time series is heteroscedastic. One might If there . etc. so that the errors are uncorrelated.g. Heteroscedasticity Heteroscedasticity is the case where is a diagonal matrix. Consider a supply function suppose that unobservable factors (e. degree of coordi- Û may have a higher variance when is high than when it is low.) account for the error term B B Û where is price and is some measure of size of the © I S ¥ I)¹ ª µ´ û ¢ ¶ 3 G Ûì § T§ PI B VP B iuP B ª $uH§ HB § P § RG ª i R ª Rf ª S R G G¥ ©©ttUë © ¥ 4d %£S ¢ .4. B ª ﬁrm. then 8B G nation between production units. Heteroscedasticity is usually thought of as associated with cross sectional data. 7.4. though there is absolutely no reason why time series data cannot also be heteroscedastic. talent of managers.7.. HETEROSCEDASTICITY S 117 (2) Form (3) Calculate the Cholesky factorization (4) Transform the model using (5) Estimate using OLS on the transformed model.

The OLS estimator has asymptotic distribution h as we’ve already seen. In this case. can reﬂect variations in is . even if we can’t estimate consistently. under heteroscedastic- ity but no autocorrelation is One can then modify the previous test statistics to obtain tests that are valid when there is heteroscedasticity of unknown form. For example.7. Eicker (1967) and White (1980) showed how to modify test statistics to account for heteroscedasticity of unknown form. The consistent estimator.4. so it is possible that the variance of could be higher when ´ BG where is price and G W§ T§ P I B 0P B ´ SuP B ª $VsH§ DB § ² ' I @ 1 ¢ G R f ) Ë¢² ' É 1 tf R tG R uë G Æ m h § 0÷§ 1 ´ ª is income.4. OLS with heteroscedastic consistent varcov estimation. individual demand. HETEROSCEDASTICITY 118 Another example. There are more possibilities for expression of preferences when high.1. Add example of group means. Recall that we deﬁned This matrix has dimension S and can be consistently estimated. the Wald test ó ð æ I « ¼² I « æ ñ BG one is rich. preferences. 7.

4. We’ll discuss three methods. There exist many tests for the presence of heteroscedasticity. The motive for dropping the middle observations is to increase the difference between the average variance in the subsamples. ¢ !gq1 1I Goldfeld-Quandt. The model is estimated using the ﬁrst will be independent.for 7.4. HETEROSCEDASTICITY 119 |1 . and The distributional result is exact if the errors are normally distributed. one would use a conventional one-tailed F-test. one could order the observations accordingly. The sample is divided in to three parts. supposing that there exists heteroscedasticity. Detection. so that and | I § ¢è ¢è G y G G R G | | ´ | | | 1 | iP ¢ iP 1 1 1 I observations. In this case. Draw picture. Ordering the observations is an important step if the test is to have any power. from largest to smallest. This can increase the power of 8 © | ! ¥ 1 I 1 P m © 1 | ¨¥ © qs¨¥ I 1 Â | G R | G Â I G R I G so © ¥ ¢ I 1 © | ¥ ¢ 1 m m ¢è ¢è I G I ´ y I G I G R I G Then we have § and third parts of the sample. This test is a two-tailed test. with É 1 É 1 h ©R¥ ¢ £ç h § À % § ¡ ·R ¡ I R Æ ² I R Æ ¡ ¶5R À ÷ § ¡ 1 I § ¡ p À ÷p¦r would be and d 7. Alternatively. and probably more conventionally.2. where . separately. if one has prior ideas about the possible magnitudes of the variances of the observations.

based on Monte Carlo experiments is to drop around 25% of the observations.7. the White test is a possibility. use the consistent estimator may include some or all of the variables in The statistic in this case is g G 8 G¥ g© ¢ ¬¼ë 8c c so that or functions of shouldn’t help to explain The test works as instead.4. ¸ P R # t¹a wP ¢ è ¢ G ª G (1) Since isn’t available. If one doesn’t have any ideas about the form of the het. 8 G R G | | I G R I G A rule of 2ê Ô ¢ è ¢ Uë © G¥ . plus the set of all unique squares and cross products of variables in © t) ¥ Â Û Ù 1 Û Û ªÛ ÜÙ Û Û Ù ¥ ª ªÂ© P § 8 (3) Test the hypothesis that P # § # where is a -vector. then follows: (2) Regress as well as other variables. On the other hand. White’s test. the test will probably have low power since a sensible data ordering isn’t available. and no idea of its potential form. The idea is that if there is homoscedasticity. White’s original suggestion was to use . HETEROSCEDASTICITY 120 the test. dropping too many observations will substantially increase the variance of the statistics and thumb. When one has little idea if there exists heteroscedasticity.

d Note: the null hypothesis of this test may be interpreted as the variance model where is an arbitrary func- tion of unknown form. under the null of no heteroscedasticity This doesn’t require normality of the errors.4. Draw pictures here. It also has the problem that speciﬁcation errors other than heteroscedasticity may lead to rejection. HETEROSCEDASTICITY 121 Note that tor by this we get An asymptotically equivalent statistic. for . Question: why is this neces- vector is chosen well. though it does assume that the sary? The White test has the disadvantage that it may not be very power- knowledge of the form of heteroscedasticity. Like the Goldfeld-Quandt test. is ¢ £¡ eroscedasticity. not the ¢ p¡ Note that this is the ¢ ¡V¼) © 1 ¢ ¡ t) ª ¨¥ Û Û Û ¿ Û Ù P § so dividing both numerator and denomina- or the artiﬁcial regression used to test for hetof the original model. and this is hard to do without 8 © ª ¥ ¢ ' ç ¢ D1 ¡ ¢ ¡ (so that should tend to zero).7. under the null. The test is more general than is may appear from the regression that is used. Plotting the residuals. this will d©c¥ ¹ b © # gd R ²P & ¥ ¹ © ¢ é G¥ # ful unless the G fourth moment of is constant. A very simple method is to simply plot the residuals (or their squares).

Before this. HETEROSCEDASTICITY 122 be more informative if the observations are ordered according to the suspected form of the heteroscedasticity. consistently using » ¢ G ¸ and » has mean zero. let’s consider the © d¥ 4e£S metric form for be supplied. Multiplicative heteroscedasticity Suppose the model is º ©H R ¥ # © ¢ ¬¼ë G¥ G P 0§ R but the other classical assumptions hold. Correction.4.3. The solution is to substitute the squared since it is consistent by the Slutsky theorem. were observable. In this case In the second step. Nonlinear least squares could be used to estimate ¸ © R # iP º H g¥ ¢ G gf ¢ è 8© d¥ ge£S plied for We’ll consider two examples. and that a means for estimating consis- and .7. The estimation method will be speciﬁc to the for sup- general nature of GLS when there is heteroscedasticity. 7. we transform the model by dividing by the standard deviation: 8 ¢ è Sè Sè Sè t G P § R g f T ¢ è º © R # H g¨¥ ¢ è Once we have and we can estimate ¢ G OLS residuals in place of G consistently. Correcting for heteroscedasticity requires that a parad tently be determined.4.

g. ¼ 8 º # For each of these values. Draw picture. Û W Û Ù with the minimum as the estimate. This model is a bit complex in that NLS is required to estimate the model of the variance. the search method can be used in this case to reduce the estimation problem to repeated applications of OLS. e.g. However.7. as in this case. conditional on so one can estimate 8¦7 ï 8756A@8A87©54A) 7 8 8 8 8 7 òI g » » First. we deﬁne an interval of reasonable values for ¸ ¼ iP º # ¢ è ¢ G ´ 9# where is a single variable. Next. divide the model by the estimated standard deviations. calculate the variable The regression by OLS. A simpler version would be mated.. 8 Û Û µW ÜÙ © ¢W gUW » è Save the pairs ( and the corresponding Choose the pair ¢è µW » is linear in the parameters. Can reﬁne.4. . There are still two parameters to be esti- G P 0§ R º #¢ è 8 0§ R f G P G¥ © ¢ ¬¼ë ¢ è f e. this model satisﬁes the classical assumptions. Partition this interval into equally spaced values. HETEROSCEDASTICITY 123 or Asymptotically. Works well when the parameter to be searched over is low dimen- sional. and the model of the variance is still nonlinear in the parameters..

7.4. HETEROSCEDASTICITY

124

Groupwise heteroscedasticity A common case is where we have repeated observations on each of a number of economic agents: e.g., 10 years of macroeconomic data on each of a set of countries or regions, or daily observations of transactions of 200 banks. This sort of data is a pooled cross-section time-series model. It may be reasonable to presume that the variance is constant over time within the cross-sectional units, but that it differs across them (e.g., ﬁrms or countries of different sizes...). The model is

each agent.

The other classical assumptions are presumed to hold.

that we’ll relax later.

regressors, so

could be negative. Asymptotically the difference

is unimportant.

1

1 1Â)

Note that we use

here since it’s possible that there are more than

¢B G

I@ 1 ¢B è f )

tor:

¢B sè

To correct for heteroscedasticity, just estimate each

8 Eì B G B Uë © G¥

In this model, we assume that

1

the

observations for that agent. This is a strong assumption

¢B è

In this case, the variance

is speciﬁc to each agent, but constant over

1 8 5 !@A8@89764) 2

& 85 6A@8@8974) 3

where

are the agents, and

2ê c ¢ B è
B V§
BR G P

are the observations on using the natural estima-

¥ © ¢B ¬G¼ë B f

7.4. HETEROSCEDASTICITY

125

**F IGURE 7.4.1. Residuals, Nerlove model, sorted by ﬁrm size
**

Regression residuals 1.5 Residuals

1

0.5

0

-0.5

-1

-1.5

0

20

40

60

80

100

120

140

160

With each of these, transform the model as usual:

Do this for each cross-sectional group. This transformed model satisﬁes the classical assumptions, asymptotically.

7.4.4. Example: the Nerlove model (again!) Let’s check the Nerlove data for evidence of heteroscedasticity. In what follows, we’re going to use the model with the constant and output coefﬁcient varying across 5 groups, but with the input price coefﬁcients ﬁxed (see Equation 6.8.3 for the rationale behind this). Figure 7.4.1, which is generated by the Octave program GLS/NerloveResiduals.m plots the residuals. We can see pretty clearly that the error variance is larger for small ﬁrms than for larger ﬁrms.

Bè Bè Bè B G P § BR B f

7.4. HETEROSCEDASTICITY

126

Now let’s try out some tests to formally check for heteroscedasticity. The Octave program GLS/HetTests.m performs the White and Goldfeld-Quandt tests, using the above model. The results are Value White’s test 61.903 Value GQ test 10.886 p-value 0.000 p-value 0.000

All in all, it is very clear that the data are heteroscedastic. That means that OLS estimation is not efﬁcient, and tests of restrictions that ignore heteroscedasticity are not valid. The previous tests (CRTS, HOD1 and the Chow test) were calculated assuming homoscedasticity. The Octave program GLS/NerloveRestrictions-Het.m uses the Wald test to check for CRTS and HOD1, but using a heteroscedasticconsistent covariance estimator.1 The results are Testing HOD1 Value Wald test 6.161 p-value 0.013

Testing CRTS Value Wald test 20.169 p-value 0.001

1

By the way, notice that GLS/NerloveResiduals.m and GLS/HetTests.m use the restricted LS estimator directly to restrict the fully general model with all coefﬁcients varying to the model with only the constant and the output coefﬁcient varying. But GLS/NerloveRestrictions-Het.m estimates the model by substituting the restrictions into the model. The methods are equivalent, but the second is more convenient and easier to understand.

7.4. HETEROSCEDASTICITY

127

We see that the previous conclusions are altered - both CRTS is and HOD1 are rejected at the 5% level. Maybe the rejection of HOD1 is due to to Wald test’s tendency to over-reject? From the previous plot, it seems that the variance of is a decreasing function of output. Suppose that the 5 size groups have different error variances (heteroscedasticity by groups):

estimates the model using GLS (through a transformation of the model so that OLS can be applied). The estimation results are ********************************************************* OLS estimation results Observations 145 R-squared 0.958822 Sigma-squared 0.090800

Results (Het. consistent var-cov estimator)

constant1 constant2 constant3 constant4 constant5 output1

5 85 ï 6A@8A874) 3 )

%

where

if

, etc., as before. The Octave program GLS/NerloveGLS.m

estimate -1.046 -1.977 -3.616 -4.052 -5.308 0.391

st.err. 1.276 1.364 1.656 1.462 1.586 0.090

t-stat. -0.820 -1.450 -2.184 -2.771 -3.346 4.363

e

¢ ß è © B ¨¥ À ¯é e ô

p-value 0.414 0.149 0.031 0.006 0.001 0.000

7.4. HETEROSCEDASTICITY

128

output2 output3 output4 output5 labor fuel capital

0.649 0.897 0.962 1.101 0.007 0.498 -0.460

0.090 0.134 0.112 0.090 0.208 0.081 0.253

7.184 6.688 8.612 12.237 0.032 6.149 -1.818

0.000 0.000 0.000 0.000 0.975 0.000 0.071

*********************************************************

********************************************************* OLS estimation results Observations 145 R-squared 0.987429 Sigma-squared 1.092393

Results (Het. consistent var-cov estimator)

estimate constant1 constant2 constant3 constant4 constant5 output1 output2 -1.580 -2.497 -4.108 -4.494 -5.765 0.392 0.648

st.err. 0.917 0.988 1.327 1.180 1.274 0.090 0.094

t-stat. -1.723 -2.528 -3.097 -3.808 -4.525 4.346 6.917

p-value 0.087 0.013 0.002 0.000 0.000 0.000 0.000

7.4. HETEROSCEDASTICITY

129

output3 output4 output5 labor fuel capital

0.892 0.951 1.093 0.103 0.492 -0.366

0.138 0.109 0.086 0.141 0.044 0.165

6.474 8.755 12.684 0.733 11.294 -2.217

0.000 0.000 0.000 0.465 0.000 0.028

*********************************************************

Testing HOD1 Value Wald test 9.312 p-value 0.002

The ﬁrst panel of output are the OLS estimation results, which are used to

results. Some comments:

not the same. The measure for the GLS results uses the transformed

but I have not done so.

The differences in estimated standard errors (smaller in general for

GLS) can be interpreted as evidence of improved efﬁciency of GLS, since the OLS standard errors are calculated using the Huber-White estimator. They would not be comparable if the ordinary (inconsistent) estimator had been used.

¢ s¡

dependent variable. One could calculate a comparable

¢ £¡

The

measures are not comparable - the dependent variables are

¢ß è

consistently estimate the

. The second panel of results are the GLS estimation

measure,

7.5. AUTOCORRELATION

130

Note that the previously noted pattern in the output coefﬁcients per-

Autocorrelation, which is the serial correlation of the error term, is a problem that is usually associated with time series data, but also can affect crosssectional data. For example, a shock to oil prices will simultaneously affect all countries, so one could expect contemporaneous correlation of macroeconomic variables across countries. 7.5.1. Causes. Autocorrelation is the existence of correlation across the error term:

Why might this occur? Plausible explanations include (1) Lags in adjustment to shocks. In a model such as

that moves the system away from equilibrium. If the time needed to return to equilibrium is long with respect to the observation frequency,

gG

stant over a number of observations. One can interpret

as a shock

§ R

one could interpret

8 í 2 í © ì ¥ 4¤ £ ¡£Eg¨Gt¬G¼ë

G P t0§ R gf

sists. The nonconstant CRTS result is robust. The coefﬁcient on capital is now negative and signiﬁcant at the 3% level. That seems to indicate some kind of problem with the model or the data, or economic theory. Note that HOD1 is now rejected. Problem of Wald test over-rejecting? Speciﬁcation error in model? 7.5. Autocorrelation

as the equilibrium value. Suppose

is con-

7.5. AUTOCORRELATION

131

induces a correlation.

(2) Unobserved factors that are correlated over time. The error term is often assumed to correspond to unobservable factors. If these factors are correlated, there will be autocorrelation. (3) Misspeciﬁcation of the model. Suppose that the DGP is

but we estimate

The effects are illustrated in Figure 7.5.1.

7.5.2. Effects on the OLS estimator. The variance of the OLS estimator is the same as in the case of heteroscedasticity - the standard formula does not apply. The correct formula is given in equation 7.1.1. Next we discuss two GLS corrections for OLS. These will potentially induce inconsistency when the regressors are nonstochastic (see Chapter8) and should either not be used in that case (which is usually the relevant case) or used with caution. The more recommended procedure is discussed in section 7.5.5.

7.5.3. AR(1). There are many types of autocorrelation. We’ll consider two examples. The ﬁrst is the most commonly encountered case: autoregressive

G

G tVP ¢ ¢ uS HuP p § gf § P I§

G P I§ tVS HuP p § gf

I G

one could expect

to be positive, conditional on

positive, which

7.5. AUTOCORRELATION

132

F IGURE 7.5.1. Autocorrelation induced by misspeciﬁcation

order 1 (AR(1) errors. The model is

We assume that the model satisﬁes the other classical assumptions.

explodes as increases, so standard asymptotics will not apply.

8 4) ½

We need a stationarity assumption:

**© ¢ Í ¾3 è ¥ t3
½ P
G C¹I t4 ¤ ½ E 2
**

Otherwise the variance of

G P tV§ R

f tG ©ì G¥ Ô½tUë 2 G ç kC½

the variance of order autocovariance: © G¥ ¨té p Note that the variance does not depend on 2 were covariance stationary. AUTOCORRELATION drops out. since as k G tG obtain The variance is the © G¥ ¨té obtain this using so If we had directly assumed that With this. we could so we 133 .¢ 0È) © G ¥ ¢ Í è ¨t¬é ¶ ¢ Í u¨t¬é ¢ è P © G¥ © ¢ RUVCI t¬¼4gw© I ¢ Uë ¢ ½ ¥ ë P © ½ G¥ ë 5 P G¥ gG ¢ jÈ) ¢Í è p qW G¥ W ¢ ¢Í è © ¢ Uë gG p qW W C½ W G is found as W C½isI C½Du© ¢ iiP | t4¥ ¢ P P ½ G ½ P ½ iiDI iDVP ¢ tG ¢ C½PÔI i½iP ¢ t4p © G ¥ ½ P G iisI t4 In the limit the lagged By recursive substitution we obtain 7.5.

. . . . so the -order autocor- this is the correlation matrix ~ ) | }{ bb xxb I f this is the variance . AUTOCORRELATION 134 Using the same method. for r. the two standard errors are the same. . . . we ﬁnd that for stationary corr but in this case.’s and ) is deﬁned as G gx The autocovariances don’t depend on : the process 2 ½ ¤ ¢ 0È) ¢Í è D$ © G¥ é t¬ © G © ½ P G ¥¥ I ¦¨C¹I 4¬Uë ¢ jÈ) ¢Í Dè ì ì is covariance 2 I Likewise. .5..7. the ﬁrst order autocovariance is © G ¥ Eì t×¬G¸ ´ û ì © G ¥ 7 EI t×¬G¸ ´ û .v.. . ¢ f uxxb bb bb ¢ I f uxxb $ S All this means that the overall matrix ) ì ì ¦ ) z ~ ¢ j{ È) z | } ¢Í è S ì relation is has the form ¤ © ¨¥ © ¥ f © ! ¥ f cov se se f ©f S ¥ The correlation (in general.

we can apply FGLS. AUTOCORRELATION 135 So we have homoscedasticity. It turns out that it’s easy to estimate these consistently. The steps are (2) Take the residuals. ¢Í è estimating and etc. one can omit since it cancels out in the formula © ¢Í è £S S ¥ ¢Í è i ¢ @ 1 ¢Í è T ¢ © ½ ¥ f ) ¢Í è estimator using the C½ b½ ½ ng plying OLS to is consistent. and If we G T . 8 sè ¢Í not zero. by taking the ﬁrst FGLS estimator of 8 g© f I yS R%¥ h I yS % @0 ¯ 2 § R ³ I ) g© ¢ 0¼c¥ Â ¢Í è µS previous structure of and estimate by FGLS. the re- . All of this depends only on two parameters. If one iterates to convergences it’s equivalent § One can iterate the process. since T ½ P G C¹I t4 tG P I t G G ctG Since this regression is asymptotically equivalent to the re- P I t G t G G P 0§ R ¡ gf (1) Estimate the model by OLS.7. Actually. but elements off the main diagonal are can estimate these consistently.5. and estimate the model ½ ¹ gression which satisﬁes the classical assumptions. obtained by ap- (3) With the consistent estimators and form the factor to MLE (supposing normal errors). Also. Therefore.

348-49 for more see that the transformed model will be homoscedastic (and nonauto- periods. MA(1). but it can be very important in small samples.4. Dropping the ﬁrst observation is asymptotically irrelevant. 7. so we in different time . This is p pf P R© §c!I ¥ I g¦ gf f formed model asymptotically. Note that the variance of ¢ È) ¢ È) I I I f I f 8I ) (u1 using observations (since and is ½ ¹ aren’t available).5. The linear regression model with moving average order 1 errors is 4¤ R f ¤ ½ 2 © ¢Í ! ¤ò3 è ¥ t3 I CnËP C½ ½t G P t0§ R f tG ç ñC½ ©ì ¥ ¿½¬G¼ë ¤R ½ correlated. since the are uncorrelated with the ¢Í è If discussion. pg.7.5. One can recuperate the ﬁrst observation by putting This somewhat odd-looking result is related to the Cholesky factorS ization of See Davidson and MacKinnon. AUTOCORRELATION 136 An asymptotically equivalent approach is to simply estimate the trans- the method of Cochrane and Orcutt.

bb xxb t t ¢ tË) P t ¢ wP ) t t ¢ © | irËP ¢ CREI CrËP CR@¥ ½t ½¥ © ½ t ½ I i Similarly è ¢ Í it © ¢ CrËDI CREI CnwP iA¥ªë ½ t P ½¥ © ½ t ½ © ¢ ËÔ¥ ¢ Í è t P ) p t ¢ Í è ¢ wP ¢ Í è ³ ¢ ÔI irËSCR¥ ° ë © ½t P ½ © G¥ té In this case. AUTOCORRELATION ..5. Note that the ﬁrst order autocorrelation is bb xxb ¢Í è S 137 . . . .. so in this case and 7. . . . . ¢ tË) P t .© ¢ twÔ¥ P ) t p Å z À I Ã zÁ I C zÁ À I t .

Therefore. since it’s unidentiﬁed. series that are more strongly autocorrelated can’t be MA(1) processes. this isn’t sufﬁcient to deﬁne consistent estimators of the parameters. estimate the covariance of ¢ @ 1 ¢è I t © G ¥ G t G f ) Í Ât EI t×¬G¸ ´ Â û I @ 1 P ) ¢ ¢ G f ) © ¢ ¢ t c¥ Í è¥ t ¢Í Dè tiﬁed) estimator of both I @ 1 ¢t P ) ¢ ¢ ¢ G f ) © Ëc¥ Í è î ¢ è © ¢ wÔ¥ ¢ Í è ¢ î è ¨té t P ) © G¥ x i½ because the are unobservable and they can’t be estimated consistently.g. AUTOCORRELATION 138 maximal and minimal autocorrelations are 1/2 and -1/2.. How- and e. Since the model is homoscedastic.7.5. Again the covariance matrix has a simple structure that depends on only two ever. there is a simple way to estimate the parameters. we can estimate using the typical estimator: By the Slutsky theorem. I G gG To solve this problem. The problem in this case is that one can’t estimate ) 4Þ t I irËP C½ t G ½t ) t This achieves a maximum at and a minimum at and the using OLS on using . we can interpret this as deﬁning an (uniden- However. use this as and t parameters.

one may decide to use the OLS estimator. and transform the model using the Cholesky decomposition. is ó ð æ I « ¼² I « æ ñ © ¢ Í è t £S S ¥ ¥ m 8 ¢Í è t sistent) estimators of both ¢ @ 1 ¢ I G G f ) Í è¥ t and h 0÷§ 1 § Deﬁne the consistent estimator . AUTOCORRELATION 139 This is a consistent estimator. As above. Asymptotically valid inferences with autocorrelation of unknown form. The transformed model satisﬁes the classical assumptions asymptotically. 7.7. this can be interpreted as deﬁning an unidentiﬁed estimator: Now solve these two equations to obtain identiﬁed (and therefore con- following the form we’ve seen above. We’ve seen that this estimator has the limiting distribution h É 1 f R t RG Æ ë A Ë² G ² where.5. pp. following a LLN (and given that the epsilon hats are consistent for the epsilons). See Hamilton Ch. as before. without correction. 261-2 and 280-84.5.5. When the form of autocorrelation is unknown. 10.

. I $G ¢G f tG r if xxb ¢ I n bb G ¬ ¶ v ' f I@ f I @ ë ) 1 f R Gi ) j' as a vector). Deﬁne ² (recall that is deﬁned . since Ä ÅÃ 8 ÄR Ã ¥ © Ä R ¬¼ë Note that (show this with an example). . In general. we expect 8 R ¥ g© Ä ¬¨Uë Deﬁne the autocovariance of Ä ÅÃ 8© gE2 2 ¹ jÜ¸ ì and does not depend on as is potentially autocorrelated: due to covariance We assume that is covariance stationary (so that the covariance between w I@ · R f ¶ I@ · f so that ² . AUTOCORRELATION 140 that: stationarity.5.7. Note that G t¬ ® We need a consistent estimate of . 2 Note that this autocovariance does not depend on í R ¥ ¡£© Ä ¬¼ë G will be autocorrelated.

we in general have little informa- 3 ¢ B è © ¢B ¨Uë ¥ and heteroscedastic ( ² Ã in will in general be correlated (more on this later). by expanding sum and shifting rows to left) where h I Rf Â Ã P I f Ã ) 1 9xx¡P h ¢R Pbbb Â Ã ¥ 1 P ¢ Ã V1 P h IR 5 ¥ 8 h ÄR Ã ¥ I P ÄÅÃ s11 Ä P p ¸ ¥ I f 1 P DI Ã ) 1 P p ¥ Ã ¥ f ² tent. Recent research has fo- We have (show that the following is true. tion upon which to base a parametric speciﬁcation. consistent estimator of Ä ÅÃ ¥ is ó I Rf Ã P sI f Ã ð ) 1 xxxH© ¢R Pbbb w I@ · R f Ã 1 P ¢ Ã ¥ u1 © IR 5 P ¶ I@ · f Now deﬁne ¶ v 8 ² cused on consistent nonparametric estimators of 1 ) ë f Ã 1 P DI Ã ¥ ) 1 P p ² While one could estimate parametrically. but inconsis- Ä 1Â) t G ¡ Ä ÆÃ The natural. estimator of Ã ¥ would be ©¸ 1 7Ç¨¥ Â ) (note: one could put instead of I @ 8 R Ä 1 f ) here). AUTOCORRELATION 141 the regressors will have different variances. So. a natural. . which depends upon ).7. since the regressors Ã ¥ f f ² ² . again since ¥ í © k¨ ß B ¼ë contemporaneously correlated ( ).5.

by construction.7. since the number of parameters to estimate is more than the number of observations. The assumption that autocorrelations die off is reasonable in many that die off.5. so information does not build up as 1 . 8 ©1 D¨¥ § 3 p ¤e is that Note that this is a very slow rate of growth 8 hR Ä Ã ¢ This could cause one to calculate a negative ¥ 8 q1 P Ä ÅÃ ¥ È I w É")¡PÈ§ È) Ä P p ¸ Å 9f Ã ( Ã ¥ f ² ©1 D¨¥ § given that increases slowly relative to statistic. For example. . for example! ©¥S§ ½ 1 ¸ can be dropped because it tends to one for ) ½ cases. AUTOCORRELATION 142 This estimator is inconsistent in general. the AR(1) model with © 1¥ S§ k 1 f f I 1 k ©1 ¨¥ § where as will be consistent.we’ve placed no parametric It is an example of a kernel estimator. supposing that tends to zero sufﬁciently rapidly as 8 k 1 than . and increases more rapidly slowly. The term Ä A disadvantage of this estimator is that is may not be positive deﬁnite. provided hR Ä Ã ¥ P Ä ÅÃ ¥ I Ä p Ã P Å 9f ( Ã ¥ f ² T k tends to a modiﬁed estimator grows sufﬁciently has autocorrelations ¸ Ä Ã On the other hand. The condition for consistency 8 ² restrictions on the form of 8 § for This estimator is nonparametric . Their estimator is This estimator is p. 1987) that solves the problem of possible nonpositive deﬁniteness of the above estimator. Newey and West proposed and estimator (Econometrica.d.

For the same reason. since many general patterns of autocorrelation will have the ﬁrst order autocorrelation different than zero.5. Durbin-Watson test The Durbin-Watson test statistic is The null hypothesis is that the ﬁrst order autocorrelation of the errors is zero: The alternative is of course that the alternative is not that the errors are AR(1).5. 75 case the middle term tends to so 8) 4Þ Supposing that we had an AR(1) error process with Ñ T ü 5 7p the middle term tends to so 8 4) Supposing that we had an AR(1) error process with T ü ü 8 75 to one. the middle term tends to zero. and the other two tend These are the extremes: always lies between 0 and 4.7. With this. For this reason the test is useful for detecting autocorrelation in general. 7. one shouldn’t just assume that an AR(1) model is appropriate when the DW test rejects the null. Testing for autocorrelation.6. asymptotically valid tests are constructed in the usual way. since has as its limit. We can now use and In this case In this « æÂ ² f T ² R If Note . so 8 QI r í f d I ¢ G @ ó I ¢ P f 5 G sI t G t G 0 ¢ G ð I ¢ G @ © f ¢ EI t G t G ¥ 8 ² ¢ @ f ¢ @ f 8 I r p T ü ² ü d f ² Finally. AUTOCORRELATION 143 to consistently estimate the limiting distribution of the OLS estimator under heteroscedasticity and autocorrelation of unknown form. Under the null.

5. There are means of determining exact critical values con- Note that DW can be used to test for nonlinearity (add discussion). in repeated samples. The DW test is based upon the assumption that the matrix 8 i ditional on % sors. which correspond to the extremes that are possible. Durbin-Watson critical values The distribution of the test statistic depends on the matrix of regres- lower bounds.2. which is precisely the context where the test would have application. AUTOCORRELATION 144 F IGURE 7.5. so tables can’t give exact critical values. The give upper and is ﬁxed .5.2. See Figure 7.7. It is possible to relate the DW test to other test statistics which are valid without strict exogeneity. This is often unreasonable in the context of economic time series.

This is a technicality that we won’t go g G restrictions. An important excepand the errors are autocorrelated.7. as long as A simple example is the case of a single lag of the dependent variable with ¤R f tion is the case where contains lagged Ú 4 RG Uë © ¥ This will be the case when following a LLN. Lagged dependent variables and autocorrelation. AUTOCORRELATION 145 Breusch-Godfrey test This test uses an auxiliary regression. This is compatible with many speciﬁc forms of autocorrelation.5. There are are not ª ¸ P iÊ G Ê ui Pbbb txxP ¢ t G ¢ sI t G CiP » R ¡ t G P I % ª . is included as a regressor to account for the fact that the into here.7. We’ve seen that the OLS estimator is consistent under autocorrelation. so the test statistic is asymptotically distributed as a 8 g© ª ¥ ¢ ¢ ¡ sH1 and the test statistic is the statistic.5. following the argument above. The regression is The intuition is that the lagged errors shouldn’t contribute to explain- ing the current error if there is no autocorrelation. 8 ® î y f ò3 µ « G independent even if the are. This test is valid even if the regressors are stochastic and contain lagged dependent variables. just as in the White test. 7. The alternative is not that the model is an AR(P). so it is considerably more useful than the DW test for typical time series data. as does the White test for heteroscedasticity. The alternative is simply that some or all of the ﬁrst autocorrelations are different from zero.

7. However. The Nerlove model uses cross-sectional data.8. Nerlove model. yet again. speciﬁcation error can induce autocorrelated errors. The model is Now we can write and therefore the OLS estimator is inconsistent in this case. Examples.5. AUTOCORRELATION 146 AR(1) errors. which we’ll get to later. so one may not think of performing tests for autocorrelation.7. In this case Since í © ¥ pG R Uë í © ½ P G ¥ © G P ¢ f P ù © G f¥ ú ¨C¹I 4I tV¾ gV§ I R ¥ ë tòI g¨Uë ½ P G C¹I 4 G f P 0P sI u§ R f tG . One needs to estimate by instrumental variables (IV).5. Consider the simple Nerlove model and the extended Nerlove model 8e uP ÿ ª )uP 5ª 4uP 5ª | uP æ ¢ VP I § uû A 6 § 3 § § § 2 0 ß ß e uP ÿ ª )VP 5ª 4uP 1ª | VP æ ¢ uH§ Úû A 6 § 3 § § § PI 2 0 1 P G R ò3 µ "§ § ò3 µ 8 í î y f 3 µ © I ¢ 4¨Uë « G ¥ since one of the terms is which is clearly nonzero.

AUTOCORRELATION 147 F IGURE 7. The residual plot is in Figure 7.1.6.5 0 -0.m estimates the simple Nerlove model. So if it is in fact the proper model. The Octave program GLS/NerloveAR. Let’s check if this misspeciﬁcation might induce autocorrelated errors. there is a problem of autocorrelated residuals.6.930 p-value 0.5 -1 0 1 2 3 4 5 6 7 8 9 10 We have seen evidence that the extended model is preferred. E XERCISE 7.5 1 0. Repeat the autocorrelation tests using the extended Nerlove model (Equation ??) to see the problem is solved.000 Clearly.7. and it calculates a Breusch-Godfrey . Residuals of simple Nerlove model 2 Residuals Quadratic fit to Residuals 1.6.5. the simple model is misspeciﬁed. æ . and the test results are: Value Breusch-Godfrey test 34. and plots the residuals as a function of test statistic.1 .

m estimates this model by OLS.335 0.303 0. and performs the Breusch-Godfrey test.000 .981008 Sigma-squared 1.193 0. Consider the model The Octave program GLS/Klein. Have a look at the README û & P I ª ¢ & P S ª I & P p & ' û T ü ª t-stat.051732 Results (Ordinary var-cov estimator) estimate Constant Profits Lagged Profits Wages 16. One of the equations in the model explains consumption ( ) as a function of proﬁts ( ). as well as the sum of wages in the private sector ﬁle for this data set.7.5. This gives the variable names and other information.796 st.992 19.049 0. plots the residuals. AUTOCORRELATION 148 Klein model.237 0. 12. Klein’s Model I is a simple macroeconometric model.091 0.090 0.err. The estimation and test results are: ********************************************************* OLS estimation results Observations 21 R-squared 0.040 I e P ü ws© ËP T ò¥ | ü ü ( ) and wages in the government sector ( ).000 0.933 p-value 0. both current and lagged. using 1 lag of the residuals.091 0.464 2. 1.115 0.

5 Residuals 0 5 10 15 20 25 ********************************************************* Value Breusch-Godfrey test 1.539 p-value 0.6.5 -2 -2. OLS residuals. so power is likely to be fairly low.5 1 0. The Octave program GLS/KleinAR1.5. lets’s try an AR(1) correction.5 -1 -1.5 0 -0.m estimates the Klein consumption equation assuming that the errors follow the AR(1) pattern. Klein consumption equation Regression residuals 2 1.215 and the residual plot is in Figure 7. The results. but we should remember that we have only 21 observations.2. The residual plot leads me to suspect that there may be autocorrelation .there are some signiﬁcant runs below and above the x-axis.7.2.6. with the Breusch-Godfrey test for remaining autocorrelation are: . AUTOCORRELATION 149 F IGURE 7. Since it seems that there may be autocorrelation. The test does not reject the null of nonautocorrelatetd errors. Your opinion may differ.

492 0.388 2.076 0.234 p-value 0.215 0.967090 Sigma-squared 0.806 16.5.000 0.000 ********************************************************* Value Breusch-Godfrey test 2.345 The test is farther away from the rejection region than before.039 0.232 0. IMHO.431 0.983171 Results (Ordinary var-cov estimator) estimate Constant Profits Lagged Profits Wages 16. there has not been much of an effect on the estimated coefﬁcients nor on their estimated standard errors.992 0.129 p-value 0. This is probably because the estimated AR(1) coefﬁcient is not very large (around 0. Nevertheless. 11.7.048 t-stat.err.774 st.094 0. AUTOCORRELATION 150 ********************************************************* OLS estimation results Observations 21 R-squared 0.096 0. 1. For this reason.2) . and the residual plot is a bit more favorable for the hypothesis of nonautocorrelated residuals. it seems that the AR(1) correction might have improved the estimation.

EXERCISES 151 The existence or not of autocorrelation in this model will be important Exercises later. in the section on simultaneous equations. .

© É ³ @0 ¯ 1 tf R t RG uë G Æ § ¥ ÜVÄ§ ¥ À ¯é ô é © ô À m h § j%§ 1 ³ C0 ¯ § © ¥ 2 ¹ %¸ . (5) Deﬁne the (6) For the Nerlove model e wP ÿ ª )uP 1ª A 4uP 5ª | uP æ A ¢ uP I § Úû 6 § 3 § § § 2 0 ß ß 8 ÄR Ã ¥ © Ä R ¨¼ë Show that 8 g© R¬ ¨¥Uë Ä Ä ÅÃ Ù where as ÷ autocovariance of a covariance stationary process I @ 1 ¢ G R f ) Ë¢² Explain why © § f R© § f qijÜ¨¥ I yBSc"j¯¨¥ E ó I « ¼² I « æ ð æ y w S w . I claimed that the following holds: (2) Verify that this is true. (3) Show that the GLS estimator can be deﬁned as (4) The limiting distribution of the OLS estimator with heteroscedasticity of h unknown form is where ² ' is a consistent estimator of this matrix.EXERCISES 152 (1) Comparing the variances of the OLS and GLS estimators.

(b) Test the transformed model to check whether it appears to satisfy homoscedasticity. P & ' © e¥ Eé assume that æ A .EXERCISES 1& 153 Exercises (a) Calculate the FGLS estimator and interpret the estimation results. .

so we need to consider the behavior of gI gf G tVP p § R gf bb R h f xxb ¢ I 94i÷1 ) ' G § 70P p i f gf f conditional on the regressors. There are several ways to think of the problem. they are nonstochastic. Now we will assume they are random. then it is irrelevant if they are stochastic or not. In dynamic models. since we may want to predict into the The model we’ll deal will involve a combination of the following assumptions or in matrix form. which is clearly unrealistic.CHAPTER 8 Stochastic regressors Up to now we have treated the regressors as ﬁxed. if we are interested in an analysis conditional on the explanatory variables. 154 G p§ ) i' where is where is and r p§ Linearity: the model is a linear function of the parameter vector 8 i the relevant test statistics unconditional on and are § future many periods out. since conditional on the values of they regressors take on. First. which is the case already considered. where may depend on a conditional anal- and . conformable. In cross-sectional analysis it is usually reasonable to make the analysis ysis is not sufﬁciently general.

. is the conditional mean of given § R v ¨¥ Ù v v © f 2ê Ô G Ù © cv t¥ Ì (8.2) 2ê Ô e $f ± ¢ ! je G © è ¥ ç « æ G¥ © tUë ) ó « ¡ R I f ð æ © ¢p è « æ j ¥ § R v has rank with probability 1 Ë p G R ¢ eI 1 p f is a ﬁnite positive deﬁnite matrix. : is normally distributed : Likewise. CASE 1 155 Stochastic. 8 % and since this holds for all © G¥ R © R ¥ X ¬¼ëi I !Xi¬P p § ó 9 §ð ç ¢p è I !©XÌRi¥t×ñe § § § ¥ Ù i © Gi I !Xi¬P p § § R © R ¥ p§ © ¥ X § Uë 7G Normality of strongly exogenous regressors . linearly independent regressors is stochastic where Central limit theorem m V Normality (Optional): Strongly exogenous regressors: (8.1) Weakly exogenous regressors: 8.0.1. unconditional on f In both cases. Case 1 In this case.1.0.8.

8. mally distributed: (3) The usual test statistics have the same distribution as with non- (4) The Gauss-Markov theorem still holds. nothing changes. P when marginalizing to obtain the unconditional distribution. strongly exogenous regressors 8 G regarding test statistics doesn’t hold. these distributions don’t depend on 2 i However.8. However. since it holds condition- (5) Asymptotic properties are treated in the next section. the argument Still. The tests are valid in small samples. we have 8 % i ally on 8 i stochastic § (2) § (1) is unbiased is nonnormally distributed and this is true for all G Summary: When is stochastic but strongly exogenous and is nor- i distributions. Case 2 nonnormally distributed.2.2. conditional on the usual test statistics have the and so 8 i ¢§ © ¥ Q X¢"t multiplying the conditional density by and integrating over § © ¥ Q XDt If the density of is the marginal density of is obtained by ¢ G . CASE 2 156 Doing this leads to a nonnormal density for in small samples. due to nonnormality of 1 É 1 RG I R Æ P Gi I 6XiP R © R ¥ p§ p§ § § The unbiasedness of carries through as before. Importantly.

Asymptotic normality of the estimator still holds. CASE 2 157 since the numerator converges to a goes to inﬁnity. all the previous asymptotic results on test statistics are also valid in this case.8.v. and É 1 I « æ T R Æ I r. (4) Asymptotic normality § nonnormal. (1) Unbiasedness (2) Consistency (3) Gauss-Markov theorem holds. has the properties: G Summary: Under strongly exogenous regressors. with RG ¢ 1 É 1 Æ p eI I R 1 É 1 Æ R ²1 RG I © ¢p è I « æ j ¥ ©¢ gDè ¥ « æ j 1 1 T GR ¢ I 1 GR 8p§ p e m h by assumption. and the denominator still h Now T § hp § 0÷§ 1 p§ h ²% § 1 h h normal or . so. the estimator is consistent: Considering the asymptotic distribution so directly following the assumptions. Since the asymptotic results on all test statistics only require this.2. since it holds in the previous case and doesn’t depend on normality. We have unbiasedness and the variance disappearing.

and we see that the OLS estimator is biased in this case. This is a case of weakly exogenous regressors. even with and are not uncorrelated. WHEN ARE THE ASSUMPTIONS REASONABLE? 158 (5) Tests are asymptotically valid. using the same arguments as before. © e Ù ¨ v !¥ G P 0§ R Iòì G0ì g7 P fì P & g# R T í © G ¥ £ I t¬¼ë f G 9 . Recall Figure 3.7. 8.4.2. and small sample validity of test statistics do not hold in this case. Gauss-Markov theorem.8. Clearly. A simple version of these models that captures the important points is where now contains lagged dependent variables. 8. This fact implies that all of the small sample properties such as unbiasedness. all asymptotic properties continue to hold. so one can’t show unbiasedness.3. When are the assumptions reasonable? The two assumptions we’ve added are © I G I gf since contains (which is a function of as an element. Case 3 Weakly exogenous regressors An important class of models are dynamic models. under the above assumptions. For example. Nevertheless.4. where lagged dependent variables have an impact on the current value. but are not valid in small samples.

We won’t enter into details (see Hamilton. x# to converge in probability to a ﬁnite limit is that I @ 1 ò9g ¤ # f ) @8A89( 6gì !c 8 # # # © ¢p è æ ² m G R ¢ eI 1 ¥ p ó « f Ë æ « æ 4) « R I ð p f A a ﬁnite positive deﬁnite matrix. since the other cases can be treated as nested in this case. A main requirement for use of standard asymptotics for a dependent sequence Strong stationarity requires that the joint distribution of the set Covariance (weak) stationarity requires that the ﬁrst and second mo- An example of a sequence that doesn’t satisfy this is an AR(1) process with a unit root (a random walk): Stationarity prevents the process from trending off to plus or minus inﬁnity. and prevents cyclical behavior which would allow correlations between far ì x# # removed znd to be high. many of which are fairly technical. in some sense. 2 One can show that the variance of © ¢ ² ¦± è ¥ ± G P t0sI 8 2 ments of this set not depend on ç §G 8 2 not depend on depends upon in this case. There exist a number of central limit theorems for dependent processes. Draw a picture here. WHEN ARE THE ASSUMPTIONS REASONABLE? 159 (1) (2) The most complicated case is that of dynamic models. be stationary. Chapter 7 if you’re interested).8. .4.

and are not too strongly dependent. the assumptions are reasonable when the stochastic con- ditioning variables have variances that are ﬁnite. The econometrics of nonstationary processes has been an active area of research in the last two decades. . The AR(1) model with unit root is an example of a case where the dependence is too strong for standard asymptotics to apply. WHEN ARE THE ASSUMPTIONS REASONABLE? 160 In summary. The standard asymptotics don’t apply in this case.8.4. This isn’t in the scope of this course.

e. P tG gI gf ï 8 P $f (2) If it possible for an AR(1) model for time series data. How is this used in the Gauss-Markov theorem? satisfy weak exogeneity? Strong exogeneity? Discuss.. Ù © © ¥ ¥ ¨ w ¥ © ¥ w ¥ Ù ¥ w (1) Show that for two random variables and if then .EXERCISES 161 Exercises .g.

if the model is | ¢ & I & P G P t0i | | Vi ¢ ¢ VsH§ § P § P I R Ë½ Q R © ¥ gf ¢ ½ X © ¥ then so so is not invertible and the OLS estimator 162 ) ' wj1 ¸ % B Dv where is the column of the regressor matrix ¡ and is an vector. so it’s difﬁcult to deﬁne ¸ when collinearilty exists. Collinearity Collinearity is the existence of linear relationships amongst the regressors. the variation in is relatively small.1. 9. ¸ In the extreme. “relative” and “approximate” are imprecise. if there are exact linear relationships (every element of equal) is not uniquely deﬁned. missing observation and measurement error. For example.CHAPTER 9 Data problems In this section well consider problems associated with the regressor matrix: collinearity. ¸ ¹ P se xxx¡P ¢ v ¢ ÿ v ÿ Pbbb e P I v I e ¶ 3 . We can always write In the case that there exists collinearity. so that there is an approximately exact linear relation between the regressors.

B f G P B 0a BR P B f 6 ) 3 ê Eg4) B f P B ¿ P B P µ¥ & B § uP B ¿ 4uP B & | uP Y¥ ¢ VsH§ DB f 3 § § B § P I B Dµ¥ apartment is in Barcelona. except in the case of an error in con- struction of the regressor matrix. such as including the same regressor twice. Tarragona and Lleida. if one is not careful. Perfect collinearity is unusual. quality etc. deﬁne B ¿ B & ) ¼p¥ B C B lected in as well as on the location of the apartment. The ¤R § § $¤ R § equations in three the can’t be consistently estimated (there are unidentiﬁed in R R¤ The can be consistently estimated. Let if the and for ¶ 3 ¤R § are multiple values of that solve the fonc). COLLINEARITY 163 then we can write the case of perfect collinearity. Consider a model of rental price of an apartment. Another case where perfect collinearity may be encountered is with models with dummy variables. This could depend factors such as size..1. One must either drop the constant. Similarly.9. so there is an exact relationship between these variables and the column of ones corresponding to the constant. One could use a model such as In this model. otherwise. or one of the qualitative variables. col- Girona. but since the G P tVS | G P t0C | s deﬁne two G P t0C | ¢ C P I | © | uP ¢ & ¢ ¨EI & ¢ uDH¨¥ § §¥ P © § P I § § P ¢ ¢§ P ¢§ PI | uS | & VsI & uDH§ § P© ¢ P ¢§ PI | V | & sI & ¥ uDH§ gf © B ¨¥ f .

2. Back to collinearity.9.1. and is often a severe problem. This is reﬂected in imprecise estimates.1.e. COLLINEARITY 164 -6 -4 9.1. but not zero. the sum of squared errors) is relatively .e. © §¥ ¨¦¤ that deﬁnes the OLS estimator ( © §¥ ¨¦¤ -2 F IGURE 9.1. The more common case.1. when there is no collinearity 60 55 50 45 40 35 30 25 20 15 6 4 2 0 -2 -4 -6 0 2 4 6 .1.1. correlations between the regressors that are less than one in absolute value. i. estimates with high variances. Introduce a brief discussion of dummy variables here. collinearity is commonly encountered. 9.1. This is seen in Figures 9. When there is collinearity..1 and 9. A brief aside on dummy variables. The basic problem is that when two (or more) variables move together.. it is difﬁcult to determine their separate inﬂuences. if one doesn’t make mistakes such as these. i. With economic data. is the existence of inexact linear relationships. the minimizing point of the objective function poorly deﬁned.2.

1. the variance of under the classical assumptions. © R © ¥ ¢ è I X%¥ § é Rü ü v R ü Rü v vR v R where is the ﬁrst column of r ü v o n © §¥ ¤ -2 F IGURE 9. when there is collinearity 100 90 80 70 60 50 40 30 20 6 4 2 0 -2 -4 -6 0 2 4 6 ¢§ v (note: we can interchange the columns of isf .2. partition the regressor matrix as we like. COLLINEARITY 165 -6 -4 To see the effect of collinearity on variances. Now. is Using the partition. so there’s no loss of generality in considering the ﬁrst column).9.1.

8 k © Í § 4) ¥ é £ Í ¢ £¡ will be close to 1.9. so that can explain the movement in well.1. COLLINEARITY 166 and following a rule for partitioned inversion. Í Û ó £ bÛ I ! h h ü I y ÚüòþVsf ± R R © R ü¥ ü I v ó R © R ü¥ I v ü I 6Údü üR v v R sion Since we have We see three factors inﬂuence the variance of this coefﬁcient. In this case. and the other regres- Í © £ ¢ ¡jyc¥ Í Û ) Û ¿ © Í § é ¥ ¢è v so the variance of the coefﬁcient corresponding to ! Û Û Û ¿ Â Û Ù È) ¢ ¡ Û Û © ¢ V¼Ô¥ Û ¿ ÜÙ ¡ ) Û 8¸ i¹P e ü %v Í £ b Û Û Ù where by we mean the error sum of squares obtained from the regres- v Ùð vð r © R II sI Xi¥ ! is . As £ Í ¢ ¡ v ! ü sors. ! (3) There is a strong linear relationship between 8 v (2) There is little variation in ¢è (1) is large Draw a picture here. It will be high if The last of these cases is collinearity.

4. Collinearity isn’t a problem if it doesn’t affect what we’re interested in estimating.. Stochastic restrictions and ridge regression g© ¢ ¡ Also indicative of collinearity is that the model ﬁts well (high ¢ ¡ regressions has a high there is a problem of collinearity.3. the artiﬁcial regressions are the best approach if one wants to be careful.1.9. their separate inﬂuences aren’t well determined). 9. COLLINEARITY 167 Intuitively. In summary. High correlations are sufﬁcient but not necessary for severe collinearity. we’re only interested in certain parameters. when there are strong linear relations between the regressors. of the variables is signiﬁcantly different from zero (e. Sometimes. An alternative is to examine the matrix of correlations between the regressors. 9. The ﬁrst question is: is all the available information being used? Is more data available? Are there coefﬁcient restrictions that have been neglected? Picture illustrating how a restriction can solve problem of perfect collinearity. Dealing with collinearity.1. The best way is simply to regress each explanatory variable in turn on the remaining regressors.g. Detection of collinearity. This can be seen by comparing the OLS objective function in the case of no correlation between regressors with the objective function with correlation between the regressors. it is difﬁcult to determine the separate inﬂuence of the regressors on the dependent variable. See the ﬁgures nocollin. If any of these auxiliary procedure identiﬁes which parameters are affected.1. available on the web site.ps (correlation). this but none . Furthermore.ps (no correlation) and collin. More information Collinearity is a problem of an uninformative sample.

one possibility is to change perspectives. A stochastic linear restriction would be something of the form vector. How can you estimate using P À p¡ § © R `Uë ¸ G ¥ ¸ ÐÒ( ¢ è fÓ ± Ä gT( f f ± ¢î è ( Ó @D ©4f ± ¢ î ! j è ¥ G P § V" © è ¥ ç § !( ± ¢ Ä ! À jQp¡ ÎÏ ÐÒ ÎÏ f ç G f § p¡ ç ÐÒ Ñ¸ G ÎÏ ¡ where and are as in the case of exact linear restrictions. but ¸ i P À p¡ § À is a random is which is . This model does ﬁt the Bayesian perspective: we combine information coming from the model and the data. COLLINEARITY 168 Supposing that there is no more data or neglected restrictions. summarized in Since the sample is random it is reasonable to suppose that the last piece of information in the speciﬁcation. the model could be ¸ ¹P À G P § Vi This sort of model isn’t in line with the classical interpretation of parameters ¸ Ô as constants: according to this interpretation the left hand side of constant but the right is random. For example. to Bayesian econometrics. One can express prior beliefs regarding the coefﬁcients using stochastic restrictions.1.9. summarized in with prior beliefs regarding the distribution of the parameter.

since is p. however. Note that this estimator is bi- restriction is false (this is in contrast to the case of false exact restrictions). where function. COLLINEARITY 169 this model? The solution is to treat the restrictions as artiﬁcial data. so the estimator has the same limiting objective function as the OLS estimator.1. Write This expresses the degree of belief in the restriction relative to the varithen the model is homoscedastic and can be estimated by OLS. To see this. To motivate the use of stochastic restrictions. It is consistent. note that there are restrictions. Supposing that we specify is the number of rows of 8 ¢Ä è í ¢î è P § This model is heteroscedastic.d. æ ased. since Deﬁne the prior precision G ¸ P § ¡ ¡ æ À f À f 8Ä èÂî è 8 ¡ . even if the ¸ u G ability of the data. and is therefore consistent.9. given that is a ﬁxed constant. consider the expectation of the squared length of : Q R P R ¢ è Å « y « Ã ¤Æ4e ©§§ × Ö Õ I B è P R B Se ¢ u©§§ ÿ © À P ¢ è I X R ¥ y¿ R§ § ó RG X R ¥ X R u R G ð V R§ § ë P I © I © ¥ © ¥ P © ¥ P õ ø h RG I X R § R h RG I X R § ë § æ k ¥ © ¥ § R § Uë 1 As these artiﬁcial observations have no weight in the objective (the trace is the sum of eigenvalues) (the eigenvalues are all positive.

is ﬁnite. the coefﬁcients are shrunken toward zero. The ridge regression estimator Also. to which is more and the f r ± R n ÿ ¸ Ú G I 8¸ iP § x± ÿ Now considering the restriction ÐÒ P § ©f% I ó 9± ¢ w§ið R ÿ P R ÿ ± ÿ 9± r ± R n ÿ f R§ § the other hand. if our original ¢ ó R ÿ f R ©f% I 9± ¢ ð 8 ©fi I ó ± ¢ w§iòð R ÿ P R e § restrictions tend to that is. On R Å y Ã Ø « Dè ÙÕ ¢« e P R§ § ¥ § R § Uë © ¥ Å « y « Ã ÙÕ Ø (which is the inverse of the m R . so Å « y « Ã ÙÕ Ø 8 © © I X R ¬¥ Û Û m B § m B § m Û B § B R § e where is the minimum eigenvalue of As collinearity becomes worse and worse.1.9. COLLINEARITY 170 so maximum eigenvalue of determinant is the product of the eigenvalues) and becomes and the estimator is ÎÏ This is the ordinary ridge regression estimator. As % R ± ¢ ÿ can be seen to add which is nonsingular. the estimator tends to model is at al sensible. k more nearly singular as collinearity becomes worse and worse. Û so This is clearly a false restriction in the limit. With this restriction the model ©§ R § ¼ë ¥ e becomes more nearly singular. tends to zero (recall that the tends to inﬁnite.

2. 9. etc. the ridge estimator offers some hope.2.1. 8 § reﬂects prior beliefs about the length of increasing dies off). The interest in ridge regression centers on the fact that it can be shown that there exists which are unknown. This is not a problem from the Bayesian perspective: the choice of In summary. Collinearity is a fact of life in econometrics. For example..9. The problem is to determine the such that the restriction is correct. and there is no clear solution to the problem. Why should the last revision necessarily be correct? 9. Error of measurement of the dependent variable. inﬂation. Draw picture here. Measurement errors in the dependent variable and the regressors have important differences. Thinking about the way economic data are reported. but it is impossible to guarantee that it will outperform the OLS estimator. either the dependent variable or the regressors are measured with error.2.g. where the effect of is obviously The ridge trace method plots as a function of and chooses § m Û Û B § B R § m 8 ³ @0 Ü r a such that Ù § © m § ¥ ¢Û ´ Û ½ B The problem is that this depends on and ¢ è . are commonly revised several times. MEASUREMENT ERROR 171 There should be some amount of shrinkage that is in fact a true restriction. Measurement error Measurement error is exactly what it says. subjective. estimates of growth of GDP. measurement error is probably quite prevalent. This means of choosing the value of that “artistically” seems appropriate (e.

Given this. % ¸ As long as is uncorrelated with this model satisﬁes the classical G § iP " f served.9.2. MEASUREMENT ERROR 172 First consider error in measurement of the dependent variable. we have so assumptions and can be estimated by OLS. and © ¢Ä ! ¤ò3 è ¥ t3 is what is obsatisﬁes G P § V" ¸ i P f f ç ñt¸ f . The data generating process is presumed to be the classical assumptions. then. This type of measurement error isn’t a problem. We assume that and are independent and that ¸ f © ¢ Ä èVP ¢ î è! ¤ò3 ¥ t3 P § " ¸ G P § V" G P § pV" Ý ¸ i P £f f ç §Ý G f where is the unobservable true dependent variable.

The DGP is where Because of this correlation.2. is independent of and that © Ä T ¤ò3 S ¥ t3 ¸ iP G P t0§ R gf ç ñt¸ ' .2. unobserved regressors. Error of measurement of the regressors. MEASUREMENT ERROR 173 9. just as in the case of autocorrelated errors with lagged dependent variables. The situation isn’t so good in this case.2. Again assume that Þ Ä S where is a matrix. Now contains the true. In matrix notation. write the estimated model as Ý The problem is that now there is a correlation between ©© G P ¸ ¥ © ¸ ¨tV§ R tÔHiP ë ¥¥ G P © ¸ 0§ R ¨ ¥ G P R ¸ 0§ $s%§ R 8 R ¸ ¸ ¥ H© $ë Ý Ý P § R P § " f § Ä ¦ S Ä S f © ¥ ßÝ ¼ë G P V§ f the model satisﬁes the classical assumptions. the OLS estimator is biased and inconsistent.9. Now we have and since G ¸ and is what is observed.

§à « æ 1 © G P ¥ é 0§ ¬© R wP R ¨¥ 3 µ I @ 1 ¸ R$ t¸ ë f ) §à « æ I ©Ä S h Ä P à « æ ¥ § ò3 µ S 1 é R é 3 µ É 1 f R Æ ò3 µ é since and are independent. so So we see that the least squares estimator is inconsistent when the regressors are measured with error. and © é ÜËP S I © Ä P « æ ¥ à 1 ¥ é ¬D© R wP R ¨¥ 3 µ É 1 É 1 Rf Æ I R Æ § I É 1 R Æ 3 µ We have that . which we’ll discuss shortly.9.2. A potential solution to this problem is the instrumental variables (IV) estimator. MEASUREMENT ERROR 174 and Likewise.

The question remains whether or not one could somehow replace the ¢ R ¢ P I RI f iVD¦f i I â ¢ ¢ åç f âæ I¦f I R I R ¢ P I RI ¢ i0 i ¢ ¢ âä I R áâ ã I 8¢f Let be the predictor of ¢f unobserved by a predictor. and improve over OLS in some sense. or respondents to a survey may not answer all questions. Missing observations Missing observations occur quite frequently: time series data may not be gathered in a certain year. MISSING OBSERVATIONS 175 9. Since these observations satisfy the classical assumptions. Now I G P §I ¦0c ¦f I § ¢f ¢f where is not observed.3. Missing observations on the dependent variable. we have or hold. In this case. one could estimate by OLS. we assume the classical assumptions A clear alternative is to simply estimate using the compete observa- tions I $G ¢G P § G P § V" f I ¢ I ¦f ¢f . Otherwise.3. 9. We’ll consider two cases: missing observations on the dependent variable and missing observations on the regressors.3.1.9.

8 § h ¢ § ë h¢ § © ÿ § § ¼ë ë ¥ P w © ¥ w ± s 8 w 9± ÿ I RI I ¢ R ¢ 0D RI è 9± PI ÿ I RI R ¢ P I RI i²Ä© ¢ iV %@¥ I ¢ iV % R ¢ P I RI I RI i I ¢ i0D i R ¢ P I RI $ ¢ ¢ § ¢ i I ¢ RiVD i gsH§ i R¢ P I RI P I I RI r ¢ § ¢ iVDH§ i n R ¢ P I I RI ¢ §© w ¢ I ¢ I 8 ¢ f R¢ ¢ § ¢ R¢ I RI f i H§ % I I RI Rf § R Recall that the OLS fonc are and this will be unbiased only if ¢ % ¢ iVD % R¢ I R ¢ P I RI w where and we use 9± H§ w ÿ ¥ P I %V i R ¢ P I RI %V i R ¢ P I RI $ § Substituting these into the equation for the overall combined estimator gives Likewise. MISSING OBSERVATIONS 176 . an OLS regression using only the second (ﬁlled in) observations so if we regressed using only the ﬁrst (complete) observations. we would have would give Now.3. 9.

the overall estimate is a weighted average of above.9. it is difﬁcult to satisfy this condition ¢ P ¢ ¢ G § f does not satisfy the condition and therefore to predict : just as G deﬁne an unbiased estimator. MISSING OBSERVATIONS 177 The conclusion is the this ﬁlled in observations alone would need to Note that putting leads to a biased estimator. Formally prove this last statement. Clearly.3. Now. One possibility that has been suggested (see Greene. but we have and ¢§ I H§ IH§ ¢ i I R¢ ¢ R¢ f i I If RI I E RI ¥ ¢ ©I I H§ ¢ ¢ f ¢ f I H§ I RI © I RI I f i I 6Ô i¬¥ H§ I H§ !© ¢ %¥ R¢ R¢ !© ¢ %¥ ¢§ § estimate using a ﬁrst round estimation using only the complete ob- I¦Ý f ¢ f 8 § without knowledge of ¢ where has mean zero. This will be the case only if . page 275) is to servations then use this estimate. E XERCISE 13.

but it may ¢ but we assume now that each row of has an unobserved component(s). The Octave program is sampsel. f which is assumed to satisfy the classical assumptions. but using only the observations for which Figure 9. MISSING OBSERVATIONS 178 This shows that this suggestion is completely empty of content: the ﬁnal estimator is the same as the OLS estimator using only the complete observations.3. However. G P t0§ R f is not f f © ¥ Ò 5 4¬Gé . Missing observations on the regressors.3. The sample selection problem is a case where the missing observations are not random. The sample selection problem. Again the model is Again.9.1 illustrates the bias.m 9.3. one could just estimate using the complete observations. u¥ 4f I $G ¢G P § G 0P f I ¢ I ¦f ¢f 8 c are correlated with the f Or. in other words. Consider the model The difference in this case is that the missing values are not random: they with . In the above discussion we assumed that the missing observations are random. Consider the case ¡ f f if always observed. is missing when it is less than zero.3.2. What is observed is deﬁned as to estimate. 9.3.

¢ is used instead of Consistency . if the unobserved is replaced by some prediction. Sample selection bias 25 Data True Line Fitted Line 20 15 10 5 0 -5 -10 0 2 4 6 8 10 seem frustrating to have to drop observations simply because of a single miss- then we are in the case of errors of observation. Monte Carlo studies suggest that it is dangerous to simply substitute the mean. this introduces bias. In general. for example.1. as long as the number of missing observations doesn’t Including observations that have missing values replaced by ad hoc values can be interpreted as introducing false stochastic restrictions. MISSING OBSERVATIONS 179 F IGURE 9.3. In general. 8 q1 increase with 8¢ ¢ that the OLS estimator is biased when ¢ ing variable. As before. however. It is difﬁcult to determine whether MSE increases or decreases. this means is salvaged.9.3.

case that doesn’t hold for E XERCISE 14. There is potential for reduction of MSE through ﬁlling in missing elements with intelligent guesses.9.3. In summary. MISSING OBSERVATIONS 180 In the case that there is only one regressor other than the constant. 8 ¥ 75 ¦ Ý subtitution of for the missing does not lead to bias. if one is strongly concerned with bias. but this could also increase MSE. This is a special . it is best to drop observations that have missing components. Prove this last statement.

(c) Apply the ridge regression estimator. Exercises (i) Plot the ridge trace diagram large. and as becomes very .EXERCISES 181 Exercises (1) Consider the Nerlove model When this model is estimated by OLS. (ii) Check what happens as e wP ÿ ª )uP 1ª A 4uP 5ª | uP æ A ¢ uP I § Úû 6 § 3 § § § 2 0 ß ß goes to zero. (b) Perform artiﬁcial regressions to see if collinearity is a problem. some coefﬁcients are not signiﬁcant. Exercises (a) Calculate the correlation matrix of the regressors. This may be due to collinearity.

it is usually silent regarding the functional form of the relationship between the dependent variable and the regressors. for certain G P 0È§ h ( ) © ¥ A § uP ¢ F ¢ Vh µF HuP p § È¸ § PI I§ îº x § gYF IyF U¸ z x¢ x w h h This while . considering a cost function. For example.CHAPTER 10 Functional form and nonnested tests Though theory often suggests which conditioning variables should be included. one could have a Cobb-Douglas model This model. Note that values of the regressors. 8 ( ) n§ 4) ¢ ¢§ § P I 8 § ¼¸ 8 ¥ | ! ¥ ¢ ! ¥ D! ¥ w § § I§ 8 w A p § G P (§ 0È§ )VP ¢ F A ¢ uµF HuP p § ¼¸ A § PI I§ Theory suggests that when and 182 look quite alike. so it may be difﬁcult to choose between these models. an alternative may be just as plausible. after taking logarithms. and suggests the signs of certain derivatives. and up to a linear transform. gives where model isn’t compatible with a ﬁxed cost of production since Homogeneity of degree one in input prices suggests that constant returns to scale implies h While this model may be reasonable in some cases.

the vector of ﬁrst derivatives and the matrix of second derivatives can take on an arbitrary value at a single data point. FLEXIBLE FUNCTIONAL FORMS 183 The basic point is that many functional forms are compatible with the linearin-parameters model.1. A “Diewert-Flexible” functional form is deﬁned as one such that the function. but there may be ©©Ô¥ b G P tV§ R © # ¨¥ f ©©ÔH b¥ ample. where may be smaller than. Flexibility in this sense clearly requires that there be at least free parameters: one for each independent effect that we wish to model. 8 ct# could include squares and cross products of the conditioning variables in P ó P ª È5 Â ª ¢ ª ð P ª ) d gressors.1. For ex- valued function. 8 ª x# There may be fundamental conditioning variables . suppose that is a real valued function and that is a vector- ª re- .10. equal to or larger than For example. Flexible functional forms Given that the functional form of the relationship between the dependent variable and the regressors is in general unknown. since this model can incorporate a wide variety of nonlinear transformations of the dependent variable and the regressors. The following model is linear in the parameters but nonlinear in the variables: 10. one might wonder if there exist parametric models that can closely approximate a wide variety of functional relationships.

1. the approximation will have exactly enough free pa- up to second order. If we treat them as parameters. FLEXIBLE FUNCTIONAL FORMS 184 Suppose that the model is A second-order Taylor’s series expansion (with remainder term) of the func- Use the approximation. in the sense that For the ap- 5 P ¥ D© H ã R D© D © ¥ P ¥ ¥ ã © H ¢ R ÿ ' ¢ © H ¥ r ¥ U© H imation to ¡wP ¥ 5 ã P ¥ D© H ã R H© H © D P ¥ ¥ © D ¢ R © H ¥ tion about the point is G P ¥ V© D f . The model is (Ö © H ¢ã ¥ and 8 ¥ © H ¢ã © ¥ ÿ ¢ã © H ã g© D ¥ ¥ 8 & g © H ã ¥ © ¥ ÿ & è f © ¥ ÿ ã © ¥H © ¥ ÿ As the approximation becomes more and more exact. which simply drops the remainder term.10. exactly. as an approx- proximation is exact. at the point so the regression model to ﬁt is ) P Ã R 5 Â ¡©§R P G VP Ã R 5 Â ¡d§R P ) P ¥ g© H rameters to approximate the function which is of unknown form. up to the second order. The idea behind many ﬂexible functional forms is to note that and are all constants.

they are useful. The translog form. The conclusion is that “ﬂexible functional forms” aren’t really ﬂexi- ble in a useful statistical sense. 10. The translog model is probably the most widely used FFF. approxima- tion to a quadratic function. then © H ã § ò3 µ ¥ © H w & 3 µ ¥ © H ¢ã Ã ò3 µ ¥ G Is is forced to play and .1. in general. such as the Cobb-Douglas of the simple linear in the variables model.1. the regression model must be correctly speciﬁed. This model is as above. The reason is that if we treat the true the part of the remainder term. unless the function belongs to the parametric family of the speciﬁed functional form. the question remains: is é © Is The answer is no. which is a function of so that are correlated in this case. Also. except that the variables are subjected to a logarithmic tranformation. the expansion point is usually taken to be the sample mean of the data. and they are certainly subject to less bias due to misspeciﬁcation of the functional form than are many popular forms.10. A simpler example would be to consider a ﬁrst-order T. As before. Draw picture. In order to lead to consistent inferences. FLEXIBLE FUNCTIONAL FORMS 185 While the regression model has enough free parameters to be Diewerté é ﬂexible. after the logarithmic transformation.1. in that neither the function itself nor its derivatives are consistently estimated. In spite of the fact that FFF’s aren’t really as ﬂexible as they were originally claimed to be. The model is deﬁned G values of the parameters as these derivatives. the estimator is biased in this case.S.

the subscript that distinguishes observations is sup- (the other part of is constant) GVP Ã R 5 Â )P©§R P & ©¦g¥ A ¢4¥ A Ý# ©# © Ý ## ¥ A © ¸ ¥ A f f f à à . § § F where is a vector of input prices and ©§ F ¸ Ü¥ f f To illustrate. Note that ables by extending in the obvious manner. but this is supressed for simplicity. at the means of the data.1. § ê íÅ ê fà ê ë à so is output. 8# which is the elasticity of with respect to This is a convenient feature of the ¸ ¸ # # ¸à ©$¥ à # © ¸ ¥ à P à § Ã 2 In this presentation. Note that at the means of the conditioning variables. . We could add other vari- Ý # translog model. FLEXIBLE FUNCTIONAL FORMS 186 by pressed for simplicity.10. consider that is cost of production: ì ì Hê § so the are the ﬁrst-order elasticities.

the conditional factor demands are and the cost shares of the factors are therefore which is simply the vector of elasticities of cost with respect to input prices. Then the share equations are just # © § gÝ § Â R¥ A # ¢ ¢ ¢ #5 µ# ¢ I R P I R 5 ¡P gw©§R P Â ) P Ã I Ã Â ) »R # P ¢ ¢ ¢ IR ¢I Ã Ã I I Ã Ã # ¸ F § F F ©4¯¨¥à ¸ r # R n 5 Â ¡P R w R§ P ) » # P 8 | | ¢ | I | C ¢¢ ¢ C I ¢ iîC I II r ¢I F ©4©Ü¨à¥ ¸ § F Ã à I I Ã n § ¤ P à ¸ F ¤ I I ¢I ¢ ¢ Ã Ã Ã & & © F YÝ F Â ¨¥ © ¸ ¥ . we have where and and Note that symmetry of the second derivatives has been imposed.10. If the cost function is modeled using a translog function. FLEXIBLE FUNCTIONAL FORMS 187 By Shephard’s lemma.1.

e. since otherwise the derivatives would not be the true derivatives of the log cost function. To pool the equations. In this way we’re using more observations and therefore more information. the share equations and the cost equation have parameters in common. so In this case the translog model of the logarithmic cost function is The two cost shares of the inputs are the derivatives of Note that the share equations and the cost equation have parameters in common. To illustrate in more detail. not an approximation).1.10. which will lead to imporved efﬁciency. consider the case of two inputs.. we can gain efﬁciency. write the model in matrix form (adding in ¢ P # | CiP ¢ ¢ iI I # | CiP ¢ ¢ CiI I I P II P I CiH§ ¢I ¢ CiP § I g¤ ¢¤ ¢ and : I ¸ 5 # ¢ | ¢ Ñ¼EI | CÇP ¢ I ¢ CÑP ¢ # 5 P ¢¢ ¢ 5 P I¢ i ¼# » P ¢ ¢ jHI H²P P # I I § P I§ ¢ II P | | 8 I ¢ ÷ & è È¸ with respect to . imposing that the parameters are the same. By pooling the equations together and imposing the (true) restriction that the parameters of the equations be the same. One can do a pooled estimation of the three equations at once. Note that this does assume that the cost equation is correctly speciﬁed (i. FLEXIBLE FUNCTIONAL FORMS 188 Therefore. and would then be misspeciﬁed for the shares.

. a single observation can be written as This is one observation on the three equations. With the appropriate nota- | ¢ & |G I $G ¢G I | C ¢ C I || ¢ ¢ II C ¢ »§ ¢¤ P # # # ¢ EI ¢ # I ¢ ) ¢ I ) ¢ì ¢ ¢ ¢ I z zz ã z ã # I ) I g¤ ¸ I H§ of The overall model would stack observations on the three equations for a total error terms) observations: 10. . . 1 7 189 .1. P Èd . . FLEXIBLE FUNCTIONAL FORMS . . . f tG f a f f I $G ¢G I ¢ I ¦f 1 ¢f G P d t0ÈR f tion. .

(In fact. Supposing that the Note that this matrix is singular. FLEXIBLE FUNCTIONAL FORMS 190 Next we need to consider the errors. with 2 shares the variances are equal and the covariance is -1 times the variance. Assuming that there is no autocorrelation. General notation is used to allow easy extension to the case of more than 2 inputs). the variance of won t depend upon : 2 I $G | G ¢ G tG p S ô tG À ¯é . Also. the overall covariance matrix has the seemingly unrelated 2 R | | è b b ¢ ¢¢ | è è b ¢ Hè Hè I II I | Hè gG model is covariance stationary. it’s likely that the shares and the cost equation have different variances.1. For observation the errors can be placed in a vector First consider the covariance matrix of this vector: the shares are certainly correlated since they must sum to one. since the shares sum to 1.10.

10. .2.. . 8 .1. this model has heteroscedasticity and autocorrelation. . . .. so OLS won’t be efﬁcient. FLEXIBLE FUNCTIONAL FORMS 191 regressions (SUR) structure. . ¥ ¥ ( T ( R Iô bb SÚxxb T tô ¥ bb xxb ¢I ô ¥ ¥ ¥ I I ô I¢ ô ( ¨ T xô 8 µS ¥ ï w ¥ w two matrices and ï where the symbol indicates the Kronecker product. .. bb xxb bb xxb p S p S ï p S f± S . . The Kronecker product of is p S . I can never keep straight the roles of and . .1. . . . I ¦G ¢G f G ô À Üé . . . . 10. . . So. The next question is: how do we estimate efﬁciently using FGLS? FGLS is based upon inverting the esti- An asymptotically efﬁcient procedure is (supposing normality of the errors) 8 µS mated error covariance So we need to estimate ¥ w Personally. . FGLS estimation of a translog model.

FLEXIBLE FUNCTIONAL FORMS 192 (1) Estimate each equation by OLS will be singular when the shares sum to one. The solution is to drop one of the share equations. for example the second.1.10. so FGLS won’t work. The model becomes & or in matrix notation for the observation: I $G ¢G P II C I | C ¢ C I || ¢ ¢ | I H§ ¢ »§ ¢ 8p # ¢ I ) # ¢ ÔI ¢ I z ¢ ì zz ¢ ã z ¢ ã # ¢ I ) # S (3) Next we need to account for the singularity of I @ 1 p R G t G f ) S 0yd f G P p S (2) Estimate using It can be shown that p S ¸ I g¤ .

10. . ﬁnally in matrix notation for all observations: Considering the error covariance. . . FLEXIBLE FUNCTIONAL FORMS 193 tions: or. . following the consistency of OLS and applying a LLN. we can deﬁne This is a consistent estimator. 1 g5 and in stacked notation for all observations we have the observa- I G ¢G I ¢ S S If ¢f ff . and form fG I ¦G ¢G P Èd G P 0yd f p S ï ht ô À ¯é f f± p . . . (4) Next compute the Cholesky factorization and the Cholesky factorization of the overall covariance matrix of the 2 equation model. . which can be calculated as p p ª h p S I S ï t 8 p f ± S ï t µ ´ `) f ± S p û ¹ ¡ ª 5 ' Q¾5 µ ´ `) û ¹ ª S p S Deﬁne as the leading block of .1. .

1. It is relatively simple to relax this. but we won’t go into it here. (1) We have assumed no autocorrelation across time.10. we have only imposed symmetry of the second derivatives. A few last comments. so they are easy to impose and will improve efﬁciency if they are true. Another restriction that the model should satisfy is that the estimated shares should sum to 1. This can be accomplished by imposing These are linear parameter restrictions. (2) Also. This is probably the simplest approach. f h p S R É h p S R CÆ I I I 8 7 5 774) Å % G ª Yd p ª u f p ª P ) I B q Bß | ¢ VH§ § PI G ª Èd ª f ª P ³ C0 ¯ 2 d . This is clearly restrictive. FLEXIBLE FUNCTIONAL FORMS 194 (5) Finally the FGLS estimator can be calculated by applying OLS to the transformed model or by directly using the GLS formula It is equivalent to transform each observation individually: and then apply OLS.

It can ³ @0 Ü d timator based on since FGLS is asymptotically more efﬁcient. p ³ @0 ¯ 2 d jÜf G ³ C0 ¯ 2 d mate as above. since both are based upon a consistent estimator of the error covariance.10. esti ðS These might be expected to lead to a better estimate than the es- be shown that if this is repeated until the estimates don’t change (i. That is. At any rate. Testing nonnested hypotheses Given that the choice of functional form isn’t perfectly clear. the asymptotic properties of the iterated and uniterated estimators are the same. iterated to convergence) then the resulting estimator is the MLE. how can one choose between forms? When one form is a parametric restriction of another. TESTING NONNESTED HYPOTHESES 195 (3) The estimation procedure outlined above can be iterated.. score or are all possibilities. For example.2. in that many possibilities exist. the Cobb-Douglas model is a parametric restriction of the translog: The translog is where the variables are in logarithms. while the Cobb-Douglas is Ã so a test of the Cobb-Douglas versus the translog is simply a test that 8 G P VS Ã R 5 Â § R P ) P G P 0§ R P & è f & g gf d Then re-estimate using the new estimated error covariance. then re-estimate using errors calculated as .e. the previously studied tests such as Wald. P § LR.2. 10.

as in X& On the other hand. « There are a number of ways to proceed. and use the same transformation of the dependent variable. if the second model is correctly speciﬁed then & If the ﬁrst model is correctly speciﬁed. For example. Econometrica (1981). then they may be written as We wish to test hypotheses of the form: One could account for non-iid errors. if the models share some regressors. for p B ´ r © ¢ î ! ¾3 è ¥ t3 G P § Vi © ¢ ñ ! d ¾3 è ¥ t3 P W fr $UI ´ ç f ¢ $r ´ ) ÈÔ¥ f fr $¼I ´ f ¢ $r ´ 8 4) B ´r wd ç ktG is correctly speciﬁed versus test. then the true value of C3 | iS ¢ ¢ iDC P P P I G P 0C | | VC ¢ ¢ VDH§ § P § PI Ý P© W Ht¤¥ & P § "c© & 85 74) 3 is misspeciﬁed.10.. The idea is to artiﬁcially nest the two models. .2.g. If the two functional forms are linear in the parameters. but we’ll suppress this for simplicity. – The problem is that this model is not identiﬁed in general. We’ll consider the posed by Davidson and MacKinnon. e. TESTING NONNESTED HYPOTHESES 196 The situation is more complicated when we want to test non-nested hypotheses. pro- is zero.

2. while it’s estimated standard error tends to zero. so one can’t test the hypothesis that & ¤R » The four are consistently estimable. then since & T 2 è ©) x9 ¥ ² ç § & 2 Q& and that the ordinary -statistic for is asymptotically normal: tends in & one can show that. since we have four equa- This is a consistent Ý P C3 | Ý P C3 | & & P i | | c© § P C ¢ ¢ & & )¥ P ÈÔi ¢ © ¢ PI DC & P S | | c© § & & Ý P si3 3 » C | | » i ¢ ¢ » I » P P P P ¢ Ô© & ÈÔEC & Hc© & Èc¥ § )¥¥ P ©I P I § )¥ )¥ P ÈÔC ¢ ¢ c© § & )¥ P I § ÈÔDHÔ© & ) Èc¥ gf f gf . Then estimate the model probability to 1. T & 2 W where In this model. and 8 Ý P© W H e¥ Ý s P £ f & & P § "© P d Èp & ) ¼Ô¥ f 8 f a ª f R W I pW R e¥ © W « The idea of the test is to substitute in place of 8 Q& tions in 7 unknowns. since the statistic will eventually exceed any critical value with probability one. under the hypothesis that the ﬁrst model is correct. TESTING NONNESTED HYPOTHESES 197 then the composite model is Combining terms we get estimator supposing that the second model is correctly speciﬁed. It will tend to a ﬁnite probability limit even if the second model is misspeciﬁed. but is not. k If the second model is correctly speciﬁed. Thus the test will always reject the false null model. is consistently estimable.10. asymptotically.

T 2 ©) ¥ t9 j critical values from the distribution. Monte-Carlo evidence shows that these tests often over-reject a cor- rectly speciﬁed model. Journal of Econometrics. In this case. there are 4 possible outcomes when we test two models. The above presentation assumes that the same transformation of the dependent variable is used by both models. TESTING NONNESTED HYPOTHESES 198 We can reverse the roles of the models. (1983) shows how to deal with the case of different transformations. if we use tends to test is . Both may be rejected. White and Davidson. The -test is similar.2. asymptotically. but easier to apply when is nonlinear. or one of the two may be rejected. since as long as Of course.10. Can use bootstrap critical values to get betterperforming tests. In summary. each against the other. simple to apply when both models are linear in the parameters. It may be the case that neither model is correctly speciﬁed. MacKinnon. The I ´ 8 k something different from zero. when we switch & ª ﬁrst. neither may be rejected. q« There are other tests available for non-nested models. testing the second against the the roles of the models the other will also be rejected asymptotically. the test will still reject the null hypothesis.

Another important case is that of simultaneous equations. we’re as we saw in the section on stochastic as ﬁxed where. but the effect is the same. Nevertheless. 11. The cause is different. for purposes of estimation we can treat G P § V" f as ﬁxed. This means that . Simultaneous equations Up until now our model is not interested in conditioning on continues to have desirable asymptotic properties even in that case.1. the OLS estimator obtained by treating % 8 i § when estimating we condition on When analyzing dynamic models. 199 regressors.CHAPTER 11 Exogeneity and simultaneity Several times we’ve encountered cases where correlation between regressors and the error term lead to biasedness and inconsistency of the OLS estimator. Cases include autocorrelation with lagged dependent variables and measurement error in the regressors.

1. Solving for : is uncorrelated with 4f the intersection of these equations. An example of a simultaneous equation system is a simple supply-demand system: Supply: unrelated process. SIMULTANEOUS EQUATIONS 200 Simultaneous equations is a different prospect. for the same reason. It’s easy to see that we have correlation between regressors Now consider whether Because of this correlation. ¢ ¢§ & ¢ H0ÄHè I è II ¢ ¢ ¢ ¢§ ¢ ¢§ õ & & I É G& I § ø $G ¢ j$G P gf | & P HjqI & Æ ë I§ r I q¦G S ¢ ¢§ ¢ ¢§ ¢ & ¢§ & & ¢ G²$G P gf | & P H0qI & S I I§ ¢ G²I$VP gf | & DH0qI & S ¢ & S ¢ G PI§ ¢ VP S ¢ VsH§ $VP f | & G § P I I G C and errors. We’ll assume that S v§ The presumption is that and are jointly determined at the same time by is determined by some I G $VP f | 2 ê ÔµS ¢ ¢ è b ¢ Hè Hè I II ¢ G0P S ¢ uDH§ § PI P S ¢ & I & P & Demand: v§ v§ $ ÐÒ r G I È ¢ å$G n § © I ¥ ¨$¨G Uë ¢ G I ¦G P CS ¢ & ÎÏ P sI ë & . OLS estimation of the demand equation will be biased and inconsistent. The same applies to the supply equation.11.

.11. . which is Stack the errors of the equations into the error vector ¤ & 8 c¤ we group together current endogs in the vector Ù Ù¥ © R ì Uë ç ñ Ù Ã R ¤ gf mined within the system. R¢ Ù RI Ù Ù ©sT j i ¥ Å ¯ Ó ÿÃ Ù P ¥ RI R¢ ç © Ù ¥ xº ¸ ¸ Ù ¥ © R ¬¼ë Ã ¤ RI ¤ R¢ ¤ Rf ¤ ¤ 1 We can stack all observations and write the model as 8 c Ù 2Ô¼T ê© S R Ù P ¤ µ í 2 ¥ j ¥ R & . S v§ In this model. and are the endogenous varibles (endogs). in that there are as many equations as endogs. as well as lagged endogs in the vector The model. . Suppose Group current and lagged exogs.1. and we’ll return to it in a minute. SIMULTANEOUS EQUATIONS 201 are a bit tricky. with additional assumtions. some notation. . but allows us to consider the relationship between least squares and ML estimators. This isn’t necessary. that are deteris an exogenous variable (exogs). . is 8) 4' a 8) ' 4"Ú& ¤ . . can be written as where This system is complete. . First. 8 & ' aiX1 Ù Q1 ' & ' aiX1 is is and is Rf Ù Rf . There is a normality assumption. These concepts If there are endogs. .

¯ )¯ Ù è Since there is no autocorrelation of the ’s. Exogeneity The model deﬁnes a data generating process. and since the columns of S ï b f± i Ù & and which . .2. These variables are predetermined. . . is a d © S ¸ R r R ¼¤¥ xº ¸ R © ¥ ¥ xº ¸ ¸ R © Ã ¥ xº ¸ n ¸ d i¤ variables.. . there exists a joint density function for P © & 45 Â ÜV ¢ aP £P ¢ & &¥ & In general.2. without additional restrictions. We need to deﬁne what is meant by “endogenous” and “exogenous” when classifying the current period variables. . 11. . This is the parameter vector that were inter- ested in estimating. then may contain lagged endogenous and exogenous variables. . The model involves two sets of dimensional vector. © ¤¥ ¨ o t Ei¨E 8 t depends on a parameter vector ¨ Write this density as ca ¤ In principle.11. EXOGENEITY 202 are individually homoscedastic. and as well as a parameter vector f± ¢ f ± ¢ è f ± ¯ Huxxb f ± ¢ Hè f ± Hè Iè bb I II .

and may share elements. EXOGENEITY 203 This is a general factorization. but is may very well be the case that not of course. In general.11. We have ¨ Recall that the model is Normality and lack of correlation over time imply that the observations are independent of one another. ¢t Ù Ù¥ © R ì Uë ç ñ Ù Ã R s¤ t of that enter into the conditional density and write I st t all parameters in © ¥ © t ¤¥ ¨ o st aE ¨ ¨ o × CE ¨ ¨ o st acCÔ © ¤¥ affect both factors. So use conditional on ¨ and lagged ’s of course. so we can write the log-likelihood function as the © o ¢ t aE ¨ ¨ o 6c CÔ ¨ ¨ o st aCE ¥ © I t ¤¥ © ¤¥ ¢t ¤ µ í 2 2Ô¼T j ê© S ¥ R Ù P ¥ i R I st that enter into the marginal.2. This can be factored into the density of times the marginal density of : to indicate elements for parameters ¤ ¤R ¤ 8 2 o where is the information set in period This includes lagged .

then the MLE of © ¢ gò¥ tI t arbitrary combinations of ¢t I t nous. knowledge of the DGP of is irrelevant for 8I st and conditional log-likelihoods maximize at the same value of I st 8¢ t I @ © 6It a i¨¥E ¤ © d ¤ A f ¨ o i ¥ f o ¨ a Supposing that is weakly exogenous. It Sd © ¥ 8 d rameter of interest. is weakly exogeneous for I @ © ¢ ¥ o t aÔ ¨ f ©© o ¢ t aE ¥ I @ P © I t ¤¥ 6Dca CE o f ¨ I @ © !ca i¨E ¥ I t ¤¥ o f ¨ ¨ I @ ©¨ st a C¥E ¤ f o ¨ © ¤ ¨ o 7d ¥ f (the . which prevents consideration of . the joint as ﬁxed in inference. the MLE of is d ping is assumed to exist in the deﬁnition of weak exogeneity.2. ¢t I t 8¢t to More formally. Since the DGP of I t I 6st inference on and knowledge of is sufﬁcient to recover the pais irrelevant. for an arbitrary This implies that and cannot share elements if d 8EISd òSIg© ¢ ò¥ © t¥ © t¥ d tI t t original parameter vector) if there is a mapping from to that is invariant is weakly exoge- using the joint and this map- d D EFINITION 15 (Weak Exogeneity). since would change as changes. we can treat With weak exogeneity. By the invariance property of MLE. EXOGENEITY 204 sum of likelihood contributions of each observation: density is the same as the MLE using only the conditional density since the conditional likelihood doesn’t depend on In other words.11.

g. we can’t treat 8I Dt d This is the famous identiﬁcation problem. as . D EFINITION 16 (Structural form). For this reason. word. isfy the deﬁnition. but the conditional MLE is not. since they are in the conditioning information set. 11. just earlier on.3. REDUCED FORM 205 Of course. An equation is in structural form when more than one current period endogenous variable is included. we’ll need to ﬁgure out just what this mapping is to recover from With lack of weak exogeneity. since their values are determined within the model. the joint and conditional likelihood func- ﬁxed in inference. Weakly exogenous variables include exogenous (in the normal sense) variables as well as all predetermined variables. we require the variables in to be weakly exogenous if sat- tions maximize in different places. The joint MLE is valid. R Ù P ¥ i R S © Ù¥ é Ã R s¤ i¤ 8 c o g ò I i¤ e.11. Reduced form Recall that the model is This is the model in structural form. Lagged In resume.. Lagged aren’t exogenous in the normal usage of the ¤ we are to be able to treat them as ﬁxed in estimation.3.

¢ ËSgf ¢ P ¢ I é P ¢ ¢ ¢§ ¢ ¢§ ¢ ¢ & & P I §& § ¢ Gj¦G P gf | I HjqI & & ¢ j$VSgf | & HjqI & G I G P PI§ I G P ¦0Cf | & P SS ¢ & P I & S ¢ ¢ S & S § ¢G ¢§ PI VP S uDH§ Similarly. The reduced form for quanThe solution for the current period endogs is easy to ﬁnd.3. the rf for price is ¢ & ¢ G¢ & I ¦G ¢ g§ ¢ g§ g§ ¢§ $G ¢ § I § uP gf | & & v§ ¢ § I G $VP gf | I é f HËP ÔI ¢ I P I ¢ ¢ ¢ ¢ P gf& ¢ § § P H§ ¢ & qI § ¢ § | & I & & ¢ us¨ ¢ 0sH¥ ¢ qI ¢ § § P © G P I§ & & & ¢ P É ¢ ²qH§ jv§ Æ ¢ & I & P G I§ R é ó R ÞËP ei Ù R I Ã R P I Ã ¥ i 206 current period endog is included. tity is obtained by solving the supply equation for price and substituting into Now only one current period endog appears in each equation. This is the R ¯¤ reduced form. demand: An example is our supply/demand system. An equation is in reduced form if only one 11. REDUCED FORM . It is D EFINITION 17 (Reduced form).

3. so are the É É ¢ G¢ ¢ & & § ¢ © ¢ ¢ ¥ ¢ ¢ è ¢ P ¢ Hè & ¢ ¢ g0¢Hè ¢ ¢ § I § 5 II & & È ¢§ É ¢ ¢§ & I H$G ¢ § Æ ¢ G ¢ & $G ¢ § Æ ë I © ¢ é¥ ¨ òé © ¢ é I é¥ ¨HòUë © I é¥ ¨Hòé I Dé The variance of is zz § gz x z Ê î gx Ê î z z Ê z î z §§ Ê î gx 82 ê ¢ é I Hé ¡ B ¬é¼ë © f¥ and therefore i=1.2. so the ﬁrst rf equation is homoscedastic. since is uncorrelated with and by assumption. 8 cié É § ¢ © ¢ ¢ ¨¥ ¢ è I ¢ VP ¢ Hè © ¢ & P ¢ ¨¡qHè ¢ § §¥ II & É ¢ ¢§ É ¢ ¢§ È & & ¢ j¦G Æ ¢ G ¢ & $G ¢ § Æ ë G I I É § ¢ © ¢ ¢ ¨¥ ¢ è I è 5 II ¢ VP ¢ & HV¢Hè É ¢ ¢§ É ¢ & ¢§ È & G I ¢ jH$G Æ ¢ jH$G Æ ë G I G Likewise. but they are contemporaneously correlated. . since the are independent over time. REDUCED FORM 207 The interesting thing about the rf is that the equations individually satisfy the This is constant over time. The variance of the second rf error is and the contemporaneous covariance of the errors across equations is In summary the rf equations individually satisfy the classical assump- tions. The errors of the rf are ¢ G I ¦G f classical assumptions.11. under the assumtions we’ve made.

IV estimation Sé and that the are timewise independent (note that this wouldn’t be the case 2ê Ô h I R é ó R ÞËP ei Ù R I Ã R P I Ã ¥ i Ã S Ù ó ó ð R I Ã $ x u R I Ã ð Cé ç R ¯¤ matrix as . partition ¤ always reorder the equations) we can partition the r ¢ ¤ D¤ f n ¤ I Ù P ¥ Ã ¤ ¢¤ I¤ f Ù if the were autocorrelated).11.4.4. IV ESTIMATION 208 The general form of the rf is so we have that The IV estimator may appear a bit unusual at ﬁrst. since we can is the ﬁrst column are the other endogenous variables that enter the ﬁrst equation are endogs that are excluded from this equation as r ¢ e n I Similarly. but it will grow on you over time. The simultaneous equations model is Considering the ﬁrst equation (this is without loss of generality. 11.

and are the excluded exogs. and which scale the variances of the error terms. These are normalization r ¢I Ù G n Ù ¢ are the included exogs.11. the coefﬁcient matrices can be written as With this. the ﬁrst equation can be written as endogs. as we’ve seen is that G P I §I P I I VDHÔVDCsD¤ is correlated with ¢I ¢ ¢ ¢ | ¢¢ ¢I Ã Ã Ã ¥ ¥ I C G VP I H§ ) » W f Ã ¥ Ã Assume that has ones on the main diagonal. Given this scaling and our partitioning. I ¤ 7G W The problem. partition the error matrix as restrictions that simply scale the remaining coefﬁcients on each equation. IV ESTIMATION 209 Finally. I since is formed of .4.

but so do other problems. This matrix deﬁnes a projection matrix ! projection matrix we get or © G ª R Uë ¥ ! © G ª R ª R Uë ¥ ! ! ¥ © RG Uë G Now we have that and G P § ª " ! ª f ! ª are uncorrelated.11.4. Consider some matrix 8 í RG Uë © ¥ © ¢ ègf ± ¾ò3 ç G ¥ t3 G P § 0" f © ü¥ R ü I Ú Rü þü G ª . let’s consider the general problem of a linear regression model with correlation between regressors and the error term: The present case of a structural equation from a system of equations ﬁts into this notation. such as measurement error or lagged which is formed of variables uncorrelated with . since this is simply ! G P V§ f 8 Öü G correlated with by the deﬁnition of Transforming the model with this ü so that anything that is projected onto the space spanned by will be un- ü dependent variables with autocorrelated errors. IV ESTIMATION 210 Now.

This is the ments. IV ESTIMATION 211 and OLS to the model will lead to a consistent estimator.11. given a few more assumptions.4. is known as the matrix of instru- 8 G G P V§ f ! õ ô ) § õ ô ) § § õô 0÷) § Xü the columns of so it must be uncorrelated with 8 Öü is the ﬁtted value from a regression of © ü¥ R ü I Ú Rü þü ª ! on This is a linear combination of This implies that applying . The estimator is from which we obtain so so that 1 Assuming that each of the terms with a É 1 É 1 É 1 É pÉ 1 É 1 É 1 Æ § õô RG ü Æ I Rü ü Æ Rü Æ R ü Æ I Rü ü Æ Rü YÆ j÷) § I 1 Now we can introduce factors of Gü I 6©Ú©üR¥þüRi I ó ü I !Údüòþdüi ð R ü R © R ü¥ R G R © R ª % I 6X ! ª i¬¥ G © ¥ P ª R I X ! ª R § ! ©G P § R © V"¥ ª i I !X ª i¥ R ! ! f R © R ª i I !X ! ª i¥ to get in the denominator satisﬁes a LLN. ! ü generalized instrumental variables estimator.

g. a ﬁnite pd matrix a ﬁnite matrix with rank (= cols ) G h ü is . An estimator for É 1 É 1 É 1 GR ü Æ I üR ü Æ üR Æ I h · É 1 É 1 É 1 R ü Æ I üR ü Æ üR Æ q1 Furthermore. scaling by ¶ we have © Q¥ ¨tG R ¼ë © ü¥ 8 § T õ ô ) § ©¢ è h ¥ æ ² ! f T îy f! « æ T y « æ T !yf § õô h 0÷) § 1 m îy f ! ! . ¢è The estimators for and ó ¥ ¢è I !© ! «R æ ö! æ ! « æ t ð I ! 8 h õ ô ) õ § jÜf R h )§ Ö f ) 1 ¢ô è õ ô Â ! ö! m æ « æ ! h 0÷)§ 1 § õô h then we get are the obvious ones.11.4. e. Given these assumtions the IV estimator is consistent Assuming that the far right term satiﬁes a CLT. so that ! ö! This estimator is consistent following the proof of consistency of the OLS esti- ¢ è mator of when the classical assumptions hold.. This last term has plim 0 since we assume that and are uncorrelated. IV ESTIMATION ! ö! ! 212 then the plim of the rhs is zero.

depends upon ! õ ô ¢ è h Qò¥ I ®düòDÚü%¥ )§ é © R ü © R ü¥ © R © õô ¥ Â I « æ § The formula used to estimate the variance of is . The only issue is which instruments to use.4. and 8 Öü æ and and these depend upon the choice of The choice of instru- such that õ ô ) § An important point is that the asymptotic distribution of ! ö! ! G © I X ! ª R ¬¥ ªR G © ¥ ë © ! ¥ ª R I X ! ª R U6 G ! ª R ¬¼ë and are not independent. to itself as the number of instruments increases. The reason for this is that IV estimation can clearly be used in the case of simultaneous equations. ª ! 8I gjü ¢ ü then the IV estimator using is at least as efﬁciently asymptotically More instruments leads to more asymp- becomes closer and closer ¢ ¡øjü ü ÷ I ¢ ü I 0ü When we have two sets of instruments. The penalty for indiscriminant use of instruments is that the small sample bias of the IV estimator rises as the number of instruments increases. as the estimator that used totically efﬁcient estimation.11. since ments inﬂuences the efﬁciency of the estimator. as we’ll see). There are special cases where there is no gain (simultaneous equations is an example of this. since even though may not be zero. in general. IV ESTIMATION õ ô ) 213 The IV estimator is (1) Consistent (2) Asymptotically normally distributed (3) Biased in general.

5. If we use IV estimation for a single equation of the system.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS 214 11. These identiﬁcation conditions are not that intuitive nor is it very ob- vious how to check them. this is the case if the limiting covariance of the IV estimator is positive deﬁnite and The necessary and sufﬁcient condition for identiﬁcation is simply that this matrix be positive deﬁnite. This matrix is G ! ö! must be . and that the instruments be (asymptotically) uncorrelated with . r dD¤ n I I G VP » W f W æ noted above hold: must be positive deﬁnite and « æ © õô ¥ ¢è I 6© ! «R æ !ö! æ ! « æ ¥ ) § ané I RG ü I f ò3 µ . the equation can be written as where Notation: Let be the total numer of weakly exogenous variables. Identiﬁcation by exclusion restrictions The identiﬁcation problem in simultaneous equations is in fact of the same nature as the identiﬁcation problem in any estimation setting: does the limiting objective function have the proper curvature so that there is a unique global minimum or maximum at the true parameter value? In the context of IV estimation.11. Necessary conditions.1. For this matrix to be positive deﬁnite.5. we need that the conditions ! of full rank ( ). 11.

Using this notation. and let be the number of excluded endogs. Let be the total number of included endogs. e.11. so This is the order condition for identiﬁcation in a set of simultaneous equations.g. ments. in that don’t lead to an identiﬁed model then no other since if not then at least one instrument must will not have full column rank: ö&V& sù& )PEID¥¦¤ I´ Ú & © ¤ µ ¸ ©I ¥ µ ¬¦¤ `´ ¸ Let be the number of included exogs. and let . To show that this is in fact a necessary condition consider some arbi- 8 Öü trary set of instruments A necessary condition for identiﬁcation is ) wP & ½ © ) & ! a ¥ & æ Á ) wP ¡ s ü ©I ¤¥ µ ED¦¤ I´ ¸ © ü¥ ½ Úò ¡ © ¢ ¦¤ `´ ¸ ¥ µ that if the variables in It turns out that I Now the are weakly exogenous and can serve as their own instru- exhausts the set of possible instruments. Assuming this is true (we’ll prove it in a moment). minus 1 (the normalized lhs endog). IDENTIFICATION BY EXCLUSION RESTRICTIONS s 215 be the number of excluded exogs (in this equation).. consider the selection of instruments. instruments will identify the model either. then a necessary condition for identiﬁcation is be used twice. When the only identifying information is exclusion restrictions on the variables that enter an equation. then the number of excluded exogs must be greater than or equal to the number of included endogs.5.

r I ¢ ¯ ¢ ó ¢ VP ¢ I ó n R ü 1 ò3 W R ü 1 ò3 I ) µ ) µ I Dé I sé r I Ié ¢ ¯eDwP ¢ ó ¢ VP ¢ I ó n R ü 1 W R ü 1 I ) ) converges in probability to zero. by assumption.5. so Because the so so we have that as ü Ié ¢ HËP ¢ ó ¢ VP ¢ I ó D¤ I I r ¢ é Dé I ¸ | ¢ ó ó ó ó n P |I ¢ ¢ ¢I I¢ I I r ¢ n r ¢ ¤ D¤ f n I I we can write the reduced form using the same partition é ËP ó ¤ Given the reduced form r ¢ e n I r ¢ ¤ D¤ f n ¤ I Ù P ¥ Ã ¤ r dD¤ n I I W ) wP & 216 É Æ R ü ) 1 ò3 µ $ W between ü Recall that we’ve partitioned the model where and ’s are uncorrelated with the 11. the cross . IDENTIFICATION BY EXCLUSION RESTRICTIONS ’s.

5. unless the structural model is subject to some restrictions.11. Turning to sufﬁcient conditions (again.5. We’ve already identiﬁed necessary conditions. 11. and the identiﬁcation condition fails. then it is not of full column rank. Identiﬁcation essentially requires that the structural parameters be recoverable from the data. for the moment). IDENTIFICATION BY EXCLUSION RESTRICTIONS 217 In this case. This won’t be the case. the rank of this matrix can never be greater than regardless of the choice of % Since the far rhs term is formed only of linear combinations of columns of .2. the limiting matrix is not of full column rank. The model is Ù P ¥ i R s ¥ ) & S © Ù¥ é Ã R s¤ s or noting that ¥ P ) & W When has more than columns we have W instruments. Sufﬁcient conditions. we’re only considering identiﬁcation through zero restricitions on the parameters. If has more than columns. in general.

consider the model P is ¥ é CwP Ù I Ã P I Ã Ù I Ã I ùP P I Ã I ùP P P ¥ P Ù P I © ÆÃ ¥ P P I © P ÆÃ & i' & ¥ ¥ ó R R R i R i R ¤ P where is some arbirary nonsingular ó I Ã SR I Ãð é ó R SËP ei Ù R I Ã P I Ã ¥ i matrix. The rf of this new model Ù P P P ¥ S R R P ² Ù¥ © P é R Þ¤ P ÆÃ © é¥ Sòé R ¤ . IDENTIFICATION BY EXCLUSION RESTRICTIONS 218 This leads to the reduced form The reduced form parameters are consistently estimable. The problem is that more than one structural form has the same reduced form.5. To see this. but none of them are known a priori. so knowledge of the reduced form parameters alone isn’t enough to determine the structural parameters.11. and there are no restrictions on their values.

IDENTIFICATION BY EXCLUSION RESTRICTIONS 219 Likewise. Take the coefﬁcient matrices as partitioned before: Ã The coefﬁcients of the ﬁrst equation of the transformed model are simply these P coefﬁcients multiplied by the ﬁrst column of I I ¢ ¨ P © I Ã Ù é ¥ .11. the covariance of the rf of the transformed model is Since the two structural forms lead to the same rf. and the rf is all that is directly estimable. What we ¥ Ã need for identiﬁcation are restrictions on P and ble is an identity matrix (if all of the equations are to be identiﬁed). This gives ¢I ¢I ¢ ¢ ¢ ¢ ¢I ¢ ¢I ¢ ¢ ¢ ¢ ¢ | | Ã Ã ¥ ¥ Ã Ã Ã ¥ ¥ ² I C I C ©I © I H§ I H§ ) ) P ÆÃ ¥ P Ù é ¥ I I ¢ Ã ¥ ¨ P Ã ¥ such that the only admissi- .5. the models are said to be observationally equivalent.

then I Ct I H§ ) ÐÒ ¢¢ ¢ | Ã ¥ I I ¢ ÎÏ I I ¢ ¢ ¨ P P P ¤ `´ µ ¸ ¢¢ ¢ | ¢I ¢ ¢ ¢I ¢ ¢ ¢ | Ã ¥ ¥ ¥ Ã Ã Ã ÐÒ I i ¢¢ ¢ | I H§ ) Ã ¥ ÎÏ . is if is a vector of zeros.5. Given that ) & is a vector of zeros. without additional restrictions on the model’s the ﬁrst equation ) & ÐÒ ¢¢ ¢ | Ã ¥ ÎÏ Therefore. IDENTIFICATION BY EXCLUSION RESTRICTIONS 220 For identiﬁcation of the ﬁrst equation we need that there be enough restrictions so that the only admissible ¨ be the leading column of an identity matrix.g.11. then the only way this can hold. so that Note that the third and ﬁfth rows are Supposing that the leading matrix is of full column rank. as long as ) I I ¢ P ¨ Á ) I I ¢ ¨ P r ¢I Ã ) n ¢ P parameters. e..

so the condition is sufﬁcient for identiﬁcation. The necessary condition ensures that there are enough variables not in the equation of interest to potentially move the other equations. The above result is fairly intuitive (draw picture here).5. Since this matrix has I ¯ ) I I ¢ ¨ P & . Some points: omission of an identiﬁying restriction is not possible without loosing consistency. drop a restriction and still retain consistency. When an equation is overidentiﬁed we ) 4þ & ¥ s When the equation is overidentiﬁed. so as to trace out the equation of interest. in that s ) & & P V& ) & ¡ s & P 0& ¡ s s P s ) & trix must have at least rows. It is also necessary. Overidentifying restrictions are therefore testable. we obtain or which is the previously derived necessary condition. IDENTIFICATION BY EXCLUSION RESTRICTIONS 221 then The ﬁrst equation is identiﬁed in this case. since the condition implies that this subma- rows.11. since one could ) 4(sù& s When an equation has is is exactly identiﬁed. The sufﬁcient condition ensures that those other equations in fact do move around as the variables change their values.

5. and through the use of a normalization. These results are valid assuming that the only identifying information comes from knowing which variables appear in which equations. the ﬁrst equation is I · Ù I Þ¥ ¦f P · I Ã where is an upper triangular matrix with 1’s on the main diagonal. We can repeat this partition for each equation in the system.g. Since estimation by IV with more instruments is more efﬁcient asymptotically. e.11. the above conditions aren’t necessary for identiﬁcation. by exclusion restrictions. These include (1) Cross equation restrictions (2) Additional restrictions on parameters within equations (as in the Klein model discussed below) (3) Restrictions on the covariance matrix of the errors (4) Nonlinearities in variables When these sorts of information are available. consider the model triangular system of equations. one should employ overidentifying restrictions if one is conﬁdent that they’re true. to see To give an example of how other information can be used. This is a Ù P ¥ Ã ¤ which equations are identiﬁed and which aren’t. There are other sorts of identifying information that can be used. though they are of course still sufﬁcient. .. In this case. IDENTIFICATION BY EXCLUSION RESTRICTIONS 222 have more instruments than are strictly necessary for consistent estimation.

and included endogs.11. To give an example of determining identiﬁcation status. suppose that we have the restriction s This equation has 5 & Ú ¢ Ù P ¢ 0DÔI ¢ t ¢ f · · Þ¥ P I f excluded exogs. The second equation is fails the order (necessary) condition for identiﬁcation.3. In this case nal. all of the equations are identiﬁed. consider the following macro model (this is the widely known S so there’s no problem of simultaneity. so it so that the ﬁrst matrix is diago- .5. and second structural errors are uncorrelated. If the entire I ¢ ¢ © ¢ I f¥ G © I G P · R ¥ ¡ × ¦0DI Þ¥ i¬7ë ¨G¦¨Uë S However. then following the same logic. This is known as a fully recursive model.5. this equation is identiﬁed. IDENTIFICATION BY EXCLUSION RESTRICTIONS 223 Since only exogs appear on the rhs. Example: Klein’s Model 1. 11.

IDENTIFICATION BY EXCLUSION RESTRICTIONS and a time trend.5. Private Wages: Capital Stock: Investment: Output: Proﬁts: 11. government . 8 c w s& lhs variables.r 8 ÞI a I I ª w ¿ s& ü ) n R r È ª a T § ± û o Þ¤ ü n R ü c ¿ Ò Ò û û û û û û }Ð }Ð | | è ¢ ¢ ¢ | è è ¢ Hè Hè I II Ï ú ú Îú I | Hè ü T V ¿ Ha & P P £wS ± S û | VS w | sI a ¢ CaCiP p G P P P I ¢ 0sI | uDI ª ¢ VC ª HuP p § G P § P § P I§ I G P ü $V© ËP T ò¥ | & I ª ¢ & C ª I & P p & ü P P 224 Klein’s Model 1) and the predetermined variables are all others: û Ò û û }Ð The other variables are the government wage bill. The endogenous variables are the taxes. Ï û Ò û û }Ð ú ú ú Ï ú ± ¦± ç |e ¢e I ge Ï ú ú Îú Îú Îú ± I P ª a Consumption: T ü ± û nonwage spending.

The model written as gives ) ¯ ) ) ¯ ) ) Þ ) ) ) ¯ ) ¯ ) I§ HY I & ) Ct I |§ ¢ ¢ § & | | & p§ p & p ¢ ) ) | & ) ¥ Ã . we need to extract Ù P ¥ Ã ¤ correlated.5. by nonautocorrelated.11. These are the rows that have zeros in the ﬁrst column. and ¢ ¢ ¥ and the submatrices of coefﬁcients of endogs and exogs that don’t appear ¢ | Ã To check this identiﬁcation of the consumption equation. IDENTIFICATION BY EXCLUSION RESTRICTIONS 225 The model assumes that the errors of the equations are contemporaneously in this equation.

IDENTIFICATION BY EXCLUSION RESTRICTIONS 226 we need to drop the ﬁrst column. Ò Ò ' ) Þ ) ) matrix. We get We need to ﬁnd a set of 5 rows of this matrix gives a full-rank 5 example.5. selecting rows 3. so the sufﬁcient condition for identiﬁcation is met.11.5.6. and 7 we obtain the matrix This matrix is of full rank. and counting excluded exogs.4. For so ¢ |§ | ) Þ ) )Þ qC ) I ) Þ ) ) ) ) |§ | ) Þ Ò s ¡w ¢¢ ¢ | ¥ Ã . s ) £7 ) & 7 f f f 77 ë& Counting included endogs.

there is additional inforenter the consumption equation.11. each column of I ó I « D¤ ª I R © R ¥ ©¤i I !Xi¬u ü mation in this case. the entire I ¤ The 2SLS estimator is very simple: in the ﬁrst stage. one can estimate the parameters of a single equation of the system without regard to the other equations. as we’ll see. which are correct when the only identifying information are the exclusion restrictions. according to the and their coefﬁcients are restricted to be the same. 11.g. 2SLS When we have no information regarding cross-equation restrictions or the structure of the error covariance matrix. estimation of the equation won’t be affected by identiﬁcation problems in other equations. This isn’t always efﬁcient. e. However. is re- matrix. The ﬁtted values are G and since any vector in this space is uncorrelated with by assumption. For this reason the consumption equation is in fact overidentiﬁed by four restrictions. 2SLS 227 The equation is over-identiﬁed by three restrictions. i ID¤ I s¤ Since these ﬁtted values are the projection of on the space spanned by gressed on all the weakly exogenous variables in the system. Both and T ü I ¤ counting rules.. but it has the advantage that misspeciﬁcations in other equations will not affect the consistency of the estimator of the parameters of the equation of interest.6. is . Also.6.

it is cor- than in in this case. This should be the case when the order condition is satisﬁed.6. This so we can write the second .11. Note that if we deﬁne G VP » W sª « G P I §I « P I I « VDHÔ ª DCsD¤ sª f sª R W I pW sª R e¥ » © « W « r I I ÞdD¤ n W « ª W f $ I Since is in the space spanned by I ª % I « I ¤ The second stage substitutes in place of G P I §I P I I VDHÔVDCsD¤ 8 G P I §I P I I VDHÔVDCsD¤ f I D¤ G VP » W I¤ ¢ since there are more columns in f I ¤ related with The only other requirement is that the instruments be linearly 8 G uncorrelated with Since I ¤ is simply the reduced-form prediction. and estimates by OLS. with the reduced form predictions of the endogs used as instruments. original model is and the second stage model is stage model as The OLS estimator applied to this model is which is exactly what we get if we estimate using IV. 2SLS 228 independent.

11. However the OLS covariance formula is not valid.6. We need to apply the IV covariance formula already seen above. Actually. there is also a simpliﬁcation of the general IV variance formula. 2SLS 229 Important note: OLS on the transformed model can be used to calcu- a particular set of instruments. Deﬁne The IV covariance estimator would ordinarily be õ R R R ¥ ¢ô è h ±WÈW h W yW h W yW d © » é I I However. looking at the last term in brackets I RI I ¤ RI I « c© ª ¥ R I ¤ DE© ª ¥ R I ¤ I¤ « r W ¤ n I I I I r eD¤ n R r d ¤ n W R W « sª W » late the 2SLS estimate of ©fyW I !p±WyW ¥ » R © R since we see that it’s equivalent to IV using YW W so that are the instruments for then we can write .

it is general since any equation can be placed ﬁrst. Properties of 2SLS: (1) Consistent (2) Asymptotically normal (3) Biased when the mean esists (the existence of moments is a technical issue we won’t go into here).11. I RI I « ª R I ¤ r eD¤ I I I I n R r dD ¤ I « D¤ sª RI ID¤ «sªUsª R I ¤ « i W but since is idempotent and since R yW n ª « r I I r I I ÞeD¤ n R Þ D ¤ n « ª we can write .6. the second and last term in the variance formula cancel. recall that though this is presented in terms of the ﬁrst equation. (4) Asymptotically inefﬁcient. 2SLS 230 Therefore. can also be written as õ R ¥ ¢ô è h W yW d © » é I Finally. following some algebra similar to the above. except in special circumstances (more on this later). so the 2SLS varcov estimator simpliﬁes to õ R ¥ ¢ô è h W yW © » é I which.

so the IV objective function at the minimized value is but where ! so ©GVP§"¬¥ ó © ª R I X ! ! f ó ª R I X © ! ! f ª i I !X R © h ô)§ jÜf õ ! ©GVP§i¬¥ R R§ PR © õô ¥ w ª R w d©iugt¬G¥ s) § ¦¤ ! R © R ¥ ª % I 6X ! ª i¬uj ± ! ªR h õ ô ) ! ¥ ª R uj ± ð ¥ ª R uj ± ð § j¯f l )§ ¦¤ © õô ¥ ©G P § 0"¨¥ w R ¥ ª iuj¯f $ õ ô ) § j¯f w õ ô ` G . A general test for the speciﬁcation on the model can be formulated as follows: The IV estimator can be calculated by applying OLS to the transformed model. there is room for error here: one might erroneously classify a variable as exog when it is in fact correlated with the error term. As such. Testing the overidentifying restrictions The selection of which variables are endogs and which are exogs is part of the speciﬁcation of the model.11.7.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 231 11.

as can be veriﬁed by multiplication: random variable with an idempotent matrix in . so This isn’t available. ²l ó ª R I Q ª R u² ± ð © ¥ ! ! ! R ¥ ª i¬u ! ª ± ð ªð ±ð 8 ª ó i I 6X R © ó ó ! R © © R ¥ ª Ri I !X ª iu ª ª ð ª i I !X ! ! ! ! ó R ! © óR © ! R ¥ ª % I 6X ! ª i¬uj ± ð ! ª i I 6X ! ¥ ª R%u ! ª ! R ¥ ª i¬u ! ª ! w ªR w ! Moreover. Supposing the w w ªR w ! so is orthogonal to 11. since we need to estimate is a quadratic form of a estimator. with variance ¢ xsè G variable ¢è ©© ¥ w ª R w ¨¥ ¢ ' ç )¢ § ¦¤ © õô ¥ ! ¢ è ¢è © © ¥ w ª R w ¨¥ ¢ 'ç )§ ¦¤ © õô ¥ ! ©) ¥ t9 j ¢è ¢è © õô ¥ G w ª R w R G ) § ¦¤ ! the middle. TESTING THE OVERIDENTIFYING RESTRICTIONS are normally distributed.G R © õô ¥ w ª R w tG s) § ¦¤ ! w Furthermore. Substituting a consistent then the random 232 .7. is idempotent.

. Note that since or that there is correlation between and G XP » W 8 G f Rejection can mean that either the model ü the model is correctly speciﬁed and that the G w õ ô ` G 8 % columns of The degrees of freedom of the test is simply the number ü where is the number of columns of « © R ¥ I 6X ª i¬u ó R ! © ª i I 6X ! and form valid instruments ! « ó R © R ¥ ª i I !X ! ª iu ! ª ! « ! © ü¥ « I Ú Rü Rü ü y¿ À R ü I Ú Rü òþü y¿ © ü¥ À R À ª ! ª i y¿ ! y¿ ªÀ ! R ¥ ª i¬u ! ª ! ª ð y¿ À ! ª ð w ! ª R w ¥ © w ª R w ! G Even if the aren’t normally distributed. TESTING THE OVERIDENTIFYING RESTRICTIONS 233 holds. that the variables classiﬁed as exogs really are uncorrelated with This is a particular case of the GMM criterion test. We have so of overidentifying restrictions: the number of instruments we have beyond the number that is strictly necessary for consistent estimation. The last thing we need to determine is the rank of the idempotent matrix. the asymptotic result still ! 8 7G is the number of is misspeciﬁed.11. .g.7. This test is an overall speciﬁcation test: the joint null hypothesis is that (e. See Section 15. which is covered in the second half of the course.8.

so is a square matrix. using the standard notation and the fonc are The IV estimator is f R © R ª i I X ! ª i¬¥ ! G P § ª " ! ª f ! ª © õô f R s) § jÜ¥ ª % ! ! õ ô ) § ü y . The transformed model is © ü¥ µ ®¦¤ `´ ¸ ©X¦¤ I´ ¸ ¥ µ ü and is the matrix of instruments. If we have exact identiﬁcation then G P § V" f ü on all of the instruments .11. consider IV estimation of a just-identiﬁed model.7. TESTING THE OVERIDENTIFYING RESTRICTIONS 234 and we can write test statistic. On an aside. This is a convenient way to calculate the é± ¢ £¡ ¢Í £¡ where is the uncentered ¡ ¢Í D1 Û Û © R¾ü î Û £ ý ü Û ¡ ¥ ý ¿ Â ! e¤î ¢1 1 Â GR G © G R ü I Úü R òþ © R ü I Úü R òþü R G ¥ © ü¥ ü¥ © ü¥ from a regression of the residuals G w ª R w R G s)§ ¦¤ © õô ¥ ! ¢è © õô ¥ )¢ § ¦¤ .

7.ªR f ªR f R ª ¦f R ª ¦f f h õ)§ R V£f R ô P ª ! ª ! õ)§ ô R õ Rô P i s§ £f ª i R ! ª ! h )§ jÜf õ ô R ª i ! h õ ô ) § õ R ô § h õ Rô § h õ Rô § h jÜf h )§ j¯f õ ô ! õ ô )§ j¯f ! õ ô )§ j¯f ! õ)§ j¯f ô ! § Ö ! ªR h õ ô ) © õô ¥ ) § ¦¤ õ ô ) ©fü I 6Qò¥ R © R ü ü¥ © ¥ © ü © ü Rf ü I ©Ú Rü Rü I Ú Rü ¬Ú Rü ¥ I Q R ò¥ f ª R I Ú Rü ¬Ú Rü ¥ I Q R ò¥ © ¥ © ü © ü ! we obtain © R ¥ © R ü © R ü I Ú@üi¨Úüò¥ I !XÌ¥ ó © R ü¥ R © R ü I I !Údüòþdüi ð I !XÌ¥ ó R © R ü¥ R I ü I 6Úüòþdüið 235 Considering the inverse here The objective function for the generalized IV estimator is § f ª R ! © R I X ª i¥ ! Now multiplying this by 11. TESTING THE OVERIDENTIFYING RESTRICTIONS .

Secondly. since we’ve already shown that the objective function number of overidentifying restrictions. The disadvantage of 2SLS is that it’s inefﬁcient. so they don’t need to be speciﬁed (except for deﬁning what are the exogs. This makes sense.g. Recall that overidentiﬁcation improves efﬁciency of estimation. ! © õô ¥ ) § ¦¤ .8. since an overidentiﬁed equation can use more instruments than are necessary for consistent estimation.8. it’s simply 0. This means we’re not able to test the identifying restrictions in the case of exact identiﬁcation. the assumption is that © ¥ ¢ dentifying restrictions. The advantage of a single equation method is that it’s unaffected by the other equations of the system.11. 11. as noted above. In the present case. this is The value of the objective function of the IV estimator is zero in the just identiﬁed case. when we’re in the just indentiﬁed case. so we have a ¢ ¢è after dividing by f ó R ü I ©X R ò¥u R ü I ©Ú Rü 0 R ü I Ú Rü òþü ð R f ü ü¥ ü © ü¥ f ó ü I 6QòuÖ ± ð ª ¦f R © R ü¥ R ó R © R ü¥ ! R ©fü I 6QòVjÜf ð ª ¦f is asymptotically with degrees of freedom equal to the rv. in general. which has mean 0 and variance 0. SYSTEM METHODS OF ESTIMATION 236 by the fonc for generalized IV. so 2SLS can use the complete set of instruments). However. there are no overi- e. System methods of estimation 2SLS is a single equation method of estimation..

information about one equa- tion is implicitly information about all equations. even the just identiﬁed equations. When equations are correlated with one another estimation should account for the correlation in order to obtain efﬁciency. . ignoring this will lead to inefﬁcient estimation. since the equations are correlated. following the section on GLS. . ¯ )¯ Ù Since there is no autocorrelation of the è ©sT j i ¥ Å ¯ Ó ÿÃ Ù P ¥ ’s. . . Therefore. . and are therefore inefﬁcient (in general). Single equation methods can’t use these types of information. overidentiﬁcation restrictions in any equation improve efﬁciency for all equations.8.11. .. SYSTEM METHODS OF ESTIMATION 237 are individually homoscedastic. and since the columns of Ã ç © Ù ¥ xº ¸ ¸ ÙR ¥ © i¬¼ë f± b ¤ ï ÑS i Ù . f± ¢ f ± ¢ è f ± ¯ Huxxb f ± ¢ Hè f ± Hè Iè bb I II . Also. . then This means that the structural equations are heteroscedastic and correlated with one another In general.

I ¦G I » bb xxb ¢ W I UW I ¦f ¯ ¢f & Grouping the equations together we get G B VP B » B W G PI PI B 0H§ B 0Di B ¤ åB f f . . SYSTEM METHODS OF ESTIMATION 238 11. ¢G ¢ » .8.1. .8. I no longer teach the following section. 3SLS. . Note: It is easier and more practical to treat the 3SLS estimator as a generalized method of moments estimator (see Chapter 15). . .11. . Following our above notation. Another alternative is to use FIML (Subsection 11. . . . . . . . but it is retained for its possible historical interest.. . P » W RG ¥ ©©t¬G¼ë bb xxb f . This is computationally feasible with modern computers. if you are willing to make distributional assumptions on the errors. each structural equation can be written as or where we already have that ¯ G ¯ » f± ï ÇS G VP ¯ W i .2).8.

The natural extension is to add the GLS transformation. . .8. combined with the exogs. putting the inverse of the ó ©fyW I !p±WyW ¥ » R © R ó ¥ and and does not impose these restrictions... bb xxb I I d¤ W . note that is calculated Ã may be subject to some zero restrictions. . SYSTEM METHODS OF ESTIMATION 239 The 3SLS estimator is just 2SLS combined with a GLS correction that takes W These instruments are simply the unrestricted rf predicitions of the endogs. Also. . . then using OLS equation by equation. and noting that the inverse of a block-diagonal matrix is just the matrix with the inverses of the blocks on the main diagonal. More on this later. .11. The 2SLS estimator would be as can be veriﬁed by simple multiplication. This IV estimator still ignores the covariance information. . . The distinction is that if the model is overidentiﬁed. . ¯ W R I X R u © ¥ . . . . . bb xxb ¢ W R ©X R ¥u I I © ¥ UW R I X R ¬u bb xxb Ã ¥ ' bb xxb ó ¢ ¢¤ 8 ¾i advantage of the structure of Deﬁne as . depending on the restrictions on ¯ I ¯ ¤ .

which gives the 3SLS estimator estimator based on the 2SLS residuals: Analogously to what we did in the case of 2SLS. SYSTEM METHODS OF ESTIMATION 240 error covariance into the formula. 1 è ß G BR G È Bß S of is estimated by 8 g© B W B W (IMPORTANT NOTE: this is calculated using % ¾ not ³ 0 @`³ ¢ 8 µS estimator using a consistent estimator of r B » B 8 µS This estimator requires knowledge of f ó f ± ï I yS ð yW h W ó f ± ï I yS ð yW R R I f I $f ± ÇegyW h W I 4f ± Ç¤yW © ï S¥ R © ï S¥ R I W ¹ B f HB G ³ 0 CG³ | » 3 The solution is to deﬁne a feasible The obvious solution is to use an Then the element . h W hf I ± ï R I y S È W h © 1 (cancelling out the powers of ³ 0 @`³ is ÐÒ ç å 1 ¥R W ¶ á ä ã ë f ÎÏ ç h » | » é ³ 0 @`³ | » 1 S Substitute into the formula above to get the feasible 3SLS estimator.8. combined with the GLS correction. the asymptotic distribution of the 3SLS estimator can be shown to be æ · W © $f I I ± ï S ÑR h A formula for estimating the variance of the 3SLS estimator in ﬁnite samples This is analogous to the 2SLS formula in equation (??).11.

since the rf equations are correlated: and Let this var-cov matrix be indicated by þ 2ê Ô h I é R ËP ó R Ù R I Ã R P I Ã ¥ i Ã S ó ç Ù ó R I Ã ð u R I Ã ð Cé I Ã S ó R I Ãð R ¯¤ 8 ó column of ó The 3SLS estimator is based upon the rf parameter estimator r ¯ xxb ¢ f ¦f n R I X R ¥ ó f bb I © d¤% I 6Xi¥ ó R © R equivalent to 2SLS. GMM is presented in the next econometrics course. For now. calculated .11. SYSTEM METHODS OF ESTIMATION 241 In the case that all equations are just identiﬁed.8. take it on faith. Proving this is easiest if we use a GMM interpretation of 2SLS and 3SLS. OLS equation by equation using all the exogs in the estimation of each It may seem odd that we use OLS on the reduced form. 3SLS is numerically equation by equation using OLS: which is simply that is.

.. SYSTEM METHODS OF ESTIMATION 242 OLS equation by equation to get the rf is equivalent to Use the notation to indicate the pooled model. however. . . Following this notation. . . . Note that each equation of the system individually satisﬁes the classi- cal assumptions. since equation-by-equation estimation is equivalent to pooled estima- tion.8. pooled estimation using the GLS correction is more efﬁcient. The equations are contemporanously correlated. However. þ The model is estimated by GLS. . . . . . I $¸ is the entire column of I bb xxb ¯ I ¦f ¢f f ' j1 . . is the column of and is the ¶ 3 ¶ 3 ) ' w1 Bf where is the vector of observations of the endog. residuals from equation-by-equation estimation. ¯ ¸ ¶ 3 ¯ . since is block diagonal. . but ignoring the covariance informa- B The general case would have a different for each equation. is estimated using the OLS 8 ¢é B ¸ ó f± ¸ ¹ ï P þ f © ¸¥ 7Ré B matrix of exogs. the error covariance matrix is This is a special case of a type of model known as a set of seemingly unrelated equations (SUR) since the parameter vector of each equation is different. P bb xxb . which are consistent.11. ¢ ¸ ¢ . where tion. .

To show û µw ¥ © ï ©R ¥ ï R w¥ I w¥ ¥ û v© R© I © w¥ . See two sections ahead.8. and in the case of the ó We have ignored any potential zeros in the matrix .2. SUR B In the special case that all the are the same. SYSTEM METHODS OF ESTIMATION 243 So the unrestricted rf coefﬁcients can be estimated efﬁciently (assum- ing normality) by OLS. all other estimators that use the same information set.t. which is true in the OLS. which if they f I $f ± © ï þ ¢ I f R³ I X R ¬uï ¯ °± © ¥ f ó iwï I þ ð ó I 6Xiuï þ ð R © R ¥ f ó iwï I þ ð I ó Xwtf ± ¥ ó iwï I þ tð R © ï R ð ¥ R Xw f ± ¥ ó Xwtf ± ¥ I 4f ± ï þ ¥ R Xwtf ± ð © ï © ï © © ï ¥ I ¥ ï ¯ ¥ ï (3) ³ © ¥ ï w¥ (2) ï ¥ ï w¥ (1) ©I ¥ 8 ï ± %wtf ¡§ and this note that in this case Using the rules we get $ present case of estimation of the rf parameters. $ Another example where SUR OLS is in estimation of vector autoregressions. FIML will be asymptotically efﬁcient. exist could potentially increase the efﬁciency of estimation of the rf.r. even if the equations are correlated. FIML.8. .11. since ML estimators based on a given information set are asymptotically efﬁcient w. Full information maximum likelihood is an alternative estimation method. 11. .

Our model is. ¤ S ¤ R © R j R ¥ I 4© ¥ R j Ã R ¨¥ ¥ Ã É ¤ S ¤ 5 ~ p R © ¥ R j Ã R ¥ I )© ¥ R ² Ã R ¥ ) Æ 7g} ¢ eI ó I I@ 5 I f ) S ¢ 5 } ÿ A 1 © Ã ¢ 5 © S 1 P 5 } ÿ ¥ p© ò¥ A D1 ¼T Ã ¥ ¥ f & S ¢ } ÿð i¤ so the density for Ã is Ã ¡ R t ¤ } ÿ Ù t ¢} ÿ C¤ Ù The transformation from to É Ù I S R Ù )5 Æ } ¢ eI ó I ~ p requires the Jacobian Ù S ¡ 5 ÿ } ¤ð ¢ p © ¥ Ù The joint normality of means that the density for ¤ µ í 2 2Ô¼T j ê© S ¥ R Ù P ¥ i R is the multivariate nor- Ã Ù Ù¥ © R ì Uë ç ñ Ù R s¤ ¡ 5 } ÿ ¢ p ¯ © ò¥ . We’ll see how to do this in the next section. SYSTEM METHODS OF ESTIMATION 244 full-information ML estimator we use the entire information set. while FIML of course does. the joint log-likelihood function is This is a nonlinear in the parameters objective function.8. which is Given the assumption of independence over time. Maximixation of this can be done using iterative numeric methods. The 2SLS and 3SLS estimators don’t require distributional assumptions.11. recall mal.

thus avoiding the use of a nonlinear optimizer. since it’s an ML estimator that uses all informa- tion. assuming nonautocorrelated errors. FIML is fully efﬁcient. Also. so that lagged endogenous variables can be used as instruments. applying the usual estimator. EXAMPLE: 2SLS AND KLEIN’S MODEL 1 245 It turns out that the asymptotic distribution of 3SLS and FIML are the this way before. The steps are and as normal. since in this case 2SLS 3SLS.m performs 2SLS estimation for the 3 equations of Klein’s model 1. the likelihood function is $ of course different than what’s written above. we didn’t estimate in and . 11.9. assuming normality of the errors. When Greene says iterated 3SLS doesn’t lead to FIML. Example: 2SLS and Klein’s Model 1 The Octave program Simeq/Klein.11.9. This estimator may have some zeros in it. This implies that 3SLS is fully efﬁcient when the errors are normally distributed. One can calculate the FIML estimator by iterating the 3SLS estimator. ó a procedure that doesn’t update but only updates and ó (2) Calculate ¥ ³ 0 CG³ (1) Calculate 8 CGI0 ³ | Ã CG³ | ¥ ó ³ ³0 | ¥ | ³ 0 @`³ Ã 8 Ã same. The results are: CONSUMPTION EQUATION 8 µS (4) Apply 3SLS using these new instruments and the estimate of S ó (3) Calculate the instruments and calculate using and ¥ Ã ¤ ó If you update you do converge to FIML. This is new. if each equation is just identiﬁed and the errors are normal. When the errors aren’t normally distributed. (5) Repeat steps 2-4 until there is no change in the parameters. then 2SLS will be fully efﬁcient. he means this for S to get the estimated errors.

107 0.383184 estimate Constant Profits Lagged Profits 20.163 t-stat.150 0.118 0.278 0.err.016 20.616 st. EXAMPLE: 2SLS AND KLEIN’S MODEL 1 246 ******************************************************* 2SLS estimation results Observations 21 R-squared 0.9. 2.147 2.044059 estimate Constant Profits Lagged Profits Wages 16. 1. 7.173 0.555 0.err.216 0.867 3.398 0.688 0.884884 Sigma-squared 1.000 0.321 0.001 .885 0.534 0.060 0.016 0.129 p-value 0.040 t-stat.11.810 st.976711 Sigma-squared 1.784 p-value 0.000 ******************************************************* INVESTMENT EQUATION ******************************************************* 2SLS estimation results Observations 21 R-squared 0.543 0. 12.017 0.

209 0.036 0. EXAMPLE: 2SLS AND KLEIN’S MODEL 1 247 Lagged Capital -0. unless use a Newey-West type covariance estimator.039 0. since lagged endogenous variables will not be valid instruments in that case. You might consider eliminating the lagged endogenous variables as instruments.368 0.307 12.000 ******************************************************* WAGES EQUATION ******************************************************* 2SLS estimation results Observations 21 R-squared 0..147 0. 1.11.316 3.439 0.148 0.000 ******************************************************* The above results are not valid (speciﬁcally.476427 estimate Constant Output Lagged Output Trend 1.err.158 0.9. Food for thought. Standard errors will still be estimated inconsistently. .475 p-value 0. to obtain consistent parameter estimates in this more complex case. 1. they are inconsistent) if the errors are autocorrelated.987414 Sigma-squared 0.777 4.029 t-stat.500 0. and re-estimating by 2SLS.002 0.130 st.036 -4..000 0.

linear model vertically.1.CHAPTER 12 Introduction to the second half available data. maximization x © d f £¥ f 47Ä¤C9¤ optimizing element of an objective function over a set .0. estimator is deﬁned as $ We readily ﬁnd that Example: Maximum likelihood mum likelihood estimator is deﬁned as Because the logarithmic function is strictly increasing on 248 of the average logarithm of the likelihood function is achieved at the same as d 8© ) p d¥ ± ç t99ej ¦± Yf Suppose that the continuous random variable © k ¥ 5 · ©d f ¢ 4ig¥ d `@ 8 R h f bxbxb ¢ I d f t"HCf fG Ppd f d 1 8 5 2 G p r!A@8@8974) SctjP p d R ¡ gf v where 8 R I © R ¥ d f £D R `@f sD Å1 Â Ô¥ eS9¤ ¥ E f d f © ) © d¥ f 8 x pg Let the d. The maxi- d D EFINITION 12. [Extremum estimator] An extremum estimator f £ We’ll begin with study of extremum estimators in general. based on a sample of size . We’ll usually write the objective function suppressing the dependence on Example: Least squares. be Stacking observations The least squares .p. Let be the ¶ 5 7~g} ¢ I © ò¥ p e 1 is the I @ fq © d¥ 4eSf ¦ ~ ¥ d $ d 8f 7Ä£ .g.

The ﬁrst moment (expectation).. leads to the familiar result that MLE estimators are asymptotically efﬁcient (Cramér-Rao lower bound. of a random I@ I Q f ¢ © p d¥ I ggRvQ Q I $ d © p RvQ d ¥ I p d . i. ©pd ge¥ ¢ 8 q1 Â f f Suppose we draw a random sample of 8 Ý d I @ 5 © ) D1 Ô¥ 5 5 Â Þ 4R¥ f ¦ D1 Â Ô¥ 4eS¤ ~ g ) ©d © ) © d ¥ f ¥ ©d f ¢ i¨¥ f Â from the distribution. In this example. Example: Method of moments variable will in general be a function of the parameters of the distribution. Here.12. Theorem3). supposing the strong distributional assumptions upon which they are based are true. which we’ll study later. is a moment-parameter equation. INTRODUCTION TO THE SECOND HALF 249 for the likelihood function: Solution of the f. It is possible to estimate using weaker distributional assumptions based only on some of the moments of a random variable(s). This gives a quasiML estimator.e. . The strong distributional assumptions of MLE may be questionable in many cases.c.o. One can investigate the properties of an “ML” estimator supposing that the distributional assumptions are incorrect. the relationship is the identity function though in general the relationship may be more complicated. The sample ﬁrst moment is p xgd eÄQ ©p d¥ I I ÄQ is the parameter of interest.

$ sample moment. is 1 f @ I T Â f .v. that is. the estimator is consistent.12.. I © d¥I ¢ Q q4egQ 4R% © d¥ I © ¥I d i . the sample variance is consistent for the true vari- © Ý f f1 5 I d ¢ 7®g¥ @f So. Then the moment-parameter equation r.e. More on the method of moments The MM estimator would set Again. i. INTRODUCTION TO THE SECOND HALF 250 Deﬁne The method of moments principle is to choose the estimator of the parameter to set the estimate of the population moment equal to the is inverted to solve for the parameter estimate. 1 8 p d@5 © Ýf f I T ¢ 7ug¥ @f ance. the variance of a 8 p @5 ¢ ó p ig×ð Ù g¨¥ é d d f © f $ 1 pd 5 4d ¥ ¢ © © Ýf f I ¢ 7Úg¥ @f p gd Since by the LLN. 8 1 d ©d p@5 4e¥ ¢ © Ýf f I ¢ 7ug¨¥ @f Deﬁne ©pd ge¥ ¢ Continuing with the above example. 8 1 Â gf I@ pd 4d % © ¥I f In this case. by the LLN.

e. Example: Generalized method of moments (GMM) sistent. the estimators will be different in general.12. deﬁne We already have that p d The previous two examples give two estimators of which are both con- 8 f gwÜd 4REi © d¥ I is . which means that we have more information than is strictly necessary for consistent estimation of the parameter. we have overidentiﬁcation. INTRODUCTION TO THE SECOND HALF 251 which is obtained by inverting the moment-parameter equation. With a given sample. With two moment-parameter equations and only one parameter. both © d¥ I evi From the ﬁrst example. Ù d¥ I ñ s© p eE% p d Clearly. is consistent. when evaluated at the true parameter value I @ 81 Â gf pd f I @ ©e¥EI% d 1Â) f © d¥ I e% © d¥ I 4REi the sample average of i. The GMM combines information from the two moment-parameter equa- tions to form a new estimator which will be more efﬁcient. From the second example we deﬁne additional moment conditions 8 © Ý f f1 I d ©d p@5 4R¥ ¢ ¢ 7ug¥ @f and © Ýf f ¥ d © d ¥ ¢ 7Úgp@5 eE ¢ Ù ©p d¥ I ge% and .. in general (proof of this below).

The reason we study GMM ﬁrst is that LS. the study of extremum estimators is useful for its generality. Nevertheless. While it’s clear that the MM gives consistent estimates if there is a oneto-one relationship between parameters and moments. After studying extremum estimators in general. For this reason.12. Linear models are more general than w " wR © ¥ ÷¨t An example would be to choose 8HE©4e¥X¬¥pt ©4deS¤ ¥ ¡ d © d ¥ f R E4e¥ ¢ g4R%¨¥ eX ©© d © d¥ I © d¥ where and choosing where is a positive deﬁnite ©© d¥ ¥ gE4eXt The GMM estimator is based on deﬁning a measure of distance d chose to set either or 8 ñ 4d ¥ ¢ © 8 ì © p e¥ ¢ d The MM estimator would © ¥I k d i In general. NLS. so the general results on GMM can simplify and unify the treatment of these other estimators. We will see that the general results extend smoothly to the more specialized results available for speciﬁc estimators. MLE. IV. then QML and NLS. no single value of . it’s not immediately obvious that the GMM estimator is consistent. This is not to suggest that linear models aren’t useful. matrix. which makes focus on them useful. QML and other well-known parametric estimators may all be interpreted as special cases of the GMM estimator. there are some special results on QML and NLS. it is clear from the LLN that d will solve the two equations simultaneously. we will study the GMM estimator. INTRODUCTION TO THE SECOND HALF 252 Again. and both are important in empirical research. One of the focal points of the course will be nonlinear models.) These examples show that these widely used estimators may all be interpreted as the solution of an optimization problem. (We’ll see later that it is.

12. ﬁts this form. theory that applies to nonlinear models also applies to linear models. INTRODUCTION TO THE SECOND HALF 253 they might ﬁrst appear. No linear in the parameters model & Roy’s Identity states that the quantity demanded of the r © ¥ bb © © ¥ G t0P p d ¯ $T m xxb ¥ ¢ m vI m n g¨¥ p m © f ¶ 3 G P tVC ¢ I » P I¢ of goods is 8 f B à f ¥ Â © !d¸ B f ¥ Âà © !d¸ à DC à ÂB f CAB ) B ¤ B I ¯ i P § SI uP $ B¤ & è f . since one can employ nonlinear transformations of the variables: For example. The important point is that the model is linear in the parameters but not necessarily linear in the variables. Example: Binary limited dependent variable B¤ B C for or with a parameter space that is deﬁned independent of the data can ) ò} 'g B ¤ so necessarily and . situations often arise which simply can not be convincingly represented by linear in the parameters models. so one may as well start off with the general case. Example: Expenditure shares An expenditure share is guarantee that either of these conditions holds. which calls into question their appropriateness in cases of this sort. Also. In spite of this generality. These constraints will often be violated by estimated linear models.

and a speciﬁc distributional assumption is made on the distribution of preferences in the population.i.12. suppose that pends only on income.1) To make the example ¸ w ¥ ¸ © w %w C§ Deﬁne let collect and and let § f ¨ w d¥ I ©§ § w 5 7) ! G 3 B © § ¸ © § ©"¨¥ p ¨ w Ç¥ I pG P © § ¥p 9Ëq¨"7¸ ity in the base case (no project) is w asked if they would pay an amount for provision of a project. otherwise. deﬁne 8 ¥ ¸ g© w iHw î 8 ¥ ¸ 4s© w %w î © w % ¥ © f t) ¥ ¸ ¸ Ë (12. preferences in both states are homothetic. see 1 We assume here that responses are truthful. an individual agrees1 to pay ¸ vision. extreme value random variables. utility is The random terms reﬂect variations Deﬁne if the consumer agrees to pay for the change. utility de- s0 § sy § & P © § ¨"¨¥ I © § ¨"¨¥ p $ To simplify notation. Individuals are vector of other variables such as prices. personal characteristics.0. This example is a special case of more general discrete choice (or binary response) models. INTRODUCTION TO THE SECOND HALF 254 The referendum contingent valuation (CV) method of infering the social value of a project provides a simple example. After proof preferences in the population. etc. That is. that is there is no strategic behavior and that individuals are able to order their preferences in this hypothetical situation. p G and and are i. With these assumptions (the details are unimportant here. The probability of agreement is P speciﬁc.d. Indirect utilwhere is income and is a ¸ G ½ I j p G ) f 8© § ¥p g©"¨t¦¸ I Ë¼G G G p IG 8 I Ëq©"¨¥ I G P © § if . With this.

it is useful to study NLS and other nonlinear models. for arbitrary . in the sense that. can write One could estimate this by (nonlinear) least squares such that binary random variable. INTRODUCTION TO THE SECOND HALF 255 articles by D.12. since it is the expectation of a 0/1 d R ¥ m © 7d eter. is either or 1. Thus. McFadden if you’re interested) it can be shown that This is the simple logit model: the choice probability is the logit function of a linear in parameters function. 4) negative or greater than which is illogical. we . and the expected value of 8 I $YÔ¥ 7g} Ô¥ $¨¥ ©© # ~ P ) © # ©# $¥ where is the logistic distribution function § © w VP & ¥ ©d ¥ w is . This is because for any we can always ﬁnd a such that d © w¥ m where is a -vector valued function of and is a dimensional paramwill be © w ¥ m d in the parameters model. there are no w © w uP & ¥ § ê R m Ô7Bdc© w ¥ © w VP & ¥ § The main point is that it is impossible that can be written as a linear © w VP & ¥ § © § ¢ E© w uP & ¥ f P § © w VP & ¥ f Ü¨¥ 1 h § ) E & 8 f ¥ © Uë f Now. Since this sort of problem occurs often in empirical work.

This is im- © # t ¥ î G P ©d 04× ¥ g G P © t0¨ ¥ t ¨ x P g ¨ gf © G $# ½ t¥ d f © # st ¨¥ î Ë © ¥ ¨ ample. Finally. Since computer power is becoming relatively cheap compared to mental effort. These methods allow us to substitute computer power for mental power. These methods allow one. for ex- model of the form can be restricted to a parametric form portant since economic theory gives us general information about functions and the signs of their derivatives. Then we’ll look at simulation-based methods in econometrics. INTRODUCTION TO THE SECOND HALF 256 After discussing these estimation methods for parametric models we’ll brieﬂy introduce nonparametric estimation methods.12. to estimate consistently when we are not willing to assume that a . any econometrician who lives by the principles of economic theory should be interested in these techniques. P d©c¥ b ¨ where and perhaps are of known functional form. but not about their speciﬁc form. we’ll look at how econometric computations can be done in parallel on a cluster of computers. This allows us to harness more computational power to work with more complex models that can be dealt with using a desktop computer.

section 7 (pp. 13. The general problem we consider is how to ﬁnd the maximizing element it may not be differentiable. We’ll consider a few well-known techniques. al. pp. Vol. Goffe. (1994). 133-139) Gourieroux and Monfort. This section gives a very brief introduction to what is a large literature on numeric optimization methods.. and to learn how to use the BFGS algorithm at the practical level. Even if it is twice continuously differentiable.g. 5. The main objective is to become familiar with the issues. If we’re going to be applying extremum estimators. we’ll need to know how to ﬁnd an extremum. and ! ¢ d e. so local maxima. ch. 1. et. It’s also the case for 257 8 xU I û d 7d R P RU P d û d )5 yd!£ô 4e¦¤ © d¥ d P © d¥ ¤ û ¢U 4e¦ © d¥ 4R¦¤ all exist. This is the sort . and one fairly new technique that may allow one to solve difﬁcult problems. minima and saddlepoints may the ﬁrst order conditions would be linear: so the maximizing (minimizing) element would be of problem we have with linear models estimated by OLS. it may not be globally concave. Supposing were a quadratic function of 8© d¥ 4R¤ (a -vector) of a function This function may not be continuous. 443-60 .CHAPTER 13 Numeric optimization methods Readings: Hamilton. ch.

and we will not be able to solve for the maximizer analytically. One has to be careful that the grid is ﬁne enough in relationship to the irregularity of the function to ensure that sharp peaks are not missed entirely. DERIVATIVE-BASED METHODS 258 feasible GLS. d d (2) the iteration method for choosing given h d (1) the method for choosing the initial value. Derivative-based methods are deﬁned by tives) (3) the stopping criterion. since conditional on the estimate of the varcov matrix. Then reﬁne the grid in the neighborhood of the best point. if ÿ )Ö'p)7G8 7 A ) "# p I 4 ) § To check values in each dimension of a dimensional parameter space. Search The idea is to create a grid over the parameter space and evaluate the function at each point on the grid. The search method is a very reasonable choice if ) o 4 ) § § we need to check points. and there would . More general problems will not have linear f.2.c.2.. take 13. Derivative-based methods 13.1. be points to check. See Figure 13. I I h but it quickly becomes infeasible if is moderate or large. If 1000 points can be checked in a second. 13. This is when we need a numeric optimization method.2. which is approximately the is small. Select the best point.1. we have a quadratic objective function in the remaining parameters. (based upon deriva- age of the earth.13.1.o. Introduction. For example. it would years to perform the calculations. and continue until the accuracy is ”good enough”.1.

we will improve on the objective function.13. h ô h ô which is of the same . at least if we don’t go too far in that direction.1. The search method The iteration method can be broken into two problems: choosing the stepsize for positive but small.1. DERIVATIVE-BASED METHODS 259 F IGURE 13. t ô ¥ "wyà e¦¤ S © t ô P d¥ r ô t A locally increasing direction of search is a direction such that 8 h h t ô wP Ã d Åh I h Ã Å à d d dimension of so that t (a scalar) and choosing the direction of movement. That is. if we go in direction .2.

1. we guarantee that is less that 90 degrees). take a T.2. the iteration rule becomes and we keep going until the gradient becomes zero. A simple line is a scalar. since ô Conditional on . To see this.d.13. and they can all be represented as matrix and where d is the gradient at . 8 æ ô search is an attractive possibility. expansion around positive deﬁnite. DERIVATIVE-BASED METHODS d 260 As long as the gradient at is not zero there exist increasing directions. choosing is fairly straightforward. With this. we need Deﬁning t ©) xÔ¥ ´ ô For small enough the term can be ignored. so that there is no increas- The remaining problem is how to choose Note also that this gives no guarantees to ﬁnd a global maximum.S. The problem is how to choose and © d¥ 4RD æ way (p.2. If ©) tc¥ ´ h æ P © P d¥ ô¥ P © P d¥ pt R t ÈeHu© Þt Èe¦¤ © RH æ d ¥ h h ©) tÔ¥ ´ P R© d¥ ô P © d¥ µt4eHCwe¦¤ p ô © d¥ ¤ e¦ ep © d¥ © t ô P d¥ "wÈe¦¤ is a symmetric pd is to be an inwhere is . matrices are those such that the angle between © eH æ uP Å Ã d ¥ ô h h h h d I h Ã Å d æ 8 ( eH © d¥ unless Every increasing direction can be represented in this and æ © d¥ æ geH ¡ t © d¥ R© d¥ R© d¥ ¥ eH æ ceH Bt4eH 8 ¥ t R 4RD © d¥ creasing direction. 8 æ ô ing direction. See Figure 13.

Increasing directions of search 13. Steepest descent (ascent if we’re maximizing) just sets imum rate of change of the objective function.13.1.2.2.2. Disadvantages: This doesn’t always work too well however (draw picture of banana function). DERIVATIVE-BASED METHODS 261 F IGURE 13. since the gradient provides the direction of max- tives.2.doesn’t require anything more than ﬁrst deriva- æ to and identity matrix. Advantages: fast . . Steepest descent.

e.2. Supposing we’re trying So the solution for the next round estimate is This is illustrated in Figure 13. we can maximize © d¥ f geSx¤ To attempt to maximize we can maximize the portion of the right-hand ó h d s sd ð © R¥ R ó pd ð 5 Â ¡P ó £d ð c© eH© RCu¬ 4RC¤ d d ) d R d¥ P d¥ f ¤ © d¥ f h h h h h d d about (an initial guess). it’s good to include a stepsize. since the approximation to d I © h e¥ d ó h © eH I !© e¥ j d ¥ d ô h h h h d i © eH I © R¥ d ¥ d h h pd ð © e¥ d h 7d d d P d¥ © d¥ © eH e¦ý ¤ h h d d I h I h d d 7d in so it has linear ﬁrst order conditions. So may not be © d¥ f eSt¤ However. since it is a quadratic function ó h d i pd ð © R¥ R ó ipd ð 5 Â ¡ÈBdc© RD 4e¦ý ¤ d d ) P R d¥ © d¥ h h h d d side that depends on i.2. These are 8 7d with respect to This is a much easier problem. DERIVATIVE-BASED METHODS 262 13. may be bad far away from the maximizer d so the actual iteration formula is A potential problem is that the Hessian may not be negative deﬁnite when we’re far from the maximizing point. © d¥ f eSt¤ 8© d¥ f geS9¤ to maximize Take a second order Taylor’s series approximation of h . The Newton-Raphson method uses information about the slope and curvature of the objective function to determine which direction and how far to move from an initial point..2.13.3.2. Newton-Raphson.

. ©d 4e¥ component to to ensure that the resulting matrix is positive defwhere is chosen large enough so that positive deﬁnite. in which case the Hessian matrix is very ill-conditioned (e.2.13.g.2. d is positive deﬁnite. we certainly don’t want to go in the direction of a minimum when we’re maximizing. Newton-Raphson method rection of search. and our direction is a decreasing direction of search.g. Also. and © eH I © e¥ d ¥ d h h d may not deﬁne an increasing di- © e¥ d h ..2. Quasi-Newton methods simply add a positive deﬁnite d d U $ P ©d %U Ue¥ æ inite. To solve this problem. This can happen when the objective function has ﬂat regions. DERIVATIVE-BASED METHODS 263 F IGURE 13. Matrix inverses by computers are subject to large errors when the matrix is ill-conditioned. e. or when we’re in the vicinity of a local minimum. is nearly singular).

We need to deﬁne acceptable tolerances. DERIVATIVE-BASED METHODS 264 is well-conditioned and positive deﬁnite.d. A digital computer is subject to limited machine precision and round-off errors. This avoids actual calculation of the Hessian. Another variation of quasi-Newton methods is to approximate the Hessian by using successive gradient evaluations. Some stopping criteria are: Negligable change in parameters: Negligable relative change: Negligable change of function: Gradient negligibly different from zero: d ¥ ¤ d ¥ | G ½ © I h R¦uq© h e¦¤ ßd ¢ %êÔ G ½ Id h I h ß h ß d % êI Ô$G ½ % ê3 cxGG ½ © R¥ ß d h d I h ß h ß d æ .2. This has the beneﬁt that improvement in the objective function is guaranteed. Stopping criteria The last thing we need is to decide when to stop. which is an order of magnitude (in the dimension of the parameter vector) more costly than calculation of the gradient. For these reasons. They can be done to ensure that the approximation is p.13. DFP and BFGS are two well-known examples. it is unreasonable to hope that a program can exactly ﬁnd the point that maximizes a function.

The algorithm may converge to a local minimum or to a local maximum that is not optimal. if we’re maximizing. but not so well if there are convex regions and local minima or multiple local maxima.shtml 1 d©cC¤ b¥ f the function Also. and are usually more costly to evaluate. Numeric derivatives are less accurate than analytic derivatives. and one that I did know.de/download. You can download it from http://www.mupad. It is often difﬁcult to calculate derivatives (especially the Hessian) analytically if numerically. not approximate) Hessian is negative deﬁnite. Starting values The Newton-Raphson and related algorithms work well if the objective function is concave (when maximizing).13. and choose the solution that returns the highest objective function value. More on this later. is complicated. DERIVATIVE-BASED METHODS 265 Or.2. check all of these. For example. THIS IS IMPORTANT in practice. it’s good to check that the last round (real.3 shows MuPAD1 calculating a derivative that I didn’t know off the top of my head. is to use many different starting values. The usual way to “ensure” that a global maximum has been found Calculating derivatives The Newton-Raphson algorithm requires ﬁrst and second derivatives. Figure 13.2. even better. or to use programs such as MuPAD or Mathematica to calculate analytic derivatives. Possible solutions are to calculate derivatives . MuPAD is not a freely distributable program. The algorithm may also have difﬁculties converging at all. so it’s not on the CD. Both factors usually cause optimization programs to be less successful when numeric derivatives are used.

suppose that 8 4) 4 8 å ©©ÔS¤ x b¥ f b¥ f 4 ) ©©ÔS¤ § G P © # P cw¨§ & ¥ ¹ f about having made an error in calculating the analytic derivative.3. DERIVATIVE-BASED METHODS 266 F IGURE 13.13.2. Example: if the model is mation is by NLS. When programming analytic derivatives it’s a good idea to check that they are correct by using numeric derivatives.2. Using MuPAD to get analytic derivatives One advantage of numeric derivatives is that you don’t have to worry magnitude. Numeric second derivatives are much more accurate if the data are scaled so that the elements of the gradient are of the same order of and esti- and . This is a lesson I learned the hard way when writing my thesis.

SIMULATED ANNEALING & 267 In this case. accepts all points that yield an increase in the objective function. the probability that a negative move is accepted reduces.13. As more and more points are tried. The iterations are faster for this reason since the actual Hessian isn’t calculated. but focuses in on promising areas. I have a program to do this if you’re interested. the gradients In general. This allows the algorithm to escape from local minima. and reduces the range over which random points are generated. but more iterations usually are required for convergence. which reduces function evaluations with respect to the search method. but also accepts some points that decrease the objective function. Switching between algorithms during iterations is sometimes useful. discontinuities and multiple local minima/maxima. This is important in practice. since roundoff errors are less likely to become important. Also. It does not require derivatives to be evaluated. estimation programs always work better if data is scaled in this way. as in the search method. periodically the algorithm focuses on the best point so far. . One could deﬁne . Basically. the algorithm randomly selects evaluation points. d©cC¤ x b¥ f ©©ÔS¤ à § b¥ f 8 44 ) Â # # ! § 44 ) § g4 ) ! 44 ) q& Â and will both be 1. Simulated Annealing Simulated annealing is an algorithm which can ﬁnd an optimum in the presence of nonconcavities. There are algorithms (such as BFGS and DFP) that use the sequential gradient evaluations to build up an approximation to the Hessian. The algorithm relies on many evaluations. 13.3.3.

m . which calculates the loglikelihood. the probability has the speciﬁc form You should download and examine LogitDGP.4. which generates data according to the logit model.1. I ©¬s©4d f )¥ P © d f B 1 © d¥ f B ¥ C iÈd) © B 0ÈÔ4 C A B ¥ ) 4eS¤ B ¥ f © 4d ~ P c¥ } ) ©d ¥ ) & ' ©d ¥ 4 $ f © f À t) ¨¥ Þª f .13.m .4. logit. Examples This section gives a few examples of how some nonlinear models may be estimated using maximum likelihood.0. EXAMPLES 268 13. A more general representation is © H î ¥ P © ¥ ¦) f¥ G ¥ jq© H The log-likelihood function is For the logit model (see the contingent valuation example above).4. We saw an example of a binary choice model in equation 12. 13. Discrete Choice: The logit model. which uses the BFGS algorithm. In this section we will consider maximum likelihood estimation of the logit model for binary 0/1 dependent variables.1. and EstimateLogit.m . We will use the BFGS algotithm to ﬁnd the MLE. which sets things up and calls the estimation routine.

7566 st. EXAMPLES 269 *********************************************** Trial of MLE estimation of Logit model MLE Estimation Results BFGS convergence: Normal convergence Average Log-L: 0. and health is an argument of the utility function.2229 0.g.13.4.6230 BIC : 130.. models health as a capital stock that is subject to depreciation (e. for example.607063 Observations: 100 constant slope estimate 0. Health care visits restore the stock. Grossman (1972).2.1863 p-value 0. These functions are part of the octave-forge repository. err 0.6230 AIC : 125.4127 *********************************************** The estimation program is calling mle_results(). 13. Count Data: The Poisson model. As such. Under the home production framework.2374 t-stat 2.5400 0. which in turn calls a number of other routines. individuals decide when to make health care visits to maintain their health stock. individual 8 R x9 ¥ ©) d 4 ) 1 Here are some estimation results with and the true .4.0154 0. the effects of ageing).4224 3.0014 Information Criteria CAIC : 132. Demand for health care is usually thought of a a derived demand: health care is an input to a home production function that produces health. or to deal with negative shocks to the stock in the form of accidents or illnesses.

These form columns 7 . outpatient visits (OPV). It might be reasonable to try to use this information by . where 1 indicates that the person is female.m shows how the data may be read in. which is ( f 8 u © )Ô¥ 7g} ~ e e The Poisson average log-likelihood function is I B 1 ©( f © d¥ f B f Se B uP Se ¥ ) 4RC¤ B B f ©f S¥ Y G¨ @8@8974 85) the values . sex (SEX). This data will be used in examples fairly extensively in what follows. age (AGE). These form columns 1 . and income (INCOME). You can get more information at http://www. The program ExploreMEPS.data. in the order given here. years of education (EDUC). contains 4564 observations on six measures of health care usage.6 of meps1996. PRIV and PUBLIC are 0/1 binary variables. which follows: All of the measures of use are count data. EXAMPLES 270 demand will be a function of the parameters of the individuals’ utility functions. private insurance (PRIV).13.gov/. meps1996. The six measures of use are are ofﬁce-based visits (OBDV). The data is from the 1996 Medical Expenditure Panel Survey (MEPS). which means that they take on specifying the density as a count data density. and gives some descriptive information about variables.ahrq.data. dental visits (VDV).meps. and number of prescription drugs taken (PRESCR). emergency room visits (ERV). inpatient visits (IPV).12 of the ﬁle. where a 1 indicates that the person has access to public or private insurance coverage.4. SEX is also 0/1. One of the simplest count data densities is the Poisson density. The conditioning variables are public insurance (PUBLIC). The MEPS health data ﬁle .

13. The program EstimatePoisson. The results of the estimation. Note that for this parameterization e § ß § e Uß Â à à so variable. as is required for the Poisson model. EXAMPLES 271 We will parameterize the model as This ensures that the mean is positive. using OBDV as the dependent variable are here: MPITB extensions found OBDV ****************************************************** Poisson model. MEPS 1996 full data set MLE Estimation Results % the elasticity of the conditional mean of 8 ¬@R û ±(û 02 Ù Ù & w Ù¢Û é ± ¡ ª ± f ¥ 1ª d) û 0 © § BR v ¥ 7} ~ with respect to the ¶ f ä Ôã èÈ6ß § ß B §Ce B §Dv conditioning .m estimates a Poisson model using the full data set.4.

294 0. CAIC: Avg.6881 BIC : 33568.010 0.061 -0.000 0.137 8.5. .000 0. and are referred to as duration data.791 0.000 0.3566 7.000 0.093 4. or the time needed to ﬁnd a job once one is unemployed. ins.5. BIC: Avg. Duration data and the Weibull model In some cases the dependent variable may be the time that passes between the occurence of two events.149 0.029 -0. ins. Such variables take on values on the positive real line.7064 Avg.978 p-value 0.797 11.3452 ****************************************************** 13. For example.13. AIC: 7.290 11.6881 AIC : 33523.487 0.071 0.076 0.002 0.328 Information Criteria CAIC : 33575. DURATION DATA AND THE WEIBULL MODEL 272 BFGS convergence: Normal convergence Average Log-L: -3. it may be the duration of a strike.002 0.000 st.471 3. sex age edu inc -0.000 0.000 t-stat -5. err 0.024 0.055 0.3551 7.671090 Observations: 4564 estimate constant pub. priv.848 0.

For example. The random variable function of 3 4¨ is the duration of the spell. assume that time is measured in years. and be the time the conclud- . minus 3 5¨ metric density. then estimate by maximum likelihood. one might wish to know the expected time one has to wait to ﬁnd a job given that one has already waited years. For simplicity. There are a number of possibilities including the exponential density. ing event occurs. etc.5. DURATION DATA AND THE WEIBULL MODEL 273 A spell is the period of time between the occurence of initial event and the concluding event. A reasonably ﬂexible model that is a generalization of the exponential density is the Weibull density ©2 E¥ To estimate this function. Deﬁne the density 8 4¤ as a para- . with distribution function Several questions may be of interest. and the ﬁnal event is the ﬁnding of a new job. one needs to specify the density ¤ The density of conditional on the spell already having lasted years is 8E2 ½ ¥ Ë E¬¥ © ©2 p2 I 3 P jÄ62 3 #¨ ¤ u É 7t tò¥ 43# ¥ P y) # Æ VÄt¤ ¥ Uë Ù ¤ ©¤© ¤ © ¥ 8©¤ gt¥ 3 P I 2 ©¤ 8 tò¥ È) t¤ » ¥ © ¤ © E¬32 ¥ P 3 4¨ È) © 2 t¤ ¥ ¥ Ë È) t¤ ¥ ¥ © 3 4¨ © gE¬2¥ Ë ¤ p2 ¤ Let be the time the initial event occurs. For example. The probability that a spell lasts years is The expectanced additional time required for the spell to end given that is has already lasted years is the expectation of with respect to this density.13. The spell is the period of unemployment. the lognormal. the initial event could be the loss of a job.

5. Mongooses that are between 2-6 years old seem to have a lower life expectancy than is predicted by the Weibull model. This is consistent by the LLN. The ”spell” in this case is the lifetime of an individual mongoose. and then subtracts age. the log densities. The plot is accompanied by a nonparametric Kaplan-Meier estimate of life-expectancy. one 8 h I z ¡ E2 ¢ e ¥ ¢ ¢ e z 6 @Å z Ã t © » ÈÔP h I ¡ E×I e vCI © )¥ ©2 ¥I º e 6 º @Å Ã t ©2 E¥ might specify as a mixture of two Weibull densities. in that it predicts life expectancy quite differently than does the nonparametric model. with 95% conﬁdence bands. 2 © Ñ 7 8 ¥ ï 4Ò 8 ¦e Ò 8 ¡ e © ¼ë ¥ 8 I ¡ E2 e q e 76@Å Ã d ¬2¥ 4¨ © ¥ º © 3 The log-likelihood is just the product of 3 4¨ » © 4d ¬2¥ ©7 ¥ A $@7 8 ÈD9 E 8 3 4¨ . 402 observations on the lifespan of mongooses in Serengeti National Park (Tanzania) were used to ﬁt a Weibull model. Figure 13. For ages 4-6. up to a bit beyond 2 years. In the ﬁgure one can see that the model doesn’t ﬁt the data well. Due to the dramatic change in the death rate as a function of .3. The parameter estimates and standard errors are and and the log-likelihood value is -659.1 presents ﬁtted life expectancy (expected additional years of life) as a function of age. which casts doubt upon the parametric model.5.13. To illustrate application of this model. This nonparametric estimator simply averages all spell lengths greater than age. whereas young mongooses that survive beyond infancy have a higher life expectancy. the nonparametric estimate is outside the conﬁdence interval that results from the parametric model. DURATION DATA AND THE WEIBULL MODEL 274 According to this model.

With the same data.1. since under the null that 5 3 B ) ! Se B The parameters and are the parameters of the two Weibull densi- . DURATION DATA AND THE WEIBULL MODEL 275 F IGURE 13. Nevertheless.17. but this topic is out of the scope of this course. Weibull model ties. Note that a standard likelihood ratio test cand » ble to take this into account. The results are a log-likelihood = -623.13. the two parameters and are not identiﬁed.5.5. and is the parameter that mixes the two. Life expectancy of mongooses. The parameter estimates are ¢ ¢ e (single density). It is possi- ) » not be used to chose between the two models. can be estimated using the mixed model. the improvement in the likelihood function is considerable.

problems can easily result. NUMERIC OPTIMIZATION: PITFALLS 276 Parameter Estimate St. Poor scaling of the data. Alternatives will be discussed later.6. Error 0. Mixture models are often an effective way to model complex responses.428 years. the data will not be scaled. With unscaled data.722 1. 13.101 0. and some solutions.522 0. If we uncomment the appropriate line in EstimatePoisson.016 0.2.6. The disagreement after this . 13.035 Note that the mixture parameter is highly signiﬁcant. though they can suffer from overparameterization.731 1. and the estimation program will have difﬁculty converging (it seems to take an inﬁnite amount of time).096 0.1. Numeric optimization: pitfalls In this section we’ll examine two common problems that can be encountered when doing numeric optimization of nonlinear models. Note that the parametric and nonparametric ﬁts are 9 quite close to one another.m. the elements of the score vector have very different magnitudes at the initial value I ¢ I i ¢ e » e 1. When the data is scaled so that the magnitudes of the ﬁrst and second derivatives are of different orders.6. which implies that the Kaplan-Meier nonparametric estimate has a high variance (since it’s an average of a small number of observations).166 0.5. up to around point is not too important.233 0.13. This model leads to the ﬁt in Figure 13. since less than 5% of mongooses live more than 6 years.

This causes convergence problems due to serious numerical inaccuracy when doing inversions to calculate the BFGS direction of search. one element of the gradient is very large. others local) can complicate life.m. mixed Weibull model of (all zeros).13. With unscaled data.5. With scaled data.2.6. Life expectancy of mongooses. and the maximum and minimum elements are 5 orders of magnitude apart.6.2. since we have limited means of determining if there is a higher d . To see this run CheckScore. and the maximum difference in orders of magnitude is 3. none of the elements of the gradient are very large. Multiple optima (one global. NUMERIC OPTIMIZATION: PITFALLS 277 F IGURE 13. Multiple optima. 13. Convergence is quick.

. A foggy mountain maximum the the one we’re at.. Or perhaps one might have priors about possible values for the parameters (e. from previous studies of similar data). or randomly generated.1).6. but since you’re in the fog you don’t know if the true summit is across the gap that’s at your feet. Think of climbing a mountain in an unknown range.6. or do you trudge down the gap and explore the other side? The best way to avoid stopping at a local maximum is to use many starting values. NUMERIC OPTIMIZATION: PITFALLS 278 F IGURE 13. in a very foggy place (Figure 13.13.1.6. Do you claim victory and go home. for example on a grid. You can go up until there’s nowhere else to go up.g.

It uses SA.0000 © ¥ can see it’s close to .8119 gradient -0. which ﬁnds the true global minimum. NUMERIC OPTIMIZATION: PITFALLS 279 Let’s try to ﬁnd the true minimizer of minus 1 times the foggy mountain function (since the algoritms are set up to minimize).0000 .13. The output of one run is here: MPITB extensions found ====================================================== BFGSMIN final results Used numeric gradient -----------------------------------------------------STRONG CONVERGENCE Function conv 1 Param conv 1 Gradient conv 1 -----------------------------------------------------Objective function value -0.102833 43 iterations ------------------------------------------------------ param 15.9999 -28. you that. The program FoggyMountain. and that we don’t know change 0.0000 0.0130329 Stepsize 0. but let’s pretend there is fog. and it shows that BFGS using a battery of random start values can also ﬁnd the global minimum help.0000 0.m shows that poor start values can lead to problems.6. From the picture.

000000 ================================================ Now try a battery of random start values and a short BFGS on each. NUMERIC OPTIMIZATION: PITFALLS 280 The result with poor start values ans = 16. 1.000051 0.7628e-07 The true maximizer is near (0. fn.000000e-10 Param.7417e-02 2.0) .037. value -0.6.000000e-03 Obj. then iterate to convergence The result using 20 randoms start values ans = 3. tol.037419 -0. tol.100023 parameter search width 0.000 -28.812 ================================================ SAMIN final results NORMAL CONVERGENCE Func.13. 1.000018 0.

6. the single BFGS run with bad start values converged to a point far from the true minimizer. . which simulated annealing and BFGS using a battery of random start values both found the true maximizaer. battery of random start values managed to ﬁnd the global max. The moral of the story is be cautious and don’t publish your results too quickly. NUMERIC OPTIMIZATION: PITFALLS 281 In that run.13.

Write a complete description of how the whole chain works. (3) Using logit. and in turn the functions that those functions call. to ﬁnd out the location of the ﬁle. Estimate Poisson models for the other 5 measures of health care usage. (5) Look at the Poisson estimation results for the OBDV measure of health care use and give an economic interpretation. type ”help bfgsmin_example”.m to see what it does. write a function to calculate the probit loglikelihood. Examine the functions that mle_results.EXERCISES 282 Exercises (1) In octave.m calls. and examine the output. to ﬁnd out the location of the ﬁle. type ”help samin_example”. (4) Study mle_results. (2) In octave. Edit the ﬁle to examine it and learn how to call samin. Run it.m as templates. . Edit the ﬁle to examine it and learn how to call bfgsmin. and a script to estimate a probit model. Run it. and examine the output. Run it using data that actually follows a logit model (you can generate it in the same way that is done in the logit example).m and EstimateLogit.

Ch. Vol. Ch. 24 Amemiya. Given the model 1Â) 1Â) © d f W¥ f ¢eC¤ the are -vectors and is ﬁnite. deﬁne ! 8 8 R © BR B ¨¥ HB # f 9# © d f £¥ f 47Ä¤C¤ . “Large Sample Estimation and Hypothesis Testing. Extremum estimators d In Deﬁnition 12. Ch.1. pp. 36. Ch. 591-96.” in Handbook of Econometrics. Gallant. Newey and McFadden (1994).1 . ' X1 depend upon a random matrix f# bb ¢ I f R r uxxb # g# n ¢W © d¥ f 4eS9¤ element of an objective function over a set .0. 2.CHAPTER 14 Asymptotic properties of extremum estimators Readings: Gourieroux and Monfort (1995). 4. 14. Davidson and MacKinnon. 4 section 4. Vol. Let the objective function where with observations.1 we deﬁned an extremum estimator x as the optimizing The OLS estimator minimizes 283 8 YW ¤ where and are deﬁned similarly to 1 d ¢ pj¤ I f B ¢ ©d BR B ¥ f B G Þd BR kÈB f P E XAMPLE 18. 3.

CONSISTENCY 284 14. ref.1.] Suppose that Assume Then This happens with probability one by assumption (b). which we’ll see in its original form later in the course.1.e. Then is a ﬁxed sequence ©p d¥ ¥ ge¾ ¤ x g pt d p x 8p 9gd ì d f g d p wxd cx4RaG¤ í d ê © d ¥ ©©Ô¾G¤ b¥ (3) Identiﬁcation: A © 4Ra uq4e¥ ¤ ¥ @Ð f © d¥ ¤ © d f 9 has a unique global maximum at x continuous in on d such that a. © d¥ 4R¾I¤ (2) Uniform Convergence: There is a nonstochastic function x x 8 y space The closure of is compact.14. later). say that is a limit point of W1 Yt ( is simply a sequence of increasing integers) with W 8¦7fd d ¾x in the compact set x by assumption (1) and the fact that maximixation is over f 7d 8 © d¥ g4RaG¤ © d¥ f eSx¤ Ý of functions. T HEOREM 19. which is done in terms of convergence in probability. It is interesting to compare the following proof with Amemiya’s Theorem 4. Suppose that is such that converges uniformly to ©d ¥ f¤ 64yßÝCxg ² g èÝ Proof: Select a and hold it ﬁxed.. The sequence . By .2. 2. [Consistency of e. Consistency The following theorem is patterned on a proof in Gallant (1987) (the article.12).s.2.e. Thm. lies . that is x (1) Compactness: The parameter space 8 ÿ x © d¥ f eS9¤ ing over is an open subset of Euclidean f d is obtained by maximiz- i. Since every sequence from a compact set has at least one limit point (DavidThere is a subsequence ¼ fd d ¼ fd son.

CONSISTENCY Continuity of d©cuG¤ b¥ as seen above.2. ﬁrst of all. uniform convergence and continuity implies that of is . 285 . so © p e¥ ¼ ¤ d f ¡ © ¼ d ¥ ¼ 9¤ f f since the limit as Next. by maximization k C d d © ¥ d ¾ ¤ òd ¾ ¤ A © ¥ 8© ¥ gd ¾ ¤ d ¥ ¼ ¤ W © f To see this. and 2 However. So the above claim is true. select an element uniform convergence implies from the sequence Then 8 CD¼ fd B d 8© ¥ gd a ¤ © ¼ d ¥ ¼ ¤ W f f 14.8 d¥ © p Ra ¤ ¡ © ¥ d ¾ ¤ by uniform convergence. so B © p ea ¤ © p e¥ ¼ ¤ t W d ¥ d f © ¥ gd a ¤ © ¼ d ¥ ¼ ¤ W f f 8 d f g© p e¥ ¼ ¤ t W ¡ © ¼ d ¥ ¼ ¤ W f f which holds in the limit.

14. though the identiﬁcation assumption 8 © d¥ f He¼x¤ We assume that is in fact a global maximum of may be a non- v p d d © p Ra ¤ d ¥ ¡ © d¥ e¾G¤ x (c) Identiﬁcation: Any point in d f d 8p xgd maximum at An equivalent way to state this is with must have It is not re- The ² ÷ û p d 8 © û ¥ ª ² g Ý all . x tion assumption) has not been used to prove consistency. since so far we have held 8p d we must have and Finally. Therefore has only one limit point. so we could reason that we assume it’s in the interior here is that this is necessary for subsequent proof of asymptotic normality. CONSISTENCY 286 with Discussion of the proof: This proof relies on the identiﬁcation assumption of a unique global which matches the way we will write the assumption in the section on nonparametric inference. for clarity. there is a unique global maximum of at so d d¥ g© p ea ¤ 4d a ¤ © ¥ f d . Ý almost surely. all of the above limits hold ﬁxed. requires that the limiting objective function have a unique maximizing argument.1. and I’d like to maintain a minimal set of simple assumptions. but now we need to consider except on a set p 9d © d¥ 4RaI¤ But by assumption (3).4 for a case where discontinuity leads to breakdown of consistency. The next section on numeric optimization methods will trivial problem. See Amemiya’s Example 4. Parameters on the boundary of the parameter set cause theoretical difﬁculties that we 8 ¾x p d directly assume that p d The assumption that is in the interior of (part of the identiﬁca- is simply an element of a compact set © d¥ f ey9¤ show that actually ﬁnding the global maximum of 1 quired to be unique for ﬁnite.2.

CONSISTENCY 287 will not deal with in this course. the maximum of the sample objective function eventually must be in the neighborhood of the maximum of the limiting objective function © d¥ 4R¾`¤ © d¥ f e¼9¤ Note that is not required to be continuous. Just note that conventional hypothesis testing methods do not apply in this case.2. The following ﬁgures illustrate why uniform convergence is impor- tant. though is. With uniform convergence.14. .

[Uniform Strong LLN] Let be a sequence of stochastic Then © d¥ f & eSs ©d f ì T 4R¥ s& and using . y ÷ x The metric space we are interested in now is simply ÿ p x p x g pwd (a) for each where is a dense subset of 8© x ! ¤¥ real-valued functions on a totally-bounded metric space A © d ¥ f ¥ © ì eSs& ©Ð 9 ©d f & e¥ Þ T HEOREM 20. pg. CONSISTENCY With pointwise convergence. the Euclidean norm.2.14. 337. The following theorem is from Davidson. The pointwise almost sure convergence needed for assuption (a) comes from one of the usual SLLN’s. the sample objective function may have its maximum far away from that of the limiting objective function 288 We need a uniform strong law of large numbers in order to verify assumption (2) of Theorem 19.. if and only if x (b) is strongly stochastically equicontinuous.

3.14. In the section on simlation-based estimation we will se a case of a discontinuous objective function.3. where ( and and so . . The more general theorem is useful in the case that the limiting objective function can be continuous in even if is discontinuous. Example: Consistency of Least Squares we can write The sample objective function for a sample size x x lg 8ctud R Ú gf G P p R 6s!p & ¥ gd ©p § p are ﬁnite. Let for which 8 ë ÷' E are independent) with support © G F G P Fp § P ¨tU¥ tµU!swqp . Let 1 R ¨ U4Ô¥ © F ) ¢î è ¢` è F î nDQ Q ` © F f p!¥ We suppose that data is generated by random sampling of & £ © d¥ f 4eSx¤ d gf G – the parameter space is compact (this has already been assumed) – the objective function is continuous and bounded with probability one on the entire parameter space – a standard SLLN can be shown to apply to some point in the parameter space These are reasonable conditions in many cases. 14. and henceforth when dealing with speciﬁc estimators we’ll simply assume that pointwise almost sure convergence can be extended to uniform almost sure convergence in this way. EXAMPLE: CONSISTENCY OF LEAST SQUARES 289 Stronger assumptions that imply those of the theorem are: This can happen because discontinuities may be smoothed out as we take expectations over the data. has the common distribution function Suppose that the variances is compact.

Thus. d Finally. the objective function is clearly continuous.3. since I@ I @ I @ ó 1 ód d 1 5 tó d d ¢ G Â )¡P tG p sð R Â ËP ¢ i p ¾ð R ð f f f I B I @ ó © f ¢ d R G tVP p d R ð 1 Â ) ¢ 4d R g¥ f f 8 ¢î è H Q "t Q Dt ¢G H F IhG T I @ ì ¢G f 1Â) ì ¢ p & ÞwP ¢ ó & ð 5 & p & ð wP ¢ ó & 5 & I óxó i p ¾ð R @ d d ð f 1Â) 1Â) © d¥ f eS9¤ and and are indepen- . and the parameter space is assumed to be compact. EXAMPLE: CONSISTENCY OF LEAST SQUARES 290 is Considering the last term.1) G F Ù ©G ¬¥ Considering the second term.14. for a given . by the SLLN.3. F dent. Show that in order for the above solution to be unique it is the problem of colinearity of regressors. we assume that a SLLN applies so that F Finally. A minimizer of this is clearly E XERCISE 21. for the ﬁrst term. 8 p© ¢ ¥ Ù í F necessary that 8p 9s§ !9p 'Q& § & è F § § P© F § § ð 5 ¢ î uP ó ¢ ×ð Ù ¢ ó j p ð £¥ Ù ó 0 p ×ð ó & p & sËP ¢ ó Discuss the relationship between this condition and F Q Dt ¢ F ó ¢ Fòð Ù ó 0 p ×ð Dp¨¥ Ù ó 0 p òð ó § § ¢ § § P© F ó § ó § ó Q t ¢ 0 p § ð P "8F F 0 p § ð F Q Dt ¢ ótó d i p ¾ð R ð d F F I & p & ð 4RaG¤ © d¥ p &ð p &ð 1Â) (14. so the convergence is also uniform. the SLLN implies that it converges to zero.

Asymptotic Normality A consistent estimator is oftentimes not very useful unless we know how fast it is likely to be converging to the true value. convex neighbor- a ﬁnite negative deﬁnite matrix. ASYMPTOTIC NORMALITY 291 This example shows that Theorem 19 can be used to prove strong consistency of the OLS estimator.3 (pg.this is only an example of application of the theorem.] In addition to the assumptions of Theorem 19. by consistency.e. T HEOREM 22. [Asymptotic normality of e. The following theorem is similar to Amemiya’s Theorem 4. There are easier ways to show this.14. for any sequence fd ` exists with . © d¥ f 4RC¤ ¢ e e e £ d where © p R¾@© p ea d P d I d ¥ P d ¥ o I © p e¥ % m h p £ d 1 ©p d¥ f ¤ gRC 1 À ¯Qtf ggRa o ô é © p d¥ ©p d¥ 4ge¾ o q ggRC 1 m © p d ¥ f ¤ where h p ipd © RC¤ ¢ © p eS96 4d S d d ¥ f P d¥ f ¤ © f ¥ f ¤ 8p d that converges almost surely to h h 8 4) » pd » xg© d © p e¥ P )¥ P ÈcÈd d¥ f P 6©$fRC¢4 ì T 8pd © d¥ f 4RCP $ (a) © d¥ f eS9¤ ¢ exists and is continuous in an open.1. 111). of course .4. 14. Establishment of asymptotic normality with a known scaling factor solves these two problems. and the probability that it is far away from the true value.4. assume hood of (b) h (c) Then Proof: By Taylor expansion: 1 probability one as d Note that will be in the neighborhood where becomes large.

v. Assumption (b) is not implied by the Slutsky theorem. at least asymptotically. The Slutsky Ch. since gives So And Because of assumption (c).s.’s. so the h p ssd 1 ³ tÔ$T © p e¥ d ©)¥ ´ P d h p ipd ³ tÔT © p e¾ P ° © p RC ¯ d © )¥ ´ P d¥ P d¥ f ¤ h © p e¾ P ì © eS¤ ¢ d ¥ d ¥ f T p d P P d¥ f ¤ ° © p eS96 1 Cd Also. so we can write ©)¥ tÔT ´ m h p ipd 1 d ©pd 6ge¥ P Now is a ﬁnite negative deﬁnite matrix.c. 4) is ©fd f ¦e¥ P our case is a function of A theorem which applies (Amemiya. assumption (b) term is asymptoti- 8p 9d objective function is strictly concave in a neighborhood of f p d ì d f d h Þ f d is a maximizer and the f. ASYMPTOTIC NORMALITY 292 Now the l. and the formula for the variance of a linear combih nation of r. the function can’t depend on to use this theorem. of this equation is zero. must hold exactly since the limiting . since is between and and since . In d©ÔH b¥ theorem says that if and is continuous at 8 ³ I © p e¥ d © p eS 1 I 6© p eaR h p ipd 1 d ¥ f ¤ d¥ P d h p ipd 1 h © p e¾ Q© p eS96 1 h d d ¥ P P d ¥ f ¤ P © © p ea o I © p R¥ d ¥ d h f 81 ©©ÔH b¥ © H ì f H ¥ © ¥ P % ° © p d ¥ h gRaøP cally irrelevant next to .o. 1 However.4.h.14.

14. p d 8xd ì d p © d¥ 4RaC function uniformly on an open neighborhood of then © p e¥a@ ì d ¥ g d © f © d¥ f 4RCg T HEOREM 23. Supposing a SLLN applies. On the other hand. ASYMPTOTIC NORMALITY 293 To apply this to the second derivatives. which is the case matrices. is also an average of 1 © d¥ f 4RC¤ ¢ q © p e¥S96 1 d f ¤ © ) S © p d¥ tÔ¥ T ggR¾P © d¥ f 4eS¤ that is representable as an average of 8p 9d second derivatives in a neighborhood of 8©p d¥ g ge²"wd derivatives when evaluated at p xgd ous on a neighborhood of p gd h ©p d¥ gRaC if is continuous at and and that an ordinary LLN applies to the terms. A note on the order of these matrices: Supposing the elements of which are not centered (they do not have zero expectation). assump s we’d have q1 where we use the result of Example 49. sufﬁcient conditions would be that the second derivatives be strongly stochastically equicontinu- Stronger conditions that imply this are as above: continuous and bounded Skip this in lecture. If converges uniformly almost surely to a nonstochastic .4. If we were to omit the h h z 1 T S ©)¥ tÔ$T S z 1 ©U¥ tV¦T S T © p e¥ d d ¥ f ¤ © p eS96 1 d ¥ f ¤ © p eS96 o hm tion (c): means that ©pd f gggR¥ 9¤ ¢ 1 for all estimators we consider. the almost sure limit of as we saw in Example 51.

5. a binary response model will require that the choice probability be parameterized in some form. Examples 14. is an unobserved (latent) continuous variable. Then G À © f À ½ ¬¥ sª t) ¨¥ sª f © ¥ § £ § © f Here. EXAMPLES 294 to zero. If where Gt ~ p 4© ¢ 5 G Ô¥ 7} ¢ eI © ¥ x ã 5 ©d ¥ 4 © ) ¨¥ Þª f À £ ¥ © £ f able that indicates whether is negative or positive. 14.5. Assume that .14.1. Binary response models. For a vector of explanatory variables . We’ve already seen a logit model. In general. the response probability will be parameterized in some manner standard normal distribution function.5. Binary response models arise in a variety of contexts. where is the standard normal distribution function. then we have a probit model. so we need to scale by to avoid convergence 8 g© ( 1¥ ¦T S 1¥ 1¥ © ( ¨¦T S © ¦T ©) ¥ x9 ² © ¥ ¦) f¥ G j£ R§ S where we use the fact that h The sequence f ç G f © p eS96 d ¥ f ¤ is a binary vari- is the . d©c£ b¥ © ¥ 4d R £ ©d ¥ © gd R ¥ W ©d ¥ s If we have a logit model. Another simple example is a probit threshold-crossing model. and 1 is centered.

1) I B 1 8©d 4 B f¥ C B ¨¦¤ ) f I ©¨©d f )¥ P © d f B 1 B ¥ C s%È) A © B 0yc C B ¨¥ ) B ¥ f ©©d ¥ ) ©d B ¥ B f XY I 4 i¼c¥ X u 4 C © C B ¥ %G¨ © d¥ f 4RC¤ d tends in probability to the that is the .5.5. and © Q ú iÈd) c© p iÈ) ° s © p ù t ¥ © d ¥ ³ d ¥ P ©d ¥ d ¥ conditional on to get we get the limiting objective function 8 © d f¥ g ¨¦¤ © d¥ f eSx¤ 8© d¥ f 4RCx¤ maximizes the uniform almost sure limit of Noting that converges almost surely to the First one can take the expectation gpd BC s ÄB ë © ¥ f p gd d Following the above theoretical results. is the maximizer of and following a SLLN for i.5.i.the integral is understood to be multiple. The maximizing ©d ¥ 4 ©d ¥ d continuous in as long as is continuous. for example. $ (14.d. the maximum likelihood (ML) estimator. Note that is continous for the logit and probit models.2) 8 © d ¥ $s4 iÈ) A ³ © p iÈ) ° E4 © p s £yd) A SjÈcs4 t ã £ u ë d ¥ P© d ¥ d ¥ © d ¥ © f )¥ P © d ¥ f is the (joint . we are dealing with a Bernoulli density. EXAMPLES 295 Regardless of the parameterization. X u so as long as the observations are independent. expectation of a representative term Next taking expectation over Y 'g support of ) density function of the explanatory variables .14. This is clearly compact we therefore have uniform almost sure convergence. and if the parameter space is ` © Q ¥ where © d¥ 4ea ¤ (14. processes.

The terms in h 1 the expectation of a typical element of the outer product of the gradient. is simply is consis296 . since it’s zero. Question: what’s needed to ensure that the solution is unique? The asymptotic normality theorem tells us that f. EXAMPLES à à So we get In the case of i. observations This is clearly solved by tent. in the consistency proof above and the fact that observations are There’s no need to subtract the mean.d. © d f¤ pgR¥C 1 À ¯Ö f gRa o ô é ©p d¥ 8 d¥ P d¥ q³ I 6© p he¾ ©© p Ra o I !© p R¾% @° d¥ P m h p £d 1 d h element of 8p xgd d d d © ¥siy) d d © õ d ¥ t ¥ d d © Q ø © ¥ à © siy) q© ¥ à g Y pd ¥ ©pd ¥ Ig4RaG¤ d © d ¥ solves the ﬁrst order conditions 14.i.i.d.o.R 8 ø © p ¨¦¤ d à © p ¨¦¤ d à õ ë © p ea d f¥ d f¥ d ¥ à à o © p R¦ À ¯é d ¥ ¤ ô A A A A f f f f © p e¦¤ d ¥ © p e¦ À Üé d ¥ ¤ ô © p de¦¤ ¥ ô ¯é 1 À ) © p R¥¤ 1 ) À Üé d ô 1 h 1 Üé ô ) À h d ¥ f ¤ ô f © p eS 1 À ¯é i. following the also drop out by the same argument: Provided the solution is unique.c.5.

5.3) So We can simplify the above results in this case. a typical element of the objective function is conditional 297 .8R Dv ³ ¢ © p iq© p ° d ¥ d ¥ v ³ d ¥ v c© p %Üf ° Rd d d ¥ © p e¦¤ ¢ à à à d f¥ d © p ¨¦¤ à à 8 v ó ¢ 4 %Ä4 ð ©d ¥ ©d ¥ ©© d ¥ )¥ © d ¥ v iÈ4 ©d R Ô¥ 7g} ) © © ~ P ~ P ) v ©4dv R v Ô¥ 7} I d R v Ô¥ 7g} Ô¥ ~ © ~ ©© ~ P ) v 4±dR v Ô¥ 7} ¢ dR v Ô¥ 7g} Ô¥ ©d ¥ d 4 à à 8 I ±dR v Ô¥ 7g} P c¥ 4 ©© ~ ) ©d ¥ Now suppose that we are dealing with a correctly speciﬁed logit model: 8 ³ © p iÈ) ° © 0¼Ô© p f © p ¨¦¤ d ¥ f )¥ P d ¥ d f¥ Expectations are jointly over and or equivalently. EXAMPLES P on then over 8 (14. We have that From above.5. ﬁrst over f Likewise. f R 8g© p ¨¥¦¤ d d d f d ¥ ¢ à à ë © p e¾ à 14.

Rev. Example: Linearization of a nonlinear model Ref.the logit distribution is a bit more fat-tailed. 1980 is an earlier reference. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL 298 (14. 8 © BQR vHv ³ ¢ © p iq© p t ¥ d ¥ d ©pd ¥ g v © ¢ ¥ Y Ù f ¥ d ¥ ° © p R¾P ©f Ù S¥ Y where we use the fact that h (14.6. 14.14.5.3.5. Intn’l Econ.5) t ¥ v d ¥ P d ¥ d ¥ 5 © BQR Dv ³ ¢ © p Q© p s© p u ¢ f ° 8 © QR Hv ³ ¢ © p iq© p ° t ¥ v d ¥ d ¥ Y Ù d ¥ © p Ra o (14. With this. While coefﬁcients will vary slightly between the two models.6) Note that we arrive at the expected result: the information matrix equality simpliﬁes to which can also be expressed as h On a ﬁnal note.4. section 8. Gourieroux and Monfort. White. functions of interest such as estimated probabilities will be virtually identical for the two models.5. © ¥ 4d . ³ I 6© p e¥ d ³ I 6© p R¾R9 d° d¥ P 8 ³ I © p e¥ d P © © p ea o I 6© p R¥ d ¥ d o ° P % m m ° h p ipd 1 d h p ipd 1 d ©©pd e¥ m h d p ipd 1 gRaP © p d ¥ h o holds (that is. the logit and standard normal CDF’s are very similar . Likewise.6. .4) v f Taking expectations over then gives .

will and be consistent for § a B QP C uP B § & Given this. but for now it is clear that the foc for minimization will require solving a set of nonlinear equations. one might try to estimate & g and by applying OLS to and ? B a © p p d à ¥ ¹ à Rp q© p p ¥ ¹ d © d pgp à ¥ ¹ à B f BG § & & B a where encompasses both R © p C © p p ¥ ¹ B f B aQP © p p B ¥ P d d à ¥ ¹ à p Taylor’s series expansion about the point I © f B 1 ¢ ©d B C ¥ ¹ B ¨¥ ) g f d f G P d B B V© p C ¥ ¹ B f and the Taylor’s series remainder.14.6. Deﬁne § & § Question. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL 299 Suppose we have a nonlinear model where The nonlinear least squares estimator solves We’ll study this more later. A ﬁrst order is no longer a classical error .its mean is not zero. A common approach to the problem seeks to avoid this difﬁculty by linearizing the model. We should expect problems. Note that © ¢ ! ¤òç B G è ¥ t33 with remainder gives .

This depends on both the shape of density function of the conditioning variables.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL 300 The objective function converges to its expectation Noting that perplane that is closest to the true regression function mean squared error criterion. ©©Ô¥ ¹ b ©pd g4 ¥ ¹ p § p & G since cross products involving drop out.14. Let correspond to the hyaccording to the and the § The answer is no. as one can see by interpreting and & as extremum . ó § d ë ¢ 0 & q© p ¥ ¹ ð « VP ¢ è ó § £ G P d ¢ 0 & p0© p ¥ ¹ ð « Y ë « ë and § ¢ © 0 © ¥ H¾G¤ & f Ü¨¥ « £ Y ë « ë ¡ p © ¢ d§R p & f ¯¨¥ « £ Y ë « ë 8 8 4©¤Cô and converges to the that minimizes : ¢ © C 0 B § § ¢ © j & & f Ü¥ « £ Y ë « ë HaG¤ ì HS9¤ © ¥ © ¥ f T Í B ¨¥ f I B 1 © f H¥ ¤ f ) 8 R © R ! & ¥ § estimators.

Generalized Leontiev and other “ﬂexible functional forms” based upon second-order approximations in general suffer from bias and inconsistency. since. For this reason. translog.. of course. for example.6.g. which is 2 in this example).14. tently. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL Inconsistency of the linear approximation. but it can be very important for analyzing ﬁrst and second derivatives.θ) Tangent line x x β α x x x x x x x x 301 Fitted line x_0 It is clear that the tangent line does not minimize MSE. Second order and higher-order approximations suffer from exactly the same problem. elasticities of substitution) are often p d Note that the true underlying parameter ©pd gg ¥ ¹ is concave. all errors between the tangent line and is not estimated consis- . In production and consumer analysis. The bias may not be too important for analysis of conditional means. though to a less severe degree. either (it may be of a different dimension than the dimension of the parameter of the approximating model. ﬁrst and second derivatives (e. if the true function are negative. even at the approximation point h(x.

. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL 302 of interest. but it is not justiﬁable for the estimation of the parameters of the model using data. The section on simulation-based methods offers a means of obtaining consistent estimators of the parameters of dynamic macro models that are too complex for standard methods of analysis. so in this case.6. It is justiﬁed for the purposes of theoretical analysis of a model given the model’s parameters.14. This sort of linearization about a long run equilibrium is a common practice in dynamic macroeconomic models. one should be cautious of unthinking application of models that impose stong restrictions on second derivatives.

1). êÅ ©) ¥ t ² and § x yx x x Ã cáb ì á gs¨QP ¨S¤ z á á á ©p §¥ © §¥ f ç QG G P p ¡U§ l f p § p & the numeric values of and that are the probability limits of and 8©¢ gDè & BG B B P C §÷P 'B f & B ËP ¢B µ) qB f G ç C B (1) Suppose that uniform(0.EXERCISES 303 Exercises Suppose we estimate the misspeciﬁed model by OLS. and calculating the OLS estimator. (3) Use the asymptotic normality theorem to ﬁnd the asymptotic distribution ê ê 8 and is independent of p § of the ML estimator of for the model where This means ﬁnding . When the sample size is very large the estimator should be very close to the analytical results you obtained in question 1. Find (2) Verify your results using Octave by generating data that follows the above model. 8 § g© p ¥ o . and where is iid(0.

17 (see pg. 15.1. 14 . is .” in Handbook of Econometrics. the ML estimator is interpretable as a GMM estimator that uses all of 304 ¢ ©I p p 4d q1 © d 45 Â p R¥ ¢ © p d ¥ ó Ã f P d f ¥ ¥ ³ p d Â ¢ ×ð ) ° `5 Â t¡qegIpp RA¥ © ) P d Ã © p g!Ê `¨ Y © d f 4g¥ Ê Y `¨ I @ © d¥ ¥ A f 4RSf ¦ ~ g ¤ The density function of a t-distributed r. 587 for refs. based upon Given an iid sample of size likelihood function one could estimate by maximizing the log- This approach is attractive since ML estimators are asymptotically ef- ﬁcient. Recalling that a distribution is completely characterized by its moments. 36.v. Newey and McFadden (1994). Ch. to applications). This is because the ML estimator uses all of the available information (e. Davidson and MacKinnon. the distribution is fully speciﬁed up to a parameter). Deﬁnition We’ve already seen one example of GMM in the introduction. 4. Vol. Ch. Consider the following example based upon the t-distribution.CHAPTER 15 Generalized method of moments (GMM) Readings: Hamilton Ch. $ d ¢ the distribution.. “Large Sample Estimation and Hypothesis Testing.g.

efﬁciency is lost relative to the ML estimator. with density ©5 d ¼V£R¥ Â 8© p g45 ¥ gd © d¥ I I © d¥ I e% @f 1 Â ) 4Rv% © 5 p d p © f¥ 4Vg¥ Â gd ¨é I @ 1 d ¥ 45 d © d © f © Ñ s4YV£R¥ 4R¥ ¢ d ¢ © ¥ 7 f ) p RD© 5 p d d ¥ © Ñ gR4u e¥ © 3 ¨¥ Ù f ¥ ¢ © p d y7 z y) X Xf d 5 © ¥I d $ d % 8 ¡ ©deI% ë p ¥ ©p d¥ I 6geE% ë 8 7f @f 1 Â ) ¢ I ¢ f 4upR¥ Â d 4eE% ©5 d © d¥ I u ments to estimate a dimensional parameter.v. Since information is the moments. in general.it uses less information than the ML estimator. deﬁne a moment condition d and both and Choosing to set yields a MM estimator: (15.1. so it is intuitively clear that the MM estimator will be inefﬁcient relative to the ML estimator. DEFINITION 305 discarded. when evaluated at the true parameter value ©p d f Y ggg¥ Ê @¨ Continuing with the example.1) P This estimator is based on only one moment of the distribution .v. An alternative MM estimator could be based upon the fourth moment of the t-distribution. is provided We can deﬁne a second moment condition 3 p d As before. a t-distributed r.1. The method of moments estimator uses only mo- $ 3 cQ 8 Ñ ¥ gd p . has mean zero and variance (for Using the notation introduced previously. by the MM estimator. The fourth moment of a t-distributed r.15.

1. the following deﬁnition of the GMM estimator is sufﬁciently general: symmetric positive deﬁnite matrix -vector. set The subscript is used to insince it is an converges to a is a -vector 8 $ d A second.15.1. $ d p d tor where with and converges almost surely to a ﬁnite .1. This estimator isn’t efﬁcient either. -dimensional parameter vecis a © d¥ 4RX In general. since it uses only one moment. © d¥ I 4RE @f I f 4RC © d¥ f D EFINITION 24. average of centered random variables. For the purposes of this course.A S dicate the sample size. different MM estimator chooses to set © 4d ¥ ¢ If you © d¥ gpeX 8 R E4e¥ ¢ !ge%¨¥ eC ©© d © d¥ I © d¥ f ' . whereas where expectations are taken using the true distribution with param- popular choice (for reasons noted below) is to set ﬁnite positive deﬁnite matrix. The GMM estimator of the Þü f iü þ eX!4ë ¡ © d¥ © d¥ f f © d¥ f g4Rig" Rü eS $ 4eS¤ ¥ g © d¥ f ' ü and is a matrix. so 8 © d¥ f © d¥ g4RXi Rü 4eX e¥ 9¤ ©d f and we minimize We assume f "ü "fiü R E4eXpt ©© d¥ ¥ ©© d¥ ¥ E4RX¨pt A GMM estimator requires deﬁning a measure of distance. DEFINITION 306 solve this you’ll see that the estimate is different from that in equation 15. 8p xd eter This is the fundamental reason that GMM is consistent. The GMM estimator is overidentiﬁed. Note that xgd íQdD©t)Ô$T d eX p ¥ S © d¥ p 1¥ g© ¢ eI ¨¦T 1 As before. which leads to an estimator which is efﬁcient relative to the just identiﬁed MM estimators (more on efﬁciency later). assume we have moment conditions. A GMM estimator would use the two moment conditions together to estimate the single parameter. .

whereas MLE in effect requires correct speciﬁcation of every conceivable moment condition. The only assumption that warrants additional comments is that of identiﬁcation.. More on this in the section on simulation-based estimation. We simply assume that the assumptions of Theorem 19 hold. CONSISTENCY 307 What’s the reason for using GMM if MLE is asymptotically efﬁcient? Robustness: GMM is based upon a limited set of moment conditions. Consistency has a unique global maximum at i.2. The price for robustness is loss of efﬁciency with respect to the MLE estimator. only these moment conditions need to be correctly speciﬁed.2. so the GMM estimator is strongly consistent. Keep in mind that the true distribution is not known so if we erroneously specify a distribution and estimate by MLE. the third assump- Taking the case of a quadratic objective function p 9gd í"d í © d ¥ ¯4ea ea ppü R gpgdea gge¥ G¤ ©p d¥ © ¥ ©pd y 8 © p dRa ¥ d ¥ f © p RC 4ë 8 d g©4R¥a ì ©e¥Sf d T 8© d¥ 4eSf © d¥ f f © d¥ f g4RCgi Rü eS 4RC¤ © d¥ f 8 p í d ê © d ¥ xd þcgea ¤ ¥ ea ¤ ©p d¥ ﬁrst consider Applying a uniform law of large numbers. . we need that for p d d©ÔaG¤ b¥ tion reads: (c) Identiﬁcation: For consistency. GMM is robust with respect to distributional misspeciﬁcation. we get Since Since by assumption.15. because we are not able to deduce the likelihood function.e. 15. the estimator will be inconsistent in general (not always). in order for asymptotic identifor at least some element ﬁcation. The GMM estimator may still be feasible even though MLE is not possible. – Feasibility: in some cases the MLE estimator is not available. In Theorem 19.

there may be multiple minimizing solutions in ﬁnite samples.15.3. p of the vector. Asymptotic normality We also simply assume that the conditions of Theorem 22 hold. we have h We need to determine the form of these matrices given the objective function so: (15. ASYMPTOTIC NORMALITY 308 Note that asymptotic identiﬁcation does not rule out the possibility of lack of identiﬁcation for a given data set . so we will have asymptotic normality. we do need to ﬁnd the structure of the asymptotic variance-covariance matrix of the estimator. However. q1 © d¥ f eS .3. This and the assumption that ü ì Qü f a ﬁnite positive ' .1) (Note that is omitted to unclutter the notation). 15. à ' ©p d¥ ge¾P where is the almost sure limit of y 8gpgeS¤ á á 1 À ô¯éX f A ea o © d¥ f ©p d¥ © d¥ f eS¤ z á á á ³ I 6© p e¥ ©© p ea o I 6© p R¥ d P d¥ d and but it h p d P % ° m h p ipd 1 d t ' deﬁnite matrix guarantee that is asymptotically identiﬁed. From Theorem 22. 8©d HR¥ e¥ 5 4e¦¤ d à ü©d © d¥ and © d Rf HR¥ d à all depend on the sample size à $ © d¥ eSf à f ü © d¥ "74eSf eS9¤ © d¥ f Deﬁne the matrix © d¥ f f © d eygiü É e¥ yf d à © d¥ f d à 5 4RC¤ à È 8 © d¥ f f © d¥ f g4RCgiü R eS 4RC¤ © d¥ f Now using the product rule from the introduction.3.

we have uct rule. let 3 where we deﬁne since at to a ﬁnite limit. With regard to p gd © p Ra o d ¥ have mean zero at (since d d ¥ f ü Rf f"ü ú R ©4R¥XH1 ù ú © p eXH1 ù if Ñ ë A f Rf fiü©4e¥Xc© p RXh i$f 1 Ñ ë A f R d d ¥ f ü h © p R¥Cf¤ d à 1 À Üé A f d ô à © p d¥ ggh RXqë . we have p xgd . and p holds). ASYMPTOTIC NORMALITY satisﬁes a LLN. and noting that the scores rows of th row of we get a.s. so that it converges almost surely . following equation 15. (we assume a LLN Using the prod309 by assumption).1. In this case.3.3. Stacking these results over the assume that 15. d ¥ © p ea o ü ü @84©¤8C6 R ô püs d ¥ 5 © p ea A4¤C7 8 8ô Rd d P d © p R¥Cf¤ ¢ à à A à Þ ü ì ggtcT ´ © p eX ü ©)¥ d ¥ R È ì É BR © p e¥ d à d à d ¥ Rü © p RXg5 y á ©d BR e¥ á R È É ©d d BR 4e¥ à à © d¥ Rü 4eX5 When evaluating the term R È 5 P RÉ B d à düRgËgR ü B 5 à R ©de¥ fiüe¥ B d ©d 5 à à 8©d ge¥ be the d R d d ©4R¥¤ B ¢à à à B To take second derivatives.

we could set U ô ü s © ¥ 8 iü f trix Note that for . Assuming m h p ipd 1 d ü h .4.15. and the last equation.4. if we are much more sure of the ﬁrst moment condition. CHOOSING THE WEIGHTING MATRIX 310 h this. 15. it is . which is based upon the fourth moment. which is based upon the variance. given that is an average of centered (mean-zero) quantities. Choosing the weighting matrix is a weighting matrix. after multiplication by h ü s ¥ R p R ü s © gs ² ² ¥ p ² ü s Þ ü p I © R ² Þ ü s Ñ © p Ra d ¥ m © p RXH1 d ¥ p ü ¥ s x n o © p d¥ ggRX Now. to be positive deﬁnite. where 8 ³R © p RXÔ© p RXH1 ° ë A f d ¥ d ¥ ² Using this. must have full row rank. the asymptotic normality theorem gives us the asymptotic distribution of the GMM estimator for arbitrary weighting ma "« tions of the individual moment conditions. we get Using these results. For example. which determines the relative importance of viola- r I © R Þ 1 reasonable to expect a CLT to apply. than of the second.

ü already inefﬁcient w. 8 xU ª ô with much larger than In this case.t. consider the linear model 8I ªR ª I G ª § ª f ª P 8 g© ² j ¾G ¥ ç © d¥ f eS deﬁned by .g.4. matrix . Since moments are not independent. errors in the second moment condition may be a random. so it may not data dependent matrix. we might like to choose the ü We have already seen that the choice of will inﬂuence the asymp- ü be desirable to set the off-diagonal elements to 0.r. totic distribution of the GMM estimator. in general. we should expect that there be a correlation between the moment conditions. since (Note: we use for both nonsingular). Since the GMM estimator is to make the GMM estimator efﬁcient within the class of GMM estimators where lation. the optimal weighting matrix is seen to be the in- © § ¨¥ G q§ ¯¨¥ © f jective function Interpreting I © 8w ¥ ¥ © G¥ © ¥ R ª é ª 4G ª é minimizes the ob- ² ' 8© § ¨¥ I ² R § ¨¥ f © f G ª i§ ª f ª P w I wI ¥ ¥ 8 f ± R ª I R© ª ¥ I ¯ª R ª I © ª R ª ¥ ª R ª ² ª ª ² be the Cholesky factorization of e. This result carries over to GMM estimation. The OLS estimator of the model as moment conditions (note that they do have zero expectation when verse of the covariance matrix of the moment conditions.15. This means that the transformed model is efﬁcient. he have heteroscedasticity and autocorre- Then the model satisﬁes the classical assumptions of homoscedasticity and nonautocorrelation. (Note: this presentation of GLS is not p § evaluated at ). Let That is. MLE. G P v 7Ö§ R ¦ f To provide a little intuition. CHOOSING THE WEIGHTING MATRIX 311 have less weight in the objective function.

which proves the theorem. and is therefore positive semideﬁnite. Later we’ll see that GLS can be put into the the so that ¥ I ² ' p ü con- versus . If is a GMM estimator that minimizes 8 q1 equal to the sample size.4.4. as can be veriﬁed by multiplication. The term in brackets is idempotent. which implies that the difference of the variances is negative semideﬁnite. The result h ² m (15. CHOOSING THE WEIGHTING MATRIX 312 a GMM estimator.15. A quadratic form in a positive semideﬁnite matrix is also positive semideﬁnite. for any choice such that I I © R p ü £ ¥ R p ü ¯ ² Þ ü s I © R 8 I © R I p ² è ü £ ¥ Þ ü Proof: For I ² where 84 R pgR¥XÔgpgR¥XH1 ë% f A © d © d d asymptotic variance of will be minimized by choosing the asymptotic variance f f "ü ì "ü © d¥ f f © d¥ f geigiü R eS d T HEOREM 25. which is also easy to check by multiplication. The difference of the inverses of the variances is positive semideﬁnite.1) R ¢ eI ² r ¢ eI p p © R ² p ü ps ü I R ps ü s ¥ I R pÞ ü r ó R I I ð n ² ² p Þs ü Å© R Þ¥ ¢ eI ² ü p ü s © R ps ¡ ü ¥ h p ipd 1 d ±n ¢ eI p ó R I ² ² ð ü when is some arbitrary positive deﬁnite matrix: ² ü sider the difference of the inverses of the variances when I ² í Þ ü ² simpliﬁes to Now. because the number of moment conditions here is GMM framework deﬁned above).

” To operationalize this we and · 1 I © R I ² ¥ allows us to treat p d ¶ ¬ ùd which is consistent Stois is continuous in .5. In the case that we wish to use the optimal weighting matrix. pp. Estimation of the variance-covariance matrix (See Hamilton Ch. and have different variances ( ). In general.5. since the individual moment condi- ). we expect that: variance will not depend on if the moment conditions are covariance stationary. Note that this autoco- © p d¥ f ggRCD1 ² estimate of the limiting variance-covariance matrix of 8 ² not continuous. we in general have little information ). we need an h upon which to base a parametric speciﬁcation.15. While Rf áá chastic equicontinuity results can give us this result even if 8 7d by the consistency of assuming that The obvious estimator of is simply Rf á á h d Rf á á Â 7d 8 ¾ ² need estimators of ¬ where the means ”approximately distributed as. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX 313 15. contemporaneously correlated. 10. 261-2 and 280-84) . We now turn to estimation of . í © ¨ ß B ¨Uë ¥ tions will not in general be independent of one another ( 2 ¥ í ¡s© ì R ¬¨Uë @ì ¢ B è © ¢B ¼ë ¥ Ã will be autocorrelated ( ² one could estimate parametrically.

research has focused on consistent nonparametric and are functions of so that so for now assume that we have some consistent Now h I Rf Â Ã P DI f Ã 9xx¡P h ¢R Pbbb Â Ã ¥ 1 P ¢ Ã 0÷1 P h IR 5 ¥ 8 h ÄR Ã ¥ I P ÄÅÃ Ñ1÷1 Ä P p ¸ ¥ I f 1 P DI Ã ) 1 P p ¥ Ã ¥ ² tent. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX 314 Since we need to estimate so many components if we are to take the parametric approach.5. consistent estimator of Ä ÅÃ ¥ is ó I Rf Ã w P I f Ã ð ) 1 xxxH© ¢R Pbbb I@ 1Â) · R f ¶ I @ 1Â) · f Ã 1 P ¢ Ã ¥ 501÷1 © IR Ã I Ã ¥ ) 1 P p Ã P P I @ I @ 1Â)v ë · w · R ¶ ¶ f f 1 v ë R³ © p eXc© p eX1 ° ë d ¥ d ¥ ¶ p d estimator of 8 ÄR Ã ¥ © ì R ¨¼ë 8© ¥ 4d E 7d 8 g© ì R ¨Uë ÅÃ ¥ Ä the moment conditions 8© gE2 ì Ã ¥ f tween and does not depend on Deﬁne the Note that 2 ¹ X¯¸ Henceforth we assume that 8 ² estimators of is covariance stationary (the covariance beautocovariance of Recall that ² ² . estimator of Ã ¥ ¸ i ÷1 (you might use in the denominator instead). For this reason. it is unlikely that we would arrive at a correct parametric speciﬁcation.15. a natural. So. but inconsiswould be I @ 8 R Ä f 1 Â ) Ä Ä ÆÃ A natural.

The term can be dropped because must be © 1¥ S§ k Ä 1 f f k 8p xgd ©1 ¨¥ § where as will be consistent. so information does not build up as 1 . provided hR Ä Ã ¥ P Ä ÅÃ ¥ I Ä p Ã P Å xf ( Ã ¥ ² T k tends to a modiﬁed estimator grows sufﬁciently This allows ¸ Ä Ã On the other hand. This could cause one to (for example to an identity matrix). 15. and increases more rapidly information to accumulate at a rate that satisﬁes a LLN. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX 315 This estimator is inconsistent in general. supposing that tends to zero sufﬁciently rapidly as 8 k 1 than .15.1.5. since the number of parameters to estimate is more than the number of observations. A disadvantage of this estimator is that it may not be positive deﬁnite. 1987) solves the problem of possible nonpositive deﬁniteness of the above estimator. The Newey-West estimator (Econometrica. obtain a ﬁrst consistent but inefﬁ- between iterations. Newey-West covariance estimator. Their estimator is d The process can be iterated until neither nor change appreciably 8 hR Ä Ã ¥ ² P Ä Ã ¥ È I ÉÚ)¡PÈ§ y) Ä P p ¸ Å xf Ã ( p Ã ¥ d cient estimate of then use this estimate to form then re-estimate ü solution to this circularity is to set the weighting matrix arbitrarily ( ² ² d requires an estimate of which is based upon an estimate of ©p d¥ geX Note: the formula for ² ² ¢ calculate a negative statistic.5. for example! requires an estimate of which in turn The 8 © 1¥ gD¨T ´ ©1 ¨¥ § slowly.

The idea is to ﬁt a VAR model to the moment conditions.we’ve placed no parametric restrictions on the form of 8 7§ 8 i ½ ½ P T x Pbbb P I C¹UT UhxxxsI ¯x T Ux xxx¡sI ¯x T Pbbb P I ¥ ¥ ¥ 8 § 3 p ¤e I 1 Note that this is a very slow rate of growth for This estimator is It is . and the covariance estimated combining the ﬁtted VAR tails. I have a program that does this if you’re interested. so that the Newey-West covariance estimator might perform better with short lag lengths. if the set of selected moments 8 ½ with the kernel estimate of the covariance of the See Newey-West for de- 8 ² nonparametric . 15.6. 1994) use pre-whitening before applying the kernel estimator.d. might this imply that this information is not being used efﬁciently in estimation? In other words. The VAR model is This is estimated.15. by construction.6. since the performance of GMM depends on which moment conditions are used.. It is expected that the residuals of the VAR model will be more nearly white noise. ESTIMATION USING CONDITIONAL MOMENTS 316 This estimator is p. The condition for consistency is that an example of a kernel estimator. Estimation using conditional moments If the above VAR model does succeed in removing unmodeled heteroscedasticity and autocorrelation. In a more recent paper. Newey and West (Review of Economic Studies. giving the residuals Then the Newey-West covariance is ² estimator is applied to these pre-whitened residuals.

the moment conditions have been presented as unconditional expectations. The unconditional expectation is e I This can be factored into a conditional expectation and an expectation w. is also zero.r. but it can be analyzed more carefully when the moments used in estimation are derived from conditional moments. can’t we use this information. a la GLS. to guide us in selecting a better set of moment conditions to improve efﬁciency? The answer to this may not be so clear when moments are deﬁned unconditionally. so 8 t© © ¥ %4X¬¥ ¨ X¬H É iX ¨¥ ¨ ¤ ¤ t© ¤ © ¥ ¤ X¬D që e ¤ Æ Y f' © ¥ ¤ X¬D Äë © ¥ QH Since doesn’t depend on it can be pulled out of the integral 8 t© %$X¥ ¨ É ¤ t© ¤ © ¥ iX ¨¥ ¨ QHö¤ e I Æ Y r o the marginal density of © ¥ X¬H 8 ¤ t© ¤ © ¥ i4t É X¥ ¨ XH ¤ ¤ Then the unconditional expectation of the product of ¤ t© ¤ iX ¥ ¨ ¤ Æ £ ¤ « Yë Y ' © ¥ ¤ QHöqë © ¥ ¤ X¬D Äë random variable ¤ Suppose that a random variable has zero expectation conditional on the and a function of .6.t.15. ESTIMATION USING CONDITIONAL MOMENTS 317 exhibits heteroscedasticity and autocorrelation. One common way of deﬁning unconditional moment conditions is based upon conditional moment conditions. But the term in parentheses on the rhs is zero by assumption. So far.

ESTIMATION USING CONDITIONAL MOMENTS 318 as claimed. This is important econometrically. Suppose a model tells us that the function has . the above result allows us to § R ¡ Ô D ©d ¥ © d ¥ © f © d¥ 4c Duq ¨¥ eE ¹ 8 ± 4RE ¹ 4ë © d¥ © d¥ eE © f gf ×¥ we can set so that . ¥ now have moment conditions. and is a set of variables G P tw§ R 4f For example. so as long as © F ¨U¨¥ W 8 c ± drawn from the information set The are instrumental variables. since models often imply restrictions on With this. the function has conditional expectation equal to zero This is a scalar moment condition. in the context of the classical linear model 8©d ¥ g4c D ± ×g¨¥ 4ë © f © t) ¥ ¥ c ± expectation.which wouldn’t be sufﬁcient to identify a form various unconditional expectations for identiﬁcation holds. We the necessary condition UF ¼F ) ' j © F ¨U¥ W where is a -vector valued function of © d¥ © F eE ¹ U¨¥ W 8 7d dimensional parameter However. conditional on the information set equal to ©d ¥ gc D © f ×4¨¥ conditional moments.15.6.

. This ﬁts the previous treatment. . 319 . ¶ 2 Rf W W W R¢ RI With this we can form the f ÄW One can form the ' X1 where is the 15. ESTIMATION USING CONDITIONAL MOMENTS row of 8f ÄW Å · ßr Ã W matrix moment conditions . . . An interesting . . .question that arises is how one should choose the instrumental variables © F¥ y¨ÜW I @ 1 © d¥ eE f ) I @ 1 ©4R¥E d ¹ ÄW f ) © d ¥ Rf eSf ¹ ÈW ) 1 ©eif ¹ d ¥ ©4e¥ ¢ ¹ d © d¥ 4eI ¹ Rf ÈW ) 1 © d¥ f 4RC ©f F $U¥ W V bb ©f F xxb 4U¨¥ ¢ W W W © f F¥ U¨I W W W W © ¢ ¥ F ©I F Eµ¥ W V © ¢ ¨¥ ¢ F bb ©I F xxb Eµ¨¥ ¢ © ¢ ¨I F¥ ©I F¥ Ôµ¨I to achieve maximum efﬁciency.6. .

6. of the moment conditions sample size and is not consistently estimable without additional assumptions. The asymptotic normality theorem above says that the GMM estimator using the optimal weighting matrix is distributed as h © é ¥ sr j m h p £d 1 d where we have deﬁned 8©p d¥ ô ggeSf ¹ À ¯é f f ÄW 1 R f W f R d¥ ÄW É c© p eSf ¹ © p eSf ¹ ) 1 Ëë yW d ¥ Æ Rf È É fÄW R © p eSf ¹ © p RCf ¹ R f W ) 1 ë d ¥ d ¥ ³c© p eic© p Ri1 ° ë R d¥ f d¥ f f f $ ² 1 %' f where is a matrix that has the derivatives of the individual moment f ©d ÄW É 4¥ fR ¹ d à Æ ) 1 à 1 d © d¥ R ©eSf R f W ¹ y¤¥ ) à à Note that matrix is growing with the d 8 ÄDf ) 1 4RCf f W © d¥ © d¥ eSf ' (a matrix) is ©d e¥ R áá $ Note that with this choice of moment conditions. Likewise. ESTIMATION USING CONDITIONAL MOMENTS 320 which we can deﬁne to be d conditions as its columns.15. we have that f . deﬁne the var-cov.

6.6.2) which we should write more properly as 8 · 8 É f f 1 Æ t f I R I gf d R f I f d f ¢W n é (15. struments. Note that both this. so without restrictions 1 ' XV1 f Estimation of ýd where is some initial consistent estimator based on non-optimal in- may not be possible. ESTIMATION USING CONDITIONAL MOMENTS d d 321 where ¶ d Using an argument similar to that used to prove that d is the efﬁcient weight- ing matrix. As above.one just uses d p since it depends on and must be consistently estimated to apply matrix. (To prove this.15.1) I I ² É 1 É 1 É 1 R f R f W Æ I ¢C R f W Æ ÄDf Æ f Wf f W f . estimation of is straightforward .6. you can show that the difference is positive d d semi-deﬁnite). so it has d g© p R¥ f f n é (15. q1 more unique elements than the sample size. examine the difference of the inverses of the var-cov matrices with the optimal intruments and with nonoptimal instruments. It is an h ý d Rf d ¹ à à f ¢ d Usually. this matrix is smaller that the limiting var-cov for any other choice of instrumental variables. we can show that putting causes the above var-cov matrix to simplify to and furthermore.

A speciﬁcation test The ﬁrst order conditions for minimization. where the term $ h d É h d yf f I ² $ © ¥f d i I ² 4d ¥ © d à ¥ d à 5 ©4d ¤ à à È © d¥ 4RE ¹ in order to be able to use optimal instruments.8.it will be necessary to b h b ¢b a a must be estimated consistently apart. the Hansen application below is enough. Note that the simpliﬁed var-cov matrix in equation 15.1. A solution is to ap- .15. 15.6.2 will not apply if approximately optimal instruments are used . you need to provide a parametric speciﬁcation of the covariances of the proximate this matrix parametrically to deﬁne the instruments.6. 15. Estimation using dynamic moment conditions Note that dynamic moment conditions simplify the var-cov matrix. The will be added in future editions. are or f y use an estimator based upon equation 15. Basically. A SPECIFICATION TEST 322 on the parameters it can’t be estimated consistently. For now. but are often harder to formulate.7. using the an estimate of the optimal weighting matrix.8. for example by the Newey-West procedure.

we can write (15.1) Now or Multiplying by get This last can be written as The lhs is zero. A SPECIFICATION TEST 323 . and since tends to Consider a Taylor expansion of With this.8.8. and taking into account the original expansion (equation ??).© ± ² ¥ ² m i © p eS ¢ eI ² 1 d ¥ f p h ² © p RC ¢ eI ² h ¢ eI d ¥ f p p ó I R I ² p ð R ¢ eI x 1 ± h © ¥ p d X ¢ eI ² 1 Or h © p eC ¢ eI ² h ¢ eI d ¥ f p p ² ó I R I ² ² p ð R ¢ eI ² 1 h © ¥ d X1 h 8 d¥ f © p eS I ² ó I R I ð R 1 h d¥ f q© p eSH1 h © ¥ $ d XH1 h © p RC I d ¥ f ² ó I R I ² ð 1 h h p pd 1 d h h d d ¥ f p ipd R I ² © p eC I ² d pd ² ² ©xÔ¥T ´ P h p dipd © p R¥ I ² d ¥ © p RC I ² d ¥ 4d X I ² 4d ¥ ) R d © P d¥ f © © ¥ © © I ² d ¥ 8©)¥ gtcT ´ P h p £d © p e¥ Rf © p RC 4d X d d P d¥ f © ¥ © ¥ 4d X we obtain and : tends to . we 15.

15. A SPECIFICATION TEST 324 and one can easily verify that or supposing the model is correctly speciﬁed. so ² 8 $ © ¥ $d X I ² eS © d¥ f ¤ 8 $ © ¥ d X © Ü¥ ¢ with a critical value.o.c.8. assuming that asymptotic normality hold). This is a convenient test since we moments used to estimate are correctly speciﬁed. This won’t work when the estimator is just identiﬁed. both and (at least asymptotically. The test is a general test of whether or not the are square and invertible q1 just multiply the optimized value of the objective function by © Ü¥ ¢ © ¥ ¢ m © ¥ $d X I ² c4d XH1 R© ¥ m © ¥f¤b 4d C1 ² Since converges to we also have © ¯¥ ¢ m i d © ¥ 4 X I ² c$d XH1 h $d X ¢ eI ² 1 R h $d X ¢ eI ² 1 R© ¥ © ¥ p © ¥ p h equal to its trace) so h is idempotent of rank (recall that the rank of an idempotent matrix is h ¢ eI p ² ó I R I ² p ð R ¢ eI ² xl ª ± ² and compare . The f. are But with exact identiﬁcation.

© ¥ § é § µS If we’re not conﬁdent about parameterizing we can still estimate S GLS).9. ¢ è I © R ¥ by sampling from the data with re- . For bootstrap samples. in order to determine the critical value with precision. If the sam- ple size is small. OLS with heteroscedasticity of unknown form. That must be a very large number if is large. and to estimate and jointly (feasible is correct.9. consistently by OLS. E XAMPLE 26. draw artiﬁcial samples of size © f 4 d ¥ 9¤ . This sort of test has been found to have quite good small sample properties. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 325 So the moment conditions are zero regardless of the weighting matrix used. This will work well if the parameterization of è è § © è¥ gs£S S The typical approach is to parameterize where is a ﬁnite ¡ such that percent of the exceed the value. we might as well use an identity matrix and save trouble. However. dimensional parameter vector. so the test breaks down.9. the typical covariance estimator will be biased and inconsistent. it might be better to use bootstrap critical values. and will lead to invalid in- ferences. Suppose where a diagonal matrix. 15. optimize and calculate the test 1 is. White’s heteroscedastic consistent varcov estimator for OLS. Other estimators interpreted as GMM estimators 15.1.15. Of course. i p statistic Deﬁne the bootstrap critical value û S @ © d ¤ ¥ ) 4 b & ß 8 6A@8@8974) Cg© d ¦"1 ¡ 85 % ¥¤b ß © S ¥ ç g¼T jsG G P p V§ ¡ placement. As such. Also A note: this sort of test often over-rejects in ﬁnite samples.

15. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 326 which suggests the moment condition ditions). since the number of moment conditions is identical to is no need to use the “optimal” weighting matrix in this case. That is. an identity matrix works just as well for the purpose of estimation. we have exact identiﬁcation ( parameters and moment con- © G Ù t¬ ¥ There 8© § R v gS qE f¥ © §¥ 1 ) Â 1 ) ¨X © §¥ Â ¶ ) 0' By exogeneity of the regressors (a column vector) we have § Â . Therefore which is the usual OLS estimator. We have identiﬁcation. the foc imply that © ¥ § X I 8 hdR · R v v © §¥ qXÖü For any choice of will be identically zero at the minimum. 8 hR Ä Ã ¥ P Ä ÅÃ ¥ I Ä p P I f ² Recall that a possible estimator of Ã ¥ is 8 Â q1 ÷ R R v v 1 ¯ ) Â ² á á Â Recall that is simply In this case ² The GMM estimator of the asymptotic varcov matrix is 8 hR I 8 Xü Â I ¢ Â R I 6© R ¥ g¬ v f $ the number of parameters. due to exact regardless of 8 § R v v 1 ËH¬ v ) f Â In this case.9.

15. so information will accumulate. but in the present case of nonautocorrelation. it simpliﬁes to ¥ Ã ² which has a constant number of elements to estimate. If there is autocorrelation. This estimator is consistent under heteroscedasticity of an unknown form. estimator. is This is the varcov estimator that White (1980) arrived at in an inﬂuential article. and consistency obtains. - I É 1 å· Æ I R stÉ 1 å· Æ R I 2 2 É 1 qR ¶ I É 1 1 R ¶ R q 1 Æ R Æ r h § Dh j% § 1 é 1 ' %X1 where is an h diagonal matrix with ¢ G 1 qR I @ w ¢ G R v v f v 1 Â ) I @ v w ¢ h § R v f R v v l1 Â ) f I @ p · R f ¶ 1 Â ) Ã¥ p Ë ² q ¢ ² in the position . the Newey-West estimator can be used to estimate the rest is the same. In the present case Therefore. . the GMM varcov. which is consistent. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 327 This is in general inconsistent.9.

the GLS estimator in this case has an obvious representation as a GMM estimator. 8 $ p d¥ p ¥ ý § © R e ESè 1 ) © R df !Sè © ý§ ¥ v 1 Â ) ¦X v v Â This estimator can be interpreted as the solution to the 8 g© © p R£S d ¥ © q I S R I ó I S R ð ý § S Now. Nevertheless. so that is a correct para- © S ¥ ¼T j G 0P p § In this case. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 328 15. There are a few points: The (feasible) GLS estimator is known to be asymptotically efﬁcient in the class of linear asymptotically unbiased estimators (Gauss-Markov). Consider the previous example of a linear model with heteroscedasticity of unknown form: metric speciﬁcation (which may also depend upon estimator is That is. This means that it is more efﬁcient than the above example of OLS with White’s heteroscedastic consistent covariance.9.2.9. the idea is the same. suppose that the form of S where is a diagonal matrix. which is an alternative GMM estimator.15. the representation exists but it is a little more complicated. Weighted Least Squares. This means that the choice of the moment conditions is important to achieve efﬁciency. With autocorrelation. is known. the GLS moment conditions ç G .

2SLS. e.i.3. regardless of R 4x§4 § Since we have parameters and 8© R§ f § 4wg¨¥ 4 § moment conditions. must .9.d. the GMM so we 1 ) ¨X © §¥ Â © § f © §¥ § R vg¥ 4 § ¨E 8 G be uncorrelated with ¶ § £ Since is a linear combination of the exogenous variables This suggests the -dimensional moment con- 4 § v £ u £ R I © R ¥ £ £ R I © R ¥ ¡ £ £ Deﬁne v that is as the vector of predictions of when regressed upon G exogenous and predetermined variables that are uncorrelated with v both endogenous and exogenous variables.9. We use the exogenous variables and the reduced form predictions of the endogenous variables as instruments. Suppose that is the vector of all (suppose x# this equation is one of a system of simultaneous equations.g. and ©RÄ£ h u£Ä£ d ¬4 § ¥ R © f I I have · Öü estimator will set identically equal to zero.. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 329 15. Consider the linear model or dition and so This is the standard formula for 2SLS. Suppose that contains 8©) gtj' À . where )' § G P § VÄ£ is G P R t0§ g# gf and is i. so that gG using the usual construction.15.

Note that dependent causes G is the standard formula for 2SLS). . 15. just use the Newey-West or some other consistent lagged endogenous variables to loose their status as legitimate instruments. . and for each equation. We have a system of equations of the form .9. See Hamilton pp. v E© ¯ 4c¥ © d § ¢ v E© ¢ 4c¥ ¢ ¨ ¢ ¨¥ © d § f ©©I d §¥ I f I v Eg4cI ¨ ¨¥ ¯ ¯ ¨ f ¥ ¯ c$§ in the exogenous variables in B c Dv R8 © R p¯ Dxx9 R p¢ R pI e¥ p d b b b d d d 8 B G ) B j' Yw © d¥ eE ) B I ²' h yw B ¯ & d©Ô¥ b ¨ where is a -vector valued function. and for how to deal with heterogeneous . Then we can deﬁne the orthogonality conditions 8 . OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 330 apply IV estimation.9. that gG G P d § ¯ V© p¯ ×4c¥ I G P d §¥ ¦0© pI c4cvI ¢ 0© p¢ c4c¥ ¢ G P d § G P d § t0© p 4¥ ¯ ¨ ¨ ¨ gf I ¦f ¢ f ¯f ² estimator of and apply the usual formula). . Nonlinear simultaneous equations. GMM provides a convenient way to estimate nonlinear systems of simultaneous equations. .15. ¨ or in compact notation We need to ﬁnd an are uncorrelated with vector of instruments Typical instruments would be low order monomials with their lagged values.4. 420-21 for the varcov formula (which and dependent (basically.

A GMM estimator that chose an optimal set of moment condi- tions would be fully efﬁcient.15. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 331 A note on identiﬁcation: selection of instruments that ensure identiﬁ- 15. some sets of moment conditions may contain more information than others. More on this later.9. and let Then at time ª ª ª ª I CC2 ¤ cation is a non-trivial problem. has been observed (refer to it as the information set. A note on efﬁciency: the selected set of instruments has important effects on the efﬁciency of estimation. Maximum likelihood. Unfortunately there is little theory offering guidance on what is the optimal set. since the moment conditions could be highly correlated. . The likelihood function is the joint density of the sample: which can be factored as and we can repeat this to get 8 R © R !@A8@89 R¢ ! RI ¨¥ ¤ f 8 f f 8©I f b 8 b © d f f f b © d f f f gc¥ ¨ $A8@84q ¢ C¤ I ¨¥ ¨ q6I i¤ ¨¥ ©d f ¤ b ©d f ff 6I i¨¥ ¨ q4!I C¤ g¥ ©d ff 8 fI f 4gA@8A8 ¢ !6¨¥ ¨ ©d 4R¥ ¨ ©d e¥ ¦ ¦ ¨ ©d 4e¥ ¦ & gf Let be a -vector of variables. Here we’ll see that the optimal moment conditions are simply the scores of the ML estimator. In the introduction we argued that ML will in general be more efﬁcient than GMM since ML implicitly uses all of the moments of the distribution while GMM uses a limited number of moments. However. Actually. a distribution with parameters can be uniquely characterized by moment conditions.9. since we assume the conditioning variables have been selected to take advantage of all useful information).5.

OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 332 The log-likelihood function is therefore Deﬁne as the score of the observation. ² . The GMM varcov formula is are both condi- I © R I ¥ I @ I @ © ¤¥ © f 4 d I C¤ g¨¥ ¨ 1 Â ) d CE 1 Â ) f f I @ R ©4d 6I i¤ g¥ f © ¤¥ d ¢ f 1 Â ) 4 d ci¨X à ¨ identiﬁed GMM estimator ( if there are parameters there are score equa- p $d conditions. that the scores have conditional mean zero when evaluated at ©d f 46I i¤ g¥ ¨ I @ 8 d ©4I C¤ gf¨¥ ©d f 4R¥ ¦ d ¤¥ !I i¤ © p ci¨Ôtë ¨ $ © d ¤¥ 4×i¨E 2 à ¶ 2 Â . Conditional uncorrelation formation set at time .15. under the regularity (see notes to Introduction to Econometrics): so one could interpret these as moment conditions to use to deﬁne a just- tions). The GMM estimator sets which are precisely the ﬁrst order conditions of MLE. It can be shown that.9. Therefore. Unconditional uncorrelation follows from the ì i¤ ì follows from the fact that is a function of ¤ ¥ gì It is important to note that and which is in the in- é interpreted as a GMM estimator. MLE can be ² c Consistent estimates of variance components are as follows tionally and unconditionally uncorrelated.

and the ﬁrst term converges to minus the inverse of the middle term. which is still inside the overall inverse) I ââ ââ âæ ââ ââ ç â å C I © f d I C¤ ¨¥ ¨ A ¢ @f B I ' ø R r ©d I C¤ ¨¥ ¨ A n r 4d I C¤ g¥ ¨ n @f õ f © f I I ' C ©4d I i¤ g¥ @ f ¢ f B ¨ I @ w d I C¤ g¨¥ © f ) ¢ f 1 Â Þ v ¨ I r ¥ é áââ ââ ã â ââ ââ âä 1 r é mate I @ I @ © f © f R© ¤¥ © ¤¥ R r 4 d 6I i¤ g¥ ¨ n r d I C¤ ¨¥ ¨ A n 1 Â ) c4 d ci¨ÔÔ d CE 1 Â ) f f ù 8 ú © p I C¤ g¨¥ ¨ ¢ Ù C R ³ © p I C¤ ¨¥ ¨ A ° ³ © p gI i¤ g¥ ¨ ° d f d f d f in any of three ways: The sandwich version: I i¤ ¶ I C¤ of so marginalizing with respect to preserves uncorrelation ¥ é B Ù ¢ ² . OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS 333 fact that conditional uncorrelation hold regardless of the realization (see the section on ML estimation. above).15. The fact that the scores are ² serially uncorrelated implies that can be estimated by the estimator of the 0 autocovariance of the moment conditions: Recall from study of ML estimation that the information matrix equality (equation ??) states that This result implies the well known (and already seeen) result that we can esti n or the inverse of the negative of the Hessian (since the middle and last term cancel.9. except for a minus sign): or the inverse of the outer product of the gradient (since the middle and last cancel except for a minus sign.

If they differ by too much.10. EXAMPLE: THE HAUSMAN TEST 334 This simpliﬁcation is a special result for the MLE estimator . Asymptotically. if the model is correctly speciﬁed. In small samples they will differ. but that the some of the regressors may be correlated with the error term. J. 6e lagged values of the dependent variable are used as regressors and 8 e P cEË§ R 4f Consider the simple linear regression model I @ 8 s © f © f R r d I C¤ ¨¥ ¨ A n r 4 d 6I i¤ g¥ ¨ n 1 Â ) f I 15.A.15.it doesn’t apply to GMM estimators in general. 1982) is based upon comparing the two ways to estimate the information matrix: outer product of gradient or negative of the Hessian. 477). pg. this will be a problem if if some regressors are endogeneous some regressors are measured with error is autocorrelated. the functional form and the choice of regressors is correct. 46. Speciﬁcation tests in econometrics. This section discusses the Hausman test. which was originally presented in Hausman. 1251-71. In particular. White’s Information matrix test (Econometrica. all of these forms converge to the same limit.10. which as you know will produce inconsistency of For example. (1978). there is evidence that the outer product of the gradient formula does not perform very well in small samples (see Davidson and MacKinnon. Example: The Hausman Test 8 ¢§ r y n ¥ é We assume that . this is evidence of misspeciﬁcation of the model. Econometrica.

the IV estimator).15. while if one is consistent and the other is not they converge to different limits.10. we might try to check if the difference between the estimators is signiﬁcantly different from zero. but we are doubting if the other is consistent (e. EXAMPLE: THE HAUSMAN TEST 335 F IGURE 15. This is the idea behind the Hausman test .1. the OLS estimator). you can see evidence that the OLS estimator is asymptotically biased. We have seen that inconsistent and the consistent estimators converge to different probability limits.1 shows that the OLS estimator is quite biased.10.g.. the Octave program biased. while the IV estimator is consistent. and estimation is by OLS and IV. while the IV estimator is on average much closer to the true value. increasing the sample size. 8 5 ¢ 5 ¢ 5 ¡ 5 ¨5 . Figure 15.g.m performs a Monte Carlo experiment where errors are correlated with regressors.10. If you play with the program.a pair of consistent estimators converge to the same probability limit. If we accept that one is consistent (e. OLS and IV estimators when regressors and errors are correlated x x¢ ¢ qo mk v t r qopmnkl j f d g f e d wushf pnl i8ih4¢ { { zz ¨5 8 5 ¢ 5 ¢ 5 @5 ¡ ¡ % ¢ ¢ ¢ ¢ ¢ ¡ ¢ 8 ¢ ¢ ¢ ¢ ¡ ¡ % ¢ ¢yy ¢ ¢yy ¢ ¡yy ¢ 8yy ¢ yy ¢ ¢ ¢ ¡ 5 ¢ ¡ ¡ 8 ¢ ¡ ¡ 8 ¡ ¡ x x¢ ¢ q o mk v t r qopmnkl j f d g f e d wushf pnl i8ih4¢ @5 To illustrate..

equation 4. let’s consider the covariance between the MLE estimator 8 d¥ g© p eHb1 I 6© p ea d¥ h 8© d¥ 4Ra o e¾ © d¥ d ì h p £d 1 d T ©4eHn1 d ¥ h pý d 1 d i d h r h é should we be interested in testing . Now.). unless we have evidence that the assumptions are false. we might prefer to use it. we get h ýd d So.10. etc.why not just use the IV estimator? Because the OLS estimator is more efﬁcient when the regressors are exogenous and the other classical assumptions (including normality of the errors) hold.2 is Also.6.7. say .1 is: Equation 4. EXAMPLE: THE HAUSMAN TEST 336 If we’re doubting about the consistency of OLS (or QML.1 tells us that the asymptotic covariance between any CAN estimator and the MLE score vector is h 8 © d¥ ea 8 d¥ © p eHn1 I !© p ea ± ì h p ipd 1 d¥ d ÿ 9± o ÿ 9± © ýd ¥ 49ané h Combining these two equations. When we have a more efﬁcient estimator that relies on stronger assumptions (such as exogeneity) than the IV estimator.4. Equation 4. (or any other . let’s recall some results from MLE.15. why fully efﬁcient estimator) and some other CAN estimator.

versus the alternative hypothesis that the ”MLE” esti- ÿ 4 8 ÿ4 ÿ 9± 8 h ipd d h ipý d d © ¥ 4d aré I 4R¾ ± © d¥ © d¥ © ýd I 4Ra ± $x¥ né © d¥ 4Ra ÿ 9± 1 1 o ÿ ± © ýd ¥ 49ané T ì © d¥ © d¥ I 4Ra ± I 4R¾ ± d © ýd I ©4R¥a ± $x¥ né 4ea I © d¥ ± © d¥ 4RDb1 h ipý d 1 d h ÿ 4 h ipd d h ipý d d ÿ4 ÿ 9± 1 1 © d¥ I e¾ ± h h ÿ ÿ 9± n n h £d 1 d h £ý d 1 d é ÿ4 ÿ 9± h n h é . for clarity in what follows. Now. suppose we with to test whether the the two estimators are in fact mator is not in fact consistent (the consistency of is a maintained hypothesis). Under the null hypothesis that they are. consider The asymptotic covariance of this is which. we might write as So. we have will be asymptotically normally distributed as h d £ý d 1 ýd h h p pd d h p pý d d 1 1 h h r ± ÿ pd both converging to .10. EXAMPLE: THE HAUSMAN TEST h 4ea I © d¥ ± h h 337 Now. the asymptotic covariance between the MLE and any other CAN estimator is equal to the MLE asymptotic variance (the inverse of the information matrix).15.

. of the difference of the asymptotic variances is often less than the dimension of the matrices. Note: if the test is based on a sub-vector of the entire parameter vector of the MLE. This may occur. say. .10. in its original form. and it may be difﬁcult to determine what the true rank is. and will converge to 8© g¨¥ ¢ m h d £ý d h d é qý d é R h d pý d © ¥ © ¥ I where is the rank of the difference of the asymptotic variances. for example. regardless of d how small a signiﬁcance level is used. that has the same asymptotic distribution is This is the Hausman test statistic. The reason that this test has power under the alternative hypothesis is that in that case the ”MLE” h . it is possible that the inconsistency of the MLE will not show up in the portion of the vector that has been used. a non-zero vector. so the test statistic will eventually reject. when the consistent but inefﬁcient estimator is not identiﬁed for all the parameters of the model. A statistic . the test may not have power to detect the inconsistency. If the true rank is lower than what is taken d Then the mean of the asymptotic distribution of vector will be pd p í d d h £ý d 1 d estimator will not be consistent. EXAMPLE: THE HAUSMAN TEST h 338 So. Some things to note: The rank.15. If this is the case. where © g¨¥ ¢ 8 h 4d ¾ruq4ý d ¥ n © ¥ é © é m h pý d h d arVq4ý d ané R h d pý d 1 © ¥ é © ¥ I d m h d pý d 1 .

15.10. A solution to this problem is to use a rank 1 test. Consider the omnibus GMM estimator Ó R ge% n ¥ ¡ h ¢ d ggd ¥ ©I d¥ I I where is a vector of moment conditions. but the other may belong to the positive . let’s think of two not necessarily efﬁcient es- same parameter space. EXAMPLE: THE HAUSMAN TEST 339 to be true. and that they can be expressed as generalized method of moments (GMM) estimators. by comparing only Following up on this last point. that variable’s coefﬁcients may be compared. if a variable is suspected of possibly being endogenous. The estimators are deﬁned (suppressing the dependence upon data) by (15. This means that it must be a ML estimator or a fully efﬁcient estimator that has the same asymptotic distribution as the ML estimator. the test will be biased against rejection of the null hypothesis. For example.10. where one is assumed to be consistent. and © B e¥ B d I gd a single coefﬁcient. The contrary holds if we underestimate the rank. This is quite restrictive since modern estimators such as GMM and QML are not in general fully efﬁcient. We assume for expositional simplicity that both and B ü Igd © B e¥ B B ü R © B e¥ X d d ¥ X A §B d ) ' B ¢ d timators.1) 8 © ¢ R¥ ¢ d ©I d¥ I gR% ¢ ü Å z Ó Ã | Å Ó z Ã I ²ü | r R © ¢ R¥ ¢ d 85 7) 3 deﬁnite weighting matrix. . This simple formula only holds when the estimator that is being tested for consistency is fully efﬁcient under the null hypothesis. and is a B 'B ¢d not be.

since the term will not can- ¢ µS I While this is clearly an inconsistent estimator in general. the Newey-West estimator discussed previously.10. the test suffers from a loss of power due to the fact that the omnibus GMM estimator of equation 15.g. but term I $d The standard Hausman test is equivalent to a Wald test of the equality of æ ç å 8 1 ÐÒ á ä ¢ S b ¢ yS yS I I ÎÏ $ S (15. Methods for consistently estimating the asymptotic covariance of a vector of moment conditions are well-known. EXAMPLE: THE HAUSMAN TEST 340 Suppose that the asymptotic covariance of the omnibus moment vector is with the covariance of the moment conditions estimated as cancels out of the test statistic when one of the estimators is asymptotically efﬁcient. However.10. e.10.1 is deﬁned using an inefﬁcient weight matrix. The Hausman test using a proper estimator of the over- ther estimator is efﬁcient.. the omitted 8 ÐÒ Åz Ó ¢ ¥ S Ã | Å Ó z Ã I yS ¥ | ÎÏ ¢ S ¢d and (or subvectors of the two) applied to the omnibus GMM estimator. A new test can be deﬁned by using an alternative ¢ all covariance matrix will now have an asymptotic distribution when nei- ¢ µS I S entire matrix must be estimated consistently. and thus it need not be estimated. as we have seen above. The general solution when neither of the estimators is efﬁcient is clear: the cel out.2) © ¢ e¥ ¢ d ©I d¥ I gev% h ã ô tf À Üé .15.

1. so the Wald test using this alternative is more powerful. including simulation results.10. Hansen and Singleton’s 1982 paper is also a classic worth studying in itself.2. since theory directly suggests the moment conditions. The future consumption stream is the stochastic se! 8 ity (15. Application: Nonlinear rational expectations Readings: Hansen and Singleton. this is a more efﬁcient estimator than that deﬁned by equation 15. for more details. and reﬂects discounting. I’ll use a simpliﬁed model with similar notation to Hamilton’s. 15.11. By standard arguments. application to rational expectations models is elegant. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 341 omnibus GMM estimator (15. 2004.11. Utility is temporally additive. Though I strongly recommend reading the paper.15.11. and the expected utility hypothesis holds. 1986 Though GMM estimation has many applications.1) § The parameter is between 0 and 1.3) 15. See my article in Applied Economics. 8© ± Eì ¸ ë ì § © ¥ ½¥ 2 p òì 8 p @ × ¸ quence The objective function at time is the discounted expected util- S } S where is a consistent estimator of the overall covariance matrix of equation © ¢ R¥ ¢ d ©I d¥ I gRv% d h }S r R © ¢ e¥ ¢ I ~ Ó R ge% n ¥ ¥¡ h ¢ d ggd ©I d¥ I I . 1982 Tauchen.10.10. We assume a representative consumer maximizes expected discounted utility over an inﬁnite horizon.

APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 342 random variables indexed and earlier. Then by reducing current consumption marginally would cause equation 15. which is constained to 2 is the information set at time and includes the all realizations of 2 ) Y2 i ± 3 The price of .2) To see that the condition is necessary. suppose that the lhs < rhs.11. is risky. where is investment in period ¸ 8 2 $t S where is the price and S © P ) I 4I S I À Ô¥ t P is the dividend in period 8 ¼F be less than or equal to current wealth ¸ The choice variable is .1 to same time. which This increase in consumption would cause the objective function to increase by 8) P 2 has gross return © I À Ô¥ P ) © g ¸ ¥ R ½ drop by since there is no discounting of the current period. A partial set of necessary conditions for utility maximization have the form: (15.11. So the problem is to allocate current wealth between current .11. Suppose the consumer can invest in a risky asset. At the which could ﬁnance consumption in period 2 Future net rates of return are not known in period : the asset P ¸ 6 ÈF consumption and investment to ﬁnance future consumption: 8 ò ± I ¸ CÚI À Ô7 s§ ¸ i½ © ¥R ½ © P )¥ ë © ¥R I 3 Current wealth ¤ ¥ ì À I c¨ À Pp)Ô¥ UF 3© 8 ) is normalized to . A dollar invested in the asset yields a gross return .15. the marginal reduction in consumption ﬁnances investment.current consumption.

unless the condition holds. the expected relative risk aversion form is . APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 343 discounted utility function is not maximized. To solve this. To use this we need to choose the functional form of utility. With this form. divide though by r (note that can be passed though the conditional expectation since Suppose that is a vector of variables drawn from the information set We can use the necessary conditions to form the expressions 8 c ± © d¥ eE $ 8© gE2 based only upon information available in time ¸ ± · s v É ¡ h Ê nØ Ê I À c0È) © P )¥ § Ø I ¡ § 1¶ is chosen ¸ so that we could use this to deﬁne moment conditions. A constant so the foc are While it is true that stationary. and our theory requires stationarity.11. is . even though it is in real terms. it is unlikely that © t) ½ ó © P ) ù § ¡ ± ú I I ¡ ¸ I À Ô¥ ² I ¡ ¸ ð ë © P ) ù ë ú ± I I ¡ ¸ I À Ô¥ § I ¡ ¸ I ¸ © ¥R I ¡ ¨ ¸ ¬C½ É÷ ¸ © P ) I ¸ Æ I À Ô¥ I ¡ ¸ È Ù v ¸ s È) where is the coefﬁcient of relative risk aversion ( © ¥ ¡ ¸ ¨ ¸ ½ 8 © × ± I ¸ ¥ R I À 7§ ½ © P )¥ ë Therefore.15.

which would no longer allow any variable in the information set to be used as an instrument. the above expression may be interpreted as a moment con- the information set. for example). it could potentially be autocorrelated. p 9gd this to estimate and repeat until the estimates don’t change. By rational expectations.2 holds without error. If there were an error term here.11. Supposing agents were heterogeneous. use the new estimate to re-estimate ü can be obtained by setting the weighting matrix arbitrarily (to an identity d As before. this wouldn’t be reasonable.11. the autocovariances of the mo- is therefore the inverse of the variance of the moment conditions: which can be consistently estimated by matrix. ² This process can be iterated. and is therefore an element of should be zero.g.. e.. After obtaining we then minimize This whole approach relies on the very strong assumption that equa- tion 15. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS 344 Therefore. The optimal weighting matrix which 8p x4d use dition which can be used for GMM estimation of the parameters 8 § represents d and . this estimate depends on an initial consistent estimate of ³c© p eXc© p eX1 ° Ù A wp² R d¥ d¥ I @ R ¥ © ¥ c©d Ec4d E 1Â) ² f 8© d¥ geX I ² c4RX 4R¤ R© d¥ © d¥ 7d p Ã ment conditions other than ì H2 Note that at time has been observed.15.

The columns of this data ﬁle are 8 c v in estimation. Simulation-based estimation methods (discussed below) are one means of trying to use more informative moment conditions to estimate this sort of model.o.12. one might be tempted to use many instrumental variables. 9 ¸ model.data.c. since any current or lagged variable could be used in . 1986).12. We will do a compter lab that will show that this may not be a good idea with ﬁnite samples. 15. Empirical example: a portfolio model The Octave program portfolio. Probably this f. using the data ﬁle tauchen. JBES. is simply not informative enough.m performs GMM estimation of a portfolio As instruments we use 2 lags of and The estimation results are *********************************************** Example of GMM estimation of rational expectations model Û Ù ¥ « 8À ¸ t and in that order.15. 1986). The reason for poor ² performance when using many instruments is that the estimate of becomes very imprecise. Empirical papers that use this approach often have serious problems in obtaining precise estimates of the parameters. Note that we are basing everything on a single parial ﬁrst order condition. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL 345 In principle. This issue has been studied using Monte Carlos (Tauchen. There are 95 observations (source: Tauchen. we could use a very large number of moment conditions Since use of more moment conditions will lead to a more (asymptotically) efﬁcient estimator.

EMPIRICAL EXAMPLE: A PORTFOLIO MODEL 346 GMM Estimation Results BFGS convergence: Normal convergence Objective function value: 0.071872 Observations: 93 Value X^2 test 6. .8723 3.1555 st.6841 df 5.0000 p-value 0.15.0580 p-value 0.2452 estimate beta gamma 0.12.0000 0.0000 *********************************************** experiment with the program using lags of 1. 3 and 4 periods to deﬁne Comment on the results. Are the results sensitive to the set of instruments used? (Look at as well as Are these good instruments? Are the instruments highly correlated with one another? ² 8 7d © § d ² instruments Iterate the estimation of and to convergence. err 0.6079 11.0220 0.2854 t-stat 39.

and show that the covariance matrix formula given previously corresponds to the GMM covariance matrix formula. generate data from the logit dgp . . Estimate by MLE. Prove analytically that the estimators coicide. (1) Verify the missing steps needed to show that distribution. Consider the moment condtions (exactly iden- tiﬁed): (a) Estimate by GMM. using these moments. Identify what are the moment conditions. . Recall that © ~ P ©d I s4d & v c¥ 7g} d) 4c v ¥ © ¥ $d X I ² R 4d X¦"1 © ¥ b © d¥ eE a GMM estimator.EXERCISES 347 Exercises (1) Show how to cast the generalized IV estimator presented in section 11.4 as is the form of the the matrix what is the efﬁcient weight matrix. what 8 © d ¥ v c v igf 4RE © d¥ 8 Ü 7f © ¯¥ ¢ has a . show that the monster matrix is idem- potent and has trace equal to © f Ù v 4¨¥ (2) Using Octave. (b) The two estimators should coincide. That is.

g. and let 888 h v 9qI v 8 function is just this density evaluated at other values þ p x As long as the marginal density of doesn’t depend on this conditional 8 þ g is a member of the parametric family r p Y 4©! ¥ f 88 I h e D998 q g © !g! ¥ ! ¥ © e p 8 g© p ! ¥ e 888 I h I H99q I f 8 h f v 8998qI v v variables suppose the joint density of 1 Given a sample of size of a random vector and a vector of conditioning conditional on . taking into account possible dependence of 348 Let .. it fully describes the probabilistically important features of the d.CHAPTER 16 Quasi-ML Quasi-ML is the estimator one obtains when a misspeciﬁed probability model is used to calculate an ”ML” estimator. The likelihood The likelihood function. The true joint density is associated with the vector density fully characterizes the random characteristics of samples: e.g.p.

This setup allows for heterogeneous time series data. which we refer to as the quasi-log likelihood function. can be written as The average log-likelihood function is: Suppose that we do not have knowledge of the family of densities is a member of the family ¨ where there is no (this is what we such that mean by “misspeciﬁed”). with dynamic misspeciﬁcation. we may assume that the conditional density of 2êcg$!× I HES g I DÔ © p ¥ © p d ¥ èÞg4 I DÔ ¨ x g d © d ¥ I @ 1 1 © ¥ f ©¥ES © f ) ! ¥ f ) C¤ I @ © d¥ 4RE f ¨ I @ © p ¥ d I D¬E ¨ f I @ © ¥ ¨E fq I @ © ¥ I DE q f © d ¥ f ¥ f 4eS¤ ~ g d © ! ¥ $ ) ) 1 1 © d¥ f 4RC¤ $ f 8© ¥ ¨ES p gd . This objective function is and the QML is s Mistakenly. The QML estimator is the argument that maximizes the misspeciﬁed average log likelihood. QUASI-ML 349 observations.16.

A stronger version of this asex- © d¥ e¦Ý ¤ x compactness of © d¥ 4R¦Ý ¤ (this means that will be continuous. means © d¥ 4e¦Ý ¤ © d¥ f eS9¤ – is continuous and converges pointwise almost surely to © d ¥ ¥ e¦Ý ¤ ~ E p p d f f d A d © d¥ 4RÝ ¤ maximizes : a. a. we obtain An example of sufﬁcient conditions for consistency are is compact sumption that allows for asymptotic normality is that Applying the asymptotic normality theorem. The “pseudo-true” value of Given assumptions so that theorem 19 is applicable.16. so that We assume that this can be strengthened to uniform convergence. and this combined with is uniformly continuous).s. © d¥ 4e¦Ý ¤ is the value that $ I @ 1 f ©4R¥E d © d¥ f A f ) ë A ì 4eS¤ ¨ x .s. where © p eS¤ ¢ ë f © p R¾P d ¥ f d ¥ 8p d ists and is negative deﬁnite in a neighborhood of h ©4R¥¦Ý ¤ ¢ d ³ I 6© p e¥ d P © © p ea o I 6© p R¥ d ¥ d P % ° m h p ipd 1 d p gd – is a unique global maximizer.. followd ing the previous arguments. QUASI-ML 350 A SLLN for dependent sequences applies (we assume).

16. QUASI-ML 351 and o Note that asymptotic normality only requires that the additional asP local property.1.16. asymptotic normality is a p d o P sumptions regarding 8 d¥ f ¤ © p eS 1 À ¯é A f © p ea ô d ¥ and hold in a neighborhood of h o for and . and may be impossible. Consistent Estimation of Variance Components. just calculate the Hessian using the estimate in place of 8 d¥ © p R¾P © p ea o d ¥ f d I @ 1 tf I @ 1 © ¥f d ¥ © ¥ © p eÔ ¨ ¢ ) ë ì T 4f d E ¨ ¢ ) 4f d SP f f f f f f d ¥ © p Ra © p e¥ d P mation of is straightforward. Consistent esti- that Consistent estimation of $ is more difﬁcult. s R I@ ·¨ 5ëj¥ © f I @ © p R¥Ô d f ¨ A © p eS96 d ¥ f ¤ ¶ I@ ·¨ 1ëj ¥ © ë )1 ¶ r f I@ g ô Üé 1 f À ) 1 ô ) 1 À Üé ô 1 h À Üé h We need to estimate ©p d¥ eE ¨ g Notation: Let 8p 9gd That is.0. Assumption (b) of Theorem 22 implies 8 ¾x o p xgd at for not throughout In this sense.

for example. since it requires calculating an expectation using the true density under the d. With random sampling.e. they are iid).p. . so p p ë d f d f © p ¨¥ ¨ A ë « ë © p ¥ ¨ ë « p ë d ¥ Ý¤ d f © p e¦g © p ¨¥ ¨ ë « p d at we have pë where means expectation of and means expectation respect ¨ does not imply that the conditional density © f ¨¥ © f ¨¥ (Note: we have that the joint distribution of 8 f «ë © p ¨¥ ¨ p ë « ë © p e¦Ý ¤ d f d ¥ ©p d¥ 6gea o There are important cases where I @ 1 f © ë © ë R 5¥¨ 1¬¥ f ) is consistently estimable. which is unknown. This would be the case with cross sectional data. This term is not consistently estimable in general. suppose that the data come from a random sample (i. example.16. the limiting objective function is simply to the marginal density of By the requirement that the limiting objective function be maximized The dominated convergence theorem allows switching the order of expectation and differentiation.g.. This is identical). QUASI-ML 352 This is going to contain a term which will not tend to zero.. For is identical. in general.

16. 8 © d¥ gE© p Ra o j ¥ m I @ 1 ©$d c y 4d E ¥ © ¥ ¨ ¨ A f ) I @ 1 © p d f¨¥ ¨ A ) f ¢o h . it’s not necessary to subtract the individual means. a consistent estimator is This is an important case where consistent estimation of the covariance matrix is possible. and due to independent observations. Given this. Other cases exist. since they are zero. QUASI-ML 353 The CLT implies that That is. even for dynamically misspeciﬁed time series models.

If we stack the observations vertically. Ch. deﬁning and 354 G P©d 04R¥ 1 we can write the observations as R©© d cI ¥ ¨ @A8@89gI ¥ ¨ ggI ¥ ¨ ¥ 8©d ©d R© f f 8 fI f cg!@@8A89 ¢ !¦¨¥ ©fG 8 GI G R EA@8A8 ¢ $¥ G © ¢ ! ¾òDtG è ¥ t33 ç G In general. Ch. 2 and 5 .1. However. and possibly 8 G P d ctV© p v ¥ ¨ gf .CHAPTER 17 Nonlinear least squares (NLS) Readings: Davidson and MacKinnon. dealing with this is exactly as in the case of linear models. 1 17. Gallant. Introduction and deﬁnition Nonlinear least squares (NLS) is a means of estimating the parameter of the model nonnormally distributed. will be heteroscedastic and autocorrelated. so we’ll just treat the iid case here.

the NLS estimator can be deﬁned as The estimator minimizes the weighted sum of squared errors.1) Using this.1. use in place of $ (17.1.o.2) $ © $d ¥ R R P r © 4d ¥ i n R In shorthand. INTRODUCTION AND DEFINITION 355 Using this notation.the § R £R then is simply so the f. If 8 $ (17. for the linear model .17.c. the ﬁrst order conditions can 8© $d ¥ y 8 ¥ g©$d s © ¥ $d s ' Q1 Deﬁne the matrix 8©d 4R¥ is the same as minimizing the Euclidean distance between and ©d © d © d © d¥ f ¢ 4R¥ i ) 1 e¥ % R e¥ % ) 1 4eS¤ ¥ © d¥ f © d R© d P © d R 5 R 44R¥ 4e¥ 4R¥ su£s ) 1 4RC¤ 8 $ ©$d ¥ É R 4d ¥ © d à à È P É R ©$d ¥ d à à È $ d . which The objective function can be written as which gives the ﬁrst order conditions be written as or This bears a good deal of similarity to the f.1.o. (with spherical errors) simplify to ©d d e¥ derivative of the prediction is orthogonal to the prediction error.c.

The condition for asymptotic identiﬁcation is that © d¥ f eSt¤ tend local maxima. IDENTIFICATION 356 the usual 0LS f. and deﬁnite.13 and 46).2. pgs. we conclude that the second term will converge A LLN can be applied to the third term to conclude that it converges G ©d 4R¥ pointwise to 0. identiﬁcation can be considered conditional on the sample. Note that the nonlinearity of the manifold leads to potential multiple is not necessarily well-behaved and may be difﬁcult to minimize. 8 d to a constant which does not depend upon I @ 1 © ¢ ¨ tG¥ f ) ³© d ¥ ¢ 4c v E I @ 1 c©eE q© p eE G ³ d¥ d ¥ ¨ ¨° f 5 I @ 1 P ©d d ¥ ¢ ³ e¥E ¨ q© p eE ¨ ° ) f I @ 1 tVP© p c G d v¥ ¨° ) ¨ f I@ 1 gf f ) d ¢ © v ¥ © d¥ f eS9¤ © p d¥ ggR¾G¤ case if is strictly convex at © d gpe¥¾ ¤ ¢ p xgd 8 p í d ê © d ¥ xd £c9gea ¤ ½ eaG¤ ©p d¥ ©d 4e¥ G¤ to a limiting function such that This will be the be positive which requires that © d¥ f eSt¤ asymptotically. which illustrated the consistency of extremum es- timators using OLS. Identiﬁcation As before. minima and saddlepoints: the objective function .c.2. We can interpret this geometrically: INSERT drawings of geometrical depiction of OLS and NLS (see Davidson and MacKinnon.17. Consider the objective function: ¨ As in example 14. as long as and are uncorrelated.o.3. 8. 17.

There are a number of possible assumptions one could use. IDENTIFICATION 357 Next. we obtain (after a little the expectation of the outer product of the gradient of the regression function 8p 9gd evaluated at (Note: the uniform boundedness we have already assumed © #¥ Q $¨Dt R ³ © p 7¥ ¨ y ° ³ c© p ¨¥ d # R d # © Dt ¢ 4 ¥ ¥ Q ³© d 8p x4d Given these results. it is clear that a minimizer is When considering identi- ©d 4 ¥ ¨ form almost sure convergence is immediate. for all so strengthening to uni- ©d 4 ¥ ¨ 8 where is the distribution function of In many cases. Turning to the ﬁrst term. we’ll assume a pointwise law of large numbers applies. the question is whether or not there may be some other minimizer.17.2. pointwise convergence needs to be stregnthened to uniform al- a bounded range. and the function is continuous in ﬁcation (asymptotic). so will ¢à à d à .2.1) 5 © Q ¥ I @ 1 ³© d¥ d¥ ì ¢ ceE ¨ q© p RE ¨ ° f ) ¨ êê ê ê © Dt ¢ ³ ¥ ¥ Q ©d 8pd Rd d R © d¥ d d ¢ à à 4Ra ¤ ¢ à à à à ÿ y ) 8 7d r ¨ I 4d Ô¥ 7} ) © ~ P ¨ d q© p ¥ ¨ ° R d most sure convergence. For example if x ¨ ¨ g ld be bounded and continuous. # Q ³© d # g©4¥Dt ¢ 47¥ ¨ q© p ¥ ° d # d q© p ¥ ° ° ¨ ©) t9 ¥ (17. A local condition for identiﬁcation is that be positive deﬁnite at work) Evaluating this derivative. we’ll just assume it holds. Here.

Asymptotic normality As in the case of GMM. Recall that the result of the asymptotic normality theorem is h p xgd d¥ © p Ra o ì R c© p e¥ g ° c© p eS96 ° 1 ³ d f ¤ ³ d¥ f ¤ T © d¥ f 4RC¤ y z á á á ©p d¥ ge¾ P where is the almost sure limit of ³ I © p e¾ ©© p Ra o I © p R¾% ° d ¥ P d ¥ d ¥ P evaluated at © gsx pact estimation space (the closure of the parameter space the consistency and 1 d R ë 5 © p e¥ P m h p £d 1 d must span a -dimensional space if we are to consistently estimate a - . ASYMPTOTIC NORMALITY 358 allows passing the derivative through the integral. we also simply assume that the conditions for asymptotic normality as in Theorem 22 hold. 17. This is analogous to the requirement that there be no perfect colinearity in a linear model. by the dominated convergence theorem.3.4. Given that the strong stochastic equicontinuity conditions hold. The tangent space to the regression manifold dimensional parameter vector. Consistency We simply assume that the conditions of Theorem 19 hold.) This matrix will be positive deﬁnite (wp1) as long as the gradient vector is of full rank (wp1).17. so the estimator is consistent. and given the above identiﬁcation conditions an a com- proof’s assumptions are satisﬁed. Note that the LLN implies that the above expectation is equal to 17.4. The only remaining problem is to determine the form of the asymptotic variance-covariance matrix. as discussed above. This is a necessary condition for identiﬁcation.

R 1 ë 5 © p RaP d ¥ We’ve already seen that o 1 d ¥ R ë A ¢ è Ñ © p ea R d w © p c ¥ v ¨ The objective function is This converges almost surely to its expectation. ASYMPTOTIC NORMALITY 359 . following a LLN RG t©GR 1 R c© p R¥ 9g ° © p RC ° 1 ³ d f ¤ ³ d¥ f ¤ Ñ RG G R c© p R¥ ° d à ³ d à d © p v ¥ ¨ I @ tG f Noting that I @ d tG f v w © p × v ¥ ¨ I @ 1 ³ d¥ f ¤ ³ d¥ f ¤ G f v Ñ R © p RC ° © p RC ° 1 8 d © p × v ¥ ¨ ¨ 8 d ©4× v ¥ © d 4 v ¥ ¨ © d ¢ v ¥ ¨ I @ 1 4eS¤ gf © d¥ f f ) So I @ 1 f 4RC © d¥ f ¤ f 5 Evaluating at I @ 1 d ¥ f ¤ G f 5 © p RC we can write the above as With this we obtain p 9gd 17.4.

which means it can take the values {0. EXAMPLE: THE POISSON MODEL FOR COUNT DATA 360 normality theorem. we get h We can consistently estimate the variance covariance matrix using ¶ (17.. Note that v f Suppose that conditional on is independently distributed Poisson. Note the close correspondence to the results for the linear model. This sort of model has been used to study visits to doctors per year. Example: The Poisson model for count data Poisson random variable is a count data variable. A É 8 · ¢ è o1 ë Æ I R 1 ¢è r d ¥ % n R r d ¥ % n © © 1 ¢ è · R I d¥ g© p ea o ¶ © p RaøP d ¥ m ing these expressions for and and the result of the asymptotic must be positive. The Poisson density is ( ¤gf 8¦A88@975649 g f 8 ) © f ²5! u e e Ô¥ 7g} g¨¥ Ê © ~ ¨ that the true mean is e © p § R v ¥ 7} p ~ e e gf The mean of is as is the variance. etc.1) where is deﬁned as in equation 17.}.1.5. number of patents registered by businesses per year.17..2.5. 17.1 and the obvious estimator..4. Suppose 8 G where the expectation is with respect to the joint density of and Combin- h p £d 1 d .1.

use a CLT as needed. rather than ê b means ﬁnding the the speciﬁc forms of xá x x © p §¥ © §¥ f y á á ì á s¨QP qC¤ z á . Determine the limiting distribution of 8ggps¨¥ o © § ê Å xÃ 8 h sj%§ p § h NLS estimator is consistent as long as identiﬁcation holds. no need to verify that it can be applied. ê The Gauss-Newton optimization technique is speciﬁcally designed for nonlinear least squares.6. THE GAUSS-NEWTON ALGORITHM 361 We can write The last term has expectation zero since the assumption that implies that related with is continuous on a compact parameter space.6. and noting that the objective function This function is clearly minimized at and G © p §R v ¥ 7g} Í 0P ¢ ó q©§R v ¥ 7g} p d§R v ¥ 7g} ð Í ë ¨¾ ¤ © © §¥ ~ ë ~ ~ © ¢ ©§ R v ¥ } g¨¥ ~ f are uncorso the This I@ © §¥ f ¿ ¨S¤ ¡ § f ) squares: p § 8 c e which enforces the positivity of Suppose we estimate by nonlinear least )¿ )¿ © §¥ f C¤ .17. we get where the last term comes from the fact that the conditional variance of is the Again. 201-207 . Chapter 6. 1 E XERCISE 27. p § § 8 f same as the variance of 8 G v v të © G¥ © ps§ R v ¥ } © v 4f¨Uë ¥ ~ I@ I @ I @ ó ©q§ R © ¿ 5ËP ¢ G ¿ P ¢ ó § R ~ ~ ~ v ¥ 7~g} p § R v ¥ } ð tG v ¥ 7g} p § R v ¥ 7g} ð f ) f ) f I @ ó© ~ G ~ ¢ § R v ¥ 7} t0P p § R v ¥ 7g} ð f which in turn implies that functions of Applying a strong LLN. pgs. The idea is to linearize the nonlinear model. 17. The Gauss-Newton algorithm Readings: Davidson and MacKinnon.

This can be written as a Ý Id © I R¥ y d ' 1 P d¥ ¢ © I Rs § . d process. take a new Taylor’s series expansion around p d Given we calculate a new round estimate of as 8 I dlPU ¢ d h d Üe¥ $ U The other new element here is 8 g© I U $ § Similarly. Note that one could esti- d d g© I R¥ i Note that is known. mate simply by performing OLS on the above equation. not equal to a Q 8G P d V© p e¥ we have P ©d s4R¥ d and the error due Take a With . is the and matrix of derivatives of the plus approximation error from is d ﬁrst order Taylor’s series approximation around a point and repeat the 8p d r I P a ÞQP ó I ipsð ³ ó I ð y ° © I R¥ d d d P d Ý U d to evaluating the regression function at rather than the true value G © I es d ¥ U U a where is a combination of the fundamental error term p d At some in the parameter space.17. given 8I $ where. THE GAUSS-NEWTON ALGORITHM 362 the objective function.6. which is also known. Stop when (to within a speciﬁed tolerance). ¢ d this. evaluated at the truncated Taylor’s series. regression function. as above. The model is approximationerror.

Consider the example When evaluated at has virtually no effect on the NLS objec- tive function. rather than 3. since by deﬁnition of the NLS estimator (these are the normal equations as in equa- The Gauss-Newton method doesn’t require second derivatives.1 is simple to calculate. but evaluated at the NLS estimator: This must be zero. as in equation 17. so will have rank that is “essentially” 2. as does the Newton-Raphson method. since it’s just the OLS varcov estimator from the last iteration.e. consider the above approximation. a normal OLS program will give the NLS varcov estimator directly.1. © d¥ © d¥ 4Rs R es The method can suffer from convergence problems since ¢ § | ¬ § G § § P I t0P ¨ ¢ VsH§ f | d d d $ U tion 17. The varcov estimator. Since when we evaluate at 8 r ©4d ¥ i n R h R d U I pd updating would stop. it’s just the last round “regressor matrix”). may be very nearly singular.6.. so it’s faster. since we have as a by-product of the estimation process (i. In fact. even with an asymptotically identiﬁed model. especially if is very far from .17. THE GAUSS-NEWTON ALGORITHM 363 To see why this might work.4.2. $ r 4d ¥ % n R © d $ U The OLS estimate of is Ý s P h d pd $d s $d ¥ © ¥ P© .

J. will be nearly singular. Econometrica. The problem occurs when observations used in estimation are sampled non-randomly. 1979 (This is a classic article. Application: Limited dependent variables and sample selection Readings: Davidson and MacKinnon.7. with subscripts suppressed): Reservation wage: Write the wage differential as © P » R ¡ÄS aH4c¥ ¥ ©a P R§ G P VydR $ § F RF Offer wage: Ý P » R T F a P Q¾ R P v R§ ¡ G¤ Latent labor supply: v Characteristics of individual: 2 R In this case. 15 (a quick reading is sufﬁcient). not required for reading. and which is a bit out-dated. Heckman. Sample selection is a common problem in applied research. 17. Ch. 17. so © I È R ¥ will be subject to . according to some selection scheme.7.17. Nevertheless it’s a good place to start if you encounter sample selection problems in your research).7.1. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION 364 large roundoff errors. The model (very simple. which is the wage at which the person prefers not to work. Example: Labor Supply. “Sample Selection Bias as a Speciﬁcation Error”. Labor supply of a person is a positive number of hours per unit time supposing the offer wage is higher than the reservation wage.

Suppose we estimated the model residual The problem is that these observaand d using only observations for which R ½ È G `d R ½ ² G í ¥ F tions are those for which or equivalently. we observe whether or not a person is working. 8 ¥ ¤ P v R§ ¡ ¤ Ý ë 8 ¤ í ( ¤ Otherwise.7. which is equal to latent labor supply. Note that we are using a simplifying assumption that 8 I¤ is working. If the person individuals can freely choose their weekly hours of work. What is observed is 8 ÐÒ ) $ è è $ ¢ è ÎÏ Assume that 8G P 0YBdR Ý s P d§R v ¤ F ç G Ý . we observe labor supply. as well as the latent In other words. ¥ F ) 8 pF ¤ ¤ F ¤ variable are unobservable.17. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION 365 We have the set of equations We assume that the offer wage and the reservation wage.

The quantity on the RHS above is known as the inverse Mill’s ratio: With this we can write © YÔ£ # ¥ §¥ © qt © ca¡ ´ ± #¥ d©cÖ b¥ d©¾t b¥ where and are the standard normal density and distribution 8 P© ±dR ½ j ¬¼$u§R ¡ ¤ G G¥ ë è P v Y ¥ # © C¨Ô£ © # ¥ # ¥ Ù # © # qt ¥ ©) ¥ ç t ²Ü# d R ½ È G If we condition this equation on we get 8 G P Gè P v È$u§R ¡ ¤ where has mean zero and is independent of . and G . this expectation will in general since elements of can enter in Because of these two facts. pg.17. respectively. With this we can write 8 $Gd R ½ j G P Gè ÈH$ Ý Ýª ë Consider more carefully Given the joint normality of 8 v v depend on G since and are dependent. 122) A useful result is that for function. Furthermore. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION Ý 366 least squares estimation is biased and inconsistent. Ý we can write (see for example Spanos Statistical Foundations of Econometric Modelling.7.

See Ahn and Powell. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION 367 NLS. This d squares using the estimated value of is inefﬁcient and estimation of the covariance is a tricky issue.17. It is probably easier (and more efﬁcient) just to do MLE.2 is estimated by least to form the regressors. 1994.7. It is possible to estimate consistently without distributional assumptions. The model presented above depends strongly on joint normality.1) Å y s Ã h R v r c y Ã À n Å c § © R ¥ P 4d d R Öt $V©§R v © ¥ ¾ è P . and is uncorrelated At this point. we can estimate the equation by 8 P $ (17. Journal of Econometrics.7.7. There exist many alternative models which weaken the maintained assumptions. À h Rv with the regressors Å y y sÃ 8 c sÃ Å c 8è $ where The error term has conditional mean zero.7. then equation 17. Heckman showed how one can estimate this in a two step procedure d where ﬁrst is estimated.2) ¤ (17.

pp. forms such as the transcendental logarithmic (usually know as the translog) can be interpreted as second order Taylor’s series approximations.CHAPTER 18 Nonparametric inference 18. for simplicity. Possible pitfalls of parametric inference: estimation Readings: H. Approximating about © p © p ¥ ¨ ã © p ¥ ¥ P ¨ Taylor’s series approximation to about some point 8p © ¥ ¨ In general. with respect to © ¥ f We suppose that data is generated by random sampling of . is uniformly distributed on and is a classical error. where 5 g© ¥ 5 5 P ¢ h 7 ) © ¥ © ¥ ¨ © ¥ ¹ ¨ G P µµ© ¥ ¨ f . which illustrates both why nonparametric methods may in some cases be preferred to parametric methods. Suppose that throughout the range of . In this section we consider a simple example. One idea is to take a Flexible functional : © ¥ ¨ The problem of interest is to estimate the elasticity of G . White (1980) “Using Least Squares to Approximate Unknown Regression Functions.1. 149-70. the functional form of is unknown.” International Economic Review. We’ll work 368 p with a ﬁrst order approximation.

following the argument we used to get equations 14. 8 t ¢ E© ¥ ¹ q© ¥ ¨ ¥ ¤¢ © I @ 8 ¢ ©© f ©U ô¥ ¥ ¹ g¥ 1 Â ) gC¦¤ f 8 ú I 6 pU Â P È9 Â A p ù $ô ¥ © a ¹ p ©U ô¥ gCa ¤ 8 value of the derivative at These are of course not known.1.2. The objective function is The limiting objective function. Solving the ﬁrst order conditions1 reveals that obtains its minimum at ing function therefore tends almost surely to We may plot the true function and the limit of the approximation to see the asymptotic bias as a function of : (The approximating model is the straight line.3. even at the approximation point. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION 369 estimation by ordinary least squares.1 and 17.) Note that the approximating model is in general inconsistent.18. This shows that “ﬂexible functional forms” based 1 All calculations were done using Scientiﬁc Workplace.1 is The theorem regarding the consistency of extremum estimators (Theorem 19) U tells us that and will converge almost surely to the values that minimize the limiting objective function. the true model has curvature. One might try © ¥ ¹ ô The coefﬁcient is the value of the function at U¹P£ô © ¥ ¹ p If the approximation point is we can write and the slope is the ô The estimated approximat- ©U ô¥ gCa ¤ .

1981. However. Root mean squared error in the approximation of the elasticity is Now suppose we use the leading terms of a trigonometric series as the approximating model. Normally with this type of model the number of basis functions is 8 The true elasticity is the line that has negative slope for large 9 É © Ñ )7 gÒ 78 ¢ eI t ¢ © ¥ p © ¥ ¹ Â © ¨R ¹ ¡ © ¥ ¥ ¨ G q© ¥ ¥ ¤¢ p ¨ approximation of both and over the range of 8 Good approximation of the elasticity over the range of © t Â © ¨it © ¥ G ¥ ¥R will require a good The approximating © ¥R © ¥ Visually we Æ . 1982). The mathematical properties of the Taylor’s series do not carry over when coefﬁcients are estimated. which we will study in more detail below. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION 370 upon Taylor’s series approximations do not in general allow consistent estimation. we are interested in the elasticity of the function. The reason for using a trigonometric series as an approximating model is motivated by the asymptotic properties of the Fourier ﬂexible functional form (Gallant. Recall that an elasticity is the marginal function divided by the average function: elasticity is Plotting the true elasticity and the elasticity obtained from the limiting approximating model see that the elasticity is not approximated so well. asymptotically.18. The approximating model seems to ﬁt the true model fairly well.1.

18. Plotting elasticities: On average. É¢ Æ É¢ Æ © 5 Ð P ) Ñ Dy© 5 4Î P © EÐ P ) Dy© S4Î P y9 Â ¥ ÐÏ ¥ ¥ ÐÏ ¥ Â P © ¥ ÿ ¢ ¢ 8 ø 6 ) Ñ D6 ¡ C6 ) | 6 ) ¢ 6 6 ô ô ô ô 3ô 5 5 ÐÏ 8 r © ¥ EÐ © ò¥ 4vÎ © ¥ Ð © ¥ Î ) n © ¥ ÐÏ we obtain the almost sure limit of the ap- 8 &© ¥ W © ¥ ÿ A 9 õ Iô W A ¥ © a@ . than does the linear model.1. the ﬁt is better. asymptotically. Here we hold the set of basis function ﬁxed. which we interpret as an approximation to the estimator’s behavior in ﬁnite samples. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION 371 an increasing function of the sample size. Consider the set of basis functions: The approximating model is Maintaining these basis functions as the sample size increases.1. we ﬁnd that the limiting objective function is minimized at Substituting these values into proximation (18. though there is some implausible wavyness in the estimate.1) Plotting the approximation and the true function: Clearly the truncated trigonometric series model offers a better approximation. We will consider the asymptotic behavior of a ﬁxed model.

for example sional parameter. this error measure would be driven to zero. which is non-trivial) 8 © 0 ¥ ¹ ¢ T ¢ p £x © d "!d¥ þj ©pd g"!d¥ where p É ¥ 7 )75$)98 · t £© a@ q© SG Æ ¤¢ 9 ¥ ¢ eI ¢ © ¥ R p is a function of known form and p x pg d G P d © p V© p "!d¥ ÷!d¥ © 0 d¥ ¹ © 0 ¥ ¹ ¢ T ¶ © ÷!¥ are the a set of compensated demand is a ﬁnite dimen- If we can . If the trigonometric series contained inﬁnite terms. we could use to calculate (by solving the integrability problem. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: HYPOTHESIS TESTING 372 Root mean squared error in the approximation of the elasticity is about half that of the RMSE when the ﬁrst order approximation is used.18. After estimation. Estimation of these functions by normal parametric methods requires speciﬁcation of the functional form of demand. where functions. One approach to testing for utility maximization would estimate a set of normal demand functions . Possible pitfalls of parametric inference: hypothesis testing What do we mean by the term “nonparametric inference”? Simply.2. must be negative semi-deﬁnite. 18. Consider means of testing for the hypothesis that consumers maximize utility. this means inferences that are possible without restricting the functions of interest to belong to a parametric family. as we shall see.2. A consequence of utility maximization is that the Slutsky matrix .

THE FOURIER FUNCTIONAL FORM 373 statistically reject that the matrix is negative semi-deﬁnite. 1987. This is known as the “model-induced augmenting hypothesis. without the “model-induced augmenting hypothesis”.3. The only assumptions used in the test are those directly implied by theory. Fifth World Congress. so rejection of the hypothesis calls into question the theory. Testing using parametric models always means we are testing a compound hypothesis. Nonparametric inference allows direct testing of economic propositions.. The problem with this is that the reason for rejection of the theoretical Readings: Gallant.18. we might conclude that consumers don’t maximize utility. 1. Truman Bewley. The Fourier functional form .” Varian’s WARP allows one to test for utility maximization without specifying the form of the demand functions. 18.3. In the introductory section we saw that functional form misspeciﬁcation leads to inconsistent estimation of the function and its derivatives. The hypothesis that is tested is 1) the economic proposition we wish to test. ed. and 2) the model is correctly speciﬁed. Failure of either 1) or 2) can lead to rejection. Cambridge.” in Advances in Econometrics. Suppose we have a multivariate model G P 70© v ¥ ¨ f proposition may be that our choice of functional form is incorrect. “Identiﬁcation and consistency in semi-nonparametric regression. V.

2) 888II ¸II ½ R g999g$Dc© û ¥ xº As & uÿ ¸ ¸ R § where the -dimensional parameter vector 8 © Ð § ¸ E© v §R ~% ¥ ¢ß q© v §R % ¥ 4v¢ß R¥ ÐÏÎ § ½ IWI ß § P 7R 5 R§ P v v Â) P v 8 v B at an arbitrary point B © © ¥ ¨ C © v ¥ v ¥ ¨ à Dv B à ¨ G X ã © ¥ & ' ¨ where is of unknown form and is a ª dimensional vector. following Gallant (1982). Let us take the estimation of the vector of elasticities with typical element The Fourier form. positive and zero). THE FOURIER FUNCTIONAL FORM 374 simplicity. may be written as (18.3. and in value.1) to avoid periodic behavior of the approximation. but with a somewhat different parameterization. divide by the maxima of the conditioning variables. For © ÿ d v¥ ÿ . For example.3. assume that is a classical error. which is desirable since economic functions aren’t periodic.3. The The are ”elementary multi-indices” which are simply .18. subtract sample means. and we follow the convention § formed of integers (negative. § are required to be linearly independent. 8 5 w A@88A74) Q& ª vectors 5 ¤ º ¤ $i º 5 multiply by where is some positive number less than 8 5 formed to lie in an interval that is shorter than v We assume that the conditioning variables have each been transThis is required ¨8x R ¸ ½ d (18.

For example is a potential multi-index to be used. The cost of this is that we are no longer able to test a quadratic speciﬁcation using nested testing. let be an e I I ß § P (18. The vector of ﬁrst partial derivatives is (18. simpliﬁes things in practice.3. but is not since its ﬁrst nonzero element is negative. Deﬁne as the sum of the elements To deﬁne a compact notation for partial derivatives.3.18.4) ³ §H § ¢ CE© v §H~% ¥ ß i© v §H~% ¥ 4¢ß tÔ¥ ° R % © R Ð § ¸ P R ÐÏÎ § ½ -dimensional % ~C © R Ð ÏÎ § ¸ R Ð § ½ E© v §H~% ¥ 4ß sÄ© v §% ¥ E¢ß tÔ@¥ û We parameterize the matrix r ) ) R ) Þ Þ n r 5 R %5 µ 5 n differently than does Gallant because it ) R r ) Þ ) n I ß I W § P v P § © ÿ d v ¥ ÿ ã ã © ÿ d v ¥ ÿ ¢ .3) § and the matrix of second partial derivatives is e multi-index with no negative elements. Nor is a multi-index we would use. since it is a scalar multiple of the original multi-index.3. THE FOURIER FUNCTIONAL FORM 375 that the ﬁrst non-zero element be positive.

THE FOURIER FUNCTIONAL FORM 376 indicate a certain partial derivative: so that Both the approximating model and the derivatives of the approximat- ing model are linear in the parameters. 1987] Suppose that b following conditions: relative topology deﬁned by with respect to and . Consider the is compact in the ÿ ¹ ÿ maximizing a sample objective function over where f¹ 8 ÿ Bdc© v ¥ R © ¹ Sx¤ ¥f § © ÿ d v ¥ ÿ (18. write for simplicity. ÿ ¹ (b) Denseness: I ÿ ÷ ÿ 8A@89 754) ÿ 8 7 ¹ .3. If we have e arguments of the (arbitrary) function .3. For the approximating model to the function (not derivatives). The following theorem can be used to prove the consistency of the Fourier form. use © v¥ ¹ © v¥ ¹ to . . T HEOREM 28.18. . we see that it is possible to deﬁne ÿ dR § © ÿ d v ¥ ÿ e When is the zero vector.5) © ¯c¥ ' ) few equations into account. [Gallant and Nychka. Taking this deﬁnition and the last vector © v¥ ¹ is obtained by is a subset ¡ © v¥ ¹ $ © v¥ ¹ bb ¢ xxb z I à à £ £ à à à $ v © v¥ ¹ © v¥ W of . (a) Compactness: The closure of with respect to is a dense subset of the closure of ¹ of some function space on which is deﬁned a norm .

¡ must have . so that convergence with respect to implies convergence w. The main differences are: (1) A generic norm norm may be stronger than the Euclidean norm.r. This formulation is much less restrictive than the restriction to a parametric family. Typically we will want to make sure that the norm is strong enough to imply convergence of all functions of interest. This is a function space. There is no restriction to a parametric family. provided that © ¥ 6 ¹ 9 ¹ ¾ ¤ © 6 ¹ ¹ a`¤ ¥ f ¥ ¤ ¥f © ¹ ¹ ¾ uq© ¹ S9¤ ©Ð A 9 ¹ v f ¹ s ¹ (c) Uniform convergence: There is a point ¹ in and there is a function such that ¹ ¹ f A ¹ ¹ ¡v ¹ ¹ f © 6 ¹ ¹ a ¤ ¥ f A . almost surely. The modiﬁcation of the original statement of the theorem that has been x made is to set the parameter space 0 to a single point and to state the theorem in terms of maximization rather than minimization. THE FOURIER FUNCTIONAL FORM 377 that is continuous in with respect to almost surely.t the Euclidean norm. (2) The “estimation space” x parameter space in our discussion of parametric estimators. This theorem is very similar in form to Theorem 19.3. only a restriction to a space of functions that satisfy certain conditions. Under these conditions k g in Gallant and Nychka’s (1987) Theorem is used in place of the Euclidean norm. It plays the role of the (d) Identiﬁcation: Any point in the closure of with almost surely.18.

1. 18. in relation to the Fourier form as the approximating model. Since we are interested in ﬁrst-order elasticities in the tains all values of that we’re interested in. 1987) but we will discuss its assumptions. It is deﬁned. THE FOURIER FUNCTIONAL FORM 378 (3) There is a denseness assumption that was not present in the other theorem. we need to make explicit what norm we wish to use. making use of our notation for partial derivatives.3. see Gallant. the relevant 8 " and partial derivatives up to order If we want to estimate ﬁrst order elaswould be Further- 8 Y r W © ÿ d v ¥ ÿ sq© v ¥ ¨ © ÿ d ¥ ÿ mating model . We will not prove this theorem (the proof is quite similar to the proof of theorem [19]. we need close approximation of both the function and its ¹ . We need a norm that guarantees that the errors in approximation of the functions we are interested in are accounted for. as: To see whether or not the function is well approximated by an approxi- We see that this norm takes into account errors in approximating the function 8 uniform convergence.r. the Sobolev means 8 ) ticities.3. The Sobolev norm is appropriate in this case.t. Since all of the assumptions involve the norm . so that we obtain consistent estimates for all values of i` ©Ð 9 more.18. we would evaluate ê ê ` © ¥ ¨ ê ê W £ à £ © ¥ ¹ Y ©Ð ~ Y r W ¹ 9 8 g© ¥ R ¨ ﬁrst derivative throughout the range of Let be an open set that con- © ¥ ¨ present case. since we examine the over convergence w. as is the case in this example. Sobolev norm.

THE FOURIER FUNCTIONAL FORM 379 18. Econometrica. The basic requirement is that if we need consistency Y of functions Y r where is a ﬁnite constant. we’ll deﬁne the estimation space as follows: The estimation space is an open set.3. Since in our case we’re interested in consistent estimation of ﬁrst-order elasticities. 18.3.3. [Estimation subspace] The estimation subspace ¦ ÿ bÿ y g estimation space.18. With seminonparametric estimators.3. Verifying compactness with respect to this norm is quite technical and unenlightening. W ¹ then the functions of interest must belong to a Sobolev space . Gallant and Souza. It is proven by Elbadawi. Rather. and we presume that So we are assuming that the function to be estimated has bounded second ` derivatives throughout . 1983. we optimize over a subspace. The estimation space and the estimation subspace.3. © ÿ v ¥ ÿ d where is the Fourier form approximation as deﬁned in Equation ÿ D EFINITION 30.1. A Sobolev space is the set .r.2. deﬁned as: is de- E 8 Qg ¹ D EFINITION 29. [Estimation space] The estimation space ¢ 8 g© ¥ Yr ¦ ½ d ¥ ¢ ÿ g© r ) P uu Y r which takes into account derivatives of order W © v¥ ¹ r © v¥ ¹ © ¥ E g © ÿ d v ¥ ÿ I¼© ÿ d v ¥ ÿ uÿ r W wE r w. In plain words.t. the functions must have bounded partial derivatives of one order higher than the derivatives we seek to estimate. Compactness. we don’t actually optimize over the ﬁned as 18.

In order for optimization at least asymptotically. Denseness. be continuously differentiable up to order on an open set containing ÿ pleteness: there’s no need to study it in detail. The rest of the discussion of denseness is provided just for com- who in turn cites Edmunds and Moscatelli (1977). [Edmunds and Moscatelli. deﬁned above. this parameter is estimable. To show that the countable union of the subsets is equal to the closure of I sI ¹ r Y estimation space. 1977] Let the real-valued function © v¥ ¹ of with respect to it is useful to apply Theorem 1 of Gallant (1982). with minor notational changes. we as ÿ Note that the true function is not necessarily an element of þ¥ 1 ¹ 1 in equation 18. . The important point here is that is a space of funchas elements.3. is a subset of the closure of the of a set is “dense” if the closure of : 8 Q (2) We need that the ÿ 1 q1 of the sample size.4. for convenience of reference: T HEOREM 31.1 increasing functions will have to grow more slowly is a dense subset 8 k 1 k b ÿ d Q ÿ over to be equivalent to optimization over ÿ tion over may not lead to a consistent estimator. THE FOURIER FUNCTIONAL FORM 380 need that: (1) The dimension of the parameter vector. We reproduce the theorem as presented by Gallant. so optimiza- ÿ d tions that is indexed by a ﬁnite dimensional parameter ( ÿ 18. A set of subsets ÿ The estimation subspace . It is clear that be dense subsets of w is achieved by making and in equation 18. as ÿ This . « than . With observations.2).3.18.3.3. The second requirement is: Use a picture here.

Then it is possible to choose a triangular array of coefﬁcients .3. and every ÿ ¹ q© v ¥ ¢dI¹ d 8 8 8 ÿ 99 ggd ¹ . Uniform convergence. with respect to the norm Y I ¹ 8 ÿ Q÷ ÿ 8 Q ÿ R R ÷ ÿ R R ÷ g for all . ` A ÿ space. and . . so the theorem is applicable. ` open and contains the closure of . which is 8 k as ¥ G ½ § § such that for every with » ©î ( W ¥ ´ r( © ÿ d v ¥ 888 v999 Y ` the closure of .3. so Therefore 18. ÿ The implication of Theorem 31 is that there is a sequence of { } from ÿ ÿ¹ r Y sI ÿ ¹ ¹ ÿ R lowing Gallant and Nychka (1987).5. Closely folis the countable union of the . The sample objective function stated in terms of © f ¢ E© ÿ d v ¥ ÿ sg¥ I@ 1 © RC¤ ÿ d¥ f f ) maximization is r s ÿ R so is a dense subset of . . We now turn to the limiting objective function.18. THE FOURIER FUNCTIONAL FORM 381 such that However. By deﬁnition of the estimation . the elements of are once continuously differentiable on 5 ) § In the present application. We estimate by OLS. Therefore.

1 and 17. the limiting objective function is The pointwise convergence of the objective function needs to be strengthened to uniform convergence. the limit and the integral can be interchanged. Review of concepts. the limit is zero.3. since the way to verify this depends upon the speciﬁc application.6. The identiﬁcation condition requires that for any point ¨ sumption that deﬁnes the estimation space).3.18. For the example of estimation of ﬁrst-order elasticities. Both and are elements of © ¥ © r ¢ ¥ ¨ where the true function takes the place of the generic function ¹ 8 ¢ î j è in the . We will simply assume that this holds.3. 18. the relevant concepts are: ¨ satisﬁed given that and y Á © ¨ ¨ ¾ ¤ ¥ ¡ © ¨ ¾ X å ¥ ¤ ' in .2. This condition is clearly are once continuously differentiable (by the as- 8 Q "t r ¢ ó© v¥ ¨ q© v ¥ p ð ó ú © ¨p r Y sI p ó ¢ © v ¥ ¨ q© v ¥ I ð n G Y p ¤ ð V ó © ¨ I ð ¤ ù continuity of the objective function in with respect to the norm Y I ¹ ÿ R © ¥ ¨ © H ¥ presentation of the theorem. since Q Dt © ¥ ¢ E© v Hsq© v ¥ ¨ ¥ Y © ¨ uG¤ ¥ (18. THE FOURIER FUNCTIONAL FORM 382 With random sampling. Identiﬁcation.6) ¢ ¡ ¢ ¡ © ¨ ¥ . 18.3. We also have r s By the dominated convergence theorem (which applies since the ﬁnite bound E used to deﬁne is dominated by an integrable function).3. so by inspection.7. as in the case of Equations 14.1.

over the closure of the inﬁnite As a result of this. Sample objective function By standard arguments this converges uniformly to the a global maximum in its ﬁrst argument. Gallant and Souza. The issue of how to chose the rate at which parameters are added and which to add ﬁrst is fairly complex. A basic problem is that a high rate of inclusion of additional parameters causes the variance to tend more slowly to zero. tending to inﬁnity. the bias tends relatively rapidly to zero. which is continuous in and has d that is representable by a Fourier form with parameter These are 8 ÿ 8 ÿ Estimation subspace The estimation subspace is the subset of 8 r s Consistency norm Y I ¹ E Estimation space ¢ © ¥ Yr : the function space in the closure of The closure of is compact with respect .3. THE FOURIER FUNCTIONAL FORM 383 which the true function must lie. Discussion. ﬁrst order elasticities are consistently estimated for all 18.18. Supposing we stick to these 8 ¨ 8i£g v ` © ¥ C © v ¥ B © ¨ v ¥ ¨ à Dv B à union of the estimation subpaces. A problem is that the allowable rates for asymptotic normality to obtain (Andrews 1991.8. Consistency requires that the number of parameters used in the expansion increase with the sample size. If parameters are added at a high rate. 1991) are very strict.3. at ¥ © ¨ aG¤ Limiting objective function d¥ f © ÿ RC9¤ ¨ 8 Q dense subsets of the negative of the sum of squares. to this norm.

we metric linear model. If we can’t stick to acceptable rates. THE FOURIER FUNCTIONAL FORM 384 rates. though. I emphasize.3. The LS estimator is large enough when some dummy variables are included. our approximating model is: servations. We’ll discuss this in the section on simulation. The prediction. 1 grows very slowly as grows. as would be the case for f R R £ ÿq£ © ÿ £ ÿÄ¤¥ Ëÿ d ' ²1 ÿ £ Deﬁne as the 8 ÿ Bd4§ © ÿ d v ¥ ÿ R matrix of regressors obtained by stacking ob- is asymptotically . that this is only valid if 8 w ¢ è § where © v¥ ¨ © é w j ¥ É 1 DR Æ ÿ £ ÿR £ m h© ¥ § v ¨ ÿ d 4§ 1 R Ù f A é w § . normally distributed: h Formally.18. Bootstrapping is a possibility. may be singular. of the unknown function ÿ £ ÿR £ – This is used since ÿ dR ©1 ¥ ©©Ô¥ b where is the Moore-Penrose generalized inverse. this is exactly the same as if we were dealing with a para- should probably use some other method of approximating the small sample distribution.

). we have ¨ y This suggests that we could estimate by estimating and 8ft f $© ! ¥ ¨ f © ¥ ¹ 8 f t© f ¦S ¥ © H ¥ r where is the marginal density of ¨ 8 ¥ © H The conditional expectation of f¦©Sf ¥ ¨ f t f f $t © ©! ¥ ¥ ¹ f 8 ¨ t¬¥ Ù © G G P © ¥ tV D f © ¥ ¹ ) f © ¥ ¹ ¥ © H © ¥ ¹ is -dimensional.18. where conditional expectation. nearest neighbor. Fifth World Congress. Kernel regression estimation is an example (others are splines. Cambridge. KERNEL REGRESSION ESTIMATORS 385 18. Truman Bewley. ed. etc. “Kernel estimators of regression functions. 1.4.4. The model is given is By deﬁnition of the © f S ¥ ¨ Suppose we have an iid sample from the joint density where . V.. 1987.” in Advances in Econometrics. An alternative method to the semi-nonparametric method is a fully nonparametric method of estimation. Kernel regression estimators Readings: Bierens. We’ll consider the NadarayaWatson kernel regression estimator in a simple case.

but not too quickly. g Ù r © ¥ ¹ n ©©Ô¥ b restrict to be nonnegative. is a sequence of positive numbers that d©Ô¥ b In this respect. To show pointwise consistency of for pectation of the estimator (since the estimator is an average of iid terms we only need to consider the expectation of a representative term): 8 # t© # f © # 7$¥ ¹ G¦ Â $0 @¥ f h g© ¥ ¹ © ¥ ¹ k f f b1 A h f f ¦ f 7 The window width parameter.18. the window width must tend to zero.4. Estimation of the denominator.1. KERNEL REGRESSION ESTIMATORS 386 18.4. A kernel estimator for satisﬁes So. but we do not necessarily 8 t 4) © ¥ k r Þ) d©c¥ b and integrates to k ½ % t © ¥ ©©Ô¥ b 1 where is the sample size and The function 8 I @ 1 f f © ¦ Â h @¥ f ) © ¥ ¹ is the dimension of (the kernel) is absolutely integrable: G form ﬁrst consider the ex- © ¥ ¹ has the . is like a density function.

KERNEL REGRESSION ESTIMATORS and y ìì m àm C #f 4¦¹ # By the representative term argument. (Note: that we be dominated 387 . I @ 1 f ¨G © Â A¥ wé ) f I f õ @ ¢ 1 ø `¦ Â h @¥ é ) f © f we have. asymptotically. For this to hold we need that can pass the limit through the integral is a result of the dominated by an absolutely integrable function.4.. by assumption. considering the variance of © ¥ ¹ © ¥ ¹ Ù f r © ¥ ¹ n Now. Ù r © ¥ ¹ n we obtain since and y f tion convergence theorem. this is h h f f 1 b r© ¥ ¹ n é h f b1 g© ¥ ¹ ) ¦4© ¥ #t # #t # ¦4© ¥ Next. due to the iid assump- ©©Ô¥ ¹ b # t ¹ © ¥ k # ¦© ¥ #t #f ¦© 4¦Ñ ¥ ¹ © ¥ f # #t #f ¦© 4¦Ñ ¥ ¹ © ¥ ñ t f # 8 ¦© g¦s ¥ ¹ © ¥ k #t #f # # #f ¦t f s© Ç ¥ ¹ © ¨¥ # f h h f © 7¦ Â 4# ¥ D# so f h Change variables as 18.

by assumption. and 388 8 . this can be shown The second term converges to zero: r© ¥ ¹ n é h f b1 Also. this is bounded. we have pointwise f b1 Therefore. consistency (convergence in quadratic mean implies convergence in Since the bias and the variance both go to zero. KERNEL REGRESSION ESTIMATORS by the previous result regarding the expectation and the fact that f Ù ¢ r © ¥ ¹¢ n hf Since both Using exactly the same change of variables as before.4.8 r © ¥ ¹n é h 8 t ¢ © ¨¥ # # © ¥ ¹ # @t ¢ C¨¥ © # y © ¥ ¹ r © ¥ ¹ n é f b1 A f h and 8 ¦$¥ ¹ ¢ G¦ Â $V @¥ # t© # f © # f h f A r © f ¥ ¹ n é h f b1 A Ù # t© # f © # f ¢ r © ¥ ¹ ¢ n h f ss¦$¨¥ ¹ ¢ G¦ Â $0 A¥ h s #¦$¥ ¹ `¦ Â $V @¥ õ f ss¦$¨¥ ¹ ¢ G¦ Â $0 A¥ s t © # f ©# f # t © # f ©# f ¢ ø h h h Ù ù Ù f © f © f © # f ¢ 6©¬G¦ Â $#0 @¥ ¥ s ú ¢ ¬G Â $V A¥ ¥ h h ¢ © ¥ Ù q© ¢ ¥ Ù © é ¥ f © # ¨`¦ Â $j @¥ Ëé f r © ¥ ¹ n é f b1 h h 18. we have that we have are bounded. since to be since k probability).

4.2.4.4. Discussion.3. Estimation of the numerator. so we obtain With this kernel. KERNEL REGRESSION ESTIMATORS we need an . d©c¥ b is required to have mean zero: The estimator has the same form as the estimator for 18. we have and to marginalize to the previous kernel for estimator of 8 f g© ! ¥ ¨ The kernel only with one dimension more: 18. To estimate ft f $© ! ¥ ¨ f This is the Nadaraya-Watson kernel regression estimator. ¥ © H I @ f © 8 `¦ Â © @¥ I f `f¦ Â ¢¨ ¡ @¥ f @f b ã Ãã I @ I f ¡ § %b p ¢¡ f ¨ ÊÅ ¥ b ã Ãã ¦ÿf @ ¡ I f §%b p ÊÅ f I ¥ ¦ÿ © ¥ ft f $© ¥ ¨ f ) ¹ I @ 1 f ft f f © ¦ Â h @¥ gf f ) $© ¨¥ ¨ f 8 © ¥ ¡ $© ¨¥ k ft f r U© ¥ ¹ ft f $© ¨¥ f y by marginalization of the kernel. I G I @ 1 I f ©f f © ¦ Â th Â ¨j A¥ f ) S ¥ ¨ ¥ ¦ © f f f g© ¥ ¹ 389 18.

The out of sample data is (2) Choose a window width . though there are possibly better alternatives.18. which is used for estimation. A small window width reduces the bias. as well as the evaluation point . The ﬁrst part is the “in sample” data. where higher weights are associated with points that The weights sum to 1. the variance is large when the window width is small. used for evaluation of the ﬁt though RMSE or some other criterion. The estimator are closer to A large window width reduces the variance (strong imposition of ﬂat- ness).4.. (3) With the in sample data. The selection of an appropriate window width is important.g. 18. This consists of splitting the sample into two parts (e. imposes smoothness. but makes very little use of information except points that are in a small neighborhood of relatively little information is used. and the second part is the “out of sample” data. One popular method is cross validation. Choice of the window width: Cross-validation. but increases the bias. 50%-50%).4. ﬁt and corresponding to each This ﬁtted value is a function of the in sample data.4. but it does not involve f g© ¨¥ 8 Í ©Ô¥ 8 The standard normal density is a popular choice for and 8 c 8 Í Í f 8 Í If Í f k f 7 is increasingly ﬂat as f The window width parameter since in this case each weight tends to © ¥ D The kernel regression estimator for is a weighted average of the 8c 1 8 5 !@A8@89764) " ß f % Í 8 q1 Â ) Since . The steps are: (1) Split the data. KERNEL REGRESSION ESTIMATORS 390 .

and in a Fourier form model. We have already seen how joint densities may be estimated. tried. 18. 18.5.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 391 (4) Repeat for all out of sample points. see f If were interested in a conditional density. conditional on . For a Fortran program to do this and a useful discussion in the user’s guide. then the kernel estimate of the conditional density is simply f I¦ f Â © A¥ I@ G¦ f Â t¦ Â j ¬¥f I@f ) © ¥ f © f f ¨ ¡ b ã Ãã I@ f ¡ § %b p ÊÅ ¨ ¡ f I ¥ ¦ÿ b I@ f ¡ § b p Ê Å ã Ãã r b ¡ p ÅÊ u u Ã ¥ à ÿ f I © ¥ ¹ © ! ¥ ¨ f where we obtain the expressions for the joint and marginal densities from the section on kernel regression. 8© H found.6. for example by plotting RMSE as a function of © H (7) Select the 75 (6) Go to step or to the next step if enough window widths have been that minimizes RMSE( © D¥ (5) Calculate RMSE (Verify that a minimum has been ã£ u ¢¨ . Kernel density estimation The previous discussion suggests that a kernel density estimator may easily be constructed. 1987. for example of « w This same principle can be used to choose (8) Re-estimate using the best and all of the data.18. Semi-nonparametric maximum likelihood Readings: Gallant and Nychka. Econometrica.

1997. Suppose that the density is a reasonable starting approxi- f Suppose we’re interested in the density of conditional on (both may be . mation to the true density. 12. Journal of Applied Econometrics. Is is possible to obtain the beneﬁts of MLE when we’re not so conﬁdent about the speciﬁcation? In part. V. The normalization factor d one.18. The new density is © t¥ H©T p to impose a normalization: is set to 1. See also Cameron and Johansson. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 392 this link .6. MLE is the estimation method of choice when we are conﬁdent about specifying the density. This density can be reshaped by multiplying it by a squared polynomial. yes. Because © t ¥ © f D ¦T IÂ H ¨¥ ¢ T ¹ © t ¥ H© T and is a normalizing factor to make the density integrate (sum) to is a homogenous function of it is necessary is f h h p h H T © f¥ ¹ T where © t H©s ¥ T © t f¥ T H© ©t f © f H ¥ ¨ H ¨¥ ¢ T ¹ ©t f H ¥ ¨ vectors).

asymptotically. Gallant and Nychka (1987) give conditions under which such a density may be treated as correctly speciﬁed.1 Y G¨ f p h p T h p T u p u p u ¤ Ù © ¨¥ À . we may develop a negative binomial polynomial (NBP) density for count data.6.1) we get that the normalizing factor is © © are the raw moments of the baseline density. the order of the polynomial must increase as the sample size increases.6.1 (18. However. The p 8 t g©H©¥¦T `Â h h © T p© u © p ©H©stò¦T IÂ s t ¨¥ ¥ © f f Y`¨ S r h © h © p p T © h © ¥ H©tT f f sHt ¥ f © f IÂ Y G¨ h h T © T © © © t¥ ©Ht ¥ H©¦T f f¥ ¹ f Y`¨ ¢ s©H ¨T h h p p h H©T © t¥ T © T © f Dt ¨¥ in equation 18.6. Basically. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 393 calculated (following Cameron and Johansson) using By setting 18.6. there are technicalities.18. Similarly to Cameron and Johannson (1997). The negative binomial baseline density may be written (see equation as u e É e PQX Æ ª D É e PQX X P f Æ © R¥ ©t)f £¥ P £¨¥ Ã X © Ã X ¹ Ã © f Ht ¥ Y G¨ p Recall that is set to 1 to achieve identiﬁcation.

* psi)) Econometrics/ psi.18. & @ e Â X ) X &@Â x y Í (²e º tioning variables is the parameterization ¥ © E¬2¥ X ´ v ¥ X u²e ¦y e t where and . we need the moment generating function: ª 8 ó Y To illustrate. P e ' ¤¥ © ¨é binomial-II (NP-II) model. X ¹ ª X (18. For the NB-I density. } if(k_gam >= 2) { © 5 ¥ second order polynomial .6.3) 8 É u e PQX Æ ª É e (18. which is a Computer Algebra System that is free for personal use. The usual means of incorporating condi. In the case of . e g e & The reshaped density. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 394 the negative binomial-I model (NB-I). For both forms. When we have . When we have the negative .6.* (lambda + psi + lambda . calculated using MuPAD. here are the ﬁrst through fourth raw moments of the NB density.2) © ¤ Ù Ü¥ e PQX X P f © t¥ Æ ©U¥ ©t)f £¨¥ H©sfòT © f © P s¥ Ã Ã ¢ sH ¥ T ¹ H©st ¨¥ © P e º eð ¢ e & P e ' ¤¥ © ¨é the NB-II model. is X Ã X ¹ Y `¨ To get the normailization factor. m[][1] = (lambda . with normalization to sum to one. we have . These are the moments you would need to use a if(k_gam >= 1) { m[][0] = lambda. and then programmed in Ox.6.

^ 2 .^ psi .* lambda . the normalization factor is calculated using equation 18.* (2 . again with the help of MuPAD.* m[][1] + gam[1][] .* (2 . SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 395 m[][2] = (lambda .* psi .* gam[0][] .^ 3 + 7 . if(k_gam == 1) { norm_factor = 1 + gam[0][] .* m[][2]) + 2 .* (m[][0] + gam[1][] .* (psi .^ 3))) Econometrics/ psi .^ 2) + lambda .* psi + 6 . } else if(k_gam == 2) { norm_factor = 1 + gam[0][] .* psi .* m[][1]).18.* lambda .^ 2 + psi .* (1 + psi) + 6 .^ 2 .^ 2 .* (1 + psi) + lambda .1.* (2 + 3 . } .* psi .6.* (6 + 11 .^ 3 .* m[][0] + gam[0][] .^ 3.* psi + psi .^ 2.* (psi .^ 2))) Econometrics/ psi .* (2 + 3 .^ 2 + 3 .6.* psi + gam[1][] .* m[][1] + 2 .* psi . m[][3] = (lambda .* lambda .* m[][3]). } After calculating the raw moments.

for example using polynomials. which variously exhibit over and underdispersion. Econometrica. they would probably get the paper published in a good journal. Gallant and Nychka. several pages) long. 1987 prove that this sort of density can approximate a wide variety of densities arbitrarily well as the degree of the polynomial increases with the sample size. This can be accomodated by allowing the parameters to depend upon the conditioning variables. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD 396 This is an example of a model that would be difﬁcult ot formulate without the help of a program like MuPAD.e. This approach is not without its drawbacks: the sample objective function can have an extremely large number of local maxima that can lead to numeric difﬁculties. as well as excess zeros. The baseline model is a negative binomial density. Any ideas? Here’s a plot of true and the limiting SNP approximations (with the order of the polynomial ﬁxed) to four different count data densities.18. 79 For the analogous formulae are impressively (i. It is possible that there is conditional heterogeneity such that the appropriate reshaping should be more local. If someone could ﬁgure out how to do in a way such that the sample objective function was nice and smooth.6.eps not found! . h Figures/SNP.

Examples 18.18.7. EXAMPLES 397 18.7. There is no need to specify multi-indices with a univariate regressor (as is the case here to keep the graphics simple). Nonparametric-I/fff_4. rors (with the mean subtracted out).ox allows « you to experiment with different sample sizes and values of . Fourier form estimation.ox.eps not found! This next one looks pretty good.we are starting to chase the ) « error term too much ( ). which sets up the data matrix for Fourier form estimation. Then the program fourierform.eps not found! Here’s an example of an overﬁtted model . You need to get the ﬁle FFF. here are several plots Nonparametric-I/fff_2. « . ©5 ò¥ ¢ The ﬁrst DGP ﬁrst DGP generates data with a nonlinear mean and er- 4 Ò 1 . For a sample size of with different .7. 5 « This ﬁrst plot shows an underparameterized ﬁt ( ).1.

ox allows you to experiment with different sample sizes. here are several plots with different window widths. while setting the window width too high leads to too ﬂat a ﬁt. EXAMPLES 398 Nonparametric-I/fff_10.eps not found! .18.ox.2. which contains the routines for kernel regression and density estimation. Kernel regression estimation. 18. window widths. Nonparametric-I/undersmoothed. You need to get the ﬁle KernelLib.1) leads to a very irregular ﬁt.7. The program kernelreg1.7.eps not found! 18. For a sample size of Note that too small a window-width (ww = 0.7. Ò 1 . We will use the same data generating process as for the above examples of Fourier Form models.eps not found! Nonparametric-I/oversmoothed.3. Kernel regression.

and calculating the MSE. kernels. The minimum MSE window width may then be chosen.ox does this.eps not found! Nonparametric-I/crossvalidated. This is repeated for various window widths.4.ox allows you to experiment using different sample sizes. and window widths. EXAMPLES 399 Nonparametric-I/justright. Kernel density estimation. then estimates their density using kernel density . The results are: 18.eps not found! random variables. The second DGP second DGP gener- estimation.18.7. The program kernelreg2. The program kerneldens. The following ﬁgure shows © d¥ ¢ ates Nonparametric-I/cvscores.eps not found! Cross Validation The leave-one-out method of cross validation consists of doing an out-of-sample ﬁt to each data point in turn.7.

18. To change kernels you need to selectively (un)comment lines in the KernelLib. we can try out MuPAD. Rather than using one of the expensive alternatives. I couldn’t ﬁnd the NB moment To implement this using a polynomial of order 8 h h (18.2) we need the raw moments É e u PQX Æ ª É e X 8 fh h Ã X ¹ p h © f¥ H UT ¹ T p p h H©sòT © t¥ T © T Y `¨ (18. so a solution is to calculate it using a Computer Algebra System (CAS). which can be downloaded and is free (in the sense of free g5 of the negative binomial density up to order . an SNP density for count data may be obtained by reshaping a negative binomial density using a squared polynomial: The normalization factor is generating function anywhere.7. Seminonparametric density estimation and MuPAD.7.3) © © (18.7.1) e PQX X P f © t¥ Æ ©U¥ ©t)f £¨¥ H©sfòT © f P s¥ Ã Ã ¢ sH ¥ T ¹ H©st ¨¥ © © .eps not found! 18. Following the lecture notes. Nonparametric-I/kerneldensfit. EXAMPLES 400 an Epanechnikov kernel ﬁt using different window widths.7.ox ﬁle.5.7.

18.The Open Computer Algebra System *----* | | *--|-* |/ |/ Copyright (c) 1997 .| \ a + b / | ----. The ﬁle negbinSNP.mpd.1 -.2002 by SciFace Software All rights reserved. EXAMPLES 401 beer) for personal use.7. Michael Creel Negative Binomial SNP Density First define the NB density / a \a / b \y gamma(a + y) | ----.mpd. too.5. if run using the the command mupad negbinSNP.| \ a + b / ---------------------------------gamma(a) gamma(y + 1) Verify that it sums to 1 . will give you the output that follows: *----* /| /| MuPAD 2. It is installed on the Linux machines in the computer room. *----* Licensed to: Dr. and if you like you can install the Windows version.

\ ox{exp}\\left(t\\right)\\right)}{\\left(a + b\\right)}^a}" Find the first moment (which we know is b (lambda)) b .b exp(t) \a | ---------------.7.b\\.18. EXAMPLES 402 1 Define the MGF / a \a | ----.| \ a + b / --------------------/ a + b .| \ a + b / Print the MGF in TeX format "\\frac{\\frac{a}{\\left(a + b\\right)}^a}{\\frac{\\left(a + b .

0D0*a**3*b**5+10.0D0*a*b**5+50.0D0*a**4*(b*b)+35. a\\. EXAMPLES 403 Find the fifth moment (which we probably don’t know) 5 (24 b + 60 a b 4 + a 4 b + 50 a b 5 + 50 a 2 3 b + 15 a 3 b 2 + 110 a 2 b 4 + 3 75 a b 3 + 15 a 4 2 b 2 + 35 a b 5 + 60 a 3 4 b + 25 a 4 b 3 + 10 a 3 5 b + 4 10 a b 4 + a 4 5 b ) / a 4 Print the fifth moment in fortran form.0D0*a**3*b**3 ~5. b^3 + 1 . \\. b^2 + 110\\. b^5 + 60\\. b^5 + 60\\.0D ~(a*a)*b**3+15.18.0D0*(a*a)*b**4+75.0D0*(a*a)*b**5+60.0D0*a**3*(b*b)+110. b^2 + 35\\. to program ln L " n n n t3 = a**-4*(b**5*24. a^3\\.0D0*a**4* ~*3+10. a^2\\. b^4 + 75\\.0D0*a**4*b**4+a**4*b**5)" Print the fifth moment in TeX form "\\frac{24\\. b^4 + 25\\. b^3 + 15\\. b^4 + a^4\\. a^3\\.7. b + 50\\. a\\.0D0*a**3*b**4+25. a^2\\.0D0*a*b**4+a**4*b+50.0D0+60. b^5 + 50\\. a^3\\. a^4\\. b^3 + 15\ a^4\\.

a^3\\. The ﬁle NegBinSNP. The estimation results for OBDV using Ox version 3.ox will let you estimate NegBinSNP models for the MEPS data. we need expressions of the form of the following a(0) b(0) m(0) + a(0) b(1) m(1) + b(0) a(1) m(1) + a(0) b(2) m(2) + b(0) a(2) m(2) + a(1) b(1) m(2) + a(0) b(3) m(3) + b(0) a(3) m(3) + a(1) b(2) m(3) + a(2) b(1) m(3) + a(1) b(3) m(4) + a(2) b(2) m(4) + b(1) a(3) m(4) + a(2) b(3) m(5) + a(3) b(2) m(5) + a(3) b(3) m(6) >> quit Once you get expressions for the moments and the double sums. 1994-2002 *********************************************************************** MEPS data. OBDV 5 ª and a NB-I baseline model are . b^5 + 10\\. b^4 + a^4\\. you can use these to program a loglikelihood function in Ox. The ﬁle EstimateNBSNP. Doornik. a^4\\. EXAMPLES 404 .7. without too much trouble.ox implements this.18.A.20 (Linux) (C) J. b^5}{a^4}" To get the normalizing factor.

050720 0.17378 0.032581 1.046301 0.2426 Standard Errors params constant pub_ins priv_ins sex age educ inc ln_alpha 1. Log Likelihood -2.0089429 0.052710 0.047968 0.7.048707 0.064384 0.048407 0.056824 0.0083419 0.090624 0.062689 0.) 0.) 12.12645 0.047614 0.543 t(Sand.063835 0.16113 0.0078799 0.18.132 t(Hess) 12.0040547 t-Stats params constant 1.0039745 se(Hess) 0.039692 0.045060 0.052521 0.0042349 se(Sand.058794 0.18466 0.181 .5340 0.5340 t(OPG) 11.17398 0.16863 0.043708 0.8138 -0.051033 0.065619 0. EXAMPLES 405 negbin_snp_obj results Strong convergence Observations = 500 Avg.013382 se(OPG) 0.054144 0.17950 0.12593 0.13289 0.053100 0.

438 -6.67509 0.6892 3.7082 0.ox and maxsa.3811 3.9759 1.63842 10.039692 0.3188 3.7 BIC 2304.7 AIC 2262.75573 0.0344 1.8356 1. one can try using multiple starting values.3669 2.3248 3.1599 2.8226 -5. For more details on . or one could try simulated annealing as an optimization method. To guard against having converged to a local maximum.50603 9.5416 3.8769 0.h into your working directory.052710 0.4621 3. and then use the program EstimateNBSNP2.7. copy maxsa.4197 3.4456 3.090624 0.82746 0.3003 Information Criteria CAIC 2314. NOTE: density functions formed in this way may have MANY local maxima.ox to see how to implement SA estimation of the reshaped negative binomial model.17950 0. EXAMPLES 406 pub_ins priv_ins sex age educ inc ln_alpha 0. To do this.9837 0.16863 0. so you need to be careful before accepting the results of a casual run.8138 -0.013382 3.032581 1.425 -6.74541 10.6 *********************************************************************** Note that the CAIC and BIC are lower for this model than for the ordinary NB-I model.8941 3.16113 0.18.

Note ...18. using a gradient-based method such as BFGS with many starting values is as successful as SA. see Charles Bos’ page. YMMV. and is usually faster. .in my own experience. EXAMPLES 407 the Ox implementation of SA.7. Perhaps I’m not using SA as well as is possible.

˘ Gourieroux. but the likelihood function is not calculable. ECONOMETRIC THEORY. Vol. Motivation Simulation methods are of interest when the DGP is fully characterized by a parameter vector. 19. which is asymptotically fully efﬁcient. “Indirect Inference. Henceforth drop the subscript when it is not needed for clarity. pages 657-681. If it were available. articles include Gallant and Tauchen (1996). “Which Moments to Match?”. Example: Multinomial and/or dynamic discrete response models. 12. Pakes and Pollard (1989) Econometrica. Econometrics.” J.1. Apl.CHAPTER 19 Simulation-based estimation Readings: In addition to the book mentioned previously. © ² jç B G ¥ (19.1. 1996.1. we would simply estimate by MLE.1.1) 3 8 e ' B where is Suppose that 408 8 " G P B V§ B B f Bf Let be a latent random vector of dimension Suppose that . McFadden (1989) Econometrica. 19. Monfort and Renault a (1993).

is independent of . we observe a many-to-one mapping some cases only one element will be one). The log-likelihood function is -dimensional random vec- and the MLE solves the score equations É ´ 5 È ® 5 ¥ G I ² R y 7} ¢ eI ² ¢ p I6© ò¥ © ² ¬G¢1 G ~ p 8 f t f¥ B $© ² § B j B ¨Ä1 $ I B 1 ©d ©4dR¥ B A ) 4R¥ ¦ f I I B 1 © 4d ¥ B B 1 © © 4d ¥ B ò f ) d ¥ B f ) X £ ©d 4e¥ B ² ¸ ¸¥ § xº t R ¨¥ d d Let be the vector of parameters of the model. Deﬁne contribution of the where is the multivariate normal density of an tor.19. However.1. ² may not be independent of one another (and clearly are not if is not Bf Suppose random sampling of . In this case the elements of f This mapping is such that each element of f 6© ¨¥ « © B B ¨¥ f f w B f B f t © B ¥ Hyw © ¥ f is either zero or one (in « ¬ f f . Rather. The observation to the likelihood function is 8 í 4% y3 ß f ¶ 3 R ©R © Bf diagonal). MOTIVATION 409 is not observed.

by is higher than 3 . Let alternative have utility % C% B ½ Utilities of jobs. MOTIVATION d 410 standard methods of numeric integration such as quadrature is com- binary discrete choice models as well as the case of multinomial discrete choice (the choice of one out of a ﬁnite set of alternatives). ©f putationally infeasible when (the dimension of ©d e¥ B The problem is that evaluation of ¦ and its derivative w.1. We have cross sectional data on individuals’ matching to a set of jobs that are available (one of which is unemploy% ment). stacked in the vector í X6"ig Dc h ê © ¨¥ f 85) i!A@8@8974¦ G § ò ß j%¬ ß ü « quite general: for different choices of G P ß V§ ß ¼ß 5) 4¦ ½ ¥ Yß ½ ) ¼ß f g g ß ½ 2 % ½ © ¨¥ f « The mapping has not been made speciﬁc so far. Rather.r. The utility of alternative is we observe the vector formed of elements Only one of these elements is different than zero.t.19. This setup is it nests the case of dynamic 8 © ² or 4 (as long as there are no restrictions on are not observed. – Dynamic discrete choice is illustrated by repeated choices over time between two alternatives. – Multinomial discrete choice is illustrated by a (very simple) job search model.

one parameterizes the conditional mean as r e The mean and variance of the Poisson distribution are both equal to B distribution Ë e ( @3 © e Ô¥ 7g} ×3 ¥ ~ © f 8 8© ~ B § B ¥ 7g} HCe e g © f¥ f¥ Sé © ¨Uë ©@8A89774 8 7 5) count data (that takes values is often modeled using the Poisson 3 that is if individual G P § V" I$Gj ¢ G0P§cEIjüu ¢ ò¥ © ü I½ ¢ ½ ¥ f ) f chooses the second alternative in $ 2 ) B f f . For example. MOTIVATION 411 Then Now the mapping is (element-by-element) period zero otherwise.1.2. A possibility is to introduce latent random variables. Economic data often presents substantial heterogeneity that may be difﬁcult to model. This can cause the problem that there may be no known closed form for the distribution of observable variables after marginalizing out the unobservable latent variables. Example: Marginalization of latent variables.1.19. 19. Often.

In this case. but often this will not is then used to do ML estimation. a solution is to use the negative binomial distribution rather than the Poisson. simulation is a means of calculating © f 3 ¨¥ ( f © B ¥ Q B P ~ B P ~ Dt u © g § B ¥ 7} us© B § B ¥ 7g} d } ~ X ³ g f marginal density of 8 g B © Dt B ¥ Q on additional parameters).1. the which . Estimation by ML is straightforward. Often. An alternative is to introduce a latent variable that reﬂects heterogeneity into the speciﬁcation: will have a closed-form solution (one can derive the negative binomial distri bution in the way if has an exponential distribution). count data exhibits “overdispersion” which simply means that If this is the case. a more ﬂexible model with heterogeneity Ë be possible.19. MOTIVATION 412 This ensures that the mean is positive (as it must be). Let be the density of Û f © B f ¨¥ Ë B where has some speciﬁed density with support © § B ¥ 7} HCe B P ~ B 8 © f¥ S¼ë ¥ © ¨é f¥ (this density may depend In some cases. since there is only one latent variable. In this case. This would be an example of the Simulated Maximum Likelihood (SML) estimation. quadrature is proba- bly a better choice. However.

19. such that Brownian motion is a continuous-time stochastic process such that ©S¥þüu% ¥þ © ü © 2 E¬¥þVqtòþ ü © ¤¥ ü ©EjÈ ¥jç©E¬2¥þüVqtòþü 2 ¤ © ¤¥ ¥ © ü That is. which will not be evaluable gets large. 8 ¥ % ¥ 2 ¥ ¤ and © ¿ jD"t ¥ ç ü ü Qg which is assumed to be stationary. One can think of Brownian motion the accumulation of independent normally distributed shocks with inﬁnitesimal variance. non-overlapping segments are independent. For example X 19. Estimation of models speciﬁed in terms of stochastic differential equations. MOTIVATION 413 would allow all parameters (not just the constant) to vary. This leads to a model that is expressed as a system of stochastic differential equations. © B Dt u s© § ¬¥ D(© B f § ¥ §¥ Q ~ ~ ~ B B } B B 7g} d 7g} are independent for ³ ' f © B f ¨¥ Ë is a standard Brownian motion (Weiner . ü t© f d P 2 t© f d¥ f "!7R¥ ¹ D$!7RD $t ¯ ° p ¥ © ¿ þü by quadrature when B § A ÿ e entails a -dimensional integral. © f d¥ g7eH The function is the deterministic part. A realistic model should account for exogenous shocks to the system. Consider the process process).3.1. which can be done by assuming a random component. It is often convenient to formulate models in terms of continuous time using differential equations.1.

the approximation shouldn’t be too bad. though is a continu- which which is . A typical solution is to “discretize” the model. deduce the transition density ©) ¥ x9 ² G© f t P © f t¥ tcÔI g!sò¥ ¹ cI g!sòH ç ñtG f I jHf 7d To perform inference on direct ML or GMM estimation is not usually fea- gf 8 Ìf@8A89 ¢ f 8 fI ¯ gf observations of © f d ¨!7R¥ ¹ determines the variance of the shocks. (that is. in general. The discretized version of the model is deﬁnes the best approximation of the discretization to the actual (un- the true parameter value). which will be useful. because one cannot. The important point about these three examples is that computational d difﬁculties prevent direct application of ML. . This is an approximation. etc. QML) p d known) discrete time version of the model is not equal to p Ht t The discretization induces a new parameter. by which we mean to ﬁnd a discrete time approximation to the model. This density is necessary to evaluate the likelihood function or to evaluate moment conditions (which are based upon expectations with respect to this density). in discrete points That is. MOTIVATION 414 To estimate a model of this sort. the 8©d f 46I gf $¨¥ ¨ sible. and as such “ML” based upon this equation is in general biased and inconsistent for the original parameter. we typically have data that are assumed to be ous process it is observed in discrete time. GMM.19. Nevertheless. as we will see. Nevertheless t estimation of (which is actually quasi-maximum likelihood.1.

in place of © d f¥ a ¨ ± G ë ® d © d f¥ 4× g where is the density function of the ® d observation. SIMULATED MAXIMUM LIKELIHOOD (SML) 415 the model is fully speciﬁed in probabilistic terms up to a parameter vector. Example: multinomial probit.2. consider cross-sectional data.19. This means that the model is simulable. 19. If this is the case. An ML estimator solves 0 ¶ 2 I @ 1 © d¥ f ©dc a f¨ ¥ A f ) 4RC¤ ~ E 0 0 it may be possible to deﬁne a random function such that log-likelihood function. However. When in the .2. is an infeasible estimator. conditional on the parameter vector. Simulated maximum likelihood (SML) For simplicity.1.2. Recall that the utility of alternative is % © d f¥ ca g I B 1 © d ¥ f ©d a f¨Üý ¥ f ) eS¤ ~ ¡ © d f¥ 4aÜý The SML simply substitutes G P ß V§ ß ¼ß ½ 8 © d f¥ g4a g¨s is unbiased for © d f a 4× Eg!@ìt¥ ¨ I òì ² © d f¥ 4agÜý ) d a where the density of is known. that is ³ 19. the simulator © d f¥ © d f a a g¨s 4caòg!p¥ ® d does not have a known closed form.

and the elements sum to one. Draw from the distribution Repeat this times and deﬁne } tween 0 and 1. Each element of is be- .19. and This does not cause problems from a theoretical point 8 E3 8 § the iterations used to ﬁnd and ² B ýG draws of 8 ² § I ©4d f ¥ RB B 1 § B B ¨Üý A f ) © ² ¥ ¦ f ©d f B } RB f B B ¥ ý Now and are draw only once and are used repeatedly during The draws are different for each B } Bß } B } Deﬁne as the -vector formed of the C% I ¶ Bß ý f g¶ U Bß í g ê Q"iªDÔ ©hB d ² ½ ¥ µ Bß ½ø ) U Bß ý f Deﬁne P B ý G Ü§ B B ý ½ Calculate (where B © ² j ¥ © d ) Uß ¥ f Ë The problem is that can’t be calculated when is the matrix formed by stacking the C% í g ""iª h ½ ¥ Yß ½ø) ¼ß f d B ýG f and the vector is formed of elements is larger than 4 or © Bß .r.t. However. d Notes: The The log-likelihood function with this simulator is a discontinuous func- 8 ² § tion of B ýG If the are re-drawn at every iteration the estimator will not converge. The SML multinomial probit log-likelihood function is This is to be maximized w.2. it is easy to simulate this probability. SIMULATED MAXIMUM LIKELIHOOD (SML) 416 5.

2. There are alternative methods (e. If the corresponding element of © ² ¨¥ ¦ A § is stochastically equicon- k w ©1 T D¨¥ w U Bß ý f . This approximates a step will be continuous and differentiable. For example. . Gibbs sampling) that may work better.g. it does cause problems if one attempts to use a gradient-based optimization method such as Newton-Raphson. for example.. Bß ý ² § tinuous function of and so that and therefore © ² §¨¥ ¦ Bß ý f ) £ Bß ½ imum. Also.19. simulated annealing. there will be a problem. To solve to log(0) problem. Solutions to discontinuity: – 1) use an estimation method that doesn’t require a continuous and differentiable objective function. and if it is the maximum. This is computationally costly. This makes Bß ½ Bß ý f function such that is very close to zero if is not the maxa con- Bf I I r dhB ½ ~ h ¼ Bß ½ n j' ¡P h r dhB ½ ~ h Bß ½ n ' w ) Ò8 W W © ¥ 4Ï B } some elements of are zero. Consistency requires that so that the approximation to a step function becomes arbitrarily close as the sample size increases. one possibility is to search the web for the d slog function. However. – 2) Smooth the simulated probabilities so that they are continuous functions of the parameters. SIMULATED MAXIMUM LIKELIHOOD (SML) 417 of view since it can be shown that tinuous. that is d equal to 1. It may be the case. but this is too technical to discuss here. increase if this is a serious problem. particularly if few simulations. are used. apply a kernel transformation such as where is a large positive number.

so that as long as grows fast enough the estimator is consistent and fully asymptotically efﬁcient.” Econometric Thed ory. in principle. but is such that d 8 ¢ eI 1 p ¥ where is a ﬁnite vector of constants. Properties. 11. [Lee] 1) if d 1 h then This means that the SML estimator is asymptotically biased if grow faster than d The varcov is the typical inverse of the information matrix.3. The following is taken from Lee (1995) “Asymptotic Bias in Simulated Maximum Likelihood Estimation of Discrete Choice Models. METHOD OF SIMULATED MOMENTS (MSM) 418 19. ©d f 4 ¥ Suppose we have a DGP which is simulable given . T HEOREM 32.2. The properties of the SML estimator depend on how is set. pp.2. then doesn’t d © d © p e¥ I o j ¥ m 1 Â ¢ eI "tf p h p i d 0 ® ³ 1 h Â ¢ eI " f p d . 437-83.3. © d E© p e¥ I o ¥ j ¥ m h p i d 0 ® ³ e d 1 e g 2) if a ﬁnite constant. 19. base a GMM estimator upon the moment conditions where f t© d f¥ © f © d ¥ $ ¨s¨ c¨¥ k' c D #© d ¥ © f xc Huq g¥ 4eE © d¥ f the density of is not calculable. Method of simulated moments (MSM) d Once could.19.

3.1. Properties.19. above) show that the asymptotic distribution of the MSM estimator is very similar to that of the infeasible GMM estimator. Suppose that the optimal weighting matrix is used. form © d ¥ © f # r 4c s} q ¨¥ n © d¥ 4RE³ (19. so errors introduced appears linearly within the sums. McFadden (ref. In particular. assuming that the optimal weighting matrix is © ¨ ¶ }f D ¥ 1 (19. is readily simulated using © d f¥ is a vector of instruments in the information set and is the density # which .3. As before.1) 1 operating across the observations of real data. though in fact we obtain consistency even for ﬁnite. d as d ©4c ¥s ì ©c s} d d ¥ I g¶ © ¶ }f ¥ ² ©d ¥ 4c s} ) d ©dc D ¥ 8 c f of conditional on The problem is that this density is not available. METHOD OF SIMULATED MOMENTS (MSM) 419 However provides a clear intuitive basis for the estimator. since a law of large numbers is also by simulation cancel themselves out. above) and Pakes and Pollard (refs.3. k By the law of large numbers.3.2) Ig¶ I B © # w ¥ © f ¶ }f D ) q ¨¥ v f d ² I ©d 4e¥E³ B f ) ) 1 © d¥ eX} # where is drawn from the information set. This allows us to form the moment conditions with which we form the GMM criterion and estimate as usual. Note that the unbiased simulator 19.

and for d 19.2. the asymptotic varcov The above presentation is in terms of a speciﬁc moment condition based upon the conditional mean. To use the multinomial probit model as an example. Comments. ¥ For this . the asymptotic variance is inﬂated by a factor d d mator. but the efﬁciency loss is reasonably large. inﬂated by is ﬁnite. for small and controllable. d 19. Simulated GMM can be applied to moment conditions of any form.3) É ó I R I ð È É )Æ ) P D d ® d 1 I © R I h used.3. Why is SML inconsistent if The reason is that SML is based upon an average of logarithms of an unbiased simulator (the densities of the observations). while MSM is? 8 ) The estimator is asymptotically unbiased even for 8 Â ²D) ) P That is.3. the log-likelihood function is I B 1 © §¨¥ § BR B ý ¦f ) © ² ¨¥ ¦ A f The SML version is 8 Â ¡) ) P is just the ordinary GMM varcov.reason the MSM estimator is not fully asymptotically efﬁcient relative to the infeasible GMM estimator. METHOD OF SIMULATED MOMENTS (MSM) 420 ﬁnite. I B 1 © §¨¥ § BR B ¦f ) © ² ¨¥ ¦ A f ² ² ² where is the asymptotic variance of the infeasible GMM esti- ² m h p i d ® ³ (19. This is an advantage relative to SML.3. by setting d d ﬁnite. d If one doesn’t use the optimal weighting matrix.

8© g p d © d¥ 4ea ý I ² cea ý ea ¤ R© d¥ © d¥ © I c¥ P ) ² # (note: is assume to be made up of functions of 8 ¥ Qt ¥ ©d ¥ d ¥ g© D© C# ³ 4 Duq© p D ° © d¥ 4ea ý © d¥ eXý (19.2]).5 a bit. That is.5) converge almost surely to converges to which obviously has a minimum at henceforth consistency. © ² ¥ B © ² ¨¥ B ý ë § § d d©Ô¥ b . you will see why the variance inﬂation factor is .3. and it ap- to cancel out simulation errors. METHOD OF SIMULATED MOMENTS (MSM) 421 The problem is that in spite of the fact that due to the fact that The reason that MSM does not suffer from this problem is that in this case the unbiased simulator appears linearly within every sum of terms.3. The only way for the tends to inﬁnite so that tends to .4) I g¶ I ¶ # w ß 6ý G P©4c ¥ d G P d ¥ B 1 D ) t0© p c Dcv ) f ² I d g¶ I B 1 © q¨ f cg¥ v ) ) f © # w ¨ ¶ }f D ¥ 1 pears within a sum over (see equation [19. the moment conditions ² d (19. Therefore the SLLN applies The objective function d©ÔÜ b¥ ©©ÔÜý b¥ two to be equal (in the limit) is if © § ë í © § E© ² ¨¥ B ý ¬¥ £© ² ¨¥ B ý ¥ Ù is a nonlinear transformation. from which we get consistency.3.19.3.3. If you look at equation 19. using simple notation for the random sampling case.

“Which Moments to Match?”. pages 657-681) seeks to provide moment conditions that closely mimic the score vector. Efﬁcient method of moments (EMM) The choice of which moments upon which to base a GMM estimator can have very pronounced effects upon the efﬁciency of the estimator. The DGP is characterized by random sampling from the density © d¥ ± eE 4RE © d¥ © p eE d ¥ $ © p g d f¥ timators.19. The efﬁcient method of moments (EMM) (see Gallant and Tauchen (1996).4. ECONOMETRIC THEORY. The asymptotic efﬁciency of the estimator may be low. and can even cause identiﬁcation problems (as we’ve seen with the GMM problem set). Vol. The asymptotically optimal choice of moments would be the score vector of the likelihood function. 12.4. The drawback of the above approach MSM is that the moment conditions used in estimation are selected arbitrarily. If the approximation is very good. . 1996. this choice is unavailable. A poor choice of moment conditions may lead to very inefﬁcient es- As before. the resulting estimator will be very nearly fully efﬁcient. EFFICIENT METHOD OF MOMENTS (EMM) 422 19.

since ² d © e ¶ }f ¥ ¨ ¨ (19.1) pe T e £ ¥ Q t f t d f¥ f © DÚ¦© p ¨¬© p e ¨¥ ¨ A « f$©4eE © ¥Ô t d ¥ e Ig¶ @ 1 I © eS³ d¥ f ) f ) e Y I@ 1 d¥ f © e ei f ) « £ f ë ³ © p e ¥ ¨ w ° « Y ë « 4r p over e is zero: d f¥ g© p 9f the true but unknown density of p e a pseudo-true for which the true expectation. taken with respect to and then marginalized . but they are simulable using © d¥ eEi These moment conditions are not calculable.4. there is We have seen in the section on QML that the moment conditions able.19. Therefore quasi-ML estimation is possible. this suggests using © e × ¨¥ ¨ A f e I @ 1 8 © ¥E ¥ f ´ A f ) © e C¤ ~ E e e 8 ¨ e This density is known up to a parameter © e Ô ¥ We assume that this den. EFFICIENT METHOD OF MOMENTS (EMM) 423 We can deﬁne an auxiliary model. Speciﬁcally. called the “score generator”. which simply provides a (misspeciﬁed) parametric density sity function is calculable.4. is not avail- ¨ $ © e × ¥ f ¨ . After determining we can calculate the score functions The important point is that even if the density is misspeciﬁed.

1. EFFICIENT METHOD OF MOMENTS (EMM) 424 the fact that converges to } . This is done because it is sometimes impractical to estimate with very large. assuming that d where is a draw from 8 © p e p ea d ¥ is identiﬁed. inﬁnite d d 8 b d©Ô¥ ¨ data well. pe ©d ge¥ ª & © e eS} d¥ f e ý ¶ f holding ﬁxed.4. ﬁnite. Gallant and Tauchen give the theory for the case of d so large that it may be treated as inﬁnite (the difference being irrelevant given the numerical precision of a computer). 1987) SNP density estimator which we saw before. I will present the theory for d and possibly small. If one has prior information that a certain density approximates the If one has no density in mind. which is fully efﬁcient. 19. it would be a good choice for © d f¥ g× ¨ imates then will closely approximate the optimal © e c ¨¥ f ¨ The advantage of this procedure is that if closely approx- p e This is not the case for other values of .19. the efﬁciency of the indirect estimator is the same as the infeasible ML estimator. Optimal weighting matrix. Since the SNP density is consistent. 1983) and Gallant and Nychka’s (Econometrica. moment conditions which characterize maximum likelihood estimation. there exist good ways of approximating unknown distributions parametrically: Philips’ ERA’s (Econometrica.4. The theory for the case of follows directly from the results presented here. By the LLN and .

4.2) would 8 The moment condition h depends on the pseudo-ML estimate e ³ I !© p e Q©© p e ¥ o I !© p e Q% ° ¥ P ¥ P © e 7RX} d¥ We © e c g¨¥ f e 1 © e 9gRi1 p d¥ f h h . this is simply the asymptotic variance covariance matrix of the show that the asymptotic variance of this term is © ¥ p e a ± I ² © p d¥ f p e xgRCý 1 First consider . in the present case cancellation. In this case. Now take a ﬁrst order Taylor’s series approximation to © e S9¤ ¥f Comparing the deﬁnition of © d f¥ 4× ¨ 8 É ê ê ê ê Re © S9àf ¤ ¥ e 8 g© p e p eX y © p e QP d ¥ ¥ à êê ê ê © iàf ¤ ¥ 8 h gp e S¤ y z á á á A © ¥f e e à È 1 ë f © p e ¥ $ © ¥ p e QP o ¨ we assume that is only an approximation to © gp e ¥ o I 6p e QP © ¥ © d f¥ g4 ¨ © e c g¨¥ f ¨ If the density were in fact the true density then would be an identity so there is no e m hp e (19.1. we see that As in Theorem 22. ©)¥ tcT ´ P h p e e © p e p RXý y w 1 d ¥ h P © p e p RCý 1 d ¥ f h p d ¥ f © h e p eSý 1 e about : 8 ² moment conditions. and matrix.4. It is straightforward but somewhat tedious to . However. Recall that with the deﬁnition of the moment condition in Equation 19. due to the information matrix equality.19.4. EFFICIENT METHOD OF MOMENTS (EMM) 425 can apply Theorem 22 to conclude that be the maximum likelihood estimator.

g. the appropriate weighting matrix is simply the information matrix of the auxiliary model. since the scores are uncorrelated. so it is always possible if the score generator is taken to be correctly speciﬁed. EFFICIENT METHOD OF MOMENTS (EMM) But noting equation 19. 8 8ô 4©¤C6 h p e ³ © p e ¥ o ° ç h p e © p e Ql1 ¥ P h so we have e hp e Next consider the second term h . É ©p e¥ o È É P d ¥ f ý ) ) Æ u ç © e p ei 1 d h matrix of the moment conditions. This may be complicated if the score generator is a poor approximator. we see that deﬁning as d is the GMM estimator with the efﬁcient choice of weighting matrix.4. (e. Note that T © p d¥ f ý y ì p e 9ei e p e 9gR²ý y w 1 © p d¥ hp e e © p e Ql1 ¥ P h 426 e © p e p RXý y w 1 d ¥ h x ¥ g© p e QP . combining the results for the ﬁrst and second terms. If one has used the Gallant-Nychka ML estimator as the auxiliary model. Combining this with the result on the efﬁcient GMM weighting matrix in Theorem 25.. it really is © e 7Ri d¥ f I É ©p e¥ x o È É ) R d¥ f ¥ ) P Æ © e eS ¡ d d © gp e ¥ o to consistently estimate ©p e¥ o Suppose that is a consistent estimator of the asymptotic variance-covariance when the model is simulable.2 Now.4. On the other hand. Even if this is the case. the ordinary estimator of the information matrix is consistent. since the individual score contributions may not have mean zero in this case (see the section on QML) .19. the individuals means can be calculated by simulation.

4.3.4. Since we use the optimal weighting matrix.4. something may be wrong ©d 4e¥ A ÿ ÿ where is § since without moment conditions the model I · ©R¥ ¢ 'ç © 7d ¥C É © e ¥ § f e I É R I ©p e¥ o É P ) ) Æ È © e d ¥ ! Rf d o ©d ge¥ ÿ © e ¥ A É P ) È R ¥f ) DÆ © e d CH1 ¶ d m h p £d 1 d .2): h where 8 ³ qc© p e p e¥ ! ° ë f d Rf This can be consistently estimated using 19. since the score generator can approximate the unknown density arbitrarily well). Diagnotic testing. so we have (using the result in Equation 19. The fact that É ©p e¥ o È É P d ¥ f ) ) Æ u ç © e p ei1 d h implies that is not identiﬁed.1.4.2. EFFICIENT METHOD OF MOMENTS (EMM) 427 ML estimation asymptotically.4. so testing is impossible. One test of the model is simply based (the small sample performance of this sort of test would be a topic worth investigating). 19. the asymptotic distribution is as in Equation 15. Asymptotic distribution. ©§ R¥ ¢ on this statistic: if it exceeds the critical point.19.

one estimates by QML to obtain the scores of the discretized approximation: ©) ¥ gx9 ² used to revise the model. Example: estimation of stochastic differential equations It is often convenient to formulate theoretical models in terms of differential equations. hourly or real-time) it may be more natural to adopt this framework for econometric models of time series. 1995. and estimate using the discretized version. Since these moments are related to parameters of the score generator.. These aren’t actually distributed as © e 7d C1 ¥f o diag È ¶ É © d ¥if1 © ¥ · ¢p e e I eI É ) P )Æ © e 7d SH1 ¥f © e p eSH1 d ¥ f h statistics: . That is. and when the observation frequency is high (e.5. daily.g. this information can be h h since and have different distributions (that of is somewhat more complicated). An alternative is to use indirect inference: The discretized model is used as the score generator. It can be shown that the pseudo-t statistics are biased toward nonrejection. for more details. which are usually related to certain features of the model. or Gallant and Long. the resulting estimator is in general biased and inconsistent. 19. as above. al.19. See Gourieroux et. weekly. However. since the discretization is only an approximation to the true discrete-time version of the model (which is not calculable).5. The most common approach to estimation of stochastic differential equations is to “discretize” the model. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS 428 Information about what is wrong can be gotten from the pseudo-th d can be used to test which moments are not well modeled.

EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS 429 Indicate these scores by equations is simulated over . approximates a Brownian motion ©t 7Rif B d¥ I B © d¥ f t 7RCý ) $ lations ©) ¥ x9 ² G© f t P © f t¥ tcÔI g!sò¥ ¹ cI g!sòH Then the system of stochastic differential ü t© f d P 2 t© f d¥ f "!7R¥ ¹ D$!7RD $t © ¥f t 7d iý 8© d¥ f gt ei ç ñtG f I jHf d d . There are many ways of doing this. Basically. « By setting very small. and the scores are calculated and averaged over the simu- is chosen to set the simulated scores to zero This method requires simulating the stochastic differential equation. the sequence of © « j ¥ g!e¥ ¹ ¬g!eHiP gf © f d P © f d¥ ç k µ gf t (since and d are of the same dimension).19. they involve doing very ﬁne discretizations: fairly well.5.

In the method described above the score generator’s parameter is of .5.). al. which allows for diagnostic 7d the same dimension as is so diagnostic testing is not possible. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS 430 This is only one method of using indirect inference for estimation of differential equations. 1995 and Gourieroux et. There are others (see Gallant and Long.19. t testing. Use of a series approximation to the transitional density as in Gallant and Long is an interesting possibility since the score generator may have a higher dimensional parameter than the model.

431 .CHAPTER 20 Parallel programming for econometrics In this chapter we’ll see how commonly used computations in econometrics can be done in parallel on a cluster of computers.

and boot your computer with it. which I encourage you to explore. which is of course installed.m ﬁles mentioned below. The editor is conﬁgure with a macro to execute the programs using Octave. and it runs on both GNU/Linux. Getting started Get the bootable CD. Then burn the image. There are other ways to use Octave.CHAPTER 21 Introduction to Octave Why is Octave being used here.3. since it’s not that well-known by econometricians? Well. After this. or that you have conﬁgured your computer to be able to run the *. because it is a high quality environment that is easily extensible. you can look at the example programs scattered throughout the document (and edit them. From this point. I assume you are running the CD (or sitting in the computer room across the hall from my ofﬁce). as was described in Section 1. and run them) to learn more about how Octave can be used to do econometrics. These are just some rudiments. It’s also quite easy to learn. but with all of the example programs ready to run. 21. A short introduction The objective of this introduction is to learn just the basics of Octave. Mac OSX and Windows systems. This will give you this same PDF ﬁle.2. uses well-tested and high performance numerical libraries. Students of mine: your problem 432 . it is licensed under the GNU GPL. 21.1. so you can get it for free and modify it if you like.

So study the examples! Octave can be used interactively. open it up with NEdit (by ﬁnding the correct ﬁle inside the /home/knoppix/Desktop/Econometrics folder and clicking on the icon) and then type CTRL-ALT-o. preparing programs with NEdit. or use the Octave item in the Shell menu (see Figure 21. or it can be used to run programs that are written using a text editor. . A SHORT INTRODUCTION 433 F IGURE 21. To run this. We’ll use this second method. Running an Octave program sets will include exercises that can be done by modifying the example programs in relatively minor ways. and calling Octave from within the editor.2.1.1).2.2.21. The program ﬁrst.m gets us started.

have a look at the Octave programs that are included as examples.data”.m so that the 8th line reads ”printf(”hello world\n”). Please have a look at CommonOperations.2. Those pages will allow you to examine individual ﬁles. You might like to check the article Econometrics with Octave and the Econometrics Toolbox . Edit first.data”. If not. named for example ”myfile. out of context. then you should be able to click on links to open them. Or get the bootable CD.m shows how. There are some other resources for doing econometrics with Octave. but much of which could be easily used with Octave. A SHORT INTRODUCTION 434 Note that the output is not formatted in a pleasing way. you should go to the home page of this document.m for examples of how to do some basic things in Octave. Once you have run this. That’s because printf() doesn’t automatically start a new line. just use the command ”load myfile. . the matrix ”myfile” (without extension) will contain the data. To actually use these ﬁles (edit and run them).21. the example programs are available here and the support ﬁles needed to run these are available here.” and re-run the program. The program second. We need to know how to load and save data. since you will probably want to download the pdf version together with all the support ﬁles and examples. Now that we’re done with the basics. Basically. After having done so. If you are looking at the browsable PDF version of this document. you will ﬁnd the ﬁle ”x” in the directory Econometrics/Include/OctaveIntro/ You might have a look at it with NEdit to see Octave’s default format for saving data. if you have data in an ASCII text ﬁle. formed of numbers separated by spaces. which is for Matlab.

you need to: Get the collection of support programs and the examples. 435 21. Associate *.. Not to put too ﬁne a point on it.3. .3.nedit from the CD to do this. and tell Octave how to ﬁnd them. IF YOU’RE RUNNING A LINUX INSTALLATION. That should do it. please note that there is a period in that name.g.21.m ﬁles with NEdit so that they open up in the editor when you click on them. get the ﬁle NeditConﬁguration and save it in your $HOME directory with the name ”.. If you’re running a Linux installation.nedit”. Copy the ﬁle /home/econometrics/... Or.. e. Put them somewhere. from the document home page. Then to get the same behavior as found on the CD. by putting a link to the MyOctaveFiles directory in /usr/local/share/octave/site-m Make sure nedit is installed and conﬁgured to run Octave and use syntax highlighting.

is a . if is a vector. Let ' Y Following this convention.1. For example. Chapter 1] organized as a -vector. . 8 d zá Åc Ã ì á á Å c Ã ì á á ¶ Å c Ã ì á 4Rà¤ © d¥ d à f y Let be a real valued function of the -vector Then matrix. unless they have a transpose symbol (or I forget to apply this rule .CHAPTER 22 Notation and Review vector. R y r b¥ T Ü©©Ô¦¤ 1 ' %Þ) ©d T y e¥ ¨ All vectors will be column vectors. Also. Then R 4R¥ ©d ¨ d 1 y Let : be a -vector valued function of the -vector .your help catching typos and er0rors is a . I mean a column vector. show that 436 8©d ge¥ ¨ y á á R ó R e¥ ¨ á á ð ©d ¨ be the valued transpose of . 22. When I refer to a -vector. Notation for differentiation of vectors and matrices [3. . ' ) is á Å c Ã ì á ) ' V" is much appreciated). E XERCISE 33. For and both -vectors. ô yã ã á á ô R Rd R 8 É edd ¦à¤ Æ d à É e¦¤ à Æ d à 4Rdd ¦à¥ ¤ d ¢ à © ¥ © d¥ © à ày à y á áà à ' ) c Ã ì á á Å c Ã ì z á Å is a vector and .

For a matrix and a vector. The ﬁrst three modes discussed are simply for background. and let : be a -vector valued function of an 1 T y ¨ Chain rule: Let : a -vector valued function of a -vector ) ' V" f ' i E XERCISE 34. . Chapter 4]. -vector valued argument . The stochastic modes are those which will be used later in the course. We will consider several modes of convergence. For and both vectors. Chapter 4]. show that ã 6á ã ã y 6á which has dimension 1 É Rd ¹ à ÆR à y 8) ' 4j£ RÉ ¹ d à Æ§P ¹ É R ¨ d à Æ e¥ ¨ 4R¥ ¹ d à © d R© d à à 8 ' ¯) à R R P É d à Æ R © d R© d d ¹ 4R¥ ¨ 4e¥ ¹ à à ¨ à y v d ©òH ¥ b T y ©©Ô¥ ¨ Product rule: Let : and : f ©d T y 4e¥ ¹ f ©d T y e¥ be -vector valued à w 8 À Q1 ' À R w P w . Then has dimension 22.2.22. Then ¨ has dimension Applying the transposition rule we get ¨ .[4. ) Y' § E XERCISE 35. show that xá © ¸ ~ R§ ¥ 7g} Å x y Ãã ©× á ¹ ê R Å ·Ã c êê R R © ¥ ¨H à ê e¥ ¨ d à £ ¨ à ©d © ¥ à y à y argument.2. CONVERGENGE MODES 437 functions of the -vector . Convergenge modes Readings: [1.

22. CONVERGENGE MODES 438 to some other set. [Pointwise convergence] A sequence of functions there con- ¥ © ßÝCf ¨ Deterministic real-valued functions. Uniform convergence requires yÝ ¼ î It’s important to note that 8 î ¥ ÔG ½ © ¥ 1 ê Ý ¼ ¨ q© Ý¥ f ¨ ¼ î exists an integer such that depends upon so that converge may be ² g qÝ Ç¥ G © Ý ¨ ² converges pointwise on to the function ( if for all and Ý¥ 6© Sf ¨ D EFINITION 38. is the limit of written î Ë¥ G ô verges to the vector if for any there exists an integer such that for all fô 7 D EFINITION 37. [Uniform convergence] A sequence of functions 8 ² a similar rate of convergence throughout Ý much more rapid for certain than for others. [Convergence] A real-valued sequence of vectors 85) A8@897¦ con- D EFINITION 36.2. Real-valued sequences: where may be an arbitrary set. so that the set is ordered according to the nat- ural numbers associated with its elements. such that ¨ ¥ G 8 £ ¥ ÔG ½ © ¥ 1 ê Ý © Ý ¨ ¥ ½ 9 © ÝSf ¨ ¾A ©Ð¼ ² verges uniformly on to the function ( if for any there exists an integer Ý¥ 6© Sf ¨ D EFINITION 39. A sequence is a mapping from the natural numbers 1 I 1 t 9f t ² . Consider a sequence of functions 8 Cô f $ô 8 sy º » 4ô f ¿ ² r af ¨ ô G t 0s4ô î ¥ 1 ½ ô f .

© Sf Ý ¥ ¨ © ßÝ¥ ¨ (insert a diagram here showing the envelope around in which must rQ© ßÝ¥u ¿ © ª £© ² ¥ . (ordinary convergence of the two functions) Almost sure convergence is © ßÝu ¥ 8 4) © c¥ ª ¥f © ÝSax .e. each 8 sy ² maps the sample space to the real line.e. [Almost sure convergence] Let be a sequence of ran- 8 % a f % T f 8 ¥ Ô f c¥ ª A f Gê © ¥f © ÝSax © Ýu ¥ dom variables.2.22. ¦G ¥ © ßÝu ¥ ¥ © ßÝu 1 . and let be a random variable. Let if ¥f r Ý © ßÝCa% f A w © ßÝC ¥f D EFINITION 41. [Convergence in probability] Let be a sequence of ran- if f S§ G P p § ¡" ¤ given the model the OLS estimator © S Ý ¥ f ¤ R I X R ¥ S§ © f 8 ¿ i© ª %© ² ¥ ¥ 6© ÝSf x random variables is a collection of such mappings. Given a probability space recall that a random variable A sequence of is a random variable with respect to the probability space For example. Let converges in probability to © u Ý ¥ ¼© ßÝ¥ f Ç a r Ý f © S Ý ¥ f D EFINITION 40. i. In econometrics. and let be a random variable. we typically deal with stochastic sequences. CONVERGENGE MODES 439 lie) Stochastic sequences.. where is the sample size. can be used to form a sequence of random vectors A number of modes of convergence are in use when dealing with sequences of random variables. Then converges almost surely to © Ýu ¥ dom variables. Several such modes of convergence should already be familiar: .. Then Convergence in probability is written as or plim except on a set such that 8 d © û ¥ ª © Ýu ¥ ² û © ÝSf ¥ In other words. i.

Stochastic functions.v. In general. In this case. 8 % T a Á ì a f f T have distribuIf at 8 8 ô 4©¤C6% f a written as i ì f or One can show that ì T î y f « 8 § . this is not possible. Simple laws of large numbers (LLN’s) allow us to directly conclude that and by a SLLN. have distribution function f D EFINITION 42. We often deal with the more complicated situation where the stochastic sequence depends on parameters in a manner that is not reducible to a simple sequence of random variables. 8 x g pwd and the parameter belongs to a parameter space d ©d ¥ f 4yßÝCa each is a random variable with respect to a probability space © ª £¿© ² ¥ © d Ý¥ f ¦ySx we have a sequence of random functions that depend on : d É 1 f É 1 G R Æ I R ÆkP p § S § f p § ì S§ T in the OLS example.v.2.22. CONVERGENGE MODES 440 vergence in probability implies convergence in distribution. Note that this term is not a function of the parameter This easy proof is a result of the linearity of the model. since 8 i m f Convergence in distribution is written as It can be shown that con- 8 i f P every continuity point of then converges in distribution to where P f P 8 P f f P tion function and the r. which allows us to express the estimator in a way that separates parameters from random functions. [Convergence in distribution] Let the r.

Quantities that are small relative to others can often be ignored. [Uniform almost sure convergence] converges uni- are random vari- Ë ê ê . This deﬁnition doesn’t require that edly).3.t.) D EFINITION 43. such that for ©Ð 9 the essential difference is the addition of the . 8 T Í 8 x g ¾¹ªd ables w.s. Å xf Ã Å 9f Ã À where is a ﬁnite constant. for all We’ll indicate uniform almost sure conver- © d Ý¥ yu ©d ¥ 4yßÝCf Implicit is the assumption that all ì Í © ª £© ² ¥ ¿ A ¡ yuj4yßÝCa ¥ © Ð f © d Ý¥ © d ¥ f 9 @ ©d ¥ yÝu x formly almost surely in to if and ©d ¥ f 64yßÝSx (a.22. and be two real-valued functions. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY 441 gence by and uniform convergence in probability by An equivalent deﬁnition.s. convergence - 22. [Little-o] Let and be two real-valued functions.r. have a limit (it may ﬂuctuate bound- ¥ 1 ©© 1¥ D¨H¥ S ©1 ¨¥ ¨ The notation ê ê Å x ½ ê 9ff ÃÃ À ê £ Å means there exists some © 1¥ ¨H ©1 ¥ ¨ D EFINITION 45.3. based on the fact that “almost sure” means “with probability one” is A ) É yVj4yßÝCa ¥ © Ð t f Æ © d Ý¥ © d ¥ f 9 © This has a form similar to that of the deﬁnition of a. Rates of convergence and asymptotic equality It’s often useful to have notation for the relative magnitudes of quantities. which simpliﬁes analysis. [Big-O] Let ©© 1¥ ED¥ ´ ©1 ¨¥ ¨ The notation means Å 8 ¡ 9f Ã Å À xf Ã f © 1¥ D¨H ©1 D¨¥ ¨ D EFINITION 44.

’s with mean 0 and vari- distributed.22.v. The notation ©)¥ tcT ´ ©)¥ xÔ¦T ´ G R I X R ¬¥ © ( © ©GVPpgdp¬¥ R I ©X R ¬¥ ¤ R I X R ¥ d Å x 8 T 9ff ÃÃ À ©© 1¥ ¥ HT ´ Å means we can write is negligible. Useful rules: E XAMPLE 50. The notation I îy « Å «y «Ã ©1 ¨¥ ¨ D EFINITION 46. ê ê ¨H ê © 1¥ G²È) É ¥ î ½ ê D¨¥ ê Æ ª ©1 ¨ then ê î ¥ 1 ê ê ¥ G such that for and all since. we had So so now we have have the stronger result that relates the rate of convergence to the sample size. given there is î ©© 1¥ ¥ E¨H¦T S d ©1 D¨¥ ¨ D EFINITION 48. This is just a means there exists some 8 )¥ P p g©tc¦T ´ gd d 8 G R I Q R ¬®¢gd © ¥ P p f g f ¨ If and are sequences of random variables analogous deﬁnitions . If î where is a ﬁnite constant. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY 442 are E XAMPLE 47. e. the term way of indicating that the LS estimator is consistent. The least squares estimator Since plim and Asymptotically. 8© p ¥ g¢ eI 1¨$T SÂ d ©)¥ gxÔT ´ d ¢è ance ..3.g. Consider a random sample of iid r. The estimator of the mean ) g©tc¥¦T SÂ d ¢ eI 1 D! jkç d ¢ eI 1 8©¢ è ¥ p p B I C fB 1 Â ) d © ( T T ´ © ( T ´ © T ¦T ´ 1¥ 1¥ 1¥ © ( T ¨$T T © ( ¦T S © T ¨¦T Á 1¥ S 1¥ 1¥ S î always some such that is asymptotically normally Before G ©)¥ gtc¦T 8Gj¼) ¥ © î ½ f ¥ ª f ©) ¥ ç t ² ¡f a S E XAMPLE 49.

Now consider a random sample of iid r. Two sequences of random variables and © 1¥ D ©1 D¨¥ ¨ T S nonzero plims.v. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY 443 Q E XAMPLE 51. while averages of uncentered quantities have ﬁnite are of the same order. T S T´ Finally. Note that the deﬁnition of ©f 4 f 8©)¥ S gxÔ¦T T d ¢p 1 Qspd eI © p 1¥ gg¢ eI ¦T ¢ è and variance .g. The estimator of the mean ¥ g©t)cT SÃ h Qpd ¢ p 1 g© ¢ ²Úç h 8 è ¥ eI I BC fB 1 Â ) d So does not mean that and is asymptotically so S T Q pd are . analogous almost sure versions of © 1¥ ) É D¨H D¨¥ Æ 3 µ ©1 ¨ ¨ asymptotically equal (written if and are deﬁned in the ob- f g} 7f ¨ D EFINITION 52. e.22.3.. Asymptotic equality ensures that this is the case.’s with mean normally distributed. vious way. so These two examples show that averages of centered (mean zero) quantities typically have plim 0.

. . show that . type help numgradient and help numhessian inside octave. show that vectors. .EXERCISES 444 Exercises (5) Write an Octave program that veriﬁes each of the previous results by taking numeric derivatives. For a hint. ﬁnd the analytic expression for x §R 7~} ¢ x © ~ ~ R§ ¥ 7} R§ 7g} R w P ã yã á 6á ã w . show that ô yã 6ã á á ) ' ²£ ô (1) For and both vectors. (4) For and both )j£ ' ) ' j£ (3) For and both vectors. ) ' ²£ § § ' i w (2) For a matrix and a vector.

but changing it is not allowed. 59 Temple Place. Boston. that you can change 445 . that you receive source code or can get it if you want it. Inc. June 1991 Copyright (C) 1989.) You can apply it to your programs. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish). When we speak of free software. This license follows: GNU GENERAL PUBLIC LICENSE Version 2. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast. not price. Suite 330.CHAPTER 23 The GPL This document and the associated examples and materials are copyright Michael Creel. MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document. This General Public License applies to most of the Free Software Foundation’s software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead. too. under the terms of the GNU General Public License. we are referring to freedom. the GNU General Public License is intended to guarantee your freedom to share and change free software–to make sure the software is free for all its users. 1991 Free Software Foundation.

These restrictions translate to certain responsibilities for you if you distribute copies of the software. and (2) offer you this license which gives you legal permission to copy. so that any problems introduced by others will not reﬂect on the original authors’ reputations. You must make sure that they. receive or can get the source code. To protect your rights. in effect making the program proprietary. . or if you modify it. To prevent this. we want its recipients to know that what they have is not the original. Finally. We protect your rights with two steps: (1) copyright the software. And you must show them these terms so they know their rights. you must give the recipients all the rights that you have. If the software is modiﬁed by someone else and passed on. whether gratis or for a fee. distribute and/or modify the software. The precise terms and conditions for copying. and that you know you can do these things. too. we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. we have made it clear that any patent must be licensed for everyone’s free use or not licensed at all.23. For example. we want to make certain that everyone understands that there is no warranty for this free software. any free program is threatened constantly by software patents. distribution and modiﬁcation follow. if you distribute copies of such a program. Also. THE GPL 446 the software or use pieces of it in new free programs. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses. for each author’s protection and ours.

You may copy and distribute verbatim copies of the Program’s source code as you receive it. 1. Activities other than copying. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. You may charge a fee for the physical act of transferring a copy. and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say. (Hereinafter. DISTRIBUTION AND MODIFICATION 0. a work containing the Program or a portion of it. distribution and modiﬁcation are not covered by this License. in any medium. keep intact all the notices that refer to this License and to the absence of any warranty.23. translation is included without limitation in the term "modiﬁcation". The act of running the Program is not restricted. Whether that is true depends on what the Program does. and give any other recipients of the Program a copy of this License along with the Program. provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty. The "Program". and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). THE GPL 447 GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING. below. and you may at your option offer warranty protection in exchange for a fee. they are outside its scope. refers to any such program or work.) Each licensee is addressed as "you". either verbatim or with modiﬁcations and/or translated into another language. .

when started running for such interactive use in the most ordinary way. and its terms. But when you distribute the same sections as part of a whole which is a work based on the Program. do not apply to those sections when you distribute them as separate works. provided that you also meet all of these conditions: a) You must cause the modiﬁed ﬁles to carry prominent notices stating that you changed the ﬁles and the date of any change. to be licensed as a whole at no charge to all third parties under the terms of this License. If identiﬁable sections of that work are not derived from the Program. You may modify your copy or copies of the Program or any portion of it. you must cause it.23. then this License. to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else. (Exception: if the Program itself is interactive but does not normally print such an announcement. that in whole or in part contains or is derived from the Program or any part thereof. THE GPL 448 2. and can be reasonably considered independent and separate works in themselves.) These requirements apply to the modiﬁed work as a whole. and telling the user how to view a copy of this License. thus forming a work based on the Program. c) If the modiﬁed program normally reads commands interactively when run. your work based on the Program is not required to print an announcement. and copy and distribute such modiﬁcations or work under the terms of Section 1 above. the distribution of the whole must be . b) You must cause any work that you distribute or publish. saying that you provide a warranty) and that users may redistribute the program under these conditions.

or. b) Accompany it with a written offer.23. to give any third party. valid for at least three years. You may copy and distribute the Program (or a work based on it. whose permissions for other licensees extend to the entire whole. under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code. or. complete source code means . THE GPL 449 on the terms of this License. to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. in accord with Subsection b above. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer. rather. In addition. a complete machine-readable copy of the corresponding source code. Thus.) The source code for a work means the preferred form of the work for making modiﬁcations to it. which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. For an executable work. for a charge no more than your cost of physically performing source distribution. it is not the intent of this section to claim rights or contest your rights to work written entirely by you. c) Accompany it with the information you received as to the offer to distribute corresponding source code. the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. and thus to each and every part regardless of who wrote it. 3.

plus the scripts used to control compilation and installation of the executable. THE GPL 450 all the source code for all modules it contains. from you under this License will not have their licenses terminated so long as such parties remain in full compliance. even though third parties are not compelled to copy the source along with the object code. 5. since you have not signed it. If distribution of executable or object code is made by offering access to copy from a designated place. You are not required to accept this License. . you indicate your acceptance of this License to do so. plus any associated interface deﬁnition ﬁles. sublicense or distribute the Program is void. by modifying or distributing the Program (or any work based on the Program). modify. the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler. However. However. as a special exception. nothing else grants you permission to modify or distribute the Program or its derivative works. and will automatically terminate your rights under this License. These actions are prohibited by law if you do not accept this License. Therefore. parties who have received copies. and all its terms and conditions for copying. or rights. and so on) of the operating system on which the executable runs. then offering equivalent access to copy the source code from the same place counts as distribution of the source code. You may not copy. modify. or distribute the Program except as expressly provided under this License. distributing or modifying the Program or works based on it. 4. kernel. However.23. unless that component itself accompanies the executable. Any attempt otherwise to copy. sublicense.

agreement or otherwise) that contradict the conditions of this License. You are not responsible for enforcing compliance by third parties to this License. For example. If any portion of this section is held invalid or unenforceable under any particular circumstance. You may not impose any further restrictions on the recipients’ exercise of the rights granted herein. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system. this section has the sole purpose of protecting the integrity of the free software distribution system. then as a consequence you may not distribute the Program at all. the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances.23. conditions are imposed on you (whether by court order. THE GPL 451 6. if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you. 7. Each time you redistribute the Program (or any work based on the Program). distribute or modify the Program subject to these terms and conditions. If. it is . which is implemented by public license practices. the recipient automatically receives a license from the original licensor to copy. as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues). then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims. they do not excuse you from the conditions of this License.

In such case. but may differ in detail to address new problems or concerns. 10. you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. 8. the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries. Each version is given a distinguishing version number. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces. we sometimes make exceptions for this. Such new versions will be similar in spirit to the present version. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different. If the Program speciﬁes a version number of this License which applies to it and "any later version". The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. this License incorporates the limitation as if written in the body of this License. THE GPL 452 up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. For software which is copyrighted by the Free Software Foundation. write to the Free Software Foundation.23. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. If the Program does not specify a version number of this License. Our decision will be guided by the two goals of preserving the free . 9. you may choose any version ever published by the Free Software Foundation. write to the author to ask for permission. so that distribution is permitted only in or among countries not thus excluded.

THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. INCLUDING ANY GENERAL. TO THE EXTENT PERMITTED BY APPLICABLE LAW. REPAIR OR CORRECTION.23. OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE. INCLUDING. BUT NOT LIMITED TO. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE. NO WARRANTY 11. EITHER EXPRESSED OR IMPLIED. THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. END OF TERMS AND CONDITIONS . SPECIAL. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER. THERE IS NO WARRANTY FOR THE PROGRAM. SHOULD THE PROGRAM PROVE DEFECTIVE. THE GPL 453 status of all derivatives of our free software and of promoting the sharing and reuse of software generally. 12. INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS). YOU ASSUME THE COST OF ALL NECESSARY SERVICING. BE LIABLE TO YOU FOR DAMAGES.

23. THE GPL

454

How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source ﬁle to most effectively convey the exclusion of warranty; and each ﬁle should have at least the "copyright" line and a pointer to where the full notice is found. <one line to give the program’s name and a brief idea of what it does.> Copyright (C) 19yy <name of author> This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type ‘show w’. This is

23. THE GPL

455

free software, and you are welcome to redistribute it under certain conditions; type ‘show c’ for details. The hypothetical commands ‘show w’ and ‘show c’ should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than ‘show w’ and ‘show c’; they could even be mouse-clicks or menu items–whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program ‘Gnomovision’ (which makes passes at compilers) written by James Hacker. <signature of Ty Coon>, 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.

CHAPTER 24

The attic

The GMM estimator, brieﬂy The OLS estimator can be thought of as a method of moments estimator. With weak exogeneity, . So, likewise,

The idea of the MM estimator is to choose the estimator to make the sample counterpart hold:

This means of deriving the formula requires no calculus. It provides another interpretation of how the OLS estimator is deﬁned. We can perhaps think of other variables that are not correlated with , say

that satisy

This holds material that is not really ready to be incorporated into the main body, but that I don’t want to lose. Basically, ignore it, unless you’d like to help get it ready for inclusion.

of

is greater than

then we have more un

456

© E¨a¥ Ù e

assume that we have instruments

. If the dimension

v

. This may be needed if the weak exogeneity assumption fails for

Ù Ù Ä © yâ f ¡¥ h Ê â Ê Í f Ê !e

P

R I 6© R ¥

§ 1 h § i R 1 eR

© Ô¨ v ¥ Ù e

.

. Let us

a

24.1. MEPS DATA: MORE ON COUNT MODELS

457

24.1. MEPS data: more on count models Note to self: this chapter is yet to be converted to use Octave. To check the plausibility of the Poisson model, we can compare the sample unconditional variance with the estimated unconditional variance according to the Poisson

P

TABLE 1. Marginal Variances, Sample and Estimated (Poisson) OBDV ERV Sample 37.446 0.30614 Estimated 3.4540 0.19060

conditioning, the overdispersion is not captured in either case. There is huge problem with OBDV, and a signiﬁcant problem with ERV. In both cases the Poisson model does not appear to be plausible.

24.1.1. Inﬁnite mixture models. Reference: Cameron and Trivedi (1998) Regression analysis of count data, chapter 4. The two measures seem to exhibit extra-Poisson variation. To capture unobserved heterogeneity, a possibility is the random parameters approach. Consider the possibility that the constant term in a Poisson model were random:

©G¥ 7g} § Çv ¥ 7g} ~ © & ~ © 4G0P§ Çv ¥ 7g} ~ & ( Æf © ~ 4 Ô¥ 7g}

u ò d d

a

e

d ©G f v ¨¥

Y G¨

© f¥ S°é

model:

f Ê nÅbÊ

. For OBDV and ERV, we get We see that even after

24.1. MEPS DATA: MORE ON COUNT MODELS

458

**constant. The problem is that we don’t observe , so we will need to marginalize it to get a usable density
**

a

This density can be used directly, perhaps using numerical integration to evaluate the likelihood function. In some cases, though, the integral will have an

a

analytic solution. For example, if density, then

follows a certain one parameter gamma

function of , so that the variance is too. This is referred to as the NB-I model.

to as the NB-II model.

So both forms of the NB model allow for overdispersion, with the NB-II model allowing for a more radical form. Testing reduction of a NB model to a Poisson model cannot be done

the boundary of the parameter space. Without getting into details,

ñ&

values need to be adjusted to account for the fact that

å¡&

by testing

using standard Wald or LR procedures. The critical is on

¢

e &

P

e

f¥ © v ¨é

¥ Ú&

& @

Â)

X

– If

, where

, then

. This is referred

e

e &

P

e q

f¥ © v ¨é

¥ ®&

v

& I q e

Â

X

– If

, where

, then

X

The variance depends upon how

is parameterized.

. Note that

© R§ v ¥ 7g} e ~

e '

f Ù © v ¨¥

For this density,

X

© X U e ¥ t

where

.

appears since it is the parameter of the gamma density. , which we have parameterized

u

É

e

PXX Æ

ª È

É

e

X

Ã

Y `¨

(24.1.1)

e

# t© # ¦$¨¥

d d Ä ¨ u ßG8

PQX ©) P f Æ © R©X ¥ X Ã x¡f £¥ Ui£¨¥ Ã P

a

()f d 7g} ~

**©G ~ ¥ 7}
**

f © v ¥

©t f v ¨¥

a

Y `¨

§ Çv ¥ 7} ue ~ &

where

) and

. Now

captures the randomness in the

is a

24.1. MEPS DATA: MORE ON COUNT MODELS

459

suppose that the data were in fact Poisson, so there is equidispersion

**de underdispersed, and about half the time overdispersed. When the
**

&

data is underdispersed, the MLE of

will be

null, there will be a probability spike in the asymptotic distribution of

at 0, so standard testing methods will not be valid.

Here are NB-I estimation results for OBDV, obtained using this estimation program

.

MEPS data, OBDV negbin results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ inc ln_alpha -0.055766 0.47936 0.20673 0.34916 0.015116 0.014637 0.012581 1.7389 t(OPG) -0.16793 2.9406 1.3847 3.2466 3.3569 0.78661 0.60022 23.669 t(Sand.) -0.17418 2.8296 1.4201 3.4148 3.8055 0.67910 0.93782 11.295 t(Hess) -0.17215 2.9122 1.4086 3.3434 3.5974 0.73757 0.76330 16.660 -2.2656

Information Criteria Consistent Akaike 2323.3

( &

Ç& & 1

h

and the true

. Then about half the time the sample data will

. Thus, under the

©

&

& ¢1 ¥

24.1. MEPS DATA: MORE ON COUNT MODELS

460

Schwartz 2315.3 Hannan-Quinn 2294.8 Akaike 2281.6

24.1. MEPS DATA: MORE ON COUNT MODELS

461

Here are NB-II results for OBDV

********************************************************************* MEPS data, OBDV negbin results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ inc ln_alpha -0.65981 0.68928 0.22171 0.44610 0.024221 0.020608 0.020040 0.47421 t(OPG) -1.8913 2.9991 1.1515 3.8752 3.8193 0.94844 0.87374 5.6622 t(Sand.) -1.4717 3.1825 1.2057 2.9768 4.5236 0.74627 0.72569 4.6278 t(Hess) -1.6977 3.1436 1.1917 3.5164 4.3239 0.86004 0.86579 5.6281 -2.2616

Information Criteria Consistent Akaike 2319.3 Schwartz 2311.3 Hannan-Quinn 2290.8 Akaike 2277.6

24.2. HURDLE MODELS

462

*********************************************************************

To check the plausibility of the NB-II model, we can compare the sample unconditional variance with the estimated unconditional variance according to the NB-II model:

P

not reported), we get The overdispersion problem is signiﬁcantly better than TABLE 2. Marginal Variances, Sample and Estimated (NB-II) OBDV ERV Sample 37.446 0.30614 Estimated 26.962 0.27620 in the Poisson case, but there is still some overdispersion that is not captured, for both OBDV and ERV.

Returning to the Poisson model, lets look at actual and ﬁtted count prob-

Y `¨

ted frequencies are

1 Â 4d B % ¥ © C 1 Â % pB ¨$) B % ¥ ¨ © f¥ © f

abilities. Actual relative frequencies are

I B % ¥ © f ¨ f

f z © Ê ¥ § Ê nÅbÊ

&

The estimated

f¥ © ¨°é

For the OBDV model, the NB-II model does a better job, in terms of the average log-likelihood and the information criteria. Note that both versions of the NB model ﬁt much better than does the Poisson model. The t-statistics are now similar for all three ways of calculating them, which might indicate that the serious speciﬁcation problems of the Poisson model for the OBDV data are partially solved by moving to the NB model. is highly signiﬁcant.

. For OBDV and ERV (estimation results

24.2. Hurdle models

and ﬁt-

We see that for the OBDV

2. Actual and Poisson ﬁtted frequencies Count OBDV ERV Count Actual Fitted Actual Fitted 0 0. they are sick.86 0. where the total number of visits depends upon the decision of both the patient and the doctor.002 4 0. for example.14 2 0. and let be the paramter of the doctor’s “de- T e the probability of zeros versus the other counts.10 0.18 0. HURDLE MODELS 463 TABLE 3.15 0.10 0. Let be the parameters of © tT e ¥ Y `¨ © Ë¥ ¥ ¤ © ¡ ¥ ¤ Ë Ë .032 0. Since different parameters may govern the two decision-makers choices.10 0 2.02 0.24.0002 5 0.052 0. but the difference is not too important.11 0. This is a principal/agent type situation. The patient will initiate visits according to a discrete choice model.83 1 0. then the doctor decides on whether or not follow-up visits are needed. we might expect that different parameters govern mand” for visits.18 0.19 0. there are somewhat more actual zeros than ﬁtted.002 0. a logit model: s © ~ $T e Ô¥ 7g} P ©) Â ) © ~ P T e Ô¥ 7g} d) Â ) È) m e the patient’s demand for visits. Why might OBDV not ﬁt the zeros well? What if people made the decision to contact the doctor for a ﬁrst visit.15 0.004 0. there are many more actual zeros than predicted.32 0.02 3 0.06 0.4e-5 measure. For ERV.

for the observations where visits are positive. The computational overhead is of order where is is the number of parameters to be É©m e c¥ 7g} È) © ~ P ~ m Æ É T e Ô¥ )7g} ) Æ e © ¥ ¤ ¨¥ Ù © Ë¥ ¨¥ Ë ¤ ¤ ¤ estimated) . they may be estimated separately. The expectation of ¤ Since the hurdle and truncated components of the overall density for © m c¥ } È) ~ e © m e ¥ f Y `¨ © ¥ ¥ Ë f © m e ¨¥ `¨ f Y 8 pm © m ( c¥ f } © ¡ ¨¥ ~ e e m f © ¥ f e ¨¥ ¢ Ë ¤ Ù © ¥ Y `¨ share . HURDLE MODELS 464 The above probabilities are used to estimate the binary 0/1 hurdle process.24. no parameters.2. will have to invert the approximated Hessian. a truncated Poisson density is estimated. This density is since according to the Poisson model with the doctor’s paramaters. for example. Then. (Recall that the BFGS algorithm. which is computationally more efﬁcient than estimating the overall model.

039606 0. OBDV logit results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ inc -1.7655 t(Sand.0873 2.1677 2.077446 t(OPG) -2.89 Hannan-Quinn 614.5502 1.5709 3.1366 2.58939 Information Criteria Consistent Akaike 639.5560 3. HURDLE MODELS 465 Here are hurdle Poisson estimation results for OBDV.1807 1.7289 3.96 Akaike 603.2.018614 0.1547 1.1672 t(Hess) -2.9601 -0.98710 2.0519 0.0027 1.39 ********************************************************************* .) -2.63570 0.45867 0. obtained from this estimation program ********************************************************************* MEPS data.0222 1.89 Schwartz 632.1969 0.5269 3.0384 1.6924 3.24.0520 1.7166 3.0467 1.

3186 t(Sand.0079016 t(OPG) 7.2.2144 -2.7 Schwartz 2747.7 Hannan-Quinn 2729.10438 1.7183 0.24.6353 -0.2 ********************************************************************* .6942 7.9814 1.4291 6.293 16.35309 t(Hess) 3.7573 0.56547 -0.) 1.19075 0.18112 3.016683 0.31001 0.148 4.96078 -2.016286 -0.1747 1. OBDV tpoisson results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ inc 0.5262 0.5708 0.7042 Information Criteria Consistent Akaike 2754.014382 0.8 Akaike 2718.2323 3.1890 3.54254 0. HURDLE MODELS 466 The results for the truncated part: ********************************************************************* MEPS data.29433 10.

1. 24.10 0. The mixture approach has the intuitive appeal of allowing for subgroups of the population with different health status.006 0.002 0.32 0.035 0.06 0.10 0.11 0. self-reported measures may suffer from the same problem. performance is at least as good as the hurdle Poisson model.002 0.32 0.86 0. The available objective measures.004 0. Zeros are exact.052 0. The ﬁnite mixture approach to ﬁtting health care demand was introduced by Deb and Trivedi (1997).071 0. If individuals are classiﬁed as healthy or unhealthy then two subgroups are deﬁned.08 0. Many studies have incorporated objective and/or subjective indicators of health status in an effort to capture this heterogeneity.86 0. A ﬁner classiﬁcation scheme would lead to more subgroups.2.0005 0. such as limitations on activity.10 0.34 0. and may also not be exogenous . For the NB-II ﬁts.002 5 0. and higher counts are overestimated.02 3 0. the ERV ﬁt is very accurate. and one should recall that many fewer parameters are used. Finite mixture models.86 1 0.18 0.032 0. Actual and Hurdle Poisson ﬁtted frequencies Count OBDV ERV Count Actual Fitted HP Fitted NB-II Actual Fitted HP Fitted NB-II 0 0.02 0. are not necessarily very informative about a person’s overall health status.11 0.2. HURDLE MODELS 467 Fitted and actual probabilites (NB-II ﬁts are provided as well) are: TABLE 4.24.11 0.10 0.10 2 0. Subjective.001 For the Hurdle Poisson models. but 1’s and 2’s are underestimated.05 0 0. The OBDV ﬁt is not so good.006 4 0. Hurdle version of the negative binomial model are also widely used.10 0.16 0.02 0.

and . This is simple to accomplish post-estimation by rearrangement and possible elimination of redundant component densities. Information criteria means of choosing the model (see below) are valid. ) I Not that when ) I restriction 5 ) example. the moment generating function is the same mixture of the moment generating functions of the component densities. f ©xT¦t¥ T Y ¨ T © B 6¥ Å Ã Y ¨ B P t f B ¢ I ¡ ) DB BT I % p! t í í 3 ß B t B I 85 3 ¥ B I BT q) T ò@A8@897) ! ËB I B © 8 Tt 8I t f Y EI T A@8@89I 4¦@A8@896D¨¥ `¨ I T . HURDLE MODELS 468 Finite mixture models are conceptually simple. Identiﬁcation reand ¶ 3 © ¥B . testing for (a single component. Mixture densities may suffer from overparameterization. for example. The properties of the mixture density follow in a straightforward way from those of the components. It is possible to constrained parameters across the mixtures. In particular. Q where is the mean of the component density. the parameters of the second component can I © ¥ B Q B BT © ¥ Ù ¤ T ¡ bb xxb ¡ quires that the are ordered in some way.2. for example. . so. which is to say. Testing for the number of component densities is a tricky issue. Usual methods such as the likelihood ratio test are not applicable when parameters are on the boundary under the null hypothesis.24. no (a mixture of two components) involves the . since the to- tal number of parameters grows rapidly with the number of component densities. . For mixture) versus take on any value without affecting the density. which is on the boundary of the parameter space. . The density is where .

HURDLE MODELS 469 The following are results for a mixture of 2 negative binomial (NB-I) models.24.2. for the OBDV data. which you can replicate using this estimation program .

HURDLE MODELS 470 ********************************************************************* MEPS data.6519 0.093396 0.81892 1.1470 0.5173 -1.34886 0.73854 0.85186 t(OPG) 1.76782 2.58386 2.3456 -1.8419 0.40854 2.22461 0.6182 1.7826 0.015969 -0.73281 2.18729 0.2312 Information Criteria .6126 1.2.4827 t(Hess) 1.4358 -0.77431 0.2497 1.8036 0.23188 0.40854 6.0922 0. OBDV mixnegbin results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ inc ln_alpha constant pub_ins priv_ins sex age educ inc ln_alpha logit_inv_mix 0.3851 -0.7365 3.0396 -1.7061 0.24.69961 -3.13802 0.2148 2.7883 -2.6130 2.36313 7.015880 0.7677 1.021425 0.8013 0.8411 2.20453 6.33046 2.74016 1.39785 0.3032 1.5475 -1.46948 2.7527 0.8702 1.3226 -0.1366 0.3387 2.64852 -0.3456 0.4882 2.7096 t(Sand.4029 -1.019227 2.049175 0.80035 1.062139 0.7151 -1.97338 0.6121 2.1354 2.) 1.

Again. that makes relatively few visits. Education is interesting. For the subpopulation that is “healthy”.2. For the “unhealthy” group. mix 0. HURDLE MODELS 471 Consistent Akaike 2353.3 Akaike 2265.2 ********************************************************************* Delta method for mix parameter st. A larger sample could help clarify things. The constants and the overdispersion parameters ä x Í Ú£7e º ß nomial model where all the slope parameters in to 1.e. education seems to have a positive effect on visits. rather than a mixture.24.8 Schwartz 2336. The 95% conﬁdence interval for the mix parameter is perilously close The following are results for a 2 component constrained mixture negative bi- allowed to differ for the two components. i.8 Hannan-Quinn 2293. education has a negative effect on visits. this is not the way to test this . are the same across are . ß D& the two components.70096 se_mix 0.it is merely suggestive. The other results are more mixed.. which suggests that there may really be only one component density.12043 err.

4258 3.4888 t(OPG) -0.015822 0.1948 3.58331 0.37714 0.5539 0.5088 1.6140 0.24.) -0.94203 2.69088 4.2. OBDV cmixnegbin results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ inc ln_alpha const_2 lnalpha_2 logit_inv_mix -0.5 Schwartz 2312.9693 -2.47525 1.7042 0.3895 3.45320 0.014088 1.4918 3.20663 0.2441 Information Criteria Consistent Akaike 2323.4293 1.83408 6.1212 0.91456 2.65887 0.2621 2.5219 6.7224 t(Hess) -0.7806 0.011784 0.4929 3.97943 2.5 Hannan-Quinn .50362 0.60073 t(Sand.3105 3.5060 4. HURDLE MODELS 472 ********************************************************************* MEPS data.7769 2.34153 0.2462 2.5319 3.1798 1.6206 1.96831 7.2243 1.7067 1.

Comparing models using information criteria.3 Akaike 2266. HURDLE MODELS 473 2284. Now the mixture parameter is even closer to 1. How can we determine which of competing models is the best? The information criteria approach is one possibility.2. Information criteria are functions of the log-likelihood. mix 0. Testing for collapse of a ﬁnite mixture to a mixture of fewer components has the same problem. 24.2.24. . This doesn’t mean.047318 err. Three popular information criteria are the Akaike (AIC).92335 se_mix 0. A Poisson model can’t be tested (using standard methods) as a restriction of a negative binomial model.1 ********************************************************************* Delta method for mix parameter st.2. of course. The formulae are It can be shown that the CAIC and BIC will select the correctly speciﬁed model from a group of models. asymptotically. Bayes (BIC) and consistent Akaike (CAIC). that the 5 P© Ë$d ¥ f p 5 1 A Ë$d ¥ f p P© 5 ©) tP 1 HË$d ¥ f p ¥ P© 5 û± 7¯¥ û± w 7YÞû û± 7Yw The slope parameter estimates are pretty close to what we got with the NB-I model. with a penalty for the number of parameters used.

Hamilton. which has relatively many TABLE 5. The Poisson-based models do not do well.g. for OBDV. and according to CAIC. conditional on other observable variables. The AIC is not consistent. Here are information criteria values for the models we’ve seen. Models for time series data This section can be ignored in its present form. Information Criteria. The best according to the BIC is CMNB-I. dependent variables. the best is the MNB-I. Just left in to form a basis for completion (by someone else ?!) at some point. 24. most time series gf the behavior of after marginalizing out all other variables. OBDV Model Poisson NB-I Hurdle Poisson MNB-I CMNB-I AIC 3822 2282 3333 2265 2266 BIC CAIC 3911 3918 2315 2323 3381 3395 2337 2354 2312 2323 parameters.3. e. and will asymptotically favor an over-parameterized model over the correctly speciﬁed model. Time Series Analysis is a good reference for this section. According to the AIC.24.. MODELS FOR TIME SERIES DATA 474 correct model is necesarily in the group. While it’s not gf consider the behavior of 8 f 8 © ß !@A88@9I f! UF¨¥ 8 c function of other variables These variables can of course contain lagged Pure time series methods as a function only of its own lagged values. This is very incomplete and contributions would be very welcome. un- 4f Up to now we’ve considered the behavior of the dependent variable as a . the best is NB-I. One can think of this as modeling immediately clear why a model that has other explanatory variables should marginalize to a linear in the parameters time series model.3.

A time series is one observation of a stochastic process.3. The ¶ 1 So a time series is a sample of size from a stochastic process. A stochastic process is covariance stationary if it has time constant mean and autocovariances of all 8 © f¥ g¨ë HQ where © ß H ß ¨Hsg¼ë ß Q f¥ © Q f¥ (24. one could draw another sample. though nonlinear time series is also a large and growing ﬁeld. MODELS FOR TIME SERIES DATA 475 work is done with linear models.24. 24. It’s impor- I f @f ×gt @ ×i ¤ (24. process is D EFINITION 56 (Covariance (weak) stationarity). D EFINITION 53 (Stochastic process). A stochastic process is a sequence of random variables.1. We’ll stick with linear time series models. over a speciﬁc interval: (24.3.3) % D EFINITION 55 (Autocovariance).1) autocovariance of a stochastic .3.3. indexed by time: D EFINITION 54 (Time series).2) tant to keep in mind that conceptually. and that the values would be different.3. Basic concepts.

How can be estimated then? It turns out that © ¤¥ ¨i¨¼ë T dWgf I @ 1 gf f ) ® IqW ´ ) A ® ´ é R i¤ What is the mean of The time series is one sample from the stochastic repeated samples from the stoch. MODELS FOR TIME SERIES DATA 476 orders: one the interval between observations. e. we would expect that The problem is. A stationary stochastic process is ergodic (for the mean) if the time average converges to the mean (24. A stochastic process is strongly sta- stationarity. One could think of By a LLN. D EFINITION 58 (Ergodicity)..4) Q T © ¤¥ ¨¨¼ë in time and collect another. Á Since moments are determined by the distribution.g. we have only one sample to work with. D EFINITION 57 (Strong stationarity).3. process. this implies that 2ê Ô ß ß 2 ê ÔÄQ HQ the autocovariances depend only doesn’t weak W t f .3. strong stationarity 8 2 depend on ¤ × tionary if the joint distribution of an arbitrary collection of the r ß ¼ß As we’ve seen.. since we can’t go back ergodicity is the needed property.24. proc. but not the time of the observations.

3. is just the ¶ 4f This implies that the autocovariances die off.2. MODELS FOR TIME SERIES DATA 477 A sufﬁcient condition for ergodicity is that the autocovariances be absolutely summable: k l dependent that they don’t satisfy a LLN. so that the ¶ p ß ½ ß ¼ß p ß 2 ê are not so strongly . ARMA models.3. 24. Gaussian white noise just adds a ¢ xDè E¨é © e¥ 2 ê Ô EUë © e¥ e term for a classical error. The autocorrelation. These are closely related to the AR and MA error processes that we’ve already discussed. § 24. The main difference is that the lhs variable is observed directly now.3. A Q order moving average (MA) process is ( t¹xxx¡P ¢ tG ¢ DI tcgStVP G(d Pbbb d P GI d P G 8 í 4¤ 2 ¶ gf ì !e Ee and iii) and are independent.2.24. With these concepts. MA(q) processes. White noise is just the time series literature normality assumption.5) D EFINITION 60 (White noise).3. autocovariance divided by the variance: (24.1. is white noise if i) ii) % ß % D EFINITION 59 (Autocorrelation). we can discuss ARMA models.

. .3. AR(p) processes.. . Similarly. . . . T gf ) bxxb b I T gf tG or .. Ù DI i¤ P P P C¤ û bb xxb tG ¢ gf I gf T t ¶ ß § § % ¥ s (d(d Pbbb ¢ ¢ d P Id d % » ¾ ß 9xx¡P d ©ß ¹sgI ß P ß d ó ¢( ¹xxx¡P d P b b b © G(d Pbbb ¢ 6( t9xx¡P ¢ ¢¢ ¹P I¢ !ð ¢ è d d P ) tG ¢ I tcgP të d P GI d G¥ p © Q f¥ ¢ ¨ë where as long as order difference equation as a vector ﬁrst order difference equation: The dynamic behavior of an AR(p) process can be studied by writing this G tVP T g9¦ËxxxP ¢ f ¢ wsI cwP gf fTt Pbbb t P fI t ¸ 24. the autocovariances are is white noise. 478 .2. P ..3. . . . . . . . . The variance is 24. An AR(p) process can be represented as and all of the Therefore an MA(q) process is necessarily covariance stationary and ergodic. MODELS FOR TIME SERIES DATA are ﬁnite. .2. ) ) bb xxb ¢ t t I ¸ I gf gf ßd ¢ Dè . . . . .. .

then as we move forward in time this impact ß $ Ù 4I Ù P $ ß P Pbbb P Ôxx²I Ù I ß P ¢ Ù PsI Ù P C Ù ¢ P I i¤ P P û ¢ P P û P P ¡ P P û | ¢ Ù ©I Ù P Ù P sI C6¢ P P û P P û ¥ P P P P ¤ û ¢ ¢ C¤ Ù DI C¤ P P P û I Ù S Ù P I C¤ ¢ P P û P P P P û I Ù P© Ù DI i¤ P P û ¥ P P P û I Ù P C¤ P P I C¤ û Consider the impact of a shock in period on or in general and must die off.3. we can recursively work forward in time: If the system is to be stationary. MODELS FOR TIME SERIES DATA This is simply © ß P û ß Pbbb cxx²P û P P ¼$ C¤ û ß 479 . Otherwise a shock causes a permanent change in the mean of 8 $f År I sI Ã ß P 8 $ f ß 2 ÅI sI Ã R Ù r ß $ C¤ à P t Ù ß P P ¦I i¤ I Therefore. stationarity requires that Save this result. we’ll need it in a minute. With this. t P Ã År ¡ I sI ß ß à P P 24.

for ¡ Ê Ce P ± 8 P Consider the eigenvalues of the matrix These are the for such that .24. for example.3. MODELS FOR TIME SERIES DATA e 480 so can be written as so and e g So the eigenvalues are the roots of the polynomial ¢ Vqt t I e ¢t ¢ Vt t I e ) ¢ t t I e ) I Dt ¢ e Ê Ce P ± ¢ Ê Ce ± e P P P When the matrix is I e t I t e I t P 75 P the matrix is simply ) The determinant here can be expressed as a polynomial.

the eigenvalues are the roots of Supposing that all of the roots of this polynomial are distinct. This generalizes. ß T e ¢ ß e Iß and e I ¿ß ¿ ß P ß % where is repeated times. then the matrix can be factored as is a diagonal matrix with the eigenvalues on the main diagonal.. This gives P where is the matrix which has as its columns the eigenvectors of and ¶ Tt T ¦VI ¦t e ó ó bb ó I ¿ ¿ ð xxb I ¿ ¿ ð I ¿ ¿ ð ß I ¿ ¿ bb gxxb ¢ t ¢ T e ÄDt I T e T I P P e I ¿ ¿ ¿ P . .24. we can write requires that 8 5 3 ò@@8A894) !) ½ Se B Ã År ¡ I sI ß P t ß 85 A@8A874) 3 Ce B Supposing that the are all real valued. Using this decomposition. it is clear that . MODELS FOR TIME SERIES DATA 481 which can be found using the quadratic equation. For a order AR process.3.

the eigenvalues must be less than one in absolute value.g.g. MODELS FOR TIME SERIES DATA 482 e. we leave the world of stationary processes..3. when there are multiple eigenvalues the overall effect can be a mixture.” draw picture here. The is This leads to the famous statement that “stationarity requires the roots of the determinantal polynomial to lie inside the complex unit circle. Real eigenvalues lead to steady movements. e.. where the modulus of a complex number is a dynamic multiplier or an . When there are roots on the unit circle (unit roots) or outside the unit circle. whereas comlpex eigenvalue lead to ocillatory behavior. År I sI Ã © gf f ¥ ¢ gf I gf f P Dynamic multipliers: ¢ hP ¢ ô U ß I f gf h tG v$ f Âß à à ©3U P ô¥ ×¹p¨t f gf ¢ f ´ 3U P 6pô previous result generalizes to the requirement that the eigenvalues be less than one in modulus. It may be the case that some eigenvalues are complex-valued. Of course.24. pictures Invertibility of AR process f To begin with. deﬁne the lag operator f The lag operator is deﬁned to behave just as an algebraic quantity. impulse-response function.

MODELS FOR TIME SERIES DATA such that the following two expressions are are coefﬁcients to be determined.3. determination of the 24. just assume that the Factor this polynomial as same as determination of the is deﬁned to operate as an algebraic quantitiy. Since 483 is the .© T e e xxH© ¢ ¥bbb e e vEI ¥© e e ¥ u e I V4xxË ¢ T e ¢ u I T e u T Tt Tt bbb t It e © T e I ¨xxxH© ¢ #¥ b b b e I ÔI #¥ © e I I ¥ ¦V I ÔI 6xxË # Tt # Ttbbb )¥© # ÈÔ4ÔI so we get # e ¢t #I t T ¢ # u T I cV T # © # )¥bbb © 4tT e ¼ctxxH$# ¢ e e T# ) Èc¥ T 9uxxb ¢ # ¢ uÞcVÈ) #Tt bb t #I t r # B ie B e © f T e ÈÔ9xxH© f ¢ )¥bbb e )¥ ÈÔv© f I e B ie ) Èc¥ T f ¦V4xxb ¢ f ¢ u f VÈ) Tt bb t It tG © T f ¦uxxb ¢ f ¢ V f uÈÔÔf Tt bb t I t )¥ tG T g9¦uxxb ¢ gf ¢ uI gÔDVHf fTt bb t fI t A mean-zero AR(p) process can be written as ¢ g0È) f gf ¢ f gf f P gf f È) f gc© f P )¥ Ô© f ) ÈÔ¥ or and now deﬁne or the same for all f Multiply both sides by For the moment.

as above. MODELS FOR TIME SERIES DATA since gf I f I 4 t ©ß ß P Sgf I f I t f ß ©ß 484 . the Stationarity. implies that 8 4) ½ t 8 P B Ce 8 P so Multiply both sides by that are the coefﬁcients of the factorization are simply 24. multiplying the polynomials on th LHS. we get ß fß f t ) G Ô© f VÈc¥ Now consider a different stationary process the eigenvalues of the matrix The LHS is precisely the determinantal polynomial that gives the eigenvalues of Therefore.3.t wP ) ð ç gf Now as k % tG ó tG ó ß fß ß fß tG ó f Ët@8@8¡P ¢ f ¢ ËP f t P 8 t ß ß 4) ½ t t P 8 wtA8@8¡P ¢ f ¢ wP f w) ð t t P so t P 8 wA8@8¡P ¢ f ¢ wP f Ë) ð gf ó I f I uÈ) ð t t P t ©ß ß and with cancellations we have G t7© f Ët@8@8¡P ¢ f ¢ ËP f Ëc¥ s t P 8 t t P ) ß ß gf © I f I u f u4@8A88 ¢ f ¢ 0 f V f wt@8A8P ¢ f ¢ ËP f ËÔ¥ t t t t t P 8 t t P ) ß ©ß ß ß ß ß tG ó f tËPx8A@88¡P ¢ f ¢ twP f twP ) ð Ô© f VÈc¥ ó f Ët@8A8P ¢ f ¢ ËP f Ë) ð f t ) t P 8 t t P ß ß ß ß t P 8 P ¢t Ë@8A8¢ f wP f w) t P to get or.

There- Recall that our mean zero AR(p) process can be written using the factorization Therefore. we started with Substituting this into the above equation we have so and the approximation becomes arbitrarily good as increases arbitrarily.24.3. MODELS FOR TIME SERIES DATA % 485 and the approximation becomes better and better as increases. for deﬁne f t ) Ô© f u¼c¥ ó % ) ç © f uÈÔ¥ ó f wA8@8¡P ¢ f ¢ wP f wð t ) t P 8 t t P ) ß ß ß fß f t ) G Ô© f VÈc¥ t P 8 Ët@8A8P ¢ f ¢ ËP f Ëð ç gf t t P ) . However. all the ß fß t p ß t ) 6© uÈc¥ I f e )¥ ÈÔv© f I e )¥ ÈÔcgf 4) ½ t fore. we can invert each ﬁrst order polynomial on the LHS to get The RHS is a product of inﬁnite-order polynomials in sented as which can be repre- 8 4) ½ e B f p p tGj· T ß bxxb · ¢ ß b ß fß e ß fß e G©bbb tc!xx¡P ¢ f ¢ ¶ X ¹ P f iP c¥ gf I X ) P ¶ I · ß fß e p ß ¶ gf e where the are the eigenvalues of tG © f T e ¼ÔxxxH© f ¢ )¥bbb tG © T f ¦uxxb ¢ f ¢ V f uÈÔÔf Tt bb t I t )¥ and given stationarity.

Recall before that by recursive substitution. Take of zeros except for their ﬁrst element. The are vectors ß $ Ù 4I Ù P $ ß Ù sI Ù P P û Pbbb P Ôxx²I Ù I P ¢ P ¢ ô U ¢3¢U 3U ô P 3U ô ¢ ò66"uH6"0$ô 6i¯¨×6p¨¥ ©3U ô¥ ©3U P ô Pbbb P txxI ß Ù I ß ß P P t Ù ß P P ¦I i¤ I P ß Ù ß © ß P P P û ß P I ß i¤ I ©ß P Pbbb cxx²P û 83U 6iÞô is In multiplication P 3U P 6ÇUô conjugate pairs. . which are in turn P ¼$ C¤ û ß always occur in then so . in the limit. an AR(p) process can be written as If the process is mean zero. is just ì Ù p ß r ó ß tG I sI ß P ð gf ¤ k % As the lagged on the RHS drops out.24.3. This means that if P i¤ X The are real-valued because any complex-valued is an eigenvalue of B ie 8B t functions of the B X The are formed of products of powers of the B ie B B P X where the are real-valued and absolutely summable. MODELS FOR TIME SERIES DATA 486 which is real-valued. This shows that an AR(p) process is representable as an inﬁnite-order MA(q) process. then everything with a this and lag it by periods to get P % drops out. so we see that the ﬁrst equation here.

Moments of AR(p) process. MODELS FOR TIME SERIES DATA and the Bt ß B X the as well. recalling the previous factorization of 8 g© P The autocovariances of orders Q s ) ¡ % With this. the second moments are easy to ﬁnd: The variance is B Se so and so Assuming stationarity.3. The AR(p) process is follow the rule (and 487 Hf .T ß ¦Ët@8A8P ¢ ß ¢ wHI ß sDt Tt P 8 t P I f © f¥ T t P 8 P © Q ¢ f ¢ t P © Q f¥ I t ©sQs ß g¨¥H G0P©sQsyT g¦wA8@8¡D ¨¥ ËssqI gvA¥ªë ©© Q s ß g¨Hsg¨@¥ ë f¥ © Q f ß ¾ è P TTt P 8 t P I I ¢ u}wA8@8¡P ¢ ¢ wDiDt p Q G P © Q f¥ T t P 8 P © Q tVHsYT ¨¦¦Ët@8@8Ds G P fTt Pbbb 0UT 9wtxx¡P ¢ f ¢ wsI t P ¢ g¥ ¢ wDsI gt f t P © Q f¥ I cw}u4@8A88 DuQ fI t P QT t QI t Q T t QI t ¦u4@8A88 DVQ ¼¸ ¸ Q Tt uA8@88 ¢ uu¼) t It Q } T t P 8 P t P QI t ¸ Ë@8@8¡Q ¢ wwP Q 2 ê ÔÄQ ¨Uë © f¥ G tVP T g9¦ËxxxP ¢ f ¢ wsI cwP gf fTt Pbbb t P fI t ¸ so which makes explicit the relationship between the 24.

Invertibility of MA(q) process. G P 8 tV@8A8P ¢ gf ¢ » sI cI » P gf P f ¸ tG @8A8ss ¢ g¨¥ ¢ » qssqI gI » q¨¥ 8 P© Q f © Q f¥ © Q f p ß tG © Q f s ß g¨¥ ß f ß » f I !© ( f ( d P 8 @8@8¡P f g¹Ô¥ Id P ) ( d f © Q f tG s¨¥ I !© ( If this is the case.3. .3.3. With 488 . P 8 @8@8¡P f g¹Ô¥ Id P ) © f g µc¥ B ) then we can write © f ( 8 4) ½ g B ) 8 Yc¨¥@8@8g© f ¢ Èc© f I Èc¥ © ( )¥ ) f ( d ¹ P 8 A8@8¡P f gÔ¥ Id P ) G tc© ( f ( d P 8 @8A8P f g¹P c¥ Id ) Q gf Using the fact that 24. can be inverted as long as so we get and solve for the unknowns. the polynomial on the RHS can be factored as 24. the for ¥ % ß with ) ¯ p » or will be an inﬁnite-order polynomial in where and each of the As before. An MA(q) can be written as or can be solved for recursively. MODELS FOR TIME SERIES DATA one can take the equations for 8) ò@@8A899 u % ) P ß yß which have unknowns ( ©T 8I ¢ t©@A8@89C© p TxDè ) P s these.2.

the two MA(1) processes and have exactly the same moments if For example. it’s impossible to distinguish between observationally equivalent processes on the basis of data. MODELS FOR TIME SERIES DATA 489 where It turns out that one can always manipulate the parameters of an MA(q) process to ﬁnd an invertible representation. This means that the two MA processes are observationally equivalent. we’ve seen that Given the above relationships amongst the parameters. It turns out that all the autocovariances will be the same. ½ 4 B So we see that an MA(q) has an inﬁnite AR representation. as is easily checked.24. as long as the © ¢ hc¥ ¢ è © ¢ $Ô¥ ¢ d ¢ î è p d P ) d P ) @8A8Q ¢ » P I » Q È¸ 8 P Q P © I $ÈÔ¥ G f d ) G © 8 d P ) g© ¢ ¹Ô¥ ¢ è p f d i ¢ d ¢î è à ¢î è ) ÈÔ¥ Q s Q gf f 8§ 85 3 7@@8A894) 4) . so the variances are the same. For example. As before.3.

It’s important to ﬁnd an invertible representation. MODELS FOR TIME SERIES DATA 490 For a given MA(q) process. Since an representation. calculating moments is similar.3. Stationarity and invertibility of ARMA models is similar to what we’ve seen . Calculate the autocovariances of an ARMA(1. it’s always possible to manipulate the pa- The other representations express Why is invertibility important? The most important reason is that it provides a justiﬁcation for the use of parsimonious models.we won’t go into the details. since it’s the only representation that allows one to represent as a function of past .1) model: © k gument and note that at least some MA( © k AR(1) process has an MA( representation. At the time of estimation. one can reverse the arprocesses have an AR(1) 8 $ R¤ f gG e EÔ© f d P )¥ ¸ f cP gc© f t rameters to ﬁnd an invertible representation (which is unique).24. P ) Ô¥ E XERCISE 61. Combining low- order AR and MA models can usually offer a satisfactory representation of univariate time series data with a reasonable number of parameters. Likewise. it’s a lot easier to estimate the single AR(1) coefﬁcient rather than the inﬁnite number of coefﬁcients associated with the MA representation. This is the reason that ARMA models are popular.

Press.Bibliography [1] Davidson. Press [6] Hayashi. A. Wiley. Thomson. [2] Davidson. Princeton Univ. Oxford Univ. J. Press. (2000) Econometrics. Press. R. [3] Gallant. Princeton Univ. and J. 491 . Princeton Univ. (1985) Nonlinear Statistical Models. [5] Hamilton.G. for supplementary use only). Press. MacKinnon (2004) Econometric Theory and Methods. F.G. (undergraduate level. MacKinnon (1993) Estimation and Inference in Econometrics. Oxford Univ. A. R. [7] Wooldridge (2003). (1997) An Introduction to Econometric Theory.R.R. [4] Gallant. Introductory Econometrics. (1994) Time Series Analysis. and J.

Index asymptotic equality. OLS. inﬂuential. 438 convergence. 27 492 . 439 convergence. 17 own inﬂuence. uncentered. 49 Product rule. in probability. 438 Convergence.squared. idempotent. 31 R-squared. 436 R. ordinary. 49 matrix. 32 estimator. 247 leverage. 28 likelihood function. 442 observations. almost sure. uniform almost sure. 26 matrix. pointwise. 437 convergence. 29 parameter space. 27 Chain rule. 28. 23 extremum estimator. 436 Cobb-Douglas model. 27 matrix. 437 convergence. symmetric. 27 outliers. centered. uniform. projection. in distribution. 38 estimator. linear. 437 convergence. 440 cross section. 21 convergence.

- Econometrics
- Econometrics Bruce
- Econometrics
- Econometrics
- Econometrics
- mbook applied econometrics using MATLAB
- Econometrics
- Econometrics - Wikipedia, The Free Encyclopedia
- Econometrics
- Econometrics Notes-1
- Econometrics
- Hansen - Eco No Metrics - 2012 - Jan 18th Update
- A Concise Introduction to Econometrics
- Watson introduccion a la econometria.pdf
- Introduction to the Mathematical and Statistical Foundations of Eco No Metrics
- StaticBS
- Materials Science and Technology
- Book
- dynamics of mechanical systems
- theoryofwavetran
- Weather_climate_and_Water
- Ts Econometric s
- Chaos & Fractals in Financial Markets
- 3211766650
- Complexity of Algorithms
- Fuzzy_Systems
- Mermin - QCS
- Five Equations That Changed the World
- Against the Theory of ‘Dynamic Equivalence
- A Working Method Approach for Introductory Physical Chemistry Calculations

- UT Dallas Syllabus for epps6316.502.11s taught by Paul Jargowsky (jargo)
- UT Dallas Syllabus for epps7316.501.11s taught by Patrick Brandt (pxb054000)
- wp458
- UT Dallas Syllabus for eco4355.001.07s taught by Daniel Obrien (obri)
- Financial Health and Firm Productivity
- UT Dallas Syllabus for eco5311.501 06s taught by Magnus Lofstrom (mjl023000)
- Effects of Hub-and-Spoke Free Trade Agreements on Trade
- UT Dallas Syllabus for eco6309.001.08s taught by Wim Vijverberg (vijver)
- Did Highways Cause Suburbanization?
- Do Newspapers Matter?
- Money Buys Happiness
- Is There a Size-Induced Market Failure in Skills Training?
- frbclv_wp1985-06.pdf
- Remittances and the Brain Drain
- UT Dallas Syllabus for poec6344.001 06f taught by Paul Jargowsky (jargo)
- Determinants of Japanese Direct Investment in Selected BIMP-EAGA Countries
- Trade and Income in Asia
- Day of The Week Effect In Indian IT Sector With Special Reference To BSE IT Index
- Willingness to Pay for Good Quality, Uninterrupted Power Supply in Madhya Pradesh, India
- Costs and Benefits of Urbanization
- Human Capital and Urbanization in the People’s Republic of China
- Financial Integration in East Asia
- tmpAC5B.tmp
- UT Dallas Syllabus for epps7344.001.11f taught by Adam Olulicz-Kozaryn (ajo021000)
- UT Dallas Syllabus for psci5362.001 06s taught by Patrick Brandt (pxb054000)
- Protectionism among the States
- UT Dallas Syllabus for poec6344.001.08f taught by Paul Jargowsky (jargo)
- UT Dallas Syllabus for poec5316.501.10s taught by Paul Jargowsky (jargo)

- frbclv_wp1986-03.pdf
- Quantitative Methods for Economic
- UT Dallas Syllabus for meco6320.001.09f taught by Yexiao Xu (yexiaoxu)
- Comparison of Wavelet Watermarking Method With & without Estimator Approach
- tmp9B1B.tmp
- UT Dallas Syllabus for ee6343.001.07f taught by Naofal Al-dhahir (nxa028000)
- UT Dallas Syllabus for meco7320.001 05s taught by Yexiao Xu (yexiaoxu)
- UT Dallas Syllabus for meco7320.001.07f taught by Yexiao Xu (yexiaoxu)
- frbrich_wp90-12.pdf
- tmp2AEA
- Statistical Methods
- UT Dallas Syllabus for meco6320.001.10f taught by Yexiao Xu (yexiaoxu)
- Taking the risk out of systemic risk measurement
- UT Dallas Syllabus for ee6343.001 06f taught by Naofal Al-dhahir (nxa028000)
- UT Dallas Syllabus for meco6320.001.08f taught by Yexiao Xu (yexiaoxu)

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Close Dialog## This title now requires a credit

Use one of your book credits to continue reading from where you left off, or restart the preview.

Loading