This action might not be possible to undo. Are you sure you want to continue?

Michael Creel Department of Economics and Economic History Universitat Autònoma de Barcelona version 0.98, July 2011

Contents

1 About this document 1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Obtaining the materials . . . . . . . . . . . . . . . . . 1.5 An easy way to use LYX and Octave today . . . . . . . . 2 Introduction: Economic and econometric models 20 20 22 26 28 28 31

1

3 Ordinary Least Squares 3.1 The Linear Model . . . . . . . . . . . . . . . . . . . . . 3.2 Estimation by least squares . . . . . . . . . . . . . . . 3.3 Geometric interpretation of least squares estimation . 3.4 Inﬂuential observations and outliers . . . . . . . . . . 3.5 Goodness of ﬁt . . . . . . . . . . . . . . . . . . . . . . 3.6 The classical linear regression model . . . . . . . . . . 3.7 Small sample statistical properties of the least squares estimator . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Example: The Nerlove model . . . . . . . . . . . . . . 3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Asymptotic properties of the least squares estimator 4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . 4.3 Asymptotic efﬁciency . . . . . . . . . . . . . . . . . . .

40 40 43 48 55 59 64 67 80 90 92 94 97 99

4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5 Restrictions and hypothesis tests 103

5.1 Exact linear restrictions . . . . . . . . . . . . . . . . . 103 5.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3 The asymptotic equivalence of the LR, Wald and score tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.4 Interpretation of test statistics . . . . . . . . . . . . . . 137 5.5 Conﬁdence intervals . . . . . . . . . . . . . . . . . . . 138 5.6 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . 141 5.7 Wald test for nonlinear restrictions: the delta method . 145 5.8 Example: the Nerlove data . . . . . . . . . . . . . . . . 152 5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6 Stochastic regressors 165

6.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.2 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

6.3 Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.4 When are the assumptions reasonable? . . . . . . . . . 175 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7 Data problems 180

7.1 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . 181 7.2 Measurement error . . . . . . . . . . . . . . . . . . . . 210 7.3 Missing observations . . . . . . . . . . . . . . . . . . . 218 7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 228 8 Functional form and nonnested tests 230

8.1 Flexible functional forms . . . . . . . . . . . . . . . . . 233 8.2 Testing nonnested hypotheses . . . . . . . . . . . . . . 254 9 Generalized least squares 262

9.1 Effects of nonspherical disturbances on the OLS estimator265 9.2 The GLS estimator . . . . . . . . . . . . . . . . . . . . 271

9.3 Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . 277 9.4 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . 280 9.5 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . 311 9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 358 10 Endogeneity and simultaneity 365

10.1 Simultaneous equations . . . . . . . . . . . . . . . . . 366 10.2 Reduced form . . . . . . . . . . . . . . . . . . . . . . . 372 10.3 Bias and inconsistency of OLS estimation of a structural equation . . . . . . . . . . . . . . . . . . . . . . . . . . 378 10.4 Note about the rest of this chaper . . . . . . . . . . . . 382 10.5 Identiﬁcation by exclusion restrictions . . . . . . . . . 382 10.6 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 10.7 Testing the overidentifying restrictions . . . . . . . . . 411 10.8 System methods of estimation . . . . . . . . . . . . . . 421 10.9 Example: 2SLS and Klein’s Model 1 . . . . . . . . . . . 438

11 Numeric optimization methods

443

11.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 11.2 Derivative-based methods . . . . . . . . . . . . . . . . 447 11.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . 462 11.4 Examples of nonlinear optimization . . . . . . . . . . . 463 11.5 Numeric optimization: pitfalls . . . . . . . . . . . . . . 481 11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 489 12 Asymptotic properties of extremum estimators 491

12.1 Extremum estimators . . . . . . . . . . . . . . . . . . . 492 12.2 Existence . . . . . . . . . . . . . . . . . . . . . . . . . 496 12.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 497 12.4 Example: Consistency of Least Squares . . . . . . . . . 510 12.5 Example: Inconsistency of Misspeciﬁed Least Squares . 512 12.6 Example: Linearization of a nonlinear model . . . . . . 514 12.7 Asymptotic Normality . . . . . . . . . . . . . . . . . . 520

12.8 Example: Classical linear model . . . . . . . . . . . . . 524 12.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 529 13 Maximum likelihood estimation 530

13.1 The likelihood function . . . . . . . . . . . . . . . . . . 531 13.2 Consistency of MLE . . . . . . . . . . . . . . . . . . . . 539 13.3 The score function . . . . . . . . . . . . . . . . . . . . 541 13.4 Asymptotic normality of MLE . . . . . . . . . . . . . . 544 13.5 The information matrix equality . . . . . . . . . . . . . 552 13.6 The Cramér-Rao lower bound . . . . . . . . . . . . . . 558 13.7 Likelihood ratio-type tests . . . . . . . . . . . . . . . . 561 13.8 Example: Binary response models . . . . . . . . . . . . 565 13.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 575 13.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 575 14 Generalized method of moments 579

14.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 580

14.2 Deﬁnition of GMM estimator

. . . . . . . . . . . . . . 588

14.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 590 14.4 Asymptotic normality . . . . . . . . . . . . . . . . . . . 592 14.5 Choosing the weighting matrix . . . . . . . . . . . . . 599 14.6 Estimation of the variance-covariance matrix . . . . . . 605 14.7 Estimation using conditional moments . . . . . . . . . 612 14.8 Estimation using dynamic moment conditions . . . . . 618 14.9 A speciﬁcation test . . . . . . . . . . . . . . . . . . . . 618 14.10 Example: Generalized instrumental variables estimator 623 14.11 Nonlinear simultaneous equations . . . . . . . . . . . 640 14.12 Maximum likelihood . . . . . . . . . . . . . . . . . . . 642 14.13 Example: OLS as a GMM estimator - the Nerlove model again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 14.14 Example: The MEPS data . . . . . . . . . . . . . . . . 649 14.15 Example: The Hausman Test . . . . . . . . . . . . . . . 654 14.16 Application: Nonlinear rational expectations . . . . . . 669

14.17 Empirical example: a portfolio model . . . . . . . . . . 677 14.18 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 681 15 Introduction to panel data 687

15.1 Generalities . . . . . . . . . . . . . . . . . . . . . . . . 688 15.2 Static issues and panel data . . . . . . . . . . . . . . . 694 15.3 Estimation of the simple linear panel model . . . . . . 697 15.4 Dynamic panel data . . . . . . . . . . . . . . . . . . . 705 15.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 713 16 Quasi-ML 715

16.1 Consistent Estimation of Variance Components . . . . 720 16.2 Example: the MEPS Data . . . . . . . . . . . . . . . . . 724 16.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 746 17 Nonlinear least squares (NLS) 749

17.1 Introduction and deﬁnition . . . . . . . . . . . . . . . 750

17.2 Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . 754 17.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 757 17.4 Asymptotic normality . . . . . . . . . . . . . . . . . . . 758 17.5 Example: The Poisson model for count data . . . . . . 762 17.6 The Gauss-Newton algorithm . . . . . . . . . . . . . . 765 17.7 Application: Limited dependent variables and sample selection . . . . . . . . . . . . . . . . . . . . . . . . . . 769 18 Nonparametric inference 18.2 Possible pitfalls of parametric inference: hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787 18.3 Estimation of regression functions . . . . . . . . . . . . 790 18.4 Density function estimation . . . . . . . . . . . . . . . 823 18.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 833 18.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 844 776

18.1 Possible pitfalls of parametric inference: estimation . . 776

19 Simulation-based estimation

846

19.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 847 19.2 Simulated maximum likelihood (SML) . . . . . . . . . 860 19.3 Method of simulated moments (MSM) . . . . . . . . . 867 19.4 Efﬁcient method of moments (EMM) . . . . . . . . . . 874 19.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 886 19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 896 20 Parallel programming for econometrics 897

20.1 Example problems . . . . . . . . . . . . . . . . . . . . 900 21 Final project: econometric estimation of a RBC model 913

21.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914 21.2 An RBC Model . . . . . . . . . . . . . . . . . . . . . . 916 21.3 A reduced form model . . . . . . . . . . . . . . . . . . 918 21.4 Results (I): The score generator . . . . . . . . . . . . . 922 21.5 Solving the structural model . . . . . . . . . . . . . . . 922

22 Introduction to Octave

927

22.1 Getting started . . . . . . . . . . . . . . . . . . . . . . 928 22.2 A short introduction . . . . . . . . . . . . . . . . . . . 928 22.3 If you’re running a Linux installation... . . . . . . . . . 932 23 Notation and Review 934

23.1 Notation for differentiation of vectors and matrices . . 935 23.2 Convergenge modes . . . . . . . . . . . . . . . . . . . 937 23.3 Rates of convergence and asymptotic equality . . . . . 944 24 Licenses 950

24.1 The GPL . . . . . . . . . . . . . . . . . . . . . . . . . . 951 24.2 Creative Commons . . . . . . . . . . . . . . . . . . . . 973 25 The attic 989

25.1 Hurdle models . . . . . . . . . . . . . . . . . . . . . . 994 25.2 Models for time series data . . . . . . . . . . . . . . . 1011

List of Figures

1.1 Octave . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 LYX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Typical data, Classical Model . . . . . . . . . . . . . . 3.2 Example OLS Fit . . . . . . . . . . . . . . . . . . . . . 3.3 The ﬁt in observation space . . . . . . . . . . . . . . . 3.4 Detection of inﬂuential observations . . . . . . . . . . 3.5 Uncentered R2 . . . . . . . . . . . . . . . . . . . . . . 3.6 Unbiasedness of OLS under classical assumptions . . . 3.7 Biasedness of OLS when an assumption fails . . . . . . 13 25 27 44 49 51 58 61 69 70

3.8 Gauss-Markov Result: The OLS estimator . . . . . . . . 3.9 Gauss-Markov Resul: The split sample estimator . . . .

77 78

5.1 Joint and Individual Conﬁdence Regions . . . . . . . . 140 5.2 RTS as a function of ﬁrm size . . . . . . . . . . . . . . 161 7.1 s(β) when there is no collinearity . . . . . . . . . . . . 193 7.2 s(β) when there is collinearity . . . . . . . . . . . . . . 194 7.3 Collinearity: Monte Carlo results . . . . . . . . . . . . 201 ˆ 7.4 ρ − ρ with and without measurement error . . . . . . . 218 7.5 Sample selection bias . . . . . . . . . . . . . . . . . . . 225 9.1 Rejection frequency of 10% t-test, H0 is true. . . . . . 269 9.2 Motivation for GLS correction when there is HET . . . 296 9.3 Residuals, Nerlove model, sorted by ﬁrm size . . . . . 303 9.4 Residuals from time trend for CO2 data . . . . . . . . 315 9.5 Autocorrelation induced by misspeciﬁcation . . . . . . 318 9.6 Efﬁciency of OLS and FGLS, AR1 errors . . . . . . . . . 333

9.7 Durbin-Watson critical values . . . . . . . . . . . . . . 344 9.8 Dynamic model with MA(1) errors . . . . . . . . . . . 350 9.9 Residuals of simple Nerlove model . . . . . . . . . . . 352 9.10 OLS residuals, Klein consumption equation . . . . . . . 356 11.1 Search method . . . . . . . . . . . . . . . . . . . . . . 446 11.2 Increasing directions of search . . . . . . . . . . . . . . 450 11.3 Newton iteration . . . . . . . . . . . . . . . . . . . . . 454 11.4 Using Sage to get analytic derivatives . . . . . . . . . . 460 11.5 Dwarf mongooses . . . . . . . . . . . . . . . . . . . . . 476 11.6 Life expectancy of mongooses, Weibull model . . . . . 477 11.7 Life expectancy of mongooses, mixed Weibull model . 480 11.8 A foggy mountain . . . . . . . . . . . . . . . . . . . . . 484 14.1 Asymptotic Normality of GMM estimator, χ2 example . 598 14.2 Inefﬁcient and Efﬁcient GMM estimators, χ2 data . . . 604

ˆ 14.3 GIV estimation results for ρ − ρ, dynamic model with measurement error . . . . . . . . . . . . . . . . . . . . 636 14.4 OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 14.5 IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 14.6 Incorrect rank and the Hausman test . . . . . . . . . . 664 18.1 True and simple approximating functions . . . . . . . . 780 18.2 True and approximating elasticities . . . . . . . . . . . 782 18.3 True function and more ﬂexible approximation . . . . 785 18.4 True elasticity and more ﬂexible approximation . . . . 786 18.5 Negative binomial raw moments . . . . . . . . . . . . 830 18.6 Kernel ﬁtted OBDV usage versus AGE . . . . . . . . . . 834 18.7 Dollar-Euro . . . . . . . . . . . . . . . . . . . . . . . . 839 18.8 Dollar-Yen . . . . . . . . . . . . . . . . . . . . . . . . . 840 18.9 Kernel regression ﬁtted conditional second moments, Yen/Dollar and Euro/Dollar . . . . . . . . . . . . . . . 843

20.1 Speedups from parallelization . . . . . . . . . . . . . . 910 21.1 Consumption and Investment, Levels . . . . . . . . . . 915 21.2 Consumption and Investment, Growth Rates . . . . . . 915 21.3 Consumption and Investment, Bandpass Filtered . . . 915 22.1 Running an Octave program . . . . . . . . . . . . . . . 930

List of Tables

15.1 Dynamic panel data model. Bias. Source for ML and II is Gouriéroux, Phillips and Yu, 2010, Table 2. SBIL, SMIL and II are exactly identiﬁed, using the ML auxiliary statistic. SBIL(OI) and SMIL(OI) are overidentiﬁed, using both the naive and ML auxiliary statistics. . 708

18

15.2 Dynamic panel data model. RMSE. Source for ML and II is Gouriéroux, Phillips and Yu, 2010, Table 2. SBIL, SMIL and II are exactly identiﬁed, using the ML auxiliary statistic. SBIL(OI) and SMIL(OI) are overidentiﬁed, using both the naive and ML auxiliary statistics. . 709 16.1 Marginal Variances, Sample and Estimated (Poisson) . 725 16.2 Marginal Variances, Sample and Estimated (NB-II) . . 736 16.3 Information Criteria, OBDV . . . . . . . . . . . . . . . 743 25.1 Actual and Poisson ﬁtted frequencies . . . . . . . . . . 995 25.2 Actual and Hurdle Poisson ﬁtted frequencies . . . . . . 1003

There are many sources for this material. one are the appendices to Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge.Chapter 1 About this document 1.1 Prerequisites These notes have been prepared under the assumption that the reader understands basic statistics. and mathematical optimization. It is the student’s resposibility to get up to speed on this 20 . linear algebra.

A. it will not be covered in class This document integrates lecture notes for a one year graduate level course with computer programs that illustrate and apply the methods that are studied. and P. Microeconometrics . Trivedi. The immediate availability of executable (and modiﬁable) example programs when using the PDF version of the document is a distinguishing feature of these notes.K. If you are a student of mine. If printed.C.material. These notes are not intended to be a perfect substitute for a printed textbook. Students taking my courses should read the appropriate sections from at least one of the following books (or other textbooks with similar level and content) • Cameron.Methods and Applications . please note that last sentence carefully. the document is a somewhat terse approximation to a textbook. There are many good textbooks available.

. A. and J. J. with a bias toward microeconometrics..D. If you take a moment to read the licensing information in the next section. Time Series Analysis • Hayashi. Econometrics A more introductory-level reference is Introductory Econometrics: A Modern Approach by Jeffrey Wooldridge. 1. MacKinnon. you’ll see that you are free to copy and modify the .R. An Introduction to Econometric Theory • Hamilton. Econometric Theory and Methods • Gallant.G.• Davidson. R.. the emphasis is on estimation and inference within the world of stationary data. F.2 Contents With respect to contents.

Octave is similar to the commercial package Matlab R . and will run scripts for that language without modiﬁcation1. minimization. such as a BFGS minimizer. GNU Octave (www. Octave will run pure Matlab scripts. If anyone would like to contribute material that expands the contents. etc.document. This choice is motivated by several factors. 1 .) exist and are implemented in a way that make extending them fairly easy. Inc. etc.octave. such as a toolbox function. then it is necessary to make a similar extension available to Octave. statistical functions. All of this code is provided with the examples. Second. The examples discussed in this document call a number of functions. The ﬁrst is the high quality of the Octave environment for doing applied econometrics. it would be very welcome. as well as on the PelicanHPC live CD image. an advantage of Matlab R is a trademark of The Mathworks. a program for ML estimation. The fundamental tools (manipulation of matrices. If a Matlab script calls an extension. The integrated examples (they are on-line here and the support ﬁles are here) are an important part of these notes. Error corrections and other additions are also welcome.org) has been used for most of the example programs. which are scattered though the document.

but LYX is also free of charge (free as in ”free beer”). Octave runs on GNU/Linux. and Time-Series Library. Figure 1. This is an easy to use program.free software is that you don’t have to pay for it. As of 2011. Econometrics.1 shows a sample GNU/Linux work environment.lyx. Third. This can be an important consideration if you are at a university with a tight budget or if need to run many copies. with an Octave script being edited. It (with help from other A applications) can export your work in LTEX. LYX is a free2 “what you see is what you mean” word processor. HTML. available in a number of languages.org). the Gnu Regression. It runs on the major operating systems. some examples are being added using Gretl. . as can be the case if you do parallel computing (discussed in Chapter 20). and it comes with a lot of data ready to use. The main document was prepared using LYX (www. PDF and several 2 ”Free” is used in the sense of ”freedom”. Windows and MacOS. and the results are visible in an embedded shell window. basically A working as a graphical frontend to LTEX.

1: Octave .Figure 1.

as long as you share your contributions in the same way the materials are made available to you. 2. It will run on Linux. in editable form. They are provided under the terms of the GNU General Public License. under the Creative Commons Attribution-Share Alike 2. you must make available the source ﬁles.other forms. which forms Section 24.2 shows LYX editing this document. for your modiﬁed version of the materials. Figure 1. . or. Windows. and MacOS systems. ver.2 of the notes. In particular.3 Licenses All materials are copyrighted by Michael Creel with the date that appears above. The main thing you need to know is that you are free to modify and distribute these materials in any way you like.1 of the notes. 1.5 license. which forms Section 24. at your option.

Figure 1.2: LYX .

which will allow you to create your own version. In addition to the ﬁnal product. 1. since you will probably want to download the pdf version together with all the support . since there are dependencies between ﬁles .1. which you’re probably looking at in some form now. or send error corrections and contributions. if you like. you can obtain the editable LYX sources. To see how to use these ﬁles (edit and run them). you should go to the home page of this document.5 An easy way to use LYX and Octave today The example programs are available as links to ﬁles on my web page in the PDF version. The ﬁles won’t run properly from your browser. and here. Support ﬁles needed to run these are available here.they are only illustrative when browsing.4 Obtaining the materials The materials are available on my web page.

These notes. Then set the base URL of the PDF ﬁle to point to wherever the Octave ﬁles are installed. together with all of the examples and the software needed to run them are available on PelicanHPC. The need to reboot to use PelicanHPC can be somewhat inconvenient. PelicanHPC is a ”live CD” image. if you like. in source form and as a PDF.ﬁles and examples. because it is. You can burn the PelicanHPC image to a CD and use it to boot your computer. and the fact that it works very well. An easier solution is available: The PelicanHPC distribution of Linux is an ISO image ﬁle that may be burnt to CDROM. Then you need to install Octave and the support ﬁles. That. is the reason . you will return to your normal operating system. It is also possible to use PelicanHPC while running your normal operating system by using a virtualization platform such as Virtualbox 3 3 Virtualbox is free software (GPL v2). All of this may sound a bit complicated. It contains a bootable-from-CD GNU/Linux system. When you shut down and reboot.

There are a number of similar products available. it is recommended here. please just ignore the stuff on the CD that’s not related to econometrics. It is possible to run PelicanHPC as a virtual machine. . just skip ahead to Chapter 20. and it is very convenient.The reason why these notes are integrated into a Linux distribution for parallel computing will be apparent if you get to Chapter 20. If you don’t get that far or you’re not interested in parallel computing. If you happen to be interested in parallel computing but not econometrics. and to communicate with the installed operating system using a private network. Learning how to do this is not too difﬁcult.

Chapter 2 Introduction: Economic and econometric models Here’s some data: 100 observations on 3 economic variables. Let’s do some exploratory analysis using Gretl: • histograms • correlations 31 .

Causality? Who knows? This is economic data. It is not experimental data generated under controlled conditions. generated by economic agents. How can we determine causality if we don’t have experimental data? .• x-y scatterplots So. technologies and preferences. following their own beliefs. what can we say? Correlations? Yes.

Economic theory tells us that the quantity of a good that consumers will puchase (the demand function) is something like: q = f (p. The market supply function is something like q = g(p. and INCOME (m). z) • q is the quantity demanded • p is the price of the good • m is income • z is a vector of other variables that may affect demand The supply of the good to the market is the aggregation of the ﬁrms’ supply functions.Without a model. It turns out that the variables we’re looking at are QUANTITY (q). m. PRICE (p). we can’t distinguish correlation from causality. z) .

. Because q and p are jointly determined. . its value is determined independently of q and p by some other process. given by the intersection of the two curves. zt) qt = g(pt. mt. m is an exogenous variable. 2. though the demand function. p and q do not cause m. These two variables are determined jointly by the model.Suppose we have a sample consisting of a number of observations on q p and m at different time periods t = 1. Supply and demand in each period is qt = f (pt. m causes q. Income (m) is not determined by this model. and are the endogenous variables. So.. zt) (draw some graphs showing roles of m and z) This is the basic economic model of supply and demand: q and p are determined in the market equilibrium. n.. m also causes p. according to this theoretical .

but quantity also affect price. then we know that y causes x. q and p have a joint causal relationship. we are unable to control the values of the variables: for example in supply and demand. If we see that variable x changes as the controlled value of variable y is changed.model. we could control certain variables and observe the outcomes for other variables. . We can’t control the market price. then quantity changes. The model is essentially a theoretical construct up to now: • We don’t know the forms of the functions f and g. • Economic theory can help us to determine the causality relationships between correlated variables. • If we had experimental data. because the market price changes as quantity adjusts. if price changes. This is the reason we need a theoretical model to help us distinguish correlation and causality. With economic data.

) are constant over time. and you can’t tell what they will order just by looking at them.• Some components of zt may not be observable. and we can model them as random variables. β2. There are unobservable components to supply and demand. etc. people don’t eat the same lunch every day. An econometric model attempts to quantify the relationship more precisely. For example. A step toward an estimable econometric model is to suppose that the model may be written as qt = α1 + α2pt + α3mt + εt1 qt = β1 + β2pt + εt1 We have imposed a number of restrictions on the theoretical model: • The functions f and g have been speciﬁed to be linear functions • The parameters (α1. . Suppose we can break zt into two unobservable components εt1 and t2 .

• There is a single unobservable component in each equation. Such assumptions might be something like: • E( tj ) = 0. j = 1. 2 tj ) These are assertions that the errors are uncorrelated with the variables. and we assume it is additive. j = 1. 2 = 0. we need to make additional assumptions. and in order to be able to use sample data to make reliable inferences about their values. as the errors simply make up the difference between the true demand and supply functions and the assumed forms. But in order for the β coefﬁcients to exist in a sense that has economic meaning. and such assertions may or may not be reasonable. If we assume nothing about the error terms t1 and t2 . we can al- ways write the last two equations. 2 • E(pt • E(mt tj ) = 0. Later we . j = 1.

and thus the test does not have the assumed distribution . to check that the model seems to be reasonable. For this reason. When testing a hypothesis using an econometric model. the hypothesis is false 2. a type I error has occured 3. Only when we are convinced that the model is at least approximately correct should we use it for economic analysis. at least three factors can cause a statistical test to reject the null hypothesis: 1. the econometric model is not correctly speciﬁed. The validity of any results we obtain using this model will be contingent on these additional restrictions being at least approximately correct. in that the theory of supply and demand doesn’t imply these conditions. All of the last six bulleted points have no theoretical basis. speciﬁcation testing will be needed.will see how such assumption may be used and/or tested.

Hopefully the above example makes it clear that econometric models are necessarily more detailed than what we can obtain from economic theory. so that rejection will be most likely due to either the ﬁrst or second reasons.To be able to make scientiﬁc progress. Later on. and that this additional detail introduces many possible sources of misspeciﬁcation of econometric models. econometric methods that seek to minimize maintained assumptions are introduced. In the next few sections we will obtain results supposing that the econometric model is entirely correctly speciﬁed. Later we will examine the consequences of misspeciﬁcation and see some methods for determining if a model is correctly speciﬁed. . we would like to ensure that the third reason is not contributing in a major way to rejections.

We can consider a model that is a linear approximation: Linearity: the model is a linear function of the parameter vector 40 .Chapter 3 Ordinary Least Squares 3. . x2.. xk ..1 The Linear Model Consider approximating a variable y using the variables x1..

t = 1... Suppose that we want to use data to try to determine the best linear approximation to y using the variables x.. x = ( x1 x2 · · · xk ) 0 0 0 is a k-vector of explanatory variables. cross-sectional data may be obtained by random sampling. 2. The superscript “0” in β 0 means this is the ”true value” of the unknown parameter. It will be deﬁned more precisely later. using vector notation: y = x β0 + The dependent variable y is a scalar random variable. Time series data accumulate historically. xt)} . and β 0 = ( β1 β2 · · · βk ) . n are obtained by some form of sampling1. + βk xk + or.. 1 . The data {(yt.β0 : 0 0 0 y = β1 x1 + β2 x2 + . and usually suppressed when it’s not necessary for clarity. An individual obFor example. ..

etc. Deﬁning y = ϕ0(z). since one can employ nonlinear transformations of the variables: ϕ0(z) = ϕ1(w) ϕ2(w) · · · ϕp(w) β + ε where the φi() are known functions. x1 = ϕ1(w). For example.4. Linear models are more general than they might ﬁrst appear.servation is yt = xtβ + εt The n observations can be written in matrix form as y = Xβ + ε. where y = (3. the Cobb-Douglas model β β z = Aw2 2 w3 3 exp(ε) .1) y1 y2 · · · yn is n × 1 and X = x1 x2 · · · xn . leads to a model in the form of equation 3.

If we deﬁne y = ln z.1.can be transformed logarithmically to obtain ln z = ln A + β2 ln w2 + β3 ln w3 + ε. 3. we can put the model in the form needed. etc.2 Estimation by least squares Figure 3. where t is a random error that has mean zero and is indepen- dent of xt2. β1 = ln A. The green line is the ”true” regression line β1 +β2xt2. obtained by running TypicalData. and we don’t know where . but not necessarily linear in the variables. and the red crosses are the data points (xt2.. In practice. The approximation is linear in the parameters. Exactly how the green line is deﬁned will become clear later.m shows some data that follows the linear model yt = β1 + β2xt2 + t. we only have the data. yt).

Classical Model 10 data true regression line 5 0 -5 -10 -15 0 2 4 6 8 10 X 12 14 16 18 20 the green line lies. We need to gain information about the straight line that best ﬁts the data points.1: Typical data.Figure 3. The ordinary least squares (OLS) estimator is deﬁned as the value that minimizes the sum of the squared errors: ˆ β = arg min s(β) .

One could think of other estimators based upon other metrics. rather than in terms of the metrics that . Later. The ﬁtted OLS coefﬁcients are those that give the best linear approximation to y using x as basis functions. where ”best” means minimum Euclidean distance. For example. the minimum absolute distance (MAD) minimizes n t=1 |yt − xtβ|.where n s(β) = t=1 (yt − xtβ) 2 (3. we will see that which estimator is best in terms of their statistical properties.2) = (y − Xβ) (y − Xβ) = y y − 2y Xβ + β X Xβ = y − Xβ 2 This last expression makes it clear how the OLS estimator is deﬁned: it minimizes the Euclidean distance between y and Xβ.

check the second order sufﬁcient condition: 2 ˆ Dβ s(β) = 2X X .deﬁne them. ﬁnd the derivative with respect to β: Dβ s(β) = −2X y + 2X Xβ Then setting it to zeros gives ˆ ˆ Dβ s(β) = −2X y + 2X Xβ ≡ 0 so ˆ β = (X X)−1X y. depends upon the properties of . • To minimize the criterion s(β). • To verify that this is a minimum. about which we have as yet made no assumptions.

Since ρ(X) = K.d. ˆ ˆ • The ﬁtted values are the vector y = Xβ. ˆ so β is in fact a minimizer. ˆ ˆ • The residuals are the vector ε = y − Xβ • Note that y = Xβ + ε ˆ ˆ = Xβ + ε • Also. this matrix is positive deﬁnite. since it’s a quadratic form in a p. matrix (identity matrix of order n). the ﬁrst order conditions can be written as ˆ X y − X Xβ = 0 ˆ X y − Xβ = 0 ˆ Xε = 0 .

and sometimes rather far away. Let’s look at this more carefully. Y Space Figure 3. and to see how the ﬁtted line will sometimes be close to the true line.2 shows a typical ﬁt to data. 3.3 Geometric interpretation of least squares estimation In X. You can experiment with changing the parameter values to see how this affects the ﬁt.m . . the OLS residuals are orthogonal to X. This ﬁgure was created by running the Octave program OlsFit. Note that the true line and the estimated line are different.which is to say. along with the true regression line.

2: Example OLS Fit 15 data points fitted line true line 10 5 0 -5 -10 -15 0 2 4 6 8 10 X 12 14 16 18 20 .Figure 3.

.c.o. • We can decompose y into two components: the orthogonal proˆ jection onto the K−dimensional space spanned by X. we’ll need to use only two or three observations. we can’t have K > 1.In Observation Space If we want to plot in observation space. With only two observations. or we’ll encounter some limitations of the blackboard. ε. that deﬁne the least squares estimator imply that this is so. so let’s use two. Since X is in this space. and the component that is the orthogonal projection onto the n − K ˆ subpace that is orthogonal to the span of X. X β. ˆ ˆ ˆ • Since β is chosen to make ε as short as possible. ε will be orthogˆ onal to the space spanned by X. we’ll encounter the limits of my artistic ability. Note that the f. X ε = 0. If we try to use 3.

Figure 3.3: The ﬁt in observation space Observation 2 y e = M_xY S(x) x x*beta=P_xY Observation 1 .

Projection Matrices ˆ X β is the projection of y onto the span of X. or ˆ X β = X (X X) −1 Xy Therefore. . the matrix that projects y onto the span of X is PX = X(X X)−1X since ˆ X β = PX y.

We have ˆ ε = MX y. . So the matrix that projects y onto the space orthogonal to the span of X is MX = In − X(X X)−1X = In − PX .ˆ ε is the projection of y onto the N − K dimensional space that is orthogonal to the span of X. We have that ˆ ˆ ε = y − Xβ = y − X(X X)−1X y = In − X(X X)−1X y.

the portion that lies in the K dimensional space deﬁned by X. – A symmetric matrix A is one such that A = A . These two projection matrices decompose the n dimensional vector y into two orthogonal components . and the portion that lies in the orthogonal n − K dimensional space. • Note that both PX and MX are symmetric and idempotent. . – An idempotent matrix A is one such that A = AA.Therefore y = PX y + MX y ˆ ˆ = X β + ε. – The only nonsingular idempotent matrix is the identity matrix.

i.4 Inﬂuential observations and outliers The OLS estimator of the ith element of the vector β0 is simply ˆ βi = (X X)−1X = ciy This is how we deﬁne a linear estimator .it’s a linear function of the dependent variable. it’s the tth column of the matrix In.3. To investigate this. where the weights are determined by the observations on the regressors. Since it’s a linear combination of the observations on the dependent variable. Deﬁne ht = (PX )tt = etPX et i· y . let et be an n vector of zeros with a 1 in the tth position. some observations may have more inﬂuence than others..e.

However. consider estimation of β without using the 2 P X et 2 =1 .so ht is the tth element on the main diagonal of PX . the observation has the potential to affect the OLS ﬁt importantly. rather than the weight it is multiplied by. If the leverage is much higher than average. an observation may also be inﬂuential due to the value of yt. To account for this. which only depends on the xt’s. Note that ht = so ht ≤ et So 0 < ht < 1. The value ht is referred to as the leverage of the observation. So the average of the ht is K/n. Also. T rPX = K ⇒ h = K/n.

32-5 for proof) that ˆ ˆ β (t) = β − 1 ˆ (X X)−1Xt εt 1 − ht so the change in the tth observations ﬁtted value is ˆ ˆ xtβ − xtβ (t) = ht εt ˆ 1 − ht While an observation may be inﬂuential if it doesn’t affect its own ﬁtted value. (note to self when lecturing: load the data . If you re-run the program you will see that the leverage . One can show (see Davidson and MacKinnon.m. The Octave program is InﬂuentialObservation./OLS/inﬂuencedata into Gretl and reproduce this).4 gives an example plot of data. leverage and inﬂuence. Figure 3. ﬁt. A fast means of identifying inﬂuential observations is to plot ht 1−ht εt (which I will refer ˆ to as the own inﬂuence of the observation) as a function of t. pp..ˆ tth observation (designate this estimator as β (t)). it certainly is inﬂuential if it does.

one needs to determine why they are inﬂuential.5 of the last observation (an outlying value of x) is always high.5 2 2. After inﬂuential observations are detected. .5 3 3. Possible causes include: • data entry error.4: Detection of inﬂuential observations 14 Data points fitted Leverage Influence 12 10 8 6 4 2 0 0 0. and the inﬂuence is sometimes high.Figure 3.5 1 1. which can easily be corrected once detected.

This is the idea behind structural change: the parameters may not be constant across all observations. These would need to be identiﬁed and incorporated in the model.Data entry errors are very common.5 Goodness of ﬁt ˆ ˆ y = Xβ + ε The ﬁtted model is Take the inner product: . 3. There exist robust estimation methods that downweight outliers. • special economic factors that affect some observations. • pure randomness may have caused us to sample a low-probability observation.

3) = = = = ˆˆ εε 1− yy ˆ ˆ β X Xβ yy PX y 2 y 2 cos2(φ). so ˆ ˆ ˆˆ y y = β X Xβ + ε ε 2 The uncentered Ru is deﬁned as 2 Ru (3. where φ is the angle between y and the span of X . since this . • The uncentered R2 changes if we add a constant to y.ˆ ˆ ˆ ˆ ˆˆ y y = β X X β + 2β X ε + ε ε ˆ But the middle term of the RHS is zero since X ε = 0.

the yellow vector is a constant. to explaining the variation in y.5. Thus it measures the ability of the model to explain the variation . Another. since it’s on the 45 degree line in observation space). other than the constant term. more Figure 3.5: Uncentered R2 common deﬁnition measures the contribution of the variables.changes φ (see Figure 3.

equation 3.of y about its unconditional sample mean. a n -vector. . . So Mι = In − ι(ι ι)−1ι = In − ιι /n Mιy just returns the vector of deviations from the mean.. In terms of deviations from the mean. 1. 1) ..3 becomes ˆ ˆ ˆ ˆ y Mιy = β X MιX β + ε Mιε 2 The centered Rc is deﬁned as 2 Rc = 1 − εε ˆˆ ESS =1− y Mιy T SS n t=1 (yt ˆˆ where ESS = ε ε and T SS = y Mιy= ¯ − y )2 .. Let ι = (1.

e..Supposing that X contains a column of ones (i. there is a constant term). In this case ˆ ˆ ˆˆ y Mιy = β X MιX β + ε ε So 2 Rc RSS = T SS ˆ ˆ where RSS = β X MιX β • Supposing that a column of ones is in the space spanned by X 2 (PX ι = ι). . then one can show that 0 ≤ Rc ≤ 1. ˆ Xε=0⇒ t ˆ εt = 0 ˆ ˆ so Mιε = ε.

+ βk xk + The partial derivative is ∂y ∂ = βj + ∂xj ∂xj Up to now. what is the partial derivative of y with respect to xj ? The linear approximation is y = β1x1 + β2x2 + . and the regression parameters have no economic interpretation. For the β to have an economic meaning. we need to make additional assumptions.6 The classical linear regression model Up to this point the model is empty of content beyond the deﬁnition of a best linear approximation to y and some geometrical properties. There is no economic content to the model.. The assumptions that are appropriate to make depend on the data under .. For example. there’s no guarantee that ∂ ∂xj =0.3.

. + βk xk + (3. This is to be able to explain some concepts with a minimum of confusion and notational clutter.consideration. using vector notation: y = x β0 + Nonstochastic linearly independent regressors: X is a ﬁxed ma- . Linearity: the model is a linear function of the parameter vector β0 : 0 0 0 y = β1 x1 + β2 x2 + ..4) or. Later we’ll adapt the results to what we can get with more realistic assumptions. We’ll start with the classical linear regression model. which incorporates some assumptions that are clearly not realistic for economic data.

and 1 lim X X = QX n to identify the individual effects of the explanatory variables. This is needed to be able ε is jointly distributed IID. Independently and identically distributed errors: ∼ IID(0. ∀t = s (3. σ 2In) (3.6) (3. it has rank K equal to its number of columns.5) where QX is a ﬁnite positive deﬁnite matrix. ∀t (3. This implies the following two properties: Homoscedastic errors: 2 V (εt) = σ0 .8) .trix of constants.7) Nonautocorrelated errors: E(εt s) = 0.

.9) 3. σ 2In) (3.7 Small sample statistical properties of the least squares estimator Up to now. that always hold. Normally distributed errors: ∼ N (0. Now we will examine statistical properties. The statistical properties depend upon the assumptions we make.Optionally. we will sometimes assume that the errors are normally distributed. we have only examined numeric properties of the OLS estimator.

ˆ β = (X X)−1X (Xβ + ε) = β + (X X)−1X ε By 3.6 E(X X)−1X ε = E(X X)−1X ε = (X X)−1X Eε = 0 so the OLS estimator is unbiased under the assumptions of the classical model. Figure 3. By linearity.6 shows the results of a small Monte Carlo experiment where the OLS estimator was calculated for 10000 samples from the .Unbiasedness ˆ We have β = (X X)−1X y.5 and 3.

02 0 -3 -2 -1 0 1 2 3 2 classical model with y = 1 + 2x + ε.08 0.6: Unbiasedness of OLS under classical assumptions 0. Figure 3.m . the OLS estimator will often be biased.06 0. where n = 20. The program that generates the plot is Unbiased.Figure 3. if you would like to experiment with this. and x is ﬁxed across samples.7 shows the results of a small Monte Carlo experiment where . With time series data. We can see that the β2 appears to be estimated without bias.1 0.04 0. σε = 9.

assumption 3.4 the OLS estimator was calculated for 1000 samples from the AR(1) 2 model with yt = 0 + 0.4 -0.6 -0.8 -0. if you would like to experiment with this.m .9yt−1 + εt.5 does not hold: the regressors are stochastic. . where n = 20 and σε = 1.02 0 -1 -0.08 0. We can see that the bias in the estimation of β2 is about -0.7: Biasedness of OLS when an assumption fails 0. The program that generates the plot is Biased.06 0.Figure 3.2 0 0. In this case.2 0.1 0.2.12 0.04 0.

and is thus not normally distributed. Even when the data may be taken to be IID.6 you can see that the estimator appears to be normally distributed. Many variables in economics can take on only nonnegative values.9. Adding the assumption of normality (3. then 2 ˆ β ∼ N β. we have β = β + (X X)−1X ε. which implies strong exogeneity). (X X)−1σ0 since a linear function of a normal random vector is also normally distributed. In Figure 3.Normality ˆ With the linearity assumption. . For example. the assumption of normality is often questionable or simply untenable. This is a linear function of ε. which. since the DGP (see the Octave program) has normal errors. if the dependent variable is the number of automobile trips per week. It in fact is normally distributed. it is a count variable with a discrete distribution.

We have β = β + (X X)−1X ε and we know that E(β) = β. rules out normality. 2 . So ˆ V ar(β) = E ˆ β−β ˆ β−β = E (X X)−1X εε X(X X)−1 2 = (X X)−1σ0 The OLS estimator is a linear estimator. as long as the probability of a negative value occuring is negligable under the model. This depends upon the mean being large enough in relation to the variance. which means that it is a Normality may be a good model nonetheless.2 The variance of the OLS estimator and the Gauss-Markov theorem Now let’s make all the classical assumptions except the assumption of ˆ ˆ normality.strictly speaking.

it is nonstochastic. It is also unbiased under the present assumptions. Note that since W is a function of X. not the dependent variable. as we proved above.linear function of the dependent variable. ˆ β = (X X)−1X y = Cy where C is a function of the explanatory variables only. If the estimator is unbiased. We’ll still insist ˜ upon unbiasedness. y. Consider β = W y. One could consider other weights W that are a function of X that deﬁne some other linear estimator. where W = W (X) is some k × n matrix function of X. then we must have . too.

Deﬁne D = W − (X X)−1X so W = D + (X X)−1X .W X = IK : E(W y) = E(W Xβ0 + W ε) = W Xβ0 = β0 ⇒ W X = IK ˜ The variance of β is 2 ˜ V (β) = W W σ0 .

more formally. To illustrate the Gauss-Markov result. • It is worth emphasizing again that we have not used the normality assumption in any way to prove the Gauss-Markov theorem. ˜ ˆ that V (β) − V (β) is a positive semi-deﬁnite matrix. as long as the other assumptions hold.Since W X = IK . so it is valid if the errors are not normally distributed. DX = 0. This is a proof of the Gauss-Markov Theorem. The OLS estimator is the ”best linear unbiased estimator” (BLUE). so ˜ V (β) = D + (X X)−1X = So ˜ ˆ V (β) ≥ V (β) The inequality is a shorthand means of expressing. consider the estimator that reDD + (X X) −1 D + (X X)−1X 2 σ0 2 σ0 .

in order to have an idea of the 2 precision of the estimates of β.sults from splitting the sample into p equally-sized parts. ˆ ˆ We have that E(β) = β and V ar(β) = −1 XX 2 σ0 . then averaging the p resulting estimators. which compares the OLS estimator and a 3-way split sample estimator. You should be able to show that this estimator is unbiased.8 and 3. but we still 2 need to estimate the variance of . with n = 21. The data generating process follows the classical model. In Figures 3.9 we can see that the OLS estimator is more efﬁcient. estimating using each part of the data separately by OLS. since the tails of its histogram are more narrow. but inefﬁcient with respect to the OLS estimator. A commonly used estimator of σ0 is 2 σ0 = 1 ˆˆ εε n−K .m illustrates this using a small Monte Carlo experiment. The true parameter value is β = 2. The program Efﬁciency. σ0 .

5 2 2.8: Gauss-Markov Result: The OLS estimator 0.06 0.5 3 3.08 0.Figure 3.5 1 1.1 0.02 0 0 0.12 0.5 4 .04 0.

9: Gauss-Markov Resul: The split sample estimator 0.02 0 0 1 2 3 4 5 .04 0.06 0.Figure 3.1 0.08 0.12 0.

This estimator is unbiased: 1 ˆˆ εε n−K 1 ε Mε n−K 1 E(T rε M ε) n−K 1 E(T rM εε ) n−K 1 T rE(M εε ) n−K 1 2 σ0 T rM n−K 1 2 σ0 (n − k) n−K 2 σ0 2 σ0 = = 2 E(σ0 ) = = = = = = where we use the fact that T r(AB) = T r(BA) when both products .

the cost minimization problem is to choose the quantities of inputs x to solve the problem min w x x subject to the restriction f (x) = q.8 Example: The Nerlove model Theoretical background For a ﬁrm that takes input prices w and the output level q as given. . 3. Thus.are conformable. this estimator is also unbiased under these assumptions.

• Homogeneity The cost function is homogeneous of degree 1 in input prices: C(tw. q) = tC(w. so ∂C(w. q).The solution is the vector of factor demands x(w. This is because the factor demands are homogeneous of degree zero in factor prices . • Monotonicity Increasing factor prices cannot decrease cost. • Returns to scale The returns to scale parameter γ is deﬁned as .they only depend upon relative prices. q). q) where t is a scalar constant. The cost function is obtained by substituting the factor demands into the criterion function: Cw. q) ≥0 ∂w Remember that these derivatives give the conditional factor demands (Shephard’s Lemma). q) = w x(w.

wg g q βq eε β . q) q ∂q C(w. If this is the case. q) −1 Constant returns to scale is the case where increasing production q implies that cost increases in the proportion 1:1. if there are g factors. then γ = 1.the inverse of the elasticity of cost with respect to output: γ= ∂C(w... Cobb-Douglas functional form The Cobb-Douglas functional form is linear in the logarithms of the regressors and the dependent variable. the Cobb-Douglas cost function has the form β C = Aw1 1 . For a cost function.

eCj = w wj C wj = xj (w.What is the elasticity of C with respect to wj ? eCj = w ∂C ∂WJ wj C β −1 β β = βj Aw1 1 . Not that in this case.wg g q βq eε β = βj This is one of the reasons the Cobb-Douglas form is popular . q) C ≡ sj (w.the coefﬁcients are easy to interpret.. q) ∂C ∂ WJ ... since they are the elasticities of the dependent variable with respect to the explanatory variable.wg g q βq eε wj β Aw1 1 .wj j .

q). βj = sj (w. So with a Cobb-Douglas cost function. The hypothesis that the technology exhibits CRTS implies that γ= 1 =1 βq .. The cost shares are constants. So we see that the transformed model is linear in the logs of the data.the cost share of the j th input. Note that after a logarithmic transformation we obtain ln C = α + β1 ln w1 + .. + βg ln wg + βq ln q + where α = ln A . the cost shares add up to 1. One can verify that the property of HOD1 implies that g βg = 1 i=1 In other words.

Likewise.. OUTPUT (Q). The observations are by row.so βq = 1. PRICE OF FUEL (PF ) and PRICE OF CAPITAL (PK ). i = 1.S. you need the data ﬁle mentioned above. and were collected by M. and the li- . g.m (the estimation program). To do this yourself. Note that the data are sorted by output level (the third column).. The data are for the U. output and input prices. as well as Nerlove. . monotonicity implies that the coefﬁcients βi ≥ 0. The Nerlove data and OLS The ﬁle nerlove. Nerlove.10) using OLS. and the columns are COMPANY. We will estimate the Cobb-Douglas model ln C = β1 + β2 ln Q + β3 ln PL + β4 ln PF + β5 ln PK + (3.data contains data on 145 electric utility companies’ cost of production... PRICE OF LABOR (PL). COST (C).

925955 Sigma-squared 0.3 The results are ********************************************************* OLS estimation results Observations 145 R-squared 0.brary of Octave functions mentioned in the introduction to Octave that forms section 22 of this document.427 st. 1. you have all of this installed and ready to run.720 0.527 0.err.000 0.049 0.291 0.017 0.249 p-value 0.244 1. -1.499 4.774 0.436 0.000 If you are running the bootable CD.136 0.153943 Results (Ordinary var-cov estimator) constant output labor fuel 3 estimate -3. .987 41.100 t-stat.

since following the programming statements is a useful way of learning how theory is put into practice. so that . This is an easy to use program. and Time-Series Library. the Gnu Regression. available in English. and it comes with a lot of data ready to A use. I heartily recommend Gretl.648 0.capital -0.518 ********************************************************* • Do the theoretical restrictions hold? • Does the model ﬁt well? • What do you think about RTS? While we will most often use Octave programs as examples in this document.220 0. you may be interested in a more ”user-friendly” environment for doing econometrics. and Spanish.339 -0. Econometrics. It even has an option to save output as LTEX fragments. French.

no muss.0000 0.219888 Std.436341 0.2495 −0.426517 −0.720394 0.9875 41.77437 0.339429 t-statistic −1.2445 1.291048 0.5182 .100369 0.6478 p-value 0.0000 0. no fuss. Error 1.1361 0.5265 0.0174664 0.0488 0.4992 4.I can just include the results into this document. Here the results of the Nerlove model from GRETL: Model 2: OLS estimates using the 145 observations 1–145 Dependent variable: l_cost Variable const l_output l_labor l_fuel l_capita Coefﬁcient −3.

since under the assumption of normal .D.686 145.392356 0.923840 437. 140) Akaike information criterion Schwarz Bayesian criterion 1. of dependent variable Sum of squared residuals Standard error of residuals (ˆ ) σ Unadjusted R2 ¯ Adjusted R2 F (4.5520 0.967 Fortunately. Gretl and my OLS program agree upon the results. Gretl is included in the bootable CD mentioned in the introduction.084 159. I recommend using GRETL to repeat the examples that are done using Octave. Before considering the asymptotic properties of the OLS estimator it is useful to review the MLE estimator.Mean of dependent variable S. The previous properties hold for ﬁnite sample sizes.42172 21.72466 1.925955 0.

3.9 2. Using GRETL. 1. examine the residuals after OLS estimation and tell me whether or not you believe that the assumption of independent identically distributed normal errors is warranted. No . Discuss.errors the two estimators coincide. and provide printouts of the results. 4.9 Exercises is unbiased. Interpret the results. Calculate the OLS estimates of the Nerlove model using Octave and GRETL. 3. Do an analysis of whether or not there are inﬂuential observations for OLS estimation of the Nerlove model. Prove that the split sample estimator used to generate ﬁgure 3.

For the model with a constant and a single regressor. which satisﬁes the classical assumptions. prove that the variance of the OLS estimator declines to zero as the sample size increases. . and interpret them. Print out any that you think are relevant. Σ). 5. Using Octave. For a random vector X ∼ N (µx. write a little program that veriﬁes that T r(AB) = T r(BA) for A and B 4x4 matrices of random numbers. Note: there is an Octave function trace. 7. what is the distribution of AX + b.need to do formal tests. yt = β1 + β2xt + t. where A and b are conformable matrices of constants? 6. just look at the plots.

92 .

for all sample sizes.Chapter 4 Asymptotic properties of the least squares estimator The OLS estimator under the classical assumptions is BLUE1. Now let’s see what happens when the sample size tends 1 BLUE ≡ best linear unbiased estimator if I haven’t deﬁned it before .

to inﬁnity. 4. By assumption limn→∞ limn→∞ XX n −1 XX n = QX ⇒ = Q−1. since the inverse of a nonsingular matrix is a X .1 Consistency ˆ β = (X X)−1X y = (X X)−1X (Xβ + ε) = β0 + (X X)−1X ε −1 Xε XX = β0 + n n Consider the last two terms.

Xε n =0 n Xε n . so E The variance of each term is V (xt t) = xtxtσ 2. Considering Xε 1 = n n Each xtεt has expectation zero.continuous function of the elements of the matrix. xtεt t=1 .

and given a technical condition2.As long as these are ﬁnite. so 1 n This implies that ˆ a. • Remember that almost sure convergence implies convergence in probability. as long as terms that make up an average have ﬁnite variances and are not too strongly dependent. t=1 a. I’m going to avoid the technicalities. . • The consistency proof does not use the normality assumption. 2 n xtεt → 0. For application of LLN’s and CLT’s. Basically. Which one it is doesn’t matter.s. we only need the result. one will be able to ﬁnd a LLN or CLT to apply. This is the property of strong consistency: the estimator converges in almost surely to the true value. of which there are very many to choose from.s. β → β0. the Kolmogorov SLLN applies.

Assuming the distribution of ε is unknown. However. we can get asymptotic results. XX n −1 → Q−1.2 Asymptotic normality We’ve seen that the OLS estimator is normally distributed under the assumption of normal errors. we of course don’t know the distribution of the estimator. X . If the error distribution is unknown. but the the other classical assumptions hold: ˆ β = β0 + (X X)−1X ε ˆ β − β0 = (X X)−1X ε −1 √ Xε XX ˆ √ n β − β0 = n n • Now as before.4.

If ε is not normally ˆ distributed. σ0 QX n Therefore. β is asymptotically normally distributed when a CLT .1) • In summary. To get asymptotic normality. n the limit of the variance is Xε √ n = lim E = n→∞ 2 σ 0 QX n→∞ lim V X n X The mean is of course zero. we need to apply a CLT.• Considering Xε √ . so Xε d 2 √ → N 0. We assume one (for instance. σ0 Q−1 X (4. √ d 2 ˆ n β − β0 → N 0. the OLS estimator is normally distributed in small and large samples if ε is normally distributed. the LindebergFeller CLT) holds.

4. 2 ε ∼ N (0. so n ε2 1 √ exp − t 2 f (ε) = 2σ 2πσ 2 t=1 .can be applied. the model is y = Xβ0 + ε. σ0 In).3 Asymptotic efﬁciency The least squares objective function is n s(β) = t=1 (yt − xtβ) 2 Supposing that ε is normally distributed.

It turns out that ML estimators are asymptotically efﬁcient. so n ∂ε ∂y ∂ε = In and | ∂y | = 1. so f (y) = t=1 (yt − xtβ)2 √ exp − . a concept that will be explained in detail later. √ ln L(β. so the OLS estimator of β is also the ML estimator. 2 2σ Maximizing this function with respect to β and σ gives what is known as the maximum likelihood (ML) estimator. 2 2 2σ 2πσ 1 Taking logs. The estimators are the same. It’s clear that the ﬁrst order conditions for the MLE of β0 are the same as the ﬁrst order conditions that deﬁne the OLS estimator (up to multiplication by a constant). under the present . σ) = −n ln 2π − n ln σ − n t=1 (yt − xtβ)2 .The joint density for y can be constructed using a change of variables. We have ε = y − Xβ.

then the OLS estimator and the ML estimator would not coincide. This is not the case if ε is nonnormal. Note that one needs to make an assumption about the distribution of the errors to compute the ML estimator. the OLS estimator β is asymptotically efﬁcient. their properties are the same. In particular. it will be possible to use (iterated) linear estimation methods and still achieve asymptotic efﬁciency even if the assumption that V ar(ε) = σ 2In.assumptions. as long as ε is still normally distributed. As we’ll see later. Therefore. . In general with nonnormal errors it will be necessary to use nonlinear estimation methods to achieve asymptotically efﬁcient estimation. If the errors had a distribution other than the normal. ˆ under the classical assumptions with normality.

Do you observe evidence of asymptotic normality? Comment. except that the errors should not be normally distributed (try U (−a. χ2(p) − p. Write an Octave program that generates a histogram for R Monte √ ˆ ˆ Carlo replications of n βj − βj .4. R should be a large number. a). 50.4 Exercises 1. 1000}. . The model used to generate data should follow the classical assumptions. etc). where β is the OLS estimator and βj is one of the k slope parameters. Generate histograms for n ∈ {20. at least 1000. 100. t(p).

For example.1 Exact linear restrictions In many cases.Chapter 5 Restrictions and hypothesis tests 5. economic theory suggests restrictions on the parameters of a model. a demand function is supposed to 103 .

then we need that k 0 ln q = β0 + β1 ln kp1 + β2 ln kp2 + β3 ln km + ε. The only way to guarantee this for arbitrary k is to set β1 + β2 + β3 = 0. If we have a Cobb-Douglas (log-linear) model. . ln q = β0 + β1 ln p1 + β2 ln p2 + β3 ln m + ε.be homogeneous of degree zero in prices and income. so β1 ln p1 + β2 ln p2 + β3 ln m = β1 ln kp1 + β2 ln kp2 + β3 ln km = (ln k) (β1 + β2 + β3) + β1 ln p1 + β2 ln p2 + β3 ln m.

Q < K and r is a Q×1 vector of constants. this is a linear equality restriction.which is a parameter restriction. so that there are no redundant restrictions. • We assume R is of rank Q. . • We also assume that ∃β that satisﬁes the restrictions: they aren’t infeasible. In particular. Imposition The general formulation of linear equality restrictions is the model y = Xβ + ε Rβ = r where R is a Q×K matrix. which is probably the most commonly encountered case.

which makes things less messy.Let’s consider how to estimate β subject to the restrictions Rβ = r. β n The Lagrange multipliers are scaled by 2. λ) = −2X y + 2X X βR + 2R λ ≡ 0 ˆ ˆ ˆ Dλs(β. λ) = RβR − r ≡ 0. which can be written as XX R R 0 ˆ βR ˆ λ = Xy r . The fonc are ˆ ˆ ˆ ˆ Dβ s(β. . The most obvious approach is to set up the Lagrangean 1 min s(β) = (y − Xβ) (y − Xβ) + 2λ (Rβ − r).

We get ˆ βR ˆ λ help you with that: Note that (X X)−1 −R (X X) −1 = XX R R 0 −1 Xy r . Maybe you’re curious about how to invert a partitioned matrix? I can 0 IQ XX R R 0 ≡ AB = ≡ IK (X X)−1 R 0 −R (X X)−1 R IK (X X)−1 R 0 −P ≡ C. .

so DAB = IK+Q DA = B −1 IK (X X)−1R P −1 −1 B = 0 −P −1 = P −1 (X X)−1 0 −R (X X)−1 IQ −1 (X X)−1 − (X X)−1R P −1R (X X)−1 (X X)−1R P −1 R (X X) −P −1 .and IK (X X)−1R P −1 0 −P −1 IK (X X)−1 R 0 −P ≡ DC = IK+Q. .

please start paying attention again. Recall that for x a random vector. note that we have made the deﬁnition P = R (X X)−1 R ) ˆ βR ˆ λ = = = (X X)−1 − (X X)−1R P −1R (X X)−1 (X X)−1R P −1 P −1R (X X)−1 ˆ ˆ β − (X X)−1R P −1 Rβ − r ˆ P −1 Rβ − r IK − (X X)−1R P −1R P −1R ˆ β+ (X X)−1R P −1r −P −1r −P −1 Xy r ˆ ˆ ˆ The fact that βR and λ are linear functions of β makes it easy to deterˆ mine their distributions. Also.If you weren’t curious about that. an easier way. since the distribution of β is already known. and for A and b a matrix and vector of constants. V ar (Ax + b) = AV ar(x)A . is to . respectively. if the number of restrictions is small. Though this is the obvious way to go about ﬁnding the restricted estimator.

impose them by substitution. Write y = X1β1 + X2β2 + ε R1 R2 β1 β2 = r where R1 is Q × Q nonsingular. Then −1 −1 β1 = R1 r − R1 R2β2. Supposing the Q restrictions are linearly independent. one can always make R1 nonsingular by reorganizing the columns of X. Substitute this into the model −1 −1 y = X1R1 r − X1R1 R2β2 + X2β2 + ε −1 −1 y − X1R1 r = X2 − X1R1 R2 β2 + ε .

This model satisﬁes the classical assumptions. yR = XRβ2 + ε. One can estimate by OLS.e.. using the restricted model. supposing the restriction ˆ is true.or with the appropriate deﬁnitions. 2 σ0 = yR − XRβ2 yR − XRβ2 n − (K − Q) . i. The variance of β2 is as before −1 2 ˆ V (β2) = (XRXR) σ0 and the estimator is −1 ˆ ˆ ˆ V (β2) = (XRXR) σ 2 2 where one estimates σ0 in the normal way.

ˆ ˆ To recover β1. To ﬁnd the variance of β1. so −1 −1 ˆ ˆ V (β1) = R1 R2V (β2)R2 R1 −1 = R1 R2 (X2X2) −1 −1 2 R2 R1 σ0 Properties of the restricted estimator We have that ˆ ˆ ˆ βR = β − (X X)−1R P −1 Rβ − r ˆ = β + (X X)−1R P −1r − (X X)−1R P −1R(X X)−1X y ˆ βR − β = (X X)−1X ε + (X X)−1R P −1 [r − Rβ] − (X X)−1R P −1R(X X)−1X ε = β + (X X)−1X ε + (X X)−1R P −1 [r − Rβ] − (X X)−1R P −1R(X X)− . use the ˆ fact that it is a linear function of β2. use the restriction.

Mean squared error is ˆ ˆ ˆ M SE(βR) = E(βR − β)(βR − β) Noting that the crosses between the second term and the other terms expect to zero. and that the cross of the ﬁrst and third has a cancellation with the square of the third. The second term is PSD. the ﬁrst term is the OLS covariance. we obtain ˆ M SE(βR) = (X X)−1σ 2 + (X X)−1R P −1 [r − Rβ] [r − Rβ] P −1R(X X)−1 − (X X)−1R P −1R(X X)−1σ 2 So. the second term is 0. so we are better off. . and the third term is NSD. • If the restriction is true. True restrictions improve efﬁciency of estimation.

so that they are not exactly valid with ﬁnite samples. one wishes to test economic theories. The ﬁrst two (t and F) have a known small sample distributions. as in the above homogeneity example. 5. in terms of MSE.2 Testing In many cases. A number of tests are available. If theory suggests parameter restrictions.• If the restriction is false. . depending on the magnitudes of r − Rβ and σ 2. we may be better or worse off. The third and fourth (Wald and score) do not require normality of the errors. but their distributions are known only approximately. one can test theory by testing parameter restrictions. when the errors are normally distributed.

but the test would only be valid asymptotically in this case. with normality of the errors. R(X X)−1R σ0 so ˆ Rβ − r R(X X)−1R 2 σ0 = ˆ Rβ − r σ0 R(X X)−1R ∼ N (0. One could use the consistent esti2 2 mator σ0 in place of σ0 . .t-test Suppose one has the model y = Xβ + ε and one wishes to test the single restriction H0 :Rβ = r vs. 1) . 2 The problem is that σ0 is unknown. HA :Rβ = r . Under H0. 2 ˆ Rβ − r ∼ N 0.

. then x V −1x ∼ χ2(n). then x x ∼ χ2(n.v.v. 1) and the χ2(q) are independent. it is referred to as a central χ2 r. .. suppressing the noncentrality parameter.1) χ2 (q) q ∼ t(q) as long as the N (0.v. We need a few results on the χ2 distribution. In) is a vector of n independent r.’s. N (0.Proposition 1. Proposition 3. and it’s distribution is written as χ2(n). If the n dimensional random vector x ∼ N (0. V ). λ) where λ = 2 i µi = µ µ is the noncentrality parameter. If x ∼ N (µ. has the noncentrality parameter equal to zero. We’ll prove this one as an indication of how the following unproven propositions could be proved. Proposition 2. When a χ2 r.

then . If the n dimensional random vector x ∼ N (0. A more general proposition which implies this result is Proposition 4. V ). Then consider y = P x. We have y ∼ N (0.Proof: Factor V −1 as P P (this is the Cholesky factorization. P V P ) but V P P = In PV P P = P so P V P = In and thus y ∼ N (0. In). where P is deﬁned to be upper triangular). Thus y y ∼ χ2(n) but y y = x P P x = xV −1x and we get the result we wanted.

Now consider (remember that we have only one restriction in this case) . Consider the random variable ˆˆ ε MX ε εε 2 = 2 σ0 σ0 ε = MX σ0 ∼ χ2(n − K) ε σ0 Proposition 6. and B is idempotent with rank r. If the random vector (of dimension n) x ∼ N (0. An immediate consequence is Proposition 5. If the random vector (of dimension n) x ∼ N (0.x Bx ∼ χ2(ρ(B)) if and only if BV is idempotent. then x Bx ∼ χ2(r). I). I). then Ax and x Bx are independent if AB = 0.

so σ0 ˆ Rβ − r = ∼ t(n − K) −1 R ˆ ˆ σR β R(X X) ˆ Rβ − r In particular. H0 : βi = 0 . for the commonly encountered test of signiﬁcance of an individual coefﬁcient.√ Rβ−r σ0 ˆ R(X X)−1 R εε ˆˆ 2 (n−K)σ0 = ˆ Rβ − r σ0 R(X X)−1R ˆ ˆˆ This will have the t(n − K) distribution if β and ε ε are independent. the test statistic is ˆ βi ∼ t(n − K) ˆˆ σβi . ˆ But β = β + (X X)−1X ε and (X X)−1X MX = 0. for which H0 : βi = 0 vs.

provided . s). a conservative procedure is to take critical values from the t distribution if nonnormality is suspected. x/r y/s ∼ F (r. one could use the above asymptotic result to justify taking critical values from the N (0.• Note: the t− test is strictly valid only if the errors are actually normally distributed. d F test The F test allows testing multiple restrictions jointly. 1) distribution. 1) as n → ∞. since t(n − K) → N (0. then that x and y are independent. Proposition 7. If one has nonnormal errors. This will reject H0 less often since the t distribution is fatter-tailed than is the normal. In practice. If x ∼ χ2(r) and y ∼ χ2(s).

.Proposition 8. and previous results on the χ2 distribution. it is simple to show that the following statistic has the F distribution: −1 ˆ Rβ − r F = R (X X) −1 R ˆ Rβ − r ∼ F (q. n − K). n − K). The following tests will be appropriate when one cannot assume normally distributed errors. Using these results. ESSU /(n − K) • Note: The F test is strictly valid only if the errors are truly normally distributed. If the random vector (of dimension n) x ∼ N (0. then x Ax and x Bx are independent if AB = 0. ˆ qσ2 A numerically equivalent expression is (ESSR − ESSU ) /q ∼ F (q. I).

Given that the least squares estimator is asymptotically normally distributed: √ d 2 ˆ n β − β0 → N 0. we have √ d 2 ˆ n Rβ − r → N 0.it is only approximately valid in ﬁnite samples. but it is an asymptotic test . The Wald test does not.Wald-type tests The t and F tests require normality of the errors. The Wald principle is based on the idea that if a restriction is true. the unrestricted model should “approximately” satisfy the restriction. σ0 Q−1 X then under H0 : Rβ0 = r. σ0 RQ−1R X so by Proposition [3] ˆ n Rβ − r 2 σ0 RQ−1R X −1 ˆ Rβ − r → χ2(q) d .

The test statistic we use X substitutes the consistent estimators.2 Note that Q−1 or σ0 are not observable. there is a cancellation of n s. and the X statistic to use is ˆ Rβ − r 2 σ0 R(X −1 X) R −1 ˆ Rβ − r → χ2(q) d • The Wald test is a simple way to test restrictions without having to estimate the restricted model. Score-type tests (Rao tests. With this. . Use (X X/n)−1 as the consistent estimator of Q−1. • Note that this formula is similar to one of the formulae provided for the F test. Lagrange multiplier tests) The score test is another asymptotically valid test that does not require normality of the errors.

The score test is useful in this situation. if the restrictions are true. evaluated at the restricted estimate. For example. linear model.In some cases. the model y = (Xβ)γ + ε is nonlinear in β and γ. an unrestricted model may be nonlinear in the parameters. should be asymptotically normally distributed with mean zero. • Score-type tests are based upon the general principle that the gradient vector of the unrestricted model. The original development was for ML estimation. but the principle is valid for a wide variety of estimation methods. but the model is linear in the parameters under the null hypothesis. but is linear in β under H0 : γ = 1. so one might prefer to have a test based upon the restricted. Estimation of nonlinear models is a bit more complicated. .

σ0 RQ−1R X So √ √ nPˆλ −1 2 σ0 RQ−1R X d nPˆλ → χ2(q) . σ0 RQ−1R X under the null hypothesis. we obtain √ d 2 nPˆλ → N 0.We have seen that ˆ λ = R(X X)−1R ˆ = P −1 Rβ − r so √ √ −1 ˆ Rβ − r nPˆλ = ˆ n Rβ − r Given that √ d 2 ˆ n Rβ − r → N 0.

Noting that lim nP = RQ−1R . Note that the test can be written as ˆ ˆ R λ (X X)−1R λ 2 σ0 → χ2(q) d . • This makes it clear why the test is sometimes referred to as a Lagrange multiplier test. we obtain. It may seem that one needs the actual Lagrange multipliers to calculate this. If we impose the restrictions by substitution. X ˆ λ R(X X)−1R 2 σ0 ˆ d λ → χ2(q) since the powers of n cancel. these are not available. To get a usable test statistic substitute a 2 consistent estimator of σ0 .

2 σ0 To see why the test is also known as a score test. note that the fonc . we get εRX(X X)−1X εR d 2 ˆ ˆ → χ (q) 2 σ0 but this is simply ˆ εR PX d ˆ εR → χ2(q). we can use the fonc for the restricted estimator: ˆ ˆ −X y + X X βR + R λ to get that ˆ ˆ R λ = X (y − X βR) ˆ = X εR Substituting this into the above.However.

The test is also known as a Rao test. The scores evaluated at the unrestricted estimate are identically zero. since P. The logic behind the score test is that the scores evaluated at the restricted estimate should be approximately zero. . evaluated at the restricted estimator. if the restriction is true.for restricted least squares ˆ ˆ −X y + X X βR + R λ give us ˆ ˆ R λ = X y − X X βR and the rhs is simply the gradient (score) of the unrestricted model. Rao ﬁrst proposed it in 1948.

We have seen that the Wald test is asymptotically equivalent to ˆ W = n Rβ − r a 2 σ0 RQ−1R X −1 ˆ Rβ − r → χ2(q) d (5.1) . We’ll show that the Wald and LR tests are asymptotically equivalent. they all converge to the same χ2 rv.5. We have seen that the three tests all converge to χ2 random variables. but I’m leaving it here for reference. In fact. under the null hypothesis. I no longer teach the material in this section. Wald and score tests Note: the discussion of the LR test has been moved forward in these notes.3 The asymptotic equivalence of the LR.

Using ˆ β − β0 = (X X)−1X ε and ˆ ˆ Rβ − r = R(β − β0) we get √ √ ˆ − β0) = nR(X X)−1X ε nR(β −1 XX n−1/2X ε = R n .

1] to get 2 W = n−1ε XQ−1R σ0 RQ−1R X X a a −1 RQ−1X ε X −1 2 = ε X(X X)−1R σ0 R(X X)−1R −1 a ε A(A A) A ε = 2 σ0 a ε PR ε = 2 σ0 R(X X)−1X ε where PR is the projection matrix formed by the matrix X(X X)−1R . Now consider the likelihood ratio statistic LR = n1/2g(θ0) I(θ0)−1R RI(θ0)−1R a −1 RI(θ0)−1n1/2g(θ0) (5. • Note that this matrix is idempotent and has q columns.2) . so the projection matrix has rank q.Substitute this into [5.

σ) n X (y − Xβ0) = nσ 2 Xε = nσ 2 √ 2π − n ln σ − 1 (y − Xβ) (y − Xβ) . σ) = −n ln Using this. 1 g(β0) ≡ Dβ ln L(β. 2 2 σ .Under normality. we have seen that the likelihood function is ln L(β.

Also. by the information matrix equality: I(θ0) = −H∞(θ0) = lim −Dβ g(β0) X (y − Xβ0) = lim −Dβ nσ 2 XX = lim nσ 2 QX = 2 σ so I(θ0)−1 = σ 2Q−1 X .

• The LR statistic is based upon distributional assumptions. one can show that. under the null hypothesis. a a a . we get 2 LR = ε X (X X)−1R σ0 R(X X)−1R a ε PR ε = 2 σ0 a = W a −1 R(X X)−1X ε This completes the proof that the Wald and LR tests are asymptotically equivalent. since one can’t write the likelihood function without them. as can be veriﬁed by examining the expressions for the statistics.Substituting these last expressions into [5. Similarly. qF = W = LM = LR • The proof for the statistics except for LR does not depend upon normality of the errors.2].

Though the four statistics are asymptotically equivalent.• However. For example all of the following are consistent for σ 2 under H0 . supposing normality. The numeric values of the tests also depend upon how σ 2 is estimated. This is readily generalizable to nonlinear models and/or other estimation methods. in that it’s like a LR statistic in that it uses the value of the objective functions of the restricted and unrestricted models. the qF statistic can be thought of as a pseudo-LR statistic. they are numerically different in small samples. • The presentation of the score and Wald tests has been done in the context of the linear model. due to the close relationship between the statistics qF and LR. but it doesn’t require distributional assumptions. and we’ve already seen than there are several ways to do this.

For this reason. the Wald test will always reject if the LR test rejects. and if εε ˆˆ n is used to calculate the Wald test and εR εR ˆ ˆ n is used for the score test.εε ˆˆ n−k εε ˆˆ n εR εR ˆ ˆ n−k+q εR εR ˆ ˆ n and in general the denominator call be replaced with any quantity a such that lim a/n = 1. that W > LR > LM. for linear regression models subject to linear restrictions. It can be shown. and in turn the LR test rejects if the LM test rejects. This is a bit prob- .

the true size of the score test is often smaller than the nominal size. 5. The small sample behavior of the tests can be quite different. since asymptotic approximations are not an issue.lematic: there is the possibility that by careful choice of the statistic used. A conservative/honest approach would be to report all three test statistics when they are available. In the case of linear models with normal errors the F test is to be preferred. we need to know how to use them. The true size (probability of rejection of the null when the null is true) of the Wald test is often dramatically higher than the nominal size associated with the asymptotic distribution.4 Interpretation of test statistics Now that we have a menu of test statistics. one can manipulate reported results to favor or disfavor a hypothesis. Likewise. .

5 Conﬁdence intervals Conﬁdence intervals for single coefﬁcients are generated in the normal manner. using a α signiﬁcance level: C(α) = {β : −cα/2 The set of such β is the interval ˆ β ± σβ cα/2 ˆ ˆ β−β < < cα/2} σβ ˆ .5. Given the t statistic ˆ β−β t(β) = σβ ˆ a 100 (1 − α) % conﬁdence interval for β0 is deﬁned by the bounds of the set of β such that t(β) does not reject H0 : β0 = β.

mass 1 − α. – Joint rejection does not imply individal tests will reject. if the estimators are correlated. • The region is an ellipse. Since the ellipse is bounded in both dimensions but also contains mass 1 − α.g. . • From the pictue we can see that: – Rejection of hypotheses individually does not imply that the joint test will reject. the set of {β1. This generates an ellipse. it must extend beyond the bounds of the individual CI.A conﬁdence ellipse for two coefﬁcients jointly would be. since the CI for an individual coefﬁcient deﬁnes a (inﬁnitely long) rectangle with total prob. can take on any value). analogously. β2} such that the F (or some other test statistic) doesn’t reject at the speciﬁed critical value.. since the other coefﬁcient is marginalized (e.

Figure 5.1: Joint and Individual Conﬁdence Regions .

σ0 ) . the small sample distribution of n β − β0 may be very different than its large sample distribution. Suppose that y ε = ∼ X is nonstochastic Xβ0 + ε 2 IID(0. We’ll consider a simple example. just to get the main idea.5. Also. we’re often at serious risk of making important errors. A means of trying to gain information on the small sample distribution of test statistics and estimators is the bootstrap. If the sample size is small and errors are √ ˆ highly nonnormal. the distributions of test statistics may not resemble their limiting distributions at all.6 Bootstrapping When we rely on asymptotic theory to use the normal distributionbased tests and conﬁdence intervals.

since we have random sampling.ˆ Given that the distribution of ε is unknown. J. With this. Now take this and estimate ˜ ˜ β j = (X X)−1X y j . Call this vector ˜ εj (it’s a n × 1). Repeat steps 1-4. until we have a large number. The steps are: ˆ 1. the distribution of β will be unknown in small samples. However. Draw n observations from ε with replacement. Then generate the data by y j = X β + εj 3. ˆ ˜ ˜ 2. we can use the replications to calculate the empirical distri˜ bution of βj . Save β j ˜ 5. we could generate artiﬁcial data. One way to form a 100(1-α)% conﬁdence interval for β0 . of β j . ˜ 4.

• If the assumption of iid errors is too strong (for example if there is heteroscedasticity or autocorrelation. This is easy . x) with replacement. Simple: just calculate the transformation for each j. and use the remaining endpoints as the limits of the CI. for example a test statistic. and work with the empirical distribution of the transformation. and drop the ﬁrst and last Jα/2 of the replications.˜ would be to order the β j from smallest to largest. • How to choose J: J should be large enough that the results don’t change with repetition of the entire bootstrap. see below) one can work with a bootstrap deﬁned by sampling from (y. • Suppose one was interested in the distribution of some function ˆ of β. Note that this will not give the shortest CI if the empirical distribution is skewed.

Compare the test statistic calculated using . this doesn’t hold. If you ﬁnd the results change a lot. At a minimum. so statistics based on sampling from the empirical distribution should converge in distribution to statistics based on sampling from the actual sampling distribution. and use this to get critical values. Basically. • In ﬁnite samples. increase J and try again. the bootstrap is a good way to check if asymptotic theory results offer a decent approximation to the small sample distribution. use the bootstrap to get an approximation to the empirical distribution of the test statistic under the alternative hypothesis. • Bootstrapping can be used to test hypotheses. • The bootstrap is based fundamentally on the idea that the empirical distribution of the sample data converges to the actual sampling distribution as n becomes large.to check.

Since estimation subject to nonlinear restrictions requires nonlinear estimation methods. to the bootstrap critical values. . Consider the q nonlinear restrictions r(β0) = 0. which are beyond the score of this course. at least when the model is linear.7 Wald test for nonlinear restrictions: the delta method Testing nonlinear restrictions of a linear model is not much more difﬁcult.the real data. under the null. There are many variations on this theme. which we won’t go into here. we’ll just consider the Wald test for nonlinear restrictions on a linear model. 5.

so that ρ(R(β)) = q in a neighborhood of β0. Write the derivative of the restriction evaluated at β as Dβ r(β) β = R(β) We suppose that the restrictions are not redundant in a neighborhood of β0. Under the null hypothesis we have ˆ ˆ r(β) = R(β ∗)(β − β0) . Take a ﬁrst order Taylor’s series expansion of ˆ r(β) about β0: ˆ ˆ r(β) = r(β0) + R(β ∗)(β − β0) ˆ where β ∗ is a convex combination of β and β0.where r(·) is a q-vector valued function.

R(β0)Q−1R(β0) σ0 .QX −1 ˆ ˆ ˆ r(β) R(β)(X X)−1R(β) σ2 ˆ r(β) → χ2(q) d . Using this we get We’ve already seen the distribution of √ d 2 ˆ nr(β) → N 0. the resulting statistic is −1 ˆ r(β) → χ2(q) d under the null hypothesis. so √ ˆ a nr(β) = √ ˆ nR(β0)(β − β0) √ ˆ n(β − β0). Substituting consistent estimators for β0. asymptotically.ˆ Due to consistency of β we can replace β ∗ by β0. X Considering the quadratic form ˆ nr(β) R(β0)Q−1R(β0) X 2 σ0 2 and σ0 .

but they require estimation methods for nonlinear models. it will tend to over-reject in ﬁnite samples. we just have √ 2 ˆ n r(β) − r(β0) → N 0. If the nonlinear function r(β0) is not hypothesized to be zero. which aren’t in the scope of this course. • Since this is a Wald test. The score and LR tests are also possibilities. R(β0)Q−1R(β0) σ0 X d so an approximation to the distribution of the function of the estimator is 2 ˆ r(β) ≈ N (r(β0). or as Klein’s approximation. • This is known in the literature as the delta method.under the null hypothesis. R(β0)(X X)−1R(β0) σ0 ) . Note that this also gives a convenient way to estimate nonlinear functions and associated asymptotic conﬁdence intervals.

Suppose we estiy = x β + ε. x are η(x) = β xβ x (note that this is the entire vector of elasticities).For example.r. The estimated elasticities are ˆ β η(x) = ˆ xβ x . mate a linear function The elasticities of y w.t. the vector of elasticities of a function f (x) is ∂f (x) η(x) = ∂x where x f (x) means element-by-element multiplication.

. Let x(p. In many cases. use R(β) = ∂η(x) ∂β x1 0 · · · 0 x2 . 0 0 βk x2 k .. . = 0 β1x2 0 1 . ˆ To get a consistent estimator just substitute in β. Note that the elasticity and the standard error are functions of x. where p is prices and m is .. . . 0 β2x2 . The program ExampleDeltaMethod. m) be a demand funcion. 0 0 ··· 0 · · · 0 xk (x β)2 ··· 0 .To calculate the estimated standard errors of all ﬁve elasticites. nonlinear restrictions can also involve the data. . 2 xβ− .. For example.. not just the parameters.m shows how this can be done. consider a model of expenditure shares.

1] interval depends on both parameters and the values of p and m. An expenditure share system for G goods is pixi(p. m) = . m) i=1 = 1 Suppose we postulate a linear model for the expenditure shares: i i i si(p.. and we assume that expenditures sum to income. It is impossible to . so we have the restrictions 0 ≤ si(p.. m) ≤ 1. i = 1. m) = β1 + p βp + mβm + εi It is fairly easy to write restrictions such that the shares sum to one. 2. m Now demand must be positive. ∀i G si(p. G. .income.. but the restriction that the shares lie in the [0. m) si(p.

925955 Sigma-squared 0.153943 Results (Ordinary var-cov estimator) .8) that the OLS results for the Nerlove model are ********************************************************* OLS estimation results Observations 145 R-squared 0.impose the restriction that 0 ≤ si(p.8 Example: the Nerlove data Remember that we in a previous example (section 3. 5. In such cases. m) ≤ 1 for all possible p and m. one might consider whether or not a linear model is a reasonable speciﬁcation.

000 0.720 0.244 1.constant output labor fuel capital estimate -3.249 -0.291 0.427 -0.774 0.100 0.527 0. From it we obtain the results that follow: .000 0. -1.518 ********************************************************* Note that sK = βK < 0.648 p-value 0.220 st. and if there is homogeneity of degree 1 then βL + βF + βK = 1. We can test these hypotheses either separately or jointly.499 4. 1. then βQ = 1.049 0.err.436 0.017 0. Remember that if we have constant returns to scale.136 0. and that βL + βF + βK = 1.987 41.m imposes and tests CRTS and then HOD1. NerloveRestrictions.339 t-stat.

018 0.206 0.000 0.192 t-stat.005 0.878 4.414 -0.040 2.263 41.691 0. -5.969 ******************************************************* .155686 estimate constant output labor fuel capital -4.593 0.100 0.925652 Sigma-squared 0.err.891 0.721 0.038 p-value 0.007 st.000 0. 0.159 -0.000 0.Imposing and testing HOD1 ******************************************************* Restricted LS estimation results Observations 145 R-squared 0.

438861 estimate constant -7.574 0.593 0.592 p-value 0.530 st.441 0. 2.012 .594 0.790420 Sigma-squared 0. -2.539 p-value 0.441 0.442 Imposing and testing CRTS ******************************************************* Restricted LS estimation results Observations 145 R-squared 0.err.966 t-stat.Value F Wald LR Score 0.450 0.

132 0.output labor fuel capital 1. HOD1 is not rejected at usual signiﬁcance levels (e. you should .000 0.863 93.020 0.000 0.572 Inf 0. α = 0.262 265.414 150.771 p-value 0.895 ******************************************************* Value F Wald LR Score 256. compared to the unrestricted results.040 4.000 0.968 0.000 0.10). Also.076 0.000 0.g. For CRTS. R2 does not drop much when the restriction is imposed..289 0.167 0.489 0.000 0.000 Notice that the input price coefﬁcients in fact sum to 1 when HOD1 is imposed.000 0.715 0.

If you look at the unrestricted estimation results. Note that R2 drops quite a bit when imposing CRTS. so the restriction is satisﬁed. Exercise 9.m program to impose and test the restrictions jointly. Also note that the hypothesis that βQ = 1 is rejected by the test statistics at all reasonable signiﬁcance levels. Modify the NerloveRestrictions. From the point of view of neoclassical economic theory. Deﬁne 5 subsamples of ﬁrms.note that βQ = 1. then the next 29 ﬁrms. and that a conﬁdence interval for βQ does not overlap 1. etc. Recall that the data is sorted by output (the third column). you can see that a t-test for βQ = 1 also rejects. The Chow test Since CRTS is rejected. with the ﬁrst group being the 29 ﬁrms with the lowest output levels. let’s examine the possibilities more carefully. but CRTS is not. these results are not anomalous: HOD1 is an implication of the theory. .

.58} t ∈ {30. 118.. where j = 1 for t = 1. ... 31. . ..The ﬁve subsamples can be indexed by j = 1. ... D5 where D1 = 1 t ∈ {1.... j = 2 for t = 30. 145} / .. 118. . 2.29} / t ∈ {30.. 5. 31. ..58} / 0 1 D2 = 0 .. 31. 2. Deﬁne dummy variables D1.29} t ∈ {1.58. D2. ... 2. 2... ..29... 145} t ∈ {117... D5 = 1 0 t ∈ {117. . ... etc.

+ . . and j is the 29×1 . βL1.3) Note that the ﬁrst column of nerlove. and provides and easy way of deﬁning the dummy variables. X1 is 29×5.g. βF 1.data indicates this way of breaking up the sample. . X3 X4 0 5 β5 0 X5 y5 (5. =. β 1 = (α1.Deﬁne the model 5 5 5 5 5 ln Ct = j=1 α1Dj + j=1 γj Dj ln Qt+ j=1 βLj Dj ln PLt+ j=1 βF j Dj ln PF t+ j=1 βKj Dj ln (5.4) where y1 is 29×1. γ1.. β j is the 5 × 1 vector of coefﬁcients for the j th subsample (e. βK1) ). . The new model may be written as 1 β1 X1 0 · · · 0 y1 β2 2 y2 0 X2 .

that there is coefﬁcient stability across the ﬁve subsamples. is sometimes referred to as a Chow test.m estimates the above model. β1 = β2 = β3 = β4 = β5 This type of test.vector of errors for the j th subsample. that is. The null to test is that the parameter vectors for the separate groups are all the same. • There are 20 restrictions. look at the Octave program. It also tests the hypothesis that the ﬁve subsamples share the same parameter vector. If that’s not clear to you. or in other words. we should probably use the unre- . • The restrictions are rejected at all conventional signiﬁcance levels. The Octave program Restrictions/ChowTest. that parameters are constant across different sets of data. Since the restrictions are rejected.

2 1 1 1.Figure 5.5 3 3.2: RTS as a function of ﬁrm size 2.6 RTS 2.5 4 4. .2 2 1. but that RTS is approximately constant for large ﬁrms.4 2.4 1.2 plots RTS.8 1.6 1.5 5 stricted model for analysis. What is the pattern of RTS as a function of the output group (small to large)? Figure 5. We can see that there is increasing RTS for small ﬁrms.5 2 2.

estimated standard errors for coefﬁcients. This new model is 5 5 1. Using the Chow test on the Nerlove model.5) (a) estimate this model by OLS. we reject that there ln C = j=1 αj Dj + j=1 γj Dj ln Q + βL ln PL + βF ln PF + βK ln PK + (5.9 Exercises is coefﬁcient stability across the 5 groups. Interpret the results in detail. But perhaps we could restrict the input price coefﬁcients to be the same but let the constant and output coefﬁcients vary by group size. t-statistics for tests of signiﬁcance. and the associated p-values. (b) Test the restrictions implied by this model (relative to the model that lets all coefﬁcients vary across groups) using the . giving R2.5.

Give estimated standard errors for all coefﬁcients. Don’t use mc_olsr or any other restricted OLS estimation program. using an OLS estimation program. using the delta method to compute standard errors. 3. compute 95% conﬁdence intervals for RTS for each of the 5 groups of ﬁrms. Compare the plot to that given in the notes for the unrestricted model. 2. (d) Plot the estimated RTS parameters as a function of ﬁrm size.F. Comment on the results. qF. Wald. (c) Estimate this model but imposing the HOD1 restriction. For the model of the above question. Perform a Monte Carlo study that generates data from the model y = −2 + 1x2 + 1x3 + . Comment on the results. score and likelihood ratio tests. Comment on the results.

x2 and x3 are independently uniformly distributed on [0. imposing the restriction that β2 + β3 = 1.where the sample size is 30. (b) Compare the means and standard errors of the estimated coefﬁcients using OLS and restricted OLS. imposing the restriction that β2 + β3 = 2. (c) Discuss the results. 1) (a) Compare the means and standard errors of the estimated coefﬁcients using OLS and restricted OLS. . 1] and ∼ IIN (0.

then it is irrelevant if they are stochastic or not. since conditional on the values of they regressors take on.Chapter 6 Stochastic regressors Up to now we have treated the regressors as ﬁxed. First. which is clearly unrealistic. Now we will assume they are random. if we are interested in an analysis conditional on the explanatory variables. There are several ways to think of the problem. • In cross-sectional analysis it is usually reasonable to make the 165 . they are nonstochastic. which is the case already considered.

where yt may depend on yt−1. and β0 and . where xt is K × 1.analysis conditional on the regressors. X = x1 x2 · · · xn . • In dynamic models. y = Xβ0 + ε. so we need to consider the ˆ behavior of β and the relevant test statistics unconditional on X. since we may want to predict into the future many periods out. or in matrix form. The model we’ll deal will involve a combination of the following assumptions Assumption 10. a conditional analysis is not sufﬁciently general. where y is n × 1. Linearity: the model is a linear function of the parameter vector β0 : yt = xtβ0 + εt.

Central limit theorem 2 n−1/2X ε → N (0. σ 2In): mally distributed is nor- . Assumption 11. Stochastic.ε are conformable. 1 nX X = QX = 1. where QX is a ﬁnite positive deﬁnite Assumption 12. linearly independent regressors X has rank K with probability 1 X is stochastic limn→∞ Pr matrix. QX σ0 ) d Assumption 13. Normality (Optional): ε|X ∼ N (0.

xtβ is the conditional mean of yt given xt: E(yt|xt) = xtβ .1) Assumption 15. Weakly exogenous regressors: The regressors are weakly exogenous if E(εt|xt) = 0.Assumption 14. The regressors X are strongly exogenous if E(εt|X) = 0. Strongly exogenous regressors. ∀t (6. ∀t In both cases.

(X X)−1σ0 ˆ • If the density of X is dµ(X). unconditional on X. Likewise. strongly exogenous regressors In this case. E(β) = β. 2 ˆ β|X ∼ N β. the marginal density of β is obtained by multiplying the conditional density by dµ(X) and integrating ˆ over X. ˆ β = β0 + (X X)−1X ε ˆ E(β|X) = β0 + (X X)−1X E(ε|X) = β0 ˆ and since this holds for all X. Doing this leads to a nonnormal density for β.6.1 Case 1 Normality of ε. in small .

4. these distributions don’t depend on X. F and χ2 distributions. β is nonnormally distributed 3. and this is true for all X. since it holds conditionally on X. The usual test statistics have the same distribution as with nonstochastic X. The Gauss-Markov theorem still holds. Importantly. conditional on X. β is unbiased ˆ 2.samples. so when marginalizing to obtain the unconditional distribution. • However. the usual test statistics have the t. . The tests are valid in small samples. • Summary: When X is stochastic but strongly exogenous and ε is normally distributed: ˆ 1. nothing changes.

5. we have ˆ β = β0 + (X X)−1X ε −1 Xε XX = β0 + n n Now XX n −1 p → Q−1 X . the argument regarding test statistics doesn’t hold. due to nonnormality of ε. Still.2 Case 2 ε nonnormally distributed. 6. However. strongly exogenous regressors ˆ The unbiasedness of β carries through as before. Asymptotic properties are treated in the next section.

QX σ 2) r. Q−1σ0 ) X . so. Considering the asymptotic distribution √ ˆ n β − β0 = = so √ √ XX Xε n n n −1 XX n−1/2X ε n −1 d 2 ˆ n β − β0 → N (0.v.by assumption. We have unbiasedness and the variance disappearing. and X ε n−1/2X ε p = √ →0 n n since the numerator converges to a N (0. and the denominator still goes to inﬁnity. the estimator is consistent: ˆ p β → β0.

Tests are not valid in small samples if the error is normally distributed . since it holds in the previous case and doesn’t depend on normality. Tests are asymptotically valid 6. • Summary: Under strongly exogenous regressors. Consistency 3. Unbiasedness 2. all the previous asymptotic results on test statistics are also valid in this case. Since the asymptotic results on all test statistics only require this. Asymptotic normality of the estimator still holds.directly following the assumptions. Gauss-Markov theorem holds. with ε normal ˆ or nonnormal. 4. Asymptotic normality 5. β has the properties: 1.

6. E(εt−1xt) = 0 since xt contains yt−1 (which is a function of εt−1) as an element. A simple version of these models that captures the important points is p yt = ztα + s=1 γsyt−s + εt = xtβ + εt where now xt contains lagged dependent variables. For example. Clearly. X and ε are not uncorrelated. even with E( t|xt) = 0.3 Case 3 Weakly exogenous regressors An important class of models are dynamic models. so one can’t show unbiasedness. where lagged dependent variables have an impact on the current value. .

QX σ0 ) .• This fact implies that all of the small sample properties such as unbiasedness.7. and we see that the OLS estimator is biased in this case. This is a case of weakly exogenous regressors. • Nevertheless. and small sample validity of test statistics do not hold in this case. Gauss-Markov theorem. 6. all asymptotic properties continue to hold. using the same arguments as before.4 When are the assumptions reasonable? The two assumptions we’ve added are 1. limn→∞ Pr d 1 nX X = QX = 1. a QX ﬁnite positive deﬁnite matrix. 2 2. under the above assumptions. Recall Figure 3. n−1/2X ε → N (0.

since the other cases can be treated as nested in this case. zt+s. many of which are fairly technical. There exist a number of central limit theorems for dependent processes.. • Strong stationarity requires that the joint distribution of the set {zt.} not depend on t.The most complicated case is that of dynamic models. zt−q . Chapter 7 if you’re interested). We won’t enter into details (see Hamilton.. . in some sense. A main requirement for use of standard asymptotics for a dependent sequence 1 {st} = { n n zt} t=1 to converge in probability to a ﬁnite limit is that zt be stationary. .

. Stationarity prevents the process from trending off to plus or minus inﬁnity. σ 2) One can show that the variance of xt depends upon t in this case. so it’s not weakly stationary. Draw a picture here. and prevents cyclical behavior which would allow correlations between far removed zt znd zs to be high.• Covariance (weak) stationarity requires that the ﬁrst and second moments of this set not depend on t. • The series sin t + t has a ﬁrst moment that depends upon t. so it’s not weakly stationary either. • An example of a sequence that doesn’t satisfy this is an AR(1) process with a unit root (a random walk): xt = xt−1 + εt εt ∼ IIN (0.

e.5 Exercises then E (Af (B)) = 0. 6. How is this used in the proof of the GaussMarkov theorem? 1. Show that for two random variables A and B. • The study of nonstationary processes is an important part of econometrics. the assumptions are reasonable when the stochastic conditioning variables have variances that are ﬁnite.• In summary. The AR(1) model with unit root is an example of a case where the dependence is too strong for standard asymptotics to apply. 2. Is it possible for an AR(1) model for time series data. yt = 0 + 0. if E(A|B) = 0..9yt−1 + εt satisfy weak exogeneity? Strong exogeneity? Dis- .g. but it isn’t in the scope of this course. and are not too strongly dependent.

.cuss.

180 .Chapter 7 Data problems In this section we’ll consider problems associated with the regressor matrix: collinearity. missing observations and measurement error.

2 ..000 population (Range 321.1 Collinearity Motivation: Data on Mortality and Related Factors The data set mortality.8. due to coronary heart disease and their determinants. along with data on factors like smoking and consumption of alcohol.1980 on death rates in the U.7.5) .4) • cal = Per capita consumption of calcium per day in grams (Range 0.S.9 .06) • unemp = Percent of civilian labor force unemployed in 1. The data description is: DATA4-7: Death rates in the U.000 of persons 16 years and older (Range 2.375.S.9 .data contains annual data from 1947 . Data compiled by Jennifer Whisenand • chd = death rate per 100.1.

34. lamb and mutton (Range 138 . 339 cigarettes per pound of tobacco (Range 6.10.9) • beer = Per capita consumption of malted liquor in taxed gallons for individuals 18 and older (Range 15.77 .75 .5) • meat = Per capita intake of meat in pounds–includes beef. margarine and butter (Range 42 .9) • wine = Per capita consumption of wine measured in taxed gallons for individuals 18 and older (Range 0.194.2. veal.04 . pork.65) .• cig = Per capita consumption of cigarettes in pounds of tobacco by persons 18 years and older–approx.56.46) • edfat = Per capita intake of edible fats and oil in pounds–includes lard.2.8) • spirits = Per capita consumption of distilled spirits in taxed gallons for individuals 18 and older (Range 1 .

581 + 3.8783 spirits − 5.735) ¯ T = 34 R2 = 0.5528 F (4.156) (7.9764 wine (12.914 + 5.373) (1.5498 F (3.0102) ¯ T = 34 R2 = 0.28816 beer (56.028 ˆ (standard errors in parentheses) .17560 cig + 38.9945 (standard errors in parentheses) chd = 353. 30) = 14.10365 beer (58.624) (4.433 σ = 10.939) (5.2513) + 13. 29) = 11.2 ˆ σ = 9.7523) (7.Consider estimation results for several models: chd = 334.3481 spirits − 4.41216 cig + 36.275) (1.

4371) (6.638) ¯ T = 34 R2 = 0.1598 ˆ σ = 12.481 (standard errors in parentheses) Note how the signs of the coefﬁcients change depending on the model.1709 σ = 12.7535 cig + 22.8672 spirits (49. The parameter estimates are highly sensitive to the particular model we estimate.chd = 243.21) (6.310 + 10.0359) (12.3198 F (3.3026 F (2. too.219 + 16.8689 wine (67.2079) ¯ T = 34 R2 = 0.8012 spirits − 16.1508) (8. and that the magnitudes of the parameter estimates vary a lot. 30) = 6. 31) = 8.119) (4. Why? We’ll see that the problem is that the data .5146 cig + 15.327 ˆ (standard errors in parentheses) chd = 181.

exhibit collinearity. We can always write λ1x1 + λ2x2 + · · · + λK xK + v = 0 where xi is the ith column of the regressor matrix X. and v is an n × 1 vector. • “relative” and “approximate” are imprecise. Collinearity: deﬁnition Collinearity is the existence of linear relationships amongst the regressors. so it’s difﬁcult to deﬁne when collinearilty exists. . the variation in v is relatively small. In the case that there exists collinearity. so that there is an approximately exact linear relation between the regressors.

In the extreme. if there are exact linear relationships (every element of v equal) then ρ(X) < K. For example. the β s can’t be consistently estimated . so X X is not invertible and the OLS estimator is not uniquely deﬁned. but since the γ s deﬁne two equations in three β s. so ρ(X X) < K. if the model is yt = β1 + β2x2t + β3x3t + εt x2t = α1 + α2x3t then we can write yt = β1 + β2 (α1 + α2x3t) + β3x3t + εt = β1 + β2α1 + β2α2x3t + β3x3t + εt = (β1 + β2α1) + (β2α2 + β3) x3t = γ1 + γ2x3t + εt • The γ s can be consistently estimated.

collected in xi. Similarly. quality etc. Consider a model of rental price (yi) of an apartment. The β s are unidentiﬁed in the case of perfect collinearity. Tarragona and Lleida.(there are multiple values of β that solve the ﬁrst order conditions). if one is not careful. Ti and Li for Girona.. except in the case of an error in construction of the regressor matrix. as well as on the location of the apartment. such as including the same regressor twice. Let Bi = 1 if the ith apartment is in Barcelona. • Perfect collinearity is unusual. Bi = 0 otherwise. One could use a model such as yi = β1 + β2Bi + β3Gi + β4Ti + β5Li + xiγ + εi . This could depend factors such as size. Another case where perfect collinearity may be encountered is with models with dummy variables. deﬁne Gi.

and 0 if the condition is false. Variables like xt and xt3 are ordinary continuous regressors. Bi + Gi + Ti + Li = 1. ∀i. A brief aside on dummy variables Dummy variable: A dummy variable is a binary-valued variable that indicates whether or not some condition is true. so that variables like dt and dt2 are understood to be dummy variables. One must either drop the constant. Use d to indicate that a variable is a dummy. It is customary to assign the value 1 if the condition is true. You know how to interpret the following models: . or one of the qualitative variables.In this model. so there is an exact relationship between these variables and the column of ones corresponding to the constant. Dummy variables are used essentially like any other regressor.

yt = β1 + β2dt + t yt = β1dt + β2(1 − dt) + t yt = β1 + β2dt + β3xt + t Interaction terms: an interaction term is the product of two variables. yt = β1 + β2dt + β3xt + β4dtxt + t ∂E(y|x) ∂x = β3 + β4dt. Note that value of dt. The following model has an interaction term. so that the effect of one variable on the dependent variable depends on the value of the other. The slope depends on the Multiple dummy variables: we can use more than one dummy vari- .

and we create a variable d = 1 if the observation is in the ﬁrst category. (This is not strictly speaking a dummy variable. Suppose that we a condition that deﬁnes 4 possible categories.able in a model. d = 2 if in the second. We will study models of the form yt = β1 + β2dt1 + β3dt2 + β4xt + t yt = β1 + β2dt1 + β3dt2 + β4dt1dt2 + β5xt + t Incorrect usage: You should understand why the following models are not correct usages of dummy variables: 1. multiple values assigned to multiple categories. overparameterization: yt = β1 + β2dt + β3(1 − dt) + t 2. etc. .

there are multiple ways to use dummy variables. Why is the following model not a good one? yt = β1 + β2d + What is the correct way to deal with this situation? Multiple parameterizations. To formulate a model that conditions on a given set of categorical information. 2. For example. 4. the two models yt = β1dt + β2(1 − dt) + β3xt + β4dtxt + and yt = α1 + α2dt + α3xtdt + α4xt(1 − dt) + t t are equivalent. You should know what are the 4 equations that relate the βj parameters to the αj parameters. j = 1.according to our deﬁnition). You should know . 3.

Back to collinearity The more common case. Two children are in a room.e. it is difﬁcult to determine their separate inﬂuences. collinearity is commonly encountered. correlations between the regressors that are less than one in absolute value. Example 16. along with a broken lamp. i.e. Both say ”I didn’t do it!”. How can we tell who broke the lamp? Lack of knowledge about the separate inﬂuences of variables is reﬂected in imprecise estimates. but not zero. i..how to interpret the parameters of both models. With economic data. is the existence of inexact linear relationships. estimates with high variances. and is often a severe problem. .. The basic problem is that when two (or more) variables move together. if one doesn’t make mistakes such as these.

1: s(β) when there is no collinearity 60 55 50 45 40 35 30 25 20 15 6 4 2 0 -2 -4 -6 -6 -4 -2 0 2 4 6 When there is collinearity. the sum of squared errors) is relatively poorly deﬁned.2. the minimizing point of the objective function that deﬁnes the OLS estimator (s(β). This is seen in Figures 7.Figure 7.1 and 7. partition the regressor . To see the effect of collinearity on variances.

Figure 7.2: s(β) when there is collinearity 100 90 80 70 60 50 40 30 20 6 4 2 0 -2 -4 -6 -6 -4 -2 0 2 4 6 .

is −1 ˆ V (β) = (X X) σ 2 Using the partition. under the classical assumptions. so there’s no loss of generality in considering the ﬁrst ˆ column). XX= xx xW Wx WW .matrix as X= x W where x is the ﬁrst column of X (note: we can interchange the columns of X isf we like. the variance of β. Now.

and following a rule for partitioned inversion. we have ESS = T SS(1 − R2) .1 = x x − x W (W W )−1W x = x In − W (W W ) W −1 1 −1 −1 −1 x = ESSx|W where by ESSx|W we mean the error sum of squares obtained from the regression x = W λ + v. Since R2 = 1 − ESS/T SS. (X X)1.

As R2 → 1. This can be seen by comparing . it is difﬁcult to determine the separate inﬂuence of the regressors on the dependent variable. V (βx) → ∞. so that W can explain the movement in x well. 3. Draw a picture here. In ˆ this case.so the variance of the coefﬁcient corresponding to x is σ2 ˆ V (βx) = 2 T SSx(1 − Rx|W ) (7. There is a strong linear relationship between x and the other regressors.1) We see three factors inﬂuence the variance of this coefﬁcient. There is little variation in x. when there are strong linear relations between the regressors. x|W x|W The last of these cases is collinearity. Intuitively. σ 2 is large 2. It will be high if 1. R2 will be close to 1.

Example 17. The model is y = 1 + x2 + x3 + . Three estimators are used: OLS. See the ﬁgures nocollin. Contribution received from node 0. The Octave script DataProblems/collinearity.9 is octave:1> collinearity Contribution received from node 0.ps (correlation).m performs a Monte Carlo study with correlated regressors. OLS dropping x3 (a false restriction).ps (no correlation) and collin. where the correlation between x2 and x3can be set. Received so far: 500 Received so far: 1000 . available on the web site.the OLS objective function in the case of no correlation between regressors with the objective function with correlation between the regressors. The output when the correlation between the two regressors is 0. and restricted LS using β2 = β3 (a true restriction).

dev.202 min 0.663 max 1.correlation between x2 and x3: 0. dev.096 min 0. 0.301 max 1.339 descriptive statistics for 1000 OLS replications. 0. b2=b3 .998 1.651 max 1.179 0.198 0.574 1. dropping x3 descriptive statistics for 1000 Restricted OLS replications.463 -0.433 0.182 0.342 min 0.574 2.002 st.395 -0.444 0.330 1.996 1.517 2.207 st.999 1.008 mean 0. dev. 0.900000 descriptive statistics for 1000 OLS replications mean 0.996 0.436 st.696 2.905 mean 0.

096 0. this procedure identiﬁes which parameters are affected. If any of these auxiliary regressions has a high R2. and note how the standard errors of the OLS estimator change. Detection of collinearity The best way is simply to regress each explanatory variable in turn on the remaining regressors.663 1.1. there is a problem of collinearity. .339 Figure 7.002 octave:2> 0. for each of the three estimators. • repeat the experiment with a lower value of rho. Furthermore.3 shows histograms for the estimated β2.

Figure 7.β2 ˆ (b) OLS.β2 . dropping x3 ˆ (c) Restricted LS.β2 . with true restrictionβ2 = β3 .3: Collinearity: Monte Carlo results ˆ (a) OLS.

An alternative is to examine the matrix of correlations between the regressors. some coefﬁcients are not signif- . Example 18.g. In summary..• Sometimes. their separate inﬂuences aren’t well determined). we’re only interested in certain parameters. Nerlove data and collinearity. The simple Nerlove model is ln C = β1 + β2 ln Q + β3 ln PL + β4 ln PF + β5 ln PK + When this model is estimated by OLS. the artiﬁcial regressions are the best approach if one wants to be careful. Also indicative of collinearity is that the model ﬁts well (high R2). but none of the variables is signiﬁcantly different from zero (e. Collinearity isn’t a problem if it doesn’t affect what we’re interested in estimating. High correlations are sufﬁcient but not necessary for severe collinearity.

If you run this.8).icant (see subsection 3. Why is the coefﬁcient of ln PK not signiﬁcantly different from zero? Dealing with collinearity More information Collinearity is a problem of an uninformative sample.The Octave script DataProblems/NerloveCollinearity. The ﬁrst question is: is all the available information being used? Is more data available? Are there coefﬁcient restrictions that have been neglected? Picture illustrating how a restriction can solve problem of perfect collinearity. you will see that collinearity is not a problem with this data. . This may be due to collinearity.m checks the regressors for collinearity.

Stochastic restrictions and ridge regression Supposing that there is no more data or neglected restrictions. For example. A stochastic linear restriction would be something of the form Rβ = r + v where R and r are as in the case of exact linear restrictions. one possibility is to change perspectives. to Bayesian econometrics. One can express prior beliefs regarding the coefﬁcients using stochastic restrictions. . the model could be y = Xβ + ε Rβ = r + v ε 0 ∼ N v 0 2 σε In 0n×q 2 0q×n σv Iq . but v is a random vector.

σε In) with prior beliefs regarding the distribution of the parameter. σv Iq ) Since the sample is random it is reasonable to suppose that E(εv ) = 0. summarized in y = Xβ + ε 2 ε ∼ N (0. which is the last piece of information in the speciﬁcation. summarized in 2 Rβ ∼ N (r. This model does ﬁt the Bayesian perspective: we combine information coming from the model and the data. How can you estimate using this model? The solution is to treat the restrictions .This sort of model isn’t in line with the classical interpretation of parameters as constants: according to this interpretation the left hand side of Rβ = r + v is constant but the right is random.

Deﬁne the prior precision k = σε/σv . This expresses the degree of belief in the restriction relative to the variability of the data. It is consistent. Note that this estimator is biased. given that k is a ﬁxed constant. however. Supposing that we specify k. even if the restriction is false (this is in contrast to the case of false exact restrictions). note that there are Q restrictions.as artiﬁcial data. Write y r = X R β+ ε v 2 2 This model is heteroscedastic. then the model y kr = X kR β+ ε kv is homoscedastic and can be estimated by OLS. As n → ∞. To see this. so the estimator . these Q artiﬁcial observations have no weight in the objective function. where Q is the number of rows of R. since σε = σv .

has the same limiting objective function as the OLS estimator. consider the expecˆ tation of the squared length of β: ˆˆ E(β β) = E β + (X X) −1 Xε β + (X X) −1 Xε = β β + E ε X(X X)−1(X X)−1X ε = β β + T r (X X) K −1 σ2 = β β + σ2 i=1 λi(the trace is the sum of eigenvalues) > β β + λmax(X X −1)σ 2(the eigenvalues are all positive. To motivate the use of stochastic restrictions. sinceX X is p so ˆˆ E(β β) > β β + σ2 λmin(X X) where λmin(X X) is the minimum eigenvalue of X X (which is the . and is therefore consistent.

so λmin(X X) tends to zero (recall that the determinant is the product of the eigenˆˆ values) and E(β β) tends to inﬁnite.inverse of the maximum eigenvalue of (X X)−1). X X becomes more nearly singular. which . Now considering the restriction IK β = 0 + v. With this restriction the model becomes y 0 and the estimator is ˆ βridge = X kIK X kIK −1 −1 = X kIK β+ ε kv X IK y 0 = X X + k 2IK Xy This is the ordinary ridge regression estimator. On the other hand. The ridge regression estimator can be seen to add k 2IK . β β is ﬁnite. to X X. which is nonsingular. As collinearity becomes worse and worse.

ˆ ˆ The ridge trace method plots βridgeβridge as a function of k. if our original model is at all sensible. As k → ∞.is more and more nearly singular as collinearity becomes worse and worse. the restrictions tend to β = 0. Draw picture here. Also. that is. This is clearly a false restriction in the limit. There should be some amount of shrinkage that is in fact a true restriction. and chooses the value of k that “artistically” seems appropriate (e. The problem is that this k depends on β and σ 2.. The interest in ridge regression centers on the fact that it ˆ ˆ can be shown that there exists a k such that M SE(βridge) < βOLS .g. This means of . where the effect of increasing k dies off). which are unknown. the estimator tends to ˆ βridge = X X + k 2IK −1 X y → k 2IK −1 Xy= Xy →0 2 k ˆ ˆ so βridgeβridge → 0. The problem is to determine the k such that the restriction is correct. the coefﬁcients are shrunken toward zero.

Thinking about the way economic data are reported. measurement error is probably quite prevalent. are commonly revised several times. estimates of growth of GDP. and there is no clear solution to the problem. but it is impossible to guarantee that it will outperform the OLS estimator.choosing k is obviously subjective. This is not a problem from the Bayesian perspective: the choice of k reﬂects prior beliefs about the length of β. etc. In summary. 7.2 Measurement error Measurement error is exactly what it says. inﬂation. For example. Collinearity is a fact of life in econometrics. the ridge estimator offers some hope. Why should the last revision necessarily be correct? . either the dependent variable or the regressors are measured with error.

We assume that ε and v are independent and that y ∗ = Xβ + ε satisﬁes the classical assumptions. and y is what is observed. we have y + v = Xβ + ε . σv ) where y ∗ = y + v is the unobservable true dependent variable. First consider error in measurement of the dependent variable.Error of measurement of the dependent variable Measurement errors in the dependent variable and the regressors have important differences. The data generating process is presumed to be y ∗ = Xβ + ε y = y∗ + v 2 vt ∼ iid(0. Given this.

then. except in that the increased variability of the error term causes an increase in the variance of the OLS estimator (see equation 7. . This type of measurement error isn’t a problem. σε + σv ) • As long as v is uncorrelated with X. this model satisﬁes the classical assumptions and can be estimated by OLS.so y = Xβ + ε − v = Xβ + ω 2 2 ωt ∼ iid(0.1).

and that the model y = X ∗β + ε satisﬁes the classical assumptions. The DGP is yt = x∗ β + εt t xt = x∗ + vt t vt ∼ iid(0.Error of measurement of the regressors The situation isn’t so good in this case. Again assume that v is independent of ε. Now X ∗ contains the true. Now we have yt = (xt − vt) β + εt = xtβ − vtβ + εt = x t β + ωt . unobserved regressors. and X is what is observed. Σv ) where Σv is a K × K matrix.

The problem is that now there is a correlation between xt and ωt. just as in the case of autocorrelated errors with lagged dependent variables. the OLS estimator is biased and inconsistent. In matrix notation. write the estimated model as y = Xβ + ω We have that ˆ β= XX n −1 Xy n . since E(xtωt) = E ((x∗ + vt) (−vtβ + εt)) t = −Σv β where Σv = E (vtvt) . Because of this correlation.

and XX plim n −1 (X ∗ + V ) (X ∗ + V ) = plim n = (QX ∗ + Σv )−1 since X ∗ and V are independent. and V V 1 plim = lim E n n = Σv Likewise. Xy plim n (X ∗ + V ) (X ∗β + ε) = plim n = QX ∗ β n vtvt t=1 .

which we’ll discuss shortly. and instead we observe yt. Suppose ∗ that yt is not observed. Example 19. Measurement error in a dynamic model.so ˆ plimβ = (QX ∗ + Σv )−1 QX ∗ β So we see that the least squares estimator is inconsistent when the regressors are measured with error. Consider the model ∗ ∗ yt = α + ρyt−1 + βxt + ∗ yt = yt + υt t where t and υt are independent Gaussian white noise errors. • A potential solution to this problem is the instrumental variables (IV) estimator. What are the .

but not when σν > 0. Figure 7. when ˆ σν = 1. so that there is signiﬁcant measurement error. so that there is not measurement error. The ﬁrst panel shows a histogram for 1000 replications of ρ − ρ. There is also bias without measurement error.4 gives the results. The second panel repeats this with σν = 0. By re-running the script with larger n. Without measurement error. you can verify that the bias disappears when σν = 0. This is due to the same reason that we saw bias in Figure 3. The sample size is n = 100. .7: one of the classical assumptions (nonstochastic regressors) that guarantees unbiasedness of OLS does not hold for this model. the OLS estimator is consistent. Note that there is much more bias with measurement error.properties of the OLS regression on the equation yt = α + ρyt−1 + βxt + νt ? The Octave script DataProblems/MeasurementError.m does a Monte Carlo study.

or respondents to a survey may not answer all questions. We’ll consider two cases: missing observations on the dependent variable and missing observations on the regressors.3 Missing observations Missing observations occur quite frequently: time series data may not be gathered in a certain year. .Figure 7.4: ρ − ρ with and without measurement error ˆ (a) with measurement error: σν = 1 (b) without measurement error: σν = 0 7.

we assume the classical assump- . Otherwise.Missing observations on the dependent variable In this case. we have y = Xβ + ε or y1 y2 tions hold. • A clear alternative is to simply estimate using the compete observations y1 = X1β + ε1 Since these observations satisfy the classical assumptions. • The question remains whether or not one could somehow re= X1 X2 β+ ε1 ε2 where y2 is not observed. one could estimate by OLS.

we would have ˆ X1X1β1 = X1y1. and improve over OLS in some sense. Let y2 be the predictor of y2.place the unobserved y2 by a predictor. Now ˆ X 1 ˆ β = X2 −1 −1 X1 X1 X2 y1 ˆ y2 X2 = [X1X1 + X2X2] Recall that the OLS fonc are [X1y1 + X2y2] ˆ ˆ X Xβ = X y so if we regressed using only the ﬁrst (complete) observations. .

Likewise. Substituting these into the equation for the overall combined estimator gives −1 ˆ ˆ ˆ β = [X1X1 + X2X2] X1X1β1 + X2X2β2 ˆ = [X1X1 + X2X2] X1X1β1 + [X1X1 + X2X2] ˆ ˆ ≡ Aβ1 + (IK − A)β2 where A ≡ [X1X1 + X2X2] −1 −1 −1 ˆ X2X2β2 X1 X1 . an OLS regression using only the second (ﬁlled in) observations would give ˆ ˆ X2X2β2 = X2y2.

−1 [(X1X1 + X2X2) − X1X1] −1 = IK − [X1X1 + X2X2] X1 X1 Now.and we use [X1X1 + X2X2] −1 X2X2 = [X1X1 + X2X2] = IK − A. it is difﬁcult to satisfy this ˆ condition without knowledge of β. . This will be the case only if ˆ ˆ y2 = X2β + ε2 where ε2 has mean zero. Clearly. • The conclusion is that the ﬁlled in observations alone would need to deﬁne an unbiased estimator. ˆ ˆ E(β) = Aβ + (IK − A)E β2 ˆ and this will be unbiased only if E β2 = β.

However. Consider the model ∗ y t = xt β + εt ∗ which is assumed to satisfy the classical assumptions. Exercise 20. What is observed is yt deﬁned as ∗ ∗ yt = yt if yt ≥ 0 . yt is not always observed. The sample selection problem is a case where the missing observations are not random. Formally prove this last statement. The sample selection problem In the above discussion we assumed that the missing observations are random.ˆ ¯ • Note that putting y2 = y1 does not satisfy the condition and therefore leads to a biased estimator.

5 illustrates the bias. One should at least be aware that nonrandom selection of the sample will normally lead to bias and inconsistency if the problem is not taken into account. but using only the observations for which y ∗ > 0 to estimate. Figure 7.m There are means of dealing with sample selection bias. The Octave program is sampsel. in other words. The difference in this case is that the missing values are not random: they are correlated with the xt.∗ Or. yt is missing when it is less than zero. Consider the case y∗ = x + ε with V (ε) = 25. . but we will not go into it here.

5: Sample selection bias 25 Data True Line Fitted Line 20 15 10 5 0 -5 -10 0 2 4 6 8 10 .Figure 7.

but it may seem frustrating to have to drop observations simply because of a single missing variable. then we are in the case of errors of observation. if the unobserved X2 ∗ is replaced by some prediction. X2 . Consistency is salvaged. In general. As before. • Including observations that have missing values replaced by ad hoc values can be interpreted as introducing false stochastic re- .Missing observations on the regressors Again the model is y1 y2 = X1 X2 β+ ε1 ε2 but we assume now that each row of X2 has an unobserved component(s). Again. this means that the OLS estimator is biased ∗ when X2 is used instead of X2. however. as long as the number of missing observations doesn’t increase with n. one could just estimate using the complete observations.

if one is strongly concerned with bias. There is potential for reduction of MSE through ﬁlling in missing elements with intelligent guesses. • In the case that there is only one regressor other than the con¯ stant. Prove this last statement. but this could also increase MSE. subtitution of x for the missing xt does not lead to bias. • In summary. this introduces bias.strictions. it is best to drop observations that have missing components. . Exercise 21. Monte Carlo studies suggest that it is dangerous to simply substitute the mean. In general. for example. This is a special case that doesn’t hold for K > 2. It is difﬁcult to determine whether MSE increases or decreases.

We have seen that collinearity is not an important problem. Depict what this statement means using a . Why is β5 not signiﬁcantly different from zero? Give an economic explanation. the OLS estimator does not exist.2) are straight lines when there is perfect collinearity (b) For this model with perfect collinearity. 2. (a) verify that the level sets of the OLS criterion function (deﬁned in equation 3. For the model y = β1x1 + β2x2 + . some coefﬁcients are not signiﬁcant. Consider the simple Nerlove model ln C = β1 + β2 ln Q + β3 ln PL + β4 ln PF + β5 ln PK + When this model is estimated by OLS.4 Exercises 1.7.

(c) Show how a restriction R1β1 +R2β2 = r causes the restricted least squares estimator to exist. using a drawing.drawing. .

and suggests the signs of certain derivatives.Chapter 8 Functional form and nonnested tests Though theory often suggests which conditioning variables should be included. For example. it is usually silent regarding the functional form of the relationship between the dependent variable and the regressors. considering a 230 .

Note that certain values of the regressors. Theory suggests that A > 0. while constant returns to scale implies βq = 1. β2 > 0. β1 > 0. and up to a linear transformation. an alternative √ √ √ √ c = β0 + β1 w1 + β2 w2 + βq q + ε √ x and ln(x) look quite alike. Homogeneity of degree one in input prices suggests that β1 + β2 = 1.cost function. gives ln c = β0 + β1 ln w1 + β2 ln w2 + βq ln q + ε where β0 = ln A. This model isn’t compatible with a ﬁxed cost of production since c = 0 when q = 0. for may be just as plausible. so . one could have a Cobb-Douglas model β β c = Aw1 1 w2 2 q βq eε This model. after taking logarithms. While this model may be reasonable in some cases. β3 > 0.

. xt could include squares and cross products of the conditioning variables in zt. For example. since this model can incorporate a wide variety of nonlinear transformations of the dependent variable and the regressors. For example. equal to or larger than P. The basic point is that many functional forms are compatible with the linear-in-parameters model. where K may be smaller than.it may be difﬁcult to choose between these models. suppose that g(·) is a real valued function and that x(·) is a K− vector-valued function. but there may be K regressors. The following model is linear in the parameters but nonlinear in the variables: xt = x(zt) yt = xtβ + εt There may be P fundamental conditioning variables zt.

one might wonder if there exist parametric models that can closely approximate a wide variety of functional relationships. Flexibility in this sense clearly requires that there be at least K = 1 + P + P 2 − P /2 + P free parameters: one for each independent effect that we wish to model.8. A “Diewert-Flexible” functional form is deﬁned as one such that the function.1 Flexible functional forms Given that the functional form of the relationship between the dependent variable and the regressors is in general unknown. . the vector of ﬁrst derivatives and the matrix of second derivatives can take on an arbitrary value at a single data point.

which simply drops the remainder term. For x = 0. the approximation becomes more and more exact. as an approximation to g(x) : g(x) 2 x Dxg(0)x gK (x) = g(0) + x Dxg(0) + 2 As x → 0. The idea . in the 2 2 sense that gK (x) → g(x). the approximation is exact. DxgK (x) → Dxg(x) and DxgK (x) → Dxg(x). up to the second order.Suppose that the model is y = g(x) + ε A second-order Taylor’s series expansion (with remainder term) of the function g(x) about the point x = 0 is 2 x Dxg(0)x g(x) = g(0) + x Dxg(0) + +R 2 Use the approximation.

exactly. the question remains: is plimα = g(0)? Is plimβ = 2 ˆ Dxg(0)? Is plimΓ = Dxg(0)? • The answer is no.behind many ﬂexible functional forms is to note that g(0). Dxg(0) and 2 Dxg(0) are all constants. The reason is that if we treat the true values of the parameters as these derivatives. the approxi- mation will have exactly enough free parameters to approximate the function g(x). up to second order. at the point x = 0. If we treat them as parameters. then ε is . in general. The model is gK (x) = α + x β + 1/2x Γx so the regression model to ﬁt is y = α + x β + 1/2x Γx + ε • While the regression model has enough free parameters to be ˆ ˆ Diewert-ﬂexible. which is of unknown form.

• A simpler example would be to consider a ﬁrst-order T.S. Draw picture. the estimator is biased in this case. In order to lead to consistent inferences. in that neither the function itself nor its derivatives are consistently estimated. so that x and ε are correlated in this case. As before. which is a function of x. • The conclusion is that “ﬂexible functional forms” aren’t really ﬂexible in a useful statistical sense.forced to play the part of the remainder term. . the regression model must be correctly speciﬁed. approximation to a quadratic function. unless the function belongs to the parametric family of the speciﬁed functional form.

The translog form In spite of the fact that FFF’s aren’t really ﬂexible for the purposes of econometric estimation and inference. the expansion point is usually taken to be the sample mean of the data. This model is as above. except that the variables are subjected to a logarithmic tranformation. they are useful. The model is deﬁned by y = ln(c) z x = ln ¯ z = ln(z) − ln(¯) z y = α + x β + 1/2x Γx + ε . Also. and they are certainly subject to less bias due to misspeciﬁcation of the functional form than are many popular forms. such as the Cobb-Douglas or the simple linear in the variables model. The translog model is probably the most widely used FFF. after the logarithmic transformation.

q) . z .In this presentation. consider that y is cost of production: y = c(w. at the means of the data. the t subscript that distinguishes observations is suppressed for simplicity. so ∂y ∂x =β z=¯ z so the β are the ﬁrst-order elasticities. This is a convenient feature of the translog model. Note that at the means of the conditioning ¯ variables. To illustrate. x = 0. Note that ∂y = β + Γx ∂x ∂ ln(c) = (the other part of x is constant) ∂ ln(z) ∂c z = ∂z c which is the elasticity of c with respect to z.

By Shephard’s lemma. q) x= ∂w and the cost shares of the factors are therefore wx ∂c(w. the conditional factor demands are ∂c(w. If the cost function is modeled using a translog function. q) w s= = c ∂w c which is simply the vector of elasticities of cost with respect to input prices. we have ln(c) = α + x β + z δ + 1/2 x z Γ11 Γ12 Γ12 Γ22 x z = α + x β + z δ + 1/2x Γ11x + x Γ12z + 1/2z 2γ22 .where w is a vector of input prices and q is output. We could add other variables by extending q in the obvious manner. but this is supressed for simplicity.

q and Γ11 = Γ12 = Γ22 γ11 γ12 γ12 γ22 γ13 γ23 = γ33. By pooling the equations together and imposing the (true) restriction that the parameters of the equations be the same. Then the share equations are just s = β + Γ11 Γ12 x z Therefore. the share equations and the cost equation have parameters in common. Note that symmetry of the second derivatives has been imposed. .¯ where x = ln(w/w) (element-by-element division) and z = ln(q/¯).

One can do a pooled estimation of the three equa- . so x= x1 x2 . consider the case of two inputs. To illustrate in more detail.we can gain efﬁciency. In this case the translog model of the logarithmic cost function is ln c = α+β1x1 +β2x2 +δz+ γ11 2 γ22 2 γ33 2 x1 + x2 + z +γ12x1x2 +γ13x1z+γ23x2z 2 2 2 The two cost shares of the inputs are the derivatives of ln c with respect to x1 and x2: s1 = β1 + γ11x1 + γ12x2 + γ13z s2 = β2 + γ12x1 + γ22x2 + γ13z Note that the share equations and the cost equation have parameters in common.

and would then be misspeciﬁed for the shares. Note that this does assume that the cost equation is correctly speciﬁed (i.tions at once. To pool . which will lead to imporved efﬁciency. since otherwise the derivatives would not be the true derivatives of the log cost function. imposing that the parameters are the same. In this way we’re using more observations and therefore more information. not an approximation)..e.

a single observation can be written as y t = X t θ + εt . With the appropriate notation. write the model in matrix form (adding in error terms) α β1 β 2 δ x2 x2 z 2 ln c ε 1 x1 x2 z 21 22 2 x1x2 x1z x2z γ11 1 s1 = 0 1 0 0 x 0 0 x z 0 1 2 γ + ε2 22 s2 ε3 0 0 1 0 0 x2 0 x1 0 z γ 33 γ12 γ 13 γ23 This is one observation on the three equations.the equations.

For observation t the errors can be placed in a vector ε 1t εt = ε2t ε3t First consider the covariance matrix of this vector: the shares are certainly correlated since they must sum to one.The overall model would stack n observations on the three equations for a total of 3n observations: y1 X1 y2 X2 = . . . . with 2 shares the variances are equal and the covariance is -1 times the variance. General notation is used to allow easy extension to the case of more . (In fact. yn ε1 θ + ε2 . . Xn εn Next we need to consider the errors.

than 2 inputs). the overall covariance matrix has the . Also. since the shares sum to 1. it’s likely that the shares and the cost equation have different variances. Assuming that there is no autocorrelation. the variance of εt won t depend upon t: σ σ σ 11 12 13 V arεt = Σ0 = · σ22 σ23 · · σ33 Note that this matrix is singular. Supposing that the model is covariance stationary.

seemingly unrelated regressions (SUR) structure. The Kronecker .. . . . = Σ . 0 .. . ε1 ε2 V ar . . εn Σ0 0 0 = . 0 Σ0 where the symbol ⊗ indicates the Kronecker product. 0 ··· 0 = In ⊗ Σ0 ··· Σ0 .

. An asymptotically efﬁcient procedure is (supposing normality of the errors) 1. so OLS won’t be efﬁcient. The next question is: how do we estimate efﬁciently using FGLS? FGLS is based upon inverting the estimated error ˆ covariance Σ. apq B · · · a1q B . apq B FGLS estimation of a translog model So. . . So we need to estimate Σ.product of two matrices A and B is a11B a12B · · · a21B . A⊗B =. this model has heteroscedasticity and autocorrelation. . . Estimate each equation by OLS .

Next we need to account for the singularity of Σ0. so FGLS won’t work. It can be ˆ shown that Σ0 will be singular when the shares sum to one.2. Estimate Σ0 using ˆ0 = 1 Σ n n ˆˆ εtεt t=1 3. The solution is to drop one of the share .

for example the second. The model becomes α β1 β 2 δ x2 x2 z 2 ln c 1 x1 x2 z 21 22 2 x1x2 x1z x2z γ11 = + γ22 s1 0 1 0 0 x1 0 0 x2 z 0 γ 33 γ12 γ 13 γ23 or in matrix notation for the observation: ∗ yt = Xt∗θ + ε∗ t ε1 ε2 .equations.

.and in stacked notation for all observations we have the 2n observations: ∗ y1 ∗ y2 = θ + . . ﬁnally in matrix notation for all observations: y ∗ = X ∗ θ + ε∗ Considering the error covariance. ∗ ∗ yn Xn ε∗ n or. . . we can deﬁne Σ∗ 0 = V ar ε1 ε2 ∗ X1 ∗ X2 ε∗ 1 ε∗ 2 Σ∗ = In ⊗ Σ∗ 0 . .

which is consistent with the way Octave does it) and the Cholesky factorization of the overall covariance matrix of the 2 equation model. 4.ˆ0 ˆ Deﬁne Σ∗ as the leading 2 × 2 block of Σ0 . and form ˆ ˆ0 Σ∗ = In ⊗ Σ∗. which can be calculated as ˆ ˆ ˆ P = CholΣ∗ = In ⊗ P0 . Next compute the Cholesky factorization ˆ P0 = Chol ˆ0 Σ∗ −1 (I am assuming this is deﬁned as an upper triangular matrix. This is a consistent estimator. following the consistency of OLS and applying a LLN.

It is relatively simple to relax this. This is clearly restrictive.5. This is probably the simplest approach. We have assumed no autocorrelation across time. but we won’t go . 1. A few last comments. Finally the FGLS estimator can be calculated by applying OLS to the transformed model ˆ ˆ y ∗ = P X ∗ θ + P ε∗ ˆ ˆ P or by directly using the GLS formula ˆ θF GLS = X ∗ ˆ0 Σ∗ −1 −1 X ∗ X ∗ ˆ0 Σ∗ −1 y∗ It is equivalent to transform each observation individually: ˆ ∗ ˆ ˆ P0yy = P0Xt∗θ + P0ε∗ and then apply OLS.

This can be accomplished by imposing β1 + β2 = 1 3 γij = 0. i=1 These are linear parameter restrictions. then re-estimate Σ∗ using errors cal0 culated as ˆ ε = y − X θF GLS ˆ . estimate θF GLS as above. Another restriction that the model should satisfy is that the estimated shares should sum to 1. The estimation procedure outlined above can be iterated. we have only imposed symmetry of the second derivatives. Also. 3. so they are easy to impose and will improve efﬁciency if they are true. 3. 2.into it here. 2. j = 1. That ˆ is.

These might be expected to lead to a better estimate than the ˆ estimator based on θOLS . since both are based upon a consistent estimator of the error covariance. since FGLS is asymptotically more efﬁcient. Then re-estimate θ using the new estimated error covariance. 8.2 Testing nonnested hypotheses Given that the choice of functional form isn’t perfectly clear. in that many possibilities exist. score or qF are all possibilities.e. iterated to convergence) then the resulting estimator is the MLE. For example. how can one choose between forms? When one form is a parametric restriction of another.. LR. the Cobb-Douglas model is a parametric restriction of the translog: . the previously studied tests such as Wald. At any rate. the asymptotic properties of the iterated and uniterated estimators are the same. It can be shown that if this is repeated until the estimates don’t change (i.

and use the same transformation of the dependent variable. If the two functional forms are linear in the parameters. then they . The situation is more complicated when we want to test non-nested hypotheses. while the Cobb-Douglas is yt = α + xtβ + ε so a test of the Cobb-Douglas versus the translog is simply a test that Γ = 0.The translog is yt = α + xtβ + 1/2xtΓxt + ε where the variables are in logarithms.

but we’ll suppress this for simplicity. y = (1 − α)Xβ + α(Zγ) + ω .g..may be written as M1 : y = Xβ + ε 2 εt ∼ iid(0. proposed by Davidson and MacKinnon. Econometrica (1981). • There are a number of ways to proceed. σε ) M2 : y = Zγ + η 2 η ∼ iid(0. 2. The idea is to artiﬁcially nest the two models. e. ση ) We wish to test hypotheses of the form: H0 : Mi is correctly speciﬁed versus HA : Mi is misspeciﬁed. for i = 1. • One could account for non-iid errors. We’ll consider the J test.

On the other hand. if the second model is correctly speciﬁed then α = 1.If the ﬁrst model is correctly speciﬁed. as in M1 : yt = β1 + β2x2t + β3x3t + εt M2 : yt = γ1 + γ2x2t + γ3x4t + ηt then the composite model is yt = (1 − α)β1 + (1 − α)β2x2t + (1 − α)β3x3t + αγ1 + αγ2x2t + αγ3x4t + ωt . – The problem is that this model is not identiﬁed in general. For example. then the true value of α is zero. if the models share some regressors.

Then estimate the model ˆ y = (1 − α)Xβ + α(Z γ ) + ω = Xθ + αˆ + ω y ˆ where y = Z(Z Z)−1Z y = PZ y.Combining terms we get yt = ((1 − α)β1 + αγ1) + ((1 − α)β2 + αγ2) x2t + (1 − α)β3x3t + αγ3x4t + ωt = δ1 + δ2x2t + δ3x3t + δ4x4t + ωt The four δ s are consistently estimable. This is a consistent estimator supposing that the second model is correctly speciﬁed. It will tend to a ﬁnite probability limit even if the second model is misspeciﬁed. but α is not. ˆ The idea of the J test is to substitute γ in place of γ. so one can’t test the hypothesis that α = 0. α is consistently es- . since we have four equations in 7 unknowns. In this model.

• It may be the case that neither model is correctly speciﬁed. asymptotically. and one can show that. Thus the test will always reject the false null model. • We can reverse the roles of the models. since ˆ α tends in probability to 1. asymptotically. α → 0 and that the ordinary t -statistic for α = 0 is asymptotically normal: ˆ α a ∼ N (0. testing the second against the ﬁrst.timable. while it’s estimated standard error tends to zero. the test will still reject the null hypothesis. under the hypothesis that the ﬁrst model is correct. since the statistic will eventually exceed any critical value with probability one. if we use critical values from the N (0. 1) t= ˆˆ σα • If the second model is correctly speciﬁed. 1) distribution. then t → ∞. In this case. since p p .

when we switch the roles of the models the other will also be rejected asymptotically. |t| → ∞.ˆ as long as α tends to something different from zero. • The above presentation assumes that the same transformation of the dependent variable is used by both models. Of course. p . there are 4 possible outcomes when we test two models. The J− test is simple to apply when both models are linear in the parameters. but easier to apply when M1 is nonlinear. Both may be rejected. • There are other tests available for non-nested models. (1983) shows how to deal with the case of different transformations. Journal of Econometrics. each against the other. White and Davidson. • In summary. MacKinnon. or one of the two may be rejected. The P -test is similar. neither may be rejected.

Can use bootstrap critical values to get better-performing tests. .• Monte-Carlo evidence shows that these tests often over-reject a correctly speciﬁed model.

σ 2). σ 2) or occasionally εt ∼ IIN (0.Chapter 9 Generalized least squares Recall the assumptions of the classical linear regression model. 262 .6. One of the assumptions we’ve made up to now is that εt ∼ IID(0. in Section 3.

The model is y = Xβ + ε E(ε) = 0 V (ε) = Σ where Σ is a general symmetric positive deﬁnite matrix (we’ll write β in place of β0 to simplify the typing of these notes). and later we’ll look at the consequences of relaxing this admittedly unrealistic assumption. We’ll assume ﬁxed regressors for now. nonidentically distributed errors. V ( i) = V ( j ) • The case where Σ has the same number on the main diagonal but nonzero elements off the main diagonal gives identically . • The case where Σ is a diagonal matrix gives uncorrelated. This is known as heteroscedasticity: ∃i.t. j s.Now we’ll investigate the consequences of nonidentically and/or dependently distributed errors. to keep the presentation simple.

Perhaps it’s because under the classical assumptions. a joint conﬁdence region for ε would be an n− dimensional hypersphere. i j) = . E( 0) • The general case combines heteroscedasticity and autocorrelation. This is known as autocorrelation: ∃i = j s. though why this term is used.(assuming higher moments are also the same) dependently distributed errors. I have no idea.t. This is known as “nonspherical” disturbances.

p. F. since there isn’t any σ 2. as before.g. ˆ • The variance of β is ˆ ˆ E (β − β)(β − β) = E (X X)−1X εε X(X X)−1 = (X X)−1X ΣX(X X)−1 (9. In particular. any test statistic that is based upon an estimator of σ 2 is invalid.9.1) Due to this. χ2 .1 Effects of nonspherical disturbances on the OLS estimator The least square estimator is ˆ β = (X X)−1X y = β + (X X)−1X ε • We have unbiasedness. the formulas for the t. it doesn’t exist as a feature of the true d.

based tests given above do not lead to statistics with these distributions. (X X)−1X ΣX(X X)−1 The problem is that Σ is unknown in general. following exactly the same argument given before. then ˆ β ∼ N β.g. • If ε is normally distributed. ˆ • β is still consistent. weakly exogenous regressors) we still have √ ˆ n β−β = = √ n(X X)−1X ε −1 XX n n−1/2X ε . and with stochastic X (e.. • Without normality. so this distribution won’t be useful for testing hypotheses.

consider the Octave script GLS/EffectsOLS. Note that the true X asymptotic distribution of the OLS has changed with respect to the results under the classical assumptions. so we obtain √ d −1 ˆ n β − β → N 0. So we need to ﬁgure out how to take it into account. we obtain something like what we see in Figure 9.m.s. When the data are heteroscedastic. If we neglect to take this into account. generating data that are either heteroscedastic or homoscedastic. the Wald and score tests will not be asymptotically valid.1.Deﬁne the limiting variance of n−1/2X ε (supposing a CLT applies) as n→∞ lim E X εε X n = Ω. Q−1ΩQX . a. To see the invalidity of test procedures that are correct under the classical assumptions. when we have nonspherical errors. This script does a Monte Carlo study. This sort of heteroscedasticity . and then computes the empirical rejection frequency of a nominally 10% t-test.

causes us to reject a true null hypothesis regarding the slope parameter much too often. . You can experiment with the script to look at the effects of other sorts of HET. and to vary the sample size.

Figure 9.1: Rejection frequency of 10% t-test. . H0 is true.

as is shown below.Summary: OLS with heteroscedasticity and/or autocorrelation is: • unbiased with ﬁxed or strongly exogenous regressors • biased with weakly exogenous regressors • has a different variance than before. so the previous test statistics aren’t valid • is consistent • is asymptotically normally distributed. • is inefﬁcient. . Previous test statistics aren’t valid in this case for this reason. but with a different limiting covariance matrix.

9. P is an upper triangular matrix. We have P P Σ = In so P P ΣP = P . and also to see how one can generate data from a N (0. Try out the GLS/cholesky. which implies that P ΣP = In Let’s take some time to play with the Cholesky decomposition. Then one could form the Cholesky decomP P = Σ−1 Here.m Octave script to see that the above claims are true. V ) .2 position The GLS estimator Suppose Σ were known.

distribition. or. Consider the model P y = P Xβ + P ε. y ∗ = X ∗ β + ε∗ . This variance of ε∗ = P ε is E(P εε P ) = P ΣP = In . making the obvious deﬁnitions.

The GLS estimator is simply OLS applied to the transformed model: ˆ βGLS = (X ∗ X ∗)−1X ∗ y ∗ = (X P P X)−1X P P y = (X Σ−1X)−1X Σ−1y The GLS estimator is unbiased in the same circumstances under which the OLS estimator is unbiased. the model y ∗ = X ∗ β + ε∗ E(ε∗) = 0 V (ε∗) = In satisﬁes the classical assumptions. assuming X is .Therefore. For example.

nonstochastic ˆ E(βGLS ) = E (X Σ−1X)−1X Σ−1y = E (X Σ−1X)−1X Σ−1(Xβ + ε = β. we have ˆ βGLS = (X ∗ X ∗)−1X ∗ y ∗ = (X ∗ X ∗)−1X ∗ (X ∗β + ε∗) = β + (X ∗ X ∗)−1X ∗ ε∗ . To get the variance of the estimator.

• All the previous results regarding the desirable properties of the least squares estimator hold. • Tests are valid. when dealing with the transformed model. using the previous formulas. as long as we substitute X ∗ in place of X. Furthermore. since the transformed model satisﬁes the classical assumptions. any test that involves σ 2 can set it to 1..so E ˆ βGLS − β ˆ βGLS − β = E (X ∗ X ∗)−1X ∗ ε∗ε∗ X ∗(X ∗ X ∗)−1 = (X ∗ X ∗)−1X ∗ X ∗(X ∗ X ∗)−1 = (X ∗ X ∗)−1 = (X Σ−1X)−1 Either of these last formulas can be used. This is preferable to re-deriving the appropriate .

note that ˆ ˆ V ar(β) − V ar(βGLS ) = (X X)−1X ΣX(X X)−1 − (X Σ−1X)−1 = AΣA where A = (X X)−1 X − (X Σ−1X)−1X Σ−1 . we conclude that AΣA is positive semi-deﬁnite. the GLS . Then noting that AΣA is a quadratic form in a positive deﬁnite matrix. To see this directly. and that GLS is efﬁcient relative to OLS. since the GLS estimator is based on a model that satisﬁes the classical assumptions but the OLS estimator is not. This may not seem obvious. but it is true. • As one can verify by calculating ﬁrst order conditions. This is a consequence of the Gauss-Markov theorem. • The GLS estimator is more efﬁcient than the OLS estimator. as you can verify for yourself.formulas.

it is symmetric. • The number of parameters to estimate is larger than n and increases faster than n. There’s no way to devise an estimator that satisﬁes a LLN without adding restrictions. so this estimator isn’t available.3 Feasible GLS The problem is that Σ ordinarily isn’t known. 9. because it’s a covariance matrix).estimator is the solution to the minimization problem ˆ βGLS = arg min(y − Xβ) Σ−1(y − Xβ) so the metric Σ−1 is used to weight the residuals. • Consider the dimension of Σ : it’s an n×n matrix with n2 − n /2+ n = n2 + n /2 unique elements (remember . .

The FGLS estimator shares the same asymptotic properties as GLS. θ) are continuous functions of θ (by the Slutsky theorem). Suppose that we parameterize Σ as a function of X and θ. where θ may include β as well as other parameters.• The feasible GLS estimator is based upon making sufﬁcient assumptions regarding the form of Σ so that a consistent estimator can be devised. so that Σ = Σ(X. p ˆ → Σ(X. θ) where θ is of ﬁxed dimension. In this case. we obtain the FGLS estimator. θ) Σ = Σ(X. we can consistently estimate Σ. If we can consistently estimate θ. θ) If we replace Σ in the formulas for the GLS estimator with Σ. as long as the elements of Σ(X. These are .

Deﬁne a consistent estimator of θ. Consistency 2. ˆ 2. This is a case-by-case proposition. θ) ˆ 3. (CramerRao). the usual way to proceed is 1. . We’ll see examples below. Asymptotic efﬁciency if the errors are normally distributed. In practice. Form Σ = Σ(X. depending on the parameterization Σ(θ). Calculate the Cholesky factorization P = Chol(Σ−1).1. Asymptotic normality 3. Test procedures are asymptotically valid. 4.

4. though there is absolutely no reason why time series data cannot also be heteroscedastic. Heteroscedasticity is usually thought of as associated with cross sectional data. Estimate using OLS on the transformed model. but have different variances.4 Heteroscedasticity Heteroscedasticity is the case where E(εε ) = Σ is a diagonal matrix. Actually. so that the errors are uncorrelated. the popular ARCH (autoregressive conditionally heteroscedastic) models . 9. Transform the model using ˆ ˆ ˆ P y = P Xβ + P ε 5.

If there is more variability in these factors for large ﬁrms than for small ﬁrms. εi can reﬂect variations in preferences. degree of coordination between production units. etc.) account for the error term εi.. individual demand. qi = β1 + βpPi + βmMi + εi where P is price and M is income. One might suppose that unobservable factors (e. Consider a supply function qi = β1 + βpPi + βsSi + εi where Pi is price and Si is some measure of size of the ith ﬁrm. then εi may have a higher variance when Si is high than when it is low. talent of managers. There are more possibilities for expression of . Another example.explicitly assume that a time series is heteroscedastic. In this case.g.

Add example of group means. Q−1ΩQ−1 X X as we’ve already seen. Recall that we deﬁned lim E X εε X n =Ω n→∞ .preferences when one is rich. The OLS estimator has asymptotic distribution √ d ˆ n β − β → N 0. so it is possible that the variance of εi could be higher when M is high. OLS with heteroscedastic consistent varcov estimation Eicker (1967) and White (1980) showed how to modify test statistics to account for heteroscedasticity of unknown form.

This matrix has dimension K × K and can be consistently estimated.m. This script generates data from a linear model with HET. even if we can’t estimate Σ consistently. consider the script bootstrap_example1. and the good effect of using a HET consistent covariance estimator. the Wald test for H0 : Rβ − r = 0 would be ˆ n Rβ − r R XX n −1 −1 ˆ XX Ω n −1 R ˆ Rβ − r ∼ χ2(q) a To see the effects of ignoring HET when doing OLS. The consistent estimator. then computes standard errors using the ordinary . under heteroscedasticity but no autocorrelation is 1 Ω= n n ˆt xt xt ε2 t=1 One can then modify the previous test statistics to obtain tests that are valid when there is heteroscedasticity of unknown form. For example.

Typical output of this script follows: .OLS formula. the Eicker-White formula. Note that Eicker-White and bootstrap pretty much agree. and also bootstrap standard errors. while the OLS formula gives standard errors that are quite different.

088 t-stat.369 -0.105 st.014674 Sigma-squared 0.090719 0.237 ********************************************************* OLS estimation results Observations 100 . -1.016 -0.084 0. 0.115 -0.err.197 -1.174 0.083376 0.189 p-value 0.695267 Results (Ordinary var-cov estimator) estimate 1 2 3 -0.octave:1> bootstrap_example1 Bootstrap standard errors 0.143284 ********************************************************* OLS estimation results Observations 100 R-squared 0.083 0.845 0.

you will notice that the OLS standard error for the last parameter appears to be biased downward. the t-test for lack of signiﬁcance will reject more often than it should (the variables really are not signiﬁcant.381 -0. 0.454 • If you run this several times.695267 Results (Het.170 0.856 0.016 -0.014674 Sigma-squared 0. which are asymptotically valid.err. consistent var-cov estimator) estimate 1 2 3 -0.751 p-value 0. -1.R-squared 0.090 0.115 -0.105 st. • The true coefﬁcients are zero.140 t-stat. at least comparing to the other two methods.084 0. With a standard error biased downward. but we will ﬁnd that they seem to be more often than is due to Type-I .182 -0.

Goldfeld-Quandt The sample is divided in to three parts. so that β 1 and ˆ β 3 will be independent. separately. with n1. Run the script 20 times and you’ll see.error. where n1 + n2 + n3 = n. Detection There exist many tests for the presence of heteroscedasticity. • For example.10 more than 10% of the time. you should see that the p-value for the last coefﬁcient is smaller than 0. n2 and n3 observations. We’ll discuss three methods. The model is estimated ˆ using the ﬁrst and third parts of the sample. Then we have ˆ ˆ ε1 ε1 ε1 M 1ε1 d 2 = → χ (n1 − K) σ2 σ2 .

and probably more conventionally. one could order the observations accordingly. ˆ ˆ ε3 ε3/(n3 − K) The distributional result is exact if the errors are normally distributed. • The motive for dropping the middle observations is to increase . In this case. Draw picture. This test is a two-tailed test. from largest to smallest. n3 − K).and ε3 ε3 ε3 M 3ε3 d 2 ˆ ˆ = → χ (n3 − K) 2 2 σ σ so ˆ ˆ ε1 ε1/(n1 − K) d → F (n1 − K. • Ordering the observations is an important step if the test is to have any power. if one has prior ideas about the possible magnitudes of the variances of the observations. Alternatively. one would use a conventional one-tailed F-test.

then E(ε2|xt) = σ 2.the difference between the average variance in the subsamples. supposing that there exists heteroscedasticity. This can increase the power of the test. the White test is a possibility. ∀t t . and no idea of its potential form. • If one doesn’t have any ideas about the form of the het. the test will probably have low power since a sensible data ordering isn’t available. On the other hand. based on Monte Carlo experiments is to drop around 25% of the observations. The idea is that if there is homoscedasticity. A rule of thumb. White’s test When one has little idea if there exists heteroscedasticity. dropping too many observations will substantially increase the variance of the statistics ˆ ˆ ˆ ˆ ε1 ε1 and ε3 ε3.

as well as other variables. Since εt isn’t available. zt may include some or all of the variables in xt. 2. 3. Regress ˆt ε2 = σ 2 + ztγ + vt where zt is a P -vector.so that xt or functions of xt shouldn’t help to explain E(ε2). The test t works as follows: ˆ 1. so dividing both numerator and de- . plus the set of all unique squares and cross products of variables in xt. Test the hypothesis that γ = 0. The qF statistic in this case is P (ESSR − ESSU ) /P qF = ESSU / (n − P − 1) Note that ESSR = T SSU . White’s original suggestion was to use xt. use the consistent estimator εt instead.

nominator by this we get R2 qF = (n − P − 1) 1 − R2 Note that this is the R2 of the artiﬁcial regression used to test for heteroscedasticity. This doesn’t require normality of the errors. An asymptotically equivalent statistic. though it does assume that the fourth moment of εt is constant. not the R2 of the original model. is nR2 ∼ χ2(P ). Question: why is this necessary? • The White test has the disadvantage that it may not be very powerful unless the zt vector is chosen well. and this is hard to a . under the null. under the null of no heteroscedasticity (so that R2 should tend to zero).

The test is more general than is may appear from the regression that is used. where h(·) is an t arbitrary function of unknown form. Plotting the residuals A very simple method is to simply plot the residuals (or their squares). Like the GoldfeldQuandt test. • It also has the problem that speciﬁcation errors other than heteroscedasticity may lead to rejection.do without knowledge of the form of heteroscedasticity. Draw pictures here. this will be more informative if the observations are ordered according to the suspected form of the heteroscedasticity. . • Note: the null hypothesis of this test may be interpreted as θ = 0 for the variance model V (ε2) = h(α + ztθ).

σ2 . When we have HET but no AUT.. The estimation method will be speciﬁc to the for supplied for Σ(θ). 2 . 0 . Σ is a diagonal matrix: 2 σ1 0 . and that a means for estimating θ consistently be determined. We’ll consider two examples. . Σ= .. 0 2 0 · · · 0 σn . Before this.Correction Correcting for heteroscedasticity requires that a parametric form for Σ(θ) be supplied. . . let’s consider the general nature of GLS when there is heteroscedasticity.

. .. . . .. 1 . . 0 0 ··· 0 1 σn We need to transform the model. 0 1 . . 0 . = 0 0 .Likewise.. 2 P = . 2 σ2 . Σ−1 is diagonal Σ−1 1 2 σ1 .. just as before. in the general case: P y = P Xβ + P ε.. σ . 0 · · · 0 σ12 n and so is the Cholesky decomposition P = chol(Σ−1) 1 σ1 0 .

When the standard error is small. which shows a true regression line with heteroscedastic errors. the weight is high. Which sample is more informative about the location of the line? The ones with observations with smaller variances. That is. Consider Figure 9. including the 1 corresponding to the constant). and vice versa. for this reason. xi) by the corresponding standard error of the error term. σi. yi∗ = yi/σi and x∗ = xi/σi (note that xi is a K-vector: we divided i each element. y ∗ = X ∗ β + ε∗ .2. So.or. . The GLS correction for the case of HET is also known as weighted least squares. This makes sense. the GLS solution is equivalent to OLS on the transformed data. weighted by the inverse of the standard error of the observation’s error term. making the obvious deﬁnitions. By the transformed data is the original data. Note that multiplying by P just divides the data for each observation (yi.

2: Motivation for GLS correction when there is HET .Figure 9.

Nonlinear least squares could be used to estimate γ and δ consistently. In this case ε2 = (ztγ) + vt t and vt has mean zero.Multiplicative heteroscedasticity Suppose the model is y t = xt β + εt 2 σt = E(ε2) = (ztγ) t δ but the other classical assumptions hold. since it is consistent ˆt t ˆ ˆ by the Slutsky theorem. The solution is to substitute the squared OLS residuals ε2 in place of ε2. we can estimate σ 2 t δ consistently using ˆ2 σt ˆ δ p 2 ˆ = (ztγ ) → σt . . Once we have γ and δ. were εt observable.

we transform the model by dividing by the standard deviation: y t xt β εt = + ˆ ˆ ˆ σt σt σt ∗ yt = x∗ β + ε∗. this model satisﬁes the classical assumptions. There are still two parameters to be estimated. However. • This model is a bit complex in that NLS is required to estimate the model of the variance. A simpler version would be yt = xt β + εt 2 δ σt = E(ε2) = σ 2zt t where zt is a single variable. t t or Asymptotically. the search method can be used in this .In the second step. and the model of the variance is still nonlinear in the parameters.

• First.g.. • Partition this interval into M equally spaced values.9. . e. Choose the pair with the minimum ESSm as the estimate. • The regression δ ˆt ε2 = σ 2zt m + vt is linear in the parameters. δ • For each of these values.. calculate the variable zt m .g. 2 • Save the pairs (σm..case to reduce the estimation problem to repeated applications of OLS. . . 2. . δm)..2. • Next. 3}. δ ∈ [0.. we deﬁne an interval of reasonable values for δ. and the corresponding ESSm. so one can estimate σ 2 by OLS. divide the model by the estimated standard deviations. 3]. conditional on δm. e.1. {0.

∀t it . Groupwise heteroscedasticity A common case is where we have repeated observations on each of a number of economic agents: e... • Works well when the parameter to be searched over is low dimensional.. as in this case.. 10 years of macroeconomic data on each of a set of countries or regions. but that it differs across them (e. or daily observations of transactions of 200 banks.g.g.). It may be reasonable to presume that the variance is constant over time within the cross-sectional units. ﬁrms or countries of different sizes.• Can reﬁne. Draw picture. This sort of data is a pooled cross-section time-series model. The model is yit = xitβ + εit E(ε2 ) = σi2.

the variance σi2 is speciﬁc to each agent. and t = 1. • The other classical assumptions are presumed to hold. Asymptotically the difference is unimportant. we assume that E(εitεis) = 0. • In this case. . just estimate each σi2 using the natural estimator: 1 2 σi = ˆ n n ε2 ˆit t=1 • Note that we use 1/n here since it’s possible that there are more than n regressors. • In this model. so n − K could be negative. G are the agents.. This is a strong assumption that we’ll relax later. n are the observations on each agent. . .. but constant over the n observations for that agent... 2.where i = 1.. 2.. To correct for heteroscedasticity.

we’re going to use the model with the constant and output coefﬁcient varying across 5 groups. . Let’s check the Nerlove data for evidence of heteroscedasticity.5 for the rationale behind this).• With each of these.8 and 5. which is generated by the Octave program GLS/NerloveResiduals. This transformed model satisﬁes the classical assumptions.8. We can see pretty clearly that the error variance is larger for small ﬁrms than for larger ﬁrms. but with the input price coefﬁcients ﬁxed (see Equation 5. Example: the Nerlove model (again!) Remember the Nerlove data .see sections 3.3. asymptotically. Figure 9. In what follows.m plots the residuals. transform the model as usual: yit xitβ εit = + ˆ ˆ ˆ σi σi σi Do this for each cross-sectional group.

sorted by ﬁrm size Regression residuals 1.5 Residuals 1 0.5 -1 -1.5 0 20 40 60 80 100 120 140 160 . Nerlove model.Figure 9.3: Residuals.5 0 -0.

000 0.000 p-value All in all. but using a heteroscedastic- . That means that OLS estimation is not efﬁcient. using the above model. HOD1 and the Chow test) were calculated assuming homoscedasticity.m performs the White and Goldfeld-Quandt tests.Now let’s try out some tests to formally check for heteroscedasticity. and tests of restrictions that ignore heteroscedasticity are not valid. The previous tests (CRTS. The results are Value White's test GQ test 61. The Octave program GLS/NerloveRestrictions-Het.886 p-value 0.903 Value 10.m uses the Wald test to check for CRTS and HOD1. it is very clear that the data are heteroscedastic. The Octave program GLS/HetTests.

161 p-value 0.consistent covariance estimator.013 We see that the previous conclusions are altered .both CRTS is and HOD1 are rejected at the 5% level.169 p-value 0. Maybe the rejection of HOD1 is due to to Wald test’s tendency to over-reject? By the way.m use the restricted LS estimator directly to restrict the fully general model with all coefﬁcients varying to the model with only the constant and the output coefﬁcient varying.001 6. But GLS/NerloveRestrictions-Het. but the second is more convenient and easier to understand. The methods are equivalent. 1 .m estimates the model by substituting the restrictions into the model. notice that GLS/NerloveResiduals.1 The results are Testing HOD1 Value Wald test Testing CRTS Value Wald test 20.m and GLS/HetTests.

it seems that the variance of is a decreasing function of output.090800 .m estimates the model using GLS (through a transformation of the model so that OLS can be applied).958822 Sigma-squared 0.. . 29.. etc. as before. The Octave script GLS/NerloveGLS.. 2.. The estimation results are i ********************************************************* OLS estimation results Observations 145 R-squared 0.From the previous plot. where j = 1 if i = 1. Suppose that the 5 size groups have different error variances (heteroscedasticity by groups): 2 V ar( i) = σj .

046 -1. 1.818 p-value 0.462 1.007 0.237 0.897 0.101 0.586 0.err.498 -0.308 0.771 -3.000 0.450 -2.184 6.364 1.071 .134 0.363 7.052 -5.391 0.000 0.612 12.346 4.031 0.000 0.820 -1.032 6.149 -1.000 0.001 0.Results (Het.090 0.276 1. consistent var-cov estimator) estimate constant1 constant2 constant3 constant4 constant5 output1 output2 output3 output4 output5 labor fuel capital -1. -0.977 -3.975 0.081 0.000 0.184 -2.649 0.616 -4.253 t-stat.006 0.688 8.090 0.962 1.149 0.000 0.208 0.460 st.414 0.112 0.090 0.656 1.

********************************************************* ********************************************************* OLS estimation results Observations 145 R-squared 0.002 0.723 -2.917 0.108 -4.528 -3.494 st.580 -2.092393 Results (Het.808 p-value 0.497 -4.087 0. 0.988 1. consistent var-cov estimator) estimate constant1 constant2 constant3 constant4 -1.097 -3.013 0.err. -1.987429 Sigma-squared 1.000 .180 t-stat.327 1.

028 ********************************************************* Testing HOD1 Value Wald test 9.086 0.002 .755 12.217 0.093 0.274 0.000 0.165 -4.474 8.892 0.000 0.525 4.733 11.312 p-value 0.090 0.000 0.392 0.648 0.138 0.346 6.492 -0.000 0.684 0.000 0.000 0.765 0.294 -2.109 0.465 0.366 1.951 1.044 0.094 0.103 0.141 0.917 6.000 0.constant5 output1 output2 output3 output4 output5 labor fuel capital -5.

which are 2 used to consistently estimate the σj . • The differences in estimated standard errors (smaller in general for GLS) can be interpreted as evidence of improved efﬁciency of GLS. The second panel of results are the GLS estimation results. since the OLS standard errors are calculated using the Huber-White estimator. The nonconstant CRTS result is robust. They would not be comparable if the ordinary (inconsistent) estimator had been used.The ﬁrst panel of output are the OLS estimation results. but I have not done so. The measure for the GLS results uses the transformed dependent variable. Some comments: • The R2 measures are not comparable . • Note that the previously noted pattern in the output coefﬁcients persists. . One could calculate a comparable R2 measure.the dependent variables are not the same.

or economic theory. but also can affect cross-sectional data. • Note that HOD1 is now rejected. Problem of Wald test overrejecting? Speciﬁcation error in model? 9. For example. so one could expect contemporaneous correlation of macroeconomic variables across countries.5 Autocorrelation Autocorrelation. is a problem that is usually associated with time series data. which is the serial correlation of the error term. That seems to indicate some kind of problem with the model or the data.• The coefﬁcient on capital is now negative and signiﬁcant at the 3% level. a shock to oil prices will simultaneously affect all countries. .

HAWAII.wikipedia. THIS RECORD CONSTITUTES THE LONGEST CONTINUOUS RECORD OF ATMOSPHERIC CO2 CONCENTRATIONS AVAILABLE IN THE WORLD. From the ﬁle maunaloa.txt).txt: ”THE DATA FILE PRESENTED IN THIS SUBDIRECTORY CONTAINS MONTHLY AND ANNUAL ATMOSPHERIC CO2 CONCENTRATIONS DERIVED FROM THE SCRIPPS INSTITUTION OF OCEANOGRAPHY’S (SIO’s) CONTINUOUS MONITORING PROGRAM AT MAUNA LOA OBSERVATORY. MONTHLY AND ANNUAL AVERAGE MOLE FRACTIONS OF CO2 IN WATER-VAPORFREE AIR ARE GIVEN FROM MARCH 1958 THROUGH DECEMBER 2003. Hawaii (see http://en.” .org/wiki/Keeling_ Curve and http://cdiac.Example Consider the Keeling-Whorf data on atmospheric CO2 concentrations an Mauna Loa. EXCEPT FOR A FEW INTERRUPTIONS.ornl.gov/ftp/ndp001/maunaloa.

979239 Sigma-squared 5.data . we get the results octave:8> CO2Example warning: load: file found in load path ********************************************************* OLS estimation results Observations 468 R-squared 0.406 141.000 .918 0.The data is available in Octave format at CO2.err.001 t-stat. 1394.696791 Results (Het.000 0.227 0. consistent var-cov estimator) estimate 1 2 316.521 p-value 0. 0.121 st. If we ﬁt the model CO2t = β1 + β2t + t.

Why might this occur? Plausible explanations include .********************************************************* It seems pretty clear that CO2 concentrations have been going up in the last 50 years. t = s. surprise. Note that there is a very predictable pattern. surprise. which means there seems to be autocorrelation. Let’s look at a residual plot for the last 3 years of the data.4. This is pretty strong evidence that the errors of the model are not independent of one another. see Figure 9. Causes Autocorrelation is the existence of correlation across the error term: E(εtεs) = 0.

Figure 9.4: Residuals from time trend for CO2 data .

Misspeciﬁcation of the model. Unobserved factors that are correlated over time. One can interpret εt as a shock that moves the system away from equilibrium. 2. there will be autocorrelation. which induces a correlation. conditional on εt positive. one could interpret xtβ as the equilibrium value. The error term is often assumed to correspond to unobservable factors. Lags in adjustment to shocks.1. 3. Suppose that the DGP is yt = β0 + β1xt + β2x2 + εt t . If these factors are correlated. If the time needed to return to equilibrium is long with respect to the observation frequency. one could expect εt+1 to be positive. In a model such as y t = x t β + εt . Suppose xt is constant over a number of observations.

These will potentially induce inconsistency when the regressors are nonstochastic (see Chapter 6) and should either not be used in that case (which is usually the relevant case) or used with caution. .but we estimate yt = β0 + β1xt + εt The effects are illustrated in Figure 9.5. Next we discuss two GLS corrections for OLS.5.1.the standard formula does not apply. The correct formula is given in equation 9. Effects on the OLS estimator The variance of the OLS estimator is the same as in the case of heteroscedasticity . The more recommended procedure is discussed in section 9.

5: Autocorrelation induced by misspeciﬁcation .Figure 9.

. σu) E(εtus) = 0. • We need a stationarity assumption: |ρ| < 1.AR(1) There are many types of autocorrelation. We’ll consider two examples. t < s We assume that the model satisﬁes the other classical assumptions. Otherwise the variance of εt explodes as t increases. The ﬁrst is the most commonly encountered case: autoregressive order 1 (AR(1) errors. so standard asymptotics will not apply. The model is y t = xt β + εt εt = ρεt−1 + ut 2 ut ∼ iid(0.

the variance of εt is found as ∞ 2 E(ε2) = σu t m=0 2 σu − ρ2 ρ2m = 1 .• By recursive substitution we obtain εt = ρεt−1 + ut = ρ (ρεt−2 + ut−1) + ut = ρ2εt−2 + ρut−1 + ut = ρ2 (ρεt−3 + ut−2) + ρut−1 + ut In the limit the lagged ε drops out. so we obtain εt = m=0 ∞ ρmut−m With this. since ρm → 0 as m → ∞.

the ﬁrst order autocovariance γ1 is Cov(εt. so 2 σu V (εt) = 1 − ρ2 • The variance is the 0th order autocovariance: γ0 = V (εt) • Note that the variance does not depend on t Likewise. we could obtain this using V (εt) = ρ2E(ε2 ) + 2ρE(εt−1ut) + E(u2) t−1 t 2 = ρ2V (εt) + σu.• If we had directly assumed that εt were covariance stationary. εt−1) = γs = E((ρεt−1 + ut) εt−1) = = ρV (εt) 2 ρσu 1 − ρ2 .

y) = se(x)se(y) but in this case.• Using the same method.’s x and y) is deﬁned as cov(x. for r. y) corr(x. the two standard errors are the same. so the s-order autocorrelation ρs is ρs = ρs . εt−s) = γs = 1 − ρ2 • The autocovariances don’t depend on t: the process {εt} is covariance stationary The correlation (in general. we ﬁnd that for s < t 2 ρ s σu Cov(εt.v.

ρ 2 and σu. .. The steps are 1.. If we can estimate these consistently. Σ= . It turns out that it’s easy to estimate these consistently. but elements off the main diagonal are not zero. we can apply FGLS. ρ this is the variance ρn−1 · · · 1 this is the correlation matrix So we have homoscedasticity... . 1 − ρ2 . .• All this means that the overall matrix Σ has the form 1 ρ ρ2 · · · ρn−1 ρ 1 ρ · · · ρn−2 2 σu . Estimate the model yt = xtβ + εt by OLS. . All of this depends only on two parameters.

ρ obtained ˆ by applying OLS to εt = ρˆt−1 + u∗ is consistent. Take the residuals. Actually. ρ) using σ2 ˆ the previous structure of Σ. Also. and estimate the model ˆ εt = ρˆt−1 + u∗ ε t ˆ Since εt → εt. one . this regression is asymptotically equivalent to the regression εt = ρεt−1 + ut ˆ which satisﬁes the classical assumptions. since ε t u∗ → ut. form Σ = Σ(ˆ u. Therefore. With the consistent estimators σu and ρ.2. and estimate by FGLS. the estimator t 1 ˆ2 σu = n n 2 (ˆ∗)2 → σu ut t=2 p p p ˆ ˆ ˆ2 3.

One can recuperate the ﬁrst observation by . etc. since it cancels out in the formula ˆ ˆ βF GLS = X Σ−1X −1 ˆ (X Σ−1y). If one iterates to convergences it’s equivalent to MLE (supposing normal errors). • An asymptotically equivalent approach is to simply estimate the transformed model ˆ ˆ yt − ρyt−1 = (xt − ρxt−1) β + u∗ t using n − 1 observations (since y0 and x0 aren’t available). but it can be very important in small samples.ˆ2 can omit the factor σu/(1−ρ2). Dropping the ﬁrst observation is asymptotically irrelevant. by taking the ﬁrst FGLS estimator 2 of β. This is the method of Cochrane and Orcutt. • One can iterate the process. re-estimating ρ and σu.

since the u s are uncorrelated with the y s. 348-49 for ∗ 2 more discussion. pg.putting ∗ y1 = y1 ˆ 1 − ρ2 ˆ 1 − ρ2 x∗ = x1 1 This somewhat odd-looking result is related to the Cholesky factorization of Σ−1. See Davidson and MacKinnon. Note that the variance of y1 is σu. asymptoti- cally. . in different time periods. so we see that the transformed model will be homoscedastic (and nonautocorrelated.

t < s In this case. V (εt) = γ0 = E (ut + φut−1)2 = = 2 2 σ u + φ2 σ u 2 σu(1 + φ2) . σu) E(εtus) = 0.MA(1) The linear regression model with moving average order 1 errors is y t = xt β + εt εt = ut + φut−1 2 ut ∼ iid(0.

φ φ 1 + φ2 . 0 φ 1+φ φ ··· 2 0 ··· φ ..Similarly γ1 = E [(ut + φut−1) (ut−1 + φut−2)] 2 = φσu and γ2 = [(ut + φut−1) (ut−2 + φut−3)] = 0 so in this case 2 Σ = σu 1 + φ2 φ 0 . 0 . . . ....

and the maximal and minimal autocorrelations are 1/2 and 1/2.Note that the ﬁrst order autocorrelation is ρ1 = 2 φσu 2 σu (1+φ2 ) = γ1 = γ0 φ (1 + φ2) • This achieves a maximum at φ = 1 and a minimum at φ = −1. Again the covariance matrix has a simple structure that depends on only two parameters. . there is a simple way to estimate the parameters. series that are more strongly autocorrelated can’t be MA(1) processes. Therefore. The problem in this case is that one can’t estimate φ using OLS on εt = ut + φut−1 ˆ because the ut are unobservable and they can’t be estimated consistently. However.

. e. since it’s unidentiﬁed . one equation.two unknowns. we can estimate 2 2 V (εt) = σε = σu(1 + φ2) using the typical estimator: 1 2 = σ 2 (1 + φ2 ) = σε u n n ε2 ˆt t=1 • By the Slutsky theorem. use this as 1 2 (1 + φ2 ) = σu n n ˆt ε2 t=1 However. .g.• Since the model is homoscedastic. this isn’t sufﬁcient to deﬁne consistent estimators of the parameters. we can interpret this as deﬁning an 2 (unidentiﬁed) estimator of both σu and φ.

following a LLN (and given that the epsilon hats are consistent for the epsilons). this can be interpreted as deﬁning an unidentiﬁed estimator of the two parameters: ˆ 2 1 φσ u = n n ˆˆ εtεt−1 t=2 • Now solve these two equations to obtain identiﬁed (and there2 fore consistent) estimators of both φ and σu. Deﬁne the consis- tent estimator ˆ 2 ˆ Σ = Σ(φ.• To solve this problem. εt−1) = φσu = n n ˆˆ εtεt−1 t=2 This is a consistent estimator. and transform the model . estimate the covariance of εt and εt−1 using 1 2 Cov(εt. As above. σu) following the form we’ve seen above.

• Note: there is no guarantee that Σ estimated using the above method will be positive deﬁnite.using the Cholesky decomposition. Monte Carlo example: AR1 Let’s look at a Monte Carlo study that compares OLS and GLS when we have AR1 errors. The model is y t = 1 + xt + t t =ρ t−1 + ut . The transformed model satisﬁes the classical assumptions asymptotically. Another method would be to use ML estimation. if one is willing to make distributional assumptions regarding the white noise errors. which may pose a problem.

m. AR1 errors (a) OLS (b) GLS with ρ = 0.9. We can see that the GLS histogram is much more concentrated about 0. and 1000 Monte Carlo replications are done. which is indicative of the efﬁciency of GLS relative to OLS. The Octave script is GLS/AR1Errors.6 shows histograms of the estimated coefﬁcient of x minus the true value.Figure 9. The sample size is n = 30. Figure 9. .6: Efﬁciency of OLS and FGLS.

one may decide to use the OLS estimator. We’ve seen that this estimator has the limiting distribution √ d ˆ n β − β → N 0. pp. 261-2 and 280-84.Asymptotically valid inferences with autocorrelation of unknown form See Hamilton Ch. Deﬁne mt = xtεt (recall that xt is . as before. without correction. When the form of autocorrelation is unknown. 10. Q−1ΩQ−1 X X where. Ω is Ω = lim E n→∞ X εε X n We need a consistent estimate of Ω.

εn n = t=1 n xt εt mt t=1 = so that 1 Ω = lim E n→∞ n n n mt t=1 t=1 mt We assume that mt is covariance stationary (so that the covariance between mt and mt−s does not depend on t). Note that Xε = x1 x 2 · · · xn ε1 ε2 . .deﬁned as a K × 1 vector). .

• contemporaneously correlated ( E(mitmjt) = 0 ). again it since the regressors will have different variances. In general. we expect that: • mt will be autocorrelated. which depends upon i ). (show this with an example).Deﬁne the v − th autocovariance of mt as Γv = E(mtmt−v ). due to covariance stationarity. Note that E(mtmt+v ) = Γv . • and heteroscedastic (E(m2 ) = σi2 . since the regressors in xt will in general be correlated (more on this later). since εt is potentially autocorrelated: Γv = E(mtmt−v ) = 0 Note that this autocovariance does not depend on t. .

While one could estimate Ω parametrically. Recent research has focused on consistent nonparametric estimators of Ω. we in general have little information upon which to base a parametric speciﬁcation. by expanding sum and shifting rows to left) n−1 n−2 1 (Γ1 + Γ1) + (Γ2 + Γ2) · · · + Ωn = Γ0 + Γn−1 + Γn−1 n n n The natural. consistent estimator of Γv is 1 ˆ ˆ Γv = mtmt−v . Now deﬁne 1 Ωn = E n n n mt t=1 t=1 mt We have (show that the following is true. n t=v+1 n .

On the other hand. n v=1 This estimator is inconsistent in general. estimator of Ωn would be n−2 1 n−1 ˆ Γ1 + Γ1 + Γ2 + Γ2 + · · · + Γn−1 + Γn−1 Ωn = Γ0 + n n n n−1 n−v = Γ0 + Γv + Γv .where ˆ ˆ mt = xtεt (note: one could put 1/(n − v) instead of 1/n here). so information does not build up as n → ∞. supposing that Γv tends to zero sufﬁciently . a natural. and increases more rapidly than n. So. but inconsistent. since the number of parameters to estimate is more than the number of observations.

provided q(n) grows sufﬁciently slowly. given that q(n) increases slowly relative to n. where q(n) → ∞ as n → ∞ will be consistent. for example! . • A disadvantage of this estimator is that is may not be positive deﬁnite. • The assumption that autocorrelations die off is reasonable in many cases. This could cause one to calculate a negative χ2 statistic.rapidly as v tends to ∞. a modiﬁed estimator q(n) ˆ Ωn = Γ0 + v=1 p Γv + Γv . For example. the AR(1) model with |ρ| < 1 has autocorrelations that die off. • The term n−v n can be dropped because it tends to one for v < q(n).

The condition for consistency is that n−1/4q(n) → 0. since Ωn has Ω as its limit. . Their estimator is q(n) ˆ Ωn = Γ0 + v=1 v 1− q+1 Γv + Γv .• Newey and West proposed and estimator (Econometrica. 1987) that solves the problem of possible nonpositive deﬁniteness of the above estimator. Ωn → Ω. It is an example of a kernel estimator. by construction.we’ve placed no parametric restrictions on the form of Ω. We can now use Ωn and 1 QX = n X X to consistently estimate the limiting distribution of the OLS estimator under heteroscedasticity and autocorrelation of unknown form. This estimator is p. This estimator is nonparametric . Note that this is a very slow rate of growth for q. asymptotically valid tests are constructed in the usual way. ˆ p ˆ Finally.d. With this.

Testing for autocorrelation Durbin-Watson test The Durbin-Watson test is not strictly valid in most situations where we would like to use it. Nevertheless. it is encountered often enough so that one should know something about it. Note that the alternative is not that the errors are AR(1). since many general patterns of autocorrelation will have the ﬁrst order autocorrelation different than zero. The Durbin-Watson test statistic is DW = = n ˆ (ˆt − εt−1)2 t=2 ε n ˆ2 t=1 εt n ˆ2 εˆ t=2 εt − 2ˆt εt−1 n ˆ2 t=1 εt ˆt−1 + ε2 • The null hypothesis is that the ﬁrst order autocorrelation of the errors is zero: H0 : ρ1 = 0. The alternative is of course HA : ρ1 = 0. For this reason the .

the middle term tends to zero. X. and the other two tend to one. The give upper and lower bounds.test is useful for detecting autocorrelation in general. so DW → 4 • These are the extremes: DW always lies between 0 and 4. which correspond to the extremes that p p p . so tables can’t give exact critical values. In this case the middle term tends to 2. so DW → 0 • Supposing that we had an AR(1) error process with ρ = −1. In this case the middle term tends to −2. • Under the null. • Supposing that we had an AR(1) error process with ρ = 1. so DW → 2. For the same reason. • The distribution of the test statistic depends on the matrix of regressors. one shouldn’t just assume that an AR(1) model is appropriate when the DW test rejects the null.

are possible. See Figure 9.7. There are means of determining exact critical values conditional on X. • Note that DW can be used to test for nonlinearity (add discussion). • The DW test is based upon the assumption that the matrix X is ﬁxed in repeated samples. This is often unreasonable in the context of economic time series, which is precisely the context where the test would have application. It is possible to relate the DW test to other test statistics which are valid without strict exogeneity. Breusch-Godfrey test This test uses an auxiliary regression, as does the White test for heteroscedasticity. The regression is ˆ ˆ ˆ ˆ εt = xtδ + γ1εt−1 + γ2εt−2 + · · · + γP εt−P + vt

Figure 9.7: Durbin-Watson critical values

and the test statistic is the nR2 statistic, just as in the White test. There are P restrictions, so the test statistic is asymptotically distributed as a χ2(P ). • The intuition is that the lagged errors shouldn’t contribute to explaining the current error if there is no autocorrelation. ˆ • xt is included as a regressor to account for the fact that the εt are not independent even if the εt are. This is a technicality that we won’t go into here. • This test is valid even if the regressors are stochastic and contain lagged dependent variables, so it is considerably more useful than the DW test for typical time series data. • The alternative is not that the model is an AR(P), following the argument above. The alternative is simply that some or all of

the ﬁrst P autocorrelations are different from zero. This is compatible with many speciﬁc forms of autocorrelation.

**Lagged dependent variables and autocorrelation
**

We’ve seen that the OLS estimator is consistent under autocorrelation, as long as plim X ε = 0. This will be the case when E(X ε) = 0, n following a LLN. An important exception is the case where X contains lagged y s and the errors are autocorrelated. Example 22. Dynamic model with MA1 errors. Consider the model yt = α + ρyt−1 + βxt +

t t

= υt + φυt−1

We can easily see that a regressor is not weakly exogenous: E(yt−1εt) = E {(α + ρyt−2 + βxt−1 + υt−1 + φυt−2)(υt + φυt−1)} = 0

2 since one of the terms is E(φυt−1) which is clearly nonzero. In this

case E(xtεt) = 0, and therefore plim X ε = 0. Since n ˆ = β + plim X ε plimβ n the OLS estimator is inconsistent in this case. One needs to estimate by instrumental variables (IV), which we’ll get to later The Octave script GLS/DynamicMA.m does a Monte Carlo study. The sample size is n = 100. The true coefﬁcients are α = 1 ρ = 0.9 and β = 1. The MA parameter is φ = −0.95. Figure 9.8 gives the results. You can see that the constant and the autoregressive parameter have a lot of bias. By re-running the script with φ = 0, you will see that

much of the bias disappears (not all - why?).

Examples

Nerlove model, yet again The Nerlove model uses cross-sectional data, so one may not think of performing tests for autocorrelation. However, speciﬁcation error can induce autocorrelated errors. Consider the simple Nerlove model ln C = β1 + β2 ln Q + β3 ln PL + β4 ln PF + β5 ln PK + and the extended Nerlove model

5 5

ln C =

j=1

αj Dj +

j=1

γj Dj ln Q + βL ln PL + βF ln PF + βK ln PK +

discussed around equation 5.5. If you have done the exercises, you have seen evidence that the extended model is preferred. So if it is in fact the proper model, the simple model is misspeciﬁed. Let’s check if this misspeciﬁcation might induce autocorrelated errors.

**Figure 9.8: Dynamic model with MA(1) errors
**

(a) α − α ˆ

(b) ρ − ρ ˆ

ˆ (c) β − β

The Octave program GLS/NerloveAR.m estimates the simple Nerlove model, and plots the residuals as a function of ln Q, and it calculates a Breusch-Godfrey test statistic. The residual plot is in Figure 9.9 , and the test results are:

Value Breusch-Godfrey test 34.930 p-value 0.000

Clearly, there is a problem of autocorrelated residuals. Repeat the autocorrelation tests using the extended Nerlove model (Equation 5.5) to see the problem is solved. Klein model Klein’s Model I is a simple macroeconometric model. One of the equations in the model explains consumption (C) as a function of proﬁts (P ), both current and lagged, as well as the sum of wages in the private sector (W p) and wages in the government sector

**Figure 9.9: Residuals of simple Nerlove model
**

2 Residuals Quadratic fit to Residuals

1.5

1

0.5

0

-0.5

-1 0 2 4 6 8 10

(W g ). Have a look at the README ﬁle for this data set. This gives the variable names and other information. Consider the model Ct = α0 + α1Pt + α2Pt−1 + α3(Wtp + Wtg ) +

1t

The Octave program GLS/Klein.m estimates this model by OLS, plots the residuals, and performs the Breusch-Godfrey test, using 1 lag of the residuals. The estimation and test results are:

********************************************************* OLS estimation results Observations 21 R-squared 0.981008 Sigma-squared 1.051732

Results (Ordinary var-cov estimator) estimate Constant Profits Lagged Profits Wages 16.237 0.193 0.090 0.796 st.err. 1.303 0.091 0.091 0.040 t-stat. 12.464 2.115 0.992 19.933 p-value 0.000 0.049 0.335 0.000

********************************************************* Value Breusch-Godfrey test 1.539 p-value 0.215

and the residual plot is in Figure 9.10. The test does not reject the null of nonautocorrelatetd errors, but we should remember that we have only 21 observations, so power is likely to be fairly low. The residual plot leads me to suspect that there may be autocorrelation

- there are some signiﬁcant runs below and above the x-axis. Your opinion may differ. Since it seems that there may be autocorrelation, lets’s try an AR(1) correction. The Octave program GLS/KleinAR1.m estimates the Klein consumption equation assuming that the errors follow the AR(1) pattern. The results, with the Breusch-Godfrey test for remaining autocorrelation are:

********************************************************* OLS estimation results Observations 21 R-squared 0.967090 Sigma-squared 0.983171 Results (Ordinary var-cov estimator)

**Figure 9.10: OLS residuals, Klein consumption equation
**

Regression residuals 2 Residuals

1

0

-1

-2

-3 0 5 10 15 20 25

estimate Constant Profits Lagged Profits Wages 16.992 0.215 0.076 0.774

st.err. 1.492 0.096 0.094 0.048

t-stat. 11.388 2.232 0.806 16.234

p-value 0.000 0.039 0.431 0.000

********************************************************* Value Breusch-Godfrey test 2.129 p-value 0.345

• The test is farther away from the rejection region than before, and the residual plot is a bit more favorable for the hypothesis of nonautocorrelated residuals, IMHO. For this reason, it seems that the AR(1) correction might have improved the estimation. • Nevertheless, there has not been much of an effect on the esti-

mated coefﬁcients nor on their estimated standard errors. This is probably because the estimated AR(1) coefﬁcient is not very large (around 0.2) • The existence or not of autocorrelation in this model will be important later, in the section on simultaneous equations.

9.6

Exercises

that the following holds: ˆ ˆ V ar(β) − V ar(βGLS ) = AΣA Verify that this is true.

1. Comparing the variances of the OLS and GLS estimators, I claimed

2. Show that the GLS estimator can be deﬁned as ˆ βGLS = arg min(y − Xβ) Σ−1(y − Xβ) 3. The limiting distribution of the OLS estimator with heteroscedasticity of unknown form is √ where

n→∞ d ˆ n β − β → N 0, Q−1ΩQ−1 , X X

lim E

X εε X n

n

=Ω

Explain why 1 Ω= n

ˆt xt xt ε2

t=1

is a consistent estimator of this matrix. 4. Deﬁne the v − th autocovariance of a covariance stationary pro-

cess mt, where E(mt) = 0 as Γv = E(mtmt−v ). Show that E(mtmt+v ) = Γv . 5. For the Nerlove model with dummies and interactions discussed above (see Section 9.4 and equation 5.5)

5 5

ln C =

j=1

αj Dj +

j=1

γj Dj ln Q + βL ln PL + βF ln PF + βK ln PK +

**above, we did a GLS correction based on the assumption that
**

2 there is HET by groups (V ( t|xt) = σj ). Let’s assume that this

model is correctly speciﬁed, except that there may or may not be HET, and if it is present it may be of the form assumed, or perhaps of some other form. What happens if the assumed form of HET is incorrect?

(a) Is the ”FGLS” based on the assumed form of HET consistent? (b) Is it efﬁcient? Is it likely to be efﬁcient with respect to OLS? (c) Are hypothesis tests using the ”FGLS” estimator valid? If not, can they be made valid following some procedure? Explain. (d) Are the t-statistics reported in Section 9.4 valid? (e) Which estimator do you prefer, the OLS estimator or the FGLS estimator? Discuss. 6. Consider the model yt = C + A1yt−1 + E( E(

t t) t s) t

=Σ = 0, t = s

where yt and

t

are G × 1 vectors, C is a G × 1 of constants, and

A1and A2 are G×G matrices of parameters. The matrix Σ is a G× G covariance matrix. Assume that we have n observations. This is a vector autoregressive model, of order 1 - commonly referred to as a VAR(1) model. (a) Show how the model can be written in the form Y = Xβ+ν, where Y is a Gn × 1 vector, β is a (G + G2)×1 parameter vector, and the other items are conformable. What is the structure of X? What is the structure of the covariance matrix of ν? (b) This model has HET and AUT. Verify this statement. (c) Simulate data from this model, then estimate the model using OLS and feasible GLS. You should ﬁnd that the two estimators are identical, which might seem surprising, given that there is HET and AUT.

setting ρ1 = 0. 7.4. (b) Estimate the AR1 model by OLS. Prove analytically that the OLS and GLS estimators are identical. but the econometrician mistakenly decides to estimate an AR1 model (yt = α + ρ1yt−1 + t). using a sample size of n = 30.5 and ρ2 = 0. Hint: this model is of the form of seemingly unrelated regressions. (a) simulate data from the AR2 model.5 . Suppose that data is generated from the AR2 model. This is an autogressive model of order 2 (AR2) model. 1) white noise error.(d) (advanced). Consider the model yt = α + ρ1yt−1 + ρ2yt−2 + where t t is a N (0. using the simulated data (c) test the hypothesis that ρ1 = 0.

5? ii.(d) test for autocorrelation using the test of your choice (e) repeat the above steps 10000 times. rather than given special treatment. i. What percentage of the time does a t-test reject the hypothesis that ρ1 = 0.5 so that the ﬁrst observation is dropped. What percentage of the time is the hypothesis of no autocorrelation rejected? (f) discuss your ﬁndings. Modify the script given in Subsection 9. This corresponds to using the Cochrane-Orcutt method. Check if there is an efﬁciency loss when the ﬁrst observation is dropped. . whereas the script as provided implements the Prais-Winsten method. Include a residual plot for a representative sample. 8.

Cases include autocorrelation with lagged dependent variables (Exampe 22) and measurement error in the regressors 365 .Chapter 10 Endogeneity and simultaneity Several times we’ve encountered cases where correlation between regressors and the error term lead to biasedness and inconsistency of the OLS estimator.

but the effect is the same. the OLS estimator has desirable large sample properties (consistency. With weak exogeneity. so that E(xt t) = 0. Another important case is that of simultaneous equations. Simultaneous equations is a different prospect.(Example 19).1 Simultaneous equations Up until now our model is y = Xβ + ε where we assume weak exogeneity of the regressors. An example of a . asymptotic normality). The cause is different. 10.

∀t E The presumption is that qt and pt are jointly determined at the same time by the intersection of these equations. We’ll assume that yt is determined by some unrelated process.simultaneous equation system is a simple supply-demand system: Demand: qt = α1 + α2pt + α3yt + ε1t ε1t ε2t Supply: qt = β1 + β2pt + ε2t σ11 σ12 = ε1t ε2t · σ22 ≡ Σ. It’s easy to see that we have correlation between regressors and errors. Solving for pt : α1 + α2pt + α3yt + ε1t = β1 + β2pt + ε2t β2pt − α2pt = α1 − β1 + α3yt + ε1t − ε2t α1 − β1 α3 y t ε1t − ε2t pt = + + β2 − α2 β2 − α2 β2 − α2 .

and OLS estimation of the demand equation will be biased and inconsistent. In this model. Yt is G × 1. Group current and lagged exogs. First. which is K × 1. and we’ll return to it in a minute. for the same reason. yt is an exogenous variable (exogs). If there are G endogs. . qt and pt are the endogenous varibles (endogs). that are determined within the system.Now consider whether pt is uncorrelated with ε1t : α3yt ε1t − ε2t α1 − β1 + + ε1t E(ptε1t) = E β2 − α2 β2 − α2 β2 − α2 σ11 − σ12 = β2 − α2 Because of this correlation. some notation. The same applies to the supply equation. weak exogeneity does not hold. Suppose we group together current endogs in the vector Yt. These concepts are a bit tricky. Stack the errors of the G equations into the error vector Et. as well as lagged endogs in the vector Xt .

The model. t = s There are G equations here. Ψ) . with additional assumtions. ∀t E(EtEs) = 0. Σ). We can stack all n observations and write the model as Y Γ = XB + E E(X E) = 0(K×G) vec(E) ∼ N (0. can be written as Yt Γ = Xt B + Et Et ∼ N (0. and the parameters that enter into each equation are contained in the columns of the matrices Γ and B.

• There is a normality assumption. .where Y1 X1 E1 E2 X2 Y2 Y = . . in that there are as many equations as endogs. X is n × K. • Since there is no autocorrelation of the Et ’s. . This isn’t necessary. . but allows us to consider the relationship between least squares and ML estimators.E = . and E is n × G. En Xn Yn Y is n × G. . • This system is complete. and since the .X = .

the regressors are weakly exogenous if E(Et|Xt) = 0. lagged endogenous variables are .. These variables are predetermined. σ22In . . • We need to deﬁne what is meant by “endogenous” and “exogenous” when classifying the current period variables. · = In ⊗ Σ σGGIn • X may contain lagged endogenous and exogenous variables.columns of E are individually homoscedastic. then σ11In σ12In · · · σ1GIn . Endogenous regressors are those for which this assumption does not hold. . Remember the deﬁnition of weak exogeneity Assumption 15. As long as there is no autocorrelation. Ψ = ..

.2 Reduced form Recall that the model is Yt Γ = Xt B + Et V (Et) = Σ This is the model in structural form.weakly exogenous. [Structural form] An equation is in structural form when more than one current period endogenous variable is included. Deﬁnition 23. 10.

Deﬁnition 24. It is Yt = Xt BΓ−1 + EtΓ−1 = Xt Π + Vt Now only one current period endog appears in each equation. The reduced form for quantity is obtained by solving the supply equation for price and substituting into demand: . This is the reduced form. [Reduced form] An equation is in reduced form if only one current period endog is included. An example is our supply/demand system.The solution for the current period endogs is easy to ﬁnd.

the rf for price is β1 + β2pt + ε2t = α1 + α2pt + α3yt + ε1t β2pt − α2pt = α1 − β1 + α3yt + ε1t − ε2t α1 − β1 α3 y t ε1t − ε2t pt = + + β2 − α2 β2 − α2 β2 − α2 = π12 + π22yt + V2t The interesting thing about the rf is that the equations individually satisfy the classical assumptions. since yt is uncorrelated with ε1t and .qt = β2qt − α2qt = qt = = qt − β1 − ε2t α1 + α2 + α3yt + ε1t β2 β2α1 − α2 (β1 + ε2t) + β2α3yt + β2ε1t β2α1 − α2β1 β2α3yt β2ε1t − α2ε2t + + β2 − α2 β2 − α2 β2 − α2 π11 + π21yt + V1t Similarly.

so the ﬁrst rf equation is homoscedastic.ε2t by assumption. i=1. so are the Vt. The errors of the rf are V1t V2t The variance of V1t is β2ε1t − α2ε2t β2ε1t − α2ε2t V (V1t) = E β2 − α2 β2 − α2 2 β2 σ11 − 2β2α2σ12 + α2σ22 = (β2 − α2)2 • This is constant over time. ∀t. since the εt are independent over time. and therefore E(ytVit) = 0. • Likewise. = β2 ε1t −α2 ε2t β2 −α2 ε1t −ε2t β2 −α2 .2.

The variance of the second rf error is ε1t − ε2t ε1t − ε2t V (V2t) = E β2 − α2 β2 − α2 σ11 − 2σ12 + σ22 = (β2 − α2)2 and the contemporaneous covariance of the errors across equations is E(V1tV2t) = E β2ε1t − α2ε2t ε1t − ε2t β2 − α2 β2 − α2 β2σ11 − (β2 + α2) σ12 + σ22 = (β2 − α2)2 • In summary the rf equations individually satisfy the classical assumptions. but they are contemporaneously correlated. . under the assumtions we’ve made.

From the reduced form. ∀t and that the Vt are timewise independent (note that this wouldn’t be the case if the Et were autocorrelated). Γ−1 ΣΓ−1 .1) . we can easily see that the endogenous variables are correlated with the structural errors: E(EtYt ) = E Et Xt BΓ−1 + EtΓ−1 = E EtXt BΓ−1 + EtEtΓ−1 = ΣΓ−1 (10.The general form of the rf is Yt = Xt BΓ−1 + EtΓ−1 = Xt Π + Vt so we have that Vt = Γ−1 Et ∼ N 0.

3 Bias and inconsistency of OLS estimation of a structural equation Considering the ﬁrst equation (this is without loss of generality. since we can always reorder the equations) we can partition the Y matrix as Y = y Y 1 Y2 • y is the ﬁrst column • Y1 are the other endogenous variables that enter the ﬁrst equation • Y2 are endogs that are excluded from this equation Similarly.10. partition X as X = X1 X 2 .

These are normalization restrictions that simply scale the remaining coefﬁcients on each equation. Finally. and which scale the variances of the error terms. Given this scaling and our partitioning. the coefﬁcient matrices can be written as 1 Γ12 −γ1 Γ22 Γ = 0 Γ32 B = β1 B12 0 B22 . and X2 are the excluded exogs. partition the error matrix as E = ε E12 Assume that Γ has ones on the main diagonal.• X1 are the included exogs.

So.1. the endogenous variables are correlated with the structural errors. is that the columns of Z corresponding to Y1 are correlated with ε. as we’ve seen.2) = (Z Z) −1 Z Zδ 0 + ε −1 = δ 0 + (Z Z) Z .With this. E(Z ) =0. and as we saw in equation 10. because these are endogenous variables. the ﬁrst equation can be written as y = Y1γ1 + X1β1 + ε = Zδ + ε The problem. so they don’t satisfy weak exogeneity. What are the properties of the OLS estimator in this situation? −1 ˆ δ = (Z Z) Z y (10.

a. whether due to measurement error. or omitted regressors. a. and lim ZnZ = QZ .s.a. Also.It’s clear that the OLS estimator is biased in general. Z So the OLS estimator of a structural equation is inconsistent.s. . In general. ˆ δ − δ0 = ZZ n −1 Z n Say that lim Zn = A.. correlation between regressors and errors leads to this problem.s. simultaneity. Then ˆ lim δ − δ 0 = Q−1A = 0.

10. 10.5 Identiﬁcation by exclusion restrictions The identiﬁcation problem in simultaneous equations is in fact of the same nature as the identiﬁcation problem in any estimation setting: does the limiting objective function have the proper curvature so that there is a unique global minimum or maximum at the true parameter value? In the context of IV estimation. I’m leaving this here for possible future reference.4 Note about the rest of this chaper In class. This matrix is ˆ V∞(βIV ) = (QXW Q−1W QXW )−1σ 2 W . I will not teach the material in the rest of this chapter at this time. You need study GMM before reading the rest of this chapter. this is the case if the limiting 1 covariance of the IV estimator is positive deﬁnite and plim n W ε = 0.

and that the instruments be (asymptotically) uncorrelated with ε. Necessary conditions If we use IV estimation for a single equation of the system.• The necessary and sufﬁcient condition for identiﬁcation is simply that this matrix be positive deﬁnite. • These identiﬁcation conditions are not that intuitive nor is it very obvious how to check them. we need that the conditions noted above hold: QW W must be positive deﬁnite and QXW must be of full rank ( K ). • For this matrix to be positive deﬁnite. the equation can be written as y = Zδ + ε .

Using this notation. • It turns out that X exhausts the set of possible instruments. and let K ∗∗ = K − K ∗ be the number of excluded exogs (in this equation). consider the selection of instruments. in that if the variables in X don’t lead to an identiﬁed model then . • Now the X1 are weakly exogenous and can serve as their own instruments.where Z = Y1 X1 Notation: • Let K be the total numer of weakly exogenous variables. • Let K ∗ = cols(X1) be the number of included exogs. • Let G∗ = cols(Y1) + 1 be the total number of included endogs. and let G∗∗ = G − G∗ be the number of excluded endogs.

Assuming this is true (we’ll prove it in a moment). K ∗∗ ≥ G∗ − 1 • To show that this is in fact a necessary condition consider some . so W will not have full column rank: ρ(W ) < K ∗ + G∗ − 1 ⇒ ρ(QZW ) < K ∗ + G∗ − 1 This is the order condition for identiﬁcation in a set of simultaneous equations..g. When the only identifying information is exclusion restrictions on the variables that enter an equation. then the number of excluded exogs must be greater than or equal to the number of included endogs.no other instruments will identify the model either. then a necessary condition for identiﬁcation is that cols(X2) ≥ cols(Y1) since if not then at least one instrument must be used twice. minus 1 (the normalized lhs endog). e.

A necessary condition for identiﬁcation is that 1 ρ plim W Z n where Z = Y1 X1 Recall that we’ve partitioned the model Y Γ = XB + E as Y = y Y1 Y2 X = X1 X 2 = K ∗ + G∗ − 1 .arbitrary set of instruments W.

the cross between W and V1 converges in probability to zero. so 1 1 plim W Z = plim W n n X1Π12 + X2Π22 X1 = X 1 X2 π11 Π12 Π13 π21 Π22 Π23 + v V1 V2 .Given the reduced form Y = XΠ + V we can write the reduced form using the same partition y Y 1 Y2 so we have Y1 = X1Π12 + X2Π22 + V1 so 1 1 W Z = W X1Π12 + X2Π22 + V1 X1 n n Because the W ’s are uncorrelated with the V1 ’s. by assumption.

and the identiﬁcation condition fails. G∗ − 1 > K ∗∗ In this case. regardless of the choice of instruments. then it is not of full column rank.Since the far rhs term is formed only of linear combinations of columns of X. the rank of this matrix can never be greater than K. . If Z has more than K columns. When Z has more than K columns we have G∗ − 1 + K ∗ > K or noting that K ∗∗ = K − K ∗. the limiting matrix is not of full column rank.

we’re only considering identiﬁcation through zero restricitions on the parameters. We’ve already identiﬁed necessary conditions. Turning to sufﬁcient conditions (again. The model is Yt Γ = Xt B + Et V (Et) = Σ .Sufﬁcient conditions Identiﬁcation essentially requires that the structural parameters be recoverable from the data. for the moment). in general. This won’t be the case. unless the structural model is subject to some restrictions.

To see this.This leads to the reduced form Yt = Xt BΓ−1 + EtΓ−1 = Xt Π + Vt V (Vt) = Γ−1 ΣΓ−1 = Ω The reduced form parameters are consistently estimable. and there are no restrictions on their values. The problem is that more than one structural form has the same reduced form. but none of them are known a priori. consider the model Yt ΓF = Xt BF + EtF V (EtF ) = F ΣF . so knowledge of the reduced form parameters alone isn’t enough to determine the structural parameters.

where F is some arbirary nonsingular G × G matrix. the covariance of the rf of the transformed model is V (EtF (ΓF )−1) = V (EtΓ−1) = Ω Since the two structural forms lead to the same rf. What we need for identiﬁcation are restrictions on Γ and B such that the only admissible F is an identity matrix (if all of the equa- . The rf of this new model is Yt = Xt BF (ΓF )−1 + EtF (ΓF )−1 = Xt BF F −1Γ−1 + EtF F −1Γ−1 = Xt BΓ−1 + EtΓ−1 = Xt Π + Vt Likewise. and the rf is all that is directly estimable. the models are said to be observationally equivalent.

Take the coefﬁcient matrices as partitioned before: 1 −γ1 =0 β1 0 Γ12 Γ B Γ22 Γ32 B12 B22 The coefﬁcients of the ﬁrst equation of the transformed model are simply these coefﬁcients multiplied by the ﬁrst column of F .tions are to be identiﬁed). This gives 1 −γ1 =0 β1 0 Γ12 Γ22 f 11 Γ32 F2 B12 B22 Γ B f11 F2 .

For identiﬁcation of the ﬁrst equation we need that there be enough restrictions so that the only admissible f11 F2 be the leading column of an identity matrix. so that 1 Γ12 1 −γ1 Γ22 −γ1 f 11 =0 0 Γ32 F2 β1 B12 β1 0 B22 0 Note that the third and ﬁfth rows are Γ32 B22 F2 = 0 0 .

e. is if F2 is a vector of zeros.g. Given that F2 is a vector of zeros. ρ Γ32 B22 = cols Γ32 B22 =G−1 then the only way this can hold. as long as ρ Γ32 B22 =G−1 f11 F2 = 1 ⇒ f11 = 1 . without additional restrictions on the model’s parameters.. then the ﬁrst equation 1 Γ12 Therefore.Supposing that the leading matrix is of full column rank.

then f11 F2 = 1 0G−1 The ﬁrst equation is identiﬁed in this case. The above result is fairly intuitive (draw picture here). since the condition implies that this submatrix must have at least G − 1 rows. The necessary condition ensures that there are enough variables not in the . we obtain G − G∗ + K ∗∗ ≥ G − 1 or K ∗∗ ≥ G∗ − 1 which is the previously derived necessary condition. It is also necessary. Since this matrix has G∗∗ + K ∗∗ = G − G∗ + K ∗∗ rows. so the condition is sufﬁcient for identiﬁcation.

• When K ∗∗ > G∗ − 1. to . Since estimation by IV with more instruments is more efﬁcient asymptotically. When an equation is overidentiﬁed we have more instruments than are strictly necessary for consistent estimation. since one could drop a restriction and still retain consistency. is is exactly identiﬁed. the equation is overidentiﬁed. Some points: • When an equation has K ∗∗ = G∗ − 1. Overidentifying restrictions are therefore testable. The sufﬁcient condition ensures that those other equations in fact do move around as the variables change their values. in that omission of an identiﬁying restriction is not possible without loosing consistency. so as to trace out the equation of interest. one should employ overidentifying restrictions if one is conﬁdent that they’re true. • We can repeat this partition for each equation in the system.equation of interest to potentially move the other equations.

the above conditions aren’t necessary for identiﬁcation. These include 1. e.g. Restrictions on the covariance matrix of the errors 4.. There are other sorts of identifying information that can be used. Nonlinearities in variables • When these sorts of information are available. though they are of course still sufﬁcient. and through the use of a normalization. by exclusion restrictions. .see which equations are identiﬁed and which aren’t. Cross equation restrictions 2. • These results are valid assuming that the only identifying information comes from knowing which variables appear in which equations. Additional restrictions on parameters within equations (as in the Klein model discussed below) 3.

suppose that we have the restriction Σ21 = 0. so it fails the order (necessary) condition for identiﬁcation. this equation is identiﬁed. This is a triangular system of equations.To give an example of how other information can be used. In this case. consider the model Y Γ = XB + E where Γ is an upper triangular matrix with 1’s on the main diagonal. so that . the ﬁrst equation is y1 = XB·1 + E·1 Since only exogs appear on the rhs. The second equation is y2 = −γ21y1 + XB·2 + E·2 This equation has K ∗∗ = 0 excluded exogs. • However. and G∗ = 2 included endogs.

then following the same logic. .the ﬁrst and second structural errors are uncorrelated. This is known as a fully recursive model. If the entire Σ matrix is diagonal. In this case E(y1tε2t) = E {(Xt B·1 + ε1t)ε2t} = 0 so there’s no problem of simultaneity. all of the equations are identiﬁed.

Tt. At. σ22 σ23 σ33 0 3t The other variables are the government wage bill. taxes. government nonwage spending.and a time trend. The endogenous . Wtg . Gt. consider the following macro model (this is the widely known Klein’s Model 1) Consumption: Ct = α0 + α1Pt + α2Pt−1 + α3(Wtp + Wtg ) + ε1t Investment: It = β0 + β1Pt + β2Pt−1 + β3Kt−1 + ε2t Private Wages: Wtp = γ0 + γ1Xt + γ2Xt−1 + γ3At + ε3t Output: Xt = Ct + It + Gt Proﬁts: Pt = Xt − Tt − Wtp Capital Stock: Kt = Kt−1 t +I 0 σ σ σ 11 12 13 1t 2t ∼ IID 0 .Example: Klein’s Model 1 To give an example of determining identiﬁcation status.

The model assumes that the errors of the equations are contemporaneously correlated. Yt = Ct It Wtp Xt Pt Kt and the predetermined variables are all others: Xt = 1 Wtg Gt Tt At Pt−1 Kt−1 Xt−1 .variables are the lhs variables. by nonautocorrelated. The model written as Y Γ = XB + E gives 1 0 1 0 0 0 0 0 1 −1 0 −1 0 0 0 0 0 0 −α 3 Γ= 0 −α 1 0 −γ1 1 0 −β1 0 −1 1 0 −1 0 1 0 0 1 .

α0 β0 γ0 0 0 0 α3 0 0 B= 0 α2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 −1 0 0 γ3 0 0 0 β2 0 0 0 0 β3 0 0 0 1 0 γ2 0 0 0 To check this identiﬁcation of the consumption equation. These are the rows that have zeros in the ﬁrst column. We . and we need to drop the ﬁrst column. the submatrices of coefﬁcients of endogs and exogs that don’t appear in this equation. we need to extract Γ32 and B22.

5.get 1 0 0 0 0 γ3 0 γ2 −1 0 0 1 0 0 0 0 0 0 0 0 0 −1 1 0 0 1 0 0 0 0 = 0 0 β 3 0 −γ1 1 −1 0 Γ32 B22 −1 0 We need to ﬁnd a set of 5 rows of this matrix gives a full-rank 5×5 matrix. selecting rows 3. For example. and 7 we obtain the .4.6.

matrix

0 0 A=0 0 β3

0 0 0

1

0 1 0 0 0 0 −1 0 γ3 0 0 0 0 0 0 1

This matrix is of full rank, so the sufﬁcient condition for identiﬁcation is met. Counting included endogs, G∗ = 3, and counting excluded exogs, K ∗∗ = 5, so K ∗∗ − L = G∗ − 1 5−L =3−1 L =3

• The equation is over-identiﬁed by three restrictions, according to the counting rules, which are correct when the only identifying information are the exclusion restrictions. However, there is

additional information in this case. Both Wtp and Wtg enter the consumption equation, and their coefﬁcients are restricted to be the same. For this reason the consumption equation is in fact overidentiﬁed by four restrictions.

10.6

2SLS

When we have no information regarding cross-equation restrictions or the structure of the error covariance matrix, one can estimate the parameters of a single equation of the system without regard to the other equations. • This isn’t always efﬁcient, as we’ll see, but it has the advantage that misspeciﬁcations in other equations will not affect the consistency of the estimator of the parameters of the equation of interest.

• Also, estimation of the equation won’t be affected by identiﬁcation problems in other equations. The 2SLS estimator is very simple: it is the GIV estimator, using all of the weakly exogenous variables as instruments. In the ﬁrst stage, each column of Y1 is regressed on all the weakly exogenous variables in the system, e.g., the entire X matrix. The ﬁtted values are ˆ Y1 = X(X X)−1X Y1 = PX Y1 ˆ = X Π1 Since these ﬁtted values are the projection of Y1 on the space spanned by X, and since any vector in this space is uncorrelated with ε by ˆ ˆ assumption, Y1 is uncorrelated with ε. Since Y1 is simply the reducedform prediction, it is correlated with Y1, The only other requirement is that the instruments be linearly independent. This should be the case

when the order condition is satisﬁed, since there are more columns in X2 than in Y1 in this case. ˆ The second stage substitutes Y1 in place of Y1, and estimates by OLS. This original model is y = Y1γ1 + X1β1 + ε = Zδ + ε and the second stage model is ˆ y = Y1γ1 + X1β1 + ε. Since X1 is in the space spanned by X, PX X1 = X1, so we can write the second stage model as y = PX Y1γ1 + PX X1β1 + ε ≡ PX Zδ + ε

The OLS estimator applied to this model is ˆ δ = (Z PX Z)−1Z PX y which is exactly what we get if we estimate using IV, with the reduced form predictions of the endogs used as instruments. Note that if we deﬁne ˆ Z = PX Z ˆ = Y1 X1 ˆ so that Z are the instruments for Z, then we can write ˆ ˆ ˆ δ = (Z Z)−1Z y • Important note: OLS on the transformed model can be used to calculate the 2SLS estimate of δ, since we see that it’s equivalent to IV using a particular set of instruments. However the OLS co-

variance formula is not valid. We need to apply the IV covariance formula already seen above. Actually, there is also a simpliﬁcation of the general IV variance formula. Deﬁne ˆ Z = PX Z ˆ = Y X The IV covariance estimator would ordinarily be ˆ ˆ ˆ V (δ) = Z Z

−1

ˆ ˆ ZZ

ˆ ZZ

−1

ˆ2 σIV

However, looking at the last term in brackets ˆ ˆ Z Z = Y1 X1 Y1 X1 = Y1 (PX )Y1 Y1 (PX )X1 X1Y1 X1 X1

but since PX is idempotent and since PX X = X, we can write ˆ Y1 X1 Y1 X1 = = Y1 PX PX Y1 Y1 PX X1 X1PX Y1 X1 X1 ˆ Y1 X1

ˆ Y1 X1 ˆ ˆ = ZZ

Therefore, the second and last term in the variance formula cancel, so the 2SLS varcov estimator simpliﬁes to ˆ ˆ ˆ V (δ) = Z Z

−1

ˆ2 σIV

**which, following some algebra similar to the above, can also be written as ˆ ˆ ˆ ˆ V (δ) = Z Z
**

−1

σIV ˆ2

(10.3)

Finally, recall that though this is presented in terms of the ﬁrst equation, it is general since any equation can be placed ﬁrst. Properties of 2SLS: 1. Consistent 2. Asymptotically normal 3. Biased when the mean esists (the existence of moments is a technical issue we won’t go into here). 4. Asymptotically inefﬁcient, except in special circumstances (more on this later).

10.7

Testing the overidentifying restrictions

The selection of which variables are endogs and which are exogs is part of the speciﬁcation of the model. As such, there is room for error

here: one might erroneously classify a variable as exog when it is in fact correlated with the error term. A general test for the speciﬁcation on the model can be formulated as follows: The IV estimator can be calculated by applying OLS to the transformed model, so the IV objective function at the minimized value is ˆ ˆ s(βIV ) = y − X βIV but ˆ ˆ εIV = y − X βIV = y − X(X PW X)−1X PW y = I − X(X PW X)−1X PW y = I − X(X PW X)−1X PW (Xβ + ε) = A (Xβ + ε) ˆ PW y − X βIV ,

where A ≡ I − X(X PW X)−1X PW so ˆ s(βIV ) = (ε + β X ) A PW A (Xβ + ε) Moreover, A PW A is idempotent, as can be veriﬁed by multiplication: A PW A = I − PW X(X PW X)−1X PW I − X(X PW X)−1X PW = PW − PW X(X PW X)−1X PW = I − PW X(X PW X)−1X PW . Furthermore, A is orthogonal to X AX = I − X(X PW X)−1X PW X = X −X = 0 PW − PW X(X PW X)−1X PW

so ˆ s(βIV ) = ε A PW Aε Supposing the ε are normally distributed, with variance σ 2, then the ˆ s(βIV ) ε A PW Aε = σ2 σ2 is a quadratic form of a N (0, 1) random variable with an idempotent matrix in the middle, so ˆ s(βIV ) ∼ χ2(ρ(A PW A)) σ2 This isn’t available, since we need to estimate σ 2. Substituting a consistent estimator, ˆ s(βIV ) σ2 ∼ χ2(ρ(A PW A))

a

random variable

• Even if the ε aren’t normally distributed, the asymptotic result still holds. The last thing we need to determine is the rank of

the idempotent matrix. We have A PW A = PW − PW X(X PW X)−1X PW so ρ(A PW A) = T r PW − PW X(X PW X)−1X PW = T rPW − T rX PW PW X(X PW X)−1 = T rW (W W )−1W − KX = T rW W (W W )−1 − KX = KW − KX where KW is the number of columns of W and KX is the number of columns of X. The degrees of freedom of the test is simply the number of overidentifying restrictions: the number of instruments we have beyond the number that is strictly necessary for consistent estimation.

• This test is an overall speciﬁcation test: the joint null hypothesis is that the model is correctly speciﬁed and that the W form valid instruments (e.g., that the variables classiﬁed as exogs really are uncorrelated with ε. Rejection can mean that either the model y = Zδ + ε is misspeciﬁed, or that there is correlation between X and ε. • This is a particular case of the GMM criterion test, which is covered in the second half of the course. See Section 14.9. • Note that since εIV = Aε ˆ and ˆ s(βIV ) = ε A PW Aε

**we can write ˆ s(βIV ) σ2 ˆ ˆ ε W (W W )−1W W (W W )−1W ε = ˆˆ ε ε/n = n(RSSεIV |W /T SSεIV ) ˆ ˆ
**

2 = nRu 2 where Ru is the uncentered R2 from a regression of the IV resid-

uals on all of the instruments W . This is a convenient way to calculate the test statistic. On an aside, consider IV estimation of a just-identiﬁed model, using the standard notation y = Xβ + ε and W is the matrix of instruments. If we have exact identiﬁcation then cols(W ) = cols(X), so W X is a square matrix. The transformed

**model is PW y = PW Xβ + PW ε and the fonc are ˆ X PW (y − X βIV ) = 0 The IV estimator is
**

−1 ˆ βIV = (X PW X) X PW y

**Considering the inverse here (X PW X)
**

−1

= X W (W W )−1W X

−1 −1

= (W X)−1 X W (W W )−1 = (W X)−1(W W ) (X W )

−1

**Now multiplying this by X PW y, we obtain
**

−1 ˆ βIV = (W X)−1(W W ) (X W ) X PW y

= (W X)−1(W W ) (X W ) = (W X)−1W y

−1

X W (W W )−1W y

The objective function for the generalized IV estimator is ˆ s(βIV ) = ˆ y − X βIV ˆ PW y − X βIV

ˆ ˆ ˆ = y PW y − X βIV − βIV X PW y − X βIV ˆ ˆ ˆ ˆ = y PW y − X βIV − βIV X PW y + βIV X PW X βIV ˆ ˆ ˆ = y PW y − X βIV − βIV X PW y + X PW X βIV ˆ = y PW y − X βIV by the fonc for generalized IV. However, when we’re in the just inden-

tiﬁed case, this is ˆ s(βIV ) = y PW y − X(W X)−1W y = y PW I − X(W X)−1W y = y W (W W )−1W − W (W W )−1W X(W X)−1W y = 0 The value of the objective function of the IV estimator is zero in the just identiﬁed case. This makes sense, since we’ve already shown that the objective function after dividing by σ 2 is asymptotically χ2 with degrees of freedom equal to the number of overidentifying restrictions. In the present case, there are no overidentifying restrictions, so we have a χ2(0) rv, which has mean 0 and variance 0, e.g., it’s simply 0. This means we’re not able to test the identifying restrictions in the case of exact identiﬁcation.

10.8

System methods of estimation

2SLS is a single equation method of estimation, as noted above. The advantage of a single equation method is that it’s unaffected by the other equations of the system, so they don’t need to be speciﬁed (except for deﬁning what are the exogs, so 2SLS can use the complete set of instruments). The disadvantage of 2SLS is that it’s inefﬁcient, in general. • Recall that overidentiﬁcation improves efﬁciency of estimation, since an overidentiﬁed equation can use more instruments than are necessary for consistent estimation. • Secondly, the assumption is that

Y Γ = XB + E E(X E) = 0(K×G) vec(E) ∼ N (0, Ψ) • Since there is no autocorrelation of the Et ’s, and since the columns of E are individually homoscedastic, then σ11In σ12In · · · σ1GIn . σ22In . Ψ = ... . . · = Σ ⊗ In σGGIn

This means that the structural equations are heteroscedastic and correlated with one another • In general, ignoring this will lead to inefﬁcient estimation, following the section on GLS. When equations are correlated with

one another estimation should account for the correlation in order to obtain efﬁciency. • Also, since the equations are correlated, information about one equation is implicitly information about all equations. Therefore, overidentiﬁcation restrictions in any equation improve efﬁciency for all equations, even the just identiﬁed equations. • Single equation methods can’t use these types of information, and are therefore inefﬁcient (in general).

3SLS

Note: It is easier and more practical to treat the 3SLS estimator as a generalized method of moments estimator (see Chapter 14). I no longer teach the following section, but it is retained for its possible historical interest. Another alternative is to use FIML (Subsection

2 2 = + . Following our above notation. yG or y = Zδ + ε 0 ··· 0 ZG δG εG . . . each structural equation can be written as yi = Yiγ1 + Xiβ1 + εi = Ziδi + εi Grouping the G equations together we get Z1 0 · · · 0 ε1 y1 δ1 .. . δ ε y2 0 Z2 .8).10. . .. . . 0 . if you are willing to make distributional assumptions on the errors. This is computationally feasible with modern computers.

. 0 ··· ··· 0 ..where we already have that E(εε ) = Ψ = Σ ⊗ In The 3SLS estimator is just 2SLS combined with a GLS correction that ˆ takes advantage of the structure of Ψ. 0 ˆ Y1 X1 0 ˆ 0 Y2 X2 = . Deﬁne Z as X(X X) X Z1 0 −1 ˆ = 0 Z . 0 0 ˆ YG XG . . . .. .. X(X X)−1X Z2 . 0 ··· 0 X(X X)−1X ZG ··· 0 . .

This IV estimator still ignores the covariance information. Also. and noting that the inverse of a block-diagonal matrix is just the matrix with the inverses of the blocks on the main diagonal. ˆ note that Π is calculated using OLS equation by equation. combined with the exogs. More on this later. then Π = BΓ−1 may be subject to some zero restrictions. The distinction is that if the model is overidentiﬁed. The natural extension is to add the GLS . The 2SLS estimator would be ˆ ˆ ˆ δ = (Z Z)−1Z y as can be veriﬁed by simple multiplication. and Π does not impose these restrictions. depending on the restricˆ tions on Γ and B.These instruments are simply the unrestricted rf predicitions of the endogs.

transformation. putting the inverse of the error covariance into the formula. Then the element i. not Zi). The obvious solution is to use an estimator based on the 2SLS residuals: ˆ ˆ εi = yi − Ziδi.2SLS ˆ (IMPORTANT NOTE: this is calculated using Zi. which gives the 3SLS estimator ˆ δ3SLS = = ˆ Z (Σ ⊗ In) −1 −1 Z ˆ Z (Σ ⊗ In)−1 y ˆ Z Σ−1 ⊗ In y ˆ Z Σ−1 ⊗ In Z −1 This estimator requires knowledge of Σ. j of Σ is estimated by εi εj ˆˆ ˆ σij = n . The solution is to deﬁne a feasible estimator using a consistent estimator of Σ.

3SLS is numerically equivalent to 2SLS. lim E n δ n→∞ n −1 A formula for estimating the variance of the 3SLS estimator in ﬁnite samples (cancelling out the powers of n) is ˆ ˆ ˆ V δ3SLS = Z ˆ ˆ Σ−1 ⊗ In Z −1 • This is analogous to the 2SLS formula in equation (10. the asymptotic distribution of the 3SLS estimator can be shown to be Z (Σ ⊗ I )−1 Z ˆ ˆ √ a n ˆ3SLS − δ ∼ N 0.3). Analogously to what we did in the case of 2SLS.ˆ Substitute Σ into the formula above to get the feasible 3SLS estimator. • In the case that all equations are just identiﬁed. GMM is presented in the next . combined with the GLS correction. Proving this is easiest if we use a GMM interpretation of 2SLS and 3SLS.

calculated equation by equation using OLS: ˆ Π = (X X)−1X Y which is simply ˆ Π = (X X)−1X y1 y2 · · · yG that is. since the rf equations are correlated: Yt = Xt BΓ−1 + EtΓ−1 = Xt Π + Vt . OLS equation by equation using all the exogs in the estimation of each column of Π. For now. take it on faith.econometrics course. ˆ The 3SLS estimator is based upon the rf parameter estimator Π. It may seem odd that we use OLS on the reduced form.

∀t Let this var-cov matrix be indicated by Ξ = Γ−1 ΣΓ−1 OLS equation by equation to get the rf is equivalent to y1 X 0 ··· 0 π1 v1 . Γ−1 ΣΓ−1 . . 2 2 = . .. . .. 0 . . π v y2 0 X . + . . X is the entire n × K matrix of exogs. and vi is the ith . πi is the ith column of Π. yG 0 ··· 0 X πG vG where yi is the n × 1 vector of observations of the ith endog.and Vt = Γ−1 Et ∼ N 0.

• However. Use the notation y = Xπ + v to indicate the pooled model. The equations are contemporanously correlated. Following this notation. The general case would have a different Xi for each equation. the error covariance matrix is V (v) = Ξ ⊗ In • This is a special case of a type of model known as a set of seemingly unrelated equations (SUR) since the parameter vector of each equation is different. however.column of V. • Note that each equation of the system individually satisﬁes the classical assumptions. pooled estimation using the GLS correction is more .

where Ξ is estimated using the OLS residuals from equation-by-equation estimation. • In the special case that all the Xi are the same.efﬁcient. (A ⊗ B)−1 = (A−1 ⊗ B −1) 2. • The model is estimated by GLS. SUR ≡OLS. since equation-by-equation estimation is equivalent to pooled estimation. but ignoring the covariance information. which is true in the present case of estimation of the rf parameters. since X is block diagonal. which are consistent. (A ⊗ B) = (A ⊗ B ) and . To show this note that in this case X = In ⊗ X. Using the rules 1.

• So the unrestricted rf coefﬁcients can be estimated efﬁciently (assuming normality) by OLS. we get ˆ πSU R = = (In ⊗ X) (Ξ ⊗ In) −1 (In ⊗ X) −1 −1 (In ⊗ X) (Ξ ⊗ In)−1 y Ξ−1 ⊗ X (In ⊗ X) Ξ−1 ⊗ X y = Ξ ⊗ (X X)−1 Ξ−1 ⊗ X y = IG ⊗ (X X)−1X y ˆ π1 π2 ˆ = . . even if the equations are correlated. ˆ πG • Note that this provides the answer to the exercise 6d in the chapter on GLS. . (A ⊗ B)(C ⊗ D) = (AC ⊗ BD).3.

t. See two sections ahead. FIML Full information maximum likelihood is an alternative estimation method. and in the case of the full-information ML estimator we use the entire information set. since ML estimators based on a given information set are asymptotically efﬁcient w. all other estimators that use the same information set.r. which if they exist could potentially increase the efﬁciency of estimation of the rf. . The 2SLS and 3SLS estimators don’t require distributional assumptions. • Another example where SUR≡OLS is in estimation of vector autoregressions. FIML will be asymptotically efﬁcient.• We have ignored any potential zeros in the matrix Π.

recall Yt Γ = Xt B + Et Et ∼ N (0. Σ). Our model is. t = s The joint normality of Et means that the density for Et is the multivariate normal. which is (2π) −g/2 det Σ −1 −1/2 1 exp − EtΣ−1Et 2 The transformation from Et to Yt requires the Jacobian dEt | = | det Γ| | det dYt .while FIML of course does. ∀t E(EtEs) = 0.

the joint log-likelihood function is nG n −1 1 ln L(B. Maximixation of this can be done using iterative numeric methods. Γ. The steps . Σ) = − ln(2π)+n ln(| det Γ|)− ln det Σ − 2 2 2 n (Yt Γ − Xt B) Σ−1 (Yt t=1 • This is a nonlinear in the parameters objective function. • One can calculate the FIML estimator by iterating the 3SLS estimator. • It turns out that the asymptotic distribution of 3SLS and FIML are the same. assuming normality of the errors.so the density for Yt is (2π) −G/2 | det Γ| det Σ −1 −1/2 1 exp − (Yt Γ − Xt B) Σ−1 (Yt Γ − Xt B) 2 Given the assumption of independence over time. thus avoiding the use of a nonlinear optimizer. We’ll see how to do this in the next section.

Calculate the instruments Y = X Π and calculate Σ using Γ ˆ and B to get the estimated errors. ˆ ˆ ˆ ˆ 3. but only ˆ ˆ ˆ ˆ updates Σ and B and Γ. he ˆ means this for a procedure that doesn’t update Π. applying the usual estimator. This is new. Apply 3SLS using these new instruments and the estimate of Σ.are ˆ ˆ 1. This estimator may have some zeros in it. we didn’t estimate Π 3SLS in this way before. When Greene says iterated 3SLS doesn’t lead to FIML. Repeat steps 2-4 until there is no change in the parameters. If you update Π you do converge to FIML. 4. Calculate Π = B3SLS Γ−1 . 5. ˆ ˆ ˆ 2. Calculate Γ3SLS and B3SLS as normal. .

10. then 2SLS will be fully efﬁcient. This implies that 3SLS is fully efﬁcient when the errors are normally distributed.m performs 2SLS estimation for the 3 equations of Klein’s model 1. • When the errors aren’t normally distributed. assuming nonautocorrelated errors. The results are: CONSUMPTION EQUATION . since in this case 2SLS≡3SLS. if each equation is just identiﬁed and the errors are normal. Also. so that lagged endogenous variables can be used as instruments.• FIML is fully efﬁcient. since it’s an ML estimator that uses all information. the likelihood function is of course different than what’s written above.9 Example: 2SLS and Klein’s Model 1 The Octave program Simeq/Klein.

810 st.147 2.017 0. 1.321 0.err.555 0.534 0.040 t-stat.129 p-value 0.******************************************************* 2SLS estimation results Observations 21 R-squared 0.885 0. 12.216 0.976711 Sigma-squared 1.000 0.060 0.044059 estimate Constant Profits Lagged Profits Wages 16.000 ******************************************************* INVESTMENT EQUATION .016 20.107 0.118 0.

368 p-value 0.err.278 0.398 0.000 ******************************************************* WAGES EQUATION ******************************************************* . 7.543 0.016 0.884884 Sigma-squared 1.150 0.784 -4.616 -0.001 0.163 0.******************************************************* 2SLS estimation results Observations 21 R-squared 0.158 st.383184 estimate Constant Profits Lagged Profits Lagged Capital 20.036 t-stat.173 0.867 3.688 0. 2.

since lagged endogenous variables will not be valid instruments in that case.439 0.147 0.130 st.000 0.029 t-stat.039 0. 1.209 0.987414 Sigma-squared 0.000 ******************************************************* The above results are not valid (speciﬁcally.036 0.307 12.500 0.2SLS estimation results Observations 21 R-squared 0.476427 estimate Constant Output Lagged Output Trend 1.148 0. 1. You might consider .777 4.002 0.316 3.err.475 p-value 0. they are inconsistent) if the errors are autocorrelated.

and reestimating by 2SLS. . Food for thought. Standard errors will still be estimated inconsistently.. unless use a Newey-West type covariance estimator. to obtain consistent parameter estimates in this more complex case..eliminating the lagged endogenous variables as instruments.

we’ll need to know how to ﬁnd an extremum. 133-139)∗. pp. (1994).Chapter 11 Numeric optimization methods Readings: Hamilton. 13. 5. If we’re going to be applying extremum estimators. Goffe. section 7 (pp. et. al. This section gives a very brief intro443 . Gourieroux and Monfort. Vol. ch. 443-60∗. ch. 1.

e. We’ll consider a few well-known techniques. The general problem we consider is how to ﬁnd the maximizing ˆ element θ (a K -vector) of a function s(θ). 2 the ﬁrst order conditions would be linear: . and to learn how to use the BFGS algorithm at the practical level. The main objective is to become familiar with the issues. Supposing s(θ) were a quadratic function of θ.duction to what is a large literature on numeric optimization methods. and one fairly new technique that may allow one to solve difﬁcult problems. 1 s(θ) = a + b θ + θ Cθ. and it may not be differentiable. so local maxima. minima and saddlepoints may all exist. Even if it is twice continuously differentiable.g.. it may not be globally concave. This function may not be continuous.

This is the sort of problem we have with linear models estimated by OLS. This is when we need a numeric optimization method. It’s also the case for feasible GLS.o. we have a quadratic objective function in the remaining parameters. More general problems will not have linear f. and we will not be able to solve for the maximizer analytically.c.1 Search The idea is to create a grid over the parameter space and evaluate the function at each point on the grid. since conditional on the estimate of the varcov matrix. 11. Then reﬁne .Dθ s(θ) = b + Cθ ˆ so the maximizing (minimizing) element would be θ = −C −1b. Select the best point..

1: Search method the grid in the neighborhood of the best point. .Figure 11. One has to be careful that the grid is ﬁne enough in relationship to the irregularity of the function to ensure that sharp peaks are not missed entirely.1. See Figure 11. and continue until the accuracy is ”good enough”.

11. 171 × 109 years to perform the calculations. if q = 100 and K = 10. but it quickly becomes infeasible if K is moderate or large. there would be 10010 points to check.To check q values in each dimension of a K dimensional parameter space.2 Derivative-based methods Introduction Derivative-based methods are deﬁned by 1. the method for choosing the initial value. The search method is a very reasonable choice if K is small. For example. which is approximately the age of the earth. θ1 2. we need to check q K points. the iteration method for choosing θk+1 given θk (based upon . If 1000 points can be checked in a second. it would take 3.

the stopping criterion. That is. we will improve on the objective function. so that θ(k+1) = θ(k) + ak dk . if we go in direction d. The iteration method can be broken into two problems: choosing the stepsize ak (a scalar) and choosing the direction of movement. at least if we don’t go too far in that direction. • As long as the gradient at θ is not zero there exist increasing .derivatives) 3. A locally increasing direction of search d is a direction such that ∂s(θ + ad) ∃a : >0 ∂a for a positive but small. dk . which is of the same dimension of θ.

S. To see this. we guarantee that g(θ) d = g(θ) Qg(θ) > 0 unless g(θ) = 0.directions. .2. If d is to be an increasing direction. See Figure 11. we need g(θ) d > 0. Deﬁning d = Qg(θ). where Q is positive deﬁnite. matrices are those such that the angle between g and Qg(θ) is less that 90 degrees). Every increasing direction can be represented in this way (p.d. expansion around a0 = 0 s(θ + ad) = s(θ + 0d) + (a − 0) g(θ + 0d) d + o(1) = s(θ) + ag(θ) d + o(1) For small enough a the o(1) term can be ignored. and they can all be represented as Qk g(θk ) where Qk is a symmetric pd matrix and g (θ) = Dθ s(θ) is the gradient at θ. take a T.

Figure 11.2: Increasing directions of search .

• Conditional on Q. Steepest descent Steepest descent (ascent if we’re maximizing) just sets Q to and identity matrix. • Note also that this gives no guarantees to ﬁnd a global maximum.• With this. choosing a is fairly straightforward. so that there is no increasing direction. • The remaining problem is how to choose Q. the iteration rule becomes θ(k+1) = θ(k) + ak Qk g(θk ) and we keep going until the gradient becomes zero. since the gradient provides the direction of maximum rate . since a is a scalar. A simple line search is an attractive possibility. The problem is how to choose a and Q.

• Disadvantages: This doesn’t always work too well however (draw picture of banana function). Take a second order Taylor’s series approximation of sn(θ) about θk (an initial guess).of change of the objective function. sn(θ) ≈ sn(θk ) + g(θk ) θ − θk + 1/2 θ − θk H(θk ) θ − θk .doesn’t require anything more than ﬁrst derivatives. Newton-Raphson The Newton-Raphson method uses information about the slope and curvature of the objective function to determine which direction and how far to move from an initial point. • Advantages: fast . Supposing we’re trying to maximize sn(θ).

so it has linear ﬁrst order conditions. These are ˜ Dθ s(θ) = g(θk ) + H(θk ) θ − θk So the solution for the next round estimate is θk+1 = θk − H(θk )−1g(θk ) This is illustrated in Figure 11. so the actual .e. However. we can maximize ˜ s(θ) = g(θk ) θ + 1/2 θ − θk H(θk ) θ − θk with respect to θ. This is a much easier problem.3.To attempt to maximize sn(θ). since it is a quadratic function in θ.. since the approximation ˆ to sn(θ) may be bad far away from the maximizer θ. i. it’s good to include a stepsize. we can maximize the portion of the right-hand side that depends on θ.

Figure 11.3: Newton iteration .

in which case the Hessian matrix is very ill-conditioned (e.. is nearly singular). Quasi-Newton methods simply add a positive deﬁnite component to H(θ) to ensure that the resulting matrix is positive . H(θk ) is positive deﬁnite. and −H(θk )−1g(θk ) may not deﬁne an increasing direction of search. and our direction is a decreasing direction of search. Also. or when we’re in the vicinity of a local minimum. Matrix inverses by computers are subject to large errors when the matrix is ill-conditioned.g.iteration formula is θk+1 = θk − ak H(θk )−1g(θk ) • A potential problem is that the Hessian may not be negative definite when we’re far from the maximizing point. So −H(θk )−1 may not be positive deﬁnite. we certainly don’t want to go in the direction of a minimum when we’re maximizing. To solve this problem. This can happen when the objective function has ﬂat regions.

DFP and BFGS are two well-known examples.g. They can be done to ensure that the approximation is p. Q = −H(θ) + bI. Stopping criteria The last thing we need is to decide when to stop. e. This has the beneﬁt that improvement in the objective function is guaranteed.deﬁnite.. A digital computer is subject to limited machine precision and round-off errors.d. where b is chosen large enough so that Q is well-conditioned and positive deﬁnite. For these reasons. This avoids actual calculation of the Hessian. it is unreasonable to hope that a program can exactly . which is an order of magnitude (in the dimension of the parameter vector) more costly than calculation of the gradient. • Another variation of quasi-Newton methods is to approximate the Hessian by using successive gradient evaluations.

ﬁnd the point that maximizes a function. ∀j . We need to deﬁne acceptable tolerances. Some stopping criteria are: • Negligable change in parameters: k k−1 |θj − θj | < ε1. ∀j • Negligable change of function: |s(θk ) − s(θk−1)| < ε3 • Gradient negligibly different from zero: |gj (θk )| < ε4. ∀j • Negligable relative change: | k−1 k θj − θj k−1 θj | < ε2.

The algorithm may also have difﬁculties converging at all. More on this later. • Also. THIS IS IMPORTANT in practice. but not so well if there are convex regions and local minima or multiple local maxima. Calculating derivatives . Starting values The Newton-Raphson and related algorithms work well if the objective function is concave (when maximizing). The algorithm may converge to a local minimum or to a local maximum that is not optimal. it’s good to check that the last round (real. and choose the solution that returns the highest objective function value. if we’re maximizing.• Or. not approximate) Hessian is negative deﬁnite. check all of these. even better. • The usual way to “ensure” that a global maximum has been found is to use many different starting values.

Both factors usually cause optimization programs to be less successful when numeric derivatives are used. or to use programs such as MuPAD or Mathematica to calculate analytic derivatives.The Newton-Raphson algorithm requires ﬁrst and second derivatives. See http: //www. Figure 11. For example. Possible solutions are to calculate derivatives numerically. and are usually more costly to evaluate.4 shows Sage 1 calculating a couple of derivatives.org/ 1 . • One advantage of numeric derivatives is that you don’t have to worry about having made an error in calculating the analytic Sage is free software that has both symbolic and numeric computational capabilities.sagemath. • Numeric derivatives are less accurate than analytic derivatives. It is often difﬁcult to calculate derivatives (especially the Hessian) analytically if the function sn(·) is complicated.

4: Using Sage to get analytic derivatives .Figure 11.

zt = zt/1000. • There are algorithms (such as BFGS and DFP) that use the sequential gradient evaluations to build up an approximation to . • Numeric second derivatives are much more accurate if the data are scaled so that the elements of the gradient are of the same order of magnitude. and estimation is by NLS. estimation programs always work better if data is scaled in this way. x∗ = 1000xt. since roundoff errors are less likely to become important. This is important in practice.β ∗ = t ∗ 1000β. In general. suppose that Dαsn(·) = 1000 and Dβ sn(·) = 0.derivative. When programming analytic derivatives it’s a good idea to check that they are correct by using numeric derivatives. This is a lesson I learned the hard way when writing my thesis.001. the gradients Dα∗ sn(·) and Dβ sn(·) will both be 1. Example: if the model is yt = h(αxt + βzt) + εt. In this case. One could deﬁne α∗ = α/1000.

• Switching between algorithms during iterations is sometimes useful. 11. discontinuities and multiple local minima/maxima. This allows the algorithm to escape from local minima. Basically.the Hessian.3 Simulated Annealing Simulated annealing is an algorithm which can ﬁnd an optimum in the presence of nonconcavities. the algorithm randomly selects evaluation points. accepts all points that yield an increase in the objective function. As more and more points are tried. periodically the algorithm focuses on the best . but more iterations usually are required for convergence. but also accepts some points that decrease the objective function. The iterations are faster for this reason since the actual Hessian isn’t calculated.

I have a program to do this if you’re interested. The algorithm relies on many evaluations. 11.4 Examples of nonlinear optimization This section gives a few examples of how some nonlinear models may be estimated using maximum likelihood. the probability that a negative move is accepted reduces. It does not require derivatives to be evaluated. but focuses in on promising areas. Discrete Choice: The logit model In this section we will consider maximum likelihood estimation of the logit model for binary 0/1 dependent variables. which reduces function evaluations with respect to the search method. and reduces the range over which random points are generated. We will use the BFGS .point so far. Also. as in the search method.

say y. The model can be represented as y ∗ = g(x) − ε y = 1(y ∗ > 0) P r(y = 1) = Fε[g(x)] ≡ p(x. x. on y. which can be thought of as codes for whether or not a condisiton is satisﬁed. 1=take the bus. θ) + (1 − yi) ln [1 − p(xi. θ) The log-likelihood function is 1 sn(θ) = n n (yi ln p(xi. customarily 0 and 1. We would like to know the effect of covariates.algotithm to ﬁnd the MLE. 0=drive to work. θ)]) i=1 . For example. Often the observed binary variable. is related to an unobserved (latent) continuous varable. say y ∗. A binary response is a variable that takes on only two values.

m . θ) = 1 1 + exp(−x θ) You should download and examine LogitDGP. which uses the BFGS algorithm. and EstimateLogit. *********************************************** Trial of MLE estimation of Logit model MLE Estimation Results BFGS convergence: Normal convergence .m .For the logit model. the probability has the speciﬁc form p(x. Here are some estimation results with n = 100. which generates data according to the logit model. and the true θ = (0. 1) . which sets things up and calls the estimation routine. which calculates the loglikelihood.m . logit.

which in turn calls a number of other routines.7566 st.6230 AIC : 125.6230 BIC : 130.Average Log-L: 0.0014 Information Criteria CAIC : 132.5400 slope 0.2229 0.4127 *********************************************** The estimation program is calling mle_results().4224 3. .607063 Observations: 100 estimate constant 0.0154 0. err 0.2374 t-stat 2.1863 p-value 0.

individual demand will be a function of the parameters of the individuals’ utility functions. The data is from the 1996 Medical Expenditure Panel Survey (MEPS). the effects of ageing). Under the home production framework. individuals decide when to make health care visits to maintain their health stock.Count Data: The MEPS data and the Poisson model Demand for health care is usually thought of a a derived demand: health care is an input to a home production function that produces health. Grossman (1972). The six measures of use . models health as a capital stock that is subject to depreciation (e. The MEPS health data ﬁle . and health is an argument of the utility function. or to deal with negative shocks to the stock in the form of accidents or illnesses. meps1996. As such.meps.g. Health care visits restore the stock..gov/. for example. contains 4564 observations on six measures of health care usage. You can get more information at http://www.data.ahrq.

These form columns 1 .. dental visits (VDV). These form columns 7 12 of the ﬁle. The program ExploreMEPS. which follows: All of the measures of use are count data. and number of prescription drugs taken (PRESCR). emergency room visits (ERV). outpatient visits (OPV).6 of meps1996. and income (INCOME). which means that they take on the values 0. The conditioning variables are public insurance (PUBLIC). PRIV and PUBLIC are 0/1 binary variables. in the order given here. where 1 indicates that the person is female.. . age (AGE). 2.m shows how the data may be read in.are are ofﬁce-based visits (OBDV). inpatient visits (IPV).data. sex (SEX). SEX is also 0/1. 1. and gives some descriptive information about variables. It might be reasonable to try to use this . where a 1 indicates that the person has access to public or private insurance coverage. years of education (EDUC).. This data will be used in examples fairly extensively in what follows. private insurance (PRIV).

which is exp(−λ)λy fY (y) = .information by specifying the density as a count data density. One of the simplest count data densities is the Poisson density. y! The Poisson average log-likelihood function is 1 sn(θ) = n n (−λi + yi ln λi − ln yi!) i=1 We will parameterize the model as λi = exp(xiβ) xi = [1 P U BLIC P RIV SEX AGE EDU C IN C] (11.1) This ensures that the mean is positive. as is required for the Poisson .

using OBDV as the dependent variable are here: MPITB extensions found OBDV . The results of the estimation. Note that for this parameterization ∂λ/∂βj βj = λ so λ βj xj = ηxj . the elasticity of the conditional mean of y with respect to the j th conditioning variable. The program EstimatePoisson.model.m estimates a Poisson model using the full data set.

ins.076 0. sex age -0.797 11.487 0.093 4.000 0.****************************************************** Poisson model.002 t-stat -5.294 0. priv.471 p-value 0.000 0.000 . err 0. MEPS 1996 full data set MLE Estimation Results BFGS convergence: Normal convergence Average Log-L: -3.149 0.791 0.055 0.000 0.290 11.000 0.071 0.137 8. ins.848 0.024 st.671090 Observations: 4564 estimate constant pub.

3566 7.000 3.6881 AIC : 33523.978 0. BIC: Avg. CAIC: Avg. For example. . AIC: 7. Such variables take on values on the positive real line.7064 Avg.6881 BIC : 33568.3551 7.010 0.029 -0.3452 ****************************************************** Duration data and the Weibull model In some cases the dependent variable may be the time that passes between the occurence of two events. it may be the duration of a strike.edu inc 0.328 Information Criteria CAIC : 33575. or the time needed to ﬁnd a job once one is unemployed.002 0.000 0.061 -0.

and the ﬁnal event is the ﬁnding of a new job. The spell is the period of unemployment.and are referred to as duration data. and t1 be the time the concluding event occurs. The random variable D is the duration of the spell. assume that time is measured in years. Deﬁne the density function of D. with distribution function FD (t) = Pr(D < t). fD (t). For example. D = t1 − t0. For simplicity. . Let t0 be the time the initial event occurs. A spell is the period of time between the occurence of initial event and the concluding event. For example. the initial event could be the loss of a job. The probability that a spell lasts more than s years is Pr(D > s) = 1 − Pr(D ≤ s) = 1 − FD (s). Several questions may be of interest. one might wish to know the expected time one has to wait to ﬁnd a job given that one has already waited s years.

the lognormal.The density of D conditional on the spell being longer than s years is fD (t|D > s) = fD (t) . 1 − FD (s) The expectanced additional time required for the spell to end given that is has already lasted s years is the expectation of D with respect to this density. . etc. minus s. then estimate by maximum likelihood. A reasonably ﬂexible model that is a generalization of the exponential density is the Weibull density fD (t|θ) = e −(λt)γ λγ(λt)γ−1. There are a number of possibilities including the exponential density. one needs to specify the density fD (t) as a parametric density. ∞ E = E(D|D > s) − s = t fD (z) z dz 1 − FD (s) −s To estimate this function.

E(D) = λ−γ .034) and γ = 0. For ages 4-6. To illustrate application of this model. The ”spell” in this case is the lifetime of an individual mongoose. with 95% conﬁdence bands. 402 observations on the lifespan of dwarf mongooses (see Figure 11. Figure 11. The plot is accompanied by a nonparametric Kaplan-Meier estimate of life-expectancy.5) in Serengeti National Park (Tanzania) were used to ﬁt a Weibull model. The parameter estiˆ ˆ mates and standard errors are λ = 0.033) and the log-likelihood value is -659.559 (0. in that it predicts life expectancy quite differently than does the nonparametric model. the nonparametric estimate is outside . The log-likelihood is just the product of the log densities. This nonparametric estimator simply averages all spell lengths greater than age.867 (0. and then subtracts age.According to this model.3. This is consistent by the LLN. In the ﬁgure one can see that the model doesn’t ﬁt the data well.6 presents ﬁtted life expectancy (expected additional years of life) as a function of age.

5: Dwarf mongooses .Figure 11.

6: Life expectancy of mongooses.Figure 11. Weibull model .

the two parameters λ2 and γ2 are not identiﬁed. i = 1. up to a bit beyond 2 years. The results are a log-likelihood = -623. With the same data.17. Mongooses that are between 2-6 years old seem to have a lower life expectancy than is predicted by the Weibull model. γ1 γ2 . It is possible to take this into account. which casts doubt upon the parametric model. The parameters γi and λi. and δ is the parameter that mixes the two. whereas young mongooses that survive beyond infancy have a higher life expectancy. since under the null that δ = 1 (single density). fD (t|θ) = δ e−(λ1t) λ1γ1(λ1t)γ1−1 + (1 − δ) e−(λ2t) λ2γ2(λ2t)γ2−1 . one might specify fD (t) as a mixture of two Weibull densities. 2 are the parameters of the two Weibull densities. Note that a standard likelihood ratio test cannot be used to chose between the two models.the conﬁdence interval that results from the parametric model. Due to the dramatic change in the death rate as a function of t. θ can be estimated using the mixed model.

This model leads to the ﬁt in Figure 11. the improvement in the likelihood function is considerable.7.096 0.522 0.233 1. The disagreement after this point is not too important.101 0.722 1. which implies that the KaplanMeier nonparametric estimate has a high variance (since it’s an average of a small number of observations). Nevertheless.731 1. .035 Note that the mixture parameter is highly signiﬁcant.016 0. since less than 5% of mongooses live more than 6 years. Note that the parametric and nonparametric ﬁts are quite close to one another. The parameter estimates are Parameter Estimate St.428 0. up to around 6 years.but this topic is out of the scope of this course.166 0. Error λ1 γ1 λ2 γ2 δ 0.

mixed Weibull model .Figure 11.7: Life expectancy of mongooses.

and the estimation program will have difﬁculty con- . Poor scaling of the data When the data is scaled so that the magnitudes of the ﬁrst and second derivatives are of different orders. If we uncomment the appropriate line in EstimatePoisson.5 Numeric optimization: pitfalls In this section we’ll examine two common problems that can be encountered when doing numeric optimization of nonlinear models. and some solutions. 11.Mixture models are often an effective way to model complex responses. the data will not be scaled. problems can easily result.m. though they can suffer from overparameterization. Alternatives will be discussed later.

Convergence is quick. since we have limited means of determining if there is a higher maximum the the one we’re at. This causes convergence problems due to serious numerical inaccuracy when doing inversions to calculate the BFGS direction of search. With scaled data. one element of the gradient is very large. With unscaled data. others local) can complicate life. and the maximum and minimum elements are 5 orders of magnitude apart. Multiple optima Multiple optima (one global. Think of climbing a mountain in an unknown . the elements of the score vector have very different magnitudes at the initial value of θ (all zeros). and the maximum difference in orders of magnitude is 3.m. With unscaled data. To see this run CheckScore. none of the elements of the gradient are very large.verging (it seems to take an inﬁnite amount of time).

or do you trudge down the gap and explore the other side? The best way to avoid stopping at a local maximum is to use many starting values. From the picture. which ﬁnds the true global minimum.. The program FoggyMountain. you can see it’s close to (0. Let’s try to ﬁnd the true minimizer of minus 1 times the foggy mountain function (since the algorithms are set up to minimize). You can go up until there’s nowhere else to go up.m shows that poor start values can lead to problems.g. and it shows that BFGS using a battery of random start values can also ﬁnd the global minimum help. for example on a grid. Do you claim victory and go home.range. It uses SA. . from previous studies of similar data). Or perhaps one might have priors about possible values for the parameters (e. or randomly generated.8). and that we don’t know that. but let’s pretend there is fog. but since you’re in the fog you don’t know if the true summit is across the gap that’s at your feet. 0). in a very foggy place (Figure 11.

8: A foggy mountain .Figure 11.

102833 43 iterations ------------------------------------------------------ .The output of one run is here: MPITB extensions found ====================================================== BFGSMIN final results Used numeric gradient -----------------------------------------------------STRONG CONVERGENCE Function conv 1 Param conv 1 Gradient conv 1 -----------------------------------------------------Objective function value -0.0130329 Stepsize 0.

0000 0.812 ================================================ SAMIN final results NORMAL CONVERGENCE Func. 1. tol. tol.param 15.000000e-03 Obj. value -0.8119 ans = 16.0000 change 0.100023 .0000 The result with poor start values -28.0000 0. fn.9999 -28. 1.000 gradient -0.000000e-10 Param.

037. the single BFGS run with bad start values converged to a point far from the true minimizer.7628e-07 The true maximizer is near (0.000051 0.000000 ================================================ Now try a battery of random start values and a short BFGS on each. which simulated annealing and .7417e-02 2. then iterate to convergence The result using 20 randoms start values ans = 3.037419 -0.000018 0.parameter search width 0.0) In that run.

BFGS using a battery of random start values both found the true maximizer. Using a battery of random start values. The moral of the story is to be cautious and don’t publish your results too quickly. we managed to ﬁnd the global max. .

type ”help samin_example”.m as templates. Run it. In octave. 4. to ﬁnd out the location of the ﬁle. In octave. and examine the output. write a function to calculate the probit log likelihood. Run it using data that actually follows a logit model (you can generate it in the same way that is done in the logit example).m calls. type ”help bfgsmin_example”.m and EstimateLogit. to ﬁnd out the location of the ﬁle.6 Exercises 1. Edit the ﬁle to examine it and learn how to call samin.m to see what it does. Run it. Examine the functions that mle_results.11. 3. 2. Using logit. and a script to estimate a probit model. and examine the output. and in turn the functions that those . Edit the ﬁle to examine it and learn how to call bfgsmin. Study mle_results.

functions call. . Look at the Poisson estimation results for the OBDV measure of health care use and give an economic interpretation. 5. Estimate Poisson models for the other 5 measures of health care usage. Write a complete description of how the whole chain works.

Ch. Ch. Newey and McFadden (1994).1. pp. Gallant. Ch. Ch. Davidson and MacKinnon. “Large Sample Estimation and Hypothesis Testing. Vol. 591-96.” in Handbook of 491 . Gourieroux and Monfort (1995). 3. 24. 7.Chapter 12 Asymptotic properties of extremum estimators Readings: Hayashi (2000). 2. 4 section 4. Amemiya.

The draw from the density may be thought of as the outcome of a random experiment that is characterized by the probability space {Ω. arranged in a n × p matrix.. z2. z2. and Zn(ω) = {Z1(ω). . Vol. ... 4. P }. When the ex{z1. periment is performed. . }... but it exists in principle. zn} be the available data. Ch. Our paradigm is that data are generated as a draw from the joint density fZn (z).. Zn(ω)} . 36. F..Econometrics. .. 12. This density may not be known.. Let Zn = {z1. The probability space is rich enough to allow us to consider events deﬁned in terms of an inﬁnite sequence of data Z = {z1. z2. ω ∈ Ω is the result. Z2(ω). zn} is the realized data.1 Extremum estimators We’ll begin with study of extremum estimators in general. based on a sample of size n (there are p variables)....

.. θ0 ∈ Θ. t = 1. . θ) = 1/n n (yt − xtθ) t=1 2 ˆ As you already know.g. Let Zn = [yn Xn]. yn = Xnθ0 + εn. OLS. Let the d. Example 26. The least squares estimator is ˆ θ ≡ arg min sn(Zn. I’ll be loose with notation and interchange when convenient.. be yt = xtθ0 + εt. θ = (X X)−1X y. θ).ˆ Deﬁnition 25. Because the data Zn(ω) depends on ω.. θ) Θ where sn(Zn. we can emphasize this by writing sn(ω. n. 2. Stacking observations vertically. [Extremum estimator] An extremum estimator θ is the optimizing element of an objective function sn(Zn. . θ) over a set Θ.p. where Xn = x1 x2 · · · xn deﬁned as .

d. the joint density is the product of the densities of each observation. The maximum likelihood estimator is maximizes the joint density of the sample. maximization of the average logarithmic likelihood function is achieved ˆ at the same θ as for the likelihood function.. So. an the ML estimator is n ˆ θ ≡ arg max Ln(θ) = Θ t=1 (2π) −1/2 (yt − θ)2 exp − 2 Because the natural logarithm is strictly increasing on (0. t = 1..Example 27. Suppose that the continuous random variables Yt ∼ IIN (θ0.i. Maximum likelihood. ∞). Because the data are i. n The density of a single observation is fY (yt) = (2π) −1/2 (yt − θ)2 exp − 2 .. the ML estimator . 2. . 1)..

. and the objective function sn(Zn. and employ algo- .c. we treat the data as ﬁxed. .. .o. Zn(ω)} = {z1. These different outcomes lead to different data being generated. however. . z2.. We need to consider what happens as different outcomes ω ∈ Ω occur. . leads to the familiar result that θ = y. and the different data causes the objective function to change... because it depends on Zn(ω) = {Z1(ω).. θ) is a random function. When actually computing an extremum estimator. Note that the objective function sn(Zn. Z2(ω). We’ll come back to this in more detail later..ˆ θ ≡ arg maxΘ sn(θ) where n sn(θ) = (1/n) ln Ln(θ) = −1/2 ln 2π − (1/n) t=1 (yt − θ)2 2 ˆ ¯ Solution of the f. the data Zn(ω) = {Z1(ω).. Note.. θ) becomes a non-random function of θ. z2... Zn(ω)} = {z1. zn}. zn} are a ﬁxed realization. that for a ﬁxed ω ∈ Ω. Z2(ω)..

depending on context. This is because we would like to ﬁnd estimators that work well on average for any data set that can result from ω ∈ Ω. 12.rithms for optimization of nonstochastic functions. 1959). as sn(ω. and the second is more compact. In some cases . θ) or simply sn(θ). the objective function is random. and because the data is randomly sampled. We’ll often write the objective function suppressing the dependence on Zn. too. by the Weierstrass maximum theorem (Debreu. When analyzing the properties of an extremum estimator. The ﬁrst of these emphasizes the fact that the objective function is random. the data is still in there. then a maximizer exists. we need to investigate what happens throughout Ω: we do not focus only on the ω that generated the observed data. However.2 Existence If sn(θ) is continuous in θ and Θ is compact.

Assume .of interest. which is done in terms of convergence in probability. [Consistency of e. Henceforth in this course. which we’ll see in its original form later in the course. later).] Suppose that θn is obtained by maximizing sn(θ) over Θ. it may still converge to a continous function. It is interesting to compare the following proof with Amemiya’s Theorem 4. 12. in which case existence will not be a problem. ˆ Theorem 28. ref.1. at least asymptotically. we assume that sn(θ) is continuous. Nevertheless.1.e.3 Consistency The following theorem is patterned on a proof in Gallant (1987) (the article. sn(θ) may not be continuous.

is compact. The ˆ sequence {θn} lies in the compact set Θ. θ) − s∞(θ)| = 0. i. Since every sequence from a comˆ pact set has at least one limit point (Bolzano-Weierstrass). s∞(θ0) > s∞(θ). (b) Uniform Convergence: There is a nonstochastic function s∞(θ) that is continuous in θ on Θ such that n→∞ lim sup |sn(ω.(a) Compactness: The parameter space Θ is an open bounded subset of Euclidean space K . Then θn → θ0. θ) converges to s∞(θ). by assumption (a) and the fact that maximixation is over Θ.s. Then {sn(ω. This happens with probability one by assumption (b). Proof: Select a ω ∈ Ω and hold it ﬁxed. θ)} is a ﬁxed sequence of functions. ∀θ = θ0. a. Θ. Suppose that ω is such that sn(ω.. θ ∈ Θ ˆ a.s. say that θ . So the closure of Θ.e. θ∈Θ (c) Identiﬁcation: s∞(·) has a unique global maximum at θ0 ∈ Θ.

By uniform m convergence and continuity. There is a subsequence {θnm } ({nm} is simply ˆ ˆ a sequence of increasing integers) with limm→∞ θn = θ. . So the above claim is true. Then uniform convergence implies m→∞ ˆ ˆ lim snm (θt) = s∞(θt) Continuity of s∞ (·) implies that t→∞ ˆ ˆ lim s∞(θt) = s∞(θ) ˆ ˆ since the limit as t → ∞ of θt is θ. m→∞ ˆ ˆ lim snm (θnm ) = s∞(θ). ﬁrst of all. ˆ ˆ To see this. select an element θt from the sequence θnm .ˆ ˆ is a limit point of {θn}.

and m→∞ by uniform convergence. there is a unique global maximum of s∞(θ) at . by maximization ˆ snm (θnm ) ≥ snm (θ0) which holds in the limit.Next. lim snm (θ0) = s∞(θ0) as seen above. But by assumption (3). m→∞ However. so m→∞ ˆ lim snm (θnm ) ≥ lim snm (θ0). m→∞ ˆ ˆ lim snm (θnm ) = s∞(θ). so ˆ s∞(θ) ≥ s∞(θ0).

Finally. ˆ • We assume that θn is in fact a global maximum of sn (θ) . It is not required to be unique for n ﬁnite. Therefore {θn} has only one limit point. all of the above limits hold almost surely. which matches the way we will write the assumption in the section on nonparametric inference.ˆ ˆ θ0. but now we need to consider all ω ∈ Ω. An equivalent way to state this is (c) Identiﬁcation: Any point θ in Θ with s∞(θ) ≥ s∞(θ0) must be such that θ − θ0 = 0. The previous section on numeric . though the identiﬁcation assumption requires that the limiting objective function have a unique maximizing argument. so we must have s∞(θ) = s∞(θ0). since so far we have held ω ˆ ﬁxed. Discussion of the proof: • This proof relies on the identiﬁcation assumption of a unique global maximum at θ0. except on a set C ⊂ Ω with P (C) = 0. θ0. and θ = θ0 in the limit.

though s∞(θ) . Parameters on the boundary of the parameter set cause theoretical difﬁculties that we will not deal with in this course.1. for clarity. so we could directly assume that θ0 is simply an element of a compact set Θ.optimization methods showed that actually ﬁnding the global maximum of sn (θ) may be a non-trivial problem. Just note that conventional hypothesis testing methods do not apply in this case. • See Amemiya’s Example 4. The reason that we assume it’s in the interior here is that this is necessary for subsequent proof of asymptotic normality. • The assumption that θ0 is in the interior of Θ (part of the identiﬁcation assumption) has not been used to prove consistency. and I’d like to maintain a minimal set of simple assumptions. • Note that sn (θ) is not required to be continuous.4 for a case where discontinuity leads to breakdown of consistency.

In the second ﬁgure. there is no guarantee that the maximizer will be in the neighborhood of the global maximizer. if the function is not converging around the lower of the two maxima. the maximum of the sample objective function eventually must be in the neighborhood of the maximum of the limiting objective function . With uniform convergence. • The following ﬁgures illustrate why uniform convergence is important.is.

it is often feasible to employ the following set of stronger assumptions: .With pointwise convergence. To verify the uniform convergence assumption. the sample objective function may have its maximum far away from that of the limiting objective function Sufﬁcient conditions for assumption (b) We need a uniform strong law of large numbers in order to verify assumption (2) of Theorem 28.

s. That is. such as 1 sn(θ) = n n a. which is given by assumption (b) • the objective function sn(θ) is continuous and bounded with probability one on the entire parameter space • a standard SLLN can be shown to apply to some point θ in the parameter space. we can usually ﬁnd a SLLN that will apply. st(θ) t=1 As long as the st(θ) are not too strongly dependent. so we obtain the needed uni- . it can be shown that pointwise convergence holds throughout the parameter space. With these assumptions. and have ﬁnite variances.• the parameter space is compact. the objective function will be an average of terms. Note that in most cases. we can show that sn(θ) → s∞(θ) for some θ.

These are reasonable conditions in many cases. θ).form convergence. θ). More on the limiting objective function The limiting objective function in assumption (b) is s∞(θ). • Usually. What is the nature of this function and where does it come from? • Remember our paradigm . and henceforth when dealing with speciﬁc estimators we’ll simply assume that pointwise almost sure convergence can be extended to uniform almost sure convergence in this way. • The limiting objective function is found by applying a strong (weak) law of large numbers to sn(Zn. θ) is an average of terms.data is presumed to be generated as a draw from fZn (z). sn(Zn. and the objective function is sn(Zn. .

ρ ∈ . ρ0)dz Zn .• A strong (weak) LLN says that an average of terms converges almost surely (in probability) to the limit of the expectation of the average. θ)fZn (z)dz Zn Now suppose that the density fZn (z) that characterizes the DGP is parametric: fZn (z. ρ). s∞(θ) = lim Esn(Zn. θ)fZn (z. which means that ρ0 is the item of interest. When the DGP is parametric. Supposing one holds. We are probably interested in learning about the true DGP. the limiting objective function is s∞(θ) = lim Esn(Zn. θ) = lim n→∞ n→∞ sn(z. and the data is generated by ρ0 ∈ . Now we have two parameters to worry about. θ) = lim n→∞ n→∞ sn(z. θ and ρ.

• If knowledge of θ0 is sufﬁcient for knowledge of some but not all elements of ρ0. For example. Often. ˆ a. In some cases. θ0 is referred to as the true parameter value. For example. we know that θn → θ0 What is the relationship between θ0 and ρ0? • ρ and θ may have different dimensions. • If knowledge of θ0 is sufﬁcient for knowledge of ρ0. ﬁtting a polynomial to an unknown nonlinear function. the statistical model (with parameter θ) only partially describes the DGP. the case of OLS with errors of unknown distribution. we have a correctly speciﬁed semiparametric . ρ0) to emphasize the dependence on the parameter of the DGP.s. From the theorem. the dimension of θ may be greater than that of ρ. we have a correctly and fully speciﬁed model.and we can write the limiting objective function as s∞(θ.

. depending on how good or poor is the model. θ0 is referred to as the pseudo-true parameter value. Because the objective function may or may not make sense. or if it causes us to draw false conclusions regarding at least some of the elements of ρ0. It says that with probability one. θ0 is referred to as the true parameter value. understanding that not all parameters of the DGP are estimated. our model is misspeciﬁed. an extremum estimator converges to the value that maximizes the limit of the expectation of the objective function. • If knowledge of θ0 is not sufﬁcient for knowledge of any elements of ρ0.model. Summary The theorem for consistency is really quite intuitive. we may or may not be estimating parameters of the DGP.

4 Example: Consistency of Least Squares We suppose that data is generated by random sampling of (Y.12.s. ε) has the common distribution function FZ = µxµε (x and ε are independent) with support Z = X × E. X E 2 ε2dµX dµE = σε . by the SLLN. n 1/n t=1 ε2 → t a. The sample objective function for a sample size n is n n sn(θ) = 1/n t=1 n (yt − βxt)2 = 1/n i=1 (β0xt + εt − βxt)2 n n = 1/n t=1 (xt (β0 − β))2 + 2/n t=1 xt (β0 − β) εt + 1/n t=1 ε2 t • Considering the last term. where yt = β0xt +εt. (X. X). Sup2 2 pose that the variances σX and σε are ﬁnite. .

• Considering the second term. since E(ε) = 0 and X and ε are independent. so the convergence is also uniform.1) x2dµX E X2 Finally.s. and the parameter space is assumed to be compact. we assume that a SLLN applies so that n 1/n t=1 (xt (β0 − β))2 → = = a. for the ﬁrst term. X (x (β0 − β))2 dµX β0 − β β0 − β 2 X 2 (12. Thus. s∞(β) = β 0 − β A minimizer of this is clearly β = β 0. • Finally. the objective function is clearly continuous. 2 2 E X 2 + σε . for a given β. the SLLN implies that it converges to zero.

of course . Show that in order for the above solution to be unique it is necessary that E(X 2) = 0. ε) has the common distribution function FZ = µxµε (x and ε are independent) with support 2 2 Z = X × E. Let’s verify this result in the context of extremum estimators. Interpret this condition. X). Suppose that the variances σX and σε are ﬁnite.this is only an example of application of the theorem.5 Example: Inconsistency of Misspeciﬁed Least Squares You already know that the OLS estimator is inconsistent when relevant variables are omitted. There are easier ways to show this. We suppose that data is generated by random sampling of (Y. where yt = β0xt +εt. (X. However. . This example shows that Theorem 28 can be used to prove strong consistency of the OLS estimator.Exercise 29. 12.

Suppose that E(W ) = 0 but that E(W X) = 0. β0 . and instead proposes the misspeciﬁed model yt = γ0wt +ηt.the econometrician is unaware of the true DGP. The sample objective function for a sample size n is n n sn(γ) = 1/n t=1 n (yt − γwt)2 = 1/n i=1 n (β0xt + εt − γwt)2 n n n = 1/n t=1 (β0xt)2 + 1/n t=1 (γwt)2 + 1/n t=1 ε2 + 2/n t t=1 β0xtεt − 2/n t=1 β Using arguments similar to above. multiplied by the pseudo-true value of a regression of X on W. s∞(γ) = γ 2E W 2 − 2β0γE(W X) + C So. The OLS estimator is not consistent for the true parameter. γ0 = β0 E(W X) . E(W 2 ) which is the true parameter of the DGP.

1980 is an earlier reference.3. θ))2 i=1 We’ll study this more later.4. section 8.12. but for now it is clear that the foc for mini- .6 Example: Linearization of a nonlinear model Ref. θ0) + εi where εi ∼ iid(0. Intn’l Econ. White. Suppose we have a nonlinear model yi = h(xi. Gourieroux and Monfort. Rev. σ 2) The nonlinear least squares estimator solves 1 ˆ θn = arg min n n (yi − h(xi.

its mean is not zero. θ0) − x0 ∂x ∂h(x0.mization will require solving a set of nonlinear equations. θ0) yi = h(x . Deﬁne ∂h(x0. θ ) + (xi − x0) + νi ∂x 0 0 where νi encompasses both εi and the Taylor’s series remainder. A common approach to the problem seeks to avoid this difﬁculty by linearizing the model. A ﬁrst order Taylor’s series expansion about the point x0 with remainder gives ∂h(x0. Note that νi is no longer a classical error . θ0) α∗ = h(x0. We should expect problems. θ0) ∗ β = ∂x .

. to the γ 0 that minimizes s∞(γ): γ 0 = arg min EX EY |X (y − α − βx)2 u.a. β ) .s.s. one might try to estimate α∗ and β ∗ by applying OLS to yi = α + βxi + νi ˆ ˆ • Question. Let γ = (α. as one can see by interpreting α and β as extremum estimators. 1 ˆ γ = arg min sn(γ) = n n (yi − α − βxi)2 i=1 The objective function converges to its expectation sn(γ) → s∞(γ) = EX EY |X (y − α − βx)2 ˆ and γ converges a.Given this. will α and β be consistent for α∗ and β ∗? ˆ ˆ • The answer is no.

α0 and β 0 correspond to the hyperplane that is closest to the true regression function h(x. θ0) − α − βx 2 2 2 since cross products involving ε drop out. . θ0) + ε − α − βx = σ 2 + EX h(x.Noting that EX EY |X (y − α − x β) = EX EY |X h(x. This depends on both the shape of h(·) and the density function of the conditioning variables. θ0) according to the mean squared error criterion.

for example.Inconsistency of the linear approximation. θ0) is concave. if h(x. even at the approximation point x h(x. since. • Note that the true underlying parameter θ0 is not estimated consistently. which is 2 in this example). .θ) Tangent line x x β α x x x x x x x Fitted line x_0 • It is clear that the tangent line does not minimize MSE. either (it may be of a different dimension than the dimension of the parameter of the approximating model. all errors between the tangent line and the true function are negative.

The bias may not be too important for analysis of conditional means. In production and consumer analysis. • This sort of linearization about a long run equilibrium is a common practice in dynamic macroeconomic models. one should be cautious of unthinking application of models that impose stong restrictions on second derivatives. though to a less severe degree. but it can be very important for analyzing ﬁrst and second derivatives. Generalized Leontiev and other “ﬂexible functional forms” based upon second-order approximations in general suffer from bias and inconsistency. For this reason.. ﬁrst and second derivatives (e.• Second order and higher-order approximations suffer from exactly the same problem. so in this case. translog. but it is not justiﬁable for the estima- .g. of course. It is justiﬁed for the purposes of theoretical analysis of a model given the model’s parameters. elasticities of substitution) are often of interest.

The section on simulation-based methods offers a means of obtaining consistent estimators of the parameters of dynamic macro models that are too complex for standard methods of analysis. 111). Theorem 30.1. Establishment of asymptotic normality with a known scaling factor solves these two problems. assume .e. The following theorem is similar to Amemiya’s Theorem 4. 12.7 Asymptotic Normality A consistent estimator is oftentimes not very useful unless we know how fast it is likely to be converging to the true value. and the probability that it is far away from the true value.3 (pg.tion of the parameters of the model using data.] In addition to the assumptions of Theorem 28. [Asymptotic normality of e.

must hold exactly since the .o.s. J∞(θ0)−1I∞(θ0)J∞(θ0)−1 Proof: By Taylor expansion: 2 ˆ ˆ Dθ sn(θn) = Dθ sn(θ0) + Dθ sn(θ∗) θ − θ0 a. √ √ 0 d 0 0 (c) nDθ sn(θ ) → N 0.s. I∞(θ ) .2 (a) Jn(θ) ≡ Dθ sn(θ) exists and is continuous in an open.h. (b) {Jn(θn)} → J∞(θ0). ˆ where θ∗ = λθ + (1 − λ)θ0. of this equation is zero. ˆ since θn is a maximizer and the f. by consistency. convex neighborhood of θ0. a ﬁnite negative deﬁnite matrix. 2 ˆ • Note that θ will be in the neighborhood where Dθ sn(θ) exists with probability one as n becomes large. 0 ≤ λ ≤ 1.c. for any sequence {θn} that converges almost surely to θ0. • Now the l. at least asymptotically. where I∞(θ ) = limn→∞ V ar nDθ sn(θ0) √ ˆ d Then n θ − θ0 → N 0.

limiting objective function is strictly concave in a neighborhood of θ0. I∞(θ0) . I∞(θ0) by assumption c. ˆ ˆ a. and since θn → θ0 .s. assumption (b) gives 2 Dθ sn(θ∗) → J∞(θ0) a. so − J∞(θ ) + os(1) 0 √ d ˆ n θ − θ0 → N 0. So 0 = Dθ sn(θ0) + J∞(θ0) + os(1) And 0= Now √ √ nDθ sn(θ ) + J∞(θ ) + os(1) d 0 0 ˆ θ − θ0 √ ˆ n θ − θ0 nDθ sn(θ0) → N 0. • Also. since θ∗ is between θn and θ0.s.

J∞(θ0) + os(1) → J (θ0). Dθ sn(θ) is also a. so √ d ˆ n θ − θ0 → N 0. J∞(θ0)−1I∞(θ0)J∞(θ0)−1 by the Slutsky Theorem (see Gallant. If we were to omit the . assumption (c): nDθ sn(θ0) → N 0. an average of n matrices.Also.6). Theorem 4. as we saw in √ d Example 74. I∞(θ0) means that √ nDθ sn(θ0) = Op() where we use the result of Example 72. On the other hand. the elements of which are not centered (they do not have zero expectation). 2 the almost sure limit of Dθ sn(θ0).s. • Skip this in lecture. J∞(θ0) = O(1). A note on the order of these matrices: Supposing that sn(θ) is representable as an average of n terms. 2 which is the case for all estimators we consider. Supposing a SLLN applies.

to verify that we obtain the results seen before.√ n. we’d have Dθ sn(θ ) = n Op(1) = Op n− 2 1 0 −1 2 where we use the fact that Op(nr )Op(nq ) = Op(nr+q ). The se√ 0 quence Dθ sn(θ ) is centered. 12.8 Example: Classical linear model Let’s use the results to get the asymptotic distribution of the OLS estimator applied to the classical model. The OLS criterion is . so we need to scale by n to avoid convergence to zero.

1 sn(β) = (y − Xβ) (y − Xβ) n = Xβ 0 + − Xβ Xβ 0 + − Xβ 1 = β 0 − β X X β 0 − β − 2 Xβ + n The ﬁrst derivative is 1 −2X X β 0 − β − 2X Dβ sn(β) = n so. evaluating at β 0. X Dβ sn(β 0) = −2 n .

so the variance is the expectation of the outer product: √ X V ar nDβ sn(β ) = E − n2 n X X = E4 n XX = 4σ 2 n 0 √ √ X − n2 n Therefore √ I∞(β 0) = lim V ar nDβ sn(β 0) = 4σ QX The second derivative is Jn(β) = 2 Dβ sn(β 0) n→∞ 2 1 = [−2X X] .This has expectation 0. n .

So. The asymptotic normality theorem (30) tells us that √ ˆ n β − β0 → N 0. of course.1. − X 2 d 4σ QX 2 Q−1 − X 2 √ d ˆ n β − β 0 → N 0. Q−1σ 2 . X This is the same thing we saw in equation 4. √ or ˆ n β−β 0 Q−1 → N 0. so uniformity is not an issue. given the above. the .A SLLN tells us that this converges almost surely to the limit of its expectation: J∞(β 0) = −2QX There’s no parameter in that last expression. J∞(β 0)−1I∞(β 0)J∞(β 0)−1 d which is.

theory seems to work :-) .

where ε ∼ N (0. Suppose that xi ∼ uniform(0. J (β 0).1). Use the asymptotic normality theorem to ﬁnd the asymptotic distribution of the ML estimator of β 0 for the model y = xβ 0 + ε. This means ﬁnding ∂2 ∂β∂β n (β) sn(β). The expressions may involve the unspeciﬁed density of x. and I(β 0).σ 2). 3. Suppose we estimate the misspeciﬁed model yi = α + βxi + ηi by OLS.9 Exercises 1. . Find the numeric values of α0 and β 0 that ˆ ˆ are the probability limits of α and β 2. Verify your results using Octave by generating data that follows the above model. 1) and is independent of x.12. ∂s∂β . where εi i is iid(0. and calculating the OLS estimator. and yi = 1 − x2 + εi. When the sample size is very large the estimator should be very close to the analytical results you obtained in question 1.

Its use of all of the information causes it to have a number of attractive properties. foremost of which is asymptotic efﬁciency. the 530 . For this reason.Chapter 13 Maximum likelihood estimation The maximum likelihood estimator is important because it uses all of the information in a fully speciﬁed statistical model.

Z. . y n is characterized by a parameter vector ψ0 : fY Z (Y. Suppose the joint density of Y = y1 . given the parameter.1 The likelihood function and Z = z1 . which essentially means that there is enough information to draw data from the DGP. . zn Suppose we have a sample of size n of the random vectors y and z. If this is the case. This is a fairly strong requirement. The ML estimator requires that the statistical model be fully speciﬁed.ML estimator can serve as a benchmark against which other estimators may be measured. 13. ψ0). . the ML estimator will not have the nice properties that it has under correct speciﬁcation. . . and for this reason we need to be concerned about the possible misspeciﬁcation of the statistical model.

ψ) = f (Y. Z. ψ0) = fY |Z (Y |Z. The maximum likelihood estimator of ψ0 is the value of ψ that maximizes the likelihood function. θ0)fZ (Z.This is the joint density of the sample. In this case. the variables Z are said to be exogenous for estimation of θ. Z. ψ). θ) with respect to θ is the same as the maximizer of the overall likelihood function fY Z (Y. for the elements of ψ that correspond to θ. Note that if θ0 and ρ0 share no elements. ψ) = fY |Z (Y |Z. Z. θ)fZ (Z. This density can be factored as fY Z (Y. then the maximizer of the conditional likelihood function fY |Z (Y |Z. ρ0) The likelihood function is just this density evaluated at other values ψ L(Y. ψ ∈ Ψ. ρ). Z. where Ψ is a parameter space. and we may more conveniently work with the conditional likelihood .

by using the fact that a joint density can be factored into the product of a marginal and conditional (doing this iteratively) L(Y. θ) • If the n observations are independent. θ) . z3. zn. • If this is not possible. we can always factor the likelihood into contributions of observations. θ)f (y3|y1. θ)f (y2|y1.y2. θ) · · · f (yn|y1. θ) = f (y1|z1. θ) = t=1 f (yt|zt. z2.function fY |Z (Y |Z. . θ) where the ft are possibly of different form. . The maximum likelihood estimator of θ0 = arg max fY |Z (Y |Z. y2. θ) for the purposes of estimating θ0. yt−n. . the likelihood function can be written as n L(Y |Z.

θ) t=1 The maximum likelihood estimator may thus be deﬁned equivalently as ˆ θ = arg max sn(θ). etc.. deﬁne xt = {y1. Now the likelihood function can be written as L(Y.. θ) The criterion function can be deﬁned as the average log-likelihood function: 1 1 sn(θ) = ln L(Y. . yt−1.. y2. x2 = {y1.it contains exogenous and predetermined endogeous variables.To simplify notation. z2}. θ) = n n n ln f (yt|xt. zt} so x1 = z1. θ) = t=1 n f (yt|xt. . .

1} / .where the set maximized over is deﬁned below. so that the probability of a heads may not be 0. Maybe we’re interested in estimating the probability of a heads. ˆ Dividing by n has no effect on θ. Since ln(·) is a monotonic increasing function.5. y ∈ {0. y ∈ {0. Example: Bernoulli trial Suppose that we are ﬂipping a coin that may be biased. 1} 0 = 0. ln L and L maximize at the same value of θ. The outcome of a toss is a Bernoulli random variable: fY (y. Let Y = 1(heads) be a binary variable that indicates whether or not a heads is observed. p0) = py (1 − p0)1−y .

p) = y ln p + (1 − y) ln (1 − p) The derivative of this is ∂ ln fY (y.So a representative term that enters the likelihood function is fY (y. p) y (1 − y) = − ∂p p (1 − p) y−p = p (1 − p) Averaging this over a sample of size n gives ∂sn(p) 1 = ∂p n n i=1 yi − p p (1 − p) . p) = py (1 − p)1−y and ln fY (y.

Suppose that pi ≡ p(xi. Now ∂pi(β) = pi (1 − pi) xi ∂β Y =1 ypy (1 0 Y =0 − p0)1−y = p0 and V ar(Y ) = E(Y 2) − . 0 Now imagine that we had a bag full of bent coins. For future reference. each bent around a sphere of a different radius (with the head pointing to the outside of the sphere). We might suspect that the probability of a heads could depend upon the radius.Setting to zero and solving gives ˆ ¯ p=y (13. so that β is a 2×1 vector. note that E(Y ) = [E(Y )]2 = p0 − p2.1) So it’s easy to calculate the MLE of p0 in this case. β) = (1 + exp(−xiβ))−1 where xi = 1 ri .

β)) xi So the derivative of the average log lihelihood function is now ∂sn(β) = ∂β n i=1 (yi − p(xi. . This is commonly the case with ML estimators: they are often nonlinear. There is no explicit solution for the two elements that set the equations to zero. β) pi (1 − pi) xi = ∂β pi (1 − pi) = (yi − p(xi.so y − pi ∂ ln fY (y. β)) xi n This is a set of 2 nonlinear equations in the two unknown elements in β. See Chapter 11 for more information on how to do this. and ﬁnding the value of the estimate often requires use of numeric methods to ﬁnd solutions to the ﬁrst order conditions.

13. The question is: what is the value that maximizes s∞(θ)? 1 Remember that sn(θ) = n ln L(Y. the expectation on the RHS is E L(θ) L(θ0) = L(θ) L(θ0)dy = 1. and since the . given basic assumptions it is consistent for the value that maximizes the limiting objective function. following Theorem 28. and L(Y. Now. θ0) is the true density of the sample data. θ). For any θ = θ0 L(θ) E ln L(θ0) ≤ ln E L(θ) L(θ0) by Jensen’s inequality ( ln (·) is a concave function). L(θ0) since L(θ0) is the density function of the observations.2 Consistency of MLE The MLE is an extremum estimator.

a. By the identiﬁcation assumption there is a unique maximizer. so the inequality is strict if θ = θ0: s∞(θ.s. a. and with continuity and a compact parameter space. . ≤ 0. this is uniform. θ0) = lim E (sn (θ)).s. Note: the θ0 appears because the expectation is taken with respect to the true density L(θ0). θ0) − s∞(θ0. θ0) < 0. ∀θ = θ0. since ln(1) = 0. A SLLN tells us that sn(θ) → s∞(θ. θ0) ≤ 0 except on a set of zero probability. so s∞(θ. θ0) − s∞(θ0.integral of any density is 1. Therefore. L(θ) E ln L(θ0) or E (sn (θ)) − E (sn (θ0)) ≤ 0.

13. . θ0 is the unique maximizer of s∞(θ. a.Therefore. θ0). the ML estimator is consistent. So. and thus.s.3 The score function Assumption: (Differentiability) Assume that sn(θ) is twice continuously differentiable in a neighborhood N (θ0) of θ0. at least when n is large enough. Theorem 28 tells us that n→∞ ˆ lim θ = θ0.

This is the expectation taken .To maximize the log-likelihood function. t=1 We will show that Eθ [gt(θ)] = 0. Note that the score function has Y as an argument. take derivatives: gn(Y. ˆ The ML estimator θ sets the derivatives to zero: ˆ =1 gn(θ) n n ˆ gt(θ) ≡ 0. ∀t. which implies that it is a random function. t=1 This is the score vector (with dim K × 1). θ) = Dθ sn(θ) n 1 = Dθ ln f (yt|xx. Y (and any exogeneous variables) will often be suppressed for clarity. θ) n t=1 1 ≡ n n gt(θ). but one should not forget that they are still there.

θ)]f (yt|x. θ)] f (yt|xt. we can switch the order of integration and differentiation. This gives Eθ [gt(θ)] = Dθ = Dθ 1 = 0 where we use the fact that the integral of the density is 1. θ) Dθ f (yt|xt. Given some regularity conditions on boundedness of Dθ f. θ)dyt . f (yt|xt. not necessarily f (θ0) . Eθ [gt(θ)] = = = [Dθ ln f (yt|xt. θ)dyt. by the dominated convergence theorem. θ)dyt 1 [Dθ f (yt|xt.with respect to the density f (θ). θ)dyt f (yt|xt.

• So Eθ (gt(θ) = 0 : the expectation of the score vector is zero. Take a ﬁrst order Taylor’s series expanˆ sion of g(Y. θ) = 0. so it implies that Eθ gn(Y. . θ) about the true value θ0 : ˆ ˆ 0 ≡ g(θ) = g(θ0) + (Dθ g(θ∗)) θ − θ0 or with appropriate deﬁnitions ˆ J (θ∗) θ − θ0 = −g(θ0). 13. • This hold for all t.4 Asymptotic normality of MLE Recall that we assume that the log-likelihood function sn(θ) is twice continuously differentiable.

the matrix of second derivatives of the average log likelihood function. This is J (θ∗) = Dθ g(θ∗) 2 = Dθ sn(θ∗) n 1 2 = Dθ ln ft(θ∗) n t=1 where the notation ∂ 2sn(θ) ≡ . So √ √ ˆ − θ0 = −J (θ∗)−1 ng(θ0) n θ Now consider J (θ∗). ∂θ∂θ Given that this is an average of terms. it should usually be the case 2 Dθ sn(θ) that this satisﬁes a strong law of large numbers (SLLN). Regularity . 0 < λ < 1. Assume J (θ∗) is invertible (we’ll justify this in a minute).ˆ where θ∗ = λθ + (1 − λ)θ0.

This matrix converges to a ﬁnite limit. by the above differentiability assumption.s. J (θ∗) converges to the limit of it’s expectation: 2 J (θ∗) → lim E Dθ sn(θ0) = J∞(θ0) < ∞ n→∞ a. and their variances must not become inﬁnite. J (θ) is continuous in θ. since the appropriate assumptions will depend upon the particularities of a given model. which is legitimate given certain regularity conditions related to the boundedness . we assume that a SLLN applies. we have that θ∗ → θ0. Also. For example. a.s. However.conditions are a set of assumptions that guarantee that this will happen. Re-arranging orders of limits and differentiation. and since θ∗ = λθ + (1 − λ)θ0. We don’t assume any particular set here. There are different sets of assumptions that can be used to justify 2 appeal to different SLLN’s. since we know that θ is consistent. the Dθ ln ft(θ∗) must not be too strongly dependent over time. ˆ ˆ Also. Given this.

θ0) i.2) .. and we have √ √ ˆ − θ0 = −J (θ∗)−1 ng(θ0). then J∞(θ0) must be negative deﬁnite. and therefore of full rank. Therefore the previous inversion is justiﬁed. n θ (13. and by the assumption that sn(θ) is twice continuously differentiable (which holds in the limit). asymptotically. θ0) We’ve already seen that s∞(θ. Since there is a unique maximizer. we get 2 J∞(θ0) = Dθ lim E (sn(θ0)) n→∞ 2 = Dθ s∞(θ0. θ0 maximizes the limiting objective function.e. θ0) < s∞(θ0.of the log-likelihood function.

This is √ √ ngn(θ0) = nDθ sn(θ) √ n n = Dθ ln ft(yt|xt. lim V (Xn)) The “certain conditions” that Xn must satisfy depend on the case at d a.s. by consistency. As such. θ0) n t=1 1 √ = n n gt(θ0) t=1 We’ve already seen that Eθ [gt(θ)] = 0. . Note that gn(θ0) → 0. it is reasonable to assume that a CLT applies. Xn − E(Xn) → N (0.v. for Xn a random vector that satisﬁes certain conditions.√ Now consider ng(θ0). (a constant vector) we need to scale by n. A generic CLT states that. To avoid this collapse to a √ degenerate r.

if the Xt have ﬁnite variances and are not too strongly dependent. we get √ where I∞(θ0) = lim Eθ0 n [gn(θ0)] [gn(θ0)] n→∞ √ = lim Vθ0 ngn(θ0) n→∞ ngn(θ0) → N [0. then a CLT for dependent processes will apply. Xn will be of the form of an average.3) . I∞(θ0)] d (13. and noting √ that E( ngn(θ0) = 0. Then the properties of Xn depend on the properties of the Xt. For example. scaled by Xn = This is the case for √ √ n n t=1 Xt √ n: n ng(θ0) for example. Supposing that a CLT applies. Usually.hand.

These are known as superconsistent estimators. • Combining [13. Deﬁnition 31. There do exist. Consistent and asymptotically normal (CAN). we get √ a ˆ n θ − θ0 ∼ N 0.s. . J∞(θ0)−1I∞(θ0)J∞(θ0)−1 . a. An esˆ of a parameter θ0 is √n-consistent and asymptotically nortimator θ √ ˆ d mally distributed if n θ − θ0 → N (0.3]. estimators that are consistent such √ ˆ p that n θ − θ0 → 0.This can also be written as • I∞(θ0) is known as the information matrix.2] and [13. V∞) where V∞ is a ﬁnite positive deﬁnite matrix. in special cases. and noting that J (θ∗) → J∞(θ0). The MLE estimator is asymptotically normally distributed.

though. though not all consistent estimators are asymptotically unbiased. .√ since in ordinary circumstances with stationary data. Such cases are unusual. Asymptotically unbiased. An estimator θ of a parameter θ0 is asymptotically unbiased if ˆ limn→∞ Eθ (θ) = θ. n is the highest factor that we can multiply by and still get convergence to a stable limiting distribution. ˆ Deﬁnition 32. Estimators that are CAN are asymptotically unbiased.

4) . so Dθ ft(θ)dy (Dθ ln ft(θ)) ft(θ)dy [Dθ ln ft(θ)] Dθ ft(θ)dy 2 = Eθ Dθ ln ft(θ) + [Dθ ln ft(θ)] [Dθ ln ft(θ)] ft(θ)dy 2 = Eθ Dθ ln ft(θ) + Eθ [Dθ ln ft(θ)] [Dθ ln ft(θ)] = Eθ [Jt(θ)] + Eθ [gt(θ)] [gt(θ)] (13. θ) 1 = 0 = = Now differentiate again: 0 = 2 Dθ ln ft(θ) ft(θ)dy + ft(θ)dy.5 The information matrix equality We will show that J∞(θ) = −I∞(θ). Let ft(θ) be short for f (yt|xt.13.

Finally take limits. (This forms the basis for a speciﬁcation test proposed by White: if the scores appear to be correlated one may question the speciﬁcation of the model).. since for t > s. This allows us to write Eθ [Jn(θ)] = −Eθ n [g(θ)] [g(θ)] since all cross products between different periods expect to zero.5) The scores gt and gs are uncorrelated for t = s.Now sum over n and multiply by 1 Eθ n n 1 n [Jt(θ)] = −Eθ t=1 1 n n [gt(θ)] [gt(θ)] t=1 (13. . ft(yt|y1. so what was random in s is ﬁxed in t..6) .. yt−1. θ) has conditioned on prior information. (13. we get J∞(θ) = −I∞(θ).

s. for θ0. ˆ n θ − θ0 → N 0. J∞(θ0)−1I∞(θ0)J∞(θ0)−1 √ a.This holds for all θ. in particular. ˆ n θ − θ0 → N 0.s.7) or √ a. we need estimators of J∞(θ0) and I∞(θ0). ˆ n θ − θ0 → N 0. −J∞(θ0)−1 (13.8) To estimate the asymptotic variance. We can use 1 I∞(θ0) = n n ˆ ˆ gt(θ)gt(θ) t=1 ˆ J∞(θ0) = Jn(θ). . √ simpliﬁes to a.s. I∞(θ0)−1 (13. Using this.

outer product of the gradient (OPG) and sandwich estimators. The sandwich form is the most robust. Why not? From this we see that there are alternative ways to estimate V∞(θ0) that are all valid.5. since it coincides with the covariance estimator of the quasi-ML estimator. Note.as is intuitive if one considers equation 13. These include V∞(θ0) = −J∞(θ0) V∞(θ0) = I∞(θ0) −1 −1 −1 −1 V∞(θ0) = J∞(θ0) I∞(θ0)J∞(θ0) These are known as the inverse Hessian. one can’t use ˆ I∞(θ0) = n gn(θ) ˆ gn(θ) to estimate the information matrix. respectively. .

Coin ﬂipping.Example. data.i.) n 1 = lim nV ar(y) (by identically distributed obs.1). We can do this in a simple way: √ p lim V ar n (ˆ − p0) = lim nV ar (ˆ − p0) p = lim nV ar (ˆ) p = lim nV ar (¯) y = lim nV ar = lim yt n 1 V ar(yt) (by independence of obs. with i.) n = V ar(y) = p0 (1 − p0) .1 we saw that the MLE for the parameter of a Bernoulli ˆ ¯ trial. √ p Now let’s ﬁnd the limiting variance of n (ˆ − p0).d. again In section 13. is the sample mean: p = y (equation 13.

we get the same limiting variance using both methods. The log-likelihood function is 1 sn(p) = n so Esn(p) = p0 ln p + 1 − p0 (1 − ln p) by the fact that the observations are i. s∞(p) = p0 ln p + 1 − p0 (1 − ln p).While that is simple.i. A bit of calculation shows that 2 Dθ sn(p) p=p0 n {yt ln p + (1 − yt) (1 − ln p)} t=1 −1 . −J∞ (p0) = p0 1 − p0 . let’s verify this using the methods of Chapter 12 give the same answer.d. ≡ Jn(θ) = 0 p (1 − p0) √ ˆ which doesn’t depend upon n. Thus. lim V ar n p − p0 −1 −1 −J∞ (p0). So. By results we’ve seen on MLE. And in this case. .

13. Noting that Dθ f (Y. minus the inverse of the information matrix is a positive semideﬁnite matrix. it is asymptotically unbiased. θ) = f (θ)Dθ ln f (θ). [Cramer-Rao Lower Bound] The limiting variance of a ˜ CAN estimator of θ0. θ) θ − θ dy = 0 (this is a K × K matrix of zeros). .6 The Cramér-Rao lower bound Theorem 33. we can write n→∞ lim ˜ θ − θ f (θ)Dθ ln f (θ)dy + lim n→∞ ˜ f (Y. Proof: Since the estimator is CAN. θ)Dθ θ − θ dy = 0. say θ. so n→∞ ˜ lim Eθ (θ − θ) = 0 Differentiate wrt θ : ˜ Dθ lim Eθ (θ − θ) = lim n→∞ n→∞ ˜ Dθ f (Y.

˜ for θ any CAN estimator. Using this.˜ Now note that Dθ θ − θ = −IK . g(θ). suppose the . so we can write n→∞ lim Eθ √ ˜ n θ−θ √ ng(θ) = IK √ ˜ This means that the covariance of the score function with n θ − θ . θ)(−IK )dy = −IK . and this we have n→∞ f (Y. With lim ˜ θ − θ f (θ)Dθ ln f (θ)dy = IK . is an identity matrix. Playing with powers of n we get lim √ ˜ n θ−θ √ 1 n [Dθ ln f (θ)] f (θ)dy = IK n n→∞ Note that the bracketed part is just the transpose of the score vector.

9) Since this is a covariance matrix. it is positive semi-deﬁnite. −1 ˜ Since α is arbitrary. for any K -vector α. α −α −1 I∞ (θ) ˜ V∞(θ) IK IK I∞(θ) α −I∞(θ) α −1 ≥ 0. V∞ √ ˜ n θ−θ = √ ng(θ) ˜ V∞(θ) IK IK I∞(θ) . −1 This means that I∞ (θ) is a lower bound for the asymptotic variance . This con- ludes the proof. Therefore. V∞(θ) − I∞ (θ) is positive semideﬁnite.variance of √ ˜ ˜ n θ − θ tends to V∞(θ). Therefore. This simpliﬁes to −1 ˜ α V∞(θ) − I∞ (θ) α ≥ 0. (13.

The Wald test can be calculated using the unrestricted model. The score test can be calculated using only the restricted model. (Asymptotic efﬁciency) Given two CAN estimators of a parameter ˜ ˆ ˆ ˜ ˜ θ0. where the q × k matrix Dθ r(θ) has rank q. The likelihood ratio . θ is asymptotically efﬁcient with respect to θ if V∞(θ)− ˆ V∞(θ) is a positive semideﬁnite matrix. but if one can show that the asymptotic variance is equal to the inverse of the information matrix. In particular.7 Likelihood ratio-type tests Suppose we would like to test a set of q possibly nonlinear restrictions r(θ) = 0. then the estimator is asymptotically efﬁcient. A direct proof of asymptotic efﬁciency of an estimator is infeasible. say θ and θ.of a CAN estimator. the MLE is asymptotically efﬁcient with respect to any other CAN estimator. 13.

The test statistic is ˆ ˜ LR = 2 ln L(θ) − ln L(θ) ˆ ˜ where θ is the unrestricted estimate and θ is the restricted estimate. uses both the restricted and the unrestricted estimators. the ﬁrst order term drops out since Dθ ln L(θ) ≡ 0 by the fonc and we need to multiply the second-order term by n since J (θ) is 1 deﬁned in terms of n ln L(θ)) so LR ˜ ˆ ˆ ˜ ˆ −n θ − θ J (θ) θ − θ . To show that it is asymptotically χ2.test. take a second order Taylor’s series ˜ ˆ expansion of ln L(θ) about θ : ˜ ln L(θ) n ˜ ˆ ˆ ˆ ˜ ˆ ln L(θ) + θ − θ J (θ) θ − θ 2 ˆ (note. on the other hand.

So a ˜ ˆ ˜ ˆ LR = n θ − θ I∞(θ0) θ − θ (13. J (θ) → J∞(θ0) = −I(θ0). by the information matrix equality.10) We also have that. An analogous result for the restricted estimator is (this is unproven here. and manipulate the ﬁrst order conditions) : √ a ˜ n θ − θ0 = I∞(θ0)−1 In − R RI∞(θ0)−1R −1 RI∞(θ0)−1 n1/2g(θ0). from the theory on the asymptotic normality of the MLE and the information matrix equality √ a ˆ n θ − θ0 = I∞(θ0)−1n1/2g(θ0). to prove this set up the Lagrangean for MLE subject to Rβ = r. Combining the last two equations √ ˜ ˆ a n θ − θ = −n1/2I∞(θ0)−1R RI∞(θ0)−1R −1 RI∞(θ0)−1g(θ0) .ˆ As n → ∞.

d d d a RI∞(θ0)−1R −1 RI∞(θ0)−1n1/2g(θ0) . so LR → χ2(q). We can see that LR is a quadratic form of this rv. I∞(θ0)) the linear function RI∞(θ0)−1n1/2g(θ0) → N (0. substituting into [13.so. with the inverse of its variance in the middle.10] LR = n1/2g(θ0) I∞(θ0)−1R But since n1/2g(θ0) → N (0. RI∞(θ0)−1R ).

as such models arise in a variety of .Summary of MLE • Consistent • Asymptotically normal (CAN) • Asymptotically efﬁcient • Asymptotically unbiased • LR test is available for testing hypothesis • The presentation is for general MLE: we haven’t speciﬁed the distribution or the linearity/nonlinearity of the estimator 13.8 Example: Binary response models This section extends the Bernoulli trial model to binary response models with conditioning variables.

where Φ(•) = −∞ xβ (2π) −1/2 ε2 exp(− )dε 2 are not normal. y ∗ is an unobserved (latent) continuous variable. Then the probit model results. but has fatter tails. where P r(y = 1|x) = P r(ε < x θ) = Φ(x θ). This distribution is similar to the standard normal. Assume that y∗ = x θ − ε y = 1(y ∗ > 0) ε ∼ N (0. and y is a binary variable that indicates whether y ∗is negative or positive.contexts. The logit model results if the errors have a logistic distribution. but rather is the standard normal distribution function. 1) Here. The probability has the following .

Regardless of the parameterization. For a vector of explanatory variables x. then we have a probit model. where Φ(·) is the standard normal distribution function. the response probability will be parameterized in some manner P r(y = 1|x) = p(x. θ) = Φ(x θ). we are dealing with a Bernoulli density. or something else. In general. if p(x.parameterization P r(y = 1|x) = Λ(x θ) = (1 + exp(−x θ)) −1 . θ) = Λ(x θ). If p(x. a binary response model will require that the choice probability be parameterized in some form which could be logit. we have a logit model. probit. fYi (yi|xi) = p(xi. θ))1−yi . θ) Again. θ)yi (1 − p(x.

First one can take the expectation conditional on x to get Ey|x {y ln p(x.so as long as the observations are independent. and following a SLLN for i. θ)+[1 − p(x. θ0)] ln [1 − p .d. Noting that Eyi = p(xi. processes. θ) + (1 − y) ln [1 − p(x. sn(θ) converges almost surely to the expectation of a representative term s(y.i. xi. x. θ)]) i=1 n s(yi. θ). θ0) ln p(x. θ. i=1 (13. θ)]} = p(x. the maximum likeliˆ hood (ML) estimator. θ) + (1 − yi) ln [1 − p(xi. is the maximizer of 1 sn(θ) = n 1 ≡ n n (yi ln p(xi. θ0). θ). θ tends in probability to the θ0 that maximizes the uniform almost sure limit of sn(θ).11) ˆ Following the above theoretical results.

the integral is understood to be multiple. θ∗. and if the parameter space is compact we therefore have uniform almost sure convergence.Next taking expectation over x we get the limiting objective function s∞(θ) = X {p(x. θ) is continuous. θ)]} µ(x)dx. as long as p(x. θ is consistent. solves the ﬁrst order conditions p(x. for example.12) where µ(x) is the (joint . and X is the support of x) density function of the explanatory variables x. θ∗) µ(x)dx = 0 p(x. θ∗) ∂θ X ˆ This is clearly solved by θ∗ = θ0. θ) + [1 − p(x. θ0) ∂ 1 − p(x. θ) is continous for the logit and probit models. Question: what’s needed to ensure that the solution is . This is clearly continuous in θ. θ0)] ln [1 − p(x. θ∗) ∂θ 1 − p(x. θ0) ln p(x. (13. θ∗) − p(x. Note that p(x. θ0) ∂ p(x. The maximizing element of s∞(θ). Provided the solution is unique.

in the consistency proof above and the fact that observations are i.d. since it’s zero.c. .o. √ In the case of i. following the f.unique? The asymptotic normality theorem tells us that d ˆ n θ − θ0 → N 0.i.i.d. observations I∞(θ0) = limn→∞ V ar nDθ sn(θ0) is √ simply the expectation of a typical element of the outer product of the gradient. J∞(θ0)−1I∞(θ0)J∞(θ0)−1 . • There’s no need to subtract the mean.

θ0) . θ0) s(y. x. ∂ ∂ s(y. ∂θ∂θ Expectations are jointly over y and x. ﬁrst over y con- .• The terms in n also drop out by the same argument: lim V ar nDθ sn(θ0) = lim V ar nDθ n→∞ √ √ n→∞ 1 n s(θ0) t 1 s(θ0) = lim V ar √ Dθ n→∞ n t 1 = lim V ar Dθ s(θ0) n→∞ n t = lim V arDθ s(θ0) n→∞ = V arDθ s(θ0) So we get I∞(θ0) = E Likewise. x. x. or equivalently. ∂θ ∂θ ∂2 J∞(θ0) = E s(y. θ0).

. θ0) = y ln p(x. a typical element of the objective function is s(y. x. θ) − p(x. θ0)] . From above. θ)) x = p(x.ditional on x. then over x. θ) = (1 + exp(−x θ)) −1 . Now suppose that we are dealing with a correctly speciﬁed logit model: p(x. θ) (1 − p(x. We can simplify the above results in this case. θ)2 x. We have that ∂ −2 p(x. θ0) + (1 − y) ln [1 − p(x. θ) = (1 + exp(−x θ)) exp(−x θ)x ∂θ exp(−x θ) −1 = (1 + exp(−x θ)) x 1 + exp(−x θ) = p(x.

θ0) + p(x. θ0)2 xx µ(x)dx (13. θ0) = [y − p(x.13) where we use the fact that EY (y) = EY (y 2) = p(x. θ0)2 xx µ(x)dx. x. ∂θ∂θ Taking expectations over y then x gives I∞(θ0) = = EY y 2 − 2p(x. θ0) − p(x. Likewise. θ0)2 xx . θ0). θ0) − p(x.15) (13. J∞(θ0) = − p(x. θ0)] x ∂θ ∂2 s(θ0) = − p(x. θ0)2 xx µ(x)dx.14) p(x.16) Note that we arrive at the expected result: the information matrix . (13. θ0)p(x. θ0) − p(x. (13.So ∂ s(y.

While coefﬁcients will vary slightly between the two models. √ d ˆ n θ − θ0 → N 0.the logit distribution is a bit more fat-tailed. θ) will be virtually identical for the two models. J∞(θ0) = −I∞(θ0)). the logit and standard normal CDF’s are very similar . functions of interest such ˆ as estimated probabilities p(x. . I∞(θ0)−1 . With this. On a ﬁnal note. J∞(θ0)−1I∞(θ0)J∞(θ0)−1 simpliﬁes to √ d ˆ n θ − θ0 → N 0.equality holds (that is. −J∞(θ0)−1 which can also be expressed as √ d ˆ n θ − θ0 → N 0.

y ∈ {0. 13. p0) = py (1 − p0)1−y . see Section 11.4 in the chapter on Numerical Optimization. Consider coin tossing with a single possibly biased coin.10 Exercises 1.13. y ∈ {0. 1} 0 = 0. The density function for the random variable y = 1(heads) is fY (y. You should examine the scripts and run them to see you MLE is actually done. 1} / .9 Examples For examples of MLE using logit and Poisson model applied to data.

Suppose that we have a sample of size n. Please give me histograms that show the sam√ pling frequency of n (¯ − p0) for several values of n. J∞(p0)−1I∞(p0)J∞(p0)−1 a a) ﬁnd the analytic expression for gt(θ) and show that Eθ [gt(θ)] = 0 b) ﬁnd the analytical expressions for J∞(p0) and I∞(p0) for this problem p c) verify that the result for lim V ar n (ˆ − p) found in section 13. y √ . We know from above that the ML estimator is p0 = y .5 is equal to J∞(p0)−1I∞(p0)J∞(p0)−1 d) Write an Octave program that does a Monte Carlo study that √ y shows that n (¯ − p0) is approximately normally distributed when n is large. We also know from the theory ¯ above that √ y n (¯ − p0) ∼ N 0.

Consider the model yt = xtβ + α t where the errors follow the Cauchy (Student-t with 1 degree of freedom) density.2. Find the score function gn(θ) where θ = β α . −∞ < π (1 + 2) t t <∞ The Cauchy density has a shape similar to a normal density. Why are the ﬁrst order conditions that deﬁne an efﬁcient estimator different . 4. Consider the model classical linear regression model yt = xtβ + where β σ t ∼ IIN (0. but with much thicker tails. Compare the ﬁrst order conditions that deﬁne the ML estimators of problems 2 and 3 and interpret the differences. t 3. Thus. σ 2). So 1 f ( t) = . extremely small and large errors occur much more frequently with this density than would happen if the errors were normally distributed. Find the score function gn(θ) where θ = .

7.a).d.8 by ML.m. Interpret the output. Edit (to see what it does) then run the script mle_example. σ 2) and compare to the results you get from running Nerlove.2a).p. follows the logit model: Pr(y = 1|x) = 1 + exp(−β 0x) (a) Assume that x ∼ uniform(-a. assuming that the errors are i. Find the asymptotic distribution of the ML estimator of β 0 (this is a scalar parameter). Assume a d. (c) Comment on the results 6. −1 .m .g.in the two cases? 5. There is an ML estimation routine in the provided software that accompanies these notes. Again ﬁnd the asymptotic distribution of the ML estimator of β 0. (b) Now assume that x ∼ uniform(-2a. N (0.i. Estimate the simple Nerlove model discussed in section 3. .

17 (see pg. Davidson and MacKinnon. Vol. "Large Sample Estimation and Hypothesis Testing". 579 . 36.Chapter 14 Generalized method of moments Readings: Hamilton Ch. 14∗. Ch. in Handbook of Econometrics. 4. Newey and McFadden (1994). Ch. to applications). 587 for refs.

if Y ∼ χ2(θ0). of a random variable will in general be a function of the parameters of the distribution:µ1 = µ1(θ0) .1 Motivation Sampling from χ2(θ0) Example 34. Here. The ﬁrst moment (expectation). so the relationship is the identity function µ1(θ0) = θ0. Deﬁne the single observation contribution to the moment condi- . then E(Y ) = θ0. though in general the relationship may be more complicated. v1) Suppose we draw a random sample of yt from the χ2(θ0) distribution. The sample ﬁrst moment is n µ1 = t=1 yt/n. µ1. θ0 is the parameter of interest. In this example. (Method of moments.14.

i. Since y = p n t=1 yt /n → .e. The method of moments principle is to choose the estimator of the parameter to set the estimate of the population moment equal to the ˆ sample moment.tion as the true moment minus the tth observation’s contribution to the sample moment: m1t(θ) = µ1(θ) − yt The corresponding average moment condition is m1(θ) = µ1(θ) − µ1 ¯ where the sample moment µ1 = y = n t=1 yt /n. In this case. n ˆ ˆ m1(θ) = θ − t=1 yt/n = 0 θ0 by the LLN.. Then the equation is solved for the estimator. m1(θ) ≡ 0. the estimator ˆ ¯ ¯ is solved by θ = y .

is consistent. Example 35. Deﬁne the average mo- n y 2 t=1 (yt −¯) ment condition as the population moment minus the sample moment: ˆ m2(θ) = V (yt) − V (yt) = 2θ − n t=1 (yt ¯ − y )2 n We can see that the average moment condition is the average of the contributions ¯ m2t(θ) = V (yt) − (yt − y )2 . is V (yt) = E yt − θ0 ˆ The sample variance is V (yt) = n 2 = 2θ0.v. v2) The variance of a χ2(θ0) r. (Method of moments. .

n t=1 (yt ¯ − y )2 n So. that is.m . p ¯ − y )2 n . Example 36. Try some MM estimation yourself: here’s an Octave script that implements the two MM estimators discussed above: GMM/chi2mm. by the LLN. the sample variance is consistent for the true variance.The MM estimator using the variance would set ˆ ˆ m2(θ) = 2θ − n t=1 (yt ¯ − y )2 n ≡ 0. This estimator is also consistent for θ0. the estimator is half the sample variance: ˆ= 1 θ 2 n t=1 (yt → 2θ0. Again.

which means that we have more information than is strictly necessary for consistent estimation of the parameter.Note that when you run the script. • With two moment-parameter equations and only one parameter. • The idea behind GMM is to combine information from the two moment-parameter equations to form a new estimator which will be more efﬁcient. we have overidentiﬁcation. in general (proof of this below). . the two estimators give different results. Each of the two estimators is consistent.

This is because the ML estimator uses all of the available information (e. the ML estimator is inter- . Recalling that a distribution is completely characterized by its moments. θ ) = 0 Γ θ0 + 1 /2 (πθ0)1/2 Γ (θ0/2) 1+ 2 yt /θ0 −(θ0 +1)/2 Given an iid sample of size n.v. Yt is fYt (yt..Sampling from t(θ0) Here’s another example based upon the t-distribution. one could estimate θ0 by maximizing the log-likelihood function n ˆ θ ≡ arg max ln Ln(θ) = Θ t=1 ln fYt (yt. The density function of a t-distributed r.g. θ) • This approach is attractive since ML estimators are asymptotically efﬁcient. the distribution is fully speciﬁed up to a parameter).

t=1 As before. efﬁciency is lost relative to the ML estimator. A t-distributed r. ˆ ˆ Choosing θ to set m1(θ) ≡ 0 yields a MM estimator: ˆ θ= 2 1− n 2 i yi (14. Using the notation introduced previously. θ0) has mean zero and variance V (yt) = θ0/ θ0 − 2 (for θ0 > 2). The method of moments estimator uses only K moments to estimate a K−dimensional parameter. with density fYt (yt.pretable as a GMM estimator that uses all of the moments. when evaluated at the true parameter value θ0.v. in general.1) . Since information is discarded. by the MM estimator. (Method of moments). Example 37. both Eθ0 m1t(θ0) = 0 and Eθ0 m1(θ0) = 0. deﬁne a moment con2 tribution m1t(θ) = θ/ (θ − 2) − yt and the average moment condition m1(θ) = 1/n n t=1 m1t (θ) = θ/ (θ − 2) − 1/n n 2 yt .

is 3 θ0 4 . . (Method of moments). Example 38. We can deﬁne a second moment condition 3 (θ)2 1 m2(θ) = − (θ − 2) (θ − 4) n n 4 yt t=1 2 ˆ ˆ A second. If you solve this you’ll see that the estimate is different from that in equation 14.1. The fourth moment of a t-distributed r. different MM estimator chooses θ to set m2(θ) ≡ 0.v. An alternative MM estimator could be based upon the fourth moment of the t-distribution.it uses less information than the ML estimator.This estimator is based on only one moment of the distribution . µ4 ≡ E(yt ) = 0 0 − 4) (θ − 2) (θ provided that θ0 > 4. so it is intuitively clear that the MM estimator will be inefﬁcient relative to the ML estimator.

and Wn converges almost surely to a ﬁnite g × g symmetric positive deﬁnite matrix W∞.2 Deﬁnition of GMM estimator For the purposes of this course. A GMM estimator would use the two moment conditions together to estimate the single parameter.This estimator isn’t efﬁcient either. θ ≡ arg minΘ sn(θ) ≡ mn(θ) Wnmn(θ). since it uses only one moment. where mn(θ) = 1 n n t=1 mt (θ) is a g-vector. The GMM estimator is overidentiﬁed. the following deﬁnition of the GMM estimator is sufﬁciently general: Deﬁnition 39. with Eθ m(θ) = 0. The GMM estimator of the K -dimensional parameˆ ter vector θ0. which leads to an estimator which is efﬁcient relative to the just identiﬁed MM estimators (more on efﬁciency later). What’s the reason for using GMM if MLE is asymptotically efﬁcient? . 14. g ≥ K.

whereas MLE in effect requires correct speciﬁcation of every conceivable moment condition. GMM is robust with respect to distributional misspeciﬁcation. above.m implements GMM using the same χ2 data as was using in Example 36. only these moment conditions need to be correctly speciﬁed. For consistency. the estimator will be inconsistent in general (not always).• Robustness: GMM is based upon a limited set of moment conditions. Example 40. • Feasibility: in some cases the MLE estimator is not available. The Octave script GMM/chi2gmm. The two . More on this in the section on simulation-based estimation. because we are not able to deduce the likelihood function. The GMM estimator may still be feasible even though MLE is not available. The price for robustness is loss of efﬁciency with respect to the MLE estimator. Keep in mind that the true distribution is not known so if we erroneously specify a distribution and estimate by MLE.

i.e. Taking the case of a quadratic objective function sn(θ) = mn(θ) Wnmn(θ). In Theorem 28. In Octave.. type ”help gmm_estimate” to get more information on how the GMM estimation routine works. the third assumption reads: (c) Identiﬁcation: s∞(·) has a unique global maximum at θ0. we get mn(θ) → m∞(θ). The only assumption that warrants additional comments is that of identiﬁcation.moment conditions. 14. ﬁrst consider mn(θ). s∞(θ0) > s∞(θ). . The weight matrix is an identity matrix. I2. based on the sample mean and sample variance are combined. so the GMM estimator is strongly consistent. a. • Applying a uniform law of large numbers.s.3 Consistency We simply assume that the assumptions of Theorem 28 hold. ∀θ = θ0.

Example 41. we need that m∞(θ) = 0 for θ = θ0. for at least some element of the vector. in order for asymptotic identiﬁcation. m∞(θ0) = 0.there may be multiple minimizing solutions in ﬁnite samples. W∞.m to see evidence of the consistency of the GMM estimator. a. • Note that asymptotic identiﬁcation does not rule out the possibility of lack of identiﬁcation for a given data set .s.• Since Eθ0 mn(θ0) = 0 by assumption. This and the assumption that Wn → θ0 is asymptotically identiﬁed. Increase n in the Octave script GMM/chi2gmm. a ﬁnite positive g × g deﬁnite g × g matrix guarantee that . • Since s∞(θ0) = m∞(θ0) W∞m∞(θ0) = 0.

we have √ d ˆ n θ − θ0 → N 0.14. n→∞ ∂θ We need to determine the form of these matrices given the objective function sn(θ) = mn(θ) Wnmn(θ). From Theorem 30.4 Asymptotic normality We also simply assume that the conditions of Theorem 30 hold. J∞(θ0)−1I∞(θ0)J∞(θ0)−1 where J∞(θ0) is the almost sure limit of θ0 and 0 ∂2 ∂θ∂θ sn(θ) when evaluated at √ ∂ I∞(θ ) = lim V ar n sn(θ0). so we will have asymptotic normality. However. we do need to ﬁnd the structure of the asymptotic variance-covariance matrix of the estimator. .

∂θ ∂ s(θ) = 2D(θ)W m (θ) . Dn(θ). ∂ ∂ sn(θ) = 2 m (θ) Wnmn (θ) ∂θ ∂θ n (this is analogous to ∂ ∂β β X Xβ = 2X Xβ which appears when com- puting the ﬁrst order conditions for the OLS estimator) Deﬁne the K × g matrix ∂ Dn(θ) ≡ mn (θ) . let Di be the i− th row of D(θ).2) ∂θ (Note that sn(θ). To take second derivatives. (14. but it is omitted to unclutter the notation).Now using the product rule from the introduction. Wn and mn(θ) all depend on the sample size n. Using so: .

. assume that ∂ ∂θ ∂ D(θ)i ∂θ D(θ)i satisﬁes a LLN.s.the product rule. In this case. ∂θ surely to a ﬁnite limit. so that it converges almost ∂ a. ∂2 ∂ s(θ) = 2Di(θ)W m (θ) ∂θ ∂θi ∂θ ∂ D = 2DiW D + 2m W ∂θ i When evaluating the term 2m(θ) W at θ0. since m(θ0) = op(1) and W → W∞.s. we have 2m(θ0) W a. D(θ0)i → 0.

Stacking these results over the K rows of D, we get ∂2 lim sn(θ0) = J∞(θ0) = 2D∞W∞D∞, a.s., ∂θ∂θ where we deﬁne lim D = D∞, a.s., and lim W = W∞, a.s. (we assume a LLN holds). With regard to I∞(θ0), following equation 14.2, and noting that the scores have mean zero at θ0 (since Em(θ0) = 0 by assumption), we have √ ∂ I∞(θ ) = lim V ar n sn(θ0) n→∞ ∂θ = lim E4nDW m(θ0)m(θ0) W D n→∞ √ √ 0 = lim E4DW nm(θ ) nm(θ0) W D

0 n→∞

Now, given that m(θ0) is an average of centered (mean-zero) quantities, it is reasonable to expect a CLT to apply, after multiplication by

√

n. Assuming this, √ nm(θ0) → N (0, Ω∞),

d

where Ω∞ = lim E nm(θ0)m(θ0) .

n→∞

Using this, and the last equation, we get I∞(θ0) = 4D∞W∞Ω∞W∞D∞ Using these results, the asymptotic normality theorem (30) gives us √

d −1 −1 ˆ n θ − θ0 → N 0, (D∞W∞D∞) D∞W∞Ω∞W∞D∞ (D∞W∞D∞) ,

the asymptotic distribution of the GMM estimator for arbitrary weighting matrix Wn. Note that for J∞ to be positive deﬁnite, D∞ must have full row rank, ρ(D∞) = k. This is related to identiﬁcation. If the rows

of mn(θ) were not linearly independent of one another, then neither Dnnor D∞would have full row rank. Identiﬁcation plus two times differentiability of the objective function lead to J∞ being positive deﬁnite. Example 42. The Octave script GMM/AsymptoticNormalityGMM.m does a Monte Carlo of the GMM estimator for the χ2 data. Histograms √ ˆ for 1000 replications of n θ − θ0 are given in Figure 14.1. On the left are results for n = 30, on the right are results for n = 1000. Note that the two distributions are fairly similar. In both cases the distribution is approximately normal. The distribution for the small sample size is somewhat asymmetric. This has mostly disappeared for the larger sample size.

**Figure 14.1: Asymptotic Normality of GMM estimator, χ2 example
**

(a) n = 30 (b) n = 1000

14.5

Choosing the weighting matrix

W is a weighting matrix, which determines the relative importance of violations of the individual moment conditions. For example, if we are much more sure of the ﬁrst moment condition, which is based upon the variance, than of the second, which is based upon the fourth moment, we could set W = a 0 0 b

with a much larger than b. In this case, errors in the second moment condition have less weight in the objective function. • Since moments are not independent, in general, we should expect that there be a correlation between the moment conditions, so it may not be desirable to set the off-diagonal elements to 0. W may be a random, data dependent matrix.

• We have already seen that the choice of W will inﬂuence the asymptotic distribution of the GMM estimator. Since the GMM estimator is already inefﬁcient w.r.t. MLE, we might like to choose the W matrix to make the GMM estimator efﬁcient within the class of GMM estimators deﬁned by mn(θ). • To provide a little intuition, consider the linear model y = x β + ε, where ε ∼ N (0, Ω). That is, he have heteroscedasticity and autocorrelation. • Let P be the Cholesky factorization of Ω−1, e.g, P P = Ω−1. • Then the model P y = P Xβ + P ε satisﬁes the classical assumptions of homoscedasticity and nonautocorrelation, since V (P ε) = P V (ε)P = P ΩP = P (P P )−1P = P P −1 (P )−1 P = In. (Note: we use (AB)−1 = B −1A−1 for A, B both nonsingular). This means that the transformed model is efﬁcient.

• The OLS estimator of the model P y = P Xβ + P ε minimizes the objective function (y −Xβ) Ω−1(y −Xβ). Interpreting (y − Xβ) = ε(β) as moment conditions (note that they do have zero expectation when evaluated at β 0), the optimal weighting matrix is seen to be the inverse of the covariance matrix of the moment conditions. This result carries over to GMM estimation. (Note: this presentation of GLS is not a GMM estimator, because the number of moment conditions here is equal to the sample size, n. Later we’ll see that GLS can be put into the GMM framework deﬁned above). ˆ Theorem 43. If θ is a GMM estimator that minimizes mn(θ) Wnmn(θ), ˆ the asymptotic variance of θ will be minimized by choosing Wn so that Wn → W∞ = Ω−1, where Ω∞ = limn→∞ E nm(θ0)m(θ0) . ∞ Proof: For W∞ = Ω−1, the asymptotic variance ∞ (D∞W∞D∞)

−1 a.s

D∞W∞Ω∞W∞D∞ (D∞W∞D∞)

−1

simpliﬁes to D∞Ω−1D∞ ∞

−1

. Now, let A be the difference between the

**general form and the simpliﬁed form: A = (D∞W∞D∞)
**

−1

D∞W∞Ω∞W∞D∞ (D∞W∞D∞)

−1

−1

− D∞Ω−1D∞ ∞

−1

**Set B = (D∞W∞D∞)−1 D∞W∞ − D∞Ω−1D∞ ∞ p.s.d., which concludes the proof. The result √ allows us to treat ˆ θ≈N
**

−1 0 D∞ Ω∞ D∞ θ, n

D∞Ω−1. You can show ∞

that A = BΩ∞B . This is a quadratic form in a p.d. matrix, so it is

d ˆ n θ − θ0 → N 0, D∞Ω−1D∞ ∞

−1

(14.3)

−1

,

**where the ≈ means ”approximately distributed as.” To operationalize this we need estimators of D∞ and Ω∞.
**

∂ ˆ • The obvious estimator of D∞ is simply ∂θ mn θ , which is con∂ ˆ sistent by the consistency of θ, assuming that ∂θ mn is continuous

**in θ. Stochastic equicontinuity results can give us this result even if
**

∂ ∂θ mn

is not continuous.

Example 44. To see the effect of using an efﬁcient weight matrix, consider the Octave script GMM/EfﬁcientGMM.m. This modiﬁes the previous Monte Carlo for the χ2 data. This new Monte Carlo computes the GMM estimator in two ways: 1) based on an identity weight matrix 2) using an estimated optimal weight matrix. The estimated efﬁcient weight matrix is computed as the inverse of the estimated covariance of the moment conditions, using the inefﬁcient estimator of the ﬁrst step. See the next section for more on how to do this.

**Figure 14.2: Inefﬁcient and Efﬁcient GMM estimators, χ2 data
**

(a) inefﬁcient (b) efﬁcient

Figure 14.2 shows the results, plotting histograms for 1000 replica√ ˆ tions of n θ − θ0 . Note that the use of the estimated efﬁcient weight matrix leads to much better results in this case. This is a simple case where it is possible to get a good estimate of the efﬁcient weight matrix. This is not always so. See the next section.

14.6

Estimation of the variance-covariance matrix

(See Hamilton Ch. 10, pp. 261-2 and 280-84)∗. In the case that we wish to use the optimal weighting matrix, we need an estimate of Ω∞, the limiting variance-covariance matrix of √ nmn(θ0). While one could estimate Ω∞ parametrically, we in general have little information upon which to base a parametric speciﬁcation. In general, we expect that: • mt will be autocorrelated (Γts = E(mtmt−s) = 0). Note that this autocovariance will not depend on t if the moment conditions are covariance stationary. • contemporaneously correlated, since the individual moment conditions will not in general be independent of one another (E(mitmjt) = 0).

2 • and have different variances (E(m2 ) = σit ). it

Since we need to estimate so many components if we are to take the parametric approach, it is unlikely that we would arrive at a correct parametric speciﬁcation. For this reason, research has focused on consistent nonparametric estimators of Ω∞. Henceforth we assume that mt is covariance stationary (the covariance between mt and mt−s does not depend on t). Deﬁne the v − th autocovariance of the moment conditions Γv = E(mtmt−s). Note that E(mtmt+s) = Γv . Recall that mt and m are functions of θ, so for now asˆ ˆ sume that we have some consistent estimator of θ0, so that mt = mt(θ).

Now

n n

Ωn = E nm(θ0)m(θ0) = E n 1/n

t=1 n n

mt

1/n

t=1

mt

= E 1/n

t=1

mt

t=1

mt

= Γ0 +

n−1 n−2 1 (Γ1 + Γ1) + (Γ2 + Γ2) · · · + Γn−1 + Γn−1 n n n

**A natural, consistent estimator of Γv is
**

n

Γv = 1/n

t=v+1

ˆ ˆ mtmt−v .

(you might use n − v in the denominator instead). So, a natural, but

inconsistent, estimator of Ω∞ would be ˆ = Γ0 + n − 1 Γ1 + Γ + n − 2 Γ2 + Γ + · · · + Γn−1 + Γ Ω 1 2 n−1 n n n−1 n−v = Γ0 + Γv + Γv . n v=1 This estimator is inconsistent in general, since the number of parameters to estimate is more than the number of observations, and increases more rapidly than n, so information does not build up as n → ∞. On the other hand, supposing that Γv tends to zero sufﬁciently rapidly as v tends to ∞, a modiﬁed estimator

q(n)

ˆ Ω = Γ0 +

v=1 p

Γv + Γv ,

where q(n) → ∞ as n → ∞ will be consistent, provided q(n) grows

sufﬁciently slowly. The term

n−v n

can be dropped because q(n) must

be op(n). This allows information to accumulate at a rate that satisﬁes a LLN. A disadvantage of this estimator is that it may not be positive deﬁnite. This could cause one to calculate a negative χ2 statistic, for example! ˆ • Note: the formula for Ω requires an estimate of m(θ0), which in turn requires an estimate of θ, which is based upon an estimate of Ω! The solution to this circularity is to set the weighting matrix W arbitrarily (for example to an identity matrix), obtain a ﬁrst consistent but inefﬁcient estimate of θ0, then use this estimate ˆ to form Ω, then re-estimate θ0. The process can be iterated until ˆ ˆ neither Ω nor θ change appreciably between iterations.

**Newey-West covariance estimator
**

The Newey-West estimator (Econometrica, 1987) solves the problem of possible nonpositive deﬁniteness of the above estimator. Their estimator is ˆ Ω = Γ0 +

v=1 q(n)

1−

v q+1

Γv + Γv .

This estimator is p.d. by construction. The condition for consistency is that n−1/4q → 0. Note that this is a very slow rate of growth for q. This estimator is nonparametric - we’ve placed no parametric restrictions on the form of Ω. It is an example of a kernel estimator. In a more recent paper, Newey and West (Review of Economic Studies, 1994) use pre-whitening before applying the kernel estimator. The idea is to ﬁt a VAR model to the moment conditions. It is expected that the residuals of the VAR model will be more nearly white noise, so that the Newey-West covariance estimator might perform better

with short lag lengths.. The VAR model is ˆ ˆ ˆ mt = Θ1mt−1 + · · · + Θpmt−p + ut ˆ This is estimated, giving the residuals ut. Then the Newey-West covariance estimator is applied to these pre-whitened residuals, and the covariance Ω is estimated combining the ﬁtted VAR ˆ ˆ ˆ mt = Θ1mt−1 + · · · + Θpmt−p with the kernel estimate of the covariance of the ut. See Newey-West for details. • I have a program that does this if you’re interested.

14.7

Estimation using conditional moments

So far, the moment conditions have been presented as unconditional expectations. One common way of deﬁning unconditional moment conditions is based upon conditional moment conditions. Suppose that a random variable Y has zero expectation conditional on the random variable X EY |X Y = Y f (Y |X)dY = 0

**Then the unconditional expectation of the product of Y and a function g(X) of X is also zero. The unconditional expectation is EY g(X) =
**

X Y

Y g(X)f (Y, X)dY

dX.

This can be factored into a conditional expectation and an expectation

**w.r.t. the marginal density of X : EY g(X) =
**

X Y

Y g(X)f (Y |X)dY

f (X)dX.

**Since g(X) doesn’t depend on Y it can be pulled out of the integral EY g(X) =
**

X Y

Y f (Y |X)dY

g(X)f (X)dX.

But the term in parentheses on the rhs is zero by assumption, so EY g(X) = 0 as claimed. This is important econometrically, since models often imply restrictions on conditional moments. Suppose a model tells us that the function K(yt, xt) has expectation, conditional on the information set

It, equal to k(xt, θ), Eθ K(yt, xt)|It = k(xt, θ). • For example, in the context of the classical linear model yt = xtβ + εt, we can set K(yt, xt) = yt so that k(xt, θ) = xtβ. With this, the error function

t (θ)

= K(yt, xt) − k(xt, θ)

has conditional expectation equal to zero Eθ t(θ)|It = 0. This is a scalar moment condition, which isn’t sufﬁcient to identify a K -dimensional parameter θ (K > 1). However, the above result

so as long as g > K the necessary condition for identiﬁcation holds. We now have g moment conditions.allows us to form various unconditional expectations mt(θ) = Z(wt) t(θ) where Z(wt) is a g × 1-vector valued function of wt and wt is a set of variables drawn from the information set It. . The Z(wt) are instrumental variables.

. . Z1(wn) Z2(wn) · · · Zg (wn) Z1 Z2 = Zn With this we can form the g moment conditions 1 (θ) 2(θ) 1 mn(θ) = Zn . . n n (θ) . Zg (w1) Zg (w2) .One can form the n × g matrix Z1(w1) Z2(w1) · · · Z1(w2) Z2(w2) Zn = .

. n (θ) With this. we can write 1 mn(θ) = Znhn(θ) n n 1 = Ztht(θ) n t=1 1 = n n mt(θ) t=1 where Z(t.·) is the tth row of Zn. This ﬁts the previous treatment.Deﬁne the vector of error functions 1 (θ) 2(θ) hn(θ) = . .

14. but are often harder to formulate. using the an estimate of the optimal weighting matrix. For now. 14. are ∂ ∂ ˆ ˆ s(θ) = 2 mn θ ∂θ ∂θ or ˆ ˆ Ω−1mn θ ≡ 0 . the Hansen application below is enough.8 Estimation using dynamic moment conditions Note that dynamic moment conditions simplify the var-cov matrix.9 A speciﬁcation test The ﬁrst order conditions for minimization. The will be added in future editions.

4) ˆ ˆˆ where θ∗ is between θ and θ0.ˆ ˆˆ D(θ)Ω−1mn(θ) ≡ 0 ˆ Consider a Taylor expansion of m(θ): ˆ ˆ m(θ) = mn(θ0) + Dn(θ∗) θ − θ0 (14. so ˆˆ ˆˆ D(θ)Ω−1mn(θ0) = − D(θ)Ω−1D(θ∗) or ˆ ˆˆ θ − θ0 = − D(θ)Ω−1D(θ∗) −1 ˆ θ − θ0 ˆˆ D(θ)Ω−1mn(θ0) . Multiplying by D(θ)Ω−1 we obtain ˆˆ ˆ ˆˆ ˆˆ ˆ D(θ)Ω−1m(θ) = D(θ)Ω−1mn(θ0) + D(θ)Ω−1D(θ∗) θ − θ0 The lhs is zero.

this last can be written as √ ˆ nm(θ) = ˆˆ ˆ Ω1/2 − Dn(θ∗) D(θ)Ω−1D(θ∗) −1 ˆ Dn(θ )Ω−1/2 ∗ √ ˆ nΩ−1/2mn(θ0) ˆ and then multiply be Ω−1/2 to get √ −1 ˆ ˆ nΩ−1/2m(θ) = ˆˆ ˆ Ig − Ω−1/2Dn(θ∗) D(θ)Ω−1D(θ∗) √ ˆ Dn(θ )Ω−1/2 ∗ √ ˆ nΩ−1/2m Now ˆ nΩ−1/2mn(θ0) → N (0. With some factoring. Ig ) −1 d ˆˆ ˆ and the big matrix Ig − Ω−1/2Dn(θ∗) D(θ)Ω−1D(θ∗) ˆ Dn(θ∗)Ω−1/2 . we get √ ˆ nm(θ) = √ nmn(θ ) − 0 √ ˆ nDn(θ ) D(θ)Ω D(θ ) ∗ ˆ −1 ∗ −1 ˆˆ D(θ)Ω−1mn(θ0). and taking into account the original expansion (equation 14.4).With this.

The quadratic form on the l. and compare with a χ2(g − K) critical value. (recall that the rank of an idempotent matrix is equal to its trace).h.h.converges in probability to P = Ig −Ω∞ D∞ D∞Ω−1D∞ ∞ −1/2 −1 D∞Ω∞ . We know that N (0. The test is a general test of whether or not the moments used to estimate are correctly ˆ ˆ nΩ−1/2m(θ) √ ˆ ˆ ˆ ˆ d ˆ nΩ−1/2m(θ) = nm(θ) Ω−1m(θ) → χ2(g − K) . has an asymptotic chi-square distribution. must also have the same distribution. so we ﬁnally get √ or ˆ d n · sn(θ) → χ2(g − K) supposing the model is correctly speciﬁed. Ig ) ∼ χ2(g − K). one can easily verify that P is idempotent and has rank g − K. −1/2 However.s.s. Ig ) P · N (0. a quadratic form on the r. This is a convenient test since we just multiply the optimized value of the objective function by n. So.

c. • A note: this sort of test often over-rejects in ﬁnite samples. ˆ But with exact identiﬁcation. • This won’t work when the estimator is just identiﬁed. .speciﬁed. so ˆ m(θ) ≡ 0. are ˆ ˆ Dθ sn(θ) = DΩ−1m(θ) ≡ 0. assuming that asymptotic normality hold). we might as well use an identity matrix ˆ and save trouble. As such. So the moment conditions are zero regardless of the weighting matrix used.o. The f. One should be cautious in rejecting a model when this test rejects. both D and Ω are square and invertible (at least asymptotically. so the test breaks down. Also sn(θ) = 0.

where G ≥ K. that the errors are iid • The model in matrix form is y = Xθ + Let K = dim(xt). The variables zt are instrumental . but with correlation between regressors and errors: y t = xt θ + εt E(xtεt) = 0 • Let’s assume. in the discussion of conditional moments.10 Example: Generalized instrumental variables estimator The IV estimator may appear a bit unusual at ﬁrst. Consider some vector zt of dimension G × 1. but it will grow on you over time. We have in fact already seen the IV estimator above.14. Let’s look at the special case of a linear model with iid errors. just to keep things simple. Assume that E(zt t) = 0.

.variables. zn The average moment conditions are 1 mn(θ) = Z n 1 = (Z y − Z Xθ) n . Consider the moment conditions mt(θ) = zt t = zt (yt − xtθ) We can arrange the instruments in the n × G matrix z1 z2 Z = .

Verify that the efﬁcient weight matrix is Wn = (up to a constant). ˆ The ﬁrst order conditions for GMM are DnWnmn(θ) = 0. we have exact identiﬁcation. which imply that ˆ DnWnZ X θIV = DnWnZ y Exercise 45. Remember that (assuming differentiability) identiﬁcation of the GMM estimator requires that this matrix must converge to a matrix with full row rank. Can just any variable that is uncorrelated with the error be used as an instrument. ZZ n −1 . or is there some other condition? Exercise 46.The generalized instrumental variables estimator is just the GMM estimator based upon these moment conditions. and it is referred to as the instrumental variables estimator. When G = K. Verify that Dn = − XnZ .

Transforming the model with .5) Another way of arriving to the same point is to deﬁne the projection matrix PZ PZ = Z(Z Z)−1Z Anything that is projected onto the space spanned by Z will be uncorrelated with ε. then XZ n ZZ n −1 ˆ Z X θIV XZ = n ZZ n −1 Zy Noting that the powers of n cancel. by the deﬁnition of Z.If we accept what is stated in these two exercises. we get X Z (Z Z) or ˆ θIV = X Z (Z Z) −1 −1 −1 ˆ Z X θIV = X Z (Z Z) Z y −1 ZX X Z (Z Z) −1 Zy (14.

This is a linear combination of the columns of Z. so it must be uncorrelated with ε. since this is simply E(X ∗ ε∗) = E(X PZ PZ ε) = E(X PZ ε) and PZ X = Z(Z Z)−1Z X is the ﬁtted value from a regression of X on Z. This .this projection matrix we get PZ y = PZ Xβ + PZ ε or y ∗ = X ∗θ + ε∗ Now we have that ε∗ and X ∗ are uncorrelated.

With the deﬁnition of PZ . Verify algebraically that applying OLS to the above model gives the IV estimator of equation 14. we can write ˆ θIV = (X PZ X)−1X PZ y from which we obtain ˆ θIV = (X PZ X)−1X PZ (Xθ0 + ε) = θ0 + (X PZ X)−1X PZ ε .implies that applying OLS to the model y ∗ = X ∗θ + ε∗ will lead to a consistent estimator.5. Exercise 47. given a few more assumptions.

a ﬁnite pd matrix → QXZ . That is to p say.so ˆ θIV − θ0 = (X PZ X)−1X PZ ε = X Z(Z Z)−1Z X −1 X Z(Z Z)−1Z ε Now we can introduce factors of n to get ˆ θIV − θ0 = XZ n ZZ n −1 ZX n −1 XZ n ZZ n −1 Zε n Assuming that each of the terms with a n in the denominator satisﬁes a LLN. the instruments must be correlated with the regressors. • Zε p n → 0 . so that • • ZZ n XZ n p → QZZ . a ﬁnite matrix with rank K (= cols(X) ).

scaling by √ ˆ n θIV − θ0 = √ n. This last term has plim 0 since we assume that Z and ε are uncorrelated. we have ZZ n −1 XZ n ZX n −1 XZ n ZZ n −1 Zε √ n Assuming that the far right term satiﬁes a CLT. so that • Zε d √ → n N (0.then the plim of the rhs is zero.g. e. QZZ σ 2) . Given these assumtions the IV estimator is consistent p ˆIV → θ0. θ Furthermore.. E(ztεt) = 0.

n This estimator is consistent following the proof of consistency of the 2 σIV = for σ 2 is OLS estimator of σ 2. when the classical assumptions hold. ˆ The formula used to estimate the variance of θIV is ˆ ˆ V (θIV ) = (X Z) (Z Z) −1 −1 (Z X) 2 σIV .then we get √ d ˆ n θIV − θ0 → N 0. (QXZ Q−1 QXZ )−1σ 2 ZZ The estimators for QXZ and QZZ are the obvious ones. An estimator 1 ˆ ˆ y − X θIV y − X θIV .

• When we have two sets of instruments. . The choice of instruments inﬂuences the efﬁciency of the estimator.The GIV estimator is 1. since (X PZ X)−1 and X PZ ε are not independent. Z1 and Z 2 such that Z1 ⊂ Z2. Biased in general. then the IV estimator using Z2 is at least as efﬁciently asymptotically as the estimator that used Z1. More instruments leads to more asymptotically efﬁcient estimation. E(X PZ X)−1X PZ ε may not be zero. since even though E(X PZ ε) = 0. Consistent 2. in general. This point was made above. Asymptotically normally distributed 3. and these depend upon the choice of Z. when optimal instruments were discussed. ˆ An important point is that the asymptotic distribution of βIV depends upon QXZ and QZZ .

and instead we observe yt. Recall Example 19 which deals with a dynamic model with measurement error. The model is ∗ ∗ yt = α + ρyt−1 + βxt + ∗ yt = yt + υt t where t and υt are independent Gaussian white noise errors. Suppose ∗ that yt is not observed. The reason for this is that PZ X becomes closer and closer to X itself as the number of instruments increases.• The penalty for indiscriminant use of instruments is that the small sample bias of the IV estimator rises as the number of instruments increases. Exercise 48. If we estimate the . How would one adapt the GIV estimator presented here to deal with the case of HET and AUT? Example 49.

equation yt = α + ρyt−1 + βxt + νt by OLS. As we have 4 instruments and 3 parameters. What about using the GIV estimator? Consider using as instruments Z = [1 xt xt−1 xt−2]. we have seen in Example 19 that the estimator is biased an inconsistent. Using the GIV estimator. and by assumption xt and its lags are uncorrelated with t and υt (and thus they’re also uncorrelated with νt). The Octave script GMM/MeasurementErrorIV.m does a Monte Carlo study using 1000 replications. these are legitimate instruments. descriptive statistics for 1000 replications are octave:3> MeasurementErrorIV rm: cannot remove `meas_error. Thus.out': No such file or directory . The results are comparable with those in Example 19. The lags of xt are correlated with yt−1as long as ρ and β are different from zero. with a sample size of 100. this is an overidentiﬁed situation.

757 max 1.149 0.541 0. dev.868 -0.827 0.016 -0. you will see that the GIV estimator is consistent.876 If you compare these with the results for the OLS estimator.3. Figure 7. You can compare with the similar ﬁgure for the OLS estimator. ˆ A histogram for ρ − ρ is in Figure 14.000 -0. you will see that the bias of the GIV estimator is much less for estimation of ρ.001 octave:4> st. If you increase the sample size.4. .241 0. but that the OLS estimator is not.mean 0.250 -0. 0.177 min -1.

dynamic model with measurement error .ˆ Figure 14.3: GIV estimation results for ρ − ρ.

The model is y = Y1γ1 + X1β1 + ε = Zδ + ε where Y1 are current period endogenous variables that are correlated with the error term.2 for context. we haven’t considered from where we get the instruments. where the instruments are obtained in a particular way. Let X be all of the weakly exogenous variables (please refer back for context). Refer back to equation 10. Two stage least squares is an example of a particular GIV estimator. Consider a single equation from a system of simultaneous equations. is that the variables in Y1 are correlated with .2SLS In the general discussion of GIV above. X1 are exogenous and predetermined variables that are assumed not to be correlated with the error term. . The problem. recall.

The ﬁtted values of a regression of X1 on X are just ˆ X1. • Since we have K parameters and K moment conditions. So. it must be uncorrelated with ε. because X contains X1. the .ˆ • Deﬁne Z = ˆ Y1 X1 regressed upon X: as the vector of predictions of Z when −1 ˆ Z = X (X X) X Z Remember that X are all of the exogenous variables from all equations. This suggests the Kˆ dimensional moment condition mt(δ) = zt (yt − ztδ) and so m(δ) = 1/n t ˆ zt (yt − ztδ) . ˆ • Since Z is a linear combination of the weakly exogenous variables X. Y1 are the reduced form predictions of Y1.

and for how to deal with εt heterogeneous and dependent (basically. so we have −1 ˆ δ= t ˆ zt zt t ˆ (ˆtyt) = Z Z z −1 ˆ Zy This is the standard formula for 2SLS. and apply the usual formula). • Note that autocorrelation of εt causes lagged endogenous variables to loose their status as legitimate instruments. regardless of W. Some caution is warranted if this suspicion arises. We use the exogenous variables and the reduced form predictions of the endogenous variables as instruments.GMM estimator will set m identically equal to zero. . See Hamilton pp. and apply IV estimation. just use the Newey-West or some other consistent estimator of Ω. 420-21 for the varcov formula (which is the standard formula for 2SLS).

0 yGt = fG(zt. θ1 ) + ε1t 0 y2t = f2(zt. We have a system of equations of the form 0 y1t = f1(zt. θ2 .11 Nonlinear simultaneous equations GMM provides a convenient way to estimate nonlinear systems of simultaneous equations. . and θ0 = (θ1 .14. 0 0 0 where f (·) is a G -vector valued function. We need to ﬁnd an Ai × 1 vector of instruments xit. θG) + εGt. Typical instruments would be low . that are uncorrelated with εit. θG ) . or in compact notation yt = f (zt. θ2 ) + ε2t . for each equation. θ0) + εt. · · · .

• A note on efﬁciency: the selected set of instruments has important effects on the efﬁciency of estimation. More on this later. θ1)) x1t (y2t − f2(zt. mt(θ) = . (yGt − fG(zt. . Then we can deﬁne the tions G i=1 Ai × 1 orthogonality condi (y1t − f1(zt. with their lagged values. θ2)) x2t . θG)) xGt • A note on identiﬁcation: selection of instruments that ensure identiﬁcation is a non-trivial problem. Unfortunately there is little theory offering guidance on what is the optimal set. .order monomials in the exogenous variables in zt.

The likelihood function is the . Let yt be a G -vector of variables. yt) . y2. since we assume the conditioning variables have been selected to take advantage of all useful information). since the moment conditions could be highly correlated..12 Maximum likelihood In the introduction we argued that ML will in general be more efﬁcient than GMM since ML implicitly uses all of the moments of the distribution while GMM uses a limited number of moments.. Actually. Here we’ll see that the optimal moment conditions are simply the scores of the ML estimator. A GMM estimator that chose an optimal set of P moment conditions would be fully efﬁcient. Yt−1 has been observed (refer to it as the information set.14. and let Yt = (y1.. Then at time t. . some sets of P moment conditions may contain more information than others. However. a distribution with P parameters can be uniquely characterized by P moment conditions.

θ) and we can repeat this to get L(θ) = f (yn|Yn−1. θ) · . θ) . θ) ≡ Dθ ln f (yt|Yt−1. yn. y2. . · f (y1).. θ) which can be factored as L(θ) = f (yn|Yn−1... The log-likelihood function is therefore n ln L(θ) = t=1 ln f (yt|Yt−1. θ).. θ) · f (yn−1|Yn−2. Deﬁne mt(Yt. θ) · f (Yn−1.joint density of the sample: L(θ) = f (y1..

which are precisely the ﬁrst order conditions of MLE. The GMM estimator sets n n 1/n t=1 ˆ mt(Yt.as the score of the tth observation. Consistent estimates of variance components are as follows . that the scores have conditional mean zero when evaluated at θ0 (see notes to Introduction to Econometrics): E{mt(Yt. It can be shown that. Therefore. θ) = 1/n t=1 ˆ Dθ ln f (yt|Yt−1. θ0)|Yt−1} = 0 so one could interpret these as moment conditions to use to deﬁne a just-identiﬁed GMM estimator ( if there are K parameters there are K score equations). MLE can be interpreted as a GMM estimator. under the regularity conditions. θ) = 0. The GMM varcov formula is V∞ = D∞Ω−1D∞ −1 .

above). θ) t=1 It is important to note that mt and mt−s. Unconditional uncorrelation follows from the fact that conditional uncorrelation hold regardless of the realization of Yt−1. so marginalizing with respect to Yt−1 preserves uncorrelation (see the section on ML estimation. s > 0 are both conditionally and unconditionally uncorrelated. θ) = 1/n D∞ = ∂θ • Ω n 2 ˆ Dθ ln f (yt|Yt−1. which is in the information set at time t. Conditional uncorrelation follows from the fact that mt−s is a function of Yt−s.• D∞ ∂ ˆ m(Yt. The fact that the scores are serially uncorrelated implies that Ω can be estimated by the estimator of the 0th autocovari- .

This result implies the well known (and already seeen) result that we can estimate V∞ in any of three ways: • The sandwich version: n 2 ˆ t=1 Dθ ln f (yt |Yt−1 .4) states that E Dθ ln f (yt|Yt−1.ance of the moment conditions: n n Ω = 1/n t=1 ˆ ˆ mt(Yt. θ) −1 × −1 . θ) n 2 ˆ t=1 Dθ ln f (yt |Yt−1 . θ) = 1/n t=1 ˆ Dθ ln f (yt|Yt−1. θ0) Dθ ln f (yt|Yt−1. θ) Dθ ln f (yt|Yt−1 Recall from study of ML estimation that the information matrix equality (equation 13. θ0) . θ0) 2 = −E Dθ ln f (yt|Yt−1. θ)mt(Yt. θ) ˆ V∞ = n t=1 Dθ ln f (yt |Yt−1 . θ) × n ˆ Dθ ln f (yt|Yt−1.

Asymptotically. and the ﬁrst term converges to minus the inverse of the middle term. which is still inside the overall inverse) n −1 V∞ = 1/n t=1 ˆ Dθ ln f (yt|Yt−1. • or the inverse of the outer product of the gradient (since the middle and last cancel except for a minus sign. θ) . except for a minus sign): n −1 2 ˆ Dθ ln f (yt|Yt−1. In small samples they will differ. In par- . if the model is correctly speciﬁed. θ) t=1 V∞ = −1/n . all of these forms converge to the same limit. This simpliﬁcation is a special result for the MLE estimator .it doesn’t apply to GMM estimators in general.• or the inverse of the negative of the Hessian (since the middle and last term cancel. θ) ˆ Dθ ln f (yt|Yt−1.

If they differ by too much. The Octave script NerloveGMM. 14. 477). White’s Information matrix test (Econometrica.13 Example: OLS as a GMM estimator the Nerlove model again The simple Nerlove model can be estimated using GMM. there is evidence that the outer product of the gradient formula does not perform very well in small samples (see Davidson and MacKinnon. 1982) is based upon comparing the two ways to estimate the information matrix: outer product of gradient or negative of the Hessian. It also illustrates that the weight matrix does not matter when the moments just identify the parameter. pg.ticular. You are encouraged to examine the .m estimates the model by GMM and by OLS. this is evidence of misspeciﬁcation of the model.

The Octave script meps. in which case. Both estimation methods are imThe validity of the instruments used may be debatable. If this is the case. Perhaps the same latent factors (e.g. the Poisson ”ML” estimator would be inconsistent.1. even if the conditional mean were correctly speciﬁed. using Poisson ”ML” (better thought of as quasi-ML). chronic illness) that induce one to make doctor visits also inﬂuence the decision of whether or not to purchase insurance. 14. but real data sets often don’t contain ideal instruments. and IV estimation1.14 Example: The MEPS data The MEPS data on health care usage discussed in section 11. the PRIV variable could well be endogenous. 1 .script and run it..4 estimated a Poisson model by ”maximum likelihood” (probably misspeciﬁed).m estimates the parameters of the model presented in equation 11.

plemented using a GMM form. Running that script gives the output OBDV ****************************************************** IV GMM Estimation Results BFGS convergence: Normal convergence Objective function value: 0.004273 Observations: 4564 No moment covariance supplied. assuming efficient weight matrix Value df p-value .

err 0.149 0.002 0.000 0.000 0.851 -5. ins. sex age edu inc -0.000 t-stat -2.053 0.254 0.441 -0.000 0.127 -1.535 4.213 0.072 0.395 0.502 estimate 3.072 -0.624 10.133 13.000 ****************************************************** ****************************************************** Poisson QML .500 p-value 0.038 0.000 0. priv.537 0.000 constant pub.000 st.011 0.031 0.431 6.429 0.000 0. ins.X^2 test 19.

000 0.469 3.000 0.294 0.000000 Observations: 4564 No moment covariance supplied.024 0. sex age edu -0.136 8.000 0.029 st.487 0.000 0. test estimate constant pub.055 0.060 p-value 0.071 0.000 0. err 0.002 0. priv.149 0. no spec.092 4. ins. assuming efficient weight matrix Exactly identified.791 0.GMM Estimation Results BFGS convergence: Normal convergence Objective function value: 0.076 0.848 0.002 .796 11.289 11. ins.010 t-stat -5.

328 ****************************************************** Note how the Poisson QML results.4).978 0. are the same as were obtained using the ML estimation routine (see subsection 11. Perhaps the difference in the results depending upon whether or not PRIV is treated as endogenous can suggest a method for testing exogeneity.inc -0. Note that income becomes positive and signiﬁcant when PRIV is treated as endogenous. Onward to the Hausman test! . Also note that the IV and QML results are considerably different. Perhaps it is logical that people who own private insurance make fewer visits.000 0. estimated here using a GMM routine. if they have to make a co-payment.000 -0. Treating PRIV as potentially endogenous causes the sign of its coefﬁcient to change. This is an example of how (Q)ML may be represented as a GMM estimator.

(1978). .14. but that the some of the regressors may be correlated with the error ˆ term. 46. We assume that the functional form and the choice of regressors is correct. which was originally presented in Hausman. 1251-71. Consider the simple linear regression model yt = xtβ + t. J. Econometrica.A. For example. Speciﬁcation tests in econometrics.15 Example: The Hausman Test This section discusses the Hausman test. which as you know will produce inconsistency of β. this will be a problem if • if some regressors are endogeneous • some regressors are measured with error • lagged values of the dependent variable are used as regressors and t is autocorrelated.

but we are doubting if the other is consistent (e. the Octave program OLSvsIV. the IV estimator).g.To illustrate. Figure 14. the OLS estimator). while the IV estimator is consistent. If we accept that one is consistent (e.. increasing the sample size.m performs a Monte Carlo experiment where errors are correlated with regressors.g. we might try to check if the difference between the . and estimation is by OLS and IV. while Figure 14. We have seen that inconsistent and the consistent estimators converge to different probability limits. while if one is consistent and the other is not they converge to different limits.a pair of consistent estimators converge to the same probability limit. you can see evidence that the OLS estimator is asymptotically biased.5 shows that the IV estimator is on average much closer to the true value. If you play with the program.4 shows that the OLS estimator is quite biased. The true value of the slope coefﬁcient used to generate the data is β = 2. This is the idea behind the Hausman test ..

3 2.32 2.36 2.14 0.34 2.1 0.04 0.28 2.4: OLS OLS estimates 0.08 0.12 0.02 0 2.38 .Figure 14.06 0.

9 1.1 0.02 0 1.08 .92 1.04 0.98 2 2.94 1.08 0.06 0.5: IV IV estimates 0.02 2.06 2.96 1.14 0.Figure 14.04 2.12 0.

ˆ So.2 is: √ √ ˆ − θ0 a. When we have a more efﬁcient estimator that relies on stronger assumptions (such as exogeneity) than the IV estimator. let’s consider the covariance between the MLE estimator θ (or any ˜ other fully efﬁcient estimator) and some other CAN estimator.). let’s recall some results from MLE. we might prefer to use it. unless we have evidence that the assumptions are false. say θ.estimators is signiﬁcantly different from zero. etc. −J∞(θ0)−1 ng(θ0). Equation 13. Now.s. • If we’re doubting about the consistency of OLS (or QML.why not just use the IV estimator? Because the OLS estimator is more efﬁcient when the regressors are exogenous and the other classical assumptions (including normality of the errors) hold. why should we be interested in testing . n θ → .

. Combining these two equations. I∞(θ0)−1 ng(θ0).9 tells us that the asymptotic covariance between any CAN estimator and the MLE score vector is V∞ Now.s. equation 13.Equation 13.s. consider IK 0K −1 √ ˜ n θ−θ = √ ng(θ) ˜ V∞(θ) IK IK I∞(θ) . √ 0K I∞(θ) ˜ n n θ−θ a. we get √ √ ˆ − θ0 a. n θ → Also. →√ √ n ng(θ) √ ˜ θ−θ ˆ θ−θ .6 is J ∞(θ) = −I∞(θ).

So. Now.The asymptotic covariance of this is √ ˜ n θ−θ 0K √ = IK V∞ ˆ 0K I∞(θ)−1 n θ−θ = −1 ˜ V∞(θ) IK −1 IK I∞(θ) IK 0K 0K I∞(θ)−1 ˜ V∞(θ) I∞(θ)−1 I∞(θ) I∞(θ) . versus the alternative hypothesis that . for clarity in what follows. we might write as √ ˜ ˜ n θ−θ V∞(θ) I∞(θ)−1 = V∞ √ ˆ ˆ I∞(θ)−1 V∞(θ) n θ−θ . suppose we with to test whether the the two estimators are in fact both converging to θ0. which. the asymptotic covariance between the MLE and any other CAN estimator is equal to the MLE asymptotic variance (the inverse of the information matrix).

Under the null hypothesis that they are. . will be asymptotically normally distributed as √ So. where ρ is the rank of the difference of the asymptotic variances. we have IK −IK √ √ ˜ n θ − θ0 ˆ n θ − θ0 = √ ˜ ˆ n θ−θ . ˜ ˆ n θ−θ ˜ ˆ V∞(θ) − V∞(θ) ˜ ˆ d ˜ ˆ n θ − θ → N 0. −1 ˜ ˆ d θ − θ → χ2(ρ). V∞(θ) − V∞(θ) . A statistic that has the same asymptotic distribution is ˜ ˆ θ−θ ˆ ˜ ˆ ˆ V (θ) − V (θ) −1 ˜ ˆ d θ − θ → χ2(ρ).˜ the ”MLE” estimator is not in fact consistent (the consistency of θ is a maintained hypothesis).

it is possible that the inconsistency of the MLE will not show up in the portion of the vector that has been used. • Note: if the test is based on a sub-vector of the entire parameter vector of the MLE. the test may not have power to detect the inconsistency. where θA = θ0. Some things to note: . The reason that this test has power under the alternative hypothesis is that in that case the ”MLE” estimator will not be consistent. regardless of how small a signiﬁcance level is used. This may occur. in its original form. a non-zero vector. If this is the case.This is the Hausman test statistic. Then the mean of the asymptotic distribution √ ˜ ˆ of vector n θ − θ will be θ0 − θA. when the consistent but inefﬁcient estimator is not identiﬁed for all the parameters of the model. for example. say. so the test statistic will eventually reject. and will converge to θA.

and it may be difﬁcult to determine what the true rank is. • A solution to this problem is to use a rank 1 test. If the true rank is lower than what is taken to be true.• The rank. The contrary holds if we underestimate the rank. ρ. For example. that variable’s coefﬁcients may be compared. This is quite restrictive since modern estimators such . by comparing only a single coefﬁcient. • This simple formula only holds when the estimator that is being tested for consistency is fully efﬁcient under the null hypothesis. This means that it must be a ML estimator or a fully efﬁcient estimator that has the same asymptotic distribution as the ML estimator. the test will be biased against rejection of the null hypothesis. if a variable is suspected of possibly being endogenous. of the difference of the asymptotic variances is often less than the dimension of the matrices.

6: Incorrect rank and the Hausman test .Figure 14.

Consider the omnibus . θ1 and θ2. where one is assumed to be consistent. 2.as GMM and QML are not in general fully efﬁcient. Following up on this last point. and that they can be expressed as generalized method of moments (GMM) estimators. let’s think of two not necessarily efˆ ˆ ﬁcient estimators. i = 1. We assume for expositional simplicity that ˆ ˆ both θ1 and θ2 belong to the same parameter space. The estimators are deﬁned (suppressing the dependence upon data) by ˆ θi = arg min mi (θi) Wi mi(θi) θi ∈Θ where mi(θi) is a gi × 1 vector of moment conditions. but the other may not be. and Wi is a gi × gi positive deﬁnite weighting matrix.

Θ×Θ m2(θ2) (14.6) Suppose that the asymptotic covariance of the omnibus moment vector is Σ = lim V ar n→∞ √ n m1(θ1) m2(θ2) (14.GMM estimator ˆ ˆ θ1.7) ≡ Σ1 Σ12 · Σ2 . but with the covariance of the moment conditions . θ2 = arg min m1(θ1) m2(θ2) W1 0(g2×g1) 0(g1×g2) W2 m1(θ1) . The standard Hausman test is equivalent to a Wald test of the equality of θ1 and θ2 (or subvectors of the two) applied to the omnibus GMM estimator.

g. The Hausman test using a proper estimator of the overall covariance matrix will now have an asymptotic χ2 distribution when neither estimator is efﬁcient. the test suffers from a loss of power due to the fact that . The general solution when neither of the estimators is efﬁcient is clear: the entire Σ matrix must be estimated consistently. since the Σ12 term will not cancel out. This is However.estimated as Σ= Σ1 0(g2×g1) 0(g1×g2) Σ2 . as we have seen above. Methods for consistently estimating the asymptotic covariance of a vector of moment conditions are well-known. e. and thus it need not be estimated. the omitted Σ12 term cancels out of the test statistic when one of the estimators is asymptotically efﬁcient. the Newey-West estimator discussed previously. While this is clearly an inconsistent estimator in general..

including simulation results.8) where Σ is a consistent estimator of the overall covariance matrix Σ of equation 14.the omnibus GMM estimator of equation 14. θ2 = arg min m1(θ1) m2(θ2) Θ×Θ −1 Σ m1(θ1) m2(θ2) . By standard arguments. for more details. so the Wald test using this alternative is more powerful. this is a more efﬁcient estimator than that deﬁned by equation 14. 2004. (14. See my article in Applied Economics.6. .7.6 is deﬁned using an inefﬁcient weight matrix.m calculates the Wald test corresponding to the efﬁcient joint GMM estimator (the ”H2” test in my paper). for a simple linear model. A new test can be deﬁned by using an alternative omnibus GMM estimator ˆ ˆ θ1. The Octave script hausman.

using Kalman ﬁltering. There is a lot of interesting stuff that is beyond the scope of this course. and involve sophisticated ﬁltering methods to compute the likelihood function for nonlinear models with non-normal shocks. Current methods are usually Bayesian. After work like the cited papers.14.16 Application: Nonlinear rational expectations Readings: Hansen and Singleton. Hansen and Singleton’s 1982 paper is also a classic worth studying in itself. The literature on estimation of these models has grown a lot since these early papers. people moved to ML estimation of linearized models. 1986 Though GMM estimation has many applications. 1982∗. Tauchen. I have done some work using . I’ll use a simpliﬁed model with similar notation to Hamilton’s. Though I strongly recommend reading the paper. application to rational expectations models is elegant. since theory directly suggests the moment conditions.

9) • The parameter β is between 0 and 1. and reﬂects discounting. They are not the state of the art for estimation of such models. We assume a representative consumer maximizes expected discounted utility over an inﬁnite horizon. . The methods explained in this section are intended to provide an example of GMM estimation. • It is the information set at time t. The future consumption stream is the stochastic sequence {ct}∞ .simulation-based estimation methods applied to such models. and includes the all realizations of random variables indexed t and earlier. s=0 (14. and the expected utility hypothesis holds. The objective function at t=0 time t is the discounted expected utility ∞ β sE (u(ct+s)|It) . Utility is temporally additive.

. • Suppose the consumer can invest in a risky asset. So the problem is to allocate current wealth between current consumption and investment to ﬁnance future consumption: wt = ct + it. which is constained to be less than or equal to current wealth wt. The price of ct is normalized to 1.current consumption. • Future net rates of return rt+s. s > 0 are not known in period t: the asset is risky. where it−1 is investment in period t − 1. • Current wealth wt = (1 + rt)it−1.• The choice variable is ct . A dollar invested in the asset yields a gross return pt+1 + dt+1 (1 + rt+1) = pt where pt is the price and dt is the dividend in period t.

the marginal reduction in consumption ﬁnances investment. Then by reducing current consumption marginally would cause equation 14.10) To see that the condition is necessary. This increase in consumption would cause the objective function to increase by βE {(1 + rt+1) u (ct+1)|It} . which has gross return (1 + rt+1) . Therefore. At the same time. A . unless the condition holds.9 to drop by u (ct). (14. which could ﬁnance consumption in period t + 1. the expected discounted utility function is not maximized.A partial set of necessary conditions for utility maximization have the form: u (ct) = βE {(1 + rt+1) u (ct+1)|It} . • To use this we need to choose the functional form of utility. suppose that the lhs < rhs. since there is no discounting of the current period.

it is unlikely that ct is stationary.constant relative risk aversion form is c1−γ − 1 u(ct) = t 1−γ where γ is the coefﬁcient of relative risk aversion. even though it is in real terms. and our theory . With this form. u (ct) = c−γ t so the foc are c−γ = βE (1 + rt+1) c−γ |It t t+1 While it is true that E c−γ − β (1 + rt+1) c−γ t t+1 |It = 0 so that we could use this to deﬁne moment conditions.

requires stationarity. To get a vector of moment conditions we need some instruments. Now 1-β ct+1 (1 + rt+1) ct −γ is analogous to ht(θ) deﬁned above: it’s a scalar moment condition. ct+1 ct −γ zt ≡ mt(θ) . divide though by c−γ t E 1-β ct+1 (1 + rt+1) ct −γ |It = 0 (note that ct can be passed though the conditional expectation since ct is chosen based only upon information available in time t). To solve this. We can use the necessary conditions to form the expressions 1 − β (1 + rt+1) • θ represents β and γ. Suppose that zt is a vector of variables drawn from the information set It.

this estimate depends on an initial consistent estimate of . The optimal weighting matrix is therefore the inverse of the variance of the moment conditions: Ω∞ = lim E nm(θ0)m(θ0) which can be consistently estimated by n ˆ Ω = 1/n t=1 ˆ ˆ mt(θ)mt(θ) As before. the above expression may be interpreted as a moment condition which can be used for GMM estimation of the parameters θ0.• Therefore. By rational expectations. mt−s has been observed. and is therefore an element of the information set. Note that at time t. the autocovariances of the moment conditions other than Γ0 should be zero.

which can be obtained by setting the weighting matrix W arbitrarˆ ily (to an identity matrix. use this to estimate θ0. Since use of more moment conditions will lead to a more (asymptotically) efﬁcient estimator. JBES. • In principle. This issue has been studied using Monte Carlos (Tauchen. use the new estimate to re-estimate Ω. .g. for example). and repeat until the estimates don’t change.θ. we could use a very large number of moment conditions in estimation. we then minimize ˆ s(θ) = m(θ) Ω−1m(θ). This process can be iterated. one might be tempted to use many instrumental variables. The reason for poor performance when using many instruments is that the estimate of Ω becomes very imprecise. After obtaining θ.. 1986). e. since any current or lagged variable could be used in xt. We will do a computer lab that will show that this may not be a good idea with ﬁnite samples.

and identiﬁcation can be problematic. as well as a constant. The columns of this data ﬁle are c.c. is simply not informative enough. 14. Note that we are basing everything on a single partial ﬁrst order condition. As instruments we use lags of c and r. JBES. and d in that order.• Empirical papers that use this approach often have serious problems in obtaining precise estimates of the parameters. Probably this f. p.m performs GMM estimation of a portfolio model.o.data. For a single lag the estimation results are MPITB extensions found . 1986). using the data ﬁle tauchen. There are 95 observations (source: Tauchen.17 Empirical example: a portfolio model The Octave program portfolio.

271 1.783 p-value 0.971 t-stat 97.000 0.915 0.001 estimate beta gamma 0.009 0.000014 Observations: 94 Value X^2 test 0.569 df 1.075 .000 st.319 p-value 0. err 0.****************************************************** Example of GMM estimation of rational expectations model GMM Estimation Results BFGS convergence: Normal convergence Objective function value: 0.

****************************************************** For two lags the estimation results are MPITB extensions found ****************************************************** Example of GMM estimation of rational expectations model GMM Estimation Results BFGS convergence: Normal convergence Objective function value: 0.037882 Observations: 93 .

See Hansen.000 0.024 0.636 -7.Value X^2 test 3. the results are sensitive to the choice of instruments. (1996) JBES V14.523 estimate beta gamma 0.315 p-value 0.462 p-value 0. Moment conditions formed from Euler conditions sometimes do not identify the parameter of a model. err 0. (I haven’t checked it carefully)? .318 t-stat 35. Heaton and Yarron. or possibly a conditional moment that is not very informative. Is that a problem here.857 -2.000 ****************************************************** Pretty clearly.351 df 3. N3.000 st. Maybe there is some problem here: poor instruments.

If the PRIV variable is endogenous. Do the exercises in section 14. 2. assuming that an efﬁcient weight matrix is used.10 can be adapted to account for an error term with HET and/or AUT. only the IV approach should be consistent. The Octave script meps.10. Show how the GIV estimator presented in section 14. For the GIV estimator presented in section 14.14.18 Exercises 1. If all conditioning variables are exogenous.10. 4. ﬁnd the form of the expressions I∞(θ0) and J∞(θ0) that appear in the asymptotic distribution of the estimator. 3. a Poisson QML approach and an IV approach. both approaches should be consistent. Neither of the two estimators is efﬁcient in any .m estimates a model for ofﬁce-based doctpr visits (OBDV) using two different moment conditions.

case. Prove analytically that the estimators coicide. (b) Estimate by ML (using mle_results).g.m for hints about how to set up the test. since we already know that this data exhibits variability that exceeds what is implied by the Poisson model (e. negative binomial and other models ﬁt much better). θ) = [1 + exp(−xt θ)]−1. Using Octave. The script EstimateLogit. θ)]xtEstimate by GMM (using gmm_results). (a) Recall that E(yt|xt) = p(xt. using the Octave script hausman. using these moments. Test the exogeneity of the variable PRIV with a GMM-based Hausman-type test. 5. (c) The two estimators should coincide. generate data from the logit dgp. ..m should prove quite helpful. Consider the moment condtions (exactly identiﬁed) mt(θ) = [yt − p(xt.

m with several sample sizes. experiment with the program using lags of 3 and 4 periods to deﬁne instruments (a) Iterate the estimation of θ = (β. 7. . (b) Comment on the results. For the portfolio example.ˆ ˆ ˆ 6. That is. Verify the missing steps needed to show that n·m(θ) Ω−1m(θ) has a χ2(g − K) distribution. γ) and Ω to convergence. Do the results you obtain seem to agree with the consistency of the GMM estimator? Explain. show that the monster matrix is idempotent and has trace equal to g − K. Run the Octave script GMM/chi2gmm. Are the results sensitive to the set ˆ ˆ of instruments used? Look at Ω as well as θ. Are these good instruments? Are the instruments highly correlated with one another? Is there something analogous to collinearity going on here? 8.

(proof that the GMM optimal weight matrix is one such that W∞ = Ω−1) Consider the difference of the asymptotic variance ∞ using an arbitrary weight matrix. minus the asymptotic variance . I suggest that you use a Wald test. so that this result applies. 10. (D∞W∞D∞) D∞W∞Ω∞W∞D∞ (D∞W∞D∞) Supposing that you compute a GMM estimator using an arbitrary weight matrix. The GMM estimator with an arbitrary weight matrix has the asymptotic distribution √ d −1 −1 ˆ n θ − θ0 → N 0. and r is a given q × 1 vector. Explain exactly what is the test statistic.9. where R is a given q × k matrix. Carefully explain how you could test the hypothesis H0 : Rθ0 = r versus HA : Rθ0 = r. and how to compute every quantity that appears in the statistic.

Verify ∞ that A = BΩ∞B .using the optimal weight matrix: A = (D∞W∞D∞) −1 D∞W∞Ω∞W∞D∞ (D∞W∞D∞) −1 −1 − D∞Ω−1D∞ ∞ −1 Set B = (D∞W∞D∞)−1 D∞W∞ − D∞Ω−1D∞ ∞ D∞Ω−1. If we estimate the equation yt = α + ρyt−1 + βxt + νt . ∗ Suppose that yt is not observed. 11. What is the implication of this? Explain. Recall the dynamic model with measurement error that was discussed in class: ∗ ∗ yt = α + ρyt−1 + βxt + ∗ yt = yt + υt t where t and υt are independent Gaussian white noise errors. and instead we observe yt.

Run this script to verify that the test over-rejects. . Increase the sample size. to determine if the over-rejection problem becomes less severe.The Octave script GMM/SpecTest. Discuss your ﬁndings. ˆ d n · sn(θ) → χ2(g − K) Examine the script and describe what it does.m performs a Monte Carlo study of the performance of the GMM criterion test.

Microeconometrics: Methods and Applications. Chapters 21 and 22 (plus 23 if you have special interest in the topic). In this chapter we’ll look at panel data. 2005. Part V.Chapter 15 Introduction to panel data Reference: Cameron and Trivedi. Panel data is an important 687 .

and the intention is not to give a complete overview. but rather to highlight the issues and see how the tools we have studied can be applied. GMM. simply because much of the available data has this structure. There has been much work in this area. habit formation. Also. Starting from the perspective of a single time series. The addition of temporal information can in principle allow us to investigate issues such as persistence. if parameters are common across units or over time. In both cases. 15. it provides an example where things we’ve already studied (GLS.1 Generalities Panel data combines cross sectional and time series data: we have a time series for each of the agents observed in a cross section. the addi- . the addition of crosssectional information allows investigation of heterogeneity.area in applied econometrics. endogeneity. and dynamics. Hausman test) come into play.

and a single model is unlikely to be appropriate for all situations. to the model is not identiﬁed. . 2... The basic idea is to allow variables to have two indices. This would give yit = αit + xitβit + it The problem here is that there are more parameters than observations. n and t = 1. .. . We need some restraint! The proper restrictions to use of course depend on the problem at hand.. i = 1. The simple linear model y i = α + xi β + becomes yit = α + xitβ + it i We could think of allowing the parameters to change over time and over cross sectional units.. T .tional data allows for more precise estimation.. 2.

However we’re now allowing for the constant to vary across . one could have time and cross-sectional dummies. which: yit = αi + xitβ + it (15. We also need to consider whether or not n and T are ﬁxed or growing. We’ll need at least one of them to be growing in order to do asymptotics.1) I will refer to this as the ”simple linear panel model”. we’ll consider common slope parameters.For example. in that the β s are constant over time for all i. It is like the original cross-sectional model. and slopes that vary by time: yit = αi + αt + xitβt + it There is a lot of room for playing around here. This is the model most often encountered in the applied literature. but agent-speciﬁc intercepts. To provide some focus.

of course. we could keep n ﬁxed (seems appropriate when dealing with the EU countries). (e. The β s are ﬁxed over time. which is a testable restriction.i (some individual heterogeneity). of course.g. For that case. The asymptotic results depend on how we do this. Why bother using panel data. the PSID data) where a survey of a large number of individuals may be done for a limited number of time periods. Macroeconometric applications might look at longer time series for a small number of cross-sectional units (e. 40 years of quarterly data for 15 European countries).. and do asymptotics as T increases.. as is normal for time series. We can consider what happens as n → ∞ but T is ﬁxed.g. This would be relevant for microeconometric panels. what are the beneﬁts? The model yit = αi + xitβ + it .

If we have only time series data on a single cross sectional . if true. • Another reason is that panel data allows us to estimate parameters that are not identiﬁed by cross sectional (time series) data. we cannot estimate the αi. Why use the panel approach? • Because the restrictions that βi = βj = . if the model is yit = αi + αt + xitβt + it and we have only cross sectional data. lead to more efﬁcient estimation.. = β.is a restricted version of yit = αi + xitβi + it which could be estimated for each i in turn. Estimation for each i in turn will be very uninformative if T is small. For example..

Parameters indexed by both i and t will require other forms of restrictions in order to be estimable. Cross-sectional variation allows us to estimate parameters indexed by time. • sometimes.unit i = 1. The main issues are: • can β be estimated consistently? This is almost always a goal. • are the αi correlated with xit? • does the presence of αi complicate estimation of β? . and time series variation allows us to estimate parameters indexed by crosssectional unit. we cannot estimate the αt. we’re interested in estimating the distribution of αi across i. • can the αi be estimated consistently? This is often of secondary interest.

These are individualspeciﬁc parameters.2 Static issues and panel data it ). they can be thought of as individual-speciﬁc variables that are not observed (latent variables).• what about the covariance stucture? We’re likely to have HET and AUT. Deﬁne α = E(αi). Potential for efﬁciency gains. 15. Or. The basic problem we have in the panel data model is the presence of the αi. Our model yit = αi + xitβ + it . so E(αi − α) = 0. To begin with. so GLS issue will probably be relevant. The reason for thinking of them as variables is because the agent may choose their values following some process. assume that the xit are weakly exogenous variables (uncorrelated with yit = αi + xitβ + it and that the model is static: xit does not con- tain lags of yit. possibly more accurately.

there may be correlation between αi and xit. A way of thinking about the data generating process is this: First. In . αi. However. In either case. which means that E(xitηit) =0 in the above equation. and then x is drawn from fX (z|αi).may be written yit = αi + xitβ + it it ) = α + xitβ + (αi − α + = α + xitβ + ηit Note that E(ηit) = 0. αi is drawn. it is possible (but unlikely for economic data) that xit and ηit are independent or at least uncorrelated. Thus. if the distribution of xit is constant with respect to the realization of αi. either in turn from the set of n ﬁxed values. This means that OLS estimation of the model would lead to biased and inconsistent estimates. or randomly. the important point is that the distribution of x may vary depending on the realization.

and collinearity is usually the rule rather than the exception). unconditional on i). the model is called the ”random effects model”. So. because the issue is not whether ”effects” are ﬁxed or random (they are always random. The issue is whether or not the ”effects” are correlated with the other regressors. we expect that the ”ﬁxed effects” model is probably the relevant one unless special circumstances mean that the αi are uncorrelated with the xit. Fixed effects: when E(xitηit) =0.this case OLS estimation would be consistent. . I ﬁnd this to be pretty poor nomenclature. In economics. the model is called the ”ﬁxed effects model” Random effects: when E(xitηit) = 0. it seems likely that the unobserved variable α is probably correlated with the observed regressors. x (this is simply the presence of collinearity between observed and unobserved variables.

The model can be written as yit = α + xitβ + ηit.15.1) and what properties do the estimators have? First. OLS estimation of this model will give biased an inconsistent estimated of the parameters α and β. and we have that E(xitηit) =0.this involves subtracting the time . As such. we assume that the αi are correlated with the xit (”ﬁxed effects” model ).3 Estimation of the simple linear panel model ”Fixed effects”: The ”within” estimator How can we estimate the parameters of the simple linear panel model (equation 15. The ”within” estimator is a solution .

series average from each cross sectional unit.3) where x∗ = xit − xi and it x∗ and it ∗ it ∗ it = it − i. In this model. 1 xi = T i T xit t=1 T it t=1 T 1 = T 1 yi = T t=1 1 yit = αi + T i T t=1 1 xitβ + T T it t=1 y i = αi + xi β + The transformed model is (15. as long as the original regressors xit are . it is clear that are uncorrelated.2) yit − y i = αi + xitβ + ∗ yit = x∗ β + it ∗ it it − αi − xiβ − i (15.

the variation in the αi can be fairly informative about the heterogeneity. Neverˆ theless. For a short panel with ﬁxed T. ∀t. Thus OLS will give consistent estimates of the parameters of What about the αi? Can they be estimated? An estimator is 1 ˆ αi = T T ˆ yit − xitβ t=1 It’s fairly obvious that this is a consistent estimator if T → ∞. A couple of notes: • an equivalent approach is to estimate the model n yit = j=1 dj. β.itαi + xitβ + it . this estimator is not consistent. it (E(xit is) = 0. s).strongly exogenous with respect to the original error this model.

• This last expression makes it clear why the ”within” estimator cannot estimate slope coefﬁcients corresponding to variables that have no time variation. The corresponding coefﬁcients are not identiﬁed. zero otherwise. Such variables are perfectly collinear with the cross sectional dummies dj . See Cameron and ˆ Trivedi. j = 1. (Write out form of regressor matrix on blackboard).. but proba- .6. Estimating this model by OLS gives numerically exactly the same results as the ”within” estimator. and you get the αi automatically. section 21. They are indicators of the cross sectional unit of the observation.by OLS. 2. . n are n dummy variables that take on the value 1 if j = 1.. The dj . • OLS estimation of the ”within” model is consistent.. An interesting and important result known as the Frisch-Waugh-Lovell Theorem can be used to show that the two means of estimation give identical results.4 for details.

which would invalidate inferences if ignored. See Cameron and Trivedi. and autocor. The White heteroscedasticity consistent covariance estimator is easily extended to panel data with independence across i.3. There is very likely heteroscedasticity across the i and autocorrelation between the T observations corresponding to a given i.bly not efﬁcient. It is possible to use GLS corrections if you make assumptions regarding the het. Quasi-GLS. . One can then combine it with subsequent panelrobust covariance estimation to deal with the misspeciﬁcation of the error covariance. using a possibly misspeciﬁed model of the error covariance. Section 21. because it is highly probable that the it are not iid. but with heteroscedasticity and autocorrelation within i.2. and heteroscedasticity between i. One needs to estimate the covariance matrix of the parameter estimates taking this into account. can lead to more efﬁcient estimates than simple OLS.

as above. As such. We can recover estimates of the αi as discussed above. and inferences based on OLS need to be done taking this into account.Estimation with random effects The original model is yit = αi + xitβ + This can be written as yit = α + xitβ + (αi − α + yit = α + xitβ + ηit it ) it (15. and E(xitηit) = 0. It is to be noted that the error ηit is almost certainly heteroscedastic and autocorrelated. or both. . or panel-robust covariance matrix estimation. One could attempt to use GLS. the OLS estimator of this model is consistent.4) where E(ηit) = 0. so OLS will not be efﬁcient.

There is no advantage to doing this. so there is no gain in simplicity. even if your statistical package offers it as an option. It is to be emphasized that ”random effects” is not a plausible assumption with most economic data. One would still need to deal with cross sectional heteroscedasticity when using the between estimator. either. which operates on the time averages of the cross sectional units. and averaging looses information (efﬁciency loss).There are other estimators when we have random effects. as the overall estimator is already consistent. Think carefully about whether the assumption is warranted before trusting the results of this estimator. . a wellknown example being the ”between” estimator. so use of this estimator is discouraged.

and is . A Hausman test statistic can be computed. Failure to reject does not mean that the null hypothesis is true.Hausman test Suppose you’re doubting about whether ﬁxed or random effects are present. we conclude that ﬁxed effects are present. but the estimator of the previous section will not. using the difference between the two estimators. but it’s probably more realistic to conclude that the test may have low power. estimation of the covariance matrices needed to compute the Hausman test is a non-trivial issue. After all. not random effects. what happens if the test does not reject? One could optimistically turn to the random effects model. so the ”within” estimator should be used. The null hypothesis is ”random effects” so that both estimators are consistent. If we have ﬁxed effects. Evidence that the two estimators are converging to different limits is evidence in favor of ﬁxed effects. then the ”within” estimator will be consistent. Now. When the test rejects.

the simple version of the Hausman test requires that the estimator under the null be fully efﬁcient. 2004. I have a little paper on this topic.4 Dynamic panel data When we have panel data.t−1 as a regressor. With dynamics.t−1. Creel. to capture dynamic effects that can’t be analyed with only cross-sectional data. we have information on both yit as well as yi.3. 15. See also Cameron and Trivedi. Applied Economics. there is likely to be less of a problem of autocorrelation. A conservative approach would acknowledge that neither estimator is likely to be efﬁcient. Excluding dynamic effects is often the reason for detection of spurious AUT of the errors. but one should still be concerned that . One may naturally think of including yi. Achieving this goal is probably a utopian prospect.a source of considerable noise in the test statistic (noise=low power). and to operate accordingly.4. section 21. Finally.

t−1. When regressors are correlated with the errors. OLS estimation is inconsistent even for the random effects model. even if the αi are pure random effects (uncorrelated with xit). For this reason. consider a simple linear dynamic panel model yit = αi + φ0yit−1 + it (15.5) . The model becomes yit = αi + γyi. the natural thing to do is start thinking of instrumental variables estimation. or GMM. and it’s also of course still inconsistent for the ﬁxed effects model.t−1 is correlated with ηit. So.t−1 + xitβ + it it ) yit = α + γyi.t−1 are correlated. To illustrate. Note that αi is a component that determines both yit and its lag. Thus.t−1 + xitβ + ηit We assume that the xit are uncorrelated with it . yi.t−1 + xitβ + (αi − α + yit = α + γyi. yi.some might still be present. αi and yi.

0. Perhaps these results will stimulate your interest. Tables 15.9 and αi and i are independently distributed. φ0 = 0.where it ∼ N (0. . 0.3.2 present bias and RMSE for the ”within” estimator (labeled as ML) and some simulationbased estimators. The overidentiﬁed SBIL estimator has the lowest RMSE. 1). 1). Note that the ”within” estimator is very biased. αi ∼ N (0. 0. Simulation-based estimators are discussed in a later Chapter. and has a large RMSE.1 and 15.6.

0 0.022 -0.001 -0.275 -0.004 -0.6 0.001 0.001 -0.3 0.000 -0.000 0.000 0.464 -0. 2010.000 0.9 0.000 0. SBIL(OI) and SMIL(OI) are overidentiﬁed. T 5 5 5 5 5 5 5 5 N 100 100 100 100 200 200 200 200 φ 0.000 0.Table 15.3 0.003 -0.199 -0.001 0.362 -0. Phillips and Yu.001 -0.6 0. using the ML auxiliary statistic.001 0.010 0. Table 2.004 -0.200 -0.001 -0. SBIL.0 0. using both the naive and ML auxiliary statistics. Source for ML and II is Gouriéroux.465 II 0.001 .000 -0.010 -0. Bias. SMIL and II are exactly identiﬁed.9 ML -0.1: Dynamic panel data model.363 -0.001 -0.001 0.274 -0.003 SBIL SBIL(OI) 0.

365 0. RMSE. SBIL.0 0.3 0.025 0.029 0.041 0. SBIL(OI) and SMIL(OI) are overidentiﬁed.036 0.070 0. T 5 5 5 5 5 5 5 5 N 100 100 100 100 200 200 200 200 φ 0.050 0.044 0.059 0.203 0.074 0. Source for ML and II is Gouriéroux.081 0.041 0.065 0.0 0. using the ML auxiliary statistic.277 0.3 0.278 0.2: Dynamic panel data model.365 0.046 0.9 ML 0. 2010. SMIL and II are exactly identiﬁed. Phillips and Yu.031 0.059 0.6 0.027 .6 0. using both the naive and ML auxiliary statistics.033 0.Table 15.467 II 0.041 0.071 0.054 SBIL SBIL(OI) 0.057 0.050 0. Table 2.076 0.204 0.046 0.467 0.9 0.

Note that we loose one observation when doing ﬁrst differencing. because yi. as above for the ”within” estimator.t−1 yit − yi.t−1 + ∆xitβ + ∆ it − i.t−2 − xi. but this would give us problems later when we need to deﬁne instruments. Getting rid of the αi is a step in the direction of solving the problem.t−1) β + or ∆yit = γ∆yi.t−1 = γ (yi.Arellano-Bond estimator The ﬁrst thing is to realize that the αi that are a component of the error are correlated with all regressors in the general case of ﬁxed effects.t−1β − it i.t−1 = αi + γyi. We could subtract the time averages. consider the model in ﬁrst differences yit − yi.t−1 + xitβ + it − αi − γyi.t−1 Now the pesky αi are no longer in the picture.t−1is clearly correlated with . OLS estimation of this model will still be inconsistent.t−2) + (xit − xi. Instead.t−1 − yi.

There is no problem of correlation between ∆xit and ∆ it. we cannot compute the moment conditions.t−2 as an instrument? It is clearly correlated with ∆yi.i.t−1. Thus. then yi.t−1 = (yi.t−s. because GMM with additional instruments is asymptotically more efﬁcient than with less instruments.t−1 .t−2 is not correlated with ∆ it it are not serially corWe can also = it − i.t−1 . For t = 3.t−1 − yi. This sort of estimator is widely known in the literature as an Arellano-Bond estimator. s ≥ 2 to increase efﬁciency. • Note that this sort of estimators requires T = 3 at a minimum. Note also that the error ∆ it is serially correlated even if the it are not. How about using yi. but the variables in ∆xit can serve as their own instruments. due to the inﬂuential 1991 paper of Arellano and Bond (1991). use additional lags yi. and as long as the related. Then for t = 1 and t = 2. we can compute the moment conditions using a single lag yi.1 as an instrument. we need to ﬁnd instruments for ∆yi. to do GMM. Suppose T = 4.t−2). When .

Some robustness checks. serial correlation may also be due to factors not captured in lags of the dependent variable. and this can be solved by including additional lags of the dependent variable. we can use both yi. This sort of unbalancedness in the instruments requires a bit of care when programming. looking at the stability of the estimates are a way to proceed. and very . However.2 as instruments. Serial correlation of the errors may be due to dynamic misspeciﬁcation. so one should be a little careful with using too many instruments. then the validity of the Arellano-Bond type instruments is in question. • A ﬁnal note is that the error ∆ it is serially correlated. • One should note that serial correlation of the it will cause this estimator to be inconsistent. additional instruments increase asymptotic efﬁciency but can lead to increased small sample bias. Also.t = 4. If this is a possibility.1 and yi.

so that E( it js ) it are independent across = 0. With a . ∀t.4). One can also attempt to use GLS style weighting to improve efﬁciency. There are many possibilities.likely heteroscedastic across i. why is the differencing used in the ”within” estimation approach (equation 15. One needs to take this into account when computing the covariance of the GMM estimator. Consider the simple linear panel data model with random effects (equation 15. s.5 Exercises 1. 15. why does the Arellano-Bond estimator operate on the model in ﬁrst differences instead of using the within approach? 2.3) problematic? That is. In the context of a dynamic model with ﬁxed effects. Suppose that the cross sectional units. i = j.

how? (d) For one of the two preceeding parts (b) and (c). i1 i2 (a) write out the form of the entire covariance matrix (nT ×nT ) of all errors. t = s. how? (c) suppose that T is ﬁxed. . the errors are independently and identically distributed. For that case. Is it possible to estimate the Σi consistently? If so.cross sectional unit. outline how to do ”within” estimation using a GLS correction. · · · iT . i = j. (b) suppose that n is ﬁxed. Σ = E( ). but E( it pactly. consistent estimation is possible. Is it possible to estimate the Σi consistently? If so. let E( i i) it is ) = 0. so E( 2 ) = σi2. where = 1 2 ··· T is the column vector of nT errors. and consider asymptotics as T grows. and consider asymptotics an n grows. Then the assumptions are that and E( i j ) = 0. More com- = i = σi2IT .

Given a sample of size n of a random vector y and a vector of conditioning variables x. yn x1 . . The true joint density is associated with the 715 . .Chapter 16 Quasi-ML Quasi-ML is the estimator one obtains when a misspeciﬁed probability model is used to calculate an ”ML” estimator. ρ). suppose the joint density of Y = conditional on X = y1 . xn is a member of the parametric family pY (Y|X. ρ ∈ Ξ. . .

.e. ρ0). it fully describes the probabilistically important features of the d. ρ) = pY (Y|X. taking into account possible dependence . As long as the marginal density of X doesn’t depend on ρ0.vector ρ0 : pY (Y|X. .g. this conditional density fully characterizes the random characteristics of samples: i. ρ ∈ Ξ. yt−1 .. ρ). and let Xt = x1 . The likelihood function is just this density evaluated at other values ρ L(Y|X. Y0 = 0. • Let Yt−1 = y1 . xt The likelihood function. .p. .

**of observations, can be written as
**

n

L(Y|X, ρ) =

t=1 n

pt(yt|Yt−1, Xt, ρ) pt(ρ)

t=1

≡

**• The average log-likelihood function is: 1 1 sn(ρ) = ln L(Y|X, ρ) = n n
**

n

ln pt(ρ)

t=1

• Suppose that we do not have knowledge of the family of densities pt(ρ). Mistakenly, we may assume that the conditional density of yt is a member of the family ft(yt|Yt−1, Xt, θ), θ ∈ Θ, where there is no θ0 such that ft(yt|Yt−1, Xt, θ0) = pt(yt|Yt−1, Xt, ρ0), ∀t (this is what we mean by “misspeciﬁed”).

• This setup allows for heterogeneous time series data, with dynamic misspeciﬁcation. The QML estimator is the argument that maximizes the misspeciﬁed average log likelihood, which we refer to as the quasi-log likelihood function. This objective function is 1 sn(θ) = n 1 ≡ n and the QML is ˆ θn = arg max sn(θ)

Θ n

ln ft(yt|Yt−1, Xt, θ0)

t=1 n

ln ft(θ)

t=1

**A SLLN for dependent sequences applies (we assume), so that 1 sn(θ) → lim E n→∞ n
**

a.s. n

ln ft(θ) ≡ s∞(θ)

t=1

We assume that this can be strengthened to uniform convergence, a.s., following the previous arguments. The “pseudo-true” value of θ ¯ is the value that maximizes s(θ): θ0 = arg max s∞(θ)

Θ

**Given assumptions so that theorem 28 is applicable, we obtain
**

n→∞

ˆ lim θn = θ0, a.s.

**• Applying the asymptotic normality theorem, √ where
**

2 J∞(θ0) = lim EDθ sn(θ0) n→∞ d ˆ n θ − θ0 → N 0, J∞(θ0)−1I∞(θ0)J∞(θ0)−1

and

0 n→∞

I∞(θ ) = lim V ar nDθ sn(θ0).

√

• Note that asymptotic normality only requires that the additional assumptions regarding J and I hold in a neighborhood of θ0 for J and at θ0, for I, not throughout Θ. In this sense, asymptotic normality is a local property.

16.1

Consistent Estimation of Variance Components

**Consistent estimation of J∞(θ0) is straightforward. Assumption (b) of Theorem 30 implies that ˆn) = 1 Jn(θ n
**

n 2 ˆn) a.s. lim E 1 Dθ ln ft(θ → n→∞ n n 2 Dθ ln ft(θ0) = J∞(θ0). t=1

t=1

ˆ That is, just calculate the Hessian using the estimate θn in place of θ0. Consistent estimation of I∞(θ0) is more difﬁcult, and may be im-

**possible. • Notation: Let gt ≡ Dθ ft(θ0) We need to estimate I∞(θ ) = lim V ar nDθ sn(θ0)
**

n→∞ 0

√

**√ 1 = lim V ar n n→∞ n 1 = lim V ar n→∞ n 1 = lim E n→∞ n
**

n

n

Dθ ln ft(θ0)

t=1

gt

t=1 n n

(gt − Egt)

t=1 t=1

(gt − Egt)

**This is going to contain a term 1 lim n→∞ n
**

n

(Egt) (Egt)

t=1

which will not tend to zero, in general. This term is not consistently estimable in general, since it requires calculating an expectation using the true density under the d.g.p., which is unknown. • There are important cases where I∞(θ0) is consistently estimable. For example, suppose that the data come from a random sample (i.e., they are iid). This would be the case with cross sectional data, for example. (Note: under i.i.d. sampling, the joint distribution of (yt, xt) is identical. This does not imply that the conditional density f (yt|xt) is identical). • With random sampling, the limiting objective function is simply s∞(θ0) = EX E0 ln f (y|x, θ0) where E0 means expectation of y|x and EX means expectation respect to the marginal density of x. • By the requirement that the limiting objective function be maxi-

mized at θ0 we have Dθ EX E0 ln f (y|x, θ0) = Dθ s∞(θ0) = 0 • The dominated convergence theorem allows switching the order of expectation and differentiation, so Dθ EX E0 ln f (y|x, θ0) = EX E0Dθ ln f (y|x, θ0) = 0 The CLT implies that 1 √ n

n

Dθ ln f (y|x, θ0) → N (0, I∞(θ0)).

t=1

d

That is, it’s not necessary to subtract the individual means, since they are zero. Given this, and due to independent observations,

a consistent estimator is 1 I= n

n

ˆ ˆ Dθ ln ft(θ)Dθ ln ft(θ)

t=1

This is an important case where consistent estimation of the covariance matrix is possible. Other cases exist, even for dynamically misspeciﬁed time series models.

16.2

Example: the MEPS Data

To check the plausibility of the Poisson model for the MEPS data, we can compare the sample unconditional variance with the estimated unconditional variance according to the Poisson model: V (y) =

n ˆ t=1 λt

n

. Using the program PoissonVariance.m, for OBDV and ERV, we

get We see that even after conditioning, the overdispersion is not captured in either case. There is huge problem with OBDV, and a sig-

Table 16.1: Marginal Variances, Sample and Estimated (Poisson) OBDV ERV Sample 38.09 0.151 Estimated 3.28 0.086 niﬁcant problem with ERV. In both cases the Poisson model does not appear to be plausible. You can check this for the other use measures if you like.

**Inﬁnite mixture models: the negative binomial model
**

Reference: Cameron and Trivedi (1998) Regression analysis of count data, chapter 4. The two measures seem to exhibit extra-Poisson variation. To capture unobserved heterogeneity, a possibility is the random parameters approach. Consider the possibility that the constant term in a Poisson

model were random: exp(−θ)θy fY (y|x, ε) = y! θ = exp(x β + ε) = exp(x β) exp(ε) = λν where λ = exp(x β) and ν = exp(ε). Now ν captures the randomness in the constant. The problem is that we don’t observe ν, so we will need to marginalize it to get a usable density exp[−θ]θy fY (y|x) = fv (z)dz y! −∞ This density can be used directly, perhaps using numerical integration to evaluate the likelihood function. In some cases, though, the integral will have an analytic solution. For example, if ν follows a certain

∞

**one parameter gamma density, then Γ(y + ψ) fY (y|x, φ) = Γ(y + 1)Γ(ψ) ψ ψ+λ
**

ψ

λ ψ+λ

y

(16.1)

where φ = (λ, ψ). ψ appears since it is the parameter of the gamma density. • For this density, E(y|x) = λ, which we have parameterized λ = exp(x β) • The variance depends upon how ψ is parameterized. – If ψ = λ/α, where α > 0, then V (y|x) = λ + αλ. Note that λ is a function of x, so that the variance is too. This is referred to as the NB-I model. – If ψ = 1/α, where α > 0, then V (y|x) = λ + αλ2. This is referred to as the NB-II model.

So both forms of the NB model allow for overdispersion, with the NB-II model allowing for a more radical form. Testing reduction of a NB model to a Poisson model cannot be done by testing α = 0 using standard Wald or LR procedures. The critical values need to be adjusted to account for the fact that α = 0 is on the boundary of the parameter space. Without getting into details, suppose that the data were in fact Poisson, so there is equidispersion and the true α = 0. Then about half the time the sample data will be underdispersed, and about half the time overdispersed. When the data is underdispersed, the MLE of α will be α = 0. Thus, under the ˆ null, there will be a probability spike in the asymptotic distribution of √ ˆ ˆ n(α − α) = nα at 0, so standard testing methods will not be valid. This program will do estimation using the NB model. Note how modelargs is used to select a NB-I or NB-II density. Here are NB-I estimation results for OBDV:

MPITB extensions found OBDV ====================================================== BFGSMIN final results Used analytic gradient -----------------------------------------------------STRONG CONVERGENCE Function conv 1 Param conv 1 Gradient conv 1 -----------------------------------------------------Objective function value 2.18573 Stepsize 0.0007 17 iterations -----------------------------------------------------param 1.0965 gradient change 0.0000 -0.0000

0.2551 0.2024 0.2289 0.1969 0.0769 0.0000 1.7146

-0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 0.0000 -0.0000 0.0000

****************************************************** Negative Binomial model, MEPS 1996 full data set MLE Estimation Results BFGS convergence: Normal convergence Average Log-L: -2.185730 Observations: 4564 constant estimate -0.523 st. err 0.104 t-stat -5.005 p-value 0.000

pub. ins. priv. ins. sex age edu inc alpha

0.765 0.451 0.458 0.016 0.027 0.000 5.555

0.054 0.049 0.034 0.001 0.007 0.000 0.296

14.198 9.196 13.512 11.869 3.979 0.000 18.752

0.000 0.000 0.000 0.000 0.000 1.000 0.000

Information Criteria CAIC : 20026.7513 Avg. CAIC: 4.3880 BIC : 20018.7513 Avg. BIC: 4.3862 AIC : 19967.3437 Avg. AIC: 4.3750 ******************************************************

Note that the parameter values of the last BFGS iteration are different that those reported in the ﬁnal results. This reﬂects two things - ﬁrst, the data were scaled before doing the BFGS minimization, but the mle_results script takes this into account and reports the results

using the original scaling. But also, the parameterization α = exp(α∗) is used to enforce the restriction that α > 0. The unrestricted parameter α∗ = log α is used to deﬁne the log-likelihood function, since the BFGS minimization algorithm does not do contrained minimization. To get the standard error and t-statistic of the estimate of α, we need to use the delta method. This is done inside mle_results, making use of the function parameterize.m . Likewise, here are NB-II results:

MPITB extensions found OBDV ====================================================== BFGSMIN final results Used analytic gradient

-----------------------------------------------------STRONG CONVERGENCE Function conv 1 Param conv 1 Gradient conv 1 -----------------------------------------------------Objective function value 2.18496 Stepsize 0.0104394 13 iterations -----------------------------------------------------param 1.0375 0.3673 0.2136 0.2816 0.3027 0.0843 -0.0048 0.4780 gradient 0.0000 -0.0000 0.0000 0.0000 0.0000 -0.0000 0.0000 -0.0000 change -0.0000 0.0000 -0.0000 -0.0000 0.0000 0.0000 -0.0000 0.0000

****************************************************** Negative Binomial model, MEPS 1996 full data set MLE Estimation Results BFGS convergence: Normal convergence Average Log-L: -2.184962 Observations: 4564 constant pub. ins. priv. ins. sex age edu inc alpha estimate -1.068 1.101 0.476 0.564 0.025 0.029 -0.000 1.613 st. err 0.161 0.095 0.081 0.050 0.002 0.009 0.000 0.055 t-stat -6.622 11.611 5.880 11.166 12.240 3.106 -0.176 29.099 p-value 0.000 0.000 0.000 0.000 0.000 0.002 0.861 0.000

Information Criteria CAIC : 20019.7439 Avg. CAIC: 4.3864 BIC : 20011.7439 Avg. BIC: 4.3847 AIC : 19960.3362 Avg. AIC: 4.3734 ******************************************************

• For the OBDV usage measurel, the NB-II model does a slightly better job than the NB-I model, in terms of the average loglikelihood and the information criteria (more on this last in a moment). • Note that both versions of the NB model ﬁt much better than does the Poisson model (see 11.4). • The estimated α is highly signiﬁcant. To check the plausibility of the NB-II model, we can compare the sample unconditional variance with the estimated unconditional variance

according to the NB-II model: V (y) =

2 n ˆ α ˆ t=1 λt +ˆ λt

( )

n

. For OBDV and

ERV (estimation results not reported), we get For OBDV, the overdisTable 16.2: Marginal Variances, Sample and Estimated (NB-II) OBDV ERV Sample 38.09 0.151 Estimated 30.58 0.182 persion problem is signiﬁcantly better than in the Poisson case, but there is still some that is not captured. For ERV, the negative binomial model seems to capture the overdispersion adequately.

**Finite mixture models: the mixed negative binomial model
**

The ﬁnite mixture approach to ﬁtting health care demand was introduced by Deb and Trivedi (1997). The mixture approach has the intu-

p. φ1... The available objective measures. Subjective... φp. π1 ≥ π2 ≥ · · · ≥ πp and φi = φj . Many studies have incorporated objective and/or subjective indicators of health status in an effort to capture this heterogeneity. φi) + πpfY (y. and may also not be exogenous Finite mixture models are conceptually simple.itive appeal of allowing for subgroups of the population with different health status. . and p i=1 πi = 1. φp).. . self-reported measures may suffer from the same problem. are not necessarily very informative about a person’s overall health status. for example.. This is simple to accomplish .. The density is p−1 fY (y. πp−1) = i=1 p πifY (y. If individuals are classiﬁed as healthy or unhealthy then two subgroups are deﬁned. i = j. A ﬁner classiﬁcation scheme would lead to more subgroups. such as limitations on activity. 2. Iden- tiﬁcation requires that the πi are ordered in some way. (i) where πi > 0. πp = 1 − p−1 i=1 πi . π1.. i = 1. ..

which is to say. for example. • Mixture densities may suffer from overparameterization. E(Y |x) = p i=1 πi µi (x). • The properties of the mixture density follow in a straightforward way from those of the components. testing for p = 1 (a single component. where µi(x) is the mean of the ith component density. so. • Testing for the number of component densities is a tricky issue.post-estimation by rearrangement and possible elimination of redundant component densities. since the total number of parameters grows rapidly with the number of component densities. the moment generating function is the same mixture of the moment generating functions of the component densities. For example. It is possible to constrained parameters across the mixtures. In particular. no mixture) versus p = 2 (a mixture of two components) .

for the OBDV data. Not that when π1 = 1.involves the restriction π1 = 1. OBDV ****************************************************** Mixed Negative Binomial model. which you can replicate using this program . the parameters of the second component can take on any value without affecting the density. which is on the boundary of the parameter space. Usual methods such as the likelihood ratio test are not applicable when parameters are on the boundary under the null hypothesis. Information criteria means of choosing the model (see below) are valid. The following results are for a mixture of 2 NB-II models. MEPS 1996 full data set .

036 t-stat 0.146 0.590 -0.087 0.164783 Observations: 4564 constant pub.000 0.059 0.400 0.346 0. sex age estimate 0.377 0. ins.755 3.025 -0. ins.000 0.003 0.861 0.000 0.296 st.112 0.007 0.805 0.196 0.000 0.678 8. err 0.016 0.214 8.000 0.017 6.MLE Estimation Results BFGS convergence: Normal convergence Average Log-L: -2.024 0.168 0.525 0.773 8.000 0.000 .349 6.178 p-value 0.422 0.512 0.247 4.127 0.193 0.061 2.831 0.115 0. priv.752 4. priv.962 0.174 0.000 1.117 1. ins.450 0.351 0. sex age edu inc alpha constant pub.004 0.048 0.000 0. ins.

3807 Avg.008 0. CAIC: 4. but also not that the coefﬁcients of public insurance and age.014 1. BIC: 4.582 0.000 0.edu inc alpha Mix 0.784 0.042 0.257 0.634 0. for example.1395 Avg. .051 0.274 5.3370 ****************************************************** It is worth noting that the mixture parameter is not signiﬁcantly different from zero.3610 AIC : 19794.187 0.518 1.3647 BIC : 19903. AIC: 4.111 0.162 2. differ quite a bit between the two latent classes.114 Information Criteria CAIC : 19920.034 0.3807 Avg.

Information criteria are functions of the log-likelihood. a Poisson model can’t be tested (using standard methods) as a restriction of a negative binomial model. But it seems. Bayes (BIC) and consistent Akaike (CAIC). The formulae are ˆ CAIC = −2 ln L(θ) + k(ln n + 1) ˆ BIC = −2 ln L(θ) + k ln n ˆ AIC = −2 ln L(θ) + 2k .Information criteria As seen above. that the NB model is more appropriate. based upon the values of the likelihood functions and the fact that the NB model ﬁts the variance much better. How can we determine which of a set of competing models is the best? The information criteria approach is one possibility. with a penalty for the number of parameters used. Three popular information criteria are the Akaike (AIC).

Pretty clearly.386 4. the NB-II is slightly favored. that the correct model is necesarily in the group.337 BIC 7. the Table 16.375 4. The AIC is not consistent. Here are information criteria values for the models we’ve seen.385 4. .355 4.345 4.It can be shown that the CAIC and BIC will select the correctly speciﬁed model from a group of models.3: Information Criteria. This doesn’t mean.361 CAIC 7.365 NB models are better than the Poisson. Between the NB-I and NB-II models.388 4. and will asymptotically favor an over-parameterized model over the correctly speciﬁed model. for OBDV. The one additional parameter gives a very signiﬁcant improvement in the likelihood function value. of course. asymptotically.386 4.373 4.357 4. OBDV Model Poisson NB-I NB-II MNB-II AIC 7. But one should remember that information criteria values are statistics.

But for i. it may well be that the NBI model would be favored. With another sample. using the sandwich form for the ML estimator. by all 3 information criteria.with variances. Why is all of this in the chapter on QML? Let’s suppose that the correct model for OBDV is in fact the NB-II model. since the differences are so small. since the information matrix equality does not hold for QML estimators. then the parameters of the conditional mean will be consistently estimated). It turns out in this case that the Poisson model will give consistent estimates of the slope parameters (if a model is a member of the linear-exponential family and the conditional mean is correctly speciﬁed. data (which is the case for the MEPS data) the QML asymptotic covariance can be consistently estimated. as discussed above. mle_results in fact reports .d. So the Poisson estimator would be a QML estimator that is consistent for some parameters of the true model.i. The MNB-II model is favored over the others. The ordinary OPG or inverse Hessian ”ML” covariance estimators are however biased and inconsistent.

so the Poisson estimation results would be reliable for inference even if the true model is the NB-I or NB-II. Not that they are in fact similar to the results for the NB models. as is favored by the information criteria. if we assume that the correct model is the MNB-II model. then both the Poisson and NB-x models will have misspeciﬁed mean functions.sandwich results. so the parameters that inﬂuence the means would be estimated with bias and inconsistently. However. .

Considering the MEPS data (the description is in Section 11. 1 (16. for the OBDV (y) measure. but we assume that η is uncorrelated with the other regressors.3 Exercises 1.4). AGE. P RIV. We use the Poisson QML estimator of the model y ∼ Poisson(λ) λ = exp(β1 + β2P U B + β3P RIV + β4AGE + β5EDU C + β6IN C). let η be a latent index of health status that has expectation equal to unity.16. IN C. . η) = exp(β1 + β2P U B + β3P RIV + β4AGE + β5EDU C + β6IN C)η. EDU C.1 We suspect that η and P RIV may be correlated.2) A restriction of this sort is necessary for identiﬁcation. We assume that E(y|P U B.

As instruments we use all the exogenous regressors.Since much previous evidence indicates that health care services usage is overdispersed2. EDU C. as well as the cross products of P U B with the variables in Z = {AGE. this is almost certainly not an ML estimator. when η and P RIV are uncorrelated. However. since the conditional mean is correctly speciﬁed in that case. the full set of inOverdispersion exists when the conditional variance is greater than the conditional mean. That is. is consistent. λ where λ is deﬁned in equation 16. with appropriate instruments. 2 . When η and P RIV are correlated.2. Mullahy’s (1997) NLIV estimator that uses the residual function ε= y − 1. the Poisson speciﬁcation is not correct. and thus is not efﬁcient. IN C}. If this is the case. this estimator is consistent for the βi parameters.

(c) Calculate the Hausman test statistic to test the exogeneity of PRIV. (d) comment on the results . (b) Calculate the generalized IV estimates (do it using a GMM formulation . (a) Calculate the Poisson QML estimates.struments is W = {1 P U B Z P U B × Z }.see the portfolio example for hints how to do this).

Ch.Chapter 17 Nonlinear least squares (NLS) Readings: Davidson and MacKinnon. 2∗ and 5∗. Ch. 1 749 . Gallant.

.1 Introduction and deﬁnition Nonlinear least squares (NLS) is a means of estimating the parameter of the model yt = f (xt. dealing with this is exactly as in the case of linear models. • In general. θ). εt will be heteroscedastic and autocorrelated. σ 2) If we stack the observations vertically.. deﬁning y = (y1.. θ)) . . f (x1. θ0) + εt. . f (x1. so we’ll just treat the iid case here. εt ∼ iid(0. yn) f = (f (x1.. However.. and possibly nonnormally distributed. y2. θ)..17.

. The objective function can be written as 1 sn(θ) = [y y − 2y f (θ) + f (θ) f (θ)] . which is the same as minimizing the Euclidean distance between y and f (θ). εn) we can write the n observations as y = f (θ) + ε Using this notation. ε2. . n .. the NLS estimator can be deﬁned as ˆ ≡ arg min sn(θ) = 1 [y − f (θ)] [y − f (θ)] = 1 θ Θ n n y − f (θ) 2 • The estimator minimizes the weighted sum of squared errors..and ε = (ε1.

1) ˆ ˆ In shorthand. ∂θ ∂θ Deﬁne the n × K matrix ˆ ˆ F(θ) ≡ Dθ f (θ). . (17.the derivative of the prediction is orthogonal to the prediction error. Using this. (17.2) This bears a good deal of similarity to the f. or ˆ ˆ F y − f (θ) ≡ 0.c.which gives the ﬁrst order conditions − ∂ ˆ ∂ ˆ ˆ f (θ) y + f (θ) f (θ) ≡ 0. the ﬁrst order conditions can be written as ˆ ˆ ˆ −F y + F f (θ) ≡ 0. for the linear model . use F in place of F(θ).o.

o.c. then F is simply X. minima and saddlepoints: the objective function sn(θ) is not necessarily well-behaved and may be difﬁcult to minimize. so the f. pgs. We can interpret this geometrically: INSERT drawings of geometrical depiction of OLS and NLS (see Davidson and MacKinnon.ˆ If f (θ) = Xθ. the usual 0LS f.o. • Note that the nonlinearity of the manifold leads to potential multiple local maxima. (with spherical errors) simplify to X y − X Xβ = 0. .13 and 46).c. 8.

which requires 2 that Dθ s∞(θ0) be positive deﬁnite. θ)]2 t=1 n f (xt. The condition for asymptotic identiﬁcation is that sn(θ) tend to a limiting function s∞(θ) such that s∞(θ0) < s∞(θ). and asymptotically. θ0) + εt − ft(xt. ∀θ = θ0. Consider the objective function: n 1 sn(θ) = n = 1 n [yt − f (xt.2 Identiﬁcation As before.17. This will be the case if s∞(θ0) is strictly convex at θ0. identiﬁcation can be considered conditional on the sample. θ) t=1 n 2 1 = n 2 − n ft(θ0) − ft(θ) t=1 n 2 1 + n n (εt)2 t=1 ft(θ0) − ft(θ) εt t=1 .

2 (17. θ) will be bounded and continuous. so 1 n n ft(θ0) − ft(θ) t=1 2 a. θ) dµ(z). • A LLN can be applied to the third term to conclude that it converges pointwise to 0.3) where µ(x) is the distribution function of x. f (x. pointwise convergence needs to be stregnthened to uniform almost sure convergence. so strengthening . as long as f (θ) and ε are uncorrelated. There are a number of possible assumptions one could use. → f (z. • Next. θ0) − f (z.• As in example 12. we’ll just assume it holds. we’ll assume a pointwise law of large numbers applies. • Turning to the ﬁrst term.s. Here. In many cases. which illustrated the consistency of extremum estimators using OLS.4. we conclude that the second term will converge to a constant which does not depend upon θ. for all θ ∈ Θ.

the question is whether or not there may be some other minimizer. (Note: the uniform boundedness we have . θ0) − f (x. When considering identiﬁcation (asymptotic). f : K → (0. it is clear that a minimizer is θ0. we obtain (after a little work) ∂2 ∂θ∂θ f (x. a bounded range. θ) = [1 + exp(−xθ)]−1 . A local condition for identiﬁcation is that ∂2 ∂2 s∞(θ) = ∂θ∂θ ∂θ∂θ f (x. Given these results. θ0) − f (x. For example if f (x. 1) .to uniform almost sure convergence is immediate. θ0) Dθ f (z. θ) dµ(x) 2 be positive deﬁnite at θ0. and the function is continuous in θ. θ) dµ(x) θ0 2 =2 Dθ f (z. θ0) dµ(z) the expectation of the outer product of the gradient of the regression function evaluated at θ0. Evaluating this derivative.

This is analogous to the requirement that there be no perfect colinearity in a linear model. as discussed above. so the estimator is consistent.already assumed allows passing the derivative through the integral. and given the above identiﬁ- .3 Consistency We simply assume that the conditions of Theorem 28 hold. The tangent space to the regression manifold must span a K -dimensional space if we are to consistently estimate a K -dimensional parameter vector. This is a necessary condition for identiﬁcation.) This matrix will be positive deﬁnite (wp1) as long as the gradient vector is of full rank (wp1). Given that the strong stochastic equicontinuity conditions hold. by the dominated convergence theorem. Note that the LLN implies that the above expectation is equal to FF J∞(θ ) = 2 lim E n 0 17.

and I∞(θ ) = lim V ar nDθ sn(θ0) √ . ∂2 ∂θ∂θ where J∞(θ0) is the almost sure limit of 0 sn(θ) evaluated at θ0.cation conditions an a compact estimation space (the closure of the parameter space Θ).4 Asymptotic normality As in the case of GMM. J∞(θ0)−1I∞(θ0)J∞(θ0)−1 . the consistency proof’s assumptions are satisﬁed. Recall that the result of the asymptotic normality theorem is √ d ˆ n θ − θ0 → N 0. we also simply assume that the conditions for asymptotic normality as in Theorem 30 hold. 17. The only remaining problem is to determine the form of the asymptotic variancecovariance matrix.

t=1 t Note that the expectation of this is zero. θ)] Dθ f (xt. So to calculate the variance. since and xt are assumed to be uncorrelated. t=1 εtDθ f (xt. we can simply calculate . θ)]2 t=1 [yt − f (xt.The objective function is 1 sn(θ) = n So 2 Dθ sn(θ) = − n Evaluating at θ0. θ). 2 Dθ sn(θ0) = − n n n n [yt − f (xt. θ0).

the second moment about zero. n 0 0 √ . Also note that n t=1 ∂ f (θ0) ε εtDθ f (xt. θ ) = ∂θ 0 = Fε With this we obtain I∞(θ ) = lim V ar nDθ sn(θ0) 4 = lim nE 2 F εε’F n FF 2 = 4σ lim E n We’ve already seen that FF J∞(θ ) = 2 lim E .

Combining these expressions for J∞(θ0) and I∞(θ0).1 and ˆ σ2 = ˆ y − f (θ) n ˆ y − f (θ) . ˆ (17. lim E n −1 σ2 . and the result of the asymptotic normality theorem.where the expectation is with respect to the joint density of x and ε.4) ˆ where F is deﬁned as in equation 17. we get √ ˆ n θ − θ0 → N d FF 0. We can consistently estimate the variance covariance matrix using ˆ ˆ FF n −1 σ 2. .

. This sort of model has been used to study visits to doctors per year. which means it can take the values {0. number of patents registered by businesses per year. The Poisson density is exp(−λt)λyt t . 2. 1..}. yt ∈ {0.}. .1.5 Example: The Poisson model for count data Suppose that yt conditional on xt is independently distributed Poisson... 17. Note that λt must be positive. A Poisson random variable is a count data variable.the obvious estimator. Note the close correspondence to the results for the linear model. as is the variance. etc. f (yt) = yt ! The mean of yt is λt.2. ..

which in turn implies that func- .Suppose that the true mean is λ0 = exp(xtβ 0). Suppose we estimate β 0 by nonlinear least squares: ˆ = arg min sn(β) = 1 β T We can write 1 sn(β) = T 1 = T n n (yt − exp(xtβ)) t=1 2 exp(xtβ 0 + εt − exp(xtβ) t=1 n 2 t=1 1 2 0 exp(xtβ − exp(xtβ) + T n t=1 1 2 εt + 2 T n εt exp(xtβ 0 − exp(xtβ t=1 The last term has expectation zero since the assumption that E(yt|xt) = exp(xtβ 0) implies that E (εt|xt) = 0. t which enforces the positivity of λt.

This ∂ n (β) means ﬁnding the the speciﬁc forms of ∂β∂β sn(β). J (β 0). . use a CLT as needed.tions of xt are uncorrelated with εt. so the NLS estimator is consistent as long as identiﬁcation holds. ∂s∂β . and I(β 0). we get s∞(β) = Ex exp(x β 0 − exp(x β) 2 + Ex exp(x β 0) where the last term comes from the fact that the conditional variance of ε is the same as the variance of y. Again. This function is clearly minimized at β = β 0. no need to verify that it can be applied. Applying a strong LLN. Exercise 50. Determine the limiting distribution of 2 √ ˆ n β − β 0 . and noting that the objective function is continuous on a compact parameter space.

pgs. we have y = f (θ) + ν where ν is a combination of the fundamental error term ε and the error due to evaluating the regression function at θ rather than the true value θ0. The Gauss-Newton optimization technique is speciﬁcally designed for nonlinear least squares. The idea is to linearize the nonlinear model. Chapter 6. Take a ﬁrst order Taylor’s series approximation around . not equal to θ0. At some θ in the parameter space.6 The Gauss-Newton algorithm Readings: Davidson and MacKinnon. The model is y = f (θ0) + ε. rather than the objective function. 201-207∗.17.

a point θ1 : y = f (θ1) + Dθ f θ1 θ − θ1 + ν + approximation error. as above. and ω is ν plus approximation error from the truncated Taylor’s series. • Given ˆ we calculate a new round estimate of θ0 as θ2 = ˆ + b. given θ1. where. take a new Taylor’s series expansion around θ2 . With this. Deﬁne z ≡ y − f (θ1) and b ≡ (θ − θ1). F(θ1) ≡ Dθ f (θ1) is the n × K matrix of derivatives of the regression function. evaluated at θ1. • Note that F is known. b θ1. • Note that one could estimate b simply by performing OLS on the above equation. Then the last equation can be written as z = F(θ1)b + ω.

but evaluated at the NLS estimator: ˆ ˆ ˆ y = f (θ) + F(θ) θ − θ + ω ˆ The OLS estimate of b ≡ θ − θ is ˆ= FF ˆ ˆ b This must be zero.2. To see why this might work. updating would b . consider the above approximation.and repeat the process. since ˆ ˆ F θ ˆ y − f (θ) ≡ 0 −1 ˆ ˆ F y − f (θ) . by deﬁnition of the NLS estimator (these are the normal equations as ˆ in equation 17. Since ˆ ≡ 0 when we evaluate at θ. Stop when ˆ = 0 (to within a speciﬁed b tolerance).

as in equation 17.4 is simple to calculate. since it’s just the OLS varcov estimator from the last iteration. • The method can suffer from convergence problems since F(θ) F(θ). β3 has virtually no effect on the NLS . may be very nearly singular. as does the Newton-Raphson method. especially if θ is very far from θ. so it’s faster. • The Gauss-Newton method doesn’t require second derivatives. a normal OLS program will give the NLS varcov estimator directly. Consider the example y = β1 + β2xtβ 3 + εt When evaluated at β2 ≈ 0.stop.. • The varcov estimator. ˆ since we have F as a by-product of the estimation process (i. it’s just the last round “regressor matrix”). even with an asymptotically identiˆ ﬁed model.e. In fact.

17. F F will be nearly singular. Econometrica. rather than 3. Heckman. The problem occurs when observations used in estimation are sampled non-randomly. 15∗ (a quick reading is sufﬁcient). In this case. J. so (F F)−1 will be subject to large roundoff errors.7 Application: Limited dependent variables and sample selection Readings: Davidson and MacKinnon. . not required for reading. Nevertheless it’s a good place to start if you encounter sample selection problems in your research). Ch. Sample selection is a common problem in applied research. according to some selection scheme. 1979 (This is a classic article. “Sample Selection Bias as a Speciﬁcation Error”. and which is a bit out-dated.objective function. so F will have rank that is “essentially” 2.

which is the wage at which the person prefers not to work. The model (very simple.Example: Labor Supply Labor supply of a person is a positive number of hours per unit time supposing the offer wage is higher than the reservation wage. with t subscripts suppressed): • Characteristics of individual: x • Latent labor supply: s∗ = x β + ω • Offer wage: wo = z γ + ν • Reservation wage: wr = q δ + η Write the wage differential as w∗ = (z γ + ν) − (q δ + η) ≡ rθ+ε .

σ 2 ρσ ρσ 1 . s = 0 = s∗. as well as the latent variable s∗ are unobservable. In other words. If the person is working. What is observed is w = 1 [w∗ > 0] s = ws∗. we observe whether or not a person is working. we observe labor supply.We have the set of equations s∗ = x β + ω w∗ = r θ + ε. We assume that the offer wage and the reservation wage. Otherwise. Assume that ω ε ∼N 0 0 . which is equal to latent labor supply. s∗. Note that we are using a .

Consider more carefully E [ω| − ε < r θ] . since ε and ω are dependent. Because of these two facts. Suppose we estimated the model s∗ = x β + residual using only observations for which s > 0.simplifying assumption that individuals can freely choose their weekly hours of work. or equivalently. −ε < r θ and E [ω| − ε < r θ] = 0. we can write (see for example Spanos Statistical Founda- . least squares estimation is biased and inconsistent. Given the joint normality of ω and ε. The problem is that these observations are those for which w∗ > 0. Furthermore. this expectation will in general depend on x since elements of x can enter in r.

122) ω = ρσε + η. where η has mean zero and is independent of ε. 1) . pg. With this we can write s∗ = x β + ρσε + η.tions of Econometric Modelling. If we condition this equation on −ε < r θ we get s = x β + ρσE(ε| − ε < r θ) + η which may be written as s = x β + ρσE(ε|ε > −r θ) + η • A useful result is that for z ∼ N (0.

The quantity on the RHS above is known as the inverse Mill’s ratio: φ(z ∗) IM R(z ) = Φ(−z ∗) ∗ With this we can write (making use of the fact that the standard normal density is symmetric about zero.φ(z ∗) E(z|z > z ) = .5) (17.6) . ζ (17. respectively. ∗) Φ(−z ∗ where φ (·) and Φ (·) are the standard normal density and distribution function. so that φ(−a) = φ(a)): φ (r θ) s = x β + ρσ +η Φ (r θ) β φ(r θ ) ≡ x Φ(r θ) + η.

• Heckman showed how one can estimate this in a two step procedure where ﬁrst θ is estimated. The error term η has conditional mean zero. • The model presented above depends strongly on joint normality. See Ahn and Powell. and is uncorrelated with the regressors x φ(r θ) . At this point. Journal of Econometrics.where ζ = ρσ. we can Φ(r θ) estimate the equation by NLS. .6 is estimated by least squares using the estimated value of θ to form the regressors. There exist many alternative models which weaken the maintained assumptions. 1994. This is inefﬁcient and estimation of the covariance is a tricky issue. then equation 17. It is probably easier (and more efﬁcient) just to do MLE. It is possible to estimate consistently without distributional assumptions.

White (1980) “Using Least Squares to Approximate Unknown Regression Functions. 776 . pp.1 Possible pitfalls of parametric inference: estimation Readings: H.” International Economic Review. 149-70.Chapter 18 Nonparametric inference 18.

2π). and ε is a classical error. which illustrates both why nonparametric methods may in some cases be preferred to parametric methods. 2 The problem of interest is to estimate the elasticity of f (x) with re- . x is uniformly distributed on (0. Flexible functional forms such as the transcendental logarithmic (usually know as the translog) can be interpreted as second order Taylor’s series approximations. One idea is to take a Taylor’s series approximation to f (x) about some point x0. x). Suppose that 3x x f (x) = 1 + − 2π 2π spect to x. In general.In this section we consider a simple example. throughout the range of x. where y = f (x) +ε. the functional form of f (x) is unknown. We’ll work with a ﬁrst order approximation. We suppose that data is generated by random sampling of (y.

and the slope is the value of the derivative at x = 0. The limiting objective function. One might try estimation by ordinary least squares. Approximating about x0: h(x) = f (x0) + Dxf (x0) (x − x0) If the approximation point is x0 = 0. These are of course not known. b) = 1/n t=1 n (yt − h(xt))2 . The objective function is s(a. following the argument we used to .for simplicity. we can write h(x) = a + bx The coefﬁcient a is the value of the function at x = 0.

The following results were obtained using the free computer algebra system (CAS) Maxima. Solving the ﬁrst order conditions1 reveals that s∞(a. b) obtains its minimum at ˆ a0 = 7 . The theorem regarding the consistency of extremum estimators (Theˆ orem 28) tells us that a and ˆ will converge almost surely to the b values that minimize the limiting objective function.1 we see the true function and the limit of the approximation to see the asymptotic bias as a function of x. I have lost the source code to get the results :-( 1 .3 is 2π s∞(a. b) = 0 (f (x) − h(x))2 dx. Unfortunately.get equations 12. b0 = 1 .1 and 17. The estimated approximating function h(x) there6 π fore tends almost surely to h∞(x) = 7/6 + x/π In Figure 18.

5 approx true 3.0 0 1 2 3 x 4 5 6 7 .Figure 18.0 2.0 1.1: True and simple approximating functions 3.5 2.5 1.

asymptotically. even at the approximation point. The approximating elasticity is η(x) = xh (x)/h(x) .) Note that the approximating model is in general inconsistent. This shows that “ﬂexible functional forms” based upon Taylor’s series approximations do not in general lead to consistent estimation of functions. However. we are interested in the elasticity of the function. Recall that an elasticity is the marginal function divided by the average function: ε(x) = xφ (x)/φ(x) Good approximation of the elasticity over the range of x will require a good approximation of both f (x) and f (x) over the range of x. The approximating model seems to ﬁt the true model fairly well.(The approximating model is the straight line. the true model has curvature.

3 0.5 0. 31546 .7 approx true 0. Visually we see that the elasticity is not approximated so well.2: True and approximating elasticities 0.2 0.6 0.Figure 18.0 0 1 2 3 x 4 5 6 7 In Figure 18. Root mean squared error in the approximation of the elasticity is 2π 0 1/2 (ε(x) − η(x))2 dx = .4 0.2 we see the true elasticity and the elasticity obtained from the limiting approximating model.1 0. The true elasticity is the line that has negative slope for large x.

1982). Here we hold the set of basis function ﬁxed. 1981. We will consider the asymptotic behavior of a ﬁxed model. The reason for using a trigonometric series as an approximating model is motivated by the asymptotic properties of the Fourier ﬂexible functional form (Gallant. The approximating model is gK (x) = Z(x)α. which we will study in more detail below. which we interpret as an approximation to the estimator’s behavior in ﬁnite samples.Now suppose we use the leading terms of a trigonometric series as the approximating model. Normally with this type of model the number of basis functions is an increasing function of the sample size. . Consider the set of basis functions: Z(x) = 1 x cos(x) sin(x) cos(2x) sin(2x) .

Root mean squared error in the . we ﬁnd that the limiting objective function is minimized at 1 1 1 7 a1 = . asymptotically. a2 = . In Figure 18. a4 = 0.Maintaining these basis functions as the sample size increases.4 we have the more ﬂexible approximation’s elasticity and that of the true function: On average. though there is some implausible wavyness in the estimate. the ﬁt is better. 6 π π 4π Substituting these values into gK (x) we obtain the almost sure limit of the approximation 1 1 g∞(x) = 7/6+x/π+(cos x) − 2 +(sin x) 0+(cos 2x) − 2 +(sin 2x) 0 π 4π (18.3 we have the approximation and the true function: Clearly the truncated trigonometric series model offers a better approximation. a5 = − 2 . a6 = 0 .1) In Figure 18. than does the linear model. a3 = − 2 .

0 1.5 1.0 2.5 approx true 3.5 2.Figure 18.3: True function and more ﬂexible approximation 3.0 0 1 2 3 x 4 5 6 7 .

Figure 18.7 approx true 0.0 0 1 2 3 x 4 5 6 7 .4 0.6 0.1 0.4: True elasticity and more ﬂexible approximation 0.3 0.5 0.2 0.

approximation of the elasticity is 2π 0 g∞(x)x ε(x) − g∞(x) 2 1/2 dx = . • Consider means of testing for the hypothesis that consumers . 18. 16213. as we shall see. this means inferences that are possible without restricting the functions of interest to belong to a parametric family. about half that of the RMSE when the ﬁrst order approximation is used.2 Possible pitfalls of parametric inference: hypothesis testing What do we mean by the term “nonparametric inference”? Simply. If the trigonometric series contained inﬁnite terms. this error measure would be driven to zero.

θ) to calculate (by ˆ 2 solving the integrability problem. • Estimation of these functions by normal parametric methods requires speciﬁcation of the functional form of demand. ˆ • After estimation. we could use x = x(p. m) = x(p. must be negative semi-deﬁnite. for example x(p. A consequence of utility maximization is that 2 the Slutsky matrix Dp h(p.maximize utility. U ) are the a set of compensated demand functions. which is non-trivial) Dp h(p. U ). One approach to testing for utility maximization would estimate a set of normal demand functions x(p. m). m. where h(p. θ0) is a function of known form and Θ0 is a ﬁnite dimensional parameter. θ0) + ε. m. where x(p. If we can statistically reject that the matrix is negative semideﬁnite. we might conclude that consumers don’t maximize util- . U ). m. θ0 ∈ Θ0.

. Failure of either 1) or 2) can lead to rejection (as can a Type-I error. and 2) the model is correctly speciﬁed. • Testing using parametric models always means we are testing a compound hypothesis. This is known as the “model-induced augmenting hypothesis.ity. • The problem with this is that the reason for rejection of the theoretical proposition may be that our choice of functional form is incorrect. The hypothesis that is tested is 1) the economic proposition we wish to test. In the introductory section we saw that functional form misspeciﬁcation leads to inconsistent estimation of the function and its derivatives.” • Varian’s WARP allows one to test for utility maximization without specifying the form of the demand functions. The only assumptions used in the test are those directly implied by theory. even when 2) holds).

1987. ed. and a loss of power. The cost of nonparametric methods is usually an increase in complexity. “Identiﬁcation and consistency in seminonparametric regression. compared to what one would get using a well-speciﬁed parametric model.so rejection of the hypothesis calls into question the theory. • Nonparametric inference also allows direct testing of economic propositions.3 Estimation of regression functions The Fourier functional form Readings: Gallant.. Truman Bewley. Fifth World Congress. V.” in Advances in Econometrics. 1. The beneﬁt is robustness against possible misspeciﬁcation. Cambridge. . avoiding the “model-induced augmenting hypothesis”. 18.

The Fourier form. where f (x) is of unknown form and x is a P −dimensional vector. but with a somewhat different parameterization. For simplicity. assume that ε is a classical error.2) . may be written as A J xi ∂f (x) . f (x) ∂xif (x) gK (x | θK ) = α+x β+1/2x Cx+ α=1 j=1 (ujα cos(jkαx) − vjα sin(jkαx)) . following Gallant (1982). (18. Let us take the estimation of the vector of elasticities with typical element ξxi = at an arbitrary point xi.Suppose we have a multivariate model y = f (x) + ε.

divide by the maxima of the conditioning variables. which is desirable since economic functions aren’t periodic. .. 2. (18. β . and multiply by 2π − eps. This is required to avoid periodic behavior of the approximation. positive and zero). A are required to be linearly independent.. α = 1. . For example.where the K-dimensional parameter vector θK = {α. uJA. u11. v11.. and we follow the convention that the ﬁrst non-zero element be posi- . • The kα are ”elementary multi-indices” which are simply P − vectors formed of integers (negative. . subtract sample means. where eps is some positive number less than 2π in value. vec∗(C) . . vJA} . The kα.3) • We assume that the conditioning variables x have each been transformed to lie in an interval that is shorter than 2π. .

• We parameterize the matrix C differently than does Gallant because it simpliﬁes things in practice. For example 0 1 −1 0 1 is a potential multi-index to be used.tive. since it is a scalar multiple of the original multi-index. Nor is 0 2 −2 0 2 a multi-index we would use. . but 0 −1 −1 0 1 is not since its ﬁrst nonzero element is negative. The cost of this is that we are no longer able to test a quadratic speciﬁcation using nested testing.

4) and the matrix of second partial derivatives is A 2 DxgK (x|θK ) = C + α=1 j=1 J (−ujα cos(jkαx) + vjα sin(jkαx)) j 2kαkα (18. let λ be an N -dimensional multi-index with no negative elements. If we have N arguments x of the (arbitrary) function h(x).5) To deﬁne a compact notation for partial derivatives. Deﬁne | λ |∗ as the sum of the elements of λ.The vector of ﬁrst partial derivatives is A J DxgK (x | θK ) = β + Cx + α=1 j=1 [(−ujα sin(jkαx) − vjα cos(jkαx)) jkα] (18. use Dλh(x) to indicate a certain partial deriva- .

we see that it is possible to deﬁne (1 × K) vector Z λ(x) so that DλgK (x|θK ) = zλ(x) θK .6) • Both the approximating model and the derivatives of the approximating model are linear in the parameters. (18. . Dλh(x) ≡ h(x). The following theorem can be used to prove the consistency of the Fourier form.tive: Dλh(x) ≡ ∂ |λ|∗ λ · · · ∂xNN ∂xλ1 ∂xλ2 2 1 h(x) When λ is the zero vector. Taking this deﬁnition and the last few equations into account. write gK (x|θK ) = z θK for simplicity. • For the approximating model to the function (not derivatives).

is a dense subset of the closure and HK ⊂ HK+1. 2.ˆ Theorem 51. h such that (c) Uniform convergence: There is a point h∗ in H and there is a function s∞(h. (b) Denseness: ∪K HK . h∗) must have h − h∗ = 0. K = 1.. [Gallant and Nychka. . h∗) ≥ s∞(h∗. 3. h is compact h . h∗) |= 0 H almost surely. (d) Identiﬁcation: Any point h in the closure of H with s∞(h.. 1987] Suppose that hn is obtained by maximizing a sample objective function sn(h) over HKn where HK is a subset of some function space H on which is deﬁned a norm Consider the following conditions: (a) Compactness: The closure of H with respect to in the relative topology deﬁned by of H with respect to h h . h∗) that is continuous in h with respect to n→∞ lim sup | sn(h) − s∞(h. .

A generic norm h is used in place of the Euclidean norm.r. The main differences are: 1. Typically we will want to make sure that the norm is strong enough to imply convergence of all functions of interest. It plays the role . This theorem is very similar in form to Theorem 28.Under these conditions limn→∞ that limn→∞ Kn = ∞ almost surely. The “estimation space” H is a function space. h implies convergence w. so that convergence with respect to Euclidean norm.t the This norm may be stronger than the Euclidean norm. 2. provided The modiﬁcation of the original statement of the theorem that has been made is to set the parameter space Θ in Gallant and Nychka’s (1987) Theorem 0 to a single point and to state the theorem in terms of maximization rather than minimization. ˆ h∗ − hn = 0 almost surely.

There is no restriction to a parametric family. we need to make explicit what norm we wish to use. There is a denseness assumption that was not present in the other theorem. We need a norm that guarantees that the errors in approximation of the functions we are interested in are accounted for. in relation to the Fourier form as the approximating model. only a restriction to a space of functions that satisfy certain conditions. Sobolev norm Since all of the assumptions involve the norm h .of the parameter space Θ in our discussion of parametric estimators. This formulation is much less restrictive than the restriction to a parametric family. see Gallant. We will not prove this theorem (the proof is quite similar to the proof of theorem [28]. Since we are interested in ﬁrst- . 3. 1987) but we will discuss its assumptions.

The Sobolev norm is appropriate in this case. making use of our notation for partial derivatives. We see that this norm takes into account errors in approximating the function and partial derivatives up to order m. It is deﬁned. as: h m. we would evaluate f (x) − gK (x | θK ) m. .X = |λ |≤m X max sup Dλh(x) ∗ To see whether or not the function f (x) is well approximated by an approximating model gK (x | θK ). we need close approximation of both the function f (x) and its ﬁrst derivative f (x). the relevant m would be m = 1. If we want to estimate ﬁrst order elasticities. throughout the range of x.X . since we examine the sup over X .order elasticities in the present case. Let X be an open set that contains all values of x that we’re interested in. as is the case in this example. Furthermore.

so that we obtain consistent estimates for all values of x.r. the Sobolev means uniform convergence. Econometrica.X < D}.t. It is proven by Elbadawi. The basic requirement is that if we need consistency w.r. h m.X . . 1983. Compactness Verifying compactness with respect to this norm is quite technical and unenlightening. where D is a ﬁnite constant. Gallant and Souza.t. In plain words. A Sobolev space is the set of functions Wm. the functions must have bounded partial derivatives of one order higher than the derivatives we seek to estimate.convergence w.X (D) = {h(x) : h(x) m. then the functions of interest must belong to a Sobolev space which takes into account derivatives of order m + 1.

we optimize over a subspace. Rather. deﬁned as: Deﬁnition 53. With seminonparametric estimators. HKn . we’ll deﬁne the estimation space as follows: Deﬁnition 52. and we presume that h∗ ∈ H.X (D). [Estimation subspace] The estimation subspace HK is deﬁned as HK = {gK (x|θK ) : gK (x|θK ) ∈ W2. we don’t actually optimize over the estimation space. [Estimation space] The estimation space H = W2. .The estimation space and the estimation subspace Since in our case we’re interested in consistent estimation of ﬁrst-order elasticities. The estimation space is an open set. θK ∈ K }.Z (D). So we are assuming that the function to be estimated has bounded second derivatives throughout X .

the sample size. In order for optimization over HK to be equivalent to optimization over H. so optimization over HK may not lead to a consistent estimator.2. Note that the true function h∗ is not necessarily an element of HK . The second requirement is: 2.2 increasing functions of n. θK ) is the Fourier form approximation as deﬁned in Equation 18. We need that the HK be dense subsets of H. . at least asymptotically. Denseness The important point here is that HK is a space of func- tions that is indexed by a ﬁnite dimensional parameter (θK has K elements. With n observations. this parameter is estimable.3). n > K. This is achieved by making A and J in equation 18. as in equation 18. dim θKn → ∞ as n → ∞. It is clear that K will have to grow more slowly than n. we need that: 1.where gK (x. The dimension of the parameter vector.

it is useful to apply Theorem 1 of Gallant (1982). . is a subset of the closure of the estimation space. deﬁned above. To show that HK is a dense subset of H with respect to h 1. We reproduce the theorem as presented by Gallant. θ2. . . . . . who in turn cites Edmunds and Moscatelli (1977). . The rest of the discussion of denseness is provided just for completeness: there’s no need to study it in detail. with minor notational changes.X . for convenience of reference: Theorem 54. H .The estimation subspace HK . 1977] Let the real-valued function h∗(x) be continuously differentiable up to order m on an open set containing the closure of X . . [Edmunds and Moscatelli. A set of subsets Aa of a set A is “dense” if the closure of the countable union of the subsets is equal to the closure of A: ∪∞ Aa = A a=1 Use a picture here. Then it is possible to choose a triangular array of coefﬁcients θ1. θK . such that for every q with 0 ≤ q < m.

q = 1. h∗(x) − hK (x|θK ) q. and m = 2. . In the present application. H ⊂ ∪∞HK . ∪∞HK is the countable union of the HK .and every ε > 0. the elements of H are once continuously differentiable on X . ∪∞HK ⊂ H. However. By deﬁnition of the estimation space.X = 0.X = o(K −m+q+ε) as K → ∞. for all h∗ ∈ H. so the theorem is applicable. The implication of Theorem 54 is that there is a sequence of {hK } from ∪∞HK such that K→∞ lim h∗ − hK 1. Therefore. which is open and contains the closure of X . Closely following Gallant and Nychka (1987).

X . the limiting objective function is . We estimate by OLS. Therefore H = ∪∞HK .so ∪∞HK ⊂ H. as in the case of Equations 12. so ∪∞HK is a dense subset of H. Uniform convergence We now turn to the limiting objective function. The sample objective function stated in terms of maximization is 1 sn(θK ) = − n n (yt − gK (xt | θK ))2 t=1 With random sampling.1 and 17.3. with respect to the norm h 1.

with respect to the norm g 1 −g 0 1. The pointwise convergence of the objective function needs to be strengthened to uniform convergence. We will simply assume that this holds.s∞ (g. We also have continuity of the objective function in g.X →0 lim − g 0(x) − f (x) 2 dµx. f ) − s∞ g 0 .X since lim s∞ g 1 .7) where the true function f (x) takes the place of the generic function h∗ in the presentation of the theorem. Both g(x) and f (x) are elements of ∪∞HK . f ) g 1(x) − f (x) X 2 = g 1 −g 0 1. f ) = − X 2 (f (x) − g(x))2 dµx − σε . since the way to verify this depends upon the speciﬁc application.X →0 h 1. By the dominated convergence theorem (which applies since the ﬁ- . (18.

The closure of H is compact with . h 1.Z (D) is dominated by an integrable function).X (D): the function space in the closure of which the true function must lie. This condition is clearly satisﬁed given that g and f are once continuously differentiable (by the assumption that deﬁnes the estimation space).X = (g. the limit and the integral can be interchanged. f ) ⇒ 0.nite bound D used to deﬁne W2. f ) in H × H. the limit is zero. • Consistency norm respect to this norm. the relevant concepts are: • Estimation space H = W2. f ) ≥ s∞(f. Review of concepts For the example of estimation of ﬁrst-order elasticities. s∞(g.X . so by inspection. Identiﬁcation The identiﬁcation condition requires that for any point g−f 1.

• Estimation subspace HK . The estimation subspace is the subset of H that is representable by a Fourier form with parameter θK . over the closure of the inﬁnite union of the estimation subpaces. . • Sample objective function sn(θK ). ﬁrst order elasticities xi ∂f (x) f (x) ∂xif (x) are consistently estimated for all x ∈ X . which is continuous in g and has a global maximum in its ﬁrst argument. the negative of the sum of squares. f ). • As a result of this. By standard arguments this converges uniformly to the • Limiting objective function s∞( g. These are dense subsets of H. at g = f.

. The issue of how to chose the rate at which parameters are added and which to add ﬁrst is fairly complex. our approximating model is: gK (x|θK ) = z θK . If parameters are added at a high rate. A basic problem is that a high rate of inclusion of additional parameters causes the variance to tend more slowly to zero. • Deﬁne ZK as the n×K matrix of regressors obtained by stacking observations. Gallant and Souza. the bias tends relatively rapidly to zero. 1991) are very strict. A problem is that the allowable rates for asymptotic normality to obtain (Andrews 1991.Discussion Consistency requires that the number of parameters used in the expansion increase with the sample size. tending to inﬁnity. Supposing we stick to these rates. The LS estimator is + ˆ θK = (ZK ZK ) ZK y.

where (·)+ is the Moore-Penrose generalized inverse. z θK . AV ). as would be the case for K(n) large enough when some dummy variables are included. σ Formally. If we can’t stick to . ˆ • . The prediction. though. – This is used since ZK ZK may be singular. this is exactly the same as if we were dealing with a parametric linear model. I emphasize. of the unknown function f (x) is asymptotically normally distributed: √ where AV = lim E z n→∞ d ˆ n z θK − f (x) → N (0. that this is only valid if K grows very slowly as n grows. ZK ZK n + zˆ 2 .

. Bootstrapping is a possibility. We’ll consider the Nadaraya-Watson kernel regression estimator in a simple case. Truman Bewley. nearest neighbor. • Suppose we have an iid sample from the joint density f (x.acceptable rates. we should probably use some other method of approximating the small sample distribution. Fifth World Congress. We’ll discuss this in the section on simulation. 1. y). Kernel regression estimation is an example (others are splines. V. Kernel regression estimators Readings: Bierens.). An alternative method to the semi-nonparametric method is a fully nonparametric method of estimation. 1987. ed. “Kernel estimators of regression functions.” in Advances in Econometrics. Cambridge.. etc.

we have g(x) = = y f (x.where x is k -dimensional. The model is yt = g(xt) + εt. By deﬁnition of the conditional expectation. y)dy. y) dy h(x) yf (x. 1 h(x) where h(x) is the marginal density of x : h(x) = f (x. y)dy. • The conditional expectation of y given x is g(x). . where E(εt|xt) = 0.

Estimation of the denominator A kernel estimator for h(x) has the form 1 ˆ h(x) = n n t=1 K [(x − xt) /γn] . • The function K(·) (the kernel) is absolutely integrable: |K(x)|dx < ∞. k γn where n is the sample size and k is the dimension of x.• This suggests that we could estimate g(x) by estimating h(x) and yf (x. y)dy. . and K(·) integrates to 1 : K(x)dx = 1.

but we do not necessarily restrict K(·) to be nonnegative. but not too quickly. • The window width parameter. . ﬁrst consider the expectation of the estimator (since the estimator is an average of iid terms we only need to consider the expectation of a representative term): ˆ E h(x) = −k γn K [(x − z) /γn] h(z)dz. K(·) is like a density function. the window width must tend to zero. γn is a sequence of positive numbers that satisﬁes n→∞ n→∞ lim γn = 0 k lim nγn = ∞ So.In this respect. ˆ • To show pointwise consistency of h(x) for h(x).

so z = x−γnz ∗ and | dz ∗ | = γn . ˆ lim E h(x) = lim = = K (z ∗) h(x − γnz ∗)dz ∗ −k k γn K (z ∗) h(x − γnz ∗)γn dz ∗ K (z ∗) h(x − γnz ∗)dz ∗.dz k Change variables as z ∗ = (x−z)/γn. we obtain ˆ E h(x) = = Now. n→∞ n→∞ n→∞ lim K (z ∗) h(x − γnz ∗)dz ∗ K (z ∗) h(x)dz ∗ K (z ∗) dz ∗ = h(x) = h(x). asymptotically. .

we have.. (Note: that we can pass the limit through the integral is a result of the dominated convergence theorem. For this to hold we need that h(·) be dominated by an absolutely integrable function.since γn → 0 and K (z ∗) dz ∗ = 1 by assumption. this is −k k ˆ nγn V h(x) = γn V {K [(x − z) /γn]} . ˆ • Next. considering the variance of h(x). due to the iid assumption k k 1 ˆ nγn V h(x) = nγn 2 n n V t=1 n K [(x − xt) /γn] k γn = −k 1 γn n V {K [(x − xt) /γn]} t=1 • By the representative term argument.

since V (x) = E(x2) − E(x)2 we have k −k −k ˆ nγn V h(x) = γn E (K [(x − z) /γn])2 − γn {E (K [(x − z) /γn])}2 = = −k k γn K [(x − z) /γn]2 h(z)dz − γn −k γn K −k γn K [(x − z) /γn] h( 2 [(x − z) /γn] h(z)dz − 2 k γn E h(x) The second term converges to zero: k γn E 2 h(x) → 0. n→∞ k ˆ lim nγn V h(x) = lim n→∞ −k γn K [(x − z) /γn]2 h(z)dz. by the previous result regarding the expectation and the fact that γn → 0. Using exactly the same change of variables as before. this can be .• Also. Therefore.

k and since nγn → ∞ by assumption.shown to be n→∞ k ˆ lim nγn V h(x) = h(x) [K(z ∗)]2 dz ∗. The esti- mator has the same form as the estimator for h(x). only with one . we need an estimator of f (x. Since both [K(z ∗)]2 dz ∗ and h(x) are bounded. y)dy. • Since the bias and the variance both go to zero. Estimation of the numerator To estimate yf (x. this is bounded. we have pointwise consistency (convergence in quadratic mean implies convergence in probability). we have that ˆ V h(x) → 0. y).

x) dy = 0 and to marginalize to the previous kernel for h(x) : K∗ (y. y) = 1 f n n t=1 K∗ [(y − yt) /γn. With this kernel. x)dy = n n yt t=1 K [(x − xt) /γn] k γn . (x − xt) /γn] k+1 γn The kernel K∗ (·) is required to have mean zero: yK∗ (y.dimension more: ˆ(x. we have 1 ˆ y f (y. x) dy = K(x).

. where higher weights are associated with points that are closer to xt. 2. j = 1.by marginalization of the kernel. . n. Discussion • The kernel regression estimator for g(xt) is a weighted average of the yj .. The weights sum to 1. n t=1 K [(x − xt ) /γn ] This is the Nadaraya-Watson kernel regression estimator.. so we obtain ˆ g (x) = = = 1 ˆ h(x) 1 n 1 n ˆ y f (y. x)dy n yt K[(x−xt)/γn] k t=1 γn n K[(x−xt )/γn ] k t=1 γn n t=1 yt K [(x − xt ) /γn ] . • The window width parameter γn imposes smoothness. The es- .

• The standard normal density is a popular choice for K(. since in this case each weight tends to 1/n. Choice of the window width: Cross-validation The selection of an appropriate window width is important. though there are possibly better alternatives. x). • A large window width reduces the variance (strong imposition of ﬂatness). the variance is large when the window width is small. Since relatively little information is used. One popular method is cross validation. This consists of splitting the sample . but makes very little use of information except points that are in a small neighborhood of xt. • A small window width reduces the bias.) and K∗(y. but increases the bias.timator is increasingly ﬂat as γn → ∞.

5. and the second part is the “out of sample” data. 50%-50%). The ﬁrst part is the “in sample” data. but it does not involve yt . which is used for estimation. Choose a window width γ. With the in sample data. ˆout 3. used for evaluation of the ﬁt though RMSE or some other criterion.. This t ﬁtted value is a function of the in sample data. The steps are: 1. .g. or to the next step if enough window widths have been tried.into two parts (e. as well as the out evaluation point xout. The out of sample data is y out and xout. t 4. Repeat for all out of sample points. 2. Go to step 2. Calculate RMSE(γ) 6. ﬁt yt corresponding to each xout. Split the data.

This same principle can be used to choose A and J in a Fourier form model. for example of y conditional on x. We have already seen how joint densities may be estimated. Select the γ that minimizes RMSE(γ) (Verify that a minimum has been found. 18.7. 8. If were interested in a conditional density.4 Density function estimation Kernel density estimation The previous discussion suggests that a kernel density estimator may easily be constructed. then the kernel estimate of the conditional . Re-estimate using the best γ and all of the data. for example by plotting RMSE as a function of γ).

See also Cameron and Johansson. Journal of Applied Economet- . y) = ˆ h(x) = = 1 n n K∗ [(y−yt )/γn . Semi-nonparametric maximum likelihood Readings: Gallant and Nychka. For a Fortran program to do this and a useful discussion in the user’s guide. see this link. (x − xt ) /γn ] n t=1 K [(x − xt ) /γn ] 1 γn where we obtain the expressions for the joint and marginal densities from the section on kernel regression.density is simply fy|x ˆ f (x. 1987.(x−xt )/γn ] k+1 t=1 γn n K[(x−xt )/γn ] 1 k t=1 n γn n t=1 K∗ [(y − yt ) /γn . Econometrica.

γ) is a normalizing factor to make the density integrate . 12. 1997. γ) = ηp(x. φ. yes. φ) p gp(y|x. The new density is h2(y|γ)f (y|x. φ. Suppose we’re interested in the density of y conditional on x (both may be vectors). MLE is the estimation method of choice when we are conﬁdent about specifying the density. φ) is a reasonable starting approximation to the true density.rics. This density can be reshaped by multiplying it by a squared polynomial. V. Is is possible to obtain the beneﬁts of MLE when we’re not so conﬁdent about the speciﬁcation? In part. φ. γ) p where hp(y|γ) = γk y k k=0 and ηp(x. Suppose that the density f (y|x.

(sum) to one. γ). The normalization factor ηp(φ. φ. γ) is calculated (following Cameron and Johansson) using ∞ E(Y r ) = y=0 ∞ y r fY (y|φ. γ) [hp (y|γ)]2 = yr fY (y|φ) ηp(φ. γ) = k=0 l=0 γk γl mk+l+r /ηp(φ. γ) ∞ = k=0 l=0 p p γk γl y r+k+l fY (y|φ) y=0 /ηp(φ. γ) is a homogenous function p of θ it is necessary to impose a normalization: γ0 is set to 1. . γ) y=0 ∞ p p = y=0 k=0 l=0 p p y r fY (y|φ)γk γl y k y l /ηp(φ. Because h2(y|γ)/ηp(x.

Similarly to Cameron and Johannson (1997). The negative binomial baseline density may be written (see equation as Γ(y + ψ) fY (y|φ) = Γ(y + 1)Γ(ψ) ψ ψ+λ ψ λ ψ+λ y . Gallant and Nychka (1987) give conditions under which such a density may be treated as correctly speciﬁed. the order of the polynomial must increase as the sample size increases. asymptotically. we may develop a negative binomial polynomial (NBP) density for count data.8 ηp(φ. γ) = k=0 l=0 p p γk γl mk+l (18.8) Recall that γ0 is set to 1 to achieve identiﬁcation.8 are the raw moments of the baseline density. Basically. The mr in equation 18.By setting r = 0 we get that the normalizing factor is 18. there are technicalities. However.

ψ}. γ) = ηp(φ.9) To get the normalization factor. λ > 0 and ψ > 0.where φ = {λ. For both forms.10) To illustrate. γ) Γ(y + 1)Γ(ψ) ψ ψ+λ ψ λ ψ+λ y . For the NB-I density. which is a Computer Algebra System that (used to be?) free for personal use. These . (18. with normalization to sum to one. E(Y ) = λ. (18. Figure 18. V (Y ) = λ + αλ. we have V (Y ) = λ + αλ2. When ψ = 1/α we have the negative binomial-II (NP-II) model. calculated using MuPAD. we need the moment generating function: MY (t) = ψ ψ λ − etλ + ψ −ψ . is [hp (y|γ)]2 Γ(y + ψ) fY (y|φ.5 shows calculation of the ﬁrst four raw moments of the NB density. When ψ = λ/α we have the negative binomial-I model (NB-I). The usual means of incorporating conditioning variables x is the parameterization λ = ex β . The reshaped density. In the case of the NB-II model.

Gallant and Nychka. This has been done in NegBinSNP. which is a C++ version of this model that can be compiled to use with octave using the mkoctfile command. Econometrica. This . which is relatively easy to edit to write the likelihood function for the model. Note the impressive length of the expressions when the degree of the expansion is 4 or 5! This is an example of a model that would be difﬁcult to formulate without the help of a program like MuPAD. This can be accomodated by allowing the γk parameters to depend upon the conditioning variables. MuPAD will output these results in the form of C code. 1987 prove that this sort of density can approximate a wide variety of densities arbitrarily well as the degree of the polynomial increases with the sample size. for example using polynomials.are the moments you would need to use a second order polynomial (p = 2). It is possible that there is conditional heterogeneity such that the appropriate reshaping should be more local.cc.

5: Negative binomial raw moments .Figure 18.

they would probably get the paper published in a good journal.approach is not without its drawbacks: the sample objective function can have an extremely large number of local maxima that can lead to numeric difﬁculties. The baseline model is a negative binomial density. Any ideas? Here’s a plot of true and the limiting SNP approximations (with the order of the polynomial ﬁxed) to four different count data densities. . which variously exhibit over and underdispersion. as well as excess zeros. If someone could ﬁgure out how to do in a way such that the sample objective function was nice and smooth.

1 .5 .05 .15 .2 .5 15 .1 .3 .Case 1 .5 5 7.15 5 10 15 20 0 Case 4 .2 .25 .1 5 10 15 20 25 .05 1 2 3 4 5 6 7 2.4 .05 .5 10 12.2 .1 Case 2 0 Case 3 .

18. Kernel regression estimation Let’s try a kernel regression ﬁt for the OBDV data.5 Examples MEPS health care usage data We’ll use the MEPS OBDV data to illustrate kernel regression and semi-nonparametric maximum likelihood. and plots the ﬁtted OBDV usage versus AGE. The program OBDVkernel. . scans over a range of window widths and calculates leave-one-out CV scores. Note that usage increases with age. just as we’ve seen with the parametric models.6.m loads the MEPS OBDV data. Once could use bootstrapping to generate a conﬁdence interval to the ﬁt. The plot is in Figure 18. using the best window width.

29 3. OBDV visits versus AGE 3.Figure 18.275 3.26 3.6: Kernel ﬁtted OBDV usage versus AGE Kernel fit.27 3.265 3.255 20 25 30 35 40 Age 45 50 55 60 65 .285 3.28 3.

Seminonparametric ML estimation and the MEPS data Now let’s estimate a seminonparametric density for the OBDV data. The program EstimateNBSNP. using a NB-I baseline density and a 2nd order polynomial expansion. as discussed above. The output is: OBDV ====================================================== BFGSMIN final results Used numeric gradient -----------------------------------------------------STRONG CONVERGENCE . We’ll reshape a negative binomial density.m loads the MEPS OBDV data and estimates the model.

0000 0.17061 Stepsize 0.0000 0.0000 0.0000 0.1898 0.4358 0.2214 0.0000 -0.0722 -0.0000 -0.2317 0.0065 24 iterations -----------------------------------------------------param 1.0002 1.0000 change -0.0000 .7853 -0.Function conv 1 Param conv 1 Gradient conv 1 -----------------------------------------------------Objective function value 2.0000 0.3826 0.0000 -0.1839 0.0000 -0.0000 0.0000 -0.0000 -0.0000 0.1129 gradient 0.0000 0.0000 -0.0000 -0.0000 0.0000 0.

025 -0.126 0.409 0.034 0.786 4.000 0.000 0.936 8.027 t-stat -1.000 0. ins.****************************************************** NegBin SNP model.903 -0.173 13.029 0.000 0.148 11.141 0.695 0.170614 Observations: 4564 constant pub.001 0. ins.785 -0.833 13.113 st.629 -14.436 0.000 .000 0.241 0.000 0.016 0. MEPS full data set MLE Estimation Results BFGS convergence: Normal convergence Average Log-L: -2.006 0.166 p-value 0.000 0. sex age edu inc gam1 gam2 lnalpha estimate -0.000 1. priv.443 0.011 12. err 0.880 3.147 0.000 0.046 0.050 0.991 0.

still being parsimonious.3. AIC: 4. CAIC: 4.3456 ****************************************************** Note that the CAIC and BIC are lower for this model than for the models presented in Table 16. Density functions formed in this way may have MANY local maxima. you can use SA to do the minimization. If you uncomment the relevant lines in the program.6244 Avg.6244 Avg. one can try using multiple starting values. BIC: 4.3649 Avg. This model ﬁts well. using a NP-II baseline density.3597 AIC : 19833. You can play around trying other use measures. This will take a lot . so you need to be careful before accepting the results of a casual run. and using other orders of expansions. To guard against having converged to a local maximum. or one could try simulated annealing as an optimization method.3619 BIC : 19897.Information Criteria CAIC : 19907.

Financial data and volatility The data set rates contains the growth rate (100×log difference) of the daily spot $/euro and $/yen exchange rates at New York. Figures ?? and ?? show the data and their histograms. . 2008.Figure 18. See the README ﬁle for details. from January 04. 1999 to February 12. There are 2291 observations.7: Dollar-Euro of time. noon. compared to the default BFGS minimization. The chapter on parallel computations might be interesting to read before trying this.

alternating between periods of stability and periods of more wild swings. the bars extend above the normal density that best ﬁts the data.Figure 18. and the tails are fatter than those of the best ﬁt normal density. This is known as conditional heteroscedasticity. we can see that the variance of the growth rates is not constant over time. Volatility clusters are apparent. This feature of the data is known as leptokurtosis.8: Dollar-Yen • at the center of the histograms. • in the series plots. ARCH and GARCH well-known models that are often applied to this .

leptokurtosis.9. • Many structural economic models often cannot generate data that exhibits conditional heteroscedasticity without directly assuming shocks that are conditionally heteroscedastic. note how current volatility depends on lags of the .m performs kernel regression to ﬁt E(yt |yt−1. • In the Figure.) features of ﬁnancial data result from the behavior of economic agents. etc.sort of data. and generates the plots in Figure 18. 2 2 2 The Octave script kernelﬁt. note how the data is compactiﬁed in the example script. rather than from a black box that provides shocks. and other (leverage. • From the point of view of learning the practical aspects of kernel regression. It would be nice to have an economic explanation for how conditional heteroscedasticity.yt−2).

• The fact that the plots are not ﬂat suggests that this conditional moment contain information about the process that generates the data. but drops off quickly when either of the lags is low.squared return rate . We’ll come back to this later.it is high when both of the lags are high. . Perhaps attempting to match this moment might be a means of estimating the parameters of the dgp.

Figure 18.9: Kernel regression ﬁtted conditional second moments. Yen/Dollar and Euro/Dollar (a) Yen/Dollar (b) Euro/Dollar .

(c) Experiment with different bandwidths.18.6 Exercises 1. (b) Run the script and interpret the output. (a) Look this script over. In Octave. 2. type ”help kernel_regression”. and comment on the effects of choosing small and large values. (a) How can a kernel ﬁt be done without supplying a bandwidth? (b) How is the bandwidth chosen if a value is not provided? (c) What is the default kernel used? . and describe in words what it does. In Octave. type ”edit kernel_example”.

Using the Octave script OBDVkernel.3. plot kernel regression ﬁts for OBDV visits as a function of income and education.m as a model. .

pages 846 . ECONOMETRIC THEORY. Vol. 1996. “Which Moments to Match?”. There are many articles. 12. Some of the seminal papers are Gallant and Tauchen (1996).Chapter 19 Simulation-based estimation Readings: Gourieroux and Monfort (1996) Simulation-Based Econometric Methods (Oxford University Press).

Apl. 19. Econometrics. Simulation-based estimation methods open up the possibility to estimate truly complex models.1 Motivation Simulation methods are of interest when the DGP is fully characterized by a parameter vector. so that MLE or GMM estimation is not possible. Pakes and Pollard (1989) Econometrica. The desirability introducing a great deal of complexity may be an issue1. McFadden (1989) Econometrica. Gourieroux. but the likelihood function and moments of the observable varables are not calculable.657-681. and abstraction helps us to isolate the important features of a phenomenon. 1 . so that simulated data can be generated. “Indirect Inference. as we will see. but it least it Remember that a model is an abstraction from reality.” J. Many moderately complex models result in intractible likelihoods or moments. Monfort and Renault (1993).

we observe a many-to-one mapping y = τ (y ∗) .1) Henceforth drop the i subscript when it is not needed for clarity. Ω) (19. Suppose that yi∗ = Xiβ + εi where Xi is m × K. Suppose that εi ∼ N (0. Example: Multinomial and/or dynamic discrete response models Let yi∗ be a latent random vector of dimension m.becomes a possibility. • y ∗ is not observed. Rather.

Ω)dyi∗ −ε Ω−1ε exp 2 where n(ε. Xi). However. • Deﬁne Ai = A(yi) = {y ∗|yi = τ (y ∗)} Suppose random sampling of (yi. The contribution of the ith observation to the likelihood function is pi(θ) = Ai n(yi∗ − Xiβ. In this case the elements of yi may not be independent of one another (and clearly are not if Ω is not diagonal).This mapping is such that each element of y is either zero or one (in some cases only one element will be one). i = j. yi is independent of yj . Ω) = (2π) −M/2 |Ω| −1/2 is the multivariate normal density of an M -dimensional random . (vec∗Ω) ) be the vector of parameters of the model. • Let θ = (β .

vector. θ by standard methods of numeric integration such as quadrature is computationally infeasible when m (the dimension of y) is higher than 3 or 4 (as long as there are no restrictions on Ω). ˆ pi(θ) • The problem is that evaluation of Li(θ) and its derivative w. • The mapping τ (y ∗) has not been made speciﬁc so far. This setup is quite general: for different choices of τ (y ∗) it nests the case of dynamic binary discrete choice models as well as the case of .t.r. The log-likelihood function is 1 ln L(θ) = n n ln pi(θ) i=1 ˆ and the MLE θ solves the score equations 1 n n i=1 1 ˆ gi(θ) = n n i=1 ˆ Dθ pi(θ) ≡ 0.

stacked in the vector ui are not observed. We have cross sectional data on individuals’ matching to a set of m jobs that are available (one of which is unemployment). we observe the vector formed of elements yj = 1 [uj > uk . Rather. – Dynamic discrete choice is illustrated by repeated choices over time between two alternatives. k = j] Only one of these elements is different than zero. ∀k ∈ m. The utility of alternative j is uj = Xj β + εj Utilities of jobs. Let alternative j have .multinomial discrete choice (the choice of one out of a ﬁnite set of alternatives). – Multinomial discrete choice is illustrated by a (very simple) job search model.

that is yit = 1 if individual i chooses the second alternative . 2. 2} t ∈ {1.. j ∈ {1... .utility ujt = Wjtβ − εjt. m} Then y ∗ = u2 − u1 = (W2 − W1)β + ε2 − ε1 ≡ Xβ + ε Now the mapping is (element-by-element) y = 1 [y ∗ > 0] .

For example. count data (that takes values 0.in period t. 1... zero otherwise.) is often modeled using the Poisson distribution exp(−λ)λi Pr(y = i) = i! The mean and variance of the Poisson distribution are both equal to λ: E(y) = V (y) = λ. 3. 2. Example: Marginalization of latent variables Economic data often presents substantial heterogeneity that may be difﬁcult to model. . A possibility is to introduce latent random variables. This can cause the problem that there may be no known closed form for the distribution of observable variables after marginalizing out the unobservable latent variables. .

count data exhibits “overdispersion” which simply means that V (y) > E(y). If this is the case. This ensures that the mean is positive (as it must be). In .Often. one parameterizes the conditional mean as λi = exp(Xiβ). Estimation by ML is straightforward. An alternative is to introduce a latent variable that reﬂects heterogeneity into the speciﬁcation: λi = exp(Xiβ + ηi) where ηi has some speciﬁed density with support S (this density may depend on additional parameters). Let dµ(ηi) be the density of ηi. Often. a solution is to use the negative binomial distribution rather than the Poisson.

but often this will not be possible.some cases. the marginal density of y exp [− exp(Xiβ + ηi)] [exp(Xiβ + ηi)]yi dµ(ηi) Pr(y = yi) = yi ! S will have a closed-form solution (one can derive the negative binomial distribution in the way if η has an exponential distribution .1). However. quadrature is probably a better choice. simulation is a means of calculating Pr(y = i). For example exp [− exp(Xiβi)] [exp(Xiβi)]yi Pr(y = yi) = dµ(βi) yi ! S . This would be an example of the Simulated Maximum Likelihood (SML) estimation.see equation 16. since there is only one latent variable. which is then used to do ML estimation. • In this case. a more ﬂexible model with heterogeneity would allow all parameters (not just the constant) to vary. In this case.

Consider the process dyt = g(θ. which will not be evaluable by quadrature when K gets large. yt)dt + h(θ. This leads to a model that is expressed as a system of stochastic differential equations.entails a K = dim βi-dimensional integral. A realistic model should account for exogenous shocks to the system. Estimation of models speciﬁed in terms of stochastic differential equations It is often convenient to formulate models in terms of continuous time using differential equations. yt)dWt . which can be done by assuming a random component.

non-overlapping segments are independent. yt) is the deterministic part. T ) Brownian motion is a continuous-time stochastic process such that • W (0) = 0 • [W (s) − W (t)] ∼ N (0. • h(θ. yt) determines the variance of the shocks.which is assumed to be stationary. such that T W (T ) = 0 dWt ∼ N (0. {Wt} is a standard Brownian motion (Weiner process). s − t) • [W (s) − W (t)] and [W (j) − W (k)] are independent for s > t > j > k. • The function g(θ. One can think of Brownian motion the accumulation of independent normally distributed shocks with inﬁnitesimal variance. That is. .

This density is necessary to evaluate the likelihood function or to evaluate moment conditions (which are based upon expectations with respect to this density). in general.. To perform inference on θ.. • A typical solution is to “discretize” the model. yt−1) + h(φ. direct ML or GMM estimation is not usually feasible. deduce the transition density f (yt|yt−1. the φ0 . by which we mean to ﬁnd a discrete time approximation to the model. y2. .To estimate a model of this sort. 1) The discretization induces a new parameter. θ). though yt is a continuous process it is observed in discrete time. yt−1)εt εt ∼ N (0. because one cannot. we typically have data that are assumed to be observations of yt in discrete points y1. The discretized version of the model is yt − yt−1 = g(φ.yT . φ (that is. That is.

. Nevertheless. GMM. as we will see. the approximation shouldn’t be too bad. This means that the model is simulable. which will be useful. Nevertheless the model is fully speciﬁed in probabilistic terms up to a parameter vector. etc. conditional on the parameter vector. This is an approximation.which deﬁnes the best approximation of the discretization to the actual (unknown) discrete time version of the model is not equal to θ0 which is the true parameter value). and as such “ML” estimation of φ (which is actually quasi-maximum likelihood. • The important point about these three examples is that computational difﬁculties prevent direct application of ML. θ. QML) based upon this equation is in general biased and inconsistent for the original parameter.

θ) = ˜ H H f (νts. However. Xt. it may be possible to deﬁne a random function such that Eν f (ν. θ) = p(yt|Xt. the simulator 1 p (yt. An ML estimator solves ˆ θM L 1 = arg max sn(θ) = n ln p(yt|Xt. yt. θ) where the density of ν is known. When ˆ p(yt|Xt. If this is the case. θM L is an infeasible estimator. Xt. θ) t=1 where p(yt|Xt. yt. Xt. θ) is the density function of the tth observation. consider cross-sectional data.2 Simulated maximum likelihood (SML) n For simplicity.19. θ) does not have a known closed form. θ) s=1 .

θ) i=1 Example: multinomial probit Recall that the utility of alternative j is uj = Xj β + εj and the vector y is formed of elements yj = 1 [uj > uk . that is ˆ θSM L 1 = arg max sn(θ) = n n ˜ ln p (yt. k = j] .is unbiased for p(yt|Xt. θ) in the log-likelihood function. k ∈ m. θ) in place of p(yt|Xt. Xt. ˜ • The SML simply substitutes p (yt. θ). Xt.

∀k ∈ m. ˜ • Draw εi from the distribution N (0. However. Each element of πi is between 0 and 1. it is easy to simulate this probability. Ω) • Calculate ui = Xiβ + εi (where Xi is the matrix formed by stack˜ ˜ ing the Xij ) ˜ • Deﬁne yij = 1 [uij > uik . • Now p (yi. and the elements sum to one.The problem is that Pr(yj = 1|θ) can’t be calculated when m is larger than 4 or 5. k = j] • Repeat this H times and deﬁne πij = H ˜ h=1 yijh H • Deﬁne πi as the m-vector formed of the πij . θ) = yiπi ˜ . Xi.

β and Ω. The draws are dif˜ ferent for each i. it does cause problems if one attempts to use a gradient-based optimization method .t. Xi. θ) i=1 This is to be maximized w. If the εi are re-drawn at every iteration the estimator will not converge.• The SML multinomial probit log-likelihood function is 1 ln L(β.r. Ω) is stochastically equicontinuous. Notes: ˜ • The H draws of εi are draw only once and are used repeatedly ˆ ˆ during the iterations used to ﬁnd β and Ω. However. • The log-likelihood function with this simulator is a discontinuous function of β and Ω. This does not cause problems from a theoretical point of view since it can be shown that ln L(β. Ω) = n n ˜ yi ln p (yi.

This is computationally costly.5 × 1 uij = max uik k=1 m where A is a large positive number. that some elements of πi are zero.such as Newton-Raphson. For example. simulated annealing. apply a kernel transformation such as ˜ yij = Φ A × uij − max uik k=1 m + . are used. This approximates a . particularly if few simulations. for example. – 2) Smooth the simulated probabilities so that they are continuous functions of the parameters. there will be a log(0) problem. If the corresponding element of yi is equal to 1. H. • Solutions to discontinuity: – 1) use an estimation method that doesn’t require a continuous and differentiable objective function. • It may be the case.

Consistency requires that A(n) → ∞. Gibbs sampling) that may work better. p . Ω) will be continuous and differentiable. This makes yij a continuous function of β and Ω. so ˜ that pij and therefore ln L(β. so that the approximation to a step function becomes arbitrarily close as the sample size increases.. • To solve to log(0) problem. but this is too technical to discuss here. increase H if this is a serious problem. There are alternative methods (e.g. and yij is very close to 1 if uij is the maxi˜ ˜ mum.˜ step function such that yij is very close to zero if uij is not the maximum. Also. one possibility is to search the web for the slog function.

[Lee] 1) if limn→∞ n1/2/H = 0.” Econometric Theory. λ a ﬁnite constant. 11. pp. then √ d ˆ n θSM L − θ0 → N (0. I −1(θ0)) 2) if limn→∞ n1/2/H = λ. The following is taken from Lee (1995) “Asymptotic Bias in Simulated Maximum Likelihood Estimation of Discrete Choice Models. • This means that the SML estimator is asymptotically biased if H doesn’t grow faster than n1/2.Properties The properties of the SML estimator depend on how H is set. then √ d ˆ n θSM L − θ0 → N (B. Theorem 55. 437-83. . I −1(θ0)) where B is a ﬁnite vector of constants.

Once could.3 Method of simulated moments (MSM) Suppose we have a DGP(y|x. xt) − k(xt. so that as long as H grows fast enough the estimator is consistent and fully asymptotically efﬁcient. θ)] zt where k(xt. θ) which is simulable given θ. but is such that the density of y is not calculable. zt is a vector of instruments in the information set and p(y|xt. θ) = K(yt. The problem is that this density is not . in principle. base a GMM estimator upon the moment conditions mt(θ) = [K(yt. xt)p(y|xt. 19. θ) is the density of y conditional on xt.• The varcov is the typical inverse of the information matrix. θ)dy.

xt) h=1 a. which provides a clear intuitive basis for the estimator. θ) . xt) − k (xt. as H → ∞.2) .available. θ) → k (xt. though in fact we obtain consistency even for H ﬁnite. so errors introduced by simulation cancel themselves out. • This allows us to form the moment conditions mt(θ) = K(yt. • However k(xt.s. since a law of large numbers is also operating across the n observations of real data. θ) is readily simulated using 1 k (xt. θ) = H H h K(yt . k (xt. • By the law of large numbers. θ) zt (19.

McFadden (ref. xt) zt h=1 (19. Properties Suppose that the optimal weighting matrix is used. above) show that the asymptotic distribution of the MSM estimator is very similar to that of the infeasible GMM estimator. xt) appears linearly within the sums. above) and Pakes and Pollard (refs.3) with which we form the GMM criterion and estimate as usual. h Note that the unbiased simulator k(yt . As before. form 1 m(θ) = n 1 = n n mt(θ) i=1 n i=1 1 K(yt. xt) − H H h k(yt . assuming that the optimal .where zt is drawn from the information set. In particular.

the asymptotic variance is inﬂated by a factor 1 + 1/H. but the efﬁciency loss is small and controllable. For this reason the MSM estimator is not fully asymptotically efﬁcient relative to the infeasible GMM estimator. by setting H reasonably large. inﬂated by 1 + 1/H. . 1 + 1 n θ H −1 D∞Ω−1D∞ −1 (19. • The estimator is asymptotically unbiased even for H = 1. is the asymptotic variance of the infeasible GMM • That is.4) where D∞Ω−1D∞ estimator. √ d ˆM SM − θ0 → N 0. for H ﬁnite. • If one doesn’t use the optimal weighting matrix.weighting matrix is used. the asymptotic varcov is just the ordinary GMM varcov. This is an advantage relative to SML. and for H ﬁnite.

To use the multinomial probit model as an example.• The above presentation is in terms of a speciﬁc moment condition based upon the conditional mean. Ω) i=1 yi ln pi(β. Ω) ˜ i=1 . Ω) = n n n yi ln pi(β. Ω) = n The SML version is 1 ln L(β. the log-likelihood function is 1 ln L(β. Simulated GMM can be applied to moment conditions of any form. while MSM is? The reason is that SML is based upon an average of logarithms of an unbiased simulator (the densities of the observations). Comments Why is SML inconsistent if H is ﬁnite.

That is. The only way for the two to be equal (in the limit) is if H tends to inﬁnite so that ˜ p (·) tends to p (·). using simple notation for the random .The problem is that ˜ E ln(˜i(β. Therefore the SLLN applies to cancel out simulation errors. Ω) = pi(β. Ω)) p in spite of the fact that ˜ E pi(β. from which we get consistency. and it appears within a sum over n (see equation [19.3]). Ω)) = ln(E pi(β. Ω) due to the fact that ln(·) is a nonlinear transformation. The reason that MSM does not suffer from this problem is that in this case the unbiased simulator appears linearly within every sum of terms.

θ) + εht] zt (19. henceforth consistency.6) h=1 converge almost surely to ˜ m∞(θ) = k(x. θ ) + εt − H ˜ [k(xt. . the moment conditions 1 ˜ m(θ) = n 1 = n n i=1 n 1 K(yt.sampling case. xt) − H H h k(yt . θ0) − k(x.5) i=1 1 0 k(xt. xt) zt h=1 H (19. The objective function converges to ˜ s∞(θ) = m∞(θ) Ω−1m∞(θ) ∞ ˜ which obviously has a minimum at θ0. θ) z(x)dµ(x). (note: zt is assume to be made up of functions of xt).

The asymptotic efﬁciency of the estimator may be low.6 a bit. • A poor choice of moment conditions may lead to very inefﬁcient estimators.4 Efﬁcient method of moments (EMM) The choice of which moments upon which to base a GMM estimator can have very pronounced effects upon the efﬁciency of the estimator. you will see why the variance 1 inﬂation factor is (1 + H ).• If you look at equation 19. and can even cause identiﬁcation problems (as we’ve seen with the GMM problem set). • The drawback of the above approach MSM is that the moment conditions used in estimation are selected arbitrarily. 19. • The asymptotically optimal choice of moments would be the .

Vol. this choice is unavailable. pages 657-681) seeks to provide moment conditions that closely mimic the score vector. 12. mt(θ) = Dθ ln pt(θ | It) As before.score vector of the likelihood function. θ0) ≡ pt(θ0) . the resulting estimator will be very nearly fully efﬁcient. If the approximation is very good. 1996. The efﬁcient method of moments (EMM) (see Gallant and Tauchen (1996). ECONOMETRIC THEORY. The DGP is characterized by random sampling from the density p(yt|xt. “Which Moments to Match?”.

θ0).We can deﬁne an auxiliary model. λ). t=1 ˆ ˆ • After determining λ we can calculate the score functions Dλ ln f (yt|xt. 1 ˆ λ = arg max sn(λ) = Λ n n ln ft(λ). We assume that this density function is calculable. • The important point is that even if the density is misspeciﬁed. λ) ≡ ft(λ) • This density is known up to a parameter λ. Therefore quasi-ML estimation is possible. p(y|xt. called the “score generator”. . Speciﬁcally. there is a pseudo-true λ0 for which the true expectation. which simply provides a (misspeciﬁed) parametric density f (y|xt. taken with respect to the true but unknown density of y.

λ) = n n t=1 1 H H h ˆ Dλ ln f (yt |xt. this suggests • We have seen in the section on QML that λ using the moment conditions ˆ =1 mn(θ. λ0) = X Y |X Dλ ln f (y|x.7) • These moment conditions are not calculable.and then marginalized over x is zero: ∃λ0 : EX EY |X Dλ ln f (y|x. θ0)dydµ(x) = p ˆ → λ0. By the LLN . λ) h=1 ˜h where yt is a draw from DGP (θ). λ0)p(y|x. λ) n n ˆ Dλ ln ft(λ)pt(θ)dy t=1 (19. holding xt ﬁxed. since pt(θ) is not available. but they are simulable using 1 ˆ mn(θ.

m∞(θ0. 1987) . which is fully efﬁcient. This is not the case for other values of θ.ˆ and the fact that λ converges to λ0. there exist good ways of approximating unknown distributions parametrically: Philips’ ERA’s (Econometrica. λ) closely apˆ proximates p(y|xt. λ) will closely approximate the optimal moment conditions which characterize maximum likelihood estimation. λ0) = 0. assuming that λ0 is identiﬁed. θ). then mn(θ. • If one has no density in mind. it would be a good choice for f (·). • If one has prior information that a certain density approximates the data well. • The advantage of this procedure is that if f (yt|xt. 1983) and Gallant and Nychka’s (Econometrica.

J (λ0)−1I(λ0)J (λ0)−1 (19. Optimal weighting matrix I will present the theory for H ﬁnite.SNP density estimator which we saw before. and possibly small. The theory for the case of H inﬁnite follows directly from the results presented here. ˆ The moment condition m(θ. We can apply Theorem 30 to conclude that √ d ˆ n λ − λ0 → N 0. Gallant and Tauchen give the theory for the case of H so large that it may be treated as inﬁnite (the difference being irrelevant given the numerical precision of a computer). This is done because it is sometimes impractical to estimate with H very large. λ) depends on the pseudo-ML estiˆ mate λ. Since the SNP density is consistent. the efﬁciency of the indirect estimator is the same as the infeasible ML estimator.8) .

λ) about λ0 : . then λ would be the maximum likelihood estimator. θ). and J (λ0)−1I(λ0) would be an identity matrix. Ω. so there is no cancellation. ∂sn(λ) I(λ ) = lim E n n→∞ ∂λ 0 0 ∂2 ∂λ∂λ sn(λ0) . θ).ˆ ˆ If the density f (yt|xt. due to the information matrix equality. in the present case we assume that f (yt|xt. Howˆ ever. Now take a ﬁrst order Taylor’s series √ ˆ approximation to nmn(θ0. Recall that J (λ ) ≡ p lim we see that J (λ0) = Dλ m(θ0. λ0 ∂sn(λ) ∂λ . Comparing the deﬁnition of sn(λ) with the deﬁnition of the moment condition in Equation 19. this is simply the asymptotic variance covariance matrix of the moment conditions. As in Theorem 30. λ0 In this case. λ) were in fact the true density p(y|xt. λ) is only an approximation to p(y|xt.7. λ0).

It is straightforward but somewhat 1 tedious to show that the asymptotic variance of this term is H I∞(λ0). But noting equation 19. λ0).√ ˆ ˜ nmn(θ . combining the results for the ﬁrst and second terms. 1 + H I(λ0) . λ ) λ − λ0 = 0 0 a. λ ) + 0 0 √ ˆ ˜ nDλ m(θ0. λ) ∼ N 0. √ 1 ˆ a ˜ nmn(θ0. so we have √ ˆ ˜ nDλ m(θ .8 √ a ˆ nJ (λ0) λ − λ0 ∼ N 0.s. λ0) λ − λ0 . λ0) → J (λ0). Note that ˜ Dλ mn(θ0. a. λ0) λ − λ0 + op(1) First consider ˜ nmn(θ0.s. √ ˆ nJ (λ0) λ − λ0 . λ) = 0 √ √ ˜ nmn(θ . √ ˆ ˜ Next consider the second term nDλ m(θ0. I(λ0) Now.

Even if this is the case. This may be complicated if the score generator is a poor approximator. since the individual score contributions may not have mean zero in this case (see the section on QML) . λ) is the GMM estimator with the efﬁcient choice of weighting matrix. On the other hand.Suppose that I(λ0) is a consistent estimator of the asymptotic variancecovariance matrix of the moment conditions. • If one has used the Gallant-Nychka ML estimator as the auxiliary . the ordinary estimator of the information matrix is consistent. Combining this with the result on the efﬁcient GMM weighting matrix in Theorem 43. if the score generator is taken to be correctly speciﬁed. so it is always possible to consistently estimate I(λ0) when the model is simulable. we see ˆ that deﬁning θ as ˆ ˆ θ = arg min mn(θ. λ) Θ 1 1+ H −1 I(λ0) ˆ mn(θ. the individuals means can be calculated by simulation.

the asymptotic distribution is as in Equation 14.3. so we have (using the result in Equation 19. since the scores are uncorrelated.g. it really is ML estimation asymptotically. since the score generator can approximate the unknown density arbitrarily well). I(λ0) D∞ where D∞ = lim E Dθ mn(θ0. λ0) . the appropriate weighting matrix is simply the information matrix of the auxiliary model.model. D∞ d 1 1+ H −1 −1 .. Asymptotic distribution Since we use the optimal weighting matrix.8): √ ˆ n θ − θ0 → N 0. (e. n→∞ .

so testing is impossible. λ) ∼ χ2(q) where q is dim(λ) − dim(θ). λ) Diagnotic testing The fact that √ a ˆ ∼ N 0. since without dim(θ) moment conditions the model is not identiﬁed. λ) H 0 I(λ0) implies that ˆ ˆ nmn(θ. 1 + 1 nmn(θ . One test of the model is simply based on this statistic: if it exceeds the χ2(q) critical . λ) 1 1+ H −1 ˆ I(λ) ˆ ˆ a mn(θ.This can be consistently estimated using ˆ ˆ ˆ D = Dθ mn(θ.

1). See Gourieroux et. λ) is somewhat more complicated). These aren’t √ √ 0 ˆ ˆ ˆ actually distributed as N (0. • Information about what is wrong can be gotten from the pseudot-statistics: diag 1 1+ H 1/2 −1 ˆ I(λ) √ ˆ ˆ nmn(θ. which are usually related to certain features of the model. λ) can be used to test which moments are not well modeled. λ) and nmn(θ. since nmn(θ . .point. al. It can be shown that the pseudo-t statistics are biased toward nonrejection. 1995. Since these moments are related to parameters of the score generator. for more details. or Gallant and Long. λ) √ ˆ ˆ have different distributions (that of nmn(θ. something may be wrong (the small sample performance of this sort of test would be a topic worth investigating). this information can be used to revise the model.

1) that a Poisson model with latent heterogeneity that follows an exponential distribution leads to the negative binomial model. This model is similar to the negative binomial model.19. but it is a nice example since it allows us to compare SML to the actual ML estimator. The Octave script Esti- . The Octave function deﬁned by PoissonLatentHet.5 Examples SML of a Poisson model with latent heterogeneity We have seen (see equation 16. To illustrate SML. where η ∼ N (0.m calculates the simulated log likelihood for a Poisson model where λ = exp xtβ + ση). 1). rather than the analytical approach which leads to the negative binomial model. except that the latent variable is normally distributed rather than gamma distributed. we can integrate out the latent heterogeneity using Monte Carlo. one would not want to use SML in this case. In actual practice.

Attempting to use BFGS leads to trouble. exceeded Average Log-L: -2. though I have not checked if this is true. I suspect that the log likelihood is approximately non-differentiable in places. If you run this script.matePoissonLatentHet. iters. SML estimation. MEPS 1996 full da MLE Estimation Results BFGS convergence: Max.171826 . you will see that it takes a long time to get the estimation results.m estimates this model using the MEPS OBDV data that has already been discussed. around which it is very ﬂat. which are: ****************************************************** Poisson Latent Heterogeneity model. Note that simulated annealing is used to maximize the log likelihood function.

892 17. AIC: 4.036 p-value 0.024 -0. priv.068 0.124 13.000 0.002 0. ins.189 0.044 0.012 0.3602 4.000 0.065 0. CAIC: Avg.8396 AIC : 19840.000 0. err 0.000 0.531 14.655 0.000 0. ins.146 0.010 0.000 0.000 Information Criteria CAIC : 19899.018 0.Observations: 4564 estimate constant pub.203 st.8396 BIC : 19891.3584 4.425 10.888 10.523 -0.4320 Avg.615 0.865 2.000 0.3472 .592 1.014 t-stat -10. BIC: Avg. sex age edu inc lnalpha -1.596 0.

due to the computational requirements. The chapter on parallel computing is relevant if you wish to use models of this sort.****************************************************** octave:3> If you compare these results to the results for the negative binomial model. you can see that the present model ﬁts better according to the CAIC criterion. The present model is considerably less convenient to work with.2). however. given in subsection (16. SMM To be added in future: do SMM using unconditional moments for SV model (compare to Andersen et al and others) .

and the arguments needed to evaluate it are deﬁned in line 4. This ﬁle is interesting enough to warrant some discussion. The ﬁle emm_moments. it is a general purpose moment condition for EMM estimation. The QML estimate of the parameter of the score generator . Line 3 deﬁnes the DGP. but here we’ll look at some simple. where the DGP and score generator can be passed as arguments. EMM estimation of a discrete choice model In this section consider EMM estimation. There is a sophisticated package by Gallant and Tauchen for this. but hopefully didactic code.m generates data that follows the probit model. A listing appears in Listing 19.SNM To be added. and its arguments are deﬁned in line 6. Thus.m deﬁnes EMM moment conditions. The ﬁle probitdgp. The score generator is deﬁned in line 5.1.

# SG arguments (cell array) phi = momentargs{6}. # the data generating process (DGP) dgpargs = momentargs{3}. # its arguments (cell array) sg = momentargs{4}. # passed with data to ensure . and are thus ﬁxed during estimation. momentargs) k = momentargs{1}. # the score generator (SG) sgargs = momentargs{5}. data. # QML estimate of SG parameter y = data(:. In line 20 we average the scores of the score generator. rand_draws = data(:. The simulated data is generated in line 16.1). x = data(:. 1 2 3 4 5 6 7 8 9 10 function scores = emm_moments(theta.is read in line 7. dgp = momentargs{2}.2:k+1).k+2:columns(data)). Note in line 10 how the random draws needed to simulate data are passed with the data. to avoid ”chattering”. and the derivative of the score generator using the simulated data is calculated in line 18. which are the moment conditions that the function returns.

{phi.rows(phi)).i). scores = zeros(n. # average over number of simulations endfunction 19 20 21 Listing 19.1: emm_moments. sgdata. y = feval(dgp. # simulated data for SG scores = scores + numgradient(sg. x. The results we obtain are .m The ﬁle emm_example.m performs EMM estimation of the probit model. # simulated data sgdata = [y x]. dgpargs). sgargs}). # how many simulations? for i = 1:reps e = rand_draws(:. e. using a logit model as the score generator. # container for moment contributions reps = columns(rand_draws).fixed across iterations 11 12 13 14 15 16 17 18 n = rows(y). theta. # gradient of SG endfor scores = scores / reps.

6648 1.0000 -0.Score generator results: ===================================================== BFGSMIN final results Used analytic gradient -----------------------------------------------------STRONG CONVERGENCE Function conv 1 Param conv 1 Gradient conv 1 -----------------------------------------------------Objective function value 0.8979 1.281571 Stepsize 0.0279 15 iterations -----------------------------------------------------param 1.8875 gradient 0.0000 0.0000 .0000 -0.0000 change 0.9125 1.0000 0.0000 0.0000 -0.

1. err 0.0000 0.022 0.7433 -0.085 st.000 0.022 t-stat 47. test p1 p2 p3 estimate 1.022 0.0000 ====================================================== Model results: ****************************************************** EMM example GMM Estimation Results BFGS convergence: Normal convergence Objective function value: 0.000 0.000000 Observations: 1000 Exactly identified.935 1.000 .630 p-value 0.240 49.618 42.069 0. no spec.

022 49.p4 1.080 0.000 p5 0.000 ****************************************************** It might be interesting to compare the standard errors with those obtained from ML estimation. to check efﬁciency of the EMM estimator.047 0.978 0. One could even do a Monte Carlo study.643 0.023 41. .

19. (more advanced) Do a little Monte Carlo study to compare ML. . (advanced. Compare the SML estimates to ML estimates.m might be helpful).6 Exercises 1. SML and EMM estimation of the probit model. Do an estimation using data generated by a probit model ( probitdgp.5 and describe what they do. 3. 2. but even if you don’t do this you should be able to describe what needs to be done) Write Octave code to do SML estimation of the probit model.5 and describe what they do. 4. (basic) Examine the Octave scripts and functions discussed in subsection 19. (basic) Examine the Octave script and function discussed in subsection 19. Investigate how the number of simulations affect the two simulation-based estimators.

Chapter 20 Parallel programming for econometrics The following borrows heavily from Creel (2005). but it bears emphasis since it is the main reason that parallel computing may be attractive to 897 . Parallel computing can offer an important reduction in the time to complete computations. This is well-known.

was introduced in November of 2002. was introduced in November of 2000. running at 3.users. The ﬁrst is the fact that setting up a cluster of computers for distributed parallel computing is not difﬁcult. using distributed parallel computing on available computers. the Intel Pentium IV (Willamette) processor. The Pentium IV (Northwood-HT) processor. The examples in this chapter show that a 10-fold improvement in performance can be achieved immediately.5GHz. running at 1. Extrapolating this admittedly rough snapshot of the evolution of the performance of commodity processors. . If you are using the ParallelKnoppix bootable CD that accompanies these notes. To illustrate.06GHz.6 years and then purchase a new computer to obtain a 10-fold improvement in computational performance. one would need to wait more than 6. Recent (this is written in 2005) developments that may make parallel computing attractive to a broader spectrum of researchers who do computations. An approximate doubling of the performance of a commodity CPU took place in two years.

end users will ﬁnd it By ”high-level matrix programming language” I mean languages such as MATLAB (TM the Mathworks. If programs that run in parallel have an interface that is nearly identical to the interface of equivalent serial versions. See the ParallelKnoppix tutorial. for example. A third is the spread of dual and quad-core CPUs. A second development is the existence of extensions to some of the high-level matrix programming (HLMP) languages1 that allow the incorporation of parallelism into programs written in these languages. A focus of the examples is on the possibility of hiding parallelization from end users of programs. Ltd.).octave.you are less than 10 minutes away from creating a cluster. Following are examples of parallel implementations of several mainstream problems in econometrics. Ox (TM OxMetrics Technologies. so that an ordinary desktop or laptop computer can be made into a mini-cluster. supposing you have a second computer at hand and a crossover ethernet cable. Those cores won’t work together on a single problem unless they are told how to. Inc. 1 . and GNU Octave (www.org).).

the following examples are the most accessible introduction to parallel programming for econometricians. There are also parallel packages for Ox. taking advantage of the MPI Toolbox (MPITB) for Octave. R.1 Example problems This section introduces example problems from econometrics. and Python which may be of interest to econometricians.easy to take advantage of parallel computing’s performance. (2004). . 20. by by Fernández Baldomero et al. and shows how they can be parallelized in a natural way. but as of this writing. We continue to use Octave.

we use same trace test example as do Doornik. and the number of series in the second position. (2002).Monte Carlo A Monte Carlo study involves repeating a random experiment many times under identical conditions. The single argument in this case is a cell array that holds the length of the series in its ﬁrst position. al. and it returns a row vector that holds the results of one random simulation.m is a function that calculates the trace test statistic for the lack of cointegration of integrated time series. To illustrate the parallelization of a Monte Carlo study. This function is illustrative of the format that we adopt for Monte Carlo simulation of a function: it receives a single argument of cell type. et. 2002. tracetest. Bruche. Several authors have noted that Monte Carlo studies are obvious candidates for parallelization (Doornik et al. It generates a random result though a process that . 2003) since blocks of replications can be done independently on different computers.

. and it reports some output in a row vector (in this case the result is a scalar).m is an Octave script that executes a Monte Carlo study of the trace test by repeatedly evaluating the tracetest. as in line 7. the last argument is the number of slave hosts to use. In line 10. there is a fourth argument. montecarlo.m function. When called with four arguments.he or she must only indicate the number of slave computers to be used.is internal to the function. The main thing to notice about this script is that lines 7 and 10 call the function montecarlo.m executes serially on the computer it is called from. We see that running the Monte Carlo study on one or more processors is transparent to the user . mc_example1. When called with 3 arguments.m.

As Swann (2002) points out. xt)}n of n observations of a set of dependent and explanatory variables. for example .ML For a sample {(yt. yt may be a vector of random variables. the maximum likelihood estimator of the parameter θ can be deﬁned as ˆ θ = arg max sn(θ) where 1 sn(θ) = n n ln f (yt|xt. θ) t=1 Here. this can be broken into sums over blocks of observations. and the model may be dynamic since xt may contain lags of yt.

two blocks: sn(θ) = 1 n n1 ln f (yt|xt. In line 5. In lines 1-3 the data is read. The fourth and ﬁfth arguments are empty placeholders where . and the initial value of the parameter vector is set. while in line 7 the same function is called with 6 arguments. parallelization can be done by calculating each block on separate computers. mle_example1. θ) + n t=1 t=n1 +1 ln f (yt|xt.m is an Octave script that calculates the maximum likelihood estimator of the parameter vector of a model that assumes that the dependent variable is distributed as a Poisson random variable. Again following Swann. the function mle_estimate performs ordinary serial calculation of the ML estimator. the name of the density function is provided in the variable model. we can deﬁne up to n blocks. conditional on some explanatory variables. θ) Analogously.

Users need only learn how to write the likelihood function using the Octave language. 1 in this case.options to mle_estimate may be set. while the sixth argument is the number of slave computers to use for parallel execution. The likelihood function itself is an ordinary Octave function that is not parallelized. It is worth noting that a different likelihood function may be used by making the model variable point to a different function. A person who runs the program sees no parallel programming code the parallelization is transparent to the end user. When executed. beyond having to select the number of slave computers. which are identical. The mle_estimate function is a generic function that can call any likelihood function that has the appropriate input/output syntax for evaluation either serially or in parallel. . this script prints out the estimates theta_s and theta_p.

using for example 2 blocks: n1 1 mt(yt|xt. θ) mn(θ) = n t=1 + n t=n1 +1 mt(yt|xt. θ) (20. it can obviously be computed blockwise.1) .GMM For a sample as above. θ) t=1 Since mn(θ) is an average. the GMM estimator of the parameter θ can be deﬁned as ˆ θ ≡ arg min sn(θ) Θ where sn(θ) = mn(θ) Wnmn(θ) and 1 mn(θ) = n n mt(yt|xt.

theta_s and theta_p are identical up to the tolerance for convergence of the min- imization routine.a different model can be estimated by changing the value of the moments variable. Whether estimation is done in parallel or serially depends only the seventh argument to gmm_estimate . gmm_example1. The function that moments points to is an ordinary Octave function that uses no parallel programming.when it is missing or zero. estimation is by default done serially with one processor. The point to notice here is that an end user can perform the estimation in parallel in virtually the same way as it is done serially. is a generic function that will estimate any model speciﬁed by the moments variable .m is a script that illustrates how GMM estimation may be done serially or in parallel. Again. gmm_estimate. used in lines 8 and 10.Likewise. so users can write their models using the simple and intuitive HLMP syntax of Octave. we may deﬁne up to n blocks. When this is run. When it is . each of which could potentially be computed on a different machine.

on the order of n2k calculations must be done. it speciﬁes the number of slave nodes to use. where k is the dimension of the vector of explanatory variables. x. Racine (2002) demonstrates that MPI parallelization can be used to speed up calculation of the kernel . Kernel regression The Nadaraya-Watson kernel regression estimator of a function g(x) at a point x is ˆ g (x) = n n t=1 yt K [(x − xt ) /γn ] n t=1 K [(x − xt ) /γn ] ≡ t=1 w t yy We see that the weight depends upon every data point in the sample.positive. To calculate the ﬁt at every point in a sample of size n.

We follow this implementation here. The speedup for k nodes is the time to ﬁnish the problem on a single node divided by the time to ﬁnish the problem on k nodes. Note that you can get 10X speedups. a single slave is speciﬁed. Serial execution is obtained by setting the number of slaves equal to zero. as claimed in the . as well as the size of the cluster. In line 17. so execution is in parallel on the master and slave nodes. Figure 20. The example programs show that parallelization may be mostly hidden from end users. Users can beneﬁt from parallelization without having to write or understand parallel code. the efﬁciency of the network. etc. kernel_example1.1 reproduces speedups for some econometric problems on a cluster of 12 desktop computers. in line 15. The speedups one can obtain are highly dependent upon the speciﬁc problem at hand.m is a script for serial and parallel kernel regression.regression estimator by calculating the ﬁts for portions of the sample on different computers. Some examples of speedups are presented in Creel (2005).

Figure 20. It’s pretty obvious that much greater speedups could be obtained using a larger cluster. . for the ”embarrassingly parallel” problems.1: Speedups from parallelization 11 10 9 8 7 6 5 4 3 2 1 2 4 6 nodes 8 10 12 MONTECARLO BOOTSTRAP MLE GMM KERNEL introduction.

Bibliography [1] Bruche. (2003) A note on embarassingly parallel computation using OpenMosix and Ox.A. V. 107-128. M. [2] Creel. M. Shephard (2002) Computationally-intensive econometrics using a dis911 . pp. Computational Economics.. working paper. 26. (2005) User-friendly parallel computations with econometric examples. Hendry and N. [3] Doornik. London School of Economics. J. Financial Markets Group. D.F.

es/ javier-bin/mpitb. 40. 293-302. 19. Computational Statistics & Data Analysis. 145-178. atc. Jeff (2002) Parallel distributed kernel estimation. [4] Fernández Baldomero. C. Computational Economics. [6] Swann. Series A.tributed matrix-programming language.ugr. Philosophical Transactions of the Royal Society of London. 1245-1266. J. (2004) LAM/MPI parallel computing under GNU Octave. .A. 360. [5] Racine. (2002) Maximum likelihood estimation using parallel computing: an introduction to MPI.

IGNORE IT FOR NOW In this last chapter we’ll go through a worked example that com913 .Chapter 21 Final project: econometric estimation of a RBC model THIS IS NOT FINISHED .

The data are obtained from the US Bureau of Economic Analysis (BEA) National Income and Product Accounts (NIPA). The program plots.1.1 though 21. Table 11.m will make a few plots. 21. First looking at the plot for levels. The data we use are in the ﬁle rbc_data.1 Data We’ll develop a model for private consumption and real gross private investment. We’ll do simulated method of moments estimation of a real business cycle model. There appears to be somewhat of a structural change in the .5.3.bines a number of the topics we’ve seen. including Figures 21. This data is real (constant dollars).m. Lines 2 and 6 (you can download quarterly data from 1947-I to the present). surprise). similar to what Valderrama (2002) does. we can see that real consumption and investment are clearly nonstationary (surprise.

Looking at growth rates. becoming more moderate in the 90’s. volatility seems to have declined.the data that the models generate is stationary and ergodic. Or. Levels Figure 21. Since 1990 or so. The volatility of growth of consumption has declined somewhat. Growth Rates mid-1970’s. over time.3: Consumption and Investment. the data that the models generate needs to be passed Figure 21. there are some notable periods of high volatility in the mid-1970’s and early 1980’s.1: Consumption and Investment. for example.Figure 21. Bandpass Filtered . Economic models for growth often imply that there is no long term growth (!) . the series for consumption has an extended period of high growth in the 1970’s. Looking at investment.2: Consumption and Investment.

kt}∞ E0 t=0 ct + kt log φt t ∞ t t=0 β U (ct ) α (1 − δ) kt−1 + φtkt−1 = = ∼ ρ log φt−1 + IIN (0.3. and generate stationary business cycle data by applying the bandpass ﬁlter of Christiano and Fitzgerald (1999).2 An RBC Model Consider a very simple stochastic growth model (the same used by Maliar and Maliar (2003). The ﬁltered data is in Figure 21. We’ll follow this. To get data that look like the levels for consumption and investment. σ 2) t . 21.through the inverse of a ﬁlter. we’d need to apply the inverse of the bandpass ﬁlter. with minor notational difference): max{ct. We’ll try to specify an economic model that can generate similar data.

the utility function is logarithmic. • gross investment. it. • γ is the coefﬁcient of relative risk aversion.Assume that the utility function is c1−γ − 1 U (ct) = t 1−γ • β is the discount rate • δ is the depreciation rate of capital • α is the elasticity of output with respect to capital • φ is a technology shock that is positive. When γ = 1. φt is observed in period t. is the change in the capital stock: it = kt − (1 − δ) kt−1 .

recall. Now we’ll try to use some more informative moment conditions to see if we get better results. and use it to deﬁne a GMM estimator.16 and 14. We would like to estimate the parameters θ = β. α. γ. Once can derive the Euler condition in the same way we did there.• we assume that the initial condition (k0. σ 2 using the data that we have on consumption and investment. This problem is very similar to the GMM estimation of the portfolio model discussed in Sections 14. Let yt be a G-vector of jointly dependent vari- . ρ. That approach was not very successful. A vector autogression is just the vector version of an autoregressive model. 21. θ0) is given. δ.17.3 A reduced form model Macroeconomic time series data are often modeled using vector autoregressions.

ables.. I2). if all of the variables are endogenous.. Let vt = Rtηt. Clearly.p.. one would need some form of additional information to identify a structural model.2.. and Rt is upper triangular..yt−p) = RtRt. A VAR(p) model is yt = c + A1yt−1 + A2yt−2 + . j=1. where ηt ∼ IIN (0. . A well-ﬁtting reduced form model will be adequate for the purpose. and Aj .. But we already have a structural model. . So V (vt|yt−1. + Apyt−p + vt where c is a G-vector of parameters. and we’re only going to use the VAR to help us estimate the parameters. We’re seen that our data seems to have episodes where the variance of growth rates and ﬁltered data is non-constant. are G × G matrices of parameters.. You can think of a VAR model as the reduced form of a dynamic linear simultaneous equations model where all of the variables are treated as endogenous. Without going into details.. This brings us to the general area of stochastic volatility.

. the vector of elements in the upper triangle of Rt (in our case this is a 3 × 1 vector). pg.1) speciﬁcation. as well as upon the past realizations of the shocks..) log ht−1 The variance of the VAR error depends upon its own past. Deﬁne ht = vec∗(Rt). The obvious generalization is the EGARCH(r. m) speciﬁcation.) |vt−1| − 2/π + ℵ(j. with longer lags (r for lags of v. m for lags of h)..)vt−1 + G(j. • The advantage of the EGARCH formulation is that the variance is assuredly positive without parameter restrictions . We assume that the elements follow log hjt = κj + P(j. • This is an EGARCH(1. 668-669).we’ll just consider the exponential GARCH model of Nelson (1991) as presented in Hamilton (1994.

so that positive and negative shocks can have asymmetric effects upon volatility. given the data and the . • The matrix G has dimension 3 × 3. • The parameter matrix ℵ allows for leverage. With the above speciﬁcation. I2) ηt = R−1vt t and we know how to calculate Rt and vt.• The matrix P has dimension 3 × 2. • We will probably want to restrict these parameter matrices in some way. we have ηt ∼ IIN (0. • The matrix ℵ (reminder to self: this is an ”aleph”) has dimension 2 × 2. G could plausibly be diagonal. For instance.

4 21.parameters. 21. This will be the score generator. The parameterized expectations algorithm (PEA: den Haan and . Thus. it is straighforward to do estimation by maximum likelihood.5 Results (I): The score generator Solving the structural model The ﬁrst order condition for the structural model is α−1 c−γ = βEt c−γ 1 − δ + αφt+1kt t t+1 or ct = βEt c−γ t+1 1−δ+ α−1 αφt+1kt −1 γ The problem is that we cannot solve for ct since we do not know the solution for the expectation in the previous equation.

1990). there exist parameter values that make the approximation as close to the true expectation as is desired. and then for kt using the restriction that α ct + kt = (1 − δ) kt−1 + φtkt−1 This allows us to generate a series {(ct. we can solve for ct.Marcet. kt)}. Then the expectations . As long as the parametric function is a ﬂexible enough function of variables that have been realized in period t. The expectations term is replaced by a parametric function. is a means of solving the problem. We will write the approximation α−1 Et c−γ 1 − δ + αφt+1kt t+1 exp (ρ0 + ρ1 log φt + ρ2 log kt−1) For given values of the parameters of this approximating function.

γ. This can be used to do EMM estimation using the scores of the reduced form model to deﬁne moments. kt)} using the PEA. ρ. using the simulated data from the structural model. δ. the expectations function is the best ﬁt to the generated data. we can generate data {(ct. it can be made to be equal to the true expectations function by using a long enough simulation. θ = β. When this is the case. As long it is a rich enough parametric model to encompass the true expectations function. α. Thus. it)} using it = kt − (1 − δ) kt−1. From this we can get the series {(ct. . given the parameters of the structural model.approximation is updated by ﬁtting α−1 c−γ 1 − δ + αφt+1kt = exp (ρ0 + ρ1 log φt + ρ2 log kt−1) + ηt t+1 by nonlinear least squares. The 2 step procedure of generating data and updating the parameters of the approximation to expectations is iterated until the parameters no longer change. σ 2 .

W. Press 925 . [2] den Haan. Princeton Univ. and Marcet. (1994) Time Series Analysis. A. Journal of Business and Economics Statistics. [3] Hamilton. 31-34. 8. M (2005) A Note on Parallelizing the Parameterized Expectations Algorithm.Bibliography [1] Creel. J. (1990) Solving the stochastic growth model by parameterized expectations.

Federal Reserve Bank of San Francisco.org/p/ fip/fedfap/2002-13. D. http://ideas. [6] Valderrama.repec.html . D. (1991) Conditional heteroscedasticity is asset returns: a new approach. 34770.[4] Maliar. (2002) Statistical nonlinearities in the business cycle: a challenge for the canonical RBC model. S. and Maliar. Economic Research. 59. Econometrica. (2003) Matlab code for Solving a Neoclassical Growh Model with a Parametrized Expectations Algorithm and Moving Bounds [5] Nelson. L.

so you can get it for free and modify it if you like. because it is a high quality environment that is easily extensible. it is licensed under the GNU GPL.Chapter 22 Introduction to Octave Why is Octave being used here. It’s also quite easy to learn. Mac OSX and Windows systems. uses well-tested and high performance numerical libraries. since it’s not that well-known by econometricians? Well. 927 . and it runs on both GNU/Linux.

or that you have conﬁgured your computer to be able to run the *. and boot your computer with it. This will give you this same PDF ﬁle. These are just some rudiments.22. as was described in Section 1.2 A short introduction The objective of this introduction is to learn just the basics of Octave. From this point. After this. There are other ways to use Octave. which is of course installed.1 Getting started Get the ParallelKnoppix CD.5.m ﬁles mentioned below. Then burn the image. I assume you are running the CD (or sitting in the computer room across the hall from my ofﬁce). you can look at . 22. which I encourage you to explore. The editor is conﬁgure with a macro to execute the programs using Octave. but with all of the example programs ready to run.

m gets us started. To run this.1). So study the examples! Octave can be used interactively. with NEdit (by ﬁnding the correct ﬁle inside the /home/knoppix/Desktop/Econom .” and re-run the program. That’s because printf() doesn’t automatically start a new line. Edit first. and calling Octave from within the editor.m so that the 8th line reads ”printf(hello world\n). Students of mine: your problem sets will include exercises that can be done by modifying the example programs in relatively minor ways. We’ll use this second method. Note that the output is not formatted in a pleasing way. or it can be used to run programs that are written using a text editor. and run them) to learn more about how Octave can be used to do econometrics.the example programs scattered throughout the document (and edit them. The program ﬁrst. open it up folder and clicking on the icon) and then type CTRL-ALT-o. or use the Octave item in the Shell menu (see Figure 22. preparing programs with NEdit.

Figure 22.1: Running an Octave program .

Please have a look at CommonOperations. After having done so. just use the command ”load myfile. you will ﬁnd the ﬁle ”x” in the directory Econometrics/Examples/OctaveIntro/ You might have a look at it with NEdit to see Octave’s default format for saving data. out of context. If you are looking at the browsable PDF version of this document.m shows how. Basically. then you should be able to click on links to open them. named for example ”myfile. have a look at the Octave programs that are included as examples. if you have data in an ASCII text ﬁle. Those pages will allow you to examine individual ﬁles. the matrix ”myfile” (without extension) will contain the data.data”. Once you have run this.m for examples of how to do some basic things in Octave.We need to know how to load and save data. If not. The program second. formed of numbers separated by spaces. the example programs are available here and the support ﬁles needed to run these are available here. Now that we’re done with the basics.data”. To actually use these ﬁles (edit and .

which is for Matlab.g. since you will probably want to download the pdf version together with all the support ﬁles and examples. You might like to check the article Econometrics with Octave and the Econometrics Toolbox . 22. and tell Octave how to ﬁnd them. Then to get the same behavior as found on the CD. you need to: • Get the collection of support programs and the examples.run them). from the document home page..3 If you’re running a Linux installation. • Put them somewhere. but much of which could be easily used with Octave. by putting a link to the MyOctaveFiles directory in /usr/local/share/octave/ . Or get the bootable CD. you should go to the home page of this document. e... There are some other resources for doing econometrics with Octave.

Copy the ﬁle /home/econometrics/. • Associate *.nedit”. Not to put too ﬁne a point on it.nedit from the CD to do this. . get the ﬁle NeditConﬁguration and save it in your $HOME directory with the name ”. Or. please note that there is a period in that name. That should do it.m ﬁles with NEdit so that they open up in the editor when you click on them.• Make sure nedit is installed and conﬁgured to run Octave and use syntax highlighting.

if xt is a p × 1 vector.Chapter 23 Notation and Review • All vectors will be column vectors. unless they have a transpose symbol (or I forget to apply this rule . For example. xt is a 1 × p vector. 934 . When I refer to a p-vector. I mean a column vector.your help catching typos and er0rors is much appreciated).

∂s(θ) is ∂θ ∂ 2s(θ) ∂ = ∂θ∂θ ∂θ a 1 × p vector. Chapter 1] Let s(·) : ∂s(θ) ∂θ → be a real valued function of the p-vector θ. . ∂s(θ) ∂θ1 ∂s(θ) ∂s(θ) ∂θ2 = . ∂θ ∂s(θ) ∂θp Following this matrix. Then is organized as a p-vector. convention. ∂θ . Also.1 Notation for differentiation of vectors and matrices p [3.23. and ∂ 2 s(θ) ∂θ∂θ is a p × p ∂s(θ) ∂θ ∂ = ∂θ ∂s(θ) .

show that ∂x Ax ∂x ∂ f ∂θ h+ ∂ h ∂θ f =A+A. For a and x both p-vectors. Then ∂ h(θ) f (θ) = h ∂θ ∂ f ∂θ +f ∂ h ∂θ has dimension 1 × p. ∂ ∂θ f (θ) Let f (θ) be the 1 × n valued transpose of f . → n and h(θ): p → n be n-vector valued functions of the p-vector θ. For A a p × p matrix and x a p × 1 vector. . Applying the transposition rule we get ∂ h(θ) f (θ) = ∂θ which has dimension p × 1. Then • Product rule: Let f (θ): p = ∂ ∂θ f (θ). show that Let f (θ): p ∂a x ∂x = a. Exercise 57. → n be a n-vector valued function of the p-vector θ.Exercise 56.

We will consider several modes of convergence. The ﬁrst three modes discussed are simply for background. For x and β both p × 1 vectors. Exercise 58. ∂ exp(x β) ∂β ∂ g(ρ) θ=g(ρ) ∂ρ = 23. The stochastic modes . Chapter 4]. show that exp(x β)x.• Chain rule: Let f (·): p → n a n-vector valued function of a r p-vector argument. Then ∂ ∂ f [g (ρ)] = f (θ) ∂ρ ∂θ has dimension n × r.[4.2 Convergenge modes Readings: [1. Chapter 4]. and let g(): → p be a p-vector valued function of an r-vector valued argument ρ.

2. A sequence is a mapping from the natural numbers {1.. written Deterministic real-valued functions Consider a sequence of functions {fn(ω)} where fn : Ω → T ⊆ . Real-valued sequences: Deﬁnition 60. a is the limit of an.are those which will be used later in the course. an − a < ε . [Convergence] A real-valued sequence of vectors {an} converges to the vector a if for any ε > 0 there exists an integer Nε such that for all n > Nε.. .} = {n}∞ = {n} to some other set. . so that the set is ordered n=1 according to the natural numbers associated with its elements. an → a. Deﬁnition 59.

ω∈Ω (insert a diagram here showing the envelope around f (ω) in which . Deﬁnition 62. It’s important to note that Nεω depends upon ω. Uniform convergence requires a similar rate of convergence throughout Ω. so that converge may be much more rapid for certain ω than for others. Deﬁnition 61. ∀n > Nεω . ∀n > N. [Uniform convergence] A sequence of functions {fn(ω)} converges uniformly on Ω to the function f (ω) if for any ε > 0 there exists an integer N such that sup |fn(ω) − f (ω)| < ε.Ω may be an arbitrary set. [Pointwise convergence] A sequence of functions {fn(ω)} converges pointwise on Ω to the function f (ω) if for all ε > 0 and ω ∈ Ω there exists an integer Nεω such that |fn(ω) − f (ω)| < ε.

For example. i. i. Stochastic sequences In econometrics. recall that a random variable maps the sample space to the real line. A number of modes of convergence are in use when dealing with sequences of random variables.e. A sequence of random variables {Xn(ω)} is a collection of such mappings. each Xn(ω) is a random variable with respect to the probability space (Ω.fn(ω) must lie). F. Several such modes of convergence should already be familiar: Deﬁnition 63.e. where n is the sample size. the OLS estimaˆ tor βn = (X X)−1 X Y. [Convergence in probability] Let Xn(ω) be a sequence of random variables. P ) . given the model Y = Xβ 0 +ε. Given a probability space (Ω. we typically deal with stochastic sequences. F.. X(ω) : Ω → . P ) . Let An = . and let X(ω) be a random variable.. can be used to form ˆ a sequence of random vectors {βn}.

. a.s. Then {Xn(ω)} converges in probability to X(ω) if n→∞ lim P (An) = 0. and let X(ω) be a random variable. Deﬁnition 64. One can show that Xn → X ⇒ Xn → X. Let A = {ω : limn→∞ Xn(ω) = X(ω)}.{ω : |Xn(ω) − X(ω)| > ε}. Almost sure convergence is written as Xn → X. p a. In other words. Then {Xn(ω)} converges almost surely to X(ω) if P (A) = 1. Xn(ω) → X(ω) (ordinary convergence of the two functions) except on a set C = Ω − A such that P (C) = 0. ∀ε > 0. p Convergence in probability is written as Xn → X.s. or plim Xn = X. [Almost sure convergence] Let Xn(ω) be a sequence of random variables. or Xn → X. a.s.

.v. then Xn converges in distribution to X. since ˆ βn = β 0 + and Xε n a. Xn have distribution function F. Note that this term is not a function of the parameter β. It can be shown that convergence in probability implies convergence in distribution. This easy proof is a result of the linearity of the model.s.v. XX n −1 Xε . d Stochastic functions Simple laws of large numbers (LLN’s) allow us to directly conclude ˆ a. If Fn → F at every continuity point of F. Convergence in distribution is written as Xn → X.Deﬁnition 65. n → 0 by a SLLN. Xn have distribution function Fn and the r. that βn → β 0 in the OLS example. [Convergence in distribution] Let the r.s.

θ) is a random variable with respect to a probability space (Ω. we have a sequence of random functions that depend on θ: {Xn(ω. We often deal with the more complicated situation where the stochastic sequence depends on parameters in a manner that is not reducible to a simple sequence of random variables. θ)} converges uniformly almost surely in Θ to X(ω. θ) are random variables w. P ) and the parameter θ belongs to a parameter space θ ∈ Θ. F. θ) and X(ω.) Implicit is the assumption that all Xn(ω. θ) if n→∞ θ∈Θ lim sup |Xn(ω. In this case. P ) for all θ ∈ Θ. (Ω. (a. θ) − X(ω. F. θ)| = 0. .s. where each Xn(ω.s.which allows us to express the estimator in a way that separates parameters from random functions. this is not possible. In general. θ)}.r.t.a. [Uniform almost sure convergence] {Xn(ω. Deﬁnition 66. We’ll indicate uniform almost sure convergence by → and uniform convergence in probability by u.

based on the fact that “almost sure” means “with probability one” is Pr lim sup |Xn(ω. which simpliﬁes analysis. →. 23. .3 Rates of convergence and asymptotic equality It’s often useful to have notation for the relative magnitudes of quantities.u. convergence . • An equivalent deﬁnition.p. θ)| = 0 =1 n→∞ θ∈Θ This has a form similar to that of the deﬁnition of a. θ) − X(ω.s. Quantities that are small relative to others can often be ignored.the essential difference is the addition of the sup.

have a limit (it may ﬂuctu- If {fn} and {gn} are sequences of random variables analogous definitions are Deﬁnition 69. ˆ Example 70. where K is a ﬁnite constant.Deﬁnition 67. The least squares estimator θ = (X X)−1X Y = (X X)−1X Xθ0 + θ0 +(X X)−1X ε. f (n) g(n) This deﬁnition doesn’t require that ate boundedly). g(n) Deﬁnition 68. The notation f (n) = o(g(n)) means limn→∞ f (n) = 0. f (n) g(n) < K. [Big-O] Let f (n) and g(n) be two real-valued functions. The notation f (n) = op(g(n)) means f (n) p g(n) → 0. [Little-o] Let f (n) and g(n) be two real-valued functions. The notation f (n) = O(g(n)) means there exists some N such that for n > N. Since plim (X X) 1 −1 X ε = 0. we can write (X X)−1X ε = .

where Kε is a ﬁnite constant. This is just a way of indicating that the LS estimator is consistent. 1) then Xn = Op(1). the term op(1) is negligible. The notation f (n) = Op(g(n)) means there exists some Nε such that for ε > 0 and all n > Nε. P f (n) < Kε g(n) > 1 − ε.ˆ op(1) and θ = θ0 + op(1). Example 72. Useful rules: • Op(np)Op(nq ) = Op(np+q ) • op(np)op(nq ) = op(np+q ) . there is always some Kε such that P (|Xn| < Kε) > 1 − ε. Deﬁnition 71. since. If Xn ∼ N (0. given ε. Asymptotically.

Note that the deﬁnition of Op does not mean that f (n) and g(n) are of the same order. So n1/2θ = Op(1). σ 2). so θ = Op(1). Asymptotic equality ensures that this is the case..v. Consider a random sample of iid r. now we have have the stronger result that relates the rate of convergence to the sample size. e. These two examples show that averages of centered (mean zero) quantities typically have plim 0.’s with mean 0 and ˆ variance σ 2. ˆ ˆ so θ = Op(n−1/2). n1/2 θ − µ ∼ N (0. e. The estimator of the mean θ = 1/n n xi is asymptoti=1 ˆ ˆ ically normally distributed.Example 73.g. σ 2). so θ − µ = Op(n−1/2). A Example 74. So ˆ ˆ ˆ n1/2 θ − µ = Op(1). n1/2θ ∼ N (0.. . The estimator of the mean θ = 1/n n xi is i=1 A ˆ asymptotically normally distributed. while averages of uncentered quantities have ﬁnite nonzero plims. Before we had θ = op(1).g.’s with mean ˆ µ and variance σ 2.v. Now consider a random sample of iid r.

Two sequences of random variables {fn} and {gn} are asymptotically equal (written fn = gn) if f (n) plim g(n) =1 a Finally. analogous almost sure versions of op and Op are deﬁned in the obvious way.Deﬁnition 75. .

show that Dβ exp x β = exp(x β)x. Write an Octave program that veriﬁes each of the previous results by taking numeric derivatives.For a and x both p × 1 vectors. For a hint. show that Dxa x = a. ﬁnd the analytic expression for 2 Dβ exp x β. show that Dxx Ax = A+A . 2 For A a p×p matrix and x a p×1 vector. type help numgradient and help numhessian inside octave. For x and β both p × 1 vectors. . For x and β both p × 1 vectors.

Version 2. 2.5. or at your option. ver.. under the terms of the GNU General Public License. The licenses follow. under the Creative Commons AttributionShare Alike License.Chapter 24 Licenses This document and the associated examples and materials are copyright Michael Creel. 950 .

By contrast.1 The GPL GNU GENERAL PUBLIC LICENSE Version 2. Inc. This License is intended to guarantee your freedom to share and change fre . 59 Temple Place. MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document. June 1991 Copyright (C) 1989. 1991 Free Software Foundation. Suite 330. Boston. Preamble The licenses for most software are designed to take away your freedom to share and change it. the GNU General Public software--to make sure the software is free for all its users.24. but changing it is not allowed.

not using it. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge fo if you want it. that you receive source code or can get it anyone to deny you these rights or to ask you to surrender the rights These restrictions translate to certain responsibilities for you if y . we are referring to freedom.) You can apply it to your programs.General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit t the GNU Library General Public License instead. too. that you can change the software or use pieces of it in new free programs. To protect your rights. (Some other Free Software Foundation software is covered by price. and that you know you can do these things. we need to make restrictions that forbid this service if you wish). When we speak of free software.

for each author's protection and ours. or if you modify it. if you distribute copies of such a program. And you must show them these terms so they know their rights. a (2) offer you this license which gives you legal permission to copy. . too. distribute and/or modify the software. you must give the recipients all the rights that We protect your rights with two steps: (1) copyright the software. receive or can get the source code. want its recipients to know that what they have is not the original. You must make sure that they. we want to make certain that everyone understands that there is no warranty for this free software. whether you have. If the software is modified by someone else and passed on. gratis or for a fee. For example. Also.distribute copies of the software.

We wish to avoid the danger that redistributors of a free program proprietary. Finally. . in effect making th patent must be licensed for everyone's free use or not licensed at al The precise terms and conditions for copying. To prevent this.that any problems introduced by others will not reflect on the origin authors' reputations. we have made it clear that any program will individually obtain patent licenses. distribution and modification follow. any free program is threatened constantly by software patents.

The "Program". either verbatim or with modifications and/or translated into another the term "modification". means either the Program or any derivative work under copyright law: that is to say. translation is included without limitation in running the Program is not restricted. they are outside its scope.) Each licensee is addressed as "you". below. and the output from the Progra .GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. DISTRIBUTION AND MODIFICATION 0. (Hereinafter. and a "work based on the Program" language. The act of refers to any such program or work. distribution and modification are not covered by this License. Activities other than copying. a work containing the Program or a portion of it.

You may copy and distribute verbatim copies of the Program's source code as you receive it. in any medium. 1. Whether that is true depends on what the Program does. and you may at your option offer warranty protection in exchange for a fe 2. provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty. You may modify your copy or copies of the Program or any portion . keep intact all the and give any other recipients of the Program a copy of this License along with the Program. notices that refer to this License and to the absence of any warranty You may charge a fee for the physical act of transferring a copy.is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program).

provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. to be licensed as a whole at no charge to all third parties under the terms of this License. b) You must cause any work that you distribute or publish. that in whole or in part contains or is derived from the Program or any part thereof. and copy and distribute such modifications or work under the terms of Section 1 above. to print or display an announcement including an appropriate copyright notice and a .of it. c) If the modified program normally reads commands interactively when run. when started running for such interactive use in the most ordinary way. you must cause it. thus forming a work based on the Program.

) These requirements apply to the modified work as a whole. and can be reasonably considered independent and separate works in themselves. then this License. But when you . your work based on the Program is not required to print an announcement. If identifiable sections of that work are not derived from the Program.notice that there is no warranty (or else. and telling the user how to view a copy of this License. and its terms. saying that you provide a warranty) and that users may redistribute the program under these conditions. (Exception: if the Program itself is interactive but does not normally print such an announcement. do not apply to those sections when you distribute them as separate works.

the distribution of the whole must be on the terms of entire whole. mere aggregation of another work not based on the Progra a storage or distribution medium does not bring the other work under the scope of this License. whose permissions for other licensees extend to the on the Program.distribute the same sections as part of a whole which is a work based this License. the intent is to In addition. rather. it is not the intent of this section to claim rights or contest exercise the right to control the distribution of derivative or collective works based on the Program. with the Program (or with a work based on the Program) on a volume of . You may copy and distribute the Program (or a work based on it. 3. and thus to each and every part regardless of who wrote Thus. your rights to work written entirely by you.

c) Accompany it with the information you received as to the offer to distribute corresponding source code. which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. valid for at least three years. or. for a charge no more than your cost of physically performing source distribution. to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. a complete machine-readable copy of the corresponding source code. b) Accompany it with a written offer.under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following a) Accompany it with the complete corresponding machine-readable source code. (This alternative is . to give any third party.

complete source code means all the source code for all modules it contains. unless that component itself accompanies the executable. plus any associated interface definition files. If distribution of executable or object code is made by offering control compilation and installation of the executable.allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer. and so on) of the operating system on which the executable runs. the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler. plus the scripts used to special exception. in accord with Subsection b above. However.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work. as a . kernel.

Any attempt otherwise to copy. You may not copy. from you under this License will not have their licenses terminated so long as such parties remain in full compliance.access to copy from a designated place. then offering equivalent access to copy the source code from the same place counts as distribution of the source code. modify. or distribute the Program except as expressly provided under this License. sublicense. even though third parties are not compelled to copy the source along with the object code. 4. or rights. modify. void. and will automatically terminate your rights under this License . parties who have received copies. sublicense or distribute the Program is However.

distribute or modify the Program subject t restrictions on the recipients' exercise of the rights granted herein . nothing else grants you permission to modify or distribute the Program or its derivative works. and all its terms and conditions for copying. since you have not signed it. 6. Therefore. by modifying or distributing the Program (or any work based on the Program). you indicate your acceptance of this License to do so. the recipient automatically receives a license from the these terms and conditions. These actions are prohibited by law if you do not accept this License. However. distributing or modifying the Program or works based on it. Each time you redistribute the Program (or any work based on the Program).5. You may not impose any further You are not responsible for enforcing compliance by third parties to original licensor to copy. You are not required to accept this License.

if a patent License and any other pertinent obligations. then as a consequence yo license would not permit royalty-free redistribution of the Program b the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If. For example. conditions are imposed on you (whether by court order. If you cannot otherwise) that contradict the conditions of this License. agreement or excuse you from the conditions of this License.this License. as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues). all those who receive copies directly or indirectly through you. 7. they do no distribute so as to satisfy simultaneously your obligations under thi may not distribute the Program at all. then If any portion of this section is held invalid or unenforceable under .

It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims. which is implemented by public license practices. to distribute software through any other system and a licensee cannot This section is intended to make thoroughly clear what is believed to . the balance of the section is intended t apply and the section as a whole is intended to apply in other circumstances.any particular circumstance. it is up to the author/donor to decide if he or she is willin impose that choice. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system. this section has the sole purpose of protecting the integrity of the free software distribution system.

The Free Software Foundation may publish revised and/or new versi of the General Public License from time to time. If the distribution and/or use of the Program is restricted in original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries. Such new versions wi . this License incorporates the limitation as if written in the body of this License. 8. so that distribution is permitted only in or among countries not thus excluded. the 9. In such case.be a consequence of the rest of this License. certain countries either by patents or by copyrighted interfaces.

write to the Free Software Foundation. 10. If the Program does not specify a version number Foundation. If the Program specifies a version number of this License which applies to it and "a either of that version or of any later version published by the Free later version". If you wish to incorporate parts of the Program into other free to ask for permission. write to the au Software Foundation. you have the option of following the terms and condit Software Foundation. you may choose any version ever published by the Free S programs whose distribution conditions are different. we someti make exceptions for this.be similar in spirit to the present version. but may differ in detail address new problems or concerns. Our decision will be guided by the two goal . Each version is given a distinguishing version number. For software which is copyrighted by the Free this License.

THE ENTIRE RISK TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. INCLUDING. EITHER EXPR MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WR . BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE. THE IMPLIED WARRANTIES OF PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND. YOU ASSUME THE COST OF ALL NECESSARY SERVICI 12. EXCEPT WH OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIE OR IMPLIED. SHOULD THE REPAIR OR CORRECTION. THERE IS NO WARR FOR THE PROGRAM.of preserving the free status of all derivatives of our free software of promoting the sharing and reuse of software generally. PROGRAM PROVE DEFECTIVE. TO THE EXTENT PERMITTED BY APPLICABLE LAW. NO WARRANTY 11. BUT NOT LIMITED TO.

OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE. SPECIAL. BE LIABLE TO YOU FOR DAM INCLUDING ANY GENERAL. END OF TERMS AND CONDITIONS TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED PROGRAMS). INCIDENTAL OR CONSEQUENTIAL DAMAGES A OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIM YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY POSSIBILITY OF SUCH DAMAGES.WILL ANY COPYRIGHT HOLDER. EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE How to Apply These Terms to Your New Programs .

either version 2 of the License.If you develop a new program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty. attach the following notices to the program. <one line to give the program's name and a brief idea of what it doe Copyright (C) <year> <name of author> This program is free software. and you want it to be of the greatest free software which everyone can redistribute and change under these To do so. and each file should have at least possible use to the public. or it under the terms of the GNU General Public License as published by . the best way to achieve this is to make i the "copyright" line and a pointer to where the full notice is found. you can redistribute it and/or modify the Free Software Foundation.

This program is distributed in the hope that it will be useful. MA 02111-1307 Also add information on how to contact you by electronic and paper ma If the program is interactive. if not. Boston. Suite 330.. without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Inc.(at your option) any later version. 59 Temple Place. You should have received a copy of the GNU General Public License along with this program. See the GNU General Public License for more details. but WITHOUT ANY WARRANTY. make it output a short notice like thi when it starts in an interactive mode: . write to the Free Software Foundation.

alter the names: Yoyodyne. and you are welcome to redistribute it under certain conditions. they could even You should also get your employer (if you work as a programmer) or yo school. Copyright (C) year name of author This is free software. to sign a "copyright disclaimer" for the program. for details type `sho The hypothetical commands `show w' and `show c' should show the appro parts of the General Public License. Of course. Gnomovision comes with ABSOLUTELY NO WARRANTY. type `show c' for details. Inc.Gnomovision version 69.. be called something other than `show w' and `show c'. if necessary. hereby disclaims all copyright interest in the progr . Here is a sample. the commands you use mouse-clicks or menu items--whatever suits your program. if any.

2 Creative Commons Legal Code Attribution-ShareAlike 2. use the GNU Library General Public License instead of this License. If this is what you want to do. If your program is a subroutine library. President of Vice This General Public License does not permit incorporating your progra proprietary programs. 1 April 1989 Ty Coon.`Gnomovision' (which makes passes at compilers) written by James Hac <signature of Ty Coon>. you ma library.5 CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND . consider it more useful to permit linking proprietary applications wi 24.

THE WORK IS PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE INFORMATION PROVIDED. AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM ITS USE. CREATIVE COMMONS PROVIDES THIS INFORMATION ON AN "ASIS" BASIS.DOES NOT PROVIDE LEGAL SERVICES. BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE. THE LICENSOR GRANTS YOU THE RIGHTS CONTAINED . DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN ATTORNEY-CLIENT RELATIONSHIP. YOU ACCEPT AND AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE. License THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). ANY USE OF THE WORK OTHER THAN AS AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.

such as a translation. dramatization. sound recording. "Derivative Work" means a work based upon the Work or upon the Work and other pre-existing works. 2. musical arrangement. Deﬁnitions 1. motion picture version. art reproduction. anthology or encyclopedia.HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND CONDITIONS. or adapted. 1. abridgment. such as a periodical issue. constituting separate and independent works in themselves. are assembled into a collective whole. transformed. in which the Work in its entirety in unmodiﬁed form. "Collective Work" means a work. condensation. or any other form in which the Work may be recast. except that a work that constitutes a Collective Work will not . A work that constitutes a Collective Work will not be considered a Derivative Work (as deﬁned below) for the purposes of this License. ﬁctionalization. along with a number of other contributions.

"Original Author" means the individual or entity who created the Work. the synchronization of the Work in timed-relation with a moving image ("synching") will be considered a Derivative Work for the purpose of this License. 4. "You" means an individual or entity exercising rights under this License who has not previously violated the terms of this License with respect to the Work. 3. where the Work is a musical composition or sound recording. 6. "Work" means the copyrightable work of authorship offered under the terms of this License. or who has received express permission from the Licensor to exercise rights under this License despite a previous violation. For the avoidance of doubt.be considered a Derivative Work for the purpose of this License. "Licensor" means the individual or entity that offers the Work under the terms of this License. . 5.

Nothing in this license is intended to reduce. perpetual (for the duration of the applicable copyright) license to exercise the rights in the Work as stated below: 1. ShareAlike. to incorporate the Work into one or more Collective Works. royalty-free. 2. Subject to the terms and conditions of this License. Fair Use Rights. and to reproduce the Work as incorporated in the Collective Works. display publicly. Licensor hereby grants You a worldwide. ﬁrst sale or other limitations on the exclusive rights of the copyright owner under copyright law or other applicable laws. nonexclusive. 3.7. to create and reproduce Derivative Works. License Grant. "License Elements" means the following high-level license attributes as selected by Licensor and indicated in the title of this License: Attribution. limit. 2. per- . 3. to reproduce the Work. or restrict any rights arising from fair use. to distribute copies or phonorecords of.

Licensor waives the exclusive right to collect. whether individually or via a performance rights society (e. BMI. whether individually or via a music rights society or designated agent (e. webcast) of the Work. ASCAP.g. 5. Harry Fox Agency). perform publicly.g. display publicly. royalties for . Licensor waives the exclusive right to collect.form publicly. and perform publicly by means of a digital audio transmission Derivative Works. royalties for the public performance or public digital performance (e. Mechanical Rights and Statutory Royalties. 2. Performance Royalties Under Blanket Licenses. to distribute copies or phonorecords of. For the avoidance of doubt. SESAC). 4. and perform publicly by means of a digital audio transmission the Work including as incorporated in Collective Works.g. where the work is a musical composition: 1.

subject to the compulsory license created by 17 USC Section 114 of the US Copyright Act (or the equivalent in other jurisdictions).g. 4. The above rights include the right to make such modiﬁcations as are technically necessary to exercise the rights in other media and formats. Restrictions. whether individually or via a performancerights society (e. Licensor waives the exclusive right to collect. webcast) of the Work. The above rights may be exercised in all media and formats whether now known or hereafter devised.g. SoundExchange). royalties for the public digital performance (e. where the Work is a sound recording. For the avoidance of doubt.any phonorecord You create from the Work ("cover version") and distribute. All rights not expressly granted by Licensor are hereby reserved. subject to the compulsory license created by 17 USC Section 115 of the US Copyright Act (or the equivalent in other jurisdictions). 6. Webcasting Rights and Statutory Royalties.The license granted in Section 3 above is expressly .

publicly perform. publicly display. and You must include a copy of. You may distribute. but this does not require the Collective Work apart from the Work itself to be made . or publicly digitally perform the Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this License Agreement. publicly display. You may not sublicense the Work. or the Uniform Resource Identiﬁer for.made subject to and limited by the following restrictions: 1. or publicly digitally perform. publicly perform. The above applies to the Work as incorporated in a Collective Work. You may not offer or impose any terms on the Work that alter or restrict the terms of this License or the recipients’ exercise of the rights granted hereunder. or publicly digitally perform the Work only under the terms of this License. You may not distribute. You must keep intact all notices that refer to this License and to the disclaimer of warranties. this License with every copy or phonorecord of the Work You distribute. publicly display. publicly perform.

as requested. as requested.g. a later version of this License with the same License Elements as this License. or a Creative Commons iCommons license that contains the same License Elements as this License (e. You may not offer or impose any terms on . upon notice from any Licensor You must. 2. If You create a Collective Work. or publicly digitally perform. publicly perform. remove from the Collective Work any credit as required by clause 4(c). upon notice from any Licensor You must. If You create a Derivative Work. publicly display. You may distribute. remove from the Derivative Work any credit as required by clause 4(c). to the extent practicable. this License or other license speciﬁed in the previous sentence with every copy or phonorecord of each Derivative Work You distribute. to the extent practicable.5 Japan). AttributionShareAlike 2. or the Uniform Resource Identiﬁer for. You must include a copy of. publicly display.subject to the terms of this License. publicly perform. or publicly digitally perform a Derivative Work only under the terms of this License.

3. or publicly digitally perform the Work or any Derivative Works or Collective Works. If you distribute. if applicable) if supplied. The above applies to the Derivative Work as incorporated in a Collective Work. or publicly digitally perform the Derivative Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this License Agreement. publicly display. You must keep intact all copyright notices for the Work and provide. You may not distribute. and/or (ii) if the Original Author and/or Licensor designate another . reasonable to the medium or means You are utilizing: (i) the name of the Original Author (or pseudonym.the Derivative Works that alter or restrict the terms of this License or the recipients’ exercise of the rights granted hereunder. publicly display. publicly perform. and You must keep intact all notices that refer to this License and to the disclaimer of warranties. publicly perform. but this does not require the Collective Work apart from the Derivative Work itself to be made subject to the terms of this License.

g. the title of the Work if supplied.g. a sponsor institute. to the extent reasonably practicable. that in the case of a Derivative Work or Collective Work. Such credit may be implemented in any reasonable manner.party or parties (e. if any. at a minimum such credit will appear where any other comparable authorship credit appears and in a manner at least as prominent as such other comparable authorship credit. a credit identifying the use of the Work in the Derivative Work (e. that Licensor speciﬁes to be associated with the Work. Warranties and Disclaimer . provided." or "Screenplay based on original Work by Original Author"). Representations. terms of service or by other reasonable means. however. and in the case of a Derivative Work. unless such URI does not refer to the copyright notice or licensing information for the Work. "French translation of the Work by Original Author. 5. the name of such party or parties. publishing entity. journal) for attribution in Licensor’s copyright notice.. the Uniform Resource Identiﬁer.

FITNESS FOR A PARTICULAR PURPOSE. Limitation on Liability. IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY SPECIAL. WITHOUT LIMITATION. PUNITIVE OR EXEMPLARY DAMAGES ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK. INCIDENTAL. EXPRESS.UNLESS OTHERWISE AGREED TO BY THE PARTIES IN WRITING. 6. NONINFRINGEMENT. MERCHANTIBILITY. LICENSOR OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE MATERIALS. SO SUCH EXCLUSION MAY NOT APPLY TO YOU. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES. EVEN IF LI- . ACCURACY. WARRANTIES OF TITLE. OR THE PRESENCE OF ABSENCE OF ERRORS. WHETHER OR NOT DISCOVERABLE. INCLUDING. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE LAW. STATUTORY OR OTHERWISE. IMPLIED. CONSEQUENTIAL. OR THE ABSENCE OF LATENT OR OTHER DEFECTS.

Licensor reserves the right to release the Work under different license terms or to stop distributing the Work at any time. and 8 will survive any termination of this License. provided. Notwithstanding the above. Termination 1. the license granted here is perpetual (for the duration of the applicable copyright in the Work). 7. 7. however that any such election will not serve to withdraw this License (or any other license that has been.CENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 2. Individuals or entities who have received Derivative Works or Collective Works from You under this License. 5. will not have their licenses terminated provided such individuals or entities remain in full compliance with those licenses. Sections 1. Subject to the above terms and conditions. however. This License and the rights granted hereunder will terminate automatically upon any breach by You of the terms of this License. 2. 6. .

If any provision of this License is invalid or unenforceable under applicable law. Miscellaneous 1. it shall not affect the validity or enforceability of the remainder of the terms of this License. Licensor offers to the recipient a license to the original Work on the same terms and conditions as the license granted to You under this License. 2. such provision shall be reformed . 8. Each time You distribute or publicly digitally perform a Derivative Work. 3. and this License will continue in full force and effect unless terminated as stated above. granted under the terms of this License).or is required to be. and without further action by the parties to this agreement. Each time You distribute or publicly digitally perform the Work or a Collective Work. the Licensor offers to the recipient a license to the Work on the same terms and conditions as the license granted to You under this License.

No term or provision of this License shall be deemed waived and no breach consented to unless such waiver or consent shall be in writing and signed by the party to be charged with such waiver or consent. This License constitutes the entire agreement between the parties with respect to the Work licensed here. agreements or representations with respect to the Work not speciﬁed here. Creative Commons will not be liable to You or any party on any legal theory for any .to the minimum extent necessary to make such provision valid and enforceable. 4. Creative Commons is not a party to this License. There are no understandings. 5. Licensor shall not be bound by any additional provisions that may appear in any communication from You. This License may not be modiﬁed without the mutual written agreement of the Licensor and You. and makes no warranty whatsoever in connection with the Work.

as may be published on its website or otherwise made available upon request from time to time. neither party will use the trademark "Creative Commons" or any related trademark or logo of Creative Commons without the prior written consent of Creative Commons. Notwithstanding the foregoing two (2) sentences. it shall have all rights and obligations of Licensor. including without limitation any general. Except for the limited purpose of indicating to the public that the Work is licensed under the CCPL.org/. .damages whatsoever. special. incidental or consequential damages arising in connection to this license. Any permitted use will be in compliance with Creative Commons’ then-current trademark usage guidelines. Creative Commons may be contacted at http://creativecommons. if Creative Commons has expressly identiﬁed itself as the Licensor hereunder.

Basically. unless you’d like to help get it ready for inclusion. 989 .Chapter 25 The attic This holds material that is not really ready to be incorporated into the main body. but that I don’t want to lose. ignore it.

An interesting question that arises is how one should choose the instrumental variables Z(wt) to achieve maximum efﬁciency. n .Optimal instruments for GMM PLEASE IGNORE THE REST OF THIS SECTION: there is a ﬂaw in the argument that needs correction. In particular. we have that Dn ≡ ∂ ∂θ m (θ) (a K × g matrix) is Dn(θ) = ∂ 1 (Znhn(θ)) ∂θ n 1 ∂ = hn (θ) Zn n ∂θ which we can deﬁne to be 1 Dn(θ) = HnZn. Note that with this choice of moment conditions. it may be the case that E(Zt t) = 0 if instruments are chosen in the way suggested here.

deﬁne the var-cov. of the moment conditions Ωn = E nmn(θ0)mn(θ0) 1 = E Znhn(θ0)hn(θ0) Zn n 1 = ZnE hn(θ0)hn(θ0) Zn n Φn ≡ Zn Zn n where we have deﬁned Φn = V hn(θ0) . Likewise. so it is not consistently estimable without additional assumptions. Note that the dimension of this matrix is growing with the sample size. The asymptotic normality theorem above says that the GMM esti- .where Hn is a K × n matrix that has the derivatives of the individual moment conditions as its columns.

mator using the optimal weighting matrix is distributed as √ where V∞ = lim HnZn n ZnΦnZn n −1 d ˆ n θ − θ0 → N (0. (25. V∞) n→∞ ZnHn n −1 .1) Using an argument similar to that used to prove that Ω−1 is the efﬁ∞ cient weighting matrix.2) . (25. we can show that putting Zn = Φ−1Hn n causes the above var-cov matrix to simplify to V∞ = lim HnΦ−1Hn n n −1 n→∞ .

which we should write more properly as Hn(θ0). As above. this matrix is smaller that the limiting var-cov for any other choice of instrumental variables.and furthermore. • Estimation of Φn may not be possible. • Note that both Hn. so . you can show that the difference is positive semi-deﬁnite). since it depends on θ0. It is an n × n matrix. ∂θ ˜ where θ is some initial consistent estimator based on non-optimal instruments. and Φ must be consistently estimated to apply this.one just uses ∂ ˜ H = hn θ . • Usually. estimation of Hn is straightforward . examine the difference of the inverses of the var-cov matrices with the optimal intruments and with non-optimal instruments. (To prove this.

you need to provide a parametric speciﬁcation of the covariances of the ht(θ) in order to be able to use optimal instruments.1.2 will not apply if approximately optimal instruments are used . A solution is to approximate this matrix parametrically to deﬁne the instruments.1 Hurdle models i 1(yi Returning to the Poisson model. Note that the simpliﬁed var-cov matrix in equation 25. lets look at actual and ﬁtted count probabilities. 25.it will be necessary to use an estimator based upon equation 25. Actual relative frequencies are f (y = j) = = . where the term n−1ZnΦnZn must be estimated consistently apart.it has more unique elements than n. Basically. so without restrictions on the parameters it can’t be estimated consistently. for example by the Newey-West procedure. the sample size.

002 4 0. but the difference is not too important. there are somewhat more actual zeros than ﬁtted.032 0. there are many more actual zeros than predicted.18 0.19 0. Why might OBDV not ﬁt the zeros well? What if people made the decision to contact the doctor for a ﬁrst visit.052 0.15 0.4e-5 that for the OBDV measure.02 3 0.10 0 2. then the doctor decides on whether or not follow-up visits are needed.002 0.0002 5 0.004 0. For ERV.02 0.1: Actual and Poisson ﬁtted frequencies Count OBDV ERV Count Actual Fitted Actual Fitted 0 0.10 0.14 2 0.10 0.11 0.15 0.86 0.32 0.83 1 0. θ)/n We see Table 25. they are sick. This .06 0.ˆ j)/n and ﬁtted frequencies are f (y = j) = n ˆ i=1 fY (j|xi .18 0.

a trun- . Since different parameters may govern the two decision-makers choices. for the observations where visits are positive. where the total number of visits depends upon the decision of both the patient and the doctor. λp) = 1 − 1/ [1 + exp(−λp)] Pr(Y > 0) = 1/ [1 + exp(−λp)] . for example. we might expect that different parameters govern the probability of zeros versus the other counts. a logit model: Pr(Y = 0) = fY (0. Then. and let λd be the paramter of the doctor’s “demand” for visits. The patient will initiate visits according to a discrete choice model. Let λp be the parameters of the patient’s demand for visits.is a principal/agent type situation. The above probabilities are used to estimate the binary 0/1 hurdle process.

which is computationally more efﬁcient than estimating the overall model. for example. λd) = 1 − exp(−λd) since according to the Poisson model with the doctor’s paramaters. 0! Since the hurdle and truncated components of the overall density for Y share no parameters. exp(−λd)λ0 d Pr(y = 0) = . λd) fY (y. This density is fY (y. The computational overhead is of order K 2 where K is the number of parameters to be estimated) . will have to invert the approximated Hessian. they may be estimated separately. λd|y > 0) = Pr(y > 0) fY (y. The expec- .cated Poisson density is estimated. (Recall that the BFGS algorithm.

x) 1 λd = 1 + exp(−λp) 1 − exp(−λd) .tation of Y is E(Y |x) = Pr(Y > 0|x)E(Y |Y > 0.

5709 3.58939 ******************************************************************* .) -2.5502 1.0520 1.1677 2.0467 t(Sand.0873 2.1807 1.7166 3.7289 3.1547 1.0519 0.1969 0.98710 t(Hess) -2. obtained from this estimation program MEPS data.039606 t(OPG) -2.018614 0.0222 -0.0027 1.5560 3.Here are hurdle Poisson estimation results for OBDV. OBDV logit results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ -1.63570 0.6924 3.5269 3.45867 0.0384 1.1366 2.

89 Schwartz 632.9601 Information Criteria ******************************************************************* .1672 1.96 Akaike 603.39 0.inc Consistent Akaike 639.89 Hannan-Quinn 614.7655 2.077446 1.

56547 -0.3186 t(Sand.18112 3.2144 -2.The results for the truncated part: MEPS data.0079016 t(OPG) 7.5708 0.4291 6.2323 3.7042 ******************************************************************* .31001 0.148 4.96078 -2.19075 0.7183 0.014382 0.9814 1.6353 -0.016683 0.1747 1.10438 1.7573 0.1890 3.29433 10.35309 t(Hess) 3.6942 7.) 1. OBDV tpoisson results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ inc 0.54254 0.016286 -0.5262 0.293 16.

7 Schwartz 2747.8 Akaike 2718.2 ******************************************************************* .Information Criteria Consistent Akaike 2754.7 Hannan-Quinn 2729.

10 0.16 0.32 0.18 0. For the NB-II ﬁts.032 0.2: Actual and Hurdle Poisson ﬁtted frequencies Count OBDV ERV Count Actual Fitted HP Fitted NB-II Actual Fitted HP Fitted NB-II 0 0.02 0.006 0.10 0.002 0. and one should recall that many fewer parameters are used. .052 0.0005 0.006 4 0. The OBDV ﬁt is not so good.10 0.035 0.02 3 0. the ERV ﬁt is very accurate.86 0. but 1’s and 2’s are underestimated. Zeros are exact.06 0.11 0.071 0.10 2 0.10 0.86 0.86 1 0.02 0. and higher counts are overestimated.004 0.10 0.002 0. Hurdle version of the negative binomial model are also widely used.11 0.002 5 0.Fitted and actual probabilites (NB-II ﬁts are provided as well) are: Table 25. performance is at least as good as the hurdle Poisson model.001 For the Hurdle Poisson models.11 0.05 0 0.32 0.08 0.34 0.

for the OBDV data. which you can replicate using this estimation program .Finite mixture models The following are results for a mixture of 2 negative binomial (NB-I) models.

4029 -2.015880 0.58386 2.64852 -0.049175 0.23188 0.) 1.8036 0.******************************************************************* MEPS data.69961 t(OPG) 1. OBDV mixnegbin results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ inc ln_alpha 0.33046 2.18729 0.093396 0.76782 2.0396 t(Hess) 1.4358 -0.13802 0.3851 -0.6121 2.7151 -1.8013 0.015969 -0.46948 2.4882 2.3226 -0.39785 0.2148 2.40854 2.2312 .062139 0.73281 2.3456 t(Sand.5173 -1.5475 -1.7061 0.

021425 0.0922 0.8 Hannan-Quinn 2293.2497 1.8419 0.7096 -1.6519 0.1470 0.22461 0.3456 0.74016 1.73854 0.97338 0.77431 0.6126 1.8 Schwartz 2336.34886 0.019227 2.7365 3.3 -3.8411 2.7677 1.7826 0.3387 2.1366 0.36313 7.81892 1.constant pub_ins priv_ins sex age educ inc ln_alpha logit_inv_mix Consistent Akaike 2353.4827 -1.6182 1.85186 -1.8702 1.6130 2.1354 2.80035 1.7883 Information Criteria .7527 0.3032 1.20453 6.40854 6.

education seems to have a positive effect on visits.e. For the subpopulation that is “healthy”. A larger sample could help clarify things. For the “unhealthy” group. ******************************************************************* • The 95% conﬁdence interval for the mix parameter is perilously close to 1.70096 se_mix 0.12043 err.Akaike 2265. Again. The other results are more mixed.it is merely suggestive.. . which suggests that there may really be only one component density.2 Delta method for mix parameter st. rather than a mixture. • Education is interesting. education has a negative effect on visits. mix 0. that makes relatively few visits. this is not the way to test this . i.

. The constants and the overdispersion parameters αj are allowed to differ for the two components.The following are results for a 2 component constrained mixture negative binomial model where all the slope parameters in λj = exβj are the same across the two components.

1212 0.20663 0.015822 0.3895 3.6140 t(Sand.7042 0.7806 0.50362 0.4929 3.011784 0.91456 2.2441 . OBDV cmixnegbin results Strong convergence Observations = 500 Function value t-Stats params constant pub_ins priv_ins sex age educ inc ln_alpha -0.4293 -2.014088 1.5088 1.2462 t(Hess) -0.1948 3.97943 2.4258 3.37714 0.58331 0.1798 t(OPG) -0.45320 0.7067 1.69088 4.83408 6.6206 1.65887 0.96831 7.) -0.94203 2.******************************************************************* MEPS data.5319 3.34153 0.3105 3.

5 Hannan-Quinn 2284.5219 6.2243 1.4888 0.5 Schwartz 2312.const_2 lnalpha_2 logit_inv_mix 1.5060 4.4918 3.1 Delta method for mix parameter st.7224 1.47525 1.3 Akaike 2266.5539 0.9693 Information Criteria Consistent Akaike 2323. ******************************************************************* .2621 2.60073 2. mix se_mix err.7769 2.

. Hamilton.047318 • Now the mixture parameter is even closer to 1. This is very incomplete and contributions would be very welcome. yt−1. 25. Up to now we’ve considered the behavior of the dependent variable yt as a function of other variables xt.92335 0. yt−j ). Pure time series methods consider the behavior of yt as a function only . Time Series Analysis is a good reference for this section.g. • The slope parameter estimates are pretty close to what we got with the NB-I model.0. These variables can of course contain lagged dependent variables.. Just left in to form a basis for completion (by someone else ?!) at some point.. e.2 Models for time series data This section can be ignored in its present form. .. xt = (wt.

t=1 . unconditional on other observable variables. indexed by time: {Yt}∞ t=−∞ Deﬁnition 77. We’ll stick with linear time series models. though nonlinear time series is also a large and growing ﬁeld. [Time series] A time series is one observation of a stochastic process. [Stochastic process]A stochastic process is a sequence of random variables. While it’s not immediately clear why a model that has other explanatory variables should marginalize to a linear in the parameters time series model.of its own lagged values. most time series work is done with linear models. Basic concepts Deﬁnition 76. One can think of this as modeling the behavior of yt after marginalizing out all other variables. over a speciﬁc interval: {yt}n .

Deﬁnition 78. ∀t As we’ve seen. this implies that γj = γ−j : the autocovariances . [Covariance (weak) stationarity] A stochastic process is covariance stationary if it has time constant mean and autocovariances of all orders: µt = µ. Deﬁnition 79. and that the values would be different. ∀t γjt = γj .So a time series is a sample of size n from a stochastic process. one could draw another sample. [Autocovariance] The j th autocovariance of a stochastic process is γjt = E(yt − µt)(yt−j − µt−j ) where µt = E (yt) . It’s important to keep in mind that conceptually.

since we can’t go back in time and collect another.depend only one the interval between observations. e. we would expect that 1 lim M →∞ M M ytm → E(Yt) m=1 p The problem is. Since moments are determined by the distribution.. Deﬁnition 80. What is the mean of Yt? The time series is one sample from the stochastic process. strong stationarity⇒weak stationarity..g. {yt } By a LLN. [Strong stationarity]A stochastic process is strongly stationary if the joint distribution of an arbitrary collection of the {Yt} doesn’t depend on t. How can E(Yt) be estimated then? . but not the time of the observations. we have only one sample to work with. One could think of M repeated samples from the m stoch. proc.

Deﬁnition 81.It turns out that ergodicity is the needed property. Deﬁnition 82. ρj is just the j th autocovariance divided by the variance: . [Autocorrelation] The j th autocorrelation. A stationary stochastic process is ergodic (for the mean) if the time average converges to the mean 1 n n yt → µ t=1 p (25. [Ergodicity].3) A sufﬁcient condition for ergodicity is that the autocovariances be absolutely summable: ∞ |γj | < ∞ j=0 This implies that the autocovariances die off. so that the yt are not so strongly dependent that they don’t satisfy a LLN.

ARMA models With these concepts. The main difference is that the lhs variable is observed directly now. V ( t) = σ 2. These are closely related to the AR and MA error processes that we’ve already discussed. Gaussian white noise just adds a normality assumption. . we can discuss ARMA models.4) Deﬁnition 83. ii) and s are independent. ∀t. ∀t and iii) t t is white noise if i) E( t) = 0.ρj = γj γ0 (25. t = s. [White noise] White noise is just the time series literature term for a classical error.

MA(q) processes A q th order moving average (MA) process is yt = µ + εt + θ1εt−1 + θ2εt−2 + · · · + θq εt−q where εt is white noise. the autocovariances are γj = θj + θj+1θ1 + θj+2θ2 + · · · + θq θq−j . The variance is γ0 = E (yt − µ)2 = E (εt + θ1εt−1 + θ2εt−2 + · · · + θq εt−q )2 2 2 2 = σ 2 1 + θ1 + θ2 + · · · + θq Similarly. j > q . j ≤ q = 0.

AR(p) processes An AR(p) process can be represented as yt = c + φ1yt−1 + φ2yt−2 + · · · + φpyt−p + εt The dynamic behavior of an AR(p) process can be studied by writing this pth order difference equation as a vector ﬁrst order difference .Therefore an MA(q) process is necessarily covariance stationary and ergodic. as long as σ 2 and all of the θj are ﬁnite.

. . . 0 .equation: φ1 yt c 1 yt−1 0 = 0 ... 0··· yt−p 0 0 φp . 0 yt−p+1 0 or Yt = C + F Yt−1 + Et With this. 1 1 0 .. . .. we can recursively work forward in time: Yt+1 = C + F Yt + Et+1 = C + F (C + F Yt−1 + Et) + Et+1 = C + F C + F 2Yt−1 + F Et + Et+1 φ2 · · · 0 0 ... . .. . .. ··· 0 yt−1 εt 0 y 0 t−2 + .

1) ∂Et (1.and Yt+2 = C + F Yt+1 + Et+2 = C + F C + F C + F 2Yt−1 + F Et + Et+1 + Et+2 = C + F C + F 2C + F 3Yt−1 + F 2Et + F Et+1 + Et+2 or in general Yt+j = C+F C+· · ·+F j C+F j+1Yt−1+F j Et+F j−1Et+1+· · ·+F Et+j−1+Et+j Consider the impact of a shock in period t on yt+j .1) If the system is to be stationary. This is simply ∂Yt+j j = F(1. Otherwise a shock causes a permanent change . then as we move forward in time this impact must die off.

stationarity requires that j→∞ j lim F(1. we’ll need it in a minute. These are the for λ such that |F − λIP | = 0 The determinant here can be expressed as a polynomial. for example. Consider the eigenvalues of the matrix F. Therefore.in the mean of yt.1) = 0 • Save this result. the matrix F is simply F = φ1 so |φ1 − λ| = 0 . for p = 1.

can be written as φ1 − λ = 0 When p = 2. the matrix F is F = so F − λIP = and |F − λIP | = λ2 − λφ1 − φ2 So the eigenvalues are the roots of the polynomial λ2 − λφ1 − φ2 φ1 − λ φ 2 1 −λ φ1 φ2 1 0 .

This generalizes. Using this decomposition.which can be found using the quadratic equation. For a pth order AR process. and Λ is a diagonal matrix with the eigenvalues on the main diagonal. we can write F j = T ΛT −1 T ΛT −1 · · · T ΛT −1 . the eigenvalues are the roots of λp − λp−1φ1 − λp−2φ2 − · · · − λφp−1 − φp = 0 Supposing that all of the roots of this polynomial are distinct. then the matrix F can be factored as F = T ΛT −1 where T is the matrix which has as its columns the eigenvectors of F.

it is clear that j→∞ j lim F(1. . p e. 2..g... the eigenvalues must be less than one in absolute value.. 2... This gives F j = T Λj T −1 and λj 1 0 0 j 0 λ2 j Λ = .. p are all real valued. i = 1. 0 λj p Supposing that the λi i = 1.where T ΛT −1 is repeated j times. .. .1) = 0 requires that |λi| < 1..

when there are multiple eigenvalues the over- .• It may be the case that some eigenvalues are complex-valued. where the modulus of a complex number a + bi is √ mod(a + bi) = a2 + b2 This leads to the famous statement that “stationarity requires the roots of the determinantal polynomial to lie inside the complex unit circle. The previous result generalizes to the requirement that the eigenvalues be less than one in modulus.1) is a dynamic multiplier or an impulse-response function.” draw picture here. Of course. j • Dynamic multipliers: ∂yt+j /∂εt = F(1. we leave the world of stationary processes. whereas comlpex eigenvalue lead to ocillatory behavior. Real eigenvalues lead to steady movements. • When there are roots on the unit circle (unit roots) or outside the unit circle.

e.. pictures Invertibility of AR process To begin with. deﬁne the lag operator L Lyt = yt−1 The lag operator is deﬁned to behave just as an algebraic quantity.all effect can be a mixture. L2yt = L(Lyt) = Lyt−1 = yt−2 or (1 − L)(1 + L)yt = 1 − Lyt + Lyt − L2yt = 1 − yt−2 .g.

just assume that the λi are coefﬁcients to be determined. determination of the λi is the same as determination of the λi such that the following two expressions are the same for all z : 1 − φ1z − φ2z 2 − · · · − φpz p = (1 − λ1z)(1 − λ2z) · · · (1 − λpz) . Since L is deﬁned to operate as an algebraic quantitiy.A mean-zero AR(p) process can be written as yt − φ1yt−1 − φ2yt−2 − · · · − φpyt−p = εt or yt(1 − φ1L − φ2L2 − · · · − φpLp) = εt Factor this polynomial as 1 − φ1L − φ2L2 − · · · − φpLp = (1 − λ1L)(1 − λ2L) · · · (1 − λpL) For the moment.

as above. . Now consider a different stationary process (1 − φL)yt = εt • Stationarity. the λi that are the coefﬁcients of the factorization are simply the eigenvalues of the matrix F.Multiply both sides by z −p z −p −φ1z 1−p −φ2z 2−p −· · · φp−1z −1 −φp = (z −1 −λ1)(z −1 −λ2) · · · (z −1 −λp) and now deﬁne λ = z −1 so we get λp − φ1λp−1 − φ2λp−2 − · · · − φp−1λ − φp = (λ − λ1)(λ − λ2) · · · (λ − λp) The LHS is precisely the determinantal polynomial that gives the eigenvalues of F. implies that |φ| < 1. Therefore.

..Multiply both sides by 1 + φL + φ2L2 + ... so yt ∼ 1 + φL + φ2L2 + . + φj Lj (1−φL)yt = 1 + φL + φ2L2 + ... − φj Lj − φj+1Lj+1 yt == 1 + φL + φ2L2 + .. + φj Lj − φL − φ2L2 − . + φj Lj εt or... multiplying the polynomials on th LHS... φj+1Lj+1yt → 0. we get 1 + φL + φ2L2 + .... + φj Lj εt and with cancellations we have 1 − φj+1Lj+1 yt = 1 + φL + φ2L2 + ... + φj Lj to get 1 + φL + φ2L2 + ... + φj Lj εt = . since |φ| < 1. + φj Lj εt so yt = φj+1Lj+1yt + 1 + φL + φ2L2 + . + φj Lj εt Now as j → ∞.

. + φj Lj (1 − φL) ∼ 1 = and the approximation becomes arbitrarily good as j increases arbitrarily. Therefore.. + φj Lj (1 − φL)yt = so 1 + φL + φ2L2 + .and the approximation becomes better and better as j increases.. we started with (1 − φL)yt = εt Substituting this into the above equation we have yt ∼ 1 + φL + φ2L2 + .. deﬁne ∞ (1 − φL)−1 = j=0 φj Lj . However. for |φ| < 1.

Recall that our mean zero AR(p) process yt(1 − φ1L − φ2L2 − · · · − φpLp) = εt can be written using the factorization yt(1 − λ1L)(1 − λ2L) · · · (1 − λpL) = εt where the λ are the eigenvalues of F. all the |λi| < 1. we can invert each ﬁrst order polynomial on the LHS to get yt = j=0 ∞ λj Lj 1 ∞ ∞ λj Lj εt p λj Lj · · · 2 j=0 j=0 The RHS is a product of inﬁnite-order polynomials in L. and given stationarity. Therefore. which can be represented as yt = (1 + ψ1L + ψ2L2 + · · · )εt .

• The ψi are real-valued because any complex-valued λi always occur in conjugate pairs. • The ψi are formed of products of powers of the λi. In multiplication (a + bi) (a − bi) = a2 − abi + abi − b2i2 = a 2 + b2 which is real-valued. This means that if a+bi is an eigenvalue of F. • This shows that an AR(p) process is representable as an inﬁniteorder MA(q) process. an AR(p) process . • Recall before that by recursive substitution.where the ψi are real-valued and absolutely summable. then so is a − bi. which are in turn functions of the φi.

then everything with a C drops out. so we see that the ﬁrst equation here. the lagged Y on the RHS drops out. recalling the previous factorization of F j ). in the limit. .1 t−j which makes explicit the relationship between the ψi and the φi (and the λi as well. The Et−s are vectors of zeros except for their ﬁrst element. is just ∞ yt = j=0 Fj ε 1.can be written as Yt+j = C+F C+· · ·+F j C+F j+1Yt−1+F j Et+F j−1Et+1+· · ·+F Et+j−1+Et+j If the process is mean zero. Take this and lag it by j periods to get Yt = F j+1Yt−j−1 + F j Et−j + F j−1Et−j+1 + · · · + F Et−1 + Et As j → ∞.

.. + φp(yt−p − µ) + εt and . + φpµ so c µ= 1 − φ1 − φ2 − . E(yt) = µ.. − φpµ + φ1yt−1 + φ2yt−2 + · · · + φpyt−p + εt − µ = φ1(yt−1 − µ) + φ2(yt−2 − µ) + .. so µ = c + φ1µ + φ2µ + ... − φpµ so yt − µ = µ − φ1µ − .. − φp c = µ − φ1µ − . ∀t.Moments of AR(p) process The AR(p) process is yt = c + φ1yt−1 + φ2yt−2 + · · · + φpyt−p + εt Assuming stationarity....

1. . p.. + φpγp + σ 2 The autocovariances of orders j ≥ 1 follow the rule γj = E [(yt − µ) (yt−j − µ))] = E [(φ1(yt−1 − µ) + φ2(yt−2 − µ) + .. γ0. which have p + 1 unknowns (σ 2. γp) and solve for the unknowns. the γj for j > p can be solved for recursively. + φp(yt−p − µ) + εt) (yt−j − µ)] = φ1γj−1 + φ2γj−2 + . With these. .. ... the second moments are easy to ﬁnd: The variance is γ0 = φ1γ1 + φ2γ2 + ...... + φpγj−p Using the fact that γ−j = γj ..With this. one can take the p + 1 equations for j = 0. γ1..

+ θq Lq ) = (1 − η1L)(1 − η2L). + θq Lq )εt As before.... If this is the case..(1 − ηq L) and each of the (1 − ηiL) can be inverted as long as |ηi| < 1. + θq Lq )−1(yt − µ) = εt where (1 + θ1L + .Invertibility of MA(q) process An MA(q) can be written as yt − µ = (1 + θ1L + . then we can write (1 + θ1L + .. the polynomial on the RHS can be factored as (1 + θ1L + .... + θq Lq )−1 ...

.will be an inﬁnite-order polynomial in L.. = εt or yt = c + δ1yt−1 + δ2yt−2 + . + εt where c = µ + δ1µ + δ2µ + .. or (yt − µ) − δ1(yt−1 − µ) − δ2(yt−2 − µ) + .... • It turns out that one can always manipulate the parameters of an . 2.. so we get ∞ −δj Lj (yt−j − µ) = εt j=0 with δ0 = −1.. as long as the |ηi| < 1. So we see that an MA(q) has an inﬁnite AR representation.. . q. i = 1.

Given the above relationships amongst the parameters.MA(q) process to ﬁnd an invertible representation. For example. we’ve seen that γ0 = σ 2(1 + θ2). ∗ 2 γ0 = σε θ2(1 + θ−2) = σ 2(1 + θ2) . the two MA(1) processes yt − µ = (1 − θL)εt and ∗ yt − µ = (1 − θ−1L)ε∗ t have exactly the same moments if 2 2 σε∗ = σε θ2 For example.

• For a given MA(q) process. one can reverse the argument and note that at least some MA(∞) . Since an AR(1) process has an MA(∞) representation. • It’s important to ﬁnd an invertible representation.so the variances are the same. it’s impossible to distinguish between observationally equivalent processes on the basis of data. it’s always possible to manipulate the parameters to ﬁnd an invertible representation (which is unique). The other representations express • Why is invertibility important? The most important reason is that it provides a justiﬁcation for the use of parsimonious models. as is easily checked. As before. This means that the two MA processes are observationally equivalent. It turns out that all the autocovariances will be the same. since it’s the only representation that allows one to represent εt as a function of past y s.

1) model:(1+ φL)yt = c + (1 + θL) t . • This is the reason that ARMA models are popular.processes have an AR(1) representation.we won’t go into the details. • Stationarity and invertibility of ARMA models is similar to what we’ve seen . Calculate the autocovariances of an ARMA(1. At the time of estimation. Likewise. Combining low-order AR and MA models can usually offer a satisfactory representation of univariate time series data with a reasonable number of parameters. Exercise 84. calculating moments is similar. it’s a lot easier to estimate the single AR(1) coefﬁcient rather than the inﬁnite number of coefﬁcients associated with the MA representation.

Press.G. A.R.R. Oxford Univ. [3] Gallant. Princeton Univ. and J. 1041 .G. (1997) An Introduction to Econometric Theory. (1985) Nonlinear Statistical Models. MacKinnon (2004) Econometric Theory and Methods.Bibliography [1] Davidson. MacKinnon (1993) Estimation and Inference in Econometrics. A. Oxford Univ. [2] Davidson. and J. R. Wiley. R. Press. Press. [4] Gallant.

[5] Hamilton. Introductory Econometrics. (undergraduate level. . (1994) Time Series Analysis. Press. Princeton Univ. for supplementary use only). (2000) Econometrics. Thomson. F. Princeton Univ. Press [6] Hayashi. [7] Wooldridge (2003). J.

uniform. pointwise. 56 likelihood function. in probability. 45 ﬁtted values. 938 convergence. idempotent. in distribution. 840 leverage. 47 GARCH. almost sure. 942 convergence. 493 . 937 Cobb-Douglas model. 55. 941 convergence. ordinary. linear. 939 convergence. 72 estimator. 948 Chain rule. 532 matrix. uniform almost sure. 42 convergence. 840 leptokurtosis.Index ARCH. 54 1043 conditional heteroscedasticity. 840 extremum estimator. 840 asymptotic equality. 939 convergence. OLS. 943 estimator. 940 Convergence.

52 matrix. 60 R-squared. 55 outliers. 54 observations. 55 own inﬂuence. centered. inﬂuential. 57 parameter space.matrix.squared. projection. 62 residuals. 532 Product rule. 47 . symmetric. uncentered. 936 R.

Sign up to vote on this title

UsefulNot useful- Applied Econometrics - A Modern Approach Using Eviews and Microfitby maardybum
- Dynamic Bayesian Networks Representation, Inference and Learning - Kevin Patrick Murphyby arounek
- mbook applied econometrics using MATLABby imad.akhdar
- Discrete Choice Method With Simulation - Train (Cambridge 2nd 2003)by jayadev_sreekanth243

- Applied Econometrics - A Modern Approach Using Eviews and Microfit
- Dynamic Bayesian Networks Representation, Inference and Learning - Kevin Patrick Murphy
- mbook applied econometrics using MATLAB
- Discrete Choice Method With Simulation - Train (Cambridge 2nd 2003)
- Applied Non Parametric Regression
- Eco No Metrics Pelican
- Econometrics
- Tema I (Mínimos Cuadrados Ordinarios)
- Sample Midterm1
- Panel DEEQA
- Final Lecture 507
- Chapter 19.docx
- 14 21November Commands
- Paper 3 December 2010
- Daniel Ortega
- Regress Print
- The Mathematical Derivation of Least Squares
- Watson,_Teelucksingh]_A Practical Introduction to Econometric Methods
- Introduction Notes1
- Effect of p2p on Music Markets
- Solutions Chapter 2
- Auto Correlation
- Auxiliary Information and a priori values in construction of improved estimators, by R. Singh, P. Chauhan, N. Sawan, F. Smarandache
- Lecture-8
- Nearest Neighbor
- 10 Fawad Ahmed
- A SHORT HISTORY OF NETWORK SAMPLING
- Means of Evaluating Economic Ideas
- wsc09
- Determining the Number of Factors from Empirical Distribution of Eigenvalues
- Eco No Metrics

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.