You are on page 1of 376

E

onometri s

Mi hael Creel

Department of E onomi s and E onomi History


Universitat Autnoma de Bar elona

version 0.95, November 2008

Contents
1 About this do ument

15

1.1

Li enses

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.2

Obtaining the materials

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

1.3

An easy way to use LYX and O tave today . . . . . . . . . . . . . . . . . . . . .

18

2 Introdu tion: E onomi and e onometri models

19

3 Ordinary Least Squares

21

3.1

The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.2

Estimation by least squares

22

3.3

Geometri interpretation of least squares estimation

X, Y

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .

24

3.3.1

In

Spa e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.3.2

In Observation Spa e . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.3.3

Proje tion Matri es

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.4

Inuential observations and outliers . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.5

Goodness of t

28

3.6

The lassi al linear regression model

3.7

Small sample statisti al properties of the least squares estimator

3.8

3.9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .

31

3.7.1

Unbiasedness

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.7.2

Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.7.3

The varian e of the OLS estimator and the Gauss-Markov theorem . . .

33

Example: The Nerlove model

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.8.1

Theoreti al ba kground

. . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.8.2

Cobb-Douglas fun tional form . . . . . . . . . . . . . . . . . . . . . . . .

37

3.8.3

The Nerlove data and OLS

. . . . . . . . . . . . . . . . . . . . . . . . .

38

Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

4 Maximum likelihood estimation


4.1

30

41

The likelihood fun tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.1.1

42

Example: Bernoulli trial . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

Consisten y of MLE

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

4.3

The s ore fun tion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

4.4

Asymptoti normality of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

CONTENTS

4.4.1

Coin ipping, again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

4.5

The information matrix equality

. . . . . . . . . . . . . . . . . . . . . . . . . .

49

4.6

The Cramr-Rao lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.7

Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

5 Asymptoti properties of the least squares estimator

55

5.1

Consisten y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

5.2

Asymptoti normality

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

5.3

Asymptoti e ien y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

5.4

Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

6 Restri tions and hypothesis tests


6.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

6.1.1

Imposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

6.1.2

Properties of the restri ted estimator . . . . . . . . . . . . . . . . . . . .

62

Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.2.1

t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.2.2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.2.3

Wald-type tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

6.2.4

S ore-type tests (Rao tests, Lagrange multiplier tests)

. . . . . . . . . .

66

6.2.5

Likelihood ratio-type tests . . . . . . . . . . . . . . . . . . . . . . . . . .

68

6.3

The asymptoti equivalen e of the LR, Wald and s ore tests . . . . . . . . . . .

69

6.4

Interpretation of test statisti s

. . . . . . . . . . . . . . . . . . . . . . . . . . .

72

6.5

Conden e intervals

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

6.6

Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

6.7

Testing nonlinear restri tions, and the delta method

. . . . . . . . . . . . . . .

75

6.8

Example: the Nerlove data

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

6.9

Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

6.2

Exa t linear restri tions

59

test

7 Generalized least squares

83

7.1

Ee ts of nonspheri al disturban es on the OLS estimator . . . . . . . . . . . .

84

7.2

The GLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

7.3

Feasible GLS

87

7.4

Heteros edasti ity

7.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4.1

OLS with heteros edasti onsistent var ov estimation

7.4.2

Dete tion

88

. . . . . . . . . .

88

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

7.4.3

Corre tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

7.4.4

Example: the Nerlove model (again!) . . . . . . . . . . . . . . . . . . . .

93

Auto orrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

7.5.1

Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

7.5.2

Ee ts on the OLS estimator

. . . . . . . . . . . . . . . . . . . . . . . .

97

7.5.3

AR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

CONTENTS

7.6

7.5.4

MA(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.5.5

Asymptoti ally valid inferen es with auto orrelation of unknown form

7.5.6

Testing for auto orrelation . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.5.7

Lagged dependent variables and auto orrelation . . . . . . . . . . . . . . 107

7.5.8

Examples

. 102

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8 Sto hasti regressors

113

8.1

Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.2

Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.3

Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.4

When are the assumptions reasonable? . . . . . . . . . . . . . . . . . . . . . . . 116

8.5

Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

9 Data problems
9.1

9.2

9.3

9.4

Collinearity

119
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

9.1.1

A brief aside on dummy variables . . . . . . . . . . . . . . . . . . . . . . 120

9.1.2

Ba k to ollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

9.1.3

Dete tion of ollinearity

9.1.4

Dealing with ollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Measurement error

. . . . . . . . . . . . . . . . . . . . . . . . . . . 122

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

9.2.1

Error of measurement of the dependent variable . . . . . . . . . . . . . . 125

9.2.2

Error of measurement of the regressors . . . . . . . . . . . . . . . . . . . 126

Missing observations

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

9.3.1

Missing observations on the dependent variable . . . . . . . . . . . . . . 127

9.3.2

The sample sele tion problem . . . . . . . . . . . . . . . . . . . . . . . . 129

9.3.3

Missing observations on the regressors

. . . . . . . . . . . . . . . . . . . 130

Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

10 Fun tional form and nonnested tests

133

10.1 Flexible fun tional forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134


10.1.1 The translog form

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

10.1.2 FGLS estimation of a translog model . . . . . . . . . . . . . . . . . . . . 138


10.2 Testing nonnested hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

11 Exogeneity and simultaneity


11.1 Simultaneous equations
11.2 Exogeneity

145

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

11.3 Redu ed form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149


11.4 IV estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.5 Identi ation by ex lusion restri tions

. . . . . . . . . . . . . . . . . . . . . . . 154

11.5.1 Ne essary onditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

CONTENTS

11.5.2 Su ient onditions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

11.5.3 Example: Klein's Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 161


11.6 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
11.7 Testing the overidentifying restri tions

. . . . . . . . . . . . . . . . . . . . . . . 165

11.8 System methods of estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168


11.8.1 3SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
11.8.2 FIML

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

11.9 Example: 2SLS and Klein's Model 1

. . . . . . . . . . . . . . . . . . . . . . . . 174

12 Introdu tion to the se ond half

177

13 Numeri optimization methods

183

13.1 Sear h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183


13.2 Derivative-based methods

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

13.2.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184


13.2.2 Steepest des ent

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

13.2.3 Newton-Raphson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186


13.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
13.4 Examples

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

13.4.1 Dis rete Choi e: The logit model . . . . . . . . . . . . . . . . . . . . . . 189


13.4.2 Count Data: The MEPS data and the Poisson model . . . . . . . . . . . 190
13.4.3 Duration data and the Weibull model

. . . . . . . . . . . . . . . . . . . 192

13.5 Numeri optimization: pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195


13.5.1 Poor s aling of the data
13.5.2 Multiple optima

. . . . . . . . . . . . . . . . . . . . . . . . . . . 195

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

13.6 Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

14 Asymptoti properties of extremum estimators

201

14.1 Extremum estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201


14.2 Existen e

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

14.3 Consisten y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202


14.4 Example: Consisten y of Least Squares . . . . . . . . . . . . . . . . . . . . . . . 205
14.5 Asymptoti Normality
14.6 Examples

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

14.6.1 Coin ipping, yet again

. . . . . . . . . . . . . . . . . . . . . . . . . . . 208

14.6.2 Binary response models

. . . . . . . . . . . . . . . . . . . . . . . . . . . 208

14.6.3 Example: Linearization of a nonlinear model

. . . . . . . . . . . . . . . 212

14.7 Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

15 Generalized method of moments


15.1 Denition

217

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

15.2 Consisten y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

CONTENTS

15.3 Asymptoti normality

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

15.4 Choosing the weighting matrix

. . . . . . . . . . . . . . . . . . . . . . . . . . . 221

15.5 Estimation of the varian e- ovarian e matrix

. . . . . . . . . . . . . . . . . . . 222

15.5.1 Newey-West ovarian e estimator . . . . . . . . . . . . . . . . . . . . . . 224


15.6 Estimation using onditional moments

. . . . . . . . . . . . . . . . . . . . . . . 225

15.7 Estimation using dynami moment onditions . . . . . . . . . . . . . . . . . . . 228


15.8 A spe i ation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
15.9 Other estimators interpreted as GMM estimators . . . . . . . . . . . . . . . . . 230
15.9.1 OLS with heteros edasti ity of unknown form . . . . . . . . . . . . . . . 230
15.9.2 Weighted Least Squares

. . . . . . . . . . . . . . . . . . . . . . . . . . . 232

15.9.3 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232


15.9.4 Nonlinear simultaneous equations . . . . . . . . . . . . . . . . . . . . . . 233
15.9.5 Maximum likelihood

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

15.10Example: The MEPS data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236


15.11Example: The Hausman Test

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

15.12Appli ation: Nonlinear rational expe tations . . . . . . . . . . . . . . . . . . . . 243


15.13Empiri al example: a portfolio model . . . . . . . . . . . . . . . . . . . . . . . . 245
15.14Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

16 Quasi-ML

249

16.1 Consistent Estimation of Varian e Components

. . . . . . . . . . . . . . . . . . 251

16.2 Example: the MEPS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252


16.2.1 Innite mixture models: the negative binomial model . . . . . . . . . . . 252
16.2.2 Finite mixture models: the mixed negative binomial model

. . . . . . . 257

16.2.3 Information riteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258


16.3 Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

17 Nonlinear least squares (NLS)


17.1 Introdu tion and denition
17.2 Identi ation

261

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

17.3 Consisten y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264


17.4 Asymptoti normality

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

17.5 Example: The Poisson model for ount data . . . . . . . . . . . . . . . . . . . . 265


17.6 The Gauss-Newton algorithm

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

17.7 Appli ation: Limited dependent variables and sample sele tion
17.7.1 Example: Labor Supply

18 Nonparametri inferen e

. . . . . . . . . 268

. . . . . . . . . . . . . . . . . . . . . . . . . . . 268

271

18.1 Possible pitfalls of parametri inferen e: estimation . . . . . . . . . . . . . . . . 271


18.2 Possible pitfalls of parametri inferen e: hypothesis testing . . . . . . . . . . . . 274
18.3 Estimation of regression fun tions . . . . . . . . . . . . . . . . . . . . . . . . . . 276
18.3.1 The Fourier fun tional form . . . . . . . . . . . . . . . . . . . . . . . . . 276

CONTENTS

18.3.2 Kernel regression estimators . . . . . . . . . . . . . . . . . . . . . . . . . 283


18.4 Density fun tion estimation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

18.4.1 Kernel density estimation

. . . . . . . . . . . . . . . . . . . . . . . . . . 287

18.4.2 Semi-nonparametri maximum likelihood


18.5 Examples

. . . . . . . . . . . . . . . . . 288

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

18.5.1 MEPS health are usage data . . . . . . . . . . . . . . . . . . . . . . . . 291


18.5.2 Finan ial data and volatility . . . . . . . . . . . . . . . . . . . . . . . . . 293
18.6 Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

19 Simulation-based estimation

297

19.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297


19.1.1 Example: Multinomial and/or dynami dis rete response models
19.1.2 Example: Marginalization of latent variables

. . . . 297

. . . . . . . . . . . . . . . 299

19.1.3 Estimation of models spe ied in terms of sto hasti dierential equations300
19.2 Simulated maximum likelihood (SML)
19.2.1 Example: multinomial probit

. . . . . . . . . . . . . . . . . . . . . . . 301

. . . . . . . . . . . . . . . . . . . . . . . . 302

19.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303


19.3 Method of simulated moments (MSM)

. . . . . . . . . . . . . . . . . . . . . . . 304

19.3.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305


19.3.2 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
19.4 E ient method of moments (EMM) . . . . . . . . . . . . . . . . . . . . . . . . 306
19.4.1 Optimal weighting matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 308
19.4.2 Asymptoti distribution

. . . . . . . . . . . . . . . . . . . . . . . . . . . 309

19.4.3 Diagnoti testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310


19.5 Examples

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

19.5.1 SML of a Poisson model with latent heterogeneity


19.5.2 SMM

. . . . . . . . . . . . 310

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

19.5.3 SNM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312


19.5.4 EMM estimation of a dis rete hoi e model

. . . . . . . . . . . . . . . . 312

19.6 Exer ises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

20 Parallel programming for e onometri s


20.1 Example problems
20.1.1 Monte Carlo

315

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

20.1.2 ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
20.1.3 GMM

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

20.1.4 Kernel regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

21 Final proje t: e onometri estimation of a RBC model

323

21.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323


21.2 An RBC Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
21.3 A redu ed form model

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

CONTENTS

21.4 Results (I): The s ore generator . . . . . . . . . . . . . . . . . . . . . . . . . . . 326


21.5 Solving the stru tural model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

22 Introdu tion to O tave

331

22.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331


22.2 A short introdu tion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

22.3 If you're running a Linux installation... . . . . . . . . . . . . . . . . . . . . . . . 333

23 Notation and Review

335

23.1 Notation for dierentiation of ve tors and matri es


23.2 Convergenge modes

. . . . . . . . . . . . . . . . 335

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

23.3 Rates of onvergen e and asymptoti equality . . . . . . . . . . . . . . . . . . . 338

24 Li enses
24.1 The GPL

341
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

24.2 Creative Commons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350

25 The atti

355

25.1 Hurdle models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355


25.1.1 Finite mixture models
25.2 Models for time series data

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

25.2.1 Basi on epts

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

25.2.2 ARMA models

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

10

CONTENTS

List of Figures

1.1

O tave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.2

LYX

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.1

Typi al data, Classi al Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.2

Example OLS Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.3

The t in observation spa e

25

3.4

Dete tion of inuential observations

3.5

2
Un entered R

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .

28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

3.6

Unbiasedness of OLS under lassi al assumptions . . . . . . . . . . . . . . . . .

32

3.7

Biasedness of OLS when an assumption fails . . . . . . . . . . . . . . . . . . . .

33

3.8

Gauss-Markov Result: The OLS estimator . . . . . . . . . . . . . . . . . . . . .

35

3.9

Gauss-Markov Resul: The split sample estimator

. . . . . . . . . . . . . . . . .

35

6.1

Joint and Individual Conden e Regions . . . . . . . . . . . . . . . . . . . . . .

73

6.2

RTS as a fun tion of rm size . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

7.1

Residuals, Nerlove model, sorted by rm size

. . . . . . . . . . . . . . . . . . .

93

7.2

Auto orrelation indu ed by misspe i ation

. . . . . . . . . . . . . . . . . . . .

97

7.3

Durbin-Watson riti al values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.4

Residuals of simple Nerlove model

7.5

OLS residuals, Klein onsumption equation

9.1

s()

when there is no ollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . 121

9.2

s()

when there is ollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

9.3

Sample sele tion bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

. . . . . . . . . . . . . . . . . . . . . . . . . 108
. . . . . . . . . . . . . . . . . . . . 110

13.1 In reasing dire tions of sear h . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185


13.2 Using MuPAD to get analyti derivatives

. . . . . . . . . . . . . . . . . . . . . 188

13.3 Life expe tan y of mongooses, Weibull model

. . . . . . . . . . . . . . . . . . . 194

13.4 Life expe tan y of mongooses, mixed Weibull model

. . . . . . . . . . . . . . . 196

13.5 A foggy mountain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197


15.1 OLS
15.2 IV

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

11

12

LIST OF FIGURES

15.3 In orre t rank and the Hausman test . . . . . . . . . . . . . . . . . . . . . . . . 241


18.1 True and simple approximating fun tions

. . . . . . . . . . . . . . . . . . . . . 272

18.2 True and approximating elasti ities . . . . . . . . . . . . . . . . . . . . . . . . . 273


18.3 True fun tion and more exible approximation

. . . . . . . . . . . . . . . . . . 274

18.4 True elasti ity and more exible approximation

. . . . . . . . . . . . . . . . . . 275

18.5 Negative binomial raw moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 290


18.6 Kernel tted OBDV usage versus AGE . . . . . . . . . . . . . . . . . . . . . . . 291
18.7 Dollar-Euro

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

18.8 Dollar-Yen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293


18.9 Kernel regression tted onditional se ond moments, Yen/Dollar and Euro/Dollar294
20.1 Speedups from parallelization

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

21.1 Consumption and Investment, Levels . . . . . . . . . . . . . . . . . . . . . . . . 323


21.2 Consumption and Investment, Growth Rates

. . . . . . . . . . . . . . . . . . . 324

21.3 Consumption and Investment, Bandpass Filtered

. . . . . . . . . . . . . . . . . 324

22.1 Running an O tave program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

List of Tables

16.1 Marginal Varian es, Sample and Estimated (Poisson) . . . . . . . . . . . . . . . 252


16.2 Marginal Varian es, Sample and Estimated (NB-II) . . . . . . . . . . . . . . . . 256
16.3 Information Criteria, OBDV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
25.1 A tual and Poisson tted frequen ies . . . . . . . . . . . . . . . . . . . . . . . . 355
25.2 A tual and Hurdle Poisson tted frequen ies . . . . . . . . . . . . . . . . . . . . 359

13

14

LIST OF TABLES

Chapter 1
About this do ument
This do ument integrates le ture notes for a one year graduate level ourse with omputer
programs that illustrate and apply the methods that are studied. The immediate availability
of exe utable (and modiable) example programs when using the PDF version of the do ument is a distinguishing feature of these notes. If printed, the do ument is a somewhat terse
approximation to a textbook. These notes are not intended to be a perfe t substitute for a
printed textbook. If you are a student of mine, please note that last senten e arefully. There
are many good textbooks available. The hapters make referen e to readings in arti les and
textbooks.
With respe t to ontents, the emphasis is on estimation and inferen e within the world of
stationary data, with a bias toward mi roe onometri s.

The se ond half is somewhat more

polished than the rst half, sin e I have taught that ourse more often. If you take a moment
to read the li ensing information in the next se tion, you'll see that you are free to opy and
modify the do ument. If anyone would like to ontribute material that expands the ontents,
it would be very wel ome. Error orre tions and other additions are also wel ome.
The integrated examples (they are on-line here and the support les are here) are an
important part of these notes. GNU O tave (www.o tave.org) has been used for the example
programs, whi h are s attered though the do ument.

This hoi e is motivated by several

fa tors. The rst is the high quality of the O tave environment for doing applied e onometri s.
R , and will run s ripts for that language
O tave is similar to the ommer ial pa kage Matlab
1

without modi ation . The fundamental tools (manipulation of matri es, statisti al fun tions,

minimization,

et .)

exist and are implemented in a way that make extending them fairly easy.

Se ond, an advantage of free software is that you don't have to pay for it. This an be an
important onsideration if you are at a university with a tight budget or if need to run many
opies, as an be the ase if you do parallel omputing (dis ussed in Chapter 20). Third, O tave
runs on GNU/Linux, Windows and Ma OS. Figure 1.1 shows a sample GNU/Linux work
environment, with an O tave s ript being edited, and the results are visible in an embedded
1

R is a trademark of The Mathworks, In . O tave will run pure Matlab s ripts. If a Matlab s ript
Matlab

alls an extension, su h as a toolbox fun tion, then it is ne essary to make a similar extension available to
O tave. The examples dis ussed in this do ument all a number of fun tions, su h as a BFGS minimizer, a
program for ML estimation, et . All of this ode is provided with the examples, as well as on the Peli anHPC
live CD image.

15

16

CHAPTER 1.

ABOUT THIS DOCUMENT

Figure 1.1: O tave

shell window.
The main do ument was prepared using LYX (www.lyx.org). LYX is a free

what you see

AT X. It (with
is what you mean word pro essor, basi ally working as a graphi al frontend to L
E
AT X, HTML, PDF and several other
help from other appli ations) an export your work in L
E

forms. It will run on Linux, Windows, and Ma OS systems. Figure 1.2 shows LYX editing
this do ument.

1.1 Li enses
All materials are opyrighted by Mi hael Creel with the date that appears above. They are
provided under the terms of the GNU General Publi Li ense, ver.

2, whi h forms Se tion

24.1 of the notes, or, at your option, under the Creative Commons Attribution-Share Alike
2.5 li ense, whi h forms Se tion 24.2 of the notes. The main thing you need to know is that
you are free to modify and distribute these materials in any way you like, as long as you share
2

Free is used in the sense of freedom, but LYX is also free of harge (free as in free beer).

1.1.

17

LICENSES

Figure 1.2: LYX

18

CHAPTER 1.

ABOUT THIS DOCUMENT

your ontributions in the same way the materials are made available to you.

In parti ular,

you must make available the sour e les, in editable form, for your modied version of the
materials.

1.2 Obtaining the materials


The materials are available on my web page. In addition to the nal produ t, whi h you're
probably looking at in some form now, you an obtain the editable LYX sour es, whi h will
allow you to reate your own version, if you like, or send error orre tions and ontributions.

1.3 An easy way to use LYX and O tave today


The example programs are available as links to les on my web page in the PDF version,
and here. Support les needed to run these are available here. The les won't run properly
from your browser, sin e there are dependen ies between les - they are only illustrative when
browsing. To see how to use these les (edit and run them), you should go to the home page
of this do ument, sin e you will probably want to download the pdf version together with all
the support les and examples. Then set the base URL of the PDF le to point to wherever
the O tave les are installed. Then you need to install O tave and the support les. All of
this may sound a bit ompli ated, be ause it is. An easier solution is available:
The Peli anHPC distribution of Linux is an ISO image le that may be burnt to CDROM.
It ontains a bootable-from-CD GNU/Linux system.

These notes, in sour e form and as a

PDF, together with all of the examples and the software needed to run them are available
on Peli anHPC. Peli anHPC is a live CD image. You an burn the Peli anHPC image to
a CD and use it to boot your omputer, if you like. When you shut down and reboot, you
will return to your normal operating system.

The need to reboot to use Peli anHPC an

be somewhat in onvenient. It is also possible to use Peli anHPC while running your normal
operating system by using a virtualization platform su h as Sun xVM Virtualbox

3
R

The reason why these notes are integrated into a Linux distribution for parallel omputing
will be apparent if you get to Chapter 20. If you don't get that far or you're not interested in
parallel omputing, please just ignore the stu on the CD that's not related to e onometri s.
If you happen to be interested in parallel omputing but not e onometri s, just skip ahead to
Chapter 20.

R is a trademark of Sun, In . Virtualbox is free software (GPL v2). That, and the fa t
xVM Virtualbox

that it works very well, is the reason it is re ommended here. There are a number of similar produ ts available.
It is possible to run Peli anHPC as a virtual ma hine, and to ommuni ate with the installed operating system
using a private network. Learning how to do this is not too di ult, and it is very onvenient.

Chapter 2
Introdu tion: E onomi and
e onometri models
E onomi theory tells us that an individual's demand fun tion for a good is something like:

x = x(p, m, z)
x

is the quantity demanded

is

m
z

G1

ve tor of pri es of the good and its substitutes and omplements

is in ome

is a ve tor of other variables su h as individual hara teristi s that ae t preferen es

Suppose we have a sample onsisting of one observation on


period

(this is a

ross se tion ,

where

i = 1, 2, ..., n

individuals' demands at time

indexes the individuals in the sample).

The individual demand fun tions are

xi = xi (pi , mi , zi )
The model is not estimable as it stands, sin e:

The form of the demand fun tion is dierent for all

Some omponents of

zi

i.

may not be observable to an outside modeler.

For example,

people don't eat the same lun h every day, and you an't tell what they will order just
by looking at them. Suppose we an break
single unobservable omponent

zi

into the observable omponents

wi

and a

i .

A step toward an estimable e onometri model is to suppose that the model may be written
as

xi = 1 + pi p + mi m + wi w + i
We have imposed a number of restri tions on the theoreti al model:

19

20

CHAPTER 2.

The fun tions

INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS

xi ()

whi h in prin iple may dier for all

have been restri ted to all

belong to the same parametri family.

Of all parametri families of fun tions, we have restri ted the model to the lass of linear

The parameters are onstant a ross individuals.

There is a single unobservable omponent, and we assume it is additive.

in the variables fun tions.

If we assume nothing about the error term


order for the

we an always write the last equation. But in

oe ients to exist in a sense that has e onomi meaning, and in order to

be able to use sample data to make reliable inferen es about their values, we need to make
additional assumptions. These additional assumptions have

no theoreti al basis,

they are

assumptions on top of those needed to prove the existen e of a demand fun tion. The validity
of any results we obtain using this model will be ontingent on these additional restri tions
being at least approximately orre t. For this reason,

spe i ation testing

will be needed, to

he k that the model seems to be reasonable. Only when we are onvin ed that the model is
at least approximately orre t should we use it for e onomi analysis.
When testing a hypothesis using an e onometri model, at least three fa tors an ause a
statisti al test to reje t the null hypothesis:
1. the hypothesis is false
2. a type I error has o ured
3. the e onometri model is not orre tly spe ied, and thus the test does not have the
assumed distribution
To be able to make s ienti progress, we would like to ensure that the third reason is not
ontributing in a major way to reje tions, so that reje tion will be most likely due to either
the rst or se ond reasons. Hopefully the above example makes it lear that there are many
possible sour es of misspe i ation of e onometri models. In the next few se tions we will
obtain results supposing that the e onometri model is entirely orre tly spe ied. Later we
will examine the onsequen es of misspe i ation and see some methods for determining if a
model is orre tly spe ied. Later on, e onometri methods that seek to minimize maintained
assumptions are introdu ed.

Chapter 3
Ordinary Least Squares
3.1 The Linear Model
Consider approximating a variable

y using the variables x1 , x2 , ..., xk .

We an onsider a model

that is a linear approximation:

Linearity:

the model is a linear fun tion of the parameter ve tor

0 :

y = 10 x1 + 20 x2 + ... + k0 xk +
or, using ve tor notation:

y = x 0 +
The dependent variable

x = ( x1 x2 xk ) is a k-ve tor

k0 ) . The supers ript 0 in 0 means this

is a s alar random variable,

of explanatory variables, and

0 = ( 10 20

is the true value of the unknown parameter.

It will be dened more pre isely later, and

usually suppressed when it's not ne essary for larity.


Suppose that we want to use data to try to determine the best linear approximation to

using the variables

x.

The data

{(yt , xt )} , t = 1, 2, ..., n

are obtained by some form of

sampling . An individual observation is

yt = xt + t
The

observations an be written in matrix form as

y = X + ,
where

y=

y1 y2 yn

is

n1

and

X=

(3.1)

x1 x2 xn

Linear models are more general than they might rst appear, sin e one an employ non-

For example, ross-se tional data may be obtained by random sampling.

histori ally.

21

Time series data a umulate

22

CHAPTER 3.

ORDINARY LEAST SQUARES

Figure 3.1: Typi al data, Classi al Model


10
data
true regression line

-5

-10

-15
0

10
X

12

14

16

18

20

linear transformations of the variables:

0 (z) =
where the

i ()

1 (w) 2 (w) p (w)

are known fun tions. Dening

y = 0 (z), x1 = 1 (w),

et .

leads to a model in

the form of equation 3.3. For example, the Cobb-Douglas model

z = Aw22 w33 exp()


an be transformed logarithmi ally to obtain

ln z = ln A + 2 ln w2 + 3 ln w3 + .
If we dene

y = ln z, 1 = ln A,

et .,

we an put the model in the form needed.

The

approximation is linear in the parameters, but not ne essarily linear in the variables.

3.2 Estimation by least squares


Figure 3.1, obtained by running Typi alData.m shows some data that follows the linear model

yt = 1 + 2 xt2 + t .
are the data points
of

xt2 .

The green line is the true regression line

(xt2 , yt ), where t

1 + 2 xt2 ,

and the red rosses

is a random error that has mean zero and is independent

Exa tly how the green line is dened will be ome lear later. In pra ti e, we only have

the data, and we don't know where the green line lies. We need to gain information about the
straight line that best ts the data points.

3.2.

23

ESTIMATION BY LEAST SQUARES

The

ordinary least squares

(OLS) estimator is dened as the value that minimizes the sum

of the squared errors:

= arg min s()


where

s() =

n
X
t=1

yt xt

2

= (y X) (y X)

= y y 2y X + X X
= k y X k2

This last expression makes it lear how the OLS estimator is dened: it minimizes the Eu lidean distan e between
linear approximation to
distan e.

and

using

X.
x

The tted OLS oe ients are those that give the best

as basis fun tions, where best means minimum Eu lidean

One ould think of other estimators based upon other metri s.

minimum absolute distan e

(MAD) minimizes

Pn

t=1 |yt

For example, the

xt |. Later, we will see that whi h

estimator is best in terms of their statisti al properties, rather than in terms of the metri s
that dene them, depends upon the properties of

about whi h we have as yet made no

assumptions.

To minimize the riterion

s(),

nd the derivative with respe t to

D s() = 2X y + 2X X
Then setting it to zeros gives

= 2X y + 2X X 0
D s()
so

= (X X)1 X y.

To verify that this is a minimum, he k the se ond order su ient ondition:

= 2X X
D2 s()
Sin e

(X) = K,

this matrix is positive denite, sin e it's a quadrati form in a p.d.

matrix (identity matrix of order

n),

so

The

tted values are the ve tor y = X.

The

residuals are the ve tor = y X

is in fa t a minimizer.

24

CHAPTER 3.

ORDINARY LEAST SQUARES

Figure 3.2: Example OLS Fit


15
data points
fitted line
true line
10

-5

-10

-15
0

10
X

12

14

16

18

20

Note that

y = X +
= X +

Also, the rst order onditions an be written as

X y X X = 0


X y X = 0
X = 0

whi h is to say, the OLS residuals are orthogonal to

X.

Let's look at this more arefully.

3.3 Geometri interpretation of least squares estimation


3.3.1 In X, Y Spa e
Figure 3.2 shows a typi al t to data, along with the true regression line.

Note that the

true line and the estimated line are dierent. This gure was reated by running the O tave
program OlsFit.m . You an experiment with hanging the parameter values to see how this
ae ts the t, and to see how the tted line will sometimes be lose to the true line, and
sometimes rather far away.

3.3.

25

GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION

3.3.2 In Observation Spa e


If we want to plot in observation spa e, we'll need to use only two or three observations, or
we'll en ounter some limitations of the bla kboard. If we try to use 3, we'll en ounter the limits
of my artisti ability, so let's use two. With only two observations, we an't have

K > 1.

Figure 3.3: The t in observation spa e

Observation 2

e = M_xY

S(x)

x
x*beta=P_xY

Observation 1

spa e spanned by

nK

y into two omponents: the orthogonal proje tion onto the Kdimensional
and the omponent that is the orthogonal proje tion onto the
X , X ,

We an de ompose

Sin e
by

subpa e that is orthogonal to the span of

X, .

is hosen to make as short as possible, will be orthogonal to the spa e spanned

X. Sin e X

is in this spa e,

X = 0. Note that the f.o. .

estimator imply that this is so.

3.3.3 Proje tion Matri es


X

is the proje tion of

onto the span of

X,

or

X = X X X
Therefore, the matrix that proje ts

1

X y

onto the span of

PX = X(X X)1 X

is

that dene the least squares

26

CHAPTER 3.

ORDINARY LEAST SQUARES

sin e

X = PX y.
is
X.

the proje tion of

onto the

We have that

N K

dimensional spa e that is orthogonal to the span of

= y X

So the matrix that proje ts

= y X(X X)1 X y


= In X(X X)1 X y.

onto the spa e orthogonal to the span of

is

= In X(X X)1 X

MX

= In PX .
We have

= MX y.
Therefore

y = PX y + MX y
= X + .
These two proje tion matri es de ompose the
omponents - the portion that lies in the
that lies in the orthogonal

Note that both

PX

nK

and

MX

dimensional ve tor

into two orthogonal

dimensional spa e dened by

X,

dimensional spa e.
are

symmetri

and

idempotent.

A symmetri matrix

An idempotent matrix

The only nonsingular idempotent matrix is the identity matrix.

and the portion

is one su h that

A = A .

is one su h that

A = AA.

3.4 Inuential observations and outliers


The OLS estimator of the

ith

element of the ve tor

i =
=

(X X)1 X

ci y

is simply

This is how we dene a linear estimator - it's a linear fun tion of the dependent variable. Sin e it's a linear ombination of the observations on the dependent variable, where the
weights are determined by the observations on the regressors, some observations may have
more inuen e than others.

3.4.

To investigate this, let

tth

27

INFLUENTIAL OBSERVATIONS AND OUTLIERS

olumn of the matrix

et

In .

be an

ve tor of zeros with a

in the t

th position,

i.e., it's the

Dene

ht = (PX )tt
= et PX et
so

ht

is the t

th element on the main diagonal of

PX .

Note that

ht = k PX et k2
so

ht k et k2 = 1
So

0 < ht < 1.

Also,

T rPX = K h = K/n.
So the average of the

ht

is

K/n.

The value

ht

is referred to as the

leverage of the observation.

If the leverage is mu h higher than average, the observation has the potential to ae t the
OLS t importantly. However, an observation may also be inuential due to the value of

yt ,

xt 's.
th
without using the t
observation (designate

rather than the weight it is multiplied by, whi h only depends on the
To a ount for this, onsider estimation of
this estimator as

(t) ).

One an show (see Davidson and Ma Kinnon, pp. 32-5 for proof ) that

(t)

so the hange in the

tth

1
1 ht

(X X)1 Xt t

observations tted value is

xt

xt (t)

ht
1 ht

t
 is
t

While an observation may be inuential if it doesn't ae t its own tted value, it ertainly
inuential if it does. A fast means of identifying inuential observations is to plot
(whi h I will refer to as the

own inuen e of the observation) as a fun tion of t.

ht
1ht

Figure 3.4 gives

an example plot of data, t, leverage and inuen e. The O tave program is InuentialObservation.m . If you re-run the program you will see that the leverage of the last observation (an
outlying value of x) is always high, and the inuen e is sometimes high.
After inuential observations are dete ted, one needs to determine

why they are inuential.

Possible auses in lude:

data entry error, whi h an easily be orre ted on e dete ted. Data entry errors

ommon.

are very

spe ial e onomi fa tors that ae t some observations. These would need to be identied
and in orporated in the model. This is the idea behind
may not be onstant a ross all observations.

stru tural hange :

the parameters

28

CHAPTER 3.

ORDINARY LEAST SQUARES

Figure 3.4: Dete tion of inuential observations


14
Data points
fitted
Leverage
Influence

12

10

0
0

0.5

1.5

2.5

3.5

pure randomness may have aused us to sample a low-probability observation.

There exist

robust

estimation methods that downweight outliers.

3.5 Goodness of t
The tted model is

y = X +
Take the inner produ t:

y y = X X + 2 X +
But the middle term of the RHS is zero sin e

X = 0,

so

y y = X X +
The

un entered Ru2

is dened as

Ru2 = 1
=
=


yy

X X
yy
k PX y k2
k y k2

= cos2 (),

(3.2)

3.5.

29

GOODNESS OF FIT

where

is the angle between

The un entered

R2

and the span of

hanges if we add a onstant to

3.5, the yellow ve tor is a onstant, sin e it's on the

y,

sin e this hanges

45 degree

(see Figure

line in observation spa e).

Another, more ommon denition measures the ontribution of the variables, other than

Figure 3.5: Un entered

R2

the onstant term, to explaining the variation in


model to explain the variation of

= (1, 1, ..., 1) ,

Let

y.

Thus it measures the ability of the

about its un onditional sample mean.

-ve tor. So

M = In ( )1
= In /n

M y

just returns the ve tor of deviations from the mean.

In terms of deviations from the

mean, equation 3.2 be omes

y M y = X M X + M
The

entered Rc2

is dened as

Rc2 = 1
where

ESS = and T SS = y M y =

Pn

ESS

=1

y M y
T SS

t=1 (yt

y)2 .

30

CHAPTER 3.

Supposing that

ontains a olumn of ones (

X = 0
so

M = .

ORDINARY LEAST SQUARES

i.e., there is a onstant term),

t = 0

In this ase

y M y = X M X +
So

Rc2 =
where

RSS
T SS

RSS = X M X

Supposing that a olumn of ones is in the spa e spanned by


show that

Rc2

X (PX = ),

then one an

1.

3.6 The lassi al linear regression model


Up to this point the model is empty of ontent beyond the denition of a best linear approximation to

and some geometri al properties. There is no e onomi ontent to the model, and

the regression parameters have no e onomi interpretation. For example, what is the partial
derivative of

with respe t to

xj ?

The linear approximation is

y = 1 x1 + 2 x2 + ... + k xk +
The partial derivative is

y
= j +
xj
xj

Up to now, there's no guarantee that

xj =0. For the

to have an e onomi meaning, we

need to make additional assumptions. The assumptions that are appropriate to make depend
on the data under onsideration. We'll start with the lassi al linear regression model, whi h
in orporates some assumptions that are learly not realisti for e onomi data. This is to be
able to explain some on epts with a minimum of onfusion and notational lutter. Later we'll
adapt the results to what we an get with more realisti assumptions.

Linearity:

the model is a linear fun tion of the parameter ve tor

y = 10 x1 + 20 x2 + ... + k0 xk +

0 :
(3.3)

or, using ve tor notation:

y = x 0 +

Nonsto hasti linearly independent regressors: X is a


has rank

K,

xed matrix of onstants, it

its number of olumns, and

1
lim X X = QX
n

(3.4)

3.7.

SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR31

where

QX

is a nite positive denite matrix. This is needed to be able to identify the individual

ee ts of the explanatory variables.

Independently and identi ally distributed errors:


IID(0, 2 In )

(3.5)

is jointly distributed IID. This implies the following two properties:

Homos edasti errors:


V (t ) = 02 , t

(3.6)

E(t s ) = 0, t 6= s

(3.7)

Nonauto orrelated errors:

Optionally, we will sometimes assume that the errors are normally distributed.

Normally distributed errors:


N (0, 2 In )

(3.8)

3.7 Small sample statisti al properties of the least squares estimator


Up to now, we have only examined numeri properties of the OLS estimator, that always
hold. Now we will examine statisti al properties. The statisti al properties depend upon the
assumptions we make.

3.7.1 Unbiasedness
We have

= (X X)1 X y .

By linearity,

= (X X)1 X (X + )
= + (X X)1 X
By 3.4 and 3.5

E(X X)1 X = E(X X)1 X


= (X X)1 X E
= 0
so the OLS estimator is unbiased under the assumptions of the lassi al model.
Figure 3.6 shows the results of a small Monte Carlo experiment where the OLS estimator
was al ulated for 10000 samples from the lassi al model with

= 9, and x is xed a ross samples.

We an see that the

y = 1 + 2x + ,

where

n = 20,

appears to be estimated without

32

CHAPTER 3.

ORDINARY LEAST SQUARES

Figure 3.6: Unbiasedness of OLS under lassi al assumptions

0.1

0.08

0.06

0.04

0.02

0
-3

-2

-1

bias. The program that generates the plot is Unbiased.m , if you would like to experiment
with this.
With time series data, the OLS estimator will often be biased. Figure 3.7 shows the results
of a small Monte Carlo experiment where the OLS estimator was al ulated for 1000 samples
from the AR(1) model with

yt = 0 + 0.9yt1 + t ,

where

n = 20

and

2 = 1.

In this ase,

assumption 3.4 does not hold: the regressors are sto hasti . We an see that the bias in the
estimation of

is about -0.2.

The program that generates the plot is Biased.m , if you would like to experiment with
this.

3.7.2 Normality
With the linearity assumption, we have

= + (X X)1 X .

This is a linear fun tion of

Adding the assumption of normality (3.8, whi h implies strong exogeneity), then

N , (X X)1 02

sin e a linear fun tion of a normal random ve tor is also normally distributed. In Figure 3.6
you an see that the estimator appears to be normally distributed.

It in fa t is normally

distributed, sin e the DGP (see the O tave program) has normal errors. Even when the data
may be taken to be IID, the assumption of normality is often questionable or simply untenable.
For example, if the dependent variable is the number of automobile trips per week, it is a ount
variable with a dis rete distribution, and is thus not normally distributed. Many variables in

SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR33

3.7.

Figure 3.7: Biasedness of OLS when an assumption fails

0.12

0.1

0.08

0.06

0.04

0.02

0
-1

-0.8

-0.6

-0.4

-0.2

0.2

0.4

e onomi s an take on only nonnegative values, whi h, stri tly speaking, rules out normality.

3.7.3 The varian e of the OLS estimator and the Gauss-Markov theorem
Now let's make all the lassi al assumptions ex ept the assumption of normality.

= + (X X)1 X

= .
E()

and we know that

= E
V ar()



We have

So


 



= E (X X)1 X X(X X)1

= (X X)1 02
The OLS estimator is a
dependent variable,

linear estimator ,

whi h means that it is a linear fun tion of the

y.
=


(X X)1 X y

= Cy
where
also

is a fun tion of the explanatory variables only, not the dependent variable.

unbiased

weights

It is

under the present assumptions, as we proved above. One ould onsider other

that are a fun tion of

that dene some other linear estimator. We'll still insist

= W y, where W = W (X) is some k n matrix fun tion of


fun tion of X, it is nonsto hasti , too. If the estimator is unbiased,

upon unbiasedness. Consider

X.

Note that sin e


2

is a

Normality may be a good model nonetheless, as long as the probability of a negative value o uring is

negligable under the model. This depends upon the mean being large enough in relation to the varian e.

34

CHAPTER 3.

then we must have

ORDINARY LEAST SQUARES

W X = IK :
E(W y)

E(W X0 + W )

W X0

WX
The varian e of

IK

is

= W W 2 .
V ()
0
Dene

D = W (X X)1 X
so

W = D + (X X)1 X
Sin e

W X = IK , DX = 0,

so



D + (X X)1 X D + (X X)1 X 02

1  2
0
=
DD + X X

=
V ()

So

V ()

V ()
The inequality is a shorthand means of expressing, more formally, that

is a positive
V ()V
()

semi-denite matrix. This is a proof of the Gauss-Markov Theorem. The OLS estimator is
the best linear unbiased estimator (BLUE).

It is worth emphasizing again that we have not used the normality assumption in any
way to prove the Gauss-Markov theorem, so it is valid if the errors are not normally
distributed, as long as the other assumptions hold.

To illustrate the Gauss-Markov result, onsider the estimator that results from splitting the
sample into

equally-sized parts, estimating using ea h part of the data separately by OLS,

then averaging the

resulting estimators.

You should be able to show that this estimator

is unbiased, but ine ient with respe t to the OLS estimator.

The program E ien y.m

illustrates this using a small Monte Carlo experiment, whi h ompares the OLS estimator and
a 3-way split sample estimator. The data generating pro ess follows the lassi al model, with

n = 21.

The true parameter value is

= 2.

In Figures 3.8 and 3.9 we an see that the OLS

estimator is more e ient, sin e the tails of its histogram are more narrow.
We have that

=
E()

and

=
V ar()

XX

1

02 ,

but we still need to estimate the

2
varian e of , 0 , in order to have an idea of the pre ision of the estimates of

A ommonly

3.7.

SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR35

Figure 3.8: Gauss-Markov Result: The OLS estimator


0.12

0.1

0.08

0.06

0.04

0.02

0
0

0.5

1.5

2.5

3.5

Figure 3.9: Gauss-Markov Resul: The split sample estimator


0.12

0.1

0.08

0.06

0.04

0.02

0
0

36

CHAPTER 3.

used estimator of

02

ORDINARY LEAST SQUARES

is

This estimator is unbiased:

c2 =

0
=

c2 ) =
E(
0
=

=
=
=
=
where we use the fa t that

1

nK

c2 =

1

nK
1
M
nK
1
E(T r M )
nK
1
E(T rM )
nK
1
T rE(M )
nK
1
2 T rM
nK 0
1
2 (n k)
nK 0
02

T r(AB) = T r(BA)

when both produ ts are onformable. Thus,

this estimator is also unbiased under these assumptions.

3.8 Example: The Nerlove model


3.8.1 Theoreti al ba kground
For a rm that takes input pri es

and the output level

problem is to hoose the quantities of inputs

as given, the ost minimization

to solve the problem

min w x
x

subje t to the restri tion

f (x) = q.
The solution is the ve tor of fa tor demands

x(w, q).

The

ost fun tion

tuting the fa tor demands into the riterion fun tion:

Cw, q) = w x(w, q).


Monotoni ity

In reasing fa tor pri es annot de rease ost, so

C(w, q)
0
w

is obtained by substi-

3.8.

37

EXAMPLE: THE NERLOVE MODEL

Remember that these derivatives give the onditional fa tor demands (Shephard's Lemma).

Homogeneity The ost fun tion is homogeneous of degree 1 in input pri es: C(tw, q) =
tC(w, q)

where

is a s alar onstant. This is be ause the fa tor demands are homoge-

neous of degree zero in fa tor pri es - they only depend upon relative pri es.

Returns to s ale

returns to s ale

The

parameter

is dened as the inverse of the

elasti ity of ost with respe t to output:

Constant returns to s ale

q
C(w, q)
q
C(w, q)

1

is the ase where in reasing produ tion

in reases in the proportion 1:1. If this is the ase, then

implies that ost

= 1.

3.8.2 Cobb-Douglas fun tional form


The Cobb-Douglas fun tional form is linear in the logarithms of the regressors and the dependent variable. For a ost fun tion, if there are

fa tors, the Cobb-Douglas ost fun tion has

the form

C = Aw11 ...wg g q q e
C

What is the elasti ity of

eC
wj

with respe t to

C
WJ



wj ?

wj 
C

= j Aw11 .wj j

..wg g q q e

wj

1
Aw1 ...wg g q q e

= j
This is one of the reasons the Cobb-Douglas form is popular - the oe ients are easy to interpret, sin e they are the elasti ities of the dependent variable with respe t to the explanatory
variable. Not that in this ase,

eC
wj

C
WJ



= xj (w, q)

wj 
C

wj
C

sj (w, q)
the

ost share

of the

j th

input. So with a Cobb-Douglas ost fun tion,

shares are onstants.


Note that after a logarithmi transformation we obtain

ln C = + 1 ln w1 + ... + g ln wg + q ln q +

j = sj (w, q).

The ost

38

CHAPTER 3.

where

= ln A

ORDINARY LEAST SQUARES

. So we see that the transformed model is linear in the logs of the data.

One an verify that the property of HOD1 implies that

g
X

g = 1

i=1

In other words, the ost shares add up to 1.


The hypothesis that the te hnology exhibits CRTS implies that

=
so

q = 1.

1
=1
q

Likewise, monotoni ity implies that the oe ients

i 0, i = 1, ..., g.

3.8.3 The Nerlove data and OLS


The le nerlove.data ontains data on 145 ele tri utility ompanies' ost of produ tion, output
and input pri es. The data are for the U.S., and were olle ted by M. Nerlove. The observations

COMPANY, COST (C), OUTPUT (Q), PRICE OF


LABOR (PL ), PRICE OF FUEL (PF ) and PRICE OF CAPITAL (PK ). Note that the
are by row, and the olumns are

data are sorted by output level (the third olumn).


We will estimate the Cobb-Douglas model

ln C = 1 + 2 ln Q + 3 ln PL + 4 ln PF + 5 ln PK +

(3.9)

using OLS. To do this yourself, you need the data le mentioned above, as well as Nerlove.m
(the estimation program), and the library of O tave fun tions mentioned in the introdu tion
3

to O tave that forms se tion 22 of this do ument.


The results are

*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var- ov estimator)
onstant
output
labor
fuel
apital
3

estimate
-3.527
0.720
0.436
0.427
-0.220

st.err.
1.774
0.017
0.291
0.100
0.339

t-stat.
-1.987
41.244
1.499
4.249
-0.648

p-value
0.049
0.000
0.136
0.000
0.518

If you are running the bootable CD, you have all of this installed and ready to run.

3.8.

39

EXAMPLE: THE NERLOVE MODEL

*********************************************************

Do the theoreti al restri tions hold?

Does the model t well?

What do you think about RTS?

While we will use O tave programs as examples in this do ument, sin e following the programming statements is a useful way of learning how theory is put into pra ti e, you may
be interested in a more user-friendly environment for doing e onometri s. I heartily re ommend Gretl, the Gnu Regression, E onometri s, and Time-Series Library. This is an easy to
use program, available in English, Fren h, and Spanish, and it omes with a lot of data ready
AT X fragments, so that I an just in lude
to use. It even has an option to save output as L
E

the results into this do ument, no muss, no fuss. Here the results of the Nerlove model from
GRETL:

Model 2: OLS estimates using the 145 observations 1145


Dependent variable: l_ ost
Variable
onst

Coe ient

Std. Error

3.5265

t-statisti

1.77437

p-value

1.9875

0.0488

41.2445

0.0000

0.720394

0.0174664

l_labor

0.436341

0.291048

1.4992

0.1361

l_fuel

0.426517

0.100369

4.2495

0.0000

0.219888

0.339429

0.6478

0.5182

l_output

l_ apita

Mean of dependent variable

1.72466

S.D. of dependent variable

1.42172

Sum of squared residuals


Standard error of residuals (
)
Unadjusted

R2

2
Adjusted R

21.5520
0.392356
0.925955
0.923840

F (4, 140)

437.686

Akaike information riterion

145.084

S hwarz Bayesian riterion

159.967

Fortunately, Gretl and my OLS program agree upon the results.

Gretl is in luded in the

bootable CD mentioned in the introdu tion. I re ommend using GRETL to repeat the examples that are done using O tave.
The previous properties hold for nite sample sizes.

Before onsidering the asymptoti

properties of the OLS estimator it is useful to review the MLE estimator, sin e under the
assumption of normal errors the two estimators oin ide.

40

CHAPTER 3.

ORDINARY LEAST SQUARES

3.9 Exer ises


1. Prove that the split sample estimator used to generate gure 3.9 is unbiased.
2. Cal ulate the OLS estimates of the Nerlove model using O tave and GRETL, and provide
printouts of the results. Interpret the results.
3. Do an analysis of whether or not there are inuential observations for OLS estimation
of the Nerlove model. Dis uss.
4. Using GRETL, examine the residuals after OLS estimation and tell me whether or not
you believe that the assumption of independent identi ally distributed normal errors is
warranted. No need to do formal tests, just look at the plots. Print out any that you
think are relevant, and interpret them.
5. For a random ve tor

X N (x , ),

what is the distribution of

AX + b,

where

and

are onformable matri es of onstants?


6. Using O tave, write a little program that veries that

T r(AB) = T r(BA)

4x4 matri es of random numbers. Note: there is an O tave fun tion


7. For the model with a onstant and a single regressor,

for

and

tra e.

yt = 1 + 2 xt + t ,

whi h satises

the lassi al assumptions, prove that the varian e of the OLS estimator de lines to zero
as the sample size in reases.

Chapter 4
Maximum likelihood estimation
The maximum likelihood estimator is important sin e it is asymptoti ally e ient, as is shown
below.

For the lassi al linear model with normal errors, the ML and OLS estimators of

are the same, so the following theory is presented without examples. In the se ond half of the
ourse, nonlinear models with nonnormal errors are introdu ed, and examples may be found
there.

4.1 The likelihood fun tion


Suppose we have a sample of size
of

Y =

y1 . . . yn

and

Z=

n

of the random ve tors

z1 . . . zn

and

z.

Suppose the joint density

is hara terized by a parameter ve tor

0 :

fY Z (Y, Z, 0 ).
This is the joint density of the sample. This density an be fa tored as

fY Z (Y, Z, 0 ) = fY |Z (Y |Z, 0 )fZ (Z, 0 )


The

likelihood fun tion

is just this density evaluated at other values

L(Y, Z, ) = f (Y, Z, ), ,
where

parameter spa e.
maximum likelihood estimator

The

is a

of

is the value of

that maximizes the likelihood

fun tion.
Note that if
fun tion
fun tion

and

fY |Z (Y |Z, )

share no elements, then the maximizer of the onditional likelihood

with respe t to

is the same as the maximizer of the overall likelihood

fY Z (Y, Z, ) = fY |Z (Y |Z, )fZ (Z, ),

this ase, the variables

are said to be

for the elements of

exogenous

0 .

The maximum likelihood estimator of

fY |Z (Y |Z, )

0 = arg max fY |Z (Y |Z, )


41

that orrespond to

for estimation of

onveniently work with the onditional likelihood fun tion


estimating

In

and we may more


for the purposes of

42

CHAPTER 4.

If the

MAXIMUM LIKELIHOOD ESTIMATION

observations are independent, the likelihood fun tion an be written as

L(Y |Z, ) =
where the

ft

n
Y
t=1

f (yt |zt , )

are possibly of dierent form.

If this is not possible, we an always fa tor the likelihood into

vations,

ontributions of obser-

by using the fa t that a joint density an be fa tored into the produ t of a

marginal and onditional (doing this iteratively)

L(Y, ) = f (y1 |z1 , )f (y2 |y1 , z2 , )f (y3 |y1 , y2 , z3 , ) f (yn |y1, y2 , . . . ytn , zn , )
To simplify notation, dene

xt = {y1 , y2 , ..., yt1 , zt }


so

x1 = z1 , x2 = {y1 , z2 },

et .

- it ontains exogenous and predetermined endogeous variables.

Now the likelihood fun tion an be written as

L(Y, ) =

n
Y
t=1

f (yt |xt , )

The riterion fun tion an be dened as the average log-likelihood fun tion:

1X
1
ln f (yt |xt , )
sn () = ln L(Y, ) =
n
n t=1
The maximum likelihood estimator may thus be dened equivalently as

= arg max sn (),


where the set maximized over is dened below. Sin e

ln L

and

maximize at the same value of

ln()

Dividing by

is a monotoni in reasing fun tion,

has no ee t on

4.1.1 Example: Bernoulli trial


Suppose that we are ipping a oin that may be biased, so that the probability of a heads may
not be 0.5. Maybe we're interested in estimating the probability of a heads. Let

y = 1(heads)

be a binary variable that indi ates whether or not a heads is observed. The out ome of a toss
is a Bernoulli random variable:

fY (y, p0 ) = py0 (1 p0 )1y , y {0, 1}


= 0, y
/ {0, 1}

4.1.

43

THE LIKELIHOOD FUNCTION

So a representative term that enters the likelihood fun tion is

fY (y, p) = py (1 p)1y
and

ln fY (y, p) = y ln p + (1 y) ln (1 p)
The derivative of this is

ln fY (y, p)
p

=
=

Averaging this over a sample of size

y (1 y)

p (1 p)
yp
p (1 p)

gives

1 X yi p
sn (p)
=
p
n
p (1 p)
i=1

Setting to zero and solving gives

p = y
So it's easy to al ulate the MLE of

p0 in

(4.1)

this ase.

Now imagine that we had a bag full of bent oins, ea h bent around a sphere of a dierent radius (with the head pointing to the outside of the sphere). We might suspe t that the probability of a heads ould depend upon the radius. Suppose that
where

xi =

1 ri

, so that

is a 21 ve tor. Now

pi p(xi , ) = (1 + exp(xi ))1

pi ()
= pi (1 pi ) xi

so

ln fY (y, )

y pi
pi (1 pi ) xi
pi (1 pi )
= (yi p(xi , )) xi

So the derivative of the average log lihelihood fun tion is now

sn ()
=

Pn

i=1 (yi

p(xi , )) xi
n

This is a set of 2 nonlinear equations in the two unknown elements in

There is no expli it

solution for the two elements that set the equations to zero. This is ommonly the ase with
ML estimators: they are often nonlinear, and nding the value of the estimate often requires
use of numeri methods to nd solutions to the rst order onditions.
explored further in the se ond half of these notes (see se tion 14.6).

This possibility is

44

CHAPTER 4.

MAXIMUM LIKELIHOOD ESTIMATION

4.2 Consisten y of MLE


To show onsisten y of the MLE, we need to make expli it some assumptions.

Compa t parameter spa e ,

over

an open bounded subset of

whi h is ompa t.

This implies that

is an interior point of the

K .

Maximixation is

parameter spa e .

Uniform onvergen e
u.a.s

sn () lim E0 sn () s (, 0 ), .
n

We have suppressed

here for simpli ity. This requires that almost sure onvergen e holds for

all possible parameter values. For a given parameter value, an ordinary Law of Large Numbers
will usually imply almost sure onvergen e to the limit of the expe tation. Convergen e for a
single element of the parameter spa e, ombined with the assumption of a ompa t parameter
spa e, ensures uniform onvergen e.

Continuity sn ()
in

is ontinuous in

, .

This implies that

Identi ation s (, 0 ) has a unique maximum


We will use these assumptions to show that
First,

s (, 0 )

is ontinuous

in its rst argument.

a.s.
n 0 .

ertainly exists, sin e a ontinuous fun tion has a maximum on a ompa t set.

Se ond, for any

6= 0

E ln
by Jensen's inequality (

ln ()

L()
L(0 )



 

L()
ln E
L(0 )

is a on ave fun tion).

Now, the expe tation on the RHS is

E
sin e

L(0 )

is

L()
L(0 )

L()
L(0 )dy = 1,
L(0 )

the density fun tion of the observations, and sin e the integral of any density is

1. Therefore, sin e

ln(1) = 0,

E ln

L()
L(0 )



0,

or

E (sn ()) E (sn (0 )) 0.


Taking limits, this is (by the assumption on uniform onvergen e)

s (, 0 ) s (0 , 0 ) 0

4.3.

45

THE SCORE FUNCTION

ex ept on a set of zero probability.


By the identi ation assumption there is a unique maximizer, so the inequality is stri t if

6= 0 :

s (, 0 ) s (0 , 0 ) < 0, 6= 0 , a.s.
is a limit point of n (any sequen e from a ompa t
n is a maximizer, independent of n, we must have
Sin e

Suppose that
limit point).

set has at least one

s ( , 0 ) s (0 , 0 ) 0.
These last two inequalities imply that

= 0 , a.s.
Thus there is only one limit point, and it is equal to the true parameter value, with probability
one. In other words,

lim = 0 , a.s.

This ompletes the proof of strong onsisten y of the MLE. One an use weaker assumptions
to prove weak onsisten y ( onvergen e in probability to

0 ) of the MLE. This is omitted here.

Note that almost sure onvergen e implies onvergen e in probability.

4.3 The s ore fun tion


Assumption:

Dierentiability Assume that

a neighborhood

N (0 )

of

0 ,

at least when

sn () is twi e ontinuously dierentiable in


is large enough.

To maximize the log-likelihood fun tion, take derivatives:

gn (Y, ) = D sn ()
n
1X
=
D ln f (yt |xx , )
n t=1
n

This is the

s ore ve tor

(with dim

K 1).

1X
gt ().
n
t=1

Note that the s ore fun tion has

whi h implies that it is a random fun tion.

sets

the derivatives to zero:

as an argument,

(and any exogeneous variables) will often be

suppressed for larity, but one should not forget that they are still there.
The ML estimator

X
= 1
0.
gn ()
gt ()
n t=1

46

CHAPTER 4.

We will show that

f (),

not ne essarily

MAXIMUM LIKELIHOOD ESTIMATION

E [gt ()] = 0, t. This is the expe tation

taken with respe t to the density

f (0 ) .

E [gt ()] =

[D ln f (yt |xt , )]f (yt |x, )dyt

D f (yt |xt , )dyt .

=
=

1
[D f (yt |xt , )] f (yt |xt , )dyt
f (yt |xt , )

Given some regularity onditions on boundedness of

D f, we an swit h the order of integration

and dierentiation, by the dominated onvergen e theorem. This gives

E [gt ()] = D

f (yt |xt , )dyt

= D 1
= 0

where we use the fa t that the integral of the density is 1.

So

E (gt () = 0 :

This hold for all

the expe tation of the s ore ve tor is zero.


t,

so it implies that

E gn (Y, ) = 0.

4.4 Asymptoti normality of MLE


Re all that we assume that
series expansion of

g(Y, )

sn () is twi e ontinuously dierentiable.


about the true value

Take a rst order Taylor's

0 :



= g(0 ) + (D g( )) 0
0 g()
or with appropriate denitions



H( ) 0 = g(0 ),
where

= + (1 )0 , 0 < < 1.

minute). So

Now onsider

H( ).

Assume

H( )

is invertible (we'll justify this in a




n 0 = H( )1 ng(0 )

This is

H( ) = D g( )
= D2 sn ( )
n
1X 2
D ln ft ( )
=
n t=1

4.4.

47

ASYMPTOTIC NORMALITY OF MLE

where the notation

D2 sn ()

2 sn ()
.

Given that this is an average of terms, it should usually be the ase that this satises a strong
law of large numbers (SLLN).
that this will happen.

Regularity onditions

are a set of assumptions that guarantee

There are dierent sets of assumptions that an be used to justify

appeal to dierent SLLN's. For example, the

D2 ln ft ( ) must

not be too strongly dependent

over time, and their varian es must not be ome innite. We don't assume any parti ular set
here, sin e the appropriate assumptions will depend upon the parti ularities of a given model.
However, we assume that a SLLN applies.
Also, sin e we know that

is onsistent, and sin e

a.s.

0 . Also, by the above dierentiability assumtion,

H( ) onverges to the limit of it's expe tation:

= + (1 )0 ,

H()

is ontinuous in

we have that

Given this,


a.s.
H( ) lim E D2 sn (0 ) = H (0 ) <
n

This matrix onverges to a nite limit.

Re-arranging orders of limits and dierentiation, whi h is legitimate given regularity onditions, we get

H (0 ) = D2 lim E (sn (0 ))
n
2
D s (0 , 0 )

=
We've already seen that

s (, 0 ) < s (0 , 0 )

i.e., 0

maximizes the limiting obje tive fun tion. Sin e there is a unique maximizer, and by

the assumption that

H (0 )

sn ()

is twi e ontinuously dierentiable (whi h holds in the limit), then

must be negative denite, and therefore of full rank. Therefore the previous inversion

is justied, asymptoti ally, and we have

Now onsider

ng(0 ).


a.s.
n 0 H (0 )1 ng(0 ).

(4.2)

This is

ngn (0 ) =
nD sn ()
X
n
n
=
D ln ft (yt |xt , 0 )
n t=1
n

=
We've already seen that
Note that

1 X

gt (0 )
n t=1

E [gt ()] = 0. As su h,

a.s.

gn (0 ) 0,

by onsisten y.

it is reasonable to assume that a CLT applies.

To avoid this ollapse to a degenerate r.v.

(a

48

CHAPTER 4.

onstant ve tor) we need to s ale by

n.

MAXIMUM LIKELIHOOD ESTIMATION

A generi CLT states that, for

Xn

a random ve tor

that satises ertain onditions,

Xn E(Xn ) N (0, lim V (Xn ))


The  ertain onditions that

Xn

must satisfy depend on the ase at hand. Usually,

be of the form of an average, s aled by

n:

Xn = n
This is the ase for
of the

Xt .

ng(0 ) for example.

For example, if the

Xt

Pn

that

will

t=1 Xt

n
Xn

Then the properties of

depend on the properties

have nite varian es and are not too strongly dependent,

then a CLT for dependent pro esses will apply.

E( ngn (0 ) = 0,

Xn

Supposing that a CLT applies, and noting

we get

d
I (0 )1/2 ngn (0 ) N [0, IK ]
where

I (0 ) =
=
This an also be written as

I (0 )

is known as the

lim E0 n [gn (0 )] [gn (0 )]




lim V0
ngn (0 )

n
n

ngn (0 ) N [0, I (0 )]

(4.3)

information matrix.

Combining [4.2 and [4.3, we get





a
n 0 N 0, H (0 )1 I (0 )H (0 )1 .

The MLE estimator is asymptoti ally normally distributed.


Denition 1 (CAN) An estimator of a parameter 0 is

normally distributed if

n- onsistent



d
n 0 N (0, V )

(4.4)

where V is a nite positive denite matrix.

There do exist, in spe ial ases, estimators that are onsistent su h that
These are known as

and asymptoti ally

super onsistent

estimators, sin e normally,

 p

n 0 0.

is the highest fa tor that

we an multiply by and still get onvergen e to a stable limiting distribution.

Denition 2 (Asymptoti unbiasedness) An estimator of a parameter 0 is asymptot-

i ally unbiased if

= .
lim E ()

(4.5)

4.5.

49

THE INFORMATION MATRIX EQUALITY

Estimators that are CAN are asymptoti ally unbiased, though not all onsistent estimators
are asymptoti ally unbiased. Su h ases are unusual, though. An example is

4.4.1 Coin ipping, again


In se tion 4.1.1 we saw that the MLE for the parameter of a Bernoulli trial, with i.i.d. data,
is the sample mean:

p = y (equation

4.1). Now let's nd the limiting varian e of

n (
p p).

lim V ar n (
p p) = lim nV ar (
p p)
= lim nV ar (
p)
= lim nV ar (
y)
P 
yt
= lim nV ar
n
X
1
V ar(yt ) (by independen e of obs.)
= lim
n
1
= lim nV ar(y) (by identi ally distributed obs.)
n
= p (1 p)

4.5 The information matrix equality


We will show that

H () = I ().

Let

1 =

0 =

ft ()

be short for

ft ()dy,

f (yt |xt , )

so

D ft ()dy
(D ln ft ()) ft ()dy

Now dierentiate again:

D2 ln ft ()

ft ()dy + [D ln ft ()] D ft ()dy


Z
 2

= E D ln ft () + [D ln ft ()] [D ln ft ()] ft ()dy


= E D2 ln ft () + E [D ln ft ()] [D ln ft ()]

0 =

= E [Ht ()] + E [gt ()] [gt ()]


Now sum over

and multiply by

(4.6)

1
n

#
" n
n
1X
1X
[Ht ()] = E
[gt ()] [gt ()]
E
n t=1
n t=1
The s ores

gt

and

gs

are un orrelated for

t 6= s,

sin e for

onditioned on prior information, so what was random in

t > s, ft (yt |y1 , ..., yt1 , )


is xed in

t.

has

(This forms the

50

CHAPTER 4.

MAXIMUM LIKELIHOOD ESTIMATION

basis for a spe i ation test proposed by White: if the s ores appear to be orrelated one may
question the spe i ation of the model). This allows us to write

E [H()] = E n [g()] [g()]

sin e all ross produ ts between dierent periods expe t to zero. Finally take limits, we get

H () = I ().
This holds for all

in parti ular, for

0 .

(4.7)

Using this,





a.s.
n 0 N 0, H (0 )1 I (0 )H (0 )1

simplies to





a.s.
n 0 N 0, I (0 )1

To estimate the asymptoti varian e, we need estimators of

I\
(0 ) = n

n
X

(4.8)

H (0 )

and

I (0 ).

We an use

t ()

gt ()g

t=1

H\
(0 ) = H().
Note, one an't use

h
ih
i

I\
(
)
=
n
g
(
)
g
(
)
0
n
n

to estimate the information matrix. Why not?

From this we see that there are alternative ways to estimate

V (0 )

that are all valid.

These in lude

\
V\
(0 ) = H (0 )

\
V\
(0 ) = I (0 )

\
V\
(0 ) = H (0 )

These are known as the


estimators, respe tively.

\
I\
(0 )H (0 )

inverse Hessian, outer produ t of the gradient

(OPG) and

sandwi h

The sandwi h form is the most robust, sin e it oin ides with the

ovarian e estimator of the

quasi-ML estimator.

4.6 The Cramr-Rao lower bound


The limiting varian e of a CAN estimator of 0 , say
, minus the inverse of the information matrix is a positive semidenite matrix.

Theorem 3

[Cramer-Rao Lower Bound

4.6.

51

THE CRAMR-RAO LOWER BOUND

Proof: Sin e the estimator is CAN, it is asymptoti ally unbiased, so

lim E ( ) = 0

Dierentiate wrt

lim E ( ) =

lim

= 0 (this
Noting that

D f (Y, ) = f ()D ln f (),


lim

Now note that

Z 

h

i
D f (Y, ) dy

is a

K K

matrix of zeros).

we an write

Z



f (Y, )D dy = 0.
f ()D ln f ()dy + lim
n



D = IK ,
lim

and

Z 

f (Y, )(IK )dy = IK .

With this we have


f ()D ln f ()dy = IK .

n we get
Z
 1

lim
n
n [D ln f ()] f ()dy = IK
n
{z
}
|n

Playing with powers of

Note that the bra keted part is just the transpose of the s ore ve tor,

lim E

g(),

so we an write

i
h 

n
ng() = IK



n , for any CAN esti


varian e of
n tends to V ().

This means that the ovarian e of the s ore fun tion with
mator, is an identity matrix. Using this, suppose the
Therefore,

 # "
" 
#

n
V ()
IK
V
.
=

IK
I ()
ng()

(4.9)

Sin e this is a ovarian e matrix, it is positive semi-denite. Therefore, for any

1
I
()

This simplies to

Sin e

is arbitrary,

This means that

"

V ()
IK

IK
I ()

#"

I ()1

-ve tor

0.

h
i
I 1 () 0.
V ()

I 1 ()
V ()

1 ()
I

Asymptoti e ien y )

is a

is positive semidenite. This onludes the proof.

lower bound

for the asymptoti varian e of a CAN estimator.

Given two CAN estimators of a parameter

asymptoti ally e ient with respe t to

V ()

if V ()

0 ,

say

and

is

is a positive semidenite matrix.

52

CHAPTER 4.

MAXIMUM LIKELIHOOD ESTIMATION

A dire t proof of asymptoti e ien y of an estimator is infeasible, but if one an show that
the asymptoti varian e is equal to the inverse of the information matrix, then the estimator
is asymptoti ally e ient. In parti ular,

any other CAN estimator.

the MLE is asymptoti ally e ient with respe t to

Summary of MLE

Consistent

Asymptoti ally normal (CAN)

Asymptoti ally e ient

Asymptoti ally unbiased

This is for general MLE: we haven't spe ied the distribution or the linearity/nonlinearity of the estimator

4.7 Exer ises


1. Consider oin tossing with a single possibly biased oin. The density fun tion for the
random variable

y = 1(heads)

is

fY (y, p0 ) = py0 (1 p0 )1y , y {0, 1}


= 0, y
/ {0, 1}
Suppose that we have a sample of size
is

pb0 = y.

n.

We know from above that the ML estimator

We also know from the theory above that



a
n (
y p0 ) N 0, H (p0 )1 I (p0 )H (p0 )1

a) nd the analyti expression for gt () and show that E [gt ()] = 0
b) nd the analyti al expressions for H (p0 ) and I (p0 ) for this problem

) verify that the result for lim V ar n (


p p) found in se tion 4.4.1 is equal to H (p0 )1 I (p0 )H (p0 )1

d) Write an O tave program that does a Monte Carlo study that shows that n (y p0 )
is approximately normally distributed when
show the sampling frequen y of

2. Consider the model

n (
y p0 )

yt = xt + t

is large. Please give me histograms that

for several values of

n.

where the errors follow the Cau hy (Student-t with

1 degree of freedom) density. So

f (t ) =

1
 , < t <
1 + 2t

The Cau hy density has a shape similar to a normal density, but with mu h thi ker tails.
Thus, extremely small and large errors o ur mu h more frequently with this density

4.7.

53

EXERCISES

than would happen if the errors were normally distributed.

gn ()

where

3. Consider the model lassi al linear regression model


Find the s ore fun tion

Find the s ore fun tion

gn ()

where

yt = xt + t

where

t IIN (0, 2 ).

4. Compare the rst order onditions that dene the ML estimators of problems 2 and 3
and interpret the dieren es.

Why

are the rst order onditions that dene an e ient

estimator dierent in the two ases?


5. There is an ML estimation routine in the provided software that a ompanies these
notes.
output.

Edit (to see what it does) then run the s ript mle_example.m.

Interpret the

54

CHAPTER 4.

MAXIMUM LIKELIHOOD ESTIMATION

Chapter 5
Asymptoti properties of the least
squares estimator
1

The OLS estimator under the lassi al assumptions is BLUE , for all sample sizes. Now let's
see what happens when the sample size tends to innity.

5.1 Consisten y
= (X X)1 X y
= (X X)1 X (X + )
= 0 + (X X)1 X
 1
XX
X
= 0 +
n
n
Consider the last two terms. By assumption

limn

XX
n

= QX limn

X X
n

1

Q1
X , sin e the inverse of a nonsingular matrix is a ontinuous fun tion of the elements of the
X
matrix. Considering
n ,
n
X
1X
n

Ea h

xt t

has expe tation zero, so

xt t

X
n

t=1

=0

The varian e of ea h term is

V (xt t ) = xt xt 2 .
1

BLUE

best linear unbiased estimator if I haven't dened it before

55

56CHAPTER 5.

ASYMPTOTIC PROPERTIES OF THE LEAST SQUARES ESTIMATOR

As long as these are nite, and given a te hni al ondition , the Kolmogorov SLLN applies, so

1X
a.s.
xt t 0.
n
t=1

This implies that

a.s.
0 .
This is the property of

strong onsisten y:

the estimator onverges in almost surely to the

true value.

The onsisten y proof does not use the normality assumption.

Remember that almost sure onvergen e implies onvergen e in probability.

5.2 Asymptoti normality


We've seen that the OLS estimator is normally distributed

errors.

under the assumption of normal

If the error distribution is unknown, we of ourse don't know the distribution of the

estimator. However, we an get asymptoti results.

Assuming the distribution of is unknown,

but the the other lassi al assumptions hold:

= 0 + (X X)1 X

0 = (X X)1 X
 1


XX
X

n 0
=
n
n

XX
n

1

Q1
X .

Now as before,

X
Considering , the limit of the varian e is
n

lim V

lim E

= 02 QX

X X
n

The mean is of ourse zero. To get asymptoti normality, we need to apply a CLT. We
assume one (for instan e, the Lindeberg-Feller CLT) holds, so


X d
N 0, 02 QX
n
For appli ation of LLN's and CLT's, of whi h there are very many to hoose from, I'm going to avoid

the te hni alities. Basi ally, as long as terms that make up an average have nite varian es and are not too
strongly dependent, one will be able to nd a LLN or CLT to apply. Whi h one it is doesn't matter, we only
need the result.

5.3.

57

ASYMPTOTIC EFFICIENCY

Therefore,




d
n 0 N 0, 02 Q1
X

In summary, the OLS estimator is normally distributed in small and large samples if
is normally distributed. If

is not normally distributed,

is asymptoti ally normally

distributed when a CLT an be applied.

5.3 Asymptoti e ien y


The least squares obje tive fun tion is

s() =

n
X
t=1

Supposing that

yt xt

2

is normally distributed, the model is

y = X0 + ,

N (0, 02 In ), so


n
Y
1
2t

f () =
exp 2
2
2 2
t=1
The joint density for

so
y

= 1,

In and | y
|

an be onstru ted using a hange of variables. We have


so

n
Y

(yt xt )2

f (y) =
exp
2 2
2 2
t=1
Taking logs,

= y X,

n
X

(yt xt )2
ln L(, ) = n ln 2 n ln
.
2 2
t=1

It's lear that the fon for the MLE of

are the same as the fon for OLS (up to multipli ation

the estimators are the same, under the present assumptions. Therefore, their
properties are the same. In parti ular, under the lassi al assumptions with normality, the OLS
estimator is asymptoti ally e ient.

by a onstant), so

As we'll see later, it will be possible to use (iterated) linear estimation methods and still
a hieve asymptoti e ien y even if the assumption that
normally distributed. This is

V ar() 6= 2 In ,

not the ase if is nonnormal.

as long as

is still

In general with nonnormal errors

it will be ne essary to use nonlinear estimation methods to a hieve asymptoti ally e ient
estimation. That possibility is addressed in the se ond half of the notes.

58CHAPTER 5.

ASYMPTOTIC PROPERTIES OF THE LEAST SQUARES ESTIMATOR

5.4 Exer ises


1. Write an O tave program that generates a histogram for



n j j ,

where

is

the OLS estimator and

Monte Carlo repli ations of

is one of the

slope parameters.

should be a large number, at least 1000. The model used to generate data should follow
the lassi al assumptions, ex ept that the errors should not be normally distributed (try

U (a, a), t(p), 2 (p) p,

et ). Generate histograms for

observe eviden e of asymptoti normality? Comment.

n {20, 50, 100, 1000} .

Do you

Chapter 6
Restri tions and hypothesis tests
6.1 Exa t linear restri tions
In many ases, e onomi theory suggests restri tions on the parameters of a model.

For

example, a demand fun tion is supposed to be homogeneous of degree zero in pri es and
in ome. If we have a Cobb-Douglas (log-linear) model,

ln q = 0 + 1 ln p1 + 2 ln p2 + 3 ln m + ,
then we need that

k0 ln q = 0 + 1 ln kp1 + 2 ln kp2 + 3 ln km + ,
so

1 ln p1 + 2 ln p2 + 3 ln m = 1 ln kp1 + 2 ln kp2 + 3 ln km
= (ln k) (1 + 2 + 3 ) + 1 ln p1 + 2 ln p2 + 3 ln m.
The only way to guarantee this for arbitrary

is to set

1 + 2 + 3 = 0,
whi h is a

parameter restri tion.

In parti ular, this is a linear equality restri tion, whi h is

probably the most ommonly en ountered ase.

6.1.1 Imposition
The general formulation of linear equality restri tions is the model

y = X +
R = r
where

is a

QK

matrix,

Q<K

and

is a

Q1

59

ve tor of onstants.

60

CHAPTER 6.

We assume

We also assume that

is of rank

Let's onsider how to estimate

Q,

RESTRICTIONS AND HYPOTHESIS TESTS

so that there are no redundant restri tions.

that satises the restri tions: they aren't infeasible.

subje t to the restri tions R = r. The most obvious approa h

is to set up the Lagrangean

min s() =

1
(y X) (y X) + 2 (R r).
n

The Lagrange multipliers are s aled by 2, whi h makes things less messy. The fon are

)
= 2X y + 2X X R + 2R
0
D s(,
)
= RR r 0,
D s(,
"

whi h an be written as

We get

"

X X R

R
#

"

#"

X X R
R

"

X y

#1 "

X y
r

Maybe you're urious about how to invert a partitioned matrix? I an help you with that:

Note that

"

(X X)1

R (X X)1 IQ

#"

X X R
R

AB
=

"

"

IK

(X X)1 R

R (X X)1 R
#
(X X)1 R

IK
0

C,
and

"

IK

(X X)1 R P 1

P 1

#"

IK

(X X)1 R

DC
= IK+Q ,

6.1.

61

EXACT LINEAR RESTRICTIONS

so

DAB = IK+Q
DA = "
B 1
#
#"
(X X)1
0
IK (X X)1 R P 1
1
B
=
R (X X)1 IQ
0
P 1
#
"
(X X)1 (X X)1 R P 1 R (X X)1 (X X)1 R P 1
,
=
P 1 R (X X)1
P 1
If you weren't urious about that, please start paying attention again. Also, note that we have

P = R (X X)1 R )

made the denition

"

"

(X X)1 (X X)1 R P 1 R (X X)1 (X X)1 R P 1

P 1 R (X X)1
P 1


(X X)1 R P 1 R r




=
P 1 R r
#
"
"
 #
X)1 R P 1 r
(X
IK (X X)1 R P 1 R
+
=
P 1 r
P 1 R

X y
r

are linear fun tions of makes it easy to determine their distributions,

is already known. Re all that for x a random ve tor, and for A and
sin e the distribution of
The fa t that

#"

and

a matrix and ve tor of onstants, respe tively,

V ar (Ax + b) = AV ar(x)A .

Though this is the obvious way to go about nding the restri ted estimator, an easier way,
if the number of restri tions is small, is to impose them by substitution. Write

h
where

R1

is

QQ

an always make

R1 R2

"

1
2

y = X1 1 + X2 2 +
#
= r

nonsingular. Supposing the

R1

restri tions are linearly independent, one

nonsingular by reorganizing the olumns of

X.

Then

1 = R11 r R11 R2 2 .
Substitute this into the model

y = X1 R11 r X1 R11 R2 2 + X2 2 +


y X1 R11 r = X2 X1 R11 R2 2 +

or with the appropriate denitions,

yR = XR 2 + .

62

CHAPTER 6.

RESTRICTIONS AND HYPOTHESIS TESTS

This model satises the lassi al assumptions,


estimate by OLS. The varian e of

supposing the restri tion is true.

is as before

V (2 ) = XR
XR
and the estimator is

V (2 ) = XR
XR
where one estimates

1 , use
2 , so
of

To re over
fun tion

02

1
1

02

in the normal way, using the restri ted model,

c2 =

One an

i.e.,


 

yR XR b2
yR XR b2
n (K Q)

the restri tion. To nd the varian e of

1 ,

use the fa t that it is a linear


V (1 ) = R11 R2 V (2 )R2 R11
1

R2 R11 02
= R11 R2 X2 X2

6.1.2 Properties of the restri ted estimator


We have that



R = (X X)1 R P 1 R r

= + (X X)1 R P 1 r (X X)1 R P 1 R(X X)1 X y

= + (X X)1 X + (X X)1 R P 1 [r R] (X X)1 R P 1 R(X X)1 X

R = (X X)1 X

+ (X X)1 R P 1 [r R]

(X X)1 R P 1 R(X X)1 X

Mean squared error is

M SE(R ) = E(R )(R )


Noting that the rosses between the se ond term and the other terms expe t to zero, and that
the ross of the rst and third has a an ellation with the square of the third, we obtain

M SE(R ) = (X X)1 2
+ (X X)1 R P 1 [r R] [r R] P 1 R(X X)1
(X X)1 R P 1 R(X X)1 2

So, the rst term is the OLS ovarian e. The se ond term is PSD, and the third term is NSD.

If the restri tion is true, the se ond term is 0, so we are better o.

improve e ien y of estimation.

True restri tions

6.2.

63

TESTING

If the restri tion is false, we may be better or worse o, in terms of MSE, depending on
the magnitudes of

r R

and

2.

6.2 Testing
In many ases, one wishes to test e onomi theories. If theory suggests parameter restri tions,
as in the above homogeneity example, one an test theory by testing parameter restri tions.
A number of tests are available.

6.2.1 t-test
Suppose one has the model

y = X +
and one wishes to test the

single restri tion H0 :R = r

HA :R 6= r

vs.

normality of the errors,

R r N 0, R(X X)1 R 02
so

The problem is that

02

R r

R(X X)1 R 02

R r

R(X X)1 R

H0 ,

with

N (0, 1) .

is unknown. One ould use the onsistent estimator

but the test would only be valid asymptoti ally in this ase.

Proposition 4

. Under

c2

in pla e of

N (0, 1)
q 2
t(q)
(q)
q

02 ,

(6.1)

as long as the N (0, 1) and the 2 (q) are independent.


We need a few results on the

distribution.

Proposition 5 If x N (, In ) is a ve tor of n independent r.v.'s., then


x x 2 (n, )

where =

When a

2
i i

is the

(6.2)

non entrality parameter.

r.v. has the non entrality parameter equal to zero, it is referred to as a entral

2 r.v., and it's distribution is written as

2 (n),

suppressing the non entrality parameter.

Proposition 6 If the n dimensional random ve tor x N (0, V ), then x V 1 x 2 (n).


We'll prove this one as an indi ation of how the following unproven propositions ould be
proved.

64

CHAPTER 6.

Proof: Fa tor

V 1

as

P P

RESTRICTIONS AND HYPOTHESIS TESTS

(this is the Cholesky fa torization, where

y = P x.

upper triangular). Then onsider

is dened to be

We have

y N (0, P V P )
but

so

P V P = In

and thus

y N (0, In ).

V P P

= In

P V P P

= P

Thus

y y 2 (n)

but

y y = x P P x = xV 1 x
and we get the result we wanted.
A more general proposition whi h implies this result is

Proposition 7 If the n dimensional random ve tor x N (0, V ), then


x Bx 2 ((B))

(6.3)

if and only if BV is idempotent.


An immediate onsequen e is

Proposition 8 If the random ve tor (of dimension n) x N (0, I), and B is idempotent with

rank r, then

x Bx 2 (r).

(6.4)

Consider the random variable


=
02

MX
02
 
 

MX
=
0
0
2 (n K)

Proposition 9 If the random ve tor (of dimension n) x N (0, I), then Ax and x Bx are

independent if AB = 0.

Now onsider (remember that we have only one restri tion in this ase)

Rr
R(X X)1 R
q
=


(nK)02

0
c

R r

R(X X)1 R

6.2.

65

TESTING

This will have the

t(nK) distribution if and are independent.

But

= +(X X)1 X

and

(X X)1 X MX = 0,
so

0
c

R r
R r
=
t(n K)

R
R(X X)1 R

In parti ular, for the ommonly en ountered


for whi h

H0 : i = 0

vs.

H0 : i 6= 0

test of signi an e

of an individual oe ient,

, the test statisti is

i
t(n K)

Note:

the

test is stri tly valid only if the errors are a tually normally distributed.

If one has nonnormal errors, one ould use the above asymptoti result to justify taking
riti al values from the

N (0, 1)

distribution, sin e

t(n K) N (0, 1)

In pra ti e, a onservative pro edure is to take riti al values from the


if nonnormality is suspe ted.

This will reje t

H0

less often sin e the

as

n .

distribution

distribution is

fatter-tailed than is the normal.

6.2.2
The

test

test allows testing multiple restri tions jointly.

Proposition 10 If x 2 (r) and y 2 (s), then


x/r
F (r, s)
y/s

(6.5)

provided that x and y are independent.


Proposition 11 If the random ve tor (of dimension n) x N (0, I), then x Ax and x Bx are

independent if AB = 0.

Using these results, and previous results on the


the following statisti has the

F =

distribution, it is simple to show that

distribution:

1 


 
R r
R r
R (X X)1 R
q
2

F (q, n K).

A numeri ally equivalent expression is

(ESSR ESSU ) /q
F (q, n K).
ESSU /(n K)
Note:

The

test is stri tly valid only if the errors are truly normally distributed. The

following tests will be appropriate when one annot assume normally distributed errors.

66

CHAPTER 6.

RESTRICTIONS AND HYPOTHESIS TESTS

6.2.3 Wald-type tests


The Wald prin iple is based on the idea that if a restri tion is true, the unrestri ted model
should approximately satisfy the restri tion. Given that the least squares estimator is asymptoti ally normally distributed:

then under

H0 : R0 = r,

Q1
X

or

02

we have




d

n R r N 0, 02 RQ1
X R

so by Proposition [6

Note that




d
n 0 N 0, 02 Q1
X




 
d
1
r
R

n R r
R
2 (q)
02 RQ1
X
are not observable. The test statisti we use substitutes the onsistent es-

1 as the onsistent estimator of


timators. Use (X X/n)
of

n s,

Q1
X . With this, there is a an ellation

and the statisti to use is


 
 

d
c2 R(X X)1 R 1 R r
R r

2 (q)
0

The Wald test is a simple way to test restri tions without having to estimate the re-

Note that this formula is similar to one of the formulae provided for the

stri ted model.

test.

6.2.4 S ore-type tests (Rao tests, Lagrange multiplier tests)


In some ases, an unrestri ted model may be nonlinear in the parameters, but the model is
linear in the parameters under the null hypothesis. For example, the model

y = (X) +
is nonlinear in

and

but is linear in

under

H0 : = 1.

Estimation of nonlinear models

is a bit more ompli ated, so one might prefer to have a test based upon the restri ted, linear
model. The s ore test is useful in this situation.

S ore-type tests are based upon the general prin iple that the gradient ve tor of the unrestri ted model, evaluated at the restri ted estimate, should be asymptoti ally normally
distributed with mean zero, if the restri tions are true. The original development was
for ML estimation, but the prin iple is valid for a wide variety of estimation methods.

6.2.

67

TESTING

We have seen that


1 
R r
R(X X)1 R


= P 1 R r

so

Given that

nP =



n R r




d

n R r N 0, 02 RQ1
X R

under the null hypothesis, we obtain

nP N 0, 02 RQ1
X R

So

Noting that



 
d
1 1
2

nP 0 RQX R
nP 2 (q)

1
R,
lim nP = RQX

we obtain,

sin e the powers of

R(X X)1 R
02

2 (q)

an el. To get a usable test statisti substitute a onsistent estimator of

02 .

This makes it lear why the test is sometimes referred to as a Lagrange multiplier test.
It may seem that one needs the a tual Lagrange multipliers to al ulate this.

If we

impose the restri tions by substitution, these are not available. Note that the test an
be written as

(X X)1 R
02

2 (q)

However, we an use the fon for the restri ted estimator:

X y + X X R + R
to get that

= X (y X R )
R
= X R

Substituting this into the above, we get

R X(X X)1 X R d 2
(q)
02

68

CHAPTER 6.

but this is simply

RESTRICTIONS AND HYPOTHESIS TESTS

PX
d
R 2 (q).
2
0

To see why the test is also known as a s ore test, note that the fon for restri ted least squares

X y + X X R + R
give us

= X y X X R
R
and the rhs is simply the gradient (s ore) of the unrestri ted model, evaluated at the restri ted
estimator. The s ores evaluated at the unrestri ted estimate are identi ally zero. The logi
behind the s ore test is that the s ores evaluated at the restri ted estimate should be approximately zero, if the restri tion is true. The test is also known as a Rao test, sin e P. Rao rst
proposed it in 1948.

6.2.5 Likelihood ratio-type tests


The Wald test an be al ulated using the unrestri ted model. The s ore test an be al ulated
using only the restri ted model. The likelihood ratio test, on the other hand, uses both the
restri ted and the unrestri ted estimators. The test statisti is



ln L()

LR = 2 ln L()
where

is the unrestri ted estimate and

asymptoti ally

2 ,

is the restri ted estimate.

take a se ond order Taylor's series expansion of

ln L()
+
ln L()
(note, the rst order term drops out sin e

the se ond-order term by

sin e

H()

To show that it is

ln L()

about



n  

H()
2

0
D ln L()

by the fon and we need to multiply

is dened in terms of

1
n

ln L())

so






LR n H()
As

H (0 ) = I(0 ),
n , H()

by the information matrix equality. So





a
LR = n I (0 )

?? that

We also have that, from [



a
n 0 = I (0 )1 n1/2 g(0 ).

6.3.

THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS

69

An analogous result for the restri ted estimator is (this is unproven here, to prove this set up
the Lagrangean for MLE subje t to

R = r,

and manipulate the rst order onditions) :




1

a
n 0 = I (0 )1 In R RI (0 )1 R
RI (0 )1 n1/2 g(0 ).

Combining the last two equations


1

a
n = n1/2 I (0 )1 R RI (0 )1 R
RI (0 )1 g(0 )

??

so, substituting into [

But sin e

i
h
i
1 h
a
LR = n1/2 g(0 ) I (0 )1 R RI (0 )1 R
RI (0 )1 n1/2 g(0 )
d

n1/2 g(0 ) N (0, I (0 ))


the linear fun tion

RI (0 )1 n1/2 g(0 ) N (0, RI (0 )1 R ).


We an see that LR is a quadrati form of this rv, with the inverse of its varian e in the
middle, so

LR 2 (q).

6.3 The asymptoti equivalen e of the LR, Wald and s ore tests
We have seen that the three tests all onverge to
to the

same

2 random variables.

In fa t, they all onverge

2 rv, under the null hypothesis. We'll show that the Wald and LR tests are

asymptoti ally equivalent. We have seen that the Wald test is asymptoti ally equivalent to

Using




 
a
d
1
r
W = n R r
R

R
02 RQ1
2 (q)
X
0 = (X X)1 X

and

R r = R( 0 )
we get

nR( 0 ) =

nR(X X)1 X
 1
XX
n1/2 X
= R
n

70

CHAPTER 6.

RESTRICTIONS AND HYPOTHESIS TESTS

?? to get

Substitute this into [

=
a

=
where

PR

1

RQ1
X X
1
a
= X(X X)1 R 02 R(X X)1 R
R(X X)1 X
1

2
= n1 XQ1
X R 0 RQX R

A(A A)1 A
02
PR
02

is the proje tion matrix formed by the matrix

Note that this matrix is idempotent and has


rank

X(X X)1 R .

olumns, so the proje tion matrix has

q.

Now onsider the likelihood ratio statisti

LR = n1/2 g(0 ) I(0 )1 R RI(0 )1 R

1

RI(0 )1 n1/2 g(0 )

Under normality, we have seen that the likelihood fun tion is

1 (y X) (y X)
ln L(, ) = n ln 2 n ln
.
2
2
Using this,

1
ln L(, )
n
X (y X0 )
n 2

X
n 2

g(0 ) D
=
=
Also, by the information matrix equality:

I(0 ) = H (0 )
= lim D g(0 )
X (y X0 )
= lim D
n 2

XX
= lim
n 2
QX
=
2
so

I(0 )1 = 2 Q1
X

6.3.

THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS

71

??, we get

Substituting these last expressions into [

LR = X (X X)1 R 02 R(X X)1 R


PR
a
=
02

1

R(X X)1 X

= W
This ompletes the proof that the Wald and LR tests are asymptoti ally equivalent. Similarly,
one an show that,

under the null hypothesis,


a

qF = W = LM = LR

The proof for the statisti s ex ept for

The

LR does not depend upon normality

of the errors,

as an be veried by examining the expressions for the statisti s.

LR

statisti

is

based upon distributional assumptions, sin e one an't write the

likelihood fun tion without them.

However, due to the lose relationship between the statisti s


normality, the

qF

statisti an be thought of as a

qF

and

pseudo-LR statisti ,

LR,

supposing

in that it's like

a LR statisti in that it uses the value of the obje tive fun tions of the restri ted and
unrestri ted models, but it doesn't require distributional assumptions.

The presentation of the s ore and Wald tests has been done in the ontext of the linear model.

This is readily generalizable to nonlinear models and/or other estimation

methods.

Though the four statisti s

are

asymptoti ally equivalent, they are numeri ally dierent in

small samples. The numeri values of the tests also depend upon how

is estimated, and

we've already seen than there are several ways to do this. For example all of the following are
onsistent for

under

H0


nk

n
R R
nk+q
R R
n
and in general the denominator all be repla ed with any quantity

su h that

lim a/n = 1.

It an be shown, for linear regression models subje t to linear restri tions, and if
used to al ulate the Wald test and

R R
is used for the s ore test, that
n

W > LR > LM.


n is

72

CHAPTER 6.

RESTRICTIONS AND HYPOTHESIS TESTS

For this reason, the Wald test will always reje t if the LR test reje ts, and in turn the LR
test reje ts if the LM test reje ts. This is a bit problemati : there is the possibility that by
areful hoi e of the statisti used, one an manipulate reported results to favor or disfavor a
hypothesis. A onservative/honest approa h would be to report all three test statisti s when
they are available. In the ase of linear models with normal errors the

test is to be preferred,

sin e asymptoti approximations are not an issue.


The small sample behavior of the tests an be quite dierent. The true size (probability of
reje tion of the null when the null is true) of the Wald test is often dramati ally higher than
the nominal size asso iated with the asymptoti distribution. Likewise, the true size of the
s ore test is often smaller than the nominal size.

6.4 Interpretation of test statisti s


Now that we have a menu of test statisti s, we need to know how to use them.

6.5 Conden e intervals


Conden e intervals for single oe ients are generated in the normal manner. Given the
statisti

t() =
a

100 (1 ) %

t()

onden e interval for

does not reje t

H0 : 0 = ,

using a

is dened by the bounds of the set of

su h that

signi an e level:

C() = { : c/2 <


The set of su h

is the interval


< c/2 }


c c/2

A onden e ellipse for two oe ients jointly would be, analogously, the set of {1 , 2 }

su h that the

(or some other test statisti ) doesn't reje t at the spe ied riti al value. This

generates an ellipse, if the estimators are orrelated.

The region is an ellipse, sin e the CI for an individual oe ient denes a (innitely
long) re tangle with total prob. mass

1 ,

sin e the other oe ient is marginalized

(e.g., an take on any value). Sin e the ellipse is bounded in both dimensions but also
ontains mass

1 ,

it must extend beyond the bounds of the individual CI.

From the pi tue we an see that:

Reje tion of hypotheses individually does not imply that the joint test will reje t.

Joint reje tion does not imply individal tests will reje t.

6.5.

CONFIDENCE INTERVALS

Figure 6.1: Joint and Individual Conden e Regions

73

74

CHAPTER 6.

RESTRICTIONS AND HYPOTHESIS TESTS

6.6 Bootstrapping
When we rely on asymptoti theory to use the normal distribution-based tests and onden e
intervals, we're often at serious risk of making important errors. If the sample size is small
and errors are highly nonnormal, the small sample distribution of



n 0

may be very

dierent than its large sample distribution. Also, the distributions of test statisti s may not
resemble their limiting distributions at all. A means of trying to gain information on the small
sample distribution of test statisti s and estimators is the

bootstrap.

We'll onsider a simple

example, just to get the main idea.


Suppose that

X0 +

IID(0, 02 )

X
Given that the distribution of
samples.

is nonsto hasti

is unknown, the distribution of

will be unknown in small

However, sin e we have random sampling, we ould generate

arti ial data.

The

steps are:
1. Draw

observations from

with repla ement.

2. Then generate the data by

Call this ve tor

(it's a

n 1).

yj = X + j

3. Now take this and estimate

j = (X X)1 X yj .
4. Save

j
J,

5. Repeat steps 1-4, until we have a large number,


With this, we an use the repli ations to al ulate the
form a 100(1-)% onden e interval for
and drop the rst and last

J/2

of

j .

empiri al distribution of j . One way to

would be to order the

from smallest to largest,

of the repli ations, and use the remaining endpoints as the

limits of the CI. Note that this will not give the shortest CI if the empiri al distribution is
skewed.

Suppose one was interested in the distribution of some fun tion of


test statisti . Simple: just al ulate the transformation for ea h

j,

for example a

and work with the

empiri al distribution of the transformation.

If the assumption of iid errors is too strong (for example if there is heteros edasti ity
or auto orrelation, see below) one an work with a bootstrap dened by sampling from

(y, x)

with repla ement.

How to hoose

J: J

should be large enough that the results don't hange with repetition

of the entire bootstrap.


in rease

and try again.

This is easy to he k.

If you nd the results hange a lot,

6.7.

75

TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD

The bootstrap is based fundamentally on the idea that the empiri al distribution of the
sample data onverges to the a tual sampling distribution as

be omes large, so statis-

ti s based on sampling from the empiri al distribution should onverge in distribution


to statisti s based on sampling from the a tual sampling distribution.

In nite samples, this doesn't hold.

At a minimum, the bootstrap is a good way to

he k if asymptoti theory results oer a de ent approximation to the small sample


distribution.

Bootstrapping an be used to test hypotheses.

Basi ally, use the bootstrap to get an

approximation to the empiri al distribution of the test statisti under the alternative
hypothesis, and use this to get riti al values.

Compare the test statisti al ulated

using the real data, under the null, to the bootstrap riti al values.

There are many

variations on this theme, whi h we won't go into here.

6.7 Testing nonlinear restri tions, and the delta method


Testing nonlinear restri tions of a linear model is not mu h more di ult, at least when the
model is linear. Sin e estimation subje t to nonlinear restri tions requires nonlinear estimation
methods, whi h are beyond the s ore of this ourse, we'll just onsider the Wald test for
nonlinear restri tions on a linear model.
Consider the

nonlinear restri tions

r(0 ) = 0.
where

r()

is a

q -ve tor

valued fun tion. Write the derivative of the restri tion evaluated at

as


D r() = R()

We suppose that the restri tions are not redundant in a neighborhood of

0 ,

so that

(R()) = q
in a neighborhood of

0 .

Take a rst order Taylor's series expansion of

r()

about

= r(0 ) + R( )( 0 )
r()
where

is a onvex ombination of

and

0 .

Under the null hypothesis we have

= R( )( 0 )
r()
Due to onsisten y of

we an repla e

by

=
nr()

0 ,

asymptoti ally, so

nR(0 )( 0 )

0 :

76

CHAPTER 6.

We've already seen the distribution of

RESTRICTIONS AND HYPOTHESIS TESTS

n( 0 ).

Using this we get

d
2

nr()
N 0, R(0 )Q1
X R(0 ) 0 .

Considering the quadrati form

R(0 )Q1 R(0 )


nr()
X
02

1

r()

2 (q)

under the null hypothesis. Substituting onsistent estimators for


statisti is

1

X)1 R()

R()(X
r()
r()
c2

under the null hypothesis.

0, QX

and

02 ,

the resulting

2 (q)

delta method, or as Klein's approximation.

This is known in the literature as the

Sin e this is a Wald test, it will tend to over-reje t in nite samples. The s ore and LR
tests are also possibilities, but they require estimation methods for nonlinear models,
whi h aren't in the s ope of this ourse.

Note that this also gives a onvenient way to estimate nonlinear fun tions and asso iated

r(0 ) is not hypothesized to be zero,

asymptoti onden e intervals. If the nonlinear fun tion


we just have




d
2
r(0 )
n r()
N 0, R(0 )Q1
X R(0 ) 0

so an approximation to the distribution of the fun tion of the estimator is

N (r(0 ), R(0 )(X X)1 R(0 ) 2 )


r()
0
For example, the ve tor of elasti ities of a fun tion

(x) =
where

f (x)

is

x
f (x)

x
f (x)

means element-by-element multipli ation. Suppose we estimate a linear fun tion

y = x + .
The elasti ities of

w.r.t.

are

(x) =

x
x

(note that this is the entire ve tor of elasti ities). The estimated elasti ities are

b(x) =

x
x

6.8.

77

EXAMPLE: THE NERLOVE DATA

To al ulate the estimated standard errors of all ve elasti ites, use

R() =

(x)

x1 0 0

.
0 x
.
.
2

.
..
..
.
0

0 0 xk

x.

1 x21

2 x22

.
.
.

..

(x )2

To get a onsistent estimator just substitute in


error are fun tions of

0
.
.
.

k x2k

Note that the elasti ity and the standard

The program ExampleDeltaMethod.m shows how this an be done.

In many ases, nonlinear restri tions an also involve the data, not just the parameters.
For example, onsider a model of expenditure shares. Let

is pri es and

x(p, m) be a demand fun ion, where

is in ome. An expenditure share system for

si (p, m) =

goods is

pi xi (p, m)
, i = 1, 2, ..., G.
m

Now demand must be positive, and we assume that expenditures sum to in ome, so we have
the restri tions

G
X

0 si (p, m) 1, i
si (p, m)

i=1

Suppose we postulate a linear model for the expenditure shares:

i
si (p, m) = 1i + p pi + mm
+ i
It is fairly easy to write restri tions su h that the shares sum to one, but the restri tion that
the shares lie in the

[0, 1]

interval depends on both parameters and the values of

is impossible to impose the restri tion that

0 si (p, m) 1

for all possible

and

and

m.

m.

It

In su h

ases, one might onsider whether or not a linear model is a reasonable spe i ation.

6.8 Example: the Nerlove data


Remember that we in a previous example (se tion 3.8.3) that the OLS results for the Nerlove
model are

*********************************************************
OLS estimation results
Observations 145

78

CHAPTER 6.

RESTRICTIONS AND HYPOTHESIS TESTS

R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var- ov estimator)
onstant
output
labor
fuel
apital

estimate
-3.527
0.720
0.436
0.427
-0.220

st.err.
1.774
0.017
0.291
0.100
0.339

t-stat.
-1.987
41.244
1.499
4.249
-0.648

p-value
0.049
0.000
0.136
0.000
0.518

*********************************************************

Note that

sK = K < 0,

and that

L + F + K 6= 1.

Remember that if we have onstant returns to s ale, then


geneity of degree 1 then

L + F + K = 1.

Q = 1,

and if there is homo-

We an test these hypotheses either separately or

jointly. NerloveRestri tions.m imposes and tests CRTS and then HOD1. From it we obtain
the results that follow:

Imposing and testing HOD1


*******************************************************
Restri ted LS estimation results
Observations 145
R-squared 0.925652
Sigma-squared 0.155686

onstant
output
labor
fuel
apital

estimate
-4.691
0.721
0.593
0.414
-0.007

st.err.
0.891
0.018
0.206
0.100
0.192

t-stat.
-5.263
41.040
2.878
4.159
-0.038

p-value
0.000
0.000
0.005
0.000
0.969

*******************************************************
Value
p-value
F
0.574
0.450
Wald
0.594
0.441
LR
0.593
0.441
S ore
0.592
0.442

6.8.

79

EXAMPLE: THE NERLOVE DATA

Imposing and testing CRTS


*******************************************************
Restri ted LS estimation results
Observations 145
R-squared 0.790420
Sigma-squared 0.438861
estimate
-7.530
1.000
0.020
0.715
0.076

onstant
output
labor
fuel
apital

st.err.
2.966
0.000
0.489
0.167
0.572

t-stat.
-2.539
Inf
0.040
4.289
0.132

p-value
0.012
0.000
0.968
0.000
0.895

*******************************************************
Value
p-value
F
256.262
0.000
Wald
265.414
0.000
LR
150.863
0.000
S ore
93.771
0.000

Noti e that the input pri e oe ients in fa t sum to 1 when HOD1 is imposed. HOD1 is

e.g., = 0.10).

not reje ted at usual signi an e levels (

Also,

R2

does not drop mu h when

the restri tion is imposed, ompared to the unrestri ted results. For CRTS, you should note

Q = 1 is
2
reje ted by the test statisti s at all reasonable signi an e levels. Note that R drops quite a
that

Q = 1,

so the restri tion is satised.

Also note that the hypothesis that

bit when imposing CRTS. If you look at the unrestri ted estimation results, you an see that
a t-test for

Q = 1

also reje ts, and that a onden e interval for

does not overlap 1.

From the point of view of neo lassi al e onomi theory, these results are not anomalous:
HOD1 is an impli ation of the theory, but CRTS is not.

Exer ise 12 Modify the NerloveRestri tions.m program to impose and test the restri tions

jointly.

The Chow test

Sin e CRTS is reje ted, let's examine the possibilities more arefully. Re all

that the data is sorted by output (the third olumn). Dene 5 subsamples of rms, with the
rst group being the 29 rms with the lowest output levels, then the next 29 rms, et . The
ve subsamples an be indexed by

t = 30, 31, ...58,

j = 1, 2, ..., 5,

where

j = 1

for

t = 1, 2, ...29, j = 2

for

et . Dene a pie ewise linear model

ln Ct = 1j + 2j ln Qt + 3j ln PLt + 4j ln PF t + 5j ln PKt + t

(6.6)

80

CHAPTER 6.

where

RESTRICTIONS AND HYPOTHESIS TESTS

is a supers ript (not a power) that ini ates that the oe ients may be dierent

a ording to the subsample in whi h the observation falls.


upon

whi h in turn depends upon

That is, the oe ients depend

t. Note that the rst olumn of nerlove.data

indi ates this

way of breaking up the sample. The new model may be written as

y1

X1 0


y2 0


.. ..
. = .


y5
0
where

y1

j is the

is 291,

29 1

X1

is 295,

X2

X3
X4

is the

51

.
+
.

5
5

X5

ve tor of oe ient for the

j th

(6.7)

subsample, and

th subsample.
ve tor of errors for the j

The O tave program Restri tions/ChowTest.m estimates the above model. It also tests the
hypothesis that the ve subsamples share the same parameter ve tor, or in other words, that
there is oe ient stability a ross the ve subsamples. The null to test is that the parameter
ve tors for the separate groups are all the same, that is,

1 = 2 = 3 = 4 = 5
This type of test, that parameters are onstant a ross dierent sets of data, is sometimes
referred to as a

Chow test.

There are 20 restri tions. If that's not lear to you, look at the O tave program.

The restri tions are reje ted at all onventional signi an e levels.

Sin e the restri tions are reje ted, we should probably use the unrestri ted model for analysis.
What is the pattern of RTS as a fun tion of the output group (small to large)? Figure 6.2 plots
RTS. We an see that there is in reasing RTS for small rms, but that RTS is approximately
onstant for large rms.

6.9 Exer ises


1. Using the Chow test on the Nerlove model, we reje t that there is oe ient stability
a ross the 5 groups. But perhaps we ould restri t the input pri e oe ients to be the
same but let the onstant and output oe ients vary by group size. This new model is

ln Ci = 1j + 2j ln Qi + 3 ln PLi + 4 ln PF i + 5 ln PKi + i
(a) estimate this model by OLS, giving

(6.8)

R2 , estimated standard errors for oe ients, t-

statisti s for tests of signi an e, and the asso iated p-values. Interpret the results
in detail.

6.9.

81

EXERCISES

Figure 6.2: RTS as a fun tion of rm size


2.6
RTS
2.4

2.2

1.8

1.6

1.4

1.2

1
1

1.5

2.5

3.5

4.5

(b) Test the restri tions implied by this model (relative to the model that lets all
oe ients vary a ross groups) using the F, qF, Wald, s ore and likelihood ratio
tests. Comment on the results.
( ) Estimate this model but imposing the HOD1 restri tion,

using an OLS

estimation

program. Don't use m _olsr or any other restri ted OLS estimation program. Give
estimated standard errors for all oe ients.
(d) Plot the estimated RTS parameters as a fun tion of rm size. Compare the plot to
that given in the notes for the unrestri ted model. Comment on the results.
2. For the simple Nerlove model, estimated returns to s ale is

[
RT
S =

1
cq .

Apply the

delta method to al ulate the estimated standard error for estimated RTS. Dire tly test

H0 : RT S = 1 versus HA : RT S 6= 1 rather than testing H0 : Q = 1 versus HA : Q 6= 1.

Comment on the results.

3. Perform a Monte Carlo study that generates data from the model

y = 2 + 1x2 + 1x3 +
where the sample size is 30,
and

x2

and

x3

are independently uniformly distributed on

[0, 1]

IIN (0, 1)

(a) Compare the means and standard errors of the estimated oe ients using OLS
and restri ted OLS, imposing the restri tion that

2 + 3 = 2.

(b) Compare the means and standard errors of the estimated oe ients using OLS
and restri ted OLS, imposing the restri tion that

2 + 3 = 1.

82

CHAPTER 6.

( ) Dis uss the results.

RESTRICTIONS AND HYPOTHESIS TESTS

Chapter 7
Generalized least squares
One of the assumptions we've made up to now is that

t IID(0, 2 ),
or o asionally

t IIN (0, 2 ).
Now we'll investigate the onsequen es of nonidenti ally and/or dependently distributed errors.
We'll assume xed regressors for now, relaxing this admittedly unrealisti assumption later.
The model is

y = X +
E() = 0
V () =
where

is a general symmetri positive denite matrix (we'll write

in pla e of

to simplify

the typing of these notes).

The ase where

The ase where

is a diagonal matrix gives un orrelated, nonidenti ally distributed

errors. This is known as

heteros edasti ity.

has the same number on the main diagonal but nonzero elements

o the main diagonal gives identi ally (assuming higher moments are also the same)
dependently distributed errors. This is known as

auto orrelation.

The general ase ombines heteros edasti ity and auto orrelation.

This is known as

nonspheri al disturban es, though why this term is used, I have no idea. Perhaps it's
be ause under the lassi al assumptions, a joint onden e region for
dimensional hypersphere.

83

would be an

84

CHAPTER 7.

GENERALIZED LEAST SQUARES

7.1 Ee ts of nonspheri al disturban es on the OLS estimator


The least square estimator is

= (X X)1 X y
= + (X X)1 X

We have unbiasedness, as before.

The varian e of

is

i
h


E ( )( ) = E (X X)1 X X(X X)1
= (X X)1 X X(X X)1

(7.1)

is invalid, sin e there

Due to this, any test statisti that is based upon an estimator of

isn't

any

for the

If

t,

2 ,

it doesn't exist as a feature of the true d.g.p. In parti ular, the formulas

F, 2 based tests given above do not lead to statisti s with these distributions.

is still onsistent, following exa tly the same argument given before.

is normally distributed, then

N , (X X)1 X X(X X)1


The problem is that

is unknown in general, so this distribution won't be useful for

testing hypotheses.

Without normality, and un onditional on

we still have


n
=
n(X X)1 X
 1
XX
=
n1/2 X
n
Dene the limiting varian e of

n1/2 X
lim E

so we obtain

Summary:

(supposing a CLT applies) as

X X
n




d
1
n N 0, Q1
X QX

OLS with heteros edasti ity and/or auto orrelation is:

unbiased in the same ir umstan es in whi h the estimator is unbiased with iid errors

has a dierent varian e than before, so the previous test statisti s aren't valid

is onsistent

7.2.

85

THE GLS ESTIMATOR

is asymptoti ally normally distributed, but with a dierent limiting ovarian e matrix.
Previous test statisti s aren't valid in this ase for this reason.
is ine ient, as is shown below.

7.2 The GLS estimator


Suppose

were known. Then one ould form the Cholesky de omposition

P P = 1
Here,

is an upper triangular matrix. We have

P P = In
so

P P P = P ,
whi h implies that

P P = In
Consider the model

P y = P X + P ,
or, making the obvious denitions,

y = X + .
This varian e of

= P

is

E(P P ) = P P
= In
Therefore, the model

y = X +
E( ) = 0

V ( ) = In
satises the lassi al assumptions.

The GLS estimator is simply OLS applied to the trans-

formed model:

GLS

= (X X )1 X y
= (X P P X)1 X P P y
= (X 1 X)1 X 1 y

86

CHAPTER 7.

GENERALIZED LEAST SQUARES

The GLS estimator is unbiased in the same ir umstan es under whi h the OLS estimator
is unbiased. For example, assuming

is nonsto hasti



E(GLS ) = E (X 1 X)1 X 1 y


= E (X 1 X)1 X 1 (X +
= .

The varian e of the estimator, onditional on

GLS

an be al ulated using

= (X X )1 X y
= (X X )1 X (X + )
= + (X X )1 X

so



 


GLS GLS
E
= E (X X )1 X X (X X )1
= (X X )1 X X (X X )1
= (X X )1
= (X 1 X)1
Either of these last formulas an be used.

All the previous results regarding the desirable properties of the least squares estimator
hold, when dealing with the transformed model, sin e the transformed model satises
the lassi al assumptions..

Tests are valid, using the previous formulas, as long as we substitute


Furthermore, any test that involves

an set it to

1.

in pla e of

X.

This is preferable to re-deriving

the appropriate formulas.

The GLS estimator is more e ient than the OLS estimator. This is a onsequen e of
the Gauss-Markov theorem, sin e the GLS estimator is based on a model that satises
the lassi al assumptions but the OLS estimator is not. To see this dire tly, not that

V ar(GLS ) = (X X)1 X X(X X)1 (X 1 X)1


V ar()

= AA
where

i
h
A = (X X)1 X (X 1 X)1 X 1 .

true, as you an verify for yourself.

This may not seem obvious, but it is

Then noting that

positive denite matrix, we on lude that

AA

AA

is a quadrati form in a

is positive semi-denite, and that GLS

is e ient relative to OLS.

As one an verify by al ulating fon , the GLS estimator is the solution to the mini-

7.3.

87

FEASIBLE GLS

mization problem

GLS = arg min(y X) 1 (y X)


so the

metri 1

is used to weight the residuals.

7.3 Feasible GLS


The problem is that

isn't known usually, so this estimator isn't available.

Consider the dimension of

The number of parameters to estimate is larger than

The

: it's an

unique elements.

nn

matrix with



n2 n /2 + n = n2 + n /2
and in reases faster than

n.

There's no way to devise an estimator that satises a LLN without adding restri tions.

feasible GLS estimator

form of

is based upon making su ient assumptions regarding the

so that a onsistent estimator an be devised.

Suppose that we

parameterize as a fun tion of X

and

where

may in lude

as well as

other parameters, so that

= (X, )
where

is of xed dimension. If we an onsistently estimate

as long as

If we repla e

(X, )

is a ontinuous fun tion of

we an onsistently estimate

(by the Slutsky theorem). In this ase,

p

b = (X, )

(X, )

in the formulas for the GLS estimator with

b
,

we obtain the FGLS estimator.

The FGLS estimator shares the same asymptoti properties as GLS. These are
1. Consisten y
2. Asymptoti normality
3. Asymptoti e ien y

if

the errors are normally distributed. (Cramer-Rao).

4. Test pro edures are asymptoti ally valid.

In pra ti e, the usual way to pro eed is


1. Dene a onsistent estimator of
parameterization
2. Form

().

This is a ase-by- ase proposition, depending on the

We'll see examples below.

b = (X, )

3. Cal ulate the Cholesky fa torization


4. Transform the model using

1 ).
Pb = Chol(

P y = P X + P
5. Estimate using OLS on the transformed model.

88

CHAPTER 7.

GENERALIZED LEAST SQUARES

7.4 Heteros edasti ity


Heteros edasti ity is the ase where

E( ) =
is a diagonal matrix, so that the errors are un orrelated, but have dierent varian es. Heteros edasti ity is usually thought of as asso iated with ross se tional data, though there is
absolutely no reason why time series data annot also be heteros edasti . A tually, the popular ARCH (autoregressive onditionally heteros edasti ) models expli itly assume that a time
series is heteros edasti .
Consider a supply fun tion

qi = 1 + p Pi + s Si + i
where

Pi

is pri e and

Si

is some measure of size of the

ith

rm.

One might suppose that

unobservable fa tors (e.g., talent of managers, degree of oordination between produ tion
units,

et .)

i .

a ount for the error term

rms than for small rms, then

If there is more variability in these fa tors for large

may have a higher varian e when

Si

is high than when it is

low.
Another example, individual demand.

qi = 1 + p Pi + m Mi + i
where

is pri e and

is in ome. In this ase,

an ree t variations in preferen es. There

are more possibilities for expression of preferen es when one is ri h, so it is possible that the
varian e of

ould be higher when

is high.

Add example of group means.

7.4.1 OLS with heteros edasti onsistent var ov estimation


Ei ker (1967) and White (1980) showed how to modify test statisti s to a ount for heteros edasti ity of unknown form. The OLS estimator has asymptoti distribution




d
1
n N 0, Q1
X QX

as we've already seen. Re all that we dened

lim E

n
This matrix has dimension

K K

X X
n

and an be onsistently estimated, even if we an't estimate

onsistently. The onsistent estimator, under heteros edasti ity but no auto orrelation is

X
b= 1
xt xt 2t

n t=1

7.4.

89

HETEROSCEDASTICITY

One an then modify the previous test statisti s to obtain tests that are valid when there is
heteros edasti ity of unknown form. For example, the Wald test for
be



n R r

X X
n

1

X X
n

1

!1

H0 : R r = 0

would



a
R r 2 (q)

7.4.2 Dete tion


There exist many tests for the presen e of heteros edasti ity. We'll dis uss three methods.

Goldfeld-Quandt

The sample is divided in to three parts, with

n1 , n2

n1 + n2 + n3 = n. The model is estimated using the rst and


1 and 3 will be independent. Then we have
separately, so that

where

n3

and

observations,

third parts of the sample,

1 M 1 1 d 2
1 1
(n1 K)
=
2
2
and

3 M 3 3 d 2
3 3
=
(n3 K)
2
2
so

1 1 /(n1 K) d
F (n1 K, n3 K).
3 3 /(n3 K)

The distributional result is exa t if the errors are normally distributed. This test is a two-tailed
test. Alternatively, and probably more onventionally, if one has prior ideas about the possible
magnitudes of the varian es of the observations, one ould order the observations a ordingly,
from largest to smallest. In this ase, one would use a onventional one-tailed F-test.

pi ture.

Draw

Ordering the observations is an important step if the test is to have any power.

The motive for dropping the middle observations is to in rease the dieren e between the
average varian e in the subsamples, supposing that there exists heteros edasti ity. This
an in rease the power of the test. On the other hand, dropping too many observations
will substantially in rease the varian e of the statisti s

1 1

and

3 3 .

A rule of thumb,

based on Monte Carlo experiments is to drop around 25% of the observations.

If one doesn't have any ideas about the form of the het. the test will probably have low
power sin e a sensible data ordering isn't available.

White's test

When one has little idea if there exists heteros edasti ity, and no idea of its

potential form, the White test is a possibility. The idea is that if there is homos edasti ity,
then

E(2t |xt ) = 2 , t
so that

xt

or fun tions of

xt

shouldn't help to explain

E(2t ).

The test works as follows:

90

CHAPTER 7.

1. Sin e

GENERALIZED LEAST SQUARES

isn't available, use the onsistent estimator

instead.

2. Regress

2t = 2 + zt + vt
where

zt

is a

P -ve tor. zt

may in lude some or all of the variables in

variables. White's original suggestion was to use


and ross produ ts of variables in

= 0.

3. Test the hypothesis that

ESSR = T SSU ,

The

qF

R2

plus the set of all unique squares

statisti in this ase is

P (ESSR ESSU ) /P
ESSU / (n P 1)

so dividing both numerator and denominator by this we get

qF = (n P 1)
Note that this is the

as well as other

xt .

qF =
Note that

xt ,

xt ,

R2
1 R2

or the arti ial regression used to test for heteros edasti ity,

2
not the R of the original model.
An asymptoti ally equivalent statisti , under the null of no heteros edasti ity (so that

R2

should tend to zero), is

nR2 2 (P ).
This doesn't require normality of the errors, though it does assume that the fourth moment
of

is onstant, under the null.

Question:

why is this ne essary?

The White test has the disadvantage that it may not be very powerful unless the

zt ve tor

It also has the problem that spe i ation errors other than heteros edasti ity may lead

Note: the null hypothesis of this test may be interpreted as

is hosen well, and this is hard to do without knowledge of the form of heteros edasti ity.

to reje tion.

(2t )

= h( +

zt ), where

h()

= 0 for the varian e model

is an arbitrary fun tion of unknown form. The test is

more general than is may appear from the regression that is used.

Plotting the residuals


squares).

A very simple method is to simply plot the residuals (or their

Draw pi tures here.

Like the Goldfeld-Quandt test, this will be more informative if

the observations are ordered a ording to the suspe ted form of the heteros edasti ity.

7.4.3 Corre tion


Corre ting for heteros edasti ity requires that a parametri form for
that a means for estimating

onsistently be determined.

()

be supplied, and

The estimation method will be

7.4.

91

HETEROSCEDASTICITY

spe i to the for supplied for

().

We'll onsider two examples. Before this, let's onsider

the general nature of GLS when there is heteros edasti ity.


Multipli ative heteros edasti ity
Suppose the model is

yt = xt + t
t2 = E(2t ) = zt
but the other lassi al assumptions hold. In this ase

2t = zt
and

vt

+ vt

has mean zero. Nonlinear least squares ould be used to estimate

and

onsistently,

2t in pla e of
were t observable. The solution is to substitute the squared OLS residuals
we an estimate 2
2t , sin e it is onsistent by the Slutsky theorem. On e we have and ,
t
onsistently using

t2 = zt

t2 .

In the se ond step, we transform the model by dividing by the standard deviation:

x
t
yt
= t +

t
or

yt = x
t + t .
Asymptoti ally, this model satises the lassi al assumptions.

This model is a bit omplex in that NLS is required to estimate the model of the varian e.
A simpler version would be

yt

xt + t

t2 = E(2t ) = 2 zt
where

zt

is a single variable. There are still two parameters to be estimated, and the

model of the varian e is still nonlinear in the parameters. However, the

sear h method

an be used in this ase to redu e the estimation problem to repeated appli ations of
OLS.

First, we dene an interval of reasonable values for

Partition this interval into

For ea h of these values, al ulate the variable

The regression

e.g.,

equally spa ed values, e.g.,

ztm .

2t = 2 ztm + vt

[0, 3].

{0, .1, .2, ..., 2.9, 3}.

92

CHAPTER 7.

is linear in the parameters, onditional on

m ,

GENERALIZED LEAST SQUARES

so one an estimate

by OLS.

Save the pairs (m , m ), and the orresponding

Next, divide the model by the estimated standard deviations.

Can rene.

Works well when the parameter to be sear hed over is low dimensional, as in this ase.

mum

ESSm

ESSm .

Choose the pair with the mini-

as the estimate.

Draw pi ture.

Groupwise heteros edasti ity


A ommon ase is where we have repeated observations on ea h of a number of e onomi
agents: e.g., 10 years of ma roe onomi data on ea h of a set of ountries or regions, or daily
observations of transa tions of 200 banks.

series model.

This sort of data is a

pooled ross-se tion time-

It may be reasonable to presume that the varian e is onstant over time within

the ross-se tional units, but that it diers a ross them (e.g., rms or ountries of dierent
sizes...). The model is

yit = xit + it
E(2it ) = i2 , t
where

i = 1, 2, ..., G

are the agents, and

t = 1, 2, ..., n

are the observations on ea h agent.

The other lassi al assumptions are presumed to hold.

In this ase, the varian e

In this model, we assume that

i2 is spe i to ea h agent, but onstant over the n observations

for that agent.

later.

E(it is ) = 0.

This is a strong assumption that we'll relax

To orre t for heteros edasti ity, just estimate ea h

i2

using the natural estimator:

i2 =

1X 2
it
n
t=1

1/n

Note that we use

With ea h of these, transform the model as usual:

nK

here sin e it's possible that there are more than

regressors, so

ould be negative. Asymptoti ally the dieren e is unimportant.

yit
x
it
= it +

i
Do this for ea h ross-se tional group.
assumptions, asymptoti ally.

This transformed model satises the lassi al

7.4.

93

HETEROSCEDASTICITY

Figure 7.1: Residuals, Nerlove model, sorted by rm size


Regression residuals
1.5
Residuals

0.5

-0.5

-1

-1.5
0

20

40

60

80

100

120

140

160

7.4.4 Example: the Nerlove model (again!)


Let's he k the Nerlove data for eviden e of heteros edasti ity. In what follows, we're going
to use the model with the onstant and output oe ient varying a ross 5 groups, but with
the input pri e oe ients xed (see Equation 6.8 for the rationale behind this). Figure 7.1,
whi h is generated by the O tave program GLS/NerloveResiduals.m plots the residuals. We
an see pretty learly that the error varian e is larger for small rms than for larger rms.
Now let's try out some tests to formally he k for heteros edasti ity. The O tave program
GLS/HetTests.m performs the White and Goldfeld-Quandt tests, using the above model. The
results are

Value
p-value
White's test
61.903
0.000
Value
p-value
GQ test
10.886
0.000
All in all, it is very lear that the data are heteros edasti . That means that OLS estimation
is not e ient, and tests of restri tions that ignore heteros edasti ity are not valid.

The

previous tests (CRTS, HOD1 and the Chow test) were al ulated assuming homos edasti ity.
The O tave program GLS/NerloveRestri tions-Het.m uses the Wald test to he k for CRTS
and HOD1, but using a heteros edasti - onsistent ovarian e estimator.
1

The results are

By the way, noti e that GLS/NerloveResiduals.m and GLS/HetTests.m use the restri ted LS estimator

dire tly to restri t the fully general model with all oe ients varying to the model with only the onstant
and the output oe ient varying. But GLS/NerloveRestri tions-Het.m estimates the model by substituting

94

CHAPTER 7.

GENERALIZED LEAST SQUARES

Testing HOD1
Wald test

Value
6.161

p-value
0.013

Value
20.169

p-value
0.001

Testing CRTS
Wald test

We see that the previous on lusions are altered - both CRTS is and HOD1 are reje ted at
the 5% level. Maybe the reje tion of HOD1 is due to to Wald test's tenden y to over-reje t?
From the previous plot, it seems that the varian e of

is a de reasing fun tion of output.

Suppose that the 5 size groups have dierent error varian es (heteros edasti ity by groups):

V ar(i ) = j2 ,
where

j = 1

if

i = 1, 2, ..., 29,

et .,

as before.

The O tave program GLS/NerloveGLS.m

estimates the model using GLS (through a transformation of the model so that OLS an be
applied). The estimation results are

*********************************************************
OLS estimation results
Observations 145
R-squared 0.958822
Sigma-squared 0.090800
Results (Het. onsistent var- ov estimator)

onstant1
onstant2
onstant3
onstant4
onstant5
output1
output2
output3
output4
output5
labor
fuel

estimate
-1.046
-1.977
-3.616
-4.052
-5.308
0.391
0.649
0.897
0.962
1.101
0.007
0.498

st.err.
1.276
1.364
1.656
1.462
1.586
0.090
0.090
0.134
0.112
0.090
0.208
0.081

t-stat.
-0.820
-1.450
-2.184
-2.771
-3.346
4.363
7.184
6.688
8.612
12.237
0.032
6.149

p-value
0.414
0.149
0.031
0.006
0.001
0.000
0.000
0.000
0.000
0.000
0.975
0.000

the restri tions into the model. The methods are equivalent, but the se ond is more onvenient and easier to
understand.

7.4.

95

HETEROSCEDASTICITY

apital

-0.460

0.253

-1.818

0.071

*********************************************************
*********************************************************
OLS estimation results
Observations 145
R-squared 0.987429
Sigma-squared 1.092393
Results (Het. onsistent var- ov estimator)
estimate
-1.580
-2.497
-4.108
-4.494
-5.765
0.392
0.648
0.892
0.951
1.093
0.103
0.492
-0.366

onstant1
onstant2
onstant3
onstant4
onstant5
output1
output2
output3
output4
output5
labor
fuel
apital

st.err.
0.917
0.988
1.327
1.180
1.274
0.090
0.094
0.138
0.109
0.086
0.141
0.044
0.165

t-stat.
-1.723
-2.528
-3.097
-3.808
-4.525
4.346
6.917
6.474
8.755
12.684
0.733
11.294
-2.217

p-value
0.087
0.013
0.002
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.465
0.000
0.028

*********************************************************
Testing HOD1
Value
9.312

Wald test

p-value
0.002

The rst panel of output are the OLS estimation results, whi h are used to onsistently
estimate the

The

j2 .

R2

The se ond panel of results are the GLS estimation results. Some omments:

measures are not omparable - the dependent variables are not the same. The

measure for the GLS results uses the transformed dependent variable. One ould al ulate a omparable

R2

measure, but I have not done so.

The dieren es in estimated standard errors (smaller in general for GLS)

an

be in-

terpreted as eviden e of improved e ien y of GLS, sin e the OLS standard errors are

96

CHAPTER 7.

al ulated using the Huber-White estimator.

GENERALIZED LEAST SQUARES

They would not be omparable if the

ordinary (in onsistent) estimator had been used.

Note that the previously noted pattern in the output oe ients persists. The non onstant CRTS result is robust.

The oe ient on apital is now negative and signi ant at the 3% level. That seems to

Note that HOD1 is now reje ted.

indi ate some kind of problem with the model or the data, or e onomi theory.

Problem of Wald test over-reje ting?

Spe i ation

error in model?

7.5 Auto orrelation


Auto orrelation, whi h is the serial orrelation of the error term, is a problem that is usually
asso iated with time series data, but also an ae t ross-se tional data. For example, a sho k
to oil pri es will simultaneously ae t all ountries, so one ould expe t ontemporaneous
orrelation of ma roe onomi variables a ross ountries.

7.5.1 Causes
Auto orrelation is the existen e of orrelation a ross the error term:

E(t s ) 6= 0, t 6= s.
Why might this o ur? Plausible explanations in lude

1. Lags in adjustment to sho ks. In a model su h as

yt = xt + t ,
one ould interpret
of observations.
equilibrium.

xt

as the equilibrium value. Suppose

One an interpret

xt

is onstant over a number

as a sho k that moves the system away from

If the time needed to return to equilibrium is long with respe t to the

observation frequen y, one ould expe t

t+1

to be positive, onditional on

positive,

whi h indu es a orrelation.

2. Unobserved fa tors that are orrelated over time.


to orrespond to unobservable fa tors.

The error term is often assumed

If these fa tors are orrelated, there will be

auto orrelation.

3. Misspe i ation of the model. Suppose that the DGP is

yt = 0 + 1 xt + 2 x2t + t

7.5.

97

AUTOCORRELATION

Figure 7.2: Auto orrelation indu ed by misspe i ation

but we estimate

yt = 0 + 1 xt + t
The ee ts are illustrated in Figure 7.2.

7.5.2 Ee ts on the OLS estimator


The varian e of the OLS estimator is the same as in the ase of heteros edasti ity - the standard
formula does not apply. The orre t formula is given in equation 7.1. Next we dis uss two
GLS orre tions for OLS. These will potentially indu e in onsisten y when the regressors are
nonsto hasti (see Chapter 8) and should either not be used in that ase (whi h is usually the
relevant ase) or used with aution. The more re ommended pro edure is dis ussed in se tion
7.5.5.

98

CHAPTER 7.

GENERALIZED LEAST SQUARES

7.5.3 AR(1)
There are many types of auto orrelation. We'll onsider two examples. The rst is the most
ommonly en ountered ase: autoregressive order 1 (AR(1) errors. The model is

yt = xt + t
t = t1 + ut
ut iid(0, u2 )
E(t us ) = 0, t < s
We assume that the model satises the other lassi al assumptions.

We need a stationarity assumption:

By re ursive substitution we obtain

|| < 1.

Otherwise the varian e of

explodes as

in reases, so standard asymptoti s will not apply.

t = t1 + ut
= (t2 + ut1 ) + ut
= 2 t2 + ut1 + ut
= 2 (t3 + ut2 ) + ut1 + ut
In the limit the lagged

drops out, sin e

t =

m 0

as

m ,

so we obtain

m utm

m=0
With this, the varian e of

is found as

E(2t )

If we had dire tly assumed that

u2

u2
1 2

2m

m=0

were ovarian e stationary, we ould obtain this using

V (t ) = 2 E(2t1 ) + 2E(t1 ut ) + E(u2t )


= 2 V (t ) + u2 ,

so

V (t ) =
0th

The varian e is the

Note that the varian e does not depend on

u2
1 2

order auto ovarian e:

0 = V (t )

7.5.

99

AUTOCORRELATION

Likewise, the rst order auto ovarian e

is

Cov(t , t1 ) = s = E((t1 + ut ) t1 )
=

V (t )
u2
1 2

s<t

Using the same method, we nd that for

Cov(t , ts ) = s =

The

The auto ovarian es don't depend on

t:

s u2
1 2

the pro ess

{t }

is

ovarian e stationary

orrelation ( in general, for r.v.'s x and y ) is dened as


orr(x, y)

ov(x, y)
se(x)se(y)

but in this ase, the two standard errors are the same, so the

s-order

auto orrelation

is

s = s

All this means that the overall matrix

1 2

| {z }

this is the varian e


u2

has the form

.
.
.

n1

..

{z

n1

n2

.
.

..

.

1
}

this is the orrelation matrix

So we have homos edasti ity, but elements o the main diagonal are not zero.
this depends only on two parameters,

we an apply FGLS.

It turns out that it's easy to estimate these onsistently. The steps are
1. Estimate the model

yt = xt + t

by OLS.

2. Take the residuals, and estimate the model

t =
t1 + ut
Sin e

t t ,

All of

2
and u . If we an estimate these onsistently,

this regression is asymptoti ally equivalent to the regression

t = t1 + ut

100

CHAPTER 7.

whi h satises the lassi al assumptions.

t =

GENERALIZED LEAST SQUARES

Therefore,

t1 + ut is onsistent. Also, sin e ut

ut ,

obtained by applying OLS to

the estimator

u2 =

1X 2 p 2
(
ut ) u
n
t=2

3. With the onsistent estimators


of

u2

and

= (
, form
u2 , ) using the previous stru ture

and estimate by FGLS. A tually, one an omit the fa tor

an els out in the formula

u2 /(1 2 ),

sin e it


1
1 X
1 y).
F GLS = X
(X

One an iterate the pro ess, by taking the rst FGLS estimator of

re-estimating

2
and u , et . If one iterates to onvergen es it's equivalent to MLE (supposing normal
errors).

An asymptoti ally equivalent approa h is to simply estimate the transformed model

yt yt1 = (xt xt1 ) + ut


using

n1 observations (sin e y0 and x0 aren't available).

and Or utt.

This is the method of Co hrane

Dropping the rst observation is asymptoti ally irrelevant, but

very important in small samples.

it an be

One an re uperate the rst observation by putting

y1 = y1
x1 = x1

1 2
1 2

This somewhat odd-looking result is related to the Cholesky fa torization of

1 .

See

Davidson and Ma Kinnon, pg. 348-49 for more dis ussion. Note that the varian e of
is

u2 ,

y1

asymptoti ally, so we see that the transformed model will be homos edasti (and

nonauto orrelated, sin e the

u s

are un orrelated with the

y s,

7.5.4 MA(1)
The linear regression model with moving average order 1 errors is

yt = xt + t
t = ut + ut1
ut iid(0, u2 )
E(t us ) = 0, t < s

in dierent time periods.

7.5.

101

AUTOCORRELATION

In this ase,

i
h
V (t ) = 0 = E (ut + ut1 )2
=

u2 + 2 u2

u2 (1 + 2 )

Similarly

1 = E [(ut + ut1 ) (ut1 + ut2 )]


= u2

and

2 = [(ut + ut1 ) (ut2 + ut3 )]


= 0
so in this ase

= u2

1 + 2

1 + 2

..

.
.
.

.
.
.

..

1 + 2

Note that the rst order auto orrelation is

1 =

This a hieves a maximum at

=1

1
0

(1 + 2 )

2
u
2 (1+2 )
u

and a minimum at

minimal auto orrelations are 1/2 and -1/2.

= 1,

and the maximal and

Therefore, series that are more strongly

auto orrelated an't be MA(1) pro esses.

Again the ovarian e matrix has a simple stru ture that depends on only two parameters. The
problem in this ase is that one an't estimate

using OLS on

t = ut + ut1
be ause the

ut

are unobservable and they an't be estimated onsistently. However, there is a

simple way to estimate the parameters.

Sin e the model is homos edasti , we an estimate

V (t ) = 2 = u2 (1 + 2 )

102

CHAPTER 7.

GENERALIZED LEAST SQUARES

using the typi al estimator:

1X 2
c2 = 2 (1
\
2) =
t

u
n
t=1

By the Slutsky theorem, we an interpret this as dening an (unidentied) estimator of


both

u2

and

e.g., use this as

X
c2 (1 + b2 ) = 1
2

u
n t=1 t

However, this isn't su ient to dene onsistent estimators of the parameters, sin e it's
unidentied.

To solve this problem, estimate the ovarian e of

and

t1

using

X
d2 = 1
d t , t1 ) =
Cov(
t t1
u
n t=2
This is a onsistent estimator, following a LLN (and given that the epsilon hats are
onsistent for the epsilons). As above, this an be interpreted as dening an unidentied
estimator:

X
c2 = 1
t t1

u
n
t=2

Now solve these two equations to obtain identied (and therefore onsistent) estimators
of both

and

u2 .

Dene the onsistent estimator

c2 )

= (,

following the form we've seen above, and transform the model using the Cholesky de omposition. The transformed model satises the lassi al assumptions asymptoti ally.

7.5.5 Asymptoti ally valid inferen es with auto orrelation of unknown form
See Hamilton Ch. 10, pp. 261-2 and 280-84.
When the form of auto orrelation is unknown, one may de ide to use the OLS estimator,
without orre tion. We've seen that this estimator has the limiting distribution

where, as before,

is




d
1
n N 0, Q1
X QX
= lim E
n

We need a onsistent estimate of

Dene

X X
n

mt = xt t


(re all that

xt

is dened as a

K 1

7.5.

103

AUTOCORRELATION

ve tor). Note that

X =

=
=

t=1
n
X

i
2

xn .

..
n

x1 x2

n
X

xt t
mt

t=1

so that

1
= lim E
n n
We assume that

n
X

mt

t=1

n
X

mt

t=1

!#

is ovarian e stationary (so that the ovarian e between

mt

and

mts

does

t).

not depend on
Dene the

mt

"

v th

mt

auto ovarian e of

as

v = E(mt mtv ).
Note that

mt

E(mt mt+v ) = v .

(show this with an example).

will be auto orrelated, sin e

In general, we expe t that:

is potentially auto orrelated:

v = E(mt mtv ) 6= 0
Note that this auto ovarian e does not depend on

ontemporaneously orrelated (

and heteros edasti (E(mit )

t,

due to ovarian e stationarity.

E(mit mjt ) 6= 0 ), sin e the regressors in xt

will in general

be orrelated (more on this later).

= i2

, whi h depends upon

), again sin e the regressors

will have dierent varian es.

While one ould estimate

parametri ally, we in general have little information upon whi h

to base a parametri spe i ation. Re ent resear h has fo used on onsistent nonparametri
estimators of

Now dene

1
n = E
n
We have (

"

n
X
t=1

mt

n
X
t=1

mt

!#

show that the following is true, by expanding sum and shifting rows to left)
n = 0 +

 n2


n1
1
1 + 1 +
2 + 2 +
n1 + n1
n
n
n

104

CHAPTER 7.

The natural, onsistent estimator of

GENERALIZED LEAST SQUARES

is

n
X
cv = 1
m
tm
tv .

n
t=v+1

where

m
t = xt t
(note: one ould put
of

would be

1/(n v)

1/n

instead of

here). So, a natural, but in onsistent, estimator








c + n 2
c + + 1
[
c1 +
c2 +
n =
c0 + n 1
[
+

n1
n1
1
2
n
n
n
n1

Xnv 
c .
c0 +
cv +
=

v
n
v=1

This estimator is in onsistent in general, sin e the number of parameters to estimate is more
than the number of observations, and in reases more rapidly than
build up as

n .

On the other hand, supposing that

q(n)

as

so information does not

tends to zero su iently rapidly as

modied estimator

where

n,

n =
c0 +

q(n) 

X
v=1


c ,
cv +

will be onsistent, provided

q(n)

tends to

grows su iently slowly.

The assumption that auto orrelations die o is reasonable in many ases. For example,
the AR(1) model with

|| < 1

has auto orrelations that die o.

nv
n an be dropped be ause it tends to one for

The term

A disadvantage of this estimator is that is may not be positive denite. This ould ause

Newey and West proposed and estimator (

in reases slowly relative to

one to al ulate a negative

v < q(n),

given that

q(n)

n.

statisti , for example!

E onometri a, 1987) that solves the problem

of possible nonpositive deniteness of the above estimator. Their estimator is

n =
c0 +

q(n) 
X
v=1



v
c .
cv +

1
v
q+1

This estimator is p.d. by onstru tion. The ondition for onsisten y is that

0.

Note that this is a very slow rate of growth for

q.

n1/4 q(n)

This estimator is nonparametri -

we've pla ed no parametri restri tions on the form of

It is an example of a

kernel

estimator.
Finally, sin e

has

as its limit,

p
n

We an now use

and

d
Q
X =

1
n X X to

onsistently estimate the limiting distribution of the OLS estimator under heteros edasti ity

7.5.

105

AUTOCORRELATION

and auto orrelation of unknown form. With this, asymptoti ally valid tests are onstru ted
in the usual way.

7.5.6 Testing for auto orrelation


Durbin-Watson test
The Durbin-Watson test is not stri tly valid in most situations where we would like to use
it. Nevertheless, it is ommonly enough en ountered so that it probably should be presented.
The Durbin-Watson test statisti is

DW

=
=

Pn

(
t t1 )2
t=2P
n
2t
t=1
Pn
t t1
2t 2
t=2
Pn 2
t
t=1

+ 2t1

The null hypothesis is that the rst order auto orrelation of the errors is zero:

1 = 0.

The alternative is of ourse

HA : 1 6= 0.

H0 :

Note that the alternative is not that

the errors are AR(1), sin e many general patterns of auto orrelation will have the rst
order auto orrelation dierent than zero. For this reason the test is useful for dete ting
auto orrelation in general.

For the same reason, one shouldn't just assume that an

AR(1) model is appropriate when the DW test reje ts the null.

Under the null, the middle term tends to zero, and the other two tend to one, so

Supposing that we had an AR(1) error pro ess with

tends to

2,

so

= 1.

In this ase the middle term

DW 0

Supposing that we had an AR(1) error pro ess with


term tends to

DW 2.

2,

so

DW 4

= 1.

In this ase the middle

These are the extremes:

The distribution of the test statisti depends on the matrix of regressors,

DW

always lies between 0 and 4.

X,

so tables

an't give exa t riti al values. The give upper and lower bounds, whi h orrespond to
the extremes that are possible. See Figure 7.3. There are means of determining exa t
riti al values onditional on

X.

Note that DW an be used to test for nonlinearity (add dis ussion).

The DW test is based upon the assumption that the matrix


samples.

is xed in repeated

This is often unreasonable in the ontext of e onomi time series, whi h is

pre isely the ontext where the test would have appli ation. It is possible to relate the
DW test to other test statisti s whi h are valid without stri t exogeneity.

Breus h-Godfrey test

106

CHAPTER 7.

GENERALIZED LEAST SQUARES

Figure 7.3: Durbin-Watson riti al values

This test uses an auxiliary regression, as does the White test for heteros edasti ity. The
regression is

t = xt + 1 t1 + 2 t2 + + P tP + vt
and the test statisti is the

nR2

statisti , just as in the White test. There are

restri tions,

2
so the test statisti is asymptoti ally distributed as a (P ).

The intuition is that the lagged errors shouldn't ontribute to explaining the urrent
error if there is no auto orrelation.

xt

is in luded as a regressor to a ount for the fa t that the

if the

are not independent even

are. This is a te hni ality that we won't go into here.

This test is valid even if the regressors are sto hasti and ontain lagged dependent

The alternative is not that the model is an AR(P), following the argument above. The

variables, so it is onsiderably more useful than the DW test for typi al time series data.

alternative is simply that some or all of the rst

auto orrelations are dierent from

zero. This is ompatible with many spe i forms of auto orrelation.

7.5.

107

AUTOCORRELATION

7.5.7 Lagged dependent variables and auto orrelation

We've seen that the OLS estimator is onsistent under auto orrelation, as long as
This will be the ase when
where

E(X ) = 0,

plim Xn = 0.

following a LLN. An important ex eption is the ase

ontains lagged y s and the errors are auto orrelated. A simple example is the ase

of a single lag of the dependent variable with AR(1) errors. The model is

yt = xt + yt1 + t
t = t1 + ut
Now we an write



E(yt1 t ) = E (xt1 + yt2 + t1 )(t1 + ut )
6= 0

sin e one of the terms is


therefore

plim

X
n

6= 0.

E(2t1 )

whi h is learly nonzero.

In this ase

Sin e

plim = + plim

E(X ) 6= 0,

and

X
n

the OLS estimator is in onsistent in this ase. One needs to estimate by instrumental variables
(IV), whi h we'll get to later.

7.5.8 Examples
Nerlove model, yet again

The Nerlove model uses ross-se tional data, so one may not

think of performing tests for auto orrelation.

However, spe i ation error an indu e auto-

orrelated errors. Consider the simple Nerlove model

ln C = 1 + 2 ln Q + 3 ln PL + 4 ln PF + 5 ln PK +
and the extended Nerlove model

ln C = 1j + 2j ln Q + 3 ln PL + 4 ln PF + 5 ln PK + .
We have seen eviden e that the extended model is preferred.
model, the simple model is misspe ied.

So if it is in fa t the proper

Let's he k if this misspe i ation might indu e

auto orrelated errors.


The O tave program GLS/NerloveAR.m estimates the simple Nerlove model, and plots
the residuals as a fun tion of

ln Q,

and it al ulates a Breus h-Godfrey test statisti .

residual plot is in Figure 7.4 , and the test results are:

Breus h-Godfrey test

Value
34.930

p-value
0.000

The

108

CHAPTER 7.

GENERALIZED LEAST SQUARES

Figure 7.4: Residuals of simple Nerlove model

Residuals
Quadratic fit to Residuals

1.5

0.5

-0.5

-1
0

Clearly, there is a problem of auto orrelated residuals.


Repeat the auto orrelation tests using the extended Nerlove model (Equation

??)

to see

the problem is solved.

Klein model

Klein's Model I is a simple ma roe onometri model. One of the equations

in the model explains onsumption (C ) as a fun tion of prots (P ), both urrent and lagged,

as well as the sum of wages in the private se tor (W ) and wages in the government se tor

(W ). Have a look at the README le for this data set. This gives the variable names and
other information.
Consider the model

Ct = 0 + 1 Pt + 2 Pt1 + 3 (Wtp + Wtg ) + 1t

10

7.5.

109

AUTOCORRELATION

The O tave program GLS/Klein.m estimates this model by OLS, plots the residuals, and
performs the Breus h-Godfrey test, using 1 lag of the residuals.

The estimation and test

results are:

*********************************************************
OLS estimation results
Observations 21
R-squared 0.981008
Sigma-squared 1.051732
Results (Ordinary var- ov estimator)

Constant
Profits
Lagged Profits
Wages

estimate
16.237
0.193
0.090
0.796

st.err.
1.303
0.091
0.091
0.040

t-stat.
12.464
2.115
0.992
19.933

p-value
0.000
0.049
0.335
0.000

*********************************************************
Value
p-value
Breus h-Godfrey test
1.539
0.215

and the residual plot is in Figure 7.5. The test does not reje t the null of nonauto orrelatetd
errors, but we should remember that we have only 21 observations, so power is likely to be
fairly low. The residual plot leads me to suspe t that there may be auto orrelation - there are
some signi ant runs below and above the x-axis. Your opinion may dier.
Sin e it seems that there

may

be auto orrelation, lets's try an AR(1) orre tion.

The

O tave program GLS/KleinAR1.m estimates the Klein onsumption equation assuming that
the errors follow the AR(1) pattern. The results, with the Breus h-Godfrey test for remaining
auto orrelation are:

*********************************************************
OLS estimation results
Observations 21
R-squared 0.967090
Sigma-squared 0.983171
Results (Ordinary var- ov estimator)
estimate

st.err.

t-stat.

p-value

110

CHAPTER 7.

GENERALIZED LEAST SQUARES

Figure 7.5: OLS residuals, Klein onsumption equation

Residuals

-1

-2

-3
0

10

15

20

25

7.6.

111

EXERCISES

Constant
Profits
Lagged Profits
Wages

16.992
0.215
0.076
0.774

1.492
0.096
0.094
0.048

11.388
2.232
0.806
16.234

0.000
0.039
0.431
0.000

*********************************************************
Value
p-value
Breus h-Godfrey test
2.129
0.345

The test is farther away from the reje tion region than before, and the residual plot is
a bit more favorable for the hypothesis of nonauto orrelated residuals, IMHO. For this
reason, it seems that the AR(1) orre tion might have improved the estimation.

Nevertheless, there has not been mu h of an ee t on the estimated oe ients nor on
their estimated standard errors. This is probably be ause the estimated AR(1) oe ient
is not very large (around 0.2)

The existen e or not of auto orrelation in this model will be important later, in the
se tion on simultaneous equations.

7.6 Exer ises


1. Comparing the varian es of the OLS and GLS estimators, I laimed that the following
holds:

V ar(GLS ) = AA
V ar()
Verify that this is true.
2. Show that the GLS estimator an be dened as

GLS = arg min(y X) 1 (y X)


3. The limiting distribution of the OLS estimator with heteros edasti ity of unknown form
is

where




d
1
n N 0, Q1
X QX ,
lim E

n
Explain why

X X
n
n

X
b= 1
xt xt 2t

n t=1

is a onsistent estimator of this matrix.

112

CHAPTER 7.

4. Dene the
as

GENERALIZED LEAST SQUARES

v th auto ovarian e of a ovarian e stationary pro ess mt , where E(mt = 0)


v = E(mt mtv ).

Show that

E(mt mt+v ) = v .

5. For the Nerlove model

ln C = 1j + 2j ln Q + 3 ln PL + 4 ln PF + 5 ln PK +
assume that

V (t |xt ) = j2 , j = 1, 2, ..., 5.

That is, the varian e depends upon whi h of

the 5 rm size groups the observation belongs to.

a) Apply White's test using the OLS residuals, to test for homos edasti ity
b) Cal ulate the FGLS estimator and interpret the estimation results.
) Test the transformed model to he k whether it appears to satisfy homos edasti ity.

Chapter 8
Sto hasti regressors
Up to now we have treated the regressors as xed, whi h is learly unrealisti . Now we will
assume they are random.
interested in an analysis

There are several ways to think of the problem.

onditional

First, if we are

on the explanatory variables, then it is irrelevant if they

are sto hasti or not, sin e onditional on the values of they regressors take on, they are
nonsto hasti , whi h is the ase already onsidered.

In ross-se tional analysis it is usually reasonable to make the analysis onditional on


the regressors.
In dynami models, where

yt may depend on yt1 , a onditional analysis is not su iently

general, sin e we may want to predi t into the future many periods out, so we need to
onsider the behavior of

and the relevant test statisti s un onditional on

X.

The model we'll deal will involve a ombination of the following assumptions

Linearity:

the model is a linear fun tion of the parameter ve tor

0 :

yt = xt 0 + t ,
or in matrix form,

where

is

n 1, X =

x1 x2

y = X0 + ,

, where xt
xn

is

Sto hasti , linearly independent regressors


X

has rank

is sto hasti

limn Pr

K 1, and 0

and

are onformable.

with probability 1

1
nX X


= QX = 1,

where

QX

is a nite positive denite matrix.

Central limit theorem


d

n1/2 X N (0, QX 02 )

Normality (Optional): |X N (0, 2 In ): is normally distributed


Strongly exogenous regressors:
E(t |X) = 0, t
113

(8.1)

114

CHAPTER 8.

STOCHASTIC REGRESSORS

Weakly exogenous regressors:


E(t |xt ) = 0, t
In both ases,

xt

is the onditional mean of

yt

(8.2)

xt : E(yt |xt ) = xt

given

8.1 Case 1
Normality of , strongly exogenous regressors
In this ase,

= 0 + (X X)1 X

E(|X)
= 0 + (X X)1 X E(|X)
= 0
and sin e this holds for all

= ,
X, E()

un onditional on

N , (X X)1 2
|X
0

If the density of

d(X),

is

onditional density by
density for

d(X)

X.


the marginal density of


and integrating over

X.

Likewise,

is obtained by multiplying the

Doing this leads to a nonnormal

in small samples.

However, onditional on

X,

the usual test statisti s have the

t, F

and

distributions.

Importantly, these distributions don't depend on X, so when marginalizing to obtain the


un onditional distribution, nothing hanges. The tests are valid in small samples.

Summary: When

is sto hasti but strongly exogenous and

1.

is unbiased

2.

is nonnormally distributed

is normally distributed:

3. The usual test statisti s have the same distribution as with nonsto hasti
4. The Gauss-Markov theorem still holds, sin e it holds onditionally on
is true for all

X,

X.
and this

X.

5. Asymptoti properties are treated in the next se tion.

8.2 Case 2

nonnormally distributed, strongly exogenous regressors


The unbiasedness of

arries through as before. However, the argument regarding test

8.3.

115

CASE 3

statisti s doesn't hold, due to nonnormality of

Still, we have

= 0 + (X X)1 X
 1
XX
X
= 0 +
n
n
Now

X X
n

1

Q1
X

by assumption, and

n1/2 X p
X

0
=
n
n
sin e the numerator onverges to a

N (0, QX 2 ) r.v.

and the denominator still goes to innity.

We have unbiasedness and the varian e disappearing, so,

the estimator is onsistent :

p
0 .
Considering the asymptoti distribution

 1



XX
X

n 0
=
n
n
n
 1
XX
n1/2 X
=
n
so



d
2
n 0 N (0, Q1
X 0 )

dire tly following the assumptions.

Asymptoti normality of the estimator still holds.

Sin e

the asymptoti results on all test statisti s only require this, all the previous asymptoti results
on test statisti s are also valid in this ase.

Summary: Under strongly exogenous regressors, with

normal or nonnormal,

has

the

properties:
1. Unbiasedness
2. Consisten y
3. Gauss-Markov theorem holds, sin e it holds in the previous ase and doesn't depend
on normality.
4. Asymptoti normality
5. Tests are asymptoti ally valid
6. Tests are not valid in small samples if the error is normally distributed

8.3 Case 3
Weakly exogenous regressors

116

CHAPTER 8.

An important lass of models are

STOCHASTIC REGRESSORS

dynami models, where lagged dependent variables have

an impa t on the urrent value. A simple version of these models that aptures the important
points is

yt = zt +

p
X

s yts + t

s=1

= xt + t
where now

xt

ontains lagged dependent variables. Clearly, even with

are not un orrelated, so one an't show unbiasedness. For example,

E(t |xt ) = 0, X

and

E(t1 xt ) 6= 0
sin e

xt

ontains

yt1

(whi h is a fun tion of

t1 )

as an element.

This fa t implies that all of the small sample properties su h as unbiasedness, GaussMarkov theorem, and small sample validity of test statisti s

do not hold

in this ase.

Re all Figure 3.7. This is a ase of weakly exogenous regressors, and we see that the
OLS estimator is biased in this ase.

Nevertheless, under the above assumptions, all asymptoti properties ontinue to hold,
using the same arguments as before.

8.4 When are the assumptions reasonable?


The two assumptions we've added are

1
nX X


= QX = 1,

1.

limn Pr

2.

n1/2 X N (0, QX 02 )

QX

nite positive denite matrix.

The most ompli ated ase is that of dynami models, sin e the other ases an be treated as
nested in this ase. There exist a number of entral limit theorems for dependent pro esses,
many of whi h are fairly te hni al.
if you're interested).

We won't enter into details (see Hamilton, Chapter 7

A main requirement for use of standard asymptoti s for a dependent

sequen e

{st } = {

1X
zt }
n t=1

to onverge in probability to a nite limit is that

zt

be

stationary, in some sense.

Strong stationarity requires that the joint distribution of the set

{zt , zt+s , ztq , ...}


not depend on

t.

8.5.

117

EXERCISES

Covarian e (weak) stationarity requires that the rst and se ond moments of this set

An example of a sequen e that doesn't satisfy this is an AR(1) pro ess with a unit root

not depend on

(a

t.

random walk):
xt = xt1 + t
t IIN (0, 2 )

One an show that the varian e of

xt

depends upon

in this ase, so it's not weakly

stationary.

The series

sin t + t

has a rst moment that depends upon t, so it's not weakly stationary

either.

Stationarity prevents the pro ess from trending o to plus or minus innity, and prevents
y li al behavior whi h would allow orrelations between far removed

zt

znd

zs

to be high.

Draw a pi ture here.

In summary, the assumptions are reasonable when the sto hasti onditioning variables
have varian es that are nite, and are not too strongly dependent. The AR(1) model
with unit root is an example of a ase where the dependen e is too strong for standard
asymptoti s to apply.

The e onometri s of nonstationary pro esses has been an a tive area of resear h in the
last two de ades. The standard asymptoti s don't apply in this ase. This isn't in the
s ope of this ourse.

8.5 Exer ises


1. Show that for two random variables

A and B, if E(A|B) = 0, then E (Af (B)) = 0.

How

is this used in the proof of the Gauss-Markov theorem?


2. Is it possible for an AR(1) model for time series data,
weak exogeneity? Strong exogeneity? Dis uss.

e.g., yt = 0 + 0.9yt1 + t

satisfy

118

CHAPTER 8.

STOCHASTIC REGRESSORS

Chapter 9
Data problems
In this se tion well onsider problems asso iated with the regressor matrix: ollinearity, missing
observation and measurement error.

9.1 Collinearity
Collinearity is the existen e of linear relationships amongst the regressors.

We an always

write

1 x1 + 2 x2 + + K xK + v = 0
where

xi

is the

ith

olumn of the regressor matrix

there exists ollinearity, the variation in

X, and v

is an

n 1 ve tor.

In the ase that

is relatively small, so that there is an approximately

exa t linear relation between the regressors.

relative and approximate are impre ise, so it's di ult to dene when ollinearilty
exists.

In the extreme, if there are exa t linear relationships (every element of

so (X X)

< K,

v equal) then (X) < K,

so X X is not invertible and the OLS estimator is not uniquely dened. For

example, if the model is

yt = 1 + 2 x2t + 3 x3t + t
x2t = 1 + 2 x3t
then we an write

yt = 1 + 2 (1 + 2 x3t ) + 3 x3t + t
= 1 + 2 1 + 2 2 x3t + 3 x3t + t
= (1 + 2 1 ) + (2 2 + 3 ) x3t
= 1 + 2 x3t + t

The

an be onsistently estimated, but sin e the

119

dene two equations in three

120

CHAPTER 9.

s, the s an't be onsistently


fon ). The

are

unidentied

DATA PROBLEMS

estimated (there are multiple values of

that solve the

in the ase of perfe t ollinearity.

Perfe t ollinearity is unusual, ex ept in the ase of an error in onstru tion of the
regressor matrix, su h as in luding the same regressor twi e.

Another ase where perfe t ollinearity may be en ountered is with models with dummy
variables, if one is not areful. Consider a model of rental pri e

(yi )

of an apartment. This

xi , as well as on the lo ation of


th
if the i
apartment is in Bar elona, Bi = 0 otherwise. Similarly,

ould depend fa tors su h as size, quality et ., olle ted in


the apartment. Let
dene

Gi , Ti

and

Li

Bi = 1

for Girona, Tarragona and Lleida. One ould use a model su h as

yi = 1 + 2 Bi + 3 Gi + 4 Ti + 5 Li + xi + i
In this model,

Bi + Gi + Ti + Li = 1, i,

so there is an exa t relationship between these

variables and the olumn of ones orresponding to the onstant.

One must either drop the

onstant, or one of the qualitative variables.

9.1.1 A brief aside on dummy variables


Introdu e a brief dis ussion of dummy variables here.

9.1.2 Ba k to ollinearity
The more ommon ase, if one doesn't make mistakes su h as these, is the existen e of inexa t
linear relationships,

i.e., orrelations between the regressors that are less than one in absolute

value, but not zero. The basi problem is that when two (or more) variables move together, it

i.e.,
With e onomi data, ollinearity is ommonly en ountered, and

is di ult to determine their separate inuen es. This is ree ted in impre ise estimates,
estimates with high varian es.

is often a severe problem.

When there is ollinearity, the minimizing point of the obje tive fun tion that denes the
OLS estimator (s(), the sum of squared errors) is relatively poorly dened. This is seen in
Figures 9.1 and 9.2.
To see the ee t of ollinearity on varian es, partition the regressor matrix as

X=
where

is the rst olumn of

x W

(note: we an inter hange the olumns of

isf we like, so

there's no loss of generality in onsidering the rst olumn). Now, the varian e of
the lassi al assumptions, is

= X X
V ()
Using the partition,

XX=

"

x x

1

x W

W x W W

under

9.1.

121

COLLINEARITY

Figure 9.1:

s()

when there is no ollinearity

6
4

60
55
50
45
40
35
30
25
20
15

2
0
-2
-4
-6
-6

-4

-2

Figure 9.2:

s()

when there is ollinearity

6
4
2
0
-2
-4
-6
-6

-4

-2

100
90
80
70
60
50
40
30
20

122

CHAPTER 9.

DATA PROBLEMS

and following a rule for partitioned inversion,

X X

where by

ESSx|W

1
x x x W (W W )1 W x
 1
 

=
x In W (W W ) 1 W x
1
= ESSx|W

1

1,1

we mean the error sum of squares obtained from the regression

x = W + v.
Sin e

R2 = 1 ESS/T SS,
we have

ESS = T SS(1 R2 )
so the varian e of the oe ient orresponding to

V (x ) =

is

2
2
T SSx (1 Rx|W
)

We see three fa tors inuen e the varian e of this oe ient. It will be high if
1.

is large

2. There is little variation in

x.

Draw a pi ture here.

3. There is a strong linear relationship between


explain the movement in

well.

and the other regressors, so that

2
In this ase, R
x|W will be lose to 1.

1, V (x ) .

2
As R
x|W

an

The last of these ases is ollinearity.


Intuitively, when there are strong linear relations between the regressors, it is di ult to
determine the separate inuen e of the regressors on the dependent variable. This an be seen
by omparing the OLS obje tive fun tion in the ase of no orrelation between regressors with
the obje tive fun tion with orrelation between the regressors. See the gures no ollin.ps (no
orrelation) and ollin.ps ( orrelation), available on the web site.

9.1.3 Dete tion of ollinearity


The best way is simply to regress ea h explanatory variable in turn on the remaining regressors.

If any of these auxiliary regressions has a high

R2 ,

there is a problem of ollinearity.

Furthermore, this pro edure identies whi h parameters are ae ted.

Sometimes, we're only interested in ertain parameters. Collinearity isn't a problem if


it doesn't ae t what we're interested in estimating.

9.1.

123

COLLINEARITY

An alternative is to examine the matrix of orrelations between the regressors. High orrelations are su ient but not ne essary for severe ollinearity.
Also indi ative of ollinearity is that the model ts well (high

R2 ), but none of the variables

is signi antly dierent from zero (e.g., their separate inuen es aren't well determined).
In summary, the arti ial regressions are the best approa h if one wants to be areful.

9.1.4 Dealing with ollinearity


More information
Collinearity is a problem of an uninformative sample.
available information being used?
that have been negle ted?

ollinearity.

Is more data available?

The rst question is:

is all the

Are there oe ient restri tions

Pi ture illustrating how a restri tion an solve problem of perfe t

Sto hasti restri tions and ridge regression


Supposing that there is no more data or negle ted restri tions, one possibility is to hange
perspe tives, to Bayesian e onometri s. One an express prior beliefs regarding the oe ients
using sto hasti restri tions. A sto hasti linear restri tion would be something of the form

R = r + v
where

and

are as in the ase of exa t linear restri tions, but

is a random ve tor. For

example, the model ould be

y = X +
R = r + v
!
!
0

,
N
0
v

2 In 0nq
0qn

v2 Iq

This sort of model isn't in line with the lassi al interpretation of parameters as onstants:
a ording to this interpretation the left hand side of

R = r + v

is onstant but the right is

random. This model does t the Bayesian perspe tive: we ombine information oming from
the model and the data, summarized in

y = X +
N (0, 2 In )
with prior beliefs regarding the distribution of the parameter, summarized in

R N (r, v2 Iq )
Sin e the sample is random it is reasonable to suppose that

E(v ) = 0,

whi h is the last pie e

of information in the spe i ation. How an you estimate using this model? The solution is

124

CHAPTER 9.

DATA PROBLEMS

to treat the restri tions as arti ial data. Write

"

y
r

"

X
R

"

2 6= v2 .

Dene the

This model is heteros edasti , sin e

prior pre ision k = /v .

This

expresses the degree of belief in the restri tion relative to the variability of the data. Supposing
that we spe ify

k,

then the model

"

y
kr

"

X
kR

"

kv

is homos edasti and an be estimated by OLS. Note that this estimator is biased.
onsistent, however, given that

is a xed onstant, even if the restri tion is false (this is in

ontrast to the ase of false exa t restri tions). To see this, note that there are
where

Q is the number

of rows of

It is

restri tions,

R. As n , these Q arti ial observations have

no weight

in the obje tive fun tion, so the estimator has the same limiting obje tive fun tion as the OLS
estimator, and is therefore onsistent.

To motivate the use of sto hasti restri tions, onsider the expe tation of the squared
length of

:


1 
1  
X
X
+ X X
+ X X

= + E X(X X)1 (X X)1 X
1 2

= + T r X X

= E
E( )

= + 2

K
X

i (the

tra e is the sum of eigenvalues)

i=1

> + max(X X 1 ) 2 (the

eigenvalues are all positive, sin eX

so

min(X X)

is the minimum eigenvalue of

values) and

is p.d.

min(X X)

X X

(whi h is the inverse of the maximum

1 ). As ollinearity be omes worse and worse,


eigenvalue of (X X)
singular, so

> +
E( )
where

X X

be omes more nearly

min(X X) tends to zero (re all that the determinant is the


tends to innite. On the other hand, is nite.
E( )

produ t of the eigen-

Now onsidering the restri tion

"

y
0

IK = 0 + v.
#

"

X
kIK

With this restri tion the model be omes

"

kv

9.2.

125

MEASUREMENT ERROR

and the estimator is

ridge =
=
This is the ordinary

kIK

X X + k2 IK

ridge regression

2
add k IK , whi h is nonsingular, to
be omes worse and worse. As

"

1

X
kIK

#!1

X y

IK

"

y
0

estimator. The ridge regression estimator an be seen to

X X, whi h is more and more nearly singular as ollinearity

k ,

the restri tions tend to

= 0,

that is, the oe ients

are shrunken toward zero. Also, the estimator tends to

ridge = X X + k2 IK
so

ridge
ridge 0.

1

X y k2 IK

1

X y =

X y
0
k2

This is learly a false restri tion in the limit, if our original model is at al

sensible.

There should be some amount of shrinkage that is in fa t a true restri tion. The problem is
to determine the

su h that the restri tion is orre t. The interest in ridge regression enters

on the fa t that it an be shown that there exists a


problem is that this

depends on

The ridge tra e method plots

su h that

M SE(ridge ) < OLS .

ridge
ridge

as a fun tion of

k,

and hooses the value of

that artisti ally seems appropriate (e.g., where the ee t of in reasing

pi ture here.

The

2
and , whi h are unknown.

This means of hoosing

the Bayesian perspe tive: the hoi e of

dies o ).

Draw

is obviously subje tive. This is not a problem from

ree ts prior beliefs about the length of

In summary, the ridge estimator oers some hope, but it is impossible to guarantee that
it will outperform the OLS estimator. Collinearity is a fa t of life in e onometri s, and there
is no lear solution to the problem.

9.2 Measurement error


Measurement error is exa tly what it says, either the dependent variable or the regressors
are measured with error. Thinking about the way e onomi data are reported, measurement
error is probably quite prevalent. For example, estimates of growth of GDP, ination, et . are
ommonly revised several times. Why should the last revision ne essarily be orre t?

9.2.1 Error of measurement of the dependent variable


Measurement errors in the dependent variable and the regressors have important dieren es.
First onsider error in measurement of the dependent variable. The data generating pro ess

126

CHAPTER 9.

DATA PROBLEMS

is presumed to be

y = X +
y = y + v
vt iid(0, v2 )
where
that

is the unobservable true dependent variable, and

and

are independent and that y

= X +

is what is observed. We assume

satises the lassi al assumptions. Given

this, we have

y + v = X +
so

y = X + v
= X +
t iid(0, 2 + v2 )

As long as

is un orrelated with

X,

this model satises the lassi al assumptions and

an be estimated by OLS. This type of measurement error isn't a problem, then.

9.2.2 Error of measurement of the regressors


The situation isn't so good in this ase. The DGP is

yt = x
t + t
xt = xt + vt
vt iid(0, v )
where

is a

K K

matrix.

Now

what is observed. Again assume that

ontains the true, unobserved regressors, and

is independent of

and that the model

satises the lassi al assumptions. Now we have

yt = (xt vt ) + t
= xt vt + t

= xt + t

The problem is that now there is a orrelation between

xt

and

t ,

E(xt t ) = E (xt + vt ) vt + t
= v
where


v = E vt vt .



sin e

y=

is

9.3.

127

MISSING OBSERVATIONS

Be ause of this orrelation, the OLS estimator is biased and in onsistent, just as in the ase of
auto orrelated errors with lagged dependent variables. In matrix notation, write the estimated
model as

y = X +
We have that

X X
n

1 

X y
n

and

plim

sin e

and

X X
n

1

= plim

(X + V ) (X + V )
n

= (QX + v )1

are independent, and

plim

V V
n

= lim E
= v

1X
vt vt
n
t=1

Likewise,

plim

X y
n

(X + V ) (X + )
n
= QX
= plim

so

plim = (QX + v )1 QX
So we see that the least squares estimator is in onsistent when the regressors are measured
with error.

A potential solution to this problem is the instrumental variables (IV) estimator, whi h
we'll dis uss shortly.

9.3 Missing observations


Missing observations o ur quite frequently: time series data may not be gathered in a ertain
year, or respondents to a survey may not answer all questions.

We'll onsider two ases:

missing observations on the dependent variable and missing observations on the regressors.

9.3.1 Missing observations on the dependent variable


In this ase, we have

y = X +

128

CHAPTER 9.

"

or

where

y2

y1
y2

"

X1
X2

"

1
2

DATA PROBLEMS

is not observed. Otherwise, we assume the lassi al assumptions hold.

A lear alternative is to simply estimate using the ompete observations

y1 = X1 + 1
Sin e these observations satisfy the lassi al assumptions, one ould estimate by OLS.

The question remains whether or not one ould somehow repla e the unobserved
a predi tor, and improve over OLS in some sense. Let

y2

("

# "

=
=
Re all that the OLS fon are

X1
X2

# "

X1

#)1 "

X1

be the predi tor of

y1

X2
X2
y2



1
X1 X1 + X2 X2
X1 y1 + X2 y2

y2 .

X X = X y
so if we regressed using only the rst ( omplete) observations, we would have

X1 X1 1 = X1 y1.
Likewise, an OLS regression using only the se ond (lled in) observations would give

X2 X2 2 = X2 y2 .
Substituting these into the equation for the overall ombined estimator gives

i
1 h
X1 X1 1 + X2 X2 2
1
1


X2 X2 2
X1 X1 1 + X1 X1 + X2 X2
= X1 X1 + X2 X2

X1 X1 + X2 X2

A1 + (IK A)2

where

1

X1 X1
A X1 X1 + X2 X2

and we use

X1 X1 + X2 X2

1



1 
X1 X1 + X2 X2 X1 X1
X1 X1 + X2 X2
1

X1 X1
= IK X1 X1 + X2 X2

X2 X2 =

= IK A.

y2

by

Now

9.3.

129

MISSING OBSERVATIONS

Now,

and this will be unbiased only

 
= A + (IK A)E 2
E()
 
2 = .
if E

The on lusion is the this lled in observations alone would need to dene an unbiased
estimator. This will be the ase only if

y2 = X2 + 2
where
of

2 has mean zero.

Clearly, it is di ult to satisfy this ondition without knowledge

Note that putting

y2 = y1

does not satisfy the ondition and therefore leads to a biased

estimator.

Exer ise 13 Formally prove this last statement.

One possibility that has been suggested (see Greene, page 275) is to estimate

using a

rst round estimation using only the omplete observations

1 = (X1 X1 )1 X1 y1
then use this estimate,

1 ,to

predi t

y2

y2 = X2 1
= X2 (X1 X1 )1 X1 y1
Now, the overall estimate is a weighted average of

and

2 ,

just as above, but we have

2 = (X2 X2 )1 X2 y2
= (X2 X2 )1 X2 X2 1
= 1
This shows that this suggestion is ompletely empty of ontent: the nal estimator is
the same as the OLS estimator using only the omplete observations.

9.3.2 The sample sele tion problem


In the above dis ussion we assumed that the missing observations are random. The sample
sele tion problem is a ase where the missing observations are not random. Consider the model

yt = xt + t

130

CHAPTER 9.

DATA PROBLEMS

Figure 9.3: Sample sele tion bias


25
Data
True Line
Fitted Line
20

15

10

-5

-10
0

10

whi h is assumed to satisfy the lassi al assumptions. However,


What is observed is

yt
yt

is not always observed.

dened as

yt = yt
Or, in other words,

yt

if

yt 0

is missing when it is less than zero.

The dieren e in this ase is that the missing values are not random: they are orrelated
with the

xt .

Consider the ase

y = x +
with

V () = 25,

but using only the observations for whi h

y > 0

to estimate.

Figure 9.3

illustrates the bias. The O tave program is sampsel.m

9.3.3 Missing observations on the regressors


Again the model is

"

y1
y2

but we assume now that ea h row of

X2

"

X1
X2

"

1
2

has an unobserved omponent(s). Again, one ould

just estimate using the omplete observations, but it may seem frustrating to have to drop

X2 is

repla ed by some predi tion, X2 , then we are in the ase of errors of observation. As before,

this means that the OLS estimator is biased when X2 is used instead of X2 . Consisten y is
observations simply be ause of a single missing variable. In general, if the unobserved

salvaged, however, as long as the number of missing observations doesn't in rease with

In luding observations that have missing values repla ed by

ad ho

n.

values an be inter-

9.4.

131

EXERCISES

preted as introdu ing false sto hasti restri tions. In general, this introdu es bias. It is
di ult to determine whether MSE in reases or de reases. Monte Carlo studies suggest
that it is dangerous to simply substitute the mean, for example.

In the ase that there is only one regressor other than the onstant, subtitution of
the missing

xt

does not lead to bias.

This is a spe ial ase that doesn't hold for

for

K > 2.

Exer ise 14 Prove this last statement.

In summary, if one is strongly on erned with bias, it is best to drop observations that
have missing omponents.

There is potential for redu tion of MSE through lling in

missing elements with intelligent guesses, but this ould also in rease MSE.

9.4 Exer ises


1. Consider the Nerlove model

ln C = 1j + 2j ln Q + 3 ln PL + 4 ln PF + 5 ln PK +
When this model is estimated by OLS, some oe ients are not signi ant. This may
be due to ollinearity.

(a) Cal ulate the orrelation matrix of the regressors.


(b) Perform arti ial regressions to see if ollinearity is a problem.
( ) Apply the ridge regression estimator.
(d) Plot the ridge tra e diagram
(e) Che k what happens as

goes to zero, and as

be omes very large.

132

CHAPTER 9.

DATA PROBLEMS

Chapter 10
Fun tional form and nonnested tests
Though theory often suggests whi h onditioning variables should be in luded, and suggests
the signs of ertain derivatives, it is usually silent regarding the fun tional form of the relationship between the dependent variable and the regressors. For example, onsidering a ost
fun tion, one ould have a Cobb-Douglas model

c = Aw11 w22 q q e
This model, after taking logarithms, gives

ln c = 0 + 1 ln w1 + 2 ln w2 + q ln q +
where

0 = ln A.

Theory suggests that

A > 0, 1 > 0, 2 > 0, 3 > 0.

ompatible with a xed ost of produ tion sin e


in input pri es suggests that

1 + 2 = 1,

This model isn't

c = 0 when q = 0. Homogeneity

of degree one

while onstant returns to s ale implies

q = 1.

While this model may be reasonable in some ases, an alternative

c = 0 + 1 w1 + 2 w2 + q q +

may be just as plausible. Note that

and

ln(x)

look quite alike, for ertain values of the

regressors, and up to a linear transformation, so it may be di ult to hoose between these
models.
The basi point is that many fun tional forms are ompatible with the linear-in-parameters
model, sin e this model an in orporate a wide variety of nonlinear transformations of the
dependent variable and the regressors. For example, suppose that
and that

x()

is a

K ve tor-valued

g() is a real valued fun tion

fun tion. The following model is linear in the parameters

but nonlinear in the variables:

xt = x(zt )
yt = xt + t
There may be

fundamental onditioning variables

133

zt ,

but there may be

regressors, where

134

CHAPTER 10.

FUNCTIONAL FORM AND NONNESTED TESTS

may be smaller than, equal to or larger than

ross produ ts of the onditioning variables in

P.

For example,

xt

ould in lude squares and

zt .

10.1 Flexible fun tional forms


Given that the fun tional form of the relationship between the dependent variable and the
regressors is in general unknown, one might wonder if there exist parametri models that an
losely approximate a wide variety of fun tional relationships. A Diewert-Flexible fun tional
form is dened as one su h that the fun tion, the ve tor of rst derivatives and the matrix
of se ond derivatives an take on an arbitrary value

at a single data point.

Flexibility in this

sense learly requires that there be at least


K = 1 + P + P 2 P /2 + P

free parameters: one for ea h independent ee t that we wish to model.


Suppose that the model is

y = g(x) +
A se ond-order Taylor's series expansion (with remainder term) of the fun tion
point

x=0

is

g(x) = g(0) + x Dx g(0) +

g(x) about the

x Dx2 g(0)x
+R
2

Use the approximation, whi h simply drops the remainder term, as an approximation to

g(x) gK (x) = g(0) + x Dx g(0) +


As

x 0,

x Dx2 g(0)x
2

the approximation be omes more and more exa t, in the sense that

Dx gK (x) Dx g(x)

2
and Dx gK (x)

Dx2 g(x). For

x = 0,

g(x) :

gK (x) g(x),

the approximation is exa t, up to

g(0), Dx g(0)
2
and Dx g(0) are all onstants. If we treat them as parameters, the approximation will have
the se ond order. The idea behind many exible fun tional forms is to note that

exa tly enough free parameters to approximate the fun tion


exa tly, up to se ond order, at the point

x = 0.

g(x),

whi h is of unknown form,

The model is

gK (x) = + x + 1/2x x
so the regression model to t is

y = + x + 1/2x x +

While the regression model has enough free parameters to be Diewert-exible, the question remains: is

plim
= g(0)?

The answer is no, in general.

Is

plim = Dx g(0)?

Is

= D 2 g(0)?
plim
x

The reason is that if we treat the true values of the

parameters as these derivatives, then

is for ed to play the part of the remainder term,

10.1.

135

FLEXIBLE FUNCTIONAL FORMS

whi h is a fun tion of

x,

so that

and

are orrelated in this ase.

As before, the

estimator is biased in this ase.

A simpler example would be to onsider a rst-order T.S. approximation to a quadrati


fun tion.

Draw pi ture.

The on lusion is that exible fun tional forms aren't really exible in a useful statisti al sense, in that neither the fun tion itself nor its derivatives are onsistently estimated,
unless the fun tion belongs to the parametri family of the spe ied fun tional form. In
order to lead to onsistent inferen es, the regression model must be orre tly spe ied.

10.1.1 The translog form


In spite of the fa t that FFF's aren't really exible for the purposes of e onometri estimation
and inferen e, they are useful, and they are ertainly subje t to less bias due to misspe i ation
of the fun tional form than are many popular forms, su h as the Cobb-Douglas or the simple
linear in the variables model. The translog model is probably the most widely used FFF. This
model is as above, ex ept that the variables are subje ted to a logarithmi tranformation. Also,
the expansion point is usually taken to be the sample mean of the data, after the logarithmi
transformation. The model is dened by

y = ln(c)
z 
x = ln
z
= ln(z) ln(
z)

y = + x + 1/2x x +

In this presentation, the

t subs ript that distinguishes observations is suppressed for simpli ity.

Note that

y
x

= + x
=
=

whi h is the elasti ity of

ln(c)
ln(z)
c z
z c

(the other part of

onstant)

c with respe t to z. This is a onvenient

Note that at the means of the onditioning variables,

so the

x is

z, x = 0,


y
=
x z=z

feature of the translog model.


so

are the rst-order elasti ities, at the means of the data.

To illustrate, onsider that

is ost of produ tion:

y = c(w, q)

136

CHAPTER 10.

where

is a ve tor of input pri es and

FUNCTIONAL FORM AND NONNESTED TESTS

is output. We ould add other variables by extending

in the obvious manner, but this is supressed for simpli ity.

onditional fa tor demands are

x=

By Shephard's lemma, the

c(w, q)
w

and the ost shares of the fa tors are therefore

s=

wx
c(w, q) w
=
c
w c

whi h is simply the ve tor of elasti ities of ost with respe t to input pri es. If the ost fun tion
is modeled using a translog fun tion, we have

ln(c) = + x + z + 1/2

x z

"

11 12
12 22

#"

x
z

= + x + z + 1/2x 11 x + x 12 z + 1/2z 22
where

x = ln(w/w)

(element-by-element division) and

"

11 =

"

12 =

z = ln(q/
q ),

11 12
12 22
#
13

and

23

22 = 33 .
Note that symmetry of the se ond derivatives has been imposed.

Then the share equations are just

s=+

11 12

"

x
z

Therefore, the share equations and the ost equation have parameters in ommon. By pooling
the equations together and imposing the (true) restri tion that the parameters of the equations
be the same, we an gain e ien y.

To illustrate in more detail, onsider the ase of two inputs, so

x=

"

x1
x2

In this ase the translog model of the logarithmi ost fun tion is

ln c = + 1 x1 + 2 x2 + z +

11 2 22 2 33 2
x +
x +
z + 12 x1 x2 + 13 x1 z + 23 x2 z
2 1
2 2
2

10.1.

137

FLEXIBLE FUNCTIONAL FORMS

The two ost shares of the inputs are the derivatives of

ln c

with respe t to

x1

and

x2 :

s1 = 1 + 11 x1 + 12 x2 + 13 z
s2 = 2 + 12 x1 + 22 x2 + 13 z
Note that the share equations and the ost equation have parameters in ommon.

One

an do a pooled estimation of the three equations at on e, imposing that the parameters are
the same. In this way we're using more observations and therefore more information, whi h
will lead to imporved e ien y. Note that this does assume that the ost equation is orre tly

i.e.,

spe ied (

not an approximation), sin e otherwise the derivatives would not be the true

derivatives of the log ost fun tion, and would then be misspe ied for the shares. To pool
the equations, write the model in matrix form (adding in error terms)

ln c

s1
s2

This is

1 x1 x2 z


= 0 1
0 0

one

x21
2

x22
2

z2
2

x1 x2

x2

x2 0

x1

0 x1 0

0 0

x1 z x2 z

z
0

0
z

11

+ 2
22

3
33

12

13

23

observation on the three equations. With the appropriate notation, a single

observation an be written as

yt = Xt + t
The overall model would sta k

vations:

observations on the three equations for a total of

y1

X1

3n

obser-


y2 X2
+ .2
. = .

.
. .

.
. .
yn
Xn
n

Next we need to onsider the errors. For observation

1t

the errors an be pla ed in a ve tor

t = 2t
3t
First onsider the ovarian e matrix of this ve tor: the shares are ertainly orrelated sin e
they must sum to one. (In fa t, with 2 shares the varian es are equal and the ovarian e is
-1 times the varian e. General notation is used to allow easy extension to the ase of more

138

CHAPTER 10.

FUNCTIONAL FORM AND NONNESTED TESTS

than 2 inputs). Also, it's likely that the shares and the ost equation have dierent varian es.
Supposing that the model is ovarian e stationary, the varian e of

won t depend upon

t:

11 12 13

V art = 0 =

22 23

33

Note that this matrix is singular, sin e the shares sum to 1.

Assuming that there is no

seemingly unrelated regressions

auto orrelation, the overall ovarian e matrix has the

(SUR)

stru ture.

V ar .
=
..
n

0 0

= .
..

0
..

0
..

.
.
.

= In 0
where the symbol

and

is

indi ates the

Krone ker produ t.

The Krone ker produ t of two matri es

a11 B a12 B a1q B

a B ...
21
AB = .
..

apq B

.
.
.

apq B

10.1.2 FGLS estimation of a translog model


So, this model has heteros edasti ity and auto orrelation, so OLS won't be e ient. The next
question is: how do we estimate e iently using FGLS? FGLS is based upon inverting the
estimated error ovarian e

So we need to estimate

An asymptoti ally e ient pro edure is (supposing normality of the errors)

1. Estimate ea h equation by OLS

2. Estimate

using

X
0 = 1

t t
n t=1

3. Next we need to a ount for the singularity of

0 .

It an be shown that

will be

singular when the shares sum to one, so FGLS won't work. The solution is to drop one

10.1.

139

FLEXIBLE FUNCTIONAL FORMS

of the share equations, for example the se ond. The model be omes

"

ln c
s1

"

1 x1 x2 z
0 1

x21
2

x22
2

0 x1 0

z2
2

x1 x2

x2

#
x1 z x2 z

z
0


"
#
11
1

+
22
2

33

12

13

23

or in matrix notation for the observation:

yt = Xt + t
and in sta ked notation for all observations we have the

y1

X1

2n

observations:




y2 X2
+ .2
. = .

.
. .

.
. .
yn
Xn
n
or, nally in matrix notation for all observations:

y = X +
Considering the error ovarian e, we an dene

0 = V ar

Dene

as the leading

22

= In

blo k of

"

1
2

, and form

= In
0 .

This is a onsistent estimator, following the onsisten y of OLS and applying a LLN.

4. Next ompute the Cholesky fa torization

 1
0
P0 = Chol
(I am assuming this is dened as an upper triangular matrix, whi h is onsistent with

140

CHAPTER 10.

FUNCTIONAL FORM AND NONNESTED TESTS

the way O tave does it) and the Cholesky fa torization of the overall ovarian e matrix
of the 2 equation model, whi h an be al ulated as

= In P0
P = Chol
5. Finally the FGLS estimator an be al ulated by applying OLS to the transformed model

P y = P X + P
or by dire tly using the GLS formula

F GLS =

1
 1
 1
0
0
y
X
X
X

It is equivalent to transform ea h observation individually:

P0 yy = P0 Xt + P0
and then apply OLS. This is probably the simplest approa h.
A few last omments.
1. We have assumed no auto orrelation a ross time. This is learly restri tive. It is relatively simple to relax this, but we won't go into it here.
2. Also, we have only imposed symmetry of the se ond derivatives.

Another restri tion

that the model should satisfy is that the estimated shares should sum to 1. This an be
a omplished by imposing

1 + 2 = 1
3
X
ij = 0, j = 1, 2, 3.
i=1

These are linear parameter restri tions, so they are easy to impose and will improve
e ien y if they are true.
3. The estimation pro edure outlined above an be

iterated.

That is, estimate

F GLS

as

above, then re-estimate 0 using errors al ulated as

= y X F GLS
These might be expe ted to lead to a better estimate than the estimator based on
sin e FGLS is asymptoti ally more e ient. Then re-estimate

OLS ,

using the new estimated

error ovarian e. It an be shown that if this is repeated until the estimates don't hange

i.e., iterated to onvergen e) then the resulting estimator is the MLE. At any rate, the

10.2.

141

TESTING NONNESTED HYPOTHESES

asymptoti properties of the iterated and uniterated estimators are the same, sin e both
are based upon a onsistent estimator of the error ovarian e.

10.2 Testing nonnested hypotheses


Given that the hoi e of fun tional form isn't perfe tly lear, in that many possibilities exist,
how an one hoose between forms? When one form is a parametri restri tion of another, the
previously studied tests su h as Wald, LR, s ore or

qF

are all possibilities. For example, the

Cobb-Douglas model is a parametri restri tion of the translog: The translog is

yt = + xt + 1/2xt xt +
where the variables are in logarithms, while the Cobb-Douglas is

yt = + xt +
so a test of the Cobb-Douglas versus the translog is simply a test that
The situation is more ompli ated when we want to test

= 0.

non-nested hypotheses.

If the

two fun tional forms are linear in the parameters, and use the same transformation of the
dependent variable, then they may be written as

M1 : y = X +
t iid(0, 2 )
M2 : y = Z +
iid(0, 2 )
We wish to test hypotheses of the form:

H0 : Mi

misspe ied, for i = 1, 2.

is orre tly spe ied

versus

One ould a ount for non-iid errors, but we'll suppress this for simpli ity.

There are a number of ways to pro eed. We'll onsider the


and Ma Kinnon,

E onometri a

HA : Mi

is

test, proposed by Davidson

(1981). The idea is to arti ially nest the two models,

e.g.,

y = (1 )X + (Z) +
If the rst model is orre tly spe ied, then the true value of
hand, if the se ond model is orre tly spe ied then

is zero. On the other

= 1.

The problem is that this model is not identied in general.


models share some regressors, as in

M1 : yt = 1 + 2 x2t + 3 x3t + t
M2 : yt = 1 + 2 x2t + 3 x4t + t

For example, if the

142

CHAPTER 10.

FUNCTIONAL FORM AND NONNESTED TESTS

then the omposite model is

yt = (1 )1 + (1 )2 x2t + (1 )3 x3t + 1 + 2 x2t + 3 x4t + t


Combining terms we get

yt = ((1 )1 + 1 ) + ((1 )2 + 2 ) x2t + (1 )3 x3t + 3 x4t + t


= 1 + 2 x2t + 3 x3t + 4 x4t + t
The four

are onsistently estimable, but

knowns, so one an't test the hypothesis that


The idea of the

test is to substitute

is not, sin e we have four equations in 7 un-

= 0.

in pla e of

This is a onsistent estimator

supposing that the se ond model is orre tly spe ied. It will tend to a nite probability limit
even if the se ond model is misspe ied. Then estimate the model

y = (1 )X + (Z ) +
= X +
y+
where

y = Z(Z Z)1 Z y = PZ y.

In this model,

is onsistently estimable, and one an show

that, under the hypothesis that the rst model is orre t,


-statisti for

=0

is asymptoti ally normal:

t=

and that the ordinary

a
N (0, 1)

If the se ond model is orre tly spe ied, then

t ,

sin e

tends in probability to

1, while it's estimated standard error tends to zero. Thus the test will always reje t the
false null model, asymptoti ally, sin e the statisti will eventually ex eed any riti al
value with probability one.

We an reverse the roles of the models, testing the se ond against the rst.

It may be the ase that

neither

model is orre tly spe ied. In this ase, the test will

still reje t the null hypothesis, asymptoti ally, if we use riti al values from the
distribution, sin e as long as

N (0, 1)

tends to something dierent from zero, |t| . Of ourse,

when we swit h the roles of the models the other will also be reje ted asymptoti ally.

In summary, there are 4 possible out omes when we test two models, ea h against the
other. Both may be reje ted, neither may be reje ted, or one of the two may be reje ted.

There are other tests available for non-nested models. The


when both models are linear in the parameters. The
when

M1

test is simple to apply

P -test is similar, but easier to apply

is nonlinear.

The above presentation assumes that the same transformation of the dependent variable

10.2.

143

TESTING NONNESTED HYPOTHESES

is used by both models.

Ma Kinnon, White and Davidson,

Journal of E onometri s,

(1983) shows how to deal with the ase of dierent transformations.

Monte-Carlo eviden e shows that these tests often over-reje t a orre tly spe ied model.
Can use bootstrap riti al values to get better-performing tests.

144

CHAPTER 10.

FUNCTIONAL FORM AND NONNESTED TESTS

Chapter 11
Exogeneity and simultaneity
Several times we've en ountered ases where orrelation between regressors and the error term
lead to biasedness and in onsisten y of the OLS estimator. Cases in lude auto orrelation with
lagged dependent variables and measurement error in the regressors. Another important ase
is that of simultaneous equations. The ause is dierent, but the ee t is the same.

11.1 Simultaneous equations


Up until now our model is

y = X +
where, for purposes of estimation we an treat
we

ondition

on

X.

as xed. This means that when estimating

When analyzing dynami models, we're not interested in onditioning on

X, as we saw in the se tion on sto hasti regressors.


by treating

Nevertheless, the OLS estimator obtained

as xed ontinues to have desirable asymptoti properties even in that ase.

Simultaneous equations is a dierent prospe t.

An example of a simultaneous equation

system is a simple supply-demand system:

Demand:

"

The presumption is that

1t
2t

qt

and

Supply:

1t 2t

pt

qt = 1 + 2 pt + 3 yt + 1t
q = 1 + 2 pt + 2t
#
!t
"
i
11 12
=

22
, t

are jointly determined at the same time by the interse tion

of these equations. We'll assume that

yt

is determined by some unrelated pro ess. It's easy

to see that we have orrelation between regressors and errors. Solving for

pt

1 + 2 pt + 3 yt + 1t = 1 + 2 pt + 2t
2 pt 2 pt = 1 1 + 3 yt + 1t 2t
3 yt
1t 2t
1 1
+
+
pt =
2 2 2 2
2 2
145

146

CHAPTER 11.

pt

Now onsider whether

is un orrelated with

EXOGENEITY AND SIMULTANEITY

1t :



1 1
3 yt
1t 2t
+
+
2 2 2 2
2 2
11 12
2 2

E(pt 1t ) = E
=

1t

Be ause of this orrelation, OLS estimation of the demand equation will be biased and in onsistent. The same applies to the supply equation, for the same reason.
In this model,
the system.

yt

qt

is an

and

pt

endogenous

are the

exogenous

varibles (endogs), that are determined within

variable (exogs).

These on epts are a bit tri ky, and we'll

return to it in a minute. First, some notation. Suppose we group together urrent endogs in
the ve tor

Yt .

If there are

lagged endogs in the ve tor


error ve tor

Et .

endogs,

Xt

Yt

is

, whi h is

G 1.

K 1.

Group urrent and lagged exogs, as well as


Sta k the errors of the

equations into the

The model, with additional assumtions, an be written as

Yt = Xt B + Et
Et N (0, ), t

E(Et Es ) = 0, t 6= s
We an sta k all

observations and write the model as

Y = XB + E

E(X E) = 0(KG)
vec(E) N (0, )
where

is

n G, X

is

n K,

Y1

X1



X2
Y2

.
,
X
=
Y =
.
..
.
.

Yn
Xn

and

is

n G.

E1

E
,E = . 2

.
En

omplete, in that there are as many equations as endogs.

This system is

There is a normality assumption.

Sin e there is no auto orrelation of the

This isn't ne essary, but allows us to onsider the

relationship between least squares and ML estimators.

Et

's, and sin e the olumns of

are individually

11.2.

147

EXOGENEITY

homos edasti , then

11 In 12 In 1G In
.
.
.

22 In
..

.
.
.

GG In

= In

may ontain lagged endogenous and exogenous variables. These variables are

termined.

prede-

We need to dene what is meant by endogenous and exogenous when lassifying the
urrent period variables.

11.2 Exogeneity
The model denes a

Xt ,

data generating pro ess.

The model involves two sets of variables,

Yt

and

as well as a parameter ve tor

vec() vec(B) vec ()


is a G2 +GK + G2 G /2+G dimensional

In general, without additional restri tions,

In prin iple, there exists a joint density fun tion for

ve tor. This is the parameter ve tor that were interested in estimating.

parameter ve tor

Yt

and

Xt ,

whi h depends on a

Write this density as

ft (Yt , Xt |, It )
where

It

is the information set in period

t.

This in ludes lagged

ourse. This an be fa tored into the density of


density of

Xt

Yt

Yt s

onditional on

Xt

and lagged

Xt

's of

times the marginal

ft (Yt , Xt |, It ) = ft (Yt |Xt , , It )ft (Xt |, It )


This is a general fa torization, but is may very well be the ase that not all parameters in

ae t both fa tors. So use

density and write

to indi ate elements of

that enter into the onditional

for parameters that enter into the marginal. In general,

may share elements, of ourse. We have

ft (Yt , Xt |, It ) = ft (Yt |Xt , 1 , It )ft (Xt |2 , It )

and

148

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

Re all that the model is

Yt = Xt B + Et
Et N (0, ), t

E(Et Es ) = 0, t 6= s

Normality and la k of orrelation over time imply that the observations are independent of
one another, so we an write the log-likelihood fun tion as the sum of likelihood ontributions
of ea h observation:

ln L(Y |, It ) =
=

n
X
t=1

n
X
t=1

n
X
t=1

ln ft (Yt , Xt |, It )
ln (ft (Yt |Xt , 1 , It )ft (Xt |2 , It ))
ln ft (Yt |Xt , 1 , It ) +

n
X
t=1

ln ft (Xt |2 , It ) =

Denition 15 (Weak Exogeneity) Xt is weakly exogeneous for (the original parameter

ve tor) if there is a mapping from to that is invariant to 2 . More formally, for an arbitrary

(1 , 2 ), () = (1 ).
This implies that
hange as

1 and 2 annot share elements if Xt is weakly exogenous, sin e 1 would

hanges, whi h prevents onsideration of arbitrary ombinations of

Supposing that

Xt

is weakly exogenous, then the MLE of

(1 , 2 ).

using the joint density is the

same as the MLE using only the onditional density

ln L(Y |X, , It ) =

n
X
t=1

ln ft (Yt |Xt , 1 , It )

sin e the onditional likelihood doesn't depend on


log-likelihoods maximize at the same value of

1 .

With weak exogeneity, knowledge of the DGP of


knowledge of

Xt

2 . In other words, the joint and onditional

Xt

is irrelevant for inferen e on

is su ient to re over the parameter of interest,

is irrelevant, we an treat

Xt

1 , and

Sin e the DGP of

as xed in inferen e.

By the invarian e property of MLE, the MLE of

is

(1 ),and

this mapping is assumed

to exist in the denition of weak exogeneity.

from 1 .

Of ourse, we'll need to gure out just what this mapping is to re over

With la k of weak exogeneity, the joint and onditional likelihood fun tions maximize in

is the famous

This

identi ation problem.

dierent pla es. For this reason, we an't treat


is valid, but the onditional MLE is not.

Xt

as xed in inferen e. The joint MLE

11.3.

149

REDUCED FORM

In resume, we require the variables in

Xt

to be weakly exogenous if we are to be able to

treat them as xed in estimation. Lagged


onditioning information set, e.g.,

Yt

Yt1 It .

satisfy the denition, sin e they are in the


Lagged

Yt

aren't exogenous in the normal

are determined within the model, just earlier on.


Weakly exogenous variables in lude exogenous (in the normal sense) variables as well as
all predetermined variables.
usage of the word, sin e their values

11.3 Redu ed form


Re all that the model is

Yt = Xt B + Et
V (Et ) =
This is the model in

stru tural form.

Denition 16 (Stru tural form) An equation is in stru tural form when more than one

urrent period endogenous variable is in luded.

The solution for the urrent period endogs is easy to nd. It is

Yt = Xt B1 + Et 1
= Xt + Vt =
Now only one urrent period endog appears in ea h equation. This is the

redu ed form.

Denition 17 (Redu ed form) An equation is in redu ed form if only one urrent period

endog is in luded.

An example is our supply/demand system. The redu ed form for quantity is obtained by
solving the supply equation for pri e and substituting into demand:

qt =
2 qt 2 qt =
qt =
=


qt 1 2t
1 + 2
+ 3 yt + 1t
2
2 1 2 (1 + 2t ) + 2 3 yt + 2 1t
2 3 yt
2 1t 2 2t
2 1 2 1
+
+
2 2
2 2
2 2
11 + 21 yt + V1t

150

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

Similarly, the rf for pri e is

1 + 2 pt + 2t = 1 + 2 pt + 3 yt + 1t
2 pt 2 pt = 1 1 + 3 yt + 1t 2t
3 yt
1t 2t
1 1
+
+
pt =
2 2 2 2
2 2
= 12 + 22 yt + V2t
The interesting thing about the rf is that the equations individually satisfy the lassi al assumptions, sin e
i=1,2,

t.

yt

is un orrelated with

1t

and

2t

by assumption, and therefore

The errors of the rf are

The varian e of

"
V1t

V1t
V2t

"

2 1t 2 2t
2 2
1t 2t
2 2

E(yt Vit ) = 0,

is





2 1t 2 2t
2 1t 2 2t
V (V1t ) = E
2 2
2 2
2
2 11 22 2 12 + 2 22
=
(2 2 )2

This is onstant over time, so the rst rf equation is homos edasti .

Likewise, sin e the

are independent over time, so are the

Vt .

The varian e of the se ond rf error is





1t 2t
1t 2t
V (V2t ) = E
2 2
2 2
11 212 + 22
=
(2 2 )2
and the ontemporaneous ovarian e of the errors a ross equations is



1t 2t
2 1t 2 2t
E(V1t V2t ) = E
2 2
2 2
2 11 (2 + 2 ) 12 + 22
=
(2 2 )2


In summary the rf equations individually satisfy the lassi al assumptions, under the
assumtions we've made, but they are ontemporaneously orrelated.

The general form of the rf is

Yt = Xt B1 + Et 1
= Xt + Vt

11.4.

151

IV ESTIMATION

so we have that

and that the

Vt





Vt = 1 Et N 0, 1 1 , t

are timewise independent (note that this wouldn't be the ase if the

Et

were

auto orrelated).

11.4 IV estimation
The IV estimator may appear a bit unusual at rst, but it will grow on you over time.
The simultaneous equations model is

Y = XB + E
Considering the rst equation (this is without loss of generality, sin e we an always reorder
the equations) we an partition the

matrix as

Y =
y

y Y1 Y2

is the rst olumn

Y1

are the other endogenous variables that enter the rst equation

Y2

are endogs that are ex luded from this equation

Similarly, partition

as

X=
X1

are the in luded exogs, and

X2

X1 X2

are the ex luded exogs.

Finally, partition the error matrix as

E=
Assume that

E12

has ones on the main diagonal. These are normalization restri tions that

simply s ale the remaining oe ients on ea h equation, and whi h s ale the varian es of the
error terms.
Given this s aling and our partitioning, the oe ient matri es an be written as

12

= 1 22
0
32
"
#
1 B12
B =
0 B22

152

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

With this, the rst equation an be written as

y = Y1 1 + X1 1 +
= Z +
The problem, as we've seen is that

is orrelated with

sin e

Y1

is formed of endogs.

Now, let's onsider the general problem of a linear regression model with orrelation between regressors and the error term:

y = X +
iid(0, In 2 )

E(X ) 6= 0.

The present ase of a stru tural equation from a system of equations ts into this notation,
but so do other problems, su h as measurement error or lagged dependent variables with
auto orrelated errors.
with

Consider some matrix

whi h is formed of variables un orrelated

This matrix denes a proje tion matrix

PW = W (W W )1 W
so that anything that is proje ted onto the spa e spanned by
by the denition of

W.

will be un orrelated with

Transforming the model with this proje tion matrix we get

PW y = PW X + PW
or

y = X +
Now we have that

and

are un orrelated, sin e this is simply

E(X ) = E(X PW
PW )

= E(X PW )

and

PW X = W (W W )1 W X
is the tted value from a regression of

W,

This implies that applying OLS to the model

so it must be un orrelated with

on

W.

This is a linear ombination of the olumns of

y = X +
will lead to a onsistent estimator, given a few more assumptions.

This is the

generalized

11.4.

153

IV ESTIMATION

instrumental variables estimator. W

is known as the matrix of instruments. The estimator is

IV = (X PW X)1 X PW y
from whi h we obtain

IV

= (X PW X)1 X PW (X + )
= + (X PW X)1 X PW

so

IV = (X PW X)1 X PW

X W (W W )1 W X

=
Now we an introdu e fa tors of

IV =

X W
n

to get

W W 1
n

!
n

Assuming that ea h of the terms with a

W W
n

QW W ,

X W
n

QXW ,

W p
n

W X
n

1

X W (W W )1 W

!1 

X W
n



W W
n

1 

W
n

in the denominator satises a LLN, so that

a nite pd matrix

a nite matrix with rank

(= ols(X) )

then the plim of the rhs is zero. This last term has plim 0 sin e we assume that

and

un orrelated, e.g.,

E(Wt t ) = 0,
Given these assumtions the IV estimator is onsistent

p
IV .
Furthermore, s aling by



n IV =

n,

we have

X W
n



W W
n

1 

W X
n

!1 

X W
n



W W
n

Assuming that the far right term saties a CLT, so that

d
W

n

then we get

N (0, QW W 2 )



d

1 2
n IV N 0, (QXW Q1
W W QXW )

1 

are

154

CHAPTER 11.

The estimators for

QXW

and

QW W

EXOGENEITY AND SIMULTANEITY

are the obvious ones. An estimator for

is


 

2 = 1 y X
IV .
IV
y

d
IV
n

This estimator is onsistent following the proof of onsisten y of the OLS estimator of

2 ,

when the lassi al assumptions hold.


The formula used to estimate the varian e of

V (IV ) =

The IV estimator is

X W

IV

W W

is

1

W X

1 d
2
IV

1. Consistent
2. Asymptoti ally normally distributed
3. Biased in general, sin e even though

1 and
zero, sin e (X PW X)

E(X PW ) = 0, E(X PW X)1 X PW

An important point is that the asymptoti distribution of


and these depend upon the hoi e of

the estimator.

W.

W1 .

W2

IV

depends upon

QXW

and

QW W ,

The hoi e of instruments inuen es the e ien y of

When we have two sets of instruments,


estimator using

may not be

X PW are not independent.

W1

and

W2

su h that

W1 W2 ,

then the IV

is at least as e iently asymptoti ally as the estimator that used

More instruments leads to more asymptoti ally e ient estimation, in general.

There are spe ial ases where there is no gain (simultaneous equations is an example of

The penalty for indis riminant use of instruments is that the small sample bias of the

this, as we'll see).

IV estimator rises as the number of instruments in reases. The reason for this is that

PW X

be omes loser and loser to

itself as the number of instruments in reases.

IV estimation an learly be used in the ase of simultaneous equations. The only issue
is whi h instruments to use.

11.5 Identi ation by ex lusion restri tions


The identi ation problem in simultaneous equations is in fa t of the same nature as the
identi ation problem in any estimation setting: does the limiting obje tive fun tion have the
proper urvature so that there is a unique global minimum or maximum at the true parameter
value? In the ontext of IV estimation, this is the ase if the limiting ovarian e of the IV
estimator is positive denite and

plim n1 W = 0.

This matrix is

1 2
V (IV ) = (QXW Q1
W W QXW )

11.5.

155

IDENTIFICATION BY EXCLUSION RESTRICTIONS

The ne essary and su ient ondition for identi ation is simply that this matrix be

For this matrix to be positive denite, we need that the onditions noted above hold:

These identi ation onditions are not that intuitive nor is it very obvious how to he k

positive denite, and that the instruments be (asymptoti ally) un orrelated with

QW W

must be positive denite and

QXW

must be of full rank (

).

them.

11.5.1 Ne essary onditions


If we use IV estimation for a single equation of the system, the equation an be written as

y = Z +
where

Z=

Notation:

Let

Let

K = cols(X1 )

Let

Y1 X1

be the total numer of weakly exogenous variables.

be the number of in luded exogs, and let

number of ex luded exogs (in this equation).

G = cols(Y1 ) + 1

K = K K

be the total number of in luded endogs, and let

the number of ex luded endogs.

be the

G = G G

be

Using this notation, onsider the sele tion of instruments.

Now the

It turns out that

X1

are weakly exogenous and an serve as their own instruments.

exhausts the set of possible instruments, in that if the variables in

don't lead to an identied model then no other instruments will identify the model

either. Assuming this is true (we'll prove it in a moment), then a ne essary ondition
for identi ation is that
must be used twi e, so

cols(X2 ) cols(Y1 )

sin e if not then at least one instrument

will not have full olumn rank:

(W ) < K + G 1 (QZW ) < K + G 1


This is the

order ondition

for identi ation in a set of simultaneous equations. When

the only identifying information is ex lusion restri tions on the variables that enter an
equation, then the number of ex luded exogs must be greater than or equal to the number
of in luded endogs, minus 1 (the normalized lhs endog), e.g.,

K G 1

156

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

To show that this is in fa t a ne essary ondition onsider some arbitrary set of instruments

W.

A ne essary ondition for identi ation is that

1
plim W Z
n
where

Z=
Re all that we've partitioned the model

= K + G 1

Y1 X1

Y = XB + E
as

Y =

X=

y Y1 Y2

X1 X2

Given the redu ed form

Y = X + V
we an write the redu ed form using the same partition

y Y1 Y2

X1 X2

"

11 12 13
21 22 23

v V1 V2

so we have

Y1 = X1 12 + X2 22 + V1
so

Be ause the

V1

i
h
1
1
W Z = W X1 12 + X2 22 + V1 X1
n
n

's are un orrelated with the

V1

's, by assumption, the ross between

and

onverges in probability to zero, so

i
h
1
1
plim W Z = plim W X1 12 + X2 22 X1
n
n
Sin e the far rhs term is formed only of linear ombinations of olumns of
matrix an never be greater than
than

K,

X,

the rank of this

regardless of the hoi e of instruments. If

olumns, then it is not of full olumn rank. When

has more than

has more

olumns we

have

G 1 + K > K
or noting that

K = K K ,

G 1 > K

In this ase, the limiting matrix is not of full olumn rank, and the identi ation ondition

11.5.

IDENTIFICATION BY EXCLUSION RESTRICTIONS

157

fails.

11.5.2 Su ient onditions


Identi ation essentially requires that the stru tural parameters be re overable from the data.
This won't be the ase, in general, unless the stru tural model is subje t to some restri tions.
We've already identied ne essary onditions. Turning to su ient onditions (again, we're
only onsidering identi ation through zero restri itions on the parameters, for the moment).
The model is

Yt = Xt B + Et
V (Et ) =
This leads to the redu ed form

Yt = Xt B1 + Et 1
= Xt + Vt

V (Vt ) = 1 1
=

The redu ed form parameters are onsistently estimable, but none of them are known

a priori,

and there are no restri tions on their values. The problem is that more than one stru tural
form has the same redu ed form, so knowledge of the redu ed form parameters alone isn't
enough to determine the stru tural parameters. To see this, onsider the model

Yt F

= Xt BF + Et F

V (Et F ) = F F
where

is some arbirary nonsingular

GG

matrix. The rf of this new model is

Yt = Xt BF (F )1 + Et F (F )1
= Xt BF F 1 1 + Et F F 1 1
= Xt B1 + Et 1
= Xt + Vt
Likewise, the ovarian e of the rf of the transformed model is

V (Et F (F )1 ) = V (Et 1 )
=
Sin e the two stru tural forms lead to the same rf, and the rf is all that is dire tly estimable,
the models are said to be

observationally equivalent.

What we need for identi ation are

158

CHAPTER 11.

restri tions on

and

EXOGENEITY AND SIMULTANEITY

su h that the only admissible

is an identity matrix (if all of the

equations are to be identied). Take the oe ient matri es as partitioned before:

"

12

=
0

1
0

22

32

B12
B22

The oe ients of the rst equation of the transformed model are simply these oe ients
multiplied by the rst olumn of

"

F.

#"

This gives

f11
F2

12

=
0

1
0

#
22 "
f11
32
F

2
B12
B22

For identi ation of the rst equation we need that there be enough restri tions so that the
only admissible

"

f11
F2

be the leading olumn of an identity matrix, so that

12

1
0

#
22 "
1
f11

=
32
F
0

B12
1
0
B22

Note that the third and fth rows are

"

32
B22

F2 =

"

0
0

Supposing that the leading matrix is of full olumn rank, e.g.,

"

32
B22

#!

= cols

"

32
B22

#!

=G1

then the only way this an hold, without additional restri tions on the model's parameters, is
if

F2

is a ve tor of zeros. Given that

F2

1 12

is a ve tor of zeros, then the rst equation

"

f11
F2

= 1 f11 = 1

11.5.

159

IDENTIFICATION BY EXCLUSION RESTRICTIONS

Therefore, as long as

"

then

"

B22

#!

"

32

f11
F2

=G1
1
0G1

The rst equation is identied in this ase, so the ondition is su ient for identi ation. It is
also ne essary, sin e the ondition implies that this submatrix must have at least
Sin e this matrix has

G1

rows.

G + K = G G + K
rows, we obtain

G G + K G 1
or

K G 1
whi h is the previously derived ne essary ondition.
The above result is fairly intuitive (draw pi ture here). The ne essary ondition ensures
that there are enough variables not in the equation of interest to potentially move the other
equations, so as to tra e out the equation of interest. The su ient ondition ensures that
those other equations in fa t do move around as the variables hange their values.

Some

points:

When an equation has

K = G 1,

is is

exa tly identied,

in that omission of an

identiying restri tion is not possible without loosing onsisten y.


When

K > G 1,

the equation is

and still retain onsisten y.

overidentied,

sin e one ould drop a restri tion

Overidentifying restri tions are therefore testable.

When

an equation is overidentied we have more instruments than are stri tly ne essary for
onsistent estimation. Sin e estimation by IV with more instruments is more e ient
asymptoti ally, one should employ overidentifying restri tions if one is ondent that
they're true.

We an repeat this partition for ea h equation in the system, to see whi h equations are

These results are valid assuming that the only identifying information omes from know-

identied and whi h aren't.

ing whi h variables appear in whi h equations, e.g., by ex lusion restri tions, and through
the use of a normalization. There are other sorts of identifying information that an be
used. These in lude

1. Cross equation restri tions


2. Additional restri tions on parameters within equations (as in the Klein model dis ussed below)

160

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

3. Restri tions on the ovarian e matrix of the errors

4. Nonlinearities in variables

When these sorts of information are available, the above onditions aren't ne essary for
identi ation, though they are of ourse still su ient.

To give an example of how other information an be used, onsider the model

Y = XB + E
where

system

is an upper triangular matrix with 1's on the main diagonal.

This is a

triangular

of equations. In this ase, the rst equation is

y1 = XB1 + E1
Sin e only exogs appear on the rhs, this equation is identied.

The se ond equation is

y2 = 21 y1 + XB2 + E2
This equation has

K = 0

ex luded exogs, and

G = 2

in luded endogs, so it fails the order

(ne essary) ondition for identi ation.

However, suppose that we have the restri tion

21 = 0,

so that the rst and se ond

stru tural errors are un orrelated. In this ase



E(y1t 2t ) = E (Xt B1 + 1t )2t = 0

so there's no problem of simultaneity. If the entire

the same logi , all of the equations are identied.


model.

matrix is diagonal, then following

This is known as a

fully re ursive

11.5.

161

IDENTIFICATION BY EXCLUSION RESTRICTIONS

11.5.3 Example: Klein's Model 1


To give an example of determining identi ation status, onsider the following ma ro model
(this is the widely known Klein's Model 1)

Ct = 0 + 1 Pt + 2 Pt1 + 3 (Wtp + Wtg ) + 1t

Consumption:
Investment:

Wtp = 0 + 1 Xt + 2 Xt1 + 3 At + 3t

Private Wages:

Xt = Ct + It + Gt

Output:
Prots:
Capital Sto k:

It = 0 + 1 Pt + 2 Pt1 + 3 Kt1 + 2t

1t

Pt = Xt Tt Wtp
Kt = Kt1 + It

11 12 13


2t IID 0 ,
0
3t
The other variables are the government wage bill,
ing,

Gt ,and

a time trend,

At .

Wtg ,

22 23
33

taxes,

Tt ,

government nonwage spend-

The endogenous variables are the lhs variables,

Yt =

Ct It Wtp Xt Pt Kt

and the predetermined variables are all others:

Xt

Wtg

Gt Tt At Pt1 Kt1 Xt1

The model assumes that the errors of the equations are ontemporaneously orrelated, by
nonauto orrelated. The model written as

Y = XB + E

3 0
0
0
0

B=

1 0

0 0 0 0 0
3 0

2 2 0
0

3 0

1 0
0

0 1 0

0 0
0

0 0
0

0 0
1

0 0
0
0 0

1
0

1 0

1
0

0
1

1 0

1
0
1 1

1 1 0

gives

162

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

To he k this identi ation of the onsumption equation, we need to extra t


submatri es of oe ients of endogs and exogs that

don't

32

and

B22 ,

the

appear in this equation. These are

the rows that have zeros in the rst olumn, and we need to drop the rst olumn. We get

"

32
B22

1 0

1 1

1 0

1 0

3 0

We need to nd a set of 5 rows of this matrix gives a full-rank 55 matrix.

For example,

sele ting rows 3,4,5,6, and 7 we obtain the matrix

A=
0

0
3

0
0
3
0

0 0

0 1 0

0 0
0
0 0
1
1 0

This matrix is of full rank, so the su ient ondition for identi ation is met.

in luded endogs, G

= 3,

and ounting ex luded exogs, K

= 5,

Counting

so

K L = G 1
5L
L

=31
=3

The equation is over-identied by three restri tions, a ording to the ounting rules,
whi h are orre t when the only identifying information are the ex lusion restri tions.
However, there is additional information in this ase.

Both

Wtp

and

Wtg

onsumption equation, and their oe ients are restri ted to be the same.

enter the
For this

reason the onsumption equation is in fa t overidentied by four restri tions.

11.6 2SLS
When we have no information regarding ross-equation restri tions or the stru ture of the
error ovarian e matrix, one an estimate the parameters of a single equation of the system
without regard to the other equations.

This isn't always e ient, as we'll see, but it has the advantage that misspe i ations
in other equations will not ae t the onsisten y of the estimator of the parameters of

11.6.

163

2SLS

the equation of interest.

Also, estimation of the equation won't be ae ted by identi ation problems in other
equations.

The 2SLS estimator is very simple: in the rst stage, ea h olumn of


weakly exogenous variables in the system, e.g., the entire

Y1

is regressed on

all

the

matrix. The tted values are

Y1 = X(X X)1 X Y1
= PX Y1
1
= X
Sin e these tted values are the proje tion of

Y1

on the spa e spanned by

1
ve tor in this spa e is un orrelated with by assumption, Y
simply the redu ed-form predi tion, it is orrelated with

X,

and sin e any

is un orrelated with

. Sin e Y1

is

Y1 , The only other requirement is that

the instruments be linearly independent. This should be the ase when the order ondition is
satised, sin e there are more olumns in
The se ond stage substitutes

Y1

X2

than in

in pla e of

Y1 ,

Y1

in this ase.

and estimates by OLS. This original model

is

y = Y1 1 + X1 1 +
= Z +
and the se ond stage model is

y = Y1 1 + X1 1 + .
Sin e

X1

is in the spa e spanned by

X, PX X1 = X1 ,

so we an write the se ond stage model

as

y = PX Y1 1 + PX X1 1 +
PX Z +
The OLS estimator applied to this model is

= (Z PX Z)1 Z PX y
whi h is exa tly what we get if we estimate using IV, with the redu ed form predi tions of the
endogs used as instruments. Note that if we dene

Z = PX Z
h
i
=
Y1 X1

164

CHAPTER 11.

so that

are the instruments for

Z,

EXOGENEITY AND SIMULTANEITY

then we an write

= (Z Z)1 Z y

Important note:
estimate of
However

OLS on the transformed model an be used to al ulate the 2SLS

, sin e we see that it's equivalent to IV using a parti ular set of instruments.

the OLS ovarian e formula is not valid.

We need to apply the IV ovarian e

formula already seen above.

A tually, there is also a simpli ation of the general IV varian e formula. Dene

Z = PX Z
h
i
=
Y X
The IV ovarian e estimator would ordinarily be


1 

1
2
= Z Z
V ()
Z Z Z Z

IV
However, looking at the last term in bra kets

Z Z =
but sin e

PX

Y1 X1

i h

is idempotent and sin e

Y1 X1

i h

Y1 X1

PX X = X,

Y1 X1

=
=

"
h

"

Y1 (PX )Y1 Y1 (PX )X1


X1 Y1

X1 X1

we an write

Y1 PX PX Y1 Y1 PX X1
X1 X1
X1 PX Y1
i h
i
Y1 X1
Y1 X1

= Z Z

Therefore, the se ond and last term in the varian e formula an el, so the 2SLS var ov estimator simplies to


1
2
= Z Z
V ()

IV

whi h, following some algebra similar to the above, an also be written as


1
2
= Z Z
V ()

IV
Finally, re all that though this is presented in terms of the rst equation, it is general sin e
any equation an be pla ed rst.

Properties of 2SLS:
1. Consistent
2. Asymptoti ally normal

11.7.

165

TESTING THE OVERIDENTIFYING RESTRICTIONS

3. Biased when the mean esists (the existen e of moments is a te hni al issue we won't go
into here).

4. Asymptoti ally ine ient, ex ept in spe ial ir umstan es (more on this later).

11.7 Testing the overidentifying restri tions


The sele tion of whi h variables are endogs and whi h are exogs

the model.

is part of the spe i ation of

As su h, there is room for error here: one might erroneously lassify a variable as

exog when it is in fa t orrelated with the error term. A general test for the spe i ation on
the model an be formulated as follows:
The IV estimator an be al ulated by applying OLS to the transformed model, so the IV
obje tive fun tion at the minimized value is





s(IV ) = y X IV PW y X IV ,
but

IV

= y X IV

= y X(X PW X)1 X PW y

= I X(X PW X)1 X PW y

= I X(X PW X)1 X PW (X + )
= A (X + )

where

A I X(X PW X)1 X PW
so

Moreover,

A PW A

A PW A =
=
=
Furthermore,


s(IV ) = + X A PW A (X + )

is idempotent, as an be veried by multipli ation:



I PW X(X PW X)1 X PW I X(X PW X)1 X PW


PW PW X(X PW X)1 X PW PW PW X(X PW X)1 X PW

I PW X(X PW X)1 X PW .

is orthogonal to

AX =


I X(X PW X)1 X PW X

= X X
= 0

166

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

so

s(IV ) = A PW A
Supposing the

are normally distributed, with varian e

2 ,

then the random variable

A PW A
s(IV )
=
2
2
is a quadrati form of a

N (0, 1)

random variable with an idempotent matrix in the middle, so

s(IV )
2 ((A PW A))
2
This isn't available, sin e we need to estimate

Even if the

2 .

Substituting a onsistent estimator,

s(IV ) a 2
((A PW A))
c
2

aren't normally distributed, the asymptoti result still holds.

The last

thing we need to determine is the rank of the idempotent matrix. We have

A PW A = PW PW X(X PW X)1 X PW
so

(A PW A) = T r PW PW X(X PW X)1 X PW

= T rPW T rX PW PW X(X PW X)1

= T rW (W W )1 W KX

= T rW W (W W )1 KX
= KW KX

where

KW

is the number of olumns of

and

KX

is the number of olumns of

X.

The degrees of freedom of the test is simply the number of overidentifying restri tions:
the number of instruments we have beyond the number that is stri tly ne essary for
onsistent estimation.

This test is an overall spe i ation test:


is orre tly spe ied

and

that the

the joint null hypothesis is that the model

form valid instruments (e.g., that the variables

lassied as exogs really are un orrelated with


model

y = Z +

Reje tion an mean that either the

is misspe ied, or that there is orrelation between

and

This is a parti ular ase of the GMM riterion test, whi h is overed in the se ond half
of the ourse. See Se tion 15.8.

Note that sin e

IV = A

11.7.

167

TESTING THE OVERIDENTIFYING RESTRICTIONS

and

s(IV ) = A PW A
we an write

s(IV )
c2

where

Ru2



W (W W )1 W W (W W )1 W
=
/n
= n(RSSIV |W /T SSIV )
= nRu2

is the un entered

instruments

W.

R2

from a regression of the

IV

residuals on all of the

This is a onvenient way to al ulate the test statisti .

On an aside, onsider IV estimation of a just-identied model, using the standard notation

y = X +
and
so

is the matrix of instruments. If we have exa t identi ation then

W X

cols(W ) = cols(X),

is a square matrix. The transformed model is

PW y = PW X + PW
and the fon are

X PW (y X IV ) = 0
The IV estimator is

IV = X PW X
Considering the inverse here

X PW X

Now multiplying this by

IV

1

X PW y,

1

X PW y

X W (W W )1 W X

1

1
= (W X)1 X W (W W )1
1
= (W X)1 (W W ) X W

we obtain

= (W X)1 (W W ) X W
= (W X)1 (W W ) X W
= (W X)1 W y

1

1

X PW y
X W (W W )1 W y

168

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

The obje tive fun tion for the generalized IV estimator is

s(IV ) =
=
=
=
=




y X IV PW y X IV





X PW y X IV
y PW y X IV IV



X PW y + IV
X PW X IV
y PW y X IV IV





X PW y + X PW X IV
y PW y X IV IV


y PW y X IV


by the fon for generalized IV. However, when we're in the just indentied ase, this is


s(IV ) = y PW y X(W X)1 W y

= y PW I X(W X)1 W y


= y W (W W )1 W W (W W )1 W X(W X)1 W y
= 0

The value of the obje tive fun tion of the IV estimator is zero in the just identied ase.
makes sense, sin e we've already shown that the obje tive fun tion after dividing by
asymptoti ally

This

is

with degrees of freedom equal to the number of overidentifying restri tions.

In the present ase, there are no overidentifying restri tions, so we have a

2 (0)

rv, whi h has

mean 0 and varian e 0, e.g., it's simply 0. This means we're not able to test the identifying
restri tions in the ase of exa t identi ation.

11.8 System methods of estimation


2SLS is a single equation method of estimation, as noted above. The advantage of a single
equation method is that it's unae ted by the other equations of the system, so they don't
need to be spe ied (ex ept for dening what are the exogs, so 2SLS an use the omplete set
of instruments). The disadvantage of 2SLS is that it's ine ient, in general.

Re all that overidenti ation improves e ien y of estimation, sin e an overidentied


equation an use more instruments than are ne essary for onsistent estimation.

Se ondly, the assumption is that

Y = XB + E
E(X E) = 0(KG)
vec(E) N (0, )

Sin e there is no auto orrelation of the

Et

's, and sin e the olumns of

are individually

11.8.

169

SYSTEM METHODS OF ESTIMATION

homos edasti , then

11 In 12 In 1G In
.
.
.

22 In
..

.
.
.

GG In

= In

This means that the stru tural equations are heteros edasti and orrelated with one
another

In general, ignoring this will lead to ine ient estimation, following the se tion on GLS.
When equations are orrelated with one another estimation should a ount for the orrelation in order to obtain e ien y.

Also, sin e the equations are orrelated, information about one equation is impli itly information about all equations. Therefore, overidenti ation restri tions in any equation
improve e ien y for

all

equations, even the just identied equations.

Single equation methods an't use these types of information, and are therefore ine ient
(in general).

11.8.1 3SLS
Note: It is easier and more pra ti al to treat the 3SLS estimator as a generalized method of
moments estimator (see Chapter 15). I no longer tea h the following se tion, but it is retained
Another alternative is to use FIML (Subse tion 11.8.2),

for its possible histori al interest.

if you are willing to make distributional assumptions on the errors. This is omputationally
feasible with modern omputers.
Following our above notation, ea h stru tural equation an be written as

yi = Yi 1 + Xi 1 + i
= Zi i + i
Grouping the

equations together we get

y1

y2
.
.
.
yG
or

Z1 0



0
=

.
.

.

0
.
.
.

Z2
..

0
ZG

y = Z +

.
..


2
+ .
.
.
G

170

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

where we already have that

E( ) =
= In
The 3SLS estimator is just 2SLS ombined with a GLS orre tion that takes advantage of the
stru ture of

Dene

as

X(X X)1 X Z1 0

Z = .
..

Y1 X1

= .
.
.
0

X(X X)1 X Z2

.
.
.

..

0
Y2 X2

0
0
.
.
.

..

These instruments are simply the

unrestri ted

0
YG XG

X(X X)1 X ZG

rf predi itions of the endogs, ombined with

the exogs. The distin tion is that if the model is overidentied, then

= B1
may be subje t to some zero restri tions, depending on the restri tions on
does not impose these restri tions.

Also, note that

and

B,

and

is al ulated using OLS equation by

equation. More on this later.


The 2SLS estimator would be

= (Z Z)1 Z y
as an be veried by simple multipli ation, and noting that the inverse of a blo k-diagonal
matrix is just the matrix with the inverses of the blo ks on the main diagonal.

This IV

estimator still ignores the ovarian e information. The natural extension is to add the GLS
transformation, putting the inverse of the error ovarian e into the formula, whi h gives the
3SLS estimator

3SLS


1
Z ( In )1 Z
Z ( In )1 y


 1 1
=
Z In y
Z 1 In Z
=

This estimator requires knowledge of


a onsistent estimator of

The solution is to dene a feasible estimator using

The obvious solution is to use an estimator based on the 2SLS

residuals:

i = yi Zi i,2SLS

11.8.

171

SYSTEM METHODS OF ESTIMATION

(IMPORTANT NOTE:

this is al ulated using

estimated by

ij =
Substitute

Zi ,

not

Zi ).

Then the element

i, j

of

is

i j
n

into the formula above to get the feasible 3SLS estimator.

Analogously to what we did in the ase of 2SLS, the asymptoti distribution of the 3SLS
estimator an be shown to be

!
Z ( I )1 Z 1


a
n

n 3SLS N 0, lim E
n

A formula for estimating the varian e of the 3SLS estimator in nite samples ( an elling out
the powers of

n)

is


  
 1
1 In Z
V 3SLS = Z

??), ombined with the GLS orre -

This is analogous to the 2SLS formula in equation (

In the ase that all equations are just identied, 3SLS is numeri ally equivalent to 2SLS.

tion.

Proving this is easiest if we use a GMM interpretation of 2SLS and 3SLS. GMM is
presented in the next e onometri s ourse. For now, take it on faith.

The 3SLS estimator is based upon the rf parameter estimator

al ulated equation by

equation using OLS:

= (X X)1 X Y

whi h is simply

= (X X)1 X

that is, OLS equation by equation using

all

y1 y2 yG

the exogs in the estimation of ea h olumn of

It may seem odd that we use OLS on the redu ed form, sin e the rf equations are orrelated:

Yt = Xt B1 + Et 1
= Xt + Vt
and





Vt = 1 Et N 0, 1 1 , t

Let this var- ov matrix be indi ated by


= 1 1

172

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

OLS equation by equation to get the rf is equivalent to

y1

y2
.
.
.
yG
where

yi

is the

n1

X 0



0
=

.
.

.

.
.
0
.
G
X
.
.
.

X
..

ve tor of observations of the

th olumn of
exogs, i is the i

ith

v1


v2
+ .
.
.
vG

endog,

th olumn of
and vi is the i

V.

is the entire

nK

matrix of

Use the notation

y = X + v
to indi ate the pooled model. Following this notation, the error ovarian e matrix is

V (v) = In

This is a spe ial ase of a type of model known as a set of

(SUR)

seemingly unrelated equations

sin e the parameter ve tor of ea h equation is dierent. The equations are on-

temporanously orrelated, however. The general ase would have a dierent

Xi

for ea h

equation.

Note that ea h equation of the system individually satises the lassi al assumptions.

However, pooled estimation using the GLS orre tion is more e ient, sin e equationby-equation estimation is equivalent to pooled estimation, sin e

X is blo k diagonal,

but

ignoring the ovarian e information.

The model is estimated by GLS, where

In the spe ial ase that all the

is estimated using the OLS residuals from

equation-by-equation estimation, whi h are onsistent.

Xi

are the same, whi h is true in the present ase

of estimation of the rf parameters, SUR

X = In X.

Using the rules

1.

(A B)1 = (A1 B 1 )

2.

(A B) = (A B )

and

OLS.

To show this note that in this ase

11.8.

3.

(A B)(C D) = (AC BD),

SU R =
=
=
=

173

SYSTEM METHODS OF ESTIMATION

we get

1
(In X) ( In )1 (In X)
(In X) ( In )1 y

1 1

1 X (In X)
X y


(X X)1 1 X y


IG (X X)1 X y



.2
.
.

So the unrestri ted rf oe ients an be estimated e iently (assuming normality) by


OLS, even if the equations are orrelated.

We have ignored any potential zeros in the matrix

, whi h if they exist ould potentially

in rease the e ien y of estimation of the rf.

Another example where SUROLS is in estimation of ve tor autoregressions. See two


se tions ahead.

11.8.2 FIML
Full information maximum likelihood is an alternative estimation method.

FIML will be

asymptoti ally e ient, sin e ML estimators based on a given information set are asymptoti ally e ient w.r.t. all other estimators that use the same information set, and in the ase
of the full-information ML estimator we use the entire information set. The 2SLS and 3SLS
estimators don't require distributional assumptions, while FIML of ourse does. Our model
is, re all

Yt = Xt B + Et
Et N (0, ), t

E(Et Es ) = 0, t 6= s
The joint normality of

Et

means that the density for

g/2

(2)
The transformation from

Et

to

Yt

det


1 1/2

Et

1
exp Et 1 Et
2

requires the Ja obian

| det

is the multivariate normal, whi h is

dEt
| = | det |
dYt

174

CHAPTER 11.

so the density for

Yt

EXOGENEITY AND SIMULTANEITY

is

(2)G/2 | det | det 1

1/2





1
exp Yt Xt B 1 Yt Xt B
2

Given the assumption of independen e over time, the joint log-likelihood fun tion is

ln L(B, , ) =



nG
n
1X
ln(2)+n ln(| det |) ln det 1
Yt Xt B 1 Yt Xt B
2
2
2 t=1

This is a nonlinear in the parameters obje tive fun tion. Maximixation of this an be

It turns out that the asymptoti distribution of 3SLS and FIML are the same,

One an al ulate the FIML estimator by iterating the 3SLS estimator, thus avoiding

done using iterative numeri methods. We'll see how to do this in the next se tion.

normality of the errors.

assuming

the use of a nonlinear optimizer. The steps are

1. Cal ulate

3SLS

2. Cal ulate

=B
3SLS
1 .

3SLS

and

3SLS
B

as normal.
This is new, we didn't estimate

in this way before.

This estimator may have some zeros in it. When Greene says iterated 3SLS doesn't
lead to FIML, he means this for a pro edure that doesn't update
updates

and

and

If you update

3. Cal ulate the instruments

Y = X

you

do

but only

onverge to FIML.

and al ulate

using

and

to get the

estimated errors, applying the usual estimator.


4. Apply 3SLS using these new instruments and the estimate of

5. Repeat steps 2-4 until there is no hange in the parameters.

FIML is fully e ient, sin e it's an ML estimator that uses all information. This implies
that 3SLS is fully e ient

when the errors are normally distributed.

Also, if ea h equation

is just identied and the errors are normal, then 2SLS will be fully e ient, sin e in this
ase 2SLS3SLS.

When the errors aren't normally distributed, the likelihood fun tion is of ourse dierent
than what's written above.

11.9 Example: 2SLS and Klein's Model 1


The O tave program Simeq/Klein.m performs 2SLS estimation for the 3 equations of Klein's
model 1, assuming nonauto orrelated errors, so that lagged endogenous variables an be used
as instruments. The results are:

CONSUMPTION EQUATION

11.9.

175

EXAMPLE: 2SLS AND KLEIN'S MODEL 1

*******************************************************
2SLS estimation results
Observations 21
R-squared 0.976711
Sigma-squared 1.044059

Constant
Profits
Lagged Profits
Wages

estimate
16.555
0.017
0.216
0.810

st.err.
1.321
0.118
0.107
0.040

t-stat.
12.534
0.147
2.016
20.129

p-value
0.000
0.885
0.060
0.000

*******************************************************
INVESTMENT EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.884884
Sigma-squared 1.383184

Constant
Profits
Lagged Profits
Lagged Capital

estimate
20.278
0.150
0.616
-0.158

st.err.
7.543
0.173
0.163
0.036

t-stat.
2.688
0.867
3.784
-4.368

p-value
0.016
0.398
0.001
0.000

*******************************************************
WAGES EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.987414
Sigma-squared 0.476427

Constant
Output
Lagged Output
Trend

estimate
1.500
0.439
0.147
0.130

st.err.
1.148
0.036
0.039
0.029

t-stat.
1.307
12.316
3.777
4.475

p-value
0.209
0.000
0.002
0.000

176

CHAPTER 11.

EXOGENEITY AND SIMULTANEITY

*******************************************************

The above results are not valid (spe i ally, they are in onsistent) if the errors are auto orrelated, sin e lagged endogenous variables will not be valid instruments in that ase. You
might onsider eliminating the lagged endogenous variables as instruments, and re-estimating
by 2SLS, to obtain onsistent parameter estimates in this more omplex ase. Standard errors
will still be estimated in onsistently, unless use a Newey-West type ovarian e estimator. Food
for thought...

Chapter 12
Introdu tion to the se ond half
We'll begin with study of
on a sample of size

extremum estimators in general.

sn (Zn , )

Zn

be the available data, based

n.

[Extremum estimator An extremum estimator


fun tion

Let

over a set

is

the optimizing element of an obje tive

We'll usually write the obje tive fun tion suppressing the dependen e on

Zn .

Example: Least squares, linear model


yt = xt 0 + t , t = 1, 2, ..., 
n, 0 . Sta king observations verti ally,

. The least squares estimator is dened as


yn = Xn 0 + n , where Xn = x1 x2 xn
Let the d.g.p. be

arg min sn () = (1/n) [yn Xn ] [yn Xn ]

We readily nd that

= (X X)1 X y.

Example: Maximum likelihood


Suppose that the ontinuous random variable
estimator is dened as

arg max Ln () =

n
Y

(2)

yt IIN ( 0 , 1).

1/2

t=1

The maximum likelihood

(yt )2
exp
2
(0, ),
as
same

Be ause the logarithmi fun tion is stri tly in reasing on

maximization of the average

logarithm of the likelihood fun tion is a hieved at the

for the likelihood fun tion:

arg max sn () = (1/n) ln Ln () = 1/2 ln 2 (1/n)

Solution of the f.o. . leads to the familiar result that

n
X
(yt )2
t=1

.
= y

sup-

MLE estimators are asymptoti ally e ient (Cramr-Rao lower bound, Theorem3),

One an investigate the properties of an ML estimator supposing that the distributional

posing the strong distributional assumptions upon whi h they are based are true.
assumptions are in orre t. This gives a

quasi-ML estimator, whi h we'll study later.

177

178

CHAPTER 12.

INTRODUCTION TO THE SECOND HALF

The strong distributional assumptions of MLE may be questionable in many ases. It is


possible to estimate using weaker distributional assumptions based only on some of the
moments of a random variable(s).

Example: Method of moments


Suppose we draw a random sample of

yt

from the

1 = 1 ( 0 )

is a

distribution.

Here,

is the

1 , of a random variable will in general

parameter of interest. The rst moment (expe tation),


be a fun tion of the parameters of the distribution,

2 ( 0 )

i.e., 1 ( 0 ) .

moment-parameter equation.
1 ( 0 ) = 0 ,

In this example, the relationship is the identity fun tion

though in general

the relationship may be more ompli ated. The sample rst moment is

c1 =

n
X

yt /n.

t=1

Dene

The method of moments prin iple is to hoose the estimator of the parameter to set the

m1 () = 1 ()
c1

0.
, i.e., m1 ()

estimate of the population moment equal to the sample moment

Then

the moment-parameter equation is inverted to solve for the parameter estimate.

In this ase,

=
m1 ()
Sin e

Pn

t=1 yt /n

n
X

yt /n = 0.

t=1

by the LLN, the estimator is onsistent.

More on the method of moments


Continuing with the above example, the varian e of a

V (yt ) = E yt 0

Dene

m2 () = 2

2

2 ( 0 )

r.v. is

= 2 0 .

Pn

t=1 (yt

y)2

The MM estimator would set

= 2
m2 ()

Pn

t=1 (yt

y)2

0.

Again, by the LLN, the sample varian e is onsistent for the true varian e, that is,

Pn

t=1 (yt

y)2

2 0 .

179

So,

Pn

y)2

t=1 (yt

2n

whi h is obtained by inverting the moment-parameter equation, is onsistent.

Example: Generalized method of moments (GMM)


The previous two examples give two estimators of

whi h are both onsistent. With a

given sample, the estimators will be dierent in general.

With two moment-parameter equations and only one parameter, we have

overidenti a-

tion, whi h means that we have more information than is stri tly ne essary for onsistent
estimation of the parameter.

The GMM ombines information from the two moment-parameter equations to form a
new estimator whi h will be

From the rst example, dene


average of

m1t (),

i.e.,

more e ient, in general (proof of this below).

m1t () = yt .

m1 () = 1/n
=

We already have that

n
X

t=1
n
X

m1 ()

is the sample

m1t ()
yt /n.

t=1

Clearly, when evaluated at the true parameter value

0.





0 , both E m1t ( 0 ) = 0 and E m1 ( 0 ) =

From the se ond example we dene additional moment onditions

m2t () = 2 (yt y)2


and

m2 () = 2
Again, it is lear from the LLN that
either

=0
m1 ()

or

= 0.
m2 ()

Pn

t=1 (yt

a.s.

m2 ( 0 ) 0.

y)2

The MM estimator would hose

In general, no single value of

to set

will solve the two equations

simultaneously.

The GMM estimator is based on dening a measure of distan e

(m1 (), m2 ()) ,

d(m()),

where

m() =

and hoosing

= arg min sn () = d (m()) .

An example would be to hoose

d(m) = m Am,

where

is a positive denite matrix. While

it's lear that the MM gives onsistent estimates if there is a one-to-one relationship between

180

CHAPTER 12.

INTRODUCTION TO THE SECOND HALF

parameters and moments, it's not immediately obvious that the GMM estimator is onsistent.
(We'll see later that it is.)
These examples show that these widely used estimators may all be interpreted as the
solution of an optimization problem.
useful for its generality.

For this reason, the study of extremum estimators is

We will see that the general results extend smoothly to the more

spe ialized results available for spe i estimators.

After studying extremum estimators in

general, we will study the GMM estimator, then QML and NLS. The reason we study GMM
rst is that LS, IV, NLS, MLE, QML and other well-known parametri estimators may all be
interpreted as spe ial ases of the GMM estimator, so the general results on GMM an simplify
and unify the treatment of these other estimators. Nevertheless, there are some spe ial results
on QML and NLS, and both are important in empiri al resear h, whi h makes fo us on them
useful.

One of the fo al points of the ourse will be nonlinear models.

This is not to suggest that

linear models aren't useful. Linear models are more general than they might rst appear, sin e
one an employ nonlinear transformations of the variables:

0 (yt ) =
For example,

1 (xt ) 2 (xt ) p (xt )

0 + t

ln yt = + x1t + x21t + x1t x2t + t


ts this form.

The important point is that the model is

linear in the variables.

linear in the parameters

but not ne essarily

In spite of this generality, situations often arise whi h simply an not be onvin ingly represented by linear in the parameters models. Also, theory that applies to nonlinear models also
applies to linear models, so one may as well start o with the general ase.

Example: Expenditure shares


Roy's Identity states that the quantity demanded of the

xi =

ith

of

goods is

v(p, y)/pi
.
v(p, y)/y

An expenditure share is

so ne essarily

si [0, 1],

and

PG

i=1 si

si pi xi /y,
= 1.

No linear in the parameters model for

xi

or

si

with

a parameter spa e that is dened independent of the data an guarantee that either of these
onditions holds. These onstraints will often be violated by estimated linear models, whi h
alls into question their appropriateness in ases of this sort.

Example: Binary limited dependent variable


The referendum ontingent valuation (CV) method of infering the so ial value of a proje t
provides a simple example. This example is a spe ial ase of more general dis rete hoi e (or

181

binary response) models. Individuals are asked if they would pay an amount
a proje t. Indire t utility in the base ase (no proje t) is

v 0 (m, z) + 0 , where m is in ome and

is a ve tor of other variables su h as pri es, personal hara teristi s,

1
utility is v (m, z)

1 . The random terms

i , i
1

population. With this, an individual agrees

0
1}
| {z

= 0 1 ,

Dene

y =1

let

olle t

<

= 1, 2,

to pay

A for provision of

et .

After provision,

ree t variations of preferen es in the


if

v 1 (m A, z) v 0 (m, z)
{z
}
|
v(w, A)

z, and let v(w, A) = v 1 (m A, z) v 0 (m, z). Dene


A for the hange, y = 0 otherwise. The probability of

and

if the onsumer agrees to pay

agreement is

Pr(y = 1) = F [v(w, A)] .


To simplify notation, dene
that

p(w, A) F [v(w, A)] .

(12.1)

To make the example spe i , suppose

v 1 (m, z) = m

v 0 (m, z) = m
and

and

are i.i.d.

extreme value random variables.

That is, utility depends only on

in ome, preferen es in both states are homotheti , and a spe i distributional assumption is
made on the distribution of preferen es in the population. With these assumptions (the details
are unimportant here, see arti les by D. M Fadden if you're interested) it an be shown that

p(A, ) = ( + A) ,
where

(z) is

the logisti distribution fun tion

(z) = (1 + exp(z))1 .
This is the simple logit model:

the hoi e probability is the logit fun tion of a linear in

parameters fun tion.


Now,

is either

or 1, and the expe ted value of

is

( + A)

. Thus, we an write

y = ( + A) +
E() = 0.
One ould estimate this by (nonlinear) least squares


1


1X
(y ( + A))2
,
= arg min
n t

We assume here that responses are truthful, that is there is no strategi behavior and that individuals are

able to order their preferen es in this hypotheti al situation.

182

CHAPTER 12.

The main point is that it is impossible that

INTRODUCTION TO THE SECOND HALF

( + A)

parameters model, in the sense that, for arbitrary

A,

an be written as a linear in the

there are no

, (A)

su h that

( + A) = (A) , A
where

(A)

is a

be ause for any

1,

p-ve tor
,

valued fun tion of

we an always nd a

and

is a

dimensional parameter. This is

su h that (A) will be negative or greater than

whi h is illogi al, sin e it is the expe tation of a 0/1 binary random variable.

Sin e this

sort of problem o urs often in empiri al work, it is useful to study NLS and other nonlinear
models.
After dis ussing these estimation methods for parametri models we'll briey introdu e

nonparametri estimation methods.

These methods allow one, for example, to estimate

f (xt )

onsistently when we are not willing to assume that a model of the form

yt = f (xt ) + t
an be restri ted to a parametri form

yt = f (xt , ) + t
Pr(t < z) = F (z|, xt )
,
where

f ()

and perhaps

F (z|, xt )

are of known fun tional form.

This is important sin e

e onomi theory gives us general information about fun tions and the signs of their derivatives,
but not about their spe i form.
Then we'll look at simulation-based methods in e onometri s. These methods allow us to
substitute omputer power for mental power.

Sin e omputer power is be oming relatively

heap ompared to mental eort, any e onometri ian who lives by the prin iples of e onomi
theory should be interested in these te hniques.
Finally, we'll look at how e onometri omputations an be done in parallel on a luster of
omputers. This allows us to harness more omputational power to work with more omplex
models that an be dealt with using a desktop omputer.

Chapter 13
Numeri optimization methods
Readings:

Hamilton, h. 5, se tion 7 (pp. 133-139)

; Gourieroux and Monfort, Vol. 1, h.

13, pp. 443-60 ; Goe, et. al. (1994).


If we're going to be applying extremum estimators, we'll need to know how to nd an
extremum.

This se tion gives a very brief introdu tion to what is a large literature on nu-

meri optimization methods. We'll onsider a few well-known te hniques, and one fairly new
te hnique that may allow one to solve di ult problems.

The main obje tive is to be ome

familiar with the issues, and to learn how to use the BFGS algorithm at the pra ti al level.
The general problem we onsider is how to nd the maximizing element
of a fun tion

s().

(a K

-ve tor)

This fun tion may not be ontinuous, and it may not be dierentiable.

Even if it is twi e ontinuously dierentiable, it may not be globally on ave, so lo al maxima,


minima and saddlepoints may all exist. Supposing

s()

were a quadrati fun tion of

e.g.,

1
s() = a + b + C,
2
the rst order onditions would be linear:

D s() = b + C
so the maximizing (minimizing) element would be

= C 1 b.

This is the sort of problem we

have with linear models estimated by OLS. It's also the ase for feasible GLS, sin e onditional
on the estimate of the var ov matrix, we have a quadrati obje tive fun tion in the remaining
parameters.
More general problems will not have linear f.o. ., and we will not be able to solve for the
maximizer analyti ally. This is when we need a numeri optimization method.

13.1 Sear h
The idea is to reate a grid over the parameter spa e and evaluate the fun tion at ea h point
on the grid. Sele t the best point. Then rene the grid in the neighborhood of the best point,
and ontinue until the a ura y is good enough. See Figure

183

??.

One has to be areful that

184

CHAPTER 13.

NUMERIC OPTIMIZATION METHODS

the grid is ne enough in relationship to the irregularity of the fun tion to ensure that sharp
peaks are not missed entirely.
To he k

values in ea h dimension of a

q K points. For example, if

q = 100

and

dimensional parameter spa e, we need to he k

K = 10,

there would be

10010

points to he k. If 1000

9
points an be he ked in a se ond, it would take 3. 171 10 years to perform the al ulations,

whi h is approximately the age of the earth. The sear h method is a very reasonable hoi e if

is small, but it qui kly be omes infeasible if

is moderate or large.

13.2 Derivative-based methods


13.2.1 Introdu tion
Derivative-based methods are dened by
1. the method for hoosing the initial value,
2. the iteration method for hoosing

k+1

given

(based upon derivatives)

3. the stopping riterion.


The iteration method an be broken into two problems: hoosing the stepsize

k
and hoosing the dire tion of movement, d , whi h is of the same dimension of

ak
,

(a s alar)

so that

(k+1) = (k) + ak dk .

A lo ally in reasing dire tion of sear h d is a dire tion su h that


a :
for

positive but small.

s( + ad)
>0
a

That is, if we go in dire tion

d,

we will improve on the obje tive

fun tion, at least if we don't go too far in that dire tion.

As long as the gradient at

is not zero there exist in reasing dire tions, and they an

k
k
all be represented as Q g( ) where
the gradient at

Qk

is a symmetri pd matrix and

To see this, take a T.S. expansion around

g () = D s()

is

a0 = 0

s( + ad) = s( + 0d) + (a 0) g( + 0d) d + o(1)


= s() + ag() d + o(1)

For small enough

we need g() d

> 0.

the

o(1)

Dening

term an be ignored. If

d = Qg(),

where

is to be an in reasing dire tion,

is positive denite, we guarantee that

g() d = g() Qg() > 0


unless

g() = 0. Every in reasing dire tion

are those su h that the angle between

an be represented in this way (p.d. matri es

and

Qg()

is less that 90 degrees). See Figure

13.2.

185

DERIVATIVE-BASED METHODS

Figure 13.1: In reasing dire tions of sear h

13.1.

With this, the iteration rule be omes

(k+1) = (k) + ak Qk g( k )
and we keep going until the gradient be omes zero, so that there is no in reasing dire tion.
The problem is how to hoose

Conditional on Q,

and

hoosing

attra tive possibility, sin e

Q.
a

is fairly straightforward.

A simple line sear h is an

is a s alar.

The remaining problem is how to hoose

Q.

Note also that this gives no guarantees to nd a global maximum.

13.2.2 Steepest des ent


Steepest des ent (as ent if we're maximizing) just sets

to and identity matrix, sin e the

gradient provides the dire tion of maximum rate of hange of the obje tive fun tion.

Advantages: fast - doesn't require anything more than rst derivatives.

186

CHAPTER 13.

Disadvantages:

NUMERIC OPTIMIZATION METHODS

This doesn't always work too well however (draw pi ture of banana

fun tion).

13.2.3 Newton-Raphson
The Newton-Raphson method uses information about the slope and urvature of the obje tive
fun tion to determine whi h dire tion and how far to move from an initial point. Supposing
we're trying to maximize

sn ().

Take a se ond order Taylor's series approximation of

sn ()

k
about (an initial guess).







sn () sn ( k ) + g( k ) k + 1/2 k H( k ) k
To attempt to maximize
depends on

sn (),

we an maximize the portion of the right-hand side that

i.e., we an maximize




s() = g( k ) + 1/2 k H( k ) k

with respe t to

. This is a mu h

easier problem, sin e it is a quadrati fun tion in

, so it has

linear rst order onditions. These are



D s() = g( k ) + H( k ) k

So the solution for the next round estimate is

k+1 = k H( k )1 g( k )
This is illustrated in Figure

??.

However, it's good to in lude a stepsize, sin e the approximation to


away from the maximizer

sn ()

may be bad far

so the a tual iteration formula is

k+1 = k ak H( k )1 g( k )

A potential problem is that the Hessian may not be negative denite when we're far from
the maximizing point. So

H( k )1

may not be positive denite, and

H( k )1 g( k )

may not dene an in reasing dire tion of sear h. This an happen when the obje tive
fun tion has at regions, in whi h ase the Hessian matrix is very ill- onditioned (e.g.,
is nearly singular), or when we're in the vi inity of a lo al minimum,
denite, and our dire tion is a

de reasing

H( k )

is positive

dire tion of sear h. Matrix inverses by om-

puters are subje t to large errors when the matrix is ill- onditioned. Also, we ertainly
don't want to go in the dire tion of a minimum when we're maximizing. To solve this
problem,

Quasi-Newton

methods simply add a positive denite omponent to

ensure that the resulting matrix is positive denite,


hosen large enough so that

e.g., Q = H() + bI,

is well- onditioned and positive denite.

benet that improvement in the obje tive fun tion is guaranteed.

H()

to

is

where

This has the

13.2.

187

DERIVATIVE-BASED METHODS

Another variation of quasi-Newton methods is to approximate the Hessian by using


su essive gradient evaluations.

This avoids a tual al ulation of the Hessian, whi h

is an order of magnitude (in the dimension of the parameter ve tor) more ostly than
al ulation of the gradient. They an be done to ensure that the approximation is p.d.
DFP and BFGS are two well-known examples.

Stopping riteria
The last thing we need is to de ide when to stop. A digital omputer is subje t to limited
ma hine pre ision and round-o errors. For these reasons, it is unreasonable to hope that a
program an

exa tly nd the point that maximizes a fun tion.

We need to dene a eptable

toleran es. Some stopping riteria are:

Negligable hange in parameters:

|jk jk1 | < 1 , j

Negligable relative hange:

jk jk1
jk1

| < 2 , j

Negligable hange of fun tion:

|s( k ) s( k1 )| < 3

Gradient negligibly dierent from zero:

|gj ( k )| < 4 , j

Or, even better, he k all of these.

Also, if we're maximizing, it's good to he k that the last round (real, not approximate)
Hessian is negative denite.

Starting values
The Newton-Raphson and related algorithms work well if the obje tive fun tion is on ave
(when maximizing), but not so well if there are onvex regions and lo al minima or multiple
lo al maxima. The algorithm may onverge to a lo al minimum or to a lo al maximum that
is not optimal. The algorithm may also have di ulties onverging at all.

The usual way to ensure that a global maximum has been found is to use many dierent
starting values, and hoose the solution that returns the highest obje tive fun tion value.

THIS IS IMPORTANT in pra ti e.

More on this later.

Cal ulating derivatives


The Newton-Raphson algorithm requires rst and se ond derivatives. It is often di ult to
al ulate derivatives (espe ially the Hessian) analyti ally if the fun tion

sn ()

is ompli ated.

188

CHAPTER 13.

NUMERIC OPTIMIZATION METHODS

Figure 13.2: Using MuPAD to get analyti derivatives

Possible solutions are to al ulate derivatives numeri ally, or to use programs su h as MuPAD
1

or Mathemati a to al ulate analyti derivatives. For example, Figure 13.2 shows MuPAD
al ulating a derivative that I didn't know o the top of my head, and one that I did know.

Numeri derivatives are less a urate than analyti derivatives, and are usually more
ostly to evaluate. Both fa tors usually ause optimization programs to be less su essful
when numeri derivatives are used.

One advantage of numeri derivatives is that you don't have to worry about having made
an error in al ulating the analyti derivative. When programming analyti derivatives
it's a good idea to he k that they are orre t by using numeri derivatives. This is a
lesson I learned the hard way when writing my thesis.

Numeri se ond derivatives are mu h more a urate if the data are s aled so that the
elements of the gradient are of the same order of magnitude. Example: if the model is

yt = h(xt + zt ) + t ,
D sn () = 0.001.

and estimation is by NLS, suppose that

One ould dene

In this ase, the gradients


1

D sn ()

= /1000;

and

D sn ()

xt

= 1000xt

D sn () = 1000
1000; zt

and

= zt /1000.

will both be 1.

MuPAD is not a freely distributable program, so it's not on the CD. You an download it from

http://www.mupad.de/download.shtml

13.3.

189

SIMULATED ANNEALING

In general, estimation programs always work better if data is s aled in this way, sin e
roundo errors are less likely to be ome important.

This is important in pra ti e.

There are algorithms (su h as BFGS and DFP) that use the sequential gradient evaluations to build up an approximation to the Hessian. The iterations are faster for this
reason sin e the a tual Hessian isn't al ulated, but more iterations usually are required
for onvergen e.

Swit hing between algorithms during iterations is sometimes useful.

13.3 Simulated Annealing


Simulated annealing is an algorithm whi h an nd an optimum in the presen e of non on avities, dis ontinuities and multiple lo al minima/maxima.

Basi ally, the algorithm randomly

sele ts evaluation points, a epts all points that yield an in rease in the obje tive fun tion,
but also a epts some points that de rease the obje tive fun tion. This allows the algorithm
to es ape from lo al minima. As more and more points are tried, periodi ally the algorithm
fo uses on the best point so far, and redu es the range over whi h random points are generated. Also, the probability that a negative move is a epted redu es. The algorithm relies on
many evaluations, as in the sear h method, but fo uses in on promising areas, whi h redu es
fun tion evaluations with respe t to the sear h method. It does not require derivatives to be
evaluated. I have a program to do this if you're interested.

13.4 Examples
This se tion gives a few examples of how some nonlinear models may be estimated using
maximum likelihood.

13.4.1 Dis rete Choi e: The logit model


In this se tion we will onsider maximum likelihood estimation of the logit model for binary
0/1 dependent variables. We will use the BFGS algotithm to nd the MLE.
We saw an example of a binary hoi e model in equation 12.1. A more general representation is

y = g(x)

y = 1(y > 0)

P r(y = 1) = F [g(x)]
p(x, )
The log-likelihood fun tion is

190

CHAPTER 13.

NUMERIC OPTIMIZATION METHODS

sn () =

1X
(yi ln p(xi , ) + (1 yi ) ln [1 p(xi , )])
n
i=1

For the logit model (see the ontingent valuation example above), the probability has the
spe i form

p(x, ) =

1
1 + exp(x)

You should download and examine LogitDGP.m , whi h generates data a ording to the
logit model, logit.m , whi h al ulates the loglikelihood, and EstimateLogit.m , whi h sets
things up and alls the estimation routine, whi h uses the BFGS algorithm.
Here are some estimation results with

n = 100,

and the true

= (0, 1) .

***********************************************
Trial of MLE estimation of Logit model
MLE Estimation Results
BFGS onvergen e: Normal onvergen e
Average Log-L: 0.607063
Observations: 100
onstant
slope

estimate
0.5400
0.7566

st. err
0.2229
0.2374

t-stat
2.4224
3.1863

p-value
0.0154
0.0014

Information Criteria
CAIC : 132.6230
BIC : 130.6230
AIC : 125.4127
***********************************************

mle_results(), whi h in turn


the o tave-forge repository.

The estimation program is alling


routines. These fun tions are part of

alls a number of other

13.4.2 Count Data: The MEPS data and the Poisson model
Demand for health are is usually thought of a a derived demand: health are is an input to
a home produ tion fun tion that produ es health, and health is an argument of the utility
fun tion. Grossman (1972), for example, models health as a apital sto k that is subje t to
depre iation (e.g., the ee ts of ageing). Health are visits restore the sto k. Under the home
produ tion framework, individuals de ide when to make health are visits to maintain their
health sto k, or to deal with negative sho ks to the sto k in the form of a idents or illnesses.

13.4.

191

EXAMPLES

As su h, individual demand will be a fun tion of the parameters of the individuals' utility
fun tions.
The MEPS health data le ,

meps1996.data,

ontains 4564 observations on six measures

of health are usage. The data is from the 1996 Medi al Expenditure Panel Survey (MEPS).
You an get more information at

http://www.meps.ahrq.gov/.

The six measures of use are

are o e-based visits (OBDV), outpatient visits (OPV), inpatient visits (IPV), emergen y
room visits (ERV), dental visits (VDV), and number of pres ription drugs taken (PRESCR).
These form olumns 1 - 6 of

meps1996.data.

The onditioning variables are publi insuran e

(PUBLIC), private insuran e (PRIV), sex (SEX), age (AGE), years of edu ation (EDUC), and
in ome (INCOME). These form olumns 7 - 12 of the le, in the order given here. PRIV and
PUBLIC are 0/1 binary variables, where a 1 indi ates that the person has a ess to publi or
private insuran e overage. SEX is also 0/1, where 1 indi ates that the person is female. This
data will be used in examples fairly extensively in what follows.
The program ExploreMEPS.m shows how the data may be read in, and gives some des riptive information about variables, whi h follows:
All of the measures of use are ount data, whi h means that they take on the values

0, 1, 2, ....

It might be reasonable to try to use this information by spe ifying the density as a

ount data density. One of the simplest ount data densities is the Poisson density, whi h is

exp()y
.
y!

fY (y) =
The Poisson average log-likelihood fun tion is

1X
sn () =
(i + yi ln i ln yi !)
n
i=1

We will parameterize the model as

i = exp(xi )
xi = [1 P U BLIC P RIV SEX AGE EDU C IN C]

(13.1)

This ensures that the mean is positive, as is required for the Poisson model. Note that for this
parameterization

j =

/j

so

j xj = xj ,
the elasti ity of the onditional mean of

with respe t to the

j th

onditioning variable.

The program EstimatePoisson.m estimates a Poisson model using the full data set. The
results of the estimation, using OBDV as the dependent variable are here:

MPITB extensions found

192

CHAPTER 13.

NUMERIC OPTIMIZATION METHODS

OBDV

******************************************************
Poisson model, MEPS 1996 full data set
MLE Estimation Results
BFGS onvergen e: Normal onvergen e
Average Log-L: -3.671090
Observations: 4564

onstant
pub. ins.
priv. ins.
sex
age
edu
in

estimate
-0.791
0.848
0.294
0.487
0.024
0.029
-0.000

st. err
0.149
0.076
0.071
0.055
0.002
0.010
0.000

t-stat
-5.290
11.093
4.137
8.797
11.471
3.061
-0.978

p-value
0.000
0.000
0.000
0.000
0.000
0.002
0.328

Information Criteria
CAIC : 33575.6881
Avg. CAIC: 7.3566
BIC : 33568.6881
Avg. BIC:
7.3551
AIC : 33523.7064
Avg. AIC:
7.3452
******************************************************

13.4.3 Duration data and the Weibull model


In some ases the dependent variable may be the time that passes between the o uren e of
two events.

For example, it may be the duration of a strike, or the time needed to nd a

job on e one is unemployed. Su h variables take on values on the positive real line, and are
referred to as duration data.
A

spell

event.

is the period of time between the o uren e of initial event and the on luding

For example, the initial event ould be the loss of a job, and the nal event is the

nding of a new job. The spell is the period of unemployment.


Let

t0

be the time the initial event o urs, and

t1

be the time the on luding event o urs.

For simpli ity, assume that time is measured in years. The random variable
of the spell,

D = t1 t0 .

FD (t) = Pr(D < t).

Dene the density fun tion of

D, fD (t),

is the duration

with distribution fun tion

13.4.

193

EXAMPLES

Several questions may be of interest. For example, one might wish to know the expe ted
time one has to wait to nd a job given that one has already waited

that a spell lasts

years. The probability

years is

Pr(D > s) = 1 Pr(D s) = 1 FD (s).


The density of

onditional on the spell already having lasted

years is

fD (t)
.
1 FD (s)

fD (t|D > s) =

The expe tan ed additional time required for the spell to end given that is has already lasted

years is the expe tation of

with respe t to this density, minus

E = E(D|D > s) s =

Z

fD (z)
z
dz
1 FD (s)

To estimate this fun tion, one needs to spe ify the density

s.

fD (t)

as a parametri density,

then estimate by maximum likelihood.

There are a number of possibilities in luding the

et .

A reasonably exible model that is a generalization

exponential density, the lognormal,

of the exponential density is the Weibull density

fD (t|) = e(t) (t)1 .


A ording to this model,

E(D) = . The log-likelihood is just the produ t of the log densities.

To illustrate appli ation of this model, 402 observations on the lifespan of mongooses in
Serengeti National Park (Tanzania) were used to t a Weibull model. The spell in this ase
is the lifetime of an individual mongoose. The parameter estimates and standard errors are

= 0.559 (0.034)

and

= 0.867 (0.033)

and the log-likelihood value is -659.3.

Figure 13.3

presents tted life expe tan y (expe ted additional years of life) as a fun tion of age, with
95% onden e bands. The plot is a ompanied by a nonparametri Kaplan-Meier estimate of
life-expe tan y. This nonparametri estimator simply averages all spell lengths greater than
age, and then subtra ts age. This is onsistent by the LLN.
In the gure one an see that the model doesn't t the data well, in that it predi ts life
expe tan y quite dierently than does the nonparametri model. For ages 4-6, the nonparametri estimate is outside the onden e interval that results from the parametri model,
whi h asts doubt upon the parametri model.

Mongooses that are between 2-6 years old

seem to have a lower life expe tan y than is predi ted by the Weibull model, whereas young
mongooses that survive beyond infan y have a higher life expe tan y, up to a bit beyond 2
years. Due to the dramati hange in the death rate as a fun tion of t, one might spe ify

fD (t)

as a mixture of two Weibull densities,





2
1
fD (t|) = e(1 t) 1 1 (1 t)1 1 + (1 ) e(2 t) 2 2 (2 t)2 1 .
The parameters

and

i , i = 1, 2

are the parameters of the two Weibull densities, and

is

194

CHAPTER 13.

NUMERIC OPTIMIZATION METHODS

Figure 13.3: Life expe tan y of mongooses, Weibull model

13.5.

195

NUMERIC OPTIMIZATION: PITFALLS

the parameter that mixes the two.


With the same data,
likelihood = -623.17.

an be estimated using the mixed model. The results are a log-

Note that a standard likelihood ratio test annot be used to hose

between the two models, sin e under the null that

and

=1

(single density), the two parameters

are not identied. It is possible to take this into a ount, but this topi is out of the

s ope of this ourse. Nevertheless, the improvement in the likelihood fun tion is onsiderable.
The parameter estimates are

Parameter

Estimate

St. Error

0.233

0.016

1.722

0.166

1.731

0.101

1.522

0.096

0.428

0.035

Note that the mixture parameter is highly signi ant. This model leads to the t in Figure
13.4.

Note that the parametri and nonparametri ts are quite lose to one another, up

to around

years.

The disagreement after this point is not too important, sin e less than

5% of mongooses live more than 6 years, whi h implies that the Kaplan-Meier nonparametri
estimate has a high varian e (sin e it's an average of a small number of observations).
Mixture models are often an ee tive way to model omplex responses, though they an
suer from overparameterization. Alternatives will be dis ussed later.

13.5 Numeri optimization: pitfalls


In this se tion we'll examine two ommon problems that an be en ountered when doing
numeri optimization of nonlinear models, and some solutions.

13.5.1 Poor s aling of the data


When the data is s aled so that the magnitudes of the rst and se ond derivatives are of
dierent orders, problems an easily result.

If we un omment the appropriate line in Esti-

matePoisson.m, the data will not be s aled, and the estimation program will have di ulty
onverging (it seems to take an innite amount of time). With uns aled data, the elements
of the s ore ve tor have very dierent magnitudes at the initial value of

(all zeros). To see

this run Che kS ore.m. With uns aled data, one element of the gradient is very large, and the
maximum and minimum elements are 5 orders of magnitude apart. This auses onvergen e
problems due to serious numeri al ina ura y when doing inversions to al ulate the BFGS
dire tion of sear h. With s aled data, none of the elements of the gradient are very large, and
the maximum dieren e in orders of magnitude is 3. Convergen e is qui k.

196

CHAPTER 13.

NUMERIC OPTIMIZATION METHODS

Figure 13.4: Life expe tan y of mongooses, mixed Weibull model

13.5.

197

NUMERIC OPTIMIZATION: PITFALLS

Figure 13.5: A foggy mountain

13.5.2 Multiple optima


Multiple optima (one global, others lo al) an ompli ate life, sin e we have limited means of
determining if there is a higher maximum the the one we're at. Think of limbing a mountain
in an unknown range, in a very foggy pla e (Figure 13.5). You an go up until there's nowhere
else to go up, but sin e you're in the fog you don't know if the true summit is a ross the gap
that's at your feet. Do you laim vi tory and go home, or do you trudge down the gap and
explore the other side?
The best way to avoid stopping at a lo al maximum is to use many starting values, for
example on a grid, or randomly generated. Or perhaps one might have priors about possible

e.g., from previous studies of similar data).

values for the parameters (

Let's try to nd the true minimizer of minus 1 times the foggy mountain fun tion (sin e the
algoritms are set up to minimize). From the pi ture, you an see it's lose to

(0, 0),

but let's

pretend there is fog, and that we don't know that. The program FoggyMountain.m shows that
poor start values an lead to problems. It uses SA, whi h nds the true global minimum, and
it shows that BFGS using a battery of random start values an also nd the global minimum
help. The output of one run is here:

MPITB extensions found

198

CHAPTER 13.

NUMERIC OPTIMIZATION METHODS

======================================================
BFGSMIN final results
Used numeri gradient
-----------------------------------------------------STRONG CONVERGENCE
Fun tion onv 1 Param onv 1 Gradient onv 1
-----------------------------------------------------Obje tive fun tion value -0.0130329
Stepsize 0.102833
43 iterations
-----------------------------------------------------param
gradient hange
15.9999 -0.0000
0.0000
-28.8119 0.0000
0.0000
The result with poor start values
ans =
16.000 -28.812

================================================
SAMIN final results
NORMAL CONVERGENCE
Fun . tol. 1.000000e-10 Param. tol. 1.000000e-03
Obj. fn. value -0.100023
parameter
sear h width
0.037419
0.000018
-0.000000
0.000051
================================================
Now try a battery of random start values and
a short BFGS on ea h, then iterate to onvergen e
The result using 20 randoms start values
ans =
3.7417e-02

2.7628e-07

13.5.

NUMERIC OPTIMIZATION: PITFALLS

199

The true maximizer is near (0.037,0)

In that run, the single BFGS run with bad start values onverged to a point far from the
true minimizer, whi h simulated annealing and BFGS using a battery of random start values
both found the true maximizer. Using a battery of random start values, we managed to nd
the global max. The moral of the story is to be autious and don't publish your results too
qui kly.

200

CHAPTER 13.

NUMERIC OPTIMIZATION METHODS

13.6 Exer ises


1. In o tave, type  help
le to examine it and

bfgsmin_example, to nd out


learn how to all bfgsmin. Run

the lo ation of the le. Edit the


it, and examine the output.

2. In o tave, type  help


to examine it and

samin_example, to nd out the lo ation of the le. Edit


learn how to all samin. Run it, and examine the output.

the le

3. Using logit.m and EstimateLogit.m as templates, write a fun tion to al ulate the probit
log likelihood, and a s ript to estimate a probit model. Run it using data that a tually
follows a logit model (you an generate it in the same way that is done in the logit
example).
4. Study

mle_results.m

to see what it does. Examine the fun tions that

mle_results.m

alls, and in turn the fun tions that those fun tions all. Write a omplete des ription
of how the whole hain works.
5. Look at the Poisson estimation results for the OBDV measure of health are use and
give an e onomi interpretation. Estimate Poisson models for the other 5 measures of
health are usage.

Chapter 14
Asymptoti properties of extremum
estimators
Readings:

Hayashi (2000), Ch. 7; Gourieroux and Monfort (1995), Vol. 2, Ch. 24; Amemiya,

Ch. 4 se tion 4.1; Davidson and Ma Kinnon, pp. 591-96; Gallant, Ch. 3; Newey and M Fadden (1994),  Large Sample Estimation and Hypothesis Testing, in

Vol. 4, Ch.

Handbook of E onometri s,

36.

14.1 Extremum estimators


In Denition 12 we dened an extremum estimator
fun tion
matrix

sn ()h over

Zn =

a set

as the optimizing element of an obje tive

Let the obje tive fun tion

z1 z2 zn

where the

zt

are

sn (Zn , ) depend

p-ve tors

and

upon a

n p random

is nite.

Example 18 Given the model yi = xi + i , with n observations, dene zi = (yi , xi ) . The

OLS estimator minimizes

sn (Zn , ) = 1/n

n
X
i=1

yi xi

2

= 1/n k Y X k2

where Y and X are dened similarly to Z.

14.2 Existen e
If

sn (Zn , )

interest,

is ontinuous in

sn (Zn , )

and

is ompa t, then a maximizer exists. In some ases of

may not be ontinuous. Nevertheless, it may still onverge to a ontinous

fun tion, in whi h ase existen e will not be a problem, at least asymptoti ally.

201

202

CHAPTER 14.

ASYMPTOTIC PROPERTIES OF EXTREMUM ESTIMATORS

14.3 Consisten y
The following theorem is patterned on a proof in Gallant (1987) (the arti le, ref. later), whi h
we'll see in its original form later in the ourse. It is interesting to ompare the following proof
with Amemiya's Theorem 4.1.1, whi h is done in terms of onvergen e in probability.

Theorem 19

Suppose that n is obtained by maximizing sn () over .

[Consisten y of e.e.

Assume

1. Compa tness: The parameter spa e is an open bounded subset of Eu lidean spa e K .
So the losure of , , is ompa t.
2. Uniform Convergen e: There is a nonsto hasti fun tion s () that is ontinuous in
on su h that
lim sup |sn () s ()| = 0, a.s.
n

3. Identi ation: s () has a unique global maximum at 0 , i.e., s ( 0 ) > s (),


6= 0 ,

Then n a.s.
0.
Proof:

Sele t a

and hold it xed. Then

{sn (, )}

is a xed sequen e of fun tions.

is su h that sn () onverges uniformly to s (). This happens with probability


n } lies in the ompa t set , by assumption (1) and
assumption (b). The sequen e {

Suppose that
one by

the fa t that maximixation is over

Sin e every sequen e from a ompa t set has at least one

limit point (Davidson, Thm. 2.12), say that

{nm } ({nm }

is

a limit point of

is simply a sequen e of in reasing integers) with

{n }.

There is a subsequen e

limm nm = .

By uniform

onvergen e and ontinuity

lim snm (nm ) = s ().

To see this, rst of all, sele t an element

from the sequen e

gen e implies

o
n
nm .

lim snm (t ) = s (t ).

m
Continuity of

s ()

implies that

lim s (t ) = s ()

t
sin e the limit as

of

Next, by maximization

n o
t

is

So the above laim is true.

snm (nm ) snm ( 0 )


whi h holds in the limit, so

lim snm (nm ) lim snm ( 0 ).

Then uniform onver-

14.3.

203

CONSISTENCY

However,

lim snm (nm ) = s (),

m
as seen above, and

lim snm ( 0 ) = s ( 0 )

m
by uniform onvergen e, so

s ( 0 ).
s ()
But by assumption (3), there is a unique global maximum of

= s ( 0 ),
s ()
have held

= 0 . Finally,
and
C

at

0,

so we must have

all of the above limits hold almost surely, sin e so far we

xed, but now we need to onsider all

0
point, , ex ept on a set

s ()

with

P (C) = 0.

Therefore

{n }

has only one limit

Dis ussion of the proof:

(2)

This proof relies on the identi ation assumption of a unique global maximum at

0 . An

equivalent way to state this is

Identi ation:

Any point

in

with

s () s ( 0 )

must be su h that

k 0 k= 0,

whi h mat hes the way we will write the assumption in the se tion on nonparametri inferen e.

We assume that
for

is in fa t a global maximum of

sn () . It is not required

to be unique

nite, though the identi ation assumption requires that the limiting obje tive

fun tion have a unique maximizing argument.

The previous se tion on numeri opti-

mization methods showed that a tually nding the global maximum of

sn ()

may be a

non-trivial problem.

See Amemiya's Example 4.1.4 for a ase where dis ontinuity leads to breakdown of
onsisten y.

The assumption that

is in the interior of

(part of the identi ation assumption)

has not been used to prove onsisten y, so we ould dire tly assume that
element of a ompa t set

is simply an

The reason that we assume it's in the interior here is that

this is ne essary for subsequent proof of asymptoti normality, and I'd like to maintain
a minimal set of simple assumptions, for larity.

Parameters on the boundary of the

parameter set ause theoreti al di ulties that we will not deal with in this ourse. Just
note that onventional hypothesis testing methods do not apply in this ase.

Note that

sn ()

The following gures illustrate why uniform onvergen e is important.

is not required to be ontinuous, though

s ()

is.

In the se ond

gure, if the fun tion is not onverging around the lower of the two maxima, there is no
guarantee that the maximizer will be in the neighborhood of the global maximizer.

204

CHAPTER 14.

ASYMPTOTIC PROPERTIES OF EXTREMUM ESTIMATORS

With uniform convergence, the maximum of the sample


objective function eventually must be in the neighborhood
of the maximum of the limiting objective function

With pointwise convergence, the sample objective function


may have its maximum far away from that of the limiting
objective function

We need a uniform strong law of large numbers in order to verify assumption (2) of Theorem
19. The following theorem is from Davidson, pg. 337.

Let {Gn ()} be a sequen e of sto hasti real-valued fun tions on a totally-bounded metri spa e (, ). Then

Theorem 20

[Uniform Strong LLN

a.s.

sup |Gn ()| 0

if and only if

14.4.

205

EXAMPLE: CONSISTENCY OF LEAST SQUARES

0 for ea h 0 , where 0 is a dense subset of and


(a) Gn () a.s.
(b) {Gn ()} is strongly sto hasti ally equi ontinuous..
K , using the

The metri spa e we are interested in now is simply

The pointwise almost sure onvergen e needed for assuption (a) omes from one of the

Stronger assumptions that imply those of the theorem are:

Eu lidean norm.

usual SLLN's.

the parameter spa e is ompa t (this has already been assumed)

the obje tive fun tion is ontinuous and bounded with probability one on the entire
parameter spa e

a standard SLLN an be shown to apply to some point in the parameter spa e

These are reasonable onditions in many ases, and hen eforth when dealing with spe i
estimators we'll simply assume that pointwise almost sure onvergen e an be extended
to uniform almost sure onvergen e in this way.

The more general theorem is useful in the ase that the limiting obje tive fun tion an be
ontinuous in

even if

sn ()

is dis ontinuous. This an happen be ause dis ontinuities

may be smoothed out as we take expe tations over the data. In the se tion on simlationbased estimation we will se a ase of a dis ontinuous obje tive fun tion.

14.4 Example: Consisten y of Least Squares


We suppose that data is generated by random sampling of

(wt , t )

(y, w),

where

yt = 0 + 0 wt +t .

w (w and are independent) with support


2
2
W E. Suppose that the varian es w and are nite. Let 0 = (0 , 0 ) , for whi h

0
is ompa t. Let xt = (1, wt ) , so we an write yt = xt + t . The sample obje tive fun tion
has the ommon distribution fun tion

for a sample size

is

sn () = 1/n
= 1/n

n
X

t=1
n
X
t=1

yt xt

xt 0

= 1/n

2

n
X
i=1

+ 2/n

xt 0 + t xt

n
X
t=1

2

n
X

2t
xt 0 t + 1/n
t=1

Considering the last term, by the SLLN,

1/n

n
X
t=1

2

a.s.
2t

Considering the se ond term, sin e


implies that it onverges to zero.

E() = 0

2 dW dE = 2 .
and

and

are independent, the SLLN

206

CHAPTER 14.

ASYMPTOTIC PROPERTIES OF EXTREMUM ESTIMATORS

Finally, for the rst term, for a given

1/n

n
X

xt

t=1

=
=

2

a.s.

we assume that a SLLN applies so that

x 0

2


0 + 2 0 0

2

dW

wdW + 0

(14.1)

2

2


2

+ 2 0 0 E(w) + 0 E w2
0

w2 dW

Finally, the obje tive fun tion is learly ontinuous, and the parameter spa e is assumed to
be ompa t, so the onvergen e is also uniform. Thus,

2


2

s () = 0 + 2 0 0 E(w) + 0 E w2 + 2

A minimizer of this is learly

= 0 , = 0 .

Exer ise 21 Show that in order for the above solution to be unique it is ne essary that
E(w2 ) 6= 0.

of regressors.

Dis uss the relationship between this ondition and the problem of olinearity

This example shows that Theorem 19 an be used to prove strong onsisten y of the OLS
estimator. There are easier ways to show this, of ourse - this is only an example of appli ation
of the theorem.

14.5 Asymptoti Normality


A onsistent estimator is oftentimes not very useful unless we know how fast it is likely to
be onverging to the true value, and the probability that it is far away from the true value.
Establishment of asymptoti normality with a known s aling fa tor solves these two problems.
The following theorem is similar to Amemiya's Theorem 4.1.3 (pg. 111).

Theorem 22

[Asymptoti normality of e.e.

In addition to the assumptions of Theorem 19,

assume
(a) Jn () D2 sn() exists and is ontinuous in an open, onvex neighborhood of 0 .
(b) {Jn (n )} a.s.
J ( 0 ), a nite negative denite matrix, for any sequen e {n } that
onverges almost surely to 0.



d
( ) nDsn ( 0 )
N 0, I ( 0 ) , where I ( 0 ) = limn V ar nD sn ( 0 )




d
Then n 0
N 0, J ( 0 )1 I ( 0 )J ( 0 )1
Proof:

By Taylor expansion:



D sn (n ) = D sn ( 0 ) + D2 sn ( ) 0
where

= + (1 ) 0 , 0 1.

14.5.

207

ASYMPTOTIC NORMALITY

will

D2 sn ()

Note that

Now the l.h.s. of this equation is zero, at least asymptoti ally, sin e

be in the neighborhood where

be omes large, by onsisten y.

is a maximizer

and the f.o. . must hold exa tly sin e the limiting obje tive fun tion is stri tly on ave
in a neighborhood of

exists with probability one as

Also, sin e

0.

is between

and

0,

and sin e

a.s.
n 0

, assumption (b) gives

a.s.

D2 sn ( ) J ( 0 )
So




0 = D sn ( 0 ) + J ( 0 ) + op (1) 0

And

0=
Now

J ( 0 )



 

nD sn ( 0 ) + J ( 0 ) + op (1) n 0

is a nite negative denite matrix, so the

op (1)

term is asymptoti ally irrelevant

0
next to J ( ), so we an write
a

0=


nD sn ( 0 ) + J ( 0 ) n 0




a
n 0 = J ( 0 )1 nD sn ( 0 )

Be ause of assumption ( ), and the formula for the varian e of a linear ombination of r.v.'s,





d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1

Assumption (b) is not implied by the Slutsky theorem. The Slutsky theorem says that

a.s.

g(xn ) g(x)
depend on

if

xn x and g()

n to use this theorem.

applies (Amemiya, Ch. 4) is

is ontinuous at

In our ase

x.

Jn (n ) is

However, the fun tion


a fun tion of

g()

an't

n. A theorem whi h

Theorem 23 If gn () onverges uniformly almost surely to a nonsto hasti fun tion g ()


a.s.
uniformly on an open neighborhood of 0, then gn ()
g ( 0 ) if g ( 0 ) is ontinuous at 0
a.s.
and 0 .

To apply this to the se ond derivatives, su ient onditions would be that the se ond
derivatives be strongly sto hasti ally equi ontinuous on a neighborhood of
an ordinary LLN applies to the derivatives when evaluated at

0,

and that

N ( 0 ).

Stronger onditions that imply this are as above: ontinuous and bounded se ond derivatives in a neighborhood of

Skip this in le ture.

0.

A note on the order of these matri es: Supposing that

representable as an average of

sn ()

is

terms, whi h is the ase for all estimators we onsider,

208

CHAPTER 14.

D2 sn ()

ASYMPTOTIC PROPERTIES OF EXTREMUM ESTIMATORS

matri es, the elements of whi h are not entered (they

do not have zero expe tation).

Supposing a SLLN applies, the almost sure limit of

is also an average of

D2 sn ( 0 ),
( ):

nD sn

( 0 )

d
( 0 )

= O(1), as we saw in Example




N 0, I ( 0 ) means that

51. On the other hand, assumption

nD sn ( 0 ) = Op ()

where we use the result of Example 49. If we were to omit the

n,

we'd have

D sn ( 0 ) = n 2 Op (1)
 1
= Op n 2

Op (nr )Op (nq ) = Op (nr+q ). The sequen e D sn ( 0 )

n to avoid onvergen e to zero.


by

where we use the fa t that


tered, so we need to s ale

is

en-

14.6 Examples
14.6.1 Coin ipping, yet again
Remember that in se tion 4.4.1 we saw that the asymptoti varian e of the MLE of the
parameter of a Bernoulli trial, using i.i.d.

data, was

lim V ar n (
p p) = p (1 p).

Let's

verify this using the methods of this Chapter. The log-likelihood fun tion is

sn (p) =
so

1X
{yt ln p + (1 yt ) (1 ln p)}
n t=1


Esn (p) = p0 ln p + 1 p0 (1 ln p)

by the fa t that the observations are i.i.d. Thus,


of al ulation shows that

whi h doesn't depend upon


And in this ase,


s (p) = p0 ln p + 1 p0 (1 ln p).


D2 sn (p) p=p0 Jn () =

n.

1
,
p0 )

p0 (1

By results we've seen on MLE,

1 (p0 ) = p0 1 p0
J

we got in se tion 4.4.1.

A bit

1 (p0 ).
lim V ar n p p0 = J

. It's omforting to see that this is the same result

14.6.2 Binary response models


Extending the Bernoulli trial model to binary response models with onditioning variables,
su h models arise in a variety of ontexts. We've already seen a logit model. Another simple

14.6.

209

EXAMPLES

example is a probit threshold- rossing model. Assume that

y = x

y = 1(y > 0)
N (0, 1)

Here,

is an unobserved (latent) ontinuous variable, and

whether y is negative or positive. Then

() =

y is a binary variable that indi ates

P r(y = 1) = P r( < x) = (x),


x

(2)1/2 exp(

where

2
)d
2

is the standard normal distribution fun tion.


In general, a binary response model will require that the hoi e probability be parameterized in some form. For a ve tor of explanatory variables

x,

the response probability will be

parameterized in some manner

P r(y = 1|x) = p(x, )


If

p(x, ) = (x ),

we have a logit model.

If

p(x, ) = (x ),

where

()

is the standard

normal distribution fun tion, then we have a probit model.


Regardless of the parameterization, we are dealing with a Bernoulli density,

fYi (yi |xi ) = p(xi , )yi (1 p(x, ))1yi


so as long as the observations are independent, the maximum likelihood (ML) estimator,

is

the maximizer of

1X
(yi ln p(xi , ) + (1 yi ) ln [1 p(xi , )])
n

sn () =

i=1

n
1X
s(yi , xi , ).
n

Following the above theoreti al results,


uniform almost sure limit of
pro esses,

sn ()

(14.2)

i=1

tends in probability to the

sn (). Noting that Eyi =

that maximizes the

p(xi , 0 ), and following a SLLN for i.i.d.

onverges almost surely to the expe tation of a representative term

First one an take the expe tation onditional on

s(y, x, ).

to get



Ey|x {y ln p(x, ) + (1 y) ln [1 p(x, )]} = p(x, 0 ) ln p(x, ) + 1 p(x, 0 ) ln [1 p(x, )] .

Next taking expe tation over

s () =
where

(x)

we get the limiting obje tive fun tion




p(x, 0 ) ln p(x, ) + 1 p(x, 0 ) ln [1 p(x, )] (x)dx,

is the (joint - the integral is understood to be multiple, and

(14.3)

is the support of

210

x)

CHAPTER 14.

ASYMPTOTIC PROPERTIES OF EXTREMUM ESTIMATORS

density fun tion of the explanatory variables

p(x, )

x.

This is learly ontinuous in

as long as

is ontinuous, and if the parameter spa e is ompa t we therefore have uniform almost

sure onvergen e. Note that


The maximizing element of

Z 
X

p(x, )

is ontinous for the logit and probit models, for example.

s (), ,

solves the rst order onditions


p(x, 0 )
1 p(x, 0 )

p(x, )
p(x, ) (x)dx = 0
p(x, )
1 p(x, )

This is learly solved by

= 0.

Provided the solution is unique,

is

onsistent. Question:

what's needed to ensure that the solution is unique?


The asymptoti normality theorem tells us that





d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1 .

In the ase of i.i.d. observations

I ( 0 ) = limn V ar nD sn ( 0 )

is simply the expe tation

of a typi al element of the outer produ t of the gradient.

There's no need to subtra t the mean, sin e it's zero, following the f.o. . in the onsisten y proof above and the fa t that observations are i.i.d.
The terms in

also drop out by the same argument:

1X
lim V ar nD
s( 0 )
n
n t
X
1
s( 0 )
= lim V ar D
n
n
t
X
1
D s( 0 )
= lim V ar
n n
t

lim V ar nD sn ( 0 ) =

lim V arD s( 0 )

= V arD s( 0 )
So we get

I ( ) = E

0
0
s(y, x, ) s(y, x, ) .

Likewise,

J ( 0 ) = E
Expe tations are jointly over

x.

and

x,

2
s(y, x, 0 ).

or equivalently, rst over

onditional on

From above, a typi al element of the obje tive fun tion is



s(y, x, 0 ) = y ln p(x, 0 ) + (1 y) ln 1 p(x, 0 ) .

Now suppose that we are dealing with a orre tly spe ied logit model:

1
p(x, ) = 1 + exp(x )
.

x,

then over

14.6.

211

EXAMPLES

We an simplify the above results in this ase. We have that

2
1 + exp(x )
exp(x )x

p(x, ) =

exp(x )
x
1 + exp(x )
= p(x, ) (1 p(x, )) x

= p(x, ) p(x, )2 x.
1
1 + exp(x )

So




s(y, x, 0 ) = y p(x, 0 ) x



2
s( 0 ) = p(x, 0 ) p(x, 0 )2 xx .

Taking expe tations over

I ( ) =
=
where we use the fa t that

then

(14.4)

gives



EY y 2 2p(x, 0 )p(x, 0 ) + p(x, 0 )2 xx (x)dx



p(x, 0 ) p(x, 0 )2 xx (x)dx.

EY (y) = EY (y 2 ) = p(x, 0 ).

J ( 0 ) =

(14.5)

(14.6)

Likewise,


p(x, 0 ) p(x, 0 )2 xx (x)dx.

(14.7)

Note that we arrive at the expe ted result: the information matrix equality holds (that is,

J ( 0 ) = I( 0 )).

simplies to

With this,





d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1




d
n 0 N 0, J ( 0 )1

whi h an also be expressed as





d
n 0 N 0, I ( 0 )1 .

On a nal note, the logit and standard normal CDF's are very similar - the logit distribution
is a bit more fat-tailed. While oe ients will vary slightly between the two models, fun tions
of interest su h as estimated probabilities

will be virtually identi al


p(x, )

for the two models.

212

CHAPTER 14.

ASYMPTOTIC PROPERTIES OF EXTREMUM ESTIMATORS

14.6.3 Example: Linearization of a nonlinear model


Ref.

Gourieroux and Monfort, se tion 8.3.4.

Intn'l E on. Rev.

White,

1980 is an earlier

referen e.
Suppose we have a nonlinear model

yi = h(xi , 0 ) + i
where

i iid(0, 2 )
The

nonlinear least squares

estimator solves

1X
n = arg min
(yi h(xi , ))2
n
i=1

We'll study this more later, but for now it is lear that the fo for minimization will require
solving a set of nonlinear equations. A ommon approa h to the problem seeks to avoid this
di ulty by

linearizing

the model. A rst order Taylor's series expansion about the point

with remainder gives

yi = h(x0 , 0 ) + (xi x0 )
where

en ompasses both

x0

h(x0 , 0 )
+ i
x

and the Taylor's series remainder. Note that

is no longer a

lassi al error - its mean is not zero. We should expe t problems.


Dene

= h(x0 , 0 ) x0
=

h(x0 , 0 )
x

h(x0 , 0 )
x

Given this, one might try to estimate

and

by applying OLS to

yi = + xi + i

Question, will

The answer is no, as one an see by interpreting

and

be onsistent for

and

and

as extremum estimators. Let

(, ) .
n

= arg min sn () =

1X
(yi xi )2
n
i=1

The obje tive fun tion onverges to its expe tation

u.a.s.

sn () s () = EX EY |X (y x)2

14.6.

and

213

EXAMPLES

onverges

a.s.

to the

that minimizes

s ():

0 = arg min EX EY |X (y x)2


Noting that

EX EY |X y x

sin e ross produ ts involving

2

2
= EX EY |X h(x, 0 ) + x
2
= 2 + EX h(x, 0 ) x

drop out.

and

orrespond to the hyperplane that is

0
losest to the true regression fun tion h(x, ) a ording to the mean squared error riterion.
This depends on both the shape of

h()

and the density fun tion of the onditioning variables.

Inconsistency of the linear approximation, even at


the approximation point
x
h(x,)
x

Tangent line

x
x

x
Fitted line

x
x

x_0

It is lear that the tangent line does not minimize MSE, sin e, for example, if

Note that the true underlying parameter

h(x, 0 )

is on ave, all errors between the tangent line and the true fun tion are negative.

is not estimated onsistently, either (it may

be of a dierent dimension than the dimension of the parameter of the approximating


model, whi h is 2 in this example).

Se ond order and higher-order approximations suer from exa tly the same problem,
though to a less severe degree, of ourse. For this reason, translog, Generalized Leontiev
and other exible fun tional forms based upon se ond-order approximations in general
suer from bias and in onsisten y. The bias may not be too important for analysis of
onditional means, but it an be very important for analyzing rst and se ond derivatives.

e.g.,

In produ tion and onsumer analysis, rst and se ond derivatives (

elasti ities of

214

CHAPTER 14.

ASYMPTOTIC PROPERTIES OF EXTREMUM ESTIMATORS

substitution) are often of interest, so in this ase, one should be autious of unthinking
appli ation of models that impose stong restri tions on se ond derivatives.

This sort of linearization about a long run equilibrium is a ommon pra ti e in dynami
ma roe onomi models. It is justied for the purposes of theoreti al analysis of a model

given the model's parameters, but it is not justiable for the estimation of the parameters
of the model using data.

The se tion on simulation-based methods oers a means of

obtaining onsistent estimators of the parameters of dynami ma ro models that are too
omplex for standard methods of analysis.

14.7.

215

EXERCISES

14.7 Exer ises


1. Suppose that

xi

uniform(0,1), and

where

is iid(0,

2 ). Suppose we

yi = + xi + i by OLS. Find the numeri values of


0
that are the probability limits of
and

estimate the misspe ied model

0 and

yi = 1 x2i + i ,

2. Verify your results using O tave by generating data that follows the above model, and
al ulating the OLS estimator. When the sample size is very large the estimator should
be very lose to the analyti al results you obtained in question 1.
3. Use the asymptoti normality theorem to nd the asymptoti distribution of the ML

y = x 0 + , where
N (0, 1) and is independent of x.

2

s
()
n

0
0
This means nding
sn (), J ( ),
, and I( ). The expressions may involve
the unspe ied density of x.

estimator of

for the model

4. Assume a d.g.p. follows the logit model:

(a) Assume that

1
Pr(y = 1|x) = 1 + exp( 0 x) .

uniform(-a,a). Find the asymptoti distribution of the ML esti-

0
mator of (this is a s alar parameter).
(b) Now assume that

uniform(-2a,2a). Again nd the asymptoti distribution of

0
the ML estimator of .
( ) Comment on the results

216

CHAPTER 14.

ASYMPTOTIC PROPERTIES OF EXTREMUM ESTIMATORS

Chapter 15
Generalized method of moments
Readings:

Hamilton Ch.

14 ; Davidson and Ma Kinnon, Ch.

17 (see pg.

587 for refs.

to appli ations); Newey and M Fadden (1994), "Large Sample Estimation and Hypothesis
Testing", in

Handbook of E onometri s, Vol. 4, Ch.

36.

15.1 Denition
We've already seen one example of GMM in the introdu tion, based upon the
Consider the following example based upon the t-distribution.
t-distributed r.v.

Yt

distribution.

The density fun tion of a

is

fYt (yt , ) =
Given an iid sample of size

 
0 + 1 /2 

( 0 )1/2 ( 0 /2)

n, one ould

estimate

arg max ln Ln () =

1 + yt2 / 0

(0 +1)/2

by maximizing the log-likelihood fun tion

n
X

ln fYt (yt , )

t=1

This approa h is attra tive sin e ML estimators are asymptoti ally e ient.

This is

be ause the ML estimator uses all of the available information (e.g., the distribution
is fully spe ied up to a parameter). Re alling that a distribution is ompletely hara terized by its moments, the ML estimator is interpretable as a GMM estimator that
uses

all

of the moments. The method of moments estimator uses only

estimate a

Kdimensional parameter.

moments to

Sin e information is dis arded, in general, by the

MM estimator, e ien y is lost relative to the ML estimator.

fYt (yt , 0 )

Continuing with the example, a t-distributed r.v. with density

Using the notation introdu ed previously, dene a moment ondition

and varian e

yt2 and


V (yt ) = 0 / 0 2

m1 () = 1/n

(for

has mean zero

0 > 2).

Pn

m1t () = / ( 2)
2
t=1 yt . As before, when evaluated


0 and E0 m1 ( 0 ) = 0.

Pn

t=1 m1t () = / ( 2) 1/n




0
0
at the true parameter value , both E 0 m1t ( ) =
217

218

CHAPTER 15.

Choosing

to

set

0
m1 ()

GENERALIZED METHOD OF MOMENTS

yields a MM estimator:

2
1

(15.1)

Pn 2
i yi

This estimator is based on only one moment of the distribution - it uses less information than
the ML estimator, so it is intuitively lear that the MM estimator will be ine ient relative
to the ML estimator.

An alternative MM estimator ould be based upon the fourth moment of the t-distribution.
The fourth moment of a t-distributed r.v. is

4
provided that

0 > 4.

E(yt4 )

2
3 0
= 0
,
( 2) ( 0 4)

We an dene a se ond moment ondition

1X 4
3 ()2
yt

m2 () =
( 2) ( 4) n
t=1

A se ond, dierent MM estimator hooses

to

set

0.
m2 ()

If you solve this you'll see

that the estimate is dierent from that in equation 15.1.

This estimator isn't e ient either, sin e it uses only one moment. A GMM estimator would
use the two moment onditions together to estimate the single parameter. The GMM estimator
is overidentied, whi h leads to an estimator whi h is e ient relative to the just identied
MM estimators (more on e ien y later).

As before, set

mn () = (m1 (), m2 ()) .

0
size. Note that m( )
whereas

The

subs ript is used to indi ate the sample

Op (n1/2 ), sin e it is an average of entered random variables,

m() = Op (1), 6= 0 ,

where expe tations are taken using the true distribution

0
with parameter . This is the fundamental reason that GMM is onsistent.

A GMM estimator requires dening a measure of distan e,


(for reasons noted below) is to set

m() W

n m(). We assume that

In general, assume we have

Wn

d (m()) =

m W

A popular hoi e

n m, and we minimize

sn () =

onverges to a nite positive denite matrix.

moment onditions, so

matrix.

d (m()).

m() is a g

-ve tor and

is a

gg

For the purposes of this ourse, the following denition of the GMM estimator is su iently
general:

Denition 24 The GMM estimator of the K -dimensional parameter ve tor 0 , arg min sn ()
P

where mn () = n1 nt=1 mt () is a g-ve tor, g K, with E m() = 0, and


Wn onverges almost surely to a nite g g symmetri positive denite matrix W .

mn () Wn mn (),

15.2.

219

CONSISTENCY

What's the reason for using GMM if MLE is asymptoti ally e ient?

Robustness: GMM is based upon a limited set of moment onditions. For onsisten y,
only these moment onditions need to be orre tly spe ied, whereas MLE in ee t

every on eivable moment ondition. GMM is robust


with respe t to distributional misspe i ation. The pri e for robustness is loss of e ien y
requires orre t spe i ation of

with respe t to the MLE estimator. Keep in mind that the true distribution is not known
so if we erroneously spe ify a distribution and estimate by MLE, the estimator will be
in onsistent in general (not always).

Feasibility: in some ases the MLE estimator is not available, be ause we are not able
to dedu e the likelihood fun tion.

More on this in the se tion on simulation-based

estimation. The GMM estimator may still be feasible even though MLE is not available.

15.2 Consisten y
We simply assume that the assumptions of Theorem 19 hold, so the GMM estimator is strongly
onsistent. The only assumption that warrants additional omments is that of identi ation.
In Theorem 19, the third assumption reads:

0
maximum at ,
fun tion

i.e., s

sn () = mn

( 0 )

() W

> s (), 6=

Identi ation:

s () has a unique global


0
. Taking the ase of a quadrati obje tive

( )

n mn (), rst onsider

mn ().
a.s.

Applying a uniform law of large numbers, we get

Sin e

E mn ( 0 ) = 0

Sin e

s ( 0 ) = m ( 0 ) W m ( 0 ) = 0, in order for asymptoti identi ation,

that

m () 6= 0

by assumption,

for

6=

m ( 0 ) = 0.

0 , for at least some element of the ve tor.

a.s.

Wn W , a nite positive
0
is asymptoti ally identied.

assumption that

mn () m ().

gg

denite

gg

we need

This and the

matrix guarantee that

Note that asymptoti identi ation does not rule out the possibility of la k of identi ation for a given data set - there may be multiple minimizing solutions in nite samples.

15.3 Asymptoti normality


We also simply assume that the onditions of Theorem 22 hold, so we will have asymptoti
normality. However, we do need to nd the stru ture of the asymptoti varian e- ovarian e
matrix of the estimator. From Theorem 22, we have





d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1

where

J ( 0 ) is the almost sure limit of

2
sn () and


sn ( 0 ). We
I ( 0 ) = limn V ar n

need to determine the form of these matri es given the obje tive fun tion

sn () = mn () Wn mn ().

220

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

Now using the produ t rule from the introdu tion,




sn () = 2
mn () Wn mn ()

Dene the

K g

matrix

Dn ()

so:


m () ,
n

s() = 2D()W m () .

(Note that

sn (), Dn (), Wn

and

mn ()

(15.2)

all depend on the sample size

n,

but it is omitted to

un lutter the notation).


To take se ond derivatives, let

Di

be the

th row of

D().

Using the produ t rule,

2
s() =
i

2Di ()Wn m ()




= 2Di W D + 2m W
D
i

When evaluating the term

D()i
2m() W

at

0 , assume that

D()i satises a LLN, so that it onverges almost surely to a nite limit.

In this ase, we have

0 a.s.
2m( ) W
D( )i 0,

0

sin e

a.s.

m( 0 ) = op (1), W W .

Sta king these results over the

lim
where we dene

rows of

D,

we get

sn ( 0 ) = J ( 0 ) = 2D W D
, a.s.,

lim D = D , a.s.,

With regard to

0
at (sin e

and

lim W = W ,

I ( 0 ), following equation 15.2,

Em( 0 )

=0

a.s. (we assume a LLN holds).

and noting that the s ores have mean zero

by assumption), we have


lim V ar n sn ( 0 )
n

= lim E4nDn Wn m( 0 )m() Wn Dn


n



nm( 0 )
nm() Wn Dn
= lim E4Dn Wn

I ( 0 ) =

Now, given that

m( 0 )

is an average of entered (mean-zero) quantities, it is reasonable to

expe t a CLT to apply, after multipli ation by

n.

Assuming this,

nm( 0 ) N (0, ),

15.4.

221

CHOOSING THE WEIGHTING MATRIX

where



= lim E nm( 0 )m( 0 ) .
n

Using this, and the last equation, we get

I ( 0 ) = 4D W W D
Using these results, the asymptoti normality theorem gives us


h

 i

d
1

1
n 0 N 0, D W D
D W W D
,
D W D

the asymptoti distribution of the GMM estimator for arbitrary weighting matrix
that for

to be positive denite,

must have full row rank,

Wn .

Note

(D ) = k.

15.4 Choosing the weighting matrix


W

is a

weighting matrix, whi h determines the relative importan e of violations of the individ-

ual moment onditions. For example, if we are mu h more sure of the rst moment ondition,
whi h is based upon the varian e, than of the se ond, whi h is based upon the fourth moment,

"

we ould set

W =
with

a 0
0 b

a mu h larger than b. In this ase, errors in the se ond moment ondition have less weight

in the obje tive fun tion.

Sin e moments are not independent, in general, we should expe t that there be a orrelation between the moment onditions, so it may not be desirable to set the o-diagonal
elements to 0.

may be a random, data dependent matrix.

We have already seen that the hoi e of


the GMM estimator.

will inuen e the asymptoti distribution of

Sin e the GMM estimator is already ine ient w.r.t.

might like to hoose the

of GMM estimators

matrix to make the GMM estimator e ient

dened by

To provide a little intuition, onsider the linear model

Let

Then the model

y = x + ,

That is, he have heteros edasti ity and auto orrelation.


be the Cholesky fa torization of

P y = P X + P

1 ,

e.g,

where

N (0, ).

P P = 1 .

satises the lassi al assumptions of homos edas-

V (P ) = P V ()P = P P = P (P P )1 P =

ti ity and nonauto orrelation, sin e

P P 1 (P )1 P = In .

within the lass

mn ().

MLE, we

(Note: we use

(AB)1 = B 1 A1

for

A, B

both nonsingular).

This means that the transformed model is e ient.

The OLS estimator of the model

(y

X) 1 (y

X).

P y = P X + P

Interpreting

(y X) = ()

minimizes the obje tive fun tion


as moment onditions (note that

222

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

they do have zero expe tation when evaluated at

0 ),

the optimal weighting matrix

is seen to be the inverse of the ovarian e matrix of the moment onditions.

This

result arries over to GMM estimation. (Note: this presentation of GLS is not a GMM
estimator, be ause the number of moment onditions here is equal to the sample size,

n.

Later we'll see that GLS an be put into the GMM framework dened above).

Theorem 25 If is a GMM estimator that minimizes mn () Wn mn (), the asymptoti vari-

a.s
W = 1
an e of will be minimized by hoosing Wn so that Wn
, where =



limn E nm( 0 )m( 0 ) .

Proof:

For

W = 1
,

the asymptoti varian e

D W D
simplies to

D 1
D

1

1

D W W D
D W D

1

. Now, for any hoi e su h that W 6= 1


, onsider the dieren e

of the inverses of the varian es when

W = 1

versus when

is some arbitrary positive

denite matrix:






D 1
D W D
D D W D D W W D
i
h


1/2

1
I 1/2
D W 1/2
= D 1/2
W D D W W D
D

as an be veried by multipli ation. The term in bra kets is idempotent, whi h is also easy
to he k by multipli ation, and is therefore positive semidenite.

A quadrati form in a

positive semidenite matrix is also positive semidenite. The dieren e of the inverses of the
varian es is positive semidenite, whi h implies that the dieren e of the varian es is negative
semidenite, whi h proves the theorem.
The result

allows us to treat


h
 i

d
1
n 0 N 0, D 1
D

N

where the
of

and

D 1
D
,
n
0

1 !

(15.3)

means approximately distributed as. To operationalize this we need estimators

 

d
The obvious estimator of D is simply mn , whi h is onsistent by the onsisten y
assuming that m is ontinuous in . Sto hasti equi ontinuity results an give
of ,
us this result even if

mn is not ontinuous. We now turn to estimation of

15.5 Estimation of the varian e- ovarian e matrix


(See Hamilton Ch. 10, pp. 261-2 and 280-84) .

15.5.

223

ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX

In the ase that we wish to use the optimal weighting matrix, we need an estimate of
the limiting varian e- ovarian e matrix of

nmn ( 0 ).

While one ould estimate

paramet-

ri ally, we in general have little information upon whi h to base a parametri spe i ation. In
general, we expe t that:

mt

will be auto orrelated (ts

not depend on

= E(mt mts ) 6= 0).

Note that this auto ovarian e will

if the moment onditions are ovarian e stationary.

ontemporaneously orrelated, sin e the individual moment onditions will not in general

and have dierent varian es (E(mit )

be independent of one another (E(mit mjt )

2
= it

6= 0).
).

Sin e we need to estimate so many omponents if we are to take the parametri approa h, it
is unlikely that we would arrive at a orre t parametri spe i ation. For this reason, resear h
has fo used on onsistent nonparametri estimators of

mt

Hen eforth we assume that

mts

does not depend on

v = E(mt mts ).

t).

is ovarian e stationary (the ovarian e between

v th

Dene the

E(mt mt+s ) = v .

and

auto ovarian e of the moment onditions

mt and m are fun tions of


0
Now
now assume that we have some onsistent estimator of , so that m
t = mt ().
Note that

mt

Re all that

so for

!#
!
"
n
n
X
X

mt
1/n
mt
= E nm( 0 )m( 0 ) = E n 1/n


"

= E 1/n

n
X

mt

t=1

n
X

mt

t=1

!#

t=1

t=1

 n2


n1
1
= 0 +
1 + 1 +
2 + 2 +
n1 + n1
n
n
n

A natural, onsistent estimator of

(you might use


of

nv

is

cv = 1/n

n
X

m
tm
tv .

t=v+1

in the denominator instead). So, a natural, but in onsistent, estimator

would be








c + n 2
c + +
[
c1 +
c2 +
=
c0 + n 1
[

n1
1
2
n1
n
n
n1

Xnv 
c .
c0 +
cv +
=

v
n
v=1

This estimator is in onsistent in general, sin e the number of parameters to estimate is more
than the number of observations, and in reases more rapidly than
build up as

n .

On the other hand, supposing that

n,

so information does not

tends to zero su iently rapidly as

tends to

224

CHAPTER 15.

modied estimator

where

q(n)

as

=
c0 +

GENERALIZED METHOD OF MOMENTS

q(n) 

X
c ,
cv +

v
v=1

will be onsistent, provided

nv
term
n an be dropped be ause

q(n)

must be

op (n).

q(n)

grows su iently slowly. The

This allows information to a umulate

at a rate that satises a LLN. A disadvantage of this estimator is that it may not be positive
denite. This ould ause one to al ulate a negative

Note: the formula for


of

The solution to this ir ularity is to set

arbitrarily (for example to an identity matrix), obtain a rst

onsistent but ine ient estimate of


estimate

statisti , for example!

requires an estimate of m( 0 ), whi h in turn requires an estimate

whi h is based upon an estimate of

the weighting matrix

0,

then use this estimate to form

then re-

nor hange appre iably between


0 . The pro ess an be iterated until neither

iterations.

15.5.1 Newey-West ovarian e estimator


E onometri a,

The Newey-West estimator (

1987) solves the problem of possible nonpositive

deniteness of the above estimator. Their estimator is

=
c0 +

q(n) 
X
1
v=1



v
c .
cv +

v
q+1

This estimator is p.d. by onstru tion. The ondition for onsisten y is that
that this is a very slow rate of growth for
no parametri restri tions on the form of

q.

This estimator is nonparametri - we've pla ed

kernel estimator.
(Review of E onomi Studies, 1994)

It is an example of a

In a more re ent paper, Newey and West

whitening

n1/4 q 0. Note

use

pre-

before applying the kernel estimator. The idea is to t a VAR model to the moment

onditions. It is expe ted that the residuals of the VAR model will be more nearly white noise,
so that the Newey-West ovarian e estimator might perform better with short lag lengths..
The VAR model is

m
t = 1 m
t1 + + p m
tp + ut
This is estimated, giving the residuals

u
t . Then the Newey-West ovarian e estimator is applied

to these pre-whitened residuals, and the ovarian e

is estimated ombining the tted VAR

c
c1 m
cp m
m
t =
t1 + +
tp

with the kernel estimate of the ovarian e of the

ut .

See Newey-West for details.

I have a program that does this if you're interested.

15.6.

225

ESTIMATION USING CONDITIONAL MOMENTS

15.6 Estimation using onditional moments


So far, the moment onditions have been presented as un onditional expe tations. One ommon way of dening un onditional moment onditions is based upon onditional moment
onditions.
Suppose that a random variable

has zero expe tation onditional on the random variable

X
EY |X Y =

Y f (Y |X)dY = 0

Then the un onditional expe tation of the produ t of

and a fun tion

g(X)

of

is also zero.

The un onditional expe tation is

EY g(X) =

Z Z
X

Y g(X)f (Y, X)dY


Y

dX.

This an be fa tored into a onditional expe tation and an expe tation w.r.t. the marginal
density of

X:
EY g(X) =

Sin e

g(X)

doesn't depend on

EY g(X) =

Z Z
X

Y g(X)f (Y |X)dY

f (X)dX.

it an be pulled out of the integral

Z Z
X

Y f (Y |X)dY

g(X)f (X)dX.

But the term in parentheses on the rhs is zero by assumption, so

EY g(X) = 0
as laimed.
This is important e onometri ally, sin e models often imply restri tions on onditional
moments. Suppose a model tells us that the fun tion
on the information set

It ,

equal to

K(yt , xt )

has expe tation, onditional

k(xt , ),
E K(yt , xt )|It = k(xt , ).

For example, in the ontext of the lassi al linear model

K(yt , xt ) = yt

so that

k(xt , ) =

xt .

With this, the error fun tion

t () = K(yt , xt ) k(xt , )
has onditional expe tation equal to zero

E t ()|It = 0.

yt = xt + t ,

we an set

226

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

This is a s alar moment ondition, whi h isn't su ient to identify a

(K > 1).

-dimensional parameter

However, the above result allows us to form various un onditional expe tations

mt () = Z(wt )t ()
where

Z(wt ) is a g 1-ve tor

information set
so as long as

It .

The

g>K

valued fun tion of

Z(wt ) are

wt

and

wt

instrumental variables.

is a set of variables drawn from the


We now have

moment onditions,

the ne essary ondition for identi ation holds.

One an form the

ng

matrix

Z1 (w1 ) Z2 (w1 ) Zg (w1 )

Zg (w2 )
Z1 (w2 ) Z2 (w2 )
Zn =
.
..
.
.
.
Z1 (wn ) Z2 (wn ) Zg (wn )

Z1

Z2

Zn
With this we an form the

moment onditions

mn () =

1 ()

2 ()
1
Zn
.
n
..

n ()

Dene the ve tor of error fun tions

1 ()

2 ()
hn () =
..
.
n ()

With this, we an write

1
Z hn ()
n n
n
1X
=
Zt ht ()
n

mn () =

t=1

n
1X
mt ()
=
n t=1

where

Z(t,)

is the

tth

row of

Zn .

This ts the previous treatment.

An interesting question

15.6.

227

ESTIMATION USING CONDITIONAL MOMENTS

that arises is how one should hoose the instrumental variables

Z(wt )

to a hieve maximum

e ien y.
Note that with this hoi e of moment onditions, we have that
matrix) is

Dn () =
=

Dn () =
Hn

is a

K n

m () (a

Kg


1
Zn hn ()
n

1
hn () Zn
n

whi h we an dene to be

where

Dn

1
Hn Zn .
n

matrix that has the derivatives of the individual moment onditions as

its olumns. Likewise, dene the var- ov. of the moment onditions

where we have dened



n = E nmn ( 0 )mn ( 0 )


1
0
0
Z hn ( )hn ( ) Zn
= E
n n


1

0
0
= Zn E
hn ( )hn ( ) Zn
n
n
Zn
Zn
n

n = V hn ( 0 ) . Note that the dimension

of this matrix is growing

with the sample size, so it is not onsistently estimable without additional assumptions.
The asymptoti normality theorem above says that the GMM estimator using the optimal
weighting matrix is distributed as



d
n 0 N (0, V )
where

V = lim

Hn Zn
n



Zn n Zn
n

1 

Using an argument similar to that used to prove that


we an show that putting

Zn Hn
n

!1

(15.4)

is the e ient weighting matrix,

Zn = 1
n Hn
auses the above var- ov matrix to simplify to

V = lim

Hn 1
n Hn
n

1

(15.5)

and furthermore, this matrix is smaller that the limiting var- ov for any other hoi e of instrumental variables. (To prove this, examine the dieren e of the inverses of the var- ov matri es
with the optimal intruments and with non-optimal instruments. As above, you an show that

228

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

the dieren e is positive semi-denite).

Note that both

Usually, estimation of

0 , and

where

Hn , whi h

we should write more properly as

Hn ( 0 ), sin e it

depends on

must be onsistently estimated to apply this.

is

Hn

is straightforward - one just uses

 
b = h ,
H
n

some initial onsistent estimator based on non-optimal instruments.

Estimation of

elements than

n,

may not be possible.

It is an

nn

matrix, so it has more unique

the sample size, so without restri tions on the parameters it an't be

estimated onsistently. Basi ally, you need to provide a parametri spe i ation of the
ovarian es of the

ht () in

order to be able to use optimal instruments. A solution is to

approximate this matrix parametri ally to dene the instruments. Note that the simplied var- ov matrix in equation 15.5 will not apply if approximately optimal instruments
are used - it will be ne essary to use an estimator based upon equation 15.4, where the
term

n1 Zn n Zn

must be estimated onsistently apart, for example by the Newey-West

pro edure.

15.7 Estimation using dynami moment onditions


Note that dynami moment onditions simplify the var- ov matrix, but are often harder to
formulate. The will be added in future editions. For now, the Hansen appli ation below is
enough.

15.8 A spe i ation test


The rst order onditions for minimization, using the an estimate of the optimal weighting
matrix, are



 
  1

s() = 2
mn mn 0

or

0
1 mn ()
D()
Consider a Taylor expansion of

Multiplying by

1
D()

:
m()



= mn ( 0 ) + D ( 0 ) 0 + op (1).
m()
n
we obtain




= D()

1 m()
1 mn ( 0 ) + D()
1 D( 0 ) 0 + op (1)
D()

(15.6)

15.8.

229

A SPECIFICATION TEST

The lhs is zero, and sin e

tends

to

and

tends to

we an write



0 a
1
0

D 1
m
(
)
=
D

D
n


or





a
1
0
n 0 = n D 1
D 1
D
mn ( )

With this, and taking into a ount the original expansion (equation 15.6), we get

=
nm()

nmn ( 0 )

This last an be written as

Or

a
=
nm()



 1/2
1/2

1 1
1/2

mn ( 0 )
n D D D
D

a
n1/2
m() =

Now



1
0
nD D 1
D 1
D
mn ( ).




0

1 1
1/2
1/2
n Ig 1/2
D
D

D
D

mn ( )

0
n1/2
mn ( ) N (0, Ig )

and one an easily verify that

is idempotent of rank
tra e) so





1 1
1/2
P = Ig 1/2
D
D D D
g K,

(re all that the rank of an idempotent matrix is equal to its


 

d
1/2

= nm()
1 m()

n1/2
m(
)
n
m(
)
2 (g K)

Sin e

onverges to

we also have


2 (g K)
1 m()
nm()
or

d

n sn ()
2 (g K)
supposing the model is orre tly spe ied.

This is a onvenient test sin e we just multiply

the optimized value of the obje tive fun tion by

n,

and ompare with a

2 (g K)

riti al

value. The test is a general test of whether or not the moments used to estimate are orre tly
spe ied.

This won't work when the estimator is just identied. The f.o. . are

0.
1 m()
D sn () = D

230

CHAPTER 15.

But with exa t identi ation, both

GENERALIZED METHOD OF MOMENTS

and

are square and invertible (at least asymp-

toti ally, assuming that asymptoti normality hold), so

0.
m()
So the moment onditions are zero

regardless

of the weighting matrix used. As su h, we

might as well use an identity matrix and save trouble. Also

= 0,
sn ()

so the test breaks

down.

A note: this sort of test often over-reje ts in nite samples. One should be autious in
reje ting a model when this test reje ts.

15.9 Other estimators interpreted as GMM estimators


15.9.1 OLS with heteros edasti ity of unknown form
Example 26 White's heteros edasti onsistent var ov estimator for OLS.
Suppose

y = X 0 + ,

parameter ve tor, and to estimate

and

a diagonal matrix.

= (),

The typi al approa h is to parameterize

the parameterization of

N (0, ),

where

where

is a nite dimensional

jointly (feasible GLS). This will work well if

is orre t.

If we're not ondent about parameterizing

OLS. However, the typi al ovarian e estimator

we an still estimate

= (X X)1
V ()
2

onsistently by

will be biased and

in onsistent, and will lead to invalid inferen es.


By exogeneity of the regressors

xt

(a

the moment ondition

K 1 olumn ve tor) we have E(xt t ) = 0,whi h suggests


mt () = xt yt xt .

In this ase, we have exa t identi ation (

m() = 1/n

parameters and

mt = 1/n

For any hoi e of

X
t

xt yt 1/n

moment onditions). We have

xt xt .

W, m() will be identi ally zero at the minimum,

due to exa t identi ation.

That is, sin e the number of moment onditions is identi al to the number of parameters, the
fo imply that

0 regardless of W. There is no need to use the optimal


m()

weighting matrix

in this ase, an identity matrix works just as well for the purpose of estimation. Therefore

X
t

whi h is the usual OLS estimator.

xt xt

!1

X
t

xt yt = (X X)1 X y,

15.9.

The GMM estimator of the asymptoti var ov matrix is

d
D

231

OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

is simply

 
m .

In this ase

d
D
= 1/n

Re all that a possible estimator of

X
t

d
b 1 d
D
D

1

Re all that

xt xt = X X/n.

is

=
c0 +

n1
X
v=1


c .
cv +

This is in general in onsistent, but in the present ase of nonauto orrelation, it simplies to

=
c0

whi h has a onstant number of elements to estimate, so information

will

a umulate, and

onsisten y obtains. In the present ase

b =
c0 = 1/n

= 1/n

= 1/n
=
where

is an

nn

"

"

n
X
t=1

n
X

n
X

m
tm
t

t=1

xt xt

2
yt xt

xt xt 2t

t=1

X EX

diagonal matrix with

2t

in the position

t, t.

Therefore, the GMM var ov. estimator, whi h is onsistent, is

!
)1

1  X X 
X EX
X X

n
n
n
!
 1


XX
X X 1
X EX
=
n
n
n

 

V
n
=

(

This is the var ov estimator that White (1980) arrived at in an inuential arti le. This estimator is onsistent under heteros edasti ity of an unknown form. If there is auto orrelation,
the Newey-West estimator an be used to estimate

- the rest is the same.

232

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

15.9.2 Weighted Least Squares


Consider the previous example of a linear model with heteros edasti ity of unknown form:

y = X 0 +
N (0, )
where

is a diagonal matrix.

Now, suppose that the form of

tion (whi h may also depend upon

is known, so that

X).

( 0 )

is a orre t parametri spe i a-

In this ase, the GLS estimator is

1 1
= X 1 X
X y)

This estimator an be interpreted as the solution to the

= 1/n
m()

moment onditions

X xt yt
X xt x
t

1/n
0.
0)
0)

(
t
t
t
t

That is, the GLS estimator in this ase has an obvious representation as a GMM estimator.
With auto orrelation, the representation exists but it is a little more ompli ated. Nevertheless, the idea is the same. There are a few points:

The (feasible) GLS estimator is known to be asymptoti ally e ient in the lass of linear
asymptoti ally unbiased estimators (Gauss-Markov).

This means that it is more e ient than the above example of OLS with White's het-

This means that the hoi e of the moment onditions is important to a hieve e ien y.

eros edasti onsistent ovarian e, whi h is an alternative GMM estimator.

15.9.3 2SLS
Consider the linear model

yt = zt + t ,
or

y = Z +
using the usual onstru tion, where

is

K1

and

one of a system of simultaneous equations, so that


variables. Suppose that
un orrelated with

Dene

xt

zt

is i.i.d. Suppose that this equation is

ontains both endogenous and exogenous

is the ve tor of all exogenous and predetermined variables that are

(suppose that

xt

is

r 1).

as the ve tor of predi tions of Z when regressed upon X, e.g., Z


= X (X X)1 X Z
Z
= X X X
Z

1

X Z

15.9.

Sin e
with

t
x, z

is a linear ombination of the exogenous variables


This suggests the

K -dimensional

so

moment ondition

m() = 1/n

233

OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

Sin e we have

parameters and

zt zt

mt () = zt (yt zt )

and


zt yt zt .

moment onditions, the GMM estimator will set

identi ally equal to zero, regardless of

must be un orrelated

W,

!1

so we have

X
t


1
Z
y
(
zt yt ) = Z
Z

This is the standard formula for 2SLS. We use the exogenous variables and the redu ed form
predi tions of the endogenous variables as instruments, and apply IV estimation. See Hamilton
pp. 420-21 for the var ov formula (whi h is the standard formula for 2SLS), and for how to
deal with

heterogeneous and dependent (basi ally, just use the Newey-West or some other

, and apply the usual formula).

onsistent estimator of

Note that

dependent auses lagged

endogenous variables to loose their status as legitimate instruments.

15.9.4 Nonlinear simultaneous equations


GMM provides a onvenient way to estimate nonlinear systems of simultaneous equations. We
have a system of equations of the form

y1t = f1 (zt , 10 ) + 1t
y2t = f2 (zt , 20 ) + 2t
.
.
.

0
yGt = fG (zt , G
) + Gt ,
or in ompa t notation

yt = f (zt , 0 ) + t ,
where

f ()

is a

-ve tor valued fun tion, and

We need to nd an
with

it .

0 ) .
0 = (10 , 20 , , G

Ai 1 ve tor of instruments xit , for ea h equation, that are un orrelated

Typi al instruments would be low order monomials in the exogenous variables in

with their lagged values. Then we an dene the

P
G

i=1 Ai

(y1t f1 (zt , 1 )) x1t

(y2t f2 (zt , 2 )) x2t


mt () =
.

.
.

(yGt fG (zt , G )) xGt

A note on identi ation:

zt ,

orthogonality onditions

sele tion of instruments that ensure identi ation is a non-

234

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

trivial problem.

A note on e ien y: the sele ted set of instruments has important ee ts on the e ien y
of estimation.

Unfortunately there is little theory oering guidan e on what is the

optimal set. More on this later.

15.9.5 Maximum likelihood


In the introdu tion we argued that ML will in general be more e ient than GMM sin e ML
impli itly uses all of the moments of the distribution while GMM uses a limited number of
moments. A tually, a distribution with

ment onditions. However, some sets of

parameters an be uniquely hara terized by

mo-

moment onditions may ontain more information

than others, sin e the moment onditions ould be highly orrelated. A GMM estimator that
hose an optimal set of

moment onditions would be fully e ient. Here we'll see that the

optimal moment onditions are simply the s ores of the ML estimator.


Let

yt

be a

-ve tor of variables, and let

Yt = (y1 , y2 , ..., yt ) .

Then at time

t, Yt1

has

been observed (refer to it as the information set, sin e we assume the onditioning variables
have been sele ted to take advantage of all useful information). The likelihood fun tion is the
joint density of the sample:

L() = f (y1 , y2 , ..., yn , )


whi h an be fa tored as

L() = f (yn |Yn1 , ) f (Yn1 , )


and we an repeat this to get

L() = f (yn |Yn1 , ) f (yn1 |Yn2 , ) ... f (y1 ).


The log-likelihood fun tion is therefore

ln L() =

n
X
t=1

ln f (yt |Yt1 , ).

Dene

mt (Yt , ) D ln f (yt|Yt1 , )
as the

s ore

of the

tth

observation.

It an be shown that, under the regularity onditions,

that the s ores have onditional mean zero when evaluated at

(see notes to Introdu tion to

E onometri s):

E{mt (Yt , 0 )|Yt1 } = 0


so one ould interpret these as moment onditions to use to dene a just-identied GMM
estimator ( if there are

1/n

parameters there are

n
X
t=1

= 1/n
mt (Yt , )

n
X
t=1

s ore equations). The GMM estimator sets

= 0,
D ln f (yt |Yt1 , )

15.9.

235

OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

whi h are pre isely the rst order onditions of MLE. Therefore, MLE an be interpreted as
a GMM estimator. The GMM var ov formula is

V = D 1 D

Consistent estimates of varian e omponents are as follows

1

= 1/n
d
D2 ln f (yt |Yt1 , )
D
m(Y
,
)
=
t

t=1

It is important to note that


tionally un orrelated.
fun tion of

Yts ,

mt

mts , s > 0

and

are both onditionally and un ondi-

Conditional un orrelation follows from the fa t that

whi h is in the information set at time

t.

mts

is a

Un onditional un orrelation

follows from the fa t that onditional un orrelation hold regardless of the realization of

Yt1 ,

so marginalizing with respe t to

Yt1

preserves un orrelation (see the se tion on

ML estimation, above). The fa t that the s ores are serially un orrelated implies that

th auto ovarian e of the moment onditions:


an be estimated by the estimator of the 0

b = 1/n

n
X

t (Yt , )
= 1/n
mt (Yt , )m

n h
ih
i
X
D ln f (yt |Yt1 , )

D ln f (yt |Yt1 , )
t=1

t=1

Re all from study of ML estimation that the information matrix equality (equation

??) states

that

n




 o
= E D2 ln f (yt |Yt1 , 0 ) .
D ln f (yt |Yt1 , 0 ) D ln f (yt |Yt1 , 0 )

This result implies the well known (and already seeen) result that we an estimate

in any

of three ways:

The sandwi h version:

nP
o

n
2 ln f (y |Y

D
,
)

t
t1
t=1

P h
ih
i 1
n
c
D ln f (yt |Yt1 , )

V = n

t=1 D ln f (yt |Yt1 , )

n
o

Pn

D2 ln f (yt|Yt1 , )
t=1

or the inverse of the negative of the Hessian (sin e the middle and last term an el,
ex ept for a minus sign):

"

Vc
= 1/n

n
X
t=1

#1

D2 ln f (yt |Yt1 , )

or the inverse of the outer produ t of the gradient (sin e the middle and last an el
ex ept for a minus sign, and the rst term onverges to minus the inverse of the middle
term, whi h is still inside the overall inverse)

236

CHAPTER 15.

Vc
=

1/n

n h
X
t=1

GENERALIZED METHOD OF MOMENTS

ih
i
D ln f (yt |Yt1 , )

D ln f (yt |Yt1 , )

)1

This simpli ation is a spe ial result for the MLE estimator - it doesn't apply to GMM
estimators in general.
Asymptoti ally, if the model is orre tly spe ied, all of these forms onverge to the same
limit. In small samples they will dier. In parti ular, there is eviden e that the outer produ t of the gradient formula does not perform very well in small samples (see Davidson and
Ma Kinnon, pg. 477). White's

Information matrix test

(E onometri a, 1982) is based upon

omparing the two ways to estimate the information matrix:

outer produ t of gradient or

negative of the Hessian. If they dier by too mu h, this is eviden e of misspe i ation of the
model.

15.10 Example: The MEPS data


The MEPS data on health are usage dis ussed in se tion 13.4.2 estimated a Poisson model by
maximum likelihood (probably misspe ied). Perhaps the same latent fa tors (e.g., hroni
illness) that indu e one to make do tor visits also inuen e the de ision of whether or not
to pur hase insuran e.

If this is the ase, the PRIV variable ould well be endogenous, in

whi h ase, the Poisson ML estimator would be in onsistent, even if the onditional mean
were orre tly spe ied.

The O tave s ript meps.m estimates the parameters of the model

presented in equation 13.1, using Poisson ML (better thought of as quasi-ML), and IV
1

estimation . Both estimation methods are implemented using a GMM form. Running that
s ript gives the output

OBDV

******************************************************
IV
GMM Estimation Results
BFGS onvergen e: Normal onvergen e
Obje tive fun tion value: 0.004273
Observations: 4564
No moment ovarian e supplied, assuming effi ient weight matrix

X^2 test
1

Value
19.502

df
3.000

p-value
0.000

The validity of the instruments used may be debatable, but real data sets often don't ontain ideal instru-

ments.

15.10.

EXAMPLE: THE MEPS DATA

237

estimate
st. err
t-stat
p-value
onstant
-0.441
0.213
-2.072
0.038
pub. ins.
-0.127
0.149
-0.851
0.395
priv. ins.
-1.429
0.254
-5.624
0.000
sex
0.537
0.053
10.133
0.000
age
0.031
0.002
13.431
0.000
edu
0.072
0.011
6.535
0.000
in
0.000
0.000
4.500
0.000
******************************************************

******************************************************
Poisson QML
GMM Estimation Results
BFGS onvergen e: Normal onvergen e
Obje tive fun tion value: 0.000000
Observations: 4564
No moment ovarian e supplied, assuming effi ient weight matrix
Exa tly identified, no spe . test
estimate
st. err
t-stat
p-value
onstant
-0.791
0.149
-5.289
0.000
pub. ins.
0.848
0.076
11.092
0.000
priv. ins.
0.294
0.071
4.136
0.000
sex
0.487
0.055
8.796
0.000
age
0.024
0.002
11.469
0.000
edu
0.029
0.010
3.060
0.002
in
-0.000
0.000
-0.978
0.328
******************************************************

Note how the Poisson QML results, estimated here using a GMM routine, are the same as
were obtained using the ML estimation routine (see subse tion 13.4.2). This is an example of
how (Q)ML may be represented as a GMM estimator. Also note that the IV and QML results
are onsiderably dierent.

Treating PRIV as potentially endogenous auses the sign of its

oe ient to hange. Perhaps it is logi al that people who own private insuran e make fewer
visits, if they have to make a o-payment. Note that in ome be omes positive and signi ant
when PRIV is treated as endogenous.

238

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

Figure 15.1: OLS


OLS estimates

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0
2.28

2.3

2.32

2.34

2.36

2.38

Perhaps the dieren e in the results depending upon whether or not PRIV is treated as
endogenous an suggest a method for testing exogeneity. Onward to the Hausman test!

15.11 Example: The Hausman Test


This se tion dis usses the Hausman test, whi h was originally presented in Hausman, J.A.
(1978), Spe i ation tests in e onometri s,

E onometri a, 46, 1251-71.

Consider the simple linear regression model

yt = xt + t .

We assume that the fun tional

form and the hoi e of regressors is orre t, but that the some of the regressors may be
orrelated with the error term, whi h as you know will produ e in onsisten y of

For example,
.

this will be a problem if

if some regressors are endogeneous

some regressors are measured with error

lagged values of the dependent variable are used as regressors and

is auto orrelated.

To illustrate, the O tave program biased.m performs a Monte Carlo experiment where errors
are orrelated with regressors, and estimation is by OLS and IV. The true value of the slope
oe ient used to generate the data is

= 2.

Figure 15.1 shows that the OLS estimator is

quite biased, while Figure 15.2 shows that the IV estimator is on average mu h loser to the
true value. If you play with the program, in reasing the sample size, you an see eviden e that
the OLS estimator is asymptoti ally biased, while the IV estimator is onsistent.
We have seen that in onsistent and the onsistent estimators onverge to dierent probability limits.

This is the idea behind the Hausman test - a pair of onsistent estimators

15.11.

239

EXAMPLE: THE HAUSMAN TEST

Figure 15.2: IV
IV estimates

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0
1.9

1.92

1.94

1.96

1.98

2.02

2.04

2.06

2.08

onverge to the same probability limit, while if one is onsistent and the other is not they

e.g.,

onverge to dierent limits. If we a ept that one is onsistent (

e.g., the OLS

we are doubting if the other is onsistent (

the IV estimator), but

estimator), we might try to he k if

the dieren e between the estimators is signi antly dierent from zero.

If we're doubting about the onsisten y of OLS (or QML,

et .),

why should we be

interested in testing - why not just use the IV estimator? Be ause the OLS estimator
is more e ient when the regressors are exogenous and the other lassi al assumptions
(in luding normality of the errors) hold. When we have a more e ient estimator that
relies on stronger assumptions (su h as exogeneity) than the IV estimator, we might
prefer to use it, unless we have eviden e that the assumptions are false.

So, let's onsider the ovarian e between the MLE estimator


estimator) and some other CAN estimator, say

(or any other fully e ient

Now, let's re all some results from MLE.

Equation 4.2 is:

Equation 4.7 is




a.s.
n 0 H (0 )1 ng(0 ).
H () = I ().

Combining these two equations, we get




a.s.
n 0 I (0 )1 ng(0 ).
Also, equation 4.9 tells us that the asymptoti ovarian e between any CAN estimator and

240

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

the MLE s ore ve tor is

 # "
#
" 

n
V ()
IK
.
=
V

IK
I ()
ng()
Now, onsider

"



 #
#" 

n

n
a.s.
 .

I ()1
n
ng()

IK

0K

0K

The asymptoti ovarian e of this is



#"
"
#"

n
IK
I
0
V
(
)
I
K
K

K
 =
V 
1
0K
0K I ()
IK
I ()
n
#
"

V ()
I ()1
,
=
I ()1 I ()1

0K
I ()1

whi h, for larity in what follows, we might write as

 "

#
1

n
V
(
)
I
()

 =
V 
.

I ()1 V ()
n
So, the asymptoti ovarian e between the MLE and any other CAN estimator is equal to the
MLE asymptoti varian e (the inverse of the information matrix).
Now, suppose we with to test whether the the two estimators are in fa t both onverging

0 , versus the alternative hypothesis that the MLE estimator is not in fa t onsistent (the
is a maintained hypothesis). Under the null hypothesis that they are, we have
onsisten y of

to

IK

IK




n 0


 = n ,
n 0

will be asymptoti ally normally distributed as

So,

where





d
V ()
.
n N 0, V ()

 
1 

d
V ()

n
V ()
2 (),

is the rank of the dieren e of the asymptoti varian es.

A statisti that has the same

asymptoti distribution is


 
1 

d
V ()


V ()
2 ().
This is the Hausman test statisti , in its original form. The reason that this test has power
under the alternative hypothesis is that in that ase the MLE estimator will not be onsistent,

15.11.

241

EXAMPLE: THE HAUSMAN TEST

Figure 15.3: In orre t rank and the Hausman test

A , say, where A 6= 0 . Then the mean of the


 
n will be 0 A , a non-zero ve tor, so the test

and will onverge to

asymptoti distribution

of ve tor

statisti will eventually

reje t, regardless of how small a signi an e level is used.

Note: if the test is based on a sub-ve tor of the entire parameter ve tor of the MLE,
it is possible that the in onsisten y of the MLE will not show up in the portion of the
ve tor that has been used. If this is the ase, the test may not have power to dete t the
in onsisten y. This may o ur, for example, when the onsistent but ine ient estimator
is not identied for all the parameters of the model.

Some things to note:

The rank,

, of the dieren e of the asymptoti varian es is often less than the dimension

of the matri es, and it may be di ult to determine what the true rank is. If the true
rank is lower than what is taken to be true, the test will be biased against reje tion of
the null hypothesis. The ontrary holds if we underestimate the rank.

A solution to this problem is to use a rank 1 test, by omparing only a single oe ient.
For example, if a variable is suspe ted of possibly being endogenous, that variable's
oe ients may be ompared.

This simple formula only holds when the estimator that is being tested for onsisten y is

fully

e ient under the null hypothesis. This means that it must be a ML estimator or a

242

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

fully e ient estimator that has the same asymptoti distribution as the ML estimator.
This is quite restri tive sin e modern estimators su h as GMM and QML are not in
general fully e ient.

Following up on this last point, let's think of two not ne essarily e ient estimators,

and

2 ,

where one is assumed to be onsistent, but the other may not be. We assume for expositional
simpli ity that both

and

belong to the same parameter spa e, and that they an be

expressed as generalized method of moments (GMM) estimators. The estimators are dened
(suppressing the dependen e upon data) by

i = arg min mi (i ) Wi mi (i )
i

where

mi (i )

is a

weighting matrix,

gi 1

ve tor of moment onditions, and

i = 1, 2.

Wi

gi gi

is a

Consider the omnibus GMM estimator



i
h
1 , 2 = arg min m1 (1 ) m2 (2 )

"

W1
0(g2 g1 )

0(g1 g2 )
W2

#"

positive denite

m1 (1 )
m2 (2 )

(15.7)

Suppose that the asymptoti ovarian e of the omnibus moment ve tor is

#)
(
"

m1 (1 )
= lim V ar
n
n
m2 (2 )
!
1 12
.

(15.8)

The standard Hausman test is equivalent to a Wald test of the equality of

and

(or

subve tors of the two) applied to the omnibus GMM estimator, but with the ovarian e of the
moment onditions estimated as

b=

c1

0(g2 g1 )

0(g1 g2 )
c2

While this is learly an in onsistent estimator in general, the omitted

12

term an els out of

the test statisti when one of the estimators is asymptoti ally e ient, as we have seen above,
and thus it need not be estimated.
The general solution when neither of the estimators is e ient is lear: the entire
must be estimated onsistently, sin e the

12 term will not an el out.

matrix

Methods for onsistently

, e.g.,

estimating the asymptoti ovarian e of a ve tor of moment onditions are well-known

the Newey-West estimator dis ussed previously. The Hausman test using a proper estimator
of the overall ovarian e matrix will now have an asymptoti

distribution when neither

estimator is e ient. This is


However, the test suers from a loss of power due to the fa t that the omnibus GMM
estimator of equation 15.7 is dened using an ine ient weight matrix.

A new test an be

15.12.

243

APPLICATION: NONLINEAR RATIONAL EXPECTATIONS

dened by using an alternative omnibus GMM estimator


where

1 , 2 = arg min

m1 (1

m2 (2

i  1
e

"

m1 (1 )
m2 (2 )

is a onsistent estimator of the overall ovarian e matrix

(15.9)

of equation 15.8.

By

standard arguments, this is a more e ient estimator than that dened by equation 15.7, so
the Wald test using this alternative is more powerful. See my arti le in

Applied E onomi s,

2004, for more details, in luding simulation results. The O tave s ript hausman.m al ulates
the Wald test orresponding to the e ient joint GMM estimator (the H2 test in my paper),
for a simple linear model.

15.12 Appli ation: Nonlinear rational expe tations


Readings:

Hansen and Singleton, 1982

; Tau hen, 1986

Though GMM estimation has many appli ations, appli ation to rational expe tations models is elegant, sin e theory dire tly suggests the moment onditions. Hansen and Singleton's
1982 paper is also a lassi worth studying in itself. Though I strongly re ommend reading
the paper, I'll use a simplied model with similar notation to Hamilton's.
We assume a representative onsumer maximizes expe ted dis ounted utility over an innite horizon. Utility is temporally additive, and the expe ted utility hypothesis holds. The
future onsumption stream is the sto hasti sequen e

is the dis ounted expe ted utility

X
s=0

The parameter

It

is the

indexed

The obje tive fun tion at time

s E (u(ct+s )|It ) .

(15.10)

is between 0 and 1, and ree ts dis ounting.

information set
t

{ct }
t=0 .

at time

t,

and in ludes the all realizations of random variables

and earlier.

The hoi e variable is

ct

equal to urrent wealth

- urrent onsumption, whi h is onstained to be less than or

wt .

Suppose the onsumer an invest in a risky asset. A dollar invested in the asset yields a
gross return

(1 + rt+1 ) =
where

pt

is the pri e and

dt

pt+1 + dt+1
pt

is the dividend in period

t.

The pri e of

ct

is normalized to

1.

Current wealth

wt = (1 + rt )it1 ,

where

it1

is investment in period

t 1.

So the

problem is to allo ate urrent wealth between urrent onsumption and investment to
nan e future onsumption:

wt = ct + it .

244

CHAPTER 15.

Future net rates of return

rt+s , s > 0

are

GENERALIZED METHOD OF MOMENTS

not known

in period

t:

the asset is risky.

A partial set of ne essary onditions for utility maximization have the form:



u (ct ) = E (1 + rt+1 ) u (ct+1 )|It .

(15.11)

To see that the ondition is ne essary, suppose that the lhs < rhs.

Then by redu ing ur-

rent onsumption marginally would ause equation 15.10 to drop by

u (ct ),

sin e there is no

dis ounting of the urrent period. At the same time, the marginal redu tion in onsumption
nan es investment, whi h has gross return
period
by

t + 1.

(1 + rt+1 ) ,

whi h ould nan e onsumption in

This in rease in onsumption would ause the obje tive fun tion to in rease

E {(1 + rt+1 ) u (ct+1 )|It } .

Therefore, unless the ondition holds, the expe ted dis ounted

utility fun tion is not maximized.

To use this we need to hoose the fun tional form of utility.


aversion form is

u(ct ) =
where

A onstant relative risk

c1
1
t
1

is the oe ient of relative risk aversion. With this form,

u (ct ) = c
t
so the fo are

o
n

c
t = E (1 + rt+1 ) ct+1 |It

While it is true that


n
o
E ct (1 + rt+1 ) c
|It = 0
t+1

so that we ould use this to dene moment onditions, it is unlikely that

ct

is stationary, even

though it is in real terms, and our theory requires stationarity. To solve this, divide though
by

c
t
E

(note that

ct

1-

(1 + rt+1 )

ct+1
ct

 )!

|It = 0

an be passed though the onditional expe tation sin e

upon information available in time

1-

ht () dened above:

(1 + rt+1 )

ct+1
ct

It .

 )

it's a s alar moment ondition. To get a ve tor of moment

onditions we need some instruments. Suppose that


information set

is hosen based only

t).

Now

is analogous to

ct

zt

is a ve tor of variables drawn from the

We an use the ne essary onditions to form the expressions

1 (1 + rt+1 )

ct+1
ct

 

zt mt ()

15.13.

245

EMPIRICAL EXAMPLE: A PORTFOLIO MODEL

represents

and

Therefore, the above expression may be interpreted as a moment ondition whi h an


be used for GMM estimation of the parameters

Note that at time

t, mts

0.

has been observed, and is therefore an element of the information

set. By rational expe tations, the auto ovarian es of the moment onditions other than

should be zero. The optimal weighting matrix is therefore the inverse of the varian e of the
moment onditions:



= lim E nm( 0 )m( 0 )

whi h an be onsistently estimated by

= 1/n

n
X

t ()

mt ()m

t=1

As before, this estimate depends on an initial onsistent estimate of


by setting the weighting matrix
obtaining

whi h an be obtained

arbitrarily (to an identity matrix, for example).

After

we then minimize

1 m().
s() = m()
This pro ess an be iterated, e.g., use the new estimate to re-estimate

use this to estimate

0 , and repeat until the estimates don't hange.

In prin iple, we ould use a very large number of moment onditions in estimation, sin e

any urrent or lagged variable

ould be used in

xt . Sin e

use of more moment onditions

will lead to a more (asymptoti ally) e ient estimator, one might be tempted to use
many instrumental variables. We will do a omputer lab that will show that this may
not be a good idea with nite samples. This issue has been studied using Monte Carlos
(Tau hen,

JBES, 1986).

is that the estimate of

The reason for poor performan e when using many instruments


be omes very impre ise.

Empiri al papers that use this approa h often have serious problems in obtaining pre ise
estimates of the parameters. Note that we are basing everything on a single parial rst
order ondition. Probably this f.o. . is simply not informative enough. Simulation-based
estimation methods (dis ussed below) are one means of trying to use more informative
moment onditions to estimate this sort of model.

15.13 Empiri al example: a portfolio model


The O tave program portfolio.m performs GMM estimation of a portfolio model, using the
data le tau hen.data. The olumns of this data le are
95 observations (sour e: Tau hen,

JBES,

c, p,

and

in that order. There are

1986). As instruments we use lags of

well as a onstant. For a single lag the estimation results are

and

r,

as

246

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

MPITB extensions found

******************************************************
Example of GMM estimation of rational expe tations model
GMM Estimation Results
BFGS onvergen e: Normal onvergen e
Obje tive fun tion value: 0.000014
Observations: 94

X^2 test

Value
0.001

df
1.000

p-value
0.971

estimate
st. err
t-stat
p-value
beta
0.915
0.009
97.271
0.000
gamma
0.569
0.319
1.783
0.075
******************************************************
For two lags the estimation results are

MPITB extensions found

******************************************************
Example of GMM estimation of rational expe tations model
GMM Estimation Results
BFGS onvergen e: Normal onvergen e
Obje tive fun tion value: 0.037882
Observations: 93

X^2 test

Value
3.523

df
3.000

p-value
0.318

estimate
st. err
t-stat
p-value
beta
0.857
0.024
35.636
0.000
gamma
-2.351
0.315
-7.462
0.000
******************************************************

15.13.

EMPIRICAL EXAMPLE: A PORTFOLIO MODEL

Pretty learly, the results are sensitive to the hoi e of instruments.

247

Maybe there is some

problem here: poor instruments, or possibly a onditional moment that is not very informative.
Moment onditions formed from Euler onditions sometimes do not identify the parameter of
a model. See Hansen, Heaton and Yarron, (1996)
haven't he ked it arefully)?

JBES

V14, N3. Is that a problem here, (I

248

CHAPTER 15.

GENERALIZED METHOD OF MOMENTS

15.14 Exer ises


1. Show how to ast the generalized IV estimator presented in se tion 11.4 as a GMM
estimator. Identify what are the moment onditions,
matrix

Dn ,

mt (),

what is the form of the the

what is the e ient weight matrix, and show that the ovarian e matrix

formula given previously orresponds to the GMM ovarian e matrix formula.


2. The O tave s ript meps.m estimates a model for o e-based do tpr visits (OBDV) using
two dierent moment onditions, a Poisson QML approa h and an IV approa h. If all
onditioning variables are exogenous, both approa hes should be onsistent. If the PRIV
variable is endogenous, only the IV approa h should be onsistent. Neither of the two
estimators is e ient in any ase, sin e we already know that this data exhibits variability
that ex eeds what is implied by the Poisson model (e.g., negative binomial and other
models t mu h better). Test the exogeneity of the variable PRIV with a GMM-based
Hausman-type test, using the O tave s ript hausman.m for hints about how to set up
the test.
3. Using O tave, generate data from the logit dgp.
prove quite helpful. Re all that

E(yt |xt ) = p(xt , ) = [1 + exp(xt )]1 .

moment ondtions (exa tly identied)

(a) Estimate by GMM (using


(b) Estimate by ML (using

The s ript EstimateLogit.m should


Consider the

mt () = [yt p(xt , )]xt

gmm_results),

using these moments.

mle_results).

( ) The two estimators should oin ide. Prove analyti ally that the estimators oi ide.

4. Verify the missing steps needed to show that

1 m()
n m()

has a

2 (g K)

dis-

tribution. That is, show that the monster matrix is idempotent and has tra e equal to

g K.
5. For the portfolio example, experiment with the program using lags of 3 and 4 periods to
dene instruments

(a) Iterate the estimation of

= (, )

and

to onvergen e.

(b) Comment on the results. Are the results sensitive to the set of instruments used?
Look at

as well as

Are these good instruments? Are the instruments highly

orrelated with one another? Is there something analogous to ollinearity going on


here?

Chapter 16
Quasi-ML
Quasi-ML is the estimator one obtains when a misspe ied probability model is used to al ulate an ML estimator.
Given a sample of size

suppose the joint density of

of a random ve tor

Y=

member of the parametri family


the ve tor

0 :

y

and a ve tor of onditioning variables

y1 . . . yn onditional on X = x1 . . . xn
pY (Y|X, ), . The true joint density is asso iated

x,

is a

with

pY (Y|X, 0 ).
X

As long as the marginal density of

doesn't depend on

0 ,

this onditional density fully

hara terizes the random hara teristi s of samples: i.e., it fully des ribes the probabilisti ally

likelihood fun tion

important features of the d.g.p. The


values

Let

Yt1 =

L(Y|X, ) = pY (Y|X, ), .
y1 . . . yt1

Y0 = 0,

and let

Xt =

The likelihood

pt ().

Mistakenly, we

x1 . . . xt

fun tion, taking into a ount possible dependen e of observations, an be written as

L(Y|X, ) =

is just this density evaluated at other

n
Y

t=1
n
Y

pt (yt |Yt1 , Xt , )
pt ()

t=1

The average log-likelihood fun tion is:

1
1X
sn () = ln L(Y|X, ) =
ln pt ()
n
n
t=1

Suppose that we do not have knowledge of the family of densities


may assume that the onditional density of

0
where there is no su h that

yt is a member of the family ft (yt |Yt1 , Xt , ),

ft (yt |Yt1 , Xt , 0 ) = pt (yt |Yt1 , Xt , 0 ), t


249

(this

250

CHAPTER 16.

QUASI-ML

is what we mean by misspe ied).

This setup allows for heterogeneous time series data, with dynami misspe i ation.

The QML estimator is the argument that maximizes the

misspe ied average log likelihood,

whi h we refer to as the quasi-log likelihood fun tion. This obje tive fun tion is

1X
ln ft (yt |Yt1 , Xt , 0 )
n t=1

sn () =

1X
ln ft ()
n

t=1

and the QML is

n = arg max sn ()

A SLLN for dependent sequen es applies (we assume), so that

1X
ln ft () s ()
sn () lim E
n n
t=1
a.s.

We assume that this an be strengthened to uniform onvergen e, a.s., following the previous
arguments. The pseudo-true value of

is the value that maximizes

s():

0 = arg max s ()

Given assumptions so that theorem 19 is appli able, we obtain

lim n = 0 , a.s.

Applying the asymptoti normality theorem,





d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1

where

J ( 0 ) = lim ED2 sn ( 0 )
n

and

I ( 0 ) = lim V ar nD sn ( 0 ).
n

Note that asymptoti normality only requires that the additional assumptions regarding

and

hold in a neighborhood of

for

and at

sense, asymptoti normality is a lo al property.

0,

for

I,

not throughout

In this

16.1.

251

CONSISTENT ESTIMATION OF VARIANCE COMPONENTS

16.1 Consistent Estimation of Varian e Components


Consistent estimation of
that

J ( 0 )

is straightforward.

Assumption (b) of Theorem 22 implies

1X 2
1X 2
a.s.
D ln ft (n ) lim E
D ln ft ( 0 ) = J ( 0 ).
Jn (n ) =
n n
n t=1
t=1

That is, just al ulate the Hessian using the estimate


Consistent estimation of

Notation:

Let

I ( 0 )

in pla e of

0.

is more di ult, and may be impossible.

gt D ft ( 0 )

We need to estimate

I ( 0 ) =

lim V ar nD sn ( 0 )

n
1X
D ln ft ( 0 )
= lim V ar n
n
n
t=1

n
X

1
gt
V ar
n
t=1
( n
! n
! )
X
X
1
= lim E
(gt Egt )
(gt Egt )
n n
t=1
t=1
=

lim

This is going to ontain a term

1X
(Egt ) (Egt )
n n
t=1
lim

whi h will not tend to zero, in general.

This term is not onsistently estimable in general,

sin e it requires al ulating an expe tation using the true density under the d.g.p., whi h is
unknown.

There are important ases where

I ( 0 ) is

onsistently estimable. For example, suppose

i.e., they are iid).

that the data ome from a random sample (

This would be the ase with

ross se tional data, for example. (Note: under i.i.d. sampling, the joint distribution of

(yt , xt ) is identi al.

This does not imply that the onditional density

f (yt|xt ) is identi al).

With random sampling, the limiting obje tive fun tion is simply

s ( 0 ) = EX E0 ln f (y|x, 0 )
where

E0

density of

means expe tation of

x.

y|x

and

EX

means expe tation respe t to the marginal

By the requirement that the limiting obje tive fun tion be maximized at

D EX E0 ln f (y|x, 0 ) = D s ( 0 ) = 0

we have

252

CHAPTER 16.

QUASI-ML

The dominated onvergen e theorem allows swit hing the order of expe tation and differentiation, so

D EX E0 ln f (y|x, 0 ) = EX E0 D ln f (y|x, 0 ) = 0
The CLT implies that

1 X
d

D ln f (y|x, 0 ) N (0, I ( 0 )).


n t=1
That is, it's not ne essary to subtra t the individual means, sin e they are zero. Given
this, and due to independent observations, a onsistent estimator is

1X

ln ft ()
D ln ft ()D
Ib =
n
t=1

This is an important ase where onsistent estimation of the ovarian e matrix is possible.
Other ases exist, even for dynami ally misspe ied time series models.

16.2 Example: the MEPS Data


To he k the plausibility of the Poisson model for the MEPS data, we an ompare the sample
un onditional varian e with the estimated un onditional varian e a ording to the Poisson
model:

V[
(y) =

Pn

t=1

. Using the program PoissonVarian e.m, for OBDV and ERV, we get

Table 16.1: Marginal Varian es, Sample and Estimated (Poisson)


OBDV

ERV

Sample

38.09

0.151

Estimated

3.28

0.086

We see that even after onditioning, the overdispersion is not aptured in either ase. There
is huge problem with OBDV, and a signi ant problem with ERV. In both ases the Poisson
model does not appear to be plausible. You an he k this for the other use measures if you
like.

16.2.1 Innite mixture models: the negative binomial model


Referen e: Cameron and Trivedi (1998)

Regression analysis of ount data, hapter 4.

The two measures seem to exhibit extra-Poisson variation. To apture unobserved heterogeneity, a possibility is the

random parameters

approa h.

Consider the possibility that the

16.2.

253

EXAMPLE: THE MEPS DATA

onstant term in a Poisson model were random:

exp() y
y!
= exp(x + )

fY (y|x, ) =

= exp(x) exp()
=
where

= exp(x )

and

= exp().

problem is that we don't observe

Now

an

aptures the randomness in the onstant. The

so we will need to marginalize it to get a usable density

fY (y|x) =
This density

exp[] y
fv (z)dz
y!

be used dire tly, perhaps using numeri al integration to evaluate the likeli-

hood fun tion. In some ases, though, the integral will have an analyti solution. For example,
if

follows a ertain one parameter gamma density, then

(y + )
fY (y|x, ) =
(y + 1)()
where

= (, ).

For this density,

The varian e depends upon how

If

 

y

(16.1)

appears sin e it is the parameter of the gamma density.

= /,

E(y|x) = ,

where

whi h we have parameterized

> 0,

= exp(x )

is parameterized.

then

V (y|x) = + .

Note that

is a fun tion of

x,

so

that the varian e is too. This is referred to as the NB-I model.

If

= 1/,

where

> 0,

then

V (y|x) = + 2 .

This is referred to as the NB-II

model.

So both forms of the NB model allow for overdispersion, with the NB-II model allowing for a
more radi al form.
Testing redu tion of a NB model to a Poisson model annot be done by testing

= 0

using standard Wald or LR pro edures. The riti al values need to be adjusted to a ount for
the fa t that

=0

is on the boundary of the parameter spa e. Without getting into details,

suppose that the data were in fa t Poisson, so there is equidispersion and the true

= 0.

Then about half the time the sample data will be underdispersed, and about half the time
overdispersed. When the data is underdispersed, the MLE of

= 0. Thus, under
p

n(
) = n

of

will be

the null, there will be a probability spike in the asymptoti distribution


at 0, so standard testing methods will not be valid.

This program will do estimation using the NB model. Note how modelargs is used to sele t
a NB-I or NB-II density. Here are NB-I estimation results for OBDV:

254

CHAPTER 16.

MPITB extensions found


OBDV
======================================================
BFGSMIN final results
Used analyti gradient
-----------------------------------------------------STRONG CONVERGENCE
Fun tion onv 1 Param onv 1 Gradient onv 1
-----------------------------------------------------Obje tive fun tion value 2.18573
Stepsize 0.0007
17 iterations
-----------------------------------------------------param
1.0965
0.2551
0.2024
0.2289
0.1969
0.0769
0.0000
1.7146

gradient
0.0000
-0.0000
-0.0000
0.0000
0.0000
0.0000
-0.0000
-0.0000

hange
-0.0000
0.0000
0.0000
-0.0000
-0.0000
-0.0000
0.0000
0.0000

******************************************************
Negative Binomial model, MEPS 1996 full data set
MLE Estimation Results
BFGS onvergen e: Normal onvergen e
Average Log-L: -2.185730
Observations: 4564
onstant
pub. ins.
priv. ins.
sex
age
edu
in
alpha

estimate
-0.523
0.765
0.451
0.458
0.016
0.027
0.000
5.555

Information Criteria
CAIC : 20026.7513

st. err
0.104
0.054
0.049
0.034
0.001
0.007
0.000
0.296
Avg. CAIC:

t-stat
-5.005
14.198
9.196
13.512
11.869
3.979
0.000
18.752
4.3880

p-value
0.000
0.000
0.000
0.000
0.000
0.000
1.000
0.000

QUASI-ML

16.2.

255

EXAMPLE: THE MEPS DATA

BIC : 20018.7513
Avg. BIC:
4.3862
AIC : 19967.3437
Avg. AIC:
4.3750
******************************************************

Note that the parameter values of the last BFGS iteration are dierent that those reported
in the nal results.

This ree ts two things - rst, the data were s aled before doing the

BFGS minimization, but the

mle_results

results using the original s aling.


enfor e the restri tion that

>

s ript takes this into a ount and reports the

But also, the parameterization

0. The unrestri ted parameter

= exp( )
= log

is used to

is used to dene

the log-likelihood fun tion, sin e the BFGS minimization algorithm does not do ontrained
minimization. To get the standard error and t-statisti of the estimate of
delta method. This is done inside

mle_results,

making use of the fun tion parameterize.m .

Likewise, here are NB-II results:

MPITB extensions found


OBDV
======================================================
BFGSMIN final results
Used analyti gradient
-----------------------------------------------------STRONG CONVERGENCE
Fun tion onv 1 Param onv 1 Gradient onv 1
-----------------------------------------------------Obje tive fun tion value 2.18496
Stepsize 0.0104394
13 iterations
-----------------------------------------------------param
1.0375
0.3673
0.2136
0.2816
0.3027
0.0843
-0.0048
0.4780

gradient
0.0000
-0.0000
0.0000
0.0000
0.0000
-0.0000
0.0000
-0.0000

, we need to use the

hange
-0.0000
0.0000
-0.0000
-0.0000
0.0000
0.0000
-0.0000
0.0000

******************************************************
Negative Binomial model, MEPS 1996 full data set

256

CHAPTER 16.

QUASI-ML

MLE Estimation Results


BFGS onvergen e: Normal onvergen e
Average Log-L: -2.184962
Observations: 4564
estimate
-1.068
1.101
0.476
0.564
0.025
0.029
-0.000
1.613

onstant
pub. ins.
priv. ins.
sex
age
edu
in
alpha

st. err
0.161
0.095
0.081
0.050
0.002
0.009
0.000
0.055

t-stat
-6.622
11.611
5.880
11.166
12.240
3.106
-0.176
29.099

p-value
0.000
0.000
0.000
0.000
0.000
0.002
0.861
0.000

Information Criteria
CAIC : 20019.7439
Avg. CAIC: 4.3864
BIC : 20011.7439
Avg. BIC:
4.3847
AIC : 19960.3362
Avg. AIC:
4.3734
******************************************************

For the OBDV usage measurel, the NB-II model does a slightly better job than the NB-I
model, in terms of the average log-likelihood and the information riteria (more on this
last in a moment).

Note that both versions of the NB model t mu h better than does the Poisson model
(see 13.4.2).

The estimated

is highly signi ant.

To he k the plausibility of the NB-II model, we an ompare the sample un onditional


varian e with the estimated un onditional varian e a ording to the NB-II model:

Pn

V[
(y) =

t )2
(
t=1 t +
. For OBDV and ERV (estimation results not reported), we get For OBDV, the
n

Table 16.2: Marginal Varian es, Sample and Estimated (NB-II)


OBDV

ERV

Sample

38.09

0.151

Estimated

30.58

0.182

overdispersion problem is signi antly better than in the Poisson ase, but there is still some
that is not aptured. For ERV, the negative binomial model seems to apture the overdispersion adequately.

16.2.

257

EXAMPLE: THE MEPS DATA

16.2.2 Finite mixture models: the mixed negative binomial model


The nite mixture approa h to tting health are demand was introdu ed by Deb and Trivedi
(1997). The mixture approa h has the intuitive appeal of allowing for subgroups of the population with dierent health status. If individuals are lassied as healthy or unhealthy then
two subgroups are dened. A ner lassi ation s heme would lead to more subgroups. Many
studies have in orporated obje tive and/or subje tive indi ators of health status in an eort
to apture this heterogeneity. The available obje tive measures, su h as limitations on a tivity, are not ne essarily very informative about a person's overall health status.

Subje tive,

self-reported measures may suer from the same problem, and may also not be exogenous
Finite mixture models are on eptually simple. The density is

fY (y, 1 , ..., p , 1 , ..., p1 ) =

p1
X

(i)

i fY (y, i ) + p fYp (y, p ),

i=1

where
the

i > 0, i = 1, 2, ..., p, p = 1

Pp1
i=1

i ,

are ordered in some way, for example,

and

Pp

i=1 i

= 1.

1 2 p

Identi ation requires that


and

i 6= j , i 6= j .

This is

simple to a omplish post-estimation by rearrangement and possible elimination of redundant


omponent densities.

The properties of the mixture density follow in a straightforward way from those of the
omponents. In parti ular, the moment generating fun tion is the same mixture of the
moment generating fun tions of the omponent densities, so, for example,

Pp

i=1 i i (x), where

i (x)

is the mean of the

ith

omponent density.

E(Y |x) =

Mixture densities may suer from overparameterization, sin e the total number of parameters grows rapidly with the number of omponent densities. It is possible to onstrained
parameters a ross the mixtures.

Testing for the number of omponent densities is a tri ky issue. For example, testing for

p=1

(a single omponent, whi h is to say, no mixture) versus

omponents) involves the restri tion


spa e.

1 = 1,

Not that when

1 = 1,

p=2

(a mixture of two

whi h is on the boundary of the parameter

the parameters of the se ond omponent an take on

any value without ae ting the density. Usual methods su h as the likelihood ratio test
are not appli able when parameters are on the boundary under the null hypothesis.
Information riteria means of hoosing the model (see below) are valid.
The following results are for a mixture of 2 NB-II models, for the OBDV data, whi h you an
repli ate using this program .

OBDV
******************************************************

258

CHAPTER 16.

QUASI-ML

Mixed Negative Binomial model, MEPS 1996 full data set


MLE Estimation Results
BFGS onvergen e: Normal onvergen e
Average Log-L: -2.164783
Observations: 4564
onstant
pub. ins.
priv. ins.
sex
age
edu
in
alpha
onstant
pub. ins.
priv. ins.
sex
age
edu
in
alpha
Mix

estimate
0.127
0.861
0.146
0.346
0.024
0.025
-0.000
1.351
0.525
0.422
0.377
0.400
0.296
0.111
0.014
1.034
0.257

st. err
0.512
0.174
0.193
0.115
0.004
0.016
0.000
0.168
0.196
0.048
0.087
0.059
0.036
0.042
0.051
0.187
0.162

t-stat
0.247
4.962
0.755
3.017
6.117
1.590
-0.214
8.061
2.678
8.752
4.349
6.773
8.178
2.634
0.274
5.518
1.582

p-value
0.805
0.000
0.450
0.003
0.000
0.112
0.831
0.000
0.007
0.000
0.000
0.000
0.000
0.008
0.784
0.000
0.114

Information Criteria
CAIC : 19920.3807
Avg. CAIC: 4.3647
BIC : 19903.3807
Avg. BIC:
4.3610
AIC : 19794.1395
Avg. AIC:
4.3370
******************************************************

It is worth noting that the mixture parameter is not signi antly dierent from zero, but
also not that the oe ients of publi insuran e and age, for example, dier quite a bit between
the two latent lasses.

16.2.3 Information riteria


As seen above, a Poisson model an't be tested (using standard methods) as a restri tion of
a negative binomial model. But it seems, based upon the values of the likelihood fun tions
and the fa t that the NB model ts the varian e mu h better, that the NB model is more
appropriate. How an we determine whi h of a set of ompeting models is the best?
The information riteria approa h is one possibility. Information riteria are fun tions of
the log-likelihood, with a penalty for the number of parameters used. Three popular information riteria are the Akaike (AIC), Bayes (BIC) and onsistent Akaike (CAIC). The formulae

16.2.

259

EXAMPLE: THE MEPS DATA

are

+ k(ln n + 1)
CAIC = 2 ln L()
+ k ln n
BIC = 2 ln L()
+ 2k
AIC = 2 ln L()

It an be shown that the CAIC and BIC will sele t the orre tly spe ied model from a group
of models, asymptoti ally. This doesn't mean, of ourse, that the orre t model is ne esarily
in the group. The AIC is not onsistent, and will asymptoti ally favor an over-parameterized
model over the orre tly spe ied model. Here are information riteria values for the models
we've seen, for OBDV. Pretty learly, the NB models are better than the Poisson. The one

Table 16.3: Information Criteria, OBDV


Model

AIC

BIC

CAIC

Poisson

7.345

7.355

7.357

NB-I

4.375

4.386

4.388

NB-II

4.373

4.385

4.386

MNB-II

4.337

4.361

4.365

additional parameter gives a very signi ant improvement in the likelihood fun tion value.
Between the NB-I and NB-II models, the NB-II is slightly favored. But one should remember
that information riteria values are statisti s, with varian es. With another sample, it may
well be that the NB-I model would be favored, sin e the dieren es are so small. The MNB-II
model is favored over the others, by all 3 information riteria.
Why is all of this in the hapter on QML? Let's suppose that the orre t model for OBDV
is in fa t the NB-II model. It turns out in this ase that the Poisson model will give onsistent
estimates of the slope parameters (if a model is a member of the linear-exponential family
and the onditional mean is orre tly spe ied, then the parameters of the onditional mean
will be onsistently estimated).

So the Poisson estimator would be a QML estimator that

is onsistent for some parameters of the true model. The ordinary OPG or inverse Hessian
ML ovarian e estimators are however biased and in onsistent, sin e the information matrix
equality does not hold for QML estimators. But for i.i.d. data (whi h is the ase for the MEPS
data) the QML asymptoti ovarian e an be onsistently estimated, as dis ussed above, using
the sandwi h form for the ML estimator.

mle_results in fa t reports sandwi h results, so the

Poisson estimation results would be reliable for inferen e even if the true model is the NB-I
or NB-II. Not that they are in fa t similar to the results for the NB models.
However, if we assume that the orre t model is the MNB-II model, as is favored by the
information riteria, then both the Poisson and NB-x models will have misspe ied mean
fun tions, so the parameters that inuen e the means would be estimated with bias and
in onsistently.

260

CHAPTER 16.

QUASI-ML

16.3 Exer ises


1. Considering the MEPS data (the des ription is in Se tion 13.4.2), for the OBDV (y )
measure, let

be a latent index of health status that has expe tation equal to unity.

We suspe t that

and

P RIV

may be orrelated, but we assume that

is un orrelated

with the other regressors. We assume that

E(y|P U B, P RIV, AGE, EDU C, IN C, )


= exp(1 + 2 P U B + 3 P RIV + 4 AGE + 5 EDU C + 6 IN C).
We use the Poisson QML estimator of the model

Poisson()

= exp(1 + 2 P U B + 3 P RIV +

(16.2)

4 AGE + 5 EDU C + 6 IN C).


2

Sin e mu h previous eviden e indi ates that health are servi es usage is overdispersed ,
this is almost ertainly not an ML estimator, and thus is not e ient. However, when
and

P RIV

are un orrelated, this estimator is onsistent for the

onditional mean is orre tly spe ied in that ase. When

and

parameters, sin e the

P RIV

are orrelated,

Mullahy's (1997) NLIV estimator that uses the residual fun tion

=
where

y
1,

is dened in equation 16.2, with appropriate instruments, is onsistent.

instruments we use all the exogenous regressors, as well as the ross produ ts of
with the variables in

Z = {AGE, EDU C, IN C}.

As

PUB

That is, the full set of instruments is

W = {1 P U B Z P U B Z }.
(a) Cal ulate the Poisson QML estimates.
(b) Cal ulate the generalized IV estimates (do it using a GMM formulation - see the
portfolio example for hints how to do this).
( ) Cal ulate the Hausman test statisti to test the exogeneity of PRIV.
(d) omment on the results

1
2

A restri tion of this sort is ne essary for identi ation.


Overdispersion exists when the onditional varian e is greater than the onditional mean. If this is the

ase, the Poisson spe i ation is not orre t.

Chapter 17
Nonlinear least squares (NLS)
Readings:

and 5 ; Gallant, Ch. 1

Davidson and Ma Kinnon, Ch. 2

17.1 Introdu tion and denition


Nonlinear least squares (NLS) is a means of estimating the parameter of the model

yt = f (xt , 0 ) + t .

In general,

will be heteros edasti and auto orrelated, and possibly nonnormally dis-

tributed. However, dealing with this is exa tly as in the ase of linear models, so we'll
just treat the iid ase here,

t iid(0, 2 )
If we sta k the observations verti ally, dening

y = (y1 , y2 , ..., yn )
f = (f (x1 , ), f (x1 , ), ..., f (x1 , ))
and

= (1 , 2 , ..., n )
we an write the

observations as

y = f () +
Using this notation, the NLS estimator an be dened as

1
1
arg min sn () = [y f ()] [y f ()] = k y f () k2

n
n

The estimator minimizes the weighted sum of squared errors, whi h is the same as
minimizing the Eu lidean distan e between

The obje tive fun tion an be written as

261

and

f ().

262

CHAPTER 17.

sn () =

NONLINEAR LEAST SQUARES (NLS)


1
y y 2y f () + f () f () ,
n

whi h gives the rst order onditions

Dene the

nK

In shorthand, use






0.
f () y +
f () f ()

matrix

D f ().
F()

in pla e of

F().

(17.1)

Using this, the rst order onditions an be written as

0,
y + F
f ()
F
or

h
i
0.
y f ()
F

(17.2)

This bears a good deal of similarity to the f.o. . for the linear model - the derivative of the
predi tion is orthogonal to the predi tion error. If

is simply X, so the f.o. .


f () = X, then F

(with spheri al errors) simplify to

X y X X = 0,
the usual 0LS f.o. .

INSERT drawings of geometri al depi tion of OLS


and NLS (see Davidson and Ma Kinnon, pgs. 8,13 and 46).
We an interpret this geometri ally:

Note that the nonlinearity of the manifold leads to potential multiple lo al maxima,
minima and saddlepoints: the obje tive fun tion

sn ()

is not ne essarily well-behaved

and may be di ult to minimize.

17.2 Identi ation


As before, identi ation an be onsidered onditional on the sample, and asymptoti ally. The

sn () tend to a limiting fun tion s () su h


0
. This will be the ase if s ( 0 ) is stri tly onvex at 0 , whi h

ondition for asymptoti identi ation is that


that

( 0 )

< s (), 6=

17.2.

263

IDENTIFICATION

requires that

D2 s ( 0 )

be positive denite. Consider the obje tive fun tion:

sn () =

1X
[yt f (xt , )]2
n
t=1

=
=

n
2
1 X
f (xt , 0 ) + t ft (xt , )
n t=1

n
n
2 1 X
1 X
(t )2
ft ( 0 ) ft () +
n
n
t=1

t=1

n

2 X
ft ( 0 ) ft () t
n
t=1

As in example 14.4, whi h illustrated the onsisten y of extremum estimators using OLS,
we on lude that the se ond term will onverge to a onstant whi h does not depend
upon

A LLN an be applied to the third term to on lude that it onverges pointwise to 0, as

Next, pointwise onvergen e needs to be stregnthened to uniform almost sure onver-

long as

f () and

gen e.

are un orrelated.

There are a number of possible assumptions one ould use.

Here, we'll just

assume it holds.

Turning to the rst term, we'll assume a pointwise law of large numbers applies, so

n
2 a.s.
1 X
ft ( 0 ) ft ()
n t=1
where

(x)

is the distribution fun tion of

and ontinuous, for all


immediate.

For example if

2
f (z, 0 ) f (z, ) d(z),

x.

In many ases,

f (x, )

will

(17.3)

be bounded

so strengthening to uniform almost sure onvergen e is

f (x, ) = [1 + exp(x)]1 , f : K (0, 1) ,

range, and the fun tion is ontinuous in

a bounded

Given these results, it is lear that a minimizer is

0.

When onsidering identi ation (asymp-

toti ), the question is whether or not there may be some other minimizer. A lo al ondition
for identi ation is that

2
2
s
()
=



be positive denite at

0.

2
f (x, 0 ) f (x, ) d(x)

Evaluating this derivative, we obtain (after a little work)



2
f (x, ) f (x, ) d(x)
0

=2
0

D f (z, 0 )




D f (z, 0 ) d(z)

264

CHAPTER 17.

NONLINEAR LEAST SQUARES (NLS)

the expe tation of the outer produ t of the gradient of the regression fun tion evaluated at

0.

(Note: the uniform boundedness we have already assumed allows passing the derivative

through the integral, by the dominated onvergen e theorem.)

This matrix will be positive

denite (wp1) as long as the gradient ve tor is of full rank (wp1). The tangent spa e to the
regression manifold must span a

-dimensional spa e if we are to onsistently estimate a

-dimensional parameter ve tor. This is analogous to the requirement that there be no perfe t
olinearity in a linear model. This is a ne essary ondition for identi ation. Note that the
LLN implies that the above expe tation is equal to

J ( 0 ) = 2 lim E

F F
n

17.3 Consisten y
We simply assume that the onditions of Theorem 19 hold, so the estimator is onsistent.
Given that the strong sto hasti equi ontinuity onditions hold, as dis ussed above, and given
the above identi ation onditions an a ompa t estimation spa e (the losure of the parameter
spa e

),

the onsisten y proof 's assumptions are satised.

17.4 Asymptoti normality


As in the ase of GMM, we also simply assume that the onditions for asymptoti normality as
in Theorem 22 hold. The only remaining problem is to determine the form of the asymptoti
varian e- ovarian e matrix. Re all that the result of the asymptoti normality theorem is

where

J ( 0 )





d
n 0 N 0, J ( 0 )1 I ( 0 )J ( 0 )1 ,
is the almost sure limit of

2
sn () evaluated at

0,

and

I ( 0 ) = lim V ar nD sn ( 0 )
The obje tive fun tion is

sn () =
So

D sn () =
Evaluating at

1X
[yt f (xt , )]2
n t=1

0,

2X
[yt f (xt , )] D f (xt , ).
n t=1
n

2X
D sn ( ) =
t D f (xt , 0 ).
n t=1
0

Note that the expe tation of this is zero, sin e

and

xt

are assumed to be un orrelated. So

to al ulate the varian e, we an simply al ulate the se ond moment about zero. Also note

17.5.

265

EXAMPLE: THE POISSON MODEL FOR COUNT DATA

that

n
X

t D f (xt , 0 ) =

t=1

 0 
f ( )

= F

With this we obtain

I ( 0 ) = lim V ar nD sn ( 0 )
4
= lim nE 2 F ' F
n
F F
= 4 2 lim E
n
We've already seen that

J ( 0 ) = 2 lim E

F F
,
n

where the expe tation is with respe t to the joint density of

J ( 0 )

pressions for
get

and

I ( 0 ),

and

Combining these ex-

and the result of the asymptoti normality theorem, we



d
n 0 N

F F
0, lim E
n

1

We an onsistently estimate the varian e ovarian e matrix using

F
n
where

!1

2,

(17.4)

is dened as in equation 17.1 and

2 =

i h
i

y f ()
y f ()
n

the obvious estimator. Note the lose orresponden e to the results for the linear model.

17.5 Example: The Poisson model for ount data


Suppose that
variable is a

yt

onditional on

ount data

xt

is independently distributed Poisson.

A Poisson random

variable, whi h means it an take the values {0,1,2,...}.

This sort

of model has been used to study visits to do tors per year, number of patents registered by
businesses per year,

et .

The Poisson density is

f (yt ) =
The mean of

yt

is

t ,

exp(t )yt t
, yt {0, 1, 2, ...}.
yt !

as is the varian e. Note that

must be positive. Suppose that the true

266

CHAPTER 17.

NONLINEAR LEAST SQUARES (NLS)

mean is

0t = exp(xt 0 ),
whi h enfor es the positivity of

t .

Suppose we estimate

by nonlinear least squares:

n
2
1X
yt exp(xt )
= arg min sn () =
T t=1
We an write

sn () =
=

n
2
1X
exp(xt 0 + t exp(xt )
T t=1

n
n
n

2 1 X
1X
1X
2
0

t + 2
t exp(xt 0 exp(xt )
exp(xt exp(xt ) +
T
T
T
t=1

t=1

t=1

The last term has expe tation zero sin e the assumption that
that

E (t |xt ) = 0, whi h in turn implies that fun tions of xt

E(yt |xt ) = exp(xt 0 )

are un orrelated with

implies

t . Applying

a strong LLN, and noting that the obje tive fun tion is ontinuous on a ompa t parameter
spa e, we get

2
s () = Ex exp(x 0 exp(x ) + Ex exp(x 0 )

where the last term omes from the fa t that the onditional varian e of
varian e of

y. This fun tion

is learly minimized at

is the same as the

0 , so the NLS estimator is onsistent

as long as identi ation holds.

Exer ise 27 Determine the limiting distribution of



n ()
of sn (), J ( 0 ), s
,

spe i forms
to verify that it an be applied.
2



n 0 .

This means nding the the

and I( 0 ). Again, use a CLT as needed, no need

17.6 The Gauss-Newton algorithm


Readings: Davidson and Ma Kinnon,

Chapter 6, pgs. 201-207 .

The Gauss-Newton optimization te hnique is spe i ally designed for nonlinear least squares.
The idea is to linearize the nonlinear model, rather than the obje tive fun tion. The model is

y = f ( 0 ) + .
At some

in the parameter spa e, not equal to

0,

we have

y = f () +
where

is a ombination of the fundamental error term

the regression fun tion at

and the error due to evaluating

0
rather than the true value . Take a rst order Taylor's series

17.6.

267

THE GAUSS-NEWTON ALGORITHM

approximation around a point

Dene

1 :




1 + + approximation
y = f ( 1 ) + D f 1

z y f ( 1 )

and

b ( 1 ).

error.

Then the last equation an be written as

z = F( 1 )b + ,
where, as above,

F( 1 ) D f ( 1 ) is the nK

1
evaluated at , and

is

plus approximation error from the trun ated Taylor's series.

1.

Note that

Note that one ould estimate

Given

b,

matrix of derivatives of the regression fun tion,

is known, given

simply by performing OLS on the above equation.

we al ulate a new round estimate of

as

2 = b + 1 .

With this, take a new

2
Taylor's series expansion around and repeat the pro ess. Stop when

b = 0

(to within

a spe ied toleran e).

To see why this might work, onsider the above approximation, but evaluated at the NLS
estimator:

The OLS estimate of



+ F()
+
y = f ()

b is


1 h
i

b = F

F y f () .

This must be zero, sin e

 h
i
0
y f ()
F

by denition of the NLS estimator (these are the normal equations as in equation 17.2, Sin e

b 0

when we evaluate at

updating would stop.

The Gauss-Newton method doesn't require se ond derivatives, as does the Newton-

The var ov estimator, as in equation 17.4 is simple to al ulate, sin e we have

Raphson method, so it's faster.

as a

i.e., it's just the last round regressor matrix).

by-produ t of the estimation pro ess (

In

fa t, a normal OLS program will give the NLS var ov estimator dire tly, sin e it's just
the OLS var ov estimator from the last iteration.

The method an suer from onvergen e problems sin e

F() F(),

singular, even with an asymptoti ally identied model, espe ially if


Consider the example

y = 1 + 2 xt 3 + t

may be very nearly

is very far from

268

CHAPTER 17.

2 0, 3

When evaluated at

NONLINEAR LEAST SQUARES (NLS)

has virtually no ee t on the NLS obje tive fun tion, so

will have rank that is essentially 2, rather than 3. In this ase,

F F

will be nearly

1 will be subje t to large roundo errors.


singular, so (F F)

17.7 Appli ation: Limited dependent variables and sample sele tion
Readings:

Davidson and Ma Kinnon, Ch. 15

(a qui k reading is su ient), J. He kman,

Sample Sele tion Bias as a Spe i ation Error,

E onometri a, 1979 (This is a lassi arti le,

not required for reading, and whi h is a bit out-dated. Nevertheless it's a good pla e to start
if you en ounter sample sele tion problems in your resear h).
Sample sele tion is a ommon problem in applied resear h.

The problem o urs when

observations used in estimation are sampled non-randomly, a ording to some sele tion s heme.

17.7.1 Example: Labor Supply


Labor supply of a person is a positive number of hours per unit time supposing the oer wage
is higher than the reservation wage, whi h is the wage at whi h the person prefers not to work.
The model (very simple, with

subs ripts suppressed):

Chara teristi s of individual:

Latent labor supply:

Oer wage:

Reservation wage:

s = x +

wo = z +
wr = q +

Write the wage dierential as



z + q +

w =

r +
We have the set of equations

s = x +
w = r + .
Assume that

"

"

0
0

# "
,

#!

APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION269

17.7.

We assume that the oer wage and the reservation wage, as well as the latent variable

are

unobservable. What is observed is

w = 1 [w > 0]
s = ws .
In other words, we observe whether or not a person is working. If the person is working, we
observe labor supply, whi h is equal to latent labor supply,

s .

Otherwise,

s = 0 6= s .

Note

that we are using a simplifying assumption that individuals an freely hoose their weekly
hours of work.
Suppose we estimated the model

s = x + residual
using only observations for whi h

whi h w

or equivalently,

<

The problem is that these observations are those for

r and


E | < r 6= 0,

and

are dependent. Furthermore, this expe tation will in general depend on

elements of

an enter in

sin e

> 0,

s > 0.

r.

sin e

Be ause of these two fa ts, least squares estimation is biased and

in onsistent.
Consider more arefully

E [| < r ] .

write (see for example Spanos

Given the joint normality of

and

Statisti al Foundations of E onometri Modelling, pg.

we an
122)

= + ,
where

has mean zero and is independent of

With this we an write

s = x + + .
If we ondition this equation on

< r

we get

s = x + E(| < r ) +
whi h may be written as

s = x + E(| > r ) +

A useful result is that for

z N (0, 1)
E(z|z > z ) =
where

()

and

()

(z )
,
(z )

are the standard normal density and distribution fun tion, respe -

270

CHAPTER 17.

NONLINEAR LEAST SQUARES (NLS)

tively. The quantity on the RHS above is known as the

IM R(z ) =

inverse Mill's ratio:

(z )
(z )

With this we an write (making use of the fa t that the standard normal density is
symmetri about zero, so that

(a) = (a)):
(r )
+
(r )
#
"
i
(r )

s = x +

where

= .

regressors

The error term

(r )
(r )

(r )

(17.5)

+ .

(17.6)

has onditional mean zero, and is un orrelated with the

At this point, we an estimate the equation by NLS.

He kman showed how one an estimate this in a two step pro edure where rst

is

estimated, then equation 17.6 is estimated by least squares using the estimated value of

to form the regressors. This is ine ient and estimation of the ovarian e is a tri ky

issue. It is probably easier (and more e ient) just to do MLE.

The model presented above depends strongly on joint normality. There exist many alternative models whi h weaken the maintained assumptions. It is possible to estimate
onsistently without distributional assumptions. See Ahn and Powell,

metri s, 1994.

Journal of E ono-

Chapter 18
Nonparametri inferen e
18.1 Possible pitfalls of parametri inferen e: estimation
Readings:
tions,

H. White (1980) Using Least Squares to Approximate Unknown Regression Fun -

International E onomi Review, pp.

149-70.

In this se tion we onsider a simple example, whi h illustrates both why nonparametri
methods may in some ases be preferred to parametri methods.
We suppose that data is generated by random sampling of
uniformly distributed on

(0, 2),

and

where

y = f (x) +, x

is

is a lassi al error. Suppose that

f (x) = 1 +

3x  x 2

2
2

The problem of interest is to estimate the elasti ity of


range of

(y, x),

f (x)

with respe t to

x,

throughout the

x.

In general, the fun tional form of


proximation to

f (x) is unknown.

One idea is to take a Taylor's series ap-

f (x) about some point x0 . Flexible fun tional forms su h as the trans endental

logarithmi (usually know as the translog) an be interpreted as se ond order Taylor's series
approximations. We'll work with a rst order approximation, for simpli ity. Approximating
about

x0 :
h(x) = f (x0 ) + Dx f (x0 ) (x x0 )

If the approximation point is

x0 = 0,

we an write

h(x) = a + bx
The oe ient
derivative at

is the value of the fun tion at

x = 0.

x = 0,

and the slope is the value of the

These are of ourse not known. One might try estimation by ordinary

least squares. The obje tive fun tion is

s(a, b) = 1/n

n
X
t=1

(yt h(xt ))2 .

271

272

CHAPTER 18.

NONPARAMETRIC INFERENCE

Figure 18.1: True and simple approximating fun tions

3.5

approx
true
3.0

2.5

2.0

1.5

1.0
0

The limiting obje tive fun tion, following the argument we used to get equations 14.1 and 17.3
is

s (a, b) =

(f (x) h(x))2 dx.

The theorem regarding the onsisten y of extremum estimators (Theorem 19) tells us that
and
b will

onverge almost surely to the values that minimize the limiting obje tive fun tion.
1

Solving the rst order onditions reveals that


The estimated approximating fun tion

h(x)



s (a, b) obtains its minimum at a0 = 76 , b0 = 1 .

therefore tends almost surely to

h (x) = 7/6 + x/
In Figure 18.1 we see the true fun tion and the limit of the approximation to see the asymptoti
bias as a fun tion of

x.

(The approximating model is the straight line, the true model has urvature.) Note that
the approximating model is in general in onsistent, even at the approximation point.

This

shows that exible fun tional forms based upon Taylor's series approximations do not in
general lead to onsistent estimation of fun tions.
The approximating model seems to t the true model fairly well, asymptoti ally. However,
we are interested in the elasti ity of the fun tion.

Re all that an elasti ity is the marginal

fun tion divided by the average fun tion:

(x) = x (x)/(x)
Good approximation of the elasti ity over the range of
1

will require a good approximation of

The following results were obtained using the free omputer algebra system (CAS) Maxima. Unfortunately,

I have lost the sour e ode to get the results :-(

18.1.

POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION

273

Figure 18.2: True and approximating elasti ities

0.7

approx
true
0.6

0.5

0.4

0.3

0.2

0.1

0.0
0

both

f (x)

and

f (x)

over the range of

x.

The approximating elasti ity is

(x) = xh (x)/h(x)
In Figure 18.2 we see the true elasti ity and the elasti ity obtained from the limiting approximating model.
The true elasti ity is the line that has negative slope for large

x.

Visually we see that the

elasti ity is not approximated so well. Root mean squared error in the approximation of the
elasti ity is

Z

2
0

((x) (x))2 dx

1/2

= . 31546

Now suppose we use the leading terms of a trigonometri series as the approximating
model. The reason for using a trigonometri series as an approximating model is motivated
by the asymptoti properties of the Fourier exible fun tional form (Gallant, 1981, 1982),
whi h we will study in more detail below. Normally with this type of model the number of
basis fun tions is an in reasing fun tion of the sample size.

Here we hold the set of basis

fun tion xed. We will onsider the asymptoti behavior of a xed model, whi h we interpret
as an approximation to the estimator's behavior in nite samples. Consider the set of basis
fun tions:

Z(x) =

1 x cos(x) sin(x) cos(2x) sin(2x)

The approximating model is

gK (x) = Z(x).
Maintaining these basis fun tions as the sample size in reases, we nd that the limiting ob-

274

CHAPTER 18.

NONPARAMETRIC INFERENCE

Figure 18.3: True fun tion and more exible approximation

3.5

approx
true
3.0

2.5

2.0

1.5

1.0
0

je tive fun tion is minimized at



1
1
1
7
a1 = , a2 = , a3 = 2 , a4 = 0, a5 = 2 , a6 = 0 .
6

4
Substituting these values into

gK (x)


we obtain the almost sure limit of the approximation

1
g (x) = 7/6 + x/ + (cos x) 2



1
+ (sin x) 0 + (cos 2x) 2 + (sin 2x) 0
4

(18.1)

In Figure 18.3 we have the approximation and the true fun tion: Clearly the trun ated trigonometri series model oers a better approximation, asymptoti ally, than does the linear model.
In Figure 18.4 we have the more exible approximation's elasti ity and that of the true fun tion: On average, the t is better, though there is some implausible wavyness in the estimate.
Root mean squared error in the approximation of the elasti ity is

2
0

g (x)x
(x)
g (x)

2

dx

!1/2

= . 16213,

about half that of the RMSE when the rst order approximation is used. If the trigonometri
series ontained innite terms, this error measure would be driven to zero, as we shall see.

18.2 Possible pitfalls of parametri inferen e: hypothesis testing


What do we mean by the term nonparametri inferen e? Simply, this means inferen es that
are possible without restri ting the fun tions of interest to belong to a parametri family.

18.2.

POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: HYPOTHESIS TESTING275

Figure 18.4: True elasti ity and more exible approximation

0.7

approx
true
0.6

0.5

0.4

0.3

0.2

0.1

0.0
0

Consider means of testing for the hypothesis that onsumers maximize utility. A onsequen e of utility maximization is that the Slutsky matrix

Dp2 h(p, U ),

where

h(p, U )

the a set of ompensated demand fun tions, must be negative semi-denite.

are

One ap-

proa h to testing for utility maximization would estimate a set of normal demand fun tions

x(p, m).

Estimation of these fun tions by normal parametri methods requires spe i ation of
the fun tional form of demand, for example

x(p, m) = x(p, m, 0 ) + , 0 0 ,
where

x(p, m, 0 )

is a fun tion of known form and

is a nite dimensional parameter.

to al ulate (by solving the integrability


x
= x(p, m, )
b p2 h(p, U ). If we an statisti ally reje t that the matrix is
D

After estimation, we ould use


problem, whi h is non-trivial)

negative semi-denite, we might on lude that onsumers don't maximize utility.

The problem with this is that the reason for reje tion of the theoreti al proposition may
be that our hoi e of fun tional form is in orre t. In the introdu tory se tion we saw
that fun tional form misspe i ation leads to in onsistent estimation of the fun tion and
its derivatives.

Testing using parametri models always means we are testing a ompound hypothesis.
The hypothesis that is tested is 1) the e onomi proposition we wish to test, and 2) the
model is orre tly spe ied.

Failure of either 1) or 2) an lead to reje tion (as an a

Type-I error, even when 2) holds).


hypothesis.

This is known as the model-indu ed augmenting

276

CHAPTER 18.

NONPARAMETRIC INFERENCE

Varian's WARP allows one to test for utility maximization without spe ifying the form of
the demand fun tions. The only assumptions used in the test are those dire tly implied
by theory, so reje tion of the hypothesis alls into question the theory.

Nonparametri inferen e also allows dire t testing of e onomi propositions, avoiding


the model-indu ed augmenting hypothesis.

The ost of nonparametri methods is

usually an in rease in omplexity, and a loss of power, ompared to what one would
get using a well-spe ied parametri model. The benet is robustness against possible
misspe i ation.

18.3 Estimation of regression fun tions


18.3.1 The Fourier fun tional form
Readings:
in

Gallant, 1987, Identi ation and onsisten y in semi-nonparametri regression,

Advan es in E onometri s, Fifth World Congress, V. 1, Truman Bewley, ed., Cambridge.


Suppose we have a multivariate model

y = f (x) + ,
where

f (x) is of unknown form and x is a P dimensional ve tor.

For simpli ity, assume that

is a lassi al error. Let us take the estimation of the ve tor of elasti ities with typi al element

x i =
at an arbitrary point

xi f (x)
,
f (x) xi f (x)

xi .

The Fourier form, following Gallant (1982), but with a somewhat dierent parameterization, may be written as

gK (x | K ) = + x + 1/2x Cx +
where the

K -dimensional

A X
J
X

=1 j=1


uj cos(jk x) vj sin(jk x) .

parameter ve tor

K = {, , vec (C) , u11 , v11 , . . . , uJA , vJA } .

(18.2)

We assume that the onditioning variables


interval that is shorter than

2.

(18.3)

have ea h been transformed to lie in an

This is required to avoid periodi behavior of the ap-

proximation, whi h is desirable sin e e onomi fun tions aren't periodi . For example,
subtra t sample means, divide by the maxima of the onditioning variables, and multiply
by

2 eps,

The

where

eps

is some positive number less than

are elementary multi-indi es whi h are simply

in value.
ve tors formed of integers

18.3.

277

ESTIMATION OF REGRESSION FUNCTIONS

(negative, positive and zero). The

k , = 1, 2, ..., A

are required to be linearly inde-

pendent, and we follow the onvention that the rst non-zero element be positive. For
example

0 1 1 0 1

is a potential multi-index to be used, but

i
i

0 1 1 0 1

is not sin e its rst nonzero element is negative. Nor is

0 2 2 0 2

a multi-index we would use, sin e it is a s alar multiple of the original multi-index.

We parameterize the matrix

dierently than does Gallant be ause it simplies things

in pra ti e. The ost of this is that we are no longer able to test a quadrati spe i ation
using nested testing.

The ve tor of rst partial derivatives is

Dx gK (x | K ) = + Cx +

A X
J
X


=1 j=1



uj sin(jk x) vj cos(jk x) jk

(18.4)

and the matrix of se ond partial derivatives is

Dx2 gK (x|K ) = C +

A X
J
X


=1 j=1



uj cos(jk x) + vj sin(jk x) j 2 k k

To dene a ompa t notation for partial derivatives, let


index with no negative elements. Dene

arguments

| |

D h(x)

be an

N -dimensional

as the sum of the elements of

is the zero ve tor,

||

D h(x) h(x).

h(x)

Taking this denition and the last few equations

(1 K)

ve tor

Z (x)

so that

D gK (x|K ) = z (x) K .

If we have

x1 1 x2 2 xNN

into a ount, we see that it is possible to dene

multi-

of the (arbitrary) fun tion h(x), use D h(x) to indi ate a ertain partial

derivative:

When

(18.5)

(18.6)

Both the approximating model and the derivatives of the approximating model are linear
in the parameters.
For the approximating model to the fun tion (not derivatives), write

gK (x|K ) = z K

278

CHAPTER 18.

NONPARAMETRIC INFERENCE

for simpli ity.

The following theorem an be used to prove the onsisten y of the Fourier form.

Theorem 28 [Gallant and Ny hka, 1987 Suppose that h n is obtained by maximizing a sample

obje tive fun tion sn (h) over HKn where HK is a subset of some fun tion spa e H on whi h
is dened a norm k h k. Consider the following onditions:
(a) Compa tness: The losure of H with respe t to k h k is ompa t in the relative topology
dened by k h k.
(b) Denseness: K HK , K = 1, 2, 3, ... is a dense subset of the losure of H with respe t to
k h k and HK HK+1 .
( ) Uniform onvergen e: There is a point h in H and there is a fun tion s (h, h ) that
is ontinuous in h with respe t to k h k su h that
lim sup | sn (h) s (h, h ) |= 0

almost surely.
(d) Identi ation: Any point h in the losure of H with s (h, h ) s (h , h ) must have
k h h k= 0.
Under these onditions limn k h h n k= 0 almost surely, provided that limn Kn =
almost surely.
The modi ation of the original statement of the theorem that has been made is to set the
parameter spa e

in Gallant and Ny hka's (1987) Theorem 0 to a single point and to state

the theorem in terms of maximization rather than minimization.


This theorem is very similar in form to Theorem 19. The main dieren es are:

1. A generi norm

k h k is used in pla e of the Eu lidean norm.

This norm may be stronger

than the Eu lidean norm, so that onvergen e with respe t to

khk

implies onvergen e

w.r.t the Eu lidean norm. Typi ally we will want to make sure that the norm is strong
enough to imply onvergen e of all fun tions of interest.

2. The estimation spa e

is a fun tion spa e. It plays the role of the parameter spa e

in our dis ussion of parametri estimators.

There is no restri tion to a parametri

family, only a restri tion to a spa e of fun tions that satisfy ertain onditions.

This

formulation is mu h less restri tive than the restri tion to a parametri family.

3. There is a denseness assumption that was not present in the other theorem.

We will not prove this theorem (the proof is quite similar to the proof of theorem [19, see
Gallant, 1987) but we will dis uss its assumptions, in relation to the Fourier form as the
approximating model.

18.3.

279

ESTIMATION OF REGRESSION FUNCTIONS

Sobolev norm

Sin e all of the assumptions involve the norm

k h k , we need to make expli it

what norm we wish to use. We need a norm that guarantees that the errors in approximation

of the fun tions we are interested in are a ounted for. Sin e we are interested in rst-order
elasti ities in the present ase, we need lose approximation of both the fun tion

rst derivative f (x), throughout the range of


of

that we're interested in.

x. Let X

f (x)

and its

be an open set that ontains all values

The Sobolev norm is appropriate in this ase.

It is dened,

making use of our notation for partial derivatives, as:





k h km,X = max sup D h(x)

| |m X

To see whether or not the fun tion

gK (x | K ),

f (x)

is well approximated by an approximating model

we would evaluate

k f (x) gK (x | K ) km,X .
We see that this norm takes into a ount errors in approximating the fun tion and partial
derivatives up to order

m.

this example, the relevant

X,

If we want to estimate rst order elasti ities, as is the ase in

would be

m = 1.

onvergen e w.r.t. the Sobolev means

estimates for all values of

Compa tness
enlightening.

Furthermore, sin e we examine the

uniform

sup

over

onvergen e, so that we obtain onsistent

x.

Verifying ompa tness with respe t to this norm is quite te hni al and un-

It is proven by Elbadawi, Gallant and Souza,

requirement is that if we need onsisten y w.r.t.

k h km,X ,

E onometri a,

1983.

The basi

then the fun tions of interest must

belong to a Sobolev spa e whi h takes into a ount derivatives of order

m + 1.

A Sobolev

spa e is the set of fun tions

Wm,X (D) = {h(x) :k h(x) km,X < D},


where

is a nite onstant. In plain words, the fun tions must have bounded partial deriva-

tives of one order higher than the derivatives we seek to estimate.

The estimation spa e and the estimation subspa e

Sin e in our ase we're interested

in onsistent estimation of rst-order elasti ities, we'll dene the estimation spa e as follows:

Denition 29 [Estimation spa e The estimation spa e H = W2,X (D). The estimation spa e

is an open set, and we presume that h H.

So we are assuming that the fun tion to be estimated has bounded se ond derivatives
throughout

X.

With seminonparametri estimators, we don't a tually optimize over the estimation spa e.
Rather, we optimize over a subspa e,

HK n ,

dened as:

280

CHAPTER 18.

NONPARAMETRIC INFERENCE

Denition 30 [Estimation subspa e The estimation subspa e HK is dened as


HK = {gK (x|K ) : gK (x|K ) W2,Z (D), K K },

where gK (x, K ) is the Fourier form approximation as dened in Equation 18.2.


Denseness

The important point here is that

nite dimensional parameter (K has

n > K,

this parameter is estimable.

HK ,

element of

so optimization over

for optimization over


need that:

HK

that

and

HK

H.

dim Kn

A set of subsets

Aa

be dense subsets of

HK , dened above,

of a set

is equal to the losure of

A:

A is dense

n.

H,

as

in equation 18.2 in reasing fun tions of

The estimation subspa e

observations,

may not lead to a onsistent estimator.

to be equivalent to optimization over

HK

Note that the true fun tion h is not ne essarily an

will have to grow more slowly than

2. We need that the

is a spa e of fun tions that is indexed by a

elements, as in equation 18.3). With

1. The dimension of the parameter ve tor,


making

HK

at least asymptoti ally, we

n .

n,

In order

This is a hieved by

the sample size. It is lear

The se ond requirement is:

H.

is a subset of the losure of the estimation spa e,

if the losure of the ountable union of the subsets

a=1 Aa = A

Use a pi ture here. The rest of the dis ussion of denseness is provided just for ompleteness:
there's no need to study it in detail. To show that HK is a dense subset of H with respe t to
k h k1,X ,

it is useful to apply Theorem 1 of Gallant (1982), who in turn ites Edmunds and

Mos atelli (1977). We reprodu e the theorem as presented by Gallant, with minor notational
hanges, for onvenien e of referen e:

Theorem 31 [Edmunds and Mos atelli, 1977 Let the real-valued fun tion h (x) be ontin-

uously dierentiable up to order m on an open set ontaining the losure of X . Then it is


possible to hoose a triangular array of oe ients 1 , 2 , . . . K , . . . , su h that for every q with
0 q < m, and every > 0, k h (x) hK (x|K ) kq,X = o(K m+q+ ) as K .
In the present appli ation,
elements of
of

X,

q = 1,

and

m = 2.

By denition of the estimation spa e, the

H are on e ontinuously dierentiable on X , whi h is open and ontains the losure

so the theorem is appli able. Closely following Gallant and Ny hka (1987),

the ountable union of the


{hK } from

HK

HK .

h H.

is

The impli ation of Theorem 31 is that there is a sequen e of

su h that

lim k h hK k1,X = 0,

K
for all

HK

Therefore,

H HK .

18.3.

281

ESTIMATION OF REGRESSION FUNCTIONS

However,

HK H,
so

HK H.
Therefore

H = HK ,
so

HK

is a dense subset of

Uniform onvergen e

H,

with respe t to the norm

k h k1,X .

We now turn to the limiting obje tive fun tion.

We estimate by

OLS. The sample obje tive fun tion stated in terms of maximization is

1X
(yt gK (xt | K ))2
sn (K ) =
n
t=1

With random sampling, as in the ase of Equations 14.1 and 17.3, the limiting obje tive
fun tion is

s (g, f ) =
where the true fun tion
the theorem. Both

g(x)

f (x)
and

(f (x) g(x))2 dx 2 .

takes the pla e of the generi fun tion

f (x)

are elements of

(18.7)

in the presentation of

HK .

The pointwise onvergen e of the obje tive fun tion needs to be strengthened to uniform
onvergen e. We will simply assume that this holds, sin e the way to verify this depends upon
the spe i appli ation. We also have ontinuity of the obje tive fun tion in
to the norm

k h k1,X

g,

with respe t

sin e




s g 1 , f ) s g 0 , f )
kg 1 g 0 k1,X 0
Z h
2
2 i
dx.
g1 (x) f (x) g0 (x) f (x)
=
lim
lim

kg 1 g 0 k1,X 0 X

By the dominated onvergen e theorem (whi h applies sin e the nite bound

W2,Z (D)

used to dene

is dominated by an integrable fun tion), the limit and the integral an be inter-

hanged, so by inspe tion, the limit is zero.

Identi ation

The identi ation ondition requires that for any point

s (g, f ) s (f, f ) k g f k1,X = 0.

(g, f )

in

This ondition is learly satised given that

H H,
g

and

are on e ontinuously dierentiable (by the assumption that denes the estimation spa e).

Review of on epts
on epts are:

For the example of estimation of rst-order elasti ities, the relevant

282

CHAPTER 18.

H = W2,X (D):

Estimation spa e
fun tion must lie.

Consisten y norm

k h k1,X .

Estimation subspa e

Sample obje tive fun tion

Limiting obje tive fun tion

HK .

NONPARAMETRIC INFERENCE

the fun tion spa e in the losure of whi h the true

The losure of

is ompa t with respe t to this norm.

that is repre-

the negative of the sum of squares.

By standard

The estimation subspa e is the subset of

sentable by a Fourier form with parameter

sn (K ),

K .

These are dense subsets of

H.

arguments this onverges uniformly to the

s ( g, f ), whi h is ontinuous in g and has a global maximum

in its rst argument, over the losure of the innite union of the estimation subpa es, at

g = f.

As a result of this, rst order elasti ities

xi f (x)
f (x) xi f (x)
are onsistently estimated for all

Dis ussion

x X.

Consisten y requires that the number of parameters used in the expansion in-

rease with the sample size, tending to innity. If parameters are added at a high rate, the bias
tends relatively rapidly to zero. A basi problem is that a high rate of in lusion of additional
parameters auses the varian e to tend more slowly to zero. The issue of how to hose the rate
at whi h parameters are added and whi h to add rst is fairly omplex. A problem is that the
allowable rates for asymptoti normality to obtain (Andrews 1991; Gallant and Souza, 1991)
are very stri t. Supposing we sti k to these rates, our approximating model is:

gK (x|K ) = z K .

Dene

ZK

as the

estimator is

nK

matrix of regressors obtained by sta king observations. The LS

K = ZK ZK
where

()+

+

ZK y,

is the Moore-Penrose generalized inverse.

This is used sin e

ZK ZK

may be singular, as would be the ase for

K(n)

large

enough when some dummy variables are in luded.

. The predi tion,


tributed:

where

z K ,

of the unknown fun tion

f (x)

is asymptoti ally normally dis-



d
n z K f (x) N (0, AV ),

#
" 
+

2
ZK ZK
z
.
AV = lim E z
n
n

18.3.

283

ESTIMATION OF REGRESSION FUNCTIONS

Formally, this is exa tly the same as if we were dealing with a parametri linear model. I
emphasize, though, that this is only valid if

grows very slowly as

grows. If we an't

sti k to a eptable rates, we should probably use some other method of approximating
the small sample distribution. Bootstrapping is a possibility. We'll dis uss this in the
se tion on simulation.

18.3.2 Kernel regression estimators


Readings:

Bierens, 1987, Kernel estimators of regression fun tions, in

metri s, Fifth World Congress, V. 1, Truman Bewley, ed., Cambridge.

Advan es in E ono-

An alternative method to the semi-nonparametri method is a fully nonparametri method


of estimation. Kernel regression estimation is an example (others are splines, nearest neighbor,
et .). We'll onsider the Nadaraya-Watson kernel regression estimator in a simple ase.

Suppose we have an iid sample from the joint density

f (x, y), where x is k

-dimensional.

The model is

yt = g(xt ) + t ,
where

E(t |xt ) = 0.

The onditional expe tation of

given

is

By denition of the onditional expe -

tation, we have

f (x, y)
dy
h(x)
Z
1
yf (x, y)dy,
h(x)

g(x) =
=
where

h(x)

x:

is the marginal density of

h(x) =

g(x).

g(x)

This suggests that we ould estimate

f (x, y)dy.
by estimating

Estimation of the denominator


A kernel estimator for

h(x)

has the form

1 X K [(x xt ) /n ]

h(x) =
,
n t=1
nk
where

is the sample size and

The fun tion

K()

is the dimension of

x.

(the kernel) is absolutely integrable:

|K(x)|dx < ,

h(x)

and

yf (x, y)dy.

284

CHAPTER 18.

and

K()

integrates to

In this respe t,

K()

1:

NONPARAMETRIC INFERENCE

K(x)dx = 1.

is like a density fun tion, but we do not ne essarily restri t

K()

to

be nonnegative.

The

window width

parameter,

is a sequen e of positive numbers that satises

lim n = 0

lim nnk =

So, the window width must tend to zero, but not too qui kly.

To show pointwise onsisten y of

h(x)

h(x),

for

rst onsider the expe tation of the

estimator (sin e the estimator is an average of iid terms we only need to onsider the
expe tation of a representative term):

i Z

E h(x) = nk K [(x z) /n ] h(z)dz.


h

Change variables as

z = (x z)/n ,

so

z = x n z

and

| dzdz | = nk ,

we obtain

Z
h
i

E h(x)
=
nk K (z ) h(x n z )nk dz
Z
=
K (z ) h(x n z )dz .

Now, asymptoti ally,

lim E h(x)
=

K (z ) h(x n z )dz
lim
Z
=
lim K (z ) h(x n z )dz
n
Z
=
K (z ) h(x)dz
Z
= h(x) K (z ) dz
n

= h(x),
sin e

n 0

and

K (z ) dz = 1

by assumption. (Note: that we an pass the limit

through the integral is a result of the dominated onvergen e theorem.. For this to hold
we need that

h()

be dominated by an absolutely integrable fun tion.

18.3.

Next, onsidering the varian e of

nnk V

h(x),

we have, due to the iid assumption



n
i
X
K [(x xt ) /n ]
k 1

V
h(x) = nn 2
n
nk
= nk

285

ESTIMATION OF REGRESSION FUNCTIONS

1
n

t=1
n
X
t=1

V {K [(x xt ) /n ]}

By the representative term argument, this is

h
i

nnk V h(x)
= nk V {K [(x z) /n ]}

Also, sin e

V (x) = E(x2 ) E(x)2

we have

o
h
i
n

nnk V h(x)
= nk E (K [(x z) /n ])2 nk {E (K [(x z) /n ])}2
Z
2
Z
2
k
k
k
=
n K [(x z) /n ] h(z)dz n
n K [(x z) /n ] h(z)dz
Z
h
i2
=
nk K [(x z) /n ]2 h(z)dz nk E b
h(x)

The se ond term onverges to zero:

h
i2
nk E b
h(x) 0,

by the previous result regarding the expe tation and the fa t that

lim nnk V
n

n 0.

Therefore,

Z
i

nk K [(x z) /n ]2 h(z)dz.
h(x) = lim
n

Using exa tly the same hange of variables as before, this an be shown to be

Z
h
i

lim nnk V h(x)


= h(x) [K(z )]2 dz .

n
Sin e both

[K(z )]2 dz

and

by assumption, we have that

h(x)

are bounded, this is bounded, and sin e

nnk

h
i

V h(x)
0.

Sin e the bias and the varian e both go to zero, we have pointwise onsisten y ( onvergen e in quadrati mean implies onvergen e in probability).

286

CHAPTER 18.

NONPARAMETRIC INFERENCE

Estimation of the numerator


To estimate

yf (x, y)dy,

as the estimator for

h(x),

we need an estimator of

f (x, y).

The estimator has the same form

only with one dimension more:

1 X K [(y yt ) /n , (x xt ) /n ]
f(x, y) =
n
nk+1
t=1

The kernel

K ()

is required to have mean zero:

yK (y, x) dy = 0
h(x) :

and to marginalize to the previous kernel for

K (y, x) dy = K(x).

With this kernel, we have

1 X K [(x xt ) /n ]
y f(y, x)dy =
yt
n
nk
t=1

by marginalization of the kernel, so we obtain

g(x) =
=
=

h(x)
1 Pn

y f(y, x)dy

K[(xxt )/n ]
k
n
P
K[(xx
n
t )/n ]
1
k
t=1
n
n
Pn
yt K [(x xt ) /n ]
Pt=1
.
n
t=1 K [(x xt ) /n ]
n

t=1 yt

This is the Nadaraya-Watson kernel regression estimator.

Dis ussion

The kernel regression estimator for

g(xt )

is a weighted average of the

yj , j = 1, 2, ..., n,

xt .

The weights sum

where higher weights are asso iated with points that are loser to
to 1.

The window width parameter

A large window width redu es the varian e (strong imposition of atness), but in reases

A small window width redu es the bias, but makes very little use of information ex ept

as

n ,

imposes smoothness. The estimator is in reasingly at

sin e in this ase ea h weight tends to

1/n.

the bias.

points that are in a small neighborhood of

xt .

Sin e relatively little information is used,

18.4.

287

DENSITY FUNCTION ESTIMATION

the varian e is large when the window width is small.

The standard normal density is a popular hoi e for

K(.)

and

K (y, x),

though there

are possibly better alternatives.

Choi e of the window width: Cross-validation


The sele tion of an appropriate window width is important.

One popular method is ross

validation. This onsists of splitting the sample into two parts (e.g., 50%-50%). The rst part
is the in sample data, whi h is used for estimation, and the se ond part is the out of sample
data, used for evaluation of the t though RMSE or some other riterion. The steps are:
1. Split the data. The out of sample data is
2. Choose a window width

y out

and

xout .

3. With the in sample data, t

ytout orresponding to ea h xout


t . This tted value is a fun tion

of the in sample data, as well as the evaluation point

xout
t ,

but it does not involve

ytout .

4. Repeat for all out of sample points.


5. Cal ulate RMSE()
6. Go to step

2,

or to the next step if enough window widths have been tried.

7. Sele t the

that minimizes RMSE() (Verify that a minimum has been found, for

example by plotting RMSE as a fun tion of


8. Re-estimate using the best

).

and all of the data.

This same prin iple an be used to hoose

and

in a Fourier form model.

18.4 Density fun tion estimation


18.4.1 Kernel density estimation
The previous dis ussion suggests that a kernel density estimator may easily be onstru ted.
We have already seen how joint densities may be estimated. If were interested in a onditional
density, for example of

onditional on

x,

then the kernel estimate of the onditional density

is simply

fby|x =
=
=

f(x, y)

h(x)
1 Pn

K [(yyt )/n ,(xxt )/n ]


k+1
n
K[(xxt )/n ]
1 Pn
k
t=1
n
n
Pn
1
[(y yt ) /n , (x xt ) /n ]
t=1 K
P
n
n
t=1 K [(x xt ) /n ]
n

t=1

288

CHAPTER 18.

NONPARAMETRIC INFERENCE

where we obtain the expressions for the joint and marginal densities from the se tion on kernel
regression.

18.4.2 Semi-nonparametri maximum likelihood


Readings:

Gallant and Ny hka,

E onometri a, 1987.

For a Fortran program to do this and a

useful dis ussion in the user's guide, see this link. See also Cameron and Johansson,

of Applied E onometri s, V. 12, 1997.

Journal

MLE is the estimation method of hoi e when we are ondent about spe ifying the density.
Is is possible to obtain the benets of MLE when we're not so ondent about the spe i ation?
In part, yes.

Suppose we're interested in the density of


Suppose that the density

f (y|x, )

onditional on

(both may be ve tors).

is a reasonable starting approximation to the true density.

This density an be reshaped by multiplying it by a squared polynomial. The new density is

gp (y|x, , ) =

h2p (y|)f (y|x, )


p (x, , )

where

hp (y|) =

p
X

k y k

k=0

p (x, , ) is a normalizing fa tor to make the density integrate (sum) to one. Be ause
2
hp (y|)/p (x, , ) is a homogenous fun tion of it is ne essary to impose a normalization: 0
and

is set to 1. The normalization fa tor

p (, ) is al ulated

(following Cameron and Johansson)

using

E(Y r ) =
=

y=0

y r fY (y|, )
y

(y|)]2
fY (y|)
p (, )

r [hp

y=0
p X
p
X
X

y r fY (y|)k l y k y l /p (, )

y=0 k=0 l=0

p X
p
X

k l

k=0 l=0

p
p X
X

y=0

r+k+l
y
fY (y|) /p (, )

k l mk+l+r /p (, ).

k=0 l=0

By setting

r=0

we get that the normalizing fa tor is

18.8

p (, ) =

p X
p
X

k l mk+l

(18.8)

k=0 l=0

Re all that

is set to 1 to a hieve identi ation.

The

mr

in equation 18.8 are the raw

18.4.

289

DENSITY FUNCTION ESTIMATION

moments of the baseline density. Gallant and Ny hka (1987) give onditions under whi h su h
a density may be treated as orre tly spe ied, asymptoti ally.

Basi ally, the order of the

polynomial must in rease as the sample size in reases. However, there are te hni alities.
Similarly to Cameron and Johannson (1997), we may develop a negative binomial polynomial (NBP) density for ount data. The negative binomial baseline density may be written
(see equation as

(y + )
fY (y|) =
(y + 1)()
where

= {, }, > 0

and

is the parameterization

(NB-I). When

> 0.

 

y

The usual means of in orporating onditioning variables

ex . When

= /

we have the negative binomial-I model

= 1/ we have the negative binomial-II (NP-II) model.

V (Y ) = + .

In the ase of the NB-II model, we have

For the NB-I density,

V (Y ) = + 2 .

For both forms,

E(Y ) = .
The reshaped density, with normalization to sum to one, is

[hp (y|)]2 (y + )
fY (y|, ) =
p (, ) (y + 1)()

 

y

(18.9)

To get the normalization fa tor, we need the moment generating fun tion:

MY (t) = et +

(18.10)

To illustrate, Figure 18.5 shows al ulation of the rst four raw moments of the NB density,
al ulated using MuPAD, whi h is a Computer Algebra System that (used to be?) free for
personal use. These are the moments you would need to use a se ond order polynomial

(p = 2).

MuPAD will output these results in the form of C ode, whi h is relatively easy to edit to
write the likelihood fun tion for the model. This has been done in NegBinSNP. , whi h is
a C++ version of this model that an be ompiled to use with o tave using the

mko tfile

ommand. Note the impressive length of the expressions when the degree of the expansion is
4 or 5! This is an example of a model that would be di ult to formulate without the help of
a program like

MuPAD.

It is possible that there is onditional heterogeneity su h that the appropriate reshaping


should be more lo al.

This an be a omodated by allowing the

parameters to depend

upon the onditioning variables, for example using polynomials.


Gallant and Ny hka,

E onometri a,

1987 prove that this sort of density an approximate

a wide variety of densities arbitrarily well as the degree of the polynomial in reases with the
sample size. This approa h is not without its drawba ks: the sample obje tive fun tion an
have an

extremely

large number of lo al maxima that an lead to numeri di ulties.

If

someone ould gure out how to do in a way su h that the sample obje tive fun tion was ni e
and smooth, they would probably get the paper published in a good journal. Any ideas?
Here's a plot of true and the limiting SNP approximations (with the order of the polynomial
xed) to four dierent ount data densities, whi h variously exhibit over and underdispersion,

290

CHAPTER 18.

NONPARAMETRIC INFERENCE

Figure 18.5: Negative binomial raw moments

as well as ex ess zeros. The baseline model is a negative binomial density.

Case 1

Case 2

.5
.4

.1

.3
.2

.05

.1
0

Case 3

10

15

20

.25

.2

.2

.15

.15

Case 4

10

15

20

25

.1

.1
.05
.05
1

2.5

7.5

10

12.5

15

18.5.

291

EXAMPLES

Figure 18.6: Kernel tted OBDV usage versus AGE


Kernel fit, OBDV visits versus AGE

3.29

3.285

3.28

3.275

3.27

3.265

3.26

3.255

20

25

30

35

40

45

50

55

60

65

Age

18.5 Examples
18.5.1 MEPS health are usage data
We'll use the MEPS OBDV data to illustrate kernel regression and semi-nonparametri maximum likelihood.

Kernel regression estimation


Let's try a kernel regression t for the OBDV data. The program OBDVkernel.m loads the
MEPS OBDV data, s ans over a range of window widths and al ulates leave-one-out CV
s ores, and plots the tted OBDV usage versus AGE, using the best window width. The plot
is in Figure 18.6. Note that usage in reases with age, just as we've seen with the parametri
models. On e ould use bootstrapping to generate a onden e interval to the t.

Seminonparametri ML estimation and the MEPS data


Now let's estimate a seminonparametri density for the OBDV data. We'll reshape a negative
binomial density, as dis ussed above. The program EstimateNBSNP.m loads the MEPS OBDV
data and estimates the model, using a NB-I baseline density and a 2nd order polynomial
expansion. The output is:

OBDV

292

CHAPTER 18.

NONPARAMETRIC INFERENCE

======================================================
BFGSMIN final results
Used numeri gradient
-----------------------------------------------------STRONG CONVERGENCE
Fun tion onv 1 Param onv 1 Gradient onv 1
-----------------------------------------------------Obje tive fun tion value 2.17061
Stepsize 0.0065
24 iterations
-----------------------------------------------------param
gradient hange
1.3826 0.0000 -0.0000
0.2317 -0.0000
0.0000
0.1839 0.0000
0.0000
0.2214 0.0000 -0.0000
0.1898 0.0000 -0.0000
0.0722 0.0000 -0.0000
-0.0002 0.0000 -0.0000
1.7853 -0.0000 -0.0000
-0.4358 0.0000 -0.0000
0.1129 0.0000
0.0000
******************************************************
NegBin SNP model, MEPS full data set
MLE Estimation Results
BFGS onvergen e: Normal onvergen e
Average Log-L: -2.170614
Observations: 4564
onstant
pub. ins.
priv. ins.
sex
age
edu
in
gam1
gam2
lnalpha

estimate
-0.147
0.695
0.409
0.443
0.016
0.025
-0.000
1.785
-0.436
0.113

Information Criteria
CAIC : 19907.6244

st. err
0.126
0.050
0.046
0.034
0.001
0.006
0.000
0.141
0.029
0.027
Avg. CAIC:

t-stat
-1.173
13.936
8.833
13.148
11.880
3.903
-0.011
12.629
-14.786
4.166
4.3619

p-value
0.241
0.000
0.000
0.000
0.000
0.000
0.991
0.000
0.000
0.000

18.5.

293

EXAMPLES

Figure 18.7: Dollar-Euro

Figure 18.8: Dollar-Yen

BIC : 19897.6244
Avg. BIC:
4.3597
AIC : 19833.3649
Avg. AIC:
4.3456
******************************************************

Note that the CAIC and BIC are lower for this model than for the models presented in
Table 16.3. This model ts well, still being parsimonious. You an play around trying other
use measures, using a NP-II baseline density, and using other orders of expansions. Density
fun tions formed in this way may have

MANY

before a epting the results of a asual run.

lo al maxima, so you need to be areful

To guard against having onverged to a lo al

maximum, one an try using multiple starting values, or one ould try simulated annealing
as an optimization method.

If you un omment the relevant lines in the program, you an

use SA to do the minimization. This will take a

lot

of time, ompared to the default BFGS

minimization. The hapter on parallel omputations might be interesting to read before trying
this.

18.5.2 Finan ial data and volatility


The data set rates ontains the growth rate (100log dieren e) of the daily spot $/euro and

$/yen ex hange rates at New York, noon, from January 04, 1999 to February 12, 2008. There

are 2291 observations. See the README le for details. Figures 18.5.2 and 18.5.2 show the
data and their histograms.

294

CHAPTER 18.

NONPARAMETRIC INFERENCE

Figure 18.9: Kernel regression tted onditional se ond moments, Yen/Dollar and Euro/Dollar

(a) Yen/Dollar

at the enter of the histograms, the bars extend above the normal density that best ts
the data, and the tails are fatter than those of the best t normal density. This feature
of the data is known as

(b) Euro/Dollar

leptokurtosis.

in the series plots, we an see that the varian e of the growth rates is not onstant
over time. Volatility lusters are apparent, alternating between periods of stability and
periods of more wild swings. This is known as

onditional heteros edasti ity.

ARCH and

GARCH well-known models that are often applied to this sort of data.

Many stru tural e onomi models often annot generate data that exhibits onditional
heteros edasti ity without dire tly assuming sho ks that are onditionally heteros edasti . It would be ni e to have an e onomi explanation for how onditional heteros edasti ity, leptokurtosis, and other (leverage, et .) features of nan ial data result from the
behavior of e onomi agents, rather than from a bla k box that provides sho ks.

The O tave s ript kernelt.m performs kernel regression to t


the plots in Figure 18.9.

2
2 ),
yt2
E(yt2 |yt1,

and generates

From the point of view of learning the pra ti al aspe ts of kernel regression, note how
the data is ompa tied in the example s ript.
In the Figure, note how urrent volatility depends on lags of the squared return rate it is high when both of the lags are high, but drops o qui kly when either of the lags is
low.

The fa t that the plots are not at suggests that this onditional moment ontain information about the pro ess that generates the data. Perhaps attempting to mat h this
moment might be a means of estimating the parameters of the dgp. We'll ome ba k to
this later.

18.6.

295

EXERCISES

18.6 Exer ises


1. In O tave, type  edit

kernel_example.

(a) Look this s ript over, and des ribe in words what it does.
(b) Run the s ript and interpret the output.
( ) Experiment with dierent bandwidths, and omment on the ee ts of hoosing
small and large values.
2. In O tave, type  help

kernel_regression.

(a) How an a kernel t be done without supplying a bandwidth?


(b) How is the bandwidth hosen if a value is not provided?
( ) What is the default kernel used?

3. Using the O tave s ript OBDVkernel.m as a model, plot kernel regression ts for OBDV
visits as a fun tion of in ome and edu ation.

296

CHAPTER 18.

NONPARAMETRIC INFERENCE

Chapter 19
Simulation-based estimation
Readings:

Gourieroux and Monfort (1996)

University Press).

Simulation-Based E onometri Methods

There are many arti les.

(Oxford

Some of the seminal papers are Gallant and

Tau hen (1996), Whi h Moments to Mat h?, ECONOMETRIC THEORY, Vol.

12, 1996,

J. Apl. E onometri s; Pakes and Pollard (1989) E onometri a ; M Fadden (1989) E onometri a.

pages 657-681; Gourieroux, Monfort and Renault (1993), Indire t Inferen e,

19.1 Motivation
Simulation methods are of interest when the DGP is fully hara terized by a parameter ve tor,
so that simulated data an be generated, but the likelihood fun tion and moments of the
observable varables are not al ulable, so that MLE or GMM estimation is not possible.
Many moderately omplex models result in intra tible likelihoods or moments, as we will
see. Simulation-based estimation methods open up the possibility to estimate truly omplex
1

models. The desirability introdu ing a great deal of omplexity may be an issue , but it least
it be omes a possibility.

19.1.1 Example: Multinomial and/or dynami dis rete response models


Let

yi

be a latent random ve tor of dimension

m.

Suppose that

yi = Xi + i
where

Xi

is

m K.

Hen eforth drop the

Suppose that

i N (0, )
i

(19.1)

subs ript when it is not needed for larity.

is not observed. Rather, we observe a many-to-one mapping

y = (y )
1

Remember that a model is an abstra tion from reality, and abstra tion helps us to isolate the important

features of a phenomenon.

297

298

CHAPTER 19.

This mapping is su h that ea h element of

SIMULATION-BASED ESTIMATION

is either zero or one (in some ases only

one element will be one).

Dene

Ai = A(yi ) = {y |yi = (y )}
Suppose random sampling of

(yi , Xi ).

In this ase the elements of

dent of one another (and learly are not if


of

is not diagonal).

yi may not be indepen-

However,

yi

is independent

yj , i 6= j.

Let

= ( , (vec ) )

be the ve tor of parameters of the model. The ontribution of

th observation to the likelihood fun tion is


the i

pi () =

Ai

n(yi Xi , )dyi

where

M/2

n(, ) = (2)

||

is the multivariate normal density of an


likelihood fun tion is

1/2

1
exp
2

-dimensional random ve tor.

The log-

1X
ln pi ()
ln L() =
n
i=1

and the MLE

solves

the s ore equations

n
n

1 X D pi ()
1X
gi () =
0.

n
n
pi ()
i=1

The problem is that evaluation of

Li ()

and its derivative w.r.t.

by standard methods

of numeri integration su h as quadrature is omputationally infeasible when


dimension of

i=1

The mapping

y)

is higher than 3 or 4 (as long as there are no restri tions on

(y )

dierent hoi es of

(the

).

has not been made spe i so far. This setup is quite general: for

(y )

it nests the ase of dynami binary dis rete hoi e models as

well as the ase of multinomial dis rete hoi e (the hoi e of one out of a nite set of
alternatives).

Multinomial dis rete hoi e is illustrated by a (very simple) job sear h model. We
have ross se tional data on individuals' mat hing to a set of

available (one of whi h is unemployment). The utility of alternative

jobs that are

is

uj = Xj + j
Utilities of jobs, sta ked in the ve tor

ui

are not observed. Rather, we observe the

19.1.

299

MOTIVATION

ve tor formed of elements

yj = 1 [uj > uk , k m, k 6= j]
Only one of these elements is dierent than zero.

Dynami dis rete hoi e is illustrated by repeated hoi es over time between two
alternatives. Let alternative

have utility

ujt = Wjt jt ,
j

{1, 2}

t {1, 2, ..., m}
Then

y = u2 u1
= (W2 W1 ) + 2 1
X +
Now the mapping is (element-by-element)

y = 1 [y > 0] ,
that is

yit = 1

if individual

hooses the se ond alternative in period

t,

zero other-

wise.

19.1.2 Example: Marginalization of latent variables


E onomi data often presents substantial heterogeneity that may be di ult to model.

possibility is to introdu e latent random variables. This an ause the problem that there may
be no known losed form for the distribution of observable variables after marginalizing out
the unobservable latent variables. For example, ount data (that takes values
often modeled using the Poisson distribution

Pr(y = i) =

exp()i
i!

The mean and varian e of the Poisson distribution are both equal to

E(y) = V (y) = .
Often, one parameterizes the onditional mean as

i = exp(Xi ).

0, 1, 2, 3, ...)

is

300

CHAPTER 19.

SIMULATION-BASED ESTIMATION

This ensures that the mean is positive (as it must be). Estimation by ML is straightforward.
Often, ount data exhibits overdispersion whi h simply means that

V (y) > E(y).


If this is the ase, a solution is to use the negative binomial distribution rather than the
Poisson. An alternative is to introdu e a latent variable that ree ts heterogeneity into the
spe i ation:

i = exp(Xi + i )
where

has some spe ied density with support

parameters). Let

d(i )

i .

be the density of

Pr(y = yi ) =

(this density may depend on additional

In some ases, the marginal density of

exp [ exp(Xi + i )] [exp(Xi + i )]yi


d(i )
yi !

will have a losed-form solution (one an derive the negative binomial distribution in the way if

has an exponential distribution - see equation 16.1), but often this will not be possible.
ase, simulation is a means of al ulating

Pr(y = i),

In this

whi h is then used to do ML estimation.

This would be an example of the Simulated Maximum Likelihood (SML) estimation.

In this ase, sin e there is only one latent variable, quadrature is probably a better
hoi e. However, a more exible model with heterogeneity would allow all parameters
(not just the onstant) to vary. For example

Pr(y = yi ) =
entails a
when

exp [ exp(Xi i )] [exp(Xi i )]yi


d(i )
yi !

K = dim i -dimensional

integral, whi h will not be evaluable by quadrature

gets large.

19.1.3 Estimation of models spe ied in terms of sto hasti dierential


equations
It is often onvenient to formulate models in terms of ontinuous time using dierential equations.

A realisti model should a ount for exogenous sho ks to the system, whi h an be

done by assuming a random omponent. This leads to a model that is expressed as a system
of sto hasti dierential equations. Consider the pro ess

dyt = g(, yt )dt + h(, yt )dWt


whi h is assumed to be stationary.
su h that

{Wt }

W (T ) =

is a standard Brownian motion (Weiner pro ess),

dWt N (0, T )

Brownian motion is a ontinuous-time sto hasti pro ess su h that

19.2.

301

SIMULATED MAXIMUM LIKELIHOOD (SML)

W (0) = 0
[W (s) W (t)] N (0, s t)
[W (s) W (t)]

and

[W (j) W (k)]

are independent for

s > t > j > k.

That is, non-

overlapping segments are independent.

One an think of Brownian motion the a umulation of independent normally distributed


sho ks with innitesimal varian e.

The fun tion

h(, yt )

g(, yt )

is the deterministi part.

determines the varian e of the sho ks.

To estimate a model of this sort, we typi ally have data that are assumed to be observations
of

yt

in dis rete points

y1 , y2 , ...yT .

That is, though

yt

is a ontinuous pro ess it is observed in

dis rete time.


To perform inferen e on

dire t ML or GMM estimation is not usually feasible, be ause

one annot, in general, dedu e the transition density

f (yt |yt1 , ).

This density is ne essary

to evaluate the likelihood fun tion or to evaluate moment onditions (whi h are based upon
expe tations with respe t to this density).

A typi al solution is to dis retize the model, by whi h we mean to nd a dis rete time
approximation to the model. The dis retized version of the model is

yt yt1 = g(, yt1 ) + h(, yt1 )t


t N (0, 1)
The dis retization indu es a new parameter,

(that is, the

whi h denes the best

approximation of the dis retization to the a tual (unknown) dis rete time version of the
model is not equal to

whi h is the true parameter value). This is an approximation,

and as su h ML estimation of

(whi h is a tually quasi-maximum likelihood, QML)

based upon this equation is in general biased and in onsistent for the original parameter,

Nevertheless, the approximation shouldn't be too bad, whi h will be useful, as we will

see.

The important point about these three examples is that omputational di ulties prevent dire t appli ation of ML, GMM, et . Nevertheless the model is fully spe ied in
probabilisti terms up to a parameter ve tor. This means that the model is simulable,
onditional on the parameter ve tor.

19.2 Simulated maximum likelihood (SML)


For simpli ity, onsider ross-se tional data. An ML estimator solves

1X
M L = arg max sn () =
ln p(yt |Xt , )
n t=1

302

CHAPTER 19.

SIMULATION-BASED ESTIMATION

p(yt |Xt , ) is the density fun tion of the tth observation. When p(yt |Xt , ) does not have
M L is an infeasible estimator. However, it may be possible to dene a
known losed form,

where
a

random fun tion su h that

E f (, yt , Xt , ) = p(yt |Xt , )
where the density of

is known. If this is the ase, the simulator

p (yt , Xt , ) =

H
1 X
f (ts , yt , Xt , )
H
s=1

is unbiased for

p(yt |Xt , ).

The SML simply substitutes


tion, that is

p (yt , Xt , ) in pla e of p(yt |Xt , ) in the log-likelihood fun n

1X
SM L = arg max sn () =
ln p (yt , Xt , )
n
i=1

19.2.1 Example: multinomial probit


Re all that the utility of alternative

is

uj = Xj + j
and the ve tor

is formed of elements

yj = 1 [uj > uk , k m, k 6= j]
The problem is that

Pr(yj = 1|)

an't be al ulated when

is larger than 4 or 5. However,

it is easy to simulate this probability.

Draw

Cal ulate

Dene

Repeat this

from the distribution

u
i = Xi + i

N (0, )

(where

Xi

is the matrix formed by sta king the

Xij )

yij = 1 [uij > uik , k m, k 6= j]

ei

times and dene

m-ve tor

eij =

PH

ijh
h=1 y

eij .

Dene

Now

The SML multinomial probit log-likelihood fun tion is

as the

formed of the

the elements sum to one.

p (yi , Xi , ) = yi
ei

Ea h element of

ln L(, ) =

1X
yi ln p (yi , Xi , )
n
i=1

ei

is between 0 and 1, and

19.2.

303

SIMULATED MAXIMUM LIKELIHOOD (SML)

This is to be maximized w.r.t.

and

Notes:

The

draws of

used to nd

and

are draw

only on e

and are used repeatedly during the iterations

The draws are dierent for ea h

i.

If the

are re-drawn at every

iteration the estimator will not onverge.

The log-likelihood fun tion with this simulator is a dis ontinuous fun tion of

and

This does not ause problems from a theoreti al point of view sin e it an be shown
that

ln L(, )

is sto hasti ally equi ontinuous. However, it does ause problems if one

attempts to use a gradient-based optimization method su h as Newton-Raphson.

It may be the ase, parti ularly if few simulations,

Solutions to dis ontinuity:

are zero. If the orresponding element of

yi

H , are used, that some elements

is equal to 1, there will be a

log(0)

of

ei

problem.

1) use an estimation method that doesn't require a ontinuous and dierentiable obje tive fun tion, for example, simulated annealing. This is omputationally ostly.

2) Smooth the simulated probabilities so that they are ontinuous fun tions of the
parameters. For example, apply a kernel transformation su h as

yij = A uij max uik


where

yij

k=1

uij

is the maximum. This makes

+ .5 1 uij = max uik


k=1

is a large positive number. This approximates a step fun tion su h that

is very lose to zero if

therefore



is not the maximum, and

yij

a ontinuous fun tion of

ln L(, ) will be ontinuous and dierentiable.

A(n) ,

yij

is very lose to 1 if
and

so that

pij

uij

and

Consisten y requires that

so that the approximation to a step fun tion be omes arbitrarily lose

as the sample size in reases. There are alternative methods (e.g., Gibbs sampling)
that may work better, but this is too te hni al to dis uss here.

To solve to log(0) problem, one possibility is to sear h the web for the slog fun tion.
Also, in rease

if this is a serious problem.

19.2.2 Properties
The properties of the SML estimator depend on how

is set. The following is taken from

Lee (1995) Asymptoti Bias in Simulated Maximum Likelihood Estimation of Dis rete Choi e
Models,

E onometri Theory, 11, pp.

437-83.

Theorem 32 [Lee 1) if limn n1/2 /H = 0, then




d
n SM L 0 N (0, I 1 ( 0 ))

304

CHAPTER 19.

SIMULATION-BASED ESTIMATION

2) if limn n1/2 /H = , a nite onstant, then




d
n SM L 0 N (B, I 1 ( 0 ))

where B is a nite ve tor of onstants.

This means that the SML estimator is asymptoti ally biased if

doesn't grow faster

1/2 .
than n

The var ov is the typi al inverse of the information matrix, so that as long as

grows

fast enough the estimator is onsistent and fully asymptoti ally e ient.

19.3 Method of simulated moments (MSM)


Suppose we have a DGP(y|x, ) whi h is simulable given

but is su h that the density of

is not al ulable.
On e ould, in prin iple, base a GMM estimator upon the moment onditions

mt () = [K(yt , xt ) k(xt , )] zt
where

k(xt , ) =
zt
on

K(yt , xt )p(y|xt , )dy,

is a ve tor of instruments in the information set and

xt .

p(y|xt , ) is the density of y

onditional

The problem is that this density is not available.

However

k(xt , )

is readily simulated using

H
1 X
e
k (xt , ) =
K(e
yth , xt )
H
h=1

By the law of large numbers,

a.s.
e
k (xt , ) k (xt , ) ,

as

H ,

whi h provides a lear

intuitive basis for the estimator, though in fa t we obtain onsisten y even for
sin e a law of large numbers is also operating a ross the

nite,

observations of real data, so

errors introdu ed by simulation an el themselves out.

This allows us to form the moment onditions

h
i
m
ft () = K(yt , xt ) e
k (xt , ) zt

(19.2)

19.3.

305

METHOD OF SIMULATED MOMENTS (MSM)

where

zt

is drawn from the information set. As before, form

m()
e
=
=

1X
m
ft ()
n
i=1
"
#
n
H
1X
1 X
K(yt , xt )
k(e
yth , xt ) zt
n
H
i=1

(19.3)

h=1

with whi h we form the GMM riterion and estimate as usual. Note that the unbiased
simulator

k(e
yth , xt )

appears linearly within the sums.

19.3.1 Properties
Suppose that the optimal weighting matrix is used. M Fadden (ref. above) and Pakes and
Pollard (refs.

above) show that the asymptoti distribution of the MSM estimator is very

similar to that of the infeasible GMM estimator.


weighting matrix is used, and for

In parti ular, assuming that the optimal

nite,

 





1
d
1
n M SM 0 N 0, 1 +
D 1 D
H
where

D 1 D

1

(19.4)

is the asymptoti varian e of the infeasible GMM estimator.

That is, the asymptoti varian e is inated by a fa tor

1+1/H. For this reason the MSM

estimator is not fully asymptoti ally e ient relative to the infeasible GMM estimator,
for

nite, but the e ien y loss is small and ontrollable, by setting

reasonably

large.

The estimator is asymptoti ally unbiased even for

If one doesn't use the optimal weighting matrix, the asymptoti var ov is just the ordi-

H = 1.

This is an advantage relative

to SML.

nary GMM var ov, inated by

1 + 1/H.

The above presentation is in terms of a spe i moment ondition based upon the onditional mean. Simulated GMM an be applied to moment onditions of any form.

19.3.2 Comments
Why is SML in onsistent if
an average of

is nite, while MSM is? The reason is that SML is based upon

logarithms of an unbiased simulator (the densities of the observations).

the multinomial probit model as an example, the log-likelihood fun tion is

ln L(, ) =

1X
yi ln pi (, )
n
i=1

To use

306

CHAPTER 19.

The SML version is

SIMULATION-BASED ESTIMATION

ln L(, ) =

1X
yi ln pi (, )
n
i=1

The problem is that

E ln(
pi (, )) 6= ln(E pi (, ))
in spite of the fa t that

E pi (, ) = pi (, )
due to the fa t that
(in the limit) is if

ln()

is a nonlinear transformation. The only way for the two to be equal

tends to innite so that

p ()

tends to

p ().

The reason that MSM does not suer from this problem is that in this ase the unbiased
simulator appears

linearly

within every sum of terms, and it appears within a sum over

(see

equation [19.3). Therefore the SLLN applies to an el out simulation errors, from whi h we
get onsisten y.

That is, using simple notation for the random sampling ase, the moment

onditions

m()

"
#
n
H
1X
1 X
h
K(yt , xt )
k(e
yt , xt ) zt
n
H
i=1
h=1
"
#
n
H
X
1
1X
k(xt , 0 ) + t
[k(xt , ) + ht ] zt
n
H
i=1

(19.5)

(19.6)

h=1

onverge almost surely to

m
() =
(note:

zt


k(x, 0 ) k(x, ) z(x)d(x).

is assume to be made up of fun tions of

xt ).

The obje tive fun tion onverges to

s () = m
() 1
()
m
whi h obviously has a minimum at

0,

hen eforth onsisten y.

If you look at equation 19.6 a bit, you will see why the varian e ination fa tor is

(1+ H1 ).

19.4 E ient method of moments (EMM)


The hoi e of whi h moments upon whi h to base a GMM estimator an have very pronoun ed
ee ts upon the e ien y of the estimator.

A poor hoi e of moment onditions may lead to very ine ient estimators, and an

The drawba k of the above approa h MSM is that the moment onditions used in es-

even ause identi ation problems (as we've seen with the GMM problem set).

timation are sele ted arbitrarily.


low.

The asymptoti e ien y of the estimator may be

19.4.

307

EFFICIENT METHOD OF MOMENTS (EMM)

The asymptoti ally optimal hoi e of moments would be the s ore ve tor of the likelihood
fun tion,

mt () = D ln pt ( | It )
As before, this hoi e is unavailable.
The e ient method of moments (EMM) (see Gallant and Tau hen (1996), Whi h Moments
to Mat h?, ECONOMETRIC THEORY, Vol.

12, 1996, pages 657-681) seeks to provide

moment onditions that losely mimi the s ore ve tor. If the approximation is very good, the
resulting estimator will be very nearly fully e ient.
The DGP is hara terized by random sampling from the density

p(yt |xt , 0 ) pt ( 0 )
We an dene an auxiliary model, alled the s ore generator, whi h simply provides a
(misspe ied) parametri density

f (y|xt , ) ft ()

This density is known up to a parameter

is

We assume that this density fun tion

al ulable. Therefore quasi-ML estimation is possible. Spe i ally,

X
= arg max sn () = 1
ln ft ().

n
t=1

.
D ln f (yt |xt , )

After determining

The important point is that even if the density is misspe ied, there is a pseudo-true

and then marginalized over

: EX EY |X


D ln f (y|x, ) =
0

X
= 1
mn (, )
n

D ln f (y|x, 0 )p(y|x, 0 )dyd(x) = 0

Y |X

We have seen in the se tion on QML that


onditions

is zero:

Z Z

t=1

for whi h the true expe tation, taken with respe t to the true but unknown density of

y, p(y|xt , 0 ),

we an al ulate the s ore fun tions

0 ;

this suggests using the moment

t ()dy
D ln ft ()p

These moment onditions are not al ulable, sin e

pt ()

(19.7)

is not available, but they are

simulable using

n
H
1X 1 X

D ln f (e
yth |xt , )
m
fn (, ) =
n t=1 H
h=1

where

yth

is a draw from

onverges to

DGP (),

holding

xt

xed.

0 ,
m
e ( 0 , 0 ) = 0.

By the LLN and the fa t that

308

CHAPTER 19.

This is not the ase for other values of

assuming that

is identied.

f (yt |xt , ) losely approximates p(y|xt , ), then

The advantage of this pro edure is that if

m
e n (, )

SIMULATION-BASED ESTIMATION

will losely approximate the optimal moment onditions whi h hara terize

maximum likelihood estimation, whi h is fully e ient.

If one has prior information that a ertain density approximates the data well, it would
be a good hoi e for

f ().

If one has no density in mind, there exist good ways of approximating unknown distri-

E onometri a, 1983) and Gallant and Ny hka's

butions parametri ally: Philips' ERA's (

E onometri a, 1987)

SNP density estimator whi h we saw before. Sin e the SNP den-

sity is onsistent, the e ien y of the indire t estimator is the same as the infeasible ML
estimator.

19.4.1 Optimal weighting matrix


I will present the theory for
impra ti al to estimate with

nite, and possibly small. This is done be ause it is sometimes

very large. Gallant and Tau hen give the theory for the ase of

so large that it may be treated as innite (the dieren e being irrelevant given the numeri al

pre ision of a omputer). The theory for the ase of

innite follows dire tly from the results

presented here.
The moment ondition

m(,
e )

depends on the pseudo-ML estimate

We an apply

Theorem 22 to on lude that





d
0
n
N 0, J (0 )1 I(0 )J (0 )1

If the density

were in fa t the true density p(y|xt , ), then


would be the maximum
f (yt|xt , )

likelihood estimator, and

J (0 )1 I(0 )

would be an identity matrix, due to the information

matrix equality. However, in the present ase we assume that


mation to

(19.8)

p(y|xt , ),

Re all that

so there is no an ellation.

J (0 ) p lim

2
0
sn ( )

f (yt |xt , )

is only an approxi-

Comparing the denition of

sn ()

with the

denition of the moment ondition in Equation 19.7, we see that

J (0 ) = D m( 0 , 0 ).
As in Theorem 22,




sn () sn ()
.
I( ) = lim E n
n
0 0
0

In this ase, this is simply the asymptoti varian e ovarian e matrix of the moment onditions,

Now take a rst order Taylor's series approximation to

=
nm
n ( 0 , )

nm
n ( 0 , 0 ) +

nmn ( 0 , )

about



0 + op (1)
0 , 0 )
nD m(

19.4.

309

EFFICIENT METHOD OF MOMENTS (EMM)

First onsider

nm
n ( 0 , 0 ).

It is straightforward but somewhat tedious to show that the

asymptoti varian e of this term is


Next onsider the se ond term

J (0 ),

1
0
H I ( ).

so we have



0 .
0 , 0 )
nD m(

Note that

a.s.

n ( 0 , 0 )
D m





0 = nJ (0 )
0 , a.s.
0 , 0 )
nD m(

But noting equation 19.8






a
0
nJ (0 )
N 0, I(0 )

Now, ombining the results for the rst and se ond terms,



 

1
0
0 a
I( )
nm
n ( , ) N 0, 1 +
H
Suppose that

\
0)
I(

is a onsistent estimator of the asymptoti varian e- ovarian e matrix of

the moment onditions. This may be ompli ated if the s ore generator is a poor approximator,
sin e the individual s ore ontributions may not have mean zero in this ase (see the se tion
on QML) . Even if this is the ase, the individuals means an be al ulated by simulation,
so it is always possible to onsistently estimate

I(0 )

when the model is simulable. On the

other hand, if the s ore generator is taken to be orre tly spe ied, the ordinary estimator of
the information matrix is onsistent.

Combining this with the result on the e ient GMM

weighting matrix in Theorem 25, we see that dening


= arg min mn (, )

as


1

1 \
0

mn (, )
1+
I( )
H

is the GMM estimator with the e ient hoi e of weighting matrix.

If one has used the Gallant-Ny hka ML estimator as the auxiliary model, the appropriate
weighting matrix is simply the information matrix of the auxiliary model, sin e the
s ores are un orrelated. (e.g., it really is ML estimation asymptoti ally, sin e the s ore
generator an approximate the unknown density arbitrarily well).

19.4.2 Asymptoti distribution


Sin e we use the optimal weighting matrix, the asymptoti distribution is as in Equation 15.3,
so we have (using the result in Equation 19.8):

!1


1


1
d

,
n 0 N 0, D 1 +
D
I(0 )
H
where



D = lim E D mn ( 0 , 0 ) .
n

310

CHAPTER 19.

SIMULATION-BASED ESTIMATION

This an be onsistently estimated using

= D mn (,
D

19.4.3 Diagnoti testing


 



1
0 a
0
nmn ( , ) N 0, 1 +
I( )
H

The fa t that

implies that

)

nmn (,
where



1
1
a 2
)

1+
mn (,
(q)
I()
H

q is dim()dim(), sin e without dim() moment onditions the model is not identied,

so testing is impossible. One test of the model is simply based on this statisti : if it ex eeds
the

2 (q)

riti al point, something may be wrong (the small sample performan e of this sort

of test would be a topi worth investigating).

Information about what is wrong an be gotten from the pseudo-t-statisti s:

diag



1
1+
H

1/2 !1

nmn (,
I()

an be used to test whi h moments are not well modeled.

Sin e these moments are

related to parameters of the s ore generator, whi h are usually related to ertain features
of the model, this information an be used to revise the model. These aren't a tually

and nmn (,
)
have dierent distributions
N (0, 1), sin e nmn ( 0 , )

)
is somewhat more ompli ated). It an be shown that the pseudo-t
nmn (,
of

distributed as
(that

statisti s are biased toward nonreje tion. See Gourieroux

et. al.

or Gallant and Long,

1995, for more details.

19.5 Examples
19.5.1 SML of a Poisson model with latent heterogeneity
We have seen (see equation 16.1) that a Poisson model with latent heterogeneity that follows
an exponential distribution leads to the negative binomial model. To illustrate SML, we an
integrate out the latent heterogeneity using Monte Carlo, rather than the analyti al approa h
whi h leads to the negative binomial model. In a tual pra ti e, one would not want to use
SML in this ase, but it is a ni e example sin e it allows us to ompare SML to the a tual
ML estimator. The O tave fun tion dened by PoissonLatentHet.m al ulates the simulated
log likelihood for a Poisson model where

= exp xt + ),

where

N (0, 1).

This model is

similar to the negative binomial model, ex ept that the latent variable is normally distributed
rather than gamma distributed.

The O tave s ript EstimatePoissonLatentHet.m estimates

this model using the MEPS OBDV data that has already been dis ussed. Note that simulated

19.5.

311

EXAMPLES

annealing is used to maximize the log likelihood fun tion. Attempting to use BFGS leads to
trouble. I suspe t that the log likelihood is approximately non-dierentiable in pla es, around
whi h it is very at, though I have not he ked if this is true. If you run this s ript, you will
see that it takes a long time to get the estimation results, whi h are:

******************************************************
Poisson Latent Heterogeneity model, SML estimation, MEPS 1996 full data set
MLE Estimation Results
BFGS onvergen e: Max. iters. ex eeded
Average Log-L: -2.171826
Observations: 4564

onstant
pub. ins.
priv. ins.
sex
age
edu
in
lnalpha

estimate
-1.592
1.189
0.655
0.615
0.018
0.024
-0.000
0.203

st. err
0.146
0.068
0.065
0.044
0.002
0.010
0.000
0.014

t-stat
-10.892
17.425
10.124
13.888
10.865
2.523
-0.531
14.036

p-value
0.000
0.000
0.000
0.000
0.000
0.012
0.596
0.000

Information Criteria
CAIC : 19899.8396
Avg. CAIC: 4.3602
BIC : 19891.8396
Avg. BIC:
4.3584
AIC : 19840.4320
Avg. AIC:
4.3472
******************************************************
o tave:3>

If you ompare these results to the results for the negative binomial model, given in subse tion
(16.2.1), you an see that the present model ts better a ording to the CAIC riterion. The
present model is onsiderably less onvenient to work with, however, due to the omputational
requirements. The hapter on parallel omputing is relevant if you wish to use models of this
sort.

19.5.2 SMM
To be added in future:

do SMM using un onditional moments for SV model ( ompare to

Andersen et al and others)

312

CHAPTER 19.

SIMULATION-BASED ESTIMATION

19.5.3 SNM
To be added.

19.5.4 EMM estimation of a dis rete hoi e model


In this se tion onsider EMM estimation.

There is a sophisti ated pa kage by Gallant and

Tau hen for this, but here we'll look at some simple, but hopefully dida ti ode.

The le

probitdgp.m generates data that follows the probit model. The le emm_moments.m denes
EMM moment onditions, where the DGP and s ore generator an be passed as arguments.
Thus, it is a general purpose moment ondition for EMM estimation. This le is interesting
enough to warrant some dis ussion. A listing appears in Listing 19.1. Line 3 denes the DGP,
and the arguments needed to evaluate it are dened in line 4. The s ore generator is dened
in line 5, and its arguments are dened in line 6. The QML estimate of the parameter of the
s ore generator is read in line 7. Note in line 10 how the random draws needed to simulate
data are passed with the data, and are thus xed during estimation, to avoid  hattering.
The simulated data is generated in line 16, and the derivative of the s ore generator using the
simulated data is al ulated in line 18. In line 20 we average the s ores of the s ore generator,
whi h are the moment onditions that the fun tion returns.
1
2
3
4
5
6
7
8
9
10

11
12
13
14
15
16
17
18
19
20
21

fun tion s ores = emm_moments(theta, data, momentargs)


k = momentargs{1};
dgp = momentargs{2}; # the data generating pro ess (DGP)
dgpargs = momentargs{3}; # its arguments ( ell array)
sg = momentargs{4}; # the s ore generator (SG)
sgargs = momentargs{5}; # SG arguments ( ell array)
phi = momentargs{6}; # QML estimate of SG parameter
y = data(:,1);
x = data(:,2:k+1);
rand_draws = data(:,k+2: olumns(data)); # passed with data to ensure fixed a ross
iterations
n = rows(y);
s ores = zeros(n,rows(phi)); # ontainer for moment ontributions
reps = olumns(rand_draws); # how many simulations?
for i = 1:reps
e = rand_draws(:,i);
y = feval(dgp, theta, x, e, dgpargs); # simulated data
sgdata = [y x; # simulated data for SG
s ores = s ores + numgradient(sg, {phi, sgdata, sgargs}); # gradient of SG
endfor
s ores = s ores / reps; # average over number of simulations
endfun tion
Listing 19.1: emm_moments.m
The le emm_example.m performs EMM estimation of the probit model, using a logit
model as the s ore generator. The results we obtain are

19.5.

313

EXAMPLES

S ore generator results:


=====================================================
BFGSMIN final results
Used analyti gradient
-----------------------------------------------------STRONG CONVERGENCE
Fun tion onv 1 Param onv 1 Gradient onv 1
-----------------------------------------------------Obje tive fun tion value 0.281571
Stepsize 0.0279
15 iterations
-----------------------------------------------------param
1.8979
1.6648
1.9125
1.8875
1.7433

gradient
0.0000
-0.0000
-0.0000
-0.0000
-0.0000

hange
0.0000
0.0000
0.0000
0.0000
0.0000

======================================================
Model results:
******************************************************
EMM example
GMM Estimation Results
BFGS onvergen e: Normal onvergen e
Obje tive fun tion value: 0.000000
Observations: 1000
Exa tly identified, no spe . test
estimate
st. err
t-stat
p-value
p1
1.069
0.022
47.618
0.000
p2
0.935
0.022
42.240
0.000
p3
1.085
0.022
49.630
0.000
p4
1.080
0.022
49.047
0.000
p5
0.978
0.023
41.643
0.000
******************************************************

It might be interesting to ompare the standard errors with those obtained from ML
estimation, to he k e ien y of the EMM estimator.
study.

One ould even do a Monte Carlo

314

CHAPTER 19.

SIMULATION-BASED ESTIMATION

19.6 Exer ises


1. (basi ) Examine the O tave s ript and fun tion dis ussed in subse tion 19.5.1 and des ribe what they do.
2. (basi ) Examine the O tave s ripts and fun tions dis ussed in subse tion 19.5.4 and
des ribe what they do.
3. (advan ed, but even if you don't do this you should be able to des ribe what needs to be
done) Write O tave ode to do SML estimation of the probit model. Do an estimation
using data generated by a probit model ( probitdgp.m might be helpful). Compare the
SML estimates to ML estimates.
4. (more advan ed) Do a little Monte Carlo study to ompare ML, SML and EMM estimation of the probit model. Investigate how the number of simulations ae t the two
simulation-based estimators.

Chapter 20
Parallel programming for e onometri s
The following borrows heavily from Creel (2005).
Parallel omputing an oer an important redu tion in the time to omplete omputations.
This is well-known, but it bears emphasis sin e it is the main reason that parallel omputing
may be attra tive to users. To illustrate, the Intel Pentium IV (Willamette) pro essor, running
at 1.5GHz, was introdu ed in November of 2000. The Pentium IV (Northwood-HT) pro essor,
running at 3.06GHz, was introdu ed in November of 2002. An approximate doubling of the
performan e of a ommodity CPU took pla e in two years.

Extrapolating this admittedly

rough snapshot of the evolution of the performan e of ommodity pro essors, one would need
to wait more than 6.6 years and then pur hase a new omputer to obtain a 10-fold improvement
in omputational performan e. The examples in this hapter show that a 10-fold improvement
in performan e an be a hieved immediately, using distributed parallel omputing on available
omputers.
Re ent (this is written in 2005) developments that may make parallel omputing attra tive
to a broader spe trum of resear hers who do omputations. The rst is the fa t that setting
up a luster of omputers for distributed parallel omputing is not di ult. If you are using
the ParallelKnoppix bootable CD that a ompanies these notes, you are less than 10 minutes
away from reating a luster, supposing you have a se ond omputer at hand and a rossover
ethernet able. See the ParallelKnoppix tutorial. A se ond development is the existen e of
extensions to some of the high-level matrix programming (HLMP) languages

that allow the

in orporation of parallelism into programs written in these languages. A third is the spread
of dual and quad- ore CPUs, so that an ordinary desktop or laptop omputer an be made
into a mini- luster. Those ores won't work together on a single problem unless they are told
how to.
Following are examples of parallel implementations of several mainstream problems in
e onometri s. A fo us of the examples is on the possibility of hiding parallelization from end
users of programs. If programs that run in parallel have an interfa e that is nearly identi al
to the interfa e of equivalent serial versions, end users will nd it easy to take advantage
of parallel omputing's performan e.
1

We ontinue to use O tave, taking advantage of the

By high-level matrix programming language I mean languages su h as MATLAB (TM the Mathworks,

In .), Ox (TM OxMetri s Te hnologies, Ltd.), and GNU O tave (www.o tave.org), for example.

315

316

CHAPTER 20.

PARALLEL PROGRAMMING FOR ECONOMETRICS

MPI Toolbox (MPITB) for O tave, by by Fernndez Baldomero

et al.

(2004).

There are

also parallel pa kages for Ox, R, and Python whi h may be of interest to e onometri ians,
but as of this writing, the following examples are the most a essible introdu tion to parallel
programming for e onometri ians.

20.1 Example problems


This se tion introdu es example problems from e onometri s, and shows how they an be
parallelized in a natural way.

20.1.1 Monte Carlo


A Monte Carlo study involves repeating a random experiment many times under identi al
onditions. Several authors have noted that Monte Carlo studies are obvious andidates for
parallelization (Doornik

et al.

2002; Bru he, 2003) sin e blo ks of repli ations an be done

independently on dierent omputers. To illustrate the parallelization of a Monte Carlo study,


we use same tra e test example as do Doornik,

et. al.

(2002). tra etest.m is a fun tion that

al ulates the tra e test statisti for the la k of ointegration of integrated time series. This
fun tion is illustrative of the format that we adopt for Monte Carlo simulation of a fun tion:
it re eives a single argument of ell type, and it returns a row ve tor that holds the results of
one random simulation. The single argument in this ase is a ell array that holds the length
of the series in its rst position, and the number of series in the se ond position. It generates
a random result though a pro ess that is internal to the fun tion, and it reports some output
in a row ve tor (in this ase the result is a s alar).
m _example1.m is an O tave s ript that exe utes a Monte Carlo study of the tra e test by
repeatedly evaluating the

tra etest.m

is that lines 7 and 10 all the fun tion


in line 7,

monte arlo.m

fun tion. The main thing to noti e about this s ript

monte arlo.m.

When alled with 3 arguments, as

exe utes serially on the omputer it is alled from. In line 10, there

is a fourth argument. When alled with four arguments, the last argument is the number of
slave hosts to use. We see that running the Monte Carlo study on one or more pro essors is
transparent to the user - he or she must only indi ate the number of slave omputers to be
used.

20.1.2 ML
For a sample

{(yt , xt )}n

of

n observations of a set of dependent

maximum likelihood estimator of the parameter

an be dened as

= arg max sn ()
where

sn () =

and explanatory variables, the

1X
ln f (yt |xt , )
n t=1

20.1.

EXAMPLE PROBLEMS

Here,

yt

317

may be a ve tor of random variables, and the model may be dynami sin e

ontain lags of

yt .

xt

may

As Swann (2002) points out, this an be broken into sums over blo ks of

observations, for example two blo ks:

1
sn () =
n

n1
X
t=1

Analogously, we an dene up to

ln f (yt |xt , )

n blo ks.

n
X

t=n1 +1

!)

ln f (yt |xt , )

Again following Swann, parallelization an be done

by al ulating ea h blo k on separate omputers.


mle_example1.m is an O tave s ript that al ulates the maximum likelihood estimator of
the parameter ve tor of a model that assumes that the dependent variable is distributed as a
Poisson random variable, onditional on some explanatory variables. In lines 1-3 the data is

model, and the initial value


mle_estimate performs ordinary serial

read, the name of the density fun tion is provided in the variable
of the parameter ve tor is set. In line 5, the fun tion

al ulation of the ML estimator, while in line 7 the same fun tion is alled with 6 arguments.
The fourth and fth arguments are empty pla eholders where options to

mle_estimate may be

set, while the sixth argument is the number of slave omputers to use for parallel exe ution,
1 in this ase.

A person who runs the program sees no parallel programming ode - the

parallelization is transparent to the end user, beyond having to sele t the number of slave

theta_p,

whi h

It is worth noting that a dierent likelihood fun tion may be used by making the

model

omputers. When exe uted, this s ript prints out the estimates

theta_s

and

are identi al.

variable point to a dierent fun tion.


fun tion that is not parallelized. The

The likelihood fun tion itself is an ordinary O tave

mle_estimate

fun tion is a generi fun tion that an

all any likelihood fun tion that has the appropriate input/output syntax for evaluation either
serially or in parallel.

Users need only learn how to write the likelihood fun tion using the

O tave language.

20.1.3 GMM
For a sample as above, the GMM estimator of the parameter

arg min sn ()

where

sn () = mn () Wn mn ()
and

mn () =

1X
mt (yt |xt , )
n t=1

an be dened as

318

CHAPTER 20.

Sin e

PARALLEL PROGRAMMING FOR ECONOMETRICS

mn () is an average, it an obviously be omputed blo kwise, using for example 2 blo ks:
1
mn () =
n

n1
X
t=1

Likewise, we may dene up to

mt (yt |xt , )

n
X

t=n1 +1

!)

mt (yt |xt , )

(20.1)

blo ks, ea h of whi h ould potentially be omputed on a

dierent ma hine.
gmm_example1.m is a s ript that illustrates how GMM estimation may be done serially
or in parallel. When this is run,

theta_s

onvergen e of the minimization routine.

and

theta_p

are identi al up to the toleran e for

The point to noti e here is that an end user an

perform the estimation in parallel in virtually the same way as it is done serially.

Again,

gmm_estimate,

used in lines 8 and 10, is a generi fun tion that will estimate any model

spe ied by the

moments

of the

moments

variable - a dierent model an be estimated by hanging the value

variable. The fun tion that

moments

points to is an ordinary O tave fun tion

that uses no parallel programming, so users an write their models using the simple and
intuitive HLMP syntax of O tave. Whether estimation is done in parallel or serially depends
only the seventh argument to

gmm_estimate

- when it is missing or zero, estimation is by

default done serially with one pro essor. When it is positive, it spe ies the number of slave
nodes to use.

20.1.4 Kernel regression


The Nadaraya-Watson kernel regression estimator of a fun tion

g(x)

at a point

is

Pn
yt K [(x xt ) /n ]
Pt=1
n
t=1 K [(x xt ) /n ]
n
X
wt yy

g(x) =

t=1

We see that the weight depends upon every data point in the sample.
at every point in a sample of size

n,

To al ulate the t

2
on the order of n k al ulations must be done, where

is the dimension of the ve tor of explanatory variables,

x.

Ra ine (2002) demonstrates that

MPI parallelization an be used to speed up al ulation of the kernel regression estimator


by al ulating the ts for portions of the sample on dierent omputers.

We follow this

implementation here. kernel_example1.m is a s ript for serial and parallel kernel regression.
Serial exe ution is obtained by setting the number of slaves equal to zero, in line 15. In line
17, a single slave is spe ied, so exe ution is in parallel on the master and slave nodes.
The example programs show that parallelization may be mostly hidden from end users.
Users an benet from parallelization without having to write or understand parallel ode.
The speedups one an obtain are highly dependent upon the spe i problem at hand, as well
as the size of the luster, the e ien y of the network,

et .

Some examples of speedups are

presented in Creel (2005). Figure 20.1 reprodu es speedups for some e onometri problems on
a luster of 12 desktop omputers. The speedup for

nodes is the time to nish the problem

20.1.

319

EXAMPLE PROBLEMS

Figure 20.1: Speedups from parallelization


11
10

MONTECARLO
BOOTSTRAP
MLE
GMM
KERNEL

9
8
7
6
5
4
3
2
1
2

10

12

nodes

on a single node divided by the time to nish the problem on

nodes. Note that you an get

10X speedups, as laimed in the introdu tion. It's pretty obvious that mu h greater speedups
ould be obtained using a larger luster, for the embarrassingly parallel problems.

320

CHAPTER 20.

PARALLEL PROGRAMMING FOR ECONOMETRICS

Bibliography
[1 Bru he, M. (2003) A note on embarassingly parallel omputation using OpenMosix and
Ox, working paper, Finan ial Markets Group, London S hool of E onomi s.
[2 Creel, M. (2005) User-friendly parallel omputations with e onometri examples,

putational E onomi s, V. 26, pp. 107-128.

Com-

[3 Doornik, J.A., D.F. Hendry and N. Shephard (2002) Computationally-intensive e onometri s using a distributed matrix-programming language,

the Royal So iety of London, Series A, 360, 1245-1266.

Philosophi al Transa tions of

[4 Fernndez Baldomero, J. (2004) LAM/MPI parallel omputing under GNU O tave,

at .ugr.es/javier-bin/mpitb.
[5 Ra ine, Je (2002) Parallel distributed kernel estimation,

Data Analysis, 40, 293-302.

Computational Statisti s &

[6 Swann, C.A. (2002) Maximum likelihood estimation using parallel omputing: an introdu tion to MPI,

Computational E onomi s, 19, 145-178.

321

322

BIBLIOGRAPHY

Chapter 21
Final proje t: e onometri estimation
of a RBC model
THIS IS NOT FINISHED - IGNORE IT FOR NOW
In this last hapter we'll go through a worked example that ombines a number of the
topi s we've seen. We'll do simulated method of moments estimation of a real business y le
model, similar to what Valderrama (2002) does.

21.1 Data
We'll develop a model for private onsumption and real gross private investment. The data
are obtained from the US Bureau of E onomi Analysis (BEA) National In ome and Produ t
A ounts (NIPA), Table 11.1.5, Lines 2 and 6 (you an download quarterly data from 1947-I to
the present). The data we use are in the le rb _data.m. This data is real ( onstant dollars).
The program plots.m will make a few plots, in luding Figures 21.1 though 21.3.

First

looking at the plot for levels, we an see that real onsumption and investment are learly
nonstationary (surprise, surprise). There appears to be somewhat of a stru tural hange in
the mid-1970's.
Looking at growth rates, the series for onsumption has an extended period of high growth in
the 1970's, be oming more moderate in the 90's. The volatility of growth of onsumption has
de lined somewhat, over time. Looking at investment, there are some notable periods of high
volatility in the mid-1970's and early 1980's, for example. Sin e 1990 or so, volatility seems
to have de lined.
E onomi models for growth often imply that there is no long term growth (!) - the data
that the models generate is stationary and ergodi .

Or, the data that the models generate

needs to be passed through the inverse of a lter. We'll follow this, and generate stationary
business y le data by applying the bandpass lter of Christiano and Fitzgerald (1999). The
ltered data is in Figure 21.3. We'll try to spe ify an e onomi model that an generate similar

Figure 21.1: Consumption and Investment, Levels

323

324CHAPTER 21. FINAL PROJECT: ECONOMETRIC ESTIMATION OF A RBC MODEL

Figure 21.2: Consumption and Investment, Growth Rates

Figure 21.3: Consumption and Investment, Bandpass Filtered

data. To get data that look like the levels for onsumption and investment, we'd need to apply
the inverse of the bandpass lter.

21.2 An RBC Model


Consider a very simple sto hasti growth model (the same used by Maliar and Maliar (2003),
with minor notational dieren e):

max{ct ,kt }
E0
t=0

t=0

t U (c )
t

(1 ) kt1 + t kt1

ct + kt

log t

log t1 + t

IIN (0, 2 )

Assume that the utility fun tion is

U (ct ) =

c1
1
t
1

is the dis ount rate

is the depre iation rate of apital

is the elasti ity of output with respe t to apital

is a te hnology sho k that is positive.

is the oe ient of relative risk aversion. When

is observed in period

= 1,

t.

the utility fun tion is logarith-

mi .

gross investment,

it ,

is the hange in the apital sto k:

it = kt (1 ) kt1

we assume that the initial ondition

We would like to estimate the parameters

(k0 , 0 )

is given.

= , , , , , 2

using the data that we have

on onsumption and investment. This problem is very similar to the GMM estimation of the
portfolio model dis ussed in Se tions 15.12 and 15.13. On e an derive the Euler ondition in
the same way we did there, and use it to dene a GMM estimator. That approa h was not
very su essful, re all. Now we'll try to use some more informative moment onditions to see
if we get better results.

21.3.

325

A REDUCED FORM MODEL

21.3 A redu ed form model


Ma roe onomi time series data are often modeled using ve tor autoregressions.
autogression is just the ve tor version of an autoregressive model.

Let

yt

be a

A ve tor

G-ve tor

of

jointly dependent variables. A VAR(p) model is

yt = c + A1 yt1 + A2 yt2 + ... + Ap ytp + vt


where

c is a G-ve tor of parameters, and Aj ,

vt = Rt t ,

where

t IIN (0, I2 ),

and

Rt

j=1,2,...,p, are

GG matri es of parameters.

is upper triangular. So

Let

V (vt |yt1 , ...ytp ) = Rt Rt .

You an think of a VAR model as the redu ed form of a dynami linear simultaneous equations
model where all of the variables are treated as endogenous. Clearly, if all of the variables are
endogenous, one would need some form of additional information to identify a stru tural
model.

But we already have a stru tural model, and we're only going to use the VAR to

help us estimate the parameters. A well-tting redu ed form model will be adequate for the
purpose.
We're seen that our data seems to have episodes where the varian e of growth rates and
ltered data is non- onstant. This brings us to the general area of sto hasti volatility. Without going into details, we'll just onsider the exponential GARCH model of Nelson (1991) as
presented in Hamilton (1994, pg. 668-669).
Dene
is a

31

ht = vec (Rt ),

the ve tor of elements in the upper triangle of

Rt

(in our ase this

ve tor). We assume that the elements follow

n
o
p
log hjt = j + P(j,.) |vt1 | 2/ + (j,.) vt1 + G(j,.) log ht1

The varian e of the VAR error depends upon its own past, as well as upon the past realizations
of the sho ks.

This is an EGARCH(1,1) spe i ation. The obvious generalization is the EGARCH(r, m)

The advantage of the EGARCH formulation is that the varian e is assuredly positive

The matrix

has dimension

3 2.

The matrix

has dimension

3 3.

The matrix

The parameter matrix

spe i ation, with longer lags (r for lags of

v, m

for lags of

h).

without parameter restri tions

(reminder to self: this is an aleph) has dimension

allows for

leverage,

2 2.

so that positive and negative sho ks an

have asymmetri ee ts upon volatility.

We will probably want to restri t these parameter matri es in some way. For instan e,

ould plausibly be diagonal.

326CHAPTER 21. FINAL PROJECT: ECONOMETRIC ESTIMATION OF A RBC MODEL

With the above spe i ation, we have

t IIN (0, I2 )

t = R1
t vt
and we know how to al ulate

Rt

and

vt ,

given the data and the parameters.

Thus, it is

straighforward to do estimation by maximum likelihood. This will be the s ore generator.

21.4 Results (I): The s ore generator


21.5 Solving the stru tural model
The rst order ondition for the stru tural model is




1
c
1

k
c
=
E
t+1 t
t
t
t+1

or

h
n
io 1

1
1

k
ct = Et c
t+1
t
t+1

The problem is that we annot solve for

ct sin e we do not know the solution for the expe tation

in the previous equation.


The parameterized expe tations algorithm (PEA: den Haan and Mar et, 1990), is a means
of solving the problem. The expe tations term is repla ed by a parametri fun tion. As long
as the parametri fun tion is a exible enough fun tion of variables that have been realized
in period

t,

there exist parameter values that make the approximation as lose to the true

expe tation as is desired. We will write the approximation

h
i
1
1

k
exp (0 + 1 log t + 2 log kt1 )
Et c
t+1 t
t+1

For given values of the parameters of this approximating fun tion, we an solve for
then for

kt

ct ,

and

using the restri tion that

ct + kt = (1 ) kt1 + t kt1
This allows us to generate a series
by tting

{(ct , kt )}.

Then the expe tations approximation is updated


1
c
= exp (0 + 1 log t + 2 log kt1 ) + t
t+1 1 + t+1 kt

by nonlinear least squares. The 2 step pro edure of generating data and updating the parameters of the approximation to expe tations is iterated until the parameters no longer hange.
When this is the ase, the expe tations fun tion is the best t to the generated data. As long
it is a ri h enough parametri model to en ompass the true expe tations fun tion, it an be
made to be equal to the true expe tations fun tion by using a long enough simulation.
Thus, given the parameters of the stru tural model,

= , , , , , 2

, we an generate

21.5.

data

327

SOLVING THE STRUCTURAL MODEL

{(ct , kt )}

(1 ) kt1 .

using the PEA. From this we an get the series

{(ct , it )}

using

it = kt

This an be used to do EMM estimation using the s ores of the redu ed form

model to dene moments, using the simulated data from the stru tural model.

328CHAPTER 21. FINAL PROJECT: ECONOMETRIC ESTIMATION OF A RBC MODEL

Bibliography
[1 Creel. M (2005) A Note on Parallelizing the Parameterized Expe tations Algorithm.
[2 den Haan, W. and Mar et, A. (1990) Solving the sto hasti growth model by parameterized expe tations,

Journal of Business and E onomi s Statisti s, 8, 31-34.

[3 Hamilton, J. (1994)

Time Series Analysis, Prin eton Univ. Press

[4 Maliar, L. and Maliar, S. (2003) Matlab ode for Solving a Neo lassi al Growh Model
with a Parametrized Expe tations Algorithm and Moving Bounds
[5 Nelson, D. (1991) Conditional heteros edasti ity is asset returns: a new approa h,

metri a, 59, 347-70.


[6 Valderrama, D. (2002) Statisti al nonlinearities in the business y le:

E ono-

a hallenge for

the anoni al RBC model, E onomi Resear h, Federal Reserve Bank of San Fran is o.

http://ideas.repe .org/p/fip/fedfap/2002-13.html

329

330

BIBLIOGRAPHY

Chapter 22
Introdu tion to O tave
Why is O tave being used here, sin e it's not that well-known by e onometri ians?

Well,

be ause it is a high quality environment that is easily extensible, uses well-tested and high
performan e numeri al libraries, it is li ensed under the GNU GPL, so you an get it for free
and modify it if you like, and it runs on both GNU/Linux, Ma OSX and Windows systems.
It's also quite easy to learn.

22.1 Getting started


Get the ParallelKnoppix CD, as was des ribed in Se tion 1.3. Then burn the image, and boot
your omputer with it.

This will give you this same PDF le, but with all of the example

programs ready to run. The editor is ongure with a ma ro to exe ute the programs using
O tave, whi h is of ourse installed. From this point, I assume you are running the CD (or
sitting in the omputer room a ross the hall from my o e), or that you have ongured your
omputer to be able to run the

*.m

les mentioned below.

22.2 A short introdu tion


The obje tive of this introdu tion is to learn just the basi s of O tave. There are other ways
to use O tave, whi h I en ourage you to explore. These are just some rudiments. After this,
you an look at the example programs s attered throughout the do ument (and edit them,
and run them) to learn more about how O tave an be used to do e onometri s.

Students

of mine: your problem sets will in lude exer ises that an be done by modifying the example
programs in relatively minor ways. So study the examples!
O tave an be used intera tively, or it an be used to run programs that are written using a
text editor. We'll use this se ond method, preparing programs with NEdit, and alling O tave
from within the editor.

The program rst.m gets us started.

NEdit (by nding the orre t le inside the

To run this, open it up with

/home/knoppix/Desktop/E onometri s

folder

and li king on the i on) and then type CTRL-ALT-o, or use the O tave item in the Shell
menu (see Figure 22.1).

331

332

CHAPTER 22.

INTRODUCTION TO OCTAVE

Figure 22.1: Running an O tave program

22.3.

333

IF YOU'RE RUNNING A LINUX INSTALLATION...

printf() doesn't
reads  printf(hello

Note that the output is not formatted in a pleasing way. That's be ause
automati ally start a new line.

world\n);

Edit

first.m

so that the 8th line

and re-run the program.

We need to know how to load and save data. The program se ond.m shows how. On e you
have run this, you will nd the le  x in the dire tory

E onometri s/Examples/O taveIntro/

You might have a look at it with NEdit to see O tave's default format for saving data. Basi ally,
if you have data in an ASCII text le, named for example  myfile.data, formed of numbers
separated by spa es, just use the ommand  load

myfile.data.

After having done so, the

matrix  myfile (without extension) will ontain the data.


Please have a look at CommonOperations.m for examples of how to do some basi things
in O tave. Now that we're done with the basi s, have a look at the O tave programs that are
in luded as examples. If you are looking at the browsable PDF version of this do ument, then
you should be able to li k on links to open them. If not, the example programs are available
here and the support les needed to run these are available here. Those pages will allow you
to examine individual les, out of ontext. To a tually use these les (edit and run them),
you should go to the home page of this do ument, sin e you will probably want to download
the pdf version together with all the support les and examples. Or get the bootable CD.
There are some other resour es for doing e onometri s with O tave.
he k the arti le E onometri s with O tave

You might like to

and the E onometri s Toolbox , whi h is for

Matlab, but mu h of whi h ould be easily used with O tave.

22.3 If you're running a Linux installation...


Then to get the same behavior as found on the CD, you need to:

Get the olle tion of support programs and the examples, from the do ument home page.

Put them somewhere, and tell O tave how to nd them, e.g., by putting a link to the

MyO taveFiles dire tory in

/usr/lo al/share/o tave/site-m

Make sure nedit is installed and ongured to run O tave and use syntax highlighting.
Copy the le

/home/e onometri s/.nedit

from the CD to do this.

Or, get the le

NeditConguration and save it in your $HOME dire tory with the name  .nedit. Not
to put too ne a point on it, please note that there is a period in that name.

Asso iate

*.m

les with NEdit so that they open up in the editor when you li k on

them. That should do it.

334

CHAPTER 22.

INTRODUCTION TO OCTAVE

Chapter 23
Notation and Review

All ve tors will be olumn ve tors, unless they have a transpose symbol (or I forget to
apply this rule - your help at hing typos and er0rors is mu h appre iated). For example,
if

xt

is a

ve tor.

p1

ve tor,

xt

is a

1p

ve tor. When I refer to a

p-ve tor,

I mean a olumn

23.1 Notation for dierentiation of ve tors and matri es


[3, Chapter 1
Let

s() : p

be a real valued fun tion of the

p-ve tor,

Following this onvention,

s()
is a

s()
=

1p

2 s()

s()
1
s()
2
.
.
.

s()
p

ve tor, and

s()

f ():p n

valued transpose of

Produ t rule:
p-ve tor .

be a

n-ve tor

. Then

Let

Then

f ():p n

and

has dimension

1 p.

2 s()
is a

a x
x

pp

s()

= a.
p-ve tor .

h():p n


be

+f

n-ve tor


Applying the transposition rule we get

h() f () =

matrix. Also,

f ().

h() f () = h








f h+
h f

335

s()
is organized as a

valued fun tion of the

=
f ()

Then

Exer ise 33 For a and x both p-ve tors, show that


Let

p-ve tor .

Let

f ()

be the

1n

valued fun tions of the

336

CHAPTER 23.

whi h has dimension

NOTATION AND REVIEW

p 1.

Exer ise 34 For A a p p matrix and x a p 1 ve tor, show that

Chain rule :

Let

r
let g():

p be a

has dimension

f ():p n
p-ve tor

n r.

n-ve tor

x Ax
x

valued fun tion of a

valued fun tion of an

r -ve tor

= A + A .

p-ve tor

argument, and

valued argument

Then

f [g ()] =
f ()
g()

=g()

Exer ise 35 For x and both p 1 ve tors, show that

exp(x )

= exp(x )x.

23.2 Convergenge modes


Readings:

[1, Chapter 4;[4, Chapter 4.

We will onsider several modes of onvergen e. The rst three modes dis ussed are simply
for ba kground. The sto hasti modes are those whi h will be used later in the ourse.

Denition 36 A sequen e is a mapping from the natural numbers {1, 2, ...} = {n}
n=1 = {n}

to some other set, so that the set is ordered a ording to the natural numbers asso iated with
its elements.

Real-valued sequen es:


A real-valued sequen e of ve tors {an } onverges to the ve tor
a if for any > 0 there exists an integer N su h that for all n > N , k an a k< . a is the
limit of an , written an a.
Denition 37

[Convergen e

Deterministi real-valued fun tions


Consider a sequen e of fun tions

{fn ()}

where

fn : T .

may be an arbitrary set.

A sequen e of fun tions {fn ()} onverges pointwise


on to the fun tion f () if for all > 0 and there exists an integer N su h that
Denition 38

[Pointwise onvergen e

|fn () f ()| < , n > N .


It's important to note that
rapid for ertain
throughout

depends upon

so that onverge may be mu h more

than for others. Uniform onvergen e requires a similar rate of onvergen e

23.2.

337

CONVERGENGE MODES

A sequen e of fun tions {fn ()} onverges


on to the fun tion f () if for any > 0 there exists an integer N su h that
Denition 39

[Uniform onvergen e

uniformly

sup |fn () f ()| < , n > N.

(insert a diagram here showing the envelope around f () in whi h fn() must lie)

Sto hasti sequen es


In e onometri s, we typi ally deal with sto hasti sequen es.

(, F, P ) ,

Given a probability spa e

re all that a random variable maps the sample spa e to the real line

A sequen e of random variables

{Xn ()}

, X() :
i.e., ea h

i.e.

is a olle tion of su h mappings,

Xn () is a random variable with respe t to the probability spa e (, F, P ) . For example, given
0
n = (X X)1 X Y, where n is the sample size,
the model Y = X + , the OLS estimator
n }. A number of modes of onvergen e are
an be used to form a sequen e of random ve tors {
in use when dealing with sequen es of random variables. Several su h modes of onvergen e
should already be familiar:

Let Xn () be a sequen e of random variables, and


let X() be a random variable. Let An = { : |Xn () X()| > }. Then {Xn ()} onverges
in probability to X() if

Denition 40

[Convergen e in probability

lim P (An ) = 0, > 0.

Convergen e in probability is written as

Xn X,

or plim

Xn = X.

Let Xn () be a sequen e of random variables, and


let X() be a random variable. Let A = { : limn Xn () = X()}. Then {Xn ()}
onverges almost surely to X() if
Denition 41

[Almost sure onvergen e

P (A) = 1.
In other words,
set

C = A

Xn X, a.s.

Xn () X()

su h that

(ordinary onvergen e of the two fun tions) ex ept on a

P (C) = 0.

Almost sure onvergen e is written as

One an show that

a.s.

Xn X,

or

a.s.

Xn X Xn X.

Let the r.v. Xn have distribution fun tion Fn


and the r.v. Xn have distribution fun tion F. If Fn F at every ontinuity point of F, then
Xn onverges in distribution to X.
Denition 42

[Convergen e in distribution

Convergen e in distribution is written as

Xn X.

probability implies onvergen e in distribution.

It an be shown that onvergen e in

338

CHAPTER 23.

NOTATION AND REVIEW

Sto hasti fun tions


Simple laws of large numbers (LLN's) allow us to dire tly on lude that
example, sin e

n = 0 +
and

X X
n

1 

X
n

a.s.
n 0

in the OLS

a.s.

X
n

by a SLLN. Note that this term is not a fun tion of the parameter

This

easy proof is a result of the linearity of the model, whi h allows us to express the estimator
in a way that separates parameters from random fun tions. In general, this is not possible.
We often deal with the more ompli ated situation where the sto hasti sequen e depends on
parameters in a manner that is not redu ible to a simple sequen e of random variables. In
this ase, we have a sequen e of random fun tions that depend on

Xn (, ) is a random

variable with respe t to a probability spa e

belongs to a parameter spa e

Denition 43

[Uniform almost sure onvergen e

surely in to X(, ) if

{Xn (, )}

: {Xn (, )},

(, F, P )

where ea h

and the parameter

onverges uniformly almost

lim sup |Xn (, ) X(, )| = 0, (a.s.)

Impli it is the assumption that all

(, F, P )

for all

and

X(, )

are random variables w.r.t.

We'll indi ate uniform almost sure onvergen e by

onvergen e in probability by

Xn (, )

u.a.s.

u.p.

and uniform

An equivalent denition, based on the fa t that almost sure means with probability
one is

Pr


lim sup |Xn (, ) X(, )| = 0 = 1

This has a form similar to that of the denition of a.s.


dieren e is the addition of the

onvergen e - the essential

sup.

23.3 Rates of onvergen e and asymptoti equality


It's often useful to have notation for the relative magnitudes of quantities. Quantities that are
small relative to others an often be ignored, whi h simplies analysis.

Denition 44
o(g(n))

means

[Little-o

Let f (n) and g(n) be two real-valued fun tions. The notation f (n) =

(n)
limn fg(n)

= 0.

Let f (n) and g(n) be two real-valued fun tions.


The notation f (n) =

f (n)
O(g(n)) means there exists some N su h that for n > N, g(n) < K, where K is a nite
onstant.
Denition 45

[Big-O

This denition doesn't require that


If

{fn }

and

{gn }

f (n)
g(n) have a limit (it may u tuate boundedly).

are sequen es of random variables analogous denitions are

23.3.

339

RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY

Denition 46 The notation f (n) = op (g(n)) means

f (n) p
g(n)

0.


Example 47 The least squares estimator = (X X)1 X Y = (X X)1 X X 0 + = 0 +

(X X)1 X . Sin e plim (X

X)1 X

= 0, we an write (X X)1 X = op (1) and = 0 +op (1).

Asymptoti ally, the term op (1) is negligible. This is just a way of indi ating that the LS
estimator is onsistent.

Denition 48 The notation f (n) = Op (g(n)) means there exists some N su h that for > 0

and all n > N ,

where K is a nite onstant.




f (n)


P
< K > 1 ,
g(n)

Example 49 If Xn N (0, 1) then Xn = Op (1), sin e, given , there is always some K su h

that P (|Xn | < K ) > 1 .


Useful rules:

Op (np )Op (nq ) = Op (np+q )


op (np )op (nq ) = op (np+q )

Example 50 Consider a random sample of iid r.v.'s with mean 0 and varian e 2 . The
P

A
estimator of the mean = 1/n ni=1 xi is asymptoti ally normally distributed, e.g., n1/2
N (0, 2 ). So n1/2 = Op (1), so = Op (n1/2 ). Before we had = op (1), now we have have
the stronger result that relates the rate of onvergen e to the sample size.

Example 51 Now onsider a random sample of iid r.v.'s with mean and varian e 2 .
P

The estimator
 of the mean = 1/n  i=1 xi is asymptoti ally normally distributed, e.g.,
A
n1/2 N (0, 2 ). So n1/2 = Op (1), so = Op (n1/2 ), so = Op (1).
These two examples show that averages of entered (mean zero) quantities typi ally have
plim 0, while averages of un entered quantities have nite nonzero plims.
denition of

Op

does not mean that

f (n) and g(n) are of the same order.

Note that the

Asymptoti equality

ensures that this is the ase.

Denition 52 Two sequen es of random variables {fn } and {gn } are asymptoti ally equal

(written fn =a gn ) if

plim

f (n)
g(n)

Finally, analogous almost sure versions of

op

=1

and

Op

are dened in the obvious way.

340

CHAPTER 23.

For

For

For

and

both

For

and

both

and

both

pp

p1

ve tors, show that

p1

ve tors, show that

matrix and

p1

p1

NOTATION AND REVIEW

Dx a x = a.

ve tor, show that

Dx2 x Ax = A + A .

D exp x = exp(x )x.

ve tors, nd the analyti expression for

D2 exp x .

Write an O tave program that veries ea h of the previous results by taking numeri
derivatives. For a hint, type

help numgradient

and

help numhessian

inside o tave.

Chapter 24
Li enses
This do ument and the asso iated examples and materials are opyright Mi hael Creel, under
the terms of the GNU General Publi Li ense, ver. 2., or at your option, under the Creative
Commons Attribution-Share Alike Li ense, Version 2.5. The li enses follow.

24.1 The GPL


GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, In .
59 Temple Pla e, Suite 330, Boston, MA 02111-1307 USA
Everyone is permitted to opy and distribute verbatim opies
of this li ense do ument, but hanging it is not allowed.
Preamble
The li enses for most software are designed to take away your
freedom to share and hange it. By ontrast, the GNU General Publi
Li ense is intended to guarantee your freedom to share and hange free
software--to make sure the software is free for all its users. This
General Publi Li ense applies to most of the Free Software
Foundation's software and to any other program whose authors ommit to
using it. (Some other Free Software Foundation software is overed by
the GNU Library General Publi Li ense instead.) You an apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
pri e. Our General Publi Li enses are designed to make sure that you
have the freedom to distribute opies of free software (and harge for
341

342

CHAPTER 24.

LICENSES

this servi e if you wish), that you re eive sour e ode or an get it
if you want it, that you an hange the software or use pie es of it
in new free programs; and that you know you an do these things.
To prote t your rights, we need to make restri tions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restri tions translate to ertain responsibilities for you if you
distribute opies of the software, or if you modify it.
For example, if you distribute opies of su h a program, whether
gratis or for a fee, you must give the re ipients all the rights that
you have. You must make sure that they, too, re eive or an get the
sour e ode. And you must show them these terms so they know their
rights.
We prote t your rights with two steps: (1) opyright the software, and
(2) offer you this li ense whi h gives you legal permission to opy,
distribute and/or modify the software.
Also, for ea h author's prote tion and ours, we want to make ertain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its re ipients to know that what they have is not the original, so
that any problems introdu ed by others will not refle t on the original
authors' reputations.
Finally, any free program is threatened onstantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent li enses, in effe t making the
program proprietary. To prevent this, we have made it lear that any
patent must be li ensed for everyone's free use or not li ensed at all.
The pre ise terms and onditions for opying, distribution and
modifi ation follow.

GNU GENERAL PUBLIC LICENSE


TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

24.1.

343

THE GPL

0. This Li ense applies to any program or other work whi h ontains


a noti e pla ed by the opyright holder saying it may be distributed
under the terms of this General Publi Li ense. The "Program", below,
refers to any su h program or work, and a "work based on the Program"
means either the Program or any derivative work under opyright law:
that is to say, a work ontaining the Program or a portion of it,
either verbatim or with modifi ations and/or translated into another
language. (Hereinafter, translation is in luded without limitation in
the term "modifi ation".) Ea h li ensee is addressed as "you".
A tivities other than opying, distribution and modifi ation are not
overed by this Li ense; they are outside its s ope. The a t of
running the Program is not restri ted, and the output from the Program
is overed only if its ontents onstitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may opy and distribute verbatim opies of the Program's
sour e ode as you re eive it, in any medium, provided that you
onspi uously and appropriately publish on ea h opy an appropriate
opyright noti e and dis laimer of warranty; keep inta t all the
noti es that refer to this Li ense and to the absen e of any warranty;
and give any other re ipients of the Program a opy of this Li ense
along with the Program.
You may harge a fee for the physi al a t of transferring a opy, and
you may at your option offer warranty prote tion in ex hange for a fee.
2. You may modify your opy or opies of the Program or any portion
of it, thus forming a work based on the Program, and opy and
distribute su h modifi ations or work under the terms of Se tion 1
above, provided that you also meet all of these onditions:
a) You must ause the modified files to arry prominent noti es
stating that you hanged the files and the date of any hange.
b) You must ause any work that
whole or in part ontains or is
part thereof, to be li ensed as
parties under the terms of this

you distribute or publish, that in


derived from the Program or any
a whole at no harge to all third
Li ense.

344

CHAPTER 24.

LICENSES

) If the modified program normally reads ommands intera tively


when run, you must ause it, when started running for su h
intera tive use in the most ordinary way, to print or display an
announ ement in luding an appropriate opyright noti e and a
noti e that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these onditions, and telling the user how to view a opy of this
Li ense. (Ex eption: if the Program itself is intera tive but
does not normally print su h an announ ement, your work based on
the Program is not required to print an announ ement.)

These requirements apply to the modified work as a whole. If


identifiable se tions of that work are not derived from the Program,
and an be reasonably onsidered independent and separate works in
themselves, then this Li ense, and its terms, do not apply to those
se tions when you distribute them as separate works. But when you
distribute the same se tions as part of a whole whi h is a work based
on the Program, the distribution of the whole must be on the terms of
this Li ense, whose permissions for other li ensees extend to the
entire whole, and thus to ea h and every part regardless of who wrote it.
Thus, it is not the intent of this se tion to laim rights or ontest
your rights to work written entirely by you; rather, the intent is to
exer ise the right to ontrol the distribution of derivative or
olle tive works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the s ope of this Li ense.
3. You may opy and distribute the Program (or a work based on it,
under Se tion 2) in obje t ode or exe utable form under the terms of
Se tions 1 and 2 above provided that you also do one of the following:
a) A ompany it with the omplete orresponding ma hine-readable

24.1.

THE GPL

sour e ode, whi h must be distributed under the terms of Se tions


1 and 2 above on a medium ustomarily used for software inter hange; or,
b) A ompany it with a written offer, valid for at least three
years, to give any third party, for a harge no more than your
ost of physi ally performing sour e distribution, a omplete
ma hine-readable opy of the orresponding sour e ode, to be
distributed under the terms of Se tions 1 and 2 above on a medium
ustomarily used for software inter hange; or,
) A ompany it with the information you re eived as to the offer
to distribute orresponding sour e ode. (This alternative is
allowed only for non ommer ial distribution and only if you
re eived the program in obje t ode or exe utable form with su h
an offer, in a ord with Subse tion b above.)
The sour e ode for a work means the preferred form of the work for
making modifi ations to it. For an exe utable work, omplete sour e
ode means all the sour e ode for all modules it ontains, plus any
asso iated interfa e definition files, plus the s ripts used to
ontrol ompilation and installation of the exe utable. However, as a
spe ial ex eption, the sour e ode distributed need not in lude
anything that is normally distributed (in either sour e or binary
form) with the major omponents ( ompiler, kernel, and so on) of the
operating system on whi h the exe utable runs, unless that omponent
itself a ompanies the exe utable.
If distribution of exe utable or obje t ode is made by offering
a ess to opy from a designated pla e, then offering equivalent
a ess to opy the sour e ode from the same pla e ounts as
distribution of the sour e ode, even though third parties are not
ompelled to opy the sour e along with the obje t ode.

4. You may not opy, modify, subli ense, or distribute the Program
ex ept as expressly provided under this Li ense. Any attempt
otherwise to opy, modify, subli ense or distribute the Program is
void, and will automati ally terminate your rights under this Li ense.

345

346

CHAPTER 24.

LICENSES

However, parties who have re eived opies, or rights, from you under
this Li ense will not have their li enses terminated so long as su h
parties remain in full omplian e.
5. You are not required to a ept this Li ense, sin e you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These a tions are
prohibited by law if you do not a ept this Li ense. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indi ate your a eptan e of this Li ense to do so, and
all its terms and onditions for opying, distributing or modifying
the Program or works based on it.
6. Ea h time you redistribute the Program (or any work based on the
Program), the re ipient automati ally re eives a li ense from the
original li ensor to opy, distribute or modify the Program subje t to
these terms and onditions. You may not impose any further
restri tions on the re ipients' exer ise of the rights granted herein.
You are not responsible for enfor ing omplian e by third parties to
this Li ense.
7. If, as a onsequen e of a ourt judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
onditions are imposed on you (whether by ourt order, agreement or
otherwise) that ontradi t the onditions of this Li ense, they do not
ex use you from the onditions of this Li ense. If you annot
distribute so as to satisfy simultaneously your obligations under this
Li ense and any other pertinent obligations, then as a onsequen e you
may not distribute the Program at all. For example, if a patent
li ense would not permit royalty-free redistribution of the Program by
all those who re eive opies dire tly or indire tly through you, then
the only way you ould satisfy both it and this Li ense would be to
refrain entirely from distribution of the Program.
If any portion of this se tion is held invalid or unenfor eable under
any parti ular ir umstan e, the balan e of the se tion is intended to
apply and the se tion as a whole is intended to apply in other
ir umstan es.
It is not the purpose of this se tion to indu e you to infringe any
patents or other property right laims or to ontest validity of any

24.1.

THE GPL

347

su h laims; this se tion has the sole purpose of prote ting the
integrity of the free software distribution system, whi h is
implemented by publi li ense pra ti es. Many people have made
generous ontributions to the wide range of software distributed
through that system in relian e on onsistent appli ation of that
system; it is up to the author/donor to de ide if he or she is willing
to distribute software through any other system and a li ensee annot
impose that hoi e.
This se tion is intended to make thoroughly lear what is believed to
be a onsequen e of the rest of this Li ense.

8. If the distribution and/or use of the Program is restri ted in


ertain ountries either by patents or by opyrighted interfa es, the
original opyright holder who pla es the Program under this Li ense
may add an expli it geographi al distribution limitation ex luding
those ountries, so that distribution is permitted only in or among
ountries not thus ex luded. In su h ase, this Li ense in orporates
the limitation as if written in the body of this Li ense.
9. The Free Software Foundation may publish revised and/or new versions
of the General Publi Li ense from time to time. Su h new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or on erns.
Ea h version is given a distinguishing version number. If the Program
spe ifies a version number of this Li ense whi h applies to it and "any
later version", you have the option of following the terms and onditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not spe ify a version number of
this Li ense, you may hoose any version ever published by the Free Software
Foundation.
10. If you wish to in orporate parts of the Program into other free
programs whose distribution onditions are different, write to the author
to ask for permission. For software whi h is opyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes

348

CHAPTER 24.

LICENSES

make ex eptions for this. Our de ision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS

How to Apply These Terms to Your New Programs


If you develop a new program, and you want it to be of the greatest
possible use to the publi , the best way to a hieve this is to make it
free software whi h everyone an redistribute and hange under these terms.
To do so, atta h the following noti es to the program. It is safest
to atta h them to the start of ea h sour e file to most effe tively
onvey the ex lusion of warranty; and ea h file should have at least

24.1.

THE GPL

the " opyright" line and a pointer to where the full noti e is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software; you an redistribute it and/or modify
it under the terms of the GNU General Publi Li ense as published by
the Free Software Foundation; either version 2 of the Li ense, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Publi Li ense for more details.
You should have re eived a opy of the GNU General Publi Li ense
along with this program; if not, write to the Free Software
Foundation, In ., 59 Temple Pla e, Suite 330, Boston, MA 02111-1307 USA

Also add information on how to onta t you by ele troni and paper mail.
If the program is intera tive, make it output a short noti e like this
when it starts in an intera tive mode:
Gnomovision version 69, Copyright (C) year name of author
Gnomovision omes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are wel ome to redistribute it
under ertain onditions; type `show ' for details.
The hypotheti al ommands `show w' and `show ' should show the appropriate
parts of the General Publi Li ense. Of ourse, the ommands you use may
be alled something other than `show w' and `show '; they ould even be
mouse- li ks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
s hool, if any, to sign a " opyright dis laimer" for the program, if
ne essary. Here is a sample; alter the names:
Yoyodyne, In ., hereby dis laims all opyright interest in the program
`Gnomovision' (whi h makes passes at ompilers) written by James Ha ker.

349

350

CHAPTER 24.

LICENSES

<signature of Ty Coon>, 1 April 1989


Ty Coon, President of Vi e
This General Publi Li ense does not permit in orporating your program into
proprietary programs. If your program is a subroutine library, you may
onsider it more useful to permit linking proprietary appli ations with the
library. If this is what you want to do, use the GNU Library General
Publi Li ense instead of this Li ense.

24.2 Creative Commons


Legal Code
Attribution-ShareAlike 2.5
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT
PROVIDE LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE
AN ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM ITS USE.
Li ense
THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS
CREATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS
PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF
THE WORK OTHER THAN AS AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT
LAW IS PROHIBITED.
BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT
AND AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE. THE LICENSOR
GRANTS YOU THE RIGHTS CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND CONDITIONS.
1. Denitions
1. "Colle tive Work" means a work, su h as a periodi al issue, anthology or en y lopedia, in
whi h the Work in its entirety in unmodied form, along with a number of other ontributions,
onstituting separate and independent works in themselves, are assembled into a olle tive
whole. A work that onstitutes a Colle tive Work will not be onsidered a Derivative Work
(as dened below) for the purposes of this Li ense.
2. "Derivative Work" means a work based upon the Work or upon the Work and other
pre-existing works, su h as a translation, musi al arrangement, dramatization,  tionalization,
motion pi ture version, sound re ording, art reprodu tion, abridgment, ondensation, or any
other form in whi h the Work may be re ast, transformed, or adapted, ex ept that a work
that onstitutes a Colle tive Work will not be onsidered a Derivative Work for the purpose of

24.2.

351

CREATIVE COMMONS

this Li ense. For the avoidan e of doubt, where the Work is a musi al omposition or sound
re ording, the syn hronization of the Work in timed-relation with a moving image ("syn hing")
will be onsidered a Derivative Work for the purpose of this Li ense.
3. "Li ensor" means the individual or entity that oers the Work under the terms of this
Li ense.
4. "Original Author" means the individual or entity who reated the Work.
5.

"Work" means the opyrightable work of authorship oered under the terms of this

Li ense.
6. "You" means an individual or entity exer ising rights under this Li ense who has not
previously violated the terms of this Li ense with respe t to the Work, or who has re eived
express permission from the Li ensor to exer ise rights under this Li ense despite a previous
violation.
7.

"Li ense Elements" means the following high-level li ense attributes as sele ted by

Li ensor and indi ated in the title of this Li ense: Attribution, ShareAlike.
2.

Fair Use Rights.

Nothing in this li ense is intended to redu e, limit, or restri t any

rights arising from fair use, rst sale or other limitations on the ex lusive rights of the opyright
owner under opyright law or other appli able laws.
3.

Li ense Grant.

Subje t to the terms and onditions of this Li ense, Li ensor hereby

grants You a worldwide, royalty-free, non-ex lusive, perpetual (for the duration of the appli able opyright) li ense to exer ise the rights in the Work as stated below:
1.

to reprodu e the Work, to in orporate the Work into one or more Colle tive Works,

and to reprodu e the Work as in orporated in the Colle tive Works;


2. to reate and reprodu e Derivative Works;
3.

to distribute opies or phonore ords of, display publi ly, perform publi ly, and per-

form publi ly by means of a digital audio transmission the Work in luding as in orporated in
Colle tive Works;
4. to distribute opies or phonore ords of, display publi ly, perform publi ly, and perform
publi ly by means of a digital audio transmission Derivative Works.
5.
For the avoidan e of doubt, where the work is a musi al omposition:
1. Performan e Royalties Under Blanket Li enses. Li ensor waives the ex lusive right to
olle t, whether individually or via a performan e rights so iety (e.g. ASCAP, BMI, SESAC),
royalties for the publi performan e or publi digital performan e (e.g. web ast) of the Work.
2. Me hani al Rights and Statutory Royalties. Li ensor waives the ex lusive right to olle t, whether individually or via a musi rights so iety or designated agent (e.g. Harry Fox
Agen y), royalties for any phonore ord You reate from the Work (" over version") and distribute, subje t to the ompulsory li ense reated by 17 USC Se tion 115 of the US Copyright
A t (or the equivalent in other jurisdi tions).
6.

Web asting Rights and Statutory Royalties.

For the avoidan e of doubt, where the

Work is a sound re ording, Li ensor waives the ex lusive right to olle t, whether individually
or via a performan e-rights so iety (e.g.

SoundEx hange), royalties for the publi digital

352

CHAPTER 24.

LICENSES

performan e (e.g. web ast) of the Work, subje t to the ompulsory li ense reated by 17 USC
Se tion 114 of the US Copyright A t (or the equivalent in other jurisdi tions).
The above rights may be exer ised in all media and formats whether now known or hereafter
devised.

The above rights in lude the right to make su h modi ations as are te hni ally

ne essary to exer ise the rights in other media and formats. All rights not expressly granted
by Li ensor are hereby reserved.
4. Restri tions.The li ense granted in Se tion 3 above is expressly made subje t to and
limited by the following restri tions:
1.

You may distribute, publi ly display, publi ly perform, or publi ly digitally perform

the Work only under the terms of this Li ense, and You must in lude a opy of, or the
Uniform Resour e Identier for, this Li ense with every opy or phonore ord of the Work
You distribute, publi ly display, publi ly perform, or publi ly digitally perform. You may not
oer or impose any terms on the Work that alter or restri t the terms of this Li ense or the
re ipients' exer ise of the rights granted hereunder. You may not subli ense the Work. You
must keep inta t all noti es that refer to this Li ense and to the dis laimer of warranties.
You may not distribute, publi ly display, publi ly perform, or publi ly digitally perform the
Work with any te hnologi al measures that ontrol a ess or use of the Work in a manner
in onsistent with the terms of this Li ense Agreement.

The above applies to the Work as

in orporated in a Colle tive Work, but this does not require the Colle tive Work apart from
the Work itself to be made subje t to the terms of this Li ense.

If You reate a Colle tive

Work, upon noti e from any Li ensor You must, to the extent pra ti able, remove from the
Colle tive Work any redit as required by lause 4( ), as requested. If You reate a Derivative
Work, upon noti e from any Li ensor You must, to the extent pra ti able, remove from the
Derivative Work any redit as required by lause 4( ), as requested.
2.

You may distribute, publi ly display, publi ly perform, or publi ly digitally perform

a Derivative Work only under the terms of this Li ense, a later version of this Li ense with
the same Li ense Elements as this Li ense, or a Creative Commons iCommons li ense that
ontains the same Li ense Elements as this Li ense (e.g. Attribution-ShareAlike 2.5 Japan).
You must in lude a opy of, or the Uniform Resour e Identier for, this Li ense or other
li ense spe ied in the previous senten e with every opy or phonore ord of ea h Derivative
Work You distribute, publi ly display, publi ly perform, or publi ly digitally perform.

You

may not oer or impose any terms on the Derivative Works that alter or restri t the terms
of this Li ense or the re ipients' exer ise of the rights granted hereunder, and You must keep
inta t all noti es that refer to this Li ense and to the dis laimer of warranties.

You may

not distribute, publi ly display, publi ly perform, or publi ly digitally perform the Derivative
Work with any te hnologi al measures that ontrol a ess or use of the Work in a manner
in onsistent with the terms of this Li ense Agreement. The above applies to the Derivative
Work as in orporated in a Colle tive Work, but this does not require the Colle tive Work
apart from the Derivative Work itself to be made subje t to the terms of this Li ense.
3. If you distribute, publi ly display, publi ly perform, or publi ly digitally perform the
Work or any Derivative Works or Colle tive Works, You must keep inta t all opyright noti es

24.2.

CREATIVE COMMONS

353

for the Work and provide, reasonable to the medium or means You are utilizing: (i) the name
of the Original Author (or pseudonym, if appli able) if supplied, and/or (ii) if the Original
Author and/or Li ensor designate another party or parties (e.g. a sponsor institute, publishing
entity, journal) for attribution in Li ensor's opyright noti e, terms of servi e or by other
reasonable means, the name of su h party or parties; the title of the Work if supplied; to the
extent reasonably pra ti able, the Uniform Resour e Identier, if any, that Li ensor spe ies
to be asso iated with the Work, unless su h URI does not refer to the opyright noti e or
li ensing information for the Work; and in the ase of a Derivative Work, a redit identifying
the use of the Work in the Derivative Work (e.g., "Fren h translation of the Work by Original
Author," or "S reenplay based on original Work by Original Author"). Su h redit may be
implemented in any reasonable manner; provided, however, that in the ase of a Derivative
Work or Colle tive Work, at a minimum su h redit will appear where any other omparable
authorship redit appears and in a manner at least as prominent as su h other omparable
authorship redit.
5. Representations, Warranties and Dis laimer
UNLESS OTHERWISE AGREED TO BY THE PARTIES IN WRITING, LICENSOR
OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES
OF ANY KIND CONCERNING THE MATERIALS, EXPRESS, IMPLIED, STATUTORY
OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE,
MERCHANTIBILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT,
OR THE ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS, WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO
SUCH EXCLUSION MAY NOT APPLY TO YOU.
6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE
LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY
FOR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY
DAMAGES ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF
LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
7. Termination
1. This Li ense and the rights granted hereunder will terminate automati ally upon any
brea h by You of the terms of this Li ense. Individuals or entities who have re eived Derivative
Works or Colle tive Works from You under this Li ense, however, will not have their li enses
terminated provided su h individuals or entities remain in full omplian e with those li enses.
Se tions 1, 2, 5, 6, 7, and 8 will survive any termination of this Li ense.
2. Subje t to the above terms and onditions, the li ense granted here is perpetual (for
the duration of the appli able opyright in the Work). Notwithstanding the above, Li ensor
reserves the right to release the Work under dierent li ense terms or to stop distributing the
Work at any time; provided, however that any su h ele tion will not serve to withdraw this
Li ense (or any other li ense that has been, or is required to be, granted under the terms of
this Li ense), and this Li ense will ontinue in full for e and ee t unless terminated as stated

354

CHAPTER 24.

LICENSES

above.
8. Mis ellaneous
1. Ea h time You distribute or publi ly digitally perform the Work or a Colle tive Work,
the Li ensor oers to the re ipient a li ense to the Work on the same terms and onditions as
the li ense granted to You under this Li ense.
2.

Ea h time You distribute or publi ly digitally perform a Derivative Work, Li ensor

oers to the re ipient a li ense to the original Work on the same terms and onditions as the
li ense granted to You under this Li ense.
3. If any provision of this Li ense is invalid or unenfor eable under appli able law, it shall
not ae t the validity or enfor eability of the remainder of the terms of this Li ense, and
without further a tion by the parties to this agreement, su h provision shall be reformed to
the minimum extent ne essary to make su h provision valid and enfor eable.
4. No term or provision of this Li ense shall be deemed waived and no brea h onsented to
unless su h waiver or onsent shall be in writing and signed by the party to be harged with
su h waiver or onsent.
5. This Li ense onstitutes the entire agreement between the parties with respe t to the
Work li ensed here. There are no understandings, agreements or representations with respe t
to the Work not spe ied here. Li ensor shall not be bound by any additional provisions that
may appear in any ommuni ation from You. This Li ense may not be modied without the
mutual written agreement of the Li ensor and You.
Creative Commons is not a party to this Li ense, and makes no warranty whatsoever in
onne tion with the Work. Creative Commons will not be liable to You or any party on any
legal theory for any damages whatsoever, in luding without limitation any general, spe ial,
in idental or onsequential damages arising in onne tion to this li ense. Notwithstanding the
foregoing two (2) senten es, if Creative Commons has expressly identied itself as the Li ensor
hereunder, it shall have all rights and obligations of Li ensor.
Ex ept for the limited purpose of indi ating to the publi that the Work is li ensed under
the CCPL, neither party will use the trademark "Creative Commons" or any related trademark
or logo of Creative Commons without the prior written onsent of Creative Commons. Any
permitted use will be in omplian e with Creative Commons' then- urrent trademark usage
guidelines, as may be published on its website or otherwise made available upon request from
time to time.
Creative Commons may be onta ted at http:// reative ommons.org/.

Chapter 25
The atti
This holds material that is not really ready to be in orporated into the main body, but that
I don't want to lose. Basi ally, ignore it, unless you'd like to help get it ready for in lusion.

25.1 Hurdle models


Returning to the Poisson model, lets look at a tual and tted ount probabilities.
relative frequen ies are

Pn

f (y = j) =

1(yi = j)/n

and tted frequen ies are

A tual

f(y = j) =

i=1 fY (j|xi , )/n We see that for the OBDV measure, there are many more a tual zeros
Table 25.1: A tual and Poisson tted frequen ies
Count

OBDV

ERV

Count

A tual

Fitted

A tual

Fitted

0.32

0.06

0.86

0.83

0.18

0.15

0.10

0.14

0.11

0.19

0.02

0.02

0.10

0.18

0.004

0.002

0.052

0.15

0.002

0.0002

0.032

0.10

2.4e-5

than predi ted. For ERV, there are somewhat more a tual zeros than tted, but the dieren e
is not too important.
Why might OBDV not t the zeros well?

What if people made the de ision to onta t

the do tor for a rst visit, they are si k, then the

do tor

de ides on whether or not follow-up

visits are needed. This is a prin ipal/agent type situation, where the total number of visits
depends upon the de ision of both the patient and the do tor. Sin e dierent parameters may
govern the two de ision-makers hoi es, we might expe t that dierent parameters govern the
probability of zeros versus the other ounts. Let
for visits, and let

be the parameters of the patient's demand

be the paramter of the do tor's demand for visits.

The patient will

initiate visits a ording to a dis rete hoi e model, for example, a logit model:

355

356

CHAPTER 25.

THE ATTIC

Pr(Y = 0) = fY (0, p ) = 1 1/ [1 + exp(p )]


Pr(Y > 0)

1/ [1 + exp(p )] ,

The above probabilities are used to estimate the binary 0/1 hurdle pro ess.

Then, for the

observations where visits are positive, a trun ated Poisson density is estimated. This density
is

fY (y, d |y > 0) =
=

fY (y, d )
Pr(y > 0)
fY (y, d )
1 exp(d )

sin e a ording to the Poisson model with the do tor's paramaters,

Pr(y = 0) =

exp(d )0d
.
0!

Sin e the hurdle and trun ated omponents of the overall density for

share no parameters,

they may be estimated separately, whi h is omputationally more e ient than estimating
the overall model.

(Re all that the BFGS algorithm, for example, will have to invert the

approximated Hessian. The omputational overhead is of order


parameters to be estimated) . The expe tation of

K2

where

is

E(Y |x) = Pr(Y > 0|x)E(Y |Y > 0, x)





1
d
=
1 + exp(p )
1 exp(d )

is the number of

25.1.

HURDLE MODELS

357

Here are hurdle Poisson estimation results for OBDV, obtained from this estimation program

**************************************************************************
MEPS data, OBDV
logit results
Strong onvergen e
Observations = 500
Fun tion value
-0.58939
t-Stats
params
t(OPG)
t(Sand.)
t(Hess)
onstant
-1.5502
-2.5709
-2.5269
-2.5560
pub_ins
1.0519
3.0520
3.0027
3.0384
priv_ins
0.45867
1.7289
1.6924
1.7166
sex
0.63570
3.0873
3.1677
3.1366
age
0.018614
2.1547
2.1969
2.1807
edu
0.039606
1.0467
0.98710
1.0222
in
0.077446
1.7655
2.1672
1.9601
Information Criteria
Consistent Akaike
639.89
S hwartz
632.89
Hannan-Quinn
614.96
Akaike
603.39
**************************************************************************

358

CHAPTER 25.

THE ATTIC

The results for the trun ated part:

**************************************************************************
MEPS data, OBDV
tpoisson results
Strong onvergen e
Observations = 500
Fun tion value
-2.7042
t-Stats
params
t(OPG)
t(Sand.)
t(Hess)
onstant
0.54254
7.4291
1.1747
3.2323
pub_ins
0.31001
6.5708
1.7573
3.7183
priv_ins
0.014382
0.29433
0.10438
0.18112
sex
0.19075
10.293
1.1890
3.6942
age
0.016683
16.148
3.5262
7.9814
edu
0.016286
4.2144
0.56547
1.6353
in
-0.0079016
-2.3186
-0.35309
-0.96078
Information Criteria
Consistent Akaike
2754.7
S hwartz
2747.7
Hannan-Quinn
2729.8
Akaike
2718.2
**************************************************************************

25.1.

359

HURDLE MODELS

Fitted and a tual probabilites (NB-II ts are provided as well) are:

Table 25.2: A tual and Hurdle Poisson tted frequen ies


Count

OBDV

Count

A tual

0
1

ERV

Fitted HP

Fitted NB-II

A tual

Fitted HP

Fitted NB-II

0.32

0.32

0.34

0.86

0.86

0.86

0.18

0.035

0.16

0.10

0.10

0.10

0.11

0.071

0.11

0.02

0.02

0.02

0.10

0.10

0.08

0.004

0.006

0.006

0.052

0.11

0.06

0.002

0.002

0.002

0.032

0.10

0.05

0.0005

0.001

For the Hurdle Poisson models, the ERV t is very a urate. The OBDV t is not so good.
Zeros are exa t, but 1's and 2's are underestimated, and higher ounts are overestimated. For
the NB-II ts, performan e is at least as good as the hurdle Poisson model, and one should
re all that many fewer parameters are used. Hurdle version of the negative binomial model
are also widely used.

25.1.1 Finite mixture models


The following are results for a mixture of 2 negative binomial (NB-I) models, for the OBDV
data, whi h you an repli ate using this estimation program

360

CHAPTER 25.

THE ATTIC

**************************************************************************
MEPS data, OBDV
mixnegbin results
Strong onvergen e
Observations = 500
Fun tion value
-2.2312
t-Stats
params
t(OPG)
t(Sand.)
t(Hess)
onstant
0.64852
1.3851
1.3226
1.4358
pub_ins
-0.062139
-0.23188
-0.13802
-0.18729
priv_ins
0.093396
0.46948
0.33046
0.40854
sex
0.39785
2.6121
2.2148
2.4882
age
0.015969
2.5173
2.5475
2.7151
edu
-0.049175
-1.8013
-1.7061
-1.8036
in
0.015880
0.58386
0.76782
0.73281
ln_alpha
0.69961
2.3456
2.0396
2.4029
onstant
-3.6130
-1.6126
-1.7365
-1.8411
pub_ins
2.3456
1.7527
3.7677
2.6519
priv_ins
0.77431
0.73854
1.1366
0.97338
sex
0.34886
0.80035
0.74016
0.81892
age
0.021425
1.1354
1.3032
1.3387
edu
0.22461
2.0922
1.7826
2.1470
in
0.019227
0.20453
0.40854
0.36313
ln_alpha
2.8419
6.2497
6.8702
7.6182
logit_inv_mix
0.85186
1.7096
1.4827
1.7883
Information Criteria
Consistent Akaike
2353.8
S hwartz
2336.8
Hannan-Quinn
2293.3
Akaike
2265.2
**************************************************************************
Delta method for mix parameter st. err.
mix
se_mix
0.70096
0.12043

The 95% onden e interval for the mix parameter is perilously lose to 1, whi h suggests
that there may really be only one omponent density, rather than a mixture. Again, this
is

not

the way to test this - it is merely suggestive.

25.1.

361

HURDLE MODELS

Edu ation is interesting. For the subpopulation that is healthy, i.e., that makes relatively few visits, edu ation seems to have a positive ee t on visits. For the unhealthy
group, edu ation has a negative ee t on visits. The other results are more mixed. A
larger sample ould help larify things.

The following are results for a 2 omponent onstrained mixture negative binomial model
where all the slope parameters in

j = exj

onstants and the overdispersion parameters

are the same a ross the two omponents.

The

are allowed to dier for the two omponents.

362

CHAPTER 25.

THE ATTIC

**************************************************************************
MEPS data, OBDV
mixnegbin results
Strong onvergen e
Observations = 500
Fun tion value
-2.2441
t-Stats
params
t(OPG)
t(Sand.)
t(Hess)
onstant
-0.34153
-0.94203
-0.91456
-0.97943
pub_ins
0.45320
2.6206
2.5088
2.7067
priv_ins
0.20663
1.4258
1.3105
1.3895
sex
0.37714
3.1948
3.4929
3.5319
age
0.015822
3.1212
3.7806
3.7042
edu
0.011784
0.65887
0.50362
0.58331
in
0.014088
0.69088
0.96831
0.83408
ln_alpha
1.1798
4.6140
7.2462
6.4293
onst_2
1.2621
0.47525
2.5219
1.5060
lnalpha_2
2.7769
1.5539
6.4918
4.2243
logit_inv_mix
2.4888
0.60073
3.7224
1.9693
Information Criteria
Consistent Akaike
2323.5
S hwartz
2312.5
Hannan-Quinn
2284.3
Akaike
2266.1
**************************************************************************
Delta method for mix parameter st. err.
mix
se_mix
0.92335
0.047318

Now the mixture parameter is even loser to 1.

The slope parameter estimates are pretty lose to what we got with the NB-I model.

25.2 Models for time series data


This se tion an be ignored in its present form. Just left in to form a basis for ompletion (by
someone else ?!) at some point.

25.2.

363

MODELS FOR TIME SERIES DATA

Hamilton,

Time Series Analysis is a good referen e for this se tion.

This is very in omplete

and ontributions would be very wel ome.


Up to now we've onsidered the behavior of the dependent variable
other variables

xt .

yt

as a fun tion of

These variables an of ourse ontain lagged dependent variables, e.g.,

xt = (wt , yt1 , ..., ytj ).

Pure time series methods onsider the behavior of

yt

as a fun tion

only of its own lagged values, un onditional on other observable variables. One an think of
this as modeling the behavior of

yt

after marginalizing out all other variables. While it's not

immediately lear why a model that has other explanatory variables should marginalize to a
linear in the parameters time series model, most time series work is done with linear models,
though nonlinear time series is also a large and growing eld.

We'll sti k with linear time

series models.

25.2.1 Basi on epts


Denition 53 (Sto hasti pro ess) A sto hasti pro ess is a sequen e of random variables,

indexed by time:

{Yt }
t=

(25.1)

Denition 54 (Time series) A time series is one observation of a sto hasti pro ess, over

a spe i interval:

{yt }nt=1
So a time series is a sample of size

from a sto hasti pro ess.

(25.2)

It's important to keep

in mind that on eptually, one ould draw another sample, and that the values would be
dierent.

Denition 55 (Auto ovarian e) The j th auto ovarian e of a sto hasti pro ess is
jt = E(yt t )(ytj tj )

(25.3)

where t = E (yt ) .
Denition 56 (Covarian e (weak) stationarity) A sto hasti pro ess is ovarian e sta-

tionary if it has time onstant mean and auto ovarian es of all orders:
t

= , t

jt = j , t
As we've seen, this implies that

j = j : the auto ovarian es depend only one the interval

between observations, but not the time of the observations.

Denition 57 (Strong stationarity) A sto hasti pro ess is strongly stationary if the joint

distribution of an arbitrary olle tion of the {Yt } doesn't depend on t.

364

CHAPTER 25.

THE ATTIC

Sin e moments are determined by the distribution, strong stationarityweak stationarity.

Yt ?

What is the mean of


ould think of

The time series is one sample from the sto hasti pro ess. One

repeated samples from the sto h.

pro ., e.g.,

expe t that

{ytm }

By a LLN, we would

M
1 X
p
lim
ytm E(Yt )
M M
m=1
The problem is, we have only one sample to work with, sin e we an't go ba k in time and
olle t another. How an
property.

E(Yt )

be estimated then? It turns out that

ergodi ity

is the needed

Denition 58 (Ergodi ity) A stationary sto hasti pro ess is ergodi (for the mean) if the

time average onverges to the mean

1X
p
yt
n t=1

(25.4)

A su ient ondition for ergodi ity is that the auto ovarian es be absolutely summable:

X
j=0

|j | <

This implies that the auto ovarian es die o, so that the

yt

are not so strongly dependent that

they don't satisfy a LLN.

Denition 59 (Auto orrelation) The j th auto orrelation, j is just the j th auto ovarian e

divided by the varian e:

j =

j
0

(25.5)

Denition 60 (White noise) White noise is just the time series literature term for a las-

si al error. t is white noise if i) E(t ) = 0, t, ii) V (t ) = 2 , t, and iii) t and s are


independent, t 6= s. Gaussian white noise just adds a normality assumption.

25.2.2 ARMA models


With these on epts, we an dis uss ARMA models. These are losely related to the AR and
MA error pro esses that we've already dis ussed. The main dieren e is that the lhs variable
is observed dire tly now.

MA(q) pro esses


A

q th

order moving average (MA) pro ess is

yt = + t + 1 t1 + 2 t2 + + q tq

25.2.

MODELS FOR TIME SERIES DATA

where

365

is white noise. The varian e is

= E (yt )2

= E (t + 1 t1 + 2 t2 + + q tq )2

= 2 1 + 12 + 22 + + q2

Similarly, the auto ovarian es are

= j + j+1 1 + j+2 2 + + q qj , j q
= 0, j > q

Therefore an MA(q) pro ess is ne essarily ovarian e stationary and ergodi , as long as
and all of the

are nite.

AR(p) pro esses


An AR(p) pro ess an be represented as

yt = c + 1 yt1 + 2 yt2 + + p ytp + t


The dynami behavior of an AR(p) pro ess an be studied by writing this

pth

order dieren e

equation as a ve tor rst order dieren e equation:

yt

yt1
.
.
.
ytp+1

1 2

1

0
= .
0
.
.
..
.
0
0
c

1
..

..

..

..

yt1
0

yt2

0
.
..

0
ytp
0

Yt = C + F Yt1 + Et
With this, we an re ursively work forward in time:

= C + F Yt + Et+1
= C + F (C + F Yt1 + Et ) + Et+1
= C + F C + F 2 Yt1 + F Et + Et+1

and

Yt+2

0
+ .
.
.
0

or

Yt+1

= C + F Yt+1 + Et+2


= C + F C + F C + F 2 Yt1 + F Et + Et+1 + Et+2

= C + F C + F 2 C + F 3 Yt1 + F 2 Et + F Et+1 + Et+2

366

CHAPTER 25.

THE ATTIC

or in general

Yt+j = C + F C + + F j C + F j+1 Yt1 + F j Et + F j1 Et+1 + + F Et+j1 + Et+j


t

Consider the impa t of a sho k in period

on

yt+j .

This is simply

Yt+j
j
= F(1,1)
Et (1,1)
If the system is to be stationary, then as we move forward in time this impa t must die
o. Otherwise a sho k auses a permanent hange in the mean of

yt .

Therefore, stationarity

requires that

j
=0
lim F(1,1)

Save this result, we'll need it in a minute.

Consider the eigenvalues of the matrix

F.

These are the for

su h that

|F IP | = 0
The determinant here an be expressed as a polynomial. for example, for

p = 1,

the matrix

is simply

F = 1
so

|1 | = 0
an be written as

1 = 0
When

p = 2,

the matrix

is

F =

"

so

F IP =

"

1 2
1

1 2

and

|F IP | = 2 1 2
So the eigenvalues are the roots of the polynomial

2 1 2
whi h an be found using the quadrati equation. This generalizes. For a
the eigenvalues are the roots of

p p1 1 p2 2 p1 p = 0

pth

order AR pro ess,

25.2.

367

MODELS FOR TIME SERIES DATA

Supposing that all of the roots of this polynomial are distin t, then the matrix

an be

fa tored as

F = T T 1
where

is the matrix whi h has as its olumns the eigenve tors of

F,

and

is a diagonal

matrix with the eigenvalues on the main diagonal. Using this de omposition, we an write

F j = T T 1
where

T T 1

is repeated



T T 1 T T 1

times. This gives

F j = T j T 1
and

j1 0

0
=

0
j

Supposing that the

i i = 1, 2, ..., p

j2
..

jp

are all real valued, it is lear that

j
=0
lim F(1,1)

j
requires that

|i | < 1, i = 1, 2, ..., p
e.g., the eigenvalues must be less than one in absolute value.

It may be the ase that some eigenvalues are omplex-valued.

The previous result

generalizes to the requirement that the eigenvalues be less than one in


the modulus of a omplex number

a + bi

modulus,

where

is

mod(a + bi) =

a2 + b2

This leads to the famous statement that stationarity requires the roots of the determinantal polynomial to lie inside the omplex unit ir le.

When there are roots on

Dynami multipliers:

draw pi ture here.

the unit ir le (unit roots) or outside the unit ir le, we leave

the world of stationary pro esses.

j
yt+j /t = F(1,1)

is a

dynami multiplier

or an

impulse-response

fun tion. Real eigenvalues lead to steady movements, whereas omlpex eigenvalue lead
to o illatory behavior. Of ourse, when there are multiple eigenvalues the overall ee t
an be a mixture.

pi tures

Invertibility of AR pro ess

368

CHAPTER 25.

To begin with, dene the lag operator

THE ATTIC

Lyt = yt1
The lag operator is dened to behave just as an algebrai quantity, e.g.,

L2 yt = L(Lyt )
= Lyt1
= yt2
or

(1 L)(1 + L)yt = 1 Lyt + Lyt L2 yt


= 1 yt2

A mean-zero AR(p) pro ess an be written as

yt 1 yt1 2 yt2 p ytp = t


or

yt (1 1 L 2 L2 p Lp ) = t
Fa tor this polynomial as

1 1 L 2 L2 p Lp = (1 1 L)(1 2 L) (1 p L)
For the moment, just assume that the

are oe ients to be determined. Sin e

to operate as an algebrai quantitiy, determination of the


the

L is

dened

is the same as determination of

su h that the following two expressions are the same for all

z:

1 1 z 2 z 2 p z p = (1 1 z)(1 2 z) (1 p z)
Multiply both sides by

z p

z p 1 z 1p 2 z 2p p1 z 1 p = (z 1 1 )(z 1 2 ) (z 1 p )
and now dene

= z 1

so we get

p 1 p1 2 p2 p1 p = ( 1 )( 2 ) ( p )
The LHS is pre isely the determinantal polynomial that gives the eigenvalues of
the

F.

Therefore,

that are the oe ients of the fa torization are simply the eigenvalues of the matrix

F.

25.2.

369

MODELS FOR TIME SERIES DATA

Now onsider a dierent stationary pro ess

(1 L)yt = t

Stationarity, as above, implies that

Multiply both sides by

|| < 1.

1 + L + 2 L2 + ... + j Lj

to get



1 + L + 2 L2 + ... + j Lj (1 L)yt = 1 + L + 2 L2 + ... + j Lj t

or, multiplying the polynomials on th LHS, we get


1 + L + 2 L2 + ... + j Lj L 2 L2 ... j Lj j+1 Lj+1 yt

== 1 + L + 2 L2 + ... + j Lj t

and with an ellations we have



1 j+1 Lj+1 yt = 1 + L + 2 L2 + ... + j Lj t

so

Now as


yt = j+1 Lj+1 yt + 1 + L + 2 L2 + ... + j Lj t

j , j+1 Lj+1 yt 0,

sin e

|| < 1,

so


yt
= 1 + L + 2 L2 + ... + j Lj t
j

and the approximation be omes better and better as

in reases. However, we started with

(1 L)yt = t
Substituting this into the above equation we have


yt
= 1 + L + 2 L2 + ... + j Lj (1 L)yt

so


1 + L + 2 L2 + ... + j Lj (1 L)
=1

and the approximation be omes arbitrarily good as

|| < 1,

dene

(1 L)

in reases arbitrarily.

j Lj

j=0

Re all that our mean zero AR(p) pro ess

yt (1 1 L 2 L2 p Lp ) = t

Therefore, for

370

CHAPTER 25.

THE ATTIC

an be written using the fa torization

yt (1 1 L)(1 2 L) (1 p L) = t
where the

are the eigenvalues of

F,

and given stationarity, all the

an invert ea h rst order polynomial on the LHS to get

|i | < 1.

Therefore, we

X
X
X
yt =
j1 Lj
j2 Lj
jp Lj t
j=0

j=0

j=0

The RHS is a produ t of innite-order polynomials in

L,

whi h an be represented as

yt = (1 + 1 L + 2 L2 + )t
where the

are real-valued and absolutely summable.

The

are formed of produ ts of powers of the

The

are real-valued be ause any omplex-valued

This means that if

a + bi

is an eigenvalue of

i ,

F,

whi h are in turn fun tions of the

i .

always o ur in onjugate pairs.

then so is

a bi.

In multipli ation

(a + bi) (a bi) = a2 abi + abi b2 i2


= a2 + b2

whi h is real-valued.

This shows that an AR(p) pro ess is representable as an innite-order MA(q) pro ess.

Re all before that by re ursive substitution, an AR(p) pro ess an be written as

Yt+j = C + F C + + F j C + F j+1 Yt1 + F j Et + F j1 Et+1 + + F Et+j1 + Et+j


If the pro ess is mean zero, then everything with a

drops out. Take this and lag it by

periods to get

Yt = F j+1 Ytj1 + F j Etj + F j1 Etj+1 + + F Et1 + Et


As

j ,

the lagged

on the RHS drops out. The

Ets

are ve tors of zeros ex ept

for their rst element, so we see that the rst equation here, in the limit, is just

yt =

X
j=0

Fj

1,1 tj

whi h makes expli it the relationship between the

j
re alling the previous fa torization of F ).

and the

(and the

as well,

25.2.

371

MODELS FOR TIME SERIES DATA

Moments of AR(p) pro ess

The AR(p) pro ess is

yt = c + 1 yt1 + 2 yt2 + + p ytp + t


Assuming stationarity,

E(yt ) = , t,

so

= c + 1 + 2 + ... + p
so

c
1 1 2 ... p

and

c = 1 ... p
so

yt = 1 ... p + 1 yt1 + 2 yt2 + + p ytp + t


= 1 (yt1 ) + 2 (yt2 ) + ... + p (ytp ) + t
With this, the se ond moments are easy to nd: The varian e is

0 = 1 1 + 2 2 + ... + p p + 2
The auto ovarian es of orders

j1

follow the rule

= E [(yt ) (ytj ))]


= E [(1 (yt1 ) + 2 (yt2 ) + ... + p (ytp ) + t ) (ytj )]
= 1 j1 + 2 j2 + ... + p jp

Using the fa t that

p+1

j = j ,

2
unknowns ( ,

one an take the

0 , 1 , ..., p )

p+1

equations for

j = 0, 1, ..., p,

and solve for the unknowns. With these, the

an be solved for re ursively.

Invertibility of MA(q) pro ess


An MA(q) an be written as

yt = (1 + 1 L + ... + q Lq )t
As before, the polynomial on the RHS an be fa tored as

(1 + 1 L + ... + q Lq ) = (1 1 L)(1 2 L)...(1 q L)

whi h have

for

j>p

372

CHAPTER 25.

and ea h of the
write

(1 i L)

an be inverted as long as

|i | < 1.

THE ATTIC

If this is the ase, then we an

(1 + 1 L + ... + q Lq )1 (yt ) = t
where

(1 + 1 L + ... + q Lq )1
will be an innite-order polynomial in

X
j=0

with

0 = 1,

L,

so we get

j Lj (ytj ) = t

or

(yt ) 1 (yt1 ) 2 (yt2 ) + ... = t

or

yt = c + 1 yt1 + 2 yt2 + ... + t


where

c = + 1 + 2 + ...
So we see that an MA(q) has an innite AR representation, as long as the

|i | < 1, i = 1, 2, ..., q.

It turns out that one an always manipulate the parameters of an MA(q) pro ess to nd
an invertible representation. For example, the two MA(1) pro esses

yt = (1 L)t
and

yt = (1 1 L)t
have exa tly the same moments if

2 = 2 2
For example, we've seen that

0 = 2 (1 + 2 ).
Given the above relationships amongst the parameters,

0 = 2 2 (1 + 2 ) = 2 (1 + 2 )
so the varian es are the same.

It turns out that

all

the auto ovarian es will be the

same, as is easily he ked. This means that the two MA pro esses are

equivalent.

observationally

As before, it's impossible to distinguish between observationally equivalent

pro esses on the basis of data.

For a given MA(q) pro ess, it's always possible to manipulate the parameters to nd an

25.2.

373

MODELS FOR TIME SERIES DATA

invertible representation (whi h is unique).

It's important to nd an invertible representation, sin e it's the only representation that
allows one to represent

as a fun tion of past

y s.

The other representations express

Why is invertibility important? The most important reason is that it provides a justi ation for the use of parsimonious models. Sin e an AR(1) pro ess has an MA() rep-

resentation, one an reverse the argument and note that at least some MA() pro esses
have an AR(1) representation. At the time of estimation, it's a lot easier to estimate the

single AR(1) oe ient rather than the innite number of oe ients asso iated with
the MA representation.

This is the reason that ARMA models are popular. Combining low-order AR and MA
models an usually oer a satisfa tory representation of univariate time series data with
a reasonable number of parameters.

Stationarity and invertibility of ARMA models is similar to what we've seen - we won't
go into the details. Likewise, al ulating moments is similar.

Exer ise 61 Cal ulate the auto ovarian es of an ARMA(1,1) model: (1 + L)yt = c + (1 +
L)t

374

CHAPTER 25.

THE ATTIC

Bibliography
[1 Davidson, R. and J.G. Ma Kinnon (1993)

Estimation and Inferen e in E onometri s,

Oxford Univ. Press.


[2 Davidson, R. and J.G. Ma Kinnon (2004)

E onometri Theory and Methods,

Oxford

Univ. Press.
[3 Gallant, A.R. (1985)

Nonlinear Statisti al Models, Wiley.

[4 Gallant, A.R. (1997)

An Introdu tion to E onometri Theory, Prin eton Univ. Press.

[5 Hamilton, J. (1994)
[6 Hayashi, F. (2000)
[7 Wooldridge (2003),

Time Series Analysis, Prin eton Univ. Press

E onometri s, Prin eton Univ. Press.


Introdu tory E onometri s,

plementary use only).

375

Thomson. (undergraduate level, for sup-

Index
ARCH, 294

R-squared, entered, 29

asymptoti equality, 339

residuals, 23

Chain rule, 336


Cobb-Douglas model, 22
onditional heteros edasti ity, 294
onvergen e, almost sure, 337
onvergen e, in distribution, 337
onvergen e, in probability, 337
Convergen e, ordinary, 336
onvergen e, pointwise, 336
onvergen e, uniform, 337
onvergen e, uniform almost sure, 338
ross se tion, 19
estimator, linear, 26, 33
estimator, OLS, 23
extremum estimator, 177
tted values, 23
GARCH, 294
leptokurtosis, 294
leverage, 27
likelihood fun tion, 41
matrix, idempotent, 26
matrix, proje tion, 25
matrix, symmetri , 26
observations, inuential, 26
outliers, 26
own inuen e, 27
parameter spa e, 41
Produ t rule, 335
R- squared, un entered, 29

376

You might also like