You are on page 1of 111

c

°2001–2003
Masao Ogaki, Kyungho Jang, and Hyoung-Seok Lim

Structural Macroeconometrics

Masao Ogaki
The Ohio State University

Kyungho Jang
The University of Alabama at Birmingham

Hyoung-Seok Lim
The Bank of Korea

First draft: May, 2000
This version: February 29, 2004

PREFACE
This book presents various structural econometric tools used in macroeconomics.
The word “structural” has been defined in many ways. In this book, “structural”
means that explicit assumptions are made in econometric methods so that estimators
or test statistics can be interpreted in terms of an economic model (or models) as
explained in Chapter 1.
Many applied macroeconomists link macroeconomic models with econometric
methods in this sense of structural econometrics. In principle, recent advances of
theoretical time series econometrics make this task easier because they often relax
the very restrictive assumptions made in conventional econometrics. There are many
textbooks that explain these advanced econometric methods. It is often difficult,
however, for applied researchers to exploit these advances because few textbooks in
time series econometrics explain how macroeconomic models are mapped into advanced econometric models.1 To fill this gap, this book presents methods to apply
advanced econometric procedures to structural macroeconomic models. The econometric methods covered are mainly those of time series econometrics, and include the
generalized method of moments, vector autoregressions, and estimation and testing
in the presence of nonstationary variables.
Since this book focuses on applications, proofs are usually omitted with references given for interested readers. When proofs are helpful to understand issues that
are important for applied research, they are given in mathematical appendices. Many
examples are given to illustrate concepts and methods.
1

For example, Hamilton (1994) contains exceptional volume of explanations of applications for a
time series econometrics textbook, but its main focus is on econometrics, and not on the mapping
of economic models into econometric models.

i

ii
This book is intended for an advanced graduate course in time series econometrics or macroeconomics. The prerequisites for this course would include an introduction to econometrics. This book is also useful to applied macroeconomic researchers
interested in learning how recent advances in time-series econometrics can be used to
estimate and test structural macroeconomic models.

. . . . . . . . . . . .2 Stochastic Processes . . 55 iii . . . . . . . . . . . .2 Parameterizing Expectations . . . . . . . . . . 51 4. . . . . . . . . . 2. . . . . . . . . . 5 5 7 8 12 16 18 19 29 30 3 FORECASTING 3. . . . . . . . . . . . . . . . 33 33 33 35 37 37 39 40 42 42 44 . . . .3 for Infinite Numbers of R. .2. 3. . . . . . .’s (Incomplete) . . . . . . . .B Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . .2 The Lag Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Definitions and Properties of Projections . . . . . . . . 2. . . . . . . . . . . . . . . . . 3. . . . 2. . . . .4 Autoregression Representation . . . 2. . . . . . . . . .2 Hilbert Space . . . . . 3. . . . . . . . 3. . . . . . . . . . . . . . . . . . 2. . . . 2. . . . . . . . . .2 Propositions 2. . . . . . . . . . . . 2. . . . . . . . . .2. . . . . . . . . . . . . .A Introduction to Hilbert Space . . . . . . . . .1 Volatility Tests . . . . . . . . . . . . . . . . . .1 Review of Probability Theory .4 Stationary Stochastic Processes .3 Conditional Expectations . .3 Moving Average Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A. .6 Martingales and Random Walks . . . . . . . . . . . . . . . . . . . . .1. . . 2. . . . . . . . . . . .Contents 1 INTRODUCTION 1 2 STOCHASTIC PROCESSES 2. .B. . . . . . . 3.V. . . . . . . . . . . . . . . . . . 3. . . .1 Projections . . . . . . .B. . . . . . . . . . .3 Noise Ratio . . .2 and 2. . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . .2 Some Applications of Conditional Expectations and Projections 3. . . . . .1 Convergence in Distribution . . . . 53 4. . . . . . . . . 2. . . . . .2 Linear Projections and Conditional Expectations . . . . . . . . 52 4.1. . . .A. . . . . . .5 Conditional Heteroskedasticity . . . . . . . . . . . . . 31 4 ARMA AND VECTOR AUTOREGRESSION REPRESENTATIONS 51 4. . . . .1 Autocorrelation . . . . . . . .A A Review of Measure Theory . . .2. . . .1 Vector Spaces . . . . . . 3. . . . . . . . . . . . .

. . . . . 7. . . . . . . .3 Tests . . . . . . . . . . . . . . . . .6 The Wold Representation . . . . . 5. . . . . . . . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . .4 The Linearized Euler Equation . . . . . . . . . .5 An Example Program . .2 Serially Correlated Variables . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . 4. . . . . . . .1 OLS Estimation . . . . . . . . . . . 6. . TECHNIQUES . . .1 Unknown Order of Serial Correlation . .4 Forecast error decomposition . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . . . . 7.1 Serially Uncorrelated Variables . .2 Granger Causality . . . . . . 5. .A. . . . . . . . . 112 113 116 118 119 121 8 VECTOR AUTOREGRESSION 8. . . . . . . . . . . . . . . . . 5. . . . . . . . . . . . . 4. . . . .1 The Conditional Gauss Markov Theorem . . . . . .7 Nonlinear Functions of Estimators .3 The Impulse Response Function 8. . . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 ARMA . . . .1 Autoregression of Order One . . . . . . . . . . . . . . . . . . . . . . . . . .2 The p-th Order Autoregression 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Optimal Taxation . . . . . .2 Estimators . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . .4 A Pitfall in Monte Carlo Simulations . . . 7. . .3 The Martingale Model of Consumption . . . . . . . . . .3 The Law of Large Numbers . . .A. . . .6 Consistency and Asymptotic Distributions of IV Estimators . . . . . . . .A Monte Carlo Studies . . . . . . . . . . . . . .7 Fundamental Innovations . . . . . . . . . . . 7. . . . . . . . . . . . .1 Forward Exchange Rates . .2 The Euler Equation . . . . . . . . . . . . . . . . . . . . . . . . . 124 125 126 129 132 5 STOCHASTIC REGRESSORS IN LINEAR MODELS 5. . . . 4. 6. . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . .A Difference Equations . . . . . . . . . . . . .A. 101 102 103 103 108 7 TESTING LINEAR FORECASTING MODELS 7. . . . .2. . .8 Remarks on Asymptotic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . .2 Known Order of Serial Correlation .5 Consistency and Asymptotic Distributions of OLS Estimators 5. . . . . . . . . . . .2. . . 67 68 73 75 76 80 82 83 83 84 85 88 89 90 92 6 ESTIMATION OF THE LONG-RUN COVARIANCE MATRIX 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8. . . . . . . . . . . . . . . . . . . . . . .1 Random Number Generators . . . . . . 55 57 58 58 61 63 65 . . . . . . . . . .8 The Spectral Density . . . . . . . . . . . . . . .A. . . . . . . . . . . . . .2 Unconditional Distributions of Test Statistics . . 5. . . . . . . . . . . . . . . . . . 6. . 4. 4.4. .A. . . . . . . . . .4 Convergence in Distribution and Central Limit Theorem . . . 8. . .iv CONTENTS 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . .

. . .1 Sequential Estimation . . . . . . . . . 174 10 EMPIRICAL APPLICATIONS OF GMM 10. . . 170 9. . . 146 9. . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Asymptotic Properties of Extremum Estimators .1 Stationarity . . . . . . 151 9. . . . . . . . . . . . . . . . .3 Important Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . .4 Asymptotic Distributions of GMM Estimators . . .1 Asymptotic Properties of GMM Estimators . . . . . . . . . . . . . . . 146 9. .4 Extensions . . . . .3 Optimal Choice of the Distance Matrix . . . . . . . . . . . . . . . 133 136 136 138 139 9 GENERALIZED METHOD OF MOMENTS 143 9. . . . . 151 9. . . . .1. . . . . . . . . .4. . . . . .2 Consistency of GMM Estimators . . . . . 8. .2 Alternative Measures of IMRS . . . . . Identification .2 GMM with Deterministic Trends . .6 Numerical Optimization . . 148 9. . . . . . . . . . . . . . . . . . . . . . . . . .6. . 10. .3 Two-step ML estimation . . . . . . . . . . . . 146 9. .2. . .1 Euler Equation Approach . . . . . .2 Identification of block recursive systems . . .5 Time Aggregation . . . . . . .6 Multiple-Goods Models . . . .8 Small Sample Properties .1 Ordinary Least Squares . . . . . .1. . . . . . . . .4 Linear GMM estimator . . . . . . . . . . .6 v Structural VAR Models . . . 10. . . . . . . . . . . . .EXP) . . . . . . . . . . . . . . . . . . . . . . . . . 8. . . . . . . . . . 147 9. .A. . . . . . . . 145 9. . . . . 156 9. . . . . . . . . . . . . . . . . . . . . . . . . 10. . . . . . . . . . . . . . 143 9. . . . . . . . . 148 9. . . . . . . .CONTENTS 8. . . . . . . . . . . . . . . . 150 9. . . . 158 9. . . . . . . . . . . . . . . . . . . . .4 State-Nonseparable Preferences . . . . . . . . . . .2 Special Cases . . 161 9. . . . . . .3. . . . 164 9. . . . .A. .A. . . . . . . . . . . . . . 151 9. . . . 149 9. . . .1 Moment Restriction and GMM Estimators . . . 154 9. . .3 Nonlinear Instrumental Variables Estimation .2 Linear Instrumental Variables Regressions . . . . .3 Habit Formation and Durability . . . . .4. . . . . . . . . 162 9.7 The Optimal Choice of Instrumental Variables .2. . . . 144 9. . .3. . . . . . . . .1 Short-Run Restrictions for Structural VAR 8. . . .5 8.1. . . .A. . . .2 Asymptotic Distributions of GMM Estimators . . . 153 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 181 183 185 187 188 189 . . . . . . . . . . . . . . 10. . .A Asymptotic Theory for GMM . . . .2. . . .6. . . 158 9.4 A Chi-Square Test for the Overidentifying Restrictions .2 Identification . . . . . . . . . . . . . . . . .5 Hypothesis Testing and Specification Tests .1. . . . . . . . . . . .3 Minimum Distance Estimation . . . . . . . . . . . .2. . .6. . . . . . 153 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 A Sufficient Condition for the Almost Sure Uniform Convergence165 9. . . . . . .4. . . . 143 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B A Procedure for Hansen’s J Test (GMM. . . 10. . . . . . . . . .

. 12.4 Park’s J Tests .A. . . . . . . . . . . . . . . .1 Park’s CCR and H Test (CCR.2 Cointegrating Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . 12. . . . .10Real Business Cycle Models and GMM Estimation .1 Definitions . . . . .2. . .4 Tests for the Null Hypothesis of No Cointegration . . . . . 11. . . .1 DF test with serially uncorrelated disturbances 11. . . .1 Spurious Regressions .A. . . 12. . . .EXP) . . . . . . . . 191 192 193 194 199 202 . . 11. . . . . . .3. 12. . . . . . . . . . . .3. .3 Phillips-Perron test . . . . . . . . . . . . . . . . 11. . . . . . . .3 Park’s G Test (GPQ. 10. . . . .1 Dickey-Fuller Tests . . . . . . . . .3 Tests for the Null of Difference Stationarity . . .6.12Other Empirical Applications . . . .6 Generalized Method of Moments and Unit Roots . . . . . . . . 10. . . . . . . . . . . . . . . . . .9 Calculating Standard Errors for Estimates of Standard Deviation. . . . . . . . . . . . . . . . . . 245 246 248 249 253 254 255 257 259 260 261 263 263 . . . . . . . . . . . . . . . . . . . . . . . . . 12. . . . . . . . . .1 Functional Central Limit Theorem . 11.B. . 12. . . .B Procedures for Unit Root Tests . . . . . . . . . . . . . . . 10. . . . . . . . . . . . . . . . 11. . . . . 10. . . . . . . . . . . . . . . .B. . . . . . . . . . . . .5 Tests for the Null Hypothesis of Cointegration . . . .7 Seasonality . .3. . . . . 11. . . . . . . . . . . . . . . . .11GMM and an ARCH Process . . . . . . . . . . . . . . . . .2 Park’s J Test (JPQ. . . . . . . 11. .6. 11.6 Asymptotics for unit root tests . . . . . . 11. . . . . .4 Tests for the Null of Stationarity . . . . . . . . . .2. . . . . . . . . . . . .2 ADF test with serially correlated disturbances . . . . . . Correlation. . . . . . . . . . . . . . . . .3 Large Sample Properties . . . . . . . . . . . . . . . . . . . . . .1 Said-Dickey Test (ADF. . . . . . . . . . . 11. .EXP) . . . .2 Decompositions . . . . . . . . .EXP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 12. . . . . . . . . . . . 11. . .3. . . . . . . . . . . . 12. . . . . . . . . . . . . . . . .1 Canonical Cointegrating Regression .2 Exact Finite Sample Properties of Regression Estimators . . 11. . . . . . . . .3. . . . 12. . . . . . . . . . . . . . . . . . 11. . . . . . . . . . . . .A Procedures for Cointegration Tests . . . . 12. . . . . . . . 11. . . . . . . . .2 Estimation of Long-Run Covariance Parameters . . . . . . .5 Near Observational Equivalence . . . . and Autocorrelation .2 Said-Dickey Test . . . . . . . . . . . . . . . . . . . 11. . . . . . 12. . 10. . . . . . . . . . . . . . .A Asymptotic Theory . . 11. . . . . . . . . . . . . .3. . . . . . . .3 Phillips-Perron Tests . . . . . . . . . . .EXP) . .8 Monetary Models .vi CONTENTS 10. . . . . . . . 11 UNIT ROOT NONSTATIONARY PROCESSES 11. . . . . . . .B. . . . . . . . . . . . .6. . . . . 210 211 212 214 214 216 218 218 220 221 222 222 226 232 239 239 239 239 240 241 12 COINTEGRATING AND SPURIOUS REGRESSIONS 12. . . . .1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. . . . . . . . . . . . . . . . . . . . . . . .

2. . . .7 How Should an Estimation Method be Chosen? . . . . . . . . .1 Long-Run Restrictions for Structural VAR Models . . 284 . 259 13.2. .5. . 253 13. . . . . . . . . . . . . . . 274 13. . . . .6 Tests for the Number of Cointegrating Vectors . . . . . . . . . .1 The model and Long-run Restrictions . . 14. . .3 Applications to Money Demand Functions . . . . . .2. . . . 289 15 VECTOR AUTOREGRESSIONS WITH UNIT ROOT NONSTATIONARY PROCESSES 296 15. . . . . . 276 14 ESTIMATION AND TESTING OF LINEAR RATIONAL TATIONS MODELS 14. . . . . . . . . . . . . . .1 Stationary dt . .2 Vector Error Correction Model . . . . . . 285 . 248 13. . . . . . . . . 297 15. . . 297 15. . . . . . . 311 15. . 305 15. . .3 Measuring Intertemporal Substitution: The Role of Durable Goods . . . . . . . . . . . . . . . . . . . . .1 The Time Separable Addilog Utility Function . . . .2 Difference Stationary dt . . . . . . . .4 The Cointegration Approach to Estimating Preference Parameters . . . . . . . . .1 The Economy . .2 Identification of Permanent Shocks .1. . . . . . . . . . . . . .1. . . . .CONTENTS vii 12. 14. . . . . .1 Identification on Structural VAR Models . . . . . . . . . . . . . . . . 270 13. . . . . . . . . . . . . . . . . . . . 303 15. . .2 Difference Stationary dt .4. . . . . . . . . . . 322 15. .2. . . . .3 Engel’s Law and Cointegration . . . . . . . . . . . . . . . 251 13. . . 308 15. . . . . . . . . . . . . . 288 . . . . . . . . . . .1 The Nonlinear Restrictions . . . . . 287 . . . . . . . 14. . . . . . . . 276 13. . . . . . . . . . 309 15. . . . . . . . .2. . .A. . .3 Structural Vector Error Correction Models . . . . . 255 13. . . . . 267 13. . . .4 Forecast-Error Variance Decomposition . . . .2. . . . . . . 324 . . . . . . .1 Stationary dt . .5. .4 An Exchange Rate Model with Sticky Prices . . . . . 288 . . . . . . . . . 253 13. . . . . . . . . . . .5. . . 264 13. . 14. . . . .1 The Permanent Income Hypothesis of Consumption . . . .2 The Time Nonseparable Addilog Utility Function . . . . . 318 15. . . .2 Park’s I Test (IPQ. . . . . . . 307 15. . . . . . . . . .4. . . 264 13 ECONOMIC MODELS AND COINTEGRATING REGRESSIONS247 13. . . .2 Econometric Methods . 301 15. . .2 The 2-Step Estimation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . EXPEC284 . .5 Summary . . . . 14. . . . . . . . . . . . . . . .5 The Instrumental Variables Methods . . . . .1. . . . 301 15. . . . . . . . . . . . . . . . .EXP) . . .2 Short-run and Long-Run Restrictions for Structural VAR Models298 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. .5 The Cointegration-Euler Equation Approach . . . . . . . . . . . . . . . .2. . . . . . .3 Impulse Response Functions . . . . . . . . . . . . . . . .2 Present Value Models of Asset Prices . . . . . . . . . . . . . . .6 Purchasing Power Parity . . .1.

. . . . . . . . . . . . . . . A. . 16 PANEL AND CROSS-SECTIONAL DATA 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. . . . .2 The DOS Version . . .3 Editing a File . . .6 Operators . . . . A. . . . . . . . . . . . . . . . . . . 16. . . . . . . . . .C Johansen’s Maximum Likelihood Estimation and Cointegration Rank Tests . . . . . .7. . . . . . . .2 Case . . . . . . . . . . . . . . . . . . . . . . . . .5 Reading and Storing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Complex Variables . . . . . . . . . . . . . .3 Comments . . A. . .6. A. . . . .3 Decreasing Relative Risk Aversion and Risk Sharing 16. . . . .2 The Number of the Cointegrating Vectors .5 Panel Unit Root Tests . . . . . . . . . . . . A. . . A. . 16. . . . . . . . . . . . . . . .A Estimation of the Model . . . . . . . . . . .1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . A. . . . . . . . . . . . . . . . .6 Cointegration and Panel Data . . . . . . . . . .2 Numeric Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7. . . . . . . . . . . . . . . . . A. . . . . . . . . . . . . . . . . . . . . . . . . .7. .1. . .2 Tests of Risk Sharing . . . . . . . . . . . . 15. . . . . . . . . . 15. . . . . .1. . . . . . 15. . . . . . . . . . . . . . .4 Rules of Syntax . . . . . . . . . . . .1 Generalized Method of Moments . . . . . . . . . . . . . . . . . . .7. . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . . . . .1. .4. . . . . . . . . . . . . . . . . . . . . . . A.1 Complex Numbers . . 16.2 Printing . . . . . . . . . . . . . . . . . . . .B Monte Carlo Integration . . . AND LAG OPERATOR 365 B. . 16. . . . . . . . . . . . . . . .6. . . . . . .1 Statements . . . . . . . . . .1 Operators for Matrix Manipulations . . . . . . . . .1 Are Short-Run Dynamics of Interest? . . . . . . 357 357 357 357 358 358 358 358 358 359 359 359 359 359 361 362 362 363 364 364 364 B COMPLEX VARIABLES. . . . . . . . . A. . . . . . . . . . .1 Starting and Exiting GAUSS . . . . . . . . . . . . . . . . . . . . .7. . .7 Commands . . . . . . A. . . . . . . . . . THE SPECTRUM. . . . . . . . . . . . . . .4. . . . . . . . 366 B. . . . . . 15. . . . . . . . . 325 325 326 327 332 334 . . . . . . . . . . . . .4 Symbol Names . A. . . . . . . . . .8 Procedure . . . . . . . . . . . . . . . . A. . . A INTRODUCTION TO GAUSS A. . . . . . .7. . .3 Preparing an Output File . . . . . . . . .viii CONTENTS 15. . . . . . . . . . . . . . . . . . . . . . . .1 The Windows Version . . . . . . . . . . A. . . . . . . . . . . . . . . . A. . . .3 Small Sample Properties . . . . . . . . A. . . . 15. . . . . . . . . . . . . . . . .4 Euler Equation Approach . . . .2 Running a Program Stored in a File from the COMMAND Mode A. .4. . . . . . . . . 343 343 345 347 349 350 352 . 366 . . . . . . . . . . .9 Examples . . . . . . . . . A. . . . .

. . ix . 367 372 373 376 379 . . . . . .CONTENTS B. . . . . . . . . . .4 Lag Operators . . . . . . . . .3 Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Hilbert Spaces on C . B. . .2 Analytic Functions B. . . . B. . . . . . . . . . . C ANSWERS TO SELECTED QUESTIONS . .1. . . .

. . . . . . . 219 11. . . . . . .3 C. . . . CCR estimation and H(p. ADF tests . . . . . . . .2 C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 390 392 393 393 . . . . . . . . . . . . 260 C. . . . . . . . . . . . . . . x . . . . . . q) Tests for Null of No Cointegration . . . . . . . . . . .5 GMM Results . . . . . . . . . . . . . . . .q) tests .1 Critical Values of Park’s J(p. . . . . . . . . . . . .4 C. . . . . . . 242 12. . . . . . . . Data moments and model moments GPQ tests . . . . . . . q) Tests for the Null of Difference Stationarity . .2 Probability of smaller values . . . .1 Critical Values of Park’s I(p. . . . . . . .1 C. . . . . . . . . . . . . . . . . . . . .List of Tables 11. . . . . . . . . . . .

Qt . We then use the economic model to interpret empirical results obtained by applying the econometric tools to real data. E(e2t ) = σ 2 . consider a model of demand for a good: (1.1) Qdt = a − bPt + et . This is important because an economic model is used to analyze causal relationships between economic variables.Chapter 1 INTRODUCTION The word “structural” has various different meanings in econometrics. As a very simple example. In this model a and b are constants and et is the demand shock. is equated with Qdt . and understanding causal relationships is essential for policy evaluations and forecasting. With these assumptions the Gauss-Markov Theorem can be applied to this model. “structural” means that explicit assumptions are made in econometric methods so that estimators or test statistics can be interpreted in terms of an economic model (or models). and E(et es ) = 0 if t 6= s. In some cases. If the Ordinary Least Squares (OLS) slope coefficient estimator 1 . some properties of the estimators and test statistics are known when they are applied to data generated from an economic model. Pt is nonstochastic. et has mean zero. In this book. where Pt is the price and Qdt is the market quantity demanded. The model assumes that the observed quantity.

d+b Hence Pt is stochastic. we obtain (1. then the estimator is the Best Linear Unbiased Estimator (BLUE) for the demand slope coefficient.3) Pt = 1 (a − c + et − ut ). The structural demand model tells us under what assumptions we can interpret the OLS slope estimator as an unbiased estimator for b. · · · .3) makes it clear that Pt is correlated with et and ut . This leads us to consider an improved econometric method. for example. and solving for Pt . Qt . Moreover. By studying the assumptions.2 CHAPTER 1. is equal to Qdt and Qst . (1. T in this model. consider the assumption made in the model that Pt is nonstochastic. b. This means that the OLS slope coefficient estimator is not even a consistent estimator for b or d as discussed in Chapter 5. For example.2). Equating the right hand sides of (1. INTRODUCTION is applied to data of Qt and Pt for t = 1. an instrumental variable method. the observed quantity. (1. This assumption is sometimes motivated by saying that the price is taken as given by the individual market participants. This process leads to better . This knowledge is helpful because we can then study how to improve our econometric methods for better interpretation of data.2) where Qst the market quantity supplied and ut is the supply shock. One benefit of having this structural model is that we know exactly what the limitations are when we interpret OLS results applied to real data in terms of the model.1) and (1. In equilibrium. It is easy to see that this motivation is problematic by considering the supply side of the market. we can see what will happen when they are violated. Consider a model of supply of the good: Qst = c + dPt + ut .

non-structural econometrics is better.3 econometric methods. On the other hand. one can start with a utility function as in the Euler Equation Approach discussed in Chapter 10. and this Cointegration Approach can be combined with the Euler Equation Approach as described in Chapter 13. structural econometrics is better than non-structural econometrics. after finding stylized facts with non-structural econometrics. Instead of starting with a demand function. Another consideration is the trend observed in most aggregate data. We do not claim that structural econometrics as defined here is better than non-structural econometrics. Similarly. Using a structural econometric model that enforces a certain economic interpretation is not good for this purpose. For the purpose of summarizing data properties and finding stylized facts. we do not claim that the definition of “structural” in this book is better than other definitions. They are tools that serve different purposes. When data contain trends. Just as it does not make sense to argue whether a hammer is better than a screwdriver. we cannot compare structural and non-structural econometrics without specifying the purposes. For that purpose. one may wish to understand causal relationships that explain stylized facts and make policy recommendations based on causal relationships. For example. Hendry (1993) and Ericsson (1995) define a structural model as an econometric model that is invariant over extensions of . This purpose is obviously very important in economics. The demand model with trends leads to cointegrating regressions as discussed in Chapter 13. cointegrating regressions can be used to estimate preference parameters.

69. N. Their definition is useful for their purpose of finding invariant relationships between economic variables in data. interventions or variables.4 CHAPTER 1. Uppsala. R. (1995): “Conditional and Structural Error Correction Models.” Journal of Econometrics. 159–171. Hendry. References Ericsson. but cannot be used for our purpose of interpreting empirical results in terms of an economic model. Sweden. F. INTRODUCTION the information set in time. . (1993): “The Roles of Economic Theory and Econometrics in Time Series Economics. D.” Invited paper presented at the Econometric Society European Meeting.

we first review those basic concepts. Thus. 5 .A). Economic agents typically observe stochastic processes of random variables (collections of random variables indexed by time) to form their information sets. Imagine that we are interested in making probability statements about a set of the states of the world (or a probability space). expectations conditional on information sets are used to model the forecasting conducted by economic agents. Hence we adopt the simplifying assumption that S 1 For the general probability space. This chapter defines the concepts of conditional expectations and information sets for the case of a finite number of elements in the probability space. nothing is lost by assuming that there is a finite number of states of the world. For the purpose of understanding concepts.1 2. which we denote by S. it is not necessary for the reader to understand measure theory.Chapter 2 STOCHASTIC PROCESSES In most macroeconomic models. For our purpose. these concepts are defined with measure theory (see Appendix 2.1 Review of Probability Theory Since the probability statements made in asymptotic theory involve infinitely many random variables instead of just one random variable. it is important to understand basic concepts in probability theory.

· · · . . The distribution function F (x) is given by F (x) = 0 for x < 10. let X(s) be the profit of an umbrella seller in terms of dollars with X(s1 ) = 100 and X(s2 ) = 10. it is a real valued function on S). For example. xk ) = P r[X1 ≤ x1 . For a k-dimensional random vector X(s) = (X1 (s). Let X(s) be a random variable (we will often omit the arguments s).1.1 The state of the world consists of s1 : it rains tomorrow. sN }. depending on how likely si is to occur. It is assumed that N i=1 πi = 1 and 0 ≤ πi ≤ 1 for all i. Xk ≤ xk ]. Note that we can now assign a probability to all subsets of S. For a real value x. F (x) = 0.2 for 10 ≤ x < 100. F (x). Xk (s))0 . Example 2. the distribution function. A random variable assigns a real value to each element s in S (that is.2. i=1 Example 2.8 and π2 = 0. and F (x) = 1 for x ≥ 100. Then the probability that the true s is in Λ is denoted by P r(s ∈ Λ). let Λ be {s1 . According to a weather forecast.2 = 82.6 CHAPTER 2. of the random variable is defined by F (x) = P r{s : X(s) ≤ x}. where P r(s ∈ Λ) = π1 + π2 .2) F (x1 .8+10×0. We assign a probability πi = P r(si ) P to si . A random vector is a vector of random variables defined on the set of states. s2 }. · · · . and s2 : it does not rain tomorrow. A random variable is assigned an expected value or mean value (2. Then E(X) = 100×0. STOCHASTIC PROCESSES consists of N possible states: S = {s1 . π1 = 0. · · · .1) E(X) = N X X(si )πi . · · · . the joint distribution function F is defined by (2.2 Continuing Example 2.

which is to consider a probability space for each period based on independent disturbances. Let Xt (s) be a random variable. · · · . XT }. · · · } is a univariate stochastic process. In general. STOCHASTIC PROCESSES 2. However. it is usually easier to think about the stochastic nature of economic variables this way rather than the alternative. Here. then a collection {Xt : X0 (s).2 7 Stochastic Processes A collection of random variables indexed by time is called a stochastic process or a time series. X2 (s). X1 (s). then it is a vector stochastic process. X0 (s). if we take Xt as a random vector rather than a random variable. aggregate output (Yt ) and . In a sense.2. When we observe a sample of size T of a random variable X or a random vector X : {X1 . · · · }. In order to illustrate this. X1 (s).2. {Xt (s) : t is a nonnegative real number}.3 Imagine an economy with three periods and six states of the world. {Xt (s) : t ∈ A} for any set A is a stochastic process. It is sometimes more convenient to consider a stochastic process that starts from the infinite past. the complete history of the stochastic process becomes known. For asymptotic theory. It is also possible to consider a continuous time stochastic process for which the time index takes any real value. it is considered a particular realization of a part of the stochastic process. Note that once s is determined. let us consider the following example: Example 2. X2 (s). the stochastic process modeled in this manner is deterministic because everything is determined at the beginning of the world when s is determined. {· · · . For example. then time is discrete. The world begins in period 0. We observe two variables. X−1 (s). this does not mean that there is no uncertainty to economic agents because they do not learn s until the end of the world. X−2 (s). If A is a set of integers.

s3 = [300. 5. 300]. In period 1. 5. π4 = 0. STOCHASTIC PROCESSES the interest rate (it ). s4 = [300. if Y1 = 300 and i1 = 5. Unconditional expected values are taken with these probabilities. s5 = [150. X2 (s)] is a stochastic process. 150]. and s6 = [150. Yt can take two values. Then [X1 (s). 10. [Y1 . π5 = 0.10. and that the i1 = 5 in all states in which Y1 = 150. 150 and 300. 300]. 300]. This forecasting process can be modeled using conditional expectations. and is in a boom in period 2. s1 = [300. π2 = 0.15. in period 1 and period 2.8 CHAPTER 2. the economic agents assign a probability to each state: π1 = 0. the agents learn that they are in either s3 or s4 . Y2 ]. In each period. and the agents learn which state of the world they are in at the end of the world in period 2. The whole history of the process is determined at the beginning of the world when s is chosen. i1 . The six states of the world are. 5. π3 = 0. s1 means the economy is in a boom (higher output level) with a high interest rate in period 1. In this example. it (s)]. 150]. We assume that i2 is equal to i1 in all states of the world. let Xt (s) = [Yt (s). To illustrate. For example. 5 and 10. 5. the agents only have partial information as to which state of the world is true. The world ends in period 2. In period 0.15.05. however. but cannot tell which one they are in until they observe Y2 in period 2. 150]. and π6 = 0.20.3 Conditional Expectations Economic agents use available information to learn the true state of the world and make forecasts of future economic variables. and it can take two values. s2 = [300. 10. 2. Information can be modeled as a partition of S into mutually exclusive subsets: .35. The six states of the world can be described by the triplet.

Given each s. The expectation of a random variable Y conditional on the information that the true s is in Λj is (2. ΛM } where Λ1 ∪ · · · ∪ ΛM = S. P r{s ∈ Λj } when si is in Λj . .4) E(Y |s ∈ Λj ) = X s∈Λj Y (s) P r{si } . so that the probabilities add up to one. The considerations given above lead to the following definition of conditional probability: (2. Λ2 }.2. the conditional expectation needs to be defined over all s in S. then the agents know which subset contains the true s. Nonetheless. P r{s ∈ Λj } where the summation is taken over all s in Λj . In this situation. There is no reason to change the ratios of probabilities assigned to the elements in the subset containing the true s. We use conditional probability to define the conditional expectation. not just for s in a particular Λj . so that the probabilities add up to one. we first find out which Λ contains s.3. and they can assign a probability of zero to all elements in the other subset. Here Λ1 = {s1 . · · · . · · · . sM }. information F consists of two subsets: F = {Λ1 . sN }.3) P r{si |s ∈ Λj } = P r{si } . The information represented by F tells us which Λ contains the true s. For this purpose. CONDITIONAL EXPECTATIONS 9 F = {Λ1 . but no further information is given by F. Here each probability is scaled by the probability of the subset containing the true s. the absolute level of each probability should be increased. and Λj ∩ Λk = ∅ if j 6= k. For example. · · · . It is convenient to view the conditional expectation as a random variable. The probability conditional on the information that the true s is in Λj is denoted by P r{si |s ∈ Λj }. once agents obtain the information represented by F. and Λ2 = {sM +1 .

where Λ1 = {s1 . A random variable X is said to be in this information set. If X is in the information set I. If a random vector X is in I. Then F = {Λ1 . Instead of a partition. STOCHASTIC PROCESSES When Λj contains s. · · · . Then the information set I represents the same information as F does. j. Λ2 . we define E(Y |I) as E(Y |F). When a random variable X is in the information set I. s2 }. Λ3 }. When a random variable X or a random vector X generates the information set I. which consists of all random variables that take the same value for all elements in each Λj : I = {X(s) : X(si ) = X(sk ) if si ∈ Λj and sk ∈ Λj for all i. we say that the random variable is measurable with respect to this σ-field. 2 In the terminology of probability theory. It should be noted that E(Y |I) is in the information set I. the expected value of Y conditional on F for s is given by E(Y |F)(s) = E(Y |s ∈ Λj ). s4 }. and if I is generated by a random vector X. we define E(Y |X) = E(Y |I). then we say that the random variable X generates the information set I.4 Continuing Example 2. Consider the set I. Consider information represented by a partition F = {Λ1 .10 CHAPTER 2. . then the random vector X is said to generate the information set I. we consider a set of all possible unions of Λ’s in F plus the null set. Example 2. s6 }. and used to describe information. let I be the information set generated by X1 = (Y1 . ΛM }. and Λ3 = {s5 . This set of subsets of S is called a σ-field. we can use a random variable or a random vector to describe information. we define E(Y |X) = E(Y |I). If I is generated by X.2 A random vector X is said to be in this information set when each element of X is in the information set. and let F be the partition that represents the same information as I. i1 ). Λ2 = {s3 . when X(si ) = X(sk ) if both si and sk are in the same Λj . and if X takes on different values for all different Λ (X(si ) 6= X(sk ) when si and sk are not in the same Λ). k}. which represents the same information as a partition F. and if at least one element of X takes on different values for different Λ.3.

s4 } (2.3). then (2.1 (Properties of Conditional Expectations) (a) If a random variable Z is in the information set I.4 yield ½ 255 if s ∈ {s1 . 10 1 3 2 3 = 1 .3.5.20+0. Then J is a smaller information set than I in the sense that J ⊂ I.10 0. P r(s3 |s ∈ Λ2 ) = 3 .5 Continuing Example 2.7) E(ZY |I) = ZE(Y |I) for any random variables Y with finite E(|Y |).10 E(Y2 |s ∈ Λ2 ) = 262. (b) The Law of Iterated Expectations: If the information set J is smaller than the information set I (J ⊂ I).2.20 0.20+0.5 if s ∈ Λ2 .10 Using (2. s6 } Two properties of conditional expectations are very important in macroeconomics. 4 and P r(s2 |s ∈ Λ1 ) = P r(s6 |s ∈ Λ3 ) = 7 . P r(s5 |s ∈ Λ3 ) = 3 . (2. 10 0. E(Y2 |J)(s) = 195 if s ∈ {s5 . Proposition 2. CONDITIONAL EXPECTATIONS 0. s3 . then (2.6) . Similar computations as those in Example 2. 3 = 250. .8) E(Y |J) = E[E(Y |I)|J] for any random variable Y with finite E(|Y |). s2 . Hence the random variable E(Y2 |I) is given by  if s ∈ Λ1  250 262. assuming that E(|ZY |) is finite. P r(s1 |s ∈ Λ1 ) = Hence E(Y2 |s ∈ Λ1 ) = 300 × 2 3 11 = + 150 × P r(s4 |s ∈ Λ2 ) = 14 . Similarly. consider the information set J which is generated by Y1 .4. and E(Y2 |s ∈ Λ3 ) = 195.5) E(Y2 |I)(s) =  195 if s ∈ Λ3 Example 2.

±1. expectations are called unconditional expectations. 2. · · · and all h = 0. Therefore.3 and Theorem 34. 2.9) states that an unconditional expected value of a random variable Y can be computed as an unconditional expected value of the expectation of the random variable conditional on any information set. 2. X−1 . Xt+h ) is the same for all t = 0. . · · · } is strictly stationary if the joint distribution function of (Xt . A stochastic process {· · · . X0 . then its mean E(Xt ) and its h-th autocovariance Φ(h) = E[(Xt − E(Xt ))(Xt−h − E(Xt−h )0 ] = E(Xt X0t−h ) − E(Xt )E(X0t−h ) does not depend on date t. e. 1. For a proof of Proposition 2. X1 . · · · . · · · .9) E(Y ) = E[E(Y |I)]. · · · and all h = 0. then it is also covariance stationary. When we wish to emphasize the difference between expectations and conditional expectations. X1 .4 Stationary Stochastic Processes A stochastic process {· · · . if Xt is strictly stationary and has finite second moments. X0 . If Xt is covariance stationary. Billingsley (1986. Relation (2. STOCHASTIC PROCESSES Expectation can be viewed as a special case of conditional expectation in which the information set consists of constants. Xt+1 . any information set includes all constants. Because all moments are computed from distribution functions. 1. · · · . the Law of Iterated Expectations implies (2. see. · · · } is covariance stationary (or weakly stationary) if Xt has finite second moments (E(Xt X0t ) < ∞) and if E(Xt ) and E(Xt X0t−h ) do not depend on the date t for all t = 0. Theorem 34.g. ±2. Since a constant is a random variable which takes the same value for all states of the world. ±2.12 CHAPTER 2.. ±1. X−1 .1 in the general case.4).

a nonlinear function of a covariance stationary process may not be covariance stationary.A). The next proposition is for covariance stationary processes.4. A univariate stochastic process {et : t = 3 This proposition holds for any measurable function f (·) : Rk 7−→ Rp (see Appendix 2.2 If a k-dimensional vector stochastic process Xt is strictly stationary. Then Zt ’s variance is not finite. and hence Zt is not covariance stationary. The term “measurable” is avoided because this book does not require knowledge of measure theory. · · · . suppose that Xt is covariance stationary.2 for strictly stationary processes. . f (Xt+1 ). then f (Xt ) is also strictly stationary. Thus the continuity condition in Proposition 2. All continuous functions are measurable but not vice versa. and if a continuous function f (·) : Rk 7−→ Rp does not depend on date t. However.3 If a k-dimensional vector stochastic process Xt is covariance stationary. and if a linear function f (·) : Rk 7−→ Rp does not depend on date t. · · · . f (Xt+h ) is determined by f and the joint distributions of Xt .A).3 This follows from the fact that the distribution function of f (Xt ). STATIONARY STOCHASTIC PROCESSES 13 Proposition 2. it is convenient to consider white noise processes. Imagine that Xt ’s variance is finite but E(|Xt |4 ) = ∞. and the first and second moments of f (Xt ) do not depend on date t. For example. In order to model strictly stationary and covariance stationary processes. Consider Zt = f (Xt ) = (Xt )2 . This is not a problem for the purpose of this book because continuous functions are used in all applications of this proposition.2 will be used frequently to derive the cointegrating properties of economic variables from economic models in Chapter 13.2. Xt+h (see Appendix 2. Xt+1 . This proposition is true because f (Xt ) has finite second moments. unlike Proposition 2. Proposition 2. then f (Xt ) is also covariance stationary. Proposition 2.2 is more stringent than necessary.

). et can be a vector white noise process. white noise process for which et is normally distributed with zero mean. and δ is a constant. If a process is independent and identically distributed (i. the easiest case to see that an observed variable is not strictly stationary is when a variable’s mean shifts upward or downward over time.14 CHAPTER 2.10) E(et ej ) = σ2 0 if t = j . STOCHASTIC PROCESSES · · · . if t = 6 j where Σ is a matrix of constants.d.11) E(et e0j ) = Σ 0 if t = j . Xt is covariance stationary. all functions of i. then it is strictly stationary. In addition. −1. If et is an i. Xt ’s first and second moments cannot depend on date t. if t = 6 j where σ is a constant. white noise. process is an i. For a vector white noise. then Xt is strictly stationary.2. In empirical work.i. Therefore. In these definitions.d. by Proposition 2.3. we require ½ (2.i. where et is a white noise process.i. · · · } is white noise if E(et ) = 0. white noise process.6 Let Xt = δ + et . A Gaussian white noise process {et : −∞ < t < ∞} is an i. A simple example of this case is: . white noise random variables are strictly stationary. Then E(Xt ) = δ.d.i. and ½ (2.i.d. A white noise process is covariance stationary.d. The simplest example of an i. 1. A simple example of this case is: Example 2.i. All linear functions of white noise random variables are covariance stationary because of Proposition 2. and Xt is covariance stationary.d. If Xt is strictly stationary with finite second moments. 0.

5 E(Xt ) = δ. it is often convenient to impose a restriction |B| ≤ 1 as explained in Chapter 4.10). Example 2. Then Xt is not stationary because E(Xt ) = δ + θt depends on time. This is an autoregressive process of order 1 (see Chapter 4). Then Xt is covariance stationary for any B because of Proposition 2. where (2.3.4.7 Let Xt = δ + θt + et .d.i. that is. white noise random variable and δ and θ 6= 0 are constants. where et is a Gaussian white noise random variable. . we say that Xt is trend stationary as we will discuss in Chapter 11. 5 Even though Xt is stationary for any B. Trend stationarity is a way to model nonstationarity. their h-th order autocovariances can be nonzero for h 6= 0 as in the next two examples. and δ and B are constant. where et is a white noise which satisfies (2.13) Xt = AXt−1 + et for t ≥ 1.4 Strictly stationary and covariance stationary processes can be serially correlated.8 (The first order Moving Average Process) Let Xt = δ + et + Bet−1 .12) φh = E[(Xt − δ)(Xt−h − δ)] =  0 if |h| > 1 In this example.9 (The first order Autoregressive Process) Consider a process Xt which is generated from an initial random variable X0 . where et is an i.i. If |A| < 1 and X0 is a normally distributed random variable with mean zero and variance of 4 V ar(et ) . and A is a constant. 1−A2 then Xt is strictly Because Xt is stationary after removing a deterministic trend in this example. Example 2.2. STATIONARY STOCHASTIC PROCESSES 15 Example 2. white noise. and its h-th autocovariance is  2 if h = 0  σ (1 + B 2 ) 2 σ if |h| = 1 . (2.d. This is a moving average process of order 1 (see Chapter 4). then Xt is strictly stationary. if et is an i.

5 Conditional Heteroskedasticity Using conditional expectations.d. and Yt is heteroskedastic as long as ht 6= hj for some t and j. if not. then the Yt is said to be heteroskedastic. Y |I) = E[(X − E(X|I))(Y − E(Y |I))|I]. If Yt ’s variance conditional on an information set It . A heteroskedastic process is not strictly stationary because its variance depends on date t.i. where et is an i.3). white noise with unit variance (E(e2t ) = 1). Then the (unconditional) variance of Yt is ht .16 CHAPTER 2. it is homoskedastic.14) V ar(Y |I) = E[(Y − E(Y |I))2 |I]. V ar(Yt |It )). we can define variance and covariance conditional on an information set just as we use unconditional expectations to define (unconditional) variance and covariance. it is conditionally heteroskedastic. depends on date t. If the unconditional variance of Yt .15) Cov(X. Consider a stochastic process [Yt : t ≥ 1]. 2. It should be noted. that a strictly stationary random variable can . The methods explained in Chapter 4 can be used to show that Xt is not strictly stationary when X0 ’s distribution is different from the one given above. if not. V ar(Yt ). The variance of Y conditional on an information set I is (2. is constant and does not depend on the information set. and {ht : −∞ < t < ∞} is a sequence of real numbers. STOCHASTIC PROCESSES stationary (see Exercise 2.10 Let Yt = δ + ht et . and the covariance of X and Y conditional on an information set I is (2. however. Example 2. then Yt is said to be conditionally homoskedastic.

In order to see whether or not et ’s unconditional variance is constant over time. introduced by Engle (1982). is an autoregressive conditional heteroskedastic (ARCH) process.5. and E(et |It−1 ) = 0.16) implies that et ’s conditional variance depends on It : (2. Relation (2. wt is another white noise process in It with E(wt |It−1 ) = 0 and ½ (2. take expectations of both sides of (2.17) E(wk wj |It ) = λ2 0 if k = j . .18) to obtain (2. if k = 6 j where λ is a constant. For example.18) E(e2t |It−1 ) = η + αe2t−1 . Assume that (2. However.11 (An ARCH Process) Let It be an information set. the growth rates of asset prices and foreign exchange rates can be reasonably modeled as strictly stationary processes. and et be a univariate stochastic process such that et is in It .16) e2t = η + αe2t−1 + wt . Therefore. and thus et is conditionally heteroskedastic. CONDITIONAL HETEROSKEDASTICITY 17 be conditionally heteroskedastic. The following is a simple example of an ARCH process.19) E(e2t ) = η + αE(e2t−1 ). the volatility of such a growth rate at a point in time tends to be high if it has been high in the recent past. A popular method to model conditional heteroskedasticity. such a growth rate is often modeled as a conditionally heteroskedastic process. This fact is important because many of the financial time series have been found to be conditionally heteroskedastic. where η > 0. Example 2.2.

As we will discuss later in this book. and a sequence of information sets [It : −∞ < t < ∞] which is increasing (It ⊂ It+1 ). then Yt is a martingale adapted to It . When α < 1. then Yt is a random walk. If et is in It and if (2. Consider a stochastic process [et : −∞ < t < ∞]. and a sequence of information sets [It : −∞ < t < ∞] that is increasing (It ⊂ It+1 ). E((Yt+1 − Yt )2 |It ). If Yt is a martingale adapted to It and if its conditional variance.21) E(et+1 |It ) = 0. It is often easier to test whether or not the variable is a random walk than to test whether or not it is a martingale. STOCHASTIC PROCESSES Hence if the variance of et is a constant σ 2 . if the data for the variable does not show signs of conditional heteroskedasticity.2). 2. If Yt is a martingale adapted .6 Martingales and Random Walks Consider a stochastic process [Yt : −∞ < t < ∞]. then we may test whether or not a variable is a random walk. However. is constant (that is. then σ 2 = η +ασ 2 .18 CHAPTER 2. Rational expectations often imply that an economic variable is a martingale (see Section 3. an ARCH process can be covariance stationary and strictly stationary. most of the rational expectations models imply that certain variables are martingales. Yt is conditionally homoskedastic). and σ 2 = η . then et is a martingale difference sequence adapted to It . The models typically do not imply that the variables are conditionally homoskedastic.20) E(Yt+1 |It ) = Yt . 1−α Because σ 2 is positive. If Yt is in It and if (2. and hence do not imply that they are random walks. this equation implies that α < 1.

In these definitions.5). a white noise process may not be a martingale difference sequence for any sequence of information sets.2.A A Review of Measure Theory Let S be an arbitrary nonempty set of points s. A covariance stationary martingale difference sequence is a white noise process (see Exercise 2. (ii) A ∈ F implies Ac ∈ F. A set function P r on a field F is a probability measure if it satisfies these conditions: (i) 0 ≤ P r(A) ≤ 1 for A ∈ F. A class F is a σ-field if it is a field and if (iv) A1 . B ∈ F implies A ∪ B ∈ F. where Ac is the complement of A. A2 . However. An event is a subset of S.4). a martingale or a martingale difference sequence can be a vector stochastic process. A set function is a real-valued function defined on some class of subsets of S. . white noise process is a martingale difference sequence (see Exercise 2.i. A REVIEW OF MEASURE THEORY 19 to It . (ii) P r(0) = 0. A class F of subsets of S is called a field if (i) S ∈ F. P r(S) = 1.d. Appendix 2. (iii) A. then et = Yt − Yt−1 is a martingale difference sequence (see Exercise 2. · · · ∈ F implies A1 ∪ A2 ∪ · · · ∈ F.A. A set of subsets is called a class.6). An i.

S∞ k=1 Ak ∈ F. we can assign a probability to the event that X(s) belongs to a Borel set B by P r(X −1 (B)). STOCHASTIC PROCESSES (iii) if A1 . and is called the σ-field generated by A and is denoted by σ(A). · · · .20 CHAPTER 2. consider the inverse images T −1 (A0 ) = [s ∈ S : T (s) ∈ A0 ]. For a random variable X. This class is the smallest σ-field which contains A.A. xk ) : ai ≤ x ≤ bi . For a mapping T : S 7−→ S0 . A2 . consider the class of the bounded rectangles [x = (x1 . The σ-field generated from this class is called the k-dimensional Borel sets. then If F is a σ-field in S and P r is a probability measure on F. the triple (S. and in this case R1 is always tacitly understood to play the role of F0 . In Euclidean k-space Rk . and denoted by Rk . P r) is a probability space. For a real-valued function f . F. Let F be a σ-field of subsets of S and F0 be a σ-field of subsets of S0 .1 A probability measure on a field has a unique extension to the generated σ-field. . the image space S0 is the line R1 . · · · is a disjoint sequence of F-sets and if S P∞ P r( ∞ k=1 Ak ) = k=1 P r(Ak ). i = 1. A real-valued function on S is measurable F (or simply measurable when it is clear from the context what F is involved) if it is measurable F/R1 . Given a class A. If (S. k]. consider the class which is the intersection of all σ-fields containing A. then a real-valued measurable function is called a random variable. P r) is called a probability space. · · · . Proposition 2. F. The mapping T is measurable F/F0 if T −1 (A0 ) ∈ F for each A0 ∈ F0 .

.A. . .. then g(X) is an i-dimensional random vector. . e2 . P r) is a probability space. We now introduce two definitions of conditional expectation. Let S be a probability space. We are concerned with a linear model of the form: y = Xb0 + e. where b0 is a K × 1 vector of real numbers. XT )0 be a T × K matrix of random variables. . then it is measurable.2..2 can be proven by taking X = [Yt0 . F be a σ-field of S. then a measurable mapping X : S 7−→ Rk is called a random vector. Proposition 2. Such functions are called Borel functions. It is known that X is a random vector if and only if each component of X is a random variable... the distribution of 0 g(X) is µg −1 . · · · .. y2 . If the distribution of X is µ.A. eT ) be T × 1 vectors of random variables. A REVIEW OF MEASURE THEORY 21 For a mapping f : S 7−→ Rk . and g : Rj 7−→ Ri is measurable. Yt+k ]0 . the conditional Gauss-Markov theorem is obtained by stating all assumptions and results of the Gauss-Markov theorem conditional on the stochastic regressors. Proposition 2.. One definition is standard in measure theory. X2 . If (S. Formally. Intuitively. yT ) and e = (e1 . The other definition is given because it is convenient for the purpose of stating a version of the conditional Gauss-Markov theorem used in this book. Rk is always understood to be the σ-field in the image space. and Pr be a probability measure defined on F. it is necessary to make sure that the conditional expectations of the relevant variables are well defined.. Let y = (y1 .2 If f : Ri 7−→ Rk is continuous. which will be the regressor matrix of the regression to be considered. Let X = (X1 . The random variables we will consider in this section are defined on this probability space. F. If X is a j-dimensional random vector. A mapping f : Ri 7−→ Rk is defined to be measurable if it is measurable Ri /Rk ..

. G There will in general be many such random variables which satisfy these two properties.6 E[Z|σ(X)] is a random variable. e. and σ(X) be the smallest σ-field with respect to which of the random variables in X are measurable. 1986). Let Z be an integrable random variable (namely. we use the notation E[Z|σ(X)] to denote the usual conditional expectation of Z conditional on X as defined by Billingsley (1986) for a random variable Z. Billingsley.22 CHAPTER 2. This condition is too restrictive when we define 6 If z is a vector. It satisfies the following two properties: (i) E(Z|σ(X)) is measurable and integrable given σ(X). namely E(|Z|) < ∞. the conditional expectation is taken for each element in z. Any two versions are equal with probability 1.A.A. E(|Z|) < ∞). Throughout this paper.g. the OLS estimator is bT = (X0 X)−1 X0 y.. G ∈ σ(X). STOCHASTIC PROCESSES For s such that X(s)0 X(s) is nonsingular. (2.2) E(Z|σ(X))dPr = G ZdPr. The standard definition of the expectation of Z given X is obtained by applying the Radon-Nikodym theorem (see. It should be noted that this definition is given under the condition that Z is integrable.1) In order to apply a conditional version of the Gauss-Markov Theorem. it is necessary to define the expectation and variance of bT conditional on X. (ii) E(Z|σ(X)) satisfies the functional equation: Z Z (2. and E[Z|σ(X)]s denotes the value of the random variable at s in S. any one of them is called a version of E(Z|σ(X)).

but this slight relaxation does not solve our problem. Given A in F. Thus. G ∈ σ(X). The Radon-Nikodym theorem can be applied to the measures v and Pr. we define a conditional distribution µ(·. This definition can be used for E(·|σ(X)). We avoid this problem by adopting a different definition of conditional expectation based on conditional distribution.A.3) Pr(A|σ(X))dPr = Pr(A ∩ G). Then Pr(G) = 0 implies that v(G) = 0. Denote this random variable by Pr(A|σ(G)). Let R1 be the σ-field of the Borel sets 7 Loeve (1978) slightly relaxes this restriction by defining the conditional expectation for any random variable whose expectation exists (but may not be finite) with an extension of the RadonNikodym theorem. s) given X for each s in S. . Judge et al. R such that Pr(A ∩ G) = G fdPr for all G in σ(X). define a finite measure v on σ(X) by v(G) = Pr(A ∩ G) for G in σ(X). This random variable satisfies these two properties: (i) Pr(A|σ(X)) is measurable and integrable given σ(X).2. (1985) conclude that the Gauss-Markov theorem based on E(·|σ(X)) is not very useful. G There will in general be many such random variables. we first define conditional probabilities following Billingsley (1986).A. which may not be integrable. For this reason. A specific such random variable is called a version of the conditional probability. A REVIEW OF MEASURE THEORY 23 the conditional expectation and variance of the OLS estimator in many applications7 because the moments of (X0 X)−1 may not be finite even when X has many finite moments. it is difficult to confirm that E(bT |σ(X)) can be defined in each application even if X is normally distributed. Given a random variable Z. (ii) Pr(A|σ(X)) satisfies the functional equation Z (2. and there exists a random variable f that is measurable and integrable with respect to Pr. but any two of them are equal with probability 1. For this purpose.

we assume that e is homoskedastic and et is not serially correlated given X = x: Assumption 2.A. s). Given a T × K matrix of real numbers x. Next.24 CHAPTER 2. s). The OLS estimator can be expressed by (2. then.A. By Theorem 33. Given a T ×K matrix of real numbers x. defined for H in R1 and s in S. we define E(Z|X = x) as E(Z|X)s for s in X−1 (x). as a function of s. there exists a function µ(H. s) is finite.460). Therefore. We are concerned with a linear model of the form: Assumption 2. It should be noted that E(Z|X)s does not necessarily satisfy the usual properties of conditional expectation such as the law of iterated expectations.3 in Billingsley (1986. p. E(Z|X)s is identical for all s in X−1 (x).3 E[ee0 |X = x)] = σ 2 IT .4 x0 x is nonsingular. a version of P r(Z ∈ H|σ(X))s .1 y = Xb0 + e where b0 is a K × 1 vector of real numbers.A. we assume that the conditional expectation of e given X = x is zero: Assumption 2. For each s in S. If R1 |z|µ(dz. a probability measure on R1 . E(Z|X)s is said to exist and be finite. . µ(H.1) for all s in X−1 (x) when the next assumption is satisfied: Assumption 2.2 E[e|X = x] = 0.A. (ii) For each H in R1 . E(Z|X)s may not even exist for R some s. This is the definition of the conditional expectation of Z given X = x in this paper. s) is. as a function of H. In general. s) is. with these two properties: (i) For each s in S.A. we define E(Z|X)s to be R R1 zµ(dz. µ(H. STOCHASTIC PROCESSES in R1 .

A. Applying any of the standard proofs of the (unconditional) Gauss-Markov theorem can prove this theorem by replacing the unconditional expectation with E(·|X = x). Assumption 2.A.4. The conditional version of the Best Linear Unbiased Estimator (BLUE) given X = x can be defined as follows: An estimator bT for b0 is BLUE conditional on X = x if (1) bT is linear conditional on X = x. Given E(e) = 0.A. namely. With these preparations.20 E[e|σ(X)] = 0. Since E[e|σ(X)] is defined only when each element of e is integrable. bT can be written as bT = Ay for all s in X−1 (x) where A is a K×T matrix of real numbers. namely.A.4. (3) for any linear unbiased estimator b∗ conditional on X = x.A.20 implicitly assumes that E(e) exists and is finite. namely.1 (The Conditional Gauss-Markov Theorem) Under Assumptions 2. Modifying some assumptions and adding another yields the textbook version of the conditional Gauss-Markov theorem based on E(·|σ(X)).20 does not imply that X is statistically independent of e. It also implies E(e) = 0 because of the law of iterated expectations.20 is that X is statistically independent of e.A. the OLS estimator is BLUE conditional on X = x.1– 2.1–2. Assumption 2. E(bT |X = x) = b.A.20 .2.A. E[(bT − b0 )(bT − b0 )0 |X = x] ≤ E[(b∗ − b0 )(b∗ − b0 )0 |X = x].A.A. Assumption 2. E[bT |X = x] = b0 and E[(bT − b0 )0 (bT − b0 )|X = x] = σ 2 (x0 x)−1 . a sufficient condition for Assumption 2.A. E[(b∗ −b0 )(b∗ −b0 )0 |X(s) = x]−E[(bT −b0 )(bT −b0 )0 |X(s) = x] is a positive semidefinite matrix. A REVIEW OF MEASURE THEORY 25 Under Assumptions 2. (2) bT is unbiased conditional on X = x. Since Assumption 2. the following theorem can be stated: Theorem 2.

For this purpose.A.A.A.20 –2. The next assumption replaces Assumption 2. and 2.5.1.26 CHAPTER 2. 2.5 E[trace((X0 X)−1 X0 ee0 X(X0 X)−1 )] exists and is finite. However. we consider the following assumption: Assumption 2.A. we assume that e is conditionally homoskedastic and et is not serially correlated: Assumption 2. Under Assumptions 2. E[(bT − b0 )0 (bT − b0 )|σ(X)] can be defined.5 is that it is not easy to verify the assumption for many distributions of X and e that are often used in applications and Monte Carlo studies.A.A. We now consider a different definition of the conditional version of the Best Linear Unbiased Estimator (BLUE). and E[(bT − b0 )0 (bT − b0 )|σ(X)] = E[(X0 X)−1 X0 ee0 X(X0 X)−1 |σ(X)] = (X0 X)−1 X0 E[ee0 |σ(X)]X(X0 X)−1 = σ 2 (X0 X)−1 .40 X0 X is nonsingular with probability one. a sufficient condition for Assumption 2. E(bT |σ(X)) = b0 + E[(X0 X)−1 X0 e|σ(X)] = b0 . From Assumption 2. Assumption 2. STOCHASTIC PROCESSES is weaker than the assumption of independent stochastic regressors.1. Hence we can prove a version of the conditional Gauss-Markov theorem based on E(·|σ(X)) when the expectations of (X0 X)−1 X0 e and (X0 X)−1 X0 ee0 X(X0 X)−1 exist and are finite.40 .A. Moreover.A. The Best Linear Unbiased Estimator (BLUE) . With the next assumption.A.30 E[ee0 |σ(X)] = σ 2 IT .A.5 is that the distributions of X and e have finite supports. The problem with Assumption 2. bT = b0 + (X0 X)−1 X0 e.4.A.

and 2. E(bT |σ(X)) = b0 . (2) bT is unbiased conditional on σ(X) in G. 2. it is unconditionally unbiased and has the minimum unconditional covariance matrix among all linear unbiased estimators conditional on σ(X). (1/T )X0 X converges almost surely to E[X0t Xt ] if Xt is stationary and ergodic. the OLS estimator is BLUE conditional on σ(X). . the covariance matrix of bT is σ 2 E[(X0 X)−1 ]. An estimator bT for b0 is BLUE conditional on σ(X) in H if (1) bT is linear conditional on σ(X). but it does not.40 . e follows a multivariate normal distribution. is equal to the asymptotic covariance matrix. Section 6.A.A.6 Conditional on X. This property may seem to contradict the standard asymptotic theory.5. σ 2 [E(X0t Xt )]−1 . E[(b∗ − b0 )(b∗ − b0 )0 |σ(X)] − E[(bT − b0 )(bT − b0 )0 |σ(X)] is a positive semidefinite matrix with probability 1.A. Proof The proof of this proposition is given in Greene (1997. and each element of A is measurable given σ(X). bT can be written as bT = Ay where A is a K ×T matrix. which is different from σ 2 [E(X0 X)]−1 . (3) for any linear unbiased estimator b∗ conditional on σ(X) for which E(b∗ b∗ 0 ) exists and is finite. Hence the limit of the covariance matrix of √ T (bT − b0 ). In this proposition. namely. Proposition 2. Asymptotically.7).A.A.20 –2.A. Moreover. equivalently. A REVIEW OF MEASURE THEORY 27 conditional on σ(X) is defined as follows. namely.A. In order to study the distributions of the t ratios and F test statistics we need an additional assumption: Assumption 2.3 Under Assumptions 2.2. E[(bT − b0 )(bT − b0 )0 |σ(X)] ≤ E[(b∗ − b0 )(b∗ − b0 )0 |σ(X)] with probability 1. σ 2 E[{(1/T )(X0 X)}−1 ].1.

Proposition 2. and 2. STOCHASTIC PROCESSES Given a 1 × K vector of real numbers R.A.2 and 2. With the standard argument. 2.4 implies the following proposition: Proposition 2.1–2. consider a random variable (2.A.5) tR = R(bT − b0 ) σ ˆ [R(X0 X)−1 R]1/2 .A.5. 2.20 .A.A.A. Since NR and tR follow a standard normal distribution and a t distribution conditional on X. respectively.6. respectively.4) NR = R(bT − b0 ) σ[R(X0 X)−1 R]1/2 and the usual t ratio for Rb0 (2. This proposition is obtained by integrating the probability density function conditional on Q over all possible values of the random variables in Q.A.A. under either Assumptions 2.A.4 If the probability density function of a random variable Z conditional on a random vector Q does not depend on the values of Q.30 .A.28 CHAPTER 2. .A.6 or Assumptions 2.A. and 2.A.6 are satisfied and that Assumptions 2.A.A.1. then the marginal probability density function of Z is equal to the probability density function of Z conditional on Q. Proposition 2.3 are satisfied for all x in a set H such that P r(X−1 (H)) = 1. The following proposition is useful in order to derive the unconditional distributions of these statistics.A. NR and tR can be shown to follow the standard normal distribution and Student’s t distribution with T − K degrees of freedom with appropriate conditioning.A.5 Suppose that Assumptions 2.5–2. Then NR is a standard normal random variable and tR is a t random variable with T − K degrees of freedom. 2.1. Here σ ˆ is the positive square root of σ ˆ 2 = (y − XbT )0 (y − XbT )/(T − K).

2.A.4. cT .1.A.A.A.2 and 2. and 2.6 are satisfied for s and that Assumptions 2. Similarly. · · · . Fix s.164).A.A.A. · · · } by interpreting |cT − c| as the Euclidean distance (cT − c)0 (cT − c). 2. there exists an N such that |cT − c| < ε for all T ≥ N .2. Consider a univariate stochastic process [XT : T ≥ 1].3 are satisfied for all x in a set H such that P r(X−1 (H)) = 1.5.50 Suppose that Assumptions 2. · · · be a sequence of real numbers and c be a real number. However.B Convergence in Probability Let c1 . Judge et al. The sequence is said to converge to c if for any ε. CONVERGENCE IN PROBABILITY 29 Alternatively. the usual t and F test statistics have the usual (unconditional) distributions as a result of Proposition 2. (1985. cT . We write cT → c or limT →∞ cT = c. 2. Then NR is a standard normal random variable and tR is a t random variable with T − K degrees of freedom. states that “our usual test statistics do not hold in finite samples” on the ground that the (unconditional) distribution of b0T s is not normal. For example.B. and a random variable X.20 –2.A. · · · .A. the assumptions for Proposition 2.6 can be used to obtain a similar result: Proposition 2. These results are sometimes not well understood by econometricians.A. so it does not follow a normal distribution even if X and e are both normally distributed. and then [XT (s) : T ≥ 1] is a sequence of real numbers and X(s) is a real . c2 . p. It is true that bT is a nonlinear function of X and e.3 with Assumption 2. a standard textbook.A. the usual F test statistics also follow (unconditional) F distributions. This definition is extended to a sequence of vectors of real numbers {c1 .30 . c2 .

Slutsky’s Theorem is important for working with probability limits. then XT is said to converge in distribution to X. if a property holds for all s except for a set of s with probability zero. and a random variable X with respective distribution functions FT and F . we say the sequence of random variables.30 CHAPTER 2.1 Convergence in Distribution Consider a univariate stochastic process [XT : T ≥ 1]. verify whether or not XT (s) → X(s). STOCHASTIC PROCESSES number. [XT : T ≥ 1]. 2. If Ω has finite elements. converges to X almost surely or with probability one. It states that. This extension to the vector case is done by using the Euclidean distance. The sequence of random variables [XT : T ≥ 1] converges in probability to the random variable XT if. however. and calculate the probability that XT (s) → X(s).B. Almost sure convergence implies convergence in probability. For each s. In general. In general. This is P expressed by writing XT → c or plimT →∞ XT = X. almost sure convergence is the same thing as convergence of XT (s) to X(s) in all states of the world. limT →∞ P rob(|XT − X| > ε) = 0. Then collect s such that XT (s) → X(s). almost sure convergence does not imply convergence in all states. If the probability is one. This definition is extended to a sequence of random vectors by using convergence for a sequence of vectors for each s. we say that the property holds almost surely or with probability one. The distribution F is called the asymptotic distribution or the . this is expressed by D writing XT → X. then plim(f (XT )) = f (plim(XT )). We write XT → X almost surely. if plimXT = X and if f (·) is a continuous function. for all ε > 0. If FT (x) → F (x) for every continuity point x of F .

15. · · · .3. compute E(Y2 |I)(s) and E(Y2 |J)(s). The vector version of this proposition is: Proposition 2. Verify that E(Y2 |J)(s) = E(E(Y2 |I)|J)(s) for all s ∈ S.2 Propositions 2.B. .05. Exercises 2.20. and if the absolute value of the i-th row of a sequence of a k × k matrix of real P∞ numbers {Aj }∞ j=0 is square summable for i = 1. then Xt = j=0 Aj Xt−j is covariance stationary. As in Example 2. the infinite sum j=0 aj Xt−j means P the convergence in mean square of Tj=0 aj Xt−j as T goes to infinity. In the following propositions. We need the convergence concepts explained in ApP∞ 2 pendix 2.1 If Xt is a scalar covariance stationary process. π4 = 0.’s (Incomplete) In Propositions 2.30. P∞ P∞ j=0 |aj | is finite.B.2 If Xt is a k-dimensional vector covariance stationary process.4. we are often interested in infinite sums of covariance or strictly stationary random variables.B. that is. 2.2. π3 = 0. Then compute E(E(Y2 |I)|J)(s). A sequence of real numbers {aj }∞ j=0 is square summable if j=0 aj is finite.1 In Example 2.3 for Infinite Numbers of R. assume that π1 = 0. π5 = 0.B.10. and if {aj }∞ j=0 is P square summable.2 and 2.B.2 and 2. In many applications.20. Proposition 2. π2 = 0.3. A sufficient condition for {aj }∞ j=0 is that it is absolutely summable. then X = ∞ j=0 aj Xt−j is covariance stationary.V. we only allow for a finite number of random variables. and π6 = 0. k. CONVERGENCE IN PROBABILITY 31 limiting distribution of XT .

3rd edn. Wiley.5 Prove that a covariance stationary martingale difference sequence is a white noise process. (1997): Econometric Analysis. Lee (1985): The Theory Judge. assume that |A| < 1.4 Let Yt be a martingale adapted to It . .” Econometrica. Then prove that et = Yt − Yt−1 is a martingale difference sequence. Then compute the expected values of Y1 and Y2 . G. Hill. (Hint: Remember that first and second moments completely determine the joint distribution of jointly normally distributed random variables. E.9. F. References Billingsley. 50(4). 1−A 2 ). Prove that Yt is strictly stationary in this case. and show that Yt is not strictly stationary if A 6= 0. (1986): Probability and Measure. C. Prentice-Hall.3 In example 2. R. New York. suppose that Y0 =0. (1978): Probability Theory.6 Prove that an i. P. Lu and Practice of Econometrics. W. 2. white noise process is a martingale difference sequence. Greene. Wiley. New York.32 CHAPTER 2. ¨ tkepohl. 2nd edn. H. Then compute the expected values of Y1 and Y2 and the variance of Y1 and Y2 . and T. New York. This condition does not ensure that Yt is strictly stationary. 2.9. assume that |A| < 1 and that Y0 is N (0. and the k-th autocovariance of Y .) 2. In order to see this. STOCHASTIC PROCESSES 2.2 In example 2. Engle. G. R. Springer-Verlag.. (1982): “Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation. 4th edn.d. Griffiths. W.i. M. 2 σ 2. the variance of Y1 and Y2 . Loeve. H. 987–1008.

and H contains 33 . we consider random variables with finite second moments unless otherwise noted. For the purpose of testing the models and parameter estimation. Projections are used to explain the Wold representation. Typically.Chapter 3 FORECASTING 3. 3. Taking the conditional expectation is one way to model forecasting. This method generally requires nonlinear forecasting rules which are difficult to estimate. For structural macroeconomic models. we study projections as a forecasting method. using a set H of random variables. which forms a basis for studying linear and nonlinear stochastic processes. We consider the problem of forecasting y. y is a future random variable such as the growth rate of the Gross Domestic Product (GDP) or the growth rate of a stock price. it is sometimes possible for an econometrician to use a simpler forecasting rule and a smaller information set.1 Definitions and Properties of Projections In this chapter.1. we usually need to specify the forecasting rules that economic agents are using and the information set used by them to forecast future economic variables.1 Projections In macroeconomics. In this section. forecasting is important in many ways.

2) The expression (3. FORECASTING current and past economic variables that are observed by economic agents and/or econometricians. E[(y − y f )2 ] ≤ E[(y − h)2 ]. When such a forecast exists. so that the forecasting error is y − y f . and for all h in H.34 CHAPTER 3. and that y f is the minimizer if and only if the forecasting error is orthogonal to all members of H: (3. y f . Luenberger.g. When either h1 or h2 has mean zero. we choose the forecast. Let us denote a forecast of y based on H by y f . Under certain conditions on H. .1) is called the mean squared error associated with the forecast. y f . ˆ we call the forecast. When Y is a random vector with finite second moments. 1969) states that there exists a unique random variable y f in H that minimizes the mean squared error.3) E(h1 h2 ) = 0. and denote it by E(y|H). orthogonality means that they are uncorrelated. y f .1) In other words. we apply the projection to each ˆ element of Y and write E(Y|H). (3.. the Classical Projection Theorem (see. y f is in H. The concept of orthogonality is closely related to the problem of minimizing the mean squared error. so that y f minimizes E[(y − y f )2 ]. e. this is called the orthogonality condition. (3. In most economic applications. a projection of y onto H. When two random variables h1 and h2 satisfy (3. they are said to be orthogonal to each other.4) E((y − y f )h) = 0 for all h in H.

then ˆ ˆ |H).1.1 (Properties of Projections) ˆ ˆ ˆ |H) for any random (a) Projections are linear: E(aX + bY |H) = aE(X|H) + bE(Y variables. We write E(y|H) = E(y|X). the projection based on such ˆ ˆ an information set is called a linear projection. Since E(y|H) is also a member of H. X and Y . Let X be a p × 1 vector of random variables with finite second moments. PROJECTIONS 35 Some properties of projections are very important: Proposition 3. with finite variance and constants. ˆ In this sense. then ˆ |H) = E[ ˆ E(Y ˆ |G)|H].1. there exists b0 such that (3. (b) If a random variable Z is in the information set H. Let H = {h is a random variable such that h = X0 b for some p-dimensional ˆ vector of real numbers b}. E(ZY |H) = Z E(Y (c) The Law of Iterated Projections: If the information set H is smaller than the information set G (H ⊂ G). a and b. E(Y 3.2 Linear Projections and Conditional Expectations The meaning of projection depends on how the information set H used for the projection is constructed.5) ˆ E(y|H) = X 0 b0 . which only allows for linear forecasting rules. . When we use an information set such as H.3. E(y|H) uses a linear forecasting rule.

E(y|H ) allows for a nonlinear forecasting rule. When it is necessary to distinguish the information set I generated by X for conditional expectations introduced in Chapter 2 and the information set H generated Masao needs to check this! by X for linear projections. we require that the function f is measurable. An important special case is when y and X are jointly normally distributed.36 CHAPTER 3.9) 1 E[(y − X0 b0 )Xi ] = 0 As in Proposition 2. (3. Hence the linear projection of y onto the information set generated by X is equal to the expectation of y conditional on X.2.8) E[(y − X0 b0 )h] = 0 for any h in H. we obtain (3. In this case. For this reason. . there exists a function f0 (·) such that N ˆ E(y|H ) = f0 (X). It can be shown that N ˆ E(y|H ) = E(y|X). H will be called the linear information set generated by X. using the i-th element Xi for h. (3. the projections we use in this book are linear projections unless otherwise noted. (????? Unclear! from Billy) Linear projections are important because it is easy to estimate them in many applications. the expectation of y conditional on X is a linear function of X. FORECASTING Let HN = {h is a random variable with a finite variance such that h = f (X) for a function f }. Since each element of X is in H.6) N ˆ In this sense. Note that the orthogonality condition states that (3.1 In this case.7) Hence the projection and conditional expectation coincide when we allow for nonlinear forecasting rules.

SOME APPLICATIONS OF CONDITIONAL EXPECTATIONS AND PROJECTIONS37 for i = 1. 3.1 Volatility Tests Many rational expectations models imply (3. 3. Assuming that E(XX0 ) is nonsingular.11) E(Xy) = E(XX0 )b0 . As we will discuss.10) Therefore (3. 2.2. p. · · · . we obtain (3.12) b0 = E(XX0 )−1 E(Xy) and ˆ t |Ht ) = X0t b0 . Ordinary Least Squares (OLS) can be used to estimate b0 .2. or E[X(y − X0 b0 )] = 0. if Xt and yt are strictly stationary. In this chapter.13) where Ht is the linear information set generated by Xt .2 Some Applications of Conditional Expectations and Projections This section presents some applications of conditional expectations and projections in order to illustrate their use in macroeconomics. all random variables are assumed to have finite second moments. More explanations of some of these applications and presentations of other applications will be given in later chapters.14) Xt = E(Yt |It ) . E(y (3. (3.3.

14) implies (3. Since E(²2t ) ≥ 0. Therefore.16) E(²t ht ) = 0 for any random variable ht that is in It . Since Xt is in It . Since E(²t |It ) = 0. we conclude (3. LeRoy and Porter (1981) and Shiller (1981) started to apply volatility tests to the present value model of stock prices.38 CHAPTER 3. Thus. We can interpret (3.14) can be obtained by comparing the volatility of Xt with that of Yt . Here Xt is in the information set It which is available to the economic agents at date t while Yt is not. A testable implication of (3. (3. Relation (3.19) V ar(Yt ) ≥ V ar(Xt ). Since (3.15) we obtain (3.16) implies E(²t Xt ) = 0.15) Yt = Xt + ²t where ²t = Yt − E(Yt |It ) is the forecast error.17) E(Yt2 ) = E(Xt2 ) + E(²2t ).14). (3.18) V ar(Yt ) = V ar(Xt ) + E(²2t ). (3. The forecast error must be uncorrelated with any variable in the information set.17) implies (3.16) as an orthogonality condition. FORECASTING for economic variables Xt and Yt . from (3. Various volatility tests have been developed to test this implication of (3. if Xt forecasts Yt .14) implies that E(Xt ) = E(Yt ). Let pt be the real stock price (after the . Xt must be less volatile than Yt .

we use (3. · · · . ∞ i=1 b dt+i involves infinitely many data points for the dividend. Then the no-arbitrage condition is (3. consider E(Y |I) for a random variable Y and an information set I generated from a random variable X.21) i=1 Applying the volatility test. and MacKinlay (1997). and compare them to form a test statistic.20) forward and imposing the no bubble condition. conditional expectations allow for nonlinear forecasting rules.1.20) pt = E[b(pt+1 + dt+1 )|It ]. see Campbell. we conclude that the variance of P∞ i=1 bi dt+i is greater than or equal to the variance of pt . .2. We can estimate the variance of pt and the variance of Yt .2 2 Parameterizing Expectations As discussed in Section 3. and It is the information set available to economic agents in period t. T . i=1 Let Yt = PT −t i=1 bi dt+i + bT −t pT . For example. Then E(Y |I) can be written as a function of 2 There are some problems with this procedure. One problem is nonstationarity of pt and Yt . 3. One way to test this is to directly P i estimate these variances and compare them. we obtain the present value formula: ∞ X pt = E( bi dt+i |It ). Solving (3. SOME APPLICATIONS 39 dividend is paid) in period t and dt be the real dividend paid to the owner of the stock at the beginning of period t.3. where b is the constant real discount rate. For more detailed explanation of volatility tests. Lo.22) pt = E( T −t X bi dt+i + bT −t pT |It ). Then we have data on Yt from t = 1 to t = T when we choose a reasonable number for the discount rate b. (3.2. However.21) to obtain (3. When we have data for the stock price and dividend for t = 1.

and thus minimizes the mean square error. E[(Y −fN (X))2 ]. In order to simulate rational expectations models. For example. 1990) is based on the fact that the conditional expectation is a projection. . In most applications involving nonlinear forecasting rules. Durlauf and Hall (1990) and Durlauf and Maccini (1995) propose such a measure called the noise ratio. Given that all economic models are meant to be approximations. · · · . aN to minimize the mean square error. we often test an economic model with test statistics whose probability distributions are known under the null hypothesis that the model is true. however. it seems desirable to measure how good a model is in approximating reality. take a class of polynomial functions and let fN (X) = a0 + a1 X + a2 X 2 + · · · + aN X N . which will be discussed in Chapter 9. Marcet’s (1989) parameterizing expectations method (also see den Haan and Marcet. We choose a0 . For example. The function f (·) can be nonlinear here. Intuitively.3 Consider an economic model which states (3. This method is used to simulate economic models with rational expectations. fN (·) should approximate f (X) for a large enough N . it is often necessary to have a method to estimate f (·). 3.3 Noise Ratio In econometrics. is an example.40 CHAPTER 3.23) E(g(Y)|I) = 0 for an information set I and a function g(·) of a random vector Y. F be the forward 3 See Konuki (1999) for an application of the noise ratio to foreign exchange rate models. We take a class of functions that approximate any function. let S be the spot exchange rate of a currency in the next period. FORECASTING X : E(Y |I) = f (X). the functional form of f (·) is not known.2. Hansen’s J test.

we assumed that the constants are included in H. . E(N 2 ) = E[(E(N ˆ |H))2 ]+ and the forecast error. noise. is defined by N R = η . Thus.2. (3. Since we do not know much about the model ˆ |H) = N . The noise ratio. Because N = E(N ˆ |H) + (N − E(N ˆ |H)). Let ν = g(Y) − E(g(Y)|I). N . Therefore. If N happens to be in H. g(S. is orthogonal to E(N ˆ |H). in this case V ar(N ) = η. then E(N Therefore.24) 4 ˆ V ar(g(Y)) = η + V ar(g(Y) − E(g(Y)|H)). then g(Y) = ν. some lagged values of S − F and a constant can be used to generate a linear information set H. E(g(Y)|H) = ˆ |H). Then under the assumption of risk neutral investors. however. we obtain (3. which is called the model noise. N R. E[(N − E(N ˆ |H))2 ] − {E[E(N ˆ |H)]}2 = η.4 ˆ ˆ Using the law of iterated projections5 . Since this model is an approximation. In a sense. g(Y) deviates from ν. where H is an information set generated from some variables in I. For example. Therefore. and therefore η = V ar(E(N ˆ |H)). F ) = S − F . we have E(ν|H) = 0. E(N ˆ |H). E(N 2 ) ≥ E[(E(N ˆ |H))2 ]. Since E[(N − E(N ˆ |H))2 ] ≥ 0.23). N −E(N ˆ |H))2 ]. it may or may not be in H. 5 We assume that the second moment exists and is finite. and I be the information set available to the economic agents today. 6 ˆ Here. so that E(S) = E[E(S|H)]. Let N be the deviation: N = g(Y) − ν.3.6 Thus η is a V ar(N ) = E(N 2 ) − (E(N ))2 ≥ E[(E(N lower bound of V ar(N ). If the model is true. the conditional expectation is a projection. SOME APPLICATIONS 41 exchange rate observed today for the currency to be delivered in the next period. V ar(g(Y)) ˆ Since E(g(Y)|H) is orthog- ˆ onal to g(Y) − E(g(Y)|H). in the forward exchange rate model mentioned above. η is a sharp lower bound. A natural measure of how well the model approximates reality is V ar(N ). Durlauf and Hall (1990) propose a method ˆ to estimate a lower bound of V ar(N ) using η = V ar(E(g(Y)|H)).

For example.7 Projections explained in this chapter are defined in a Hilbert space. A Hilbert space is a complete pre-Hilbert space.A. FORECASTING Therefore. Since a Hilbert space is complete. Section 3. we will consider another Hilbert space.A Introduction to Hilbert Space This Appendix explains Hilbert space techniques used in this book. the 0 ≤ N R ≤ 1. The inner product is used to define a distance.42 CHAPTER 3.1 reviews definitions regarding vector spaces.2 gives an introduction to Hilbert space. A pre-Hilbert space is a vector space on which an inner product is defined. . One reason why a Hilbert space is useful is that the notion of orthogonality can be defined with the inner product. this technique can be used to prove that a projection can be defined. In Appendix B.A. then it is said to be complete. Appendix 3. we can prove that the limit of a sequence exists once we prove that the sequence is Cauchy. which provides the foundation for the lag operator methods and the frequency domain analysis which are useful in macroeconomics and time series economics. Section 3. 7 All proofs of the results can be found in Luenberger (1969) or Hansen and Sargent (1991). If all Cauchy sequences of a preHilbert space converge.

A.A. together with two operations (addition and scalar multiplication) which satisfy the following conditions: For any x. 1x = x.3. In this Appendix. For this purpose. we use linear combinations of vectors in H.A.A. x2 . ¾ α(x + y) = αx + αy (3.2) (x + y) + z = x + (y + z) (commutative law) (associative law) (3. INTRODUCTION TO HILBERT SPACE 3.A. but state results that are applicable when K = C .4)-(3. A nonempty subset H of a vector space X is called a (linear) subspace of X if every vector of the form αx + βy is in H whenever x and y are both in H and α and β are in K .6) for any field K is a vector space on K . a vector space (or a linear space) X on K is a set of elements. or the complex plane.A. C )8 .6) 0x = 0. we require (3. it is often convenient to construct the smallest subspace containing H.A. . Examples of vector spaces on C are given in Appendix B. A linear combination of the vectors x1 . A subspace always contains the null vector 0. xn is a sum of the form 8 In general. an additive group X for which scalar multiplication satisfies (3.5) (αβ)x = α(βx) (3. · · · .1)-(3.A.1 43 Vector Spaces Given a set of scalars K (either the real line.A. called vectors. we give examples of vector spaces on R.3) There is a null vector 0 in X such that x + 0 = x for all x in X . we define x − y = x + (−1)y.A.6). β in K .1) x+y =y+x (3. y.A. z in X and for any α. Hence a subspace is itself a vector space. R.A. (associative law) Using α = −1.4) (distributive laws) (α + β)x = αx + βx (3. and satisfies conditions (3. If a subset H of X is not a subspace. In this book K is either the real line or the complex plane.

Example 3. m > N . is a vector space on K = R with addition and scalar multiplication defined in the usual way. yn )0 . The norm is a real-valued function that maps each element of x in X into a real number kxk. · · · . In a normed vector space. (The triangle inequality) A norm can be used to define a metric d on X by d(x. which is a vector space on R when x+y for x = (x1 . The set consisting of all vectors in X which are linear combinations of vectors in H is called the (linear) subspace generated by H. When the norm of a real number is defined as its absolute value. x2 . there exists an integer N such that kxn − xm k < ² for all n.A. A space in which every Cauchy sequence has a limit is said to be complete. y) = kx − yk. · · · .2 Vectors in the space consist of sequences of n real numbers. Example 3. A complete normed vector space is called a Banach space. every convergent sequence is a Cauchy sequence.A. (3. Rn . n). A sequence {xn }∞ n=1 in a normed vector space converges to x0 if the sequence {kxn − x0 k}∞ n=1 of real numbers converges to zero. R. y2 . FORECASTING α1 x1 + α2 x2 + · · · + αn xn where αi is a scalar (i = 1.A. R is a Banach space.7) kxk ≥ 0 for all x in X and kxk = 0 if and only if x = 0.8) kx + yk ≤ kxk + kyk (3.44 CHAPTER 3.9) kαxk = |α| kxk for all α in K and x in X .1 The real line.A. · · · .A. A normed vector space is a vector space X on which a norm is defined. xn )0 and y = (y1 . which is denoted by xn → x0 or lim xn = x0 . A sequence {xn }∞ n=1 in a normed vector space is a Cauchy sequence if for any ² > 0. which satisfies (3.

The following Hilbert space of random variables with finite second moments is the one we used in Chapter 3. The inner product is a scalar-valued function that maps each element of (x. INTRODUCTION TO HILBERT SPACE 45 is defined by (x1 +y1 . Let L2 (P rob) = {h : h is a (real-valued) random variable and E(|h|2 ) < ∞}. y. A norm can be defined from an inner product by kxk = p (x|x). Then with an inner product .A. P rob) be a probability space. which can be ignored if K is R. (x|x) is real for each x even when K is C .13) (x|x) ≥ 0 and (x|x) = 0 if and only if x = 0.A. y) in X × X into an element (x|y) in K . Rn is a Hilbert space on R.A.A. · · · .3 When we define (x|y) = Pn i=1 xi yi .A. The bar on the right side on (3. which satisfies (3.A. · · · .2 Hilbert Space A pre-Hilbert space is a vector space X on K for which an inner product is defined.3. Example 3. A complete pre-Hilbert space is called a Hilbert space. αx2 .10) (x|y) = (y|x) (3.A. x2 +y2 . αxn )0 .A. for any x. 3. z in X and α in K .11) (x + z |y) = (x|y) + (z|y) (3. Thus a pre- Hilbert space is a normed vector space.A.10). By (3. R is a Banach space.12) (αx|y) = α(x|y) (3.4 Let (S. F. Example 3.A.10) denotes complex conjugation. pPn n 2 When we define a norm of x as kxk = i=1 xi . xn +yn )0 and αx for α in R is defined by (αx1 .

Proposition 3.49). two vectors x and y are said to be orthogonal if (x|y) = 0.A. Equality holds if and only if x = λy for some λ in K . Some useful results concerning the inner product are:9 Proposition 3.3 If x is orthogonal to y in a Hilbert space.A.A. y in a Hilbert space. Proposition 3. or y = 0. In a Hilbert space. then E[(x + y)2 ] = E(x2 ) + E(y 2 ).47 and p. if E[(h1 − h2 )2 ] = 0. Then (xn |yn ) → (x|y). the distance is defined by the mean square.5 In L2 (P rob). Example 3. then h1 and h2 are the same element in this space. . In other words. Hence this definition does not cause problems for most purposes. Proposition 3. |(x|y)| ≤ kxk kyk. If two different random variables h1 and h2 satisfy E[(h1 − h2 )2 ] = 0. A vector x is said to be orthogonal to a set H if x is orthogonal to each element h in H. FORECASTING defined by (h1 |h2 ) = E(h1 h2 ). See Luenberger (1969. One reason why an inner product is useful is that we can define the notion of orthogonality. L2 (P rob) is a Hilbert space on R. Projections can be defined on a Hilbert space due to the following result: 9 These three propositions hold for a pre-Hilbert space. then h1 = h2 with probability one. In this space. so the convergence in this space is the convergence in mean square. the Cauchy-Schwarz Inequality becomes |E(xy)| ≤ p p E(x2 ) E(y 2 ) for any random variables with finite second moments.1 (The Cauchy-Schwarz Inequality) For all x.46 CHAPTER 3. then kx + yk2 = kxk2 + kyk2 .3 states that if x and y satisfy E(xy) = 0. p.2 (Continuity of the Inner Product) Suppose that xn → x and yn → y in a Hilbert space.A.A.

A. A necessary and sufficient condition for an infinite series of orthonormal sequence to converge in Hilbert space is known (see Luenberger. E(x|H) is the projection of x onto H. p.1 in L2 (P rob) is one example.A.3. ˆ ˆ Given a closed linear space H. If a sequence {et }∞ t=1 in a Hilbert space satisfies ket k = 1 for all t and (et |es ) = 0 for all t 6= s. and in that case we have αj = (x|ej ). Example 3. we obtain a necessary P and sufficient condition for an MA(∞) representation ∞ j=0 bj vt−j to converge for a ∞ with E(vt2 ) = σv2 > 0. σv and αj = bj σv . From . there is a unique vector h0 in H such that kx − h0 k ≤ kx − hk. 1969. we define a function E(·|H) on X by E(x|H) = h0 ˆ where h0 is an element in H such that x − h0 is orthogonal to H. then it is said to be an orthonormal sequence. a necessary and sufficient condition that h0 in H be the unique minimizing vector is that x − h0 be orthogonal to H. A P P∞ 2 series of the form ∞ j=1 αj ej converges to an element x in X if and only if j=1 |αj | < ∞.A. An infinite series of the form t=1 xt is said to converge to the element x in a Hilbert space if the sequence of partial sums PT P∞ sT = t=1 xt converges to x.59): Proposition 3. We are concerned with P∞ P∞ an infinite series of the form t=1 αt et . INTRODUCTION TO HILBERT SPACE 47 Proposition 3. Corresponding to any vector x in X .4 (The Classical Projection Theorem) Let X be a Hilbert space and H be a closed linear subspace of X . The projection defined in Section 3.6 Applying the above proposition in L2 (P rob). Furthermore.A.5 Let {ej }∞ j=1 be an orthonormal sequence in a Hilbert space X . 2 so that {et−j }∞ j=0 is orthonormal because E(et ) = 1 and E(et es ) = 0 for t 6= s. In that case we write x = t=1 xt . Define et = white noise process {vt−j }j=0 vt .

we started from a square summable P sequence {αj } and constructed x = ∞ j=1 αj ej in X in the above proposition. The difference vector x− x is orthogonal to M . x is not equal to its Fourier series.14) ∞ X (x|ej )ej . j=1 bj vj converges in mean square if and only if {bj }∞ j=1 is square summable. Given an orthonormal sequence {ej }∞ j=1 . and (x|ej ) is called the Fourier coefficient of x with respect to ej . If x is in M.6 Let x be an element in a Hilbert space X and {ej }∞ j=1 be an P orthonormal sequence in H. We now start with a given x in X and consider a series (3. This proposition shows that the Fourier series of x is the projection of x onto M : P 10 ˆ E(x|M )= ∞ j=1 (x|ej )ej .1 Let St be a spot exchange rate at time t and Ft be a forward exchange rate 10 See Luenberger (1969. if and only if P∞ P∞ P∞ P∞ 2 2 2 j=1 |αj | < ∞.48 CHAPTER 3. In general.A.60). the closed subspace generated by H is the closure of the linear subspace generated by H. Then the Fourier series ∞ j=1 (x|ej )ej converges to an ˆ in the closed subspace M generated by {ej }∞ ˆ element x j=1 . Given a subset H of a Hilbert space. Exercises 3. .A. then x is equal to its Fourier series as implied by the next proposition: Proposition 3. j=1 The series is called the Fourier series of x relative to {ej }∞ j=1 . Since j=1 |αj | < ∞ if and only if j=1 |bj | < ∞. ∞ j=1 bj vj = j=1 αj ej converges in L (P rob). Let M be the closed subspace generated by {ej }∞ j=1 . p. FORECASTING P P∞ 2 the above proposition.

2 Let in. τ =1 where It is the information set available in period t that includes the present and past ˆ values of pt and dt .3. Assume that Ft = E(St+1 |It ) where It is the information set available for the economic agents at t. Let E(·|H t ) be the linear projection onto an information set Ht . and b be the constant ex ante discount rate.t be the n year interest rate observed at time t.E. Define the model noise Nt by (3. 3.t+τ .A. Does the data support the expectations theory? Explain your answer.3) Nt = pt − pet .t ) ≤ V ar(At ). INTRODUCTION TO HILBERT SPACE 49 observed at time t for delivery of one unit of a currency at t + 1. Assume that pt and dt are stationary with zero mean and finite second moments. Let (3.2) pet = ∞ X bτ E(dt+τ |It ).t = E(At |It ) where n−1 At = (3.1) 1X i1.3 Let pt be the real stock price.E.E. n τ =0 where It is the information available at time t. 3. The expectations hypothesis of the term structure of interest rates states that in. Imagine that data on interest rates clearly indicate that V ar(in. Show that η ≤ V ar(Nt ) for any noise Nt . ˆ t |Ht )). Prove that V ar(Ft ) ≤ V ar(St+1 ). Let η = V ar(E(N (a) Assume that Ht is generated by {dt }. . dt be the real dividend.

421–436. Show that η ≤ V ar(Nt ) for any noise Nt . Y.” American Economic Review. Konuki. 8. FORECASTING (b) Assume that Ht is generated by {dt . N. Lo. J. (1989): “Solving Non-Linear Stochastic Models by Parameterizing Expectations. N. and R. References Campbell..” Journal of International Economics. W. T. Princeton University Press. 65–89. and T. (1969): Optimization by Vector Space Methods.” Manuscript. Wiley.. S. S. 49(3). J. Maccini (1995): “Measuring Noise in Inventory Models. 31–34. C.24) in the text. E. W. and A. Marcet (1990): “Solving the Stochastic Growth Model by Parameterizing Expectations. 3. (1999): “Measuring Noise in Exchange Rate Models. MacKinlay (1997): The Econometrics of Financial Markets. S.” Journal of Monetary Economics. Durlauf. R. and R. 255–270. and A. dt−2 }. Marcet. Luenberger. Westview. F. New York.” Manuscript.. A. J. Porter (1981): “The Present-Value Relation: Tests Based on Implied Variance Bounds.” Journal of Business and Economic Statistics.4 Derive (3. dt−1 . LeRoy.. Hansen. Sargent (1991): Rational Expectations Econometrics. 71. and L. 48(2). 36. L. A. Shiller.. Princeton. J. London. Durlauf.50 CHAPTER 3. New Jersey.. 555–574. J. D. Hall (1990): “Bounds on the Variances of Specification Errors in Models with Expectations.” Econometrica. P. D. . G. (1981): “Do Stock Prices Move Too Much to be Justified by Subsequent Changes in Dividends?. den Haan.

If the process is covariance stationary. Xt−j ) p . γ0 where γj = Cov(Xt . γj = γ−j . Xt−j ) = p In general. ρj depends on t. ρj does not depend on t. V ar(Xt ) V ar(Xt−j ) Corr(Xt . The j-th autocorrelation of a process (denoted by ρj ) is defined as the correlation between Xt and Xt−j : Cov(Xt .1 Autocorrelation The Wold representation of a univariate process {Xt : −∞ < t < ∞} provides us with a description of how future values of Xt depend on its current and past values (in the sense of linear projections). For covariance stationary processes. When we view ρj as a function of j. Note that ρ0 = 1 for any process by 51 . hence ρj = ρ−j . Xt−j ) is the j-th autocovariance. and is equal to its j-th autocovariance divided by its variance: (4. A useful description of this dependence is autocorrelation.Chapter 4 ARMA AND VECTOR AUTOREGRESSION REPRESENTATIONS 4. it is called the autocorrelation function.1) ρj = γj . and γ0 = V ar(Xt ).

denoted by the symbol L. it results in a new sequence {Yt : −∞ < t < ∞}. it is convenient to use the lag operator.52 CHAPTER 4.2) LXt = Xt−1 . When the operator is applied to a sequence {Xt : −∞ < t < ∞} of real numbers. We define a p-th order polynomial in the lag operator B(L) = B0 +B1 L+B2 L2 + · · · + Bp Lp . where B1 . for any integer k > 0. ARMA AND VAR definition. the lag operator is applied to all sequences of real numbers {Xt (ω) : −∞ < t < ∞} given by fixing the state of the world ω to generate a new stochastic process {Xt : −∞ < t < ∞} that satisfies Xt−1 (ω) = LXt (ω) for each ω. and can be estimated by its sample counterpart as explained in Chapter 5. · · · . as the operator that yields B(L)Xt = (B0 + B1 L + B2 L2 + · · · + Bp Lp )Xt = B0 Xt + B1 Xt−1 + · · · + Bp Xt−p . ρj = 0 for j 6= 0. 4. where the value of Y at date t is equal to the value X at date t − 1: Yt = Xt−1 . It is convenient to define L0 = 1 as the identity operator that gives L0 Xt = Xt . and we write (4. . and to define L−k as the operator that moves the sequence forward: L−k Xt = Xt+k for any integer k > 0.2 The Lag Operator In order to study ARMA representations. we write L2 Xt = Xt−2 . In general. Bp are real numbers. The autocorrelation function is a population concept. For a white noise process. Lk Xt = Xt−k . When the lag operator is applied twice to a process {Xt : −∞ < t < ∞}. When we apply the lag operator to a univariate stochastic process {Xt : −∞ < t < ∞}.

For a vector stochastic process {Xt : −∞ < t < ∞}. we study how some properties of Xt depend on Φ(L). If Φ(L) is a polynomial of infinite order. then it has a Moving Average (MA) representation of the form Xt = µ + Φ(L)et or (4. Xt = Φ0 et + Φ1 et−1 + · · · can be expressed as (4.3 Moving Average Representation If Xt is linearly regular and covariance stationary with mean µ. In this section. If Φ(L) is a polynomial of order q.4.(?????? Need to use other expressions instead of L2 because we use L for the lag operator in this paragraph) we write B(L) = B0 +B1 L+B2 L2 +· · · . a polynomial in the lag operator B0 + B1 L + B2 L2 + · · · + Bp Lp for matrices B0 . and B(L)Xt = (B0 + B1 L + B2 L2 + · · · )Xt = B0 Xt + B1 Xt−1 + B2 Xt−2 + · · · . Using the lag operator. Masao needs to check this! . where Φ0 = 1. 4. MOVING AVERAGE REPRESENTATION 53 When an infinite sum B0 Xt + B1 Xt−1 + B2 Xt−2 + · · · converges in some sense (such as convergence in L2 ).4) Xt = µ + Φ0 et + Φ1 et−1 + Φ2 et−2 + · · · . Xt is a moving average process of infinite order (denoted MA(∞)). · · · .3. Xt is a moving average process of order q (denoted MA(q)). Bp with real numbers is used in the same way. where Φ(L) = Φ0 + Φ1 L + Φ2 L2 + · · · . so that (B0 + B1 L + B2 L2 + · · · + Bp Lp )Xt = B0 Xt + B1 Xt−1 + · · · + Bp Xt−p .3) Xt = Φ(L)et .

8.1 Using (2.10). we obtain the mean of an MA(q) process: (4. When a vector stochastic process {· · · . X1 . The mean.10). X0 . · · · .54 CHAPTER 4. where et is a white noise process that satisfies (2. E(Xt ) = µ. Φq ). and ρk = 0 if |k| > 1. and µ and Φ are constants. variance. X−1 . . We often impose conditions on (Φ1 . for |j| > q Hence the j-th autocorrelation of an MA(q) process is zero when |j| > q.6) E(Xt ) = µ. A moving average process is covariance stationary for any (Φ1 . X−2 . and its k-th autocorrelation is ρk = Φ 1+Φ2 if |k| = 1.8. Xt . ARMA AND VAR An MA(1) process Xt has a representation Xt = µ + et + Φet−1 as in Example 2.8) γj = E[(Xt − µ)(Xt−j − µ)] ½ 2 σ (Φj + Φj+1 Φ1 + · · · + Φq Φq−j ) = 0 for |j| ≤ q . where et is a white noise process that satisfies (2. its variance: γ0 = E[(Xt − µ)2 ] = σ 2 (1 + Φ21 + · · · + Φ2q ).7) and its j-th autocovariance: (4. · · · . and autocovariance of this process are given in Example 2. · · · . (4. and µ and Φ1 .10). Φq ) as we will discuss later in this chapter. An MA(q) process Xt satisfies (4. · · · } can be written as (4.9) 1 Xt = µ + Φ0 et + Φ1 et−1 + · · · + Φq et−q . Φq are real numbers. · · · .5) Xt = µ + et + Φ1 et−1 + · · · + Φq et−q .

· · · . which states that any covariance stationary process can be decomposed into linearly regular and linearly deterministic components: 2 We only define the linear information set for a finite number of random variables. In this case. · · · } be a covariance stationary n-dimensional vector process with mean zero. Xt−2 . The stochastic process Xt is linearly regular if H−∞ contains only the constant zero when T H−∞ = ∞ n=1 Ht−n . Xt . . X−1 . The stochastic process Xt is linearly deterministic if Ht = H−∞ for all t. X0 . As q goes to infinity. X1 . X−2 . an MA(∞) representation (4. Φij . Therefore. Let H−∞ be T 0 the set of random variables that are in Ht for all t: H−∞ = ∞ n=1 Ht−n .10) Xt = µ + Φ0 et + Φ1 et−1 + · · · is well defined and covariance stationary if P∞ j=0 |Φij |2 < ∞ for the i-th row of Φj . a process with MA(q) representation is covariance stationary.4. E(y|X t . Φq . then Xt is linearly deterministic. For any Φ0 . Note that the information set grows larger over time and the sequence {Ht : −∞ < t < ∞} is increasing in the sense that Ht ⊂ Ht+1 for all t. For example. Then 0 = 0 Xt is a member of Ht . · · · ) ˆ for E(y|H t ). · · · . THE WOLD REPRESENTATION 55 for a white noise process et . the constant zero is always a member of H−∞ . then Xt has a q-th order (one-sided) moving average (MA(q)) representation. Let Ht be the linear information set generated by ˆ the current and past values of Xt . We can now state the Wold decomposition theorem. 4. See Appendix 3. Xt has a moving average representation of infinite order. in which Ht is generated by the current and past values of Xt .A for further explanation.4 The Wold Representation Let {· · · . Xt−1 . if Xt is an n-dimensional vector of constants.2 We use the notation.4.

Xt−2 . and et is defined by (4. Then it can be written as (4. Φij . j=0 where Φ0 = In .14) Xt = ∞ X Φj et−j .11) Xt = ∞ X Φj et−j + gt . Φij .1 (The Wold Decomposition Theorem) Let {· · · . X1 .56 CHAPTER 4. X0 . j=0 where Φ0 = In . Proposition 4. X1 . · · · . ARMA AND VAR Proposition 4. P∞ j=0 |Φij |2 < ∞ for the i-th row of Φj .12) j=0 |Φij |2 < ∞ for the i-th row of Φj .2 (The Wold Representation) Let {· · · . Then it can be written as (4. Xt . · · · } be a covariance stationary vector process with mean zero. There may exist infinitely . and ˆ t |Xt−1 . X−1 . X0 .12). P∞ (4.12). Xt−3 . · · · . Xt . The Wold representation gives a unique MA representation when the MA innovation et is restricted to the form given by Equation (4.13) It can be shown that P∞ j=0 Φj et−j is a linearly regular covariance stationary process and gt is linearly deterministic. Hence if Xt is not linearly regular. · · · } be a linearly regular covariance stationary vector process with mean zero. it is possible to remove gt and work with a linearly regular process as long as gt can be estimated. · · · ) et = Xt − E(X and ˆ t |H−∞ ). gt = E(X (4. X−1 .

d. Example 4. THE WOLD REPRESENTATION 57 many other MA representations when the MA innovation is not restricted to be given by (4. where et = ut + Φ(u2t−1 − 1). In this sense. Then E(Xt Xt−1 ) = E[ut ut−1 + Φu3t−1 − Φut−1 + Φut u2t−2 − Φut + Φ2 (u2t−1 − 1)(u2t−2 − 1)] = 0. Note that the Wold representation innovation et in this example is serially uncorrelated. as in the next example. but not i. Xt is a nonlinear transformation of a Gaussian white noise.d. Then the Wold representation of Xt is Xt = et . In many macroeconomic models. interest rates.i.i. stock prices.2 Suppose that ut is an i. the innovations in the Wold representation may not be i. stochastic processes that we observe (real GDP.i. Even when the underlying shocks that generate processes are i.4. where et = u2t − 1. etc. is not normally distributed. ut . the innovation in the Wold representation of a process can have a different distribution from the underlying shock that generates the process. In this example. but Proposition 4. Example 4. because et (= ut +Φu2t−1 ) and et−1 (= ut−1 +Φu2t−2 ) are related . Thus.1 states that even a nonlinear stochastic process has a linear moving average representation as long as it is linearly regular and covariance stationary. et .) are considered to be generated from the nonlinear function of underlying shocks.d Gaussian white noise.12) as we will discuss below. the processes in these models are nonlinear.d.1 Suppose that ut is a Gaussian white noise.i.4. is normally distributed. the innovation in its Wold representation. The shock that generates Xt . Let Xt = u2t − 1. Let Xt be generated by Xt = ut + Φ(u2t−1 − 1). Hence the Wold representation of Xt is Xt = et .. so that E(u3t ) = 0. However.

Therefore. which are closely related to MA representations. Such a process is called an autoregression of order . If B(L) is a polynomial of order p.1 Autoregression of Order One Consider a process Xt that satisfies (4.15). If B(L) is a polynomial of infinite order.58 CHAPTER 4. 4. which satisfies B(L)Xt = δ + et with B0 = 1 or Xt + B1 Xt−1 + B2 Xt−2 + · · · = δ + et for a white noise process et . In this section. where et is a white noise process with variance σ 2 and X0 is a random variable that gives an initial condition for (4. The Wold representation states that any linearly regular covariance stationary process has an MA representation. and it is often convenient to consider AR representations and ARMA representations. it is useful to estimate an MA representation in order to study how linear projections of future variables depend on their current and past values. 4. Xt is an autoregression of order p (denoted AR(p)).5.15) Xt = δ + BXt−1 + et for t ≥ 1. Higher order MA representations and vector MA representations are hard to estimate.5 Autoregression Representation A process Xt . we study how some properties of Xt depend on B(L). Xt is an autoregression of infinite order (denoted AR(∞)). ARMA AND VAR nonlinearly through the Φu2t−1 and ut−1 terms. is an autoregression. however.

and will be discussed in detail in Chapter 11. consider an MA process (4. so that (4. Consider the case where the absolute value of B is less than one. Substituting (4.5. denoted by AR(1).19) X0 = µ + e0 + Be−1 + B 2 e−2 + · · · . however. In this way. Hence Xt cannot be covariance stationary. .4. 1−B for t ≥ 1. the case in which B = 1 is of importance. In order to choose X0 .16) can be written as (4.20) (1 − BL)(Xt − µ) = et . Xt is covariance stationary. As seen in Example 2. It is often convenient to consider (4. B t X0 (ω) becomes negligible as t goes to infinity for a fixed ω. · · · .9.16) recursively. then the variance of Xt increases over time. and choose the initial condition for the process Xt in (4. we obtain X1 − µ = B(X0 − µ) + e1 and X2 − µ = B(X1 − µ) + e2 = B 2 (X0 − µ) + Be1 + e2 . When the absolute value of B is greater than or equal to one. Whether or not Xt is stationary depends upon the initial condition X0 . Suppose that X0 is uncorrelated with e1 .16) where µ = Xt − µ = B(Xt−1 − µ) + et δ .15) by (4. (4. the process Xt is not covariance stationary in general.18) Xt = µ + et + Bet−1 + B 2 et−2 + · · · . Xt is defined for any real number B. In this case. e2 . With the lag operator.15) in a deviation-fromthe-mean form: (4.17) Xt − µ = B t (X0 − µ) + B t−1 e1 + B t−2 e2 + · · · + Bet−1 + et for t ≥ 1. In macroeconomics. When this particular initial condition is chosen. AUTOREGRESSION REPRESENTATION 59 1.

5. Consider.26) (4. or equivalently.24) 1 − B1 z − B2 z 2 − · · · − Bp z p = 0 are larger than one in absolute value. ARMA AND VAR We define the inverse of (1 − BL) as (4. 4.22) Xt = µ + (1 − BL)−1 et .18).25) z p − B1 z p−1 − B2 z p−2 − · · · − Bp = 0 are smaller than one in absolute value. all the roots of (4. the special case of a AR(1) process with B1 = 1 and X0 = 0: (4. When a process Xt has an MA representation of the form (4. for instance.2 The p-th Order Autoregression A p-th order autoregression satisfies (4.21) (1 − BL)−1 = 1 + BL + B 2 L2 + B 3 L3 + · · · . . which is the MA(∞) representation of an AR(1) process.27) Xt = Xt−1 + et = e1 + e2 + · · · + et−1 + et for t ≥ 1. The stability condition is that all the roots of (4.23) Xt = δ + B1 Xt−1 + B2 Xt−2 + · · · + Bp Xt−p + et for t ≥ 1.60 CHAPTER 4. when the absolute value of B is less than one. we write (4.

6. B(1) Provided that the p-th order polynomial B(z) satisfies stability conditions. Xt is nonstationary. · · · . We also say that zi is a root of the equation Φ(z) = 0. Note also that its first difference is stationary since et (= Xt − Xt−1 ) is stationary. Since the variance of Xt varies over time. q) process yields the MA(∞) representation (4. When a (possibly infinite order) polynomial in the lag operator Φ(L) = Φ0 + Φ1 L + Φ2 L2 + · · · is given. we consider a complex valued function Φ(z −1 ) = Φ0 + Φ1 z −1 +Φ2 z −2 +· · · by replacing the lag operator L by a complex number z. V ar(X2 ) = 2σ 2 . we have the deviation-from-the-mean form (4. V ar(Xt ) = tσ 2 . . In this case.28). Such a process is called difference stationary.31) Xt = µ + Φ(L)et .29) Xt = δ + B1 Xt−1 + B2 Xt−2 + · · · + Bp Xt−p + et + θ1 et−1 + θ2 et−2 + · · · . Note that V ar(X1 ) = σ 2 . 4. Consider a condition Φ(z) = Φ0 + Φ1 z + Φ2 z 2 + · · · = 0. (4.28) If a complex number zi satisfies the condition (4.4.6 ARMA An ARMA(p. q) process satisfies (4. where Φ(L) = B(L)−1 θ(L) = Φ0 + Φ1 L + θ2 L2 + · · · and P∞ j=0 |θj |2 ≤ ∞. ARMA 61 where E(Xt ) = 0 and E(Xt−i Xt−j ) = σ 2 for i = j.30) where µ = B(L)(Xt − µ) = θ(L)et . then zi is a zero of Φ(z). δ . the ARMA(p. If B(1) = 1 − B1 − · · · − Bp 6= 1.

q) process has both the MA(∞) and AR(∞) representations. q) process by the ARMA(p − m. another MA representation is obtained by adopting a different dating procedure for the innovation. q) process yields the AR(∞) representation3 θ(L)−1 B(L)Xt = δ ∗ + et . q − m) process that has no common roots.7 Fundamental Innovations Let Xt be a covariance stationary vector process with mean zero that is linearly regular. In this example. Let (4. Then Xt = ut is an MA representation. where m is the number of common roots. Then the Wold representation in (4. if both B(z) and θ(z) satisfy stability conditions. then θ(L) is invertible and the ARMA(p.14) gives an MA representation. Let u∗t = ut+1 . See Hayashi (2000. then the ARMA(p. There are infinitely many other MA representations. 382) for further discussion. Then Xt = u∗t−1 is another MA representation.33) Xt = ∞ X Φj ut−j = Φ(L)ut j=0 3 Without any loss of generality.62 CHAPTER 4. Example 4. In such a case. ARMA AND VAR On the other hand. 4. (4.3 let ut be a white noise. we assume that there are no common roots of B(z) = 0 and θ(z) = 0. p. .32) where δ ∗ = δ . It is often convenient to restrict our attention to the MA representations for which the information content of the current and past values of the innovations is the same as that of the current and past values of Xt . we can write the ARMA(p. and Xt = ut . if θ(z) satisfies stability conditions that all roots of θ(z) = 0 lie outsize the unit circle. θ(1) Therefore.

The innovation process ut is said to be fundamental if Ht = Hut . then this MA representation is not invertible. then ut is fundamental. the information set generated by the current and past values of u∗t : {u∗t . If Φ = 1 or if Φ = −1. then ut = Φ(L)−1 Xt . Xt = ut is a fundamental MA representation while Xt = u∗t−1 is not. but ut is fundamental. If all the roots of det[Φ(z)] = 0 lie on or outside the unit circle. As a result of the dating procedure used for Xt = u∗t−1 .33). Since (4. then ut is not fundamental.3.33) is invertible. but ut is fundamental. we can allow some roots of det[Φ(z)] = 0 to lie on the unit circle. Let Ht be the linear information set generated by the current and past values of Xt . If all the roots of det[Φ(z)] = 0 lie outside the unit circle. and is strictly larger than the information set generated by Ht . then ut is fundamental. The MA representations with fundamental innovations are useful. If |Φ| < 1. The concept of fundamental innovations is closely related to the concept of invertibility. FUNDAMENTAL INNOVATIONS 63 be an MA representation for Xt . Thus for fundamentalness. then this MA representation is invertible.4. In the univariate case. Ht = Hut . · · · } is equal to Ht+1 .33) implies Ht ⊂ Hut . Therefore.7. and ut is fundamental. The innovation in the Wold representation is fundamental.33) is invertible. and ut is fundamental. Thus if the MA representation (4. it is easier to express projections of variables onto Ht with them than if they had non-fundamental . If |Φ| > 1. Hut ⊂ Ht . and Hut be the linear information set generated by the current and past values of ut . For example. let Xt = ut + Φut−1 . In Example 4. u∗t−1 . If the MA representation (4. if Xt = Φ(L)ut and all the roots of Φ(z) = 0 lie on or outside the unit circle. then Φ(L) may not be invertible. Then Ht ⊂ Hut because of (4. then Φ(L) is invertible.

36) f (λ) = ∞ 1 X Φ(k) exp(iλk) 2π k=−∞ for a real number λ (−π < λ < π) when the autocovariances are absolutely summable. j=0 Then (4. It is natural to assume that economic agents observe Xt . The spectral density of Yt . define (4. and E(X ˆ t+1 |Ht ). P Then Yt −E(Yt ) = b(L)et = ∞ j=0 bj et−j for a square summable {bj } and a white noise process et such that E(e2t ) = 1 and E(et es ) = 0 for t 6= s.64 CHAPTER 4. The spectral density is a function of λ. For a real number r. f (λ) is defined by f (λ) = ( ∞ X bj exp(−iλj))( j=0 ∞ X bj exp(iλj)). let Xt be a univariate process with an MA(1) representation: Xt = ut + Φut−1 . Using the . On the other hand. the economic agents’ forecast for Xt+1 can be modˆ t+1 |Ht ) rather than E(X ˆ t+1 |Hut ).34) where i = (4. if |Φ| > 1. For example. which is called the frequency. and eled as E(X ˆ t+1 |Ht ) = E(X ˆ t+1 |Hu ) = Φut . ut is not fundaE(X t ˆ t+1 |Ht ) 6= E(X ˆ t+1 |Hut ) = Φut . E(X 4. Therefore. and there is no easy way to express mental. ARMA AND VAR innovations. Its k-th autocovariance Φ(k) = E[(Yt − E(Yt ))(Yt−k − E(Yt−k )0 ] does not depend on date t.8 The Spectral Density Consider a covariance stationary process Yt such that Yt − E(Yt ) is linearly regular. If |Φ| ≤ 1. ut is fundamental.35) exp(ir) = cos(r) + i sin(r). √ −1. but not ut .

it requires infinite space to plot the autocovariance for k = 0.39) f (λ)dλ = Φ(0).8. This intuition can be formalized by the spectral representation theorem which states that any covariance stationary process Yt with absolutely summable autocovariances can be expressed in the form Z (4. THE SPECTRAL DENSITY 65 properties of the cos and sin functions and the fact that Φ(k) = Φ(−k).40) π Yt = µ + [α(λ) cos(λt) + δ(λ) sin(λt)]dλ.36) gives the spectral density from the autocovariances. In some applications.38) in which k = 0: Z π (4. An interpretation of the spectral density is given by the special case of (4.38) f (λ) exp(iλk)dλ = Φ(k). −π Thus the spectral density and the autocovariances contain the same information about the process. the autocovariances can be calculated form the following formula: Z π (4.37) ∞ X 1 Φ(0) + 2 f (λ) = Φ(k) cos(iλk). whereas the spectral density can be concisely plotted. When the spectral density is given. −π This relationship suggests an intuitive interpretation that f (λ) is the contribution of the frequency λ to the variance of Yt . it is more convenient to examine the spectral density than the autocovariances. 2π k=1 where f (λ) = f (−λ) and f (λ) is nonnegative for all λ. 0 . 2. · · · . it can be shown that (4. For example. 1. Equation (4.4.

give an expression.) in terms of past ut ’s? ut ’s.2 Let ut be a white noise. For such a process. ut−2 . · · · ) in terms of past ut fundamental for xt ? Give an expression for E(x ˆ t |xt−1 . xt−2 .66 CHAPTER 4.8ut−1 .41) λ1 2 f (λ)dλ. give an expression. ut−2 . · · · ) in terms of past ut fundamental for xt ? Give an expression for E(x ˆ t |xt−1 . the Rλ Rλ variable λ12 α(λ) is uncorrelated with λ34 δ(λ). fundamental for xt ? Give an expression for E(x ˆ t |xt−1 .1 Let ut be a white noise. Is xt covariance stationary? Is ut ˆ t |ut−1 . and xt = ut + ut−1 . and the variable λ12 δ(λ) Rλ is uncorrelated with λ34 δ(λ).. the variable λ12 α(λ) is uncorrelated with λ34 α(λ). · · · ) in terms of past ut ’s? ut ’s. (2000): Econometrics. · · · ) in terms of past ut ’s.3 Let ut be a white noise. 4. Is it possible to give an expression for E(x If so. ut−2 . the portion of the variance due to cycles with frequency less than or equal to λ1 is given by Z (4. These variables have the further properties that for any frequencies 0 < λ1 < λ2 < λ3 < Rλ Rλ Rλ λ4 < π. For any 0 < λ1 < λ2 < π and 0 < λ3 < λ4 < π. Is it possible to give an expression for E(x If so. and xt = ut + 1.. References Hayashi. Explain your answers. π].2ut−1 . Princeton University Press. Is xt covariance stationary? Is ˆ t |ut−1 . and xt = ut + 0. F. · · · ) in terms of past ut ’s? If Is it possible to give an expression for E(x so. 4. Explain your answers. Explain your answers. ARMA AND VAR where α(λ) and δ(λ) are random variables with mean zero for any λ in [0. xt−2 . 0 Exercises 4. xt−2 . . Princeton. Is xt covariance stationary? Is ˆ t |ut−1 . give an expression. .

Given the difficulties of obtaining the exact small sample distributions of estimators in many applications. however. are unattractive. it is not possible to obtain the exact distributions of estimators in finite samples. Many researchers use asymptotic theory at initial stages of an empirical research project.Chapter 5 STOCHASTIC REGRESSORS IN LINEAR MODELS This chapter explains asymptotic theory for linear models in the form that is convenient for most applications of structural econometrics for linear time series models. then asymptotic theory must be a good approximation of the true properties of estimators. In many applications of rational expectations models. For this reason. If the sample size is “large”. After the importance of a research project is established. because the answer depends on the nature of each application. asymptotic theory describes the properties of estimators as the sample size goes to infinity. this utilization seems to be a sound strategy. small sample properties of the estimators used in the project are often studied. Without such assumptions. stringent distributional assumptions. such as an assumption that the disturbances are normally distributed. The problem is that no one knows how large the sample size should be. For this purpose. Monte Carlo 67 .

We use the notation E[Z|σ(X)] to denote the usual conditional expectation of Z conditional on X as defined by Billingsley (1986) for a random Masao needs to check this! Masao needs to check this! Masao needs to check this! variable Z. Moreover. E[Z|σ(X)] is a random variable. Nevertheless. the forecast error is usually correlated with future values of Xt . For example. and E[Z|σ(X)](s) denotes the value of the random variable at s in S (????? what is s?).68 Masao needs to check this! Masao needs to check this! CHAPTER 5. we will also use a different concept of conditional expectation 1 Loeve (1978) slightly relaxes this restriction by defining the conditional expectation for any random variable whose expectation exists (but may not be finite) with an extension of the RadonNikodym theorem.????? . ?????1 For this reason. Xt−1 . namely E(|Z|) < ∞.4????????) and (5. Xt . · · · ) = 0 in some applications because et is a forecast error. E(et |Xt . Xt+2 . Let σ(X) be the smallest σ-field with respect to which the random variables in X are measurable. this observation leads to a Generalized Least Squares (GLS) correction to spurious regressions. Xt−2 . 5. However.7???????). Xt−2 . This condition can be too restrictive when we define the conditional expectation of the OLS estimator in some applications as we discuss later.1 The Conditional Gauss Markov Theorem In regressions (5. This definition can be used for E(·|σ(X)). It should be noted that the definition is given under the condition that Z is integrable. but this slight relaxation does not solve our problem which we describe later. it is useful to observe that the Gauss Markov theorem applies to cointegrating regressions in order to understand small sample properties of various estimators for cointegrating vectors. Xt+1 . Xt−1 . as Choi and Ogaki (1999) argue. Xt is strictly exogenous in the time series sense if E(et | · · · . STOCHASTIC REGRESSORS experiments can be used as described later in this book. · · · ) = 0. Hence the strict exogeneity assumption is violated. This is a very restrictive assumption that does not hold in all applications of cointegration discussed in Chapter 13.

and that Z is integrable. In this case. we use the notation E[Z|X(s) = x]. and E(Z) does not exist. Let y = (y1 .1 y = Xb0 + e. Thus. We assume that the expectation of e conditional on X is zero: Y It should be noted that we cannot argue that E(Z) = E(E( X |σ(X))) = E( E(Y |σ(X)) ) = 0 even X 1 Y though X is measurable in σ(X) because E( X |σ(X)) is not defined. We are concerned with a linear model of the form: Assumption 5. More precisely. This definition can only be used when the probability density functions exist and fX (vec(x)) is positive. but the advantage of this definition for our purpose is that the conditional expectation can be defined even when E(Z) does not exist. 2 . that the probability density function of vec(X) is always positive. they coincide. fX (vec(x)) For this definition. Then E[Z|σ(X)](s) = E[Z|X(s)] with probability one.1. THE CONDITIONAL GAUSS MARKOV THEOREM 69 conditional on X that can be used when Z and vec(X) have probability density functions fZ (z) and fX (vec(x)). we can define E[Z|X(s) = x] for all s in the probability space because the density function of X is always positive. · · · . y2 . yT )0 be a T × 1 vector of random variables. if fX (vec(x)) is positive. In the special case in which both types of conditional expectations can be defined. · · · . e2 .5.2 However. eT )0 be a T × 1 vector of random variables. For example let Z = Y X where Y and X are independent random variables with a standard normal distribution.1) ∞ E[Z|X(s) = x] = −∞ fZ (z) dz. where b0 is a K × 1 vector of real numbers. we define the expectation of Z conditional on X(s) = x as Z (5. and e = (e1 . suppose that Z and vec(X) have the probability density functions. E[Z|σ(X)] cannot be defined. respectively. Then Z has the Cauchy distribution.

Assumption 5.2 E[e|σ(X)] = 0.40 e and vec(X) have probability density functions. With the next assumption. For this purpose. It also implies E(e) = 0 because of the law of iterated expectations. For any s in G.2) bT = (X0 X)−1 X0 y. Given E(e) = 0. the OLS estimator is (5. we consider the following two alternative assumptions: Assumption 5. we assume that e is conditionally homoskedastic and et is not serially correlated: Assumption 5. Since E[e|σ(X)] is only defined when each element of e is integrable.2 is weaker than the assumption of the independent stochastic regressors. G is a member of the σ-field F?????. and the probability density functions of vec(X) are positive for all s in G.1. Assumption 5.2 does not imply that X is statistically independent of e.4 E[(X0 X)−1 X0 ee0 X(X0 X)−1 ] exists and is finite.2 implicitly assumes that E(e) exists and is finite.70 CHAPTER 5. Assumption 5. Since the determinant of a matrix Masao needs to check this! is a continuous function of the elements of a matrix.2 is that X is statistically independent of e. Let G = {s in S : X(s)0 X(s) is nonsingular}. .3 E[ee0 |σ(X)] = σ 2 IT . STOCHASTIC REGRESSORS Assumption 5. bT = b0 + (X0 X)−1 X0 e. Since Assumption 5. Hence the conditional Gauss-Markov theorem can be proved when the expectation of (X0 X)−1 X0 e and (X0 X)−1 X0 ee0 X(X0 X)−1 can be defined. a sufficient condition for Assumption 5. From Assumption 5.

4 is that the distributions of X and e have finite supports. Given a set H in the σ-field F. namely.4 is that it is not easy to verify Assumption 5.5. E[(X0 X)−1 X0 e] also exists and is finite. E[(bT − b0 )(bT − b0 )0 |X(s) = x] ≤ E[(b∗ − b0 )(b∗ − b0 )0 |X(s) = x] in H with probability P r(H).4 and 5.4. (3) for any linear unbiased estimator b∗ conditional on X(s) = x for which E(b∗ b∗0 ) exists and is finite. E(bT |σ(X)) = b0 for s in H with probability P r(H). Corresponding with Assumption 5. Under Assumptions 5.4.3. and each element of A is measurable σ(X). and E[(bT −b0 )0 (bT −b0 )|σ(X)] = E[(X0 X)−1 X0 ee0 X(X0 X)−1 |σ(X)] = (X0 X)−1 X0 E[ee0 |σ(X)]X(X0 X)−1 = σ 2 (X0 X)−1 for s in G with probability P r(G). Under Assumptions 5. namely.1-5.40 .1. E[(bT −b0 )0 (bT −b0 )|σ(X)] can be defined. (2) bT is unbiased conditional on σ(X) in G. Under Assumption 5.1-5. Hence E(bT |σ(X)) can be defined. From Assumptions 5. An estimator bT for b0 is the BLUE conditional on σ(X) in H if (1) bT is linear conditional on σ(X). the Best Linear Unbiased Estimator (BLUE) conditional on σ(X) in H is defined as follows. bT can be written as bT = Ay where A is a K × T matrix.3 and 5. THE CONDITIONAL GAUSS MARKOV THEOREM 71 A sufficient condition for Assumption 5.1-5.4 for many distributions of X and et which are often used in applications and Monte Carlo studies. An estimator bT for b0 is the BLUE conditional on X(s) = x in H if (1) bT . namely. E[(b∗ − b0 )(b∗ − b0 )0 |X(s) = x]−E[(bT −b0 )(bT −b0 )0 |X(s) = x] is a positive semidefinite matrix with probability one for s in H with probability P r(H). E[bT |X(s)] = b0 and E[(bT − b0 )0 (bT − b0 )|X(s)] = σ 2 (X(s)0 X(s))−1 for any s in G. E(bT |σ(X)) = b0 + E[(X0 X)−1 X0 e|σ(X)] = b0 for s in G with probability P r(G). The problem with Assumption 5.40 . we consider two definitions of the conditional version of the Best Linear Unbiased Estimator (BLUE).

. The theorem can be proved by applying any of the standard proofs of the (unconditional) Gauss-Markov theorem by replacing the unconditional expectation with the appropriate conditional expectation. E(bT |X(s) = x) = b0 for any s in H. the following theorem can be stated: Theorem 5. STOCHASTIC REGRESSORS is linear conditional on X(s) in H.5 X0 X is nonsingular with probability one.1 (The Conditional Gauss-Markov Theorem) Under Assumptions 5. E[(b∗ − b0 )(b∗ − b0 )0 |X(s) = x] − E[(bT − b0 )(bT − b0 )0 |X(s) = x] is a positive semidefinite matrix for any s in H. namely.72 CHAPTER 5.4.1-5. the OLS estimator is the BLUE conditional on X(s) = x in G. (2) bT is unbiased conditional on X(s) = x in H. With an additional assumption that P r(G) = 1 or Assumption 5. we obtain the following corollary of the theorem: Proposition 5. namely. (3) for 0 any linear unbiased estimator b∗ conditional on X(s) = x for which E(b∗ b∗ |X(s) = x) exists and is finite.4. the OLS estimator is unconditionally unbiased and has the minimum unconditional covariance matrix among all linear unbiased estimators conditional on σ(X).3 and 5.1 Under Assumptions 5.5. With these preparations. bT can be written as bT = Ay where A is a K × T matrix.1-5.40 .15. the OLS estimator is the BLUE conditional on σ(X) in G. E[(bT −b0 )(bT −b0 )0 |X(s) = x] ≤ E[(b∗ −b0 )(b∗ −b0 )0 |X(s) = x] in H. the unconditional expectation and the unconditional covariance matrix of bT can be defined. and each element of A is measurable σ(X).1-5. Under Assumptions 5. namely. Under Assumptions 5.

40 . e follows a multivariate normal distribution.4 cannot be replaced by Assumption 5. let b∗ be another linear unbiased estimator conditional on σ(X). Remark 5. Under Assumption 5. UNCONDITIONAL DISTRIBUTIONS OF TEST STATISTICS 73 Proof Using the law of iterated expectations. Hence the limit of the covariance matrix √ of T (bT − b0 ).4) NR = R(bT − b0 ) 1 σ[R(X0 X)−1 R0 ] 2 Masao needs to check this! . we need an additional assumption: Assumption 5. E(bT ) = E{E[bT |σ(X)]} = E(b0 ) = b0 . E(bT ) and E[(bT − b0 )(bT − b0 )0 ] may not exist. σ 2 E[{ T1 (X0 X)}−1 ]. Then E[(b∗ −b0 )(b∗ −b0 )0 ]−E[(bT − b0 )(bT − b0 )0 ] = [E(b∗ b∗0 ) − b0 b00 ] − [E(bT b0T ) − b0 b00 ] = E[E(b∗ b∗0 |σ(X)) − E[E(bT b0T |σ(X)] = E(∆) is a positive semidefinite matrix. This result may seem to contradict the standard asymptotic theory.3) E[(b∗ − b0 )(b∗ − b0 )0 |σ(X)] = E[(bT − b0 )(bT − b0 )0 |σ(X)] + ∆. where ∆ is a positive semidefinite matrix with probability one.2.5.6 Conditional on X. σ 2 [E(Xt Xt0 )]−1 . consider a random variable (5. 1 X0 X T converges almost surely to E[Xt Xt0 ] if Xt is stationary and ergodic. For the minimum covariance matrix part. Asymptotically. which is different from σ 2 [E(X0 X)]−1 .1 Assumption 5. and E[(bT − b0 )(bT − b0 )0 ] = E{E[(bT − b0 )(bT − b0 )0 |σ(X)]} = σ 2 E[(X0 X)−1 ]. is equal to the asymptotic covariance matrix. Then (5. Given a 1 × K vector of real numbers R. but it does not. 5. the covariance matrix of bT is σ 2 E[(X0 X)−1 ].40 for this proposition.2 In this proposition.2 Unconditional Distributions of Test Statistics In order to study distributions of the t ratios and F test statistics. (?????) A few remarks for this proposition are in order: Remark 5.

a standard textbook. Proposition 5.6.5-5. and 5.3 Under the Assumptions 5.5) tR = R(bT − b0 ) 1 σ ˆ [R(X0 X)−1 R0 ] 2 Here σ ˆ is the positive square root of σ ˆ2 = .1-5. 1 (y T −K − XbT )0 (y − XbT ). then the marginal probability density function of Z is equal to the probability density function of Z conditional on Q. p. Similarly. respectively.6.3.2 If the probability density function of a random variable Z conditional on a random vector Q does not depend on the values of Q. states that “our usual test statistics . Proposition 5. The following proposition is useful in order to derive unconditional distributions of these statistics. and 5.74 CHAPTER 5. Judge et al.1-5. For example.40 .40 .6. 5. NR and tR can be shown to follow the standard normal distribution and Student’s t distribution with T − K degrees of freedom conditional on X. STOCHASTIC REGRESSORS and the usual t ratio for Rb0 (5. the usual F test statistics also follow (unconditional) F distributions. NR is the standard normal random variable and tR is the Student’s t random variable with T − K degrees of freedom.1-5. (1985. under either Assumptions 5.164). or under the Assumptions 5. Since NR and tR follow the standard normal and the Student’s t distribution conditional on X.1-5.55.3. These results are sometimes not well understood by econometricians. 5.6 or Assumptions 5. With the standard argument. This proposition is obtained by integrating the probability density function conditional on Q over all possible values of the random variables in Q. respectively.2 implies the following proposition: Proposition 5.

if (Xt . so it does not follow a normal distribution even if X and e are both normally distributed. For the sample mean to converge almost surely to its unconditional mean. Xt+j )]|. · · · . If an estimator bT converges in probability to a vector of parameters b0 . A stationary process {Xt } is said to be ergodic if. but converges almost surely to an expectation of X conditional on an information set. the sample mean does not converge to its unconditional expected value. 5. However. . (5. E(Xt ) does not depend on date t. then bT is weakly consistent for b0 . · · · . Therefore. When Xt is stationary. Xt+T +j ) are approximately independent for large enough T . · · · .3 The Law of Large Numbers If an estimator bT converges almost surely to a vector of parameters b0 . Xt+i )g(Xt+T . It is true that bT is a nonlinear function of X and e. · · · . Xt+T +j )]| T →∞ = |E[f (Xt . THE LAW OF LARGE NUMBERS 75 do not hold in finite samples” on the grounds that bT ’s (unconditional) distribution is not normal.2. Xt+i )]||E[g(Xt . where YT = T1 Tt=1 Xt is the sample mean of X computed from a sample of size T . the usual t and F test statistics have usual (unconditional) distributions as a result of Proposition 5. for any bounded functions f : Ri+1 7−→ R and g : Rj+1 7−→ R. and consider a sequence of random variables [YT : T ≥ P 1]. Xt+i ) and (Xt+T . Assume that E(|X|) is finite. · · · . Heuristically. · · · . we often write E(X) instead of E(Xt ). a stationary process is ergodic if it is asymptotically independent: that is.3. we require the series to be ergodic. In general. Consider a univariate stationary stochastic process {Xt }. then bT is strongly consistent for b0 .5.6) lim |E[f (Xt .

These central limit theorems are based on martingale difference sequences.d. Central limit theorems establish that the sample mean scaled by T converges in distribution to a normal distribution3 under various regularity conditions. 5. and that E(|e|2 ) < ∞. · · · }. E(e2 )).4 Convergence in Distribution and Central Limit Theorem This section explains a definition of convergence in distribution and presents some central limit theorems. The following central limit theorem by Billingsley (1961) is useful for many applications because we can apply it when economic models imply a variable is a martingale difference sequence. the limiting distribution is not normal. and are useful in many applications of rational expectations models.76 CHAPTER 5. Assume that It−1 ⊂ It for all t. T t=1 If et is an i. STOCHASTIC REGRESSORS Proposition 5. then T1 Tt=1 Xt → E(X) almost surely. .i. Then T 1 X D √ et → N (0. and if E(|X|) is finite. et−1 .4 (The strong law of large numbers) If a stochastic process [Xt : t ≥ P 1] is stationary and ergodic. white noise. Proposition 5.5 (Billingsley’s Central Limit Theorem) Suppose that et is a stationary and ergodic martingale difference sequence adapted to It . Hence the Billingsley’s Central Limit Theorem is more general than the central limit theorems 3 In some central limit theorems. then it is a stationary and ergodic martingale difference sequence adapted to It which is generated from {et .

5.4. CONVERGENCE IN DISTRIBUTION AND CLT

77

for i.i.d. processes such as the Lindeberg- Levy theorem, which is usually explained
in econometric text books. However, Billingsley’s Central Limit Theorem cannot be
applied to any serially correlated series.
A generalization of the theorem to serially correlated series is due to Gordin
(1969):
Proposition 5.6 (Gordin’s Central Limit Theorem) Suppose that et is a univariate
stationary and ergodic process with mean zero and E(|e|2 ) < ∞, that E(et |et−j , et−j−1 , · · · )
converges in mean square to 0 as j → ∞, and that
(5.7)


X

1

2 2
[E(rtj
)] < ∞,

j=0

where
(5.8)

rtj = E(et |It−j ) − E(et |It−j−1 ),

where It is the information set generated from {et , et−1 , · · · }. Then et ’s autocovariances are absolutely summable, and
(5.9)

T
1 X D

et → N (0, Ω),
T t=1

where
(5.10)

Ω = lim

T →∞

T −1
X

E(et et−j ).

j=−T +1

When et is serially correlated, the sample mean scaled by T still converges to a normal
distribution, but the variance of the limiting normal distribution is affected by serial
correlation as in (5.10).

78

CHAPTER 5. STOCHASTIC REGRESSORS
In (5.10), Ω is called a long-run variance of et . Intuition behind the long-run

variance can be obtained by observing
(5.11)

T
T −1
X
1 X 2
T − |j|
E[( √
E(et et−j )
et ) ] =
T
T t=1
j=−T +1

P −1
and that the right hand side (5.11) is the Cesaro sum of Tj=−T
+1 E(et et−j ). Thus
P −1
when Tj=−T
+1 E(et et−j ) converges, its limit is equal to the limit of the right hand
side of (5.11) (Apostol, 1974).
Another expression for the long-run variance can be obtained from an MA
representation of et . Let et = Ψ(L)ut = Ψ0 ut +Ψ1 ut−1 +· · · be an MA representation.
Then E(et et−j ) = (Ψj Ψ0 + Ψj+1 Ψ1 + Ψj+2 Ψ2 + · · · )E(u2t ), and Ω = {(Ψ20 + Ψ21 + Ψ22 +
· · · ) + 2(Ψ1 Ψ0 + Ψ2 Ψ1 + Ψ3 Ψ2 + · · · ) + 2(Ψ2 Ψ0 + Ψ3 Ψ1 + Ψ4 Ψ2 + · · · ) + · · · }E(u2t ) =
(Ψ0 + Ψ1 + Ψ2 + · · · )2 E(u2t ). Hence
(5.12)

Ω = Ψ(1)2 E(u2t ).

In the next example, we consider a multi-period forecasting model. For this
model, it is easy to show that Gordin’s Theorem is applicable to the serially correlated
forecast error.
Example 5.1 (The Multi-Period Forecasting Model) Suppose that It is an information set generated by {Yt , Yt−1 , Yt−2 , · · · }, where Yt is a stationary and ergodic
vector stochastic process. In typical applications, economic agents are assumed to
use the current and past values of Yt to generate their information set. Let Xt be a
stationary and ergodic random variable in the information set It with E(|Xt |2 ) < ∞.
We consider an s-period ahead forecast of Xt , E(Xt+s |It ), and the forecast error,
et = Xt+s − E(Xt+s |It ).

5.4. CONVERGENCE IN DISTRIBUTION AND CLT

79

It is easy to verify that all the conditions for Gordin’s Theorem are satisfied for
et . Moreover, because E(et |It ) = 0 and et is in the information set It+s , E(et et−j ) =
P
E(E(et et−j |It )) = E(et−j E(et |It )) = 0 for j ≥ s. Hence Ω = limj→∞ j−j E(et et−j ) =
Ps−1
j=−s+1 E(et et−j ).
Hansen (1985) generalized Gordin’s Central Limit Theorem to vector processes.
In this book, we call the generalized theorem Gordin and Hansen’s Central Limit
Theorem.

Proposition 5.7 (Gordin and Hansen’s Central Limit Theorem) Suppose that et is
a vector stationary and ergodic process with mean zero and finite second moments,
that E(et |et−j , et−j−1 , · · · ) converges in mean square to 0 as j → ∞, and that
(5.13)


X
1
[E(r0tj rtj )] 2 < ∞,
j=0

where
(5.14)

rtj = E(et |It−j ) − E(et |It−j−1 ),

where It is the information set generated from {et , et−1 , · · · }. Then et ’s autocovariances are absolutely summable, and
T
1 X D

et → N (0, Ω)
T t=1

where
(5.15)

Ω = lim

T →∞

T −1
X
j=−T +1

E(et e0t−j ).

(5.80 CHAPTER 5. and xt is a p-dimensional stationary and ergodic random vector.18) E(xt et ) = 0 . We assume that the orthogonality conditions (5. thus E(ft ft−j ) = E(E(ft ft−j |It )) = E(E(ft |It )ft−j ) = 0 for P P 0 0 j ≥ s. Then E(ft |It ) = E(Zt et |It ) = E(Zt E(et |It )) = 0. Hence Ω = limj→∞ j−j E(ft ft−j ) = s−1 j=−s+1 E(ft ft−j ).5 Consistency and Asymptotic Distributions of OLS Estimators Consider a linear model. 5. where yt and et are stationary and ergodic random variables. let Zt be a random vector with finite second moments in the information set It . Define ft = Zt et . it is often necessary to apply a central limit theorem to a random vector such as ft . Then E(et e0t−j ) = (Ψj + Ψj+1 + Ψj+2 + · · · )E(ut u0t )(Ψ0 + Ψ1 + Ψ2 + · · · )0 . another expression for the long-run covariance can be obtained from an MA representation of et . STOCHASTIC REGRESSORS The matrix Ω in Equation (5. In empirical work. It is easy to verify that all conditions for Gordin and Hansen’s Theorem are satisfied for ft . Gordin and Hansen’s Central Limit Theorem is applied to a serially correlated vector process: Example 5. and Ω = (Ψ0 + Ψ1 + Ψ2 + · · · )E(ut u0t )(Ψ0 + Ψ1 + Ψ2 + · · · )0 . Hence Ω = Ψ(1)E(ut u0t )Ψ(1)0 . (5. Moreover.1.16) In the next example. E(ft |It ) = 0 and ft is 0 0 0 in the information set It+s .2 Continuing Example 5. As in the univariate case. Let et = Ψ(L)ut = Ψ0 ut + Ψ1 ut−1 + · · · be an MA representation.17) yt = x0t b0 + et .15) is called the long-run covariance matrix of et .

23) 4 √ D T (bT − b0 ) → N (0.4 Imagine that we observe a sample of P (yt . is a strongly consistent estimator. CONSISTENCY AND ASYMPTOTICS 81 are satisfied. we make an additional assumption that a central limit theorem applies to xt et . we multiply both sides of (5. rewrite (5. and that E(xt x0t ) is nonsingular. and the Ordinary Least Squares (OLS) estimator for (5.20) bT − b0 = ( T T 1X 1X xt x0t )−1 ( (xt et )). Hence with probability one.19) from (5. Hence the OLS estimator. we obtain (5. x0t ) of size T . t=1 In order to apply the Law of Large Numbers to show that the OLS estimator is strongly consistent. In order to obtain the asymptotic distribution of the OLS estimator.4 shows that T1 Tt=1 xt x0t converges to E(xt x0t ) almost P surely.21) bT − b0 → [E(xt x0t )]−1 (E(xt et )) = 0 almost surely.17) after scaling each element of the right side by T : (5. [E(xt x0t )]−1 Ω[E(xt x0t )]−1 ) Appendix 3. T t=1 T t=1 Therefore. T t=1 T t=1 Applying Proposition 5. Tt=1 xt x0t (s) is nonsingular for large enough T .22) √ T (bT − b0 ) = ( T T 1X 1 X xt x0t )−1 ( √ (xt et )).20) by the square root of T : (5.5.A explains why these types of conditions are called orthogonality conditions. In particular. (5. assuming that the Gordin and Hansen’s Martingale Approximation Central Limit Theorem is applicable.4.5. Proposition 5. .17) can be written as (5. bT .19) bT = ( T X t=1 T X xt x0t )−1 ( xt yt ).

30) Ω= ∞ X j=−∞ E(et et−j zt z0t−j ).25) E(zt et ) = 0.28) bT − b0 → [E(zt x0t )]−1 (E(zt et )) = 0 almost surely.24) ∞ X Ω= E(et et−j xt x0t−j ). T t=1 T t=1 Applying Proposition 5. [E(zt x0t )]−1 Ω[E(zt x0t )]−1 ) where Ω is the long-run covariance matrix of zt et : (5. and the relevance condition that E(zt x0t ) is nonsingular. In this case.26) bT = ( T X t=1 zt x0t )−1 T X zt yt . . j=−∞ 5.82 CHAPTER 5. Assuming that the Vector Martingale Approximation Central Limit Theorem is applicable to zt et . STOCHASTIC REGRESSORS where Ω is the long-run covariance matrix of xt et : (5.29) √ D T (bT − b0 ) → N (0. Hence the linear IV estimator. We define the Linear Instrumental Variable (IV) estimator as (5. is a strongly consistent estimator.27) bT − b0 = ( T T 1X 1X zt x0t )−1 ( zt et ).18) are not satisfied. bT . t=1 Then (5.17) for which the orthogonality conditions (5.4. (5. we obtain (5. which satisfies two types of conditions: the orthogonality condition (5. we try to find a p-dimensional stationary and ergodic random vector zt .6 Consistency and Asymptotic Distributions of IV Estimators Consider the linear model (5.

Later????? in this book we will use the proof of the delta method to prove the asymptotic normality of the GMM estimator. bT is an OLS estimator or a linear IV estimator.8 Suppose that {bT } is a sequence of p-dimensional random vectors √ D such that T (bT −b0 ) → z for a random vector z. Σ). denotes the r × p matrix of first derivatives evaluated at b0 .7 83 Nonlinear Functions of Estimators In many applications of linear models. then √ where d(b0 ) = ∂a(b) | ∂b0 b=b0 D T [a(bT ) − a(b0 )] → d(b0 )z. we are interested in nonlinear functions of b0 . then √ D T [a(bT ) − a(b0 )] → N (0. say a(b0 ). if z ∼ N (0. f is Masao needs to check this! Masao needs to check this! better?) Proposition 5. Proof ???????????? 5. d(b0 )Σd(b0 )0 ). If a(·) : Rp 7−→ Rr is continuously differentiable at b.5.8 Remarks on Asymptotic Theory When we use asymptotic theory.7. Masao needs to check this! . In many applications. Serial correlation and conditional heteroskedasticity can be easily taken into account as long as we can estimate the long-run covariance matrix (which is the topic of the next chapter). (????? a not bold. we do not have to make restrictive assumptions that the disturbances are normally distributed. NONLINEAR FUNCTIONS OF ESTIMATORS 5. In particular. which is a convenient method to derive asymptotic properties of a(bT ) as an estimator for b0 where bT is a weakly consistent estimator for b0 . This section explains the delta method.

the distributions based on the long-run covariance matrices give the correct limiting distributions.” but exactly how “large” is .84 CHAPTER 5.30). we can assume neither that the regressor is nonrandom nor that the error term is strictly exogenous in the time series sense. In many time series applications. as long as it is a linearly regular and covariance stationary process.e. Asymptotic theory gives accurate approximations when the sample size is “large. Appendix 5. it has the Wold representation: zt et = Ψ(L)ut . Example programs in GAUSS explained in Appendix A are given. the assumption that the error term in a regression is normal is inappropriate because many authors have found evidence against normality for several financial variables. and its long-run covariance matrix is given by (5. This point is related to the Wold representation for nonlinear processes discussed in Chapter 4. For example. in a regression with lagged dependent variables. the exact finite sample properties cannot be obtained. It should be noted that we did not assume that xt et or zt et was generated by linear functions (i. Even when xt et or zt et is generated from nonlinear functions of independent white noise processes. Even when zt et is generated as a nonlinear process. In many applications with financial variables.A Monte Carlo Studies This appendix explains Monte Carlo methods.. a moving average process in the terminology of Chapter 4) of independent white noise processes. Asymptotic theory is used to obtain approximations of the exact finite sample properties of estimators and test statistics. STOCHASTIC REGRESSORS It is a common mistake to think that the linearity of the formula for the long- run covariance matrix means a linearity assumption for the process of xt et (for the OLS estimator) or zt et (for the IV estimator).

we can estimate small sample properties of the estimators or test statistics by studying the generated 5 One exception is that a pseudo-random number generator ultimately cycles back to the initial value generated and repeats the sequence when too many numbers are generated. MONTE CARLO STUDIES 85 enough depends on each application. we can obtain many samples from generated random numbers. From actual data. 5. However. we use a number called the starting seed to determine s. and this tendency is not a problem for most Monte Carlo studies. phrases such as “values that appear to be” are often suppressed. After replicating many samples. . we obtain only one sample. we compute estimators or test statistics of interest. These programs generate sequences of values that appear to be draws from a specified probability distribution. Most modern pseudo-random number generators cycle back after millions of values are drawn. Modern pseudo-random generators are accurate enough that we can ignore the fact that numbers generated are not exactly independent draws from a specified probability distribution for most purposes. Each time a sample is generated. One method to study the quality of asymptotic approximations is the Monte Carlo simulations. Then the random number generator automatically updates the seed each time a number is generated.A. For a random number generator. in some studies in which millions or billions of values are needed. Recall that when a probability space Ω is given. data are generated with computer programs called pseudorandom number generators. there can be a serious problem.5 Hence in the rest of this appendix. Generated random numbers are used to generate samples. but in Monte Carlo studies. It should be noted that the same sequence of numbers is generated whenever the same starting seed is given to a random number generator. the whole history of a stochastic process {et (s)}N t=1 is determined when a point in the probability space s is given.1 Random Number Generators In Monte Carlo studies.A.5.

d. STOCHASTIC REGRESSORS distributions of these variables and compare them with predictions of asymptotic theory. x=sumc((e^2)’). then i=1 ei follows the χ distribution with d degrees of freedom. The following examples are some of the transformations which are often used. One can produce random numbers with other distributions by transforming generated random numbers. in GAUSS one can generate a T × 1 vector with values that appear to be a realization of an i. .i. and if ei is independent from ej for j 6= i. Example 5. 1). where the value of the seed n must be in the range 0 < n < 231 − 1.d). {x}Tt=1 of random variables with the χ2 distribution with d degrees of freedom by the following program: e=RNDN(T. If ei ∼ Pd 2 2 N (0. For example. For example. y=RNDN(r.1 A χ2 random variable with d degrees of freedom can be created from d independent random variables with the standard normal distribution. Most programs offer random number generators for the uniform distribution and the standard normal distribution.86 CHAPTER 5. The starting seed for RNDN can be given by a statement RNDSEED n.c). in GAUSS generates r × c values that appear to be a realization of independent standard normal random variables that will be stored in an r ×c matrix.A.

A. · · · . x=eP.i.A.d. 1). c=sumc((e[. Thus. If ei ∼ N (0. 1)./sqrt(c/d).3 A K-dimensional random vector which follows N (0. then X = Pe ∼ N (0. . x=e[. in GAUSS one can generate a T × 1 vector with values that appear to be a realization of an i. in GAUSS one can generate a T × K matrix with values that appear to be a realization of an i. Ψ) where e = (e1 . Let Ψ = PP0 be the Cholesky decomposition of Ψ.. Note that the Cholesky decomposition in GAUSS gives an upper triangular matrix. the above program transposes the matrix to be lower triangular.K)..1].5. in which P is a lower triangular matrix. Ψ) for any positive definite covariance matrix Ψ can be generated from K independent random variables with the standard normal distribution. P=chol(C)’. {Xt }Tt=1 of K-dimensional random vectors with the N (0. For example.2 A random variable which follows the Student’s t distribution with d degrees of freedom can be generated from d + 1 independent random variables with the standard normal distribution. and if ei is independent from ej for qP d+1 2 j 6= i.i. For example. If ei ∼ N (0. C) distribution with the following program provided that the matrix C is already defined. eK )0 . MONTE CARLO STUDIES 87 Example 5.2:d+1]^2)’). Example 5. and if ei is independent from ej for j 6= i. e=RNDN(T.d+1). {x}Tt=1 of random variables with the t distribution with d degrees of freedom by the following program: e=RNDN(T. then x = e1 / i=2 ei /d follows the t distribution with d degrees of freedom. e2 .A.d.

A variable with negative skewness is more likely to be far below the mean than it is . Then the expected value of the estimator E(bi ) can be estimated by its P mean over the samples: N1 N i=1 bi .A. By the strong law of large numbers. Nelson and Startz (1990) report estimated 1%. The skewness of a variable Y with mean µ is (5.88 CHAPTER 5. asymptotic theory does not give a good approximation of the exact finite sample properties.2 Estimators When a researcher applies an estimator to actual data without the normality assumption. Other properties can also be reported. the mean converges almost surely to the expected value. depending on the purpose of the study. STOCHASTIC REGRESSORS 5. For example. and 99% fractiles for an IV estimator and compared them with fractiles implied by the asymptotic distribution. In some cases. This influential paper uses Monte Carlo simulations to study the small sample properties of IV estimator and its t-ratio when instruments are poor in the sense that the relevance condition is barely satisfied. the mean. For example.1) E(Y − µ)3 3 [V ar(Y )] 2 .A. N independent samples are created and an estimate bi (i ≥ 1) for a parameter b0 is calculated for the i-th sample. 50%. asymptotic theory is used as a guide of small sample properties of the estimator. For example. A Monte Carlo study can be used to estimate the true finite sample properties. 90%. 10%. 5%. median. When the deviation from the normal distribution is of interest. the skewness and kurtosis are often estimated and reported. and standard deviation of the realized values of the estimator over generated samples can be computed and reported as estimates of the true values of these statistics.

If Y has a symmetric distribution such as a normal distribution. asymptotic theory is typically used. and conversely a variable with positive skewness is more likely to be far above the mean than is is to be below. Since the asymptotic distribution is not exactly equal to the exact distribution of the test statistic. one can estimate the true critical value. If the kurtosis of Y exceeds 3.A.96 is used for a test statistic with the asymptotic normal distribution for the significance level of 5%. Using the distribution of the test statistic produced by a Monte Carlo simulation. 5. If the true size is larger than the nominal significance level. the critical value of 1. The kurtosis of Y is E(Y − µ)4 .2) If Y is normally distributed.A. If the true size is smaller than the nominal significance level. This property is called the size distortion. then its distribution has more mass in the tails than the normal distribution with the same variance. The probability of rejecting the null hypothesis when it is true is called the size of the test. The power of the test is the probability of rejecting the null hypothesis when . the true size of the test based on the nominal critical value is usually either larger or smaller than the nominal significance level. MONTE CARLO STUDIES 89 to be far above.5. [V ar(Y )]2 (5.3 Tests When a researcher applies a test to actual data without the normality assumption. The significance level and critical value based on the asymptotic distribution are called the nominal significance level and the nominal critical value. the test overrejects the null hypothesis and is said to be liberal. For example. then the skewness is zero.A. the kurtosis is 3. respectively. the test underrejects the null hypothesis and is said to be conservative.

The following example illustrates common mistakes in a simple form: Example 5. i=1. The latter is called the size corrected power.90 CHAPTER 5. STOCHASTIC REGRESSORS the alternative hypothesis is true. 5. m=meanc(y).4 A Monte Carlo Program with a Common Mistake (I) ss=3937841.4 A Pitfall in Monte Carlo Simulations Common mistakes are made by many graduate students when they first use Monte Carlo simulations. For example. a liberal test tends to have a higher power based on the nominal critical value than a conservative test. These mistakes happen when they repeatedly use a random number generator to conduct simulations. On the other hand. vecm=zeros(100. In Monte Carlo studies. a random number generator automatically updates the seed whenever it creates a number.1). The way the seed is updated depends on the program. we cannot conclude the liberal test is better from this observation because the probability of Type I error is not equal for the two tests. two versions of the power can be reported for each point of the alternative hypothesis: the power based on the nominal critical value and the power based on the estimated true critical value. y=RNDN(50. However.1). do until i>100. Recall that once the starting seed is given.A.A. RNDSEED ss. . These mistakes are caused by updating seeds arbitrarily in the middle of a simulation. the size corrected power is more appropriate for the purpose of comparing tests. The power based on the nominal critical value is also of interest because it is the probability of rejecting the null hypothesis in practice if asymptotic theory is used.

5. i=i+1. This mistake is innocuous because it is easy to detect.1). MONTE CARLO STUDIES 91 vecm[i]=m. However. RNDSEED ss+i. the RNDSEED statement inside the do loop should be removed and the statement RNDSEED ss. . endo. vecm=zeros(100. The following program contains a mistake which is harder to detect: Example 5.A.5 A Monte Carlo Program with a Common Mistake (II) ss=3937841. do until i>100.1). the programmer is trying to create 100 samples of the sample mean of a standard normal random variable y when the sample size is 50. There is no guarantee that one sample is independent from the others. A correct program would put the RNDSEED statement before the do loop.A. i=i+1. m=meanc(y). y=RNDN(50. In this example. For example. endo. exactly the same data are generated 100 times because the same starting seed is given for each replication inside the do-loop. The problem is that the seed is updated in an arbitrary way in each sample by giving a different starting seed. i=1. can be added after the first line. vecm[i]=m.

92 CHAPTER 5. Because Xt is not normally distributed. If no RNDSEED statement is given before the RNDN command is used. asymptotic theory predicts that the t test statistic of the sample mean divided by the estimated standard error approximately follows the standard normal distribution.A. E(Xt ). Example 5. GAUSS will take the starting seed from the computer clock. STOCHASTIC REGRESSORS In Monte Carlo simulations. where Xt = µ + et and et is drawn from the t distribution with 3 Masao needs to check this! degrees of freedom. Then there is no way to exactly replicate these Monte Carlo results. Hence the t distribution is often considered a better distribution to describe some financial variables. When you publish Monte Carlo results. it is also important to control the starting seed so that the simulation results can be replicated. Because a random variable with the t distribution with 3 degrees of freedom has zero mean and a finite second moment. it is advisable to put enough information in the publication so that others can exactly replicate the results. the sample mean is calculated as an estimator for the expected value of Xt . @MCMEAN.6 The program.PRG @ 6 @Monte Carlo Program for the sample mean@ This information is also relevant because different computer specifications and different versions of the program (such as GAUSS) can produce different results. 5. outlying values have high probability.6 At the very least. The example is concerned with the t test of the null hypothesis that µ = 0.5 An Example Program This section describes basic Monte Carlo methods with an example program. The t distribution with 3 degrees of freedom has thick tails and large????? . In the following example.A. the standard theory for the exact finite sample properties cannot be applied. a record of the information should be kept. .

etcv=sorttn[int(nor*0.5. @the number of replications@ df=3. @t distribution: used under H0@ x1=x0+0.>1. @used to store t-values under H1@ do until i>nor. @normal r. tend=25.f.1).>etcv).96).1).’s@ x0=nrv[.v. The power is calculate for the case when E(X)=0.>1. mx1=meanc(x1). ? etcv.’s@ crv=nrv[.1). @used for H1@ mx0=meanc(x0).96).v.2. @ d.. ? "***** When H0 is true *****". @t-value under H1@ i=i+1. ? meanc(abs(ta). sorttn=sortc(abs(tn). @ RNDSEED 382974.2:df+1]^2. nrv=RNDN(tend. output off. ? "The estimated size with the nominal critical value". where X follows t-distribution with 3 degrees of freedom. ? "The estimated power with the nominal critical value".. ? "***** When H1 is true *****". endo. tn=zeros(nor. @t-value under H0@ ta[i]=meanc(x1)*sqrt(tend)/sighat1. for the t-distribution of X@ i=1.2. ? "The estimated true 5-percent critical value". sighat0=sqrt((x0-mx0)’(x0-mx0)/(tend-1)).A. MONTE CARLO STUDIES 93 @This example program is a GAUSS program to calculate the empirical size and power of the t-test for H0: E(X)=0. @used to store t-values under H0@ ta=zeros(nor. ? meanc(abs(ta). . @the sample size@ nor=1000.1]. @simgahat under H0@ sighat1=sqrt((x1-mx1)’(x1-mx1)/(tend-1)). ? meanc(abs(tn). ? "The estimated size corrected power". @chi square r. output file=mc.out reset.95)]. @sigmahat under H1@ tn[i]=meanc(x0)*sqrt(tend)/sighat0./sqrt(sumc(crv’)/df).df+1).

. 5.2 Show that all conditions of Gordin and Hansen’s Central Limit Theorem are satisfied for ft in Example 5. It is a good idea to minimize the content inside the do-loop to speed up replications.1. do until i>250. the do-loop can be set up as follows: i=1. endo. After the do-loop.1).out. the program calculates characteristics of the generated distributions of test statistics under the null hypothesis and the alterative hypothesis such as the frequency of rejecting the null with the nominal critical value.1 Show that all conditions of Gordin’s Central Limit Theorem are satisfied for et in Example 5. . the program set up an output file by output file=mc. it makes the RNDNSEED statement before the do-loop.A. For example. In GAUSS. Before the do-loop of the replications. . ta=zeros(nor.1). STOCHASTIC REGRESSORS Some features of the example are important.. Exercises 5.2.4. Then to avoid the common mistake explained in 5. (Program for each replication) i=i+1.94 CHAPTER 5. the program defines variables to store the test results: tn=zeros(nor. Everything that can be done outside the do-loop should be done here.

5. (A2) zt et have finite second moments. for β under (A1)-(A5). and yt is in It+1 . For the following problems. (A3) E(e2t |zt ) = σ 2 . where σ is a constant. Let zt be a k × 1 vector of instrumental variables. (A5) E(zt x0t ) is nonsingular. What is the long-run variance of ft = (1 − L)et ? 5.5 Consider the linear model yt = x0t β + et . MONTE CARLO STUDIES 95 5. the true critical value or the nominal critical value? 5. It ⊂ It+1 ).e. bIV . which is greater.. zt )∞ t=1 is a stationary and ergodic stochastic process. where xt is a k-dimensional vector.4 Explain what it means to say that “a test under-rejects in small samples” (or “a test is conservative”). zt and xt are in It . Note that E(et ) = 0 is implied by (A4) if zt includes a constant. Note that many rational expectations models imply (A4). prove the asymptotic properties of the instrumental variable (IV) estimator.A. xt . When a test is conservative. (A4) E(et |It ) = 0 for a sequence of information sets (It )∞ t=1 which is increasing (i.3 Let et = Ψ(L)ut = Ψ0 ut + Ψ1 ut−1 + · · · be an MA representation. We will adopt the following set of assumptions: (A1) (et . Use a central limit theorem and a strong law of large .

STOCHASTIC REGRESSORS numbers given in this chapter. (A5) xt et is a martingale difference sequence with finite second moments.96 CHAPTER 5. Following Hayashi (2000). (A3) Predetermined regressors: E(et xt ) = 0.6 Consider the linear model yt = x0t β + ²t . and indicate which ones you are using and where you are using them in your proof. . Prove that gt is a martingale difference sequence. (c) Prove that the IV estimator is consistent under (A1)-(A5). (A4) Rank condition: E(xt x0t ) is nonsingular (and hence finite). xt } is jointly stationary and ergodic. (A2) Ergodic stationarity: {yt . xt . T ) when ΣTt=1 zt x0t is nonsingular. (a) Express the IV estimator bIV in terms of zt . . . (d) Prove that the IV estimator is asymptotically normally distributed. . (b) Let gt = zt et . suppose that this model satisfies the classical linear regression model assumptions for any sample size (n) as follows: (A1) Linearity: yt = x0t β + et . and yt (t = 1. Derive the formula of the covariance matrix of the asymptotic distribution. . where xt is a k-dimensional vector. 5. (e) Explain what happens if yt is in It+2 in (A4).

9. What if n = 8. · · · .2) yt = x0t β + et . 11? Explain. Is the actual size larger than 10 percent when n = 4. (A7) Conditional homoskedasticity: E(e2t |xt ) = σ 2 > 0. Let tk = bk − β¯k bk − β¯k =p SE(bk ) s2 [(X0 X)−1 ]kk be the t statistic for the null hypothesis βk = β¯k . You can assume that s2 is consistent.E. Does this test overreject or underreject. How do you know? Suppose that k = 3. where X is an n × k matrix with x0t in its t-th row.1) can be written as (5. 10.7 Consider the linear model (5. 2. You do not have to prove that s2 is consistent σ 2 for this question.E.1) y = Xβ + e Let k × 1 matrix x0t be the t-th row of the regressor matrix X. k). MONTE CARLO STUDIES 97 (A6) Finite fourth moments for regressors: E[(xit xjt )2 ] exists and finite for all i. (b) Based on the asymptotic result in (a).5. Further. The model (5.96.A. (a) Prove that tk converges in distribution to the standard normal distribution as the sample size goes to infinity.E. 5. suppose that you set the nominal size to be 5 percent and reject the null hypothesis when |tk | is greater than 1. j (= 1. assume that et is normally distributed conditional on X.

) random vectors. Note that both t1 and t2 converge in distribution to a random variable with the standard normal distribution. (A60 ) et follows the standard normal distribution. Consider two alternative assumptions for et .1.5 are satisfied with (A60 ). . Assume that xt follows N(0.i. so that t1 has the exact t distribution with n − k degrees of freedom.1 . Note that E(et ) = 0 is implied by (A4) if xt includes a constant. Under the null hypothesis H0 .98 CHAPTER 5. Consider the standard t statistic. Note that Assumptions 1.E. (A2) xt and et have finite second moments.1) with k = 1. Consider the model (5.1).4) t2 = b−β √ σˆ2 X0 X)−1 where σ ˆ22 = (Y − Xb)0 (Y − Xb)/n. conduct a Monte Carlo simulation with the sample size of 26 and 500 replications under Assumption (A6). Using GAUSS. Assume that xt and et are independent. (A3) E(e2t |xt ) = σ 2 which is a constant. (A4) E(xt et ) = 0 for all t ≥ 1 (A5) E(xt x0t ) is nonsingular. (5.d.3) t1 = b−β √ σˆ1 X0 X)−1 where σ ˆ12 = (Y − Xb)0 (Y − Xb)/(n − k). the true value of β is 0. Consider another version of the t statistic (5. xt )∞ i=t are independent and identically distributed (i. (A6) et follows the t distribution with 4 degrees of freedom. STOCHASTIC REGRESSORS We will adopt the following set of assumptions: (A1) (et . so that yt = et .E.E.

I.” Manuscript. . (1974): Mathematical Analysis. New York. such as 0912 for September 12. Billingsley. Addison-Wesley.E. For each t ratio. Ogaki (1999): “The Gauss-Markov Theorem for Cointegrating and Spurious Regressions.. (v) the size corrected power of the test. (1961): “The Lindeberg-Levy Theorem for Martingales. Also discuss whether t1 or t2 is better for this application. C. 1174–1176. References Apostol. Wiley. (iv) the power of the test at β = 0. (b) Use the t2 in (5. 12. 788–792. P. use 3648xxxx. M. Massachusetts. (1986): Probability and Measure.” Proceeding of the American Mathematical Society.4) and repeat the exercises (a) − (e). and M. Choi.3) to estimate (i) the true size of the t test for H0 : β = 0 based on the nominal significance level of 5% and the nominal critical value based on the standard normal distribution are used. (ii) the true size of the t test for H0 : β = 0 based on the nominal significance level of 5% and the nominal critical value based on the t distribution with 25 degrees of freedom.REFERENCES 99 (a) Use the t1 in (5. discuss whether it is better to use the standard distribution or the t distribution critical values for this application. 10. M.-Y. where xxxx is your birth date. For the starting seed.E. Rading.” Soviet MathematicsDoklady. (iii) the true critical value of the t test for the 5% significance level. Gordin.15 based on the nominal critical value. Submit your program and output. T. (1969): “The Central Limit Theorem for Stationary Processes.

M. Lee (1985): The Theory Judge. Startz (1990): “The Distribution of the Instrumental Variables Estimator and Its t-Ratio When the Instrument is a Poor One. and R. and T.. STOCHASTIC REGRESSORS Hansen. G. F. (2000): Econometrics.100 CHAPTER 5. Loeve. E. P. R. 203–238. C. Hill. G. S125–S140. 4th edn. Princeton. 30. ¨ tkepohl. Lu and Practice of Econometrics. Hayashi.” Journal of Econometrics.” Journal of Business. (1985): “A Method for Calculating Bounds on the Asymptotic Covariance Matrices of Generalized Method of Moments Estimators. C. (1978): Probability Theory. . R. New York. Springer-Verlag.. L. Wiley. W. Griffiths. 2nd edn. Nelson. H. 63. Princeton University Press. New York.