Econometric Methods with Applications in Business and Economics

This page intentionally left blank

**Econometric Methods with Applications in Business and Economics
**

Christiaan Heij Paul de Boer Philip Hans Franses Teun Kloek Herman K. van Dijk

1

3

Great Clarendon Street, Oxford ox2 6 dp Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide in Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With ofﬁces in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries Published in the United States by Oxford University Press Inc., New York ß Christiaan Heij, Paul de Boer, Philip Hans Franses, Teun Kloek, and Herman K. van Dijk, 2004 The moral rights of the authors have been asserted Database right Oxford University Press (maker) First published 2004 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this book in any other binding or cover and you must impose the same condition on any acquirer British Library Cataloguing in Publication Data Data available Library of Congress Cataloguing in Publication Data Data available Typeset by SPI Publisher Services, Pondicherry, India Printed in Great Britain on acid-free paper by Antony Rowe Ltd, Chippenham, Wiltshire ISBN 0–19–926801–0 3 5 7 9 10 8 6 4 2

Preface

Econometric models and methods are applied in the daily practice of virtually all disciplines in business and economics like ﬁnance, marketing, microeconomics, and macroeconomics. This book is meant for anyone interested in obtaining a solid understanding and active working knowledge of this ﬁeld. The book provides the reader both with the required insight in econometric methods and with the practical training needed for successful applications. The guiding principle of the book is to stimulate the reader to work actively on examples and exercises, so that econometrics is learnt the way it works in practice — that is, practical methods for solving questions in business and economics, based on a solid understanding of the underlying methods. In this way the reader gets trained to make the proper decisions in econometric modelling. This book has grown out of half a century of experience in teaching undergraduate econometrics at the Econometric Institute in Rotterdam. With the support of Jan Tinbergen, Henri Theil founded the institute in 1956 and he developed Econometrics into a full-blown academic programme. Originally, econometrics was mostly concernedwith national and internationalmacroeconomic policy; the required computing power to estimate econometric models was expensive and scarcely available, so that econometrics was almost exclusively applied in public (statistical) agencies. Much has changed, and nowadays econometrics ﬁnds widespread application in a rich variety of ﬁelds. The two major causes of this increased role of econometrics are the

information explosion in business and economics (with large data sets — for instance, in ﬁnance and marketing) and the enormous growth in cheap computing power and user-friendly software for a wide range of econometric methods. This development is reﬂected in the book, as it presents econometric methods as a collection of very useful tools to address issues in a wide range of application areas. First of all, students should learn the essentials of econometrics in a rigorous way, as this forms the indispensable basis for all valid practical work. These essentials are treated in Chapters 1–5, after which two major application areas are discussed in Chapter 6 (on individual choice data with applications in marketing and microeconomics) and Chapter 7 (on time series data with applications in ﬁnance and international economics). The Introduction provides more information on the motivation and contents of the book, together with advice for students and instructors, and the Guide to the Book explains the structure and use of the book. We thank our students, who always stimulate our enthusiasm to teach and who make us feel proud by their achievements in their later careers in econometrics, economics, and business management. We also thank both current and former members of the Econometric Institute in Rotterdam who have inspired our econometric work. Several people helped us in the process of writing the book and the solutions manual. First of all we should mention our colleague Zsolt Sandor and our (current and former) Ph.D.

vi

Preface

students Charles Bos, Lennart Hoogerheide, Rutger van Oest, and Bjo ¨ rn Vroomen, who all contributed substantially in producing the solutions manual. Further we thank our (current and former) colleagues at the Econometric Institute, Bas Donkers, Rinse Harkema, Johan Kaashoek, Frank Kleibergen, Richard Kleijn, Peter Kooiman, Marius Ooms, and Peter Schotman. We were assisted by our (former) students Arjan van Dijk, Alex Hoogendoorn, and Jesse de Klerk, and we obtained very helpful feedback from our students, in particular from Simone Jansen, Martijn de Jong, Marie ¨ lle Non,

Arnoud Pijls, and Gerard Voskuil. Special thanks are for Aletta Henderiks, who never lost her courage in giving us the necessary secretarial support in processing the manuscript. Finally we wish to thank the delegates and staff of Oxford University Press for their assistance, in particular Andrew Schuller, Arthur Attwell, and Hilary Walford. Christiaan Heij, Paul de Boer, Philip Hans Franses, Teun Kloek, Herman K. van Dijk Rotterdam, 2004

From left to right: Christiaan Heij, Paul de Boer, Philip Hans Franses, Teun Kloek, and Herman K. van Dijk

Contents

Detailed Contents List of Exhibits Abbreviations Guide to the Book

ix xvi xix xxi 1 11 75 117 187 273 437 531 723 747 773

Introduction 1 2 3 4 5 6 7 Review of Statistics Simple Regression Multiple Regression Non-Linear Methods Diagnostic Tests and Model Adjustments Qualitative and Limited Dependent Variables Time Series and Dynamic Models Appendix A. Matrix Methods Appendix B. Data Sets

Index

This page intentionally left blank

Detailed Contents

List of Exhibits Abbreviations Guide to the Book

xvi xix xxi

1

Introduction

Econometrics Purpose of the book Characteristic features of the book Target audience and required background knowledge Brief contents of the book Study advice Teaching suggestions Some possible course structures

1 1 2 3 4 4 5 6 8

1

Review of Statistics

11 12 12

16

1.1 Descriptive statistics

1.1.1 Data graphs 1.1.2 Sample statistics

1.2 Random variables

1.2.1 1.2.2 1.2.3 1.2.4 Single random variables Joint random variables Probability distributions Normal random samples

20 20 23 29 35 38 38 42 47 55 55 59 63

1.3 Parameter estimation

1.3.1 Estimation methods 1.3.2 Statistical properties 1.3.3 Asymptotic properties

1.4 Tests of hypotheses

1.4.1 Size and power 1.4.2 Tests for mean and variance 1.4.3 Interval estimates and the bootstrap

x

Detailed Contents

Summary, further reading, and keywords Exercises

68 71

2

Simple Regression

75 76 76 79 82 84 87 87 91 92 94 97 99 99 101 103 105 105 107 111 113

2.1 Least squares

2.1.1 2.1.2 2.1.3 2.1.4 Scatter diagrams Least squares Residuals and R2 Illustration: Bank Wages

**2.2 Accuracy of least squares
**

2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 Data generating processes Examples of regression models Seven assumptions Statistical properties Efﬁciency

2.3 Signiﬁcance tests

2.3.1 The t-test 2.3.2 Examples 2.3.3 Use under less strict conditions

2.4 Prediction

2.4.1 Point predictions and prediction intervals 2.4.2 Examples

Summary, further reading, and keywords Exercises

3

Multiple Regression

117 118 118 120 123 125 127 129 131 134 135 139 142 143 145

**3.1 Least squares in matrix form
**

3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.1.6 3.1.7 Introduction Least squares Geometric interpretation Statistical properties Estimating the disturbance variance Coefﬁcient of determination Illustration: Bank Wages

**3.2 Adding or deleting variables
**

3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 Restricted and unrestricted models Interpretation of regression coefﬁcients Omitting variables Consequences of redundant variables Partial regression

Detailed Contents

xi

**3.3 The accuracy of estimates
**

3.3.1 3.3.2 3.3.3 3.3.4 The t-test Illustration: Bank Wages Multicollinearity Illustration: Bank Wages The F-test in different forms Illustration: Bank Wages Chow forecast test Illustration: Bank Wages

152 152 154 156 159 161 161 166 169 174 178 180

3.4 The F-test

3.4.1 3.4.2 3.4.3 3.4.4

Summary, further reading, and keywords Exercises

4

Non-Linear Methods

187 188 188 191 193 196 198 202 202 205 209 212 218 222 222 224 228 230 232 235 238 240 243 250 250 252 255 259

4.1 Asymptotic analysis

4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 Introduction Stochastic regressors Consistency Asymptotic normality Simulation examples

**4.2 Non-linear regression
**

4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 Motivation Non-linear least squares Non-linear optimization The Lagrange Multiplier test Illustration: Coffee Sales

4.3 Maximum likelihood

4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 4.3.6 4.3.7 4.3.8 4.3.9 Motivation Maximum likelihood estimation Asymptotic properties The Likelihood Ratio test The Wald test The Lagrange Multiplier test LM-test in the linear model Remarks on tests Two examples

**4.4 Generalized method of moments
**

4.4.1 4.4.2 4.4.3 4.4.4 Motivation GMM estimation GMM standard errors Quasi-maximum likelihood

xii

Detailed Contents

4.4.5 GMM in simple regression 4.4.6 Illustration: Stock Market Returns

260 262 266 268

Summary, further reading, and keywords Exercises

5

Diagnostic Tests and Model Adjustments

273 274 277 277 285 289 296 302 303 303 310 313 318 320 320 324 327 334 343 352 354 354 358 361 368 376 378 378 379 386 388 394 396 396 404 409 418

**5.1 Introduction 5.2 Functional form and explanatory variables
**

5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 The number of explanatory variables Non-linear functional forms Non-parametric estimation Data transformations Summary

5.3 Varying parameters

5.3.1 5.3.2 5.3.3 5.3.4 The use of dummy variables Recursive least squares Tests for varying parameters Summary

5.4 Heteroskedasticity

5.4.1 5.4.2 5.4.3 5.4.4 5.4.5 5.4.6 Introduction Properties of OLS and White standard errors Weighted least squares Estimation by maximum likelihood and feasible WLS Tests for homoskedasticity Summary

5.5 Serial correlation

5.5.1 5.5.2 5.5.3 5.5.4 5.5.5 Introduction Properties of OLS Tests for serial correlation Model adjustments Summary

5.6 Disturbance distribution

5.6.1 5.6.2 5.6.3 5.6.4 5.6.5 Introduction Regression diagnostics Test for normality Robust estimation Summary

**5.7 Endogenous regressors and instrumental variables
**

5.7.1 5.7.2 5.7.3 5.7.4 Instrumental variables and two-stage least squares Statistical properties of IV estimators Tests for exogeneity and validity of instruments Summary

Detailed Contents

xiii

5.8 Illustration: Salaries of top managers Summary, further reading, and keywords Exercises

419 424 427

6

Qualitative and Limited Dependent Variables

437 438 438 443 447 452 459 461 463 463 466 474 480 482 482 490 500 511 521 523 525

6.1 Binary response

6.1.1 6.1.2 6.1.3 6.1.4 6.1.5 6.1.6 Model formulation Probit and logit models Estimation and evaluation Diagnostics Model for grouped data Summary

6.2 Multinomial data

6.2.1 6.2.2 6.2.3 6.2.4 Unordered response Multinomial and conditional logit Ordered response Summary

**6.3 Limited dependent variables
**

6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 Truncated samples Censored data Models for selection and treatment effects Duration models Summary

Summary, further reading, and keywords Exercises

7

Time Series and Dynamic Models

531 532 532 535 538 542 545 550 553 555 555 558 563

**7.1 Models for stationary time series
**

7.1.1 7.1.2 7.1.3 7.1.4 7.1.5 7.1.6 7.1.7 Introduction Stationary processes Autoregressive models ARMA models Autocorrelations and partial autocorrelations Forecasting Summary

**7.2 Model estimation and selection
**

7.2.1 The modelling process 7.2.2 Parameter estimation 7.2.3 Model selection

xiv

Detailed Contents

7.2.4 Diagnostic tests 7.2.5 Summary

567 576 578 578 585 592 604 611 612 612 616 620 626 636 637 637 640 647 654 656 656 661 667 681 682 682 684 692 700 709 710 713

**7.3 Trends and seasonals
**

7.3.1 7.3.2 7.3.3 7.3.4 7.3.5 Trend models Trend estimation and forecasting Unit root tests Seasonality Summary

**7.4 Non-linearities and time-varying volatility
**

7.4.1 7.4.2 7.4.3 7.4.4 7.4.5 Outliers Time-varying parameters GARCH models for clustered volatility Estimation and diagnostic tests of GARCH models Summary

**7.5 Regression models with lags
**

7.5.1 7.5.2 7.5.3 7.5.4 Autoregressive models with distributed lags Estimation, testing, and forecasting Regression of variables with trends Summary

**7.6 Vector autoregressive models
**

7.6.1 7.6.2 7.6.3 7.6.4 Stationary vector autoregressions Estimation and diagnostic tests of stationary VAR models Trends and cointegration Summary

**7.7 Other multiple equation models
**

7.7.1 7.7.2 7.7.3 7.7.4 7.7.5 Introduction Seemingly unrelated regression model Panel data Simultaneous equation model Summary

Summary, further reading, and keywords Exercises

**Appendix A. Matrix Methods
**

A.1 A.2 A.3 A.4 A.5 A.6 A.7 Summations Vectors and matrices Matrix addition and multiplication Transpose, trace, and inverse Determinant, rank, and eigenvalues Positive (semi)deﬁnite matrices and projections Optimization of a function of several variables

723 723 725 727 729 731 736 738

Detailed Contents

xv

A.8 Concentration and the Lagrange method

743 746

Exercise

**Appendix B. Data Sets
**

List of Data Sets

747 748

Index

773

3.1) Student Learning (Example 1.9 1.9 2.5 1.4) Bank Wages (Section 2.4) Student Learning (Example 1.13 3.5) Coffee Sales (Example 4.8 1.14 4.12 3.5) Student Learning (Example 1.6 2.1.7 4.7 1.3 1.7 2.6) 119 124 124 130 132 134 138 141 145 149 150 155 160 162 163 167 170 175 176 190 195 198 200 203 204 210 214 219 220 224 225 231 233 236 241 242 245 248 .16 4.2 1.3 1.13) 2.4) Accuracy of least squares Simulated Regression Data (Example 2.16 3.9 4.10 3.13 2.5) Coffee Sales (Section 4.4) Maximum likelihood Likelihood Ratio test Wald test Lagrange Multiplier test Comparison of tests F.11 3.4 4.7) Direct and indirect effects Bank Wages (Example 3.8) Bank Wages (Example 2.List of Exhibits 0.4 2.9) Consistency Simulated Normal Random Sample (Example 1.9 3.3 4.8 2.9) Quantiles of distributions of t-statistics Prediction error Simulated Regression Data (Example 2.4.17 3.18 3.3) Student Learning (Example 1.4) Bank Wages (Section 3.3.5) Stock Market Returns (Example 4.13 1.15 3.4 1.19 4.2) Coffee Sales (Example 2.10 2.4.2 2.15 Stock Market Returns (Example 2.4) P-value Geometry of F-test Bank Wages (Section 3.10) 1.6 4.6 3.14 Econometrics as an interdisciplinary ﬁeld Econometric modelling Book structure 2 3 8 13 14 15 17 19 28 30 32 33 34 44 47 48 54 58 61 66 77 78 79 80 85 86 89 90 97 101 103 103 106 108 109 Student Learning (Example 1.4) Simulated Regression Data (Example 2.17 Student Learning (Example 1.6) Normal distribution 2 -distribution t-distribution F-distribution Bias and variance Normal Random Sample (Example 1.14 2.17 4.11 1.2) Bank Wages (Section 3.3 2.12 1.12 2.8 3.11) 1.5) Simulation Example (Section 4.3) Newton–Raphson Lagrange multiplier Coffee Sales (Section 4.10) Bank Wages (Example 2.1.16 P-value 1.7 3.14 3.1) Bank Wages (Example 2.2.2) Bias and efﬁciency Bank Wages (Example 3.10 1.4 3.5 3.11 2.and 2 -distributions Stock Market Returns (Example 4.1 1.12 4.4) Simulated Regression Data (Example 2.15 4.5 4.2) Food Expenditure (Example 4.1 3.1) Bank Wages (Example 3.1 2.4) Bank Wages (Example 4.8 4.15 Simulated Normal Random Sample (Example 1.1) Inconsistency Simulation Example (Section 4.1.18 4.11) 3.1 0.3) Bank Wages (Example 3.4.5 2.13 4.2) Prediction Bank Wages (Section 3.3) Scatter diagram with ﬁtted line Bank Wages (Section 2.19 Scatter diagrams of Bank Wage data Least squares Least squares Geometric picture of R2 Bank Wages (Section 3.1 4.2 0.5) Coffee Sales (Example 4.2 3.3) Bank Wages (Section 3.2.6 1.11 4.2 4.3 3.1.2) Student Learning (Example 1.10 4.1.

16) Industrial Production (Example 7.5) Truncated data Direct Marketing for Financial Product (Example 6.14 5.4) Simulated Times Series (Example 7.3 7.9) Bank Wages (Example 5.15) Interest and Bond Rates (Example 5.11) Industrial Production (Example 7.13) Interest and Bond Rates (Example 5.34 5.11 6.24 5.16) Bank Wages (Example 5.3) Simulated MA and ARMA Time Series (Example 7.7 7.43 5.31 5.5 7.7) Industrial Production (Example 7.13 6.14 7.9 5.17) Bank Wages (Example 5.7) Student Learning (Example 6.1 5.38 5.13) Bank Wages (Example 5.13 7.17 7.20 Stock Market Returns (Example 4.31) 5.42 5.10 5.44 The empirical cycle in econometric modelling Bank Wages (Example 5.18 7.8) Student Learning (Example 6.1 6.4 6.26) Outliers and OLS Stock Market Returns (Example 5.18 5.39 5.10 7.14 6.13) Unit root tests Industrial Production (Example 7.28) Simulated Data of Normal and Student t(2) Distribution (Example 5.37 5.30) Motor Gasoline Consumption (Example 5.6) 5.5) Simulated Time Series (Example 7.28 5.14) Bank Wages (Example 5.12 5.11 7.21 Probability models Normal and logistic densities Direct Marketing for Financial Product (Example 6.13 5.15 5.1) Bank Wages (Example 5.18) Interest and Bond Rates (Example 5.25 5.27 5.6) Steps in modelling Industrial Production (Example 7.35) 6.5 6.47 Interest and Bond Rates (Example 5.16) Industrial Production (Example 7.40 5.7) Bank Wages (Example 5.33) 5.12 7.21 Stock Market Returns (Section 4.26) Industrial Production (Example 5.15 7.16 5.19 7.6 6.45 Motor Gasoline Consumption (Example 5.27) Stock Market Returns (Example 5.27) Stock Market Returns (Example 5.12 6.3 5.17) Interest and Bond Rates (Example 5.12) Industrial Production (Example 7.10 6.46 Interest and Bond Rates (Example 5.4) Ordered response Bank Wages (Example 6.20 5.8 6.4 7.1 7.8) Bank Wages (Example 5.6 7.2) Simulated AR Time Series (Example 7.29 5.15 6.List of Exhibits xvii 4.1) Bank Wages (Example 5.8 7.33 5.2 6.3) Bank Wages (Example 6.7 6.34) 5.4 5.26 5.23 5.32) 5.13) Industrial Production (Example 7.11) Simulated Series with Trends (Example 7.17) 404 408 415 417 420 440 444 451 458 472 476 478 484 489 491 498 508 509 512 514 517 533 534 542 545 549 553 555 557 563 566 573 575 584 591 592 595 601 603 609 610 615 .12) Bank Wages (Example 5.25) Industrial Production (Example 5.9 6.9) Bank Wages (Example 5.48 Motor Gasoline Consumption (Example 5.14) Dow-Jones Index (Example 7.19) Food Expenditure (Example 5.5) Bank Wages (Example 5.18) Interest and Bond Rates (Example 5.35 5.10) Industrial Production (Example 7.7) 4.10) Interest and Bond Rates (Example 5.3 6.4.2 7.19 5.32 5.30 5.22) Food Expenditure (Example 5.2) Tricube weights Simulated Data from a Non-Linear Model (Example 5.49 Salaries of Top Managers (Example 5.21) Interest and Bond Rates (Example 5. Interest and Bond Rates (Example 5.20 7.36 5.31) 251 264 276 283 284 287 291 294 295 299 300 306 307 309 312 317 318 322 323 326 331 332 333 338 341 347 349 350 351 355 358 360 366 367 371 373 375 376 382 385 386 388 389 391 401 403 5.8) Industrial Production (Example 7.41 5.16 7.9) Industrial Production (Example 7.8) Duration data Hazard rates Duration of Strikes (Example 6.3) Bank Wages (Example 5.11) Bank Wages.5 5.20) Interest and Bond Rates (Example 5.16 7.11 5.6) Censored data Direct Marketing for Financial Product (Example 6.23) Interest and Bond Rates (Example 5.1) Dow-Jones Index (Example 7.29) Three estimation criteria Interest and Bond Rates (Example 5.22 5.6) Coffee Sales (Example 5.21 5.2 5.15) Industrial Production (Example 7.6) Fashion Sales (Example 5.17 5.8 5.9 7.4) Bank Wages (Example 5.7 5.5) Fashion Sales (Example 5.24) Food Expenditure (Example 5.2) Direct Marketing for Financial Product (Example 6.6 5.

37 Primary Metal Industries (Example 7.30) Simulated Macroeconomic Consumption and Income (Example 7.29) Primary Metal Industries (Example 7.32 Interest and Bond Rates (Example 7.22 Industrial Production (Example 7.18) 7.xviii List of Exhibits 7.1) Simulated Data on Student Learning (Example A.32) A.36 7.31) 7.20) 7.31 Cointegration tests 7.29) Primary Metal Industries (Example 7.29 Interest and Bond Rates (Example 7.7) Simulated Data on Student Learning (Example A.35 7.26 Interest and Bond Rates (Example 7.27 Mortality and Marriages (Example 7.3 Simulated Data on Student Learning (Example A.30 Interest and Bond Rates (Example 7.2 A.23 Simulated ARCH and GARCH Time Series (Example 7.34 7.25) 7.19) 7.1 A.25 Dow-Jones Index (Example 7.22) 7.21) 7.23) 7.26) 7.11) 689 690 698 703 708 724 735 742 .28 Simulated Random Walk Data (Example 7.27) 7.28) 619 624 630 632 645 648 650 653 665 672 676 679 7.24 Industrial Production (Example 7.38 Interest and Bond Rates (Example 7.33 Treasury Bill Rates (Example 7.24) 7.

Hall. the list also contains the abbreviations (in italics) used to denote the data sets of examples and exercises. but not the abbreviations used to denote the variables in these data sets (see Appendix B for the meaning of the abbreviated variable names). 2SLS 3SLS ACF ADF ADL AIC AR ARCH ARIMA ARMA BHHH BIC BLUE BWA CAPM CAR CDF CL COF CUSUM CUSUMSQ DGP DJI DMF DUS ECM EWMA EXR FAS FEX two-stage least squares three-stage least squares autocorrelation function augmented Dickey–Fuller autoregressive distributed lag Akaike information criterion autoregressive autoregressive conditional heteroskedasticity autoregressive integrated moving average autoregressive moving average method of Berndt. and Hausman Bayes information criterion best linear unbiased estimator bank wages (data set 2) capital asset pricing model car production (data set 18) cumulative distribution function conditional logit coffee sales (data set 4) cumulative sum cumulative sum of squares data generating process Dow-Jones index (data set 15) direct marketing for ﬁnancial product (data set 13) duration of strikes (data set 14) error correction model exponentially weighted moving average exchange rates (data set 21) fashion sales (data set 8) food expenditure (data set 7) FGLS FWLS GARCH GLS GMM GNP HAC IBR IID INP IV LAD LM LOG LR MA MAE MGC ML MNL MOM MOR MSE NEP NID NLS OLS P PACF PMI QML RESET RMSE SACF SCDF SEM SIC SMR feasible generalized least squares feasible weighted least squares generalized autoregressive conditional heteroskedasticity generalized least squares generalized method of moments gross national product (data set 20) heteroskedasticity and autocorrelation consistent interest and bond rates (data set 9) identically and independently distributed industrial production (data set 10) instrumental variable least absolute deviation Lagrange multiplier natural logarithm likelihood ratio moving average mean absolute error motor gasoline consumption (data set 6) maximum likelihood multinomial logit mortality and marriages (data set 16) market for oranges (data set 23) mean squared error nuclear energy production (data set 19) normally and independently distributed non-linear least squares ordinary least squares probability (P-value) partial autocorrelation function primary metal industries (data set 5) quasi-maximum likelihood regression speciﬁcation error test root mean squared error sample autocorrelation function sample cumulative distribution function simultaneous equation model Schwarz information criterion stock market returns (data set 3) . Hall.Abbreviations Apart from abbreviations that are common in econometrics.

xx Abbreviations SPACF SSE SSR SST STAR STP STU SUR TAR sample partial autocorrelation function explained sum of squares sum of squared residuals total sum of squares smooth transition autoregressive standard and poor index (data set 22) student learning (data set 1) seemingly unrelated regression threshold autoregressive TBR TMSP TOP USP VAR VECM W WLS Treasury Bill rates (data set 17) total mean squared prediction error salaries of top managers (data set 11) US presidential elections (data set 12) vector autoregressive vector error correction model Wald weighted least squares .

. training by empirical exercises (using an econometric software package). . and for suggestions for instructors as to how the book can be used in different courses. This allows them to apply econometrics in new situations that require a creative mind in developing alternative models and methods. . as there exist no standard ‘how-to-do’ recipes that can be applied blindly in practice. explanation by motivating examples. discussion of appropriate econometric models and methods. Learning econometrics: Why. optional deeper understanding (theory text parts and theory and simulation exercises). . . The user is free to choose the desired balance between econometric applications and econometric theory. the theory parts (clearly marked in the text) and the theory and simulation exercises can be skipped without any harm. illustrative applications in practical examples. the text still provides a good understanding of the ‘what’ of econometrics that is required in sound applied work. . students get a deeper understanding of econo- metrics — in addition to the practical skills of applied courses — by studying also the theory parts and by doing the theory and simulation exercises. Even without these parts. In more advanced courses.Guide to the Book This guide describes the organization and use of the book. and how The learning student is confronted with three basic questions: Why should I study this? What knowledge do I need? How can I apply this knowledge in practice? Therefore the topics of the book are presented in the following manner: . for a synopsis of the contents of the book. what. In applied courses. The book can be used for applied courses that focus on the ‘how’ of econometrics and also for more advanced courses that treat both the ‘how’ and the ‘what’ of econometrics. We refer to the Introduction for the purpose of the book. for study advice. .

Exhibit 0. Summaries are included at many points — especially at the end of all sections in Chapters 5–7. The core material on econometrics is in Chapters 2–7. . . the required preliminary knowledge is indicated at the start of subsections. . together with computational examples). . Examples. Most of the sections of Chapter 5 can be read independently of each other. and the same holds true for the material of Appendix A. . Further details of the text structure are discussed in the Introduction (see the section ‘Teaching suggestions’ — in particular. The chapter starts with a brief statement of the purpose of the chapter. as Chapter 1 can be reviewed along the way as one progresses through Chapters 2–4. To facilitate the use of the book. and computational schemes are clearly indicated in the text. it is not necessary to cover all Chapter 1 before starting on the other chapters. . Therefore. . In Chapters 2–4 we refer to the preliminary knowledge needed from Chapter 1 (on statistics) and Appendix A (on matrix methods). . followed by sections and subsections that are divided into manageable parts with clear headings. and a keyword list that summarizes the treated topics. Chapters 2– 5 treat fundamental econometric methods that are needed for the topics discussed in Chapters 6 and 7.3).xxii Guide to the Book Text structure The required background material is covered in Chapter 1 (which reviews statistical methods that are fundamental in econometrics) and in Appendix A (which summarizes useful matrix methods. further reading. . Examples and data sets The econometric models and methods are motivated by means of fully worked-out examples using real-world data sets from a variety of applications in business and economics. theory parts. The examples are clearly marked in the text because they play a crucial role in explaining the application of econometric methods. The chapter concludes with a brief summary. and in Chapters 6 and 7 some sections can be skipped depending on the topics of interest for the reader. In Chapters 6 and 7 we indicate which parts of the earlier chapters are needed at each stage. Each chapter has the following structure. A varied set of exercises is included at the end of each chapter.

and the ﬁle XR111STU contains the data for Exercise 1. S simulation exercises. and Appendix B explains the type and source of the data and the meaning of the variables in the data ﬁles (see p. 748 for a list of all the data sets used in the book). Simulation exercises illustrating statistical properties of econometric models and methods. An asterisk (Ã) denotes advanced (parts of) exercises. . Theory exercises on derivations and model extensions. Exercises Students will enhance their understanding and acquire practical skills by working through the exercises. . These exercises provide more intuitive understanding of some of the central theoretical results. . . Empirical exercises on applications with business and economic data sets to solve questions of practical interest. three letters. . indicating the example or exercise number. For example. XM (for examples) and XR (for exercises).Guide to the Book xxiii The corresponding data sets are available from the web site of the book. The web site of the book contains the data sets of all empirical exercises.1 on student learning. These exercises deepen the theoretical understanding of the ‘what’ of econometrics. The names of the data sets consist of three parts: . and E empirical exercises). three digits. . Every exercise refers to the parts of the chapter that are needed for doing the exercise. These exercises focus on the ‘how’ of econometrics. Each subsection concludes with a list of exercises related to the material of that subsection (where T denotes theory exercises. The choice of appropriate exercises is facilitated by cross-references. The desired level of the course will determine how many of the theory exercises should be covered. indicating the data topic. which are of three types. so that the student learns to construct appropriate models from real-world data and to draw sound conclusions from the obtained results. . the ﬁle XM101STU contains the data for Example 1. . and Appendix B contains information on these data sets. .11 on student learning. Actively working through these empirical exercises is essential to gaining a proper understanding of econometrics and to getting hands-on experience with applications to solve practical problems.

6 vectors of variables are denoted by upper-case italic letters. Matrices are denoted by upper-case italic letters (X. A. it does not support the programs required for the simulation exercises (see the web site of the book for further details).xxiv Guide to the Book Web site and software The web site of the book contains all the data sets used in the book. Instructor material Instructors who adopt the book can receive the Solutions Manual of the book for free. The manual contains a CD-ROM with solution ﬁles (EViews work ﬁles with the solutions of all empirical exercises and EViews programs for all simulation exercises). but other econometric software packages can also be used in most cases. in three formats: . For further information and additional material we refer readers to the Oxford University Press web site of the book. in accordance with most of the literature on this topic. 1994–8). . This CD-ROM also contains all the exhibits of the book (in Word format) to facilitate lecture presentations. . . upon request by adopting instructors. The manual contains over 350 pages with fully worked-out text solutions of all exercises. .1. in Section 7. Excel. however. EViews. . Scalar variables and vectors are denoted by lower-case italic letters (x. Remarks on notation In the text we follow the notational conventions commonly used in econometrics. y. but this version has some limitations — for example. The exhibits for the empirical examples in the text have been obtained by using EViews version 3. and so on). such as Yt . both of the theory questions and of the empirical and simulation questions. The printed solutions manual and CD-ROM can be obtained from Oxford University Press. . this will assist instructors in selecting material for exercise sessions and computer sessions as part of their course. . and so on).1 and higher (Quantitative Micro Software. All the examples and all the empirical and simulation exercises in the book can be done with EViews version 3. ASCII. The student version of the EViews package sufﬁces for most of the book. .

The element in row i and column j of a matrix A is generally denoted by aij. and so on). where this element is denoted by xji. Estimated quantities are denoted by Latin italic letters (b. s. and so on). In many of the exhibits — for instance. except for the regressor matrix X. s. x0 . and otherwise the notation is explained in the text or in the caption of the exhibits. Scalar variables are denoted by capital letters (X. and so on). Std. ^ ^ . and so on). . The notation in these exhibits may differ from the above conventions. sometimes by imposing a hat (b e. s. Unknown parameters are denoted by Greek italic letters (b. . . Transposition is denoted by a prime (X0 . instead of x. E[b]. In most cases this does not lead to any confusion. or ^.. e. Dev. . instead of R2 . e. the ones related to empirical examples — we show the output as generated by the software program EViews. .1. . . . y. log (x) denotes the natural logarithm of x (with base e ¼ 2:71828 . which is observation i of variable j (see Section 3. and so on).2).Guide to the Book xxv . ). and so on). . Expected values are denoted by E[ _ ] — for instance. . xi denotes the vector containing the values of all the explanatory variables xji for observation i (including the value 1 as ﬁrst element of xi if the model contains a constant term). Statistics are denoted by text (R-squared. s . Y.

This page intentionally left blank .

Such econometric models help to understand the relation between economic and business variables and to analyse the possible effects of decisions. Relevant quantitative data are available in many economic and business disciplines. it uses computerscience methods to collect the data and to solve econometric models. quantitative data (on price movements. Much information is also available in microeconomics (for instance. Econometrics is concerned with summarizing relevant data information by means of a model. In the early years. In areas such as ﬁnance and marketing. Econometrics was founded as a scientiﬁc discipline around 1930. Econometric techniques have been developed to deal with all such kinds of information. Nowadays econometrics forms an indispensable tool to model empirical reality in almost all economic and business disciplines. weekly.Introduction Econometrics Decision making in business and economics is often supported by the use of quantitative information. Econometrics is an interdisciplinary ﬁeld. and so on) are collected on a regular basis. . daily. Realistic models can easily be solved by modern econometric techniques to support everyday decisions of economists and business managers.1. and it uses statistics and mathematics to develop econometric methods that are appropriate for the data and the problem at hand. It uses insights from economics and business in selecting the relevant variables and models. on the spending behaviour of households). Economic theory often does not give the quantitative information that is needed in practical decision making. sales patterns. There are three major reasons for this increasing attention for factual data and econometric models. The interplay of these disciplines in econometric modelling is summarized in Exhibit 0. . or even every split second. . most applications dealt with macroeconomic questions to help governments and large ﬁrms in making their long-term decisions. .

Steps 1. models for individual economic behaviour (with applications in marketing and microeconomics) in Chapter 6 and models for time series data (with applications in ﬁnance and macroeconomics) in Chapter 7. This provides the student with a thorough understanding of the central ideas and their practical application. . Model. and 5 form the applied part of econometrics and steps 3 and 4 the theoretical part. Collect and analyse relevant statistical data. 3.2 Introduction Economics and Business Statistics Econometrics Mathematics Computer Science Exhibit 0. Information.2. Analyse the empirical validity of the model. 2. The book is selective. The book provides a rigorous and selfcontained treatment of the central methods in econometrics in Chapters 1–5.1 Econometrics as an interdisciplinary ﬁeld Purpose of the book The book gives the student a sound introduction into modern econometrics. Question. Although econometric models and methods differ according to the nature of the data and the type of questions under investigation. As the title of the book indicates. Analysis. 4. all applications share this common structure. Apply the model to answer the questions and to support decisions. estimation. These steps are shown in Exhibit 0. and diagnostic analysis of econometric models) that are motivated and illustrated by applications in business and economics (to answer practical questions that support decisions by means of relevant quantitative data information). The student obtains a solid understanding of econometric methods and an active training in econometrics as it is applied in practice. The thorough treatment of the selected topics not only enables the student to apply these methods successfully in practice. it discusses econometric methods (tools for the formulation. as its purpose is not to give an exhaustive encyclopaedic overview of all available methods. Application. 1. This involves the following steps. Formulate the economic and business questions of central interest. 2. Two major application areas are discussed in more detail — that is. Formulate and estimate an appropriate econometric model. it also gives an excellent preparation for understanding and applying econometrics in other application areas. 5.

. All topics are treated thoroughly and are illustrated with up-to-date realworld applications to solve practical economic and business questions. and required matrix methods are summarized in Appendix A. . . This twofold serious attention for methods and applications is also reﬂected in the extensive exercise sections at the end of each chapter. Some characteristic features of the book follow. The book stimulates active learning by the examples. which contain both theory questions and empirical questions. Preliminary topics in statistics are reviewed in Chapter 1. . and by extensive exercise sets. . Our book is characterized by its thorough discussion of core econometrics motivated and illustrated by real-world examples from a broad range of economic and business applications.2 Econometric modelling Characteristic features of the book Over recent years several new and refreshing econometric textbooks have appeared. The theory and simulation exercises provide a deeper understanding. the basis of all econometric work. Two major application areas are discussed in detail — namely. choice data (in marketing and microeconomics) in Chapter 6 and time series data (in ﬁnance and international economics) in Chapter 7. The book presents deep coverage of key econometric topics rather than exhaustive coverage of all topics. which show econometrics as it works in practice.Introduction 3 Economic or business problem of interest Data Economic model Statistical method Software Econometric model NO OK? YES Use for forecasting and decision making Revise Exhibit 0. In all our discussions of econometric topics we stress the interplay between real-world applications and the practical need for econometric models and methods. and the . The book gives a sound and solid training in basic econometric thinking and working in Chapters 1–5. The book is of an academic level and it is rigorous and self-contained.

It serves as a refresher for students with some background in statistics. It does not require any prior course in econometrics. Chapters 2 and 3 treat a relatively simple yet very useful model that is much applied in practice — namely. The corresponding estimates can be computed by numerical optimization procedures and statistical properties can be derived under the assumption that a sufﬁcient number of observations are available. Chapter 5 (model evaluation). An overview of the required statistical concepts and methods is given in Chapter 1. The book supports the learning process in many ways (see the Guide to the Book for further details). The statistical properties of the least squares method are derived under a number of assumptions. The book assumes a good working knowledge of basic statistics and some knowledge of matrix algebra. The chapter discusses the concepts of random variables and probability distributions and methods of estimation and testing. Brief contents of the book The contents can be split into four parts: Chapter 1 (review of statistics). The book can be used both in more advanced (graduate) courses and in introductory applied (undergraduate) courses. and Chapters 6 and 7 (selected application areas). The multiple regression model in Chapter 3 is formulated in matrix terms. the linear regression model. the book is directed at anyone interested in obtaining a solid understanding and active working knowledge of econometrics as it works in the daily practice of business and economics. Target audience and required background knowledge As stated in the Preface. In Chapter 4 we extend Chapters 2 and 3 to non-linear models and we discuss the maximum likelihood method and the generalized method of moments. Basic econometric methods are described in Chapters 2–4. The book builds up econometrics from its fundamentals in simple models to modern applied research. .4 Introduction empirical examples provide the student with a working understanding and hands-on experience with econometrics in a broad set of real-world economic and business data sets. Chapter 1 reviews the statistical material needed in later chapters. because this enables an analysis by means of transparent and efﬁcient matrix methods (summarized in Appendix A). Chapters 2–4 (model building). which is meant as a refresher and which requires a preliminary course in statistics. . . because the more theoretical parts can easily be skipped without loss of coherence of the exposition. The required matrix methods are summarized in Appendix A.

The book discusses core econometrics and selected key topics. We discuss logit and probit models and models for truncated and censored data and duration data. such as data collection and report writing. The subsections of Chapters 2–4 contain references . we discuss only parametric models and we pay hardly any attention to nonparametric or semi-parametric techniques. our purpose is to give the student a profound working knowledge of core econometrics needed in good applied work. the student will be well prepared to master the other topics on his or her own. The sections of Chapter 5 can be read independently from each other. These two chapters can be read independently from each other. We pay only brief attention to panel data models. and dynamic models (serial correlation). Also some aspects of signiﬁcant practical importance. Chapter 6 concerns individual decision making with applications in marketing and microeconomics. discrete choice models and models for time series data. Our models are relatively simple and can be optimized in a relatively straightforward way — for example. he or she should know how to proceed to improve the model. simultaneous equation models. As stated before. with the views and skills acquired after studying the book. Chapter 5 forms the bridge between the basic methods in Chapters 2–4 and the application areas discussed in Chapters 6 and 7. are not discussed in the book. we do not discuss optimization by means of simulation techniques. Chapter 7 discusses univariate and multivariate time series methods. It does not provide an exhaustive treatment of all econometric topics — for instance. Along with tests on the correct speciﬁcation of the regression model. for instance. models for changing variance (heteroskedasticity). instrumental variables methods. and models with latent variables. robust estimation methods. Our motivation for the extensive treatment of these topics is that regression is by far the most popular method for applied work. if some of the assumptions are not acceptable. models with varying parameters. The applied researcher should check whether the required regression assumptions are valid and. we also discuss several extensions that are often used in practice. The student can check this by means of the keyword list at the end of Chapter 1. This involves.Introduction 5 Chapter 5 discusses a set of diagnostic instruments that play a crucial role in obtaining empirically valid models. We pay special attention to forecasting methods and to the modelling of trends and changing variance in time series. In Chapters 6 and 7 we discuss econometric models and methods for two major application areas — namely. to mention a few. Study advice In Chapters 2–4 it is assumed that the student understands the statistical topics of Chapter 1. the use of dummy variables in regression models. We are conﬁdent that. and it is not necessary to study all Chapter 5 before proceeding with the applications in Chapters 6 and 7. which ﬁnd many applications in ﬁnance and international economics.

using EViews or a similar econometric software package (the data sets can be downloaded from the web site of the book). The best way to study is as follows. and forecasting. 5. In more advanced courses. Chapter 2 is a fundamental chapter that prepares the ground for all later chapters. the student is ready for more. 3. This material can be covered in one trimester or semester. . 4. 1. so that statistical topics that are unknown or partly forgotten can be studied along the way.3 below for further details. Several options for further chapters are open. Understand the general nature of the practical question of interest. 2. This requires one or two trimesters or . the theory parts in the text clarify the structure of econometric models and the role of model assumptions needed to justify econometric methods. Obtain active understanding by doing the empirical exercises. testing. This provides a better understanding of the various model assumptions that are needed to justify the econometric analysis. Introductory Graduate Course on Econometrics. Understand the model formulation and the main methods of analysis. . and we refer to the teaching suggestions and Exhibit 0. Train the practical understanding by working through the text examples (preferably using a software package to analyse the example data sets).6 Introduction to the corresponding relevant parts of Chapter 1. Focus on Chapters 2–4. After studying Chapters 2–4 in this way. In applied courses much of the underlying theory can easily be skipped without loss of coherence of the exposition (by skipping the theory sections in the text and the theory and simulation exercises). It discusses the concept of an econometric model and the role of random variables and it treats statistical methods for estimation. Focus on Chapters 2–4. The further reading list in Chapter 1 contains references to statistical textbooks that treat the required topics in much more detail. and on some parts of Chapters 5–7. Teaching suggestions The book is suitable both for advanced undergraduate courses and for introductory graduate courses in business and economics programmes. This is extended in Chapters 3 and 4 to more general models and methods. and possibly on parts of Chapter 5. Deepen the understanding by studying the theoretical parts in the main text and by doing the theory and simulation exercises. Advanced Undergraduate Course on Econometrics. The book can be used in three types of courses. including the model properties and assumptions.

and the students can get further training by working on the empirical exercises at the end of each chapter. Students with a preliminary background in econometrics can start somewhere lower in Exhibit 0. the examples in the main text serve this purpose. if the aim is to cover GARCH models (Section 7. . . .3. according to the purposes of the course. We advise teachers always to include the following three ingredients in the course. which can be treated thoroughly in one trimester or semester. depending on background and coverage. Intermediate Graduate Course on Econometrics. Exhibit 0. preferably supported by a lecture room PC to show the data and selected results of the analysis.1–7. Exercise sessions treating selected theory exercises to train mathematical and statistical econometric methods on paper. as long as the logical dependencies between the topics in Exhibit 0. The book leaves the teacher a lot of freedom to select topics. For instance. Our advice is always to pay particular attention to the motivation of models and methods. In our own programme in Rotterdam.3.4) in the course. Chapters 2–4 treat concepts and methods that are of fundamental importance in all econometric work.3 and select different routes to applied econometric areas. the students work together in groups of four to perform small-scale projects on the computer by analysing data sets from the book. Chapters 5–7 can be treated selectively. In all cases it is necessary for the students to understand the statistical topics reviewed in Chapter 1. Computer sessions treating selected empirical and simulation exercises to get hands-on experience by applying econometrics to real-world economic and business data. and the basics (Chapters 2–4 and possibly selected parts of Chapter 5) can be treated in one trimester or semester. depending on the background of the students and on the desired coverage of topics. This material can be skipped only if the students have already followed an introductory course in econometrics.Introduction 7 semesters. Focus on Chapters 5–7.6 and 7.3 are respected. then it will be necessary to include the main topics of Sections 5. The book is suitable for different entrance levels. The keyword list at the end of Chapter 1 summarizes the required topics and the further reading contains references to textbooks that treat the topics in more detail. . This requires one or two trimesters or semesters.4–5. Students starting in econometrics will have to begin at the top of Exhibit 0. Lectures on the book material to discuss econometric models and methods with illustrative text examples. .3 gives an overview of the dependencies between topics. and possibly on some parts of Chapters 3–4 as background material.

8 Introduction background 1: Statistics 2: Simple Regression basics 3: Multiple Regression 4: Non-Linear Models 5: Diagnostics 5. in a twelve-week trimester course with a student load of 120 hours.3 Univariate time series 7.1 Binary response 7.4 & 5. 20 per cent for attending lectures.4 Non-linearities GARCH 7.3 Functional form Varying parameters 5. .3 Limited dependent variables 7. 20 per cent for exercise sessions (10 per cent guided. . this corresponds basically to two lecture hours per week and two .5 Regression with lags 6.5 Serial correlation 5. 10 per cent group work).2 Multinomial data 6.2 & 5. . For instance. 20 per cent for computer sessions (10 per cent guided. 10 per cent individual work).7 Multiple equation models choice models with applications in microeconomics and marketing time series models with applications in macroeconomics and finance Exhibit 0.6 & 7. 40 per cent self-study of the book.6 Heteroskedasticity Non-normality 5.1−7. including preparation of computer and paper exercises.7 Endogenous regressors 6.3 Book structure Some possible course structures For all courses we suggest reserving approximately the following relative time load for the students’ different activities: .

and Chapter 6. Sections 5. (d2) Econometric Applications in Finance and Macroeconomics (single course. This is a second-year course for students who followed introductory courses in statistics and linear algebra in their ﬁrst year. as the course load is 160 hours. 120 hours): parts of Chapters 3 and 4. 7. Here we also basically follow option (e). but.6. 300 hours): Chapters 2–7. (e2) Econometric Applications (single course.4–5. 120 hours): parts of Chapters 3 and 4 and Sections 5.Introduction 9 exercise hours per week (half on computer and half on paper). Taking this type of course as our basis.4–5. (b) Introductory Econometrics (extended course on basics.1–7.6. In Rotterdam we use the book for undergraduate students in econometrics and we basically follow option (e) above. 180 hours): Chapters 2–5. and Chapter 7.4–5.4–5. 120 hours): parts of Chapters 3 and 4.7.2. and Chapters 5–7. we focus on practical aspects and skip most of the theory parts. (c) Econometrics with Applications in Marketing and Microeconomics (double course. Sections 5. We also use the book for ﬁrst-year graduate students in economics in Rotterdam and Amsterdam.6. 240 hours): Chapters 2–4.4 and 5. and Chapter 7. (a) Introductory Econometrics (single course on basics. . 6. (f ) Econometrics with Applications (extended double course. (c2) Econometric Applications in Marketing and Microeconomics (single course. (e) Econometrics with Applications (double course.1 and 6.6. we mention the following possible course structures for students without previous knowledge of econometrics. and Chapter 6. 7.1–7.2. 6. (d) Econometrics with Applications in Finance and Macroeconomics (double course. The book is also suitable for a second course.7. Sections 5.1 and 6. (f 2) Econometric Applications (extended or double course. 120 hours): Chapters 2–4. 240 hours): Chapters 2–4. Sections 5.7. after an undergraduate introductory course in econometrics. The book can then be used as a graduate text by skipping most of Chapters 2–4 and choosing one of the options (c)–(f ) above.4 and 5. 240 hours): Chapters 2–4 and Sections 5. 180–240 hours): parts of Chapters 3 and 4.7.

This page intentionally left blank .

1 Review of Statistics A ﬁrst step in the econometric analysis of economic data is to get an idea of the general pattern of the data. . and correlation are helpful tools. economic data are partly systematic and partly random. where the observations are mutually independent and come from an underlying population with ﬁxed mean and standard deviation. The concepts and methods for this relatively simple situation form the building blocks for dealing with more complex models that are relevant in practice and that will be discussed in later chapters. Graphs and sample statistics such as mean. In general. standard deviation. This chapter pays special attention to data obtained by random sampling. This motivates the use of random variables and distribution functions to describe the data.

1.1: Student Learning As an example. A part of the corresponding data table is given in Exhibit 1. (We refer readers to Appendix B for further details on the data sets and corresponding notation of variables used in this book. we consider in this chapter a data set on student learning. 185–202).1 Data graphs E Data First used in Section 2. on a scale from 0 to 4). A histogram of a variable consists of a two-dimensional plot. Economic data sets may contain a large number of observations for many variables. and FEM (with value 1 for females and value 0 for males). Siegfried in their paper ‘Does More Calculus Improve Student Learning in Intermediate Micro. It is often useful to summarize the information in some way. on a scale from 0 to 10). FGPA (the overall grade point average at the end of the freshman year. so that the data set consists of 18.1. and J.) Graphs The data can be visualized by means of various possible graphs. In this chapter we restrict the attention to four variables — that is. SATM (the score on the SAT mathematics test divided by 100. J. These data were analysed by J.1 Descriptive statistics 1.879 numbers. ﬁnancial investors can analyse the patterns of many individual stocks traded on the stock exchange.12 1 Review of Statistics 1. SATV (the score on the SAT verbal test divided by 100. E XM101STU Example 1.and Macroeconomic Theory’ (Journal of Applied Econometrics. marketing departments get very detailed information on individual buyers from scanner data. In total there are thirty-one observed variables. 13/2 (1998). and national authorities have detailed data on import and export ﬂows for many kinds of goods.1. For instance. A. T. Finegan. On the horizontal axis. Butler. S.1. This data set contains information on 609 students of the Vanderbilt University in the USA. the . In this section we discuss some simple graphical methods and in the next section some summary statistics. on a scale from 0 to 10).

2 6. c. SATM.2 5.2 FEM 0 0 0 1 0 1 0 0 0 0 1 1 .456 2. the value on the vertical axis measures the number of observations of the variable that have an outcome in that particular interval.1 Descriptive statistics 13 Obs. FGPA against SATV (h).2 4.4 4.455 3. Example 1.6 6.168 2.1) Part of data on 609 students on FGPA (grade point average at the end of the freshman year). outcome range of the variable is divided into a number of intervals.1 Student Learning (Example 1.296 2.806 2. To investigate possible dependencies between two variables one can draw a scatter diagram.2: Student Learning (continued) Exhibit 1.5 . 1 0 Exhibit 1. The scatter diagrams show much variation in the outcomes.240 .5 6. . .293 2. .1 5.3 6.6 4.125 1. .0 5. 6. In this example it is not so easy to determine from the diagrams whether the variables are related or not.2 shows histograms (a. . 6. SATV (scaled score on SAT verbal test).5 7.6 6. SATM (scaled score on SAT mathematics test). In the case of intervals with equal width. 1 2 3 4 5 6 7 8 9 10 11 12 .6 6.8 6. . and SATV. 2. and SATM against SATV (i).133 SATM 6.430 3.996 2. the other along the vertical axis.7 6.4 5. the function value on the vertical axis is the fraction of the observations with an outcome smaller than or equal to v. but such graphs are often difﬁcult to read.0 6. and the plot consists of points representing the joint outcomes of the two variables that occur in the data set. 0 for males). e) and SCDFs (b.1 6. . .4 .5 6. f ) of the variables FGPA. Instead three two-dimensional scatter . The sample cumulative distribution function (SCDF) is represented by a two-dimensional plot with the outcome range of the variable on the horizontal axis.5 6. .9 SATV 5. d.1. .8 4. For each value v in this range.700 3.7 5. and scatter diagrams of FGPA against SATM (g). and FEM (1 for females. 608 609 FGPA 3. One variable is measured along the horizontal axis.145 2.1 5.500 2. E XM101STU More than two variables For three variables it is possible to plot a three-dimensional scatter cloud.6 5.

8 Sample CDF of FGPA 60 0.6 Sample CDF of SATM 40 30 20 0.0 4.8 4.0 0.0 3.5 6.8 2.6 60 0.0 5.0 3.4 20 0.0 6.0 5.5 8.2 0.2 0.6 3.0 7.4 40 20 0 3.2 3.0 2. and SATV ((e)–(f )).5 6.4 2.5 0. .8 0.0 0. and of SATM against SATV (i).4 0.0 4 5 6 7 8 (e) 120 100 80 Histogram of SATV (f ) 1. SATM ((c)–(d)).6 1.5 7.0 4.6 2.0 10 0 4.0 Sample CDF of SATV 0. of FGPA against SATV (h).6 40 0.5 7.2) Histograms and sample cumulative distribution functions of FGPA ((a)–(b)).0 6.2 2.0 2 3 4 (c) 50 Histogram of SATM (d) 1. and scatter diagrams ((g)–(i)) of FGPA against SATM (g).0 0.2 0 1.8 0.5 4.8 3.0 7.5 5.4 3.14 1 Review of Statistics (a) 80 Histogram of FGPA (b) 1.0 3 4 5 6 7 ( g) 4 (h) 4 (i) 8 7 3 3 SATM 2 1 FGPA FGPA 6 5 2 4 1 3 4 5 6 7 8 SATM 3 3 4 5 SATV 6 7 8 3 4 5 SATV 6 7 8 Exhibit 1.2 Student Learning (Example 1.5 5.

0 3. the spread of the FGPA scores in the ﬁrst group (see Exhibit 1. E XM101STU (a) 25 Histogram of FGPA for 100 students with lowest and 100 students with highest SATA scores (b) 14 12 10 Histogram of FGPA for 201 students with middle SATA scores 20 15 8 6 10 4 5 2 0 1.6 3.1 Descriptive statistics 15 diagrams can be used. The ﬁrst group consists of students with low or high SATA scores (rank numbers between 1 and 100 and between 510 and 609) and the second group with middle SATA scores (rank numbers between 205 and 405).4 2. The same idea applies for four or more variables. d.3 Student Learning (Example 1.2 3.5(SATM þ SATV).11c. however.3 we give an illustration of the possible effects of such a partial analysis. then they would possibly have had less different FGPA outcomes. .8 Exhibit 1.3 (b)). that histograms and scatter diagrams provide only partial information if there are more than one or two variables. and cannot easily be detected from Exhibit 1.2 3. though.4 3.6 2.1.0 2.3 shows histograms for two groups of students.3) Histograms for FGPA scores of students with 100 lowest and 100 highest average SATA scores (a) and for FGPA scores of 201 students with middle average SATA scores (b). The 609 students are ordered by their average SAT score. As expected. 1. Exhibit 1. It should be realized.0 0 2. which of course cannot be detected from a histogram. E Exercises: E: 1. If they had differed less on their SATM and SATV scores.3: Student Learning (continued) The histogram of FGPA shows a spread that is partly caused by differences in the learning abilities of the students.3.0 2. The difference is small.8 3.6 2. One of the main purposes of econometric modelling is to disentangle the mutual dependencies between a group of variables.3 (a)) is somewhat larger than that in the second group (see Exhibit 1. In general. but the inﬂuence of these variables cannot be detected from the diagrams. As an example. deﬁned as SATA ¼ 0. The shape of these diagrams will partly be determined by the neglected variables.6 4.8 3. Example 1.13a. the variation in one variable may be partly caused by another variable.4 2. In Example 1.2 2. In the next section we describe numerical measures for the spread of data that will simplify the comparison.

1. . The skewness is zero if the observations are distributed symmetrically around the mean. n.1. and the sample standard deviation is equal to s (the root of s2 ). An alternative measure of location is the median. m3 =s3 is called the skewness and m4 =s4 the kurtosis.9). 2 A measure of dispersion is the second sample moment. The kurtosis is larger for distributions with fatter tails. the shape of the histogram is often summarized by measures of location and dispersion. If the mean is larger (smaller) than the median. Á Á Á . in practice one often uses a slightly different measure of dispersion deﬁned by s2 ¼ n 1 X (yi À y)2 : n À 1 i¼1 (1:3) This is called the sample variance. Let the observations be ordered so that yi yiþ1 for i ¼ 1. Á Á Á . n À 1. The rth (centred) sample moment is deﬁned by Psquare n 1 mr ¼ n i¼1 (yi À y)r and the standardized rth moment is deﬁned by mr =sr. deﬁned by m2 ¼ n 1X (yi À y)2 : n i¼1 (1:2) For reasons that will become clear later (see Example 1.2 Sample statistics E First used in Section 2. negative if the left tail is longer than the right tail.16 1 Review of Statistics 1. The sample mean is deﬁned as the average of the observations over the sample — that is. then the median is equal to the middle obsern n vation ynþ1 if n is odd and equal to 1 2 (y2 þ y2þ1 ) if n is even.1. In particular. The kurtosis measures the relative amount of observations in the tails as compared to the amount of observations around the mean. and positive if the right tail is longer than the left tail. y¼ n 1X yi : n i¼1 (1:1) The sample mean is also called the ﬁrst sample moment. Sample moments For a single variable. uses Appendix A. Let the number of observations be denoted by n and let the observed data points be denoted by yi with i ¼ 1.1. this is an indication of positive (negative) skewness. 2.

4: Student Learning (continued) Exhibit 1.4 2. The tails of the SATM scores are somewhat fatter on the left.600000 3.672398 0.1.0 7. and the mean exceeds the median.4 shows the sample mean.1 Descriptive statistics 17 Example 1. Skewness Kurtosis 1.0 5.0 4.162196 2. Dev.564860 5. Dev.0 40 20 0 2.100000 0.8 3. Dev.3 on two groups of students.000000 Std.0 7.0 5.500000 0. The tails of FGPA and SATV are somewhat fatter on the right.510522 3. more heterogeneous group of students.0 (c) 120 100 80 60 40 20 0 Series: SATV Sample 1 609 Observations 609 Mean Median Maximum Minimum Std.460238 0. Skewness Kurtosis 3. SATM (b). and SATV (c).4 Student Learning (Example 1.4) Summary statistics of FGPA (a). we measure the spread of the FGPA scores in both groups by the sample standard deviation. and the mean is smaller than the median.500000 7. As expected.0 8.248440 Median 6. (a) 80 E XM101STU (b) Series: FGPA Sample 1 609 Observations 609 Mean Median Maximum Minimum Std. and SATV (c) of 609 students. as it contains somewhat less observations in the tails as compared to SATM and SATV. with observed .0 6.197807 Kurtosis 3.0 6. Both the mean and the median of the SATV scores are lower than those of the SATM scores.167829 2. Further.0 5.974246 Exhibit 1.397305 4.6 2. returning to our discussion in Example 1. median.6 4.2 50 40 30 20 10 0 60 Series: SATM Sample 1 609 Observations 609 Mean 6.900000 Minimum 4.773000 3. Of the three variables.0 2. skewness.792796 2. and kurtosis of the data on FGPA (a). SATM (b).595765 Skewness −0. the standard deviation is larger for the ﬁrst. FGPA has the smallest kurtosis. but the difference is small. 0. whereas the second group of students (with middle average SATA scores) has s ¼ 0:449. The ﬁrst group of students (with either low or high average SATA scores) has s ¼ 0:485.971000 1.300000 Maximum 7. Covariance and correlation The dependence between two variables can be measured by their common variation. standard deviation. Let the two variables be denoted by x and y.

yi ) for i ¼ 1. the ﬁrst and second moments can be summarized in vectors and matrices (see Appendix A for an overview of results on matrices that are used in this book). females have on average somewhat better scores on FGPA and SATV and somewhat lower scores on SATM. . @ . and FEM. Let x be the sample mean of x and y that of y. As rjj ¼ 1. this matrix contains unit elements on the diagonal. E XM101STU Example 1. SATM. . and SATV are all positively correlated. . and let sx be the standard deviation of x and sy that of y. The sample pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ correlation coefﬁcients are given by rjk ¼ sjk = sjj skk. . When there are p variables. The scores on FGPA. relatively large observations on x correspond with relatively large observations on y and small observations on x with small observations on y. this means that.5: Student Learning (continued) Exhibit 1. the corresponding sample means can be collected in a p Â 1 vector. SATV. then the p Â p sample covariance matrix S is deﬁned by s11 B s21 B S¼B . SATM.1). The covariances are scale dependent. The correlation coefﬁcient rxy always lies between À1 and þ1 and it does not depend on the units of measurement (see Exercise 1. . n. As compared with males. Then the sample covariance between x and y is deﬁned by sxy ¼ n 1 X (xi À x)(yi À y) n À 1 i¼1 (1:4) and the sample correlation coefﬁcient by rxy ¼ sxy : sx sy (1:5) When two variables are positively correlated. A spp The diagonal elements are the sample variances of the variables. and when sjk denotes the sample covariance between the jth and kth variable. and the p Â p correlation matrix is deﬁned similar to the covariance matrix by replacing the elements sjk by rjk. ÁÁÁ 1 s1p s2p C C C: .18 1 Review of Statistics outcome pairs (xi . Á Á Á . sp2 ÁÁÁ ÁÁÁ . on average. . In the case of two or more variables. The correlations do not depend on the scale of measurement and are therefore easier to interpret. sp1 0 s12 s22 .5 shows the sample covariance matrix (Panel 1) and the sample correlation matrix (Panel 2) for the four variables FGPA. . .

1.195 1.011 SATV 0. b. .092 0.040 À0.053 0.163 0.163 SATV 0.000 0.288 1.028 0. E: 1.115 0.034 FEM 0.211 0.176 À0. SATM.5 Student Learning (Example 1.000 0.047 SATM 0.047 0.237 FEM 0.13b.354 0. d.1 Descriptive statistics 19 Panel 1 FGPA SATM SATV FEM Panel 2 FGPA SATM SATV FEM FGPA 0.000 Exhibit 1.176 SATM 0.288 À0. SATV.11a.1.053 0.1.000 0.195 0.028 0. and FEM for 609 students.034 1.5) Sample covariances (Panel 1) and sample correlations (Panel 2) of FGPA.011 0.115 À0.092 0. E Exercises: T: 1.040 FGPA 1.451 0.

which is a nondecreasing function with limv!À1 F(v) ¼ 0 and limv!1 F(v) ¼ 1. If the set of possible outcome values is discrete.1 concern a group of 609 students. In statistics one usually denotes random variables by capital letters (for instance. (SCDF) of Section 1. Á Á Á .1 is given by Fs (v) ¼ 1 n (number of yi Remarks on notation Some remarks on notation are in order.2 Random variables 1. the sample cumulative distribution function v). If the set of possible outcomes is continuous. For instance.vi vg pi. Other data would have been obtained if another group of students (at another university or in another year) had been observed. One of the causes of randomness is sampling.2. The corresponding P tion function (CDF) is given by F(v) ¼ P[y v] ¼ fi. (v) and. Distributions A variable y is called random if. then the derivative f (v) ¼ dF dv is called the R 1 probability density function.1. Randomness The observed outcomes of variables are often partly systematic and partly random.3. The CDF of a random variable is also called the population CDF. then the distribution is given by the set of probabilities pi ¼ P[y ¼ vi ]. as it represents the distribution of all the possible outcomes of the variable. The uncertainty about the outcome is described by a probability distribution. R b Interval probabilities are obtained from P[a < y b] ¼ F(b) À F(a) ¼ a f (v)dv. It has the properties that f (v) ! 0 and À1 f (v)dv ¼ 1. Á Á Ág. say V ¼ fv1 . if this function is differentiable. yn . the data on student scores in Example 1. Y ) and observed outcomes . v2 .2. These probabilities have the cumulative distribuproperties that pi ! 0 and pi ¼ 1. For observed data y1 .20 1 Review of Statistics 1. prior to observation. its outcome cannot be predicted with certainty.1 Single random variables E First used in Section 2. the probability P of the outcome vi . then the CDF is again deﬁned by P[y v].

When y has a continuous distribution with density function f . Further. y). the observations are usually denoted by yi with i ¼ 1. For a discrete distribution this gives s2 ¼ E[(y À m)2 ] ¼ X (vi À m)2 pi (1:8) . the outcome value of the random variable yi . To avoid confusion with the notation in later chapters. we use lower-case letters (like y) to denote random variables. but for simplicity of notation we write yi both for the random variables and for the observed outcomes. After observation. Á Á Á . the mean is deﬁned by Z m ¼ E[y] ¼ vf (v)dv (1:7) (if an integral runs from À1 to þ1.1. If y has a discrete distribution. then the (population) mean is deﬁned as a weighted average over the outcome set V with weights equal to the probabilities pi of the different outcomes vi — that is. Mean The distribution of a random variable can be summarized by measures of location and dispersion.2 Random variables 21 of random variables by lower-case letters (for instance. Variance The (population) variance is deﬁned as the mean of (y À m)2 . a random variable (prior to observation) or an observed outcome. for the random variable y we denoted the set of possible outcome values by V and the observed outcome by v. the realized values could be denoted by say v(yi ). we delete this for simplicity of notation). in econometrics it is usual to reserve capital letters for matrices only. This notation is common in econometrics. We will make sure that it is always clear from the context what the notation yi means. However. Note that the sample mean is obtained when the SCDF is used. m ¼ E[y] ¼ X vi pi : (1:6) The operator E that determines the mean of a random variable is also called the expectation operator. However. so that the notation in econometrics differs from the usual one in statistics. in a sample of n observed data. the outcome of yi can be seen as a random variable. Prior to observation. n.

and the variance the second (centred) moment. Higher moments The rth centred moment is deﬁned as the mean of (y À m)r — that is (in the R case of a continuous distribution).1. Transformations of random variables Now we consider the statistical properties of functions of random variables. then z has density function fz (w) ¼ fy (h(w))jh0 (w)j (1:10) (see Exercise R given by E[z] ¼ P 1. this is not always the case for the population moments. 2. then all the moments mr with r c exist. so that g(y) ¼ ay þ b for some constants a and b. mr ¼ E[(y À m)r ] ¼ (v À m)r f (v)dv. If g is linear. E Exercises: T: 1. a random variable with a ﬁnite variance also has a ﬁnite mean. The sample moments of Section 1.22 1 Review of Statistics and for a continuous distribution Z s ¼ E[(y À m) ] ¼ 2 2 (v À m)2 f (v)dv: (1:9) The standard deviation s is the square root of the variance s2 . Although the sample moments always exist. then z ¼ g(y) is also a random variable. v2 . For r ¼ 3 this gives the skewness and for r ¼ 4 the kurtosis. The mean is also called the (population) ﬁrst moment.2 are obtained by replacing the CDF by the sample CDF. then E[ay þ b] ¼ aE[y] þ b. . In particular. If y has a discrete distribution with outcomes fv1 . but if g is not linear then E[g(y)] 6¼ g(E[y]) in general.3a. Suppose that g is invertible with inverse function y ¼ h(z). If y is a random variable and g is a given function. Á Á Ág and probabilities P[z ¼ wi ] ¼ P[y ¼ h(wi )] ¼ pi . The mean of z is E[g(y)] ¼ pi g(vi ) in the discrete case and by E[g(y)] ¼ f (v)g(v)dv in the continuous case. Á Á Ág. then z also has a discrete distribution with outcomes fwi ¼ g(vi ). i ¼ 1. If E[jy À mjc ] < 1. The standardized rth moment is given by mr =sr.3 for a special case). When y has a continuous distribution with density function fy and h is differentiable with derivative h0 .

If the sets of possible outcomes are continuous. w)wdvdw.2 Joint random variables E First used in Section 3. the data set on 609 student scores in Example 1. w)dvdw: The correlation coefﬁcient between x and y is deﬁned by rxy ¼ cov(x. the individual distributions of x and y (also called the marginal distributions) can be derived. y w] ¼ f(i. Á Á Ág and W ¼ fw1 . y) s x sy (1:11) . j). Mean and variance R R R y can also be determined f (v. then the corresponding density function is deﬁned by 2 v. y w]. y) ¼ E[(x À mx )(y À my )] ¼ (v À mx )(w À my )f (v. y) on these two tests can be described by a joint probability distribution. For instance. w) ¼ @ @Fv(@ w . my ¼ fy (w)wdw ¼ Covariance and correlation The covariance between x and y is deﬁned (for continuous distributions) by ZZ cov(x. and every function with these two properties describes a joint probability distribution. y) by Fy (w) ¼ P[y w] ¼ F(1.1. vi v. w)dv.2 Random variables 23 1. the corresponding densities are related by R of x and fy (w) ¼ f (v. Á Á Ág. w)dvdw ¼ 1. uses Appendix A. The CDF Fy of y is obtained from the CDF of (x.1.4. w) ¼ P[x v.2. one can consider their joint distribution.2–A. y ¼ wj ]. say V ¼ fv1 . For continuous distributions.4. w2 . then the CDF is also deﬁned as F(v. and if the second derivative of this function exists. w) f (R v. Two random variables When there are two or more variables of interest. w) ¼ P[x v. w). The density function has the properties f (v. If the sets of possible outcome values for x and y are both discrete. The corresponding cumulative P distribution function (CDF) is given by F(v. w) ! 0 and R f (v. wj wg pij . v2 . The uncertainty about the pair of outcomes (x. then the joint distribution is given by the set of probabilities pij ¼ P[x ¼ vi . in this way — for instance.1 contains the outcomes of mathematics and verbal tests. When the joint distribution of x and y is given.

The two random variables are called uncorrelated if rxy ¼ 0. fyjx¼v (w) ¼ f (v. The conditional variance var(yjx ¼ v) is the variance of y with respect to the conditional distribution fyjx¼v . the conditional probabilities are given by P[y ¼ wj jx ¼ vi ] ¼ P[x ¼ vi . For instance. with P[x ¼ vi ] > 0. For continuous distributions. as the conditional probabilities sum up (over j) to unity. w) f (v. the conditional expectation E[yjx] (a function of the random variable x) has the same mean as the unconditional random variable y. for continuous distributions the conditional expectation is given by Z E[yjx ¼ v] ¼ R f (v. so that E[yjx] is a random variable with density fx (v). When the distribution is discrete and the outcome x ¼ vi is given. Conditional distribution The conditional distribution of y for given value of x is deﬁned as follows. y ¼ wj ] pij : ¼P P[x ¼ vi ] j pij (1:12) This gives a new distribution for y. w)dw (1:14) Note that the conditional expectation is a function of v. w)dw (1:13) Conditional mean and variance The conditional mean and variance of y for given value x ¼ v are the mean and variance with respect to the corresponding conditional distribution. The mean of this conditional expectation is (see Exercise 1.2) . w)wdw fyjx¼v (w)wdw ¼ R : f (v. the conditional density fyjx¼v is deﬁned as follows (for values of v for which fx (v) > 0).24 1 Review of Statistics where sx and sy are the standard deviations of x and y. This variance depends on the value of v. and the mean of this variance satisﬁes (see Exercise 1. w) ¼R : fx (v) f (v. This is equivalent to the condition that E[xy] ¼ E[x]E[y].2) Z E[E[yjx] ] ¼ E[yjx ¼ v]fx (v)dv ¼ E[y]: (1:15) In words.

For continuous distributions the condition is that f (v. It follows from (1.2 Random variables 25 Z E[var(yjx)] ¼ Z 2 fx (v) fyjx¼v (w)(w À E[yjx ¼ v]) dw dv var(y): (1:16) So. If this holds true. the variation in the FGPA scores of students can be related to differences in student abilities as measured by their SATM and SATV scores. In such models. the variable x does not contain information on the variable y. yp is a . on average. marginal. For instance. that is. Á Á Á . the differences in the outcomes of the variable of interest (y) are explained in terms of underlying factors (x) that inﬂuence this variable. then the uncertainty of y is not diminished by conditioning on x. the joint density function of p continuous random variables y1 . For instance. For discrete distributions this is the case if and only if P[y ¼ wj jx ¼ vi ] ¼ P[y ¼ wj ] for all (vi .12) and (1.13) that for independent variables E[yjx ¼ v] ¼ E[y] is independent of the value v of x. So in this case the joint distribution is simply obtained by multiplying the marginal distributions with each other. the conditional random variable yjx ¼ v has a smaller variance than the unconditional random variable y. and conditional distributions are easily extended to the case of more than two random variables. If x and y are independent. P[x ¼ vi . for independent variables there holds var(yjx ¼ v) ¼ var(y) for all values x ¼ v. but the reverse does not hold true (see Exercise 1. then x and y are called independent random variables. and hence also E[var(yjx)] ¼ var(y). That is. Further. This is an important motivation for econometric models with explanatory variables. w) ¼ fx (v)fy (w) for all (v. knowledge of the outcome of the variable x helps to reduce the uncertainty about the outcome of y. wj ).1. w). More than two random variables The deﬁnitions of joint. y ¼ wj ] ¼ P[x ¼ vi ]P[y ¼ wj ] for all (vi . Independence A special situation occurs when the conditional distribution is always equal to the marginal distribution. Independent variables are always uncorrelated. wj ) — that is.2). Such econometric models with explanatory variables are further discussed in Chapters 2 and 3.

. for given (non-random) constants b and aj . vp )dv1 Á Á Á dvp : These variances and covariances can be collected in the p Â p symmetric covariance matrix 1 1 0 s11 s12 Á Á Á s1p var(y1 ) cov(y1 . . . var(z) ¼ p X p X j¼1 k¼1 aj ak cov(yj . The variables are independent if and only if the joint density f is equal to the product of the p individual marginal densities fyi of yi — that is. yp ) C B s21 s22 Á Á Á s2p C C B C B S¼B C: C¼B . y1 ) var(y2 ) Á Á Á cov(y2 . . . yp be given random variables and let z ¼ b þ p j¼1 aj yj be a linear function of these random variables. A cov(yp . Á Á Á . . Then the mean and variance of z are given by E[z] ¼ b þ p X j¼1 aj E[yj ]. Á Á Á . . and covariances can be determined from the joint distribution. . If in addition all the variables have equal variance sii ¼ s2 . For linear transformations the ﬁrst and second moments of the transformed variables can be determined P in a simple way. . . . . . y2 ) Á Á Á cov(y1 . yk ) . the pdimensional integral Z s12 ¼ ÁÁÁ Z (v1 À m1 )(v2 À m2 )f (v1 . Á Á Á . vp ) that is non-negative everywhere and that integrates (over the p-dimensional space) to unity. . For instance. . . Á Á Á . . . replacing the elements sij sij in S by the correlations rij ¼ pﬃﬃﬃﬃﬃﬃﬃ sii sjj . f (v1 . y2 ) Á Á Á var(yp ) sp1 sp2 Á Á Á spp 0 The correlation matrix is deﬁned in an analogous way. . so that in this case sij ¼ 0 for all i 6¼ j. . Let y1 . Linear transformations of random variables For our statistical analysis in later chapters we now consider the distribution of functions of random variables. variances. y1 ) cov(yp . then the covariance matrix is of the form S ¼ s2 I where I is the p Â p identity matrix. for continuous distributions the covariance between y1 (with mean m1 ) and y2 (with mean m2 ) is given by s12 ¼ cov(y1 . . y2 ) ¼ E[(y1 À m1 )(y2 À m2 )] — that is. Means.26 1 Review of Statistics function f (v1 . yp ) B cov(y2 . @ @ A . vp ) ¼ p Y j¼1 fyi (vi ): Independent variables are uncorrelated. .

the joint density function of (z1 . var(y) ¼ s2 : n (1:17) Now let z ¼ Ay þ b be a vector of random variables. let z1 ¼ g1 (y1 . w2 )) j J(w1 . where A and b are a given (non-random) m Â p matrix and m Â 1 vector respectively and where y is a p Â 1 vector of random variables with vector of means m and covariance matrix S.3). For example. Suppose that the mapping g ¼ (g1 . . when y ¼ n i¼1 yi is the mean of n uncorrelated random variables. z2 ) is invertible with inverse h ¼ (h1 . z2 (w1 . the density of (y1 . Then the vector of means of z and its covariance matrix Sz are given by E[z] ¼ Am þ b. v2 ) ¼ h(w1 .3). For discrete random variables. Sz ¼ ASA0 (1:18) where A0 denotes the transpose of the matrix A (see Exercise 1. y2 ¼ v2 ] where (v1 . For continuous random variables. then z1 and z2 are in general not uncorrelated. w2 ). then it follows from (1. w2 )j: (1:19) That is.2 Random variables 27 where cov(yj . For instance. so that E[g1 (y1 )g2 (y2 )] ¼ E[g1 (y1 )]E[g2 (y2 )] when y1 and y2 are independent. z2 ) is given by P[z1 ¼ w1 . w2 ). 2. y2 ) should be evaluated at the point h(w1 . h2 ). Arbitrary transformations of random variables The distribution of non-linear functions of random variables can be derived from the joint distribution of these variables. When the random variables yj are uncorrelated P that E[z] ¼ b þ m aj and have identical mean m and variance s2 . it follows P P n 1 and var(z) ¼ s2 a2 j . When z1 ¼ g1 (y1 ) and z2 ¼ g2 (y2 ) and y1 and y2 are independent. y2 ) to (z1 . j ¼ 1. This result generalizes to the case of more than two functions. The Jacobian J is deﬁned as the i (z) determinant of the 2 Â 2 matrix with elements @ h @ zj for i. unless g1 and g2 are linear functions (see Exercise 1. y2 ) be two functions of given random variables y1 and y2 . If y1 and y2 are uncorrelated but not independent. the distribution of (z1 . z2 ¼ w2 ] ¼ P[y1 ¼ v1 .10) and (1. w2 ) ¼ fy1 .19) that z1 and z2 are also independent (see Exercise 1. So in this case z1 and z2 are uncorrelated. then it follows that E[y] ¼ m.3). y2 ) and z2 ¼ g2 (y1 . w2 ) and the result should be multiplied by the absolute value of the Jacobian J in (w1 . z2 ) is given by fz1 . g2 ) from (y1 . y2 (h(w1 . yj ) ¼ var(yj ).1.

896500 3.948000 1.971000 1.034324 2.15) for the mean because E[E[yjx]] ¼ 373 236 373 E[yjM] þ E[yjF] ¼ (2:728) 609 609 609 236 (2:895) ¼ 2:793 ¼ E[y]. 373 are male and 236 are female. Skewness Kurtosis 2.4 (a). we can verify the result (1.4 2.16) for the variance because E[var(yjx)] ¼ 373 236 373 236 var(yjM) þ var(yjF) ¼ (0:441)2 þ (0:472)2 609 609 609 609 ¼ 0:206 < 0:212 ¼ (0:460)2 ¼ var(y): (a) 50 40 Series: FGPA Sample 2 608 Observations 373 Mean Median Maximum Minimum Std. þ 609 and we can verify the result (1. 1.2 3.28 1 Review of Statistics E XM101STU Example 1.15) and (1.0 2. In this example we will consider these 609 students as the population of interest and we will analyse the effect of the gender of the student by conditioning with respect to this variable. E: 1.4 2.0 Exhibit 1.2.894831 2. Of the 609 students in the population. . Dev.6 4.11e. Indeed. The two means and standard deviations in Exhibit 1.6) Histograms for FGPA scores of males (a) and females (b).805000 0. E Exercises: T: 1.6 2.688000 3.441261 0.2 3.0 2. using the fact that the conditioning variable x in this case is a discrete random variable with probabilities 373/609 for a male and 236/609 for a female student.728239 2. The mean and standard deviation of the unconditional (full) population are in Exhibit 1.0 2. denoting males by M and females by F.6 4. Exhibit 1.6 shows histograms of the variable FGPA for male students (a) and female students (b) separately.353326 30 20 15 10 5 0 10 0 1.471943 0. Dev.8 3.16) (more precisely. their analogue for the current discrete distributions) are easily veriﬁed.500000 0.6: Student Learning (continued) As an illustration we consider again the data on student learning of 609 students.8 3.658253 (b) 25 20 Series: FGPA Sample 1 609 Observations 236 Mean Median Maximum Minimum Std.6 are conditional on the gender of the student and they differ in the two groups.6 Student Learning (Example 1. The relations (1.3b–d.217720 2. Skewness Kurtosis 2.

is the number of possibilities to locate v successes over n positions).4). A normal random variable is a continuous random variable that can take on any value. It has mean p and variance p(1 À p) (see Exercise 1. This is called the binomial distribution. s2 ).4). and the distribution is denoted by N(m. Á Á Á . 1. which says that many distributions can be approximated by normal distributions if the sample size is large enough. Bernoulli distribution and binomial distribution In this section we consider some probability distributions that are often used in econometrics.7). It has mean np and variance np(1 À p) (see Exercise 1. Normal distribution The normal distribution is the most widely used distribution in econometrics. The distribution contains two parameters. as P[y ¼ 0] ¼ 1 À P[y ¼ 1] ¼ 1 À p. As the normal distribution is .1. Á Á Á . Let y ¼ n i¼1 i be the total number of successes. denoted by 0 (failure) and 1 (success). n.4).2–A. One of the reasons is the central limit theorem (to be discussed later).2. The set of possible outcome values of y is V ¼ {0. s 2p À 1 < v < 1: (1:20) This function is symmetric around m and it is shaped like a bell (see Exhibit 1. m and s2 . and P[y ¼ v] ¼ n v pv (1 À p)nÀv (the ﬁrst term. with the Bernoulli distribution with probability p P y of success.2. The third and fourth moments of this distribution are 0 and 3s4 respectively (see Exercise 1. n}.5. The probability distribution is completely described by the probability p ¼ P[y ¼ 1] of success. Another reason is that the normal distribution has a number of attractive properties.3 Probability distributions E First used in Section 2. This notation is motivated by the fact that m is the mean and s2 the variance of this distribution. are independent and identically distributed. Suppose that the n random variables yi . Its density function is given by 2 1 1 f (v) ¼ pﬃﬃﬃﬃﬃﬃ eÀ2s2 (vÀm) . The simplest case of a random variable is a discrete variable y with only two possible outcomes. This is called the Bernoulli distribution.2 Random variables 29 1. so that the skewness is zero and the kurtosis is equal to 3. i ¼ 1. uses Appendix A.3. ‘n over v’.

2 0. when y is standardized by subtracting its mean and dividing by its standard deviation. distributions with kurtosis larger than three are called fat-tailed.4 0. Its density function is denoted by f. Multivariate normal distribution In later chapters we will often consider jointly normally distributed random variables.2 0.4 0.1 0 −4 −2 0 2 4 6 8 10 −2 0 2 x 4 6 8 10 −2 0 2 x 4 6 8 10 x Exhibit 1. this is written as y $ N(m. s2 ) distribution. s2 ).3 f(x) 0.3 f(x) 0. one with mean 0 and variance 1 (a) and another one with mean 3 and variance 2 (b). often taken as a benchmark.7 Normal distribution Density functions of two normal distributions.5 0. so that 1 1 2 f(v) ¼ pﬃﬃﬃﬃﬃﬃ eÀ2v .5 0. The result in (1.5 0.4) that the linear function ay þ b (with a and b ﬁxed numbers) is also normally distributed and ay þ b $ N(am þ b.10) implies (see Exercise 1.1 0 −4 (c) 0. it follows that yÀm $ N(0.3 f(x) 0. The plot in (c) shows the two densities in one diagram for comparison. a2 s2 ): In particular. The multivariate normal distribution of n random variables has density function . When y follows the N(m. 2p and the cumulative distribution function is denoted by F(v) ¼ Rv À1 f(u)du. It is very convenient to use matrix notation to describe the multivariate normal distribution. 1): s This is called the standard normal distribution.4 0.30 1 Review of Statistics (a) 0.2 0.1 0 −4 ( b) 0.

The distribution is written as N(m.4). as S12 ¼ 0 if y1 and y2 are uncorrelated so that the conditional distribution of y1 becomes independent of y2 . independence implies being uncorrelated but not the other way round. sii ) where mi is the ith component of m and sii the ith diagonal element of S. then the ith component yi is also normally distributed and yi $ N(mi . then the linear function Ay þ b (with A a given m Â n matrix and b a given m Â 1 vector) is also normally distributed and (see Exercise 1. S). Then the conditional distribution of y1 . when normally distributed variables are uncorrelated they are also independent. S22 is the covariance matrix of y2 .22). knowledge of the value of y2 always leads to the same reduction in the uncertainty of y1 if the variables are normally distributed. let the vector y be split in two parts (with sub-vectors y1 and y2 ) and let the mean vector and covariance matrix be split accordingly. For arbitrary random variables. and S21 is the transpose of S12 (see Exercise 1. m is an n Â 1 vector. For the conditional distribution. Suppose . Note that the conditional variance does not depend on the value of y2 in this case. If y $ N(m. That is. ASA0 ): (1:23) Chi-square ( x2 ) distribution In the rest of this section we consider the distribution of some other functions of normally distributed random variables that will be used later on. (1:21) where v denotes the n variables. when jointly normally distributed variables are uncorrelated. then the joint density (1.1. so that S is a diagonal matrix. and S an n Â n positive deﬁnite matrix (det(S) denotes the determinant of this matrix). given that y2 ¼ v2 . If the n Â 1 vector y is normally distributed. S12 is the covariance matrix between y1 and y2 . S). However. This notation is motivated by the fact that this distribution has mean m and covariance matrix S. Properties of the multivariate normal distribution Marginal and conditional distributions of normal distributions remain normal.21) reduces to the product of the individual densities.4) Ay þ b $ N(Am þ b. S À S S S 2 11 12 21 2 22 22 (1:22) where S11 is the covariance matrix of y1 .2 Random variables 31 f (v) ¼ 1 (2p) n=2 (det(S)) 1=2 eÀ2(vÀm) S 1 0 À1 (vÀm) . is given by 1 À1 y1 jy2 ¼ v2 $ N m1 þ S12 SÀ ( v À m ). This also follows from (1. That is.

yn )0 . r v ¼ 0.3 (c) 0.32 1 Review of Statistics that y1 . denoted by w2 (n).2 f(x) 0. This can be generalized to other quadratic forms in the vector of random variables y ¼ (y1 . idempotent (that is. yn are independent and all follow the standard distribuPn normal 2 tion.1 0. The plot in (c) shows the two densities in one diagram for comparison.5). Student t-distribution If y1 $ N(0. v < 0. v ! 0. f (v) is equal to the given expression up to a scaling constant that R does not depend on v.2 f(x) 0. then the distribution of y1 = y2 =r is called the Student t-distribution with r degrees of freedom.8 x2-distribution Density functions of two chi-squared distributions.3 0. the sum of the n diagonal elements of this matrix). A2 ¼ A) and that has rank r (which in this case is equal to the trace of A — that is. The distributions have a positive skewness. The density of the w2 (r) distribution is given by f (v) / v2À1 eÀ2 . written as (b) 0. Let A be an n Â n matrix that is symmetric (that is.5). This scaling constant is deﬁned by the condition that f (v)dv ¼ 1. . (1:25) where / means ‘proportional to’ — that is. Then the distribution of the sum of squares i¼1 yi is called the chi-square distribution with n degrees of freedom. 1) and y2 $ w2p (r)ﬃﬃﬃﬃﬃﬃﬃﬃﬃ and y1 and y2 are independently distributed. For a symmetric idempotent matrix A there always holds that y0 Ay ! 0.8 shows chi-square densities for varying degrees of freedom. Á Á Á .2 0. A0 ¼ A). Exhibit 1. Then y0 Ay $ w2 (r) (1:24) (see Exercise 1. Á Á Á .3 (a) 0.1 f(x) 0.1 0 0 4 8 12 16 20 0 0 4 8 x 12 16 20 0 0 4 8 x 12 16 20 x Exhibit 1. one with 4 degrees of freedom (a) and another one with 8 degrees of freedom (b). The w2 (r) distribution has mean r and variance 2r (see Exercise 1.

The plot in (d) shows the three densities in one diagram for comparison.2 0.2 0.5 0.5 0.3 f(x) (d) 0.5 0. . Árþ 2 À 1 < v < 1: (1:27) r For r > 1 the mean is equal to 0. if r ! 1 then the t(r) density converges to the standard normal density (see Exercise 1.2 0. the density of the t(r)-distribution is given by f (v) / À 1 1 þ vr 2 1 .4 0. the t(1)-distribution (also called the Cauchy distribution) has density f (v) ¼ 1 : p(1 þ v2 ) This distribution is so much dispersed that it does not have ﬁnite moments — in particular.4 0.1 0 −8 (b) 0.1 0 −8 −6 −4 −2 0 x 2 4 6 8 −6 −4 −2 0 x 2 4 6 8 Exhibit 1.9 t-distribution Density functions of three t-distributions. with number of degrees of freedom equal to 1 (a). For more degrees of freedom the density is more concentrated around zero and has less fat tails. These distributions are symmetric (the skewness is zero) and have fat tails (the kurtosis is larger than three).4 0.4 0.1. For r ¼ 1.2 0.1 0 −8 f(x) −6 −4 −2 0 x 2 4 6 8 −6 −4 −2 0 x 2 4 6 8 (c) 0. and 100 (c).9 shows t-distributions for varying degrees of freedom.5 0. and for r > 2 the variance is equal to rÀ 2.2 Random variables 33 y1 pﬃﬃﬃﬃﬃﬃﬃﬃﬃ $ t(r): y2 =r (1:26) Up to a scaling constant.5).3 0. the mean and the variance do not exist. On the other hand. (a) 0.1 0 −8 0. Exhibit 1.3 f(x) 0.3 f(x) 0. 4 (b).

it is for later purposes helpful to use simple checks for the independence between linear and quadratic forms of normally distributed random variables.8 0. If r2 ! 1. r2 ): y2 =r2 (1:28) Exhibit 1. Conditions for independence In connection with the t. The plot in (d) shows the three densities in one diagram for comparison.6 0. .2 0. and for more degrees of freedom in the denominator it gets less fat tails.100) (b). then r1 Á F(r1 .4 f(x) 0. I) be a vector (a) 0.4) (a).and F-distributions.2 0.8 0. For more degrees of freedom in the numerator the density shifts more to the right.8 (b) 0.34 1 Review of Statistics F -distribution If y1 $ w2 (r1 ) and y2 $ w2 (r2 ) and y1 and y2 are independently distributed. Let y $ N(0. This is written as y1 =r1 $ F(r1 .8 0.10 F-distribution Density functions of three F-distributions.4 0.2 0 0 1 2 3 x 4 5 6 0 0 1 2 3 x 4 5 6 Exhibit 1. (4. r2 ) converges to the w2 (r1 )-distribution (see Exercise 1.6 f(x) 0.4 0.5).10 shows F-distributions for varying degrees of freedom.4) (c).4 f(x) 0. and (100.2 0 0 1 2 3 x 4 5 6 0 0 1 2 3 x 4 5 6 (c) 0. with numbers of degrees of freedom in numerator and denominator respectively (4. then the distribution of (y1 =r1 )=(y2 =r2 ) is called the F-distribution with r1 and r2 degrees of freedom.6 (d) 0.6 f(x) 0.

s2 I) (1:32) .1.4 Normal random samples E First used in Section 2. 1. 1. (1:29) and the random variables z1 and z2 (both with w2 -distribution) are independently distributed if Q1 Q2 ¼ 0: (1:30) E Exercises: T: 1. yn .2 Random variables 35 of independent standard normal random variables.4.3). The random variables z0 (with normal distribution) and z1 (with w2 -distribution) are independently distributed if AQ1 ¼ 0.5a–e. To illustrate some of the foregoing results. and let z0 ¼ Ay.5). 1. s2 ). yn are normally and independently distributed random variables with the same mean m and variance s2 . 1.2. n.2.13f. z1 ¼ y0 Q1 y and z2 ¼ y0 Q2 y be respectively a linear form (with A an m Â n matrix) and two quadratic forms (with Q1 and Q2 symmetric and idempotent n Â n matrices). Á Á Á . i ¼ 1. (1:31) where NID stands for normally and independently distributed.15b.3.1) and of the sample variance s2 in (1. uses Appendix A. Á Á Á . we consider the situation where y1 . so that y $ N(mi. The two following results are left as an exercise (see Exercise 1. yn is a random sample (that is. s2 ).2–A. Á Á Á . Sample mean Let y be the n Â 1 vector with elements y1 . We are interested in the distributions of the sample mean y in (1. with independent observations) from N(m. One also says that y1 . This is written as yi $ NID(m. Á Á Á .5.

35).4.36 1 Review of Statistics where i is the n Â 1 vector with all its elements equal I is the n Â n P to 11and 1 0 identity matrix. Such a random variable (in this case. . As i0 M ¼ 0.29) n that this standard normal random variable is independent from the w2 (n À 1) random variable in (1.36) has a distribution that does not depend on s2 . Such pivotal random variables are helpful in statistical hypothesis testing. Now i¼1 (zi À z) ¼ z Mz where the matrix M is deﬁned by 1 M ¼ I À ii0 : n (1:34) The matrix M is symmetric and idempotent and has rank n À 1 (see Exercise 1. the result (1. and as i0 i ¼ n and i0 Ii ¼ n it follows from (1.24) shows that (n À 1)s2 ¼ z0 Mz $ w2 (n À 1): s2 (1:35) The t-value of the sample mean Using the introduced above. 1).23) that 1 0 s2 y ¼ i y $ N m. The result in (1. let zi ¼ (yi À m)=s Pn 2 1 s2 can be written as s2 ¼ nÀ so that zi $ NID(0. pﬃﬃﬃ yÀm n(y À m)=s qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ pﬃﬃﬃ $ t(n À 1): (nÀ1)s2 =(n À 1) s= n s2 (1:36) Note that the random variable in (1.35) shows that (n À 1)s2 =s2 is pivotal for s2. 1).5).2. as will become clear in Section 1.33) implies that pﬃﬃﬃ notation pﬃﬃﬃ 1ﬃﬃ 0 p i z ¼ nz ¼ n(y À m)=s $ N(0. a function of the data and of the parameter m that does not depend on s2 ) is called pivotal for the parameter m. Then i¼1 (yi À y) 1 P 2 Pn n 2 2 s 0 ¼ nÀ1 i¼1 (zi À z) . it follows from (1. By deﬁnition. Then (1. : n n (1:33) Sample variance To derive the distribution of the sample variance s2 . The sample mean is given by y ¼ n yi ¼ n i y.

5f.15a. . (1:37) E Exercises: T: 1.1. m ¼ 0 — it follows that y pﬃﬃﬃ $ t(n À 1): s= n This is called the t-value of the sample mean.2 Random variables 37 If it is assumed that the population mean is zero — that is. 1.

y 2 Qg is called a model for the observations — that is. Á Á Á . Estimated parameters are denoted by ^ y. the method of moments. parameters. yn ) — that is. In this section we consider a general framework for estimation with corresponding concepts and terminology that are used throughout this book. We discuss three methods — that is.32) with parameter set Q ¼ {(m. if it is supposed that yi $ NID(m. yn ).1) is a statistic that provides an intuitively appealing guess for the population mean m.3.3. it speciﬁes the general shape of the distribution together with a set Q of possible values for the unknown parameters. The parameters can be estimated. any numerical expression that can be evaluated from the observed data alone. Á Á Á . estimator.3 Parameter estimation 1. A set of distributions ffy .7. and an estimate is a number. Here it is assumed that the general shape of the distribution is known up to one or more unknown parameters that are denoted by y. An estimator is a random variable.38 1 Review of Statistics 1.2. then the joint distribution is given by (1. So an estimator is a numerical expression in terms of random variables. s2 ). estimate Suppose that n available observations yi . for instance. An estimator is a statistic that is used to make a guess about an unknown parameter. least squares. s2 ) with unknown mean m and variance s2 . In all that follows. . Several methods have been developed for the construction of estimators. For instance.2. As an example. as it depends on the random variables yi . uses Appendix A. s2 > 0}. i ¼ 1.1. the resulting numerical value of the estimator is called the estimate of the parameter. we use the notation yi both for the random variable and for the observed outcome of this variable.1 Estimation methods E First used in Section 4.1. n are considered as the outcomes of random variables with a joint probability distribution fy (y1 . and maximum likelihood. but they can be estimated from the observed data. Concepts: model. Á Á Á . A. the sample mean (1. by the sample mean and sample variance discussed in Section 1. A statistic is any given function g(y1 . The numerical values of y are unknown. For given observed outcomes.

Now y is estimated by replacing the unknown population moments by the corresponding sample moments.4.1. The above results show that the parameter estimates s may be different for different choices of the ﬁtted moments. with mean y ¼ 2:793. and kurtosis 2.7: Student Learning (continued) To illustrate the method of moments.3 Parameter estimation 39 The method of moments In the method of moments the parameters are estimated as follows. so that ^2 of the parameter s2 m4 ¼ Ks4 ¼ 2:511(0:460)4 ¼ 0:112. so that basedp on theﬃ fourth moment is then obtained by solving 3s ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 ^ ¼ m4 =3 ¼ 0:194. ei $ IID(0.511. Let ei ¼ yi À m.793 and the second moment (1. s2 ). en are identically and independently distributed with mean zero and variance s2 . However.168. In our example the differences are not so large. So the ﬁrst moment is 2. The estimate s ^4 ¼ m4 .4 (a). Summary statistics of this sample are in Exhibit 1. If k such moments are selected. we consider the FGPA scores of 609 students in Example 1. So the moment estimates then become m Instead of using the second moment. then it follows that e1 . Á Á Á . An advantage of this method is that it is based on moments that are often easy to compute. it should be noted that the obtained estimates depend on the chosen moments. The model can now be written as yi ¼ m þ ei . Example 1. the ﬁrst and second moment of this distribution are equal to m and s2 respect^ ¼ 2:793 and s ^2 ¼ 0:211. the general shape of the distribution) implies expressions for the population moments in terms of y. standard deviation s ¼ 0:460. Suppose that y contains k unknown parameters. one could also use the fourth moment to estimate s2 . note that the sample kurtosis (K) is equal to the sample fourth moment (m4 ) divided by s4. the parameters y can in general be solved from these k expressions. s2 ): (1:38) .2) is equal to m2 ¼ (n À 1)s2 =n ¼ 0:211. skewness 0. The speciﬁed model (that is. Á Á Á . yn of a distribution with unknown mean m and unknown variance s2 . E XM101STU Least squares Another method for parameter estimation is least squares. ively. The fourth (population) moment of the normal distribution is equal to 3s4 . This is written as ei $ IID(0. To obtain the fourth sample moment from the summary statistics presented in Exhibit 1.4 (a). We illustrate this method for the estimation of the population mean from a random sample y1 . If these scores are assumed to be normally and independently distributed with mean m and variance s2 .

Similarly. instead of using the parameters y. Á Á Á . The maximum likelihood estimate is the value of y for which this probability is maximal (over the set of all possible values y 2 Q).14). For every value of y. yn.40 1 Review of Statistics The least squares estimate is that value of m that minimizes the sum of squared errors S(m) ¼ n X i¼1 (yi À m)2 : Taking the ﬁrst derivative of this expression with respect to m gives the ﬁrst P order condition n ( y À m) ¼ 0. Let y and c be the maximum likelihood estimates of y and c ^ ). Á Á Á . the sum of absolute errors n X i¼1 jyi À mj: As will be seen in Chapter 5 (see Exercise 5. yn ). The model is then exfc ¼ fhÀ1 (c) and pressed as the set of distributions {~ fc . for a continuous distribution the maximum likelihood estimate is obtained by maximizing L(y) over Q. the sample mean. Solving this for m gives the least squares i¼1 i ^ ¼ y. the likelihood L(y) is equal to the probability (with respect to the distribution fy ) of the actually observed outcome.6 for an example). the distribution gives a certain value fy (y1 . Á Á Á . the resulting estimate is then given by the median of the sample. Maximum likelihood A third method is that of maximum likelihood. Then ^ y ¼ hÀ1 (c probability distribution (see Exercise 1. Instead of least squares one could also use estimate m other estimation criteria — for instance. . y 2 Qg of joint probability distributions for y1 . When seen as a function of y. y 2 Q: (1:39) For discrete distributions. where h is an invertible transformation. one describes the model in terms of another set of parameters c and that the relation between c and y is given by c ¼ h(y). Suppose that. so that L(y) ¼ fy (y1 . An attractive property of this method is that the estimates are invariant with respect to changes in the deﬁnition of the parameters. c 2 C} where ~ ^ ^ C ¼ h(Q). so that both models lead to the same estimated respectively. denoted by L(y). this is called the likelihood function. Recall that a model consists of a set ffy . yn ) for the given observations.

This method can be applied only if the joint probability distribution of the observations is completely speciﬁed so that the likelihood function (1. For least squares the distance is measured directly in terms of the observed data. as in (1.38). Sometimes the model is expressed in terms of moment conditions. but on the likelihood function that expresses the likelihood or ‘credibility’ of parameter values with respect to the observed data.1. If the model is expressed in terms of an equation. The maximum likelihood method. Least squares and the method of moments are both based on the idea of minimizing a distance function. Suppose that yi $ NID(m.8: Normal Random Sample We will illustrate the method of maximum likelihood by considering data generated by a random sample from a normal distribution. on the other hand. we maximize n n n 1 X (yi À m)2 : log (L(m.3. s2 ). so that the method of moments is a natural way of estimation. Example 1. s2 )) ¼ À log (2p) À log (s2 ) À 2 2 2 2s i¼1 (1:40) The ﬁrst order conditions (with respect to m and s2 ) for a maximum are given by n @ log (L) 1X (yi À m) ¼ 0.3 Parameter estimation 41 Comparison of methods In later chapters we will encounter each of the above three estimation methods. then least squares is intuitively appealing. as it optimizes the ﬁt of the model with respect to the observations. ¼ 2 @m s i¼1 (1:41) .39) is a known function of y. s2 ). Then the likelihood function is given by L(m. In this case maximum likelihood estimators have optimal properties in large samples. Á Á Á .3. n. with unknown parameters y ¼ (m. the likelihood function and its logarithm log (L(m. is based not on a distance function. It depends on the application which method is the most attractive one. as will be discussed in Section 1. As log (L(m. s2 ) ¼ n Y 2 1 1 pﬃﬃﬃﬃﬃﬃ eÀ2s2 (yi Àm) : i¼1 s 2p E As the logarithm is a monotonically increasing function. s2 )) is easier to work with. i ¼ 1. s2 )) obtain their maximum for the same values of m and s2 . whereas for the method of moments the distance is measured in terms of the sample and population moments.

s ^2 ) is a Evaluating this at the values of m ML shows that H (m 2 4 ^ML on the diagonal. it follows that the Hessian matrix is equal to @ 2 log(L) @ 2 log(L) @ m2 @ m @ s2 A @ ¼ H(y) ¼ @ 2 log( L) @ 2 log(L) @ s2 @ m @ (s2 )2 0 1 n Às 2 P 1 Às (yi À m) 4 ! P 1 Às À m ) ( y 4 P i : n 1 À (yi À m)2 4 6 2s s (1:43) ^ML and s ^2 ^. Á Á Á . That is.42 1 Review of Statistics n @ log (L) n 1 X ¼À 2þ 4 (yi À m)2 ¼ 0: @ s2 2s 2s i¼1 (1:42) The solutions of these two equations are given by ^ML ¼ m 1X yi ¼ y. suppose that the data are generated by a particular distribution that belongs to the speciﬁed model. We leave it as an exercise (see Exercise 1. By differentiating the above two ﬁrst order conditions. c.3. the data generating process (DGP) of y1 . yn has a distribution fy0 where y0 2 Q. 1. ^ML and Àn=2s diagonal matrix with elements Àn=s which is indeed a negative deﬁnite matrix. Data generating process To evaluate the quality of estimators. b. 1. Note that we expressed the model and the likelihood function in terms of the parameters m and s2 .6a.6) to show that solving the ﬁrst order conditions with respect to m and s gives the same estimators as before. n ^2 s ML ¼ 1X nÀ1 2 s : (yi À y)2 ¼ n n So m is estimated by the sample mean and s2 by (1 À 1 n ) times the sample variance.2 Statistical properties E First used in Section 2. E Exercises: T: 1. .2.4.2–A.5.10a. For large sample sizes the difference with the sample variance s2 becomes negligible.9d. To check whether the estimated values indeed correspond to a maximum of the likelihood function we compute the matrix of second order derivatives (the Hessian matrix) and check whether this matrix (evaluated at ^2 ^ML and s m ML ) is negative deﬁnite. uses Appendix A. which illustrates the invariance property of maximum likelihood estimators. We could equally well use the parameters m and s. 1.

The mean squared error provides a trade-off between the variance and the bias of an estimator. which can be decomposed in two terms as y) þ (E[^ y] À y0 )2 : MSE(^ y) ¼ E[(^ y À y0 )2 ] ¼ var(^ (1:44) Here all expectations are taken with respect to the underlying distribution fy0 of the data generating process. one can restrict the attention to unbiased estimators — that is. The ﬁrst term is the variance of the estimator. as it would always infer the correct parameter value from the sample. Á Á Á . so that y itself a random variable with a distribution that depends on y0 . The Crame ´ r–Rao lower bound states that for every unbiased estimator ^ y there holds var(^ y) ! " #!À1 !À1 d log (L(y)) 2 d2 log (L(y)) E ¼À E dy d y2 (1:45) . To evaluate the quality of an estimator we therefore need statistical measures for the distance of the distribution of ^ y from y0 . An estimator that minimizes the variance over a class of estimators is called efﬁcient within that class. and if this is small this means that the estimator is not so much affected by the randomness in the data. and if this is small this means that the estimator has a distribution that is centred around y0 . For instance. The estimator would be perfect if P[^ y ¼ y0 ] ¼ 1. yn . as the observations are partly random.1. y0 can in general not be inferred with certainty from the information in the data. one often uses other criteria that can be evaluated without knowing y0 . Unbiased and efficient estimators The practical use of the MSE criterion is limited by the fact that MSE(^ y) depends in general on the value of y0 . Variance and bias First assume that y consists of a single parameter. However.3 Parameter estimation 43 ^ is An estimator ^ y is a function of the random variables y1 . Assume again that y consists of a single parameter. with the property that E[^ y] ¼ y0 . The second term is the square of the bias E[^ y] À y0 . The mean squared error (MSE) of an estimator is deﬁned by E[(^ y À y0 )2 ]. As y0 is unknown (else there would be no reason to estimate it). and try to minimize the variance var(^ y) within the class of unbiased estimators.

the expectations are taken with respect to the distribution with parameter y0. one that is unbiased but that has a relatively large variance and another that has a small bias and a relatively small variance.45) is left as an exercise (see Exercise 1. In practice we have a single sample y1 . and that ^ y is a vector of estimators where each component is an estimator of the corresponding component of y. because in some situations the lower bound on the variance cannot be attained by any unbiased estimator. More than one parameter Now suppose that y consists of a vector of parameters. . Although the property of unbiasedness is an attractive one. For unbiased estimators. this does not mean that biased estimators should automatically be discarded. as before. The inequality in (1. This shows that unbiasedness should not be imposed blindly.45) implies that a sufﬁcient condition for the efﬁciency of an estimator ^ y in y) is equal to the the class of unbiased estimators with E[^ y] ¼ y0 is that var(^ Crame ´ r–Rao lower bound.11 Bias and variance Densities of two estimators. the covariance matrix is given by h i h i y À y0 )(^ y À y0 )0 : var(^ y) ¼ E (^ y À E[^ y])(^ y À E[^ y])0 ¼ E (^ q0 Exhibit 1. This condition is not necessary.44 1 Review of Statistics where L(y) is the likelihood function and. yn at our disposal.11 shows the density functions of two estimators. however. and corresponding single outcomes of the estimators. Then ^ y is unbiased if E[^ y] ¼ y0 — that is. As is clear from Exhibit 1. if all components are unbiased.7). Warning on terminology A comment on the terminology is in order. one that is unbiased but that has a larger variance and another one that is biased (downwards) but that has a smaller variance (y0 denotes the parameter of the data generating process). The proof of the equality in (1.11. Exhibit 1. the outcome of the biased estimator will in general be closer to the correct parameter value than the outcome of the unbiased estimator. Á Á Á .

Á Á Á . the median (for m) and the sample variance s2 (for s2 ).4. i ¼ 1. with unknown parameters y ¼ (m. A sufﬁcient condition for efﬁciency of an unbiased estimator of the kth component of y is that its variance is equal to the kth diagonal 1 element of I À 0 . (ii) Variance and efficiency of the ML estimators Now we evaluate the efﬁciency of the estimators y and s2 in the class of all unbiased estimators. We will investigate (i) the unbiasedness of the ML ^2 ^ML and s estimators m ML . The Crame ´ r–Rao lower bound for the variance of unbiased estimators is given by the inverse of the so-called information matrix. As the w2 (n À 1) distribution has variance 2(n À 1). This means in particular that every component of ^ y2 has a variance that is at least as large as that of the corresponding component of ^ y1 . This matrix is deﬁned as follows. An unbiased estimator of s2 is given by the is unbiased but s sample variance s2 . (iii) simulated sample distributions of these two estimators and of two alternative estimators. m 2 ^2 ^ML ^ML ] ¼ m and that E[s It follows that E[m ML ] ¼ (n À 1)s =n — that is. This is the reason to divide by (n À 1) in (1.2. (ii) the variance and efﬁciency of these two estimators. for every unbiased estimator there holds that var(y 0 semideﬁnite. we consider the case of data consisting of a random sample from the normal distribution. s2 ).9: Normal Random Sample (continued) As in Example 1.1. s2 =n) and (n À 1)s2 =s2 $ w2 (n À 1). and (iv) the interpretation of the outcomes of this simulation experiment. We suppose that yi $NID(m.8. s2 ).3) instead ^2 of by n. it follows that s2 has variance 2s4 =(n À 1). The information matrix is equal to I 0 ¼ ÀE[H (y0 )]. where the expectations are taken with respect to the probability distribution with parameters y0 and where the derivatives are evaluated at y0 .3 Parameter estimation 45 ^1 is called more efﬁcient than another estimator ^ An estimator y y2 if ^ ^ var(y2 ) À var(y1 ) is a positive semideﬁnite matrix. The variance of y is equal to s2 =n. Unless the sample size n is small. where . 2 ^ ML ^ ML and s (i) Means of the ML estimators m E ^ $ N(m. m 2 ^ML not. n. the difference between s2 and s ML is small. I0 ¼ E ! ! @ log (L(y)) @ log (L(y)) 0 @ 2 log (L(y)) ¼ ÀE : @y @y @ y@ y0 (1:46) ^) À I À1 is positive So. Example 1. The maximum likeli2 2 ^2 ^ML ¼ y and s hood estimators are given by m ML ¼ (n À 1)s =n where s is the sample variance. As was shown in Section 1.

43). To perform a simulation we have to specify the data generating process. (iii) Simulated sample distributions To illustrate the sampling aspect of estimators. it follows that the Crame ´ r–Rao lower bounds for the variance of unbiased estimators of m and s2 are respectively s2 =n and 2s4 =n. We consider n ¼ 10 independent random variables y1 . but the variance of s2 does not attain the lower bound. This is in line with the fact that the sample mean is the efﬁcient estimator. y10 obtained by ten random drawings from N(0. is s= n ¼ 1= 10 ¼ 0:3162).12 shows histograms for the resulting 10. there exists no unbiased estimator of s2 with variance smaller than 2s4 =(n À 1). Á Á Á . see (a)) is close to the pﬃﬃﬃﬃﬃﬃ standard deviation of 2the sample mean (which pﬃﬃﬃ theoretical ^ML show a downward bias. So y is efﬁcient. we compute the following statistics: the sample mean y. is unbiased.46 1 Review of Statistics y0 ¼ (m. s2 ) and H (y0 ) is the Hessian matrix in (1. To get an idea of this variation we perform 10.000 runs. the ^2 sample variance s2 .000 . This is in line with the fact that s ML is biased and s 2 nÀ1 2 ^ is equal to n s ¼ 0:9. The values of these statistics depend on the simulated data. we perform a small simulation experiment. together with their averages and standard deviations over the 10. 1).000 outcomes of the statistics y in (a).000 runs. Statistical and econometric software packages contain random number generators for this purpose. the sample median med(y). Exhibit 1. med(y) in (b). We mention that s2 is nonetheless efﬁcient — that is. s2 in (c). so that the outcomes will be different for different simulation runs. but the sample mean has a smaller standard deviation than the median.3159. As E[yi À m] ¼ 0 and E[(yi À m)2 ] ¼ s2 . it follows that I0 ¼ n s2 0 n 2s4 0 : By taking the inverse. (iv) Interpretation of simulation outcomes Both the sample mean and the median have an average close to the mean m ¼ 0 of the data generating process. yn that are all normally distributed with mean m ¼ 0 and variance s2 ¼ 1. The theoretical expected value of s which is close to the sample average of the estimates over the 10.000 runs (0. For such a simulated set of ten data points. A simulation run then consists of the outcomes of the variables y1 . Also note that the sample standard deviation of the sample mean over the 10. Á Á Á . and ^2 s ML in (d ). and the second sample moment m2 ¼ s ML . The estimates s 2 whereas s has an average that is close to the variance s2 ¼ 1 of the data 2 ^2 generating process.

Dev. For instance.5 1200 1000 800 Series: SIGMA2HAT Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.005740 −1.029682 Kurtosis 2.201058 Minimum −1.2–A. 1.8a–c.998629 −1. 1.5 1. The standard deviations of s ML and s over the 10. This is the case for many estimators used in econometrics. for random samples from the normal distribution.1.579816 Minimum −1.0 0.0 2.315891 Skewness −0.928417 3.3 Asymptotic properties E First used in Section 4. 0. 2 ^2 simulation runs (0. sample variance (c).35).3.5 3.0 −0.000 runs ﬃ are in line with the theoretical standard deviations of pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ^2 2(n À 1)=n2 ¼ 0:424 for s to ﬃa value of 0.10a. 1.268504 0.133899 2. In other cases.3 Parameter estimation 47 (a) 1400 1200 1000 800 600 400 200 0 ( b) Series: MEAN Sample 1 10000 Observations 10000 Mean 0.305028 Std.888960 4.631671 0.133899 600 400 200 0 0.5 0.5 1.5 0. Each simulation run consists of ten random drawings from the standard normal distribution and provides one outcome of the four sample statistics.0 1.004406 Median 0.0 −0.000 simulation runs. Motivation In some situations the sample distribution of an estimator is known exactly. Dev. 1.5.12 Normal Random Sample (Example 1.6c. and second sample moment (d) obtained in 10.5 2. Skewness Kurtosis 0.0 1. E Exercises: T: 1. Dev.0 Exhibit 1.5 1.0 3.0 1. Dev.371443 Skewness −0.7a. 0. e.074836 0.5 (c) 1200 1000 800 600 400 200 0 (d) Series: S2 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.1.0 0.001334 0. sample median (b). uses Appendix A.007826 Maximum 1.33) and (1.5 1.067353 0.0 2.020633 Kurtosis 3.418 in (d) ML (as compared pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ over the 10.901201 0.005618 Median 0.208065 Std.000 simulation runs) and 2=(n À 1) ¼ 0:471 for s2 (as compared to a value of 0.004149 Maximum 1.9a–c. as will .465 in (c) over the 10.5 3. however.0 1200 1000 800 600 400 200 0 Series: MEDIAN Sample 1 10000 Observations 10000 Mean 0. 1.888960 4.5 1.901).9) Histograms of sample mean (a).835576 3. the exact ﬁnite sample distribution of estimators is not known.464999 0. the sample mean and variance have distributions given by (1.0 0. Skewness Kurtosis 0.000 simulation runs).0 0.418499 0.

Asymptotic properties give an indication of the distribution of the estimator in large enough ﬁnite samples. Another method is to consider the asymptotic properties of the estimator — that is. Consistency In this section we discuss some asymptotic properties that are much used in econometrics. then the distribution becomes more concentrated around the parameter y0 of the data generating process. the properties if the sample size n tends to inﬁnity. Basically two methods can be followed in such situations. an estimator ^ yn is called consistent if each component of ^ yn is a consistent estimator of the corresponding component of y. We are interested in the properties of this estimator when n ! 1. The estimator is called consistent if it converges in probability to y0 — that is. if for all d > 0 there holds n!1 lim P[j^ yn À y0 j < d] ¼ 1: (1:47) yn . with n1 < n2 < n3 . A possible disadvantage is that it may be less clear whether the actual sample size is large enough to use the asymptotic properties as an approximation.13. The distribution of the estimator becomes more and more concentrated around the correct parameter value y0 if the sample size increases. under the assumption that the data are generated by a process with parameter y0. . One method is to simulate the distribution of the estimator for a range of data generating processes. Let y be a parameter of interest and let ^ yn be an estimator of y that is based on a sample of n observations. written as plim(^ yn ) ¼ y0 . Consistency is illustrated graphically in Exhibit 1. If In this case y0 is called the probability limit of ^ y is a vector of parameters. If the sample size gets larger.48 1 Review of Statistics become clear in later chapters. A possible disadvantage is that the results depend on the chosen parameters of the data generating process.13 Consistency Distribution of a consistent estimator for three sample sizes. A sufﬁcient (but not necessary) condition for consistency is that n3 n2 n1 q0 Exhibit 1.

sample moments provide consistent estimators of the population moments.3 Parameter estimation 49 n!1 lim E[^ yn ] ¼ y0 and lim var(^ yn ) ¼ 0. Again. Á Á Á . If g is a continuous function that does not depend on n. n!1 (1:48) that is. plim(y2 n ) ¼ c1 . p. Let An be a sequence of p Â q matrices of random variables an (i. j). then g(^ g(y0 ). j ¼ 1. plim (yn zn ) ¼ c1 c2 . E[y2 n ] 6¼ (E[yn ]) ^ yn ) is a consistent estimator of yn is a consistent estimator of y0 . q. then plim (g(yn )) ¼ g(c1 ) 2 (see Exercise 1. For two matrix sequences An and Bn with plim (An ) ¼ A and plim (Bn ) ¼ B there holds 1 À1 plim (An þ Bn ) ¼ A þ B. then we write plim (An ) ¼ A if all the elements converge so that plim (an (i. Similar results hold true for vector or matrix sequences of random variables. Note that for expectations there holds E[y þ z] ¼ E[y] þ E[z]. for instance.1. for instance.7). but E[yz] ¼ E[y]E[z] holds only if y and z are uncorrelated. if for instance. for the last equality. j) for all i ¼ 1. then there holds. plim (An Bn ) ¼ AB. plim (AÀ n Bn ) ¼ A B. plim (yn =zn ) ¼ c1 =c2 (see Exercise 1. provided that the matrices have compatible dimensions and. Calculation rules for probability limits Probability limits are easy to work with. if the estimator is asymptotically unbiased and its variance tends to zero (see Exercise 1. Law of large numbers When the data consist of a random sample from a population. The reason is that the uncertainty in the individual observations cancels out in .7). This result implies that. Á Á Á . that plim (yn þ zn ) ¼ c1 þ c2 . and in general E[y=z] 6¼ E[y]=E[z] (even when y and z are independent). j)) ¼ a(i. that the matrix A is invertible. for expectations this does in general not hold true (unless g is linear) — 2 in general. so that.7). as they have similar properties as ordinary limits of functions. Suppose that yn and zn are two sequences of random variables with probability limits plim(yn ) ¼ c1 and plim(zn ) ¼ c2 (6¼ 0).

and so on. n. That is. n. then n 1X yi plim n i¼1 ! ¼ m: (1:49) This is called the law of large numbers. This is written as yn ! y and F is also called the asymptotic distribution of yn . A central result in statistics is that. under very general conditions. with ﬁnite population mean E[yi ] ¼ m. sample averages from arbitrary distributions are asymptotically normally distributed. and (1. if yi $ IID. if yi $ IID. Á Á Á . Let yi . To get the idea. Similarly. the sample variance converges in probability to the population variance. 1) s (1:50) d This is called the central limit theorem.49) follows from (1. n be independently and identically distributed random variables with mean m and ﬁnite variance s2 . Á Á Á . assume for simplicity that the population variance s2 is ﬁnite. i ¼ 1. i ¼ 1. then n 1X (yi À yn )r plim n i¼1 ! ¼ mr : For instance. Also the sample covariance between two variables converges in probability to the population covariance.50 1 Review of Statistics the limit by taking averages. i ¼ 1.48). Then the sample mean of n observaPn 1 tions yn ¼ n i¼1 yi is a random variable with mean m and variance s2 =n. Central limit theorem A sequence of random variables yn with cumulative distribution functions Fn is said to converge in distribution to a random variable y with distribution function F if limn!1 Fn (v) ¼ F(v) at all points v where F is continuous. dividing by the standard deviation. and the rth population moment mr ¼ E[(yi À m)r ] < 1. Á Á Á . and multiplying by the square root of the sample size) the sample mean of a random sample from an arbitrary distribution has an asymptotic standard . Then zn ¼ pﬃﬃﬃ yn À m d n ! z $ N(0. This means that (after standardization by subtracting the mean.

We mention three generalizations that are used later in this book. i ¼ 1. It follows that yn is approximately distributed as N m. then An yn ! N(0. When yi are independent random variables with common mean m and P different variances s2 i for which the average variance 2 1 2 s is ﬁnite. For large enough sample sizes. maximum likelihood estimators are consistent and (asymptotically) efﬁcient. Generalized central limit theorems The above central limit theorem for the IID case can be generalized in several directions. : n Note that an exact distribution is denoted by $ and an approximate distribution by %. Á Á Á . in the sense that the data are generated . 1). ASA0 ): d Asymptotic properties of maximum likelihood estimators The law of large numbers shows that moment estimators are consistent in case of random samples. then ! n pﬃﬃﬃ 1 X d yi À m ! N(0. Suppose that the likelihood function (1.39) is correctly speciﬁed.1. then s ¼ limn!1 n n i¼1 i ! n pﬃﬃﬃ 1 X d yi À m ! N(0. n is a random sample from a p-dimensional distribution with ﬁnite vector of means m and ﬁnite covariance matrix S. However. When d plim(An ) ¼ A where A is a given (non-random) matrix and yn ! N(0. the ﬁnite sample distribution of zn can be approximated by the standard normal À distribution 2 Á N(0. S). s2 ): n n i¼1 When yi . which we write as s2 yn % N m. S): n n i¼1 Now suppose that An is a sequence of p Â p matrices of random variables and that yn is a sequence of p Â 1 vectors of random variables. s n .3 Parameter estimation 51 normal distribution.

and asymptotically normally distributed. yML % N y0 . under certain regularity conditions. Let ^ y be a consistent estimator (based on n observations) and let pﬃﬃﬃ ^ pﬃﬃﬃ ^ S ¼ limn!1 var( n(y À y0 )) and SML ¼ limn!1 var( n(yML À y0 )). it may be of interest to provide some intuition for this result. under these conditions there holds pﬃﬃﬃ d 1 n(^ yML À y0 ) ! N(0. I À 0 ): (1:51) Here ^ yML denotes the maximum likelihood estimator (based on n observations) and I 0 is the asymptotic information matrix evaluated at y0 — that is. Asymptotic efﬁciency means that n(^ yML À y0 ) has. n are IID with ^ common probability density function fy0 . Then.51) the approximation À Á ^ ^ À1 . The ﬁrst order condimizing the log-likelihood 1 y i i ¼1 n n tions for a maximum of this function can be expressed as n @ 1X log (fy ðyi )Þ @ y n i¼1 T ! ¼ n n 1X @ log (fy (yi )) 1 X 1 @ f y ( yi ) ¼ ¼ 0: n i¼1 @y n i¼1 fy (yi ) @ y . The Q ML estimator yML is obtained by f ( y ) or equivalently by maximaximizing the likelihood function L(y)P ¼ n i ¼1 y i n 1 log ( L ( y )) ¼ log ( f ( y )). for n ! 1. in the following sense. 1 I 0 ¼ lim In .46) evaluated at ^ yML . asymptotically efﬁcient.51) can be seen as a generalization of the central limit theorem. Suppose that yi . n!1 n where I n is the information matrix pﬃﬃﬃ for sample size n deﬁned in (1. For ﬁnite samples we obtain from (1.52 1 Review of Statistics by a distribution fy0 with y0 2 Q. Á Á Á . then S À SML is a positive semideﬁnite matrix. the smallest covariance matrix among all consistent estimators. ^ML Intuitive argument for the consistency of u Although a formal proof of consistency falls outside the scope of this book. without being precise about the required regularity conditions. The result in where I (1. More in particular. I n ^ n is the information matrix (1. i ¼ 1.46). maximum likelihood estimators are consistent.

so that ^ of the asymptotic ﬁrst order conditions. Then plim(^ yML ) ¼ y0 . and n ¼ 1000. For each of the sample sizes n ¼ 10. and (e) show three histograms of the resulting ^2 10. we get E0 ! Z Z 1 @ f y ( yi ) 1 @ f y ( yi ) @ fy (yi ) fy0 (yi )dyi ¼ dyi ¼ fy0 (yi ) @ y fy0 (yi ) @ y @y Z @ @ ¼ (1) ¼ 0: fy (yi )dyi ¼ @y @y This shows that y0 solves the asymptotic ﬁrst order conditions. for n ¼ 1000 the distribution is much more symmetric and approaches a normal distribution. Using this result. Exhibit 1.3. note that To prove that y0 is a solution R fy is a density function. 1) P and we compute the corresponding n 2 1 ^2 ¼ maximum likelihood estimate s i¼1 (yi À y) . This intuition is correct under suitable yML is consistent. so that fy (yi )dyi ¼ 1. fy (yi ) @ y where E0 means that the expectation should be evaluated according to the density fy0 of the DGP. Á Á Á . we perform 10.14 (a).2 and Example 1.000 estimates of s ML for the three sample sizes. n ¼ 100. regularity conditions.000 simulation runs. The results in Section ML n 2 ^ML is a biased estimator of 1. We generate a sample of n observations (y1 .1. (c). the estimator ^ yML (which solves the equations for ﬁnite n) will then converge to y0 (which solves the equations asymptotically) in case n ! 1. Whereas for n ¼ 10 the skewness of the w2 distribution is still visible. this illustrates the consistency of this estimator. and substituting y ¼ y0 in the asymptotic ﬁrst order conditions. This illustrates the asymptotic normality of the maximum likelihood estimator of s2 .9 (for n ¼ 10) showed that s 2 s ¼ 1. E . We will show below that the DGP parameter y0 solves the above asymptotic ﬁrst order condition for a maximum. the law of large numbers applies to the IID y (yi ) random variables fy (1yi ) @ f@ y . Exhibits 1. so that the ﬁrst order conditions converge in probability (for n ! 1) to n 1X 1 @ fy (yi ) plim n i ¼ 1 f y ( yi ) @ y ! ! 1 @ f y ( yi ) ¼ E0 .3 Parameter estimation 53 Under suitable regularity conditions.10: Simulated Normal Random Sample To illustrate the consistency and asymptotic normality of maximum likelihood. yn ) by independent drawings from the standard normal distribution N(0. Example 1. d . we consider the following simulation experiment.4 (b. Intuitively.000 values of p ^2 n(s ML À 1) for the three sample sizes. As the histograms become strongly concentrated around the value s2 ¼ 1 for large sample size (n ¼ 1000).ﬃﬃﬃ and f ) show three histograms of the resulting 10.

10) ^2 Histograms of the maximum likelihood estimates s (denoted by ML of the error variance p ﬃﬃﬃ 2 ^ML À 1) SIGMA2HAT.011227 Minimum −4. 1.0 0 2 4 6 (c) 1600 (d) Series: SIGMA2HAT100 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. 1.554031 0. Dev.998493 Std.044717 0. n ¼ 100 in ((c)–(d)).997968 1.123954 800 400 0 0 1.834406 0. 1.990385 0.12d.1 1. Skewness Kurtosis 0.839516 3. and (f )) for random drawings of the standard normal distribution.359381 1000 500 0 1.359381 −2 0. Dev. shown in (a). Dev.426767 0. and (e)) and of a scaled version (deﬁned by n(s and denoted by SER. Skewness Kurtosis 0.459694 Std. Dev.2 1600 1200 Series: SER1000 Sample 1 10000 Observations 10000 −0.6 0.5 2000 1500 Series: SER10 Sample 1 10000 Observations 10000 −0.10b–d.2 1600 1200 1200 Series: SER100 Sample 1 10000 Observations 10000 −0.966363 4.6 6 (e) 1000 800 600 400 200 0 (f ) Series: SIGMA2HAT1000 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. (c). 1.0 2.404199 Skewness 0. with sample size n ¼ 10 in ((a)–(b)).123954 −4 −2 0 2 4 800 400 0.295081 3.414064 Skewness 0.140420 0.8 1.051793 0.318130 0.200872 0.064244 Median Maximum 6.8d.9 1.601123 0. 1.063276 −4 −2 0 0.54 1 Review of Statistics (a) 1200 1000 800 600 400 200 0 (b) Series: SIGMA2HAT10 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.7b–e.5 3.096145 Mean −0.998829 0.507496 Median Maximum 7. shown in (b).037024 Mean −0. 1.300422 Mean −0.0 1.4 1.295081 Kurtosis 3.966363 Kurtosis 4.106363 3.330571 Minimum −2.904998 0. E Exercises: T: 1.984136 1.236548 Std.0 0.158643 Median Maximum 6.0 1. Dev. Skewness Kurtosis 0.063276 800 400 0 2 4 6 Exhibit 1.0 1. .14 Simulated Normal Random Sample (Example 1. and n ¼ 1000 in ((e)–(f )).349555 Skewness 0.5 2. (d).352126 Minimum −5.106363 Kurtosis 3. Dev.

If one wishes to evaluate hypotheses concerning the data generating process. This decision . If this seems unlikely. Á Á Á . Now we introduce some terminology. then it is called a simple hypothesis. then we can test the hypothesis of zero mean against the alternative of non-zero mean. Null hypothesis and alternative hypothesis When observations are affected by random inﬂuences. a composite hypothesis.1. If the hypothesis speciﬁes the distribution completely. yn ) are used to decide which of the two hypotheses (H0 and H1 ) seems to be the most appropriate one. For instance. A statistical hypothesis is an assertion about the distribution of one or more random variables. called the null hypothesis and denoted by H0. then the hypothesis is possibly not correct. one should take the random nature of the data into account. otherwise. The observed data (y1 . the sample mean of the observed data will not be (exactly) equal to zero. In this case Q0 ¼ {0} and Q1 ¼ {y 2 Q. if y is the (unknown) population mean. even if the hypothesis is correct.4.1 Size and power E First used in Section 2. Test statistic and critical region Let y0 be the parameter (or vector of parameters) of the data generating process.1. If the functional form of the distribution is known up to a parameter (or vector of parameters). y 2 Qg. We restrict the attention to parametric hypotheses where one assertion. Let the speciﬁed set of distributions be given by ffy . then the hypothesis is parametric.3. otherwise it is non-parametric. consider the hypothesis that the data are generated by a probability distribution with mean zero. the same holds true for all inference that is based on these data. then H0 corresponds to the assertion that y 2 Q0 and H1 to the assertion that y 2 Q1 . called the alternative hypothesis and denoted by H1. y 6¼ 0}.4 Tests of hypotheses 55 1. For instance. is tested against another one. The question is whether the difference between the sample mean and the hypothetical population mean is due only to randomness in the data. In general. where Q0 and Q1 are disjoint subsets of Q.4 Tests of hypotheses 1.

yn ) — that is. The null hypothesis will be rejected if y is ‘too far away’ from zero. A test is called consistent if the power p(y) converges to 1 for all y 2 = Q0 if n ! 1. This is possible only if the parameter values can be inferred with absolute certainty from the observed data. These should be selected in such a way that one can discriminate well between the null and alternative hypotheses. for practical applications one prefers tests that have small size and large power in ﬁnite samples. if the null hypothesis is false (as y 2 = Q0 ) but the observed data do not lead to rejection of the null hypothesis (because t 2 = C). for every given critical region C. but then there is. Tests with a given significance level Of course. of course. to test a hypothesis one has to decide about the test statistic and about the critical region. The quality of a test can be evaluated in terms of the probability p(y) ¼ P[t 2 C] to reject the null hypothesis. yn ) that are pivotal in the sense that the distribution under the null hypothesis does not depend on any unknown parameters. The rejection probability p(y) for y 2 = Q0 is called the power of the test. Size and power In general. Á Á Á . the critical region (denoted by C) and the complement of this region. We restrict the attention to similar tests — that is. If the null hypothesis is valid (so that the data generating process satisﬁes y0 2 Q0 ) but the observed data lead to rejection of the null hypothesis (because t 2 C). this is called an error of the ﬁrst type. Á Á Á . The probability of this error is called the size or also the signiﬁcance level of the test. an expression that can be computed from the observed data. Note that we say that H0 is not rejected. test statistics t(y1 . no need for tests anymore. For instance. with p(y) ¼ 0 for y 2 Q0 and p(y) ¼ 1 for y 2 = Q0. the rejection probability p(y) can be calculated for y 2 Q0 and that (in case the set Q0 contains more than one element) this probability is independent of the value of y 2 Q0 . The null hypothesis is rejected if t 2 C and it is not rejected if t 2 = C. For similar tests. this is called an error of the second type. the size can be computed for every given critical region. On the other hand. If the sample is such that jyj c. to test the null hypothesis that the population mean y ¼ 0 against the alternative that y 6¼ 0. In practice one often ﬁxes a maximally tolerated size — for .56 1 Review of Statistics is made by means of a test statistic t(y1 . This means that. For instance. one can choose a value c > 0 and reject the null hypothesis if jyj > c. instead of saying that H0 is accepted. but this does not mean that we accept the null hypothesis as a factual truth. then the hypothesis is not rejected. the sample mean y provides an intuitively appealing test statistic. The possible outcomes of this statistic are divided into two regions. A perfect test would be one where the probability to make a mistake is zero — that is.

We assume that the modeller knows that the data are generated by an NID process with known variance s2 ¼ 1 but that the mean m is unknown. i ¼ 1. many of the standard econometric tests can be shown to have reasonably good power properties.1. (i) Two test statistics: mean and median E We will analyse the properties of two alternative estimators of m — namely. (iii) the set-up of the simulation experiment. This means that an econometrician should try to formulate tests in such a way that errors of the ﬁrst type are more serious than errors of the second type. because a small size is taken as a starting point. .11: Simulated Normal Random Sample (continued) To illustrate the power of tests. However. This means. we consider a simulation experiment where the data yi . and we pay relatively little attention to their power properties. Both estimators can be used to construct test statistics to test the null hypothesis. It then remains to choose a test statistic and a critical region with good power properties — that is. We will test the null hypothesis H0 : m ¼ 0 against the alternative H1 : m 6¼ 0. and (iv) the outcomes of this experiment.4 Tests of hypotheses 57 instance. 1). in large samples nearly every null hypothesis will be rejected at the 5 per cent signiﬁcance level. given the information at hand. but whether it is a reasonable approximation. Hereby the less relevant details are neglected on purpose. The meaning of significance One should distinguish statistical signiﬁcance from practical signiﬁcance. that signiﬁcance levels should in practice be taken as a decreasing function of the sample size. for instance. n are generated by independent drawings from N(m. The purpose of an econometric model is to capture the main aspects of interest of these processes. We will now discuss (i) two alternative test statistics (the mean and the median). the sample mean y and the sample median med(y). In this book we mostly use intuitive arguments to construct econometric tests of a given size. Example 1. Tests can help to ﬁnd a model that is reasonable. Note that the null and alternative hypotheses play different roles in testing. 5 per cent — to control for errors of the ﬁrst type. In many cases the relevant question is not so much whether the null hypothesis is exactly correct. with small probabilities of errors of the second type. One should not always blindly follow rules of thumb like signiﬁcance levels of 5 per cent in testing. For example. Á Á Á . (ii) the choice of critical regions (for ﬁxed signiﬁcance level). In empirical econometrics we analyse data that are the result of economic processes that are often relatively involved.

7 51.4 9.6 100.1 7.0 100.5 37.08 respectively.2 8.6 4.9 100.3 4. where in each run the median of a sample yi $ N(0.3 7. for which c1 ¼ 1:96= n (because s is known to the modeller) and c2 is approximately 0. both for the sample mean and for the sample median.73.4 5. 1). 1) for a range of eleven values for m between m ¼ À2 and m ¼ 2.0 Median 100.3 100. n ¼ 100.7 16. for each of the eleven values of m). For each experiment we perform 10.5 88. n is determined. we consider as data generating processes yi $ N(m.24.0 100. Á Á Á .0 n ¼ 100 Mean 100.9 70.11) Results of simulation experiments with random samples of different sizes and with different means of the data generating process.2 6.5 12.0 Exhibit 1.2 100. 0.7 23.0 7. using tests of size 5% based on the sample mean and on the sample median.0 100. Population mean m ¼ À2 m ¼ À1 m ¼ À0:2 m ¼ À0:1 m ¼ À0:05 m¼0 m ¼ 0:05 m ¼ 0:1 m ¼ 0:2 m¼1 m¼2 n ¼ 10 Mean 100.0 100.0 100. .8 76. This leads to in total thirty-three simulation experiments (three sample sizes n ¼ 10. The numbers in the table report the rejection percentages (over 10. The values of c2 are obtained from a simulation experiment with 100.000 runs.1 99.2 17.8 5. and 0.0 99. i ¼ 1. and n ¼ 1000.5 35.4 5.0 100.9 5.15 Simulated Normal Random Sample (Example 1.0 35. The critical values c1 and c2 are determined by the condition that P[ À c1 y c1 ] ¼ P[ À c2 med(y) c2 ] ¼ 0:95 when m ¼ 0.5 5.5 4.0 Median 100. (iii) Simulation experiment To investigate the power properties of the two test statistics.4 100.0 Median 100.0 100.1 71.2 6.7 6.0 88. or 1000.9 5. We consider pﬃﬃﬃ sample sizes of n ¼ 10.7 7.15 for the precise values of m in the eleven experiments).0 9.0 51.0 n ¼ 1000 Mean 100.000 simulation runs and determine the frequency of rejection of the null hypothesis. intuition suggests that we should reject the null hypothesis if jyj > c1 (if we use the sample mean) or if jmed(y)j > c2 (if we use the sample median).8 24.0 100.0 77.4 88.58 1 Review of Statistics (ii) Choice of critical regions As both statistics (sample mean and sample median) have a distribution that is symmetric around the population mean.000 simulation runs) of the null hypothesis that m ¼ 0.0 6.9 13.0 5.0 100.0 5.0 89.8 34. 100.7 100. We ﬁx the size of the tests at 5 per cent in all cases. including m ¼ 0 (see Exhibit 1.0 100.

i ¼ 1. as the alternative contains values m > m0 as well as values m < m0 .33). i ¼ 1.3. s2 ). where m0 is a given value. H0 : m ¼ m0 .1. for the normal distribution. This is called a two-sided test.4. we obtain. If this is replaced by the unbiased estimator s2. As test statistic we consider the sample mean y and we reject the null hypothesis if jy À m0 j > c. For m 6¼ 0 the power of the tests increases for larger samples.4 Tests of hypotheses 59 (iv) Outcomes of the simulation experiment The results are in Exhibit 1.1. Two-sided test for the mean To illustrate the general principles discussed in Section 1. According to (1. the expression on the left-hand side is not a statistic. we consider tests for the mean and variance of a population. For m ¼ 0 this indicates that the size is indeed around 5 per cent. n.2 Tests for mean and variance E First used in Section 2.4. and c determines the size of the test. for m ¼ m0 we get y À m0 pﬃﬃﬃ $ N(0. Note that for n ¼ 1000 the null hypothesis that m ¼ 0 is rejected nearly always (both by the mean and by the median) if the data generating process has m ¼ 0:2. This deﬁnes the critical region of the test. so that yi $ NID(m. according to (1.36). the sample mean is to be preferred above the sample median to perform tests on the population mean. Both the mean m and the variance s2 of the population are unknown. indicating that both tests are consistent.15. The sample mean has higher power than the sample median. whether this difference is of practical signiﬁcance. t¼ y À m0 pﬃﬃﬃ $ t(n À 1): s= n (1:52) . 1): s= n However. Á Á Á . so that y ¼ (m. n consist of a random sample from a normal distribution. as s2 is unknown. First we test a hypothesis about the mean — for instance. This means that. H1 : m 6¼ m0 .1. s2 ). 1. It depends on the particular investigation whether the distinction between m ¼ 0 and m ¼ 0:2 is really of interest — that is. It is assumed that the data yi . Á Á Á .

The test statistic is again as given in (1.67. as the statistic t is pivotal in the sense that the distribution does not depend on the unknown variance s2 . the critical value for n ¼ 20 is around 2. In principle this depends on the consequences of making errors of the ﬁrst and second type. the null hypothesis should be . one often takes c % 2. and for n ! 1 it converges to 1. This is called the probability value or P-value of the test outcome.60 1 Review of Statistics Because all the terms in this expression are known (m0 is given and y and s can be computed from the data). but such errors are often difﬁcult to determine. but now the null hypothesis is rejected if t > c with size equal to P[t > c] where t $ t(n À 1). This test can also be used for the null hypothesis H0 : m m0 against the alternative H1 : m > m0 .09. this is a test statistic. The estimated standard deviation s= n is called the standard error of the sample mean. where t follows the t(n À 1) distribution. and for n ! 1 it approaches 1. The critical value for n ¼ 20 is around 1. pﬃﬃﬃsee (1. where s is replaced by s. Instead of ﬁxing the size — for instance.645. there will be a minimal value of the size for which the null hypothesis is rejected. for n ¼ 60 it is around 2. The test (1. Probability value (P-value) In practice it may not be clear how to choose the size. at 5 per cent — one can also leave the size unspeciﬁed and compute the value of the test statistic from the observed sample. So the null hypothesis is rejected if the sample mean is more than two standard errors away from the postulated mean m0 .00. or (when m0 ¼ 0) that the sample mean is signiﬁcant at the 5 per cent signiﬁcance level. where the value of c is chosen in accordance with the desired size of the test.52) is called the t-test for the mean.73 in this case. Note that s= n is the estimated standard deviation of the sample mean. As larger sizes correspond to larger rejection probabilities. So the null hypothesis is rejected if y falls in the critical region s s y < m0 À c pﬃﬃﬃ or y > m0 þ c pﬃﬃﬃ : n n (1:53) The size of this test is equal to P[jtj > c]. where H0 is rejected for small values of t.33). Tests for H0 : m ¼ m0 or H0 : m ! m0 against H1 : m < m0 are performed in a similar fashion. for n ¼ 60 it is around 1.52). For size 5 per cent. One can then ask for which sizes this test outcome would lead to rejection of the null hypothesis. In this case one says that the sample mean differs signiﬁcantly from m0 . The null hypothesis is rejected if jtj > c. This is called a one-sided test.96. As pﬃﬃﬃ a rule of thumb. One-sided test for the mean In some cases it is of interest to test the null hypothesis H0 : m ¼ m0 against the one-sided alternative H1 : m > m0 . That is. It is also a similar test statistic.

Stated otherwise. P-value for the mean As an example. the null hypothesis should be rejected for small values of P and it should not be rejected for large values of P. Á Á Á . n.16 P-value P-value of a one-sided test ((a). If this P-value is small. 1) are independent. The arrow indicates the outcome of the test statistic calculated from the observed sample. For instance. when H0 : m ¼ m0 is tested against H1 : m > m0 . In (a) the P-value is equal to the surface of the shaded area. so that the null hypothesis should be rejected. and it should not be rejected for all sizes smaller than P. In general. for the two-sided alternative that the parameter is not zero. let t0 be the calculated value of the t-test for the null hypothesis H0 : m ¼ m0 against the two-sided alternative H1 : m 6¼ m0 . then P ¼ P[t > t0 ]. so that n X i¼1 2 (yi À m)2 =s2 0 $ w (n): (a) (b) 0 0 Exhibit 1. with m and s2 unknown. Let the null hypothesis be H0 : s2 ¼ s2 0 and the (one-sided) alterna2 2 tive H1 : s > s0 . If the null hypothesis holds true. Chi-square test for the variance Next we consider tests on the variance. for the one-sided alternative that the parameter is larger than zero) and of a two-sided test ((b). then the P-value is given by P ¼ P[t < t0 or t > t0 ].4 Tests of hypotheses 61 rejected for all sizes larger than P. Note that the P-value depends on the form of the (one-sided or two-sided) alternative hypothesis. the P-value is the corresponding (one-sided or two-sided) tail probability. Again it is assumed that the data consist of a random sample yi $ N(m. . then (yi À m)=s0 $ N(0. s2 ). and in (b) the P-value is equal to the sum of the surfaces of the two shaded areas. this means that outcomes of the test statistic so far away from zero are improbable under the null hypothesis. i ¼ 1.1.16. This is illustrated graphically in Exhibit 1. with equal areas in both tails). the P-value can be deﬁned as the probability (under the null hypothesis) of getting the observed outcome of the test statistic or a more extreme outcome — that is.

s2 1 ) and the other of n2 observations distributed as 2 ).1). as m is unknown. n 1X (n À 1)s2 2 ( y À y ) ¼ $ w2 (n À 1): i 2 s s2 0 i¼1 0 (1:54) The null hypothesis is rejected for large values of this test statistic. Let s1 be 2 the sample variance in the ﬁrst sample and s2 that in the second sample. as its distribution does not depend on the unknown parameters m1 and m2 . it follows that s2 2 $ F(n2 À 1. s2 2 1 ¼ s2 of equal variances 2 2 against the alternative H1 : s2 1 6¼ s2 that the variances are different. For other hypoth2 2 eses — for instance. Consider the null hypothesis H0 : s2 NID(m2 . (i) Test for the mean Suppose that the mean value of this score over a sequence of previous years is equal to 2. Further suppose . so that 2 2 2 2 2 (s2 2 =s2 )=(s1 =s1 ) $ F (n2 À 1. As the 2 two samples are assumed to be independent. We will discuss (i) a test for the mean and (ii) a test for the equality of two variances. 2 2 2 Further (1. n1 À 1) distribution. The testing problem H0 : m1 ¼ m2 against the alternative H1 : m1 6¼ m2 is more complicated and will be discussed later (see Exercise 3. we obtain. F -test for equality of two variances Finally we discuss a test to compare the variances of two populations. Suppose that the data consist of two independent samples. H0 : s2 ¼ s2 0 against H1 : s 6¼ s0 — the same test statistic can be used with appropriate modiﬁcations of the critical regions. non-random value).12: Student Learning (continued) We illustrate tests for the mean and variance by considering the random variable consisting of the FGPA score of students at the Vanderbilt University (see Example 1. the same holds true for s2 1 and s2 .35).70 (this is a hypothetical. according to (1. E XM101STU Example 1. The critical values can be obtained from the F(n2 À 1. one of n1 observations distributed as NID(m1 .35) implies that (ni À 1)si =si $ w (ni À 1) for i ¼ 1. When the null hypothesis s1 ¼ s2 is true. Note that this test statistic is similar. When this parameter is replaced by its estimate y. with critical value determined from the w2 (n À 1) distribution. n1 À 1) s2 1 (1:55) and the null hypothesis is rejected if this test statistic differs signiﬁcantly from 1.62 1 Review of Statistics However. n1 À 1). this is not a test statistic.10). 2.

If we substitute the values of y and s as calculated from the sample.4 (a). The question is whether this should be attributed to random ﬂuctuations in student scores or whether the student scores are above average in the current year. there is no signiﬁcant difference in the variance of the scores for male and female students.3 Interval estimates and the bootstrap E First used in Section 2. e. see Exhibit 1. the sample means are y1 ¼ 2:728 and y2 ¼ 2:895. (ii) Test for equality of two variances Next we split the sample into two parts.6 — that is.12a–c.26. and for the corresponding F(235. The FGPA scores in the current year of 609 students of this university are summarized in Exhibit 1. The P-value of this test outcome is around 10À6 . n1 ¼ 373 and n2 ¼ 236. 372) distribution this gives a (two-sided) P-value of around 0. The university wishes to test the null hypothesis H0 : m ¼ 2:70 of average scores against the alternative hypothesis H1 : m > 2:70 that the scores in the current year are above average.55). So the sample average is above 2. The null hypothesis of equal variances is not rejected (at 5 per cent signiﬁcance level). it follows from (1. this neglects the random nature of estimators. this leads to rejection of the null hypothesis.4 Tests of hypotheses 63 that the individual FGPA scores of students in the current year are independently and normally distributed with unknown mean m and unknown variance s2 . The outcome is 2 2 2 2 F ¼ s2 =s1 ¼ (0:472) =(0:441) ¼ 1:14.4 (a) for the more precise numbers). this gives the value t ¼ 4:97. The sample mean and standard deviation are y ¼ 2:793 and s ¼ 0:460 (after rounding. E Exercises: E: 1. That is.1.645. It seems that the current students have better scores on average than the students in previous years. We test whether s2 1 ¼ s2 against 2 the alternative that s2 1 6¼ s2 by means of (1.1.645. and we assume that all scores are independently distributed with distribution N(m1 .4. The one-sided critical value for size 5 per cent is (approximately) 1. s2 ) for females. The sample means and standard deviations for both sub-samples are in Exhibit 1. Under the null hypothesis that m ¼ 2:70. s2 1 ) for 2 males and N(m2 . Interval estimates Although a point estimate may suggest a high precision. males and females. and the sample standard 2 deviations are s1 ¼ 0:441 and s2 ¼ 0:472. Because estimates depend on data that are . 1.3. As this outcome is well above 1.52) that pﬃﬃﬃﬃﬃﬃﬃﬃ (y À 2:70)=(s= 609) $ t(608).70.

For instance. so that the interval in (1. When the interval is constructed in such a way that it contains the true parameter value with probability 1 À a. if the true parameter value is y0 . this can be used to construct asymptotic tests and corresponding interval estimates. a 95 per cent interval estimate for the variance s2 is given by (n À 1)s2 c2 s2 (n À 1)s2 . In this case. Indeed. One method to construct such an interval is to use a test of size a and to include all parameter values yÃ for which the null hypothesis H0 : y ¼ yÃ is not rejected. In other cases. This means that where s . ^2 =n). one can take these values so that P[t c1 ] ¼ P[t ! c2 ] ¼ 0:025. For example. assuming that the observations are NID(m. if a maximum likelihood estimator ^ yn is asymptotically normally distributed pﬃﬃﬃ d n(^ yn À y0 ) ! N(0. then for a test of size a the probability that H0 : y ¼ y0 is rejected is precisely a. then in ﬁnite samples we can take as an approximation ^ yn % N(y0 .54). n (1:56) where c is such that P[ À c t c] ¼ 0:95 when t has the t(n À 1) distribution. s2 ). c1 where c1 < c2 are chosen such that P[c1 t c2 ] ¼ 0:95 when t has the w2 (n À 1) distribution. This interval indicates the uncertainty about the actual value of the parameter. s2 Þ. When the asymptotic distribution is known.56) has a probability of 95 per cent to contain m0 . Approximate tests and approximate interval estimators Until now the attention has been restricted to data consisting of random samples from the normal distribution. the distribution of the estimator ^ y in ﬁnite samples is not known.53) has a probability of 5 per cent. tests can be constructed for the mean and variance that have a known distribution in ﬁnite samples. For instance. the complementary set of outcomes in (1. it is sometimes preferred to give an interval estimate of the parameter instead of a point estimate. then it is called a (1 À a) Â 100 per cent interval estimate. s 2 2 ^ is a consistent estimator of s .64 1 Review of Statistics partly random. so that the probability that the constructed interval contains y0 is 1 À a. If m0 is the true population mean. In a similar way. using (1. a 95 per cent interval estimate for the mean is given by all values m for which s y À c pﬃﬃﬃ n m s y þ c pﬃﬃﬃ .

let H0 : y ¼ y0 and H1 : y 6¼ y0 . be only an approximation of the data generating process. therefore. If the observations all have different values (that is. An approximate 95 per cent interval estimate of y is given by all values in the interval ^ s ^ y À 2 pﬃﬃﬃ n y ^ s ^ y þ 2 pﬃﬃﬃ : n Bootstrap method An alternative to asymptotic approximations is to use the bootstrap method. n observations are randomly drawn (with replacement) from the distribution (1. The bootstrap method uses the sample distribution to construct an interval estimate.52). yn g and with probabilities 1 P[y ¼ yi ] ¼ : n (1:57) The distribution of a statistic t(y1 . For a 5 per cent size there holds c % 2. y < y0 À c pﬃﬃﬃ or ^ n n where P[jtj > c] with t $ N(0. 1) is the approximate size of the test. Repeating this in a large number of runs (always with the same distribution (1. i. n). We discuss this for the case of random samples. then the bootstrap probability distribution of the random variable y is the discrete distribution with outcome set V ¼ fy1 .57) would be the data generating process. In reality it will. then the null hypothesis is rejected if ^ ^ s s ^ y > y0 þ c pﬃﬃﬃ . Á Á Á .1. This method has the attractive property that it does not require knowledge of the shape of the probability distribution of the data or of the estimator. Á Á Á . which is based on the original data). For instance. 1): ^= n s This is very similar to (1. However. of course.4 Tests of hypotheses 65 ^ yn À y0 pﬃﬃﬃ % N(0. Á Á Á . said to be distribution free. In one simulation run.57) and the corresponding value of t is calculated. j ¼ 1. the bootstrap is a simple method to get an idea of possible random variations when there is little information about the probability distribution that generates the data.57). yn ) is simulated as follows. . this provides an accurate approximation of the distribution of t if (1. yi 6¼ yj for i 6¼ j. It is.

and (ii) obtained by the bootstrap method. If we do not assume that the individual scores are normally distributed but apply the asymptotic normal approximation for the sample means.000991 3.13: Student Learning (continued) We will construct two interval estimates of the mean of the FGPA scores — namely. For this purpose. (ii) Interval obtained by the bootstrap method Although the sample sizes are relatively large in all three cases.007879 (b) Lower value 95% bootstrap interval Upper value 95% bootstrap interval 2.000 simulation runs. Skewness Kurtosis 2.13) Bootstrap distribution (a) of the sample mean obtained by 10. (i) Interval based on normal distribution We construct interval estimates of the mean of FGPA. The resulting 95 per cent interval estimates are 2:76 m 2:83. Dev. . both for the combined population (denoted by m) and for males (m1 ) and females (m2 ) separately. the interval estimates can be computed from (1.769504 0.56).85 2.9551695 Exhibit 1. (i) based on the assumed normal distribution of FGPA scores. 2:83 m2 2:96: Note that the interval estimate of m1 is below that of m2 .00 Series: MEANFGPA Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.894837 2.66 1 Review of Statistics E XM101STU Example 1.895028 3. we will consider the bootstrap as an alternative to construct an interval estimate for m2 (as the corresponding sample size n2 ¼ 236 is the smallest of the three cases). We perform 10. In each (a) 1400 1200 1000 800 600 400 200 0 2. and 95% bootstrap interval estimate for the population mean m2 (b).8338220 2.000 simulation runs (each of sample size 236) from the bootstrap distribution of the FGPA scores of female students. the bootstrap distribution (1.030819 0. 2:68 m1 2:77. which suggests that the two means differ signiﬁcantly. then the corresponding asymptotic interval estimates for the mean are the same as before.57) consists of the 236 FGPA scores of female students.90 2.80 2. If we assume that the scores are in all three cases normally distributed.17 Student Learning (Example 1.95 3.022415 2.

we obtain the 95 per cent bootstrap interval estimate 2:83 m2 2:96 (see Exhibit 1.14. E Exercises: S: 1.15c–g. . E: 1.1. 1. e. g. but this is a coincidence.000 simulated values of the sample mean. This gives a set of 10.13c. Here the outcomes are very close. Deleting the 250 smallest and the 250 largest values of the sample mean.17 (a). with histogram given in Exhibit 1.4 Tests of hypotheses 67 run. because the sample size (n ¼ 236) is large enough to use the normal distribution of the sample mean as a reasonable approximation. as in general the two intervals will be different.17 (b)). This interval coincides (within this precision) with the earlier interval that was based on the normal distribution. 236 IID observations are drawn from the bootstrap distribution and the corresponding sample mean is calculated.

G. (2003). The Advanced Theory of Statistics. Englewood Cliffs. Hogg and Craig (1995). efﬁciency. M. and P-values.. Random variables provide a means to describe the random nature of economic data. Mood. V.. (1990).68 1 Review of Statistics Summary. F.: Brooks/Cole-Thomson Learning. Aczel. further reading. M. Graybill. Hogg. consistency. and Engelhardt. Boston: Duxbury Press. Keller. and keywords SUMMARY This chapter gives a concise review of the main statistical concepts and methods that are used in the rest of this book. B. . London: Grifﬁn. A. We considered different methods to estimate the parameters of distributions from observed data and we discussed statistical properties such as unbiasedness.. Introduction to the Theory of Statistics. A. S. The normal distribution and distributions related to the normal distribution play a central role. NJ: Prentice Hall. (1977–83). G. 3 vols. and Mood. A. D. Calif. and Craig. Boston: McGraw-Hill. F. Keller and Warrack (2003). and Moore and McCabe (1999) are introductory. power. We mention only a few key references from the large collection of good textbooks in statistics. J. T. and asymptotic distributions of estimators. L. Arnold (1990). Statistics for Management and Economics... Graybill. Introduction to Mathematical Statistics. NJ: Prentice Hall. Further we discussed hypothesis testing and related concepts such as size. Englewood Cliffs. As concerns the technical level. (1992). and Boes (1987) are intermediate. Complete Business Statistics. J. R. (2002).. Aczel and Sounderpandian (2002). (1995). Arnold. and Boes. and Warrack. M. FURTHER READING For a more extensive treatment of statistics the reader is referred to the literature. Auckland: McGraw-Hill. and Kendall and Stuart (1977–83) and Spanos (1995) are advanced. signiﬁcance. Bain. A. and Sounderpandian. D. Paciﬁc Grove. A. Kendall. and Stuart. Mathematical Statistics. (1987). Introduction to Probability and Mathematical Statistics. C.. We discussed methods to describe observed data by means of graphs and sample statistics. Bain and Engelhardt (1992).

further reading. and McCabe. New York: Freeman. A. G. Spanos. 38 pivotal 36 power 56 practical signiﬁcance 57 probability distribution 20 probability limit 48 probability value 60 random 20 random sample 35 sample correlation coefﬁcient 18 sample covariance 18 sample covariance matrix 18 sample cumulative distribution function 13. KEYWORDS alternative hypothesis 55 asymptotic distribution 50 asymptotic normal distribution 50 asymptotic properties 48 Bernoulli distribution 29 bias 43 binomial distribution 29 bootstrap 65 central limit theorem 50 chi-square distribution 32 conditional distribution 24 conditional expectation 24 consistent estimator 48 consistent test 56 convergence in distribution 50 correlation coefﬁcient 23 correlation matrix 18 covariance 23 critical region 56 cumulative distribution function 20 data generating process 42 density function 20 dependence 17 efﬁcient 43 error of the ﬁrst type 56 error of the second type 56 estimate 38 estimator 38 expectation 21 F-distribution 34 histogram 12 hypothesis 55 independence 25 information matrix 45 interval estimate 64 invariant 40 kurtosis 16 law of large numbers 50 least squares 39 likelihood function 40 log-likelihood 52 marginal distribution 23 maximum likelihood 40 mean 21 mean squared error 43 median 16 method of moments 39 model 38 multivariate normal distribution 30 normal distribution 29 null hypothesis 55 one-sided test 60 P-value 60 parameters 29. P. D. (1999).. Statistical Foundations of Econometric Modelling. Cambridge: Cambridge University Press. and keywords 69 Moore. S. 20 sample mean 16 sample standard deviation 16 sample variance 16 scatter diagram 13 signiﬁcance level 56 signiﬁcant 60 similar test 56 simulation experiment 46 simulation run 46 . (1995).Summary. Introduction to the Practice of Statistics.

70 1 Review of Statistics size 56 skewness 16 standard deviation 22 standard error 60 standard normal distribution 30 statistic 38 statistical signiﬁcance 57 Student t-distribution 32 t-test for the mean 60 t-value 37 test statistic 56 two-sided test 59 unbiased 43 uncorrelated 24 variance 21 .

n.2) Suppose that n pairs of outcomes (xi . Show.3 (E Sections 1.2.3) a. yi ¼ a þ bxi . Prove that the conditional distribution of yjx ¼ v then does not depend on the value of v and that therefore the conditional mean and variance of yjx ¼ v also do not depend on v. Let y $ N(m. d. c.Exercises 71 Exercises THEORY QUESTIONS 1. 1.2. a. d. Suppose that x and y are independent random variables. b. n. Prove the results in (1. m3 ¼ 0.4 (E Section 1. that the sample correlation coefﬁcient is in general not invariant under non-linear transformations.5 (E Sections 1. 1. s2 ) are equal to m1 ¼ m.1 (E Section 1. eÃ . Show that. then prove that z $ N(am þ b. I) be an n Â 1 vector of independent standard normal random variables and let z0 ¼ Ay.18). i ¼ 1. Á Á Á . Show that the mean and variance of the Bernoulli distribution are equal to p and p(1 À p) respectively. f. for some numbers a and b that do not depend on i. c. 1. with a1 > 0 and a2 > 0 positive constants.2) aÃ . Show that the skewness and kurtosis are equal to zero and three respectively. b. by using the generalization of the result in (1.2.4) Let y $ N(0. Prove that independent variables are uncorrelated. by means of an example. . b. 1. c. 1.1. Ã d . a2 s2 ). Show. gÃ .1. m2 ¼ s2 .2) a. that the result in c does not generally hold true when ‘independent’ is replaced by ‘uncorrelated’. c. z1 ¼ y0 Q1 y and z2 ¼ y0 Q2 y with A a given m Â n matrix and with Q1 and Q2 given symmetric idempotent n Â n matrices.2. Á Á Á .2. they are also independent.16). Show the result in (1.19) to the case of n functions.15) and (1. by means of an example. Show the result in (1. s2 ) and let z ¼ ay þ b with a 6¼ 0. Suppose that y1 and y2 are independent random variables and that z1 ¼ g1 (y1 ) and z2 ¼ g2 (y2 ). Prove the two results in (1.3.19) to prove that z1 and z2 are independent. Prove that the sample correlation coefﬁcient between the variables x and y always lies between À1 and þ1.23) for the case that A is an n Â n non-singular matrix.2. d. when two jointly normally distributed random variables are uncorrelated. Show that the ﬁrst four moments of the normal distribution N(m. yi ) of the variables x and y have been observed. Prove that the sample correlation coefﬁcient is invariant under the linear transformation Ã xÃ i ¼ a1 xi þ b1 and yi ¼ a2 yi þ b2 for all i ¼ 1. and m4 ¼ 3s4 .10) for the case of a linear transformation z ¼ ay þ b with a 6¼ 0. Use the results in (1. Give an example of two random variables that are uncorrelated but not independent. Prove that the sample correlation coefﬁcient is equal to 1 or À1 if and only if y is a linear function of x — that is. 1. Show that the mean and variance of the binomial distribution are equal to np and np(1 À p) respectively.22).2 (E Section 1.10) and (1. b. It may the P be helpful to consider 2 function S(b) ¼ n and i¼1 (yi À y À b(xi À x)) to use the fact that the minimal value of this function is non-negative. Prove the result in (1.

and consider linear P estimators of the mean m of the form ^¼ n m i¼1 ai yi . the quotient.30).2) Let yi $ IID(m. s2 ).3. 1.8 (E Sections 1. 1. Determine the mean and variance of this estimator. 1.3.7 (E Sections 1. yn be a random sample from a population with density function fy (v) ¼ eyÀv for v ! y and f (v) ¼ 0 for v < y. In particular. Derive the restriction on ai needed to guarantee that the estimator m ^ is unbiased. s2 ). i ¼ 1. Á Á Á . Show that the mean and variance of the w2 (r) distribution are r and 2r respectively.10 (E Sections 1.3. Prove the inequality of Chebyshev. Show that in 2e this case the maximum likelihood estimator of m is the median. Use this result to prove that the maximum likeli^ML and s ^2 hood estimators m ML in Example 1.46) for the log-likelihood function log (L(c) ). n. i ¼ 1. and arbitrary continuous functions of sequences of random variables.34) is symmetric and idempotent and that it has rank (n À 1).8 are consistent.1. Determine the method of moments estimator of y. d. b. d. then show that for r ! 1 the random variables yr converge in distribution to the standard normal distribution. where y is an unknown parameter.29) and (1. e. c.2. Suppose that the odds ratio P[y ¼ 1]=P[y ¼ 0] is ^=(1 À p ^) with p ^ the estimator in a. c. Show that the matrix M in (1. Á Á Á .2.45) for arbitrary distributions. Check the equality in (1. a. Derive the linear unbiased estimator that has the minimal variance in this class of estimators.8.3) Let y1 . Prove that this estimator is consistent. Prove the four rules for probability limits that are stated in the text below formula (1. n. Discuss whether the estimator of c will be asymptotically efﬁcient in the class of all unbiased estimators if the data are generated by the double exponential density.48) imply consistency.48) — that . 1.3. d.3) Let y1 . b. eÃ .2. s2 ). If the rank of Q1 is equal to r. then Q1 ¼ UU0 for an n Â r matrix U with the property that U0 U ¼ I (the r Â r identity matrix). the product. Let yr $ F(r1 .3. Suppose that the distribution of each observation is given by the double exponential density ÀjvÀmj f (v) ¼ 1 with À1 < v < 1. b. 1. 1.8 the log-likelihood function (1. yn be a random sample from a Bernoulli distribution. b.6 (E Sections 1. 1.3. 1.1.40) was analysed in terms of the parameters y ¼ (m.1. Derive the expression for the variance of the ^. a. Investigate whether the estimator in a is efﬁcient in the class of unbiased estimators. Let yr $ t(r). 1.3.72 1 Review of Statistics a. for the sum. Á Á Á . c. r) with r1 ﬁxed. Derive the Crame ´ r–Rao lower bound. Use the inequality of Chebyshev to prove that the two conditions in (1.9 (E Sections 1. show that the estimated value of s is the square root of the estimated value of s2 in Example 1.3. using the fact that jointly normally distributed random variables are independent if and only if they are uncorrelated. e.3. based on the ﬁrst moment. Á Á Á . Check that the maximum likelihood estimates are invariant under this change of parameters. a. which states that for a random variable y with mean m and variance s2 there holds P[jy À mj ! cs] 1=c2 for every c > 0. Derive the maximum likelihood estimator of the parameter p ¼ P[yi ¼ 1]. estimated by p Investigate whether this estimator of the odds ratio is unbiased and consistent. Use this result to prove that y0 Q1 y $ w2 (r).2) Let yi $ NID(m. but now we consider as an alternative the parameters c ¼ (m. Investigate whether this estimator is unbiased and consistent. b. Prove the results in (1. Prove the equality in (1. 1. In Example 1. s). is. b. Determine gradient and Hessian matrix of the log-likelihood function log (L(c) ). estimator m c. f.3. a.3) a. d. c. then show that for r ! 1 the random variables r1 Á yr converge in distribution to the w2 (r1 ) distribution.3. 1.

e. test the null hypothesis that the mean is m ¼ 2:79 against the alternative that m < 2:79. Use the central limit theorem to argue À that the approxiÁ ^ is p ^ % N p.2 FEM 0 0 1 0 1 1 1 1 0 1 mally distributed with mean m and variance s2 .692 1.3 6. Consider in particular the two extreme cases of a single observation (n ¼ 1) and the asymptotic case (for n ! 1). In the sample there are six female and four male students.1).7 6. The values of FGPA.46.2. These ten students are actually drawn from a larger data set consisting of 236 female and 373 male students where the FGPA scores have mean 2.2. Compute mean.4 5.9 5.Exercises 73 c.1.168 3.1. 1. 1.2. Give explicit proofs that this estimator is biased but consistent.0 5. Repeat a.020 SATM 6.2 5. SATM. ^ denote the random variable consisting of d. i 8 381 186 325 290 138 178 108 71 594 FGPA 3.79 and standard deviation 0. yn .3. Prove that the maximum likelihood estimator of y is given by the minimum value of y1 .12 (E Sections 1. Compute the conditional means of FGPA and SATM for the four male students and also for the six female students. b. Let p the fraction of successes in a random sample of size n from the Bernoulli distribution.4 6. Use a statistical package to compute the corresponding (one-sided) P-value of this test outcome. b.264 3. median. c.1.3. a. Relate the outcomes in a and b with the results in c.4.11. e.15) (applied to the ‘population’ of the ten students considered here) for the two variables FGPA and SATM. Test the null hypothesis that p ¼ 236=609 against the alternative that p 6¼ 236=609. and the gender variable FEM is assumed to be independently Bernoulli distributed with parameter p ¼ P[FEM ¼ 1]. Á Á Á . a. Discuss which of the two estimators in a and c you would prefer. Compute for each of the three variables the sample mean. but now for the two-sided alternative that m 6¼ 2:79.2) In this exercise we consider data of ten randomly drawn students (the observa.566 2. median. EMPIRICAL AND SIMULATION QUESTIONS 1.482 2.592 2.805 2.1. 1.2) Consider the data set of ten observations used in Exercise 1. Make histograms of the variables x and y and make a scatter plot of y against x. based on the asymptotic approximation in d. Answer the questions in a and b for testing the null hypothesis s ¼ 0:46 against the one-sided alternative s > 0:46.XR111STU tion index i indicates the position of the students in the ﬁle of all 609 students of Example 1. Compute the sample covariances and sample correlation coefﬁcients between these three variables. 1.2. and kurtosis. On the basis of the ten observations.225 2.0 5.11 (E Sections 1. Check the relation (1. d. Check that the distribution of the salaries y is very skewed and has excess kurtosis. a.4. b.3) In this exercise we consider data of 474 XR113BWA employees (working in the banking sector) on the variables y (the yearly salary in dollars) and x (the number of ﬁnished years of education).13 (E Sections 1. and standard deviation of the variables x and y and compute the correlation between x and y.074 3. 1 mate distribution of p n p(1 À p) . 1. d.1. skewness. Make three histograms and three scatter plots for these three variables.1. 1. What is the relation between the P-values of the one-sided and the two-sided tests? c.2 6. The FGPA scores are assumed to be independently nor- XR111STU 1. . 1. standard deviation.3. and FEM of these students are as follows.

but that the standard deviation p of ﬃﬃﬃ the sample median is ﬁnite and equal to p=(2 n).74 1 Review of Statistics c. then y ¼ ez is said to be lognormally distributed. this can be estimated by the sample median. Generate a new set of 1000 observations by IID drawings from the bootstrap distribution and compute the median. Make a histogram of the resulting 474 values of z and compute the mean. assuming that the salaries are NID(m. Let y be the sample mean and s the sample standard pﬃﬃﬃ deviation. s2 z ). g.2. e. f. Check that z 6¼ log (y) but that med(z) ¼ log (med(y)). b. and n ¼ 1000.3) In this simulation exercise we consider the quality of the asymptotic interval estimates discussed in Section 1. Repeat this 10. the t(1) distribution) the standard deviation of the sample mean does not exist. Use the bootstrap method (based on the data of c) to construct a 95% interval estimate of the median. and compare this with the theoretical standard deviation in b. and f.000 times. c. The lower bound is then the 251st value and the upper bound is the 9750th value in this ordered sequence of sample medians (this interval contains 9500 of the 10. As data generating process we consider the t(3) distribution that has mean equal to zero and variance equal to three.000 simulations in d.000 times and compute the number of times that the null hypothesis of zero mean is rejected.15 (E Sections 1. 1. For a random sample of size n.000 simulations and compare this with the result in b. Compute a 95% interval estimate of the mean of the variable z.3. 1. d. Which interval do you prefer? 1. Compute the standard deviation of the median over the 10. c. g. e. Compute a 95% interval estimate of the mean of the variable y. Compare this interval with that obtained in c. the median has a standard psample ﬃﬃﬃ deviation of (2f (m) n)À1 where m is the median of the density f . b. d. Comment on these results.000 medians — that is. Show that the mean of y 1 2 is given by m ¼ emz þ2sz . If z $ N(mz . Give an explanation for the simulated rejection frequencies of the null hypothesis for sample sizes n ¼ 10. Show that for random samples from the normal distributionp the deviation of the sample pﬃﬃﬃﬃﬃﬃ ﬃﬃﬃ standard median is s p= 2n whereas the pﬃﬃﬃ standard deviation of the sample mean is s= n. Construct a corresponding 95% interval estimate of the median and compare this with the result in d. Simulate a data set of n ¼ 1000 observations y1 . Explain this last result. 1. Compute a 95% interval estimate of m.4. 2 f.14 (E Section 1. s2 ). 95%). e. Deﬁne the random variable z ¼ log (y). y1000 by independent drawings from the t(1) distribution. based on the results in d. Repeat the simulation run of a 10. We focus on the construction of interval estimates of the mean and on corresponding tests. The 95% interval estimate of the median can be obtained by ordering the 10.4. If the median is taken as a measure of location of a distribution f .4. as follows. When the distribution f is unknown. . assuming that the observations on z are NID(mz .4. Show that for random samples from the Cauchy distribution (that is.2.000 times. Comment on the differences between the methods in d and f and their usefulness in practice if we do not know the true data generating process. sz ).000 computed sample medians. Compute the interval y Æ 2s= n. this expression cannot be used to construct an interval estimate.3. skewness and kurtosis of z. Generate a sample of n ¼ 10 independent drawings from the t(3) distribution. Á Á Á . a. standard deviation. Reject the null hypothesis of zero mean if and only if this interval does not include zero.3) In this simulation exercise we consider an example of the use of the bootstrap in constructing an inter- val estimate of the median. n ¼ 100. a. d. Repeat c 10. Repeat the simulation experiment of a and b for sample sizes n ¼ 100 and n ¼ 1000 instead of n ¼ 10. median. Also compute the standard deviation of the median over these 10.

This relation can be estimated by the method of least squares. We also describe tests for the statistical signiﬁcance of models and their use in making predictions.2 Simple Regression Econometrics is concerned with relations between economic variables. The simplest case is a linear relation between two variables. . We discuss this method and we describe conditions under which this method performs well.

corresponds to the excess returns on an index of stocks in the sector of cyclical consumer goods.01) to i ¼ 240 (for 1999. and the third one from a marketing experiment. 1. since the top right and . Data form the basic ingredient for every applied econometric study. textiles.12).1. In Exhibit 2. xi is denoted by RENDMARK and yi by RENDCYCO (see Appendix B for an explanation of the data sets and corresponding notation of variables used in the book). This index is composed on the basis of 104 ﬁrms in the areas of household durables. These examples are helpful to understand the methods that we will discuss in this chapter. one taken from the ﬁnancial world. The other variable.76 2 Simple Regression 2. Here we used the one-month interest rate as riskless asset. for which reason they are said to be cyclical.2. which we denote by xi.1.1 Least squares 2.1 Scatter diagrams E Data Uses Sections 1.1. corresponds to the excess returns on an overall stock market index. The histograms indicate that the excess returns in the sector of cyclical consumer goods are on average lower than those in the total market and that they show a relatively larger variation over time. The consumption of these goods is relatively sensitive to economic ﬂuctuations. The index i denotes the observation number and runs from i ¼ 1 (for 1980. the second one from the ﬁeld of labour economics.1. which we denote by yi. automobiles.1 shows two histograms (in (a) and (b)) and a scatter plot (in (c)) of monthly excess returns in the UK over the period January 1980 to December 1999. The excess returns are obtained by subtracting the return on a riskless asset from the asset returns.1. The scatter diagram shows that the two variables are positively related.1: Stock Market Returns Exhibit 2. Therefore we start by introducing three data sets. One variable. Appendix A.1. and sports. The extremely large negative returns (of around À36 per cent and À28 per cent) correspond to the stock market crash in October 1987. The data are taken from the data bank DataStream. E XR201SMR Example 2.

12.2 shows two histograms (in (a) and (b)) and a scatter diagram (in (c)) of 474 observations on education (in terms of ﬁnished years of education) and salary (in (natural) logarithms of the yearly salary S in dollars). The data are taken from one of the standard data ﬁles of the statistical software package SPSS and concern the employees of a US bank. and 2.374932 20 10 0 0.1) Histograms of monthly returns in the sector of cyclical consumer goods (a) and monthly total market returns (b) in the UK. the relationship is not completely linear.531597 40 20 0 (c) 30 20 10 RENDCYCO 0 −10 −20 −30 −40 −40 −30 −20 −10 0 10 20 30 RENDMARK Exhibit 2. Skewness Kurtosis −30 −20 −10 0 10 20 80 30 60 Series: RENDMARK Sample 1980:01 1999:12 Observations 240 Mean Median Maximum Minimum Std.499826 0. then there would be clear deviations from the line.808884 1.2. bottom left of the diagram contain relatively many observations.46098 −27.20496 −35. Skewness Kurtosis −30 −20 −10 0 10 0.11. However. 2.204026 13.15).6). Example 2. as for a ﬁxed value of E XM202BWA . However.56618 7. Each point in the scatter diagram corresponds to the education and salary of an employee. Dev.230900 4. for a ﬁxed level of education there remains much variation in salaries.755913 −1. On average. This can be seen in the scatter diagram (c). The salaries are measured in logarithms for reasons to be discussed later (see Example 2.2: Bank Wages Exhibit 2.86969 4. Dev.1 Least squares 77 (a) 40 (b) Series: RENDCYCO Sample 1980:01 1999:12 Observations 240 Mean Median Maximum Minimum Std. salaries are higher for higher educated people.849594 −0. The further analysis of these data is left as an exercise (see Exercises 2.157801 8.1 Stock Market Returns (Example 2. and corresponding scatter diagram (c). If we were to draw a straight line through the scatter of points.309320 22.

for a ﬁxed price (on the horizontal axis) there remains variation in sales (different values on the vertical axis).00000 8. Skewness Kurtosis 10.0 10.5 5 10 15 EDUC 20 25 Exhibit 2.113746 2.10).27073 11. and corresponding scatter diagram (c).81303 9. The price is indexed.0 10. education (on the horizontal axis) there remains variation in salaries (on the vertical axis). The further analysis of these data is left as an exercise (see Exercise 2.2) Histograms of education (in years (a)) and salary (in logarithms (b)) of 474 bank employees.5 LOGSALARY 11.998033 3.78 2 Simple Regression (a) 200 Series: EDUC Sample 1 474 Observations 474 Mean Median Maximum Minimum Std. ‘Measuring the Short-Term Effect of In-Store Promotion and Retail Advertising on Brand Sales: A Factorial Experiment’.000000 2.0 9.5 (c) 12. Further. Journal of Marketing Research.0 11. lower prices result in higher sales. E XR210COF Example 2.397334 0.3 shows a scatter diagram with twelve observations (xi .0 11.5 11. Two price actions are investigated. Dev.884846 −0.00000 21. The data are taken from A.2 Bank Wages (Example 2. C. In the sequel we will take this as the leading example to illustrate the theory of this chapter. with reductions of 5 per cent or 15 per cent of the usual price. 202–14. The quantity sold is in units of coffee per week.49156 12.662632 150 60 100 40 50 20 0 8 10 12 14 16 18 20 0 10. . 28 (1991).725155 (b) 80 Series: LOGSALARY Sample 1 474 Observations 474 Mean Median Maximum Minimum Std. Dev. and were obtained from a controlled marketing experiment in stores in Paris. yi ) on price and quantity sold of a brand of coffee.3: Coffee Sales Exhibit 2.35679 10. Clearly. Skewness Kurtosis 13. Mouchoux.664596 0. with value one for the usual price.5 10. Bemmaor and D.

Fitting a line to a scatter of data Our starting point is a set of points in a scatter diagram corresponding to n paired observations (xi . n. A.2. and we want to ﬁnd the line that gives the best ﬁt to these points. yi ). We describe the line by the formula y ¼ a þ bx: (2:1) Here b is called the slope of the line and a the intercept. Terminology The variable y in (2.85 0. Á Á Á . We measure the deviations ei of the observations from the line vertically — that is. in the top left part of the diagram two observations nearly coincide). that differences in salaries are explained by education.1) is called the variable to be explained (or also the dependent variable or the endogenous variable) and the variable x is called the explanatory variable (or also the independent variable.1.1. . or the covariate). 2. In the three examples in Section 2.3 Coffee Sales (Example 2.00 1. i ¼ 1. the regressor.1.95 1.05 PRICE Exhibit 2. The idea is to explain the differences in the outcomes of the variable y in terms of differences in the corresponding values of the variable x.3) Scatter diagram of quantity sold against price index of a brand of coffee (the data set consists of twelve observations.2 Least squares E Uses Appendix A.7. this means that the monthly variation in sector returns is explained by the market returns. and that variations in sales are explained by prices.1. the exogenous variable.1 Least squares 200 79 150 QUANTITY 100 50 0.90 0. we assume that our purpose is to explain or predict the value of y that is associated with a given value of x. To evaluate the ﬁt.

we will meet other criterion functions in later chapters. b). ei ¼ yi À a À bxi : (2:2) So ei is the error that we make in predicting yi by means of the variable xi using the linear relation (2. There are several ways to specify such a function — for instance. yi ).1) (see Exhibit 2. However. This method is also called ordinary least squares (abbreviated as OLS). a function of a and b that takes smaller values if the deviations are smaller. b) ¼ X jei j. in which case the criterion function depends on a and b via the absolute values of the deviations ei . We will do this by specifying a criterion function — that is. The least squares criterion Now we have to make precise what we mean by the ﬁt of a line. X S(a. is by far the most frequently used.80 2 Simple Regression a+bx yi a+bxi ei { xi Exhibit 2. b) ¼ e2 i: In both cases the summation index runs from 1 P through n (where no misunderstanding can arise we simply write ).4 Scatter diagram with ﬁtted line Scatter diagram with observed data (xi .2) in the least squares criterion we obtain . The second of these functions. In this chapter we restrict our attention to the minimization of S(a.4). and residual (ei ). ﬁtted value (a þ bxi ). The reason is that its minimization is much more convenient than that of other functions. Sabs (a. the least squares criterion that measures the sum of squared deviations. Computation of the least squares estimates a and b By substituting (2. In many situations we dislike positive deviations as much as negative deviations. regression line (a þ bx).

then all the points in the scatter diagram . @a X @S ¼ À2 xi (yi À a À bxi ) ¼ 0: @b From the condition in (2. Because x(yi À a À bxi ) ¼ x (yi À a À bxi ) ¼ 0 according to (2. that a ¼ y À bx (2:6) (2:4) (2:5) P P where y ¼ yi =n and x ¼ xi =n denote the sample means of the variables x is ﬁxed (that is. yi ) are given.4). If this condition does not hold true.6) in (2. b) with respect to a and b. b) ¼ X (yi À a À bxi )2 : (2:3) Here the n observations (xi .2. From (2.5) as X (xi À x)(yi À a À bxi ) ¼ 0: (2:7) Now we substitute (2. after dividing by 2n.4) we obtain. b).5) the Hessian matrix is obtained as @ 2 S=@ a2 @ 2 S=@ b@ a @ 2 S=@ a@ b @ 2 S=@ b2 ¼ 2 2n P P 2 P xi : 2 x2 i P x2 i xi ThisP matrix is positive deﬁnite if n > 0 and the determinant 4n À4ð xi Þ2 > 0 — that is. it sufﬁces to check whether the Hessian matrix is positive deﬁnite.7) and solve this expression for b. independent of i) so that y P Pand x respectively. so that P b¼ (xi À x)(yi À y) : P (xi À x)2 (2:8) To check whether the values of a and b in (2.8) indeed provide the minimum of S(a.1 Least squares 81 S(a.4) and (2. and the condition that (xi À x)2 > 0 means that there should be some variation in the explanatory variable x. and we minimize the function S(a.6) and (2. we can rewrite (2. The ﬁrst order conditions for a minimum are given by X @S ¼ À2 (yi À a À bxi ) ¼ 0. X (xi À x)2 > 0: P The condition that n > 0 is evident.

6) and (2.82 2 Simple Regression are situated on a vertical line. Normal equations We can rewrite the two ﬁrst order conditions in (2. i ¼ 1. we make a remark on the notation. (xn . we ﬁnd two properties of these residuals X ei ¼ 0. E: 2. Á Á Á . from now on a and b are uniquely deﬁned as the numbers that can be computed from the observed data (xi . 2.3 Residuals and R2 Least squares residuals Given the observations (x1 . 2. That is.10b. an þ b xi ¼ X X X a xi þ b x2 xi yi : i ¼ (2:9) (2:10) These are called the normal equations. Á Á Á . we obtain the residuals ei ¼ yi À a À bxi : Because a and b satisfy the ﬁrst order conditions (2.6) and (2.1a. The expressions in (2. y1 ). and the corresponding unique values of a and b given by (2. it makes little sense to try to explain variations in y by variations in x if x does not vary in the sample. Remark on notation Now that we have completed our minimization procedure.12a. From now on. Of course. X (xi À x)ei ¼ 0: (2:11) .8). n.8). When minimizing (2. c. After completing the minimization procedure. yn ).3) we have treated the xi and yi as ﬁxed numbers and a and b as independent variables that could be chosen freely.8).4) and (2. we no longer need a and b as independent variables and for convenience we will use the notation a and b only for the expressions in (2.6) and (2. Strict mathematicians would stress the difference by using different symbols. b. we have found speciﬁc values of a and b by (2.7). E Exercises: T: 2. yi ).4) and (2.5) as X X yi .8) show that the least squares estimates depend solely on the ﬁrst and second (non-centred) sample moments of the data.6) and (2. by means of these two formulas.1.

6) that yi À y ¼ b(xi À x) þ ei : So the difference from the mean (yi À y) can be decomposed as a sum of two components. a component corresponding to the difference from the mean of the explanatory variable (xi À x) and an unexplained component described by the residual ei . Here SST is called the total sum of squares. and SSR the sum of squared residuals. Three sums of squares A traditional way to measure the performance of least squares is to compare the sum of squared residuals with the sum of squares of (yi À y). The sum of squares of (yi À y) also consists of two components X (yi À y)2 ¼ b2 X (xi À x)2 þ X e2 i (2:12) (2:13) SST ¼ SSE þ SSR: P Note that the cross product term (xi À x)ei vanishes as a consequence of (2.2. is deﬁned as the relative explained sum of squares P SSE b2 (xi À x)2 R ¼ : ¼ P SST (yi À y)2 2 (2:14) By substituting (2.1 Least squares 83 So (in the language of descriptive statistics) the residuals have zero mean and they are uncorrelated with the explanatory variable. SSE the explained sum of squares.8) in (2.14) we obtain P ð (xi À x)(yi À y)Þ2 : R ¼P P (xi À x)2 (yi À y)2 2 . The coefﬁcient of determination. We can rewrite (2. Coefficient of determination: R2 The above three sums of squares depend on the scale of measurement of the variable y. denoted by the symbol R2 . To get a performance measure that is independent of scale we divide through by SST.11).2) as yi ¼ a þ bxi þ ei and we obtain from (2.

3 with the data on bank wages discussed before in Example 2.1.11). R2 in model without intercept Until now we have assumed that an intercept term (the coefﬁcient a) is included in the model. (2.1. .12).14) can be rewritten as R ¼1ÀP 2 e2 i : (yi À y)2 P (2:15) The expressions (2.2. we ﬁrst comment on the precision of the reported results. if the ﬁtted line is of the form y ¼ bx — then R2 is still deﬁned as in (2. (i) Precision of reported results For readers who want to check the numerical outcomes. 2. we report the following sample statistics for the n ¼ 474 observations of the variables x (education) and y (natural logarithm of salary). Therefore. If the model does not contain an intercept — that is.10c. we report intermediary and ﬁnal results with a much lower precision than the software packages used to compute the outcomes. (iii) the sums of squares and R2 .2 on bank wages. E Exercises: T: 2. the reader should also use a software package and should not work with our intermediary outcomes.12) it follows that (2. However. and (2.14). E: 2.15) no longer hold true (see Exercise 2. We will discuss (i) the precision of reported results in this book.84 2 Simple Regression So R2 is equal to the square of the correlation coefﬁcient between x and y.15) show that 0 R2 1 and that the least squares criterion is equivalent to the maximization of R2 .14) and (2.11.2 and 2. to check the outcomes. By using (2.2. which involve rounding errors. E XM202BWA 2. the results in (2. and (iv) the outcome of a regression package (we used the package EViews). In all our examples.4). (ii) Least squares estimates Continuing the discussion in Example 2.4 Illustration: Bank Wages We illustrate the results in Sections 2. (ii) the least squares estimates.1.

X X xi yi ¼ 66609: y2 i ¼ 50917. we use X (xi À x)(yi À y) ¼ 1X X xi yi À xi yi ¼ 378:9.1 Least squares 85 X X X yi ¼ 4909. In the sense of least squares.0 20 10 0 10. i n X X 1 X 2 2 2 2 2 2 ¼ 36:3.8 LOGSALARY 11. this line gives an optimal ﬁt to the cloud of points.5 5 10 15 20 25 EDUC Exhibit 2.035591 Median Maximum 0. (xi À x)2 ¼ i À nx ¼ i À n xi yi À nx y ¼ X X so that b ¼ 0:096. (a) LOGSALARY vs.10) (up to rounding errors).8).285017 Skewness 0.4 0.5 Bank Wages (Section 2.6) for the intercept gives a ¼ 9:06. Dev.0 9.4) Scatter diagram of salary (in logarithms) against education with least squares line (a) and histogram of the residuals (b).9) and (2. The histogram of the residuals is shown in Exhibit 2. The formula (2.1. 0. We leave it as an exercise to check that these values satisfy the normal equations (2.19E-15 −0.4 0. x2 xi ¼ 6395.0 50 40 11. i ¼ 90215.0 0. The regression line is given by a þ bx ¼ 9:06 þ 0:096x and is shown in Exhibit 2. . EDUC 12. with a corresponding coefﬁcient of determination R2 ¼ 0:49.5 30 (b) Series: Residuals Sample 1 474 Observations 474 Mean 6. n X X X 1 X 2 2 x2 x2 xi ¼ 3936:5.2. To compute the slope (2. SSE ¼ b xi xi À nx ¼ b xi À n X 2 y2 i À ny ¼ X SSR ¼ SST À SSE ¼ 38:4.5 10.952229 Minimum −0.518339 Kurtosis 3.5 (b).5 (a).256640 −0. (iii) Sums of squares and R2 The sums of squares are SST ¼ 1 X 2 y2 À yi ¼ 74:7.662598 Std.

42407 S.6 Bank Wages (Section 2. SSR. 10.062102 EDUC 0.86 2 Simple Regression (iv) Outcome of regression package The outcome of a regression package is given in Exhibit 2. based on data of 474 bank employees.D.1.397334 .35679 0. The two values in the column denoted by ‘coefﬁcient’ show the values of a (the constant term) and b (the coefﬁcient of the explanatory variable).4) Results of regression of salary (in logarithms) on a constant (denoted by C) and education. We conclude that there is an indication of a positive effect of education on salary and that around 50 per cent of the variation in (logarithmic) salaries can be attributed to differences in education.6.485447 Mean dependent var Sum squared resid 38. Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Variable Coefﬁcient C 9. The table reports pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ R2 . and also y and SST =(n À 1) (the sample mean and sample standard deviation of the dependent variable). dependent var Exhibit 2.095963 R-squared 0.

In this case the data . The reported ﬁgures may further depend on the method of measurement.2 Accuracy of least squares 87 2. The idea is that similar results hold approximately true if the model is a reasonably accurate approximation of the data generating process. an econometric model will never provide a completely accurate description of the data generating process. This concerns in particular the meaning of ‘data generating process’ and ‘true model’.2. this depends on the ﬁrms that are included in the analysis.1 Data generating processes E Uses Appendix A. It is common to label the combined economic and measurement process as the data generating process (DGP). The helpful fiction of a ‘true model’ in statistical analysis Before discussing the statistical properties of least squares. For instance. and the sales data in Example 2. but we can imitate this situation by means of computer simulations.2. This implies that.3 result from the purchase decisions of many individual buyers in a number of stores in Paris. Still. This reﬂects an idealized situation that allows us to obtain mathematically exact results.1 result from developments in the production and value of many ﬁrms (in this case ﬁrms in the sector of cyclical consumer goods) in the UK. we pay attention to the meaning of some of the terminology that is used in this context. By disregarding less relevant aspects of the data. For the stock market data. Economic data are the outcome of economic processes. the model helps to obtain a better understanding of the main aspects of the data generating process. Simulation as a tool in statistical analysis The ideal situation of a ‘true model’ will never hold in practice. in practice. the stock market data in Example 2. Therefore. and for the sales data this depends on the chosen shops and the periods of measurement. if taken literally. we sometimes use the notion of a ‘true model’. An econometric model aims to provide a concise and reasonably accurate reﬂection of the data generating process. in discussing statistical properties.1.2 Accuracy of least squares 2. the concept of a ‘true model’ does not make much practical sense.

respectively. b. For this purpose we use a generator of normally distributed random numbers. We choose a constant term a — say. Á Á Á . which is generated by this process. for illustrative purposes. For that purpose we shall write a small computer program in which we shall carry out a number of steps. Á Á Á . Example of a simulation experiment We start by choosing a value for the number n of observations — for instance. So. Finally we generate values for the dependent variable according to yi ¼ a þ bxi þ ei (i ¼ 1. i ¼ 1.16). we have to multiply them by s ¼ 5 to obtain disturbances with variance s2 ¼ 25. xn ¼ n. i ¼ 1. simple relations like yi ¼ a þ bxi will not hold exactly true for the observed data. Then we ﬁx n numbers for the explanatory variable x — for instance. Because of the disturbance terms. If this econometrician now applies the formulas of Section 2. x1 ¼ 1. So this experiment is useful for assessing the accuracy of the method of least squares. but that this econometrician does not know the underlying values of a. b ¼ 1. but by generating a set of data ourselves. s. Then the model is indeed ‘true’. a ¼ 10 — and a slope coefﬁcient b — say. n. we will start not by analysing a set of empirical data. en . n. as the data generating process satisﬁes all the model assumptions. which are known to us.2 to this data set to compute a and b. As the computer usually generates random numbers with zero mean and unit variance.1. In practice. yi ). Finally we choose a value s2 for the variance of the disturbance or error term — for instance. Use of simulated data in statistical analysis Now consider the situation of an econometrician whose only information consists of a data set (xi . and we can compare them with the original values of a and b. The observed data are partly random because of the effects of the disturbance terms ei in (2. x2 ¼ 2. The estimates are accurate if they do not differ much from the values of a and b of the DGP. Many computer packages contain such a generator. we can interpret them as estimates of a and b. the outcomes of a and b are random and in general a 6¼ a and b 6¼ b. and the disturbances ei summarize the effect of all the other variables (apart from xi ) on yi . Á Á Á . . and ei . Á Á Á . This completes our data generating process (DGP). s2 ¼ 25. n ¼ 20.88 2 Simple Regression are generated by means of a computer program that satisﬁes the assumptions of the model. n): (2:16) The role of the disturbances is to ensure that our data points are around the line a þ bx instead of exactly on this line. Á Á Á . Then we generate n random disturbances e1 .

584349 2. xi ¼ i for i ¼ 1.8 (a) and (b).00000 13.000000 4.00000 15.00000 22.385784 À0.00000 27.00000 29.000000 2. As the two series of disturbance terms are different in the two simulation runs (see (b) and (c)).572068 6.50625 25.000000 9.000000 7.105329 À10. This is also clear from the two scatter diagrams in Exhibit 2.76902 EPS2 2. .045441 À4.267976 À1.07071 30.00000 24.285183 13.614216 12. The values of a and b obtained in the jth simulation run are denoted by aj and bj .86301 26. The means of the outcomes are close to the values a ¼ 10 and b ¼ 1 of the DGP.072153 À8.2 Accuracy of least squares 89 We can repeat this simulation.00469 À10.000 times.31404 31.00000 16.506565 À9.160951 1.152398 À5.49344 17.09012 17.506255 5.577481 1.845870 À5.00000 20. We see that the variation (a ) X 1. Á Á Á .568719 6.536910 8. the values of the dependent variable are also different.29517 32.000000 6.00000 14.295170 3.449468 9. This simulation is repeated m ¼ 10. the obtained regression line a þ bx is different in the two simulation runs.57207 23.00000 EPS1 À1.41385 22.00000 16.23408 9.00000 19.28518 39.58051 26.00000 26. a ¼ 10. MSEa ¼ (aj À a)2 : m m Example 2.93570 0.769024 Y1 ¼ YSYSþEPS1 9.915039 À6.00000 17.894671 4.00000 21.000000 10.000000 3. Histograms of the resulting estimates a and b are in Exhibit 2. b ¼ 1.16) with n ¼ 20.7 (d) and (e).7 Stimulated Regression Data (Example 2.43999 7.00000 11.64932 E Exhibit 2. 20.000000 5.00000 23.00000 20.00000 YSYS ¼ 10þX 11.973480 À4.26798 19. Á Á Á .064299 20.97348 10.43999 34. Á Á Á .00469 8.136992 4.4: Simulated Regression Data We will consider outcomes of the above simulation experiment. 25).855825 5.67523 16. e20 a random sample from a normal distribution with mean zero and variance s2 ¼ 25.00000 13.419486 7.4) Data generated by y ¼ 10 þ x þ e with e $ N(0.24291 30.00000 14.43128 36.234083 À2.53691 38.00000 17.07215 5. The data are generated by the equation (2.42252 15.649318 Y2 ¼ YSYSþEPS2 13.00000 18. m. and with e1 .2.56312 23.00000 19. say m times.00000 28.58435 27.324765 À0.08496 17.00000 25.767494 9.935698 3.413848 À0.94552 0.04544 24.563120 À2.00000 30.83905 25.929291 12.65635 22.00000 15. j ¼ 1.15240 12.00000 12.550532 À2.00000 18.232506 22.93570 31.7.85582 26.000000 8. As a result. The results of two simulation runs are shown in Exhibit 2. The accuracy P of least squares estimates can be evaluated in terms of the means a ¼ aj =m P and b ¼ bj =m and the mean squared errors MSEb ¼ 1X 1X (bj À b)2 . shown are two simulations of sample size 20 (a).054479 16.656347 0.757085 2.00000 12.090115 À5.685956 6.15413 10.

017376 2.6 0.00928 9.3113813 b 0.7 (Contd.933285 (c) MSE a 5.998289 1.321729 X Exhibit 2.979128 17.304740 0. X (e) 40 Y2 vs.002619 2.296596 0.191656 −0.6 0.501285 + 1.998555 0. .175827 X Y2FIT = 8.4) Histograms of least squares estimates (a in (a) and b in (b)) in 10.96982 1.2 1.) Graphs corresponding to the two simulations in (a). X 30 30 Y1 20 Y2 0 5 10 20 10 10 0 15 20 25 0 0 5 10 15 20 25 X X Y1FIT = 7. Dev.90 2 Simple Regression (b) 15 10 5 0 −5 −10 −15 (c) 15 10 5 0 −5 −10 2 4 6 8 10 12 14 16 18 20 −15 2 4 6 8 10 12 14 16 18 20 EPS1 EPS2 (d) 40 Y1 vs. and scatter diagrams (of y1 against x and of y2 against x) and ﬁtted regressions (Y1FIT and Y2FIT ((d)–(e))). (a) 1000 800 600 400 200 0 Series: a Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. Dev.4 0. with series of disturbances (EPS1 and EPS2 ((b)–(c))).8 1.0 1. Skewness Kurtosis 2 4 6 8 10 12 14 16 18 10.700380 0.643618 + 1.8 Simulated Regression Data (Example 2.000 simulations (the DGP has a ¼ 10 and b ¼ 1).4 1. and mean squared error of these estimates (c).733678 2.0367304 Exhibit 2.889230 (b) 1200 1000 800 600 400 200 0 Series: b Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. Skewness Kurtosis 0.

2 Examples of regression models Notation: we do not know Greek but we can compute Latin One of the virtues of the computer experiment in the foregoing section is that it helps to explain the usual notation and terminology. E XR201SMR Example 2.2 Accuracy of least squares 91 of the outcomes in b (measured by the standard deviation) is much smaller than that of a. When we analyse empirical data we do not know ‘true’ values of a and b. then the model states that yi ¼ log (Si ) ¼ a þ bxi þ ei . the capital asset pricing model (CAPM). where we discuss the situation of a single data set.1. The further analysis of this data set is left as an exercise (see Exercises 2. b. but for the time being we will accept it as a working hypothesis. The formulas for a and b in (2. The disturbance terms ei are needed because the linear dependence between the returns is only an approximation. s2 ). The residuals ei in (2. For every individual there E XM202BWA . Intuitively speaking.2.1. Again.15).2 and assume that the model (2.8) can be used for this purpose.6) and (2. and 2. which is the usual situation in practice.2) can be seen as estimates of the disturbances ei in (2.16) this derivative is assumed to is given by dS dx dx ¼ be constant. and the same holds true for the MSE (see (c)).12. the disturbance terms ei are needed because the linear dependence between the yi and xi is only an approximation.3. We follow the convention to denote the parameters of the DGP by Greek letters (a. s2 ) and the estimates by Latin letters (a.1 are generated by the linear model yi ¼ a þ bxi þ ei for certain (unknown) values of a and b. This will be made more precise in Section 2. Example 2. Let S be the salary. b. A careful inspection of Exhibit 2. 2.11. 2.1 in Section 2.16). but we can compute estimates a and b from the observed data. relates the excess returns xi (on the market) and yi (of an individual asset or a portfolio of assets in a sector) by the model (2. It is one of the tasks of the econometrician to estimate a and b as well as possible.16) applies for the data on logarithmic salary and education. Here b can be interpreted as the relative increase in salary due to one year of additional education. the outcomes of the slope estimates b differ signiﬁcantly from zero.6: Bank Wages (continued) We consider again Example 2.5: Stock Market Returns (continued) A well-known model of ﬁnancial economics.2.16).1. which =S d log (S) dy ¼ dx : In the model (2. as is clear from Exhibit 2. So the CAPM assumes that the data in Example 2.2 (c) may cast some doubt on this assumption.

Assumption 1: ﬁxed regressors. This data set is further analysed in Example 2. In the next chapter we will introduce some of these factors explicitly. in economics the possibilities for experiments are often quite limited.10).2.9 (p. Assumption on the regressors .3 show that demand increases if the price is decreased.2.1 are often performed in applied econometrics. The scatter diagram in Exhibit 2. However.7: Coffee Sales (continued) The data on prices and quantities sold of Example 2. education. . xn are ﬁxed numbers.2.3 clearly shows that.2.3 Seven assumptions E Uses Sections 1. Here the x variable. Á Á Á . The n observations on the explanatory variable x1 . the sales still ﬂuctuate owing to unobserved causes. They satisfy P (xi À x)2 > 0. E XR210COF Example 2.16) with yi for the quantity sold and with xi for the price. For this purpose. for ﬁxed prices. This means that the values xi of the explanatory variable are assumed to be non-random. 107). This describes the situation of controlled experiments. The variations in sales that are not related to price variations are summarized by the disturbance terms ei in (2.3 were performed in a controlled marketing experiment. however.1.92 2 Simple Regression will be many factors.4. the different opportunities and situational characteristics of the individuals — and these factors are not observed in this sample.2 are obtained from a sample of individuals.11 (p. 2. It is inﬂuenced by many factors — for instance. apart from education. If this effect is supposed to be proportional to the price decrease. 102) and in Example 2.2. is not determined by a controlled experiment. we can obtain accuracy measures by means of analytical methods. The analysis of this data set is left as an exercise (see Exercise 2. 1. that affect the salary. The purpose of assumptions: simpler analysis Data generating experiments as described in Section 2. For example. in particular in complicated cases where little is known about the accuracy of the estimation procedures used. In the case of the linear model and the method of least squares. the price reductions in Example 2.16). the data on salaries and education in Example 2. 1. For instance.3. then the demand curve can be described by the model (2. we introduce the following assumptions on the data generating process.

it follows that E[yi ] ¼ a þ bxi . although the parameters of the DGP are unknown. Á Á Á . The n disturbances e1 . ej ) are uncorrelated. this assumption speciﬁes a precise distribution for the disturbance terms and it implies that the disturbances are mutually independent. The data on y1 . zero mean. . and s are ﬁxed unknown numbers with s > 0. Á Á Á . n). yj ) ¼ 0 (i 6¼ j): So the observed values of the dependent variable are uncorrelated and have the same variance. Together with Assumptions 2–4. . This means that. . and when the variances differ they are called heteroskedastic. b. Assumption 7: normality. cov(yi . Assumptions on model and model parameters . When the variances are equal the disturbances are called homoskedastic. Á Á Á . The variances of the n disturbances 2 e1 . i 6¼ j). All pairs of disturbances (ei . we assume that all the n observations are generated with the same values of the parameters. Á Á Á . Assumption 3: homoskedasticity. n. Assumption 4: no correlation. However. j ¼ 1. Note that they say nothing about the shape of the distribution. the mean value of yi varies across the observations and depends on xi . yn have been generated by yi ¼ a þ bxi þ ei (i ¼ 1. n). en are jointly normally distributed. en exist and are all equal. Assumptions 2–4 concern properties of the disturbance terms. Assumption 5: constant parameters. Assumption 2: random disturbances. E[ei ] ¼ 0 (i ¼ 1. Á Á Á . en are random variables with zero mean. Together with Assumptions 1–4. except that extreme distributions (such as the Cauchy distribution) are excluded because it is assumed that the means and variances exist. Assumption 4 is also called the absence of serial correlation across the observations. Á Á Á .2. E[ei ej ] ¼ 0 (i. Assumption on the probability distribution . The disturbances e1 . E[e2 i ] ¼ s (i ¼ 1. n): (2:17) The model is called linear because it postulates that yi depends in a linear way on the parameters a and b. . The parameters a. Á Á Á .2 Accuracy of least squares 93 Assumptions on the disturbances . Assumption 6: linear model. var(yi ) ¼ s2 . Á Á Á .

This is the case.10a.1. for instance. when we test whether the estimated slope parameter b is signiﬁcantly different from zero. as the properties of these disturbances are given by Assumptions 2–4. 2. and by using P (xi À x)a ¼ 0 it follows that P X (xi À x)ei ¼bþ ci ei b¼bþP (xi À x)xi (2:19) . T Derivation: Some helpful notation and results Using Assumptions 1–6 we now derive some statistical properties of the least squares estimators a and b as deﬁned in (2.2. This can be written as yi $ NID(a þ bxi . (2.4 Statistical properties E Uses Sections 1. X (xi À x)x ¼ x X (xi À x) ¼ 0: (2:18) Using this result. To express b in (2. The essential characteristic of the current model is that variations in yi are not seen purely as the effect of randomness. Several of the results to be given below are proved under Assumptions 1–6.17) for yi.6) and (2. n): If b ¼ 0. E Exercises: E: 2.94 2 Simple Regression Interpretation of the simple regression model Under Assumptions 1–7. For this purpose. in Section 2. this reduces to the case of random samples from a ﬁxed population described in Chapter 1.2. s2 ) (i ¼ 1. ﬁrst note that X (xi À x)y ¼ y X (xi À x) ¼ 0. Appendix A.8).1. but partly as the effect of variations in the explanatory variable xi .3.1.8) in terms of ei . Á Á Á .2. but sometimes we also need Assumption 7.3. 1.8) can be written as P (xi À x)yi : b¼P (xi À x)xi Because of Assumption 6 we may substitute (2. it is convenient to express the random part of a and b as explicit functions of the random variables ei . the values yi are normally and independently distributed with varying means a þ bxi and constant variance s2 .

So the least squares estimates will. This shows that i n a ¼ y À bx ¼ a þ X X 1X ci ei ¼ a þ di ei ei À x n (2:21) where the coefﬁcients di are non-random and given by di ¼ 1 1 x(xi À x) À xci ¼ À P : n n (xi À x)2 (2:22) From (2.6) in terms of ei .2. 2. Summarizing.22) we directly obtain the following properties: X ci ¼ 0. it follows from (2.19) that bx ¼ b x þ x ci ei .17) P P implies that y ¼ a þ bx þ e (where e e¼1 ) and (2. Under the same assumptions we get from (2. (2. E[b] ¼ E b þ ci ei ¼ b þ (2:25) because b is non-random (Assumption 5). under Assumptions 1. 5. X c2 i ¼P 1 .19) that h X X i ci E[ei ] ¼ b. be equal to the correct parameter values. on average. X di2 ¼ 1 x2 : þP n (xi À x)2 Least squares is unbiased If we use the rules of the calculus of expectations (see Section 1.20) and (2. (xi À x)2 (2:23) (2:24) X di ¼ 1. the ci are non-random (Assumption 1). and E[ei ] ¼ 0 (Assumption 2). and 6 the estimator b has expected value b and hence b is an unbiased estimator of b. . E[a] ¼ E a þ di ei ¼ a þ (2:26) so that a is also an unbiased estimator of a.2 Accuracy of least squares 95 where the coefﬁcients ci are non-random (because of Assumption 1) and given by ci ¼ P xi À x xi À x ¼P : (xi À x)xi (xi À x)2 (2:20) To express a in (2.2).21) h i X X di E[ei ] ¼ a.

96 2 Simple Regression The variance of least squares estimators Although the property of being unbiased is nice.23) give var(b) ¼ X 2 c2 is ¼P s2 : (xi À x)2 (2:27) The variance of a follows from (2. in practice we often have only a single data set at our disposal. and the worst case is large error variance and small systematic variance (shown in (c)). However. Then it is important that the deviations (b À b)2 and (a À a)2 are expected to be small. with values for the error variance s2 and P different 2 for the systematic variance (xi À x) . Note small if n and (xj À x)2 are large. It follows from (2. The reason is that the method of least squares tries to minimize the sum Palso that the difference is P of squares of the residuals.2. The shapes of the scatters give a good impression of the possibilities to determine the regression line accurately. If both n and (xj À x)2 tend to inﬁnity. The best case is small error variance and large systematic variance (shown in (b)).1.9 we show four scatters generated with simulations of the type described in Section 2. We measure the accuracy by the mean squared errors E[(b À b)2 ] and E[(a À a)2 ]. it tells us only that the estimators a and b will on average be equal to a and b. There holds E[ei ] ¼ 0 and ! 1 (xi À x)2 2 (2:29) var(ei ) ¼ s 1 À À P n (xj À x)2 (see Exercise 2. these MSEs are equal to the variances var(b) and var(a) respectively. and Assumptions 3 and 4 and (2. Mean and variance of residuals In a similar way we can derive the mean and variance of the residuals ei .7). where ei ¼ yi À a À bxi . Note that this variance is smaller than the variance s2 of the disturbances ei . . then var(ei ) tends to s2 .19) that XX var(b) ¼ ci cj E[ei ej ].24) with result ! X x2 2 2 2 1 þP var(a) ¼ di s ¼ s : n (xi À x)2 (2:28) Graphical illustration In Exhibit 2.21) and (2. As the estimators are unbiased.

2.9.6. Such estimators are called linear estimators.1. We have shown that they are unbiased. the estimators a and b are the best linear unbiased estimators (BLUE) — that is. 2.9 Accuracy of least squares Scatter diagrams of y against x.2 Accuracy of least squares 97 (a) 30 (b) Y vs. 2. X large error variance and large systematic variance 20 20 Y 10 Y 10 0 0 0 10 20 30 0 10 20 30 X X Exhibit 2. X 30 small error variance and large systematic variance 20 20 Y Y 10 10 0 0 0 10 20 30 0 10 20 30 X X (c) 30 Y vs. E Exercises: T: 2.2. under Assumptions 1–6.5. X small error variance and small systematic variance Y vs. they have the smallest possible variance in the class of all . Best linear unbiased estimators (BLUE) The least squares estimators a and b given in (2.5 Efficiency E Uses Section 1.4.8) are linear expressions in y1 . 2. Á Á Á .2. 2. X large error variance and small systematic variance (d) 30 Y vs.6) and (2. Now we will show that. yn . and the standard deviation of the error terms in (c) and (d) is three times as large as in (a) and (b). the standard deviation of x in (b) and (d) is three times as large as in (a) and (c).

4). Therefore (2.19) that ^ ¼bþ b X wi ei ¼ b þ X ( ci þ w i ) ei : ^ is equal to Because of Assumptions 3 and 4.3. the variance is minimal if and only if wi ¼ 0 for all i ¼ 1.20).31) imply that ci wi ¼ 0.30) and (2. the variance of b ^ ) ¼ s2 var(b X (ci þ wi )2 : (2:33) P P 2 P P wi þ 2 ci wi. the least squares estimators are efﬁcient in this respect.33) reduces to ^) ¼ var(b) þ s2 var(b X w2 i: Clearly. means that b E Exercises: T: 2. This ^ ¼ b.98 2 Simple Regression linear unbiased estimators. . the expected value of b ^] ¼ E[b] þ E[b X wi E[yi ] ¼ b þ a X wi þ b X wi xi : We require unbiasedness. and this proves the Gauss–Markov theorem. Á Á Á . gn .20) for ci Now ( ci þ w i ) 2 ¼ c 2 i þ together with the properties in (2. n. X wi xi ¼ 0: (2:31) It then follows from the assumption of the linear model (2. (2:32) and from (2.17) that X w i yi ¼ a X wi þ b X wi xi þ X w i ei ¼ X wi ei . Let b ^ ¼ P gi yi for certain ﬁxed coefﬁcients g1 . T Proof of BLUE We will prove this result for b (the result for a follows from a more general result ^ be an arbitrary linear estimator of b. This is called the Gauss–Markov theorem. Á Á Á . Á Á Á . Now deﬁne wi ¼ gi À ci . then it follows that gi ¼ ci þ wi and ^¼ b X ( c i þ w i ) yi ¼ b þ X w i yi : (2:30) ^ is given by Under Assumptions 1–6. This means treated in Section 3. So the two conditions on w1 . wn are that X wi ¼ 0. andP the expression (2. Stated otherwise. The that it can be written as b P least squares estimator can be written as b ¼ ci yi with ci as deﬁned in (2. irrespective of the values taken by a and b.1. Note that the assumption of normality is not needed for this result.

However. Now it is crucial to realize that. We want to apply a test for the null hypothesis b ¼ 0 against the alternative that b 6¼ 0.3. Further.3.2. if b has standard deviation 100. It n is left as an exercise (see Exercise 2.2. it is normally distributed with mean zero and with varianceﬃ given by (2. The significance of an estimate The regression model aims to explain the variation in the dependent variable y in terms of variations in the explanatory variable x.17).19)). This makes sense only if y is related to x — that is.2. even if b ¼ 0. Therefore we scale the outcome of b by its standard deviation. the obtained value of b is the outcome of a random variable. the distribution of b should be known.1–1.3. Derivation of test statistic To derive a test for the signiﬁcance of the slope estimate b.1 The t-test E Uses Sections 1. this suggests estimating As the residuals ei are estimates ofP 1 2 ^ e2 the variance s2 ¼ E[e2 ] by s ¼ i i . the least squares estimator b will be different from zero. the disturbances ei .3.3 Significance tests 2.01 then an outcome b ¼ 0:1 is signiﬁcantly different from zero. So the standard qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P (xi À x)2 and deviation of b is given by sb ¼ s= bÀb $ N(0.7) to show that an unbiased estimator is given by T .4. if b 6¼ 0 in the model (2. 1): sb This expression cannot be used as a test statistic. since s is an unknown parameter.4. under Assumptions 1–6. So we will make use of Assumptions 1–7 of Section 2. this estimator is biased. then an outcome b ¼ 10 is not signiﬁcantly different from zero. and if b has standard deviation 0.4. So. 1. For instance. we have to take the uncertainty of this random variable into account. The null hypothesis will be rejected if b differs signiﬁcantly from zero.27).3 Significance tests 99 2. to apply the testing approach discussed in Section 1. to decide whether b is signiﬁcant or not. Since b À b is linear in the disturbances (see (2. In general. we will assume that the disturbances ei are normally distributed.

if jbj > csb . E: 2. is called the standard error of the regression. it differs signiﬁcantly from zero.13. the square root of (2. As a rule of thumb (for the popular 5 per cent signiﬁcance level). and s.5 and 3. an estimatedcoefﬁcientissigniﬁcantifits t-valueis(inabsolutevalue)largerthan2.12b. where it is further proved P refer 2 = s follows the w2 (n À 2) distribution and that s2 and b are independent. 2.10d–f. so that P[jtb j > c] ¼ a where tb is deﬁned as in (2. that e2 i Standard error and t-value It follows from the above results and by the deﬁnition of the t-distribution that tb ¼ where s : sb ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P 2 (xi À xÞ (2:36) bÀb ¼ sb bÀb (b À b)=sb ﬃ ¼ qP qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ $ t(n À 2).96. For b ¼ 0. and an (1 À a) interval estimate of b is given by all values for which Àc tb c — that is. 2.14a–c. for n ¼ 60 it is 2. For a 5 per cent signiﬁcance level. c.34). P e2 i s= (xi À x)2 = ( n À 2) s2 (2:35) That is. tb follows the Student t-distribution with n À 2 degrees of freedom. the critical value c is obtained from the t(n À 2) distribution. tb is called the t-value of b. Then P[jtb j c] ¼ 1 À a.00. if jtb j > c.3. and for n ! 1 the critical value converges to 1.100 2 Simple Regression s2 ¼ 1 X 2 ei : nÀ2 (2:34) We also to Sections 3. sb is called the standard error of b.1c. Further. A practical rule of thumb for significance For a given level of signiﬁcance. 2. one often uses c ¼ 2 as an approximation.1 below. b À csb b b þ csb : (2:37) E Exercises: T: 2. d. The null hypothesis H0 : b ¼ 0 is rejected against the alternative H1 : b 6¼ 0 if b is too far from zero — that is. if the outcome is at least twice as large as the uncertainty in this outcome as measured by the standard deviation. 2.1. or equivalently. the critical value for n ¼ 30 is c ¼ 2:05. Let c be the critical value of the t-test of size a. . Interval estimates The foregoing results can also be used to construct interval estimates of b. Then b is called signiﬁcant — that is. That is.35). In this case the estimate b is signiﬁcant if jbj > 2sb — that is.7.

Expect.10 Simulated Regression Data (Example 2. denoted by B in (a)). S2) B S2 1 25 0.000 values of b.12268 69.998289 1. S2) Sample Corr. (d) also contains some properties of the corresponding theoretical distributions.8) Histograms of least squares estimates of slope (b.433300 0. Dev. The histograms and some summary statistics of the resulting 10.4 (with slope b ¼ 1 and variance s2 ¼ 25).3 Significance tests 101 2. Dev.390456 0.017376 2.669003 8. (B. and tb are given in Exhibit 2.1939 8.000 simulated data sets of Example 2. Skewness Kurtosis 5. Theor.0 0.04228 1. i ¼ 1.6 0 2 4 6 8 10 12 14 (c) 1400 1200 1000 800 Series: S2 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.2. Á Á Á .000 simulations of the data generating process in Example 2. For comparison. (a) 1200 1000 800 Series: B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.6 0. Theor.4. 25).0 1. denoted by S2 in (c)) resulting from 10. Dev.10 (a–d).3333 0 À0. s2 . .933285 E (b) 1600 1200 800 400 Series: TSTAT_B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.0 B 1. Skewness Kurtosis 25. (B. Std.8 1.4 1.998555 0.228537 15.269387 1.8: Simulated Regression Data (continued) First we consider the situation of simulated data with a known DGP. and variance (s2 .637642 (d) Theor.3.0147 600 400 200 0 10 20 30 40 50 60 70 (e) 80 60 S2 40 20 0 0. with data generating process yi ¼ 10 þ xi þ ei where xi ¼ i and ei are NID(0. 20. Dev.80976 4.5 2. t-value (tb . Corr.700380 0. For this purpose we consider again the 10.306187 600 400 200 0 0.4 0.06268 24.296596 0.0 Exhibit 2.2 1. So this DGP has slope parameter b ¼ 1. denoted by TSTAT_B in (b)).5 1.723008 4.2 Examples Example 2. Theoretical means and standard deviations of b and s2 and theoretical and sample correlations between b and s2 (d) and scatter diagram of s2 against b (e). Skewness Kurtosis 0.665928 3.191656 −0.375915 5.

6. In this example. on bank wages. In the great majority of cases the null hypothesis is rejected. However. To perform a 5 per cent signiﬁcance test of H0 : b ¼ 0 against H1 : b 6¼ 0.0000. So the standard error of b is sb ¼ 0:285= p ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 3937 ¼ 0:00455 and the t-value is tb ¼ 0:096=0:00455 ¼ 21:1. The t-statistic for the null hypothesis H0 : b ¼ 0 does not follow the t-distribution. For such a low P-value as in this example we will always reject the null hypothesis.4.1 is non-zero. their sample correlation over the 10.000 simulation runs is less than 1. This simulation illustrates the distribution properties of b. and the histogram of s2 is in accordance with the (scaled) w2 distribution. only in nineteen cases it is not rejected (using the 5 per cent critical value c ¼ 2:1 for the t(18) distribution). The scatter diagram of s2 against b (shown in (e)) illustrates the independence of these two random variables.8 per cent) of the t-test in this example. The regression results are often presented in the following way.11. s2 . which actually means that it is smaller than 0.9: Bank Wages (continued) Next we consider a real data set — namely.1. This means that education has a very signiﬁcant effect on wages. as even for b ¼ 0 the probability of getting t-values larger than 21. The P-value is reported as 0. Note that this P-value is not exactly zero. the null hypothesis that b ¼ 0 is rejected for all sizes a > 0:00005. the variance of the disturbance terms is estimated by s2 ¼ SSR=(n À 2) ¼ 38:424=472 ¼ 0:0814 and the standard error of the regression is s ¼ 0:285. For the salary and education data of bank employees discussed before in Example 2. so that the null hypothesis is clearly rejected. These outcomes are also given in Exhibit 2. E XM202BWA Example 2.00005. This indicates a high power (of around 99. the (two-sided) critical value of the t(472) distribution is given by 1.96.102 2 Simple Regression The histogram of b is in accordance with the normal distribution of this least squares estimator.5 per cent. together with the P-value for the test of H0 : b ¼ 0 against H1 : b 6¼ 0. the t-test is very successful in detecting a signiﬁcant effect of the variable x on the variable y. as this hypothesis is not correct (b ¼ 1).2. Using the results in Exhibit 2. as the data do not lie exactly on the estimated line). the sample moments were given in Section 2. y ¼ 9:06 þ 0:096x þ e: (144) (21:1) . and tb . where the numbers in parentheses denote the t-values and e denotes the residuals of the regression (the equation without e is not valid.

0. In other words. in rows 3 and 5 the DGP does not satisfy Assumption 7 (normality).675 0.004548 21.397334 Results of regression of salary (in logarithms) on a constant (denoted by C) and education.2.0000 10. and p ¼ 0:975.677 0. ﬁxed values of the x variable and normally distributed disturbances) were introduced in order to simplify the proofs.289 1.651 1. The exhibit shows quantiles for p ¼ 0:75.9) Prob.678 0. Error t-Statistic C 9.290 1.900 0.972 1. the quantile function is the inverse of the cumulative distribution function. If a random variable y has a strictly monotone cumulative distribution function F(v) ¼ P[y v].975 1. 2.653 1.3 Use under less strict conditions Weaker assumptions on the DGP The rather strict conditions of Assumptions 1–7 (in particular.750 0. dependent var S. The last quantile corresponds to the critical value for a two-sided test with signiﬁcance level 5 per cent.653 1.3.285 1.950 1.980 1. The exhibit reports some quantiles.982 Exhibit 2. p ¼ 0:95. of regression 0.485447 Mean dependent var Sum squared resid 38.287 0.11 Bank Wages (Example 2.42407 S. In Exhibit 2.062102 0.12 we present the results of a number of simulation experiments where the conditions on both the explanatory variable and the disturbances have been varied. Fortunately.3 Significance tests 103 Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Variable Coefﬁcient Std.676 0.E.10214 R-squared 0.285319 Exhibit 2.679 1.986 1.650 0. .062738 144.4446 EDUC 0.656 1.095963 0.D. Row 1 2 3 4 5 x Fixed Fixed Fixed Normal Normal e Normal Normal Logistic Normal Logistic Result Exact t(198) Simulated Simulated Simulated Simulated Quantiles 0. based on data of 474 bank employees.286 1. p ¼ 0:90.984 1.35679 0. the same results hold approximately true under more general conditions.0000 0.12 Quantiles of distributions of t-statistics Rows 1 and 2 correspond to the standard model that satisﬁes Assumptions 1–7. and in rows 4 and 5 the DGP does not satisfy Assumption 1 (ﬁxed regressors). then the quantile q(p) is deﬁned by the condition that F(q(p) ) ¼ p.

where the x values are ﬁxed and the disturbances are independently.000 simulation runs) under different conditions. Likewise.104 2 Simple Regression Discussion of simulation results Rows 1 and 2 of the table give the results for the classical linear model. This illustrates that we may apply the formulas derived under the assumptions of the linear model also in cases where the assumptions of ﬁxed regressors or normal disturbances are not satisﬁed. The ﬁrst row gives the exact results corresponding to the t(198) distribution. In row 3 the disturbances are drawn from a logistic distribution with density function f (x) ¼ ex =(1 þ ex )2 and with cumulative distribution function F(x) ¼ 1=(1 þ eÀx ).000 samples were drawn. . but instead they are drawn from a normal distribution. each of n ¼ 200 observations. This density is bell-shaped but the tails are somewhat fatter than those of the normal density. In rows 4 and 5 the values of the x variable are no longer kept ﬁxed along the different simulation runs. and in rows 3 and 5. independently of the disturbances. we see that the differences between the rows are very small. Conclusion When we compare the quantiles. identically normally distributed. To enhance the comparability of the results the same x values were used in rows 4 and 5. The second row gives the results from a simulation experiment where 50. the same disturbances were used in rows 2 and 4. The remaining rows give the results of further simulation experiments (each consisting of 50. Under the assumptions of this simulation example this still gives reliable results.

This is called a point prediction. The least squares residuals ei ¼ yi À a À bxi correspond to the deviations of yi from the ﬁtted values a þ bxi . Á Á Á . we can evaluate the quality of our prediction by computing the prediction error f ¼ ynþ1 À a À bxnþ1 : (2:38) If ynþ1 is unknown. The proofs are left as an exercise (see Exercise 2.2. n.8). n þ 1. we can get an idea of the prediction accuracy by deriving the mean and variance of the prediction error. We suppose that Assumptions 1–6 hold true for i ¼ 1. n.4. Note that the variance of the prediction error is larger than the variance s2 of the disturbances. An obvious prediction is given by a þ bxnþ1. Á Á Á .1 Point predictions and prediction intervals Point prediction We consider the use of an estimated regression model for the prediction of the outcome of the dependent variable y for a given value of the explanatory variable x. Á Á Á . the mean is E[f ] ¼ 0. and s2 indicates the average accuracy of these predictions. so that the prediction is unbiased. Now assume that we want to predict the outcome ynþ1 for a given new value xnþ1 . i ¼ 1. The extra terms are due to the fact that a and b are used rather .4 Prediction 105 2.4 Prediction 2. Prediction error and variance In order to say something about the accuracy of this prediction we need to make assumptions about the mechanism generating the value of ynþ1 . Under Assumptions 1–6. If at a later point of time we observe ynþ1 . and the variance is given by var(f ) ¼ s2 ! 1 (xnþ1 À x)2 : 1þ þP n (xi À x)2 (2:39) Here the average x and the summation refer to the estimation sample i ¼ 1. The regression line a þ bx can be interpreted as the prediction of the y-value for a given x-value.

If Assumptions 1–7 hold true for i ¼ 1. Let 2 s2 f ¼s ! 1 (xnþ1 À x)2 .13 Prediction error Uncertainty in the slope of the regression line (indicated by the lower value bL and the upper value bU of an interval estimate of the slope) results in larger forecast uncertainty for values of the explanatory variable that are further away from the sample mean (the forecast interval f2 corresponding to x2 is larger than the interval f1 corresponding to x1 ). 1 þ þ Pn 2 n i¼1 (xi À x) then it follows that f =sf $ t(n À 2).6) for a. a þ bxnþ1 þ csf ) where c is such that P[jtj > c] ¼ a when t $ t(n À 2). . Á Á Á .13.106 2 Simple Regression y y=aU+bUx f2 f1 y=aL+bLx x x1 x2 x Exhibit 2. based on the ﬁrst n observations and deﬁned in (2. Prediction interval The above results can also be used to construct prediction intervals. then the prediction error f is normally distributed and independent of s2 . It is also seen that the variance of the prediction error reaches its minimum for xnþ1 ¼ x and that the prediction errors tend to be larger for values of xnþ1 that are further away from x. This is illustrated in Exhibit 2. So a (1 À a) prediction interval for ynþ1 is given by (a þ bxnþ1 À csf . than a and b. So uncertainty about the slope b of the regression line leads to larger forecast uncertainty when xnþ1 is further away from x. (2. By using the expression (2. n þ 1.38) can be written as f ¼ (ynþ1 À y) Àb(xnþ1 À x).34).

We will discuss (i) the splitting of the sample in two sub-samples. this would require additional assumptions on the way the x-values are generated. and (e)).2.8. (ii) the forecasts. for x21 ¼ 10 the predictions and forecast errors have a smaller standard deviation than for x21 ¼ 40. The ﬁrst part (used in estimation) consists of 424 individuals with sixteen years of education or less.15. E Example 2. and (iii) the interpretation of the forecast results. E Exercises: T: 2.000 new values of y21 ¼ 10 þ x21 þ e21 by random drawings e21 of the N(0.000 simulated data sets of Example 2.000 predictions (in (a) and (c)) and of the prediction errors f21 ¼ y21 À (a þ bx21 ) (in (b). b) obtained for one of the 10.11: Bank Wages (continued) We consider again the salary and education data of bank employees. each prediction corresponding to the values of (a. in contrast to unconditional prediction. 2.14 shows histograms and summary statistics of the resulting two sets of 10.4. where the value of xnþ1 is unknown and should also be predicted.000 predictions a þ bx21 .14d. E: 2. one where the new value of x21 ¼ 10 is in the middle of the sample of previous x-values (that range between 1 and 20) and another where x21 ¼ 40 lies outside this range. Further we also generate in both cases 10.000 simulated data sets. We consider two situations. Exhibit 2. 2.10: Simulated Regression Data (continued) Consider once more the 10. .39). (d). as would be expected because of (2. Since our model does not contain a mechanism to predict xnþ1 .4 Prediction 107 Conditional prediction In the foregoing results the value of xnþ1 should be known.4. Therefore this is called conditional prediction. (i) Splitting of the sample in two sub-samples E XM202BWA To illustrate the idea of prediction we split the data set up in two parts. 25) distribution. Clearly. e.2 Examples Example 2. In this way we can investigate whether the effect of education on salary is the same for lower and higher levels of education. the second part (used in prediction) consists of the remaining 50 individuals with seventeen years of education or more. For both cases we generate 10.

99483 19.637444 0.764102 Skewness −0. Dev.0 22. Dev.1243925 F40 0 7. 5. Moreover.4 for two values of x — that is. The average squared prediction error (for the ﬁfty highly educated P fi2 =50 ¼ 0:268.10965 Minimum 28.10) Forecasted values of y ((a) and (c)) and forecast errors f ((b) and (d)) in 10.6789430 Exhibit 2.041491 Kurtosis 2. 5.39). Std.996303 1000 800 600 400 200 0 Series: F10 Sample 1 10000 Observations 10000 Mean 0.108 2 Simple Regression (a) 1000 800 600 400 200 0 (b) Series: PRED_Y10 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.14 Simulated Regression Data (Example 2. Theor.1.021648 Maximum 16. We mention the following facts. which is based on Assumptions 1–7 for the DGP. Dev. the average squared prediction error is also larger than what would be expected on the basis of (2.15 (d) shows that the actual salaries of these highly educated persons are systematically higher than predicted.083525 Skewness −0.000 simulations from the data generating process of Example 2.4 (see Exhibit 2.025658 2.127159 0.041647 30.95148 Median 49. Dev.975955 30 1400 1200 1000 800 600 400 200 0 1500 Series: F40 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.010071 Kurtosis 2.5 20. If we average this expression over the ﬁfty values of education (x) in the second group. (iii) Interpretation of forecast results Exhibit 2.98888 24.97599 Maximum 71.13995 15.81502 −28. With this model the salary of an individual in the second group with an education of x years is predicted by a þ bx ¼ 9:39 þ 0:0684x.6). This is larger than the average employees) is equal to 474 P474 i¼425 2 squared residual i¼425 ei =50 ¼ 0:142 if the estimates a ¼ 9:06 and b ¼ 0:0960 are used that were obtained from a regression over the full sample in Section 2.27738 Std.29711 7. x ¼ 10 ((a)–(b)) and x ¼ 40 ((c)–(d)). Skewness Kurtosis −30 −20 −10 0 10 20 1000 500 0 −0.5 19. Expect.025500 40 50 60 70 30 (e ) Theor.09801 Minimum −17.014607 3.15 (a). (ii) Forecasts The results of the regression over the ﬁrst group of individuals are shown in Exhibit 2. Skewness Kurtosis 17. with the estimated variance s2 ¼ (0:262)2 ¼ 0:0688 obtained from the regression .56667 Std. F10 0 5. Dev.52563 1. The estimated intercept is a ¼ 9:39 and the estimated slope is b ¼ 0:0684.000518 Median 0.911464 −15 −10 −5 0 5 10 15 (c) 2000 (d) Series: PRED_Y40 Sample 1 10000 Observations 10000 Mean 49.029978 −0. together with theoretical expected values and standard deviations of the forecast errors (denoted by F10 for x ¼ 10 and F40 for x ¼ 40 (e)).

E.5 11.0 9. over the 424 individuals with sixteen years of education or less in Exhibit 2. and one for all 474 employees together with the predicted values of employees with at least seventeen years of education ((d).4 Prediction 109 (a) Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 424 (individuals with at most 16 years of education) Variable Coefﬁcient Std.5 10.5 10.310519 S.0000 R-squared 0.15.5 (c) LOGSALARY vs.5 5 10 15 20 25 Exhibit 2.0 10.0 11. EDUC 12. EDUC (flat regression line) 12.07446 0.005233 13. one for all 474 employees (b). this may cast .2.15 Bank Wages (Example 2. As the actual squared prediction errors are on average nearly twice as large (0.139.268 instead of 0.0 9.02805 S.5 10.0 11. C 9. EDUC LOGSALARY vs.5 LOGSALARY 5 10 11.068722 136.5 15 EDUC 20 25 5 10 15 EDUC 20 25 (d) LOGSALARY vs.288294 Mean dependent var 10.6081 0.0 10. then this gives the value 0. of regression 0. EDUC (steep regression line) FORECAST vs. dependent var 0.139). Error t-Statistic Prob.0 10. with predictions based on the regression in (a)).27088 Sum squared resid 29.D.0 11.0 9.387947 0.262272 (b) 12.5 LOGSALARY 11. one for 424 employees with at most sixteen years of education (c).068414 0.0000 EDUC 0.11) Result of regression of salary (in logarithms) on a constant and education for 424 bank employees with at most sixteen years of education (a) and three scatter diagrams.

We will return to this question in Section 5.1.110 2 Simple Regression some doubt on the working hypothesis that Assumptions 1–7 hold true for the full data set of 474 persons. .2. It seems that the returns on education are larger for higher-educated employees than for lower-educated employees.

R. Basic Econometrics. Introductory Econometrics. L. Pindyck. Further Reading (p. G. (1997). D. New York: Wiley. and Judge. G. Wooldridge. and keywords 111 Summary. P. Boston: McGraw-Hill. The ideas presented in this chapter form the basis for many other types of econometric models.. FURTHER READING Most of the textbooks on statistics mentioned in Section 1. R. D. R. The statistical properties of these estimators were derived under a number of assumptions on the data generating process.Summary. C. Econometric textbooks go beyond the simple regression model. M. G. . Undergraduate Econometrics. We now mention some econometric textbooks that do not use matrix algebra. Introduction to Econometrics. Further we described methods to construct point predictions and prediction intervals. (2003). and keywords SUMMARY In this chapter we considered the simple regression model. further reading. Gujarati. where variations in the dependent variable are explained in terms of variations of the explanatory variable. S. (2000). In the following chapters we make intensive use of matrix algebra. E. L. A Guide to Econometrics.. Kennedy.5 contain chapters on regression. and later chapters contain further extensions that are often needed in practice. S. W. N. Grifﬁths. (2001). 178). J. (2001). further reading. Econometric Models and Economic Forecasts. London: Prentice Hall. Modern Econometrics. Boston: McGraw-Hill. Hill. and Rubinfeld. and references to textbooks that also follow this approach are given in Chapter 3. In Chapter 3 we consider models with more than one explanatory variable. (1998).. Thomas. The method of least squares can be used to estimate the parameters of this model. Harlow: Addison-Wesley. (1998). Maddala. Oxford: Blackwell. Australia: Thomson Learning.

112 2 Simple Regression KEYWORDS absence of serial correlation 93 best linear unbiased estimator 97 BLUE 97 coefﬁcient of determination 83 conditional prediction 107 controlled experiments 92 covariate 79 data generating process 87 dependent variable 79 DGP 87 disturbance 88 endogenous variable 79 error term 88 exogenous variable 79 explanatory variable 79 ﬁxed regressors 92 heteroskedastic 93 homoskedastic 93 independent variable 79 least squares criterion 80 linear model 93 normal equations 82 normality assumption 93 OLS 80 ordinary least squares 80 parameters 93 point prediction 105 prediction interval 106 À Á R-squared R2 83 regressor 79 residuals 82 rounding errors 84 scatter diagram 79 signiﬁcant 100 standard error of b 100 standard error of the regression 100 t-value of b 100 variable to be explained 79 .

yi ) and (xÃ i . i ¼ 1. so that a ¼ 0 is given. For example. b. This means that the only differences between the two data sets are the location and the scale of measurement.3) In the regression model the variable y is regressed on the variable x with resulting regression line a þ bx. a. Let R2 be deﬁned by R2 ¼ b2 xi = yi . n.15) no longer hold true. d. Show that there exist data xi .4 (E Section 2. Ã by means of a simulation example. Find the mean and variance of this estimator. s2 . i ¼ 1. Á Á Á . Perform two re. Á Á Á . Reversing the role of the two variables. a. check the results in b and c by considering again the excess returns data of Example 2. Then c3 are the total ﬁxed costs (in millions of dollars) and c4 ¼ 10À6 . Prove that the value of b that Pminimizes P this sum of squares is given by bÃ ¼ xi yi = x2 i. b. Finally. and tb are invariant with respect to this transformation? d.2 (E Section 2.11).4) Let Assumption 6 be replaced by the assumption that the data are generated by yi ¼ bxi þ ei. . c. P 2 P 2 e. Investigate whether the estimator b2 of Exercise 2. Explain this in terms of the criterion functions used to obtain b and d. b. sb . a. so that var(b2 ) < var(b).2.Exercises 113 Exercises THEORY QUESTIONS 2. and (2.1 on stock market returns.XR201SMR gressions.2. 2.3 (E Section 2. Show that var(b1 ) ! var(b).1.1 (E Sections 2. bÃ ) for the transformed data. a. c. with c1 ¼ c3 ¼ 0 and c2 ¼ c4 ¼ 0:01. P and b2 ¼ yi = xi . 2. Derive the relation yÃ ¼ aÃ þ bÃ xÃ between yÃ and xÃ if y and x would satisfy the linear relation y ¼ a þ bx. Á Á Á . Conclude that in general d 6¼ 1=b. We wish to ﬁt a line through the origin by means of least squares — that is. b. Is this not in contradiction with the Gauss–Markov theorem? 2.5) Suppose that Assumptions 1–6 are satisﬁed. as alternatives for the least squares estimator b. n. Investigate whether b1 and b2 are unbiased estimators of b. c.1.1. one with the original data (in percentages) and the other with transformed data with the actual excess returns — that is. Which of the statistics R2 . n. Show.3 is unbiased now. c. (2.3.12). Determine expressions for the variances of b1 and b2 . that the results in (2. XR201SMR 2. yi ) be related by Ã Ã and yi ¼ c3 þ c4 yi for all xi ¼ c1 þ c2 xi i ¼ 1. b) for the original data and (aÃ .1) Ã Let two data sets (xi . Check the results in b and c with the excess returns data of Example 2. derive the relation between the least squares estimators (a. and show that var(b2 ) ! var(bÃ ). x can be regressed on y with resulting regression line c þ dy. Derive formulas for the least squares estimates of c and d obtained by regressing x on y. We consider two slope b1 ¼ (yn À y1 )=(xn À x1 ) P estimators.2. Adapt Assumptions 1 and 5 for this special case. Such data transformations are often applied in economic studies. Show that bd ¼ R2 . yi may be the total variable production costs in dollars of a ﬁrm in month i and yÃ i the total production costs in millions of dollars. by minP imizing (yi À bxi )2 . where b is the conventional least squares estimator and d the slope estimator in a. yi ). d. d. For arbitrary data (xi .

3. Is the estimator of a unbiased? 2.2. Now suppose that Assumptions 1–6 are satisﬁed.4) Sometimes we wish to assign different weights to the observations.2. ei ¼ (yi À y) À b(xi À x) ¼ À(xi À x)(b À b) þ ei À e. where s2 is deﬁned in (2. 2. 2.8 (E Section 2.2. and R2 for these data. Á Á Á . Show that the least squares slope estimator b remains unbiased if mi ¼ m is constant for all i ¼ 1.10 (E Sections 2. and e is used to denote sample averages over the estimation sample i ¼ 1. c. For this purpose we present the data in the following table. explain why var(f ) > var(ei ).1. n. f ¼ (ynþ1 À y) À b(xnþ1 À x) ¼ À(xnþ1 À x)(b À b) þ enþ1 À e.4. i n (x Àx)2 j d. b. . Consider in particular the situations of a and b.7b. but that the random disturbances ei do not have mean zero but that E[ei ] ¼ mi . The notation x. 2. 2. Comment on the difference between this result and the one in Exercise 2. and E[f ] ¼ 0.8). c.2. a. Explain the difference with the ﬁrst result in Exercise 2. c.114 2 Simple Regression 2. E[ei . E[(ei À n ) and E[(b À b)(ei À e)] ¼ ( x i Àx ) 2P s .5 (E Section 2. the populations or the areas of the countries. d. P a. Now suppose that mi ¼ xi =10 is proportional to the level of xi .1) In this exercise we prove that the least squares estimator s2 is unbiased.9 (E Section 2. b. Á Á Á . n. Verify that this result reduces to (2. Without loss of generality it may Pbe assumed that the weights are scaled so that wi ¼ 1. d.34).4) Suppose that Assumptions 1. 2. y. i ¼ 1. e)2 ] ¼ s2 (1 þ 1 b. Á Á Á . (xj Àx)2 (xi Àx)2 2 1 P ] ¼ 0 and var( e ) ¼ s 1 À À c. SSR.3. in particular.1. Derive the variance of b under these assumptions. e)2 ] ¼ s2 (1 À 1 b.7 (E Section 2. a. Prove the result (2. b.1) Consider the set of n ¼ 12 observations on XR210COF price xi and quantity sold yi for a brand of coffee in Example 2. Find the value of b that minimizes wi e2 i where the weights w1 . Discuss whether the Assumptions 1–6 are plausible for these data.6 (E Section 2.3. a.4. E[s2 ] ¼ s2 . Compute SST.27) if gi ¼ 1 for i ¼ 1. Is b still unbiased under these assumptions? b. Estimate the variance s2 of the disturbance terms. where the gi are known and given numbers. n. and 4–6 hold but that the variances of the disturbances are given by 2 E[e2 i ] ¼ s gi (i ¼ 1.4) Suppose that data are generated by a process that satisﬁes Assumptions 1 and 3–6. Compute the least squares estimates of a and b in the model y ¼ a þ bxi þ ei . a.1) Under the assumptions stated in Section 2. Á Á Á . EMPIRICAL AND SIMULATION QUESTIONS 2. Derive the bias E[b] À b under these assumptions. Prove the following results under Assumptions 1–6. 2.3. E[(enþ1 À n ).7c. wn are given positive numbers — for instance. Discuss whether Assumption 2 can be checked by considering the least squares residuals ei .2.3. 2. n.38). a.39). It may be instructive to perform the calculations of this exercise only with the help of a calculator. Á Á Á . This is for instance the case if the observations refer to countries and we want to give larger weights to larger countries. c. n).1. Á Á Á . SSE. Assume that b is computed according to (2. prove the following results for the prediction error f in (2.

00 0. Perform two regressions of y on x.00 1.00 1. Discuss the conditions needed to be conﬁdent about these predictions.11) and (2. b.85 Quantity 89 86 74 79 68 84 139 122 102 186 179 187 e.4. b. 2.1) Consider the data set of Exercise 1.95 0.2. Determine 95% interval estimates for a and b.85 0. f.1) Consider the CAPM of Example 2. b. Compute a. Perform 5% signiﬁcance tests on a and b. Can you explain this outcome? 2. We pay special attention to the ‘crash observation’ i ¼ 94 . (yi À y)2 =n. with the x-variable XR201SMR for the excess returns for the whole market and with the y-variable for the excess returns for the sector of cyclical consumer goods. Regress the FGPA scores on a constant and SATM and compute a. Á Á Á . 2.11 (E Section 2. the standard errors of a and b. s2 . In Assumption 1 take n ¼ 10 and xi ¼ 100 þ i for i ¼ 1. but we will simulate the situation where the modeller knows only a set of data generated by the DGP.1 on stock market XR201SMR returns.11 on student learning.95 0. Construct also a 95% prediction interval. This data set consists of 240 monthly returns.3.1) Consider again the stock market returns data of Example 2. Make a point prediction of the FGPA score for a student with SATM score equal to 6.13 (E Section 2. Repeat steps a and b 100 times. a.1. and s2 .1. Compute the standard error of b and test the null hypothesis H0 : b ¼ 0 against the alternative H1 : b 6¼ 0. with xi the excess returns of the market index and yi the excess returns in the sector of cyclical consumer goods. b. a.95 0.1. c.00 1. a.14 (E Sections 2.Exercises 115 i 1 2 3 4 5 6 7 8 9 10 11 12 Price 1. Use a software package to compute the sample means x and y Pand the sample P À x)2 =n.1.3) Consider the excess returns data set described in Example 2. Check the results by performing a regression of y on x by means of a software package.00 1. and in Assumption 6 take a ¼ À100 and b ¼ 1. Check the conditions (2. Investigate the correlation between the two series of residuals obtained in a.85 0. Note that we happen to know the parameters of this DGP. Compute a 95% interval estimate of b. Compare the resulting variances in the 100 estimates a and b with the theoretical variances. We investigate how far the FGPA scores of these students can be explained in terms of their SATM scores.3. Discuss the resulting outcomes. a.15 (E Section 2.0.1) Consider the data generating process deﬁned in terms of Assumptions 1–7 with the following speciﬁcations. d.4. Estimate a and b by using all 1000 observations simultaneously and construct 95% interval estimates for a and b. one in the model yi ¼ a þ bxi þ ei and the second in the model yi ¼ bxi þ ei .12 (E Sections 2. e. and R2 from the statistics in a. c. 2. How many of the 100 computed interval estimates contain the true values of a and b? d. using a 5% signiﬁcance level (the corresponding two-sided critical value of the t(10) distribution is c ¼ 2:23). Simulate one data set from this model and determine the least squares estimates a and b. c.5 for the stock market returns data on the XR215SMR excess returns yi for the sector of cyclical consumer goods and xi for the market index. b.00 1. 2. and moments ( x i P (xi À x) (yi À y)=n. with FGPA and SATM XR111STU scores of ten students.12) for both models. and not the parameters of the DGP. 2. in Assumption 3 take s2 ¼ 1.3. b. Now combine the data into one large data set with 1000 observations. and the t-values of a and b. 10. 2. Construct 95% interval estimates for a and b. c.

including the crash observation). Explain the relation between your ﬁndings in b and c. Use the second model (estimated without the crash observation) to predict the value of y94 for the given historical value of x94 . using the data over the full sample (that is. with conﬁdence levels 50%. f. 95%. Construct also four prediction intervals. for the three sectors ‘Noncyclical Consumer Goods’. b. Answer questions a and b also for some other sectors instead of cyclical consumer goods — that is. a. Relate the outcomes in f to the risk of the different sectors as compared to the total market in the UK over the period 1980–99.116 2 Simple Regression corresponding to October 1987 when a crash took place. c. ‘Information Technology’. Media and Technology’. Compare the outcomes of the two regressions. and ‘Telecommunication. For which sectors should this hypothesis be rejected (at the 5% signiﬁcance level)? g. . 90%. and 99%. For each of the four sectors in a and e. Estimate the CAPM using all the available data. e. test the null hypothesis H0 : b ¼ 1 against the alternative H1 : b 6¼ 1. Does the actual value of y94 belong to these intervals? d. and a second version where the crash observation is deleted from the sample.

We use matrices to describe and analyse this model. with a test for predictive accuracy as a special case. and the idea of partial regression. Particular attention is paid to the question whether additional variables should be included in the model or not. This chapter discusses the regression model with multiple explanatory variables. . its statistical properties. We present the method of least squares. The F-test is the central tool for testing linear hypotheses.3 Multiple Regression In practice there often exists more than one variable that inﬂuences the dependent variable.

begin salary. The other variables may also affect the earned salary. the salary of an employee is not only determined by the number of years of education because many other variables also play a role.2 (p. In practice the situation is often more involved in the sense that there exists more than one variable that inﬂuences the dependent variable. A.1 Introduction More than one explanatory variable In the foregoing chapter we considered the simple regression model where the dependent variable is related to one explanatory variable. personal characteristics. The begin salary can be seen as an indication of the qualities of the employee that. and so on. 3. the effect of each variable could be estimated by a simple regression of salaries on each explanatory variable separately. and job category (category 1 consists of administrative jobs. are determined by previous experience. Simple regression may be misleading Of course.118 3 Multiple Regression 3. 85) and the regression results in Exhibit 2.4.5(a) (p. In Chapter 2 the variations in salaries (measured in logarithms) were explained by variations in education of the employees.2–A. As an illustration we consider again the salaries of 474 employees at a US bank (see Example 2.7. the scatter diagrams with regression lines are shown in Exhibit 3. gender (with value zero for females and one for males). apart from education. around half of the variability (as measured by the variance) can be explained in this way. ethnic minority (with value zero for non-minorities and value one for minorities). A. For the explanatory variables education. these results may be misleading. the following data are available for each employee: begin or starting salary (the salary that the individual earned at his or her ﬁrst position at this bank). For .1 (a–c). 77) on bank wages). However. as the explanatory variables are mutually related. As can be observed from the scatter diagram in Exhibit 2.1 Least squares in matrix form E Uses Appendix A.6 (p. 86). Apart from salary and education.1. category 2 of custodial jobs. and category 3 of management jobs).6. and gender. Of course.

LOGSALBEGIN (logarithm of yearly salary when employee entered the ﬁrm) and GENDER (0 for females.5 9.5 10.5 10. for 474 employees of a US bank.5 5 10 15 EDUC 20 25 LOGSAL 11.0 9. GENDER LOGSALBEGIN 20 EDUC 10.0 1.5 11. EDUC (ﬁnished years of education).0 10.0 11.0 9.5 11.5 LOGSAL 11.0 0.0 9.0 9.5 (e) 25 EDUC vs.0 11.0 LOGSAL vs.0 0.5 10.0 10.5 GENDER 1.0 11.0 −0.0 1. GENDER (d) 11.5 GENDER 1. GENDER (f ) 11.3.0 LOGSALBEGIN vs.0 10.5 11.5 LOGSAL 11.5 10.0 5 10 15 EDUC 20 25 10.5 10.0 10.5 0.5 LOGSAL vs.5 GENDER 1.5 LOGSAL vs.1 Scatter diagrams of Bank Wage data Scatter diagrams with regression lines for several bivariate relations between the variables LOGSAL (logarithm of yearly salary in dollars).0 LOGSALBEGIN vs. EDUC (b) 12. .0 9.0 1.1 Least squares in matrix form 119 (a) 12. 1 for males).5 10.5 0.5 −0.5 Exhibit 3. LOGSALBEGIN 11.5 9.5 15 10 5 −0.0 0.5 9.5 LOGSALBEGIN (c) 12. EDUC LOGSALBEGIN 0.0 9.

1) as y ¼ Xb þ e: (3:3) Here b is a k Â 1 vector of unknown parameters and e is an n Â 1 vector of unobserved disturbances. Similar relations between the explanatory variables are shown in (d) and (f ). The notation in (3. . A: . n): (3:1) From now on we follow the convention that the constant term is denoted by b1 rather than a.2 Least squares E Uses Appendix A. k) refers to the variable number (in columns) and the second index i (i ¼ 1. Á Á Á . y¼@ .120 3 Multiple Regression example. Á Á Á . . and for simplicity of notation we write b1 instead of b1 x1i . Á Á Á .7. Regression model in matrix form The linear model with several explanatory variables is given by the equation yi ¼ b1 þ b2 x2i þ b3 x3i þ Á Á Á þ bk xki þ ei (i ¼ 1. e ¼ @ . n) refers to the observation number (in rows). . .1) in matrix form. the gender effect on salaries (c) is partly caused by the gender effect on education (e). A. 0 yn 1 x21 . C . b ¼ @ . Residuals and the least squares criterion If b is a k Â 1 vector of estimates of b. This mutual dependence is taken into account by formulating a multiple regression model that contains more than one explanatory variable. . then the estimated model may be written as . . B . . A.2) is common in econometrics (whereas in books on linear algebra the indices i and j are often reversed). For purposes of analysis it is convenient to express the model (3. A. In our notation. n. The ﬁrst explanatory variable x1 is deﬁned by x1i ¼ 1 for every i ¼ 1. en xkn bk (3:2) Note that in the n Â k matrix X ¼ (xji ) the ﬁrst index j (j ¼ 1.1. . Á Á Á . x2n ÁÁÁ ÁÁÁ 1 0 1 0 1 b1 xk1 e1 C C B B . X ¼ @ . Let 1 0 y1 1 C B . we can rewrite (3. 3.

To determine the least squares estimator. which can be computed from the data and the vector of estimates b by means of e ¼ y À Xb: (3:5) We denote transposition of matrices by primes (0 ) — for instance. The proof of this result is left as an exercise (see Exercise 3. The last term of (3. To get the idea we consider the case k ¼ 2 and we denote the elements of X0 X by cij . The second and third terms of the last expression in (3. this can be written as 2X0 Xb.10 and A. The derivative with respect to b1 is 2c11 b1 þ 2c12 b2 and the derivative with respect to b2 is 2c12 b1 þ 2c22 b2 .3.7) for further computational details and illustrations. 2. en ).11 in Section A. The vector of ﬁrst order derivatives of this term b0 X0 Xb can be written as 2X0 Xb. the transpose of the residual vector e is the 1 Â n matrix e0 ¼ (e1 .1).1 Least squares in matrix form 121 y ¼ Xb þ e: (3:4) Here e denotes the n Â 1 vector of residuals. So we have k ﬁrst order derivatives and we will follow the convention to arrange them in a column vector. 2 with c12 ¼ c21 . whereas b is a column vector with k components. we obtain @S ¼ À2X0 y þ 2X0 Xb: @b (3:7) The least squares estimator is obtained by minimizing S(b). which gives the normal equations X0 Xb ¼ X0 y: (3:8) .6) are equal (a 1 Â 1 matrix is always symmetric) and may be replaced by À2b0 X0 y. T The least squares estimator Combining the above results. When we arrange these two partial derivatives in a 2 Â 1 vector. This is a linear expression in the elements of b and so the vector of derivatives equals À2X0 y. Note that the function S(b) has scalar values. Then b0 X0 Xb ¼ c11 b2 1 þ c22 b2 þ 2c12 b1 b2 .6) is a quadratic form in the elements of b. i. we write the sum of squares of the residuals (a function of b) as S(b) ¼ X 0 0 e2 i ¼ e e ¼ (y À Xb) (y À Xb) ¼ y0 y À y0 Xb À b0 X0 y þ b0 X0 Xb: Derivation of least squares estimator (3:6) The minimum of S(b) is obtained by setting the derivatives of S(b) equal to zero. Á Á Á . Therefore we set these derivatives equal to zero. j ¼ 1. See Appendix A (especially Examples A.

Step 3: Compute the estimates.10) we take the derivatives of a vector @ b with respect to another vector (b0 ) and we follow the convention to arrange these derivatives in a matrix (see Exercise 3. if we write b. This is the classical formula for the least squares estimator in matrix notation. In (3.6).3. this requires in particular that n ! k — that is. we always mean the expression in (3. Collect n observations of y and of the related values of x1 .1. the number of parameters is smaller than or equal to the number of observations. In practice we will almost always require that k is considerably smaller than n. Least squares estimation Step 1: Choice of variables. This implies that (3.122 3 Multiple Regression Solving this for b.2). Summary of computations The least squares estimates can be computed as follows. If the matrix X has rank k. xk and store the data of y in an n Â 1 vector and the data on the explanatory variables in the n Â k matrix X. we obtain b ¼ (X0 X)À1 X0 y (3:9) provided that the inverse of X0 X exists. which means that the matrix X should have rank k.2). Proof of minimum From now on.9) À is indeed Á @S the minimum of (3. it follows that the Hessian matrix @2S ¼ 2X0 X @ b@ b0 (3:10) T is a positive deﬁnite matrix (see Exercise 3.9) by using a regression package. As X is an n Â k matrix. xk . An alternative proof that b minimizes the sum of squares (3.2.6) that makes no use of ﬁrst and second order derivatives is given in Exercise 3. Á Á Á . Choose the variable to be explained (y) and the explanatory variables (x1 . where x1 is often the constant that always takes the value 1). . Á Á Á .9). Step 2: Collect data. E Exercises: T: 3. 3. Compute the least squares estimates by the OLS formula (3.

so that 0 ¼ X0 e ¼ X0 (y À Xb). Appendix A.9) for b. since it transforms y into ^ y (pronounced: ‘y-hat’). the residuals may be written as e ¼ y À Xb ¼ y À X(X0 X)À1 X0 y ¼ My where M ¼ I À X(X0 X)À1 X0 : (3:12) (3:11) The matrix M is symmetric (M0 ¼ M) and idempotent (M2 ¼ M). This gives the normal equations (3. 1.2.13).11) and (3. Since it also has the property MX ¼ 0.6. there holds H ¼ H.1.8). because of (3.3. it follows from (3. so that the vectors ^ y and e are orthogonal to each other.2. H þ M ¼ I and HM ¼ 0. the least squares method can be given the following interpretation. Using the expression (3.2. H ¼ H. 0 2 Clearly. Therefore.3 Geometric interpretation E Uses Sections 1.2.11) that X0 e ¼ 0: We may write the explained component ^ y of y as ^ y ¼ Xb ¼ Hy where H ¼ X(X0 X)À1 X0 (3:15) (3:14) (3:13) is called the ‘hat matrix’. The sum of squares e0 e is the square of the length of the residual vector e ¼ y À Xb. Least squares seen as projection The least squares method can be given a geometric interpretation. The projection is characterized by the property that e ¼ y À Xb is orthogonal to all columns of X. This is illustrated in Exhibit 3. .3. ^ y0 e ¼ 0.1 Least squares in matrix form 123 3. So y ¼ Hy þ My ¼ ^ yþe where. The length of this vector is minimized by choosing Xb as the orthogonal projection of y onto the space spanned by the columns of X. which we discuss now.

the set of all n Â 1 vectors that can be written as Xa for some k Â 1 vector a) and let S? (X) be the space orthogonal to S(X) (that is. the vector of observations on the dependent variable y is projected onto the plane of the independent variables X to obtain the linear combination Xb of the independent variables that is as close as possible to y. The matrix H projects onto S(X) and the matrix M projects onto S? (X). and the projection matrices (H and M) remain unaffected by this transformation. As an example we consider the effect of applying linear transformations on the set of explanatory variables. This is immediately evident from the geometric pictures in Exhibits 3. the set of all n Â 1 vectors z with the property that X0 z ¼ 0).14) and e 2 S? (X) according to (3. the residuals (e).3 Least squares Two-dimensional geometric impression of least squares where the k-dimensional plane S(X) is represented by the horizontal line. the vector y is decomposed into two orthogonal components. T Geometric interpretation as a tool in analysis This geometric interpretation can be helpful to understand some of the algebraic properties of least squares. Then the least squares ﬁt (^ y).2.3.3. T Geometry of least squares Let S(X) be the space spanned by the columns of X (that is. Suppose that the n Â k matrix X is replaced by XÃ ¼ XA where A is a k Â k invertible matrix. the vector of observations on the dependent variable y is projected onto the space of the independent variables S(X) to obtain the linear combination Xb of the independent variables that is as close as possible to y. The essence of this decomposition is given in Exhibit 3.124 3 Multiple Regression y e=My Xb=Hy 0 X-plane Exhibit 3. which can be seen as a two-dimensional version of the three-dimensional picture in Exhibit 3. e = My y 0 Xb = Hy S(X) Exhibit 3.2 Least squares Three-dimensional geometric impression of least squares.13). In y ¼ ^ y þ e. .2 and 3. with ^ y 2 S(X) according to (3. as S(XÃ ) ¼ S(X).

Assumption 3: homoskedasticity. as one unit increase in xk corresponds to an increase of a thousand units in xk .1 Least squares in matrix form 125 The properties can also be checked algebraically. bÃ j ¼ bj for j 6¼ k. The least squares estimates change after the transformation. The data on the explained variable y have been generated by the data generating process (DGP) y ¼ Xb þ e: (3:16) .2. by working out the expressions for ^ y.2. that is. and bk ¼ 1000bk . Á Á Á . 1. zero mean. 0:001). and XÃ ¼ XA where A thousands of dollars. that is. n). Then xÃ ki ki is the diagonal matrix diag(1. The n Â 1 vector e consists of random disturbances with zero mean so that E[e] ¼ 0. 87–8). Á Á Á .2.2. The least squares estimates of bj for Ã j 6¼ k remain unaffected — that is. We ﬁrst restate the seven assumptions of Section 2. and M in terms of XÃ . taining the observations on the explanatory variables are non-stochastic. e. that is. it is convenient to use as conceptual background again the simulation experiment described in Section 2. The elements of the k Â 1 vector b and the scalar s are ﬁxed unknown numbers with s > 0. . H. Assumption 2: random disturbances. It is assumed that n ! k and that the matrix X has rank k.3. 3. All elements of the n Â k matrix X con- . .1. n.2.3. . as bÃ ¼ (X0Ã XÃ )À1 X0Ã y ¼ AÀ1 b. Assumption 1: ﬁxed regressors. 92) for the multiple regression model (3. Á Á Á . E[ei ej ] ¼ 0 for all i 6¼ j. 1. E[ei ] ¼ 0 (i ¼ 1. . Á Á Á . For example.1.1 (p. suppose that the variable xk is measured in dollars and xÃ k is the same variable measured in ¼ x = 1000 for i ¼ 1.3 (p. E Exercises: T: 3. Assumption 4: no correlation. This also Ã makes perfect sense.3. Assumption 5: constant parameters. The covariance matrix of the disturbances E[ee0 ] exists and all its diagonal elements are equal to s2 .3) and use the matrix notation introduced in Section 3. The off-diagonal elements of the covariance matrix of the disturbances E[ee0 ] are all equal to zero. 2 E[e2 i ] ¼ s (i ¼ 1. n).4 Statistical properties E Uses Sections 1. Seven assumptions on the multiple regression model To analyse the statistical properties of least squares estimation.2. . Assumption 6: linear model.

(3:17) where I denotes the n Â n identity matrix. 5. and 6. Assumptions 3 and 4 can be summarized in matrix notation as E[ee0 ] ¼ s2 I. The disturbances are jointly normally distrib- uted.126 3 Multiple Regression . if z1 and z2 are two random variables and A1 and A2 are two non-random matrices of appropriate dimensions so that z ¼ A1 z1 þ A2 z2 is well deﬁned. s2 I): Assumptions 4 and 7 imply that the disturbances ei . i ¼ 1. then E[z] ¼ A1 E[z1 ] þ A2 E[z2 ]. From Assumptions 1. . Least squares is unbiased The expected value of b is obtained by using Assumptions 1. Assumption 6 implies that the least squares estimator b ¼ (X0 X)À1 X0 y can be written as b ¼ (X0 X)À1 X0 (Xb þ e) ¼ b þ (X0 X)À1 X0 e: Taking expectations is a linear operation — that is. Assumption 7: normality. 2. and the off-diagonal elements are the covariances between these estimators. then e follows the multivariate normal distribution e $ N(0. and 5 we obtain E[b] ¼ E[b þ (X0 X)À1 X0 e] ¼ b þ (X0 X)À1 X0 E[e] ¼ b: So b is unbiased. (3:18) The covariance matrix of b Using the result (3. If in addition Assumption 7 is satisﬁed.18). n are mutually independent. Á Á Á . 2. we obtain that under Assumptions 1–6 the covariance matrix of b is given by var(b) ¼ E[(b À b)(b À b)0 ] ¼ E[(X0 X)À1 X0 ee0 X(X0 X)À1 ] ¼ (X0 X)À1 X0 E[ee0 ]X(X0 X)À1 ¼ (X0 X)À1 X0 (s2 I)X(X0 X)À1 ¼ s2 (X0 X)À1 : (3:19) The diagonal elements of this matrix are the variances of the estimators of the individual parameters.

or. This shows that var(b ^ ¼ b gives the minimal zero if and only if D ¼ 0 — that is.4. and AA0 . equivalently. proved in Section 2. T E Exercises: T: 3. Intuition could suggest to P 2 1 2 2 1 estimate s ¼ E[ei ] by the sample average n ei ¼ n e0 e.16) and the fact that MX ¼ 0 that e ¼ My ¼ M(Xb þ e) ¼ Me. there holds c0 (var(b Choosing for c the jth unit vector.1 Least squares in matrix form 127 Least squares is best linear unbiased The Gauss–Markov theorem. Proof of Gauss–Markov theorem ^] ¼ E[Ay] ¼ To prove the result. var(c0 b) var(c0 b ^). var(b where the last equality follows by writing A ¼ D þ (X0 X)À1 X0 and working out ^) À var(b) ¼ s2 DD0 . but this estimator is not unbiased.3. then var(b) À var(b) is a positive semideﬁnite matrix. the assumption of normality is not needed. 3. Now deﬁne D ¼ A À (X0 X)À1 X0 . So b variance. A ¼ (X0 X)À1 X0 . As in the previous chapter we make use of the sum of squared residuals e0 e. this means in particular that for the jth ^ ) so that the least squares estimators are efﬁcient.2.16). the k Â k identity matrix. if b ^ ^ a k Â n non-stochastic matrix and E[b] ¼ b. then DX ¼ AX À (X0 X)À1 X0 X ¼ I À I ¼ 0 so that ^) ¼ var(Ay) ¼ var(Ae) ¼ s2 AA0 ¼ s2 DD0 þ s2 (X0 X)À1 . It follows from (3. var(e) ¼ E[ee0 ] ¼ E[Mee0 M] ¼ ME[ee0 ]M ¼ s2 M2 ¼ s2 M: (3:20) (3:21) T . among all linear unbiased estimators. It states that. also holds for the more general model (3.1. component var(bj ) var(b j This result holds true under Assumptions 1–6.5 Estimating the disturbance variance Derivation of unbiased estimator Next we consider the estimation of the unknown variance s2 . b is ^ ¼ Ay with A the best linear unbiased estimator (BLUE) in the sense that. This means that for every k Â 1 vector c of constants ^) À var(b))c ! 0.11) and (3. 97–8) for the simple regression model. which is positive semideﬁnite.5 (p. ﬁrst note that the condition that E[b AE[y] ¼ AXb ¼ b for all b implies that AX ¼ I. b has minimal variance — that is. So E[e] ¼ 0.

we ﬁnd. Suppose we would try to explain y by a matrix X with k ¼ n columns and rank k. Let us consider a diagonal element of (3. a perfect ﬁt.7). Intuition for the factor 1/(n À k ) The result in (3.22) can also be given a more intuitive interpretation. it follows that hi ! 0. The square root s of (3. var(ei ) ¼ s2 (1 À hi ). Then we would obtain e ¼ 0. So each single element ei of the residual vector has a variance . If in the expression (3. This is an estimate of pﬃﬃﬃﬃﬃ the standard deviation s ajj of bj . where the subscripts denote the order of the identity matrices. then s ajj is called the standard error of the estimated coefﬁcient bj . Of course this is an extreme case.19) we replace s2 pﬃﬃﬃﬃﬃ by s2 and if we denote the jth diagonal element of (X0 X)À1 by ajj.128 3 Multiple Regression To evaluate E[e0 e] it is convenient to use the trace of a square matrix.21). but we would not have obtained any information on s2 . then hi > 0 (see Exercise 3. The very fact that we choose b in such a way that the sum of squared residuals is minimized is the cause of the fact that the squared residuals are smaller (on average) than the squared disturbances. (3:23) where hi is the ith diagonal element of the matrix H ¼ I À M in (3.15). If the model contains a constant term (so that the matrix X contains a column of ones). that E[e0 e] ¼ E[tr(ee0 ) ] ¼ tr(E[ee0 ] ) ¼ s2 tr(M): Using the property that tr(A þ B) ¼ tr(A) þ tr(B) we can simplify this as tr(M) ¼ tr(In À X(X0 X)À1 X0 ) ¼ n À tr(X(X0 X)À1 X0 ) ¼ n À tr(X0 X(X0 X)À1 ) ¼ n À tr(Ik ) ¼ n À k. using the property that tr(AB) ¼ tr(BA).22) is called the standard error of the regression. In practice we conﬁne ourselves to the case k < n. Because the trace and the expectation operator can be interchanged. which is deﬁned as the sum of the diagonal elements of this matrix. The least squares estimator s2 and standard errors This shows that E[e0 e] ¼ (n À k)s2 so that s2 ¼ e0 e nÀk (3:22) is an unbiased estimator of s2 . As H is positive semideﬁnite.

If one would like to use a small residual variance as a criterion for a good model. then the denominator (n À k) of the estimator (3. the total sample variation can be written as y0 Ny with 1 N ¼ I À ii0 .7a. Á Á Á . using the fact that M is an idempotent matrix with rank (n À k).3) contains a constant term. 3. The matrix N has the property that it takes deviations from the mean. This follows from the results in Section 1. and therefore the sum of squares e2 i has an expected 2 value less than ns . once we have freely chosen the n À k elements of e1 . Note that N is a special case of an M-matrix (3.2.12) with X ¼ i.6 Coefficient of determination Derivation of R2 The performance of least squares can be evaluated by the coefﬁcient of determinP (yi À y)2 that is ation R2 — that is. This is also clear from Exhibit 3. k degrees of freedom are lost because b has been estimated. the remaining elements are dictated by e2 ¼ À(X02 )À1 X01 e1. The term degrees of freedom refers to the restrictions X0 e ¼ 0. If X in the multiple regression model (3. The restrictions imply that.3. n where i ¼ (1. we may rearrange the columns of X0 in such a way that X02 has rank k. From y ¼ Xb þ e we then obtain Ny ¼ NXb þ Ne ¼ NXb þ e ¼ ‘explained’ þ ‘residual’. it follows under Assumptions 1–7 that e0 e=s2 ¼ e0 Me=s2 follows the w2 -distribution with (n À k) degrees of freedom. the fraction of the total sample variation explained by the model. E Exercises: T: 3. as the elements of Ny are yi À y. If the matrix X02 has a rank less than k. as i0 i ¼ n. In matrix notation.3. and T . So Ny can be interpreted as the vector of residuals and y0 Ny ¼ (Ny)0 Ny as the residual sum of squares from a regression where y is explained by X ¼ i. 1)0 is the n Â 1 vector of ones.3 (p. 32).1 Least squares in matrix form 129 P that is smaller than s2 .1. where X01 is a k Â (n À k) matrix and X02 a k Â k matrix. Intuition for the number of degrees of freedom (n À k) As e ¼ Me. We may partition this as X01 e1 þ X02 e2 ¼ 0.5. then the fact that X0 e ¼ 0 implies that i0 e ¼ 0 and hence Ne ¼ e.22) gives an automatic penalty for choosing models with large k. That is. This effect becomes stronger when we have more parameters to obtain a good ﬁt for the data. the residual vector lies in S? (X) and this space has dimension (n À k). 3. For given matrix X of explanatory variables.

4 Geometric picture of R Two-dimensional geometric impression of the coefﬁcient of determination. Ne Ny j 0 NXb 2 Exhibit 3. . R is equal to the cosine of the angle between Ny and NXb. The dependent variable and all the independent variables are taken in deviation from their sample means.24) shows that 0 R2 1. It follows that the total variation in y (SST ) can be decomposed in an explained part SSE ¼ b0 X0 NXb and a residual part SSR ¼ e0 e. If the model contains a constant term.6). Coefficient of determination: R2 Therefore R2 is given by R2 ¼ SSE b0 X0 NXb e0 e SSR ¼ 1 À ¼1À ¼ : 0 0 SST y Ny y Ny SST (3:24) The third equality in (3. then SSR may be larger than SST (see Exercise 3. when the angle between these two vectors is small. then (3. It is left as an exercise (see Exercise 3. as Ne ¼ e and X0 e ¼ 0. and the coefﬁcient of determination is equal to the square of the cosine of the indicated angle j. Adjusted R2 When explanatory variables are added to the model.7) to show that R2 is the squared sample correlation coefﬁcient between y and its explained part ^ y ¼ Xb. In geometric terms. The wish to penalize models with large k has motivated an adjusted R2 deﬁned by adjusting for the degrees of freedom.7) and R2 is deﬁned as SSE=SST (and not as 1 À SSR=SST ). If this is not the case.4. R (the square root of R2 ) is equal to the length of NXb divided by the length of Ny — that is. This is illustrated in Exhibit 3.24) holds true if the model contains a constant term. A good ﬁt is obtained when Ny is close to NXb — that is. This corresponds to a high value of R2 .130 3 Multiple Regression y0 Ny ¼ (Ny)0 Ny ¼ (NXb þ e)0 (NXb þ e) ¼ b0 X0 NXb þ e0 e: Here the cross term vanishes because b0 X0 Ne ¼ 0. The explained part of Ny is NXb with residuals Ne ¼ e. with resulting vector of dependent variables Ny and matrix of independent variables NX. then R2 never decreases (see Exercise 3.

We will discuss (i) the data. (ii) Model As a start. E XM301BWA (i) Data The data consist of a cross section of 474 individuals working for a US bank. 0 474 @ 6395 4583 6395 90215 62166 10 1 0 1 4583 1:647 4909 62166 A@ 0:023 A ¼ @ 66609 A: 44377 0:869 47527 . x5 ¼ 0 otherwise).7 Illustration: Bank Wages To illustrate the foregoing results we consider the data on salary and education discussed earlier in Chapter 2 and in Section 3. so that (after rounding) b1 ¼ 1:647. gender (x4 ¼ 0 for females.7b. and (vi) the orthogonality of residuals and explanatory variables. n): (iii) Normal equations and least squares estimates As before. to simplify the notation we deﬁne the ﬁrst regressor by x1i ¼ 1. 3.1. Á Á Á . begin salary (B). The normal equations (3. For the data at hand they are given (after rounding) in Exhibit 3.8) gives the least squares estimates shown in Panel 3 in Exhibit 3. (v) the sums of squares and R2 . c. It may be checked from the cross products in Panel 1 in Exhibit 3.5. Panel 1.1.1 Least squares in matrix form 2 131 R ¼1À e0 e=(n À k) nÀ1 ¼1À (1 À R2 ): 0 y Ny=(n À 1) nÀk (3:25) E Exercises: T: 3. job category (x6 ¼ 1 for clerical jobs. b. and x6 ¼ 3 for management positions).5. For each employee. and some further jobrelated variables. minority (x5 ¼ 1 if the individual belongs to a minority group.8) involve the cross product terms X0 X and X0 y. (ii) the model.6a. x4 ¼ 1 for males). That is. and b3 ¼ 0:869. (iii) the normal equations and the least squares estimates.3. we consider the regression model yi ¼ b1 þ b2 x2i þ b3 x3i þ ei (i ¼ 1. 3. Solving the normal equations (3. (iv) the interpretation of the estimates.1. x6 ¼ 2 for custodial jobs.5 that X0 Xb ¼ X0 y (apart from rounding errors) — that is. we will consider the model with y ¼ log (S) as variable to be explained and with x2 and x3 ¼ log (B) as explanatory variables. education (x2 ). b2 ¼ 0:023. the information consists of the following variables: salary (S).

an .868505 R-squared 0. and so on.646916 EDUC 0. That is. This is a substantial difference.11.023. whereas in Chapter 2 this effect was estimated as 0. (iv) Interpretation of estimates A ﬁrst thing to note here is that the marginal relative effect of education on =S d log (S) dy wage (that is.5 Bank Wages (Section 3.10E-11 EDUC 2.1.177812 Sum squared resid 14.7) Std.47E-13 LOGSALBEGIN À3. of regression 0. dS dx2 ¼ dx2 ¼ dx2 ¼ b2 ) is estimated now as 0.E. these values are zero up to numerical rounding).004246 S.685719 Panel 3: Dependent Variable: LOGSAL Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient C 1.274598 0.005 (see Exhibit 2. and Panel 4 shows the result of regressing these residuals on a constant and the two explanatory variables (3.132 3 Multiple Regression Panel 1 IOTA LOGSAL EDUC LOGSALBEGIN Panel 2 LOGSAL EDUC LOGSALBEGIN IOTA 474 4909 6395 4583 LOGSAL 50917 66609 47527 EDUC 90215 62166 LOGSALBEGIN 44377 LOGSALBEGIN 1.003894 0.000000 LOGSAL 1. of regression 0.89166 Exhibit 3.696740 0.000000 Adjusted R-squared À0.799733 S. Error 0.800579 Adjusted R-squared 0.000000 0. 103).10Ã10À11 .000000 0. The residuals of this regression are denoted by RESID.89166 Total sum of squares 74.886368 EDUC 1.55E-12 R-squared 0.096 with a standard error of 0. and Panel 3 shows the outcomes obtained by regressing salary (in logarithms) on a constant and the explanatory variables education and the logarithm of begin salary.10E-11 means 3.023122 LOGSALBEGIN 0.E.67462 Explained sum of squares 59.78296 Panel 4: Dependent Variable: RESID Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient C 3.031835 Panel 1 contains the cross product terms (X0 X and X0 y) of the variables (iota denotes the constant term with all values equal to one).177812 Sum squared resid 14. p. Panel 2 shows the correlations between the dependent and the two independent variables.

This explains why the estimated effect in Chapter 2 is larger. so that R2 ¼ 0:801.1 (d).4). SSE ¼ 59:783. This gives an R2 ¼ 0. an additional year of education gives only a 2.0039. We refer also to Exhibit 3.6 per cent increase in salary.5 also reports the p ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ standard error of the regression s ¼ SSR=(474 À 3) ¼ 0:178 and the standard error of b2 0. . In Section 3.2 and 3. part of the positive association between education and salary is due to a third variable. which is in accordance with the property that the residuals are uncorrelated with the explanatory variables in the sense that X0 e ¼ 0 (see Exhibits 3. This is larger than the R2 ¼ 0:485 in Chapter 2 (see Exhibit 2. which shows that x2 and x3 have a correlation of around 69 per cent.4 we will discuss a method to test whether this is a signiﬁcant increase in the model ﬁt. and SSR ¼ 14:892. (v) Sums of squares and R2 The sums of squares for this model are reported in Panel 3 in Exhibit 3. 86).5. This is clear from Panel 2 in Exhibit 3.6.1 Least squares in matrix form 133 additional year of education corresponds on average with a 9.3 per cent increase in salary. where we have excluded the begin salary from the model.5. The cause of this difference is that the variable ‘begin salary’ is strongly related to the variable ‘education’. if the begin salary is ‘kept ﬁxed’. This means that in Chapter 2. p. with values SST ¼ 74:675.3. (vi) Orthogonality of residuals and explanatory variables Panel 4 in Exhibit 3. which shows a strong positive relation between x2 and x3 . begin salary. x2 . and x3 . But. Panel 3 in Exhibit 3.5 shows the result of regressing the least squares residuals on the variables x1 .

. the (partial) effect of X1 on y (b) for given value of X2 is denoted by b1 and the (partial) effect of X2 on y for given value of X1 is denoted by b2. The effect of changes in X1 on X2 is denoted by P. That is.3 and 3.4 we analyse the statistical consequences of omitting or including variables. we have to decide which explanatory variables should be included in the model. Organization of this section The section is organized as follows. where X1 and X2 denote two subsets of variables. The (total) effect of X1 on y (a) is denoted by b in Chapter 2 (and by bR in Section 3.6 Direct and indirect effects Two subsets of explanatory variables (X1 and X2 ) inﬂuence the variable to be explained (y). This is illustrated in the scheme in Exhibit 3. Section 3.2.6. Section 3. and one subset of explanatory variables (X1 ) inﬂuences the other one (X2 ).1).134 3 Multiple Regression 3.2.2. and Section 3. In this section we analyse what happens if we add variables to our model or delete variables from our model. In Sections 3.1 considers the effects of including or deleting variables on the regression coefﬁcients.5 shows that. Choice of the number of explanatory variables To make an econometric model we have to decide which variables provide the best explanation of the dependent variable. in a multiple regression model.2 Adding or deleting variables E Uses Appendix A.2 provides an interpretation of this result in terms of ceteris paribus conditions. Here X1 is included in the model. each individual coefﬁcient measures the effect of an explanatory variable on the dependent variable after neutralizing for the effects that are due to the other explanatory variables included in the model.4.2.2.2–A.2. (a) (b) X1 b1 X1 b y P b2 X2 y Exhibit 3. and the question is whether X2 should be included in the model or not.

3.1 Restricted and unrestricted models Two models: Notation As before.1.2 Adding or deleting variables 135 3. In particular.4 are assumed to hold true.2. The k Â 1 vector b of unknown parameters is decomposed in a similar way in the (k À g) Â 1 vector b1 and the g Â 1 vector b2. We partition the explanatory variables in two groups. (3:28) (3:27) Least squares in the unrestricted model We may write the unrestricted model as y ¼ Xb þ e ¼ ( X1 b X2 ) 1 b2 þ e: (3:29) . we consider the regression model y ¼ Xb þ e where X is the n Â k matrix of explanatory variables with rank(X) ¼ k. where X1 is the n Â (k À g) matrix of observations of the included regressors and X2 is the n Â g matrix with observations on the variables that may be included or deleted. In this section we compare two versions of the model — namely. we investigate the consequences of deleting X2 for the estimate of b1 and for the residuals of the estimated model. The matrix of explanatory variables is partitioned as X ¼ (X1 X2 ). so that bR ¼ (X01 X1 )À1 X01 y: We use the notation eR ¼ y À X1 bR for the corresponding restricted residuals. Least squares in the restricted model In the restricted model we estimate b1 by regressing y on X1 . Then the regression model can be written as y ¼ X1 b1 þ X2 b2 þ e: (3:26) All the assumptions on the linear model introduced in Section 3. the unrestricted version in (3. one with k À g variables that are certainly included in the model and another with the remaining g variables that may be included or deleted.26) and a restricted version where X2 is deleted from the model.

we premultiply (3. in general the restricted estimate bR will be different from the unrestricted estimate b1 . it can be expected that it provides a better (or at least not a worse) ﬁt than the restricted model so that e0 e e0R eR . then bR ¼ b1 . Decomposing the k Â 1 vector b into a (k À g) Â 1 vector b1 (the unrestricted estimator of b1 ) and a g Â 1 vector b2 (the unrestricted estimator of b2 ). as we will now show. This is indeed the case. for instance. As the unrestricted model contains more variables to explain the dependent variable.28) and the residuals e in the unrestricted regression (3.30) by the matrix (X01 X1 )À1 X01 and make use of X01 e ¼ 0 to obtain bR ¼ (X01 X1 )À1 X01 y ¼ b1 þ (X01 X1 )À1 X01 X2 b2 .136 3 Multiple Regression The unrestricted least squares estimator is given by b ¼ (X0 X)À1 X0 y. In these cases it does not matter for the estimate of b1 whether we include X2 in the model or not. that is. So we now have X01 eR ¼ 0 for the restricted model. and X01 e ¼ 0 and X02 e ¼ 0 for the unrestricted model. If either of these terms vanishes. We have learned in the previous section that the least squares residuals are orthogonal to all regressors. the residuals eR in the restricted regression (3. however. if X2 has no effect at all (b2 ¼ 0) or if X1 and X2 are orthogonal (X01 X2 ¼ 0). Comparison of bR and b1 To study the difference between the two estimators bR and b1 of b1 . Note. . However. we can write the unrestricted regression as y ¼ ( X1 b X2 ) 1 þ e ¼ X1 b1 þ X2 b2 þ e: b2 (3:30) So we continue to write e for the residuals of the unrestricted model. bR ¼ b1 þ Pb2 where the (k À g) Â g matrix P is deﬁned by P ¼ (X01 X1 )À1 X01 X2 : (3:32) (3:31) So we see that the difference bR À b1 depends on both P and b2 . This is the case. that in general X02 eR 6¼ 0.30). Comparison of e0R eR and e0 e Next we compare the residuals of both equations — that is.

and we used that M1 X1 ¼ 0 and M1 e ¼ e (as X01 e ¼ 0).4.7. We obtain eR ¼ M1 y ¼ M1 (X1 b1 þ X2 b2 þ e) ¼ M1 X2 b2 þ e: (3:33) T Here M1 ¼ I À X1 (X01 X1 )À1 X01 is the projection orthogonal to the space spanned by the columns of X1 .1. (3.3. The results of the restricted and unrestricted regressions are given in Panels 1 and 2 of Exhibit 3.1: Bank Wages (continued) To illustrate the results in this section we return to the illustration in Section 3. The dependent variable y is the yearly wage (in logarithms). a constant term. X2 has no effect at all (b2 ¼ 0). This additional variable is denoted by the matrix X2 with n rows and g ¼ 1 column in this case. Interpretation of result As M1 is a positive semideﬁnite matrix. it follows that b02 X02 M1 X2 b2 ¼ (X2 b2 )0 M1 (X2 b2 ) ! 0 in (3.7. This shows that adding variables to a regression model in general leads to a reduction of the sum of squared residuals. The unrestricted model (in Panel 2) has a larger R2 than the restricted model (in Panel 1). In the unrestricted model we take as explanatory variables ‘education’. for instance.33) implies that e0R eR ¼ b02 X02 M1 X2 b2 þ e0 e (3:34) where the product term vanishes as X02 M1 e ¼ X02 e À X02 X1 (X01 X1 )À1 X01 e ¼ 0 because X01 e ¼ 0 and X02 e ¼ 0.2.34). A test for the signiﬁcance of the increased model ﬁt is derived in Section 3.30) for y. Example 3.1.1 these two variables are collected in the matrix X1 with n rows and k À g ¼ 2 columns. So the difference between eR and e depends on M1 X2 and b2 . As R2 ¼ 1 À (e0 e=SST) E XM301BWA . We see that eR ¼ e if. and in the notation of Section 3. so that e0R eR ! e0 e and the inequality is strict unless M1 X2 b2 ¼ 0. For the sums of squared residuals. as they provide a signiﬁcant additional explanation of the dependent variable.2 Adding or deleting variables 137 Derivation of sums of squares To prove that e0 e e0R eR we start with the restricted residuals and then substitute the unrestricted model (3. and the additional variable ‘begin salary’ (in logarithms). then this motivates to include the variables X2 in the model. In the restricted model we take as explanatory variables ‘education’ and a constant term. If this reduction is substantial.

000000 Panel 4: Dependent Variable: RESIDREST Method: Least Squares Variable Coefﬁcient C 3.868505 0.062102 0.76E-14 R-squared 0.485447 Panel 2: Dependent Variable: LOGSAL Method: Least Squares Variable Coefﬁcient Std.449130 LOGSALBEGIN 0.47E-13 LOGSALBEGIN À3.1) Regression in the restricted model (Panel 1) and in the unrestricted model (Panel 2).7 Bank Wages (Example 3.78E-13 EDUC À2.324464 Panel 6: Dependent Variable: LOGSALBEGIN Method: Least Squares Variable Coefﬁcient C 8.460124 R-squared 0.10E-11 EDUC 2.55E-12 R-squared 0.274598 EDUC 0.062738 EDUC 0. The regression in Panel 6 shows that the logarithm of begin salary is related to education.470211 Exhibit 3.537878 EDUC 0.023122 0. Error C 9.646916 0. but the residuals of the restricted regression (denoted by RESIDREST) are uncorrelated only with education (Panel 4) and not with the logarithm of begin salary (Panel 5). Error C 1.000000 Panel 5: Dependent Variable: RESIDREST Method: Least Squares Variable Coefﬁcient C À4.138 3 Multiple Regression Panel 1: Dependent Variable: LOGSAL Method: Least Squares Variable Coefﬁcient Std. . The residuals of the unrestricted regression (denoted by RESIDUNREST) are uncorrelated with both explanatory variables (Panel 3).004548 R-squared 0.095963 0.800579 Panel 3: Dependent Variable: RESIDUNREST Method: Least Squares Variable Coefﬁcient C 3.003894 LOGSALBEGIN 0.031835 R-squared 0.083869 R-squared 0.

but X02 eR 6¼ 0 (Panel 5). Panels 3–5 in Exhibit 3. Collecting the g regressions z ¼ ^ z þ M1 z in g columns. 3. For the jth column of X2 — say. In the unrestricted model (where y is regressed on the same k À g regressors and g additional regressors) the k Â 1 vector of least squares estimates is given by b ¼ (X0 X)À1 X0 y. where each column of X2 is regressed on X1 . it follows that e0 e e0R eR .32). It follows from the outcomes in Panel 1 (for bR ).2 Interpretation of regression coefficients Relations between regressors: The effect of X1 on X2 The result in (3.32).7 show that X01 e ¼ 0 (Panel 3). The question arises which of these two estimates should be preferred.2 Adding or deleting variables 139 ¼ 0:801 > 0:485 ¼ 1 À (e0R eR )=SST. Let b be decomposed in two parts as b ¼ (b01 . where the (k À g) Â 1 vector b1 corresponds to the regressors of the restricted model and b2 to the g added regressors. Panel 6 shows the regression of X2 on X1 corresponding to (3. z — this gives estiz ¼ X1 pj and remated coefﬁcients pj ¼ (X01 X1 )À1 X01 z with explained part ^ À1 0 0 ^ sidual vector z À z ¼ M1 z where M1 ¼ I ÀX1 (X1 X1 ) X1 .31) between restricted and unrestricted least squares estimates. Summary of computations In the restricted model (where y is regressed on k À g regressors) the (k À g) Â 1 vector of least squares estimates is given by bR ¼ (X01 X1 )À1 X01 y. b02 )0 .31) shows that the estimated effect of X1 on y changes from b1 to bR ¼ b1 þ Pb2 if we delete the regressors X2 from the model. Then the relation between bR and b1 is given by bR ¼ b1 þ Pb2 . Panel 2 (for b1 and b2 ). we get X2 ¼ X1 P þ M1 X2 ¼ ‘explained part’ þ ‘residuals’ with P ¼ (X01 X1 )À1 X01 X2 as deﬁned in (3. (3:35) . which veriﬁes the relation bR ¼ b1 þ Pb2 in (3. we ﬁrst give an interpretation of the matrix P in (3. X01 eR ¼ 0 (Panel 4). To investigate this question.2. This matrix may be interpreted in terms of regressions.32).3. and Panel 6 (for P) that (apart from rounding errors) 9:062 0:096 ¼ 1:647 0:023 þ 8:538 0:084 Á 0:869.

.140 3 Multiple Regression Non-experimental data and the ceteris paribus idea The auxiliary regressions (3. Traditionally. The result in (3. the direct effect of x1 on y (the ﬁrst term) and the indirect effect that runs via x2 (the second term). the ‘other things’ clearly are the remaining columns of the matrix X1 and the residual eR . If the variables would satisfy exact functional relationships. if X1 and X2 are uncontrolled. say y ¼ f (x1 . then there are several possible reasons why P could be different from 0. bR gives a better idea of the total effect on y of changes in X1 than b1 . In experimental situations where we are free to choose the matrices X1 and X2 . Then a change of X1 may have two effects on y. It may be useful to keep this in mind when interpreting the restricted estimate bR and the unrestricted estimate b. and total effects So the restricted and the unrestricted model raise different questions and one should not be surprised if different questions lead to different answers. On the other hand. we can choose orthogonal columns so that P ¼ 0. Take the particular case that X1 ‘causes’ X2 . then the marginal effect of x1 on y is given by dy @f dh @ f ¼ þ : dx1 @ x1 dx1 @ x2 Here the total effect of x1 on y (on the left-hand side) is decomposed as the sum of two terms (on the right-hand side).35) have an interesting interpretation.31) shows that neglecting the variables X2 then has no effect on the estimate of b1 . Under these circumstances it may be hard to keep X2 constant if X1 changes. the second column of X1 ). Consider the second element of bR (the ﬁrst element is the intercept). x2 ) and x2 ¼ h(x1 ) (with k ¼ 2 and g ¼ 1). For instance. It is seen from (3. or there could exist a third ‘cause’ in the background that inﬂuences both X1 and X2 . X1 may ‘cause’ X2 or X2 may ‘cause’ X1 . and in the unrestricted model the ‘other things’ are the same columns of X1 and in addition the columns of X2 and the residual e. if all other things remain equal. Direct.31) that these are precisely the two components of bR . That is.31) shows that the same relation holds true when linear relationships are estimated by least squares. in a linear relationship this measures the partial derivative @ y=@ z (where z now denotes the second explanatory variable — that is. So in this case it may be more natural to look at the restricted model. a direct effect measured by b1 and an indirect effect measured by Pb2 . It answers the question how y will react on a change in z ceteris paribus — that is. The result in (3. indirect. as it is unnatural to assume that X2 remains ﬁxed. Now the question is: which ‘other things’? In the restricted model.

Example 3. If salary is regressed on education alone. On the other hand. then the estimated effects are respectively 0. the estimated effect is 0.0839 y = log current salary 0. The results discussed in Example 3.0960. The current salary of an employee is inﬂuenced by the education and the begin salary of that employee. In this case the direct effect is 0.2: Bank Wages (continued) To illustrate the relation between direct. the estimated effect is 0. E XM301BWA (a) x2 = education (b) x2 = education 0. the direct effect and all the indirect effects that run via the other explanatory variables — then one should estimate the restricted model where all the other explanatory variables are deleted. The total effect of education on salary consists of two parts. . a direct effect and an indirect effect that runs via the begin salary.2) Two variables (education and begin salary) inﬂuence the current salary. In the restricted model (without begin salary) the coefﬁcient bR ¼ 0:0960 measures the total effect of education on salary. and total effects. we return to Example 3. and if salary is regressed on education and begin salary together. the begin salary may for a large part be determined by education. and education also inﬂuences the begin salary.2 Adding or deleting variables 141 Interpretation of regression coefficients in restricted and unrestricted model If one wants to estimate only the direct effect of an explanatory variable — that is.1 are summarized in Exhibit 3.0231 and 0.1 on bank wages.0960 y = log current salary 0. and the total effect is 0:0231 þ 0:0729 ¼ 0:0960. if one wants to estimate the total effect of an explanatory variable — that is. under the assumption that all other explanatory variables remain ﬁxed — then one should estimate the unrestricted model that includes all explanatory variables.0231 0.8 and have the following interpretation. This effect is split up in two parts in the unrestricted model as bR ¼ b þ pc. Clearly. If begin salary is regressed on education.8685.8685 x3 = log begin salary Exhibit 3.0839. indirect.8 Bank Wages (Example 3. the indirect effect is 0:0839 Á 0:8685 ¼ 0:0729.0231.3.

3. Strictly speaking. but we use the model with only X1 as explanatory variables and with bR as our estimator of b1 . E Exercises: E: 3. When comparing the restricted and the unrestricted model. The question is which of these variables should be included in the model. Then we have bR ¼ (X01 X1 )À1 X01 y ¼ b1 þ (X01 X1 )À1 X01 X2 b2 þ (X01 X1 )À1 X01 e: This shows that E[bR ] ¼ b1 þ (X01 X1 )À1 X01 X2 b2 ¼ b1 þ Pb2 : . In this section we analyse the effect of omitting variables from the model. Omitted variables bias In this section we consider the consequences of omitting variables from the ‘true model’. It seems intuitively reasonable to include variables only if they have a clear effect on the dependent variable and to omit variables that are less important. then the DGP is unknown and can at best be approximated. When the data are from the real world. one can ﬁnd a long list of possible explanatory variables. see also our earlier remarks in Section 2. the estimates bR and b have very different interpretations. Nevertheless.2. We focus on the statistical properties of the least squares estimator. the term ‘true’ model has a clear interpretation only in the case of simulated data.1 (p. and in the next section of including irrelevant variables. it helps our insight to study some of the consequences of estimating a different model from the true model. and pc ¼ 0:0839 Á 0:8685 ¼ 0:0729 measures the indirect effect of education on salary that is due to a higher begin salary. a remark about the term true model is in order.2. Clearly. 87). Suppose the ‘true model’ is y ¼ X1 b1 þ X2 b2 þ e.142 3 Multiple Regression Here b ¼ 0:0231 measures the direct effect of education on salary under the assumption that begin salary remains constant (the ceteris paribus condition).3 Omitting variables Choice of explanatory variables For most economic variables to be explained.14a–c.

then omission of X2 is undesirable unless the resulting bias is small compared to the gain in efﬁciency.2. the DGP satisﬁes Assumptions 1–6 with b2 ¼ 0. Suppose that y ¼ X1 b1 þ e. the omission of relevant variables leads to biased estimates but to a reduction in variance. when b2 is small enough.2 we should not be surprised by this ‘bias’. as this leads to an improved efﬁciency of the least squares estimator. Variance reduction To compute the variance of bR . This means that variables can be omitted if their effect is small.7) to prove that this is smaller than the variance of the unrestricted least squares estimator b1. since bR and b1 have different interpretations. The estimator bR is in general a biased estimator of b1 .3. so that the estimation results are given by y ¼ X1 b1 þ X2 b2 þ e: Although the estimated model y ¼ X1 b1 þ X2 b2 þ e neglects the fact that b2 ¼ 0. In practice we do not know that b2 is zero. var(b1 )À var(bR ) is positive semideﬁnite. 3. Summary Summarizing. for instance. If one is interested in estimating the ‘direct’ effect b1 . Suppose that the variables X2 are included as additional regressors. it is not wrongly speciﬁed as it satisﬁes Assumptions 1–6. that is. the above two expressions show that bR À E[bR ] ¼ (X01 X1 )À1 X01 e so that var(bR ) ¼ E[(bR À E[bR ])(bR À E[bR ])0 ] ¼ s2 (X01 X1 )À1 : It is left as an exercise (see Exercise 3. In the light of our discussion in Section 3. The result . that is.2 Adding or deleting variables 143 The last term is sometimes called the omitted variables bias.2.4 Consequences of redundant variables Redundant variables lead to inefficiency A variable is called redundant if it plays no role in the ‘true’ model.

32) to write (X1 M1 X2 ) ¼ (X1 X2 À X1 P) ¼ (X1 I X2 ) 0 ÀP : I The last matrix is non-singular and Assumption 1 states that the n Â k matrix (X1 X2 ) has rank k. bR ) ¼ 0 It remains to prove that cov(b2 . as we will prove below. Then the result var(b1 ) ¼ var(bR ) þ P var(b2 )P0 follows from the fact that cov(b2 . If we premultiply this by X02. So (X1 M1 X2 ) also has rank k — that is. as X02 e ¼ 0. That is. However. Therefore b1 is an unbiased estimator. This proves that X02 M1 X2 is non-singular.33) states that M1 y ¼ M1 X2 b2 þ e. Now the result in (3. if b2 ¼ 0. So the variances of the elements of b1 are larger than those of the corresponding elements of bR . unless the corresponding rows of P are zero. it follows that P var(b2 )P0 is positive semideﬁnite. We use (3.144 3 Multiple Regression (3. then the parameters are estimated with less precision (larger standard errors) as compared with the model that excludes the redundant variables. bR ) ¼ 0. this means that b2 ¼ (X02 M1 X2 )À1 X02 M1 y: (3:37) We now substitute the ‘true’ model y ¼ X1 b1 þ e into (3. all its columns are linearly independent. it sufﬁces to prove that the n Â g matrix M1 X2 has rank g. so that this matrix has rank g. then in general we gain efﬁciency by deleting the irrelevant variables X2 from the model. This shows the importance of imposing restrictions on the model. then we obtain.18) shows that E[b1 ] ¼ b1 and E[b2 ] ¼ b2 ¼ 0 in this case.35) and (3. T Proof of auxiliary result cov(b2 . that X02 M1 y ¼ X02 M1 X2 b2 . if the model contains redundant variables. As X02 M1 X2 is nonsingular. This means in particular that all columns of the n Â g matrix M1 X2 are linearly independent. we write (3. this gives . The basic step is to express bR and b2 in terms of e.37). That is. Because var(b2 ) is positive deﬁnite.31) as b1 ¼ bR À Pb2 . bR ) ¼ 0. Because M1 X1 ¼ 0. this estimator is inefﬁcient in the sense that var(b1 ) À var(bR ) is positive semideﬁnite. As b2 ¼ 0 it follows that bR ¼ (X01 X1 )À1 X01 y ¼ (X01 X1 )À1 X01 (X1 b1 þ e) ¼ b1 þ (X01 X1 )À1 X01 e: (3:36) To express b2 in terms of e we ﬁrst prove as an auxiliary result that the g Â g matrix X02 M1 X2 is non-singular. To prove this. As X02 M1 X2 ¼ (M1 X2 )0 M1 X2 .

we obtain (as M1 X1 ¼ 0) cov(b2 . 3. E Exercises: T: 3. Estimated Model y ¼ X1 bR þ eR y ¼ X1 b1 þ X2 b2 þ e Data Generating Process y ¼ X1 b1 þ X2 b2 þ e y ¼ X1 b1 þ e (b2 non-zero) ðb2 ¼ 0Þ bR biased. In practice we do not know the true parameters b2 but we can test whether b2 ¼ 0. Comparisons should be made in columns — that is.3 and 3.2 we mentioned that these .5 Partial regression Multiple regression and partial regression In this section we give a further interpretation of the least squares estimates in a multiple regression model. but larger variance b1 unbiased.4. If we include redundant variables (b2 ¼ 0) in our model. In Section 3.9. but smaller variance bR best linear unbiased than b1 b1 unbiased.3. The cells show the statistical properties of the estimators bR (of the restricted model where X2 is deleted. but not efﬁcient than bR Exhibit 3. then this causes a loss of efﬁciency of the estimators of the parameters (b1 ) of the relevant variables. This is discussed in Sections 3. by excluding irrelevant variables we gain efﬁciency. So the choice between a restricted and an unrestricted model involves a trade-off between the bias and efﬁciency of estimators. bR ) ¼ E[(b2 À E[b2 ])(bR À E[bR ])0 ] ¼ E[(X02 M1 X2 )À1 X02 M1 ee0 X1 (X01 X1 )À1 ] ¼ s2 (X02 M1 X2 )À1 X02 M1 X1 (X01 X1 )À1 ¼ 0: Summary of results We summarize the results of this and the foregoing section in Exhibit 3.38) that express bR and b2 in terms of e. for a ﬁxed data generating process. if we exclude relevant variables (b2 6¼ 0).2 Adding or deleting variables 145 b2 ¼ (X02 M1 X2 )À1 X02 M1 e (as b2 ¼ 0): (3:38) Using the expressions (3. ﬁrst row) and b1 (of the unrestricted model that contains both X1 and X2 . That is.9 Bias and efﬁciency Consequences of regression in models that contain redundant variables (bottom right cell) and in models with omitted variables (top left cell).36) and (3.7d. this causes a bias in the estimators.2. under Assumptions 1–6. second row) for the model parameters b1 .2. However.

by including X2 in the regression model. where M2 ¼ I À X2 (X02 X2 )À1 X02 . That is. Step 2: Estimate the ‘cleaned’ effect of X1 on y. Note that. As will be shown below. the question arises what is the precise interpretation of this condition. We consider again the model where the n Â k matrix of explanatory variables X is split in two parts as X ¼ (X1 X2 ).146 3 Multiple Regression estimates measure direct effects under the ceteris paribus condition that the other variables are kept ﬁxed. Partial regression Step 1: Remove the effects of X2 . The regression of y on X1 and X2 gives the result y ¼ X1 b1 þ X2 b2 þ e: Another approach to estimate the effects of X1 on y is the following two-step method. Proof of the result of Frisch–Waugh To prove the result of Frisch–Waugh. the estimated effect b1 of X1 on y is automatically ‘cleaned’ from the side effects caused by X2. This gives M2 y ¼ M2 X 1 b Ã þ eÃ where bÃ ¼ [(M2 X1 ) M2 X1 ]À1 (M2 X1 )0 M2 y ¼ (X01 M2 X1 )À1 X01 M2 y eÃ ¼ M2 y À M2 X1 bÃ are the corresponding residuals. the ‘cleaned’ variables M2 y and M2 X1 are uncorrelated with X2 . we write out the normal equations X0 Xb ¼ X0 y in terms of the partitioned matrix X ¼ (X1 X2 ). Also regress each column of X1 on X2 with residuals M2 X1 . where X1 is an n Â (k À g) matrix and X2 an n Â g matrix. Now estimate the ‘cleaned’ effect of X1 on y by regressing M2 y on M2 X1 . as a consequence of the fact that residuals are orthogonal to explanatory variables. As such ‘controlled experiments’ are almost never possible in economics. T . called partial regression. it means that the indirect effects that are caused by variations in the other variables are automatically removed in a multiple regression. eÃ ¼ e: (3:39) That is. Here we remove the side effects that are caused by X2. 0 and The result of Frisch–Waugh The result of Frisch–Waugh states that bÃ ¼ b1 . regress y on X2 with residuals M2 y. Here M2 y and M2 X1 can be interpreted as the ‘cleaned’ variables obtained after removing the effects of X2 .

40) and arranging terms it follows that X01 M2 X1 b1 ¼ X01 M2 y.41) we get b2 ¼ (X02 X2 )À1 X02 y À (X02 X2 )À1 X02 X1 b1 . should we include X2 or not? Suppose that we wish to estimate the effect of a certain set of regressors (X1 ) on the dependent variable (y). The partial effect X1 ! y (ceteris paribus. The question is whether certain other variables (X2 ) should be added to or omitted from the regression. after which the cleaned M2 y is regressed on the cleaned M2 X1 . If the two sets of regressors X1 and X2 are related (in the sense that X01 X2 6¼ 0). as if X2 were ﬁxed) cannot be determined if X2 is deleted from the model. Case 1: Deviations from sample mean Let X2 have only one column consisting of ones. and a similar argument shows that also X01 M2 X1 is invertible. Instead of this partial regression. Three illustrations There are several interesting applications of the result of Frisch–Waugh. then the estimated effects X1 ! y differ in the two models.3. To isolate the direct effect X1 ! y one can ﬁrst remove the effects of X2 on y and of X2 on X1 . In Section 3.30) and the facts that M2 X2 ¼ 0 and M2 e ¼ e that M2 y ¼ M2 X1 b1 þ e: As b1 ¼ bÃ . because then the indirect effect X1 ! X2 ! y is also present. if one is interested in the total effect of X1 on y. For instance. Further it follows from (3. On the other hand. this shows that eÃ ¼ e. Summary: To estimate the effect of X1 on y.2 Adding or deleting variables 147 X01 X1 b1 þ X01 X2 b2 ¼ X01 y X02 X1 b1 þ X02 X2 b2 ¼ X02 y (3:40) (3:41) From (3. This shows that bÃ ¼ b1 . then X2 should be deleted from the model. and we mention three of them. regressing y on X2 . and by substituting this in (3.4 we proved that X02 M1 X2 is invertible.2. this amounts to taking deviations from means. where M2 ¼ I À X2 (X02 X2 )À1 X02 . one can also include X2 as additional regressors in the model and regress y on X1 and X2 . If we premultiply by M2.

E XM301BWA Example 3. . so that k À g ¼ 1 and X2 contains the remaining k À 1 variables.1. Econometrica. ‘Partial Time Regressions as Compared with Individual Trends’. Then both M2 X1 and M2 y have one column and one can visualize the relation between these columns by drawing a scatter plot.C @.7.30). .A 1 n Then the ﬁrst step in partial regression amounts to removing the (linear) trends from y and the columns of X1 . as follows. so that the elements of M2 y are (y1 À y). . This equals the slope parameter of X1 in the multiple regression equation (3. This is called a partial regression scatter plot.1. This case was the subject of the article by R. Frisch and F. with results in Exhibit 3. a constant and a trend. Waugh.5. and x3i the logarithm of the begin salary of the ith employee. 1 1 1 B1 2C X2 ¼ B Á . In fact we have already met this kind of formula in Chapter 2 — for instance. and (ii) the above-mentioned Case 3. 0 Case 3: Single partial relation Let X1 consist of a single variable. The result of Frisch–Waugh states that inclusion of a constant term gives the same results as a regression where all variables are expressed in deviation from their means. This is the regression of the illustration in Section 3. we illustrate some of the foregoing results for the model yi ¼ b1 þ b2 x2i þ b3 x3i þ ei . 1 (1933). V. Case 2: Detrending Let X2 consist of two columns. 387–401. Á Á Á . . (yn À y). in formula (2.7. where yi denotes the logarithm of yearly salary. Panel 3.148 3 Multiple Regression P gives an estimated coefﬁcient (X02 X2 )À1 X02 y ¼ 1 yi ¼ y with residuals n yi À y. We will now consider (i) the above-mentioned Case 1.8) for the least squares slope estimator. x2i the education. The slope of the regression line in this plot is b1 .3: Bank Wages (continued) Using the data on bank wages of the illustration in Section 3. .

003885 R-squared 0.650406 0.71973 LOGSALBEGIN 5. Error C 0.89166 Exhibit 3.023998 Regression 6: Dependent Variable: EDUC Variable Coefﬁcient C À40. Error RESEDUC 0.89166 Regression 5: Dependent Variable: LOGSAL Variable Coefﬁcient Std. .705383 0.E.023122 0.868505 0.2 Adding or deleting variables 149 Regression 1: Dependent Variable: LOGSAL Variable Coefﬁcient Std. Error 2.132505 Regression 3: Dependent Variable: LOGSALBEGIN Variable Coefﬁcient Std. The residuals of these two regressions (which correspond to the variables LOGSAL and EDUC where the effect of LOGSALBEGIN has been eliminated and which are denoted by RESLOGSAL and RESEDUC) are related in Regression 7.018250 Regression 2: Dependent Variable: EDUC Variable Coefﬁcient C 13. Regressions 5 and 6 determine the effect of LOGSALBEGIN on EDUC and LOGSAL.023122 0. Error DMEDUC 0.669405 0. Regressions 1–3 determine the effect of the constant term on the variables LOGSAL.606476 Std. The residuals of these regressions (which correspond to taking the original observations in deviation from their sample mean and which are denoted by DM) are related in Regression 4. Error C 9.031801 R-squared 0.003890 DMLOGSALBEGIN 0.069658 Adjusted R-squared 0.998139 0. and LOGSALBEGIN.232198 LOGSALBEGIN 0.3) Two illustrations of partial regressions.800157 S. EDUC. Error 0.069658 S.177624 Sum squared resid 14.273920 Regression 7: Dependent Variable: RESLOGSAL Variable Coefﬁcient Std. of regression 0.E.49156 Std.35679 0. of regression 0. Error C 10.800579 Adjusted R-squared 0.177436 Sum squared resid 14.10 Bank Wages (Example 3.016207 Regression 4: Dependent Variable: DMLOGSAL Variable Coefﬁcient Std.3.

5. and then the demeaned y is regressed on the two demeaned variables x2 and x3 . and in the second step M2 y is regressed on M2 X1 . On the vertical axis are the residuals of the regression of log salary on a constant and log begin salary and on the horizontal axis are the residuals of the regression of education on a constant and log begin salary. Panel 3.26). . we see that the regression coefﬁcients are equal. there is a small difference in the calculated standard errors (see Exercise 3.0 0. and let X1 be the 474 Â 1 vector containing the values of x2 (education). If we compare the results of Regression 4 in Exhibit 3. Panel 3. In the ﬁrst step all variables are regressed on a constant.10.10.3) Partial regression scatter plot of (logarithmic) salary against education.9). This is shown in Regressions 5–7 in Exhibit 3.5 RESLOGSAL 0.10 with those of the unrestricted regression in Exhibit 3. where the variables are expressed in deviations from their sample mean.0 −0. The last regression corresponds to the model M2 y ¼ (M2 X1 )bÃ þ eÃ in the result of Frisch–Waugh.5 −10 −5 RESEDUC 0 5 Exhibit 3. This result states that the estimated coefﬁcient in this RESLOGSAL vs. (ii) Direct effect of education on salary Next we consider Case 3 above and give a partial regression interpretation of the coefﬁcient b2 ¼ 0:023 in Exhibit 3. let X2 be the 474 Â 2 matrix with a column of ones and with the values of x3 (begin salary) in column 2. In terms of the model y ¼ X1 b1 þ X2 b2 þ e in (3.5.11 Bank Wages (Example 3. To remove the effects of the other variables.150 3 Multiple Regression (i) Deviation from mean We ﬁrst consider Case 1 above. This is shown in Regressions 1–4 in Exhibit 3. with regression line. However. RESEDUC 1. y and X1 are ﬁrst regressed on X2 with residuals M2 y and M2 X1 . The slope of the regression line in the ﬁgure indicates the direct effect of education on log salary after neutralizing for the indirect effect via log begin salary. for the estimated ‘direct effect’ of education on salary for ‘ﬁxed’ begin salary.

9. which is veriﬁed by comparing Regression 7 in Exhibit 3. E: 3.16. Panel 3. E Exercises: T: 3. where RESLOGSAL denotes M2 y and RESEDUC denotes M2 X1 . 3. The corresponding partial regression scatter plot is shown in Exhibit 3. .2 Adding or deleting variables 151 regression is equal to the coefﬁcient in the multiple regression model.3.11.18.10 with the result in Exhibit 3.5.

By standardization we get bj À bj pﬃﬃﬃﬃﬃ $ N(0. Its mean and variance are given by (3. as the variance s2 is unknown.2. Let w $ N(0. Then Aw $ N(0. we test the null hypothesis H0 : bj ¼ 0 against the alternative H1 : bj 6¼ 0.2.4. 1. 32 and 34–5). Therefore s is replaced by s. 1.22). and let A be a given m Â n matrix and Q a given n Â n symmetric and idempotent matrix. 1): s ajj This expression cannot be used to test whether bj ¼ 0. where ajj is the jth diagonal element of (X0 X)À1 . To check whether the jth explanatory variable has a signiﬁcant effect on y.1. AA0 ) and w0 Qw $ w2 (r) where r ¼ tr(Q). To derive the distribution of the resulting test statistic we use the following results of Section 1. A ¼ (X0 X)À1 X0 . so that b $ N(b.2. Derivation of t-test For this purpose we suppose that Assumptions 1–7 hold true.3 (p.1 The t-test E Uses Sections 1.3 The accuracy of estimates 3.18) and (3. We apply these results with w ¼ (1=s)e. under these assumptions. I) be a n Â 1 vector of independent N(0. As the least squares estimator b is a linear function of e.3. b is normally distributed.19). 1) variables. Note that . it follows that.3. and these two random variables are independently distributed when AQ ¼ 0. where s2 is the unbiased estimator of s2 deﬁned in (3.152 3 Multiple Regression 3. we can test its statistical signiﬁcance. Test of significance To test whether we should include a variable in the model or not. and Q ¼ M ¼ I À X(X0 X)À1 X0 with tr(M) ¼ n À k.4. s2 (X0 X)À1 ): (3:42) T The variance of the jth component bj of the least squares estimator b is equal to s2 ajj .

The t-test pﬃﬃﬃﬃﬃ Let sj ¼ s ajj be the standard error of bj .3. we reject the null hypothesis if jtj > c where c is the signiﬁcance level deﬁned by P[jtj > c] where t $ t(n À k). Further. in this case the size of the test should be chosen small enough to protect ourselves from a large probability of an error of the ﬁrst type. we use the above test statistic with bj ¼ 0. or the test of (individual) signiﬁcance of bj . tj ¼ sj s ajj e0 e =(n À k) s2 that is.1 (p. Stated otherwise. which corresponds to bj ¼ 0. Of course. That is. As AM ¼ 0. the null hypothesis is rejected for P < 0:05 and it is not rejected for P > 0:05. it follows that b and Àep ﬃﬃﬃﬃﬃÁ 0 2 2 independently distributed. That is.3 The accuracy of estimates 153 b À b ¼ (X0 X)À1 X0 e ¼ Ae so that (b À b)=s ¼ Aw. to test the null hypothesis that bj ¼ 0 against the alternative that bj 6¼ 0. we compute the t-value tj ¼ bj bj ¼ pﬃﬃﬃﬃﬃ : sj s ajj (3:44) We reject the null hypothesis if tj differs signiﬁcantly from zero. Use of the t-test and the P-value As discussed in Section 2. 100). In general it is preferable to report the P-value of the test. Against the above two-sided alternative. 1). the null hypothesis is rejected only for small enough P-values of the test. and that e ¼ My ¼ 0 e are M(Xb þ e) ¼ Me so that e0 e=s2 ¼ w0 Mw.3. However. e e=s $ w (n À k) and (bj À bj )= s ajj $ N(0. we should do this only if there exists sufﬁcient evidence for this effect. for a size of 5 per cent we can use c ¼ 2 as a rule of thumb. if we want to establish an effect of xj on y. tj follows the t(n À k) distribution. then À pﬃﬃﬃﬃﬃÁ bj À bj bj À bj (bj À bj )= s ajj ¼ pﬃﬃﬃﬃﬃ ¼ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ $ t(n À k). which is accurate if n À k is not very small (say n À k > 30). This is called the t-test. In some situations smaller signiﬁcance levels are used (especially in large . If the null hypothesis bj ¼ 0 is true. (3:43) The t-value and significance To test whether xj has no effect on y. For a signiﬁcance level of 5 per cent. tj follows the t(n À k) distribution. then we hope to be able to reject the null hypothesis. and as both terms are independent their quotient has by deﬁnition the Student t-distribution with (n À k) degrees of freedom.

(i) Regression outcomes and t-tests Panel 1 in Exhibit 3. the sign of signiﬁcant coefﬁcients (indicating whether the corresponding regressor has a positive or a negative effect on y). and the column ‘t-Statistic’ the t-values tj ¼ bj =sj . The column ‘Coefﬁcient’ contains the regression coefﬁcients bj . (ii) presentation of the regression results. education. Of particular interest are . Error’ the standard errors sj . and (iii) results of the model with two additional regressors (gender and minority). . and their P-value Pj ¼ P[jtj > jtj j] where t has the t(n À k)-distribution.2 Illustration: Bank Wages E XM301BWA We consider again the salary data and the linear model with k ¼ 3 explanatory variables (a constant. if t follows the t(471) distribution and c is the outcome of the t-statistic. Summary of computations In regression we usually compute . In this example with n ¼ 474 and k ¼ 3. The P-value .154 3 Multiple Regression samples).12 shows the outcomes of regressing salary (in logarithms) on a constant and the explanatory variables education and begin salary (the last again in logarithms). We will discuss (i) the regression outcomes and t-tests.3. the regression coefﬁcients b ¼ (X0 X)À1 X0 y. . j ¼ 1. the P-value of the hypothesis that bj ¼ 0 against the twosided alternative that bj 6¼ 0. the signiﬁcance of the regressors (measured by the P-values). and the logarithm of begin salary) discussed in Example 3. pﬃﬃﬃﬃﬃ their t-value tj ¼ bj =sj . the size of the coefﬁcients (which can only be judged properly in combination with the measurement scale of the corresponding regressor). 3. then the P-value is deﬁned as the (two-sided) probability P(jtj > jcj). . Other statistics like R2 may also be of interest. and in other situations sometimes larger signiﬁcance levels are used (for instance in small samples). Á Á Á . . k. as well as other statistics that will be discussed later in the book. for each of the coefﬁcients bj . their standard error sj ¼ s ajj. The column denoted by ‘Prob’ contains the P-values corresponding to the t-values in the preceding column — that is. the standard error of regression s.3. the column ‘Std.

3.3 The accuracy of estimates

155

Panel 1: Dependent Variable: LOGSAL Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std. Error t-Statistic C 1.646916 0.274598 5.997550 EDUC 0.023122 0.003894 5.938464 LOGSALBEGIN 0.868505 0.031835 27.28174 R-squared 0.800579 Mean dependent var Adjusted R-squared 0.799733 S.D. dependent var S.E. of regression 0.177812 Sum squared resid 14.89166 Panel 2: Dependent Variable: LOGSAL Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std. Error t-Statistic C 2.079647 0.314798 6.606288 EDUC 0.023268 0.003870 6.013129 LOGSALBEGIN 0.821799 0.036031 22.80783 GENDER 0.048156 0.019910 2.418627 MINORITY À0.042369 0.020342 À2.082842 R-squared 0.804117 Mean dependent var Adjusted R-squared 0.802446 S.D. dependent var S.E. of regression 0.176603 Sum squared resid 14.62750

Exhibit 3.12 Bank Wages (Section 3.3.2)

Prob. 0.0000 0.0000 0.0000 10.35679 0.397334

Prob. 0.0000 0.0000 0.0000 0.0160 0.0378 10.35679 0.397334

Results of two regressions. Panel 1 shows the regression of salary (in logarithms) on education and begin salary (in logarithms) and Panel 2 shows the results when gender and minority are included as additional explanatory variables. The column ‘Prob’ contains the P-values for the null hypothesis that the corresponding parameter is zero against the two-sided alternative that it is non-zero.

requires Assumptions 1–7 and, in addition, that the null hypothesis bj ¼ 0 is true. All parameters are highly signiﬁcant.

(ii) Presentation of regression results There are several conventions to present regression results in the form of an equation. For example, similar to what was done in Example 2.9 (p. 102), the parameter estimates can be reported together with their t-values (in parentheses) in the form

y ¼ 1:647 þ 0:023 x2 þ 0:869 x3 þ e: (5:998) (5:938) (27:282)

Sometimes the parameter estimates are reported together with their standard errors. Many readers are interested in the question whether the estimates are

156

3 Multiple Regression

signiﬁcantly different from zero. These readers almost automatically start to calculate the t-values themselves. So it is friendly to them to present the tvalues right away. In some cases, however, the null hypothesis of interest is different from zero. In such a case the t-values give the wrong answers and extra calculations are required. These calculations are simpler if standard errors are presented. Those who prefer interval estimates are also better served by reporting standard errors. The obvious way out seems to report both the t-values and the standard errors, but this requires more reporting space. In any case, one should always clearly mention which convention is followed.

(iii) Two additional regressors As compared with the illustration in Section 3.1.7, we now extend the set of explanatory variables with x4 (gender) and x5 (minority). Panel 2 of Exhibit 3.12 shows the regression outcomes when these variables are added. On the basis of the t-test, both the variable x4 and the variable x5 have signiﬁcant effects (at 5 per cent signiﬁcance level). Note that, if we add variables, the coefﬁcients of the other variables change also. This is because the explanatory variables are correlated with each other — that is, in the notation of Section 3.2.1 we have X01 X2 6¼ 0 (see (3.31) and (3.32)). For instance, the additional regressor gender is correlated with the regressors education and begin salary, with correlation coefﬁcients 0.36 and 0.55 respectively. Using the notation of the result of Frisch–Waugh, to guarantee that bÃ ¼ b1 we should not simply regress y on X1 (as in Panel 1 of Exhibit 3.12), but instead we should regress M2 y on M2 X1 . If important variables like x4 and x5 are omitted from the model, this may lead to biased estimates of direct effects, as was discussed in Section 3.2.3.

**3.3.3 Multicollinearity Factors that affect significance
**

It may happen that bj 6¼ 0 but that the t-test cannot reject the hypothesis that bj ¼ 0. The estimate bj is then not accurate enough — that is, its standard error is too large. In this case the t-test does not have enough power to reject the null hypothesis. To analyse the possible causes of such a situation we decompose the variance of the least squares estimators in terms of a number of components. We will derive the result in three steps, ﬁrst for the mean, then for the simple regression model, and ﬁnally for the multiple regression model.

3.3 The accuracy of estimates

157

**First case: Sample mean
**

We start with the simplest possible example of a matrix X that consists of one column of unit elements. In this case we have b ¼ y and var(b) ¼ s2 : n

We see that for a given required accuracy there is a trade-off between s2 and n. If the disturbance variance s2 is large — that is, if there is much random variation in the outcomes of y — then we need a large sample size to obtain a precise estimate of b.

**Second case: Simple regression
**

Next we consider the simple regression model studied in Chapter 2, yi ¼ a þ bxi þ ei : For the least squares estimator b discussed there, the variance is given by var(b) ¼ P Here we use the expression P s2 x ¼ (xi À x)2 nÀ1 s2 s2 ¼ : (xi À x)2 (n À 1)s2 x (3:45)

for the sample variance of x. For a given required accuracy we now see a tradeoff between three factors: a large disturbance s2 can be compensated for by either a large sample size n or by a large variance s2 x of the explanatory variable. More variation in the disturbances ei gives a smaller accuracy of the estimators whereas more observations and more variation in the regressor xi lead to a higher accuracy.

General case: Multiple regression (derivation) Finally we look at the general multiple regression model. We concentrate on one regression coefﬁcient and without loss of generality we choose the last one, since it is always possible to change the order of the columns of X. We use the notation introduced in Section 3.2. In the current situation g ¼ 1 so that the n Â g matrix X2 reduces to an n Â 1 vector that we will denote by x2. The n Â (k À 1)

T

158

3 Multiple Regression

matrix X1 corresponds to the ﬁrst (k À 1) regressors. We concentrate on the single parameter b2 in the model y ¼ X1 b1 þ X2 b2 þ e ¼ X1 b1 þ b2 x2 þ e: Substituting this in (3.37) and b2 ¼ b2 þ (x02 M1 x2 )À1 x02 M1 e, so that using M1 X1 ¼ 0, it follows that

var(b2 ) ¼ s2 (x02 M1 x2 )À1 :

(3:46)

Here M1 x2 has one column and x02 M1 x2 is the residual sum of squares of the auxiliary regression x2 ¼ X1 P þ M1 x2 (see (3.35)). As R2 ¼ 1 À (SSR=SST ), we may write SSR ¼ SST (1 À R2 ): If we apply this result to the auxiliary P regression x2 ¼ X1 P þ M1 x2 we may substitute SSR ¼ x02 M1 x2 and SST ¼ (x2i À x2 )2 ¼ (n À 1)s2 x2 . Denoting the R2 of this auxiliary regression by R2 a we obtain the following result.

**The effect of multicollinearity
**

In the multiple regression model the variance of the last regression coefﬁcient (denoted by b2 ) may be decomposed as var(b2 ) ¼ s2 : 2 (n À 1)s2 x2 (1 À Ra )

If we compare this with (3.45), we see three familiar factors and a new one, 2 2 (1 À R2 a ). So var(b2 ) increases with Ra and it even explodes if Ra " 1. This is called the multicollinearity problem. If x2 is closely related to the remaining regressors X1 , it is hard to estimate its isolated effect accurately. Indeed, if R2 a is large, then x2 is strongly correlated with the set of variables in X1 , so that the ‘direct’ effect of x2 on y (that is, b2 ) is accompanied by strong side effects via X1 on y. Rewriting the above result for an arbitrary column of X (except the intercept), we get var(bj ) ¼ s2 , 2 (n À 1)s2 xj (1 À Rj ) (j ¼ 2, Á Á Á , k), (3:47)

2 where R2 j denotes the R of the auxiliary regression of the jth regressor variable on the remaining (k À 1) regressors (including the constant term)

3.3 The accuracy of estimates

159

and s2 xj is the sample variance of xj . So, accurate estimates of ‘direct’ or ‘partial’ effects are obtained for large sample sizes, large variation in the relevant explanatory variable, small error variance, and small collinearity with the other explanatory variables. The factor 1=(1 À R2 j ) is called the variance inﬂation factor — that is, the factor by which the variance increases because of collinearity of the jth regressor with the other (k À 1) regressors.

Interpretation of results

In many applications we hope to ﬁnd signiﬁcant estimates of the partial effects of the explanatory variables. If some of the t-values of the regression coefﬁcients are small, this may possibly be caused by high correlations among the explanatory variables, measured by the coefﬁcients of determination R2 j . One method to improve the signiﬁcance is to get more data, if this is possible. However, if the purpose of the model would be to estimate the total effects of some of the variables (as opposed to partial effects), then another solution is to drop some of the other explanatory variables. In some applications the individual parameters may not be of so much interest — for instance, in prediction. Then multicollinearity is not a very relevant issue, but it may be of interest to compare the forecast quality of the full model with that of restricted versions where some of the explanatory variables are omitted. Methods to choose the number of explanatory variables in prediction will be discussed later (see Section 5.2.1).

E

Exercises: S: 3.12; E: 3.14d.

**3.3.4 Illustration: Bank Wages
**

To illustrate the factors that affect the standard errors of least squares estimates we consider once again the bank wage data. Panel 1 of Exhibit 3.13 shows once more the regression of salary on ﬁve explanatory variables (see also Panel 2 of Exhibit 3.12). The standard errors of the estimated parameters are relatively small, but it is still of interest to decompose these errors as in (3.47) to see if this is only due to the fact that the number of observations n ¼ 474 is quite large. The values R2 j of the auxiliary regres2 2 ¼ 0 : 47 (shown in Panel 2), R sions are equal to R2 2 3 ¼ 0:59, R4 ¼ 0:33, and 2 R2 5 ¼ 0:07. Recall from Section 3.1.6 that R is the square of a correlation coefﬁcient, so that these outcomes cannot directly be compared to the (bivariate) correlations that are also reported in Panel qﬃﬃﬃﬃﬃﬃ3 of Exhibit 3.13. R2 Therefore Panel 3 also contains the values of Rj ¼ j and of the square ﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 root of the variance inﬂation factors 1= 1 À Rj that affect the standard

E

XM301BWA

160

3 Multiple Regression

Panel 1: Dependent Variable: LOGSAL Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std. Error t-Statistic C 2.079647 0.314798 6.606288 EDUC 0.023268 0.003870 6.013129 LOGSALBEGIN 0.821799 0.036031 22.80783 GENDER 0.048156 0.019910 2.418627 MINORITY À0:042369 0.020342 À2:082842 R-squared 0.804117 Mean dependent var Adjusted R-squared 0.802446 S.D. dependent var S.E. of regression 0.176603 Sum squared resid 14.62750 Panel 2: Dependent Variable: EDUC Method: Least Squares Variable Coefﬁcient C À41:59997 LOGSALBEGIN 5.707538 GENDER À0:149278 MINORITY À0:071606 R-squared 0.470869 Panel 3 EDUC Rj 2 0.470869 0.6862 p Rj 1.3747 1= (1 À Rj 2 ) EDUC 1.000000 LOGSALBEGIN 0.685719 GENDER 0.355986 MINORITY À0.132889

Prob. 0.0000 0.0000 0.0000 0.0160 0.0378 10.35679 0.397334

Std. Error 3.224768 0.339359 0.237237 0.242457

t-Statistic À12:90014 16.81859 À0:629237 À0:295337

Prob. 0.0000 0.0000 0.5295 0.7679

LOGSALBEGIN 0.592042 0.7694 1.5656 1.000000 0.548020 À0.172836

GENDER MINORITY 0.330815 0.071537 0.5752 0.2675 1.2224 1.0378 1.000000 0.075668

1.000000

Exhibit 3.13 Bank Wages (Section 3.3.4)

Panel 1 shows the regression of salary (in logarithms) on a constant, education, begin salary (in logarithms), gender, and minority. Panel 2 shows the regression of one of the explanatory variables (EDUC) on the other ones, with corresponding coefﬁcient of determination. Similar regressions are performed (but not shown) and the corresponding R2 are reported in Panel 3, together with the values of R and of the square root of the variance inﬂation factors. For comparison, Panel 3 also contains the pairwise sample correlations between the explanatory variables.

errors of bj in (3.47), for j ¼ 2, 3, 4, 5. The largest multiple correlation is ¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 0:77 ﬃwith corresponding square root of the variance inﬂation factor R3q 1= 1 À R2 3 ¼ 1:57. This shows that some collinearity exists. However, as the variance inﬂation factors are not so large, multicollinearity does not seem to be a very serious problem in this example.

3.4 The F-test

161

3.4 The F -test

3.4.1 The F -test in different forms

E

Uses Section 1.2.3, 1.4.1; Appendix A.2–A.4.

**Testing the joint significance of more than one coefficient
**

In Section 3.2 we considered the choice between the unrestricted model y ¼ X1 b 1 þ X 2 b 2 þ e with estimates y ¼ X1 b1 þ X2 b2 þ e, and the restricted model with b2 ¼ 0 and estimates y ¼ X1 bR þ eR . We may prefer to work with the simpler restricted model if b2 is small. The question is when b2 is small enough to do so, so that a measure is needed for the distance between b2 and 0. For this purpose the F-test is commonly used to test the null hypothesis that b2 ¼ 0. One computes the F-statistic to be deﬁned below and uses the restricted model if F does not exceed a certain critical value.

Derivation of the F -test To derive the F-test for H0 : b2 ¼ 0 against H1 : b2 6¼ 0, we use the result in (3.38), which states that, if b2 ¼ 0, b2 ¼ (X02 M1 X2 )À1 X02 M1 e: Under Assumptions 1–7 we conclude that E[b2 ] ¼ 0 and b2 $ N(0, V ), where V ¼ var(b2 ) ¼ s2 (X02 M1 X2 )À1 . Let V À1=2 be a symmetric matrix with the property that V À1=2 VV À1=2 ¼ I, the g Â g identity matrix. Such a matrix V À1=2 is called a square root of the matrix V À1 , and it exists because V is a positive deﬁnite matrix. As b2 $ N(0, V ), it follows that V À1=2 b2 $ N(0, I) — that is, the g components of V À1=2 b2 are independently distributed with standard normal distribution. By deﬁnition it follows that the sum of the squares of these components b02 V À1 b2 has the w2 (g) distribution. As V À1 ¼ sÀ2 X02 M1 X2 this means that b02 X02 M1 X2 b2 =s2 $ w2 (g), (3:48)

T

162

3 Multiple Regression

if the null hypothesis that b2 ¼ 0 is true. However, this still involves the unknown parameter s2 and hence it can not be used in practice. But if we divide it by the ratio e0 e=s2 (which follows a w2 (n À k) distribution (see Section 3.3.1)), and if we divide both the numerator and the denominator by their degrees of freedom, the two factors with the unknown s2 cancel and we obtain F¼ b02 X02 M1 X2 b2 =g : e0 e=(n À k) (3:49)

This follows an F(g, n À k) distribution, as it was shown in Section 3.3.1 that s2 ¼ e0 e=(n À k) and the least squares estimator b (and hence also b2 ) are independent (for an alternative proof see Exercise 3.7). Using (3.34) we see that b02 X02 M1 X2 b2 ¼ e0R eR À e0 e, so that F may be computed as follows.

**Basic form of the F -test
**

F¼ (e0R eR À e0 e)=g $ F(g, n À k): e0 e=(n À k) (3:50)

So the smaller model with b2 ¼ 0 is rejected if the increase in the sum of squared residuals e0R eR À e0 e is too large. The null hypothesis that b2 ¼ 0 is rejected for large values of F — that is, this is a one-sided test (see Exhibit 3.14). A geometric impression of the equality of the two forms (3.49) and (3.50) of the F-test is given in Exhibit 3.15. This equality can be derived from the theorem of Pythagoras, as is explained in the text below the exhibit.

F(g, n − k)

Exhibit 3.14 P-value

F-test on parameter restrictions, where g is the number of restrictions under the null hypothesis, n is the total number of observations, and k is the total number of regression parameters in the unrestricted model. The P-value is equal to the area of the shaded region in the right tail, and the arrow on the horizontal axis indicates the calculated F-value.

3.4 The F-test

163

S(X2)

y X2b2

eR e

M1X2b2

**X1b1 + X2b2 M1X2b2 X1b1 X1bR
**

S(X1)

Exhibit 3.15 Geometry of F-test

Three-dimensional geometric impression of the F-test for the null hypothesis that the variables X2 are not signiﬁcant. The projection of y on the unrestricted model (which contains both X1 and X2 ) is given by X1 b1 þ X2 b2 with residual vector e. The projection of y on the restricted model (which contains only X1 ) is given by X1 bR with residual vector eR. The vectors eR and e are both orthogonal to the variables X1 , and hence the same holds true for the difference eR À e. This difference is the residual that remains after projection of X1 b1 þ X2 b2 on the space of the variables X1 — that is, eR À e ¼ M1 (X1 b1 þ X2 b2 ) ¼ M1 X2 b2 . As the vector e is orthogonal to X1 and X2 , it is also orthogonal to M1 X2 b2 . The theorem of Pythagoras implies that e0R eR ¼ e0 e þ (M1 X2 b2 )0 M1 X2 b2 ¼ e0 e þ b02 X02 M1 X2 b2 . The F-test for b2 ¼ 0 corresponds to testing whether the contribution M1 X2 b2 of explaining y in terms of X2 is signiﬁcant — that is, it tests whether the length of eR is signiﬁcantly larger than the length of e, or, equivalently, whether (e0R eR À e0 e) differs signiﬁcantly from 0.

**The F -test with R2
**

In the literature the F-test appears in various equivalent forms, and we now present some alternative formulations. Let R2 and R2 R denote the coefﬁcients of determination for the unrestricted model and the restricted model respect2 where the total sum ively. Then e0 e ¼ SST (1 À R2 ) and e0R eR ¼ SST P(1 À RR ), of squares is in both cases equal to SST ¼ (yi À y)2 . Substituting this in (3.50) gives F¼ n À k R2 À R 2 R : Á 1 À R2 g (3:51)

164

3 Multiple Regression

So the restriction b2 ¼ 0 is not rejected if the R2 does not decrease too much when this restriction is imposed. This method to compare the R2 of two models is preferred above the use of the adjusted R2 of Section 3.1.6. This is because the F-test can be used to compute the P-value for the null hypothesis that b2 ¼ 0, which provides a more explicit basis to decide whether the decrease in ﬁt is signiﬁcant or not. The derivation of (3.51) from (3.50) makes clear that the R2 or the adjusted R2 can only be used to compare two models that have the same dependent variable. For instance, it makes no sense to compare the R2 of a model where y is the measured variable with another model where y is the logarithm of the measured variable. This is because the total sum of squares (SST) of both models differ — that is, explaining the variation of y around its mean is something different from explaining the variation of log (y) around its mean.

F - and t-tests

The above F-statistics can be computed for every partition of the matrix X in two parts X1 and X2 . For instance, in the particular case that X2 consists of only one column (so that g ¼ 1) F ¼ t2 — that is, the F-statistic equals the square of the t-statistic and in this case the F-test and the two-sided t-test always lead to the same conclusion (see Exercise 3.7).

**Test on the overall significance of the regression
**

Several statistical packages present for every regression the F-statistic and its associated P-value for the so-called signiﬁcance of the regression. This corresponds to a partitioning of X in X1 and X2 where X1 only contains the constant term (that is, X1 is a single column consisting of unit elements) and X2 contains all remaining columns (so that g ¼ k À 1). If we denote the components of the (k À 1) Â 1 vector b2 by the scalar parameters b2 , Á Á Á , bk , then the null hypothesis is that b2 ¼ b3 ¼ Á Á Á ¼ bk ¼ 0, which means that none of the explanatory variables (apart from the constant term) has effect on y. So this tests whether the model makes any sense at all. In this case, eR ¼ y À iy and e0R eR ¼ SST , so that R2 R ¼ 0. For this special case the F-statistic can therefore be written as F¼ nÀk R2 : Á k À 1 1 À R2

So there is a straightforward link between the F-test for the joint signiﬁcance of all variables (except the intercept) and the coefﬁcient of determination R2 .

3.4 The F-test

165

**Test of general linear restrictions
**

Until now we have tested whether certain parameters are zero and we have decomposed the regression matrix X ¼ (X1 X2 ) accordingly. An arbitrary set of linear restrictions on the parameters can be expressed in the form Rb ¼ r, where R is a given g Â k matrix with rank g and r is a given g Â 1 vector. We consider the testing problem y ¼ Xb þ e, H 0 : Rb ¼ r , (3:52)

**which imposes g independent linear restrictions on b under the null hypothesis. Examples are given in Section 3.4.2.
**

Derivation of the F -test We can test these restrictions, somewhat in the spirit of the t-test, by estimating the unrestricted model and checking whether Rb is sufﬁciently close to r. Under Assumptions 1–7, it follows that b $ N(b, s2 (X0 X)À1 ) (see (3.42)). Therefore Rb À r $ N(Rb À r, s2 R(X0 X)À1 R0 ) and we reject the null hypothesis if Rb À r differs signiﬁcantly from zero. If the null hypothesis is true, then Rb À r $ N(0, s2 R(X0 X)À1 R0 ) and (Rb À r)0 [s2 R(X0 X)À1 R0 ]À1 (Rb À r) $ w2 (g): (3:53)

T

The unknown s2 drops out again if we divide by e0 e=s2, which has the w2 (n À k) distribution and which is independent of b and hence also of the expression (3.53). By the deﬁnition of the F-distribution, this means that (Rb À r)0 [R(X0 X)À1 R0 ]À1 (Rb À r)=g e0 e=(n À k) (3:54)

follows the F(g, n À k) distribution if the null hypothesis is true. Expression (3.54) is not so convenient from a computational point of view. It is left as an exercise (see Exercise 3.8) that this F-test can again be written in terms of the sum of squared residuals (SSR) as in (3.50), where e0 e is the unrestricted SSR and e0R eR is the SSR under the null hypothesis.

Summary of computations

A set of linear restrictions on the model parameters can be tested as follows. Let n be the number of observations, k the number of parameters of the unrestricted model, and g the number of parameter restrictions under the null hypothesis (so that there are only (k À g) free parameters in the restricted model).

166

3 Multiple Regression

Testing a set of linear restrictions

Step 1: Estimate the unrestricted model. Estimate the unrestricted model and compute the corresponding sum of squared residuals e0 e. Step 2: Estimate the restricted model. Estimate the restricted model under the null hypothesis and compute the corresponding sum of squared residuals e0R eR . Step 3: Perform the F-test. Compute the F-test by means of (3.50), and reject the null hypothesis for large values of F. The P-values can be obtained from the fact that the F-test has the F(g, n À k) distribution if the null hypothesis is true (provided that Assumptions 1–7 are satisﬁed).

E

Exercises: T: 3.6c, d, 3.7e, f, 3.8, 3.10; E: 3.13, 3.15, 3.19a–e.

E

XM301BWA

**3.4.2 Illustration: Bank Wages
**

As an illustration, we consider again the data discussed in previous examples on salary (y, in logarithms of yearly wage), education (x2 , in years), begin salary (x3 , in logarithms of yearly wage), gender (x4 , taking the value 0 for females and 1 for males), and minority (x5 , taking the value 0 for nonminorities and 1 for minorities). We will discuss (i) the results of various models for three data sets, (ii) the signiﬁcance of the variable minority, (iii) the joint signiﬁcance of the regression, (iv) the joint signiﬁcance of gender and minority, and (v) the test whether gender and minority have the same effect.

(i) Results of various models for three data sets Exhibit 3.16 summarizes results (the sum of squared residuals and the coefﬁcient of determination) of regressions in the unrestricted model

y ¼ b1 þ b2 x2 þ b3 x3 þ b4 x4 þ b5 x5 þ e (see Panel 1) and in several restricted versions corresponding to different restrictions on the parameters bi , i ¼ 1, Á Á Á , 5 (see Panel 2). Most of the results of the unrestricted regression in Panel 1 of Exhibit 3.16 were already reported in Panel 1 of Exhibit 3.13 (p. 160). In Panel 2 of Exhibit 3.16 the models are estimated for different data sets. One version uses the data of all 474 employees, a second one of the employees with custodial jobs (job category 2), and a third one of the employees with management jobs (job category 3). Some of the regressions cannot be performed for the second version. The reason is that all employees with a custodial job are male, so that x4 ¼ 1 for all employees in job category 2.

3.4 The F-test

167

Panel 1: Dependent Variable: LOGSAL Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient C 2.079647 EDUC 0.023268 LOGSALBEGIN 0.821799 GENDER 0.048156 MINORITY À0:042369 R-squared 0.804117 Adjusted R-squared 0.802446 S.E. of regression 0.176603 Sum squared resid 14.62750 Panel 2 X-variables 1 12 123 1234 1235 1 2 3 4 5 ðb4 þ b5 ¼ 0Þ 1 2 3 4 5 (unrestricted) ALL (n ¼ 474) SSR R2 74.6746 0.0000 38.4241 0.4854 14.8917 0.8006 14.7628 0.8023 14.8100 0.8017 14.6291 0.8041 14.6275 0.8041

Std. Error t-Statistic 0.314798 6.606288 0.003870 6.013129 0.036031 22.80783 0.019910 2.418627 0.020342 À2:082842 Mean dependent var S.D. dependent var F-statistic Prob(F-statistic) JOBCAT 2 (n ¼ 27) SSR R2 0.1274 0.0000 0.1249 0.0197 0.1248 0.0204 --------0.1224 0.0391 -----------------

Prob. 0.0000 0.0000 0.0000 0.0160 0.0378 10.35679 0.397334 481.3211 0.000000

JOBCAT 3 (n ¼ 84) SSR R2 5.9900 0.0000 4.8354 0.1928 3.1507 0.4740 3.1263 0.4781 3.0875 0.4846 3.1503 0.4741 3.0659 0.4882

Exhibit 3.16 Bank Wages (Section 3.4.2)

Summary of outcomes of regressions where the dependent variable (logarithm of salary) is explained in terms of different sets of explanatory variables. Panel 1 shows the unrestricted regression in terms of ﬁve explanatory variables (including a constant term). In Panel 2, the explanatory variables (X) are denoted by their index 1 (the constant term), 2 (education), 3 (logarithm of begin salary), 4 (gender), and 5 (minority). The signiﬁcance of explanatory variables can be tested by F-tests using the SSR (sum of squared residuals) or the R2 (coefﬁcient of determination) of the regressions. The column ‘X-variables’ indicates which variables are included in the model (in the sixth row all variables are included and the parameter restriction is that b4 þ b5 ¼ 0). The models are estimated for three data sets, for all 474 employees, for the twenty-seven employees in job category 2 (custodial jobs), and for the eighty-four employees in job category 3 (management jobs).

Therefore the variable x4 should not be included in this second version of the model, as x4 ¼ x1 ¼ 1 and this would violate Assumption 1. With the results in Exhibit 3.16, we will perform four tests, all for the data set of all 474 employees. We refer to Exercise 3.13 for the analysis of similar questions for the sub-samples of employees with management or custodial jobs.

(ii) Significance of minority Here the unrestricted model contains a constant term and the variables x2 , x3 , x4 , and x5 , and we test H0 : b5 ¼ 0 against H1 : b5 6¼ 0. This corresponds to (3.52) with k ¼ 5 and g ¼ 1 and with R ¼ (0, 0, 0, 0, 1) and r ¼ 0. This restriction can be tested by the t-value of b5 in Panel 1 of Exhibit 3.16. It is equal to À2:083 with P-value 0.038, so that the hypothesis is rejected at the 5 per cent level of signiﬁcance.

168

3 Multiple Regression

As an alternative, we can also compare the residual sum of squares e e ¼ 14:6275 in the unrestricted model (see last row in Panel 2 of Exhibit 3.16) with the restricted sum of squares e0R eR ¼ 14:7628 (see the row with x1 , x2 , x3 , and x4 included in Panel 2 of Exhibit 3.16) and compute the F-test

0

F¼

(14:7628 À 14:6275)=1 ¼ 4:338 14:6275=(474 À 5)

with corresponding P-value of 0.038. The 5 per cent critical value of the F(1, 469) distribution is 3.84, so p that the null hypothesis is rejected at 5 per ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ cent signiﬁcance level. Note that 4 : 338 ¼ 2:083 is equal (in absolute value) pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ to the t-value of b5 , that 3:84 ¼ 1:96 is the two-sided 5 per cent critical value of the t(469) distribution, and that the P-values of the t-test and the F-test are equal. If we substitute the values R2 ¼ 0:8041 and R2 R ¼ 0:8023 into (3.51), then the same value for F is obtained.

(iii) Significance of the regression Now we test the joint signiﬁcance of all explanatory variables by testing the null hypothesis that b2 ¼ b3 ¼ b4 ¼ b5 ¼ 0. In this case there are g ¼ 4 independent restrictions and in terms of (3.52) we have

0 B0 R¼B @0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0C C, 0A 1 0 1 0 B0C C r¼B @ 0 A: 0

Using the values of the sum of squared residuals in Panel 2 of Exhibit 3.16, the F-statistic becomes F¼ (74:6746 À 14:6275)=4 ¼ 481:321: 14:6275=(474 À 5)

The 5 per cent critical value of F(4, 469) is 2.39 and so this hypothesis is strongly rejected. Note that the value of this F-test has already been reported in the regression table in Panel 1 in Exhibit 3.16, with a P-value that is rounded to zero.

(iv) Joint significance of gender and minority Next we test the null hypothesis that b4 ¼ b5 ¼ 0. This corresponds to (3.52) with k ¼ 5 and g ¼ 2 and with

R¼ 0 0 0 1 0 , 0 0 0 0 1 0 r¼ : 0

3.4 The F-test

169

To perform this test for the joint signiﬁcance of the variables x4 and x5 , we use row 3 (for the restricted model) and row 7 (for the unrestricted model) in Exhibit 3.16, Panel 2, and ﬁnd (using the R2 this time) F¼ (0:8041 À 0:8006)=2 ¼ 4:190: (1 À 0:8041)=(474 À 5)

The P-value with respect to the F(2, 469) distribution is equal to P ¼ 0:016. So, at 5 per cent signiﬁcance level we reject the null hypothesis.

(v) Test whether gender and minority have the same effect In the unrestricted model the variable gender (x4 ) has a positive coefﬁcient (0.048). As x4 ¼ 0 for females and x4 ¼ 1 for males, this means that, on average, males have higher salaries than females (for the same education, begin salary, and minority classiﬁcation). Further, the variable minority has a negative coefﬁcient (À 0:042). As x5 ¼ 1 for minorities and x5 ¼ 0 for nonminorities, this means that, on average, minorities have lower salaries than non-minorities (for the same education, begin salary, and gender). As the two estimated effects are nearly of equal magnitude, we will test whether the advantage of males is equally large as the advantage of non-minorities. This corresponds to the null hypothesis that b4 ¼ Àb5 , or, equivalently, In terms of (3.52), we have k ¼ 5, g ¼ 1, b4 þ b5 ¼ 0. R ¼ (0, 0, 0, 1, 1), and r ¼ 0. Using the last two rows in Exhibit 3.16, Panel 2, we get (in terms of SSR)

F¼ (14:6291 À 14:6275)=1 ¼ 0:051 14:6275=(474 À 5)

with a P-value of P ¼ 0:821. So this hypothesis is not rejected — that is, the two factors of discrimination (gender and minority) seem to be of equal magnitude.

3.4.3 Chow forecast test

E

Uses Appendix A.2–A.4.

**Evaluation of predictive performance: Sample split
**

One of the possible practical uses of a multiple regression model is to produce forecasts of the dependent variable for given values of the explanatory variables. It is, therefore, of interest to evaluate an estimated regression model by studying its predictive performance out of sample. For this purpose

170

3 Multiple Regression

full sample n+g estimation sample n prediction sample g

Exhibit 3.17 Prediction

The full sample is split into two non-overlapping parts, the estimation sample with observations that are used to estimate the model, and the prediction sample. The estimated model is used to forecast the values in the prediction sample, which can be compared with the actually observed values in the prediction sample.

the full sample is split into two parts, an estimation sample with n observations used to estimate the parameters, and a prediction sample with g additional observations used for the evaluation of the forecast quality of the estimated model. This is illustrated in Exhibit 3.17.

Notation

The data in the estimation sample are denoted by y1 and X1 , where y1 is a n Â 1 vector and X1 a n Â k matrix. The data in the prediction sample are denoted by y2 and X2 , where y2 is a g Â 1 vector and X2 a g Â k matrix. Note that this notation of X1 and X2 differs from the one used until now. That is, now the rows of the matrix X are partitioned instead of the columns. We can write X¼ y1 X1 ,y¼ , X2 y2

where X is a (n þ g) Â k matrix and y is a (n þ g) Â 1 vector. Since we use y1 and X1 for estimation, we assume that n > k, whereas g may be any positive integer. For the DGP over the full sample we suppose that Assumptions 1–7 are satisﬁed, so that y1 ¼ X1 b þ e1 , y2 ¼ X2 b þ e2 , with E[e1 e01 ] ¼ s2 I, E[e2 e02 ] ¼ s2 I, E[e1 e02 ] ¼ 0.

**Prediction and prediction error
**

The estimate of b is based on the estimation sample and is given by b ¼ (X01 X1 )À1 X01 y1 :

3.7) to show that X2 b is the best linear unbiased predictor of y2 in the sense that it minimizes the variance of f . the disturbance e2 and a component caused by the fact that we use b rather than b in our prediction formula X2 b. Test of constant DGP To obtain the predicted values of y2 . so that the prediction error f consists of two uncorrelated components — namely. the variance of the prediction errors is larger than the variance of the disturbances. where djj is the jth diagonal element of the matrix I þ X2 (X01 X1 )À1 X02 in (3.7) that a (1 À a) prediction interval for y2j for given values X2j of the explanatory variables is given by X02j b À cs qﬃﬃﬃﬃﬃ djj y2j qﬃﬃﬃﬃﬃ X02j b þ cs djj .39) for the variance of the prediction error in Section 2. Prediction interval If s2 in (3.56) and c is such that P[jtj > c] ¼ a when t $ t(n À k). with resulting prediction error f ¼ y2 À X2 b: (3:55) It is left as an exercise (see Exercise 3. It is left as an exercise (see Exercise 3. As a consequence. we assumed that the data in the two subsamples are generated by the same DGP. where e1 ¼ y1 À X1 b are the residuals over the estimation sample.1 (p. It can be shown that the minimum is reached if all the rows of X2 are equal to the row of column averages of X1 (for the regression model with k ¼ 2 this follows from formula (2.4 The F-test 171 This estimate is used to predict the values of y2 by means of X2 b. We can write f ¼ X2 b þ e2 À X2 (X01 X1 )À1 X01 y1 ¼ e2 À X2 (X01 X1 )À1 X01 e1 . then one can construct forecast intervals for y2. 105)). var( f ) ¼ s2 (I þ X2 (X01 X1 )À1 X02 ): (3:56) Superﬁcial observation could suggest that the prediction error covariance matrix attains it minimum if X2 ¼ 0.56) is replaced by the least squares estimator s2 ¼ e01 e1 =(n À k). This may be tested by considering . but in a model with an intercept this is impossible (as the elements in the ﬁrst column of X2 all have the value 1).4.

In order to test the predictive accuracy. So the F-test in (3. Derivation of sums of squares To compute the F-test we still have to determine the restricted sum of squared residuals e0R eR and the unrestricted sum of squared residuals e0 e. for example. where g is a g Â 1 vector of unknown parameters.57) is n þ g and the number of parameters is k þ g. that we check whether our model estimated using data from a number of regions may be used to predict the y variable in another region.172 3 Multiple Regression whether the predictions are sufﬁciently accurate. For a cross section this may mean. and e0R eR is the corresponding SSR. as before. In all cases we study conditional prediction — that is.50) becomes in this case F¼ (e0R eR À e0 e)=g (e0R eR À e0 e)=g ¼ . we formulate the model y1 ¼ X1 b þ e1 y2 ¼ X2 b þ g þ e2 . We can test this by means of an F-test in the model y1 y2 ¼ X1 X2 0 I b e1 . For a time series model we check if our model estimated using data from a certain period can be used to predict the y variable in another period. The number of observations in the model (3. that the model satisﬁes Assumptions 1–7 over the full sample of n þ g observations. Under the . Under the null hypothesis that g ¼ 0. the model becomes y1 y2 ¼ X1 e1 bþ : X2 e2 T So eR is obtained as the (n þ g) Â 1 vector of residuals of the regression over the full sample of (n þ g) observations. þ e2 g (3:57) where it is assumed. To perform the F-test for H0 : g ¼ 0 against H1 : g 6¼ 0. not in the full sample. Note that n is the number of observations in the estimation sample. we assume that the X2 matrix required in the prediction is given. e0 e=(n þ g À (k þ g)) e0 e=(n À k) which follows the F(g. The foregoing predictions of y2 are made under the assumption that g ¼ 0. note that H0 involves g restrictions. n À k) distribution when g ¼ 0.

58) the regression in the ‘large’ (unrestricted) model corresponds to the regression over the ‘small’ subsample (of the ﬁrst n observations). If we use the expression (3. for b ¼ (X01 X1 )À1 X01 y. This gives F¼ (e0R eR À e01 e1 )=g . This shows that the null hypothesis that g ¼ 0 is rejected if the prediction errors f are too large. then b2 corresponds to the estimated parameters g in the unrestricted model. The unrestricted model is larger in the sense that it . least squares in (3. an ‘unresample stricted’ one (the regression of y1 on X1 on the estimation with X1 y1 on on the residuals e1 ) and a ‘restricted’ one (the regression of y2 X2 full sample with residuals eR ).50).4 The F-test 173 alternative hypothesis that g 6¼ 0. e01 e1 =(n À k) (3:58) which is called the Chow forecast test for predictive accuracy. and the second term attains its minimal value zero for c ¼ y2 À X2 b.3.49) with submatrices of explanatory variables as indicated in (3. e01 e1 =(n À k) where V is a g Â g matrix of similar structure as in (3.57). these estimates are given by c ¼ y2 À X2 b — that is. Chow forecast test The test may therefore be performed by running two regressions. As stated before. So the unrestricted SSR is equal to e0 e ¼ (y1 À X1 b)0 (y1 À X1 b) ¼ e01 e1 — that is. c ¼ f are the prediction errors in (3. Comment on the two regressions in the Chow forecast test Note that in the Chow forecast test (3. whereas the regression in the ‘small’ (restricted) model corresponds to the regression over the ‘large’ sample (of all n þ g observations). c) ¼ y1 À X1 b 0 y1 À X 1 b y2 À X 2 b À c y2 À X2 b À c ¼ (y1 À X1 b)0 (y1 À X1 b) þ (y2 À X2 b À c)0 (y2 À X2 b À c): The ﬁrst term is minimized by regressing y1 on X1 — that is. the SSR corresponding to a regression of the n observations in the estimation sample.49) of the F-test instead of (3. So the Chow test may also be written as F¼ f 0 Vf =g .57) is equivalent to minimizing S(b.55).

(iii) forecast of salaries for management jobs. (ii) forecast of salaries for custodial jobs. (ii) Forecast of salaries for custodial jobs We ﬁrst perform a Chow forecast test by predicting the salaries of the twentyseven employees with custodial jobs.19 (a). This exhibit contains three regressions.19 (b). We will discuss (i) the regression results.17. (i) Regression results We use the results in Exhibit 3. 442) distribution is P ¼ 0:70.18. Although the great majority of the predicted salaries are lower than the actual salaries.11. x4 . Both models apply to the same set of n þ g observations.18: F¼ (e0R eR À e01 e1 )=g (14:6275 À 13:9155)=27 ¼ ¼ 0:838: e01 e1 =(n À k) 13:9155=(447 À 5) The P-value of the corresponding F(27.4 Illustration: Bank Wages As an illustration we return to the data on bank wages and we perform two forecast tests of the salary model with the explanatory variables x1 .4. and it is precisely because the large model contains g parameters for the g observations in the second sub-sample that the estimation of the large model can be reduced to a regression over the ﬁrst sub-sample. a second one over an estimation sample of 447 employees working in administration or management (Panel 2. the twenty-seven employees with custodial jobs form the prediction sample in this case). That is. x3 . and a third one over an estimation sample of 390 employees with administrative or custodial jobs (Panel 3. so that the predictions are sufﬁciently accurate.58) can be computed from the results in Panels 1 and 2 in Exhibit 3.174 3 Multiple Regression contains more parameters (k þ g instead of k). the salaries for custodial jobs can be predicted by means of the model estimated for administrative and management jobs. one over the full sample of 474 employees (Panel 1). 3.4. indicating a .2. and (iv) a comparison of the two forecasts. 3.19f. The corresponding F-statistic (3. E Exercises: T: 3. and x5 described in Section 3. x2 . E XM301BWA 3.7g. the eighty-four employees with management jobs form the prediction sample in this case). E: 3. and a histogram of the forecast errors is given in Exhibit 3. h. g. The scatter of twenty-seven points of the actual and predicted salaries is shown in Exhibit 3.

323277 6.62750 Prob.606288 EDUC 0. of regression 0.023268 0.003774 4.E.079647 0.0125 10.518492 MINORITY À0:040494 0.519694 0.240326 Regressions for two forecast tests.552306 Mean dependent var Adjusted R-squared 0.071522 0.821799 0.0000 0.0000 0.004352 6.67293 GENDER 0.020342 À2:082842 R-squared 0.013129 LOGSALBEGIN 0.020327 3.1729 0.028500 0.18 Bank Wages (Section 3.0005 0.939607 LOGSALBEGIN 0. Error t-Statistic C 2.177434 Sum squared resid 13.813307 Mean dependent var Adjusted R-squared 0. Error t-Statistic C 2.048156 0.0000 0.674293 0.019292 À2:099032 R-squared 0.E.133639 0.365269 MINORITY À0:053989 0.804117 Mean dependent var Adjusted R-squared 0. dependent var S.91547 Prob. In Panel 1 a model for salaries is estimated using the data of all 474 employees.0378 10.05848 Exhibit 3. of regression 0.35796 0.314798 6.408806 Panel 3: Dependent Variable: LOGSAL Method: Least Squares Sample (adjusted): 2 474 IF JOBCAT ¼ 1 OR JOBCAT ¼ 2 Included observations: 390 after adjusting endpoints Variable Coefﬁcient Std.019910 2.036031 22.D.811617 S.021518 À2:508953 R-squared 0.517151 6.547655 S.029102 0. of regression 0.35679 0.161635 Sum squared resid 10.397334 Panel 2: Dependent Variable: LOGSAL Method: Least Squares Sample: 1 474 IF JOBCAT ¼ 1 OR JOBCAT ¼ 3 Included observations: 447 Variable Coefﬁcient Std.687637 LOGSALBEGIN 0. dependent var S.600032 EDUC 0.802446 S. dependent var S.0365 10.0000 0.80783 GENDER 0.003870 6.808688 0.020875 1.4) Prob.E.0000 0.056446 11.0000 0.94577 GENDER 0.805930 EDUC 0. 0. Error t-Statistic C 3.4 The F-test 175 Panel 1: Dependent Variable: LOGSAL Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std.0000 0. .0000 0.21188 0.0000 0. in Panel 3 this model is estimated using only the data of the employees with jobs in categories 1 and 2 (administration and custodial jobs).3.0160 0.4. 0.418627 MINORITY À0:042369 0. in Panel 2 this model is estimated using only the data of the employees with jobs in categories 1 and 3 (administration and management).037313 21.D.176603 Sum squared resid 14.018640 0. 0.D.

5 12.2 10.6 0. both in logarithms ((a) and (c)).2 0. The diagrams indicate that the salaries in a job category cannot be well predicted from the salaries in the other two job categories. Skewness Kurtosis 0.067697 10.0 0. the prediction errors in (a) and (b) are acceptable.0 0.368005 2. forecasts obtained from model estimated for the data of employees in job categories 1 and 2).176 3 Multiple Regression (a) (b) 16 Series: FORECER2 Sample JOBCAT = 2 Observations 27 Mean Median Maximum Minimum Std.4 0.2 0.5 8 LOGSALFOR3 4 11. the forecast errors are small enough for the null hypothesis not to be rejected.4 LOGSAL (c) 12.513183 −0.19 (b)).182145 6.0 10.4 0.18).0 0 −0.0 Exhibit 3.199694 0. Dev.200624 0.0 LOGSAL 11.767448 −0.2 4 0 10.2 0. for employees in job category 2 ((a) and (b). Dev. Skewness Kurtosis 0.2 0.182092 0.4.6 9.0 10.0 −0.4) Scatter diagrams of forecasted salaries against actual salaries.0 (d) 12 Series: FORECER3 Sample JOBCAT = 3 Observations 84 Mean Median Maximum Minimum Std.5 11. This explains that the Chow test does not reject the hypothesis that custodial salaries can be predicted from the model estimated on the basis of wage data for jobs in administration and management.862222 11. .126952 0. The forecast test is based on the magnitude of the forecast errors. forecasts obtained from model estimated for the data of employees in job categories 1 and 3) and for employees in job category 3 ((c) and (d).128563 0.8 10. whereas those in (c) and (d) are not. The mean squared error of the forecasts (that is. and histograms of forecast errors ((b) and (d)).0 10.19 Bank Wages (Section 3.8 10.191368 0. In terms of the Chow forecast test.222636 0.5 10.8 9. the sum of the squared bias and the variance) is (0:1286)2 þ (0:1270)2 ¼ 0:0327 (see Exhibit 3. downward bias. and these are of the same order as the random variation s2 on the estimation sample. whereas the estimated variance of the disturbances is s2 ¼ (0:1774)2 ¼ 0:03125 (see Panel 2 in Exhibit 3.139186 0.4 12 8 LOGSALFOR2 10.

gender. so that salaries in this category are higher than would be expected (on the basis of education.58) can be computed from the results in Panels 1 and 3 of Exhibit 3.18: F¼ (e0R eR À e01 e1 )=g (14:6275 À 10:0585)=84 ¼ ¼ 2:082: e01 e1 =(n À k) 10:0585=(390 À 5) The P-value of the corresponding F(84. begin salary. and minority. so that the predictions are not accurate. (a) contains much less observations than (c) (27 and 84 respectively). Stated otherwise. and the histogram of the forecast errors in Exhibit 3.18.19 (a) and (c). The values are again mostly below the 45 8 line.18).3. Forecast errors become more signiﬁcant if they occur for a larger number of observations. and minority) for categories 1 and 2.19 (c). 385) distribution is rounded to P ¼ 0:0000. at ﬁrst sight the predictive quality seems to be comparable in both cases. that the vertical scales differ in the two scatter diagrams.19 (d). The regression results based on the 390 observations in job categories 1 and 2 are shown in Panel 3 of Exhibit 3. gender. Note. The standard error of the regression over the 390 individuals in categories 1 and 2 is s ¼ 0:1616 (see Panel 3 in Exhibit 3. the salaries in job category 3 cannot be predicted well in this case. people with management positions earn on average more than people with administrative or custodial jobs for given level of education. we predict the salaries for the eighty-four employees with management positions from the model estimated for administrative and custodial jobs (job categories 1 and 2). begin salary. . The corresponding Chow forecast test (3. whereas the root mean squared forecast error over the eighty-four individuals in category 3 can be computed from Exhibit 3. however. So the forecast errors are much larger than the usual random variation in the estimation sample. (iv) Comparison of the two forecasts Comparing once more Exhibit 3. Further. The scatter of eighty-four points of the actual and predicted salaries is shown in Exhibit 3. That is.4 The F-test 177 (iii) Forecast of salaries for management jobs As a second test.19 (d) as ((0:2006)2 þ (0:1997)2 )1=2 ¼ 0:2831.

. Grifﬁths.178 3 Multiple Regression Summary. W. G. Verbeek (2000). (1983. (1997). J. Econometric Methods. Z. New York: Prentice Hall. 1986). M. New York: Wiley. Davidson. (1993). C. (1985). Johnston and DiNardo (1997).. Handbook of Econometrics. A. D.. Chow (1983). T. J.. 3 vols. C. Cambridge: Cambridge University Press. and Lee. The least squares coefﬁcients measure the direct effect of an explanatory variable on the dependent variable after neutralizing for the indirect effects that run via the other explanatory variables. and Intriligator. H. Gourieroux. E. Econometric Analysis. The Theory and Practice of Econometrics. Under these assumptions. For reasons of efﬁciency it is better to exclude variables that have only a marginal effect. R. Auckland: McGraw-Hill. Hill. and Monfort. New York: Oxford University Press. the F-test can be used to test for the individual and joint signiﬁcance of explanatory variables. R. 1984. G. The statistical properties of least squares were derived under a number of assumptions on the data generating process.. (1995). Griliches. the other books are on an advanced level. Lu ¨ tkepohl.. Stewart and Gill (1998). G. The handbooks edited by Griliches and Intriligator contain overviews of many topics that are treated in this and the next chapters. W. We give some references to econometric textbooks that also follow this approach. (2000). G. Statistics and Econometric Models. FURTHER READING In our analysis we made intensive use of matrix methods.. Chow.. and keywords SUMMARY In this chapter we considered regression models with more than one explanatory variable. Johnston. Judge. Greene. Estimation and Inference in Econometrics. and Wooldridge (2002) are on an intermediate level. These estimated effects therefore depend on the set of all explanatory variables included in the model. H. Econometrics. Greene (2000). Amsterdam: North-Holland.. C. 2 vols. further reading. G. and MacKinnon. New York: McGraw-Hill. We paid particular attention to the question of which explanatory variables should be included in the model. (1983). J. and DiNardo.

G. J. Statistical Methods of Econometrics. G. Verbeek. and Miller. Econometric Analysis of Cross Section and Panel Data.. M. Principles of Econometrics.. C. (2000). J. L. Stewart. Cambridge: Cambridge University Press. (2000). Wooldridge. KEYWORDS auxiliary regressions 140 ceteris paribus 140 Chow forecast test 173 coefﬁcient of determination 129 covariance matrix 126 degrees of freedom 129 direct effect 140 F-test 161 Frisch–Waugh 146 indirect effect 140 inefﬁcient 144 joint signiﬁcance 164 least squares estimator 122 linear restrictions 165 matrix form 120 minimal variance 127 multicollinearity 158 normal equations 121 omitted variables bias 143 partial regression 146 partial regression scatter 148 prediction interval 171 predictive performance 169 projection 123 signiﬁcance 153 signiﬁcance of the regression 164 standard error 128 standard error of the regression 128 t-test 153 t-value 153 total effect 140 true model 142 unbiased 126 uncontrolled 140 variance inﬂation factor 159 . (1971). (2000). (2002). Cambridge. Mittelhammer. MA: MIT Press. H. New York: Wiley. Amsterdam: North Holland. J. Chichester: Wiley. New York: Oxford University Press. Ruud. Econometrics. Econometric Foundations.. (1998). R. A Guide to Modern Econometrics. D. further reading. M. (1980). and keywords 179 Malinvaud. An Introduction to Classical Econometric Theory. and Gill. London: Prentice Hall. A. E.Summary. Theil. P. Judge.

Check every detail of the following argument.1. c. Á Á Á . then prove that the k Â k matrix X0 X is positive deﬁnite. y conditional on the values of x2 . Á Á Á .6). . and the variance by var(b) ¼ s2 (X0 X)À1 . so that we have to minimize the function f (b) ¼ y0 y À p0 b À b0 p þ b0 Qb ¼ y0 y À 2b0 p þ b0 Qb.4. d. There are k ﬁrst order derivatives and we follow the convention to arrange them in a column vector. a. this indicates that the separate derivatives are arranged as a row). the estimates a and b. S). this shows that @S 0 0 @ b ¼ À2X y þ 2X Xb.2 (E Section 3.1 (E Section 3.2) In this exercise we study the derivatives of (3. Qb can be written as Qb ¼ q1 b1 þ q2 b2 þ . This result can be interpreted as a Taylor expansion. Let the observations be obtained by a random sample of size n from this distribution N(m. With the same conventions we get @@ b@ b0 ¼ Q for the Hessian. Let b increase to b þ h. In this exercise bÃ denotes any k Â 1 vector. .1. . b. b. If we apply this to (3. Let X be an n Â k matrix with rank k. S c. Let bÃ ¼ (X0 X)À1 X0 y þ d. To write all derivatives for i ¼ 1. we write X0 y ¼ p (a k Â 1 vector) and X0 X ¼ Q (a k Â k matrix). a. Suppose that the k random variables y. In the model y ¼ Xb þ e. The derivatives of the elements of Qb with respect to the scalar bi can be written as a column qi . a. 3.3 (E Section 3. Verify each step in the following argument. 3. b.1.7) contains one term that depends on b. k in one formula we follow the convention to write them as a ‘row of columns’ — that is. so that @@Qb b0 ¼ Q (note the prime in the left-hand denominator.1.2. where we may choose the elements of the k Â 1 vector h as small as we like.2) In this exercise we prove the result in (3.1. . 2 3. þ qk bk . b.10). x3 . Derive the condition for uniqueness of this minimum and show that the minimum is then given by d ¼ 0.6) and prove the result in (3. Show that the n observations yc satisfy Assumptions 1–7 of Section 3.6) without using the ﬁrst and second order derivatives. xk are jointly normally distributed with mean m and (non-singular) covariance matrix S. x2 .7). Show that S(bÃ ) ¼ e0 e þ (Xd)0 (Xd) and that the minimum of this expression is attained if Xd ¼ 0. where e is a vector of constants that does not depend on the choice of d. . .1. the least squares estimates by b ¼ (X0 X)À1 X0 y. The vector of ﬁrst order derivatives in (3. the last term can be neglected. Deﬁne the random variable yc ¼ yjfx2 . then show that y À XbÃ ¼ e À Xd. the normal equations are given by X0 Xb ¼ X0 y.2) The following steps show that the least squares estimator b ¼ (X0 X)À1 X0 y minimizes (3. Work these three formulas out for the special case of the simple regression model yi ¼ a þ bxi þ ei and prove that these results are respectively equal to the normal equations. For convenience. xk .180 3 Multiple Regression Exercises THEORY QUESTIONS 3. we group them into a matrix. Then f (b þ h) ¼ f (b) þ h0 ( À 2p þ (Q0 þ Q)b) þh0 Qh. qk ). and the central term is a linear expression containing the k Â 1 vector of ﬁrst order derivatives @f 0 @ b ¼ À2p þ (Q þ Q)b. . .4. For convenience we write it as Qb and we partition the k Â k matrix Q ¼ 2X0 X into its columns as Q ¼ (q1 q2 . .4) a. xk g — that is.2 and 2.4 (E Section 3. If the elements of h are sufﬁciently small. Á Á Á . and the variances of a and b obtained in Sections 2. c.

50). Prove that the least squares estimator obtained by regressing yÃ on XÃ gives the desired result.2. a. among all predictors of the form ^ y2 ¼ Ly1 (with L a given matrix) with the property that E[y2 À ^ y2 ] ¼ 0. 3. c. with (ii) Q1 ¼ M1 À M and Q2 ¼ M.1) Consider the model y ¼ Xb þ e with the null hypothesis that Rb ¼ r where R is a given g Â k matrix of rank g and r is a given g Â 1 vector. Show also that both tests lead to the same conclusion. b. E[e1 e01 ] ¼ s2 I. The restricted least squares estimator bR ^)0 (y À Xb ^) minimizes the sum of squares (y À Xb ^ ¼ r.5. 3. under the null hypothesis that b2 ¼ 0.Exercises 181 3. Prove that the Theil criterion is equivalent with minimizing s. where n is the number of observations and k the number of explanatory variables.3). while an intercept is added automatically. 3.6 (E Section 3. Prove that R2 (in the model with constant term) is the square of the sample correlation coefﬁcient between y and ^ y ¼ Xb. That is. and (iv) Q1 Q2 ¼ 0. Prove that R2 never decreases by including an additional regressor in the model. these two random variables are independently distributed as w2 (g) and w2 (n À k) respectively by showing that (i) they can be expressed as e0 Q1 e and e0 Q2 e.6. 3.8 (E Section 3. Show that var(bR ) var(b1 ) in the sense that var(b1 ) – var(bR ) is a positive semideﬁnite matrix.6) Suppose we wish to explain a variable y and that the number of possible explanatory variables is so large that it is tempting to take a subset. fÃ .4 we considered the prediction of y2 for given values of X2 under the assumptions that y1 ¼ X1 b þ e1 and y2 ¼ X2 b þ e2 where E[e1 ] ¼ 0.4.3) Some of the following questions and arguments were mentioned in this chapter.3.5 that hi > 0 if the n Â k matrix X contains a column of unit elements and rank (X) ¼ k. Consider the expression (3. b.4. the standard error of regression. In Section 3. consisting of unit elements only.4.1. it minimy2. i ÀX where the i columns. If a regression model contains no constant term so that the matrix X contains no column of ones. 3. so that (iii) Q1 is idempotent with rank g and Q2 is idempotent with rank (n À k). izes the variance of the forecast error y2 À ^ h. a.4. irrespective of the chosen signiﬁcance level. 3. Show that under the restriction that Rb .05. a.2.5 (E Section 3. Let y ¼ X1 b1 þ X2 b2 þ e and let b1 be estimated by regressing y on X1 alone (the ‘omitted variables’ case of Section 3. When are the two variances equal? e.49) of the F-test in terms of the random variables b02 X02 M1 X2 b2 and e0 e.1. Use the following steps to show that the expression (3. In such a situation some researchers apply the so-called Theil criterion and maximize the adjusted R2 deﬁned by 2 À1 2 R ¼1Àn nÀk (1 À R ). Now suppose that you wish to compute the least squares estimates b in a regression of the type y ¼ Xb þ e where the n Â k matrix X does not contain an ‘intercept column’ consisting of unit elements.1. Using the notation introduced in Section 3. E[e2 e02 ] ¼ s2 I. Ày XÃ ¼ i X . Show that the size (signiﬁcance level) of such a test is larger than 0. 3. b. where M is the M-matrix corresponding to X and M1 is the M-matrix corresponding to X1 . Prove the result stated in Section 3.1. are added by the computer package and the user speciﬁes the other data.5) In some software packages the user is asked to specify the variable to be explained and the explanatory variables. then show that 1 À (SSR=SST ) (and hence R2 when it is computed in this way) may be negative. Prove that under Assumptions 1–6 the predictor X2 b with b ¼ (X01 X1 )À1 X01 y1 is best linear unbiased. Prove that the standard errors of the regression coefﬁcients p ofﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ this regression must ﬃ be corrected by a factor (2n À k À 1)=(n À k). show that a (1 À ap ) prediction interval for y2j is ﬃﬃﬃﬃﬃ given by X02j b Æ cs djj. Prove that the Theil criterion implies that an explanatory variable xj will be maintained if and only if the F-test statistic for the null hypothesis bj ¼ 0 is larger than one. d. Show that the F-test for a single restriction bj ¼ 0 is equal to the square of the t-value of bj . E[e2 ] ¼ 0.1.54) for the F-test can be written in terms of residual sums of squares as in (3.4. a. Deﬁne yÃ ¼ y . d.7 (E Sections 3. g. c.1. Prove that. and E[e1 e02 ] ¼ 0.

Derive the relation between the t-values of (1) and (2). a.50). Á Á Á . In Example 1.7 and the last regression in Exhibit 3. c. P1 2 sub-samples.4. in particular the estimated regressions (1) y ¼ X1 b1 þ X2 b2 þ e.5. Here X1 and M2 X1 are n Â (k À g) matrices and X2 is an n Â g matrix. a more precise result is obtained when higher precision values from a regression package are used).2 we mentioned the situation of two independent random samples.58) for the case g ¼ 1 of a single new observation (xnþ1 .4.3) We consider the Chow forecast test (3. Check this result by considering the standard errors of the variable education in the second regression in Exhibit 3. d.2. where b is the unrestricted least squares estimator and A ¼ (X0 X)À1 R0 [R(X0 X)À1 R0 ]À1 . e. are given by b and c ¼ ynþ1 À x0nþ1 b. whereas for the (n þ 1)st observation we write ynþ1 ¼ x0nþ1 b þ g þ enþ1 with g an unknown scalar parameter.10 (E Section 3. We assume that Assumptions 1–4 and 7 are satisﬁed for the full sample i ¼ 1. Formulate the testing problem of m1 ¼ m2 against m1 6¼ m2 in terms of a parameter restriction in a multivariate regression model (with parameters m1 and m2 ). Prove that var(b1 ) ¼ var(bÃ ) ¼ s2 (X01 M2 X1 )À1 . Derive the F-test for H0 : m1 ¼ m2 in the form (3. Describe a method to determine the restricted sum of squared residuals e0R eR in this case. Use the results reported in Exhibit 1.182 3 Multiple Regression bR ¼ b À A(Rb À r).6 to perform a test of the null hypothesis of equal means for male and female students against the alternative that female students have on average higher scores than male students. then the pooled estimator of the variance is deﬁned by 0 0 s2 p ¼ (e1 e1 þ e2 e2 )=(n1 þ n2 À 2) and the pooled t-test is deﬁned by tp ¼ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ n1 n2 y2 À y1 : n1 þ n2 sp 3. In Section 3. s2 ) and a second one of size n2 from N(m2 . b. d. We use the notation of Section 3. n þ 1.12 (p.4. then show that e0R eR ¼ e0 e þ (Rb À r)0 [R(X0 X)À1 R0 ]À1 (Rb À r): c. Let e ¼ y À Xb and eR ¼ y À XbR . Show that the F-test in (3.1) In Section 1. Prove that the standard errors of the coefﬁcients b1 in (1) can be obtained by multiplying the standard errors of the coefﬁcients bÃ in (2) by pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ the factor (n À k þ g)=(n À k). b. Á Á Á . and Assumptions 5 and 6 for the estimation sample i ¼ 1. Provide an intuitive explanation for this result. Derive expressions for the estimated variance s2 in regression (1) and s2 Ã in regression (2). The n preceding observations are used in the model y1 ¼ X1 b þ e with least squares estimator b. s2 ). 62) we considered the FGPA scores of n1 ¼ 373 male students and n2 ¼ 236 female students. n. one of size n1 from N(m1 . We consider the null hypothesis that g ¼ 0 against the alternative that g 6¼ 0. 150). Show that the residual for the (n þ 1)st observation is equal to zero. Prove that the least squares estimators of b and g over the full sample i ¼ 1. (These values are rounded. b. 3. d.5) This exercise serves to clarify a remark on standard errors in partial regressions that was made in Example 3.3 (p.54) can be written as in (3.11 (E Section 3.2 we tested the null hypothesis that b4 þ b5 ¼ 0 in the model with k ¼ 5 explanatory variables. a.2.50). 2 c. Let e01 e1 ¼ n and i¼1 (yi À y1 ) P n þ n 2 1 2 0 e2 e2 ¼ i¼n1 þ1 (yi À y2 ) be the total sum of squares in the ﬁrst and second sub-sample respectively. both in terms of e0 e.9 (E Section 3. n þ 1. Á Á Á .4. 3. ynþ1 ). a. and (2) M2 y ¼ (M2 X1 )bÃ þ eÃ in the result of Frisch–Waugh. Prove that tp is equal to the F-test in b and that tp follows the t(n1 þ n2 À 2) distribution if the null hypothesis of equal means holds true. We want to test the null hypothesis H0 : m1 ¼ m2 against the alternative H1 : m1 6¼ m2 . The pooled t-test is based on the difference between the sample means y1 and y2 of the two .10.

b. Investigate the presence of collinearity between the explanatory variables by computing R2 j in (3. Bartelsman and W. x1 and x2 . for j ¼ 2.15 (E Section 3.14 (E Sections 3. Derive the residual sum of squares over the full sample i ¼ 1. (iii) b4 ¼ b5 ¼ 0. d.2. c. !i . x1 . and x4 (FEM. . NBER Technical Working Paper 205. labour (L. a.16. ten are female-non-minority. We mention that of the eightyfour employees in management.2. x3 (SATV score). c. 3. n þ 1 under the alternative hypothesis. n. Estimate a model for FGPA in terms of SATV by regressing y on x1 and x3 . Answer the questions of b for the regression of z on a constant. 1996. Finally consider the subset of employees with custodial jobs (job category 2. 3. and x4 .4. Let n ¼ 100 and let ei . 12) XR314STU for 609 students. Estimate also a model by regressing y on x1 .16 to test the hypothesis that b5 ¼ 0. Perform also regressions of y on a constant and x1 . and x3 ) that are generated as follows. Discuss the relevance of this information with respect to the power of the test for hypothesis (iii). x3 . Derive the F-test for the hypothesis that g ¼ 0. x2 . value added in millions of dollars). Now consider the hypothesis (iii) that gender and minority have no effect on salary for employees in management. For each ﬁrm. A log-linear production function is estimated with the following result (standard errors are in parentheses). Zi $ NID(0. z.1) In Section 3. x2 . x2 and x3 . Test these four hypotheses also for the subset of employees working in management (job category 3). values are given of production (Y . EMPIRICAL AND SIMULATION QUESTIONS 3.Exercises 183 b. 1= 1 À R2 j . Comment on the differences between the two models in b for the effect of SATV on FGPA. Comment on the outcomes. Deﬁne x1i ¼ 5 þ !i þ 0:3Zi x2i ¼ 10 þ !i x3i ¼ 5 þ Zi yi ¼ x1i þ x2i þ ei zi ¼ x2i þ x3i þ ei a. x4 ). Use the results in Exhibit 3.3) In this exercise we consider the data set on student learning of Example 1. and capital (K. 3. with x4 ¼ 1 for females and x4 ¼ 0 for males). real capital stock in millions of 1987 dollars).13 (E Section 3. The data are taken from E.1) In this exercise we consider production data for the year 1994 of n ¼ 26 US ﬁrms XR315PMI in the sector of primary metal industries (SIC33). 3. c.4. Á Á Á . minority. x3 . 1) be independent random samples from the standard normal distribution. Á Á Á . and the explanatory variables are x1 (constant term).2 we tested four different hypotheses — that is. seventy are male non-minority. (ii) b2 ¼ XM301BWA b3 ¼ b4 ¼ b5 ¼ 0. a. total payroll in millions of dollars). and no one is female- 3.3. Use a signiﬁcance level of 5 per cent in all tests below. Perform the regression of y on a constant.3. Compute the 4 Â 4 correlation matrix for the variables (y. x2 . where all employees are male). National Bureau of Economic Research. The dependent variable (y) is the FGPA score of a student. As data set we considered the data on all 474 employees (see Exhibit 3. d.47) and the square root of the variance inﬂaqﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ tion factors. c. Compute the regression coefﬁcients and their t-values. Gray. using the results in the last two columns in Exhibit 3. Test also the hypothesis that b2 ¼ b3 ¼ b5 ¼ 0.3) In this simulation exercise we consider ﬁve variables (y. i ¼ 1. Discuss the differences that arise between these two cases. (i) b5 ¼ 0. x2 (SATM score).16).12 (E Section 3.4.1 (p. and (iv) b4 þ b5 ¼ 0. four are male-minority. What is the correlation between x1 and x3 ? And what is the correlation between x2 and x3 ? b. J. 4. and of z on a constant and x3 . b.

184 3 Multiple Regression log (Y ) ¼ 0:701 þ 0:756 log (L) þ 0:242 log (K) þ e (0:415) (0:091) (0:110) The model is also estimated under two alternative restrictions. To test for XM301BWA the possible effect of gender on wage. Draw a partial regression scatter plot (with regression line) for salary (in logarithms) against gender after correction for the variable education (see Case 3 in Section 3. Estimate the model using only observations in weeks without advertisement. These data come from the same marketing experiment as discussed in Example 2. Test whether advertisement has a signiﬁcant effect on sales. Test the restriction of constant returns to scale also in two ways. Explain why the outcomes of b and c are the same but the two outcomes in d are different.2.1. The data provide for n ¼ 18 weeks the values of the coffee sales in that week (Q.3) In this exercise we consider data on weekly coffee sales of a certain brand of XR317COF coffee. e. . Test the restriction of equal coefﬁcients by means of an F-test based on the residual sums of squares. Construct 95% interval estimates for the parameters b2 and b3 . 78). Test the null hypothesis that b2 ¼ 1 against the alternative that b2 > 1. someone proposes to estimate the model y ¼ b1 þ b4 x4 þ e. Note: take special care of the fact that the estimated model can not predict the effect of advertisement. and the R2 are respectively equal to R2 ¼ 0:956888. where y is the yearly wage (in logarithms) and x4 is the variable gender (with x4 ¼ 0 for females and x4 ¼ 1 for males).7.5) Consider the data on bank wages of the example in Section 3. following tests use a signiﬁcance level of 5%. d. A ¼ 0 otherwise). d. both by a t-test and by an F-test. a. a. Comment on the differences between the conclusions that could be drawn (without further thinking) from each of these two regressions. R2 and R2 In the 1 ¼ 0:943984. but for another brand of coffee and for another selection of weeks.16 (E Section 3. D ¼ 1:05 in weeks with 5% price reduction. Test this restriction also by means of the R2 . e01 e1 ¼ 2:371989. and advertisement (A ¼ 1 in weeks with advertisement. one with the F-test based on the residual sums of squares and the other with the F-test based on the R2 . one of the actual values of log (Q) against the ﬁtted values of d for the 3. b. Check the results on regression coefﬁcients and residuals in the result of Frisch–Waugh (3. and e02 e2 ¼ 1:825652. in units). the applied deal rate (D ¼ 1 for the usual price. c.4. c. c. As an alternative we consider the model with x2 (education) as an additional explanatory variable. Use the data to perform the two regressions. Test also for the joint signiﬁcance of these two variables. e. and D ¼ 1:15 in weeks with 15% price reduction). b. Which of the two tests in d is the correct one? b. We postulate the model log (Q) ¼ b1 þ b2 log (D) þ b3 A þ e: For all tests below use a signiﬁcance level of 5 %. 2 ¼ 0:751397. where X1 refers to the variable x4 . For this purpose the following two regressions are performed. Test for the individual signiﬁcance of log (L) and log (K) in the ﬁrst regression.3 (p. a. and X2 refers to the constant term and the variable x2 . Discuss how these plots help in clarifying the differences in b. log (Y ) ¼ 0:010 þ 0:524( log (L) þ log (K)) þ e1 (0:358) (0:026) log(Y ) À log(K) ¼ 0:686 þ 0:756(log(L) À log(K)) þ e2 (0:132) (0:089) The residual sums of squares are respectively e0 e ¼ 1:825544.39) for these data. d. Draw also a scatter plot (with regression line) for the original (uncorrected) data on salary (in logarithms) and gender.5). 3.17 (E Section 3. the ﬁrst with equal coefﬁcients for log (L) and log (K) and the second with the sum of the coefﬁcients of log (L) and log (K) equal to one (‘constant returns to scale’).2. Make two scatter plots. Test whether this model produces acceptable forecasts for the sales (in logarithms) in the weeks with advertisement.

economagic. Now estimate the price elasticity by regressing y on a constant and the variables x2 and x3 .18 (E Section 3. d. com). x3 ¼ log (PGAS=PALL). Regress y on a constant and the four explanatory variables log (PGAS). The data are taken from different sources (see the table).access. x5 ¼ log (PNCAR=PALL). log (INC).19 (E Sections 3. e.Exercises 185 twelve observations in the estimation sample. real income has mostly gone up and the price of gasoline (as compared with other prices) has mostly gone down. and x6 . Comment on the outcomes and compare them with the ones in c.4. Transform the four price indices (PALL. Which regression statistics remain the same. and x6 ¼ log (PUCAR=PALL). and log (PPUB). PPUB.5) In this exercise we consider yearly data (from 1970 to 1999) related to motor gas.1. 3. Check this result and give an explanation in terms of partial regressions. x3 . US city average Nominal personal disposable income Consumer price index Units 10 dollars cts/gallon 109 dollars 6 Source ecocb ecode rp rp rp rp rp c. a. For all tests below. Make a partial regression scatter plot of the ‘cleaned’ variables and check the validity of the result of Frisch– Waugh in this case. and ‘ecode’ to data of the Department of Energy (see www. (1982 À 4)=3 ¼ 100 PPUB Consumer price index idem of public transport PNCAR Consumer price index idem of new cars PUCAR Consumer price index idem of used cars a. and x6 . Relate these graphs to your conclusions in d. in the period 1970–99. We are interested in the price elasticity of gasoline consumption — that is.XR318MGC oline consumption in the USA. Regress y on a constant and the variables x2 .2. and use a signiﬁcance level of 5%. and PUCAR) so that they all have the value 100 in 1970. Variable Deﬁnition SGAS PGAS INC PALL Retail sales gasoline service stations Motor gasoline retail price. x3 . Provide a motivation for this choice of explained and explanatory variables and comment on the outcomes. We deﬁne the variables y ¼log (SGAS=PGAS). Why is this outcome still misleading? 3. and a second one of log (Q) against the predicted values for the six observations in the prediction sample. If y is regressed on a constant and the variable x3 then the estimated elasticity is more negative than in c. determine the degrees of freedom of the test statistic. Use the fact that. Comment on the outcome. x4 . the marginal relative increase in sold quantity due to a marginal relative price increase. log (PALL).3) We consider the same data on motor gasoline consumption as in Exercise 3. g. Perform the regression of f for the transformed data (taking logarithms again) and compare the outcomes with the ones in f. b.4. compute sums of squared residuals of appropriate regressions. and explain why this outcome is misleading. x5 . x4 ¼ log (PPUB=PALL). The price indices are deﬁned so that the average value over the years 1982–4 is equal to 100. Explain the precise relation with the results in a. x5 . 3.gpo. x4 . x2 ¼ log (INC=PALL). PNCAR. Test for the joint signiﬁcance of the prices of new and used cars. Estimate the price elasticity now by regressing y on a constant and log (PGAS). Here ‘rp’ refers to data in the Economic Report of the President (see w3. and which ones have changed? Explain these results. Perform the partial regressions needed to remove the effect of income (x2 ) on the consumption (y) and on the relative price (x3 ). Use the results to construct a 95% interval estimate for the price elasticity of gasoline consumption.gov). f. b. Estimate this price elasticity by regressing log (SGAS) on a constant and log (PGAS). .18 XR318MGC and we use the same notation as introduced there. ‘ecocb’ to data of the Census Bureau. Estimate the price elasticity by regressing y on a constant and the variables x2 .

PALL. and log (PPUB) in the model of b is equal to zero.186 3 Multiple Regression c. . Compare the most recent value of y with the two forecast intervals of part f. Show that the restricted model has regressors log (PGAS). x2 and x4 as regressors) to construct a 95% interval estimate for the price elasticity of gasoline consumption. Show that the following null hypothesis is not rejected: the sum of the coefﬁcients of log (PALL). d. Use the model of d (with the constant. Compare this with the result in b and comment. perform Chow forecast tests for the most recent value of y. and PPUB (make sure to use the same units as the ones mentioned in Exercise 3. and estimate this model. f. Test the null hypothesis that the sum of the coefﬁcients of the four regressors in the model in b (except the constant) is equal to zero.18). log (PGAS). For the two models in b and d. PGAS. log (INC). Explain why this restriction is of interest by relating this regression model to the restricted regression in a. x2 and x4 (and a constant term). Use the models in b and d to construct 95% forecast intervals of y ¼ log (SGAS=PGAS) for the given most recent values of the regressors. g. e. Search the Internet to ﬁnd the most recent year with values of the variables SGAS. INC.

An approximation is obtained by asymptotic analysis — that is. Often the ﬁnite sample statistical properties of the estimators cannot be derived analytically. and non-linearities in the parameters. by considering the statistical properties if the sample size tends to inﬁnity. other models are better estimated by maximum likelihood or by the generalized method of moments. Some of these models can be estimated by (non-linear) least squares. nonnormal disturbances. In most cases there exists no closed-form expression for the estimates. In this chapter we describe several methods that can be applied more generally.4 Non-Linear Methods In the previous chapter. the ﬁnite sample statistical properties of regression methods were derived under restrictive assumptions on the data generating process. We consider models with stochastic explanatory variables. so that numerical methods are required. .

1 Introduction Motivation of asymptotic analysis and use in finite samples In the previous chapter we have seen that. then we do not know the statistical properties of the estimators and tests anymore. Of course.1. once we know how estimators and tests behave for a limitless number of observations. if one or several of the standard Assumptions 1–7 in Section 3. provided that the sample size is large enough. In . Strictly speaking. 4. A useful tool to obtain understanding of the properties and tests in this more general setting is to pretend that we can obtain a limitless number of observations. That is.and F-tests). Random regressors and non-normal disturbances As before. given certain assumptions on the data generating process. these assumptions are rather strong and one might have a hard time ﬁnding practical applications where all these assumptions hold exactly true. we consider the linear model y ¼ Xb þ e: (4:1) In the previous chapter we derived the statistical properties of the least squares estimator under the seven assumptions listed in Section 3. in essence.4 (p. but they are often stochastic (as we rely on empirical data that are for some part affected by random factors). in practice our sample size is ﬁnite.1. For example. we also get an approximate idea of how they perform in ﬁnite samples of usual size. Also. 125–6) are violated. t. we can derive the exact distributional properties of estimators (b and s2 ) and of tests (for instance. However. regressors typically do not tend to be ‘ﬁxed’ (as we do not often do controlled experiments). is what is called asymptotic analysis. This. An interesting question now is whether estimators and tests.3. regression models need not be linear in the parameters.1 Asymptotic analysis E Uses Section 1. which are based on the same principles as before.4.3. the asymptotic properties translate into results that hold true approximately in ﬁnite samples. still make sense in this more general setting. We can then pose the question how the estimators and tests would behave when the number of observations increases without limit.1. However.188 4 Non-Linear Methods 4.

4. 50) we discussed the law of large numbers. Exhibit 4. asymptotically The general idea is to remove randomness and non-normality asymptotically by taking averages of the observed data.1. We will discuss (i) randomness of the regressors due to sampling. where the disturbances are not normally distributed (so that Assumption 7 is violated). (ii) measurement errors.2. so that Assumption 1 is violated. Before discussing further details of asymptotic analysis. (i) Sampling as a source of randomness E XR414BWA To estimate a wage equation for the US banking sector. this would of course give other values for the dependent and explanatory variables. we give an example to illustrate that Assumptions 1 and 7 are often violated in practice. we could use the data of n ¼ 474 employees of a US bank (see Section 2. Panel 3 (p.7 and Exhibit 3. then under appropriate conditions these assumptions still hold true asymptotically — that is. That is.5 (a) (p. 85).1 Asymptotic analysis 189 this chapter we relax some of these assumptions. We show the results for two such sub-samples. In Section 1.4 and Exhibit 2. If we were to use data of employees of another bank. and Section 3. Averaging to remove randomness and to obtain normality.3 (p. Example 4. That is. and (iii) non-normality of the disturbances. To illustrate this idea. 132)).1 contains three histograms of the explanatory . Non-linear regression models are discussed in Section 4. In particular. This means that both y and X are random.1: Bank Wages (continued) As an illustration. and they can be taken as an approximation in large enough ﬁnite samples. if the sample size grows without limit (n ! 1). which states that this average (properly scaled) converges in distribution to a normal distribution. we consider situations where the explanatory variables are random (so that Assumption 1 is not satisﬁed). Such situations often occur in practice when we analyse observed economic data. and the central limit theorem. suppose that we want to investigate the wage structure in the US banking sector.5. both y and X in (4.1) are obtained by sampling from the full population of employees of all US banks.1. In this section we consider the properties of the least squares estimator when Assumptions 1 and 7 are violated. The results of Chapter 3 then also hold true asymptotically. suppose that our data set consisted only of a subset of the 474 employees considered before. or where the model is not linear in the parameters (so that Assumption 6 is violated). which states that the (random) sample average converges in probability to the (non-random) population mean. if Assumptions 1 and 7 are violated.3.

Dev.0 11. and for two (complementary) random samples of size 237 ((c)–(f )).0 0 10.1 Bank Wages (Example 4.5 13.018119 2. .00000 8. (d).5 5 10 15 20 25 EDUC Exhibit 4. (c).00000 21.5 5 10 15 20 25 EDUC (e) 120 100 80 60 40 20 0 (f ) LOGSAL vs.000000 2.49156 12.0 9.901276 −0.798974 LOGSAL 11.00000 21.694125 LOGSAL 11.00000 20.0 11.1) Histograms of variable education (EDUC) ((a).5 5 10 15 20 25 EDUC (c) 100 80 60 40 20 0 (d) LOGSAL vs.0 9.0 150 11.5 10 12 14 16 18 20 10. EDUC Series: EDUC Sample 1 237 Observations 237 Mean Median Maximum Minimum Std.00000 8.5 10 12 14 16 18 20 10. Skewness Kurtosis 8 12.00000 8. EDUC Series: EDUC Sample 1 237 Observations 237 Mean Median Maximum Minimum Std.60759 15.000000 2. EDUC 12.5 13.884846 −0.0 10. and (e)).190 4 Non-Linear Methods (a) 200 (b) Series: EDUC Sample 1 474 Observations 474 Mean Median Maximum Minimum Std.0 9.869756 0. and (f )). Dev.5 12 14 16 18 20 10. for full sample (n ¼ 474 ((a) and (b))).0 10.5 13. Skewness Kurtosis 8 12.244493 2.37553 12. Dev. Skewness Kurtosis 8 10 LOGSAL vs. and scatter diagrams of salary (in logarithms) against education ((b).113746 2.725155 100 50 LOGSAL 11.000000 2.

4.1.1 Asymptotic analysis 191 variable education (in (a). For the simple regression model of Section 2. and (e)) and three corresponding scatter diagrams (in (b). (iii) Non-normality of disturbances As concerns the Assumption 7. the histogram of the residuals is given in Exhibit 2. the measured number of years of education of employees does not take the quality of the education into account. an indication of the distribution of the disturbances may be obtained by considering the least squares residuals. var(ejX) ¼ s2 I: . carry over to the case of stochastic regressors.4. For example.1. This distribution is skewed and this may cast doubt on the validity of Assumption 7. we consider the mean and variance of the least squares estimator b.5 (b). The results in Chapters 2 and 3. However.4 (p. where salaries (in logarithms) are explained from education alone. which were obtained under Assumption 1 of ﬁxed regressors. and (f )). 126) we showed that. the other two are the result of a random selection of the employees in two distinct groups of size 237 each. under Assumptions 1–6. The reported data contain measurement errors.4. provided that all assumptions and results are interpreted conditional on the given values of the regressors. To illustrate this idea. in the sense that they give imperfect information on the relevant underlying economic variables. (ii) Measurement errors Apart from sampling effects.2 Stochastic regressors Interpretation of previous results for stochastic regressors One way to deal with stochastic regressors is to interpret the results that are obtained under the assumption of ﬁxed regressors as results that hold true conditional on the given outcomes of the regressors. The ﬁrst data set consists of the full sample. (d).1. In Section 3. (c). If the regressors in the n Â k matrix X are stochastic. suppose that we replace Assumption 2 that E[e] ¼ 0 and Assumptions 3 and 4 that var(e) ¼ s2 I by the following two assumptions that are conditional on X: E[ejX] ¼ 0. E[b] ¼ b and var(b) ¼ s2 (X0 X)À1 . the observed explanatory variables often provide only partial information on the economic variables of interest. these results are not valid anymore. the outcomes depend on the chosen sample. and both y and X are random because of sampling. Clearly.

So. var(bjX) ¼ s2 (X0 X)À1 . so that the previous results remain true if we interpret everything conditional on X. X is given. but that Assumption 1 of ﬁxed regressors is not valid. To prove the above two results. If X is random but independently distributed from e. where the third equality follows because X and e are independent. conditional on X. This shows that the variance of b depends on the distribution of X. . then it follows that E[b] ¼ E[(X0 X)À1 X0 y] ¼ b þ E[(X0 X)À1 X0 e] ¼ b þ E[(X0 X)À1 X0 ]E[e] ¼ b. and the fourth equality holds true because X and e are independent.2.1. The last equality uses the fact that E[ee0 ] ¼ s2 I because of Assumptions 2–4. note that E[bjX] ¼ E[b þ (X0 X)À1 X0 ejX] ¼ b þ (X0 X)À1 X0 E[ejX] ¼ b. Using the properties of conditional expectations (see Section 1.4) are satisﬁed. (4:2) T so that b À b ¼ (X0 X)À1 X0 e.2 (p. To evaluate the variance var(b) ¼ E[(b À b)(b À b)0 ] we write b ¼ (X0 X)À1 X0 y ¼ (X0 X)À1 X0 (Xb þ e) ¼ b þ (X0 X)À1 X0 e.192 4 Non-Linear Methods Then it holds true that E[bjX] ¼ b. in this case the least squares estimator is still unbiased. 24)) it follows by conditioning on X (denoted by E[ Á jX]) that var(b) ¼ E[(X0 X)À1 X0 ee0 X(X0 X)À1 ] ¼ E[E[(X0 X)À1 X0 ee0 X(X0 X)À1 jX] ] ¼ E[(X0 X)À1 X0 E[ee0 jX]X(X0 X)À1 ] ¼ E[(X0 X)À1 X0 E[ee0 ]X(X0 X)À1 ] ¼ s2 E[(X0 X)À1 ]: The third equality follows because. var(bjX) ¼ var(b þ (X0 X)À1 X0 ejX) ¼ (X0 X)À1 X0 var(ejX)X(X0 X)À1 ¼ (X0 X)À1 X0 (s2 I)X(X0 X)À1 ¼ s2 (X0 X)À1 : Derivation of statistical properties OLS when X and « are independent Consider the linear model y ¼ Xb þ e and suppose that Assumptions 2–6 (see Section 3.

48–9).3. The 0 regressors X may be stochastic and the probability limit of 1 n X X exists and is non-singular. then b is in general no longer unbiased. suppose that the values of the k Â 1 vector of regressors are obtained by random sampling from a population with zero mean and positive deﬁnite covariance matrix Q — that is. For example.3 (p. If X and e are not independent. . the n n (non-centred) second moment of the hth and jth explanatory variable. The t.3 Consistency The exogeneity condition for consistency If X is random but independent of e. However. The element (h.1.and F-statistics as computed in Chapter 3 will no longer exactly follow the t. This also means that the P-values reported by statistical packages that are based on these distributions are no longer valid. the asymptotic properties can be determined under appropriate regularity conditions (see Section 4. À Pn The law Á of large numbers (see Section 1.3 (p. that is.and F-statistics cannot be determined analytically.1 Asymptotic analysis 193 Consequences of random regressors In general it may be difﬁcult to estimate the joint distribution of X or to estimate E[(X0 X)À1 ].1. from a population where the regressors are perfectly Pnot n 1 0 X X is given by x collinear.4).4. In order to investigate the asymptotic properties of the least squares estimator. then the least squares estimator b is unbiased. the variables should vary sufﬁciently (so that Q is invertible) but not excessively (so that Q is ﬁnite). the exact ﬁnite sample distributions of b and of the t. we make the following assumption.3. The assumption of stable regressors In the sequel we no longer assume that X and e are independent. j) of the matrix 1 i¼1 hi xji.and F-distributions. Assumption 1Ã : stability (replaces Assumption 1 of ﬁxed regressors). 4. for some non-singular k Â k matrix Q there holds 1 0 plim X X ¼ Q: n This stability assumption places restrictions on the variation in the explanatory variables — that is. because E[b] ¼ b þ E[(X0 X)À1 X0 e] and the last term is non-zero . For the deﬁnition and calculation rules of probability limits we refer to Section 1. so that Assumption 1Ã holds true under these conditions. 50)) implies that plim 1 i¼1 xhi xji ¼ n E[xhi xji ] ¼ Qhj . In general.

n À1 so that b is consistent if and only if 1 plim X0 e ¼ 0: n (4:4) This last condition is called the orthogonality condition. then the explanatory variables are said to be exogenous (or sometimes ‘weakly’ exogenous.1. from other types of exogeneity related to forecasting structural breaks). This shows that plim(s2 ) ¼ s2 under the stated conditions. whether plim(b) ¼ b — we write (4.4) can be written À Pand Á n . we consider the data generating process yi ¼ bxi þ ei .5) s2 ¼ 1 1 0 1 e0 e ¼ e Me ¼ (e0 e À e0 X(X0 X)À1 X0 e) nÀk nÀk nÀk ! À1 n 1 0 1 0 1 0 1 0 ¼ eeÀ eX XX Xe : nÀk n n n n For n ! 1 the ﬁrst expression in the last line converges to 1. to distinguish this type of exogeneity. This can n be seen by writing (using the notation and results of Section 3. the second to s2 . which is related to consistent estimation. and the fourth expression converges to QÀ1 because of Assumption 1Ã . If this condition is satisﬁed. The jth component of (4.4).2) as À1 1 0 1 0 b¼bþ XX X e: n n (4:3) Using the rules for probability limits. in (3.22)) is a conÀ Pn Á À1 0 Ás (deﬁned 2 2 2 sistent estimator of s provided that plim n e e ¼ plim 1 i¼1 ei ¼ s . An example where OLS is consistent As an illustration. so that this condition basically means that the explanax e as plim 1 i¼1 ji i n tory variables should be asymptotically uncorrelated with the disturbances. . Derivation of consistency of s2 T 2 Under Assumption 1Ã and condition (4. the third and ﬁfth to zero because of condition (4.4). To investigate whether b is consistent — that is. it follows from Assumption 1Ã that 1 0 plim(b) ¼ b þ Q plim X e .194 4 Non-Linear Methods in general.

This is illustrated by a simulation in Exhibit 4. E Xe ¼E n n i¼1 n i¼1 " # n X n n X n 1 0 1X 1X var X e ¼ E 2 xi xj ei ej ¼ 2 E[xi xj ]E[ei ej ] n n i¼1 j¼1 n i¼1 j¼1 ¼ n Â 2 Ã 2 s2 q 1X E xi s ¼ : n2 i¼1 n It follows from the result (1. which are positively correlated. q) and the ei are IID(0. if xi and ei are correlated then the least squares estimator is no longer consistent. (b) contains the scatter diagram of y against x with the regression line and the systematic relation y ¼ x (dashed line) of the DGP. it follows that " # ! n n 1 0 1X 1X xi ei ¼ E[xi ]E[ei ] ¼ 0. and the estimated slope b is larger than the slope b of the DGP (b)). If the explanatory variable xi and the disturbance term ei are independent. The data are generated by y ¼ x þ e (so that the DGP has slope parameter b ¼ 1). (a) shows the scatter diagram of the disturbance terms e (EPS) against the regressor x.4.3. This is in line with the fact that À 0 (see Á X e ¼ b þ g=q > b. s2 ). This shows that least squares overestimates the slope parameter. which shows that the correlation between x and the disturbances (e) cannot be detected in this way. An example where OLS is not consistent On the other hand.4) is satisﬁed. 49) that in this case condition (4.3 (p. Here the explanatory variable and the disturbance terms have positive covariance (see (a)). so that g ¼ E[xi ei ] > 0. Note that the least squares plim(b) ¼ b þ qÀ1 plim 1 n (a) 4 (b) 6 (c) 2 2 4 1 2 0 EPS 0 RES 0 −1 −2 −3 −2 Y −2 −4 0 X 1 2 −4 −3 −2 −1 −3 −2 −1 X 0 1 2 −3 −2 −1 X 0 1 2 Exhibit 4.2 Inconsistency Effect of correlation between regressor and disturbance terms. .48) in Section 1.1 Asymptotic analysis 195 where the xi are IID(0.2. (c) contains the scatter diagram of the least squares residuals (RES) against x.

1. S: 4. 4. 4. In this case P P 1ﬃﬃ 0 1ﬃﬃ 1ﬃﬃ p p p x z X e ¼ e ¼ .3 (p. and.3.6) is based on generalizations of the central limit theorem. 50). 4. If xi is not constant. where ei ¼ yi À bxi are the least squares residuals. s2 Q): n (4:6) The result in (4.5) converges in probability to QÀ1 . we can use Section 1. This means that the positive correlation between xi and ei cannot be detected from the least squares residuals (see (c)). Illustration: Simple regression model Suppose that the disturbances ei are independently but not normally distributed and that the (single) explanatory variable xi is non-stochastic.6) for this particular case. E Exercises: T: 4. which proves (4. so 1ﬃﬃ 0 X e. As Q ¼ 1 in this situation this shows (4. Therefore.1. s2 ).4 Asymptotic normality Derivation of asymptotic distribution T To determine the asymptotic distribution of b.3) as pﬃﬃﬃ n(b À b) ¼ À1 1 0 1 pﬃﬃﬃ X0 e: XX n n (4:5) Under Assumption 1Ã .8.50) in Xe¼p term). if x ¼ 1 (so that the model contains only the constant E xi ei ¼ s2 x2 i i P 1ﬃﬃ 0 1ﬃﬃ ei .3. then p n n Section 1.4.7. it is helpful to rewrite (4. which states P a generalized central limit theorem (see 2 1ﬃﬃ z converges in distribution to N(0. where the random variables zi ¼ xi ei are indei i i n n n Â Ã pendently distributed with mean E [zi ] ¼ E[xi ei ] ¼ 0 and variance E z2 i ¼ Â 2 2Ã . but we analyse the simple regression model yi ¼ bxi þ ei in somewhat more detail. In particular.196 4 Non-Linear Methods P P 1 estimate is obtained from the normal equation 1 n xi (yi À bxi ) ¼ n xi ei ¼ 0.3).6) also s2 n!1 n i Ã i ¼1 i¼1 i n T . under Assumptions 1 and 2–6 and some additional weak regularity conditions. in practice this issue cannot be tested by simply looking at the residuals. Tests for exogeneity will be discussed later (see Section 5. it follows that this converges in distribution to N(0. 411)). s that p i Ã ) with variance equal to n À 1 Pn Pn 1 2 2 2 ¼ lim var( z ) ¼ s lim x Þ ¼ s Q . We do not discuss the precise regularity conditions needed for this general result.3. there holds 1 d pﬃﬃﬃ X0 e ! N(0. the ﬁrst factor in (4. It can be shown that it remains to determine the asymptotic distribution of p n Ã that. according to the central limit theorem (1.3 (p.

On the other hand. À 0 Á Ã (i) plim 1 n X X ¼ Q exists and is invertible (Assumption 1 ). s2 (X0 X)À1 : (4:8) This means that the statistical results of Chapter 3 — for example. but the orthogonality condition is crucial to obtain the zero mean in (4.1 Asymptotic analysis 197 for this case.3. 158). This gives the approximate distribution À Á b % N b. theÁ ﬁnite sample distribution of b can be approximated by À 2 À1 . s2 QÀ1 QQÀ1 ) ¼ N(0. s2 QÀ1 ): (4:7) Approximate distribution in finite samples pﬃﬃﬃ We say that the rate of convergence of b to b is n. For instance. If the sample size n is large enough.3 on multicollinearity (p. var(e) ¼ s2 I (Assumptions 2–4). s n is required to justify this approximation. the (unknown) matrix Q is 0 approximated by 1 n X X. Note that asymptotic normality is obtained independent of the distribution of the disturbances — that is. it follows from (4. the distribution of the sample mean is often well approximated by a normal distribution for small sample sizes like n ¼ 50. with correlated disturbances and stochastic X. the t-test À Á and the F-test that are based on the assumption that b $ N b.4.6). Asymptotic distribution of OLS estimator If the result on the asymptotic distribution in (4. The result in (4. The situation is somewhat comparable to the discussion in Section 3.3.47) for the variance shows that the sample size required to get a prescribed precision depends on the amount of variation in the individual regressors and on the correlations between the regressors.6) can be proved under much weaker conditions.5) and Assumption 1Ã that pﬃﬃﬃ d n(b À b)!N(0. for the case of random samples discussed in Section 1. then larger sample sizes may be required. (ii) E[e] ¼ 0.6) holds true.3. The expression (3. if the model for example contains many regressors. Practical use of asymptotic distribution To apply the normal approximation in practice. . even if the disturbances are not normally distributed. (iii) y ¼ Xb þ e (Assumptions 5 and 6). s2 (X0 X)À1 — remain valid as an asymptotic approximation under the following four assumptions. À 0 Á (iv) plim 1 n X e ¼ 0 (orthogonality condition). It depends on the application to hand which size of the sample Q N b.

970803 2. Á Á Á . Estimates of the slope parameter (b.1. Skewness Kurtosis −4 −3 −2 −1 0 −0.101695 −0. and (f ) (note the differences between the scales on the horizontal axis for both sample sizes).1. Dev.5) Consistency and asymptotic normality.6 Exhibit 4. n: So our data generating process has parameters b ¼ 1 and s2 ¼ 1. Skewness Kurtosis 0.016948 −0.584708 0.4 1.231303 1200 1000 800 600 400 200 0 Series: B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.042119 3.042119 3.0 1.998606 2.656250 −3. The number of simulation runs is 10. Dev.022730 3. provided that these four conditions are satisﬁed.028808 3. (c). . 4. denoted by S2 in (e) and (f )).6 2.4 1.0 2.205728 0.625109 1.075966 0. Skewness Kurtosis −2.002273 1.595019 3. Skewness Kurtosis 0.6 0. n(b À 1) in (c) and (d)).2 1.2 1.292816 0. denoted by B in pﬃﬃﬃ (a) and (b)). (a) 1000 800 600 400 200 0 (b) Series: B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.043645 0. and the histograms show the distribution of the resulting 10.555301 0.0 0.998716 0. i ¼ 1.028808 3.9 1.265037 0.198 4 Non-Linear Methods The standard inference methods for least squares are still valid for stochastic regressors and non-normal disturbances.4 1400 1200 1000 800 600 400 200 0 1200 Series: S2 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.0 2.0 1.5 5.1 1.3 Simulation Example (Section 4. Dev.6 0.624045 0.006968 5.971360 1.5 Simulation examples As an illustration. for simulated data that satisfy the orthogonality condition.000. Dev.2 1.143070 0. Skewness Kurtosis 0. i.4 1.3 1.002385 −0.7 0.388205 0.231303 600 400 200 0 0.050827 0 0.011057 0.8 800 400 0.0 1 2 3 4 (e) 1600 (f ) Series: S2 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.043645 (c) 1000 800 600 400 200 0 (d) 1200 Series: BNORM Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. we perform some simulation experiments with the model yi ¼ xi þ ei . Dev.e. The sample size is n ¼ 25 in (a).131250 0.285254 3.8 1. Skewness Kurtosis 0. and (e) and n ¼ 100 in (b).759553 1.6 2.534106 0.5 1000 800 Series: BNORM Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. and estimates of the disturbance variance (s2 .000 estimates. (d).2 1.999523 0.001106 1.215193 0. Dev.001648 0.4 0.994593 1.8 1.8 0.882047 −3. a normalized version (BNORM.

452657 Minimum −1.438857 0.745858 1.168096 0.4 1. Dev. and the histograms show the distribution of the resulting 10. i. n(b À 1).108574 0. and (l) (note the differences between the scales on the horizontal axis for both sample sizes). n (b À 1) in (i) and (j)).049146 −1 0 1 2 3 4 5 6 2 3 4 5 6 7 8 9 (k) 1000 800 600 400 200 0 Series: S2 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. f .8 1.505597 Maximum 6. 0.6 1. Estimates of the slope parameter (b. and covariance r.3 (Contd. a normalized version (BNORM.398608 0.000 simulations) of the values of b.156485 1000 800 600 400 200 0 Series: BNORM Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. (i).25 Exhibit 4. d.970877 1.2 1. (j).2 1.230477 0.000.0 1.073211 0.501813 Median 2.0 1.749907 0. j.727711 1.2 0. Skewness Kurtosis 0. denoted by pﬃﬃﬃ B in (g) and (h)).4 0.2 1.036497 3. h. We consider two experiments.049146 0.8 2. Dev. So the regressor xi is random. c.156485 1000 800 600 400 200 0 Series: B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.) Inconsistency when orthogonality is violated.184476 0.1 Asymptotic analysis 199 (g) 1200 1000 800 600 400 200 0 (h) 1200 Series: B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.600668 3.75 1.036497 3.7 1.6 1. e. Simulations with stable random regressors First we consider simulations where the values of (xi .0 2.000 estimates. g. k) and n ¼ 100 (b. Skewness Kurtosis 1. and (k) and n ¼ 100 in (h).500786 1.8 0.500363 1. The number of simulation runs is 10.897088 1.4 1. Dev.465221 (l) 1000 800 600 400 200 0 Series: S2 Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.8 1. Skewness Kurtosis 5. À1 Á Pnit is 2also stable because the law of large numbers implies that plim n i¼1 xi ¼ E[x2 ] ¼ 1. ei ) are obtained by a random sample of the bivariate normal distribution with mean zero.290531 0.5 1.007858 8.e.501119 2.922381 Skewness 0. Dev.014734 5.501473 1. and estimates of the disturbance variance (s2 .143886 0.221650 0. Dev.870705 0.087070 0. one experiment with r ¼ 0 (so that the regressor satisﬁes the orthogonality condition) and another experiment with r ¼ 0:5 (so that the orthogonality condition is violated). Dev.758514 0.00 1. l).207431 Std.751332 0.042937 3.2 1. The sample size is n ¼ 25 in (g). Skewness Kurtosis 1. for sample sizes n ¼ 25 (a.6 0.271572 3.042937 Kurtosis 3.50 0. and unit variances.3 1.8 1. for simulated data that do not satisfy the orthogonality condition. i. The histograms (a–f ) indicate the consistency and . Exhibit 4.4. Skewness Kurtosis 0.3 shows pﬃﬃﬃ histograms (based on 10. and s2 .4 1.9 (i) 1200 1000 800 600 400 200 0 (j) 1200 Series: BNORM Sample 1 10000 Observations 10000 Mean 2. denoted by S2 in (k) and (l )).891998 0.6 1.

i.072468 17.5) Estimates of the slope parameter (b.10 1200 1000 800 600 400 200 0 Series: BNORM Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. Skewness Kurtosis −3 −2 −1 0 1 2 3 4 800 400 1. (e). Dev.000273 −0.938697 (i) 120 100 80 60 40 20 0 (j) 3 2 1 0 −1 −2 Y Y 0 20 40 X 60 80 100 −3 0.010 (d) Series: B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.103135 0. Skewness Kurtosis 0.067819 0.000103 0. Skewness Kurtosis −5 0 5 10 15 800 400 0.e.999973 0.15 −0.999947 1.515674 −0.0 0. (e)–(f ) show the outcomes of BNORM for n ¼ 25 and (g)–(h) for n ¼ 100.00 0.2 0. (a)–(b) show the estimates of b for sample size n ¼ 25 and (c)–(d) for n ¼ 100.991443 0 −0.67192 4.571300 0.011155 4.999599 −0.4 Simulation Example (Section 4.017561 2. and (i) are for the model with linear trend and (b).034689 0.995 1.8 1.003469 0.10 (c) 1200 1000 800 600 400 200 0 0.005 1. n(b À 1) in (e)–(h)) for two data generating processes that do not satisfy Assumption 1Ã . (c). .0 Exhibit 4.027607 2.054108 0.063997 0.991443 0 0.016874 3.4 800 600 400 200 0 − 20 −15 −10 Series: BNORM Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.403408 −2. (d).111553 34.014494 4.985856 0.027946 0. Skewness Kurtosis −0.126881 −0. Skewness Kurtosis −0.012688 0.998325 1200 1000 800 600 400 200 0 Series: B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. Dev.05 1.179959 0.934385 0. Skewness Kurtosis 1. (g). (h). and (j) are for the model with hyperbolic trend.6 (g) 1200 1000 800 600 400 200 0 (h) Series: BNORM Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std. Skewness Kurtosis −2 −1 0 1 2 3 4 1.000514 −0. Dev.985 0. (i)–(j) show scatter diagrams for a sample of size n ¼ 100 of the models with linear trend (i) and with hyperbolic trend ( j).200 4 Non-Linear Methods (a) 1600 Series: B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.938697 (e) 1600 (f ) 1000 1200 Series: BNORM Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.03408 −31.141444 0.012978 2. denoted by B in (a)–(d)) and a normalized version pﬃﬃﬃ (BNORM.4 X 0. Dev.1.027607 2.79959 8.012978 2.139732 0.2 0. Skewness Kurtosis −30 20 −10 0 10 20 30 −0.00 1.95 1. Dev.006782 1. (a).010822 1.4 −0.85650 −19.000835 0.05 0. Dev.05 0. Dev.0 0.2 0.016874 3.6 0.017561 2. Dev.000 1. (f ).90 0.585676 −0.990 0.999833 1.571300 −2.885740 0.000535 0.858568 −0.10 −0.063997 (b) 1200 1000 800 600 400 200 0 1200 Series: B Sample 1 10000 Observations 10000 Mean Median Maximum Minimum Std.998325 0.997995 −0.

(Note that the horizontal axis differs among the different histograms. g. the rate of convergence of b to b is equal to n n (instead of n). e.4. b. it is seen that for the linear trend the distribution of n(b À 1) shrinks to zero for p n ﬃﬃﬃ ! 1 (see (e) and (g)). E Exercises: T: 4. lim n i¼1 i ¼ 1 and lim n i¼1 i pIn ﬃﬃﬃ the linear trend pﬃﬃﬃ model. In both cases the disturbances ei are NID(0.2)). Note Ã that models do not satisfy À1 these Á À1 P Á the stability Assumption 1 . h).000 simulations of both models. f ) and n ¼ 100 (c. as the observations xi ¼ 1 i of the explanatory variable do not contain sufﬁcient variation for i ! 1. with sample sizes n ¼ 25 (a.4 shows the histograms of b and n(b À 1) for 10. 1). so that the width of the distributions is more easily compared by comparing the reported standard deviations of the outcomes. pﬃﬃﬃ Exhibit 4.2. . and the histograms (g–l) indicate the inconsistency of both b and s2 when the orthogonality condition is violated. whereas for the hyperbolic trend the distribution of n(b À 1) does not converge for n ! 1 (see (f ) and (h)).1 Asymptotic analysis 201 approximate normality when the orthogonality condition is satisﬁed.) Simulations with regressors that are not stable Next we generate data from the model yi ¼ xi þ ei with xi ¼ i (a linear trend) and with xi ¼ 1=i (a hyperbolic trend). By comparing the reported standard deviations pﬃﬃﬃ in the histograms. For the hyperbolic trend data the least squares estimator b is not consistent (see (b) and (d)). d. and in the hyperbolic trend model the estimator b does not converge to b (the proof is left as an exercise (see Exercise 4. as Pn trend n 2 À2 ¼ 0. and the least squares estimator b is unbiased and efﬁcient.

Throughout this section we suppose that the stability Assumption 1Ã of Section 4. The question of interest is whether the sensitivity of consumers to price reductions depends on the magnitude of the price reduction. 125) no longer holds true. For both brands. The deal rate d is deﬁned as d ¼ 1 if no price reduction applies. 28 (1991). three with d ¼ 1:05. ‘Measuring the Short-Term Effect of In-Store Promotion and Retail Advertising on Brand Sales: A Factorial Experiment’.1. six with d ¼ 1. 78). These data are obtained from a controlled marketing experiment in stores in suburban Paris (see A.3 (p. d ¼ 1:05 if the price reduction is 5 per cent.2: Coffee Sales (continued) In this example we consider marketing data on coffee sales. Mouchoux.4 are satisﬁed and that the regressors satisfy the orthogonality condition (4. We now present two examples motivating the use of nonlinear models. E XM402COF Example 4.2 Non-linear regression 4. For each brand there are n ¼ 12 observations. the question is whether the price elasticity of demand for coffee is constant or whether it depends on the price. . and three with d ¼ 1:15.3. (ii) the linear model with constant elasticity. This means that Assumption 6 in Section 3.1. C. Journal of Marketing Research.1. We will discuss (i) the data.1 Motivation Assumptions on the data generating process In this section we consider regression models that are non-linear in the parameters. (i) Data Exhibit 4. two of the sales ﬁgures for d ¼ 1:15 are nearly overlapping (the lower ﬁgure for brand 1 in (a) and the higher ﬁgure for brand 2 in (b)).202 4 Non-Linear Methods 4.4 (p. Stated in economic terms.5 shows scatter diagrams of weekly sales (q) of two brands of coffee against the applied deal rate (d) in these weeks (both variables are taken in natural logarithms). 202–14).2 and the Assumptions 2–5 of Section 3. The data for brand 2 (in (b)) were discussed before in Example 2. and (iii) a non-linear model with varying elasticity.2.1.4) in Section 4. and d ¼ 1:15 if the price reduction is 15 per cent. Bemmaor and D.

@ log (d) @ d=d which is the demand elasticity with respect to the deal rate.6 −0.05 LOGD2 0. So the slope in the scatter diagram of log (q) against log (d) corresponds to the demand elasticity. (iii) Non-linear model with varying elasticity The scatter diagrams in Exhibit 4.05 6. so that the elasticity may be decreasing for higher deal rates.0 LOGQ1 LOGQ2 0.05 4. the variable on the horizontal axis is the logarithm of the deal rate (deal rates of 1.8 4. A possible way to model such a rate-speciﬁc elasticity is given by the equation log (q) ¼ b1 þ b2 b3 (d À 1) þ e: b3 (4:9) .4 4.4 (b) 5.15 Exhibit 4.00 LOGD1 0.10 0.4 5.5 suggest that for both brands the elasticity may not be constant.2 Non-linear regression 203 (a) 6.2) Scatter diagrams for two brands of coffee.2 5. brand 1 (a) and brand 2 (b).05 and 1.00 0. (ii) Linear model with constant elasticity A simple linear regression model is given by log (q) ¼ b1 þ b2 log (d) þ e (here we suppress the observation index i for ease of notation). The slope seems to decrease for larger values of log (d).05 0. Both scatter diagrams contain twelve points.5 Coffee Sales (Example 4.8 5.6 4.2 6. b2 ¼ @ log (q) @ q=q ¼ .15 0. but for both brands two observations for deal rate 15% are nearly overlapping (for brand 1 the ones with the lower sales and for brand 2 the ones with the higher sales).10 0. The variable on the vertical axis is the logarithm of sales (in units of coffee).0 5.15 correspond to price reductions of 5% and 15% respectively). In this model b2 is the derivative of log (q) with respect to log (d) — that is.2 −0.6 6.4.

These data are analysed (amongst others) in a special issue of the Journal of Applied Econometrics (12/5 (1997)).4 0. Here the question of interest is whether the food expenditure depends linearly on household income or whether this dependence becomes weaker for higher levels of income.000). .0 1.3.6 0. This example will be further analysed in Sections 4.6 shows a scatter diagram of the fraction of consumptive expenditure of households spent on food against total consumptive expenditure (measured in $10.5 and 4. the linear model. Exhibit 4. The non-linear model (4. E XR416FEX Example 4.2.5 TOTCONS Exhibit 4.6 Food Expenditure (Example 4.5 FRACFOOD 0.0 0. The data consist of averages over groups of households and were obtained by a budget survey 0.3: Food Expenditure As a second example we consider budget data on food expenditure of groups of households.3 0.9. Such a decreasing effect of income on food consumption may be expected because households with higher incomes can afford to spend relatively more on other expenses that provide a higher marginal utility than additional food.2 0. the limiting model for b3 ¼ 0 is the linear model log (q) ¼ b1 þ b2 log (d) þ e. The deal rate elasticity in (4.9) is equal to @ log q @ log q @ log q ¼ ¼d ¼ db2 db3 À1 ¼ b2 db3 : @ log d @ d=d @d The null hypothesis of constant elasticity corresponds to b3 ¼ 0 — that is.3) Scatter diagram of ﬁfty-four data points of the fraction of expenditure spent on food against total (consumption) expenditure.204 4 Non-Linear Methods As (db3 À 1)=b3 ! log (d) for b3 ! 0.9) provides a simple way to model a non-constant elasticity.5 1.

where i (i ¼ 1. We denote the fraction of expenditure spent on food by y. A non-linear regression model is described by an equation of the form yi ¼ f (xi . The scatter diagram indicates that the effect of income on the fraction spent on food declines for higher income levels. Such a relation can be expressed by the non-linear model y ¼ b1 þ b2 x23 þ b4 x3 þ e: The hypothesis that the fraction spent on food does not depend on household income corresponds to b3 ¼ 0.2 Non-linear least squares E Uses Appendix A. Non-linear regression The linear regression model y ¼ Xb þ e can be written as yi ¼ x0i b þ ei . Á Á Á . b) Á2 (4:11) . if the function is non-linear in b — for instance. (4:10) where f is a non-linear function.2.2 Non-linear regression 205 in the USA in 1950. Á Á Á . fk . The total consumption expenditure is taken as a measure of the household income. In this case the parameters can be estimated by regressing y on the explanatory variables f1 . This model is linear in the unknown parameters b.7. n) denotes the observation and where x0i is the ith row of the n Â k matrix X (so that xi is a k Â 1 vector). if for ﬁxed xi the function f is linear in b — then this can be written as f (xi . If the non-linearity is only in xi — that is.16). k. j ¼ 1. and the (average) household size by x3. b 4. Á Á Á .9) — then the least squares estimation problem to minimize S(b) ¼ n À X i¼1 yi À f (xi . On the other hand. b) ¼ b1 f1 (xi ) þ Á Á Á þ bk fk (xi ). b) þ ei . the total consumption expenditure (in $10.000) by x2 . as in (4. Further analysis of this example is left as an exercise (see Exercise 4. This is a linear regression model with explanatory variables fj (xi ). and the hypothesis that it depends linearly on income corresponds to b3 ¼ 1.4.

The parameters of the model (4. b2 ) for all xi . then the parameters b of the linear model are identiﬁed. b2 ). b1 ) ¼ f (xi . 0 1 @ f (x1 . b) ¼ x0 b are always identiﬁed provided that the explanatory variables x are not perfectly collinear.11). so that numerical approximations are needed. in which case minima need not be unique. Pn 2 1 where s2 ¼ nÀ i¼1 (yi À f (xi . b1 ) 6¼ f (x. @ f (xn . bNLS )) is the NLS estimate of the variance of k the disturbance terms ei .10) are said to be identiﬁed if for all b1 6¼ b2 there exists a vector x such that f (x. The ﬁrst order conditions are given by n À X Á @ f (xi . b) Á @b @b i¼ 1 This gives a set of non-linear normal equations in b. To avoid problems in optimization one should work only with models with identiﬁed parameters. In general the solution of these equations cannot be determined analytically. This imposes conditions on the model. b) ¼ b1 eb2 þb3 x . b)=@ b0 B C . Requirement of identified parameters The non-linear least squares (NLS) estimator bNLS is deﬁned as the minimizing value of (4.206 4 Non-Linear Methods becomes non-linear. if there exist parameter vectors b1 6¼ b2 with f (xi . Here X is the n Â k matrix of ﬁrst order derivatives of the function f in (4. b22 . b31 ) and (b12 . b)=@ b0 . if Assumption 1 is satisﬁed so that the regressor matrix X has rank k. (4:12) X¼@ AÁ .11). Statistical properties of non-linear least squares The estimator bNLS will in general not be unbiased. We assume that this minimum exists and that it is unique. Numerical aspects are discussed in the next section. So. Under appropriate assumptions it is a consistent estimator and its variance may be approximated in large samples by var(bNLS ) % s2 (X0 X)À1 . as two parameter vectors (b11 . The parameters of the linear model with f (x. then S(b1 ) ¼ S(b2 ) in (4. b32 ) give the same function values for all values of x if b31 ¼ b32 and b11 eb21 ¼ b12 eb22 . b21 . . For example.10) with respect to b — that is. An example of a non-linear regression model with unidentiﬁed parameters (with a single explanatory variable x) is f (x. b) @ S( b ) ¼ À2 ¼ 0: yi À f (xi .

2 Non-linear regression 207 Note that for the linear model with f (xi .10) with parameter vector b ¼ b0.1. the middle one does not depend on b and hence it does not affect the location of the minimum of S(b).2 (p. we give an idea of the required assumptions.12) and evaluated at b ¼ b0 . the Á2 PÀ 0 fi À fi will not vanish for b 6¼ b0 if the parameters are identiﬁed ﬁrst term 1 n in the sense that for every b 6¼ b0 n 1X plim (f (xi . inÀthe Á model with fi ¼ xi b. which are basically the same as the ones discussed in Section 4. s2 QÀ1 ) (4:13) À 0 Á where Q ¼ plim 1 n X X with X the n Â k matrix of ﬁrst order derivatives deﬁned in (4.4.13) means that . and hence bNLS is consistent. 0 0 0 For instance. the minimum value of 1 n S(b) is asymptotically only obtained for b ¼ b0. However. For n ! 1. Finally.11) can be decomposed as follows: 1 1X 1X 0 S ( b) ¼ (yi À fi )2 ¼ (fi þ ei À fi )2 n n n 1X 0 1X 2 2X 0 ¼ ( fi À fi )2 þ ei þ (fi À fi )ei : n n n Of the three terms in the last expression. b))2 n i ¼1 T ! 6¼ 0: Under the above conditions. the result in (4. Under similar conditions bNLS is also asymptotically normally distributed in the sense that pﬃﬃﬃ d n(bNLS À b0 ) ! N(0. we get fi À fi ¼ xi (b0 À b) and the Plinear 1 condition plim n xi ei ¼ 0 is the orthogonality condition (4. b). Further suppose that the disturbance terms satisfy Assumptions 2–4. b0 ) and fi ¼ f (xi .1. Suppose further that Assumption 1Ã is satisﬁed with X as deﬁned in (4. Let fi0 ¼ f (xi . the last term will tend (in probability) to zero under appropriate orthogonality conditions. then the least squares criterion (4. 120).4). Approximate distribution in finite samples Under the foregoing conditions. b0 ) À f (xi .2) in Section 3.12) and evaluated at b0 . and that Assumption 5 (constant parameters) is also satisﬁed. b) ¼ x0i b this gives the matrix X as deﬁned in (3. Suppose that the data are generated by (4. Idea of conditions for asymptotic properties It is beyond the scope of this book to derive the above asymptotic results for bNLS.

bR NLS Summary of computations in NLS The non-linear least squares estimate bNLS is obtained by minimizing the sum of squares (4. Then under the above assumptions the F-test is computed by F¼ (e0R eR À e0 e)=g % F(g. Asymptotic t-values and F-tests can be obtained as in the linear regression model.11) under the null hypothesis and under the alternative hypothesis.14) motivates the use of t-tests and F-tests in a similar way as in Chapter 3.11) obtained for the unrestricted NLS estimate bNLS and e0R eR ¼ S bR NLS is the sum of squares obtained for . in the sense that p ﬃﬃﬃ n(bNLS À b0 ) has the smallest covariance matrix among all consistent estimators of b0 . by one of the non-linear optimization algorithms discussed in the next section. The result in (4.11) and determine the NLS residuals ei ¼ yi À f (xi . and provided that the parameters of the model are identiﬁed. For the F-test the sums of squares are equal to the minimum value of S(b) in (4. . Under suitable regularity conditions. bNLS ). Summarizing. Step 2: Testing. bNLS ) the NLS residuals) and where the n Â k regressor matrix X is given in (4.and F-tests can be based on the fact that 1 bNLS % N(b.12). That is. e0 e=(n À k) where e0 e ¼ S(bNLS ) is the sum of squares À Á (4. evaluated at b ¼ bNLS . using the that in large enough samples var(bNLS ) % s2 (X0 X)À1 P fact 1 2 2 where s ¼ nÀk ei (with ei ¼ yi À f (xi . Under similar k conditions bNLS is also asymptotically efﬁcient. the estimator bNLS is consistent and asymptotically normally distributed. Step 1: Estimation. s2 (X0 X)À1 ) where s2 ¼ nÀ k P e2 i and X is given in (4.11) — for instance. Computations for NLS .208 4 Non-Linear Methods À Á bNLS % N b0 .12). n À k). Estimate b by minimizing (4. bNLS )) is the NLS estimate of s . let bNLS be the unrestricted non-linear least squares estimator and bR NLS the restricted estimator obtained by imposing g restrictions under the null hypothesis.12) and evaluated at bNLS and where n 2 1 2 s 2 ¼ nÀ i¼1 (yi À f (xi . Approximate t. s2 (X0 X)À1 (4:14) where XP is the matrix deﬁned in (4. .

the objective function S(b) in (4. say ^ y1.7. y2 . Step 2: Improve and repeat. Á Á Á. Iterative optimization . we can change each component of the ﬁnal estimate ^ y by a certain percentage and take the new values as initial estimates in a new round of iterations. If the improvements in the objective function are small but the changes in the parameters remain large in a sequence of iterations.4. xi ). Determine an improved estimate of y. giving a sequence of estimates ^ y1 .10) is non-linear. the improvement (F(^ yhþ1 ) À F(^ iterations are stopped. . In this section we consider some numerical aspects of non-linear optimization. For the stopping rule in step 3. ^ y3 . this may correspond to a local optimum. . say ^ y0. The vector of unknown parameters is denoted by y and the objective function by F(y). Optimal values of y are characterized by the ﬁrst order conditions G(y) ¼ 0: Numerical procedures often involve the following steps.3 Non-linear optimization E Uses Appendix A. with column vector of gradients G(y) ¼ @ F(y)=@ y and Hessian matrix H(y) ¼ @ 2 F(y)=@ y@ y0 . ^ small. Iterate these improvements. For instance. one can consider the percentage changes in the yhþ1 in two consecutive iterations and the relative estimated parameters ^ yh and ^ yh ))= F(^ yh ). Stop the iterations if the improvements become sufﬁciently Remarks on numerical methods In general there is no guarantee that the ﬁnal estimate ^ y is close to the global ^ optimum. A possible solution is to adjust the objective function or the underlying model speciﬁcation. n. Determine an initial estimate of y. Step 1: Start. Numerical aspects If the model (4. If these changes are small enough. this may be an indication of identiﬁcation problems.11) is not quadratic and the optimal value of b cannot be written as an explicit expression in terms of the data (yi .2 Non-linear regression 209 4. To prevent the calculated ^ y being only a local optimum instead of a global optimum one can vary the initial estimate of y in step 1. Even if G(y) % 0. Á Á Á . Step 3: Stop. .2. i ¼ 1.

of the gradient G(y) in Newton–Raphson and of the non-linear function f (x. The graph shows the ﬁrst derivative (G) of the objective function as a function of the parameter y. The algorithm starts in ^ y0 .210 4 Non-Linear Methods Several methods are available for the iterations in step 2. G(y) ¼ 0. Under certain regularity conditions these iterations converge to a local optimum of F(y). the gradient G can be linearized by G(y) % G(^ The condition G(y) ¼ 0 is approximated by the condition G(^ yh ) þ yh ) ¼ 0. value ^ yh . if the Hessian matrix is nearly singular — the iterations in (4. T Regularization Sometimes — for instance.7 Newton–Raphson Illustration of two Newton–Raphson iterations to ﬁnd the optimum of an objective function. The Newton–Raphson method The Newton–Raphson method is based on the iterative linearization of the ﬁrst order condition for an optimum — that is. and y is the optimal value.7. giving the next estimate À1 ^ yh À Hh Gh . A graphical illustration of this method is given in Exhibit 4.15) are adjusted by a regularization factor so that G(q) q0 ∧ q1 ∧ q2 ∧ q ∧ q Exhibit 4. yhþ1 ¼ ^ (4:15) where Gh and Hh are the gradient and Hessian matrix evaluated at ^ yh . ^ y1 and ^ y2 denote the estimates obtained in the ﬁrst and second ^ iteration. It depends on the form of the function F(y) and on the procedure to determine initial estimates ^ y0 whether the limiting estimate corresponds to the global optimum. Both methods are based on the idea of linear approximation — namely. Newton–Raphson and Gauss– Newton. . which shows the (non-linear) gradient function and two iterations of the algorithm. These equations are linear in the unknown parameter H (^ yh )(y À ^ vector y and they are easily solved. b) in Gauss–Newton. Around a given yh ) þ H(^ yh )(y À ^ yh ). Here we discuss two methods that are often applied — namely.

b)=@ b is the gradient. it can Assuming that the function f (x. The Gauss–Newton method In many cases the computation of the Hessian matrix is cumbersome and it is much more convenient to use methods that require only the gradient. ^ and ghi ¼ gh (xi ) are computed at the given value where zhi ¼ yi À fh (xi ) þ gh (xi )0 b h ^ of bh . b) in (4. If we replace the function f (x. b) ¼ fh (x) þ gh (x)0 (b À b h ^ ) and gh (x) ¼ @ f (x. Let zh be the n Â 1 vector with elements zhi and let Xh be the n Â k matrix with rows ^ .10) corresponding to b h value of b that minimizes Sh (b) is obtained by regressing zh on Xh . b) is differentiable around a given value b h be written as ^ ) þ rh (x). This forces the parameter adjustments more in the direction of the gradient. The Newton– Raphson method requires the computation of the gradient vector and the Hessian matrix in each iteration. f (x. The idea is to linearize the function f so that this objective function becomes quadratic. Therefore we now discuss the Gauss–Newton method for non-linear regression models. Further rh (x) is a remainder term that ﬁrst order derivatives.11). b h ^ . the k Â 1 vector of where fh (x) ¼ f (x. Further let g0hi ¼ @ f (xi . the least squares problem becomes to minimize Sh (b) ¼ n X i ¼1 T ^ ) yi À fh (xi ) À gh (xi )0 (b À b h 2 ¼ n X i ¼1 (zhi À g0hi b)2 . The be the residuals of the non-linear regression model (4. evaluated at b h ^ becomes negligible if b is close to bh . where c > 0 is a chosen constant and I is the identity matrix. and using À1 0 À1 0 0 0 ^ ^ it follows that b the fact that zh ¼ eh þ Xh b h hþ1 ¼ (Xh Xh ) Xh zh ¼ (Xh Xh ) Xh ^ (eh þ Xh bh ) and hence À1 0 0 ^ ^ b hþ1 ¼ bh þ (Xh Xh ) Xh eh : (4:16) . the matrix (4. b)=@ b0 — that is.2 Non-linear regression 211 ^ yhþ 1 ¼ ^ yh À (Hh þ cI)À1 Gh . in other cases one has to use numerical methods.12) evaluated at b h ^ ) ehi ¼ yi À f (xi . Derivation of Gauss–Newton iterations ^ . The minimization of Sh (b) with respect to b is an ordinary least squares problem with dependent variable zhi and with independent variables ghi . In this case the parameter vector is y ¼ b and the objective function is S(b) deﬁned in (4.4.11) by its linear approximation. b h ^ . In some cases these can be computed analytically.

the gradient and Hessian at b h n @ f ( x . b) @b @ b i ¼1 n n @ 2 f ( x .16) with those of Newton– Raphson in (4.16b. A. b) X @S i ¼ À2 ¼ À2X0h eh yi À f (xi . one in the restricted model and another one in the unrestricted model.16) if we neglect the last term in the above expression for the Hessian. E Exercises: S: 4. This 1 0 has a ﬁnite and non-zero limit can also be motivated asymptotically.15) for the least squares criterion S(b) in (4.14).8.2 we have to perform two non-linear optimizations. b) : @ b@ b0 So the Newton–Raphson iterations (4. b) i À2 yi À f (xi .4 The Lagrange Multiplier test E Uses Appendix A.16). E: 4.212 4 Non-Linear Methods ^ ^ So. We now discuss an alternative approach for testing parameter restrictions that needs the estimates only of the restricted model.9. For the criterion ^ are given by function F(y) ¼ S(b). where X is the gradient matrix evaluated at ^. in each Gauss–Newton iteration the parameter adjustment b hþ1 À bh is obtained by regressing the residuals eh of the last estimated model on the gradient ^ . For the computation of the F-test at the end of Section 4. . b) @ f (xi . b) 0 ¼ 2 0 @ b @ b@ b @ b @ b @ b0 i ¼1 i ¼1 ¼ 2X0h Xh À 2 n X i¼1 ehi @ 2 f (xi . 4.7. This is precisely the asymptotic approximation of the variance the ﬁnal estimate b of the non-linear least squares estimator bNLS in (4. we compare the Gauss–Newton iterations (4.13b. as PnX X 2 f Ã 1 converges to zero for n ! 1 (under Assumption 1 ) and the term n ehi @@ b@ b 0 (under appropriate orthogonality conditions). The usual expression for the variance of least squares estimators in the ﬁnal iteration is s2 (X0 X)À1 . b) X X @2S @ f (xi . The Gauss–Newton iterations are repeated until the matrix Xh evaluated at b h estimates converge.11).2.15) reduce to those of Gauss–Newton (4.2. 4. So asymptotic standard ^ are immediately obtained from the ﬁnal regression in (4. This test is based on the method of Lagrange for minimization under restrictions. errors of b T Comparison of the two methods Finally. c.

4.8. so that X1 contains a column with all elements equal to 1.11). then the above three ﬁrst order conditions can be written as X01 eR ¼ 0. If we write eR ¼ y À X1 bR for the corresponding restricted least squared residuals. b1 ¼ bR ¼ (X01 X1 )À1 X01 y is the restricted least squares estimate obtained by regressing y on X1 . l) ¼ S(b1 . @ b1 @L ¼ À2X02 (y À X1 b1 À X2 b2 ) þ 2^ l ¼ 0. This is illustrated graphically in Exhibit 4. @ b2 @L ¼ 2b2 ¼ 0: @l Substituting b2 ¼ 0 in the ﬁrst condition shows that X01 (y À X1 b1 ) ¼ 0 — that is. which can be achieved by relaxing the restriction that b2 ¼ 0. (4:18) where S(b1 . In (a) the slope ^ l is nearly zero (and the .2 Non-linear regression 213 Interpretation of the Lagrange multiplier in the linear model For simplicity we ﬁrst consider the case of a linear model with linear restrictions. b2 ) þ 2l0 b2 . mates) @L @ b2 2 ^ l ¼ X02 eR . b2 ¼ 0: (4:19) S ¼ @@b þ 2l ¼ 0. H0 : b2 ¼ 0. (4:17) where b2 contains g parameters and b1 contains the remaining k À g parameters. b2 . In particular. The ﬁrst order conditions for a minimum are given by @L ¼ À2X01 (y À X1 b1 À X2 b2 ) ¼ 0. b2 ) ¼ (y À X1 b1 À X2 b2 )0 (y À X1 b1 À X2 b2 ) is the least squares criterion function and l is a vector with the g Lagrange multipliers. so that (evaluated at the restricted esti- @ S(b1 . so that y ¼ X1 b1 þ X2 b2 þ e. 0) À2^ l¼ : @ b2 So ^ l measures the marginal decrease of the least squares criterion S in (4. The Lagrange method states that the least squares estimates under the null hypothesis are obtained by minimization of the (unconstrained) Lagrange function L(b1 . We assume that the restricted model contains a constant term.

8 Lagrange multiplier Graphical interpretation of the Lagrange multiplier in constrained optimization. where M1 ¼ I À X1 (X01 X1 )À1 X01 . The hypothesis b2 ¼ 0 is acceptable if the sum of squares S does not increase much by imposing this restriction — that is.1. b2) 0 b2 −2l = b2 0 b2 −2l = b2 ∂ S(b1. there holds e $ N(0. value of S at b2 ¼ 0 is nearly minimal). s2 M1 ) and ^ l ¼ X02 eR $ N(0. s2 X02 M1 X2 ): This means that ^ l0 (X02 M1 X2 )À1 ^ l=s2 is distributed as w2 (g). it follows that eR ¼ y À X1 bR ¼ M1 y ¼ M1 (X1 b1 þ e) ¼ M1 e. whereas in (b) the slope ^ l is further away from zero (and the value of S at b2 ¼ 0 is further away from the minimum). For this purpose we need to know the distribution of ^ l under the null hypothesis that b2 ¼ 0. The graphs show the objective function as a function of the parameter b2. Under the standard Assumptions 1–7 of Section 3. 125–6). 0) ‘small’ ∂ b2 ∂ S(b1. then it follows that ^ 2 % w2 ( g ) : LM ¼ ^ l0 (X02 M1 X2 )À1 ^ l=s (4:20) T .4 (p. Derivation of LM-test statistic Under the null hypothesis that b2 ¼ 0. s2 I) so that eR $ N(0. In (a) the restriction b2 ¼ 0 is close to the unrestricted minimizing value ðb2 Þ. If the unknown vari2 0 ^2 ¼ 1 ance s is replaced by the consistent estimator s n eR eR. 0) ‘large’ ∂b2 Exhibit 4. This suggests that the null hypothesis can be tested by testing whether ^ l differs signiﬁcantly from zero. whereas in (b) b2 ¼ 0 is further away from b2 . if ^ l is sufﬁciently small.214 4 Non-Linear Methods (a) S(b1. b2) (b) S(b1.

Step 1: Estimate the restricted model. Next we consider the explained sum of squares of the regression in . . Of course we could also use the unbiased estimator s2 ¼ e0R eR =(n À k þ g) instead of ^2 for ease of later comparisons. with result y ¼ X1 bR þ eR . Step 2: Auxiliary regression of residuals of step 1. Estimate the restricted model under the null hypothesis that b2 ¼ 0 — that is.20) can be computed by the following steps. Then LM ¼ nR2 of the regression in step 2.4.19) and s LM ¼ n e0R X2 (X02 M1 X2 )À1 X02 eR : e0R eR T To prove that LM ¼ nR2 of step 3. but here we use s s 2 ^ is small if n is sufﬁciently large and it disappears for n ! 1. Therefore P À eR )2 ¼ sum of squares of the regresion in step 2 is equal to SST ¼ ( e Ri P 2 0 eRi ¼ eR eR . where eR is the vector of residuals of this regression. it sufﬁces to prove that the regression in step 2 has total sum of squares SST ¼ e0R eR and explained sum of squares SSE ¼ e0R X2 (X02 M1 X2 )À1 X02 eR . the number of restrictions under the null hypothesis).20) involves the inverse of the matrix X02 M1 X2 . Derivation of auxiliary regressions The proof of the validity of the above three-step computation of the LM-test is based on results obtained in Chapter 3. We will show that the value of the LM-test in (4. By assumption. as the value of ^ l then differs signiﬁcantly from zero. The difference between s2 and ^2 .2 Non-linear regression 215 This is called the Lagrange Multiplier test statistic. s Computation of LM-test by auxiliary regressions The expression for the LM-test in (4. regress eR on X ¼ ( X1 X2 ). The null hypothesis is rejected for large values of LM. as by deﬁnition R2 ¼ SSE=SST . . First we consider the total sum of squares SST of the regression in step 2. Computation of LM-test .20) can be written as (4. and LM % w2 (g) if the null hypothesis b2 ¼ 0 holds true (where g is the number of elements of b2 — that is. the restricted model contains a constant term. It is convenient to compute the LM-test in an alternative way by means of regressions. Step 3: LM ¼ nR2 of step 2. We proceed as follows. regress y on X1 alone. Regress the residuals eR of step 1 on the set of all explanatory variables of the unrestricted model — that is. It follows from ^2 ¼ e0R eR =n that (4. and as X01 eR ¼ 0 the total it follows that the mean of the restricted residuals eR is zero.

50).4.216 4 Non-Linear Methods step 2. this means that (X02 M1 X2 )À1 is the lower g Â g diagonal block of (X0 X)À1 . ¼n 0 eR eR SST (4:21) where R2 is the coefﬁcient of determination of the auxiliary regression eR ¼ X1 g1 þ X2 g2 þ !: (4:22) Interpretation of LM-test and relation with F -test The null hypothesis that b2 ¼ 0 is rejected for large values of LM — that is. b2 ) is equal to s2 (X0 X)À1 . It is left as an exercise (see Exercise 4.1 (p. Combining these results gives e0R X(X0 X)À1 X0 eR ¼ (0 e0R X2 )(X0 X)À1 0 X02 eR ¼ e0R X2 (X02 M1 X2 )À1 X02 eR : The above results prove the validity of the three-step procedure to compute the LM-test. As the covariance matrix of the unrestricted estimators (b1 . Now the conditions (4. the restrictions are rejected if the residuals eR under the null hypothesis can be explained by the variables X2 . it follows that the mean of ^ eR is zero. there holds 0 X 1 eR 0 ¼ . with X ¼ ( X1 X2 ). 158) for the case where X2 contains a single column).6) to prove that in the linear model LM ¼ ngF : n À k þ gF (4:23) This shows that for a large sample size n there holds LM % gF. Further it follows from the results in Section X eR ¼ X02 eR X02 eR 3. As X contains a constant term.19) 0 in also imply that. b2 ) of (b1 . The LM-test in the linear model is related to the F-test (3. So the explained sum of squares is eR ¼ e0R X(X0 X)À1 X0 eR : SSE ¼ ^ e0R^ It remains to prove that this can be written as e0R X2 (X02 M1 X2 )À1 X02 eR . The regression of eR on X in the model eR ¼ Xg þ ! gives ^ g ¼ (X0 X)À1 X0 eR À1 0 0 with explained part ^ eR ¼ X^ g ¼ X(X X) X eR . 161) that the covariance matrix of b2 (the least squares estimator of b2 in the unrestricted model) is equal to var(b2 ) ¼ s2 (X02 M1 X2 )À1 (see (3.22). for large values of R2 in (4. . so that LM ¼ n e0R X(X0 X)À1 X0 eR SSE ¼ nR2 .46) (p. Stated intuitively.

@ b1 @S ¼ À2X02 e. So S ¼ (yi À f (xi . A similar approach can be followed to perform tests in non-linear regression models. It follows from (4. this means that the residuals of the last iteration (in the model estimated under the null hypothesis) are regressed on the full À matrix Á of gradients under the alternative hypothesis . In terms of the Gauss–Newton iterations (4. LM-test in non-linear regression model The foregoing arguments show that the LM-test of the null hypothesis that b2 ¼ 0 can be computed as LM ¼ nR2 % w2 (g). b2 ¼ 0: X01R eR ¼ 0. @ L=@ b2 ¼ 0.19) that the ﬁrst order conditions @ L=@ b1 ¼ 0. evaluated at (bR NLS 0).19) is that X1R and X2R depend on bNLS . b1 . b1 . ^ Here X1R and X2R are the matrices X1 ¼ @ f =@ b01 and X2 ¼ @ f =@ b02 À R Á of derivatives R evaluated at (b1 . b2 ) þ e. b2 ) ¼ bNLS .14). H0 : b2 ¼ 0: (4:24) T The Lagrange function is deﬁned as in (4. b2 ))2 and @S ¼ À2X01 e. followed by an auxiliary linear regression as in (4. b1 . 0 . bR . with bNLS the estimator of b1 À restricted NLS Á under the restriction that b2 ¼ 0 and eRi ¼ yi À f xi .16).22). Consider the following testing problem. bR NLS .11). where b2 contains g parameters and b1 the remaining k À g parameters in the model yi ¼ f (xi .4. 0 are the correspondNLS ing residuals.17). The difference with (4. As before. @ b2 where X1 ¼ @ f =@ b01 is the n Â (k À g) matrix of ﬁrst order derivatives with respect to b1 . the restricl differs signiﬁcantly from tions that b2 ¼ 0 can be tested by considering whether ^ zero. The LM-test has the advantage that only the and evaluated at bR NLS smaller model has to be estimated by NLS. b2 ) are the residuals.18). with S the non-linear least squares P criterion in (4. X2 ¼ @ f =@ b02 is the n Â g matrix of derivatives with respect to b2 . and ei ¼ yi À f (xi . 0 . Under the conditions of asymptotic normality in (4. and @ L=@ l ¼ 0 can be written as l ¼ X02R eR .2 Non-linear regression 217 Derivation of LM-test in non-linear regression model Until now we considered the linear regression model (4. 0) .21). on the gradients X1 ¼ @ f =@ b01 and X2 ¼ @ f =@ b02 . with R2 of the regression of the restricted residuals eRi ¼ yi À f (xi . . the test can again be computed (approximately in large enough samples) as in (4. so that the normal equations X01R eR ¼ 0 are non-linear in b1 .

Let b2 consist of g components and b1 of (k À g) components.2 in Section 4. We will discuss (i) the model.1 we considered the non-linear regression model log (qi ) ¼ f (di . E: 4. (iv) t. @ b0 1 2 . and (v) the LM-test on constant elasticity. E XM402COF 4. Step 3: LM ¼ nR2 of the regression in step 2. Computation of LM-test . (iii) results of the Gauss–Newton iterations. This case is obtained in the limit for b3 ! 0. b) ¼ b1 þ b2 log (d).13d.6a. Then LM ¼ nR2 of the regression in step 2.218 4 Non-Linear Methods Summary of computations for the LM-test The LM-test of the hypothesis that b2 ¼ 0 in the model y ¼ f (x. Regress the @f @f residuals eR on the n Â k matrix of ﬁrst order derivatives X ¼ @ b0 . which corresponds to a constant demand elasticity.2 (p. . b) ¼ b1 þ b2 b3 (d À 1): b3 Of special interest is the hypothesis that b3 ¼ 0. Asymptotically. b1 . Step 1: Estimate the restricted model. (i) Model In Example 4.2.2. (ii) Non-linear least squares estimates We ﬁrst consider the n ¼ 12 data for the ﬁrst brand of coffee. b) þ ei for coffee sales (q) in terms of the deal rate (d) where f (d.16g. b2 ) þ e can be computed by means of an auxiliary regression.5 Illustration: Coffee Sales We illustrate the results on non-linear regression by considering the marketing data of coffee sales discussed before in Example 4.and F-tests on constant elasticity.10. which gives the linear model f (d. the LM-statistic follows the w2 (g) distribution if the hypothesis that b2 ¼ 0 holds true. Estimate the restricted model (with b2 ¼ 0 imposed). 202). and the null hypothesis is rejected for large enough values of the LM-statistic. (ii) the non-linear least squares estimates. E Exercises: T: 4. 4. with corresponding vector of residuals eR . Step 2: Auxiliary regression of residuals on full set of regressors. For a given value of b3 the model is linear in the parameters b1 and b2 and these two parameters . S: 4.

4.10.2.20 −5 −10 −15 −20 SSR 0. @ b2 b3 @f b b ¼À 2 (db3 À 1) þ 2 db3 log (d): 2 b3 @ b3 b3 Exhibit 4. This grid search gives ^ ¼ 5:81 and b ^ ¼ 10:30. The resulting estimates of a software package are in Panel 2 in Exhibit 4. (a) shows the minimum SSR that can be obtained for a given value of b3 . and the same holds true for the parameter estimates.087049 0. @ b1 @f 1 ¼ (db3 À 1).5) Non-linear least squares for the model for coffee sales of brand 1. The outcomes are in line with the earlier results based on a grid search for b3.443987 0.087049 0. Exhibit 3 4.087049 0. 219 (iii) Gauss–Newton iterations Next we apply the Gauss–Newton algorithm for the estimation of b. .11)) for a grid of values of b3 .9 shows the estimates of b3 (in (b)) and the value of SSR (in (c)) for a number of iterations of the Gauss–Newton method.087049 0.105480 0.05 −60 0 5 ITER 10 15 9 10 Exhibit 4.087049 BETA3 −40 0 −20 BETA3 20 40 0. (c) shows the values of SSR that are obtained in the Gauss–Newton iterations.313433 0.087176 0.10 0. As starting values we take b1 ¼ 0. b2 ¼ 1. with starting values b1 ¼ 0 and b2 ¼ b3 ¼ 1. (a) 0. The vector of gradients is given by @f ¼ 1.9 (a) shows the minimal value of the least squares criterion (the sum of squared residuals SSR in (4.25 (b) 5 0 (c) iter 0 1 2 3 4 5 6 7 8 SSR 434.9 Coffee Sales (Section 4.15 0. with corresponding estimates b b 3 1 2 ^ SSR at b3 is of course lower than at b3 ¼ 0.087049 0. and b3 ¼ 1. (b) shows the values of b3 that are obtained in iterations of the Gauss–Newton algorithm. The NLS estimates correspond to the values where SSR is minimal. and below we will test the hypothesis that b3 ¼ 0 by evaluating whether this difference is signiﬁcant. and the NLS estimate corresponds to the value of b3 where this SSR is minimal. The ^ ¼ À13:43.30 0. This shows that the values of SSR converge.2 Non-linear regression 1 can be estimated by regressing log (qi ) on a constant and b (db3 À 1).087049 0.

043072 135.1332 R-squared 0.6360 0. convergence achieved after 5 iterations LOGQ1 ¼ C(1) þ (C(2)/C(3)) Ã (D1^C(3)À1) Parameter Coefﬁcient Std.028757 0. and auxiliary regressions for LM-tests on constant elasticities (Panels 5 and 6).427608 0.865333 Sum squared resid 0.10 Coffee Sales (Section 4. 12 observations) Method: Least Squares Variable Coefﬁcient Std.96557 14.32206 0. Error t-Statistic Prob.207206 À1. 12 observations) Method: Least Squares.934474 Sum squared resid 0.55080 15.040150 À0. 12 observations) Method: Least Squares Variable Coefﬁcient Std.622097 0. .016063 0.163676 0.0000 C(2) 10.862313 0.132183 Panel 4: Dependent Variable: LOGQ2 (brand 2.132328 Panel 2: Dependent Variable: LOGQ1 (brand 1.040150 144. Error t-Statistic Prob.650653 0.342177 Panel 6: Dep Var: RESLIN2 (12 residuals of Panel 3 for brand 2) Variable Coefﬁcient Std.034622 0.4109 LOGD1 4.0075 C(3) À8.914196 Sum squared resid 0.125072 0.1392 LOGD2^2 À26.043236 À0.100944 Panel 5: Dep Var: RESLIN1 (12 residuals of Panel 1 for brand 1) Variable Coefﬁcient Std.581599 10.115810 2.841739 0.664693 0.087049 Panel 3: Dependent Variable: LOGQ2 (brand 2.674812 À2.0000 LOGD2 6. Error t-Statistic Prob.581918 8.6284 0.3638 0. 12 observations) Method: Least Squares. Error t-Statistic Prob.2.43073 6. C 4.29832 3. C(1) 4. C(1) 5.449575 2.012152 0.043236 101.295386 3.0000 R-squared 0.278436 1.665120 0.595289 5.1295 R-squared 0.77373 À2.0751 R-squared 0.5227 LOGD2 3.103012 0.911413 Sum squared resid 0.695844 2.0587 R-squared 0.377804 0.28864 3. convergence achieved after 5 iterations LOGQ2 ¼ C(1) þ (C(2)/C(3)) Ã (D2^C(3)À1) Parameter Coefﬁcient Std.0000 LOGD1 4. Error t-Statistic Prob.220 4 Non-Linear Methods Panel 1: Dependent Variable: LOGQ1 (brand 1.807118 0.90927 À1. Error t-Statistic Prob.0000 C(2) 10.043048 102. C À0. C 5.0648 LOGD1^2 À31. models with constant elasticity (Panels 1 and 3).0122 C(3) À13.0000 R-squared 0. models with varying elasticity (Panels 2 and 4).406561 0.001698 3. C À0.236330 Exhibit 4.5) Regressions for two brands of coffee.003298 0.2540 0.668888 0.

but not for brand 2. so that limits for b3 ! 0 should be taken). log (d). The F-tests for brands 1 and 2 are given by F1 ¼ (0:1323 À 0:0870)=1 (0:1322 À 0:1009)=1 ¼ 4:68.08 for brand 1 (see Panel 2) and 0. (v) LM-test on constant elasticity Next we compute the LM-test for the hypothesis that b3 ¼ 0. the t-test fails to reject the null hypothesis that b3 ¼ 0 for both brands. ¼ lim @ b2 b3 !0 b3 @f b2 1 b3 b3 ¼ lim d log (d) À (d À 1) ¼ b2 ( log (d))2 . @ b 3 b 3 ! 0 b3 b3 so the relevant regressors in step 2 of the LM computation scheme are 1.10. 9) distribution is equal to 5. however.14) and that the number of observations (n ¼ 12) is quite small. (iv) t. and ( log (d))2 .22) for the two brands are in Panels 5 and 6 in Exhibit 4. but for nonlinear models this no longer holds true.10. As can be checked from the t-values in Panels 2 and 4 in Exhibit 4. The relation F ¼ t2 for a single parameter restriction was shown in Chapter 3 to be valid for linear models. So the test statistics (rounded to two decimals) are LM1 ¼ 12R2 1 ¼ 12 Á 0:34 ¼ 4:11 and ¼ 12 Á 0 : 24 ¼ 2 : 84.12. that these values are based on the asymptotic distribution in (4. the residuals of the log-linear models (corresponding to b3 ¼ 0) are regressed on the partial derivatives @ f =@ bi (evaluated at the estimated parameters under the null hypothesis. Note.4. The reported P-values (rounded to two decimals) are 0. so that the hypothesis that b3 ¼ 0 is again not rejected. @ b1 b @f d 3 À1 ¼ log (d). the F-values are not equal to the squares of the t-value of b3 .84. .2 Non-linear regression 221 This table also contains the NLS estimates for the second brand of coffee (in Panel 4) and the estimates under the null hypothesis that b3 ¼ 0 (in Panels 1 and 3). so that the Pvalues are not completely reliable. This gives @f ¼ 1. so that in this case the null hypothesis is rejected for brand 1. To compute this test. F2 ¼ ¼ 2:79: 0:0870=(12 À 3) 0:1009=(12 À 3) The 5 per cent critical value of the F(1.and F-tests on constant elasticity At 5 per cent signiﬁcance.13 for brand 2 (see Panel 4). The results of the auxiliary regressions in (4. The 5 per cent critical value of the w2 (1) LM2 ¼ 12R2 2 distribution is equal to 3.

Some disadvantages of least squares If we apply least squares in the linear model y ¼ Xb þ e.222 4 Non-Linear Methods 4. and for some models it is even impossible to apply this method. these observations have a relatively large impact on the estimates. or by using another criterion than least squares. Least squares is an example of this approach. We recall from Section 3.3 Maximum likelihood 4.1. it is not always the most appropriate approach. by adjusting the model. then the estimator is given by b ¼ (X0 X)À1 X0 y ¼ b þ (X0 X)À1 X0 e: This means that the (unobserved) disturbances e affect the outcome of b in a linear way. by transforming the data.3. There are several ways to reduce the inﬂuence of such observations — for instance.4 (p. If some of the disturbances ei are large. In this section we will discuss the method of maximum likelihood (ML) in more detail. 127) that OLS is the best linear unbiased estimator under Assumptions 1–6. 41) we discussed two approaches in parameter estimation.1 (p. The ML method is the appropriate estimation method for a large variety of models. These methods are discussed in Chapter 5.3. and applications for models of special interest in business and economics will be discussed in later chapters. Although this is a very useful method. Another approach in parameter estimation is to maximize the likelihood of the parameters for the observed data. One is based on the idea of minimizing the distance between the data and the model parameters in some way. one that has fatter tails. We will consider the general framework and we will use the linear model as an illustration. . Another approach is to replace the normal distribution of the disturbances by another distribution — for instance.1 Motivation Two approaches in estimation In Section 1. Then the parameters are chosen in such a way that the observed data become as likely or ‘probable’ as possible. as will become clear in later chapters.

Á Á Á . in the tail of the distribution of returns. Exhibit 4. We will discuss (i) the possibility of fat tails in returns data. (i) Possibility of fat tails in returns data E XM404SMR Traders on ﬁnancial markets may react relatively strongly to positive or negative news. 77). the probability of outcomes more than 3. i ¼ 1. n: A scatter diagram of these data is given in Exhibit 2. so that the returns may be larger (both positive and negative) than would normally be expected. It seems somewhat doubtful that the disturbances are normally distributed.6 standard deviations away from the mean is around 0.3 Maximum likelihood 223 However.11 (b) shows the density function of the standard normal . one could for instance use a t-distribution for the disturbances. (ii) Least squares residuals The data consist of excess returns data for the sector of cyclical consumer goods (denoted by y) and for the whole market (denoted by x) in the UK. Two of the n ¼ 240 residuals have values of around À20 % À3:6 s. and (iii) choice of the distribution of the disturbances. (iii) Choice of distribution of the disturbances As an alternative.1 (p. Example 4. Such periods of shared panic or euphoria among traders may lead to returns far away from the long-run mean — that is.0003. The sample e ¼ 0 and mean and standard deviation of the residuals ei are qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P 2 s¼ ei =(n À 1) ¼ 5:53. there exist non-linear estimators that are more efﬁcient. The parameters a and b can be estimated by least squares. For the normal distribution. This kind of herd behaviour may cause excessive up and down swings of stock prices. if the disturbances are not normally distributed (so that Assumption 7 is not satisﬁed). and in particular they may react to the behaviour of fellow traders.4. Asymptotically. which is much smaller than 2=240 ¼ 0:0083.11 (a). (ii) the least squares residuals.4: Stock Market Returns (continued) We investigate the assumption of normally distributed disturbances in the CAPM for stock market returns discussed before in Example 2. The CAPM postulates the linear model yi ¼ a þ bxi þ ei . the most efﬁcient estimators are the maximum likelihood estimators.1 (c) (p. The histogram of the least squares residuals is shown in Exhibit 4. The histogram indicates that the disturbances may have fatter tails than the normal distribution. 76–7).

2 Maximum likelihood estimation E Uses Section 1.3.231594 15.0 7 5.4 0.14E-16 0.3. Á Á Á .224 4 Non-Linear Methods (a) 25 20 15 10 Series: RESIDUALS Sample 1980:01 1999:12 Observations 240 Mean Median Maximum Minimum Std. the t-distribution has fatter tails than the normal distribution.1. see Section 1.11 Stock Market Returns (Example 4.3. distribution and of the t(5) distribution (scaled so that it also has variance equal to one). this set of outcomes is much more probable for the distribution on the right side than for the distribution on the left side. Clearly. Appendix A.11497 −20. the ^¼ normal distribution with the largest likelihood is given by m y and P 2 ^2 ¼ 1 À y ) .006 1 3 Exhibit 4. 4.0 8 4.531151 −0.3.3 5 0 f(x) 0.2 10 4.12.5 9 4. the standard normal distribution and the t(5) distribution (scaled so that it has variance 1). (c) shows the kurtosis of t-distributions for selected values of the number of degrees of freedom.1 −20 −15 −10 −5 0 5 10 15 0 −4 −3 −2 −1 0 1 2 3 4 x (c) Degrees of freedom (d) Kurtosis of t(d) 5 9. For a random sample y1 . s2 ). The table with values of the kurtosis in Exhibit 4. yn from the normal density N(m.1 we discussed the method of maximum likelihood estimation for data consisting of a random sample from a population with ﬁxed mean and variance. The idea is illustrated in Exhibit 4. The observed values of the dependent variable are indicated by crosses. In the next sections we describe the method of maximum likelihood that can be applied for any speciﬁed distribution.044751 (b) 0. Dev.0 1000 3.2 standard normal 0. ( y s i n . (b) shows two distributions.5 standardized t(5) 0.1.41222 5.7. This is expressed by saying that the distribution on the right side has a larger likelihood then the one on the left side.11 (c) shows that t-distributions have fatter tails than the normal distribution. Skewness Kurtosis 4.280299 4.4) (a) shows the histogram of the least squares residuals of the CAPM for the sector of cyclical consumer goods in the UK. The idea of maximum likelihood In Section 1.0 6 6.

y) is a probability density for (y. X. Likelihood function and log-likelihood We now extend the maximum likelihood (ML) method to more general models. In practice. The maximum likelihood estimator ^ yML is deﬁned as the value of y that maximizes the function L(y) over the set of allowed parameter values. this measures the ‘probability’ of observing the data (y. xi ). n.1). Then ^ ^ the ML estimates are related by c ML ¼ h(yML ) (see also Section 1.3. X) for different values of y.4. y) ¼ Pn i¼1 py (yi . It is natural to prefer parameter values for which this ‘probability’ is large. In order to apply ML. xi ) are mutually independent for i ¼ 1. p(y. for given (y. then the joint density is p(y. The log-likelihood can be decomposed if the observations (yi . y): (4:25) Stated intuitively.25) and (4. On the other hand. Á Á Á .26) is obtained for the same values of y. X. xi ) so that . for computational convenience one often maximizes the logarithmic likelihood function or log-likelihood l(y) ¼ log (L(y)): (4:26) As the logarithm is a monotonically increasing transformation. The observed data on the dependent variable are denoted by the n Â 1 vector y and those on the explanatory variables by the n Â k matrix X. X. the distribution on the left therefore has a smaller likelihood than the distribution on the right. suppose that the model is formulated in terms of another parameter vector c and that c and y are related by an invertible transformation c ¼ h(y). and for given values of y. Here y denotes the vector of model parameters. If the probability density function for the ith observation is py (yi . the model is expressed in terms of a joint probability density p(y. X. y). An attractive property of ML is that it is invariant under reparametrization.3 Maximum likelihood 225 ϫ ϫ ϫ ϫϫ ϫ ϫ ϫ ϫ ϫϫ ϫ y Exhibit 4. X). the maximum of (4.12 Maximum likelihood The set of actually observed outcomes of y (denoted by the crosses on the horizontal axis) is less probable for the distribution on the left than for the distribution on the right. X) the likelihood function is deﬁned by L(y) ¼ p(y. That is.

or equivalently with G ¼ 1 n @ l (y)=@ y 0 2 and H ¼ 1 n @ l (y)=@ y@ y . In this case there holds ! ! n n 1X @ 2 li @ 2 li @ li @ li 1X @ li @ li ¼ À E % E : 0 0 0 %À @y @y n i¼1 @ y@ y n i ¼ 1 @ y @ y0 @ y@ y (4:28) The ﬁrst and the last approximate equalities follow from the law of large numbers.2 (p.15) become ^ yh þ y hþ 1 ¼ ^ n 1X @ li @ li n i¼1 @ y @ y0 !À1 ! n 1X @ li : n i¼1 @ y This is called the method of Berndt.2. xi )) is the contribution of the ith observation to the log-likelihood l(y). and Hausman (abbreviated as BHHH). (4:27) where li (y) ¼ log (py (yi . Hall.226 4 Non-Linear Methods n X i¼1 n X i¼1 l(y) ¼ log (py (yi . The last term in (4.28) is called the outer product of gradients. Using this approximation. . xi )) ¼ li (y). For instance.2. Solution methods were discussed in Section 4. n i¼1 @ y H¼ n 1X @ 2 li : n i¼1 @ y@ y0 In this case it is also possible to perform the iterations in a way where only the ﬁrst order derivatives (and no second order derivatives) need to be computed. the result in (4. as the terms @ 2 li =@ y@ y0 are mutually independent and the same holds true for the terms (@ li =@ y)(@ li =@ y0 ). one sometimes uses a regularization factor P À1 n @ li @ li 1 with c > 0 a and replaces the above matrix inverse by n i¼1 @ y @ y0 þ cI chosen constant and I the identity matrix.27) shows that then G¼ n 1X @ li .3. This is called the Marquardt algorithm. As discussed in Section 4. If the observations are mutually independent.15) can be performed with the gradient vector G ¼ @ l(y)=@ y and with the Hessian matrix H ¼ @ 2 l(y)=@ y@ y0 . 45) (applied for each individual observation i separately).3. The middle equality in (4. but they may give less precise estimates as compared with methods using the second order derivatives. the Newton–Raphson iterations (4. the Newton–Raphson iterations in (4.3.46) in Section 1.28) follows from (1. Hall. These methods have the advantage that they require only the ﬁrst order derivatives. T Numerical aspects of optimization In general the computation of ^ yML is a non-linear optimization problem.

31) for the density of the multivariate normal distribution with mean m ¼ Xb and covariance matrix S ¼ s2 I. This shows that bML coincides with the least squares 2 estimator b.3 Maximum likelihood 227 ML in the linear model In some cases the ML estimates can be computed analytically. An example is given by the linear model under Assumptions 1–7 of Section 3. ML ¼ (y À Xb) (y À Xb) ¼ n n (4:33) (4:34) (4:31) (4:32) where s2 is the (unbiased) least squares estimator of s2 discussed in Section 3.26) is given by n n 1 l(b. e $ N(0.5 (p.3 (p. s2 ) are equal to the non-linear least squares estimates bNLS (see Exercise 4. In this case y ¼ Xb þ e. E Exercises: T: 4.12a. @b s @l n 1 ¼ À 2 þ 4 (y À Xb)0 (y À Xb) ¼ 0: @ s2 2s 2s The solutions are given by bML ¼ (X0 X)À1 X0 y ¼ b.6c.21) in Section 1.4. b. ML in non-linear regression models In a similar way.6). S: 4. s2 I). s2 I).1. Using the expression (1. 1 nÀk 2 0 s2 s . 128). b) þ ei with ei $ NID(0.2. 125–6). This model has parameters y ¼ (b0 . s2 ) ¼ À log (2p) À log (s2 ) À 2 (y À Xb)0 (y À Xb): 2 2 2s (4:30) The maximum likelihood estimates are obtained from the ﬁrst order conditions @l 1 ¼ 2 X0 (y À Xb) ¼ 0. and that s2 ML differs from the unbiased estimator s by a factor that tends to 1 for n ! 1. the ML estimates of b in the non-linear regression model yi ¼ f (xi . d–f. s2 )0 . (4:29) so that y $ N(Xb. .4 (p.1. it follows that the log-likelihood (4.

Some regularity conditions are necessary for generalizations of the central limit theorem to hold true. À asymptotic Á I ( y ) where I 0 ¼ limn!1 1 n n 0 ! ! @l @l @2l ¼ ÀE I n (y0 ) ¼ E @ y @ y0 @ y@ y0 (4:36) is the information matrix (evaluated at y ¼ y0 ) for sample size n of the data (y. conventional t. so that pﬃﬃﬃ d 1 n(^ yML À y0 ) ! N(0.8. X. the main condition is that the model (that is. The model is correctly speciﬁed if there exists a parameter y0 so that the data are generated by the p probability distribution p(y.3 (p.25). .3. These tests are compared in Section 4.3.2. Apart from mild regularity conditions on the log-likelihood (4.228 4 Non-Linear Methods 4.3. maximum likelihood estimators have asymptotically optimal statistical properties.7. the joint probability distribution of the data) has been speciﬁed correctly. is asymptotically efﬁcient. In the following sections we discuss some alternative tests that are of much practical use — namely. y0 ). and the Lagrange Multiplier test. Asymptotic distribution of ML estimators As was discussed in Section 1. The asympﬃﬃﬃ totic efﬁciency means that n(^ yML À y0 ) has the smallest covariance pﬃﬃﬃ matrix among all consistent estimators of y0 (the reason for scaling with n is to get a ﬁnite. 1. Approximate distribution for finite samples This means that. non-zero covariance matrix in the limit). and has an asymptotically normal distribution. Appendix A. I À ML n À1 À1 where we used that in large enough samples var(^ yML ) % 1 n I 0 % I n (y0 ) % À1 ^ I n (yML ).3 Asymptotic properties E Uses Sections 1.3. I À 0 ): (4:35) Here I 0 is the information matrix evaluated at y0 — that is. the Wald test.3.and F-tests can be based on the approximate distribution 1 ^ ^ ( y ) yML % N y0 . asymptotically. Then the maximum likelihood estimator is consistent. X) in (4. 51–2).3.26). the Likelihood Ratio test.

1. e 0 e = ( n À k) (4:43) . s @ b@ b0 @2l 1 ¼ À 4 X0 (y À Xb).33) and (4.2 it follows that I0 ¼ 1 s2 Q 0 0 1 2 s4 : Therefore. @ b@ s2 s @2l n 1 ¼ À (y À Xb)0 (y À Xb): @ s2 @ s2 2s4 s6 (4:37) (4:38) (4:39) T Using (4.1 we considered the F-test for the null hypothesis of g linear restrictions on the model (4. In (3. where R is a g Â k matrix of rank g.3.1 (p.35). s2 (X0 X)À1 .36) is given by I n ( y0 ) ¼ 1 s2 X0 X 0 0 n 2 s4 : À1 n I n ( y0 ) (4:40) Á . for the model (4. ML n (4:41) (4:42) Actually.3 Maximum likelihood 229 Illustration of asymptotic results for ML in the linear model Now we illustrate the above asymptotic results for the linear model y ¼ Xb þ e that satisﬁes Assumptions 1–7.29) the distribution in (4.41) holds exactly. In Section 3. so that the (k þ 1) Â (k þ 1) information matrix in (4.4.33) and (4. for b ¼ b0 there holds E[X0 (y À Xb0 )] ¼ X0 E[y À Xb0 ] ¼ 0 and E[(y À Xb0 )0 (y À Xb0 )] ¼ ns2 . and under The asymptotic covariance matrix is obtained from I 0 ¼ lim Assumption 1Ã in Section 4. Here the parameter vector is given by the 0 2 0 (k þ 1)À Â 1 vector Á0 y ¼ (b . s ) and the vector of ML estimators by 0 2 ^ these estimators yML ¼ bML . large sample approximations of the distribution of the ML estimators (4.30) are given by @2l 1 ¼ À 2 X0 X.29) and the assumption that X is ﬁxed.4.29) of the form Rb ¼ r.34) for the linear model are given by À Á bML % N b0 . as was shown in Section 3.34). 4 2 2s : s2 % N s . According to (4. sML in (4.50) this test is computed in the form F¼ (e0R eR À e0 e)=g . The second order derivatives of the log-likelihood in (4. 152). pﬃﬃﬃ (when measured in deviation from y0 and multiplied by n) are asymptotically 1 normally distributed with covariance matrix I À 0 .

and numerical aspects were discussed in Section 4. .4 The Likelihood Ratio test E Uses Appendix A. This is often a non-linear optimization problem. First one has to specify the form of the likelihood function — that is. For the observed data y and X. n À k) distribution. Approximate t-values and F-tests for the ML estimates ^ yML can be obtained from the fact that this estimator is consistent and approximately normally distributed with covariance matrix 1 ^ var(^ yML ) % I À n (yML ). Step 3: Asymptotic tests. Under Assumptions 1–7. Step 2: Maximize the log-likelihood. X. Exercises: E: 4. For given data y and X. where I n is the information matrix deﬁned in (4. the maximization of the log-likelihood l(y) ¼ log (L(y)). We denote the ML estimator under the null hypothesis by ^ y0 and the ML estimator under the alternative by ^ y1.25) and that the null hypothesis imposes g independent restrictions r(y) ¼ 0 on the parameters. Computations in ML . General form of the LR-test Suppose that the model is given by the likelihood function (4. the form of the joint probability function L(y) ¼ p(y.3.230 4 Non-Linear Methods where e0R eR and e0 e are the sums of squared residuals under the null and alternative hypothesis respectively.3. . y). Summary of computations in ML To estimate model parameters by the method of maximum likelihood. The Likelihood Ratio test is based on the loss of log-likelihood that results if the restrictions are imposed — that is. or. the test statistic (4. equivalently. The criterion for estimation is the maximization of L(y).8.43) has the F(g.17a–f. one proceeds as follows. X.3. E 4. the loglikelihood l(y) ¼ log (p(y. this should be a known function of y — that is. for every choice of y the value of L(y) can be computed. y)) is maximized with respect to the parameters y.2.36) and evaluated at ^ yML . . In Section 4.8 we will make some comments on the actual computation of this covariance matrix. Step 1: Formulate the log-likelihood.

For a proof of (4.45) we refer to textbooks on statistics (see Chapter 1. LR ¼ 2 log (L(^ y1 )) À 2 log (L(^ y0 )) ¼ 2l(^ y1 ) À 2l(^ y0 ) : (4:44) A graphical illustration of this test is given in Exhibit 4.13. LR-test in the linear model As an illustration we consider the linear model y ¼ Xb þ e with Assumptions 1–7. 68)). This means that the ML optimization problem is transformed into another one that involves less parameters. if the null hypothesis is true. This hypothesis is rejected if the (vertical) distance between the log-likelihoods is too large.13 Likelihood Ratio test Graphical illustration of the Likelihood Ratio test. For a linear model it follows from (4. The restrictions are rejected if the loss in the log-likelihood (measured on the vertical axis) is too large. It can be shown that. s2 (b)) ¼ À log (2p) À log (s2 (b)) À : 2 2 2 T . we use a technique known as concentration of the log-likelihood. Substituting this in (4. To compute the LR-test for the null hypothesis that Rb ¼ r (with R a g Â k matrix of rank g). LR ! w2 (g): d (4:45) The null hypothesis is rejected if LR is sufﬁciently large. for given 0 value of b.30). the optimal value of s2 is given by s2 (b) ¼ 1 n (y À Xb) (y À Xb). Further Reading (p. the optimal value of b is obtained by maximizing the concentrated log-likelihood n n n l(b. Here y is a single parameter and the null hypothesis is that y ¼ 0.32) that.3 Maximum likelihood 231 log L 0 qML q Exhibit 4.4.

14b. The relation between this test and the F-test in (4. This is illustrated graphically in Exhibit 4. the curvature is equal to the inverse of the covariance matrix of ^ y1 .232 4 Non-Linear Methods This function of b is maximal if s2 (b) is minimal. If the required computations turn out to be complicated. see .5 The Wald test Idea of Wald test (for a single parameter) Whereas the LR-test requires two optimizations (ML under the null hypothesis and ML under the alternative hypothesis).43) is given by e0 eR À e0 e g LR ¼ n log 1 þ R 0 ¼ n log 1 þ F : ee nÀk (4:46) This result holds true for linear models with linear restrictions under the null hypothesis. and the above expression for the log-likelihood shows that 0 e eR . Because only the unrestricted model is estimated. The (horizontal) difference between the unrestricted estimator ^ y1 and y ¼ 0 is related to the (vertical) difference in the log-likelihoods.14 for the simple case of a single parameter y with the restriction y ¼ 0. Computational disadvantage of the LR-test The LR-test (4. 4.15b. E Exercises: E: 4.3. and this corresponds to least squares. an indication of this vertical disd2 l of the log-likelihood l in ^ y1 . This test considers how far the restrictions are satisﬁed by the unrestricted estimator ^ y1. The maximum likelihood estimator of b under the null hypothesis is therefore given by the restricted least squares estimator bR. the Wald test is based on the unrestricted model alone. Two of such test methods are discussed in the following two sections. It does not in general hold true for other types of models and restrictions. Asymptotically. then it may be more convenient to estimate only one of these two models. LR ¼ 2l(^ y 1 ) À 2l ( ^ y0 ) ¼ Àn log (s2 (b) ) þ n log (s2 (bR )) ¼ n log R0 ee where b is the unrestricted least squares estimator.44) requires that ML estimates are determined both for the unrestricted model and for the restricted model.13e. 4.16f. The exhibit tance is obtained by the curvature d y2 shows that this distance becomes larger for larger curvatures. 4. 4.

35) that T . The expression ^ y1 =s^ y1 is analogous to the t-value in a regression model (see Section 3. by the Wald test statistic d2 l 2 ^ W ¼ y1 Á À 2 % dy ^ y1 s^ y1 !2 % w2 (1): is an estimate of the variance of the unrestricted ML estimator ^ y1 Here s2 ^ y1 and the asymptotic distribution follows from (4.1 (p.4. it follows that in large enough samples r(^ y1 À y0 ) ¼ R0 (^ y1 À y0 ).3 Maximum likelihood 233 log L1 log L2 log L1 log L2 0 qML q Exhibit 4. It follows R0 (^ from (4. that results from imposing the restriction that y ¼ 0. Suppose that this hypothesis holds true for the DGP. and this ‘vertical’ difference is larger if the loglikelihood function has a larger curvature.14 Wald test Graphical illustration of the Wald test. Derivation of Wald test for general parameter restrictions Now we describe the Wald test for the general case of g non-linear restrictions r(y) ¼ 0. where R0 ¼ @ r=@ y0 evaluated at y ¼ y0 . The t-test for a single parameter restriction is also obtained by estimating the unrestricted model and evaluating whether the estimated parameter differs signiﬁcantly from zero. so that r(y0 ) ¼ 0.3. 153)). The restrictions are rejected if the estimated parameters are too far away from the restrictions of the null hypothesis.35) and (4. This motivates to estimate the loss in log-likelihood.35). y 1 ) % r ( y0 ) þ Because ^ y1 is consistent. This is taken as an indication that the loss in the log-likelihood is too large.36). (4.

such as EViews. Wald test in the linear model We illustrate the Wald test by considering the linear model y ¼ Xb þ e with Assumptions 1–7 and the linear hypothesis that Rb ¼ r (with R a g Â k matrix of rank g). Then plim(R1 ) ¼ R0 and ^ plim( 1 n I n (y1 )) ¼ I 0 . So in (4. where r(y) ¼Rb À r ¼ (R 0)y À r.1 (p. where e ¼ y À Xb are the unrestricted least squares residuals. 1 ^ IÀ n (y1 ) T % 1 IÀ n ( y0 ) ¼ s2 (X0 X)À1 0 2 s4 n 0 % À1 0 s2 ML (X X) 0 2s 4 ML n ! : 0 À1 0 1 ^ 0 2 0 Combining these results.54) in Section 3.40) — that is. and (4.47) implies that 1 ^ 0 r(^ y1 ) % N(0.36) evaluated at y ¼ ^ y1 .46). The parameter vector y ¼ (b0 . V ) then z0 V À1 z $ w2 (g). 165). R0 I À 0 R0 ): 0 (4:47) Let R1 ¼ @ r=@ y evaluated at y ¼ ^ y1 and let I n (^ y1 ) be the information matrix for sample size n deﬁned in (4. A disadvantage is that the numerical outcome of the test may depend on the way the model and the restrictions are formulated (see Exercise 4. This formula. The unre0 stricted estimators are given by b in (4. if the parameter restriction r(y) ¼ 0 is non-linear.33) and s2 ML ¼ e e=n in (4.48) we have r( ^ y1 ) ¼ Rb À r and R1 ¼ @ r=@ y0 ¼ (@ r=@ b0 @ r=@ s2 ) ¼ (R 0): An asymptotic approximation of the inverse of the information matrix in (4. if the g Â 1 vector z has the distribution N(0.234 4 Non-Linear Methods pﬃﬃﬃ d 1 0 nr(^ y1 ) ! N(0. we get R1 I À n (y1 )R1 % sML R(X X) R so that À Á À1 0 À1 0 W ¼ (Rb À r)0 s2 (Rb À r) ML R(X X) R À Á À1 (Rb À r)0 R(X0 X)À1 R0 (Rb À r) ¼ e 0 e =n ng ¼ F: nÀk (4:49) The last equality follows from (3. like the one in (4. s2 )0 contains k þ 1 parameters and the restrictions are given by r(y) ¼ 0. R1 I À n (y1 )R1 ): Now recall that.4.34). compute the Wald test with the OLS estimate .16 for an example). so that under the null hypothesis 1 ^ 0 À1 ^ 2 W ¼ r(^ y1 )0 (R1 I À n (y1 )R1 ) r(y1 ) % w (g): (4:48) This is an attractive test if the restricted model is difﬁcult to estimate — for instance. (Some software packages.48) is obtained from (4. holds true only for linear models with linear restrictions.

the relation (4. 4. 2 nÀk 2 Because of the relation sML ¼ n s in (4. one with the F-test and with P-values based on the F(g.13c.8. Then the restricted ML estimator can be obtained by maximizing the Lagrange function . Substituting this in (4.34).4. s2 ML in the Wald test and the OLS estimator s in the t -test.50) can also be written as W ¼ t2 Á s2 : s2 ML E Exercises: S: 4.3. also called the score test. tests of coefﬁcient restrictions are computed in two ways.49) we get the following relation between the Wald test and the t-test for a single parameter restriction: W¼ n 2 t : nÀk (4:50) The cause of the difference lies in the different estimator of the variance s2 of 2 the error terms.11a.2. 4. E: 4.4. y ¼ y2 y2 contains g components. This test considers whether the gradient (also called the ‘score’) of the unrestricted likelihood function is sufﬁciently close to zero at the restricted estimate ^ y0 .4 for regression models where we minimize the sum of squares criterion (4.15b. The null hypothesis r(y) ¼ 0 imposes g independent restrictions on y. where in two parts. in EViews.3.26).2.1 the result that F ¼ t2 .16h–j.6 The Lagrange Multiplier test E Uses Section 1. 4. 4. but now we consider this within the framework of ML estimation where we maximize the loglikelihood criterion (4. and another with the Wald test W ¼ gF and with P-values based on the w2 (g) distribution.14c.11). b.) Relation between Wald test and t-test For the case of a single restriction (so that g ¼ 1) we obtained in Section 3.3 Maximum likelihood 235 s2 instead of the ML estimate s2 ML . e. n À k) distribution.49) becomes W ¼ gF. Appendix A. For simplicity of notation suppose that the vector of parameters y can be split we y1 . Formulation of parameter restrictions by means of Lagrange parameters As a third test we discuss the Lagrange Multiplier test. This test was discussed in Section 4. d. in which case the relation (4. and that the restrictions are given by y2 ¼ 0.

This suggests evaluating the loss in log-likelihood.15 Lagrange Multiplier test Graphical illustration of the Lagrange Multiplier test. The slope ^ cal) difference in the log-likelihoods l(^ y) À l(0). This is taken as an indication that the loss in the log-likelihood is too large. y2 ) À l0 y2 : Here l is the g Â 1 vector of Lagrange multipliers. The idea is to reject the restrictions if these marginal effects are too large. Idea of LM-test for a single parameter This is illustrated graphically in Exhibit 4. and there are no addl ¼ @ l=@ y in y ¼ 0 is related to the (vertiitional components y1 ). y2 . and this ‘vertical’ difference is larger if the log-likelihood has a smaller curvature.15 for the simple case of a single parameter (g ¼ 1. .236 4 Non-Linear Methods L(y1 . y ¼ y2 contains one component. l) ¼ l(y1 . @ y1 @ y1 @ y 2 @ y2 @L ¼ y2 ¼ 0: @l (4:51) So the Lagrange multipliers ^ l ¼ @ l=@ y2 measure the marginal increase in the log-likelihood l if the restrictions y2 ¼ 0 are relaxed.2 qML. The restricted maximum satisﬁes the ﬁrst order conditions @L @l @L @l ¼ ¼ 0.1 q Exhibit 4. which results from imposing the restriction that y ¼ 0. by the LM-test statistic LM ¼ (@ l=@ y)2 À@ 2 l=@ y2 (evaluated at y ¼ 0): log L1 log L2 log L1 log L2 0 qML. ¼ À^ l ¼ 0. where ^ y is the unrestricted ML estimate. The restrictions are rejected if the gradient (evaluated at the restricted parameter estimates) differs too much from zero. This difference is larger for smaller curvatures @ 2 l=@ y2 in y ¼ 0.

y) the density in (4. To test the signiﬁcance of the Lagrange multipliers we have to derive the distribution under the null hypothesis of ^ l ¼ @ l=@ y2 (evaluated at the restricted ML estimates).25). For n ! 1 this covariance matrix converges to I 0 . If we decompose the matrix I 0 in (4. Then E[z(y0 )] ¼ E ! 1 @ py (y) py0 (y) @ y jy¼y jy¼y 0 0 R Z @ py (y)dy py0 (y) @ py (y) ¼ ¼ ¼0 dy @y py0 (y) @ y jy¼y jy¼y @ log py (y) @y ¼E 0 0 ! (4:53) R as py (y)dy ¼ 1 for every density function function of y.36) it then follows that var(z(y0 )) ¼ E @l @l @ y @ y0 ! jy¼y ¼ I n ( y0 ) : 0 1ﬃﬃ The two foregoing results show that p z(y0 ) in (4. X.52) implies that 1 pﬃﬃﬃ z(^ y0 ) % N(0. To compute the mean. y0 denotes the ML estimator of y under If the null hypothesis y2 ¼ 0 is true and ^ this hypothesis.51) are given by ^ l ¼ @ l=@ y2 under the restriction that @ l=@ y1 ¼ 0. I 0 ): n (4:52) The proof of asymptotic normality is beyond the scope of this book and is based on generalizations of the central limit theorem. this vector evaluated at the parameter y ¼ y0 of the DGP has the property that 1 d pﬃﬃﬃ z(y0 )!N(0.52) in . This derivation (which runs till (4. under weak regularity conditions.51) and consider a test for the null hypothesis that y2 ¼ 0 that is based on the magnitude of the vector of Lagrange multipliers l.54) below) goes as follows. I 0 ): n Now the Lagrange multipliers ^ l in (4. Let z(y) ¼ @ l=@ y1 @ l=@ y2 T be the gradient vector of the log-likelihood (4. Using (4. we write l(y) ¼ log (py (y)) with py (y) ¼ p(y. then ^ y0 is a consistent estimator of y0 and (4.26) for n observations. Here we only consider the mean and variance of z(y0 ).52) has mean zero and covarn 1 iance matrix n I n (y0 ).4.3 Maximum likelihood 237 Derivation of LM-test for general parameter restrictions Now we return to the more general case in (4. Then.

. it follows from (4. 4.36) with decom12 n 11 l0 V À 1 ^ l % w2 (g).3. Therefore the LM-test is attractive if the unrestricted likelihood function is relatively complicated.51) that LM ¼ ^ l0 V À 1 ^ l¼ 0 0 @l 0 À1 0 À1 @ l ¼ In In : ^ ^ l l @y @y LM-test in terms of the log-likelihood The above result can be written as LM ¼ ! À1 @l 0 @2l @l % w2 (g). but we do not need to optimize the unrestricted likelihood function. ÀE 0 @y @y @ y@ y (4:54) where the expressions @ l=@ y and E[@ 2 l=@ y@ y0 ] are both evaluated at y ¼ ^ y0 . So we need to compute the gradient and Hessian matrix of the unrestricted model.22) on conditional distributions of the normal distribution. Therefore ^ 1 matrix V À1 is equal to the lower diagonal block of the matrix I À n .7 LM-test in the linear model Model formulation As an illustration we apply the LM-test (4. the ML estimate under the null hypothesis. The advantage of the LM-test is that only the restricted ML estimate ^ y0 has to be computed.238 4 Non-Linear Methods accordance with the components z1 ¼ @ l=@ y1 and z2 ¼ @ l=@ y2 of z and use the result (1. V ). This estimate is then substituted in (4. b02 )0 .54) for the linear model y ¼ Xb þ e with Assumptions 1–7. it follows that ^ l % N(0. The model can be written as y ¼ X1 b1 þ X2 b2 þ e and we consider the null hypothesis that b2 ¼ 0. I 022 À I 021 I À 011 I 012 ): n If we denote the above covariance matrix by W. it follows that 1 pﬃﬃﬃ ^ l¼ n 1 1 pﬃﬃﬃ z2 jz1 ¼ 0 % N(0. The vector of parameters b is split in two parts b ¼ (b01 . where b2 is a g Â 1 vector and b1 is a (k À g) Â 1 vector. As the position according to that of z in z1 and z2 .54) in the gradient and the information matrix of the unrestricted model. 1 where V ¼ nW % I 22 À I 21 I À I is deﬁned in terms of I in (4.

38). note that in Section 3. (4. and R ! ! 1 1 E 4 X0 (y À X1 bR ) ¼ E 4 X0 E[y À X1 bR ] ¼ 0. the LM-test (4.39) becomes 2s4 R R R approximately the same.4. s2 R ). @ b2 s2 s R R @l n 1 ¼ À 2 þ 4 (y À X1 bR )0 (y À X1 bR ) ¼ 0: @ s2 2sR 2sR To compute the information matrix in (4.38). the ML estimates of this model are given by bML ¼ bR ¼ (X01 X1 )À1 X01 y and 1 0 s2 R ¼ n eR eR .37) becomes 0 2 2 (under the null hypothesis). where eR ¼ y À X1 bR are the restricted least squares residuals. we use the second order derivatives in (4. so that @l 1 ¼ 2 X01 (y À X1 bR ) ¼ 0.39).54) becomes !0 ! À1 0 0 1 0 XX LM ¼ 1 X0 e 1 X02 eR s2 2 R s2 s2 R R 0 0 0 R 1 X1 eR X 1 eR ¼ 2 (X0 X)À1 0 X e X02 eR sR 2 R À1 0 0 0 e X(X X) X eR ¼ R s2 R 0 eR X(X0 X)À1 X0 eR (4:55) ¼n ¼ nR2 : e0R eR . we get 1 0 1 0 ! X01 X1 s1 0 2 X1 X2 2 ! s2 1 0 R R 0 C B1 2 X X @ l s 0 1 0 R X2 X1 s2 X2 X2 0 C %B ÀE : n A¼ @ s2 0 R R @ y @ y0 j ^ 2s 4 n R y¼y0 0 0 2s4 R T With the above expressions for the gradient and the Hessian matrix. we ﬁrst note that under the null hypothesis the model is given by y ¼ X1 b1 þ e. Therefore. s ) ¼ (bR .3. À s1 2 X X. Combining the above results.54). @ b1 sR @l 1 1 ¼ X02 (y À X1 bR ) ¼ 2 X02 eR . evaluated at (b1 . According to the results in Section 4. The term R n 1 0 n À s6 eR eR ¼ À 2s4 . sR ).38) is given by À s4 X (y À X1 bR ).1 (p. The term in (4. 0. The @l 2 2 gradient @ y in (4. the restricted least squares estimators 1 0 bR and s2 R are also independent.2. evaluated at (bR .3 Maximum likelihood 239 Derivation of LM-test with auxiliary regressions To compute the LM-test (4. and the expectation of this term is in (4.37). As sR is a consistent estimator of s R 0 the expectation of this term is approximately also equal to À s1 2 X X. if the null hypothesis holds true. sR sR as E[y À X1 bR ] ¼ E[y] À X1 E[bR ] ¼ X1 b1 À X1 b1 ¼ 0 for b2 ¼ 0.3. 152–3) we proved that the least squares estimators b and s2 are independent. and (4. is given by (4. b2 .32).54). 0. The term in (4.54) for this hypothesis. to evaluate (4. Finally.31) and (4.

W -.15b. 2 . Then LM ¼ nR2 % w2 (g).4. In this section we give a brief summary and we comment on some computational issues. 4.4 within the setting of non-linear regression models.and LM-tests are an approximation of the LRtest — that is. 4.1. as is often the . Perform a regression of eR on all the variables in the unrestricted model. Exhibit 4. This result holds true much more generally — that is. where R2 is the R2 of the regression in step 2.16 (a) gives a graphical illustration of the relation between the LR-. in many cases (non-linear models. Computation of LM-test by auxiliary regressions . Estimate the restricted model.3. with corresponding residuals eR . non-linear restrictions. LR. and LM-tests for the case of a single parameter y with the null hypothesis that y ¼ 0. The advantage of the LM. In non-linear models @f . Because variables are added in step 2 to the variables that are used in step 1. non-normal disturbances) the LM-test can be computed as follows. this is also called a variable addition test. in other types of models the y ¼ f (x. Step 2: Auxiliary regression of residuals of step 1.and W -tests is that only one model needs to be estimated. If the restricted model is the simplest to estimate. and LM). b) þ e. the regressors are given by @ b0 regressors may be of a different nature (several examples will follow in the next chapters). Step 3: LM ¼ nR of step 2. Comparison of three tests In the foregoing sections we discussed four tests on parameter restrictions (F. .2.14d. The precise nature of the variables to be used in the regression in step 2 depends on the particular testing problem at hand. Step 1: Estimate the restricted model. The W . the loss in log-likelihood caused by imposing the null hypothesis. In the rest of this book we will encounter several examples. W .240 4 Non-Linear Methods Computation of LM-test as variable addition test This is precisely the result (4. E Exercises: E: 4.8 Remarks on tests E Uses Section 1.21) that was obtained in Section 4.

and LM. (b) contains a summary comparison of the three tests. and the LM-test on the indicated gradient. which weighs the changes against the variance of the estimates).4.48) (generalizes F-test) LM 1 (under H0 ) Simple computations (auxiliary regressions) Power may be small (4. In Section 4. then the LM-test is preferred. if the estimates do not change anymore (this is related to the W -test. In situations where the unrestricted model is the simplest to estimate we can use the W -test. which weighs the gradient against its variance). In general this involves a number of iterations to improve the estimates and a stopping rule determines when the iterations are ended. case. Exhibit 4.16 (b) gives a summary comparison of the three tests LR. In ML estimation one can stop the iterations if the criterion values of the log-likelihood do not change anymore (this is related to the LR-test).2.3 Maximum likelihood 241 (a) log L LM LR 0 q W qML (b ) Test LR Estimated models Advantage Disadvantage Main formula 2 (under H0 and H1 ) Optimal power Needs 2 optimizations (ML under H0 and H1 ) (4. the Wald test.16 Comparison of tests (a) gives a graphical illustration of the Likelihood Ratio test.54) and (4.44) 2 log LðH1 Þ À 2 log LðH0 Þ W 1 (under H1 ) If model under H0 is complicated Test depends on parametrization (4. The LR-test is based on the indicated vertical distance. the W-test on the indicated horizontal distance. and the Lagrange Multiplier test.55) LM ¼ nR2 Exhibit 4. .3 we discussed methods for non-linear optimization. or if the gradient has become zero (this is related to the LM-test. W .

and W -tests.12 13.23). and (4.84 9. That is. This shows that both g 1 2 3 4 5 6 7 8 9 10 20 50 100 gF (g. x2 . As all three statistics have the same asymptotic w2 (g) distribution.23). n À k).16 125. the same holds true for the LR. Exhibit 4. W .84 5.30 21.17 F. These critical values are somewhat larger.63 19.49 11.96 8.10) 4.51 16.81 9. It also follows from (4.40 31.and LR-tests. W .46). and. (4. and LM with the F-test for a linear hypothesis in a linear model is given in (4.46).49).86 139.86 258. 100.41 67.85 6.52 11.1000) 3. LM LR W: (4:56) This is left as an exercise (see Exercise 4.62 68.18 29.21 11. .58 17. it follows that the P-values based on this distribution satisfy P(LM) ! P(LR) ! P(W ).31 31.14 14.96 w2 (g) 3.242 4 Non-Linear Methods Relations between tests The relations of the tests LR. This means that.13 15.34 Exhibit 4.72 16. and 1000).53 13.26 17.01 7. n À k) distribution instead of those of the w2 (g) distribution.57 27.17 8.48 131.95 24.91 16.12 12. then the same holds true for the LM.92 18.100) 3. if the LM-test rejects the null hypothesis.99 7.and x2 -distributions The 5% critical values of the chi-squared distribution (last column) for some selected degrees of freedom (g) and the 5% critical values of the scaled F-distribution gF(g.17 shows the 5 per cent critical values for some selected degrees of freedom (g.17 gF (g.and F-distribution in testing To perform the tests LR. From these expressions the following inequalities can be derived for testing a linear hypothesis in a linear model.49) that the three tests are asymptotically (for n ! 1) equivalent to gF(g. n À k) for different values of n À k (10.84 gF (g.09 9.78 55.59 14.50 124.65 14.77 19. if the W -test fails to reject the null hypothesis. all four tests are asymptotically equivalent.07 12. (4. so that the evidence to reject the null hypothesis should be somewhat stronger than what would be required asymptotically.07 15. and (4.85 11.53 73. and LM.94 6. and this converges in distribution to a w2 (g) distribution. it is sometimes preferable to use the critical values of the gF(g.6).00 18. n À k).27 33.

4 in Section 4. ^ was stated in (4. Further.4. W -.14.3 Maximum likelihood 243 methods lead to the same results for large sample sizes. We will discuss (i) the speciﬁcation of the log-likelihood for t(5)-distributed disturbances. n À k) distribution may be considerably larger than those of the w2 (g) distribution. 4.15). 4.3. y ¼ ^ y¼^ y0 (ML under the null hypothesis) are asymptotically equal (if the null hypothy 0 ) ¼ y0 .5.54) can be approximated by using " # n 1 1 X @ 2 li 1 X @ 2 li 1 X @ li @ li In ¼ À E % À 0 0 % @ y @ y0 n n n n @ y@ y @ y@ y i ¼1 (4:57) T y1 . ML has important applications for other types of models that cannot be expressed as a regression. or ^ y0. (ii) outcomes of the ML estimates. E: 4. Of course it is also of interest to apply the LR-. Alternative expressions for tests and information matrix Sometimes the W -test and the LM-test are computed by expressions that differ from (4. the second example a non-linear regression model with normally distributed disturbances. E Exercises: T: 4. but it may provide less precise estimates as compared to methods that make use of the second order derivatives. as ML estimators are consistent so that plim(^ y1 ) ¼ y0 and plim(^ For instance.54) by using approximations of the information matrix.48) and (4. 4. The ﬁrst example concerns a linear model with non-normal disturbances. but that for small samples the critical values of the gF(g. 4. E XM404SMR .35).1. for independent observations the log-likelihood is given by (4. 4. This is left for the exercises (see Exercises 4. The last approximation evaluated at any of the three parameter values y0 . Such applications will be discussed in Chapters 6 and 7. Example 4. and (iii) choice of the number of degrees of freedom in the tdistribution. esis is true).6b.48) and (4.28) and may be convenient as it requires only the ﬁrst order derivatives. Note that y1 (ML in the unrestricted model) and each of the values y ¼ y0 (of the DGP).13.14 and 4. and LM-tests for a linear hypothesis in a linear model and to compare the outcomes with the F-test of Chapter 3.27) and the information matrix in (4.15.5: Stock Market Returns (continued) We consider again the CAPM for the sector of cyclical consumer goods of Example 4. All these expressions can also be used as approximations of the asymptotic covariance matrix of the ML estimator in (4.3.9 Two examples We illustrate ML estimation and testing with two examples.16.

the model is given by yi ¼ a þ bxi þ ei . ¼ i 2 2 @ a i¼1 1 þ e2 5s þ e2 i =5s i i¼1 n n X @l X À3 6ei xi 2 Á ( À 2 e x = 5 s ) ¼ . ¼ i i 2 @ b i¼1 1 þ e2 5s2 þ e2 i =5s i i¼1 n n X @l n 1 n 3X e2 2 4 i ¼ À À 3 Á ( À e = 5 s ) ¼ À þ : i 2 =5s2 2 2 2 þ e2 @ s2 2s2 2 s s 1 þ e 5 s i i i¼1 i¼1 Substituting ei ¼ yi À a À bxi . s2 ) ¼ n log (p(ei )) ¼ n log (c5 ) À log (s2 ) 2 i¼1 ! n X (yi À a À bxi )2 : À3 log 1 þ 5s2 i¼1 n X (ii) ML estimates based on (scaled) t(5)-distribution The ﬁrst order derivatives of the above log-likelihood are given by n n X @l X À3 6ei 2 Á ( À 2 e = 5 s ) ¼ . b. this is stronger than Assumption 4 of uncorrelated disturbance terms. where yi and xi are the excess returns in respectively the sector of cyclical consumer goods and the whole market. the disturbance terms in the CAPM may have fatter tails than the normal distribution (see also Exhibit 4.11 (a)). As an alternative. 125) are satisﬁed. and we assume that they are mutually independent.244 4 Non-Linear Methods (i) Log-likelihood for (scaled) t(5)-distributed disturbances As was discussed in Example 4.2 are given . In particular. As independence implies being uncorrelated.4 (p. That is.4 (p. We suppose that Assumptions 1–6 of Section 3. they have equal variance. The log-likelihood (4.3. The postulated scaled t(5)-density of the disturbance terms is 2 À3 p(ei ) ¼ c5 (1 þ e2 i =5s ) =s. 223–4). the ML estimates are obtained by solving the above three non-linear equations @ l=@ a ¼ @ l=@ b ¼ @ l=@ s2 ¼ 0. bML . where c5 is a scaling constant (that does not depend on s) so that R p(ei )dei ¼ 1.1. sML ) of the BHHH algorithm of Section 4. the disturbance terms ei have zero mean. The outcomes (aML .27) is given by l(a. we consider the same linear model with disturbances that have the (scaled) t(5)-distribution.

1 −0.532457 Skewness −0. sML ) ¼( À 0:34.122957 Median 0. (d)–(e) show the values of SSR (d) and of the loglikelihood values (denoted by LL (e)) obtained in these iterations. the slope b (b). .2 (c) 5 0.5) ML estimates of CAPM for the sector of cyclical consumer goods in the UK.18 (a–c). with starting values a ¼ 0. bML . and the scale parameter s (c) obtained by twenty iterations of the BHHH algorithm.2 −0.4. s) ¼ (0.4 0 5 10 15 ITER 20 25 4 3 2 1 0 0 5 10 15 20 25 ALPHA 1. 4:49).60285 Std.299558 Kurtosis 4.066637 Maximum 14.84978 Minimum −20.0 SIGMA BETA 0 5 10 15 20 25 ITER ITER (d) 7500 (e) −700 −800 7450 −900 SSR LL 7400 −1000 −1100 7350 −1200 7300 0 5 10 15 ITER 20 25 −1300 0 5 10 15 ITER 20 25 (f ) 25 Series: RES ( g) −740 −750 −760 Sample 1980:01 1999:12 Observations 240 Mean −0. The value of LL increases at each iteration.1 (b) 1. 5. but the value of SSR does not decrease always. and for d inﬁnitely large (the case of the normal distribution) the LL value is indicated by the horizontal line). 1) and converge to (aML . together with a histogram of the ML residuals ^ ei ¼ yi À aML À bML xi in (f ).3 Maximum likelihood 245 (a) 0. Dev. 1:20. 1.1 1. and s ¼ 1. and (g) shows the maximum of the log-likelihood function for the t(d) distribution for different degrees of freedom (the optimal value is obtained for d ¼ 8. (a)–(c) show the estimates of the constant term a (a).0 −0. b.18 Stock Market Returns (Example 4. The iterations are started in (a. using a scaled t(5) distribution for the disturbances. in Exhibit 4.059734 20 15 10 5 0 −20 −15 −10 −5 0 5 10 15 LL −770 −780 −790 0 10 20 30 DF 40 50 60 Exhibit 4.3 −0. (f ) shows the histogram of the ML residuals. b ¼ 1.

We can also test for the null hypothesis of normally distributed error terms against the alternative of a t(d)-distribution — that is. b3 ei $ NID(0. In Section 4. Therefore we conclude that. d).18 (g) shows the maximum of the log-likelihood for different values of d. LR follows the w2 (1) distribution. Exhibit 4. However. (ii) LR-tests on constant elasticity. The Likelihood Ratio test is given by y0 ) ¼ 2( À 747:16 þ 750:54) ¼ 6:77.246 4 Non-Linear Methods (iii) Choice of degrees of freedom of t-distribution The motivation to use the (scaled) t(5)-distribution instead of the normal distribution is that the disturbance distribution may have fat tails. with corresponding model log (q) ¼ b1 þ b2 log (d) þ e. We will discuss (i) the outcomes of ML estimation for the two brands.2.2. and (v) comparison of the tests and conclusion. LR ¼ 2l(^ y1 ) À 2l(^ where l(^ y1 ) ¼ À747:16 is the unrestricted maximal log-likelihood value (that is. .6 we provide a further comparison between the models with normal and with scaled t(5)disturbances. see Section 1. so that the null hypothesis is rejected. This can be used for a grid search for d to obtain ML estimates of the parameters (a. E XM402COF Example 4.9) with the assumption of normally distributed disturbances. we use the non-linear regression model (4. s2 .3 (p.5. so that log (qi ) ¼ b1 þ Á b2 À b3 di À 1 þ ei . Asymptotically. we have no special reason to take the t-distribution with d ¼ 5 degrees of freedom. for d ¼ 8) and l(^ y0 ) ¼ À750:54 is the log-likelihood value for the model with normally distributed disturbances. (iv) Wald tests on constant elasticity. b. a t-distribution may be more convenient to model the disturbances of the CAPM than a normal distribution. The difference in the log-likelihood with d ¼ 5 is rather small. s2 ): The null hypothesis of constant demand elasticity corresponds to the parameter restriction b3 ¼ 0. Therefore we estimate the CAPM also with scaled t(d)-distributions for selected values of d. 33)). including d ¼ 1 (which corresponds to the normal distribution. (iii) LM-tests on constant elasticity. (i) Outcomes of ML for the two brands For each of the two brands separately. The P-value of the computed LR-test is P ¼ 0:009.6: Coffee Sales (continued) As a second example we consider again the data on sales of two brands of coffee discussed before in Section 4. The overall optimum is obtained for d ¼ 8. the test of d ¼ 1 against d < 1.4. under the stated assumptions.

W¼ n 2 t : nÀk The non-linear regressions in Panels 2 and 4 in Exhibit 4. 202–4)). with corresponding P-values based on the w2 (1) distribution: . t2 ¼ À1:651 (P ¼ 0:133): Using (4. (ii) LR-tests on constant elasticity The Likelihood Ratio tests for the null hypothesis that b3 ¼ 0 against the alternative that b3 6¼ 0 can be obtained from the results in Exhibit 4. The Lagrange Multiplier test for non-linear regression models has already been performed in Section 4. The tests are based on the results in Exhibit 4. with P-values based on the asymptotic w2 (1) distribution: LR1 ¼ 2(12:530 À 10:017) ¼ 5:026 (P ¼ 0:025).10 show the t-values ^ with P-values based on the t(9)-distribution — namely. LR2 ¼ 2(11:641 À 10:024) ¼ 3:235 (P ¼ 0:072): (iii) LM-tests on constant elasticity Under the null hypothesis.2. this leads to the following values for the Wald test. This corresponds to nonlinear least squares. of b 3 t1 ¼ À2:012 (P ¼ 0:075). Because the disturbances are assumed to be normally distributed. LM2 ¼ nR2 ¼ 12 Á 0:236 ¼ 2:836 (P ¼ 0:092): (iv) Wald tests on constant elasticity To compute the Wald test (4. ML corresponds to least squares (see Section 4.50) with n ¼ 12 and k ¼ 3.48) we use the relation (4. the model is linear with dependent variable log (qi ) and with explanatory variable log (di ) (see Example 4.2).9).2 (p. with the results in Panels 5 and 6 in Exhibit 4.5 for both brands of coffee.4.3. Panels 1 and 3 give the results of ML estimation under the hypothesis that b3 ¼ 0. Panels 2 and 4 give the results of ML estimation in the unrestricted non-linear regression model (4.3 Maximum likelihood 247 We perform different tests of this hypothesis for both brands of coffee.50) between the Wald test and the t-test — that is.10. The test outcomes are LM1 ¼ nR2 ¼ 12 Á 0:342 ¼ 4:106 (P ¼ 0:043).19. The results are as follows.19 for brands 1 and 2.

0000 Panel 4: Dependent Variable: LOGQ2 (brand 2) Method: Least Squares Included observations: 12 Convergence achieved after 5 iterations LOGQ2 ¼ C(1) þ (C(2)/C(3)) Ã (D2^C(3)À 1) Parameter Coefﬁcient Std. 0.6360 C(2) 10.132328 Log likelihood 10.865333 Sum squared resid 0.207206 À1.2540 C(2) 10.295386 3. Error t-Statistic C(1) 5.012152 R-squared 0.581918 8.28864 3.043048 102.29832 3.0000 0.377804 0.043072 135.595289 5.43073 6.650653 R-squared 0.132183 Log likelihood 10. 0.0000 0.0075 0.19 Coffee Sales (Example 4.125072 C(3) À13.406561 0. 0.0000 0. Error t-Statistic C 4.3638 LOGD2 6.581599 10.427608 C(3) À8.0122 0.01699 Prob.02358 Prob.001698 3. Error t-Statistic C(1) 4. models with constant elasticity (Panels 1 and 3) and models with varying elasticity (Panels 2 and 4).0000 Panel 2: Dependent Variable: LOGQ1 (brand 1) Method: Least Squares Included observations: 12 Convergence achieved after 5 iterations LOGQ1 ¼ C(1) þ (C(2)/C(3)) Ã (D1^C(3)À 1) Parameter Coefﬁcient Std. Error t-Statistic C 5. 0.664693 0.0751 Panel 3: Dependent Variable: LOGQ2 (brand 2) Method: Least Squares Included observations: 12 Variable Coefﬁcient Std.0000 0.674812 À2. .003298 0.6) Prob.32206 R-squared 0.040150 144.841739 0.52991 Prob.016063 R-squared 0.911413 Sum squared resid 0.934474 Sum squared resid 0.1332 Regressions for two brands of coffee.248 4 Non-Linear Methods Panel 1: Dependent Variable: LOGQ1 (brand 1) Method: Least Squares Included observations: 12 Variable Coefﬁcient Std.807118 0.087049 Log likelihood 12.6284 LOGD1 4.64129 Exhibit 4.043236 101.100944 Log likelihood 11.914196 Sum squared resid 0.

on the basis of these data there is not so much compelling evidence to reject the null hypothesis of constant elasticity of the demand for coffee. It is helpful to consider also the gF(g. 9 12 W2 ¼ Á ( À 1:651)2 ¼ 3:633 (P ¼ 0:057): 9 (v) Comparison of tests and conclusion Summarizing the outcomes of the test statistics.4.1 (p. the null hypothesis of constant demand elasticity is not rejected for brand 2. Of course. n À k) ¼ F(1. and in Section 5.3 Maximum likelihood 249 W1 ¼ 12 Á ( À 2:012)2 ¼ 5:398 (P ¼ 0:020). LM. . but it is rejected for brand 1 by the LR-test. the number of observations for the two models (n ¼ 12 for both brands) is very small. the null hypothesis of constant elasticity can then be rejected for both brands by all three tests (LR. As we shall see in Section 5.1. 9) distribution with 5 per cent critical value equal to 5. Therefore. the asymptotic w2 (1) distribution is only a rough approximation. in accordance with (4. note that for both brands of coffee LM < LR < W .3.56). This is considerably larger than the value 3. 307–10) we will use a combined model for the two brands (so that n ¼ 24 in this case). the LM-test. If we use a 5 per cent signiﬁcance level. but not by the t-test. with the exception of the Wald test for brand 1. and the W -test. and Wald).12.12.84 for the w2 (1) distribution. As the sample size (n ¼ 12) is very small. all tests fail to reject the null hypothesis. With this critical value of 5.3.

the expression 1 ^ c ^ var var( yML ) ¼ I À n (yML ) provides asymptotically correct P-values on the signiﬁcance of maximum likelihood estimates if the joint probability function p(y. y) of the data is correctly speciﬁed (see Section 4. some efﬁciency will be lost as compared to ML in the correct model. The expression c b) ¼ s2 (X0 X)À1 var var( provides correct P-values on the signiﬁcance of least squares estimates if the seven standard Assumptions 1–7 of Section 3.1.25) of the data should be a reasonable reﬂection of the data generating process. this loss may be relatively small compared to the loss of using ML in a model that differs much from the DGP.1 Motivation Requirements for maximum likelihood The results in the foregoing section show that maximum likelihood has (asymptotically) optimal properties for correctly speciﬁed models. If there is much uncertainty about this distribution. Until now we have discussed two methods for this purpose. Evaluation of accuracy of estimates The accuracy of parameter estimates is usually evaluated in terms of their standard errors and their P-values associated with tests of signiﬁcance. by making less assumptions on the DGP. In general.4 are satisﬁed.250 4 Non-Linear Methods 4. Further. However. X.3. then it may be preferable to use an estimation method that requires somewhat less information on the DGP.3).4. In practice this means that the joint probability distribution (4.4 Generalized method of moments 4. .

For instance. in particular to compute the standard errors. In Section 4.5. If the variances differ. both OLS and ML can be seen as particular examples of estimators based on moment conditions. 4.1 and 4. Exhibit 4. In this approach the parameters are estimated by solving a set of moment conditions.4.4 Generalized method of moments 251 In this section we discuss the generalized method of moments (GMM). That is.3.7: Stock Market Returns (continued) As an illustration.20 Stock Market Returns (Example 4. One can also estimate the parameters by ML and compute the GMM standard errors.20 shows the OLS residuals of the CAPM discussed in Examples 2. It seems that the disturbances have a larger variance at the beginning and near the end of the observation period as compared to the middle period. the alternative of ML based on t-distribution does not take the apparent heteroskedasticity of the disturbances into account either. As we shall see below.4. and 4.9) that Assumption 7 of normally distributed disturbances is also doubtful. then the disturbances are heteroskedastic and Assumption 3 is violated.3. 20 E XM404SMR 10 0 −10 −20 OLS residuals CAPM −30 80 82 84 86 88 90 92 94 96 98 Exhibit 4. without making such assumptions.7) Least squares residuals of CAPM for the sector of cyclical consumer goods in the UK. We have already concluded (see Sections 4. It seems preferable to evaluate the CAPM. Example 4.1.4. one can estimate the parameters by OLS and compute the GMM standard errors even if not all the Assumptions 1–7 hold true.6 we will use GMM for this purpose. The GMM standard errors are computed on the basis of the moment conditions and they provide asymptotically correct P-values. . even if the speciﬁed probability distribution is not correct. However. provided that the speciﬁed moment conditions are valid. GMM can be used to compute reliable standard errors and P-values in situations where some of the assumptions of OLS or ML are not satisﬁed.

3.58) as moment conditions.1 (p. For example.4).3. n: (4:58) Note that xi is a k Â 1 vector.252 4 Non-Linear Methods 4. . 39) we discussed the method of moments. i ¼ 1.1.4. suppose that the data yi consist of a random sample from a population with unknown mean m. so that b ^ ¼ b is equal to the least This can be written as X0 (y À Xb squares estimator. this is equivalent to the condition that E[xi (yi À x0i b)] ¼ 0. which is based on estimating population moments by means of sample moments. Method of moments estimator of the mean In Section 1. Á Á Á . so that E[yi À m] ¼ 0: Then the moment estimator of m is obtained by replacing the population À P n Á mean (E) by the sample mean 1 . where xi is the n n k Â 1 vector of explanatory variables for the ith observation. The basic requirement for this estimator is the Pn 0 1 X e ¼ orthogonality condition (4. This shows that OLS can be derived by the method of moments.1) can also be derived by the method of moments. and condition (4.2 GMM estimation E Uses Section 1. m n Pn i¼1 yi .4) is satisﬁed (under weak regularity conditions) if E[xi ei ] ¼ 0 for all i. so that this imposes k restrictions on the parameter vector b. The corresponding conditions on theP sample moments 1 (replacing the population mean E by the sample mean n n i¼1 ) gives the k equations n 1X ^ ) ¼ 0: xi (yi À x0i b n i¼1 ^) ¼ 0. so that i¼1 n n 1X ^) ¼ 0. (yi À m n i¼1 ^¼1 that is. using the orthogonality conditions (4. Here 1 i¼1 xi ei . As ei ¼ yi À x0i b. Least squares derived by the method of moments The least squares estimator in the linear model (4.

That is.4. say E[gi (y0 )] ¼ 0. Á Á Á . the crucial assumption is that the DGP satisﬁes the m restrictions in (4. (4:59) this gives (4:60) Replacing the population mean E by the sample mean 1 n n 1X @ li ¼ 0: n i¼1 @ y The solution of these equations gives the ML estimator. If the number of moment conditions m is equal to the number of unknown parameters p in y.4 Generalized method of moments 253 ML as methods of moments estimator ML estimators can also be obtained from moment conditions. Examples are the orthogonality conditions (4.61) by the sample mean 1 i¼1 n n 1X gi (^ y) ¼ 0: n i¼1 (4:62) . Suppose that the data of n independent observations. n) the DGP satisﬁes m distinct moment conditions. so that the log-likelihood is Pconsist n l(y) ¼ i¼1 li (y).60) require that the sample mean of the terms @ @ y is equal to zero. Further suppose that for each observation (i ¼ 1. The li equations (4.53).61) is called exactly identiﬁed. contains p unknown parameters and that the DGP has parameters y0 . as in (4. as this corresponds to the ﬁrst order conditions for a maximum of the log-likelihood. The GMM estimator ^ y is deﬁned as the solution of the m equations obtained by replacing the population mean E Pn — that is. replacing log p(y) by li (y) ¼ log py (yi . jy¼y0 i ¼ 1. By the arguments in (4. in (4.27).61) for the observations i ¼ 1. Such equations are called ‘generalized’ moment conditions. Á Á Á . xi ). and if m > p then the model is called over-identiﬁed. n. The basic assumption is that we can formulate a set of moment conditions. Á Á Á . then the model (4. Suppose that the parameter vector of interest. Á Á Á . y.58) (which corresponds to k linear functions in the k unknown parameters) and the ﬁrst order conditions (4. The generalized method of moments We now describe the generalized method of moments more in general. i ¼ 1. (4:61) where the gi are known functions gi : Rp ! Rm that depend on the observed data. it follows that @ li E @y ! ¼ 0. n. n: Pn i¼1 .60) (which gives p non-linear functions in the p unknown parameters).

Let the m Â 1 vector Gn (y) be deﬁned by Gn (y) ¼ n X i¼1 gi (y): If there exists no value of y so that Gn (y) ¼ 0. The numerical solution methods discussed in Section 4. In the exactly y) ¼ 0) the choice of W is irrelevant.63) (in the over-identiﬁed case with m > p). for instance by minimizing 1 n Gn (y)Gn (y) with respect to y.254 4 Non-Linear Methods Numerical aspects of GMM To obtain a solution for ^ y. but identiﬁed case (with a solution Gn (^ in the over-identiﬁed case it may be chosen to take possible differences in sampling variation of the individual moment conditions into account. The crucial assumption is that the DGP satisﬁes these moment conditions.62) (in the exactly identiﬁed case with m ¼ p) or by minimizing (4. although the m (population) conditions (4. The choice of the weighting matrix W (when m > p) will be discussed in the next section.62) is exactly satisﬁed. we need in general to impose at least as many moment conditions as there are unknown parameters (m ! p). Identify the p parameters of interest y and specify m ( ! p) moment conditions (4. In the over-identiﬁed case (m > p) there are more equations than unknown parameters and there will in general exist no exact solution of this system of equations. n n (4:63) where W is an m Â m symmetric and positive deﬁnite matrix. . In general the minimization of (4. for y ¼ y0 — there often exists no value ^ y for which the sample condition (4. by Newton–Raphson.2. this system of m equations in p unknown parameters has a unique solution (under suitable regularity conditions). In particular.61). .3 can be used for this purpose. Step 2: Estimate the parameters.3 — for example. As an alternative one can also minimize a weighted sum of squares 1 0 G WGn . Step 1: Specify a sufﬁcient number of moment conditions. Estimate y by GMM by solving the equations (4.2. Summary of computations in GMM estimation Estimation by GMM proceeds in the following two steps. one can instead minimize the 0 distance of this vector from zero. In the exactly identiﬁed case (m ¼ p). the speciﬁed moments should exist.63) will be a non-linear optimization problem that can be solved by the numerical methods discussed in Section 4.61) are satisﬁed (by assumption) for the DGP — that is. GMM estimation . That is.

n Second illustration of asymptotic result: ML If the moment conditions are those of ML in (4.64) follows from (4. the result (4. and À1orthogonality Á between Pn 0 2 0 2 0 2 x ] ¼ s E [ x x ] ¼ s plim x x xi and ei ) there holds J0 ¼ E[xi e2 i i i i i i ¼ s Q and i ¼1 n then (4. for two special cases (OLS and ML).3 GMM standard errors An asymptotic result To apply tests based on GMM estimators we need to know (asymptotic) expressions for the covariance matrix of these estimators. In our analysis we will assume that the moment conditions are valid for the DGP. Note that E[Gn (y0 )] ¼ 0 if (4.4 Generalized method of moments 255 4.3. Á Á Á .52) in Section 4. It falls beyond the scope of this book to treat the required assumptions for asymptotic normality in (4. Illustration of asymptotic result: OLS If the moment conditions are those of OLS in (4. as we shall now show.1. n with n ! 1). it follows that Gn (y0 ) ¼ n X @ li @l ¼ : @ y y¼y0 y¼y0 @ y i¼1 T @l Now (4.4. which states that 1ﬃﬃ 0 d p X e ! N(0.58). This result is equivalent to (4.4. However. and that the sample average 1 n i i¼1 n n central limit theorem: 1 d pﬃﬃﬃ Gn (y0 ) ! N(0.64) follows from earlier results in this chapter.64).4. our GMM estimator is Pthat n 1 G ¼ g satisﬁes the following consistent.61) is valid.61) and on the correlation structure of the data generating process (in particular.6) in Section 4.60). s2 Q). J0 ). the random vectors gi (y0 ) should satisfy the moment conditions (4. it follows that Gn (y0 ) ¼ n X i¼1 T xi (yi À x0i b) ¼ n X i ¼1 xi ei ¼ X0 e: Under appropriate conditions (Assumptions 1Ã . because z(y0 ) ¼ Gn (y0 ) and n . I 0 ). n J0 ¼ E[gi (y0 )g0i (y0 )]: (4:64) T These assumptions hold true under suitable regularity assumptions on the moment conditions (4. 2–6.64) in more detail.6 states that for z(y0 ) ¼ @ y jy¼y0 there holds d 1ﬃﬃ p z(y0 ) ! N(0.61) and these random vectors should not be too strongly correlated for i ¼ 1.

taking W ¼ J0 . it seems reasonable to allow larger errors for estimated parameters that contain more uncertainty. then it follows from the above expression pﬃﬃﬃ d n(^ y À y0 )!N(0. The resulting p Â p asymptotic covariance matrix is given by À 0 À1 Á À1 J0 H 0 : V ¼ H0 (4:66) À1 So the estimator ^ y obtained by minimizing (4. We can then penalize the deviations of Gn from zero less heavily in directions that have a larger variance.64) is satisﬁed. (4:65) 0 0 0 where V ¼ (H0 WH0 )À1 H0 WJ0 WH0 (H0 WH0 )À1 . Derivation of asymptotic distribution of the GMM estimator Assuming that (4. Intuitively.7) to show that this is indeed the optimal weighting matrix. V ).57)). where Gn0 ¼ Gn (y0 ) and Hn0 ¼ Hn (y0 ) is the m Â p matrix deﬁned by Hn ¼ @ Gn =@ y0. the ﬁrst order conditions for a minimum of (4. . Substituting this linear approximation in (4.256 4 Non-Linear Methods ! !! n @ li @ li 1X @ 2 li J0 ¼ E E ¼ lim À ¼ I0 @ y @ y0 n i ¼1 @ y @ y0 (see (4. This suggests to the covariance Á the weights inversely proportional À 1 choosing À1 ﬃﬃ Gn (y0 ) % J0 — that is.63) and using the fact that the derivative of Gn0 þ Hn0 (y À y0 ) with respect to y is equal to Hn0 . T Choice of weighting matrix in the over-identified case The weighting matrix W in (4.63) are given by 0 Hn 0 W (Gn0 þ Hn0 (y À y0 )) ¼ 0: T The solution is given by À1 0 0 ^ y ¼ y0 À (Hn 0 WHn0 ) Hn0 WGn0 : Suppose that plim and (4.36) and (4.63) with W ¼ J0 is the most efﬁcient estimator within the class of GMM estimators obtained by minimizing (4.63) can now be chosen so that this expression is minimal (in the sense of positive semideﬁnite matrices) to get an asymptotically efﬁcient estimator.63) can be simpliﬁed by the linearization Gn ¼ Gn (y) % Gn0 þ Hn0 (y À y0 ).64) that À1 n H n0 Á ¼ H0 exists.63) for any positive deﬁnite matrix W . It is left as an exercise matrix var p n (see Exercise 4. it follows that for large enough samples (so that ^ y is close to y0 ) the minimization problem in (4.

1. Here H0 ¼ @ G=@ y0 is large when the violation of the moment conditions (4.63) then used to compute ^ J0 ¼ 1 0 i i ¼1 i n À1 . Iterative choice of weights In practice y0 is unknown.61) is relatively strong for y 6¼ y0 — that is. So this estimator is more efﬁcient if X0 X is larger (more systematic variation) and if s2 is smaller (less random variation).3.7) in Section 4. So ML estimators are efﬁcient if the information matrix I 0 is large. Second illustration: ML P 2 li For the ML moment conditions (4. A possible iterative method is to start. 2–6.4 Generalized method of moments 257 Factors that influence the variance of GMM The efﬁciency of this estimator further depends on the set of moment conditions that has been speciﬁed.61) is small.4. for instance. And J0 is small when the random variation of the moments gi (y0 ) in (4. Stated in general terms. as for y 6¼ y0 the loglikelihood values drop quickly if the curvature is large.63). or. when the restrictions are powerful in this sense.63) À1 .3. with W ¼ I with W ¼ J0 (the m Â m identity matrix) and (4.35) in Section 4. so H0 ¼ plim À 1 n that 0 À1 J0 H0 )À1 ¼ (QsÀ2 QÀ1 Q)À1 ¼ s2 QÀ1 % s2 VOLS ¼ (H0 T À1 1 0 : XX n This agrees with (4. This is also intuitively evident. We 0 n n showed earlier that in this case J0 ¼ I 0 . equivalently.58) Pn we obtain Hn ¼ @ Gn =@ b ¼ À i¼1 xi x0i ¼ ÀX0 X and (under Assumptions 1Ã . so that for ML there holds H0 ¼ ÀJ0 and À1 0 À1 1 1 J0 H 0 ) À1 ¼ ( I 0 I À ¼ IÀ VML ¼ (H0 0 I 0) 0 : T This is in line with (4. and orthogonality À Á of the regressors xi with the 2 disturbances ei ) 0 X X ¼ ÀQ. T Illustration: OLS As an illustration.4. if the log-likelihood has a large curvature around y0 .60) we Hn À ¼ @ Gn Á =@ y ¼ @@ À1 obtain Á y@ y0 ¼ ÀI and it follows from (4. with the smallest covariance matrix V for ^ y) are those for which H0 is large and J0 is small (all in the sense of positive deﬁnite matrices). so that we cannot estimate y with the criterion (4. We showed earlier that J0 ¼ s Q in this case.57) that H0 ¼ plim n Hn0 ¼ plim À 1 I . Then (4. and this process is repeated until the estimates is minimized with W ¼ ^ J0 converge. T . The resulting estimate ^ y is Pnto minimize 0 ^ ^ g ( y ) g ( y ) as an estimate of J . the best moment conditions (that is. for the OLS moment conditions E[xi (yi À x0i b)] ¼ 0 in (4.

64). Identify the p parameters of interest y and specify m (! p) moment conditions (4. The result of the w2 -distribution is based on (4. where J0 is approximated by 1 n Jn. Test of moment conditions: The J-test In the over-identiﬁed case.63) can be seen as a non-linear least squares problem with m ‘observations’ and p parameters. Here m is the number of moment conditions and p is the number of parameters in y. À1 Gn % w2 (m À p): G0n Jn (4:69) This is called the J-test.258 4 Non-Linear Methods GMM standard errors Consistent estimates of the standard errors of the GMM estimators ^ y are obtained as the square roots of the diagonal elements of the estimated covariance matrix of ^ y — that is. as Gn (^ be identically zero irrespective of the question whether the imposed moment conditions are correct or not. and H and J in (4. according to (4. Note that (4.66) are approximated 0 0 n 1 ^ H and J evaluated at y ¼ y . so that by 1 n n n n !À1 À1 1 1 1 1 1 1 À 1 0 À 1 0 0 À1 ^) % V ¼ (H J H0 ) % ¼ (Hn Jn Hn )À1 : var(y H Jn Hn n n 0 0 n n n n n The covariance matrix in (4. var var( y) ¼ (Hn n n X X @ gi (^ y) gi (^ y)g0i (^ y). which explains that the number of degrees of freedom is m À p. ^ y has covariance matrix approximately equal to 1 V .65). GMM estimation and testing . one can test the over-identifying restrictions by means of the result that.61) hold true. In the exactly y) will identiﬁed case (m ¼ p) the moment conditions cannot be tested. under the null hypothesis that the moment conditions (4. 0 À1 c ^ Jn Hn )À1 . (continues) . Step 1: Specify a sufﬁcient number of moment conditions. The crucial assumption is that the DGP satisﬁes these moment conditions.61). this approach consists of the following steps. Summary of computations in GMM estimation and testing Summarizing the results on GMM estimation and testing obtained in this and the foregoing section. Hn ¼ : Jn ¼ @ y0 i¼1 i¼1 (4:67) (4:68) Here we used the fact that.67) is called the sandwich estimator of the covariance matrix of the GMM estimator ^ y.

then H0 ¼ ÀJ0 .69). Step 2: Estimate the parameters. if the likelihood function is correct. as in (4. Comparison of ML and QML So in QML the likelihood function is used only to obtain the ﬁrst order conditions (4.67).4. the question remains how to ﬁnd the required moment conditions in step 1.4. n . Step 3: Compute the GMM standard errors. On the other hand. The weighting matrix W can be chosen iteratively.11c. 4. As was discussed in Section 4.4 Generalized method of moments 259 GMM estimation and testing (continued ) . E Exercises: T: 4. The correctness of the moment conditions can be tested in the over-identiﬁed case (m > p) by the J-test in (4. starting with W ¼ I and (if ^ hth iteration) choosing in the (h þ 1)st yh is the estimate obtained in the Pn À1 ^ 0^ .67) and (4.68). expected utility maximization. but that the likelihood function is possibly misspeciﬁed. It is assumed that the corresponding moment conditions E[gi (y)] ¼ E[@ li =@ y] ¼ 0 hold true for the DGP. This means that the expression (4.4. The (asymptotically) correct covariance matrix can be computed by means of (4.66) always hold true as long as the moment conditions (4. .4 Quasi-maximum likelihood Moment conditions derived from a postulated likelihood Considering the four steps of GMM at the end of the last section.63) (if m > p).62) (if m ¼ p) or by minimizing (4.3.7. The asymptotic covariance ^ can be obtained from (4. the results in (4. for correctly speciﬁed models. f.60) and the standard errors are computed from (4. 4. Another possibility is the so-called quasi-maximum likelihood (QML) method. Step 4: Test of moment conditions (in over-identiﬁed models). S: 4. where Jh ¼ 1 iteration W ¼ Jh i¼1 gi (yh )gi (yh ).35) for the covariance matrix does not apply. matrix of the GMM estimator y The GMM standard errors are the square roots of the diagonal elements of this matrix. The reason is that the equality (4.36) holds true only at y ¼ y0 — that is.12c.67). but this no longer holds true if the model is misspeciﬁed. This method derives the moment conditions from a postulated likelihood function. g.65) and (4. In some cases these conditions can be based on models of economic behaviour — for instance.61) are valid. Estimate y by GMM by solving the equations (4.60). QML is .

In practice. Summary of QML method In quasi-maximum likelihood.62) (as m ¼ p. when one is uncertain about the correct speciﬁcation of the likelihood function. Here it is assumed that the n observations (yi . y)) be P the contribution of the ith observation to the log-likelihood log (L(y)) ¼ n i¼1 log (p(yi . xi ) are mutually independent for i ¼ 1. Identify the p parameters of interest y. Á Á Á . Suppose that we wish to estimate the parameters a and b in the model yi ¼ a þ bxi þ ei . y)). E Exercises: E: 4. xi .17h. n.5 GMM in simple regression The two moment conditions We illustrate GMM by considering the simple regression model. where the moments are deﬁned by li gi (y) ¼ @ @ y . i ¼ 1. Á Á Á . n.67) and (4. Á Á Á .260 4 Non-Linear Methods consistent if the conditions E[@ li =@ y] ¼ 0 hold true for the DGP. .68). n: We suppose that the functional form is correctly speciﬁed in the sense that the DGP has parameters (a0 . there is no need for a weighting matrix). Quasi-maximum likelihood . it may be helpful to calculate the standard errors in both ways. Step 4: Compute the GMM standard errors. This is equivalent to ML estimation based on the chosen probability distribution in step 1. . xi . and let li (y) ¼ log (p(yi . with gi (y) ¼ li (y) ¼ log (p(yi . this is a sign of misspeciﬁcation. xi . Postulate a probability distribution p(yi . 4. the parameter estimates and their standard errors are computed in the following way. Deﬁne the p moment conditions E[gi (y)] ¼ 0. i ¼ 1. Step 3: Estimate the parameters. If the outcomes are widely different. . with ML and with QML.4. The results will be used in the example in the next section. xi . y) for the ith observation. Approximate standard errors of the QML estimates can be obtained from the asymptotic covariance matrix in (4. Estimate y by solving the equations (4. Step 1: Specify a probability distribution for the observed data. y)). b0 ) with the property that . Step 2: Derive the corresponding moment conditions. The crucial assumption is that the DGP satisﬁes these moment conditions.

27) and (2. so that the model is exactly identiﬁed. n: Further we assume that the explanatory variable xi satisﬁes the orthogonality condition E[xi ei ] ¼ E[xi (yi À a0 À b0 xi )] ¼ 0. b) ¼ yi À a À bxi xi (yi À a À bxi ) ¼ ei . A consistent estimator of the 2 Â 2 covariance matrix is obtained from (4.4 (p.2 (p. but that Assumption 3 (homoskedasticity) is doubtful. Á Á Á .67) is var var( y) ¼ Hn Jn H n with Hn ¼ n X @ gi i ¼1 @a gi g0i ¼ @ gi @b n X i ¼1 ¼ n X i¼1 À1 Àxi ! : Àxi Àx2 i ! ¼À n X i ¼1 1 xi xi x2 i ! . and 5 and 6 (linear model with constant parameters).1.4.9) and (2. gi (a. then the formulas (2.28) for the variances of a and b do not apply. The GMM estimators The GMM estimates of and b are obtained by replacing the expectation E Pa n 1 by the sample mean n i¼1 . aÀb yi À ^ n i¼1 n À Á 1X ^xi ¼ 0: xi yi À ^ aÀb n i¼1 These equations are equivalent to the two normal equations (2.10) in Section 2. 2 (zero mean).2. The above two moment conditions correspond to Assumptions 1 (exogeneity).28)). i ¼ 1. GMM standard errors (allowing for heteroskedasticity) The variance of the estimators a and b was derived in Section 2. so that n À Á 1X ^xi ¼ 0. In our case. xi ei T À 0 À1 ÁÀ1 c ^ so that the estimated covariance matrix (4. If Assumption 3 is violated. 82).67). We now suppose that Assumption 4 (no correlation) is also satisﬁed.4 Generalized method of moments 261 E[ei ] ¼ E[yi À a0 À b0 xi ] ¼ 0. Á Á Á .27) and (2. Jn ¼ n X i ¼1 e2 i 1 xi xi x2 i . 96) under Assumptions 1–6 (see (2. n: This provides two moment conditions. i ¼ 1. So the GMM estimates of a and b are the OLS estimates a and b.

(i) Data and model assumptions The data set consists of n ¼ 240 monthly data over the period 1980. as the results in Example 4.7). (iv) the estimation results. That is.7 (p. The disturbances have mean zero (Assumption 2). then Jn may differ considerably from s2 (X0 X)À1 . we cannot estimate the parameters a and b by maximum likelihood. we do not assume normality (Assumption 7). (ii) two estimation methods. 223–4) indicate that the distribution may have fat tails. and (v) tests of two hypotheses. (ii) Two estimation methods: OLS and QML with (scaled) t(5)-disturbances As the distribution of the disturbances is unknown. The CAPM is given by yi ¼ a þ bxi þ ei . that is.67) then gives V much in magnitude. so that in particular E[xi ei ] ¼ E[xi ]E[ei ] ¼ 0 (compare with Assumption 1).4 (p. n. The formula ^ % s2 (X0 X)À1 . OLS and QML with (scaled) t(5)-disturbances. 251)). We consider two . i ¼ 1. E Exercises: E: 4. and the (correct) GMM expression in (4. The disturbances are independent (Assumption 4) and the DGP is described by the above simple regression model for certain (unknown) parameters (a0 . The terms xi and ei are independent.67) may differ much from the (incorrect) expression s2 (X0 X)À1 for the covariance matrix. if the residuals differ (4. Á Á Á . However. (iii) correctness of the implied moment conditions. We will discuss (i) the data and the model assumptions. We also do not assume homoskedasticity (Assumption 3).12.4. as in Chapter 3.262 4 Non-Linear Methods 2 If the residuals are all of nearly equal magnitude so that e2 i % s .6 Illustration: Stock Market Returns We consider once again the excess returns data for the sector of cyclical consumer goods (yi ) and for the whole asset market (xi ) in the UK (see also Examples 4.17g. as the variance of the disturbances may be varying over time (see Example 4.01– 1999. then we obtain 0 2 0 Hn ¼ ÀX X and Jn % s (X X). Further we assume that the density functions pi (ei ) are symmetric around zero in the sense that pi (ei ) ¼ pi ( À ei ). i ¼ 1. E XM404SMR 4.5 (p. P[ei ! c] ¼ P[ei Àc] for every value of c. n: We make the following assumptions on the DGP. b0 ) (Assumptions 5 and 6). where X is the n Â 2 regressor matrix. we assume that the disturbances ei are independently distributed with unknown distributions pi (ei ) with mean E[ei ] ¼ 0 and possibly different 2 unknown variances E[e2 i ] ¼ si . Á Á Á . 243–6) and 4. However.

(iii) Correctness of moment conditions under the stated assumptions Under the above assumptions. Therefore. provided that the speciﬁed moment conditions hold true for the DGP.5 (p. The application of OLS with conventional formulas for the standard errors seems to be reasonable for these data. For OLS this follows from Assumptions 1 and 2. least squares (OLS) and quasi-maximum likelihood (QML) based on the (scaled) t(5)-distribution introduced in Example 4. Therefore. The differences between the OLS and QML estimates are not so large. That is. It (in QML we use the estimated value s follows from Assumptions 1 and 2.68). the OLS and QML estimators are consistent and (asymptotic) GMM standard errors can be obtained from (4. as discussed in Section 4.4. with ﬁrst order conditions described in Example 4.5 (p. by means of VOLS in Panel 1 and VML in Panel 3) and by means of GMM as in (4. (v) Test outcomes We ﬁnally consider tests for the hypothesis that a ¼ 0 against the alternative that a 6¼ 0. 244).67) (in Panels 2 and 4). (iv) Estimation results The results in Exhibit 4.4. We compute the standard errors by GMM and compare the outcomes with those obtained by the conventional expressions for OLS and ML standard errors. the effects of possible heteroskedasticity and nonfor ^ a and b normality of the disturbances seem to be relatively mild for these data. and the same applies for the standard errors (computed in four different ways ^).21 show the estimates for OLS (Panels 1 and 2) and QML (Panels 3 and 4). the moment conditions are given by (4. ! @l i gi QML (a0 . For QML. For OLS. together with the symmetry of the denQML (a0 . 244).4.59) — that is. under the stated assumpsities pi (ei ).4. b0 )] ¼ 0. that E[gi tions the moment conditions are valid for both estimation procedures. the QML matrices Hn and Jn can be derived from the above expression for gi .67) and (4. b0 ) ¼ @a @ li @b 1 6ei B 5s2 þ e2 C i C ¼B @ 6xi e i A 5s2 þ e2 i 0 ^2 ¼ 4:49 obtained in Example 4. and also for b ¼ 1 against the alternative that b 6¼ 1. Based on the .68) were derived in Section 4.3. For QML. We use only the moments for a and b.5.5. the matrices Hn and Jn in (4. with standard errors computed both in the conventional way (see Section 4. E[@ li =@ y]jy¼y0 ¼ 0.4 Generalized method of moments 263 estimators.5).

0.066475 Exhibit 4.6813 Panel 4: GMM standard errors aML (C(1) in Panel 3) 0.196406 0.067926 17. .447481 0.232924 RENDMARK 1. Error z-Statistic Prob.075386 15.21 Stock Market Returns (Section 4.344971 0. using the normal equations of OLS as moment conditions).0000 Prob.264 4 Non-Linear Methods (asymptotic) normal distribution. PML ¼ 0:008.334173 bML (C(2) in Panel 3) 0.0000 C(3) 4. Error t-Statistic C À0. PML GMM POLS ¼ 0:023.54049 0. PML Panel 1: Dependent Variable: RENDCYCO Method: Least Squares Sample: 1980:01 1999:12 Included observations: 240 Variable Coefﬁcient Std.447481 0.21 are as follows: for a ¼ 0 : for b ¼ 1 : GMM ¼ 0:30.171128 0. PGMM ¼ 0:003: OLS ¼ 0:012.3.307876 RENDMARK 1.990660 0.362943 À1. For OLS the standard errors are computed in two ways.4. POLS ¼ 0:22. Error t-Statistic C À0.24135 R-squared 0.0000 Log likelihood À747. PML ¼ 0:32. C(1) À0.1922 0.0000 Panel 3: Model: RENDCYCO ¼ C(1) þ C(2)Ã RENDMARK þ EPS EPS are IID with scaled t(5) distribution.271712 16. using the scaled t(5) distribution for the disturbances).348223 À0.53500 R-squared 0. using the ﬁrst order conditions for the maximum of the log-likelihood as moment conditions). as usual (Panel 1.20244 0.3) and by means of GMM (Panel 4. as usual (Panel 3.342143 À1.171128 0.503480 Panel 2: Dependent Variable: RENDCYCO Method: Generalized Method of Moments Sample: 1980:01 1999:12 Included observations: 240 Moment Conditions: normal equations Variable Coefﬁcient Std. 0.2188 0. using the information matrix as discussed in Section 4. scale parameter is C(3) Method: Maximum Likelihood (BHHH) Sample: 1980:01 1999:12 Included observations: 240 Convergence achieved after 19 iterations Parameter Coefﬁcient Std. For ML the standard errors are also computed in two ways.494241 0. using the expression s2 (X0 X)À1 of Chapter 3) and by means of GMM (Panel 2.3219 C(2) 1. PGMM OLS ¼ 0:19. estimated by OLS (Panel 1) and by ML (Panel 3.073841 16.503480 Prob.6) Results of different estimates of CAPM for the sector of cyclical consumer goods. the P-values of the test outcomes in Exhibit 4.

4. .4 Generalized method of moments 265 The four computed P-values for these two tests all point in the same direction. The conclusions based on ML are somewhat sharper than those based on OLS. The outcomes suggest that we should reject the hypothesis that b ¼ 1 but not that a ¼ 0.

Maximum likelihood is a widely applicable estimation method that has (asymptotically) optimal properties — that is. Gourieroux and Monfort (1995). then the results of Chapter 3 are still valid asymptotically if the regressors are exogenous. and Monfort. We further refer in particular to Davidson and MacKinnon (1993). non-linear methods. This method requires that the joint probability distribution of the disturbances is correctly speciﬁed. Gourieroux. A.266 4 Non-Linear Methods Summary. J. and the generalized method of moments. New York: Oxford University Press. If there is much uncertainty about this distribution. Davidson. then the generalized method of moments can be applied. and Hayashi (2000). FURTHER READING The textbooks mentioned in Chapter 3. further reading. Cambridge: Cambridge University Press. (1993). all contain sections on asymptotic analysis. F. 2 vols. (1995). and keywords SUMMARY In this chapter we considered methods that can be applied if some of the assumptions of the regression model in Chapter 3 are not satisﬁed. Statistics and Econometric Models. This method requires that the speciﬁed moment conditions are valid for the data generating process. and MacKinnon.. then the least squares estimator has to be computed by numerical optimization methods and this estimator has similar asymptotic properties as the least squares estimator in the linear model. In this case the parameters are estimated by solving a set of moment equations. C. . Princeton: Princeton University Press. it is consistent and it has minimal variance among all consistent estimators.. Further Reading (p. G. If the model is non-linear in the parameters. Hayashi. If the regressors are stochastic or the disturbances are not normally distributed. Econometrics. R. and the standard errors are computed in a way that does not require the joint probability distribution. (2000). 178–9). Estimation and Inference in Econometrics. maximum likelihood.

235 likelihood function 225 Likelihood Ratio test 230 log-likelihood 225 measurement errors 191 moment conditions 253 Newton–Raphson 210 non-linear model 205 non-linear optimization 226 orthogonality condition 194 outer product of gradients 226 over-identiﬁed 253 over-identifying restrictions 258 quasi-maximum likelihood 259 random regressors 193 sandwich estimator 258 score test 235 stability 193 stochastic regressors 191 variable addition test 240 Wald test 232 . further reading. 228 auxiliary regression 216 concentration 231 consistent 194. 228 exactly identiﬁed 253 exogenous 194 Gauss–Newton 211 generalized method of moments 251 GMM standard errors 258 identiﬁed parameter 206 information matrix 228 J-test 258 Lagrange method 213 Lagrange Multiplier test 215. and keywords 267 KEYWORDS asymptotic analysis 188 asymptotic approximation 197 asymptotic distribution of b 196 asymptotic normal distribution 207 asymptotic properties 193 asymptotically efﬁcient 228 asymptotically normal 207.Summary.

Explain this result by means of two scatter diagrams. so that only the last regressor is asymptotically correlated with the error term. Show that the speed of convergence is n n P in this case. assume that xÃ . n À k) are larger than those of w2 (g).5) Consider the data generating process yi ¼ xi þ ei where ei are independently normally distributed N(0. Investigate this consistency for nine cases. 0. 1) random variables. Á Á Á . (It 2 1 2 fact that 1 (1 = i ) ¼ i ¼1 6 p . n À k) can also be used. and W -tests are asymptotically distributed as w2 (g). b is inconsistent with respect to all coefﬁcients of the vector b. plim(b) À b) in terms of the so-called signal2 to-noise ratio var(xÃ )=var(ex ) ¼ s2 Ã =sx . Write the model in the form y ¼ a þ bx þ e and express e in terms of ey and ex . Express the magnitude of the inconsistency (that is. yi ). LR-.1 (E Section 4. in the model yi ¼ bxi þ ei . Show that the OLS estimator b is inconsistent if s2 x 6¼ 0 and b 6¼ 0. and inﬁnite.3. Give examples of models where plim( 1 n X X) is zero. Show that this DGP does not satisfy Ã Assumption pﬃﬃﬃ 1 .3 (E Section 4. by using a statistical package or by inspecting tables of critical values of both distributions.268 4 Non-Linear Methods Exercises THEORY QUESTIONS 4. Á Á Á . d a. 4. a. 0 b.1.1.8) In Section 4. For simplicity.4 (E Section 4. in general. n À k) ! w2 (g). or inﬁnite.7). Section pﬃﬃﬃ 4. The variances of the measurement errors ey and ex are de2 noted by s2 y and sx respectively. 4. The variance of xÃ is denoted by s2 Ã.3 it was discussed that the LM-. b. Let xi ¼ i.5 (E Section 4. and that the speed of 0 convergence may be helpful to use the P is n . Check that the P-values (corresponding to the right tail of the distributions) of gF(g. see (4. r) . a. b. Show that. i ¼ 1. ﬁnite. according to whether these limits are zero. but that gF(g.1. b.1. 4. Show that plim(b) does not exist in this case. Under which condition does only the estimator of the last coefﬁcient become inconsistent? Provide an intuitive explanation of this result. ex and ey are all IID (identically and independently distributed).3) Consider the linear model y ¼ Xb þ e with stochastic regressors that satisfy Assumption 1Ã and with 0 0 plim( 1 n X e) ¼ (0.) 4.2 (E Section 4. By the speed of convergence of b to b we mean the power np for which the distribution of np (b À b) does not diverge and also does not have limit zero if n ! 1.) b. c. a. Give an intuitive explanation why b is (in)consistent in these cases. where two economic variables yÃ and xÃ are related by yÃ ¼ a þ bxÃ and where the measured variables are given by y ¼ yÃ þ ey and x ¼ xÃ þ ex . For simplicity we estimate the parameter b ¼ 1 by regression in the model without constant term — that is.1 presented results with speed of convergence n. n.3) Consider the model with measurement errors.3) The consistency of b depends on the probability 0 1 0 limits of the two terms 1 n X X and n X e. The observed data consist of n independent observations (xi . Now let xi ¼ 1=i. Show that for n ! 1 there holds gF(g. (It may be helpful to use 2 1 the fact that n i¼1 i ¼ 6 n(n þ 1)(2n þ 1). Show that this DGP also does not satisfy Assumption 1Ã . ﬁnite. one with small and the other with large signalto-noise ratio. . a. It is assumed that ey and ex are uncorrelated with each other and that both are also uncorrelated with the variables yÃ and xÃ .

4. c.49).4. Also argue whether or not s2 R will be an unbiased and/or consistent estimator of s2 . Á Á Á .2 that ML in a non-linear regression model with normally distributed disturbances is equivalent to non-linear least squares.46). xn ) independent from (!1 . with M ¼ I À X(X0 X)À1 X0 . 4.56) for testing a linear hypothesis in the linear model y ¼ Xb þ e. Determine the 3 Â 1 vector of gradients g ¼ @ f =@ b of this model. with additional regressor xi. Compute the estimates bR 2 and sR for the sample sizes n ¼ 10. Á Á Á .3) a.21) can be written. Prove the expression (4.9 (E Section 4. EMPIRICAL AND SIMULATION QUESTIONS 4.84). b) ¼ b1 þ b2 xb3 .61) — that is. where the !i are IID(0.23) for the relation between the LM-test and the F-test in the linear model y ¼ X1 b1 þ X2 b2 þ e for the null hypothesis b2 ¼ 0. With the ﬁnal estimate in c.3.4) minimizes the asymptotic covariance matrix V in (4. LR. Determine b2 and the standard error of b2 for these two samples. Suppose that the regressors x2 and x3 are positively correlated. Perform also an LM-test of this hypothesis. It may be helpful to prove as a ﬁrst step that the numerator of R2 in (4.1. LR-.65) of the GMM estimator. s2 ) disturbances that are uncorrelated with the regressors x2 and x3 .3) À1 Prove that the choice of weights W ¼ J0 (with the notation of Section 4. Plot the three resulting series of twenty estimates of b1 . b. 1)0 . which express the three tests LM. The estimated model is yi ¼ b1 þ b2 xi þ b3 x2 i þ ei — that is. c. n ¼ 100. and the estimator of s in this model is denoted by s2 . 1) and !i $ NID(0.7Ã (E Section 4.01).6 (E Sections 4. b. where A is an m Â m non-singular matrix. Generate a sample of size 100 from the model pﬃﬃﬃﬃ yi ¼ 2 þ xi þ ei . where the xi are independent and uniformly distributed on the interval [0.23). Perform twenty steps of the Gauss–Newton method to estimate b. b. Let b2 denote the least squares estimate of b2 in this model. . 0.8 (E Section 4. and (4. as e0R (I À M)eR ¼ e0R eR À e0R MeR and that e0R MeR ¼ e 0 e. 4. 1. Generate samples of sizes n ¼ 10.2. (4.2. 2 d. For this purpose. Now take as starting values b ¼ (0.4.Exercises 269 c. The estimator of b2 in this restricted 2 model is denoted by bR 2 . Comment on the relevance of this result for applying the LM-. Explain the problems that arise in this case.7) Consider the DGP yi ¼ 1 þ x2 with i þ !i xi $ NID(0. Also prove that this choice makes the GMM estimator invariant with respect to linear transformations of the model restrictions (4. if gi is replaced by Agi.4. b. An investigator investigates the relation between y and x2 by regressing y on a constant and x2 — that is. b) þ e with f (x. Construct a data generating process that satisﬁes the above speciﬁcations. 20] and the ei are independent and distributed as N(0.3. 4.2. 0)0 . 4.10 (E Sections 4. 4. Use a 5% signiﬁcance level (the 5% critical value of the w2 (1) distribution is 3. and W in terms of the F-test. b2 . d. x3 is omitted. 1.3. 4. Perform an LM-test for the null hypothesis that b2 ¼ 0 against the alternative that b2 6¼ 0 for the two samples of a. Investigate whether bR 2 is an unbiased and/or consistent estimator of b2 . Consider the non-linear regression model y ¼ f (x.3) Suppose that data are generated by the process yi ¼ b1 þ b2 x2i þ b3 x3i þ !i . one of size n ¼ 10 and another of size n ¼ 100.3. make use of the expressions (4. Show the statement at the end of Section 4.2. n ¼ 100. and n ¼ 1000 of this process. !n ). perform an F-test of the hypothesis that b3 ¼ 1=2. Prove the inequalities in (4. R a. c. and b3 . and W -tests. e. with starting values b ¼ (0. 1) and with (x1 . a. Compare these outcomes with the results in a and b. Generate two samples of this model.8) a. and n ¼ 1000.

This hypothesis is.68). Suppose now that the researcher uses GMM. d. What approximation does this provide for the standard error of b2 ? How does this compare with the results in d? e.12 (E Sections 4. Explain why GMM provides no help here to get a clear idea of the slope parameter b. d. 4. A researcher who does not know the DGP is interested in testing the hypothesis that the observations come from a population with mean zero.7) for the parameter vector b ¼ (b1 . How many of the 1000 computed LM-values are larger than 3. Estimate the parameters a and b now by ML using the (incorrect) t(5) distribution with dens1 ity f (ei ) / . b.4.3.2. b2 .84 for n ¼ 10? And for n ¼ 100? Comment on the outcomes. both for n ¼ 10 and for n ¼ 100.3) In this simulation exercise we generate a random sample by means of yi ¼ m þ ei . 4. the t(1) distribution) is f (ei ) ¼ p 11 . f (ei ) / 2 2 ð1þ1 3ei Þ g. ð þe2 iÞ e. Make histograms of the resulting 1000 values of b2 and of the LM-test. b. f. . Make a scatter plot of the data and make a histogram of the OLS residuals obtained in a. The estimated model is yi ¼ a þ bxi þ ei . 50. not correct. Discuss which method (of the ones used in b–e) the researcher would best use if he or she does not know the DGP and is uncertain about the correct disturbance distribution. Estimate the parameters a and b by means of OLS. n. a. based on the moment condition E[yi À m] ¼ 0 for i ¼ 1. What is the estimated mean? What is the corresponding GMM standard error? Give a formal proof. Á Á Á .270 4 Non-Linear Methods c. f.3) In this exercise we consider a simulated data set of sample size n ¼ 50. Now the researcher postulates the Cauchy distribution (that is. compute the corresponding ML estimate of and perform the Wald test on the hypothesis that m ¼ 0.3.4. as the DGP has mean 1 2. where m ¼ 1 2 and the disturbances ei are independently and identically distributed with the t(3) distribution. 4. the t(1) distribution) for the disturbances. What is the computed standard error of this ML estimator of m? Why is this not the true standard error of this estimator? where the regressors xi are IID with uniform distribution on the interval 0 x 2 and the Zi are IID with tð3Þ distribution. based on (4. Also explain why the (incorrect) ML estimates in d and e perform quite well in this case. so that the correct parameter values of the DGP are a ¼ 0:5 and b ¼ 1.11 (E Sections 4. Make a histogram of the simulated data set. a. Now suppose that the researcher is so lucky to postulate the t(3) distribution for the disturbances. give comments on all the outcomes.5. State your overall conclusion for estimating models for data that are scattered in a way as depicted in b. Á Á Á . Perform this test. The density of the Cauchy distribution (that is. In answering the following questions. Simulate a set of n ¼ 50 data from this DGP. of the fact that in the current model the GMM standard error is equal to the conventional OLS ﬃ standard qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ error multiplied by the factor 1 À 1 n. drawing new values for xi and !i in each simulation run.67) and (4. The data were generated by the model yi ¼ 0:5 þ xi þ Zi . d. What is the computed standard error of the sample mean? What is the true standard deviation of the sample mean? What is your conclusion? c. using the (incorrect) Cauchy distribution for the disturbances ei . Determine the GMM standard errors of the OLS estimates of a and b. Estimate the parameters a and b by ML. The researcher tests the null hypothesis of zero mean by means of the conventional (least squares based) t-test. XR412SIM i ¼ 1. 2 3 ð1þ1 5ei Þ f. b3 )0 of this DGP. of course. Using this distribution. What is the standard deviation of the 1000 outcomes of b2 for n ¼ 10? And for n ¼ 100? How does this compare with the standard errors in a? e. Repeat a and b 1000 times. Perform the corresponding Wald test of the hypothesis that the population mean is zero. 4. Finally estimate the parameters a and b by ML using the correct t(3) distribution with density 1 . c. Compute the asymptotic distribution (4.

b3 ¼ 1. compute the relevant (asymptotic) P-values. gender (x4 ). f.3. In particular. a. Discuss whether you can get any intuition from this diagram concerning the question which of the hypotheses b3 ¼ 0. and minority (x5 ). The least squares estimates of a and b are denoted by a and b. and as explanatory variables we take (apart from a constant term) x2 ¼ tc and x3 ¼ hs. a.30) of the four models. Compute also the LR-.3. Perform conventional t. and b3 ¼ 1 2 could be plausible. Perform Wald tests for the three restricted models against the unrestricted model. b) ¼ b1 þ b b2 x23 i þ b4 x3i .6 shows a scatter diagram of y against x2 . compute the relevant (asymptotic) P-values. 4. The estimated model is of the form yi ¼ f (x2i . we test the restricted model (ii) against the alternative ‘unrestricted’ model (i). Use these outcomes to discuss the difference between joint testing of multiple restrictions (as in Exercise 4.000 per year). the fraction of total consumption spent on food. Apart from the unrestricted model we consider three restricted models — that is. Test the hypothesis that d ¼ 1 by a Wald test. e. logarithm of begin salary (x3 ).14 (E Section 4. As dependent variable we take y ¼ fc=tc. (iii) b4 þ b5 ¼ 0. b. y ¼ b1 þ b2 x2 þ b3 x3 þ b4 x4 þ b5 x5 þ e: The data set consists of observations for n ¼ 474 individuals. Perform LR-tests for the three restricted models against the unrestricted model. For all tests below. Perform also LM-tests for the three restricted models against the unrestricted model.55). b3 ¼ 0 (so that x2 has no effect on y). compute the SSR and the ML estimate s2 ML of the disturbance variance.4. Test this hypothesis also by a Lagrange Multiplier test. x3i . b.Exercises 271 4.15 (E Section 4. b. Let XR413COF y ¼ log (q) denote the logarithm of quantity sold and x ¼ log (d) the logarithm of the deal rate.5. Test this hypothesis also by a Likelihood Ratio test. It is assumed that the error terms e are NID(0.000 per year). b3 ¼ 1 (so that the marginal effect of x2 on y is constant).and F-tests for this hypothesis. by means of auxiliary regressions (4. In the tests below use a signiﬁcance level of 5%. given that b5 ¼ 0.8) Use the same bank wage data set as in Exercise 4. the (non-linear) least squares estimates of g and d by c and d. In the notation of Exercise 4.3 (p.2.14. Two econometricians (A and B) estimate different models for these data — namely. also measured in $10. and LM-tests for this hypothesis.8) Consider the n ¼ 12 data for coffee sales of brand 2 in Section 4. We will consider three hypotheses for the parameter b3 — namely.8) In this exercise we consider the food expenditure data on food consumption (fc. c.8) In this exercise we consider the bank wage data and the model discussed before XR414BWA in Section 3. For each of the four models. XR416FEX measured in $10.14 with the joint model restriction (ii) tested against the full model with all ﬁve regressors) and sequential testing of single hypotheses (as in the current exercise). Test this hypothesis using the model of econometrician A.2. Here the logarithm of yearly wage (y) is explained in terms of education (x2 ). where f (x2i .16 (E Section 4. x3i . (ii) b4 ¼ b5 ¼ 0. b) þ ei . c. 4. A: y ¼ a þ bx þ e.5% in all tests. d. e. and b3 ¼ 1 2 (so that the marginal effect of x2 on y declines for higher values of x2 ). Exhibit 4. Compute the log-likelihoods (4. and average household size (hs) that were discussed in Example 4. total consumption (tc. a.s2 ). Give a mathematical proof of the fact that c ¼ a and d ¼ b=a. B: y ¼ g(1 þ dx) þ e: d. a. Now assume that we XR414BWA accept the hypothesis that b5 ¼ 0 and that we wish to test the hypothesis that b4 ¼ 0.3. c.14.3. (i) b5 ¼ 0. Compare the outcomes with the ones obtained in Section 3. consider the differences if one uses a signiﬁcance level of 2. Perform the two regressions and check that the outcomes satisfy the relations in a.2. For the tests below. 204–5). by the model 4.4.13 (E Section 4. W -. .

48). Compute GMM standard errors of the estimates in c — that is. r(y) ¼ b3 À 1 ¼ 0. r(y) ¼ b3 À 1 ¼ 0. f.4. d. Determine also the (asymptotic) standard errors of these estimates. Estimate a and b by ML.4. . Compare the outcomes (standard errors and test results) with the outcomes in b. Compute also the standard errors of these estimates. where yi are the excess returns in this sector and xi are the market excess returns. For b3 ¼ 0 the parameters (b1 . 4. b. s2 ) orÀwith the Á CauÀ1 chy distribution with density f (ei ) ¼ p(1 þ e2 . based on the normal distribution. 3. What reformulation of the restricted model (for b3 ¼ 0) is needed to get identiﬁed parameters in this case? c.48).272 4 Non-Linear Methods b. Determine the two histograms of the residuals corresponding to the estimates in b and c. Compare the outcomes of the foregoing six testing methods (in d–i) for the three hypotheses on b3 . Prove this. but now with the parameter restriction formulated as respect2 ively r(y) ¼ b2 3 ¼ 0. and b3 ¼ 1 2 ) by means of t-tests. Estimate a and b by ML. Estimate the unrestricted model with four regression parameters. g. and r(y) ¼ 2 1 b3 À 4 ¼ 0: j. Comment on the similarities and differences of the test outcomes.3. Again use a 5% signiﬁcance level. b3 ¼ 1. e. which of the two estimation methods do you prefer? Motivate your answer. Test the three hypotheses by means of LM-tests. Show that the ML estimates for a and b P are obtained from the À1 two conditions ei (1 þ e2 ¼0 and i) P 2 À1 ei xi (1 þ ei ) ¼ 0. h. giving n ¼ 240 observations. Finally. Compare the results with those obtained by OLS.17 (E Sections 4. The disturb- ances ei are assumed to be IID distributed. b4 ) of the model are not identiﬁed. e. Test the hypothesis that a ¼ 0 using the results in b. Try out different starting values and pay attention to the convergence of the estimates.4. f.21) with appropriate auxiliary regressions. Answer the questions in d also for the hypothesis that b ¼ 1. d. i) a. d. either with normal distribution N(0. consider the QML estimates based on the two Cauchy moment conditions deﬁned by ei i xi E[ 1þ ] ¼ 0 and E[ 1eþ ] ¼ 0. 2. Use a 5% signiﬁcance level. On the basis of this information. The model is yi ¼ a þ bxi þ ei . c. Determine the log-likelihood for the case of Cauchy disturbances. Test these three hypotheses also by means of F-tests. The regressors in step 2 of the LM@f test consist of the four partial derivatives @ bj for j ¼ 1. Test the three hypotheses again by means of the Wald test as expressed in (4. the estimates based on the two moment conditions E[ei ] ¼ 0 and E[ei xi ] ¼ 0. The monthly data are given for 1980–99. Determine the e2 e2 i i GMM standard errors of these estimates and perform the two tests of d and e. Formulate the parameter restriction respectively as r(y) ¼ b3 ¼ 0. and r(y) ¼ b3 À 1 2 ¼ 0: i. Test this hypothesis also using the results in c.5) In this exercise we consider the stock market returns data for the sector of XR417SMR non-cyclical consumer goods in the UK.3. Test the three hypotheses by means of the Wald test as expressed in (4. based on the Cauchy distribution. 4. 4. and e based on the Cauchy ML standard errors. 4. b2 . using the result (4. Test the three hypotheses (b3 ¼ 0. How does this compare with the (ordinary) standard errors computed in c? Does this alter your answers in d and e to test respectively whether a ¼ 0 and b ¼ 1? hÃ . g. Now test the three hypotheses by means of LR-tests.

. We refer to Exhibit 0. If some of the assumptions are not satisﬁed then there are several ways to proceed. Most of the sections of this chapter can be read independently of each other. 8) for the sections of this chapter that are needed for selected topics in Chapters 6 and 7. Another option is to adjust the speciﬁcation of the model — for instance. including non-linear models.3 (p. the functional form. by changing the included variables. disturbances that are heteroskedastic or serially correlated. One option is to use least squares and to derive the properties of this estimator under more general conditions. and the use of instrumental variables. We discuss alternative model speciﬁcations.5 Diagnostic Tests and Model Adjustments In this chapter we describe methods to test the assumptions of the regression model. or the probability distribution of the disturbance terms.

The model may turn out to be weak. In our approach. In this book we describe econometric modelling from an applied point of view where we start from the data. How should the relationships between the variables of interest be speciﬁed. Examples of the latter are that the residuals may be far from normal or that the parameter estimates may differ substantially in subsamples. because important aspects of the data are left unexplained or because some of the basic assumptions underlying the econometric model are violated. on the other hand. we are not primarily interested in testing a particular theory but in using data to get a better understanding of an . for instance. As economic theory does not often suggest explicit models. and how should the other inﬂuences be taken into account? In practice it often occurs that an initially chosen econometric model does not ﬁt well to the data. By incorporating more of the relevant data characteristics in the model. suggesting which economic variables play a role and perhaps whether variables are positively or negatively related.1 Introduction Modelling in practice It is the skill of econometricians to use economic theory and statistical data in order to construct econometric models that provide an adequate summary of the available information. In most situations the relevant theoretical information is of a qualitative nature. we may improve our understanding of the underlying economic processes. We consider models as constructs that we can change in the light of the data information. This means that the empirical modeller is faced with the following two questions. there are various avenues to take. Most models from economic theory describe a part of the economy in isolation from its environment (the ceteris paribus assumption). This view on econometric modelling differs from a more traditional one that has more conﬁdence in the theory and the postulated model and less in the observed data. this leaves some freedom to choose the model speciﬁcation.274 5 Diagnostic Tests and Model Adjustments 5. depending on the degree of belief one has in the employed model structure and in the observed data. In this view econometrics is concerned with the measurement of theoretical relations as suggested by economic theory. This may happen despite genuine efforts to use economic theory and to collect data that are relevant for the investigation at hand. The selection and adjustment of models are guided by our insight in the relevant economic and business phenomena. If the model is not correctly speciﬁed. Several diagnostic tests have been developed that help to get clear ideas about which features of the model need improvement.

and to propose treatments (model adjustments) to end up with a ‘healthy’ model. Because the purpose of the analysis is to make a diagnosis of the quality of the model. The regression model y ¼ Xb þ e was analysed in Chapter 3 under the seven assumptions stated in Section 3. in which case the orthogonality condition of Section 4. The new model is again subjected to diagnostic tests. 194) is violated.1. the econometrician tries to detect possible weaknesses of the model. 125–6). in Section 5. The empirical cycle in model construction In practice. in Section 5. in Section 5.5). and this process is repeated until the ﬁnal model is . speciﬁes an initial model.2 we test the speciﬁcation of the functional form — that is.7 we consider models with endogenous regressors in X.6).3 (p. econometric models are formed in a sequence of steps. All these assumptions will be subjected to diagnostic tests in this chapter. serial correlation (Assumption 4. this is also called diagnostic testing. the number of included explanatory variables in X and the way they enter into the model (Assumptions 2 and 6). Section 5.1 Introduction 275 observed phenomenon of interest. This is a nice tool as it is simple to apply and it gives reliable information if the assumptions of Chapter 3 are satisﬁed. to diagnose possible causes. and chooses an estimation method. y ¼ Xb þ e . Next we examine the assumptions on the disturbance terms e and we discuss alternative estimation methods in the case of heteroskedasticity (Assumption 3. In Section 5.3 considers the possibility of non-constant parameters b (Assumptions 5 and 2). Like a medical doctor. Finally. Such a model is characterized by the fact that it provides insight into the problem at hand and that it shows acceptable reactions to relevant diagnostic tests.4 (p.1. First one selects the relevant data. and non-normal distributions (Assumption 7. The resulting estimated model is subjected to diagnostic tests. The regression model. discussed in Chapter 3 is one of the standard tools of analysis. Such tests of the underlying model assumptions are called misspeciﬁcation tests. The major role of tests is then to ﬁnd out whether the chosen model is able to represent the main characteristics of interest of the data. Several tests are available to test whether the proposed model is correctly speciﬁed. in Section 5.5.4). Diagnostic tests In econometrics we use empirical data to improve our understanding of economic processes. The test outcomes can help to make better choices for the model and the estimation method (and sometimes for the data).

276 5 Diagnostic Tests and Model Adjustments statistical data estimation economic model econometric model diagnostic tests numerical method OK? adjust NO YES use Exhibit 5. Tests are usually performed under the assumption that the model has been correctly speciﬁed.1). Also in this situation diagnostic tests remain helpful tools to ﬁnd suitable models. For instance.2. 570). for instance. It is then possible to investigate whether the ﬁnal model that is obtained in the empirical cycle is able to predict the outcomes in this hold-out sample.2. This process of iterative model speciﬁcation and testing is called the empirical cycle (see Exhibit 5. in initial rounds of the empirical cycle we may work with ﬁrst-guess models that are not appropriately speciﬁed. Forecast evaluation as a diagnostic tool will be further discussed in Sections 5. one should not report P-values without providing the details of the search process that has led to the ﬁnally chosen model.1 (p.4 (p. 280) and 7. The excluded data are called the hold-out sample. In practice. the computed standard errors of estimated coefﬁcients and their P-values depend on this assumption. This may lead. This sequential method of model construction has implications for the interpretation of test outcomes. . to underestimation of the standard errors. At this point we mention one diagnostic tool that is of particular importance — namely.1 The empirical cycle in econometric modelling satisfactory. It is advisable to exclude a part of the observed data in the process of model construction. However. This provides a clear test of model quality. irrespective of the way the model has been obtained. the evaluation of the predictive quality of proposed models.

2 Functional form and explanatory variables 277 5. and that in addition another set of g variables X2 is available that possibly also inﬂuence the dependent variable y. The effects of X1 on y can be estimated in the model y ¼ X1 b1 þ e. The reason is that variation in the other variables may cause variations in the variable y. The question then is how many variables to include in the model. if these variables are excluded from the model. it may be impossible to estimate the model (if the number of parameters becomes larger than the number of observations) or the estimates may become very inefﬁcient (owing to a lack of degrees of freedom if there are insufﬁcient observations available). On the other hand. x2 — it is of importance not to exclude the other variables a priori. with estimator bR ¼ (X01 X1 )À1 X01 y.2 Functional form and explanatory variables 5. More precisely. 144) we showed that the inclusion of irrelevant variables leads to a loss in efﬁciency. and var(b1 ) ! var(bR ) (in the sense that var(b1 ) À var(bR ) is positive semideﬁnite). If all these variables are included. Even if one is interested in the effect of only one of these explanatory variables — say. Which estimator of b1 should be preferred. b1 or bR ? Trade-off between bias and efficiency The answer to the above question is easy if b2 ¼ 0. On the . and. so that both estimators are unbiased. In Section 3. then all the variations in y will be attributed to the variable x2 alone. b2 ). under Assumptions 1–6 and with b2 ¼ 0 there holds E[b1 ] ¼ E[bR ] ¼ b1 . with corresponding estimators (b1 . An alternative is to include the variables X2 and to perform a regression in the model y ¼ X1 b1 þ X2 b2 þ e.2. b2 ) of (b1 .5. the list of possibly inﬂuential variables may be very long.1 The number of explanatory variables How many variables should be included? Assume that a set of explanatory variables has been selected as possible determinants of the variable y. Suppose that we want to estimate the effects of a set of (k À g) variables X1 on the dependent variable y.4 (p.2.

In Section 3.3 (p. So the restricted estimator bR has a smaller TMSP than the unbiased estimator b1 if b2 is sufﬁciently small and/or the variance V2 of the estimator b2 is sufﬁciently large. A prediction criterion and relation with the F -test A possible criterion to ﬁnd a trade-off between bias and variance is the mean ^ of b. if b2 6¼ 0. then it is better to start with a model that includes all variables that are economically meaningful. but that it has a smaller variance than the unbiased estimator b1 (so that var(bR ) var(b1 )). deﬁned by squared error (MSE) of an estimator b ^) þ (E[b ^] À b)(E[b ^] À b)0 : ^) ¼ E[(b ^ À b)(b ^ À b)0 ] ¼ var(b MSE(b If b contains more than one component then the MSE is a matrix. ^ À b )2 does not make much sense in general. then the situation is more complicated. where V2 is the covariance matrix of b2 and g is the number of components of b2 . It is left as an exercise (see Exercise 5.2. The question then becomes whether the gain in efﬁciency is large enough to justify the bias that results from deleting X2 . as deleting variables gives only a small gain in efﬁciency. The fact that restrictions improve the efﬁciency is one of the main motivations for modelling. In such a situation it is also intuitively evident that it is better to reduce the . The total mean squared prediction error (TMSP) is deﬁned as the sum of the squared prediction errors (^ yi À E[yi ])2 — that is.2) to show that TMSP(bR ) TMSP(b1 ) if and only if À1 b02 V2 b2 T g. ^) ¼ E[(Xb ^ À Xb)0 (Xb ^ À Xb)]: TMSP(b ^ ¼ X1 b1 þ X2 b2 We can apply this criterion to compare the predictions y of the larger model with the predictions ^ yR ¼ X1 bR of the smaller model. as the magnitude of the individual parameters bj depends on the scales of measurement of the individual explanatory variables xj . as the prediction error Xb of measurement. However. 142–3) we showed that by deleting variables we obtain an estimator that is in general biased (so that E[bR ] 6¼ b1 ). this addition of squared errors (b j j ^ of the vector of mean Instead we consider the accuracy of the prediction ^ y ¼ Xb ^ À Xb does not depend on the scales values E[y] ¼ Xb. If many observations are available. and ^) ¼ the last equality follows by using the deﬁnition of the variance var(b 0 ^ ^ ^ ^ E[(b À E[b])(b À E[b]) ].278 5 Diagnostic Tests and Model Adjustments other hand. A scalar criterion could be obtained by taking the trace of the MSE matrix. but of course the restrictions should not introduce too much bias.

The critical value of 1 corresponds to a size of more than 5 per cent — that is. The model with the smallest value of AIC or SIC is chosen.2 Functional form and explanatory variables 279 uncertainty by eliminating the variables X2 from the model. this is equivalent to the F-test for the null hypothesis that b2 ¼ 0 with a critical value of 1. This F-test can also be written as F¼ (e0R eR À e0 e)=g . That is. For n ! 8. 161)). to account for the fact that the model ﬁt always increases (that is. The information criteria of Akaike and Schwarz Another method to decide whether the variables X2 should be included in the model or not is to use information criteria that express the model ﬁt and the number of parameters in a single criterion.1. The unrestricted model has p ¼ k. In practice. The Akaike information criterion (AIC) and Schwarz information criterion (SIC) (also called the Bayes information criterion or BIC) are deﬁned as follows.49) in Section 3. p þ n À Á p log (n) SIC( p) ¼ log s2 : p þ n These criteria involve a penalty term for the number of parameters. where s2 is the estimated error variance in the unrestricted V 2 model and M1 ¼ I À X1 (X01 X1 )À1 X01 (see Section 3. We can prefer to delete the variables X2 from the model if ^ À1 b2 =g ¼ b0 X0 M1 X2 b2 =(gs2 ) b02 V 2 2 2 1: According to the result (3.4. b2 is replaced by b2 and V2 by ^ 2 ¼ s2 (X0 M1 X2 )À1 . e 0 e = ( n À k) (5:1) where eR and e are the residuals of the restricted and unrestricted model. where p is the number of included regressors and s2 p is the maximum likelihood estimator of the error variance in the model with p regressors: À Á 2p AIC( p) ¼ log s2 . s2 p decreases) if more explanatory variables are included. b2 and V2 are of course unknown. and the restricted model obtained by deleting X2 has p ¼ (k À g). respectively. the SIC . We can replace b2 and V2 by their least squares estimates in the model y ¼ X1 b1 þ X2 b2 þ e. the TMSP criterion used in this way is more liberal in accepting additional regressors.1 (p.5.4.

nf i¼1 yi where nf denotes the number of observations in the prediction sample and ^ denotes the predicted values.2). For large enough sample size n. AIC. So the models are estimated using only the data in the ﬁrst subsample.1) is smaller than 2. . For instance. j À 1 are included. so that SIC is more inclined to choose the smaller model than AIC.1). These are deﬁned by nf 1X (yi À ^ yi ) 2 nf i¼1 !1=2 . the information criteria are related to the F-test (5. Possible evaluation criteria are the root mean squared error (RMSE) and the mean absolute error (MAE). Á Á Á . the restricted model is preferred above the unrestricted model by AIC. For this purpose the data set is split in two parts. if the F-test in (5. in the sense that AIC(k À g) < AIC(k). the comparison of AIC values corresponds to an F-test with critical value 2 and SIC corresponds to an F-test with critical value log (n) (see Exercise 5. if the jth regressor is included in the model then also the regressors 1. RMSE ¼ nf 1X MAE ¼ j yi À ^ yi j . For the linear regression model.280 5 Diagnostic Tests and Model Adjustments imposes a stronger penalty on extra variables than AIC. Another method is to perform a sequence of t-tests. for instance. or SIC. Criteria based on out-of-sample predictions Another useful method for model selection is to compare the predictive performance of the models. It then remains to choose the number of regressor (k À g) to be included in the model. and the estimated models are then used to predict the y-values in the prediction sample. by choosing the model with the smallest value of TMSP. Iterative variable selection methods In the foregoing we assumed that the (k À g) variables in X1 should all be included in the model and that the g variables in X2 should either all be included or all be deleted. This can be done. 2. But how should we choose g and the decomposition of the variables in the two groups X1 and X2 ? We assume that the k regressors can be ordered in decreasing importance — that is. an ‘estimation sample’ (used to construct the model) and a ‘prediction sample’ or ‘hold-out sample’ for predictive evaluation.

We will discuss (i) the data and possible nonlinearities in the wage equation. Then the variable is added that has the (in absolute sense) largest t-value (this involves k À 1 regressions in models that contain a constant and one other regressor). as it starts from simple models. (i) The data and possible non-linearities in the wage equation E XM501BWA The dependent variable (y) is the logarithm of yearly wage. as they will exclude relevant regressors. in the speciﬁc-to-general approach. The relation between education and wage may be non-linear because the marginal returns of schooling may depend on the attained level of education. This is repeated until all remaining regressors are signiﬁcant. This is repeated until none of the additional regressors is signiﬁcant anymore. In contrast. and as regressors we take the variables education (x. Variants of this approach can also be applied if the regressors cannot be ordered in decreasing importance. with g ¼ k À 1). (iii) selection of the degree of the polynomial by means of different selection criteria. The method of backward elimination starts with the full model (with g ¼ 0) and deletes the variable that is least signiﬁcant. If this hypothesis is rejected. then the second regressor is included in the model and one tests H0 : b3 ¼ 0 against H1 : b3 6¼ 0 (in the model with g ¼ k À 3). gender . This is also called the general-to-speciﬁc method and it has the attractive statistical property that all tests are performed in correctly speciﬁed models. corresponding to g ¼ k À 1) and tests H0 : b2 ¼ 0 against H1 : b2 6¼ 0 (in the model with g ¼ k À 2). If this hypothesis is not rejected. The method of forward selection starts with the smallest model (that includes only the constant term. In the top-down approach one starts with the largest model (with g ¼ 0) and tests H0 : bk ¼ 0 against H1 : bk 6¼ 0. This is also called the speciﬁc-to-general method and it is applied much in practice. Variables are deleted until the next regressor becomes signiﬁcant. (ii) a class of polynomial models.1: Bank Wages (continued) As an illustration we consider again the data on wages and education of 474 employees of a US bank that were analysed in foregoing chapters. the number of years of education). the model with the remaining k À 1 regressors is estimated and again the least signiﬁcant variable is deleted. Example 5. and (iv) a forecast evaluation of the models.2 Functional form and explanatory variables 281 In the bottom-up approach one starts with the smallest model (including only the constant term. the initial small models are in general misspeciﬁed. In the second step. then one tests H0 : bk ¼ bkÀ1 ¼ 0 and so on.5. Variables are added in this way until the next regressor is not signiﬁcant anymore.

This plot indicates the possibility of a nonlinear relation between education and wage. 4.2 (b) and (c) show plots for p ¼ 1 of the residuals against x and of the value of y against the ﬁtted value ^ y. (ii) Polynomial models for the wage equation One method to incorporate non-linearities is to consider polynomial models of the form y ¼ a þ gDg þ mDm þ b1 x þ b2 x2 þ Á Á Á þ bp xp þ e: The constant term and the variables Dg and Dm are included in all models. If we use the adjusted R2 as criterion. This means that the . then xj is also included for all j < p. The model with degree p ¼ 2 is optimal from this perspective. (iii) Selection of the degree of the polynomial model Exhibit 5.2 (a) shows the partial regression scatter plot of wage against education (after regressions on a constant and the variables Dg and Dm ). The remaining 424 observations (with x 16) are used to estimate models with different values of p.3 summarizes the outcomes of the models with degrees p ¼ 1. Exhibit 5. If we perform F-tests on the signiﬁcance of the highest powers in the model with p ¼ 4 (‘top down’).2 (f – i). (iv) Forecast evaluation of the models Although the foregoing results could suggest selecting the degree of the polynomial model as p ¼ 3 or p ¼ 4. Dm ¼ 1 for minorities). Dg ¼ 1 for males). 2. then p ¼ 4 is optimal. There are less indications for remaining non-linearities in this case. then p ¼ 3 is again preferred.282 5 Diagnostic Tests and Model Adjustments (Dg ¼ 0 for females. and minority (Dm ¼ 0 for nonminorities. Exhibit 5. if xp is included in the model. Exhibit 5. The AIC and SIC criteria also prefer the model with p ¼ 3. 3. and the question is how many powers of x to include in the model. This is also illustrated by the graphs in Exhibit 5. the models with p ¼ 1 and p ¼ 2 provide much better forecasts. then this would suggest taking p ¼ 3 (for a signiﬁcance level of 5 per cent). For evaluation purposes we leave out the ﬁfty observations corresponding to employees with the highest education (x ! 17). These variables are ordered in a natural way — that is.2 (d) and (e) show the same two plots for the model with p ¼ 2. Both plots indicate some nonlinearities. If we use the t-test on the highest included power of x (‘bottom up’). which show that for p ¼ 3 and p ¼ 4 the forecasted wages of the ﬁfty employees with the highest education are larger than the actual wages.

5 LOGSALFIT 12.0 6 8 10 12 EDUC 14 16 18 10.0 LOGSALFIT Exhibit 5.5 10.0 11.0 11.0 p=1 (c) 12.5 −1.0 −0.0 6 8 10 12 EDUC 14 16 18 11.0 (d) 1.5 1. (b) and (c): scatter diagrams of residuals against education (b) and of wage against ﬁtted values for the (linear) model with p ¼ 1 (c) using data of 424 employees with EDUC 16.5 0.0 10.5 −1.0 11.5 9.5 12.5 9.5 1.0 10.5 −1.5 0.0 −0.0 9.5 11. (d) and (e): two similar scatter diagrams for the (quadratic) model with p ¼ 2.5 0.2 Functional form and explanatory variables 283 (a) 1.5 11.5 10.0 RESLOGSAL 0.5 p=1 LOGSALARY RESID 0.0 −0.0 10. .0 9.0 10.2 Bank Wages (Example 5.0 −10 −5 0 RESEDUC 5 10 (b) 1.5 10.5 LOGSALARY RESID 11.0 p=2 (e) 12.0 11. 474 employees).5.1) (a): partial regression scatter plot of wage (in logarithms) against education (after regressions on a constant and the variables ‘gender’ and ‘minority’.5 p=2 0.

0 10. Criterion Adjusted R2 P-values ‘bottom up’ t-test P-values ‘top down’ F-test AIC SIC RMSE of forecasts MAE of forecasts p¼1 0.5 10.0963 0.7060 2.5628Ã 0.0400 À0.0 11.5 12.5 11.3530 5.0000Ã 0.0000 0.5 10.2380Ã p¼3 0.0 11.3269 p¼4 0.4221 0.0000 À0.5 11. based on polynomial models with different values of p (p ¼ 1.0 10.4066 p¼2 0.9842 Exhibit 5.3125Ã À0.5620 0.5 12.0 LOGSALARY 10.1808 0. 2.1) Model selection criteria applied to wage data.5 11.0 10.0000 À0.1440 À0.2 (Contd.) Scatter diagrams of actual wages against forecasted wages (both in logarithms) for ﬁfty employees with highest education (! 17 years).5 11.2965Ã 0.3 Bank Wages (Example 5.0000 0.2452 7. an polynomial model for each criterion.0 p=1 (g) 12.0 10. Ã indicates the optimal degree of the .0 10.4804 0.4598 0.3121 À0.1808 À0.0000Ã À0. 3.284 5 Diagnostic Tests and Model Adjustments (f ) 12.2552Ã 2.0 11.0 FORECAST LOGSAL FORECAST LOGSAL (h) 20 18 p=3 (i) 40 p=4 LOGSALARY LOGSALARY 10 12 30 16 14 12 10 20 10 14 16 18 20 10 15 20 25 30 35 40 FORECAST LOGSAL FORECAST LOGSAL Exhibit 5.0 p=2 LOGSALARY 11.0019 0. 4).

2).3).2) there holds g ¼ 0. This may be impractical if k is not small.4 (p. As b depends on y.1. A simpler test is to add a single squared term to the linear model (5. To allow for higher order non-linearities we can include higher order terms in the RESET — that is. yi ¼ x0i b þ p X j¼1 gj (^ yi )jþ1 þ ei : (5:4) . ^ y2 i. This is called the regression speciﬁcation error test (RESET) of Ramsey. This gives the test equation y2 yi ¼ x0i b þ g^ i þ ei : (5:3) Under the null hypothesis of a correct linear speciﬁcation in (5. it may be that the dependent variable depends in a non-linear way on the explanatory variables.2a–d 5.3) is stochastic. 0 where ^ yi ¼ xi b with b the OLS estimator in (5. we can. 5. this means that the added regressor ^ y2 i in (5.2 Non-linear functional forms A general misspecification test: RESET In the foregoing chapters the functional relation between the dependent variable and the explanatory variables was assumed to be known up to a set of parameters to be estimated. 197). which can be tested by the t-test in (5.1. The linear model is given by yi ¼ x0i b þ ei ¼ b1 þ k X j¼2 bj xji þ ei : (5:2) Instead of this linear relation.2 Functional form and explanatory variables 285 models with larger values of p do not reﬂect systematic properties of the wage–education relation for higher levels of education. To test this.2) — for example. for example.5.2. add quadratic and cross product terms to obtain the model yi ¼ b 1 þ k X j¼2 bj xji þ k X j¼2 gjj x2 ji þ k X k X j¼2 h¼jþ1 gjh xji xhi þ ei : A test for non-linearity is given by the F-test for the 1 2 k(k À 1) restrictions that all parameters gij are zero. Therefore the t-test is valid only asymptotically under the assumptions stated in Section 4. E Exercises: T: 5.

5. @ xj ¼ bj for j ¼ 2. If possible. This can be modelled by including the product term x2 x3 in the model. Some meaningful non-linear specifications The RESET is a misspeciﬁcation test. In the linear model (5. Á Á Á . y depends on the level of x2 — say. Alternative models can be obtained by assuming different forms for these marginal effects. so that yi ¼ b1 þ b2 x2i þ b3 x3i þ g3 x2i x3i þ ei : The term x2i x3i is called an interaction term.1. and for simplicity we assume that k ¼ 3. @ x2 ¼ b2 þ g3 x3 . with education (x). or to use varying parameters. and 5. Á Á Á . We discuss some possible models. n À k À p) test on the joint signiﬁcance of the parameters g1 . This can be modelled by including the squared term x2 2 2 2 2 in the @ x2 model so that 2 yi ¼ b1 þ b2 x2i þ 1 2g2 x2i þ b3 x3i þ ei : y depends on the level of another It may also be that the marginal effect @@x 2 @y variable — say. to transform the data. That is.2: Bank Wages (continued) We consider again the wage data discussed in Example 5.3.4. The above two speciﬁcations provide non-linear functional forms with a clear interpretation. it tests the null hypothesis of correct speciﬁcation.286 5 Diagnostic Tests and Model Adjustments The hypothesis that the linear model is correctly speciﬁed then corresponds to the F(p. It may be that the marginal effect @@x 2 @y ¼ b þ g x . and (iii) the results of this model.2. the choice of an alternative model should be inspired by economic insight. As these models remain linear in the unknown parameters. but if the null hypothesis is rejected it does not tell us how to adjust the functional form. E XM501BWA Example 5.2 respectively. gender (Dg ). The linear model is given by y ¼ a þ gDg þ mDm þ bx þ e: As was discussed in Example 5.2. Other methods to deal with non-linearities are to use non-parametric techniques. We will now discuss (i) tests on non-linearities. . the marginal effects of the explanatory variables @y on the dependent variable are constant — that is.1. k. This is discussed in Sections 5. and minority (Dm ) as explanatory variables. gp . the wage equation may be non-linear.2). they can be estimated by (linear) least squares. (ii) a non-linear model with nonconstant marginal returns to schooling.3.

420612 MINORITY À18.23594 À4.1072 Panel 1: regression of wage (in logarithms) on education.22389 12.2) 0.025511 MINORITY À0.1374 0.4.970686 À7.483135 1.000000 0.5.443306 0.488307 0.23766 Probability Log likelihood ratio 75.261131 0. Error t-Statistic C À69.199980 0.329595 FITTED^ 2 0.0000 0.330216 À1.132673 0.7634 17. gender.1560 0.488256 EDUC 10.058687 EDUC 0.783627 EDUC À1.83565 À1.000000 Prob.299821 1. respectively with p ¼ 1 and with p ¼ 2.298731 8.0000 0.483936 0.11083 9.172669 À8.26657 1.21147 Probability Test Equation: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std. The above linear model assumes that this effect of education is constant for all employees.58029 Probability Test Equation: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std.0000 0.1311 0.077366 0.004436 GENDER 0.44229 10.0000 0.583791 À8.82447 8.354812 MINORITY 2.4 Bank Wages (Example 5. Error C 9.0000 Panel 2: Ramsey RESET Test: F-statistic 77.107602 8.2 Functional form and explanatory variables 287 (i) Tests on non-linearities Recall that y is the logarithm of yearly wage S.0000 0.614082 Exhibit 5.0000 0. Error t-Statistic C 827.0000 0.358791 GENDER À4.583411 Prob.89400 25.60463 Probability Log likelihood ratio 72. Panels 2 and 3: RESET.63188 7. so that b ¼ @ y=@ x ¼ @ log (S)=@ x ¼ (@ S=@ x)=S measures the relative wage increase due to an additional year of education.420779 GENDER 35. Both tests indicate that the linear model is misspeciﬁed. 0.028946 t-Statistic 156.000000 0.877462 0.000000 Prob. 0. Note that in the model y3 with p ¼ 2.2571 555.419787 FITTED^ 2 À14. .947902 0.809349 Panel 3: Ramsey RESET Test: F-statistic 40.1561 0.8566 1. 0. the two terms ^ y2 i and ^ i are individually not signiﬁcant but Panel 1: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std.1563 0.4)) are in Panels 2 and 3 of Exhibit 5. The results of two RESETs (with p ¼ 1 and with p ¼ 2 in (5.512380 FITTED^ 3 0.0000 0. and minority.

314151 À6.66 and 10. jointly they are highly signiﬁcant.028163 EDUC^ 2 0.009350 MINORITYÃ EDUC À0.529420 À3. The reason for this high correlation is that the logarithmic salaries yi vary only between 9.074851 8. (ii) A non-linear model As a possible alternative model we investigate whether the marginal returns of schooling depend on the level of (previous) education and on the variables gender and minority — that is.) t-Statistic 58.999871.169464 0.0010 Non-linear models with quadratic term for education and with interaction terms for education with gender and minority (Panel 4) and with the insigniﬁcant interaction term GENDERÃ EDUC omitted (Panel 5).750 to $135.306733 Prob. @y ¼ b1 þ 2b2 x þ b3 Dg þ b4 Dm : @x This motivates a model with quadratic term and interaction effects y ¼ a þ gDg þ mDm þ b1 x þ b2 x2 þ b3 Dg x þ b4 Dm x þ e: .0000 0.123432 MINORITY 0.182324 GENDER 0.322841 0.010002 Exhibit 5.033074 0.164785 Prob.0000 0.315020 0.4 (Contd.0000 0.0148 0.182800 GENDER 0. 0.009736 0.023451 MINORITY 0.002213 0.001117 GENDERÃ EDUC À0.001010 MINORITYÃ EDUC À0.81 in the sample (corresponding to salaries ranging from $15. Error C 10.009624 0.0000 0.0017 t-Statistic 58. Error C 10.234323 0.0583 0. 0.80378 8.236632 À3.8130 0.288 5 Diagnostic Tests and Model Adjustments Panel 4: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std.136128 EDUC À0.0000 0.717483 À0.0000 0.171086 0.72379 0.205648 0.447231 À6.027289 EDUC^ 2 0.66402 1.209968 9. as the y3 terms ^ y2 i and ^ i have a correlation coefﬁcient of 0.000).769373 2. This is because of multicollinearity.0000 0.131921 EDUC À0.0211 0.010277 Panel 5: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std.032525 0.72135 0.898401 2.

Local regression with nearest neighbour fit We now describe a procedure called local regression to estimate the function f (x). This estimation method is called local because the function values f (x) are estimated locally.2 Functional form and explanatory variables 289 (iii) Results of the non-linear model The estimated model is in Panel 4 of Exhibit 5. 5.2. by means of socalled non-parametric models. The regression coefﬁcient b3 is not signiﬁcant. for (a large number of) ﬁxed values of x.5 per cent for minorities. for an education level of x ¼ 16 years an additional year of education gives an estimated wage increase of 13. Á Á Á . The estimated model obtained after deleting the regressor Dg x is given in Panel 5 of Exhibit 5. That is. or. In this section we will discuss the main ideas by considering the situation of a scatter of points (xi . The marginal returns of schooling are estimated as @y ¼ b1 þ 2b2 x þ b4 Dm ¼ À0:169 þ 0:019x À 0:033Dm : @x For instance. i ¼ 1. Such methods are called parametric.4. Non-linearity can also be modelled in a more ﬂexible way. where the function f is unknown.8 per cent for nonminorities and of 10. yi ). In particular.3 Non-parametric estimation Non-parametric model formulation The methods discussed in the foregoing section to deal with non-linear functional forms require that the non-linearity is explicitly modelled in terms of a limited number of variables (such as squared terms and interaction terms) and their associated parameters. This means that the (parametric) assumption of the linear regression model that E[yjx] ¼ a þ bx is replaced by the (non-parametric) assumption that E[yjx] ¼ f (x). n. We describe . f (x) can be interpreted as the expectation of y for a given value of x.5. Instead of the simple linear regression model that requires a linear dependence in the sense that y ¼ a þ bx þ e. it is assumed that y ¼ f (x) þ e.4. it may be non-linear in the explanatory variable x. in the case where the regressor x is stochastic. that E[ejx] ¼ 0. It is assumed that E[e] ¼ 0. as the non-linearity is modelled in terms of a limited number of parameters.

In this case only the observations for which xi lies in some sufﬁciently close neighbourhood of x0 are included in the regression. It is assumed that the function f is smooth. in this case the estimate of the constant term a0 can be interpreted as an estimate of the function value f (xi0 ). We will . the parameters are estimated by minimizing the weighted sum of squares X i 2 wi yi À a0 À b0 (xi À x0 ) : This is called weighted least squares. If we consider a point x0 that is present in the observed data set — say. therefore. Choice of neighbourhood To apply the above local regression method. This is. This implies that. Therefore. The function f (x) can then be estimated by repeating the procedure for a grid of values of x. instead of estimating a0 and b0 by ordinary least squares. In particular. for observation i0 so that xi0 ¼ x0 — then E[yi0 jxi0 ] ¼ f (xi0 ) ¼ a0 : That is. we have to choose which observations are included in the regression (that is. yi ) with xi -values that are close enough to x0 to estimate the parameters a0 and b0 in the model yi ¼ a0 þ b0 (xi À x0 ) þ !i : As the linear function is only an approximation. called a regression with nearest neighbour ﬁt.290 5 Diagnostic Tests and Model Adjustments the estimation of f (x0 ) at a given point x0 . in particular. f (x) % a0 þ b0 (x À x0 ). locally around x0 . we can exclude observations with values of xi that are too far away from x0 (and that include no reliable information on f (x0 ) anymore) by choosing weights with wi ¼ 0 if jxi À x0 j is larger than a certain threshold value. the considered neighbourhood of x0 ) and the weights of the included observations. where a0 ¼ f (x0 ) and b0 is the derivative of the function f at x0 . and this motivates the use of larger weights for such observations. The basic idea of local regression is to use the observations (xi . differentiable at x0 . The linear approximation is more accurate for values of xi that are closer to x0 . the function f can be approximated by a linear function — that is. we denote the error by a new disturbance term !.

deﬁned by À Á3 wi ¼ 1 À di3 for 0 di 1: The largest weight is given when di ¼ 0 (that is. Choice of weights After the selection of the relevant neighbourhood of x0 .0 Exhibit 5.4 0.4 DIST 0.8 WEIGHT 0. A popular weighting function is the so-called tricube weighting function. when xi ¼ x0 ) and the weights gradually decrease to zero as di tends to 1 (the upper bound).0 0.8 1. These weights decrease for observations with a larger distance between xi and x0 . and the other (1 À s)n observations with largest values of jxi À x0 j are excluded in the regression.2 Functional form and explanatory variables 291 discuss a method for choosing neighbourhoods and weights that is much applied in practice. 1. s ¼ 0:6. This is a number 0 s 1 representing the fraction of the n observations (xi . The neighbourhood can be chosen by selecting the bandwidth span.5 Tricube weights The tricube weighting function w ¼ (1 À d3 )3 for 0 d 1. and let di ¼ jxi À x0 j=D be the scaled distance of the ith observation from x0 (so that 0 di 1 for all included observations). also called the span.5.5. whereas larger values may lead to very smooth curves that miss relevant aspects of the function f . and s ¼ 0:9 — and then to decide which estimated curve has the best interpretation.6 0. The selected observations are the ones that are closest to x0 — that is. .2 0.0 0.6 0.2 0. the sn nearest neighbours of x0 . Smaller values may lead to estimated curves that are overly erratic. It is often instructive to try out some values for the bandwidth span — for instance. the next step is to select the weights of the included observations. The graph of the tricube function is shown in Exhibit 5. Let D be the maximal distance jxi À x0 j that occurs for the sn included observations.0 0. One usually chooses the bandwidth span around s ¼ 0:6 or s ¼ 0:7. yi ) that are included in the regression to estimate a0 and b0 . s ¼ 0:3.

i ¼ 1. also possible to use it with k regressors. A usual choice is s ¼ 0:6. This is not P important. Á Á Á . Choose a grid of points for the variable x where the function f (x) will be estimated. Local regression is most often used to draw a smooth curve through a twodimensional scatter plot. Step 2: Choice of bandwidth span. A possible choice is the tricube weighting function. otherwise one can estimate f (x) only for a selected subsample of these values. It is. Step 3: Choice of weighting function. Local regression Step 1: Choice of grid of points. where the sum runs over the sn included observations) gives the same estimates of a0 and b0 . In most cases the choice of the bandwidth span is crucial. yi ). whereas the estimates for a given bandwidth span do not depend much on the chosen weights. Summary of local linear regression To estimate a non-linear curve y ¼ f (x) from a scatter of points (xi . but sometimes one uses regressions with only a constant term y i ¼ a0 þ !i . It has the disadvantage that it leads to biased estimates near the left and right end of the curve. n. whereas the local linear regression method is unbiased. The local linear speciﬁcation yi ¼ a0 þ b0 (xi À x0 ) þ !i is recommended in most cases. by means of local linear regression. one can take all the n observed values xi . Since the weights need not add up to unity. Note that the weights of the tricube function will in general not add up to unity. and the same holds true for other weighting functions. the weighting functions are often called kernel functions. one takes the following steps. but it is advisable to try out some other values as well. If the number of observations is not too large. or regressions with a second degree polynomial yi ¼ a0 þ b0 (xi À x0 ) þ g0 (xi À x0 )2 þ !i : The version with only the constant term was the first one that was developed and is usually called the kernel method. (continues) . but it is less easy to get a good graphical feeling for the obtained estimates. Choose the fraction s (with 0 s 1) of the observations to be included in each local regression. however. as the choice of scaled weights (with weights wi = wj . Choose the weights wi to be used in weighted least squares.292 5 Diagnostic Tests and Model Adjustments T Some extensions The tricube function is only one out of a number of possible weighting functions that are used in practice.

s ¼ 0:6 in (d).7 (a) shows the scatter of the n ¼ 474 data. The estimated function can be visualized by means of a scatter plot of ^ f (xi ) against xi for the grid of points of step 1. with xi and ej independent for all i. and the returns to education seem to become larger for higher levels of education. j ¼ 1. 200.2 Functional form and explanatory variables 293 Local regression (continued ) Step 4: Perform weighted linear regressions. s ¼ 0:3 in (c). Exhibit 5. Step 5: Estimated non-linear function y ¼ f (x). For s ¼ 0:3 this decline is picked up well.6 shows the scatter of the generated data (in (a)) as well as four curves — namely. of the data generating process in (b) and of three curves that are estimated by local linear regression with three choices for the bandwidth span. but the curve shows some erratic movements that do not correspond to properties of the data generating process. and a continuous curve is obtained by interpolating between the points f (xi )) in this scatter. but it underestimates the decline of the curve at the right end. and s ¼ 0:9 in (e). E Example 5. perform P a weighted linear regression by minimizing the weighted sum of squares i wi (yi À a0 À b0 (xi À x0 ) )2 . For s ¼ 0:9 the ﬁtted curve is very smooth. f (x0 ) ¼ ^ a0 with ^ a0 the estimated estimate the function value f (x0 ) by ^ constant term in step 4. We simulate a set of n ¼ 200 data from the data generating process yi ¼ sin (xi ) þ ei . (xi . Exhibit 5.1 we found evidence for possible non-linearities in this relation. For this data set the local linear regression with bandwidth span s ¼ 0:9 seems to be preferable. We can also investigate this by a local linear regression of wage on education (for simplicity we exclude other explanatory variables gender and minority).3: Simulated Data from a Non-linear Model To illustrate the idea of local regression we ﬁrst apply the method to a set of simulated data. where the xi consist of a random sample from the uniform distribution on the interval 0 xi 2:5 and the ei are a random sample from the normal distribution with mean zero and standard deviation s ¼ 0:2. In Example 5. For given value of x0 .4: Bank Wages (continued) As a second illustration we consider the relation between education and wages in the banking sector.5. as it gives nearly the same results as s ¼ 0:6 but without the small irregularities that do not seem to have a clear interpretation. The relation does not seem to be linear. The curve obtained for s ¼ 0:6 provides a reasonable compromise between smoothness and sensitivity to ﬂuctuations that are present in the functional relationship. For each point x0 in the grid of points chosen in step 1. together with four ﬁtted curves in (b–e). Here the summation runs over the sn included points of step 3. Á Á Á . E XM501BWA . ^ Example 5.

DGP curve (b).0 0.0 −0.8 0.5 1.5 (b) 1. and three local linear regression curves with spans 0. .0 0.2 0.5 2.0 YLOCLIN09 YLOCLIN06 0.3) Simulated data with local linear regression based on nearest neighbour ﬁt with span 0.8 0.6 0.0 2.6 0.3 (c).0 0.5 1.6 0.5 1.0 X 1.0 X 1.8 0.9 (e).5 2.5 1.6 1.4 0.2 0.0 2.0 (d) 1.2 0.5 0.5 2.6 (a).4 0.4 0.5 2.0 0.0 X 1.4 0.0 0.6 Simulated Data from a Non-linear Model (Example 5.0 0.2 0.8 0.0 Y YLOCLIN06 0.0 0.5 Exhibit 5.5 0. and 0.2 0.6 (d). 0.0 2.294 5 Diagnostic Tests and Model Adjustments (a) 1.0 0.0 X 1.5 2.4 0.0 YLOCLIN03 (c) 1.0 2.0 X 1.4 0.5 (e) 1.8 YDET 0.6 0.0 2.5 1.0 0.

8 10.2 YLOCLIN09 YLOCLIN06 11.4 10. linear (b) and local linear with spans 0. and 0.5.4) Scatter diagram of salary (in logarithms) against education with linear ﬁt and with local linear ﬁt with span 0.6 10.2 (e) 11. .2 10.4 10.7 Bank Wages (Example 5.8 10.4 11.9 (e).2 10.4 10.8 10.2 Functional form and explanatory variables 295 (a) 12.6 10.0 9.8 10.4 11.0 10.0 10.2 (c) 11.4 10.6 6 8 10 12 14 EDUC 16 18 20 22 LOGSALARY YLIN YLOCLIN09 (b) 11. 0.4 11.0 6 8 10 12 14 16 18 20 22 EDUC 11.0 6 8 10 12 14 16 18 20 22 EDUC Exhibit 5.4 11.4 10.6 10.0 6 8 10 12 14 16 18 20 22 EDUC YLOCLIN03 11.6 10.0 10.0 11.6 11.8 10.2 10.0 11.6 (d).3 (c).2 10.0 6 8 10 12 14 16 18 20 22 EDUC YLIN (d) 11.2 10.9 (a) and four ﬁtted curves.2 10.

What is more important is that the additive structure of the model implies that the variables should be incorporated in a compatible manner.5). This so-called log-linear speciﬁcation is of interest because the coefﬁcient b2 is the elasticity of y with respect to x — that is. Of course. it can only be applied if all variables take on only positive values. It also makes sense to relate the output of a ﬁrm to labour and capital.3 (p. To illustrate this idea. one can consider transformations of the data to obtain a better speciﬁcation. although for the computation of the inverse of X0 X in b ¼ (X0 X)À1 X0 y. but it makes less sense to relate the price of one stock to the returns of another stock. as was discussed at the end of Section 3. Further. In every empirical investigation. one of the ﬁrst questions to be answered concerns the most appropriate form of the data to be used in the econometric model. if the dependence of the dependent variable on the explanatory variable is multiplicative of the form 2 ei yi ¼ a1 xa i e . then log (yi ) ¼ b1 þ b2 log (xi ) þ ei . b2 ¼ d log (yi ) dyi xi : ¼ d log (xi ) dxi yi It is often more plausible that economic agents show constant reactions to relative changes in variables like prices and income than to absolute changes. but this is the case for many economic variables. consider the model (5. (5:5) with b1 ¼ log (a1 ) and b2 ¼ a2 . Use and interpretation of taking logarithms of observed data The logarithmic transformation is useful for several reasons. taking logarithms and taking differences. it is preferable that all explanatory variables are roughly of the same order of magnitude.4 Data transformations Data should be measured on compatible scales If diagnostic tests indicate misspeciﬁcation of the model. and also to relate the respective returns. For example. the logarithmic transformation may reduce skewness and heteroskedasticity. Of all the possible data transformations we discuss two important ones.296 5 Diagnostic Tests and Model Adjustments 5. where ei is . 124–5).2) the scaling of the variables is not of intrinsic importance. but it makes less sense to relate the logarithm of output to the levels of labour and capital. For instance. it makes sense to relate the price of one stock to the price of another stock. For linear models (5.1.2. or to relate the logarithms of these variables.

Taking differences of observed data Many economic time series show a trending pattern. one can consider the more general Box–Cox transformation given by yi (l) ¼ b1 þ k X j¼2 T bj xji (l) þ ei . n . and variance 2 var(yi ) ¼ (E[yi ])2 (es À 1) (see Exercise 5. The trend in a variable y can often be removed by taking ﬁrst differences. Á Á Á . (5:6) . Because yi Dyi Dyi ¼ log 1 þ % D log (yi ) ¼ log yiÀ1 yiÀ1 yiÀ1 for Dyi =yiÀ1 sufﬁciently small.2 Functional form and explanatory variables 297 normally distributed. which is deﬁned by Dyi ¼ yi À yiÀ1 : For instance. For instance. the statistical assumptions of Chapters 3 and 4 may fail to hold. 193) requires the regressors to be stable in À Pn Á 2 the sense that plim 1 i¼1 xji exists and is ﬁnite for all explanatory variables n a linear deterministic trend. The modelling of trends and the question whether variables should be differenced or not is further discussed in Chapter 7. Assumption 1Ã of Section 4. the xj . A combination of the two foregoing transformations is also of interest. the variables should sequence 1 i¼1 n be transformed to get stable regressors.1. then Dx2i ¼ 1. The Box–Cox transformation If one doubts whether the variables should be included in levels or in logarithms. This operation is denoted by D .2 (p. and the original variable yi is log1 2 normally distributed with median emi . if x2i ¼ i. In such cases. which is a stable regressor. To apply conventional tests. say x2i ¼ i for i ¼ 1. Then log (yi ) is normally distributed with mean mi ¼ b1 þ b2 log (xi ) and variance s2 . These are very common properties of economic data and then the logarithmic transformation of the data may reduce the skewness and heteroskedasticity.5.2). mean E[yi ] ¼ emi þ2s . this transforms the original level variables yi into growth rates. This means that the distribution of yi is (positively) skewed and that the standard deviation is proportional to the level. In the case Pn of 2 i diverges.

yn ) is therefore equal to Pn i ¼ 1 yi 2 .6). as yi (l) ! log (yi ) for l ! 0. the salary of US bank employees.2 (p. the log-linear model is obtained for l ¼ 0.7) — for instance.19) (e1 . this corresponds to a linear model. Á Á Á . we use that ei ¼ yi (l) À b1 À k j¼2 bj xji (l) so that dei =dyi ¼ yilÀ1 . k.5: Bank Wages (continued) We consider once again the bank wage data and investigate the best way to include the dependent variable.6) is given by l b j xl ji =yi . Note this differs from non-linear least squares in (5. Á Á Á . if the values of the variables satisfy yi ! 1 and xji ! 1 for all i ¼ 1. and (iii) the results and interpretation of an alternative relation. Then the logarithm of the joint density function is given by log (p(e1 . (i) Choice between levels and logarithms Exhibit 5. (ii) a test of linearity and loglinearity.298 5 Diagnostic Tests and Model Adjustments where yi (l) ¼ (yl i À 1)=l and xji (l) is deﬁned in a similar way. Until now we have chosen to take the logarithm of salary as the dependent variable. n and all j ¼ 1. en )).6) to estimate the parameters makes no sense. For instance. Á Á Á . Á Á Á . Actually. as in the minimization P P that the term ( l À 1) log (yi ) in (5. We will discuss (i) the choice between salaries in levels or logarithms.2.8 (a) and (b) show histograms of the salary (S) (in dollars per year) and of the natural logarithm of salary (y ¼ log (S) ) of the 474 employees of . in the model. 27)). en ) to (y1 . Á Á Á . The elasticity of y with respect to xj in (5. NLS in of e2 i (5. Tests for a linear model (l ¼ 1) or a log-linear model (l ¼ 0) can be based on (5. If l ¼ 1.6) we assume that the disturbance terms ei satisfy Assumptions 2–4 and 7. by using the LR test. Á Á Á .7) would be neglected. en )) ¼ n X i ¼1 n n n 1 X log (p(ei ) ) ¼ À log (2p) À log (s2 ) À 2 e2 : 2 2 2 s i ¼1 i P To obtain the likelihood function. To estimate the parameters of the model (5. so that n X n n log (yi ) l ¼ À log (2p) À log (s2 ) þ (l À 1) 2 2 i¼1 !2 n k X 1 X yi (l) À b1 À bj xji (l) : À 2 2s i¼1 j ¼2 (5:7) The ML estimates of the parameters are obtained by maximizing this function. l. and. The log-likelihood is equal to l ( b 1 Á Á Á . but there are alternatives. E XM501BWA Example 5. yn )) ¼ (l À 1) i¼1 log (yi ) þ log (p(e1 . s ) ¼ Pn log (p(y1 . then P e2 i ! 0 by taking l ! À1. b k . in Section 1. The Jacobian corresponding to the transformation of lÀ1 (see also (1. Á Á Á .

This provides statistical reasons to formulate models in terms of the variable y instead of the variable S. The distribution of S is more skewed than that of y.27073 11.0 11.998033 3.5 50000 10. Here we scale the salary to make the two variables x and S of similar order of magnitude.8 Bank Wages (Example 5.5 25 5 10 15 EDUC 20 25 EDUC Exhibit 5. S(l) ¼ Sl À 1 ¼ a þ gDg þ mDm þ bx þ e: l .664596 0. This effect is much less pronounced for the variable y. Skewness Kurtosis 10. As could be expected.0 15750. the variation of salaries is considerably larger for higher levels of education than for lower levels of education.308630 40 20 10.81303 9.35679 10. Skewness Kurtosis 40000 80000 120000 80 60 Series: LOGSALARY Sample 1 474 Observations 474 Mean Median Maximum Minimum Std.8 (c) and (d) show scatter diagrams of S and y against education.5 (c) 150000 (d) 12.5 LOGSALARY 5 10 15 20 100000 SALARY 11.00 17075. the considered bank. and scatter diagrams of salary against education (c) and of log salary against education (d). Regression models for y also have an attractive economic interpretation.117877 8.5) Histograms of salary (a) and log salary (b). 000) and education (x).662632 0 10.0 34419.00 135000.5 11. Dev.2 Functional form and explanatory variables 299 (a ) 120 100 80 60 40 20 0 (b) Series: SALARY Sample 1 474 Observations 474 Mean Median Maximum Minimum Std. as @ y=@ xj ¼ (@ S=@ xj )=S measures the relative increase in salary due to an increase in the explanatory variable xj .66 2. Dev.0 11.57 28875.397334 0. Exhibits 5. We are often more interested in such relative effects than in absolute effects.5. (ii) Tests of linearity and log-linearity Now we consider the following model for the relation between (scaled) salary (the dependent variable is expressed in terms of S ¼ Salary=$10.0 10.0 0 9.

102712 0.20703 0. both unrestricted (Panel 2) and under the restriction that l ¼ 1 (Panel 3) or l ¼ 0 (Panel 4).59562 Log likelihood À759.261131 0.022535 14. Error z-Statistic Prob.077366 0.9367 (c) Panel 3: Method: Maximum Likelihood for lambda ¼ 1 Dependent variable: S À 1 with S ¼ Salary=10000 Included observations: 474 Variable Coefﬁcient Std.199980 0.0000 0.320157 0.028946 À4.926990 0.835898 0.212 1201.132673 0. .046302 0.025511 10.0000 C 0.0000 EDUCATION 0.23594 MINORITY À0.483362 0.8534 15.014278 7.193465 0.0000 0.0000 (d) Panel 4: Method: Maximum Likelihood for lambda ¼ 0 Dependent variable: log(S) with S ¼ Salary=10000 Included observations: 474 Variable Coefﬁcient Std.0002 0.0000 MINORITY À0.058687 156.300 5 Diagnostic Tests and Model Adjustments (a) −400 −600 LOGLIKELIHOOD −800 −1000 −1200 −3 −2 −1 0 LAMBDA 1 2 3 (b) Panel 2: Method: Maximum Likelihood Dependent variable: (S^ lambda À 1)/lambda with S ¼ Salary=10000 Included observations: 474 Convergence achieved after 58 iterations Parameter Coefﬁcient Std.227 7.0000 0.062606 0.001986 3.818149 GENDER 9022.158373 0.5043 Prob.199 208. 0.025821 0.840 1362.44229 Log likelihood À568.0000 GENDER 0.978 À3.583411 EDUCATION 0.754163 EDUCATION 3257.007800 0.0000 0.0000 VARIANCE 0. 0.358 À4.011135 À4.5) Prob. LAMBDA À0. Error z-Statistic C 9. Error z-Statistic C À13314.510828 MINORITY À5116.004436 17.0000 Values of log-likelihood for a grid of values of l (a) and ML estimates.003656 7.0001 Log likelihood À519.9 Bank Wages (Example 5.5082 Exhibit 5.27 2763.7634 GENDER 0.111701 À7.0000 0.

E Exercises: T: 5. The LR-tests for linearity and log-linearity are given by LR(l ¼ 1) ¼ 2( À 519:94 þ 759:50) ¼ 479:14 (P ¼ 0:0000). in our model with l ¼ À0:836. The exhibit also shows the results for l ¼ 1 in Panel 3 (with dependent variable S À 1) and for l ¼ 0 in Panel 4 (with dependent variable log (S)).5. Now. For instance.7). For instance.0 per cent. Such a non-linear effect is in line with our previous analysis in Examples 5. f.2e.2 Functional form and explanatory variables 301 So we consider the transformation only of the dependent variable and not of the regressors. (dS=dx)=S.9 (with l ¼ À0:836) to determine the relative increase in salary caused by an additional year of schooling — that is.9 and the ML estimate of l is ^ l ¼ À0:836. at an education level of x ¼ 10 years the predicted increase in salary is 5. It is left as an exercise (see Exercise 5.6 per cent.1 and 5. (iii) Interpretation of an alternative relation We now use the ML estimates of the above model in Panel 2 of Exhibit 5. the estimated increase is ^ dS=dx 0:0258 b ¼ : ¼ ^x) 0:732 À 0:022x S 1þ^ l(^ aþb This means that the marginal returns of schooling increase with the previously achieved level of education. The ML estimates are given in Panel 2 of Exhibit 5. The log-likelihood of this model is given by (5. Dm ¼ 0 and e ¼ 0). this return depends on the values of the explanatory variables. replacing the last term in parentheses of this expression by ei ¼ Si (l) À a À gDgi À mDmi À bxi . whereas for an education level of x ¼ 20 years this becomes 8.4. LR(l ¼ 0) ¼ 2( À 519:94 þ 568:51) ¼ 97:14 (P ¼ 0:0000): We conclude that linearity and log-linearity are rejected. l ¼ 0 and the marginal return to schooling is constant. for an ‘average’ non-minority male employee (with Dg ¼ 0.2) to show that in this model dS=dx b : ¼ S 1 þ l(a þ gDg þ mDm þ bx þ e) In the log-linear model that was considered in previous examples. .

.2. SIC).5 Summary In order to construct a model for the explanation of the dependent variable we have to make a number of decisions. forward selection or backward elimination). Can the relation between explanatory variables and explained variable be expressed by a linear model or is the relationship non-linear? The method of local regression and Ramsey’s RESET can be used to get an idea of possible non-linearities. if the observed data contain trends. What is the best way to incorporate the variables in the model? In many cases the model has a better economic interpretation if variables are taken in logarithms. it may be worthwhile to take ﬁrst differences. by tests of signiﬁcance (for instance. and. and by comparing the predictive performance of competing models on a hold-out sample. How many explanatory variables should be included in the model? This can be investigated by means of selection criteria (such as AIC. . .302 5 Diagnostic Tests and Model Adjustments 5. .

An example: Seasonal dummies For example.3 Varying parameters 5. 4. according to the season of the ith observation. In other cases the sample can be split in groups so that the parameters are constant for all observations within a group but differ between groups. This means that ai ¼ aiþ4 for all i. The assumption of ﬁxed parameters (Assumption 5) means that these effects are the same for all observations. the ‘direct’ effect of a regressor xj on the dependent variable y is given by @ y=@ xj ¼ bj. This can be represented by the time varying parameter model yi ¼ ai þ k X j¼2 bj xji þ ei .3. These variables are called ‘dummies’ because they are artiﬁcial variables that we deﬁne ourselves.2 we discussed the addition of quadratic terms and product terms of regressors. 3. the sampled population may consist of several groups that are affected in different ways by the regressors. In Section 5. Now deﬁne four dummy variables Dh . the model (5.5.8) can be expressed as yi ¼ a1 D1i þ a2 D2i þ a3 D3i þ a4 D4i þ k X j¼2 bj xji þ ei : (5:9) .2. suppose that the data consist of quarterly observations with a mean level that varies over the seasons. (5:8) where ai takes on four different values. With the help of these dummies. where Dhi ¼ 1 if the ith observation falls in season h and Dhi ¼ 0 if the ith observation falls in another season. as the observations i and (i þ 4) fall in the same season.3 Varying parameters 303 5. 2.1 The use of dummy variables Relaxing the assumption of fixed parameters In the linear model y ¼ Xb þ e. h ¼ 1. If these effects differ over the sample. This kind of parameter variation can be modelled by means of dummy variables. then this can be modelled in different ways. For example.

However. For instance. The ﬁrst quarter is called the reference quarter in this case and the parameters gs measure the incremental effects of the other quarters relative to the ﬁrst quarter. Clearly. 3. Let D be a dummy variable with Di ¼ 0 if x2i a and Di ¼ 1 if x2i > a. a t -test on P g2 in (5. where ds ¼ as À a4 for s ¼ 1.9) from the model. suppose that the dependence of y on x2 is continuous and piece-wise linear with slope b2 for x2 a and with slope b2 þ g2 for x > a.10) corresponds to the null hypothesis that E[yi ] ¼ a1 þ k j¼2 bj xji in the second quarter — that is. and we can choose the one with the most appealing interpretation.9) corresponds to the null P hypothesis that E[yi ] ¼ k j¼2 bj xji in the second quarter. that a1 À a2 . 3. . In general. A t-test on a2 in (5.9). then yi ¼ a þ b2 x2i þ g2 (x2i À a)Di þ k X j¼3 bj xji þ ei : This model has constant parameters and it is linear in the parameters. 4. For instance. if we delete the variable D1 . In practice we often prefer models that include a constant term.304 5 Diagnostic Tests and Model Adjustments This is a linear regression model with constant parameters. The use of dummies for piece-wise linear relations Dummy variables can also be used to model varying slope parameters. This can be formulated as follows. In this case we should delete one of the dummy variables in (5.9) can be reformulated as yi ¼ a1 þ g2 D2i þ g3 D3i þ g4 D4i þ k X j¼2 bj xji þ ei .10) have a different interpretation from the parameters as in (5. models with dummy variables can often be formulated in different ways. The interpretation of the t-test on d2 differs from that of the t-test on g2 . then (5. For instance.9) — for instance. suppose we want to test whether the second quarter has a signiﬁcant effect on the level of y. 2. That is. If we delete another dummy variable from (5. provided that the break point a is known.10) becomes a4 þ d1 D1i þ d2 D2i þ d3 D3i . the parameters gs in (5.8) is removed by including dummy variables as additional regressors. The latter hypothesis is more interesting. (5:10) where gs ¼ as À a1 for s ¼ 2. D4 instead of D1 — then the dummy part in (5. the parameter variation in (5. The null hypothesis that the marginal effect of x2 on y does not vary over the sample can be tested by a t-test on the signiﬁcance of g2 .

so that n ¼ 28. The general levels of sales and the effect of purchasing ability and consumer conﬁdence on fashion sales may vary over the seasons. 4. real personal disposable income) and consumer conﬁdence (Ci . Journal of Business and Economic Statistics.M. j ¼ 1. The corresponding F-test of this hypothesis can be computed from the results in Panels 1 and 2 of Exhibit 5. 3. 103–11.6: Fashion Sales We consider US retail sales data of high-priced fashion apparel. Leone. purchasing ability (Ai .P.5.10 give . The corresponding restricted model has six parameters. real sales per thousand square feet of retail space) and two explanatory variables. F¼ (0:1993 À 0:1437)=6 ¼ 1:03 (P ¼ 0:440): 0:1437=(28 À 12) Therefore. We may expect seasonal ﬂuctuations in sales of fashion apparel — for instance. and we investigate whether there exists a quarterly effect in the relation between sales (Si . The results in Panels 2 and 3 of Exhibit 5. Jen. because of sales actions around the change of seasons. The data are taken from G. We will discuss (i) the data and the model. The null hypothesis that the effects of the variables Ai and Ci on sales do not depend on the season corresponds to the six parameter restrictions b1 ¼ b2 ¼ b3 ¼ b4 and g1 ¼ g2 ¼ g3 ¼ g4 . (i) The data and the model E XM506FAS We consider quarterly data from 1986 to 1992. Allenby. L.10 shows the results of three estimated models. ‘Economic Trends and Being Trendy: The Inﬂuence of Consumer Conﬁdence on Retail Fashion Sales’. where Dji ¼ 1 if the ith observation falls in quarter j and Dji ¼ 0 if the ith observation does not fall in quarter j. We deﬁne four quarterly dummies Dji . an index of consumer sentiment).3 Varying parameters 305 Example 5. We suppose that the standard Assumptions 1–7 are satisﬁed for the model log (Si ) ¼ a1 þ 4 X j¼2 aj Dji þ 4 X j¼1 bj Dji log (Ai ) þ 4 X j¼1 gj Dji log (Ci ) þ ei : The variation in the coefﬁcients a reﬂects the possible differences in the average level of retail fashion sales between seasons. 14/1 (1996). we test whether a2 ¼ a3 ¼ a4 ¼ 0 in this model. and we test whether fashion sales depend on the season — that is. and (ii) estimation results and tests of seasonal effects. 2. (ii) Estimation results and tests on seasonal effects Exhibit 5. and R. the null hypothesis of constant parameters for b and g is not rejected.10 — that is.

0438 0.006677 0.166138 LOGA 0.138587 D2 0.529237 Regressions of sales on purchasing ability and consumer conﬁdence (all in logarithms).488666 0.875514 Sum squared resid Panel 3: Dependent Variable: LOGSALES Method: Least Squares Sample: 1986:1 1992:4 Included observations: 28 Variable Coefﬁcient Std.3063 0.11 shows the residuals of the model with a2 ¼ a3 ¼ a4 ¼ 0.240432 2.282290 1.386734 0.75170 12.986040 0.20387 5. Exhibit 5. Error t-Statistic C À13.587800 À0. with seasonal variation in all parameters (Panel 1) or only in the constant term (Panel 2) or in none of the parameters (Panel 3). 0.1772 0.175230 7.717487 LOGCÃ D4 R-squared 0.306 5 Diagnostic Tests and Model Adjustments Panel 1: Dependent Variable: LOGSALES Method: Least Squares Sample: 1986:1 1992:4 Included observations: 28 Variable Coefﬁcient Std.2964 0.051066 3.411626 0.849956 0.100334 LOGAÃ D4 LOGCÃ D1 0.6) Prob.393303 3. 0.1052 0.0522 0.313589 0.022716 0.044989 Sum squared resid Exhibit 5.9976 0.097239 LOGCÃ D2 0.139694 2.0118 0.860878 0.737792 0.660245 2.073808 0.003096 1.052318 11.711783 0.0000 0.220704 LOGAÃ D2 1.2578 0. This can also be interpreted as a .545816 7.143700 Prob.0000 0.085208 1.0010 0.959647 1. Error t-Statistic C 1.056800 D3 9.785210 LOGC À0.783329 D3 0.06633 1.9695 1.003076 LOGCÃ D3 0.841984 3.079422 D4 8.833571 À2.923269 LOGAÃ D3 0.3696 0.199337 Prob.8694 0.774249 0.660192 0. 0.910259 Sum squared resid Panel 2: Dependent Variable: LOGSALES Method: Least Squares Sample: 1986:1 1992:4 Included observations: 28 Variable Coefﬁcient Std.870911 À2.038646 R-squared 0.0379 0.051166 6.128849 D4 0.173507 LOGAÃ D1 2. The residuals show a clear seasonal pattern with peaks in the fourth quarter.10 Fashion Sales (Example 5.933291 0.0519 0.671240 8.587470 0.868036 1.445010 2.175397 0.4397 0.0053 0. F¼ (1:5292 À 0:1993)=3 ¼ 48:93 (P ¼ 0:000): 0:1993=(28 À 6) This hypothesis is therefore clearly rejected.745860 R-squared 0.0010 0.342052 1.618763 0.263429 D2 12.82706 LOGA 1.785039 LOGC 0.609848 1. Error t-Statistic C À6.193198 0.3982 0.

We will discuss (i) the results for the two brands separately. 218–21).0 −0. (i) Results for the two brands separately E XM507COF In Section 4. (ii) a combined model for the two brands.5. a3 .2.11 Fashion Sales (Example 5.6) Residuals of the model for fashion sales where none of the parameters is allowed to vary over the seasons (note that time is measured on the horizontal axis and that the values of the residuals are measured on the vertical axis). Although scatter diagrams of the data indicate a decreasing elasticity for larger deal rates (see Exhibit 4.5).6 86 87 88 89 90 91 92 LOGSALES Residuals Exhibit 5.2.6 0.2 0. for both brands.2 −0. and (iv) the interpretation of the results. (iii) a test of constant elasticity in the combined model.5 we analysed the relation between coffee sales and the applied deal rate and we tested the null hypothesis of constant price elasticity for two brands of coffee. The seasonal variation of this mean is modelled by including the three dummy variables with parameters a2 . Example 5.3 Varying parameters 0. The model for the effect of price deals (denoted by d) on coffee sales (denoted by q) in Section 4.4 −0. and a4 in the model. (ii) A combined model for the two brands We will now consider a model that combines the information of the two brands. we had difﬁculty in rejecting the null hypothesis of constant elasticity when this is tested for the two brands separately.2. violation of Assumption 2 that the disturbance terms have a ﬁxed mean.5 is given by log (qi ) ¼ b1 þ Á b2 À b3 di À 1 þ ei : b3 . n ¼ 12.4 307 0.5 (p. A possible reason is the small number of observations.7: Coffee Sales As a second illustration of the use of dummy variables we return to the marketing data on coffee sales of two brands of coffee that were discussed before in Section 4.

221) and 4. i ¼ 1. we test whether b3 ¼ 0. and Panel 5 gives LM ¼ nR2 ¼ 24(0:253) ¼ 6:08 (P ¼ 0:014): . We do not reject this hypothesis and therefore we will consider the following combined model for the two brands of coffee: log (qi ) ¼ D1i b1 þ D2i g1 þ b2 b3 di À 1 þ e i . We do not impose the condition b1 ¼ g1 .50) with the t-test. 247). b3 ) ¼ (g2 . . .12 shows the NLS estimates of this model in Panel 1 and of the restricted model with b2 ¼ g2 and b3 ¼ g3 in Panel 3. . . This corresponds to the assumption that the elasticities are the same for the two brands of coffee — that is. Exhibit 5. the Likelihood Ratio test.5). 24: (iii) Test of constant elasticity in the combined model We now test the hypothesis of constant elasticity in the above combined model for the two brands. g3 ) has P-value 0.2. Now we combine the data of the two brands of coffee by means of the model Á Á b À b g À g log (qi ) ¼ D1i b1 þ 2 di 3 À 1 þ D2i g1 þ 2 di 3 À 1 þ ei . For the Wald test (for a single parameter restriction) we use the relation (4. 204). .3.308 5 Diagnostic Tests and Model Adjustments The price elasticity in this model is equal to b2 db3 (see Example 4. b3 i ¼ 1. as in the Box–Cox transformation. and Panel 3 gives W¼ n 2 24 t ¼ (À2:520)2 ¼ 7:62 (P ¼ 0:006): nÀk 20 The Likelihood Ratio test is obtained from Panels 3 and 4 and is equal to LR ¼ 2(l1 À l0 ) ¼ 2(22:054 À 18:549) ¼ 7:01 (P ¼ 0:008): The Lagrange Multiplier test is computed in a similar way as described in Sections 4. .9 (p. . The results in Panels 3–5 of Exhibit 5. as the level of the sales are clearly different for the two brands (see Exhibit 4. That is. in which case b3 b2 b3 di À 1 reduces to b2 log (d i ). 24.2. and the null hypothesis of constant price elasticity corresponds to the parameter restriction b3 ¼ 0. b3 g3 where D1i ¼ 1 for the observations of brand one and D1i ¼ 0 for the observations of brand two. and the Lagrange Multiplier test. The Wald test for the hypothesis that (b2 .12 are used to compute the values of the Wald test. b2 db3 ¼ g2 dg3 . This model allows for the possibility that all regression coefﬁcients differ between the two brands of coffee. and where D2i ¼ 1 À D1i . p.249 (see Panel 2). .5 (p.

9294 0.43074 6. in this example.983815 Sum squared resid 0.223654 S.005777 Probability Probability 0.0000 C(3) 10.037388 154.896461 3.105748 Log likelihood 22.6 in Section 4.041721 104.595289 5. and regression model with equal elasticities (but different sales levels) for the two brands (Panel 3).0204 R-squared 0.037388 117. in comparison with our analysis in Section 4.274838 4. .05400 Exhibit 5.5. Our earlier results in Example 4.936886 À1.E.0002 C(4) À10.8577 0.102196 Log likelihood 24.12 Coffee Sales (Example 5. This illustrates the power of imposing model restrictions.5. of regression 0. we gain twelve Panel 1: Dependent Variable: LOGQ Method: Least Squares Sample: 1 24 Included observations: 24 Convergence achieved after 5 iterations LOGQ¼C(1)Ã DUMRGC1þC(2)Ã DUMRGC2þC(3)/C(4)Ã DUMRGC1 Ã (DEAL^ C(4)À1)þC(5)/C(6)Ã DUMRGC2Ã (DEAL^ C(6)À1) Parameter Coefﬁcient Std.3 Varying parameters 309 (iv) Interpretation of the results The above test outcomes indicate that the null hypothesis of constant deal elasticity should be rejected.807118 0.0000 C(2) 4.024271 À1.187993 S.986396 Sum squared resid 0.0687 C(5) 10. C(1) 5.0000 C(2) 4.249117 0.778500 0.500207 0. C(1) 5.9 gave less clear conclusions. of regression 0.0076 C(4) À13.552142 0.007235 0.502889 3.0000 C(3) 10.23724 2. Error t-Statistic Prob.424515 3.67745 4.222487 Panel 3: Dependent Variable: LOGQ Method: Least Squares Sample: 1 24 Included observations: 24 Convergence achieved after 5 iterations LOGQ¼C(1)Ã DUMRGC1þC(2)Ã DUMRGC2þC(3)/C(4)Ã (DEAL^ C(4)À1) Parameter Coefﬁcient Std.1879 0. test on equal elasticities for the two brands (Panel 2).5565 0.041721 139.3.7) Regression of coffee sales on deal rate with all parameters different for the two brands (Panel 1).13832 Panel 2: Wald Test Null Hypothesis: F-statistic Chi-square C(3)¼C(5) C(4)¼C(6) 1.237472 À2. so that.2. Error t-Statistic Prob.28864 2.0023 C(6) À8.E.1043 R-squared 0.377804 0.519770 0.936133 0.406421 0. the assumption that the functional dependence of the elasticity on the deal rate is the same for the two brands of coffee. The combined model is estimated for twenty-four observations.710753 0.29832 3.

5240 DUMRGC2 4.7 in the foregoing section).54891 Prob.1584 LOG(DEAL) 5. E Exercises: E: 5. In some situations the choice of dummy variables is straightforward (see for instance Examples 5. in other cases it may be quite difﬁcult to specify the precise nature of the parameter variation. 5.119428 Log likelihood Panel 5: Dependent Variable: RESLOGDEAL Method: Least Squares Sample: 1 24 Included observations: 24 Variable Coefﬁcient Std.031689 0. 0.310 5 Diagnostic Tests and Model Adjustments Panel 4: Dependent Variable: LOGQ Method: Least Squares Sample: 1 24 Included observations: 24 Variable Coefﬁcient Std.105748 Exhibit 5.E.531678 LOG(DEAL)^ 2 À29. then it is natural to order them with time.847590 DUMRGC2 À0.037388 À0.26.253300 Sum squared resid S.810190 0. then the observations can be ordered according to the values of one of the explanatory variables. Error t-Statistic DUMRGC1 5. we should know the nature of this variation.0000 0.25819 11. For instance.037388 À0.608700 2.3.031689 0. However.0198 0.0170 0.23281 À2. Now suppose that the data can be ordered in a natural way. 0.604707 R-squared 0.6 and 5.299523 18. This gain of eleven degrees of freedom leads to more clear-cut conclusions.223654 Regression model for coffee sales with constant elasticity (Panel 4) and regression of the residuals of this model on the gradient of the unrestricted model where the elasticity depends on the deal rate (Panel 5).4067 0. if the data consist of time series that are observed sequentially over time.039926 145.438110 0. observations at the cost of one parameter.039926 111. of regression 0. For such ordered data sets we can detect possible .0000 0.847590 LOG(DEAL) 4.12 (Contd.427194 12.072710 1.333995 0.4067 0. of regression 0. If the data consist of a cross section.978325 Sum squared resid S.2 Recursive least squares Recursive estimation to detect parameter variations If we want to model varying parameters by means of dummy variables.E.48611 R-squared 0. Error t-Statistic DUMRGC1 À0.0000 0.) Prob.

etÀ1 ). This gives an OLS estimator btÀ1 and a corresponding forecast ^ yt ¼ x0t btÀ1 with forecast error ft ¼ yt À x0t btÀ1 : (5:11) The recursive least squares estimators are deﬁned as the series of estimators bt . As yt ¼ x0t b þ et is independent of btÀ1 (that depends only on e1 . To detect possible parameter breaks it is helpful to plot the recursive estimates bt and the recursive residuals wt as a function of t. where At ¼ (X0t Xt )À1 with Xt the t Â k regressor matrix for the observations i ¼ 1.12) shows that the magnitude of the changes bt À btÀ1 in the recursive estimates depends on the forecast errors ft in (5. if the model is valid (so that in particular the parameters are constant).5.3) to show that these estimators can be calculated recursively by bt ¼ btÀ1 þ At xt ft At ¼ AtÀ1 À 1 AtÀ1 xt x0t AtÀ1 vt (5:12) (5:13) (5:14) vt ¼ 1 þ x0t AtÀ1 xt . Á Á Á . or the model can be adjusted by including non-linear terms or dummy variables. Á Á Á .3) to show that the forecast errors ft are also mutually independent. ft wt ¼ pﬃﬃﬃﬃ $ NID(0. then this is reﬂected in variations in the estimates bt and in relatively large and serially correlated recursive residuals wt after the break. a regression is performed in the model yi ¼ x0i b þ ei using only the (t À 1) observations i ¼ 1. the correction factor At is proportional to the covariance matrix of the estimator bt.11). Recursive residuals Under Assumptions 1–7 the forecast errors have mean E[ ft ] ¼ 0. t. It is left as an exercise (see Exercise 5. Such breaks may suggest additional explanatory variables that account for the break.3 Varying parameters 311 break points by applying recursive least squares. so that large uncertainty leads to large changes in the estimates. it follows that var(ft ) ¼ var(yt ) þ var(x0t btÀ1 ) ¼ s2 (1 þ x0t AtÀ1 xt ) ¼ s2 vt . as discussed in Sections 5. Under the standard Assumptions 1–7. For every value of t with k þ 1 t n. n: (5:15) The values of wt are called the recursive residuals. Á Á Á . . t À 1. vt t ¼ k þ 1. Á Á Á . If the parameters are varying. The result in (5. s2 ).2.1. This means that.3.2 and 5. It is left as an exercise (see Exercise 5.

together with plot of recursive residuals (c). Exhibit 5. . (5:16) where y is the logarithm of yearly salary and x the number of completed years of education.02 100 150 200 250 300 350 400 450 Recursive C(4) Estimates Ϯ 2 S.00 −0. and those with ranking number 425–474 have x > 16.02 0.6 9. The graphs also show 95% interval estimates of the parameters and 95% conﬁdence intervals for the recursive residuals. (b) 0. Exhibit 5.0 100 150 200 250 300 350 400 450 Recursive C(1) Estimates Ϯ 2 S.0 9. Using the notation introduced there.08 0.E.10 0.2 9. together (a) 10.2 10.5 1. the model is yi ¼ a þ gDgi þ mDmi þ bxi þ ei .5 0.4 9.5 −1.0 100 150 200 250 300 350 400 450 Recursive Residuals Ϯ 2 S.06 0.0 0. those with ranking number 366–424 have x ¼ 16. (c) 1.13 Bank Wages (Example 5. starting with the lowest education.312 5 Diagnostic Tests and Model Adjustments E XM501BWA Example 5.8: Bank Wages (continued) We continue our analysis of the bank wage data discussed in previous sections.E.04 0.13 shows the recursive least squares estimates of the constant term a (in (a)) and of the marginal return of schooling b (in (b)). Employees with ranking number 365 or lower have at most 15 years of education (x 15). The education ranges from 8 to 21 years.8) Recursive estimates of constant term (a) and of slope with respect to education (b).0 −0. We order the n ¼ 474 employees according to the values of x.8 9.E.

n. The CUSUM test is based on the cumulative sums Wr ¼ r X wt . suggesting that the returns may be larger for higher levels of education. it follows from (5. n. Á Á Á .5. The plot of recursive residuals in (c) shows mostly positive values after observation 365.3. If the model is correctly speciﬁed. 1). s t¼kþ1 r ¼ k þ 1. All these results indicate that the effect of education on salary is non-linear.3 Varying parameters 313 with 95 per cent interval estimates. 5. then pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ w n À k $ t(n À k À 1): ^ s A signiﬁcant non-zero mean of the recursive residuals indicates possible instability of the regression parameters. Such tests can be based on the recursive residuals wt deﬁned in (5. r À k).1.15). r ¼ k þ 1. For a signiﬁcance level of (approximately) 5 per cent. see Examples 5. where s2 is the OLS estimate of s2 in the model y ¼ Xb þ e over the full data sample using all n observations. then the terms wt =s are independent with distribution N(0. 5. It is also possible to test for the joint signiﬁcance of the set of values Wr . Under the hypothesis of constant parameters. For a signiﬁcance level of (approximately) 5 . The estimates of b show a break after observation 365. an pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ individual value Wr differs signiﬁcantly from zero if jWr j > 2 r À k.4. This means that for higher levels of education the wages are higher than is predicted from the estimates based on the employees with less education. Á Á Á .3 Tests for varying parameters The CUSUM test for the regression parameters Although plots of recursive estimates and recursive residuals are helpful in analysing possible parameter variations. These results are in line with our analysis of non-linearities in the previous examples. E Exercises: T: 5. Let s kÀ1 on the recursive residuals. it is also useful to perform statistical tests on the null hypothesis of constant parameters. so that Wr is approximately distributed as N(0.5. and 5.15) that the sample mean Pn 1 w ¼ nÀ t¼kþ1 wt is normally distributed with mean zero and variance k Pn 2 s2 2 2 ^ ¼ nÀ1 t¼kþ1 (wt À w) be the unbiased estimator of s based nÀk.3.

The value of c depends on the signiﬁcance level and on (n À k). and this is discussed in Section 5. The CUSUMSQ test is based on the cumulative sums of squares r P Sr ¼ t¼kþ1 n P t¼kþ1 w2 t . Tests on the joint signiﬁcance of deviations of Sr from their mean values have been derived. not only for changes in the parameters. Such observations are called outliers. Another possibility is that the variance s2 of the error terms is changing — that is.15) shows that these values are approximately distributed as independent w2 (1) variables. then the linear approximation yi ¼ x0i b þ ei that may be acceptable at the beginning of the sample. That is. This can be investigated ^2 . so that (n À k)Sr is apk 2 proximately distributed as w (r À k) with expected value r À k and variance 2(r À k). This provides simple tests for the individual signiﬁcance of a value of Sr (for ﬁxed r). then (5. n: Pn 1 2 2 For large enough sample size. Interpretation as general misspecification tests Apart from the effects of changing parameters or variances.6. So Sr has approximately a mean of (r À k)=(n À k) and a variance of 2(r À k)=(n À k)2 . . The CUSUMSQ test for the variance Large values for one or more recursive residuals are not necessarily caused by changes in the regression parameters b.314 5 Diagnostic Tests and Model Adjustments per cent. It may also be the case that breaks occur in the explanatory variables. For instance. If the by considering the sequence of squared recursive residuals w2 t =s model is correctly speciﬁed. the amount of uncertainty or randomness in the observations may vary over time. may cause large errors at the end of the sample. Note that the values always run from Sk ¼ 0 (for r ¼ k) to Sn ¼ 1 (for r ¼ n). it can be shown that this set of values indicatesÀmisspeciﬁcation of ﬃ Ápﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ rÀk the model if there exists a point r for which jWr j > 0:948 1 þ 2 nÀk n À k. the diagnostic tests CUSUM and CUSUMSQ that are introduced here as parameter stability tests are sensitive to any kind of instability of the model. nÀ t¼kþ1 wt % s . independent of the values of the recursive residuals. Á Á Á . where the model is said to be misspeciﬁed if there exists a point r for rÀk which jSr À n Àk j > c. large recursive residuals may also be caused by exceptional values of the disturbance terms ei in the relation yi ¼ x0i b þ ei . if one of the xi variables shows signiﬁcant growth over the sample period. for small values of xi . w2 t r ¼ k þ 1.

5. and under the null hypothesis of constant parameters it follows the F(k. The number of parameters under the alternative hypothesis is 2k. The null hypothesis of constant coefﬁcients is given by H0 : b1 ¼ b2 : (5:19) This can be tested against the alternative that b1 6¼ b2 by means of the F-test. the ﬁrst part consisting of n1 observations and the second part of the remaining n2 ¼ n À n1 observations. in particular. . the model can be formulated as y1 ¼ X1 b1 þ e1 y2 ¼ X2 b2 þ e2 . (S1 þ S2 )=(n1 þ n2 À 2k) (5:20) where S0 is the error sum of squares under the null hypothesis (obtained by regression in y ¼ Xb þ e over the full sample of n ¼ n1 þ n2 observations) and where S1 and S2 are obtained by the two subset regressions in (5.17). So the F-test is given by F¼ (S0 À S1 À S2 )=k . This can also be written as y1 y2 ¼ X1 0 0 X2 b1 b2 þ e1 : e2 (5:18) It is assumed that the model (5. that all the (n1 þ n2 ) error terms are independent and have equal variance. The regressions under the alternative hypothesis require that n1 ! k and n2 ! k — that is. In order to test the hypothesis of constant coefﬁcients across the two subsets of data.4). Least squares in the unrestricted model (5.17) (see Exercise 5.19) is k. n1 þ n2 À 2k) distribution.18) satisﬁes all the standard Assumptions 1–7.3 Varying parameters 315 The Chow break test In some situations there may be a clear break point in the sample and we want to test whether the parameters have changed at this point. Let the n observations be split in two parts. (5:17) where y1 and y2 are the n1 Â 1 and n2 Â 1 vectors of the dependent variable in the two subsets and X1 and X2 the n1 Â k and n2 Â k matrices of explanatory variables. This is called the Chow break test.18) gives an error sum of squares that is equal to the sum of the error sum of squares of the two separate regressions in (5. in both subsets the number of observations should be at least as large as the number of parameters in the model for that subset. and the number of restrictions in (5.

the Chow forecast test is computed as F¼ (S0 À S1 )=n2 : S1 =(n1 À k) This is exactly equal to the forecast test discussed in Section 3. (5:21) where Dj is a dummy variable with Dji ¼ 1 for i ¼ j and Dji ¼ 0 for i 6¼ j. Using the above notation.4. The model structure under the alternative can also be left unspeciﬁed.4). 173) (see Exercise 5. we perform a .316 5 Diagnostic Tests and Model Adjustments The Chow forecast test The model speciﬁcation (5. This test can also be used as an alternative to the Chow break test (5.3 (p. E XM5O1BWA Example 5. neglected variables.9: Bank Wages (continued) We continue our analysis of the data on wages and education where the data are ordered with increasing values of education (see Example 5.8). and (ii) CUSUM and CUSUMSQ tests. n1 þ n2 : (5:22) This can be tested by the F-test. or another error model. for every observation i > n1 . Á Á Á .17) allows for a break in the parameters. So. the model allows for an additional effect gj that may differ from observation to observation. The coefﬁcients gj represent all factors that are excluded under the null hypothesis — for instance. This can be expressed by the model yi ¼ x0i b þ nX 1 þn2 j¼n1 þ1 gj Dji þ ei . but apart from this the model structure is assumed to be the same. We will discuss (i) Chow tests on parameter variations. (i) Chow tests To test whether an additional year of education gives the same relative increase in wages for lower and higher levels of education.20) if one of the subsets of data contains less than k observations. which is called the Chow forecast test. Then the null hypothesis is that y ¼ Xb þ e holds for all n1 þ n2 observations and the alternative is that this model only holds for the ﬁrst n1 observations and that the last n2 observations are generated by an unknown model. The null hypothesis of constant model structure corresponds to H0 : gj ¼ 0 for all j ¼ n1 þ 1. another functional form.

025511 10. 0.9) Prob.0000 0.261131 0.748175 EDUC 0.660543 MINORITY À0.0000 0. 0.743176 13. and education over full sample (Panel 1).058687 156.85177 Panel 2: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 424 Included observations: 424 Variable Coefﬁcient Std.066947 EDUC 0. Error t-Statistic C 9.20) is given by F¼ (30:852 À 23:403 À 2:941)=4 ¼ 19:93 (P ¼ 0:000): (23:403 þ 2:941)=(424 þ 50 À 8) Panel 1: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std.7634 GENDER 0.40327 Panel 3: Dependent Variable: LOGSALARY Method: Least Squares Sample: 425 474 Included observations: 50 Variable Coefﬁcient Std.302888 Sum squared resid Exhibit 5.0000 0.055783 0.0029 0.0000 0.0000 0. Exhibit 5. one group with at most sixteen years of education (n1 ¼ 424) and the other with seventeen years of education or more (n2 ¼ 50).3 Varying parameters 317 Chow break test and a Chow forecast test.0085 0.126096 À2.5. . over subsample of employees with at most sixteen years of education (Panel 2).077366 0.463702 0.111687 0.426202 Sum squared resid Prob.6438 2.063095 149.0000 23.019132 0. Error t-Statistic C 9.465418 R-squared 0.027462 À4.14 shows the results of regressions for the whole data set (in Panel 1) and for the two subsamples (in Panels 2 and 3). Error t-Statistic C 9. The Chow break test (5.229931 0.132673 0.9906 GENDER 0.023801 9.23594 MINORITY À0.44277 R-squared 0.263948 3.953242 0.004436 17. 0.0000 0.004875 11.583411 EDUC 0.586851 Sum squared resid Prob.39284 GENDER 0.028946 À4.44229 R-squared 0. minority.199980 0.346533 0.0001 0.830174 0.941173 Regressions of salary on gender.145213 MINORITY À0.0000 30. and over subsample of employees with seventeen years of education or more (Panel 3). The n ¼ 474 employees are split into two groups.041108 0.14 Bank Wages (Example 5.

8 50 0 −50 −100 100 150 200 0. After observation i ¼ 366 the recursive residuals are mostly positive. Employees with index 365 or lower have at most ﬁfteen years of education.33b. . (ii) CUSUM and CUSUMSQ tests Exhibit 5. 5.2 0. at the end of the sample. f. The Chow forecast test (3.19. meaning that predicted wages are smaller than the actual wages. b.4. the CUSUM deviates signiﬁcantly from zero. E: 5.13 (b).4 0. This is in agreement with the recursive slope estimate in Exhibit 5. which becomes larger after observation 366. ordered with increasing education.0 −0. The CUSUMSQ plot shows that the squared recursive residuals in the ﬁrst part of the sample are relatively small and that the sum of squares builds up faster after observation 366. d. those with index between 366 and 424 have sixteen years of education.0 0.31a.6 0.9) Plots of CUSUM and CUSUMSQ for wage data.4 Summary An econometric model usually involves a number of parameters that are all assumed to be constant over the observation sample. and those with index 425 or higher have seventeen years of education or more.15 shows plots of the CUSUM and CUSUMSQ tests.24.2 1.15 Bank Wages (Example 5. S: 5.58) gives F¼ (30:852 À 23:403)=50 ¼ 2:67 (P ¼ 0:000): 23:403=(424 À 4) The null hypothesis of constant returns of schooling is clearly rejected. This is a further sign that the returns of schooling are not constant for different levels of education. 5. E Exercises: T: 5.318 5 Diagnostic Tests and Model Adjustments (a) 150 100 (b) 1.2 250 300 350 400 450 100 150 200 250 300 350 400 450 CUSUM 5% Significance CUSUM of Squares 5% Significance Exhibit 5. 5.3. This shows that. It is advisable to apply tests on parameter constancy and to adjust the model if the parameters seem to vary over the sample.

. by choosing an appropriate non-linear model or by incorporating additional relevant explanatory variables.5.3 Varying parameters 319 . This may mean that one has to adjust the speciﬁcation of the model — for instance. Dummy variables are a helpful tool to remove parameter variation by incorporating additional parameters that account for this variation. If the parameters are not constant one has to think of meaningful adjustments of the model that do have constant parameters. and by means of the break and forecasts tests of Chow. . The assumption of constant parameters can be tested by applying recur- sive least squares. by considering plots of recursive residuals and of the CUSUM and CUSUMSQ statistics.

and 6 are satisﬁed but that Assumption 3 of constant variance is violated. so that we can possibly get more accurate esimates by applying different methods. measured by var(yi ) ¼ s2 i . E[ee0 ] ¼ s2 I: In this section we suppose that Assumptions 1. 0 0 . 1 C C C: A ÁÁÁ s2 n So the covariance matrix is diagonal because of Assumption 4 of uncorrelated disturbances. 0 ÁÁÁ ÁÁÁ .4 Heteroskedasticity 5. In this section we discuss the estimation and testing of models for data that exhibit heteroskedasticity. n.4. 4. . We can then use a weighted least squares criterion of the form . Let the disturbances be 2 heteroskedastic with E[e2 i ] ¼ si . Á Á Á . . This is because observations with small error terms provide more information on the value of b than observations with large error terms. . and in the next section we discuss serial correlation. If this is not the case. @ . but if the variances differ it may be better to assign relatively smaller weights to observations with large variance and larger weights to observations with small variance. . i ¼ 1. Under Assumptions 1–6. 2.1 Introduction General model for heteroskedastic error terms For ordinary least squares. it is assumed that the error terms of the model have constant variance and that they are mutually uncorrelated. may differ for each observation. 5. This means that the amount of randomness in the outcome of yi . . . Implications of heteroskedasticity for estimation Pn 0 2 In least squares we minimize i¼1 (yi À xi b) . then s2 1 B0 B E[ee0 ] ¼ V ¼ B . but the elements on the diagonal may be different for each observation. then OLS is no longer efﬁcient. the standard regression model is given by y ¼ Xb þ e.320 5 Diagnostic Tests and Model Adjustments 5. E[e] ¼ 0. 0 0 0 s2 2 ..

and management jobs. First we give two examples. custodial jobs. Example 5. (ii) a possible model for heteroskedasticity. xi is the number of years of education.5. two managers with the same level of education may have quite different salaries — for instance. It may well be that the amount of variation in wages differs among these three categories. Administration is taken as reference category and D2 and D3 are dummy variables (D2 ¼ 1 for individuals with a custodial job and D2 ¼ 0 otherwise. Dg is a gender dummy (1 for males. s2 1 In1 @ V¼ 0 0 0 0 2 s2 In2 0 1 0 0 A: 2 s3 In3 . the covariance matrix can be speciﬁed as follows. If we allow for different variances among the three job categories. We sort the observations so that the ﬁrst n1 ¼ 363 individuals have jobs in administration. 0 otherwise). 321 À Á2 0 w2 i yi À xi b .10: Bank Wages (continued) We consider again the bank wage data of 474 bank employees. 3. because the job responsibilities differ or because the two employees have different management experience. For instance. for a given level of education it may be expected that employees with custodial jobs earn more or less similar wages. and (iii) a graphical idea of the amount of variation in wages. The choice of optimal weights is one of the issues discussed below. where Ini denotes the ni Â ni identity matrix for i ¼ 1. 2. However. 0 for females).4 Heteroskedasticity n X i¼1 2 with weights w2 i that decrease for larger values of si . We will discuss (i) three job categories. and D3 ¼ 1 for individuals with a management position and D3 ¼ 0 otherwise). administrative jobs. and the last n3 ¼ 84 ones have jobs in management. where yi is the logarithm of yearly wage. and Dm is a minority dummy (1 for minorities. (ii) A possible model for heteroskedasticity We consider the regression model yi ¼ b1 þ b2 xi þ b3 Dgi þ b4 Dmi þ b5 D2i þ b6 D3i þ ei . (i) Three job categories E XM501BWA The bank employees can be divided according to three job categories — namely. the next n2 ¼ 27 ones have custodial jobs.

16 shows for each job category both the (unconditional) variation in y in (a) and the conditional variation (that is. These changes will be related to each other.10) Unconditional variation in log salary (a) and conditional variation of residuals of log salary (after regression on education. Let xi denote the monthly change in the Treasury Bill rate and let yi be the monthly change in the AAA bond rate.0 9. n2 ¼ 27. with respective sizes of the subsamples n1 ¼ 363. The data on the Treasury Bill rate are taken from the Federal Reserve Board of Governors and the data on AAA bonds from Moody’s Investors Service. gender. the variation of the OLS residuals of the above regression model) in (b).5 0. and we postulate the simple regression model . and job category dummies (b)).322 5 Diagnostic Tests and Model Adjustments (a) 12.5 1 2 JOBCAT 3 1 2 JOBCAT 3 Exhibit 5.11: Interest and Bond Rates We now consider monthly data on the short-term interest rate (the threemonth Treasury Bill rate) and on the AAA corporate bond yield in the USA. (iii) Graphical impression of the amount of variation Exhibit 5.0 11.5 RESLOGSALARY LOGSALARY 0. minority. The job categories are administration (1). (i) Data and model The AAA bond rate is deﬁned as an average over long-term bonds of ﬁrms with AAA rating. The data run from January 1950 to December 1999.0 10. We will discuss (i) the data and the model. E XM511IBR Example 5. As Treasury Bill notes and AAA bonds can be seen as alternative ways of investment in low-risk securities. and (iii) a possible model for heteroskedasticity.0 (b) 1. It may further be that this relation holds more tightly for lower than for higher levels of the rates.0 10. custodial jobs (2).16 Bank Wages (Example 5. The exhibit indicates that the variations are the smallest for custodial jobs. and n3 ¼ 84.5 −0.5 11. as for higher rates there may be more possibilities for speculative gains. and management (3). it may be expected that the AAA bond rate is positively related to the interest rate. (ii) a graphical impression of changes in variance.

0 50 0 −1 −2 0 −1 −2 2 4 60 70 80 90 −6 −4 DAAA Residuals −2 0 DUS3MT −6 −4 −2 0 2 4 DUS3MT Exhibit 5. . A x2 n where n ¼ 600. the values of the residuals are measured on the vertical axis).16 and 5.5. (a) 1. one (b) for the ﬁrst 300 observations (1950–74) and the other (c) for the last 300 observations (1975–99).11) Plot of residuals of regression of changes in AAA bond rate on changes in three-month Treasury Bill rate (a) and scatter diagrams of these changes over the periods 1950–74 (b) and 1975–99 (c). if E[e2 i ] ¼ s xi .0 (b) 2 period 50. The variance over the ﬁrst half of the considered time period is considerably smaller than that over the second half. .0 −0. Alternative models for the variance in these data will be considered in later sections (see Examples 5..12 (c) 2 period 75. @ . i ¼ 1. 2.5 0.17 shows two scatter diagrams of yi against xi . Observations in months with small changes in the Treasury Bill rate are then more informative about a and b than observations in months with large changes. 0 0 0 x2 2 . This suggests that the uncertainty of AAA bonds has increased over time. Exhibit 5. One of the possible causes is that the Treasury Bill rate has become more volatile.5 1.18). 2 2 For instance. 600: (ii) Graphical impression of changes in variance Exhibit 5. .4 Heteroskedasticity 323 yi ¼ a þ bxi þ ei . .01 − 99. then the covariance matrix becomes x2 1 B0 B V ¼ s2 B .17 (a) shows the residuals that are obtained from regression in the above model (the ﬁgure has time on the horizontal axis. Á Á Á .5 −1. (iii) A possible model for heteroskedasticity The magnitude of the random variations ei in the AAA bond rate changes may be related to the magnitude of the changes xi in the Treasury Bill rate. .12 1 1 DAAA DAAA 0.01 − 74. 1 C C C. 0 ÁÁÁ ÁÁÁ . .17 Interest and Bond Rates (Example 5. . ÁÁÁ 0 0 .

sn on the diagonal. 2 so that V is a diagonal matrix with elements s2 1 . The OLS estimator is given by b ¼ (X0 X)À1 X0 y.324 5 Diagnostic Tests and Model Adjustments 5.23). Á Á Á . var(b) ¼ (X0 X)À1 X0 VX(X0 X)À1 : (5:23) So the OLS estimator b remains unbiased. So the estimated coefﬁcients b are ‘correct’ in the sense of being unbiased. if one routinely applies the usual least squares expressions for standard errors. assume that y ¼ Xb þ e. estimator of s2 i i This gives .23) can be written as var(b) ¼ (X X) 0 À1 n X i¼1 ! 0 s2 i xi xi (X0 X)À1 : (5:24) Here xi is the k Â 1 vector of explanatory variables for the ith observation. substituting y ¼ Xb þ e. E[ee0 ] ¼ V: Although ordinary least squares will no longer have all the optimality properties discussed in Chapter 3. then (5.4.2 Properties of OLS and White standard errors Properties of OLS for heteroskedastic disturbances Suppose that Assumptions 1. White standard errors In order to perform signiﬁcance tests we should estimate the covariance matrix in (5. However. That is. and 6 are satisﬁed but that the covariance matrix of the disturbances is not equal to s2 I. and. E[e] ¼ 0. Therefore. In most situations the values s2 i of the variances are unknown. then the outcomes misrepresent the correct standard errors. but the OLS formulas for the standard errors are wrong. In this section we consider the consequences of applying ordinary least squares under the above assumptions. 2. it follows that b ¼ b þ (X0 X)À1 X0 e: Under the stated assumptions this means that E[b] ¼ b. unless V ¼ s2 I. A simple 2 is given by e . the usual expression s2 (X0 X)À1 for the variance does not apply anymore. it is still attractive. the square of the OLS residual ei ¼ yi À x0i b. as it is simple to compute these estimates. If the disturbances are uncorrelated but heteroskedastic. 5.

2 (p. even in the homoskedastic case and with the above correction.4 Heteroskedasticity n X i¼1 325 ! À1 0 0 e2 i xi xi (X X) : c b) ¼ (X0 X)À1 var var( (5:25) This is called the White estimate of the covariance matrix of b. we will now show that the estimator (5. these two standard errors are quite close to each other. the 2 estimator e2 i =(1 À hi ) of the variance si is unbiased but not consistent. however. we use the results in Section 4. Exhibit 5.3). However.10) and the second on interest rates (see Example 5.4. 258) on GMM estimators. that for E XM501BWA XM511IBR .5. In case of homoskedastic error terms. so that by increasing the sample size we gain no additional information on s2 i. This shows that (5. 252)). So the residual ei has variance s2 (1 À hi ) in this case. provided that E[ei xi ] ¼ E[(yi À x0i b)xi ] ¼ 0: That is.25) is a consistent estimator of the covariance matrix (5. Sometimes a correction is applied. According to the results in (4. This is also required for the consistency of the OLS estimator b.25) by ei =(1 À hi ). For most coefﬁcients. This where J ¼ n i ¼1 i¼1 @ P P 2 0 means that J ¼ ei xi xi and H ¼ À xi x0i ¼ ÀX0 X.24).1. 2 For this reason one sometimes replaces e2 i in (5.18 shows the results of least squares with conventional OLS formulas for the standard errors (in Panels 1 and 3) and with White heteroskedasticity consistent standard errors (in Panels 2 and 4). which is consistent (see Section 4.12: Bank Wages.5 (p.24) of b is consistent. The above moment conditions can be formulated as E[gi ] ¼ 0 with gi ¼ (yi À x0i b)xi .4. where hi is the ith diagonal element of H. This is because only a single observation (the ith) has information about the value of s2 i. 127–8) that the residual vector e has covariance matrix s2 M.25) is the GMM estimator of the covariance matrix.67) and (4.68).4. Proof of consistency of White standard errors Note that. the orthogonality conditions should be satisﬁed. Interest and Bond Rates (continued) As an illustration we consider the two examples of Section 5. a consistent estimator of the covariance matrix of the GMM estimator is given by c b) ¼ ( H 0 J À 1 H ) À 1 . Note that the GMM estimator for the above moment conditions is equal to the OLS estimator b (see Section 4.25) of the covariance matrix (5. it was derived in Section 3.4. To prove that (5. and the square roots of the diagonal elements are called the White standard errors. Note.1.3 (p.11). T Example 5. where M ¼ I À X(X0 X)À1 X0 ¼ I À H. var var( P P gi g0i and H ¼ n gi =@ b0 and with J and H evaluated at b. the ﬁrst on wages (see Example 5.

18 Bank Wages.074858 0.12) Regressions for wage data (Panels 1 and 2) and for AAA bond rate data (Panels 3 and 4).915697 18.020962 MINORITY À0. Error t-Statistic Prob.035887 15.074858 0.333133 3.507685 À3.616538 0. 0.987918 0.574694 0. Error t-Statistic Prob.370346 t-Statistic 0.044192 0.044192 0.022459 DUMJCAT2 0.3609 DUS3MT 0.84248 Prob.274585 0.539075 0.370346 Exhibit 5.0003 DUMJCAT2 0.574694 0.0000 Panel 2: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 White Heteroskedasticity-Consistent Standard Errors & Covariance Variable Coefﬁcient Std.3602 0.0000 EDUC 0.043494 DUMJCAT3 0.054477 175.0000 MINORITY À0.022874 12.006393 0.0000 0.0000 R-squared 0. 0.004285 GENDER 0.020699 À3.0000 0.170360 0.0000 DUMJCAT3 0.0000 0.033025 5.7556 0.158477 0.0000 GENDER 0. Error C 9.0000 Panel 4: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 White Heteroskedasticity-Consistent Standard Errors & Covariance Variable Coefﬁcient Std. Error C 0.004425 9.02147 0.760775 t-Statistic 176. Interest and Bond Rates (Example 5.760775 Panel 3: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Variable Coefﬁcient Std.326 5 Diagnostic Tests and Model Adjustments Panel 1: Dependent Variable: LOGSALARY Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std.030213 R-squared 0.019985 8.75442 Prob. with conventional standard errors (Panels 1 and 3) and with White standard errors (Panels 2 and 4).006992 0.274585 0. C 0.00409 0.0009 0.923848 0. C 9.0001 0.014641 R-squared 0. .539075 0.5965 10.0000 R-squared 0.006393 0.914321 0.178340 0.170360 0.054218 EDUC 0.916891 17.31317 8.178340 0.006982 DUS3MT 0.

pﬃﬃﬃﬃ pﬃﬃﬃﬃ Ã 1 p ﬃﬃﬃ by dividing the ith equation by vi. However. a model for heteroskedasticity is of the form 0 s2 i ¼ h(zi g). zp )0 is a vector consisting of p observed variables that inﬂuence the variances.3 Weighted least squares Models for the variance The use of OLS with White standard errors has the advantage that no model for the variances is needed. whereas according to the conventional OLS formula this standard error is computed as 0. Á Á Á . If the model explaining the heteroskedasticity is sufﬁciently accurate. n: .20a. Weighted least squares A particularly simple model is obtained if the variance depends only on a single variable v so that 2 s2 i ¼ s vi . then we obtain the transformed model Ã Ã yÃ i ¼ xi b þ e i . then this will increase the efﬁciency of the estimators. i ¼ 1. Á Á Á . The last model has the advantage that it always gives positive variances. In such cases we can transform the model yi ¼ x0i b þ ei . and g is a vector of p unknown parameters. E Exercises: S: 5.11. and more efﬁcient estimators can be obtained if one has reliable information on the variances s2 i . Stated in general terms. 2 E[e2 i ] ¼ s vi . An example is the regression model for bond rates in Example 5.023. z2 . i ¼ 1. xi ¼ vi xi and ﬃﬃﬃﬃ p Ã ei ¼ ei = vi . Á Á Á . Let yÃ i ¼ yi = vi . 5. n. where vi > 0 is known and where s2 is an unknown scalar parameter. OLS is no longer efﬁcient. (5:26) where h is a known function. 0 2 2 E[eÃ i ]¼s .4 Heteroskedasticity 327 the interest rate data the (consistent) White standard error of the slope coefﬁcient is 0.4.5. z ¼ (1.015. where we 2 2 proposed the model s2 i ¼ s xi . whereas in the additive model we have to impose restrictions on the parameters g. Two speciﬁcations that are often applied are the model with additive heteroskedasticity where h(z0 g) ¼ z0 g and the model with multi0 plicative heteroskedasticity where h(z0 g) ¼ ez g .

The groups should be chosen so that the individuals within a group are more or less homogeneous with respect to the variables in the model. with homoskedastic error terms). Let nj be the number of individuals in group j. where yj and ej are the means of yj and ej and x0j is the row vector of the means of the explanatory variables in group j. Illustration: Heteroskedasticity for grouped data In research in business and economics. Therefore. We recall that in Section 5. To derive an explicit for0 mula for this estimator. The transformed model has also homoskedastic error terms and hence it satisﬁes Assumptions 1–6. The error terms satisfy . let XÃ be the n Â k matrix with rows xÃ i and let yÃ be the n Â 1 vector with elements yÃ i . if the original model satisﬁes Assumptions 1. and 4–6. where the observations get larger weight the nearer they are to a given reference value.3 we applied weighted least squares in local regression.2. we can calculate the transformed data yÃ i and Ã xi . the best linear (in yÃ i ) unbiased estimator of b is obtained by applying least squares in the transformed model. Further. then. this is called weighted least squares (WLS). in terms of the reported group means. The intuition is that there is less uncertainty around observations with smaller variances. so that these observations are more important for estimation. then the same holds true for the transformed model. Then the estimator is given by bÃ ¼ ¼ (X0Ã XÃ )À1 X0Ã yÃ n X 1 i¼1 ¼ n X i¼1 n X 1 i¼1 Ã xÃ i xi 0 !À1 n X i¼1 Ã xÃ i yi !À1 xi x0i ! xi yi : (5:27) vi vi This estimator is obtained by minimizing the criterion S(b) ¼ n X i¼1 yÃ i À xÃ i 0 n 2 X (yi À x0i b)2 b ¼ : vi i¼1 (5:28) As observations with smaller variance have a relatively larger weight in determining the estimate bÃ . Let the individual data satisfy the model y ¼ Xb þ e with E[e] ¼ 0 and E[ee0 ] ¼ s2 I (that is. the model becomes yj ¼ x0j b þ ej . 2.328 5 Diagnostic Tests and Model Adjustments As the numbers vi are known. the original data of individual agents or individual ﬁrms are often averaged over groups for privacy reasons.

so that bWLS ¼ G X j¼1 !À1 nj xj x0j G X j¼1 ! nj xj yj : The weighting factors show that larger groups get larger weights.27) with vj ¼ 1=nj . . Statistical properties of the WLS estimator The properties of the weighted least squares estimator are easily obtained from the transformed model. F¼ X 2 ei e2 =(n À k) Ãi =(n À k) vi (5:30) .5). ÁÁÁ 0 1 nÀ G 0 0 . The covariance matrix of bÃ is given by var(bÃ ) ¼ s 2 (X0Ã XÃ )À1 ¼s 2 n X 1 i¼1 !À1 xi x0i : (5:29) vi The weighted least squares estimator is efﬁcient.5.4 Heteroskedasticity 2 E[ej ] ¼ 0. 0 ÁÁÁ ÁÁÁ . . . 329 0 0 1 nÀ 2 . E[e2 j ] ¼ s =nj and E[ej eh ] ¼ 0 for j 6¼ h. ..24) (see also Exercise 5. 1 C C C. then the results of Chapter 3 on testing linear hypotheses can be applied directly to the transformed model. . In terms of the residuals of the transformed model eÃ ¼ yÃ À XÃ bÃ . . an unbiased estimator of the variance s2 is given by s2 Ã ¼ n n 2 0 1 X 1 X 1 Ã yÃ À x b ¼ (yi À xi bÃ )2 : i Ã i n À k i¼1 n À k i¼1 vi If we add Assumption 7 that the disturbance terms are normally distributed. For instance. A where G denotes the number of groups. and hence its covariance matrix is smaller than that of the OLS estimator in (5. @ . the F-test of Chapter 3 now becomes X X 2 X 2 X eRi ei e2 e2 Ãi =g ÃRi À vi À vi =g X ¼ . so that grouping leads to heteroskedastic disturbances with covariance matrix 1 nÀ 1 B 0 B V ¼ s2 B . The WLS estimator is given by (5.

s2 QÀ Ã ).29) and (5. XÃ . if we drop Assumption 1 of ﬁxed regressors and Assumption 7 of normally distributed error terms. Apply the standard procedures for estimation and testing of Chapters 3 and 4 on the transformed data yÃ . i ¼ 1.330 5 Diagnostic Tests and Model Adjustments where e ¼ y À XbÃ . terms of the original data by substituting yi ¼ vi yÃ i and xi ¼ We illustrate this with two examples. eÃR ¼ yÃ À XÃ bÃR . n. and where b and s2 are unknown ﬁxed parameters. WLS is consistent and has an asymptotic normal distribution. . The results can be rewritten in pﬃﬃﬃﬃ pﬃﬃﬃﬃ Ã vi xi . Transform the observed data yi and xi by pﬃﬃﬃﬃ 1 1 Ã p ﬃﬃﬃ p ﬃﬃﬃ dividing by vi. to get yÃ i ¼ vi yi and xi ¼ vi xi . Weighted least squares Step 1: Formulate the model. For instance. Summary of estimation by WLS Estimation by weighted least squares can be performed by means of the following steps.30) remain valid asymptotically. Step 2: Transform the data. then pﬃﬃﬃ d 1 n(bÃ À b)!N(0. expressions like (5. and eR ¼ y À XbÃR with bÃR the restricted ordinary least squares estimator in the transformed model. under the conditions that ! n 1 0 1X 1 xi x0i ¼ QÃ plim XÃ XÃ ¼ plim n n i ¼ 1 vi ! n 1 0 1X 1 plim XÃ eÃ ¼ plim xi ei ¼ 0: n n i¼1 vi (5:32) (5:33) Under these assumptions. xi ) are observed and vi are known. Á Á Á . Asymptotic properties of WLS The asymptotic results in Chapter 4 can be applied directly to the transformed model. Formulate the model regression model 2 yi ¼ x0i b þ ei and the model for the variances E[e2 i ] ¼ s vi . (5:31) T that is. Step 4: Transform results to original data. where (yi . Step 3: Estimate and test with transformed data.

.13) Grouped data of 474 employees. the R2 and the standard error of regression are reported both for weighted data (based on the residuals eÃ ¼ yÃ À XÃ bÃ of step 3 of WLS) and for unweighted data (based on the residuals e ¼ y À XbÃ of step 4 of WLS).5. 16 Group sizes (26 groups) mean 18. and (ii) the results of OLS and WLS for the grouped data. and efﬁcient WLS estimates are reported in Panel 3. (i) Grouped bank wage data E XM513BWA Suppose that for privacy reasons the individual bank wage data are grouped according to the variables gender. The WLS estimates are clearly different from the OLS estimates and the standard errors of WLS are considerably smaller than those of OLS. twenty-two combinations do not occur in the sample. and 17 years or more). so that G ¼ 26 groups remain. The histogram shows the sizes of the resulting twenty-two groups of employees (the group size is measured on the horizontal axis. both with OLS standard errors (in Panel 1) and with White standard errors (in Panel 2). minority.4 Heteroskedasticity 331 Example 5. In principle this gives 2 Â 2 Â 3 Â 4 ¼ 48 groups.19 Bank Wages (Example 5. and four education groups. Exhibit 5. job category.19 shows a histogram of the resulting group sizes. job category.13: Bank Wages (continued) In this example we continue our previous analysis of the bank wage data. and the largest group contains 101 individuals. However. minority.23 minimum 1. We consider the possible heteroskedasticity that results by grouping the data. with groups deﬁned by gender. (ii) Results of OLS and WLS for grouped data Exhibit 5. and four education groups (10 years or less. and the vertical axis measures the frequency of occurrence of the group sizes in the indicated intervals on the horizontal axis). maximum 101 12 8 4 0 0 10 20 30 40 50 60 70 80 90 100 110 Exhibit 5. between 11 and 13 years. Some groups consist of a single individual. between 14 and 16 years. We will discuss (i) the grouped data. It is intuitively clear that the averaged data in this large group should be given more weight than the data in the small groups.20 shows the result of applying OLS to the grouped data. For WLS.

E.6899 0.009617 3.019526 0.441090 R-squared 0.E.388348 DUMJCAT2 0.834443 S.0000 0.675614 0.336567 MINORITY À0.542568 0.006123 7. Error t-Statistic C 9.033592 0.249522 0.673440 0.033592 0.077396 123.980253 R-squared 0.024444 0.125541 77.179823 0.061221 GENDER 0.043238 0.18272 MEANEDUC 0.0277 0.074784 3. of regression Prob.E.053352 4.061281 2.190354 Regressions for grouped wage data. 0.492757 GENDER 0.05376 MEANEDUC 0.019526 0.676939 MINORITY À0.586344 0. OLS with White standard errors (Panel 2). OLS (Panel 1).0023 0.673440 0.090982 0.root group size (vi ¼ 1=ni with ni the group size) Variable Coefﬁcient Std. Error t-Statistic C 9.332 5 Diagnostic Tests and Model Adjustments Panel 1: Dependent Variable: MEANLOGSAL Method: Least Squares Sample(adjusted): 1 26 Included observations: 26 after adjusting endpoints Variable Coefﬁcient Std.999903 S. and WLS with group seizes as weights (Panel 3).0000 0.157479 Panel 3: Dependent Variable: MEANLOGSAL Method: Least Squares Sample(adjusted): 1 26 Included observations: 26 after adjusting endpoints Weighting series: sq.8506 0.084661 7. 0.010022 3.141875 68.062942 À0. . In Panel 3.0000 0.724918 DUMJCAT3 0.8604 MEANEDUC 0. Error t-Statistic C 9.E.031581 À2.13) Prob.249522 0.7019 0. 0.404766 DUMJCAT2 0. of regression Prob.886690 S.074960 0.0001 0.0032 0. of regression Exhibit 5.042672 12.0033 0.024444 0.060389 À0.214610 DUMJCAT3 0.0000 0.675614 0.090510 MINORITY À0.0000 0.373596 DUMJCAT2 0.0000 0.190792 DUMJCAT3 0.166985 0.0000 0.029525 6.20 Bank Wages (Example 5.102341 0.0130 0.351783 GENDER 0.71483 Weighted Statistics R-squared 0. the weighted statistics refer to the transformed data (with weighted observations) and the unweighted statistics refer to the observed (unweighted) data.157479 Panel 2: Dependent Variable: MEANLOGSAL Method: Least Squares Sample(adjusted): 1 26 Included observations: 26 after adjusting endpoints White Heteroskedasticity-Consistent Standard Errors & Covariance Variable Coefﬁcient Std.886690 S.0000 0.077288 0.104891 6.8322 0. of regression Unweighted Statistics R-squared 0.

370293 S.0000 0.6437 0.002380 0. (ii) the outcomes of OLS and WLS.2 4 2 0.1 0.5.160000 −0.006471 0.381207 0.14: Interest and Bond Rates (continued) We continue our analysis of the interest and bond rate data introduced in Example 5.1 we considered the regression model yi ¼ a þ bxi þ ei for the relation between changes in the AAA bond rate yi (a ) Panel 1: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Variable Coefﬁcient Std.0 0.3602 0. We will discuss (i) the application of weighted least squares in this model.11 in Section 5. and (iii) comments on the outcomes.75442 R-squared 0.676769 6.171002 (b ) Prob.006982 0.817717 Weighted Statistics R-squared 0.2 −0.010000 0.14) Regressions for AAA bond rate data. Error t-Statistic C À0. (c) shows the histogram of the values of DAAA in the seventeen months where DUS3MT ¼ 0 (these observations are excluded in WLS in Panel 2).915697 DUS3MT 0.014641 18. of regression Panel 2: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 583 Excluded observations: 17 Weighting series: 1/DUS3MT (vi ¼ ðDUS3MTi Þ2 Þ Variable Coefﬁcient Std.E.E.144280 1. Skewness Kurtosis −0. Error t-Statistic C 0.320000 0.172944 6 Series: DAAA Sample 1950:01 1999:12 with DUS3MT = 0 Observations 17 Mean Median Maximum Minimum Std.1.104758 −1.4.4. 0. Dev.0696 7.21 Interest and Bond Rates (Example 5. . of regression Unweighted Statistics R-squared 0.E. of regression (c) 8 Prob. OLS (Panel 1) and WLS (with variances proportional to the square of DUS3MT.11 in Section 5.000369 S.006393 0.370346 S.005143 À0.841560 0 Exhibit 5. (i) Application of weighted least squares E XM511IBR In Example 5.3 −0.462794 DUS3MT 0.4 Heteroskedasticity 333 Example 5.274585 0.262260 0. 0. Panel 2).1 0.

then we can use the more general model (5.33c. n. S: 5. 5.26) with 0 variances s2 i ¼ h(zi g).17 suggest 2 2 E[e2 i ] ¼ s xi as a possible model for the variances.21 indicates that. E: 5. 17 of the n ¼ 600 observations are dropped.27) is obtained by ordinary least squares in the transformed model yi 1 ¼ a Á þ b þ eÃ i. s2 i ¼ s vi with s an unknown scalar parameter and with vi known for all i ¼ 1.21 shows the results of OLS in the original model (in Panel 1) and of WLS (in Panel 2). E Exercises: T: 5. In the next section we will consider alternative. Under Assumptions 1. as the AAA rate does not always remain ﬁxed in months where the Treasury Bill rate remains unchanged (see the histogram in Exhibit 5. (ii) Outcomes of OLS and WLS Exhibit 5. (iii) Comments on the outcomes Panel 2 of Exhibit 5. e. the log-likelihood (in terms of the (k þ p) unknown parameters b and g) is given by À Á2 n n À À 0 ÁÁ 1 X yi À x0i b n 1X À Á : log h zi g À l(b. This is because in these months xi ¼ 0. The WLS estimator (5. as for xi ¼ 0 the model postulates 2 2 that var(yi ) ¼ E[e2 i ] ¼ s xi ¼ 0: In reality this variance is non-zero. less restrictive models for the variance of the disturbances (see Example 5. If we are not able to specify such a type of model. Note that. xi xi Ã2 2 where the error terms eÃ i ¼ ei =xi are homoskedastic with E[ei ] ¼ s .5. The plots in Exhibit 5. the Treasury Bill rate changes (xi ) provide no signiﬁcant explanation of AAA rate changes (yi ). for WLS. Á Á Á . This indicates a shortcoming of the model for the variance. and 4–7. 2. where g contains p unknown parameters. according to the WLS outcomes and at the 5 per cent signiﬁcance level.4 Estimation by maximum likelihood and feasible WLS Maximum likelihood in models with heteroskedasticity The application of WLS requires that the variances of the disturbances are 2 2 known up to a scale factor — that is.334 5 Diagnostic Tests and Model Adjustments and the three-month Treasury Bill rate xi .4. g) ¼ À log (2p) À 2 2 i¼1 2 i¼1 h z0i g (5:34) .20b–e.21 (c)).16).

Then g can be estimated by maximizing this concentrated log-likelihood and the corresponding estimate of b follows from (5. it cannot be computed because g is unknown. Determine an estimate c of the variance parameters g in the model var(ei ) ¼ h(z0i g) and deﬁne the 0 estimated variances by s2 i ¼ h(zi c).34) to obtain the concentrated log-likelihood as a function of g alone. To investigate the consistency and the asymptotic distribution of bFWLS . Compute the feasible weighted least squares estimates ! À1 ! X1 X1 0 xi xi xi yi : (5:36) bFWLS ¼ s2 s2 i i i i Derivation of statistical properties of FWLS The properties of the estimator bFWLS depend on those of the used estimator c of g in step 1. Writing the model yi ¼ x0i b þ ei in matrix form y ¼ Xb þ e.4 Heteroskedasticity 335 The ML estimators of b and g are obtained by maximizing l(b. However.5. the optimal values of b are obtained by WLS. so that bWLS (g) ¼ n X 1 x x0 2 i i s i¼1 i !À1 ! n X 1 xi yi . s2 i¼1 i À 0 Á s2 i ¼ h zi g : (5:35) This estimator is not ‘feasible’ — that is.35). Two-step feasible weighted least squares Step 1: Estimate the variance parameters. Step 2: Apply WLS with the estimated variances. and in the second step the regression parameters b are estimated. and these estimators have the usual optimal asymptotic properties. replacing vi in (5. For given values of g.27) by h(z0i g). we get 1 À1 0 À1 À1 0 À1 0 À1 bFWLS À bWLS (g) ¼ (X0 VÀ c X) X Vc y À (X Vg X) X Vg y 1 À1 0 À1 À1 0 À1 0 À1 ¼ (X0 VÀ c X) X Vc e À (X Vg X) X Vg e: T . 0 we write Vg for the n Â n diagonal matrix with elements s2 i ¼ h(zi g) and Vc for the 2 0 n Â n diagonal matrix with elements si ¼ h(zi c). Feasible weighted least squares An alternative and computationally simpler estimation method is to use a two-step approach. g). This method is called (twostep) feasible weighted least squares (FWLS). In the ﬁrst step the variance parameters g are estimated. we can substitute this formula for bWLS in (5. using the estimated variances of the ﬁrst step.

bFWLS has the same asymptotic distribution (5. (5. and (5. there holds À Á 1 À1 bFWLS % N b. ﬁrst OLS is applied in the model y ¼ Xb þ e with residuals e. In both cases. (5. we can use the following result as an approximation in ﬁnite samples. So the covariance matrix of the FWLS estimator can be approximated by c bFWLS ) ¼ (X var var( 0 1 À1 VÀ c X) ¼ n X 1 !À1 : x x0 2 i i s i¼1 i If one wants to use WLS with chosen weighting factors s2 i but one is uncertain whether these weights correspond to the actual variances. the squared residuals e2 i are . provided that the estimator c in step 1 is a consistent estimator of g.37).336 5 Diagnostic Tests and Model Adjustments Therefore. and Vc is the 0 corresponding diagonal matrix with the estimated variances s2 i ¼ h(zi c) on the diagonal.26).32). We consider this for the additive and the multiplicative model for heteroskedasticity. the above two conditions are satisﬁed if c is a consistent estimator of g. in particular consistency of the estimator c of the variance parameters g. where yi ¼ yi =si and xi ¼ si xi . (X0 VÀ : c X) Here c is the estimate of g obtained in step 1 of FWLS.33). then the FWLS estimator has the same asymptotic covariance matrix as the WLS estimator. xi ) have been Ã Ã Ã 1 transformed to (yÃ i . This corresponds to the White standard errors of OLS after the observations (yi . xi ). Approximate distribution of the FWLS estimator Under the above conditions. plim X Vc À Vg X ¼ plim n n i ¼1 s 2 si i ! n X 1 0 À1 1 1 1 1 À 2 xi e i ¼ 0 : plim pﬃﬃﬃ X Vc À VÀ e ¼ plim pﬃﬃﬃ g n n i ¼1 s 2 si i (5:37) (5:38) Under some regularity conditions on the regressors xi and the function h in (5. If b is consistent.31) as bWLS (and hence it is consistent and asymptotically efﬁcient) provided that ! n 1 0 À1 1X 1 1 À1 0 À 2 xi xi ¼ 0. In this case consistent estimates of the standard errors can be obtained by GMM. If c is consistent. then the above formula for the variance is in general no longer correct. Two-step FWLS in the additive and multiplicative model The foregoing shows that the two-step FWLS estimator is asymptotically equally efﬁcient as WLS. Under conditions (5.38).

and in the multiplicative model À Á 0 log e2 i ¼ zi g þ Zi : 2 The error terms are given by Zi ¼ e2 i À si in the additive model and by 2 2 Zi ¼ log (ei =si ) in the multiplicative model. where ^ g1 is the OLS estimate of g1 term z1 ¼ 1 should be estimated as ^ 2 and a ¼ ÀE[ log (w (1))] % 1:27. Iterated FWLS Instead of the above two-step FWLS method. but the coefﬁcient g1 of the constant g1 þ a. The newly estimated variances are then used in step 2 to compute the corresponding new FWLS estimates of b. (i) A multiplicative model for heteroskedasticity E XM501BWA In Example 5. p. s2 . Suppose that 2 the disturbance terms ei in the above regression model have variances s2 1 . . we can also apply iterated FWLS. and (iii) the ML estimates. It is left as an exercise (see Exercise 5. In this case the FWLS estimate of b in step 2 is used to construct the corresponding series of residuals. or 3 respectively. .15: Bank Wages (continued) We consider the bank wage data again and will discuss (i) a multiplicative model for heteroskedasticity. (ii) the two-step FWLS estimates of this model.4 Heteroskedasticity 2 asymptotically unbiased estimates of s2 i . Example 5. In the latter model the coefﬁcients gj of the variables zj are consistently estimated for j ¼ 2.6) to show that the above regression for the additive model gives a consistent estimate of g. but that in the multiplicative model a correction factor is needed. which are used again in step 1 to determine new estimates of the heteroskedasticity parameters g. . Then g is estimated by replacing si by e2 i and by running the regression 0 e2 i ¼ zi g þ Zi 337 in the additive model.16).5. 2.10 we considered the regression model yi ¼ b1 þ b2 xi þ b3 Dgi þ b4 Dmi þ b5 D2i þ b6 D3i þ ei : We concluded that the unexplained variation ei in the (logarithmic) salaries may differ among the three job categories (see Exhibit 5. This is iterated until the parameter estimates converge. These iterations can improve the efﬁciency of the FGLS estimator in ﬁnite samples. . 2 or s3 according to whether the ith employee has a job in category 1. Let the parameters be transformed by g1 ¼ log (s2 1 ). .

C 9.338 5 Diagnostic Tests and Model Adjustments 2 2 2 g2 ¼ log (s2 2 =s1 ).030213 17.748212 0.539075 0.69933 0.074858 0.052207 183.8011 0.077864 0.936467 Unweighted Statistics R-squared 0.284800 1.32413 0.507685 MINORITY À0.021358 À3.5965 EDUC 0.032659 16.760775 Panel 2: Dependent Variable: LOG(RESOLS^ 2) Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std. and step 2 of FWLS (Panel 3.170360 0. .0000 Weighted Statistics R-squared 0.15) OLS for wage data (Panel 1).733237 0.5380 0.054218 176.178389 0. 0.916891 DUMJCAT3 0.545375 0.0000 DUMJCAT3 0.006882 Prob.178340 0. WLS with estimated variances obtained from Panel 2).022459 À3.037321 4.616892 R-squared 0.470278 0.020391 8.0000 0.020962 8.760688 Exhibit 5.0001 0.0000 MINORITY À0.123460 À38. the parameters of this model for the variances are estimated in Panels 1 and 2 of Exhibit 5.0000 Prob.0000 EDUC 0.289197 0.616335 DUMJCAT3 0.33819 DUMJCAT2 À0.0000 0.0000 0.0009 0.595652 0.574694 0.31317 GENDER 0.84248 R-squared 0.004285 10.645626 0.0000 GENDER 0.044192 0. and g3 ¼ log (s3 =s1 ). auxiliary regression of OLS residuals for estimation of the variance parameters in the multiplicative model of heteroskedasticity).166836 0.1066 Panel 3: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 Weighting series: 1/STDEV (vi ¼ ðSTDEVi Þ2 obtained from Panel 2) Variable Coefﬁcient Std. Error t-Statistic C À4. In Panel 2 the explained variable Panel 1: Dependent Variable: LOGSALARY Method: Least Squares Sample: 1 474 Included observations: 474 Variable Coefﬁcient Std.042617 0. 0. Error t-Statistic C 9.22 Bank Wages (Example 5.22. step 1 of FWLS (Panel 2.043494 3. Error t-Statistic Prob.460492 0.469221 À0.0000 0.333133 DUMJCAT2 0.004128 10.0003 DUMJCAT2 0. then we can formulate the following multiplicative model for 2 g1 þg2 D2i þg3 D3i : s2 i ¼ E[ei ] ¼ e (ii) Two-step FWLS estimates To apply (two-step) FWLS.

867102 0.115197 0.452073 0.79576 0. is log (e2 i ).004162 9.072756 0.0018 DUMJCAT2 0. the ML results indicate signiﬁcant heteroskedasticity between the three job categories.) ML estimates of model for wages with multiplicative model for heteroskedasticity (Panel 4. The corresponding (two-step) FWLS estimator in (5. The outcomes are quite close to those of OLS.023355 À3. As the ML estimates are . The results in Panel 2 give the following estimates of the standard deviations per job category. where ei are the OLS residuals of the regression in Panel 1.155865 0. However.605030 0. That is. and g3 are quite different from those obtained in the (two-step) FWLS method.2237 Exhibit 5.036379 4.053441 180. s2 ¼ s1 eÀ0:289 ¼ 0:153.065795 À50. The ML estimates of the parameters of the regression equation are close to the (two-step) FWLS estimates.629294 0. Constant 9.38289 0. pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ s3 ¼ s1 e0:460 ¼ 0:223: As expected.557101 0.22. starting values at FWLS estimates).0000 DUMJCAT3 0. s2 .0008 DUMJCAT2 ðg2 Þ DUMJCAT3 ðg3 Þ 0. so that the effect of heteroskedasticity is relatively small. (iii) ML estimates Panel 4 of Exhibit 5.0000 MINORITY À0. Error z-Statistic Prob.0000 À0.343140 0.0000 GENDER 0.4 Heteroskedasticity 339 Panel 4: Dependent Variable: LOGSALARY Method: Maximum Likelihood (BHHH).284448 0.5.0000 Variance Equation Constant ðg1 Þ À3.22 (Contd.36) is given in Panel 3 of Exhibit 5. s1 ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ e1:27À4:733 ¼ 0:177. the estimates ^ g2 and ^ homoskedasticity of the error terms need not be rejected. In particular.021259 8. g3 are not signiﬁcant.559227 0. g2 .259368 À3. the ML estimates of the parameters g2 and g3 differ signiﬁcantly from zero. the standard deviation is smallest for custodial jobs and it is largest for management jobs. the ML estimates of the variance parameters g1 .567533 0.173538 2. indicating that the Moreover. s2 and s2 by s2 i ¼e 1 ¼e 2 ¼ s1 e 3 ¼ s1 e .182140 0.039782 0.0000 EDUC 0. multiplicative heteroskedasticity Sample: 1 474 Included observations: 474 Evaluation order: By observation Convergence achieved after 76 iterations Variable Coefﬁcient Std. With the correction factor for multiplicative models. the variances are estimated 1:27þ^ g1 þ^ g2 D2i þ^ g 3 D 3i 1:27þ^ g1 2 ^ 2 ^ g2 g3 — that is.034005 16.342117 0.22 shows the results of ML.0092 Log likelihood 112.1839 0.

01–99. Exhibit 5.17 shows that the variance in the period 1950–74 is smaller than that in the period 1975–99. where Di is a dummy variable with Di ¼ 0 in the months 50. and small residuals by small ones.340 5 Diagnostic Tests and Model Adjustments efﬁcient.12. 2 2 The model E[e2 i ] ¼ s xi that was analysed in that example did not turn out to be very realistic. We will discuss (i) two alternative models for the variance. 2 E[e2 i ] ¼ si : 2 2 In Section 5. In general. In this model the variance is g1 until 1974 and it becomes g1 þ g2 from 1975 onwards. E XM511IBR Example 5. for the dummy variable model we perform the regression (see Panel 2) e2 i ¼ g1 þ g2 Di þ Zi . In the second step. large residuals tend to be followed by large residuals.16: Interest and Bond Rates (continued) We continue our analysis of the interest and bond rate data of Example 5.4. yi is regressed on xi with residuals ei ¼ yi À a À bxi (see Panels 1 and 5).23 shows the results of two-step FWLS and ML estimates for both heteroskedasticity models.17 (a). (i) Two alternative models for the variance We consider again the relation between monthly changes in AAA bond rates (yi ) and monthly changes in Treasury Bill rates (xi ) given by yi ¼ a þ bxi þ ei . the variance is also changing within these two subperiods. A model for this kind of clustered variances is given by 2 2 s2 i ¼ g1 þ g2 eiÀ1 ¼ g1 þ g2 (yiÀ1 À a À bxiÀ1 ) : (ii) Two-step FWLS and ML estimates Exhibit 5.3 we considered WLS with the model s2 i ¼ s xi for the variances and we concluded that this model has its shortcomings. this leads to sharper conclusions than FWLS (where the null hypothesis of homoskedasticity could not be rejected). as is clear from Exhibit 5. (ii) two-step FWLS and ML estimates of both models. In the ﬁrst step.12 and Di ¼ 1 in the months 75. However. and (iii) our conclusion.01–74. The two-step FWLS estimates are obtained as follows. and for the model with clustered variances we perform the regression (see Panel 6) .14. This can be modelled by s2 i ¼ g1 þ g2 Di .

0000 Exhibit 5.006393 0. step 2 of FWLS (Panel 3.0267 0. auxiliary regression of squared residuals to estimate dummy model of heteroskedasticity). see Panel 3). step 1 of FWLS (Panel 2.006186 t-Statistic 0.010699 19.0000 DUM7599 ðg2 Þ 0.23 show that the standard errors of the ML estimates are smaller than those of the FWLS estimates. Error C 0.043714 0.274585 0.013384 0. the estimated slope parameter b in the dummy model for the variance has standard errors 0.0107 (ML.3602 0. Error t-Statistic Prob. 2 e2 i ¼ g1 þ g2 eiÀ1 þ Zi : ^2 ^2 The estimated variances.915697 18.4 Heteroskedasticity 341 Panel 1: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Variable Coefﬁcient Std. Error z-Statistic Prob.27018 0. 0.014641 Panel 2: Dependent Variable: RESOLS^ 2 Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Variable Coefﬁcient Std.0052 DUS3MT 0.014079 15.5.0000 Panel 3: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Weighting series: 1/STDEV (vi ¼ ðSTDEVi Þ2 ¼ fitted value of Panel 2) Variable Coefﬁcient Std.0093 DUS3MT 0. and ML with dummy model for heteroskedasticity (Panel 4).796224 0.005036 2. see Panel 4) and 0. are then used to compute the (two-step) g1 þ ^ FWLS estimates (5.36) of a and b (see Panels 3 and 7).214989 0. 0.000393 21.38023 0.24227 0.002960 14.0141 (FWLS.0000 Variance Equation Constant ðg1 Þ 0.0000 t-Statistic 2. Error C 0. Constant 0.004374 DUM7599 0.75442 Prob.205870 0.038850 0. s g1 þ ^ g2 Di in the ﬁrst model and s i ¼^ i ¼ 2 ^ g2 eiÀ1 in the second model.014083 0. The results in Panels 4 and 8 of Exhibit 5.222044 6.005127 2.76792 0. dummy model heteroskedasticity Sample: 1950:01 1999:12 Included observations: 600 Convergence achieved after 18 iterations Variable Coefﬁcient Std.280616 Prob.23 Interest and Bond Rates (Example 5.0000 Panel 4: Dependent Variable : DAAA Method: Maximum Likelihood (BHHH). C 0.610380 0. WLS with estimated variances obtained from Panel 2).009719 0. For instance.008413 0.16) OLS of AAA bond rate on Treasury Bill rate (Panel 1).006982 DUS3MT 0. and in the model with .

0.0000 Panel 6: Dependent Variable: RESOLS^ 2 Method: Least Squares Sample(adjusted): 1950:02 1999:12 Included observations: 599 after adjusting endpoints Variable Coefﬁcient Std. To answer this question we should test the validity of the speciﬁed models for heteroskedasticity.006393 0.28a–c.45021 ðg2 Þ 1. clustered variances Sample: 1950:01 1999:12 Included observations: 600 Convergence achieved after 14 iterations Variable Coefﬁcient Std.0000 Panel 7: Dependent Variable: DAAA Method: Least Squares Sample(adjusted): 1950:02 1999:12 Included observations: 599 after adjusting endpoints Weighting series: 1/STDEV (vi ¼ ðSTDEVi Þ2 ¼ fitted value of Panel 6) Variable Coefﬁcient Std.) OLS of AAA bond rate on Treasury Bill rate (Panel 5).0000 0.008738 0.0000 clustered variances the standard errors are 0.010647 0.05619 e2 iÀ1 Exhibit 5.274585 0.901212 RESOLS( À 1)^ 2 0.013154 0.6a.211512 0. b. E Exercises: T: 5. E: 5. 5.003566 69.508211 DUS3MT 0.23 (Contd. Error z-Statistic C 0.375250 0.3602 0.25a.015628 18.101769 10.18 at the end of the next section.0000 Panel 8: Dependent Variable: DAAA Method: Maximum Likelihood (BHHH).288181 Prob. C 0.915697 18. step 2 of FWLS (Panel 7.003336 6.0000 0. 0.023405 0.006982 DUS3MT 0.000647 16.006354 1. see Panel 7).039997 5. see Panel 8) and 0.04759 Variance Equation Constant ðg1 Þ 0. WLS with estimated variances obtained from Panel 6).342 5 Diagnostic Tests and Model Adjustments Panel 5: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Variable Coefﬁcient Std. Error C 0.0036 (ML. (iii) Conclusion A natural question is which model for the variance should be preferred. This is further analysed in Example 5.21882 0.014641 t-Statistic 0.75442 Prob.1696 DUS3MT 0.284731 0.246218 0. . step 1 of FWLS (Panel 6. Error t-Statistic Prob.023025 0. 0.003749 3. and ML for model with clustered variances (Panel 8).0000 0. Prob.0005 0. auxiliary regression of residuals to estimate model with clustered variances). Error t-Statistic C 0.0156 (FWLS.

the second group of 2 the last n2 observations (with variance s2 ).5 Tests for homoskedasticity Motivation of diagnostic tests for heteroskedasticity When heteroskedasticity is present. It is often helpful to make plots of the least squares residuals ei and their squares e2 i as well as scatters of these variables against explanayi ¼ x0i b. Therefore 2 SSR2 =(n2 À k)s2 s2 2 2 =s2 ¼ $ F(n2 À k.5. Let b ^ and let ^ g be the estimate of b with corresponding residuals ^ ei ¼ yi À x0i b 2 0^ ^i should ^i ¼ h(zi g). then it is of interest to test whether this ^ be the (ML or FWLS) model for the variances is adequately speciﬁed. The Goldfeld–Quandt test The Goldfeld–Quandt test requires that the data can be ordered with nondecreasing variance. This last group is left out of the analysis. and these two statistics are independent. The ﬁrst group consists of the ﬁrst n1 observations (with variance s2 1 ). and the third group of the remaining n3 ¼ n À n1 À n2 observations in the middle.3.4 Heteroskedasticity 343 5. Likelihood Ratio. 2. Then the standardized residuals ^ ei =s estimate of g and s be (approximately) homoskedastic. Under the standard Assumptions 1–7 (in particular. This may provide tory variables xi or against the ﬁtted values ^ a ﬁrst indication of deviations from homoskedastic error terms. to obtain a sharper contrast between the variances in the ﬁrst and second group. the ordered data set is split in three groups. H1 : s2 > s1 : Now OLS is applied in groups 1 and 2 separately. Further. In this section we discuss some tests for homoskedasticity — that is. if the disturbances in the model yi ¼ x0i b þ ei are heteroskedastic and a model 0 s2 i ¼ h(zi g) has been postulated.4. But efﬁciency is lost if the disturbances are homoskedastic.3 are also helpful. and the alternative is that the variance increases. we ﬁrst have to test for the presence of heteroskedasticity. In order to decide which estimation method to use. independently and normally distributed error 2 terms). with resulting sums of squared residuals SSR1 and SSR2 respectively and estimated variances 2 s2 1 ¼ SSR1 =(n1 À k) and s2 ¼ SSR2 =(n2 À k). To test this hypothesis. Diagnostic tests like the CUSUMSQ discussed in Section 5. SSRj =s2 j follows the w (nj À k) distribution for j ¼ 1. The null and alternative hypotheses are 2 2 2 H0 : s2 1 ¼ s2 . Breusch–Pagan. and White. Goldfeld–Quandt. n1 À k): 2 SSR1 =(n1 À k)s2 s2 1 1 =s1 . The null hypothesis is that the variance is constant for all observations. ML and FWLS will in general offer a gain in efﬁciency as compared to OLS.

The null hypothesis is rejected in favour of the alternative if F takes large values. Likelihood Ratio test Sometimes the data can be split in several groups where the variance is assumed to be constant within groups and to vary between groups. n3 ¼ n=5 if the sample size n is small and n3 ¼ n=3 if n is large. ML % w (G À 1): j¼1 (5:39) 0 Here s2 ML ¼ e e=n is the estimated variance over the full data set (that is. Breusch–Pagan LM-test 0 The Breusch–Pagan test is based on models of the type s2 i ¼ h(zi g) for the variances. the Likelihood Ratio test for the above hypothesis is given by G À Á X Á À 2 À nj log s2 LR ¼ n log s2 ML j. n1 À k) distribution. In practice one uses rules of thumb — for example. ML ¼ ej ej =nj is the estimated variance in group j (obtained by a regression over the nj observations in this group). On the other hand. then the null hypothesis of homoskedasticity is 2 2 H 0 : s2 1 ¼ s2 ¼ Á Á Á ¼ sG . z2i . 0 under the null hypothesis of homoskedasticity) and s2 j. with variables zi ¼ (1.344 5 Diagnostic Tests and Model Adjustments So. then it would be optimal to select the two groups accordingly and to take n3 ¼ 0.6) to show that. If there are G groups and s2 j denotes the variance in group j. if nearly all variances are equal and only a few ﬁrst observations have smaller variance and a few last ones have larger variance. under the null hypothesis of equal variances. then it would be best to take n3 large. If the variance changes only at a single break-point. Á Á Á . under the standard Assumptions 1–7. zpi ) that explain the differences in the variances. the test statistic 2 F ¼ s2 2 =s1 follows the F(n2 À k. The null hypothesis of constant variance corresponds to the (p À 1) parameter restrictions . and the alternative is that this restriction does not hold true. There exists no generally accepted rule to choose the number n3 of excluded middle observations. It is left as an exercise (see Exercise 5.

Á Á Á . then apply OLS in the auxiliary regression equation e2 i ¼ g1 þ g2 z2i þ Á Á Á þ gp zpi þ Zi : (5:40) Step 3: LM ¼ nR2 of the regression in step 2. p) by functions of the ex2 planatory variables x — for instance. which are then evaluated at the estimated parameter values under the null hypothesis. Á Á Á . The above LM-test with this particular choice of the variables z is called the White test (without cross terms). If these variables are unknown.4 Heteroskedasticity 345 g2 ¼ Á Á Á ¼ gp ¼ 0: The Breusch–Pagan test is equal to the LM-test LM ¼ !À1 @l 0 @2l @l ÀE : 0 @y @y @ y@ y To compute this test we should calculate the ﬁrst and second order derivatives of the (unrestricted) log-likelihood (5. However. This is asymptotically distributed as w2 (p À 1) under the null hypothesis of homoskedasticity.7) to show that this leads to the following three-step procedure to compute the Breusch–Pagan test for heteroskedasticity. Step 2: Perform auxiliary regression. Then LM ¼ nR2 where R2 is the coefﬁcient of determination of the auxiliary regression in step 2. p) that inﬂuence the variance. g). Breusch–Pagan test for heteroskedasticity Step 1: Apply OLS. then one can replace the variables zj (j ¼ 2. Á Á Á . one should know the variables zj (j ¼ 2. x2i . If the variances s2 i are possibly affected by the (p À 1) variables (z2i . White test An advantage of the Breusch–Pagan test is that the function h in the model (5.26) may be left unspeciﬁed.5. in which case p À 1 ¼ 2k À 2. zpi ). Á Á Á . where all cross products xji xhi with j 6¼ h are also included as z-variables. then the corresponding test of Breusch . Á Á Á . xki and x2 2i . An extension is the White test with cross terms. Remarks on choice and interpretation of tests If one can identify variables zj for the model for the variances that are based on plausible economic assumptions. xki . It is left as an exercise (see Exercise 5. Apply OLS in the model y ¼ Xb þ e and compute the residuals e ¼ y À Xb.34) with respect to the parameters y ¼ (b.

as this subsample consists of males only. (ii) the Breusch–Pagan test. and (iv) tests for grouped data.15). Using the notation of Example 5. The corresponding P-value is 0. The hypothesis of homoskedastic error terms may also be rejected because of the presence of outliers. Because the results in job category 2 are not signiﬁcant.24). of course. one can test whether the standardized residuals are homoskedastic.17: Bank Wages (continued) We continue our analysis of the bank wage data (see Example 5. For the second job category the gender dummy Dg also has to be deleted from the model. If not. E XM501BWA XM513BWA Example 5. Using the results in Panels 2 and 4 of Exhibit 5. be dropped in these regressions. in the White test a signiﬁcant correlation between the squared OLS residuals e2 i and the squares and cross products of explanatory variables may be caused by misspeciﬁcation of the functional form. If homoskedasticity is rejected. After the model has been estimated by taking this type of heteroskedasticity into account. 363 À 4) ¼ F(80. The two job category dummies D2 and D3 should. n1 À k) ¼ F(84 À 4.346 5 Diagnostic Tests and Model Adjustments and Pagan is preferred. then the variances can be modelled in terms of the variables zj . . one can try to ﬁnd a better model for the variances. one for each job category (see Panels 2–4 of Exhibit 5. which indicates that the variance in the third job category is larger than that in the ﬁrst job category. We will discuss (i) the Goldfeld–Quandt test. and this has the F(n2 À k.6.011. (i) Goldfeld–Quandt test We apply tests on homoskedasticity for the Bank Wage data. (iii) the Likelihood Ratio test. they can also be considered more generally as misspeciﬁcation tests. This is further discussed in Section 5. 359) distribution. For example. the corresponding test is computed as F ¼ (0:227=0:188)2 ¼ 1:46. possibly owing to the limited number of observations within this group.24. we will leave them out and test the null hypothesis 2 2 2 that s2 1 ¼ s3 against the alternative that s3 > s1 .10. Although the above tests are originally developed to test for heteroskedasticity. the model is given by yi ¼ b1 þ b2 xi þ b3 Dgi þ b4 Dmi þ b5 D2i þ b6 D3i þ ei : For the Goldfeld–Quandt test we perform three regressions.

24 Bank Wages (Example 5.0000 DUMJCAT2 À0.543209 0.760775 S.016525 4. .0113 R-squared 0.1614 DUMJCAT3 0.018265 0.178340 0. Sample: 1 474.052588 0.004634 0.5. Sample: JOBCAT¼3.574694 0.0000 EDUC 0. Included observations: 363 Variable Coefﬁcient Std.0009 DUMJCAT2 0.227476 Panel 5: Dependent Variable: RES^ 2 Method: Least Squares.003427 10.0322 R-squared 0.170360 0.022459 À3.539075 0.260611 0.308942 S.043494 3.019166 0.074858 0.402511 0.0000 R-squared 0. Sample: JOBCAT¼2.067739 153.0083 0.004494 10.31572 0.056544 169. Sample: 1 474.188190 Panel 3: Dependent Variable: LOGSAL Method: Least Squares.030213 17. of regression 0.507685 0.013023 À1. C 9.0000 GENDER 0.0001 GENDER 0.0000 EDUC 0.046360 0.180112 0.021275 7.0000 GENDER 0.5965 0.227561 0.066967 0. RES denotes the residuals of Panel 1).071427 Panel 4: Dependent Variable: LOGSAL Method: Least Squares. and Panel 4 for category 3). Error t-Statistic Prob.080797 2. Panel 3 for category 2.26280 0.E.4932 R-squared 0.E. Error t-Statistic Prob.17) Regression for wage data of all employees (Panel 1) and for the three job categories separately (Panel 2 for category 1.007904 2.4409 0.E. C 0.0000 R-squared 0.119540 2.023313 À4.054218 176.695845 0. Sample: JOBCAT¼1. Error t-Statistic Prob.211185 0.733397 0.613780 0.954113 0. Error t-Statistic Prob.916891 0. C 10.4704 MINORITY À0. C 9. Error t-Statistic Prob.004285 10.027543 À0. of regression 0.274004 35.195374 Panel 2: Dependent Variable: LOGSALARY Method: Least Squares. Included observations: 27 Variable Coefﬁcient Std.020103 0.675982 0.31317 0. Included observations: 474 Variable Coefﬁcient Std.0000 EDUC À0. of regression 0.418977 S.169221 0.039055 S.556421 0.0000 EDUC 0.006319 À0. of regression 0.0107 MINORITY 0.020962 8.0000 MINORITY À0.044192 0.098557 0. and Breusch–Pagan test (Panel 5.035166 0.84248 0.39388 0.E.333133 0.31327 0.4 Heteroskedasticity 347 Panel 1: Dependent Variable: LOGSALARY Method: Least Squares. C 9.0000 MINORITY À0.019507 Exhibit 5.0001 DUMJCAT3 0. Included observations: 474 Variable Coefﬁcient Std. Included observations: 84 Variable Coefﬁcient Std.

However. This outcome does not lead to the rejection 2 of homoskedasticity. In a similar way. and s ¼ 0 : 0493.348 5 Diagnostic Tests and Model Adjustments (ii) Breusch–Pagan test g1 þg2 D2i þg3 D3i The Breusch-Pagan test for the multiplicative model s2 can i ¼e be computed from the regression in Panel 5 of Exhibit 5. so that homoskedasticity is again rejected. the standard error of regression (s) is computed by least 2 nÀk 2 squares. The foregoing results illustrate the importance of using all the available information on the variances of the disturbance terms. the (asymptotic) P-value is 0. ML computed as LR ¼ 474 log (0:0377) À 363 log (0:0350) À 27 log (0:0045) À 84 log (0:0493) ¼ 61:2. (iii) Likelihood Ratio test The Likelihood Ratio test (5.010. The explained variable in this regression consists of the squared OLS residuals of the above regression model for the wages. Note that in step 2 2 of the Breusch–Pagan test the dependent variable is e2 i .24. For each regression in the exhibit.20.24.25 (d). This again indicates that the hypothesis of homoskedastic error terms should be rejected. and s2 ML can then be computed by sML ¼ n s . which relates the squared OLS residuals to the inverse of the group sizes 1=nj . This test relates the variance directly to the inverse of the group size. and is repeated in Panel 1 of Exhibit 5. For the regression over the full sample with n ¼ 474 in Panel 1.25. ML 3.24. Panel 2 of Exhibit 5. not log (ei ). 2 ¼ 0 : 0045. The result of estimating the above regression model for the grouped data was given in Panel 1 of Exhibit 5. the LR-test is s2 2. this gives s ¼ 0:195 and 468 2 s2 ML ¼ 474 s ¼ 0:0377. With these values. Note that the square of a dummy variable is equal to that dummy variable. if we use the model s2 j ¼ s =nj then the Breusch–Pagan test in Panel 3 of Exhibit 5.006. With the w2 ð2Þ distribution. ML ¼ 0:0350. using the results for the three job categories in Panels 2–4 of Exhibit 5. we obtain s2 1.25 gives a value of LM ¼ nR2 ¼ 26 Á 0:296 ¼ 7:69 with P-value 0.13. The test result for the hypothesis that g2 ¼ g3 ¼ 0 is LM ¼ nR2 ¼ 474(0:0195) ¼ 9:24.25 shows the corresponding White test for homoskedasticity. as described in Example 5. This leads to rejection of homoskedasticity. This also becomes evident in the scatter plot in Exhibit 5. (iv) Tests for grouped data Next we consider the data obtained after grouping. so that the squares of dummies are not included as explanatory variables in the White test. With the (asymptotic) w2 (2) distribution the P-value is P ¼ 0:000.39) for equal variances in the three job categories can also be computed from the results in Exhibit 5. .

008236 0.7019 0.05 0. 0.5.7452 0. 0.6167 0.675614 0.024444 0.351783 GENDER 0.10 0.3854 0.059922 0.980253 (b) Panel 2: White Heteroskedasticity Test: F-statistic 0.020769 0.0 0.002928 0.336567 MINORITY À0.209566 (c) Panel 3: Dependent Variable: RES^ 2 Method: Least Squares Sample(adjusted): 1 26 Included observations: 26 after adjusting endpoints Variable Coefﬁcient Std.084661 7.018879 3.159088 MEANEDUC^ 2 À1.0 Exhibit 5.4 Heteroskedasticity 349 (a) Panel 1: Dependent Variable: MEANLOGSAL Method: Least Squares Sample(adjusted): 1 26 Included observations: 26 Variable Coefﬁcient Std.9858 0.8322 0.037059 0.002398 0.090982 0.074784 3.009429 0.0033 0.8753 0.20 Prob.15 RESSQ 0. .673440 0.17) Regression for grouped wage data (Panel 1).033592 0.631018 DUMJCAT2 0.5 INVGROUPSIZE 1. Error t-Statistic C 9.508896 MINORITY 0.0032 0.929819 R-squared 0.018407 0. RES are the residuals of Panel 1).000720 À0.009845 0.25 Bank Wages (Example 5.3641 Prob.019526 0.295649 (d) 0. Error t-Statistic C À0.0000 0.019932 0.30E-05 0.173942 R-squared 0.022435 0.291213 1/GROUPSIZE 0.0041 0.554832 0.888406 DUMJCAT3 0.214610 DUMJCAT3 0.019311 0.7734 0. and Breusch–Pagan test for heteroskedasticity related to group size (Panel 3) with scatter diagram of squared residuals against inverse of group size (d).010022 3.141875 68.487677 Prob.0000 0.062942 À0.839570 Probability ObsÃ R-squared 5. 0.388348 DUMJCAT2 0. Error t-Statistic C 0.00 0.448711 Probability Test Equation: Dependent Variable: RES^ 2 Method: Least Squares Sample: 1 26 Included observations: 26 Variable Coefﬁcient Std.5355 0.18272 MEANEDUC 0.112378 À0.249522 0.018086 GENDER 0. White heteroskedasticity test (Panel 2.018527 0.015601 0.329770 MEANEDUC 0.

029144 OLS (Panel 1) and White heteroskedasticity test (Panel 2) for regression of changes in AAA bond rate on changes in Treasury Bill rate (RES in Panel 2 denotes the residuals of the regression in Panel 1).3602 0. Now we use these models to test for the presence of heteroskedasticity.179588 Probability Test Equation: Dependent Variable: RES^ 2 Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Variable Coefﬁcient Std. and (iii) our conclusion.26 Interest and Bond Rates (Example 5.003246 8. In the foregoing we considered different 2 2 2 possible models for the variances s2 i of the disturbances — that is. 0.0197 0. Exhibit 5.915697 18. and (iii) s2 i ¼ g1 þ g2 eiÀ1 .339087 R-squared 0.006393 0. For the models (ii) and (iii) this can be done by testing whether g2 differs signiﬁcantly from zero. (i) si ¼ s xi .045511 Prob. 0. . Error t-Statistic C 0. (i) Tests on heteroskedasticity based on different models We consider again the model yi ¼ a þ bxi þ ei for the relation between the monthly changes in the AAA bond rate (yi ) and the monthly changes in the three-month Treasury Bill rate (xi ).23 show that the null hypothesis of homoskedastic disturbances is rejected for both models (P ¼ 0:000). (ii) evaluation of the obtained results. where Di is a dummy variable for the second half 2 (1975–99) of the considered time period.002804 2. The results in Panels 4 and 8 of Exhibit 5. At 5 per cent signiﬁcance we still reject the null hypothesis of homoskedasticity.014641 t-Statistic 0.106338 Probability ObsÃ R-squared 6.045489 0.006560 0. The P-value of this test is P ¼ 0:046 (see Panel 2).0000 Panel 2: White Heteroskedasticity Test: F-statistic 3.010299 Mean dependent var Exhibit 5.350 5 Diagnostic Tests and Model Adjustments E XM511IBR Example 5.26 shows the result of the White test.16). Error C 0.518663 DUS3MT À0.9748 0.18) 0. but the tests based on the explicit models (ii) and (iii) have smaller P-values.75442 Prob.14 and 5.000224 0.031639 DUS3MT^ 2 0.0000 0.274585 0.18: Interest and Bond Rates (continued) We continue our previous analysis of the interest and bond rate data (see Examples 5.027654 0.006982 DUS3MT 0.007073 À0. We will discuss (i) heteroskedasticity tests based on different models. 2 (ii) si ¼ g1 þ g2 Di . Panel 1: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Variable Coefﬁcient Std.

.27 Interest and Bond Rates (Example 5.171 (b) 15 10 5 0 −5 −10 −15 50 55 60 65 70 75 80 85 90 95 STRES1 STRES1 (c) 10 0 −10 −6 −4 −2 0 2 DUS3MT 4 6 (d) 6 4 2 (e) 6 4 2 0 −2 −4 −6 50 55 60 65 70 75 80 85 90 95 STRES2 0 −2 −4 50 55 60 65 70 75 80 85 90 95 STRES3 Exhibit 5. and (c) shows the scatter diagram of these standardized residuals against DUS3MT).18) Time plots of standardized residuals of AAA bond rate data. and for clustered variance model (STRES3 (e)). for OLS (STRESOLS (a)). for model with variance proportional to the square of DUS3MT (STRES1 (b). for dummy variance model (STRES2 (d)).4 Heteroskedasticity 351 (a) 8 6 4 2 0 −2 −4 −6 50 55 60 65 70 75 80 85 90 95 STRESOLS STRESOLS = e/s e = resid OLS s = stdev(e) = 0.5.

then OLS should not be routinely applied. corresponding to observations in months where xi is close to zero. 5. This means that we should either consider less simplistic models or restrict the attention to a shorter time period.31c. We will return to these data in Chapter 7. 5. together with the plot of the standardized OLS residuals ei =s in (a).6c. (yi À ^ of the standardized residuals are in Exhibit 5.352 5 Diagnostic Tests and Model Adjustments (ii) Evaluation of the results It is of interest to compare the success of the models (i). and (iii) in removing the heteroskedasticity. it is helpful to formulate a corresponding model for the variance of the error terms and to apply the Breusch–Pagan test. OLS with White standard errors. This shows that model (i) has some very large standardized residuals.27(c). If tests indicate the presence of signiﬁcant heteroskedasticity.25b–e. If one sees no possibility of formulating a meaningful model for the heteroskedasticity. The standardized residuals of models (ii) and (iii) still show some changes in the variance. then one can apply GMM — that is. If one has an idea what are the possible causes of heteroskedasticity. Such observations get an excessively large weight. Apply a test for the possible presence of heteroskedasticity. Plots aÀb standardized residuals of the three models — that is. as it is not efﬁcient and the usual formulas for the standard errors (as computed by software packages) do not apply. this means that some observations are more informative than others for the underlying relation.4. where we discuss the modelling of time series data in more detail. .27 (b.6 Summary If the error terms in a regression model are heteroskedastic. e). . If one has no such ideas. d.7. but somewhat less than the OLS residuals. . E: 5. Efﬁcient estimation requires that the more informative observations get a relatively larger weight in estimation. One can proceed as follows. see Exhibit 5. (ii). one can apply the White test or the Goldfeld–Quandt test. E Exercises: T: 5. 5. (iii) Conclusion The overall conclusion is that the models considered here are not able to describe the relation between AAA bond rates and the Treasury Bill rate over the time span 1950–99. For this purpose we compute the ^xi )=s ^i .

4 Heteroskedasticity 353 . Let ei ¼ yi À x0 i b be the ith residual and let s i be the estimated variance of the ith disturbance. with the usual approximate distributions of the estimators. If this is not the case. Then the model for heteroskedasticity may be ^i are homoskeevaluated by checking whether the scaled residuals ei =s dastic.5. then the model parameters can be estimated by weighted least squares if the variances are known up to a scale factor. . Otherwise one can use feasible weighted least squares or maximum likelihood. one can try to improve the model for the variances. ^ ^2 . If one can formulate a model for the heteroskedasticity (for instance. or otherwise apply OLS with White standard errors. an additive or a multiplicative model).

this means that the model is not successful in this respect.1 Introduction Interpretation of serial correlation As before. . 600. let the relation between the dependent variable y and the independent variables x be speciﬁed by yi ¼ x0i b þ ei . We illustrate this by two examples. it may be that in (5.19: Interest and Bond Rates (continued) We continue our analysis of the interest and bond rate data and will discuss (i) graphical evidence of serial correlation for these data. For example. the observations yi and yj have something more in common. to adjust the model so that its disturbances become uncorrelated. If the error terms are serially correlated. In this case the covariance matrix V is not diagonal. In general. i ¼ 1. or that lagged values of the dependent or independent variables should be included as explanatory variables (neglected dynamics). One should then try to detect the possible causes for serial correlation and. the purpose of (5. (i) Graphical evidence for serial correlation We consider the linear model yi ¼ a þ bxi þ ei . and (ii) an economic interpretation of this serial correlation.41) is to model all systematic factors that inﬂuence the dependent variable y. or that the functional relationship is non-linear instead of linear (functional misspeciﬁcation). E XM511IBR Example 5. apart from the systematic parts modelled by x0i b and x0j b.5.354 5 Diagnostic Tests and Model Adjustments 5.41) an important independent variable is missing (omitted variables). This means that. Á Á Á . if possible.5 Serial correlation 5. Á Á Á . i ¼ 1. n: (5:41) The disturbances are said to be serially correlated if there exist observations i 6¼ j so that ei and ej have a non-zero correlation.

12 (a).01–1999.0 −0.28 shows graphs of the series of residuals .4 −0.5 1.0 −0.4 0.5 Exhibit 5.0 0.0 0.5 −1.01–1999.5 RESLAG 1.0 −0. Exhibit 5. and scatter plot of residuals against their one-month lagged value (c).2 0. for the monthly changes yi in the AAA bond rate and the monthly changes xi in the three-month Treasury Bill rate.12 (b).28 Interest and Bond Rates (Example 5.0 0.28 1.0 −0.5 0. same plot over subsample 1990.2 −0.0 1.5.5 RES 0.5 Serial correlation 355 (a) 1. The sample period runs from January 1950 to December 1999.5 r=0.6 90 91 92 93 94 95 RES 96 97 98 99 (c) 1.5 0.0 −1.19) Residuals of regression of changes in AAA bond rates on changes in Treasury Bill rates over the period 1950.5 −1.0 50 55 60 65 70 75 RES 80 85 90 95 (b) 0.

in $10. i ¼ 1. These data were earlier discussed in Example 4.20: Food Expenditure (continued) The investigation of serial correlation for cross section data makes sense only if the observations can be ordered in some meaningful way. In 60 per cent of the months the residual ei has the same sign as the residual eiÀ1 in the previous month. the points in this plot are given by (eiÀ1 . Suppose that in some month the change of the AAA bond rate is larger than would be predicted from the change of the Treasury Bill rate in that month. 48: . These graphs have time on the horizontal axis and the values of the residuals on the vertical axis. We will discuss (i) the data. (i) The data The budget study of Example 4.000 per year). Á Á Á .3 (p. Such dynamic adjustments require a different model from the above (static) regression model. the total consumption expenditure (x2 . and the average household size (x3 Þ. This leaves n ¼ 48 group observations for our analysis (see Exhibit 5. E XM520FEX Example 5.29 (b)).29 (a) shows a histogram of the group sizes.28 (c) is a scatter plot of the residuals against their lagged values — that is.488 households that are aggregated in ﬁfty-four groups. Exhibit 5. Exhibit 5. (ii) a meaningful ordering of the data. In all that follows we will delete the six groups with size smaller than twenty. If ei and eiÀ1 are positively correlated.3 consists of a cross section of 12. so that eiÀ1 ¼ yiÀ1 À a À bxiÀ1 > 0. then we expect that ei > 0 — that is. We consider the following linear regression model: yi ¼ b1 þ b2 x2i þ b3 x3i þ ei . This may be caused by the fact that deviations from an equilibrium relation between the two rates are not corrected within a single month but that this adjustment takes a longer period of time. We will illustrate this by considering a cross section of budget data on food expenditure for a number of households. For each group the following data are available: the fraction of expenditure spent on food (y). that in the next month the change of the AAA rate is again larger than usual. (ii) Economic interpretation These results indicate that the series of disturbances ei may be positively correlated over time. 204). ei ). The residuals of consecutive months are positively correlated with sample correlation coefﬁcient r ¼ 0:28. and (iii) the interpretation of serial correlation for these cross section data.356 5 Diagnostic Tests and Model Adjustments ei over the whole sample period (in (a)) and also over the period January 1990 to December 1999 (in (b)).

5. 8–15. Exhibit 5.3 (p.29 (g) and (h) show the actual yi ¼ b1 þ b2 x2i þ b3 x3i for the third segment values of yi and the ﬁtted values ^ of households (where x3i ¼ 3:1 for each observation). the residuals are serially correlated.29 (d) and (f ). As a consequence. 9. ei ). These results are in line with the earlier discussion in Example 4. we make a scatter diagram of the residuals ei against the previous residuals eiÀ1 within segments — that is. and 42–48 so that residuals are compared only within the same segment and not between different segments. for i taking the values 2–6. the observed data residuals ei ¼ yi À ^ indicate a non-linear relation with diminishing slope for higher levels of total expenditure. and with x3 ! 6 in the last segment.29 (c) shows the scatter diagram of the residuals against their lagged values — that is. The sample correlation between ei and eiÀ1 is very small: r ¼ À0:012. with 1 x3 < 2 in the ﬁrst segment to 5 x3 < 6 in the ﬁfth segment. and b3 and of the variance s2 ¼ E[e2 i ] do not depend on the ordering of the groups. Whereas the ﬁtted relation is linear. together with the yi . Of course. the scatter of points (eiÀ1 . 9. residuals tend to be positive for relatively small and for relatively large values of total expenditure and they tend to be negative for average values of total expenditure. That is. Within each segment — that is. for a randomly chosen order of the groups. With this ordering.5 Serial correlation 357 (ii) A meaningful ordering of the data The OLS estimates of b1 . To obtain a meaningful ordering we ﬁrst order the data in six segments. This ordering is indicated in Exhibit 5. As a consequence. 17–24. Exhibits 5. (iii) Interpretation of serial correlation for cross section data To obtain a better understanding. This indicates that the series of error terms ei may be positively correlated within each segment. 204–5). the relation between x2 and y is non-linear. and 8. as the effect of income on food expenditure declines for higher income levels. for ‘ﬁxed’ household size — the observations are ordered according to the total consumption expenditure. 26–32. 8. The scatter diagram in (e) shows a positive correlation between eiÀ1 and ei . 34–40. 8. and the sample correlation coefﬁcient is r ¼ 0:43 in this case. The serial correlation of the ordered data provides a diagnostic indication of this misspeciﬁcation. it does not make much sense to compare the residual of one observation with the residual of the previous observation in such a randomly ordered sample. b2 . . The number of observations in the six segments is respectively 6. so that the linear regression model is misspeciﬁed. Each segment consists of group observations with comparable household size.

2 0.02 0.4 0.6 TOTCONS 0.5.01 0.04 (f ) 1.1) with corresponding residuals in (h).8 1.04 r = 0.27a.04 r = −0.0 0.02 0.02 RESOLS 0. E Exercises: E: 5.0 1. (c) and (e) show scatter diagrams of the OLS residuals (RESOLS) against their lagged values (RESOLSLAG) for random ordering ((c). The systematic ordering is in six segments according to household size (d).02 RESOLSLAG 0.29 Food Expenditure (Example 5.2 TOTCONS Exhibit 5.04 −0.02 −0.2 0.012 (d) 10 8 RESOLS (e) 0. max = 989 6 groups size < 20 (b) 10 8 6 4 2 0 0 200 400 600 800 1000 group sizes (48 groups) min = 22.8 1.6 0.02 RESOLSLAG 0.04 −0. and the ordering within each segment is by total consumption (f ). (b) for the 48 groups with size ! 20).00 −0.2 1.35 0.43 0.02 6 0.04 −0. (g) shows the actual and ﬁtted values in the third segment (groups 16–24. max = 989 (c) 0.40 groups in 3-rd segment fracfood (curved line) fitted values (straight line) RESOLS (h) 0.20 0.00 0. r ¼ 0.04 −0. 5. r ¼ À0:012) and for systematic ordering ((e).0 −0. average household size 3.20) (a) and (b) show histograms of the group sizes ((a) for all 54 groups.4 (g) 0.358 5 Diagnostic Tests and Model Adjustments (a) 14 12 10 8 6 4 2 0 0 200 400 600 800 1000 group sizes (54 groups) min = 1. b.25 0.03 0.00 0.2 0.43).00 4 −0.0 0.0 10 20 30 40 50 TOTCONS against GROUP 0. then one .8 0.02 0.4 0.03 0.30 0.01 0.02 −0.6 0.4 1.00 −0.02 2 0 −0.04 10 20 30 40 50 AHSIZE against GROUP 0.2 Properties of OLS Consequences of serial correlation Serial correlation is often a sign that the model should be adjusted. If one sees no possibilities to adjust the model to remove the serial correlation.

Serial correlation corresponds to the case where the covariance matrix V ¼ E[ee0 ] of the disturbances is not diagonal.24) in (5. j)th element of V. the OLS expressions underestimate the standard errors of the regression coefﬁcients and therefore t. and we assume that V is unknown.4.2.5 Serial correlation 359 can still apply OLS. OLS can be expressed in terms of the k moment conditions E[gi ] ¼ 0. and 6. In this sense OLS is still an acceptable method of estimation.22 for an illustration). the unknown variances s2 i in (5. i ¼ 1. If we copy this idea for the are replaced by the squared residuals e2 i current situation. and its covariance matrix is not equal to s2 (X0 X)À1 but it depends on the (unknown) covariance matrix V (see (5. but this does not hold true if the ei are serially correlated.2. However.23)). Á Á Á . then sij ¼ sji (as V is symmetric) and n X n n nÀ 1 X n 1 0 1X 1X 1X sij xi x0j ¼ sii xi x0i þ sij (xi x0j þ xj x0i ): X VX ¼ n n i¼1 j¼1 n i ¼1 n i¼1 j¼iþ1 In the White correction for standard errors. Derivation of GMM standard errors Consistent estimates of the standard errors can be obtained by GMM.25). That is. as there the functions gi are mutually independent. However. OLS remains unbiased and is consistent under appropriate conditions. In many cases. As was discussed in Section 5.23) so that the variance of b is given by À1 À1 1 1 0 1 0 1 0 XX X VX XX : var(b) ¼ n n n n Let sij denote the (i.4. 5. the OLS estimator is not efﬁcient.5. 2. n: T Note that the situation differs from the one considered in Section 5. the resulting estimate of the covariance matrix of b is useless because n nÀ 1 X n 1X 1X 1 0 e2 x x þ ei ej (xi x0j þ xj x0i ) ¼ X0 ee0 X ¼ 0: i i n i ¼1 i n i ¼1 j ¼i þ1 n A consistent estimator of the variance of b can be obtained by weighting the contributions of the terms ei ej to give the estimate 1 0 VX ¼ d ^ ¼ 1X V n n n X i¼1 0 e2 i xi xi þ nÀ 1 X n À Á 1X wjÀi ei ej xi x0j þ xj x0i : n i¼1 j¼iþ1 (5:42) . To describe the required modiﬁcations. then the variance would be estimated simply by replacing sij ¼ E[ei ej ] by the product ei ej of the corresponding residuals. under the Assumptions 1. we use the result (5. gi ¼ ei xi ¼ (yi À x0i b)xi .and F-tests tend to exaggerate the signiﬁcance of these coefﬁcients (see Exercise 5.

274585 0. Error C 0. Exhibit 5.75442 Prob. the bandwidth B should depend on the sample size n in such a way that B ! 1 for n ! 1. The weighting function w is also called the h for h < B and kernel. Rules that are applied in practice are to take B % n1=3 or.4419 DUS3MT 0.006393 0. Error t-Statistic Prob. but at the same time B should be sufﬁciently small so that the double summation in (5. C 0. .769436 0. Newey–West standard errors The above method with weighting kernels is due to Newey and West. B % n1 = 5 . with the matrix V E XM511IBR Example 5. To get consistent estimates.0000 R-squared 0.021187 12.30 shows the result of regressing the changes in AAA bond Panel 1: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Variable Coefﬁcient Std. in large samples.95993 0. the Bartlett kernel has weights wh ¼ 1 À B wh ¼ 0 for h ! B. and the terms with i 6¼ j are given weights with 0 wjÀi 1.21) Regression of changes in AAA bond rates on changes in Treasury Bill rates with conventional OLS standard errors (Panel 1) and with Newey–West standard errors (Panel 2).274585 0.006393 0.42).42) converges.006982 DUS3MT 0. 0.360 5 Diagnostic Tests and Model Adjustments The terms on the diagonal (with i ¼ j) have weight 1. So the Newey–West standard errors of b are given by the square roots of the diagonal elements of the matrix 1 ^ (X0 X)À1 c b) ¼ (X0 X)À1 V var var( n ^ as deﬁned in (5.370346 Exhibit 5.30 Interest and Bond Rates (Example 5. they are heteroskedasticity and autocorrelation consistent.21: Interest and Bond Rates (continued) We continue our analysis of the interest and bond rate data (see Example 5.008309 0.19).3602 0.915697 18.370346 t-Statistic 0. For example.014641 R-squared 0.0000 Panel 2: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Newey-West HAC Standard Errors & Covariance (lag truncation¼5) Variable Coefﬁcient Std. The corresponding estimates of the standard errors of the OLS estimator b are called HAC — that is.

.28 (c))).22 for an illustration). To consider the possibility of more general forms of misspeciﬁcation. The HAC standard errors in Panel 2 are larger than the standard errors computed by the conventional OLS formulas in Panel 1. This plot provides a ﬁrst idea of possible serial correlation. such a natural ordering is given by the time index i.22. In the foregoing sections we considered the correlation between consecutive residuals. The residual correlation is relatively mild (r ¼ 0:28 (see Exhibit 5.3 Tests for serial correlation Autocorrelation coefficients Serial correlation tests require that the observations can be ordered in a natural way. these differences do not affect the signiﬁcance of the relationship.5 Serial correlation 361 rates on the changes in the Treasury Bill rates.5. However. The sample correlation coefﬁcient of the residuals is deﬁned by Pn i¼2 ei eiÀ1 r ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pn 2 PnÀ1 2ﬃ : i¼2 ei i¼1 ei In practice one often considers a slightly different (but asymptotically equivalent) expression. E Exercises: S: 5. 5. the ﬁrst order autocorrelation coefﬁcient deﬁned by Pn i¼2 ei eiÀ1 r1 ¼ P : n 2 i¼1 ei (5:43) Large values of r1 may be an indication of dynamic misspeciﬁcation (for time series data) or of functional misspeciﬁcation (for cross section data). For cross section data the observations can be ordered according to one of the explanatory variables. where the variables are observed sequentially over time. A plot of the autocorrelations rk against the lag k is called the correlogram. it is informative to consider also the kth order autocorrelation coefﬁcients Pn ¼kþ1 ei eiÀk : rk ¼ iP n 2 i¼1 ei (5:44) This measures the correlation between residuals that are k observations apart. For time series data. In situations with more substantial serial correlation the differences may be much more dramatic (see Exercise 5.5.

362 5 Diagnostic Tests and Model Adjustments We will now discuss three tests for serial correlation. The Durbin–Watson test The Durbin–Watson test is based on the following idea. The use of these bounds requires that the model contains a constant term. Breusch–Godfrey. lower and upper bounds for the critical values that do not depend on X have been calculated by Durbin and Watson. then the differences ei À eiÀ1 tend to be relatively small (large). That is. Values of d close to zero indicate positive serial correlation. lagged values of the dependent variable yi are not allowed. So if successive error terms are positively (negatively) correlated. s2 Z I) so that the Zi are homoskedastic and serially uncorrelated. and d % 2(1 À r1 ) with r1 the ﬁrst order autocorrelation coefﬁcient deﬁned in (5. then E[(ei À eiÀ1 )2 ] ¼ 2s2 (1 À r). and Ljung–Box. the n Â 1 vector Z is distributed as N(0. The model is given by yi ¼ x0i b þ ei ei ¼ g1 eiÀ1 þ Á Á Á þ gp eiÀp þ Zi . Durbin–Watson. The equation (5. For simplicity we consider in our analysis below only the case of an AR(1) model for the error terms.46) is called an autoregressive model of order p (written as AR(p)) for the error terms. T Derivation of the Breusch–Godfrey LM-test The Breusch–Godfrey test is an LM-test on serial correlation. and that the regressors are non-stochastic — for instance. and values close to four indicate negative serial correlation. (5:45) (5:46) where Zi satisﬁes Assumptions 2–4 and 7. Let s2 be the variance of the disturbances and let r be the correlation between ei and eiÀ1 . However. The Durbin–Watson statistic is deﬁned as Pn d¼ This statistic satisﬁes 0 d 2 i¼2 (ei À eiÀ1 ) Pn 2 i¼1 ei : 4. In the absence of ﬁrst order serial correlation r1 % 0 so that d % 2. The Durbin–Watson test is nowadays mostly used informally as a diagnostic tool to indicate the possible existence of serial correlation.43). In this case we have . Critical values to test the null hypothesis r ¼ 0 depend on the matrix X of explanatory variables. that the disturbances are normally distributed.

the LM-test can be computed by auxiliary regressions P 1 zi z0i ) ¼ Qz conditions that plim( provided that the regressors zi satisfy the two n P 1 exists (and is non-singular) and that plim( n Zi zi ) ¼ 0 (orthogonality).8)).48) can be written as a non-linear regression model yi ¼ f (zi . According to the results in Section 4. g) ¼ (b. the parameters satisfy k (non-linear) restrictions. we get ei ¼ Zi and (5. 0).5. the vector of ﬁrst order derivatives @ f =@ b and @ f =@ g evaluated in the point (b.3. where zi ¼ (yiÀ1 . The absence of serial correlation corresponds to the null hypothesis that H0 : g ¼ 0: We derive the LM-test for this hypothesis by using the results in Section 4. x i . 217–8) for non-linear regression models (the ML approach of Section 4.47) to obtain yi ¼ gyiÀ1 þ x0i b À gx0iÀ1 b þ Zi : (5:48) This model contains 2k þ 1 regressors but only k þ 1 parameters — that is.2. xiÀ1 ) . x0iÀ1 )0 . As a ﬁrst step in the derivation of the LM-test we use (5. The above P P P 1 1 ZiÀ1 xi ) ¼ 0 and plim( n Zi xi ) ¼ plim( n Zi xiÀ1 ) ¼ plim( 1 n P 0 1 P xi xi 0 plim xiÀ1 xi n P 0 1 P xi xiÀ ¼Q xiÀ1 x0iÀ1 with Q a non-singular (2k) Â (2k) matrix. According to the results for non-linear regression models in Section 4.4. Under the null hypothesis that g ¼ 0 in (5.5 Serial correlation 363 ei ¼ geiÀ1 þ Zi .45) shows that zi is a 0 0 0 two limit conditions are satisﬁed if linear function of (ZiÀ1 . The model (5.2. . and we assume that À1 < g < 1: By repetitive substitution we get ei ¼ Zi þ gZiÀ1 þ g2 ZiÀ2 þ Á Á Á þ giÀ2 Z2 þ giÀ1 e1 : (5:47) So the error term for observation i is composed of independent terms with weights that decrease geometrically. x0i .45) and (5.47).4 (p. g) þ Zi .48) gives.4. the LM-test for g ¼ 0 can then be computed as LM ¼ nR2 : Here R2 is obtained from the regression of the OLS residuals e ¼ y À Xb on the gradient of the function f — that is.6 is left as an exercise (see Exercise 5.2. The model (5. when evaluated at b ¼ b and g ¼ 0. b.

Á Á Á . Then LM ¼ nR2 where R2 is the coefﬁcient of determination of the auxiliary regression in step 2. Apply OLS in the model y ¼ Xb þ e and compute the residuals e ¼ y À Xb. To choose the value of p in the Breusch–Godfrey test. In practice one usually selects small values for p (p ¼ 1 or p ¼ 2) and includes selective additional lags according to the data structure.46) can be derived in a similar way. Breusch–Godfrey test for serial correlation of order p Step 1: Apply OLS. Box–Pierce and Ljung–Box tests As a third test for serial correlation we consider the Box–Pierce test for the joint signiﬁcance of the ﬁrst p autocorrelation coefﬁcients deﬁned in (5. i ¼ 2. if g1 ¼ Á Á Á ¼ gp ¼ 0. then one can include all lags up to order twelve to incorporate monthly effects. gp ) in the above auxiliary regression. Step 2: Perform auxiliary regression. n: Step 3: LM ¼ nR2 of the regression in step 2. Apply OLS in the auxiliary regression equation ei ¼ x0i d þ g1 eiÀ1 þ Á Á Á þ gp eiÀp þ !i . This is asymptotically distributed as w2 (p) under the null hypothesis of no serial correlation. that is. The test statistic is given by .44). it may be helpful to draw the correlogram of the residuals ei . An asymptotically equivalent test is given by the usual F-test on the joint signiﬁcance of the parameters (g1 . i ¼ p þ 1.364 5 Diagnostic Tests and Model Adjustments @ f =@ b ¼ xi À gxiÀ1 ¼ xi @ f =@ g ¼ yiÀ1 À x0iÀ1 b ¼ yiÀ1 À x0iÀ1 b ¼ eiÀ1 : The Breusch–Godfrey test The foregoing arguments show that the Breusch–Godfrey test is obtained as LM ¼ nR2 of the auxiliary regression ei ¼ x0i d þ geiÀ1 þ !i . For instance. n: (5:49) This test has an asymptotic w2 (1) distribution under the null hypothesis of absence of serial correlation. if the data consist of time series that are observed every month. The LM-test for the null hypothesis of absence of serial correlation against the alternative of AR(p) errors in (5. Á Á Á . This leads to the following test procedure. Á Á Á .

23: Food Expenditure (continued) Next we perform tests on serial correlation for the data on food expenditure discussed before in Example 5. Exhibit 5.19 and 5.45) are non-stochastic. the Ljung–Box test.5 Serial correlation p X k¼1 365 BP ¼ n r2 k: (5:50) It is left as an exercise (see Exercise 5. Otherwise it is better to apply the Breusch–Godfrey LM-test.31. The Q-test in Panel 2 corresponds to the Ljung–Box test (with p ranging from p ¼ 1 to p ¼ 12). so that the ﬁrst order autocorrelation coefﬁcient is r1 % 1 À 1 2 d ¼ 0:277.50) are weighted because higher order autocorrelations are based on less observations — that is. contains the ﬁrst twelve autocorrelation coefﬁcients of the residuals. In particular. Each segment consists of groups of households of E XM520FEX . Sometimes the correlations in (5.21. For a randomly chosen ordering of the groups.20. Example 5. BP % w2 (p) under the null hypothesis of no serial correlation. the correlogram.9) to show that this test is asymptotically equivalent to the Breusch–Godfrey test.22: Interest and Bond Rates (continued) We perform serial correlation tests for the interest and bond rate data discussed before in Examples 5. rk in (5. the Box–Pierce test and the Ljung–Box test also require that the regressors xi in the model (5. The Durbin–Watson statistic is equal to d ¼ 1:447. shows the results of regressing the changes in the AAA bond rate on the changes in the Treasury Bill rate. All tests lead to a clear rejection of the null hypothesis of no serial correlation. Panels 3 and 4 show the results of the Breusch–Godfrey test with one or two lags of the residuals.32 shows the results of different tests for serial correlation for the budget data of forty-eight groups of households.5. E XM511IBR Example 5. Panel 2.31. Now we consider a meaningful ordering of the groups in six segments.44) is based on (n À k) products of residuals ei eiÀk . as discussed in Example 5. Exhibit 5. This gives the Ljung–Box test (also denoted as the Q-test) LB ¼ n p X nþ2 2 r2 k % w (p): nÀk k¼1 Similar to the Durbin–Watson test. The ﬁrst order autocorrelation coefﬁcient is signiﬁcant.20. and the Breusch–Godfrey test indicate that there is no serial correlation (see Panels 1 and 2). Panel 1. Exhibit 5.

75442 R-squared 0.014271 À2.915697 DUS3MT 0.101 7 0.006393 0.31 Interest and Bond Rates (Example 5.3602 0.040616 À4.441 50.034 5 0.000 Panel 3: Breusch–Godfrey Serial Correlation LM Test: F-statistic 51.040231 7.559364 RESID(À1) 0.000 0.0000 0. ‘Q-Stat’ is the Ljung–Box test) and Breusch–Godfrey tests on serial correlation (with order p ¼ 1 in Panel 3 and p ¼ 2 in Panel 4.205297 R-squared 0. Error t-Statistic C 0.000 0.000 0.0422 0.366 5 Diagnostic Tests and Model Adjustments Panel 1: Dependent Variable: DAAA Method: Least Squares Sample: 1950:01 1999:12 Included observations: 600 Variable Coefﬁcient Std.000000 0.0000 Regression of AAA bond rates (Panel 1) with correlogram of residuals (Panel 2. Error t-Statistic C 0.049 9 0.080005 0.310152 R-squared 0.610 61.22) 0. 0.446887 Panel 2: Correlogram of residuals Lag AC 1 0.062 Q-Stat 45.01114 Probability ObsÃ R-squared 64.047153 DUS3MT À0.076 3 0.1194 0.189 58.9736 0.624 Prob 0.000000 0.008 11 0. .041495 8.000311 0.006606 0. 0.000 0. Error t-Statistic C 0.022449 0.370346 Durbin-Watson stat Prob.008 4 0. these panels also show the auxiliary regression of step 2 of this LM-test).000 0.256187 RESID(À2) À0.000 0.035626 RESID(À1) 0.000 0.014396 À1.934 60.014641 18.289 64.276 2 À0.029051 0.646 62.107814 Exhibit 5.006982 0.274585 0.000000 Prob.006702 0. 0.68852 Probability Dependent Variable: RESID Variable Coefﬁcient Std.932 49.033123 DUS3MT À0.175063 0.00277 Probability Dependent Variable: RESID Variable Coefﬁcient Std.035 8 0.342590 0.000 0.398 49.000 0.289879 0.000 0.044 10 0.939 58.91631 Probability ObsÃ R-squared 48.9624 0.000222 0.126 51.055 6 0.032 12 À0.000000 Prob.0000 1.0000 Panel 4: Breusch–Godfrey Serial Correlation LM Test: F-statistic 36.412 61.000 0.

26–32.752 Prob 0. residuals RESORD).480625 0.000463 0.008724 À0. C À8.377 Panel 2: Breusch–Godfrey test.036 0.370 0.9881 TOTCONS À0.096 2 0. 42–48 (43 obs) Lag AC Q-Stat 1 0.023 0.009605 1. C À0.210341 0. 8–15.2165 2.031 6 À0.093 3 0.620 7 0.115 5.4697 0.9237 1.000470 0.001233 À0.156142 À0.912 9 0.076 24. 26–32.9124 2 0.213 24.053863 0. 17–24. residuals RESRAND) and with systematically ordered data (Panels 3 and 4.004 0.340 11.1560 1.842 Prob 0.32 Food Expenditure (Example 5.012 0.66E-05 0.295 0.132 0.253 14.0868 AHSIZE À0.031 8 0.009857 Panel 3: Correlogram RESORD (systematically ordered data) Sample: 1–6.651694 0.075099 0.005750 À0.9405 RESRAND(À1) À0.5369 3 À0.012 5 0.207 8 À0.895 0. 34–40.5 Serial correlation 367 Panel 1: Correlogram RESRAND (randomly ordered data.087 15.256 9 0.005876 À1.101757 0.2497 4.001248 0.1643 1.618 6 0. Error t-Statistic Prob.7566 8.9573 AHSIZE 9.019 Q-Stat 0.0071 R-squared 0.015064 0.005 0. 48 observations) Lag AC 1 À0.039 5. 17–24. 42–48 (included obs 42) Variable Coefﬁcient Std.183 10 À0.006 Panel 4: Breusch–Godfrey test.38E-05 0.5.066 4 0.5181 R-squared 0.2336 TOTCONS 0.847947 0.7096 RESORD(À1) 0.884 0.007112 0.023 0.134 7 0. 34–40.327 4.353 21.362 5 À0. Error t-Statistic Prob.729 10. Dependent Variable: RESRAND Sample(adjusted): 2 48 (included observations: 47) Variable Coefﬁcient Std. Dependent Variable: RESORD Sample: 2–6.493 0.6113 4 À0.757771 0.063 0.943 0.23) Correlograms of residuals and auxiliary regressions of step 2 of Breusch–Godfrey test for budget data with randomly ordered data (Panels 1 and 2.183146 Exhibit 5.027 0.6744 10. 8–15.764 0.630 0.375160 0.690 0.168762 2. .007 14.503 10 0.016883 0.

We investigate the presence of ﬁrst order serial correlation within these segments. by including lagged values of the explanatory variables and of the explained variable as additional regressors. The search for correct dynamic speciﬁcations of time series models is discussed in Chapter 7.5. For (ordered) cross section data this may be caused by non-linearities in the functional form. b. 25–32. This indicates misspeciﬁcation of the linear model — that is. 16.8. E Exercises: T: 5.368 5 Diagnostic Tests and Model Adjustments comparable size. and we refer to Section 5. then the model is said to have a correct dynamic speciﬁcation. 5. At the observations i ¼ 7. serial correlation means that some of the dynamic properties of the data are not captured by the model. this indicates that the model is not correctly speciﬁed. 33–40. 5. the fraction of expenditure spent on food depends in a non-linear way on total expenditure (see also Example 5. and the observations within a segment are ordered according to the total consumption expenditure. 16–24. 5. 7–15. S: 5. . The six segments consist of the observations with index 1–6. 356–8)).32.11. In this case one can adjust the model — for instance. The correlogram. the Ljung–Box test.21. 25.4 Model adjustments Regression models with lagged variables If the residuals of an estimated equation are serially correlated. and the Breusch–Godfrey test (with LM ¼ nR2 ¼ 42 Á 0:18 ¼ 7:69 with P ¼ 0:006) all reject the absence of serial correlation. the residuals ei and eiÀ1 correspond to different segments and the correlations between these residuals are excluded from the analysis. and 41. This suggests that ei ¼ yi À b1 À b2 xi is correlated with eiÀ1 ¼ yiÀ1 Àb1 À b2 xiÀ1 . For time series data. which can be expressed by the model yi ¼ g1 þ g2 xi þ g3 xiÀ1 þ g4 yiÀ1 þ Zi : (5:51) When the disturbances Zi of this model are identically and independently distributed (IID).9. eiÀ1 ) for analysis. E: 5. 5.2 for possible adjustments of the model. 33. This may be caused by correlation of yi with yiÀ1 and xiÀ1 . This leaves forty-two pairs of residuals (ei . and 41–48.30a. The results are in Panels 3 and 4 of Exhibit 5.31d. suppose that the model yi ¼ b1 þ b2 xi þ ei is estimated by OLS and that the residuals are serially correlated.20 (p. As an example.

if ~ ei are the new residuals.5 Serial correlation 369 Regression model with autoregressive disturbances In this section we consider only a special case that is often applied as a ﬁrst step in modelling serial correlation. First a gyiÀ1 on a constant new estimate of b1 and b2 is obtained by regressing yi À ^ gxiÀ1 . This process is iterated till the estimates of obtained by regressing ~ ei on ~ b1 . Here it is assumed that the dynamics can be modelled by means of the disturbances ei .51) with g1 ¼ b1 (1 À g). An alternative is to use the following iterative two-step method. and more in particular that ei satisﬁes the AR(1) model (5.47) can be written as yi ¼ b1 (1 À g) þ b2 xi À b2 gxiÀ1 þ gyiÀ1 þ Zi : (5:52) This is of the form (5. To improve the efﬁciency we can repeat these two steps. This estimator (say ^ also consistent. As a ﬁrst step take g ¼ 0 and estimate b1 and b2 . This is called the Cochrane–Orcutt method for the estimation of regression models with AR(1) errors. and g4 ¼ g. and g converge. This estimator is consistent (provided that À1 < g < 1 (see Chapter 7)). Let ei ¼ yi À b1 À b2 xi be the OLS residuals. by OLS. g2 ¼ b2 . If one substitutes ei ¼ yi À b1 À b2 xi and eiÀ1 ¼ yiÀ1 À b1 À b2 xiÀ1 . but it is not efﬁcient. Note that for a given value of g the parameters b1 and b2 can be estimated by OLS in yi À gyiÀ1 ¼ b1 (1 À g) þ b2 (xi À gxiÀ1 ) þ Zi : On the other hand.5. This is called the regression model with AR(1) errors. g3 ¼ Àb2 g. then in g) is the second step g is estimated by regressing ei on eiÀ1 . b2 . The estimates converge to a local minimum of the sum-of-squares criterion function. Second. b2 . it follows that (5. and g can be estimated by NLS. and it may be . so that the parameters satisfy the restriction g2 g4 þ g3 ¼ b2 g À b2 g ¼ 0: Estimation by Cochrane–Orcutt If the terms Zi are IID and normally distributed. then the parameters b1 . then ei ¼ yi À b1 À b2 xi can be computed and hence g can be estimated by OLS in ei ¼ geiÀ1 þ Zi : We can exploit this as follows.47) so that ei ¼ geiÀ1 þ Zi . if the values of b1 and b2 are given. then a new estimate of g is and xi À ^ eiÀ1 .

E XM520FEX Example 5.52) still contains some signiﬁcant correlations. As it makes no sense to include ‘lagged’ variables for cross section data.52).33. As the regression model with AR(1) errors is a restriction of the more general model (5. This is an indication that this linear model is not correctly speciﬁed. this restriction can be tested in the usual way — for instance. The regression model with AR(1) errors is therefore not rejected. b2 . by the Wald test. so that this restriction is not rejected. but this should not be a surprise. Including AR(1) errors leads to an increase of R2 .45). and ^ g2^ g4 þ ^ zero. The Durbin–Watson statistic is more close to 2 (1. of (5. E XM511IBR Example 5.11).25: Food Expenditure (continued) In Example 5. Both lagged terms (xiÀ1 and yiÀ1 ) g3 ¼ 0:252 Á 0:290 À 0:080 ¼ À 0:007 is close to are signiﬁcant.51) are often preferred.24: Interest and Bond Rates (continued) We continue our analysis of the interest and bond rate data.33. (ii) the Breusch– Godfrey test for this non-linear model.51). The Wald test on the restriction g2 g4 þ g3 ¼ 0 in Panel 3 has a P-value of P ¼ 0:64. with the result shown in Panel 2 of Exhibit 5. Nowadays more general dynamic models like (5. Panel 1 contains for comparison the results of OLS. and (iii) the outcome and interpretation of the test.370 5 Diagnostic Tests and Model Adjustments worthwhile to redo the iterations with different initial values for the parameters b1 . The regression model with AR(1) errors has been popular because it is simple and because the Cochrane– Orcutt estimator can be computed by iterated regressions. and this will be further discussed in Chapter 7. . we consider instead another speciﬁcation of the functional relation between income and food expenditure. and g. but recall that for models with lags this statistic does not provide consistent estimates of the correlation between the residuals (see p. The residuals of the model (5.20. To evaluate this last model. Other models are needed for these data.23 we concluded that there exists signiﬁcant serial correlation for the residuals of the linear food expenditure model of Example 5.90 as compared with 1.51). 362 and Exercise 5. Panel 5 contains the correlogram of the OLS residuals and of the residuals of the model with AR(1) errors — that is. We now estimate the adjusted model (5. In Example 5.22 in the previous section we found clear evidence for the presence of serial correlation for these data. and the estimation results of this model are shown in Panel 4 of Exhibit 5. We will discuss (i) a non-linear model. as will be discussed in Chapter 7.

011 8 0. Error t-Statistic C 0.0000 0.80237 DUS3MT(À1) À0.712171 DUS3MT 0.005 11 0.023 32.136 28.035 58.0000 0. dynamic model with single lags (Panel 2) with Wald test for AR(1) errors (Panel 3).274585 0.000 AR(1) residuals Lag AC 1 0.055 51.189 Prob 0.938 31.001 0.0000 1.023 5 0.33 Interest and Bond Rates (Example 5.004780 0.014989 16.006712 0. simple regression model (Panel 1).189 7 0.001 0.000 0.939 6 0.217 0.009423 0.006393 0.646 11 0.040228 7.402 21.75442 R-squared 0.12) Variable Coefﬁcient Std.24) Regression models for AAA bond rates.288629 0. 0.050 2 À0.5232 21.000 0.582 28.509 21.3602 0.030 9 0.001 Exhibit 5.370346 Durbin–Watson stat Prob.896645 Panel 5: Correlograms OLS residuals Lag AC Q-Stat 1 0.473948 DAAA(À1) 0.289881 0.049 60.000 0.829 22.058 12 À0.000 0.276 45.707620 DUS3MT 0.062 64.5.008 61.215300 Probability Probability 0.126 5 0.0000 1.441 4 0. .017800 À4.174887 R-squared 0. 0.622 27.032 62.000 0.036 10 À0.000 0.01 – 1999.83586 AR(1) 0.181 3 0.015007 16.420728 Durbin–Watson stat Prob.01 – 1999. Error t-Statistic C 0.000 0.076 49.642814 0.008 49.000 0.013 4 0.642645 Panel 4: Dependent Variable: DAAA (1950.101 58.398 3 0.926 28.000 0.5 Serial correlation 371 Panel 1: Dependent Variable: DAAA (1950.252361 0.000 0.000 0.001 0.036 6 0. for residuals of Panel 1 on the left and for residuals of Panel 4 on the right).12) Variable Coefﬁcient Std.000 0.932 2 À0.934 8 0.000 0. 0.000 0.000 0.610 10 0.624 Prob 0.12) Convergence achieved after 3 iterations Variable Coefﬁcient Std.4766 0.000 0.215300 Chi-square 0.510 27.044 61.252145 0. Error t-Statistic C 0.915697 DUS3MT 0.079636 0.420519 Durbin–Watson stat Prob.0000 0. simple regression model with AR(1) errors (Panel 4).0000 1.4795 0.185151 R-squared 0.289 12 À0. and correlograms of residuals (Panel 5.006668 0.446887 Panel 2: Dependent Variable: DAAA (1950.000 0.090 7 À0.897040 Panel 3: Wald Test Null Hypothesis: C(2)Ã C(4) þ C(3) ¼ 0 F-statistic 0.000 0.044 Q-Stat 1.034 50.006982 0.01 – 1999.412 9 0.040344 7.014641 18.

The regressors in step 2 are therefore (for g ¼ 0) given by @f ¼ 1 À g ¼ 1. 217–8) for @f LM-tests — that is. yi ¼ b1 þ b2 x2i3 þ b4 x3i þ ei . b3 . @ b1 @f b b3 b3 ¼ x23 i À gx2. iÀ1 þ Zi : b b This is a non-linear regression model yi ¼ f (xi . the NLS residuals ei are regressed on the gradient @ y. iÀ1 ) ¼ b2 x2i log (x2i ).20 (see Exhibit 5. x2 is total expenditure (in $10. we ﬁrst reformulate the non-linear model with AR(1) error terms as a non-linear regression model. b2 . with correlation r ¼ 0:167 (as compared to r ¼ 0:43 for the residuals of the linear model in Example 5. similar to (5. x3i . and the non-linear model can be written in terms of the independent error terms Zi as 3 yi ¼ b1 (1 À g) þ gyiÀ1 þ b2 x23 i À gb2 x2. x3. with NLS residuals ei . b3 .34. The AR(1) model is ei ¼ geiÀ1 þ Zi . as given in Panel 1 of Exhibit 5. @ b3 .48) for the linear model. g)0. with g ¼ 0 and with the NLS estimates of the other parameters. 205) we considered a non-linear functional form for the budget data. Step 1 of this test consists of NLS. iÀ1 ¼ x2i . That is. 0). b2 . b4 . y) þ Zi with 6 Â 1 vector of regressors given by xi ¼ (1.000 per year). yiÀ1 . Now step 2 of the Breusch– Godfrey test can be performed as described in Section 4. iÀ1 þ b4 x3i À gb4 x3. where yi is the fraction of total expenditure spent on food. To perform step 2 of this test. b4 . s2 Z ). x2. and x3 is the (average) household size of households in group i. evaluated at the restricted NLS estimates ^ y ¼ (b1 . where Zi $ NID(0. iÀ1 log (x2.2. iÀ1 )0 and with 5 Â 1 parameter vector y ¼ (b1 .372 5 Diagnostic Tests and Model Adjustments (i) Non-linear food expenditure model In Example 4. (ii) Breusch–Godfrey test for the non-linear model b To test whether the residual correlation is signiﬁcant we apply the Breusch– Godfrey LM-test for the non-linear model.29 (g))).34 shows the resulting estimates and a scatter diagram (in (b)) of the NLS residuals ei against their lagged values eiÀ1 . x2i . Panel 1 of Exhibit 5. @ b2 @f b b3 b3 ¼ b2 x23 i log (x2i ) À gb2 x2.3 (p. iÀ1 .4 (p.

186478 TOTCONS^ (0. iÀ1 ¼ x3i .195965 0.25) Non-linear regression model for budget data (Panel 1).0000 n = 42 r = 0.03 Prob.0331 0.038573 2.02 0. @ b4 @f b b3 ¼ yiÀ1 À b1 À b2 x23 .418210 RESNONLIN(À1) 0.0000 0. Error t-Statistic C 0.412584)Ã 0.01 0.957808 (b) 0.02 0.152414 1.285746 R-squared 0.00 −0.5.0000 0.155027 0. 0.053437 À5. iÀ1 À b4 x3.115538 3.0009 0.078232 0.01 −0.04 RESNONLINLAG (c) Panel 3: Dependent Variable: RESNONLIN Method: Least Squares Sample: 2–6. 42–48 (included obs 42) Variable Coefﬁcient Std.453923 0. 8–15.167 RESNONLIN 0.0498 0.5 Serial correlation 373 (a) Panel 1: Dependent Variable: FRACFOOD Method: Least Squares Sample: 1 48 (groups with size ! 20) Included observations: 48 Convergence achieved after 7 iterations FRACFOOD¼C(1)þC(2)Ã TOTCONS^ C(3)þC(4)Ã AHSIZE Parameter Coefﬁcient Std.412584) À0.000423 0.412584 0.04 0. @f ¼ x3i À gx3.6782 0.939246 Durbin–Watson stat 1.02 −0.070516 2. 34–40.271015 0.2065 Exhibit 5. 26–32.001012 0.157982 Prob.11004 R-squared 0.570982 C(4) 0.071693 C(3) 0. and auxiliary regression of step 2 of Breusch–Godfrey test on serial correlation (Panel 3).000991 17.154181 0.360611 C(2) À0.00 0. iÀ1 À b4 x3. Error t-Statistic C(1) 0.016961 0. 0.02 0. iÀ1 ¼ eiÀ1 : @g .0352 0. iÀ1 ¼ yiÀ1 À b1 À b2 x2.028173 LOG(TOTCONS) AHSIZE 0.04 −0.054293 8.212985 TOTCONS^ (0.34 Food Expenditure (Example 5.070053 À2. 17–24. scatter plot of residuals against their lags (within segments (b)).

as compared to P ¼ 0:007 for the linear model in Panel 4 of Exhibit 5.26: Industrial Production Whereas the two foregoing examples were concerned with data from ﬁnance and microeconomics. (iii) tests on serial correlation. We estimate the simple regression model log (INPi ) ¼ a þ bi þ ei . In order to model the exponential growth of this series. but it is an improvement as compared to the linear model. .1 until 1998. (i) The data We consider quarterly data on industrial production in the USA over the period 1950. and (iv) interpretation of the result. we ﬁt a linear trend to the logarithm of this series.32). This example shows that serial correlation tests can be applied as diagnostic tools for cross section data. (iii) Outcome and interpretation of the test The results in Panel 3 of Exhibit 5.23)).34 show that LM ¼ nR2 ¼ 42 Á 0:158 ¼ 6:64 with P-value P ¼ 0:010 (there are forty-two relevant observations because residuals in different segments should not be compared to each other (see Example 5. Although the discussion of time series models is postponed till Chapter 7. So the above simple non-linear model does not capture all the non-linear effects of the variable x2 on y.374 5 Diagnostic Tests and Model Adjustments Therefore the required regression in step 2 is b3 3 ei ¼ d1 þ d2 xb 2i þ d3 x2i log (x2i ) þ d4 x3i þ geiÀ1 þ !i : The Breusch–Godfrey test is LM ¼ nR2 of this regression. (ii) A simple trend model We denote the series of industrial production by INP. We will discuss (i) the data. we will now give a brief illustration.3. (ii) a simple trend model. serial correlation may result because of prolonged up. For instance. E XM526INP Example 5. serial correlation is also often a relevant issue for macroeconomic time series. The data are taken from the OECD main economic indicators.34).and downswings of macroeconomic variables from their long-term growth path. provided that the observations are ordered in a meaningful way. This indicates that there still exists signiﬁcant serial correlation. although the coefﬁcient of eiÀ1 is not signiﬁcant (P ¼ 0:207 (see Panel 3 of Exhibit 5.

090350 0.649 8 0. corresponding to a yearly growth rate of around 3.589 11 0.000 0. indicating a strong positive serial correlation in the residuals.321756 0. (iii) Tests on serial correlation The Durbin–Watson statistic is very close to zero.000 0.548 t-Statistic 273.000 0. 0.072114 À1.80 459.000 0.61 863.2118 Linear trend model for industrial production (in logarithms) (Panel 1.000108 R-squared 0.0000 0.875 3 0.88 777.0000 Q-Stat 175. .9733 0. correlogram of residuals (Panel 2).1) 0.5 Serial correlation 375 where xi ¼ i denotes the linear trend.058197 RESID(À1) 1.609 10 0.71580 Prob.967431 S. 0.768 5 0. The result is shown in Panel 1 of Exhibit 5.008197 0.000 0.084571 Panel 2: Correlogram of residuals Lag AC 1 0.0000 0.35 Industrial Production (Example 5.000 0.63 945.35.632 9 0.9098 Probability Test Equation: Dependent Variable: RESID Method: Least Squares Variable Coefﬁcient Std.5 1159. and Breusch–Godfrey test (Panel 3).2 Prob 0.026273 0.6052 75.37 681.66E-05 À0.033538 @TREND(1950. Error t-Statistic C 0.686 7 0.000 0. The estimated quarterly growth rate is around 0.13E-06 3.556 12 0.73 1022.000000 Prob.000 0.23600 RESID(À2) À0.9537 0. @TREND denotes the linear trend).35.E. This is also clear from the autocorrelations of the residuals in Panel 2 of Exhibit 5.4 1094. Both the Ljung–Box test in Panel 2 Panel 1: Dependent Variable: LOG(IP) Method: Least Squares Sample(adjusted): 1950:1 1998:3 Included observations: 195 after adjusting endpoints Variable Coefﬁcient Std.000 0.000138 0.000000 0.8 per cent.941 2 0.072090 14.886717 Exhibit 5. Error C 3.085094 Durbin–Watson stat 0. of regression 0.36 327.3 per cent.000 0.1) À2.89 578.26) 0.716 6 0.004108 0.252877 R-squared 0.5207 Probability ObsÃ R-squared 172.5.2 1222.012141 @TREND(1950.000 0.000 Panel 3: Breusch–Godfrey Serial Correlation LM Test: F-statistic 747.813 4 0.

d.0 0.36 Industrial Production (Example 5. and the Breusch–Godfrey test in Panel 3 (with p ¼ 2) strongly reject the absence of serial correlation. Order the observations in a natural way.27c.1 3.5. . Such prolonged deviations from the linear trend line indicate that this simple linear trend model misses important dynamical aspects of the time series. E Exercises: T: 5. One should then try to ﬁnd the possible causes and to adjust the model accordingly.1 0. The following steps may be helpful in the diagnostic analysis.0 −0.5 3. this means that the model misses some of the systematic factors that inﬂuence the dependent variable.10. .5 4.0 4. 5. and serial correlation is one of the major issues for such data (see Chapter 7). In the case of cross section data. b. . E: 5. Check whether serial correlation is present.36 shows that the growth was above average for a long period from around 1965 to 1980. More realistic models for this series will be presented in Chapter 7.2 0. right vertical axis) and plot of residuals with 95% conﬁdence interval (left vertical axis).29a.2 −0. the analysis of serial correlation makes sense only if the observations are ordered in some meaningful way.26) Actual and ﬁtted values of US quarterly industrial production (in logarithms. (iv) Interpretation The time plot of the residuals in Exhibit 5. The ordering is evident for time series data. in particular the Breusch– Godfrey LM-test.3 50 55 60 65 70 75 80 85 90 95 Residual Actual Fitted Exhibit 5.376 5 Diagnostic Tests and Model Adjustments 5.5 Summary If the error terms in a regression model are serially correlated.0 −0. by drawing the correlogram of the residuals and by performing tests. 5.

If serial correlation is present. If it is not possible to adjust the model to remove the serial correlation.5. then OLS can be applied with Newey–West standard errors. This may sometimes be achieved by adjusting the speciﬁcation of the functional relation — for instance.5 Serial correlation 377 . . OLS is no longer efﬁcient and the usual formulas for the standard errors do not apply. . The best way to deal with serial correlation is to adjust the model so that the correlation disappears. by including lagged variables in the model (this is further discussed in Chapter 7).

378 5 Diagnostic Tests and Model Adjustments 5.6. for instance.6. for time series data the criterion Sw (b) ¼ n X t¼1 À Á2 wnÀt yt À x0t b (5:53) (with 0 < w < 1) assigns larger weights to more recent observations. when the parameters b vary over time so that the most recent observations contain more information on the current parameter values than the older observations.6. If the outcomes depend heavily on only a few observations. the regression parameters are estimated by minimizing the criterion S(b) ¼ n À X i¼1 Á2 yi À x0i b : This means that errors are penalized in the same way for all observations and that large errors are penalized more than proportionally.4 the use of weighted least squares was motivated by heteroskedastic error terms. This criterion may be useful. Overview In Section 5. For instance.2 we investigate the question of which observations are the most inﬂuential ones in determining the outcomes of an ordinary least squares regression. For time-varying parameters the criterion (5. Section 5. then the speciﬁcation of the model should be reconsidered. An alternative is to apply weighted least squares where the errors are not all penalized in the same way. If the explanation of outlying data falls within the purpose of the analysis.3 contains a test for normality of the disturbances.1 Introduction Weighted influence of individual observations In ordinary least squares. It may be that outlying observations are caused by special circumstances that fall . In Section 5.53) allows for relatively larger residuals for older observations.6 Disturbance distribution 5. it is advisable to investigate the validity of these data.

5. So the mean leverage is equal to k=n. Robust methods are discussed in Section 5.6 Disturbance distribution 379 outside the scope of the model.12). we use the hat-matrix H deﬁned by (see Section 3. Á Á Á . The leverages satisfy 0 hj 1 Pthe n and j¼1 hj ¼ k (see Exercise 5. (5:55) .6. The value hj is called leverage of the jth observation. Sabs (b) ¼ n X i¼1 jyi À x0i bj: If the outcomes of estimation and testing methods are less sensitive to individual observations and to the underlying model assumptions.4. we consider the model with a dummy variable for the jth observation — that is. then such methods are called robust.1. 123)) H ¼ X(X0 X)À1 X0 : ^ of the dependent variable y is given by ^ The explained part y y ¼ Xb ¼ Hy. yi ¼ x0i b þ gDji þ ei . A large leverage hj means that the values of the explanatory variables xj are somewhat unusual as compared to the average of these values over the sample.6.2 Regression diagnostics The leverage of an observation To characterize inﬂuential data in the regression model y ¼ Xb þ e. (5:54) where the 1 Â k vector x0j is the jth row of the n Â k matrix X. Characterization of outliers An observation is called an outlier if the value of the dependent variable yj differs substantially from what would be expected from the general pattern of the other observations. n.5. To test whether the jth observation is an outlier. The inﬂuence of such data can be reduced by using a less sensitive criterion function — for example.3 (p. The jth diagonal element of H is denoted by hj ¼ x0j (X0 X)À1 xj . i ¼ 1.

The null hypothesis that the jth observation ﬁts in the general pattern of the data corresponds to the null hypothesis that g ¼ 0. Then the t-value of ^ g in (5. so that then e ¼ Me $ N(0.55). i ¼ 1. According to the result of Frisch– Waugh in Section 3.55) is given by eÃ j ¼ ^ ej g pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ : 1 À hj sj 1 À hj (5:56) T sj = This statistic follows the t(n À k À 1) distribution under the null hypothesis that g ¼ 0.2. but one should pay attention to large outliers (with t-values further away from zero) and try to understand the cause of such outliers. The residuals eÃ j are called the studentized residuals. in (5. and that yj À xj b(j) ¼ ^ (5. Indeed.55) (for n observations).56) that the studentized residuals can be computed as . if the residual ej or the leverage hj is sufﬁciently large. so that the jth observation is excluded. s2 M) 2 2 ^ g $ N(0. the OLS estimator of g is given by ^ g ¼ (D0j MDj )À1 D0j My ¼ (D0j Dj À D0j X(X0 X)À1 X0 Dj )À1 D0j e ej : ¼ (1 À x0j (X0 X)À1 xj )À1 ej ¼ 1 À hj Here M ¼ I À H and e ¼ My is the usual vector of OLS residuals in the model y ¼ Xb þ e — that is. then one may expect that 5 per cent of all observations are ‘outliers’. It is left as an exercise (see Exercise 5. the OLS estimator of s in 0 g in (5. If e $ N(0. Derivation of studentized residuals Let Dj denote the n Â 1 vector with elements Dji . The ‘leave-one-out’ interpretation of studentized residuals The jth studentized residual can also be obtained by leaving out the jth observation. That is.55) can be written as y ¼ Xb þ Dj g þ e. Let sj be the OLS estimator of s2 based on the model (5.55). This should not be interpreted as an advice to include dummies in the model for each outlier.5 (p. and ej ¼ D0j e $ N(0. Á Á Á . With these results it follows from (5. Note that the dummy variable is included only to compute the studentized residual. that s2 (j) ¼ s2 j (that is. including the dummy. Let b(j) and s2 (j) be the corresponding OLS estimators of b and s2 . perform a regression in the model yi ¼ x0i b þ ei using the (n À 1) observations with i 6¼ j. then the model in (5.12) to show that b(j) is the OLS estimator of b in 2 (5. Such ‘ordinary’ outliers are of no concern. s =(1 À hj )).380 5 Diagnostic Tests and Model Adjustments with Djj ¼ 1 and Dji ¼ 0 for i 6¼ j. as this may help to improve the model. and this can be tested by the t-test. 146). The jth observation is an outlier if ^ g is signiﬁcant — that is.55) with g ¼ 0.55)). s2 (1 À hj )Þ. s2 I). if one uses the rule of thumb jtj > 2 for signiﬁcance. n.

55) with g þ e(j). Let b be the usual OLS estimator in (5. with residuals e. with residuals e(j). If eÃ j is large.12).37 illustrate that the estimates of b and s2 and the sum of squared residuals (SSR) in (5.37. and X0 Dj ¼ x0j . however.6 Disturbance distribution 381 eÃ j yj À x0j b(j) pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ : ¼ s(j)= 1 À hj The studentized residual can be interpreted in terms of the Chow forecast test of observation j. The outlier (corresponding to the ﬁrst observation. so that ej may be small even if eÃ j is large.55) are the same as the estimates obtained by deleting the outlier observation. this can occur if the leverage of the observation is large.5. this means that yj cannot be predicted well from the other observations. This is because OLS tries to prevent very large residuals. This is illustrated by a simulation in Exhibit 5. X0 e(j) ¼ 0.56). Panels 7 and 8 of Exhibit 5.55) is much larger.55) under the restriction that g ¼ 0. Influence on parameter estimates: ‘dfbetas’ The inﬂuence of individual observations on the estimates of b can be evaluated as follows. The R2 of (5. Then y ¼ Xb þ e ¼ Xb(j) þ Dj^ so that g À e(j) þ e ¼ 0: X(b À b(j)) À Dj^ If we premultiply this with X0 and use that X0 e ¼ 0. but it is revealed very clearly if the outlier observation is excluded from the regression (c–e). so that the jth observation does not ﬁt the general pattern of the other observations. with j ¼ 1) is not detected from the residuals if we include all observations (a–b). Because of (5. This is simply caused by the fact that the total sum of squares is much larger for the set of all observations (SST ¼ 1410 in Panel 8) than for the set of observations excluding the outlier (SST ¼ 285 in Panel 7). and let b(j) and ^ g be the OLS estimators in (5. where the forecast is based on the model estimated from the (n À 1) observations i 6¼ j (see Exercise 5. OLS may not detect outliers It should be noted that outliers may not always be detected from the plot of OLS residuals ei . then we obtain b À b(j) ¼ (X0 X)À1 X0 Dj^ g ¼ 1 (X0 X)À1 xj ej : 1 À hj (5:57) T . In this sense the jth observation then is an outlier. the dummy included.

664163 0.1103 X 3. removed) Variable Coefﬁcient Std.215043 Sum squared resid 223.37 Outliers and OLS Scatter diagrams and residuals ((a)–(d)).576554 Std.0000 481.682 Exhibit 5.E.99717 X À2.664163 0. .988136 R-squared 0.187157 Total sum of squares 1409. Error t-Statistic Prob.83368 5.656161 Sum squared resid Total sum of squares Prob.448928 À6. of regression 3. C À22.0225 DUM1 64.682 (g ) Panel 7: Dependent Variable: Y Included observations: 24 (ﬁrst observation.0000 0.66355 À1.042220 0.873317 9.028141 1.7315 1409. without outlier (Panel 7) and with outlier dummy (Panel 8).73838 13.233459 2. studentized residuals (e). Error t-Statistic 4.028179 0.66355 À1.1103 X 3.454999 0.233459 2.E.841471 Sum squared resid 223. outlier.454999 0.658269 S.71023 12.028141 1.382 5 Diagnostic Tests and Model Adjustments (a) 50 40 30 n=25 (with outlier) (b) 10 5 n=25 (with outlier) Y 0 20 10 0 −5 −10 0 5 X 10 15 4 8 12 16 20 24 RES25 ( c) 50 40 30 n=24 (without outlier) (d) 6 (e) n=24 (without outlier) 6 4 studentized residuals 4 2 0 −2 −4 2 0 −2 Y 20 10 0 0 5 X 10 15 −6 −8 −10 4 8 12 16 RES24 20 24 −4 4 8 12 16 STRES 20 24 (f) Panel 6: Dependent Variable: Y Included observations: 25 Variable Coefﬁcient C 43.4754 S.6977 (h) Panel 8: Dependent Variable: Y Included observations: 25 (DUM1 is dummy for ﬁrst observation) Variable Coefﬁcient Std. 0. Error t-Statistic Prob. of regression 3. C À22.187157 Total sum of squares 284.0000 R-squared 0. of regression 4.0225 R-squared 0. regressions with outlier (Panel 6).E.73838 13.4754 S.

As h ) ¼ h var(‘dfﬁts’) % var(eÃ j j j j¼1 hj ¼ k. the jth observation is inﬂuential if the studentized residual or the leverage is large.13) to show that (under appropriate conditions) the variance of the ‘dfbetas’ is approximately 1=n. where ^ ^ y ¼ Xb and y(j) ¼ Xb(j). owing to the jth observation. then the jth observation may be difﬁcult to ﬁt from the other observations. Therefore this difference is scaled with an estimate of pﬃﬃﬃﬃﬃ the standard deviation of bl — for example. it follows that pﬃﬃﬃﬃ Pn ‘dfﬁts’ has a variance of approximately . the leverage hj gives the relative weight of the observation yj itself in constructing the predicted value for the jth observation. hj ej : 1 À hj T . the average variance is approxi. Therefore. differences in ﬁtted values mately k p ﬃﬃﬃﬃﬃﬃﬃﬃ can be stated to be signiﬁcant if n ‘dfﬁts’ is larger (in absolute value) than 2 k=n.5. In particular. So the difference in the parameter estimates canp be ﬃﬃﬃ stated to be signiﬁcant if the value of (5.6 Disturbance distribution 383 It is preferable to make the difference bl À bl (j) in the lth estimated parameter. Because the variance of ^ yj is equal to E[(x0j b À x0j b)2 ] ¼ s2 x0j (X0 X)À1 xj ¼ s2 hj . a scale invariant measure for the difference in ﬁtted values is given by the dfﬁts deﬁned by sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃ hj ^ yj À ^ yj (j) ej hj Ã pﬃﬃﬃﬃ ¼ ¼ ej : dffitsj ¼ sj 1 À hj 1 À hj sj hj Also in this respect.57) the difference in the ﬁtted values for yj is given by ^ yj À ^ yj (j) ¼ x0j (b À b(j) ) ¼ As ej ¼ yj À ^ yj it follows that ^ yj (j): yj ¼ hj yj þ (1 À hj )^ Therefore. by using (5. if hj is large.58) is (in absolute value) larger than 2= n. As var(eÃ j ) % 1 and hj is generally very small for large enough sample sizes. invariant with respect to the measurement scale of the explanatory variable xl . This gives the dfbetas deﬁned by dfbetaslj ¼ bl À bl (j) pﬃﬃﬃﬃﬃ : sj all (5:58) It is left as an exercise (see Exercise 5. sj all where all is the lth diagonal À1 0 element of (X X) . That is. Influence on fitted values: ‘dffits’ The inﬂuence of the jth observation on the ﬁtted values is given by ^ yÀ^ y(j).

the question arises what to do with these observations. If the inﬂuential observations do not ﬁt well in the general pattern of the data. 223–4). among others) and Exhibit 5.5 we analysed the possibility of fat tails and now we will apply regression diagnostics on these data.384 5 Diagnostic Tests and Model Adjustments What to do with influential observations? If the most inﬂuential observations in the data set are detected. Here we consider data on excess returns in the sector of cyclical consumer goods and in the whole market in the UK.4 (p.6 (p.1 (p. 4. and 4. in which case their large inﬂuence is justiﬁed. On the other hand. 243–6). The observation in October 1987 (when a crash took place) has a very large leverage but a small studentized residual. They can be the most important pieces of information. dfbetas. and dfﬁts. but the leverages of these observations are small. Exhibit 5. and in Section 4. These data were previously analysed in Examples 2. studentized residuals. In any case one should check whether these observations are correctly reported and one should investigate whether outliers can possibly be explained in terms of additional explanatory variables.5 (p. one may be tempted to delete them from the analysis. E XM527SMR Example 5.27: Stock Market Returns (continued) As an illustration we consider the stock market returns data that were introduced in Example 2. as they ﬁt well in the estimated model and reduce the standard errors of the estimated parameters. The CAPM corresponds to the simple regression model yi ¼ a þ bxi þ ei . where yi is the excess return of the sector and xi that of the market. 262–5). In Examples 4. We will discuss (i) the data and the possibility of outliers and (ii) the analysis of inﬂuential data. one should be careful not to remove important sample information. 76–7). .4 and 4. so that this is not an outlier.38 provides graphical information (leverages. (i) The data and the possibility of outliers Financial markets are characterized by sudden deviations from normal operation caused by crashes or moments of excessive growth.4.1. However. the observations in September 1980 and September 1982 are outliers.39 displays the characteristics for some of the data points. Such observations are helpful in estimation. (ii) Analysis of influential data The data consist of monthly observations over the period from January 1980 to December 1999.

5. Panel 6). Panel 4). studentized residuals (Panel 7). and dfﬁts (Panel 9).1 0.0 −0.40 −2 5.27) Time plots of excess returns in market (Panel 1) and in sector of cyclical consumer goods (Panel 2).08 10 0 −10 −20 0.2 0.1 0. leverages (Panel 5).2 −0.38 Stock Market Returns (Example 5.3 0. .3 80 82 84 86 88 90 92 94 96 98 −0.0 0.35 80 82 84 86 88 90 92 94 96 98 −4 80 82 84 86 88 90 92 94 96 98 Panel 6: sj Panel 7: studresid 0.55 5.1 −0.16 0. scatter diagram of excess returns in sector against market (Panel 3).04 −30 80 82 84 86 88 90 92 Panel 4: e 94 96 98 0.3 0.00 80 82 84 86 88 90 92 Panel 5: leverage 94 96 98 5.60 4 5. The dashed horizontal lines in Panels 4 and 7–9 denote 95% conﬁdence intervals.50 2 0 5.1 −0.3 80 82 84 86 88 90 92 94 96 98 Panel 8: dfbetas Panel 9: dffits Exhibit 5.6 Disturbance distribution 20 10 385 0 −10 40 −20 80 82 84 86 88 90 92 94 Panel 1: RENDMARK 96 98 20 RENDCYCO −30 0 30 20 10 0 −20 −10 −20 −30 −40 80 82 84 86 88 90 92 94 96 98 −40 −30 −20 −10 0 RENDMARK 10 20 Panel 3: Scatter diagram Panel 2: RENDCYCO 20 0. regression of excess sector returns on excess market returns with corresponding residuals (e.2 0.45 5. dfbetas (Panel 8).12 0.2 −0. standard deviations (sj .

214Ã À0. So under this assumption OLS is an optimal estimation method.215Ã À0. In particular.694 À20. so that the skewness (S) and kurtosis (K) are equal to 3 S ¼ E[e3 i ]=s ¼ 0.204Ã 0.011Ã 0.795Ã 2.33a. 5. 5.01– 1999. E Exercises: T: 5.086 À0.12 (n ¼ 240 observations). because many econometric tests (like the t-test and the F-test) are based on the assumption of normally distributed error terms.29c.779Ã 0. eÃ j 5% crit.634 dfbetas dfﬁts pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃ Æ2= n ¼ Æ2 (2=n) ¼ Æ0:129 Æ0:183 À0.386 5 Diagnostic Tests and Model Adjustments Characteristic Residual ej Leverage hj St.27) Characteristics of some selected inﬂuential observations in CAPM over the period 1980.250Ã À0.12a–c. Suppose that the standard Assumptions 1–6 of the regression model are satisﬁed. Á Á Á .956 À3.245Ã À0. in the sense that it is consistent and (asymptotically) efﬁcient. 5.021Ã Æ2 À1.412Ã 15. For this reason it is of interest to test Assumption 7 of normally distributed error terms. and E[ei ej ] ¼ 0 for all i 6¼ j. 5.005 0.492Ã À0. An Ã indicates values that differ signiﬁcantly from zero (at 5% signiﬁcance).6. E[e2 i ] ¼ s . Resid.13.240Ã Exhibit 5.119 À0.209Ã 0.010Ã 0.558Ã À13. E: 5. i ¼ 1. n. 2 where E[ei ] ¼ 0.209Ã À0.31e.015Ã 0. d. 4 K ¼ E[e4 i ]=s ¼ 3: .139Ã 0.474 À3.260Ã À0.280Ã 0.282Ã 0.3 Test for normality Skewness and kurtosis As was discussed in Chapter 4. OLS is equivalent to maximum likelihood if the error terms are normally distributed.486 1. In this case there holds E[e3 i]¼0 4 4 and E[ei ] ¼ 3s .930 Æ2=n ¼ Æ0:008 0.059Ã 0. This means that yi ¼ x0i b þ ei .39 Stock Market Returns (Example 5.480 8.551 À19. Then Assumption 7 of normally distributed disturbances can be tested by means of the OLS residuals ei ¼ yi À x0i b.30c.115 À0.588Ã À2. value 1980:06 1980:09 1981:04 1981:09 1982:09 1983:04 1987:10 1991:02 Æ2s ¼ Æ11:086 À10.156Ã 0.005 0. d. It is also of interest for other reasons — for example.115Ã 2. 5. we can compare the sample moments of the residuals with the theoretical moments of the disturbances under the null hypothesis of the normal distribution.626Ã À2.206Ã 0.

This has a large effect on the skewness and kurtosis. The corresponding (twosided) P-value for the hypothesis that S ¼ 0 is P ¼ 0:08. n=6S and n=24(K À 3) are asymptotically independently distributed as N(0. Exhibit 5.5. Exhibit 5. The skewness and kurtosis S ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ Àﬃ0:28 and K ¼ 4:04.28: Stock Market Returns (continued) We continue our analysis of Example 5. This indicates that for the majority of the sample period the assumption of E XM527SMR . and the deviation from normality can be measured by 2 rﬃﬃﬃ 2 rﬃﬃﬃﬃﬃﬃ n n 1 2 1 2 (K À 3) ¼ n S þ (K À 3) % w2 (2): S þ JB ¼ 6 24 6 24 This is the Jarque–Bera test on normality. so that the sample mean of the residuals is zero. so that the test statistic has two degrees of freedom. 91) for the sector of cyclical consumer goods. P j Then the jth moment of the residuals is given by mj ¼ n i¼1 e i =n and the skewness and kurtosis are computed as S ¼ m3 =(m2 )3=2 . This gives values of pﬃﬃﬃﬃﬃﬃﬃﬃ are equal to p n=6S ¼ À1:77 and n=24(K À 3) ¼ 3:30. So the residuals have a considerably larger kurtosis than the normal distribution. The data consist of monthly observations over the period 1980–99 (n ¼ 240). We suppose that the model contains a constant term.001.6 Disturbance distribution 387 If the null hypothesis of normality is true. but note that the null hypothesis poses two conditions (S ¼ 0 and K ¼ 3). Example 5.27 in the previous section and consider the Capital Asset Pricing Model (CAPM) of Example 2. 1). and the null hypothesis is rejected for large values of JB. The normal distribution has skewness S ¼ 0 and kurtosis K ¼ 3. b) shows the time series plot and the histogram of the residuals. So the assumption of normality is rejected. Jarque–Bera test on normality The skewness and kurtosis can also be used jointly to test for normality. then the residuals ei should have a skewness close to 0 and a kurtosis close to 3.40 (c) shows the histogram that results when two extremely large negative residuals (in the months of September 1980 and September 1982) are removed.40 (a. and the assumption of normality is no longer rejected. The Jarque–Bera test has value JB ¼ ( À 1:77)2 þ (3:30)2 ¼ 14:06 with P-value 0. Here we will not derive the asymptotic w2 (2) distribution.5 (p. under the null hypothesis of normality. These results can be used to perform individual tests for the skewness and kurtosis. and for the hypothesis that K ¼ 3 it is P ¼ 0:001. K ¼ m4 =m2 2: pﬃﬃﬃﬃﬃﬃﬃﬃ It ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ can be p ﬃ shown that.

An alternative is to use another estimation criterion that assigns relatively less weight to extreme observations as compared to OLS.280299 4.09 and 82.28) Time plot of residuals of CAPM (a) and histograms of all residuals (n ¼ 240 (b)) and of residuals with two outliers removed (n ¼ 238 (c)). 5. the regression model (5.698860 Exhibit 5. 0.40 Stock Market Returns (Example 5. Dev.6. normality is a reasonable one.716610 Probability 0.239418 Std.11497 −20. Sometimes there are several or even a large number of outlying data points.28d.05774 0. The two extreme observations were detected as outliers in Example 5.55) with the dummy variable Dj effectively removes all effects of the jth observation on the estimate of b. On the other hand.388 5 Diagnostic Tests and Model Adjustments (a) 20 10 0 −10 −20 −30 80 82 84 86 88 90 92 94 96 98 RESCAPM (b) 25 20 15 10 5 0 (c) Series: RESCAPM Sample 1980:01 1999:12 Observations 240 Mean Median Maximum Minimum Std.256015 Kurtosis −20 −15 −10 −5 0 5 10 15 Jarque-Bera Probability −10 −5 0 5 10 15 Jarque-Bera 0.000886 80. Skewness Kurtosis 4. then all the observations are weighted in a similar way. it may indeed make sense to remove it.27.167940 Mean 0.044751 14.11497 Minimum −13.31e. E Exercises: E: 5.4 Robust estimation Motivation of robust methods If we apply OLS. Such estimation .231594 15.254391 Median Maximum 15. and it may be undesirable to neglect them completely. 5. If this observation is very inﬂuential but not very reliable.14E-16 0.040987 Skewness 3.41222 5. Dev.58811 5. 531151 −0.0 removed from 20 sample 15 10 5 0 25 Series: RESCAPM Sample 1980:01 1999:12 Observations 238 0.

925 0. In this case the sample mean is an efﬁcient estimator.440 10 25 100 400 St.903 0. 1) Mean 0.984 0. Example 5. Exhibit 5. Á Á Á .29) Sample standard deviation and range of sample mean and sample median over 1000 simulation runs of two DGPs (N(0. Range St. and 400. 25.254 0.015 18. Dev.6 Disturbance distribution 389 methods are called robust. i ¼ 1.092 0.050 0. Dev. and the median is inefﬁcient. It has very fat tails so that outliers occur frequently.098 0.630 0. The sample sizes are n ¼ 10.070 0.255 1.135 0.29: Simulated Data of Normal and Student t(2) Distributions To illustrate the idea of robust estimation we consider two data generating processes. and the mean is an inefﬁcient estimator.41 Simulated Data of Normal and Student t(2) Distribution (Example 5. For every replication the mean and median of the sample are computed as estimates of the centre of location.880 20. as the following simulation example illustrates. Range St. . 1) Median 0. This centre is more robustly estimated by the sample median than by the sample mean. (non-centred) sample standard deviation m It clearly shows that the mean is the best estimator if the population E n DGP Estimator N(0. m ¼ 0. Dev. n. with 1000 replications for each sample size. 100.720 0. 100. The exhibit reports the range (the difference between the maximum and minimum values of these estimates over the 1000 replications) and the qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P 2 ^j =1000 over the replications.322 N(0. The second one is the Student t-distribution with two degrees of freedom. Range Exhibit 5.542 t(2) Median 0. because the estimation results are relatively insensitive to changes in the data. As a simple illustration. 1) and t(2) ) for different sample sizes (n ¼ 10.383 t(2) Mean 1.617 0. and 400).5.126 0. t(2). we ﬁrst consider the situation where the data consist of a random sample yi .063 0.200 1.145 0.41 shows summary statistics of simulated data from the two distributions. and we want to estimate the centre of location of the population.322 2.290 1.795 0.203 4. This distribution has mean zero and inﬁnite variance.383 3. 25. Dev.325 4. Range St. The ﬁrst one is the standard normal distribution N(0. 1).599 0.457 0.445 3.

59) are given by n ^) X @ S(b ¼À g(ei )xi ¼ 0: ^ @b i ¼1 (5:60) If one deﬁnes the weights wi ¼ g(ei )=ei .14). LAD is more robust than OLS. Further we suppose that b is estimated by minimizing a criterion function of the form ^) ¼ S( b n X i¼1 n À Á X ^ ¼ G yi À x0i b G(ei ). 2 i g(ei ) ¼ ei . The function G is assumed to be where we write ei ¼ yi À x0i b differentiable with derivative g(ei ) ¼ dG(ei )=dei . for distributions with fat tails the median is a more robust estimator than the mean. an estimator that is less sensitive to outliers — is obtained by choosing & G(ei ) ¼ jei j.60) for the estimator b function of the residuals. this can also be written as n X i¼1 wi ei xi ¼ 0: (5:61) Ordinary least squares corresponds to the choice 1 G(ei ) ¼ e2 . That is. The ﬁrst order conditions for a minimum of (5.390 5 Diagnostic Tests and Model Adjustments is normally distributed and that the median is best if the population has the t(2) distribution. ei > 0. but some efﬁciency is lost if the disturbances are normally distributed. g( e i ) ¼ À1 þ1 for for ei < 0. In ordinary least squares this inﬂuence is a linear (5.29 illustrates. yi ¼ m þ ei for i ¼ 1. A more robust estimator — that is. We call this criterion function the least absolute deviation (LAD). the following criterion: . As our simulation in Example 5. the median of the observations (see ^¼ gives m yi =n and LAD gives m Exercise 5. n — then OLS ^ ¼ med(yi ). wi ¼ 1: The function g(ei ) measures the inﬂuence of outliers in the ﬁrst order conditions ^. The attractive properties of both methods (OLS and LAD) can be combined by using. If the observations consist P of a random sample — that is. for instance. i ¼1 T (5:59) ^ for the residuals. Robust estimation criteria Now we consider the model yi ¼ x0i b þ ei and we suppose that Assumptions 1–6 are satisﬁed. Á Á Á .

42 the Huber criterion is compared with OLS and LAD. LAD ((b). (f ). The derivative of G is given by 8 < Àc g(ei ) ¼ ei : c if if if ei < Àc. and Huber ((c). (a) G(ei)=1 e2 2 i (b) G(ei)=ei (c) G(ei) −c 0 +c ei 0 ei −c 0 +c ei (d) g(ei) (e) g(ei) (f ) g(ei) +1 +1 0 ei 0 ei −c 0 +c ei −1 −1 (g) wi=g(ei)/ei (h) wi (i) wi +1 +1 0 ei 0 ei −c 0 +c ei OLS Exhibit 5. and weights (wi in Swi ei xi ¼ 0 in (g–i)) of three criteria. (5:62) This criterion was proposed by Huber. . (e).42 Three estimation criteria LAD Huber Criterion functions (G in (a–c)). The inﬂuence of outliers is reduced because (5. À c ei ei > c . Relatively small residuals have a linear inﬂuence and constant weights.5. (d). (g)).63) imposes a threshold on the function g(ei ). OLS ((a). In Exhibit 5. (i)). and large residuals have constant inﬂuence and declining weights. ^ gives a compromise between the efﬁciency (for The corresponding estimator of b normally distributed errors) of OLS (obtained for c ! 1) and the robustness of LAD (obtained for c # 0). (h)). ﬁrst order derivatives (inﬂuence functions g in (d–f )). (5:63) c.6 Disturbance distribution 391 G(ei ) ¼ &1 2 2 ei c j ei j 2 À1 2c if if jei j c . jei j > c .

that the median of the distribution of ei is zero.60) with g(ei ) ¼ Àf 0 (ei )=f (ei ): In practice the density f of the disturbances is unknown. 253). with solution f (ei ) ¼ 1 2e 5. with g(ei ) ¼ ei . The statistical properties of this estimator can be derived by noting that (5. Let ei have density function f and let li ¼ log (f (ei )).392 5 Diagnostic Tests and Model Adjustments T Remarks on statistical properties ^ are non-linear and In general.59) can be interpreted as postulating that Àf 0 (ei )=f (ei ) ¼ g(ei ) is a reasonable assumption to estimate b. the standard normal distribution. provided that E[g(ei )xi ] ¼ 0. For LAD this gives solution f (ei ) ¼ p1 2p Àjei j (see also Exercise Àf 0 (ei )=f (ei ) ¼ Æ1 (the sign of ei ). as this states that E[ei ] ¼ 0. The initial estimate is of importance and it is advisable to use a robust initial estimate. If the regressors xi are not stochastic this gives the condition E[g(ei )] ¼ 0: For OLS. with 2 ﬃﬃﬃﬃ eÀ1 2ei . in large enough samples. and ML corresponds to the minimization of (À log L) with li ¼ ﬁrst order conditions À n n n X X @ log L @ log (f (ei )) d log (f (ei )) @ ei X f 0 (ei ) ¼ xi .14). approximate standard errors can be obtained from the asymptotic results on GMM in Section 4. even if this may be inefﬁcient. T Interpretation of robust estimation in terms of ML Robust estimation can also be interpreted in terms of maximum likelihood estimation by an appropriate choice of the probability distribution of the error ^ terms ei . For LAD this condition means that P(ei > 0) ¼ P(ei < 0) — that is.61) (p. 258). the equations (5.2 (p. A criterion of the type (5.3 (p.4. this is guaranteed by Assumption 2. 253–4)). For OLS this leads to Àf 0 (ei )=f (ei ) ¼ ei . see (4.60) to compute the estimator b should be solved by numerical methods. .4. ¼À ¼À ^ ^ @b dei f (ei ) @b @b i ¼1 i ¼1 i¼1 where f 0 (ei ) ¼ df (ei )=dei is the derivative of f . The estimator is consistent and. where ei ¼ yi À x0i b ^ for a given estimate b of b. Then the log-likelihood is given by log L ¼ P P log (f (ei ) ).60) corresponds to a GMM estimator with moment functions gi ¼ Àg(ei )xi (see Section 4. This corresponds to the equations (5.

5. until the estimates converge. weights wi . Then compute wi ¼ g(ei ) and estimate b by WLS. then a robust estimator of the standard deviation is given by ^ ¼ 1:483 Á med(jej À mj. Á Á Á . with corresponding ^ is estimated from the residuals ei . j ¼ 1. In the ﬁrst step s is ﬁxed and b is estimated robustly. In practice s is unknown and has to be estimated. The criteria OLS and LAD satisfy this b b ¼ a y are related by ^ yÃ i i i i requirement. The usual OLS estimator of the variance may be sensitive for outliers.12) to prove (for a simple case) that this gives a consistent estimator of s if the observations are normally distributed. In the second step. i ¼ 1.6 Disturbance distribution 393 Interpretation of robust estimation in terms of WLS ^ that minimizes (5. . for the Huber criterion (5. n). en . Iterate the computation of residuals ei . It is left as an exercise (see Exercise 5. as in this case the ﬁtted values ^ to be replaced by b ^ Ã ¼ ab would like the estimates b Ã 0^ 0^ ^ ¼ x ¼ ax . Let m denote the median of the n residuals e1 . Á Á Á . This new residuals ei . This also motivates a simple iterative method for estimating b by means of the (robust) criterion (5. T Appropriate scaling in robust estimation It is of course preferable that the results do not depend on the chosen scales of measurement of the variables.61) are ﬁxed.59). Á Á Á . s where s2 ¼ E[e2 i ] ¼ var(yi ). s (5:64) where ‘med’ denotes the median of the n values jej À mj. s estimate of s can be used to compute new robust estimates of b. and WLS estimates of b. If the weights wi in (5. Let us consider the effect of rescaling the dependent variable yi . until convergence.59) can also be interpreted in terms of weighted The estimator b least squares. This can be used to estimate the parameters b and s by an iterative two-step method.62) this means that c should be replaced by cs.59) by ^) ¼ S( b n X i¼1 T ! ^ yi À x0i b G . If this variable is replaced by yÃ i ¼ ayi. n. For instance. and so on. i i So the weights wi measure the relative importance of the squared i ¼1 ^ errors e2 i in determining b. Start with weights wi ¼ 1. with a a given constant. then these equations correspond to order conditions for minimizing the weighted least squares criterion Pnthe ﬁrst 2 w e . then we ^. and estimate b by OLS with residuals ei . For other criterion functions this requirement is satisﬁed by replacing (5.

This may suggest. . additional relevant explanatory variables or another choice for the distribution of the disturbances. so that they should be excluded in estimation. This inﬂuence may also be bounded — for example.5 Summary In least squares the deviations from the postulated relation between dependent and independent variables are penalized in a quadratic way. E: 5. 5. by replacing the ‘normal equations’ (5. .15. dfﬁts. and dfbetas of the individual observations. 5.63) with c ¼ k=n (the mean value of the leverages). 5. one may wish to limit their inﬂuence by applying a robust estimation method. If the deviating observations are a realistic aspect of the data (as is the case in many situations).14. studentized residuals. that OLS is not a reliable method to detect inﬂuential observations. In some cases it may also be that some of the reported data are unreliable. however.60) the inﬂuence of large residuals can be limited.394 5 Diagnostic Tests and Model Adjustments T Limiting the influence of observations with large leverage Finally we note that by an appropriate choice of the function g in the criterion (5. E Exercises: T: 5. Note.12d. . one can proceed as follows. To investigate the presence of such inﬂuential observations and to reduce their inﬂuence.60) by n X i ¼1 g1 (xi )g2 ^ yi À x0i b ^ s ¼ 0: For instance. for instance.29e. . This means that observations that deviate much from the general pattern may have an excessive inﬂuence on the parameter estimates. . one should try to understand the possible causes. but the explanatory variables can still be inﬂuential because of the linear term xi . A ﬁrst impression may be obtained by inspecting the histogram of the least squares residuals and by the Jarque–Bera test on normality.6. If some of the observations deviate a lot from the overall pattern. one can take g1 (xi ) ¼ g(x0i (X0 X)À1 xi ) ¼ g(hi ) with hi the leverage in (5. Inﬂuential data may be detected by considering the leverages.54) and with g chosen as in (5.

. The choice of the robust estimation method (corresponding to solving P the equations g(ei )xi ¼ 0) can be based on ideas concerning appropriate weights wi of the individual observations (by taking g(ei ) ¼ wi ei ) or on ideas concerning the probability distribution f of the disturbances (by taking g(ei ) ¼ Àf 0 (ei )=f (ei ) where f 0 (ei ) ¼ df (ei )=dei ).6 Disturbance distribution 395 .5.

33).3 (p.396 5 Diagnostic Tests and Model Adjustments 5. The ﬁrst example is concerned with price movements on ﬁnancial markets. If we relate price x and quantity y by the simple regression model yi ¼ a þ bxi þ ei .32. For instance. such as the general sentiment in the market. whereas a higher demand may lead to higher prices.30. it will be hard to distinguish their individual contributions to the outcome of the dependent variable yi ¼ x0i b þ ei . This is the case if the factors ei that affect the bond rate. then xi and ei may well be correlated. also affect the Treasury Bill rate. and this may at the same time increase the price. We will consider this possible endogeneity of the price by considering the . We brieﬂy discuss two examples that will be treated in greater detail later in this section. unforeseen increased uncertainties in international trade may have a simultaneous upward effect both on bond rates and on interest rates.1 Instrumental variables and two-stage least squares Motivation Until now we have assumed either that the regressors xi are ﬁxed or that they are stochastic and exogenous in the sense that there is no correlation between the regressors and the disturbance terms. the demand may increase because of higher wealth of consumers. A higher price may lead to lower demand.7. for many goods the price and traded quantity are determined jointly in the market. It is intuitively clear that. We will consider this possible endogeneity of Treasury Bill rates in later examples in this section (see Examples 5. 5. 194–6) we showed that OLS is inconsistent in this situation. then xi and ei may well be correlated. If we relate the returns of one ﬁnancial asset y (in our example AAA bonds) to the returns of another asset x (in our example Treasury Bill notes) by means of the simple regression model yi ¼ a þ bxi þ ei .1.7 Endogenous regressors and instrumental variables 5. For instance. if xi and ei are mutually correlated. As a second example. and 5. In Section 4.

1 0 plim X X ¼ Q n (5:65) exists with Q a k Â k invertible matrix. In many situations the outcomes of the variables xi are partly random. the variables should satisfy the orthogonality 0 condition plim( 1 n X e) ¼ 0. 191–2). Clearly.2 (p. the dependent variable yi is modelled in terms of k explanatory variables x0i ¼ (1. but that the values of xi are ‘ﬁxed’. in such . Á Á Á . 125). This was analysed in Section 4. For instance. For instance.1–5.1.3 (p.7 Endogenous regressors and instrumental variables 397 market for motor gasoline consumption in later examples (see Examples 5.31 and 5. it is assumed that yi is a random variable.5. 194) we derived that OLS is consistent in this case if and only if the explanatory variables are (weakly) exogenous — that is. directly via the term Xb but also indirectly via changes in the term e. In this case the results that were obtained under the assumption of ﬁxed regressors (including the diagnostic analysis in Sections 5. In such a situation it is difﬁcult to isolate the effect of X on y because variations in X are related to variations in y in two ways.4 (p.1.1. OLS requires exogenous regressors In the multivariate regression model yi ¼ x0i b þ ei . in which case a regression of y on x gives a positive OLS estimate of the effect of police on crime. the statistical properties E[b] ¼ b and var(b) ¼ s2 (X0 X)À1 should then be interpreted as E[bjX] ¼ b and var(bjX) ¼ s2 (X0 X)À1 . xki ). The reason is that in the model yi ¼ a þ bxi þ ei cities with high crime rates (ei > 0) tend to have larger police forces (values of xi larger than average).34). Under the standard Assumptions 1–6 of Section 3. in a cross section of cities the per capita crime (y) may very well be positively correlated with the per capita police force (x). In Section 4.1 under Assumption 1Ã of stability — that is. This was also discussed in Section 4. x2i . Consequences of endogenous regressors We will now consider the situation where one or more of the regressors is endogenous in the sense that 1 0 (5:66) plim X e 6¼ 0: n This means that the random variation in X is correlated with the random variation e in y.6) carry over to the case of stochastic exogenous regressors. by interpreting the results conditional on the given outcomes of the regressors in the n Â k matrix X.

n: 1 0 plim Z e ¼ 0. this requires that m ! k — that is. F-tests. rank(Qzz ) ¼ m: n (5:67) (5:68) (5:69) The condition (5. n: (5:70) The condition (5. this is satisﬁed (under weak additional conditions) when the instruments are uncorrelated with the disturbances in the sense that E[zi ei ] ¼ 0. where Z denotes the n Â m matrix with rows z0i . i ¼ 1.65).69) is similar to (5.68) means that the instruments should be sufﬁciently correlated with the regressors. zmi ) is called a set of instruments if the following three conditions are satisﬁed. if one or more of the regressors is endogenous.23). This is called the order condition for the instruments. À endogenous.67) means that the instruments should be exogenous. the constant term should always be . diagnostic tests in previous sections of this chapter) are no longer valid. the number of instruments should be at least as large as the number of regressors. then OLS is no longer consistent and the conventional results (t-test. For example.398 5 Diagnostic Tests and Model Adjustments a situation the effect of police on crime cannot be estimated reliably by OLS (see also Exercise 5. Stated in statistical terms. we ﬁrst discuss the question of how to ﬁnd instruments. then this variable should be included in the so that plim 1 ji i i¼1 n set of instruments. As Qzx is an m Â k matrix. n 1 plim Z0 X ¼ Qzx . Pn x e ¼ 0. i ¼ 1. How to find instruments? Before we describe the instrumental variable estimator (below) and its statistical properties (in the next section). Á Á Á . The use of instruments A consistent estimator can be obtained if one can identify instruments. rank(Qzx ) ¼ k. Á Á Á . n 1 0 plim Z Z ¼ Qzz . For instance. This is called the rank condition. First of all. Á Á Á . A set of m observed variables z0i ¼ (z1i . one should analyse which of the explanatory variables are Á If the jth explanatory variable is exogenous. The stability condition (5.

one should ﬁnd at least k0 additional instruments. the results in Section 4.7 Endogenous regressors and instrumental variables 399 included. n bIV ¼ n X i ¼1 T ! À1 zi x0i n X i ¼1 zi yi ¼ (Z0 X)À1 Z0 y: In the over-identiﬁed case (m > k). Derivation of IV estimator To describe the instrumental variable (IV) estimator.69). n where the weighting matrix W is equal to the inverse of the covariance matrix 0 a consistent estiJÃ ¼ E[zi ei (zi ei )0 ] ¼ E[e2 i zi zi ]. we obtain the criterion function SIV (b) ¼ (y À Xb)0 Z(Z0 Z)À1 Z0 (y À Xb) ¼ (y À Xb)0 PZ (y À Xb). One option is to formulate additional equations that explain the dependence of the endogenous variables in terms of exogenous variables. This leads to simultaneous equation models that are discussed in Chapter 7. we assume that condition (5. The ﬁrst order conditions for a minimum are given by @ SIV (b) ¼ À2X0 PZ (y À Xb) ¼ 0: @b .3 (p. the GMM criterion function 1 n Gn WGn PnMore particularly. Under weak regularity conditions.67)–(5. where J ¼ s i¼1 i i ¼ n Z Z. together with all other exogenous regressors. the IV estimator bIV is given by the solution of the m equations 1 i¼1 zi (yi À xi bIV ) ¼ 0 — that is.3 we describe a test for the validity of these conditions.70) is satisﬁed. as we will illustrate by means of two examples at the end of this section. In Section 5.63) with Gn ¼ i¼1 zi (yi À xi b) ¼ Z (y À Xb) leads to the criterion function 1 S(b) ¼ (y À Xb)0 ZWZ0 (y À Xb). In many cases it is too demanding to specify such additional equations. 0 0 in (4. 256) show that the efﬁcient estimator corresponding to these moment conditions is obtained by 0 weighted least squares. If k0 of the regressors are endogenous. (5:71) where PZ ¼ Z(Z0 Z)À1 Z0 is the projection matrix corresponding to regression on the instruments Z. This corresponds to m moment conditions. In practice the choice of instruments is often based on economic insight. As the n 2 scale factor s has no effect on the location of the minimum of S(b). 2 Pn 0 s2 0 z z mator of these weights is given by W ¼ JÀ1.7. The IV estimator is deﬁned as the GMM estimator corresponding to these moment conditions. In the exactly identiﬁed Pncase (m ¼ 0k).4. and instead one selects a number of variables that are supposed to satisfy the conditions (5.5.

Regress each column of X on Z.67) means that OLS is consistent. which gives the following estimator Then b is estimated by regressing y on X of b: ^ 0X ^ )À1 X ^ 0 y ¼ (X0 PZ X)À1 X0 PZ y ¼ bIV : (X (5:74) So the IV estimator can be computed by two successive regressions. The IV estimator is therefore also called the two-stage least squares estimator. n: . The idea is to replace X by linear combinations of Z that approximate X as well as possible. Regress y on X E XM511IBR Example 5. and (iii) the results of IV estimation with these instruments.11. Two-stage least squares estimates of the parameters b (2SLS) ^ ¼ Stage 1.400 5 Diagnostic Tests and Model Adjustments The IV estimator and two-stage least squares The foregoing analysis shows that the IV estimator is given by bIV ¼ (X0 PZ X)À1 X0 PZ y: (5:72) This estimator has an interesting interpretation. We are interested in the coefﬁcients b in y ¼ Xb þ e. (xi )) by the model yi ¼ a þ bxi þ ei . then (5. The ﬁtted values of this regression are ^ ¼ Z(Z0 Z)À1 Z0 X ¼ PZ X: X (5:73) ^ . abbreviated as 2SLS.66). with ﬁtted values X À1 0 0 Z (Z Z ) Z X. We will discuss (i) the possible endogeneity of the explanatory variable (the interest rate). This best approximation is obtained by regressing every column of X on the instruments matrix Z. Stage 2. with parameter estimates bIV ¼ (X ^ ) À1 X ^ 0 y.30: Interest and Bond Rates (continued) As an illustration we consider the interest and bond rate data introduced in Example 5. If the regressors would have been Z instead of X. i ¼ 1. Á Á Á . but OLS is inconsistent because of (5. (i) Possible endogeneity of the interest rate In foregoing sections we analysed the relation between monthly changes in the AAA bond rate (yi ) and in the short-term interest rate (the three-month Treasury Bill rate. (ii) a suggestion for possible instruments. ^ 0X ^ .

713 5 0. 0.000 0.669400 DUS3MT(À1) 0.510085 DUS3MT 0.000 0.048 73.062266 À4.0000 Prob.93503 R-squared 0.897 Panel 2: Dependent Variable: DUS3MT Method: Least Squares Sample: 1980:01 1999:12 Included observations: 240 Variable Coefﬁcient Std. .306453 0.064952 2.000 0.43 Interest and Bond Rates (Example 5. then xi is not exogenous and OLS is not consistent. If this is the case.110 76.000 0.015440 À0.0000 0. DUS3MT Sample 1980.748060 DUS3MT(À2) À0. Panel 1: Correlogram of explanatory var.037 36.7 Endogenous regressors and instrumental variables 401 It may very well be that the factors ei that cause changes in the AAA bond rate reﬂect general ﬁnancial conditions that also affect the Treasury Bill rate.7681 0.282601 0.156 4 À0.000 Prob.062307 5.000 0.938 2 À0.000 0.169779 0.358145 0.000 0.000 0.318 11 À0.000 0.167 42.157 9 0.259 3 À0. Error t-Statistic C À0.412803 Panel 4: Dependent Variable: DAAA Method: Instrumental Variables Sample: 1980:01 1999:12 Included observations: 240 Instrument list: C DUS3MT(À1) DUS3MT(À2) Variable Coefﬁcient Std.330694 Exhibit 5.01–1999.000 0.6105 0.151651 Panel 3: Dependent Variable: DAAA Method: Least Squares Sample: 1980:01 1999:12 Included observations: 240 Variable Coefﬁcient Std.972 7 À0.185 27.155 33.12 (240 observations) Lag AC Q-Stat 1 0.023692 12.279 18.157 49.725 10 0. Error t-Statistic C À0.016572 À0.0095 Correlations of explanatory variable (DUS3MT) with its lagged values (Panels 1 and 2) and regression model estimated by OLS (Panel 3) and by IV (Panel 4).5039 0.295200 DUS3MT 0.008453 0.026112 0.102 35.0000 Prob.538625 R-squared 0.247 91.5.264 72.110 8 0.056 6 À0. 0.382 12 À0. 0.039009 À0.000 0.30) Prob 0. Error t-Statistic C À0.613906 R-squared 0.155 55.004558 0.

7. in particular. In Example 5. and used cars) are available over the period 1970–99.67) — that is. We will analyse this for the market of motor gasoline in the USA.and two-month lagged changes in the Treasury Bill rate. and a scatter diagram (d) and a partial scatter diagram (e) (after removing the inﬂuence of income) of consumption against price. but not with past changes xiÀ1 . with higher prices leading to lower demand and with higher demand leading to higher prices. xiÀ2 .68). It may well be that price and quantity inﬂuence each other. gasoline price. the condition that E[xiÀ1 ei ] ¼ E[xiÀ2 ei ] ¼ 0. xiÀ1 . As instruments we take xiÀ1 and xiÀ2 . this means that all past information is processed in the current prices.402 5 Diagnostic Tests and Model Adjustments (ii) Possible instruments If ﬁnancial markets are efﬁcient. The estimates of the slope parameter b differ quite substantially. A further analysis is given in Example 5. Then these past changes can serve as instruments. (iii) Results of IV estimation We now analyse the interest and bond rate data over the period from January 1980 to December 1999 (n ¼ 240). in the effects of price and income on demand. We postulate the linear demand function .43 shows that the variable xi is correlated with its past values. and so on. We will assume that the disturbance term ei is correlated with the current change xi in the Treasury Bill rate.43).3. Panel 4 reports the IV estimates with instruments z0i ¼ (1. and disposable income in the USA. Panel 1 of Exhibit 5. The condition (5. the price and traded quantities are determined jointly in the market process.33 we will test the exogeneity condition (5.68) is satisﬁed. Exhibit 5. Yearly data on these variables and three price indices (of public transport. xiÀ2 ).44 (a–c) shows time plots of these three variables (all in logarithms). although the correlations are not so large. In this case the current value of ei is uncorrelated with the past values of both yiÀj and xiÀj for all j ! 1. (i) The data We consider the relation between gasoline consumption. E XM531MGC Example 5.31: Motor Gasoline Consumption For many goods.33 at the end of Section 5. The regression of xi on xiÀ1 and xiÀ2 has an R2 ¼ 0:15 (see Panel 2 of Exhibit 5. the one. new cars. We will discuss (i) the data and (ii) possible instruments and corresponding IV estimates. and for comparison Panel 3 reports the OLS estimates. We are interested in the demand equation for motor gasoline. To check the rank condition (5.

.2 7.7 Endogenous regressors and instrumental variables 403 (a) 7. and RI for disposable income (all measured in real terms and taken in logarithms).2 PG 0. and OLS provides inconsistent estimates. Á Á Á . (ii) Possible instruments and corresponding IV estimates As possible instruments we consider (apart from the constant term and the regressor RI) the real price indices of public transport (RPT). 30.4 6. so that the ﬂuctuations ei in US gasoline consumption could affect the gasoline price.4 0.8 6. If this is the case. GCi ¼ a þ bPGi þ gRIi þ ei .2 −0.0 6. all in logarithms. Exhibit 5.34 we will test whether these variables are indeed exogenous.4 7.0 0.8 −0.5.4 (b) 0.2 0.6 −0.6 6.2 3.4 0. In Example 5.1 GCNOINC −0.2 −0.1 −0.31) Time plots of real gasoline consumption (GC (a)).4 PGNOINC Exhibit 5.0 2. PG for the gasoline price index.6 72 76 80 84 88 92 96 72 76 80 84 88 92 96 72 76 80 84 88 92 96 GC PG RI (d) 7. In Example 5. of new cars (RPN).45 shows the results of OLS (in Panel 1) and IV (in Panel 2).0 0.4 −0.34 we will formally test whether the price is exogenous or endogenous.2 7.2 (c) 3. then PG is not exogenous.3 −0. i ¼ 1. and scatter diagram of consumption against price (d) and partial scatter diagram after removing the inﬂuence of income (e). where GC stands for gasoline consumption.4 0.0 −0. and of used cars (RPU). The estimates do not differ much.8 7. real gasoline price (PG (b)) and real income (RI (c)).44 Motor Gasoline Consumption (Example 5.4 3.6 0.8 3.6 3. which can be taken as an indication that the gasoline price can be considered as an exogenous variable for gasoline consumption in the US.2 (e) 0.4 GC 6.2 0. The USA is a major player on the world oil market.0 0.0 0.2 0.

0000 0.4 (p. and used cars (RPU) (Panel 2) ). income.0000 t-Statistic 9.013700 0.833698 RPT À0. b. Error C 4.0000 Exhibit 5. 0.985997 0. Error C 5.028950 RI 0.886815 t-Statistic 61. and three real price indices (of public transport (RPT).544450 0.38644 Prob. 125–6). S: 5.081101 PG À0. and relation between gasoline price and the ﬁve instruments (Panel 3).7.247071 R-squared 0. Referring to Section 3. 5. we suppose that Assumptions 2–6 are satisﬁed and that Assumption 1 is replaced .2148 0.302668 Prob.18.0000 0.298421 0.285095 À4.527578 0.351973 RPU 0.0000 0.0003 0.80669 22.527853 0.025389 R-squared 0.986959 Panel 3: Dependent Variable: PG Method: Least Squares Sample: 1970 1999 Included observations: 30 Variable Coefﬁcient Std.0000 0.183108 RI À2.0000 0.47914 À20. 0. E Exercises: T: 5. Error C 7.2 Statistical properties of IV estimators T Derivation of consistency of IV estimators We consider the properties of the IV estimator (5.808004 0.987155 Panel 2: Dependent Variable: GC Method: Instrumental Variables Sample: 1970 1999 Included observations: 30 Instrument list: C RPT RPN RPU RI Variable Coefﬁcient Std.233078 0.16a.740963 0.272898 À9. 5.0000 t-Statistic 59.75035 À18. new cars (RPN).024511 R-squared 0.573220 0.1.31) OLS of gasoline consumption (GC) on price of gasoline (PG) and income (RI) (Panel 1).23a–d.02308 1.72) for the model y ¼ Xb þ e with n Â m instrument matrix Z.225499 À10. the constant term.191221 RPN À3.564662 0.04565 23. 0.24005 Prob.083911 PG À0.404 5 Diagnostic Tests and Model Adjustments Panel 1: Dependent Variable: GC Method: Least Squares Sample: 1970 1999 Included observations: 30 Variable Coefﬁcient Std.026319 RI 0.0000 0.45 Motor Gasoline Consumption (Example 5. IV of consumption on price and income using ﬁve instruments.

s2 Q0zx QÀ zz Qzx In large enough ﬁnite samples.72) as À Á À1 bIV ¼ X0 Z(Z0 Z)À1 Z0 X X0 Z(Z0 Z)À1 Z0 (Xb þ e) !À1 À1 À1 1 0 1 0 1 0 1 0 1 0 1 0 ¼bþ XZ ZZ ZX XZ ZZ Z e: n n n n n n (5:75) Because of the conditions (5. Assumptions 3 and 4 are needed in our derivation of the asymptotic distribution of bIV .67) is often satisﬁed only for variables that are relatively weakly correlated with the explanatory variables.7 Endogenous regressors and instrumental variables 405 by the ﬁve (asymptotic) conditions (5.75). However. To prove this.4 (p.65)–(5.73) this gives À Á À Á ^ 0X ^ )À1 : bIV % N b. Under these conditions the IV estimator is consistent. .1. s2 (X (5:76) The instrumental variable estimator is relatively more efﬁcient if the instruments Z are more highly correlated with the explanatory variables. Such weak instruments lead to relatively large variances of the IV estimator.69) we obtain the probability limit of bIV as À1 0 1 À1 plim(bIV ) ¼ b þ (Q0zx QÀ zz Qzx ) Qzx Qzz 0 ¼ b: This shows that the exogeneity of the instruments Z is crucial to obtain consistency.75) as n(bIV À b) ¼ An Á p n À1 0 1 À1 Q ) Q Q . In practice.6) in Section 4. where An has probability we can rewrite (5. the exogeneity condition (5. Derivation of asymptotic distribution We will assume (in analogy with (4. Note that the IV estimator is also consistent if Assumptions 3 and 4 are not satisﬁed (that is. s2 (X0 PZ X)À1 ¼ N b. bIV is approximately normally distributed with 2 À1 1 0 À1 1 0 1 0 ¼ s2 (X0 PZ X)À1 . Combining these results and using limit A ¼ (Q0zx QÀ zx zz zx zz À1 1 Q ) gives AQzz A0 ¼ (Q0zx QÀ zx zz T À À Á À1 Á pﬃﬃﬃ d 1 : n(bIV À b) ! N 0.5. for heteroskedastic or serially correlated errors). With mean b and covariance matrix s n ( n X Z( n Z Z) n Z X) the notation (5. s2 Qzz ): n 1 0 Using the notation bIV ¼ pb ﬃﬃﬃ þ An ( n Z e) for 1the0 last expression in (5. as long as the instruments are exogenous.69). ﬃﬃ Z e.67)–(5. 196)) that 1 d pﬃﬃﬃ Z0 e ! N(0. we write (5.

so that the ﬁnite sample probability distribution of bIV does not have a well-deﬁned mean or variance. then a consistent estimator is given by s2 IV ¼ 1 1 e0IV eIV ¼ (y À XbIV )0 (y À XbIV ): nÀk nÀk (5:77) ^ — then If the IV estimator is computed as in (5.74) — that is. An F-test for joint linear restrictions can be performed along the lines of Section 3. This result could suggest that it is always best to incorporate as many instruments as possible. s2 In ). Adding instruments also leads to asymptotically smaller variances. and this estimator of 2 s is not consistent (see Exercise 5.406 5 Diagnostic Tests and Model Adjustments To use the above results in testing we need a consistent estimator of the variance s2 . In practice it is often better to search for a sufﬁcient number of good instruments than for a large number of relatively weak instruments. s2 KK0 ) ¼ N(0.16). 161–2).71) can then be written as SIV (b) ¼ (y À Xb)0 K0 K(y À Xb) ¼ (yÃ À XÃ b)0 (yÃ À XÃ b): If y ¼ Xb þ e with e $ N(0.76) and (5. Remark on finite sample statistical properties The above analysis is based on asymptotic results. As concerns ﬁnite sample properties. The covariance matrix of bIV exists if and only if m ! k þ 2. provided that all additional instruments are exogenous. by regressing y on X the conventional OLS expression for the covariance matrix is not correct. s2 Im ): . Let eIV ¼ y À XbIV be the IV residuals.4. we mention that in ﬁnite samples the pth moments of bIV exist if and only if p < m À k þ 1. In the exactly identiﬁed case there holds m ¼ k. if the additional instruments are weak. Deﬁne the m Â 1 vector yÃ ¼ Ky and the m Â k matrix XÃ ¼ KX.1 (p. then the ﬁnite sample distribution may very well deteriorate. Derivation of the F -test in IV estimation Tests on the individual signiﬁcance of coefﬁcients can be performed by conventional t-tests based on (5. The instrumental variable criterion (5. This 0 1 ^ 0X ^ ^ ^ )À1 with ^ s 2 ¼ nÀ would give ^ s2 (X k (y À XbIV ) (y À XbIV ). eÃ $ N(0.6 (p. Section A. To derive the expression for this test we use some results of matrix algebra (see Appendix A. However. 737)). then yÃ ¼ X Ã b þ e Ã . There it is proved that the n Â n projection matrix PZ ¼ Z(Z0 Z)À1 Z0 of rank m can be written in terms of an m Â n matrix K as PZ ¼ K0 K. T T where Im is the m Â m identity matrix. with KK0 ¼ Im .77).

Then À 0 Á ^ e0 ^ eR^ eR À ^ e =g F¼ 0 : eIV eIV =(n À k) (5:78) The proof that this leads to the same F-value as the foregoing expression is left as an exercise (see Exercise 5.4. with corresponding Ã Ã residuals eÃ ¼ yÃ À XÃ bIV ¼ K(y À XbIV ) ¼ KeIV and eÃ R ¼ y À X bRIV ¼ K(y À XbRIV ) ¼ KeRIV . then we get À 0 Á eRIV PZ eRIV À e0IV PZ eIV =g % F(g. This seems to be a natural assumption. nÀk) : F¼ e0IV eIV =(n À k) This differs from the standard expression (3.16).1 (p.5. Let the unrestricted IV estimator be denoted by bIV and the restricted IV estimator by bRIV. Example 5. Now we test whether the AAA bond rate will on average remain the same if the Treasury Bill rate is ﬁxed.32: Interest and Bond Rates (continued) We continue our previous analysis of the interest and bond rate data in Example 5.30. xiÀ2 ). one without restrictions (with residuals perform two regressions of y on X denoted by ^ e) and one with the restrictions of the null hypothesis imposed (with residuals denoted by ^ eR ).50) for the F-test. So we test the null E XM511IBR . If the g restrictions of the null hypothesis hold true. ^ as in (5. as in the numerator the IV residuals are weighted with PZ . Then First regress every column of X on Z with ﬁtted values X ^ . As instruments we take again z0i ¼ (1. xiÀ1 .7 Endogenous regressors and instrumental variables 407 This shows that IV estimation in the model y ¼ Xb þ e is equivalent to applying OLS in the transformed model yÃ ¼ XÃ b þ eÃ . with yi the monthly AAA bond rate changes and xi the monthly Treasury Bill rate changes.77). 161–2) imply that À 0 Á 2 0 Ã Ã0 Ã 2 0 0 0 eÃ R eR À e e =s ¼ eRIV K KeRIV À eIV K KeIV =s À Á ¼ e0RIV PZ eRIV À e0IV PZ eIV =s2 % w2 (g): If we replace s2 by the consistent estimator (5.73). then the results in Section 3. The model is yi ¼ a þ bxi þ ei . Computation of the F -test It is computationally more convenient to perform the following regressions.

159311 R-squared 0.0318 22.46 also contains the regressions needed for (5. Error t-Statistic C À0.0271 22.408 5 Diagnostic Tests and Model Adjustments hypothesis that a ¼ 0. ﬁrst step of 2SLS (Panel 2.538625 R-squared 0. Exhibit 5.173480 0.6105 0. the constant of X term and the 1 and 2 lagged values of DUS3MT).151651 Sum squared resid Panel 3: Dependent Variable: DAAA Method: Least Squares Sample: 1980:01 1999:12 Included observations: 240 Variable Coefﬁcient Std.016572 À0.421374 XHAT 0. Error t-Statistic C À0.0095 15. second step of 2SLS (Panel 3.026112 0. by regressing DUS3MT on three instruments — that is.78).064952 2.224107 R-squared 0. denoted by XHAT.019214 Sum squared resid Panel 4: Dependent Variable: DAAA Method: Least Squares Sample: 1980:01 1999:12 Included observations: 240 Variable Coefﬁcient Std. 0. and regression of AAA bond rates on XHAT in restricted model without constant term (Panel 4).74). Error t-Statistic XHAT 0. 0. The sum of squared residuals in Panels 1.0000 0.169779 0.748060 DUS3MT(À2) À0.330694 Sum squared resid Panel 2: Dependent Variable: DUS3MT Method: Least Squares Sample: 1980:01 1999:12 Included observations: 240 Variable Coefﬁcient Std. Note IV — that is. and eIV eIV ¼ 15:491 (Panel 1).46).039009 À0. Prob.613906 R-squared 0. 0. regression of AAA bond rates on XHAT). Panel 1 of Exhibit 5.358145 0. So the F-test for a ¼ 0 becomes Panel 1: Dependent Variable: DAAA Method: Instrumental Variables Sample: 1980:01 1999:12 Included observations: 240 Instrument list: C DUS3MT(À1) DUS3MT(À2) Variable Coefﬁcient Std. So the null hypothesis is not rejected.6739 0.062266 À4. 0. and 4 is used in the F-test for the signiﬁcance of the constant term.5039 0.71653 .008453 0.020060 À0.32) Model for AAA bond rates estimated by IV (Panel 1).062307 5.0000 86.510085 DUS3MT 0.30464 Prob.49061 Prob. then the that if we compute the IV estimate by regressing y on X reported t-value becomes À0:42 (see Panel 3 of Exhibit 5. e0^ eR ¼ 22:717 (Panel 4).008453 0. 3.078000 2.669400 DUS3MT(À1) 0.46 shows the t-value obtained by a) ¼ À0:510. ^ e ¼ 22:700 (Panel with sums of squared residuals ^ e0R^ 0 3).018483 Sum squared resid Exhibit 5. tIV (^ ^ as in (5. so this t-value is not correct. Error t-Statistic C À0.282601 0.078626 2.169779 0.69959 Prob. construction ^ .46 Interest and Bond Rates (Example 5.

^ 0X ^ 0X ^ )À1 X ^ 0 y ¼ b þ (X ^ ) À1 X ^ 0 e: bIV ¼ (X T . otherwise it may be better to use IV to prevent large biases due to the inconsistency of OLS for endogenous regressors. If the regressors in y ¼ Xb þ e are exogenous.16). 5. This suggests basing the test on ^ 0 X ¼ X0 PZ X ¼ X ^ 0X ^. Derivation of test based on comparison of OLS and IV A simple idea is the following.66) that plim( 1 n X e) 6¼ 0.7. then OLS is consistent and (under the usual assumptions) more efﬁcient than IV in the sense that var(bIV ) ! var(b). 15:491=(240 À 2) P ¼ 0:611: 2 ^ (a). and both This is equal to the square of the IV t-value in Panel 1. The choice between these two estimators can be based on a test for the exogeneity of the regressors. 1 0 plim X e ¼ 0. E Exercises: T: 5. then OLS and IV are both consistent and the respective estimators b and bIV of b should not differ very much (in large enough samples). On the other hand. n (5:79) 0 against the alternative of endogeneity (5. we can apply OLS. the difference d ¼ bIV À b. if the regressors are exogenous.66) are small) and IV will be better if the regressors are (too strongly) endogenous. because (X0 PZ X)À1 ! (X0 X)À1 (see Exercise 5.16c. F ¼ tIV tests do not lead to rejection of the hypothesis that a ¼ 0.5. If the assumption of exogeneity is not rejected.7 Endogenous regressors and instrumental variables 409 F¼ (22:717 À 22:700)=1 ¼ 0:261.74) and the fact that X we get b ¼ (X0 X)À1 X0 y ¼ b þ (X0 X)À1 X0 e. Using (5. So we want to test the null hypothesis of exogeneity — that is. then OLS is not consistent but IV is consistent. e.3 Tests for exogeneity and validity of instruments Motivation of exogeneity tests If some of the regressors are endogenous. So OLS will be preferred if the regressors are exogenous (or weakly endogenous in the sense that the correlations in (5.

If the null hypothesis (5. Á Á Á . and Hausman. n: (5:80) Here gj is an m Â 1 vector of parameters and vji are error terms. The main idea is to reformulate the exogeneity condition (5. We order the regressors so that the ﬁrst (k À k0 ) ones are exogenous and the last k0 ones are potentially endogenous. Wu. ^X ¼ s2 (X ^ 0X ¼ X ^ 0X ^ . the constant term). k. This test corresponds to the Lagrange Multiplier test. Let vi be the k0 Â 1 vector with components vji .79) So d ¼ bIV À b ¼ ( 1 n X X) n n n holds true. Derivation of exogeneity test of Durbin. i ¼ 1. k: T By assumption. Under the usual assumptions. the k0 variables that are possibly endogenous and the other (k À k0 ) variables that are exogenous (for instance. Á Á Á . Wu. the m instruments zi satisfy the exogeneity condition (5. We will now describe an exogeneity test associated with Durbin. then E[d] % 0 and Á À 0 À1 0 ^) X ^ e À (X0 X)À1 X0 e ^X var(d) ¼ var (X À 0 À1 0 ÁÀ 0 À1 0 Á ^X ^X ^) X ^ À (X0 X)À1 X0 (X ^) X ^ À (X0 X)À1 X0 0 % s2 (X À 0 À1 Á ^ ) À (X0 X)À1 % var(bIV ) À var(b). For this purpose we split the regressors into two parts. and Hausman Usually exogeneity is tested in another way. then the condition is that ei is uncorrelated with all components . in which case the variance of d is ^ bIV ) À var var( (var var( very badly estimated and the test as computed above does not have a good interpretation. Because of (5.410 5 Diagnostic Tests and Model Adjustments ^ 0 ^ À1 1 X ^ 0 e À ( 1 X0 X)À1 1 X0 e. var(b). xji ¼ z0i gj þ vji . and the null hypothesis of exogeneity is equivalent to E[vji ei ] ¼ 0 for j ¼ k À k0 þ 1. The null hypothesis of exogeneity of these regressors is formulated as E[xji ei ] ¼ 0. in ﬁnite samples the estimated covariances may be such that ^ b)) is not positive semideﬁnite.70) it follows that E[xji ei ] ¼ E[vji ei ]. and OLS of y on X However. Now we consider the auxiliary regression model explaining the jth regressor in terms of these m instruments — that is. commonly known as the Hausman test.70) that E[zi ei ] ¼ 0. j ¼ k À k0 þ 1. Á Á Á .76)).79) in terms of a parameter restriction. d where we used that var(e) ¼ s2 I and X is also asymptotically normally distributed so that (under the null hypothesis of exogeneity) Á À1 À (bIV À b)0 var(bIV ) À var(b) (bIV À b) % w2 (k): This test is easy to apply. as OLS of y on X gives b and an estimate of ^ gives bIV and an estimate of var(bIV ) (see (5.

then ei ¼ v0i a þ wi . Let wi ¼ ei À E[ei jvi ]. Á Á Á . regressors) and on the k0 series of residuals ^ perform OLS in the model ei ¼ k X j ¼1 dj xji þ k X j ¼ kÀ k 0 þ 1 vji þ Zi : aj ^ (5:83) Step 3: LM ¼ nR2 of the regression in step 2. The Hausman LM-test on exogeneity The Hausman LM-test on exogeneity can be computed as follows. If we assume that all error terms are normally distributed.17 for the derivation).54)). the LM-test for the hypothesis that a ¼ 0 can be derived along the lines of Section 4. . LM has asymptotically the w2 (k0 ) distribution. Regress e on X (the n Â k matrix including both the (k À k0 ) exogenous and the k0 possibly endogenous vkÀ k0 þ 1 . H0 : a ¼ 0: (5:81) Substituting the results (5. with n Â 1 residual vector ^ vj ¼ xj À Z^ gj. Á Á Á . j ¼ k À k0 þ 1.80) in the original model yi ¼ x0i b þ ei gives yi ¼ k X j ¼1 bj xji þ k X j ¼ kÀ k 0 þ 1 À Á aj xji À z0i gj þ wi : (5:82) This is a non-linear regression model. Step 2: Perform the auxiliary regression. and the condition of exogeneity can be expressed as follows: ei ¼ v0i a þ wi . Then LM ¼ nR2 where R2 is the coefﬁcient of determination of the regression in step 2. Under the hypothesis that all k0 regressors xj . ^ vk — that is. Assuming a joint normal distribution for the error terms wi in (5. in that vi and ei are independent — that is. Hausman text on exogeneity Step 1: Perform preliminary regressions. as it involves products of the unknown parameters aj and gj . k. Regress every possibly endogenous regressor xj on Z in (5. the condition becomes more speciﬁc. The computations are straightforward but tedious and are left as an exercise for the interested reader (see Exercise 5. that in the conditional expectation E[ei jvi ] ¼ v0i a there holds a ¼ 0. 238) in terms of the score vector and the Hessian matrix (see (4.82) and vji in (5. are exogenous.80).6 (p.5.3.80). where wi is independent of vi .7 Endogenous regressors and instrumental variables 411 of vi .81) and (5. Regress y on X. with n Â 1 residual vector e ¼ y À Xb.

n À k À k0 ). T Sargan test on validity of instruments Finally we consider the question whether the instruments are valid. we test whether the instruments are exogenous in the sense that condition (5. in large enough samples they provide (nearly) the same P-value and hence both tests lead to the same conclusion (rejection or not) concerning the exogeneity of the last k0 regressors. by a signiﬁcant R2 and signiﬁcant estimates of the parameters aj . correlation between the residuals ei and ^ Endogeneity of the regressors is indicated by signiﬁcant correlations — that is. but in other situations this may be less clear. As error terms are not observed. In step 1 the model (5.83) or in the same equation with ei replaced by yi. both regressions (with ei or with yi on the left-hand side) have the same residual sum of squares (as all k regressors xi are included on the right-hand side). Because the regressors vji ¼ xji À z0i gj are unknown (as the parameters gj are unknown). the ‘explained’ variable ei in (5. This assumption is critical in all the foregoing results. If the instruments are not exogenous. In step 2. That is.83).3. This F-test and the LM-test of step 3 above are asymptotically equivalent. Under the null hypothesis.83). this test statistic is asymptotically distributed as F(k0 . Á Á Á . then IV is not consistent and also the Hausman test is not correct anymore. A simple idea to test (5.82) is estimated under the null hypothesis — that is.80). 238–40) for the linear model. As e ¼ y À Xb. We illustrate this later with two examples. aj ¼ 0 for j ¼ k À k0 þ 1.70) is to replace the (unobserved) error terms ei by reliable estimates of these error terms. Comments on the LM-test The null hypothesis of exogeneity — that is.83) is replaced by the dependent variable yi .82).7 (p. Summarizing.82) — can also be tested by the usual F-test on the joint signiﬁcance of these parameters in the regression (5. the residuals of step 1 are regressed on all explanatory variables in the unrestricted model (5. exogeneity is equivalent to the condition that E[vji ei ] ¼ 0. with a ¼ 0. where ei are the error terms in the model yi ¼ x0i b þ ei and vji are the error terms in (5. In another version of the F-test on exogeneity. Under . they are replaced by the residuals ^ vji obtained by regressing the jth regressor on the m instruments. the F-test on the joint signiﬁcance of the parameters aj can be equivalently performed in the regression equation (5. k in (5. As the regressors may be endogenous. they are replaced by residuals in step 1 and the vji is evaluated by the regression (5.412 5 Diagnostic Tests and Model Adjustments This three-step method to compute the LM-test by means of auxiliary regressions is similar to the LM-test procedure described in Section 4. That is. In some cases the exogeneity of the instruments is reasonable from an economic point of view. That is. we should take not the OLS residuals but the IV residuals eIV ¼ y À XbIV .70) is satisﬁed.

According to Section 4. 258). in particular that the degrees of freedom is equal to (m À k). P Pn Pn Pn 2 0 0 0 and Gn ¼ n .4. the moment functions corresponding to (5. Using the notation of Section 4. Evaluated i ziP i ¼ 1 gi ¼ i¼1 zi ei ¼ Z e and Jn ¼ i¼1 gi gi ¼ i ¼1 e i z À Á at n 1 2 0 J ) ¼ plim e z z the GMM estimator bIV.70). we recall that IV corresponds to GMM with moment conditions (5. 258).3 (p. we get Gn ¼ Z0 eIV and plim( 1 n i i ¼ i ¼1 i n n 0 2 1 0 Z Z and s by e e .3 (p. Regress eIV on Z in the model eIVi ¼ z0i g þ Zi : Step 3: LM ¼ nR2 of the regression in step 2. the J -test Jn % 1 IV n IV (4. with n Â 1 residual vector eIV ¼ y À XbIV . where m is the number of instruments (the number of variables in zi ) and k is the number of regressors (the number of variables in xi ).70) are gi ¼ zi ei ¼ zi (yi À x0i b). Estimate y ¼ Xb þ e by IV. Derivation of the distribution of the Sargan test To derive the distribution under the null hypothesis. which is called the Sargan test on the validity of instruments.7 Endogenous regressors and instrumental variables 413 the null hypothesis that the instruments are exogenous.4. Sargan test on the validity of instruments Step 1: Apply IV.5.69) — is given by À1 G0n Jn Gn ¼ n T e0IV Z(Z0 Z)À1 Z0 eIV ¼ nR2 e0IV eIV with the R2 of the regression in step 2 above. 253). the ith component of eIV .3 (p. Compute LM ¼ nR2 of the regression in step 2. If we approximate Qzz by 1 IV n n IV 0 0 e e Z Z . So our intuitive arguments for the Sargan test can be justiﬁed by the GMM test on over-identifying restrictions. bIV is consistent and eIV provides reliable estimates of the vector of error terms e. LM asymptotically has the w2 (m À k) distribution. in the over-identiﬁed case — we can apply the GMM test on over-identifying restrictions of Section 4. If m > k — that is. We test (5. Then the test on over-identifying restrictions — that is. Step 2: Perform auxiliary regression. This suggests the following test. Under the null hypothesis that the instruments are exogenous. under the null hypothesis of exogenous instruments there holds .70) by testing whether zi is uncorrelated with eIVi .4. we get s2 Qzz .

(ii) the Hausman test on exogeneity. shows the results of OLS and of IV. (ii) Hausman test on exogeneity As the above test is not so reliable. Panels 3 and 4 (p. we see that aIV À a ¼ À0:004 and bIV À b ¼ À0:137. In the exactly identiﬁed case (m ¼ k) the validity of the instruments cannot be tested. E XM511IBR Example 5.99).7. 401). Denoting the estimates of a and b in yi ¼ a þ bxi þ ei by a and b.33: Interest and Bond Rates (continued) We continue our previous analysis of the interest and bond rate data in Example 5.43. As in the model yi ¼ a þ bxi þ ei we have k ¼ 2 and as the constant term is exogenous. The result of step 2 of the Hausman test is in Panel 1 of Exhibit 5. (i) Comparison of IV and OLS estimates In Section 5. it follows that k0 ¼ 1. and (iii) the Sargan test on the validity of the lagged Treasury Bill rate changes as instruments.1 we estimated the relation between changes in the AAA bond rate (yi ) and the Treasury Bill rate (xi ) by instrumental variables. if the number of instruments exceeds the number of regressors. where ‘resaux’ stands for the residuals obtained . Exhibit 5. We will discuss (i) a comparison of the IV and OLS estimates. ! : c IV ¼ var ^ 0 ^ À1 s2 IV (X X) ¼ 10 À5 27:5 12:0 12:0 421:9 If one uses these results to test for exogeneity.30. so that at this signiﬁcance level this test does not lead to rejection of the hypothesis that xi is exogenous.47. respectively. The covariance matrices of these estimates are c OLS ¼ s (X X) var 2 0 À1 ¼ 10 À5 23:8 1:6 1:6 56:1 ! . we now perform the Hausman test. Note that the validity can be checked only if m > k — that is.414 5 Diagnostic Tests and Model Adjustments LM ¼ nR2 % w2 (m À k): The validity of the instruments is rejected for large values of this test statistic. it follows that aIV À a bIV À b 0 c IV À var c OLS ) (var À1 aIV À a bIV À b ¼ 5:11: This is smaller than the 5 per cent critical value of the w2 (2) distribution (5. with the lagged values xiÀ1 and xiÀ2 as instruments.

16 À0. 0.06 0.015359 À0.084042 DUS3MT(À2) À0. This indicates that the assumption of exogeneity should be rejected. by regressing xi on the instruments xiÀ1 .464945 R-squared 0.9925 0.13 À0.000135 Exhibit 5.0 0.010 1.014.02 0.43.5 −0. Dependent Variable: RESOLS Method: Least Squares Sample: 1980:01 1999:12 Included observations: 240 Variable Coefﬁcient Std. and RESAUX are the residuals of the regression in Panel 2 of Exhibit 5. 0.026395 À0.8000 0.136674 0.5 1.000156 0.46).7 Endogenous regressors and instrumental variables 415 (a) Panel 1: Step 2 of Hausman test.02 À0.350 (d) 1.01 À0.5 r = −0. Error t-Statistic C À0. and that the OLS estimator .270359 RESAUX 0. Panel 6 contains the regression for step 2 of the Sargan test on validity of instruments (RESIV are the IV residuals obtained in step 1 of this test.026378 À0. with P-value (corresponding to the w2 (1) distribution) P ¼ 0:014.0 0.0 r = 0.0 −0.008 (e) 1.5 0. Error t-Statistic C À0.35 À0. The t-test on the signiﬁcance of ‘resaux’ has a P-value of 0. 0.07 0.0 −6 −4 −2 0 2 4 −0.253610 DUS3MT À0.0 0 2 4 −1.016525 À0.5.024996 Prob. and the three scatter diagrams are for lags 0 (c). Panel 2 shows the correlations of the IV residuals with lags of the explanatory variable for lags 0–10.9331 0.0 1.46).065359 2.5 r = −0. 1 (d).002218 0.009431 DUS3MT(À1) À0.8979 Panel 1 contains the regression of step 2 of the Hausman test on exogeneity of the explanatory variable DUS3MT (RESOLS and RESAUX are the residuals obtained in step 1 of the Hausman test.5 −1.003895 0.0241 0.161106 0. Dependent Variable: RESIV Method: Least Squares Sample: 1980:01 1999:12 Included observations: 240 Variable Coefﬁcient Std.0 RESIV RESIV RESIV −6 −4 −2 0.47 Interest and Bond Rates (Example 5.5 −1. and the Hausman LM-test gives LM ¼ nR2 ¼ 240 Á 0:024996 ¼ 6:00. and 2 (e).01 (c) 1.0 −6 −4 −2 0 2 4 DUS3MT DUS3MTLAG1 DUS3MTLAG2 (f ) Panel 6: Step 2 of Sargan test.33) Prob.5 0.003387 0.128395 R-squared 0.01 À0. xiÀ2 and a constant term. this regression is shown in Panel 1 of Exhibit 5. RESOLS are the residuals of the regression in Panel 3 of Exhibit 5.060199 À2.07 0.5 0.0144 (b) Panel 2: Correlations between IV residuals and lagged values of DUS3MT Lag 0 1 2 3 4 5 6 7 8 9 10 Corr.

corresponding to the w2 (1) distribution. we have k0 ¼ 1. (i) Hausman test on the exogeneity of the gasoline price In Example 5. it follows that the w2 -distribution has (m À k) ¼ 1 degree of freedom. Exhibit 5. Panel 1 of Exhibit 5. and xiÀ2 (see (c).47 shows the correlations between lagged values of xi and the IV residuals eIV (in Panel 2) and scatters of the IV residuals against xi . (d).065 instead of the computed value of 0.45 turned out to be close together. suggesting that PG is exogenous. and (iii) a remark on the required model assumptions. a constant. both between xiÀ1 and eIV and between xiÀ2 and eIV ). xiÀ1 . This indicates that the lagged values of xi are valid instruments. Panel 6 of Exhibit 5.48 shows the regression of step 2 of the Hausman test.47 shows the regression of step 2 of the Sargan test. We will discuss (i) the Hausman test on exogeneity of the gasoline price. We postulated the demand equation GCi ¼ a þ bPGi þ gRIi þ ei : We supposed that RI is exogenous and considered the possible endogeneity of PG. is P ¼ 0:86. (iii) Sargan test on validity of instruments The IV estimates can be trusted only if the instruments xiÀ1 and xiÀ2 are exogenous — that is.34: Motor Gasoline Consumption (continued) Next we consider the data on motor gasoline consumption introduced in Example 5. The outcomes of OLS and IV estimates (with ﬁve instruments — that is.416 5 Diagnostic Tests and Model Adjustments may be considerably biased. this requires that E[xiÀ1 ei ] ¼ E[xiÀ2 ei ] ¼ 0. and (e)). gasoline price (PG). and RPU) in Panels 1 and 2 of Exhibit 5. with outcome LM ¼ nR2 ¼ 2:38. (ii) the Sargan test on the validity of the price indices as instruments.43)). As there are m ¼ 3 instruments (the constant term and xiÀ1 and xiÀ2 ) and k ¼ 2 regressors (the constant term and xi ). So the distribution of the LM-test . This indicates that xi is indeed not exogenous but that xiÀ1 and xiÀ2 are exogenous (with correlations of around À0:01. E XM531MGC Example 5. The P-value of the LM-test.024 for OLS (see Panels 3 and 4 of Exhibit 5. The IV estimate of the slope is much smaller than the OLS estimate and it has a much larger standard error (0. and disposable income (RI) in the USA. RPN. RI. and the three price indices RPT.31. This gives LM ¼ nR2 ¼ 240 Á 0:000135 ¼ 0:032.31 we considered the relation between gasoline consumption (GC). Since the constant and the income are assumed to be exogenous and the price PG is the only possibly endogenous variable.

1464 R-squared 0. The corresponding P-value is P ¼ 0:21. Dependent Variable: RESOLS Method: Least Squares Sample: 1970 1999 Included observations: 30 Variable Coefﬁcient Std. Dependent Variable: RESIV Method: Least Squares Sample: 1970 1999 Included observations: 30 Variable Coefﬁcient Std.347347 0.773862 0.070032 1.45). It is left as an exercise . (under the null hypothesis of exogeneity) is w2 (1).4590 R-squared 0.7 Endogenous regressors and instrumental variables 417 Panel 1: Step 2 of Hausman test. Panel 2 shows the regression for step 2 of the Sargan test on validity of instruments (RESIV are the IV residuals obtained in step 1 of this test.45.179698 0.080326 0. Here we test whether the ﬁve instruments are exogenous.016872 0. (ii) Sargan test on validity of instruments Panel 2 of Exhibit 5.48 shows the regression of step 2 of the Sargan test. For these data we therefore prefer OLS. as OLS is consistent and gives (somewhat) smaller standard errors (see the results in Exhibit 5. Error t-Statistic Prob. C 0.600566 0.7364 PG À0.104159 Exhibit 5.024638 À0.008558 0.497107 0. C À0.178352 0.4463 RPT À0. (iii) Remark on required model assumptions We conclude by mentioning that the above tests require that the standard Assumptions 2–6 of the regression model are satisﬁed.027703 0. so that LM ¼ nR2 ¼ 3:12 should be compared with the w2 (2) distribution. RESOLS are the residuals of the regression in Panel 1 of Exhibit 5. This does not lead to rejection of the exogeneity of the variable PG.209753 0. Error t-Statistic Prob.2492 RI 0.4180 RPN 0. note that IV estimation is not required as the regressor PG seems to be exogenous. so that the exogeneity of the instruments is not rejected. this regression is shown in Panel 2 of Exhibit 5.34) Panel 1 shows the regression for step 2 of the Hausman test on exogeneity of the explanatory variable PG (RESOLS and RESAUX are the residuals obtained in step 1 of the Hausman test.823606 0.48 Motor Gasoline Consumption (Example 5.271047 À0.081429 0.114431 0.020409 0. which gives a P-value of P ¼ 0:12.5. In this case k ¼ 3 and m ¼ 5.340209 0.060410 0. However.079363 Panel 2: Step 2 of Sargan test.104845 0.070229 0.5533 RI À0.8599 RPU À0.059531 À1.45 and RESAUX are the residuals of the regression in Panel 3 of Exhibit 5. Panel 1 for OLS and Panel 2 for IV).45).7311 RESAUX 0.062169 À0.752055 0.051202 0.028093 À0.

one should ﬁnd a sufﬁcient number of instruments that are exogenous and that carry information on the possibly endogenous regressors (that is. . The t. Therefore we should give the above test outcomes on exogeneity and validity of instruments the correct interpretation — that is. Similar remarks apply to our analysis of the interest and bond rate data in Example 5.33. If the endogeneity is only weak. as in Examples 5.31) to show that the residuals of the above demand equation for motor gasoline consumption show signiﬁcant serial correlation. . although some care is needed to use the correct formulas (see (5. First of all.7. E Exercises: T: 5.77) and (5. In this case the OLS estimates may provide very misleading information. as diagnostic tests indicating possible problems with OLS. .23e. then OLS may be considered as an alternative. the OLS estimates are not efﬁcient.27 and 5.17. In particular. S: 5. provided that the resulting bias is compensated by a sufﬁciently large increase in efﬁciency as compared to IV. . then consistent estimates are obtained by the instrumental variables estimation method. One may proceed as follows. 5.78)).16d. If it does. .4 Summary The OLS method becomes inconsistent if the regressors are not exogenous.28 we concluded that these data contain some outliers. the order and rank conditions should be satisﬁed). If some of the regressors are endogenous and the instruments are valid. as they neglect the serial correlation of the disturbances. try to use economic intuition to guess whether endogeneity might play a role for the investigation at hand. 5. If one has a sufﬁciently large number of instruments.418 5 Diagnostic Tests and Model Adjustments (see Exercise 5.and F-tests can be performed as usual. Investigate the possible endogeneity of ‘suspect’ regressors by means of the Hausman test. then perform the Sargan test to check whether the proposed instruments are indeed exogenous.

The estimated elasticity b is around 16 per cent.49. so that salaries of top managers tend to be rather inelastic with respect to proﬁts when compared over this cross section of ﬁrms.8 Illustration: Salaries of top managers 419 5.8 Illustration: Salaries of top managers The discussion in this chapter could lead one to think that ordinary least squares is threatened from so many sides that it never works in practice. The tests in Exhibits 5. Example 5. This is not true. E XM535TOP .35: Salaries of Top Managers As an example we analyse the relation between salaries of top managers and proﬁts of ﬁrms. The data set consists of 100 large ﬁrms in the Netherlands in 1999.49 (e–q) do not indicate any misspeciﬁcation of the model. so that we are satisﬁed with this simple relation. and this exhibit also shows the outcomes of various diagnostic tests discussed in this chapter. Results of OLS in the model yi ¼ a þ bxi þ ei are in Panel 3 of Exhibit 5. The 100 ﬁrms are ordered with increasing proﬁts. By means of the following example we will illustrate that in some cases OLS provides a reasonable model that performs well under various relevant diagnostic tests. Let yi be the logarithm of the average yearly salary (in thousands of Dutch guilders) of top managers of ﬁrm i and let xi be the logarithm of the proﬁt (in millions of Dutch guilders) of ﬁrm i.5. OLS is a natural ﬁrst step in estimating economic relations and in many cases it provides valuable insight in the nature of such relations.

378765 R-squared 0.0 7.269493 0.350338 0.5 4000 8.0000 0.408740 0.622791 0. 0. Panel 3).35) Scatter diagrams of salary against proﬁt (in levels (a) and in logarithms (b)).162212 0.128961 49. and graph of actual and ﬁtted (logarithmic) salaries and corresponding least squares residuals ((d).5 8.89396 F-statistic Durbin–Watson stat 2.366774 Mean dependent var Adjusted R-squared 0. but the number of observations in estimation is 96.0 0.44617 0. the original number of observations is 100.5 2000 7.5 7.5 0.05023 Schwarz criterion Log likelihood À27.000000 8.49 Salaries of Top Managers (Example 5.0 1000 0 −10000 6.E.0 10 20 30 40 50 60 Actual 70 80 90 100 Residual Fitted Exhibit 5.233248 Prob(F-statistic) (d) Prob.0 −0.0000 7.0 3000 LOGSALARY SALARY 7. . Error t-Statistic C 6. of regression 0. as 4 ﬁrms have negative proﬁts).420 5 Diagnostic Tests and Model Adjustments (a) 5000 (b) 8.5 −1.D.24249 LOGPROFIT 0.021984 7.326982 Akaike info criterion Sum squared resid 10.0 1.360037 S. the data are ordered with increasing values of proﬁts. regression table (with variables in logarithms. dependent var S.5 0 10000 20000 30000 4000 0 2 4 6 8 10 12 PROFIT LOGPROFIT (c) Panel 3: Dependent Variable: LOGSALARY Method: Least Squares Sample(adjusted): 5 100 Included observations: 96 after adjusting endpoints Variable Coefﬁcient Std.5 6.676215 54.

Included observations: 96 Variable Coefﬁcient Std.748622 Probability Probability 0.3509 (f ) (h) 1.634024 0.341979 Prob.417149 (j) Panel 10: Chow Forecast Test: Forecast from 77 to 100 F-statistic 0. RESET (Panel 5).044001 2. both with 72 ﬁrms (those with lower proﬁts) in the ﬁrst subsample and with 24 ﬁrms (those with higher proﬁts) in the second subsample. 0.306228 Prob(F-statistic) 0.15 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 One-Step Probability Recursive Residuals CUSUM 5% Significance CUSUM of Squares 5% Significance (i) Panel 9: Chow Breakpoint Test: 77 F-statistic 0.090169 S.8 Illustration: Salaries of top managers 421 (e) Panel 5: Ramsey RESET Test: F-statistic 0.521363 0.845556 Log likelihood ratio 1.2 0. Error t-Statistic C 6. and 6).348475 0.733956 Schwarz criterion Log likelihood À21.937499 (g) 1.5 1.0 0.834870 R-squared 0.15475 À0. of regression 0.005988 Exhibit 5.0060 7.30862 LOGPROFIT 0. Included observations: 72 Variable Coefﬁcient Std. 0.794813 À0. The test outcomes do not give reason to adjust the functional speciﬁcation of the model (Assumptions 2.432627 0.6 0.5. recursive residuals with CUSUM and CUSUMSQ tests ((f )–(h)).0 0.0000 0.902997 Probability Test Equation: Dependent Variable: LOGSALARY Method: Least Squares Sample: 5 100.312867 0.0 0.222507 29.D.0 0.350930 0.036490 0.142288 0.0 −0. dependent var S.14958 Probability Test Equation: Dependent Variable: LOGSALARY Method: Least Squares Sample: 5 76.5 1.5 0.582640 0.397669 Prob.2 30 20 10 0 −10 −20 −30 0.981640 13.454713 LOGPROFIT À0. Chow break test (Panel 9).725628 8. Error t-Statistic C À5.49 (Contd.102984 Mean dependent var Adjusted R-squared 0.733053 FITTED^ 2 0.124737 0.4 0. and Chow forecast test (Panel 10).873523 Probability Log likelihood ratio 25.4654 0.) Diagnostic tests.E.8 0.662387 0. 5.05 0.84593 F-statistic Durbin–Watson stat 2.2 1.333725 0.6504 0. .10 0.332393 Akaike info criterion Sum squared resid 7.878905 Probability Log likelihood ratio 0.00 0.

4.4 0.046281 RESOLS(À1) À0.2575 0.2 0.262945 0 Jarque-Bera Probability Exhibit 5.048798 0. and test on normality (histogram and Jarque–Bera test (n)).003746 À1.045 0.095 2.104460 À1.047494 À0.743717 2.49 (Contd.172 14. 0.671623 0.032151 0. Dev. Error t-Statistic C À0.759000 Probability ObsÃ R-squared 1.201465 Probability Test Equation: Dependent Variable: RESOLS^ 2 Method: Least Squares Sample: 5 100.9632 0.2814 (l) 0. tests on serial correlation (Breusch–Godfrey LM-test in Panel 12 and Ljung–Box test in Panel 13).333255 R-squared 0.6293 0.87E-15 −0.589335 Probability ObsÃ R-squared 1.022068 0.047 5 0.7100 0.6321 0. 0.042130 LOGPROFIT 0.129438 À0.046586 1.7397 (m) Panel 13: CORRELATIONS OF RESOLS Lag AC Ljung-Box Prob 1 À0.368 0.035061 0.425 3 À0.4799 0.714197 0.556754 0.105207 0.8 4 2 −2. White test on heteroskedasticity (Panel 11).452 4 0.) Diagnostic tests.558288 Probability Test Equation: Dependent Variable: RESOLS Method: Least Squares Variable Coefﬁcient Std.471043 0.548410 Prob.005453 0.083502 LOGPROFIT^ 2 R-squared 0.422 5 Diagnostic Tests and Model Adjustments (k) Panel 11: White Heteroskedasticity Test: F-statistic 0.119031 0.004059 0.001021 0.111 11. The test outcomes do not give reason to adjust the standard probability model for the disturbance terms (Assumptions 3.792 0.6 0.019 9.048 1.040238 0.060 10.016232 (n) 8 0.006 10.012515 Panel 12: Breusch-Godfrey Serial Correlation LM Test: F-statistic 0.106 6 Series: Residuals Sample 5 100 Observations 96 Mean Median Maximum Minimum Std.6 −0. Included observations: 96 Variable Coefﬁcient Std.085 6 À0.106 15.4 −0.2 0. .182 9 0.140858 À0. Skewness Kurtosis −0.8200 0.122 1. RESOLS denotes the OLS residuals of the regression in Panel 3.9665 0.325257 0.946792 −0.262 9.0 0.224 2 0.571 0.186 8 À0.388016 2.458798 Prob.2976 0.139490 RESOLS(À2) 0. and 7).228249 LOGPROFIT 0.103 10 0.049 0.6656 0.123 7 0. Error t-Statistic C À0.

526314 LOGPROFIT 0.49 (Contd.5.17263 LOGPROFIT 0.032132 0.096903 0. the scatter diagram of the explanatory variable against the instrument is shown in (o)) and step 2 of the Hausman test on exogeneity of the explanatory variable (LOGPROFIT) in the wage equation (Panel 17.6001 0. The sample size in estimation is 84 because the turnover of some of the ﬁrms is unknown.181561 0.253435 0.685017 R-squared 0. RESOLS denotes the OLS residuals of the regression in Panel 3 and V denotes the residuals of the regression of LOGPROFIT on a constant and LOGTURNOVER).602173 V À0.385981 (q) Panel 17: Dependent Variable: RESOLS Method: Least Squares Included observations: 84 Excluded observations: 16 (missing values of turnover) Variable Coefﬁcient Std.184116 À0.031937 5. instrumental variable estimate of the wage equation (with LOGTURNOVER as instrument. .054385 R-squared 0.8 Illustration: Salaries of top managers 423 (o) 12 10 8 6 LOGPROFIT 4 2 0 4 6 8 10 12 LOGTURNOVER 14 (p) Panel 16: Dependent Variable: LOGSALARY Method: Instrumental Variables Included observations: 84 Excluded observations: 16 (missing values of turnover) Instrument list: C LOGTURNOVER Variable Coefﬁcient Std.0000 0.006438 Exhibit 5. 0.0000 Prob. The test outcomes do not give reason to reject the assumption of exogeneity of proﬁts in the wage equation for top managers (Assumption 1).182995 34.9568 Diagnostic tests. 0.) Prob.5487 0. Error t-Statistic C À0.019349 0.052095 À0. Panel 16. E Exercises: E: 5.32.002833 0. Error t-Statistic C 6.

and 6) was discussed in Sections 5. This was discussed in Sections 5. and this was investigated in Section 5. The exogeneity of the regressors (Assumption 1) is required for OLS to be consistent. and Welsch (1980) for regression diagnostics.3. If the disturbances are not normally distributed. The efﬁciency can be increased by using weighted least squares (based on a model for the variances of the disturbances) or by transforming the model (to remove the serial correlation of the disturbances). The functional speciﬁcation of the model (linear model with constant parameters. Assumptions 2. then OLS is consistent but not efﬁcient. Further Reading (p. We mention some further references: Belsley. Cleveland (1993) and Fan and Gijbels (1996) for non-parametric methods. and keywords SUMMARY In this chapter the seven standard assumptions of the regression model were subjected to diagnostic tests. then OLS is consistent but not efﬁcient. further reading.4 and 5. In Section 5. Rousseeuw and Leroy (1987) for robust methods. Regression diagnostics can be used to detect inﬂuential observations. FURTHER READING The textbooks mentioned in Chapter 3. We also discussed methods for the speciﬁcation and estimation of non-linear models and models with varying parameters. and if there are relatively many outliers then robust methods can improve the efﬁciency of the estimators. For a more extensive treatment of some of these topics we refer to the three volumes of the Handbook of Econometrics mentioned in Chapter 3. Godfrey (1988) for diagnostic tests.424 5 Diagnostic Tests and Model Adjustments Summary. In practice it may be worthwhile excluding the less relevant variables — namely. 178–9).5.7. 5. A correct speciﬁcation is required to get consistent estimators. If the disturbances of the model are heteroskedastic or serially correlated (so that Assumptions 3 or 4 are not satisﬁed). if the resulting bias is compensated by an increased efﬁciency of the estimators. If the regressors are endogenous.2 and 5.6 we considered the assumption of normally distributed disturbances (Assumption 7). Kuh. . then consistent estimates can be obtained by using instrumental variables. contain chapters on most of the topics discussed in this chapter.

E. R. (1987).. Cleveland. Local Polynomial Modelling and its Applications. G. Rousseeuw. (1980). W. Regression Diagnostics: Identifying Inﬂuential Data and Sources of Collinearity. J. Summit. (1996). and Gijbels. NJ: Hobart Press. KEYWORDS additive heteroskedasticity 327 Akaike information criterion 279 autoregressive model 362 backward elimination 281 bandwidth span 291 bias 278 bottom-up approach 281 Box–Cox transformation 297 Box–Pierce test 364 Breusch–Godfrey test 364 Breusch–Pagan test 345 Chow break test 315 Chow forecast test 316 Cochrane–Orcutt method 369 correlogram 361 CUSUM test 313 CUSUMSQ test 314 dfbetas 383 dfﬁts 383 diagnostic testing 275 dummy variables 303 Durbin–Watson test 362 elasticity 296 empirical cycle 276 endogenous regressor 397 exogeneity 409 feasible weighted least squares 335 ﬁrst differences 297 ﬁrst order autocorrelation coefﬁcient 361 forward selection 281 GMM estimator 325 Goldfeld–Quandt test 343 growth rates 297 Hausman test 410 hold-out sample 276 inﬂuential data 379 instrumental variable (IV) estimator 399 instruments 398 interaction term 286 iterated FWLS 337 Jarque–Bera test 387 kth order autocorrelation coefﬁcients 361 kernel 360 kernel function 292 kernel method 292 kurtosis 386 least absolute deviation 390 leverage 379 Ljung–Box test 365 local regression 289 logarithmic transformation 296 mean absolute error 280 misspeciﬁcation test 275. London: Chapman and Hall. further reading. D. (1993).Summary. (1988). A. Godfrey. P. L. New York: Wiley. Visualizing Data. E. J. New York: Wiley. and Welsch. A. 286 multiplicative heteroskedasticity 327 nearest neighbour ﬁt 290 Newey–West standard errors 360 order condition 398 ordered data 310 outlier 379 predictive performance 280 rank condition 398 recursive least squares 311 recursive residuals 311 .... I. Cambridge: Cambridge University Press. Fan. Misspeciﬁcation Tests in Econometrics. S. M. Robust Regression and Outlier Detection. Kuh. and Leroy. and keywords 425 Belsley.

426 5 Diagnostic Tests and Model Adjustments regression speciﬁcation error test 285 RESET 285 robust estimation 389 root mean squared error 280 Sargan test 413 scaling 296 Schwarz information criterion 279 skewness 386 span 291 studentized residual 380 top-down approach 281 total mean squared prediction error 278 transformed model 327 trend 297 tricube weighting function 291 two-stage least squares 400 two-step FWLS 335 validity of instruments 413 weak instruments 405 weighted least squares 290. 328 White standard errors 325 White test 345 .

14). Prove that SIC corresponds to an F-test with critical value approximately equal to log (n). E[ei ej ] 2 ¼ 0 for i 6¼ j.3) a. b. 5. d. and E[e2 i ] ¼ si . derive a model n for the variances s2 i for which this estimator is the best linear unbiased estimator of b. a. e.3 (E Section 5.1. xtþ1 )0 ¼ (X0t .20) for the hypothesis (5. where P ¼ (X01 X1 )À1 X01 X2 and where V2 ¼ var(b2 ) is the g Â g covariance matrix of b2 in the model y ¼ X1 b1 þ X2 b2 þ e. .2. b. Derive a test for the hypothesis 1 2 2 (5.1 (E Section 5. b ¼ y = xi . then prove that y 1 2 has mean emþ2s . show that MSE(b1 ) À MSE(bR ) ¼ P(V2 À b2 b02 )P0 . Show that. and b3 ¼ i i 2 i i P 1 (yi =xi ). s2 ).2.11) are equal to s2 vt . 5. where S(l) ¼ (Sl À 1)=l. Is the standard error in the restricted model always larger? b. Prove the result (5. and variance 2mþs2 s2 e (e À 1). the estimates bR need not be more signiﬁcant (in the sense of having larger t-values) than the estimates b1 . where x1i ¼ 1 is the constant term and x2i and ei are two independent samples from the standard normal distribution. for n sufﬁciently large.3.4. It was shown in Section 3. Á Á Á . 5. Show that the standard error of the regression s may be larger in the restricted model.2.5. where E[ei ] ¼ 0.2.1) Consider the model y ¼ X1 b1 þ X2 b2 þ e with b2 6¼ 0. Prove the expressions (5.3) Consider the model yi ¼ bxi þ ei (without constant term and with k ¼ 1). c. b. Using again the notation of Section 5.3. The F-test requires that the disturbance vectors e1 and e2 in (5. 143) that the restricted least squares estimator bR ¼ (X01 X1 )À1 X01 y has a variance that is smaller than that of the unrestricted least squares estimator b1 in the model that includes both X1 and X2 .19). which starts with small models and performs sequential tests on the signiﬁcance of additional variables. Discuss the relevance of your ﬁndings for the ‘bottom-up’ strategy in model selection.19) for the case that s2 1 6¼ s2 . Using the notation of Section 5.2 (E Sections 5. Á Á Á .2. prove that TMSP(bR ) TMSP(b1 ) if and only if À1 b02 V2 b2 g.22) in the model (5. a. xtþ1 )0 . For each estimator. as a consequence. where Ytþ1 ¼ (y1 . Prove that the variances of the forecast errors ft in (5.4) a.21) is equal to the forecast test in Section 3.1. Prove that in this model (dS=dx)=S ¼ b=(1 þ l(a þ gDg þ mDm þ bx þ e)).2.2) a. As parameter values take b1 ¼ 1 and b2 ¼ 10. c. c.3 (p.12)–(5.4. c.18) are uncorrelated with mean 2 zero and covariance matrices s2 1 In1 and s2 In2 . 5. Verify the results in a and b by simulating a data set with sample size n ¼ 100 by means of the model y ¼ b1 þ b2 x þ e. Prove that the forecast errors ft are independent under the standard Assumptions 1–7. Consider following P three P estimators of b: P theP 2 b1 ¼ x y = x .5 (E Section 5. Consider the non-linear wage model S(l) ¼ a þ gDg þ mDm þ bx þ e in Example 5. 2 where s2 ¼ s . 5. ytþ1 )0 and Xtþ1 ¼ (x1 . d.4 (E Section 5. AIC corresponds to an F-test with critical value approximately equal to 2.1. f.3. Prove that. Suppose that log (y) $ N(m.Exercises 427 Exercises THEORY QUESTIONS 5. It is helpful to write out the normal equations X0tþ1 Xtþ1 btþ1 ¼ X0tþ1 Ytþ1 . median em . Prove that the F-test for the hypothesis (5.

Á Á Á . 2 a a. Prove the results that are stated in Section 5.3. e. Show that the OLS estimator is unbiased. the variances are estimated by the re0 gression e2 with error terms i ¼ zi g þ Z i 2 2 (see Section 5. show that the OLS variance in (5. Derive the LM-test for homoskedasticity (a ¼ 0). In the text this test was derived by using the results of Section 4.5) In this exercise we derive the three-step method for the computation of the Breusch–Pagan test on homoskedasticity.49).4. Let s2 i ¼ s zi . p. x1. where SSE is the explained sum of squares of the regression of 2 e2 i =sML on z. For the general case. Consider the model y ¼ Xb þ e. s2 Z ) and the null hypothesis of no serial correlation corresponds to g ¼ 0.4. a. Derive expressions for the variance of the OLS estimator and also for the WLS estimator. It may be assumed that the model (5. 5.4.45) contains a constant term — thatP is.4).54) in Section 4. and determine the ﬁrst order conditions (for ML) and the information matrix.6 (E Sections 5. b.50) is asymptotically equivalent to the Breusch– Godfrey (BG) test obtained by the regression of the OLS residuals ei on xi and the lagged values eiÀ1 .2.5.5. Now consider the general model (5. 5:7Ã (E Section 5. Á Á Á .3) In this exercise we consider an alternative derivation of the Breusch–Godfrey test — that is.4.49). and assume that the variables z i are exogenous in the sense that P 2 plim( 1 zi (e2 i À si )) ¼ 0. where e is the vector n of OLS residuals. Show that the LM-test for g2 ¼ Á Á Á ¼ gp ¼ 0 is given by LM ¼ SSE=2. Let s2 i ¼ s xi . d. Prove that the result in c can be written as nR2 of the auxiliary regression (5. g. c. 5. bÃ . Further we assume that under the null . We assume that the explanatory variables x include a constant term Pn and 0that they satisfy 1 the conditions that plim( n n i¼1 xi xi ) ¼ Q is invertP ible and that plim( 1 i¼pþ1 eiÀj xi ) ¼ 0 for all n j ¼ 1. which satisﬁes the standard regression Assumptions 1–7.26).39) for the LR-test for groupwise heteroskedasticity in the model y ¼ Xb þ e. Ã d . a. which is replaced by the model (5. In the additive heteroskedasticity model with 0 s2 i ¼ zi g.26). b.6 (p.47) where Zi $ NID(0. c. c. Determine the log-likelihood of this model (for the observations (y2 . Á Á Á . Now assume Zi ¼ ei À si P that plim( 1 zi z0i ) ¼ Qzz exists with Qzz an n invertible matrix.45) with AR(1) errors (5. Determine the ﬁrst and second order derivatives of the log-likelihood with respect to the parameter vector y. 238).4.24) is always at least as large as the WLS variance (5.4 (p. where zi is a single explanatory variable that takes on only positive values. 0 y ¼ (b0 . i ¼ 1 for i ¼ 1.40). 218) on nonlinear regression models. Show that the result in d can be written as LM ¼ nR2 of the auxiliary regression (5. where SSE is the explained sum of squares of the re2 gression of e2 i =sML on a constant and log (zi ) and 0 where s2 ¼ e e =n. in large enough samples.4 for the consistent estimation of the parameters of the multiplicative model for heteroskedasticity.428 5 Diagnostic Tests and Model Adjustments 2 2 b.3) In this exercise we show that the Box–Pierce (BP) test (5. s2 The parameter vector is Z ). Derive the log-likelihood and its ﬁrst order derivatives. the result in b can also be written as LM ¼ nR2 of the regression of e2 i on a constant and log (zi ). using the results in a. s2 )0. Show that g is estin mated consistently under this assumption. and now we will consider the ML-based version of this test. non-random value).4.29) in the sense that var(b) À var(bÃ ) is positive semideﬁnite. Derive the log-likelihood for this model. Á Á Á . ML 5:8Ã (E Section 5. Show in particular that this can be written as LM ¼ SSE=2.5) a. Show that. n — and that plim( 1 xi eiÀ1 ) ¼ 0. except for Assumption 3. eiÀp . 5:9Ã (E Section 5. treating y1 as a ﬁxed. Use the results in a and b to compute the LM-test by means of the deﬁnition in (4. the auxiliary regression (5. Derive the expression (5. d. The model is given by (5. with parameter vector y ¼ (b0 . c. Use the results in b to show that the OLS estimator has a variance that is at least as large as that of the WLS estimator. yn ).

Describe in detail how m can be estimated by the Cochrane–Orcutt method. where the ei are NID(0. Á Á Á . Now suppose that the error terms are not generated by the above process. Prove that plim(b) þ plim(r) ¼ b þ g. then show that (under p the ﬃﬃﬃ^ null hypothesis of no serial correlation) nd % 0 pﬃﬃﬃ 1 p 1ﬃﬃ 0 and n^ g%s E e (where we write a % b if 2 n plim(a À b) ¼ 0).5. d. c. transformed parameters (b þ g and bg) can be estimated consistently by OLS.6.6. where À1 < b < 1 and À1 < g < 1 and the terms Zi are homoskedastic and uncorrelated.Exercises 429 hypothesis of P absence of serial correlation there also n holds plim( 1 i¼pþ1 eiÀj ei ) ¼ 0 for all j ¼ 1. a. i 6¼ j — that is.2 and 5. n a. a. 2 b. OLS for all observations but with a dummy for the jth observation included in the model — and let s2 j be the corresponding esti^ ¼ b(j). Give an interpretation of the results in a and c by drawing scatter plots.4) Consider the model yi ¼ m þ ei . where the terms Zi (with mean zero) are uncorrelated and homoskedastic and where À1 < g < 1.6. j c. Show that sj ¼ nÀ kÀ1 s À (nÀkÀ1)(1Àhj ). Á Á Á . 2 1Àhj ) (xi Àx d.54) satisfy 0 hj 1 and n j¼1 hj ¼ k. Show that the above model can be rewritten as yi ¼ (b þ g)yiÀ1 À bgyiÀ2 þ Zi and that the two 5. Investigate whether the estimator of a is unbiased. eiÀp . n P 2 2 and plim( 1 e i)¼s . by leaving ^ and ^ out the jth observation. Prove that theP leverages hj in (5. nþ )2 (xi Àx e2 j nÀk 2 2 b. The following results are helpful in computing regression diagnostics for this model by means of a single regression including all n observations. k ¼1 n c.12 (E Sections 5. Show that 2the leverages are equal to ) (xj Àx P hj ¼ 1 . where the forecast is based on the (n À 1) observations i 6¼ j.11. What is the implication of these results for the Durbin–Watson test when lagged values of the dependent variable are used as explanatory variables in a regression model? 5. mated disturbance variance. We use the notation of Section 5. e. but that instead e1 ¼ Z1 and ei ¼ eiÀ1 þ Zi for i ¼ 2. Prove this. Further let b g be the OLS estimators of b and g in the model (5. Á Á Á . Show that the ‘dfbetas’ (for b) are equal to eÃ x Àx j pﬃﬃﬃﬃﬃﬃﬃ ﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pj . 5.6. Use the result in b to prove that for regression Pthe p SSE 2 in a there holds nR2 ¼ SST =n % n k¼1 rk . Can this result be used to estimate b and g? Àb2 ) b. s2 ).3) Let yi ¼ byiÀ1 þ ei and ei ¼ geiÀ1 þ Zi . Investigate also whether it is consistent.2) Consider the simple regression model yi ¼ a þ bxi þ ei .6. and s2 ¼ s ( j ). where the columns of E consist of lagged values of the OLS residuals. By b we denote the OLS estimator of b and by r the estimator of g obtained by regressing the OLS residuals ei on their lagged values eiÀ1 . where m is the unknown mean of the variable y and the ei are error terms. . Show that the error terms ei are homoskedastic but that all autocorrelations are non-zero. c. Show that the explained sum of squares SSE of regression in a satisﬁes SSE % Pp the P 2 1 2 [ i ei eiÀk ] =s . Show that (5. Let ^ d and ^ g be the OLS estimators obtained from this model. Derive the best linear unbiased estimator for m in this model.5.56) can be interpreted as a Chow forecast test for the jth observation. Write the regression of ei on xi and eiÀ1 .6.2. Prove that b 2 ^ g ¼ yj À x0j b(j). Prove that plim(b) ¼ b þ g(1 1þbg .13 (E Section 5. Investigate whether the estimator of c is unbiased.10 (E Section 5. d. The studentized residual in (5.11 (E Section 5. using the results of b and of Exercise 3. 5. It is assumed that e1 ¼ (1 À g2 )À1=2 Z1 and ei ¼ geiÀ1 þ Zi for i ¼ 2. 5. n.64) is a consistent estimator of s in the simple case that yi ¼ m þ ei . Try to give an intuitive explanation of the result in d. d. c. p.55) — that is. as e ¼ Xd þ Eg þ !. Á Á Á .4. Let b(j) and s (j) be the estimators of b and the disturbance variance s2 obtained by the regression yi ¼ x0i b þ ei .2. b. a. b. Investigate also whether it is consistent. n. a.4) In this exercise we use the notation of Sections 5.

Á Á Á . d. Show that the variance of the ‘dfbetas’ for the )2 (x Àx slope parameter b is approximately P (j x Àx 2. Let m Pbe estimated by minimizing the criterion jyi À mj. similar to the plots shown in Exhibit 5.3) This exercise is concerned with the derivation of the Hausman test for the null hypothesis that all regressors are exogenous against the alternative that the last k0 regressors are endogenous. e. where d is positive and cd is a scaling constant. that satisﬁes the ﬁve conditions has a derivative g(e) ¼ dG(e)=de ¼ 2 2 e(1 À e ) for jej c. G(e) ¼ m c. d.6.4) Consider estimation of the mean m from a random sample yi ¼ m þ ei . and we denote the jth column of X by Xj ^ by X ^ j.7. the only polynomial of degree P k six. Á Á Á . if all least squares coefﬁcients have approximately the same standard error.78).1–5. n. G(e) ¼ G( À e). Discuss the outcomes. b. If Xj is exogenous. X combination that has the largest ‘correlation’ with PnXj . Show that the usual OLS estimator of the variance based on the regression in (5. c. Let G be a P non-zero polynomial of degree m — k say. a. k. 5. We assume that Z contains a constant term. and the Hessian of G is continuous. We use the notation of Section 5. i ¼ 1. then prove that (var(bIV ) À var(b)) is positive semideﬁnite.61) for the t-distribution. c.73) is instruments.430 5 Diagnostic Tests and Model Adjustments e. So Xj and X ^ j are and the jth column of X n Â 1 vectors.7. x0ei for the ith row . a.74) for the computation of the IV estimator. of all vectors of the form ^ j is the linear v ¼ Zc for some m Â 1 vector c).14 (E Section 5. the so-called bisquare funcc2 tion. Now suppose that the disturbances are independently t(d) distributed with density dþ1 e2 i À 2 f (ei ) ¼ cd (1 þ d ) . 5. then it is included in the set of ^ j in (5.42.16 (E Sections 5. Prove that of all linear combinations of the instruments (that is. Xe for the n Â k0 matrix of possibly endogenous regressors. b. Show that the ML estimator is relatively insensitive to outliers (for small values of d) by writing out the ﬁrst order condition d log (L)=dm ¼ 0. c. show ﬁrst that X0 X À X0 PZ X is positive semideﬁnite. Discuss possible motivations for each of these ﬁve conditions. We impose the following conditions on the function G: G(0) ¼ 0. f.15 (E Section 5. If all the regressors in X are exogenous. 5. Show that the average variance of the ‘dfbetas’ in (5. i ¼ 1.61) corresponding to this criterion. in the sense that it maximizes xji vi pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pn i¼12 Pn 2 . where ei (y) are the residuals corresponding to y. i ¼ 1. d. and give an intuitive motivation why the median will be a better estimator than the sample mean in this case. x a. For this purpose.6.7. G(e) ¼ 6 j¼0 Gk e . with elements denoted by xji and ^ji . Comae pare this distribution with the normal distribution. Á Á Á . and we write vi for the k0 Â 1 vector with elements vji . i ) with average value 1=n.73) and (5.3) In this exercise we consider the two-step method (5.4) As estimation criterion P we consider expressions of the form S(y) ¼ n i¼1 G(ei (y)).7.74) — that is. Show that the median is the maximum likelihood estimator if the disturbances are independently distributed with the double-exponential distribu1 Àjei j=a tion with density function f (ei ) ¼ 2 . i¼1 xji i¼1 vi b. Show that the median m ¼ med(yi . Make plots of the functions G(e) and g(e) and of the weights wi in (5. then prove that j¼0 Gk e for jej m ! 6 is required to satisfy the ﬁve conditions. up to an arbitrary multiplicative scaling constant.3. Á Á Á . n.58) in this case is approximately equal to 1=n if the terms all are approximately constant — that is. Prove that the F-test on linear restrictions can be computed by (5. 5:17Ã (E Section 5. Prove that in this case X equal to Xj . n) is optimal for this criterion (distinguish the cases n odd and n even). Show that. j ¼ 1. It is helpful to prove ﬁrst that ^ e0R ^ eR À e0RIV PZ eRIV ¼ y0 y À y0 PZ y and also that ^ e0 ^ e À e0IV PZ eIV ¼ y0 y À y0 PZ y. G is non-decreasing in jej and constant for jej > c for a given constant c > 0. Show this also by computing the weights wi in (5. Now consider the multiple regression model. ^ bIV )0 (y À X ^ bIV )=(n À k) — is not a ^ s2 ¼ (y À X consistent estimator of s2 .

b. e. Analyse the recursive residuals and the CUSUM and CUSUMSQ plots.1) Consider the following model for the relation between macroeconomic consumption (C). Generate a sample of size 100 from the model pﬃﬃﬃﬃ yi ¼ 2 þ xi þ ei where the xi are independent and uniformly distributed on the interval [0. The null hypothesis of exogeneity is that a ¼ 0. In the equations xei ¼ Gzi þ vi it is assumed that vi $ NID(0. 2 i¼1 i 2 2s b. s) with r ¼ 1. Answer the same questions after the data have been ordered with increasing values of x. but these results need not be proved: Brs ¼ 0 for all (r. d. Show that the score vector @ l=@ y of the unrestricted log-likelihood. Let sub-index 1 indicate the blocks related to a and sub-index 2 the blocks 1 ^0 ^ related to b. s2 ). The Hessian matrix ( À @@yyl0 ) is a 5 Â 5 block matrix with blocks Brs . vi ^ n ^ where ^ vi ¼ xei À Gzi . and also for all (r. 5. . because of the product term G0 a. 0. 4. For the estimated value of l. and non-consumptive expenditures (Z): Ci ¼ a þ bDi þ ei (the consumption equation) and Di ¼ Ci þ Zi (the income equation). matrix U ¼ (V f. where U is the n Â (k0 þ k) ^2 s ^ X). Prove that e implies that LM ¼ nR2 of the regression in (5. ^2 2 d. Use the results in c and d to prove that the LM-test computed according to LM ¼ @l 0 @ 2 l À1 @ l (@ y ) ( À @ yy0 ) ( @ y ) can be written as LM ¼ 1 0 0 e U(U U)À1 U0 e.82) can be written as yi ¼ x0i b þ ei ¼ x0i b þ v0i a þ wi ¼ x0i b þ (xei À Gzi )0 a þ wi : This model is non-linear in the parameters y ¼ (a. n. FurB12 ¼ B21 ¼ s V 22 2 2 ^ ^ s ther the following approximations may be used in e. c. V). Perform a RESET test and a Chow forecast test. that ^2 0 1 ^0 1 X . where x(l) ¼ (xl À 1)=l. 2 and s ¼ 3. a.83). Á Á Á .18 (E Section 5. in which case ei ¼ wi $ NID(0. With this notation. is zero. b. Here Z is assumed to be exogenous in the sense that E[Zi ei ] ¼ 0 for all i ¼ 1. and use this graph to explain why OLS is not consistent. Show that the ML estimators obtained under the ^ ¼ b. Perform also a RESET test and a Chow forecast test. Prove that the application of OLS in the consumption equation gives an inconsistent estimator of the parameter b. EMPIRICAL AND SIMULATION QUESTIONS 5. and G for the k0 Â m matrix with rows g0j . e.7. c. 2. then show that B11 ¼ s V V . and V G v0i .3) a. Show that the log-likelihood is by Pngiven À1 1 0 À1 logdet( V ) À v V vi l(y) ¼ Àn log(2p) þ n i i ¼ 1 2 P2 n 2 1 2 Àn log ( s ) À w . 4. s2 . null hypothesis that a ¼ 0. a.3. b. s ^2 ¼ e0 e=n. e. Estimate the model y ¼ a þ bx(l) þ e by ML. and in the equation ei ¼ v0i a þ wi (where v0i a ¼ E[ei jvi ]) it is assumed that wi $ NID(0. c.Exercises 431 of Xe .19 (E Section 5. where V is the k0 Â k0 covariance matrix of vi . D. one where Z does not vary at all and another where Z has a very large variance. 5 and s ¼ 1. d. Derive an explicit expression for the IV estimator of b in terms of the observed variables C. disposable income (D). a for the k0 Â 1 vector with elements aj .01). evaluated at the estimates of b. with the exception of @ l=@ a. are given by P b ^ ¼1 ^ ^ 0 ¼ (Z0 Z)À1 Z0 Xe . 20] and the ei are independent and distributed as N(0. and that B ¼ X0 X. Regress y on a constant and x. and Z. G. Give a graphical illustration of the result in a by drawing a scatter plot of C against D. regress y on a constant and x(l) and analyse the corresponding recursive residuals and CUSUM and CUSUMSQ plots. VÀ1 ). which 1 ^0 is equal to s V e. Use the expression of d to prove that this IV estimator is consistent. the model (5. s2 ). Consider two cases in b. s) with r ¼ 3. 5.

f. 20] and the ei are independent and distributed as N(0. and also by deriving the correct formula for the standard deviation of WLS with this incorrect model for the variances. 5. We will estimate the model yi ¼ bxi þ ei . the correlation between xi and xiÀk converges to rk and the correlation between ei and eiÀk to gk . Regress y on a constant and x and apply tests on serial correlation. Prove that. Generate a sample of size n ¼ 100 from the linear model yi ¼ 2 þ xi þ ei . b. Save the residual series for later use in e. The of b is given by b ¼ Pn OLS Pestimator n 2 fori¼1 xi yi = i¼1 xi .3) a.1) with !i and Zj independent for all i. a.01). c. b. OLS. Determine the sample standard deviations over the 1000 simulations of the three estimators of b in a. where the n ¼ 100 values of xi remain the same over all simulations but the 100 values of Zi are different drawings from the N(0.20 (E Sections 5. x2 i ). 0. j. where the xi are independent and uniformly distributed on the interval [0.2) In this exercise we simulate data with the model yi ¼ bxi þ ei . e. the OLS estimator b is unbiased and consistent in this case. Let xi consist of 100 random drawings from the standard normal distribution. x2 i ) distributions and where the values of yi ¼ xi þ Zi differ accordingly between the simulations. for n ! 1. i ¼ 1. Use the fact that s is a consistent estimator of the variance 1=(1 À g2 ) of the disturbances ei . e. a. Compute the standard error of this estimate in three ways — that is. c. Compare the estimate and the standard error obtained for this WLS estimator with the results for OLS in a. 0. and let yi ¼ xi þ Zi . test the null hypothesis that b ¼ 0 against the (two-sided) alternative .5. Compare the three sample standard deviations in d with the estimated standard errors in a. and c — that is. for i ! 1. let Zi be a random drawing from the distribution N(0. The parameters r and g satisfy À1 < r < 1 and À1 < g < 1. d. and WLS (with incorrect weights). the true variance of b is c b) not given by the OLS formula var var( P 2 ¼ s2 = n i¼1 xi . Estimate b by WLS using the knowledge that 2 2 s2 i ¼ s xi . n. Again regress y on a constant and x and apply tests on serial correlation. Simulate two data sets of size n ¼ 100. the variance of xi converges to 1=(1 À r2 ) and the variance of ei to 1=(1 À g2 ). take r ¼ g ¼ 0:7.5. b. for i ! 1. where xi and ei are both generated by an AR(1) model.21 (E Section 5. and the conventional P OLS 2 c b) ¼ s2 = n mulaP for the variance is var( var i¼1 xi where 2 s2 ¼ n i¼1 (yi À bxi ) =(n À 1). xi ¼ rxiÀ1 þ !i (with x0 ¼ 0) and ei ¼ geiÀ1 þ Zi (with e0 ¼ 0). Prove that. and which ones are not? 5. WLS (with correct weights). although the regressors are stochastic here.2. 5.3) Simulate n ¼ 100 data points as follows. c. That is.22Ã (E Section 5. by the White method for OLS on the (incorrectly) weighted data.01). d. regress y on x and compute the OLS standard error of b and also the HAC standard error of b. where !i and Zi are both NID(0. Perform 1000 simulations. by the WLS expression corresponding to this (incorrect) model. Regress y on a constant and x and apply tests on serial correlation. Which standard errors are reliable. Again regress y on a constant and x and apply tests on serial correlation. and c. Sort the data of c with increasing values of the residuals e. Á Á Á . Prove also that. one in the model with b ¼ 0 and the other one in the model with b ¼ 1. Estimate b by OLS. b. For both data sets. For both simulations.4.432 5 Diagnostic Tests and Model Adjustments 5. b. but that it is approximately þrg 2 c b) 1 equal to var var( 1Àrg. Discuss the relevance of your ﬁndings for the interpretation of serial correlation tests (like Durbin–Watson) for cross section data. and comment on the outcomes.4. Sort the data of a with increasing values of x. For the data generated with b ¼ 0. d. Explain the results in b and d by considering relevant scatter diagrams. 20] and the ei are independent and distributed as N(0. Estimate the standard error of b both in the conventional way and by White’s method. where the xi are independent and uniformly distributed on the interval [0. Generate a sample of size n ¼ 100 from the pﬃﬃﬃﬃ model yi ¼ 2 þ xi þ ei . Now estimate b by WLS using the (incorrect) 2 2 heteroskedasticity model s2 i ¼ s =xi . Prove that.

Formulate a model with two different values of b in (5. 5. 5. 5. d. Give a possible explanation of the estimated positive effect. using the speciﬁed model for the variances. a. Perform also a sequence of Chow forecast tests and give an interpretation of the outcomes. c. the model for the variances is additive and contains also effects of the level of education.7.26 (E Section 5. namely six weeks without any marketing actions.24 (E Section 5. one for education levels less than 16 years (observations i 365) and another for education levels of 16 years or more (observations i ! 366). c.1. g. XM501BWA . e.5) Consider the salary data of Example 5. so that there are nine possible break points. Give also an intuitive motivation for this estimator of b. f.7. Check that the result of c holds true. b. For each of the two data generating processes. c. the other data to non-election years (z ¼ 0). 5. Perform Chow break tests and Chow forecast tests (with the break now located at observation 366). 5. Comment on the relevance of your ﬁndings for signiﬁcance tests of regression coefﬁcients if serial correlation is neglected.15 with the regression model discussed in XM501BWA that example. Some of the data refer to election years (z ¼ 1). using z (and a constant) as instruments. and give an interpretation of the outcomes.3. Perform a sequence of Chow break tests for all segments where the variable ‘education’ changes. a. For the model with b ¼ 0. compute the frequency of rejection of the null hypothesis that b ¼ 0 for the ttests based on the OLS and the HAC standard errors of b. Perform the Hausman test on the exogeneity of the variable x. Inspect the histogram of xi and choose two subsamples to perform the Goldfeld–Quandt test on possible heteroskedasticity due to the variable xi . using the two t-values obtained by the OLS and the HAC standard errors of b. Give an interpretation of the resulting estimate. We want to estimate the effect of police on crime — that is.9.1) In this exercise we consider data on weekly coffee sales (for brand 1). b. Perform the Breusch–Pagan test on heteroskedasticity.3) Consider the data of Example 5. This variable takes on ten different values. b. Comment on the outcomes. Comment on the similarities and differences between the test outcomes in b–d. 5. e. Use the data to estimate b by instrumental variables. Regress ‘crime’ on a constant and ‘police’. Check that the data in the data ﬁle are sorted with increasing values of xi . Show that the IV estimator of b is given by (y1 À y0 )=(x1 À x0 ). h.3) In this exercise we consider simulated data on the relation between police (x) XR523SIM and crime (y). Repeat the simulation of d 1000 times.4.25 (E Sections 5.22. Estimate the eleven parameters (six regression parameters and ﬁve variance parameters) by (two-step) FWLS and compare the outcomes with the results in Exhibit 5. where y1 denotes the sample mean of y over election years and y0 over nonelection years and where x1 and x0 are deﬁned in a similar way. e.3. Compare these values and relate them to the outcomes in e. Relate the outcomes in f also to the result obtained in c. Estimate this model. Check the outcomes on a break (at observation 425 for the Chow tests) discussed in Example 5.9 (ordered with education). a.4.16). d. d. that b 6¼ 0 (at 5% signiﬁcance). In this exercise we adjust the model for the variances as follows: 2 E[e2 i ] ¼ g1 þ g2 D2i þ g3 D3i þ g4 xi þg5 xi — that is.23 (E Sections 5. compute the standard deviation of the estimates b over the 1000 simulations and also the mean of the 1000 reported OLS standard errors and of the 1000 reported HAC standard errors.Exercises 433 e. which showed a break at observation 366 (education at least 16 years) in the marginal effect b of education on salaries (see Exhibit 5.4. In total XR526COF there are n ¼ 18 weekly observations. Give a verbal motivation why the election dummy z could serve as an instrument. the parameter b in the model yi ¼ a þ bxi þ ei .15). Also perform the White test on heteroskedasticity.

In ﬁnancial economics. In Example 5. and we restrict the attention to sales in the twelve weeks with marketing actions. Investigate whether the disturbances in the nonlinear model are heteroskedastic.34). 5. The Vasicek model postulates that g ¼ 0. a. where y denotes the logarithm of weekly sales. both by the Wald test and by the Likelihood Ratio test. c. 5. a.4) In this exercise we consider the budget data of Example 5. Test the three hypotheses that g ¼ 0.3) In this exercise we consider monthly data of the three-month Treasury Bill rate (ri ) XR528IBR in the USA from January 1985 to December 1999. Apply OLS in the linear model and perform a RESET. c. 5.36 to get an idea of this series).5.5. Estimate y by maximum likelihood. and Da is a dummy variable that is 0 if there is no advertisement and 1 if there is advertisement.000 per year).4. which has the value 0 if there is advertisement and 1 if there is not. Compare the estimated price coefﬁcient and its t-value and P-value with the results obtained in a.5. g.1–1998. As there are no advertisements without simultaneous price reductions.27 (E Sections 5.2. a. 4.XM526INP rithms) for the USA over the period 1950. and the Brennan– Schwartz model that g ¼ 1. several models are proposed for the variance of the unpredicted changes ei .26 (see Exhibit 5. What is your conclusion? d. Dp is a dummy variable with the value 0 if the price reduction is 5% and the value 1 if this reduction is 15%.5. a. Estimate the four parameters in y by (two-step) FWLS.4 for the relation between the fraction of expenditures spent on food (y). 5.11 we considered the monthly changes xi ¼ ri À riÀ1 . in loga.5. 5. In Example 5. Give an economic motivation for the above model.20. Estimate the linear trend model yi ¼ a þ bi þ ei and test whether the slope b is constant over the sample. Apply recursive least squares in the linear model and perform a CUSUM test. Estimate this model and compare the outcomes (especially the regression coefﬁcients and the standard errors) with the results in Example 5.4) In this exercise we consider the quarterly series of industrial production (yi . 5. so that the vector of unknown parameters is given by y ¼ (a. in the two models. in $10. Estimate the above model. The model then becomes Ã Ã Ã Ã y ¼ bÃ 1 þ b2 Dp þ b3 Da þ b4 Dp Da þ e. i ¼ 1. the Cox–Ingersoll– Ross model that g ¼ 1=2.28 (E Sections 5. What is the P-value of this test? b. investigate whether the disturbance variance is related to the group size. Check that the two sets of regression parameters satisfy the same relations.20.3. We consider models of the 2 2g form E[e2 i ] ¼ s riÀ1 . Discuss the relevance of this fact for the interpretation of coefﬁcients of dummy variables in regression models. ordered in seg. We consider the following simple model for the relation of these changes to the level of this interest rate: ri À riÀ1 ¼ a þ briÀ1 þ ei . Explain why the two results for the price dummy differ in a and b.4.1.6. Relate this result to c. and g ¼ 1. Now we will shift the attention to another subset of the data. We consider both the linear model of Section 5. Compare the estimates with the ones obtained in a. Á Á Á . d. b. the total consumption expenditures (x2 .6. Estimate this model and test the null hypothesis that b2 ¼ 0. and the average household size (x3 ). but only for the twelve weeks without advertisement. c. These data were discussed in Example 5. b. .25 (see Exhibit 5.XM520FEX ments as discussed in Example 5.434 5 Diagnostic Tests and Model Adjustments six weeks with price reductions without advertisement.1 and the non-linear model of Section 5. d. b. What is your conclusion? 5. s2 )0. we formulate the model y ¼ b1 þ b2 Dp þ b3 Da þ b4 Dp Da þ e.29 (E Sections 5.4. replacing Da by the alternative dummy variable DÃ a . In particular. assuming that the error terms ei are normally distributed.6. Discuss how the non-linear model can be estimated in case of heteroskedasticity related to the group size. Derive the four relations between the parameters bi and bÃ i . and six weeks with joint price reductions and advertisement.7 we considered similar coffee data. g ¼ 1=2. Test the hypothesis of normally distributed error terms ei by means of the ML residuals of b.

c. When the model is estimated for the forty-eight ﬁrms with the smallest (positive) proﬁts.3. Discuss whether you ﬁnd the seven standard assumptions of the regression model intuitively plausible. Investigate the presence and nature of inﬂuential observations in the CAPM for the sector of noncyclical consumer goods. 5.3. 5. 5. How would you estimate the yearly growth rate of industrial production.49 (for the sample of n ¼ 96 ﬁrms with positive proﬁts). f. Test for the individual and joint signiﬁcance of the seasonal dummies. Test the hypothesis . c. Estimate the number of votes v in Palm Beach county that are accidentally given to Buchanan by including a dummy variable for this county in the regression model of a. 5. b. and discuss the importance of this ﬁnding for a top manager of a ﬁrm with small proﬁts who wishes to predict his or her salary. then estimate the model D4 yi ¼ m þ ei by OLS. d.27 we considered the CAPM for the sector of cyclical consumer XR530SMR goods. Let D4 yi ¼ yi À yiÀ4 be the yearly growth rate. a. In all tests below use a signiﬁcance level of 5%.3) In Example 5.4.31 we considered data on XM531MGC gasoline consumption (GC).4. e. a. before recounting. where yi is the average salary of top managers of ﬁrm i and xi is the proﬁt of ﬁrm i (both in logarithms). c.5. Estimate the model GCi ¼ a þ bPGi þ gRIi þ ei . a. price of gasoline (PG).3. Discuss the relevance of the possible presence of heteroskedasticity and serial correlation on the detection of inﬂuential observations.5. The data XR533USP ﬁle contains the number of votes on the different candidates in the n ¼ 67 counties of the state Florida. b.3.3.31 (E Sections 5.32 (E Section 5.6. Perform a test for heteroskedasticity over this period.3.Exercises 435 b. using the data over the period 1970–95. In addition we now also consider the sector of non-cyclical consumer goods. Perform a test on parameter constancy over this period. e.8) In Section 5. for given values of the explanatory variables over this period. Perform a test for serial correlation over this period. Now include seasonal dummies to account for possible seasonal effects. This resulted in ballot papers with multiple punch holes. a.6. but third punch hole on the ballot paper) but by accident ﬁrst selected Buchanan (second punch hole on the ballot paper).5. 5.30 (E Sections 5. Check this. by the sample mean. b. The county Palm Beach is observation number i ¼ 50. by the median. Check the results of diagnostic tests reported in Exhibit 5. 5.6. The difference (before recounts) between Bush and Gore in the state Florida was 975 votes in favour of Bush. Perform a regression of the number of votes on Buchanan on a constant and the number of votes on Gore. We postulated the model yi ¼ a þ bxi þ ei .33 (E Sections 5. The recounts in Florida were motivated in part by possible mistakes of voters in Palm Beach who wanted to vote for Gore (the second candidate. d. and real income (RI) over the years 1970–99. d. 5. What are the leverages in this model? Investigate the presence of outliers in this model.8 we considered the relation between the salary of top managers and XM535TOP the proﬁts of ﬁrms for the 100 largest ﬁrms in the Netherlands in 1999.2) In this exercise we consider data on the US presidential election in 2000. or in another way? Motivate your answer. c. 5. Perform a Chow forecast test for the quality of the model in a in forecasting the gasoline consumption in the years 1996–99. then no signiﬁcant relation is found. Perform a test on outliers and a test on normality of the disturbances over this period. 5.2.2) In Example 5. b.3. 5. Perform tests for heteroskedasticity and serial correlation in the CAPM for the sector of cyclical consumer goods. Investigate for the presence of outliers. Answer a also for the sector of non-cyclical consumer goods. 5.6. Investigate the presence of outliers in this model.

e. Discuss and investigate whether the assumptions that are needed for the (politically important) conclusion of e are plausible for these data. Answer b using a regression equation with appropriately weighted data. f. Formulate an intuitively plausible model for the variance of the disturbance terms in the regression model of a. using the results of the Breusch– Pagan tests in c and d. For the Breusch–Pagan test consider het1 eroskedasticity of the form s2 i ¼ h(g1 þ g2 ni ). Answer b and c also for the model where the fraction of votes (instead of the number of votes) on Buchanan in each county is explained in terms of the fraction of votes on Gore in that county. where ni denotes the total number of votes on all candidates in county i. . The counties differ in size so that the error terms in the regression in a may be heteroskedastic. c.436 5 Diagnostic Tests and Model Adjustments that v < 975 against the alternative that v ! 975. d. Perform the Breusch–Pagan test on heteroskedasticity of the form s2 i ¼ h(g1 þ g2 ni ).

It may also be that the outcomes of the dependent variable are restricted to an interval. Binary variables have only two possible outcomes (‘yes’ and ‘no’). For all such types of dependent variables. tobit models for limited dependent variables. For instance. and models for duration data. We discuss probit and logit models for qualitative data. the linear regression model with normally distributed error terms is not suitable. . other qualitative variables can have more than two but a ﬁnite number of possible outcomes (for example. for individual agents the amount of money spent on luxury goods or the duration of unemployment is non-negative.6 Qualitative and Limited Dependent Variables In this chapter we consider dependent variables with a restricted domain of possible outcomes. with a positive probability for the outcome ‘zero’.2 and 6.3. Section 6. the choice between a limited number of alternatives). These last two sections can be read independently from each other.1 is the basic section of this chapter and it is required for the material discussed in Sections 6.

in practice all data (both yi and xi ) are often stochastic. 191).2 (p. in accordance with Assumption 1 in Section 3.1. the probability of buying a new trendy product will depend on income and age.6.1. The two outcomes will be labelled as 1 (‘success’) and 0 (‘failure’). Sections 5. Assumptions on explanatory variables As before. Motivation Students may succeed in ﬁnishing their studies or they may drop out. households may buy a trendy new product or not.1 Model formulation E Uses Chapters 1–4. However. This kind of interpretation was also discussed in Section 4. . by interpreting the results conditional on the given outcomes of xi . Such variables are called binary. for instance. and individuals may respond to a direct mailing or not. as was discussed in Section 4.1 Binary response 6. for individual i the values of k explanatory variables are denoted by the k Â 1 vector xi and the outcome of the binary dependent variable is denoted by yi.438 6 Qualitative and Limited Dependent Variables 6. and in this section we are interested in modelling the possible causes of these differences. and the probability of a response to a direct mailing will depend on relevant interests of the individuals. This is the case. the probability of success for students in their studies will depend on their intelligence. We will always assume that the model contains a constant term and that x1i ¼ 1 for all individuals. However. The simplest statistical model to describe a binary variable y is the Bernoulli distribution with P[y ¼ 1] ¼ p and P[y ¼ 0] ¼ 1 À p. All the results of this chapter carry over to the case of exogenous stochastic regressors. For instance. 125). In all such cases the variable of interest can take only two possible values. it may well be that the probability of success differs among individuals. n.1. Á Á Á . when the observations are obtained by random sampling from an underlying population.4 and 5.4 (p.1. i ¼ 1. Throughout this chapter we will treat the explanatory variables as ﬁxed values. and this is the usual situation for the types of data considered in this chapter.

It places implicit restrictions on the parameters b. n. but for simplicity of notation we delete the conditioning on xi . As E[ei ] ¼ 0 and yi can take only the values zero and one. k: Disadvantages of the linear model The linear probability model has several disadvantages. This can be written more explicitly as P[yi ¼ 1jxi ].1 Binary response 439 The linear probability model For a binary dependent variable. In the linear probability model. as (6. so that ei is a random variable with discrete distribution given by ei ¼ 1 À x0i b ei ¼ Àx0i b with probability x0i b with probability 1 À x0i b: The distribution of ei depends on xi and has variance equal to var(ei ) ¼ x0i b(1 À x0i b).6. Further. in which case they are not real ‘probabilities’. x0i b measures the probability that an individual with characteristics xi will make the choice yi ¼ 1. Á Á Á . E[ei ] ¼ 0 (6:1) is called the linear probability model. the regression model yi ¼ x0i b þ ei ¼ b1 þ k X j¼2 bj xji þ ei . so that the error terms are heteroskedastic with variances that depend on b. Similar shorthand notations will be used throughout this chapter. . the error terms ei are not normally distributed. then this may give values smaller estimated probabilities P i than zero or larger than one. but clearly it is not efﬁcient and the conventional OLS formulas for the standard errors do not apply.1) implies that OLS is an unbiased estimator of b (provided that the regressors are exogenous). so that P[yi ¼ 1] ¼ E[yi ] ¼ x0i b: (6:2) Note that we write P[yi ¼ 1] ¼ x0i b — that is. it follows that x0i b ¼ E[yi ] ¼ 0 Á P[yi ¼ 0] þ1 Á P[yi ¼ 1]. j ¼ 2. This may occur because OLS neglects the implicit restrictions 0 x0i b 1. the subindex i of yi indicates that we deal with an individual with characteristics xi . so that the marginal effect of the jth explanatory variable is equal to @ P[yi ¼ 1]=@ xji ¼ bj . This is because the variable yi can take only the values zero and one. Further. if the OLS estimates b are used to compute the ^ [yi ¼ 1] ¼ x0 b. Á Á Á .2) requires that 0 x0i b 1 for all i ¼ 1. The assumption that E[ei ] ¼ 0 in (6.

6 0.0 0.4 0.1 Probability Models Binary dependent variable (y takes value 0 or 1) with linear probability model (a) and with non-linear probability model in terms of a cumulative distribution function (b).6 0.2 1.3) x0i b can be interpreted as the strength of the stimulus for the outcome yi ¼ 1. Á Á Á .0 −0. This is illustrated in Exhibit 6. .0 0. Let F be a function with values ranging between zero and one. positive (negative) coefﬁcients correspond to positive (negative) effects on the probability of success.8 P[y=1] 0 50 0.2 0. with P[yi ¼ 1] ¼ F(x0i b) ! 1 if x0i b ! 1 and P[yi ¼ 1] ! 0 if x0i b ! À1.8 j ¼ 2. @ xji (a) 1. and let P[yi ¼ 1] ¼ F(x0i b): (6:3) For the ease of interpretation of this model. Assuming that F is differentiable with derivative f (the density function corresponding to F).0 −0. the marginal effect of the jth explanatory variable is given by @ P[yi ¼ 1] ¼ f (x0i b)bj .1.2 0. Marginal effects on probabilities In the model (6. In this case. if bj > 0. An obvious choice for the function F is a cumulative distribution function. That is.2 1. for a single explanatory variable (x).2 100 150 200 250 0 50 100 150 200 250 X X Exhibit 6. then an increase in xji leads to an increase (or at least not to a decrease) of the probability that yi ¼ 1. the function F is always taken to be monotonically non-decreasing.2 Y 0.440 6 Qualitative and Limited Dependent Variables Non-linear model for probabilities The probabilities can be conﬁned to values between zero and one by using a non-linear model.4 0. k: (6:4) (b) 1.

3) with function F and parameter vector b is equivalent to the model with function G and parameter vector b=s. That is. The observed choice y is related to the index yÃ by means of the equation yi ¼ 1 if yÃ i ! 0. Interpretation of model in terms of latent variables The model (6. the density function f has relatively smaller values in the tails and relatively larger values near the mean. This takes the possibility into account that individuals with the same observed characteristics x may make different choices because of unobserved individual effects.6. ei $ IID. It then follows . Usually. so that P[yi ¼ 1] ¼ F(x0i bÞ ¼ G(x0i b=s). if g(t) ¼ sf (st). Then the marginal effects are maximal for values of x0i b around zero. yi ¼ 0 if yÃ i < 0: It is assumed that the individual effects ei are independent and identically distributed with symmetric density f — that is.1 Binary response 441 This shows that the marginal effect of changes in the explanatory variables depends on the level of these variables. the model (6. because the explanatory variables include a constant term. Indeed. independent of the data. so that f (t) is maximal for t ¼ 0 and f (t) ¼ f ( À t) for all t. then the cumulative distribution functions (G of g and F of f ) are related by G(t) ¼ F(st). Further it is usually assumed that the density is unimodal and symmetric. where x0i b is the systematic preference and ei the individual–speciﬁc effect. It is assumed that 0 yÃ i ¼ xi b þ ei . Restriction needed for parameter identification The standard deviation of the density f should be speciﬁed beforehand. which is no loss of generality. E[ei ] ¼ 0: This is the so-called index function. where P[yi ¼ 1] is around 1/2. f (ei ) ¼ f ( À ei ). The sensitivity of decisions to changes in the explanatory variables depends on the shape of the density function f .3) can be given an interpretation in terms of an unobserved variable yÃ i that represents the latent preference of individual i for the choice yi ¼ 1. So the variance of the distribution f should be ﬁxed. This conforms with the intuition that individuals with clear-cut preferences are less affected by changes in the explanatory variables. as otherwise the parameter vector b is not identiﬁed. It is usually assumed that this density has mean zero. so that the effects are smallest for individuals for which P[yi ¼ 1] is near zero (in the left tail of f ) or near one (in the right tail of f ).

we consider data that were collected in a marketing campaign for a new ﬁnancial product of a commercial investment ﬁrm (Robeco). Also the age of customers may be of importance. yi ¼ 0 if Ui0 > Ui1 : In this case the choice depends on the difference in the utilities Ui1 À Ui0 ¼ x0i b þ ei . Again. . The utilities for individual i are deﬁned by Ui0 ¼ x0i b0 þ e0i . it follows that P[yi ¼ 1] ¼ P[ei ! Àx0i b] ¼ P[ei x0i b] ¼ F(x0i b). there may be differences between male and female customers and between active and inactive customers (where active means that the customer already invests in other products of the ﬁrm). so that P[yi ¼ 1] ¼ P[ei ! Àx0i b] ¼ P[ei x0i b] ¼ F(x0i b). so that yi ¼ 1 if Ui0 Ui1 .3) is in terms of the utilities U0 and U1 of the two alternative choices. This provides an interpretation of the model (6.3) in terms of unobserved individual effects in the utilities of the two alternatives. where F is the cumulative distribution function of ei . where b ¼ b1 À b0 and ei ¼ e1i À e0i . Ui1 ¼ x0i b1 þ e1i : The alternative with maximal utility is chosen. So this motivates the model (6. (i) Motivation of the marketing campaign The campaign consisted of a direct mailing to customers of the ﬁrm.1: Direct Marketing for Financial Product To illustrate the modelling of binary response data. and (ii) the data set. In particular.3) in terms of differences in the individual effects ei over the population. as relatively young and relatively old customers may have less interest in investing in this product than middleaged people. We will discuss (i) the motivation of the marketing campaign.442 6 Qualitative and Limited Dependent Variables R1 Rt that P[ei ! Àt] ¼ Àt f (s)ds ¼ À1 f (s)ds ¼ P[ei t]. The ﬁrm is interested in identifying characteristics that might explain which customers are interested in the new product and which ones are not. if the individualspeciﬁc terms ei are assumed to be independent and identically distributed with symmetric density f . E XM601DMF Example 6. Interpretation of model in terms of utilities Another possible interpretation of the model (6.

Franses. H.3. This is denoted by the binary variable yi . This data set will be further analysed in Examples 6. age (in years. In practice one often chooses either the standard normal density 1 1 2 f (t) ¼ f(t) ¼ pﬃﬃﬃﬃﬃﬃ eÀ2t 2p or the logistic density f (t) ¼ l(t) ¼ et : (1 þ et )2 .3) depends not only on the choice of the explanatory variables x but also on the shape of the distribution function F. So our sample contains relatively many more positive responses (470 out of 925) than the original database. 6. This leaves a data set of n ¼ 925 customers. Apart from a constant term (denoted by x1i ¼ 1).2 and 6. activity (denoted by x3i ¼ 1 for customers that are already active investors and x3i ¼ 0 for customers that do not yet invest in other products of the ﬁrm). The effect of this selection is analysed in Exercises 6. The data set considered in this chapter is drawn from a much larger database that contains more than 100. Of these customers. 1997. ‘On the Econometrics of Modelling Marketing Response’. Rotterdam.11.000 observations contains only around 5000 respondents.4) via the corresponding density function f .2 Probit and logit models Model formulation The model (6. 470 responded positively (denoted by yi ¼ 1) and the remaining 455 did not respond (denoted by yi ¼ 0).1 Binary response 443 (ii) The data set The variable to be explained is whether a customer is interested in the new ﬁnancial product or not. The original data set of more than 100. with yi ¼ 1 if the ith customer is interested and yi ¼ 0 otherwise. the explanatory variables are gender (denoted by x2i ¼ 0 for females and x2i ¼ 1 for males). This choice corresponds to assuming a speciﬁc distribution for the unobserved individual effects (in the index function or in the utilities) and it determines the shape of the marginal response function (6. For further background on the data we refer to the research report by P. and 75 observations are omitted because of missing data (on the age of the customer). denoted by x5i ¼ x2 4i =100).2 and 6.000 observations. RIBES Report 97-15. A sample of 1000 observations is drawn from this database. denoted by x4i ) and the square of age (divided by hundred.1.6.

scaled so that both densities have standard deviation equal to 1). In order to compare the two models.5 standardized logistic 0. as compared to the probit model. and that with the logistic distribution is called the logit model. the logistic density has larger values around the mean (x ¼ 0) and also in both tails (for values of x far away from 0). the logit model has marginal effects (6.4 standard normal 0. whereas the standard normal distribution has standard deviation 1. as Z L(t) ¼ t À1 l(s)ds ¼ et 1 ¼ . As compared with the normal density. The logistic distribution has standard pexplained ﬃﬃﬃ deviation s ¼ p= 3 % 1:8. There are often no compelling reasons to choose between the logit and probit model. the graphs of the density f(t) and the standardized logistic density sl(st) are given in Exhibit 6.0 −2 0 x 2 4 Exhibit 6.444 6 Qualitative and Limited Dependent Variables The model (6.1 0.2 0.4) that are relatively somewhat larger around the mean and in the tails but somewhat smaller in the two regions in between.2.3 f (x ) 0.3) with the standard normal distribution is called the probit model.2 Normal and logistic densities Densities of the standard normal distribution (dashed line) and of the logistic distribution (solid line. for reasons before. An advantage of the logit model is that the cumulative distribution function F ¼ L can be computed explicitly. . This shows that. The standard deviation of both distributions is ﬁxed. t 1þe 1 þ eÀt (6:5) whereas the cumulative distribution function F ¼ F of the probit model should be computed numerically by approximating the integral 0. Comparison of probit and logit model Both the standard normal density and the logistic density have mean zero and are unimodal and symmetric.

Á Á Á . of course. Instead. This is a bit is reported instead — that is.4) shows that the signs of the coefﬁcients bj and the relative magnitudes bj =bh have a direct interpretation in terms of the sign and the relative magnitude of the marginal effects of the explanatory variables on the chance of success (yi ¼ 1). Comparison of parameters of the two models: scaling One can.4) evaluated at x ¼ n n i i¼1 simpler to compute. but the interpretation is somewhat less clear. As f(0)=l(0) ¼ 4= 2p % 1:6. one often uses another correction factor. Marginal effects of explanatory variables As concerns the interpretation of the parameters b. This is the case when the choices are very unbalanced. The effects of the jth explanatory variable can be summarized by the mean marginal effects over the sample of n individuals — that is. In terms of Exhibit 6.1 Binary response 445 Z F(t) ¼ 1 f(s)ds ¼ pﬃﬃﬃﬃﬃﬃ 2p À1 t Z t À1 eÀ2s ds: 1 2 (6:6) In practice this poses no real problems.8. In general the differences between the two models are not so large. When the jth explanatory variable is a dummy variable. n n 1X @ P[yi ¼ 1] 1X ¼ bj f (x0i b).4) of the explanatory variables are maximal around zero. the estimated probit parameters b can be multiplied by 1. after scaling. these effects vary among the different individuals. Instead of the scaling factor 1. (6. pﬃﬃﬃﬃﬃ ﬃ so that these effects are of special interest. unless the tails of the distributions are of importance. k: Sometimes the effect at the mean values of the explanatory variables 1P x . which gives the two densities the same variance. it remains possible to compute ‘marginal’ effects in this way. The parameters of the two models should be scaled for such a comparison. always estimate both the logit and the probit model and compare the outcomes. in the sense that the fraction of individuals with yi ¼ 1 differs considerably from 1 2. n i¼1 @ xji n i¼1 j ¼ 2. it is also possible to compare the two situations xji ¼ 0 and xji ¼ 1 by comparing . (6. the two densities have the same function value in t ¼ 0.6.2 this means that. The marginal effects (6. however. as there exist very accurate numerical integration algorithms. Since the marginal effects depend on the values of xi .6 to compare them with the estimated logit parameters.

evaluated at the sample mean of the explanatory variables. S: 6. This preference depends on the values xi of the explanatory variables. an ‘average’ individual has a relative preference for alternative 0 above alternative 1. then F(b1 ) > F(0) ¼ 1 2. variance.446 6 Qualitative and Limited Dependent Variables ÀP Á P[yiÀ¼ 1] P ¼F l6¼j Ábl xli (for individuals with xji ¼ 0) with P[yi ¼ 1] ¼ F bj þ l6¼j bl xli (for individuals with xji ¼ 1).7a–c. . so that for an ‘average’ individual both choices are equally likely. which is deﬁned by P[yi ¼ 1] F(x0i b) ¼ : P[yi ¼ 0] 1 À F(x0i b) So the odds ratio is the relative preference of option 1 as compared to option 0. Comparison of probabilities and the odds ratio It may further be informative to consider the predicted probabilities pi ¼ P[yi ¼ 1] ¼ F(x0i b). xk ) in deviation from their sample mean. Á Á Á . in the logit model the log-odds is a linear function of the explanatory variables. the odds ratio. so that an ‘average’ individual has a relative preference for alternative 1 above alternative 0. The individuals may also be split into groups. becomes F(b1 )=(1 À F(b1 )). E Exercises: T: 6. both for the probit and for the logit model). This may reveal differences in the effect of the dummy variable for different ranges of the other explanatory variables (xli with l 6¼ j). after which the values of pi can be compared within and between groups.2a–c. so that L(t)=(1 À L(t)) ¼ et and L(x0i b) log ¼ x0i b: 1 À L(x0i b) That is. i ¼ 1. n — for instance. Of special interest is the odds ratio. and. As a constant term is included in the model. After this transformation. The log-odds is the natural logarithm of the odds ratio. and this provides the following interpretation of the constant term. If b1 > 0. if b1 < 0. the mean. then the odds ratio evaluated at the sample mean is equal to 1 (as F(0) ¼ 1 2. and maximum of these probabilities. minimum. we can transform the data by measuring all other explanatory variables (x2 . Á Á Á . In the logit model with F ¼ L there holds L(t) ¼ et =(1 þ et ) and 1 À L(t) ¼ 1=(1 þ et ). If b1 ¼ 0.

then the likelihood function is given by 1Àyi yi and the log-likelihood by L(p) ¼ Pn i¼1 p (1 À p) log (L(p)) ¼ ¼ X fi. 1. The log-likelihood is therefore equal to log (L(b)) ¼ ¼ ¼ n X i¼1 n X i¼1 yi log (pi ) þ n X i¼1 (1 À yi ) log (1 À pi ) n X i¼1 yi log (F(x0i b)) þ log (F(x0i b)) (1 À yi ) log (1 À F(x0i b)) log (1 À F(x0i b)): (6:7) X þ X fi. Then the variable yi follows a Bernoulli distribution with probability pi ¼ P[yi ¼ 1] ¼ F(x0i b) on the outcome yi ¼ 1 and with probability (1 À pi ) on the outcome yi ¼ 0.6. If the observations are mutually independent. P[yi ¼ 1] ¼ p — then the probability distribution of the ith observation is given by pyi (1 À p)1Àyi. . yi ¼0g log (1 À p) yi log (p) þ n X i¼1 (1 À yi ) log (1 À p): P ^¼ n Maximizing this with respect to p we get the ML estimator p i¼1 yi =n. yi ¼1g n X i¼1 log (p) þ X fi.3 Estimation and evaluation The likelihood function The logit and probit models are non-linear and the parameters can be estimated by maximum likelihood. Á Á Á . Suppose that a random sample of n outcomes of the binary variable yi is available. yn are mutually independent but that the probability of success differs among the observations according to the model (6. yi ¼0g The terms pi depend on b. all with the same function F but with differences in the values of the explanatory variables xi .3). Now suppose that the observations y1 . If the probability of success is the same for all observations — say. y The probability distribution is then given by p(yi ) ¼ pi i (1 À pi )1Àyi .1. yi ¼1g fi. but for simplicity of notation we will in the sequel often write pi instead of the more explicit expression F(x0i b). yi ¼ 0.1 Binary response 447 6.

3 (p. yi À pi ¼ yi À P[yi ¼ 1] is the residual of the model (6. Finally.8). It is often convenient to use the outer product of gradients expression for this (see Sections 4. Using the fact that the density function f (t) is the derivative of the cumulative distribution function F(t).57) in Section 4. To get an idea of the effects of the different explanatory variables it can be helpful to plot the predicted ^ [y ¼ 1] ¼ F(x0 b) and the corresponding odds ratio or log-odds probabilities P against each individual explanatory variable.4) are not constant over the sample (as is the case in a linear regression model) but depend on the value of f (x0i b). the factor fi reﬂects the fact that the marginal effects (6. so that the covariance matrix of b can be estimated by " ^ ¼ var(b) % V n X @ li @ li @ b @ b0 i ¼1 # À1 n X ^i )2 ^2 0 (yi À p f xi xi ¼ ^ 2 i ^2 i¼1 pi (1 À pi ) " # À1 .3 (p.2 and 4. These ﬁrst order conditions can be seen as a variation of the normal equations n i¼1 ei xi ¼ 0 of the linear regression model. large sample standard errors can be obtained from the inverse of the information matrix. by Newton–Raphson — to give the estimate b. In ﬁnite samples this gives . we have @ li =@ b ¼ (yi À pi )fi xi =(pi (1 À pi )).4.3 and formula (4. This probability limit exists under weak regularity conditions on the plim(nV explanatory variables xi .3) with respect to the actually observed outcome of yi . ﬁxing the other variables at their sample means. The weighting factor pi (1 À pi ) is equal to the variance of yi .448 6 Qualitative and Limited Dependent Variables T Maximization of the log-likelihood The maximum likelihood estimates are obtained by solving the ﬁrst order conditions. the k ﬁrst order conditions are given by g(b) ¼ ¼ n n @ log (L) X yi @ pi X (1 À yi ) @ (1 À pi ) þ ¼ p @ b i¼1 1 À pi @b @b i ¼1 i n X yi i ¼1 pi fi x i À n X (1 À yi ) i ¼1 n X yi À pi fi xi ¼ 0: fi xi ¼ p (1 À pi ) 1 À pi i ¼1 i (6:8) Here fi ¼ f (x0i b) is the density function corresponding to the cumulative distribution function FP .3.3. In a binary response model. With the notation introduced there. so that this corresponds to the usual correction for heteroskedasticity in weighted least squares (see Section 5.3. T Approximate distribution of the ML estimator The general properties of ML estimators were discussed in Section 4. Under the stated assumptions — that is. 228) — for instance. 327–8)).3. The set of k non-linear equations g(b) ¼ 0 can be solved numerically — for instance. that the where p observations yi are independently distributed with P[yi ¼ 1] ¼ F(x0i b) with the same cumulative distribution function F for all observations pﬃﬃﬃ— the ML estimator b has an asymptotic normal distribution in the sense that n(b À b) converges in distribution to the normal distribution with mean zero and covariance matrix ^ ). (6:9) ^i ¼ F(x0i b) and ^ fi ¼ f (x0i b).

6.44) can also be applied. As an illustration we consider the logit model with F ¼ L in (6.5) in more detail. as discussed in Section 4. and in general the Newton–Raphson iterations will converge rather rapidly to the global maximum. Therefore the logit estimates are obtained by solving the k equations g ( b) ¼ n X i ¼1 T (yi À pi )xi ¼ n X yi À i ¼1 1 xi ¼ 0: 0 1 þ e À xi b As the ﬁrst explanatory variable is the constant term with x1i ¼ 1 for all P ^i ) ¼ 0.3 (p. and of course the Likelihood Ratio test (4.1 Binary response 449 ^ ): b % N(b. so that ( y À p i ¼ 1.11) have a unique solution. 228) — that is. by substituting the logit estimate b for b in the above expression and by taking the square roots of the diagonal elements of . n. This simpliﬁes the numerical optimization. The ML ﬁrst order conditions (6. V (6:10) These results can be used to perform t. The expression for the gradient (6.8) simpliﬁes in this case as 0 0 0 e xi b e xi b e xi b ¼ Li (1 À Li ). it follows that n i i¼1 n n 1X 1X ^i ¼ yi : p n i ¼1 n i ¼1 So the logit model has the property that the average predicted probabilities of success and failure are equal to the observed fractions of successes and failures in the sample. The information matrix (for given values of the explanatory variables) is given by ! X n @ 2 log (L) ¼ pi (1 À pi )xi x0i : I n ¼ ÀE @ b@ b0 i¼1 (6:11) Large sample standard errors of the logit parameters can be obtained. li ¼ ¼ 1À 0 0 0 1 þ e xi b (1 þ exi b )2 1 þ exi b so that fi ¼ pi (1 À pi ) in this case. Á Á Á . because the Hessian matrix n n X X @ 2 log (L) @ g(b) 0 ¼ ¼ À f x x ¼ À pi (1 À pi )xi x0i i i i @ b@ b0 @ b0 i ¼1 i ¼1 is negative deﬁnite.3.and F-tests in the usual way. Results for the logit model The foregoing expressions apply for any choice of the distribution function F.

so we take as correction factor P n 1 0 i¼1 f (xi b). and age (with a linear and a squared term) (see Example 6. (i) Outcomes of logit and probit models The dependent variable is yi with yi ¼ 1 if the ith individual is interested and yi ¼ 0 otherwise. 0. As discussed in Section 6. and (ii) the odds ratios (depending on the age of the customer) of the two models. in the logit model this correction factor is 0.1).588). With suitable software.9) and I À n of (6. The explanatory variables are gender.224 in the linear probability model (see Panel 1). the mean marginal effect of the variable gender is 0.450 6 Qualitative and Limited Dependent Variables the inverse of (6. The results of logit and probit models are given in Panels 2 and 3 of Exhibit 6.11). 0. The Hessian matrix is again negative deﬁnite. the two expressions (6.954.11) by (yi À pi )2. male customers and active customers tend to be more interested than female and inactive customers. E XM601DMF Example 6. so that these terms cancel in (6.1. and 0:588 Á 0:373 ¼ 0:219 in the probit model. As the corresponding two parameters are positive.2: Direct Marketing for Financial Product (continued) We continue our analysis of the direct mailing data introduced in Example 6. That is.373.11) by replacing the terms pi (1 À pi ) in (6. since for ^2 ^2 the logit model ^ fi2 ¼ p i (1 À pi ). For instance. For our data.1. So the coefﬁcients of the variable gender differs in the three models (0.224. these variables have a positive impact on the probability of responding to the mailing.3. The numerical values of the coefﬁcients of the three models can be compared by determining the mean marginal effects of the explanatory variables in the three models. Expression (6. the practical usefulness of probit and logit models is very much alike.9).9) for the covariance matrix can be obtained from (6. Remarks on the probit model The analysis of the probit model is technically somewhat more involved. The effects of ‘gender’ and ‘activity’ are almost the same. As 2 1 E[(yi À pi ) ] ¼ var(yi ) ¼ pi (1 À pi ). and the numerical optimization poses no problems in general. For comparison the results of the linear probability model are also given (see Panel 1).11) for the covariance matrix are asymptotically equivalent. . All models indicate that the variables ‘gender’ and ‘activity’ are statistically the most signiﬁcant ones. We will discuss (i) the outcomes of estimated logit and probit models for the probability that a customer is interested in the product. activity.2. but their interpretation in terms of mean marginal effects is very much the same. This also holds true for the coefﬁcients of the other explanatory variables. the mean marginal effect of P 1 0 the jth explanatory variable is bj n n i¼1 f (xi b). 0:954 Á 0:230 ¼ 0:219 in the logit model.230 n and in the probit model it is 0.

(a ) Panel 1: Dependent Variable: RESPONSE Method: Least Squares Sample: 1 1000 Included observations: 925.Binary Probit Sample: 1 1000 Included observations: 925.588114 0.096684 6.795932 GENDER 0.007507 À2.224002 0.789720 GENDER 0.310802 GENDER 0.) Log likelihood À601. Error z-Statistic C À1. The reported scale factors are the averages of f (x0i b) over the sample.1 Binary response 451 The variable ‘age’ has an effect that ﬁrst increases and then decreases.020607 À1.068692 0.488358 0.E.561167 0. the logit model (Panel 2). Estimates obtained from the linear probability model (Panel 1).E.0000 0.3 Direct Marketing for Financial Product (Example 6.964455 AGE^ 2/100 À0. Excluded observations: 75 Variable Coefﬁcient Std.195906 À0.0052 0.034096 À2.021544 1. 0.480242 Scale factor (marg.015494 0.007861 1.0000 0. of regression 0.229533 Prob.) Log likelihood À601. of regression 0. and age (quadratic function).040982 0.480195 Scale factor (marg.158183 6.041680 0.988730 S.029070 ACTIVITY 0.0000 0. and the probit model (Panel 3).035605 1. activity dummy. although the effects are only marginally signiﬁcant (at 5 per cent signiﬁcance level).0530 0. the possible effect of age is of great practical importance for the ﬁrm. of regression 0. Excluded observations: 75 Convergence achieved after 5 iterations Variable Coefﬁcient Std. 0.6.953694 0.184779 4.945090 AGE 0.0430 (b) Panel 2: Dependent Variable: RESPONSE Method: ML .9497 Prob.2) Responses to direct mailing (1 ¼ response.480418 Prob.082811 ACTIVITY 0. Error t-Statistic C À0.7560 0.014643 S.Binary Logit Sample: 1 1000 Included observations: 925.0439 0. with f the logistic density (Panel 2) or the standard normal density (Panel 3). eff.971057 AGE^ 2/100 À0.208268 0.015209 0.060888 0.E.0000 0. Error z-Statistic C À2.081542 S.0000 0.934636 AGE^ 2/100 À0.0490 0.121010 AGE 0. Excluded observations: 75 Convergence achieved after 5 iterations Variable Coefﬁcient Std.0053 0.111572 5.069945 0.536822 À2.497584 0.913748 0. However.255535 ACTIVITY 0.8624 (c) Panel 3: Dependent Variable: RESPONSE Method: ML .0467 0. . eff.029656 AGE 0.026048 R-squared 0. 0 ¼ no response) explained by gender.372705 Exhibit 6.0000 0.0495 0.889992 À2.035809 6. 0.040669 5.

and analysis of the residuals (in particular an LM-test for heteroskedasticity).8a. All odds ratios are highest around an age of 50 years. 6. Exhibit 6. b.3 shows the estimated odds ratios (for the logit model in (d) and for the probit model in (e)) against the variable ‘age’. the top curve is for active males.2d. and the lowest one for non-active females. The opposite odds ratios apply for females who are not yet investing. In both diagrams. E: 6. b. e. In each diagram. .11.) Estimated odds ratios for logit model (d) and for probit model (e) against age. the (nearly coinciding) third one for active females.13a. S: 6.452 6 Qualitative and Limited Dependent Variables (d) 4 (e) 4 3 3 2 ODDS PROBIT 0 20 40 60 AGE 80 100 ODDS LOGIT 2 1 1 0 0 0 20 40 AGE 60 80 100 Exhibit 6. 6.1. the second one for non-active males.7d–f. the goodness of ﬁt (LR-test and R2 ). (ii) Odds ratios depending on age To give an impression of the age effect. the predictive quality (classiﬁcation table and hit rate).4 Diagnostics In this section we discuss some diagnostic tools for logit and probit models — namely. the top curve shows that males who are already active investors have a probability of responding to the direct mailing that is two to three times as large as the probability of not responding. the odds ratios for inactive males and active females coincide approximately. E Exercises: T: 6. As the coefﬁcients of ‘gender’ and ‘activity’ are almost equal. 6.3 (Contd.

6. let wi be the random variable indicating a correct prediction — that is. It follows from (6. Joint parameter restrictions can be tested by the Likelihood Ratio test. however. The overall goodness of ﬁt of the model can be tested by the LR-test on the null hypothesis that all coefﬁcients (except the constant term) are zero — that is. Predictive quality Alternative speciﬁcations of the model may be compared by evaluating whether the model gives a good classiﬁcation of the data into the two categor^i for ies yi ¼ 1 and yi ¼ 0. log (L0 ) where L1 is the maximum value of the unrestricted likelihood function and L0 that of the restricted likelihood function. McFadden’s R2 deﬁned by R2 ¼ 1 À log (L1 ) . if the fraction p of successes differs much from 50 per cent. or. at least if the restrictions are not too involved. n In the population the fraction of successes is p. to choose between a logit and a probit model. If we randomly make the prediction 1 with probability p and 0 with probability (1 À p). one ^. as these two models have different likelihood functions. For logit and probit models it is no problem to estimate the unrestricted and restricted models. The estimated model gives predicted probabilities p the choice yi ¼ 1. so that 0 R2 < 1 and higher values of R2 correspond to a relatively higher overall signiﬁcance of the model. Note.7) that L0 L1 < 0. The choice of c can yi ¼ 0 if p predicting that ^ yi ¼ 1 if p sometimes be based on the costs of misclassiﬁcation. wi ¼ 1 if P n yi and wi ¼ 0 if yi 6¼ ^ yi . The hit rate is deﬁned as the fraction of correct predictions in the sample. and this can be transformed into predicted choices by ^i ! c and ^ ^i < c. b2 ¼ Á Á Á ¼ bk ¼ 0. that this R2 cannot be used. This leads to a 2 Â 2 classiﬁcation table of the presometimes takes c ¼ p dicted responses ^ yi against the actually observed responses yi .1 Binary response 453 Goodness of fit The signiﬁcance of individual explanatory variables can be tested by the usual t-test based on (6. and the t-test statistic then follows approximately the standard normal distribution. The sample size should be sufﬁciently large to rely on the asymptotic expressions for the standard errors. for example. then we make a correct prediction with probability q ¼ p2 þ (1 À p)2.10). In practice one often ^ takes c ¼ 1 2. This test follows (asymptotically) the w2 (k À 1) distribution. Sometimes one reports measures similar to the R2 of linear regression models — for instance. Formally. then the hit rate is deﬁned by h ¼ 1 yi ¼ ^ i¼1 wi. Using the properties .

it should be realized that the parameters of binary response models are chosen to maximize the likelihood function. Under the null hypothesis that the predictions of the model are no better than pure random predictions. Therefore we reject the null hypothesis of random predictions in favour of the (one-sided) alternative of better-than-random predictions if hÀq nh À nq z ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ nq(1 À q) q(1 À q)=n is large enough (larger than 1.4). In the above expression for the z-test. ^2 þ (1 À p ^)2. the standardized residuals are deﬁned by ^i yi À p eÃ i ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ : ^i ) ^i (1 À p p (6:12) A histogram of the standardized residuals may be of interest.645 at 5 per cent signiﬁcance level). where p ^ is the q ¼ p2 þ (1 À p)2 is unknown and estimated by p fraction of successes in the sample. even if it performs worse in terms of classiﬁcation. This is another distinction with the linear regression model. where maximizing the (normal) likelihood function is equivalent to maximizing the (least squares) ﬁt. The predictive quality of our model can be evaluated by comparing our hit rate h with the random hit rate hr . Description may be more relevant than prediction Although the comparison of the classiﬁcation success of alternative models may be of interest. of the marginal effects (6. the hit rate h is approximately normally distributed with mean q and variance q(1 À q)=n. for example. scatter diagrams of these residuals against . for example. it follows that the ‘random’ hit rate hr has expected value E[hr ] ¼ E[w] ¼ q and variance var(hr ) ¼ var(w)=n ¼ q(1 À q)=n. to detect outliers.454 6 Qualitative and Limited Dependent Variables of the binomial distribution for the number of correct random predictions. In practice. Further. As the between the observed outcomes yi and the ﬁtted probabilities p variance of yi (for given values of xi ) is pi (1 À pi ). and not directly to maximize a measure of ﬁt between the observed outcomes yi and the predicted outcomes ^ yi . Standardized residuals and consequences of heteroskedasticity The residuals ei of a binary response model are deﬁned as the differences ^i . A binary response model may be preferred over another one because it gives a more useful description. nh is the total number of correct predictions in the sample and nq is the expected number of correct random predictions.

for instance. For instance. Likelihood Ratio test on heteroskedasticity A formal test for heteroskedasticity can be based on the index model 0 yÃ i ¼ xi b þ ei .8).9) and also by GMM based on the ‘moment’ conditions (6. it may be helpful to compute the standard errors in two ways — that is. the outcomes may still be reasonably reliable. Until now it was assumed that the error terms ei all follow the same distribution (described by F).7) by 0 replacing the terms pi ¼ F(x0i b) by pi ¼ F(x0i b=ezi g ).1) the scale parameter of a binary response model should be ﬁxed. that a relevant explanatory variable is missing or that the function F is misspeciﬁed.1. However. then this is a sign of misspeciﬁcation. by the ML expression (6. If the two sets of computed standard errors differ signiﬁcantly. as the differences between the probit function F and the logit function L are not so large. It then follows that P[yi ¼ 1] ¼ P[yÃ i ! 0] 0 0 0 0 xi b=s] ¼ F(xi b=s)]. maximum likelihood estimators of binary response models become inconsistent under this kind of misspeciﬁcation. The constant term should not be included in this vector because (as was discussed in Section 6. In contrast with the linear regression model. It may be. then the estimated parameters and marginal effects are inconsistent and the calculated standard errors are not correct. independent of the data. ¼ P[ei ! Àxi b] ¼ P[(ei =s) ! Àxi b=s] ¼ P[(ei =s) so that À 0 Á P[yi ¼ 1] ¼ F x0i b=ezi g : (6:13) 0 The null hypothesis of homoskedasticity corresponds to the parameter restriction H0 : g ¼ 0. This hypothesis can be tested by the LR-test. If one has doubts on the correct choice of the distribution function F. We assume again that the density function f (the derivative of F) is symmetric — that is. f (t) ¼ f ( À t).1 Binary response 455 explanatory variables are useful to investigate the possible presence of heteroskedasticity. if the data generating process is a probit model but one estimates a logit model. The unrestricted likelihood function is obtained from the log-likelihood (6. with zi a vector of observed variables. Heteroskedasticity can be due to different kinds of misspeciﬁcation of the model. As an alternative we consider the model where all ei =si follow the same distribution F where si ¼ ezi g . where OLS remains consistent under heteroskedasticity.6. .

We replace pi by p in the ﬃﬁrst step. OLS after division for the ith observation by the (estimated) standard deviation.13). there holds that LM ¼ nR2 nc of this 2 denotes the non-centred R — that is. pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ^i ). where R2 nc P Ã 2 ( e squares of (6. The residuals of this ^i ¼ yi À F(x0i b). where g is the number of variables in zi — that is.14).3. the number of parameters in g. By working out the formulas for the gradient and the Hessian of the unrestricted likelihood. First estimate the model without heteroskedasticity — that is. taking into account that the residuals are heteroskedastic. . is given by @ F(x0 b=ez g ) ¼ f (x0 b)x. the explained sum of regression. it can be shown that the LM-test can be performed as if (6.14) does not contain a constant termP on the right-hand P Ã 2 side.456 6 Qualitative and Limited Dependent Variables T Lagrange Multiplier test on heteroskedasticity An alternative is to use the LM-test. under the null hypothesis that g ¼ 0. Ã 2 ^ ¼ ( e ) = (ei ) .13) were a non-linear regression model.12) as ^i yi À p f (x0i b) f (x0i b)x0i b 0 0 eÃ i ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ xi d1 þ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ zi d2 þ Zi : ^i (1 À p ^i (1 À p ^i ) ^i ) ^i ) ^i (1 À p p p p (6:14) Under the null hypothesis of homoskedasticity. ^i (1 À p so that the weight of the ith observation in WLS is given by 1= p 0 z0i g the gradient of the function F(xi b=e ) in the model (6. We reject the null hypothesis eÃ i for large values of the LM-test. As a second step. regress the model are denoted by ei ¼ yi À p 0 residuals ei on the gradient of the non-linear model P[yi ¼ 1] ¼ F(x0i b=ezi g ). and under the null hypothesis of homoskedasticity (g ¼ 0) it is asymptotically distributed as w2 (g). so that only the model under the null hypothesis (with g ¼ 0) needs to be estimated.1). The variance of the ‘error term’ yi À pi ^i obtained is var(yi À pi ) ¼ var(yi ) ¼ pi (1 À pi ).1. where one should take here the non-centred R2 deﬁned by R2 nc i ^ denote the ﬁtted values of the regression in (6. This can be summarized as follows. Further. This amounts to applying (feasible) weighted least squares — that is. as discussed in Section 6. the required auxiliary regression in this second step can be written in terms of the standardized residuals (6. As the regression in (6. The correctness of the following steps to compute the LM-test is left as an exercise (see Exercise 6. when evaluated at g ¼ 0. This amounts to estimating the model p[yi ¼ 1] ¼ F(x0i b) by ML.14) is divided by the non-centred total sum of squares n i¼1 i ) . @b 0 @ F(x0 b=ez g Þ ¼ Àf (x0 b)x0 bz: @g 0 Therefore.

6.2.061).12). (ii) the investigation of the possible presence of outliers and heteroskedasticity. (i) Significance of the explanatory variables E XM601DMF In Example 6. Let p Ã siduals ei by (6. but the LR-test for the joint signiﬁcance of the variables (x2 . 2 2 Step 3: LM ¼ nR2 nc of step 2.1 Binary response 457 Computation of LM-test on heteroskedasticity Step 1: Estimate the restricted model.4 for the logit and probit model do not indicate the presence of outliers.3: Direct Marketing for Financial Product (continued) We perform some diagnostic checks on the logit and probit models that were estimated for the direct mailing data in Example 6. The combination of statistical signiﬁcance with relatively low ﬁt is typical for models explaining individual behaviour.4 reports the results of these diagnostic checks. Á Á Á .2 we concluded that the variables ‘gender’ and ‘activity’ are signiﬁcant but that the linear and quadratic age variables are individually only marginally signiﬁcant. If the null hypothesis of homoskedasticity (g ¼ 0) holds true. The test . To test for the possible presence of heteroskedasticity. This indicates that they are jointly not signiﬁcant. This means that the model may have difﬁculty in describing individual decisions but that it gives insight into the overall pattern of behaviour. then LM % w2 (g).14). as P ¼ 0:12 in the logit model and P ¼ 0:13 in the probit model. and (iii) the predictive performance of the models. Exhibit 6.4 contains the result of the LR-test for the joint signiﬁcance of the two age variables. The two models have nearly equal and not so large values of R2 (0. Panel 1 of Exhibit 6.4 shows that the models have explanatory power. x5 ) in Panel 1 of Exhibit 6. where zi is the total amount of money that individual i has already invested in other products of the bank. We will discuss (i) the signiﬁcance of the explanatory variables. Regress the the (scaled) gradient of the heteroskegeneralized residuals eÃ i of step 1 on 0 dastic model P[yi ¼ 1] ¼ F(x0i b=ezi g ) — that is. Example 6. where g is the number of parameters in g. Then LM ¼ nRnc . where Rnc is the non2 centred R of the regression in step 2. Estimate the homoskedastic model ^i ¼ F(x0i b) and deﬁne the generalized reP[yi ¼ 1] ¼ F(x0i b) by ML. we consider the model si ¼ egzi . Step 2: Auxiliary regression of generalized residuals of step 1. (ii) Investigation of possible outliers and heteroskedasticity The maximum and minimum values of the standardized residuals reported in Panel 1 of Exhibit 6. perform OLS in (6.

508.18 0.5) Estimated Equation Constant Probability Dep ¼ 0 Dep ¼ 1 Total Dep ¼ 0 Dep ¼ 1 Total P(Dep ¼ 1)<¼C 196 96 292 0 0 0 P(Dep¼1)>C 259 374 633 455 470 925 Total 455 470 925 455 470 925 Correct 196 374 570 0 470 470 % Correct 43.787 6.57 61.16 0.00 0.08 79. P ¼ 0:0000 Exhibit 6.76 81.19 p ¼ 470/925 ¼ 0. random hit rate p2 þ (1 À p)2 ¼ 0:5001 p Z-value ¼ (570 À 462:5)= (925Ã 0:5001Ã 0:4999) ¼ 7:07.62 0.0000 4.1196 0.3) Outcomes of various diagnostic tests for logit and probit models for responses to direct mailing (Panel 1) and predictive performance of logit model (Panel 2) and of probit model (Panel 3).00 50.00 100. The test whether the predictions are better than ^ þ (1 À p p Panel 1: DIAGNOSTIC TEST RESULTS Standardized residuals: maximum minimum Heteroskedasticity LM test value (df ¼ 1) corresponding P-value LR test for signiﬁcance of explanatory variables (df ¼ 4) corresponding P-value LR test for signiﬁcance of age variables (df ¼ 2) corresponding P-value R-squared LOGIT 2. of the 925 obser^ ¼ 470 vations there are 470 with yi ¼ 1 and 455 with yi ¼ 0 so that p 925 and 2 2 ^) ¼ 0:5001).458 6 Qualitative and Limited Dependent Variables outcomes provide some evidence for the presence of heteroskedasticity (P ¼ 0:01).1294 0.24 18.38 100.4 also contain results on the predictive performance of the logit and probit models.35 0.81 % Incorrect 58. random hit rate p2 þ (1 À p)2 ¼ 0:5001 p Z-value ¼ (575 À 462:5)= (925Ã 0:5001Ã 0:4999) ¼ 7:40.061 PROBIT 2.237 0.247 0.19 p ¼ 470/925 ¼ 0.81 % Incorrect 56.4 Direct Marketing for Financial Product (Example 6.786 6.09 37.0129 78.00 50.622 for the probit model.91 62.00 100.616 for the logit model and 0.089 0.061 Panel 2: LOGIT: Prediction Evaluation (success cutoff C ¼ 0.135 À1.5 of random predictions (more precisely.508. The hit rates are 0. This is well above the expected hit rate of around 0. (iii) Predictive performance Panels 2 and 3 of Exhibit 6.0125 78. .84 100.00 0.123 À1.43 38.186 0.0000 4.00 49.00 49. P ¼ 0:0000 Panel 3: PROBIT: Prediction Evaluation (success cutoff C ¼ 0.5) Estimated Equation Constant Probability Dep ¼ 0 Dep ¼ 1 Total Dep ¼ 0 Dep ¼ 1 Total P(Dep¼1)<¼C 190 85 275 0 0 0 P(Dep¼1)>C 265 385 650 455 470 925 Total 455 470 925 455 470 925 Correct 190 385 575 0 470 470 % Correct 41.92 20.

1. xj ). where nj1 ¼ nj yj is the number of individuals in group j that chooses alternative 1. so that a fraction 1 À yj has chosen the alternative 0.7) is given by nj1 log (pj ) þ (nj À nj1 ) log (1 À pj ). Estimation by maximum likelihood It is assumed that xj is a close enough approximation of the characteristics of all individuals in group j so that their probabilities to choose alternative 1 are constant and given by pj ¼ F(x0j b). Let yj be the fraction of individuals in group j that have chosen alternative 1. The models are more successful in predicting positive responses (around 80 per cent is predicted correctly) than in predicting no response (of which a bit more than 40 per cent is predicted correctly). the loglikelihood becomes G X nj yj log (pj ) þ (1 À yj ) log (1 À pj ) : j¼1 log (L) ¼ (6:15) . G.10. and the group sizes nj are assumed to be known. For instance. So. n. Then the joint contribution of the individuals in group j to the log-likelihood (6.5 Model for grouped data Grouped data Sometimes — for instance. Á Á Á . Let the data be grouped into G groups. with P-value P ¼ 0:00 (see Panels 2 and 3 of Exhibit 6. P[yi ¼ 1] ¼ F(x0i b) with the same function F for all i ¼ 1. d. Suppose that the individual data satisfy the binary response model (6. The data consist of the G values of (yj .6. 6.1 Binary response 459 random gives values of z ¼ 7:07 for the logit model and z ¼ 7:40 for the probit model. the investment decisions of customers of a bank may be averaged over residential areas (zip codes) or over age groups. E Exercises: T: 6. This shows that the classiﬁcation of respondents by the logit and probit models is better than what would have been achieved by random predictions. 6. S: 6.3) — that is.4). j ¼ 1. E: 6. The groups should be chosen so that the values of the explanatory variables x are reasonably constant within each group.13c. Let xj denote the vector of group means of the explanatory variables for the nj individuals in this group. Á Á Á .8c–f. for reasons of conﬁdentiality — the individual data are not given and only the average values of the variables over groups of individuals are reported.1. in terms of the observed fractions yj . with nj individuals in group j.

one can also use feasible weighted least squares (FWLS) to estimate the parameters b. Á Á Á . nj If F is continuous and monotonically increasing (as is the case for logit and probit models). The model parameters can be estimated by maximum likelihood. In the ﬁrst step b is estimated by OLS. that is. so that p . G. regressing zj on xj for the G groups. nj fj2 T where fj ¼ f (x0j b). G: normally distributed Here the error terms ej are independent and approximately À 2Á with mean zero and variances s2 j ¼ pj (1 À pj )= nj fj . We deﬁne transformed observations z j ¼ F À 1 ( yj ) : Using the facts that FÀ1 (pj ) ¼ x0j b and that FÀ1 (p) has derivative 1=f (FÀ1 (p)). The model imposes restrictions if the decisions of G groups are modelled in terms of k < G parameters. Let b be the OLS estimate. Then b can be estimated by FWLS — for instance. Á Á Á . This is based on the fact that yj is the sample mean of nj independent drawings from the Bernoulli distribution with mean pj and variance pj (1 À pj ). The model pj ¼ F(x0j b) imposes (G À k) parameter restrictions dj ¼ x0j b. If nj is sufﬁciently large.3 for the case of individual binary response data. Estimation by feasible weighted least squares Instead of using the above maximum likelihood approach. This can be tested by the LR-test that follows a w2 (G À k) distribution under the null hypothesis of correct speciﬁcation. it follows from the central limit theorem that pj (1 À pj ) : yj % N pj .460 6 Qualitative and Limited Dependent Variables It is required that k G — that is. To test the speciﬁcation of the model. So the error terms are heteroskedastic. one can consider as an alternative the model that contains a dummy for each group. as follows. it follows that in large enough samples zj % N x0j b. much in the same way as was discussed in Section 6. The corresponding maximum likelihood estimates are given by ^ dj ¼ yj. the number of explanatory variables may not be larger than the number of groups. This can be written as a regression equation zj ¼ x0j b þ ej . then the variance s2 j of ej can be estimated by replacing pj by ^j ¼ F(x0j b) and fj by f (x0j b). then the inverse function FÀ1 exists. This model contains G parameters and allows for arbitrary different speciﬁc probabilities for each group. ! pj (1 À pj ) . with pj ¼ F(dj ) for j ¼ 1. j ¼ 1.1.

which considers the direct mailing data averaged over ten age groups.1. 336)). However.1 Binary response 2 ^ ^ s2 j ¼ pj (1 À pj )=(nj f (xj b)): 461 In the second step b is estimated by WLS. 6.4 (p. That is. ^j (1 À p ^j (1 À p nj p ^j ¼ L(xj b) ¼ p 1 1 þ e À xj b 0 1 xj sj 0 b þ !j . . using the estimated standard deviations of ej to obtain the appropriate weighting factors. it may be preferable to use ML. In this case the required regressions simplify somewhat.4.6.6) (the probit model). where y is the dependent variable (with possible outcomes 0 and 1) and x is the vector of explanatory variables.12. in the second step OLS is applied in the transformed model zj ¼ sj FWLS in the logit model We specify the above general method in more detail for the logit model. G: T : So. An example using grouped data is left as an exercise: see Exercise 6. E Exercises: E: 6. in most cases F ¼ L of (6.5) (the logit model) or F ¼ F of (6.1. j ¼ 1. if nj is relatively small for some groups. The function F is chosen as a cumulative distribution function.3). for the logit model the FWLS estimates are obtained by regressing zj on xj (with OLS estimate b) followed by a regression of wj zj on wj xj with weights wj ¼ 1 0 1 0 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ e À 2 xj b pﬃﬃﬃﬃ e2xj b ^j ) ¼ pﬃﬃﬃﬃ ^j (1 À p nj p nj ¼ n j 0 0 : 1 þ e À xj b 1 þ exj b FWLS is asymptotically equivalent to ML (see also Section 5.12. .6 Summary To model the underlying factors that inﬂuence the outcome of a binary dependent variable we take the following steps. So ^ ^ s2 j ¼ 1=(nj pj (1 À pj )) and the FWLS estimates are obtained by performing OLS in the following regression model. qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ^j ) zj ¼ nj p ^j ) x0j b þ !j . because the logit model has the property that fj ¼ lj ¼ Lj (1 À Lj ) ¼ pj (1 À pj ) (see Section 6. Determine the possibly relevant explanatory variables and formulate a model of the form P[y ¼ 1] ¼ F(x0 b). Á Á Á .

7).15) instead of (6. test on heteroskedasticity) and by measuring the model quality (goodness of ﬁt and predictive performance). . The approach for grouped (instead of individual) data is similar. The model can be evaluated in different ways.462 6 Qualitative and Limited Dependent Variables . by diagnostic tests (standardized residuals. For logit and probit models. the main distinction is that the log-likelihood is now given by (6.1. . the required non-linear optimization can be solved without any problems by standard numerical methods. The estimated model can be interpreted in terms of the signs and signiﬁcance of the estimated coefﬁcients b and in terms of the mean marginal effects and odds ratios discussed in Section 6. .2. . Estimate the parameters b of the model by maximum likelihood.

how much one agrees or disagrees with a statement). Multinomial data When the dependent variable has a ﬁnite number of possible outcomes. In some cases the options can be ordered (for example. and the other elements of xi represent characteristics of the ith individual.4.1 Unordered response E Uses Chapters 1–4. and bjl À bhl measures the marginal increase of the utility of . Section 5. apart from the choices yi . so that the response yi ¼ j is a nominal (not an ordinal) variable. These alternatives (for example. i ¼ 1. uij ¼ x0i bj represents the systematic utility of alternative j for an individual with characteristics xi .2 Multinomial data 463 6. Multinomial model for individual-specific data Let m be the number of alternatives. car or train) are supposed to have no natural ordering. Further. when individuals can choose among more than two options. n.2. The ﬁrst element of xi is the constant term x1i ¼ 1.3 we consider ordered data.1.2. This occurs. In this section and the next one we discuss models for unordered data and in Section 6. and bj measures the relative weights of the characteristics in the derived utility. The differences between the alternatives are modelled by differences in the weights. A possible model in terms of stochastic utilities is given by j Ui ¼ uij þ eij ¼ x0i bj þ eij : (6:16) Here xi is a k Â 1 vector of explanatory variables for individual i and bj is a k Â 1 vector of parameters for alternative j. the data are called multinomial. However. to travel by bicycle. Á Á Á . m. Section 6. in other cases the different options are unordered (for example the choice of travel mode for urban commuters). also the values xi of k explanatory variables are observed. for ease of reference the alternatives are labeled by an index j ¼ 1.P Let nj be the number of observations with response yi ¼ j and let n¼ m j¼1 nj be the total number of observations.6. bus. Á Á Á . Suppose that.2 Multinomial data 6. for instance.

so that eij and egh are independent for all i 6¼ g and all j. Conditional model for individual.16) is called the multinomial model. m. Á Á Á . It is assumed that (conditional on the given values of the explanatory variables) the individuals make independent choices. This is called the conditional model.464 6 Qualitative and Limited Dependent Variables alternative j as compared to alternative h when the lth explanatory variable raises by one unit. and where pij ¼ piyi for the actually chosen alternative j ¼ yi . and there is no data information on the characteristics of the alternatives j ¼ 1. A possible model for the utilities is j Ui ¼ uij þ eij ¼ x0ij b þ eij . m. which are unknown and the same for all individuals. The loglikelihood can then be written as follows. The differences between the alternatives are modelled by the unknown k Â 1 parameter vectors bj . It then follows that pij ¼ P[yi ¼ j] ¼ P[uij þ eij > uih þ eih for all h 6¼ j]. it is assumed that the ith individual chooses the alternative j for which the utility Uij is maximal.and alternative-specific data Another type of model is obtained when aspects of the alternatives are measured for each individual — for example. m. which may vary between individuals. Choice model and log-likelihood Both in the multinomial model and in the conditional model. The model (6. whereas in (6. This model can be used if relevant characteristics xij of the m alternatives can be measured for the n individuals. Á Á Á . j ¼ 1. Á Á Á . where yij ¼ 1 if yi ¼ j and yij ¼ 0 otherwise. h ¼ 1. The difference with the multinomial model (6.16) these differences are (bj À bh ).16) is that the differences between the alternatives j and h are measured now by (xij À xih ). In order to estimate the parameters. This model can be used if data are available on the individual-speciﬁc values of the k explanatory variables xi . (6:17) where xij and b are m Â 1 vectors. . Let xij be the vector of values of the explanatory variables that apply for individual i and alternative j. the joint distribution of the terms eij has to be speciﬁed. The terms eij are individual-speciﬁc and represent unmodelled factors in individual preferences. (6:18) where uij ¼ x0i bj or uij ¼ x0ij b depending on which of the two above models is chosen. n. Á Á Á . i ¼ 1. the travel times for alternative transport modes.

.18). if in the multinomial probit model bj % bh or in the conditional probit model xij % xih (so that the utilities derived from the alternatives j and h are close together). the log-likelihood (6. in (6. A $ NID(0. C @ .17). For the binary choice model with m ¼ 2 alternatives. then (6.6. so that 0 1 ei1 B . Estimation of multinomial and conditional probit models The multinomial and conditional probit models can be estimated by ML. V ) is called the multinomial probit model. then a j typical preference eij ¼ Ui À uij > 0 (meaning that the ith individual derives a larger utility from alternative j than is usual for individuals with the same values of the explanatory variables) will mostly correspond to a preference eih ¼ Uih À uih > 0 as well. h) of the covariance matrix V . Such correlations can be modelled by the off-diagonal elements (j. Á Á Á . m.7).16). (6.18) with ei $ NID(0.19) after the joint distribution of the terms eij . For ﬁxed values of the parameters. suppose that these terms are jointly normally distributed with mean zero and (unknown) m Â m covariance matrix V . j ¼ 1.19) reduces to the log-likelihood (6. That is.2 Multinomial data n X m X i¼1 j¼1 n X i¼1 465 log (L) ¼ yij log (pij ) ¼ log (piyi ): (6:19) So this consists of the sum over the n terms log (pij ). As the probability pij involves the (m À 1) conditions eij À eih > uih À uij (for h 6¼ j). of x0i (bh À bj ) in the multinomial model and of (xih À xij )0 b in the conditional T . V ): eim If uij ¼ x0i bj . For example. An important advantage of incorporating the covariance matrix V in the model is the following. that is. And if uij ¼ x0ij b. then the model (6.18) for the choice probabilities pij with ei $ NID(0. where j ¼ yi is the alternative chosen by individual i.18) has been speciﬁed. Multinomial and conditional probit models The ML estimates of the parameters of the model can be obtained by maximizing (6. this probability is expressed as an (m À 1) dimensional integral in terms of the (m À 1) random variables (eij À eih ). then it may be expected that eij and eih are positively correlated. as in the multinomial model (6. as in the conditional model (6. V ) is called the conditional probit model. When two alternatives j and h are perceived as being close together.19) can be evaluated by numerical integration of the probabilities pij in (6. The evaluation of this integral (for given values of (uih À uij ).

for the multinomial model we get 0 0 pi2 ¼ exi b2 =(1 þ exi b2 ). which is a binary logit model with parameter vector b ¼ b2 . so that the vector of explanatory variables should not include a constant term in this case. by setting E[e2 i1 ] ¼ V11 ¼ 1. It can be shown (see Exercise 6. as the probabilities pij depend only on the differences uih À uij and as they are invariant under multiplication of the utilities Uij by a constant. the second model the conditional logit model.18) become multinomial logit: conditional logit: exi bj exi bj pij ¼ Pm x0 b ¼ Pm x0 b . The ﬁrst model for the choice probabilities pij is called the multinomial logit model.3) that in this case the multinomial and the conditional probabilities in (6. Further. as is discussed in the next section. The last problem can be solved by ﬁxing one of the variances — for instance. for the multinomial probit model uih À uij ¼ x0i (bh À bj ). which corresponds to choosing the ﬁrst alternative as reference. b1 ¼ 0.2 Multinomial and conditional logit Model formulation Although multinomial and conditional probit models can be estimated by suitable numerical integration methods. In the conditional logit model we get for m ¼ 2 that . In the conditional probit model uih À uij ¼ (xih À xij )0 b. it is in practice often preferred to use simpler models. i h 1 þ h¼2 e i h h¼1 e exij b pij ¼ Pm x0 b ih : h¼1 e 0 0 0 (6:20) For the multinomial model we used the identiﬁcation convention to choose b1 ¼ 0 for the ﬁrst (reference) category. Parameter restrictions needed for identification Some parameter restrictions have to be imposed. so that one of the parameter vectors can be chosen arbitrarily — for instance. Indeed. For the case of m ¼ 2 alternatives. A considerable simpliﬁcation is obtained by assuming that all the mn error terms eij are independently and identically distributed (for all individuals and all alternatives) with the so-called extreme value distribution. Numerically simpler likelihood functions can be obtained by choosing other distributions for the error terms eij . 6. both models boil down to a binary logit model.2.466 6 Qualitative and Limited Dependent Variables model) requires appropriate numerical integration techniques.

m. Estimation of the multinomial logit model The multinomial logit (MNL) model can be estimated by maximum likelihood — that is. m. g 6¼ h).6.19) with respect to the parameters bj . with pih as speciﬁed above for the multinomial model. . e i1 þ exi2 b 1 þ e(xi2 Àxi1 ) b which is a binary logit model with explanatory variables xi ¼ xi2 À xi1 . bm )) ¼ yij x0i bj À log 1 þ m X h¼ 2 !! e x0i bh : (6:21) The gradient of the log-likelihood consists of the (m À 1) stacked k Â 1 vectors n @ log (LMNL ) X ¼ (yih À pih )xi . Á Á Á . Further. the Hessian is À Pn P m i ¼1 j¼1 0 P pij xij xij À m h¼1 pih xih .20) in (6. It is left as an exercise (see Exercise 6. the log-likelihood becomes n m X X i ¼1 j ¼2 0 0 T log (LMNL (b2 . The log-likelihood is given by n m X X i¼1 j¼1 T m X h¼ 1 !! e x0ih b log (LCL (b)) ¼ yij x0ij b À log : (6:22) The gradient of the log-likelihood is n X m @ log (LCL ) X (yij À pij )xij : ¼ @b i¼1 j¼1 Finally. j ¼ 2. m.3) to show the following results. Á Á Á . h ¼ 2.3). by maximizing (6.2 Multinomial data 467 exi2 b e(xi2 Àxi1 ) b pi2 ¼ x0 b ¼ 0 0 . If we substitute (6. the (m À 1)k Â (m À 1)k Hessian matrix is negative deﬁnite with k Â k blocks P n pih )xi x0i on the diagonal (h ¼ 2. @ bh i¼1 h ¼ 2. i¼1 ih ig i i Estimation of the conditional logit model For the conditional logit (CL) model the results are as follows (see Exercise 6. Á Á Á . m) and k Â k blocks À Pn i¼1 pih (1 À 0 p p x x off the diagonal (g. Á Á Á . Á Á Á .19).

approximate standard errors of the ML estimates can be obtained from the inverse of the Hessian matrix. So the sign of the parameter bjl cannot always be interpreted directly as the sign of the effect of the lth explanatory variable on the probability to choose the jth alternative. This requires that the alternatives should be . in the conditional logit model the sign of bl is equal to the sign of the marginal effect of the lth explanatory variable (xij. @ xij @ PCL [yi ¼ j] ¼ Àpij pih b @ xih for h 6¼ j: (6:23) Note that. the k Â 1 vector of marginal effects is given by ! m X @ PMNL [yi ¼ j] ¼ pij bj À pih bh : @ xi h¼2 In the conditional logit model the marginal effects are @ PCL [yi ¼ j] ¼ pij (1 À pij )b. all the parameters bh . The following results are left as an exercise (see Exercise 6. Therefore the individual parameters of a multinomial logit model do not always have an easy direct interpretation. Odds ratios and the ‘independence of irrelevant alternatives’ The above multinomial and conditional logit models are based on the assumption that the error terms eij are independent not only among different individuals i but also among the different alternatives j. That is.468 6 Qualitative and Limited Dependent Variables T Numerical aspects The ﬁrst order conditions for a maximum can be solved numerically — for instance.3). so that in general the iterations converge relatively fast to the global maximum. It may even be the case that the marginal effect of the lth variable xli on P[yi ¼ j] has the opposite sign of the parameter bjl. in the multinomial logit model. As usual. In the multinomial logit model. the unmodelled individual preferences eij of a given individual i are independent for the different alternatives j. In both the multinomial and the conditional logit model the Hessian matrix is negative deﬁnite. together determine the marginal effect of xi on the probability to choose the jth alternative. m. h ¼ 2. l ) on the probability to choose each alternative since 0 < pij (1 À pij ) < 1. Marginal effects of explanatory variables The parameters of the model can be interpreted in terms of the marginal effects of the explanatory variables on the choice probabilities. Á Á Á . by using the above expressions for the gradient and the Hessian in the Newton–Raphson algorithm. On the other hand.

in comparing the alternatives j and h. This may be compared to random predictions. and let pjj ¼ njj =n. the other options are irrelevant. In other situations. Suppose that the owner of the ﬁrst leading brand is interested in the odds of his product compared with the other leading brand — that is. the log-odds for alternatives j and h is given respectively by log PMNL [yi ¼ j] ¼ x0i (bj À bh ). Then h ¼ m which yi ¼ ^ j¼1 pjj is the hit rate — that is. This can be further clariﬁed by considering the log-odds between two alternatives j and h. as discussed before in Section 6. the overall signiﬁcance of the model can again be tested by means of the likelihood ratio test on the null hypothesis that all parameters are zero.4 for binary models. the independence of irrelevant alternatives is not realistic. 2) and with eight other much smaller brands. suppose that consumers can choose between ten brands of a certain product. Diagnostic tests One can apply similar diagnostic checks on multinomial and conditional models. In the multinomial and conditional logit model. One can further evaluate the success of classiﬁcation — for instance. As an example. taken as one category). where for each individual the alternative j . That is. that the ith individual chooses the alternative h for which p These predicted choices can be compared with the actual observed choices yi in an m Â m classiﬁcation table. For instance. Let njj be the number of individuals P for yi ¼ j is predicted correctly.2 Multinomial data 469 sufﬁciently different from each other. the success rate of the model predictions. Clearly. In such situations the ‘independence of irrelevant alternatives’ is a reasonable assumption. it should make little difference whether this is modelled as a choice between ten alternative brands or as a choice between three alternatives (the two leading brands and the rest. The odds ratio between two alternatives then does not change when other alternatives are added to or deleted from the model. by predicting ^ih is maximal. In this case it is better to use multinomial or conditional probit models to incorporate the dependencies between the error terms eij for the different alternatives j. especially if some of the alternatives are very similar.1. This property of the multinomial and conditional logit model is called the ‘independence of irrelevant alternatives’. in log (P[yi ¼ 1]=P[yi ¼ 2]). PMNL [yi ¼ h] PCL [yi ¼ j] log ¼ (xij À xih )0 b: PCL [yi ¼ h] So the relative odds to choose between the alternatives j and h is not affected by the other alternatives. with two strong leading brands (j ¼ 1. so that the discussed logit models are not appropriate.6.

The multinomial model logit model (6. For an individual with characteristics xi . The jobs in the bank are divided into three categories. (ii) the estimation results. 3) as nominal variable and we estimate a multinomial logit model to explain the attained job category in terms of observed characteristics of the employees. One category (which is given the label ‘1’) consists of administrative jobs. the probabilities for the three job categories are then given by pi1 ¼ 1 1 þ exi b2 þ exi b3 0 0 . the observed fractions in the sample. is predicted with probability p P ^2 ^¼ m The expected hit rate of these random predictions is q j¼1 pj .4: Bank Wages (continued) We return to data of employees of a bank considered in earlier chapters. we restrict the attention to the 258 male employees of the bank (a model for all 474 employees of the bank is left as an exercise (see Exercise 6. a second category (with label ‘2’) of custodial jobs. We take the ﬁrst job category (administration) as reference category. (v) the predictive performance of the model.470 6 Qualitative and Limited Dependent Variables ^j ¼ nj =n. Á Á Á . 0 0 1 þ exi b2 þ exi b3 0 exi b3 pi3 ¼ : 0 0 1 þ exi b2 þ exi b3 0 . (iv) the average marginal effects of education. This allows for the possi. E XM604BWA Example 6. (i) The data and the model The dependent variable is the attained job category (1.17) by Ã Â 2utility Â Ã 2 ¼ s ¼ 1 is ﬁxed. The model provides better-than-random predictions if ^ ^ hÀq nh À nq z ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ^(1 À q ^)=n ^(1 À q ^) q nq is large enough (larger than 1. 2. As there are no women with custodial jobs. (iii) an analysis of the marginal effects of education. j ¼ 2. As explanatory variables we use the education level (x2 . where E e E e2 ij j i1 bility that the utilities of some of the alternatives are better captured by the explanatory variables than other ones. 2. a 3 Â 1 vector b2 for job category 2 (custodial jobs) and a 3 Â 1 vector b3 for job category 3 (management). We will discuss (i) the data and the model. We consider the job category (1.645 at 5 per cent signiﬁcance level).16) or (6. by specifying a model for the error terms eij in the function (6. or 3) of the bank employee. and a third category (with label ‘3’) of management jobs.20) for the m ¼ 3 job categories has k ¼ 3 explanatory variables (the constant term and x2 and x3 ). m . The model contains in total six parameters. for instance. and (vi) the odds ratios.14)). exi b2 pi2 ¼ . An LR-test for heteroskedasticity may be performed. in years) and the variable ‘minority’ (x3 ¼ 1 for minorities and x3 ¼ 0 otherwise).

however. as long as the ^i2 À 1:63p is positive if and only if 0:55p probability of a custodial job for this individual is at least 1:63=0:55 % 3 . The education effect is signiﬁcant for both job categories. ¼p ! ^i2 ( À 0:55(1 À p ^i2 ) À 1:63p ^i3 ) < 0. that these coefﬁcients do not have the interpretation of marginal effects. The outcomes show that the minority effect is signiﬁcant for management jobs. The marginal effects are analysed below in part (iii).2 Multinomial data 471 (ii) Estimation results The results of the multinomial logit model are in Panel 1 of Exhibit 6. We will now analyse the marginal effect of education on the probabilities to attain a job in the three job categories. For an individual with characteristics xi ¼ (1.63 for management jobs and a negative coefﬁcient of À0. x3i )0 . 3 X @ PMNL [yi ¼ 1] ^ À ^ ^i1 b ^ih b p ¼p 12 h2 @ x2 i h¼2 3 X @ PMNL [yi ¼ 2] ^ À ^ ^ih b ^i2 b p ¼p 22 h2 @ x2 i h¼2 3 X @ PMNL [yi ¼ 3] ^ À ^ ^i3 b ^ih b ¼p p 32 h2 @ x2 i h¼2 ! ^i1 (0:55p ^i2 À 1:63p ^i3 ).55 for custodial jobs. The coefﬁcient of education (x2 ) is ^ ¼ 1:63 for management jobs.5 contains the results of the model without the variables education and minority. with the following results.5. with a positive coefﬁcient of 1. Panel 2 of Exhibit 6. The effect on the probability of attaining an administrative job ^i3 > 0 — that is. Panels 1 and 2). For ^ ¼ À0:55 for custodial jobs and b b 22 32 administrative jobs the coefﬁcient of education is by deﬁnition b12 ¼ 0. Note. So we Here we used the fact that the probabilities p conclude that additional education leads to a lower probability of getting a custodial job and a higher probability of getting a management job. see (6.6. but not for custodial jobs. This test corresponds to four restrictions. x2i . The corresponding LR-test on the joint signiﬁcance of education and minority has value LR ¼ 2( À 118:7 þ 231:3) ¼ 225:2 (see Exhibit 6.49. so that the two explanatory variables are clearly jointly signiﬁcant. not even their signs.23). (iii) Analysis of the marginal effects of education In multinomial logit models the sign of the marginal effect of an explanatory variable is not always the same as the sign of the corresponding coefﬁcient. ¼p ! ^i3 (0:55p ^i2 þ 1:63ð1 À p ^i3 )Þ > 0: ¼p ^ij satisfy 0 < p ^ij < 1. and the 5 per cent critical value of the corresponding w2 (4) distribution is 9. as this is the reference category. the estimated marginal effects of education are obtained from (6. as could be expected.5.23).

760409 Cat 3: C B3(1) À0. B2(2).208342 À8.760717 EDUC B2(2) À0.896684 Number of Coefs.488181 0. then additional education decreases the chance of an administrative job in favour of a management job. log likelihood À0.636723 À3. 0. (a ) Panel 1: MULTINOMIAL LOGIT Method: Maximum Likelihood (Marquardt) Sample: 1 474 IF (GENDER¼1) Included observations: 258 Convergence achieved after 33 iterations Cat Variable Beta Coefﬁcient Cat 2 : C B2(1) 4. Error z-Statistic 1. 0. log likelihood À0. category 2 (custodial jobs) has coefﬁcients B2(1).754465 0.836415 Panel 3: MARGINAL EFFECTS OF EDUCATION ON PROBABILITIES JOBCAT JOBCAT ¼ 1 JOBCAT ¼ 2 JOBCAT ¼ 3 NON . multinomial model without explanatory variables (except constant terms for each job category. On the other hand.01435 EDUC B3(2) 1. then additional education may lead more quickly to a job in administration.682362 0.062 0. 2 (c ) Std.7360 Avg.0000 1.MINORITIES À0.268015 3. The interpretation is as follows.845405 0. Error z-Statistic 0.553399 MINORITY B2(3) 0.5 Bank Wages (Example 6.0002 0.012 À0.633386 MINORITY B3(3) À2. and B2(3).808873 1.0000 0.3818 0.966946 1.030 0.334355 Akaike info criterion Schwarz criterion Prob.157 MINORITIES 0. .449604 0.168697 9.460217 Number of Coefs.874578 2. if someone already has some chances of a management job.573738 0. and B3(3)).109115 Log likelihood À118.4) Multinomial logit model for attained job category of male employees (Panel 1: category 1 (administration) is the reference category. and category 3 (management) has coefﬁcients B3(1).312454 Akaike info criterion Schwarz criterion Prob.0000 0.114211 À4. If someone is most suited to a custodial job.752181 Log likelihood À231.141007 À5.717261 À9.0000 0.472 6 Qualitative and Limited Dependent Variables times as large as the probability of a management job.0000 0. Panel 2).049573 (b) Panel 2: MULTINOMIAL LOGIT Method: Maximum Likelihood (Marquardt) Sample: 1 474 IF (GENDER¼1) Included observations: 258 Convergence achieved after 10 iterations Cat Variable Beta Coefﬁcient Cat 2: C B2(1) À1.3446 Avg.127 À0.0009 0. 6 Std. B3(2).049 Exhibit 6.426952 Cat 3 : C B3(1) À26. and the marginal effects of education on the probability of attaining the three job categories (Panel 3: the reported numbers are averages over the two subsamples of nonminority males and minority males).

(iv) Average marginal effects of education Panel 3 of Exhibit 6. The estimated marginal effects are averaged over the relevant subsamples of minority males and non-minority males. (v) Predictive performance Panel 4 of Exhibit 6.6.5 shows actual against predicted job categories. For management.ODDS 0 (3 against 1) LOG . With more education the chance of getting a management job increases and of getting a custodial job decreases.5 shows the average marginal effect of education on the probabilities of having a job in each of the three categories.) Prediction–realization table of the predicted and actual job categories for the multinomial model of Panel 1 (Panel 4). and relation between the logarithm of the odds ratio (on the vertical axis) against education (on the horizontal axis) for non-minority males (e) and for minority males (f ).2 Multinomial data 473 (d ) Panel 4: PREDICTION-REALIZATION TABLE Actual jobcat ¼ 1 jobcat ¼ 2 jobcat ¼ 1 138 14 predicted jobcat ¼ 2 10 13 jobcat ¼ 3 9 0 actual total 157 27 jobcat ¼ 3 7 0 67 74 predicted total 159 23 76 258 2 2 2 random hit rate (157=258) p þ (27=258) þ (74=258) ¼ 0:464 À Z-value ¼ (218À 119:6)= (258 Ã 0:464 Ã 0:536) ¼ 12:28. P ¼ 0:0000 (e) 20 (3 against 2) 10 (f) 20 10 (3 against 2) LOG .ODDS 0 (3 against 1) −10 non-minority males −20 5 10 15 EDUC 20 25 −10 minority males −20 5 10 15 EDUC 20 25 Exhibit 6. where an individual is predicted of having a job in the category with the highest . the effects are much larger for non-minority males (around 16 per cent more chance for one additional year of education) than for minority males (around 5 per cent more chance).5 (Contd.

the outcome yi is related to the index function . E Exercises: T: 6. In the ordered response model. but for fourteen individuals in job category 2 it is predicted to be more likely that they belong to job category 1. The hit rate is equal to (138 þ 13 þ 67)=258 ¼ 218=258 ¼ 0:845.14.15c. as could be expected. 6. We follow the convention of labelling the m ordered alternatives by integers ranging from 1 to m. whereas the expected hit rate of random predictions is equal to (157=258)2 þ (27=258)2 þ (74=258)2 ¼ 0:464. (vi) Odds ratios Exhibit 6. This corresponds to relatively large probabilities for a management job. P ¼ 0:0000: This shows that the model indeed provides signiﬁcantly better predictions than would be obtained by random predictions. Recall that in the logit model the log-odds is a linear function of the explanatory variables. then the predicted total number is equal to 27. d.2.5 gives the log-odds (as a function of education) of job category 3 against job categories 1 and 2.3 Ordered response Model formulation In some situations the alternatives can be ordered — for instance. Such a variable is called ordinal — that is. although their numerical values have no further meaning. and the odds ratios are larger with respect to category 2 than with respect to category 1. 6. 6. To test the classiﬁcation success of the model. E: 6. if the dependent variable measures opinions (degree of agreement or disagreement with a statement) or rankings (quality of products). as for around half the people with custodial jobs it is predicted that they will work in administration. The predictions are quite successful for jobs in administration and management but somewhat less so for custodial jobs.13e. The odds ratios become very large for high levels of education.3. The odds ratios are higher for non-minority males. the outcomes are ordered. If the estimated probabilities of having a custodial job are added over all n ¼ 258 individuals. for non-minority male employees (e) and for male employees belonging to minorities (f ). these hit rates can be compared by pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ z ¼ (0:845 À 0:464)= 0:464(1 À 0:464)=258 ¼ 12:28.474 6 Qualitative and Limited Dependent Variables estimated probability.

the ordered response model has k þ m À 2 parameters