Time series book

© All Rights Reserved

41 views

Time series book

© All Rights Reserved

- 1874-7905-1-PB
- CCP303
- Does Public Capital Crowd Out Pvt. Capital
- wr97
- Human Mobility
- Demand Forcasting.pptx
- Planning
- 10.1.1.105.2762
- Front Matter
- Macro
- Proceedings of the Institution of Mechanical Engineers, Part G_ Journal of Aerospace Engineering-2008-Gupta-307-18
- Econ 723 PS VII
- BBS en 2010 1 Piscopo
- SAS Date_Time Functions
- QTT201_Syllabus
- V01I031116
- MEI_iyengar_univariateTS.ppt
- Introduction(New)
- point processes brillinger
- Hotel Sustainability: Financial Analysis Shines a Cautious Green Light

You are on page 1of 165

Bo Sj

Linkping, Sweden

email:bo.sjo@liu.se

October 30, 2011

CONTENTS

1 Introduction

1.1

1.2

1.3

Why Econometrics? . . . . . . . . . . . . . . . . . . . . . . . . . .

Junk Science and Junk Econometrics . . . . . . . . . . . . . . . . .

8

8

9

2.1 Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

12

2.2

13

2.3

15

Basic Statistics

19

21

3.1

3.2

3.3

3.4

Statistical Models . . . . . . . . . . . .

Random Variables . . . . . . . . . . .

Moments of random variables . . . . .

Popular Distributions in Econometrics

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

22

23

24

26

3.5

3.6

3.7

3.8

Multidimensional Random Variables . . . . . . . . . . .

Marginal and Conditional Densities . . . . . . . . . . . .

The Linear Regression Model A General Description

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

27

29

30

30

4.1 MLE for a Univariate Process . . . . . . . . . . . . . . . . . . . . .

4.2 MLE for a Linear Combination of Variables . . . . . . . . . . . . .

35

35

38

41

II

43

6.1 Dierent types processes . . . . . . . . . . . .

6.2 White Noise . . . . . . . . . . . . . . . . . . .

6.3 The Log Normal Distribution . . . . . . . . .

6.4 The ARIMA Model . . . . . . . . . . . . . . .

6.5 The Random Walk Model . . . . . . . . . . .

6.6 Martingale Processes . . . . . . . . . . . . . .

6.7 Markov Processes . . . . . . . . . . . . . . . .

6.8 Brownian Motions . . . . . . . . . . . . . . .

6.9 Brownian motions and the sum of white noise

6.9.1

6.9.2

CONTENTS

.

.

.

.

.

.

.

.

.

45

45

46

47

47

48

50

52

54

55

A more formal denition . . . . . . . . . . . . . . . . . . . .

56

57

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

59

7.1 Descriptive Tools for Time Series . . . . . . . . . . . . . . . . . . . 62

7.1.1 Weak and Strong Stationarity . . . . . . . . . . . . . . . . . 64

7.1.2 Weak Stationarity, Covariance Stationary and Ergodic Processes 64

7.1.3 Strong Stationarity . . . . . . . . . . . . . . . . . . . . . . . 65

7.1.4 Finding the Optimal Lag Length and Information Criteria . 66

7.2

7.3

7.4

7.5

7.1.6 Generating Functions . . . . . . . . . . . . . . . . . . . .

7.1.7 The Dierence Operator . . . . . . . . . . . . . . . . . . .

7.1.8 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.1.9 Dynamics and Stability . . . . . . . . . . . . . . . . . . .

7.1.10 Fractional Integration . . . . . . . . . . . . . . . . . . . .

7.1.11 Building an ARIMA Model. The Box-Jenkins Approach

7.1.12 Is the ARMA model identied? . . . . . . . . . . . . . . .

Theoretical Properties of Time Series Models . . . . . . . . . . .

7.2.1 The Principle of Duality . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

67

68

69

70

70

71

71

71

72

72

Additional Topics . . . . . . . . . . . . . . . .

7.3.1 Seasonality . . . . . . . . . . . . . . .

7.3.2 Non-stationarity . . . . . . . . . . . .

Aggregation . . . . . . . . . . . . . . . . . . .

Overview of Single Equation Dynamic Models

.

.

.

.

.

.

73

75

75

76

76

78

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

83

9.0.1 How estimate a VAR? . . . . . . . . . . . . . . . . . . . . .

9.0.2 Impulse responses in a VAR with non-stationary variables

and cointegration. . . . . . . . . . . . . . . . . . . . . . . .

9.1 BVAR, TVAR etc. . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

90

III

93

10.1 Exogeneity . . . . . . . . . . . . . . . . . . . .

10.1.1 Weak Exogeneity . . . . . . . . . . . . .

10.1.2 Strong Exogeneity . . . . . . . . . . . .

10.1.3 Super Exogeneity . . . . . . . . . . . . .

10.2 Multicollinearity and understanding of multiple

97

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

regression. .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11.0.1 The DF-test: . . . . . . . . . . .

11.0.2 The ADF-test . . . . . . . . . . .

11.0.3 The Phillips-Perron test . . . . .

11.0.4 The LMSP-test . . . . . . . . . .

11.0.5 The KPSS-test . . . . . . . . . .

11.0.6 The G(p; q) test. . . . . . . . . .

11.1 The Alternative Hypothesis in I(1) Tests

11.2 Fractional Integration . . . . . . . . . .

4

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

90

91

97

97

98

99

99

101

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

101

102

103

104

104

104

105

106

CONTENTS

109

12.0.1 The Spurious Regression Problem . . . . . . . . . . . . . . 110

12.0.2 Integrated Variables and Co-integration . . . . . . . . . . . 111

12.0.3 Approaches to Testing for Co-integration . . . . . . . . . . 112

13 Integrated Variables and Common Trends

117

121

15 The

15.1

15.2

15.3

15.4

15.5

15.6

15.7

Deterministic Explanatory Variables . . . . . . . . . . . .

The Deterministic Trend Model . . . . . . . . . . . . . . .

Stochastic Explanatory Variables . . . . . . . . . . . . . .

Lagged Dependent Variables . . . . . . . . . . . . . . . . .

Lagged Dependent Variables and Autocorrelation . . . . .

The Problems of Dependence and the Initial Observation

Estimation with Integrated Variables . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

16 Encompassing

125

125

127

127

129

130

131

133

137

17 ARCH Models

139

17.0.1 Practical Modelling Tips . . . . . . . . . . . . . . . . . . . . 141

17.1 Some ARCH Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 141

17.2 Some Dierent Types of ARCH and GARCH Models . . . . . . . . 143

17.3 The Estimation of ARCH models . . . . . . . . . . . . . . . . . . . 146

18 Econometrics and Rational Expectations

18.0.1 Rational v.s. other Types of Expectations . . .

18.0.2 Typical Errors in the Modeling of Expectations

18.0.3 Modeling Rational Expectations . . . . . . . .

18.0.4 Testing Rational Expectations . . . . . . . . .

19 A Research Strategy

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

147

147

148

150

150

153

20 References

157

20.1 APPENDIX 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

20.2 Appendix III Operators . . . . . . . . . . . . . . . . . . . . . . . . 160

20.2.1 The Expectations Operator . . . . . . . . . . . . . . . . . . 161

20.2.2 The Variance Operator . . . . . . . . . . . . . . . . . . . . 162

20.2.3 The Covariance Operator . . . . . . . . . . . . . . . . . . . 162

20.2.4 The Sum Operator . . . . . . . . . . . . . . . . . . . . . . . 162

20.2.5 The Plim Operator . . . . . . . . . . . . . . . . . . . . . . . 163

20.2.6 The Lag and the Dierence Operators . . . . . . . . . . . . 164

Abstract

CONTENTS

CONTENTS

1. INTRODUCTION

"1984".

Please respect that this is work in progress. It has never been my intention to

write a commercial book, or a perfect textbook in time series econometrics. It is

simply a collection of lectures in a popular form that can serve as a complement

to ordinary textbooks and articles used in education. The parts dealing with

tests for unit roots (order of integration) and cointegration are not well developed.

These topics have a memo of their own "A Guide to testing for unit roots and

cointegration".

When I started to put these lecture notes together some years ago I decided

on title "Lectures in Modern Time Series Econometrics" because I thought that

the contents where a bit "modern" compared to standard econometric textbook.

During the fall of 2010 as I started to update the notes I thought that it was

time to remove the word "modern" from the title. A quick look in Damodar

Gujaratis textbook "Basic Econometrics" from 2009 convinced my to keep the

word "modern" in te title. Gujaratis text on time series hasnt changed since the

1970s even though time series econometrics has changed completely since the 70s.

Thus, under these circumstances I see no reason to change the title, at least not

yet.

There are four ways in which one do time series econometrics. The rst is to use

the approach of the 1970s, view your time series model just like any linear regression, and impose a number of ad hoc restrictions that will hide all problems you

nd. This is not a good approach. This approach is only found in old textbooks

and never in todays research. You might only see it used in very low scientic

journals. Second, you can use theory to derive a time series model, and interesting parameters, that you then estimate with appropriate estimators. Examples

of this ti derive utility functions, assume that agents have rational expectations

etc. This is a proper research strategy. However, it typically takes good data,

and you need to be original in your approach, but you can get published in good

journals. The third, approach is simply to do statistical description of the data

series, in the form of a vector autoregressive system, or reduced form of the vector

error correction model. This system can used for forecasting, analysing relationships among data series and investigated with respect to unforeseen shocks such

as drastic changes in energy prices, money supply etc. The fourth way is to go

beyond the vector autoregressive system and try to estimate structural parameters

in the form of elasticities and policy intervention parameters. If you forget about

the rst method, the choice depends on the problem at hand and you chose to

formulate it. This book aims at telling you how to use methods three and four.

The basic thinking is that your data is the real world, theories are abstractions

that we use to understand the real world. In applied econometric time series you

should always strive to build well-dened statistical models, that is models that

are consistent with the data chosen. There is a complex statistical theory behind

all this, that I will try to popularize in this book. I do not see this book as a

substitute for an ordinary textbook. It is simply a complement.

INTRODUCTION

This book is intended for people who has done a basic course in statistics and

econometrics, either at the undergraduate or at the graduate level. If you did an

undergraduate course I assume that you did it well. Econometrics is a type of

course were every lecture, and every textbook chapter leads to the next level. The

best way to learn econometrics is to be active, read several books, work on your

own with econometric software. No teacher can learn you how to run a software.

That is something you have to learn on your own by practicing how to use the

software. There are some very good software out there, and some The outline

dierences between graduate and Ph.D. level mainly in the theoretical parts. At

the Ph.D. level, there is more stress on theoretical backgrounds.

1) I will begin by talking about why econometrics is dierent from statistics,

and why econometric time series is dierent from the econometrics your meet in

many basic textbooks.

2) I will repeat very briey basic statistics, and linear regression and stress

what you should know in terms of testing and modeling dynamic models. For

most students that will imply going back and do some quick repetition.

3) Introduction into statistical theory including maximum likelihood, random

variables, density functions and stochastic processes.

4) Fourth, basic time series properties and processes.

5) Using and understanding ARFIMA and VAR modelling techniques.

6) Testing for non-stationary in the form of stochastic trends, i.e. test for unit

roots.

7) The spurious regression problem

8) Testing and understanding cointegration.

9) Testing for Granger non-causality

10) The theory of reduction, exogeneity and building dynamic models and

systems

11) Modelling time varying variances, ARCH and GARCH models

12) The implications and consequences of rational expectations on econometric

modelling

13) Non-linearities

14) Additional topics

For most of these topics I have developed more or less self-instructing exercises.

Why is there a subject called econometrics? Why study econometrics, instead

of statistics? Why not let the statisticians teach statistics, and in particular time

series techniques? These are common questions, raised during seminars and in private, by students, statisticians and economists. The answer is that each scientic

area tends to create its own special methodological problems often heavily interrelated with theoretical issues. These problems, and the ways of solving them, are

important in a particular area of science but not necessarily in others. Economics

is a typical example, were the formulation of the economic and the statistical

problem is deeply interrelated from the beginning.

In everyday life we are forced to make decisions based on limited information.

Most of our decisions deal with the an uncertain stochastic future. We all base our

8

INTRODUCTION

decisions on some view of the economy where we assume that certain events are

linked to each other in more or less complex ways. Economists call this a model

of the economy. We can describe the economy and the behavior of the individuals in terms of multivariate stochastic processes. Decisions based on stochastic

sequences play a central role economics and in nance. Stochastic processes are

the basis for our understanding about the behavior of economic agents and of how

their behavior determine the future path of the economy. Most econometric text

books deal with stochastic time series as a special application of the linear regression technique. Though this approach is acceptable for an introductory course in

econometrics, it is unsatisfactory for students with a deeper interest in economics

and nance. To understand the empirical and theoretical work in these areas, it

is necessary to understand some of the basic philosophy behind stochastic time

series.

This work is a work in progress. It is based on my lectures on Modern Economic Time Series Analysis at the Department of Economics rst at University

of Gothenburg and later at University of Skovde and Linkping University in

Sweden. The material is not ready for a widespread distribution. This work, most

likely, contains lots of errors, some are known by the author, and some are not

yet detected. The dierent sections do not necessarily follow in a logical order.

Therefore, I invite anyone who has opinions about this work to share them me.

The rst part of this work provides a repetition of some basic statistical concepts, which are necessary understanding modern economic time series analysis.

The motive for repeating these concepts is that they play a larger role in econometrics than many contemporary textbooks in econometrics indicate. Econometrics did not change much from the rst edition of Johnston in the 60s until the

revised version of Kmenta in the mid 80s. However, as a consequence of the critique against the use of econometrics delivered by Sims, Lucas, Leamer, Hendry

and others, in combination with new insights into the behavior of non-stationary

time series and the rapid development of computer technology, have revolutionized

econometric modeling, and resulted in an explosion of knowledge. The demand for

writing a decent thesis, or a scientic paper, based on econometric methods has

risen far beyond what one can learn in an introductory course in econometrics.

In media you often hear about this and that being proved by scientic research.

In the late 1990s newspapers told that someone had proved that genetic modied

(GM) food could be dangerous. The news were spread quickly, and according to

the story the original article had been stooped from being published by scientists

with suspicious motives. Various lobby groups immediately jumped up. GM food

were dangerous, should be banned and more money should go into this line of

research. What had happened was the following. A researcher claimed to have

shown that GM food were bad for health. He claimed this results for a number

of media people, who distributed the results. (Remember the fuss about cold

fusion). The result were presented in a paper sent to a scientic journal for

publication. The journal however, did not publish the article. It was dismissed

because the results were not based on a sound scientic method. The researcher

had feed rats with potatoes. One group of rats got GM potatoes, the other group

of rats got normal non-GM potatoes. The rats that got GM potatoes seemed

to develop cancer more often than the control group. The statistical dierence

JUNK SCIENCE AND JUNK ECONOMETRICS

between the groups were not big, but su ciently big for those wanting to conrm

their a priori beliefs that GM food is bad. A somewhat embarrassing detail, never

reported in the media, is that rats in general do not like potatoes. As a consequence

both groups of rats in this study were suering from starvation, which severely

aected the test. It was not possible to determine if the dierence between the two

groups were caused by starvation, or by GM food. Once the researcher conditioned

on the eects of starvation, the dierence became insignicant. This is an example

of Junk science, bad science getting a lot of media exposure because the results

ts the interests of lobby groups, and can be used to scare people.

The lesson for econometricians is obvious, if you come up with good results

you get rewarded, bad results on the other hand can quickly be forgotten. The

GM food example is extreme econometric work. Econometric research seldom get

such media coverage, though there are examples such as Swedens economic growth

is less than other similar countries, the assumed dynamic eects of a reduction

of marginal taxes. There are signicant results that depend on one single outlier.

Once the outlier is removed, the signicance is gone, and the whole story behind

this particular book is also gone.

In these lectures we will argue that the only way to avoid junk econometrics

is careful and systematic construction and testing of models. Basically, this is the

modern econometric time series approach. Why is this modern, and why stress the

idea of testing? The answers are simply that careers have been build on running

junk econometric equations, most people are unfamiliar with scientic methods in

general and the consequences of living in a world surrounded by random variables

in particular.

10

INTRODUCTION

"Time is a great teacher, but unfortunately it kills all its pupils" Louis Hector

Berlioz

A time series is simply data ordered by time. For an econometrician time series

is usually data that is also generated over time in such a way that time can be

seen as a driving factor behind the data. Time series analysis is simply approaches

that look for regularities in these data ordered by time.

In comparison with other academic elds, the modeling of economic time series

is characterized by the following problems, which partly motivates why econometrics is a subject of its own:

The empirical sample sizes in economics are generally small, especially compared with many applications in physics or biology. Typical sample sizes

ranges between 25 - 100 observations. In many areas anything below 500

observations is considered a small sample.

Economic time series are dependent in the sense that they are correlated with

other economic time series. In the economic science, problems are almost

never concerned with univariate series. Consumption, as an example, is a

function of income, and at the same time, consumption also aects income

directly and through various other variables.

Economic time series are often dependent over time. Many series display

high autocorrelation, as well as cross autocorrelation with other variables

over time.

Economic time series are generally non-stationary. Their means and variances change over time, implying that estimated parameters might follow unknown distributions instead of standard tabulated distributions like the normal distribution. Non-stationarity arises from productivity growth and price

ination. Non-stationary economic series appear to be integrated, driven by

stochastic trends, perhaps as a result of stochastic changes in the total factor productivity. Integrated variables, and in particular the need to model

them, are not that common outside economics. In some situations, therefore,

inference in econometrics become quite complicated, and requires the development of new statistical techniques for handling stochastic trends. The

concepts of cointegration and common trends, and the recently developed

asymptotic theory for integrated variables are examples of this.

Economic time series cannot be assumed to be drawn from samples in the

way assumed in classical statistics. The classical approach is to start from

a population from which a sample is drawn. Since the sampling process can

be controlled the variables which make up the sample can be seen as random variables. Hypothesis are then formulated and tested conditionally on

the assumption that the random variables have a specic distribution. Economic time series are seldom random variables drawn from some underlying

population in the classical statistical sense. Observations do not represent

INTRODUCTION TO ECONOMETRIC TIME SERIES

11

a random sample in the classical statistical sense, because the econometrician cannot control the sampling process of variables. Variables like, GDP,

money, prices and dividends are given from history. To get a dierent sample we would have to re-run history, which of course is impossible. The way

statistic theory deals with this situation is to reverse the approach taken in

classical statistic analysis, and build a model that describes the behavior of

the observed data. A model which achieves this is called a well dened statistical model, it can be understood as a parsimonious time invariant model

with white noise residuals, that makes sense from economic theory.

Finally, from the view of economics, the subject of statistics deals mainly

with the estimation and inference of covariances only. The econometrician,

however, must also give estimated parameters an economic interpretation.

This problem cannot always be solved ex post, after the a model has been estimated. When it comes to time series, economic theory is an integrated part

of the modeling process. Given a well dened statistical model, estimated

parameters should represent behavior of economic agents. Many econometric

studies fail because researchers assume that their estimates can be given an

economic interpretation without considering the statistical properties of the

model, or the simple fact there is in general not a one to one correspondence

with observed variables and the concepts dened in economic theory.1

2.1 Programs

Here is a list of statistical software that you should be familiar with, please goggle,

(those recommended for time series are marked with *):

*RATS and CATS in RATS, Regression Analysis of Time Series and Cointegrating Analysis of Time Series (www.estima.com)

- *PcGive - Comes highly recommended. Included in Oxmetrics modules, see

also Timberlake consultants for more programs.

- *Gretl (Free GNU license, very good for students in econometrics)

- *JMulti (Free for multivariate time series analysis, updated? The discussion

forum is quite dead, www.jmulti.com)

- *EViews

- Gauss (good for simulation)

- STATA (used by the World Bank, good for microeconometrics, panel data,

OK on time series)

- LIMDEP (Mostly free with some editions of Greens Econometric text

book?, you need to pay for duration models?)

- SAS - Statistical Analysis System (good for big data sets, but not time series,

mainly medicine, "the calculus program for decision makers")

- Shazam

And more, some are very special programs for this and that, ... but I dont

nd them worth mentioning in this context.

1 For a recent discussion about the controversies in econometrics see The Economic Journal

1996.

12

There is a bunch of software that allows you to program your own models or

use other peoples modules:

- Matlab

- R (Free, GNU license, connects with Gretl)

- Ox

You should also know about C, C++, and LaTeX to be a good econometrician.

Please google.

For Data Envelopment Analysis (DEA) I recommend Tom Coellis DEAP 2.1

or Paul W. Wilsons FEAR.

Given the general denition of time series above, there many types of time series.

The focus in econometrics, macroeconomics and nance is in stochastic time series

typically in the time domain, which are non-stationarity in levels but becomes what

is called covariance stationary after dierencing.

In a broad perspective, time series analysis typically aims at making time series

more understandable by decomposing them into dierent parts. The aim of this

introduction is to give a general overview of the subject. A time series is any

sequence ordered by time. The sequence can be either deterministic or stochastic.

The primary interest in economics is in stochastic time series, where the sequence

of observations is made up by the outcome of random variables. A sequence of

stochastic variables ordered by time is called a stochastic time series process.

The random variables that make up the process can either be discrete random variables, taking on a given set of integer numbers, or be continuous

random variables taking on any real number between 1: While discrete random variables are possible they are not that common in economic time series

research.

Another dimension in modeling time series is to consider processes in discrete

time or in continuous time. The principal dierence is that stochastic variables

in continuous time can take dierent values at any time. In a discrete time process,

the variables are observed at xed intervals of time (t), and they do not change

between these observation points. Discrete time variables are not common in

nance and economics. There are few, if any variables that remain xed between

their points of observations. The distinction between continuous time and discrete

time is not matter of measurability alone. A common mistake is to be confused the

fact that economic variables are measured at discrete time intervals. The money

stock is generally measured and recorded as an end-of-month value. The way of

measuring the stock of money does not imply that it remains unchanged between

the observation interval, instead it changes whenever the money market is open.

The same holds for variables like production and consumption. These activities

take place 24 hours a day, during the whole year. The are measured as the ow

of income and consumption over a period, typically a quarter, representing the

integral sum of these activities.

Usually, a discrete time variable is written with a time subscript (xt ) while

continuous time variables written as x(t). The continuous time approach has

a number of benets, but the cost and quality of the empirical results seldom

motivate the continuous time approach. It is better to use discrete time approaches

DIFFERENT TYPES OF TIME SERIES

13

doing this simplication is small compared with the complexity of continuous

time analysis. This should not be understood as a rejection of all continuous

time approaches. Continuous time is good for analyzing a number of well dened

problems like aggregation over time and individuals. In the end it should lead to

a better understanding of adjustment speeds, stability conditions and interactions

among economic time series, see Sj (1990, 1995).2

In addition, stochastic time series can be analysed in the time domain or

in the frequency domain. In the time domain the data is analysed ordered in

given time periods such as days, weeks, years etc. The frequency approach decomposes time series into frequencies by using trigonometric functions like sinuses,

etc. Spectral analysis is an example of analysis that uses the frequency domain, to

identify regularities such as seasonal factors, trends, and systematic lags in adjustment etc. The main advantage with analysing time series in the frequency domain

is that it is relatively easy to handle continuous time processes and observations

observed as aggregations over time such as consumption.

However, in economics and nance, where we are typically faced with given

observations at given frequencies and we seek to study the behavior of agents

operating in real time. Under these circumstances, the time domain is the most

interesting road ahead because it has a direct intuitive appeal to both economists

and policy makers.

A dimension in modeling time series is to consider processes in discrete time

or in continuous time. The principal dierence here is that the stochastic variables in a continuous time process can take on dierent values at any time. In

a discrete time process, the variables are observed at xed intervals of time (t),

and they are assumed not to change during the frequency interval. Discrete time

variables are not common in nance and economics. There are few, if any variables

that remain xed between their points of observations. The distinction between

continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic variables are measured at

discrete time intervals. The money stock is generally measured and recorded as

an end-of-month value. The way of measuring the stock of money does not imply

that it remains unchanged between the observation interval, instead it changes

whenever the money market is open. The same holds for variables like production

and consumption. These activities take place 24 hours a day, during the whole

year. The are measured as the ow of income and consumption over a period,

typically a quarter, representing the integral sum of these activities.

Our interest is usually in analysing discrete time stochastic processes in the

time domain.

A time series process is generally indicated with brackets, like fyt g: In some

situations it is necessary to be more precise about the length of the process. Writing fyg1

1 indicates that he process start at period one and continues innitely.

The process consists of random variables because we can view each element in

fyt g as a random variable. Let the process go from the integer values 1 up to T:

If necessary, to be exact, the rst variable in the process can be written as yt1 the

second variable yt2 etc. up until ytT : The distribution function of the process can

then be written as F (yt1 ; yt2 ; :::; ytT ):

2 We can also mention the dierent types of series that are used; stocks, ows and price

variables. Stocks are variables that can be observed at a point in time like, the money stock,

inventories. Flows are variables that can only be observed over some period, like consumption or

GDP. In this context price variables include prices, interest rates and similar variables which can

be observed at a market at a given point in time. Combining these variables into multivariate

process and constructing econometric models from observed variables in discrete time produces

further problems, and in general they are quite di cult to solve without using continuous time

methods. Usually, careful discrete time models will reduce the problems to a large extent.

14

In some situation it is necessary to start from the very beginning. A time series

is data ordered by time. A stochastic time series is a set of random variables

ordered by time. Let Y~it represent the stochastic variable Y~i given at time t.

Observations on this random variable is often indicated as yit . In general terms

a stochastic time series is a series of random variables ordered by time. A series

starting at time t = 1 and

n ending at timeo t = T , consisting of T dierent random

variables is written as Y~1;1 ; Y~2;2 ; :::Y~T;T . Of course, assuming that the series is

built up by individual random variables, with their own independent probability

distributions is a complex thought. But, nothing in our denition of stochastic

time series rules out that the data is made up by completely dierent random

variables. Sometimes, to understand and nd solutions to practical problems, it

will be necessary to go all the way back to the most basic assumptions.

Suppose we are given a time series consisting of yearly observations of interest

rates, f6:6; 7:5; 5:9; 5:4; 5:5; 4:5; 4:3; 4:8g, the rst question to ask is this a stochastic

series in the sense that these number were generated by one stochastic process or

perhaps several dierent stochastic processes? Further questions would be to ask

if the process or processes are best represented as continuous or discrete, are the

observations independent or dependent? Quite often we will assume that the series

are generated by the same identical stochastic process in discrete time. Based on

these assumptions the modelling process tries to nd systematic historical patters

and cross-correlations with other variables in the data.

All time series methods aim at decomposing the series into separate parts in

some way. The standard approach in time series analysis is to decompose as

yt = Tt;d + St;d + Ct;d + It ;

where Td and Sd represents (deterministic) trend and seasonal components, Ct;d is

deterministic cyclical components and I is process representing irregular factors3 .

For time series econometrics this denition is limited, since the econometrician

is highly interested in the irregular component. As an alternative, let fyt g be a

stochastic time series process, which is composed as,

yt

Td + Ts + Sd + Ss + fyt g + et ,

(2.1)

Ts ; deterministic seasonals Sd stochastic seasonals Ss , a stationary process (or the

short-run dynamics) yt , and nally a white noise innovation term et : The modeling

problem can be described as the problem of identifying the systematic components

such that the residual becomes a white noise process. For all series,remember

that any inference is potentially wrong, if not all components have been modeled

correctly. This is so, regardless of whether we model a simple univariate series

with time series techniques, a reduced system, a or a structural model. Inference

is only valid for a correctly specied model.

Econometrics

1. To be completed...

3 For

are multiplicative, xt = Tt;d St;d Ct;d It :

15

In you rst course in statistics you learned how to use descriptive statistics;

the mean and the variance. Next you learned to calculate the mean and variances

from a sample that represents the whole underlying population. For the mean and

the variance to work as a description of the underlying population it is necessary

to construct the sample in such a way that the dierence between the sample

mean and the true population mean is non-systematic meaning that the dierence

between the sample mean and the population is unpredictable. This man that

your estimated sample mean is random variable with known characteristics.

The most important thing is to construct a sampling mechanism so that the

mean calculated from the sample has the characteristics you want to have. That

is the estimated mean should be unbiased, e cient and consistent. You learn

about random variables, probabilities, distributions functions and frequency distributions.

Your rst course in econometrics

"A theory should be as simple as possible, but not simpler" Albert Einstein

To be completed...

Random variables, OLS, minimize the sum of squares, assumptions 1 - 5(6),

understanding, multiple regression, multicollinearity, properties of OLS estimator

Matrix algebra

Tests and solutionsfor heteroscedasticity (cross-section), and autocorrelation

(time series).

If you read a good course you should have learned the three golden rules: test

test test, and learned about the probabilities of the OLS estimator.

Generalized least squares GLS

System estimation: demand and supply models.

Further extensions:

Panel data, Tobit, Heckit, discrete choice, probit/logit, duration

Time series: distributed lag models, partial adjustment models, error correction models, lag structure, stationarity vs. non-stationarity, co-integration

What need to know ...

What you probably do not know but should know.

OLS

Ordinary least squares is a common estimation method. Suppose there are two

series fyt ; xt g

yt = + xt + "t

Minimize

sample t = 1; :2:::T ,

PT the sum

PTof Squares over the

2

S = t=1 "2t = t=1 (yt

xt )

Take the derivative of S with respect to and , set the expressions to zero,

and solve for and :

S

=

S

=

^ =s

T SS = ESS + RSS

RSS

1 = ESS

T SS + T SS

RSS

2

R = 1 T SS = ESS

T SS

Basic assumptions

1) E("t ) = 0 for all t

16

2)

3)

4)

5)

E("t "t k ) = 0 for all k 6= t

E(Xt "t ) = 0

E(X 0 X) 6= 0

6) "t s N ID(0; 2 )

Discuss these properties

Properties

Gauss-Markow BLUE

Deviations

Misspecication, add extra variable, forget relevant variable

Multicollinearity

Error in variables problem

Homoscedasticity Heteroscedasticity

Autocorrelation

17

18

Part I

Basic Statistics

19

OVERVIEW

Economists are generally interested in a small part of what is normally included in

the subject Time Series Analysis. Various techniques such as ltering, smoothing and interpolation developed for deterministic time series are of relative minor

interest for economists. Time series econometrics is more focused on the stochastic

part of time series. The following is an brief overview of time series modeling, from

an econometric perspective. It is not text book in mathematical statistics, nor is

the ambition to be extremely rigorous in the presentation of statistical concepts.

The aim more to be a guide for the yet not so informed economist who wants to

know more about the statistical concepts behind time series econometrics.

When approaching time series econometrics the statistical vocabulary quickly

increases and can become overwhelming. These rst two chapters seek to make it

possible for people without deeper knowledge in mathematical statistics to read

and follow the econometric and nancial time series literature.

A time series is simply a set of observations ordered by time. Time series

techniques seeks to decompose this ordered series into dierent components, which

in turn can be used to generate forecasts, learn about the dynamics of the series,

and how it relates to other series. There is a number of dimensions and decision

to keep account of when approaching this subject.

First, the series, or the process, can be univariate or multivariate, depending on

the problem at hand. Second, the series can be stochastic or purely deterministic.

In the former case a stochastic random process is generating the observations.

Third, given that the series is stochastic, with perhaps deterministic components,

it can be modeled in the time domain or in the frequency domain. Modeling in

the frequency domain implies describing the series in terms cosines functions of

dierent wave lengths. This is a useful approach for solving some problems, but not

a general approach for economic time series modeling. Fourth, the data generating

process and the statistical model can constructed in continuous or discrete time.

Continuous time econometrics is good for some problems but not all. In general it

leads to more complex models. A discrete time approach builds on the assumption

that the observed data is unchanged between the intervals of observation. This is

a convenient approximation, that makes modeling easier, but comes at a cost in

the form of aggregation biases. However, in the general case, this is a low cost,

compared with the costs of general misspecication. A special chapter deals with

the discussion of discrete versus continuous time modeling.

The typical economic time series is a discrete stochastic process modeled in

the time domain. Time series can be modelled by smoothing and lter techniques.

For economists these techniques are generally uninteresting, though we will briey

come back to the concept of lters.

The simplest way to model an economic time series is to use autoregressive

techniques, or ARIMA techniques in the general case. Most economic time series,

however, are better modeled as a part of a multivariate stochastic process. Economic theory systems of economic variables, leading to single equation transfer

functions and systems of equations in a VAR model.

These techniques are descriptive, they do not identify structural, or deep parameterslike elasticities, marginal propensities to consume etc. The estimate more

TIME SERIES MODELING - AN OVERVIEW

21

VECM.

What is outlined above is quite dierent from the typical basic econometric

textbook approach, which starts with OLS and ends in practice with GLS as the

solution to all problems. Here we will develop methods, which rst describes the

statistical properties of the (joint) series at hand, and then allows the researcher

to answer economic questions in such a way that the conclusions are statistically

and economically valid. To get there we have to start with some basic statistics.

A general denition of statistical time series analysis is that it nds a mathematical

model that links observed variables with the stochastic mechanism that generated

the data. This sounds abstract, but the purpose of this abstraction is understand

the analytical tools of time series statistics. The practical problem is the following;

we have some stochastic observations over time. We know that these observations

have been generated by a process, but we do not know what this process looks

like. Statistical time series analysis is about developing the tools needed to mimic

the unknown data generating function (DGP).

We can formulate some general features of the model. First, it should be a

well-dened statistical modelin the sense that the assumptions behind the model

should be valid for the data chosen. Later we will dene more exactly what this

implies for an econometric model. For the time being, we can say that single

most important criteria of models is that the residuals should be a white noise

process. Second, the parameters of the model should be stable over time. Third,

the model should be simple, or parsimonious, meaning that its functional form

should be simple. Fourth, the model should be parameterized in such a way that

it is possible to give the parameters a clear interpretation and identify them with

events in the real world. Finally, the model should be able to explain other rival

models describing the dependent variable(s).

The way to build a well-dened-statistical-modelis to investigate the underlying assumptions of the model in a systematic way. It can easily be shown that

t-values, R2 , and Durbin-Watson values are not su cient for determining the t

of a model. In later chapters we will introduce a systematic test procedure.

The nal aim of econometric modelling is to learn about economic behavior. To

some extent this always implies using some a priori knowledge about in the form

of theoretical relationships. Economists, in general, have extremely strong a priori

belief about the size and sign of certain parameters. This way of thinking has lead

to much confusion, because a priori believes can be driven too far. Econometrics is

basically about measuring correlations. It is a common misunderstanding among

non-econometricians that correlations can be too high or too low, or be deemed

right or wrong. Measured correlations are the outcome of the data used, only.

Anyone who thinks of an estimated correlation as wrong, must also explain what

went wrong in the estimation process, which requires knowledge of econometrics

and the real world.

22

The basic reason for dealing with stochastic models rather than deterministic

models is that we are faced with random variables. A popular denition of

random variables goes like this: a random variable is a variable that can take on

more than one value. 1 For every possible value that a random variable can take

on there is a number between zero and one that describes the probability that

the random variable will take on this value. In the following a random variable is

indicated with .

In statistical terms, a random variable is associated with the outcome of a

statistical experiment. All possible outcomes of such an experiment can be called

~

the sample space. If S is a sample space with a probability measure and if X

~

is real valued function dened over S then X is called a random variable.

There are two types of random variables; discrete random variables, which

only take on a specic number of real values, and (absolute) continuous random

variables, which can take on any value between 1. It is also possible to examine

discontinuous random variables, but we will limit ourselves to the rst two types.

~ can take k numbers of values (x1 , ..., xk ),

If the discrete random variable X

the probability of observing a value xj can be stated as,

P (xj ) = pj :

(3.1)

observing one of the k possible outcomes is equal to 1.0, or using the notation just

introduced,

P (x1 ; x2 ; :::; or xk ) = p1 + p2 + ::: + pk = 1:

(3.2)

A discrete random variable is described by its probability function, F (xi ),

~ takes on a certain value. (The term

which species the probability with which X

cumulative distribution is used synonymous with probability function).

In time series econometrics we are in most applications dealing with continuous

random variables. Unlike discrete variables, it is not possible to associate a specic

observation with a certain probability, since these variables can take on an innite

range of numbers. The probability that a continuous random variable will take

on a certain value is always zero. Because it is continuous we cannot make a

dierence between 1.01 and 1.0101 etc. This does not mean that the variables do

not take on specic values. The outcome of the experiment, or the observation, is

of course always a given number.

Thus, for a continuous random variable, statements of the probability of an

~

observation must be made in terms of the probability that the random variable X

is less than or equal to some specic value. We express this with the distribution

~ as follows,

function F (x) of the random variable X

~

F (x) = P (X

x) f or

1 < x < 1;

(3.3)

which states the probability of X

The continuous analogue of the probability function is called the density

function f (x), which we get by derivation of the distribution function, w:r:t the

observations (x),

dF (x)

= f (x):

(3.4)

dx

1 Random

variables (RV:s) are also called stochastic variables, chance variables, or variates.

RANDOM VARIABLES

23

~ takes on a value less that or equal to x,

for the probability that X

Z x

F (x) =

f (u)du:

(3.5)

1

It follows that for any two constants (a) and (b), with a < b, the probability

~ takes on a value on the interval from (a) to (b) is given by

that X

F (b)

F (a)

=

=

f (u)du

1

b

f (u)du

(3.6)

f (u)du

(3.7)

physics. Think of a rod of variable density, measured by the function f (x). To

obtain the weight of some given length of this rod, we would have to integrate its

density function over that particular part in which we are interested.

Random variables care described by their density function and/or by their

moments; the mean, the variance etc. Given the density function, the moments

can be determined exactly. In statistical work, we must rst estimate the moments,

from the moments we can learn about density function. For, instance we can test,

if the assumption of an underlying normal density function is consistent with the

observed data.

A random variable can be predicted, in other words it is possible to form an

expectation of its outcome based on its density function. Appendix III deals with

the expectations operator and other operators related to random variables.

Random variables are characterized by their probability density functions pdf : s)

or their moments. In the previous section we introduced pdf : s: Moments refers to

measurements such as the mean, the variance, skewness, etc. If we know the exact

density function of a random variable then we would also know the moments. In

applied work, we will typically rst calculate the moments from a sample, and

from the moments gure out the density function of variables. The term moment

originates from physics and the moment of a pendulum. For our purposes it can be

though of as a general term which includes the denition of concepts like the mean

and the variance, without referring to any specic distribution. Starting with the

rst moment, the mathematical expectation of a discrete random variable is given

by,

~ =

E(X)

xf (x)

(3.8)

where E is the expectation operator and f (x) is the value of its probability

~ Thus, E(X)

~ represents the mean of the discrete random variable

function at X.

~

X: Or, in other words, the rst moment of the random variable. For a continuous

~ the mathematical expectation is

random variable (X),

Z 1

~ =

x f (x)dx

(3.9)

E(X)

1

24

where f (x) is the value of its probability density at x. The rst moment can

also be referred to as the location of the random variable. Location is a more

generic concept than the rst moment or the mean.

The term moments are used in situations where we are interested in the expected value of a function of a random variable, rather than the expectation of the

specic variable itself. Say that we are interested in Y~ , whose values are related

~ by the equation y = g(x). The expectation of Y~ is equal to the expectation

to X

of g(x), since E(Y~ ) = E [g(x)]. In the continuous case this leads to,

Z 1

~ =

g(x)f (x)dx:

(3.10)

E(Y~ ) = E[g(X)]

1

Like density, the term moment, or moment about the origin, has its explanation

in physics. (In physics the length of a lever arm is measured as the distance from

the origin. Or if we refer to the example with the rod above, the rst moment

around the mean would correspond to horizontal center of gravity of the rod.)

Reasoning from intuition, the mean can be seen as the midpoint of the limits of

the density. The midpoint can be scaled in such a way that its becomes the origin

of the x- axis.

The term moments of a random variable is a more general way of talking

about the mean and variance of a variable. Setting g(x) equal to x, we get the

r:th moment around the origin,

X

0

~r

xr f (x)

(3.11)

r = E(X ) =

~ is a discrete variable. In the continuous case we get,

when X

Z 1

0

~r

xr f (x)dx:

r = E(X ) =

(3.12)

~

The rst moment is nothing else than the mean, or the expected value of X.

The second moment is the variance. Higher moments give additional information

about the distribution and density functions of random variables.

0

~ = (X

~

Now, dening g(X)

r ) we get what is called the r:th moment about

~ For r = 0, 1, 2, 3 ... we

the mean of the distribution of the random variable X.

get for a discrete variable,

X

0 r

0 r

~

~

(X

(3.13)

r = E[(X

r) ] =

r ) f (x)

~ is continuous

and when X

r

~

= E[(X

0 r

r) ]

~

(X

0 r

) f (x)dx:

(3.14)

The second moment about the mean, also called the second central moment,

is nothing else than the variance of g(x) = x;

Z 1

~

~ E(X)]

~ 2 f (x)dx

var(X)

=

[X

(3.15)

1

Z 1

~ 2 f (x)dx [E(X)]

~ 2

=

X

(3.16)

1

~ 2)

E(X

~ 2;

[E(X)]

(3.17)

where f (x) is the value of probability density function of the random variable

~ at x:A more generic expression for the variance is dispersion. We can say that

X

MOMENTS OF RANDOM VARIABLES

25

the second moment, or the variance, is a measure of dispersion, in the same way

as the mean is a measure of location.

The third moment, r = 3, measures asymmetry around the mean, referred

to as skewness. The normal distribution is asymmetric around the mean. The

likelihood of observing a value above or below the mean is the same for a normal

distribution. For a right skewed distribution, the likelihood of observing a value

higher than the mean is higher than observing a lower value. For a left skewed

distribution, the likelihood of observing a value below the mean is higher than

observing a value above the mean.

The fourth moment, referred to as kurtosis, measures the thickness of the

tails of the distribution. A distribution with thicker tails than the normal, is

characterized by a higher likelihood of extreme events compared with the normal distribution. Higher moments give further information about the skewness,

tails and the peak of the distribution. The fth, the seventh moments etc. give

more information about the skewness. Even moments, above four, give further

information the thickness of the tails and the peak.

In time series econometrics, and nancial economics, there is a small set of distributions that one has to know. The following is a list of common distributions:

Distribution

Normal distribution

N ; 2

Log Normal distribution LogN ; 2

Student t distribution

St ; ; 2

Cauchy distribution

Ca ; 2

Gamma distribution

Ga ; ; 2

Chi-square distribution

( )

F distribution

F (d1 ; d2 )

Poisson distribution

P ois ( )

Uniform distribution

U (ja; bj)

The pdf of a normal distribution is written as

p

(x

)2

2 2

:

2 2

The normal distribution characterized by the following: the distribution is

symmetric around its mean, and it has only two moments, the mean and the

variance, N ( ; 2 ). The normal distribution can be standardised to have a mean of

zero and variance of unity (say ( x E(x)) and is consequently called a standardised

normal distribution, N (0; 1).

In addition, it follows that the rst four moments, the mean, the variance, the

~ = , V ar(X)

~ = 2 ; Sk(X)

~ = 0;and Ku(X)

~ =

skewness and kurtosis, are E(X)

3:There are random variables that are not normal by themselves but becomes

normal if they are logged. The typical examples are stock prices and various

macroeconomic variables. Let St be a stock price. The dollar return over a given

interval, Rt = St St 1 is not likely to be normally distributed due to simple

fact that the stock price is raising over time, partly due to the fact that investors

demand a return on their investment but mostly due to ination. However, if you

take the log of the stock price and calculate the per cent return (approximately),

f (x)

26

(or a distribution that can be approximated with a normal distribution). Thus,

since you have taken logs of variables in your econometric models, you have already

worked with log normal variables. Knowledge about log normal distributions is

necessary if you want to model, or better understand, the movements of actual

stock prices and dollar returns.

The Student t distribution is similar to the normal distribution, it is symmetric

around the mean, it has a variance but has thicker tail than the normal distribution. The Student t distribution is described by ; ; 2 where refers to

the mean and 2 refers to the variance. The parameter is called the degrees

of freedom of the Student t distribution and refers to the thickness of tails. A

random variable that follows a Student t distribution will converge to a normal

random variable as the number of observations goes to innity.

The Cauchy distribution is related to the normal distribution and the Student

t distribution. Compared with the normal it is symmetric and has two moments,

but it has fatter tails and is therefore better suited for modelling random variables

which takes on relatively more extreme events than the normal. The set back for

empirical work is that higher moment are not dened meaning that it is di cult

to use empirical moments to test for Cauchy distribution against say the normal

or the Student t distribution.

The gamma and the chi-square distributions are related to variances

n of normal

o

random variables. If we have a set of normal random variables Y~1 ; Y~2 :::; Y~v

~

and for a new variable as X

Y~12 + Y~22 + ::: + Y~v2 , then this new variable will

~

have a gamma distribution as X

Ga( ; ; 2 ):A special case of the gamma

distribution is when we have = 0 and 2 = 1, the distribution is then called

a chi-square distribution 2 ( ) with degrees of freedom. Thus, take the square

of an estimated regression parameter and divide it with it variance and you get a

2

chi-square distributed test for signicance of the estimated , ( ^ = ^ )

( ):

The F distribution comes about when you compare the ration (or log dierence)

of two squared normal random variables. The Poisson distribution is used to model

jumps in the data, usually in combination with a geometric Brownian motions,

(jump diusion models). The typical example is stock prices that might move up

or down drastically. The parameter measures the probability of jump in the

data.

In practical work we need to know the empirical distribution of the variables we

are working with, in order to make any inference. All empirical distributions can

analysed with the help of their rst four moments. Through the rst four moments

we get information rst about the mean and the variance and second about the

skewness and kurtosis. The latter moments are often critical when we decide if a

certain empirical distribution should be seen as normal or at least approximately

normal.

It is, of course, extremely convenient to work with the assumption of a normal

distribution, since a normal distribution is described by its rst two moments

only. In nance, the expected return is given be the mean, and the risk of the

asset is given by its variance. An approximation to the holding period return of

an asset is the log dierence of its price. In the case of a normal distribution,

there is no need to consider higher moments. Furthermore, linear combinations of

ANALYSING THE DISTRIBUTION

27

building regression equations, the residual process is assumed to be a normally

independent white noise process, in order to allow for inference and testing.

It is by calculating the sample moments we learn about the distribution of the

series at hand. The most typical problem in empirical work is to investigate how

well the distribution a variable can be approximated with a normal distribution.

If the normal distribution is rejected for the residuals in a regression, the typical

conclusion is that there something important missing in the regression equation.

The missing part is either an important explanatory variable, or the direct cause

of an outlier.

To investigate the empirical distribution we need to calculate the sample moments of the variable. The sample mean, of fxt g = fx1 ; x2 ; :::xT g; can be estiPT

mated as ^ x = x = (1=T ) t=1 xt . Higher moments can be estimated with the

PT

formula mr = (1=T ) t=1 (xt x)r :2

~ t N ( x ; 2x ); subtracting the mean and diA series is normally distributed, X

viding with the standard error lead to a standardised normal variable, distributed

~ N (0; 1): For a standardised normal variable the third and fourth moments

as X

equal 0 and 3, respectively. The standardised third moment is now as Skewness,

given as b1 = m23 =m32 . A skewness with a negative value indicates a left skew

distribution, compared with the normal. If the series is the return on an asset it

means that bador negative surprises dominates over goodpositive surprises. A

positive value of skewness implies a right skewed distribution. In terms of asset

returns, goodor positive surprises are more likely than badnegative surprises.

The fourth moment, kurtosis is calculated as b2 = m4 =m22 : A value above

3, implies that the distribution generates more extreme values than the normal

distribution. The distribution has fatter tails than the normal. Referring to asset

returns, approximating the distribution with the normal, would underestimate the

risk associated with the asset.

An asymptotic test, with a null of a normal distribution is given by3 ,

JB = T

m23 =m32

6

[(m4 =m22 )

24

3]2

+T

3m21

m1 m3

+

2m2

m22

(2):

This test is known as the Jarque-Bera (JB) test and is the most common

test for normality in regression analysis. The null hypothesis is that the series is

normally distributed. Let 1 ; 2 , 3 and 4 represent the mean, the variance, the

skewness and the kurtosis. The null of a normal distribution is rejected if the test

statistics is signicant. The fact that the test is only valid asymptotically, means

that we do not know the reason for a rejection in a limited sample. In a less than

asymptotic sample rejection of normality is often caused by outliers. If we think

the most extreme value(s) in the sample are non-typical outliers, removingthem

from the calculation the sample moments usually results in a non-signicant JB

test. Removing outliers is add hoc. It could be that these outliers are typical

values of the true underlying distribution.

2 For these moments to be meaningful, the series must be stationary. Also, we would like

fxt g to an independent process. Finally, notice that the here suggested estimators of the higher

moments are not necessarily e cient estimators.

3 This test statistics is for a variable with a non-zero mean. If the variable is adjusted for its

mean (say an estimated residual), the second should be removed from the expression.

28

We will now generalize the work of the previous sections by considering a vector

of n random variables,

~ = (X

~1; X

~ 2 ; :::; X

~n)

X

(3.18)

whose elements are continuous random variables with density functions f (x1 )

..., f (xn ), and distribution functions F (x1 ) ..., F (xn ). The joint distribution will

look like,

F (x1 ; x2 ; :::; xn ) =

xn

1

x1

dxp ;

(3.19)

If these random variables are independent, it will be possible to write their

joint density as the product of their univariate densities,

f (x1 ; x2 ; :::; xn ) = f (x1 )f (x2 )

f (xn ):

(3.20)

For independent random variables we can dene the r:th product moment as,

~ 1 r1 ; X

~ 2 r2 ; :::; X

~ n rn )

E(X

Z 1

Z 1

x1 r1 x2 r2

1

(3.21)

xn rn f (x1 ; x2 ; :::; xn )dx1 dx2

dxn ; (3.22)

~ 1 r1 )E(X

~ 2 r2 )

E(X

~ n rn ):

E(X

(3.23)

It follows from this result that the variance of a sum of independent random

variables is merely the sum of these individual variances,

~1 + X

~ 2 + ::: + X

~ n ) = var(X

~ 1 ) + var(X

~ 2 ) + ::: + var(X

~ n ):

var(X

(3.24)

variables, say

~ = a1 X

~ 1 + a2 X

~ 2 + ::: + ap X

~p;

a0 X

(3.25)

which leads to,

~ =

cov(a0 X)

p X

p

X

ai aj

ij :

(3.26)

i=1 j=1

~ Z = B X,

~ and the

These results hold for matrices as P

well. If we have Y~ = AX,

~

~

covariance matrix between X and Y ( ), we have also that,

cov(Y ; Y ) = A

Z)

=B

cov(Z;

and

=A

cov(Y ; Z)

A0 ;

(3.27)

B0;

(3.28)

B0:

(3.29)

29

Given a joint density function of n random variables, the joint probability of a

subsample of them is called the joint marginal density. We can also talk about

joint marginal distribution functions. If we set n = 3 we get the joint density

function f (x1 , x2 , x3 ). Given the marginal distribution g(x2 x3 ), the conditional

~ 1 , given that the random

probability density function of the random variable X

~

~

variables X2 and X3 takes on the values x2 and x3 is dened as,

'(x1 j x2 ; x3 ) =

f (x1 ; x2 ; x2 )

;

g(x2 ; x3 )

(3.30)

or

f (x1 ; x2 ; x3 ) = '(x1 j x2 ; x3 )g(x2 x3 ):

(3.31)

~1,

Of course we can dene a conditional density for various combinations of X

~

~

X2 and X3 , like, p(x1 , x3 ; j x2 ) or g(x3 j x1 , x2 ). And, instead of three dierent

variables we can talk about the density function for one random variable, say Y~t ,

for which we have a sample of T observations. If all observations are independent

we get,

f (y1 ; y2 ; :::; yt ) = f (y1 )f (y2 ):::f (yt ):

(3.32)

Like before we can also look at conditional densities, like

f (yt j y1 ; y2 ; :::; yt

1 );

(3.33)

which in this case would mean that (yt ) the observation at time t is dependent

on all earlier observations on Y~t .

It is seldom that we deal with independent variables when modeling economic

time series. For example, a simple rst order autoregressive model like yt =

yt 1 + t , implies dependence between the observations. The same holds for all

time series models. Despite this shortcoming, density functions with independent

random variables, are still good tools for describing time series modelling, because

the results based on independent variables carries over to dependent variables in

almost every case.

In this section we look at the linear regression model starting from two random

~ Two regressions can be formulated,

variables Y~ and X.

y=

+ x+ ;

(3.34)

x=

+ y+ :

(3.35)

and

Whether one chooses to condition y on x, or x on y depends on the parameter

of interest. In the following it is shown how these regression expression are constructed from the correlation between x and y, and their rst moments by making

use of the (bivariate) joint density function of x and y. (One can view this section

as an exercise in using density functions).

30

Without explicitly stating what the density function looks like, we will assume

~

that we know the joint density function for the two random variables Y~ and X,

and want to estimate a set of parameters, and . Hence we got, the joint density,

D(y; x; );

(3.36)

~ To get the linear regression model above we have condition on the outcome of

X.

~

X;

D(y; x; ) = D(yj x; );

(3.37)

where represents the vector of parameters of interest = [ , ]. This operation requires, that the parameters of interest can be written as a function of the

parameters in the joint distribution function, = f ( ).

~ is, equation 1

The expected mean of Y~ for given X

Z

E(Y~ j x; ) = y D(yj x; )dy = + x;

(3.38)

or if we choose to condition on Y~ instead,

Z

~

E(Xj y; ) = x D(xj y; )dx =

+ x:

The parameters in 3.38 can be estimated by using means, variances and covariances of the variables. Or in other terms, by using some of the lower moments

~ and Y~ . Hence, the rst step rewrite 3.38 in such a

of the joint distribution of X

~ and Y~ .

way that we can write and in terms of the means of X

Looking at the LHS of 3.38 it can be seen that a multiplication of the condi~ g(x), leads to the joint density.

tional density with the marginal density for X,

Given the joint density we can choose to integrate out either x or y. In this case

we chose to integrate over x. Thus we have after multiplication,

Z

y D(yj x; )dyg(x)= g(x ) + x g(x ):

(3.39)

Integrating over x leads to, at the LHS,

Z Z

yD(yjx; )dydg(x )

Z Z

=

yD(y;xj )dydxg(x )

Z

=

yD(yj ) = E(yj ) = y :

(3.40)

Z

Z

g(x)dx +

x g(x)dx

=

~ =

E(X)

x:

(3.41)

E(Y~ jx; ) =

~ =

E(X)

x:

(3.42)

We now have one equation two solve for the two unknowns. Since we have

used up the means let us turn to the variances by multiplying both sides of 3.38

with x and perform the same operations again.

THE LINEAR REGRESSION MODEL A GENERAL DESCRIPTION

31

Z

xyD(yj x ; )dyg(x ) = x g(x ) + x 2 g(x );

(3.43)

Integrate over x,

Z Z

=

xyD(yj x ; )dydxg(x )

Z

x g(x )dx +

x 2 g(x )dx :

(3.44)

Z Z

and the RHS,

Z

~Y

~ );

)dydx = E (X

xyD(y; x j

x g(x)dx +

~ +

E(X)

x2 g(x)dx =

(3.45)

~ 2 ):

E(X

(3.46)

~ Y~ ) =

E(X

~ +

E(X)

~ 2 ):

E(X

(3.47)

~ Y~ ) = x y + xy ,

Remembering the rules for the expectations operator, E(X

2

2

2

~

and E(X ) = x + x makes it possible to solve for and in terms of means

and variances. From the rst equation we get for ,

=

x:

(3.48)

~ Y~ )

E(X

=

=

2

x) x + ( x

2

2

x+

x+

y

x y

2

x );

x y

2

x;

xy

(3.49)

which gives

xy

:

2

x

(3.50)

E(Y~ j x; ) =

xy

(x

2

x

x)

+ x;

(3.51)

y)

+ y:

(3.52)

~

E(Xjy;

)=

yx

(y

2

y

We can now make use of the correlation coe cient and the parameter in the

~ and Y~ is dened as,

linear regression. The correlation coe cient between X

=

xy

or

x y

xy :

(3.53)

x y

E(Y~ jx; ) =

32

y

x

(x

x );

(3.54)

~

E(Xjy;

)=

x

y

(y

y ):

(3.55)

So, if the two variables are independent their covariance is zero, and the correlation is also zero. Therefore, the conditional mean of each variable does not

dependent on the mean and variance of the other variable. The nal message

is that a non-zero correlation, between two normal random variables, results in

linear relationship between them. With a multivariate model, with more than two

random variables, things are more complex.

33

34

LIKELIHOOD

There are two fundamental approaches to estimation in econometrics, the method

of moments and the maximum likelihood method. The dierence is that the moments estimator deals with estimation without a priori choosing a specic density

function. The maximum likelihood estimator (MLE), on the other hand, requires

that a specic density function is chosen from the beginning. Asymptotically there

is no dierence between the two approaches. The MLE is more general, and is the

basis for all the various tests applied in practical modeling. In this section we will

focus on MLE exclusively because of its central role.

The principles of MLE were developed early, but for a long time it was considered mainly as a theoretical device, with limited practical use. The progress

in computer capacity has changed this. Many presentations of the MLE are too

complex for students below the advanced graduate level. The aim of this chapter is to change this. The principle of ML is not dierent from OLS. The way

to learn MLE is to start with the simplest case, the estimation of the mean and

the variance of a single normal random variable. In the next step, it is easy to

show how the parameters of a simple linear regression model can be found, and

tested, using the techniques of MLE. In the third step, we can analyse how the

parameters of any density function. Finally, it is often interesting to study the bivariate joint normal density function. This last exercise is good for understanding

when certain variables can be treated as exogenous. The general idea is that after

viewing how a single random variable can be replaced by a function of random

variables, it becomes obvious how a multivariate non-linear system of variables

can be estimated.

Let us start with a single stochastic time series. The rst moment, or the

~ t with the observations (x1 ; x2 ; :::; xT ) is

sample mean, of the random process X

PT

found as x = t=1 xt =T . By using this technique we simply calculated a number

~ t . In the same way

that we can use to describe one characteristic of the process X

we can calculate the second moment around the mean, etc. In the long run, and

for a stationary variable, we can use the central limit theorem (CLT) to argue that

(x1 ; x2 ; :::; xT ) has a normal distribution, which allows us to test for signicance

etc.

~ t , and a sample of T indeThe MLE approach starts from a random variable X

pendent observations (x1 ; x2 :::; xT ). The joint density function is

f (x1 ; x2 ; :::; xT ; ) = f (x; ) =

To describe this process there are k parameters,

the density function as,

f (x; )

THE METHOD OF MAXIMUM LIKELIHOOD

T

Y

f (xt ; )

= ( 1;

2 ; :::; k );

(4.1)

so we write

(4.2)

35

where x; indicates that it is the shape of the density, described by the parameters which gives us the sample. If the density function describes a normal

distribution would consistent of two parameters the mean and the variance.

Now, suppose that we know the functional form of the density function. If we

~ t , we can ask the question which estimates

also have a sample of observations on X

of would be the most likely to nd, given the functional form of the density and

given the observations. Viewing the density in this way amounts to asking which

values of maximize the value of the density function.

Formulating the estimation problem in this way leads to a restatement of the

density function in terms of a likelihood function,

L( ; x);

(4.3)

where the parameters are seen as a function of the sample. It is often convenient

to work with the log of the likelihood instead, leading to the log likelihood

log L( ; x) = l( ; x)

(4.4)

What is left is to nd the maximum of this function with respect to the parameters in . The maximum, if it exists is found by solving the system of k

simultaneous equations,

l( ; x)

= 0;

(4.5)

i

negative denite matrix. In matrix form this expression is also know as the score

matrix, or the e cient score for , which can be written as,

l( ; x)

= S( );

(4.6)

The matrix of the expected second order expressions is know as the information

matrix

2

l( ; x)

E

= I( ):

(4.7)

2

The information matrix plays an important role in demonstrating that ML

estimators asymptotically attains the Cramer-Rao lower band, and in the derivation of the so-called classical test statistics associated with the ML estimator. It

can be shown, under quite general conditions, that the variances of the estimated

parameters from above (^) are given by the inverse of the information matrix,

var(^) = [I( )]

(4.8)

So far we have not assigned any specic distribution to the density function.

~ t g. The

Let us assume a sample of T independent normal random variables fX

normal distribution is particularly easy two work with since it only requires two

parameters to describe it. We want to estimate the rst two moments, the mean

2

and the variance 2 , thus = ( ; ): The likelihood is,

#

"

T

X

1

T

=2

2

(xt

)2 :

(4.9)

L( ; x) = 2

exp

2 2 t=1

Taking logs of this expression yields,

l( ; x) =

(T =2) log 2

(T =2) log

(1=2

T

X

(xt

)2 :

(4.10)

t=1

36

l

and

T

1 X

2

(xt

are,

);

(4.11)

t=1

and,

l

2

(T =2

) + (1=2

T

X

)2 :

(xt

(4.12)

t=1

T

X

xt

=0

(4.13)

t=1

T

X

(xt

)2

= 0:

(4.14)

t=1

variance as1

^x

^ 2x

and

T

1X

xt

T t=1

(4.15)

T

1X

(xt

T t=1

" T

#2

1 X

xt :

T t=1

T

1X 2

2

^x) =

x

T t=1 t

(4.16)

likelihood function? To answer that question we have to look at the sign of the

Hessian of the log likelihood function, the second order conditions, evaluated at

estimated values of the parameters in ;

2

6

D l( ; x)=4

2

l

2

2

2

2

2

l

2

3 2

7 4

5=

2

1

4

P

(xt

T

2 4

(x

)

Pt

(xt

2

T

6 ^ 2x

E[D l( ; x)]= 14

2

and

)2

2

3

5

(4.17)

, we get,

7

5= I(^);

T

2^ 4x

(4.18)

and a maximum value for the function at ^ x and ^ 2x :

It remains to investigate whether the estimates are unbiased. Therefore, re~

place the observations, in the solutions for and 2x , by the random variable X

and take expectation. The expected value of the mean is,

E(^ x ) =

T

T

X

1X

~ = 1

E(X)

T t=1

T t=1

PT

PT

2

solution is given by T1

]2 = T1

t=1 [xt

t=1 xt +

PT

P

P

T

T

1

1

2+

2

x

2

x

t

t

t=1

T

Pt=1

PTT t=1

T

1

2

2

2 T1

t=1 xt + T T

t=1 xt

hP

i2

PT

PT

PT

T

1

2

2 T12

t=1 xt + T 2

t=1 xt

t=1 xt

t=1 xt

PT

2

1 P

2

x

[

x

]

t

t=1 t

T2

1 The

=

=

=

=

1

T

1

T

1

T

1

T

= ;

2

(4.19)

2xt

37

for the variance are bit more complex, but the idea is the same. The expected

variance is,

E[^ 2x ]

!2 #

T

1 X ~

Xt

T t=1

t=1

"

#

T X

T

X

1

1

~ t2 )

~tX

~s

E

= E T E(X

X

T

T t=1 s=1

"

!#

T

T

X

X

1

1

2

2

~ ) E(X

~ )

~tX

~s

T E(X

E

X

=

t6=s

t

t

T

T

1

= E

T

"

T

X

~ t2

X

1

(T

T

~ t2 )

1)E(X

1

T (T

T

~ t )]2 =

1)[E(X

(4.20)

to zero as T ! 1: This is a typical result from MLE, the mean is correct but

the variance is biased. To get an unbiased estimate if we need to correct the

estimate in the following manner,

2

!2 3

T

T

X

X

T

1

T

1

1

~t2

~t 5 :

s2 =

^2 =

E4

X

X

(4.21)

T

T

T

t=1

t=1

The correction involves multiplying the estimated variance with

We have derived the maximum likelihood estimates for a single independent normal variable. How does this relate to a linear regression model? Earlier, when

we discussed the moments of a variable, we showed how it was possible, as a general principle, to substitute a random variable with a function of the variable.

~ is a function of two other random

The same reasoning applies here. Say that X

variables Y~ and Z. Assume the linear model

yt = zt + xt ;

(4.22)

where Y~ is a random variable, with observations fyt g and zt is, for the time

being, assumed to be a deterministic variable.(This is not a necessary assumption).

~ let us

Instead of using the symbol x, for observation on the random variable X;

set xt = t where t N ID(0, 2 ): Thus, we have formulated a linear regression

model with a white noise residual. This linear equation can be rewritten as,

t

= yt

zt

(4.23)

where the RHS is the function to be substituted with the single normal variable

xt used in the MLE example above. The algebra gets a bit more complicated but

the principal steps are the same.2 The unknown parameters in this case are and

2 As a consequence of more complex algebra the computer algorithms for estimating the variables will also get more complex. For the ordinary econometrician there are a lot of software

packages that cover most of the cases.

38

2

l( ;

; y; z) =

(T =2) log 2

(T =2) log

(1=2

T

X

(yt

zt )2 : (4.24)

t=1

The last factor in this expression can be identied as the sum of squares function, S( ). In matrix form we have,

S( ) =

T

X

Z )0 (Y

zt )2 = (Y

(yt

Z )

(4.25)

t=1

and

l( ;

; y; z) =

(T =2) log 2

(T =2) log

S

(1=2

)(Y

Z )0 (Y

Z ) (4.26)

yields

2Z 0 (Y

Z );

(4.27)

^ = (Z 0 Z)

(Z 0 Y )

(4.28)

Notice that the ML estimator of the linear regression model is identical to the

OLS estimator.

The variance estimate is,

^ 2 = 0 =T;

(4.29)

which in contrast to the OLS estimate is biased.

To obtain these estimates we did not have to make any direct assumptions

about the distribution of yt or zt : The necessary and su cient condition is that yt

conditional on zt is normal, which means that yt

zt = t should follow a normal

distribution. This is the reason why MLE is feasible even though yt might be a

dependent AR(p) process. In the AR(p) process the residual term is a independent

normal random variable. The MLE is given by substitution of the independently

distributed normal variable with the conditional mean of yt :

The above results can be extended to a vector of normal random variables. In

this case we have a multivariate normal distribution, where the density is

D(X) = D(X1 ; X2 ; :::; XT );

(4.30)

~

P The random variables X will have a mean vector and a covariance matrix

. The density function for the multivariate normal is,

X

X

1

D(X) = [(2 )n=2 j

jn=2 ] 1 exp[ (1=2)(X

)0

(X

)]

(4.31)

P

which can be expressed in a compact form Xt N ( ; ):

With multivariate densities it is possible to handle systems of equations with

stochastic variables, the typical case in econometrics. The bivariate normal is an

~ = (X

~1, X

~ 2 ), and

often used device to derive models including 2 variables. Set X

X

2

1

21

12

2

2

with j

j=

2 2

1 2 (1

p2 );

(4.32)

P

where p is the correlation coe cient. As can be seen j

j> 1 unless p2 = 1. If

12 = 21 = 0; the two processes are independent and can estimated individually

MLE FOR A LINEAR COMBINATION OF VARIABLES

39

equations are dependent, and it will be necessary to estimate a complete system

of equations to get correct estimates, which are unbiased and e cient.

A disadvantage with MLE is that the variance estimate is biased. This, however, is only a small sample eect. It can be shown that as T goes to innity

the bias disappears. Hence, the MLE is an asymptotically e cient estimator.

Furthermore, it can also be shown that MLE behaves asymptotically nice even

if we drop the assumption of normally independently distributed residuals. The

estimates will tend towards those given by NID errors. This situation is refereed

to as quasi maximum likelihood.

The advantages are easy to see. MLE oers a general approach to the estimation of econometric models. These models can be quite complex, non-linearity,

moving average residuals and so on can be handled by MLE. Consequently there

exists a large literature on MLE. In principle this literature is not di cult. The

main problem for our understanding of the use of MLE in dierent situations lies

in our understanding of matrix algebra.

40

AND LR TESTS

(To be completed, add gure of normal distributed variable with value of likelihood

function (L) on the vertical axis and parameter value on the horizontal axis, with

(^) is indicating the maximum value of L).

There are three approaches to testing a statistical model model. The rst is

to start with an unrestricted model and imposed restrictions on the estimated

model. The second approach is to impose the restrictions prior to estimation, and

estimate a restricted model. The test is then performed by asking if the restriction

should be lifted. The third approach, is to test for signicant dierences between

an estimated restricted model and an estimated unrestricted model. The last

approach involves estimating two models, rather than one.

The three approaches of testing are named

Wald tests (W ) - estimate an unrestricted model.

Lagrange Multiplier tests (LM ) -estimate a restricted model.

Likelihood Ratio tests (LR) - estimate both the unrestricted and the restricted models.

A test is labeled Wald, Lagrange Multiplier or Likelihood ratio depending

on how it is constructed. A typical Wald test is the t-testfor signicance.

A Lagrange multiplier test is the LM test of autocorrelation. Finally, the

F-test for testing the signicance of one or several parameters in a group

represents a typical Likelihood ratio test.

Imagine a gure of a normal density function, with the shape of a normal

random variable centered around its (true) mean. On the vertical axis put the

value of the likelihood function. The max is given by the peak of the distribution.

Let the horizontal axis represent the estimated mean. The true mean is indicated

by the peak of the normal distribution. The LR test is based on a comparison of

likelihood values. If a restriction, which is imposed on the unrestricted model, is

valid the value of the likelihood should not be reduced signicantly. This test is

^U

based on two estimations, one unrestricted giving the value of the likelihood L

^ R : From these two values the likelihood ratio is

and one restricted leading to L

dened as,

=

^R

L

:

^

LU

(5.1)

This lead to the test statistic ( 2 ln ) which has a 2 (R) distribution, where

R is the number of restrictions.

The Wald test compares (squared) estimated parameters with their variances.

In a linear regression, if the residual is N ID(0; 2 ), then ^

N ( ; var( ^ )), so

^

^

(

) N (0; var( ); and a standard t-test will tell if is signicant or not.

More generally

if we have vector of normally distributed random variables

P

^ Nj ( ; ), then have

X

(x

(x

(J):

(5.2)

41

The LM test starts from a restricted model and tests if the restrictions are

valid. Here restrictions should be understood as a general concept. A model

is restricted if it assumes homoscedasticity, no autocorrelation, etc. The test is

formulated as,

ln L(^R ) h ^ i 1 ln L(^R )

I( R )

:

(5.3)

LM =

^R

^R

The formula looks complex but is in many cases extremely easy to apply.

Consider the LM test for p : th order autocorrelation in the residuals ^t ,

^t =

1^t 1

2^t 2

+ ::: +

p^t p

t:

(5.4)

The LM test statistic for testing if the parameters 1 to p are zero, amounts to

estimating the equation with OLS and calculate the test statistics T R2 , distributed

as 2 (p) under the null of no autocorrelation. Similar tests can be formulated for

testing various forms of heteroscedasticity.

Tests can often be formulated in such a way that they follow both 2 and

F -distributions. In less than large samples the F -distribution is better one to use.

The general rule for choosing among tests based on the F or the 2 distribution

is to use the F distribution, since it has better the small sample properties.

If the information matrix is known (meaning that it is not necessary to estimate

it), all three tests would lead to the same test statistic, regardless of the chosen

distribution 2 or F . I all all three approaches lead to the same test statistics,

we would have RW = RLR = RLM . However, when the information matrix is

estimated we get the following relation between the tests RW RLR RLM .

Remember (1) that when dealing with limited samples the three tests might

lead to dierent conclusions, and (2) if the null is rejected the alternative can

never be accepted. As a matter of principle, statistical tests only rejects the

null hypothesis. Rejection of the null does not lead to accepting the alternative

hypothesis, it leads only to the formulation of new null. As an example, in a test

where the null hypothesis is homoscedasticity, the alternative is not necessarily

heteroscedasticity. Tests are generally derived on the assumption that everything

else is OK in the model. Thus, in this example, rejection of homoscedasticity

could be caused by autocorrelation, non-normality, etc. The econometrician has

to search for all possible alternatives.

42

Part II

43

AND ALL THAT

6.1 Dierent types processes

This section looks at dierent types of stochastic time series processes that are

important in the economics and nance. Time series is a series where the data

~ is a variable which can take on more

is ordered by time. A random variable (X)

than one value, and for each value it can take one there is a value between zero

and one that describes the probability of observing that value. We distinguish

between discrete and continuous random variables. Discrete random variables can

only take on a nite number of outcomes. A continuous random variable can take

one value between -1 and +1: The mathematical model of the probabilities

associated with a random variable is given by the distribution function F (x),

~

F (x) = P (X

x): If we have a continuous random variable, we can dene the

probability density function of the random variable as, f (x) = dFdx(x) : Random

variables are characterized by the probability functions, and their moments.

First, second, third and fourth moments all describe the characteristics of a

random variable. By estimating these we describe a random variable. All moments

have direct implication for risk-and return decisions. Mean = return, Variance

= risk, skewness and kurtosis implies deviations from normal and might aect

behavior. To be completed.

A stochastic time series process is then made up of a random variable that

over time can take on more than one value.

~ t gT indicating that it starts at time zero

We denote a stochastic process as fX

0

and continuous to time T . To dene a stochastic time series process we start

~ t ), which at time t can take on dierent values

with the random variable (X

i

at the future periods i = 1; 2; 3; ::n; where n might go to innity. Often we

~ t ), we want to estimate the most

will talk about conditional expectation of (X

likely future value, given the information we have today. A stochastic time series

process can be discrete or continuous. A discrete series is only changing values

at discrete time periods, while a continuous process is, or can potentially, change

values continuously and not only at discrete time intervals.

~ t+1 jIt ) or Et (X

~ t+1 ). To formalize

The conditional expectation is written as E(X

the use of conditional expectations, assume a probability space ( ; z; P ), where

is the total sample space (or possible states of the world), z denotes the tribe

of subsets of that are outcomes (observations), and P is a probability measure

associated with the outcomes. A very practical question in modeling is if there

exists a simple mathematical form for associating outcomes with probabilities.

Usually we will refer to the tribe of subsets z as the information set It :We will

assume that memory is not forgotten by the decision makers, so the information

set is increasing over time,

It0 It1 ::: Itk Itk+1 :::

In a discrete time setting we refer to this increasing sets as an increasing sequence of sigma-elds. In a continuous time setting, where new information arrives

continuously, rather than at discrete time intervals, the increasing information set

is referred to as a ltration, or an increasing family of sigma-algebra. A very uno RANDOM WALKS, WHITE NOISE AND ALL THAT

45

cial standard is to use It discrete time settings and zt for continuous time settings.

We can also say that the set fFt :t 0g is a ltration, representing increasing fam~ t , (x1 ; x2 ; :::; :xt ); will

ily of sub- sigma algebras on z: Over time outcomes of X

i

be added to the increasing family of information sets. We refer to the observed

process, (x1 ; x2 ; ::; xt ), as adapted to the ltration zt : We can also say that if

~ t is a random

(x1 ; x2 ; ::; xt ) is an adapted process, then for the sequence of fxt g X

~ t is know as xt .

variable with respect to f ; z), and for each t the value of X

A random variable is a white noise process if its expected mean is equal to zero,

E[ t ] =

= 0;

(6.1)

its variance exists and is constant 2 , and there is no memory in the process

so the autocorrelation function is zero,

E[

t t]

E[

t s]

(6.2)

0 f or t 6= s:

(6.3)

In addition, the white noise process is supposed to follow a normal and independent distribution, t

N ID(0; 2 ). A p

standardized white noise have a

distribution like N ID(0; 1). Dividing t with 1= 2 gives ( t = ) ~N ID(0; 1):

The independent normal distribution has some important characteristics. First,

if we add normal random variables together, the sum will have a mean equal to the

sum ofPthe mean of all variables. Thus, adding T white noise variables together as,

T

zT = t=1 ( t = ) forms a new variables with mean E(zT ) = E( 1 = ) + E ( 2 = ) +

:: + E ( T = ) = (1= ) [E( 1 ) + E( 2 ) + ::: + E( T )] = 0: Since each variable is independent, we have the variance as 2z = 2z;1 + 2z;2 + :: + 2z;T = 1 + 1 + :: + 1 = T .

The random

p variable is distributed as zt ~N ID(0; T ); with a standard deviation

for zt increases, a 95% forecast condence

given as 1= T : As the forecast horizon

p

interval also increases with 1:96 T :

In the same way, we can dene the distribution, mean and variance during

subsets of time. If t ~N (0; 1) is dened for the period of year. The variables

will

p be distributed over six months as, N (0; 1=2), with a standard deviation of

1= 2,

p over three months the distribution is N (0; 1=4), with a standard deviation

of 1= 4. For any fraction ( ) over

p the year, the distribution becomes N ID(0; 1= )

and the standard deviation 1= : This property of the variable following from the

assumption of independent distribution, is known as Markov property. Given that

x0 is generated from an independent normal distribution N ( ; 2 ); the expected

future value of xt at time x0+T is distributed as N ( T; 2 T ).

To sum up, it follows from the denition that a white noise process is not

linearly predictable from its own past. The expected mean of a white noise,

conditional on its history is zero,

E[

t 1 ; t 2 ; :::: 1 ]

= E [ t ] = 0:

(6.4)

by other variables, and by its own past using non-linear functions.

A process is called an innovation if it is unpredictable given some information

set It . A process yt is an innovation process w.r.t. the an information set if,

E[yt j It ) = 0:

46

(6.5)

where the information set It includes not only the history of t , but also all other

information which might be of importance for explaining this process. Stating

that a series is a white noise innovation process, with respect to some information

set It ; is a stronger requirement than white noise process. It is also a stronger

statement than saying that t is a martingale dierence process, because we add

the assumptions of a normal distribution. The martingale and the martingale

dierence processes were dened in terms of their rst moments only. Creating

a residual process that is a white noise innovation term is a basic requirement in

the modelling process.

The normal distribution is central in econometric modeling. However, nancial

prices display two characteristics which make them unt for a stochastic process

based on the assumption normal distributions. Stock prices cannot be negative,

due to limited liability, and they tend to grow over time due to the time value

of money. Thus, the distribution of stock prices is typically non-negative and

skewed. The normal distribution on the other hand is symmetric and stretches

from 1 to +1: A better alternative for modelling stock prices, and many other

asset prices, is to assume a log normal distribution, which compared to the normal,

is only dened over [0; 1], and is right skewed and reecting the fact that stock

prices have a tendency to move up rather than down. Furthermore, log normal

distribution have the property that the log of a log normal random variable has

normal distribution. Thus, taking the log of log normal random stock prices

transforms their distribution to a normal distribution.

Let S~ti be a random log normal stock hprice, with mean

and variance 2 The

i

2

; 2 .

log of st , is then distributed as ln st N

2

Given that S~t , has a log normal distribution, ithfollows that the idistance be2

tween S~t and S~t+n is distributed as S~t+n S~t N

n; 2 n :

s

The non-parametric white noise can be used to dene (or generate) autoregressive

models (AR), and moving average models (MA). The AR(p) model is

yt =

where

xt i ,

+ a1 yt

+ ::: + yt

A(L)yt =

+ t;

(6.6)

t;

(6.7)

this polynomial informs about the time path of yt . The moving vicarage model of

order q is,

THE LOG NORMAL DISTRIBUTION

47

yt =

+ b1

t 1

+ ::: + bq

t q;

(6.8)

yt =

+ B(L) t :

(6.9)

A special case of the AR(1) model is the random walk model,

xt = xt

where

N ID(0;

):

(6.10)

is a white noise process. It follows that given the past of the series the best

prediction we can use is the present value of the series, and that the rst dierence

is nothing else than a white noise, xt xt 1 = xt = t . The important factor

is that the increments of the series is unpredictable from the series own past.

A random walk is non-stationarity. By denition, it is integrated of order one

I(1). Taking the rst dierence of a random walk series produces a stationary I(0)

(white noise) series.

A random walk has the property that todays value is the prediction of the

variables future values,

t

E(xt+1 j xt ; xt

1 ; xt 2 ; :::; xt n )

= E(xt+1 j xt ) = xt ;

(6.11)

where n might be equal to innity. This denition does not rule out the case

that there are other variables that can be correlated with xt and thereby also

predict xt+1 . We can also say that a random walk has an innite long memory.

~ = 2t ;and

The mean is zero, the variance and autocovariance is equal to, var(X)

2

~

~

Cov(Xt , Xt n ) = (t n) :

E(xt ) = E

t

X

= 0;

(6.12)

i=1

var(xt ) =

E(x2t )

=E

2

cov(xt xt

1)

= E(xt xt

1)

=E4

"

t

X

i=1

!#2

t X

t

X

E [ei ej ] = t:

(6.13)

i=1 j=1

1),

t

X

i=1

! 0 t 1 13

t X

t 1

X

X

ei @

ej A5 =

E [ei ej ] = t

j=1

1:

i=1 j=1

(6.14)

The autocovariances foe higher lag order follows from this previous example.

As can be seen these are non-stationary moments, since both are dependent on

time (t). It follows that the autocorrelation function looks like,

n

= [(t

1=2

n)=t]

(6.15)

innite memory. All theoretical autocorrelations are equal to 1.0.

48

xt = x0 +

t

X

i:

(6.16)

i=0

Thus, a random walk is a sum of white noise error from the beginning of the

series (x0 ). Hence, the value of today is dependent on shocks from built up beginning of the series. All shocks in the past, are still aecting the seriesPtoday.

t

Furthermore, all shocks are equally important. The process formed by i=1 is

called a stochastic trend. In contrast to a deterministic trend, the stochastic trend

is changing its slope in a random way period by period. Ex post a stochastic trend

might look like deterministic trend. Thus, it is not really possible to determine

whether a variable is driven by a stochastic or a deterministic trend, or a combination of both.

If we add a constant term to the model we get a random walk with a drift,

xt =

+ xt

+ t;

(6.17)

where the constant represents the drift term. In this processxt is driven by

both a deterministic and a stochastic trend. If we perform the same backward

substitution as above, we get,

xt = t +

t

X

+ x0 ;

(6.18)

i=1

that the variable follows a linear deterministic trend ( t) and a stochastic trend in

the long run. In the long-run the deterministic trend will dominate the stochastic

trend and determine the path of xt . Taking rst dierences leads to,

xt =

+ t;

(6.19)

The expected value of a driftless random walk, for any future date is always

todays value, E(xt+n ) = xt : For a random walk with a drift the expected value

is, E(xt+n ) = n + xt

At a rst glance the random walk model might seem extreme, is it possible to

motivate that a series has an innite memory, so that shocks remain in the series

forever? The answer is yes. The most common example is that of innovations

leading to economic growth, which then spills over into other economic variables.

Innovations leading to economic growth do not occur at xed intervals, nor is every

single invention equally important. Over time, innovations will occur at random

intervals and some inventions will more important that others. The outcome is

that productivity and economic growth is driven by a stochastic trend, just as

described by a random walk. In empirical work it is common to nd variables

that behave like random walks.

Given forward looking behavior of economic agents, it is often possible to

construct economic models where transformed variables will behave like random

walks. In a forward looking world agents will use all relevant information when

they determine todays prices. One important characteristic follows from this,

namely that todays price is the best prediction of future prices. However, the

relationship between todays price and the predicted future price is more complex.

We return to this issue below, when we talk about martingales.

THE RANDOM WALK MODEL

49

A random walk process is also a series integrated of order one, it is also called a

unit root process, and it contains a stochastic trend. Furthermore a random walk

process can also be embedded in another process, say an ARIM A(p; d; q)process.

The problem is that it is problematic to do inference on random walk variables

(and integrated variables) because the estimated parameter on the lagged term will

not follow a standard normal distribution. Hence, ordinary t , chi square and

F distributions are not suitable for inference. Parameter estimates will generally

be asymptotically unbiased. Their standard errors and variances do not follow

standard distributions.

For instance, a common t-test cannot be used to test if a = 1 in the regression,

xt = axt

+ t:

(6.20)

a=st:dev(^

a)] will be skewed to

the left, and thus depart from the student t-table. Just as in any autoregressive

model the estimate a

^ will be biased downward. The term (^

a a); however, becomes

asymptotically a ratio between two random variables, which will lead to a second

order bias in the estimation of the variance as T ! 1. In this case, with a unit

root process, the ration random variables which in turn are functions of Wiener

processes. In this situation one common approach is to use the so-called DickeyFuller test in combination simulated distributions.

Testing for a unit root (a = 1) is one aspect of testing if a variable is a random

walk. Another aspect if it is not possible to reject a unit root is to test if the

residual is ~N ID(0; 2 ). Cambell, Lo and MacKinley (1997, Ch 1) show how you

can test for the absence of autocorrelation when dealing with the null hypothesis

of a random walk. unfortunately, it is quite common in the literature to assume

that a series is a random walk (meaning not rejecting the null of a unit root)

only on unit root testing and forgetting about the properties of the residual term,

which under a random walk is simply the rst dierence.

When testing for random walk in limited samples it is extremely di cult to distinguish between a random walk and a stationary AR(1) model with a parameter

of say 0.99.

A problem with random walks, as well as all variables which include stochastic trends, is that it is in general not possible to use standard distributions for

inference. Parameter estimates will generally be unbiased, but their standard

deviations and variances do not follow standard distributions.

A random variable is said to be a martingale if the present observation is the best

~ t g1 be a process of the random variable

prediction of all future values. Let fX

t=1

~

Xt . We say that the variable is a martingale with respect to the information set

~ t+s is equal to the present value of X

~t;

It 1 , if the expected value of X

h

i

~ t+s j It = xt for s > t:

E X

(6.21)

1 Alternatively, it is possible to dene the information set at time t-1, and wrire the denition

h

i

~ t j It 1 = X

~t 1 :

as E X

50

~ t+s is conGiven the information set, all information relevant for predicting X

~ t . Thus, the best prediction of X

~ t+1 is xt ; and the

tained in todays value of X

value of today is the best prediction of all periods in the future. The information

~ t as well as all other information that might be

set might include the history of X

~ t+s . The denition of a martingale is always relative,

of relevance for predicting X

~ t is a marsince we have the freedom of dening dierent information sets. If X

tingale with respect to the information set It0 , it might not be a martingale with

respect to another information set It00 unless the two sets are not identical.

We can now continue and dene the martingale dierence process as the ex~ t+s and X

~t;

pected dierence between X

~ t+s

E[(X

~ t ) j It ] = E(X

~ t+s

X

xt ) = 0:

(6.22)

If a process is a martingale dierence process, changes in the process are unpredictable from the information set.

The sub-martingale and the super martingale are two versions of martingale

processes. A sub-martingale is dened as

h

i

~ t+s j It

E X

xt ;

which says that, on average the expected value is growing over time. A supermartingale is dened as

h

i

~ t+s j It

E X

xt ;

~ t+s is given by X

~ t but, on average,

which says that the expected value of X

declining over time.

Martingales are well known in the nancial literature. If the agents on a nancial market use all relevant information to predict the yields of nancial assets,

the prices of these assets will, under certain special conditions, behave like martingales. The random walk hypothesis of asset prices does not come from nance

theory, it is based on empirical observations, and is mainly a hypothesis about the

empirical behavior of asset prices which lacks a theoretical foundation. A random

walk process is a martingale, but also includes statements about distributions. If

we compare with the random walk we have the model, xt = xt 1 + t where t is a

normally distributed white noise process. The latter is a stronger condition than

assuming a martingale process. A random walk with a drift xt = t + xt 1 + t ,

this variable is a sub-martingale,since the deterministic trend will increase the

~ t+1 ) = t + xt : Let us now turn to nance theory.

expectation over time, E(X

Theory that the price of an asset (Pt+1 ) at time t + 1 is given by the price at t

plus a risk-adjusted discount factor r. If we assume, for simplicity, that the discount factor is a constant we get that Pt+1 = (1 + r)Pt : Asset prices are therefore

not driftless random walks, or martingales. The process described by theory is

ln Pt+1 = ln(1 + r) + ln Pt + t+1 , which is a sub-martingale given, in this case, a

constant discount factor. If we would like to say that asset prices are martingales

we must either transform the price process according [Pt+1 =(1 + r)], or we must

include the risk-adjusted discount factor in the information set.2

Thus, the expected value of an asset price is, by denition, E(Pt+1 ) = E(1 +

r)Pt . If the discount factor (and risk) is a constant (g) we get E(Pt+1 ) = g + Pt ,

which is a random walk with drift. If the risk premium is a time-varying stochastic

2 It is obvious that we can transform a variable into a martingale by substracting elements

from the process by conditioning or direct calculation. In fact most variables can be transformed

into a martingale in this way. An alternative way of transforming a variable into a martingale is

to transform its probability distribution. In this method you look for a probability distribution

which is equivalentto the one generating the conditional expectations. This type of distribution

is called an equivalent martingale distribution.

MARTINGALE PROCESSES

51

variable (G

the random walk.

It is important to distinguish between martingales and random walks. Financial theory ends in statements about the expected mean of a variable with respect

to a given information set. A random walk is dened in terms of its own past only.

Thus, saying that a variable is a random walk does not exclude the case that there

exists an information set for which the variable is not a martingale.

Furthermore, the residuals in a random walk model are by denition independent, if we assume them to be white noise. But, a martingale describes behavior

of the rst moment of a random variable. It does not imply independence between

the higher moments of the series. If we model a martingale by a rst order autoregressive process, we might nd that the errors are dependent through higher

moments. The variance of t is not 2 , but a function of its own past, like

2

t

2

t 1

t;

(6.23)

where t is a white noise process. This is a rst order ARCH(1) model (Auto

Regressive Conditional Heteroscedasticity), which implies that a large shock to

the series is likely to be followed by another large shock. In addition, it implies

that the residuals are not independent of each other.

The conclusion is that we must be careful when reading articles which claim

that the exchange rate, or some other variable should be, or is, random walks, often

what the authors really mean is that the variable is a martingale, conditional on

some information.

The martingale property is directly related to the e cient market hypothesis

(EMH), which set out the conditions under which changes in asset prices becomes

unpredictable given dierent types of information.

Markov3 processes represent a general type of series with the property that the

value at time t contains all information necessary to form probability assessments

of all future values of the variable. Compared with the martingale property above,

this property is more far reaching. The martingale property is concerned with the

conditional expectation of a variable, and not with the actual distribution function

and the higher moments of the variable. Markov processes and the associated

Markov property are important because it helps us to form stochastic time series

processes. In economics and nance we like explain how expectations are generated

and how expectations aects the outcome of observed prices and quantities on

various markets.

In particular, in nancial economics and the pricing of derivatives, we like to

model asset prices as continuous stochastic processes Once we can trace the price

of asset continuously over time into the future, we can also determine the price

of derivatives though replication and arbitrage In addition, we learn how to use

derivatives to continuously hedge risky positions.4

To predict or generate future possible paths of a Markov variable, we only

need to know the most recent value, or its recent values of the variable. This is,

3 Markov is known for a number of results, including the so-called Markov estimates that

prove the equality between OLS and MLE.

4 Recall that the denition of a derivative asset, is a nancial contract that (1) derives its

value from some underlying asset, and (2) at the time of expiration has exactly the same price

as the underlying asset.

52

the history of the variable to learn how it behavesnor do we need to know actual

values/observations of the future. The future of the series can be generated from

its conditional past.

~t:

Let F (x1 ; x2 ; :::; xt ) be the distribution function of the random variable X

There are 1; 2; ::t observations of the series, where t might be equal to innity.

For each observation (xi ) there is a probability statement, F (x1 ; x2 ; :::; xt ) =

~1

~2

~t

Pr ob(X

x1 ; X

x2 ; :::X

xt ). A discrete time Markov process is characterized by the following property,

~ t+s

Pr ob(X

~ t+s

xt + s j x1 ; x2 ; :::xt ) = Pr ob(X

xt+s j xt );

(6.24)

where s > 0: The expression says that all probability statements of future

~ t+s is only dependent on the value the variable

values of the random variable X

takes at time t, and do not depend on earlier realizations. By stating that a

variable is a Markov process we put a restriction on the memory of the process.

The AR(1) model, and the random walk, are rst-order Markov process,

xt = a1 xt

where

N ID(0;

):

(6.25)

Given that we know that t is a white noise process (N ID(0; 2 )]; and can

observe xt we know all what there is to know about xt+1 = ; since xt = contains

all information about the future. In practical terms, it is not necessary to work

with the whole series, only a limited present. we can also say that the future

of the process, given the present, is independent of the past. For a rst order

~ t+1 , given all its possible present and

Markov process, the expected value of X

~tX

~t 1, X

~ t 2 :::, can be expressed as,

historical values X

h

~ t+1 j X;

~ X

~t

E X

1 ; Xt 2 :::Xt 1

h

i

~ t+1 j X

~t :

=E X

(6.26)

Thus, a rst order Markov process is also a martingale. Typically, the value of

~ t is know at time t as xt : The Markov property is a very convenient property if we

X

want to build theoretical models describing the continuous evolution of asset prices.

We can focuses on the value today, and generate future time series, irrespective

of the past history of the process. Furthermore, at each period in future we can

easily determine an exactfuture value, which is the equilibriumprice for that

period.

The white noise process, as an example, is a Markov process. This follows from

the fact that we assumed that each t was independent from its own past, and

future. One outcome of the assumption of a normal and independent process, was

that we could relatively easy form predictions and condence intervals given only

the value of t today.

The denition of a Markov process can be extended to an m : th order Markov

processes, for which we have;

h

~ t+1 j X

~t; X

~t

E X

1 ; Xt 2 :::Xt 1

h

~ t+1 j X;

~ X

~t

=E X

1 ; Xt 2 :::Xt m :

; (6.27)

~ t ) to predict the future.

presentvalue X

MARKOV PROCESSES

53

Consider the random walk model, xt = xt 1 + t and assume that the distance

between t and t 1 becomes smaller and smaller. As the distance between the

observations gets smaller the function will in the end get so close to a continuous

function that it becomes indistinguishable from a function in continuous time

x(t) = x(t 1) + (t): This takes us to the random walk in continuous time,

known as a Brownian motion or Wiener process. This section introduces, Brownian

motions (Wiener process), geometric Brownian motion, jump diusion models and

Ornstein- Uhlenbeck process.

There are (at least) two very important reasons for studying Wiener processes.

The rst is that the limiting distribution of most non-stationary variables in economics and nance are given as functions of a Brownian motion. It is this knowledge that helps us to understand the distribution of estimates based on nonstationary variables. The second reason for learning about Brownian motions is

that they play an important role in modeling asset prices in nance.

A word of warning, though Brownian motions have nice mathematical properties it is not necessarily so that it also ts given data series better. Normal discrete

empirical modelling will take you a long way.

The random walk is dened in discrete time. The intuition behind the random

walk and the Brownian motion is as follows. If we let the steps between t and t 1

become innitely small, the random walk can be said to converge to Brownian

motion (or Wiener process. As the distance between t and t 1, alternatively

between t and t+1, it becomes harder and harder to distinguish between a discrete

time process and continuous time process. In the end, the dierence will be so

small that it will not matter.

These processes have a long history. The Brownian motion was named after

an English botanist, Robert Brown, who in 1827 observed that small particles immersed in a liquid, exhibited ceaseless irregular motion. Brown himself, however,

named a few persons who had observed this phenomena before him. In 1900 a

french mathematical named Bachelier described the random variation in stocks

prices when he wanted to explain option prices. In 1917 Einstein observed similar

behavior gas molecules. Finally,

Norbert Wiener gave the process a rigorous mathematical treatment in a series

of papers during 1918 and 1923.

Is there a dierence between what we call a Wiener processes and Brownian

motion? In practice the answer is no. The two terms can and are used interchangeably. If you look at the details you will nd that the Brownian motion

have normally distributed increments. The Wiener process, on the other hand, is

explicitly assumed to be a martingale. No such statement is made for the Brownian motion.5 In practice, these dierences means nothing (for more information

search for the Lvy theorem). In econometrics there is a tendency to use Wiener

processes to represent univariate processes and Brownian motion for multivariate

processes.

The most important characteristic of a Brownian motion is that all increments

are independent, and not predictable from the past. Thus the Brownian motion

can be said to be a martingale and it fullls the Markov property. The latter means

that the distribution of future values at (t + dt) depends only the current value

of x(t). This is a good characteristic of models describing insecurity, in particular

situations when nature is evolving as a function of random steps that we cannot

5 See Neftci, Salih (2000), An Introduction to the Mathematics of Financial Derivatives, 2 ed.

Academic Press, Amsterdam.

54

predict. The further we look into the future, the number of random changes gets

larger, and probability statements about future events get harder and harder.

A generalized (arithmetic) Brownian motion is written as

dxt = dt + dWt

(6.28)

x over the time interval dt: This can be written as dxt = x(t + dt) x(t): The

parameters and are real numbers (constants) where is strictly positive. As

in a random walk the term dt represents the drift and dW can be said to add

a stochastic noise to the series. W represents a standardized Wiener process,

or Brownian motion, such that dW represents the dierential of the Brownian

motion, and dWt = dW (t + dt) pW (t) has a standard normal distribution with

mean zero and variance equal to dt:

It is easy to see that dt represent a drift term. Take the expected value of

the process, E(xt ) = dt +

0, both and dt are non-stochastic, and dW has

an expected value of zero. It follows that

1

E(dxt )

dt

represents the average change in x per unit of time. Of course, if = 0 we

have a driftless random walk in continuous time, E(dxt ) = E(dWt ) = 0:

The variance is V ar(dxt ) = 2 V ar(dW ) = 2 dt: Note shown here is that that

the changes in x (dxt ) are independent and stationary.

=

In terms of the change over a specic (possibly) observable time period we need to

introduce the notation t to represent the change over some fraction of time t. By

using this notation we can let t be a year or a month, and then by changing we

can let the length of the period become smaller and smaller. The change due to

the deterministic trend is written, per unit of time, as p t. The stochastic noise

that we add to dx over a given interval is written as

t, where t ~N ID(0; 1).

In the limit, as ! 0, we have that x ! dt: In terms of small intervals the

Brownian motion becomes

xt =

t+

t:

(6.29)

but there is a better way to see what happens. As we study a standardized

Brownian motion/Wiener process W (t) over the interval, [0; T ] we will nd that

we can divide this interval into segments ti ti 1 ;

0 = t0 < t1 < t2 < ::: < ti < :::tn = T:

(6.30)

assume that there is a random

variable Wt that takes on either the value

or

: Furthermore, assume that

Wti is independent of Wtj for i 6= j, so that each increment is uncorrelated

with other increments. The Wiener process is no dened as the sum of Wti as

! 1, which is the same as saying that as the interval [0; T ] is divided into ner

and ner segments, we have

BROWNIAN MOTIONS AND THE SUM OF WHITE NOISE

55

W (t) =

n

X

i=1

wti as i ! 1

(6.31)

p

An extension of this, if t ~N ID(0; 2 ); is that let Wt = t = T will also

converge to a Wiener process. Thus, the sum of a standardized white noise will

also converge to a standardized Wiener process. This result is crucial for the

understanding of the distribution a random walk and other unit rootvariables.

6.9.1

The arithmetic Brownian motion is not well suited for asset prices as their changes

seldom display a normal distribution. The log of asset prices, and return, is better

described with a normal distribution. This takes us to the geometric Brownian

motion

dxt

= dt + dWt

xt

What happens here is that we assume that ln xt has a normal distribution,

meaning that xt follows a log normal distribution, and dt + dWt follows a

normal variable. Itos lemma can be used to show that

2

d ln xt =

dt + dWt :

The expected value of the geometric Brownian motion is E(dxt =xt ) = dt, and

the variance is V ar(dxt =xt ) = 2 dt:

There are several ways in which the model can be modied to better suit

real world asset prices. One way is to introduce jumps in the process, so-called

"jump diusion models". This is done adding a Poisson process to the geometric

Brownian motion,

dxt

= dt + dWt + Ut dN ( );

xt

where Ut is a normally distributed random variable, Nt represent a Poisson

process with intensity to account for jumps in the price process.

The random walk model is good for asset prices, but not for interest rates.

The movements of interest rates are more bounded than asset prices. In this case

the so-called Ornstein-Uhlenbeck process provides a more realistic description of

the dynamics,

drt = (b

rt )dt + Wt :

Thus the idea behind the Ornstein-Uhlenbeck process is that it restricts the movements of the variable (r) to be mean reverting, or to stay in a band, around b,

where b can be zero.

6.9.2

56

RANDOM WALKS, WHITE NOISE AND ALL THAT

~

~

If X(t)

is a Wiener process, 0

t < 1:The series always starts in zero, X(0)

=

~ i ) are independent. In

0:and if t0 t1 t2 ...

tn , then all increments of X(t

terms of the density function we have,

D [x(t1 ) x(t0 ); x(t2 ) x(t1 ); :::; x(tn )

n

Y

=

D [x(ti ) x(ti 1 ) j t0 ; t1 ; :::; tn ] :

x(tn

1)

(6.33)

h

i

~ n ) X(t

~ n 1 ) = 0;

E X(t

with a variance

h

~ X(t

~

var X(t)

i

1) =

(t

j t0 ; t1 ; :::; tn ] (6.32)

(6.34)

s);

(6.35)

where 0

s < t. Finally, since the increments are a martingale dierence

process, we can assume that these increments follow a normal distribution, so

~

X(t)

N [0, (t s)]: These assumptions lead to the density function,

D[x(t)]

=

p

2

1

exp

(2 )ti

x21

2 2 t1

n

Y

(ti

i=2

ti 1 ) (1=2)

p

exp

(2 )t1

(xi xi 1 )2

(6.36)

2 2 (ti ti 1 ;

When

= 1, the process is called a standard Wiener process or standard

Brownian motion. That the Brownian motion is quite special, can be seen from

this density function. The sample path is continuous, but is not dierentiable.

[In physics this is explained as the motion of a particle which at no time has a

velocity].

Wiener processes are of interest in economics of many reasons. First, they

oer a way of modeling uncertainty. Especially in nancial markets, where we

sometimes have an almost continuous stream of observations. Secondly, many

macro economic variables appear to be integrated or near integrated. The limiting

distributions of such variables are known to be best described as functions of

Wiener processes. In general we must assume that these distributions are nonstandard.

To sum up, there are ve important things to remember about the Brownian

motions/Wiener process;

It represents the continuous time, (asymptotic) counterpart of random walks.

It always starts at zero and are dened over 0

t < 1:

The increments, any change between two points, regardless of the length of

the intervals, are not predictable, are independent, and distributed as N (0,

(t s) 2 ), for 0 s < t.

It is continuous over 0 t < 1, but nowhere dierentiable. The intuition

behind this result is that the dierential implies predictability, which would

go against the previous condition.

Finally, a function of a Brownian motion/Wiener process will behave like a

Brownian motion/Wiener process.

The last characteristic is important, because most economic time series variables can be classied as, random walks, integrated or near-integrated processes.

In practice this means that their variances, covariances etc. have distributions

that are functionals of Brownian motions. Even in small samples will functionals

of Brownian motions better describe the distributions associated with economic

variables that display tendencies of stochastic growth.

BROWNIAN MOTIONS AND THE SUM OF WHITE NOISE

57

58

"Time is a great teacher, but unfortunately it kills all its pupils" Louis Hector

Berlioz

A time series is simply data ordered by time. And, time series analysis is simply

approaches that look for regularities in these data ordered by time. Stochastic time

series play an important part in economics and nance. To forecast and analyse

these series it is necessary to take into account not only their stochastic nature but

also the fact that they are non-stationary, dependent over time and are by nature

correlated among each other. In theoretical models, the emphasis on intertemporal

decision making highlights the role expectations play in a world where decisions

must be made from information sets made up of stochastic processes.

All time series techniques aim making the series more understandable by decomposing them into dierent parts. This can be done in several ways. This

introductions aim is to give a general overview of the subject. A time series is

any sequence ordered by time. The sequence can be either deterministic or stochastic. The primary interest in economics is in stochastic time series, where the

sequence is made up by random variables. A sequence of stochastic variables ordered by time is called a stochastic time series process. These random variables

making up the process can either be discrete, taking on a given set of integer

numbers, or be continuous random variables taking on any real number between

1: While discrete random variables are possible they are not common.

Stochastic time series can be analysed. in the time domain or in the frequency domain. The former approach analysis stochastic processes in given

time periods like, days, weeks, years etc. The frequency approach aims at decomposing the process into frequencies by using trigonometric functions like sinuses,

etc. Spectral analysis is an example of analysis that uses the frequency domain, to

identify regularities like seasonal factors, trends, and systematic lags in adjustment

etc. In economics and nance, where we are faced with given observations and we

study the behavior of agents operating in real time, the time domain is the most

interesting road ahead. There are relatively few problems that are interesting to

analyze in the frequency domain.

Another dimension in modeling is processes in discrete time or in continuous time. The principal dierence here is that the stochastic variables in a

continuous time process can be measured at any time t; and that they can take

dierent values at any time. In a discrete time process, the variables are observed

at xed intervals of time (t), and they do not change between these observation

points. Discrete time variables are not common in nance and economics. There

are few, if any variables that remain xed between their points of observations.

The distinction between continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic

variables are measured at discrete time intervals. The money stock is generally

measured and recorded as an end-of-month value. The way of measuring the stock

of money does not imply that it remains unchanged between the observation interval, instead it changes whenever the money market is open. The same holds for

variables like production and consumption. These activities take place 24 hours

a day, during the whole year. The are measured as the ow of income and conINTRODUCTIOO TO TIME SERIES MODELING

59

sumption over a period, typically a quarter, representing the integral sum of these

activities.

Usually, a discrete time variable is written with a time subscript (xt ) while

continuous time variables written as x(t). The continuous time approach has a

number of benets, but the cost and quality of the empirical results seldom motivate the continuous time approach. It is better to use discrete time approaches

as an approximation to the underlying continuous time system. The cost for doing this simplication is small compared with the complexity of continuous time

analysis. This should not be understood as a rejection of continuous time approaches. Continuous time is good for analyzing a number of well dened problems

like aggregation over time and individuals. In the end it should lead to a better

understanding of adjustment speeds, stability conditions and interactions among

economic time series, see Sj (1990, 1995).1 Thus, our interest is in analysing

discrete time stochastic processes in the time domain.

A time series process is generally indicated with brackets, like fyt g: In some

situations it will be necessary to be more precise about the length of the process.

Writing fyg1

1 indicates that he process start at period one and continues innitely.

The process consists of random variables because we can view each element in fyt g

as a random variable. Let the process go from the integer values 1 up to T: If

necessary, to be exact, the rst variable in the process can be written as yt1 the

second variable yt2 etc. up until ytT : The distribution function of the process can

then be written as F (yt1 ; yt2 ; :::; ytT ):

In some situation it is necessary to start from the very beginning. A time series

is data ordered by time. A stochastic time series is a set of random variables

ordered by time. Let Y~it represent the stochastic variable Y~i given at time t.

Observations on this random variable is often indicated as yit . In general terms

a stochastic time series is a series of random variables ordered by time. A series

starting at time t = 1 and

n ending at timeo t = T , consisting of T dierent random

variables is written as Y~1;1 ; Y~2;2 ; :::Y~T;T . Of course, assuming that the series is

built up by individual random variables, with their own independent probability

distributions is a complex thought. But, nothing in our denition of stochastic

time series rules out that the data is made up by completely dierent random

variables. Sometimes, to understand and nd solutions to practical problems, it

will be necessary to go all the way back to the most basic assumptions.

Suppose we are given a time series consisting of yearly observations of interest

rates, f6:6; 7:5; 5:9; 5:4; 5:5; 4:5; 4:3; 4:8g, the rst question to ask is this a stochastic

series in the sense that these number were generated by one stochastic process or

perhaps several dierent stochastic processes? Further questions would be to ask

if the process or processes are best represented as continuous or discrete, are the

observations independent or dependent? Quite often we will assume that the series

are generated by the same identical stochastic process in discrete time. Based on

these assumptions the modelling process tries to nd systematic historical patters

and cross-correlations with other variables in the data.

All time series methods aim at decomposing the series into separate parts in

some way. The standard approach in time series analysis is to decompose as

yt = Tt;d + St;d + Ct;d + It ;

1 We can also mention the dierent types of series that are used; stocks, ows and price

variables. Stocks are variables that can be observed at a point in time like, the money stock,

inventories. Flows are variables that can only be observed over some period, like consumption or

GDP. In this context price variables include prices, interest rates and similar variables which can

be observed at a market at a given point in time. Combining these variables into multivariate

process and constructing econometric models from observed variables in discrete time produces

further problems, and in general they are quite di cult to solve without using continuous time

methods. Usually, careful discrete time models will reduce the problems to a large extent.

60

deterministic cyclical components and I is process representing irregular factors2 .

For time series econometrics this denition is limited. Instead, let fyt g be a

stochastic time series process, composed as,

yt

= Td + Ts + Sd + Ss + fyt g + et ,

(7.1)

Ts ; deterministic seasonals Sd stochastic seasonals Ss , a stationary process (or the

short-run dynamics) yt , and nally a white noise innovation term et : The modeling

problem can be described as the problem of identifying the systematic components

such that the residual becomes a white noise process. For all series,remember

that any inference is potentially wrong, if not all components have been modeled

correctly. This is so, regardless of whether we model a simple univariate series

with time series techniques, a reduced system, a or a structural model. Inference

is only valid for a correctly specied model.

Present ARIMA

A class of models ARIMA (p,dq,) ARFIMA(p,d,q) models

Operators

Box Jenkins

Identication tools: ACF, PAVFS, Q-test

Deal with:

Non-stationarity, dynamics

Trend

Seasonal eects

Deterministic variables

Theory ARIMA:

After ARIMA?

ARIMAX, Transfer function RDL, ARCH/GARCH

Structural: Single equation, ADL

Error correction modes (Older stu)

Mulivariate

VAR

VECM

SVAR

Add for VAR : How to build VAR:s

Lags - white noise

Lags dummies white noise

Information criteria + min number of equations with AR

Add for Rational expectations GMM

GARCH

2 For simplicity we assume a linear process. An alternative is to assume that the components

are multiplicative, xt = Tt;d St;d Ct;d It :

61

Random variables are described by their moments. Stochastic time series can be

described by their means, variances and autocovariances. Given a random variable

Y~t which generates an observed process fyt g, the mean and the variance are EfY~t g

= and varfY~t g: The autocovariance at lag k is

k

= cov(Y~t ; Y~t

k)

= E[Y~t

E(Y~t )][Y~t

E(Y~t

k )]:

strength of the relation. For practical work, a more useful measure is provided by

the autocorrelation,

cov(Y~t ; Y~t k )

= k;

k = q

0

~

~

var(Yt )var(Yt k )

time t k: Since the autocorrelation comes out as a number between 0 and 1.

The autocovariance operator can be applied to any lag, k 1; and is therefore generally referred to as the autocorrelation function. Furthermore, if the series have

a stationary mean and variance, it does not matter if we calculate the correlation

function (or the autocovariances) backwards or forwards, k = k :

The ACF tells us the following, the higher the value of the stronger is the

memory of the series. By studying how the autocorrelation changes as the distance

between t and k changes a we can see if they tend to die out slowly or quickly,

or remain constant for a given number of k.3 If the ACF is equal to unity and

dies out slowly this is a sign of a non-stationary variable. On the other hand, if

the ACF is zero it is a sign of a white noise process were no historical values can

predict coming observation of the same series. for a random time series process,

the sample autocorrelation function becomes

PT k

1

y)(yt k y)

t=1 (yt

T k

^k =

k = 0; 1; 2; 3:::;

PT

1

y)2

t=1 (yt

T

PT

where T is the number of observations, and y is the sample mean, y = (1=T ) i=1 yi .

In practical work, the standard assumption is a constant variance over the

sample, so that var(yt ) = var(yt k ). The sample autocorrelations are estimates

of random variables they are therefore associated with variances. Bartlett (1946)

shows that the variance of the k:th sample autocorrelation is

2

32

k

X1

14

var(^k ) =

1+2

^k 5 :

T

j=1

Given the variance, and the standard deviation of the estimated variable, it becomes possible to set up a signicance test. Asymptotically, this t-test has a

normal distribution, with an expected value of zero under the null of no autocorrelation (no memory in the series). For a limited sample, a value of ^k larger than

two times its standard error is considered signicant.

The next question is how much autocorrelation is left between the observations

at t and t k (Y~t and Y~t k ) after we remove (condition on) the autocorrelation

between t and t k? Removing the autocorrelation means that we rst calculate

the mean of Y~t conditional on all observation on Y~t and Y~t k 1 ;another way of

3 Standard

62

T =4 sample autocorrelations.

expressing this is to say that we lter Y~t from the inuence of all lags of Y~t between

t 1 and t k 1: Using the expectations operator, we dene the conditional

mean as EfY~t j yt 1 ; yt 2 ; :::yt k 1 g = Y~t . The partial autocorrelation is then

the slope coe cient in a regression between Y~t and Y~k . This leads to the following

denition of the partial autocorrelation function

k

cov(Y~t ; Y~t

j yt 1 ; :::; yt

var(Y~t k )

k

k 1)

(7.2)

coe cient on the lag at t = k in the autoregressive regression:

yt = a0 + a1 yt

+ ::: +

k yk

+ et :

(7.3)

The rst partial autocorrelation is estimated by regressing yt on yt 1 ; the second

partial autocorrelation is estimated by regressing yt on yt 1 and yt 2 and so

on.4 The partial autocorrelation functions can be estimated through regression

techniques, by the so-called Yule-Walker estimator, alternatively using recursive

techniques (Durbin 1961). The recursive technique utilises the fact that the rst

autocorrelation is equal to the rst partial autocorrelation ^1 = ^ 1 , then given ^ 1

the higher order i are solved step by step in a recursive equation system.

The complicating factor is to estimate the variance of the partial autocorrelation function. If a regression technique is used, the estimated regression variance

of ( ^ k ) is not a correct estimate of the variance, because until the residual process

is white noise, or at least free from autocorrelation the estimated variance is inefcient. Furthermore, the other (older) techniques of estimating the PACFs do not

involve a variance estimate in the same way as the OLS estimator of k : The solution, therefore, is to assume that the estimated ^ k : s are a white noise process.

Anderson (1944) shows that the asymptotic variance of a white

noise series is

p

^

1=T . This leads to the (asymptotic) signicance test, k =(1= T ): As a practical

rule of thumb, in a limited sample, a test statistics greater than 2 is considered

signicant, and lead to a rejection of the null of ^ k = 0: The PACF informs about

the length of autoregressive process. The necessary number of lags to describe an

autoregressive process of order p ends at p :

A closer look at these measures, and the way they are calculated reveals that

they are only interesting for stationary series. The same holds for the mean and

the variance, and other moments.

The two measures, the ACF and the P ACF; are complementary to other

descriptive devices, such as the mean, the variance, kurtosis, etc. The ACF and

the P ACF describe the memory of a process. They explain if and how a series can

be predicted from its own past. They help us to identify which type of process we

are studying, if it is a white noise process, an integrated process, an AR process,

an MA process, or an ARMA process.

A white noise series is recognized by its lack of signicant ACF and P ACF

coe cient. Integrated variables are identied by the fact that their ACF dies out

very slowly, in combination with at least one P ACF coe cient close to unity.

Stationary ARMA models are identied with the following identication scheme:

AR(p)

M A(q)

ARM A

ACF

Tails o

Cuts o at lag q

Tails o

P ACF

Cuts o at lag p

Tails o

Tails o

4 Notice,

that in the regression, the parameters a1 , a2 ,...ak 1 are not identical to 1 ; 2 ...

due to the (possible) correlation between yt 1 and lower order lags like yt 2 etc. The

regression formula only identies the last coe cient, at lag k, as the PACF k :

t k 1

63

each type of model. And, the properties of each model can be calculated theoretically. These calculation are an important part of time series analysis and we will

come back to these calculations below.

The idea behind ARIMA modeling is to rst calculate the ACF and the P ACF

and use these to form an idea about the order of integration and the order of p

and q. The second step, given what we know about the order of d, p, and q

is then to estimate an ARIMA model. The third step is to test the estimated

model for autocorrelation in the residual. The fourth step is reestimate models to

nd the best model according to the three criteria i) no autocorrelation, ii) the

lowest possible residual variance and iii) not include so many parameters that it

is becomes too complex.

7.1.1

A fundamental issue when analyzing time series processes is whether they are

stationary or not. As a rst, general denition, we can say that a non-stationary

series changes its behavior over time such that the mean is changing over time.

Many economic time series are non-stationary in the sense that they are growing

over time, their estimated variances are also growing and the covariance function

never dies out. In other words the calculation of the mean, autocovariance etc.

are dependent on the time period we study, and inference becomes impossible. A

stationary series on the other hand displays a behavior which is independent of

the time period and it becomes possible to test for signicance. Non-stationarity

must either be removed before modeling or included in the model. This requires

that we know what type of non-stationarity we are dealing with.

The problem with non-stationary is that a series can be non-stationary in an

innite number of ways. And, to make the problem even more complex some types

of non-stationarities will skew the distributions of the estimates such that inference

based on standard distributions such as the t , the F or the 2 distributions

are not only wrong but completely misleading. In order to model time series, we

need to understand what non-stationarity is, how to estimate it and how to deal

with it.

7.1.2

Of the two concepts, weak stationarity is the practical one. Weak stationarity

is dened in terms of the rst two moments of the process, the mean and the

variance. A process fxt g is (weakly) stationary if (1) the mean is independent of

time t,

Efxt g = ;

(2) the variance exists and is less than innity,

varfxt g =

< 1;

covfxt ; xt

64

k)

k:

Thus, the mean and the variance are constant over time, and the covariance

between two values of the process is only a function of the distance between the

two points.

A related concept is that of covariance stationarity if the autocovariances go

to zero as the distance between the two points increases the series is said to be

covariance stationary (or ergodic),

cov(xt ; xt

k)

! 0 as k ! 1:

This denition brings us to the concept of ergodicity, which can be understood as a weak form of average asymptotic independence. The most important

condition, but not su cient, for a series to be ergodic is

!

T

X

lim T 1

cov(xt ; xt k ) = 0:

t!1

k=1

restriction on the strength of the covariance structure. As more and more autocovariances are calculated their mean should go to zero. The term ergodic is used

in connection with stationarity conditions.

7.1.3

Strong Stationarity

Strong stationarity is dened in terms of the distribution function fxt g. Suppose a process that is ordered from observation 1 up to observation T: Each

observation up to T can be thought of as a random variable. Hence we can

write the rst variable in the process as xt1 the second variable xt2 etc. up until

xtT : The distribution function for this process is F (xt1 ; xt2 ; :::; xtT ): Next, dene the distribution function fxt g for another time interval, namely t + j, where

j = 1; 2; :::; T . This leads to the distribution function Fj (xt+j1 ; xt+j2 ; :::; xt+jT ).

Strong stationarity requires that the two distribution functions are identical such

that F (xt1 ; xt2 ; :::; xtT ) = Fj (xt+j1 ; xt+j2 ; :::; xt+jT ); meaning that the characteristics of the process are independent of time. We will get the same means, etc.

independently of the time period we choose for our calculations. By letting j take

dierent integer values we get the j : th order strong stationarity. Thus, j = 1

leads to rst order (strong) stationarity, etc.

Strong stationary incorporates the denition of weak stationarity. But, the

practical problem is that it is di cult to work with distribution functions for

continuous random variables, so strong stationarity is mainly a theoretical concept.

1. (a)

ARMA models, autoregressive moving average models. These are

a set of models that describe the process fxt g as a function of its

own lags and a white noise process. The autoregressive models of

order p [AR(p)],

xt = a0 + a1 xt

+ ::: + ap xt

+ et ;

q [M A(q)] is dened as

xt = a 0 + et

DESCRIPTIVE TOOLS FOR TIME SERIES

b1 e t

:::

bq et

q;

65

where et is a white noise process. The combination of autoregressive and moving average processes gives the ARIMA(p,q) model

xt = a0 + a1 xt

+ ::: + ap xt

+e

b1 et

:::

bq e t

q:

as follows: a process xt is said to be integrated of order I(d); if it contains no

deterministic components, is non-stationary in levels, but becomes stationary after

dierencing d times. Thus, a stationary series is denoted xt

I(0), a rst order

integrated series is denoted as I(1); etc.

To analyse time series it is necessary to introduce additional descriptive statistical tools beside means and variances. Then to handle the equations in an

e cient way we need a set of operators. Also, we need to classify time series as

stationary or non-stationary. The descriptive devices are autocovariances, autocorrelations and partial autocorrelations. An important classication is stationarity

or non-stationarity. For this purpose we need the concepts of weak and strong

stationarity, and ergodic processes. The operators needed are the sum operator,

the lag operator and the dierence operator.

7.1.4

to few lags the model will be denition be misspecied, and the assumption of

normally distributed white noise residual will be wrong. On the other hand,

adding more lags to the AR or M A process will make the model capture more

of the possible memory of the process, but the estimates will be ine cient. We

need to add as few lags as possible, without rejecting the assumption of white

noise residuals. The Box-Jenkins method suggests that we start with a relatively

large number of lags and tests for autocorrelation. Among those models, which

has no signicant autocorrelation, we then pick the model with the lowest possible

information criteria.

In the Box-Jenkins approach, testing for white noise is equal to testing for

autocorrelation. The typical test for autocorrelation is the Box-Pearce test, also

known as the portmanteau test, sometimes as the Q-test or the Ljung -Box test.

To test for p:th order autocorrelation in a mean adjusted series, "t ; calculate the

k:th order autocorrelation coe cient,

^k =

PT

"t ^"t k

t=k+1 ^

PT

"2t

t=1 ^

BP = T

p

X

^2k :

k=1

Under the null of no autocorrelation this test statistic has a 2 (p) distribution.

The Box-Pearce statistics is best suited for testing the residual in an AR model.

A modication, for ARMA, and more general regression models, is the so called

Box-Ljung statistics,

BL = T (T + 2)

p

X

r=1

66

^2r

(T

r)

Given that the residuals of the estimated ARMA model do not display autocorrelation, we can turn to the optimal lag length. Information criteria is simply

version of adjusted R2 values. In an ordinary linear regression, as more explanatory variables are added to the model, the R2 value will go up, and the e ciency

of the estimated parameters down. To compare the R2 values of the same model,

estimated with more or less explanatory variables it is necessary to look at the so

called adjusted R2 values. The principle behind an Information criteria is create

a measure that rewards us in the modelling process for reducing the residual variance, but punishes us for adding too many lags that makes the estimates ine cient,

and the predictions interval too wide.

There are several information criteria. They are developed for special situations. In practice, however, they often tend to give the same answer in the end.

The most well known criteria is Akiakes Information Criteria (AIC). If we estimate an autoregressive model with k lags from a sample of T observations, the

information Akaikes information criteria is

AIC = log ^ 2" + 2k=T;

where ^ 2" is the estimated residual variance. Since an estimated residual variance gets smaller the more lags there are in the model, the last term (2k=T ) tries

to compensate for the number of estimated parameters in the models. The smaller

the value of the information criteria the better is the model, as long as there is no

autocorrelation.

For model with both AR and MA components Hannan and Rissanen suggested

a dierent model,

log ^ 2" + (p + q)(log T =)T;

where p and q are the lag orders of the autoregressive and the moving average

parts of the model. As for Akaikes model the smaller the value the better the

model. From these two original criteria a number of dierent criteria has been

developed, such as Schwartz information criteria (SIC), the Bayesian information

criteria (BIC) and Hatamis information criteria (HIC).

7.1.5

When dealing with time series and dynamic econometric models, the expressions

are easier to handle with the backward shift operator (B) or the lag operator

(L).5 The backward shift operator is the symbol most often used in statistical

textbooks. Econometricians tend to use the lag operator more often. The rst

order lag operator is dened as,

Lxt = xt

1;

(7.4)

Ln xt = xt

n:

(7.5)

The lag operator is an expression such that when its is multiplied with an

observation at any given time, it will shift the observation one period backwards

5 The practical dierence between using the lag operator or the backward shift operator is

that the lag operator also aects the conditional expectations generator Et which is of interest

when working with economic theories dealing with expectations.

67

in time. In other words, the lag operator can be viewed as a time traveling device,

which makes it possible to travel both forward and backwards in time. A forward

shift operator can be constructed a long the same lines. Thus, moving forward n

observations in the series from an observation at time t is done by L n xt = xt+n :

The properties of the lag operator implies that we can write an autoregressive

expression of order p (AR(p)) as,

a0 xt + a1 xt

+ a2 xt

+ ::: + ap xt

a0 xt + a1 Lxt + a2 L xt + ::: + ap Lp xt

A(L)xt :

(7.6)

Notice that the lag operator can be moved across the equal sign. The AR(1)

model, xt = a1 xt 1 + "t can be written as (1 La1 )xt = "t or A(L)xt = "t or

1

xt = [A(L)] "t . If necessary the lag length of the process can be indicated as

Ap (L): An ARM A(p; q) process can be written compactly as,

Ap (L)xt = Bq (L)"t :

(7.7)

Skipping the indication of lag lengths for convenience, the ARMA model can

1

written as xt = [A(L)] B(L)"t or alternatively depending on the context as

1

[B(L)] A(L)xt = "t : Thus, the lag operator works as any mathematical expression. However, whether or not moving the lag operator around results in a meaningful expression is associated with the principles of stationarity and invertibility,

know as duality.

7.1.6

Generating Functions

The function A(L) is a convenient way of writing the sequence. More generally

we can refer to any expression of the type A(L) as a generating function. This

includes the mean operator, the variance and covariance operators etc. Generating

functions summarize a lot of information about sequences in a compact way and

are an important tool in time series analysis. Their main advantage is that they

saves time and make the expressions much simpler since a number mathematical

operations can be applied to generating

functions. As an example, given certain

P

conditions concerning the sum ai ; we can write invert A(L); and A(L) 1 A(L) =

1:

The generating function for the lag operator is

D(L) =

k

X

di z i ;

(7.8)

where di is generated by some other function. The point here is that it is often

easier to do manipulations on D(L) directly than on each individual element in

the expression. In the example above, we would refer to A(L)xt as the generating

function of xt .

A property of generating functions is that they are additive. If we have two

series, ai , bi and i = 0; 1; 2; :::, and dene a third series as ci = ai + bi , it then

follows that,

C(L) = A(L) + B(L):

(7.9)

68

Another property is that of convolution. Take the series ai and bi from above,

a new series di can then be dened by,

i

di = a0 bi + a1 bi

+ a 2 bi

+ ::: + ai b0 =

ah bi

h:

(7.10)

h=0

D(L) = A(L)B(L):

(7.11)

The results stated in this section should be compared with chapter 19, below,

which shows how long-run multipliers, etc. can be derived from lag operator.

7.1.7

Given the denition of the lag operator (or the backward shift operator) the difference operator ( ) is dened as,

=1

L;

(7.12)

in time series statistics the dierence operator are usually denoted with r. In

practice the -symbol denotes taking rst dierences of discrete variable. For

a continuous variable taking rst dierencing implies taking the derivative with

respect to time. If x(t) is a continuous time stochastic variable,

Dx = dx=dt;

(7.13)

where D = d=dt.

Dierences of higher order are denoted in the same way as for the lag operator.

Thus for the second dierence of xt we write,

2

xt = (1

L)2 xt = (1

2L + L2 )xt = xt

2xt

+ xt

2:

(7.14)

d

xt = (1

L)d xt :

rst is the conventional dierence operator, the second is the seasonal dierence

operator, such that

s xt

= (1

Ls )xt = xt

xt

s:

The subscript s indicates the interval over which we take the (seasonal dierence). If xt is quarterly, setting s = 4, leads to the yearly changes in the series.

This new series can the be dierenced by using the dierence operator,

d

s xt

(1

Ls )xt :

69

7.1.8

Filters

the autoregressive part of this model can be though of as a lter such that if we

multiply xt with Ap (L) the result is a white noise process. In the same way, given

a white noise series et and some lter B(L), B(L)et = yt ; generates the series yt :

Alternatively, think of S(t) as the seasonal component of the series xt ; or in other

words the seasonal lter. Multiplying xt with S(t); or in a linear relation subtract

S(t)xt from xt ; and the outcome is a deseasonalised variable. Thus, in this context

the term lter is a broad concept, that indicates that we can transform series in

dierent ways. From white noise we can produce ARIMA processes, or we can

extract certain components out of a series.

7.1.9

stationary or not. Starting from a steady state solution, will a shock to the process,

given its parameters, result in an explosion of the series, in innite growth or in a

temporary deviation from steady state? The answers to these questions are given

by analysing the roots of the polynomial given by the autoregressive process A(L).

An autoregressive process can always be expressed as a stochastic dierence

equation, and we can deal with in the same way as with a normal dierence

equation. Starting from A(L)yt = "t , withdraw yt 1 from both sides leads to the

dierence equation, yt = A (L)yt 1 + "t . The solution of this equation is,

yt = yp + yc ;

(7.15)

where yp represents the particular solution, the long-run steady state equilibrium or the stationary long-run mean of yt ; and yc represents the complementary

solution, the deviation from the long-run steady state.

Dynamic stability requires that yc vanishes as T ! 1: The roots of the polynomial A(L) tell us if this occurs. Given a change in "t ; what will happen to

yt+1 ; yt+2, ... yt+1 ? Will yt+1 explode, continue to grow for ever, or change temporary until it returns to the steady state equilibrium described by yp ? The roots

are given by solving for the r : s in the following equation,

r p + a1 r p

+ a2 rp

+ ::: + ap = 0:

(7.16)

This equation leads to the latent roots of the polynomial. The condition for

stability, when using the latent roots, is that the roots should be less than unity,

or that the roots should be inside the unit circle. Root equal to unity, so called

unit roots, imply an evergrowing series (stochastic trend), roots greater than unity

implies an explosive process. Complex roots suggest that the adjustment is cyclical

. Though not very likely, the process could follow an explosive cyclical path or

cyclical permanent shocks. If the process is stationary, following a shock, yt will

return to its stationary long-run mean. The roots can be complex indicating

cyclical behavior. The case with one or several unit rootsis of particular interest

because it represents stochastic growth in a non-stationary variable. Series with

one or more unit roots are also called integrated series. Many economic time

processes appears to have a unit root, or roots close to unity.

Using latent roots to dene stability is common, but is not only way to dene

stability. Latent roots, or eigenvalues, are motivated with the fact that they are

70

easier to work with when matrix algebra is used. An alternative way of dening

stability is to solve for the roots ( ) in the following equation,

1

a1 + a2

+ ::: + ap

=0

(7.17)

where : If the roots are greater than unity in absolute value j j> 1, lies

outside the unit circlethe process is stationary, if the roots are less than unity the

process is explosive. The historical literature on time series uses both denitions,

however, latent roots, or eigenvalues are now the established standard.

7.1.10

Fractional Integration

7.1.11

The Box-Jenkins approach is a practical way nding a suitable ARMA representation of a given time series. The steps are

1) Identication.

Determine: (i) if seasonal dierencing is necessary to remove seasonal factors,

(ii) the number times the series need to be dierenced to achieve stationarity and

iii) study ACF and PACF to determine suitable order of the ARMA process.

2) Estimation.

The identication step leads to (1) stationary series and (2) narrows the possible ARMA(p,q) process of interest to estimate.

Methods of estimation? Remember problems with t-values?!

3) Testing.

Test the estimated model(s) for white noise residuals, using Box-Pierce test for

autocorrelation. Among models with white noise residuals pick the one with the

smallest information criteria (AIC, BIC). Dierences among information criteria?

This leads quickly to a forecast model, or a representation for expectations

generating mechanism that can be used in simple (rational) expectations modeling.

Limitations of univariate ARIMA models.

Most economic problems are multivariate. Variables depend on each other.

Furthermore, the test procedure is only aimed at nding a forecast model. To

build an econometric model that can be used for inference the demands for testing

are higher.

7.1.12

The parameters of an ARMA model might not be unique. To see the conditions

for uniqeness, decompose the polynomials of the ARMA process A(L)yt = B(L)"t

into their factors6 as,

A(L) =

p

i=1 (1

i L);

(7.18)

B(L) =

q

j=1 (1

j L):

(7.19)

and

6 If

71

factors, like (1

(1

m L)

k L): If this is the case, it is possible to take any

other polynomial C(L) of nite order (< p), and multiply both sides of the ARMA

process such that,

C(L)A(L)yt = C(L)B(L)"t ;

(7.20)

leads to

A (L)yt = B (L)"t :

(7.21)

parameters in A(L) and B(L).

7.2.1

There is a link between AR and M A models, as the presentation of the lag operator

indicated. An AR process with an innite number of lags can under certain

conditions be rewritten as a nite M A process. In a similar way an innite moving

average process can be inverted to an autoregressive process of nite order.

These results have two practical implications. The rst is that in practical

modelling, a long M A process can often be rewritten as a shorter AR process instead, and the other way around. The second implication is that the two process

are complementary to each other. The combination of AR and M A into ARM A

will lead to relatively parsimonious models meaning models with quite few parameters. In fact, it is quite uncommon to nd ARM A models above the order

p = 2 and q = 2.

The AR(1) process, yt = a1 yt 1 + "t , can be written as (1 a1 L)yt = "t , and

in the next step as yt = (1 a1 L) 1 "t : The term (1 a1 L) 1 represents the sum

of an innite moving average process,

yt =

X

1

bi " t

"t =

(1 a1 L)

i=0

= B(1)"t ;

P1 yt = "t b0 "t 1 , can be

written as an innite autoregressive AR process; i=0 ai yt i = A(1) = "t : These

transformations can be generalized for AR(p) and M A(q) processes, as well as for

vector processes.

The question is, when are these transformations meaningful? An AR process

can always be inverted, but it will only have (a meaningful) summable M A process

if it is stationary. Another way to state this condition is to say that the (latent)

roots of A(L) = 0 should be less than unity (inside the unit circle). An M A

process, on the other hand, is always stationary, since the "t by denition is a

stationary process. However, a MA process can only be inverted if the latent

roots of the polynomial B(L) = 0 are less than unity, the roots are inside the unit

circle. (Notice that we refer to the latent roots, if we switch to the ordinaryroots

the requirement is that they should be outside the unit circle, larger than one. See

this paper for denitions of inside and outside the unit circle!)

Thus, a M A is always stationary, but only invertible if the latent roots of

B(L) are inside the unit circle. An AR process is always invertible to an innite

72

M A process but only stationary if the latent roots of A(L) are inside the unit

circle. The latter has one interesting implication, it is often convenient to rewrite

an AR or a V AR to a moving average form and investigate the properties and

consequences of non-stationary from the M A representation.

The conditions are similar, and actually more general, for a multivariate processes,

such that V AR(p) () M A(q):

7.2.2

time series processes. A theoretical result which suggests why ARIMA models are

useful approximations is oered by Wolds decomposition theorem, Wold (1954).

The theorem says that any covariance stationary process can be uniquely represented as the sum of two uncorrelated process, xt = dt + yt , where dt is a linearly

deterministic process, and yt is an innite moving average process, MA(1): Thus,

we can write xt as

1

X

xt = dt +

bj e t j ;

j=0

P1 2

where b0 = 1, and et is stationary (white noise) such that

j=0 bj < 1;

E(et ) = 0;

E(e2t ) and E(et ; et j ) = 0 for j 6= 0.

The theorem has two implications. The rst is that any series which appears

to be covariance stationary can modeled as an innite MA process. Given the

principle of duality, we can expect to nd a nite autoregressive process as well

(compare with the principle of duality). Since many economic time series are

covariance stationary after rst dierencing, we expect ARMA models as well as

linear autoregressive distributed lag models, to work quite well for these series.

The second implication is that we should be able to extract a white noise process

out of any covariance stationary process. This leads to the conclusion that nding

(or constructing) a white noise process in an empirical model is a basic necessity in

the modeling process because most economic time series are covariance stationary

after dierencing.

The presentation above has focused on the practical side of time series modelling. time series can be described and analysed theoretically. Consider the AR(1)

model yt = a1 yt 1 + "t . The series yt is generated by the parameter a1 , the white

noise process "t and some initial value at the beginning of time say t = 0, y0 :

Thus, given an initial value, a parameter a1 and random number generator that

generates "t

N (0; 2 );where we for simplicity can set to 2 = 1, it becomes

possible to generate possible series of yt using Monte Carlo technique. The different outcomes of the series yt can then be used to estimate the distribution of

a

^1 to learn about how to do inference in small and medium sized samples, and to

understand the distributions as a1 ! 1:0:

We can also calculate the mean and the variance of yt . The series yt is not

independent, since it is a autoregressive. Therefore, the mean and the variance of

the observed yt is not informative for describing the series. Instead look at the

mean of the zero mean (no constant) AR(1) process, in the form of the expected

value; E(yt ) = E(a1 yt 1 ) + E("t ): Looking at the expression, the left hand side

tells us that the right hand side represents the mean of yt : The expected value

of a white noise is denition zero, so E("t ) = 0. Since a1 is a given constant we

have for the other factor, E(a1 yt 1 ) = a1 E(yt 1 ). To nd an answer we need to

substitute the lags of yt 1 ; yt 2 ; etc.

THEORETICAL PROPERTIES OF TIME SERIES MODELS

73

a21 E(yt 2 ): Substitute one more time, a21 E(yt 2 ) = a21 E(a1 yt 3 +"t 2 ) = a31 E(yt 3 ).

As we continue substituting backwards we will end up with the initial value. Later

we will examine the case of minus innity. Since the initial value can be seen as a

constant, we get as the nal product at1 E(y0 ) = at1 y0 . (Recall that the expected

value of a constant is equal to the constant.) If the initial value is set to zero it

follows that at1 y0 = 0, and that the mean of yt , is E(yt ) = at1 E(y0 ) = 0.

It is standard to assume that the initial value is zero in this type of analysis.

What happens if yt has a mean dierent from zero, and if the initial value is

dierent from zero? The answer is simple as long as we can assume that the

AR process is stationary and therefore the initial value is a constant there are

no problems. Under these conditions, a non-zero mean can be represented by

a constant parameter in the AR process, such as yt = 0 + a1 yt 1 + "t . The

expected value of yt is E(yt ) = E( 0 ) + a1 E(yt 1 ) + E("t ), which mean that the

right hand side is 0 + a1 E(yt 1 ). Again we need to substitute backwards leading

to; 0 + a1 E( 0 + a1 yt 2 + "t 1 ) = 0 + a1 0 + a21 E(yt 2 ). The next substitution

gives, +a1 +a21 +a31 E(yt 3 ). If we continue substituting back to minus innity,

and set the initial value to zero, we get,

E(yt )

1

X

ai1

=

=

i=0

(1

a1 )

The last step is simply an application of the solution to an innite series, which

works in this case as long as the AR process is stationary, ja1 j < 1. It is important

that you understand the use of the expectations operator in this example because

the technique is frequently used to derive a number of results. We could have

reached the result in a simpler way if we had used the lag operator. Take the

expectation of E(1 a1 L)yt = E( + "t ). The lag operator is a deterministic

factor why the result is E(yt ) = (1 a1 L) : Again, the left hand side is the sum

of an innite process. If there is no constant, = 0 it follows immediately that

E(yt ) = 0:

What is the variance of the process yt ? The answer is given by understanding that E(yt yt ) = V ar(yt ) = 2 :Thus, start from the AR(1) process, multiply

both sides with yt to get yt yt = a1 yt yt 1 + yt "t . Next, take expectations of

both sides, E(yt yt ) = a1 E(yt yt 1 ) + E(yt "t ), and substitute yt yt 1 and yt "t as

(a1 yt 1 + "t )yt 1 = a1 yt2 1 + "t yt 1 and yt "t = (a1 yt 1 + "t )"t . From this we have

a21 E(yt2 1 ) and a1 E("t yt 1 ) + E("2t ):In the latter expression we have by denition

that E("t yt 1 ) = 0 (recall the basic assumptions of OLS) and that E("2t ) = 2" .

Put the results together,

E(yt yt )

2

2

(1

= a21 E(yt2 1 ) +

= a21

a21 )

2

"

2

"

2

"

2

"

(1

a21 )

From the calculation of the variance we can also see the value of the autocovariance and the autocorrelation coe cients, say at lag k. Multiply both sides of

74

this follows that the autocovariance is

k

= ak1

k)

= a1 E(yt

2

"

(1

a21 )

1 yt: k )

+ E("t yt

k ):

From

k

= ak1 :

From this expression it is obvious that the autocorrelation function for the

AR(1) process dies out slowly as the lag length k increases.

Calculating the mean, variance, autocovariances and autocorrelations for AR(1),

AR(2), MA(1) and MA(2) processes are standard exercise in time series courses,

followed by investigation of the unit root case a1 = 1: To be completed...

7.3.1

Seasonality

be dealt with in three ways. The rst is to use seasonal dummy variables. The

second method is to use seasonal dierencing. And, the third method is to use a

program called X12. (Previously X11)

All methods suers from the fact that e cient estimation of seasonal eects

requires a lot of data observations, which is rare in most applied econometric time

series work. Econometricians tend to use seasonal dummies, since they are easy to

use and leads to a transparency in the model. Seasonal dierencing is the standard

method in the Box-Jenkins approach.

For a quarterly series seasonal dierencing implies dierencing in the following

way; (1 L4 )yt = yt yt 4 = 4 yt . The corresponding operator for monthly data

is (1 L12 ). In econometrics, the assumption of seasonal unit roots are di cult to

test. There are few clear cut examples of such processes in the literature and the

test for seasonal unit roots are quite complex, especially given the limited samples

in econometrics. Thus, econometricians tend to use seasonal dierencing when

dummy variables do not work. Otherwise including lags at seasonal frequencies

will usually take care of seasonal eects.

Finally, X12 can be described as a state of the art tool, or as a black box,

where you send in seasonal data and out comes a desasonalised series. X12 is a

respected method to use, and is frequently used to deseasonalised public statistics.

procedure, or some similar program to remove seasonality. Removing seasonality

by seasonal dierencing, seasonal dummies or by using X12 do not aect the

presence of one or more unit roots in the series. The Dickey-Fuller test or other

tests for unit root works as before. X12 is a program designed for univariate

analysis, meaning that if seasonality is removed in single series by X12 prior to

modeling a system, seasonality can still be left in a multivariate single equation

model or in a system of equations. The problem with X12 is its black box nature,

the econometrician losses some control over the modeling process. Some care in

the use of X12 is recommended.

ADDITIONAL TOPICS

75

7.3.2

Non-stationarity

(To be completed)

Dierencing until stationarity is the standard Box-Jenkins approach. A bit

ad hoc. In econometrics the approach is to test rst, but only reject the null

of integrated of order one in the case of strong evidence against. Alternatives,

include linear deterministic trends, polynomial trends etc. Dangerous, spurious

detrending under the maintained hypothesis of integrated variables.

7.4 Aggregation

The following section oers a brief discussion about the problems of aggregation.

The interested reader is referred to the literature to learn more [Wei (1990 is a

good textbook with many references on the subject, see also Sj (1990, ch. 4].

Aggregation of series means aggregation over agents and markets, or aggregation

over time. The stock of money, measured by (M3), at the end of the month

represents an aggregation over individuals. A series like aggregate consumption in

the national accounts, represents an aggregation over both individuals and time.

Aggregation over time is usually referred to as temporal aggregation. Money

holdings is a stock variable which can be measured at any point in time. Temporal

aggregation of a stock variable implies picking observations with larger intervals,

using say a money series measured at the end of a quarter, instead of at the end

of each month. Consumption, on the other hand is a ow variable, it cannot

be measured at a point in time, only as the sum of consumption over a given

period. Temporal aggregation in this case implies taking the sum of consumption

over intervals. The distinction is of importance because the eects of temporal

aggregation are dierent for stock and ow variables.

Aggregation, both over time and individuals, can change the functional form

of the distribution of the variables, and that it can aect the residual variance

and t-values. Exactly how aggregation changes a model varies from situation

to situation. There are however some general conclusions regarding temporal

aggregation which we will repeat in this section. In many situations there is little

we can do about these problems, except working with continuous time models,

or=and select series with a low degree of temporal aggregation. That the problem

is hard to deal with is no excuse for forgetting or hiding them, as it is done in many

text books in econometrics. The area of aggregation is an interesting challenge for

econometricians since it has not been explored as much as it deserves.

An interesting example of the consequences of aggregation is given in Christiano and Eichenbaum (1987). They show how one can get extremely dierent

results by using discrete time models with yearly, quarterly and monthly data

compared with a continuous time model. They tried to estimate the speed of

adjustment in the stock of inventories, in the U:S national accounts. Using a

continuous time model they estimated the average time for closing 95% of the

gap between the desired and the actual stock of inventories, to be 17 days. The

discrete models predicted much higher rates. Using monthly data the result was

46 days, with quarterly data 7 months, and with yearly data 5 (1=2) year!

Aggregation also becomes an important problem if we have a theory that describes the stochastic behavior of a variable which we would like to test with

empirical data. There are many results, in macro and nance, that predict that

series should follow a random walk, or be the outcome of a martingale process.

76

by theory. An example is Hall (1978) who, from a life cycle hypothesis, derived

that private consumption should follow an AR(1) process, and be a random walk

under the assumption of rational expectations. The rst factor, is that of temporal aggregation. An additional complication are adjustment costs, which will also

aect the original model. If private consumption, as an example, is dened as

an AR(1) model, temporal aggregation changes it to ARMA(1,1), the existence of

adjustment costs will then transform it to an ARMA(2,1) model.

Temporal aggregation, adjustment costs and measurement errors are factors

which can aect the structure of the model and the size of estimated parameters.

To this list one could also add problems of seasonal factors, trends and hidden

periodicity. The latter is a problem, because the larger the temporal aggregation

the more di cult it is to get a correct estimate of parameters that reect cycles

which are not timed with the sampling interval. Therefore, one should be critical

of papers which try to prove that some empirical series behaves like a theoretical

process. Is it possible for the author to control all of these factors?

For a ow variable with an ARIMA representation, the outcome of temporal

aggregation depends on hidden periodicity, which if it exists can aect both the

AR and the MA process. In general, aggregation will complicate the structural

of the ARIMA model. A simple AR model becomes an ARMA model. But, as

aggregation becomes larger the structure of the model becomes simpler.

For a stock variable the consequences are clearer. An ARIMA(p; d; q) process

of a stock variable, becomes after temporal aggregation an ARMA(p; d; s) process,

where

s

integer [(p + d) + (q

d)=m];

systematic sampling interval. As a rule of thumb it can be assumed that temporal

aggregation adds +1 to the MA process.

Since dierencing is a form of temporal aggregation, taking higher and higher

dierences of a series will create an MA process. This can be seen in any time

series program that produces ACF:s and PACF. The more one dierences a series

the more clearly will the series look like an MA process.. Thus, it follows that

observing an MA process in the Identication step in the Box-Jenkins approach,

is a sign of over-dierencing.

The expression holds even for an ARMA model where d = 0. For an ARIMA

model, as m gets larger, the model turns towards an IM A(d, d 1) process.

Thus, we end up with a random walk model. This is an interesting result for of

two reasons. First, since the random walk model often seems to t macroeconomic

and especially nancial time series quite well, could that be the outcome of having

too large sampling intervals? Second, the result explains the ndings in Christiano

and Eichenbaum (1987), that larger sampling intervals lead to slower and slower

adjustment speed in inventories. The larger the sampling interval, the more did

inventories seem like a random walk. As a consequence, the more important

seemed historical shocks, further and further back in history. In the end, in the

random walk model, all historical shocks have the same importance and there

would be no adjustment at all.

Temporal aggregation will also aect prediction. The general result is that

aggregation reduces the e ciency of the forecasts, and that the relative loss of

e ciency is larger for a non-stationary series than a stationary one. (Remember

that most macroeconomic series are non-stationary.)

It is also worth mentioning some conclusions concerning causality. Aggregation

will not aect the direction of causality, if there is a clear causality from one

variable to another, when dealing with stock variables. It will, however, weaken the

AGGREGATION

77

estimated strength of the relationship and can therefore lead to wrong conclusions

from Granger non-causality tests. For ow variables, on the other hand, temporal

aggregation turns a one direction causality into what will appear to be a two-sided

causality. In this situation a clear warning is in place.

~ t and Y~t .

Finally, we also look at the aggregation of two random variables, X

Suppose that they are two independent stationary processes with mean zero,

~ t j yt ] = E[Y~t j xt ] = 0:

E[X

(7.22)

The autocovariances of X

cov(xt

1 ; xt k )

x ;k

(7.23)

cov(yt

1 ; yt k )

y ;k

(7.24)

The sum of X

~ t + Y~t ;

Z~t = X

(7.25)

z ;k

x ;k

+ y ;k :

(7.26)

~t

X

(7.27)

Y~t

(7.28)

and

and,

~ t + Y~t ;

Z~t = X

(7.29)

then,

Z~t

ARM A(x1 ; x2 );

(7.30)

which is measured with a white noise error. That is, the true series is added to a

white noise series. If the true series is AR(p) then the result of this aggregation

will be an ARMA(p; p) process.

We can conclude this section by stating that aggregation leads to loss of information, which, if the aggregation is large, might fool us into assuming that the

random walk is the appropriate model. The extent to which aggregation leads

us to wrong conclusions has not been stated yet. Partly this is so because we

need better data on shorter time intervals than what is available. Remember that

ignoring problems is not a way of solving them. One way of dealing with the

problems of aggregation is to use continuous time econometric techniques instead,

see Sj (1993) for a discussion and further references.

The autoregressive process represent a basic way of modeling time series. As

complexity and multivariate processes are introduced the AR model transform

into a system of equation, where it becomes possible to give the parameters a

structural (economic) interpretation. In principal, we have the following types of

equation models, where t NID(0, 2 ).

78

t;

t;

4. Distributed lag models: DL(p) : yt = B(L)xt +

t;

A(L)yt = B(L)xt + (L) t ;

7. Rational distributed lag model RDL: yt =

8. Transfer function: yt =

B(L)

A(L) xt

B(L)

A(L) xt

+ (L)

(L)

(L) t

Notice that the transfer function is also a rational distributed lag since it

contains a ratio of two lag structures. Also, (7) and (8) can be viewed as distributed

lag models since D(L) = [B(L)=A(L)]. Notice that rational distributed lag models

require some information about B(L) to be workable.

Imposing restrictions on the lag structure B(L) in distributed lag models lead

to further models;

9. Geometric lag structure (= Koyck), where B(L) is assumed to decline

according to some exponential function.

10. Polynomial distributed lag (PDL) models, where B(L) declines according

to some polynomial function, decided a priori. (= Almon lags).

11. All other types of a priori restrictions on B(L) not covered by (9) and

(10).7

12. The error correction model. This model embraces all of the above models

as special cases. The following explains way this is so.

Introduction to Error Correction Models

Economic time series are often non-stationary, their means and variances change

over time. The trend component in the data can either by deterministic or stochastic, or a combination of both. Fitting a deterministic trend assumes that the

data series grow with a xed rate each period. This is seldom a good way of

characterizing describing trends in economic time series. Instead they are better

described as containing stochastic trends with a drift. The series might be growing over time, but it is not possible to predict whether it grows or declines in the

next period. Variables with stochastic trends can be made stationary by taking

rst dierences. This type of variable is called integrated of order 1, where the

order of integration is determined by the number of times the variable needs to be

dierenced before it becomes stationary.

A necessary condition for tting trending data in an econometric model, is that

the variables share the same trend, otherwise there is no meaningful long-run relationship between them.8 Testing for co-integration is a way of testing if the data

7 Restrictions are put on the lag process to make the estimation more eective. A priori,

restrictions can be motivated by a limited sample and muticollinarity that aects estimated

standard errors of the individual lags. These type of restrictions are not used anymore. Today,

it is recognized that it is more important to focus information criteria, white noise residuals and

building a well-dened statistical model, instead of imposing restrictions that might not be valid.

8 The exception is tests of the e cient market hypothesis, and related tests of rational expectations. See Appendix A in Sj and Sweeney (1998) and Sj (1998).

79

has a common trend, or if they tend to drift apart as time increases. The simplest

way to test for cointegration is the so called Engle and Granger two step procedure. The test implies determining whether the data contains stochastic trends,

and if so, testing if there are common trends. If xt and yt are two variables, with

non-stochastic trends that become stationary after rst dierencing, cointegration

can be tested by running the following co-integrating regression,

yt =

+ xt + t :

(7.31)

If both yt and xt are integrated variables of the same order, a necessary condition for a statistically meaningful long-run relationship is that the residual term

( t ) is stationary. If that is the case the error term from the regression can be

seen as temporary deviations from the long-run, and and can be viewed as

estimates of the long-run steady state relation between x and y.

A general way of building a model of time series, without imposing ad hoc a

priori restrictions, is the autoregressive distributed lag model. For two variables

we have,

A(L)yt = B(L)xt + t ;

(7.32)

Pk

Pk

where the lags are given by A(L) = i=0 ai , and B(L) = i=0 bi . The rst

coe cient in A(L) is set to unity, a1 = 1. The lag length is chosen such that

the error term becomes a white noise process, t

N ID(0, 2 ). The long-run

solution of this model is given by,

yt = xt +

t;

(7.33)

where = B(L)=A(L). Without loss of generality we can use the dierence operator, xt = xt xt 1 , to rewrite the autoregressive model as an error correction

model,

k

k

X

X

yt =

x

+

yt i + ECMt 1 + t ;

(7.34)

t

i

i

i

i=0

i=1

The latter term can be said to represent the deviation from the long run steady

state relation between the two variables. It is convenient to think of the ECMt

variable at the rst lag, controlling the long-run path of the dependent variable.

Asymptotically it will not matter at which lag the ECMt is placed. Though in a

multivariate model, and for a nite sample, it might make a dierence, a seasonal

lag on might work better.9 Furthermore, for an ECM to work well in a model it

should nor display any signs of seasonal eects or extreme outliers. These eects

should be removed when the ECMt is constructed. The -parameter of the error

correction term indicates how changes in yt react to deviation from the long-run

equilibrium.

When modeling integrated variables, rewriting the system as a (vector) error

correction model is a natural step. However, error correction models works with

stationary data series. Assuming costly adjustment leads generally to partial adjustment models, that are better written in the less restrictive error correction

form. Optimal control theory, approximations to structural systems in continuous

time etc. will also lead to error correction models, see Hendry, Pagan and Wickens

(1982), Hendry (1995), or ch. 2 in Banerjee et. al. (1993).

If xt and yt contain stochastic trends it is necessary that they are co-integrated

for the ADL model to make sense in the long-run. For instance, if the variables

are co-integrated, the error term from the co-integrating regression ( t above) can

be used as the error correction mechanism. This was shown in Engle and Granger

9 For

80

that cointegration implies Granger causality in at least one direction. The advantage of the error correction model is that it does not put a priori restrictions on

the model and that it separates long-run and short run eects. It has proven to

be a very e cient way to model various economic models, like money demand,

consumption etc. It should be recognized that the early literature on EC models tended to oversee the problem of weak exogeneity. With the developments in

the elds of multivariate cointegration it has been shown that when the same EC

expression determines more than one variable, there are cross equation restrictions between the co-integrating parameters. These restrictions imply that error

correction expressions have to be estimated within complete systems, not from

OLS.

Multivariate Model Survey

Multivariate models are introduced later. For the time we can conclude our listing

of models with the following,

Vector autoregressive models V AR:

Vector autoregressive moving average processes V ARM A: The V AR and

the V ARM A represents multivariate ARIMA models.

Vector error correction models V ECM s:

Structural vector autoregressive models SV AR:

Systems of structural equations estimated using estimators.

Structural vector error correction models

The latter represent the nal step, where a complete system of interactive

variables are modeled and given en (economic) structural interpretation.

81

82

SOLUTIONS OF DYNAMIC MODELS.

Given an autoregressive, or distributed lag structure A(L), B(L) or D(L) the long

run static solution of the model is found by setting L = 1. The intuition is that

in the long run there will be no changes in the explanatory variables, and it will

not matter if we explain yt by say xt and/or xt i .

The conditional mean of yt in an ADL model for example is

1

Et fyt g = yt = A(L)

B(L)xt :

(8.1)

y=

B(1)

x;

A(1)

(8.2)

a distributed lag model we would have

y = D(1)x

(8.3)

y = D(1)(x + 1):

(8.4)

The total eect of a change in xt is given by the sum of the coe cients in D(L)

when L = 1. If there are m lags in D(L), the total multiplier is

D(1) = (

+ :::

m)

m

X

j:

(8.5)

j=0

which dies out slowly in the long-run.

The impact multiplier is associated with the rst parameter in D(1), which

is 0 . Thus taking 0 xt gives you the impact multiplier, the rst periods eect

following a change (a chock) in xt :

The j : th interim multiplier ( j ) is the sum of the coe cients up and including the j : th lag,

j

X

=

(8.6)

j:

j

j=0

m

X

=[

j ]=

D(1);

(8.7)

j=0

such that it represents the share of the total multiplier up until the j : th lag.

MULTIPLIERS AND LONG-RUN SOLUTIONS OF DYNAMIC MODELS.

83

m

X

=[

j

j ]=[

j=0

m

X

j ];

(8.8)

j=0

Notice that m could be equal to innity if we have a stable model, with stationary variables, such that the innite sum of i converges to a constant sum in

the long run.

The mean lag can be derived in a more sophisticated way, by dierentiating

D(L) with respect to L and then dividing by D(1). That is,

D(L) =

1L

D0 (L) =

+2

2L

2

2L

+3

+ ::: +

2

3L

s

sL ;

+ ::: + s

and

s 1

:

sL

(8.9)

(8.10)

=

D0 (1)

D(1)

B 0 (1)

B(1)

A0 (1)

A(1)

(8.11)

Finally we have the median lag, representing the number of periods required

for 50% of the total eect to be achieved. The median lag is obtained by solving,

j

X

[

j ]=

D(1) = 0:50:

(8.12)

j=0

multiplier in the middle of the lag structure.

84

9. VECTOR AUTOREGRESSIVE

MODELS

The extension of ARIMA modeling into a multivariate framework leads to Vector

Autoregressive (VAR) models, Vector Moving Average (VMA) models and Vector

Autoregressive Moving /VARMA) models. In economics, since most variables

display autocorrelation and are cross-correlated, VAR models are an interesting

choice for modeling economic systems. Vector models can be constructed using

similar techniques as those for single variable ARIMA models. The autocorrelation

and partial autocorrelation functions can be extended to display cross-correlations

among the variables in the system. However, when modelling more than two

variables, these cross autocorrelation and cross partial autocorrelation functions

quickly turn into complex matrix expressions for each lag.1 Thus, the crosscorrelation functions are not practical tools to work with.

The advantages of using VARs are that as VAR represent a statistical description of the economy. When using ARIMA on univariate series, in many situations

the combination of AR and MA processes turn out to be an e cient way of nding a stochastic representation of a process. VAR models are usually eective in

modeling multivariate systems, and can be used to make forecasts and dynamic

simulations of dierent shocks to system. These shocks can come from policy,

from productivity or anywhere in the economy basically, and the shocks can be

assumed to transitory or permanent. The main complicating factor is that in

order to understand what shocks and simulations actually mean it is necessary

to identify the underlying economic relation among the variables. To make VAR

models work for economic analysis it is necessary to impose some restrictions on

the residual covariance matrix of the VAR. Thus, there is no free lunch here in

terms of avoiding discussing causality and simultaneity problems. It is necessary

to point out the latter because in the beginning of the history of VAR models it

seemed like VAR models could be used without economic theory, but that was

build on a misunderstanding.

In econometrics the focus is on nding a parsimonious VAR representation

with N ID residuals.

Let xt be an p dimensional vector of stochastic time series variables, represented as a the k : th order VAR model,

xt

k

X

Ai x t

+ et ; or

i=1

A(L)xt

= et

(9.1)

Pp

where Ai is the matrix of coe cients of lag number i; so A0 A = i=0 Ai ; where

A0 is a diagonal matrix, et is a vector of white noise residual terms. Notice that

all variables across all equations have the same lag length (k). This is so because

it makes it possible estimate the system with OLS. If the lag order is allowed vary,

the VAR must be estimated with the seemingly unrelated regressor method.

A VAR model can be inverted into its VECMA form as

1 See

85

xt =

1

X

Ci xt

= C(L)et

i=1

The MA form is convenient for analysing the properties of a VAR and investigate the consequences of shocks to the system. Estimation, however, is usually

done in the VAR form, and is straightforward since each equation can be estimated

individually with OLS. The lag length (k) of the autoregressive process is chosen

such that the estimated residual process, in combination with constants, trend,

dummy variables and seasonals, becomes white noise process in each equation.

The idea is that the lag length is equal for all variables in all equations.

A

is

3

2 VAR

3 of dimension p with a constant

2

3 2

2 second

3 order

e1t

a0

x1t 1 x1t 2

x1t

7

6

7 6

6 x2t 7 6 a1 7

a11 a12

a1p 6 x2t 1 x2t 2 7 6 e2t 7

7

6

7 6

+6 . 7

6

7

6 .. 7 = 6 .. 7 +

.

.

..

..

a21 a22

a2p 4

5 4 .. 5

4 . 5 4 . 5

ept

ap

xp 1 xpt 2

xpt

VAR models were strongly advocated by Sims (1980) as a response to what he

described as incredible restrictionsimposed on standard structural econometric

models. Up until the mid 80s, empirical time series econometrics was dominated

by the estimation of text-book equations. Researchers simply took an equation

from theory, estimated it, and did not pay much attention to whether the model

and the data actually tted each other. Typically, dynamic lag structures where

treated in a very ad hoc way. Sims argued that it would be better to nd a

statistical model, which described the data series and their interaction, as well as

possible. Once the statistical model was there, it could be used to forecast and

simulate the economy. In particular, it would according to Sims be possible to

analyse the eects of various policy changes.

Simscritique is related to the "Lucas critique". Lucas showed how in a world

of rational expectations, it was not possible to understand estimated parameters in

structural econometric models as (deep) structural behavior or policy parameters.

Since agents form their behavior on plans building on forecasts of variables, not

on historical outcomes of variables, the estimated parameters based on historical

observation become a mixture of behavioral parameters and forecast generating

parameters. Further, under rational expectations, econometric models could not

be used to analyze policy changes, because a change in policy would by denition

lead to a change in the parameters of the system. Sims therefore argued for VAR

models as a statistical description of the economy, under given policy rules. The

eects of surprise changes in policy variables could then be analysed in the reduced

form.

VAR models represent the reduced form of an underlying structural model.

This can be seen by starting from a general (but not necessarily identied) structural model, and rewriting it in reduced form. As an example, start from the

bivariate model,

yt

xt

+ a11 xt + b11 yt

2 + a21 yt + b21 yt

+ b12 xt

1 + b22 xt

1t

(9.2)

1+

2t

(9.3)

the RHS of the equations,

86

yt

xt

2+

11 yt 1

21 xt

1+

12 xt 1

22 xt

+ e1t

(9.4)

1 + e2t :

(9.5)

The equations form a bi-variate VAR model of order one. The residuals of the

VAR model (the reduced form) contain the residuals and the parameters (a11 and

a21 ) of the structural model. The reduced system can be estimated by applying

OLS to each equation.2 The parameters of the VAR relate to the structural model

as

11 b12 + b11

; etc.

11 =

1

11 21

Thus, the parameters of the VAR are complex functions of some underlying structural model, and as such they are on their own quite uninteresting for

economic analysis. It is the lag structure and sometimes it signs that are more

interesting. The two residuals in this VAR are,

1t

e1t =

11 2t

and

11 1t

e2t =

(9.6)

11 21

2t

(9.7)

21 11

These residuals are both white noise terms, but they are correlated with each

other whenever the coe cients 11 or 21 are dierent from zero.

The generalization of structural system above, setting zt = fyt ; xt g; is

Bzt =

1 zt 1

t;

(9.8)

where

1

11

01

11

; 0=

and 1 =

1

02

21

If both sides of 9.8 is multiplied with B 1 the result is,

B=

21

zt

= B

+B

1 zt 1

+ et ;

1 zt 1

+B

12

22

(9.9)

model is a reduced form of an underlying structural model, where the structural

dependence is hiddenin the covariance matrix of the error terms.

VAR models are estimated in their AR form, A(L)yt = et . They can be

inverted and analysed in their MA form, yt = C(L)et . Beside predictions, VAR

models are used for three types of analysis; Granger non-causality tests, forecast error variance decomposition and impulse response analysis. Granger non-causality

tests deserve a special chapter and is therefore discussed in a following chapter.

The other two techniques are typical VAR methods that make use of the MA form.

Forecast Error Variance Decomposition. The forecast variance errors are

explained in terms of the history of each variable. This analysis will tell how

strong is the inuence among the variables of the system. It tells us the

proportion of movements in a sequence (of yi ) that is due to own shocks

and the proportion due to shocks in other variables. If these other variables

have little inuence on the investigated variable, they will contribute little

to the forecast error variance. Variables that are exogenous, will have small

eects from other variables.

2 OLS is as e cient as the seemingly unrelated estimator (SUR) in this case, because the

equations contain the same explanatory variables. However, if we set some lags to zero and have

a system with dierent lags in dierent equations, SUR will be a more e cient estimator than

OLS.

87

simulation the systems response to an unexpected shock in one variable in

the system. A typical example is to study how the economy, and real GDP,

reacts to an unexpected change in the money supply under the assumption

of rational expectations. A typical questions to ask are how long does it take

for a shock in yt or xt before it dies out, will there be an eect at all, will it

be positive or negative, will die out smoothly or through uctuations? We

can

Pt ask if shocks in yt aect xt etc. Let the MA form be yt = C(L)et =

i=0 Ci et i ; where Ci is the matrix of coe cients for lag i. In matrix form,

for a two dimensional system,

y1t

y2t

t

X

i=0

c11;i

c21;i

c12;i

c22;i

e1i

e2i

(9.10)

1

matrix of total, or long-run multipliers, is given by i=0 Ci : The impulse

response functions are given by C(j) where j = 0; :::t.

Both the variance decomposition and the impulse response analysis require that

the residual covariance matrix of the VAR is orthogonalized. This is so, because

the errors et are dependent on each other through the B 1 matrix. Unless the

residuals of the VAR is orthogonalized it will not be possible to identify a shock

from as a unique shock coming from one specic variable.3 There are several ways

of performing the orthogonalization of the residuals. (In the following we assume

that the VAR is made up of stationary variables.) The idea is that restrictions

must be put on the covariance matrix of the VAR.

Cholesky decomposition. Cholesky decomposition represents a pure mathematical way to orthogonalize the residuals, which will depend on the ordering of the variables. It is custom to do several dierent decompositions,

by changing the order of the equations in the model, show the sensitivity

of creating orthogonalization in dierent ways. In terms of the residual covariance matrix, what the Cholesky decomposition achieves is to make the

upper diagonal of the matrix zero. Assume a three dimension VAR, p = 3,

and therefore a 3 3 covariance matrix,

2 2

3

12

13

11

P

2

5:

= 4 21

23

22

31

32

3

33

2 2

3

0

0

11

P

2

0 5:

= 4 21

22

31

32

3

33

The problem for identifying the VAR and doing the impulse responses is that

the covariance matrix is not diagonal. The Cholesky decomposition

P builds

on the fact that any matrix P with the property that PP0 =

denes

an orthogonal covariance matrix such that et = P 1 t becomes a diagonal

matrix, et s (0; IN ):The ordering of the equations determines the outcome,

and the causal ordering of the residual shocks. With N = 3, there are three

possible orderings and outcomes, which can be more or less dierent.

3 Early VAR modelers did not recognize the need for orthogonalization. Thus papers from the

rst part of the 1980s must be read by some care.

88

Set up a recursive system. Instead of letting the computer do all of the job,

you can set up the matrix B 1 so that the residuals form a recursive system

by deciding on an ordering of the equations that corresponds to the ordering

and residual correlations created be the Cholesky decomposition. Thus, the

residual in equation one is not aected by the other two. (Meaning that x1t

is not explained by x2t or x3t ) The second residual is only aected by the

rst residual. And nally, the last (third) residual is aected by residual

one and two. Econometric programs often includes Cholesky decomposition

routines in combination with the analysis of VAR models. By changing

the ordering of the equations it becomes possible to compare the eects of

dierent recursive ordering of the variables. The problem is that we are

drowning in output as the dimension of the VAR increases.

Structural Autoregressive models SVAR. If economic theory does not suggest a recursive ordering, use economic theory to impose restrictions on the

B 1 matrix. This is called Structural Vector Autoregressive (SVAR) models.4 In practice the approach implies formulating a small structural (static)

economic system for the residual process et : If yt is an p-dimensional system,

the error covariance matrix contains a total of p2 parameters, leading to the

estimation of p(p + 1)=2 or (p2 + p)=2 number of parameters, equal to the

number restrictions necessary for the matrix B 1 : As an example for a 3

variable system, the error process could be set up as,

e1t

e2;t

= c21

1t

e3t

= c31

2t + c32

1t

2t

2t

3t ;

(9.11)

look like,

e1t

e2;t

= c21

e3t

= c31

1t

+ c13

3t

1t

2t

2t

3t :

(9.12)

3)=2:) = 6: Behind each equation is some reasoning about the plausible

correlation among the variables at time t. In each equation there is one

white noise residual term with an implicit parameter of unity, leaving three

possible parameters (c1 ; c2 ; c3 ) to describe how the shocks in the errors are

related.

An more general framework for identifying the VAR is

Azt = A0 + A1 zt

+ B t;

and B takes care of correlations in the residual such that B t becomes diagonal.

Once the error process is set up in such a way that the errors are orthogonal,

it becomes possible to analyze the eects of one specic shock on the system and

4 A fourth approach is oered by Blanchard and Quah (1989), and builds on classifying shocks

as temporary or permanent. This approach can be seen as an extension of the SVAR approach

to processes including integrated variables with common trends.

89

argue that the shock is unique coming only from that particular variable. Without

orthogonalization the shock can be a mixture of eects from dierent variables,

and not a cleanshock.

One controversy here is that it is up to the econometrician to identify and

label the shocks as, for instance, demand or supply shocks. The basis for such

labeling might not be strong. Further, by denition, the errors include, not only

structural relations, but also everything that we do not know or understand about

the system. For that reason it might be better to use economic theory to identify

structural relations and build conventional econometric models instead, rather

than trying to analyse what we do not understand. On the other hand, in a

world of rational expectations where the expectations generating mechanisms is

unknown, or cannot be modelled, VAR models is the best we can do.

9.0.1

First you thing about your system. What is it that you want to explain? How

could it be modelled as a recursive system? Second you estimate the equations, by

OLS, the same lag lengths on all variables across the equations to avoid using the

SUR estimation technique. Third, you investigate outliers and shifts and put in

the appropriate dummy variables. Fourth, you try to nd a short lag structure and

white noise residuals. Fifth, if you cannot fulll 4) you minimize the information

criteria. In this case AIC is not the best choice, use BIC or something else.

9.0.2

The orthogonalization of the residuals can oer some interesting intellectual challenges, especially in SVAR approach. If the variables in the VAR are integrated

variables, which also are co-integrating, we are faced with some interesting problems. In the co-integrating VAR model there will be both stationary shocks and

permanent chocks, and identifying these two types in the system is not always

easy. If the VAR is of dimension p, there can be at most r co-integrating vectors, 0

r

p, and p r common stochastic trends. Juselius (2006) ("The

Co-integrated VAR Model", Oxford University Press) shows how an identication

of the structural MA model, and orthogonalization of the residuals, can be done

of both the in terms of short and the long-run of the system.

The VAR(2), with no constants, trends or other deterministic variables, will

have the following VECM representation after nding r co-integrating vectors,

xt =

xt

xt

+ "t

xt = C

t

X

"i + C (L)"t + x0

i=1

Where the rst factor on the right hand side represent the stochastic trends

in the system and the second factor represents stationary part. The C matrix

will then represent all that is not the stationary vectors, and is related to the

co-integrated vectors as, C = ?( 0 ? ?) 1 0 ?:

90

VAR models represent statistical descriptions of data series. As such is a basis for

reducing your model and going into more ordinary structural econometric models,

such as Vector Error correction Model (VECMs). Estimating a VAR is then a

way of making sure that the nal model is a well-dened statistical model, i.e. a

model that is consistent with the data chosen.

1. We have talked about what you can do with the VAR in terms of forecasting,

simulations, impulse responses, forecast error decomposition and Granger

causality testing. in this context we meet the so-called SVAR - Structural

VAR. There is, however, a number of other VARs that one needs to know

about. The problems of working with VARs are obvious; there is a large

amount of variables to be estimated, the estimated parameters might no be

stable over time and there is a number of variables that are not modelled in

the VAR because the VAR would get too large to handle. If you want to use

the VAR for forecasting we need to address these problems. To handle the

problem with time varying parameters there are Time-Varying-Parameter

VARs (TVP-VARs). In addition there various VAR modeling techniques

that deal with regime changes, Markov switching VARs, threshold VARs,

oor and ceiling VARs, smooth transition VAR.

To work with large number of variables and reduce the model it is possible to

factor analysis, which takes us to Factor Augmented VARs (FA-VARs). Another

approach is to use a priori information about parameters and their distribution

in the form of represented by Bayesian VARs (BVARs). The latter is a popular

approach in many central banks.

We can illustrate the problem in the following way. Your model predicts that

the ination rate will vary around 10%, and the same time you have additional

information indicating that ination will uctuate around 5 per cent, say that there

is a sudden drop in ination. What do you do? One approach is simply to reduce

the constant term and predict changes in ination around 5 per cent instead. A

more ambitious approach is to incorporate more information in your model, from

more data and place more emphasis on recent observations etc. Changing the

constant is easy and quite normal. As you start walking along the path of making

assumptions about the data and the parameters of the model you might go too

far in the other direction. As long as we talk about forecasting, the proof is in the

pudding. The best forecast wins, but as we talk about the best policy to achieve

goals in the future you have to be much more careful.

The type of VARs we have discussed so far are basically statistical representations of the data. Without futher restrictions, and incorporation of long-run steady

state relations in the form of co-integrating vectors, their relative predictability

will be quite poor. Also, the economy is more complex, involving many more

variables that the two to six variables that can be handled in a standard VAR.

If you model contains fty or one hundred variables there will be too many lags

and coe cients to estimate. One way of dealing with this problem is use so-call

Bayesian VARs (BVAR). In the BVAR you can use prior information to reduce

the number of coe cients you need to estimate. BVAR is popular among many

central banks, included both the ECB and the FED to make construct better and

bigger VARs for forecasting.5

5 Gary Koop at University of Strathclyde has a home page with course material dealing with

BVAR models.

91

Finally, remember that the data is the real world, economic theories are constructions of the human mind (quote from David Hendry). If you want to use a

priori information of some kind you might miss what the data, the real world, is

trying to tell you.

92

Part III

93

to cause the other variable is a fundamental question in all sciences. However, to

validate empirically that one variable are caused by another variable is problematic

in economics since it is often quite di cult to set up controlled experiments.

Granger (1969), building upon work done by Wiener, was the rst to formalize

an empirical concept of causality in economics. Grangers basic idea is that the

future cannot predict the present or the past. It follows, as a necessary condition,

that for one variable (xt ) to cause another variable (yt ), lagged values of xt must

predict yt . This can be tested with the following vector autoregressive model,

yt =

k

X

i yt

i+

i=1

k

X

i xt i

+ et ;

(9.13)

i=1

determined such that et is a white noise process, et N ID(0; 2 ). Alternatively,

if you cannot nd white noise residuals, minimize information criteria only instead.

If all parameters associated with the process xt are dierent from zero, 1 = ::.

= i 6= 0, then xt is predicting yt ; and xt can also be said to Granger cause the

variable yt . If, on the other hand, all -parameters are zero, xt cannot predict or

cause yt : An F -test on the joint signicance of the

parameters is su cient in this

case. (Alternatively, the test can be set up in the form of chi-square test depending

on mainly the software you are using.) The F -test works by comparing the mean

squared errors from the equation above with those from a regression where the x0 s

are excluded. If the inclusion of lagged x variables leads to a signicant reduction

in the mean square error, lagged values of xt are predicting yt and the variable xt

can be said to Granger cause yt :

Please notice the distinction between prediction

and causality, which is imPk

portant in a policy context. The fact that i=1 i is signicantly dierent from

zero, so that xt is predicting yt ; does not imply that xt causes yt . It is easy to

understand why, from the following analogy. A weatherman that predicts rain

tomorrow, does not cause the rain that might fall tomorrow. This is so no matter

how good this person is predicting tomorrows weather. This is the reason why the

test should always be referred to as a Granger non-causality test and not a test

of causality. Based on the assumption that the future cannot predict the present

and the past, we can only test whether a variable is not causing another.

Of course, the outcome of the test might be aected by the number of lags

chosen in the VAR, and by the variables chosen to be included in the VAR. Though

two variable VARs are common, this is often a crude simplication. The classical

example is the eects of real money growth on real GDP growth. In one set-up

you might nd that monetary policy is eective, but add the interest rate to the

VAR and you might nd that monetary policy is ineective.

Finding that xt Granger causes yt does not exclude that the reverse is true.

Two variables can Granger cause each other. A test of whether yt Granger causes

xt , is performed with the following model,

xt =

k

X

i=1

i xt

i+

k

X

i yt i

t;

(9.14)

i=1

N ID(0, ! 2 ). If lagged

values of yt predict xt ; yt is Granger causing xt . In some situation testing the

reverse relationship is of no interest. For instance, the ination rate in a small

open economy should not Granger cause the ination rate of the World.

The main weakness of the Granger non-causality test is the assumption that

the error process in the VAR is not only a white noise process, but also a white

95

noise innovation process with respect to all relevant information for explaining

the movements of xt and yt . This is an important issue which is often forgotten

in applied work, were bivariate systems are the rule rather than the exception.

Grangers basic denition of non-causality is based on the assumption that all factors relevant for predicting yt are known. Let It represent all relevant information,

both past and present, let Xt be present and past observations on xt , such that

Xt = (xt , xt 1 , xt 2 , ..., x0 ); It 1 and Xt 1 represent past observations only.

The variable xt can therefore be said to Granger cause yt if the mean square error

(MSE) increases when yt is regressed against the information set where Xt 1 is

removed. In the bivariate case, this can be stated as,

M SE(^

yt jIt

1)

< M SE[^

yt j(It

Xt

1 ];

(9.15)

be included in It . If too many variables are included the degrees of freedom

will diminish. If too few variables are included the test might lead to the wrong

conclusions. The result of an unidirectional relation from xt to yt in a bivariate

model, might be reversed if a relevant third variable is included in the system.

This is a serious limitation of the Granger causality test. A way of reducing the

problem is to always perform the tests in a VAR system. If some variable is to

be treated as exogenous in the system, this must be based on strong a priori

knowledge.

The Granger non-causality test is sensitive to the spurious regression problem.

The F -test is unreliable when used on integrated or near integrated, which is

the standard situation in economics. However, using only rst dierences of the

variables implies a loss of information. In this situation it is recommended to

include error correction terms (or co-integrated vectors) in the VAR to increase

the e ciency of the F -tests. There is an interesting relation between cointegration

and Granger causality, as shown by Engle and Granger (1987). If a co-integrating

relationship is found, it follows there must exist Granger causality in at least one

direction. Tests of cointegration do not exclude causality test, since they cannot

determine the direction of the causality. However, if no cointegration is found we

can conclude that there is no Granger causality either.

96

10.1 Exogeneity

Exogeneity assumptions are necessary in econometric model building. In many

situations they are used in an ad hoc way; determined outside the system, or

based on variables being classied as endogenous and predetermined. Based on

this classication of the variables in the system, the basic econometric text book

explains how to apply the rank and the order condition to identify a simultaneous

system and if it is possible to use OLS or if a system estimator is necessary. In this

section we introduce three basic concepts of exogeneity that covers, (1) estimation

and inference, (2) conditional forecasting, simulations and (3) policy conclusions.

The three concepts that allow you to perform these tasks are weak exogeneity,

strong exogeneity and super exogeneity.

Consider the following system and there is co-integration.

yt

xt

xt + "1t

"2t

If "1t and "2t are both stationary it follows that xt is I(1) and that yt if I(0)

that = 0. On the other hand if 6= 0, it follows that yt is I(1): To estimate

it is required that yt is not simultaneously inuences xt : If yt or yt is part of

the left-hand side of xt equation (and thus embedded in "2t ) the result is that

E("1t "2t ) 6= 0, and we can write "1t = "2t + ut : Where for simplicity we assume

that ut s N (0; 2 ):

Now, if we estimate with OLS, the outcome would be a biased estimate of

, since E(xt "1t ) = E(xt ( "2t + ut ), and we can no longer assume that xt and "1t

are independent. This is example of lack of weak exogeneity. With the rst model

is not possible to estimate the parameter of interest , the outcome from OLS is

a dierent and biased

value.

10.1.1

Weak Exogeneity

Weak exogeneity spell out the conditions under which it is possible to obtain

unbiased and e cient estimates. The denition is based splitting the joint density

function, into a conditional density and a marginal density function;

D1 (yt ; zt j Yt

1 ; Zt 1 ;

1)

= D2 (yt j yt ; Yt

1 ; Zt 1 ;

2 )D3 (zt

j Yt

1 ; Zt 1 ;

3 );

(10.1)

97

are matrices of the nite historical values of these variables. The conditions under

which it is possible to estimate the parameters of interest by modeling only the

conditional density are that 2 and 3 should be variation free, and that are no

cross restrictions between the parameters of 2 and 3:

In practical situations, using stationary data, this comes down to judging

whether the error terms between the marginal and conditional models are correlated.1 (If the data series are integrated the question becomes one of long-run

independence between the two residual processes).

Three important conclusions follow from the denition above. The rst is that

whether a variable is exogenous or not, depends on the parameters of interest.

An OLS regression will always lead to estimates of some kind, but what is their

meaning. To understand the regression we identify parameters of interest that

relate to other variables through the (not modelled) marginal density functions.

Thus, exogeneity must be stated in terms of parameters of interest, i.e. the variable

yt is weakly exogenous for the parameter yt .

Second, it is di cult to test for weak exogeneity. Most existing tests fail,

with the exception of Johansens test for weak exogeneity of the variables in a

co-integrating vectors.2 The meaning of an exogeneity test is mainly to nd an

argument for not specifying the marginal model. However, the denition of weak

exogeneity tells that this is not possible. A test will need the estimated marginal

model, otherwise it will not work. But when the marginal is estimated (and tested

for misspecication) the work is already done, so the only thing left is to compare

the results.

The third conclusion, is that it is not possible to state that a variable like

the US ination is determined outside the model for ination in Zambia, or the

rainfall in a agricultural model. If these variables enters the system in terms of

expectations, it might be necessary to specify the stochastic process that generates

these expectations in the model to get unbiased and e cient estimates of the

parameters of interest.

10.1.2

Strong Exogeneity

Strong exogeneity spells out the conditions for conditional forecasting and simulations of a model with not modelled variables. The condition is weak exogeneity

and that the marginal model should not depend on the endogenous variable. Thus

the marginal process must be

D3 (zt j Yt

1 ; Zt 1 ;

3)

= D3 (zt j Zt

1 3 ):

(10.2)

yt :

1 The condition of no correlation between the error terms is easily understandable if we assume

that fyt ; zt g is a bivariate normal process. Set up the density function, and determine the

condition when it is possible to estimate the parameters of interest from the conditional model

only.

2 Regarding Johansens test, it is important to remember that it is model dependent. The test

is performed conditionally on the short-run dynamics of the variables included in the system,

the dummy variables and the specication of deterministic trend.

98

10.1.3

Super Exogeneity

Super exogeneity determines the conditions for using the estimated parameters

for policy decisions. The condition is weak exogeneity and that the parameters

of the conditional model are stable w.r.t. to changes in the marginal model. For

instance, if the money supply rule changes, the parameters of the marginal process

will also change. If this also leads to changes of the parameters of the conditional

model, the conditional model cannot be used to analyse the implications of policy

changes. Thus, super exogeneity denes the situations when the Lucas critique is

not valid.

regression.

Multicollinearity has to do with how we understand the estimated parameters.

Study the following model,

yt =

1 xt

2 zt

The estimated parameters of this model is analysed under the assumption that

there is no correlation between the variables. The parameter 1 is understood as

the eect on yt following a unit change in xt while holding the other variables in

the model (zt ) constant. In the same way 2 measures the eect on yt while xt is

held constant. Another way of expressing this is the following; Efyt j zt g = 1 xt

and Efyt j xt g = 2 zt ; which tells us that the eect of one parameter cannot be

analysed in isolation from the rest of the model. The eect of zt in the model is

not on yt in it self, it is on yt conditional on xt . The meaning of holding say xt

constant in the model, while zt is free to vary implies that we study the eect on yt

after removingthe eects of xt on yt .If xt and zt are correlated it is not possible to

keep one of the constant while the other is changing. This is the multicollinearity

problem.

The statistical problem is best understood by looking at the OLS variance of

^ : The variance is

V ar( ^ 2 ) = P

(xt

x2 ) (1

xz )

xz = 1; the denominator becomes zero and the calculation of the variance breaks

down. Perfect multicollinearity means that the covariance matrix E(X 0 X) 1 does

not exist, and there is no solution to = (X 0 X) 1 XY: This is seldom a practical

problem, since the computer program that calculates the estimates will break down

when it tries to invert the matrix.3

Near and less than perfect multicollinearity, meaning that is between zero and

unity, is more complex. However, the problem is limited only to the understanding

of the estimated parameters, not in the understanding the model. Less than perfect

multicollinearity will aect the residual variance of the model ( 2 ), the estimated

3 If the inversion process does not break down completely, estimated variances of one ore more

parameters will be incredibly large.

99

quick xes for multicollinearity has been suggested. None of these actually works.

In cross section studies a typical problem is to explain household consumption.

If you use household income, the number of rooms that the household posses, the

number of children and the size of the car as explanatory variables, you would

not be surprised to learn that these explanatory variables are highly correlated

with each other. As a consequence it might be hard to understand what the

parameters are estimating. This example shows that throwing in explanatory

variables without a clear economic model in the background will lead to problems.

There is no substitute for economic theory in this example.

In time series modelling multicollinearity is often, somewhat mistakenly, linked

to the estimation of lag lengths. Take the following distributed lag model as an

example; xt = 1 xt 1 + 2 xt 2 + "t : If xt is an AR(p) process, the xt variables in

the equation are of course correlated, meaning that we cannot hold xt 1 constant

and at the same time analyse the eect of varying xt on its own. On the other

hand, we are not interested in changing one lag, while keeping the rest xed. In a

time series regression estimation aims at nding the su cient number of lags that

describes the dynamic process.

However, since the lags are correlated with each other, this will aect the

estimated variance of each lag. This will make it more di cult to determine

the correct number of lags in a model, if we were to check the t of the model

by looking at the t-values of the parameters only. Since model building should

be aimed at nding a white noise innovation term, t values are seldom used to

decide the over-all t of the model. Instead we focus on misspecication tests of

the model.

We can summarize the fact about multicollinearity as follows. There is no

way to accurately measure the degree of multicollinearity and there are no quick

xes. Never, under no circumstances, can you delete some variables to solve

the problem as is suggested in some textbooks. Deleting variables means that you

change the specication and the t of the model. Leaving out a relevant explanatory variable leads to a misspecied model, which creates bias in the estimates

and aects inference. As shown in Hendry (1990 Ch. 6), multicollinearity is not a

model problem, or a misspecication problem, it has to do with the interpretation

of the estimated variables only, and not with the t of the model. It can be shown

how the variables in a given model can be transformed such that the they become

orthogonal to each other, without aecting the t of the model.

Returning to the example above, the interpretation of the parameters can be

made clearer if we use the transformation = 1 L,

yt =

xt +

3 xt 1

+ t:

(10.3)

The transformation is just a reparameterization and does not aect the residual

term. The parameter 3 = 1 + 2 which is the long run static solution of the

model. Thus we get an estimate of the short run eect on yt from 1 and at

the same time a direct estimate of the static long run solution from 3 . If the

collinearity between xt and xt 1 is high, it can be assumed to be quite small when

we look at xt and xt 1 . Since our nal interest in modelling economic time series

is to nd a well-dened statistical model, which mimics the DGP of the variable(s)

multicollinearity is not really a problem. We will therefore not deal with this topic

any further.

100

ORDER OF INTEGRATION

This section looks at a number of unit root tests, which can be applied to determine

the order of integration of a variable. The following tests are presented,

DF-test Dickey-Fuller test

ADF-test Augmented Dickey-Fuller test

Z-test Phillips and Perrons Z-test (To be included)

LMSP-test Schmidt and Phillips LM test

KPSS -test Kwiatkowsky, Phillips, Schmidt and Shin test

G(p; q)-test Parks G-test.

The alternative hypotheses to having an integrated series are discussed in a

following section.

The Dickey-Fuller test is one of the oldest test. The tests builds on the assumed

DGP,

yt = yt 1 + t with t N ID(0, 2 ):

Given this DGP, subtract yt 1 from both sides, and estimate the equation

a) yt = yt 1 + t ,

or, put a constant term in the regression, to allow for the alternative of a

deterministic trend in yt 1 ,

b) yt = + yt 1 + t ;

or, put in both a constant and a time trend in the estimated equation, to allow

for both a linear deterministic trend and a quadratic deterministic trend in yt ;

c) yt = + yt 1 + t + t ;

where

= 0 if yt is I(1). In this regression, know that

will be biased

downwards, in a limited sample. Thus, we can put all the risk on the negative

side and perform a one-sided test, instead of a two-sided standard t-test. The one

sided t-test H0 : ^ = 0 - yt I(1) against,

H1 : ^ < 0, yt I(0): The correct 0 t-statisticfor testing the signicance of ^

is tabulated in Fuller (1976), under the assumption that yt is random walk, yt

N (0; 2 ): The correct distribution for the t-test can also be calculated from

MacKinnon (1991), for the exact sample size at hand. In practice the dierences

are small though. The t-statistics for the constant term and the trend term are

tabulated in Dickey and Fuller (1980). Notice that the null hypothesis is that

yt = t , where t is white noise. The econometrician, however, will not know

1 To understand why the constant represents a linear deterministic trend, go back to the

discussion about the properties of the random walk process.

101

this in advance. S=he must therefore set up the estimated model so that there is an

meaningful alternative hypothesis to the stochastic trend (or unit root hypothesis).

A general alternative is to assume that yt is driven by a combination of t and t2 :

It is therefore recommendable, if t is white noise, to start with model c. If

the t-value on is signicant according to the table in Fuller (1976). The null

hypothesis of unit root process is rejected. It follows then that the t-statistics

for testing the signicance of and follow standard distributions. But, as long

as the unit root hypothesis ( = 0) cannot be rejected, both and must be

assumed to follow non-standard distributions. Thus, under the hypothesis that

= 0, the appropriate distributions for and are found in Dickey and Fuller

(1980).

In a limited sample it might be wise to compare the outcome of both model c

and a.

The test is easily extended to higher order unit roots, simply by performing

the test on dierenced data series.

When will the test go wrong? First, if t is not white noise. In principle, et

can be an ARIMA process. In the following a number of models dealing with this

situation is presented. If there is more than one unit root, then testing for one

unit root is likely to be misleading. Hence a good testing strategy is to start by

testing for

two unit roots, which is done by applying the DF-test to the rst dierence of

the series ( yt ). If a unit root in yt is rejected one can continue with testing for

one unit root, using the series in level form yt .

The DF-test, like all tests of I(1) versus I(0), is sensitive to deviations from the

assumption t

N ID(0, 2 ). The assumption of NID errors is critical to the

simulated distributions in Fuller (1976). If there is autocorrelation in the residual

process the OLS estimated residual will inappropriate, the residual variance estimate will be biased and inconsistent. The ADF-test seeks to solve the problem by

augmenting the equations with lagged yt ;

yt = yt

k

X

yt

+ t;

(11.1)

i=1

or

yt =

+ yt

1+

k

X

yt

+ t;

(11.2)

i=1

or

yt =

+ yt

1+ t+

k

X

yt

+ t:

(11.3)

i=1

The asymptotic test statistic is distributed as the DF-test, and the same recommendation applies to these equations, make sure there is a meaningful alternative

hypothesis. Therefore start with the model including both a constant and a trend.

The ADF test is better than the original DF-test since the augmentation leads

to empirical white noise residuals. As for the DF-test, the ADF test must be set

up in such a way that it has a meaningful alternative hypothesis, and higher order

integration must be tested before the one only unit root case.2

2 Sj

102

stationary, the distribution of the lags are normal, and standard tests, including

Q-tests, LM test for serial correlation in the residual can be used. In small samples

the augmentation might play an important role for the outcome of the test. No

general rule can be established, more than that the residuals should not display

autocorrelation. It is therefore up to the model to convince the readers (the critics)

that the nal verdict regarding the signicance, or non-signicance of rests on

solid ground.

An additional complication is how to treat outliers in the sample. Outliers will

aect the estimation, in particular the signicance of the constant and the trend

variable. If trends are signicant, under the null of unit root process, according to

the Tabulations in Dickey and Fuller (1979), the conclusion is that the estimate of

yt 1 follows a normal distribution. Finding signicant time trends often implies

the rejection of a unit root. But, if this is caused by an outlier aecting the

estimation of the trend, one has to be careful in rejecting the unit root. In the case

of signicant trend variables, leading to the rejection of the unit root hypothesis,

some careful investigation of outliers is called for, to be secure against spurious

regressions.

The DF and ADF tests are the most well known tests, and are easily understood

by most people. However, in limited samples and with t not being white noise,

they are often quite inconclusive. The tests should therefore be accompanied by

graphs and perhaps other tests.

The ADF-test tries to solve the problem of non-white noise residuals by adding

lags of the dependent variable. It should be stressed that the ADF-test is quite

adequate as a data descriptive device under the maintained hypothesis that the

variables in a sample are integrated of order one. There are, however, a number of

tests which tries improve on some of the weaknesses of the ADF-test. Phillips and

Perron (1988) suggest non-parametric correction of the test statistic so that the

Dickey-Fuller distribution can be used even in cases when the residual in the DFtest is not white noise. (The KPSS-test below a recent modication of the same

principle) The method starts from the estimated t-value (t^ ) and the estimated

residuals from the DF equation. The test statistic (t ) -the t-value- is modied

with the following formula

t =

T [S 2

S ^

t

S

S 2 ][std:er(^ )=s]

2S

(11.4)

S2 = T

T

X

^2t ;

(11.5)

t=1

and

S2 = T

T

X

t=1

^2t + 2T

l

X

[1

j(l + 1)

j=1

T

X

^t^t

j:

(11.6)

t=j+1

Bartletts triangular window. The critical factor is determine the size of the lag

window l.

UNIVARIATE TESTS OF THE ORDER OF INTEGRATION

103

Start with the following DGP,

yt = + t + xt and xt = xt 1 + t

where t

N ID(0, 2 ). Under a unit root H0 : = 1. To test, run the

following regression,

yt = + S^t 1

PT

where S^t = t=2 [ yt

yt 1 =(T 1)]. Schmidt and Phillips (1992) simulated

the t-statistic for ^ :

This test is calculated by RATS 4. The DGP is assumed to be

yt = t + rt + t

N ID(0, 2v ). The null

where rt = rt 1 + t . t

N ID(0, 2 ) and t

2

hypothesis is that yt is stationary. The test is H0 : v = 0, against H1 : 2v > 0.

Start by estimating the following equation,

yt =

+ t + et;

(11.7)

=T

t

X

(11.8)

e^2i and

(11.9)

where

St2 =

i

X

i=1

s2 (k) = T

t

X

e^2t + 2 T

k

X

w(s; k)

t

X

e^t e^t

s:

(11.10)

t=s+1

s=1

The critical values for the test is given in Kwiatkowsky et.al (1992). A Bartlett

type window, w(s; k) = 1 [s=(k + 1)] is used to correct the estimate (sample)

test statistics correspond to the simulated distribution which is based on white

noise residuals. The KPSS test appears to be powerful against the alternative of

a fractionally integrated series. That is, a rejection of I(0) does not lead to I(1),

as in most unit root test, but rather to a I(d) process where 0 < d < 1. These

type of series are called fractionally integrated. A high value of d implies a long

memory process. In contrast to an integrated series I(1), or I(2) etc, a fractionally

integrated series is reverting. Baillie and Bollerslev (1994).

This test builds on the conclusion that for a unit root variable, the estimated residuals are inappropriate and will indicate that unrelated variables are statistically

signicant (spurious regression). Therefore estimate,

1 : yt =

104

+ t+

1t

(11.11)

2 : yt =

1t

t2 +

2t ;

(11.12)

G(1; 2) = (RSS1

(11.13)

where RSS1 and RSS2 are the residual sums of squares from model 1 and 2

respectively, s2 (k) is as above.

We can conclude that among theses tests, the ADF test is robust as long as

the lag structure is correctly specied. The gains from correcting the estimated

residual variance seem to be small.

Rejecting one unit root does not necessarily mean that one can accept the alternative of an I(0) series. Sometimes unit root test will reject the assumption of

a unit root even though the series is clearly non-stationary. There are several

alternatives to rejecting the I I(1) hypothesis,

The series is actually I(0).

The series is driven by a deterministic rather than a stochastic trend.

The series contain more than one unit root.3

The series is driven by segmented trends, meaning that there are dierent

deterministic trends for dierent sub-periods.

The series contain fractionally integrated trends. It has an ARFIMA representation (AutoRegressive Fractionally Integrated Moving Average).

The series is non-stationary, but driven by some (to us) unknown trend

process. Tests for deterministic trends and more than one unit root are

straight forward from the section above and not discussed here.

The segmented trend approach was launched by Perron (1989). He argues that

few series really are I(1): If we have detailed knowledge about the data generating

process, we might establish that series have dierent deterministic trends for different time periods. The fact that these segmented trends shift over time implies

that unit root tests cannot reject the hypothesis of an integrated variable. Thus,

instead of detecting the correct deterministic trend(s), the test approximates the

changing deterministic trend with a stochastic trend. Perron (1989) demonstrates

this fact and drives a test for a known break date in the series. Banerjee et.al.

(1992) develop a test for an unknown break date. The problem with this approach

is that we somehow have to estimate these segmented trends. Sometimes it will

be possible to argue for segmented trends, like World War One and Two, etc.,

but in principle we are left more or less with ad hoc estimates of what might be

segmented trends.

3 Testing for integration should be done according to the Pantula Principle, since higher order

integration dominates lower order integration, test from higher to lower order, and stop when it

is not possible to reject the null. For instance, a test for I(1) v.s I(0) assumes that there are

no I(2)processes. The presence of higher order cointegration might ruin the test for lower order

integration, therefore start with I(2) and only if I(2) is rejected will it be meaningful to test for

I(1), etc.

105

For the class of integrated series discussed above the dierence operator was assumed to be d = 1. The choice between d = 0 and d = 1 might be too restrictive

in some situations. Especially, if unit root tests reject I(0) in favour of the I(1)

hypothesis, when we have theoretical information that suggests that I(1) is implausible, or highly unrealistic. For example, unit root tests might nd that both

the forward and the spot foreign exchange rates are I(1), and that the forward premium (f s), the log dierence, is also I(1), indicating no mean reversion in this

dierence series, and that the forward and the spot rates are not co-integrating.

The expectations part of the forward rate would therefore be extremely small or

irrational in some sense, so the risk premiums are causing the I(1) behavior.

Autoregressive Fractional Dierence Moving Average Models, represents a

more general class of model than ARMA and ARIMA models, see Granger and

Joyeux (1980) and Granger (1980). The ARFIMA (p; d; q) model is dened as

(L)(1

L)d yt =

+ (L) t ;

(11.14)

L)d is dened in terms of its Maclaurins series expansion. The dierence operator

works in the same way as for ARIMA models, applying the operator to yt results

in (1 L)d yt = zt where zt has an ARMA representation. The FI operator

transforms the original series into a series which has an ARMA representation.

Once the long-run memory is removed, the standard techniques for identifying the

ARMA process can be applied.

The dierence between ARIMA and ARFIMA models is that the latter allows

for a more complex memory process. The Wold theorem says that any nondeterministic series has an innite MA representation like,

yt =

1

X

i t i;

(11.15)

i=0

P1

where t iid(0, 2 ), and i=0 2i < 1. If this series also belongs to the class

of series which has an ARMA representation, the autocorrelation function will die

out exponentially. For an I(1) the autocorrelation function will display complete

persistence, the theoretical autocorrelation function is unity for all lags.

Because the autocorrelation function of an ARMA process dies out exponentially, it can be said to have a relatively short memory compared to series which

have autocorrelation functions which do not die out as quickly. ARFIMA series,

therefore represents long memory time series. The ARFIMA model allows the

autocorrelation coe cients to exhibit hyperbolic patterns. For d < 1, the series is

mean reverting, for 0:5 < d < 0:5 the ARFIMA series is covariance stationary.

For a statistician who is describing the behavior of a time series an ARFIMA

model might oer a better representation than the more traditional ARMA model,

see Diebold and Rudebush (1989) Sowell (1992). For an econometrican however,

the economic understanding is of equal importance. The standard question in most

economic work is whether to use levels or percentage growth rates of the data, to

construct models with known distributions. That means decide whether series are

I(0) or I(1). Fractional integration does not aect these problems. It becomes

important when we ask specic questions about the type of long-run memory we

are dealing with, like is there mean reversion in the forward premium, or the real

exchange rate, or in assets prices etc. Thus only when economic theory gives us

a reason for testing something else than I(0) and I(1) is fractional integration

106

Cheung and Lai (1995).

FRACTIONAL INTEGRATION

107

108

CO-INTEGRATION

Most macroeconomic and nance variables are non-stationary. This has enormous

consequences for the use of statistical methods in economics research. Statistical

theory assumes that variables are stationary, if they are not stationary statistical

inference is generally not possible. It doesnt matter that numerous old textbooks

in econometrics and research papers have ignored the problem. The problems

associated with non-stationary variables in econometrics has been known since the

1920s, but didnt get a solution until the end of the 1980s. In principle there two

ways of dealing with non-stationary, you must either remove the non-stationarity

before setting up the econometric model or set up a model of non-stationary

variables that forms a stationary relation. Typically, in none of these cases can

you use standard inference based on t-, chisquare or F-distributions.

Now, variables can be non-stationary in an innite number of ways. In practice,

there are broadly two types of non-stationary variables of interest in econometrics.

The rst type are variables stationary around a deterministic trend. The second

type are variables stationary around a stochastic trend. Stochastic trend variables

are also known as integrated variables. Most variables in economics and nance

seem to be driven by stochastic trends.

The problem with stochastic trend variables (integrated variables) is that not

only do they not follow standard distributions, if you try to use standard distributions you will most likely be fooled into thinking there are signicant relations

when in fact there are no relation. This is know as the spurious regression problem

in the literature.

Historically, trends were dealt with by removing what people assumed was a

linear deterministic trend. This was done in the following way. The non-stationary

variable was regressed against a constant and a linear trend variable;

yt =

+ t + y~t

(12.1)

y~t in this regression represents the de-trended yt series, which was then used in

regression models with other stationary or detrended variables. In the equation

above

becomes a combination of the sample mean of yt , and the average of

the time variable. In general, the deterministic trend removal can be done with

models including polynomial deterministic trends, such as

yt =

1t

2t

+ ::: +

nt

+ y~t :

(12.2)

trends, using trigonometric functions in combinations with the time trend. In

the literature there are various deterministic lters that aim at removing long-run

(supposedly deterministic) trends such as the so-called Hodrick-Prescott lter.

However, if the series is driven by a stochastic trend the estimated variables of

these models will not follow standard distributions and the regression will impose

a spurious autocorrelation pattern in the spuriously detrended variable y~t . Thus,

until you have investigated the non-stationary properties of the series and tested

for stochastic trends (order of integration) it is not possible to do any econometric

modelling.

NON-STATIONARITY AND CO-INTEGRATION

109

Deterministic trends are seldom the best choice for economic time series. Instead the non-stationary behaviour is often better described with stochastic trends,

which have no xed trend that can be predicted from period to period. A random walk serves as the simplest example of a stochastic trend. Starting from the

model,

yt = yt 1 + vt where vt N ID(0; 2 );

(12.3)

repeated substitution backwards leads to,

yt = y0 +

t

X

vi :

(12.4)

i=0

The expression shows how the random walk variable is made up by the sum of

all historical white noise shocks to the series. The sum represents the stochastic

trend. The variable is non-stationary, but we cannot predict how it changes, at

least no by looking at the history of the series. (See also the discussion above

concerning random walks under the section about dierent stochastic processes)

The stochastic trend term is removed by taking the rst dierence of the series.

In the random walk case it implies that yt = vt is a stationary variable with

constant mean and variance. Variables driven by stochastic trends are also called

integrated variable because the sum process represents the integrated property of

these variables.

A generic representation is the combination of deterministic and stochastic

trends,

yt = + t + t + y~t ;

(12.5)

where t = t 1 + vt ; vt is N ID(0; 2 ); t is the deterministic trend and y~t

is a stationary process representing

stationary part of yt : In this model, the

Pthe

t

stochastic trend is represented by i=1 vi :

An alternative trend representation is segmented deterministic trends, illustrated by the model

yt =

1 t1

2 t2

+ ::: +

k tk

+ y~t

(12.6)

where t1 ; t2 etc;

_ are deterministic trends for dierent periods, such as wars, or

policy regimes such as exchange rates, monetary policy etc.. Segmented trends

are an alternative to stochastic trends, see Perron 1989, but the problem is that

the identication of these dierent trends might be ad hoc. Given a suitable

choice of trends almost any empirical series can be made stationary, but are the

dierent trends really picking up anything interesting, that is not embraced by the

assumption of stochastic trends, arising from innovations with permanent eects

on the economy?

12.0.1

by stochastic trends. Regression with these variables leads to the danger of nonstandard distributed parameter estimates which make inference much more di cult.

The spurious regression problem was introduced in a article by Granger and

Newbold in 1973. Granger and Newbold generated two random walk series, which

were independent of each other by construction. Let the two variables be xt and yt ,

110

let yt and xt be independent. Next, consider the linear regression of yt and xt ;

yt =

+ xt + "t :

(12.7)

zero and we would expect that the t-statistic of ^ will go to zero as the sample

size increases so that t ^

N ID(0; 1). If we repeat the regression with new

independent random walk we expect that in 5 per cent of test we would be

unlucky and erroneously assume that there is signicance even though true value

of is zero. However, this is not what happens.

Granger and Newbold studied the empirical distribution of the regression

above. They run 1000 regressions and found that the distribution of the t-statistic

of ^ was the opposite of what we expect. In 95 % of the regression we nd a

signicant relation even though the true value should be 5 %. Asymptotically the

t-value of ^ approached 2:0. The problem got worse when more independent random walks were put into the equation. Granger and Newbold did also nd that the

reported R2 values became relatively high while the Durbin-Watson value became

low.

Later in the 1980s, researchers such as Peter Phillips, showed that due to the

integrated properties of the variables, their sample moments converge to functions

of Wiener processes (Brownian motions). The sample moments will not converge

to constants, like in the case of stationary stochastic regressors. Instead, the

sample moments converge to random variables which are functions of Wiener

processes. In this situation, with two (or more) random walk variables regressed

against each other the t-statistics will approach 2.0 zero instead of 0.0. Thus, by

using the t-distribution to test the null of no correlation between the variables, one

will be fooled into rejecting the assumption of no correlation. This is the spurious

regression problem. It is caused by parameter estimates which are not distributed

according to the normal distribution, not even in the long run.

In practical work, that is when using limited samples, this will occur not only

when regressing random walk variables, but also when regressing integrated variables or near-integrated variables.

Near-integrated variables are a classication of variables which in a limited

sample, look and behave like integrated variables. An autoregressive process with

an autoregressive parameter close to unity (say 0.9) can be called near integrated.

In these situations, the distribution theory of integrated variables is a much better

approximation than the standard normal.

12.0.2

the same order as individual variables. The exception from this rule is called cointegration, when a linear combination of integrated variables results in a lower

order of integration.

So, in the linear regression above, since both yt and xt are integrated of order

one I(1), and independent, the residual term "t will be integrated of order one

I(1) as well.

In the case when the two I(1) variables share the same stochastic trend and

form an I(0) residual we say that they are co-integrating.

NON-STATIONARITY AND CO-INTEGRATION

111

The intuition here is that for the two variables to form a meaningful long-run

relationship, their must share the same trend. Otherwise they will be drifting

away from each other as time elapses. Therefore, to build econometric models

which make sense in the long run, we have to investigate the trend properties

of the variables and determine the type of trend and whether variables are cotrending and co-integrating or not. In econometric work, trend properties refer

to the properties of the sample and how to do inference. It is not a theoretical

concept about how economics variables grow in the long run.

Once we have claried the trend properties, it becomes possible to establish

stationary relations and models, and econometric modeling can proceed as usual,

and standard techniques for inference can be used.

Denitions:

Denition 1 A series with no deterministic component and which has a stationary and invertible autoregressive moving average (ARMA) representation after

di erencing (d) times, but which is not stationary after di erencing (d 1) times,

is said to be integrated of order d, denoted xt I(d):

Denition 2 The components of the vector xt are said to be co-integrated of order

d; b, denoted xt CI(d; b); if (i) xt is I(d) and (ii) there exists a non-zero vector

such that 0 xt I(d b); d b > 0: The vector is called the co-integrating

vector.(Adapted from Engle and Granger (1987)).

Remark 1 If xt has more than two elements there can be more than one cointegrating vector .

Remark 2 The order of integration of the vector xt is determined by the element

which has the highest order of integration. Thus, xt can in principle have variables

integrated of di erent orders. A related denition concerns the error correction

representation following from co-integration.

Denition 3 A vector time-series xt has an error-correction representation if

it can be expressed as A(L)(1 L)xt =

zt 1 + ! t ; where ! t is a stationary

multivariate disturbance term, with A(0) = I; A(1) having only nite elements,

zt = 0 xt ; and

a non-zero vector. For the case where d = b = 1, and with

co-integrating rank r, the Granger Representation Theorem holds. (Adapted from

Banerjee et.al (1993))

Remark 3 This denition and the Granger Representation Theorem (Engle and

Granger, 1987) tell us that if there is co-integration then there is also an error

correction representation, and there must be Granger causality in at least one

direction.

12.0.3

Under the general null hypothesis of independent and integrated variables estimated variances, and test statistics, do not follow standard distributions. Therefore the way ahead is to test for co-integration, and then try to formulate a regression model (or system) in terms of stationary variables only. Traditionally there

are two approaches of testing for co-integration; residual based approaches and

other approaches. The rst type starts with the formulation of a co-integration

regression, a regression model with integrated variables. Co-integration is then

determined by investigating the residual(s) from that regression. The Engle and

112

Granger two-step procedure and the Phillips-Oularies test are examples of this approach. The other approach is to start from some representation of a co-integrated

system, (VAR, VECMA, etc.) and test for some specic characterization of cointegrated systems.. Johansens VECM approach, or tests for common trends are

examples.

The Engle and Grangers two-step procedure is the easiest and most used

residual based test. It is used because of its simplicity and ease of use, but is not a

good test. The two-step procedure, starts with the estimation of the co-integrating

regression. If yt and xt are two variables integrated of order one, the rst step is

to estimate the following OLS regression

yt =

+ xt + zt

(12.8)

where the estimated residuals are z^t : If the variables are co-integrating, z^t will

be I(0). The second step is to perform an Augmented Dickey-Fuller unit root test

of the estimated residual,

z^t =

+ z^t

k

X

z^t

+ "t :

(12.9)

i=1

a stationary process. If they dont share a common trend, they do not

co-integrate, the parameter

must be zero and the residual zt must be

non-stationary and integrated of the same order as yt : If the null, H0 :

^ = 0; is rejected for the alternative HA :

< 0; we conclude that the

variables are co-integrated, and that the long-run co-integrating parameter

is : Furthermore; we can refer to the OLS regression as the co-integrating

regression. We know that the residual is stationary, z^t is I(0) and therefore

z^t 1 can be used as en error correction term, identifying the long-run steady

state relation between yt and xt :

The relevant test statistics are not the one tabulated by Fuller (1976). Instead

you have to look new simulated tables in Engle and Granger (1987), Engle and

Yoo (1987), or Banerjee et al (1993). The reason is that the unit root test is

now performed, not on a univariate process, but on a variable constructed from

several stochastic processes. The test statistic will change depending on how many

explanatory variables there are in the model.

Remark 4 Remember that the t-statistics, and the estimated standard deviations,

from the co-integrating regression must be considered, even if we nd cointegration. Unless xt is exogenous the estimated parameters follow unknown non-normal

distributions even asthmatically.

Remark 5 For the outcome of the test, it will not matter which variable is chosen

to be the dependent variable. As an economist you might favour setting one variable

as dependent and understand the parameters as long-run economic parameters

(elasticities etc.)

There are a number of problems with the Engle and Granger two-step procedure.

The rst is that the tabulated (non-standard) test statistic assumes white noise

residuals. The augmentation tries to deal with this but is in most cases it is only

a crude approximation.

Second, the test assumes a common factor in the dynamic processes of yt and

xt : In practice this restriction is quite restrictive and the test will not behave

NON-STATIONARITY AND CO-INTEGRATION

113

good when it does not hold. The dynamics of the two process and their possible

co-integrating relation is usually more complex.

Third, the test assumes that there is only one co-integrating vector. If we test

for co-integration between two variables this is not a problem, because then there

can be only one co-integration vector. Suppose that we add another I(1) variable

(ut ) to the co-integrating regression equation,

yt =

1 xt

2 ut

t:

(12.10)

If yt and xt are co-integrating, they already form one linear combination (zt )

which is stationary. If ut

I(1) is not co-integrating with the other variables,

OLS will set 2 to zero, and the estimated residual ^t is I(0). This is why the test

will only work if there is only one co-integrating vector among the variables. If yt

and xt are not co-integrating then adding ut I(1) might lead to a co-integrating

relation. Thus, in this respect the test is limited, and testing must be done by

creating logical chains of bi-variate co-integration hypotheses.

Other residual based tests try to solve at least the rst problem by adjusting

the test statistics in the second step, so that it always fullls the criteria for testing

the null correctly. Some approaches try to transform the co-integrating regression

is such a way that the estimated parameters follow a standard normal distribution.

A better alternative to testing for co-integration among more than two variables

is oered by Johansens test. This test nds long long-run steady-state, or cointegrating, relations in the VAR representation of a system. Let the VAR,

Ak (L)xt =

Dt + "t ;

(12.11)

represent the system. The VAR is a p-dimensional system, the variables are assumed to integrated of order d; fxgt I(d); Dt is a vector deterministic variables,

constants, dummies, seasonals and possible trends,

is the associated coe cient

P

matrix. The residual process is normally distributed white noise, "t ID(0; ).

It is important to nd the optimal lag length in the VAR and have a normal

distribution of the error terms in addition to white noise because the test uses

a full information maximum likelihood estimator (FIML). estimators are notoriously sensitive to small samples and misspecications why care must be taken in

the formulation of the VAR. Once the VAR has been found, it can be rewritten

in error correction form,

xt =

xt

k

X

xt

Dt + "t

(12.12)

i=1

In practical use the problem is to formulate the VAR, the program will rewrite

the VAR for the user automatically. Johansens test builds on the knowledge

that if xt is I(d) and co-integration implies that there exists vectors such that

0

xt I(d b). In a practical situation we will assume that xt (1) and if there

is cointegration, 0 xt

I(0). If there is cointegration, the matrix

must have

reduced rank. The rank of indicates the number of independent rows in the matrix. Thus, if xt is a p-dimensional process, the rank (r) of

matrix determines

the number of co-integrating vectors ( ), or the number of linear steady state

relations among the variables in fxgt : Zero rank (r = 0) implies no cointegration

vectors, full rank (r = p) means that all variables are stationary, while a reduced

rank (0 < r < p) means cointegration and the existence of r co-integrating vectors

among the variables.

The procedure is to estimate the eigenvalues of

and determine their signicance.1

1 The

114

test is called the Trace test and its use is explanied in Sj Guide to testing for ...

distributions which depend on whether there is a deterministic trend, and or a

constant term in the model. The test statistic is only known asymptotically and

for a closed system without exogenous variables. In other situations the decision

must be based on viewing the test statistics as approximations.

0

Once the rank of is known, the matrix can be rewritten as =

such that

0

xt forms stationary co-integrating relations. The

are co-integrating parameters, and represent the adjustment parameters. The signicance of the alphas

can be determined by ordinary t-test since they are associated with stationary

relations, 0 xt I(0)

Finding the VECM

In practical use the problem is to formulate the VAR, the program will rewrite

the VAR for the user and present the estimated and vectors. Sometimes it

necessary to understand how the VECM is found. Consider the 2 dimensional

VAR model, where the deterministic terms have been removed for simplication,

yt

= a11 yt

+ a12 yt

+ a13 xt

+ a14 xt

zt

= a21 zt

+ a22 zt

+ a23 zt

+ a24 zt

2

2

+ e1t

(12.13)

+ e2t :

(12.14)

yt = (a11

1)yt

+ a12 yt

+ a13 xt

+ a14 xt

+ e1t

since the equation was correctly specied from the beginning it can transformed

as long as we do not do anything that aects the properties of error term. Our

aim is to split all lag terms into rst dierences and lagged variables in such a

way that the model consists of one lag at t-1 for all variables and rst dierences.

We can do this by using the dierence operator,

= (1 L), which can be

used as yt = yt yt 1 , or yt 1 = yt

yt : Referring to the operators we have

L = (1

); or Lyt = (yt

yt ): If we apply this to all lags of lower order than

t 1, we get for t 2 the following, yt 2 = yt 1

yt 1 , and zt 2 = zt 1

zt 1 :

Substitute this into the equation to get,

yt = (a11

1)yt

+ a12 (yt

yt

1)

+ a13 zt

+ a14 (zt

zt

1)

+ e1t

yt = ( 1 + a11 + a12 )yt

a12 yt

a14 zt

+ e1t

yt

+ e1t

zt = ( 1 + a21 + a22 )zt

a22 zt

24

xt =

xt

1+

1

X

xt

+ "t

(12.15)

i=1

115

where

xt =

yt

; xt =

zt

yt

;

zt

11

12

21

22

;and

11

21

zero and therefore stationary. And, since xt is non-stationarity, the variables in

xt grows in two dimensions unless they share the same trend. In that case we

would say that they are co-integrated and share one common trend. In the case of

a p-dimensional system, the system can expand in p dimensions or in less than p

dimensions if variables share the same trend. under these properties a single yt 1

or zt 1 cannot be correlated with yt or zt :

The only possible correlation that will not render the rows in to be dierent

from zero is when ( 11 yt a + 12 zt 1 ) forms a stationary process, i.e. there exists

non-zero parameters 11 and 12 (or 21 and 22 ) such that when multiplied with

the x:s a stationary relation is established. The test for this is to test for the rank

of the

matrix, the number of independent non-zero rows in :

A rank of zero mean no co-integration, rank of 2 in this case means that the

x:s are stationary, or stationary around deterministic trends if we allowed for

constants in the equation. A reduced rank, which in this case is a rank equal 1,

implies co-integration. Co-integration will imply that at least one parameter will

be signicant, there will be (long-run) Granger causality in at least one direction.

At least one variable must follow the other for them to stay together in xed

formation on the long run.

Johansens test is better than the two-step procedure in almost all aspects.

The practical problems originate from choosing a correct combination of lags and

dummy variables to make the residual come out as white noise. In a limited sample

this can be di cult, and the results might change among dierent specications

of the system, just as it does in the two-step procedure.

It is recommended to start with the two-step procedure, to learn about the

data and get some preliminary results, instead of getting stuck with the Johansen

test, having problems nding a specication that leads to economically interesting

results.

116

AND COMMON TRENDS

This chapter looks the common trends approach and some economics behind cointegration. For instance, the question of creating positive or negative shocks in

stabilization policy.

An important characteristic of integrated variables is that they become stationary after dierencing. The denition of an integrated series is; A series, or

a vector of series, yt with no deterministic component, which has a stationary,

invertible ARMA representation, after dierencing d times is said to be integrated

of order d, denoted as xt I(d).

It is possible to have variables driven by both stochastic and deterministic

trends. In the very long run a deterministic trend will always dominate over a

stochastic trend. In a limited sample however, it becomes an empirical question if

the deterministic trend is su ciently strong to have an eect on the distributions

of the estimates of the model.1

We know, from the Wold representation theorem, that if yt is I(0), and has no

deterministic process, it can be written as an innite moving average process. (If

the series has a deterministic process this can be removed before solving for the

MA process).

yt = C(L) t ;

(13.1)

where L is the lag operator, and t iid(0; 2 ). Now, suppose that yt is I(1),

then its rst dierence is stationary and has an innite MA process,

yt = C(L) t :

Under the assumption that

yt =

iid(0;

C(L)

(13.2)

= [1=(1

L)]C(L) t :

(13.3)

where 1=(1 L) represents the sum of an innite series. For a limited sample,

we get approximately,

yt = y0 + (1 + L + L2 + ::: + Lt

)C(L) t ;

(13.4)

conditional on everything known at time zero. The long-run solution of this expression, setting L = 1, gives tC(1), and

yt = y0 + t C(1) t :

(13.5)

the second dierence of yt I(1), leads to

2

yt = (1

L)C(L) t ;

(13.6)

long run MA representation, irrespective of C(L) = C(1) 6= 0, since setting L = 1

gives (1 L)C(L) = (1 1)C(1) = 0:

1 See Nelson and Plosser (1982) for a discussion about the proper way to model the trend in

economic time series.

117

Let us see what happens with the process in the future. From above we get

the MA representation for some future period t + h;

2

3

t+h t+h

X

Xi

4

(13.7)

yt+h = y0 +

Cj 5 i

i=1

= y0 +

t

X

i=1

j=0

t+h

Xi

j=0

Cj 5

t+h

X

i=1+t

t+h

Xi

j=0

Cj 5 i :

(13.8)

The forecasts are decomposed into what is known at time t, the rst double

sum, and what is going to happen between t and t + h i. The latter is unknown

at time t, therefore we hhave to formi the conditional forecast of yt+h at time t;

Pt

Pt+h i

yt+h jt = y0 + i=1

Cj i :

j=0

The eect of a shock today (at time t) on future periods is found by taking

the derivative of the above expression with respect to a change in t ;

@yt+h jt =@

h

X

j=0

Cj ! C(1) as t ! 1:

(13.9)

Thus, the long-run eect of a shock today can be expressed by the static

long run solution of the MA representation of yt . (Equal to the sum of the MA

coe cients).

The persistence of a shock depends on the value of C(1). If C(1) happens to

be 0, there is no long-run eect of todays shock. Otherwise we have three cases,

C(1) is greater than 0, C(1) = 1 or C(1) is greater than unity. If C(1) is greater

than 0 but less than unity, the shock will die out in the future. If C(1) = 1, the

integrated variables (unit roots) case, a shock will be as important today as it is

for all future periods. Finally, if C(1) is greater than one (explosive roots) the

shock magnies into the future, and we have an unstable system.

If the series are truly I(1), spectral analysis can be applied to exactly measure

the persistence of a shock. The persistence of shocks has interesting implications

for economic policy. If shocks are very persistent, or explosive in some cases, it

might be a good policy to try to avoid negative shocks, but create positive shocks.

In our stabilization policy example, this can be understood as the authorities

should be careful with deationary policies, for instance, since they might result

in high and persistent social costs, see Mankiw and Shapiro (198x) for a discussion

of these issues.

In the following, the MA representation of systems of integrated processes are

analysed. For this purpose let yt be vector of I(1)-variables. Using the lag operator

as, L = 1 (1 L) [1

]2 , and Walds decomposition theorem gives,

yt = C(L)

= C(1) t +

C (L) t :

(13.10)

If yt is a vector of I(1) variables, then we know from above that if the matrix

C(1) 6= 0, any shock to the series has innite eects on future levels of yt . Let us

consider a linear combination of these variables 0 yt = zt . Multiplication of the

expression with 0 gives,

0

zt =

C(1)

C (L) t :

(13.11)

In general, it is the case that when yt is I(1); zt is I(1) as well. Thus a linear

combination of integrated variables will also be integrated. Implying that 0 C(1)

2 This

118

is the same as yt =

yt + yt

= [(1

L) + L]yt :

is dierent from zero. Suppose, however, that there exists a matrix 0 such that

0

C(1) = 0, which implies that when is multiplied with yt we get a stationary

process, 0 yt I(0).

As an example consider private aggregated consumption and private aggregated (disposable) income. Both variables could be random walks, but what about

the dierence between these variables? Is it likely to assume that a linear combination of them could be driven by a stochastic trend, meaning that consumption

would deviate permanently from income in the long run? The answer is no.

In the long-run it is not likely that a person consume more than his/her income,

nor is it likely that a person will save more and more. Thus, we have to think

of situations when two variables cointegrate, that is when a linear combination of

I(1) series forms a new stationary series, integrated of order zero. (A more formal

denition of co-integration is given in a following section.)

In terms of the C(1) matrix, common trends or co-integration implies that

there exists a matrix such that 0 C(1) = 0, hence we get,

0

zt =

C(L) t ;

(13.12)

order one. The mathematical condition for having a vector such that 0 C(1) = 0

is that C(1) has reduced rank. There must be at least one row representing the

long run that can be solved from the other long run relations in C(1). If C(1) has

reduced rank, there can be several 0 - vectors that lead to 0 C(1) = 0. We can

express this as follows: any vector lying in the null space of C(1) is a co-integrating

vector and that the co-integrating rank of C(1) is the rank of this null space. Say

that yt is a vector of n variables. If all variables are non-stationary, and integrated

of order one, the whole system could expand in n dierent directions. If some or

all variables share the same trend in the long-run, the system would be expanding

in only r < r dimension.

How should we understand the reduced rank of C(1) in economic terms? Think

of consumption and income again. If both series are I(1), constantly growing in

the long-run. The dierence between them should be stationary in the long run.

In other words they must have a common (stochastic) trend, which the both follow

in the long run. This common trend could understood as a given by technological

growth, which leads to growth in income and thereby also to a long-run growth

in consumption. Another way of expressing the same thing is to say that the

common trend represents the cumulation of past technology shocks.

Stock and Watson (1988), modeled the common trends representation of yt

in the following way. Starting from yt = C(1) t + C (L) t , the level of yt is

determined by,

yt

= y0 + (1 + L + L2 + :::Lt

2

)[C(1) + (1

t 1

= y0 + C(1)(1 + L + L + :::L

L)C (L)]

) t + C (L) t :

(13.13)

(13.14)

If we have cointegration, and therefore common trends, C(1) must be of reduced rank. The matrix C(1) can be thought of as consisting of two sub-matrices,

such that C(1) = AJ, where J is dened as,

t

t 1

= (1 + L + L2 + :::Lt

)J t :

(13.15)

Setting the initial condition 0 = 0, the level of yt is solved as

yt = y0 + A

+ C (L)

(13.16)

can also be shown that, since C(1) = AJ, that 0 A = 0, which implies that the

INTEGRATED VARIABLES AND COMMON TRENDS

119

C(1) matrix, we can talk about two types of shocks. The rst type of shocks

decline over time, so that the variables in the system return to their equilibrium

relation. These shocks are driven by the co-integrating vectors. The second type

of shocks are those which move the whole system over time without aecting the

long-run equilibrium. These shocks are the common trends of system.

Cointegration and common trends have interesting implications for econometric model building and inference on dynamic models. For an econometrician,

however, the MA representation are not always the easiest way to approach the

concept of cointegration and stationary long-run relations.

120

Earlier we looked at the moving average representation of a vector of integrated

variables. This takes us to a denition of common trends in a system of variables with stochastic trends. For an economist it is usually more interesting to

analyse a system in autoregressive format. By looking at the VAR representation

we get a denition of cointegration or long-run steady state relations among the

variables. Let the process fyt g be represented by the following k : th order vector

autoregressive (VAR) model, consisting of p variables ,

A(L)yt =

Dt + t ;

(14.1)

including dummies and constants.

j

k

The error term is t NID(0; ), and A(L) = j=0 Aj L where A0 =P

I. Thus,

we are assuming that the process is multivariate normal, y

N ID( ; ), with

mean P

= A1 yt 1 + ::. +Ak yt k + Dt , and positive denite error covariance

matrix

:

The system can be rewritten in error correction form, using the denition of

the dierence operator, yt yt yt 1 ;

yt =

k

X1

yt

+ yt

+ Dt + t ;

(14.2)

i=1

where

i

= (I +

i

X

Aj );

(14.3)

j=1

and

=

(I +

k

X

Aj ) =

A(1):

(14.4)

j=1

Notice that in this example the system was rewritten such that the variables in

levels (yt k ) ended up at the k : th lag. As an alternative it is possible to rewrite

the system such that the levels enter the expression at the rst lag, followed by k

lags of yt i . The two ways of rewriting the system are identical. The preferred

form depends on ones preferences.

Since yt is integrated of order one and yt is stationary, it follows that there

can be at most p 1 steady state relationships between the non-stationary variables

in yt . Hence, p 1 is the largest possible number of linearly independent rows in

the -matrix. The latter is determined by the number of signicant eigenvalues

^

in the estimated matrix ^ = A(1).

Let r be the rank of , then rank( ) = 0

implies that there are no combinations of variables that leads to stationarity. In

other words, there is no cointegration. If we have rank( ) = p, the

matrix is

said to have full rank, and all variables in y t must be stationary. Finally, reduced

rank, 0 < r < p means that there are r co-integrating vectors in the system.

Once a reduced rank has been determined, the matrix can be written as =

0

, where 0 yt represent the vectors of co-integrating relations, and a matrix of

adjustment coe cients measuring the strength by which each co-integrating vector

aects an element of yt . Whether the co-integrating vectors 0 yt are referred

A DEEPER LOOK AT JOHANSENS TEST

121

solutions or desired value is a question of how one views the underlying economic

mechanisms.

Given estimates of the eigenvalues of , and , it becomes possible to impose

various restrictions on the parameter vectors to test homogeneity conditions in

the -vectors, how 0 yt aects yt , or a more general hypothesis regarding which

combinations of variables that form stationary vectors. The tests are performed

by comparing changes in the estimated eigenvalues from the unrestricted reduced

rank estimate of with the outcome of a restricted estimation.

In Johansen (1988) it is shown how to estimate the and the vectors in

the

matrix, given that the latter has reduced rank. The solution starts from

conditioning out the short-run dynamics, as well as the eects of the dummy

variables on yt and yt k respectively,

yt =

k

X1

1 ;i

yt

0 Dt

+ R1t ;

(14.5)

k

X1

2 ;i

yt

2 Dt

+ Rkt :

(14.6)

i=1

yt

i=1

The system in 14.1 can now be written in terms of the residuals above as,

R1t =

Rkt + et :

(14.7)

The vectors and can now be estimated by forming the product moment

matrices S11 , Skk and S1k from the residuals R1 ;t and Rk ;t ;

Sij = T

T

X

Rit Rjt ;

i; j = 0; k

(14.8)

i=1

For xed vectors, is given by ^ ( ) = S1k ( 0 Skk ) 1 , and the sums of squares

function ^ ( ) = S11 ^ ( )( 0 Skk )^ ( )0 . Minimizing this sum of squares function

leads to maximum likelihood estimates of and . The estimates of are found

after solving the eigenvalue problem,

j Skk

(14.9)

where

is a vector of eigenvalues. The solution leads to estimates of the

eigenvalues( ^ 1 , ^ 2 ; :::, ^ ), and the corresponding eigenvectors V^ = (^

v1 , v^2 , ..., v^ ),

normalized around the squared residuals from equation 14.7 such that V 0 S22 V = I.

The size of the eigenvalues ( i ) tells us how much each linear combination of

eigenvectors and variables, vi0 yt is correlated with the conditional process R1t (

yt j yt i , D). The number of non-zero eigenvalues (r) determines the rank of

and lead to the co-integrating vectors of the system, while the number of zero

eigenvalues (p r) dene the common trends in the system. These are the combinations of vi yt that determine the directions in which the process is non-stationary.

Given that 14.1 is a well-dened statistical model, it is possible to determine the

distribution of the estimated eigenvalues under dierent assumptions of the number of co-integrating vectors in the model. The distributions of the eigenvalues

depend not only on 14.1 being a well-dened statistical model, but also on the

number of variables, the inclusion of constant terms in the co-integrating vectors

and deterministic trends in the equations. Distributions for dierent models are

tabulated in Johansen (1995).

122

The maximized log likelihood, conditional on the short run dynamics and the

deterministic variables of the model is,

ln L = constant

(T =2) ln jS00 j

(T =2)

r

X

ln(1

i ):

(14.10)

i=1

From this expression two likelihood ratio tests for determining the number of

non-zero eigenvalues are formulated. The rst test concerns the hypothesis that

the number of eigenvalues is less than or equal to some given number (q) such that

H0 : r q, against an unrestricted model where H1 : r p. The test is given by,

2ln(Q; qj p) =

p

X

^ i ):

ln(1

(14.11)

i=q+1

The second test is used for the hypothesis that the number of eigenvalues

is less than the number tested in the previous hypothesis, H0 : r

q against

H1 : r q + 1, and is given by,

2ln(Q; qj q + 1) =

T ln(1

^ q+1 ):

(14.12)

variables in the system (p), and on the presence of trends and constant terms.

The number of non-zero eigenvalue estimates of i are given by the corresponding eigenvectors such that ^ = (^

v1 ; v^2 , ..., v^r )0 . Based on ^ the -vectors can be

solved by,

0

^ = S1k ^ ( ^ Skk ^ ) 1 :

(14.13)

0

The estimated matrix =

is not identied, in the sense that we can pick

0

any non-singular matrix M (rxr), so that M (M 0 ) 1 =

=

. There is

no unique solution for the co-integrating vectors. This solution, explaining the

economic meaning of the co-integrating vectors, is something that the econometrician must impose on the estimates. First, by normalizing each -vector around a

variable, and then tests dierent assumptions about the vector. By looking at the

signs and relative sizes of the ^ -parameters, it is in general possible to nd appropriate normalization of the -vectors such that the outcome can be understood

in terms of error correction mechanisms or long-run equilibrium relationships between economic variables. Assumptions concerning the sizes and relative signs of

the parameters can be tested by comparing an unrestricted maximization with

one where the restrictions have been imposed.

Furthermore, to rule out the cases where yt is integrated of order 2, we must

0

is the mean lag matrix of

require that the matrix 0?

? has full rank, where

evaluated at unity, and ? and ? are the orthogonal matrices to and such

that 0 ? = 0 ? = 0.

The system in 14.1 also has a moving average form given by,

yt = C(L)( t +

Dt ):

(14.14)

(L) t ;

(14.15)

zt =

C(1) t +

Since C(L) can be expanded as C(L) = C(1) + (1

integrated of order one we get,

yt = y0 + C(1)

t

X

i + C(1) t + C(1)

i=1

t

X

Di + C

L)C

(L) when yt is

(L)( t + Dt );

(14.16)

i=1

123

how the non-stationary part of yt is generated from the underlying stochastic and

deterministic trends. The link between the MA and the autoregressive form is

shown in Johansen (1991), and is given by

0

1 0

?)

?;

0

?( ?

C(1) =

(14.17)

1

Equation 14.14 can be used to estimate C(1) from given estimates of and .

But, since the error terms ( i ;t ) in the reduced form are correlated, the estimate

of C(1) is not invariant to dierent ways of conditioning on current variables

( yt ). Given this limitation and the assumption that the driving trends should

not be aected by the equilibrium forces, the

P common trends in the system are

represented by 0? yt or alternatively by 0?

it , see Juselius (1992).

The test procedure can be extended to incorporate variables integrated of order

2 as well. With both I(1) and I(2) processes in the system, two new co-integrating

relations are possible. There can be combinations of I(2) variables forming stationary I(0) vectors, or I(2) variables forming non-stationary I(1) vectors which

in turn cointegrate with I(1) variables to form stationary vectors.

The error correction system in 14.1 can be written as

2

yt =

yt

+ yt

k

X2

yt

Dt + t ;

(14.18)

i=1

Pk 1

Pk 1

where = i=1 I + , i =

2:

j=i+1 j , and i = 1, ... k

If yt is I(2) and yt is I(1); a reduced rank condition for the matrix must

be combined with a reduced rank condition for the matrix of rst dierences as

well. Johansen (1991) shows that the condition for an I(2) process is

0

?

= ' 0;

(14.19)

0

where ' and are (p r)xs, with rank s. With I(2) variables yt is I(1).

To make these vectors stationary they have to be combined with the vectors of

rst dierences ( 2? yt ) to form stationary processes. In the latter expression

2

2

1

vectors, and = ( 0 ) 1 0 2? ( 20

. The

? is the squared orthogonal

? ?)

squared orthogonal vectors indicate which variables are I(2).

An I(2) model is estimated in a way similar to the I(1) model. Maximum

likelihood estimation is feasible since the residual terms of an I(2) model can be

assumed to be a Gaussian process. The rst step is to perform a reduced rank

regression for the I(1) model of yt on yt 1 , corrected for the short run dynamics

( yt 1 , ..., yt k+1 ) and the deterministic components ( Dt ). This leads to

estimates of r^, ^ and ^ :

In the second step, given the estimates of r^, ^ and ^ , a reduced rank test is

0

performed of ^ 0? 2 yt on ^ ? yt 1 , corrected for 2 yt 1 ; ::. 2 yt k+2 , and the

constant terms. This leads to the estimates s^, '

^ , and ^.

An I(2) process is harder to analyze in economic terms since the parameters

and the test hypotheses have dierent interpretations. The tests concerning the

vectors are still valid, but are in general only valid for I(1) processes. It is,

however, possible to form stationary relations by combining levels ( ^ yt ) with rst

2

dierence expressions ( ^ ? 0 yt ). The practical solution is to identify the I(2)

terms and nds ways of transforming them to I(1) relations. The transformation

to an I(1) system can be done by taking rst dierences of I(2) variables or by

taking ratios of variables; modeling the real money stock rather than the money

stock and the price level separately.

1 An orthogonal vector is often indcated by the sign

? attatched to the original vector. The

vector ? is the orthogonal vector to the vector if ? 0 = 0.

124

(To be completed...)

The modelling of stochastic dierential models introduce some problems which

clearly violate the assumptions behind the classical linear regression model. With

some care most of these problems can be solved. The most important factors are

whether the data series are stationary, and if the residuals are white noise. As

long as the variables are stationary and the residual is a white noise process, OLS

estimation is generally feasible. Autocorrelated residuals, however, mean that the

OLS estimator is no longer consistent. In this situation the model must either

be re-specied, or the whole model including the autoregressive process in the

residuals must be estimated by maximum likelihood.

To understand the dierences between the estimation of stochastic dierence

equations and the classical linear regression model, we will introduce these dierences step by step, in all there are 6 models of interest here,

1 The classical linear regression model.

2 Regression with deterministic trends.

3 Models with stochastic explanatory variables.

4 Autoregressive models, lagged dependent variable.

5 Autoregressive models with integrated variables (Testing for unit roots).

6 Regression models with Integrated variables (Spurious regression and

cointegration).

The following sections do not present any rigorous proofs concerning the properties of the OLS estimator. The aim is only to review known problems and

introduce some new ones.

(The Classical Linear Regression model)

Starting with yt = xt + t , the matrix form of this model is

y=X + ;

(15.1)

the parameters and a vector of residuals of the same dimension as y. (To keep

the example simple, is one parameter, but the example could be extended to

a multivariate case). The classical case builds upon four assumptions. First, the

model is linear, or log linear in variables. Second, the residuals are independent,

have a mean of zero and a nite variance,

E( ) = 0; V ar( ) =

THE ESTIMATION OF DYNAMIC MODELS

I:

(15.2)

125

in such a way that the expected value of the residuals are zero.

Third, the explanatory variables are non-stochastic and therefore independent

of the errors,

E(X 0 ) = 0:

(15.3)

Finally, the explanatory variables are linearly independent such that

rank(X 0 X) = rank(X) = k;

(15.4)

which ensures that the inverse of (x0 x) exists. Minimizing the sum of squared

residuals leads to the following OLS estimator of ;

^ = (X 0 X)

(X 0 y) =

+ (X 0 X)

(X 0 ):

(15.5)

(xt ), we have for a sample of T observations,

^=

"

T

1X 2

xt

+

T

t=1

"

T

1X

xt

T

t=1

(15.6)

The estimated parameter ^ is equal to its true value and an additional term.

For the estimate to be unbiased the last factor must be zero. If we assume that

the x0 s are deterministic the problem is relatively easy. A correct specication of

the model, E( ) = 0, leads to the result that ^ is unbiased.

The parameter has the variance,

V ar( ^ ) = E[( ^

)( ^

)0 ] = (x0 x)

1 0

x(

I)x(x0 x)

(x0 x)

(15.7)

unbiased estimate of ; 1

E( ^ ) = E( ) + E[(X 0 X)

(X 0 )] =

+ E(X 0 X)

E(X 0 );

(15.8)

if the residuals have a zero mean. Thus, under these assumptions OLS is unbiased

and also consistent (Not proven here). Consistency implies that the var( ^ ) tends

to zero as T ! 1. The problem with assuming that the x0 s are non-stochastic

is of course that it is an unrealistic assumption in a time series setting. Typically

the explanatory variables are as stochastic as the dependent variable.

So far we have not made any statements about the distribution of the estimates.

OLS has the advantage that it leads to unbiased and e cient estimates under quite

general assumptions. However, to make any inference on ^ , we need to make

assumptions about its distribution. In most cases the assumption of a normal

distribution is reasonable, at least asymptotically, or a reasonable approximation

in a limited sample, leading to

(^

N ID(0;

I):

(15.9)

) is a white noise

process we know that it converges to the true sample moment with the speed given

1

by the standard error of a white noise process, 1=T 2 :

1 The expectation of an expectation is equal to the expectation E( ^ ) = ^ ;and the expectation

of a constant is equal to the constant, since true parameter can be treated as a constant we

have E( ) = .

126

A situation when the assumption of deterministic explanatory variables can make

sense in a time series setting is when the dependent variable is driven by a deterministic trend.2 Suppose that the explanatory variable is a deterministic time

trend,

yt = + t + t ;

(15.10)

where t is a time trend, t = 1; 2; 3; ::. T , without stochastic variation. If the

time trend is adjusted for its mean t~ = (t t), the constant term ( ) will measure

the unconditional mean of yt . Under the assumption that yt has a su ciently large

deterministic trend component, w:r:t to the sample size, the error terms from this

regression can be understood as the detrended yt series. Assume that both yt and

t have been corrected for their means, OLS leads to

"

T

1 X ~2

t

+

T

^=

t=1

"

T

1 X~

t

T

t=1

(15.11)

Taking expectations leads to the result that is unbiased. The most important

reason why this regression works well is that there is an additional t~ variable in

the denominator. As t~ goes to innity the denominator gets larger and larger

compared to the numerator, so the ratio goes to zero much faster than otherwise.

Applying OLS to time series data introduces the problem of stochastic explanatory

variables. The explanatory variable can be stochastic on their own, and lags of

the dependent variables imply stochastic regressors. Let the model be,

yt = xt + t ;

(15.12)

The OLS estimator leads to

" T

# 1" T

#

X

X

2

^= +

xt

xt t :

(15.13)

t=1

t=1

" T

# 1

X

2

E

xt

for the rst factor and

(15.14)

t=1

"

T

X

t=1

xt

(15.15)

that [X 0 X] is a constant and that E(xt t ) = xt E( t ) = 0. Here xt is a random

variable, so additional assumptions must be made for the OLS estimator.

2 Other realistic examples in economics are deterministic dummy variables and deterministic

seasonal components.

127

The necessary conditions are that fxt gT1 is stationary process and that fxt gT1

and f t gT1 are independent. The rst condition means that we can view the covariance matrix (X 0 X) as xed in repeated samples. In a time series perspective

we cannot generally talk about repeated samples, instead we have to look at the

sample moments as T ! 1. If xt is a stationary variable then we can state that

as T ! 1; the covariance matrix will become a constant. This can be written as,

T

1X

xt xt !p Q;

T t=1

(15.16)

alternative way to show the properties of OLS in the case of stochastic explanatory variables is to use the probability limit operator (p lim), p lim [X 0 X] = Q:

A convenient property of p lim operators is that p lim(x 1 ) = [p lim(x)] 1 . Here,

it remains to look at the numerator in the OLS expression. If fxt gT1 and f t gT1

are generated by two independent stochastic processes we have, for each pair of

observations, that E(xt t ) = E(xt )E( t ), it can then be shown that

T

1X

xt

T t=1

!p 0;

(15.17)

p lim [X 0 ] !p 0:

(15.18)

or, alternatively,

The intuition behind this result is that, because t is zero on average, we are

multiplying xt with zero. It follows then that the average of (xt t ) will be zero.

The practical implication is that given a su ciently large sample the OLS

estimator will be unbiased, e cient and consistent even when the explanatory

variables are stochastic variables. If t

N ID(0; 2 ), we also have, conditional

T

on the stochastic process fxt g1 ; that the estimated is distributed as,

^ jx

t

N[ ;

(^

) jxt

(X 0 X)

);

(15.19)

and

N (0;

(15.20)

that standard distributions can be used for inference.

The example can be extended by two assumptions. First let the residuals be

et iid(0, 2 ), they are independent and identically distributed as before, but not

necessarily normal. Second, let the process fxt gT1 be only covariance stationary in

the long run, allowing

time in a limited sample,

Pt the sample covariance to vary with

E(X 0 X) = (1=T )

xt xt = Qt . The processes fxt gT1 and f t gT1 are independent

as above. Under these conditions the estimated is,

"

#

T

X

^ = + Qt 1 (1=T )

xt t

(15.21)

t

t=1

The estimated t can vary with t since Qt varies with time. To establish that

OLS is a consistent estimator we need to establish that

"

# 1

T

X

(1=T )

xt xt

= Qt 1 !p Q 1

(15.22)

t=1

3 In

128

The condition holds if fxt gT1 is covariance stationary, as T goes to innity the

estimate will converge in probability ( !p ) to a constant. The second condition

PT

is that the sum

t=1 xt t converges in probability to zero, which takes place

whenever xt and t are independent. The error process is iid, but not necessarily

normal. Under the conditions given

PT here, the central limit theorem is su cient

to establish that the sequence { t=1 xt t gconverges (weakly in distribution) to a

normal distribution,

T

X

[(1=T )

xt t ] !d N (0; 2 );

(15.23)

t=1

so that ( ^ t

) is asymptotically distributed as

(^t

N (0;

):

(15.24)

result is necessary for using t, 2 and F -distributions for inference on ^ and ^ 2 .

To see how the last result works, recall the central limit theorem (CLT ). The

CLT states that for a sample mean of an iid process zT , as T the sample size

increases this will weakly converge to a normal distributed variable so for the

sequence

1

) ) N (0; 2 );

(15.25)

(1=T 2 )(zT

where

(^t

) = [(1=T )

T

X

xt xt ]

[(1=T )

t=1

T

X

xt t ]:

(15.26)

t=1

Since (1=T ) = (1=T 2 )(1=T 2 ) the CLT can be evoked by rewriting the expression as,

1

(1=T 2 )( ^ t

) = [(1=T )

T

X

xt xt ]

[(1=T 2 )

t=1

T

X

xt t ];

(15.27)

t=1

where the LHS and the numerator on the RHS correspond to the CLT theorem.

From the numerator, on the RHS, we get as T goes to innity

1

[(1=T 2 )

T

X

t=1

xt t ] ) N (0;

):

(15.28)

1

Moreover, we can also conclude that the rate of convergence is given by (1=T 2 ).

1

1

Dividing the RHS side of the OLS estimator with (1=T 2 ) leaves (1=T 2 ) in the

denominator which then represents the speed by which the estimate ^ t converges

to its true value .

Let us now turn to the AR(1) model,

yt = yt

LAGGED DEPENDENT VARIABLES

+ t;

(15.29)

129

where t

iid(0, 2 ). (The estimation of AR(p) models follows from this

example in a straightforward way). The estimated is

"

^= T

T

X

yt

t=1

"

T

X

yt

1 yt

t=1

(15.30)

leading to

"

= T

T

X

yt

t=1

"

T

X

yt

1 t

t=1

(15.31)

This is similar to the stochastic regressor case, but here fyt 1 g and f t g cannot

be assumed to be independent, so E(yt 1 )E( t ) 6= 0 and ^ can be biased in a

limited sample. The dependence can be explained as follows t is dependent on

yt , but yt is through the AR(1) process correlated with yt+1 , so yt+1 is correlated

with t+1 . The long-run covariance (lrcov) between yt 1 and t is dened as,

lrcov(yt

"

t) = T

1

X

t=1

yt

1 t

X

k=1

E(yt

t+k ) +

E(yt

1+k

t );

(15.32)

k=1

where the rst term on the RHS is sample estimate of the covariance, the last

two terms capture leads and lags in the cross correlation between yt 1 and t .

As long as yt is covariance stationary and t is iid, the sample estimate of the

covariance will converge to its true long-run value.

This dependence from t to yt+1 is not of major importance for estimation.

Since (yt 1 t ) is still a martingale dierence sequence w.r.t. the history of yt and

t , we have that Efyt 1 t j yt 2 ; yt 3; :::; t 1 ; t 2; :::g = 0, so it can be established

in line with the CLT; 4 that

#

"

T

X

1

yt 1 t ) N (0; 2 Q):

(15.33)

(1=T 2 )

t=1

Using the same assumptions and notation as above the variance is given is

E(yt 1 t t yt 1 ) = E( 2 )E(yt 1 yt 1 ) = 2 Qt : These results are su cient to

establish that OLS is a consistent estimator, though not necessarily unbiased in

a limited sample. It follows that the distribution of the estimated , and its rate

of convergence is as above. The results are the same for higher order stochastic

dierence models.

In this section we look at the AR(1) model with an autoregressive residual process.

Let the error process be,

(15.34)

t =

t 1 + t;

where

4 This

130

iid(0,

T

X

yt

1 t

t=1

T

X

[yt

1(

t 1

+ vt )] =

t=1

T

X

T

X

t=1

yt

1 t 1

t=1

[yt

1 (yt 1

yt

2 )]

t=1

T

X

T

X

yt

1 vt

t=1

T

X

yt

1 vt

t=1

yt2

T

X

yt

1 yt

t=1

2+

T

X

yt

1 vt :

(15.35)

t=1

"

E (1=T )

T

X

t=1

yt

1 t

var(yt ) +

cov(yt

1y 2)

+ cov(yt

1 vt );

(15.36)

which establishes that the OLS estimator is biased and inconsistent. Only the

last covariance term can be assumed to go to zero as T goes to innity.

In this situation OLS is always inconsistent.5 Thus, the conclusion is that

with a lagged depended variable OLS is only feasible if there is no serial correlation in the residual. There are two solutions in this situation, to respecify the

equation so the serial correlation is removed from the residual process, or to turn

to an iterative ML estimation of the model (yt

yt 1

t 1 = vt ). The latter specication implies common factor restrictions, which if not tested is an ad

hoc assumption. The approach was extremely popular in the late 70s and early

80s, when people used to rely on a priori assumptions in the form of adaptive

expectations or costly adjustment, as examples, to derive their estimated models.

Often economists started from a static formulation of the economic model and

then added assumption about expectations or adjustment costs. These assumptions could then lead to an innite lag structure with white noise residuals. To

estimate the model these authors called upon the so called Koyck transformation

to reduce the model to a rst order autoregressive stochastic dierence model,

with an assumed rst order serially correlated residual term.

Observation

An additional problem is that of dependent observations. When we derived the

estimators, in particular the MLE, we must assume that the observations are

drawn independent distribution. A basic assumption is therefore violated, because

the observations in a typical time series model are dependent. The AR(1) can serve

as an example,

xt = axt 1 + t

N ID(0; 2 ):

(15.37)

t

~ t is dependent on the observation of xt in

In this model each observation of X

the previous period. How does this aect the ML estimator? Suppose the sample

5 Asymptotically, though, the estimates have normal distributions, because the long-run bias

converges to a constant while the eroor process vt converges to NID(0; 2 ): This is a result of

the CLT.

131

only consists of two observations x1 and x2 . The joint density function for these

two observations can be factorised as,

D(x1 ; x2 ) = D1 (x2 j x1 )D2 (x1 )

(15.38)

D(x1 ; x2 ; x3 )

=

=

(15.39)

(15.40)

With three observations, we have that the joint probability density function is

~ 3 , conditional on X

~ 2 and X

~ 1 , multiplied by the

equal to the density function of X

~

~1.

conditional density for X2 , multiplied by the marginal density for X

It follows that for a sample of T observations, the likelihood function can be

written as,

L( ; x) =

T

Y

t=2

D(xt j Xt

1;

)f (x1 );

(15.41)

Now, the AR(1) model implies that the conditional density function of x,

D(xt j xt 1 , ..., x1 ) is normally distributed with mean a1 xt 1 and variance 2 .

The log likelihood function is,

log L(a1 , 2 ; x) = [(T 1)=2] log 2

PT

[(T 1)=2] log 2 (1=2 2 ) t=2 (xt a1 xt 1 )2 + log D(x1 ):

This looks like the expression for the MLE derived earlier, with the exception

of the last term, the log likelihood for the very rst observation. By denition,

the rst observation here contains the initial conditions for the model, meaning

everything that happen up to and including the rst period of the sample. The

question is, how do we get rid of this term?

A practical solution is to assume that x1 can be treated as a xed value in

repeated realizations. (Compare with stochastic regressor case in OLS). In this

case log f (x1 ) can be seen as a constant which can be left out of the MLE because

it will not aect the estimates of the parameters.

~ t is stationary and normally distribAn alternative way is to assume that X

uted. The absolute value of a1 will be less than one. The unconditional normal

2

~ 1 is therefore known to have mean zero and variance 2 =(1

distribution of X

).

The likelihood becomes,

log L(a1 , 2 ; x) = (T =2) log 2

(T =2) log 2

+(1=2) log(l a1 2 ) (1=2 2 )(1 a1 2 )x1 2

PT

(1=2 2 ) t=2 (xt a1 xt 1 )2 + log D(x1 ):

Unfortunately the log likelihood is no longer log-linear. The most convenient

solution in this case is to drop the third and the fourth terms from the likelihood,

with the argument that we are only dealing with one observation why the asymptotic properties of the estimator should be unchanged. The conclusion would be

~ 1 is xed in repeated samples.

the same if we assume that X

Finally, the most di cult way of dealing with the situation is to use the sample

~ t . This would be recommended

information to estimate the initial conditions of X

if we are modeling non-stationary variables where the distribution of the initial

value might dier to a large extent from the following observations. (An example

of this can be found in Bergstrom (1989).

132

(To be completed and extended)

In this section we investigate the problems of estimating integrated series. An

integrated variable can be dened as,

A series (xt ) with no deterministic component and which has a stationary and

invertible autoregressive moving average (ARMA) representation after dierencing

d times, but which is not a stationary after dierencing only d 1 times, is said

to be integrated of order d, denoted x (d): (Banerjee et. al. (1993)]

In many areas were time series techniques are applied integrated variables are

rare exceptions, which are seldom interesting to analyse. In economics this is not

the case, most macroeconomic time series appear to be integrated or nearly integrated series, see Nelson and Plosser (1982). Thus, the estimation and distribution

of sample estimates are of great importance in economics, especially since regression with integrated variables often results in spurious correlations when standard

distributions are used for inference.

The simplest example of an I(1) series is the random walk model yt = yt 1 + t ,

where t

N ID(0, 2 ). Taking the rst dierence of this variable results in a

stationary I(0) series according to the denition given above. If yt is generated as

an integrated series, the main problem with estimating a random walk model,

yt = yt

+ t;

(15.42)

is that the estimated is not following a normal distribution, not even asymptotically. The problem here is not inconsistency, but the nonstandard distribution

of the estimated parameters. This is clearly established in Fuller(1976) where the

results from simulating the empirical distribution of the random walk model is

presented. Fuller generated data series from a driftless random walk model, and

estimated the following models,

a) yt = yt 1 + t ;

b) yt = + yt 1 + t ;

c) yt = + (t t) + yt 1 + t ;

where is constant and (t t) a mean adjusted deterministic trend. These

equations follow from the random walk model. The reason for setting up these

three models is that the modeler will not now in practice that the data is generated

by a driftless random walk. S=he will therefore add a constant (representing the

deterministic growth trend in yt ) or a constant and trend. The models are easy

to understand, simply subtract yt 1 from both sides of the random walk model,

yt

yt

= yt

1)yt

yt

+ t;

(15.43)

which leads to

yt

=(

= yt

+ t:

(15.44)

does not follow a standard distribution the conventional t-statistic cannot be

used. This would not be a problem if equals say 0.99, then j j < 1 and the series

would be stationary, and its asymptotic behaviour would be like the AR(1) model

above. Fullers simulations of the empirical t distribution of the estimated in the

three model showed that they did not converge to the normal distribution. With

these results he established what is now know as the Dickey-Fuller distributions.

Furthermore, the divergences compared to the normal distributions are huge. So,

here is a case were the central limit theorem does not work.

ESTIMATION WITH INTEGRATED VARIABLES

133

The standard t-statistic for an innitely large sample is for a two sided test of

^ 6= 0 equal to 1.96 at the 5 % level. However, according to the simulations of

Dickey and Fuller the appropriate value of the t-statistic in model (a) is 2.23, for

an innity large sample. In an autoregressive model we know that the estimate

of is biased downward. Thus, the alternative hypothesis in models (a) to (c) is

that is less than zero. The associated asymptotic t-value for an estimate from

a normal distribution, is therefore -1.65. Dickey and Fuller established that the

asymptotic critical values for one sided t-tests at the 5 % level in the models (a) to

(c) are -1.95, -2.86 and -3.41 respectively. (See Fuller (1976) Table 3.2.1, page 373].

Notice that the critical values change depending on the parameters included in the

empirical model. Also, the empirical distributions assume white noise residual; if

this is not the case, either the model or the test statistic must be adjusted.

Moreover, as long as = 1 or = 0 cannot be rejected, the estimated constant

term in model b, as well as the constant and the quadratic trend in model c,

also follow non-normal distributions. These cases are tabulated in Dickey and

Fuller (1981). The consequence of ignoring the results of Dickey and Fuller is

obvious. If using the standard tables, one will reject the null hypothesis of =

0(

= 1:0) too many times. It follows that if you use standard t-tests you will

end up modelling non-stationary series, which in turn take you to the spurious

regression problem. The alternative hypotheses for unit root tests are discussed

in the following chapter.

The explanation to why the t-statistic ends up being non-normally distributed,

can be introduced as follows. As T goes to innity, the relative distance between

yt and yt 1 becomes smaller and smaller. Increasing the sample size implies that

the random walk model goes towards a continuous time random walk model. The

asymptotic distribution of such a model is that of a Wiener process (or Brownian

motion).

The OLS estimate is,

(^

"

1:0) = T

T

X

yt

t=1

"

T

X

yt

t=1

1 t

(15.45)

Pt

where, since yt is driven by stochastic trend, yt = i=1 t i , the sample moments of the two factors on the RHS will not converge to constants, but to random

variables instead. These random variables will have a non-standard distribution,

often called a Dickey-Fuller distribution. We can express this as,

(^

[Wy (t)] :

(15.46)

where W (t) indicates that the sample moment converges to a random variable

which is a function if a Wiener process and therefore distributed according to a

non-standard distribution. If the residuals are white noise then we get the so

called Dickey-Fuller distributions.

The intuition behind this result is that an integrated variable has an innite

memory so the correlation between yt 1 and t does not disappear as T grows.

The nonstandard distribution remains, and gets worse if we choose to regress two

independent integrated variables against each other. Assume that xt and yt are

two random walk variables, such that

yt = yt

t;

and xt = xt

t;

(15.47)

where both t and t are N ID(0; 2 ). In this case, would equal zero in the

model, yt = xt + t :

The estimated t-value from this model, when yt and xt are independent random

walks should converge to zero. This is not what happens when yt and xt are also

134

integrated variables. In this case the empirical t-value will converge to 2.0, leading

to spurious correlation if a standard t-table at 5% is used to test for dependence

between the variables. The problem can be described as follows. If

is zero,

the residual term will be I(1) having the same sample moments as yt . Since yt

is a random walk we know that the variance of t will be time dependent and

non-stationary as T goes to innity. The sample estimate of 2 t is therefore not

representative for the true long run variance of the yt series. The OLS estimator

gives

# 1"

#

"

T

T

X

1X

1

2

^= +

xt

xt t ;

(15.48)

T

T

t=1

t=1

(^

) ) [B1 (t)]

[B2 (t)] ;

(15.49)

variables, which follow a Brownian motion (Wiener process).

The intuition here is that in the long-run random walk variables collapse to its

continuous time counterpart, which is the Brownian motion (the Wiener process).

The important dierence is that instead of having sample moments which are

constant in the long run, we have a ratio between two random variables which are

function of Brownian motions. In this situation the distribution of the estimated

parameters end up following non-standard distributions.

It is easy to understand why this is a bit problematic, just recall that a random

walk can be written as the sum of all shocks to the series, plus the initial value.

In other words, the sample moments in this case are sums of partial sums, since

each observation of xt can be written as a sum of shocks.

The estimated parameter will still converge to its true sample moment. The

variance, however, will be dierent. It can be shown that in this case the sample

moment of is, ( ^

) NID[0, 2 (t)] where the variance is a function of time.

The estimate of is still asymptotically correct and normally distributed but its

variance is the variance of a Brownian motion. It can also be show, that the

convergence of ^ to its true value is much faster than under OLS. Stock (1985)

1

showed that the rate of convergence is 1=T , instead of the standard OLS rate 1=T 2 .

This is known as super convergence. Unfortunately, this is only an asymptotic

result. In most applications the short run dynamics between the variables will

seriously bias the OLS estimate in this situation.

The consequence is that if ones tries to use standard tables, like t or F to

test the signicance of , one might not be able to reject spurious results. The

true \t-values of this model will be much higher. If one regresses one ore more

independent random walks against each other, standard t and F tables become

useless and will lead the researcher to accept hypotheses of correlation when there

is no correlation what so ever.

These results might look like a special case, but they are not. In fact they carry

over to small sample estimates involving all types of integrated and near integrated

variables. The distributions of the parameters based on strongly autocorrelated

data are closer to the ones of a random walk, than those of standard stationary

normal variables. These results stress the importance of testing for the type of

non-stationarity, order of integration, and presence of cointegration when working

with time series. Otherwise one can easily fall into the spurious regression trap.

The problems are likely to carry on even to the situation when is dierent

from zero. In this case t will be stationary, but the distribution of is nonstandard

as long as the two residual terms t and t are dependent. In general, without a

priori knowledge, the estimated standard errors from integrated variables must be

ESTIMATION WITH INTEGRATED VARIABLES

135

be modied, or cointegration tests must be carried out.

136

16. ENCOMPASSING

Often you will nd that there are several alternative variables that you can put

into a model, there might be several measures of income, capital or interest rates

to choose from. Starting from a general to a specic model, several models of the

same dependent variables might display, white noise innovation terms and stable

parameters that all have signs and sizes that are in line with economic theory.

A typical example is given by Mankiw and Shapiro (1986), who argue that in

a money demand equation, private consumption is a better variable than income.

Thus, we are faced with two empirical models of money demand.1 The rst model

is,

mt =

1 yt

2 yt 1

3 ry

mt =

1 ct

2 ct 1

3 rt

(16.1)

t:

(16.2)

+

Which of these models is the best one, given that both can be claimed to be

good estimates of the data generating process? The better model is the one that

explains more of the systematic variation of mt and explains were other models go

wrong. Thus, the better model will encompass the not so good models. The crucial

factor is that yt and ct are two dierent variables, which leads to a non-nested

test.

To understand the dierence between nested and non-nested tests set 2 = 0:

This is a nested test because it involves a restriction on the rst model only. Now,

set 1 = 2 = 0; this is also a nested test, because it only reduces the information

of model one. If 1 = 2 = 0, this is also a nested test of the second model. Thus,

setting 1 = 2 = 0; or 1 = 2 = 0, are only special cases of each model.

The problem that we like to address here is whether to choose either yt or ct

as the scale variable in the money demand equation. This is non-nested test

because the test can not be written as a restriction in terms of one model only.

The rst thing to consider is that a stable model is better than an unstable

one, so if one of the models is stable that is the one to choose. The next measure

is to compare the residual variance and choose the model with the signicantly

smaller error variance.

However, variance domination is not really su cient, PcGive therefore oers

more tests, that allow the comparison of Model one versus Model two, and vice

versa. Thus, there are three possible outcomes, Model one is better, Model two is

better, or there is no signicant dierence between the two models.

1 For simplicity we assume that there is only one lag on income and consumption. This should

not be seen as a restriction, the lag length can vary between the rst and the second model.

ENCOMPASSING

137

138

ENCOMPASSING

Autoregressive Conditional Heteroscedasticity (ARCH) means that the variance

of a process changes in a systematic way over time. Why should one bother about

heteroscedasticity in time series models? Heteroscedasticity is often viewed as

unimportant in time series modeling, except the fact that it leads to ine cient

estimates. Recall the linear regression model,

yt = xt +

N (0;

);

(17.1)

usually assumed to be constant over time. In principle, however, nothing prevents

the variance from varying over time ( 2t ).

There are four reasons why this type of heteroscedasticity is important in time

series models. The rst is that any departures from having white noise residuals

is a sign of misspecication. Heteroscedasticity tests represents a way of detecting

misspecications originating from leaving out an important explanatory variable,

which is totally orthogonal to the other explanatory variables in the model.

Second, if the variance of the model is changing over time so will the forecast

intervals of the model. Hence, for the purpose of making better predictions ARCH

is of interest because it leads to better forecast condence intervals. One example

is so-called Value at Risk (VaR) models which are used to forecast the level of

reserves to meet cash ow uctuations.

Third, the modeling of ARCH disturbances is sometimes implied by theory,

and in general it makes sense from economic theory in many situations. ARCH

represents a time series approach to the variance component, which picks up eects

not otherwise included in the model. Various types of time varying risk premiums

are examples of this. Variables such as time varying risk premiums are di cult to

observe and measure. But, we can trace their eects on the variance in a model like

the one above. Examples of applications are intertemporal asset market models,

CAPM, exchange rate markets, etc.

Fourth, in option prices depends critically on expected future variances of price

of the underlying asset. ARCH models oers a way of forecasting variances such

that pricing can be more exact, and more protable for those who are able to

make better forecasts.

An example of an ARCH(1) model is provided by,

yt

xt +

ht

= !+

t

2

t

t 1;

N (0;

(17.2)

(17.3)

where the error variance is dependent on its lagged value. The rst equation

is referred to as the mean equation and the second equation is referred to as the

variance equation. Together they form an ARCH model, both equations must

estimated simultaneously. In the mean equation here, xt is simply an expression

for the conditional mean of yt . In a real situation this can be explanatory variables,

an AR or ARIMA process. It will be understood that yt is stationary and I(0),

otherwise the variance will not exist.

This example is an ARCH model of order one, ARCH(1). ARCH models can

be said represent an ARMA process in the variance. The implication is that a high

variance in period t-1 will be followed by higher variances in periods t, t + 1, t + 2

etc. How long the shock persists depends, as in the ARMA model on the size of

ARCH MODELS

139

the parameters in combination with the lag lengths. A low variance period is likely

to be followed by another low variance period, but a shock to the process and/or

its variance will cause the variance to become higher before it settles down in the

future. A consequence of An ARCH process is that the variance can be predicted.

In other words it is possible to predict if the future variances, and standard errors

will be large or small. This will improve forecasting in general and is useful tool

for the pricing of derivative instruments.

An ARCH(q) process is,

yt

ht

xt +

= !+

2

1 t 1

D(0; ht )

2

2 t 2

(17.4)

2

q t q

+ :::

q

X

t i:

(17.5)

i=1

The expression for the variance shows a autoregressive process in the variance of

: Deliberately the distribution of the residual term is left undetermined. In ARCH

models normality is one option, but often the residual process will be non-nonrmal

and often display thicker tails, and be leptokurtic. Thus, other distributions such

as the Student t-distribution can be a better alternative.

The t-distribution has three moments, the mean, the variance and the "degrees

of freedom of the Student t-distribution". In this case, if the residual process t

St(0; h2 ; ); where is a positive parameter that measures the relative importance

of the peak in relation to the thickness of the tails. The Student t distribution is

a symmetrical distribution that contains the normal distribution as a special case,

as ! 1:

The ARCH process can be detected by testing for ARCH and by inspecting

the P ACF and ACF of the estimated squared residual ^2t : As is the case for AR

models, ARCH has a more general form, the Generalised ARCH, which implies

lagging the dependent variable ht : A long lag structure in the ARCH process can

be substituted with lagged dependent variables to create a shorter process, just as

for ARMA processes. A GARCH(1,1) model is written as,

yt

xt +

ht

= !+

t

2

t

t 1

D(0; ht )

+ ht

1:

(17.6)

(17.7)

The GARCH(1,1) process is a very typical process found in a number of empirical applications on ARCH processes. The convention is to indicate the length of

the ARCH with q, and use the letter p to indicate the length of the lagged variance

ht : The same convention assigns to the ARCH process, and to the GARCH

process. Usually ! is usd for the constant time independent part of the variance

instead of the 0 that is used here. For an asset market this type of process would

imply that there are persistent periods when asset prices uctuate relatively little

compared with other periods where prices uctuate more and for longer times. A

General GARCH(q,p) process is,

yt

ht

xt + t

q

X

!+

i=1

t

i

D(0; ht )

p

X

i+

i ht i :

(17.8)

(17.9)

i=1

ARCH and GARCH models cannot be estimated by OLS, or standard regression programs. It is necessary to use an interativre system estimation method

because the model is now consisting of two equations; the mean equation and the

140

ARCH MODELS

mean equation. In the example above, the additional parameters are w and ;that

must be estimated in the same model. Therefore, some iterative ML estimator

is necessary (special algorithms are also necessary). Gauss is a good program for

estimating ARCH models, but takes some investment to learn, SAS (from ver.

6.08) is quite good, EViews is also good with excellent help facilities, RATS is

an alternative. Finally, PcGive 10, can also do ARCH and GARCH models. A

practical problem in estimation is that in a nite sample the estimated variance

(ht ) there is no guarantee that the variance will be a positive number. For that

reason, software will oer you the opportunity to restrict the values of ; as well

as the sum of the : s and : s sums to positive numbers.

17.0.1

In practical modelling it is necessary to start with the mean equation. It is necessary to have a correct specication of the mean equation, in order to get the

variance process right. A stationary autoregressive process and relevant explanatory variables, and possible sesonal and other dummies must be included in the

mean equation to get rid of autocorrelation and general misspecication. This is

a a relatively easy procedure for nancial return series, which often martingale

processes

Notice that ARCH and GARCH disappears with aggregation over time and

low frequencies in recording data. Thus, ARCH=GARCH is typically never found

for frequencies above months. Monthly data, or shorter intervals, are necessary

for the modelling of ARCH=GARCH process. Even if models estimated with

quarterly data and higher frequencies can display ARCH in testing the residuals,

it is usually never possible to build an ARCH=GARCH models with that type of

data.

An ARCH process can be identied by testing for ARCH(q) structure in

combination with using ACF : s and P ACF : s on the squared residuals from

the mean equation. Estimate the mean equation, save the estimated residuals,

square them and use ispect the ACF : sand P ACF : s of these squared residuals

to identify a preliminary lagorder for the GARCH. However, this method is higly

approximative regarding the order of q and p.

To explore ARCH models, let us start with the following AR(1) model, which

could represent an asset price,

yt = yt

+ t;

(17.10)

yt is stationary).

P Furthermore let us assume that the unconditional mean of yt is

E(yt ) = (1=T ) yt , which is not dependent on time.

The expected value of yt+1 , conditional on the past history of yt is

Et (yt+1 j yt ) = yt ;

SOME ARCH THEORY

(17.11)

141

which varies over time since yt is a random variable. Now turn to the variance

of yt+1

V ar(yt+1 ) = V ar( yt ) + V ar( t ):

(17.12)

This variance consists of two parts, rst we have the unconditional variance of

yt+1 which is, for an AR(1) given by,

2

V ar(yt+1 ) =

(17.13)

E(yt+1 j yt )]2 =

(17.14)

We can see that while the conditional expectation of yt+1 depends on the information set It = yt , both the conditional (V art ) and the unconditional variances

(Var) do not depend on It = yt .

If we extend the forecasts k periods ahead we get, by repeated substitution,

yt+k =

yt +

k

X

k i

t i:

(17.15)

i=1

The rst term is the conditional expectation of yt k periods ahead. The second

term is the forecast error. Hence, the conditional variance of yt k periods ahead

is equal to

k

X

2(k i)

V art (yt+k ) = 2

:

(17.16)

i=1

It can be seen that the forecast of yt+k depends on the information at time t.

The conditional variance, on the other hand, depends on the length of the forecast

horizon (k periods into the future), but not on the information set. Nothing says

that this conditional variance should be stable. Like the forecast of yt it could

very well depend on available information as well, and therefore change over time.

So let us turn to the simplest case, where the errors follow an ARCH(1) model.

We have the following model, yt = yt 1 + t where t D(0, ht ), E( t ) = 0,

E( t t i ) = 0 for i 6= 0, and ht = w + t 2 :

The process is assumed to be stable j j < 1, and since t 2 is positive we must

have w > 0 and

0. Notice that the errors are not autocorrelated, but at the

same time they are not independent since they are correlated in higher moments

through the ARCH eect. Thus, we cannot assume that the errors really are

normally distributed. If we chose to use the normal distribution as a basis for ML

estimation, this is only an approximation. (As an alternative we could think of

using the t-distribution since the distribution of the errors tends to have fatter tails

than that of the normal). Looking at the conditional expectations of the mean

and the variance of this process, Et (yt+1 jyt ) = yt and V art (yt+1 jyt ) = ht+1 =

w + (yt

yt )2 :

We can see that both depend on the available information at time t. Especially

it should be noticed that the conditional variance of yt+1 increases by positive and

negative shocks in yt :

Extending the conditional variance expression k periods ahead, as above, we

get,

k

X

2(k i)

V art (yt+k jyt ) =

Et (ht+k ):

(17.17)

i=1

ahead. To solve for the latter, and express the forecast in the same way as the one

142

ARCH MODELS

E( t t ) = 2 . In terms of ht ;

2

t

=w+

(1

L)

t,

which is,

2

t 1;

(17.18)

= w;

(17.19)

= w:

(17.20)

which, since

2

t

, implies that,

(1

Substitute by ht ;

ht = (1

2

t 1;

(17.21)

to get the relationship between the conditional and the unconditional variances

of yt . The expected value of ht in any period i is,

E(ht+i ) =

+ E[ht+i

]:

(17.22)

V art (yt+k jyt ) =

k

X1

2i

s 1

(ht+1

k

X1

2i

(17.23)

i=0

i=1

The rst term on the RHS is the long run unconditional forecast variance of

yt . The second term represents the memory in the process, given by the presence

2

of ht+1 . If < 1 the inuence of (ht+1

) will die out in the long run and

the second term vanishes. Thus, for long-run forecasts it is only the unconditional

forecast variance which is of importance. Under the assumption of

< 1 the

memory in the ARCH eect dies out. (Below we will relax this assumption, and

allow for unit roots in the ARCH process).

Models

ARCH models represent a class of models were the variance is changing over time

in a systematic way. Let us now dene dierent types of ARCH models. In all

these models there is always a mean equation, which must be correctly specied

for the ARCH process to be modeled correctly.

1) ARCH(q); the ARCH model of order q,

ht =

q

X

t i

+ A(L)

(17.24)

i=1

This is the basic ARCH model from which we now introduce dierent eects.

2) GARCH(q; p): Generalized ARCH models.

If q is large then it is possible to get a more parsimonious representation by

adding lagged ht to the model. This is like using ARMA instead of AR models.

A GARCH(q; p) model is

SOME DIFFERENT TYPES OF ARCH AND GARCH MODELS

143

ht =

q

X

t i

i=1

p

X

ht

+ A(L)

+ B(L)ht ;

(17.25)

i=1

where p

0, q P

> 0, a0P

> 0, i

0, and i

0. The sum of the estimated

parameters (1) =

+

shows

the

memory

of

the process. Values of (1)

i

i

equal to unity indicates that shocks to the variance has permanent eects, like in

a random walk model. High values of (1); but less than unity indicates a long

memory process. It takes a long time before shocks to the variance disappears.

If the roots of [1 B(L)] = 0 are outside the unit circle we the process is

invertible and,

ht

0 [1

"

= a+

B(L)]

p

X

i=1

D(L) 2 t

+ A(L)[1

1

1

X

B(L)]

i

2

t

2

t i

(17.26)

(17.27)

i=1

ARCH(1):

(17.28)

If D(L) < 1 then GARCH = ARCH. Moreover, if the long run solution of the

model B(1), is < 1, the i will decrease for all i > max(p, q).

GARCHmodels are standard tools, in particular, for modeling foreign exchange rate markets and nancial market data. Often the GARCH(1; 1) is the

preferred choice. GARCH models some empirical observations quite well. The

distribution of many nancial series display fatter tails than the standard normal

distribution. GARCH models in combination with the assumption of a normal

distribution of the residual can generate such distributions. However, many series,

like foreign exchange rates, display both fatter tails and are leptokurtic (the peak

of the distribution is higherthan the normal. A GARCH process combined with

the assumption that the errors follow the t-distribution can generate this type

observed data.

Before continuing with dierent ARCH models, we can now look at an alternative formulation of ARCH models which show their similarities with ordinary

time series models. Dene the innovations in the conditional variance as,

vt =

2

t

ht :

(17.29)

unexpected, information on the markets. The GARCH model is then,

[1

B(L)](

2

t

vt ) =

+ A(L) 2t ;

(17.30)

+ A(L)

2

t

+ [1

(17.31)

[1

B(L)]( 2t ) =

[1

B(L)

B(L)]vt ;

and

A(L)]( 2t ) =

+ vt

B(L)vt

1;

(17.32)

process using the same tools as an ARMA model. That is, by looking at the

autocorrelations and partial autocorrelations of ^2t ; estimated from OLS.

Solving for the GARCH(1,1) model,

2

144

+(

2

1) t 1

+ 1 vt 1 + vt :

(17.33)

P

If 1 + 1 = 1, or ( i + i ) = 1 in GARCH(q; p) model, we get what is called

an integrated GARCH model.

t

ARCH MODELS

ht =

+ A(L)

+ xt ;

(17.34)

In this model we have added explanatory variables into the ARCH process,

just like we can add exogenous explanatory variables into an ARMA model.

4) M-ARCH Multivariate ARCH.

The multivariate ARCH is basically an extension of the univariate model to a

system of equations with time varying variances and covariances, like

h11;t h12;t ::: h1n;t

h

h22;t ::: h2n;t

Ht = 21;t

:::

:::

::: :::

hn1;t hn2;t ::: hnn;t

The M-ARCH is like a VAR model for a system of variables, only now the

system is extended to allow for interaction among the variances as well. Typical

applications of multivariate ARCH are CAPM models of asset portfolios.

5) ARCH in mean.

It is possible to put back the ARCH process into the conditional mean of

the process, and let it represent some variable, like a time varying risk premium

as an example. In this case we get the following system,

1=2

yt

xt + ht

ht

0 + A(L)

1:

(17.35)

There exists various ways of puttingthe variance backin the mean equation.

The example above assumes that it is the standard error which is the interesting

variable in the mean equation.

6) IGARCH. Integrated ARCH.

When the coe cients sum to unity we get a model with extremely long memory.

(Similar to the random walk model). Unlike the cases discussed earlier the shocks

to the variance will not die out. Current information remains important for all

future forecasts. We talk about an integrated variance and persistence in variance.

A signicant constant term in an GARCH process can be understood as a mean

reversion of the variance. But if the variance is not mean-reverting, integrated

GARCH is an alternative, that in a GARCH(1,1) process can put the constant

zero, and restrict the two parameters to unity.

7) EGARCH. Exponential GARCH and ARCH models. (Exponential

due to logs of the variables in the GARCH model). These models have the interesting characteristic that they allow for dierent reactions from negative and

positive shocks. A phenomenon observed on many nancial markets. In the output the rst lagged residual indicated the eect of a positive shock, while the

second lagged residual (in absolute terms) indicates the eect of a negative shock.

8) FIGARCH. Fractionally Integrated GARCH.

This approach builds on the idea of fractional integration and allows for a

slow hyperbolic rate of decay for the lagged squared innovation in the conditional

variance function. See Baille, Bollerslev and Mikkelsen (1996).

9) NGARCH and NARCH Non-linear GARCH and ARCH models.

10) Common Volatilty.

Introduced by Engle and Isle 1989 (and 1993), allows you to test for common

GARCH Structure in dierent series.

SOME DIFFERENT TYPES OF ARCH AND GARCH MODELS

145

In the literature there exists a number of X-GARCH-type of models, it is

not possible to keep track of all possible twists here, but 1-10 are the relevant

approaches.

Let us now turn to the estimation of ARCH and GARCH models. The main

problem is the distribution of the error terms, in general they are not normally

distributed. The most used alternatives are the t-distribution and the gamma

distribution. In applications in nance and foreign exchange rates a t-distribution

is often motivated by the fact the empirical distributions of these variables display

fatter tails than the normal distribution.

If we assume that the residuals of the model follow a normal distribution, we

have that the conditional variance is normally distributed, or t j t 1 NID(0; 2 ).

Using that assumption the following likelihood function is estimated,

log L =

T

log 2

2

T

2

T X

t

(log ht ) +

:

2 t=1

ht

(17.36)

Notice that there are two equations involved here, the mean equation and the

variance equation. The process is correctly modelled rst when both equations

are correctly modelled.

To estimate ARCH and GARCH processes, non-standard algorithms are generally needed. If yt i is among the regressors some iterative method is always

required. (GAUSS, RATS, SAS provide such facilities). There are also special

programs which deal with ARCH, GARCH and multivariate ARCH. The research

strategy is to begin by testing ARCH, by standard tests procedures. The following

LM test for q order ARCH, is an example,

^2 t =

2

1^t 1

2

2^t 2

+ ::: +

2

q ^t q

+ yt + vt ;

(17.37)

2

where T R2

(q). Notice that this requires that E( ) = 0, and E( t t i ) 6= 0,

for i 6= 0:

If ARCH is found, or suspected, use standard time series techniques to identify

the process. The specication of an ARCH model can be tested by Lagrange multiplier tests, or likelihood ration tests. Like in time series modeling the Box-Ljung

test on the estimated residuals from an ARCH equation serves as a misspecication test. ARCH type of processes are seldom found in low frequency data.

High frequency data is generally needed to observe these eects. Daily, weekly

sometimes monthly data, but hardly ever in quarterly or yearly data.

Finally, remember two things, rst that ARCH eects imply thicker tails than

the standard normal distribution. It not obvious that the normal distribution

should be used. On the other hand it, there is no obvious alternative either.

Often the normal distribution is the best approximation, unless there is some

other information. On example, of other information, is that some series are

leptokurtic, higher peak than the normal, in combination with fat tails. In that

case the t-distribution might be an alternative. Thus using the normal density

function is often an approximation. Second, correct inference on ARCH eects

builds upon a correct specication of the mean equation. Misspecication tests of

the mean equation are therefore necessary.

146

ARCH MODELS

The presence of expectations have consequences for econometric model building.

In particular rational expectations have extremely important consequences. The

most pessimistic views, following from rational expectations, reduce econometric

modeling to simple data description, with little, or no room, for increasing our

understanding of the behavior of economic agents. Muths (1961) original denition of rational expectations goes very far. It assumes that agents know the true

data generating process (DGP) of the complete system. This is in contrast to the

econometrician who must estimate what he/she thinks is the DGP. The econometrician must also test for signicant changes in his/her model before he/she can

nd out whether the process has changed.

In econometrics we can only deal with a limited aspect of rational expectations,

namely expectations formed conditionally on past (observed) history. In contrast

to using econometrics, in the world of Muth and other rational expectation theorists, agents are free to form the best expectation at any time without estimating,

or making inference from historical data. We can describe the econometric approach to rational expectations as follows, let xet be the expected future value of

the variable xt held by the agent(s) at time t. The expectation held at time t is

xet = E(xet j It ), where It is the information set containing the historical data used

to form the expectation.

Under rational expectations, by denition, the information set contains all

relevant information for determining the expectation so that the dierence between

the actual outcome of xt+1 and its expectation (xet ) is zero, E[(xt+1 xet ) j It ] = 0:

This is a weak condition. It allows expectations to be erroneous in individual

periods, but requires that they are correct in average. Thus, in applied work the

dierence between the outcome and the expectation should a martingale dierence

process. Assuming that the dierence is also a white noise innovation process is

generally stronger than necessary.

If the ordinary not expectations based econometric model is formulated as

yt = xt + et ; the assumption of rational expectations leads to the following

model,

yt = Efxt+1 j It g + et ; or

yt = xet + et ;

(18.1)

18.0.1

than rational expectations; like myopicor staticexpectations. These alternatives are generally ad hoc, and not based on any reasonable assumptions about the

1 This

147

behavior of economic agents. Other expectations, than rational, imply that agents

might ignore information that would raise their utility. With anything than rational expectations agents will be allowed to make systematic mistakes, implying

that they ignore prot opportunities or that they are not, for some not explained

reason, maximizing their utility. The economic science has yet to identify such behavior in the real world. Rational expectations becomes an equilibrium condition

in the sense that there the dierence between prediction and outcomes cannot be

predicted. A model which allows for predictable dierences between the expectations and the outcome is not complete without an economic explanation of what

the dierence means, and why it occurs.

The correct way to approach the modeling of expectations is assume that agents

form expectations so that they do not make systematic mistakes that reduce their

welfare. Information used to predict the future will be collected and processed up

to the point were the costs of gathering more information balances the revenue

of additional information. Based on this type of behavior it might, as a special

case, be optimal to use say todays value of a variable to predict all future values

of that variable. But, these are exceptions from the rule.

In general there is a catch 22 situation in the modeling rational expectations

behavior. If the econometrician nds that the agents are doing systematic mistakes

from ex post data, this is no evidence against the rational expectations hypothesis. Instead, the empirical nding might the result of conditioning on the wrong

information set. Alternatively, the modeling of the expectation might be correct,

and be an unbiased and e cient estimate of the expectation held only at a certain

point in time. This argument also include situation where there is a small probability of an event with large consequences, as devaluations, unpredicted changes

in the monetary regime, wars, natural disasters etc. To examine these situations

generally requires further testing of model, were the outcome will depend to a

large extent on assumptions regarding distributions of the processes, if they are

linear or non-linear etc.

The discussion about other types of expectations brings us to the concepts of

forward looking v.s. backward looking behavior. The dierence can explained as

follows. Consumption based on forward looking behavior is determined on the

basis of expected future income. Consumption based on actual (existing) income

is backward looking. In practice there might not a big dierence, your present or

recent income might be a good approximation to your future income. In some cases

rational expectations might be to base decisions on contingent rules, and revise

these rules only when the costs of deviating from the optimal/desired consumption

is too big(or when the alternative cost to being outside equilibrium is to high).

18.0.2

Without given values of the expected value there are two types of common mistakes in econometric models on expected driven stochastic processes. The rst

mistake is to substitute xet with the observed value xt : This leads to an error-invariables problem, since xt = xet + vt ; where vt is E(vt ) = 0:The error-in-variable

problem implies that will not be estimated correctly. OLS is inconsistent for the

estimation of the original parameter.

The second mistake is to model the process for xt and substitute this process

into 18.1. Assume that the variable xt follows an AR(2) process, like xt = a1 xt 1 +

148

a2 xt

+ nt , where nt

N ID(0;

yt

a1 xt

1 xt 1

+ a2 xt

2 xt 2

+ et

+ et :

(18.2)

This estimated model also gives the wrong results, if we are interested in estimation the (deep) behavioral parameter . The variables xt 1 and xt 2 are not

weakly exogenous for the parameter of interest ( ) in this case. The estimated

parameters will be a mixture of the deep behavioral parameter and the parameters

of the expectations generating process (a1 and a2 ).

Not only are the estimates biased, but policy conclusion based on this estimated

model will also be misleading. If the parameters of the marginal model, (a1 and

a2 ) describe some policy reaction function, say a particular type of money supply

rule, changing this rule, i.e. changing a1 and a2 will also change 1 and 2 : This is

a typical example of when super exogeneity does not hold, and when an estimated

model cannot be used to form policy recommendations.

What is the solution to this dilemma of estimating deepbehavior parameters,

in order to understand working of the economy better?

1. One conclusion is that econometrics will not work. The problems of correctly

specifying the expectation process in combinations with short samples make

it impossible to use econometric to estimate deep parameters. A better

alternative is to construct micro-based theoretical models and simulate these

models. (As example, use calibration techniques)

2. Sims solution was to advocate VAR models, and avoid estimating deep

parameters. VAR models can then be used to increase our understanding

about the economy, and be used to simulate the consequences of unpredictable events, like monetary or scal policy shocks in order to optimize

policy.

3. Though the rational expectations critique (Lucas, Sims and others) seem to

be devastating for structural econometric modeling, the critique has yet to

be proven. In surprisingly many situations, policy changes appear to have

small eects on estimated equations, i.e. the eects of the switch in monetary

policy in the UK in early 1980s.

4. Finally, the assumption of rational expectations provides priori information

that can be used to formulate an econometric model from the beginning.

There are, in principle, three ways in which one can approach this problem; i)

substitution, ii) system estimation based on the Full Information Maximum

Likelihood (FIML) estimator or iii) use the General Methods of Moments

(GMM) estimator.

Substitution means to replace the expected explanatory variable with an

expectation. This expectation could either be a survey expectation or an

expectation generated by a forecasting model, i.e. an ARIMA model. The

FIML method can be said build in the econometric forecast in an estimated

system. The GMM estimator builds on the assumption that the explanatory

variable and the residuals are orthogonal to each other. Since, rational

expectations implies that the (rationally expected) explanatory variables are

orthogonal to the residuals, the GMM estimator is well suited for rational

expectations models. Because of this it is the preferred choice when it comes

to estimating rational expectations models, especially in nance applications.

ECONOMETRICS AND RATIONAL EXPECTATIONS

149

18.0.3

The substitution approach is perhaps the easiest way of modeling rational

expectations. The approach is to nd an estimate of Efxt j It g: The simplest

approach is to let the information set contain only historical values of xt : As an

example suppose that xt is an AR(1) process, so xt = 1 xt 1 + vt where vt is

N ID(0; 2 ). The estimated process gives the estimates x

^t that can be substituted

into equation 18.1. The outcome of the substitution is

yt

x

^t + ut where

ut

= et

(^

xt

xet ) = et

(vt

v^t ):

(18.3)

^t is weakly exogenous

w.r.t. :

FIML estimation builds on substituting xet with the actual value xt and estimate this equation simultaneously with the marginal model for xt ; say the AR(1)

model assumed in the substitution example above.

GMM and Instrumental Variables techniques start with substitution of the

expected value (xet ) with the actual observation (xt ), and then approach the errorin-variables problem. The key to the solution lies in the assumption that the

dierence between the expectations and the actual outcome is orthogonal to the

information set used, the basic assumption for the method of moments estimator.

The variables in the marginal process and the possible exogenous variables in the

conditional model can then be used as instruments in the estimation of .

18.0.4

(To be completed)

Tests concerning given values of xet .

Given some values of the expectation process, there are three types of tests

that can be performed.

1. Test if the dierence between the expectation and the outcome is a martingale dierence process, conditional on assumptions regarding risk premiums.

2. Test for news. Under the assumption of rational expectations the expected

driven variable should only react the unpredictable event newsbut not to

events that can be predicted. These assumptions are directly testable as

soon as we have a forecasting model for xet :

3. Variance bounds tests. Again, given xet , it follows that the variance of yt in

equation 18.1 must be higher than the variance of xet :

Encompassing tests

If a model based on taking account of assumed rational expectations behavior

is the correct model, it follows that this model should encompass other models

with lack this feature. Thus, encompassing tests can used to discriminate between

models based on rational expectations and other models.

Tests of super exogeneity

150

the conditional model will change whenever the parameters of the marginal model

change. First, if it can be established that the conditional model is stable, while the

marginal model changes, this would be evidence against the rational expectations

assumption, at least in the form of forward looking behavior. In the same way, it

is possible to test for joint changes/shifts in the marginal and conditional models.

1. Is rational expectations important? The answer is it depends on your problem. If you really want to estimated a stochastic phenomena derived from

theory, especially in nance, it is important to take rational expectations into

account. It has to be at least weakly rational expectations because nobody

has found any solid evidence against weak rational expectations. However, if

you want to forecast or do standard structural modelling you can test for

super exogeneity, and thereby also for rational expectations. Ericsson and

Hendry (1989), Ericsson and Irons (1995), and Ericsson and Hendry (1997)

do this for almost all instances of radical economic policy changes and nds

no evidence of the structural breaks in the econometric models predicted by

the rational expectations theory. Thus, in practice it is not a big problem

unless you want it to be a big problem.

151

152

This section describes a research strategy for nding a well-dened statistical

model of the DGP, which also has an economic interpretation.

1. I. Start from theory!

Economic theory gives the parameters of interest and the relevant variables

for estimating these parameters. Furthermore, theory suggest interesting long-run

equilibria, homogeneity conditions etc. It is important to remember that theories

are constructions of the human mind. The available data, on the other hand,

is the real world. But, there might not be a one to one mapping between the

variables of the real world and theory, no matter how good the theory might

be. Aggregation over time and individual units, adjustment costs, measurement

errors etc. will aect the estimated model.

II. Determine the order of integration and type of non-stationarity

among the variables.

Are some are all variables non-stationary. What type of non-stationarity?

The null should be integrated of order one, unless there is su cient evidence

to reject this hypotheses. Once you know the order of integration you know

to organise variables into meaningful statistical relations. You can test for cointegration, or co-trending, and with this knowledge formulate stationary relations

where standard inference is possible, and where you can separate long-run relations

(or alternatively permanent shocks) and short-term relations.

The golden rule is that if a variable looks like I(1) treat it like an I(1) variable

unless you have clear evidence to reject that hypothesis.

III. Building a VAR and test for cointegration among integrated

variables.

Cointegration tests aim at identifying long-run stable (stationary) economically

interesting relationships among the variables. This can be done 1) in the form of

testing specic relations such as PPP, consumption function, money demand etc..

2) In the case of building and modeling systems, it can be in a "complete system"

or by dividing your problem into separate variables such as domestic ination,

money demand, economic growth etc. Remember the (asymptotic) property of

co-integrating relations, that if you nd them they are exists even if you add more

variables to the model.

This requires building a VAR and testing for cointegration. And, the VAR

will be the departure for formulating a reduced form VECM and then a structural

VECM, or single equation structural equations.

The critical step is to nd suitable order of the VAR (number of lag). The

principle is to work from general to specic models, and search for parsimonious

models. For cointegration tests a log order of 2 is minimum and often optimal.

Sometimes identifying extreme outliers and impulse step dummies will help to

cure both non-normality and autocorrelation in all equations. If it is not possible

to get rid of autocorrelation with a small number of lags (perhaps in combination

with dummies and seasonals), the alternative is to focus on second best. Autocorrelation in these equations is very bad for modelling, but it might not be possible

to achieve both no autocorrelation and get a parsimonious model with su cient

degrees of freedom for inference. In that situation, the relevant question is how

much of the variation in the left hand side variables is optimal to model to get an

near-well-identied statistical model?

A RESEARCH STRATEGY

153

equations as possible, hopefully this will include that the vector no error autocorrelation test is not rejected. In this case study the F-test for the signicance of

each lag across the equations in the model. Look at the LR test for comparing lag

orders in the VAR and most important chose the model with the smallest information critera and the smallest residual autocorrelation.1 And, when you test the lag

structure, look at the I(1) test for cointegration and study the estimated

matrix

for possible economically interesting co-integrating vectors, xt 1 :. Quite often you

will see what a stable vector coming up quite independent of the lag order and

autocorrelation in some residuals.

Once the co-integrating rank is determined it remains to identify the estimated

co-integrating vectors. If there is only one vector this is relatively simple. If

there are more than one vector the vectors should fulll the rank condition for

identication of co-integrating vectors. This is explained in the work of Juselius,

and Johansen and in more advanced text books in econometric time series. The

golden rule is that the vectors should be unique (look dierent from each other),

through the alpha value determine a left-hand variable. This is achieved by rst

choosing a suitable normalization, impose other unit elasticities and or same value

but opposite signs, and by restricting some parameters to be zero in some vectors.

(Remember that the size of the

coe cients are not related to their signicance.

If co-integration is not found?

Rethink the problem. Have you forgotten some important explanatory variable?

Look for outliers and test their eects. Use dummies, trends etc. if they can

be motivated. Look for structural breaks, sample size.

Use rst (and/or second) dierences instead, to get a model with only stationary I(0) variables that leads to estimated parameters with well dened

distributions. You have to conclude that your model might not be good for

long-run analysis.

Continue with the modeling process to get the least bad of all possible models, at least. If possible, show that there may be strong a priori information

that justies the model. Add that cointegration is only an asymptotic result,

and that your sample is too short.

Consider stop modelling, and conclude that the absence of cointegration is

an interesting conclusion in itself! (Data problem, wrong theory, missing explanatory factors etc.). Do not waste too much time on a problem where the

answers will be dependent on ad hoc assumptions concerning distributions,

or instable results which will be totally model dependent.

If you nd cointegration. Continue by testing for long-run homogeneity assumptions, weak exogeneity. and identication. This can be done by using Johansens multivariate co-integrating technique. If more than one vector think

about identication of vectors.

IV. Decide on single or simultaneous model

There are no good tests for weak exogeneity. Typically a good test of simultanity requires the specication of the completemodel to work. And, then the

work is already done.

1 In PcGive 12 you need to indicate in the "Option" window under Model choce that you want

information crteria for each model. Then when you press "Progess" will you see both F-test for

lag order and Information critera for the dierent VAR modeles you estimated.

154

A RESEARCH STRATEGY

If you reduce to single equation (or very limited systems) can you motivate the

weak exogeneity. assumptions?

The reduced form VECM gives you ideas about what a system might look like,

and not like through the estimated (signicant) alpha values.

It is possible to test for predictability in the VECM by looking at the estimated

alpha values, and argue for reductions of the system?

Of course, from the reduced for VECM to logical step is to construct a simultaneous structural model based on testing the order and the rank condition

in the model. However, this can be a bit of a challenge, especially if you are

short of time. Furthermore, identication must be done on signicant parameters

(including lags) not on the underlying theoretical lag structure.

V. Set up the Error Correction Representation.

In the following we assume that you have chosen to continue with a single

equation.

Use the results from Johansens multivariate cointegration technique, then

formulate an ECM model directly.

Test for cointegration in the ADL representation of the model. (PcGive

test). It is necessary to choose lag lengths long enough to get white noise

residuals. Test if residuals are N ID(0; 2 ), +RESET test if possible.

Having white noise innovation error terms is a necessary condition.

If not white noise innovation?

Add more lags.

Did you forget something important?

Study outliers. Use dummies and trends to get white noise. But remember

that they should be motivated.

Or continue to the least worse of all possible models, see above.

Rethink the problem or stop. RESET test!! (Perhaps you should try to

condition on some other variable instead?)

When white noise is established:

Is the equation in line with what you think can be an economic meaningful

long-run equilibrium? Check sign and sizes of parameters.

Remove insignicant variables (t-values below 1.0 to begin with). Start at low

lags. Go from general to specic.

Check misspecication/specication during reductions. Run test summary after each reduction.

In PcGive all reductions are saved under Progress.

It all about "Data Mining", but done e ciently building on the empirical

approach for ARIMA models introduced by Box-Jenkins, and new developments

in Statistical theory. Modern mathematical statistical theory explain how you can

go about nding a Data Generating Process by reversing the sampling process

in classical statistics. Textbooks: Spanos, Mittelhammer

A RESEARCH STRATEGY

155

= 1 L and 1 = + L

So if you have, as an example,

+ 1 xt 1

2 xt 2 , where

1

2 (or no signicant dierence) with

dierent sign on the lags. This is also + 1 2 x

and if j 1 j j

2 j then

+ 1 2 xt 1 + ( 1

2 ) xt 2 = 1 xt 1 .

Hence, you save one degree of freedom under these condition

VII. Test the stability of the model

Use recursive estimation method in PcGive. Remember that this is also useful

during the identication of cointegrating vectors. For instance, it will allows you

to see if you need to put in (restricted) impulse dummies in co-integrating vector.

VIII. Test for rival models. Encompassing tests.

Does your model explain the results, and the failure, of other rival models?

Encompassing tests imply a comparison of the goodness of t between dierent

models, based on dierent explanatory variables. The reduction process might lead

to several model with white noise residuals. To discriminate between these models

they have to be tested against each other.

IX. Test for super exogeneity.(Rational expectations) If you want.

Establish the stability of the conditional model without using ad hoc trends

or dummies:(= criteria for stability).

Test for instability in the marginal model. If it is unstable while the conditional is stable you have super exogeneity. If the marginal model is unstable

you can go one step further by forcing the marginal to be stable by imposing

trends and dummies in such a way that it becomes stable. Then put these

trends and dummies into the conditional model and test if they are significant there? If not you have super exogeneity. And can reject the parts of

the assumptions in the rational expectations theory.

X. STOP when you nd a model that is consistent with the data

chosen. And where the parameters make economic sense.

In other words "a well-dened statistical model".

That is a model with white noise innovation residuals and stable parameters,

which is also encompassing all other rival models. Encompassing meaning that

your model explain other models and picks up more of the variation in the dependent variables, and which has an economic meaning.

1. XI. Report our results both parameters and misspecication tests.

It is not su cient report only R2 and DW-values. Show test summary

(corresponding) and graphs of your data, in levels and rst dierences, and

error terms, etc.

Be open minded and inform the reader of the tests and the problems you have

found. Dont try to prove things which one can easily reject by a simple test. The

rule is to minimize the number of assumptions behind your model, and remember

that the errors are the outcome of the formulation of the model.

156

A RESEARCH STRATEGY

20. REFERENCES

Andersson, T.W. (1971) The Statistical Analysis of Time Series, John Wiley &

Sons, New York.

Andersson, T.W. (1984) An Introduction to Multivariate Statistical Analysis,

John Wiley & Sons, New York.

Banerjee, A., J. Dolado, J.W.Galbraith and D.F. Hendry, (1993) Cointegration,

Error-Correction and the Econometric Analysis of Non-stationary Data, (Oxford

University Press, Oxford).

Baillie, Richard J. and Tim Bollerslev, The long memory of the Forward premium, Journal of Money and Finance 1994, 13 (5), p. 565-571.

Baillie, Richard J., Tim Bolloerslev and Hans Ole Mikkelsen (1966) Fractionally Integrated Generalized Autoregressive Heteroscedastcity, Journal of Econometrics 74, 3-30.

Banerjee, A., R.L. Limsdaine and J.H Stock (1992) Recursive and Sequential

tests of the Unit Root and Trend Break Hypothesis: Theory and International

Evidence, Journal of Business and Economics Statistics ?.

Cheung, Y. and K. Lai (1993), Finite Sample Sizes of Johansens Likelihood

Ratio Tests for Cointegration, Oxford Bulletin of Economics and Statistics 55, p.

313-328.

Cheung, Y. and K. Lai (1995) A Search for Long Memory in International

Stock Markets Returns, Journal of International Money and Finance 14 (4),

p.597-615.

Davidson, James, (1994) Stochastic Limit Theory, Oxford Univresity Press,

Oxford.

Dickey, D. and W.A. Fuller (1979), Distribution of the Estimators for Autoregressive Time Series with a Unit Root, Journal of the American Statistical

Association 74.

Diebold, F.X. and G.D. Rudebush (1989), Long Memory and Persistence in

Aggregate Output,Journal of Monetary Economics 24 (September), p. 189-209.

Eatwell, J., M. Milgate and P. Newman eds., (1990), Econometrics (Macmillian, London).

Eatwell, J., M. Milgate and P. Newman eds., (1990) Time Series and Statistics

(Macmillian, London).

Engle, Robert F. ed. (1995) ARCH Selected Readings, Oxford University Press,

Oxford.

Engle, R.F. and C.W.J. Granger, eds. (1991), Long-Run Economic Relationships. Readings in Cointegration, (Oxford University Press, Oxford).

Engle, R.F. and B.S. Yoo (1991) Cointegrated Economic Time Series: An

Overview with New Results, in R.F Engle and C.W. Granger, eds., Long-Run

Economic Relationships. Readings In Cointegration (Oxford University Press,

Oxford).

Ericsson, Neil R. and John S. Irons (1994) Testing Exogeneity, Oxford University Press, Oxford.

Fuller, Wayne A. (1996) Introduction to Statistical Time Series, John Wiley &

Sons, Nw York.

Freud, J.E. (1972) Mathematical Statistics, 2ed.(Prentice/Hall London).

Granger and Newbold (1986), Forecasting Economic Time Series, (Academic

Press, San Diego).

REFERENCES

157

Hamilton, James D. (1994) Time Series Analysis, Princton University Press,

Priceton, New Jersey.

Hargreaves, Colin P. ed. (1994) Nonstationarity Time Series Analysis and

Cointegration, Oxfod University Press, Oxford.

Harvey, A. (1990), The Econometric Analysis of Time Series, Philip Allan,

New York).

Hendry, David F. (1995) Dynamic Econometrics, Oxford University Press, Oxford.

Hylleberg, Svend (1992) Modelling Seasonality, Oxford University Press, Oxford.

Johansen, Sren (1995) Likelihood-Based Inference in Cointegrated Vector Autoregressive Models, Oxford University Press, Oxford.

Johnston, J. (1984) Econometric Methods (McGraw-Hill, Singapore).

Kwiatkowsky, D., P.C.B. Phillips, P. Schmidt and Y. Shin (1992) Testing the

Null Hypothesis of Stationarity Against the Alternative of a Unit Root, Journal

of Econometrics 54, p. 159-178.

Lo, Andrew W. (1991) Long-Term Memory in Sock Market Prices,Economtrica

59 (5:September), p. 1279-1313.

Maddala, G.S. (1988) Introduction to Econometrics (McMillian, New York).

Morrison, D.F. (1967) Multivariate Statistical Methods, McGraw-Hill, New

York).

Pagan, A.R. and M.R. Wickens (1989) Econometrics: A Survey, Economic

Journal, 1-113.

Park, J.Y. (1990), Testing for Unit Roots and Cointegration by Variable Addition,in T. B. Fomby and G.F. Rhodes (eds.) Co-integration, Spurious Regressions, and Unit Roots: Advances in Econometrics 8, JAI Press, New York.

Perron, Pierre (1989) The Great Crash, the Oil Price Shock and the Unit

Root Hupothesis, Econometrica 57, 1361-1401.

Phillips, P.C.B (1988) Reections on Econometric Methodolgy, The Economic

Record -Symposium on Econometric Methodolgy, December, 344-359.

Sj, Boo (2000) Testing for Unit Roots and Cointegration, memo.

Sowell, F.B. (1992) Modeling Long-Memory Behavior with the Fractional

ARMA Model, Journal of Monetary Economics 29 (April),p. 277-302.

Spanos, A. (1986) Statistical Foundations of Econometric Modelling (Cambridge University Press, Cambridge).

Wei, William W.S. (1990) Time Series Analysis. Univariate and Multivariate

Methods, (Addison-Wesley Publishing Company, Redwood City).

20.1 APPENDIX 1

A1 Smoothing Time Series Lag Windows.

In the discussion about non-stationarity dierent ways of removing the trend

in a time series was shown. If the trend is removed from, say, GDP we are left

with swings in the data that can be identied as business cycles. In time series

analysis such cycles are referred to as low frequency or periodic components.

Application of smoothing lters arise in empirical studies of real business cycles,

and in modelling nancial variables daily interest rates where for example news

about ination and other variables occur only at monthly intervals and might

158

REFERENCES

cause monthly cycles in the data.1 Smoothing methods, of course, are related

closely to spectral analysis. In this appendix we concentrate on two lters, or lag

windows, which represent the best, or most commonly used methods for time

series in time domain.

Start from a time series, rt . What we are looking for is some weights bi such

that the ltered series xt , is free of low frequency components,

xt =

i=+k

X

bi rt+i :

(20.1)

i= k

In this formula the window is applied both backwards and forwards, implying

a combination of backward and forward looking behavior. Whether this is a good

or a bad thing depends totally on the series at hand, and is left to the judgment of

the econometrician. The alternative is to let the window end at time i = 0. The

literature is lled with methods of calculating the weights bi , in this appendix we

will look at the two most commonly used methods; the Partzn window and the

Tuckey-Hanning window.

The Parzn window is calculated using the following weights,

8

9

< 1 6(i=k)2 + 6(j i j =k)3 ; j i j k=2; =

2(1 j i j =k)3 ;

k=2 j i j k;

wi =

:

;

0;

j i j k;

where k is the size of the lag window. The Parzn window tries to t a third

grade polynomial to the original series.

An alternative is the so called Tuckey-Hanning window, calculated as,

1=2 [1 + cos( i=k)] ; j i j k;

wi =

0;

jij k

Like the Parzen window, the weights need to be normalized. Under optimal

conditions, that is the correct identication of underlying cycles, the dierence between xt and rt , will appear as a normal distribution. The problem is to determine

the bandwidth, the size of the window, or k in the formula above. Unfortunately

there is no way easy way to determine this in practice. Choosing the size of the lag

window involves a choice between low bias in the mean or a high variance of the

smoothed series. The larger the window the smaller the variance but the higher

is the bias. In practice, make sure that the weights at the end of the window are

close to zero, and then judge the best t from comparing xt rt . As a rule of

thumb, choose a bandwidth equal to N exp(2=5), the number of observations (N )

raised to the power of 2 over 5. The alternative rule is to set the bandwidth equal

to N 1=4 , or make a decision based on the last signicant autocorrelation.. Since

the choice of the window is always ad hoc in some sense, great care is needed if

the smoothed series is going to be used to reveal correlations of great economic

consequence.

APPENDIX II

Testing the Random Walk Hypothesis using the Variance Ratio Test.

For a random walk, xt = xt 1 + "t , where "t

N ID(o; 2 ); we have that

the variance is 2 t and that the autocovariance function is cov(xt ; xt k ) = (t

k) 2 . It follows that cov(xt ; xt 1 ) = 21 , and that cov(xt ; xt k ) = 21 k. Dening

1

2

k = k cov(xt ; xt 1 ). For a random walk we get that the estimated variance ratio

V R(k) =

^ 2k

^ 21

1 To be clear, we are not saying that daily interest rates necessarily contain monthly cycles,

only that it might be the case. One example is daily observations of the Swedish overnight

interbank rate.

APPENDIX 1

159

^ 21 =

1

T

T

X

(xt

xt

^ )2 ;

(20.2)

t=1

^ 2k =

k(T

T

X

1

k + 1)(1

k

T ) t=k

(xt

xt

k^ )2 ;

(20.3)

the asymptotic variance of the random variable V R(k) is,

(k) =

2(2k

1)(k

3kT

1)

(20.4)

Z(k) =

V R(k)

[ (k)]

1

1

2

!a N (0; 1);

(20.5)

Since many time series, especially in nance, show time varying heteroscedasticity, the test statistics need to be modied to take this into account. Lo and

Mackinlay (1988) show that a heteroscedasticity consistent estimator of the asymptotic variance is given as,

(k) = 2

k

X1

j ^

(j)

k

j=1

(20.6)

where

^(j) =

PT

t=j+1

(xt xt

hP

T

^ ) (xt

t=1 (xt

xt

xt

i2

^)

j 1

^)

(20.7)

Z (k) =

V R(k)

[

(k)]

1

1

2

!a N (0; 1):

(20.8)

where n is some chosen fraction of the total number of observations. Since the test

statistics only holds asymptotically, Monte Carlo simulations of limited samples

are recommended. Under the null hypothesis of a random walk, it will not be

possible to reject the assumption that Z(k) or Z (k) are dierent from zero.

When dealing with random variables, and series of data there some operators that

simplies work. This chapter presents the rules of some common operators applied

160

REFERENCES

to random variables and series of observations. These are the expectations operator, the variance operator, the covariance operator, the lag operator, the dierence

operator, and the sum operator.2 The formal proofs behind these operators are

not given, instead the chapter states the basic rules for using the operators.

All operators serve the basic purpose of simplifying the calculations and communication involving random variables. Take the expectations operator (E), as

an example. Writing E(xt ) means the same as I will calculate the mean (or the

~ 3 But, I am not telling

rst moment) of the observations on random variable X:

exactly which specic estimator I would be using, if I were to estimate the mean

from empirical data, because in this context it is not important.

One important use of operators is in investigating the properties of estimators

under dierent assumptions concerning the underlying process. For instance, the

properties of the OLS estimator, when the explanatory variables are stochastic,

when the variables in the model are trending etc.

20.2.1

The rst operator is the expectations operator. This is a linear operator and, is

therefore easy to apply, as shown by the following rules. In the following, let c

and k be two non-random constants, i is the mean of the variable i and ij is

the covariance between variable i and variable j. It follows that,

E(c) = c:

~ = cE(X)

~ =c

E(cX)

x:

~ = k + cE(X)

~ =k+c

E(k + cX)

~ + Y~ ) = E(X)

~ + E(Y~ ) =

E(X

x:

y:

~ Y~ ) = E(X)E(

~

~ Y~ ) =

E(X

Y~ ) + covar(X

xy ;

where xy = 0 if X

~ 2,

with the expectation of X

~ 2 ) = E(X)E(

~

~ + var(X)

~ =

E(X

X)

2

x

2

x.

The expectations operator is linear and straight forward to use, with one

important exception - the expectation of a ratio. This is an important exception since it represents a quite common problem.

~

E(Y )

Y

EX

~ is not equal to E(X)

~ : The problem is that the numerator and the denominator are not necessarily independent In this situation it is necessary to use

the p lim operator, alternatively let the number of observations go to zero

and use convergence in probability or distribution to analyze the outcome.

In the derivation of the OLS estimator, the hfollowing

transformation is often

i

1 ~

Y~

~ Y~ ):

used, when X is viewed as given, E X~ = E X~ Y = E(W

exchange rate, and S is the spot rate; E FS 6= E FS . However, E(ln F

ln S) = E(ln F ) E(ln S):

2 The

the dierence between an estimator and an estimate.

3 Notice

161

20.2.2

For the variance operator, var(:) or V (:) we have the following rules,

var(c) = 0:

~ = c2 var(X)

~ = c2

var(cX)

2

x:

~ = c2 var(X)

~ = c2

var(k + cX)

2

x:

~ = var(Y~ ) + var(X)

~ + 2cov(Y~ + X)

~ =

var(Y~ + X)

2

y

2

x

+2

yx :

If Y~ and X

~ = var(Y~ ) + var(X)

~ + cov(Y~ + X)

~ =

var(Y~ + X)

20.2.3

2

y

2

x:

The covariance operator (cov) has already been used above. It can be thought of

~

as a generalization of the variance operator. Suppose we have two elements of X,

~

~

call them Xi and Xj : The elements can be two random variables in a multivariate

process, or refereeing to observations at dierent times (i) or (j) of the same

~ i and X

~ j is

univariate time series process. The covariance between X

~i; X

~ j ) = Ef[X

~i

cov(X

~ i )][X

~j

E(X

~ j )]g =

E(X

ij ;

[To be completed!]

~ with p elements can be dened

The covariance matrix of a random variable X

as,

3

::: :::

1p

6 21

7

::: :::

22

2p 7

6

0

6

~

~

~

~

:::

::: ::: ::: 7

Ef[X E(X)][X E(X) ]g = 6 :::

7

4 :::

:::

::: ::: ::: 5

p1

p2 ::: :::

pp

where ii = 2i ; the variance of the i : th element.

Like the expectations and the variance operator there some simple rules. If we

~ i and X

~j ;

add constants, a and b to X

~ i + a, X

~ j + b) = cov(X

~i, X

~ j ):

cov(X

~

~

If we multiply Xi and Xj with the constants (a) and (b) respectively, we get,

~ i , bX

~ j ) = ab cov(X

~i, X

~ j ):

cov(aX

The covariance operator is sometimes also written as C( ).

20.2.4

11

12

In the following

operator is,

n

X

(20.9)

i=m

162

REFERENCES

n. The important characteristic of

the sum operator is that it is linear, all proofs of the following rules of the sum

operator build on this fact.

If k is a constant,

n

n

X

X

kxi = k

xi :

(20.10)

i=1

i=1

Some important rules deal with series of integer numbers, like a deterministic

time trend t = 1; 2; :::T: These are of interest when dealing with integrated variables and determining the order of probability, that is the order of convergence,

here indicated with O(:);

T

X

(T + 1)]

t=1

= O(T 2 )

T

X

t2

(20.11)

(1=3)[(T + 1)3

O(T 3 )

t=1

T

X

t3

(20.12)

13 + 23 + ::: + T 3

t=1

= O(T 4 ):

20.2.5

(20.13)

limited samples these requirements will not always be met. To investigate what

happens as the sample size increases towards innity we us probability limits.

If ^ is an estimate of the true parameter , we say that the estimator E(^)

is consistent if the probability that we estimate as the sample size increases to

innity is equal to one. That is as the sample size approaches the population size,

we should end up with the parameter describing the population and nothing else.

Formally this can be stated as: the estimator E(^) is a consistent estimator of

if, for arbitrary small (positive) numbers and , there exists a sample size (n)

such that,

Pr ob[j ^

j< ] > 1

for n > n0 :

(20.14)

p lim ([j ^

n!1

j< ] = 1

(20.15)

APPENDIX III OPERATORS

163

^!

or p lim ^ = :

(20.16)

Probability limits are useful for examining the asymptotic properties of estimators of stationary processes. There are a few simple rules to follow,

p lim(ax + by) = a p lim(x) + b p lim(y);

(20.17)

(20.18)

(20.19)

p lim(x

) = [p lim(x)]

p lim(x2 ) = [p lim(x)]2 :

(20.20)

(20.21)

p lim(AB) = p lim(A) p lim(B);

p lim(A

) = [p lim(A)]

(20.22)

(20.23)

These rules hold regardless of whether the variables are independent or not.

20.2.6

Ln xt = xt

n:

L

xt = xt+n :

With the lag operator is becomes possible to write long lag structures in a

simpler way.

From the lag operator follows the dierence operator

=1

xt = xt

xt

such that

1

xt =

xt + xt

or as,

xt

= xt

xt

d

164

xt = (1

L)d xt

REFERENCES

Setting d = 2 we get,

2

xt

(1

L)2 xt = (1

xt

2xt

+ xt

2L + L2 )xt

2

xt

xt

The letter d indicates dierences, which can be done by integer numbers such as

-2, -1, 0, 1 and 2. It is also possible to use real numbers, typically between -1.5 and

+1.5. With non-integer dierencing we come fractional integration, and so-called

long run memory series. If variables are expressed in log, which is the typical thing

in time series, the rst dierence will be a close approximation to per cent growth.

The lag operator is sometimes called the backward shift operator and is then

indicated with the symbol B n . The dierence operator, dened with the backward

shift operator is written as 5d = (1 B)d : Econometricians use the terms lag

operator and dierence operators with the symbols above. Time series statisticians

often use the backward shift notations.

165

- 1874-7905-1-PBUploaded byRavi Singh
- CCP303Uploaded byapi-3849444
- Does Public Capital Crowd Out Pvt. CapitalUploaded byRAJAT SHARMA
- wr97Uploaded bycabmendes
- Human MobilityUploaded bygln_97
- Demand Forcasting.pptxUploaded bySakshi Rangroo
- PlanningUploaded bydeepakarora201188
- 10.1.1.105.2762Uploaded byRamana Reddy
- Front MatterUploaded byJuliana Tessari
- MacroUploaded bysunnykhatter
- Proceedings of the Institution of Mechanical Engineers, Part G_ Journal of Aerospace Engineering-2008-Gupta-307-18Uploaded bya121a121
- Econ 723 PS VIIUploaded byasthashrestha
- BBS en 2010 1 PiscopoUploaded byc_y_g_n_e
- SAS Date_Time FunctionsUploaded bySubhrajit Samantray
- QTT201_SyllabusUploaded bynamequo
- V01I031116Uploaded byIJARTET
- MEI_iyengar_univariateTS.pptUploaded byab
- Introduction(New)Uploaded byHarsh Patel
- point processes brillingerUploaded byMP113
- Hotel Sustainability: Financial Analysis Shines a Cautious Green LightUploaded byRTR Global
- Zhu 2014Uploaded byArias Hidalgo BeMol Daniel
- Dcl PaperUploaded bykedee82
- All Abstracts UQAW 2015Uploaded byskywalk189
- li2005Uploaded byWulan Wulandari
- 1-s2.0-S221256711400728X-mainUploaded byJessy Charina Kembaren
- Load ForecastingUploaded bySaqib Fayyaz
- Sequence Classification: A Regression Based Generalization of Two-stage ClusteringUploaded byPraneeth Bobba
- neural network stocksUploaded bySerag El-Deen
- Beamer 1Uploaded byGerges
- dwm-unit2Uploaded bySumit Agrawal

- Concepts of Human Development and Poverty: A Multidimensional PerspectiveUploaded byTasneem Raihan
- A Note on Balance of PaymentUploaded byTasneem Raihan
- Enforcement and Political EconomyUploaded byTasneem Raihan
- Revisiting the Natural Resource CurseUploaded byTasneem Raihan
- Differentiation Practice ProblemUploaded byTasneem Raihan
- Homework 2-Part I KeyUploaded byTasneem Raihan
- Consumer Price Index_USUploaded byTasneem Raihan
- Syllabus Econ 105AUploaded byTasneem Raihan
- A Note on Automatic StabilizerUploaded byTasneem Raihan
- Chapter 17 Questions and AnswersUploaded byTasneem Raihan

- Matthias Jakob SchleidenUploaded byDinesh Kumar
- Oracle SuperCluster M7+M8 SecurityUploaded byAlexYustas
- Rogue Trader - Dark FrontierUploaded bydarkpaladinone
- PLJ Volume 47 Number 2 -01- Magno S. Gatmaitan - Remedial Law Part OneUploaded byjr castillo
- Robinson Antonio Marti-Xiques v. Immigration and Naturalization Service, 713 F.2d 1511, 11th Cir. (1983)Uploaded byScribd Government Docs
- Adam SmithUploaded byharhllr
- Wittgenstein's Poker. the Story of a Ten-Minute Argument Between Two Great Philosophers. David Edmonds, And John EidinowUploaded byGustavo Ovatsug
- Cancer Update From John Hopkins HospitalUploaded byAncog Richard
- Package insertUploaded byDarryl John Pasamba
- PODJ-41Uploaded byJevin Stivie Cialy
- Rise of a hip-hop nationUploaded byNebojsha Milosavljevic
- Goldstein_1_2_6_8_14_20Uploaded byjamesbondrus
- 12_Heart_PPT_AUploaded byJonathan Lara
- Logic 1Uploaded byAmber Habib
- role of PM in parlimentary form of governmentUploaded byAMNA ADIL
- System ThinkingUploaded bypptam
- Biology Notes: BiochemistryUploaded byStephen Harris
- Economic Fundamentals of Financial Planning for Physicians.pdfUploaded bypeli1974
- t4Uploaded byVishal Singh Chauhan
- annotated bibliographyUploaded byapi-318100069
- 2014-no-nonsense-tech-study-guide-v1.0.pdfUploaded byenjoytheride
- Mockbar Tax BarlisUploaded byRio Angelika Laot
- Down SyndromeUploaded byoliviadosan20
- 3CMU J Nat Sci Antibacterial ActivityUploaded bylovehope
- ADL120 Installation Manual 1 01Uploaded byJunaid Younus
- Zomato's Expansion StrategyUploaded byShivangi Agarwal
- Ken Wilber - Sex, Ecology, Spirituality (Zen.evolution.meditation.enlightenment.philosophy.psychology.science)Uploaded byEmilia Munteanu
- Thermal Journal MassageUploaded byPhine Tabajonda
- Riddle Me Party IdeasUploaded byPavan Ramakrishna
- Handbook of Common Law PleadingUploaded byJason Henry

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.