You are on page 1of 165

Lectures in Modern Economic

Time Series Analysis. 2 ed. c

Bo Sj
Linkping, Sweden
October 30, 2011

1 Introduction

Outline of this Book/Text/Course/Workshop . . . . . . . . . . . .

Why Econometrics? . . . . . . . . . . . . . . . . . . . . . . . . . .
Junk Science and Junk Econometrics . . . . . . . . . . . . . . . . .


2 Introduction to Econometric Time Series

2.1 Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Dierent types of time series . . . . . . . . . . . . . . . . . . . . .



Repetition - Your First Courses in Statistics and Econometrics . .


Basic Statistics


3 Time Series Modeling - An Overview



Statistical Models . . . . . . . . . . . .
Random Variables . . . . . . . . . . .
Moments of random variables . . . . .
Popular Distributions in Econometrics



















Analysing the Distribution . . . . . . . . . . . . . . . . .

Multidimensional Random Variables . . . . . . . . . . .
Marginal and Conditional Densities . . . . . . . . . . . .
The Linear Regression Model A General Description








4 The Method of Maximum Likelihood

4.1 MLE for a Univariate Process . . . . . . . . . . . . . . . . . . . . .
4.2 MLE for a Linear Combination of Variables . . . . . . . . . . . . .


5 The Classical tests - Wald,LM and LR tests




Time Series Modeling

6 Random Walks, White noise and All That

6.1 Dierent types processes . . . . . . . . . . . .
6.2 White Noise . . . . . . . . . . . . . . . . . . .
6.3 The Log Normal Distribution . . . . . . . . .
6.4 The ARIMA Model . . . . . . . . . . . . . . .
6.5 The Random Walk Model . . . . . . . . . . .
6.6 Martingale Processes . . . . . . . . . . . . . .
6.7 Markov Processes . . . . . . . . . . . . . . . .
6.8 Brownian Motions . . . . . . . . . . . . . . .
6.9 Brownian motions and the sum of white noise




The geometric Brownian motion . . . . . . . . . . . . . . .

A more formal denition . . . . . . . . . . . . . . . . . . . .













7 Introductioo to Time Series Modeling

7.1 Descriptive Tools for Time Series . . . . . . . . . . . . . . . . . . . 62
7.1.1 Weak and Strong Stationarity . . . . . . . . . . . . . . . . . 64
7.1.2 Weak Stationarity, Covariance Stationary and Ergodic Processes 64
7.1.3 Strong Stationarity . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.4 Finding the Optimal Lag Length and Information Criteria . 66




7.1.5 The Lag Operator . . . . . . . . . . . . . . . . . . . . . .

7.1.6 Generating Functions . . . . . . . . . . . . . . . . . . . .
7.1.7 The Dierence Operator . . . . . . . . . . . . . . . . . . .
7.1.8 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.9 Dynamics and Stability . . . . . . . . . . . . . . . . . . .
7.1.10 Fractional Integration . . . . . . . . . . . . . . . . . . . .
7.1.11 Building an ARIMA Model. The Box-Jenkins Approach
7.1.12 Is the ARMA model identied? . . . . . . . . . . . . . . .
Theoretical Properties of Time Series Models . . . . . . . . . . .
7.2.1 The Principle of Duality . . . . . . . . . . . . . . . . . . .



7.2.2 Wolds decomposition theorem . . . .

Additional Topics . . . . . . . . . . . . . . . .
7.3.1 Seasonality . . . . . . . . . . . . . . .
7.3.2 Non-stationarity . . . . . . . . . . . .
Aggregation . . . . . . . . . . . . . . . . . . .
Overview of Single Equation Dynamic Models














8 Multipliers and Long-run Solutions of Dynamic Models.


9 Vector Autoregressive Models

9.0.1 How estimate a VAR? . . . . . . . . . . . . . . . . . . . . .
9.0.2 Impulse responses in a VAR with non-stationary variables
and cointegration. . . . . . . . . . . . . . . . . . . . . . . .
9.1 BVAR, TVAR etc. . . . . . . . . . . . . . . . . . . . . . . . . . . .




Granger Non-causality Tests

10 Introduction to Exogeneity and Multicollinearity

10.1 Exogeneity . . . . . . . . . . . . . . . . . . . .
10.1.1 Weak Exogeneity . . . . . . . . . . . . .
10.1.2 Strong Exogeneity . . . . . . . . . . . .
10.1.3 Super Exogeneity . . . . . . . . . . . . .
10.2 Multicollinearity and understanding of multiple


. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
regression. .





11 Univariate Tests of The Order of Integration

11.0.1 The DF-test: . . . . . . . . . . .
11.0.2 The ADF-test . . . . . . . . . . .
11.0.3 The Phillips-Perron test . . . . .
11.0.4 The LMSP-test . . . . . . . . . .
11.0.5 The KPSS-test . . . . . . . . . .
11.0.6 The G(p; q) test. . . . . . . . . .
11.1 The Alternative Hypothesis in I(1) Tests
11.2 Fractional Integration . . . . . . . . . .




















12 Non-Stationarity and Co-integration

12.0.1 The Spurious Regression Problem . . . . . . . . . . . . . . 110
12.0.2 Integrated Variables and Co-integration . . . . . . . . . . . 111
12.0.3 Approaches to Testing for Co-integration . . . . . . . . . . 112
13 Integrated Variables and Common Trends


14 A Deeper Look at Johansens Test


15 The

Estimation of Dynamic Models

Deterministic Explanatory Variables . . . . . . . . . . . .
The Deterministic Trend Model . . . . . . . . . . . . . . .
Stochastic Explanatory Variables . . . . . . . . . . . . . .
Lagged Dependent Variables . . . . . . . . . . . . . . . . .
Lagged Dependent Variables and Autocorrelation . . . . .
The Problems of Dependence and the Initial Observation
Estimation with Integrated Variables . . . . . . . . . . . .






16 Encompassing


17 ARCH Models
17.0.1 Practical Modelling Tips . . . . . . . . . . . . . . . . . . . . 141
17.1 Some ARCH Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 141
17.2 Some Dierent Types of ARCH and GARCH Models . . . . . . . . 143
17.3 The Estimation of ARCH models . . . . . . . . . . . . . . . . . . . 146
18 Econometrics and Rational Expectations
18.0.1 Rational v.s. other Types of Expectations . . .
18.0.2 Typical Errors in the Modeling of Expectations
18.0.3 Modeling Rational Expectations . . . . . . . .
18.0.4 Testing Rational Expectations . . . . . . . . .
19 A Research Strategy









20 References
20.1 APPENDIX 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
20.2 Appendix III Operators . . . . . . . . . . . . . . . . . . . . . . . . 160
20.2.1 The Expectations Operator . . . . . . . . . . . . . . . . . . 161
20.2.2 The Variance Operator . . . . . . . . . . . . . . . . . . . . 162
20.2.3 The Covariance Operator . . . . . . . . . . . . . . . . . . . 162
20.2.4 The Sum Operator . . . . . . . . . . . . . . . . . . . . . . . 162
20.2.5 The Plim Operator . . . . . . . . . . . . . . . . . . . . . . . 163
20.2.6 The Lag and the Dierence Operators . . . . . . . . . . . . 164




He who controls the past controls the future. George Orwell in


Please respect that this is work in progress. It has never been my intention to
write a commercial book, or a perfect textbook in time series econometrics. It is
simply a collection of lectures in a popular form that can serve as a complement
to ordinary textbooks and articles used in education. The parts dealing with
tests for unit roots (order of integration) and cointegration are not well developed.
These topics have a memo of their own "A Guide to testing for unit roots and
When I started to put these lecture notes together some years ago I decided
on title "Lectures in Modern Time Series Econometrics" because I thought that
the contents where a bit "modern" compared to standard econometric textbook.
During the fall of 2010 as I started to update the notes I thought that it was
time to remove the word "modern" from the title. A quick look in Damodar
Gujaratis textbook "Basic Econometrics" from 2009 convinced my to keep the
word "modern" in te title. Gujaratis text on time series hasnt changed since the
1970s even though time series econometrics has changed completely since the 70s.
Thus, under these circumstances I see no reason to change the title, at least not
There are four ways in which one do time series econometrics. The rst is to use
the approach of the 1970s, view your time series model just like any linear regression, and impose a number of ad hoc restrictions that will hide all problems you
nd. This is not a good approach. This approach is only found in old textbooks
and never in todays research. You might only see it used in very low scientic
journals. Second, you can use theory to derive a time series model, and interesting parameters, that you then estimate with appropriate estimators. Examples
of this ti derive utility functions, assume that agents have rational expectations
etc. This is a proper research strategy. However, it typically takes good data,
and you need to be original in your approach, but you can get published in good
journals. The third, approach is simply to do statistical description of the data
series, in the form of a vector autoregressive system, or reduced form of the vector
error correction model. This system can used for forecasting, analysing relationships among data series and investigated with respect to unforeseen shocks such
as drastic changes in energy prices, money supply etc. The fourth way is to go
beyond the vector autoregressive system and try to estimate structural parameters
in the form of elasticities and policy intervention parameters. If you forget about
the rst method, the choice depends on the problem at hand and you chose to
formulate it. This book aims at telling you how to use methods three and four.
The basic thinking is that your data is the real world, theories are abstractions
that we use to understand the real world. In applied econometric time series you
should always strive to build well-dened statistical models, that is models that
are consistent with the data chosen. There is a complex statistical theory behind
all this, that I will try to popularize in this book. I do not see this book as a
substitute for an ordinary textbook. It is simply a complement.

1.1 Outline of this Book/Text/Course/Workshop

This book is intended for people who has done a basic course in statistics and
econometrics, either at the undergraduate or at the graduate level. If you did an
undergraduate course I assume that you did it well. Econometrics is a type of
course were every lecture, and every textbook chapter leads to the next level. The
best way to learn econometrics is to be active, read several books, work on your
own with econometric software. No teacher can learn you how to run a software.
That is something you have to learn on your own by practicing how to use the
software. There are some very good software out there, and some The outline
dierences between graduate and Ph.D. level mainly in the theoretical parts. At
the Ph.D. level, there is more stress on theoretical backgrounds.
1) I will begin by talking about why econometrics is dierent from statistics,
and why econometric time series is dierent from the econometrics your meet in
many basic textbooks.
2) I will repeat very briey basic statistics, and linear regression and stress
what you should know in terms of testing and modeling dynamic models. For
most students that will imply going back and do some quick repetition.
3) Introduction into statistical theory including maximum likelihood, random
variables, density functions and stochastic processes.
4) Fourth, basic time series properties and processes.
5) Using and understanding ARFIMA and VAR modelling techniques.
6) Testing for non-stationary in the form of stochastic trends, i.e. test for unit
7) The spurious regression problem
8) Testing and understanding cointegration.
9) Testing for Granger non-causality
10) The theory of reduction, exogeneity and building dynamic models and
11) Modelling time varying variances, ARCH and GARCH models
12) The implications and consequences of rational expectations on econometric
13) Non-linearities
14) Additional topics
For most of these topics I have developed more or less self-instructing exercises.

1.2 Why Econometrics?

Why is there a subject called econometrics? Why study econometrics, instead
of statistics? Why not let the statisticians teach statistics, and in particular time
series techniques? These are common questions, raised during seminars and in private, by students, statisticians and economists. The answer is that each scientic
area tends to create its own special methodological problems often heavily interrelated with theoretical issues. These problems, and the ways of solving them, are
important in a particular area of science but not necessarily in others. Economics
is a typical example, were the formulation of the economic and the statistical
problem is deeply interrelated from the beginning.
In everyday life we are forced to make decisions based on limited information.
Most of our decisions deal with the an uncertain stochastic future. We all base our


decisions on some view of the economy where we assume that certain events are
linked to each other in more or less complex ways. Economists call this a model
of the economy. We can describe the economy and the behavior of the individuals in terms of multivariate stochastic processes. Decisions based on stochastic
sequences play a central role economics and in nance. Stochastic processes are
the basis for our understanding about the behavior of economic agents and of how
their behavior determine the future path of the economy. Most econometric text
books deal with stochastic time series as a special application of the linear regression technique. Though this approach is acceptable for an introductory course in
econometrics, it is unsatisfactory for students with a deeper interest in economics
and nance. To understand the empirical and theoretical work in these areas, it
is necessary to understand some of the basic philosophy behind stochastic time
This work is a work in progress. It is based on my lectures on Modern Economic Time Series Analysis at the Department of Economics rst at University
of Gothenburg and later at University of Skovde and Linkping University in
Sweden. The material is not ready for a widespread distribution. This work, most
likely, contains lots of errors, some are known by the author, and some are not
yet detected. The dierent sections do not necessarily follow in a logical order.
Therefore, I invite anyone who has opinions about this work to share them me.
The rst part of this work provides a repetition of some basic statistical concepts, which are necessary understanding modern economic time series analysis.
The motive for repeating these concepts is that they play a larger role in econometrics than many contemporary textbooks in econometrics indicate. Econometrics did not change much from the rst edition of Johnston in the 60s until the
revised version of Kmenta in the mid 80s. However, as a consequence of the critique against the use of econometrics delivered by Sims, Lucas, Leamer, Hendry
and others, in combination with new insights into the behavior of non-stationary
time series and the rapid development of computer technology, have revolutionized
econometric modeling, and resulted in an explosion of knowledge. The demand for
writing a decent thesis, or a scientic paper, based on econometric methods has
risen far beyond what one can learn in an introductory course in econometrics.

1.3 Junk Science and Junk Econometrics

In media you often hear about this and that being proved by scientic research.
In the late 1990s newspapers told that someone had proved that genetic modied
(GM) food could be dangerous. The news were spread quickly, and according to
the story the original article had been stooped from being published by scientists
with suspicious motives. Various lobby groups immediately jumped up. GM food
were dangerous, should be banned and more money should go into this line of
research. What had happened was the following. A researcher claimed to have
shown that GM food were bad for health. He claimed this results for a number
of media people, who distributed the results. (Remember the fuss about cold
fusion). The result were presented in a paper sent to a scientic journal for
publication. The journal however, did not publish the article. It was dismissed
because the results were not based on a sound scientic method. The researcher
had feed rats with potatoes. One group of rats got GM potatoes, the other group
of rats got normal non-GM potatoes. The rats that got GM potatoes seemed
to develop cancer more often than the control group. The statistical dierence

between the groups were not big, but su ciently big for those wanting to conrm
their a priori beliefs that GM food is bad. A somewhat embarrassing detail, never
reported in the media, is that rats in general do not like potatoes. As a consequence
both groups of rats in this study were suering from starvation, which severely
aected the test. It was not possible to determine if the dierence between the two
groups were caused by starvation, or by GM food. Once the researcher conditioned
on the eects of starvation, the dierence became insignicant. This is an example
of Junk science, bad science getting a lot of media exposure because the results
ts the interests of lobby groups, and can be used to scare people.
The lesson for econometricians is obvious, if you come up with good results
you get rewarded, bad results on the other hand can quickly be forgotten. The
GM food example is extreme econometric work. Econometric research seldom get
such media coverage, though there are examples such as Swedens economic growth
is less than other similar countries, the assumed dynamic eects of a reduction
of marginal taxes. There are signicant results that depend on one single outlier.
Once the outlier is removed, the signicance is gone, and the whole story behind
this particular book is also gone.
In these lectures we will argue that the only way to avoid junk econometrics
is careful and systematic construction and testing of models. Basically, this is the
modern econometric time series approach. Why is this modern, and why stress the
idea of testing? The answers are simply that careers have been build on running
junk econometric equations, most people are unfamiliar with scientic methods in
general and the consequences of living in a world surrounded by random variables
in particular.




"Time is a great teacher, but unfortunately it kills all its pupils" Louis Hector
A time series is simply data ordered by time. For an econometrician time series
is usually data that is also generated over time in such a way that time can be
seen as a driving factor behind the data. Time series analysis is simply approaches
that look for regularities in these data ordered by time.
In comparison with other academic elds, the modeling of economic time series
is characterized by the following problems, which partly motivates why econometrics is a subject of its own:
The empirical sample sizes in economics are generally small, especially compared with many applications in physics or biology. Typical sample sizes
ranges between 25 - 100 observations. In many areas anything below 500
observations is considered a small sample.
Economic time series are dependent in the sense that they are correlated with
other economic time series. In the economic science, problems are almost
never concerned with univariate series. Consumption, as an example, is a
function of income, and at the same time, consumption also aects income
directly and through various other variables.
Economic time series are often dependent over time. Many series display
high autocorrelation, as well as cross autocorrelation with other variables
over time.
Economic time series are generally non-stationary. Their means and variances change over time, implying that estimated parameters might follow unknown distributions instead of standard tabulated distributions like the normal distribution. Non-stationarity arises from productivity growth and price
ination. Non-stationary economic series appear to be integrated, driven by
stochastic trends, perhaps as a result of stochastic changes in the total factor productivity. Integrated variables, and in particular the need to model
them, are not that common outside economics. In some situations, therefore,
inference in econometrics become quite complicated, and requires the development of new statistical techniques for handling stochastic trends. The
concepts of cointegration and common trends, and the recently developed
asymptotic theory for integrated variables are examples of this.
Economic time series cannot be assumed to be drawn from samples in the
way assumed in classical statistics. The classical approach is to start from
a population from which a sample is drawn. Since the sampling process can
be controlled the variables which make up the sample can be seen as random variables. Hypothesis are then formulated and tested conditionally on
the assumption that the random variables have a specic distribution. Economic time series are seldom random variables drawn from some underlying
population in the classical statistical sense. Observations do not represent


a random sample in the classical statistical sense, because the econometrician cannot control the sampling process of variables. Variables like, GDP,
money, prices and dividends are given from history. To get a dierent sample we would have to re-run history, which of course is impossible. The way
statistic theory deals with this situation is to reverse the approach taken in
classical statistic analysis, and build a model that describes the behavior of
the observed data. A model which achieves this is called a well dened statistical model, it can be understood as a parsimonious time invariant model
with white noise residuals, that makes sense from economic theory.
Finally, from the view of economics, the subject of statistics deals mainly
with the estimation and inference of covariances only. The econometrician,
however, must also give estimated parameters an economic interpretation.
This problem cannot always be solved ex post, after the a model has been estimated. When it comes to time series, economic theory is an integrated part
of the modeling process. Given a well dened statistical model, estimated
parameters should represent behavior of economic agents. Many econometric
studies fail because researchers assume that their estimates can be given an
economic interpretation without considering the statistical properties of the
model, or the simple fact there is in general not a one to one correspondence
with observed variables and the concepts dened in economic theory.1

2.1 Programs

Here is a list of statistical software that you should be familiar with, please goggle,
(those recommended for time series are marked with *):
*RATS and CATS in RATS, Regression Analysis of Time Series and Cointegrating Analysis of Time Series (
- *PcGive - Comes highly recommended. Included in Oxmetrics modules, see
also Timberlake consultants for more programs.
- *Gretl (Free GNU license, very good for students in econometrics)
- *JMulti (Free for multivariate time series analysis, updated? The discussion
forum is quite dead,
- *EViews
- Gauss (good for simulation)
- STATA (used by the World Bank, good for microeconometrics, panel data,
OK on time series)
- LIMDEP (Mostly free with some editions of Greens Econometric text
book?, you need to pay for duration models?)
- SAS - Statistical Analysis System (good for big data sets, but not time series,
mainly medicine, "the calculus program for decision makers")
- Shazam
And more, some are very special programs for this and that, ... but I dont
nd them worth mentioning in this context.
1 For a recent discussion about the controversies in econometrics see The Economic Journal



There is a bunch of software that allows you to program your own models or
use other peoples modules:
- Matlab
- R (Free, GNU license, connects with Gretl)
- Ox
You should also know about C, C++, and LaTeX to be a good econometrician.
Please google.
For Data Envelopment Analysis (DEA) I recommend Tom Coellis DEAP 2.1
or Paul W. Wilsons FEAR.

2.2 Dierent types of time series

Given the general denition of time series above, there many types of time series.
The focus in econometrics, macroeconomics and nance is in stochastic time series
typically in the time domain, which are non-stationarity in levels but becomes what
is called covariance stationary after dierencing.
In a broad perspective, time series analysis typically aims at making time series
more understandable by decomposing them into dierent parts. The aim of this
introduction is to give a general overview of the subject. A time series is any
sequence ordered by time. The sequence can be either deterministic or stochastic.
The primary interest in economics is in stochastic time series, where the sequence
of observations is made up by the outcome of random variables. A sequence of
stochastic variables ordered by time is called a stochastic time series process.
The random variables that make up the process can either be discrete random variables, taking on a given set of integer numbers, or be continuous
random variables taking on any real number between 1: While discrete random variables are possible they are not that common in economic time series
Another dimension in modeling time series is to consider processes in discrete
time or in continuous time. The principal dierence is that stochastic variables
in continuous time can take dierent values at any time. In a discrete time process,
the variables are observed at xed intervals of time (t), and they do not change
between these observation points. Discrete time variables are not common in
nance and economics. There are few, if any variables that remain xed between
their points of observations. The distinction between continuous time and discrete
time is not matter of measurability alone. A common mistake is to be confused the
fact that economic variables are measured at discrete time intervals. The money
stock is generally measured and recorded as an end-of-month value. The way of
measuring the stock of money does not imply that it remains unchanged between
the observation interval, instead it changes whenever the money market is open.
The same holds for variables like production and consumption. These activities
take place 24 hours a day, during the whole year. The are measured as the ow
of income and consumption over a period, typically a quarter, representing the
integral sum of these activities.
Usually, a discrete time variable is written with a time subscript (xt ) while
continuous time variables written as x(t). The continuous time approach has
a number of benets, but the cost and quality of the empirical results seldom
motivate the continuous time approach. It is better to use discrete time approaches


as an approximation to the underlying continuous time system. The cost for

doing this simplication is small compared with the complexity of continuous
time analysis. This should not be understood as a rejection of all continuous
time approaches. Continuous time is good for analyzing a number of well dened
problems like aggregation over time and individuals. In the end it should lead to
a better understanding of adjustment speeds, stability conditions and interactions
among economic time series, see Sj (1990, 1995).2
In addition, stochastic time series can be analysed in the time domain or
in the frequency domain. In the time domain the data is analysed ordered in
given time periods such as days, weeks, years etc. The frequency approach decomposes time series into frequencies by using trigonometric functions like sinuses,
etc. Spectral analysis is an example of analysis that uses the frequency domain, to
identify regularities such as seasonal factors, trends, and systematic lags in adjustment etc. The main advantage with analysing time series in the frequency domain
is that it is relatively easy to handle continuous time processes and observations
observed as aggregations over time such as consumption.
However, in economics and nance, where we are typically faced with given
observations at given frequencies and we seek to study the behavior of agents
operating in real time. Under these circumstances, the time domain is the most
interesting road ahead because it has a direct intuitive appeal to both economists
and policy makers.
A dimension in modeling time series is to consider processes in discrete time
or in continuous time. The principal dierence here is that the stochastic variables in a continuous time process can take on dierent values at any time. In
a discrete time process, the variables are observed at xed intervals of time (t),
and they are assumed not to change during the frequency interval. Discrete time
variables are not common in nance and economics. There are few, if any variables
that remain xed between their points of observations. The distinction between
continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic variables are measured at
discrete time intervals. The money stock is generally measured and recorded as
an end-of-month value. The way of measuring the stock of money does not imply
that it remains unchanged between the observation interval, instead it changes
whenever the money market is open. The same holds for variables like production
and consumption. These activities take place 24 hours a day, during the whole
year. The are measured as the ow of income and consumption over a period,
typically a quarter, representing the integral sum of these activities.
Our interest is usually in analysing discrete time stochastic processes in the
time domain.
A time series process is generally indicated with brackets, like fyt g: In some
situations it is necessary to be more precise about the length of the process. Writing fyg1
1 indicates that he process start at period one and continues innitely.
The process consists of random variables because we can view each element in
fyt g as a random variable. Let the process go from the integer values 1 up to T:
If necessary, to be exact, the rst variable in the process can be written as yt1 the
second variable yt2 etc. up until ytT : The distribution function of the process can
then be written as F (yt1 ; yt2 ; :::; ytT ):
2 We can also mention the dierent types of series that are used; stocks, ows and price
variables. Stocks are variables that can be observed at a point in time like, the money stock,
inventories. Flows are variables that can only be observed over some period, like consumption or
GDP. In this context price variables include prices, interest rates and similar variables which can
be observed at a market at a given point in time. Combining these variables into multivariate
process and constructing econometric models from observed variables in discrete time produces
further problems, and in general they are quite di cult to solve without using continuous time
methods. Usually, careful discrete time models will reduce the problems to a large extent.



In some situation it is necessary to start from the very beginning. A time series
is data ordered by time. A stochastic time series is a set of random variables
ordered by time. Let Y~it represent the stochastic variable Y~i given at time t.
Observations on this random variable is often indicated as yit . In general terms
a stochastic time series is a series of random variables ordered by time. A series
starting at time t = 1 and
n ending at timeo t = T , consisting of T dierent random
variables is written as Y~1;1 ; Y~2;2 ; :::Y~T;T . Of course, assuming that the series is
built up by individual random variables, with their own independent probability
distributions is a complex thought. But, nothing in our denition of stochastic
time series rules out that the data is made up by completely dierent random
variables. Sometimes, to understand and nd solutions to practical problems, it
will be necessary to go all the way back to the most basic assumptions.
Suppose we are given a time series consisting of yearly observations of interest
rates, f6:6; 7:5; 5:9; 5:4; 5:5; 4:5; 4:3; 4:8g, the rst question to ask is this a stochastic
series in the sense that these number were generated by one stochastic process or
perhaps several dierent stochastic processes? Further questions would be to ask
if the process or processes are best represented as continuous or discrete, are the
observations independent or dependent? Quite often we will assume that the series
are generated by the same identical stochastic process in discrete time. Based on
these assumptions the modelling process tries to nd systematic historical patters
and cross-correlations with other variables in the data.
All time series methods aim at decomposing the series into separate parts in
some way. The standard approach in time series analysis is to decompose as
yt = Tt;d + St;d + Ct;d + It ;
where Td and Sd represents (deterministic) trend and seasonal components, Ct;d is
deterministic cyclical components and I is process representing irregular factors3 .
For time series econometrics this denition is limited, since the econometrician
is highly interested in the irregular component. As an alternative, let fyt g be a
stochastic time series process, which is composed as,

systematic components + unsystematic components

Td + Ts + Sd + Ss + fyt g + et ,


where the systematic components include deterministic trends Td , stochastic trend

Ts ; deterministic seasonals Sd stochastic seasonals Ss , a stationary process (or the
short-run dynamics) yt , and nally a white noise innovation term et : The modeling
problem can be described as the problem of identifying the systematic components
such that the residual becomes a white noise process. For all series,remember
that any inference is potentially wrong, if not all components have been modeled
correctly. This is so, regardless of whether we model a simple univariate series
with time series techniques, a reduced system, a or a structural model. Inference
is only valid for a correctly specied model.

2.3 Repetition - Your First Courses in Statistics and

1. To be completed...
3 For

simplicity we assume a linear process. An alternative is to assume that the components

are multiplicative, xt = Tt;d St;d Ct;d It :



In you rst course in statistics you learned how to use descriptive statistics;
the mean and the variance. Next you learned to calculate the mean and variances
from a sample that represents the whole underlying population. For the mean and
the variance to work as a description of the underlying population it is necessary
to construct the sample in such a way that the dierence between the sample
mean and the true population mean is non-systematic meaning that the dierence
between the sample mean and the population is unpredictable. This man that
your estimated sample mean is random variable with known characteristics.
The most important thing is to construct a sampling mechanism so that the
mean calculated from the sample has the characteristics you want to have. That
is the estimated mean should be unbiased, e cient and consistent. You learn
about random variables, probabilities, distributions functions and frequency distributions.
Your rst course in econometrics
"A theory should be as simple as possible, but not simpler" Albert Einstein
To be completed...
Random variables, OLS, minimize the sum of squares, assumptions 1 - 5(6),
understanding, multiple regression, multicollinearity, properties of OLS estimator
Matrix algebra
Tests and solutionsfor heteroscedasticity (cross-section), and autocorrelation
(time series).
If you read a good course you should have learned the three golden rules: test
test test, and learned about the probabilities of the OLS estimator.
Generalized least squares GLS
System estimation: demand and supply models.
Further extensions:
Panel data, Tobit, Heckit, discrete choice, probit/logit, duration
Time series: distributed lag models, partial adjustment models, error correction models, lag structure, stationarity vs. non-stationarity, co-integration
What need to know ...
What you probably do not know but should know.
Ordinary least squares is a common estimation method. Suppose there are two
series fyt ; xt g
yt = + xt + "t
sample t = 1; :2:::T ,
PT the sum
PTof Squares over the
S = t=1 "2t = t=1 (yt
xt )
Take the derivative of S with respect to and , set the expressions to zero,
and solve for and :
^ =s

1 = ESS
R = 1 T SS = ESS
Basic assumptions
1) E("t ) = 0 for all t



E("t )2 = 2 for all t

E("t "t k ) = 0 for all k 6= t
E(Xt "t ) = 0
E(X 0 X) 6= 0

6) "t s N ID(0; 2 )
Discuss these properties
Gauss-Markow BLUE
Misspecication, add extra variable, forget relevant variable
Error in variables problem
Homoscedasticity Heteroscedasticity





Part I

Basic Statistics



Economists are generally interested in a small part of what is normally included in
the subject Time Series Analysis. Various techniques such as ltering, smoothing and interpolation developed for deterministic time series are of relative minor
interest for economists. Time series econometrics is more focused on the stochastic
part of time series. The following is an brief overview of time series modeling, from
an econometric perspective. It is not text book in mathematical statistics, nor is
the ambition to be extremely rigorous in the presentation of statistical concepts.
The aim more to be a guide for the yet not so informed economist who wants to
know more about the statistical concepts behind time series econometrics.
When approaching time series econometrics the statistical vocabulary quickly
increases and can become overwhelming. These rst two chapters seek to make it
possible for people without deeper knowledge in mathematical statistics to read
and follow the econometric and nancial time series literature.
A time series is simply a set of observations ordered by time. Time series
techniques seeks to decompose this ordered series into dierent components, which
in turn can be used to generate forecasts, learn about the dynamics of the series,
and how it relates to other series. There is a number of dimensions and decision
to keep account of when approaching this subject.
First, the series, or the process, can be univariate or multivariate, depending on
the problem at hand. Second, the series can be stochastic or purely deterministic.
In the former case a stochastic random process is generating the observations.
Third, given that the series is stochastic, with perhaps deterministic components,
it can be modeled in the time domain or in the frequency domain. Modeling in
the frequency domain implies describing the series in terms cosines functions of
dierent wave lengths. This is a useful approach for solving some problems, but not
a general approach for economic time series modeling. Fourth, the data generating
process and the statistical model can constructed in continuous or discrete time.
Continuous time econometrics is good for some problems but not all. In general it
leads to more complex models. A discrete time approach builds on the assumption
that the observed data is unchanged between the intervals of observation. This is
a convenient approximation, that makes modeling easier, but comes at a cost in
the form of aggregation biases. However, in the general case, this is a low cost,
compared with the costs of general misspecication. A special chapter deals with
the discussion of discrete versus continuous time modeling.
The typical economic time series is a discrete stochastic process modeled in
the time domain. Time series can be modelled by smoothing and lter techniques.
For economists these techniques are generally uninteresting, though we will briey
come back to the concept of lters.
The simplest way to model an economic time series is to use autoregressive
techniques, or ARIMA techniques in the general case. Most economic time series,
however, are better modeled as a part of a multivariate stochastic process. Economic theory systems of economic variables, leading to single equation transfer
functions and systems of equations in a VAR model.
These techniques are descriptive, they do not identify structural, or deep parameterslike elasticities, marginal propensities to consume etc. The estimate more


specic economic models, we turn to techniques as VECM, SVAR, and structural

What is outlined above is quite dierent from the typical basic econometric
textbook approach, which starts with OLS and ends in practice with GLS as the
solution to all problems. Here we will develop methods, which rst describes the
statistical properties of the (joint) series at hand, and then allows the researcher
to answer economic questions in such a way that the conclusions are statistically
and economically valid. To get there we have to start with some basic statistics.

3.1 Statistical Models

A general denition of statistical time series analysis is that it nds a mathematical
model that links observed variables with the stochastic mechanism that generated
the data. This sounds abstract, but the purpose of this abstraction is understand
the analytical tools of time series statistics. The practical problem is the following;
we have some stochastic observations over time. We know that these observations
have been generated by a process, but we do not know what this process looks
like. Statistical time series analysis is about developing the tools needed to mimic
the unknown data generating function (DGP).
We can formulate some general features of the model. First, it should be a
well-dened statistical modelin the sense that the assumptions behind the model
should be valid for the data chosen. Later we will dene more exactly what this
implies for an econometric model. For the time being, we can say that single
most important criteria of models is that the residuals should be a white noise
process. Second, the parameters of the model should be stable over time. Third,
the model should be simple, or parsimonious, meaning that its functional form
should be simple. Fourth, the model should be parameterized in such a way that
it is possible to give the parameters a clear interpretation and identify them with
events in the real world. Finally, the model should be able to explain other rival
models describing the dependent variable(s).
The way to build a well-dened-statistical-modelis to investigate the underlying assumptions of the model in a systematic way. It can easily be shown that
t-values, R2 , and Durbin-Watson values are not su cient for determining the t
of a model. In later chapters we will introduce a systematic test procedure.
The nal aim of econometric modelling is to learn about economic behavior. To
some extent this always implies using some a priori knowledge about in the form
of theoretical relationships. Economists, in general, have extremely strong a priori
belief about the size and sign of certain parameters. This way of thinking has lead
to much confusion, because a priori believes can be driven too far. Econometrics is
basically about measuring correlations. It is a common misunderstanding among
non-econometricians that correlations can be too high or too low, or be deemed
right or wrong. Measured correlations are the outcome of the data used, only.
Anyone who thinks of an estimated correlation as wrong, must also explain what
went wrong in the estimation process, which requires knowledge of econometrics
and the real world.


3.2 Random Variables

The basic reason for dealing with stochastic models rather than deterministic
models is that we are faced with random variables. A popular denition of
random variables goes like this: a random variable is a variable that can take on
more than one value. 1 For every possible value that a random variable can take
on there is a number between zero and one that describes the probability that
the random variable will take on this value. In the following a random variable is
indicated with .
In statistical terms, a random variable is associated with the outcome of a
statistical experiment. All possible outcomes of such an experiment can be called
the sample space. If S is a sample space with a probability measure and if X
is real valued function dened over S then X is called a random variable.
There are two types of random variables; discrete random variables, which
only take on a specic number of real values, and (absolute) continuous random
variables, which can take on any value between 1. It is also possible to examine
discontinuous random variables, but we will limit ourselves to the rst two types.
~ can take k numbers of values (x1 , ..., xk ),
If the discrete random variable X
the probability of observing a value xj can be stated as,
P (xj ) = pj :


Since probabilities of discrete random variables are additive, the probability of

observing one of the k possible outcomes is equal to 1.0, or using the notation just
P (x1 ; x2 ; :::; or xk ) = p1 + p2 + ::: + pk = 1:
A discrete random variable is described by its probability function, F (xi ),
~ takes on a certain value. (The term
which species the probability with which X
cumulative distribution is used synonymous with probability function).
In time series econometrics we are in most applications dealing with continuous
random variables. Unlike discrete variables, it is not possible to associate a specic
observation with a certain probability, since these variables can take on an innite
range of numbers. The probability that a continuous random variable will take
on a certain value is always zero. Because it is continuous we cannot make a
dierence between 1.01 and 1.0101 etc. This does not mean that the variables do
not take on specic values. The outcome of the experiment, or the observation, is
of course always a given number.
Thus, for a continuous random variable, statements of the probability of an
observation must be made in terms of the probability that the random variable X
is less than or equal to some specic value. We express this with the distribution
~ as follows,
function F (x) of the random variable X
F (x) = P (X

x) f or

1 < x < 1;


~ taking a value less than or equal to x.

which states the probability of X
The continuous analogue of the probability function is called the density
function f (x), which we get by derivation of the distribution function, w:r:t the
observations (x),
dF (x)
= f (x):
1 Random

variables (RV:s) are also called stochastic variables, chance variables, or variates.



The fundamental theorem of integral calculus gives us the following expression

~ takes on a value less that or equal to x,
for the probability that X
Z x
F (x) =
f (u)du:

It follows that for any two constants (a) and (b), with a < b, the probability
~ takes on a value on the interval from (a) to (b) is given by
that X
F (b)

F (a)


f (u)du

f (u)du


f (u)du


The term density function is used in a way that is analogous to density in

physics. Think of a rod of variable density, measured by the function f (x). To
obtain the weight of some given length of this rod, we would have to integrate its
density function over that particular part in which we are interested.
Random variables care described by their density function and/or by their
moments; the mean, the variance etc. Given the density function, the moments
can be determined exactly. In statistical work, we must rst estimate the moments,
from the moments we can learn about density function. For, instance we can test,
if the assumption of an underlying normal density function is consistent with the
observed data.
A random variable can be predicted, in other words it is possible to form an
expectation of its outcome based on its density function. Appendix III deals with
the expectations operator and other operators related to random variables.

3.3 Moments of random variables

Random variables are characterized by their probability density functions pdf : s)
or their moments. In the previous section we introduced pdf : s: Moments refers to
measurements such as the mean, the variance, skewness, etc. If we know the exact
density function of a random variable then we would also know the moments. In
applied work, we will typically rst calculate the moments from a sample, and
from the moments gure out the density function of variables. The term moment
originates from physics and the moment of a pendulum. For our purposes it can be
though of as a general term which includes the denition of concepts like the mean
and the variance, without referring to any specic distribution. Starting with the
rst moment, the mathematical expectation of a discrete random variable is given
~ =

xf (x)


where E is the expectation operator and f (x) is the value of its probability
~ Thus, E(X)
~ represents the mean of the discrete random variable
function at X.
X: Or, in other words, the rst moment of the random variable. For a continuous
~ the mathematical expectation is
random variable (X),
Z 1
~ =
x f (x)dx



where f (x) is the value of its probability density at x. The rst moment can
also be referred to as the location of the random variable. Location is a more
generic concept than the rst moment or the mean.
The term moments are used in situations where we are interested in the expected value of a function of a random variable, rather than the expectation of the
specic variable itself. Say that we are interested in Y~ , whose values are related
~ by the equation y = g(x). The expectation of Y~ is equal to the expectation
to X
of g(x), since E(Y~ ) = E [g(x)]. In the continuous case this leads to,
Z 1
~ =
g(x)f (x)dx:
E(Y~ ) = E[g(X)]

Like density, the term moment, or moment about the origin, has its explanation
in physics. (In physics the length of a lever arm is measured as the distance from
the origin. Or if we refer to the example with the rod above, the rst moment
around the mean would correspond to horizontal center of gravity of the rod.)
Reasoning from intuition, the mean can be seen as the midpoint of the limits of
the density. The midpoint can be scaled in such a way that its becomes the origin
of the x- axis.
The term moments of a random variable is a more general way of talking
about the mean and variance of a variable. Setting g(x) equal to x, we get the
r:th moment around the origin,
xr f (x)
r = E(X ) =
~ is a discrete variable. In the continuous case we get,
when X
Z 1
xr f (x)dx:
r = E(X ) =


The rst moment is nothing else than the mean, or the expected value of X.
The second moment is the variance. Higher moments give additional information
about the distribution and density functions of random variables.
~ = (X
Now, dening g(X)
r ) we get what is called the r:th moment about
~ For r = 0, 1, 2, 3 ... we
the mean of the distribution of the random variable X.
get for a discrete variable,
0 r
0 r
r = E[(X
r) ] =
r ) f (x)
~ is continuous
and when X

= E[(X

0 r
r) ]


0 r

) f (x)dx:


The second moment about the mean, also called the second central moment,
is nothing else than the variance of g(x) = x;
Z 1
~ E(X)]
~ 2 f (x)dx
Z 1
~ 2 f (x)dx [E(X)]
~ 2

~ 2)

~ 2;


where f (x) is the value of probability density function of the random variable
~ at x:A more generic expression for the variance is dispersion. We can say that


the second moment, or the variance, is a measure of dispersion, in the same way
as the mean is a measure of location.
The third moment, r = 3, measures asymmetry around the mean, referred
to as skewness. The normal distribution is asymmetric around the mean. The
likelihood of observing a value above or below the mean is the same for a normal
distribution. For a right skewed distribution, the likelihood of observing a value
higher than the mean is higher than observing a lower value. For a left skewed
distribution, the likelihood of observing a value below the mean is higher than
observing a value above the mean.
The fourth moment, referred to as kurtosis, measures the thickness of the
tails of the distribution. A distribution with thicker tails than the normal, is
characterized by a higher likelihood of extreme events compared with the normal distribution. Higher moments give further information about the skewness,
tails and the peak of the distribution. The fth, the seventh moments etc. give
more information about the skewness. Even moments, above four, give further
information the thickness of the tails and the peak.

3.4 Popular Distributions in Econometrics

In time series econometrics, and nancial economics, there is a small set of distributions that one has to know. The following is a list of common distributions:
Normal distribution
N ; 2
Log Normal distribution LogN ; 2
Student t distribution
St ; ; 2
Cauchy distribution
Ca ; 2
Gamma distribution
Ga ; ; 2
Chi-square distribution
( )
F distribution
F (d1 ; d2 )
Poisson distribution
P ois ( )
Uniform distribution
U (ja; bj)
The pdf of a normal distribution is written as

2 2

2 2
The normal distribution characterized by the following: the distribution is
symmetric around its mean, and it has only two moments, the mean and the
variance, N ( ; 2 ). The normal distribution can be standardised to have a mean of
zero and variance of unity (say ( x E(x)) and is consequently called a standardised
normal distribution, N (0; 1).
In addition, it follows that the rst four moments, the mean, the variance, the
~ = , V ar(X)
~ = 2 ; Sk(X)
~ = 0;and Ku(X)
~ =
skewness and kurtosis, are E(X)
3:There are random variables that are not normal by themselves but becomes
normal if they are logged. The typical examples are stock prices and various
macroeconomic variables. Let St be a stock price. The dollar return over a given
interval, Rt = St St 1 is not likely to be normally distributed due to simple
fact that the stock price is raising over time, partly due to the fact that investors
demand a return on their investment but mostly due to ination. However, if you
take the log of the stock price and calculate the per cent return (approximately),
f (x)



rt = ln St ln St 1 , this variable are much more likely to have a normal distribution

(or a distribution that can be approximated with a normal distribution). Thus,
since you have taken logs of variables in your econometric models, you have already
worked with log normal variables. Knowledge about log normal distributions is
necessary if you want to model, or better understand, the movements of actual
stock prices and dollar returns.
The Student t distribution is similar to the normal distribution, it is symmetric
around the mean, it has a variance but has thicker tail than the normal distribution. The Student t distribution is described by ; ; 2 where refers to
the mean and 2 refers to the variance. The parameter is called the degrees
of freedom of the Student t distribution and refers to the thickness of tails. A
random variable that follows a Student t distribution will converge to a normal
random variable as the number of observations goes to innity.
The Cauchy distribution is related to the normal distribution and the Student
t distribution. Compared with the normal it is symmetric and has two moments,
but it has fatter tails and is therefore better suited for modelling random variables
which takes on relatively more extreme events than the normal. The set back for
empirical work is that higher moment are not dened meaning that it is di cult
to use empirical moments to test for Cauchy distribution against say the normal
or the Student t distribution.
The gamma and the chi-square distributions are related to variances
n of normal
random variables. If we have a set of normal random variables Y~1 ; Y~2 :::; Y~v
and for a new variable as X
Y~12 + Y~22 + ::: + Y~v2 , then this new variable will
have a gamma distribution as X
Ga( ; ; 2 ):A special case of the gamma
distribution is when we have = 0 and 2 = 1, the distribution is then called
a chi-square distribution 2 ( ) with degrees of freedom. Thus, take the square
of an estimated regression parameter and divide it with it variance and you get a
chi-square distributed test for signicance of the estimated , ( ^ = ^ )
( ):
The F distribution comes about when you compare the ration (or log dierence)
of two squared normal random variables. The Poisson distribution is used to model
jumps in the data, usually in combination with a geometric Brownian motions,
(jump diusion models). The typical example is stock prices that might move up
or down drastically. The parameter measures the probability of jump in the

3.5 Analysing the Distribution

In practical work we need to know the empirical distribution of the variables we
are working with, in order to make any inference. All empirical distributions can
analysed with the help of their rst four moments. Through the rst four moments
we get information rst about the mean and the variance and second about the
skewness and kurtosis. The latter moments are often critical when we decide if a
certain empirical distribution should be seen as normal or at least approximately
It is, of course, extremely convenient to work with the assumption of a normal
distribution, since a normal distribution is described by its rst two moments
only. In nance, the expected return is given be the mean, and the risk of the
asset is given by its variance. An approximation to the holding period return of
an asset is the log dierence of its price. In the case of a normal distribution,
there is no need to consider higher moments. Furthermore, linear combinations of


normal variates result in new normally distributed variables. In econometric work,

building regression equations, the residual process is assumed to be a normally
independent white noise process, in order to allow for inference and testing.
It is by calculating the sample moments we learn about the distribution of the
series at hand. The most typical problem in empirical work is to investigate how
well the distribution a variable can be approximated with a normal distribution.
If the normal distribution is rejected for the residuals in a regression, the typical
conclusion is that there something important missing in the regression equation.
The missing part is either an important explanatory variable, or the direct cause
of an outlier.
To investigate the empirical distribution we need to calculate the sample moments of the variable. The sample mean, of fxt g = fx1 ; x2 ; :::xT g; can be estiPT
mated as ^ x = x = (1=T ) t=1 xt . Higher moments can be estimated with the
formula mr = (1=T ) t=1 (xt x)r :2
~ t N ( x ; 2x ); subtracting the mean and diA series is normally distributed, X
viding with the standard error lead to a standardised normal variable, distributed
~ N (0; 1): For a standardised normal variable the third and fourth moments
as X
equal 0 and 3, respectively. The standardised third moment is now as Skewness,
given as b1 = m23 =m32 . A skewness with a negative value indicates a left skew
distribution, compared with the normal. If the series is the return on an asset it
means that bador negative surprises dominates over goodpositive surprises. A
positive value of skewness implies a right skewed distribution. In terms of asset
returns, goodor positive surprises are more likely than badnegative surprises.
The fourth moment, kurtosis is calculated as b2 = m4 =m22 : A value above
3, implies that the distribution generates more extreme values than the normal
distribution. The distribution has fatter tails than the normal. Referring to asset
returns, approximating the distribution with the normal, would underestimate the
risk associated with the asset.
An asymptotic test, with a null of a normal distribution is given by3 ,

JB = T

m23 =m32

[(m4 =m22 )



m1 m3


This test is known as the Jarque-Bera (JB) test and is the most common
test for normality in regression analysis. The null hypothesis is that the series is
normally distributed. Let 1 ; 2 , 3 and 4 represent the mean, the variance, the
skewness and the kurtosis. The null of a normal distribution is rejected if the test
statistics is signicant. The fact that the test is only valid asymptotically, means
that we do not know the reason for a rejection in a limited sample. In a less than
asymptotic sample rejection of normality is often caused by outliers. If we think
the most extreme value(s) in the sample are non-typical outliers, removingthem
from the calculation the sample moments usually results in a non-signicant JB
test. Removing outliers is add hoc. It could be that these outliers are typical
values of the true underlying distribution.
2 For these moments to be meaningful, the series must be stationary. Also, we would like
fxt g to an independent process. Finally, notice that the here suggested estimators of the higher
moments are not necessarily e cient estimators.
3 This test statistics is for a variable with a non-zero mean. If the variable is adjusted for its
mean (say an estimated residual), the second should be removed from the expression.



3.6 Multidimensional Random Variables

We will now generalize the work of the previous sections by considering a vector
of n random variables,
~ = (X
~1; X
~ 2 ; :::; X
whose elements are continuous random variables with density functions f (x1 )
..., f (xn ), and distribution functions F (x1 ) ..., F (xn ). The joint distribution will
look like,
F (x1 ; x2 ; :::; xn ) =



f (x1 ; x2 ; :::; xn )dx1

dxp ;


where f (x1 , x2 , ..., xn ) is the joint density function.

If these random variables are independent, it will be possible to write their
joint density as the product of their univariate densities,
f (x1 ; x2 ; :::; xn ) = f (x1 )f (x2 )

f (xn ):


For independent random variables we can dene the r:th product moment as,

~ 1 r1 ; X
~ 2 r2 ; :::; X
~ n rn )
Z 1
Z 1
x1 r1 x2 r2

xn rn f (x1 ; x2 ; :::; xn )dx1 dx2

dxn ; (3.22)

which, if the variables are independent, factorizes into the product

~ 1 r1 )E(X
~ 2 r2 )

~ n rn ):


It follows from this result that the variance of a sum of independent random
variables is merely the sum of these individual variances,
~1 + X
~ 2 + ::: + X
~ n ) = var(X
~ 1 ) + var(X
~ 2 ) + ::: + var(X
~ n ):


We can extend the discussion of covariance to linear combinations of random

variables, say
~ = a1 X
~ 1 + a2 X
~ 2 + ::: + ap X
a0 X
which leads to,
~ =
cov(a0 X)

p X

ai aj

ij :


i=1 j=1

~ Z = B X,
~ and the
These results hold for matrices as P
well. If we have Y~ = AX,
covariance matrix between X and Y ( ), we have also that,
cov(Y ; Y ) = A

cov(Y ; Z)


A0 ;







3.7 Marginal and Conditional Densities

Given a joint density function of n random variables, the joint probability of a
subsample of them is called the joint marginal density. We can also talk about
joint marginal distribution functions. If we set n = 3 we get the joint density
function f (x1 , x2 , x3 ). Given the marginal distribution g(x2 x3 ), the conditional
~ 1 , given that the random
probability density function of the random variable X
variables X2 and X3 takes on the values x2 and x3 is dened as,
'(x1 j x2 ; x3 ) =

f (x1 ; x2 ; x2 )
g(x2 ; x3 )


f (x1 ; x2 ; x3 ) = '(x1 j x2 ; x3 )g(x2 x3 ):

Of course we can dene a conditional density for various combinations of X
X2 and X3 , like, p(x1 , x3 ; j x2 ) or g(x3 j x1 , x2 ). And, instead of three dierent
variables we can talk about the density function for one random variable, say Y~t ,
for which we have a sample of T observations. If all observations are independent
we get,
f (y1 ; y2 ; :::; yt ) = f (y1 )f (y2 ):::f (yt ):
Like before we can also look at conditional densities, like
f (yt j y1 ; y2 ; :::; yt

1 );


which in this case would mean that (yt ) the observation at time t is dependent
on all earlier observations on Y~t .
It is seldom that we deal with independent variables when modeling economic
time series. For example, a simple rst order autoregressive model like yt =
yt 1 + t , implies dependence between the observations. The same holds for all
time series models. Despite this shortcoming, density functions with independent
random variables, are still good tools for describing time series modelling, because
the results based on independent variables carries over to dependent variables in
almost every case.

3.8 The Linear Regression Model A General Description

In this section we look at the linear regression model starting from two random
~ Two regressions can be formulated,
variables Y~ and X.

+ x+ ;



+ y+ :


Whether one chooses to condition y on x, or x on y depends on the parameter
of interest. In the following it is shown how these regression expression are constructed from the correlation between x and y, and their rst moments by making
use of the (bivariate) joint density function of x and y. (One can view this section
as an exercise in using density functions).


Without explicitly stating what the density function looks like, we will assume
that we know the joint density function for the two random variables Y~ and X,
and want to estimate a set of parameters, and . Hence we got, the joint density,
D(y; x; );


where is a vector of parameters which describes the relation between Y~ and

~ To get the linear regression model above we have condition on the outcome of
D(y; x; ) = D(yj x; );
where represents the vector of parameters of interest = [ , ]. This operation requires, that the parameters of interest can be written as a function of the
parameters in the joint distribution function, = f ( ).
~ is, equation 1
The expected mean of Y~ for given X
E(Y~ j x; ) = y D(yj x; )dy = + x;
or if we choose to condition on Y~ instead,
E(Xj y; ) = x D(xj y; )dx =

+ x:

The parameters in 3.38 can be estimated by using means, variances and covariances of the variables. Or in other terms, by using some of the lower moments
~ and Y~ . Hence, the rst step rewrite 3.38 in such a
of the joint distribution of X
~ and Y~ .
way that we can write and in terms of the means of X
Looking at the LHS of 3.38 it can be seen that a multiplication of the condi~ g(x), leads to the joint density.
tional density with the marginal density for X,
Given the joint density we can choose to integrate out either x or y. In this case
we chose to integrate over x. Thus we have after multiplication,
y D(yj x; )dyg(x)= g(x ) + x g(x ):
Integrating over x leads to, at the LHS,
yD(yjx; )dydg(x )
yD(y;xj )dydxg(x )
yD(yj ) = E(yj ) = y :


Performing the same operations on the RHS leads to,

g(x)dx +
x g(x)dx

~ =



If we put the two sides together we get,

E(Y~ jx; ) =

~ =



We now have one equation two solve for the two unknowns. Since we have
used up the means let us turn to the variances by multiplying both sides of 3.38
with x and perform the same operations again.


Multiplication with x and g(x) leads to,

xyD(yj x ; )dyg(x ) = x g(x ) + x 2 g(x );


Integrate over x,

xyD(yj x ; )dydxg(x )
x g(x )dx +
x 2 g(x )dx :


The LHS leads to,

and the RHS,

~ );
)dydx = E (X

xyD(y; x j

x g(x)dx +

~ +

x2 g(x)dx =


~ 2 ):


Hence our second equation is,

~ Y~ ) =

~ +

~ 2 ):


~ Y~ ) = x y + xy ,
Remembering the rules for the expectations operator, E(X
and E(X ) = x + x makes it possible to solve for and in terms of means
and variances. From the rst equation we get for ,



If we substitute this into 3.39, we get

~ Y~ )


x) x + ( x

x y

x );

x y




which gives


Using these expressions in the linear regression line leads to,

E(Y~ j x; ) =



+ x;



+ y:


or if we chose to condition on Y~ instead,



We can now make use of the correlation coe cient and the parameter in the
~ and Y~ is dened as,
linear regression. The correlation coe cient between X



x y

xy :


x y

If we put this into the equations above we get,

E(Y~ jx; ) =



x );






y ):


So, if the two variables are independent their covariance is zero, and the correlation is also zero. Therefore, the conditional mean of each variable does not
dependent on the mean and variance of the other variable. The nal message
is that a non-zero correlation, between two normal random variables, results in
linear relationship between them. With a multivariate model, with more than two
random variables, things are more complex.






There are two fundamental approaches to estimation in econometrics, the method
of moments and the maximum likelihood method. The dierence is that the moments estimator deals with estimation without a priori choosing a specic density
function. The maximum likelihood estimator (MLE), on the other hand, requires
that a specic density function is chosen from the beginning. Asymptotically there
is no dierence between the two approaches. The MLE is more general, and is the
basis for all the various tests applied in practical modeling. In this section we will
focus on MLE exclusively because of its central role.
The principles of MLE were developed early, but for a long time it was considered mainly as a theoretical device, with limited practical use. The progress
in computer capacity has changed this. Many presentations of the MLE are too
complex for students below the advanced graduate level. The aim of this chapter is to change this. The principle of ML is not dierent from OLS. The way
to learn MLE is to start with the simplest case, the estimation of the mean and
the variance of a single normal random variable. In the next step, it is easy to
show how the parameters of a simple linear regression model can be found, and
tested, using the techniques of MLE. In the third step, we can analyse how the
parameters of any density function. Finally, it is often interesting to study the bivariate joint normal density function. This last exercise is good for understanding
when certain variables can be treated as exogenous. The general idea is that after
viewing how a single random variable can be replaced by a function of random
variables, it becomes obvious how a multivariate non-linear system of variables
can be estimated.
Let us start with a single stochastic time series. The rst moment, or the
~ t with the observations (x1 ; x2 ; :::; xT ) is
sample mean, of the random process X
found as x = t=1 xt =T . By using this technique we simply calculated a number
~ t . In the same way
that we can use to describe one characteristic of the process X
we can calculate the second moment around the mean, etc. In the long run, and
for a stationary variable, we can use the central limit theorem (CLT) to argue that
(x1 ; x2 ; :::; xT ) has a normal distribution, which allows us to test for signicance

4.1 MLE for a Univariate Process

~ t , and a sample of T indeThe MLE approach starts from a random variable X
pendent observations (x1 ; x2 :::; xT ). The joint density function is
f (x1 ; x2 ; :::; xT ; ) = f (x; ) =
To describe this process there are k parameters,
the density function as,
f (x; )


f (xt ; )

= ( 1;

2 ; :::; k );

so we write

where x; indicates that it is the shape of the density, described by the parameters which gives us the sample. If the density function describes a normal
distribution would consistent of two parameters the mean and the variance.
Now, suppose that we know the functional form of the density function. If we
~ t , we can ask the question which estimates
also have a sample of observations on X
of would be the most likely to nd, given the functional form of the density and
given the observations. Viewing the density in this way amounts to asking which
values of maximize the value of the density function.
Formulating the estimation problem in this way leads to a restatement of the
density function in terms of a likelihood function,
L( ; x);


where the parameters are seen as a function of the sample. It is often convenient
to work with the log of the likelihood instead, leading to the log likelihood
log L( ; x) = l( ; x)


What is left is to nd the maximum of this function with respect to the parameters in . The maximum, if it exists is found by solving the system of k
simultaneous equations,
l( ; x)
= 0;

for , which will be the log likelihood estimates ^, provided that D2 l( ; x) is a

negative denite matrix. In matrix form this expression is also know as the score
matrix, or the e cient score for , which can be written as,
l( ; x)

= S( );


such that the matrix of the e cient score is zero at maximum.

The matrix of the expected second order expressions is know as the information
l( ; x)
= I( ):
The information matrix plays an important role in demonstrating that ML
estimators asymptotically attains the Cramer-Rao lower band, and in the derivation of the so-called classical test statistics associated with the ML estimator. It
can be shown, under quite general conditions, that the variances of the estimated
parameters from above (^) are given by the inverse of the information matrix,
var(^) = [I( )]


So far we have not assigned any specic distribution to the density function.
~ t g. The
Let us assume a sample of T independent normal random variables fX
normal distribution is particularly easy two work with since it only requires two
parameters to describe it. We want to estimate the rst two moments, the mean
and the variance 2 , thus = ( ; ): The likelihood is,
)2 :
L( ; x) = 2
2 2 t=1
Taking logs of this expression yields,
l( ; x) =

(T =2) log 2

(T =2) log



)2 :





The partial derivative with respect to



1 X







(T =2

) + (1=2


)2 :




If these equations are set to zero, the result is,








= 0:



If this system is solved for

variance as1

^ 2x


T t=1

we get the estimates of the mean and the


T t=1

" T
1 X
xt :
T t=1

1X 2
^x) =
T t=1 t


Do these estimates of and 2 really represent the maximum solution of the

likelihood function? To answer that question we have to look at the sign of the
Hessian of the log likelihood function, the second order conditions, evaluated at
estimated values of the parameters in ;

D l( ; x)=4





3 2
7 4



2 4


If we substitute from the solutions of the estimates of


6 ^ 2x
E[D l( ; x)]= 14





, we get,

5= I(^);
2^ 4x


Since the variance, 2x is always positive we have a negative denite matrix,

and a maximum value for the function at ^ x and ^ 2x :
It remains to investigate whether the estimates are unbiased. Therefore, re~
place the observations, in the solutions for and 2x , by the random variable X
and take expectation. The expected value of the mean is,
E(^ x ) =

~ = 1
T t=1
T t=1

solution is given by T1
]2 = T1
t=1 [xt
t=1 xt +
PTT t=1
2 T1
t=1 xt + T T
t=1 xt
2 T12
t=1 xt + T 2
t=1 xt
t=1 xt
t=1 xt
1 P
t=1 t

1 The




= ;



which proves that ^ x is an unbiased estimation of the mean. The calculations

for the variance are bit more complex, but the idea is the same. The expected
variance is,
E[^ 2x ]

!2 #
1 X ~
T t=1
~ t2 )
= E T E(X
T t=1 s=1
~ ) E(X
~ )

= E



~ t2


~ t2 )

T (T

~ t )]2 =


Thus, ^ 2 is not an unbiased estimate of 2 . The bias given by (T 1)=T , goes

to zero as T ! 1: This is a typical result from MLE, the mean is correct but
the variance is biased. To get an unbiased estimate if we need to correct the
estimate in the following manner,
!2 3
~t 5 :
s2 =
^2 =
The correction involves multiplying the estimated variance with

4.2 MLE for a Linear Combination of Variables

We have derived the maximum likelihood estimates for a single independent normal variable. How does this relate to a linear regression model? Earlier, when
we discussed the moments of a variable, we showed how it was possible, as a general principle, to substitute a random variable with a function of the variable.
~ is a function of two other random
The same reasoning applies here. Say that X
variables Y~ and Z. Assume the linear model
yt = zt + xt ;


where Y~ is a random variable, with observations fyt g and zt is, for the time
being, assumed to be a deterministic variable.(This is not a necessary assumption).
~ let us
Instead of using the symbol x, for observation on the random variable X;
set xt = t where t N ID(0, 2 ): Thus, we have formulated a linear regression
model with a white noise residual. This linear equation can be rewritten as,

= yt



where the RHS is the function to be substituted with the single normal variable
xt used in the MLE example above. The algebra gets a bit more complicated but
the principal steps are the same.2 The unknown parameters in this case are and
2 As a consequence of more complex algebra the computer algorithms for estimating the variables will also get more complex. For the ordinary econometrician there are a lot of software
packages that cover most of the cases.



. The log likelihood function will now look like,


l( ;

; y; z) =

(T =2) log 2

(T =2) log



zt )2 : (4.24)


The last factor in this expression can be identied as the sum of squares function, S( ). In matrix form we have,
S( ) =


Z )0 (Y

zt )2 = (Y


Z )



l( ;

; y; z) =

(T =2) log 2

(T =2) log

Dierentiation of S( ) with respect to




Z )0 (Y

Z ) (4.26)


2Z 0 (Y

Z );


which, if set to zero, solves to

^ = (Z 0 Z)

(Z 0 Y )


Notice that the ML estimator of the linear regression model is identical to the
OLS estimator.
The variance estimate is,
^ 2 = 0 =T;
which in contrast to the OLS estimate is biased.
To obtain these estimates we did not have to make any direct assumptions
about the distribution of yt or zt : The necessary and su cient condition is that yt
conditional on zt is normal, which means that yt
zt = t should follow a normal
distribution. This is the reason why MLE is feasible even though yt might be a
dependent AR(p) process. In the AR(p) process the residual term is a independent
normal random variable. The MLE is given by substitution of the independently
distributed normal variable with the conditional mean of yt :
The above results can be extended to a vector of normal random variables. In
this case we have a multivariate normal distribution, where the density is
D(X) = D(X1 ; X2 ; :::; XT );


P The random variables X will have a mean vector and a covariance matrix
. The density function for the multivariate normal is,
D(X) = [(2 )n=2 j
jn=2 ] 1 exp[ (1=2)(X
which can be expressed in a compact form Xt N ( ; ):
With multivariate densities it is possible to handle systems of equations with
stochastic variables, the typical case in econometrics. The bivariate normal is an
~ = (X
~1, X
~ 2 ), and
often used device to derive models including 2 variables. Set X



with j


2 2
1 2 (1

p2 );


where p is the correlation coe cient. As can be seen j
j> 1 unless p2 = 1. If
12 = 21 = 0; the two processes are independent and can estimated individually


without losing any important information. In principle if 12 = 21 6= 0; the two

equations are dependent, and it will be necessary to estimate a complete system
of equations to get correct estimates, which are unbiased and e cient.
A disadvantage with MLE is that the variance estimate is biased. This, however, is only a small sample eect. It can be shown that as T goes to innity
the bias disappears. Hence, the MLE is an asymptotically e cient estimator.
Furthermore, it can also be shown that MLE behaves asymptotically nice even
if we drop the assumption of normally independently distributed residuals. The
estimates will tend towards those given by NID errors. This situation is refereed
to as quasi maximum likelihood.
The advantages are easy to see. MLE oers a general approach to the estimation of econometric models. These models can be quite complex, non-linearity,
moving average residuals and so on can be handled by MLE. Consequently there
exists a large literature on MLE. In principle this literature is not di cult. The
main problem for our understanding of the use of MLE in dierent situations lies
in our understanding of matrix algebra.




(To be completed, add gure of normal distributed variable with value of likelihood
function (L) on the vertical axis and parameter value on the horizontal axis, with
(^) is indicating the maximum value of L).
There are three approaches to testing a statistical model model. The rst is
to start with an unrestricted model and imposed restrictions on the estimated
model. The second approach is to impose the restrictions prior to estimation, and
estimate a restricted model. The test is then performed by asking if the restriction
should be lifted. The third approach, is to test for signicant dierences between
an estimated restricted model and an estimated unrestricted model. The last
approach involves estimating two models, rather than one.
The three approaches of testing are named
Wald tests (W ) - estimate an unrestricted model.
Lagrange Multiplier tests (LM ) -estimate a restricted model.
Likelihood Ratio tests (LR) - estimate both the unrestricted and the restricted models.
A test is labeled Wald, Lagrange Multiplier or Likelihood ratio depending
on how it is constructed. A typical Wald test is the t-testfor signicance.
A Lagrange multiplier test is the LM test of autocorrelation. Finally, the
F-test for testing the signicance of one or several parameters in a group
represents a typical Likelihood ratio test.
Imagine a gure of a normal density function, with the shape of a normal
random variable centered around its (true) mean. On the vertical axis put the
value of the likelihood function. The max is given by the peak of the distribution.
Let the horizontal axis represent the estimated mean. The true mean is indicated
by the peak of the normal distribution. The LR test is based on a comparison of
likelihood values. If a restriction, which is imposed on the unrestricted model, is
valid the value of the likelihood should not be reduced signicantly. This test is
based on two estimations, one unrestricted giving the value of the likelihood L
^ R : From these two values the likelihood ratio is
and one restricted leading to L
dened as,



This lead to the test statistic ( 2 ln ) which has a 2 (R) distribution, where
R is the number of restrictions.
The Wald test compares (squared) estimated parameters with their variances.
In a linear regression, if the residual is N ID(0; 2 ), then ^
N ( ; var( ^ )), so
) N (0; var( ); and a standard t-test will tell if is signicant or not.
More generally
if we have vector of normally distributed random variables
^ Nj ( ; ), then have





The LM test starts from a restricted model and tests if the restrictions are
valid. Here restrictions should be understood as a general concept. A model
is restricted if it assumes homoscedasticity, no autocorrelation, etc. The test is
formulated as,
ln L(^R ) h ^ i 1 ln L(^R )
I( R )
LM =

The formula looks complex but is in many cases extremely easy to apply.
Consider the LM test for p : th order autocorrelation in the residuals ^t ,
^t =

1^t 1

2^t 2

+ ::: +

p^t p



The LM test statistic for testing if the parameters 1 to p are zero, amounts to
estimating the equation with OLS and calculate the test statistics T R2 , distributed
as 2 (p) under the null of no autocorrelation. Similar tests can be formulated for
testing various forms of heteroscedasticity.
Tests can often be formulated in such a way that they follow both 2 and
F -distributions. In less than large samples the F -distribution is better one to use.
The general rule for choosing among tests based on the F or the 2 distribution
is to use the F distribution, since it has better the small sample properties.
If the information matrix is known (meaning that it is not necessary to estimate
it), all three tests would lead to the same test statistic, regardless of the chosen
distribution 2 or F . I all all three approaches lead to the same test statistics,
we would have RW = RLR = RLM . However, when the information matrix is
estimated we get the following relation between the tests RW RLR RLM .
Remember (1) that when dealing with limited samples the three tests might
lead to dierent conclusions, and (2) if the null is rejected the alternative can
never be accepted. As a matter of principle, statistical tests only rejects the
null hypothesis. Rejection of the null does not lead to accepting the alternative
hypothesis, it leads only to the formulation of new null. As an example, in a test
where the null hypothesis is homoscedasticity, the alternative is not necessarily
heteroscedasticity. Tests are generally derived on the assumption that everything
else is OK in the model. Thus, in this example, rejection of homoscedasticity
could be caused by autocorrelation, non-normality, etc. The econometrician has
to search for all possible alternatives.



Part II

Time Series Modeling



6.1 Dierent types processes
This section looks at dierent types of stochastic time series processes that are
important in the economics and nance. Time series is a series where the data
~ is a variable which can take on more
is ordered by time. A random variable (X)
than one value, and for each value it can take one there is a value between zero
and one that describes the probability of observing that value. We distinguish
between discrete and continuous random variables. Discrete random variables can
only take on a nite number of outcomes. A continuous random variable can take
one value between -1 and +1: The mathematical model of the probabilities
associated with a random variable is given by the distribution function F (x),
F (x) = P (X
x): If we have a continuous random variable, we can dene the
probability density function of the random variable as, f (x) = dFdx(x) : Random
variables are characterized by the probability functions, and their moments.
First, second, third and fourth moments all describe the characteristics of a
random variable. By estimating these we describe a random variable. All moments
have direct implication for risk-and return decisions. Mean = return, Variance
= risk, skewness and kurtosis implies deviations from normal and might aect
behavior. To be completed.
A stochastic time series process is then made up of a random variable that
over time can take on more than one value.
~ t gT indicating that it starts at time zero
We denote a stochastic process as fX
and continuous to time T . To dene a stochastic time series process we start
~ t ), which at time t can take on dierent values
with the random variable (X
at the future periods i = 1; 2; 3; ::n; where n might go to innity. Often we
~ t ), we want to estimate the most
will talk about conditional expectation of (X
likely future value, given the information we have today. A stochastic time series
process can be discrete or continuous. A discrete series is only changing values
at discrete time periods, while a continuous process is, or can potentially, change
values continuously and not only at discrete time intervals.
~ t+1 jIt ) or Et (X
~ t+1 ). To formalize
The conditional expectation is written as E(X
the use of conditional expectations, assume a probability space ( ; z; P ), where
is the total sample space (or possible states of the world), z denotes the tribe
of subsets of that are outcomes (observations), and P is a probability measure
associated with the outcomes. A very practical question in modeling is if there
exists a simple mathematical form for associating outcomes with probabilities.
Usually we will refer to the tribe of subsets z as the information set It :We will
assume that memory is not forgotten by the decision makers, so the information
set is increasing over time,
It0 It1 ::: Itk Itk+1 :::
In a discrete time setting we refer to this increasing sets as an increasing sequence of sigma-elds. In a continuous time setting, where new information arrives
continuously, rather than at discrete time intervals, the increasing information set
is referred to as a ltration, or an increasing family of sigma-algebra. A very uno RANDOM WALKS, WHITE NOISE AND ALL THAT


cial standard is to use It discrete time settings and zt for continuous time settings.
We can also say that the set fFt :t 0g is a ltration, representing increasing fam~ t , (x1 ; x2 ; :::; :xt ); will
ily of sub- sigma algebras on z: Over time outcomes of X
be added to the increasing family of information sets. We refer to the observed
process, (x1 ; x2 ; ::; xt ), as adapted to the ltration zt : We can also say that if
~ t is a random
(x1 ; x2 ; ::; xt ) is an adapted process, then for the sequence of fxt g X
~ t is know as xt .
variable with respect to f ; z), and for each t the value of X

6.2 White Noise

A random variable is a white noise process if its expected mean is equal to zero,
E[ t ] =

= 0;


its variance exists and is constant 2 , and there is no memory in the process
so the autocorrelation function is zero,

t t]


t s]


0 f or t 6= s:


In addition, the white noise process is supposed to follow a normal and independent distribution, t
N ID(0; 2 ). A p
standardized white noise have a
distribution like N ID(0; 1). Dividing t with 1= 2 gives ( t = ) ~N ID(0; 1):
The independent normal distribution has some important characteristics. First,
if we add normal random variables together, the sum will have a mean equal to the
sum ofPthe mean of all variables. Thus, adding T white noise variables together as,
zT = t=1 ( t = ) forms a new variables with mean E(zT ) = E( 1 = ) + E ( 2 = ) +
:: + E ( T = ) = (1= ) [E( 1 ) + E( 2 ) + ::: + E( T )] = 0: Since each variable is independent, we have the variance as 2z = 2z;1 + 2z;2 + :: + 2z;T = 1 + 1 + :: + 1 = T .
The random
p variable is distributed as zt ~N ID(0; T ); with a standard deviation
for zt increases, a 95% forecast condence
given as 1= T : As the forecast horizon
interval also increases with 1:96 T :
In the same way, we can dene the distribution, mean and variance during
subsets of time. If t ~N (0; 1) is dened for the period of year. The variables
p be distributed over six months as, N (0; 1=2), with a standard deviation of
1= 2,
p over three months the distribution is N (0; 1=4), with a standard deviation
of 1= 4. For any fraction ( ) over
p the year, the distribution becomes N ID(0; 1= )
and the standard deviation 1= : This property of the variable following from the
assumption of independent distribution, is known as Markov property. Given that
x0 is generated from an independent normal distribution N ( ; 2 ); the expected
future value of xt at time x0+T is distributed as N ( T; 2 T ).
To sum up, it follows from the denition that a white noise process is not
linearly predictable from its own past. The expected mean of a white noise,
conditional on its history is zero,

t 1 ; t 2 ; :::: 1 ]

= E [ t ] = 0:


This is a relatively weak condition. A white noise process might be predicted

by other variables, and by its own past using non-linear functions.
A process is called an innovation if it is unpredictable given some information
set It . A process yt is an innovation process w.r.t. the an information set if,
E[yt j It ) = 0:



where the information set It includes not only the history of t , but also all other
information which might be of importance for explaining this process. Stating
that a series is a white noise innovation process, with respect to some information
set It ; is a stronger requirement than white noise process. It is also a stronger
statement than saying that t is a martingale dierence process, because we add
the assumptions of a normal distribution. The martingale and the martingale
dierence processes were dened in terms of their rst moments only. Creating
a residual process that is a white noise innovation term is a basic requirement in
the modelling process.

6.3 The Log Normal Distribution

The normal distribution is central in econometric modeling. However, nancial
prices display two characteristics which make them unt for a stochastic process
based on the assumption normal distributions. Stock prices cannot be negative,
due to limited liability, and they tend to grow over time due to the time value
of money. Thus, the distribution of stock prices is typically non-negative and
skewed. The normal distribution on the other hand is symmetric and stretches
from 1 to +1: A better alternative for modelling stock prices, and many other
asset prices, is to assume a log normal distribution, which compared to the normal,
is only dened over [0; 1], and is right skewed and reecting the fact that stock
prices have a tendency to move up rather than down. Furthermore, log normal
distribution have the property that the log of a log normal random variable has
normal distribution. Thus, taking the log of log normal random stock prices
transforms their distribution to a normal distribution.
Let S~ti be a random log normal stock hprice, with mean
and variance 2 The

; 2 .
log of st , is then distributed as ln st N
Given that S~t , has a log normal distribution, ithfollows that the idistance be2
tween S~t and S~t+n is distributed as S~t+n S~t N
n; 2 n :

6.4 The ARIMA Model

The non-parametric white noise can be used to dene (or generate) autoregressive
models (AR), and moving average models (MA). The AR(p) model is
yt =
xt i ,

+ a1 yt

+ ::: + yt

is E(yt ) = , and ~N ID(o;

A(L)yt =

+ t;


): Or using the lag operator, Li xt =



where A(L) = (1 a1 L a2 L2 ::: ap Lp ). The eigenvalues associated with

this polynomial informs about the time path of yt . The moving vicarage model of
order q is,


yt =

+ b1

t 1

+ ::: + bq

t q;


or, using the lag operator,

yt =

+ B(L) t :


6.5 The Random Walk Model

A special case of the AR(1) model is the random walk model,
xt = xt


N ID(0;



where xt 1 is the lagged value of xt , with an implicit parameter of unity, and

is a white noise process. It follows that given the past of the series the best
prediction we can use is the present value of the series, and that the rst dierence
is nothing else than a white noise, xt xt 1 = xt = t . The important factor
is that the increments of the series is unpredictable from the series own past.
A random walk is non-stationarity. By denition, it is integrated of order one
I(1). Taking the rst dierence of a random walk series produces a stationary I(0)
(white noise) series.
A random walk has the property that todays value is the prediction of the
variables future values,

E(xt+1 j xt ; xt

1 ; xt 2 ; :::; xt n )

= E(xt+1 j xt ) = xt ;


where n might be equal to innity. This denition does not rule out the case
that there are other variables that can be correlated with xt and thereby also
predict xt+1 . We can also say that a random walk has an innite long memory.
~ = 2t ;and
The mean is zero, the variance and autocovariance is equal to, var(X)
Cov(Xt , Xt n ) = (t n) :
E(xt ) = E


= 0;



var(xt ) =

E(x2t )


The rst autocovariance is (t


cov(xt xt


= E(xt xt






t X

E [ei ej ] = t:


i=1 j=1


! 0 t 1 13
t X
t 1
ei @
ej A5 =
E [ei ej ] = t


i=1 j=1

The autocovariances foe higher lag order follows from this previous example.
As can be seen these are non-stationary moments, since both are dependent on
time (t). It follows that the autocorrelation function looks like,

= [(t




We can see that, given su ciently large number of observations, there is an

innite memory. All theoretical autocorrelations are equal to 1.0.


If xt = (x0 , x1 , ..., xn ), we can substitute repeatedly backwards,

xt = x0 +





Thus, a random walk is a sum of white noise error from the beginning of the
series (x0 ). Hence, the value of today is dependent on shocks from built up beginning of the series. All shocks in the past, are still aecting the seriesPtoday.
Furthermore, all shocks are equally important. The process formed by i=1 is
called a stochastic trend. In contrast to a deterministic trend, the stochastic trend
is changing its slope in a random way period by period. Ex post a stochastic trend
might look like deterministic trend. Thus, it is not really possible to determine
whether a variable is driven by a stochastic or a deterministic trend, or a combination of both.
If we add a constant term to the model we get a random walk with a drift,
xt =

+ xt

+ t;


where the constant represents the drift term. In this processxt is driven by
both a deterministic and a stochastic trend. If we perform the same backward
substitution as above, we get,
xt = t +


+ x0 ;



where t = 1; 2; :::; n. Thus, a constant term in a random walk model implies

that the variable follows a linear deterministic trend ( t) and a stochastic trend in
the long run. In the long-run the deterministic trend will dominate the stochastic
trend and determine the path of xt . Taking rst dierences leads to,
xt =

+ t;


where the constant measures the average growth rate of xt , since E( xt ) = :

The expected value of a driftless random walk, for any future date is always
todays value, E(xt+n ) = xt : For a random walk with a drift the expected value
is, E(xt+n ) = n + xt
At a rst glance the random walk model might seem extreme, is it possible to
motivate that a series has an innite memory, so that shocks remain in the series
forever? The answer is yes. The most common example is that of innovations
leading to economic growth, which then spills over into other economic variables.
Innovations leading to economic growth do not occur at xed intervals, nor is every
single invention equally important. Over time, innovations will occur at random
intervals and some inventions will more important that others. The outcome is
that productivity and economic growth is driven by a stochastic trend, just as
described by a random walk. In empirical work it is common to nd variables
that behave like random walks.
Given forward looking behavior of economic agents, it is often possible to
construct economic models where transformed variables will behave like random
walks. In a forward looking world agents will use all relevant information when
they determine todays prices. One important characteristic follows from this,
namely that todays price is the best prediction of future prices. However, the
relationship between todays price and the predicted future price is more complex.
We return to this issue below, when we talk about martingales.


A note on the estimation and testing of random walks

A random walk process is also a series integrated of order one, it is also called a
unit root process, and it contains a stochastic trend. Furthermore a random walk
process can also be embedded in another process, say an ARIM A(p; d; q)process.
The problem is that it is problematic to do inference on random walk variables
(and integrated variables) because the estimated parameter on the lagged term will
not follow a standard normal distribution. Hence, ordinary t , chi square and
F distributions are not suitable for inference. Parameter estimates will generally
be asymptotically unbiased. Their standard errors and variances do not follow
standard distributions.
For instance, a common t-test cannot be used to test if a = 1 in the regression,
xt = axt

+ t:


If xt follows a random walk, the distribution of [^

a)] will be skewed to
the left, and thus depart from the student t-table. Just as in any autoregressive
model the estimate a
^ will be biased downward. The term (^
a a); however, becomes
asymptotically a ratio between two random variables, which will lead to a second
order bias in the estimation of the variance as T ! 1. In this case, with a unit
root process, the ration random variables which in turn are functions of Wiener
processes. In this situation one common approach is to use the so-called DickeyFuller test in combination simulated distributions.
Testing for a unit root (a = 1) is one aspect of testing if a variable is a random
walk. Another aspect if it is not possible to reject a unit root is to test if the
residual is ~N ID(0; 2 ). Cambell, Lo and MacKinley (1997, Ch 1) show how you
can test for the absence of autocorrelation when dealing with the null hypothesis
of a random walk. unfortunately, it is quite common in the literature to assume
that a series is a random walk (meaning not rejecting the null of a unit root)
only on unit root testing and forgetting about the properties of the residual term,
which under a random walk is simply the rst dierence.
When testing for random walk in limited samples it is extremely di cult to distinguish between a random walk and a stationary AR(1) model with a parameter
of say 0.99.
A problem with random walks, as well as all variables which include stochastic trends, is that it is in general not possible to use standard distributions for
inference. Parameter estimates will generally be unbiased, but their standard
deviations and variances do not follow standard distributions.

6.6 Martingale Processes

A random variable is said to be a martingale if the present observation is the best
~ t g1 be a process of the random variable
prediction of all future values. Let fX
Xt . We say that the variable is a martingale with respect to the information set
~ t+s is equal to the present value of X
It 1 , if the expected value of X
~ t+s j It = xt for s > t:

1 Alternatively, it is possible to dene the information set at time t-1, and wrire the denition
~ t j It 1 = X
~t 1 :
as E X



~ t+s is conGiven the information set, all information relevant for predicting X
~ t . Thus, the best prediction of X
~ t+1 is xt ; and the
tained in todays value of X
value of today is the best prediction of all periods in the future. The information
~ t as well as all other information that might be
set might include the history of X
~ t+s . The denition of a martingale is always relative,
of relevance for predicting X
~ t is a marsince we have the freedom of dening dierent information sets. If X
tingale with respect to the information set It0 , it might not be a martingale with
respect to another information set It00 unless the two sets are not identical.
We can now continue and dene the martingale dierence process as the ex~ t+s and X
pected dierence between X
~ t+s

~ t ) j It ] = E(X
~ t+s

xt ) = 0:


If a process is a martingale dierence process, changes in the process are unpredictable from the information set.
The sub-martingale and the super martingale are two versions of martingale
processes. A sub-martingale is dened as
~ t+s j It
xt ;
which says that, on average the expected value is growing over time. A supermartingale is dened as
~ t+s j It
xt ;

~ t+s is given by X
~ t but, on average,
which says that the expected value of X
declining over time.
Martingales are well known in the nancial literature. If the agents on a nancial market use all relevant information to predict the yields of nancial assets,
the prices of these assets will, under certain special conditions, behave like martingales. The random walk hypothesis of asset prices does not come from nance
theory, it is based on empirical observations, and is mainly a hypothesis about the
empirical behavior of asset prices which lacks a theoretical foundation. A random
walk process is a martingale, but also includes statements about distributions. If
we compare with the random walk we have the model, xt = xt 1 + t where t is a
normally distributed white noise process. The latter is a stronger condition than
assuming a martingale process. A random walk with a drift xt = t + xt 1 + t ,
this variable is a sub-martingale,since the deterministic trend will increase the
~ t+1 ) = t + xt : Let us now turn to nance theory.
expectation over time, E(X
Theory that the price of an asset (Pt+1 ) at time t + 1 is given by the price at t
plus a risk-adjusted discount factor r. If we assume, for simplicity, that the discount factor is a constant we get that Pt+1 = (1 + r)Pt : Asset prices are therefore
not driftless random walks, or martingales. The process described by theory is
ln Pt+1 = ln(1 + r) + ln Pt + t+1 , which is a sub-martingale given, in this case, a
constant discount factor. If we would like to say that asset prices are martingales
we must either transform the price process according [Pt+1 =(1 + r)], or we must
include the risk-adjusted discount factor in the information set.2
Thus, the expected value of an asset price is, by denition, E(Pt+1 ) = E(1 +
r)Pt . If the discount factor (and risk) is a constant (g) we get E(Pt+1 ) = g + Pt ,
which is a random walk with drift. If the risk premium is a time-varying stochastic
2 It is obvious that we can transform a variable into a martingale by substracting elements
from the process by conditioning or direct calculation. In fact most variables can be transformed
into a martingale in this way. An alternative way of transforming a variable into a martingale is
to transform its probability distribution. In this method you look for a probability distribution
which is equivalentto the one generating the conditional expectations. This type of distribution
is called an equivalent martingale distribution.



~ t ), we have E(Pt+1 ) = gt + Pt , which takes us even further away from

variable (G
the random walk.
It is important to distinguish between martingales and random walks. Financial theory ends in statements about the expected mean of a variable with respect
to a given information set. A random walk is dened in terms of its own past only.
Thus, saying that a variable is a random walk does not exclude the case that there
exists an information set for which the variable is not a martingale.
Furthermore, the residuals in a random walk model are by denition independent, if we assume them to be white noise. But, a martingale describes behavior
of the rst moment of a random variable. It does not imply independence between
the higher moments of the series. If we model a martingale by a rst order autoregressive process, we might nd that the errors are dependent through higher
moments. The variance of t is not 2 , but a function of its own past, like

t 1



where t is a white noise process. This is a rst order ARCH(1) model (Auto
Regressive Conditional Heteroscedasticity), which implies that a large shock to
the series is likely to be followed by another large shock. In addition, it implies
that the residuals are not independent of each other.
The conclusion is that we must be careful when reading articles which claim
that the exchange rate, or some other variable should be, or is, random walks, often
what the authors really mean is that the variable is a martingale, conditional on
some information.
The martingale property is directly related to the e cient market hypothesis
(EMH), which set out the conditions under which changes in asset prices becomes
unpredictable given dierent types of information.

6.7 Markov Processes

Markov3 processes represent a general type of series with the property that the
value at time t contains all information necessary to form probability assessments
of all future values of the variable. Compared with the martingale property above,
this property is more far reaching. The martingale property is concerned with the
conditional expectation of a variable, and not with the actual distribution function
and the higher moments of the variable. Markov processes and the associated
Markov property are important because it helps us to form stochastic time series
processes. In economics and nance we like explain how expectations are generated
and how expectations aects the outcome of observed prices and quantities on
various markets.
In particular, in nancial economics and the pricing of derivatives, we like to
model asset prices as continuous stochastic processes Once we can trace the price
of asset continuously over time into the future, we can also determine the price
of derivatives though replication and arbitrage In addition, we learn how to use
derivatives to continuously hedge risky positions.4
To predict or generate future possible paths of a Markov variable, we only
need to know the most recent value, or its recent values of the variable. This is,
3 Markov is known for a number of results, including the so-called Markov estimates that
prove the equality between OLS and MLE.
4 Recall that the denition of a derivative asset, is a nancial contract that (1) derives its
value from some underlying asset, and (2) at the time of expiration has exactly the same price
as the underlying asset.



in many modeling situations a very practical assumption, we do not need to know

the history of the variable to learn how it behavesnor do we need to know actual
values/observations of the future. The future of the series can be generated from
its conditional past.
Let F (x1 ; x2 ; :::; xt ) be the distribution function of the random variable X
There are 1; 2; ::t observations of the series, where t might be equal to innity.
For each observation (xi ) there is a probability statement, F (x1 ; x2 ; :::; xt ) =
Pr ob(X
x1 ; X
x2 ; :::X
xt ). A discrete time Markov process is characterized by the following property,

~ t+s
Pr ob(X

~ t+s
xt + s j x1 ; x2 ; :::xt ) = Pr ob(X

xt+s j xt );


where s > 0: The expression says that all probability statements of future
~ t+s is only dependent on the value the variable
values of the random variable X
takes at time t, and do not depend on earlier realizations. By stating that a
variable is a Markov process we put a restriction on the memory of the process.
The AR(1) model, and the random walk, are rst-order Markov process,
xt = a1 xt


N ID(0;



Given that we know that t is a white noise process (N ID(0; 2 )]; and can
observe xt we know all what there is to know about xt+1 = ; since xt = contains
all information about the future. In practical terms, it is not necessary to work
with the whole series, only a limited present. we can also say that the future
of the process, given the present, is independent of the past. For a rst order
~ t+1 , given all its possible present and
Markov process, the expected value of X
~t 1, X
~ t 2 :::, can be expressed as,
historical values X
~ t+1 j X;
~ X

1 ; Xt 2 :::Xt 1

~ t+1 j X
~t :
=E X


Thus, a rst order Markov process is also a martingale. Typically, the value of
~ t is know at time t as xt : The Markov property is a very convenient property if we
want to build theoretical models describing the continuous evolution of asset prices.
We can focuses on the value today, and generate future time series, irrespective
of the past history of the process. Furthermore, at each period in future we can
easily determine an exactfuture value, which is the equilibriumprice for that
The white noise process, as an example, is a Markov process. This follows from
the fact that we assumed that each t was independent from its own past, and
future. One outcome of the assumption of a normal and independent process, was
that we could relatively easy form predictions and condence intervals given only
the value of t today.
The denition of a Markov process can be extended to an m : th order Markov
processes, for which we have;
~ t+1 j X
~t; X

1 ; Xt 2 :::Xt 1

~ t+1 j X;
~ X
=E X

1 ; Xt 2 :::Xt m :

; (6.27)

where we need to condition on m historical (random) values (including the

~ t ) to predict the future.
presentvalue X


6.8 Brownian Motions

Consider the random walk model, xt = xt 1 + t and assume that the distance
between t and t 1 becomes smaller and smaller. As the distance between the
observations gets smaller the function will in the end get so close to a continuous
function that it becomes indistinguishable from a function in continuous time
x(t) = x(t 1) + (t): This takes us to the random walk in continuous time,
known as a Brownian motion or Wiener process. This section introduces, Brownian
motions (Wiener process), geometric Brownian motion, jump diusion models and
Ornstein- Uhlenbeck process.
There are (at least) two very important reasons for studying Wiener processes.
The rst is that the limiting distribution of most non-stationary variables in economics and nance are given as functions of a Brownian motion. It is this knowledge that helps us to understand the distribution of estimates based on nonstationary variables. The second reason for learning about Brownian motions is
that they play an important role in modeling asset prices in nance.
A word of warning, though Brownian motions have nice mathematical properties it is not necessarily so that it also ts given data series better. Normal discrete
empirical modelling will take you a long way.
The random walk is dened in discrete time. The intuition behind the random
walk and the Brownian motion is as follows. If we let the steps between t and t 1
become innitely small, the random walk can be said to converge to Brownian
motion (or Wiener process. As the distance between t and t 1, alternatively
between t and t+1, it becomes harder and harder to distinguish between a discrete
time process and continuous time process. In the end, the dierence will be so
small that it will not matter.
These processes have a long history. The Brownian motion was named after
an English botanist, Robert Brown, who in 1827 observed that small particles immersed in a liquid, exhibited ceaseless irregular motion. Brown himself, however,
named a few persons who had observed this phenomena before him. In 1900 a
french mathematical named Bachelier described the random variation in stocks
prices when he wanted to explain option prices. In 1917 Einstein observed similar
behavior gas molecules. Finally,
Norbert Wiener gave the process a rigorous mathematical treatment in a series
of papers during 1918 and 1923.
Is there a dierence between what we call a Wiener processes and Brownian
motion? In practice the answer is no. The two terms can and are used interchangeably. If you look at the details you will nd that the Brownian motion
have normally distributed increments. The Wiener process, on the other hand, is
explicitly assumed to be a martingale. No such statement is made for the Brownian motion.5 In practice, these dierences means nothing (for more information
search for the Lvy theorem). In econometrics there is a tendency to use Wiener
processes to represent univariate processes and Brownian motion for multivariate
The most important characteristic of a Brownian motion is that all increments
are independent, and not predictable from the past. Thus the Brownian motion
can be said to be a martingale and it fullls the Markov property. The latter means
that the distribution of future values at (t + dt) depends only the current value
of x(t). This is a good characteristic of models describing insecurity, in particular
situations when nature is evolving as a function of random steps that we cannot
5 See Neftci, Salih (2000), An Introduction to the Mathematics of Financial Derivatives, 2 ed.
Academic Press, Amsterdam.



predict. The further we look into the future, the number of random changes gets
larger, and probability statements about future events get harder and harder.
A generalized (arithmetic) Brownian motion is written as
dxt = dt + dWt


where d represent the continuous or innitesimal small change in the variable

x over the time interval dt: This can be written as dxt = x(t + dt) x(t): The
parameters and are real numbers (constants) where is strictly positive. As
in a random walk the term dt represents the drift and dW can be said to add
a stochastic noise to the series. W represents a standardized Wiener process,
or Brownian motion, such that dW represents the dierential of the Brownian
motion, and dWt = dW (t + dt) pW (t) has a standard normal distribution with
mean zero and variance equal to dt:
It is easy to see that dt represent a drift term. Take the expected value of
the process, E(xt ) = dt +
0, both and dt are non-stochastic, and dW has
an expected value of zero. It follows that
E(dxt )
represents the average change in x per unit of time. Of course, if = 0 we
have a driftless random walk in continuous time, E(dxt ) = E(dWt ) = 0:
The variance is V ar(dxt ) = 2 V ar(dW ) = 2 dt: Note shown here is that that
the changes in x (dxt ) are independent and stationary.

6.9 Brownian motions and the sum of white noise

In terms of the change over a specic (possibly) observable time period we need to
introduce the notation t to represent the change over some fraction of time t. By
using this notation we can let t be a year or a month, and then by changing we
can let the length of the period become smaller and smaller. The change due to
the deterministic trend is written, per unit of time, as p t. The stochastic noise
that we add to dx over a given interval is written as
t, where t ~N ID(0; 1).
In the limit, as ! 0, we have that x ! dt: In terms of small intervals the
Brownian motion becomes
xt =




To understand the asymptotic properties of the Brownian we could let ! 1,

but there is a better way to see what happens. As we study a standardized
Brownian motion/Wiener process W (t) over the interval, [0; T ] we will nd that
we can divide this interval into segments ti ti 1 ;
0 = t0 < t1 < t2 < ::: < ti < :::tn = T:


Let the length of each segment be = ti tip1 ; and p

assume that there is a random
variable Wt that takes on either the value
: Furthermore, assume that
Wti is independent of Wtj for i 6= j, so that each increment is uncorrelated
with other increments. The Wiener process is no dened as the sum of Wti as
! 1, which is the same as saying that as the interval [0; T ] is divided into ner
and ner segments, we have


W (t) =


wti as i ! 1


An extension of this, if t ~N ID(0; 2 ); is that let Wt = t = T will also
converge to a Wiener process. Thus, the sum of a standardized white noise will
also converge to a standardized Wiener process. This result is crucial for the
understanding of the distribution a random walk and other unit rootvariables.


The geometric Brownian motion

The arithmetic Brownian motion is not well suited for asset prices as their changes
seldom display a normal distribution. The log of asset prices, and return, is better
described with a normal distribution. This takes us to the geometric Brownian
= dt + dWt
What happens here is that we assume that ln xt has a normal distribution,
meaning that xt follows a log normal distribution, and dt + dWt follows a
normal variable. Itos lemma can be used to show that

d ln xt =

dt + dWt :

The expected value of the geometric Brownian motion is E(dxt =xt ) = dt, and
the variance is V ar(dxt =xt ) = 2 dt:
There are several ways in which the model can be modied to better suit
real world asset prices. One way is to introduce jumps in the process, so-called
"jump diusion models". This is done adding a Poisson process to the geometric
Brownian motion,
= dt + dWt + Ut dN ( );
where Ut is a normally distributed random variable, Nt represent a Poisson
process with intensity to account for jumps in the price process.
The random walk model is good for asset prices, but not for interest rates.
The movements of interest rates are more bounded than asset prices. In this case
the so-called Ornstein-Uhlenbeck process provides a more realistic description of
the dynamics,
drt = (b

rt )dt + Wt :

Thus the idea behind the Ornstein-Uhlenbeck process is that it restricts the movements of the variable (r) to be mean reverting, or to stay in a band, around b,
where b can be zero.


A more formal denition


If X(t)
is a Wiener process, 0
t < 1:The series always starts in zero, X(0)
~ i ) are independent. In
0:and if t0 t1 t2 ...
tn , then all increments of X(t
terms of the density function we have,
D [x(t1 ) x(t0 ); x(t2 ) x(t1 ); :::; x(tn )
D [x(ti ) x(ti 1 ) j t0 ; t1 ; :::; tn ] :




The expected value of each increment is zero,

~ n ) X(t
~ n 1 ) = 0;
E X(t
with a variance

~ X(t
var X(t)

1) =


j t0 ; t1 ; :::; tn ] (6.32)



where 0
s < t. Finally, since the increments are a martingale dierence
process, we can assume that these increments follow a normal distribution, so
N [0, (t s)]: These assumptions lead to the density function,


(2 )ti

2 2 t1



ti 1 ) (1=2)
(2 )t1

(xi xi 1 )2
2 2 (ti ti 1 ;

= 1, the process is called a standard Wiener process or standard
Brownian motion. That the Brownian motion is quite special, can be seen from
this density function. The sample path is continuous, but is not dierentiable.
[In physics this is explained as the motion of a particle which at no time has a
Wiener processes are of interest in economics of many reasons. First, they
oer a way of modeling uncertainty. Especially in nancial markets, where we
sometimes have an almost continuous stream of observations. Secondly, many
macro economic variables appear to be integrated or near integrated. The limiting
distributions of such variables are known to be best described as functions of
Wiener processes. In general we must assume that these distributions are nonstandard.
To sum up, there are ve important things to remember about the Brownian
motions/Wiener process;
It represents the continuous time, (asymptotic) counterpart of random walks.
It always starts at zero and are dened over 0

t < 1:

The increments, any change between two points, regardless of the length of
the intervals, are not predictable, are independent, and distributed as N (0,
(t s) 2 ), for 0 s < t.
It is continuous over 0 t < 1, but nowhere dierentiable. The intuition
behind this result is that the dierential implies predictability, which would
go against the previous condition.
Finally, a function of a Brownian motion/Wiener process will behave like a
Brownian motion/Wiener process.
The last characteristic is important, because most economic time series variables can be classied as, random walks, integrated or near-integrated processes.
In practice this means that their variances, covariances etc. have distributions
that are functionals of Brownian motions. Even in small samples will functionals
of Brownian motions better describe the distributions associated with economic
variables that display tendencies of stochastic growth.





"Time is a great teacher, but unfortunately it kills all its pupils" Louis Hector
A time series is simply data ordered by time. And, time series analysis is simply
approaches that look for regularities in these data ordered by time. Stochastic time
series play an important part in economics and nance. To forecast and analyse
these series it is necessary to take into account not only their stochastic nature but
also the fact that they are non-stationary, dependent over time and are by nature
correlated among each other. In theoretical models, the emphasis on intertemporal
decision making highlights the role expectations play in a world where decisions
must be made from information sets made up of stochastic processes.
All time series techniques aim making the series more understandable by decomposing them into dierent parts. This can be done in several ways. This
introductions aim is to give a general overview of the subject. A time series is
any sequence ordered by time. The sequence can be either deterministic or stochastic. The primary interest in economics is in stochastic time series, where the
sequence is made up by random variables. A sequence of stochastic variables ordered by time is called a stochastic time series process. These random variables
making up the process can either be discrete, taking on a given set of integer
numbers, or be continuous random variables taking on any real number between
1: While discrete random variables are possible they are not common.
Stochastic time series can be analysed. in the time domain or in the frequency domain. The former approach analysis stochastic processes in given
time periods like, days, weeks, years etc. The frequency approach aims at decomposing the process into frequencies by using trigonometric functions like sinuses,
etc. Spectral analysis is an example of analysis that uses the frequency domain, to
identify regularities like seasonal factors, trends, and systematic lags in adjustment
etc. In economics and nance, where we are faced with given observations and we
study the behavior of agents operating in real time, the time domain is the most
interesting road ahead. There are relatively few problems that are interesting to
analyze in the frequency domain.
Another dimension in modeling is processes in discrete time or in continuous time. The principal dierence here is that the stochastic variables in a
continuous time process can be measured at any time t; and that they can take
dierent values at any time. In a discrete time process, the variables are observed
at xed intervals of time (t), and they do not change between these observation
points. Discrete time variables are not common in nance and economics. There
are few, if any variables that remain xed between their points of observations.
The distinction between continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic
variables are measured at discrete time intervals. The money stock is generally
measured and recorded as an end-of-month value. The way of measuring the stock
of money does not imply that it remains unchanged between the observation interval, instead it changes whenever the money market is open. The same holds for
variables like production and consumption. These activities take place 24 hours
a day, during the whole year. The are measured as the ow of income and conINTRODUCTIOO TO TIME SERIES MODELING


sumption over a period, typically a quarter, representing the integral sum of these
Usually, a discrete time variable is written with a time subscript (xt ) while
continuous time variables written as x(t). The continuous time approach has a
number of benets, but the cost and quality of the empirical results seldom motivate the continuous time approach. It is better to use discrete time approaches
as an approximation to the underlying continuous time system. The cost for doing this simplication is small compared with the complexity of continuous time
analysis. This should not be understood as a rejection of continuous time approaches. Continuous time is good for analyzing a number of well dened problems
like aggregation over time and individuals. In the end it should lead to a better
understanding of adjustment speeds, stability conditions and interactions among
economic time series, see Sj (1990, 1995).1 Thus, our interest is in analysing
discrete time stochastic processes in the time domain.
A time series process is generally indicated with brackets, like fyt g: In some
situations it will be necessary to be more precise about the length of the process.
Writing fyg1
1 indicates that he process start at period one and continues innitely.
The process consists of random variables because we can view each element in fyt g
as a random variable. Let the process go from the integer values 1 up to T: If
necessary, to be exact, the rst variable in the process can be written as yt1 the
second variable yt2 etc. up until ytT : The distribution function of the process can
then be written as F (yt1 ; yt2 ; :::; ytT ):
In some situation it is necessary to start from the very beginning. A time series
is data ordered by time. A stochastic time series is a set of random variables
ordered by time. Let Y~it represent the stochastic variable Y~i given at time t.
Observations on this random variable is often indicated as yit . In general terms
a stochastic time series is a series of random variables ordered by time. A series
starting at time t = 1 and
n ending at timeo t = T , consisting of T dierent random
variables is written as Y~1;1 ; Y~2;2 ; :::Y~T;T . Of course, assuming that the series is
built up by individual random variables, with their own independent probability
distributions is a complex thought. But, nothing in our denition of stochastic
time series rules out that the data is made up by completely dierent random
variables. Sometimes, to understand and nd solutions to practical problems, it
will be necessary to go all the way back to the most basic assumptions.
Suppose we are given a time series consisting of yearly observations of interest
rates, f6:6; 7:5; 5:9; 5:4; 5:5; 4:5; 4:3; 4:8g, the rst question to ask is this a stochastic
series in the sense that these number were generated by one stochastic process or
perhaps several dierent stochastic processes? Further questions would be to ask
if the process or processes are best represented as continuous or discrete, are the
observations independent or dependent? Quite often we will assume that the series
are generated by the same identical stochastic process in discrete time. Based on
these assumptions the modelling process tries to nd systematic historical patters
and cross-correlations with other variables in the data.
All time series methods aim at decomposing the series into separate parts in
some way. The standard approach in time series analysis is to decompose as
yt = Tt;d + St;d + Ct;d + It ;
1 We can also mention the dierent types of series that are used; stocks, ows and price
variables. Stocks are variables that can be observed at a point in time like, the money stock,
inventories. Flows are variables that can only be observed over some period, like consumption or
GDP. In this context price variables include prices, interest rates and similar variables which can
be observed at a market at a given point in time. Combining these variables into multivariate
process and constructing econometric models from observed variables in discrete time produces
further problems, and in general they are quite di cult to solve without using continuous time
methods. Usually, careful discrete time models will reduce the problems to a large extent.



where Td and Sd represents (deterministic) trend and seasonal components, Ct;d is

deterministic cyclical components and I is process representing irregular factors2 .
For time series econometrics this denition is limited. Instead, let fyt g be a
stochastic time series process, composed as,

= systematic components + unsystematic component

= Td + Ts + Sd + Ss + fyt g + et ,


where the systematic components include deterministic trends Td , stochastic trend

Ts ; deterministic seasonals Sd stochastic seasonals Ss , a stationary process (or the
short-run dynamics) yt , and nally a white noise innovation term et : The modeling
problem can be described as the problem of identifying the systematic components
such that the residual becomes a white noise process. For all series,remember
that any inference is potentially wrong, if not all components have been modeled
correctly. This is so, regardless of whether we model a simple univariate series
with time series techniques, a reduced system, a or a structural model. Inference
is only valid for a correctly specied model.
Present ARIMA
A class of models ARIMA (p,dq,) ARFIMA(p,d,q) models
Box Jenkins
Identication tools: ACF, PAVFS, Q-test
Deal with:
Non-stationarity, dynamics
Seasonal eects
Deterministic variables
Theory ARIMA:
After ARIMA?
ARIMAX, Transfer function RDL, ARCH/GARCH
Structural: Single equation, ADL
Error correction modes (Older stu)
Add for VAR : How to build VAR:s
Lags - white noise
Lags dummies white noise
Information criteria + min number of equations with AR
Add for Rational expectations GMM
2 For simplicity we assume a linear process. An alternative is to assume that the components
are multiplicative, xt = Tt;d St;d Ct;d It :



7.1 Descriptive Tools for Time Series

Random variables are described by their moments. Stochastic time series can be
described by their means, variances and autocovariances. Given a random variable
Y~t which generates an observed process fyt g, the mean and the variance are EfY~t g
= and varfY~t g: The autocovariance at lag k is

= cov(Y~t ; Y~t


= E[Y~t

E(Y~t )][Y~t


k )]:

The dimension of a covariance measure is di cult to understand in terms of the

strength of the relation. For practical work, a more useful measure is provided by
the autocorrelation,
cov(Y~t ; Y~t k )
= k;
k = q
var(Yt )var(Yt k )

where k is is the autocorrelation between a realisation of the series at time t and

time t k: Since the autocorrelation comes out as a number between 0 and 1.
The autocovariance operator can be applied to any lag, k 1; and is therefore generally referred to as the autocorrelation function. Furthermore, if the series have
a stationary mean and variance, it does not matter if we calculate the correlation
function (or the autocovariances) backwards or forwards, k = k :
The ACF tells us the following, the higher the value of the stronger is the
memory of the series. By studying how the autocorrelation changes as the distance
between t and k changes a we can see if they tend to die out slowly or quickly,
or remain constant for a given number of k.3 If the ACF is equal to unity and
dies out slowly this is a sign of a non-stationary variable. On the other hand, if
the ACF is zero it is a sign of a white noise process were no historical values can
predict coming observation of the same series. for a random time series process,
the sample autocorrelation function becomes
PT k
y)(yt k y)
t=1 (yt
T k
^k =
k = 0; 1; 2; 3:::;
t=1 (yt

where T is the number of observations, and y is the sample mean, y = (1=T ) i=1 yi .
In practical work, the standard assumption is a constant variance over the
sample, so that var(yt ) = var(yt k ). The sample autocorrelations are estimates
of random variables they are therefore associated with variances. Bartlett (1946)
shows that the variance of the k:th sample autocorrelation is
var(^k ) =
^k 5 :
Given the variance, and the standard deviation of the estimated variable, it becomes possible to set up a signicance test. Asymptotically, this t-test has a
normal distribution, with an expected value of zero under the null of no autocorrelation (no memory in the series). For a limited sample, a value of ^k larger than
two times its standard error is considered signicant.
The next question is how much autocorrelation is left between the observations
at t and t k (Y~t and Y~t k ) after we remove (condition on) the autocorrelation
between t and t k? Removing the autocorrelation means that we rst calculate
the mean of Y~t conditional on all observation on Y~t and Y~t k 1 ;another way of
3 Standard


practice is to calculate the rst K

T =4 sample autocorrelations.


expressing this is to say that we lter Y~t from the inuence of all lags of Y~t between
t 1 and t k 1: Using the expectations operator, we dene the conditional
mean as EfY~t j yt 1 ; yt 2 ; :::yt k 1 g = Y~t . The partial autocorrelation is then
the slope coe cient in a regression between Y~t and Y~k . This leads to the following
denition of the partial autocorrelation function

cov(Y~t ; Y~t

j yt 1 ; :::; yt
var(Y~t k )

k 1)


The denition of the partial autocorrelation at lag k can be recognised as the

coe cient on the lag at t = k in the autoregressive regression:
yt = a0 + a1 yt

+ ::: +

k yk

+ et :


Notice the dierence, the partial autocorrelation k is a denition, not an estimate.

The rst partial autocorrelation is estimated by regressing yt on yt 1 ; the second
partial autocorrelation is estimated by regressing yt on yt 1 and yt 2 and so
on.4 The partial autocorrelation functions can be estimated through regression
techniques, by the so-called Yule-Walker estimator, alternatively using recursive
techniques (Durbin 1961). The recursive technique utilises the fact that the rst
autocorrelation is equal to the rst partial autocorrelation ^1 = ^ 1 , then given ^ 1
the higher order i are solved step by step in a recursive equation system.
The complicating factor is to estimate the variance of the partial autocorrelation function. If a regression technique is used, the estimated regression variance
of ( ^ k ) is not a correct estimate of the variance, because until the residual process
is white noise, or at least free from autocorrelation the estimated variance is inefcient. Furthermore, the other (older) techniques of estimating the PACFs do not
involve a variance estimate in the same way as the OLS estimator of k : The solution, therefore, is to assume that the estimated ^ k : s are a white noise process.
Anderson (1944) shows that the asymptotic variance of a white
noise series is
1=T . This leads to the (asymptotic) signicance test, k =(1= T ): As a practical
rule of thumb, in a limited sample, a test statistics greater than 2 is considered
signicant, and lead to a rejection of the null of ^ k = 0: The PACF informs about
the length of autoregressive process. The necessary number of lags to describe an
autoregressive process of order p ends at p :
A closer look at these measures, and the way they are calculated reveals that
they are only interesting for stationary series. The same holds for the mean and
the variance, and other moments.
The two measures, the ACF and the P ACF; are complementary to other
descriptive devices, such as the mean, the variance, kurtosis, etc. The ACF and
the P ACF describe the memory of a process. They explain if and how a series can
be predicted from its own past. They help us to identify which type of process we
are studying, if it is a white noise process, an integrated process, an AR process,
an MA process, or an ARMA process.
A white noise series is recognized by its lack of signicant ACF and P ACF
coe cient. Integrated variables are identied by the fact that their ACF dies out
very slowly, in combination with at least one P ACF coe cient close to unity.
Stationary ARMA models are identied with the following identication scheme:
M A(q)

Tails o
Cuts o at lag q
Tails o

Cuts o at lag p
Tails o
Tails o

4 Notice,

that in the regression, the parameters a1 , a2 ,...ak 1 are not identical to 1 ; 2 ...
due to the (possible) correlation between yt 1 and lower order lags like yt 2 etc. The
regression formula only identies the last coe cient, at lag k, as the PACF k :
t k 1



This identication scheme above is a direct consequence of the properties of

each type of model. And, the properties of each model can be calculated theoretically. These calculation are an important part of time series analysis and we will
come back to these calculations below.
The idea behind ARIMA modeling is to rst calculate the ACF and the P ACF
and use these to form an idea about the order of integration and the order of p
and q. The second step, given what we know about the order of d, p, and q
is then to estimate an ARIMA model. The third step is to test the estimated
model for autocorrelation in the residual. The fourth step is reestimate models to
nd the best model according to the three criteria i) no autocorrelation, ii) the
lowest possible residual variance and iii) not include so many parameters that it
is becomes too complex.


Weak and Strong Stationarity

A fundamental issue when analyzing time series processes is whether they are
stationary or not. As a rst, general denition, we can say that a non-stationary
series changes its behavior over time such that the mean is changing over time.
Many economic time series are non-stationary in the sense that they are growing
over time, their estimated variances are also growing and the covariance function
never dies out. In other words the calculation of the mean, autocovariance etc.
are dependent on the time period we study, and inference becomes impossible. A
stationary series on the other hand displays a behavior which is independent of
the time period and it becomes possible to test for signicance. Non-stationarity
must either be removed before modeling or included in the model. This requires
that we know what type of non-stationarity we are dealing with.
The problem with non-stationary is that a series can be non-stationary in an
innite number of ways. And, to make the problem even more complex some types
of non-stationarities will skew the distributions of the estimates such that inference
based on standard distributions such as the t , the F or the 2 distributions
are not only wrong but completely misleading. In order to model time series, we
need to understand what non-stationarity is, how to estimate it and how to deal
with it.


Weak Stationarity, Covariance Stationary and Ergodic Processes

Of the two concepts, weak stationarity is the practical one. Weak stationarity
is dened in terms of the rst two moments of the process, the mean and the
variance. A process fxt g is (weakly) stationary if (1) the mean is independent of
time t,
Efxt g = ;
(2) the variance exists and is less than innity,
varfxt g =

< 1;

and (3) the autocovariance is

covfxt ; xt




Thus, the mean and the variance are constant over time, and the covariance
between two values of the process is only a function of the distance between the
two points.
A related concept is that of covariance stationarity if the autocovariances go
to zero as the distance between the two points increases the series is said to be
covariance stationary (or ergodic),
cov(xt ; xt


! 0 as k ! 1:

This denition brings us to the concept of ergodicity, which can be understood as a weak form of average asymptotic independence. The most important
condition, but not su cient, for a series to be ergodic is
lim T 1
cov(xt ; xt k ) = 0:


Compared with the former concept, cov(xt ; xt k ) ! 0, ergodicity implies a

restriction on the strength of the covariance structure. As more and more autocovariances are calculated their mean should go to zero. The term ergodic is used
in connection with stationarity conditions.


Strong Stationarity

Strong stationarity is dened in terms of the distribution function fxt g. Suppose a process that is ordered from observation 1 up to observation T: Each
observation up to T can be thought of as a random variable. Hence we can
write the rst variable in the process as xt1 the second variable xt2 etc. up until
xtT : The distribution function for this process is F (xt1 ; xt2 ; :::; xtT ): Next, dene the distribution function fxt g for another time interval, namely t + j, where
j = 1; 2; :::; T . This leads to the distribution function Fj (xt+j1 ; xt+j2 ; :::; xt+jT ).
Strong stationarity requires that the two distribution functions are identical such
that F (xt1 ; xt2 ; :::; xtT ) = Fj (xt+j1 ; xt+j2 ; :::; xt+jT ); meaning that the characteristics of the process are independent of time. We will get the same means, etc.
independently of the time period we choose for our calculations. By letting j take
dierent integer values we get the j : th order strong stationarity. Thus, j = 1
leads to rst order (strong) stationarity, etc.
Strong stationary incorporates the denition of weak stationarity. But, the
practical problem is that it is di cult to work with distribution functions for
continuous random variables, so strong stationarity is mainly a theoretical concept.
1. (a)

i. In this chapter we deal with a very broad class of models named

ARMA models, autoregressive moving average models. These are
a set of models that describe the process fxt g as a function of its
own lags and a white noise process. The autoregressive models of
order p [AR(p)],
xt = a0 + a1 xt

+ ::: + ap xt

+ et ;

where et is a white noise process. A moving average model of order

q [M A(q)] is dened as
xt = a 0 + et

b1 e t


bq et



where et is a white noise process. The combination of autoregressive and moving average processes gives the ARIMA(p,q) model
xt = a0 + a1 xt

+ ::: + ap xt


b1 et


bq e t


In addition we have integrated processes. An integrated process is dened

as follows: a process xt is said to be integrated of order I(d); if it contains no
deterministic components, is non-stationary in levels, but becomes stationary after
dierencing d times. Thus, a stationary series is denoted xt
I(0), a rst order
integrated series is denoted as I(1); etc.
To analyse time series it is necessary to introduce additional descriptive statistical tools beside means and variances. Then to handle the equations in an
e cient way we need a set of operators. Also, we need to classify time series as
stationary or non-stationary. The descriptive devices are autocovariances, autocorrelations and partial autocorrelations. An important classication is stationarity
or non-stationarity. For this purpose we need the concepts of weak and strong
stationarity, and ergodic processes. The operators needed are the sum operator,
the lag operator and the dierence operator.


Finding the Optimal Lag Length and Information Criteria

In empirical work, the question is to nd the correct lag length. If we chose

to few lags the model will be denition be misspecied, and the assumption of
normally distributed white noise residual will be wrong. On the other hand,
adding more lags to the AR or M A process will make the model capture more
of the possible memory of the process, but the estimates will be ine cient. We
need to add as few lags as possible, without rejecting the assumption of white
noise residuals. The Box-Jenkins method suggests that we start with a relatively
large number of lags and tests for autocorrelation. Among those models, which
has no signicant autocorrelation, we then pick the model with the lowest possible
information criteria.
In the Box-Jenkins approach, testing for white noise is equal to testing for
autocorrelation. The typical test for autocorrelation is the Box-Pearce test, also
known as the portmanteau test, sometimes as the Q-test or the Ljung -Box test.
To test for p:th order autocorrelation in a mean adjusted series, "t ; calculate the
k:th order autocorrelation coe cient,
^k =


"t ^"t k
t=k+1 ^
t=1 ^

for r = 1; 2; :::p: The Box-Pearce test statistic is then given by

BP = T


^2k :


Under the null of no autocorrelation this test statistic has a 2 (p) distribution.
The Box-Pearce statistics is best suited for testing the residual in an AR model.
A modication, for ARMA, and more general regression models, is the so called
Box-Ljung statistics,
BL = T (T + 2)






which is also distributed as 2 (p).

Given that the residuals of the estimated ARMA model do not display autocorrelation, we can turn to the optimal lag length. Information criteria is simply
version of adjusted R2 values. In an ordinary linear regression, as more explanatory variables are added to the model, the R2 value will go up, and the e ciency
of the estimated parameters down. To compare the R2 values of the same model,
estimated with more or less explanatory variables it is necessary to look at the so
called adjusted R2 values. The principle behind an Information criteria is create
a measure that rewards us in the modelling process for reducing the residual variance, but punishes us for adding too many lags that makes the estimates ine cient,
and the predictions interval too wide.
There are several information criteria. They are developed for special situations. In practice, however, they often tend to give the same answer in the end.
The most well known criteria is Akiakes Information Criteria (AIC). If we estimate an autoregressive model with k lags from a sample of T observations, the
information Akaikes information criteria is
AIC = log ^ 2" + 2k=T;
where ^ 2" is the estimated residual variance. Since an estimated residual variance gets smaller the more lags there are in the model, the last term (2k=T ) tries
to compensate for the number of estimated parameters in the models. The smaller
the value of the information criteria the better is the model, as long as there is no
For model with both AR and MA components Hannan and Rissanen suggested
a dierent model,
log ^ 2" + (p + q)(log T =)T;
where p and q are the lag orders of the autoregressive and the moving average
parts of the model. As for Akaikes model the smaller the value the better the
model. From these two original criteria a number of dierent criteria has been
developed, such as Schwartz information criteria (SIC), the Bayesian information
criteria (BIC) and Hatamis information criteria (HIC).


The Lag Operator

When dealing with time series and dynamic econometric models, the expressions
are easier to handle with the backward shift operator (B) or the lag operator
(L).5 The backward shift operator is the symbol most often used in statistical
textbooks. Econometricians tend to use the lag operator more often. The rst
order lag operator is dened as,
Lxt = xt



or more generally as the n:th order lag operator,

Ln xt = xt



The lag operator is an expression such that when its is multiplied with an
observation at any given time, it will shift the observation one period backwards
5 The practical dierence between using the lag operator or the backward shift operator is
that the lag operator also aects the conditional expectations generator Et which is of interest
when working with economic theories dealing with expectations.



in time. In other words, the lag operator can be viewed as a time traveling device,
which makes it possible to travel both forward and backwards in time. A forward
shift operator can be constructed a long the same lines. Thus, moving forward n
observations in the series from an observation at time t is done by L n xt = xt+n :
The properties of the lag operator implies that we can write an autoregressive
expression of order p (AR(p)) as,
a0 xt + a1 xt

+ a2 xt

+ ::: + ap xt

a0 xt + a1 Lxt + a2 L xt + ::: + ap Lp xt

(a0 + a1 L + a2 L2 + ::: + ap Lp )xt

A(L)xt :


Notice that the lag operator can be moved across the equal sign. The AR(1)
model, xt = a1 xt 1 + "t can be written as (1 La1 )xt = "t or A(L)xt = "t or
xt = [A(L)] "t . If necessary the lag length of the process can be indicated as
Ap (L): An ARM A(p; q) process can be written compactly as,
Ap (L)xt = Bq (L)"t :


Skipping the indication of lag lengths for convenience, the ARMA model can
written as xt = [A(L)] B(L)"t or alternatively depending on the context as
[B(L)] A(L)xt = "t : Thus, the lag operator works as any mathematical expression. However, whether or not moving the lag operator around results in a meaningful expression is associated with the principles of stationarity and invertibility,
know as duality.


Generating Functions

The function A(L) is a convenient way of writing the sequence. More generally
we can refer to any expression of the type A(L) as a generating function. This
includes the mean operator, the variance and covariance operators etc. Generating
functions summarize a lot of information about sequences in a compact way and
are an important tool in time series analysis. Their main advantage is that they
saves time and make the expressions much simpler since a number mathematical
operations can be applied to generating
functions. As an example, given certain
conditions concerning the sum ai ; we can write invert A(L); and A(L) 1 A(L) =
The generating function for the lag operator is
D(L) =


di z i ;


where di is generated by some other function. The point here is that it is often
easier to do manipulations on D(L) directly than on each individual element in
the expression. In the example above, we would refer to A(L)xt as the generating
function of xt .
A property of generating functions is that they are additive. If we have two
series, ai , bi and i = 0; 1; 2; :::, and dene a third series as ci = ai + bi , it then
follows that,
C(L) = A(L) + B(L):


Another property is that of convolution. Take the series ai and bi from above,
a new series di can then be dened by,

di = a0 bi + a1 bi

+ a 2 bi

+ ::: + ai b0 =

ah bi




In this case we write D(L) as,

D(L) = A(L)B(L):


The results stated in this section should be compared with chapter 19, below,
which shows how long-run multipliers, etc. can be derived from lag operator.


The Dierence Operator

Given the denition of the lag operator (or the backward shift operator) the difference operator ( ) is dened as,



which for a variable xt leads to xt = (1 L)xt = xt xt 1 . Notice that

in time series statistics the dierence operator are usually denoted with r. In
practice the -symbol denotes taking rst dierences of discrete variable. For
a continuous variable taking rst dierencing implies taking the derivative with
respect to time. If x(t) is a continuous time stochastic variable,
Dx = dx=dt;


where D = d=dt.
Dierences of higher order are denoted in the same way as for the lag operator.
Thus for the second dierence of xt we write,

xt = (1

L)2 xt = (1

2L + L2 )xt = xt


+ xt



Higher order dierences are given as


xt = (1

L)d xt :

Notice the dierence between the dierence operators d xt and s xt : The

rst is the conventional dierence operator, the second is the seasonal dierence
operator, such that
s xt

= (1

Ls )xt = xt



The subscript s indicates the interval over which we take the (seasonal dierence). If xt is quarterly, setting s = 4, leads to the yearly changes in the series.
This new series can the be dierenced by using the dierence operator,

s xt



Ls )xt :



The generating functions takes us to the concept of lters. If xt is an AR(p) then

the autoregressive part of this model can be though of as a lter such that if we
multiply xt with Ap (L) the result is a white noise process. In the same way, given
a white noise series et and some lter B(L), B(L)et = yt ; generates the series yt :
Alternatively, think of S(t) as the seasonal component of the series xt ; or in other
words the seasonal lter. Multiplying xt with S(t); or in a linear relation subtract
S(t)xt from xt ; and the outcome is a deseasonalised variable. Thus, in this context
the term lter is a broad concept, that indicates that we can transform series in
dierent ways. From white noise we can produce ARIMA processes, or we can
extract certain components out of a series.


Dynamics and Stability

Given the parameters of an autoregressive process we may ask if the process is

stationary or not. Starting from a steady state solution, will a shock to the process,
given its parameters, result in an explosion of the series, in innite growth or in a
temporary deviation from steady state? The answers to these questions are given
by analysing the roots of the polynomial given by the autoregressive process A(L).
An autoregressive process can always be expressed as a stochastic dierence
equation, and we can deal with in the same way as with a normal dierence
equation. Starting from A(L)yt = "t , withdraw yt 1 from both sides leads to the
dierence equation, yt = A (L)yt 1 + "t . The solution of this equation is,
yt = yp + yc ;


where yp represents the particular solution, the long-run steady state equilibrium or the stationary long-run mean of yt ; and yc represents the complementary
solution, the deviation from the long-run steady state.
Dynamic stability requires that yc vanishes as T ! 1: The roots of the polynomial A(L) tell us if this occurs. Given a change in "t ; what will happen to
yt+1 ; yt+2, ... yt+1 ? Will yt+1 explode, continue to grow for ever, or change temporary until it returns to the steady state equilibrium described by yp ? The roots
are given by solving for the r : s in the following equation,
r p + a1 r p

+ a2 rp

+ ::: + ap = 0:


This equation leads to the latent roots of the polynomial. The condition for
stability, when using the latent roots, is that the roots should be less than unity,
or that the roots should be inside the unit circle. Root equal to unity, so called
unit roots, imply an evergrowing series (stochastic trend), roots greater than unity
implies an explosive process. Complex roots suggest that the adjustment is cyclical
. Though not very likely, the process could follow an explosive cyclical path or
cyclical permanent shocks. If the process is stationary, following a shock, yt will
return to its stationary long-run mean. The roots can be complex indicating
cyclical behavior. The case with one or several unit rootsis of particular interest
because it represents stochastic growth in a non-stationary variable. Series with
one or more unit roots are also called integrated series. Many economic time
processes appears to have a unit root, or roots close to unity.
Using latent roots to dene stability is common, but is not only way to dene
stability. Latent roots, or eigenvalues, are motivated with the fact that they are


easier to work with when matrix algebra is used. An alternative way of dening
stability is to solve for the roots ( ) in the following equation,

a1 + a2

+ ::: + ap



where : If the roots are greater than unity in absolute value j j> 1, lies
outside the unit circlethe process is stationary, if the roots are less than unity the
process is explosive. The historical literature on time series uses both denitions,
however, latent roots, or eigenvalues are now the established standard.


Fractional Integration


Building an ARIMA Model. The Box-Jenkins Approach

The Box-Jenkins approach is a practical way nding a suitable ARMA representation of a given time series. The steps are
1) Identication.
Determine: (i) if seasonal dierencing is necessary to remove seasonal factors,
(ii) the number times the series need to be dierenced to achieve stationarity and
iii) study ACF and PACF to determine suitable order of the ARMA process.
2) Estimation.
The identication step leads to (1) stationary series and (2) narrows the possible ARMA(p,q) process of interest to estimate.
Methods of estimation? Remember problems with t-values?!
3) Testing.
Test the estimated model(s) for white noise residuals, using Box-Pierce test for
autocorrelation. Among models with white noise residuals pick the one with the
smallest information criteria (AIC, BIC). Dierences among information criteria?
This leads quickly to a forecast model, or a representation for expectations
generating mechanism that can be used in simple (rational) expectations modeling.
Limitations of univariate ARIMA models.
Most economic problems are multivariate. Variables depend on each other.
Furthermore, the test procedure is only aimed at nding a forecast model. To
build an econometric model that can be used for inference the demands for testing
are higher.


Is the ARMA model identied?

The parameters of an ARMA model might not be unique. To see the conditions
for uniqeness, decompose the polynomials of the ARMA process A(L)yt = B(L)"t
into their factors6 as,
A(L) =

i=1 (1

i L);


B(L) =

j=1 (1

j L):



6 If

A(L) contains the polynominal 1

L the process is said to have a unit root.



For a unique representation of the ARMA process there should be no common

factors, like (1
m L)
k L): If this is the case, it is possible to take any
other polynomial C(L) of nite order (< p), and multiply both sides of the ARMA
process such that,
C(L)A(L)yt = C(L)B(L)"t ;
leads to
A (L)yt = B (L)"t :


Thus, in the case of a common factor there is no unique representation of the

parameters in A(L) and B(L).

7.2 Theoretical Properties of Time Series Models


The Principle of Duality

There is a link between AR and M A models, as the presentation of the lag operator
indicated. An AR process with an innite number of lags can under certain
conditions be rewritten as a nite M A process. In a similar way an innite moving
average process can be inverted to an autoregressive process of nite order.
These results have two practical implications. The rst is that in practical
modelling, a long M A process can often be rewritten as a shorter AR process instead, and the other way around. The second implication is that the two process
are complementary to each other. The combination of AR and M A into ARM A
will lead to relatively parsimonious models meaning models with quite few parameters. In fact, it is quite uncommon to nd ARM A models above the order
p = 2 and q = 2.
The AR(1) process, yt = a1 yt 1 + "t , can be written as (1 a1 L)yt = "t , and
in the next step as yt = (1 a1 L) 1 "t : The term (1 a1 L) 1 represents the sum
of an innite moving average process,
yt =

bi " t
"t =
(1 a1 L)

= B(1)"t ;

where b0 = 1: In the same way, a M A(1) process

P1 yt = "t b0 "t 1 , can be
written as an innite autoregressive AR process; i=0 ai yt i = A(1) = "t : These
transformations can be generalized for AR(p) and M A(q) processes, as well as for
vector processes.
The question is, when are these transformations meaningful? An AR process
can always be inverted, but it will only have (a meaningful) summable M A process
if it is stationary. Another way to state this condition is to say that the (latent)
roots of A(L) = 0 should be less than unity (inside the unit circle). An M A
process, on the other hand, is always stationary, since the "t by denition is a
stationary process. However, a MA process can only be inverted if the latent
roots of the polynomial B(L) = 0 are less than unity, the roots are inside the unit
circle. (Notice that we refer to the latent roots, if we switch to the ordinaryroots
the requirement is that they should be outside the unit circle, larger than one. See
this paper for denitions of inside and outside the unit circle!)
Thus, a M A is always stationary, but only invertible if the latent roots of
B(L) are inside the unit circle. An AR process is always invertible to an innite


M A process but only stationary if the latent roots of A(L) are inside the unit
circle. The latter has one interesting implication, it is often convenient to rewrite
an AR or a V AR to a moving average form and investigate the properties and
consequences of non-stationary from the M A representation.
The conditions are similar, and actually more general, for a multivariate processes,
such that V AR(p) () M A(q):


Wolds decomposition theorem

Linear ARIMA models are reasonable good approximations to many empirical

time series processes. A theoretical result which suggests why ARIMA models are
useful approximations is oered by Wolds decomposition theorem, Wold (1954).
The theorem says that any covariance stationary process can be uniquely represented as the sum of two uncorrelated process, xt = dt + yt , where dt is a linearly
deterministic process, and yt is an innite moving average process, MA(1): Thus,
we can write xt as
xt = dt +
bj e t j ;

P1 2
where b0 = 1, and et is stationary (white noise) such that
j=0 bj < 1;
E(et ) = 0;
E(e2t ) and E(et ; et j ) = 0 for j 6= 0.
The theorem has two implications. The rst is that any series which appears
to be covariance stationary can modeled as an innite MA process. Given the
principle of duality, we can expect to nd a nite autoregressive process as well
(compare with the principle of duality). Since many economic time series are
covariance stationary after rst dierencing, we expect ARMA models as well as
linear autoregressive distributed lag models, to work quite well for these series.
The second implication is that we should be able to extract a white noise process
out of any covariance stationary process. This leads to the conclusion that nding
(or constructing) a white noise process in an empirical model is a basic necessity in
the modeling process because most economic time series are covariance stationary
after dierencing.
The presentation above has focused on the practical side of time series modelling. time series can be described and analysed theoretically. Consider the AR(1)
model yt = a1 yt 1 + "t . The series yt is generated by the parameter a1 , the white
noise process "t and some initial value at the beginning of time say t = 0, y0 :
Thus, given an initial value, a parameter a1 and random number generator that
generates "t
N (0; 2 );where we for simplicity can set to 2 = 1, it becomes
possible to generate possible series of yt using Monte Carlo technique. The different outcomes of the series yt can then be used to estimate the distribution of
^1 to learn about how to do inference in small and medium sized samples, and to
understand the distributions as a1 ! 1:0:
We can also calculate the mean and the variance of yt . The series yt is not
independent, since it is a autoregressive. Therefore, the mean and the variance of
the observed yt is not informative for describing the series. Instead look at the
mean of the zero mean (no constant) AR(1) process, in the form of the expected
value; E(yt ) = E(a1 yt 1 ) + E("t ): Looking at the expression, the left hand side
tells us that the right hand side represents the mean of yt : The expected value
of a white noise is denition zero, so E("t ) = 0. Since a1 is a given constant we
have for the other factor, E(a1 yt 1 ) = a1 E(yt 1 ). To nd an answer we need to
substitute the lags of yt 1 ; yt 2 ; etc.


For the rst lag, by substitution, we get a1 E(yt 1 ) = a1 E(a1 yt 2 + "t 1 ) =

a21 E(yt 2 ): Substitute one more time, a21 E(yt 2 ) = a21 E(a1 yt 3 +"t 2 ) = a31 E(yt 3 ).
As we continue substituting backwards we will end up with the initial value. Later
we will examine the case of minus innity. Since the initial value can be seen as a
constant, we get as the nal product at1 E(y0 ) = at1 y0 . (Recall that the expected
value of a constant is equal to the constant.) If the initial value is set to zero it
follows that at1 y0 = 0, and that the mean of yt , is E(yt ) = at1 E(y0 ) = 0.
It is standard to assume that the initial value is zero in this type of analysis.
What happens if yt has a mean dierent from zero, and if the initial value is
dierent from zero? The answer is simple as long as we can assume that the
AR process is stationary and therefore the initial value is a constant there are
no problems. Under these conditions, a non-zero mean can be represented by
a constant parameter in the AR process, such as yt = 0 + a1 yt 1 + "t . The
expected value of yt is E(yt ) = E( 0 ) + a1 E(yt 1 ) + E("t ), which mean that the
right hand side is 0 + a1 E(yt 1 ). Again we need to substitute backwards leading
to; 0 + a1 E( 0 + a1 yt 2 + "t 1 ) = 0 + a1 0 + a21 E(yt 2 ). The next substitution
gives, +a1 +a21 +a31 E(yt 3 ). If we continue substituting back to minus innity,
and set the initial value to zero, we get,

E(yt )

(1 + a1 + a21 + a31 + :::)





a1 )

The last step is simply an application of the solution to an innite series, which
works in this case as long as the AR process is stationary, ja1 j < 1. It is important
that you understand the use of the expectations operator in this example because
the technique is frequently used to derive a number of results. We could have
reached the result in a simpler way if we had used the lag operator. Take the
expectation of E(1 a1 L)yt = E( + "t ). The lag operator is a deterministic
factor why the result is E(yt ) = (1 a1 L) : Again, the left hand side is the sum
of an innite process. If there is no constant, = 0 it follows immediately that
E(yt ) = 0:
What is the variance of the process yt ? The answer is given by understanding that E(yt yt ) = V ar(yt ) = 2 :Thus, start from the AR(1) process, multiply
both sides with yt to get yt yt = a1 yt yt 1 + yt "t . Next, take expectations of
both sides, E(yt yt ) = a1 E(yt yt 1 ) + E(yt "t ), and substitute yt yt 1 and yt "t as
(a1 yt 1 + "t )yt 1 = a1 yt2 1 + "t yt 1 and yt "t = (a1 yt 1 + "t )"t . From this we have
a21 E(yt2 1 ) and a1 E("t yt 1 ) + E("2t ):In the latter expression we have by denition
that E("t yt 1 ) = 0 (recall the basic assumptions of OLS) and that E("2t ) = 2" .
Put the results together,
E(yt yt )


= a21 E(yt2 1 ) +
= a21

a21 )





a21 )

The technique is the same for any AR(p) process.

From the calculation of the variance we can also see the value of the autocovariance and the autocorrelation coe cients, say at lag k. Multiply both sides of


the process with yt k and solve E(yt yt

this follows that the autocovariance is

= ak1


= a1 E(yt


a21 )

1 yt: k )

+ E("t yt

k ):


The autocorrelation is simply


= ak1 :

From this expression it is obvious that the autocorrelation function for the
AR(1) process dies out slowly as the lag length k increases.
Calculating the mean, variance, autocovariances and autocorrelations for AR(1),
AR(2), MA(1) and MA(2) processes are standard exercise in time series courses,
followed by investigation of the unit root case a1 = 1: To be completed...

7.3 Additional Topics



Seasonality is an inherent characteristic of most time series data. Seasonality can

be dealt with in three ways. The rst is to use seasonal dummy variables. The
second method is to use seasonal dierencing. And, the third method is to use a
program called X12. (Previously X11)
All methods suers from the fact that e cient estimation of seasonal eects
requires a lot of data observations, which is rare in most applied econometric time
series work. Econometricians tend to use seasonal dummies, since they are easy to
use and leads to a transparency in the model. Seasonal dierencing is the standard
method in the Box-Jenkins approach.
For a quarterly series seasonal dierencing implies dierencing in the following
way; (1 L4 )yt = yt yt 4 = 4 yt . The corresponding operator for monthly data
is (1 L12 ). In econometrics, the assumption of seasonal unit roots are di cult to
test. There are few clear cut examples of such processes in the literature and the
test for seasonal unit roots are quite complex, especially given the limited samples
in econometrics. Thus, econometricians tend to use seasonal dierencing when
dummy variables do not work. Otherwise including lags at seasonal frequencies
will usually take care of seasonal eects.
Finally, X12 can be described as a state of the art tool, or as a black box,
where you send in seasonal data and out comes a desasonalised series. X12 is a
respected method to use, and is frequently used to deseasonalised public statistics.
procedure, or some similar program to remove seasonality. Removing seasonality
by seasonal dierencing, seasonal dummies or by using X12 do not aect the
presence of one or more unit roots in the series. The Dickey-Fuller test or other
tests for unit root works as before. X12 is a program designed for univariate
analysis, meaning that if seasonality is removed in single series by X12 prior to
modeling a system, seasonality can still be left in a multivariate single equation
model or in a system of equations. The problem with X12 is its black box nature,
the econometrician losses some control over the modeling process. Some care in
the use of X12 is recommended.




(To be completed)
Dierencing until stationarity is the standard Box-Jenkins approach. A bit
ad hoc. In econometrics the approach is to test rst, but only reject the null
of integrated of order one in the case of strong evidence against. Alternatives,
include linear deterministic trends, polynomial trends etc. Dangerous, spurious
detrending under the maintained hypothesis of integrated variables.

7.4 Aggregation
The following section oers a brief discussion about the problems of aggregation.
The interested reader is referred to the literature to learn more [Wei (1990 is a
good textbook with many references on the subject, see also Sj (1990, ch. 4].
Aggregation of series means aggregation over agents and markets, or aggregation
over time. The stock of money, measured by (M3), at the end of the month
represents an aggregation over individuals. A series like aggregate consumption in
the national accounts, represents an aggregation over both individuals and time.
Aggregation over time is usually referred to as temporal aggregation. Money
holdings is a stock variable which can be measured at any point in time. Temporal
aggregation of a stock variable implies picking observations with larger intervals,
using say a money series measured at the end of a quarter, instead of at the end
of each month. Consumption, on the other hand is a ow variable, it cannot
be measured at a point in time, only as the sum of consumption over a given
period. Temporal aggregation in this case implies taking the sum of consumption
over intervals. The distinction is of importance because the eects of temporal
aggregation are dierent for stock and ow variables.
Aggregation, both over time and individuals, can change the functional form
of the distribution of the variables, and that it can aect the residual variance
and t-values. Exactly how aggregation changes a model varies from situation
to situation. There are however some general conclusions regarding temporal
aggregation which we will repeat in this section. In many situations there is little
we can do about these problems, except working with continuous time models,
or=and select series with a low degree of temporal aggregation. That the problem
is hard to deal with is no excuse for forgetting or hiding them, as it is done in many
text books in econometrics. The area of aggregation is an interesting challenge for
econometricians since it has not been explored as much as it deserves.
An interesting example of the consequences of aggregation is given in Christiano and Eichenbaum (1987). They show how one can get extremely dierent
results by using discrete time models with yearly, quarterly and monthly data
compared with a continuous time model. They tried to estimate the speed of
adjustment in the stock of inventories, in the U:S national accounts. Using a
continuous time model they estimated the average time for closing 95% of the
gap between the desired and the actual stock of inventories, to be 17 days. The
discrete models predicted much higher rates. Using monthly data the result was
46 days, with quarterly data 7 months, and with yearly data 5 (1=2) year!
Aggregation also becomes an important problem if we have a theory that describes the stochastic behavior of a variable which we would like to test with
empirical data. There are many results, in macro and nance, that predict that
series should follow a random walk, or be the outcome of a martingale process.


There are several factors to consider if we like to estimate a process suggested

by theory. An example is Hall (1978) who, from a life cycle hypothesis, derived
that private consumption should follow an AR(1) process, and be a random walk
under the assumption of rational expectations. The rst factor, is that of temporal aggregation. An additional complication are adjustment costs, which will also
aect the original model. If private consumption, as an example, is dened as
an AR(1) model, temporal aggregation changes it to ARMA(1,1), the existence of
adjustment costs will then transform it to an ARMA(2,1) model.
Temporal aggregation, adjustment costs and measurement errors are factors
which can aect the structure of the model and the size of estimated parameters.
To this list one could also add problems of seasonal factors, trends and hidden
periodicity. The latter is a problem, because the larger the temporal aggregation
the more di cult it is to get a correct estimate of parameters that reect cycles
which are not timed with the sampling interval. Therefore, one should be critical
of papers which try to prove that some empirical series behaves like a theoretical
process. Is it possible for the author to control all of these factors?
For a ow variable with an ARIMA representation, the outcome of temporal
aggregation depends on hidden periodicity, which if it exists can aect both the
AR and the MA process. In general, aggregation will complicate the structural
of the ARIMA model. A simple AR model becomes an ARMA model. But, as
aggregation becomes larger the structure of the model becomes simpler.
For a stock variable the consequences are clearer. An ARIMA(p; d; q) process
of a stock variable, becomes after temporal aggregation an ARMA(p; d; s) process,

integer [(p + d) + (q


and where m is the degree of temporal aggregation, or in other words the

systematic sampling interval. As a rule of thumb it can be assumed that temporal
aggregation adds +1 to the MA process.
Since dierencing is a form of temporal aggregation, taking higher and higher
dierences of a series will create an MA process. This can be seen in any time
series program that produces ACF:s and PACF. The more one dierences a series
the more clearly will the series look like an MA process.. Thus, it follows that
observing an MA process in the Identication step in the Box-Jenkins approach,
is a sign of over-dierencing.
The expression holds even for an ARMA model where d = 0. For an ARIMA
model, as m gets larger, the model turns towards an IM A(d, d 1) process.
Thus, we end up with a random walk model. This is an interesting result for of
two reasons. First, since the random walk model often seems to t macroeconomic
and especially nancial time series quite well, could that be the outcome of having
too large sampling intervals? Second, the result explains the ndings in Christiano
and Eichenbaum (1987), that larger sampling intervals lead to slower and slower
adjustment speed in inventories. The larger the sampling interval, the more did
inventories seem like a random walk. As a consequence, the more important
seemed historical shocks, further and further back in history. In the end, in the
random walk model, all historical shocks have the same importance and there
would be no adjustment at all.
Temporal aggregation will also aect prediction. The general result is that
aggregation reduces the e ciency of the forecasts, and that the relative loss of
e ciency is larger for a non-stationary series than a stationary one. (Remember
that most macroeconomic series are non-stationary.)
It is also worth mentioning some conclusions concerning causality. Aggregation
will not aect the direction of causality, if there is a clear causality from one
variable to another, when dealing with stock variables. It will, however, weaken the


estimated strength of the relationship and can therefore lead to wrong conclusions
from Granger non-causality tests. For ow variables, on the other hand, temporal
aggregation turns a one direction causality into what will appear to be a two-sided
causality. In this situation a clear warning is in place.
~ t and Y~t .
Finally, we also look at the aggregation of two random variables, X
Suppose that they are two independent stationary processes with mean zero,
~ t j yt ] = E[Y~t j xt ] = 0:


~ t and Y~t are,

The autocovariances of X

1 ; xt k )

x ;k



1 ; yt k )

y ;k


~ t and Y~t is,

The sum of X

~ t + Y~t ;
Z~t = X


which will have an autocovariance equal to,

z ;k

x ;k

+ y ;k :


In general, we can write this in the following way, if


ARM A(p; m);



ARM A(q; n):


~ t + Y~t ;
Z~t = X



ARM A(x1 ; x2 );


where x1 p + q and x2 max(p + n, q + m). As an example, think of a series

which is measured with a white noise error. That is, the true series is added to a
white noise series. If the true series is AR(p) then the result of this aggregation
will be an ARMA(p; p) process.
We can conclude this section by stating that aggregation leads to loss of information, which, if the aggregation is large, might fool us into assuming that the
random walk is the appropriate model. The extent to which aggregation leads
us to wrong conclusions has not been stated yet. Partly this is so because we
need better data on shorter time intervals than what is available. Remember that
ignoring problems is not a way of solving them. One way of dealing with the
problems of aggregation is to use continuous time econometric techniques instead,
see Sj (1993) for a discussion and further references.

7.5 Overview of Single Equation Dynamic Models

The autoregressive process represent a basic way of modeling time series. As
complexity and multivariate processes are introduced the AR model transform
into a system of equation, where it becomes possible to give the parameters a
structural (economic) interpretation. In principal, we have the following types of
equation models, where t NID(0, 2 ).


1. Autoregressive models: AR(p) : A(L)yt =


2. Moving average models: M A(q) : yt = (L)


3. ARM A(p; q) models: A(L)yt = (L) t ; (+ARIM A)

4. Distributed lag models: DL(p) : yt = B(L)xt +


5. Autoregressive distributed lag models: ADL(p) : A(L)yt = B(L)xt +

6. ARMA model with exogenous explanatory variable ARMAX (ARIMAX):

A(L)yt = B(L)xt + (L) t ;
7. Rational distributed lag model RDL: yt =
8. Transfer function: yt =

A(L) xt

A(L) xt

+ (L)

(L) t

Notice that the transfer function is also a rational distributed lag since it
contains a ratio of two lag structures. Also, (7) and (8) can be viewed as distributed
lag models since D(L) = [B(L)=A(L)]. Notice that rational distributed lag models
require some information about B(L) to be workable.
Imposing restrictions on the lag structure B(L) in distributed lag models lead
to further models;
9. Geometric lag structure (= Koyck), where B(L) is assumed to decline
according to some exponential function.
10. Polynomial distributed lag (PDL) models, where B(L) declines according
to some polynomial function, decided a priori. (= Almon lags).
11. All other types of a priori restrictions on B(L) not covered by (9) and
12. The error correction model. This model embraces all of the above models
as special cases. The following explains way this is so.
Introduction to Error Correction Models
Economic time series are often non-stationary, their means and variances change
over time. The trend component in the data can either by deterministic or stochastic, or a combination of both. Fitting a deterministic trend assumes that the
data series grow with a xed rate each period. This is seldom a good way of
characterizing describing trends in economic time series. Instead they are better
described as containing stochastic trends with a drift. The series might be growing over time, but it is not possible to predict whether it grows or declines in the
next period. Variables with stochastic trends can be made stationary by taking
rst dierences. This type of variable is called integrated of order 1, where the
order of integration is determined by the number of times the variable needs to be
dierenced before it becomes stationary.
A necessary condition for tting trending data in an econometric model, is that
the variables share the same trend, otherwise there is no meaningful long-run relationship between them.8 Testing for co-integration is a way of testing if the data
7 Restrictions are put on the lag process to make the estimation more eective. A priori,
restrictions can be motivated by a limited sample and muticollinarity that aects estimated
standard errors of the individual lags. These type of restrictions are not used anymore. Today,
it is recognized that it is more important to focus information criteria, white noise residuals and
building a well-dened statistical model, instead of imposing restrictions that might not be valid.
8 The exception is tests of the e cient market hypothesis, and related tests of rational expectations. See Appendix A in Sj and Sweeney (1998) and Sj (1998).



has a common trend, or if they tend to drift apart as time increases. The simplest
way to test for cointegration is the so called Engle and Granger two step procedure. The test implies determining whether the data contains stochastic trends,
and if so, testing if there are common trends. If xt and yt are two variables, with
non-stochastic trends that become stationary after rst dierencing, cointegration
can be tested by running the following co-integrating regression,
yt =

+ xt + t :


If both yt and xt are integrated variables of the same order, a necessary condition for a statistically meaningful long-run relationship is that the residual term
( t ) is stationary. If that is the case the error term from the regression can be
seen as temporary deviations from the long-run, and and can be viewed as
estimates of the long-run steady state relation between x and y.
A general way of building a model of time series, without imposing ad hoc a
priori restrictions, is the autoregressive distributed lag model. For two variables
we have,
A(L)yt = B(L)xt + t ;
where the lags are given by A(L) = i=0 ai , and B(L) = i=0 bi . The rst
coe cient in A(L) is set to unity, a1 = 1. The lag length is chosen such that
the error term becomes a white noise process, t
N ID(0, 2 ). The long-run
solution of this model is given by,
yt = xt +



where = B(L)=A(L). Without loss of generality we can use the dierence operator, xt = xt xt 1 , to rewrite the autoregressive model as an error correction
yt =
yt i + ECMt 1 + t ;


where the error correction mechanism is given by ECMt 1 = ( xt 1 yt 1 ).

The latter term can be said to represent the deviation from the long run steady
state relation between the two variables. It is convenient to think of the ECMt
variable at the rst lag, controlling the long-run path of the dependent variable.
Asymptotically it will not matter at which lag the ECMt is placed. Though in a
multivariate model, and for a nite sample, it might make a dierence, a seasonal
lag on might work better.9 Furthermore, for an ECM to work well in a model it
should nor display any signs of seasonal eects or extreme outliers. These eects
should be removed when the ECMt is constructed. The -parameter of the error
correction term indicates how changes in yt react to deviation from the long-run
When modeling integrated variables, rewriting the system as a (vector) error
correction model is a natural step. However, error correction models works with
stationary data series. Assuming costly adjustment leads generally to partial adjustment models, that are better written in the less restrictive error correction
form. Optimal control theory, approximations to structural systems in continuous
time etc. will also lead to error correction models, see Hendry, Pagan and Wickens
(1982), Hendry (1995), or ch. 2 in Banerjee et. al. (1993).
If xt and yt contain stochastic trends it is necessary that they are co-integrated
for the ADL model to make sense in the long-run. For instance, if the variables
are co-integrated, the error term from the co-integrating regression ( t above) can
be used as the error correction mechanism. This was shown in Engle and Granger
9 For


comparison see the discussion of seasonality, earlier in this paper.


(1987). If there is cointegration there is an ECM formulation, the reason being

that cointegration implies Granger causality in at least one direction. The advantage of the error correction model is that it does not put a priori restrictions on
the model and that it separates long-run and short run eects. It has proven to
be a very e cient way to model various economic models, like money demand,
consumption etc. It should be recognized that the early literature on EC models tended to oversee the problem of weak exogeneity. With the developments in
the elds of multivariate cointegration it has been shown that when the same EC
expression determines more than one variable, there are cross equation restrictions between the co-integrating parameters. These restrictions imply that error
correction expressions have to be estimated within complete systems, not from
Multivariate Model Survey
Multivariate models are introduced later. For the time we can conclude our listing
of models with the following,
Vector autoregressive models V AR:
Vector autoregressive moving average processes V ARM A: The V AR and
the V ARM A represents multivariate ARIMA models.
Vector error correction models V ECM s:
Structural vector autoregressive models SV AR:
Systems of structural equations estimated using estimators.
Structural vector error correction models
The latter represent the nal step, where a complete system of interactive
variables are modeled and given en (economic) structural interpretation.






Given an autoregressive, or distributed lag structure A(L), B(L) or D(L) the long
run static solution of the model is found by setting L = 1. The intuition is that
in the long run there will be no changes in the explanatory variables, and it will
not matter if we explain yt by say xt and/or xt i .
The conditional mean of yt in an ADL model for example is

Et fyt g = yt = A(L)

B(L)xt :


The mean path of yt is therefore




where A(1) = (1 a1 a2 ... ar ) and B(1) = b0 + b1 + b2 + ::. +bj ). In

a distributed lag model we would have
y = D(1)x


Now a unit change (easier if in percent) in x leads to a new equilibrium,

y = D(1)(x + 1):


The total eect of a change in xt is given by the sum of the coe cients in D(L)
when L = 1. If there are m lags in D(L), the total multiplier is
D(1) = (

+ :::






It is also possible to think of the total multiplier as an innite sum of variables

which dies out slowly in the long-run.
The impact multiplier is associated with the rst parameter in D(1), which
is 0 . Thus taking 0 xt gives you the impact multiplier, the rst periods eect
following a change (a chock) in xt :
The j : th interim multiplier ( j ) is the sum of the coe cients up and including the j : th lag,

It is common to standardize the j : th interim multiplier in the following way,


j ]=




such that it represents the share of the total multiplier up until the j : th lag.


The mean lag is given as,


j ]=[



j ];



Notice that m could be equal to innity if we have a stable model, with stationary variables, such that the innite sum of i converges to a constant sum in
the long run.
The mean lag can be derived in a more sophisticated way, by dierentiating
D(L) with respect to L and then dividing by D(1). That is,
D(L) =


D0 (L) =





+ ::: +


sL ;

+ ::: + s


s 1


By dividing D (1) by D(1) we get, as a general result for ADL models,


D0 (1)

B 0 (1)

A0 (1)


Finally we have the median lag, representing the number of periods required
for 50% of the total eect to be achieved. The median lag is obtained by solving,


j ]=

D(1) = 0:50:



Sometimes the median lag is approximated by choosing the j : th interim

multiplier in the middle of the lag structure.



The extension of ARIMA modeling into a multivariate framework leads to Vector
Autoregressive (VAR) models, Vector Moving Average (VMA) models and Vector
Autoregressive Moving /VARMA) models. In economics, since most variables
display autocorrelation and are cross-correlated, VAR models are an interesting
choice for modeling economic systems. Vector models can be constructed using
similar techniques as those for single variable ARIMA models. The autocorrelation
and partial autocorrelation functions can be extended to display cross-correlations
among the variables in the system. However, when modelling more than two
variables, these cross autocorrelation and cross partial autocorrelation functions
quickly turn into complex matrix expressions for each lag.1 Thus, the crosscorrelation functions are not practical tools to work with.
The advantages of using VARs are that as VAR represent a statistical description of the economy. When using ARIMA on univariate series, in many situations
the combination of AR and MA processes turn out to be an e cient way of nding a stochastic representation of a process. VAR models are usually eective in
modeling multivariate systems, and can be used to make forecasts and dynamic
simulations of dierent shocks to system. These shocks can come from policy,
from productivity or anywhere in the economy basically, and the shocks can be
assumed to transitory or permanent. The main complicating factor is that in
order to understand what shocks and simulations actually mean it is necessary
to identify the underlying economic relation among the variables. To make VAR
models work for economic analysis it is necessary to impose some restrictions on
the residual covariance matrix of the VAR. Thus, there is no free lunch here in
terms of avoiding discussing causality and simultaneity problems. It is necessary
to point out the latter because in the beginning of the history of VAR models it
seemed like VAR models could be used without economic theory, but that was
build on a misunderstanding.
In econometrics the focus is on nding a parsimonious VAR representation
with N ID residuals.
Let xt be an p dimensional vector of stochastic time series variables, represented as a the k : th order VAR model,



Ai x t

+ et ; or



= et


where Ai is the matrix of coe cients of lag number i; so A0 A = i=0 Ai ; where
A0 is a diagonal matrix, et is a vector of white noise residual terms. Notice that
all variables across all equations have the same lag length (k). This is so because
it makes it possible estimate the system with OLS. If the lag order is allowed vary,
the VAR must be estimated with the seemingly unrelated regressor method.
A VAR model can be inverted into its VECMA form as
1 See

Wei (1989) for a presentation of the Box-Jenkins technique in a multivariate framework.



xt =


Ci xt

= C(L)et


The MA form is convenient for analysing the properties of a VAR and investigate the consequences of shocks to the system. Estimation, however, is usually
done in the VAR form, and is straightforward since each equation can be estimated
individually with OLS. The lag length (k) of the autoregressive process is chosen
such that the estimated residual process, in combination with constants, trend,
dummy variables and seasonals, becomes white noise process in each equation.
The idea is that the lag length is equal for all variables in all equations.
3 of dimension p with a constant
3 2
2 second
3 order
x1t 1 x1t 2
7 6
6 x2t 7 6 a1 7
a11 a12
a1p 6 x2t 1 x2t 2 7 6 e2t 7
7 6
+6 . 7
6 .. 7 = 6 .. 7 +
a21 a22
a2p 4
5 4 .. 5
4 . 5 4 . 5

xp 1 xpt 2
VAR models were strongly advocated by Sims (1980) as a response to what he
described as incredible restrictionsimposed on standard structural econometric
models. Up until the mid 80s, empirical time series econometrics was dominated
by the estimation of text-book equations. Researchers simply took an equation
from theory, estimated it, and did not pay much attention to whether the model
and the data actually tted each other. Typically, dynamic lag structures where
treated in a very ad hoc way. Sims argued that it would be better to nd a
statistical model, which described the data series and their interaction, as well as
possible. Once the statistical model was there, it could be used to forecast and
simulate the economy. In particular, it would according to Sims be possible to
analyse the eects of various policy changes.
Simscritique is related to the "Lucas critique". Lucas showed how in a world
of rational expectations, it was not possible to understand estimated parameters in
structural econometric models as (deep) structural behavior or policy parameters.
Since agents form their behavior on plans building on forecasts of variables, not
on historical outcomes of variables, the estimated parameters based on historical
observation become a mixture of behavioral parameters and forecast generating
parameters. Further, under rational expectations, econometric models could not
be used to analyze policy changes, because a change in policy would by denition
lead to a change in the parameters of the system. Sims therefore argued for VAR
models as a statistical description of the economy, under given policy rules. The
eects of surprise changes in policy variables could then be analysed in the reduced
VAR models represent the reduced form of an underlying structural model.
This can be seen by starting from a general (but not necessarily identied) structural model, and rewriting it in reduced form. As an example, start from the
bivariate model,


+ a11 xt + b11 yt

2 + a21 yt + b21 yt

+ b12 xt

1 + b22 xt






This system can be rewritten in reduced form by substituting for xt and yt on

the RHS of the equations,





11 yt 1
21 xt


12 xt 1
22 xt

+ e1t


1 + e2t :



The equations form a bi-variate VAR model of order one. The residuals of the
VAR model (the reduced form) contain the residuals and the parameters (a11 and
a21 ) of the structural model. The reduced system can be estimated by applying
OLS to each equation.2 The parameters of the VAR relate to the structural model
11 b12 + b11
; etc.
11 =
11 21
Thus, the parameters of the VAR are complex functions of some underlying structural model, and as such they are on their own quite uninteresting for
economic analysis. It is the lag structure and sometimes it signs that are more
interesting. The two residuals in this VAR are,

e1t =

11 2t

11 1t

e2t =


11 21



21 11

These residuals are both white noise terms, but they are correlated with each
other whenever the coe cients 11 or 21 are dierent from zero.
The generalization of structural system above, setting zt = fyt ; xt g; is
Bzt =

1 zt 1




; 0=
and 1 =
If both sides of 9.8 is multiplied with B 1 the result is,




= B


1 zt 1

+ et ;

1 zt 1





where 0 = B 1 0 ; 1 = B 1 1 and et = B 1 t : This shows that the VAR

model is a reduced form of an underlying structural model, where the structural
dependence is hiddenin the covariance matrix of the error terms.
VAR models are estimated in their AR form, A(L)yt = et . They can be
inverted and analysed in their MA form, yt = C(L)et . Beside predictions, VAR
models are used for three types of analysis; Granger non-causality tests, forecast error variance decomposition and impulse response analysis. Granger non-causality
tests deserve a special chapter and is therefore discussed in a following chapter.
The other two techniques are typical VAR methods that make use of the MA form.
Forecast Error Variance Decomposition. The forecast variance errors are
explained in terms of the history of each variable. This analysis will tell how
strong is the inuence among the variables of the system. It tells us the
proportion of movements in a sequence (of yi ) that is due to own shocks
and the proportion due to shocks in other variables. If these other variables
have little inuence on the investigated variable, they will contribute little
to the forecast error variance. Variables that are exogenous, will have small
eects from other variables.
2 OLS is as e cient as the seemingly unrelated estimator (SUR) in this case, because the
equations contain the same explanatory variables. However, if we set some lags to zero and have
a system with dierent lags in dierent equations, SUR will be a more e cient estimator than



Impulse response analysis. This is a graphic or numerical presentation of a

simulation the systems response to an unexpected shock in one variable in
the system. A typical example is to study how the economy, and real GDP,
reacts to an unexpected change in the money supply under the assumption
of rational expectations. A typical questions to ask are how long does it take
for a shock in yt or xt before it dies out, will there be an eect at all, will it
be positive or negative, will die out smoothly or through uctuations? We
Pt ask if shocks in yt aect xt etc. Let the MA form be yt = C(L)et =
i=0 Ci et i ; where Ci is the matrix of coe cients for lag i. In matrix form,
for a two dimensional system,






Setting i = 0 gives the impact multiplier, C0 , the initialPeect of a shock. The

matrix of total, or long-run multipliers, is given by i=0 Ci : The impulse
response functions are given by C(j) where j = 0; :::t.
Both the variance decomposition and the impulse response analysis require that
the residual covariance matrix of the VAR is orthogonalized. This is so, because
the errors et are dependent on each other through the B 1 matrix. Unless the
residuals of the VAR is orthogonalized it will not be possible to identify a shock
from as a unique shock coming from one specic variable.3 There are several ways
of performing the orthogonalization of the residuals. (In the following we assume
that the VAR is made up of stationary variables.) The idea is that restrictions
must be put on the covariance matrix of the VAR.
Cholesky decomposition. Cholesky decomposition represents a pure mathematical way to orthogonalize the residuals, which will depend on the ordering of the variables. It is custom to do several dierent decompositions,
by changing the order of the equations in the model, show the sensitivity
of creating orthogonalization in dierent ways. In terms of the residual covariance matrix, what the Cholesky decomposition achieves is to make the
upper diagonal of the matrix zero. Assume a three dimension VAR, p = 3,
and therefore a 3 3 covariance matrix,
2 2
= 4 21



The outcome of the decomposition is to create the following covariance matrix,

2 2
0 5:
= 4 21



The problem for identifying the VAR and doing the impulse responses is that
the covariance matrix is not diagonal. The Cholesky decomposition
P builds
on the fact that any matrix P with the property that PP0 =
an orthogonal covariance matrix such that et = P 1 t becomes a diagonal
matrix, et s (0; IN ):The ordering of the equations determines the outcome,
and the causal ordering of the residual shocks. With N = 3, there are three
possible orderings and outcomes, which can be more or less dierent.

3 Early VAR modelers did not recognize the need for orthogonalization. Thus papers from the
rst part of the 1980s must be read by some care.



Set up a recursive system. Instead of letting the computer do all of the job,
you can set up the matrix B 1 so that the residuals form a recursive system
by deciding on an ordering of the equations that corresponds to the ordering
and residual correlations created be the Cholesky decomposition. Thus, the
residual in equation one is not aected by the other two. (Meaning that x1t
is not explained by x2t or x3t ) The second residual is only aected by the
rst residual. And nally, the last (third) residual is aected by residual
one and two. Econometric programs often includes Cholesky decomposition
routines in combination with the analysis of VAR models. By changing
the ordering of the equations it becomes possible to compare the eects of
dierent recursive ordering of the variables. The problem is that we are
drowning in output as the dimension of the VAR increases.
Structural Autoregressive models SVAR. If economic theory does not suggest a recursive ordering, use economic theory to impose restrictions on the
B 1 matrix. This is called Structural Vector Autoregressive (SVAR) models.4 In practice the approach implies formulating a small structural (static)
economic system for the residual process et : If yt is an p-dimensional system,
the error covariance matrix contains a total of p2 parameters, leading to the
estimation of p(p + 1)=2 or (p2 + p)=2 number of parameters, equal to the
number restrictions necessary for the matrix B 1 : As an example for a 3
variable system, the error process could be set up as,


= c21



= c31

2t + c32



3t ;


which happens to be a recursive ordering. Alternatively, the system could

look like,


= c21


= c31


+ c13





3t :


In both examples the number of restrictions imposed are equal to (32 +

3)=2:) = 6: Behind each equation is some reasoning about the plausible
correlation among the variables at time t. In each equation there is one
white noise residual term with an implicit parameter of unity, leaving three
possible parameters (c1 ; c2 ; c3 ) to describe how the shocks in the errors are
An more general framework for identifying the VAR is
Azt = A0 + A1 zt

+ B t;

where contemporaneous correlations among the variables is captured by A

and B takes care of correlations in the residual such that B t becomes diagonal.
Once the error process is set up in such a way that the errors are orthogonal,
it becomes possible to analyze the eects of one specic shock on the system and
4 A fourth approach is oered by Blanchard and Quah (1989), and builds on classifying shocks
as temporary or permanent. This approach can be seen as an extension of the SVAR approach
to processes including integrated variables with common trends.



argue that the shock is unique coming only from that particular variable. Without
orthogonalization the shock can be a mixture of eects from dierent variables,
and not a cleanshock.
One controversy here is that it is up to the econometrician to identify and
label the shocks as, for instance, demand or supply shocks. The basis for such
labeling might not be strong. Further, by denition, the errors include, not only
structural relations, but also everything that we do not know or understand about
the system. For that reason it might be better to use economic theory to identify
structural relations and build conventional econometric models instead, rather
than trying to analyse what we do not understand. On the other hand, in a
world of rational expectations where the expectations generating mechanisms is
unknown, or cannot be modelled, VAR models is the best we can do.


How estimate a VAR?

First you thing about your system. What is it that you want to explain? How
could it be modelled as a recursive system? Second you estimate the equations, by
OLS, the same lag lengths on all variables across the equations to avoid using the
SUR estimation technique. Third, you investigate outliers and shifts and put in
the appropriate dummy variables. Fourth, you try to nd a short lag structure and
white noise residuals. Fifth, if you cannot fulll 4) you minimize the information
criteria. In this case AIC is not the best choice, use BIC or something else.


Impulse responses in a VAR with non-stationary variables and cointegration.

The orthogonalization of the residuals can oer some interesting intellectual challenges, especially in SVAR approach. If the variables in the VAR are integrated
variables, which also are co-integrating, we are faced with some interesting problems. In the co-integrating VAR model there will be both stationary shocks and
permanent chocks, and identifying these two types in the system is not always
easy. If the VAR is of dimension p, there can be at most r co-integrating vectors, 0
p, and p r common stochastic trends. Juselius (2006) ("The
Co-integrated VAR Model", Oxford University Press) shows how an identication
of the structural MA model, and orthogonalization of the residuals, can be done
of both the in terms of short and the long-run of the system.
The VAR(2), with no constants, trends or other deterministic variables, will
have the following VECM representation after nding r co-integrating vectors,
xt =



+ "t

The MA version of this model is,

xt = C


"i + C (L)"t + x0


Where the rst factor on the right hand side represent the stochastic trends
in the system and the second factor represents stationary part. The C matrix
will then represent all that is not the stationary vectors, and is related to the
co-integrated vectors as, C = ?( 0 ? ?) 1 0 ?:


9.1 BVAR, TVAR etc.

VAR models represent statistical descriptions of data series. As such is a basis for
reducing your model and going into more ordinary structural econometric models,
such as Vector Error correction Model (VECMs). Estimating a VAR is then a
way of making sure that the nal model is a well-dened statistical model, i.e. a
model that is consistent with the data chosen.
1. We have talked about what you can do with the VAR in terms of forecasting,
simulations, impulse responses, forecast error decomposition and Granger
causality testing. in this context we meet the so-called SVAR - Structural
VAR. There is, however, a number of other VARs that one needs to know
about. The problems of working with VARs are obvious; there is a large
amount of variables to be estimated, the estimated parameters might no be
stable over time and there is a number of variables that are not modelled in
the VAR because the VAR would get too large to handle. If you want to use
the VAR for forecasting we need to address these problems. To handle the
problem with time varying parameters there are Time-Varying-Parameter
VARs (TVP-VARs). In addition there various VAR modeling techniques
that deal with regime changes, Markov switching VARs, threshold VARs,
oor and ceiling VARs, smooth transition VAR.
To work with large number of variables and reduce the model it is possible to
factor analysis, which takes us to Factor Augmented VARs (FA-VARs). Another
approach is to use a priori information about parameters and their distribution
in the form of represented by Bayesian VARs (BVARs). The latter is a popular
approach in many central banks.
We can illustrate the problem in the following way. Your model predicts that
the ination rate will vary around 10%, and the same time you have additional
information indicating that ination will uctuate around 5 per cent, say that there
is a sudden drop in ination. What do you do? One approach is simply to reduce
the constant term and predict changes in ination around 5 per cent instead. A
more ambitious approach is to incorporate more information in your model, from
more data and place more emphasis on recent observations etc. Changing the
constant is easy and quite normal. As you start walking along the path of making
assumptions about the data and the parameters of the model you might go too
far in the other direction. As long as we talk about forecasting, the proof is in the
pudding. The best forecast wins, but as we talk about the best policy to achieve
goals in the future you have to be much more careful.
The type of VARs we have discussed so far are basically statistical representations of the data. Without futher restrictions, and incorporation of long-run steady
state relations in the form of co-integrating vectors, their relative predictability
will be quite poor. Also, the economy is more complex, involving many more
variables that the two to six variables that can be handled in a standard VAR.
If you model contains fty or one hundred variables there will be too many lags
and coe cients to estimate. One way of dealing with this problem is use so-call
Bayesian VARs (BVAR). In the BVAR you can use prior information to reduce
the number of coe cients you need to estimate. BVAR is popular among many
central banks, included both the ECB and the FED to make construct better and
bigger VARs for forecasting.5
5 Gary Koop at University of Strathclyde has a home page with course material dealing with
BVAR models.



Finally, remember that the data is the real world, economic theories are constructions of the human mind (quote from David Hendry). If you want to use a
priori information of some kind you might miss what the data, the real world, is
trying to tell you.



Part III

Granger Non-causality Tests


Whether a variable is aected by another in such away that it can be said

to cause the other variable is a fundamental question in all sciences. However, to
validate empirically that one variable are caused by another variable is problematic
in economics since it is often quite di cult to set up controlled experiments.
Granger (1969), building upon work done by Wiener, was the rst to formalize
an empirical concept of causality in economics. Grangers basic idea is that the
future cannot predict the present or the past. It follows, as a necessary condition,
that for one variable (xt ) to cause another variable (yt ), lagged values of xt must
predict yt . This can be tested with the following vector autoregressive model,
yt =


i yt




i xt i

+ et ;



where yt is explained by lagged values of yt and xt . The lag length (k) is

determined such that et is a white noise process, et N ID(0; 2 ). Alternatively,
if you cannot nd white noise residuals, minimize information criteria only instead.
If all parameters associated with the process xt are dierent from zero, 1 = ::.
= i 6= 0, then xt is predicting yt ; and xt can also be said to Granger cause the
variable yt . If, on the other hand, all -parameters are zero, xt cannot predict or
cause yt : An F -test on the joint signicance of the
parameters is su cient in this
case. (Alternatively, the test can be set up in the form of chi-square test depending
on mainly the software you are using.) The F -test works by comparing the mean
squared errors from the equation above with those from a regression where the x0 s
are excluded. If the inclusion of lagged x variables leads to a signicant reduction
in the mean square error, lagged values of xt are predicting yt and the variable xt
can be said to Granger cause yt :
Please notice the distinction between prediction
and causality, which is imPk
portant in a policy context. The fact that i=1 i is signicantly dierent from
zero, so that xt is predicting yt ; does not imply that xt causes yt . It is easy to
understand why, from the following analogy. A weatherman that predicts rain
tomorrow, does not cause the rain that might fall tomorrow. This is so no matter
how good this person is predicting tomorrows weather. This is the reason why the
test should always be referred to as a Granger non-causality test and not a test
of causality. Based on the assumption that the future cannot predict the present
and the past, we can only test whether a variable is not causing another.
Of course, the outcome of the test might be aected by the number of lags
chosen in the VAR, and by the variables chosen to be included in the VAR. Though
two variable VARs are common, this is often a crude simplication. The classical
example is the eects of real money growth on real GDP growth. In one set-up
you might nd that monetary policy is eective, but add the interest rate to the
VAR and you might nd that monetary policy is ineective.
Finding that xt Granger causes yt does not exclude that the reverse is true.
Two variables can Granger cause each other. A test of whether yt Granger causes
xt , is performed with the following model,
xt =


i xt



i yt i




where the lag length is the same as before, and t

N ID(0, ! 2 ). If lagged
values of yt predict xt ; yt is Granger causing xt . In some situation testing the
reverse relationship is of no interest. For instance, the ination rate in a small
open economy should not Granger cause the ination rate of the World.
The main weakness of the Granger non-causality test is the assumption that
the error process in the VAR is not only a white noise process, but also a white

noise innovation process with respect to all relevant information for explaining
the movements of xt and yt . This is an important issue which is often forgotten
in applied work, were bivariate systems are the rule rather than the exception.
Grangers basic denition of non-causality is based on the assumption that all factors relevant for predicting yt are known. Let It represent all relevant information,
both past and present, let Xt be present and past observations on xt , such that
Xt = (xt , xt 1 , xt 2 , ..., x0 ); It 1 and Xt 1 represent past observations only.
The variable xt can therefore be said to Granger cause yt if the mean square error
(MSE) increases when yt is regressed against the information set where Xt 1 is
removed. In the bivariate case, this can be stated as,
M SE(^
yt jIt


< M SE[^
yt j(It


1 ];


where y^ is the predicted value of yt . The problem is to know what should

be included in It . If too many variables are included the degrees of freedom
will diminish. If too few variables are included the test might lead to the wrong
conclusions. The result of an unidirectional relation from xt to yt in a bivariate
model, might be reversed if a relevant third variable is included in the system.
This is a serious limitation of the Granger causality test. A way of reducing the
problem is to always perform the tests in a VAR system. If some variable is to
be treated as exogenous in the system, this must be based on strong a priori
The Granger non-causality test is sensitive to the spurious regression problem.
The F -test is unreliable when used on integrated or near integrated, which is
the standard situation in economics. However, using only rst dierences of the
variables implies a loss of information. In this situation it is recommended to
include error correction terms (or co-integrated vectors) in the VAR to increase
the e ciency of the F -tests. There is an interesting relation between cointegration
and Granger causality, as shown by Engle and Granger (1987). If a co-integrating
relationship is found, it follows there must exist Granger causality in at least one
direction. Tests of cointegration do not exclude causality test, since they cannot
determine the direction of the causality. However, if no cointegration is found we
can conclude that there is no Granger causality either.



10.1 Exogeneity
Exogeneity assumptions are necessary in econometric model building. In many
situations they are used in an ad hoc way; determined outside the system, or
based on variables being classied as endogenous and predetermined. Based on
this classication of the variables in the system, the basic econometric text book
explains how to apply the rank and the order condition to identify a simultaneous
system and if it is possible to use OLS or if a system estimator is necessary. In this
section we introduce three basic concepts of exogeneity that covers, (1) estimation
and inference, (2) conditional forecasting, simulations and (3) policy conclusions.
The three concepts that allow you to perform these tasks are weak exogeneity,
strong exogeneity and super exogeneity.
Consider the following system and there is co-integration.


xt + "1t

If "1t and "2t are both stationary it follows that xt is I(1) and that yt if I(0)
that = 0. On the other hand if 6= 0, it follows that yt is I(1): To estimate
it is required that yt is not simultaneously inuences xt : If yt or yt is part of
the left-hand side of xt equation (and thus embedded in "2t ) the result is that
E("1t "2t ) 6= 0, and we can write "1t = "2t + ut : Where for simplicity we assume
that ut s N (0; 2 ):
Now, if we estimate with OLS, the outcome would be a biased estimate of
, since E(xt "1t ) = E(xt ( "2t + ut ), and we can no longer assume that xt and "1t
are independent. This is example of lack of weak exogeneity. With the rst model
is not possible to estimate the parameter of interest , the outcome from OLS is
a dierent and biased


Weak Exogeneity

Weak exogeneity spell out the conditions under which it is possible to obtain
unbiased and e cient estimates. The denition is based splitting the joint density
function, into a conditional density and a marginal density function;
D1 (yt ; zt j Yt

1 ; Zt 1 ;


= D2 (yt j yt ; Yt

1 ; Zt 1 ;

2 )D3 (zt


j Yt

1 ; Zt 1 ;

3 );


where the parameters of interest ( ), are a given as = f ( 1 ); Yt 1 and Zt 1

are matrices of the nite historical values of these variables. The conditions under
which it is possible to estimate the parameters of interest by modeling only the
conditional density are that 2 and 3 should be variation free, and that are no
cross restrictions between the parameters of 2 and 3:
In practical situations, using stationary data, this comes down to judging
whether the error terms between the marginal and conditional models are correlated.1 (If the data series are integrated the question becomes one of long-run
independence between the two residual processes).
Three important conclusions follow from the denition above. The rst is that
whether a variable is exogenous or not, depends on the parameters of interest.
An OLS regression will always lead to estimates of some kind, but what is their
meaning. To understand the regression we identify parameters of interest that
relate to other variables through the (not modelled) marginal density functions.
Thus, exogeneity must be stated in terms of parameters of interest, i.e. the variable
yt is weakly exogenous for the parameter yt .
Second, it is di cult to test for weak exogeneity. Most existing tests fail,
with the exception of Johansens test for weak exogeneity of the variables in a
co-integrating vectors.2 The meaning of an exogeneity test is mainly to nd an
argument for not specifying the marginal model. However, the denition of weak
exogeneity tells that this is not possible. A test will need the estimated marginal
model, otherwise it will not work. But when the marginal is estimated (and tested
for misspecication) the work is already done, so the only thing left is to compare
the results.
The third conclusion, is that it is not possible to state that a variable like
the US ination is determined outside the model for ination in Zambia, or the
rainfall in a agricultural model. If these variables enters the system in terms of
expectations, it might be necessary to specify the stochastic process that generates
these expectations in the model to get unbiased and e cient estimates of the
parameters of interest.


Strong Exogeneity

Strong exogeneity spells out the conditions for conditional forecasting and simulations of a model with not modelled variables. The condition is weak exogeneity
and that the marginal model should not depend on the endogenous variable. Thus
the marginal process must be
D3 (zt j Yt

1 ; Zt 1 ;


= D3 (zt j Zt

1 3 ):


Meaning that it is not necessary to estimate the marginal process to forecast

yt :
1 The condition of no correlation between the error terms is easily understandable if we assume
that fyt ; zt g is a bivariate normal process. Set up the density function, and determine the
condition when it is possible to estimate the parameters of interest from the conditional model
2 Regarding Johansens test, it is important to remember that it is model dependent. The test
is performed conditionally on the short-run dynamics of the variables included in the system,
the dummy variables and the specication of deterministic trend.




Super Exogeneity

Super exogeneity determines the conditions for using the estimated parameters
for policy decisions. The condition is weak exogeneity and that the parameters
of the conditional model are stable w.r.t. to changes in the marginal model. For
instance, if the money supply rule changes, the parameters of the marginal process
will also change. If this also leads to changes of the parameters of the conditional
model, the conditional model cannot be used to analyse the implications of policy
changes. Thus, super exogeneity denes the situations when the Lucas critique is
not valid.

10.2 Multicollinearity and understanding of multiple

Multicollinearity has to do with how we understand the estimated parameters.
Study the following model,
yt =

1 xt

2 zt

The estimated parameters of this model is analysed under the assumption that
there is no correlation between the variables. The parameter 1 is understood as
the eect on yt following a unit change in xt while holding the other variables in
the model (zt ) constant. In the same way 2 measures the eect on yt while xt is
held constant. Another way of expressing this is the following; Efyt j zt g = 1 xt
and Efyt j xt g = 2 zt ; which tells us that the eect of one parameter cannot be
analysed in isolation from the rest of the model. The eect of zt in the model is
not on yt in it self, it is on yt conditional on xt . The meaning of holding say xt
constant in the model, while zt is free to vary implies that we study the eect on yt
after removingthe eects of xt on yt .If xt and zt are correlated it is not possible to
keep one of the constant while the other is changing. This is the multicollinearity
The statistical problem is best understood by looking at the OLS variance of
^ : The variance is
V ar( ^ 2 ) = P

x2 ) (1

xz )

where xz is the correlation between xt and zt . If the correlation is perfect,

xz = 1; the denominator becomes zero and the calculation of the variance breaks
down. Perfect multicollinearity means that the covariance matrix E(X 0 X) 1 does
not exist, and there is no solution to = (X 0 X) 1 XY: This is seldom a practical
problem, since the computer program that calculates the estimates will break down
when it tries to invert the matrix.3
Near and less than perfect multicollinearity, meaning that is between zero and
unity, is more complex. However, the problem is limited only to the understanding
of the estimated parameters, not in the understanding the model. Less than perfect
multicollinearity will aect the residual variance of the model ( 2 ), the estimated
3 If the inversion process does not break down completely, estimated variances of one ore more
parameters will be incredibly large.



variances of the variables. Historically, a number of measurements, remedies and

quick xes for multicollinearity has been suggested. None of these actually works.
In cross section studies a typical problem is to explain household consumption.
If you use household income, the number of rooms that the household posses, the
number of children and the size of the car as explanatory variables, you would
not be surprised to learn that these explanatory variables are highly correlated
with each other. As a consequence it might be hard to understand what the
parameters are estimating. This example shows that throwing in explanatory
variables without a clear economic model in the background will lead to problems.
There is no substitute for economic theory in this example.
In time series modelling multicollinearity is often, somewhat mistakenly, linked
to the estimation of lag lengths. Take the following distributed lag model as an
example; xt = 1 xt 1 + 2 xt 2 + "t : If xt is an AR(p) process, the xt variables in
the equation are of course correlated, meaning that we cannot hold xt 1 constant
and at the same time analyse the eect of varying xt on its own. On the other
hand, we are not interested in changing one lag, while keeping the rest xed. In a
time series regression estimation aims at nding the su cient number of lags that
describes the dynamic process.
However, since the lags are correlated with each other, this will aect the
estimated variance of each lag. This will make it more di cult to determine
the correct number of lags in a model, if we were to check the t of the model
by looking at the t-values of the parameters only. Since model building should
be aimed at nding a white noise innovation term, t values are seldom used to
decide the over-all t of the model. Instead we focus on misspecication tests of
the model.
We can summarize the fact about multicollinearity as follows. There is no
way to accurately measure the degree of multicollinearity and there are no quick
xes. Never, under no circumstances, can you delete some variables to solve
the problem as is suggested in some textbooks. Deleting variables means that you
change the specication and the t of the model. Leaving out a relevant explanatory variable leads to a misspecied model, which creates bias in the estimates
and aects inference. As shown in Hendry (1990 Ch. 6), multicollinearity is not a
model problem, or a misspecication problem, it has to do with the interpretation
of the estimated variables only, and not with the t of the model. It can be shown
how the variables in a given model can be transformed such that the they become
orthogonal to each other, without aecting the t of the model.
Returning to the example above, the interpretation of the parameters can be
made clearer if we use the transformation = 1 L,
yt =

xt +

3 xt 1

+ t:


The transformation is just a reparameterization and does not aect the residual
term. The parameter 3 = 1 + 2 which is the long run static solution of the
model. Thus we get an estimate of the short run eect on yt from 1 and at
the same time a direct estimate of the static long run solution from 3 . If the
collinearity between xt and xt 1 is high, it can be assumed to be quite small when
we look at xt and xt 1 . Since our nal interest in modelling economic time series
is to nd a well-dened statistical model, which mimics the DGP of the variable(s)
multicollinearity is not really a problem. We will therefore not deal with this topic
any further.




This section looks at a number of unit root tests, which can be applied to determine
the order of integration of a variable. The following tests are presented,
DF-test Dickey-Fuller test
ADF-test Augmented Dickey-Fuller test
Z-test Phillips and Perrons Z-test (To be included)
LMSP-test Schmidt and Phillips LM test
KPSS -test Kwiatkowsky, Phillips, Schmidt and Shin test
G(p; q)-test Parks G-test.
The alternative hypotheses to having an integrated series are discussed in a
following section.

11.0.1 The DF-test:

The Dickey-Fuller test is one of the oldest test. The tests builds on the assumed
yt = yt 1 + t with t N ID(0, 2 ):
Given this DGP, subtract yt 1 from both sides, and estimate the equation
a) yt = yt 1 + t ,
or, put a constant term in the regression, to allow for the alternative of a
deterministic trend in yt 1 ,
b) yt = + yt 1 + t ;
or, put in both a constant and a time trend in the estimated equation, to allow
for both a linear deterministic trend and a quadratic deterministic trend in yt ;
c) yt = + yt 1 + t + t ;
= 0 if yt is I(1). In this regression, know that
will be biased
downwards, in a limited sample. Thus, we can put all the risk on the negative
side and perform a one-sided test, instead of a two-sided standard t-test. The one
sided t-test H0 : ^ = 0 - yt I(1) against,
H1 : ^ < 0, yt I(0): The correct 0 t-statisticfor testing the signicance of ^
is tabulated in Fuller (1976), under the assumption that yt is random walk, yt
N (0; 2 ): The correct distribution for the t-test can also be calculated from
MacKinnon (1991), for the exact sample size at hand. In practice the dierences
are small though. The t-statistics for the constant term and the trend term are
tabulated in Dickey and Fuller (1980). Notice that the null hypothesis is that
yt = t , where t is white noise. The econometrician, however, will not know
1 To understand why the constant represents a linear deterministic trend, go back to the
discussion about the properties of the random walk process.



this in advance. S=he must therefore set up the estimated model so that there is an
meaningful alternative hypothesis to the stochastic trend (or unit root hypothesis).
A general alternative is to assume that yt is driven by a combination of t and t2 :
It is therefore recommendable, if t is white noise, to start with model c. If
the t-value on is signicant according to the table in Fuller (1976). The null
hypothesis of unit root process is rejected. It follows then that the t-statistics
for testing the signicance of and follow standard distributions. But, as long
as the unit root hypothesis ( = 0) cannot be rejected, both and must be
assumed to follow non-standard distributions. Thus, under the hypothesis that
= 0, the appropriate distributions for and are found in Dickey and Fuller
In a limited sample it might be wise to compare the outcome of both model c
and a.
The test is easily extended to higher order unit roots, simply by performing
the test on dierenced data series.
When will the test go wrong? First, if t is not white noise. In principle, et
can be an ARIMA process. In the following a number of models dealing with this
situation is presented. If there is more than one unit root, then testing for one
unit root is likely to be misleading. Hence a good testing strategy is to start by
testing for
two unit roots, which is done by applying the DF-test to the rst dierence of
the series ( yt ). If a unit root in yt is rejected one can continue with testing for
one unit root, using the series in level form yt .

11.0.2 The ADF-test

The DF-test, like all tests of I(1) versus I(0), is sensitive to deviations from the
assumption t
N ID(0, 2 ). The assumption of NID errors is critical to the
simulated distributions in Fuller (1976). If there is autocorrelation in the residual
process the OLS estimated residual will inappropriate, the residual variance estimate will be biased and inconsistent. The ADF-test seeks to solve the problem by
augmenting the equations with lagged yt ;
yt = yt



+ t;



yt =

+ yt




+ t;



yt =

+ yt

1+ t+



+ t:



The asymptotic test statistic is distributed as the DF-test, and the same recommendation applies to these equations, make sure there is a meaningful alternative
hypothesis. Therefore start with the model including both a constant and a trend.
The ADF test is better than the original DF-test since the augmentation leads
to empirical white noise residuals. As for the DF-test, the ADF test must be set
up in such a way that it has a meaningful alternative hypothesis, and higher order
integration must be tested before the one only unit root case.2
2 Sj


(2000b) explains in some detail how the test is used in practice.


The critical factor is to choose the length of the augmentation. Because yt is

stationary, the distribution of the lags are normal, and standard tests, including
Q-tests, LM test for serial correlation in the residual can be used. In small samples
the augmentation might play an important role for the outcome of the test. No
general rule can be established, more than that the residuals should not display
autocorrelation. It is therefore up to the model to convince the readers (the critics)
that the nal verdict regarding the signicance, or non-signicance of rests on
solid ground.
An additional complication is how to treat outliers in the sample. Outliers will
aect the estimation, in particular the signicance of the constant and the trend
variable. If trends are signicant, under the null of unit root process, according to
the Tabulations in Dickey and Fuller (1979), the conclusion is that the estimate of
yt 1 follows a normal distribution. Finding signicant time trends often implies
the rejection of a unit root. But, if this is caused by an outlier aecting the
estimation of the trend, one has to be careful in rejecting the unit root. In the case
of signicant trend variables, leading to the rejection of the unit root hypothesis,
some careful investigation of outliers is called for, to be secure against spurious
The DF and ADF tests are the most well known tests, and are easily understood
by most people. However, in limited samples and with t not being white noise,
they are often quite inconclusive. The tests should therefore be accompanied by
graphs and perhaps other tests.

11.0.3 The Phillips-Perron test

The ADF-test tries to solve the problem of non-white noise residuals by adding
lags of the dependent variable. It should be stressed that the ADF-test is quite
adequate as a data descriptive device under the maintained hypothesis that the
variables in a sample are integrated of order one. There are, however, a number of
tests which tries improve on some of the weaknesses of the ADF-test. Phillips and
Perron (1988) suggest non-parametric correction of the test statistic so that the
Dickey-Fuller distribution can be used even in cases when the residual in the DFtest is not white noise. (The KPSS-test below a recent modication of the same
principle) The method starts from the estimated t-value (t^ ) and the estimated
residuals from the DF equation. The test statistic (t ) -the t-value- is modied
with the following formula
t =

T [S 2

S ^

S 2 ][std:er(^ )=s]


where s is the residual variance from the DF regression,

S2 = T


^2t ;



S2 = T


^2t + 2T



j(l + 1)







The last term is a non-parametric estimation of the residual variance, using

Bartletts triangular window. The critical factor is determine the size of the lag
window l.


11.0.4 The LMSP-test

Start with the following DGP,
yt = + t + xt and xt = xt 1 + t
where t
N ID(0, 2 ). Under a unit root H0 : = 1. To test, run the
following regression,
yt = + S^t 1
where S^t = t=2 [ yt
yt 1 =(T 1)]. Schmidt and Phillips (1992) simulated
the t-statistic for ^ :

11.0.5 The KPSS-test

This test is calculated by RATS 4. The DGP is assumed to be
yt = t + rt + t
N ID(0, 2v ). The null
where rt = rt 1 + t . t
N ID(0, 2 ) and t
hypothesis is that yt is stationary. The test is H0 : v = 0, against H1 : 2v > 0.
Start by estimating the following equation,
yt =

+ t + et;


use the estimated residual to construct the following LM test statistic,



St2 =s2 (k);


e^2i and


St2 =



s2 (k) = T


e^2t + 2 T


w(s; k)


e^t e^t





The critical values for the test is given in Kwiatkowsky (1992). A Bartlett
type window, w(s; k) = 1 [s=(k + 1)] is used to correct the estimate (sample)
test statistics correspond to the simulated distribution which is based on white
noise residuals. The KPSS test appears to be powerful against the alternative of
a fractionally integrated series. That is, a rejection of I(0) does not lead to I(1),
as in most unit root test, but rather to a I(d) process where 0 < d < 1. These
type of series are called fractionally integrated. A high value of d implies a long
memory process. In contrast to an integrated series I(1), or I(2) etc, a fractionally
integrated series is reverting. Baillie and Bollerslev (1994).

11.0.6 The G(p; q) test.

This test builds on the conclusion that for a unit root variable, the estimated residuals are inappropriate and will indicate that unrelated variables are statistically
signicant (spurious regression). Therefore estimate,
1 : yt =

+ t+




2 : yt =


t2 +

2t ;


where t2 is a superuous variable. Calculate the following test statistic,

G(1; 2) = (RSS1

RSS2 )=s2 (k);


where RSS1 and RSS2 are the residual sums of squares from model 1 and 2
respectively, s2 (k) is as above.
We can conclude that among theses tests, the ADF test is robust as long as
the lag structure is correctly specied. The gains from correcting the estimated
residual variance seem to be small.

11.1 The Alternative Hypothesis in I(1) Tests

Rejecting one unit root does not necessarily mean that one can accept the alternative of an I(0) series. Sometimes unit root test will reject the assumption of
a unit root even though the series is clearly non-stationary. There are several
alternatives to rejecting the I I(1) hypothesis,
The series is actually I(0).
The series is driven by a deterministic rather than a stochastic trend.
The series contain more than one unit root.3
The series is driven by segmented trends, meaning that there are dierent
deterministic trends for dierent sub-periods.
The series contain fractionally integrated trends. It has an ARFIMA representation (AutoRegressive Fractionally Integrated Moving Average).
The series is non-stationary, but driven by some (to us) unknown trend
process. Tests for deterministic trends and more than one unit root are
straight forward from the section above and not discussed here.
The segmented trend approach was launched by Perron (1989). He argues that
few series really are I(1): If we have detailed knowledge about the data generating
process, we might establish that series have dierent deterministic trends for different time periods. The fact that these segmented trends shift over time implies
that unit root tests cannot reject the hypothesis of an integrated variable. Thus,
instead of detecting the correct deterministic trend(s), the test approximates the
changing deterministic trend with a stochastic trend. Perron (1989) demonstrates
this fact and drives a test for a known break date in the series. Banerjee
(1992) develop a test for an unknown break date. The problem with this approach
is that we somehow have to estimate these segmented trends. Sometimes it will
be possible to argue for segmented trends, like World War One and Two, etc.,
but in principle we are left more or less with ad hoc estimates of what might be
segmented trends.
3 Testing for integration should be done according to the Pantula Principle, since higher order
integration dominates lower order integration, test from higher to lower order, and stop when it
is not possible to reject the null. For instance, a test for I(1) v.s I(0) assumes that there are
no I(2)processes. The presence of higher order cointegration might ruin the test for lower order
integration, therefore start with I(2) and only if I(2) is rejected will it be meaningful to test for
I(1), etc.



11.2 Fractional Integration

For the class of integrated series discussed above the dierence operator was assumed to be d = 1. The choice between d = 0 and d = 1 might be too restrictive
in some situations. Especially, if unit root tests reject I(0) in favour of the I(1)
hypothesis, when we have theoretical information that suggests that I(1) is implausible, or highly unrealistic. For example, unit root tests might nd that both
the forward and the spot foreign exchange rates are I(1), and that the forward premium (f s), the log dierence, is also I(1), indicating no mean reversion in this
dierence series, and that the forward and the spot rates are not co-integrating.
The expectations part of the forward rate would therefore be extremely small or
irrational in some sense, so the risk premiums are causing the I(1) behavior.
Autoregressive Fractional Dierence Moving Average Models, represents a
more general class of model than ARMA and ARIMA models, see Granger and
Joyeux (1980) and Granger (1980). The ARFIMA (p; d; q) model is dened as

L)d yt =

+ (L) t ;


where d is the fractional dierencing parameter. The dierence operator (1

L)d is dened in terms of its Maclaurins series expansion. The dierence operator
works in the same way as for ARIMA models, applying the operator to yt results
in (1 L)d yt = zt where zt has an ARMA representation. The FI operator
transforms the original series into a series which has an ARMA representation.
Once the long-run memory is removed, the standard techniques for identifying the
ARMA process can be applied.
The dierence between ARIMA and ARFIMA models is that the latter allows
for a more complex memory process. The Wold theorem says that any nondeterministic series has an innite MA representation like,
yt =


i t i;



where t iid(0, 2 ), and i=0 2i < 1. If this series also belongs to the class
of series which has an ARMA representation, the autocorrelation function will die
out exponentially. For an I(1) the autocorrelation function will display complete
persistence, the theoretical autocorrelation function is unity for all lags.
Because the autocorrelation function of an ARMA process dies out exponentially, it can be said to have a relatively short memory compared to series which
have autocorrelation functions which do not die out as quickly. ARFIMA series,
therefore represents long memory time series. The ARFIMA model allows the
autocorrelation coe cients to exhibit hyperbolic patterns. For d < 1, the series is
mean reverting, for 0:5 < d < 0:5 the ARFIMA series is covariance stationary.
For a statistician who is describing the behavior of a time series an ARFIMA
model might oer a better representation than the more traditional ARMA model,
see Diebold and Rudebush (1989) Sowell (1992). For an econometrican however,
the economic understanding is of equal importance. The standard question in most
economic work is whether to use levels or percentage growth rates of the data, to
construct models with known distributions. That means decide whether series are
I(0) or I(1). Fractional integration does not aect these problems. It becomes
important when we ask specic questions about the type of long-run memory we
are dealing with, like is there mean reversion in the forward premium, or the real
exchange rate, or in assets prices etc. Thus only when economic theory gives us
a reason for testing something else than I(0) and I(1) is fractional integration


of interest. For applications of long-memory tests in general see Lo (1991) and

Cheung and Lai (1995).






Most macroeconomic and nance variables are non-stationary. This has enormous
consequences for the use of statistical methods in economics research. Statistical
theory assumes that variables are stationary, if they are not stationary statistical
inference is generally not possible. It doesnt matter that numerous old textbooks
in econometrics and research papers have ignored the problem. The problems
associated with non-stationary variables in econometrics has been known since the
1920s, but didnt get a solution until the end of the 1980s. In principle there two
ways of dealing with non-stationary, you must either remove the non-stationarity
before setting up the econometric model or set up a model of non-stationary
variables that forms a stationary relation. Typically, in none of these cases can
you use standard inference based on t-, chisquare or F-distributions.
Now, variables can be non-stationary in an innite number of ways. In practice,
there are broadly two types of non-stationary variables of interest in econometrics.
The rst type are variables stationary around a deterministic trend. The second
type are variables stationary around a stochastic trend. Stochastic trend variables
are also known as integrated variables. Most variables in economics and nance
seem to be driven by stochastic trends.
The problem with stochastic trend variables (integrated variables) is that not
only do they not follow standard distributions, if you try to use standard distributions you will most likely be fooled into thinking there are signicant relations
when in fact there are no relation. This is know as the spurious regression problem
in the literature.
Historically, trends were dealt with by removing what people assumed was a
linear deterministic trend. This was done in the following way. The non-stationary
variable was regressed against a constant and a linear trend variable;
yt =

+ t + y~t


where t was a deterministic time trend, dened as t = 1; 2; :::; T ). The residual

y~t in this regression represents the de-trended yt series, which was then used in
regression models with other stationary or detrended variables. In the equation
becomes a combination of the sample mean of yt , and the average of
the time variable. In general, the deterministic trend removal can be done with
models including polynomial deterministic trends, such as
yt =



+ ::: +


+ y~t :


This approach of tting deterministic trends can be extended into cyclical

trends, using trigonometric functions in combinations with the time trend. In
the literature there are various deterministic lters that aim at removing long-run
(supposedly deterministic) trends such as the so-called Hodrick-Prescott lter.
However, if the series is driven by a stochastic trend the estimated variables of
these models will not follow standard distributions and the regression will impose
a spurious autocorrelation pattern in the spuriously detrended variable y~t . Thus,
until you have investigated the non-stationary properties of the series and tested
for stochastic trends (order of integration) it is not possible to do any econometric


Deterministic trends are seldom the best choice for economic time series. Instead the non-stationary behaviour is often better described with stochastic trends,
which have no xed trend that can be predicted from period to period. A random walk serves as the simplest example of a stochastic trend. Starting from the
yt = yt 1 + vt where vt N ID(0; 2 );
repeated substitution backwards leads to,
yt = y0 +


vi :



The expression shows how the random walk variable is made up by the sum of
all historical white noise shocks to the series. The sum represents the stochastic
trend. The variable is non-stationary, but we cannot predict how it changes, at
least no by looking at the history of the series. (See also the discussion above
concerning random walks under the section about dierent stochastic processes)
The stochastic trend term is removed by taking the rst dierence of the series.
In the random walk case it implies that yt = vt is a stationary variable with
constant mean and variance. Variables driven by stochastic trends are also called
integrated variable because the sum process represents the integrated property of
these variables.
A generic representation is the combination of deterministic and stochastic
yt = + t + t + y~t ;
where t = t 1 + vt ; vt is N ID(0; 2 ); t is the deterministic trend and y~t
is a stationary process representing
stationary part of yt : In this model, the
stochastic trend is represented by i=1 vi :
An alternative trend representation is segmented deterministic trends, illustrated by the model
yt =

1 t1

2 t2

+ ::: +

k tk

+ y~t


where t1 ; t2 etc;
_ are deterministic trends for dierent periods, such as wars, or
policy regimes such as exchange rates, monetary policy etc.. Segmented trends
are an alternative to stochastic trends, see Perron 1989, but the problem is that
the identication of these dierent trends might be ad hoc. Given a suitable
choice of trends almost any empirical series can be made stationary, but are the
dierent trends really picking up anything interesting, that is not embraced by the
assumption of stochastic trends, arising from innovations with permanent eects
on the economy?


The Spurious Regression Problem

Most macroeconomic time series display non-stationarity and appears to be driven

by stochastic trends. Regression with these variables leads to the danger of nonstandard distributed parameter estimates which make inference much more di cult.
The spurious regression problem was introduced in a article by Granger and
Newbold in 1973. Granger and Newbold generated two random walk series, which
were independent of each other by construction. Let the two variables be xt and yt ,


with rst dierences yt N ID(0; 2y ), and xt N ID(0; 2x );by construction

let yt and xt be independent. Next, consider the linear regression of yt and xt ;
yt =

+ xt + "t :


Since yt and xt are independent there is no relation between them must be

zero and we would expect that the t-statistic of ^ will go to zero as the sample
size increases so that t ^
N ID(0; 1). If we repeat the regression with new
independent random walk we expect that in 5 per cent of test we would be
unlucky and erroneously assume that there is signicance even though true value
of is zero. However, this is not what happens.
Granger and Newbold studied the empirical distribution of the regression
above. They run 1000 regressions and found that the distribution of the t-statistic
of ^ was the opposite of what we expect. In 95 % of the regression we nd a
signicant relation even though the true value should be 5 %. Asymptotically the
t-value of ^ approached 2:0. The problem got worse when more independent random walks were put into the equation. Granger and Newbold did also nd that the
reported R2 values became relatively high while the Durbin-Watson value became
Later in the 1980s, researchers such as Peter Phillips, showed that due to the
integrated properties of the variables, their sample moments converge to functions
of Wiener processes (Brownian motions). The sample moments will not converge
to constants, like in the case of stationary stochastic regressors. Instead, the
sample moments converge to random variables which are functions of Wiener
processes. In this situation, with two (or more) random walk variables regressed
against each other the t-statistics will approach 2.0 zero instead of 0.0. Thus, by
using the t-distribution to test the null of no correlation between the variables, one
will be fooled into rejecting the assumption of no correlation. This is the spurious
regression problem. It is caused by parameter estimates which are not distributed
according to the normal distribution, not even in the long run.
In practical work, that is when using limited samples, this will occur not only
when regressing random walk variables, but also when regressing integrated variables or near-integrated variables.
Near-integrated variables are a classication of variables which in a limited
sample, look and behave like integrated variables. An autoregressive process with
an autoregressive parameter close to unity (say 0.9) can be called near integrated.
In these situations, the distribution theory of integrated variables is a much better
approximation than the standard normal.


Integrated Variables and Co-integration

Normally, a linear combination of integrated variables will also be integrated of

the same order as individual variables. The exception from this rule is called cointegration, when a linear combination of integrated variables results in a lower
order of integration.
So, in the linear regression above, since both yt and xt are integrated of order
one I(1), and independent, the residual term "t will be integrated of order one
I(1) as well.
In the case when the two I(1) variables share the same stochastic trend and
form an I(0) residual we say that they are co-integrating.


The intuition here is that for the two variables to form a meaningful long-run
relationship, their must share the same trend. Otherwise they will be drifting
away from each other as time elapses. Therefore, to build econometric models
which make sense in the long run, we have to investigate the trend properties
of the variables and determine the type of trend and whether variables are cotrending and co-integrating or not. In econometric work, trend properties refer
to the properties of the sample and how to do inference. It is not a theoretical
concept about how economics variables grow in the long run.
Once we have claried the trend properties, it becomes possible to establish
stationary relations and models, and econometric modeling can proceed as usual,
and standard techniques for inference can be used.
Denition 1 A series with no deterministic component and which has a stationary and invertible autoregressive moving average (ARMA) representation after
di erencing (d) times, but which is not stationary after di erencing (d 1) times,
is said to be integrated of order d, denoted xt I(d):
Denition 2 The components of the vector xt are said to be co-integrated of order
d; b, denoted xt CI(d; b); if (i) xt is I(d) and (ii) there exists a non-zero vector
such that 0 xt I(d b); d b > 0: The vector is called the co-integrating
vector.(Adapted from Engle and Granger (1987)).
Remark 1 If xt has more than two elements there can be more than one cointegrating vector .
Remark 2 The order of integration of the vector xt is determined by the element
which has the highest order of integration. Thus, xt can in principle have variables
integrated of di erent orders. A related denition concerns the error correction
representation following from co-integration.
Denition 3 A vector time-series xt has an error-correction representation if
it can be expressed as A(L)(1 L)xt =
zt 1 + ! t ; where ! t is a stationary
multivariate disturbance term, with A(0) = I; A(1) having only nite elements,
zt = 0 xt ; and
a non-zero vector. For the case where d = b = 1, and with
co-integrating rank r, the Granger Representation Theorem holds. (Adapted from
Banerjee (1993))
Remark 3 This denition and the Granger Representation Theorem (Engle and
Granger, 1987) tell us that if there is co-integration then there is also an error
correction representation, and there must be Granger causality in at least one


Approaches to Testing for Co-integration

Under the general null hypothesis of independent and integrated variables estimated variances, and test statistics, do not follow standard distributions. Therefore the way ahead is to test for co-integration, and then try to formulate a regression model (or system) in terms of stationary variables only. Traditionally there
are two approaches of testing for co-integration; residual based approaches and
other approaches. The rst type starts with the formulation of a co-integration
regression, a regression model with integrated variables. Co-integration is then
determined by investigating the residual(s) from that regression. The Engle and


Granger two-step procedure and the Phillips-Oularies test are examples of this approach. The other approach is to start from some representation of a co-integrated
system, (VAR, VECMA, etc.) and test for some specic characterization of cointegrated systems.. Johansens VECM approach, or tests for common trends are
The Engle and Grangers two-step procedure is the easiest and most used
residual based test. It is used because of its simplicity and ease of use, but is not a
good test. The two-step procedure, starts with the estimation of the co-integrating
regression. If yt and xt are two variables integrated of order one, the rst step is
to estimate the following OLS regression
yt =

+ xt + zt


where the estimated residuals are z^t : If the variables are co-integrating, z^t will
be I(0). The second step is to perform an Augmented Dickey-Fuller unit root test
of the estimated residual,
z^t =

+ z^t



+ "t :



1. If yt and xt share a common trend and co-integrate the residual must be

a stationary process. If they dont share a common trend, they do not
co-integrate, the parameter
must be zero and the residual zt must be
non-stationary and integrated of the same order as yt : If the null, H0 :
^ = 0; is rejected for the alternative HA :
< 0; we conclude that the
variables are co-integrated, and that the long-run co-integrating parameter
is : Furthermore; we can refer to the OLS regression as the co-integrating
regression. We know that the residual is stationary, z^t is I(0) and therefore
z^t 1 can be used as en error correction term, identifying the long-run steady
state relation between yt and xt :
The relevant test statistics are not the one tabulated by Fuller (1976). Instead
you have to look new simulated tables in Engle and Granger (1987), Engle and
Yoo (1987), or Banerjee et al (1993). The reason is that the unit root test is
now performed, not on a univariate process, but on a variable constructed from
several stochastic processes. The test statistic will change depending on how many
explanatory variables there are in the model.
Remark 4 Remember that the t-statistics, and the estimated standard deviations,
from the co-integrating regression must be considered, even if we nd cointegration. Unless xt is exogenous the estimated parameters follow unknown non-normal
distributions even asthmatically.
Remark 5 For the outcome of the test, it will not matter which variable is chosen
to be the dependent variable. As an economist you might favour setting one variable
as dependent and understand the parameters as long-run economic parameters
(elasticities etc.)
There are a number of problems with the Engle and Granger two-step procedure.
The rst is that the tabulated (non-standard) test statistic assumes white noise
residuals. The augmentation tries to deal with this but is in most cases it is only
a crude approximation.
Second, the test assumes a common factor in the dynamic processes of yt and
xt : In practice this restriction is quite restrictive and the test will not behave


good when it does not hold. The dynamics of the two process and their possible
co-integrating relation is usually more complex.
Third, the test assumes that there is only one co-integrating vector. If we test
for co-integration between two variables this is not a problem, because then there
can be only one co-integration vector. Suppose that we add another I(1) variable
(ut ) to the co-integrating regression equation,
yt =

1 xt

2 ut



If yt and xt are co-integrating, they already form one linear combination (zt )
which is stationary. If ut
I(1) is not co-integrating with the other variables,
OLS will set 2 to zero, and the estimated residual ^t is I(0). This is why the test
will only work if there is only one co-integrating vector among the variables. If yt
and xt are not co-integrating then adding ut I(1) might lead to a co-integrating
relation. Thus, in this respect the test is limited, and testing must be done by
creating logical chains of bi-variate co-integration hypotheses.
Other residual based tests try to solve at least the rst problem by adjusting
the test statistics in the second step, so that it always fullls the criteria for testing
the null correctly. Some approaches try to transform the co-integrating regression
is such a way that the estimated parameters follow a standard normal distribution.
A better alternative to testing for co-integration among more than two variables
is oered by Johansens test. This test nds long long-run steady-state, or cointegrating, relations in the VAR representation of a system. Let the VAR,
Ak (L)xt =

Dt + "t ;


represent the system. The VAR is a p-dimensional system, the variables are assumed to integrated of order d; fxgt I(d); Dt is a vector deterministic variables,
constants, dummies, seasonals and possible trends,
is the associated coe cient
matrix. The residual process is normally distributed white noise, "t ID(0; ).
It is important to nd the optimal lag length in the VAR and have a normal
distribution of the error terms in addition to white noise because the test uses
a full information maximum likelihood estimator (FIML). estimators are notoriously sensitive to small samples and misspecications why care must be taken in
the formulation of the VAR. Once the VAR has been found, it can be rewritten
in error correction form,
xt =




Dt + "t



In practical use the problem is to formulate the VAR, the program will rewrite
the VAR for the user automatically. Johansens test builds on the knowledge
that if xt is I(d) and co-integration implies that there exists vectors such that
xt I(d b). In a practical situation we will assume that xt (1) and if there
is cointegration, 0 xt
I(0). If there is cointegration, the matrix
must have
reduced rank. The rank of indicates the number of independent rows in the matrix. Thus, if xt is a p-dimensional process, the rank (r) of
matrix determines
the number of co-integrating vectors ( ), or the number of linear steady state
relations among the variables in fxgt : Zero rank (r = 0) implies no cointegration
vectors, full rank (r = p) means that all variables are stationary, while a reduced
rank (0 < r < p) means cointegration and the existence of r co-integrating vectors
among the variables.
The procedure is to estimate the eigenvalues of
and determine their signicance.1
1 The


test is called the Trace test and its use is explanied in Sj Guide to testing for ...


However, under the null of no co-integration, these estimates have non-standard

distributions which depend on whether there is a deterministic trend, and or a
constant term in the model. The test statistic is only known asymptotically and
for a closed system without exogenous variables. In other situations the decision
must be based on viewing the test statistics as approximations.
Once the rank of is known, the matrix can be rewritten as =
such that
xt forms stationary co-integrating relations. The
are co-integrating parameters, and represent the adjustment parameters. The signicance of the alphas
can be determined by ordinary t-test since they are associated with stationary
relations, 0 xt I(0)
Finding the VECM
In practical use the problem is to formulate the VAR, the program will rewrite
the VAR for the user and present the estimated and vectors. Sometimes it
necessary to understand how the VECM is found. Consider the 2 dimensional
VAR model, where the deterministic terms have been removed for simplication,

= a11 yt

+ a12 yt

+ a13 xt

+ a14 xt


= a21 zt

+ a22 zt

+ a23 zt

+ a24 zt

Start with the rst equation and with


+ e1t


+ e2t :


yt from both sides of the equal sign.

This gives you

yt = (a11


+ a12 yt

+ a13 xt

+ a14 xt

+ e1t

since the equation was correctly specied from the beginning it can transformed
as long as we do not do anything that aects the properties of error term. Our
aim is to split all lag terms into rst dierences and lagged variables in such a
way that the model consists of one lag at t-1 for all variables and rst dierences.
We can do this by using the dierence operator,
= (1 L), which can be
used as yt = yt yt 1 , or yt 1 = yt
yt : Referring to the operators we have
L = (1
); or Lyt = (yt
yt ): If we apply this to all lags of lower order than
t 1, we get for t 2 the following, yt 2 = yt 1
yt 1 , and zt 2 = zt 1
zt 1 :
Substitute this into the equation to get,
yt = (a11


+ a12 (yt



+ a13 zt

+ a14 (zt



+ e1t

Collecting terms gives,

yt = ( 1 + a11 + a12 )yt

a12 yt

+ (a13 + a14 )zt

a14 zt

+ e1t


+ e1t

Performing the same operations on the second equation

zt = ( 1 + a21 + a22 )zt

a22 zt

+ (a23 + a24 )yt


Write the system in matrix form,

xt =





+ "t





xt =

; xt =









Since xt is integrated of order one, it follows that xt is integrated of order

zero and therefore stationary. And, since xt is non-stationarity, the variables in
xt grows in two dimensions unless they share the same trend. In that case we
would say that they are co-integrated and share one common trend. In the case of
a p-dimensional system, the system can expand in p dimensions or in less than p
dimensions if variables share the same trend. under these properties a single yt 1
or zt 1 cannot be correlated with yt or zt :
The only possible correlation that will not render the rows in to be dierent
from zero is when ( 11 yt a + 12 zt 1 ) forms a stationary process, i.e. there exists
non-zero parameters 11 and 12 (or 21 and 22 ) such that when multiplied with
the x:s a stationary relation is established. The test for this is to test for the rank
of the
matrix, the number of independent non-zero rows in :
A rank of zero mean no co-integration, rank of 2 in this case means that the
x:s are stationary, or stationary around deterministic trends if we allowed for
constants in the equation. A reduced rank, which in this case is a rank equal 1,
implies co-integration. Co-integration will imply that at least one parameter will
be signicant, there will be (long-run) Granger causality in at least one direction.
At least one variable must follow the other for them to stay together in xed
formation on the long run.
Johansens test is better than the two-step procedure in almost all aspects.
The practical problems originate from choosing a correct combination of lags and
dummy variables to make the residual come out as white noise. In a limited sample
this can be di cult, and the results might change among dierent specications
of the system, just as it does in the two-step procedure.
It is recommended to start with the two-step procedure, to learn about the
data and get some preliminary results, instead of getting stuck with the Johansen
test, having problems nding a specication that leads to economically interesting




This chapter looks the common trends approach and some economics behind cointegration. For instance, the question of creating positive or negative shocks in
stabilization policy.
An important characteristic of integrated variables is that they become stationary after dierencing. The denition of an integrated series is; A series, or
a vector of series, yt with no deterministic component, which has a stationary,
invertible ARMA representation, after dierencing d times is said to be integrated
of order d, denoted as xt I(d).
It is possible to have variables driven by both stochastic and deterministic
trends. In the very long run a deterministic trend will always dominate over a
stochastic trend. In a limited sample however, it becomes an empirical question if
the deterministic trend is su ciently strong to have an eect on the distributions
of the estimates of the model.1
We know, from the Wold representation theorem, that if yt is I(0), and has no
deterministic process, it can be written as an innite moving average process. (If
the series has a deterministic process this can be removed before solving for the
MA process).
yt = C(L) t ;
where L is the lag operator, and t iid(0; 2 ). Now, suppose that yt is I(1),
then its rst dierence is stationary and has an innite MA process,
yt = C(L) t :
Under the assumption that
yt =




), we have also that

= [1=(1

L)]C(L) t :


where 1=(1 L) represents the sum of an innite series. For a limited sample,
we get approximately,
yt = y0 + (1 + L + L2 + ::: + Lt

)C(L) t ;


where y0 is the initial value of the process seen as a deterministic component

conditional on everything known at time zero. The long-run solution of this expression, setting L = 1, gives tC(1), and
yt = y0 + t C(1) t :


Unless C(1) = 0, this process will grow innitely large as t ! 1. Looking at

the second dierence of yt I(1), leads to

yt = (1

L)C(L) t ;


where = (1 L) is applied to both sides of the expression. This series has no

long run MA representation, irrespective of C(L) = C(1) 6= 0, since setting L = 1
gives (1 L)C(L) = (1 1)C(1) = 0:
1 See Nelson and Plosser (1982) for a discussion about the proper way to model the trend in
economic time series.



Let us see what happens with the process in the future. From above we get
the MA representation for some future period t + h;
t+h t+h
yt+h = y0 +
Cj 5 i

= y0 +





Cj 5





Cj 5 i :


The forecasts are decomposed into what is known at time t, the rst double
sum, and what is going to happen between t and t + h i. The latter is unknown
at time t, therefore we hhave to formi the conditional forecast of yt+h at time t;
Pt+h i
yt+h jt = y0 + i=1
Cj i :
The eect of a shock today (at time t) on future periods is found by taking
the derivative of the above expression with respect to a change in t ;
@yt+h jt =@


Cj ! C(1) as t ! 1:


Thus, the long-run eect of a shock today can be expressed by the static
long run solution of the MA representation of yt . (Equal to the sum of the MA
coe cients).
The persistence of a shock depends on the value of C(1). If C(1) happens to
be 0, there is no long-run eect of todays shock. Otherwise we have three cases,
C(1) is greater than 0, C(1) = 1 or C(1) is greater than unity. If C(1) is greater
than 0 but less than unity, the shock will die out in the future. If C(1) = 1, the
integrated variables (unit roots) case, a shock will be as important today as it is
for all future periods. Finally, if C(1) is greater than one (explosive roots) the
shock magnies into the future, and we have an unstable system.
If the series are truly I(1), spectral analysis can be applied to exactly measure
the persistence of a shock. The persistence of shocks has interesting implications
for economic policy. If shocks are very persistent, or explosive in some cases, it
might be a good policy to try to avoid negative shocks, but create positive shocks.
In our stabilization policy example, this can be understood as the authorities
should be careful with deationary policies, for instance, since they might result
in high and persistent social costs, see Mankiw and Shapiro (198x) for a discussion
of these issues.
In the following, the MA representation of systems of integrated processes are
analysed. For this purpose let yt be vector of I(1)-variables. Using the lag operator
as, L = 1 (1 L) [1
]2 , and Walds decomposition theorem gives,
yt = C(L)

= C(1) t +

C (L) t :


If yt is a vector of I(1) variables, then we know from above that if the matrix
C(1) 6= 0, any shock to the series has innite eects on future levels of yt . Let us
consider a linear combination of these variables 0 yt = zt . Multiplication of the
expression with 0 gives,

zt =


C (L) t :


In general, it is the case that when yt is I(1); zt is I(1) as well. Thus a linear
combination of integrated variables will also be integrated. Implying that 0 C(1)
2 This


is the same as yt =

yt + yt

= [(1

L) + L]yt :


is dierent from zero. Suppose, however, that there exists a matrix 0 such that
C(1) = 0, which implies that when is multiplied with yt we get a stationary
process, 0 yt I(0).
As an example consider private aggregated consumption and private aggregated (disposable) income. Both variables could be random walks, but what about
the dierence between these variables? Is it likely to assume that a linear combination of them could be driven by a stochastic trend, meaning that consumption
would deviate permanently from income in the long run? The answer is no.
In the long-run it is not likely that a person consume more than his/her income,
nor is it likely that a person will save more and more. Thus, we have to think
of situations when two variables cointegrate, that is when a linear combination of
I(1) series forms a new stationary series, integrated of order zero. (A more formal
denition of co-integration is given in a following section.)
In terms of the C(1) matrix, common trends or co-integration implies that
there exists a matrix such that 0 C(1) = 0, hence we get,

zt =

C(L) t ;


where zt is integrated of order zero, when yt consists of variables integrated of

order one. The mathematical condition for having a vector such that 0 C(1) = 0
is that C(1) has reduced rank. There must be at least one row representing the
long run that can be solved from the other long run relations in C(1). If C(1) has
reduced rank, there can be several 0 - vectors that lead to 0 C(1) = 0. We can
express this as follows: any vector lying in the null space of C(1) is a co-integrating
vector and that the co-integrating rank of C(1) is the rank of this null space. Say
that yt is a vector of n variables. If all variables are non-stationary, and integrated
of order one, the whole system could expand in n dierent directions. If some or
all variables share the same trend in the long-run, the system would be expanding
in only r < r dimension.
How should we understand the reduced rank of C(1) in economic terms? Think
of consumption and income again. If both series are I(1), constantly growing in
the long-run. The dierence between them should be stationary in the long run.
In other words they must have a common (stochastic) trend, which the both follow
in the long run. This common trend could understood as a given by technological
growth, which leads to growth in income and thereby also to a long-run growth
in consumption. Another way of expressing the same thing is to say that the
common trend represents the cumulation of past technology shocks.
Stock and Watson (1988), modeled the common trends representation of yt
in the following way. Starting from yt = C(1) t + C (L) t , the level of yt is
determined by,

= y0 + (1 + L + L2 + :::Lt

)[C(1) + (1
t 1

= y0 + C(1)(1 + L + L + :::L

L)C (L)]

) t + C (L) t :


If we have cointegration, and therefore common trends, C(1) must be of reduced rank. The matrix C(1) can be thought of as consisting of two sub-matrices,
such that C(1) = AJ, where J is dened as,

t 1

= (1 + L + L2 + :::Lt

)J t :


The variable t represents the common trends, modelled as random walks.

Setting the initial condition 0 = 0, the level of yt is solved as
yt = y0 + A

+ C (L)


which shows that yt is driven by the common trends representation A t . It

can also be shown that, since C(1) = AJ, that 0 A = 0, which implies that the


co-integrating linear combinations of yt have no common trends. In terms of the

C(1) matrix, we can talk about two types of shocks. The rst type of shocks
decline over time, so that the variables in the system return to their equilibrium
relation. These shocks are driven by the co-integrating vectors. The second type
of shocks are those which move the whole system over time without aecting the
long-run equilibrium. These shocks are the common trends of system.
Cointegration and common trends have interesting implications for econometric model building and inference on dynamic models. For an econometrician,
however, the MA representation are not always the easiest way to approach the
concept of cointegration and stationary long-run relations.




Earlier we looked at the moving average representation of a vector of integrated
variables. This takes us to a denition of common trends in a system of variables with stochastic trends. For an economist it is usually more interesting to
analyse a system in autoregressive format. By looking at the VAR representation
we get a denition of cointegration or long-run steady state relations among the
variables. Let the process fyt g be represented by the following k : th order vector
autoregressive (VAR) model, consisting of p variables ,
A(L)yt =

Dt + t ;


where Dt a vector of deterministic variables, P

including dummies and constants.
The error term is t NID(0; ), and A(L) = j=0 Aj L where A0 =P
I. Thus,
we are assuming that the process is multivariate normal, y
N ID( ; ), with
mean P
= A1 yt 1 + ::. +Ak yt k + Dt , and positive denite error covariance
The system can be rewritten in error correction form, using the denition of
the dierence operator, yt yt yt 1 ;
yt =



+ yt

+ Dt + t ;




= (I +


Aj );




(I +


Aj ) =




Notice that in this example the system was rewritten such that the variables in
levels (yt k ) ended up at the k : th lag. As an alternative it is possible to rewrite
the system such that the levels enter the expression at the rst lag, followed by k
lags of yt i . The two ways of rewriting the system are identical. The preferred
form depends on ones preferences.
Since yt is integrated of order one and yt is stationary, it follows that there
can be at most p 1 steady state relationships between the non-stationary variables
in yt . Hence, p 1 is the largest possible number of linearly independent rows in
the -matrix. The latter is determined by the number of signicant eigenvalues
in the estimated matrix ^ = A(1).
Let r be the rank of , then rank( ) = 0
implies that there are no combinations of variables that leads to stationarity. In
other words, there is no cointegration. If we have rank( ) = p, the
matrix is
said to have full rank, and all variables in y t must be stationary. Finally, reduced
rank, 0 < r < p means that there are r co-integrating vectors in the system.
Once a reduced rank has been determined, the matrix can be written as =
, where 0 yt represent the vectors of co-integrating relations, and a matrix of
adjustment coe cients measuring the strength by which each co-integrating vector
aects an element of yt . Whether the co-integrating vectors 0 yt are referred


to as error correction mechanisms, steady state relations, long-run equilibrium

solutions or desired value is a question of how one views the underlying economic
Given estimates of the eigenvalues of , and , it becomes possible to impose
various restrictions on the parameter vectors to test homogeneity conditions in
the -vectors, how 0 yt aects yt , or a more general hypothesis regarding which
combinations of variables that form stationary vectors. The tests are performed
by comparing changes in the estimated eigenvalues from the unrestricted reduced
rank estimate of with the outcome of a restricted estimation.
In Johansen (1988) it is shown how to estimate the and the vectors in
matrix, given that the latter has reduced rank. The solution starts from
conditioning out the short-run dynamics, as well as the eects of the dummy
variables on yt and yt k respectively,
yt =


1 ;i


0 Dt

+ R1t ;



2 ;i


2 Dt

+ Rkt :





The system in 14.1 can now be written in terms of the residuals above as,
R1t =

Rkt + et :


The vectors and can now be estimated by forming the product moment
matrices S11 , Skk and S1k from the residuals R1 ;t and Rk ;t ;
Sij = T


Rit Rjt ;

i; j = 0; k



For xed vectors, is given by ^ ( ) = S1k ( 0 Skk ) 1 , and the sums of squares
function ^ ( ) = S11 ^ ( )( 0 Skk )^ ( )0 . Minimizing this sum of squares function
leads to maximum likelihood estimates of and . The estimates of are found
after solving the eigenvalue problem,
j Skk

Sk1 S111 S1k j = 0;


is a vector of eigenvalues. The solution leads to estimates of the
eigenvalues( ^ 1 , ^ 2 ; :::, ^ ), and the corresponding eigenvectors V^ = (^
v1 , v^2 , ..., v^ ),
normalized around the squared residuals from equation 14.7 such that V 0 S22 V = I.
The size of the eigenvalues ( i ) tells us how much each linear combination of
eigenvectors and variables, vi0 yt is correlated with the conditional process R1t (
yt j yt i , D). The number of non-zero eigenvalues (r) determines the rank of
and lead to the co-integrating vectors of the system, while the number of zero
eigenvalues (p r) dene the common trends in the system. These are the combinations of vi yt that determine the directions in which the process is non-stationary.
Given that 14.1 is a well-dened statistical model, it is possible to determine the
distribution of the estimated eigenvalues under dierent assumptions of the number of co-integrating vectors in the model. The distributions of the eigenvalues
depend not only on 14.1 being a well-dened statistical model, but also on the
number of variables, the inclusion of constant terms in the co-integrating vectors
and deterministic trends in the equations. Distributions for dierent models are
tabulated in Johansen (1995).


The maximized log likelihood, conditional on the short run dynamics and the
deterministic variables of the model is,
ln L = constant

(T =2) ln jS00 j

(T =2)



i ):



From this expression two likelihood ratio tests for determining the number of
non-zero eigenvalues are formulated. The rst test concerns the hypothesis that
the number of eigenvalues is less than or equal to some given number (q) such that
H0 : r q, against an unrestricted model where H1 : r p. The test is given by,
2ln(Q; qj p) =


^ i ):




The second test is used for the hypothesis that the number of eigenvalues
is less than the number tested in the previous hypothesis, H0 : r
q against
H1 : r q + 1, and is given by,
2ln(Q; qj q + 1) =

T ln(1

^ q+1 ):


Both tests follow non-standard distributions which depend on the number of

variables in the system (p), and on the presence of trends and constant terms.
The number of non-zero eigenvalue estimates of i are given by the corresponding eigenvectors such that ^ = (^
v1 ; v^2 , ..., v^r )0 . Based on ^ the -vectors can be
solved by,
^ = S1k ^ ( ^ Skk ^ ) 1 :
The estimated matrix =
is not identied, in the sense that we can pick
any non-singular matrix M (rxr), so that M (M 0 ) 1 =
. There is
no unique solution for the co-integrating vectors. This solution, explaining the
economic meaning of the co-integrating vectors, is something that the econometrician must impose on the estimates. First, by normalizing each -vector around a
variable, and then tests dierent assumptions about the vector. By looking at the
signs and relative sizes of the ^ -parameters, it is in general possible to nd appropriate normalization of the -vectors such that the outcome can be understood
in terms of error correction mechanisms or long-run equilibrium relationships between economic variables. Assumptions concerning the sizes and relative signs of
the parameters can be tested by comparing an unrestricted maximization with
one where the restrictions have been imposed.
Furthermore, to rule out the cases where yt is integrated of order 2, we must
is the mean lag matrix of
require that the matrix 0?
? has full rank, where
evaluated at unity, and ? and ? are the orthogonal matrices to and such
that 0 ? = 0 ? = 0.
The system in 14.1 also has a moving average form given by,

yt = C(L)( t +

Dt ):


(L) t ;


Expression 14.14 can be compared with

zt =

C(1) t +

from the previous chapter.

Since C(L) can be expanded as C(L) = C(1) + (1
integrated of order one we get,
yt = y0 + C(1)


i + C(1) t + C(1)




Di + C


(L) when yt is

(L)( t + Dt );




where C (L) = [C(L) C(1)](1 L) 1 . The impact matrix C(1) shows

how the non-stationary part of yt is generated from the underlying stochastic and
deterministic trends. The link between the MA and the autoregressive form is
shown in Johansen (1991), and is given by
1 0

?( ?

C(1) =


where ? and ? are the orthogonal vectors of and respectively.

Equation 14.14 can be used to estimate C(1) from given estimates of and .
But, since the error terms ( i ;t ) in the reduced form are correlated, the estimate
of C(1) is not invariant to dierent ways of conditioning on current variables
( yt ). Given this limitation and the assumption that the driving trends should
not be aected by the equilibrium forces, the
P common trends in the system are
represented by 0? yt or alternatively by 0?
it , see Juselius (1992).
The test procedure can be extended to incorporate variables integrated of order
2 as well. With both I(1) and I(2) processes in the system, two new co-integrating
relations are possible. There can be combinations of I(2) variables forming stationary I(0) vectors, or I(2) variables forming non-stationary I(1) vectors which
in turn cointegrate with I(1) variables to form stationary vectors.
The error correction system in 14.1 can be written as

yt =


+ yt



Dt + t ;



Pk 1
Pk 1
where = i=1 I + , i =
j=i+1 j , and i = 1, ... k
If yt is I(2) and yt is I(1); a reduced rank condition for the matrix must
be combined with a reduced rank condition for the matrix of rst dierences as
well. Johansen (1991) shows that the condition for an I(2) process is

= ' 0;


where ' and are (p r)xs, with rank s. With I(2) variables yt is I(1).
To make these vectors stationary they have to be combined with the vectors of
rst dierences ( 2? yt ) to form stationary processes. In the latter expression
vectors, and = ( 0 ) 1 0 2? ( 20
. The
? is the squared orthogonal
? ?)
squared orthogonal vectors indicate which variables are I(2).
An I(2) model is estimated in a way similar to the I(1) model. Maximum
likelihood estimation is feasible since the residual terms of an I(2) model can be
assumed to be a Gaussian process. The rst step is to perform a reduced rank
regression for the I(1) model of yt on yt 1 , corrected for the short run dynamics
( yt 1 , ..., yt k+1 ) and the deterministic components ( Dt ). This leads to
estimates of r^, ^ and ^ :
In the second step, given the estimates of r^, ^ and ^ , a reduced rank test is
performed of ^ 0? 2 yt on ^ ? yt 1 , corrected for 2 yt 1 ; ::. 2 yt k+2 , and the
constant terms. This leads to the estimates s^, '
^ , and ^.
An I(2) process is harder to analyze in economic terms since the parameters
and the test hypotheses have dierent interpretations. The tests concerning the
vectors are still valid, but are in general only valid for I(1) processes. It is,
however, possible to form stationary relations by combining levels ( ^ yt ) with rst
dierence expressions ( ^ ? 0 yt ). The practical solution is to identify the I(2)
terms and nds ways of transforming them to I(1) relations. The transformation
to an I(1) system can be done by taking rst dierences of I(2) variables or by
taking ratios of variables; modeling the real money stock rather than the money
stock and the price level separately.
1 An orthogonal vector is often indcated by the sign
? attatched to the original vector. The
vector ? is the orthogonal vector to the vector if ? 0 = 0.




(To be completed...)
The modelling of stochastic dierential models introduce some problems which
clearly violate the assumptions behind the classical linear regression model. With
some care most of these problems can be solved. The most important factors are
whether the data series are stationary, and if the residuals are white noise. As
long as the variables are stationary and the residual is a white noise process, OLS
estimation is generally feasible. Autocorrelated residuals, however, mean that the
OLS estimator is no longer consistent. In this situation the model must either
be re-specied, or the whole model including the autoregressive process in the
residuals must be estimated by maximum likelihood.
To understand the dierences between the estimation of stochastic dierence
equations and the classical linear regression model, we will introduce these dierences step by step, in all there are 6 models of interest here,
1 The classical linear regression model.
2 Regression with deterministic trends.
3 Models with stochastic explanatory variables.
4 Autoregressive models, lagged dependent variable.
5 Autoregressive models with integrated variables (Testing for unit roots).
6 Regression models with Integrated variables (Spurious regression and
The following sections do not present any rigorous proofs concerning the properties of the OLS estimator. The aim is only to review known problems and
introduce some new ones.

15.1 Deterministic Explanatory Variables

(The Classical Linear Regression model)
Starting with yt = xt + t , the matrix form of this model is
y=X + ;


where y is vector of T observations of yt , X a matrix of explanatory variables,

the parameters and a vector of residuals of the same dimension as y. (To keep
the example simple, is one parameter, but the example could be extended to
a multivariate case). The classical case builds upon four assumptions. First, the
model is linear, or log linear in variables. Second, the residuals are independent,
have a mean of zero and a nite variance,
E( ) = 0; V ar( ) =



This is basically a statement of correct specication. The model should be set up

in such a way that the expected value of the residuals are zero.
Third, the explanatory variables are non-stochastic and therefore independent
of the errors,
E(X 0 ) = 0:
Finally, the explanatory variables are linearly independent such that
rank(X 0 X) = rank(X) = k;


which ensures that the inverse of (x0 x) exists. Minimizing the sum of squared
residuals leads to the following OLS estimator of ;
^ = (X 0 X)

(X 0 y) =

+ (X 0 X)

(X 0 ):


If we simplify the model to one parameter ( ) and one explanatory variable

(xt ), we have for a sample of T observations,


1X 2





The estimated parameter ^ is equal to its true value and an additional term.
For the estimate to be unbiased the last factor must be zero. If we assume that
the x0 s are deterministic the problem is relatively easy. A correct specication of
the model, E( ) = 0, leads to the result that ^ is unbiased.
The parameter has the variance,
V ar( ^ ) = E[( ^

)( ^

)0 ] = (x0 x)

1 0


I)x(x0 x)

(x0 x)


Taking expectations of ^ ;under the assumptions above, veries that ^ is an

unbiased estimate of ; 1
E( ^ ) = E( ) + E[(X 0 X)

(X 0 )] =

+ E(X 0 X)

E(X 0 );


where (X 0 X) is a constant when xt is deterministic, and E(X 0 ) = X 0 E( ) = 0,

if the residuals have a zero mean. Thus, under these assumptions OLS is unbiased
and also consistent (Not proven here). Consistency implies that the var( ^ ) tends
to zero as T ! 1. The problem with assuming that the x0 s are non-stochastic
is of course that it is an unrealistic assumption in a time series setting. Typically
the explanatory variables are as stochastic as the dependent variable.
So far we have not made any statements about the distribution of the estimates.
OLS has the advantage that it leads to unbiased and e cient estimates under quite
general assumptions. However, to make any inference on ^ , we need to make
assumptions about its distribution. In most cases the assumption of a normal
distribution is reasonable, at least asymptotically, or a reasonable approximation
in a limited sample, leading to

N ID(0;



Thus, the limiting distribution of is normal, and since ( ^

) is a white noise
process we know that it converges to the true sample moment with the speed given
by the standard error of a white noise process, 1=T 2 :
1 The expectation of an expectation is equal to the expectation E( ^ ) = ^ ;and the expectation
of a constant is equal to the constant, since true parameter can be treated as a constant we
have E( ) = .



15.2 The Deterministic Trend Model

A situation when the assumption of deterministic explanatory variables can make
sense in a time series setting is when the dependent variable is driven by a deterministic trend.2 Suppose that the explanatory variable is a deterministic time
yt = + t + t ;
where t is a time trend, t = 1; 2; 3; ::. T , without stochastic variation. If the
time trend is adjusted for its mean t~ = (t t), the constant term ( ) will measure
the unconditional mean of yt . Under the assumption that yt has a su ciently large
deterministic trend component, w:r:t to the sample size, the error terms from this
regression can be understood as the detrended yt series. Assume that both yt and
t have been corrected for their means, OLS leads to

1 X ~2




1 X~


Taking expectations leads to the result that is unbiased. The most important
reason why this regression works well is that there is an additional t~ variable in
the denominator. As t~ goes to innity the denominator gets larger and larger
compared to the numerator, so the ratio goes to zero much faster than otherwise.

15.3 Stochastic Explanatory Variables

Applying OLS to time series data introduces the problem of stochastic explanatory
variables. The explanatory variable can be stochastic on their own, and lags of
the dependent variables imply stochastic regressors. Let the model be,
yt = xt + t ;


where xt is generated by the covariance stationary stochastic process fxt gT1 .

The OLS estimator leads to
" T
# 1" T
^= +
xt t :


Taking expectations of the expression leads to

" T
# 1
for the rst factor and






for the second factor.


In the classical linear regression case xt is assumed to be deterministic implying

that [X 0 X] is a constant and that E(xt t ) = xt E( t ) = 0. Here xt is a random
variable, so additional assumptions must be made for the OLS estimator.
2 Other realistic examples in economics are deterministic dummy variables and deterministic
seasonal components.



The necessary conditions are that fxt gT1 is stationary process and that fxt gT1
and f t gT1 are independent. The rst condition means that we can view the covariance matrix (X 0 X) as xed in repeated samples. In a time series perspective
we cannot generally talk about repeated samples, instead we have to look at the
sample moments as T ! 1. If xt is a stationary variable then we can state that
as T ! 1; the covariance matrix will become a constant. This can be written as,
xt xt !p Q;
T t=1


meaning that the expression will converge in probability to a constant Q.3 An

alternative way to show the properties of OLS in the case of stochastic explanatory variables is to use the probability limit operator (p lim), p lim [X 0 X] = Q:
A convenient property of p lim operators is that p lim(x 1 ) = [p lim(x)] 1 . Here,
it remains to look at the numerator in the OLS expression. If fxt gT1 and f t gT1
are generated by two independent stochastic processes we have, for each pair of
observations, that E(xt t ) = E(xt )E( t ), it can then be shown that
T t=1

!p 0;


p lim [X 0 ] !p 0:


or, alternatively,

The intuition behind this result is that, because t is zero on average, we are
multiplying xt with zero. It follows then that the average of (xt t ) will be zero.
The practical implication is that given a su ciently large sample the OLS
estimator will be unbiased, e cient and consistent even when the explanatory
variables are stochastic variables. If t
N ID(0; 2 ), we also have, conditional
on the stochastic process fxt g1 ; that the estimated is distributed as,
^ jx

N[ ;


) jxt

(X 0 X)



N (0;


making ^ an unbiased and consistent estimate, with a normal distribution such

that standard distributions can be used for inference.
The example can be extended by two assumptions. First let the residuals be
et iid(0, 2 ), they are independent and identically distributed as before, but not
necessarily normal. Second, let the process fxt gT1 be only covariance stationary in
the long run, allowing
time in a limited sample,
Pt the sample covariance to vary with
E(X 0 X) = (1=T )
xt xt = Qt . The processes fxt gT1 and f t gT1 are independent
as above. Under these conditions the estimated is,
^ = + Qt 1 (1=T )
xt t


The estimated t can vary with t since Qt varies with time. To establish that
OLS is a consistent estimator we need to establish that
# 1
(1=T )
xt xt
= Qt 1 !p Q 1

3 In


a multivariate model we would say that Q converges to a matrix of constants.


The condition holds if fxt gT1 is covariance stationary, as T goes to innity the
estimate will converge in probability ( !p ) to a constant. The second condition
is that the sum
t=1 xt t converges in probability to zero, which takes place
whenever xt and t are independent. The error process is iid, but not necessarily
normal. Under the conditions given
PT here, the central limit theorem is su cient
to establish that the sequence { t=1 xt t gconverges (weakly in distribution) to a
normal distribution,
[(1=T )
xt t ] !d N (0; 2 );

so that ( ^ t

) is asymptotically distributed as

N (0;



In a limited sample the normal distribution will be an approximation. The

result is necessary for using t, 2 and F -distributions for inference on ^ and ^ 2 .
To see how the last result works, recall the central limit theorem (CLT ). The
CLT states that for a sample mean of an iid process zT , as T the sample size
increases this will weakly converge to a normal distributed variable so for the
) ) N (0; 2 );
(1=T 2 )(zT

is the population mean of zt . From the OLS estimator we have,


) = [(1=T )


xt xt ]

[(1=T )



xt t ]:



Since (1=T ) = (1=T 2 )(1=T 2 ) the CLT can be evoked by rewriting the expression as,

(1=T 2 )( ^ t

) = [(1=T )


xt xt ]

[(1=T 2 )



xt t ];



where the LHS and the numerator on the RHS correspond to the CLT theorem.
From the numerator, on the RHS, we get as T goes to innity

[(1=T 2 )


xt t ] ) N (0;



Moreover, we can also conclude that the rate of convergence is given by (1=T 2 ).
Dividing the RHS side of the OLS estimator with (1=T 2 ) leaves (1=T 2 ) in the
denominator which then represents the speed by which the estimate ^ t converges
to its true value .

15.4 Lagged Dependent Variables

Let us now turn to the AR(1) model,
yt = yt

+ t;


where t
iid(0, 2 ). (The estimation of AR(p) models follows from this
example in a straightforward way). The estimated is
^= T







1 yt



leading to

= T







1 t



This is similar to the stochastic regressor case, but here fyt 1 g and f t g cannot
be assumed to be independent, so E(yt 1 )E( t ) 6= 0 and ^ can be biased in a
limited sample. The dependence can be explained as follows t is dependent on
yt , but yt is through the AR(1) process correlated with yt+1 , so yt+1 is correlated
with t+1 . The long-run covariance (lrcov) between yt 1 and t is dened as,


t) = T



1 t



t+k ) +



t );



where the rst term on the RHS is sample estimate of the covariance, the last
two terms capture leads and lags in the cross correlation between yt 1 and t .
As long as yt is covariance stationary and t is iid, the sample estimate of the
covariance will converge to its true long-run value.
This dependence from t to yt+1 is not of major importance for estimation.
Since (yt 1 t ) is still a martingale dierence sequence w.r.t. the history of yt and
t , we have that Efyt 1 t j yt 2 ; yt 3; :::; t 1 ; t 2; :::g = 0, so it can be established
in line with the CLT; 4 that
yt 1 t ) N (0; 2 Q):
(1=T 2 )

Using the same assumptions and notation as above the variance is given is
E(yt 1 t t yt 1 ) = E( 2 )E(yt 1 yt 1 ) = 2 Qt : These results are su cient to
establish that OLS is a consistent estimator, though not necessarily unbiased in
a limited sample. It follows that the distribution of the estimated , and its rate
of convergence is as above. The results are the same for higher order stochastic
dierence models.

15.5 Lagged Dependent Variables and Autocorrelation

In this section we look at the AR(1) model with an autoregressive residual process.
Let the error process be,
t =
t 1 + t;
4 This



). In this case we get the following expression,

result is established by the so called Mann-Wold theorem




1 t





t 1

+ vt )] =





1 t 1



1 (yt 1


2 )]





1 vt




1 vt





1 yt





1 vt :



Dividing the expression with (1=T ) and taking expectations


E (1=T )



1 t

var(yt ) +


1y 2)

+ cov(yt

1 vt );


which establishes that the OLS estimator is biased and inconsistent. Only the
last covariance term can be assumed to go to zero as T goes to innity.
In this situation OLS is always inconsistent.5 Thus, the conclusion is that
with a lagged depended variable OLS is only feasible if there is no serial correlation in the residual. There are two solutions in this situation, to respecify the
equation so the serial correlation is removed from the residual process, or to turn
to an iterative ML estimation of the model (yt
yt 1
t 1 = vt ). The latter specication implies common factor restrictions, which if not tested is an ad
hoc assumption. The approach was extremely popular in the late 70s and early
80s, when people used to rely on a priori assumptions in the form of adaptive
expectations or costly adjustment, as examples, to derive their estimated models.
Often economists started from a static formulation of the economic model and
then added assumption about expectations or adjustment costs. These assumptions could then lead to an innite lag structure with white noise residuals. To
estimate the model these authors called upon the so called Koyck transformation
to reduce the model to a rst order autoregressive stochastic dierence model,
with an assumed rst order serially correlated residual term.

15.6 The Problems of Dependence and the Initial

An additional problem is that of dependent observations. When we derived the
estimators, in particular the MLE, we must assume that the observations are
drawn independent distribution. A basic assumption is therefore violated, because
the observations in a typical time series model are dependent. The AR(1) can serve
as an example,
xt = axt 1 + t
N ID(0; 2 ):
~ t is dependent on the observation of xt in
In this model each observation of X
the previous period. How does this aect the ML estimator? Suppose the sample
5 Asymptotically, though, the estimates have normal distributions, because the long-run bias
converges to a constant while the eroor process vt converges to NID(0; 2 ): This is a result of
the CLT.



only consists of two observations x1 and x2 . The joint density function for these
two observations can be factorised as,
D(x1 ; x2 ) = D1 (x2 j x1 )D2 (x1 )


Extend the sample to 3 observations and we get,

D(x1 ; x2 ; x3 )


D1 (x3 j x2 ; x1 )D2 (x2 ; x1 )

D1 (x3 j x2 ; x1 )D2 (x2 j x1 )D3 (x1 ):


With three observations, we have that the joint probability density function is
~ 3 , conditional on X
~ 2 and X
~ 1 , multiplied by the
equal to the density function of X
conditional density for X2 , multiplied by the marginal density for X
It follows that for a sample of T observations, the likelihood function can be
written as,
L( ; x) =



D(xt j Xt


)f (x1 );


where Xt 1 represents the observations up to and including xt 1 .

Now, the AR(1) model implies that the conditional density function of x,
D(xt j xt 1 , ..., x1 ) is normally distributed with mean a1 xt 1 and variance 2 .
The log likelihood function is,
log L(a1 , 2 ; x) = [(T 1)=2] log 2
[(T 1)=2] log 2 (1=2 2 ) t=2 (xt a1 xt 1 )2 + log D(x1 ):
This looks like the expression for the MLE derived earlier, with the exception
of the last term, the log likelihood for the very rst observation. By denition,
the rst observation here contains the initial conditions for the model, meaning
everything that happen up to and including the rst period of the sample. The
question is, how do we get rid of this term?
A practical solution is to assume that x1 can be treated as a xed value in
repeated realizations. (Compare with stochastic regressor case in OLS). In this
case log f (x1 ) can be seen as a constant which can be left out of the MLE because
it will not aect the estimates of the parameters.
~ t is stationary and normally distribAn alternative way is to assume that X
uted. The absolute value of a1 will be less than one. The unconditional normal
~ 1 is therefore known to have mean zero and variance 2 =(1
distribution of X
The likelihood becomes,
log L(a1 , 2 ; x) = (T =2) log 2
(T =2) log 2
+(1=2) log(l a1 2 ) (1=2 2 )(1 a1 2 )x1 2
(1=2 2 ) t=2 (xt a1 xt 1 )2 + log D(x1 ):
Unfortunately the log likelihood is no longer log-linear. The most convenient
solution in this case is to drop the third and the fourth terms from the likelihood,
with the argument that we are only dealing with one observation why the asymptotic properties of the estimator should be unchanged. The conclusion would be
~ 1 is xed in repeated samples.
the same if we assume that X
Finally, the most di cult way of dealing with the situation is to use the sample
~ t . This would be recommended
information to estimate the initial conditions of X
if we are modeling non-stationary variables where the distribution of the initial
value might dier to a large extent from the following observations. (An example
of this can be found in Bergstrom (1989).


15.7 Estimation with Integrated Variables

(To be completed and extended)
In this section we investigate the problems of estimating integrated series. An
integrated variable can be dened as,
A series (xt ) with no deterministic component and which has a stationary and
invertible autoregressive moving average (ARMA) representation after dierencing
d times, but which is not a stationary after dierencing only d 1 times, is said
to be integrated of order d, denoted x (d): (Banerjee et. al. (1993)]
In many areas were time series techniques are applied integrated variables are
rare exceptions, which are seldom interesting to analyse. In economics this is not
the case, most macroeconomic time series appear to be integrated or nearly integrated series, see Nelson and Plosser (1982). Thus, the estimation and distribution
of sample estimates are of great importance in economics, especially since regression with integrated variables often results in spurious correlations when standard
distributions are used for inference.
The simplest example of an I(1) series is the random walk model yt = yt 1 + t ,
where t
N ID(0, 2 ). Taking the rst dierence of this variable results in a
stationary I(0) series according to the denition given above. If yt is generated as
an integrated series, the main problem with estimating a random walk model,
yt = yt

+ t;


is that the estimated is not following a normal distribution, not even asymptotically. The problem here is not inconsistency, but the nonstandard distribution
of the estimated parameters. This is clearly established in Fuller(1976) where the
results from simulating the empirical distribution of the random walk model is
presented. Fuller generated data series from a driftless random walk model, and
estimated the following models,
a) yt = yt 1 + t ;
b) yt = + yt 1 + t ;
c) yt = + (t t) + yt 1 + t ;
where is constant and (t t) a mean adjusted deterministic trend. These
equations follow from the random walk model. The reason for setting up these
three models is that the modeler will not now in practice that the data is generated
by a driftless random walk. S=he will therefore add a constant (representing the
deterministic growth trend in yt ) or a constant and trend. The models are easy
to understand, simply subtract yt 1 from both sides of the random walk model,


= yt



+ t;


which leads to


= yt

+ t:


Thus, if yt 1 is integrated of order one = 0. The problem here is that since

does not follow a standard distribution the conventional t-statistic cannot be
used. This would not be a problem if equals say 0.99, then j j < 1 and the series
would be stationary, and its asymptotic behaviour would be like the AR(1) model
above. Fullers simulations of the empirical t distribution of the estimated in the
three model showed that they did not converge to the normal distribution. With
these results he established what is now know as the Dickey-Fuller distributions.
Furthermore, the divergences compared to the normal distributions are huge. So,
here is a case were the central limit theorem does not work.


The standard t-statistic for an innitely large sample is for a two sided test of
^ 6= 0 equal to 1.96 at the 5 % level. However, according to the simulations of
Dickey and Fuller the appropriate value of the t-statistic in model (a) is 2.23, for
an innity large sample. In an autoregressive model we know that the estimate
of is biased downward. Thus, the alternative hypothesis in models (a) to (c) is
that is less than zero. The associated asymptotic t-value for an estimate from
a normal distribution, is therefore -1.65. Dickey and Fuller established that the
asymptotic critical values for one sided t-tests at the 5 % level in the models (a) to
(c) are -1.95, -2.86 and -3.41 respectively. (See Fuller (1976) Table 3.2.1, page 373].
Notice that the critical values change depending on the parameters included in the
empirical model. Also, the empirical distributions assume white noise residual; if
this is not the case, either the model or the test statistic must be adjusted.
Moreover, as long as = 1 or = 0 cannot be rejected, the estimated constant
term in model b, as well as the constant and the quadratic trend in model c,
also follow non-normal distributions. These cases are tabulated in Dickey and
Fuller (1981). The consequence of ignoring the results of Dickey and Fuller is
obvious. If using the standard tables, one will reject the null hypothesis of =
= 1:0) too many times. It follows that if you use standard t-tests you will
end up modelling non-stationary series, which in turn take you to the spurious
regression problem. The alternative hypotheses for unit root tests are discussed
in the following chapter.
The explanation to why the t-statistic ends up being non-normally distributed,
can be introduced as follows. As T goes to innity, the relative distance between
yt and yt 1 becomes smaller and smaller. Increasing the sample size implies that
the random walk model goes towards a continuous time random walk model. The
asymptotic distribution of such a model is that of a Wiener process (or Brownian
The OLS estimate is,


1:0) = T








1 t


where, since yt is driven by stochastic trend, yt = i=1 t i , the sample moments of the two factors on the RHS will not converge to constants, but to random
variables instead. These random variables will have a non-standard distribution,
often called a Dickey-Fuller distribution. We can express this as,

1:0) ) [Wyy (t)]

[Wy (t)] :


where W (t) indicates that the sample moment converges to a random variable
which is a function if a Wiener process and therefore distributed according to a
non-standard distribution. If the residuals are white noise then we get the so
called Dickey-Fuller distributions.
The intuition behind this result is that an integrated variable has an innite
memory so the correlation between yt 1 and t does not disappear as T grows.
The nonstandard distribution remains, and gets worse if we choose to regress two
independent integrated variables against each other. Assume that xt and yt are
two random walk variables, such that
yt = yt


and xt = xt



where both t and t are N ID(0; 2 ). In this case, would equal zero in the
model, yt = xt + t :
The estimated t-value from this model, when yt and xt are independent random
walks should converge to zero. This is not what happens when yt and xt are also


integrated variables. In this case the empirical t-value will converge to 2.0, leading
to spurious correlation if a standard t-table at 5% is used to test for dependence
between the variables. The problem can be described as follows. If
is zero,
the residual term will be I(1) having the same sample moments as yt . Since yt
is a random walk we know that the variance of t will be time dependent and
non-stationary as T goes to innity. The sample estimate of 2 t is therefore not
representative for the true long run variance of the yt series. The OLS estimator
# 1"
^= +
xt t ;


which, if the variables are integrated, converges to


) ) [B1 (t)]

[B2 (t)] ;


where B1 and B2 represent sample moments that are functions of random

variables, which follow a Brownian motion (Wiener process).
The intuition here is that in the long-run random walk variables collapse to its
continuous time counterpart, which is the Brownian motion (the Wiener process).
The important dierence is that instead of having sample moments which are
constant in the long run, we have a ratio between two random variables which are
function of Brownian motions. In this situation the distribution of the estimated
parameters end up following non-standard distributions.
It is easy to understand why this is a bit problematic, just recall that a random
walk can be written as the sum of all shocks to the series, plus the initial value.
In other words, the sample moments in this case are sums of partial sums, since
each observation of xt can be written as a sum of shocks.
The estimated parameter will still converge to its true sample moment. The
variance, however, will be dierent. It can be shown that in this case the sample
moment of is, ( ^
) NID[0, 2 (t)] where the variance is a function of time.
The estimate of is still asymptotically correct and normally distributed but its
variance is the variance of a Brownian motion. It can also be show, that the
convergence of ^ to its true value is much faster than under OLS. Stock (1985)
showed that the rate of convergence is 1=T , instead of the standard OLS rate 1=T 2 .
This is known as super convergence. Unfortunately, this is only an asymptotic
result. In most applications the short run dynamics between the variables will
seriously bias the OLS estimate in this situation.
The consequence is that if ones tries to use standard tables, like t or F to
test the signicance of , one might not be able to reject spurious results. The
true \t-values of this model will be much higher. If one regresses one ore more
independent random walks against each other, standard t and F tables become
useless and will lead the researcher to accept hypotheses of correlation when there
is no correlation what so ever.
These results might look like a special case, but they are not. In fact they carry
over to small sample estimates involving all types of integrated and near integrated
variables. The distributions of the parameters based on strongly autocorrelated
data are closer to the ones of a random walk, than those of standard stationary
normal variables. These results stress the importance of testing for the type of
non-stationarity, order of integration, and presence of cointegration when working
with time series. Otherwise one can easily fall into the spurious regression trap.
The problems are likely to carry on even to the situation when is dierent
from zero. In this case t will be stationary, but the distribution of is nonstandard
as long as the two residual terms t and t are dependent. In general, without a
priori knowledge, the estimated standard errors from integrated variables must be


assumed to follow nonstandard distributions. The estimated equation must either

be modied, or cointegration tests must be carried out.



Often you will nd that there are several alternative variables that you can put
into a model, there might be several measures of income, capital or interest rates
to choose from. Starting from a general to a specic model, several models of the
same dependent variables might display, white noise innovation terms and stable
parameters that all have signs and sizes that are in line with economic theory.
A typical example is given by Mankiw and Shapiro (1986), who argue that in
a money demand equation, private consumption is a better variable than income.
Thus, we are faced with two empirical models of money demand.1 The rst model
mt =

1 yt

2 yt 1

3 ry

mt =

1 ct

2 ct 1

3 rt




and the second is


Which of these models is the best one, given that both can be claimed to be
good estimates of the data generating process? The better model is the one that
explains more of the systematic variation of mt and explains were other models go
wrong. Thus, the better model will encompass the not so good models. The crucial
factor is that yt and ct are two dierent variables, which leads to a non-nested
To understand the dierence between nested and non-nested tests set 2 = 0:
This is a nested test because it involves a restriction on the rst model only. Now,
set 1 = 2 = 0; this is also a nested test, because it only reduces the information
of model one. If 1 = 2 = 0, this is also a nested test of the second model. Thus,
setting 1 = 2 = 0; or 1 = 2 = 0, are only special cases of each model.
The problem that we like to address here is whether to choose either yt or ct
as the scale variable in the money demand equation. This is non-nested test
because the test can not be written as a restriction in terms of one model only.
The rst thing to consider is that a stable model is better than an unstable
one, so if one of the models is stable that is the one to choose. The next measure
is to compare the residual variance and choose the model with the signicantly
smaller error variance.
However, variance domination is not really su cient, PcGive therefore oers
more tests, that allow the comparison of Model one versus Model two, and vice
versa. Thus, there are three possible outcomes, Model one is better, Model two is
better, or there is no signicant dierence between the two models.

1 For simplicity we assume that there is only one lag on income and consumption. This should
not be seen as a restriction, the lag length can vary between the rst and the second model.






Autoregressive Conditional Heteroscedasticity (ARCH) means that the variance
of a process changes in a systematic way over time. Why should one bother about
heteroscedasticity in time series models? Heteroscedasticity is often viewed as
unimportant in time series modeling, except the fact that it leads to ine cient
estimates. Recall the linear regression model,
yt = xt +

N (0;



where the residual variance, the variance of the conditional mean of yt ( 2 ) is

usually assumed to be constant over time. In principle, however, nothing prevents
the variance from varying over time ( 2t ).
There are four reasons why this type of heteroscedasticity is important in time
series models. The rst is that any departures from having white noise residuals
is a sign of misspecication. Heteroscedasticity tests represents a way of detecting
misspecications originating from leaving out an important explanatory variable,
which is totally orthogonal to the other explanatory variables in the model.
Second, if the variance of the model is changing over time so will the forecast
intervals of the model. Hence, for the purpose of making better predictions ARCH
is of interest because it leads to better forecast condence intervals. One example
is so-called Value at Risk (VaR) models which are used to forecast the level of
reserves to meet cash ow uctuations.
Third, the modeling of ARCH disturbances is sometimes implied by theory,
and in general it makes sense from economic theory in many situations. ARCH
represents a time series approach to the variance component, which picks up eects
not otherwise included in the model. Various types of time varying risk premiums
are examples of this. Variables such as time varying risk premiums are di cult to
observe and measure. But, we can trace their eects on the variance in a model like
the one above. Examples of applications are intertemporal asset market models,
CAPM, exchange rate markets, etc.
Fourth, in option prices depends critically on expected future variances of price
of the underlying asset. ARCH models oers a way of forecasting variances such
that pricing can be more exact, and more protable for those who are able to
make better forecasts.
An example of an ARCH(1) model is provided by,

xt +


= !+


t 1;

N (0;


where the error variance is dependent on its lagged value. The rst equation
is referred to as the mean equation and the second equation is referred to as the
variance equation. Together they form an ARCH model, both equations must
estimated simultaneously. In the mean equation here, xt is simply an expression
for the conditional mean of yt . In a real situation this can be explanatory variables,
an AR or ARIMA process. It will be understood that yt is stationary and I(0),
otherwise the variance will not exist.
This example is an ARCH model of order one, ARCH(1). ARCH models can
be said represent an ARMA process in the variance. The implication is that a high
variance in period t-1 will be followed by higher variances in periods t, t + 1, t + 2
etc. How long the shock persists depends, as in the ARMA model on the size of


the parameters in combination with the lag lengths. A low variance period is likely
to be followed by another low variance period, but a shock to the process and/or
its variance will cause the variance to become higher before it settles down in the
future. A consequence of An ARCH process is that the variance can be predicted.
In other words it is possible to predict if the future variances, and standard errors
will be large or small. This will improve forecasting in general and is useful tool
for the pricing of derivative instruments.
An ARCH(q) process is,

xt +

= !+

1 t 1

D(0; ht )
2 t 2

q t q

+ :::


t i:



The expression for the variance shows a autoregressive process in the variance of
: Deliberately the distribution of the residual term is left undetermined. In ARCH
models normality is one option, but often the residual process will be non-nonrmal
and often display thicker tails, and be leptokurtic. Thus, other distributions such
as the Student t-distribution can be a better alternative.
The t-distribution has three moments, the mean, the variance and the "degrees
of freedom of the Student t-distribution". In this case, if the residual process t
St(0; h2 ; ); where is a positive parameter that measures the relative importance
of the peak in relation to the thickness of the tails. The Student t distribution is
a symmetrical distribution that contains the normal distribution as a special case,
as ! 1:
The ARCH process can be detected by testing for ARCH and by inspecting
the P ACF and ACF of the estimated squared residual ^2t : As is the case for AR
models, ARCH has a more general form, the Generalised ARCH, which implies
lagging the dependent variable ht : A long lag structure in the ARCH process can
be substituted with lagged dependent variables to create a shorter process, just as
for ARMA processes. A GARCH(1,1) model is written as,

xt +


= !+


t 1

D(0; ht )
+ ht



The GARCH(1,1) process is a very typical process found in a number of empirical applications on ARCH processes. The convention is to indicate the length of
the ARCH with q, and use the letter p to indicate the length of the lagged variance
ht : The same convention assigns to the ARCH process, and to the GARCH
process. Usually ! is usd for the constant time independent part of the variance
instead of the 0 that is used here. For an asset market this type of process would
imply that there are persistent periods when asset prices uctuate relatively little
compared with other periods where prices uctuate more and for longer times. A
General GARCH(q,p) process is,


xt + t


D(0; ht )
i ht i :



ARCH and GARCH models cannot be estimated by OLS, or standard regression programs. It is necessary to use an interativre system estimation method
because the model is now consisting of two equations; the mean equation and the


variance equation, where the variance equation dependends on estimates in the

mean equation. In the example above, the additional parameters are w and ;that
must be estimated in the same model. Therefore, some iterative ML estimator
is necessary (special algorithms are also necessary). Gauss is a good program for
estimating ARCH models, but takes some investment to learn, SAS (from ver.
6.08) is quite good, EViews is also good with excellent help facilities, RATS is
an alternative. Finally, PcGive 10, can also do ARCH and GARCH models. A
practical problem in estimation is that in a nite sample the estimated variance
(ht ) there is no guarantee that the variance will be a positive number. For that
reason, software will oer you the opportunity to restrict the values of ; as well
as the sum of the : s and : s sums to positive numbers.


Practical Modelling Tips

In practical modelling it is necessary to start with the mean equation. It is necessary to have a correct specication of the mean equation, in order to get the
variance process right. A stationary autoregressive process and relevant explanatory variables, and possible sesonal and other dummies must be included in the
mean equation to get rid of autocorrelation and general misspecication. This is
a a relatively easy procedure for nancial return series, which often martingale
Notice that ARCH and GARCH disappears with aggregation over time and
low frequencies in recording data. Thus, ARCH=GARCH is typically never found
for frequencies above months. Monthly data, or shorter intervals, are necessary
for the modelling of ARCH=GARCH process. Even if models estimated with
quarterly data and higher frequencies can display ARCH in testing the residuals,
it is usually never possible to build an ARCH=GARCH models with that type of
An ARCH process can be identied by testing for ARCH(q) structure in
combination with using ACF : s and P ACF : s on the squared residuals from
the mean equation. Estimate the mean equation, save the estimated residuals,
square them and use ispect the ACF : sand P ACF : s of these squared residuals
to identify a preliminary lagorder for the GARCH. However, this method is higly
approximative regarding the order of q and p.

17.1 Some ARCH Theory

To explore ARCH models, let us start with the following AR(1) model, which
could represent an asset price,
yt = yt

+ t;


where E( t ) = 0, V ar( t ) = 2 and j j < 1. (Thus, the model is stable and

yt is stationary).
P Furthermore let us assume that the unconditional mean of yt is
E(yt ) = (1=T ) yt , which is not dependent on time.
The expected value of yt+1 , conditional on the past history of yt is
Et (yt+1 j yt ) = yt ;


which varies over time since yt is a random variable. Now turn to the variance
of yt+1
V ar(yt+1 ) = V ar( yt ) + V ar( t ):
This variance consists of two parts, rst we have the unconditional variance of
yt+1 which is, for an AR(1) given by,

V ar(yt+1 ) =


Second we have the conditional variance of yt+1

E(yt+1 j yt )]2 =

V art (yt+1 jyt ) = E[yt+1


We can see that while the conditional expectation of yt+1 depends on the information set It = yt , both the conditional (V art ) and the unconditional variances
(Var) do not depend on It = yt .
If we extend the forecasts k periods ahead we get, by repeated substitution,
yt+k =

yt +


k i

t i:



The rst term is the conditional expectation of yt k periods ahead. The second
term is the forecast error. Hence, the conditional variance of yt k periods ahead
is equal to
2(k i)
V art (yt+k ) = 2

It can be seen that the forecast of yt+k depends on the information at time t.
The conditional variance, on the other hand, depends on the length of the forecast
horizon (k periods into the future), but not on the information set. Nothing says
that this conditional variance should be stable. Like the forecast of yt it could
very well depend on available information as well, and therefore change over time.
So let us turn to the simplest case, where the errors follow an ARCH(1) model.
We have the following model, yt = yt 1 + t where t D(0, ht ), E( t ) = 0,
E( t t i ) = 0 for i 6= 0, and ht = w + t 2 :
The process is assumed to be stable j j < 1, and since t 2 is positive we must
have w > 0 and
0. Notice that the errors are not autocorrelated, but at the
same time they are not independent since they are correlated in higher moments
through the ARCH eect. Thus, we cannot assume that the errors really are
normally distributed. If we chose to use the normal distribution as a basis for ML
estimation, this is only an approximation. (As an alternative we could think of
using the t-distribution since the distribution of the errors tends to have fatter tails
than that of the normal). Looking at the conditional expectations of the mean
and the variance of this process, Et (yt+1 jyt ) = yt and V art (yt+1 jyt ) = ht+1 =
w + (yt
yt )2 :
We can see that both depend on the available information at time t. Especially
it should be noticed that the conditional variance of yt+1 increases by positive and
negative shocks in yt :
Extending the conditional variance expression k periods ahead, as above, we
2(k i)
V art (yt+k jyt ) =
Et (ht+k ):

where Et (ht+k ) is the conditional expectation of the error variance k periods

ahead. To solve for the latter, and express the forecast in the same way as the one


for the conditional mean, let us turn to the unconditional variance if

E( t t ) = 2 . In terms of ht ;





which is,

t 1;


= w;


= w:


from which we get

which, since


, implies that,

Substitute by ht ;
ht = (1

t 1;


to get the relationship between the conditional and the unconditional variances
of yt . The expected value of ht in any period i is,
E(ht+i ) =

+ E[ht+i



Repeated substitution leads to the conditional variance k periods ahead,

V art (yt+k jyt ) =



s 1







The rst term on the RHS is the long run unconditional forecast variance of
yt . The second term represents the memory in the process, given by the presence
of ht+1 . If < 1 the inuence of (ht+1
) will die out in the long run and
the second term vanishes. Thus, for long-run forecasts it is only the unconditional
forecast variance which is of importance. Under the assumption of
< 1 the
memory in the ARCH eect dies out. (Below we will relax this assumption, and
allow for unit roots in the ARCH process).

17.2 Some Dierent Types of ARCH and GARCH

ARCH models represent a class of models were the variance is changing over time
in a systematic way. Let us now dene dierent types of ARCH models. In all
these models there is always a mean equation, which must be correctly specied
for the ARCH process to be modeled correctly.
1) ARCH(q); the ARCH model of order q,
ht =


t i

+ A(L)



This is the basic ARCH model from which we now introduce dierent eects.
2) GARCH(q; p): Generalized ARCH models.
If q is large then it is possible to get a more parsimonious representation by
adding lagged ht to the model. This is like using ARMA instead of AR models.
A GARCH(q; p) model is


ht =


t i




+ A(L)

+ B(L)ht ;



where p
0, q P
> 0, a0P
> 0, i
0, and i
0. The sum of the estimated
parameters (1) =
the process. Values of (1)
equal to unity indicates that shocks to the variance has permanent eects, like in
a random walk model. High values of (1); but less than unity indicates a long
memory process. It takes a long time before shocks to the variance disappears.
If the roots of [1 B(L)] = 0 are outside the unit circle we the process is
invertible and,

0 [1


= a+



D(L) 2 t

+ A(L)[1




t i





If D(L) < 1 then GARCH = ARCH. Moreover, if the long run solution of the
model B(1), is < 1, the i will decrease for all i > max(p, q).
GARCHmodels are standard tools, in particular, for modeling foreign exchange rate markets and nancial market data. Often the GARCH(1; 1) is the
preferred choice. GARCH models some empirical observations quite well. The
distribution of many nancial series display fatter tails than the standard normal
distribution. GARCH models in combination with the assumption of a normal
distribution of the residual can generate such distributions. However, many series,
like foreign exchange rates, display both fatter tails and are leptokurtic (the peak
of the distribution is higherthan the normal. A GARCH process combined with
the assumption that the errors follow the t-distribution can generate this type
observed data.
Before continuing with dierent ARCH models, we can now look at an alternative formulation of ARCH models which show their similarities with ordinary
time series models. Dene the innovations in the conditional variance as,
vt =


ht :


The variable vt can be thought of as surprises in volatility, arising from new,

unexpected, information on the markets. The GARCH model is then,



vt ) =

+ A(L) 2t ;


+ A(L)


+ [1


which can be written as,


B(L)]( 2t ) =



B(L)]vt ;

A(L)]( 2t ) =

+ vt




which is an ARMA process. This shows us that we can identify a GARCH

process using the same tools as an ARMA model. That is, by looking at the
autocorrelations and partial autocorrelations of ^2t ; estimated from OLS.
Solving for the GARCH(1,1) model,



1) t 1

+ 1 vt 1 + vt :
If 1 + 1 = 1, or ( i + i ) = 1 in GARCH(q; p) model, we get what is called
an integrated GARCH model.


3) ARCH(q) model with explanatory variables,

ht =

+ A(L)

+ xt ;


1. where xt is a vector of explanatory variables, and a vector parameters.

In this model we have added explanatory variables into the ARCH process,
just like we can add exogenous explanatory variables into an ARMA model.
4) M-ARCH Multivariate ARCH.
The multivariate ARCH is basically an extension of the univariate model to a
system of equations with time varying variances and covariances, like
h11;t h12;t ::: h1n;t
h22;t ::: h2n;t
Ht = 21;t
::: :::
hn1;t hn2;t ::: hnn;t
The M-ARCH is like a VAR model for a system of variables, only now the
system is extended to allow for interaction among the variances as well. Typical
applications of multivariate ARCH are CAPM models of asset portfolios.
5) ARCH in mean.
It is possible to put back the ARCH process into the conditional mean of
the process, and let it represent some variable, like a time varying risk premium
as an example. In this case we get the following system,


xt + ht


0 + A(L)



There exists various ways of puttingthe variance backin the mean equation.
The example above assumes that it is the standard error which is the interesting
variable in the mean equation.
6) IGARCH. Integrated ARCH.
When the coe cients sum to unity we get a model with extremely long memory.
(Similar to the random walk model). Unlike the cases discussed earlier the shocks
to the variance will not die out. Current information remains important for all
future forecasts. We talk about an integrated variance and persistence in variance.
A signicant constant term in an GARCH process can be understood as a mean
reversion of the variance. But if the variance is not mean-reverting, integrated
GARCH is an alternative, that in a GARCH(1,1) process can put the constant
zero, and restrict the two parameters to unity.
7) EGARCH. Exponential GARCH and ARCH models. (Exponential
due to logs of the variables in the GARCH model). These models have the interesting characteristic that they allow for dierent reactions from negative and
positive shocks. A phenomenon observed on many nancial markets. In the output the rst lagged residual indicated the eect of a positive shock, while the
second lagged residual (in absolute terms) indicates the eect of a negative shock.
8) FIGARCH. Fractionally Integrated GARCH.
This approach builds on the idea of fractional integration and allows for a
slow hyperbolic rate of decay for the lagged squared innovation in the conditional
variance function. See Baille, Bollerslev and Mikkelsen (1996).
9) NGARCH and NARCH Non-linear GARCH and ARCH models.
10) Common Volatilty.
Introduced by Engle and Isle 1989 (and 1993), allows you to test for common
GARCH Structure in dierent series.


11) Other types of GARCH models.

In the literature there exists a number of X-GARCH-type of models, it is
not possible to keep track of all possible twists here, but 1-10 are the relevant

17.3 The Estimation of ARCH models

Let us now turn to the estimation of ARCH and GARCH models. The main
problem is the distribution of the error terms, in general they are not normally
distributed. The most used alternatives are the t-distribution and the gamma
distribution. In applications in nance and foreign exchange rates a t-distribution
is often motivated by the fact the empirical distributions of these variables display
fatter tails than the normal distribution.
If we assume that the residuals of the model follow a normal distribution, we
have that the conditional variance is normally distributed, or t j t 1 NID(0; 2 ).
Using that assumption the following likelihood function is estimated,
log L =

log 2

(log ht ) +
2 t=1


Notice that there are two equations involved here, the mean equation and the
variance equation. The process is correctly modelled rst when both equations
are correctly modelled.
To estimate ARCH and GARCH processes, non-standard algorithms are generally needed. If yt i is among the regressors some iterative method is always
required. (GAUSS, RATS, SAS provide such facilities). There are also special
programs which deal with ARCH, GARCH and multivariate ARCH. The research
strategy is to begin by testing ARCH, by standard tests procedures. The following
LM test for q order ARCH, is an example,
^2 t =

1^t 1

2^t 2

+ ::: +

q ^t q

+ yt + vt ;


where T R2
(q). Notice that this requires that E( ) = 0, and E( t t i ) 6= 0,
for i 6= 0:
If ARCH is found, or suspected, use standard time series techniques to identify
the process. The specication of an ARCH model can be tested by Lagrange multiplier tests, or likelihood ration tests. Like in time series modeling the Box-Ljung
test on the estimated residuals from an ARCH equation serves as a misspecication test. ARCH type of processes are seldom found in low frequency data.
High frequency data is generally needed to observe these eects. Daily, weekly
sometimes monthly data, but hardly ever in quarterly or yearly data.
Finally, remember two things, rst that ARCH eects imply thicker tails than
the standard normal distribution. It not obvious that the normal distribution
should be used. On the other hand it, there is no obvious alternative either.
Often the normal distribution is the best approximation, unless there is some
other information. On example, of other information, is that some series are
leptokurtic, higher peak than the normal, in combination with fat tails. In that
case the t-distribution might be an alternative. Thus using the normal density
function is often an approximation. Second, correct inference on ARCH eects
builds upon a correct specication of the mean equation. Misspecication tests of
the mean equation are therefore necessary.




The presence of expectations have consequences for econometric model building.
In particular rational expectations have extremely important consequences. The
most pessimistic views, following from rational expectations, reduce econometric
modeling to simple data description, with little, or no room, for increasing our
understanding of the behavior of economic agents. Muths (1961) original denition of rational expectations goes very far. It assumes that agents know the true
data generating process (DGP) of the complete system. This is in contrast to the
econometrician who must estimate what he/she thinks is the DGP. The econometrician must also test for signicant changes in his/her model before he/she can
nd out whether the process has changed.
In econometrics we can only deal with a limited aspect of rational expectations,
namely expectations formed conditionally on past (observed) history. In contrast
to using econometrics, in the world of Muth and other rational expectation theorists, agents are free to form the best expectation at any time without estimating,
or making inference from historical data. We can describe the econometric approach to rational expectations as follows, let xet be the expected future value of
the variable xt held by the agent(s) at time t. The expectation held at time t is
xet = E(xet j It ), where It is the information set containing the historical data used
to form the expectation.
Under rational expectations, by denition, the information set contains all
relevant information for determining the expectation so that the dierence between
the actual outcome of xt+1 and its expectation (xet ) is zero, E[(xt+1 xet ) j It ] = 0:
This is a weak condition. It allows expectations to be erroneous in individual
periods, but requires that they are correct in average. Thus, in applied work the
dierence between the outcome and the expectation should a martingale dierence
process. Assuming that the dierence is also a white noise innovation process is
generally stronger than necessary.
If the ordinary not expectations based econometric model is formulated as
yt = xt + et ; the assumption of rational expectations leads to the following
yt = Efxt+1 j It g + et ; or
yt = xet + et ;


where xet is the expected value of the variable xt :1


Rational v.s. other Types of Expectations

In earlier literature some researchers used to model other types of expectations

than rational expectations; like myopicor staticexpectations. These alternatives are generally ad hoc, and not based on any reasonable assumptions about the
1 This

is a generic example where xet can be any variable, including yte :



behavior of economic agents. Other expectations, than rational, imply that agents
might ignore information that would raise their utility. With anything than rational expectations agents will be allowed to make systematic mistakes, implying
that they ignore prot opportunities or that they are not, for some not explained
reason, maximizing their utility. The economic science has yet to identify such behavior in the real world. Rational expectations becomes an equilibrium condition
in the sense that there the dierence between prediction and outcomes cannot be
predicted. A model which allows for predictable dierences between the expectations and the outcome is not complete without an economic explanation of what
the dierence means, and why it occurs.
The correct way to approach the modeling of expectations is assume that agents
form expectations so that they do not make systematic mistakes that reduce their
welfare. Information used to predict the future will be collected and processed up
to the point were the costs of gathering more information balances the revenue
of additional information. Based on this type of behavior it might, as a special
case, be optimal to use say todays value of a variable to predict all future values
of that variable. But, these are exceptions from the rule.
In general there is a catch 22 situation in the modeling rational expectations
behavior. If the econometrician nds that the agents are doing systematic mistakes
from ex post data, this is no evidence against the rational expectations hypothesis. Instead, the empirical nding might the result of conditioning on the wrong
information set. Alternatively, the modeling of the expectation might be correct,
and be an unbiased and e cient estimate of the expectation held only at a certain
point in time. This argument also include situation where there is a small probability of an event with large consequences, as devaluations, unpredicted changes
in the monetary regime, wars, natural disasters etc. To examine these situations
generally requires further testing of model, were the outcome will depend to a
large extent on assumptions regarding distributions of the processes, if they are
linear or non-linear etc.
The discussion about other types of expectations brings us to the concepts of
forward looking v.s. backward looking behavior. The dierence can explained as
follows. Consumption based on forward looking behavior is determined on the
basis of expected future income. Consumption based on actual (existing) income
is backward looking. In practice there might not a big dierence, your present or
recent income might be a good approximation to your future income. In some cases
rational expectations might be to base decisions on contingent rules, and revise
these rules only when the costs of deviating from the optimal/desired consumption
is too big(or when the alternative cost to being outside equilibrium is to high).


Typical Errors in the Modeling of Expectations

Without given values of the expected value there are two types of common mistakes in econometric models on expected driven stochastic processes. The rst
mistake is to substitute xet with the observed value xt : This leads to an error-invariables problem, since xt = xet + vt ; where vt is E(vt ) = 0:The error-in-variable
problem implies that will not be estimated correctly. OLS is inconsistent for the
estimation of the original parameter.
The second mistake is to model the process for xt and substitute this process
into 18.1. Assume that the variable xt follows an AR(2) process, like xt = a1 xt 1 +


a2 xt

+ nt , where nt

N ID(0;


a1 xt

1 xt 1

): Estimation of equation 18.1 leads to,

+ a2 xt
2 xt 2

+ et

+ et :


This estimated model also gives the wrong results, if we are interested in estimation the (deep) behavioral parameter . The variables xt 1 and xt 2 are not
weakly exogenous for the parameter of interest ( ) in this case. The estimated
parameters will be a mixture of the deep behavioral parameter and the parameters
of the expectations generating process (a1 and a2 ).
Not only are the estimates biased, but policy conclusion based on this estimated
model will also be misleading. If the parameters of the marginal model, (a1 and
a2 ) describe some policy reaction function, say a particular type of money supply
rule, changing this rule, i.e. changing a1 and a2 will also change 1 and 2 : This is
a typical example of when super exogeneity does not hold, and when an estimated
model cannot be used to form policy recommendations.
What is the solution to this dilemma of estimating deepbehavior parameters,
in order to understand working of the economy better?
1. One conclusion is that econometrics will not work. The problems of correctly
specifying the expectation process in combinations with short samples make
it impossible to use econometric to estimate deep parameters. A better
alternative is to construct micro-based theoretical models and simulate these
models. (As example, use calibration techniques)
2. Sims solution was to advocate VAR models, and avoid estimating deep
parameters. VAR models can then be used to increase our understanding
about the economy, and be used to simulate the consequences of unpredictable events, like monetary or scal policy shocks in order to optimize
3. Though the rational expectations critique (Lucas, Sims and others) seem to
be devastating for structural econometric modeling, the critique has yet to
be proven. In surprisingly many situations, policy changes appear to have
small eects on estimated equations, i.e. the eects of the switch in monetary
policy in the UK in early 1980s.
4. Finally, the assumption of rational expectations provides priori information
that can be used to formulate an econometric model from the beginning.
There are, in principle, three ways in which one can approach this problem; i)
substitution, ii) system estimation based on the Full Information Maximum
Likelihood (FIML) estimator or iii) use the General Methods of Moments
(GMM) estimator.
Substitution means to replace the expected explanatory variable with an
expectation. This expectation could either be a survey expectation or an
expectation generated by a forecasting model, i.e. an ARIMA model. The
FIML method can be said build in the econometric forecast in an estimated
system. The GMM estimator builds on the assumption that the explanatory
variable and the residuals are orthogonal to each other. Since, rational
expectations implies that the (rationally expected) explanatory variables are
orthogonal to the residuals, the GMM estimator is well suited for rational
expectations models. Because of this it is the preferred choice when it comes
to estimating rational expectations models, especially in nance applications.



Modeling Rational Expectations

(This section is very incomplete -see overheads)

The substitution approach is perhaps the easiest way of modeling rational
expectations. The approach is to nd an estimate of Efxt j It g: The simplest
approach is to let the information set contain only historical values of xt : As an
example suppose that xt is an AR(1) process, so xt = 1 xt 1 + vt where vt is
N ID(0; 2 ). The estimated process gives the estimates x
^t that can be substituted
into equation 18.1. The outcome of the substitution is

^t + ut where


= et


xet ) = et


v^t ):


OLS will lead to an unbiased estimate of ^ ; because x

^t is weakly exogenous
w.r.t. :
FIML estimation builds on substituting xet with the actual value xt and estimate this equation simultaneously with the marginal model for xt ; say the AR(1)
model assumed in the substitution example above.
GMM and Instrumental Variables techniques start with substitution of the
expected value (xet ) with the actual observation (xt ), and then approach the errorin-variables problem. The key to the solution lies in the assumption that the
dierence between the expectations and the actual outcome is orthogonal to the
information set used, the basic assumption for the method of moments estimator.
The variables in the marginal process and the possible exogenous variables in the
conditional model can then be used as instruments in the estimation of .


Testing Rational Expectations

(To be completed)
Tests concerning given values of xet .
Given some values of the expectation process, there are three types of tests
that can be performed.
1. Test if the dierence between the expectation and the outcome is a martingale dierence process, conditional on assumptions regarding risk premiums.
2. Test for news. Under the assumption of rational expectations the expected
driven variable should only react the unpredictable event newsbut not to
events that can be predicted. These assumptions are directly testable as
soon as we have a forecasting model for xet :
3. Variance bounds tests. Again, given xet , it follows that the variance of yt in
equation 18.1 must be higher than the variance of xet :
Encompassing tests
If a model based on taking account of assumed rational expectations behavior
is the correct model, it follows that this model should encompass other models
with lack this feature. Thus, encompassing tests can used to discriminate between
models based on rational expectations and other models.
Tests of super exogeneity


It follows from the rational expectations assumption that the parameters of

the conditional model will change whenever the parameters of the marginal model
change. First, if it can be established that the conditional model is stable, while the
marginal model changes, this would be evidence against the rational expectations
assumption, at least in the form of forward looking behavior. In the same way, it
is possible to test for joint changes/shifts in the marginal and conditional models.
1. Is rational expectations important? The answer is it depends on your problem. If you really want to estimated a stochastic phenomena derived from
theory, especially in nance, it is important to take rational expectations into
account. It has to be at least weakly rational expectations because nobody
has found any solid evidence against weak rational expectations. However, if
you want to forecast or do standard structural modelling you can test for
super exogeneity, and thereby also for rational expectations. Ericsson and
Hendry (1989), Ericsson and Irons (1995), and Ericsson and Hendry (1997)
do this for almost all instances of radical economic policy changes and nds
no evidence of the structural breaks in the econometric models predicted by
the rational expectations theory. Thus, in practice it is not a big problem
unless you want it to be a big problem.






This section describes a research strategy for nding a well-dened statistical
model of the DGP, which also has an economic interpretation.
1. I. Start from theory!
Economic theory gives the parameters of interest and the relevant variables
for estimating these parameters. Furthermore, theory suggest interesting long-run
equilibria, homogeneity conditions etc. It is important to remember that theories
are constructions of the human mind. The available data, on the other hand,
is the real world. But, there might not be a one to one mapping between the
variables of the real world and theory, no matter how good the theory might
be. Aggregation over time and individual units, adjustment costs, measurement
errors etc. will aect the estimated model.
II. Determine the order of integration and type of non-stationarity
among the variables.
Are some are all variables non-stationary. What type of non-stationarity?
The null should be integrated of order one, unless there is su cient evidence
to reject this hypotheses. Once you know the order of integration you know
to organise variables into meaningful statistical relations. You can test for cointegration, or co-trending, and with this knowledge formulate stationary relations
where standard inference is possible, and where you can separate long-run relations
(or alternatively permanent shocks) and short-term relations.
The golden rule is that if a variable looks like I(1) treat it like an I(1) variable
unless you have clear evidence to reject that hypothesis.
III. Building a VAR and test for cointegration among integrated
Cointegration tests aim at identifying long-run stable (stationary) economically
interesting relationships among the variables. This can be done 1) in the form of
testing specic relations such as PPP, consumption function, money demand etc..
2) In the case of building and modeling systems, it can be in a "complete system"
or by dividing your problem into separate variables such as domestic ination,
money demand, economic growth etc. Remember the (asymptotic) property of
co-integrating relations, that if you nd them they are exists even if you add more
variables to the model.
This requires building a VAR and testing for cointegration. And, the VAR
will be the departure for formulating a reduced form VECM and then a structural
VECM, or single equation structural equations.
The critical step is to nd suitable order of the VAR (number of lag). The
principle is to work from general to specic models, and search for parsimonious
models. For cointegration tests a log order of 2 is minimum and often optimal.
Sometimes identifying extreme outliers and impulse step dummies will help to
cure both non-normality and autocorrelation in all equations. If it is not possible
to get rid of autocorrelation with a small number of lags (perhaps in combination
with dummies and seasonals), the alternative is to focus on second best. Autocorrelation in these equations is very bad for modelling, but it might not be possible
to achieve both no autocorrelation and get a parsimonious model with su cient
degrees of freedom for inference. In that situation, the relevant question is how
much of the variation in the left hand side variables is optimal to model to get an
near-well-identied statistical model?


The second best in VAR modelling, is to get rid of autocorrelation in as many

equations as possible, hopefully this will include that the vector no error autocorrelation test is not rejected. In this case study the F-test for the signicance of
each lag across the equations in the model. Look at the LR test for comparing lag
orders in the VAR and most important chose the model with the smallest information critera and the smallest residual autocorrelation.1 And, when you test the lag
structure, look at the I(1) test for cointegration and study the estimated
for possible economically interesting co-integrating vectors, xt 1 :. Quite often you
will see what a stable vector coming up quite independent of the lag order and
autocorrelation in some residuals.
Once the co-integrating rank is determined it remains to identify the estimated
co-integrating vectors. If there is only one vector this is relatively simple. If
there are more than one vector the vectors should fulll the rank condition for
identication of co-integrating vectors. This is explained in the work of Juselius,
and Johansen and in more advanced text books in econometric time series. The
golden rule is that the vectors should be unique (look dierent from each other),
through the alpha value determine a left-hand variable. This is achieved by rst
choosing a suitable normalization, impose other unit elasticities and or same value
but opposite signs, and by restricting some parameters to be zero in some vectors.
(Remember that the size of the
coe cients are not related to their signicance.
If co-integration is not found?
Rethink the problem. Have you forgotten some important explanatory variable?
Look for outliers and test their eects. Use dummies, trends etc. if they can
be motivated. Look for structural breaks, sample size.
Use rst (and/or second) dierences instead, to get a model with only stationary I(0) variables that leads to estimated parameters with well dened
distributions. You have to conclude that your model might not be good for
long-run analysis.
Continue with the modeling process to get the least bad of all possible models, at least. If possible, show that there may be strong a priori information
that justies the model. Add that cointegration is only an asymptotic result,
and that your sample is too short.
Consider stop modelling, and conclude that the absence of cointegration is
an interesting conclusion in itself! (Data problem, wrong theory, missing explanatory factors etc.). Do not waste too much time on a problem where the
answers will be dependent on ad hoc assumptions concerning distributions,
or instable results which will be totally model dependent.
If you nd cointegration. Continue by testing for long-run homogeneity assumptions, weak exogeneity. and identication. This can be done by using Johansens multivariate co-integrating technique. If more than one vector think
about identication of vectors.
IV. Decide on single or simultaneous model
There are no good tests for weak exogeneity. Typically a good test of simultanity requires the specication of the completemodel to work. And, then the
work is already done.
1 In PcGive 12 you need to indicate in the "Option" window under Model choce that you want
information crteria for each model. Then when you press "Progess" will you see both F-test for
lag order and Information critera for the dierent VAR modeles you estimated.



If you reduce to single equation (or very limited systems) can you motivate the
weak exogeneity. assumptions?
The reduced form VECM gives you ideas about what a system might look like,
and not like through the estimated (signicant) alpha values.
It is possible to test for predictability in the VECM by looking at the estimated
alpha values, and argue for reductions of the system?
Of course, from the reduced for VECM to logical step is to construct a simultaneous structural model based on testing the order and the rank condition
in the model. However, this can be a bit of a challenge, especially if you are
short of time. Furthermore, identication must be done on signicant parameters
(including lags) not on the underlying theoretical lag structure.
V. Set up the Error Correction Representation.
In the following we assume that you have chosen to continue with a single
Use the results from Johansens multivariate cointegration technique, then
formulate an ECM model directly.
Test for cointegration in the ADL representation of the model. (PcGive
test). It is necessary to choose lag lengths long enough to get white noise
residuals. Test if residuals are N ID(0; 2 ), +RESET test if possible.
Having white noise innovation error terms is a necessary condition.
If not white noise innovation?
Add more lags.
Did you forget something important?
Study outliers. Use dummies and trends to get white noise. But remember
that they should be motivated.
Or continue to the least worse of all possible models, see above.
Rethink the problem or stop. RESET test!! (Perhaps you should try to
condition on some other variable instead?)
When white noise is established:
Is the equation in line with what you think can be an economic meaningful
long-run equilibrium? Check sign and sizes of parameters.

VI. Reduce the model.

Remove insignicant variables (t-values below 1.0 to begin with). Start at low
lags. Go from general to specic.
Check misspecication/specication during reductions. Run test summary after each reduction.
In PcGive all reductions are saved under Progress.
It all about "Data Mining", but done e ciently building on the empirical
approach for ARIMA models introduced by Box-Jenkins, and new developments
in Statistical theory. Modern mathematical statistical theory explain how you can
go about nding a Data Generating Process by reversing the sampling process
in classical statistics. Textbooks: Spanos, Mittelhammer


In the reduction process remember the following identities,

= 1 L and 1 = + L
So if you have, as an example,
+ 1 xt 1
2 xt 2 , where
2 (or no signicant dierence) with
dierent sign on the lags. This is also + 1 2 x
and if j 1 j j
2 j then
+ 1 2 xt 1 + ( 1
2 ) xt 2 = 1 xt 1 .
Hence, you save one degree of freedom under these condition
VII. Test the stability of the model
Use recursive estimation method in PcGive. Remember that this is also useful
during the identication of cointegrating vectors. For instance, it will allows you
to see if you need to put in (restricted) impulse dummies in co-integrating vector.
VIII. Test for rival models. Encompassing tests.
Does your model explain the results, and the failure, of other rival models?
Encompassing tests imply a comparison of the goodness of t between dierent
models, based on dierent explanatory variables. The reduction process might lead
to several model with white noise residuals. To discriminate between these models
they have to be tested against each other.
IX. Test for super exogeneity.(Rational expectations) If you want.
Establish the stability of the conditional model without using ad hoc trends
or dummies:(= criteria for stability).
Test for instability in the marginal model. If it is unstable while the conditional is stable you have super exogeneity. If the marginal model is unstable
you can go one step further by forcing the marginal to be stable by imposing
trends and dummies in such a way that it becomes stable. Then put these
trends and dummies into the conditional model and test if they are significant there? If not you have super exogeneity. And can reject the parts of
the assumptions in the rational expectations theory.
X. STOP when you nd a model that is consistent with the data
chosen. And where the parameters make economic sense.
In other words "a well-dened statistical model".
That is a model with white noise innovation residuals and stable parameters,
which is also encompassing all other rival models. Encompassing meaning that
your model explain other models and picks up more of the variation in the dependent variables, and which has an economic meaning.
1. XI. Report our results both parameters and misspecication tests.
It is not su cient report only R2 and DW-values. Show test summary
(corresponding) and graphs of your data, in levels and rst dierences, and
error terms, etc.
Be open minded and inform the reader of the tests and the problems you have
found. Dont try to prove things which one can easily reject by a simple test. The
rule is to minimize the number of assumptions behind your model, and remember
that the errors are the outcome of the formulation of the model.



Andersson, T.W. (1971) The Statistical Analysis of Time Series, John Wiley &
Sons, New York.
Andersson, T.W. (1984) An Introduction to Multivariate Statistical Analysis,
John Wiley & Sons, New York.
Banerjee, A., J. Dolado, J.W.Galbraith and D.F. Hendry, (1993) Cointegration,
Error-Correction and the Econometric Analysis of Non-stationary Data, (Oxford
University Press, Oxford).
Baillie, Richard J. and Tim Bollerslev, The long memory of the Forward premium, Journal of Money and Finance 1994, 13 (5), p. 565-571.
Baillie, Richard J., Tim Bolloerslev and Hans Ole Mikkelsen (1966) Fractionally Integrated Generalized Autoregressive Heteroscedastcity, Journal of Econometrics 74, 3-30.
Banerjee, A., R.L. Limsdaine and J.H Stock (1992) Recursive and Sequential
tests of the Unit Root and Trend Break Hypothesis: Theory and International
Evidence, Journal of Business and Economics Statistics ?.
Cheung, Y. and K. Lai (1993), Finite Sample Sizes of Johansens Likelihood
Ratio Tests for Cointegration, Oxford Bulletin of Economics and Statistics 55, p.
Cheung, Y. and K. Lai (1995) A Search for Long Memory in International
Stock Markets Returns, Journal of International Money and Finance 14 (4),
Davidson, James, (1994) Stochastic Limit Theory, Oxford Univresity Press,
Dickey, D. and W.A. Fuller (1979), Distribution of the Estimators for Autoregressive Time Series with a Unit Root, Journal of the American Statistical
Association 74.
Diebold, F.X. and G.D. Rudebush (1989), Long Memory and Persistence in
Aggregate Output,Journal of Monetary Economics 24 (September), p. 189-209.
Eatwell, J., M. Milgate and P. Newman eds., (1990), Econometrics (Macmillian, London).
Eatwell, J., M. Milgate and P. Newman eds., (1990) Time Series and Statistics
(Macmillian, London).
Engle, Robert F. ed. (1995) ARCH Selected Readings, Oxford University Press,
Engle, R.F. and C.W.J. Granger, eds. (1991), Long-Run Economic Relationships. Readings in Cointegration, (Oxford University Press, Oxford).
Engle, R.F. and B.S. Yoo (1991) Cointegrated Economic Time Series: An
Overview with New Results, in R.F Engle and C.W. Granger, eds., Long-Run
Economic Relationships. Readings In Cointegration (Oxford University Press,
Ericsson, Neil R. and John S. Irons (1994) Testing Exogeneity, Oxford University Press, Oxford.
Fuller, Wayne A. (1996) Introduction to Statistical Time Series, John Wiley &
Sons, Nw York.
Freud, J.E. (1972) Mathematical Statistics, 2ed.(Prentice/Hall London).
Granger and Newbold (1986), Forecasting Economic Time Series, (Academic
Press, San Diego).


Granger, C.W.J. and T. Lee (1989) Multicointegration, Advances in Econometrics, 8, 71-84.

Hamilton, James D. (1994) Time Series Analysis, Princton University Press,
Priceton, New Jersey.
Hargreaves, Colin P. ed. (1994) Nonstationarity Time Series Analysis and
Cointegration, Oxfod University Press, Oxford.
Harvey, A. (1990), The Econometric Analysis of Time Series, Philip Allan,
New York).
Hendry, David F. (1995) Dynamic Econometrics, Oxford University Press, Oxford.
Hylleberg, Svend (1992) Modelling Seasonality, Oxford University Press, Oxford.
Johansen, Sren (1995) Likelihood-Based Inference in Cointegrated Vector Autoregressive Models, Oxford University Press, Oxford.
Johnston, J. (1984) Econometric Methods (McGraw-Hill, Singapore).
Kwiatkowsky, D., P.C.B. Phillips, P. Schmidt and Y. Shin (1992) Testing the
Null Hypothesis of Stationarity Against the Alternative of a Unit Root, Journal
of Econometrics 54, p. 159-178.
Lo, Andrew W. (1991) Long-Term Memory in Sock Market Prices,Economtrica
59 (5:September), p. 1279-1313.
Maddala, G.S. (1988) Introduction to Econometrics (McMillian, New York).
Morrison, D.F. (1967) Multivariate Statistical Methods, McGraw-Hill, New
Pagan, A.R. and M.R. Wickens (1989) Econometrics: A Survey, Economic
Journal, 1-113.
Park, J.Y. (1990), Testing for Unit Roots and Cointegration by Variable Addition,in T. B. Fomby and G.F. Rhodes (eds.) Co-integration, Spurious Regressions, and Unit Roots: Advances in Econometrics 8, JAI Press, New York.
Perron, Pierre (1989) The Great Crash, the Oil Price Shock and the Unit
Root Hupothesis, Econometrica 57, 1361-1401.
Phillips, P.C.B (1988) Reections on Econometric Methodolgy, The Economic
Record -Symposium on Econometric Methodolgy, December, 344-359.
Sj, Boo (2000) Testing for Unit Roots and Cointegration, memo.
Sowell, F.B. (1992) Modeling Long-Memory Behavior with the Fractional
ARMA Model, Journal of Monetary Economics 29 (April),p. 277-302.
Spanos, A. (1986) Statistical Foundations of Econometric Modelling (Cambridge University Press, Cambridge).
Wei, William W.S. (1990) Time Series Analysis. Univariate and Multivariate
Methods, (Addison-Wesley Publishing Company, Redwood City).

A1 Smoothing Time Series Lag Windows.
In the discussion about non-stationarity dierent ways of removing the trend
in a time series was shown. If the trend is removed from, say, GDP we are left
with swings in the data that can be identied as business cycles. In time series
analysis such cycles are referred to as low frequency or periodic components.
Application of smoothing lters arise in empirical studies of real business cycles,
and in modelling nancial variables daily interest rates where for example news
about ination and other variables occur only at monthly intervals and might


cause monthly cycles in the data.1 Smoothing methods, of course, are related
closely to spectral analysis. In this appendix we concentrate on two lters, or lag
windows, which represent the best, or most commonly used methods for time
series in time domain.
Start from a time series, rt . What we are looking for is some weights bi such
that the ltered series xt , is free of low frequency components,
xt =


bi rt+i :


i= k

In this formula the window is applied both backwards and forwards, implying
a combination of backward and forward looking behavior. Whether this is a good
or a bad thing depends totally on the series at hand, and is left to the judgment of
the econometrician. The alternative is to let the window end at time i = 0. The
literature is lled with methods of calculating the weights bi , in this appendix we
will look at the two most commonly used methods; the Partzn window and the
Tuckey-Hanning window.
The Parzn window is calculated using the following weights,
< 1 6(i=k)2 + 6(j i j =k)3 ; j i j k=2; =
2(1 j i j =k)3 ;
k=2 j i j k;
wi =
j i j k;
where k is the size of the lag window. The Parzn window tries to t a third
grade polynomial to the original series.
An alternative is the so called Tuckey-Hanning window, calculated as,
1=2 [1 + cos( i=k)] ; j i j k;
wi =
jij k
Like the Parzen window, the weights need to be normalized. Under optimal
conditions, that is the correct identication of underlying cycles, the dierence between xt and rt , will appear as a normal distribution. The problem is to determine
the bandwidth, the size of the window, or k in the formula above. Unfortunately
there is no way easy way to determine this in practice. Choosing the size of the lag
window involves a choice between low bias in the mean or a high variance of the
smoothed series. The larger the window the smaller the variance but the higher
is the bias. In practice, make sure that the weights at the end of the window are
close to zero, and then judge the best t from comparing xt rt . As a rule of
thumb, choose a bandwidth equal to N exp(2=5), the number of observations (N )
raised to the power of 2 over 5. The alternative rule is to set the bandwidth equal
to N 1=4 , or make a decision based on the last signicant autocorrelation.. Since
the choice of the window is always ad hoc in some sense, great care is needed if
the smoothed series is going to be used to reveal correlations of great economic

Testing the Random Walk Hypothesis using the Variance Ratio Test.
For a random walk, xt = xt 1 + "t , where "t
N ID(o; 2 ); we have that
the variance is 2 t and that the autocovariance function is cov(xt ; xt k ) = (t
k) 2 . It follows that cov(xt ; xt 1 ) = 21 , and that cov(xt ; xt k ) = 21 k. Dening
k = k cov(xt ; xt 1 ). For a random walk we get that the estimated variance ratio
V R(k) =

^ 2k
^ 21

is not signicantly dierent from zero. The estimated (unbiased)

1 To be clear, we are not saying that daily interest rates necessarily contain monthly cycles,
only that it might be the case. One example is daily observations of the Swedish overnight
interbank rate.



autocovariances are given as, for k = 1

^ 21 =




^ )2 ;



and, for k > 1;

^ 2k =



k + 1)(1

T ) t=k



k^ )2 ;


where ^ = T1 (xT x0 ), and T is the sample size. Assuming homoscedasticity,

the asymptotic variance of the random variable V R(k) is,
(k) =





Under these assumptions a test statistic is given as

Z(k) =

V R(k)
[ (k)]


!a N (0; 1);


where !a indicates the test statistic converges to an asymptotic normal distribution.

Since many time series, especially in nance, show time varying heteroscedasticity, the test statistics need to be modied to take this into account. Lo and
Mackinlay (1988) show that a heteroscedasticity consistent estimator of the asymptotic variance is given as,
(k) = 2


j ^



^(j) =



(xt xt

^ ) (xt

t=1 (xt



j 1



The heteroscedastic consistent test statistic is therefore,

Z (k) =

V R(k)



!a N (0; 1):


The test is performed by calculating sequences of V R(k) as k goes from 1 to n,

where n is some chosen fraction of the total number of observations. Since the test
statistics only holds asymptotically, Monte Carlo simulations of limited samples
are recommended. Under the null hypothesis of a random walk, it will not be
possible to reject the assumption that Z(k) or Z (k) are dierent from zero.

20.2 Appendix III Operators

When dealing with random variables, and series of data there some operators that
simplies work. This chapter presents the rules of some common operators applied


to random variables and series of observations. These are the expectations operator, the variance operator, the covariance operator, the lag operator, the dierence
operator, and the sum operator.2 The formal proofs behind these operators are
not given, instead the chapter states the basic rules for using the operators.
All operators serve the basic purpose of simplifying the calculations and communication involving random variables. Take the expectations operator (E), as
an example. Writing E(xt ) means the same as I will calculate the mean (or the
~ 3 But, I am not telling
rst moment) of the observations on random variable X:
exactly which specic estimator I would be using, if I were to estimate the mean
from empirical data, because in this context it is not important.
One important use of operators is in investigating the properties of estimators
under dierent assumptions concerning the underlying process. For instance, the
properties of the OLS estimator, when the explanatory variables are stochastic,
when the variables in the model are trending etc.


The Expectations Operator

The rst operator is the expectations operator. This is a linear operator and, is
therefore easy to apply, as shown by the following rules. In the following, let c
and k be two non-random constants, i is the mean of the variable i and ij is
the covariance between variable i and variable j. It follows that,
E(c) = c:
~ = cE(X)
~ =c


~ = k + cE(X)
~ =k+c
E(k + cX)
~ + Y~ ) = E(X)
~ + E(Y~ ) =



~ Y~ ) = E(X)E(
~ Y~ ) =
Y~ ) + covar(X

xy ;

~ and Y~ are two independent random variables. Compare

where xy = 0 if X
~ 2,
with the expectation of X
~ 2 ) = E(X)E(
~ + var(X)
~ =



The expectations operator is linear and straight forward to use, with one
important exception - the expectation of a ratio. This is an important exception since it represents a quite common problem.

E(Y )
~ is not equal to E(X)
~ : The problem is that the numerator and the denominator are not necessarily independent In this situation it is necessary to use
the p lim operator, alternatively let the number of observations go to zero
and use convergence in probability or distribution to analyze the outcome.
In the derivation of the OLS estimator, the hfollowing
transformation is often
1 ~
~ Y~ ):
used, when X is viewed as given, E X~ = E X~ Y = E(W

A similar problem occurs in nancial economics. If F is the forward foreign

exchange rate, and S is the spot rate; E FS 6= E FS . However, E(ln F
ln S) = E(ln F ) E(ln S):

2 The

probability limit operator is introduced in a later chapter.

the dierence between an estimator and an estimate.

3 Notice




The Variance Operator

For the variance operator, var(:) or V (:) we have the following rules,
var(c) = 0:
~ = c2 var(X)
~ = c2


~ = c2 var(X)
~ = c2
var(k + cX)


~ = var(Y~ ) + var(X)
~ + 2cov(Y~ + X)
~ =
var(Y~ + X)




yx :

~ are independent we get,

If Y~ and X
~ = var(Y~ ) + var(X)
~ + cov(Y~ + X)
~ =
var(Y~ + X)




The Covariance Operator

The covariance operator (cov) has already been used above. It can be thought of
as a generalization of the variance operator. Suppose we have two elements of X,
call them Xi and Xj : The elements can be two random variables in a multivariate
process, or refereeing to observations at dierent times (i) or (j) of the same
~ i and X
~ j is
univariate time series process. The covariance between X
~i; X
~ j ) = Ef[X

~ i )][X

~ j )]g =

ij ;

[To be completed!]
~ with p elements can be dened
The covariance matrix of a random variable X

::: :::
6 21
::: :::
2p 7
::: ::: ::: 7
Ef[X E(X)][X E(X) ]g = 6 :::
4 :::
::: ::: ::: 5
p2 ::: :::
where ii = 2i ; the variance of the i : th element.
Like the expectations and the variance operator there some simple rules. If we
~ i and X
~j ;
add constants, a and b to X
~ i + a, X
~ j + b) = cov(X
~i, X
~ j ):
If we multiply Xi and Xj with the constants (a) and (b) respectively, we get,
~ i , bX
~ j ) = ab cov(X
~i, X
~ j ):
The covariance operator is sometimes also written as C( ).




The Sum Operator

In the following
operator is,

represents the sum operator. The basic denition of the sum


xi = xm + xm+1 + xm+2 + ::: + xn;





where m and n are integers, and m

n. The important characteristic of
the sum operator is that it is linear, all proofs of the following rules of the sum
operator build on this fact.
If k is a constant,
kxi = k
xi :


Some important rules deal with series of integer numbers, like a deterministic
time trend t = 1; 2; :::T: These are of interest when dealing with integrated variables and determining the order of probability, that is the order of convergence,
here indicated with O(:);

1 + 2 + ::: + T = (1=2)[T (T + 1)] = (1=2)[T + 1)2

(T + 1)]


= O(T 2 )




12 + 22 + ::: + T 2 = (1=6)[T (T + 1)(2T + 1)]

(1=3)[(T + 1)3

O(T 3 )




(3=2)(T + 1)2 + (1=2)(T + 1)]


13 + 23 + ::: + T 3

(1=4)[T 2 (T + 1)2 ](1=4)[T + 1)4


2(T + 1)3 + (n + 1)2 ]

= O(T 4 ):



The Plim Operator

An estimator should be unbiased, have a minimum variance and be consistent. In

limited samples these requirements will not always be met. To investigate what
happens as the sample size increases towards innity we us probability limits.
If ^ is an estimate of the true parameter , we say that the estimator E(^)
is consistent if the probability that we estimate as the sample size increases to
innity is equal to one. That is as the sample size approaches the population size,
we should end up with the parameter describing the population and nothing else.
Formally this can be stated as: the estimator E(^) is a consistent estimator of
if, for arbitrary small (positive) numbers and , there exists a sample size (n)
such that,
Pr ob[j ^

j< ] > 1

for n > n0 :


This can also be written as

p lim ([j ^

j< ] = 1


or, in shorthand as,




or p lim ^ = :


Probability limits are useful for examining the asymptotic properties of estimators of stationary processes. There are a few simple rules to follow,
p lim(ax + by) = a p lim(x) + b p lim(y);


p lim(xy) = p lim(x) p lim(y);


p lim(x=y) = [p lim(x)]=[p lim(y)];


p lim(x

) = [p lim(x)]

p lim(x2 ) = [p lim(x)]2 :


These rules can be extended to matrices as,

p lim(AB) = p lim(A) p lim(B);
p lim(A

) = [p lim(A)]


These rules hold regardless of whether the variables are independent or not.


The Lag and the Dierence Operators

The lag operator is dened as

Ln xt = xt


It can also be used to move forward in a time series,


xt = xt+n :

With the lag operator is becomes possible to write long lag structures in a
simpler way.
From the lag operator follows the dierence operator

xt = xt


such that

Notice that the dierence operator can be used as,

xt =

xt + xt

or as,

= xt


Dierencing at higher order is done as



xt = (1

L)d xt

Setting d = 2 we get,



L)2 xt = (1



+ xt

2L + L2 )xt



The letter d indicates dierences, which can be done by integer numbers such as
-2, -1, 0, 1 and 2. It is also possible to use real numbers, typically between -1.5 and
+1.5. With non-integer dierencing we come fractional integration, and so-called
long run memory series. If variables are expressed in log, which is the typical thing
in time series, the rst dierence will be a close approximation to per cent growth.
The lag operator is sometimes called the backward shift operator and is then
indicated with the symbol B n . The dierence operator, dened with the backward
shift operator is written as 5d = (1 B)d : Econometricians use the terms lag
operator and dierence operators with the symbols above. Time series statisticians
often use the backward shift notations.