You are on page 1of 25

Autocorrelation in Regression

Analysis
Tests for Autocorrelation
Examples
Durbin-Watson Tests
Modeling Autoregressive Relationships

What causes autocorrelation?


Misspecification
Data Manipulation
Before receipt
After receipt

Event Inertia
Spatial ordering

Checking for Autocorrelation


Test: Durbin-Watson statistic:
2
(e

e
)
i i 1 , fornandK 1d.f.
d
2
e
i

Positive
Zone of
No Autocorrelation
Zone of
Negative
autocorrelation
indecision
indecision
autocorrelation
|_______________|__________________|_____________|_____________|__________________|
___________________|
0
d-lower
d-upper
2
4-d-upper
4-d-lower

Autocorrelation is clearly evident


Ambiguous cannot rule out autocorrelation
Autocorrelation in not evident

Consider the following regression:


Source |
SS
df
MS
-------------+-----------------------------Model | .354067287
2 .177033643
Residual | 1.09315071
325 .003363541
-------------+-----------------------------Total |
1.447218
327 .004425743

Number of obs
F( 2,
325)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

328
52.63
0.0000
0.2447
0.2400
.058

-----------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ice |
.060075
.006827
8.80
0.000
.0466443
.0735056
quantity | -2.27e-06
2.91e-07
-7.79
0.000
-2.84e-06
-1.69e-06
_cons |
.2783773
.0077177
36.07
0.000
.2631944
.2935602
------------------------------------------------------------------------------

Because this is time series data, we should consider the possibility of


autocorrelation. To run the Durbin-Watson, first we have to specify the data
as time series with the tsset command. Next we use the dwstat command.
Durbin-Watson d-statistic( 3, 328) = .2109072

Find the D-upper and D-lower


Check a Durbin Watson table for the
numbers for d-upper and d-lower.
http://hadm.sph.sc.edu/courses/J716/Dw.html

For n=20 and k=2, = .05 the values are:


Lower = 1.643
Upper = 1.704
Durbin's alternative test for autocorrelation
--------------------------------------------------------------------------lags(p) |
chi2
df
Prob > chi2
-------------+------------------------------------------------------------1
|
1292.509
1
0.0000
--------------------------------------------------------------------------H0: no serial correlation

Alternatives to the d-statistic


The d-statistic is not valid in models with a
lagged dependent variable
In the case of a lagged LHS variable you must
use the Durbin-a test (the command is durbina
in Stata)

Also, the d-statistic is only for first order


autocorrelation. In other instances you
may use the Durbin-a
Why would you suspect other than 1st order
autocorrelation?

The Runs Test


An alternative to the D-W test is a
formalized examination of the signs of the
residuals. We would expect that the signs
of the residuals will be random in the
absence of autocorrelation.
The first step is to estimate the model and
predict the residuals.

Runs continued
Next, order the signs of the residuals
against time (or spatial ordering in the
case of cross-sectional data) and see if
there are excessive runs of positives or
negatives. Alternatively, you can graph
the residuals and look for the same trends.

Runs test continued

The final step is to use the expected mean


and deviation in a standard t-test
Stata does this automatically with the
runtest command!

Visual diagnosis of autocorrelation


(in a single series)

- 0 .5 0

A u to c o r re la tio n s o f p r ic e
0 .0 0
0 .5 0
1 .0 0

A correlogram is a good tool to identify if a


series is autocorrelated

10

20
Lag

Bartlett's formula for MA(q) 95% confidence bands

30

40

Dealing with autocorrelation


D-W is not appropriate for auto-regressive (AR)
models, where:

Yt i b0 b1Yt i 1 b2 X 2 ...
In this case, we use the Durbin alternative test
For AR models, need to explicitly estimate the
correlation between Yi and Yi-1 as a model parameter
Techniques:
AR1 models (closest to regression; 1st order only)
ARIMA (any order)

Dealing with Autocorrelation


There are several approaches to
resolving problems of
autocorrelation.
Lagged dependent variables
Differencing the Dependent variable
GLS
ARIMA

Lagged dependent variables


The most common solution
Simply create a new variable that equals Y at t-1, and use as a
RHS variable
To do this in Stata, simply use the generate command with the new
variable equal to L.variable
gen lagy = L.y
gen laglagy = L2.y

This correction should be based on a theoretic belief for


the specification
May cause more problems than it solves
Also costs a degree of freedom (lost observation)
There are several advanced techniques for dealing with this as
well

Differencing
Differencing is simply the act of subtracting the previous
observation value from the current observation.
To do this in Stata, again use the generate command with a capital
D (instead of the L for lags)

This process is effective; however, it is an EXPENSIVE


correction
This technique throws away long-term trends
Assumes the Rho = 1 exactly

D1.x xt xt 1

GLS and ARIMA


GLS approaches use maximum likelihood
to estimate Rho and correct the model
These are good corrections, and can be
replicated in OLS

ARIMA is an acronym for Autoregressive


Integrated Moving Average
This process is a univariate filter used to
cleanse variables of a variety of pathologies
before analysis

Corrections based on Rho


There are several
ways to estimate rho,
the most simple being
calculating it from the
residuals
We then estimate the regression by transforming
the regressors so that:
and
This gives the regression:

High tech solutions


Stata also offers the option of estimating
the model with the AR (with multiple ways
of estimating rho). There is also what is
known as a prais-winsten regression
which generates values for the lost
observation
For the truly adventurous, there is also the
option of doing a full ARIMA model

Prais-winsten regression

Prais-Winsten AR(1) regression -- iterated estimates

Source |
SS
df
MS
-------------+-----------------------------Model | .012722308
2 .006361154
Residual | .134323736
325 .000413304
-------------+-----------------------------Total | .147046044
327 .000449682

-----------------------------------------------------------------------------price |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ice |
.0098603
.0059994
1.64
0.101
-.0019422
.0216629
quantity | -1.11e-07
1.70e-07
-0.66
0.512
-4.45e-07
2.22e-07
_cons |
.2517135
.0195727
12.86
0.000
.2132082
.2902188
-------------+---------------------------------------------------------------rho |
.9436986
-----------------------------------------------------------------------------Durbin-Watson statistic (original)
0.210907
Durbin-Watson statistic (transformed) 1.977062

Number of obs
F( 2,
325)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

328
15.39
0.0000
0.0865
0.0809
.02033

ARIMA
The ARIMA model allows us to test the
hypothesis of autocorrelation and remove
it from the data.
This is an iterative process akin to the
purging we did when creating the ystar
variable.

The model
ARIMA regression
Sample:

1 to 328

Log likelihood =

811.6018

Number of obs
Wald chi2(1)
Prob > chi2

=
=
=

328
3804.80
0.0000

-----------------------------------------------------------------------------|
OPG
price |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------price
|
_cons |
.2558135
.0207937
12.30
0.000
.2150587
.2965683
-------------+---------------------------------------------------------------ARMA
|
ar |
L1. |
.9567067
.01551
61.68
0.000
.9263076
.9871058
-------------+---------------------------------------------------------------/sigma |
.0203009
.000342
59.35
0.000
.0196305
.0209713
------------------------------------------------------------------------------

Estimate of rho

Significant lag

-0 .2 0

A u to c o rre la tio n s o f e
-0 .1 0
0 .0 0
0 .1 0

0 .2 0

The residuals of the ARIMA model

10

20
Lag

30

Bartlett's formula for MA(q) 95% confidence bands

There are a few significant lags a ways back. Generally we


should expect some, but this mess is probably an indicator of a
seasonal trend (well beyond the scope of this lecture)!

40

ARIMA with a covariate


ARIMA regression
Sample:

1 to 328

Log likelihood =

812.9607

Number of obs
Wald chi2(3)
Prob > chi2

=
=
=

328
3569.57
0.0000

-----------------------------------------------------------------------------|
OPG
price |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------price
|
ice |
.0095013
.0064945
1.46
0.143
-.0032276
.0222303
quantity | -1.04e-07
1.22e-07
-0.85
0.393
-3.43e-07
1.35e-07
_cons |
.2531552
.0220777
11.47
0.000
.2098838
.2964267
-------------+---------------------------------------------------------------ARMA
|
ar |
L1. |
.9542692
.01628
58.62
0.000
.9223611
.9861773
-------------+---------------------------------------------------------------/sigma |
.0202185
.0003471
58.25
0.000
.0195382
.0208988
------------------------------------------------------------------------------

Final thoughts
Each correction has a best application.
If we wanted to evaluate a mean shift (dummy
variable only model), calculating rho will not
be a good choice. Then we would want to
use the lagged dependent variable
Also, where we want to test the effect of
inertia, it is probably better to use the lag

Final Thoughts Continued


In Small N, calculating rho tends to be more accurate
ARIMA is one of the best options, however, it is very
complicated!
When dealing with time, the number of time periods
and the spacing of the observations is VERY
IMPORTANT!
When using estimates of rho, a good rule of thumb is
to make sure you have 25-30 time points at a
minimum. More if the observations are too close for
the process you are observing!

Next Time:
Review for Exam
Plenary Session

Exam Posting
Available after class Wednesday

You might also like