You are on page 1of 28

MULTICOLLINEARITY &

AUTOCORRELATION
Reference: Chapter 10&12 of DNG
Multicollinearity
The theory of causation and multiple causation
Interdependence between the Independent Variables and
variability of Dependent Variables
Parsimony and Linear Regression
Theoretical consistency and Parsimony

X5
X4

Y X1

X3

X2
One of the assumptions of the CLRM is
that there is no Multicollinearity
amongst the explanatory variables.
Multicollinearity refers to perfect or
exact relationship among some or all
explanatory variables
Expl.: X1 X2 X* 2
10 50 52
15 75 75
18 90 97
24 120 129
30 150 152
X2i = 5X1i & X*2 was created by adding 2,
0, 7, 9 & 2 from, a random number table.
Here r1.2 = 1 & r2.2* = 0.99
X1 & X2 show perfect multicollinearity
X2 & X*2 near-perfect multicollinearity
The problem of multicollinearity and its
degree in types of data
Overlap between the variables
indicates the extent of it as shown in
the Venn diagram.
Example:
Y = a + b1x1 + b2x2 + u
where
Y = Consumption Expenditure
X1 = Income & X2 = Wealth
Consumption expenditure depends on income (x1) and
wealth (x2)
The estimated equation from a set of data is as follows:
= 24.77 + 0.94x1 0.04x2
t : (3.66) (1.14) (0.52)
R2 = 0.96 2 = 0.95 F = 92.40
The individual coefficients are not significant although F value
suggests a high degree of association
There is a wrong sign with x2
The fact that the F test is significant but the t
values of X1 and X2 are individually
insignificant means that the two variables are
so highly correlated that it is impossible to
isolate the individual impact of either income
or wealth on consumption.
Let us regress X2 on X1
X2 = 7.54 + 10.19 X1
t = (0.25) (62.04) R2 0.99
This shows near perfect multi-collinearity
between X2 and X1
Y on X1 Y on X2
= 24.24 + 0.51X1 = 24.41 + 0.05 X2

t = (3.81) (14.24) t = (3.55) (13.29)

R2 = 0.96 R2 = 0.96

Wealth has significant impact

Dropping highly collinear variable has made the


other variable significant.
Sources of Multicollinearity
Data collection method employed:
Sampling over a limited range of the values taken by
the regressors in the population
Constraints on the model or in the population being
sampled:
Regression of electricity consumption on income and
house size. There is a constraint : families with higher
income may have larger homes and hence more
electricity consumption.
Sources of Multicollinearity
Model specification:
Adding polynomial terms to a model when
range of X variable is small
An Over -determined Model:
This happens when the model has more
explanatory variables than the number of
observations.
Use of time series data:
Model share a common trend.
Practical Consequences of Multicollinearity:
In cases of near perfect or high multicollinearity one is
likely to encounter the following consequences:

1. The OLS estimators have large variances and co-


variances making precise estimation difficult.
2. (a) Because of 1 the confidence intervals tend to be
much wider leading to the acceptance of the
zero null hypothesis (i.e. the true population
coefficient is zero) more readily.
(b) Because of 1 the t ratios of one or more
coefficients tend to be statistically insignificant.
Practical Consequences of Multicollinearity:
3. Although the t ratio(s) of one or more
coefficients is/are statistically insignificant, R2
the overall measure of goodness of fit, can be
very high.

4. The OLS estimators and their S.E.s can be


sensitive to small changes in the data.
Detection of Multicollinearity

1. High R2 but few significant t values


2. High pair wise correlation amongst regressors
(seen from correlation matrix)
3. Examination of partial correlation
4. Auxiliary Regressors and F-test (regress each xi
on remaining xis. Find F values and decide).
5. Eigen values and condition index.
Remedial Measures
1. A priori information and articulation
2. Dropping a highly collinear variable
3. Transformation of Data
4. Additional information or new data with
a priori reasoning
5. Identifying the purpose and reducing the
degree of it. (Or) Simply identifying it if
the purpose is prediction/forecasting.
AUTOCORRELATION
The assumption E(UU) = 2 I
Each u distribution has the same variance
(homoscedastic)
All disturbances are pair wise uncorrelated
This assumption gives
Var u1 Cov (U1 U2) ... Cov (U1, U2) 2 0 ... 0
Cov (U2 U1) Var V2 ... Cov (U2, Un) 0 2 ... 0
.... .... ... .... = ... ... ... ...
Cov (unU1) Cov(Un U2) ... (Var Un) 0 0 ... 2
E(UiUj) = 0 ij
This assumption when violated leads to:
1. Heteroscedasticity
2. Autocorrelation
Covariance is the measure of how much two
random variables vary together (as distinct from
variance, which measures how much a single
variable varies.)
Covariance between two random variables say X
and Y is defined as
Cov (X, Y) = E [(X - )(Y- )]
Where and are expected values of X and Y
respectively.
If X and Y are independent their cov. is Zero
The assumption implies that the disturbance
term relating to any observation is not
influenced by the disturbance term relating
to any other observation.
For example:
1. If we are dealing with quarterly time
series data involving the regression of the
following specification. (Time Series Data)
Output (Q) = f (Labour and Capital Input)
Q L K U
Q 1.1 L1 K1 U1
Q 1.2 L2 K2 U2
Output is Q 1.3 L3 K3 U3
There is no
affected Q 1.4 L4 K4 U4
reason to believe
due to Q 2.1 ... ... ... that this will be
labour ... ... ... ... carried over to
strike ... ... ... ... U4
... ... ... ...
Q n.4 L4n K4n U4n
2. Let
Family Consumption Expenditure = f (income)
(A regression involving Cross Section Data)

Consumption Expenditure Income of


of Families Family
F1 I1 U1
F2 I2 U2
... ... ...
... ... ...
... ... ...
... ... ...
Fn In Un
The effect of an increase of one familys income on
consumption expenditure is not expected to affect the
consumption expenditure of another family.
The reality:
1. Distribution caused by strike may affect production
2.Consumption expenditure of one family may
influence that of another family i.e.
To keep up with the Joneses Demonstration effect

Autocorrelation is a feature in most time-series data.


In cross section data it is referred to as spatial
autocorrelation.
Important Reasons for its occurrence (Time Series)
1. Inertia:
A salient feature of most economic series is inertia.
Time series data such as PCI, price indices, production,
profit, employment etc. exhibit cycles. Starting at the
bottom of a recession, when economic recovery
starts, most of these series move upward. In this
upswing the value of the series at one point of time is
greater than its previous value. Thus, there is a
momentum built into that an it continues until
something happens to slow them down.
[Intervention]
Therefore, in regression involving time series data
successive observations are likely to be inter-dependent
which reflect in a systematic pattern of the ui s.
2. Specification bias:
Excluded variable(s) or incorrect functional form.
a) When some relevant variables have been excluded
from the model they will reflect a systematic pattern
in the ui s.
b) In case of incorrect functional form i.e. fitting a linear
function when the true relationship is log-linear (&
vice-versa), there will either be over estimation or
under estimation of the dependent variable which will
have a systematic impact on Ui s.
Example:
(Correct) MC = 1 + 2 output + 3 (output)2 + Ui
(Incorrect) MC = b1 + b2 output + Vi
Where vi = (output)2 + ui and hence it will catch the
systematic effect of (output)2 on the MC leading to serial
correlation of uis.
3. Cobweb Phenomenon:
Supply of many agricultural commodities reflect the so
called Cobweb-Phenomenon where supply reacts to
price with lag of one time period because supply
decisions take time to implement (gestation period).
Expl. : At the beginning of this years planting of crops farmers
are influenced by the price prevailing last year.
Suppose at the end of period t price Pt turns out
to be lower than Pt-1. Therefore, in period t +1 the
farmer may decide to produce less than they did in
period t.
Such phenomena are known as Cobweb
Phenomena. And they give a systematic pattern to
the Uis.
In cases of Household Expenditure, share prices
etc. such problem arises. In general, when lagged
variable is not included (in many cases) the uis are
correlated.
4. Manipulation of time series data:
(i) Extrapolation of values of variables like
population give rise to serial dependence of
successive Uis.
(ii) Very often we use projected population figure
to arrive at per capita figure for any macro-
variable and use of such figures in forecasting
using regression (the successive Uis are serially
correlated).
Consequences: (Proofs are not given)
In the presence of autocorrelation in a model:
a) Residual variance is likely to under estimate
the true 2.
b) R2 is likely to be over estimated.
c) t test are not valid and if applied likely to give
misleading conclusions.
OLS estimators although linear and unbiased, they
do not have minimum variance leading to invalid
t and F test.
Detection of autocorrelation:
The assumption of the CLRM relates to the population
disturbance term which are not directly observable.
Therefore, their proxies i.e. is are obtained from OLS
and examined for the presence/absence of auto
correlation.
There are various methods. Some of them are:
1. Graphical Method
2. Runs Test (a non-parametric test) Examines the
signs
3. DW-statistics

A decision rule is applied.


Remedial Measures:
Data transformation by
a) First difference method (Xt+1 Xt) (one degree
of freedom is lost)
b) transformation
Estimated

The transformed model becomes


(Yt - Yt-1) = 1(1- )+2(Xt- Xt-1)+ut

This is known as generalised or quasi-difference


equation.
Exercise 3 ( Refer Ch 10&12 of DNG)
Use time series data in MR
Find Correlation table
See the extent of multicollinearity
Test for autocorrelation
In the presence of it use Ro transformation
Addressing both the problems calculate
forecast errors and select an equation which
gives the minimum forecast error.