Professional Documents
Culture Documents
CAMPUS BRUSSEL
Master of Business Engineering
Statistical Modelling
Regression: Multicollinearity
1
Outline
What is the nature of the problem?
o Collinearity between two predictors
o Multicollinearity among a set of predictors
2
What is the nature of the problem?
Consider the following regression model with two predictors:
with
Perfect collinearity between two predictors and occurs if and
have a perfect linear relation (e.g. ). This means that
.
When we compute OLS estimates for the above regression model, it can
be shown that:
4
What is the nature of the problem?
For instance, if regression of on yields an value
close to one there is imperfect multicollinearity.
When using OLS to estimate the above multiple regression model, it can
be shown that
6
When does perfect/imperfect multicollinearity occur
A related issue is that one should avoid including a dominant variable
(i.e., a variable that is logically related to the dependent variable) as a
predictor in a regression model.
o E.g. when using regression to predict the amount spent in online sales, one
should not include a dummy which indicates that the amount spent in online
sales is larger than 0.
Note that multicollinearity concerns a strong linear relation among
predictors, whereas the concept of a dominant variable concerns a strong
(and logically implied) linear relation between predictors and the dependent
variable!
Multicollinearity often occurs when a regression model includes
predictors that measure the same theoretical concept (and which are
therefore strongly correlated).
7
Consequences of multicollinearity
The major consequences of multicollinearity are:
OLS estimates remain unbiased.
The variances and the standard errors of the estimates will increase.
As a result, multicollinearity increases the likelihood of obtaining an
unexpected sign for a coefficient.
The computed t-scores will fall leading to large (non-significant) p-
values when testing versus .
Explanation: the T-statistic reads
8
Consequences of multicollinearity
Estimates will become very sensitive to changes in specification.
When significant multicollinearity exists, removing or adding an
explanatory variable will often cause major changes in the OLS estimates
The overall fit of the equation, the overall F-test, and the estimation
of non-multicollinear variables will largely be unaffected.
o Hence, a highly significant overall F-test and a high value in
combination with no significant individual regression coefficients is often
an indication of severe multicollinearity.
o As multicollinearity does not affect the fit of the equation it will not affect
using the equation for making predictions on a test sample as long as the
pattern of multicollinearity in the test sample is the same as in the sample
used for estimation.
9
Example: consumption function
Suppose you want to estimate the consumption function of a student
using the following model:
with
o : the annual consumption expenditures of the -th student on items other
than tuition and room and board.
o : the annual disposable income of the -th student
o : the liquid assets (savings, etc.) of the -th student
We estimate the above model with SPSS using hypothetical data. In
addition, we estimate a model that includes only annual disposable income
as a predictor.
Multicollinearity is a potential problem when including both and
as predictors because .
10
Example: consumption function
11
Example : consumption function
When we drop the variable from the
model we see that
We still have a highly significant overall
F-test , a good global model fit
16
Exercise
We use linear regression to model wage (in $) as a function age (in
years), education (in years) and work experience (in months). ( R 2 .442)
17
Exercise
The correlation matrix of the predictors reads:
Correlations
** **
1 -,252 -,281
18
Exercise
Complete the VIF and tolerance values in the table of regression
coefficients.
Is there a problem of multicollinearity? What would you do to deal with
the problem?
19
Solution
age: and
education: and
experience: and
20
Solution
We see that there is indeed a problem of multicollinearity for the
predictors age and experience:
o For the predictors age and experience the VIF is close to 3.
o The predictors age and experience have a strong positive linear relation
(r=.80), and we can argue that these variables measure the same aspect.
o We can argue that the multicollinearity is problematic because the
predictors age and experience are nonsignificant, and the sign of the
coefficient of age goes against our expectations.
If we drop age from the model, we see that
o both the regression coefficients of education and experience have the
expected sign and are significant.
o the R2 of the model does not decrease by dropping age: R 2 is still .442.
So a model without age seems more valid.
21
Solution
22