You are on page 1of 22

FACULTY OF ECONOMICS AND BUSINESS

CAMPUS BRUSSEL
Master of Business Engineering

Statistical Modelling
Regression: Multicollinearity

Studenmund (2013). Using econometrics: A practical guide (6th.


edition). Edinburgh: Pearson Education Limited.

1
Outline
 What is the nature of the problem?
o Collinearity between two predictors
o Multicollinearity among a set of predictors

 What are the consequences of multicollinearity?

 How is multicollinearity diagnosed?

 What remedies are available?

2
What is the nature of the problem?
 Consider the following regression model with two predictors:
with
 Perfect collinearity between two predictors and occurs if and
have a perfect linear relation (e.g. ). This means that
.
 When we compute OLS estimates for the above regression model, it can
be shown that:

 If and are perfectly collinear (i.e. ) we see that


. Also and do not exist.
3
What is the nature of the problem?
 Imperfect collinearity exists if and have a very strong (positive or
negative) linear relation (i.e., is close to 1). In this case we see that the
and may become large leading to unstable parameter
estimates.
 We can extend the concept of collinearity to the multiple regression
model
with

 There is perfect multicollinearity if one of the predictor variables can be


expressed as a perfect linear combination of the predictor variables. In this
case the OLS estimates do not exist.
 There is imperfect multicollinearity if one of the predictor variables can
be approximated very well as a linear combination of other predictors. The
term multicollinearity refers to imperfect multicollinearity.

4
What is the nature of the problem?
 For instance, if regression of on yields an value
close to one there is imperfect multicollinearity.
 When using OLS to estimate the above multiple regression model, it can
be shown that

 with the value obtained when regressing predictor on all the


other predictor variables.
 We see that increases if becomes closer to one (i.e., can be
approximated well as linear combination of the other predictors). This
means that OLS estimate becomes instable. The factor is
called the variance inflation factor (VIF) and the term is called
the tolerance.
5
When does perfect/imperfect multicollinearity occur?
 Perfect (multi)collinearity can occur when a predictor is a defined as a
linear combination of other predictors.
o E.g., suppose a company uses a survey to measure the satisfaction of clients
with respect to different aspects of online purchases (quality website, speed
of product delivery, ..) and computes global satisfaction by taking the sum
of the satisfaction scores on different aspects. When using linear regression
to predict the total yearly spend of a customer in online sales, the company
cannot include global satisfaction in addition to satisfaction scores on all the
aspects because this would imply perfect multicollinearity.
o E.g., when using dummies to include a qualitative categorical variable with
3 categories in the model we cannot estimate dummy coefficients of each
category (if we also include an intercept) because this would imply perfect
multicollinearity.
 Perfect multicollinearity is normally easily detected by checking the
definition of predictor variables.

6
When does perfect/imperfect multicollinearity occur
 A related issue is that one should avoid including a dominant variable
(i.e., a variable that is logically related to the dependent variable) as a
predictor in a regression model.
o E.g. when using regression to predict the amount spent in online sales, one
should not include a dummy which indicates that the amount spent in online
sales is larger than 0.
 Note that multicollinearity concerns a strong linear relation among
predictors, whereas the concept of a dominant variable concerns a strong
(and logically implied) linear relation between predictors and the dependent
variable!
 Multicollinearity often occurs when a regression model includes
predictors that measure the same theoretical concept (and which are
therefore strongly correlated).

7
Consequences of multicollinearity
The major consequences of multicollinearity are:
 OLS estimates remain unbiased.
 The variances and the standard errors of the estimates will increase.
As a result, multicollinearity increases the likelihood of obtaining an
unexpected sign for a coefficient.
 The computed t-scores will fall leading to large (non-significant) p-
values when testing versus .
Explanation: the T-statistic reads

and hence it will become smaller if increases because of


multicollinearity

8
Consequences of multicollinearity
 Estimates will become very sensitive to changes in specification.
When significant multicollinearity exists, removing or adding an
explanatory variable will often cause major changes in the OLS estimates
 The overall fit of the equation, the overall F-test, and the estimation
of non-multicollinear variables will largely be unaffected.
o Hence, a highly significant overall F-test and a high value in
combination with no significant individual regression coefficients is often
an indication of severe multicollinearity.
o As multicollinearity does not affect the fit of the equation it will not affect
using the equation for making predictions on a test sample as long as the
pattern of multicollinearity in the test sample is the same as in the sample
used for estimation.

9
Example: consumption function
 Suppose you want to estimate the consumption function of a student
using the following model:

with
o : the annual consumption expenditures of the -th student on items other
than tuition and room and board.
o : the annual disposable income of the -th student
o : the liquid assets (savings, etc.) of the -th student
 We estimate the above model with SPSS using hypothetical data. In
addition, we estimate a model that includes only annual disposable income
as a predictor.
 Multicollinearity is a potential problem when including both and
as predictors because .

10
Example: consumption function

For a model that includes two


predictors we see
 A significant overall F-test
and a rather good
model fit . Note that
is substantially
lower because the data set is very
small .
 Despite the significant overall
F-test both predictors are not
significant and ,
resp.

11
Example : consumption function
When we drop the variable from the
model we see that
 We still have a highly significant overall
F-test , a good global model fit

 The coefficient of changes


drastically (it increases from .511 to .971)
The standard error of the coefficient of
drops drastically (from 1.031 to .157),
and the p-value of the coefficient becomes
highly significant (drops from .646 to
.002).
 Which equation is better? If is part
of the true model dropping the variable
may lead to omitted variable bias, but
including the variable may lead to severe
multicollinearity.
12
Detection of multicollinearity
 When detecting multicollinearity the focus will be on how much
multicollinearity exists (and on whether it is problematic)
 The severity of multicollinearity in a given equation depends on the
specific sample.
 The following diagnostics can be used to detect multicollinearity:
o High correlation coefficient between pairs of predictors. A high
correlation coefficient is only problematic if it causes unacceptably large
standard errors for the coefficients we are interested in.
o High variance inflation factors. As a rule of thumb, VIFs > 5 (or a
tolerance < 0.2) are an indication of severe multicollinearity. For models
with many predictors, one may use VIF>10 (or tolerance<0.1).
 High correlations between predictors or high VIFs are necessary (but not
sufficient) tests for problematic multicollinearity. One should always
further study model results (size standard errors, sign of coefficients, etc.)
to assess whether multicollinearity is problematic.
13
Remedies for multicollinearity
 Do nothing: a remedy should be considered only if the multicollinearity
causes insignificant t-scores or unreliable estimated coefficients.
o It is possible to have a high correlation between two predictors, and yet to
have meaningful and significant coefficient estimates.
o Deletion of a multicollinear variable may cause omitted variable bias.
 Drop a redundant variable
o If two predictors measure the same underlying economic construct, one of
the predictors is actually redundant and can be dropped from the model.
E.g., Gross domestic product and disposable income should not be included
in the same model as both variables essentially measure “income”.
o (If possible) use theory to decide which variable to drop.
 Increase the size of the sample
o Increasing the size of the sample will reduce standard errors and increase
accuracy of coefficient estimates and hence it may help to reduce the impact
of multicollinearity on coefficient estimates.
14
Example where multicollinearity is best left unadjusted
 A soft drink company develops a model to explain the impact of
advertising on the sales of a soft drink:
with
o : sales of the soft drink in year
o : average relative price of the soft drink in year
o : advertising expenditures for the company in year
o : advertising expenditures for the company’s main competitor in year
 Model estimation yields the following results:

variable ˆ SE ( ˆ ) t p Suppose and VIFs>5


Constant 3080 then still there is no good reason to do
P -75000 25000 -3.00 .0062 something about the multicollinearity
A 4.23 1.06 3.99 .00054 because estimated coefficients are
B -1.04 0.51 -2.04 .053 significant in the direction implied by
the theory, and overall model fit and
2
Radj  .825, N  28
size of coefficients seem acceptable.
15
Example where multicollinearity is best left unadjusted

When we drop predictor :


 The overall fit of the model decreases
considerably ( drops from .825 to
variable ˆ SE ( ˆ ) t p
Constant 2586 .531).
P -78000 24000 -3.25 .0033  The coefficient of predictordrops from
A 0.52 4.32 0.12 .905 4.23 to 0.52 and becomes nonsignificant.
2
Radj  .531, N  28
Explanation: the expected bias on is
negative:

The negative bias is strong enough to


decrease until it is non-significant.

16
Exercise
 We use linear regression to model wage (in $) as a function age (in
years), education (in years) and work experience (in months). ( R 2  .442)

 In addition, we estimate the following regression models on the


predictors:
o Age=33.890+.088*experience-.345*education
o Education=15.730-.054*age-.002*experience
o Experience=-155.99+7.058*age-1.035*education

17
Exercise
 The correlation matrix of the predictors reads:
Correlations

** **
1 -,252 -,281

Sig. (2-tailed) ,000 ,000


N 473 473 473
** **
-,252 1 ,802

Sig. (2-tailed) ,000 ,000


N 473 473 473
** **
age in years -,281 ,802 1

Sig. (2-tailed) ,000 ,000


N 473 473 473
**.

18
Exercise
 Complete the VIF and tolerance values in the table of regression
coefficients.
 Is there a problem of multicollinearity? What would you do to deal with
the problem?

19
Solution
age: and
education: and
experience: and

20
Solution
 We see that there is indeed a problem of multicollinearity for the
predictors age and experience:
o For the predictors age and experience the VIF is close to 3.
o The predictors age and experience have a strong positive linear relation
(r=.80), and we can argue that these variables measure the same aspect.
o We can argue that the multicollinearity is problematic because the
predictors age and experience are nonsignificant, and the sign of the
coefficient of age goes against our expectations.
 If we drop age from the model, we see that
o both the regression coefficients of education and experience have the
expected sign and are significant.
o the R2 of the model does not decrease by dropping age: R 2 is still .442.
 So a model without age seems more valid.

21
Solution

22

You might also like