You are on page 1of 11

Multicollinearity occurs when the multiple linear regression analysis includes several variables that are significantly

correlated not only with the dependent variable but also to each other. Multicollinearity makes some of the significant
variables under study to be statistically insignificant. This paper discusses on the three primary techniques for detecting
the multicollinearity using the questionnaire survey data on customer satisfaction. The first two techniques are the
correlation coefficients and the variance inflation factor, while the third method is eigenvalue method. It is observed that
the product attractiveness is more rational cause for the customer satisfaction than other predictors. Furthermore,
advanced regression procedures such as principal components regression, weighted regression, and ridge regression
method can be used to determine the presence of multicollinearity. Keywords: multicollinearity regression analysis
variance inflation factor eigenvalue customer satisfaction

1. Introduction
In multiple regression analysis, the term
multicollinearity indicates to the linear relationships
among the independent variables. Collinearity indicates
two variables that are close perfect linear combinations of
one another. Multicollinearity occurs when the regression
model includes several variables that are significantly
correlated not only with the dependent variable but also to
each other [1].
Multicollinearity is the event of great inter-correlations
among the factors in a multiple regression model.
Multicollinearity can prompt skewed or deluding results
when an investigator endeavors to decide how well
every factor can be utilized most viably to foresee or
comprehend the response variable in a statistical model [2].
All in all, multicollinearity can prompt more extensive
confidence interval and less solid likelihood esteems for
the predictors. That is, the findings from a model with
multicollinearity may not be trustworthy [2,3].
Sometimes adding more predictors to a regression
analysis fail to give clear understanding of the
model because of multicollinearity. The presence of
multicollinearity increases the standard errors of each
coefficient in the model, which in turn changes the result
of the analysis. Multicollinearity makes some of the
significant variables under study to be statistically
insignificant [4]. Multicollinearity increases variance of
the regression coefficients making them unstable,
which brings problem to interpret the coefficients [5].
Several studies examined and discussed the problems
of multicollinearity for regression model and also
emphasized that the major problem related with the
multicollinearity comprises uneven and biased standard
errors and impractical explanations of the results [6,7,8].
These studies also encourage researchers to consider the
steps for detecting the multicollinearity in regression
analysis. To determine the presence of multicollinearity,
there are some more advanced regression procedures such
as principal components regression, weighted regression,
and ridge regression method [1,8,9].
The present study discusses on the three primary
techniques for detecting the multicollinearity using the
questionnaire survey data. The first two techniques, the
correlation coefficients and variance inflation factor, are
straightforward to implement while the third method
is based on the eigenvalues and eigenvectors of the
standardized design matrix [1,5].
2. Materials and Methods
The structured questionnaire survey was conducted
among 310 young smart phone users aged 18-27 years in
October 2019. The research ethics committee of the
educational institute in Kathmandu approved this study
and the written consent of the respondent was taken prior.
The respondents were asked to rate their view using
Likert scale from 1 to 5 in some key areas. The major
factors product quality, brand experience, product
feature, product attractiveness, and product price that
could potentially influence the level of satisfaction for
smart phones were included in the survey questionnaire.
All the statistical procedures were performed
using IBM SPSS version 23.0 for Mac. Scatterplot is
constructed to display the relationship between the pair of
variables. Pearson’s correlations were calculated for the
relationship among the major factors. The regression
analysis helps to provide an estimation of the relationship
among the variables and the impact of independent
40 American Journal of Applied Mathematics and
Statistics

variables on the outcome variables that exist within a


population. Hence, the regression analysis was used for
the variables of interest.
2.1. Techniques for Detecting
Muticollinearity
The primary techniques for detecting the multicollnearity
are i) correlation coefficient, ii) variance inflation factor,
and iii) eigenvalue method.
2.1.1. Pairwise Scatterplot and Correlation
Coefficients
The scatterplot is a graphical method that signifies the
linear relationship between pairs of independent variables.
It is important to look for any scatterplots that seem to
indicate a linear relationship between pairs of independent
variables.
The correlation coefficient is calculated using the
formula:
( )−( ) ( )
 = ∑  ∑ ∑
2 2 2 2
�[ ∑  − (∑ ) ] [ ∑  − (∑ ) ]
Where, r = correlation coefficient, n = number of
observations, X = first variable in the context, and Y =
second variable in the context. If the correlation
coefficient value is higher with the pairwise variables, it
indicates possibility of collinearity. In general, if the
absolute value of Pearson correlation coefficient is close
to 0.8, collinearity is likely to exist [1,10].
2.1.2. Variance Inflation Factor (VIF)
Variance inflation factor is used to measure how much
the variance of the estimated regression coefficient is
inflated if the independent variables are correlated. VIF is
calculated as

2
1 1
1
==

VIF
Tolerance
R

Where, the tolerance is simply the inverse of the VIF. The


lower the tolerance, the more likely is the multicollinearity
among the variables. The value of VIF =1 indicates that
the independent variables are not correlated to each other.
If the value of VIF is 1< VIF < 5, it specifies that the
variables are moderately correlated to each other. The
challenging value of VIF is between 5 to 10 as it specifies
the highly correlated variables. If VIF ≥ 5 to 10, there will
be multicollinearity among the predictors in the regression
model and VIF > 10 indicate the regression coefficients
are feebly estimated with the presence of multicollinearity
[9].
2.1.3. Eigenvalue Method
Eigenvalue stands for the variance of the linear
combination of the variables. Since the sum of
eigenvalues (λ) must equal to the number of independent
variables, the very small eigenvalue (close to 0.05) are
indicative of multicollinearity and even small changes in
the data lead to large changes in regression coefficient
estimates. Sometimes it is difficult to interpret the
eigenvalue close to 0.05 so conditional index can be th
What is Multicollinearity $why isit a problem?
Multicollinearity happens when independent variables in the regression model are highly correlated to each other.
It makes it hard to interpret of model and also creates an overfitting problem. It is a common assumption that people
test before selecting the variables into the regression model.

Learning Objectives: Multicollinearity (LMWR 7.3

and 11.1)

MATH 4387/5387 Applied Regression

After meetings on Multicollinearity (MC), students should be able to:


1. Basics

(a) Describe the problem of MC and articulate its effect on the OLS estimator

of β.

(b) Describe the signs that MC is present in a given dataset.

(c) Identify three methods for detecting MC in data (variance inflation factors,

condition numbers, and principal component analysis).

2. Variance Inflation Factors (VIFs)

(a) Define the VIF for a given predictor.

(b) Analyze VIFs to make a decisions about MC.

3. Condition numbers, principal components (PCs) and PC regression (PCR)

(a) Define the scaled design matrix, Z.

(b) Identify Z

TZ as (proportional to) the correlation matrix of the regression

predictors.

(c) Define the condition number of Z as


k=
√ λ 1❑
λp

where λ1 and λp are the largest and smallest eigenvalues of Z

TZ, respectively.

(d) Interpret the condition number of Z.

(e) Define the principal components, cj

, of a given model.

(f) Identify the variance of each cj

(g) Describe PCR, the procedure for reducing the problem of MC and reducing

the dimension of the predictor space.

What Is Multicollinearity?
Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple
regression model. Multicollinearity can lead to skewed or misleading results when a researcher or analyst
attempts to determine how well each independent variable can be used most effectively to predict or understand
the dependent variable in a statistical model.

In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms
of the effect of independent variables in a model.
KEY TAKEAWAYS

 Multicollinearity is a statistical concept where several independent variables in a model are correlated.
 Two variables are considered to be perfectly collinear if their correlation coefficient is +/- 1.0.
 Multicollinearity among independent variables will result in less reliable statistical inferences.
 It is better to use independent variables that are not correlated or repetitive when building multiple
regression models that use two or more variables.
 The existence of multicollinearity in a data set can lead to less reliable results due to larger standard errors.

Understanding Multicollinearity
Statistical analysts use multiple regression models to predict the value of a specified dependent variable based
on the values of two or more independent variables. The dependent variable is sometimes referred to as the
outcome, target, or criterion variable.

An example is a multivariate regression model that attempts to anticipate stock returns based on items such
as price-to-earnings ratios (P/E ratios), market capitalization, past performance, or other data. The stock return is
the dependent variable and the various bits of financial data are the independent variables.

Multicollinearity in a multiple regression model indicates that collinear independent variables are related in some
fashion, although the relationship may or may not be casual. For example, past performance might be related
to market capitalization, as stocks that have performed well in the past will have increasing market values.

In other words, multicollinearity can exist when two independent variables are highly correlated. It can also
happen if an independent variable is computed from other variables in the data set or if two independent
variables provide similar and repetitive results.

Special Considerations
One of the most common ways of eliminating the problem of multicollinearity is to first identify collinear
independent variables and then remove all but one.

It is also possible to eliminate multicollinearity by combining two or more collinear variables into a single variable.
Statistical analysis can then be conducted to study the relationship between the specified dependent variable and
only a single independent variable.

The statistical inferences from a model that contains multicollinearity may not be dependable.

Examples of Multicollinearity
In Investing

For investing, multicollinearity is a common consideration when performing technical analysis to predict probable
future price movements of a security, such as a stock or a commodity future.

Market analysts want to avoid using technical indicators that are collinear in that they are based on very similar or
related inputs; they tend to reveal similar predictions regarding the dependent variable of price movement.
Instead, the market analysis must be based on markedly different independent variables to ensure that they
analyze the market from different independent analytical viewpoints.

An example of a potential multicollinearity problem is performing technical analysis only using several similar
indicators.
Noted technical analyst John Bollinger, creator of the Bollinger Bands indicator, notes that "a cardinal rule for the
successful use of technical analysis requires avoiding multicollinearity amid indicators." To solve the problem,
analysts avoid using two or more technical indicators of the same type. Instead, they analyze a security using one
type of indicator, such as a momentum indicator, and then do a separate analysis using a different type of
indicator, such as a trend indicator.

For example, stochastics, the relative strength index (RSI), and Williams %R are all momentum indicators that
rely on similar inputs and are likely to produce similar results. In this case, it is better to remove all but one of the
indicators or find a way to merge several of them into just one indicator, while also adding a trend indicator that is
not likely to be highly correlated with the momentum indicator.

You might also like