You are on page 1of 5

Speaking about multicollinearity, consider the model Yi = βˆ1 + βˆ2X2 + βˆ3X3 + ˆui What

happens if X2 and X3 are independent? What happens if X2 and X3 are perfectly linearly
related?

If X2 and X3 are independent variables in the model Yi = βˆ1 + βˆ2X2 + βˆ3X3 + ˆui, it means
that there is no correlation or relationship between X2 and X3. In this case, multicollinearity
is not an issue because the independent variables are not correlated with each other.

However, if X2 and X3 are perfectly linearly related, it means that there is a perfect
correlation between the two variables. In this scenario, multicollinearity becomes a
problem.

When there is perfect collinearity between X2 and X3, it becomes impossible to estimate the
individual coefficients βˆ2 and βˆ3 separately. This is because their effects on the dependent
variable (Yi) cannot be distinguished from each other. When X2 and X3 are perfectly linearly
related, the model becomes Yi = βˆ1 + βˆ2X2 + βˆ3X3 + ˆui is not identifiable or estimable.

Multicollinearity can lead to unstable and unreliable coefficient estimates, as well as inflated
standard errors. In such cases, even small changes in the data can result in large changes in
the estimated coefficients. It can also make it difficult to interpret the individual effects of
the independent variables on the dependent variable.

Consider this now:


yˆ = Xβ + uˆ
βˆ = (X`X)^-1* X`y
cov(βˆ) = σ^2(X`X)^-1

In order to determine βˆ or cov(βˆ) we need the determinant of X`X

What happens with |X`X| when X2 and X3 are perfectly linearly related

When X2 and X3 are perfectly linearly related, it means that there is a perfect correlation
between the two variables. In this case, the matrix X'X becomes singular, which means its
determinant (|X'X|) becomes zero.

The matrix X'X represents the dot product of the design matrix X with its transpose X', and it
is used to calculate the coefficient estimates (βˆ) and their covariance matrix (cov(βˆ)) in the
ordinary least squares (OLS) estimation. Inverting the matrix X'X is necessary to obtain βˆ
and cov(βˆ).

However, when X2 and X3 are perfectly linearly related, the columns of X become linearly
dependent, resulting in a singular matrix X'X. A singular matrix does not have an inverse, and
its determinant is zero. Therefore, |X'X| becomes zero, and it is not possible to calculate βˆ
or cov(βˆ) using the formula mentioned.
In such cases of perfect multicollinearity, where two or more independent variables are
perfectly correlated, it is not possible to estimate the individual coefficients and their
covariance matrix using the ordinary least squares method. Alternative methods, such as
dropping one of the perfectly correlated variables or using techniques like ridge regression
or principal component analysis, may be used to address the issue of multicollinearity.

Can you expand a bit on the following:


Sources of MC
1. Data collection method
2. Constraints on the model
3. Model specification
4. Overdetermined model

1. Data collection method: Multicollinearity can arise due to the way data is collected. If
there is a systematic relationship or dependency among the independent variables during
data collection, it can lead to multicollinearity. For example, if data is collected from a survey
that asks similar or related questions, it can result in high correlation between the variables.

2. Constraints on the model: Multicollinearity can occur when there are constraints imposed
on the model. These constraints can be theoretical or practical in nature. For instance, if a
researcher is required to include specific variables in the model due to theoretical
considerations or external requirements, it can lead to multicollinearity if those variables are
highly correlated with existing variables in the model.

3. Model specification: The way the model is specified can also contribute to
multicollinearity. This includes the choice of independent variables and the functional form
of the model. Including highly correlated variables in the model or using variables that are
derived from the same underlying concept can lead to multicollinearity. Additionally, using
polynomial terms or interaction terms that are highly correlated with the original variables
can also introduce multicollinearity.

4. Overdetermined model: Multicollinearity can occur when the number of independent


variables in the model exceeds the number of observations or is close to it. In an
overdetermined model, where the number of variables is large compared to the sample size,
it becomes more likely to have high correlations between the variables. This can happen, for
example, in econometric studies with a limited number of observations but a large number
of potential explanatory variables.

It is important to identify and address multicollinearity because it can affect the


interpretation and stability of the model's coefficients. Techniques such as variance inflation
factor (VIF), correlation analysis, and regularization methods like ridge regression or lasso
regression can be used to diagnose and mitigate multicollinearity in regression analysis.

Consequences of MC
1. Although the OLS estimators are BLUE their variances are inflated
Intuition ??? Impact of inflated variances?

When multicollinearity is present in a regression model, the consequences include inflated


variances of the coefficient estimators. Here's an intuitive explanation and the impact of
inflated variances:

Intuition:
Multicollinearity occurs when independent variables in a regression model are highly
correlated with each other. As a result, the information contained in the correlated variables
overlaps, making it difficult for the model to distinguish their individual effects on the
dependent variable. This leads to instability in the estimation process.

Impact of inflated variances:


1. Unreliable coefficient estimates: Inflated variances of the coefficient estimators indicate
greater uncertainty in estimating their true values. Large standard errors associated with the
coefficients mean that the estimated coefficients are less precise and more likely to deviate
from the true population values. Consequently, it becomes challenging to make reliable
inferences about the effects of the independent variables on the dependent variable.

2. Difficulty in variable selection: Inflated variances can impact the identification of


significant variables in the model. With multicollinearity, the standard errors of the
coefficients increase, making it harder to detect statistically significant relationships. This can
lead to difficulties in determining which independent variables are truly important and
should be included in the model.

3. Sensitivity to small changes: Multicollinearity can cause the coefficient estimates to be


highly sensitive to minor changes in the data. Small variations in the data can lead to
significant changes in the estimated coefficients. This sensitivity makes the model less stable
and robust, as slight changes in the data can result in different conclusions about the
relationships between variables.

4. Misinterpretation of variable importance: Multicollinearity can distort the interpretation


of variable importance. When variables are highly correlated, their effects on the dependent
variable become confounded, making it difficult to discern their individual contributions.
This can lead to erroneous conclusions about the relative importance of the variables in
explaining the variation in the dependent variable.

Overall, the inflated variances resulting from multicollinearity undermine the reliability,
stability, and interpretability of the regression model. It is essential to address
multicollinearity through techniques like variable transformation, feature selection, or
regularization methods to obtain more accurate and meaningful results.

Detection of MC
1. High R2 but few significant t ratios.
2. High pairwise correlation between explanatory variables 3. Examination of partial
correlation
4. Auxiliary regressions
5. Eigenvalues and condition index
6. Variance inflation factors (and tolerance)
7. Plots

1. High R2 but few significant t-ratios: Multicollinearity can manifest as a high coefficient of
determination (R2) in the overall model, indicating a good fit. However, when examining the
individual t-ratios for each coefficient, multicollinearity can be suspected if only a few
variables have statistically significant coefficients while others are insignificant or have
unexpected signs.

2. High pairwise correlation between explanatory variables: Multicollinearity is often


indicated by a high pairwise correlation between two or more independent variables.
Correlation coefficients close to +1 or -1 suggest a strong linear relationship between the
variables, which can be indicative of multicollinearity.

3. Examination of partial correlation: Partial correlation analysis helps assess the relationship
between two variables while controlling for the effects of other variables. High partial
correlations between two independent variables, after accounting for the effects of other
variables, can indicate multicollinearity.

4. Auxiliary regressions: By running auxiliary regressions, such as regressing one


independent variable against the others, or regressing the dependent variable against each
independent variable separately, one can assess the presence of multicollinearity. High R2
values or low t-ratios in these auxiliary regressions suggest multicollinearity.

5. Eigenvalues and condition index: Eigenvalues and condition index provide a numerical
measure of multicollinearity. Large eigenvalues or condition indices (above a certain
threshold, e.g., 30) indicate the presence of multicollinearity. These measures are derived
from the correlation matrix or the variance-covariance matrix of the independent variables.

6. Variance inflation factors (VIF) and tolerance: VIF quantifies the extent of multicollinearity
by measuring how much the variance of the estimated regression coefficient is inflated due
to the correlation with other independent variables. High VIF values (typically above 5 or 10)
suggest multicollinearity, while low tolerance values (1/VIF) indicate a high degree of
collinearity.

7. Plots: Visual inspection of scatterplots or correlation matrices can provide an initial


indication of multicollinearity. Scatterplots showing strong linear relationships between
independent variables, or correlation matrices highlighting high correlations between
variables, can suggest the presence of multicollinearity.

It is important to use a combination of these techniques to thoroughly assess the presence


and severity of multicollinearity in a regression model.

Remedial measures
1. Use a-priori information
2. Extending/expanding data
3. Removing variables
4. Transformation of variables
5. Ridge regression/ factor analysis regression etc

1. Use a-priori information: Prior knowledge or theory about the relationships between
variables can help guide the selection and inclusion of variables in the model. By considering
the theoretical relevance and importance of variables, you can prioritize including variables
that have the most meaningful and independent impact on the dependent variable.

2. Extending/expanding data: Increasing the sample size by collecting more data can
sometimes alleviate multicollinearity. A larger dataset can provide more variability and
reduce the chances of high correlation among variables. However, this may not always be
feasible or practical.

3. Removing variables: If variables are highly correlated, one possible approach is to remove
one or more of the correlated variables from the model. This can be done based on their
theoretical significance, statistical significance, or other criteria. By eliminating redundant
variables, you can reduce multicollinearity.

4. Transformation of variables: Transforming variables can help mitigate multicollinearity.


Techniques such as standardization (z-score transformation), centering variables, or creating
interaction terms can alter the relationships between variables and reduce multicollinearity.
Non-linear transformations, such as taking logarithms or square roots, may also help address
issues related to collinearity.

5. Ridge regression/Factor analysis regression: Ridge regression is a technique that


introduces a penalty term to the ordinary least squares estimation, helping to stabilize the
coefficients and reduce the impact of multicollinearity. It shrinks the coefficients towards
zero and can be particularly effective when dealing with highly correlated variables. Factor
analysis regression, also known as principal component regression, involves creating
composite variables by combining correlated variables into a smaller set of uncorrelated
variables. These composite variables are then used in the regression analysis, reducing
multicollinearity.

These remedial measures aim to mitigate the negative effects of multicollinearity and
improve the stability, reliability, and interpretability of the regression model. The choice of
remedial measure depends on the specific characteristics of the dataset, the research
objectives, and the underlying assumptions of the regression model.

You might also like