You are on page 1of 20

FACULTY OF COMPUTER AND MATHEMATICAL SCIENCES

SARJANA MUDA SAINS (KEPUJIAN) STATISTIK

ECNOMETRICS

QMT533

ASSESSMENT 2

NAME MATRIX NUMBER GROUP

Tengku Arif Alzam Bin Tengku Jalal 2020974691 N4CS2416T1

Muhammad Nazmi Bin Yusaini 2020614422 N4CS2416T1

Lukhman Bin Zainin 2020854042 N4CS2416T1

Muhammad Azwan Bin Najib 2020483678 N4CS2416T1

LECTURER: MADAM JAIDA NAJIHAH JAMIDIN

DATE: 1 AUGUST 2022

MARCH 2022 - AUGUST 2022


1.0 MULTICOLLINEARITY

1.1 OBJECTIVE OF THE TEST

A multicollinearity test aids in determining whether a model contains multicollinearity. When two
or more independent variables are interconnected or interrelated, this is referred to as
multicollinearity.

1.2 PROBLEM AND CONSEQUENCES

1.2.1 Problem

Presence of multicollinearity in a dataset is problematic because of four reasons:

1. The data collection method employed. For example, sampling over a limited range of values
taken by the regressors in the population.

2. Constraints on the model or in the population being sampled.

3. Model specification. For example, adding polynomial term to a regression model, especially
when the range of the X is small.

4. Improper use of dummy variables Failure to exclude one category. You have a dummy
variable for each category or group and an intercept.

5. An overdetermined model. Happens when number of independent variables > sample size.

1.2.2 Consequences Of The Problem

1. Multicollinearity does not violate OLS assumptions. OLS estimates are still unbiased and
BLUE (Best Linear Unbiased Estimators) – in general, it does not inhibit our ability to obtain a
good fit, nor does it tend to effect inferences about the mean response or prediction of new
observations.

2. Although BLUE, the OLS estimators have large variances and covariance, making precise
estimation difficult.

3. Confidence intervals for coefficients tend to be much wider, leading to the acceptance of the
“zero null hypothesis”.

1
4. The 𝑡 statistics of one or more coefficients tend to be very small thus result too statistically
insignificant.

5. Although the t ratio of one or more coefficients is statistically insignificant, R2 can be very
high.

6. The OLS estimators and their standard errors can be sensitive to small changes in the data.

7. The common interpretation of regression coefficients as measuring the change in the mean of
the response variable when the given independent variable is increased by one unit while all
other predictor variables are held constant is not fully applicable.

1.3 ASSUMPTIONS OF MULTICOLLINEARITY

● Depending on whatever other predictors are included in the model, each component's
approximate regression coefficient changes.
● As more models are added, the accuracy of calculated coefficients of regression
declines.
● Any predictor variable's marginal impact on square error reduction is influenced by the
other predictors already included in this model.
● The testing of the null hypothesis that k=0 based on the models of the predictors.

1. 4 DETECTING MULTICOLLINEARITY

1. High R2 value but few significant 𝑡-statistics.

2. None of the t-ratios for the individual coefficients is statistically significant, yet the overall F
statistic is.

3. High pairwise correlation among regressors. A rule of thumb is that if the pairwise correlation
between two regressors is high, say, in excess of 0.8, then multicollinearity is a serious problem.

4. Scatterplot. It is a good practice to use scatter plots to see how the various variables in a
regression model are related.

5. In particular, as variables are added, look for changes in the signs of effects (e.g. switches
from positive to negative) that seem theoretically questionable.

2
6. Auxiliary regression. Klein’ rule of thumb suggests that multicollinearity may be a troublesome
problem only if the R2 obtained from auxiliary regression is greater than the overall R2.

7. Tolerance.

The closer the TOL is to zero, the greater the degree of collinearity of that variable with the other
independent variables.

8. Variance inflation factor (VIF). If VIF > 10, the variable is said to be highly correlated.

9. Eigenvalues and condition index.

Find condition number,

If k is between 100 – 1000; there is moderate to strong multicollinearity.

If k > 1000; there is severe multicollinearity.

Find condition index,

If CI is between 10 – 30; there is moderate to strong multicollinearity.

If CI > 30; there is severe multicollinearity

1.5 FINDINGS FOR MULTICOLLINEARITY

Regression model: MPG= β0 + β1HP + β2MPGF + β3VOL + β4VVT + β5SP

3
1.5.1 Correlation Matrix

Figure 1.0 Correlation Matrix

Based on the Figure 1.0, show that there are numerous pairs of independent variables with
correlations greater than 0.8 or -0.8. Wish is, HP-MPGF, HPWT, HP-SP, MPF-MPGF, MPG-WT,
and MPGF-WT. Therefore, multicollinearity exists for all the aforementioned pairs.

1.5.2 Checking On Variance Inflation Factor & Tolerance Value

Figure 1.1 Checking on variance inflation factor and tolerance value

As seen in Figure 1.1, the Tolerance (TOL) value for each independent variable is less
than 2. As a result, multicollinearity exists for each of the variables. However, only the variable

4
VOL has a value less than 10 for the Variance Inflation Factor (VIF). If the infinity values are
positive infinity, it can be assumed that all of the variables, with the exception of VOL, are
multicollinear.

1.5.3 Eigenvalues And Condition Index

Figure 1.3 Eigenvalues and Condition Index

From figure 1.3, show that the relatively high Condition Number (2.752458e+08) and the
large portions of variance between the variables HP, MPGF, WT, and SP multicollinearity is very
likely.

1.5.4 Manual calculation

The positive value is required when deciding on the eigenvalues' maximum and minimum
values. As shown in Figure @@, the maximum value is 5.668374 and the minimum value is
7.481992e-17.

𝑀𝑎𝑥 𝑒𝑖𝑔𝑒𝑛𝑣𝑎𝑙𝑢𝑒 5.668374


Find condition number, 𝑘 = 𝑀𝑖𝑛 𝑒𝑖𝑔𝑒𝑛𝑣𝑎𝑙𝑢𝑒
= 7.481992𝑒−17
= 7. 5760225𝑒 + 16

Find condition index, 𝐶𝐼 = 𝑘 = 7. 5760225𝑒 + 16 = 275245754.466

5
Since k and CI is more than 1000 and 30 it shows that there is severe multicollinearity in linear
models.

1.6 REMEDIAL MEASURE

1. Increase the sample size. This will usually decrease standard errors and make it attenuate
the collinearity problem.

2. Dropping variables. When faced with severe multicollinearity, one of the simplest things is to
drop one of the collinear variables. But, if the variable really belongs in the model, this can lead
to specification error, which can be even worse than multicollinearity.

3. Transform the highly correlated variables.

4. Use multivariate statistical techniques such as factor analysis and principal components.

5. Use Ridge Regression. Ridge regression is one of several methods that have been proposed
to remedy multicollinearity problems by modifying the method of least squares to allow biased
estimators of the regression coefficients.

6. Do nothing. Simply realize that multicollinearity is present and be aware of its consequences.

1.7 CONCLUSION

Therefore, based on the checking of the multicollinearity above, it can say that the data
might have a problem as all the indicators of checking multicollinearity have problems. It also
shows that all test has shown multicollinearity exist. Therefore, it can be concluded that
multicollinearity exists and further investigation and action need to be done to confirm this
problem, whether dropping the variables or observations and more.

6
2.0 AUTOCORRELATION

2.1 OBJECTIVE OF THE AUTOCORRELATION TEST

The autocorrelation function is a statistical representation used to analyze the degree of


similarity between a time series and a lagged version of itself. This function allows the analyst to
compare the current value of a data set to its past value.

2.2 THE NATURE OF THE PROBLEM

The degree of similarity between a given time series and a lagged version of itself over
successive time intervals is mathematically represented by autocorrelation. It's conceptually
similar to the correlation between two different time series, but autocorrelation uses the same
time series twice, once in its original form and once lagged one or more time periods.
For example, the data indicates that it is more likely to rain tomorrow if it is raining today
than if it is clear today. A stock may have a large positive autocorrelation of returns when it
comes to investing, meaning that if it is "up" today, it is more likely to be up tomorrow as well.
Naturally, autocorrelation can be a useful tool for traders to utilize; particularly for technical
analysts.
Autocorrelation method is the Durbin-Watson test. The Durbin-Watson statistic is used in
regression analysis to identify autocorrelation. A test number range between 0 and 4 is always
generated by the Durbin-Watson. Values closer to 0 and 4 imply higher levels of positive
correlation and negative autocorrelation, respectively, whereas values closer to the midway
point reflect lower levels of autocorrelation.

2.2 CONSEQUENCES OF THE PROBLEM

1. Ordinary Least Square (OLS) regression coefficients are still unbiased.


2. Ordinary Least Square (OLS) regression coefficients are no longer efficient.
3. Usual formulas to estimate the variances are biased.
4. Confidence interval and hypothesis tests based on the t and F distribution are unreliable.
5. Computed variances and standard errors of forecast may be inaccurate.

7
2.3 ASSUMPTION RELATED TO THE AUTOCORRELATION

2.3.1 THERE IS NO AUTOCORRELATION OF ERRORS

- A linear regression model assumes independent error terms. This indicates that
one observation's error term is unaffected by another observation's error term. If
not, it is referred to as autocorrelation. It is generally observed in time series
data. Time series data consists of observations for which data is collected at
discrete points in time. Usually, observations at adjacent time intervals will have
correlated errors.

2.4 ALL POSSIBLE TEST/DETECTION OF THE PROBLEMS

There are many statistical tests which help to identify the presence of autocorrelation.
We can also identify autocorrelation visually through ACF plots. We will discuss them one by
one.

1. Durbin – Watson Test: A very well known test that is used to identify presence of
autocorrelation is the Durbin Watson(DW) test. The DW test statistic is expressed
as below:

∑(et -et-1)2/∑et2

where et is the error term at period t. Now this formula returns a value which lies
between 0 and 4. A value of 2 is considered as no autocorrelation. A value greater
than 2 and closer to 4 indicates negative autocorrelation and a value lesser than 2 and
closer to 0 indicates positive autocorrelation. Now, the null hypothesis of this test is:

0
𝐻 : No first order autocorrelation exists among the residuals.

1
𝐻 : The residuals are autocorrelated.

8
2. Ljung-Box Q Test: Another very popular test is the Ljung-Box Q test. The null and
alternative hypothesis for this test is as follows:

0
𝐻 : The autocorrelation up to lag k is all 0

1
𝐻 : The autocorrelation up to lag k differs from 0.

For this test if the resulting p value is less than the critical value for the chosen level of
significance, we reject the null hypothesis and conclude that there is autocorrelation in
residuals.

3. ACF plots: A plot of the autocorrelation of a time series by lag is called the
Auto-Correlation Function, or ACF plot. We plot the values of correlation among
lags along with the confidence band in an ACF plot. In simple terms, it describes
how well the present value of the series is related with its past values.

9
2.5 FINDINGS ON AUTOCORRELATION

2.5.1 Residual Plot

Figure 2.0 : Residuals vs Fitted

Figure 2.0 above, the residual plot, shows the residuals against fitted values. The
residual plot shows that the residual plot is between -5 to 5 which is there is a constant
variance. However, there are two outliers which are two fitted values greater than 10. This also
shows that there is autocorrelation in this residual plot.

10
2.5.2 Durbin-Watson test

Figure 2.1: Durbin-Watson Test

From the figure 2.1 above, the Durbin-Watson is 0.99184. This shows that the durbin
watson is low because it is less than the acceptable range which is 1.50. Therefore, there is
positive autocorrelation.

2.6 REMEDIAL MEASURE

If the diagnostic tests suggest that there have an autocorrelation problem then we have some
options which are follows as:

1. Try to find out if it is pure autocorrelation or not.

2. If it is pure autocorrelation then one can transform the original model so that in the new model
we do not have the problem of pure autocorrelation

3. For large sample cases, we can use the New-West method to obtain SE of OLS estimators
that are correlated for autocorrelation. It is just an extension of White’s heteroscedasticity
consistent standard error methods.

11
2.7 CONCLUSION

Autocorrelation is important because it can help us uncover patterns in our data, successfully
select the best prediction model, and correctly evaluate the effectiveness of our model. From the
findings, it is positive autocorrelation since the residual plots are constant. It's also proven by
Durbin-Watson since the values = 0.99184 less than 1.50. These two tests show that there is
positive autocorrelation.

12
3.0 HETEROSCEDASTICITY

3.1 OBJECTIVE OF THE HETEROSCEDASTICITY

Heteroscedasticity happens when the variance disturbance term differ across


observations. The symbol can be represented as below:
2 2
E(µ𝑖 ) = var (µ𝑖) = σ𝑖 i=1,2,…n

When heteroskedasticity occurs, the error term variance is not constant or not equal to
variance. For instance, OLS are meet assumption if the datasets are not heteroscedasticity or
constant variance (homoscedasticity).

3.2 PROBLEM DETECTION AND CONSEQUENCES

3.2.1 Problem detection

1. Following the error-learning model -As people learn, their errors of behaviour become
2
smaller over time. In this case, σ𝑖 is expected to decrease.

2. As income increases, people have more discretionary income and hence more scope for
2
choice about the disposition of their income. Hence, σ𝑖 is likely to increase with income.

3. Heteroscedasticity can also arise as a result of the presence of an outlier.


4. Other model misspecifications can produce heteroscedasticity. For example, it may be
that instead of using 𝑌, you should be using the log of 𝑌. Instead of using 𝑋, maybe you
2 2
should be using 𝑋 or both 𝑋 and 𝑋 . If the model were correctly specified, one might
find that the patterns of heteroscedasticity disappeared.
5. Skewness in the distribution of one or more regressors included in the model
6. Incorrect data transformation (e.g. ratio or first difference transformations)
7. Incorrect functional form (e.g. linear vs log-linear models)

13
3.2.2 Consequences of Heteroscedasticity

1. The OLS estimators are still unbiased and consistent.


2. Heteroscedasticity typically causes OLS to no longer be the minimum variance
estimator/efficient (of all the linear unbiased estimators). That is, they are not BLUE.
3. In addition, the standard errors are biased when heteroscedasticity is present. -S.E
(intercept) will be too high and S.E(slope) will be too low
4. This in turn leads to bias in test statistics and confidence intervals- t and F test no longer
reliable

3.3 ASSUMPTION RELATED TO THE HETEROSCEDASTICITY

1. The residuals by fitted value need to plot specifically.


2. The telltale pattern for heteroscedasticity is that as the fitted values increase,
the variance of the residuals also increases.
3. The plots display random residuals which have no pattern.
4. Heteroscedasticity can also arise as a result of presence of outlier.

3.4 ALL POSSIBLE TEST AND EXPLANATION OF THE FINDINGS

3.4.1 Graphical Method: Residual vs Fitted Value

Figure 3.1: Residual vs Fitted graph of Passenger Car Mileage Data

14
Figure 3.1 shows that heteroscedasticity exists. The reason is not all the points are
randomly distributed around zero, and it shows a decreasing and increasing pattern.

3.4.2 Formal Test

Glejser Test

Figure 3.2: Result of Glejser Test

Hypothesis:

0
𝐻 : There is no heteroscedasticity in the error variance

1
𝐻 : There is heteroscedasticity present in the error variance

By looking at significant value, it can be concluded that

1. If the value significant > 0.05, then there is no problem of heteroscedasticity


2. If the value significant < 0.05, then there is a problem of heteroscedasticity

0
Figure 3.2 shows that p-value are 0.000000114 and lower than α = 0. 05. So, reject 𝐻 .
This indicates that heteroscedasticity exists in these datasets.

15
Goldfeld-Quandt Test

Figure 3.3: Result of Goldfeld-Quandt Test

Hypothesis:

0
𝐻 : There is no heteroscedasticity in the error variance

1
𝐻 : There is heteroscedasticity present in the error variance

By looking at significant value, it can be concluded that

1. If the value significant > 0.05, then there is no problem of heteroscedasticity


2. If the value significant < 0.05, then there is a problem of heteroscedasticity

0
Figure 3.3 shows that p-value are 1 and higher than α = 0. 05. So, failed to reject 𝐻 .
This indicates that no heteroscedasticity exists in these datasets.

Breusch-pagan test

Figure 3.4: Result of Glejser Test

16
Hypothesis:

0
𝐻 : There is no heteroscedasticity in the error variance

1
𝐻 : There is heteroscedasticity present in the error variance

By looking at significant value, it can be concluded that

1. If the value significant > 0.05, then there is no problem of heteroscedasticity


2. If the value significant < 0.05, then there is a problem of heteroscedasticity

0
Figure 3.4 shows that p-value are 0.00003023 and lower than α = 0. 05. So, reject 𝐻 .
This indicates that heteroscedasticity exists in these datasets.

White’s General Heteroscedasticity Test

Hypothesis:

0
𝐻 : There is no heteroscedasticity in the error variance

1
𝐻 : There is heteroscedasticity present in the error variance

By looking at significant value, it can be concluded that

3. If the value significant > 0.05, then there is no problem of heteroscedasticity


4. If the value significant < 0.05, then there is a problem of heteroscedasticity

0
Figure 3.5 shows that p-value are 0.0000156 and lower than α = 0. 05. So, reject 𝐻 .
This indicates that there are heteroscedasticity exist in these datasets.

17
3.5 REMEDIAL MEASURES

We have two different cases:


2
1. When σ𝑖 is known – The method of Weighted Least Squares
2
2. When σ𝑖 is not known – White’s Heteroscedasticity

– Plausible Assumption about heteroscedasticity pattern

METHOD OF WEIGHTED LEAST SQUARES

Weighted least squares is simply ordinary least squares, where each observation is
adjusted for the expected size of its error term.

WHITE’S HETEROSCEDASTICITY

As noted before, heteroscedasticity causes standard errors to be biased. The use of


White’s heteroscedasticity will change the standard error - standard errors tend to be more
trustworthy (this method does not change coefficient estimates). Hence, the test statistics will
give you reasonably accurate p values.

PLAUSIBLE ASSUMPTIONS ABOUT HETEROSCEDASTICITY PATTERN

A method to re-define the variables or transformation to eliminate the heteroscedasticity.


Basically, you need to figure out what the variance of the error terms or residuals depends on
and make some sort of adjustment for it. For example, if the variance of the residuals seems to
get larger as one of the explanatory variables increases in size, a possible solution in each of
these cases could be to divide the dependent variable by the explanatory variable whose wide
range could affect the variance of the error terms.

18
3.6 CONCLUSION

Heteroscedasticity is important to prevent in datasets. The existence of


heteroscedasticity shown that the dataset are not constant variance. In this dataset, it has
shown that not all formal test has consistent results. Glejser Test, Breusch-pagan Test, and
White’s General Heteroscedasticity Test has been concluded that the dataset has
heteroscedasticity while Goldfeld-Quandt Test are shown otherwise. This can be concluded that
this dataset of Passenger Car Mileage are not heteroscedasticity.

19

You might also like