When petrol is pumped into tanks, hydrocarbons escape. To evaluate the effectiveness of pollution controls, experiments were performed. The following dataset is obtained from the experiments. Here, we have 32 observations on
1. Hydrocarbons escaping(grams)
2. Tank temperature (degrees Fahrenheit)
3. Petrol temperature (degrees Fahrenheit)
4. Initial tank pressure (pounds/square inch)
5. Petrol pressure (pounds/square inch)
We noted that multicollinearity is present significantly in the dataset. So instead of OLS models, we go by Stepwise and LASSO regressions, and finally note that the fitted model is almost free of the effects of collinearity.

© All Rights Reserved

12 views

When petrol is pumped into tanks, hydrocarbons escape. To evaluate the effectiveness of pollution controls, experiments were performed. The following dataset is obtained from the experiments. Here, we have 32 observations on
1. Hydrocarbons escaping(grams)
2. Tank temperature (degrees Fahrenheit)
3. Petrol temperature (degrees Fahrenheit)
4. Initial tank pressure (pounds/square inch)
5. Petrol pressure (pounds/square inch)
We noted that multicollinearity is present significantly in the dataset. So instead of OLS models, we go by Stepwise and LASSO regressions, and finally note that the fitted model is almost free of the effects of collinearity.

© All Rights Reserved

- MS_Excel_Linear_&_Multiple_Regression Office 2007
- Non-Parametric Tests for One Sample Location Problem
- Multiple Regression
- Case Reyem Affair
- The Role of Emotional Advertising on Consumer Buying Intention
- 1.Masood Nawaz Kalyar_1-11
- Forecasting With Excel- Suggestions for Managers
- S&P Default Rates and Recovery Jan07
- IJEMS_V3(2)11
- Module 5 Lecture 4 Final
- Relationship Analysis Between Surface Free Energy and Chemical Composition of Asphalt Binder
- Auto Clave Expansion
- 03 - Uwuigbe
- Dependent Variable
- Eviews Output
- Statistic formulas.pdf
- document.pdf
- Quiz Solutions
- Influence of Fleet Management Practices on Service Delivery to Refugees in United Nations High Commissioner for Refugees Kenya Programme
- Estimation of Postmortem Interval Using Thanatoche 2

You are on page 1of 22

October 27, 2017

When petrol is pumped into tanks, hydrocarbons escape. To evaluate the effectiveness of pollution controls,

experiments were performed. The following dataset is obtained from the experiments.

'data.frame': 32 obs. of 5 variables:

$ Tank.temperature : int 33 31 33 37 36 35 59 60 59 60 ...

$ Petrol.temperature : int 53 36 51 51 54 35 56 60 60 60 ...

$ Initial.tank.pressure: num 3.32 3.1 3.18 3.39 3.2 3.03 4.78 4.72 4.6 4.53 ...

$ Petrol.pressure : num 3.42 3.26 3.18 3.08 3.41 3.03 4.57 4.72 4.41 4.53 ...

$ Hydrocarbons.escaping: int 29 24 26 22 27 21 33 34 32 34 ...

Here, we have the data of 32 observations on response variable Hydrocarbons escaping(grams), and

4 explanatory variables Tank temperature (degrees Fahrenheit), Petrol temperature (degrees

Fahrenheit), Initial tank pressure (pounds/square inch) and Petrol pressure (pounds/square

inch). Let us respectively denote these by y, x1 , x2 , x3 and x4 .

Primary Analysis:

Let us first fit an Ordinary Least Square Model of the response variable on all of the explanatory variables.

This will give us some insight about the nature of the data and we will then proceed towards checking for the

validity of the assumptions. We observe that the fit seems to be very good in terms of the Adjusted R2 and

F − statistic.

Call:

lm(formula = y ~ 1 + x1 + x2 + x3 + x4)

Residuals:

Min 1Q Median 3Q Max

-5.586 -1.221 -0.118 1.320 5.106

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.01502 1.86131 0.545 0.59001

x1 -0.02861 0.09060 -0.316 0.75461

x2 0.21582 0.06772 3.187 0.00362 **

x3 -4.32005 2.85097 -1.515 0.14132

x4 8.97489 2.77263 3.237 0.00319 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple R-squared: 0.9261, Adjusted R-squared: 0.9151

F-statistic: 84.54 on 4 and 27 DF, p-value: 7.249e-15

1

Residual Analysis

Now we prepare the residual plot, which does not seem to be uniformly scattered around zero.

Residual Plot

4

2

Residuals

0

−2

−4

−6

0 5 10 15 20 25 30

Index

Figure 5

So, we also the plot of the fits against the residuals.

2

Plot of Residuals vs. Predicted Values

4

2

Residuals

0

−2

−4

−6

20 25 30 35 40 45 50

Fits

Figure 6

This should have been a random plot, which it is not. Hence we suspect that all the assumptions do not hold

good.

Model Assumptions

The assumptions of the OLS model Y = Xβ + are the following:

∼ ∼ ∼

2. Errors have constant variance, i.e. V (i ) = σ 2 , ∀i,

3. Errors are uncorrelated, i.e. cov(i , j ) = 0, ∀i 6= j,

4. Errors are normally distributed, i.e. ∼ N (0 , σ 2 In )

∼ ∼

5. Explanatory variables are independent, i.e. X is of full column rank.

Now, we test these assumptions one by one.

Normality of Errors

In this case, we first draw the Quantile Quantile Plot of the residuals. The diagram closely resembles the

y = x line.

3

Normal Q−Q Plot

4

Sample Quantiles

2

0

−2

−4

−6

−2 −1 0 1 2

Theoretical Quantiles

Fiqure 7

Then we perform the Shapiro-Wilk Normality test and the null hypothsis of the test, i.e. normality of

errors is accepted with considerably high p-value. So we can conclude that errors can be assumed to come

from a normal distribution.

data: e.1

W = 0.97847, p-value = 0.7539

Multicollinearity

Next, we plot each of the explanatory variables against one another. Some of these graphs, specifically the

graph of x3 vs x4 shows linear pattern.

4

Figure 8

90

80

Petrol.temperature

70

60

50

40

30 40 50 60 70 80 90

Tank.temperature

5

Figure 9

7

Initial.tank.pressure

6

5

4

3

30 40 50 60 70 80 90

Tank.temperature

6

Figure 10

7

Petrol.pressure

6

5

4

3

30 40 50 60 70 80 90

Tank.temperature

7

Figure 11

7

Initial.tank.pressure

6

5

4

3

40 50 60 70 80 90

Petrol.temperature

8

Figure 12

7

Petrol.pressure

6

5

4

3

40 50 60 70 80 90

Petrol.temperature

9

Figure 13

7

Petrol.pressure

6

5

4

3

3 4 5 6 7

Initial.tank.pressure

We then compute the correlation matrix, to see whether the explanatory variables are correlated or not.

x1 x2 x3 x4

x1 1.0000000 0.7742909 0.9554116 0.9337690

x2 0.7742909 1.0000000 0.7815286 0.8374639

x3 0.9554116 0.7815286 1.0000000 0.9850748

x4 0.9337690 0.8374639 0.9850748 1.0000000

Now, we strongly suspect multicollinearity and hence calculate the VIFs and obtain the following.

x1 x2 x3 x4

12.997379 4.720998 71.301491 61.932647

The high values suggest collinearity too. For the final verification, we compute the Condition Number

0

of X ∗ X ∗ , where X ∗ is the scaled design matrix. Its large value leads us to conclude that the extent of

multicollinearity is significant in the dataset.

[1] 482.6577

Outliers in x-directions

We know that if there are high leverage points or influential points present in the dataset, those may lead

to pseudo-multicollinearity, for example by masking, swamping, etc. So to avoid that situation, we first

try to detect these points and check whether removal of these points leads to decrease in the extent of

multicollinearity.

10

Detection of Leverage Points

First, we detect the influential points by the hat diagonals and covariance ratios, and obtain the following

detected points:

[1] 2 3 4 15 17 18 20 23

Outliers in y-direction

Before proceeding to fitting models, we first detect the outliers by DFBETA, DFFIT and Cook’s Distance

criteria. The detected points are the following:

[1] 4 15 18 21 23 24 25 26

Then to verify whether they are really outliers, we compare these points against the rest of the points which are

assumed to be clean data points. Here we test whether the observations under testing are coming from some

different distribution other than that of the normal observations. We consider the model Y = Xβ + Zγ + δ

∼ ∼ ∼ ∼

If the null hypothesis H0 : γ = 0 gets rejected, we can conclude that at least some of these points are outliers

∼ ∼

and then we test for the significance of the γ coeffiients. We include those points in the clean dataset for

which these coefficients are not significantly different from zero. Then we perform the test again and continue

in the same way until we get a set of points for which all the coefficients are significant. Then we shall treat

those points as outliers.

k.1 <- length(potential.outlier.1) # NUMBER OF INITIALLY DETECTED OUTLIERS

y.mod.1 <- c(y[-potential.outlier.1], y[potential.outlier.1]) # INITIALLY MODIFIED RESPONSE

X.mod.1 <- cbind(rbind(X.1[-potential.outlier.1, ], X.1[potential.outlier.1, ]), rbind(matrix(0, n - k.1

outlier.model.1 <- lm(y.mod.1 ~ 1 + X.mod.1) # INITIAL OUTLIER SHIFT MODEL

F.1 <- ((sum((residuals(model.1)) ^ 2) - sum((residuals(outlier.model.1)) ^ 2)) / k.1) / (sum((residuals

F.1 > qf(0.05, k.1, n - p - k.1, lower.tail = FALSE)

[1] TRUE

summary(outlier.model.1)

Call:

lm(formula = y.mod.1 ~ 1 + X.mod.1)

Residuals:

Min 1Q Median 3Q Max

-4.4613 -0.6052 0.0000 0.4497 3.9471

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.75298 1.41066 -0.534 0.599688

X.mod.1x1 -0.11158 0.08277 -1.348 0.193478

X.mod.1x2 0.31378 0.07556 4.153 0.000541 ***

X.mod.1x3 0.10255 3.84141 0.027 0.978981

X.mod.1x4 4.71023 3.97008 1.186 0.250076

X.mod.1 -3.97634 2.67335 -1.487 0.153318

X.mod.1 2.85640 3.03530 0.941 0.358485

11

X.mod.1 3.30415 2.94678 1.121 0.276142

X.mod.1 -2.51355 2.23577 -1.124 0.274913

X.mod.1 -6.80742 2.46254 -2.764 0.012344 *

X.mod.1 -3.79404 2.16636 -1.751 0.096014 .

X.mod.1 4.63336 2.05466 2.255 0.036121 *

X.mod.1 5.58316 2.09783 2.661 0.015419 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple R-squared: 0.9745, Adjusted R-squared: 0.9584

F-statistic: 60.46 on 12 and 19 DF, p-value: 1.606e-12

potential.outlier.2 <- potential.outlier.1[c(5, 7, 8)] # OUTLIERS DETECTED AFTER CROOSCHECK

k.2 <- length(potential.outlier.2) # NUMBER OF DETECTED OUTLIERS AFTER CROSSCHECK

y.mod.2 <- c(y[-potential.outlier.2], y[potential.outlier.2]) # MODIFIED RESPONSE AFTER CROSSCHECK

X.mod.2 <- cbind(rbind(X.1[-potential.outlier.2, ], X.1[potential.outlier.2, ]), rbind(matrix(0, n - k.2

outlier.model.2 <- lm(y.mod.2 ~ 1 + X.mod.2) # OUTLIER SHIFT MODEL AFTER CROSSCHECK

F.2 <- ((sum((residuals(model.1)) ^ 2) - sum((residuals(outlier.model.2)) ^ 2)) / k.2) / (sum((residuals

F.2 > qf(0.05, k.2, n - p - k.2, lower.tail = FALSE)

[1] TRUE

summary(outlier.model.2)

Call:

lm(formula = y.mod.2 ~ 1 + X.mod.2)

Residuals:

Min 1Q Median 3Q Max

-3.5204 -0.8975 0.0000 1.0743 4.3300

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.16239 1.45093 0.112 0.911818

X.mod.2x1 -0.10068 0.07347 -1.370 0.183236

X.mod.2x2 0.21759 0.05235 4.157 0.000354 ***

X.mod.2x3 -0.31137 2.36207 -0.132 0.896226

X.mod.2x4 5.98182 2.26424 2.642 0.014282 *

X.mod.2 -7.12255 2.38953 -2.981 0.006496 **

X.mod.2 5.30706 2.21009 2.401 0.024441 *

X.mod.2 6.33178 2.23658 2.831 0.009238 **

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple R-squared: 0.9611, Adjusted R-squared: 0.9498

F-statistic: 84.76 on 7 and 24 DF, p-value: 2.289e-15

Checking Influence

Now, we consider the design matrix without the rows corresponding to the influential points and calculate

the condition number based on that. But in this context, we must mention that if multicollinearity is present

12

in the dataset, this is not the correct approach.

[1] 1281.337

Anyway, even now the condition number is too large. So we can conclude that the problem of multicollinearity

is serious and hence we will opt for suitable regression methods.

Modifying Data

Before that, we should decide upon our treatment with the outliers. Since our dataset is small and we have

already verified the presence of severe multicollinearity in the data, we cannot afford to completely remove

these observations. Instead, we predict these points by a OLS regression fitted to the rest of the points and

henceforth continue our analysis with these predicted observations.

Model Fitting

To handle multicollinearity, we can proceed either by removing some of the explanatory variables, or by

performing biased regression, where we minimise the Mean Square Error subject to some Penalty Term.

In this assignment, first we will try to select a model by the stepwise regression, and then we shall apply

Lasso regression.

Stepwise Regression

Here, we start with the null model, i.e. only with an intercept term. Then we shall add variables one by one

and calculate AIC. At each step, we see what gives us minimum AIC value:

1. Adding any further variable,

2. Removing the variable which is added,

3. Keep the model same.

Among these, if the last gives us minimum AIC, the algorithm stops there and that will be our final model.

Otherwise, we repeat the same process until we reach such a stage.

Results

Start: AIC=48.27

y.mod.3 ~ 1 + x1 + x2 + x3 + x4

- x3 1 0.089 105.90 46.295

<none> 105.81 48.268

- x1 1 9.204 115.01 48.937

- x4 1 34.690 140.50 55.342

- x2 1 76.949 182.76 63.757

Step: AIC=46.3

y.mod.3 ~ x1 + x2 + x4

<none> 105.90 46.295

13

+ x3 1 0.089 105.81 48.268

- x1 1 17.254 123.15 49.125

- x2 1 112.325 218.22 67.433

- x4 1 186.416 292.31 76.787

LASSO Regression

0

In this method, we minimise n1 Σn1 (Yi − x β)2 + λΣpi |βj |. This is justified as multicollinearity inflates the

variances of the estimated regression coefficients and by including the constraint of the form Σpi |βj | ≤ c, we

enforce the regression coefficients to take small values and can also make some of these coefficients close to

zero. This can be supported by the fact in presence of multicollinearity, all the explanatory variables are

actually not required. Thus, even though this method will no longer yield unbiased estimators, but we will

still avoid loss as the estimates will have comparatively smaller MSE’s than the unbiased estimates.

Results

Here, we first obtain the an optimal λ.

[1] 0.0633387

Then we fit the model, and the regression coefficients are obtained as follows:

5 x 1 sparse Matrix of class "dgCMatrix"

s0

(Intercept) 0.26954539

x1 -0.05775523

x2 0.22050927

x3 .

x4 5.02592959

Now, we plot of the original y observations and the predictions obtained from this method. Since it seems

that the biased model yields good results, we now proceed to the residual analysis of this model. Here, we

check for the normality, autocorrelation and heteroscedasticity of the residuals of the model.

14

Plot of Original Observations and LASSO Predictions

50

Original Observations

Lasso Predictions

40

y

30

20

0 5 10 15 20 25 30

Index

Figure 14

Normality Check

Initially, we check for normality of the errors, and both the QQ Plot and the Shapiro Test concludes in the

affirmative.

15

Normal Q−Q Plot

6

4

Sample Quantiles

2

0

−2

−6 −4

−8

−2 −1 0 1 2

Theoretical Quantiles

Figure 15

data: e.2

W = 0.96997, p-value = 0.4985

Now, we check for condition number and note that it is much lower than 100, and hence it can be thought of

to be free from the effect of multicollinearity.

[1] 46.34992

Next, we prepare the ACF and PACF plots of the residuals, and note that none of the picks are significant.

16

Autocorrelation Plot of LASSO residuals

0.8

0.4

ACF

0.0

−0.4

0 5 10 15

Lag

Figure 16

17

Partial Autocorrelation Plot of LASSO residuals

0.3

0.1

Partial ACF

−0.1

−0.3

2 4 6 8 10 12 14

Lag

Figure 17

Now, we try to check for equal variances. We recursively select subsets (of some fixed size) of the residuals and

compute the variances of each group. If the plot of these variances against the indices reveal some pattern,

we can suspect that heteroscedasticity is present in the dataset. We repeat this procedure for different sizes

of the subsets. From the plots, we can see that there is an increasing pattern in each of the plots. This

indicates that the residuals are not homoscedastic.

18

Moving variances with Order 10

15

Variance

10

5

5 10 15 20

Index

Figure 19

19

Moving variances with Order 15

12

10

Variance

8

6

4

5 10 15

Index

Figure 20

20

Moving variances with Order 20

10

8

Variance

6

4

2 4 6 8 10 12

Index

Figure 21

To confirm our suspicion, we test whether the variances can be modelled by the explanatory variables. We

use the square of the residuals as the estimates of the variances, i.e. we consider the model e = Xµ + ν .

∼ ∼ ∼

From this model, we observe that the test is accepted at 5% level of significance, but with a low p-value. So

based on the data, we cannot conclude that our data is homoscedastic, but we would certainly prefer to test

it for more number of observations. Alternatively, as the regression coefficient of x2 is not significant, we

drop that variable, and now the test becomes significant, at the same level of significance of 5%.

Call:

lm(formula = (e.2^2) ~ 1 + x1 + x2 + x4)

Residuals:

Min 1Q Median 3Q Max

-13.657 -6.869 -1.496 2.590 42.344

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.2856 7.8284 -0.292 0.7725

x1 0.7482 0.3029 2.470 0.0199 *

x2 0.1723 0.2460 0.700 0.4895

x4 -10.0453 4.9177 -2.043 0.0506 .

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

21

Residual standard error: 11.77 on 28 degrees of freedom

Multiple R-squared: 0.2066, Adjusted R-squared: 0.1216

F-statistic: 2.43 on 3 and 28 DF, p-value: 0.08611

Conclusion

After all these calculations, we can see that the initial pressure of the tank is not included in our final model,

as the other explanatory variables explain this due to multicollinearity. If temperature of the tank increases,

the amount of escaped hydrocarbon decreases, and for increment in the temperature or pressure of the petrol,

waste is increased. So one should consider these points while taking steps against pollution with the fact that

these conclusions are based on a biased model affected by the problem of heteroscedasticity.

22

- MS_Excel_Linear_&_Multiple_Regression Office 2007Uploaded byMohd Nazri Salim
- Non-Parametric Tests for One Sample Location ProblemUploaded byAnirban Ray
- Multiple RegressionUploaded byKhawaja Naveed Haider
- Case Reyem AffairUploaded byVishal Bharti
- The Role of Emotional Advertising on Consumer Buying IntentionUploaded byMaria Hassan
- 1.Masood Nawaz Kalyar_1-11Uploaded byiiste
- Forecasting With Excel- Suggestions for ManagersUploaded byaruna2707
- S&P Default Rates and Recovery Jan07Uploaded bygimy2010
- IJEMS_V3(2)11Uploaded bypalak2407
- Relationship Analysis Between Surface Free Energy and Chemical Composition of Asphalt BinderUploaded byHendri Hadisi
- Auto Clave ExpansionUploaded byDiana Jenkins
- 03 - UwuigbeUploaded byneeraj goyal
- Module 5 Lecture 4 FinalUploaded bytejap314
- Dependent VariableUploaded byTinotenda Dube
- Eviews OutputUploaded byDaniela Obagiu
- Statistic formulas.pdfUploaded byPrasenjit Roy
- document.pdfUploaded byriani
- Quiz SolutionsUploaded byVikas Singh
- Influence of Fleet Management Practices on Service Delivery to Refugees in United Nations High Commissioner for Refugees Kenya ProgrammeUploaded byAnas R. Abu Kashef
- Estimation of Postmortem Interval Using Thanatoche 2Uploaded byAngel Mella
- IJETR012001Uploaded byerpublication
- Exercise 1- Chapter 5Uploaded byBrian Tran
- CEUploaded byAndri
- CemUploaded byyusuf arief
- Analysis ChangedUploaded bysureshexecutive
- 20113006Uploaded bysafi41
- BIOMETUploaded bynanda zakiyyatulmuna
- Journal of Hazardous Materials2.pdfUploaded byMuhammad Iqbal
- www.ijerd.comUploaded byIJERD
- Multi Regression ModelUploaded bysowmya

- Hybrid Quick Sort + Insertion Sort: Runtime ComparisonUploaded byAnirban Ray
- PR Mini Project 2018Uploaded byAnirban Ray
- Hybrid Quick Sort + Insertion Sort_ Runtime ComparisonUploaded byAnirban Ray
- Analysis of Hydrocarbon Data - Application of LASSO RegressionUploaded byAnirban Ray
- Analysis of Fishing Data – Application of Count RegressionUploaded byAnirban Ray
- Analysis of Fishing Data – Application of Count RegressionUploaded byAnirban Ray

- Econometrics Term PaperUploaded byDhruvin Patel
- Descriptive Statistics and RegressionUploaded byAmin Haleeb
- Long RunUploaded byNuur Ahmed
- Multiple RegressionUploaded bySpandana Achanta
- NavigationGuidance[1]Uploaded byn2730697
- E3023-15 Standard Practice for Probability of Detection Analysis for â Versus a DataUploaded byAhmed Shaban Kotb
- KernSmooth.pdfUploaded byAgus Dwi Sulistyono
- REG2_MVRUploaded byRadu Vasile
- The Detection of Earnings Manipulation by Messod D Beneish (1999)Uploaded bycasefortrils
- 2mlUploaded byJenni AH
- Applied Regression_HW1_JP, Savio, Leila, MohanUploaded byJoão Paulo Milanezi
- EDU6950 MULTIPLE REGRESSION.docUploaded bySa'adah Jamaluddin
- TSP50UGUploaded byapi-3828195
- Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)Uploaded byjohndoe-
- output2Uploaded byNurul Simatupang
- 11Uploaded byAnonymous 932wiZUNjD
- Hyp Testing and Conf Ints.pdfUploaded bysumits6
- Chapter17_MaximumLikelihoodEstimationUploaded byJuli
- Gutierrez SurveyUploaded byMichael Ray
- Linear Regression Analysis on Net Income of an Agrochemical Compa.pdfUploaded bydayusita
- Sphet Heteroskedastic Spatial ModelsUploaded byCynthia Nguyen
- SW_2e_ex_ch07Uploaded byNuno Azevedo
- a1w2017.pdfUploaded bySarika Uppal
- Makalah Intervening (Path Analysis)Uploaded bySatriyo W
- SPSS Presentation1Uploaded byArifa Hossain
- 7_garchUploaded byKhalis Mahmudah
- Pages From Midterm_2015_Winter_soln (1) 3210Uploaded byMax
- Lab 11 Multiple RegressionUploaded byAmjad Memon
- Statistics IAS Mains 12Uploaded byPranesh Purushotham
- Flexible Smoothing With B-Splines and PenaltiesUploaded byOscar Fusco