You are on page 1of 22

Analysis of Hydrocarbon Data

Anirban Ray & Soumya Sahu


October 27, 2017

Description of the Dataset


When petrol is pumped into tanks, hydrocarbons escape. To evaluate the effectiveness of pollution controls,
experiments were performed. The following dataset is obtained from the experiments.
'data.frame': 32 obs. of 5 variables:
$ Tank.temperature : int 33 31 33 37 36 35 59 60 59 60 ...
$ Petrol.temperature : int 53 36 51 51 54 35 56 60 60 60 ...
$ Initial.tank.pressure: num 3.32 3.1 3.18 3.39 3.2 3.03 4.78 4.72 4.6 4.53 ...
$ Petrol.pressure : num 3.42 3.26 3.18 3.08 3.41 3.03 4.57 4.72 4.41 4.53 ...
$ Hydrocarbons.escaping: int 29 24 26 22 27 21 33 34 32 34 ...
Here, we have the data of 32 observations on response variable Hydrocarbons escaping(grams), and
4 explanatory variables Tank temperature (degrees Fahrenheit), Petrol temperature (degrees
Fahrenheit), Initial tank pressure (pounds/square inch) and Petrol pressure (pounds/square
inch). Let us respectively denote these by y, x1 , x2 , x3 and x4 .

Primary Analysis:
Let us first fit an Ordinary Least Square Model of the response variable on all of the explanatory variables.
This will give us some insight about the nature of the data and we will then proceed towards checking for the
validity of the assumptions. We observe that the fit seems to be very good in terms of the Adjusted R2 and
F − statistic.

Call:
lm(formula = y ~ 1 + x1 + x2 + x3 + x4)

Residuals:
Min 1Q Median 3Q Max
-5.586 -1.221 -0.118 1.320 5.106

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.01502 1.86131 0.545 0.59001
x1 -0.02861 0.09060 -0.316 0.75461
x2 0.21582 0.06772 3.187 0.00362 **
x3 -4.32005 2.85097 -1.515 0.14132
x4 8.97489 2.77263 3.237 0.00319 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.73 on 27 degrees of freedom


Multiple R-squared: 0.9261, Adjusted R-squared: 0.9151
F-statistic: 84.54 on 4 and 27 DF, p-value: 7.249e-15

1
Residual Analysis
Now we prepare the residual plot, which does not seem to be uniformly scattered around zero.

Residual Plot
4
2
Residuals

0
−2
−4
−6

0 5 10 15 20 25 30

Index
Figure 5
So, we also the plot of the fits against the residuals.

2
Plot of Residuals vs. Predicted Values
4
2
Residuals

0
−2
−4
−6

20 25 30 35 40 45 50

Fits
Figure 6
This should have been a random plot, which it is not. Hence we suspect that all the assumptions do not hold
good.

Model Assumptions
The assumptions of the OLS model Y = Xβ +  are the following:
∼ ∼ ∼

1. Errors are unbiased, i.e. E(i ) = 0, ∀i,


2. Errors have constant variance, i.e. V (i ) = σ 2 , ∀i,
3. Errors are uncorrelated, i.e. cov(i , j ) = 0, ∀i 6= j,
4. Errors are normally distributed, i.e.  ∼ N (0 , σ 2 In )
∼ ∼
5. Explanatory variables are independent, i.e. X is of full column rank.
Now, we test these assumptions one by one.

Normality of Errors
In this case, we first draw the Quantile Quantile Plot of the residuals. The diagram closely resembles the
y = x line.

3
Normal Q−Q Plot
4
Sample Quantiles

2
0
−2
−4
−6

−2 −1 0 1 2

Theoretical Quantiles
Fiqure 7
Then we perform the Shapiro-Wilk Normality test and the null hypothsis of the test, i.e. normality of
errors is accepted with considerably high p-value. So we can conclude that errors can be assumed to come
from a normal distribution.

Shapiro-Wilk normality test

data: e.1
W = 0.97847, p-value = 0.7539

Multicollinearity
Next, we plot each of the explanatory variables against one another. Some of these graphs, specifically the
graph of x3 vs x4 shows linear pattern.

4
Figure 8
90
80
Petrol.temperature

70
60
50
40

30 40 50 60 70 80 90

Tank.temperature

5
Figure 9
7
Initial.tank.pressure

6
5
4
3

30 40 50 60 70 80 90

Tank.temperature

6
Figure 10
7
Petrol.pressure

6
5
4
3

30 40 50 60 70 80 90

Tank.temperature

7
Figure 11
7
Initial.tank.pressure

6
5
4
3

40 50 60 70 80 90

Petrol.temperature

8
Figure 12
7
Petrol.pressure

6
5
4
3

40 50 60 70 80 90

Petrol.temperature

9
Figure 13
7
Petrol.pressure

6
5
4
3

3 4 5 6 7

Initial.tank.pressure

We then compute the correlation matrix, to see whether the explanatory variables are correlated or not.
x1 x2 x3 x4
x1 1.0000000 0.7742909 0.9554116 0.9337690
x2 0.7742909 1.0000000 0.7815286 0.8374639
x3 0.9554116 0.7815286 1.0000000 0.9850748
x4 0.9337690 0.8374639 0.9850748 1.0000000
Now, we strongly suspect multicollinearity and hence calculate the VIFs and obtain the following.
x1 x2 x3 x4
12.997379 4.720998 71.301491 61.932647
The high values suggest collinearity too. For the final verification, we compute the Condition Number
0
of X ∗ X ∗ , where X ∗ is the scaled design matrix. Its large value leads us to conclude that the extent of
multicollinearity is significant in the dataset.
[1] 482.6577

Outliers in x-directions
We know that if there are high leverage points or influential points present in the dataset, those may lead
to pseudo-multicollinearity, for example by masking, swamping, etc. So to avoid that situation, we first
try to detect these points and check whether removal of these points leads to decrease in the extent of
multicollinearity.

10
Detection of Leverage Points

First, we detect the influential points by the hat diagonals and covariance ratios, and obtain the following
detected points:
[1] 2 3 4 15 17 18 20 23

Outliers in y-direction
Before proceeding to fitting models, we first detect the outliers by DFBETA, DFFIT and Cook’s Distance
criteria. The detected points are the following:
[1] 4 15 18 21 23 24 25 26

Outlier Shift Model

Then to verify whether they are really outliers, we compare these points against the rest of the points which are
assumed to be clean data points. Here we test whether the observations under testing are coming from some
different distribution other than that of the normal observations. We consider the model Y = Xβ + Zγ + δ
∼ ∼ ∼ ∼
If the null hypothesis H0 : γ = 0 gets rejected, we can conclude that at least some of these points are outliers
∼ ∼
and then we test for the significance of the γ coeffiients. We include those points in the clean dataset for
which these coefficients are not significantly different from zero. Then we perform the test again and continue
in the same way until we get a set of points for which all the coefficients are significant. Then we shall treat
those points as outliers.
k.1 <- length(potential.outlier.1) # NUMBER OF INITIALLY DETECTED OUTLIERS
y.mod.1 <- c(y[-potential.outlier.1], y[potential.outlier.1]) # INITIALLY MODIFIED RESPONSE
X.mod.1 <- cbind(rbind(X.1[-potential.outlier.1, ], X.1[potential.outlier.1, ]), rbind(matrix(0, n - k.1
outlier.model.1 <- lm(y.mod.1 ~ 1 + X.mod.1) # INITIAL OUTLIER SHIFT MODEL
F.1 <- ((sum((residuals(model.1)) ^ 2) - sum((residuals(outlier.model.1)) ^ 2)) / k.1) / (sum((residuals
F.1 > qf(0.05, k.1, n - p - k.1, lower.tail = FALSE)

[1] TRUE
summary(outlier.model.1)

Call:
lm(formula = y.mod.1 ~ 1 + X.mod.1)

Residuals:
Min 1Q Median 3Q Max
-4.4613 -0.6052 0.0000 0.4497 3.9471

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.75298 1.41066 -0.534 0.599688
X.mod.1x1 -0.11158 0.08277 -1.348 0.193478
X.mod.1x2 0.31378 0.07556 4.153 0.000541 ***
X.mod.1x3 0.10255 3.84141 0.027 0.978981
X.mod.1x4 4.71023 3.97008 1.186 0.250076
X.mod.1 -3.97634 2.67335 -1.487 0.153318
X.mod.1 2.85640 3.03530 0.941 0.358485

11
X.mod.1 3.30415 2.94678 1.121 0.276142
X.mod.1 -2.51355 2.23577 -1.124 0.274913
X.mod.1 -6.80742 2.46254 -2.764 0.012344 *
X.mod.1 -3.79404 2.16636 -1.751 0.096014 .
X.mod.1 4.63336 2.05466 2.255 0.036121 *
X.mod.1 5.58316 2.09783 2.661 0.015419 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.912 on 19 degrees of freedom


Multiple R-squared: 0.9745, Adjusted R-squared: 0.9584
F-statistic: 60.46 on 12 and 19 DF, p-value: 1.606e-12
potential.outlier.2 <- potential.outlier.1[c(5, 7, 8)] # OUTLIERS DETECTED AFTER CROOSCHECK
k.2 <- length(potential.outlier.2) # NUMBER OF DETECTED OUTLIERS AFTER CROSSCHECK
y.mod.2 <- c(y[-potential.outlier.2], y[potential.outlier.2]) # MODIFIED RESPONSE AFTER CROSSCHECK
X.mod.2 <- cbind(rbind(X.1[-potential.outlier.2, ], X.1[potential.outlier.2, ]), rbind(matrix(0, n - k.2
outlier.model.2 <- lm(y.mod.2 ~ 1 + X.mod.2) # OUTLIER SHIFT MODEL AFTER CROSSCHECK
F.2 <- ((sum((residuals(model.1)) ^ 2) - sum((residuals(outlier.model.2)) ^ 2)) / k.2) / (sum((residuals
F.2 > qf(0.05, k.2, n - p - k.2, lower.tail = FALSE)

[1] TRUE
summary(outlier.model.2)

Call:
lm(formula = y.mod.2 ~ 1 + X.mod.2)

Residuals:
Min 1Q Median 3Q Max
-3.5204 -0.8975 0.0000 1.0743 4.3300

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.16239 1.45093 0.112 0.911818
X.mod.2x1 -0.10068 0.07347 -1.370 0.183236
X.mod.2x2 0.21759 0.05235 4.157 0.000354 ***
X.mod.2x3 -0.31137 2.36207 -0.132 0.896226
X.mod.2x4 5.98182 2.26424 2.642 0.014282 *
X.mod.2 -7.12255 2.38953 -2.981 0.006496 **
X.mod.2 5.30706 2.21009 2.401 0.024441 *
X.mod.2 6.33178 2.23658 2.831 0.009238 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.1 on 24 degrees of freedom


Multiple R-squared: 0.9611, Adjusted R-squared: 0.9498
F-statistic: 84.76 on 7 and 24 DF, p-value: 2.289e-15

Checking Influence
Now, we consider the design matrix without the rows corresponding to the influential points and calculate
the condition number based on that. But in this context, we must mention that if multicollinearity is present

12
in the dataset, this is not the correct approach.
[1] 1281.337
Anyway, even now the condition number is too large. So we can conclude that the problem of multicollinearity
is serious and hence we will opt for suitable regression methods.

Modifying Data

Before that, we should decide upon our treatment with the outliers. Since our dataset is small and we have
already verified the presence of severe multicollinearity in the data, we cannot afford to completely remove
these observations. Instead, we predict these points by a OLS regression fitted to the rest of the points and
henceforth continue our analysis with these predicted observations.

Model Fitting
To handle multicollinearity, we can proceed either by removing some of the explanatory variables, or by
performing biased regression, where we minimise the Mean Square Error subject to some Penalty Term.
In this assignment, first we will try to select a model by the stepwise regression, and then we shall apply
Lasso regression.

Stepwise Regression
Here, we start with the null model, i.e. only with an intercept term. Then we shall add variables one by one
and calculate AIC. At each step, we see what gives us minimum AIC value:
1. Adding any further variable,
2. Removing the variable which is added,
3. Keep the model same.
Among these, if the last gives us minimum AIC, the algorithm stops there and that will be our final model.
Otherwise, we repeat the same process until we reach such a stage.

Results
Start: AIC=48.27
y.mod.3 ~ 1 + x1 + x2 + x3 + x4

Df Sum of Sq RSS AIC


- x3 1 0.089 105.90 46.295
<none> 105.81 48.268
- x1 1 9.204 115.01 48.937
- x4 1 34.690 140.50 55.342
- x2 1 76.949 182.76 63.757

Step: AIC=46.3
y.mod.3 ~ x1 + x2 + x4

Df Sum of Sq RSS AIC


<none> 105.90 46.295

13
+ x3 1 0.089 105.81 48.268
- x1 1 17.254 123.15 49.125
- x2 1 112.325 218.22 67.433
- x4 1 186.416 292.31 76.787

LASSO Regression
0
In this method, we minimise n1 Σn1 (Yi − x β)2 + λΣpi |βj |. This is justified as multicollinearity inflates the
variances of the estimated regression coefficients and by including the constraint of the form Σpi |βj | ≤ c, we
enforce the regression coefficients to take small values and can also make some of these coefficients close to
zero. This can be supported by the fact in presence of multicollinearity, all the explanatory variables are
actually not required. Thus, even though this method will no longer yield unbiased estimators, but we will
still avoid loss as the estimates will have comparatively smaller MSE’s than the unbiased estimates.

Results
Here, we first obtain the an optimal λ.
[1] 0.0633387
Then we fit the model, and the regression coefficients are obtained as follows:
5 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 0.26954539
x1 -0.05775523
x2 0.22050927
x3 .
x4 5.02592959

Checking Goodness of Fitted LASSO Model


Now, we plot of the original y observations and the predictions obtained from this method. Since it seems
that the biased model yields good results, we now proceed to the residual analysis of this model. Here, we
check for the normality, autocorrelation and heteroscedasticity of the residuals of the model.

14
Plot of Original Observations and LASSO Predictions
50

Original Observations
Lasso Predictions
40
y

30
20

0 5 10 15 20 25 30

Index
Figure 14

Normality Check
Initially, we check for normality of the errors, and both the QQ Plot and the Shapiro Test concludes in the
affirmative.

15
Normal Q−Q Plot
6
4
Sample Quantiles

2
0
−2
−6 −4
−8

−2 −1 0 1 2

Theoretical Quantiles
Figure 15

Shapiro-Wilk normality test

data: e.2
W = 0.96997, p-value = 0.4985

Checking for Multicollinearity


Now, we check for condition number and note that it is much lower than 100, and hence it can be thought of
to be free from the effect of multicollinearity.
[1] 46.34992

Checking for Autocorrelation


Next, we prepare the ACF and PACF plots of the residuals, and note that none of the picks are significant.

16
Autocorrelation Plot of LASSO residuals
0.8
0.4
ACF

0.0
−0.4

0 5 10 15

Lag
Figure 16

17
Partial Autocorrelation Plot of LASSO residuals
0.3
0.1
Partial ACF

−0.1
−0.3

2 4 6 8 10 12 14

Lag
Figure 17

Checking for Homoscedasticity


Now, we try to check for equal variances. We recursively select subsets (of some fixed size) of the residuals and
compute the variances of each group. If the plot of these variances against the indices reveal some pattern,
we can suspect that heteroscedasticity is present in the dataset. We repeat this procedure for different sizes
of the subsets. From the plots, we can see that there is an increasing pattern in each of the plots. This
indicates that the residuals are not homoscedastic.

18
Moving variances with Order 10
15
Variance

10
5

5 10 15 20

Index
Figure 19

19
Moving variances with Order 15
12
10
Variance

8
6
4

5 10 15

Index
Figure 20

20
Moving variances with Order 20
10
8
Variance

6
4

2 4 6 8 10 12

Index
Figure 21

Breusch - Pagan Test

To confirm our suspicion, we test whether the variances can be modelled by the explanatory variables. We
use the square of the residuals as the estimates of the variances, i.e. we consider the model e = Xµ + ν .
∼ ∼ ∼
From this model, we observe that the test is accepted at 5% level of significance, but with a low p-value. So
based on the data, we cannot conclude that our data is homoscedastic, but we would certainly prefer to test
it for more number of observations. Alternatively, as the regression coefficient of x2 is not significant, we
drop that variable, and now the test becomes significant, at the same level of significance of 5%.

Call:
lm(formula = (e.2^2) ~ 1 + x1 + x2 + x4)

Residuals:
Min 1Q Median 3Q Max
-13.657 -6.869 -1.496 2.590 42.344

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.2856 7.8284 -0.292 0.7725
x1 0.7482 0.3029 2.470 0.0199 *
x2 0.1723 0.2460 0.700 0.4895
x4 -10.0453 4.9177 -2.043 0.0506 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

21
Residual standard error: 11.77 on 28 degrees of freedom
Multiple R-squared: 0.2066, Adjusted R-squared: 0.1216
F-statistic: 2.43 on 3 and 28 DF, p-value: 0.08611

Conclusion
After all these calculations, we can see that the initial pressure of the tank is not included in our final model,
as the other explanatory variables explain this due to multicollinearity. If temperature of the tank increases,
the amount of escaped hydrocarbon decreases, and for increment in the temperature or pressure of the petrol,
waste is increased. So one should consider these points while taking steps against pollution with the fact that
these conclusions are based on a biased model affected by the problem of heteroscedasticity.

22