You are on page 1of 23

4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Answer the following questions. Write your responses below each part in the qmd file. Compress
qmd, html, and any data files you may create into a single zip file. Name the zip file with your Bilkent
student ID (e.g., 21400023.zip). Upload the zip file to the IE 451 moodle page.

You may use your notes, your textbook, or internet.


You may NOT seek help from anybody inside or outside classroom in any form.
Be sure that qmd file and the final version of html file are submitted.

Questions

pollution.csv contains data on average pollution amounts per unit time ( PollutantAmnt ) at different
locations along with several potential risk factors. We would like to use data to come up with a
statistical model that shed light on relations between risk factors and pollution amount.

Exploration [suggested time: 15 minutes]


1. Read data, print the first six rows, and explore the relations between variables (graphical and
tabular summaries). For full credit, you should comment on the plots and tables.

d <- suppressMessages(read_csv("pollution.csv", show_col_types = FALSE) %>%


select(-1)) %>%
mutate_if(is.character, factor)
d %>% head()

# A tibble: 6 × 7
Convection Habitants HumidDays HumidityAmnt IndustrySize PollutantAmnt Region
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 103. 315 171 24.9 907 33 A
2 98.4 1501 309 69.0 2013 129 A
3 101 351 177 29.3 273 55 B
4 111. 1163 209 70.8 1549 111 A

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 1/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

5 93.2 925 331 71.2 781 21 C


6 97.2 159 253 85.7 823 111 A

d %>% summary() %>% pander()

Table continues below

Convection Habitants HumidDays HumidityAmnt IndustrySize

Min. : 86.0 Min. : 141 Min. : 71.0 Min. : 13.10 Min. : 69.0

1st Qu.:100.2 1st Qu.: 597 1st Qu.:205.0 1st Qu.: 60.92 1st Qu.: 361.0

Median :108.2 Median :1029 Median :229.0 Median : 76.48 Median : 693.0

Mean :110.5 Mean :1216 Mean :226.8 Mean : 72.54 Mean : 925.2

3rd Qu.:117.6 3rd Qu.:1433 3rd Qu.:255.0 3rd Qu.: 85.22 3rd Qu.: 923.0

Max. :150.0 Max. :6737 Max. :331.0 Max. :118.60 Max. :6687.0

PollutantAmnt Region

Min. : 15.0 A:14

1st Qu.: 25.0 B:16

Median : 51.0 C:11

Mean : 59.1 NA

3rd Qu.: 69.0 NA

Max. :219.0 NA

There are 7 different variables which are


Convection,Habitants,HumidDays,HumidityAmnt,IndstrySize,PollutantAmnt and Region. Region is
qualitiative variable while others are quantitive variables.

The range for Convection is btw 86 and 150 The range for Habitants is btw 141 and 6737 The range
for HumidityAmnt is btw 13.10 and 118.60 The range for IndustrySize is btw 69.0 and 6687 The range
for PollutantAmnt is btw 15.0 and 219.0

scatterplotMatrix(~
Convection+Habitants+HumidDays+HumidityAmnt+IndustrySize+PollutantAmnt, data
= d,
regLine = FALSE, smooth = list(spread = FALSE, lty.smooth =
"solid", col.smooth = "red"))

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 2/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

scatterplotMatrix(~
Convection+Habitants+HumidDays+HumidityAmnt+IndustrySize+PollutantAmnt|Region,
data = d,
regLine = FALSE, smooth = list(spread = FALSE, lty.smooth =
"solid", col.smooth = "red"))

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 3/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Habitans,IndustrySize,Convection and PollutantAmnt have skewed distributions

d %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(diag=FALSE, type = "upper", method = "pie", order = "hclust")

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 4/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Habitans,IndustrySize,Convection and PollutantAmnt have skewed distributions

d %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(diag=FALSE, type = "upper", method = "pie", order = "hclust")

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 4/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

We can say that there is no such two variable quite highly correlated the most positively correlated
pair is IndustrySize and PollutantAmnt Of course there other pairs positively correlated such as
Convectiın and Habitants Also there are negatively correlated pairs such as IndustrySize and
HumidDays

Lets see how variables affect PollutantAmnt

Habitants_lm <- lm( PollutantAmnt ~ Habitants, data = d)


IndustrySize_lm <- lm( PollutantAmnt ~ IndustrySize, data = d)
Convection_lm <- lm( PollutantAmnt ~ Convection, data = d)
HumidDays_lm <- lm( PollutantAmnt ~ HumidDays, data = d)
HumidityAmnt_lm<- lm( PollutantAmnt ~ HumidityAmnt, data = d)
Habitants_lm

Call:
lm(formula = PollutantAmnt ~ Habitants, data = d)

Coefficients:
(Intercept) Habitants
55.793948 0.002716

IndustrySize_lm

Call:
lm(formula = PollutantAmnt ~ IndustrySize, data = d)
file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 5/22
4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Coefficients:
(Intercept) IndustrySize
34.24801 0.02686

Convection_lm

Call:
lm(formula = PollutantAmnt ~ Convection, data = d)

Coefficients:
(Intercept) Convection
214.734 -1.408

HumidDays_lm

Call:
lm(formula = PollutantAmnt ~ HumidDays, data = d)

Coefficients:
(Intercept) HumidDays
-15.1267 0.3273

HumidityAmnt_lm

Call:
lm(formula = PollutantAmnt ~ HumidityAmnt, data = d)

Coefficients:
(Intercept) HumidityAmnt
51.2444 0.1083
As we can see Habitants,IndustrySize,HumidDays,HumidityAmnt are positively correlated with
PollutantAmnt while Convection is negatively correlated with PollutantAmnt.

Lets build a model with all in it

total_lm <- lm(PollutantAmnt ~ Habitants + IndustrySize + HumidDays + HumidityAmnt


+ Convection, data = d)
total_lm %>% summary()

Call:
lm(formula = PollutantAmnt ~ Habitants + IndustrySize + HumidDays +
HumidityAmnt + Convection, data = d)

Residuals:
Min 1Q Median 3Q Max
-56.077 -21.538 -3.405 19.525 108.500

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 6/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 154.789101 82.063921 1.886 0.0676 .
Habitants -0.004374 0.005051 -0.866 0.3924
IndustrySize 0.025353 0.004996 5.074 1.28e-05 ***
HumidDays -0.022045 0.180982 -0.122 0.9037
HumidityAmnt 0.556838 0.408719 1.362 0.1818
Convection -1.350079 0.621210 -2.173 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32.86 on 35 degrees of freedom


Multiple R-squared: 0.5714, Adjusted R-squared: 0.5102
F-statistic: 9.333 on 5 and 35 DF, p-value: 1.019e-05
We can see that IndustrySize is significant and also Convection but not as powerful as IndustrySize.
Therefore there is an exact relation with IndustrySize and Convection btw PollutantAmnt Also
Adjusted R^2 value is 0.51 that means with these variables we can estimate the 51.02% of the
variables. P value is less than 0.05 that means null hypothesis is rejected.

Model [suggested time: 55 minutes]


plot(total_lm)

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 7/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 8/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

2. Fit the best statistical model. Check diagnostics to make sure model assumptions are satisfied. If
not, try to improve your model. Check diagnostics again. Repeat until a satisfactory model is
obtained. Explain your reasoning.

Here we can say that that is not the best model the variance may or may not be constant.

Therefore, maybe we need powertransform to normalize the model.

X <- d %>% select(PollutantAmnt, Habitants,


IndustrySize,HumidDays,HumidityAmnt,Convection)%>% as.matrix()
res_pt <- powerTransform(X ~ 1)
res_pt %>% summary()

bcPower Transformations to Multinormality


Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
PollutantAmnt -0.2558 0 -0.7067 0.1951
Habitants 0.0176 0 -0.2436 0.2787
IndustrySize 0.1288 0 -0.1017 0.3592
HumidDays 0.1557 0 -0.3244 0.6358
HumidityAmnt 1.6367 2 1.0931 2.1803
Convection -0.7290 0 -2.2045 0.7464

Likelihood ratio test that transformation parameters are equal to 0


(all log transformations)
LRT df pval
LR test, lambda = (0 0 0 0 0 0) 47.95536 6 1.2061e-08

Likelihood ratio test that no transformations are needed

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 9/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

LRT df pval
LR test, lambda = (1 1 1 1 1 1) 153.7129 6 < 2.22e-16
From that we can say that there is a need for powertransform for HumidityAmnt with power=2. And
log for others

Xpt <- bcPower(X, coef(res_pt, round=TRUE)) %>%


set_colnames(paste("pt", colnames(X), sep="_"))
head(Xpt)

pt_PollutantAmnt pt_Habitants pt_IndustrySize pt_HumidDays pt_HumidityAmnt


[1,] 3.496508 5.752573 6.810142 5.141664 309.5050
[2,] 4.859812 7.313887 7.607381 5.733341 2378.6202
[3,] 4.007333 5.860786 5.609472 5.176150 429.9178
[4,] 4.709530 7.058758 7.345365 5.342334 2504.4042
[5,] 3.044522 6.829794 6.660575 5.802118 2535.6442
[6,] 4.709530 5.068904 6.712956 5.533389 3675.1738
pt_Convection
[1,] 4.632785
[2,] 4.589041
[3,] 4.615121
[4,] 4.707727
[5,] 4.534748
[6,] 4.576771

d_pt <- Xpt %>%


as_tibble()

res_lm_sat <- lm(pt_PollutantAmnt ~


pt_Habitants+pt_IndustrySize+pt_HumidDays+pt_HumidityAmnt+pt_Convection,
data= d_pt)
res_lm_sat

Call:
lm(formula = pt_PollutantAmnt ~ pt_Habitants + pt_IndustrySize +
pt_HumidDays + pt_HumidityAmnt + pt_Convection, data = d_pt)

Coefficients:
(Intercept) pt_Habitants pt_IndustrySize pt_HumidDays
1.336e+01 -1.978e-01 2.635e-01 4.225e-01
pt_HumidityAmnt pt_Convection
8.514e-05 -2.641e+00

par(mfrow = c(2, 2))


plot(res_lm_sat)

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 10/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

The model looks like got better but still not the best let us look at the variance if it is constant or not

ncvTest(res_lm_sat)

Non-constant Variance Score Test


Variance formula: ~ fitted.values
Chisquare = 3.475041, Df = 1, p = 0.062301

Non-canstant variance test cannot reject “constant variance”. let me also look at normality

shapiro.test(rstudent(res_lm_sat))

Shapiro-Wilk normality test

data: rstudent(res_lm_sat)
W = 0.96323, p-value = 0.2038

Shapiro test cannot reject normality.

residualPlots(res_lm_sat)

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 11/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Test stat Pr(>|Test stat|)


pt_Habitants 2.1667 0.037354 *
pt_IndustrySize 2.9202 0.006171 **
pt_HumidDays 0.2634 0.793801
pt_HumidityAmnt -1.6387 0.110505
pt_Convection -2.4081 0.021607 *
Tukey test -0.0397 0.968348
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

res_lm_sat %>% summary()

Call:
lm(formula = pt_PollutantAmnt ~ pt_Habitants + pt_IndustrySize +
pt_HumidDays + pt_HumidityAmnt + pt_Convection, data = d_pt)

Residuals:
Min 1Q Median 3Q Max
-1.41295 -0.40593 -0.07447 0.43447 0.86185

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.336e+01 1.249e+01 1.070 0.2918
pt_Habitants -1.978e-01 1.432e-01 -1.382 0.1759
pt_IndustrySize 2.635e-01 1.164e-01 2.264 0.0299 *
pt_HumidDays 4.225e-01 7.944e-01 0.532 0.5982
pt_HumidityAmnt 8.514e-05 1.556e-04 0.547 0.5878
file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 12/22
4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

pt_Convection -2.641e+00 1.919e+00 -1.376 0.1775


---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5717 on 35 degrees of freedom


Multiple R-squared: 0.446, Adjusted R-squared: 0.3669
F-statistic: 5.636 on 5 and 35 DF, p-value: 0.0006509

total_lm %>% summary()

Call:
lm(formula = PollutantAmnt ~ Habitants + IndustrySize + HumidDays +
HumidityAmnt + Convection, data = d)

Residuals:
Min 1Q Median 3Q Max
-56.077 -21.538 -3.405 19.525 108.500

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 154.789101 82.063921 1.886 0.0676 .
Habitants -0.004374 0.005051 -0.866 0.3924
IndustrySize 0.025353 0.004996 5.074 1.28e-05 ***
HumidDays -0.022045 0.180982 -0.122 0.9037
HumidityAmnt 0.556838 0.408719 1.362 0.1818
Convection -1.350079 0.621210 -2.173 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32.86 on 35 degrees of freedom


Multiple R-squared: 0.5714, Adjusted R-squared: 0.5102
F-statistic: 9.333 on 5 and 35 DF, p-value: 1.019e-05
So, there is nothing wrong as an evidence from the test we can play with the variables maybe to get a
better result but i dont have time so that is the best model that i can find in a time.

residualPlots(res_lm_sat)

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 13/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Test stat Pr(>|Test stat|)


pt_Habitants 2.1667 0.037354 *
pt_IndustrySize 2.9202 0.006171 **
pt_HumidDays 0.2634 0.793801
pt_HumidityAmnt -1.6387 0.110505
pt_Convection -2.4081 0.021607 *
Tukey test -0.0397 0.968348
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looks like Habitants,IndustrySize and Convection variables are significant we can build a new model
with those variables

new_model<-lm(PollutantAmnt~Habitants+IndustrySize+Convection,data=d)
plot(new_model)

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 14/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 15/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

summary(new_model)

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 16/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Call:
lm(formula = PollutantAmnt ~ Habitants + IndustrySize + Convection,
data = d)

Residuals:
Min 1Q Median 3Q Max
-53.395 -24.191 -2.792 18.938 120.742

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 150.140364 43.684024 3.437 0.00147 **
Habitants -0.001695 0.004999 -0.339 0.73650
IndustrySize 0.024881 0.005135 4.845 2.27e-05 ***
Convection -1.013342 0.391323 -2.590 0.01366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.9 on 37 degrees of freedom


Multiple R-squared: 0.5176, Adjusted R-squared: 0.4785
F-statistic: 13.23 on 3 and 37 DF, p-value: 5.065e-06
Now it looks like it got better but we can eliminate Habitants and check it again

new_model2<-lm(PollutantAmnt~IndustrySize+Convection,data=d)
plot(new_model2)

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 17/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 18/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

summary(new_model2)

Call:
lm(formula = PollutantAmnt ~ IndustrySize + Convection, data = d)

Residuals:
Min 1Q Median 3Q Max
-54.339 -25.098 -3.225 15.730 121.659

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 152.449670 42.644264 3.575 0.000974 ***
IndustrySize 0.024304 0.004788 5.076 1.05e-05 ***
Convection -1.048053 0.373268 -2.808 0.007831 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.5 on 38 degrees of freedom


Multiple R-squared: 0.5161, Adjusted R-squared: 0.4906
F-statistic: 20.27 on 2 and 38 DF, p-value: 1.024e-06
3. How good does your model fit data? Explain. Not the best fit at all, I tried multiple methods and
the last method is the best

Analysis [suggested time: 50 minutes]

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 19/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

4. Does presence of strong air current alleviate industry size effect on pollution amount in a typical
location?

lm_all <- lm(PollutantAmnt ~ Convection + IndustrySize, data = d)


lm_ind<- lm(PollutantAmnt ~ IndustrySize, data = d)
summary(lm_all)

Call:
lm(formula = PollutantAmnt ~ Convection + IndustrySize, data = d)

Residuals:
Min 1Q Median 3Q Max
-54.339 -25.098 -3.225 15.730 121.659

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 152.449670 42.644264 3.575 0.000974 ***
Convection -1.048053 0.373268 -2.808 0.007831 **
IndustrySize 0.024304 0.004788 5.076 1.05e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.5 on 38 degrees of freedom


Multiple R-squared: 0.5161, Adjusted R-squared: 0.4906
F-statistic: 20.27 on 2 and 38 DF, p-value: 1.024e-06

summary(lm_ind)

Call:
lm(formula = PollutantAmnt ~ IndustrySize, data = d)

Residuals:
Min 1Q Median 3Q Max
-53.951 -25.936 -6.989 13.420 134.354

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.248007 7.379913 4.641 3.86e-05 ***
IndustrySize 0.026859 0.005099 5.268 5.36e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.34 on 39 degrees of freedom


Multiple R-squared: 0.4157, Adjusted R-squared: 0.4007
F-statistic: 27.75 on 1 and 39 DF, p-value: 5.363e-06

anova(lm_all,lm_ind)

Analysis of Variance Table

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 20/22


4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Model 1: PollutantAmnt ~ Convection + IndustrySize


Model 2: PollutantAmnt ~ IndustrySize
Res.Df RSS Df Sum of Sq F Pr(>F)
1 38 42655
2 39 51505 -1 -8849.4 7.8836 0.007831 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can say that there is an effect of presence of Convention over IndustrySize on PollutantAmnt

5. How does pollution amount change with convection? Does this relation differ from region to
region?

lm_interact<-lm(PollutantAmnt ~ Convection, data = d)


lm_interact2<-lm(PollutantAmnt ~ Convection+Region+(Convection:Region), data = d)
lm_interact %>% summary()

Call:
lm(formula = PollutantAmnt ~ Convection, data = d)

Residuals:
Min 1Q Median 3Q Max
-62.496 -23.660 -6.610 8.912 145.361

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 214.7340 52.2226 4.112 0.000196 ***
Convection -1.4081 0.4686 -3.005 0.004624 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.84 on 39 degrees of freedom


Multiple R-squared: 0.188, Adjusted R-squared: 0.1672
F-statistic: 9.03 on 1 and 39 DF, p-value: 0.004624

lm_interact2 %>% summary()

Call:
lm(formula = PollutantAmnt ~ Convection + Region + (Convection:Region),
data = d)

Residuals:
Min 1Q Median 3Q Max
-57.798 -26.568 -5.644 14.337 159.131

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 236.9276 88.4927 2.677 0.0112 *
Convection -1.5634 0.7846 -1.993 0.0542 .
RegionB 97.2158 144.4482 0.673 0.5054
RegionC -92.4553 122.2767 -0.756 0.4546
Convection:RegionB -0.8812 1.2798 -0.689 0.4956
file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 21/22
4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023

Convection:RegionC 0.7190 1.1098 0.648 0.5213


---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.71 on 35 degrees of freedom


Multiple R-squared: 0.2416, Adjusted R-squared: 0.1332
F-statistic: 2.229 on 5 and 35 DF, p-value: 0.07312

anova(lm_interact,lm_interact2)

Analysis of Variance Table

Model 1: PollutantAmnt ~ Convection


Model 2: PollutantAmnt ~ Convection + Region + (Convection:Region)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 39 71578
2 35 66858 4 4720.4 0.6178 0.6528
So we can say that there is a negative correlation between PollutantAmnt and Convection as
Convection increases Pollutant decreases. As we can see there is a different Estimate std for every
Region, again both are negatively affect the PollutantAmnt and it differs from region to region.

6. Predict pollution amount for a location in region C having average risk factors. Give 80%
confidence interval.

lm4<-lm(PollutantAmnt ~ Region, data = d)


lm4 %>%
predict(newdata = data.frame(Region='C'),interval= "confidence",level=0.80)

fit lwr upr


1 54.81818 35.91388 73.72248

So, we can say that we are 80% sure that the Pollutant amount in region C will be between 35.914
and 73.723

file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 22/22

You might also like