Professional Documents
Culture Documents
Answer the following questions. Write your responses below each part in the qmd file. Compress
qmd, html, and any data files you may create into a single zip file. Name the zip file with your Bilkent
student ID (e.g., 21400023.zip). Upload the zip file to the IE 451 moodle page.
Questions
pollution.csv contains data on average pollution amounts per unit time ( PollutantAmnt ) at different
locations along with several potential risk factors. We would like to use data to come up with a
statistical model that shed light on relations between risk factors and pollution amount.
# A tibble: 6 × 7
Convection Habitants HumidDays HumidityAmnt IndustrySize PollutantAmnt Region
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 103. 315 171 24.9 907 33 A
2 98.4 1501 309 69.0 2013 129 A
3 101 351 177 29.3 273 55 B
4 111. 1163 209 70.8 1549 111 A
Min. : 86.0 Min. : 141 Min. : 71.0 Min. : 13.10 Min. : 69.0
1st Qu.:100.2 1st Qu.: 597 1st Qu.:205.0 1st Qu.: 60.92 1st Qu.: 361.0
Median :108.2 Median :1029 Median :229.0 Median : 76.48 Median : 693.0
Mean :110.5 Mean :1216 Mean :226.8 Mean : 72.54 Mean : 925.2
3rd Qu.:117.6 3rd Qu.:1433 3rd Qu.:255.0 3rd Qu.: 85.22 3rd Qu.: 923.0
Max. :150.0 Max. :6737 Max. :331.0 Max. :118.60 Max. :6687.0
PollutantAmnt Region
Mean : 59.1 NA
Max. :219.0 NA
The range for Convection is btw 86 and 150 The range for Habitants is btw 141 and 6737 The range
for HumidityAmnt is btw 13.10 and 118.60 The range for IndustrySize is btw 69.0 and 6687 The range
for PollutantAmnt is btw 15.0 and 219.0
scatterplotMatrix(~
Convection+Habitants+HumidDays+HumidityAmnt+IndustrySize+PollutantAmnt, data
= d,
regLine = FALSE, smooth = list(spread = FALSE, lty.smooth =
"solid", col.smooth = "red"))
scatterplotMatrix(~
Convection+Habitants+HumidDays+HumidityAmnt+IndustrySize+PollutantAmnt|Region,
data = d,
regLine = FALSE, smooth = list(spread = FALSE, lty.smooth =
"solid", col.smooth = "red"))
d %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(diag=FALSE, type = "upper", method = "pie", order = "hclust")
d %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(diag=FALSE, type = "upper", method = "pie", order = "hclust")
We can say that there is no such two variable quite highly correlated the most positively correlated
pair is IndustrySize and PollutantAmnt Of course there other pairs positively correlated such as
Convectiın and Habitants Also there are negatively correlated pairs such as IndustrySize and
HumidDays
Call:
lm(formula = PollutantAmnt ~ Habitants, data = d)
Coefficients:
(Intercept) Habitants
55.793948 0.002716
IndustrySize_lm
Call:
lm(formula = PollutantAmnt ~ IndustrySize, data = d)
file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/Den… 5/22
4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023
Coefficients:
(Intercept) IndustrySize
34.24801 0.02686
Convection_lm
Call:
lm(formula = PollutantAmnt ~ Convection, data = d)
Coefficients:
(Intercept) Convection
214.734 -1.408
HumidDays_lm
Call:
lm(formula = PollutantAmnt ~ HumidDays, data = d)
Coefficients:
(Intercept) HumidDays
-15.1267 0.3273
HumidityAmnt_lm
Call:
lm(formula = PollutantAmnt ~ HumidityAmnt, data = d)
Coefficients:
(Intercept) HumidityAmnt
51.2444 0.1083
As we can see Habitants,IndustrySize,HumidDays,HumidityAmnt are positively correlated with
PollutantAmnt while Convection is negatively correlated with PollutantAmnt.
Call:
lm(formula = PollutantAmnt ~ Habitants + IndustrySize + HumidDays +
HumidityAmnt + Convection, data = d)
Residuals:
Min 1Q Median 3Q Max
-56.077 -21.538 -3.405 19.525 108.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 154.789101 82.063921 1.886 0.0676 .
Habitants -0.004374 0.005051 -0.866 0.3924
IndustrySize 0.025353 0.004996 5.074 1.28e-05 ***
HumidDays -0.022045 0.180982 -0.122 0.9037
HumidityAmnt 0.556838 0.408719 1.362 0.1818
Convection -1.350079 0.621210 -2.173 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2. Fit the best statistical model. Check diagnostics to make sure model assumptions are satisfied. If
not, try to improve your model. Check diagnostics again. Repeat until a satisfactory model is
obtained. Explain your reasoning.
Here we can say that that is not the best model the variance may or may not be constant.
LRT df pval
LR test, lambda = (1 1 1 1 1 1) 153.7129 6 < 2.22e-16
From that we can say that there is a need for powertransform for HumidityAmnt with power=2. And
log for others
Call:
lm(formula = pt_PollutantAmnt ~ pt_Habitants + pt_IndustrySize +
pt_HumidDays + pt_HumidityAmnt + pt_Convection, data = d_pt)
Coefficients:
(Intercept) pt_Habitants pt_IndustrySize pt_HumidDays
1.336e+01 -1.978e-01 2.635e-01 4.225e-01
pt_HumidityAmnt pt_Convection
8.514e-05 -2.641e+00
The model looks like got better but still not the best let us look at the variance if it is constant or not
ncvTest(res_lm_sat)
Non-canstant variance test cannot reject “constant variance”. let me also look at normality
shapiro.test(rstudent(res_lm_sat))
data: rstudent(res_lm_sat)
W = 0.96323, p-value = 0.2038
residualPlots(res_lm_sat)
Call:
lm(formula = pt_PollutantAmnt ~ pt_Habitants + pt_IndustrySize +
pt_HumidDays + pt_HumidityAmnt + pt_Convection, data = d_pt)
Residuals:
Min 1Q Median 3Q Max
-1.41295 -0.40593 -0.07447 0.43447 0.86185
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.336e+01 1.249e+01 1.070 0.2918
pt_Habitants -1.978e-01 1.432e-01 -1.382 0.1759
pt_IndustrySize 2.635e-01 1.164e-01 2.264 0.0299 *
pt_HumidDays 4.225e-01 7.944e-01 0.532 0.5982
pt_HumidityAmnt 8.514e-05 1.556e-04 0.547 0.5878
file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 12/22
4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023
Call:
lm(formula = PollutantAmnt ~ Habitants + IndustrySize + HumidDays +
HumidityAmnt + Convection, data = d)
Residuals:
Min 1Q Median 3Q Max
-56.077 -21.538 -3.405 19.525 108.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 154.789101 82.063921 1.886 0.0676 .
Habitants -0.004374 0.005051 -0.866 0.3924
IndustrySize 0.025353 0.004996 5.074 1.28e-05 ***
HumidDays -0.022045 0.180982 -0.122 0.9037
HumidityAmnt 0.556838 0.408719 1.362 0.1818
Convection -1.350079 0.621210 -2.173 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
residualPlots(res_lm_sat)
new_model<-lm(PollutantAmnt~Habitants+IndustrySize+Convection,data=d)
plot(new_model)
summary(new_model)
Call:
lm(formula = PollutantAmnt ~ Habitants + IndustrySize + Convection,
data = d)
Residuals:
Min 1Q Median 3Q Max
-53.395 -24.191 -2.792 18.938 120.742
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 150.140364 43.684024 3.437 0.00147 **
Habitants -0.001695 0.004999 -0.339 0.73650
IndustrySize 0.024881 0.005135 4.845 2.27e-05 ***
Convection -1.013342 0.391323 -2.590 0.01366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
new_model2<-lm(PollutantAmnt~IndustrySize+Convection,data=d)
plot(new_model2)
summary(new_model2)
Call:
lm(formula = PollutantAmnt ~ IndustrySize + Convection, data = d)
Residuals:
Min 1Q Median 3Q Max
-54.339 -25.098 -3.225 15.730 121.659
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 152.449670 42.644264 3.575 0.000974 ***
IndustrySize 0.024304 0.004788 5.076 1.05e-05 ***
Convection -1.048053 0.373268 -2.808 0.007831 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4. Does presence of strong air current alleviate industry size effect on pollution amount in a typical
location?
Call:
lm(formula = PollutantAmnt ~ Convection + IndustrySize, data = d)
Residuals:
Min 1Q Median 3Q Max
-54.339 -25.098 -3.225 15.730 121.659
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 152.449670 42.644264 3.575 0.000974 ***
Convection -1.048053 0.373268 -2.808 0.007831 **
IndustrySize 0.024304 0.004788 5.076 1.05e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm_ind)
Call:
lm(formula = PollutantAmnt ~ IndustrySize, data = d)
Residuals:
Min 1Q Median 3Q Max
-53.951 -25.936 -6.989 13.420 134.354
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.248007 7.379913 4.641 3.86e-05 ***
IndustrySize 0.026859 0.005099 5.268 5.36e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm_all,lm_ind)
5. How does pollution amount change with convection? Does this relation differ from region to
region?
Call:
lm(formula = PollutantAmnt ~ Convection, data = d)
Residuals:
Min 1Q Median 3Q Max
-62.496 -23.660 -6.610 8.912 145.361
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 214.7340 52.2226 4.112 0.000196 ***
Convection -1.4081 0.4686 -3.005 0.004624 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = PollutantAmnt ~ Convection + Region + (Convection:Region),
data = d)
Residuals:
Min 1Q Median 3Q Max
-57.798 -26.568 -5.644 14.337 159.131
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 236.9276 88.4927 2.677 0.0112 *
Convection -1.5634 0.7846 -1.993 0.0542 .
RegionB 97.2158 144.4482 0.673 0.5054
RegionC -92.4553 122.2767 -0.756 0.4546
Convection:RegionB -0.8812 1.2798 -0.689 0.4956
file:///home/sdayanik/usr/savas/bilkent/teaching/IE 451 Applied Data Analysis/IE451_git/Spring 2023/Midterm/submissions/De… 21/22
4/27/23, 1:01 PM IE451 Midterm Examination Spring 2023
anova(lm_interact,lm_interact2)
6. Predict pollution amount for a location in region C having average risk factors. Give 80%
confidence interval.
So, we can say that we are 80% sure that the Pollutant amount in region C will be between 35.914
and 73.723