You are on page 1of 13

Stat 401B Final Exam

Fall 2016

I have neither given nor received unauthorized assistance on this exam.

________________________________________________________
Name Signed Date

_________________________________________________________
Name Printed

ATTENTION!

Incorrect numerical answers unaccompanied by supporting reasoning will receive NO


partial credit.

Correct numerical answers to difficult questions unaccompanied by supporting


reasoning may not receive full credit.

SHOW YOUR WORK/EXPLAIN YOURSELF!

1
1. A manufacturing process produces cylindrical parts.

5 pts a) Suppose that parts are L inches long with diameter D (also in inches). Part volume is then
V = LD 2π / 4 . Suppose that L and D can be modeled as independent continuous random variables,
L ~ U (19.99, 20.01) and D ~ U (.99,1.00 ) . The mean and standard deviation of part volume ( EV
and VarV respectively) are of interest and simulation will be used to evaluate these. Provide a few
lines of R code that will do this. (The R syntax for calls associated with the U ( a, b ) distribution is
"_unif(a,b,..)".)

8 pts b) Suppose that the parts are steel and as manufactured have weights with mean 70.6 oz and standard
deviation .1 oz . Use the central limit theorem and approximate the probability that 25 such parts
have a total weight above 1766 oz . (Hint: Rephrase the question in terms of sample average weight.)

2. The 2013 International Journal of Microbiology article "Design and Optimization of a Process for
Sugarcane Molasses Fermentation by Saccharomyces cerevisiae Using Response Surface
Methodology" by El-Gendy, Madian, and Amr, presents results of a study made to optimize the
performance of a bioethanol production process. This question employs some results in that paper.

Under a first single set of process conditions, n = 6 runs of the process produce yields (in gm/l) of
bioethanol with sample mean y = 177.29 and sample standard deviation s = 2.03 .

4 pts a) Give 95% two-sided confidence limits for the standard deviation of process yield under these
conditions. (Plug in completely, but you need not simplify.)

2
4 pts b) Provide a number, #, such that you are 95% sure that 99% of all process yields under these
conditions are at least the value #. (Plug in completely, but you need not simplify.)

4 pts c) Suppose that a single additional run of the process made under a second set of conditions produces
a yield of y = 112 . Assuming that the standard deviation of yields is the same for both sets of
conditions, significance testing will be used to evaluate whether this new setup has a different mean
yield than the first. Give the value of an appropriate test statistic and name an appropriate reference
distribution to be used in finding a p-value .

Value of the test statistic ____________________ Reference distribution ___________________

5 pts d) Suppose that several process conditions of interest differ only in the value of incubation period ( x
in h). What model assumptions for a set of ( x, y ) data pairs are needed to make confidence limits for
the rate of change of mean y with respect to x ?

Beginning on Page 8 there is some R code and output for a MLR analysis of n = 30 process runs.
These are potentially useful for describing yield, y (in gm/l), as a function of the process variables
x1 = incubation period (h)
x2 = initial pH
x3 = incubation temperature (  C)
x4 = molasses concentration (wt %)

3
4 pts e) What is estimated by the value " Period 0.2643 " reported in the table on the output? (What
does the value .2643 represent in the context of the problem?)

4 pts f) Does a model linear in the predictors x1 , x2 , x3 , and x4 provide useful ability to predict yield?
Provide some quantitative support for your answer.
YES or NO (circle one)

As it turns out, a MLR with the (14) predictors x1 , x2 , x3 , x4 , x12 , x22 , x32 , x42 , x1 x2 , x1 x3 , x1 x4 , x2 x3 , x2 x4 , x3 x4
has R 2 = .9871 and a LOOCV RMSPE 22.286 . The (quadratic) model fit by least squares predicts (an
optimal) yield for x1 = 72, x2 = 5.65, x3 = 40, and x4 = 18.2 . This set of processing conditions has
yˆ = 279.2 and se yˆ = 9.698 .

6 pts g) Based on the information above and the R printout, fill out the ANOVA table for computing the
overall F for the quadratic model and provide sSF for the quadratic model.

ANOVA Table
Source SS df MS F
Regression

Error

Total

sSF = _________________

4 pts h) What do comparisons between s = 2.03 (from the bottom of Page 2), sSF from above, and the
LOOCV RMSPE of 22.286 suggest about this situation?
s versus sSF | sSF versus LOOCV RMSPE
|
|
|
|
|
|

4
4 pts i) Based on the (quadratic) MLR model, give 95% prediction limits for the next yield at process
conditions x1 = 72, x2 = 5.65, x3 = 40, and x4 = 18.2 . (Plug in completely, but do not simplify.)

4 pts j) Before recommending adoption of process conditions from part i), what steps would you take, and
why?

3. The "Appendicitis data set" on the KEEL website provides measured values of 7 medical variables
for N = 106 patients and values of a 0-1 variable indicating whether the patient had an appendicitis.
Beginning on Page 9 there is R code and output for a logistic regression analysis of these data.

4 pts a) Which of the 7 medical variables seems least helpful in predicting whether or not a patient has an
appendicitis? Why?

Using bestglm() in the bestglm package and cross validation (presumably on the log likelihood
criterion) it is possible to identify a good "reduced" logistic regression model as one with the two
predictor variables At3 and At4. There is some R code and output for this model included.

4 pts b) Give two-sided 95% confidence limits for the log odds ratio of the probability that a patient with
At3 = .207 and At4 = 0 has an appendicitis. (Plug in completely, but do not simplify.)

5
4. Beginning on Page 10 there is R code and output for a factorial analysis of some experimental data
on the charge lives of batteries made of 3 materials at 3 different temperatures taken from an
experimental design book of Montgomery. (Though the temperature is clearly quantitative, here treat
both factors as qualitative.) The data for this study comprise a balanced 3 × 3 factorial data set.

4 pts a) Are there statistically detectable interactions between Material and Temperature? Explain.
YES or NO (circle one)

4 pts b) Give 95% two-sided confidence limits for the difference between the Material 1 and Material 2
main effects. (Plug in completely, but you need not simplify.)

4 pts c) Find the fitted/predicted value of battery life for a battery of Material 2 under Temperature 2 for a
"main effects only" model of life. (If this is not possible based on the given information, say why.)

5. There is a famous "Boston Housing" data set on the UCI ML Data Repository. It concerns the
median home price in counties around Boston in the late 1970's as predicted by 13 measures of
community composition. This question concerns use of data from 452 counties with complete records
and prediction of "MEDV". There is R code and output provided (in pretty much the same format as for
Lab #11) beginning on Page 11. (As a baseline, MLR of MEDV on 13 predictors produces
sSF = 4.553 and R 2 = .7404 .) Use the output to answer the following questions.

4 pts a) Which of the predictors of MEDV do you like the best, and why?

6
4 pts b) What (if anything) convinces you that you can do better here than MLR for prediction purposes?

4 pts c) Why/how is it obvious that the ordinary MLR (OLS) predictor differs very little from the elastic net
predictor in this case? What about the particular elastic net fit (chosen by repeated cross-validation)
makes this similarity unsurprising?

4 pts c) What is the origin of the "vertical stripes" appearance of the plots in the "Tree" column of the
matrix of scatterplots of predicted values?

4 pts d) Below is a schematic of the tree predictor (plotting is "condition TRUE to the LEFT").

Give a simple description of the conditions (values of the predictors) producing the largest predicted
MEDV.

4 pts e) If you were going to "stack" two of the predictors here, which two would you consider and why?

7
R Code and Output for Bioethanol Analyses

> Biofuels
Period InitialpH Temp Conc Yield
1 72 5 40 25 88.86
2 48 6 40 20 243.32
3 48 6 30 20 178.89
4 72 6 30 20 192.03
5 24 7 20 25 220.45
6 48 6 30 25 110.22
7 48 6 20 20 200.00
8 72 5 40 15 230.93
9 48 6 30 20 177.00
10 72 5 20 15 74.39
11 48 6 30 15 112.00
12 48 7 30 20 86.00
13 24 7 20 15 65.00
14 24 5 20 15 36.73
15 48 6 30 20 176.00
16 48 6 30 20 179.00
17 72 7 20 15 12.00
18 72 5 20 25 10.00
19 24 5 40 25 5.00
20 72 7 40 15 50.00
21 24 5 20 25 14.18
22 24 7 40 15 7.00
23 72 7 40 25 17.00
24 48 5 30 20 84.00
25 24 7 40 25 65.00
26 24 6 30 20 162.00
27 72 7 20 25 54.00
28 48 6 30 20 174.00
29 24 5 40 15 39.66
30 48 6 30 20 178.90

> summary(lm(Yield~.,data=Biofuels))
Call:
lm(formula = Yield ~ ., data = Biofuels)
Residuals:
Min 1Q Median 3Q Max
-101.11 -67.31 -23.12 70.30 131.87
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 97.4164 158.0387 0.616 0.543
Period 0.2643 0.8130 0.325 0.748
InitialpH -0.4056 19.5124 -0.021 0.984
Temp 0.3334 1.9512 0.171 0.866
Conc -0.4778 3.9025 -0.122 0.904
Residual standard error: 82.78 on 25 degrees of freedom
Multiple R-squared: 0.005977, Adjusted R-squared: -0.1531
F-statistic: 0.03758 on 4 and 25 DF, p-value: 0.9971
> anova(lm(Yield~.,data=Biofuels))
Analysis of Variance Table
Response: Yield
Df Sum Sq Mean Sq F value Pr(>F)
Period 1 724 724.4 0.1057 0.7478
InitialpH 1 3 3.0 0.0004 0.9836
Temp 1 200 200.1 0.0292 0.8657
Conc 1 103 102.7 0.0150 0.9035
Residuals 25 171331 6853.2

8
R Code and Output for the Appendicitis Data Analyses

> Appendicitis[1:10,]
At1 At2 At3 At4 At5 At6 At7 Class
1 0.213 0.554 0.207 0.000 0.000 0.749 0.220 1
2 0.458 0.714 0.468 0.111 0.102 0.741 0.436 1
3 0.102 0.518 0.111 0.056 0.022 0.506 0.086 1
4 0.187 0.196 0.105 0.056 0.029 0.133 0.085 1
5 0.236 0.804 0.289 0.111 0.066 0.756 0.241 1
6 0.116 0.161 0.057 0.333 0.140 0.177 0.049 1
7 0.089 0.179 0.045 0.028 0.011 0.168 0.032 1
8 0.364 0.661 0.365 0.319 0.250 0.743 0.353 1
9 0.191 0.661 0.214 0.042 0.022 0.448 0.145 1
10 0.120 0.250 0.076 0.125 0.053 0.224 0.059 1
> summary(glm(Class~.,data=Appendicitis))

Call:
glm(formula = Class ~ ., data = Appendicitis)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.11042 -0.14420 -0.08051 0.01201 1.01271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.5459 0.2520 6.135 1.81e-08 ***
At1 -5.1174 1.7305 -2.957 0.00389 **
At2 -0.3557 0.8845 -0.402 0.68848
At3 0.3307 2.6167 0.126 0.89970
At4 -0.6134 0.3769 -1.627 0.10687
At5 0.4134 0.4555 0.907 0.36637
At6 -1.3227 0.7943 -1.665 0.09905 .
At7 4.6603 2.8139 1.656 0.10088
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1072328)
Null deviance: 16.840 on 105 degrees of freedom
Residual deviance: 10.509 on 98 degrees of freedom
AIC: 73.825
Number of Fisher Scoring iterations: 2
> summary(glm(Class~At3+At4,data=Appendicitis))

Call:
glm(formula = Class ~ At3 + At4, data = Appendicitis)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6499 -0.2238 -0.1075 0.1481 1.2536
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.66996 0.08292 8.080 1.30e-12 ***
At3 -0.91105 0.16350 -5.572 2.02e-07 ***
At4 -0.44852 0.16853 -2.661 0.00903 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1188325)


Null deviance: 16.84 on 105 degrees of freedom
Residual deviance: 12.24 on 103 degrees of freedom
AIC: 79.987

Number of Fisher Scoring iterations: 2

9
>
> Predlogit<-predict(glm(Class~At3+At4,data=Appendicitis),se.fit=TRUE)$fit
> SEPredlogit<-predict(glm(Class~At3+At4,data=Appendicitis),
+ se.fit=TRUE)$se.fit
>
> cbind(Appendicitis$At3,Appendicitis$At4,Appendicitis$Class,
+ round(Predlogit,3),round(SEPredlogit,3))[1:10,]
[,1] [,2] [,3] [,4] [,5]
1 0.207 0.000 1 0.481 0.059
2 0.468 0.111 1 0.194 0.038
3 0.111 0.056 1 0.544 0.065
4 0.105 0.056 1 0.549 0.066
5 0.289 0.111 1 0.357 0.042
6 0.057 0.333 1 0.469 0.071
7 0.045 0.028 1 0.616 0.075
8 0.365 0.319 1 0.194 0.039
9 0.214 0.042 1 0.456 0.054
10 0.076 0.125 1 0.545 0.066

R Code and Output for Battery Life Study

> Batteries
Life MaterialA TempB
1 130 1 1
2 74 1 1
3 155 1 1
4 180 1 1
5 34 1 2
6 80 1 2
7 40 1 2
8 75 1 2
9 20 1 3
10 82 1 3
11 70 1 3
12 58 1 3
13 150 2 1
14 188 2 1
15 159 2 1
16 126 2 1
17 136 2 2
18 106 2 2
19 122 2 2
20 115 2 2
21 25 2 3
22 58 2 3
23 70 2 3
24 45 2 3
25 138 3 1
26 110 3 1
27 168 3 1
28 160 3 1
29 174 3 2
30 150 3 2
31 120 3 2
32 139 3 2
33 96 3 3
34 82 3 3
35 104 3 3
36 60 3 3

10
> summary(lm(Life~MaterialA*TempB,data=Batteries))
Call:
lm(formula = Life ~ MaterialA * TempB, data = Batteries)
Residuals:
Min 1Q Median 3Q Max
-60.750 -14.625 1.375 17.938 45.250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 105.528 4.331 24.367 < 2e-16 ***
MaterialA1 -22.361 6.125 -3.651 0.00111 **
MaterialA2 2.806 6.125 0.458 0.65057
TempB1 39.306 6.125 6.418 7.1e-07 ***
TempB2 2.056 6.125 0.336 0.73975
MaterialA1:TempB1 12.278 8.662 1.417 0.16778
MaterialA2:TempB1 8.111 8.662 0.936 0.35735
MaterialA1:TempB2 -27.972 8.662 -3.229 0.00325 **
MaterialA2:TempB2 9.361 8.662 1.081 0.28936
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 25.98 on 27 degrees of freedom
Multiple R-squared: 0.7652, Adjusted R-squared: 0.6956
F-statistic: 11 on 8 and 27 DF, p-value: 9.426e-07
> anova(lm(Life~MaterialA*TempB,data=Batteries))
Analysis of Variance Table
Response: Life
Df Sum Sq Mean Sq F value Pr(>F)
MaterialA 2 10684 5341.9 7.9114 0.001976 **
TempB 2 39119 19559.4 28.9677 1.909e-07 ***
MaterialA:TempB 4 9614 2403.4 3.5595 0.018611 *
Residuals 27 18231 675.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R Code and Output for Boston Housing Data Analysis

> Boston[1:5,]
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSAT MEDV
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
> summary(Boston)
CRIM ZN INDUS CHAS NOX
Min. :0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850
1st Qu.:0.06988 1st Qu.: 0.00 1st Qu.: 4.93 1st Qu.:0.00000 1st Qu.:0.4470
Median :0.19103 Median : 0.00 Median : 8.14 Median :0.00000 Median :0.5190
Mean :1.42083 Mean : 12.72 Mean :10.30 Mean :0.07743 Mean :0.5408
3rd Qu.:1.21146 3rd Qu.: 20.00 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6050
Max. :9.96654 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710
RM AGE DIS RAD TAX
Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0
1st Qu.:5.927 1st Qu.: 40.95 1st Qu.: 2.355 1st Qu.: 4.000 1st Qu.:276.8
Median :6.229 Median : 71.80 Median : 3.550 Median : 5.000 Median :307.0
Mean :6.344 Mean : 65.56 Mean : 4.044 Mean : 7.823 Mean :377.4
3rd Qu.:6.635 3rd Qu.: 91.62 3rd Qu.: 5.401 3rd Qu.: 7.000 3rd Qu.:411.0
Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0
PTRATIO B LSAT MEDV
Min. :12.60 Min. : 0.32 Min. : 1.730 Min. : 6.30
1st Qu.:16.80 1st Qu.:377.72 1st Qu.: 6.588 1st Qu.:18.50
Median :18.60 Median :392.08 Median :10.250 Median :21.95
Mean :18.25 Mean :369.83 Mean :11.442 Mean :23.75
3rd Qu.:20.20 3rd Qu.:396.16 3rd Qu.:15.105 3rd Qu.:26.60
Max. :22.00 Max. :396.90 Max. :34.410 Max. :50.00

11
#k=2 for kNN prediction is chosen by repeated CV
>
> sqrt(knn.reg(train=Boston,y=Boston[,14],k=2)$PRESS/452)
[1] 4.174813
> kNNPred<-knn.reg(train=Boston,y=Boston[,14],k=2)$pred
> cbind(Boston$MEDV[1:10],kNNPred[1:10])
[,1] [,2]
[1,] 24.0 21.15
[2,] 21.6 18.00
[3,] 34.7 24.40
[4,] 33.4 32.45
[5,] 36.2 31.05
[6,] 28.7 27.45
[7,] 22.9 21.75
[8,] 27.1 18.35
[9,] 16.5 21.05
[10,] 18.9 16.95

#alpha=.002 and lambda=.3 for the elastic net are chosen by repeated CV
#producing CV RMSPE 4.606608
>
> x<-as.matrix(Boston[,1:13])
> y<-as.matrix(Boston[,14])
>
> BostonNet<-glmnet(x,y,family="gaussian",alpha=.002,lambda=.3)
> ENetPred<-predict(BostonNet,newx=x)
> cbind(Boston$MEDV[1:10],ENetPred[1:10])
[,1] [,2]
[1,] 24.0 30.16890
[2,] 21.6 25.02049
[3,] 34.7 31.60209
[4,] 33.4 29.76508
[5,] 36.2 29.27531
[6,] 28.7 25.61013
[7,] 22.9 22.62749
[8,] 27.1 19.38919
[9,] 16.5 10.82052
[10,] 18.9 18.79055

#cp=.003 is a good choice of regression tree complexity parameter


#chosen by CV and producing RMSPE 4.180514
BestTree<-rpart(MEDV~.,data=Boston,method="anova",
control=rpart.control(cp=.003))
> cbind(Boston$MEDV[1:10],predict(BestTree)[1:10])
[,1] [,2]
1 24.0 23.46667
2 21.6 23.78889
3 34.7 33.30000
4 33.4 33.30000
5 36.2 33.30000
6 28.7 23.78889
7 22.9 20.35294
8 27.1 19.70732
9 16.5 19.70732
10 18.9 19.70732

#mtry=5 is a good choice for random forest parameter, chosen by CV on OOB error
> BostonRf<-randomForest(MEDV~.,data=Boston,
+ type="regression",ntree=1000,mtry=5)
> sqrt(BostonRf$mse[1000])
[1] 3.020972

12
> comppred<-cbind(y,lm(MEDV~.,data=Boston)$fitted.values,kNNPred,ENetPred,
+ predict(BestTree),BostonRf$predicted)
>
> colnames(comppred)<-c("MEDV","OLS","2NN","ENET","Tree","RF")
>
> pairs(comppred,panel=function(x,y,...){
+ points(x,y)
+ abline(0,1)},xlim=c(0,60),ylim=c(0,60))
>

> round(cor(as.matrix(comppred)),2)
MEDV OLS 2NN ENET Tree RF
MEDV 1.00 0.86 0.88 0.86 0.92 0.94
OLS 0.86 1.00 0.77 1.00 0.87 0.91
2NN 0.88 0.77 1.00 0.77 0.81 0.85
ENET 0.86 1.00 0.77 1.00 0.87 0.91
Tree 0.92 0.87 0.81 0.87 1.00 0.96
RF 0.94 0.91 0.85 0.91 0.96 1.00

13

You might also like