You are on page 1of 19

CS 7290 — Fall 2019

Practice Midterm

January 30, 2020

Time: 1 hour 40min

Name (please print):

Show all your work and calculations. Partial credit will be given for work that is partially correct.
Points will be deducted for false statements, even if the final answer is correct. Please circle your
final answer where appropriate.

This exam is closed-book. You may consult one page with your hand-written notes. Calculators
are permitted.

Honor code: I promise not to cheat on this exam. I will neither give nor receive any unauthorized
assistance. I will not to share information about the exam with anyone who may be taking it at a
different time. I have not been told anything about the exam by someone who has taken it earlier.

Signature: Date:

1
1. Short questions

(a) (7 pts) An increase of the confidence level will cause the width of the confidence interval
to
i. Increase
ii. Decrease

(b) (7 pts) Other things being equal, quadrupling the sample size causes the width of the
confidence interval to
i. Double
ii. Half
iii. Be one quarter as wide

(c) (7 pts) Match each histogram to a Normal quantile-quantile plot, and justify.
A B C
100
300

300
250

80
200
Frequency

Frequency

Frequency
60

200
150

40
100

100
20
50
0

−5 0 5 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7

I II III
1.0


●●

●●

●●

●●●●● ● ● ●

●●

●●


●●


●●




●● ●

● ●
6



●● ●

●●
5

●●

● ●
0.8



● ●●

●●

● ●●

● ●
●● ●●
5



● ●●

● ●
●● ●

● ●●

●●

Sample Quantiles

Sample Quantiles

Sample Quantiles



● ●

● ●

● ●
●●



● ●

●●
● ●



● ●

●●
0.6


● ●
●●
● ●
●●
4



● ●


●●
● ●

● ●

●●


● ●

● ●●


●● ●



● ●


●●

● ●●

● ●●


● ●

0

●● ●●


●●

● ●●


● ●

●●

● ●

● ●
●●


●● ●

● ●●


●●
● ●●

3


● ●
●●

● ●
● ●●
● ●

0.4


● ●

●●

● ●●
●● ●
●●

●● ●




● ●
●●
● ●


● ●
●● ●

●● ●
●●
● ●



● ●
●● ●


● ●

● ●●

2

●● ●●
●● ●


● ● ●
●●
●●
● ●


● ●


● ●
0.2


● ● ●
●●
● ●
−5


● ● ●
●●

● ●

●●
● ●●
1


● ●●




● ●●



●● ●●


●●
● ●

●●


●●

● ●
●●


●●
●●

● ●


●●

●●

● ●


●●




●●
● ●●


0.0


●●


●●
●●

●●
● ●●


●●

●●


●●

●●


●●


●●



● ●●● ● ●
● ●●●●
●●


●●
●●

0

−3 −1 0 1 2 3 −3 −1 0 1 2 3 −3 −1 0 1 2 3

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

2
2. The director of admissions of a small college selected 120 students at random from the new
freshman class, to determine whether their college grade point average (GPA) at the end of
the freshman year can be predicted from their high school American College Testing (ACT)
score. The results of a simple linear regression fit a summarized below.

● ● ● ●
● ● ● ● ● ●

● ● ● ●
● ●
● ●

1
● ● ● ● ● ● ● ● ●

● ● ● ● ●
● ● ●

3.5
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ●● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ●

0

● ● ●
● ● ● ● ● ● ● ●
● ● ● ●

● ●

Residuals
● ● ● ● ● ● ●

● ● ● ● ● ● ●

2.5
● ● ● ● ●
● ● ● ● ● ● ● ● ●

● ●

GPA
● ● ● ●
● ● ● ● ● ●

● ● ● ●
● ●

−1
● ● ● ● ●
● ●
● ● ● ● ● ●

1.5


−2
0.5 ● ●

15 20 25 30 35 15 20 25 30 35

ACT ACT

>summary(lm(GPA~ACT, data=X))
Call:
lm(formula = GPA ~ ACT, data = X)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.11405 0.32089 6.588 1.3e-09 ***
ACT 0.03883 0.01277 3.040 0.00292 **
...
Residual standard error: 0.6231 on 118 degrees of freedom

> mean(X$ACT)
[1] 24.725
> var(X$ACT)
[1] 19.99937

(a) (7 pts) Is there a linear association between ACT and GPA? Use your method of choice,
and confidence level of 95%. State the model, and justify your answer.

3
(b) (8 pts) Obtain a 95% interval estimate of the mean freshman GPA for students whose
ACT test score is 28. Interpret your confidence interval.

(c) (8 pts) Mary Jones obtained an ACT score of 28. Predict her freshman GPA using a
95% prediction interval. Interpret the interval.

(d) (7 pts) Is the prediction in (c) practically useful? Discuss.

3. A New York ad-testing company Video Board Tests, Inc., conducted an annual survey to study
the success of TV commercials as function of investment. It selected 21 most outstanding
recent commercials shown on TV. Then it asked 4,000 regular product users to cite which
commercials they had seen for that product category in the past week.
The resulting dataset appeared in the Wall Street Journal. It contains TV advertising budget
(in millions of dollars), the number of retained impressions of the advertisement, and the name
of the company that sponsored the advertisement. The data are plotted below.
100

PEPSI

MC.DONALD'S
ATT.BELL
80
Number of impressions

COCO−COLA

CREST
60

BURGER.KING

MCI
40

LEVI'S FORD
POLAROID

MILLER.LITE
WENDY'S
OSCAR.MEYER
FEDERAL.EXPRESS
DIET.COLA
20

MEOW.MIX
CALVIN.KLEIN
STROH'S BUD.LITE
SHASTA
KIBBLES.N.BITS
0

0 50 100 150 200

Spending

(a) (8 pts) A simple linear regression model is considered for use with the data. Give a
mathematical description of the model, and state the assumptions. For each assumption,
comment whether it is plausible for this dataset.

4
(b) (8 pts) The company considered three candidate regression models for these data:
Impress versus Spend; log(Impress) versus Spend; and log(Impress) versus log(Spend).
Residuals of the fit of these three models are plotted below. Which model do you
recommend based on the residual plots? Please give at least two reasons why.

Impress vs log(Impress) vs log(Impress) vs


Spend Spend log(Spend)

1.5

1.5

40

1.0

1.0
● ● ●

● ●

0.5

0.5

20


● ● ●
Residuals

Residuals

Residuals
● ●

● ●
● ● ●
● ● ● ●

● ●
● ●● ●
● ●
0


● ●

−0.5

−0.5
●● ●
●●
●● ●
● ●
−20

● ●

● ●




−40

−1.5

−1.5

30 50 70 90 3.0 3.5 4.0 4.5 2.5 3.0 3.5 4.0 4.5

Predicted impressions Predicted log(impressions) Predicted log(impressions)

(c) (8 pts) The analysts decided to work with the model that uses log(Spend) and log(Impress).
State the linear model and the assumptions on the log-log scale, and then the correspond-
ing model on the original scale. Interpret the parameters of the model.

5
(d) (8 pts) A partial R output of the model fit is given below. Test for association between
log(Impress) and log(Spend). State the null and the alternative hypotheses, and your
conclusion at the confidence level of 95%. Interpret the results.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2999 0.4236
logSpend 0.6135 0.1191

Residual standard error: 0.581


Multiple R-squared: 0.5829

6
4. True or false? Circle the right answer and explain.

(a) (4 pts) Increasing the confidence level causes the width of a confidence interval to
increase
TRUE FALSE

(b) (4 pts) Probability(Type II error) = 1 - Probability(Type I error)


TRUE FALSE

(c) (4 pts) The p-value is the probability that H0 is true


TRUE FALSE

(d) (4 pts) If we know the standard error of the sample mean for each of the two independent
samples, we can figure out the standard error of the difference between the sample means,
even if we don’t know the sample sizes
TRUE FALSE

(e) (4 pts) The Pearson correlation of 0.3 can be interpreted as: X changes standard
deviations when Y changes one standard deviation.
TRUE FALSE

7
5. Twelve naval hospitals in the United States study the number of monthly man-hours in the
anaesthesiology service as function of the eligible population per thousand.

(a) The analysts considered a linear regression model that uses eligible population as the
only predictor. Give a mathematical description of this model, and fill in the ANOVA
table below
Source DF Sum of Squares Mean Square

Model 1 14346071 14346071

Error

Total 11 15094263

(b) Test at α = 0.05 whether eligible has a significant linear association with man-hour
in the model in (b).

8
(c) The Naval Office plans building a new hospital in the area where eligible population is
50 thousands individuals. Using the model in (b), calculate a 95% CI for a new monthly
value of man-hours in this area. The arameter estimates of the model in (b) and the
covariance matrix are given below.
Parameter
Variable DF Estimate
Intercept 1 180.65755
eligible 1 9.42898

Covariance of Estimates
Variable Intercept eligible
Intercept 16481.721867 -68.92847184
eligible -68.92847184 0.4636704199

(d) What is the estimated Pearson correlation between man-hour and eligible?

9
6. To study mortality due to malignant melanoma of the skin, annual mortality rates of white
males were recorded between 1950-1969 for each state in the United States (except Alaska
and Hawaii). The incidence of melanoma can be related to the amount of sunshine and,
somewhat equivalently, to the latitude of the area. Therefore researchers became interested
in the relationship between the melanoma mortality and the latitude of the states.
The scatterplot displays annual mortality (per 10 mllion) versus latitude in degrees. A state
is denoted ”O” if it borders an ocean, and ”L” otherwise. The average latitude of the states
is 38.53, and the standard deviation is 4.61. An output of linear regression fit is given below.

220
O
O
O

200
O O O

Annual mortality per 10 million


O L
L

180
L O
O L
L
O L

160
L O O O
L O O
L
140 L L O
L O L O
L LL L
L O
L L
120

L LO L O L
L L
100

30 35 40 45

Latitude in degrees

Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 36464 36464 99.80 <.0001
Error 47 17173 365.38436
Corrected Total 48 53637

Root MSE 19.11503 R-Square ---


Dependent Mean 152.87755
Coeff Var 12.50349

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 389.18935 23.81232 16.34 <.0001
latitude 1 -5.97764 0.59837 -9.99 <.0001

(a) Does a linear model between latitude and mortality appear appropriate? Describe the estimated
relationship mathematically, and in words.

10
(b) List the assumptions made by this model. For each assumption, describe a test or a plot that
you’d use to assess its validity.

(c) What is the estimated Pearson correlation coefficient between latitude and mortality rate?

11
(d) Construct the 95% confidence interval for the slope of the regression. What does the confidence
interval tell you?

(e) A researcher argues that a move from Indiana (latitude 40.2) to Texas (latitude 31.5) in 1950-
1969 increased the risk of mortality from melanoma by more than 25 per 10 million. Test this
statement at a confidence level of 95%.

12
(f) Construct a 95% confidence interval for the average annual mortality rate in Indiana (latitude
40.2).

(g) Calculate the upper and lower limits of the 95% whole-regression confidence band for Indiana.
Explain why the width of the interval differs from the interval calculated above.

13
7. (a) Suppose that you want to fit a linear model to the data shown on the following scatterplot.
Which assumptions of the model are not verified for these data? What remedy will you suggest?

150

100
y



50



● ●
●●
● ● ● ●
● ●
● ●●●


● ● ●
● ●
0

0 2 4 6 8 10

(b) In a simple linear regression relating variables X and Y , the range of interest of the independent
variable is between 0 and 10. You can collect 30 observations while controlling the levels of X.
If from previous experience you know that X and Y have a linear relationship, what values of
the input variables would you suggest, and why? How would you select the input variables if
you are not sure of the linear relationship?

14
(c) You want to conduct a lack of fit test for a simple linear regression. State the null and the
alternative hypotheses, the test statistic, and the degrees of freedom. What is the meaning of a
rejected null hypothesis in this case?

8. Short answer questions. Each part is unrelated.

(a) A researcher is designing a new experiment involving an independent variable X and a response
Y . Data from the experiment will be analyzed using a linear regression. He considers two options
for the choice of values of X:
1) X = 1 2 3 5 5 7 8 9 and 2) X = 1 1 1 5 5 9 9 9
Which design would you recommend and why? If there is a situation favorable to each, please
specify.

15
(b) The figure below shows a Box-Cox plot from a simple linear regression. Explain how to use this
plot, and what, if any, actions should be taken for the data in this analysis.

−304
95%

−306
log−Likelihood

−308
−310
−312
−314
−1.0 −0.5 0.0 0.5 1.0

lambda

16
9. Each part of the following problem is unrelated

(a) In simple linear regression we often test H0 : β1 = 0 vs Ha : β1 6= 0. What is the conceptual


difference between the Type I error and the p-value associated with this test?

(b) What estimation method is used to estimate the values of the parameter λ in the Box-Cox
transformation procedure? Why this estimation method is chosen?

17
Standard Normal Probabilities

Table entry

Table entry for z is the area under the standard normal curve
to the left of z.
z

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

–3.4 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0003 .0002
–3.3 .0005 .0005 .0005 .0004 .0004 .0004 .0004 .0004 .0004 .0003
–3.2 .0007 .0007 .0006 .0006 .0006 .0006 .0006 .0005 .0005 .0005
–3.1 .0010 .0009 .0009 .0009 .0008 .0008 .0008 .0008 .0007 .0007
–3.0 .0013 .0013 .0013 .0012 .0012 .0011 .0011 .0011 .0010 .0010
–2.9 .0019 .0018 .0018 .0017 .0016 .0016 .0015 .0015 .0014 .0014
–2.8 .0026 .0025 .0024 .0023 .0023 .0022 .0021 .0021 .0020 .0019
–2.7 .0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .0026
–2.6 .0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .0036
–2.5 .0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .0048
–2.4 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0064
–2.3 .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .0084
–2.2 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .0110
–2.1 .0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .0143
–2.0 .0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183
–1.9 .0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233
–1.8 .0359 .0351 .0344 .0336 .0329 .0322 .0314 .0307 .0301 .0294
–1.7 .0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367
–1.6 .0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455
–1.5 .0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0571 .0559
–1.4 .0808 .0793 .0778 .0764 .0749 .0735 .0721 .0708 .0694 .0681
–1.3 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823
–1.2 .1151 .1131 .1112 .1093 .1075 .1056 .1038 .1020 .1003 .0985
–1.1 .1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170
–1.0 .1587 .1562 .1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379
–0.9 .1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611
–0.8 .2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867
–0.7 .2420 .2389 .2358 .2327 .2296 .2266 .2236 .2206 .2177 .2148
–0.6 .2743 .2709 .2676 .2643 .2611 .2578 .2546 .2514 .2483 .2451
–0.5 .3085 .3050 .3015 .2981 .2946 .2912 .2877 .2843 .2810 .2776
–0.4 .3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3156 .3121
–0.3 .3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483
–0.2 .4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859
–0.1 .4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247
–0.0 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641

18
Table A.5. Percentiles of the t-Distribution
df 90% 95% 97.5% 99% 99.5% 99.9%
1 3.078 6.314 12.706 31.821 63.657 318.309
2 1.886 2.920 4.303 6.965 9.925 22.327
3 1.638 2.353 3.183 4.541 5.841 10.215
4 1.533 2.132 2.777 3.747 4.604 7.173
5 1.476 2.015 2.571 3.365 4.032 5.893
6 1.440 1.943 2.447 3.143 3.708 5.208
7 1.415 1.895 2.365 2.998 3.500 4.785
8 1.397 1.860 2.306 2.897 3.355 4.501
9 1.383 1.833 2.262 2.822 3.250 4.297
10 1.372 1.812 2.228 2.764 3.169 4.144
11 1.363 1.796 2.201 2.718 3.106 4.025
12 1.356 1.782 2.179 2.681 3.055 3.930
13 1.350 1.771 2.160 2.650 3.012 3.852
14 1.345 1.761 2.145 2.625 2.977 3.787
15 1.341 1.753 2.132 2.603 2.947 3.733
16 1.337 1.746 2.120 2.584 2.921 3.686
17 1.333 1.740 2.110 2.567 2.898 3.646
18 1.330 1.734 2.101 2.552 2.879 3.611
19 1.328 1.729 2.093 2.540 2.861 3.580
20 1.325 1.725 2.086 2.528 2.845 3.552
21 1.323 1.721 2.080 2.518 2.831 3.527
22 1.321 1.717 2.074 2.508 2.819 3.505
23 1.319 1.714 2.069 2.500 2.807 3.485
24 1.318 1.711 2.064 2.492 2.797 3.467
25 1.316 1.708 2.060 2.485 2.788 3.450
26 1.315 1.706 2.056 2.479 2.779 3.435
27 1.314 1.703 2.052 2.473 2.771 3.421
28 1.313 1.701 2.048 2.467 2.763 3.408
29 1.311 1.699 2.045 2.462 2.756 3.396
30 1.310 1.697 2.042 2.457 2.750 3.385
40 1.303 1.684 2.021 2.423 2.705 3.307
80 1.292 1.664 1.990 2.374 2.639 3.195
∞ 1.282 1.645 1.960 2.326 2.576 3.090

1
19

You might also like