You are on page 1of 10

Stat 302 Midterm 1

Thursday, February 4, 2016


Student Name ________________________________________________
Student Number ______________________________________

You have exactly 180 minutes to complete this exam.


This practice test has 10 pages including this one.

Only non-programmable calculators are allowed for electronics.


That means no graphing calculators and no phones.

The real test also includes a t-table and a table of exponents. There are no substantial formulae to be applied,
so no formulae are provided.

Protips:
- Show your work whenever appropriate. It shows understanding, and that’s what’s being tested.
- Use the backs of pages if space is an issue.
- If you get stuck on a part, don't abandon the question. Often later parts can be answered without earlier ones.
- Try not to panic, it rarely helps.

Good luck!

Question 1 2 3 4 5 6
Score
Out of

Question 7 8 Total
Score
Out of
Problem 1, Total /

Consider the multiple linear regression outlined in this R output.


The response is birth weight in grams (BWT), the explanatory variables are age in years
(AGE), smoking status (SMOKE, 2 categories), weight in pounds of the mother before
pregnancy (LWT).

A) We also have race (RACE, 3 categories), history of hypertension (HT, 2 categories),


and number of previous premature babies (PTL, numeric). Is there any term we could
add to the model to reduce the R-squared?

B) Predict the birth weight of an infant born to a mother who is 32 years old, weighed
146 pounds before pregnancy, and smokes.

C) If smoking had a greater effect on birth weight for older mothers, would the
interaction term be negative or positive? Why?
Problem 2, Total /

Consider a case where we are testing whether any one of eight different kinds of drugs
improve the five-year survival chance after cancer treatment.

A) Why would it be better to do a multiple regression instead of directly looking at the


proportion of five-year survivals under each drug?

B) If we were to use a regression to explore this, what type of regression is appropriate:


linear or logistic. Why?

C) If we are willing to accept a 5% chance of falsely concluding that any of the drugs
improve survival what should be alpha be for testing any particular drug against no
treatment?

D) If we are interested specifically in an improvement in survival rates, are the


hypothesis tests from the previous part one or two-way?
Problem 3, Total /

Consider the ANOVA table following made-up data on farm yield.

A) How many watering groups are there?

B) How many watering and fertilizer group combinations are there?

C) If the df used is the number of groups minus 1, why is the total df used less than the
number of group combinations minus 1?
Problem 4, Total /

Measurements were taken on the number of bees lost to a colony as a response to the
number of days after a storm. There are recordings for the bee loss for the day of the
storm and the three days after, making a total of 4 observations.

The r-squared is 0.972, and the regression equation is:

Bee loss = 42 - 3.8(Days) - 1.55(Days^2) + error

A) What does the intercept mean in this particular case? Does it make real world sense?

B) Why is the R-squared so high?

C) Would you trust this model to predict the bee loss in a different hive? Why or why
not?
Problem 5, Total /

Consider this one-way ANOVA on the number of breaks (numeric, continuous) as a


function of the level of tension (3 categories)

A) What proportion of variation in nunmber of breaks in explained by tension?

B) What else should be checked to ensure the validity of this ANOVA?

C) Is there evidence that ANY pair of groups have different means?


If the answer is yes or no, mention how you know. If it is impossible to tell, mention
what information you need to determine an answer.

D) Is there evidence that EVERY group has a different mean?


If the answer is yes or no, mention how you know. If it is impossible to tell, mention
what information you need to determine an answer.
Problem 6, Total /

Consider the simple regression shown in this plot and this r-output. It shows the gas
milage (MPG) as a response to the weight of the car (weight).

A) What does the slope mean in real-world terms?

B) What does the intercept mean? Is this reasonable in real world terms?

C) The confidence interval of the mean MPG is (24.03 to 36.64) at weight = 3000kg .
Would it be wider, narrower, or the same at weight = 1800kg?
Problem 7, Total /

Consider the following diagnostic plots for a linear regression model.

A) What potential problems can you see from these plots?

B) What can you tell about the explanatory variables in this model from the plots?
Problem 8, Total /

Consider the R output from this multiple logistic regression on the log odds of dying
from a burn (DEATH) as a function of age and age-squared (AGE, AGE^2), whether
there was an inhalation injury (INH_INJ, 2 categories), and whether there was a direct
flame injury (FLAME).

A) Why is appropriate to include at least one polynomial term for age?

B) What is the odds ratio of dying if you suffer an inhalation injury, holding age and
flame injury constant?

C) What are the odds of death for someone 33 years old with a direct flame injury, but
not an inhalation injury?
D) This model was developed using data from dozens of burn facilities. We are not
particularly interested in the different survival rates at these facilities, but we do
acknowledge that its a factor that should be controlled for. How do we include the burn
facility variable without overfitting?

E) Sometimes age was missing from data, because the people were so badly burned they
could not be identified. What sort of missingness is this?

F) Sometimes the value for inhalation injury was missing because of a computer error in
the reporting software. Is this sort of missingness ignorable?

You might also like