You are on page 1of 6

Final Exam

Saturday, December 9, 2017 1:05 PM

Q1
Question 1
Analytics Problem/Question (20 pts.): The manager of a prestigious
hotel in Peru has hired you to help them predict the number of
registered guests each day at various hotel locations. Their
marketing specialists suggest that the best predictors for guest
bookings are: week day vs. weekend, month of the year
(categorical), average daily marketing expenditures 30 days prior to
the booking date, breakfast included (yes/no), special event at the
hotel that day (yes/no) and location (4 locations total—Lima, Cuzco,
Arequipa and Trujillo).
Analysis Goal: Interpretation
Data: The number of total daily bookings for each hotel is relatively
large (100 to 300), so it is relatively normally distributed.
Exhibits: None
Please answer the following:
1.1. (5 pts.) Since you plan to use Location as a categorical
predictor, it would help to re-level the Location variable to
set LocationLima as the reference level. This is so because all
travelers to Peru arrive in Lima and then travel to other locations, so
Lima is a good reference level. Assume that you fitted a linear
model and the regression coefficient for LocationCuzco is −35.4
(negative) and significant. Please interpret this coefficient.
1.2. (5 pts.) Now, assume that you fit a log-linear model and the
regression coefficient for LocationCuzco is now −0.152.
1.3. (10 pts.) Suppose that you are trying to predict hotel daily
revenues (not bookings) using various predictors, including room
price. After doing some analysis, you decide to try a few log
transformations. Please provide an interpretation for each of the β
coefficients below:
a) Linear model: βPRICE = −240 (notice the negative sign)
b) Log-Linear model: βPRICE = −0.05
c) Log-Log model: βLOG(PRICE) = −1.5

1.1 If the coefficient for Location Cuzco is -35.4 and significant, this tell us that the number of daily guests
decreases when the location is booked at Cuzco. Because the p-value is significant, this tell us that this
location plays a significant role in predicting the number of guests.

1.2 If I were predicting a log- linear model, this would be a probability that the number of daily guests
would be staying at Location Cuzco would be -0.152.

1.3

Q2
Question 2
Analytics Problem/Question (15 pts.): Your client is a government agency and is
interested in predicting average salaries for all counties in the US, using

New Section 1 Page 1


interested in predicting average salaries for all counties in the US, using
profession, years of experience, level of education and gender as predictors. You
fit an OLS model using all these variable to predict salaries.
Analysis Goal: Interpretation and prediction
Preliminary Analysis: You run an OLS model using all the predictors above and
prepare a residual plot, shown below.
Exhibits: See graph

Screen clipping taken: 12/9/2017 1:33 PM

2.1. Which OLS assumption appears to be violated from visual inspection of the residual
plot above? Briefly explain why. Also, briefly comment why this is a problem.
2.2. How would you test this?
2.3. If the OLS assumptions you tested above don't hold (e.g., the Breusch-Pagan test is
significant), which modeling approach would you use instead of OLS? Briefly explain why.

Q3
Question 3
Analytics Problem/Question (15 pts.): Suddenly you realize that in the previous
analysis you forgot to include the date (Year/Month) in the model. This is an
important omission because salaries tend to go up over the years.
Analysis Goal: Prediction
Data: Same as in Q2, but with an additional predictor named t which is a sequential
number indicating the period (i.e., year/month). After incorporating this t variable
into the model and sorting the data by t, the residuals when plotted against t are
shown in the plot.
Exhibits: See the residual plot

New Section 1 Page 2


Screen clipping taken: 12/9/2017 1:49 PM

Please answer the following:


3.1. Which OLS assumption appears to be violated from visual inspection of the
residual plot above? Briefly explain why. Also, briefly comment why this is a
problem.
3.2. How would you test for this OLS violation? Briefly explain what this test does
and how would you interpret the test results.
3.2. If the OLS assumptions you tested above don't hold, which modeling approach
would you use? Briefly explain why and which transformations would you
recommend (if any).

Q4
Question 4
Analytics Problem/Question (20 pts.): Your goal is to develop a model to predict the
median value (i.e., medv) of houses in the Boston metropolitan area.
Analysis Goal: Prediction
Data: The data is from a pizza nutrition set. It contains data for 10 different brands
of pizza, along with their moisture, protein, fat, ash, sodium, carbohydrates and
calories per serving.
Exhibits: ggpairs plot; OLS regression output for a model predicting calories in
pizzas on selected predictors; the condition index (CI) for the model and the
variance inflation factor (VIF) for each of the variables

New Section 1 Page 3


Screen clipping taken: 12/9/2017 2:03 PM

Q5
Question 5
(Interaction) Analytics Problem/Question (15 pts.): Following up on the previous
question, the pupil-to-teacher ratio (ptratio) variable is significant and negative predictor of
median house value (medv) in thousands of dollars. This supports the hypothesis that
overcrowded schools have a negative impact on housing prices. But upon exploration of
the plot below you also you noticed that there seems to be a sharp change in the pattern
of median housing values around ptratio=18.
Analysis Goal: Interpretation
Data: Same as Q5
Exhibits: See plot

Screen clipping taken: 12/9/2017 2:21 PM

Please answer the following:

New Section 1 Page 4


Please answer the following:
5.1. Suppose that you want to model high vs. low ptratio as a predictor, rather than the
actual ptratios. So you create a new dummy variable called pt.high=1 if ptratio>18 and 0
otherwise. If the coefficient for pt.high is −1.58 and significant, briefly interpret the effect of
this coefficient in median housing values.
5.2. Now suppose that you want to test some interaction variables and you test the main
and interaction effects for pt.high (the dummy variable created above) and crim (the crime
rate in that county). You run your model and you find that both main effects for pt.high
and crim are negative and significant. But you also find that the interaction effect of
pt.high × crim is significant but positive. Please interpret the two main effects and the
interaction effect in this example.
5.3. Now, suppose that you decide to try a piecewise polynomial (without the interaction
effect) using the actual ptratio variable (not the pt.high dummy variable). So you fit a
model that contains ptratio and one that contains I(ptratio-18)*(ptratio>18). The coefficient
for the ptratio variable is −3.52 and significant, and the coefficient for I(ptratio−
18)*(ptration>18) is +2.13 and significant. Please provide an interpretation for these two
coefficients.

Q6
Question 6
(Variable Selection and Cross Validation) Analytics Problem/Question (15
pts.):Following up on Q5, you suggested 3 types of modeling approaches to
address the issue of dimensionality. But you may also be able to address the
issue of dimensionality through variable selection.
Analysis Goal: Interpretation
Data: Same as Q5
Exhibits: Same as Q5
Please answer the following:
6.1. Briefly but specifically discuss how would you use collinearity
diagnostics to select an appropriate subset of variables to include in the
model.
6.2. Suppose that your analysis above yielded two possible models a larger
model named fit.large and a smaller model named fit.reduced. Assume also
that the collinearity diagnostics of these two models are OK. Discuss
concisely but specifically how would you go about selecting the best model
between fit.reduced and fit.large using the stepwise selection method. Do
not discuss any R coding. Just explain in plain English how you would do
this. In your answer, please be specific about the process and fit statistics
you would use to add and remove variables, and how you would compare

New Section 1 Page 5


you would use to add and remove variables, and how you would compare
the various models.
6.3. Now, suppose now that the fit.large model has moderate collinearity
problems. Also, suppose that there are a total of 4 models tested by your
stepwise procedure in 6.2 above. Briefly but accurately describe how you
would use machine learning to select the best of the 4 models based on test
cross-validation. In your answer, indicate which cross-validation method you
would use to generate training and test subsamples and why.

From <https://2ksb.onlinebusiness.american.edu/mod/quiz/attempt.php?attempt=5561&page=5#q0>

New Section 1 Page 6

You might also like