Professional Documents
Culture Documents
Q1
Question 1
Analytics Problem/Question (20 pts.): The manager of a prestigious
hotel in Peru has hired you to help them predict the number of
registered guests each day at various hotel locations. Their
marketing specialists suggest that the best predictors for guest
bookings are: week day vs. weekend, month of the year
(categorical), average daily marketing expenditures 30 days prior to
the booking date, breakfast included (yes/no), special event at the
hotel that day (yes/no) and location (4 locations total—Lima, Cuzco,
Arequipa and Trujillo).
Analysis Goal: Interpretation
Data: The number of total daily bookings for each hotel is relatively
large (100 to 300), so it is relatively normally distributed.
Exhibits: None
Please answer the following:
1.1. (5 pts.) Since you plan to use Location as a categorical
predictor, it would help to re-level the Location variable to
set LocationLima as the reference level. This is so because all
travelers to Peru arrive in Lima and then travel to other locations, so
Lima is a good reference level. Assume that you fitted a linear
model and the regression coefficient for LocationCuzco is −35.4
(negative) and significant. Please interpret this coefficient.
1.2. (5 pts.) Now, assume that you fit a log-linear model and the
regression coefficient for LocationCuzco is now −0.152.
1.3. (10 pts.) Suppose that you are trying to predict hotel daily
revenues (not bookings) using various predictors, including room
price. After doing some analysis, you decide to try a few log
transformations. Please provide an interpretation for each of the β
coefficients below:
a) Linear model: βPRICE = −240 (notice the negative sign)
b) Log-Linear model: βPRICE = −0.05
c) Log-Log model: βLOG(PRICE) = −1.5
1.1 If the coefficient for Location Cuzco is -35.4 and significant, this tell us that the number of daily guests
decreases when the location is booked at Cuzco. Because the p-value is significant, this tell us that this
location plays a significant role in predicting the number of guests.
1.2 If I were predicting a log- linear model, this would be a probability that the number of daily guests
would be staying at Location Cuzco would be -0.152.
1.3
Q2
Question 2
Analytics Problem/Question (15 pts.): Your client is a government agency and is
interested in predicting average salaries for all counties in the US, using
2.1. Which OLS assumption appears to be violated from visual inspection of the residual
plot above? Briefly explain why. Also, briefly comment why this is a problem.
2.2. How would you test this?
2.3. If the OLS assumptions you tested above don't hold (e.g., the Breusch-Pagan test is
significant), which modeling approach would you use instead of OLS? Briefly explain why.
Q3
Question 3
Analytics Problem/Question (15 pts.): Suddenly you realize that in the previous
analysis you forgot to include the date (Year/Month) in the model. This is an
important omission because salaries tend to go up over the years.
Analysis Goal: Prediction
Data: Same as in Q2, but with an additional predictor named t which is a sequential
number indicating the period (i.e., year/month). After incorporating this t variable
into the model and sorting the data by t, the residuals when plotted against t are
shown in the plot.
Exhibits: See the residual plot
Q4
Question 4
Analytics Problem/Question (20 pts.): Your goal is to develop a model to predict the
median value (i.e., medv) of houses in the Boston metropolitan area.
Analysis Goal: Prediction
Data: The data is from a pizza nutrition set. It contains data for 10 different brands
of pizza, along with their moisture, protein, fat, ash, sodium, carbohydrates and
calories per serving.
Exhibits: ggpairs plot; OLS regression output for a model predicting calories in
pizzas on selected predictors; the condition index (CI) for the model and the
variance inflation factor (VIF) for each of the variables
Q5
Question 5
(Interaction) Analytics Problem/Question (15 pts.): Following up on the previous
question, the pupil-to-teacher ratio (ptratio) variable is significant and negative predictor of
median house value (medv) in thousands of dollars. This supports the hypothesis that
overcrowded schools have a negative impact on housing prices. But upon exploration of
the plot below you also you noticed that there seems to be a sharp change in the pattern
of median housing values around ptratio=18.
Analysis Goal: Interpretation
Data: Same as Q5
Exhibits: See plot
Q6
Question 6
(Variable Selection and Cross Validation) Analytics Problem/Question (15
pts.):Following up on Q5, you suggested 3 types of modeling approaches to
address the issue of dimensionality. But you may also be able to address the
issue of dimensionality through variable selection.
Analysis Goal: Interpretation
Data: Same as Q5
Exhibits: Same as Q5
Please answer the following:
6.1. Briefly but specifically discuss how would you use collinearity
diagnostics to select an appropriate subset of variables to include in the
model.
6.2. Suppose that your analysis above yielded two possible models a larger
model named fit.large and a smaller model named fit.reduced. Assume also
that the collinearity diagnostics of these two models are OK. Discuss
concisely but specifically how would you go about selecting the best model
between fit.reduced and fit.large using the stepwise selection method. Do
not discuss any R coding. Just explain in plain English how you would do
this. In your answer, please be specific about the process and fit statistics
you would use to add and remove variables, and how you would compare
From <https://2ksb.onlinebusiness.american.edu/mod/quiz/attempt.php?attempt=5561&page=5#q0>