You are on page 1of 3

Predictive Analytics Practice Problem

1. The “Carseats” dataset available in the “ISLR” library records the sales of child car
seats at 400 different stores. The dataset provides information on the following
variables:

Sales Unit sales (in thousands) at each location


CompPrice Price charged by competitor at each location
Income Community income level (in thousands of dollars)
Advertising Local advertising budget for company at each location (in thousands of
dollars)
Population Population size in region (in thousands)
Price Price company charges for car seats at each site
ShelveLoc A factor with levels “Bad”, “Medium” and “Good” indicating the quality
of the shelving location
Age Average Age of the local population
Education Education level at each location
Urban A factor with levels “No” and “Yes” to indicate whether the store is in an
urban or rural location
US A factor with levels “No” and “Yes” to indicate whether the store is in
the US or not
The objective is to build a model for “Sales” based on ten predictors. Answer the
following questions:
a. Build a simple linear regression model for “Sales” using the predictor “Price”.
Interpret the fitted model.
b. Suppose you are told that the price of a car seat at a particular store is $120. Based
on the fitted model in (a), what will the expected sales in that store? Provide a 95%
interval estimate that takes into account the uncertainty associated with the sales of
that particular store.
c. A particular store has set the price of a car seat at $110. The store has set a target of
selling 8 thousand units. What is the probability that the sales will be more than 8
thousand units based on the fitted model in (a)?
d. Build a simple linear regression model for “Sales” using the predictor “ShelveLoc”.
Note that “ShelveLoc” is a factor with levels “Bad”, “Medium” and “Good”. What are
the expected sales for the stores using “Bad”, “Medium” and “Good” shelving
location?
e. Is there any significant difference in Sales among different shelving locations? If so,
are all three shelving locations result in significantly different sales from each other?
Clearly provide all possible comparisons along with the p-values.
f. Build a multiple linear regression model for “Sales” using the predictors “Price” and
“ShelveLoc”. What is the expected sales for all stores with “Good” shelving location
having a price tag of $120?
g. Fit a multiple linear regression model for “Sales” using the predictors “CompPrice”,
“Income”, “Advertising”, “Price” and “ShelveLoc”. Interpret the results.
2. Consider the “Carseats” data set again discussed above. Based on the variable
“Sales”, a new dummy variable “High” is created. The variable takes a value “Yes” if
the sales exceed 8 thousand units and “No” otherwise. Thus the variable “High”
indicates “High Sales”. The objective is to predict “High Sales” based on ten
predictors. The data set is divided in two parts −¿a training set consisting of 200
observations and a test set consisting of the remaining 200 observations.

Following results are available for two logistic regression models fitted for predicting
“High Sales” based on the training data set.

Table 1: Logistic Regression Model (Model 1) for predicting “High Sales” using
“ShelveLoc” based on the training data set

Estimate Standard z value p-


Error value

Intercept −2.152 0.473 −4.554 0.000

ShelveLocGood 3.222 0.579 5.566 0.000

ShelveLocMedium 1.666 0.513 3.245 0.001

Note that “ShelveLocGood” is a dummy variable which takes the value 1 if


“ShelveLoc” is “good” and 0 otherwise. Similarly, “ShelveLocMedium” is a dummy
variable which takes the value 1 if “ShelveLoc” is “Medium” and 0 otherwise.

Table 2: Logistic Regression Model (Model 2) for predicting “High Sales” using
“Price” and “ShelveLoc” based on the training data set

Estimate Standard z value p-


Error value

Intercept 3.585 1.097 3.267 0.001

Price −0.055 0.010 −5.403 0.000

ShelveLocGood 4.509 0.747 6.040 0.000

ShelveLocMedium 2.174 0.583 3.728 0.000


Furthermore, a full logistic regression model is fitted using all the predictors based
on the training data set. The confusion matrix obtained based the test data set is
shown below.

Table 3: Confusion Matrix Based on Test Data

True “High”
Status

No Yes

Predicted “High” Status No 107 9

Yes 9 75

a. Based on the fitted Model 1 (Table 1), find the predicted probabilities of “High
Sales” for different levels of “ShelveLoc”.
b. Write the equation of the fitted Model 2 (Table 2). Interpret the coefficients.
c. Suppose the owner of a particular store wants the predicted probability of “High
Sales” to be at least 0.8. It is also given that the store maintains a “Good” shelving
location for the car seats. Based on the fitted Model 2, what should be the price of
the car seats for having a 80% chance of getting “High Sales”?
d. Based on the classification matrix shown above, compute the sensitivity, specificity
and total error rate for the logistic regression model. Suggest a possible method
for improving its performance (No need to implement it).

You might also like