You are on page 1of 11

IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

IE 451 Fall 2023-2024 Homework 7 Solutions


AUTHOR
Deniz Sahin

1 Question 8
In the lab, a classification tree was applied to the Carseats data set after converting Sales
into a qualitative response variable. Now we will seek to predict Sales using regression trees
and related approaches, treating the response as a quantitative variable.

data(Carseats)

(a) Split the data set into a training set and a test set.

set.seed(11)
train <- sample(1:nrow(Carseats), 0.75 * nrow(Carseats))
trainset <- Carseats[train, ]
testset <- Carseats[-train, ]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test
MSE do you obtain?

tree.carseats <- tree(Sales ~ ., subset = train, data = Carseats)


summary(tree.carseats)

Regression tree:
tree(formula = Sales ~ ., data = Carseats, subset = train)
Variables actually used in tree construction:
[1] "ShelveLoc" "Price" "Age" "Income" "CompPrice"
[6] "Advertising"
Number of terminal nodes: 17
Residual mean deviance: 2.798 = 791.7 / 283
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.03600 -1.05700 0.01549 0.00000 1.09400 4.10900

plot(tree.carseats)
text(tree.carseats, pretty = 0)

1 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

• The tree has 17 terminal nodes and 6 of the variables are used in the creation of the tree.
These are: ShelveLoc , Price , Age , Income , CompPrice and Advertising .

Let’s calculate the MSE.

yhat <- predict(tree.carseats, newdata = testset)


MSE <- mean((yhat - testset$Sales) ^ 2)

So the MSE is 3.8999454.

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does
pruning the tree improve the test MSE?

set.seed(12)
cv.carseats <- cv.tree(tree.carseats)
plot(cv.carseats$size , cv.carseats$dev, type = "b")

2 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

Let’s create a tree with 11 final nodes, the plot shows that it yields to a lower deviance.

prune.carseats <- prune.tree(tree.carseats , best = 11)


plot(prune.carseats)
text(prune.carseats , pretty = 0)

yhatpruned <- predict(prune.carseats, newdata = testset)


MSEpruned <- mean((yhatpruned - testset$Sales) ^ 2)

The new MSE is 3.5778147. And it is lower than the MSE of the original model.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the
importance() function to determine which variables are most important. Describe the effect
of m, the number of variables considered at each split, on the error rate obtained.

set.seed (13)
rf.carseats <- randomForest(

3 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

(
Sales ~ .,
data = Carseats ,
subset = train ,
mtry = 3,
importance = TRUE
)
yhat.rf <- predict(rf.carseats, newdata = testset)
MSE.rf <- mean ((yhat.rf - testset$Sales) ^ 2)

I selected m parameter as 3, because typically we chose m ≈ √p. And the test MSE is 2.2509814
and it is lower than the previous models.

importance(rf.carseats)

%IncMSE IncNodePurity
CompPrice 18.58222160 236.02145
Income 5.22858463 201.90373
Advertising 14.63888170 197.80659
Population -0.02144444 158.14864
Price 41.30020588 549.51743
ShelveLoc 50.23123406 572.88155
Age 17.20434503 278.38360
Education -0.05341501 105.90887
Urban -0.02927599 20.99830
US 3.19757698 31.72132

varImpPlot(rf.carseats)

4 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

• Price and ShelveLoc are the two most significant variables. To see the effect of m, let’s
create two more models with m = 2 and m = 8.

set.seed (13)
rf2.carseats <- randomForest(
Sales ~ .,
data = Carseats ,
subset = train ,
mtry = 2,
importance = TRUE
)
yhat.rf2 <- predict(rf2.carseats, newdata = testset)
MSE.rf2 <- mean ((yhat.rf2 - testset$Sales) ^ 2)

set.seed (13)
rf3.carseats <- randomForest(
Sales ~ .,
data = Carseats ,
subset = train ,
mtry = 8,
importance = TRUE
)
yhat.rf3 <- predict(rf3.carseats, newdata = testset)
MSE.rf3 <- mean ((yhat.rf3 - testset$Sales) ^ 2)

The test MSE for the case when m = 2 is 2.5596532 and when m = 8 is 1.9035168.

varImpPlot(rf2.carseats)

5 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

varImpPlot(rf3.carseats)

• 4 most significant variables are the same for all three models. Those are ShelveLoc , Price ,
CompPrice and Age .

2 HW5-Revisit
usedcars <- read.csv("used-cars.csv")
usedcars <-select(usedcars, Price, Age_08_04, KM, Fuel_Type, HP, Met_Color, Automatic, CC, Do
colnames(usedcars) <-c("Price", "Age", "Kilometers", "Fuel_Type", "HP", "Metallic",
usedcars <- usedcars %>% mutate(
Metallic = factor(Metallic),
Automatic = factor(Automatic),
Fuel_Type = factor(Fuel_Type)
)

(a) Fit the best decision tree and random forest models.

set.seed(13)
train <- sample(1:nrow(usedcars), 0.75 * nrow(usedcars))
trainset <- usedcars[train, ]

6 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

testset <- usedcars[-train, ]

tree.usedcars <- tree(Price ~ ., subset = train, data = usedcars)


summary(tree.usedcars)

Regression tree:
tree(formula = Price ~ ., data = usedcars, subset = train)
Variables actually used in tree construction:
[1] "Age" "Weight" "Kilometers"
Number of terminal nodes: 8
Residual mean deviance: 1815000 = 1.94e+09 / 1069
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5844.0 -802.4 -32.2 0.0 748.0 7948.0

plot(tree.usedcars)
text(tree.usedcars, pretty = 0)

set.seed(14)
cv.usedcars <- cv.tree(tree.usedcars)
plot(cv.usedcars$size , cv.usedcars$dev, type = "b")

7 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

• This tree looks fine. Let’s carry on with the random forest.

set.seed (15)
rfmse <- numeric(10)
for (i in 1:10) {
rf.usedcars <- randomForest(
Price ~ .,
data = usedcars ,
subset = train,
mtry = i,
importance = TRUE
)
yhat <- predict(rf.usedcars, newdata = testset)
rfmse[i] <- mean ((yhat - testset$Price) ^ 2)
}
which.min(rfmse)

[1] 5

So the best model is when m = 5.

rf.usedcars <- randomForest(


Price ~ .,
data = usedcars ,
subset = train,
mtry = 5,
importance = TRUE
)

8 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

(b) Find the prediction in part 8 of that question with both models. How can you calculate
confidence intervals with decision tree and random forest models?

avgdata <- data.frame(


Price = mean(usedcars$Price),
Age = mean(usedcars$Age),
Kilometers = mean(usedcars$Kilometers),
Fuel_Type = as.factor(get_mode(usedcars$Fuel_Type)),
HP = mean(usedcars$HP),
Metallic = as.factor(get_mode(usedcars$Metallic)),
Automatic = as.factor(get_mode(usedcars$Automatic)),
CC = mean(usedcars$CC),
Doors = get_mode(usedcars$Doors),
QuartTax = mean(usedcars$QuartTax),
Weight = mean(usedcars$Weight)
)

preds.dt <- predict(tree.usedcars, newdata = avgdata)

preds.rf <- predict(rf.usedcars, newdata = rbind(avgdata, testset)) ##we row-bind the data
preds.rf <- preds.rf[1] ##this is the result we're looking for

• The predictions for decision tree and random forest are 11002 and 10536 respectively. For the
regression model, it was 10406.85.

• To find confidence intervals for decision trees, we can use bootstrapping, we use 1000
different samples and predict the same result and get an interval.

set.seed(13)

treef <- function(data,indices){


train <- data[indices,]
tr <- tree(Price~.,data=train)
pred <- predict(tr, newdata = avgdata)
return(pred)
}

res_boot <- boot(usedcars,treef,R=1000)


conf_interval <- boot.ci(res_boot,type="basic")
print(conf_interval)

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS


Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = res_boot, type = "basic")

Intervals :
Level Basic
95% ( 9665, 12159 )

9 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

Calculations and Intervals on Original Scale

• The similar method can be used for random forests as well. On the other hand, for each tree in
the forest, we can check for the prediction result and find the confidence interval from the
quantiles.

rf <- randomForest(Price~.,mtry=5,data=trainset,ntree=1000)
ind <- predict(rf, newdata =rbind(avgdata, testset) , predict.all = TRUE)$individual
dim(ind)

[1] 360 1000

• The rows are each test data and the columns are each tree. To get the results for our average
data.

avgres <- ind[1,]


quantile(avgres,probs=c(0.05,0.95))

5% 95%
8619.00 12133.33

(c) Calculate 10-fold-cross-validated RMSE of multiple regression, decision tree, and random
forest models. Which model would you recommend?

set.seed(14)
folds <- createFolds(usedcars$Price, k = 10, list = TRUE)

rmsevaluesr <- numeric(10)


rmsevaluesdt <- numeric(10)
rmsevaluesrf <- numeric(10)

set.seed(15)
for (i in 1:10) {
train_data <- usedcars[-folds[[i]],]
test_data <- usedcars[folds[[i]],]

modelr <-
lm(
I(Price ^ 0.23) ~ Age + I(Kilometers ^ 0.56) + Fuel_Type + I(HP ^ 0.5) +
I(CC ^ 0.5) + I(QuartTax ^ 0.62) + I(Weight ^ -3.79),
data = train_data
)
modeldt <- tree(Price ~ ., data = train_data)
modelrf <- randomForest(Price ~ ., mtry = 5, data = train_data)

yhatr <- predict(modelr, newdata = test_data) ^ (100 / 23)


yhatdt <- predict(modeldt, newdata = test_data)
yhatrf <- predict(modelrf, newdata = test_data)

10 of 11 19/02/2024, 15:07
IE 451 Fall 2023-2024 Homework 7 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

rmsevaluesr[i] <- sqrt(mean((yhatr - test_data$Price) ^ 2))


rmsevaluesdt[i] <- sqrt(mean((yhatdt - test_data$Price) ^ 2))
rmsevaluesrf[i] <- sqrt(mean((yhatrf - test_data$Price) ^ 2))
}

avgrmser <- mean(rmsevaluesr)


avgrmsedt <- mean(rmsevaluesdt)
avgrmserf <- mean(rmsevaluesrf)

c(avgrmser,avgrmsedt,avgrmserf)

[1] 1271.195 1494.340 1055.236

The lowest average RMSE value is for Random Forest model. Therefore it would be the right
model to choose.

11 of 11 19/02/2024, 15:07

You might also like