You are on page 1of 3

Untitled

Matias Donoso

11/4/2019
Problem #8: In the lab, a classification tree was applied to the Carseats data set after
converting Sales into a qualitative response variable. Now we will seek to predict Sales
using regression trees and related approaches, treating the response as a quantitative
variable.
library(ISLR)
library(tree)
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats) / 2)
Carseats.train = Carseats[train, ]
Carseats.test = Carseats[-train, ]

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain?
Use the importance() function to determine which variables are most important.
bag.carseats = randomForest(Sales ~ ., data = Carseats.train, mtry = 10,
ntree = 500, importance = TRUE)
yhat.bag = predict(bag.carseats, newdata = Carseats.test)
mean((yhat.bag - Carseats.test$Sales)^2)

## [1] 2.623527

importance(bag.carseats)

## %IncMSE IncNodePurity
## CompPrice 24.6476262 168.57576
## Income 5.3355144 90.64911
## Advertising 11.7975748 100.81807
## Population -2.4347577 58.94510
## Price 55.1387362 500.32765
## ShelveLoc 46.3295341 376.78200
## Age 17.9245890 162.68705
## Education 0.8706811 43.04402
## Urban 0.6850649 8.77639
## US 4.3005762 17.96535
The MSE Error obtained in this case is 2.59.This means that bagging the tea will reduce the
test MSE. The importance function states that Price and ShelveLoc are the most important
variables (followed by Age, Advertising and CompPrice).
(e) Use random forests to analyze this data. What test MSE do you obtain? Use the
importance() function to determine which variables are most important. Describe the
effect of m, the number of variables considered at each split, on the error rate
obtained.
rf.carseats = randomForest(Sales ~ ., data = Carseats.train, mtry = 3, ntree
= 500, importance = TRUE)
yhat.rf = predict(rf.carseats, newdata = Carseats.test)
mean((yhat.rf - Carseats.test$Sales)^2)

## [1] 3.001375

importance(rf.carseats)

## %IncMSE IncNodePurity
## CompPrice 15.0614295 155.38762
## Income 2.8372504 125.35841
## Advertising 8.5912531 108.52715
## Population -2.2524534 101.40095
## Price 37.9562323 398.55509
## ShelveLoc 36.9148773 289.37326
## Age 11.0749075 173.58313
## Education 0.9820296 70.20401
## Urban 0.7161522 15.45718
## US 6.1094256 33.75805

The test MSE is 3.29, which is higher than the one obtained before. Running the importance
function we realize that still the same two variables are the most important ones
(ShelveLoc and Price).
Problem #10: We now use boosting to predict Salary in the Hitters data set.
(a) Remove the observations for whom the salary information is unknown, and then log-
transform the salaries.
Hitters = na.omit(Hitters)
Hitters$Salary = log(Hitters$Salary)

(b) Create a training set consisting of the first 200 observations, and a test set consisting
of the remaining observations.
train = 1:200
Hitters.train = Hitters[train, ]
Hitters.test = Hitters[-train, ]

(c) Perform boosting on the training set with 1,000 trees for a range of values of the
shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis
and the corresponding training set MSE on the y-axis.
(d) Produce a plot with different shrinkage values on the x-axis and the corresponding
test set MSE on the y-axis.

(e) Compare the test MSE of boosting to the test MSE that results from applying two of the
regression approaches seen in Chapters 3 and 6.

(f) Which variables appear to be the most important predictors in the boosted model?

(g) Now apply bagging to the training set. What is the test set MSE for this approach?

You might also like