You are on page 1of 5

Problem

The file ToyotaCorolla.csv contains information on used cars (Toyota Corolla) on sale during late
summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes,
including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of
a used Toyota Corolla based on its specifications. (The example in Section 9.7 is a subset of this
dataset).

Data
Load the ToyotaCorolla.csv and install/load any required packages. Split the data into training
(60%), and validation (40%) datasets.

car.df <- read.csv("D:/MSBA/3-Winter 2020/560/data/ToyotaCorolla.csv")

# partition
set.seed(22)
train.index <- sample(c(1:dim(car.df)[1]), dim(car.df)[1]*0.6)
train.df <- car.df[train.index, ]
valid.df <- car.df[-train.index, ]

Part A
Run a regression tree (RT) with outcome variable Price and predictors Age_08_04, KM,
Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfg_Guarantee, Guarantee_Period, Airco,
Automatic_Airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar. Keep the
minimum number of records in a terminal node to 1, maximum number of tree levels to 100, and
cp = 0:001, to make the run least restrictive.

tr <- rpart(Price ~ Age_08_04 + KM + Fuel_Type + HP + Automatic + Doors +


Quarterly_Tax +
Mfr_Guarantee + Guarantee_Period + Airco + Automatic_airco +
CD_Player +
Powered_Windows + Sport_Model + Tow_Bar,
data = train.df,
method = "anova", minbucket = 1, maxdepth = 30, cp = 0.001)
prp(tr)
Part A.i
Which appear to be the three or four most important car specifications for predicting the car’s
price?

#we could use str we to identify the components of the tree


##str(tr)

#or we can transpose the "vector" to create a column vector with the
variable labels to get the variable importance
t(t(tr$variable.importance))
## [,1]
## Age_08_04 8883389342
## Automatic_airco 2919505875
## KM 2401906416
## Quarterly_Tax 1607048729
## HP 1050164969
## CD_Player 343306312
## Fuel_Type 218721574
## Guarantee_Period 208534738
## Airco 122851533
## Powered_Windows 64981981
## Doors 59287239
## Mfr_Guarantee 43522929
## Automatic 5272476

Based on our decision tree vector; Age, Auto AC, KM, and Quarterly Tax seem to be the
top predictors of car price.
NOTE: Your results may vary based on your random seed selection.

Part A.ii
Compare the prediction errors of the training and validation sets by examining their RMS error
and by plotting the two boxplots. What is happening with the training set predictions? How does
the predictive performance of the validation set compare to the training set? Why does this
occur?

#calculate training and validation accuracy


accuracy(predict(tr, train.df), train.df$Price)
## ME RMSE MAE MPE MAPE
## Test set 6.534122e-14 993.2493 780.6424 -1.094599 8.044445
accuracy(predict(tr, valid.df), valid.df$Price)
## ME RMSE MAE MPE MAPE
## Test set -6.90479 1319.292 926.2175 -1.213805 8.824279
#to calculate the errors you need to subtract the prediction from the
actual
train.err <-predict(tr, train.df) -train.df$Price
valid.err <-predict(tr, valid.df) -valid.df$Price

#create a data fram from the training and validation errors in order to
plot
err <-data.frame(Error =c(train.err, valid.err),
Set =c(rep("Training", length(train.err)),
rep("Validation", length(valid.err))))

#create your error box plots


boxplot(Error~Set, data=err, main="RMS Errors",
xlab = "Set", ylab = "Error",
col="blueviolet",medcol="darkgoldenrod1",boxlty=0,border="black",
whisklty=1,staplelwd=4,outpch=13,outcex=1,outcol="darkslateblue")
The training data has fewer error outliers and when compared to the validation data it
appears to be performing better, but still within a similar range as our validation data. This
is a good thing because it indicates that we have split our training data into a large
enough sample size. If the validation had been more accurate we would need to consider
the possibility that we had underfit the data.

A.iii
How can we achieve predictions for the training set that are not equal to the actual prices?
If we wanted to get predictions in a training set that are not equal to the actual prices we
would want to experiment with making the training sample size smaller.

A.iv
Prune the full tree using the cross-validation error. Compared to the full tree, what is the
predictive performance for the validation set?

#prune the tree


tr.shallow <- rpart(Price ~ Age_08_04 + KM + Fuel_Type +
HP + Automatic + Doors + Quarterly_Tax +
Mfr_Guarantee + Guarantee_Period + Airco +
Automatic_airco + CD_Player + Powered_Windows +
Sport_Model + Tow_Bar, data = train.df)
prp(tr.shallow)

#determine the training accuracy of the prunned tree's training and


validation sets
accuracy(predict(tr.shallow, train.df), train.df$Price)
## ME RMSE MAE MPE MAPE
## Test set 3.422132e-13 1342.992 1006.305 -1.64915 10.05497

You might also like