You are on page 1of 12

Assignment

On

Decision Tree Analysis

IN PARTIAL FULLFILLMENT OF THE DEGREE OF

Masters of Business Administration-Intelligent Data Science (MBA-IDS:2018-2020)

UNDER GUIDANCE OF

Prof. Keerti Jain

Anjali Vats MB18GID244


Contents
Problem Statement........................................................................................................................ 3
DataSet ........................................................................................................................................... 3
Models of Decision Tree & Model Accuracy .............................................................................. 4
Using rpart .................................................................................................................................. 4
Comparing different combinations of model using mean squared error values ......................... 8
Using tree package ...................................................................................................................... 9
Best Model which is generated................................................................................................... 12
Comparing Accuracy ................................................................................................................ 12
Comaparing Means ................................................................................................................... 12
Problem Statement

1. Build various models of decision tree using different combinations of independent variables.
2. And check the accuracy of the models.
3. Find best model among the models which u generated.

DataSet

The Carseats dataset is a dataframe with 400 observations on the following 11 variables:

1. Sales: unit sales in thousands


2. CompPrice: price charged by competitor at each location
3. Income: community income level in 1000s of dollars
4. Advertising: local ad budget at each location in 1000s of dollars
5. Population: regional pop in thousands
6. Price: price for car seats at each site
7. ShelveLoc: Bad, Good or Medium indicates quality of shelving location
8. Age: age level of the population
9. Education: ed level at location
10. Urban: Yes/No
11. US: Yes/No
Models of Decision Tree & Model Accuracy

Using rpart
1. Predicting weather the person belongs to US or not based on variables (Income, Advertising,
Population, price)
############ CODE 1 ##################

install.packages("rpart")

library(rpart)

getwd()

Carseats<-read.csv("C:/Users/intone/Desktop/MBA/T4/MA/After MidTerm/Carseats.csv")

attach(Carseats)

names(Carseats)

tree_analysis<-rpart(US~Income+Advertising+Population+price, data=Carseats)

tree_analysis

install.packages("rpart.plot")

library(rpart.plot)

rpart.plot(tree_analysis,extra=1)

2) Predicting weather the person belongs to Urban or not based on other parameters in the dataset
i.e ( Income, Advertising, Education, Population, Price, Age, Shelvloc, US , Sales)
Carseats <- Carseats[,-1]

Carseats$Urban <- factor(Carseats$US, levels=c(2,4), labels=c("Yes", "No"))

print(summary(Carseats))

set.seed(1234)

ind <- sample(2, nrow(Carseats), replace=TRUE, prob=c(0.7, 0.3))

trainData <- Carseats[ind==1,]

validationData <- Carseats[ind==2,]

tree = rpart(Urban ~ ., data=trainData, method="class")

rpart.plot(tree)

evaluation(tree, validationData, "class")


C) Predicting Shelveloc (Good, Medium or bad) based on other parameters in the dataset
(Income+Advertising+Education+Population+price+age+Urban+US+Sales)
############ CODE ##################

tree_analysis<-
rpart(Shelveloc~Income+Advertising+Education+Population+price+age+Urban+US+Sales,
data=Carseats)

rpart.plot(tree_analysis,extra=1)

Comparing different combinations of model using mean


squared error values
Different combinations of independent variables are used to create models and their mean squared
error values are calculated using the difference in actual and predicted values

The least mean error value has been obtained for model7 and model8 with 7 and 8 independent
variables ie. With the following combinations :
tree_model7=tree(High~Advertising+age+price+Education+Income+Population+US+Shelv
eloc,training_data)
tree_model8=tree(High~Advertising+age+price+Education+Income+Population+US+Shelv
eloc+Urban,training_data)
Using tree package
a) Creating a decision model by splitting the dataset into training and test data.

#split data into training ans test set

set.seed(2)

train=sample(1:nrow(Carseats),nrow(Carseats)/2)

training_data=Carseats[train,]

testing_data=Carseats[test, ]

testing_High=High[test]

#fit thr tree model using training data

tree_model=tree(High~.,training_data)

plot(tree_model)

text(tree_model, pretty = 0)
tree_pred=predict(tree_model, testing_data, type="class")
mean(tree_pred!=testing_High)

#PRUNE the tree


##cross validation to check whre to stop pruning
set.seed(3)

cv_tree=cv.tree(tree_model, FUN=prune.misclass)
names(cv_tree)
plot(cv_tree$size, cv_tree$dev, type="b")
##prune the tree

pruned_model=prune.misclass(tree_model, best=9)

plot(pruned_model)

text(pruned_model, pretty=0)

##check how it is doing

tree_pred=predict(pruned_model, testing_data, type="class")

mean(tree_pred !=testing_High)
Best Model which is generated

Comparing Accuracy
1) The best model has been generated is the one in which US (Yes/ No) labels have been
predicted keeping other variables. This model has accuracy of around 89%.

Comaparing Means
The least mean error value has been obtained for model7 and model8 with 7 and 8 independent
variables ie. With the following combinations :
tree_model7=tree(High~Advertising+age+price+Education+Income+Population+US+Shelv
eloc,training_data)
tree_model8=tree(High~Advertising+age+price+Education+Income+Population+US+Shelv
eloc+Urban,training_data)

You might also like