You are on page 1of 4

JIGSAW ASSIGNMENT 1

Titanic Train Dataset

Abstract
Use of Titanic train dataset to plot a decision tree of categorical variables by splitting it into a
homogenous data and arriving at a conclusion with some probability

Group 1
Raghvendra Pandey | Sumit Singh | Swapnil Athaya | Abhijeet Bhadani | Monica Jain | Pratiksha Shetty
Jigsaw Assignment 1_ Titanic Train Dataset
# Build a decision tree using Titanic Train case
# Setting the directory to "/Users/RVP/Desktop/R Workshop"
# Choosing the file Titanic Train file from the directory
mydata = read.csv(file.choose())
# Attaching the file to the 'R Script'
attach(mydata)
# Viewing the data attached
View(mydata)
# Plotting the histogram of the data
# Dependent Variable - Survived , 1= YES (Survived), 2= NO (Not Survived)
# Independent Variables - a. Sex , either Male or female
# b. Age , represents the age of the candidates in the Titanic Train dataset
# c. pClass , represents the socio economic status of the candidates
# 1st = Upper , 2nd = Middle, 3rd = Lower
# d. SibSp , represents the number of siblings & spouse
# e. Parch , the number of parents/ children on the Titanic train dataset
# f. Fare , fare paid for the ticket
# g. Ticket Number , ticker number for that passenger in numeric & characters
# h. Cabin Number , cabin number for that passenger
# i. Embarked , port of embarkation of the passenger Cherbourg, Queenstown,
Southampton
hist(mydata$Survived)
# Install the package to work with the structured data
install. Packages("dplyr")
library(dplyr)
# Summarize the data calculating the mean, median, quartiles of the dataset
summary(mydata)
install.packages("caTools")
library(caTools)
# Splitting the data into train & test data.
# Train data is a certain percentage of the overall dataset. Training data is used to train & fit the
data into the model
# The better the training data, the better will be the model fit.
split= sample.split(mydata,SplitRatio = 0.7)
train= subset(mydata, split == "TRUE")
train
test= subset(mydata, split == "FALSE")
test
# rpart is installed to prune the decision tree, simplifying the decision tree.
# We prune the decision tree to avoid overfitting the data
install.packages("rpart")
library(rpart)
# To run the model on the train dataset
model=rpart(Survived~Sex+Age+Pclass+Fare,data= train)
model
install.packages("rpart.plot")
library(rpart.plot)
# Plotting the decision Tree based on the train data
rpart.plot(model)
subset(mydata, Pclass <3)

# Conclusion

Male
# Describes the survival probability of a population taking into account various independent
variables
# Starts with a total 100% population, with a combined probability of 0.38
# Of the total population, when Sex is Male, 66% of the population is male, with a probability of
0.18 & 34% of the population is not male (i.e female), with a probability of 0.76, who will
survive
# Of those Male population with age>=9.5yrs, when Pclass>=2, 48% of these population with a
probability of 0.092 will survive. When Pclass=1, 14% of the population with a probability of
0.37 will survive.
# Of those Male, age>=9.5yrs& Pclass=1, Fare >=26 ,12% of the population with a probability of
0.41 will survive.
# Of the Male Population, Male with age<9.5yrs when Pclass>=3, 3% of the population with a
probability of 0.4, survive. But when Pclass=1,2, 2% of the population with a probability of 1 will
survive.

Not Male (i.e. Female)


# Of the Female population, which is 34% , with 0.76 probability, Pclass>=3, 15% of the
population with a probability of 0.52 will survive.
# Of these, if the Fare>=25, then 3% of the population with a probability 0.12 survives & if the
Fare<25, 12% of the population with a probability of 0.61 survive.
# The Female population with a Pclass< 3 (i.e 1,2) is 19%, with a probability of 0.95, who will
survive.
Decision Tree

You might also like