You are on page 1of 9

Class Assignment-01

Of
Business Analytics

Topic- Visualizations and Interpretation of Data

Under supervision of
Dr. Tanveer Kajla
Mittal School of Business, Lovely Professional University,
Phagwara-144401, Punjab, India, 2021

Name – RIA HEMBROM Roll no. – RQ2140A01


Registration no. – 12102014 Course code- MGNM801
DATASET: Exploring survival on Titanic

The dataset contains 1309 observations of 12 variables (numerical, categorical):


The 12 variables are as follows-
Variable
Name Description
PassengerId ID of the passenger
Survived Survived (1) or died (0)
Pclass Passenger’s class
Name Passenger’s name
Sex Passenger’s sex
Age Passenger’s age
SibSp Number of siblings/spouses
aboard
Parch Number of parents/children
aboard
Ticket Ticket number
Fare Fare
Cabin Cabin
Embarked Port of embarkation

Using these variables we can analyze the survival of passengers on Titanic.

Installed packages-

The following packages were installed to run my code:

Step 1: Importing a data set

Two datasets were available i.e. train and test datasets which I imported into
RStudio to study survival of passengers who had on boarded the Titanic. These 2
datasets were then binded together to create the main dataset named all (screenshot
attached above):
Step 2: Data Cleaning

I extracted title (Mr, Ms, Miss, Mrs, Lady, Master etc) from each passenger’s
name, then grouped the ones with lowest count under one title called
uncommon_title:

Output: Count of each title shown as:

Grouping titles finally as:


Output: Count of each title shown as:

After that I extracted the surname from passenger name:

Next, I made a variable named FamName to categorize each family on the Titanic.
This was made using surname (as shown above) and FamSize(as shown below)
which is calculated on the basis of number of siblings/spouse(s) and number of
children/parents.

Bar graph-

Now I will study the relationship between family size & survival by plotting the
variables FamSize and Survived using ggplot2 .
Output:

I can see that the chances of survival are less for those with FamSize=1and
FamSize greater than 4.

Scatter Plot

Plotting FamSize and Age


Output:

Histogram:

Plotting the frequency of variable Age:

Output:
Boxplot:

I will now get rid of the missing passenger IDs


And then, I will construct a boxplot:

Reference-

• https://www.kaggle.com/

You might also like