You are on page 1of 6

Title: Data Wrangling (Data Preprocessing)

I. Students’ details
Author: “Individual work”
Subtitle: Practical assessment 1
Output:
pdf_document: default html_notebook:
default html_document: df_print: paged —

II. Install library


install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'


## (as 'lib' is unspecified)

III. Data Wrangling

III.1 Load data set


From github : https://github.com/datasciencedojo/datasets/blob/master/titanic.csv
download file titanic.csv After extract and change name to titanic and copy to project Using
read.csv method to read data set titanic.
ds_titanic = read.csv(file = 'titanic.csv',header = TRUE,sep = ',')

III.2 Head of data set titanic ( return first 6 rows only)


Using method head, it will return first 6 rows.
head(ds_titanic)

## PassengerId Survived Pclass


## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
Parch
## 1 Braund, Mr. Owen Harris male 22 1
0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
0
## 3 Heikkinen, Miss. Laina female 26 0
0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
0
## 5 Allen, Mr. William Henry male 35 0
0
## 6 Moran, Mr. James male NA 0
0
## Ticket Fare Cabin Embarked WikiId
## 1 A/5 21171 7.2500 S 691
## 2 PC 17599 71.2833 C85 C 90
## 3 STON/O2. 3101282 7.9250 S 865
## 4 113803 53.1000 C123 S 127
## 5 373450 8.0500 S 627
## 6 330877 8.4583 Q 785
## Name_wiki Age_wiki
## 1 Braund, Mr. Owen Harris 22
## 2 Cumings, Mrs. Florence Briggs (née Thayer) 35
## 3 Heikkinen, Miss Laina 26
## 4 Futrelle, Mrs. Lily May (née Peel) 35
## 5 Allen, Mr. William Henry 35
## 6 Doherty, Mr. William John (aka "James Moran") 22
## Hometown Boarded
## 1 Bridgerule, Devon, England Southampton
## 2 New York, New York, US Cherbourg
## 3 Jyväskylä, Finland Southampton
## 4 Scituate, Massachusetts, US Southampton
## 5 Birmingham, West Midlands, England Southampton
## 6 Cork, Ireland Queenstown
## Destination Lifeboat Body Class
## 1 Qu'Appelle Valley, Saskatchewan, Canada 3
## 2 New York, New York, US 4 1
## 3 New York City 14? 3
## 4 Scituate, Massachusetts, US D 1
## 5 New York City 3
## 6 New York City 3

III.3 Data Description


The titanic set describe the survival status of individual passengers on the Titanic. The
titanic data frame does not contain information from the crew, but it does contain actual
ages of half of the passengers. The principal source for data about Titanic passengers is the
Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One
of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick
Stephens Ltd, which includes a passenger list created by many researchers and edited by
Michael A. Findlay
VARIABLE DESCRIPTIONS
Columns Describle
PassengerId Passenger Identification
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S
= Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination

IV. Manipulation

IV.1 Dimension
Check dimension of data set titanic, using method dim in library : tidyverse
dim(ds_titanic)

## [1] 1309 21

It’s mean the data set titanic have total 21 columns and 1309 rows.
Show columns name of data set

IV.2 Columns
colnames(ds_titanic)

## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"


## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked" "WikiId" "Name_wiki" "Age_wiki"
## [16] "Hometown" "Boarded" "Destination" "Lifeboat" "Body"
## [21] "Class"

IV.3 Change columns


Change column SibSp to : SpousesAboard and Parch to : ChildrenAboard, using index
colnames(ds_titanic)[7] = "SpousesAboard"
colnames(ds_titanic)[8] = "ChildrenAboard"
colnames(ds_titanic)
## [1] "PassengerId" "Survived" "Pclass" "Name"
## [5] "Sex" "Age" "SpousesAboard" "ChildrenAboard"
## [9] "Ticket" "Fare" "Cabin" "Embarked"
## [13] "WikiId" "Name_wiki" "Age_wiki" "Hometown"
## [17] "Boarded" "Destination" "Lifeboat" "Body"
## [21] "Class"

IV.4 Check datatype


Check data type of columns , using method str
str(ds_titanic)

## 'data.frame': 1309 obs. of 21 variables:


## $ PassengerId : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John
Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs.
Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SpousesAboard : int 1 1 0 1 0 0 0 3 0 1 ...
## $ ChildrenAboard: int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803"
...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
## $ WikiId : num 691 90 865 127 627 ...
## $ Name_wiki : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. Florence
Briggs (née Thayer)" "Heikkinen, Miss Laina" "Futrelle, Mrs. Lily May (née
Peel)" ...
## $ Age_wiki : num 22 35 26 35 35 22 54 2 26 14 ...
## $ Hometown : chr "Bridgerule, Devon, England" "New York, New York,
US" "Jyväskylä, Finland" "Scituate, Massachusetts, US" ...
## $ Boarded : chr "Southampton" "Cherbourg" "Southampton"
"Southampton" ...
## $ Destination : chr "Qu'Appelle Valley, Saskatchewan, Canada" "New
York, New York, US" "New York City" "Scituate, Massachusetts, US" ...
## $ Lifeboat : chr "" "4" "14?" "D" ...
## $ Body : chr "" "" "" "" ...
## $ Class : int 3 1 3 1 3 3 1 3 3 2 ...

IV.5 Create new data subset


matrix_ds_titanic = data.matrix(head(ds_titanic[ds_titanic$Age >= 20 &
ds_titanic$Age <= 40,],10))
matrix_ds_titanic

## PassengerId Survived Pclass Name Sex Age SpousesAboard ChildrenAboard


## 1 1 0 3 3 2 22 1 0
## 2 2 1 1 4 1 38 1 0
## 3 3 1 3 6 1 26 0 0
## 4 4 1 1 5 1 35 1 0
## 5 5 0 3 1 2 35 0 0
## NA NA NA NA NA NA NA NA NA
## 9 9 1 3 7 1 27 0 2
## 13 13 0 3 8 2 20 0 0
## 14 14 0 3 2 2 39 1 5
## NA.1 NA NA NA NA NA NA NA NA
## Ticket Fare Cabin Embarked WikiId Name_wiki Age_wiki Hometown
Boarded
## 1 5 7.2500 1 2 691 3 22 2
2
## 2 7 71.2833 3 1 90 4 35 5
1
## 3 8 7.9250 1 2 865 6 26 3
2
## 4 1 53.1000 2 2 127 5 35 6
2
## 5 4 8.0500 1 2 627 1 35 1
2
## NA NA NA NA NA NA NA NA NA
NA
## 9 3 11.1333 1 2 902 7 26 8
2
## 13 6 8.0500 1 2 1196 8 19 7
2
## 14 2 31.2750 1 2 632 2 39 4
2
## NA.1 NA NA NA NA NA NA NA NA
NA
## Destination Lifeboat Body Class
## 1 3 1 1 3
## 2 2 4 1 1
## 3 1 2 1 3
## 4 4 5 1 1
## 5 1 1 1 3
## NA NA NA NA NA
## 9 5 3 1 3
## 13 1 1 1 3
## 14 6 1 1 3
## NA.1 NA NA NA NA

str(matrix_ds_titanic)

## num [1:10, 1:21] 1 2 3 4 5 NA 9 13 14 NA ...


## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:10] "1" "2" "3" "4" ...
## ..$ : chr [1:21] "PassengerId" "Survived" "Pclass" "Name" ...
Extract data base on condition age >= 20 and age <= 40, and using method head with
argument equal 10 to extract first 10 rows only. Want to observe the survival rate with
people in adulthood.

IV.6 Add new columns


creat new data frame from scratch with 2 variables and 10 observations, use column
PassengerId and Age
new_ds = head(ds_titanic[,c("PassengerId","Age")],10)
new_ds

## PassengerId Age
## 1 1 22
## 2 2 38
## 3 3 26
## 4 4 35
## 5 5 35
## 6 6 NA
## 7 7 54
## 8 8 2
## 9 9 27
## 10 10 14

new vector numeric


vec = c(10,5,3,6,8,9,4,1,2,7)

add vector to data set using cbind()


new_ds_titanic <- cbind(new_ds,vec)
new_ds_titanic

## PassengerId Age vec


## 1 1 22 10
## 2 2 38 5
## 3 3 26 3
## 4 4 35 6
## 5 5 35 8
## 6 6 NA 9
## 7 7 54 4
## 8 8 2 1
## 9 9 27 2
## 10 10 14 7

V. Reference
1. Data source :
https://github.com/datasciencedojo/datasets/blob/master/titanic.csv

2. http://campus.lakeforest.edu/frank/FILES/MLFfiles/Bio150/Titanic/
TitanicMETA.pdf

You might also like