You are on page 1of 15

The Brief History of the Titanic's Tragic Maiden Voyage

• Voyage from Southampton to New York City, stopping at Cherbourg


and Queenstown
• Began shortly after noon on 10 April 1912
• Hit iceberg at around 23:40 (ship's time) on 14 April 1912
• At time carrying 2,224 “souls”, 892 crew members and 1,320 passengers
• 2 hours and 40 minutes of agony, until ship disappeared from view
• Estimated fatalities between 1,490–1,635
Dataset understanding

Training dataset sample


Dataset understanding (continuation)
Feature Description Type
PassengerId Passenger ID Descrete

Survived Target Descrete

Pclass Passenger ticket class Descrete

Name Passenger name String

Sex Passenger gender String

Age Passenger age Continuos

SisbSp Passenger family relations Descrete

Parch Passenger family relations Descrete

Ticket Passenger ticket number String

Fare Fare payed by the ticket Continuos

Cabin Passenger cabin number String

Training dataset information Embarked Port of embarkation String


Dataset understanding (missing values)
Feature Description Type
• What policy should be
PassengerId Passenger ID Descrete
adopted to deal with
Survived Target Descrete
Cabin missing data, with
Pclass Passenger ticket class Descrete
so few values?
Name Passenger name String

Sex Passenger gender String


• What policy should be
Age Passenger age Continuos
adopted to deal with Age
SisbSp Passenger family relations Descrete
missing data?
Parch Passenger family relations Descrete

Ticket Passenger ticket number String


• How to solve the lack of
Fare Fare payed by the ticket Continuos
only two data in
Cabin Passenger cabin number String

Embarked Port of embarkation String


Embarked?
Highlights

• Mostly passengers ages in


range 20 – 50 years old

• 3th class registers a greater


number of casualties

• On the contrary, 1st class


registers the lowest number
of fatalities among the
classes

• Mostly tickets cost less than


£100, the higher the ticket
price, the lower the fatality
rate
Dataset understanding (features correlation)

• Feature, in general,
presents a low
correlation level

• A warning for middle


level correlation
between Pclass and
Fare
Features Engineering (missing values and object treatment)
Features engineering (results)
Highlights

• Adding a new feature “Sex”,


female passengers, gender
represented by value 1,
survived in higher numbers
than male, represented by
value 0.
Dataset understanding (final features correlation)

The characteristics, in
general, still show a low
level of correlation

A warning for middle


level correlation
between Pclass and
Fare, and between Sex
and Survived
Training, testing and scores
From the training dataset, two subdatasets were taken, one for training the 80% model and the
remaining 20% for the test.
Applying classification algorithms to these two datasets, the scores of the following table were
obtained

Algorithm Features List Best settings Score


Pclass\Sex\Age\SibSp\Parch\Fare K = 7 K-Neighbors 75.42%

Pclass\Sex\Age\Fare K = 5 K-Neighbors 73.18%


KNN
Pclass\Sex\Age\SibSp\Parch K = 9 K-Neighbors 79.89%

Pclass\Sex\Age K = 9 K-Neighbors 79.89%

Pclass\Sex\Age\SibSp\Parch\Fare C=0.01, sigma=1/gamma=0.01, kernel=linear 78.77%

Pclass\Sex\Age\Fare C=0.03, sigma=1/gamma=0.01, kernel=linear 78.77%


SVM
Pclass\Sex\Age\SibSp\Parch C=10, sigma=1/gamma=10, kernel=rbf 83.24%

Pclass\Sex\Age C=10, sigma=1/gamma=10, kernel=brf 83.24%


Training and testing (continuation)
Algorithm Features List Best settings Score
Pclass\Sex\Age\SibSp\Parch\Fare maximum depth=8, criterion=entropy, maximum features=auto 86.59%

Pclass\Sex\Age\Fare maximum depth=8, criterion=gini, maximum features=auto 83.80%


Decision Tree
Pclass\Sex\Age\SibSp\Parch maximum depth=11, criterion=entropy, maximum features=auto 83.80%

Pclass\Sex\Age maximum depth=8, criterion=gini, maximum features=auto 83.24%

Pclass\Sex\Age\SibSp\Parch\Fare maximum depth=11, criterion=gini, estimators=50, maximum features=auto 86.59%

Pclass\Sex\Age\Fare maximum depth=8, criterion=entropy, estimators=150, maximum features=auto 86.03%


Random Forests
Pclass\Sex\Age\SibSp\Parch maximum depth=7, criterion=entropy, estimators=150, maximum features=auto 83.80%

Pclass\Sex\Age maximum depth=7, criterion=entropy, estimators=150, maximum features=auto 84.36%

Pclass\Sex\Age\SibSp\Parch\Fare algorithm=lbfgs, penalty=l2, maximum iterations=100, random state=20 81.01%

Logistic Pclass\Sex\Age\Fare algorithm=liblinear, penalty=l1, maximum iterations=50, random state=20 80.45%

Regression Pclass\Sex\Age\SibSp\Parch algorithm=lbfgs, penalty=L2, maximum iterations=100, random state=20 81.01%

Pclass\Sex\Age algorithm=saga, penalty=L2, maximum iterations=50, random state=20 81.56%


Conclusion
• Considering dataset features a score of 86.59% seems to
be quite acceptable
• With this dataset it might be possible to have a score
improvement
• maybe considering a noble title underlying in passenger's name

• Features maybe needed to improve accuracy:


• given the time the passenger would be awake or sleeping
• passenger location
• distance from that location to the lifeboat
• Passenger’s physical and psychological condition
Questions?

Doubts?

Direction and production


by: Bruno Reis
bruno.reis@ipb.pt

You might also like