You are on page 1of 3

ML PROJECT REPORT

Koushik Tumati

PROBLEM:
Context: The sinking of the Titanic is one of the most infamous shipwrecks in history.
Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of
1502 out of 2224 passengers and crew.
Objective: To build a predictive model that answers the question: “what sorts of people were
more likely to survive?” using passenger data (i.e., name, age, gender, socio-economic class,
etc).
Tools and libraries used:
Language: Python
Libraries: Numpy, Pandas, Matplotlib, Seaborn, Sklearn
DATASETS:

Obtained data set from Kaggle. It contains 12 columns(features) and 891 rows.

Variable Definition Key


survival Survival 0 = No, 1 = Yes
Passenger ID Serial ID numbers
pclass Ticket class 1 = Upper , 2 = Middle , 3 = Lower
Name Name
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

APPROACH
1) DATA CLEANING:
1) Familiarising data:
Initially importing of above mentioned libraries is done and the dataset is loaded by read
function of pandas.
Now to get familiar with dataset, column types, basic descriptive statistics and info of these
columns is obtained through commands like .head(), .describe(), .info() and several methods
and functions. After performing these operations following conclusions are drawn.
 “Survival” is required dependent variable that predicts survival from other
independent variables.
 Passenger Id and ticket numbers are random numbers and cannot contribute to our
model.
 Sex and embarked are nominal data types which needs to be converted to dummy
variables.
 Age and Fare are continuous datatypes. (Note that Age is not a discrete number but
continuous datatype due to children below 1 year and accurate description or
assumption of some elders age like 25.6)
ML PROJECT REPORT
Koushik Tumati

 SIbsp and Parch are discrete numeric data.


 Cabin variable is nominal data and contains a lot of null values which makes it
insignificant and hence dropped from dataset.

2) Checking Outliers and Missing Data:


After examining no significant outliers are found but missing values are found in age,
cabin and embarked columns. Age column null values are filled by median and
embarked columns are filled with mode due to their respective quantitative and
qualitative types. Cabin is dropped due to huge number of null values. It is to be noted if
missing records can’t be reasonably filled and are few in number then we can drop those
rows for an accurate model.

3) Handling datatypes and feature engineering:


Feature engineering is creating new useful features from existing columns. In our
dataset we can extract title from Name column (Mr,Dr etc) and create a new column.
We can also create family_size variable from sibsp and parch variables.
Columns need to be checked for critical datatypes like datetime and currency. Luckily,
we don’t have such complex datatypes. However, We need to convert categorical
variables to numerical dummy variables for calculations of the model. There are several
ways like hot code encoding to create dummy variables from categorical variables. I
used some sklearn and pandas functions for this purpose. Splitting and altering data
frames a few times is done in this step to get model compatible dataset with dummy
variables.

4) Train and Test Data Split:


To prevent overfitting of our model, we divide dataset into 80:20 train dataset and test
dataset split and train our model using train dataset. We finally test our model on our
test dataset to check the accuracy of our model. If it is not satisfactory we will make
necessary changes in our model or choose a different algorithm to finally get an
optimised model.

2) EXPLORATORY DATA ANALYSIS (EDA) :


It is important to note that EDA is not always done after complete data cleaning. It can be
done alternatively in an iterative manner to extract relationships between different features
and continue with cleaning again. It is done iteratively until satisfactory insights and dataset
is obtained. However, considering the simplicity of this particular dataset, it is not needed. In
this step, we summarize variables and their relationship with target variable using
visualization packages like matplotlib and seaborn. In this dataset, we find some
observations like survival rate of women is much greater than men. Higher Fare ticket
holders had higher chance of survival. Children had higher chance of survival while people
over 65 didn’t have much chances. In this tiny dataset, most of the observations are intuitive
but in big datasets, we can get intriguing observations that are counterintuitive. Then we
have to validate dataset and find underlying reason for counterintuitive data.

Sample plots are attached below.


ML PROJECT REPORT
Koushik Tumati

3) BUILDING MODELS:
Now is the simple yet crucial step of selecting the algorithm for building predictive model.
Although the model is practically a few lines of code, the underlying statistics are vast and
complex. Due to my limited knowledge of these algorithms, I have just used a simple Logistic
Regression and decision trees models.
We build the models on our training dataset and get predictions for the test data. Then we
compare the predictions with test output and gauge our model using confusion matrix , F-1
score and several other metrics. I used only the two above said metrics.

4) Improving Model and Accuracy:


I have stopped my model till the above step due to my lack of knowledge of other algorithms
at the time. We can apply different algorithms and check the best among them. Also we can
improve model’s accuracy by tuning parameters and hyperparameters. After such
improvements, we get the finalised optimum model.

Other Key Points:

 At every point of time, we should be careful to avoid overfitting the model. For some
algorithms, hyperparameters iterate several times to get the best parameter. It may overfit
the model. So in that case we split dataset into three sets. Train, test and validation
datasets.
 Giving more data to test and validation set leads to insufficient train data to get an accurate
model and very less data to test data leads to inefficient accuracy check. Usually (70,15,15)
or (80,10,10) is the standard
 Sometimes More features (Columns) are needed and sometimes more rows are needed for
a better model. We need to understand the case by observing certain metrics.

You might also like