Professional Documents
Culture Documents
Koushik Tumati
PROBLEM:
Context: The sinking of the Titanic is one of the most infamous shipwrecks in history.
Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of
1502 out of 2224 passengers and crew.
Objective: To build a predictive model that answers the question: “what sorts of people were
more likely to survive?” using passenger data (i.e., name, age, gender, socio-economic class,
etc).
Tools and libraries used:
Language: Python
Libraries: Numpy, Pandas, Matplotlib, Seaborn, Sklearn
DATASETS:
Obtained data set from Kaggle. It contains 12 columns(features) and 891 rows.
APPROACH
1) DATA CLEANING:
1) Familiarising data:
Initially importing of above mentioned libraries is done and the dataset is loaded by read
function of pandas.
Now to get familiar with dataset, column types, basic descriptive statistics and info of these
columns is obtained through commands like .head(), .describe(), .info() and several methods
and functions. After performing these operations following conclusions are drawn.
“Survival” is required dependent variable that predicts survival from other
independent variables.
Passenger Id and ticket numbers are random numbers and cannot contribute to our
model.
Sex and embarked are nominal data types which needs to be converted to dummy
variables.
Age and Fare are continuous datatypes. (Note that Age is not a discrete number but
continuous datatype due to children below 1 year and accurate description or
assumption of some elders age like 25.6)
ML PROJECT REPORT
Koushik Tumati
3) BUILDING MODELS:
Now is the simple yet crucial step of selecting the algorithm for building predictive model.
Although the model is practically a few lines of code, the underlying statistics are vast and
complex. Due to my limited knowledge of these algorithms, I have just used a simple Logistic
Regression and decision trees models.
We build the models on our training dataset and get predictions for the test data. Then we
compare the predictions with test output and gauge our model using confusion matrix , F-1
score and several other metrics. I used only the two above said metrics.
At every point of time, we should be careful to avoid overfitting the model. For some
algorithms, hyperparameters iterate several times to get the best parameter. It may overfit
the model. So in that case we split dataset into three sets. Train, test and validation
datasets.
Giving more data to test and validation set leads to insufficient train data to get an accurate
model and very less data to test data leads to inefficient accuracy check. Usually (70,15,15)
or (80,10,10) is the standard
Sometimes More features (Columns) are needed and sometimes more rows are needed for
a better model. We need to understand the case by observing certain metrics.