You are on page 1of 23

A REPORT

ON 

Titanic - Machine Learning from Disaster


BY

V Ankith Reddy
20STUCHH010202

AT 

IIIT(Hyderbad)

An Internship Program-I

                                          Department of CSE engineering 


      Icfaitech (Deemed to be University)
 HYDERABAD
 JUNE 2022

1
Faculty of Science & Technology, IFHE Hyderabad 

Work Station :IIIT(Hyderbad)                                     Center : Hyderabad 

Duration : 2 months                                               Date of start : 30/05/2022 

Date of Submission : 27/07/2022 

Title of the project :House Price Prediction


Delhi Climate (Time Series)

ID number : 20STUCHH010202

Discipline of student : DS&AI

Name and Designation of the expert : Dr. Bapi Raju (Prof at IIIT)

Name of the IP faculty : Dr D.V Nair

2
PREFACE
The B-Tech program is a well-structured and comprehensive technical studies program.
The primary goal of practical training at the B-Tech level is to help students acquire
abilities by supplementing the academic study of technical courses in general. Internship
Program assists in gaining real-world information about the IT Company and provides a
good introduction to the modern IT world. The B-Tech degree offers a broad curriculum
divided into specialties that give practical understanding of engineering principles.
Training is an essential component of every professional course. In the organization's
training, professors provide us with theoretical knowledge of many areas. Only via
training did I learn what a corporation is and how it operates. I can learn about other
departments. activities being carried out in the organization, which will benefit me in
the future when I enter the practical sector Internship Program is a vital element of B-
Tech, and each student is required to train at a firm for two months and then write a
project report on the same while training. Throughout the course, I gained a lot of
experience and learned about real-life management methods and how they differ from
theoretical knowledge. Theoretical knowledge is insufficient in today's globalized
world, where market rivalry is fierce. Aside from that, one must have practical
information that will aid an individual in his or her work activities, and it is true that
“Experience is the best teacher”.

3
ACKNOWLEDGMENT:
I would like to offer my heartfelt gratitude to Dr. Bapi Raju and express my
appreciation for his timely comments during my assignment.I also offer my gratitude to
Mr. Madhukar Dwivedi and Mr. Krishna and they deserve my respect and admiration.
They were always there to assist me during the course of my internship. I would like to
express my heartfelt gratitude to Dr. D.V Nair (faculty mentor) for his encouragement
and valuable assistance in shaping the current form of my work, as well as to those who
extended their kind assistance in completing my internship. I appreciate everyone's
assistance and direction in finishing my assignment. This report's result required a great
deal of direction and cooperation from many people. I consider myself quite fortunate to
have received this during the whole process.. Only with such guidance and support  I
have accomplished everything.

4
ABSTRACT

The first project assigned was a moderately difficult data set from a Kaggle competition
where our team had put our skills to test that have been learned prior to the project
during the training phase. The prerequisites were - python course(pandas&numpy),
machine learning introduction, feature engineering, and core algorithms with
mathematical intuition. The problem statement was to create an ML model to predict the
survivability of a passenger in the Titanic Shipwreck, the aforementioned techniques
were put to use to import, clear corrupt data, preprocess, and train the data set to make
an efficient prediction model which yielded good results in the competition leaderboard
with top 5 percentage.

To understand the algorithms clearly, we were assigned with a task to replicate SK


Learn algorithms and build one using python and implement it on our respective
datasets. Our team has built Linear Regression from scratch.

The second assigned project was based on time series data set where data is recorded
over a period of time and is used in order to predict the future by forecasting methods.
The dataset worked upon was ‘Delhi Climate Time Series’ and the goal was to build a
model. The approaches followed were one with simple regression techniques and the
other using SARIMA which is a complex approach. The model is ready with simple
regression techniques and SARIMA model has yet to be finalized.

5
TABLE OF CONTENTS

Title                                                                                                      Page
no.

Preface….………..…………………………………………………........3

Acknowledgment...….………………………………………………..….4

Abstract…………...……………………………………………………...5

Table of contents …………………………………………………….…..6

Introduction ……………………………………………………...............7

Project 1………….……………………………………………………....8

1.0 Introduction ……….………………………………………………....8

1.1 Methodology.....……………………………………………………...11

1.2 Results………………………………………….................................17

Project 2 …………..……………………………………...............….....10

2.0 Introduction……………………………………………………….....18

2.1 Methodology,……………………………………………..............…19

 2.2 Results ………..……………………………………………….....…21

Conclusion …………………………………………………………........22

References/Bibliography// …………………………………………...….23

6
INTRODUCTION

The internship at IIIT was mainly focused on learning and implementing Machine
learning
Introduction to the Machine Learning Project: Machine learning is a subfield of
computer science that arose from the study of pattern detection in data and predicting
the outcome of future or unknown events. In this process we primarily worked on
datasets in kaggle and implemented various machine learning models to try out and test
our skills in machine learning.
There were a few pre-requisites which had to be learnt for machine learning which were
assigned to us in the first few weeks of our internship. The pre-requisites for creating
efficient machine learning models used for predictions are as follows
1) Python – Basic and most popular programming language used for machine
learning and the libraries created for machine learning algorithms
2) Numpy and Pandas – Numpy is python library which is used to handle
numerical data and pandas is also a python library with many functions used to
work on datasets
3) Data Visualisation – Data visualisation mainly focuses on plotting of the
information in datasets for a better understanding of the given data. Some of the
most commonly used plots in the projects were Bar and line plots , scatter plots ,
histograms and box plots.
4) Introduction to machine learning – In Intro to machine learning , all the basic
topics and algorithms used to build models were covered. Learning of
algorithms and the mathematical intuition behind the algorithms were assigned
to us for better implementation of the algorithms in our models. The various
algorithms learnt during the course as pre-requisites for machine learning are as
follows
- Linear Regression
- Logistic Regression
- Decision trees and Random forests
- K-means clustering algorithms
- K nearest neighbour
- Support vector machines
The above are the group of machine learning algorithms used for classification ,
regression and clustering of given data to predict the outcomes.
5) Feature Engineering – Feature engineering is a data pre-processing course
which is helpful in cleaning the data before training and testing the machine
learning models for efficient and reliable predictions. Most of the data in its raw
form is impure and corrupted which requires a lot of pre-processing to be able to
load it into machine learning models efficiently which is done with the help
various feature engineering techniques like changing the format of data into
accepted format , assigning values based on mean median mode for NaN values,
using various encoding techniques for strings in the datasets etc

7
Project – 1
Titanic machine learning project

In our first project, we all know about the unfortunate sinking of the Titanic
ship when the starboard side of the Titanic struck the iceberg , making a
cavity underneath the waterline. The frame was not punctured by the chunk
of ice, such that the hull's creases buckled and isolated, permitting water to
surge leading to tragic deaths of the individuals who were present on the
ship.

Whereas there was a few components of good fortune included in


surviving, it appears a few groups of individuals were more likely to
outlive than others based on the statistical data. In this competition, we
inquired answers to address the following question:

“what sorts of individuals were more likely to survive?”

8
OBJECTIVE OF THE STUDY

We have prepared this report based on two purposes.

A. Primary objective:

The report aims to provide information about the data and mathematical
models used for predicting the survivability of titanic passengers and future
sales of store

B.Secondary objective:

This report mainly concentrates on the pre processing of data as it very


crucial factor of any machine learning project. In this report each and every
work is explained in detail.

SCOPE OF STUDY

There is a certain boundary to cover the report.This particular report only


concentrates on the areas of the work currently we are doing in machine
learning. It mainly focuses on the Titanic competition and future sales of
store .Development of the and modifications in the machine learning
model. The information is mainly collected from the experience based on
the internship at IIIT(Hyderabad)

9
Main text
Case study - 1

1.1. Data-set for titanic

The data has been split into two groups:


1. training set (train.csv)
2. test set (test.csv)

Here training data was used to train our machine learning models and we
constructed models based on features like

1. Pclass describes the ticket class

a)1st = Upper

b)2nd = Middle

c)3rd = Lower

2.sex tells about gender of passenger

3.age gives age of passenger in years

4.sibsp is about siblings / spouses aboard the Titanic sibsp: The dataset
defines family relations in this way...

a)Sibling = brother, sister, stepbrother, stepsister

b)Spouse = husband, wife (mistresses and fiancés were ignored)

5.parch : The dataset defines family relations in this way...

a)Parent = mother, father

b)Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

6.fare accounts for fare paid by passenger

7.embarked stands for the Port of Embarkation in which

a) C = Cherbourg,

b) Q = Queenstown,

10
c)S = Southampton.

Methodology

First we inspected data set manually based on our understanding about data
after that to reduce the noise from data, few columns were dropped namely
"PassengerId", "Name","Ticket" and "Cabin" from both test and train data
set. "PassengerId", "Name" and "Ticket" were dropped as they didn't
significantly effect the data and had low correlation and in the case of
"Cabin" it had several missing values.

1.2 Data visualisation


Data visualisation sheds light on data by illustrating its significance in the
grand scheme of things. It illustrates where certain data references stand in
relation to the larger data picture.

In the next step we plotted bar graphs, pie charts, histogram, heatmaps,line
graphs etc using seaborn an inbuilt library in python. this gave us better
insight of data and help us decide on our next steps based on intuitions
from data visualization tools.

11
1.3 Pre processing

Data preprocessing is the process of preparing raw data for use with a
machine learning model. It is the first and most important stage in
developing a machine learning model.

When developing a machine learning project, we do not always come


across clean and prepared data. And, before doing any action on data, it
must be cleaned and formatted. As a result, we apply the data preparation
job for this.

Real-world data typically contains noise, missing values, and may be in an


unsuitable format that cannot be utilized directly for machine learning
models. Data preprocessing is a necessary job for cleaning the data and
preparing it for a machine learning model, which improves the accuracy
and efficiency of the machine learning model

12
· Sklearn gives a ‘Label encoding’ for encoding the levels of
categorical features into numeric values. LabelEncoder encode labels
with a value between zero and 1 in the column of “Sex” to convert
strings of male and female into numbers on which we train our
model. Thus The categorical parameters in this approach will create
separate columns for both Male and Female labels. So, whenever
there is Male, the value in the Male column will be 1 and 0 in the
Female column, and vice versa

· Predictive algorithms that rely on numerical inputs cannot handle


open text fields or category characteristics directly. Instead, this
information-rich data must be processed before being presented to a
model. Tree-based and Naive Bayes models are outliers; most
models require numeric predictors.later we used Pandas
‘getdummies’ function to create dummy variables for embarked in
our dataset.Dummy variables are numeric variables that encode
category data of ‘S’ ‘C’ and ‘Q’.Dummy variables can have two
values: 0 or 1. In a dummy variable, 1 indicates that a category is
present and 0 indicates that a category does not exist. This lead to
addition of three new columns in our dataset.We needed to perform
the above step because our machine learning model works only on
numerical values and can not handle strings.

· pandas.get_dummies() is used for data manipulation. It converts


categorical data into dummy or indicator variables.

Handling the missing values

· The hassle of missing values is pretty usual in lots of real-


existence datasets. Missing values can bias the outcomes of the
machine learning models with noise and inadqute values which in
return lessen the accuracy of the model
· Missing data is defined as a value or data that is not stored (or
does not exist) in some variables in a particular dataset.

The following is an example of missing data from the Titanic dataset. You
can see that some values are missing in the "Age" and "Cabin"
columns.

In the dataset, white space indicates the missing value. In pandas, missing
values are usually represented by NaN. It stands for Not a Number.

13
the most common way to replace missing values in a numeric column is by
replacing by means . If there are outliers, the average is not
appropriate. In such cases, the outliers need to be dealt with first.

column of embarked had 3 missing values that were replaced by value ’S’
as ‘S’ was the most frequent value in the column. similarly the
missing values in column of “Age” was filled by the mean value of age in
the given data set

1.4 Scaling values

The idea behind the StandardScaler is to transform the data so that the
mean of the distribution is 0 and the standard deviation is 1.

For multivariate data, this is functionally related (that is, it is done


independently for each column of data).

Given the distribution of the data, the mean is subtracted from each value
in the dataset and divided by the standard deviation (or characteristic in
the case of multivariate) for the entire dataset.

Understanding the mathematics behind it

The input data set features vary substantially over their ranges like data had
binary values in embarked that was considerably smaller compared to some
values in the age column. StandardScaler evaluates the mean and scales
the information to the unit fluctuation in characteristic values.

14
1.5 Training the model

Most of machine learning is classification. we want to know which class


(also known as group) the observation belongs to. The ability to accurately
classify observations is of great value for a variety of business application
and scientific studies

Data science offers a wealth of classification algorithms such as logistic


regression, support vector machines, naive Bayes classifiers, and decision
trees. However, at the top of the classification hierarchy is a random forest
classifier

As the name implies, a random forest consists of a number of individual


decision trees that act as ensembles. The individual trees in the Random
Forest spit out class predictions, and the class with the most votes is the
model prediction (see figure below).

The basic concept behind Random Forest is simple but powerful-the


wisdom of the crowd. In data science terms, the reasons why the Random
Forest model works so well are:-

Many relatively uncorrelated models (trees) that act as committees are


superior to each of the individual configuration models.

It is important that the correlation between the models is low. Just as low-
correlated assets come together to form a portfolio that is larger than the
sum of its parts, uncorrelated models can generate more accurate aggregate
forecasts than individual forecasts. The reason for this great effect is that
the trees protect each other from individual mistakes (unless they all

15
always make mistakes in the same direction). Some trees may be wrong,
but many others will be right, which allows the trees to move in the right
direction as a group. Therefore, the requirements for good performance in
Random Forest are:

In order for a model built using these functions to be better than a random
guess, the function must have an actual signal. The predictions (and
therefore errors) of the individual trees should be slightly correlated with
each other.

Feature Randomness-In a normal decision tree, when dividing a node,


consider each possible feature and choose the one that best separates the
observations of the left and right nodes. In contrast, each tree in a random
forest can only be selected from a random subset of features. This forces
more variation between the trees in the model, eventually reducing the
correlation between the trees and increasing the variance.

Now we will train our machine learning model on train data set and test it
on our test data set. We are using random forest to train our data as it had
better accuracy in compression with algorithms like decision tree and
logistic regression which were tried and tested before random forest. After
training our model we predicted the outcome of which passenger survived
and stored it in a new CSV file.

Random forests configuration:

n estimators = 1000

depth = 5

random state = 100

16
RESULTS:
Our first model was ready and we have been placed at 9200/14,500 on the
leader. The model was complete and working, but this isn’t the end, In
machine learning, a model can be improved substantially by using different
approaches and following optimization method to imporve accuracy of a
model. Our team didn’t stop with the initial result and were constantly
trying to improve our score by exploring new methods or alternatives. We
have tried different combinations in preprocessing step, scaling step and
modelling step and achieved our best result with Random forests and
standard scaler with mean imputation, which placed us at 850 rank out of
14,500+ submissions.

Project 2

17
Delhi Climate Time Series
2.0 Introduction
In New Delhi, the capital of India, the climate is subtropical, with a very mild
and sunny winter, and a very hot season from mid-March to mid-June.The air
quality in Delhi, the capital territory of India, according to a WHO survey of
1,650 world cities, is the worst of any major city in the world. It also affects the
districts around Delhi. Air pollution in India is estimated to kill about 2 million
people every year; it is the fifth largest killer in India. This dataset provides data
from 1st January 2013 to 24th April 2017 in the city of Delhi, India. The 4
parameters here are meantemp, humidity, wind_speed, meanpressure.

2.1 Dataset Description


The data set consists of two csv files. Train and Test files.

Train : Timeline of the recorded data: 2013-01-01 to 2017-01-01


Test : Timeline of the recorded data : 2017-01-01 to 2017-04-24

Shape of the dataset:


Train - 1462 rows × 5 columns
Test - 114 rows × 5 columns

Columns Descriptons:
● date : Date of format YYYY-MM-DD
● meantemp : Mean temperature averaged out from multiple 3-hour intervals in a
day.
● humidity : Humidity value for the day (units are grams of water vapor per cubic
meter volume of air).
● wind_speed : Wind speed measured in kmph.
● meanpressure : Pressure reading of weather (measure in atm)

2.2 Methodology

18
1.11 Data Visualisation

Plotted various graphs and charts using Seaborn and Mat- plotlib to get an overview of
the information in the dataset and to look at the various features in the data set over the
given time period . Here, X axis - time and Y - axis - respective feature

Fig 1 . Temperature over the time period

Fig 2 : Humidity over the time period

19
Fig 3 : Wind Speed over the time period

Observations Made
● From the above plots, its clearly evident that the meantemp-
which is mean temperature recorded for each day is following
seasonality. Seasonality- is regular,periodic change in mean od
the series. It is a fact that temperature changes from season to
season and the results from graphs are unsurprising
● Similar to meantemp,the graphs of other features- wind_speed
and humidity show seasonality.

2.22 Data Pre-Processing


- The data is given as two csv files, test and train, both of them have been
imported and examined to process

- The NaN values have been checked for but none are found in the dataset
- The data was clean
- No corrupt or duplicate data was found

20
2.23 Data Modelling

Linear Regression
In this method,simple linear regression is performed on the dataset. The values will be
predicted as a linear combination of the previous 80 days values.

Stationarity:
before applying any statistical model on a Time Series, the series has to be staionary,
which means that, over different time periods,
a) It should have constant mean.
b) It should have constant variance or standard deviation.
c) Auto-covariance should not depend on time.

ARIMA model:
ARIMA(Auto Regressive Integrated Moving Average) is a combination of 2 models AR(Auto
Regressive) & MA(Moving Average).

Rolling Statistics:
Rolling Statistics - Plot the moving avg or moving standard deviation to see if it varies with
time. Its a visual technique.

ADCF Test:
The above test is used to give us various values that can help in identifying stationarity.

2.3 RESULTS

Forecasting graph

Orange - real values

blue - predicted values

21
2.4 Conclusion

This report explains how I have learnt about machine learning throughout my
internship. I'm confident in saying that my internship at Cognitive Science gave me a lot
of valuable learning experience. I gained knowledge of machine learning technology
and how to use it effectively. In previous projects, I learned how to use data pre-
processing, data visualisation, and machine learning. In order to fully comprehend the
algorithms, we also delved deeply into their mathematics. We also undertook the
complex process of building the algorithm from scratch. In addition, we worked on
several datasets as our main projects throughout the course of two months, all these
methods have enabled us to dive into the beginning phase of Machine Learning world.

22
REFERENCES

1) https://www.kaggle.com/learn/python
2) https://www.kaggle.com/learn/pandas
3) https://www.kaggle.com/learn/data-visualization
4) https://www.kaggle.com/learn/intro-to-machine-learning
5) https://www.kaggle.com/learn/intro-to-machine-learning
6) https://www.kaggle.com/learn/intro-to-machine-learning
7) https://www.kaggle.com/datasets/sumanthvrao/daily-climate-
time-series-data
8) https://www.kaggle.com/c/titanic
9) https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LogisticRegression.html
10) https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LinearRegression.html#sklearn.linear_mod
el.LinearRegression
11) https://scikit-learn.org/stable/modules/generated/
sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.
KNeighborsClassifier
12) https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html

23

You might also like