Titanic

11/21/21, 10:23 PM titanic
Who survived the titanic?

Chris Flores
Amber Beebe
Sean Wilson
Agustin Garcia
In [1]: # using the Kaggle Titanic data set found here
# https://www.kaggle.com/c/titanic/data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import LabelEncoder

from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, export_graphviz ,plot_tree
# no need to split data since kaggle gave us split
test = pd.read_csv("https://raw.githubusercontent.com/flores58c/CST_383_Titanic_
train = pd.read_csv("https://raw.githubusercontent.com/flores58c/CST_383_Titanic
In [2]: train.info()
train.describe
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Out[2]: <bound method NDFrame.describe of PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
localhost:8888/nbconvert/html/CST_383_Titanic_Project/titanic.ipynb?download=false 1/13
11/21/21, 10:23 PM titanic
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. ... ... ... ...
886 Montvila, Rev. Juozas male 27.0 0
887 Graham, Miss. Margaret Edith female 19.0 0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1
889 Behr, Mr. Karl Howell male 26.0 0
890 Dooley, Mr. Patrick male 32.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. ... ... ... ... ...
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q
[891 rows x 12 columns]>
Predictions
We will be predicting survivability based on categorical data of Pclass, Sex, Age. We will use
machine learning methods of knn and or linear regression.
Data Wrangling
In [3]: #find null/na values
train.isna().sum()
Out[3]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In [4]: #Initial preprocessing and cleaning
11/21/21, 10:23 PM titanic
# we can remove cabin column too much NaN values
train.drop("Cabin",axis=1)
#not sure if we should replace some nan columns in age with mean values
Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare E
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris
Cumings,
Mrs. John
1 2 1 1 Bradley female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen,
2 3 1 3 Miss. female 26.0 0 0 STON/O2.
3101282 7.9250
Laina
Futrelle,
Mrs.
3 4 1 1 Jacques female 35.0 1 0 113803 53.1000
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500
Henry
... ... ... ... ... ... ... ... ... ... ...
Montvila,
886 887 0 2 Rev. male 27.0 0 0 211536 13.0000
Juozas
Graham,
887 888 1 1 Miss. female 19.0 0 0 112053 30.0000
Margaret
Edith
Johnston,
Miss. W./C. 23.4500
888 889 0 3 Catherine female NaN 1 2 6607
Helen
"Carrie"
Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.0000
Howell
Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.7500
Patrick
891 rows × 11 columns
In [5]: #changing Sex column to binary 1=male 0=female
le = LabelEncoder()
train["Sex"]=le.fit_transform(train["Sex"])
11/21/21, 10:23 PM titanic
In [6]: #Taking average, rounded age of only males in data. Will possibly import to N/A
male_age = train['Sex'] == 1
mean_for_male_age = np.round(train.loc[male_age, 'Age'].mean())
mean_for_male_age
Out[6]: 31.0
In [7]: #Taking average, rounded age of only females in data. Will possibly import to N/
female_age = train['Sex'] == 0
mean_for_female_age = np.round(train.loc[female_age, 'Age'].mean())
mean_for_female_age
Out[7]: 28.0
In [8]: train.loc[(train.Age.isnull()) & (train.Sex==0),'Age']= mean_for_female_age
train.loc[(train.Age.isnull()) & (train.Sex==1),'Age']= mean_for_male_age
#check if columns filled
train["Age"].isnull().sum()
Out[8]: 0
Data Exploration
In [9]: # data discovery
# Survived: 0 = No, 1 = Yes
test.size == train.size
train.columns
Out[9]: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
In [10]: #boxplot of men and women who survived
sns.boxplot('Sex', 'Survived', data=train, color="red")
/Users/abeebe/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:3
6: FutureWarning: Pass the following variables as keyword args: x, y. From versi
on 0.12, the only valid positional argument will be `data`, and passing other ar
guments without an explicit keyword will result in an error or misinterpretatio
n.
warnings.warn(
Out[10]: <AxesSubplot:xlabel='Sex', ylabel='Survived'>
11/21/21, 10:23 PM titanic
In [11]: # The dimensions we will be using are Survived, Sex, Pclass, & Age
# the largest number of passengers were in third class
train.groupby('Pclass').size().plot.bar()
Out[11]: <AxesSubplot:xlabel='Pclass'>
In [12]: # More people died than survived
# what was the percentage of people who died vs. survived
# train[(train.Survived == 0)]
# train['Survived'].size OR len(train)
perc_died = round(train[(train.Survived == 0)]['Survived'].size / len(train), 2)

perc_survived = round(train[(train.Survived == 1)]['Survived'].size / len(train)
print("Percentage Survived: " + str(perc_survived) + "%\nPercentage Died: " + st
# train.groupby('Survived').size().plot.bar()
Percentage Survived: 0.38%
Percentage Died: 0.62%
In [13]: pd.crosstab(train.Survived, train.Sex)
Out[13]: Sex 0 1
Survived
0 81 468
1 233 109
11/21/21, 10:23 PM titanic
In [14]: pd.crosstab(train.Survived, train.Pclass)
Out[14]: Pclass 1 2 3
Survived
0 80 97 372
1 136 87 119
In [15]: # What is the distribution of ages
import seaborn as sns
sns.distplot(train.Age)
/Users/abeebe/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.p
y:2551: FutureWarning: `distplot` is a deprecated function and will be removed i
n a future version. Please adapt your code to use either `displot` (a figure-lev
el function with similar flexibility) or `histplot` (an axes-level function for
histograms).
warnings.warn(msg, FutureWarning)
Out[15]: <AxesSubplot:xlabel='Age', ylabel='Density'>
In [16]: # Find conditional probability of people under and over 30 who survived
# Under 30: Yes OR No

# train["over_30"] = None
# del train["over_30"]
# train.head
train["Over30"] = None
train.loc[(train.Age <= 30), 'over_30'] = 'yes'
train.loc[(train.Age > 30), 'over_30'] = 'no'
pd.crosstab(train.Survived, train.over_30)
#maybe change yes/no to 1/0
Out[16]: over_30 no yes

Survived
0 289 260
1 140 202
In [17]: # Did men pay more for their fare than women, did older passengers pay less than
11/21/21, 10:23 PM titanic
#sns.scatterplot(data = train, x = "Age", y = "Fare", hue = "Pclass")
sns.scatterplot(data = train, x = "Age", y = "Fare", hue = "Sex")
Out[17]: <AxesSubplot:xlabel='Age', ylabel='Fare'>
In [18]: corM = train.corr()
sns.heatmap(corM, annot = True)
plt.show()
In [19]: sns.regplot(x="Fare",y="Survived",data=train,logistic=True)
Out[19]: <AxesSubplot:xlabel='Fare', ylabel='Survived'>
11/21/21, 10:23 PM titanic
Paying more had a somewhat positive correlation. Not sure if that's the only reason.
In [20]: #grid of scatterplots
sns.pairplot(train)
Out[20]: <seaborn.axisgrid.PairGrid at 0x7faa75f182e0>
11/21/21, 10:23 PM titanic
Machine Learning
In [21]: predictors = ['Age']
target = 'Survived'
X = train[predictors].values
y = train[target].values
#test.csv unsusable and has no Survived data. Needed to split train.csv
X_train ,X_test, y_train,y_test = train_test_split(X,y,test_size =.8,random_stat

reg = LinearRegression()
reg.fit(X_train,y_train)
Out[21]: LinearRegression()
In [22]: plt.scatter(X_train,y_train)
plt.plot(X_test,reg.predict(X_test),linestyle='dashed',color = 'black')
Out[22]: [<matplotlib.lines.Line2D at 0x7faa59fd3130>]
11/21/21, 10:23 PM titanic
Answering the question which age range is most likely to survive. The answer is unclear and
can't be explicitly solved in this plot.
In [23]: def rmse(predicted, actual):
return np.sqrt(((predicted - actual)**2).mean())
rmse(reg.predict(X_test),y_test)
Out[23]: 0.48659958319426705
Looking at our linear model the error is about half for fitting the data. So this would not be a
good representation for our prediction.
In [24]: # Predicting odds of survival based on age, class, and sex, with KNN Regression.

predictors = ['Age','Sex','Pclass']
target = 'Survived'
from sklearn.neighbors import KNeighborsRegressor
reg = KNeighborsRegressor(n_neighbors=20)
reg.fit(X,y)
i = np.linspace(start=.5, stop = 80, num=3000)
plot_f_3 = pd.DataFrame(i, columns=['Age'])
plot_f_3['Sex'] = 0
plot_f_3['Pclass'] = 3
plot_f_2['Sex'] = 0
plot_f_1['Sex'] = 0
plot_m_3 = pd.DataFrame(i, columns=['Age'])
plot_m_3['Sex'] = 1
plot_m_3['Pclass'] = 3
plot_m_2['Sex'] = 1
plot_m_1['Sex'] = 1
11/21/21, 10:23 PM titanic
fig = plt.figure(figsize=(20, 5))
plt1= fig.add_subplot(121)
plt2= fig.add_subplot(122)
fig.suptitle('Odds of Survival Based on Age, Sex, and Class')
plt1.plot(i, reg.predict(plot_f_3), label='Third class female')
plt1.plot(i, reg.predict(plot_f_2) , color='orange', label='Second class female'

plt1.plot(i, reg.predict(plot_f_1) , color='red', label='First class female')
plt2.plot(i, reg.predict(plot_m_3), label='Third class male', color='black')
plt2.plot(i, reg.predict(plot_m_2) , color='purple', label='Second class male')
plt2.plot(i, reg.predict(plot_m_1) , color='green', label='First class male')
plt1.legend()
plt2.legend()
plt1.set_xlabel('Age')
plt1.set_ylabel('Odds of survival')
plt2.set_xlabel('Age')
plt2.set_ylabel('Odds of survival')
plt1.set_title('Females')
plt2.set_title('Males')
Out[24]: Text(0.5, 1.0, 'Males')
In [25]: #using amount paid to predict Age
predictors = ['Fare']
target = 'Age'
reg2 = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=

reg2.fit(X,y)
plt.scatter(X,y)
plt.plot(X,reg2.predict(X),linestyle='dashed',color = 'black')
Out[25]: [<matplotlib.lines.Line2D at 0x7faa5a9fb760>]
11/21/21, 10:23 PM titanic
The older passengers were more frugal whereas passengers who were in the middle age ranges
paid a bit more.
In [26]: #using amount paid to predict Survivability
predictors = ['Fare']
target = 'Survived'
reg3 = LinearRegression()
reg3.fit(X,y)
plt.scatter(X,y)
plt.plot(X,reg3.predict(X),linestyle='dashed',color = 'black')
Out[26]: [<matplotlib.lines.Line2D at 0x7faa5aacd880>]
This regression plot shows there is a positive correlation with paying more and surviving. This is
because first class had a closer access to the life boats on the titanic.
In [27]: predictors = ['Age']
target = 'Survived'
X_train ,X_test, y_train,y_test = train_test_split(X,y,test_size =.8,random_stat

reg = DecisionTreeClassifier(max_depth=4)
reg.fit(X_train, y_train)
plot_tree(reg)
11/21/21, 10:23 PM titanic
Out[27]: [Text(111.60000000000001, 195.696, 'X[0] <= 1.415\ngini = 0.464\nsamples = 178\n

value = [113, 65]'),
Text(74.4, 152.208, 'gini = 0.0\nsamples = 2\nvalue = [0, 2]'),
Text(148.8, 152.208, 'X[0] <= 30.5\ngini = 0.46\nsamples = 176\nvalue = [113, 6

3]'),
Text(74.4, 108.72, 'X[0] <= 4.5\ngini = 0.485\nsamples = 87\nvalue = [51, 3
6]'),
Text(111.60000000000001, 65.232, 'X[0] <= 13.5\ngini = 0.49\nsamples = 84\nvalu

e = [48, 36]'),
Text(223.20000000000002, 108.72, 'X[0] <= 31.5\ngini = 0.423\nsamples = 89\nval

ue = [62, 27]'),
Text(260.40000000000003, 65.232, 'X[0] <= 60.5\ngini = 0.476\nsamples = 64\nval

ue = [39, 25]'),
Text(223.20000000000002, 21.744, 'gini = 0.484\nsamples = 61\nvalue = [36, 2

5]'),
Text(297.6, 21.744, 'gini = 0.0\nsamples = 3\nvalue = [3, 0]')]
In [ ]:

Titanic

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Titanic

Uploaded by

Copyright:

Available Formats

11/21/21, 10:23 PM titanic

Who survived the titanic?

import matplotlib.pyplot as plt

from scipy.stats import zscore

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, LogisticRegression

from sklearn.preprocessing import LabelEncoder

from sklearn.tree import DecisionTreeClassifier, export_graphviz ,plot_tree

# no need to split data since kaggle gave us split

RangeIndex: 891 entries, 0 to 890

Data columns (total 12 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 PassengerId 891 non-null int64

1 Survived 891 non-null int64

2 Pclass 891 non-null int64

3 Name 891 non-null object

4 Sex 891 non-null object

5 Age 714 non-null float64

6 SibSp 891 non-null int64

7 Parch 891 non-null int64

8 Ticket 891 non-null object

9 Fare 891 non-null float64

10 Cabin 204 non-null object

11 Embarked 889 non-null object

dtypes: float64(2), int64(5), object(5)

memory usage: 83.7+ KB

Out[2]: <bound method NDFrame.describe of PassengerId Survived Pclass \

.. ... ... ...

Name Sex Age SibSp \

0 Braund, Mr. Owen Harris male 22.0 1

1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1

2 Heikkinen, Miss. Laina female 26.0 0

3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1

4 Allen, Mr. William Henry male 35.0 0

.. ... ... ... ...

886 Montvila, Rev. Juozas male 27.0 0

887 Graham, Miss. Margaret Edith female 19.0 0

888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1

889 Behr, Mr. Karl Howell male 26.0 0

890 Dooley, Mr. Patrick male 32.0 0

Parch Ticket Fare Cabin Embarked

0 0 A/5 21171 7.2500 NaN S

1 0 PC 17599 71.2833 C85 C

2 0 STON/O2. 3101282 7.9250 NaN S

3 0 113803 53.1000 C123 S

4 0 373450 8.0500 NaN S

.. ... ... ... ... ...

886 0 211536 13.0000 NaN S

887 0 112053 30.0000 B42 S

888 2 W./C. 6607 23.4500 NaN S

889 0 111369 30.0000 C148 C

890 0 370376 7.7500 NaN Q

[891 rows x 12 columns]>

In [4]: #Initial preprocessing and cleaning

mean_for_male_age = np.round(train.loc[male_age, 'Age'].mean())

mean_for_female_age = np.round(train.loc[female_age, 'Age'].mean())

In [8]: train.loc[(train.Age.isnull()) & (train.Sex==0),'Age']= mean_for_female_age

train.loc[(train.Age.isnull()) & (train.Sex==1),'Age']= mean_for_male_age

#check if columns filled

Out[9]: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',