You are on page 1of 26

{

"cells": [
{
"metadata": {
"_cell_guid": "ea25cdf7-bdbc-3cf1-0737-bc51675e3374",
"_uuid": "92c2e9a9a67334c04c6cbe789a8372e5a80b2365"
},
"cell_type": "markdown",
"source": "# Titanic Data Science Solutions\n\n---\n\n### I have released a
new Python package [Speedml](https://speedml.com) which codifies the techniques
used in this notebook into an intuitive, powerful, and productive API. \n\n###
Speedml helps me jump from low 80% on the Kaggle leaderboard to high 20% within few
iterations.\n\n### One more thing... Speedml achieves this with nearly 70% fewer
lines of code!\n\n### Run and download the [Titanic Solution using Speedml]
(https://github.com/Speedml/notebooks/blob/master/titanic/titanic-solution-using-
speedml.ipynb).\n\n---\n\nThis notebook is a companion to the book [Data Science
Solutions](https://startupsci.com). The notebook walks us through a typical
workflow for solving data science competitions at sites like Kaggle.\n\nThere are
several excellent notebooks to study data science competition entries. However many
will skip some of the explanation on how the solution is developed as these
notebooks are developed by experts for experts. The objective of this notebook is
to follow a step-by-step workflow, explaining each step and rationale for every
decision we take during solution development.\n\n## Workflow stages\n\nThe
competition solution workflow goes through seven stages described in the Data
Science Solutions book.\n\n1. Question or problem definition.\n2. Acquire training
and testing data.\n3. Wrangle, prepare, cleanse the data.\n4. Analyze, identify
patterns, and explore the data.\n5. Model, predict and solve the problem.\n6.
Visualize, report, and present the problem solving steps and final solution.\n7.
Supply or submit the results.\n\nThe workflow indicates general sequence of how
each stage may follow the other. However there are use cases with exceptions.\n\n-
We may combine mulitple workflow stages. We may analyze by visualizing data.\n-
Perform a stage earlier than indicated. We may analyze data before and after
wrangling.\n- Perform a stage multiple times in our workflow. Visualize stage may
be used multiple times.\n- Drop a stage altogether. We may not need supply stage to
productize or service enable our dataset for a competition.\n\n\n## Question and
problem definition\n\nCompetition sites like Kaggle define the problem to solve or
questions to ask while providing the datasets for training your data science model
and testing the model results against a test dataset. The question or problem
definition for Titanic Survival competition is [described here at Kaggle]
(https://www.kaggle.com/c/titanic).\n\n> Knowing from a training set of samples
listing passengers who survived or did not survive the Titanic disaster, can our
model determine based on a given test dataset not containing the survival
information, if these passengers in the test dataset survived or not.\n\nWe may
also want to develop some early understanding about the domain of our problem. This
is described on the [Kaggle competition description page here]
(https://www.kaggle.com/c/titanic). Here are the highlights to note.\n\n- On April
15, 1912, during her maiden voyage, the Titanic sank after colliding with an
iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival
rate.\n- One of the reasons that the shipwreck led to such loss of life was that
there were not enough lifeboats for the passengers and crew.\n- Although there was
some element of luck involved in surviving the sinking, some groups of people were
more likely to survive than others, such as women, children, and the upper-
class.\n\n## Workflow goals\n\nThe data science solutions workflow solves for seven
major goals.\n\n**Classifying.** We may want to classify or categorize our samples.
We may also want to understand the implications or correlation of different classes
with our solution goal.\n\n**Correlating.** One can approach the problem based on
available features within the training dataset. Which features within the dataset
contribute significantly to our solution goal? Statistically speaking is there a
[correlation](https://en.wikiversity.org/wiki/Correlation) among a feature and
solution goal? As the feature values change does the solution state change as well,
and visa-versa? This can be tested both for numerical and categorical features in
the given dataset. We may also want to determine correlation among features other
than survival for subsequent goals and workflow stages. Correlating certain
features may help in creating, completing, or correcting
features.\n\n**Converting.** For modeling stage, one needs to prepare the data.
Depending on the choice of model algorithm one may require all features to be
converted to numerical equivalent values. So for instance converting text
categorical values to numeric values.\n\n**Completing.** Data preparation may also
require us to estimate any missing values within a feature. Model algorithms may
work best when there are no missing values.\n\n**Correcting.** We may also analyze
the given training dataset for errors or possibly innacurate values within features
and try to corrent these values or exclude the samples containing the errors. One
way to do this is to detect any outliers among our samples or features. We may also
completely discard a feature if it is not contribting to the analysis or may
significantly skew the results.\n\n**Creating.** Can we create new features based
on an existing feature or a set of features, such that the new feature follows the
correlation, conversion, completeness goals.\n\n**Charting.** How to select the
right visualization plots and charts depending on nature of the data and the
solution goals."
},
{
"metadata": {
"_cell_guid": "56a3be4e-76ef-20c6-25e8-da16147cf6d7",
"_uuid": "9079d24e16a56ccd4bb2fb5c8485d9546052f1f2"
},
"cell_type": "markdown",
"source": "## Refactor Release 2017-Jan-29\n\nWe are significantly
refactoring the notebook based on (a) comments received by readers, (b) issues in
porting notebook from Jupyter kernel (2.7) to Kaggle kernel (3.5), and (c) review
of few more best practice kernels.\n\n### User comments\n\n- Combine training and
test data for certain operations like converting titles across dataset to numerical
values. (thanks @Sharan Naribole)\n- Correct observation - nearly 30% of the
passengers had siblings and/or spouses aboard. (thanks @Reinhard)\n- Correctly
interpreting logistic regresssion coefficients. (thanks @Reinhard)\n\n### Porting
issues\n\n- Specify plot dimensions, bring legend into plot.\n\n\n### Best
practices\n\n- Performing feature correlation analysis early in the project.\n-
Using multiple plots instead of overlays for readability."
},
{
"metadata": {
"_cell_guid": "5767a33c-8f18-4034-e52d-bf7a8f7d8ab8",
"_uuid": "11dae061ed4be5400a6781574f55552d024bf5ad",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# data analysis and wrangling\nimport pandas as pd\nimport numpy
as np\nimport random as rnd\n\n# visualization\nimport seaborn as sns\nimport
matplotlib.pyplot as plt\n%matplotlib inline\n\n# machine learning\nfrom
sklearn.linear_model import LogisticRegression\nfrom sklearn.svm import SVC,
LinearSVC\nfrom sklearn.ensemble import RandomForestClassifier\nfrom
sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.naive_bayes import
GaussianNB\nfrom sklearn.linear_model import Perceptron\nfrom sklearn.linear_model
import SGDClassifier\nfrom sklearn.tree import DecisionTreeClassifier",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "6b5dc743-15b1-aac6-405e-081def6ecca1",
"_uuid": "e88c3e3fb9dbc9aa8b7a7e79cc9966a6b9ed40b3"
},
"cell_type": "markdown",
"source": "## Acquire data\n\nThe Python Pandas packages helps us work with
our datasets. We start by acquiring the training and testing datasets into Pandas
DataFrames. We also combine these datasets to run certain operations on both
datasets together."
},
{
"metadata": {
"_cell_guid": "e7319668-86fe-8adc-438d-0eef3fd0a982",
"_uuid": "49ac413d3421641f524fadb308ddb26f0e9cb29e",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df = pd.read_csv('../input/train.csv')\ntest_df =
pd.read_csv('../input/test.csv')\ncombine = [train_df, test_df]",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "3d6188f3-dc82-8ae6-dabd-83e28fcbf10d",
"_uuid": "02468e8d5b6657437b3b6206d44b5f857af889a6"
},
"cell_type": "markdown",
"source": "## Analyze by describing data\n\nPandas also helps describe the
datasets answering following questions early in our project.\n\n**Which features
are available in the dataset?**\n\nNoting the feature names for directly
manipulating or analyzing these. These feature names are described on the [Kaggle
data page here](https://www.kaggle.com/c/titanic/data)."
},
{
"metadata": {
"_cell_guid": "ce473d29-8d19-76b8-24a4-48c217286e42",
"_uuid": "c466d69893455c373a731172f48045c88229199b",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "print(train_df.columns.values)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "cd19a6f6-347f-be19-607b-dca950590b37",
"_uuid": "ca6c24182bcfa236e647145eecb80e05b2820940"
},
"cell_type": "markdown",
"source": "**Which features are categorical?**\n\nThese values classify the
samples into sets of similar samples. Within categorical features are the values
nominal, ordinal, ratio, or interval based? Among other things this helps us select
the appropriate plots for visualization.\n\n- Categorical: Survived, Sex, and
Embarked. Ordinal: Pclass.\n\n**Which features are numerical?**\n\nWhich features
are numerical? These values change from sample to sample. Within numerical features
are the values discrete, continuous, or timeseries based? Among other things this
helps us select the appropriate plots for visualization.\n\n- Continous: Age, Fare.
Discrete: SibSp, Parch."
},
{
"metadata": {
"_cell_guid": "8d7ac195-ac1a-30a4-3f3f-80b8cf2c1c0f",
"_uuid": "84c488f861ac5ed7243cc073e11b02f8eec04d94",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# preview the data\ntrain_df.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "97f4e6f8-2fea-46c4-e4e8-b69062ee3d46",
"_uuid": "38798960607df977d8661123ed656496a8edafdd"
},
"cell_type": "markdown",
"source": "**Which features are mixed data types?**\n\nNumerical,
alphanumeric data within same feature. These are candidates for correcting
goal.\n\n- Ticket is a mix of numeric and alphanumeric data types. Cabin is
alphanumeric.\n\n**Which features may contain errors or typos?**\n\nThis is harder
to review for a large dataset, however reviewing a few samples from a smaller
dataset may just tell us outright, which features may require correcting.\n\n- Name
feature may contain errors or typos as there are several ways used to describe a
name including titles, round brackets, and quotes used for alternative or short
names."
},
{
"metadata": {
"_cell_guid": "f6e761c2-e2ff-d300-164c-af257083bb46",
"_uuid": "ca3035f1a8d2c9b442b4d7b819815f084a20e1cd",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df.tail()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "8bfe9610-689a-29b2-26ee-f67cd4719079",
"_uuid": "cb70d90c0c173d4fa5f3c4ba1afd291fd3ffbbce"
},
"cell_type": "markdown",
"source": "**Which features contain blank, null or empty values?**\n\nThese
will require correcting.\n\n- Cabin > Age > Embarked features contain a number of
null values in that order for the training dataset.\n- Cabin > Age are incomplete
in case of test dataset.\n\n**What are the data types for various features?
**\n\nHelping us during converting goal.\n\n- Seven features are integer or floats.
Six in case of test dataset.\n- Five features are strings (object)."
},
{
"metadata": {
"_cell_guid": "9b805f69-665a-2b2e-f31d-50d87d52865d",
"_uuid": "2140db34cccc7da212a1176fa66ed197427576b9",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df.info()\nprint('_'*40)\ntest_df.info()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "859102e1-10df-d451-2649-2d4571e5f082",
"_uuid": "a7e11f7f53e795d42fbe53bcb71555981b56cbbd"
},
"cell_type": "markdown",
"source": "**What is the distribution of numerical feature values across the
samples?**\n\nThis helps us determine, among other early insights, how
representative is the training dataset of the actual problem domain.\n\n- Total
samples are 891 or 40% of the actual number of passengers on board the Titanic
(2,224).\n- Survived is a categorical feature with 0 or 1 values.\n- Around 38%
samples survived representative of the actual survival rate at 32%.\n- Most
passengers (> 75%) did not travel with parents or children.\n- Nearly 30% of the
passengers had siblings and/or spouse aboard.\n- Fares varied significantly with
few passengers (<1%) paying as high as $512.\n- Few elderly passengers (<1%) within
age range 65-80."
},
{
"metadata": {
"_cell_guid": "58e387fe-86e4-e068-8307-70e37fe3f37b",
"_uuid": "4149ea4118e3a5011d37ee8dcfa35c370507326e",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df.describe()\n# Review survived rate using
`percentiles=[.61, .62]` knowing our problem description mentions 38% survival
rate.\n# Review Parch distribution using `percentiles=[.75, .8]`\n# SibSp
distribution `[.68, .69]`\n# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .
99]`",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "5462bc60-258c-76bf-0a73-9adc00a2f493",
"_uuid": "9e885b46fa913910611755dfd0bd872bbc9c149d"
},
"cell_type": "markdown",
"source": "**What is the distribution of categorical features?**\n\n- Names
are unique across the dataset (count=unique=891)\n- Sex variable as two possible
values with 65% male (top=male, freq=577/count=891).\n- Cabin values have several
dupicates across samples. Alternatively several passengers shared a cabin.\n-
Embarked takes three possible values. S port used by most passengers (top=S)\n-
Ticket feature has high ratio (22%) of duplicate values (unique=681)."
},
{
"metadata": {
"_cell_guid": "8066b378-1964-92e8-1352-dcac934c6af3",
"_uuid": "f5efd99ee95dbb79be3dfa001fba27886eb78e8b",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df.describe(include=['O'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "2cb22b88-937d-6f14-8b06-ea3361357889",
"_uuid": "c426c5eb1600367a710dfdca1fe71ed51da5d61c"
},
"cell_type": "markdown",
"source": "### Assumtions based on data analysis\n\nWe arrive at following
assumptions based on data analysis done so far. We may validate these assumptions
further before taking appropriate actions.\n\n**Correlating.**\n\nWe want to know
how well does each feature correlate with Survival. We want to do this early in our
project and match these quick correlations with modelled correlations later in the
project.\n\n**Completing.**\n\n1. We may want to complete Age feature as it is
definitely correlated to survival.\n2. We may want to complete the Embarked feature
as it may also correlate with survival or another important
feature.\n\n**Correcting.**\n\n1. Ticket feature may be dropped from our analysis
as it contains high ratio of duplicates (22%) and there may not be a correlation
between Ticket and survival.\n2. Cabin feature may be dropped as it is highly
incomplete or contains many null values both in training and test dataset.\n3.
PassengerId may be dropped from training dataset as it does not contribute to
survival.\n4. Name feature is relatively non-standard, may not contribute directly
to survival, so maybe dropped.\n\n**Creating.**\n\n1. We may want to create a new
feature called Family based on Parch and SibSp to get total count of family members
on board.\n2. We may want to engineer the Name feature to extract Title as a new
feature.\n3. We may want to create new feature for Age bands. This turns a
continous numerical feature into an ordinal categorical feature.\n4. We may also
want to create a Fare range feature if it helps our
analysis.\n\n**Classifying.**\n\nWe may also add to our assumptions based on the
problem description noted earlier.\n\n1. Women (Sex=female) were more likely to
have survived.\n2. Children (Age<?) were more likely to have survived. \n3. The
upper-class passengers (Pclass=1) were more likely to have survived."
},
{
"metadata": {
"_cell_guid": "6db63a30-1d86-266e-2799-dded03c45816",
"_uuid": "b251704afb0f22b8b2d2bdffee5213b8da736881"
},
"cell_type": "markdown",
"source": "## Analyze by pivoting features\n\nTo confirm some of our
observations and assumptions, we can quickly analyze our feature correlations by
pivoting features against each other. We can only do so at this stage for features
which do not have any empty values. It also makes sense doing so only for features
which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.\n\n-
**Pclass** We observe significant correlation (>0.5) among Pclass=1 and Survived
(classifying #3). We decide to include this feature in our model.\n- **Sex** We
confirm the observation during problem definition that Sex=female had very high
survival rate at 74% (classifying #1).\n- **SibSp and Parch** These features have
zero correlation for certain values. It may be best to derive a feature or a set of
features from these individual features (creating #1)."
},
{
"metadata": {
"_cell_guid": "0964832a-a4be-2d6f-a89e-63526389cee9",
"_uuid": "75750fd796fe610a980795e203fb3dab99e792c9",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df[['Pclass', 'Survived']].groupby(['Pclass'],
as_index=False).mean().sort_values(by='Survived', ascending=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "68908ba6-bfe9-5b31-cfde-6987fc0fbe9a",
"_uuid": "5c9ae750fe45cef099d0a6694c4045c45a4e4e21",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df[[\"Sex\", \"Survived\"]].groupby(['Sex'],
as_index=False).mean().sort_values(by='Survived', ascending=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "01c06927-c5a6-342a-5aa8-2e486ec3fd7c",
"_uuid": "d4e981dc7c3a9bc76f85bace0d8ff82cd7ba85e6",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df[[\"SibSp\", \"Survived\"]].groupby(['SibSp'],
as_index=False).mean().sort_values(by='Survived', ascending=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "e686f98b-a8c9-68f8-36a4-d4598638bbd5",
"_uuid": "d00c2df9ee4941ec2a291438e9f1072a47315abe",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df[[\"Parch\", \"Survived\"]].groupby(['Parch'],
as_index=False).mean().sort_values(by='Survived', ascending=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "0d43550e-9eff-3859-3568-8856570eff76",
"_uuid": "44f0b389a29ad0a56c4e9bdf718d24a54c975879"
},
"cell_type": "markdown",
"source": "## Analyze by visualizing data\n\nNow we can continue confirming
some of our assumptions using visualizations for analyzing the data.\n\n###
Correlating numerical features\n\nLet us start by understanding correlations
between numerical features and our solution goal (Survived).\n\nA histogram chart
is useful for analyzing continous numerical variables like Age where banding or
ranges will help identify useful patterns. The histogram can indicate distribution
of samples using automatically defined bins or equally ranged bands. This helps us
answer questions relating to specific bands (Did infants have better survival
rate?)\n\nNote that x-axis in historgram visualizations represents the count of
samples or passengers.\n\n**Observations.**\n\n- Infants (Age <=4) had high
survival rate.\n- Oldest passengers (Age = 80) survived.\n- Large number of 15-25
year olds did not survive.\n- Most passengers are in 15-35 age
range.\n\n**Decisions.**\n\nThis simple analysis confirms our assumptions as
decisions for subsequent workflow stages.\n\n- We should consider Age (our
assumption classifying #2) in our model training.\n- Complete the Age feature for
null values (completing #1).\n- We should band age groups (creating #3)."
},
{
"metadata": {
"_cell_guid": "50294eac-263a-af78-cb7e-3778eb9ad41f",
"_uuid": "9e8efca6037855f625f1acffd08a307329e3931b",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "g = sns.FacetGrid(train_df, col='Survived')\ng.map(plt.hist,
'Age', bins=20)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "87096158-4017-9213-7225-a19aea67a800",
"_uuid": "562bfba8df2645a6bbe777f2ae6a1535717b5798"
},
"cell_type": "markdown",
"source": "### Correlating numerical and ordinal features\n\nWe can combine
multiple features for identifying correlations using a single plot. This can be
done with numerical and categorical features which have numeric
values.\n\n**Observations.**\n\n- Pclass=3 had most passengers, however most did
not survive. Confirms our classifying assumption #2.\n- Infant passengers in
Pclass=2 and Pclass=3 mostly survived. Further qualifies our classifying assumption
#2.\n- Most passengers in Pclass=1 survived. Confirms our classifying assumption
#3.\n- Pclass varies in terms of Age distribution of
passengers.\n\n**Decisions.**\n\n- Consider Pclass for model training."
},
{
"metadata": {
"_cell_guid": "916fdc6b-0190-9267-1ea9-907a3d87330d",
"_uuid": "c42012e400481d5fba53ef3dd427c6fae80e85e5",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# grid = sns.FacetGrid(train_df, col='Pclass',
hue='Survived')\ngrid = sns.FacetGrid(train_df, col='Survived', row='Pclass',
size=2.2, aspect=1.6)\ngrid.map(plt.hist, 'Age', alpha=.5,
bins=20)\ngrid.add_legend();",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "36f5a7c0-c55c-f76f-fdf8-945a32a68cb0",
"_uuid": "3858d19300fe4e176f695914557c7e78c5c87075"
},
"cell_type": "markdown",
"source": "### Correlating categorical features\n\nNow we can correlate
categorical features with our solution goal.\n\n**Observations.**\n\n- Female
passengers had much better survival rate than males. Confirms classifying (#1).\n-
Exception in Embarked=C where males had higher survival rate. This could be a
correlation between Pclass and Embarked and in turn Pclass and Survived, not
necessarily direct correlation between Embarked and Survived.\n- Males had better
survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports. Completing
(#2).\n- Ports of embarkation have varying survival rates for Pclass=3 and among
male passengers. Correlating (#1).\n\n**Decisions.**\n\n- Add Sex feature to model
training.\n- Complete and add Embarked feature to model training."
},
{
"metadata": {
"_cell_guid": "db57aabd-0e26-9ff9-9ebd-56d401cdf6e8",
"_uuid": "c7c3ba27db137130b339c9a5dcfabda793a277ec",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# grid = sns.FacetGrid(train_df, col='Embarked')\ngrid =
sns.FacetGrid(train_df, row='Embarked', size=2.2,
aspect=1.6)\ngrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex',
palette='deep')\ngrid.add_legend()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "6b3f73f4-4600-c1ce-34e0-bd7d9eeb074a",
"_uuid": "3b08460eb9226bf5a7c61fb4468ac5264af38ab6"
},
"cell_type": "markdown",
"source": "### Correlating categorical and numerical features\n\nWe may also
want to correlate categorical features (with non-numeric values) and numeric
features. We can consider correlating Embarked (Categorical non-numeric), Sex
(Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical
numeric).\n\n**Observations.**\n\n- Higher fare paying passengers had better
survival. Confirms our assumption for creating (#4) fare ranges.\n- Port of
embarkation correlates with survival rates. Confirms correlating (#1) and
completing (#2).\n\n**Decisions.**\n\n- Consider banding Fare feature."
},
{
"metadata": {
"_cell_guid": "a21f66ac-c30d-f429-cc64-1da5460d16a9",
"_uuid": "5d831e95fb0987a49e081824e29ea84ebb97399d",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived',
palette={0: 'k', 1: 'w'})\ngrid = sns.FacetGrid(train_df, row='Embarked',
col='Survived', size=2.2, aspect=1.6)\ngrid.map(sns.barplot, 'Sex', 'Fare',
alpha=.5, ci=None)\ngrid.add_legend()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "cfac6291-33cc-506e-e548-6cad9408623d",
"_uuid": "04db104674a3628eb09e53b166fd6076b465ca66"
},
"cell_type": "markdown",
"source": "## Wrangle data\n\nWe have collected several assumptions and
decisions regarding our datasets and solution requirements. So far we did not have
to change a single feature or value to arrive at these. Let us now execute our
decisions and assumptions for correcting, creating, and completing goals.\n\n###
Correcting by dropping features\n\nThis is a good starting goal to execute. By
dropping features we are dealing with fewer data points. Speeds up our notebook and
eases the analysis.\n\nBased on our assumptions and decisions we want to drop the
Cabin (correcting #2) and Ticket (correcting #1) features.\n\nNote that where
applicable we perform operations on both training and testing datasets together to
stay consistent."
},
{
"metadata": {
"_cell_guid": "da057efe-88f0-bf49-917b-bb2fec418ed9",
"_uuid": "adf16b71f2045199e506eb507c1687979159cc91",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "print(\"Before\", train_df.shape, test_df.shape, combine[0].shape,
combine[1].shape)\n\ntrain_df = train_df.drop(['Ticket', 'Cabin'], axis=1)\ntest_df
= test_df.drop(['Ticket', 'Cabin'], axis=1)\ncombine = [train_df,
test_df]\n\n\"After\", train_df.shape, test_df.shape, combine[0].shape,
combine[1].shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "6b3a1216-64b6-7fe2-50bc-e89cc964a41c",
"_uuid": "bf76358110f06e16f5ef5f53b42610b7f6724a8e"
},
"cell_type": "markdown",
"source": "### Creating new feature extracting from existing\n\nWe want to
analyze if Name feature can be engineered to extract titles and test correlation
between titles and survival, before dropping Name and PassengerId features.\n\nIn
the following code we extract Title feature using regular expressions. The RegEx
pattern `(\\w+\\.)` matches the first word which ends with a dot character within
Name feature. The `expand=False` flag returns a
DataFrame.\n\n**Observations.**\n\nWhen we plot Title, Age, and Survived, we note
the following observations.\n\n- Most titles band Age groups accurately. For
example: Master title has Age mean of 5 years.\n- Survival among Title Age bands
varies slightly.\n- Certain titles mostly survived (Mme, Lady, Sir) or did not
(Don, Rev, Jonkheer).\n\n**Decision.**\n\n- We decide to retain the new Title
feature for model training."
},
{
"metadata": {
"_cell_guid": "df7f0cd4-992c-4a79-fb19-bf6f0c024d4b",
"_uuid": "7820828858f31fefff56f7653ce639aad989c9c2",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n dataset['Title'] =
dataset.Name.str.extract(' ([A-Za-z]+)\\.',
expand=False)\n\npd.crosstab(train_df['Title'], train_df['Sex'])",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "908c08a6-3395-19a5-0cd7-13341054012a",
"_uuid": "31a625f71481d6a846cd6548d52bd9541776dde0"
},
"cell_type": "markdown",
"source": "We can replace many titles with a more common name or classify
them as `Rare`."
},
{
"metadata": {
"_cell_guid": "553f56d7-002a-ee63-21a4-c0efad10cfe9",
"_uuid": "cda34b7ef1d484d6e0ff790f99aacd173c8e9204",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n dataset['Title'] =
dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\\\n \t'Don', 'Dr',
'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')\n\n dataset['Title'] =
dataset['Title'].replace('Mlle', 'Miss')\n dataset['Title'] =
dataset['Title'].replace('Ms', 'Miss')\n dataset['Title'] =
dataset['Title'].replace('Mme', 'Mrs')\n \ntrain_df[['Title',
'Survived']].groupby(['Title'], as_index=False).mean()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "6d46be9a-812a-f334-73b9-56ed912c9eca",
"_uuid": "46a21d2e1519724cfede846f1bcd5e6e33ba5629"
},
"cell_type": "markdown",
"source": "We can convert the categorical titles to ordinal."
},
{
"metadata": {
"_cell_guid": "67444ebc-4d11-bac1-74a6-059133b6e2e8",
"_uuid": "90122df332085f5e974d1174bd100acb2c663f82",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "title_mapping = {\"Mr\": 1, \"Miss\": 2, \"Mrs\": 3, \"Master\":
4, \"Rare\": 5}\nfor dataset in combine:\n dataset['Title'] =
dataset['Title'].map(title_mapping)\n dataset['Title'] =
dataset['Title'].fillna(0)\n\ntrain_df.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "f27bb974-a3d7-07a1-f7e4-876f6da87e62",
"_uuid": "fd88195031947ed3a9f10a324053f3449dedcedd"
},
"cell_type": "markdown",
"source": "Now we can safely drop the Name feature from training and testing
datasets. We also do not need the PassengerId feature in the training dataset."
},
{
"metadata": {
"_cell_guid": "9d61dded-5ff0-5018-7580-aecb4ea17506",
"_uuid": "0241842c12fe1d06d1197e2ff65f89415c3b8d1b",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df = train_df.drop(['Name', 'PassengerId'], axis=1)\ntest_df
= test_df.drop(['Name'], axis=1)\ncombine = [train_df, test_df]\ntrain_df.shape,
test_df.shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "2c8e84bb-196d-bd4a-4df9-f5213561b5d3",
"_uuid": "72fd8ab697990725621449dae525603625092e5d"
},
"cell_type": "markdown",
"source": "### Converting a categorical feature\n\nNow we can convert
features which contain strings to numerical values. This is required by most model
algorithms. Doing so will also help us in achieving the feature completing
goal.\n\nLet us start by converting Sex feature to a new feature called Gender
where female=1 and male=0."
},
{
"metadata": {
"_cell_guid": "c20c1df2-157c-e5a0-3e24-15a828095c96",
"_uuid": "ea3c1dbf9ce217e6b481a17509ad55b23e4900f8",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n dataset['Sex'] =
dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)\n\ntrain_df.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "d72cb29e-5034-1597-b459-83a9640d3d3a",
"_uuid": "b9a87778bea8f59e95ff6fa1adee4a1d25f5d1ca"
},
"cell_type": "markdown",
"source": "### Completing a numerical continuous feature\n\nNow we should
start estimating and completing features with missing or null values. We will first
do this for the Age feature.\n\nWe can consider three methods to complete a
numerical continuous feature.\n\n1. A simple way is to generate random numbers
between mean and [standard deviation]
(https://en.wikipedia.org/wiki/Standard_deviation).\n\n2. More accurate way of
guessing missing values is to use other correlated features. In our case we note
correlation among Age, Gender, and Pclass. Guess Age values using [median]
(https://en.wikipedia.org/wiki/Median) values for Age across sets of Pclass and
Gender feature combinations. So, median Age for Pclass=1 and Gender=0, Pclass=1 and
Gender=1, and so on...\n\n3. Combine methods 1 and 2. So instead of guessing age
values based on median, use random numbers between mean and standard deviation,
based on sets of Pclass and Gender combinations.\n\nMethod 1 and 3 will introduce
random noise into our models. The results from multiple executions might vary. We
will prefer method 2."
},
{
"metadata": {
"_cell_guid": "c311c43d-6554-3b52-8ef8-533ca08b2f68",
"_uuid": "bbde14c82ef02454950b07c6e6a9f08c3b0b60f9",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# grid = sns.FacetGrid(train_df, col='Pclass', hue='Gender')\ngrid
= sns.FacetGrid(train_df, row='Pclass', col='Sex', size=2.2,
aspect=1.6)\ngrid.map(plt.hist, 'Age', alpha=.5, bins=20)\ngrid.add_legend()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "a4f166f9-f5f9-1819-66c3-d89dd5b0d8ff",
"_uuid": "a6428ba0a9bde25b46fdb9561ce9f6e8d5b94d03"
},
"cell_type": "markdown",
"source": "Let us start by preparing an empty array to contain guessed Age
values based on Pclass x Gender combinations."
},
{
"metadata": {
"_cell_guid": "9299523c-dcf1-fb00-e52f-e2fb860a3920",
"_uuid": "0412b62ce33f99f60a44be333b9646001269d3e9",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "guess_ages = np.zeros((2,3))\nguess_ages",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "ec9fed37-16b1-5518-4fa8-0a7f579dbc82",
"_uuid": "9313a3afce1c491e576e24cb64ee8292a2969840"
},
"cell_type": "markdown",
"source": "Now we iterate over Sex (0 or 1) and Pclass (1, 2, 3) to calculate
guessed values of Age for the six combinations."
},
{
"metadata": {
"_cell_guid": "a4015dfa-a0ab-65bc-0cbe-efecf1eb2569",
"_uuid": "7e5edc6e5c3c11dd14716381595ea79f99e3dfe5",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n for i in range(0, 2):\n for j
in range(0, 3):\n guess_df = dataset[(dataset['Sex'] == i) & \\\n
(dataset['Pclass'] == j+1)]['Age'].dropna()\n\n # age_mean =
guess_df.mean()\n # age_std = guess_df.std()\n # age_guess =
rnd.uniform(age_mean - age_std, age_mean + age_std)\n\n age_guess =
guess_df.median()\n\n # Convert random age float to nearest .5 age\n
guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5\n \n for i in
range(0, 2):\n for j in range(0, 3):\n
dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass ==
j+1),\\\n 'Age'] = guess_ages[i,j]\n\n dataset['Age'] =
dataset['Age'].astype(int)\n\ntrain_df.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "dbe0a8bf-40bc-c581-e10e-76f07b3b71d4",
"_uuid": "7f79de34676821287f94f3c928d570db4bb7840b"
},
"cell_type": "markdown",
"source": "Let us create Age bands and determine correlations with Survived."
},
{
"metadata": {
"_cell_guid": "725d1c84-6323-9d70-5812-baf9994d3aa1",
"_uuid": "810b5e26cfcc3306a124d4378a1eb267d5c92f1b",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df['AgeBand'] = pd.cut(train_df['Age'],
5)\ntrain_df[['AgeBand', 'Survived']].groupby(['AgeBand'],
as_index=False).mean().sort_values(by='AgeBand', ascending=True)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "ba4be3a0-e524-9c57-fbec-c8ecc5cde5c6",
"_uuid": "a85b42753bb94a727cc7e3b2c1d299ab489cf529"
},
"cell_type": "markdown",
"source": "Let us replace Age with ordinals based on these bands."
},
{
"metadata": {
"_cell_guid": "797b986d-2c45-a9ee-e5b5-088de817c8b2",
"_uuid": "2027f0ac240545aa04b4480cf193328fcbe49861",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine: \n dataset.loc[ dataset['Age'] <=
16, 'Age'] = 0\n dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32),
'Age'] = 1\n dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age']
= 2\n dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3\n
dataset.loc[ dataset['Age'] > 64, 'Age']\ntrain_df.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "004568b6-dd9a-ff89-43d5-13d4e9370b1d",
"_uuid": "44f351c9bb4362ce4b6b6474df62cbaf9b4f1ce6"
},
"cell_type": "markdown",
"source": "We can not remove the AgeBand feature."
},
{
"metadata": {
"_cell_guid": "875e55d4-51b0-5061-b72c-8a23946133a3",
"_uuid": "47bfc5ef6d6f78efc4ceba7bf4e16a683750dc04",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df = train_df.drop(['AgeBand'], axis=1)\ncombine =
[train_df, test_df]\ntrain_df.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "1c237b76-d7ac-098f-0156-480a838a64a9",
"_uuid": "507ca831212543d0bfabada3835c572c2b0afc98"
},
"cell_type": "markdown",
"source": "### Create new feature combining existing features\n\nWe can
create a new feature for FamilySize which combines Parch and SibSp. This will
enable us to drop Parch and SibSp from our datasets."
},
{
"metadata": {
"_cell_guid": "7e6c04ed-cfaa-3139-4378-574fd095d6ba",
"_uuid": "8a0d0ec4d296f0208fe6b6881f06ced1123af23e",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n dataset['FamilySize'] =
dataset['SibSp'] + dataset['Parch'] + 1\n\ntrain_df[['FamilySize',
'Survived']].groupby(['FamilySize'],
as_index=False).mean().sort_values(by='Survived', ascending=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "842188e6-acf8-2476-ccec-9e3451e4fa86",
"_uuid": "79fb6a8c851aa144a396d14bdc6706b357edf315"
},
"cell_type": "markdown",
"source": "We can create another feature called IsAlone."
},
{
"metadata": {
"_cell_guid": "5c778c69-a9ae-1b6b-44fe-a0898d07be7a",
"_uuid": "25709ff4e4c33dace874da89c2748f57708fa9ee",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n dataset['IsAlone'] = 0\n
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1\n\ntrain_df[['IsAlone',
'Survived']].groupby(['IsAlone'], as_index=False).mean()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "e6b87c09-e7b2-f098-5b04-4360080d26bc",
"_uuid": "38749430ed6b2d259573c2d8ea4a9d07ad8d9988"
},
"cell_type": "markdown",
"source": "Let us drop Parch, SibSp, and FamilySize features in favor of
IsAlone."
},
{
"metadata": {
"_cell_guid": "74ee56a6-7357-f3bc-b605-6c41f8aa6566",
"_uuid": "aba50739c0d372eb549e7b77fa6c4ab6929c9dff",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'],
axis=1)\ntest_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)\ncombine
= [train_df, test_df]\n\ntrain_df.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "f890b730-b1fe-919e-fb07-352fbd7edd44",
"_uuid": "96a3d080f99f4fca949af55b766eb18479169f35"
},
"cell_type": "markdown",
"source": "We can also create an artificial feature combining Pclass and
Age."
},
{
"metadata": {
"_cell_guid": "305402aa-1ea1-c245-c367-056eef8fe453",
"_uuid": "bf268ced2e898d687df451ebbb5c862a070678af",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n dataset['Age*Class'] = dataset.Age *
dataset.Pclass\n\ntrain_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "13292c1b-020d-d9aa-525c-941331bb996a",
"_uuid": "53a666dd0bb35c97cb39c6065627c4868c4380d1"
},
"cell_type": "markdown",
"source": "### Completing a categorical feature\n\nEmbarked feature takes S,
Q, C values based on port of embarkation. Our training dataset has two missing
values. We simply fill these with the most common occurance."
},
{
"metadata": {
"_cell_guid": "bf351113-9b7f-ef56-7211-e8dd00665b18",
"_uuid": "ada906f3b978f55507e7c1f0ce9d3a307458c029",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "freq_port = train_df.Embarked.dropna().mode()[0]\nfreq_port",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "51c21fcc-f066-cd80-18c8-3d140be6cbae",
"_uuid": "2cbe79649308397b221274a7e61a9781a6fddc5a",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n dataset['Embarked'] =
dataset['Embarked'].fillna(freq_port)\n \ntrain_df[['Embarked',
'Survived']].groupby(['Embarked'],
as_index=False).mean().sort_values(by='Survived', ascending=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "f6acf7b2-0db3-e583-de50-7e14b495de34",
"_uuid": "2c19342ae8773ee6dd7ba6074e6cbae8585787ce"
},
"cell_type": "markdown",
"source": "### Converting categorical feature to numeric\n\nWe can now
convert the EmbarkedFill feature by creating a new numeric Port feature."
},
{
"metadata": {
"_cell_guid": "89a91d76-2cc0-9bbb-c5c5-3c9ecae33c66",
"_uuid": "3ae7d6a990a7ff6bd5ba10516cc5cfa003dac765",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n dataset['Embarked'] =
dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q':
2} ).astype(int)\n\ntrain_df.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "e3dfc817-e1c1-a274-a111-62c1c814cecf",
"_uuid": "1b0bde70dc34775222b07462af99c722c6de072e"
},
"cell_type": "markdown",
"source": "### Quick completing and converting a numeric feature\n\nWe can
now complete the Fare feature for single missing value in test dataset using mode
to get the value that occurs most frequently for this feature. We do this in a
single line of code.\n\nNote that we are not creating an intermediate new feature
or doing any further analysis for correlation to guess missing feature as we are
replacing only a single value. The completion goal achieves desired requirement for
model algorithm to operate on non-null values.\n\nWe may also want round off the
fare to two decimals as it represents currency."
},
{
"metadata": {
"_cell_guid": "3600cb86-cf5f-d87b-1b33-638dc8db1564",
"_uuid": "c05e78d9a7e8a3e69d3b5d7db3704c496d56b1f4",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "test_df['Fare'].fillna(test_df['Fare'].dropna().median(),
inplace=True)\ntest_df.head()",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "4b816bc7-d1fb-c02b-ed1d-ee34b819497d",
"_uuid": "a23693030a76da1cdd2c1d1aff9ebbca87dee892"
},
"cell_type": "markdown",
"source": "We can not create FareBand."
},
{
"metadata": {
"_cell_guid": "0e9018b1-ced5-9999-8ce1-258a0952cbf2",
"_uuid": "615c08220cc042175722591b18d2be919736d8a1",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "train_df['FareBand'] = pd.qcut(train_df['Fare'],
4)\ntrain_df[['FareBand', 'Survived']].groupby(['FareBand'],
as_index=False).mean().sort_values(by='FareBand', ascending=True)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "d65901a5-3684-6869-e904-5f1a7cce8a6d",
"_uuid": "604bf142d13e115a8bfdb1f6e65c845ababc3384"
},
"cell_type": "markdown",
"source": "Convert the Fare feature to ordinal values based on the FareBand."
},
{
"metadata": {
"_cell_guid": "385f217a-4e00-76dc-1570-1de4eec0c29c",
"_uuid": "63ec2e871863be592ef31be967f66f8216c8d57b",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "for dataset in combine:\n dataset.loc[ dataset['Fare'] <= 7.91,
'Fare'] = 0\n dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <=
14.454), 'Fare'] = 1\n dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare']
<= 31), 'Fare'] = 2\n dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3\n
dataset['Fare'] = dataset['Fare'].astype(int)\n\ntrain_df =
train_df.drop(['FareBand'], axis=1)\ncombine = [train_df, test_df]\n
\ntrain_df.head(10)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "27272bb9-3c64-4f9a-4a3b-54f02e1c8289",
"_uuid": "9eddd2d528bc3fedd3aee842af3cbd9938667d23"
},
"cell_type": "markdown",
"source": "And the test dataset."
},
{
"metadata": {
"_cell_guid": "d2334d33-4fe5-964d-beac-6aa620066e15",
"_uuid": "cfeb079ff50a93e2c75700343d820fef49b7e2f7",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "test_df.head(10)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "69783c08-c8cc-a6ca-2a9a-5e75581c6d31",
"_uuid": "1c62ae893afd7cc94053f9ce15ac96800c45de63"
},
"cell_type": "markdown",
"source": "## Model, predict and solve\n\nNow we are ready to train a model
and predict the required solution. There are 60+ predictive modelling algorithms to
choose from. We must understand the type of problem and solution requirement to
narrow down to a select few models which we can evaluate. Our problem is a
classification and regression problem. We want to identify relationship between
output (Survived or not) with other variables or features (Gender, Age, Port...).
We are also perfoming a category of machine learning which is called supervised
learning as we are training our model with a given dataset. With these two criteria
- Supervised Learning plus Classification and Regression, we can narrow down our
choice of models to a few. These include:\n\n- Logistic Regression\n- KNN or k-
Nearest Neighbors\n- Support Vector Machines\n- Naive Bayes classifier\n- Decision
Tree\n- Random Forrest\n- Perceptron\n- Artificial neural network\n- RVM or
Relevance Vector Machine"
},
{
"metadata": {
"_cell_guid": "0acf54f9-6cf5-24b5-72d9-29b30052823a",
"_uuid": "daa105c75da2d53e4ad16bdd90ae38bc798f3840",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "X_train = train_df.drop(\"Survived\", axis=1)\nY_train =
train_df[\"Survived\"]\nX_test = test_df.drop(\"PassengerId\",
axis=1).copy()\nX_train.shape, Y_train.shape, X_test.shape",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "579bc004-926a-bcfe-e9bb-c8df83356876",
"_uuid": "66a62b835b87e1a06d88b9fadcb558fcce005e5e"
},
"cell_type": "markdown",
"source": "Logistic Regression is a useful model to run early in the
workflow. Logistic regression measures the relationship between the categorical
dependent variable (feature) and one or more independent variables (features) by
estimating probabilities using a logistic function, which is the cumulative
logistic distribution. Reference [Wikipedia]
(https://en.wikipedia.org/wiki/Logistic_regression).\n\nNote the confidence score
generated by the model based on our training dataset."
},
{
"metadata": {
"_cell_guid": "0edd9322-db0b-9c37-172d-a3a4f8dec229",
"_uuid": "871acd6dcc4e785a6267ca49f53c7843600e2ecc",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# Logistic Regression\n\nlogreg =
LogisticRegression()\nlogreg.fit(X_train, Y_train)\nY_pred =
logreg.predict(X_test)\nacc_log = round(logreg.score(X_train, Y_train) * 100,
2)\nacc_log",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "3af439ae-1f04-9236-cdc2-ec8170a0d4ee",
"_uuid": "7c672d5d1500a0766dd3cbd6f5cd0887f2f55c8f"
},
"cell_type": "markdown",
"source": "We can use Logistic Regression to validate our assumptions and
decisions for feature creating and completing goals. This can be done by
calculating the coefficient of the features in the decision function.\n\nPositive
coefficients increase the log-odds of the response (and thus increase the
probability), and negative coefficients decrease the log-odds of the response (and
thus decrease the probability).\n\n- Sex is highest positivie coefficient, implying
as the Sex value increases (male: 0 to female: 1), the probability of Survived=1
increases the most.\n- Inversely as Pclass increases, probability of Survived=1
decreases the most.\n- This way Age*Class is a good artificial feature to model as
it has second highest negative correlation with Survived.\n- So is Title as second
highest positive correlation."
},
{
"metadata": {
"_cell_guid": "e545d5aa-4767-7a41-5799-a4c5e529ce72",
"_uuid": "f0b099a20b5a5744673e4d03a0f146a3af5645a9",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "coeff_df =
pd.DataFrame(train_df.columns.delete(0))\ncoeff_df.columns =
['Feature']\ncoeff_df[\"Correlation\"] =
pd.Series(logreg.coef_[0])\n\ncoeff_df.sort_values(by='Correlation',
ascending=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "ac041064-1693-8584-156b-66674117e4d0",
"_uuid": "7ff209ca8d68d1d15ca6adf8e110f494a7ead42d"
},
"cell_type": "markdown",
"source": "Next we model using Support Vector Machines which are supervised
learning models with associated learning algorithms that analyze data used for
classification and regression analysis. Given a set of training samples, each
marked as belonging to one or the other of **two categories**, an SVM training
algorithm builds a model that assigns new test samples to one category or the
other, making it a non-probabilistic binary linear classifier. Reference
[Wikipedia](https://en.wikipedia.org/wiki/Support_vector_machine).\n\nNote that the
model generates a confidence score which is higher than Logistics Regression
model."
},
{
"metadata": {
"_cell_guid": "7a63bf04-a410-9c81-5310-bdef7963298f",
"_uuid": "4149f77b216ba82de159c93065bbea1929fcb238",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# Support Vector Machines\n\nsvc = SVC()\nsvc.fit(X_train,
Y_train)\nY_pred = svc.predict(X_test)\nacc_svc = round(svc.score(X_train, Y_train)
* 100, 2)\nacc_svc",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "172a6286-d495-5ac4-1a9c-5b77b74ca6d2",
"_uuid": "2065f403f470647dd62dca7634ea4fde33dd6350"
},
"cell_type": "markdown",
"source": "In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN
for short) is a non-parametric method used for classification and regression. A
sample is classified by a majority vote of its neighbors, with the sample being
assigned to the class most common among its k nearest neighbors (k is a positive
integer, typically small). If k = 1, then the object is simply assigned to the
class of that single nearest neighbor. Reference [Wikipedia]
(https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).\n\nKNN confidence
score is better than Logistics Regression but worse than SVM."
},
{
"metadata": {
"_cell_guid": "ca14ae53-f05e-eb73-201c-064d7c3ed610",
"_uuid": "15dcc1e24c14c67591ee4319dc0cd5b1b8ec1e08",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "knn = KNeighborsClassifier(n_neighbors = 3)\nknn.fit(X_train,
Y_train)\nY_pred = knn.predict(X_test)\nacc_knn = round(knn.score(X_train, Y_train)
* 100, 2)\nacc_knn",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "810f723d-2313-8dfd-e3e2-26673b9caa90",
"_uuid": "2193ef4816826d4ed27d444db617926bac15f623"
},
"cell_type": "markdown",
"source": "In machine learning, naive Bayes classifiers are a family of
simple probabilistic classifiers based on applying Bayes' theorem with strong
(naive) independence assumptions between the features. Naive Bayes classifiers are
highly scalable, requiring a number of parameters linear in the number of variables
(features) in a learning problem. Reference [Wikipedia]
(https://en.wikipedia.org/wiki/Naive_Bayes_classifier).\n\nThe model generated
confidence score is the lowest among the models evaluated so far."
},
{
"metadata": {
"_cell_guid": "50378071-7043-ed8d-a782-70c947520dae",
"_uuid": "67860d99c1993970db7116b1a5c0bfc1b5459a0e",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# Gaussian Naive Bayes\n\ngaussian =
GaussianNB()\ngaussian.fit(X_train, Y_train)\nY_pred =
gaussian.predict(X_test)\nacc_gaussian = round(gaussian.score(X_train, Y_train) *
100, 2)\nacc_gaussian",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "1e286e19-b714-385a-fcfa-8cf5ec19956a",
"_uuid": "024bedaed5297fce5f89063d1af6d0b487ada628"
},
"cell_type": "markdown",
"source": "The perceptron is an algorithm for supervised learning of binary
classifiers (functions that can decide whether an input, represented by a vector of
numbers, belongs to some specific class or not). It is a type of linear classifier,
i.e. a classification algorithm that makes its predictions based on a linear
predictor function combining a set of weights with the feature vector. The
algorithm allows for online learning, in that it processes elements in the training
set one at a time. Reference [Wikipedia]
(https://en.wikipedia.org/wiki/Perceptron)."
},
{
"metadata": {
"_cell_guid": "ccc22a86-b7cb-c2dd-74bd-53b218d6ed0d",
"_uuid": "ae591b6596828905554eeefed6d066b00e6d25eb",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# Perceptron\n\nperceptron = Perceptron()\nperceptron.fit(X_train,
Y_train)\nY_pred = perceptron.predict(X_test)\nacc_perceptron =
round(perceptron.score(X_train, Y_train) * 100, 2)\nacc_perceptron",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "a4d56857-9432-55bb-14c0-52ebeb64d198",
"_uuid": "0851fd8bf9675dfc19dad5c49970bc39c6d8cb09",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# Linear SVC\n\nlinear_svc = LinearSVC()\nlinear_svc.fit(X_train,
Y_train)\nY_pred = linear_svc.predict(X_test)\nacc_linear_svc =
round(linear_svc.score(X_train, Y_train) * 100, 2)\nacc_linear_svc",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "dc98ed72-3aeb-861f-804d-b6e3d178bf4b",
"_uuid": "072363264b4026bcb63b957706864e8d30c898c9",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# Stochastic Gradient Descent\n\nsgd =
SGDClassifier()\nsgd.fit(X_train, Y_train)\nY_pred = sgd.predict(X_test)\nacc_sgd =
round(sgd.score(X_train, Y_train) * 100, 2)\nacc_sgd",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "bae7f8d7-9da0-f4fd-bdb1-d97e719a18d7",
"_uuid": "2fd4de9e48e31b4372e5a64b6794ab1ab3cbd8b1"
},
"cell_type": "markdown",
"source": "This model uses a decision tree as a predictive model which maps
features (tree branches) to conclusions about the target value (tree leaves). Tree
models where the target variable can take a finite set of values are called
classification trees; in these tree structures, leaves represent class labels and
branches represent conjunctions of features that lead to those class labels.
Decision trees where the target variable can take continuous values (typically real
numbers) are called regression trees. Reference [Wikipedia]
(https://en.wikipedia.org/wiki/Decision_tree_learning).\n\nThe model confidence
score is the highest among models evaluated so far."
},
{
"metadata": {
"_cell_guid": "dd85f2b7-ace2-0306-b4ec-79c68cd3fea0",
"_uuid": "21c14121407cd4d05d250e921499065ba1321feb",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# Decision Tree\n\ndecision_tree =
DecisionTreeClassifier()\ndecision_tree.fit(X_train, Y_train)\nY_pred =
decision_tree.predict(X_test)\nacc_decision_tree =
round(decision_tree.score(X_train, Y_train) * 100, 2)\nacc_decision_tree",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "85693668-0cd5-4319-7768-eddb62d2b7d0",
"_uuid": "595608f8b099219b666c4e86611cb931abb715b4"
},
"cell_type": "markdown",
"source": "The next model Random Forests is one of the most popular. Random
forests or random decision forests are an ensemble learning method for
classification, regression and other tasks, that operate by constructing a
multitude of decision trees (n_estimators=100) at training time and outputting the
class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Reference [Wikipedia]
(https://en.wikipedia.org/wiki/Random_forest).\n\nThe model confidence score is the
highest among models evaluated so far. We decide to use this model's output
(Y_pred) for creating our competition submission of results."
},
{
"metadata": {
"_cell_guid": "f0694a8e-b618-8ed9-6f0d-8c6fba2c4567",
"_uuid": "6bf2cd7e2e304013bd82e41b528e9f26304eabd5",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "# Random Forest\n\nrandom_forest =
RandomForestClassifier(n_estimators=100)\nrandom_forest.fit(X_train,
Y_train)\nY_pred = random_forest.predict(X_test)\nrandom_forest.score(X_train,
Y_train)\nacc_random_forest = round(random_forest.score(X_train, Y_train) * 100,
2)\nacc_random_forest",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "f6c9eef8-83dd-581c-2d8e-ce932fe3a44d",
"_uuid": "e93b2e9f80cae0efbf48f2986504f7ed3e66d6b3"
},
"cell_type": "markdown",
"source": "### Model evaluation\n\nWe can now rank our evaluation of all the
models to choose the best one for our problem. While both Decision Tree and Random
Forest score the same, we choose to use Random Forest as they correct for decision
trees' habit of overfitting to their training set. "
},
{
"metadata": {
"_cell_guid": "1f3cebe0-31af-70b2-1ce4-0fd406bcdfc6",
"_uuid": "90f3e2d38ceceaf7248d2343eb3aae78ac419ecb",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "models = pd.DataFrame({\n 'Model': ['Support Vector Machines',
'KNN', 'Logistic Regression', \n 'Random Forest', 'Naive Bayes',
'Perceptron', \n 'Stochastic Gradient Decent', 'Linear SVC', \n
'Decision Tree'],\n 'Score': [acc_svc, acc_knn, acc_log, \n
acc_random_forest, acc_gaussian, acc_perceptron, \n acc_sgd,
acc_linear_svc, acc_decision_tree]})\nmodels.sort_values(by='Score',
ascending=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "28854d36-051f-3ef0-5535-fa5ba6a9bef7",
"_uuid": "c3631aa3e9c18954eab66d2bbc925c0e925115d1",
"trusted": false,
"collapsed": true
},
"cell_type": "code",
"source": "submission = pd.DataFrame({\n \"PassengerId\":
test_df[\"PassengerId\"],\n \"Survived\": Y_pred\n })\n#
submission.to_csv('../output/submission.csv', index=False)",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"_cell_guid": "fcfc8d9f-e955-cf70-5843-1fb764c54699",
"_uuid": "36e88712d81936b4a64d75f107553bf9636442a9"
},
"cell_type": "markdown",
"source": "Our submission to the competition site Kaggle results in scoring
3,883 of 6,082 competition entries. This result is indicative while the competition
is running. This result only accounts for part of the submission dataset. Not bad
for our first attempt. Any suggestions to improve our score are most welcome."
},
{
"metadata": {
"_cell_guid": "aeec9210-f9d8-cd7c-c4cf-a87376d5f693",
"_uuid": "696a3d5594217780f36030480a0360648c369ffd"
},
"cell_type": "markdown",
"source": "## References\n\nThis notebook has been created based on great
work done solving the Titanic competition and other sources.\n\n- [A journey
through Titanic](https://www.kaggle.com/omarelgabry/titanic/a-journey-through-
titanic)\n- [Getting Started with Pandas: Kaggle's Titanic Competition]
(https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests)\n-
[Titanic Best Working Classifier]
(https://www.kaggle.com/sinakhorami/titanic/titanic-best-working-classifier)"
}
],
"metadata": {
"_change_revision": 0,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"version": "3.6.0",
"nbconvert_exporter": "python",
"codemirror_mode": {
"version": 3,
"name": "ipython"
},
"file_extension": ".py",
"mimetype": "text/x-python",
"pygments_lexer": "ipython3",
"name": "python"
},
"_is_fork": false
},
"nbformat": 4,
"nbformat_minor": 1
}

You might also like