You are on page 1of 18

12/11/2019 haberman

Haberman Dataset : Exploratory Data Analysis

Data Description:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of
Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

OBJECTIVE OF EDA ANALYSIS:

By Using this data set analyze the Survival chances of Breast cancer patients who had undergone surgery
based on their Patient's Age, Patient's year of operation, Number of positive axillary nodes detected, Survival
status.

DESCRIPTION OF FEATURES
In [5]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

haberman_data = pd.read_csv("E:\AIML notes\haberman.csv")


print(haberman_data.shape)
haberman_data["Surv_status"].value_counts()

(306, 4)
Out[5]: 1 225
2 81
Name: Surv_status, dtype: int64

In [6]: print(haberman_data.columns)

Index(['Age', 'OP_Year', 'axil_nodes', 'Surv_status'], dtype='object')

Survival Status (Surv_status):

It represent whether patient survive more than 5 years or less after undergone through surgery.Here if
patients survived 5 years or more is represented as 1 and patients who survived less than 5 years is
represented as 2.

localhost:8888/nbconvert/html/haberman.ipynb?download=false 1/18
12/11/2019 haberman

Lymph Node(axil_node):

Lymph nodes are small, bean-shaped organs that act as filters along the lymph fluid channels. As lymph fluid
leaves the breast and eventually goes back into the bloodstream, the lymph nodes try to catch and trap
cancer cells before they reach other parts of the body. Having cancer cells in the lymph nodes under your arm
suggests an increased risk of the cancer spreading.In our data it is axillary nodes detected(0–52)

Operation year (OP_Year):

Year in which patient was undergone surgery

Age:

It represent the age of patient at which they undergone surgery.

Univariate Analysis:
Analyze the individual features how each feature effecting individually on surviaval chances.

localhost:8888/nbconvert/html/haberman.ipynb?download=false 2/18
12/11/2019 haberman

In [7]: #1-D scatter plot for


greaterthan_orequal_5years = haberman_data.loc[haberman_data["Surv_status"]
== 1];
lessthan_5years = haberman_data.loc[haberman_data["Surv_status"] == 2];

plt.figure(1)
plt.title("greaterthan_orequal_5years Survivals")
plt.plot(greaterthan_orequal_5years["Age"], np.zeros_like(greaterthan_orequ
al_5years['Age']), 'o',label='Age')
plt.plot(greaterthan_orequal_5years["OP_Year"], np.zeros_like(greaterthan_o
requal_5years['OP_Year']), 'o', label = 'OP_Year' )
plt.plot(greaterthan_orequal_5years["axil_nodes"], np.zeros_like(greatertha
n_orequal_5years['axil_nodes']),'o',label = 'axil_nodes')
plt.grid()
plt.xlabel("greaterthan_orequal_5years Survival")
plt.legend()
plt.show()

plt.figure(2)
plt.title("lessthan_5years Survivals")
plt.plot(lessthan_5years["Age"], np.zeros_like(lessthan_5years['Age']), 'o'
, label='Age')
plt.plot(lessthan_5years["OP_Year"], np.zeros_like(lessthan_5years['OP_Yea
r']), 'o', label = 'OP_Year')
plt.plot(lessthan_5years["axil_nodes"], np.zeros_like(lessthan_5years['axil
_nodes']), 'o', label = 'axil_nodes')
plt.grid()
plt.xlabel("lessthan_5years Survival")
plt.legend()
plt.show()

localhost:8888/nbconvert/html/haberman.ipynb?download=false 3/18
12/11/2019 haberman

localhost:8888/nbconvert/html/haberman.ipynb?download=false 4/18
12/11/2019 haberman

In [24]: ## HISTOGRAM DATA by Single variant


for column in haberman_data.columns[0:3]:
sns.FacetGrid(haberman_data, hue="Surv_status", size=5) \
.map(sns.distplot, column) \
.add_legend();
plt.title("histogram for {} VS Survival status".format(column))
plt.show();

localhost:8888/nbconvert/html/haberman.ipynb?download=false 5/18
12/11/2019 haberman

C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "

localhost:8888/nbconvert/html/haberman.ipynb?download=false 6/18
12/11/2019 haberman

Observations:

localhost:8888/nbconvert/html/haberman.ipynb?download=false 7/18
12/11/2019 haberman

from above histograms with Pdf data

1. the lesser Age people have more surviaval chances than high age (the people whos age is between
30-35 they have maximum chance to survive )
2. based on axil nodes also can declare who have less axilur nodes, having more chances to survive.
(aproximately 0-3 axil nodes patients have more chances to survive.
3. more than that can't identify the chances using histogram images because all the PDf's and
histograms of Survival and Non-Survival data was closly overlaped.

localhost:8888/nbconvert/html/haberman.ipynb?download=false 8/18
12/11/2019 haberman

In [58]: # CDF of OP_Year


for column in haberman_data.columns[0:3]:
counts, bin_edges = np.histogram(greaterthan_orequal_5years[column], bi
ns=10,
density = True)
sns.set_style("whitegrid")
pdf = counts/(sum(counts))
print("Pdf of {0} vs =5years: \n {1}".format(column, pdf),"\n")
print("bin edges of {0} at >=5years: \n {1}".format(column, bin_edges),
"\n")
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf, label = 'Pdf of greaterthan_orequal_5years'
)
plt.plot( bin_edges[1:],cdf, label = 'cdf of greaterthan_orequal_5year
s')
counts, bin_edges = np.histogram(lessthan_5years[column], bins=10, dens
ity = True)

pdf = counts/(sum(counts))
print("Pdf of {0} vs <5years: \n {1}" .format(column,pdf), "\n")
print("bin edges of {0} at <5years:\n {1}" .format(column, bin_edges),
"\n")

cdf =np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label = 'Pdf of lessthan_5years' )
plt.plot(bin_edges[1:], cdf, label = 'Pdf of lessthan_5years' )
plt.title("Cdf with pdf based on {}".format(column))
plt.xlabel(column)
plt.legend(loc = 'center left', bbox_to_anchor = (1, 0.5))
plt.show()
print("\n\n ***********************************************************
************************** \n\n")

localhost:8888/nbconvert/html/haberman.ipynb?download=false 9/18
12/11/2019 haberman

Pdf of Age vs =5years:


[0.05333333 0.10666667 0.12444444 0.09333333 0.16444444 0.16444444
0.09333333 0.11111111 0.06222222 0.02666667]

bin edges of Age at >=5years:


[30. 34.7 39.4 44.1 48.8 53.5 58.2 62.9 67.6 72.3 77. ]

Pdf of Age vs <5years:


[0.03703704 0.12345679 0.19753086 0.19753086 0.13580247 0.12345679
0.09876543 0.04938272 0.02469136 0.01234568]

bin edges of Age at <5years:


[34. 38.9 43.8 48.7 53.6 58.5 63.4 68.3 73.2 78.1 83. ]

**************************************************************************
***********

Pdf of OP_Year vs =5years:


[0.18666667 0.10666667 0.10222222 0.07111111 0.09777778 0.10222222
0.06666667 0.09777778 0.09333333 0.07555556]

bin edges of OP_Year at >=5years:


[58. 59.1 60.2 61.3 62.4 63.5 64.6 65.7 66.8 67.9 69. ]

Pdf of OP_Year vs <5years:


[0.25925926 0.04938272 0.03703704 0.08641975 0.09876543 0.09876543
0.16049383 0.07407407 0.04938272 0.08641975]

bin edges of OP_Year at <5years:


[58. 59.1 60.2 61.3 62.4 63.5 64.6 65.7 66.8 67.9 69. ]

localhost:8888/nbconvert/html/haberman.ipynb?download=false 10/18
12/11/2019 haberman

**************************************************************************
***********

Pdf of axil_nodes vs =5years:


[0.83555556 0.08 0.02222222 0.02666667 0.01777778 0.00444444
0.00888889 0. 0. 0.00444444]

bin edges of axil_nodes at >=5years:


[ 0. 4.6 9.2 13.8 18.4 23. 27.6 32.2 36.8 41.4 46. ]

Pdf of axil_nodes vs <5years:


[0.56790123 0.14814815 0.13580247 0.04938272 0.07407407 0.
0.01234568 0. 0. 0.01234568]

bin edges of axil_nodes at <5years:


[ 0. 5.2 10.4 15.6 20.8 26. 31.2 36.4 41.6 46.8 52. ]

**************************************************************************
***********

localhost:8888/nbconvert/html/haberman.ipynb?download=false 11/18
12/11/2019 haberman

Observation:

From the above CDF with PDF data

1. The Age 50 below patients are having 42% of survival chances.(lesser age can increase the survival
chance). Age above 50 are having 58% of patients died. (means higher age patients are having less
survival chance)
2. Operation year is not effecting anymore on Survival chances.In this operational year vs Surviavl
status graph, PDF and CDF of both Survival and Non Survival data are closly overlaped.
3. The Axil nodes less than or equal to 10 are having 90% are having survival chance. above 20 axil
nodes having 88% of patients are died.

Mean, Variance and Std-dev


In [51]: print("Mean:")
for column in haberman_data.columns[0:3]:
print(np.mean(greaterthan_orequal_5years[column]))
print("\nStd_deviation:")
for column in haberman_data.columns[0:3]:
print(np.std(greaterthan_orequal_5years[column]))

Mean:
52.01777777777778
62.86222222222222
2.7911111111111113

Std_deviation:
10.98765547510051
3.2157452144021956
5.857258449412131

Median, Percentile, Quantile, IQR, MAD

localhost:8888/nbconvert/html/haberman.ipynb?download=false 12/18
12/11/2019 haberman

In [57]: print("Median")
for column in haberman_data.columns[0:3]:
print(np.median(greaterthan_orequal_5years[column]))

print("\nQuantiles:")
for column in haberman_data.columns[0:3]:
print(np.percentile(greaterthan_orequal_5years[column],np.arange(0, 100
, 25)))

print("\n90th Percentiles:")
for column in haberman_data.columns[0:3]:
print(np.percentile(greaterthan_orequal_5years[column],90))

from statsmodels import robust


print ("\nMedian Absolute Deviation")
for column in haberman_data.columns[0:3]:
print(robust.mad(greaterthan_orequal_5years[column]))

Median
52.0
63.0
0.0

Quantiles:
[30. 43. 52. 60.]
[58. 60. 63. 66.]
[0. 0. 0. 3.]

90th Percentiles:
67.0
67.0
8.0

Median Absolute Deviation


13.343419966550417
4.447806655516806
0.0

Box plot and Whiskers

localhost:8888/nbconvert/html/haberman.ipynb?download=false 13/18
12/11/2019 haberman

In [42]: for column in haberman_data.columns[0:3]:


sns.set_style("whitegrid")
sns.boxplot(x='Surv_status',y= column, data=haberman_data)
plt.title(column)
plt.show()

localhost:8888/nbconvert/html/haberman.ipynb?download=false 14/18
12/11/2019 haberman

Observation:

For Age:

Survived Non-Survived 0th percentile = 30 0th = 34 25th = 43 25th = 46 50th = 52 50th = 53 75th = 60 75th = 61
100th = 78 100th = 83

For operation year:

Survived Non-Survived 0th percentile = 58 0th = 58 25th = 60 25th = 59 50th = 63 50th = 63 75th = 66 75th = 65
100th = 69 100th = 69

For Axil Nodes:

Survived Non-Survived 0th percentile = 0 0th = 0 25th = 0 25th = 1 50th = 0 50th = 3 75th = 3 75th = 11 100th = 8
100th = 24

Violin plots

localhost:8888/nbconvert/html/haberman.ipynb?download=false 15/18
12/11/2019 haberman

In [46]: for column in haberman_data.columns[0:3]:


sns.violinplot(x="Surv_status", y=column, data=haberman_data, size=8)
plt.title(column)
plt.show()

Observations:

localhost:8888/nbconvert/html/haberman.ipynb?download=false 16/18
12/11/2019 haberman

form the violin plots:

1. Age from 40-60 high density of survivals from non survival density it is high at 45-60.
2. operation year from 1958 to 1967 the survivals density high. and from the year 1957 to 1960 and
1962 to 1967 density of Non-survivals is high.
3. Axil nodes range from 0 to 4 having high density of survivals. and 1 to 9 having high density of Non-
survivals.

Pair-plot
In [47]: plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman_data, hue="Surv_status",vars = ['Age' , 'OP_Year' ,'a
xil_nodes' ], size=3);
plt.show()

Observation:

localhost:8888/nbconvert/html/haberman.ipynb?download=false 17/18
12/11/2019 haberman

From the above pair plots , survival and non survival are closly overlaped between the feature. its look like
data is not linearly separable.

localhost:8888/nbconvert/html/haberman.ipynb?download=false 18/18

You might also like