Professional Documents
Culture Documents
Data Description:
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of
Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
By Using this data set analyze the Survival chances of Breast cancer patients who had undergone surgery
based on their Patient's Age, Patient's year of operation, Number of positive axillary nodes detected, Survival
status.
DESCRIPTION OF FEATURES
In [5]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
(306, 4)
Out[5]: 1 225
2 81
Name: Surv_status, dtype: int64
In [6]: print(haberman_data.columns)
It represent whether patient survive more than 5 years or less after undergone through surgery.Here if
patients survived 5 years or more is represented as 1 and patients who survived less than 5 years is
represented as 2.
localhost:8888/nbconvert/html/haberman.ipynb?download=false 1/18
12/11/2019 haberman
Lymph Node(axil_node):
Lymph nodes are small, bean-shaped organs that act as filters along the lymph fluid channels. As lymph fluid
leaves the breast and eventually goes back into the bloodstream, the lymph nodes try to catch and trap
cancer cells before they reach other parts of the body. Having cancer cells in the lymph nodes under your arm
suggests an increased risk of the cancer spreading.In our data it is axillary nodes detected(0–52)
Age:
Univariate Analysis:
Analyze the individual features how each feature effecting individually on surviaval chances.
localhost:8888/nbconvert/html/haberman.ipynb?download=false 2/18
12/11/2019 haberman
plt.figure(1)
plt.title("greaterthan_orequal_5years Survivals")
plt.plot(greaterthan_orequal_5years["Age"], np.zeros_like(greaterthan_orequ
al_5years['Age']), 'o',label='Age')
plt.plot(greaterthan_orequal_5years["OP_Year"], np.zeros_like(greaterthan_o
requal_5years['OP_Year']), 'o', label = 'OP_Year' )
plt.plot(greaterthan_orequal_5years["axil_nodes"], np.zeros_like(greatertha
n_orequal_5years['axil_nodes']),'o',label = 'axil_nodes')
plt.grid()
plt.xlabel("greaterthan_orequal_5years Survival")
plt.legend()
plt.show()
plt.figure(2)
plt.title("lessthan_5years Survivals")
plt.plot(lessthan_5years["Age"], np.zeros_like(lessthan_5years['Age']), 'o'
, label='Age')
plt.plot(lessthan_5years["OP_Year"], np.zeros_like(lessthan_5years['OP_Yea
r']), 'o', label = 'OP_Year')
plt.plot(lessthan_5years["axil_nodes"], np.zeros_like(lessthan_5years['axil
_nodes']), 'o', label = 'axil_nodes')
plt.grid()
plt.xlabel("lessthan_5years Survival")
plt.legend()
plt.show()
localhost:8888/nbconvert/html/haberman.ipynb?download=false 3/18
12/11/2019 haberman
localhost:8888/nbconvert/html/haberman.ipynb?download=false 4/18
12/11/2019 haberman
localhost:8888/nbconvert/html/haberman.ipynb?download=false 5/18
12/11/2019 haberman
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\USER\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Us
erWarning: The 'normed' kwarg is deprecated, and has been replaced by the
'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
localhost:8888/nbconvert/html/haberman.ipynb?download=false 6/18
12/11/2019 haberman
Observations:
localhost:8888/nbconvert/html/haberman.ipynb?download=false 7/18
12/11/2019 haberman
1. the lesser Age people have more surviaval chances than high age (the people whos age is between
30-35 they have maximum chance to survive )
2. based on axil nodes also can declare who have less axilur nodes, having more chances to survive.
(aproximately 0-3 axil nodes patients have more chances to survive.
3. more than that can't identify the chances using histogram images because all the PDf's and
histograms of Survival and Non-Survival data was closly overlaped.
localhost:8888/nbconvert/html/haberman.ipynb?download=false 8/18
12/11/2019 haberman
pdf = counts/(sum(counts))
print("Pdf of {0} vs <5years: \n {1}" .format(column,pdf), "\n")
print("bin edges of {0} at <5years:\n {1}" .format(column, bin_edges),
"\n")
cdf =np.cumsum(pdf)
plt.plot(bin_edges[1:], pdf, label = 'Pdf of lessthan_5years' )
plt.plot(bin_edges[1:], cdf, label = 'Pdf of lessthan_5years' )
plt.title("Cdf with pdf based on {}".format(column))
plt.xlabel(column)
plt.legend(loc = 'center left', bbox_to_anchor = (1, 0.5))
plt.show()
print("\n\n ***********************************************************
************************** \n\n")
localhost:8888/nbconvert/html/haberman.ipynb?download=false 9/18
12/11/2019 haberman
**************************************************************************
***********
localhost:8888/nbconvert/html/haberman.ipynb?download=false 10/18
12/11/2019 haberman
**************************************************************************
***********
**************************************************************************
***********
localhost:8888/nbconvert/html/haberman.ipynb?download=false 11/18
12/11/2019 haberman
Observation:
1. The Age 50 below patients are having 42% of survival chances.(lesser age can increase the survival
chance). Age above 50 are having 58% of patients died. (means higher age patients are having less
survival chance)
2. Operation year is not effecting anymore on Survival chances.In this operational year vs Surviavl
status graph, PDF and CDF of both Survival and Non Survival data are closly overlaped.
3. The Axil nodes less than or equal to 10 are having 90% are having survival chance. above 20 axil
nodes having 88% of patients are died.
Mean:
52.01777777777778
62.86222222222222
2.7911111111111113
Std_deviation:
10.98765547510051
3.2157452144021956
5.857258449412131
localhost:8888/nbconvert/html/haberman.ipynb?download=false 12/18
12/11/2019 haberman
In [57]: print("Median")
for column in haberman_data.columns[0:3]:
print(np.median(greaterthan_orequal_5years[column]))
print("\nQuantiles:")
for column in haberman_data.columns[0:3]:
print(np.percentile(greaterthan_orequal_5years[column],np.arange(0, 100
, 25)))
print("\n90th Percentiles:")
for column in haberman_data.columns[0:3]:
print(np.percentile(greaterthan_orequal_5years[column],90))
Median
52.0
63.0
0.0
Quantiles:
[30. 43. 52. 60.]
[58. 60. 63. 66.]
[0. 0. 0. 3.]
90th Percentiles:
67.0
67.0
8.0
localhost:8888/nbconvert/html/haberman.ipynb?download=false 13/18
12/11/2019 haberman
localhost:8888/nbconvert/html/haberman.ipynb?download=false 14/18
12/11/2019 haberman
Observation:
For Age:
Survived Non-Survived 0th percentile = 30 0th = 34 25th = 43 25th = 46 50th = 52 50th = 53 75th = 60 75th = 61
100th = 78 100th = 83
Survived Non-Survived 0th percentile = 58 0th = 58 25th = 60 25th = 59 50th = 63 50th = 63 75th = 66 75th = 65
100th = 69 100th = 69
Survived Non-Survived 0th percentile = 0 0th = 0 25th = 0 25th = 1 50th = 0 50th = 3 75th = 3 75th = 11 100th = 8
100th = 24
Violin plots
localhost:8888/nbconvert/html/haberman.ipynb?download=false 15/18
12/11/2019 haberman
Observations:
localhost:8888/nbconvert/html/haberman.ipynb?download=false 16/18
12/11/2019 haberman
1. Age from 40-60 high density of survivals from non survival density it is high at 45-60.
2. operation year from 1958 to 1967 the survivals density high. and from the year 1957 to 1960 and
1962 to 1967 density of Non-survivals is high.
3. Axil nodes range from 0 to 4 having high density of survivals. and 1 to 9 having high density of Non-
survivals.
Pair-plot
In [47]: plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman_data, hue="Surv_status",vars = ['Age' , 'OP_Year' ,'a
xil_nodes' ], size=3);
plt.show()
Observation:
localhost:8888/nbconvert/html/haberman.ipynb?download=false 17/18
12/11/2019 haberman
From the above pair plots , survival and non survival are closly overlaped between the feature. its look like
data is not linearly separable.
localhost:8888/nbconvert/html/haberman.ipynb?download=false 18/18