You are on page 1of 3

1.

Dataset
Title:
Haberman's Survival Data Set

Information:
The dataset contains cases from a study that was conducted between 1958 and 1970 at the
University of Chicago's Billings Hospital on the survival of patients who had undergone surgery
for breast cancer. Class1 = the patient survived 5 years or longer, where class2 = the patient died
within 5 year

Attribute Information:
1. Age of patient at time of operation (numerical)

2. Patient's year of operation (year - 1900, numerical)

3. Number of positive axillary nodes detected (numerical)

4. Survival status (class attribute)

-- 1 = the patient survived 5 years or longer

-- 2 = the patient died within 5 year

Dataset link:
https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival

PDF Analytics:

According to the previous histogram, people with age between (40-60) years old are
most likely to die. People with age less than 40 years old are more likely to survive
(because this area on the plot is totally blue and not overlapped with the other orange
area).

From the above PDFs(Univariate analysis) both Age and Operation_Year are not good features for
useful insights as the distibution is more similar for both people who survived and also dead.

From the year distribution, we can observe that people who didnt survive suddenly rise and fall in
between 1958 and 1960. More number of people are not survived in year of operation of 1965

Using PDFs(Uni-variate Analysis)-


a. both Age and Operation_Year are not good features for useful insights as the distibution is
more similar for both people who survived and also dead.
b. positive_lymph_nodes is the only feature that is useful to know about the survival status of
patients as there is difference between the distributions for both classes(labels). From that
distibution we can infer that most survival patients have fallen in to zero positive_lymph_nodes.

c. More number of people are not survived in year of operation of 1965.

Box plot

From box plots and violin plots, we can say that more no of patients who are dead have age
between 46-62,year between 59-65 and the patients who survived have age between 42-60, year
between 60-66.

Pair Plot (dimension reduction):

SVM for non-linearly separable dataset:


Classifying a non-linearly separable dataset using a SVM – a linear classifier:
As mentioned above SVM is a linear classifier which learns an (n – 1)-dimensional
classifier for classification of data into two classes. However, it can be used for
classifying a non-linear dataset. This can be done by projecting the dataset into a higher
dimension in which it is linearly separable!
Data Transformation (standardization):

The preprocessing module further provides a utility class StandardScaler that


implements the Transformer API to compute the mean and standard deviation on a
training set so as to be able to later reapply the same transformation on the testing set.

Resources

https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/

https://github.com/bethusaisampath/Haberman-Cancer-Survival-
Dataset/blob/master/Haberman.ipynb

https://www.youtube.com/watch?v=U4vHP7KXt2Y&list=PLs7xKYqehofX6EIYD7WG6Lq
1ZMM2cs6Es

https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

https://github.com/mGalarnyk/Python_Tutorials/blob/master/Sklearn/PCA/PCA_Data_Vi
sualization_Iris_Dataset_Blog.ipynb

You might also like