Professional Documents
Culture Documents
Attribute Information
id: id
study: 1 for all patients
sex: 1=male
age: in years
obstruct: obstruction of colon by tumour
perfor: perforation of colon
adhere: adherence to nearby organs
nodes: number of lymph nodes with detectable cancer
time: days until event or censoring
status: censoring status
differ: differentiation of tumour (1=well, 2=moderate, 3=poor)
extent: Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous
structures)
surg: time from surgery to registration (0=short, 1=long)
node4: more than 4 positive lymph nodes
etype: event type: 1=recurrence,2=death
Loading Libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
# importing ploting libraries
import matplotlib.pyplot as plt
#importing seaborn for statistical plots
import seaborn as sns
#Let us break the X and y dataframes into training set and test set. For this we will use
#Sklearn package's data splitting function which is based on random function
from sklearn.model_selection import train_test_split
import numpy as np
# calculate accuracy measures and confusion matrix
from sklearn import metrics
Question 1: Import the Dataset
df=pd.read_csv('colon.csv').drop('Unnamed: 0',axis=1)
df.shape
(1858, 15)
Question 3: How many Missing Values are there? Drop all missing values.
df.isnull().sum()
df=df.dropna()
df.describe().T
2 888
1 888
Name: etype, dtype: int64
Question 6: With reference to the previous question, plot the levels of the dependent variable in
a plot of your choice.
sns.countplot(df.etype)
plt.grid()
plt.show()
Question 7: Drop the dependent variable from the Data Set and store it seperately. Then split
your data into train and test data sets. The test data size should be 30% of the total data. Use
random_state=7.
X = df.drop('etype',axis=1)
Y = df.etype
test_size = 0.30
seed = 7
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_stat
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
Question 10: Generate the Confusion Matrix and Classification Report. What are your
observations and recommendations?
print('Confusion Matrix','\n',metrics.confusion_matrix(y_test, y_predict),'\n')
print('Classification Report','\n',metrics.classification_report(y_test, y_predict))
Confusion Matrix
[[143 115]
[115 160]]
Classification Report
precision recall f1-score support