You are on page 1of 24

Problem Statement

Five million Americans are currently living with heart diseases, and the numbers are expected to rise. It is very
important to understand the factors which causes Heart-attacks so that certain precaution can be taken by
individuals. In-order to understand the reasons of the Heart-attack, a data was collected from various hospitals
across US which is given in US_Heart_Patients.csv. In the data set there are Heart-Att indicates whether the
person suffered from Heart attack or not.

Perform EDA on the data and build a model which will predict whether the person will suffer from Heart-
attack or not.

Importing required packages and dataset


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
from sklearn.ensemble import RandomForestClassifier
In [3]:
chd_df = pd.read_csv("US_Heart_Patients.csv")

Checking the data


In [4]:
chd_df.head(10)
Out[4]:
B He
tot Sys Dia
Ge a edu curren cigsP P prevale preva dia B hear glu art
chol toli stol
nd g cati tSmok erDa M ntStro lentH bet M tRat cos -
ester c ic
er e on er y ed ke yp es I e e At
ol BP BP
s t

3 26
Ma 0. 195. 106 70. 77.
0 9. 4.0 0.0 0.0 0.0 0.0 0.0 .9 80.0 0
le 0 0 .0 0 0
0 7

Fe 4 28
0. 250. 121 81. 76.
1 ma 6. 2.0 0.0 0.0 0.0 0.0 0.0 .7 95.0 0
0 0 .0 0 0
le 0 3

4 25
Ma 0. 245. 127 80. 70.
2 8. 1.0 1.0 20.0 0.0 0.0 0.0 .3 75.0 0
le 0 0 .5 0 0
0 4
B He
tot Sys Dia
Ge a edu curren cigsP P prevale preva dia B hear glu art
chol toli stol
nd g cati tSmok erDa M ntStro lentH bet M tRat cos -
ester c ic
er e on er y ed ke yp es I e e At
ol BP BP
s t

Fe 6 28
0. 225. 150 95. 10
3 ma 1. 3.0 1.0 30.0 0.0 1.0 0.0 .5 65.0 1
0 0 .0 0 3.0
le 0 8

Fe 4 23
0. 285. 130 84. 85.
4 ma 6. 3.0 1.0 23.0 0.0 0.0 0.0 .1 85.0 0
0 0 .0 0 0
le 0 0

Fe 4 30
0. 228. 180 110 99.
5 ma 3. 2.0 0.0 0.0 0.0 1.0 0.0 .3 77.0 0
0 0 .0 .0 0
le 0 0

Fe 6 33
0. 205. 138 71. 85.
6 ma 3. 1.0 0.0 0.0 0.0 0.0 0.0 .1 60.0 1
0 0 .0 0 0
le 0 1

Fe 4 21
0. 313. 100 71. 78.
7 ma 5. 2.0 1.0 20.0 0.0 0.0 0.0 .6 79.0 0
0 0 .0 0 0
le 0 8

5 26
Ma 0. 260. 141 89. 79.
8 2. 1.0 0.0 0.0 0.0 1.0 0.0 .3 76.0 0
le 0 0 .5 0 0
0 6

4 23
Ma 0. 225. 162 107 88.
9 3. 1.0 1.0 30.0 0.0 1.0 0.0 .6 93.0 0
le 0 0 .0 .0 0
0 1

In [5]:
chd_df.shape
Out[5]:
(4240, 16)
In [6]:
chd_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 4232 non-null object
1 age 4238 non-null float64
2 education 4130 non-null float64
3 currentSmoker 4237 non-null float64
4 cigsPerDay 4209 non-null float64
5 BP Meds 4180 non-null float64
6 prevalentStroke 4231 non-null float64
7 prevalentHyp 4238 non-null float64
8 diabetes 4238 non-null float64
9 tot cholesterol 4180 non-null float64
10 Systolic BP 4236 non-null float64
11 Diastolic BP 4235 non-null float64
12 BMI 4216 non-null float64
13 heartRate 4236 non-null float64
14 glucose 3849 non-null float64
15 Heart-Att 4240 non-null int64
dtypes: float64(14), int64(1), object(1)
memory usage: 530.1+ KB
Gender column is of type object i.e. strings. This need to be converted to ordinal type for building RF model

EDA
Summary of the data

In [7]:
chd_df.describe(include="all")
Out[7]:
G curr prev prev tot Sys Dia
edu cigs BP dia hea glu Hea
en entS alent alen cho toli stol BM
age cati Per Me bet rtR cos rt-
de mok Stro tHy lest c ic I
on Day ds es ate e Att
r er ke p erol BP BP

423 413 420 418 423 423 418 423 423 421 423 384 424
co 4237 4231.
42 8.00 0.0 9.0 0.0 8.00 8.0 0.0 6.0 5.0 6.0 6.0 9.0 0.0
u .000 0000
32 000 000 000 000 000 000 000 000 000 000 000 000 000
nt 000 00
0 00 00 00 0 00 00 00 00 00 00 00 00

u
ni
Na Na Na Na Na Na Na Na Na Na Na Na
q 2 NaN NaN NaN
N N N N N N N N N N N N
u
e

Fe
to m Na Na Na Na Na Na Na Na Na Na Na Na
NaN NaN NaN
p al N N N N N N N N N N N N
e

fr
24 Na Na Na Na Na Na Na Na Na Na Na Na
e NaN NaN NaN
14 N N N N N N N N N N N N
q
G curr prev prev tot Sys Dia
edu cigs BP dia hea glu Hea
en entS alent alen cho toli stol BM
age cati Per Me bet rtR cos rt-
de mok Stro tHy lest c ic I
on Day ds es ate e Att
r er ke p erol BP BP

236 132
m N 49.5 1.9 9.0 0.0 0.31 0.0 82. 25. 75. 81. 0.1
0.49 0.005 .67 .36
ea a 792 799 019 296 052 257 901 798 867 951 518
4218 909 727 237
n N 83 03 01 65 4 20 889 916 800 936 87
3 0

N 8.57 1.0 11. 0.1 0.46 0.1 44. 22. 11. 4.0 11. 23. 0.3
st 0.50 0.076
a 287 199 920 696 276 583 616 039 914 752 999 958 589
d 0026 650
N 5 43 742 82 3 16 098 244 467 56 488 428 53

107
N 32.0 1.0 0.0 0.0 0.00 0.0 83. 48. 15. 44. 40. 0.0
m 0.00 0.000 .00
a 000 000 000 000 000 000 500 000 540 000 000 000
in 0000 000 000
N 00 00 00 00 0 00 000 000 000 000 000 00
0

206 117
2 N 42.0 1.0 0.0 0.0 0.00 0.0 75. 23. 68. 71. 0.0
0.00 0.000 .00 .00
5 a 000 000 000 000 000 000 000 070 000 000 000
0000 000 000 000
% N 00 00 00 00 0 00 000 000 000 000 00
0 0

234 128
5 N 49.0 2.0 0.0 0.0 0.00 0.0 82. 25. 75. 78. 0.0
0.00 0.000 .00 .00
0 a 000 000 000 000 000 000 000 395 000 000 000
0000 000 000 000
% N 00 00 00 00 0 00 000 000 000 000 00
0 0

263 144
7 N 56.0 3.0 20. 0.0 1.00 0.0 90. 28. 83. 87. 0.0
1.00 0.000 .00 .00
5 a 000 000 000 000 000 000 000 040 000 000 000
0000 000 000 000
% N 00 00 000 00 0 00 000 000 000 000 00
0 0

696 295 142 143 394


m N 70.0 4.0 70. 1.0 1.00 1.0 56. 1.0
1.00 1.000 .00 .00 .50 .00 .00
a a 000 000 000 000 000 000 800 000
0000 000 000 000 000 000 000
x N 00 00 000 00 0 00 000 00
0 0 0 0 0

Frequency of the levels in the Gender column

In [8]:
# Get the count of Male and Female
chd_df.Gender.value_counts()
Out[8]:
Female 2414
Male 1818
Name: Gender, dtype: int64
Proportion of observations in each of the target classes

In [9]:
chd_df['Heart-Att'].value_counts()
Out[9]:
0 3596
1 644
Name: Heart-Att, dtype: int64
In [10]:
# Get the proportion in each class
chd_df['Heart-Att'].value_counts(normalize=True)
Out[10]:
0 0.848113
1 0.151887
Name: Heart-Att, dtype: float64

Check for duplicate data

In [11]:
# Are there any duplicates ?
dups = chd_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
chd_df[dups]
Number of duplicate rows = 0
Out[11]:
B He
tot Sys Dia
Ge a edu curren cigsP P prevale preval dia B hear glu art
chol toli stol
nd g cati tSmok erDa M ntStro entHy bet M tRat cos -
ester c ic
er e on er y ed ke p es I e e At
ol BP BP
s t

There are no duplicates in the data

Check for missing value in any column

In [12]:
# Are there any missing values ?
chd_df.isnull().sum()
Out[12]:
Gender 8
age 2
education 110
currentSmoker 3
cigsPerDay 31
BP Meds 60
prevalentStroke 9
prevalentHyp 2
diabetes 2
tot cholesterol 60
Systolic BP 4
Diastolic BP 5
BMI 24
heartRate 4
glucose 391
Heart-Att 0
dtype: int64
Missing values exists for all the independent varibles with some variables having very high missing values
In [13]:
chd_df.dropna().shape
Out[13]:
(3616, 16)
If we were to drop all the observations having missing values, then number of observations reduces from 4240
to 3616 rows for the same 16 columns.

As we see in the below boxplot continuous variables have outliers present.

So, we will impute the continuous variables with median value, and drop the missing values corresponding to
categorical variables.
In [14]:
chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP', 'Diastolic
BP','BMI','heartRate','glucose']].boxplot()
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x2cabfa5b788>

In [15]:
for column in chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP',
'Diastolic BP','BMI','heartRate','glucose']]:
median = chd_df[column].median()
chd_df[column] = chd_df[column].fillna(median)
In [16]:
chd_df.isnull().sum()
Out[16]:
Gender 8
age 0
education 110
currentSmoker 3
cigsPerDay 0
BP Meds 60
prevalentStroke 9
prevalentHyp 2
diabetes 2
tot cholesterol 0
Systolic BP 0
Diastolic BP 0
BMI 0
heartRate 0
glucose 0
Heart-Att 0
dtype: int64
Only categorical variables have misisng values now
In [17]:
chd_df.dropna(inplace=True)
chd_df.shape
Out[17]:
(4055, 16)
After removing these missing values, the number of observations is now 4055

Checking for Outliers

In [18]:
# construct box plot for continuous variables
plt.figure(figsize=(15,15))
chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP', 'Diastolic
BP','BMI','heartRate','glucose']].boxplot(vert=0)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x2cac1bf2808>
Outliers exists for most of the continuous variables, and also has many outliers

Treating the Outliers

In [19]:
# Complete the function to calculate lower_range and upper_range
def treat_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
In [20]:
for feature in chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP',
'Diastolic BP','BMI','heartRate','glucose']]:
lr,ur=treat_outlier(chd_df[feature])
chd_df[feature]=np.where(chd_df[feature]>ur,ur,chd_df[feature])
chd_df[feature]=np.where(chd_df[feature]<lr,lr,chd_df[feature])
In [21]:
plt.figure(figsize=(15,15))
chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP', 'Diastolic
BP','BMI','heartRate','glucose']].boxplot()
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x2cac21a9c08>

There are no outliers after treating them

Checking pairwise distribution of the continuous variables

In [22]:
sns.pairplot(chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP',
'Diastolic BP','BMI','heartRate','glucose']])
Out[22]:
<seaborn.axisgrid.PairGrid at 0x2cac1dc1b48>

Checking for Correlations

In [23]:
# construct heatmap with only continuous variables
plt.figure(figsize=(10,8))
sns.set(font_scale=1.2)
sns.heatmap(chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP',
'Diastolic BP','BMI','heartRate','glucose']].corr(), annot=True,vmin=-
0.5,vmax=1)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x2cac52bae08>

In [24]:
chd_df.head(10)
Out[24]:
B He
tot Sys Dia
Ge a edu curren cigsP P prevale preva dia B hear glu art
chol toli stol
nd g cati tSmok erDa M ntStro lentH bet M tRat cos -
ester c ic
er e on er y ed ke yp es I e e At
ol BP BP
s t

3 26
Ma 0. 195. 106 70. 77.
0 9. 4.0 0.0 0.0 0.0 0.0 0.0 .9 80.0 0
le 0 0 .0 0 0
0 7
B He
tot Sys Dia
Ge a edu curren cigsP P prevale preva dia B hear glu art
chol toli stol
nd g cati tSmok erDa M ntStro lentH bet M tRat cos -
ester c ic
er e on er y ed ke yp es I e e At
ol BP BP
s t

Fe 4 28
0. 250. 121 81. 76.
1 ma 6. 2.0 0.0 0.0 0.0 0.0 0.0 .7 95.0 0
0 0 .0 0 0
le 0 3

4 25
Ma 0. 245. 127 80. 70.
2 8. 1.0 1.0 20.0 0.0 0.0 0.0 .3 75.0 0
le 0 0 .5 0 0
0 4

Fe 6 28
0. 225. 150 95. 10
3 ma 1. 3.0 1.0 30.0 0.0 1.0 0.0 .5 65.0 1
0 0 .0 0 3.0
le 0 8

Fe 4 23
0. 285. 130 84. 85.
4 ma 6. 3.0 1.0 23.0 0.0 0.0 0.0 .1 85.0 0
0 0 .0 0 0
le 0 0

Fe 4 30
0. 228. 180 110 99.
5 ma 3. 2.0 0.0 0.0 0.0 1.0 0.0 .3 77.0 0
0 0 .0 .0 0
le 0 0

Fe 6 33
0. 205. 138 71. 85.
6 ma 3. 1.0 0.0 0.0 0.0 0.0 0.0 .1 60.0 1
0 0 .0 0 0
le 0 1

Fe 4 21
0. 313. 100 71. 78.
7 ma 5. 2.0 1.0 20.0 0.0 0.0 0.0 .6 79.0 0
0 0 .0 0 0
le 0 8

5 26
Ma 0. 260. 141 89. 79.
8 2. 1.0 0.0 0.0 0.0 1.0 0.0 .3 76.0 0
le 0 0 .5 0 0
0 6

4 23
Ma 0. 225. 162 107 88.
9 3. 1.0 1.0 30.0 0.0 1.0 0.0 .6 93.0 0
le 0 0 .0 .0 0
0 1

Decision tree in Python can take only numerical / categorical colums. It cannot take string / obeject types. The
following code loops through each column and checks if the column type is object then converts those columns
into categorical with each distinct value becoming a category or code.
In [25]:
for feature in chd_df.columns:
if chd_df[feature].dtype == 'object':
chd_df[feature] = pd.Categorical(chd_df[feature]).codes
In [26]:
chd_df.head(10)
Out[26]:
B He
tot Sys Dia
Ge a edu curren cigsP P prevale preva dia B hear glu art
chol toli stol
nd g cati tSmok erDa M ntStro lentH bet M tRat cos -
ester c ic
er e on er y ed ke yp es I e e At
ol BP BP
s t

3 26
0. 195. 106 70. 77.
0 1 9. 4.0 0.0 0.0 0.0 0.0 0.0 .9 80.0 0
0 0 .0 0 0
0 7

4 28
0. 250. 121 81. 76.
1 0 6. 2.0 0.0 0.0 0.0 0.0 0.0 .7 95.0 0
0 0 .0 0 0
0 3

4 25
0. 245. 127 80. 70.
2 1 8. 1.0 1.0 20.0 0.0 0.0 0.0 .3 75.0 0
0 0 .5 0 0
0 4

6 28
0. 225. 150 95. 10
3 0 1. 3.0 1.0 30.0 0.0 1.0 0.0 .5 65.0 1
0 0 .0 0 3.0
0 8

4 23
0. 285. 130 84. 85.
4 0 6. 3.0 1.0 23.0 0.0 0.0 0.0 .1 85.0 0
0 0 .0 0 0
0 0

4 30
0. 228. 180 110 99.
5 0 3. 2.0 0.0 0.0 0.0 1.0 0.0 .3 77.0 0
0 0 .0 .0 0
0 0

6 33
0. 205. 138 71. 85.
6 0 3. 1.0 0.0 0.0 0.0 0.0 0.0 .1 60.0 1
0 0 .0 0 0
0 1

4 21
0. 313. 100 71. 78.
7 0 5. 2.0 1.0 20.0 0.0 0.0 0.0 .6 79.0 0
0 0 .0 0 0
0 8

5 26
0. 260. 141 89. 79.
8 1 2. 1.0 0.0 0.0 0.0 1.0 0.0 .3 76.0 0
0 0 .5 0 0
0 6
B He
tot Sys Dia
Ge a edu curren cigsP P prevale preva dia B hear glu art
chol toli stol
nd g cati tSmok erDa M ntStro lentH bet M tRat cos -
ester c ic
er e on er y ed ke yp es I e e At
ol BP BP
s t

4 23
0. 225. 162 107 88.
9 1 3. 1.0 1.0 30.0 0.0 1.0 0.0 .6 93.0 0
0 0 .0 .0 0
0 1

Female converted to 0, and Male to 1


In [27]:
chd_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4055 entries, 0 to 4239
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 4055 non-null int8
1 age 4055 non-null float64
2 education 4055 non-null float64
3 currentSmoker 4055 non-null float64
4 cigsPerDay 4055 non-null float64
5 BP Meds 4055 non-null float64
6 prevalentStroke 4055 non-null float64
7 prevalentHyp 4055 non-null float64
8 diabetes 4055 non-null float64
9 tot cholesterol 4055 non-null float64
10 Systolic BP 4055 non-null float64
11 Diastolic BP 4055 non-null float64
12 BMI 4055 non-null float64
13 heartRate 4055 non-null float64
14 glucose 4055 non-null float64
15 Heart-Att 4055 non-null int64
dtypes: float64(14), int64(1), int8(1)
memory usage: 670.8 KB

Capture the target column into separate vectors for training set and test set

In [28]:
X = chd_df.drop(["Heart-Att"], axis=1)

y = chd_df.pop("Heart-Att")

Splitting data into training and test set

In [29]:
from sklearn.model_selection import train_test_split

X_train, X_test, train_labels, test_labels = train_test_split(X, y,


test_size=.30, random_state=1)
In [30]:
train_labels.value_counts()
Out[30]:
0 2424
1 414
Name: Heart-Att, dtype: int64
In [31]:
train_labels.value_counts(normalize=True)
Out[31]:
0 0.854123
1 0.145877
Name: Heart-Att, dtype: float64
In [32]:
test_labels.value_counts()
Out[32]:
0 1022
1 195
Name: Heart-Att, dtype: int64
In [33]:
test_labels.value_counts(normalize=True)
Out[33]:
0 0.83977
1 0.16023
Name: Heart-Att, dtype: float64
Observations are almost equally distributed between the train and test sets w.r.t target classes

Ensemble RandomForest Classifier


Building the Random Forest model

Importance of Random State

The important thing is that everytime you use any natural number, you will always get the same output
the first time you make the model which is similar to random state while train test split
In [34]:
# To understand the differences of different random states affecting Out-of-
Bag score
random_state=[0,23,42]
for i in random_state:
rf=RandomForestClassifier(random_state=i,oob_score=True)
rf.fit(X_train,train_labels)
print(rf.oob_score_)
0.8513037350246653
0.8527131782945736
0.8548273431994362
In [35]:
#Build a RandomForestCassifier wit n_estimators 100, max_features 6, andfit
it on the training data
rfcl = RandomForestClassifier(n_estimators =
100,max_features=6,random_state=0)
rfcl = rfcl.fit(X_train, train_labels)
In [36]:
rfcl
Out[36]:
RandomForestClassifier(max_features=6, random_state=0)

Predicting Train and Test data with the RF Model

In [37]:
ytrain_predict = rfcl.predict(X_train)
ytest_predict = rfcl.predict(X_test)
Train Accuracy
In [38]:
rfcl.score(X_train,train_labels)
Out[38]:
1.0

Evaluating model performance with confusion matrix

In [39]:
from sklearn.metrics import confusion_matrix,classification_report

Evaluating model performance on the training data

In [40]:
# Get the confusion matrix on the train data
confusion_matrix(train_labels,ytrain_predict)
sns.heatmap(confusion_matrix(train_labels,ytrain_predict),annot=True,
fmt='d',cbar=False, cmap='rainbow')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Confusion Matrix')
plt.show()
In [41]:
print(classification_report(train_labels,ytrain_predict))
precision recall f1-score support

0 1.00 1.00 1.00 2424


1 1.00 1.00 1.00 414

accuracy 1.00 2838


macro avg 1.00 1.00 1.00 2838
weighted avg 1.00 1.00 1.00 2838

In [42]:
print('Accuracy', ((2424+414)/(2424+414)))
print('Sensitivity',((414/414))) #TP/Actual yes
print('Specificity',(2424/2424)) #TN/Actual no
print('Precision',(414/414)) #TP/Predicted yes
Accuracy 1.0
Sensitivity 1.0
Specificity 1.0
Precision 1.0
In [43]:
from sklearn.metrics import roc_curve,roc_auc_score
rf_fpr, rf_tpr,_=roc_curve(train_labels,rfcl.predict_proba(X_train)[:,1])
plt.figure(figsize=(12,7))
plt.plot(rf_fpr,rf_tpr, marker='x', label='Random Forest')
plt.plot(np.arange(0,1.1,0.1),np.arange(0,1.1,0.1))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('Area under Curve is',
roc_auc_score(train_labels,rfcl.predict_proba(X_train)[:,1]))

Area under Curve is 1.0

Evaluating model performance on the test data

In [44]:
confusion_matrix(test_labels,ytest_predict)
sns.heatmap(confusion_matrix(test_labels,ytest_predict),annot=True,
fmt='d',cbar=False, cmap='rainbow')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Confusion Matrix')
plt.show()
In [45]:
print(classification_report(test_labels,ytest_predict))
precision recall f1-score support

0 0.85 0.99 0.91 1022


1 0.61 0.06 0.10 195

accuracy 0.84 1217


macro avg 0.73 0.52 0.51 1217
weighted avg 0.81 0.84 0.78 1217

In [46]:
print('Accuracy', ((1019+13)/(1019+3+182+13)))
print('Sensitivity',((13/(13+182))))
print('Specificity',(1019/(1019+3)))
print('Precision',(13/(13+3)))
Accuracy 0.847986852917009
Sensitivity 0.06666666666666667
Specificity 0.99706457925636
Precision 0.8125
Test Accuracy
In [47]:
rfcl.score(X_test,test_labels)
Out[47]:
0.8430566967953985
In [48]:
#from sklearn.metrics import roc_curve,roc_auc_score
rf_fpr, rf_tpr,_=roc_curve(test_labels,rfcl.predict_proba(X_test)[:,1])
plt.figure(figsize=(12,7))
plt.plot(rf_fpr,rf_tpr, marker='x', label='Random Forest')
plt.plot(np.arange(0,1.1,0.1),np.arange(0,1.1,0.1))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('Area under Curve is',
roc_auc_score(test_labels,rfcl.predict_proba(X_test)[:,1]))

Area under Curve is 0.6934969140448592

Result:
Area under the curve on the training data is 100%, which indicates very high performance that all classes have
been correctly classified. Whereas on the test data model performance is average with AUC 69%, which is
very less compare to the performance of the training data.

Since we are building a model to predict if a person will have a heart disease or not, for practical purposes, we
will be more interested in correctly classifying 1 (having heart disease) than 0(not having heart disease).

If a person not having a heart disease, is incorrectly predicted to have a heart disease, in this situation, the cost
and other impact to life is less severe, than when we incorrectly predict a person, who actually have a heart
disease, as not having a heart disease.

From the Random Forest model,looking at the Accuracy,Sensitivity,Speficity,Recall and AUC, we have 100%
results on the training data, whereas on the Test data, performance is lesser,especially in predicting Class 1.
This is because overfitting has happened on the training data, and therefore the model is weak in generalizing
and predicting any new data.

In this model, we have hard-coded the hyper parameter values. We can optimize/fine-tune the random forest
model, by trying different values for the hyper parameters to see if the model performance is improving.

Grid Search for finding out the optimal values for the hyper parameters
Note: runs for longer time on more data with more params
In [49]:
from sklearn.model_selection import GridSearchCV

param_grid = {
'max_depth': [5,7,10],
'max_features': [4,6],
'min_samples_leaf': [5,10],
'min_samples_split': [50,100],
'n_estimators': [100,200,300]
}

rfcl = RandomForestClassifier(random_state=0)

grid_search = GridSearchCV(estimator = rfcl, param_grid = param_grid, cv = 10)


In [50]:
grid_search.fit(X_train, train_labels)
Out[50]:
GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=0),
param_grid={'max_depth': [5, 7, 10], 'max_features': [4, 6],
'min_samples_leaf': [5, 10],
'min_samples_split': [50, 100],
'n_estimators': [100, 200, 300]})
In [51]:
grid_search.best_params_
Out[51]:
{'max_depth': 10,
'max_features': 6,
'min_samples_leaf': 5,
'min_samples_split': 50,
'n_estimators': 200}
In [52]:
best_grid = grid_search.best_estimator_
In [53]:
best_grid
Out[53]:
RandomForestClassifier(max_depth=10, max_features=6, min_samples_leaf=5,
min_samples_split=50, n_estimators=200, random_state=0)
In [54]:
ytrain_predict = best_grid.predict(X_train)
ytest_predict = best_grid.predict(X_test)
In [55]:
confusion_matrix(train_labels,ytrain_predict)
Out[55]:
array([[2423, 1],
[ 397, 17]], dtype=int64)
In [56]:
print(classification_report(train_labels,ytrain_predict))
precision recall f1-score support

0 0.86 1.00 0.92 2424


1 0.94 0.04 0.08 414

accuracy 0.86 2838


macro avg 0.90 0.52 0.50 2838
weighted avg 0.87 0.86 0.80 2838

In [57]:
#from sklearn.metrics import roc_curve,roc_auc_score
rf_fpr, rf_tpr,_=roc_curve(train_labels,best_grid.predict_proba(X_train)[:,1])
plt.figure(figsize=(12,7))
plt.plot(rf_fpr,rf_tpr, marker='x', label='Random Forest')
plt.plot(np.arange(0,1.1,0.1),np.arange(0,1.1,0.1))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('Area under Curve is',
roc_auc_score(train_labels,best_grid.predict_proba(X_train)[:,1]))
Area under Curve is 0.8770846287527304
In [58]:
confusion_matrix(test_labels,ytest_predict)
Out[58]:
array([[1022, 0],
[ 193, 2]], dtype=int64)
In [59]:
print(classification_report(test_labels,ytest_predict))
precision recall f1-score support

0 0.84 1.00 0.91 1022


1 1.00 0.01 0.02 195

accuracy 0.84 1217


macro avg 0.92 0.51 0.47 1217
weighted avg 0.87 0.84 0.77 1217

In [60]:
#from sklearn.metrics import roc_curve,roc_auc_score
rf_fpr, rf_tpr,_=roc_curve(test_labels,best_grid.predict_proba(X_test)[:,1])
plt.figure(figsize=(12,7))
plt.plot(rf_fpr,rf_tpr, marker='x', label='Random Forest')
plt.plot(np.arange(0,1.1,0.1),np.arange(0,1.1,0.1))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('Area under Curve is',
roc_auc_score(test_labels,best_grid.predict_proba(X_test)[:,1]))
Area under Curve is 0.7107029956345023

Final Conclusion:
Here, we can see that even when we try other values for the hyper parameters, the model performance is not
improving much.

AUC is for train is 87 % and for test it is 70%. The problem of over fitting still exist, as there is a 17% gap
between the train and test set. But still, the model is useful only in predicting class 0, and not class 1.

This is because, the dataset i`ms unbalanced, and so we have a class imbalance problem.

In the real world datasets, most of the time you will come across unbalanced datasets. To build a more robust
classification model, this class imbalance needs to be addressed before building the model. This will be
applicable to any kind of classification model. Once this issue is addressed and the model is built, further
model tuning/optimization using grid search will result in improved performance.

You will learn to deal with such imbalance in data using different performance improvement methods in the
Machine Learning Course.

You might also like