Random Forest - US - Heart - Patients - Class

Problem Statement
Five million Americans are currently living with heart diseases, and the numbers are expected to rise. It is very
important to understand the factors which causes Heart-attacks so that certain precaution can be taken by
individuals. In-order to understand the reasons of the Heart-attack, a data was collected from various hospitals
across US which is given in US_Heart_Patients.csv. In the data set there are Heart-Att indicates whether the
person suffered from Heart attack or not.
Perform EDA on the data and build a model which will predict whether the person will suffer from Heart-
attack or not.
Importing required packages and dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
from sklearn.ensemble import RandomForestClassifier
In [3]:
chd_df = pd.read_csv("US_Heart_Patients.csv")
Checking the data

In [4]:
chd_df.head(10)
Out[4]:
B He
tot Sys Dia
Ge a edu curren cigsP P prevale preva dia B hear glu art
chol toli stol
nd g cati tSmok erDa M ntStro lentH bet M tRat cos -
ester c ic
er e on er y ed ke yp es I e e At
ol BP BP
s t
3 26
Ma 0. 195. 106 70. 77.
0 9. 4.0 0.0 0.0 0.0 0.0 0.0 .9 80.0 0
le 0 0 .0 0 0
0 7
Fe 4 28
0. 250. 121 81. 76.
1 ma 6. 2.0 0.0 0.0 0.0 0.0 0.0 .7 95.0 0
0 0 .0 0 0
le 0 3
4 25
Ma 0. 245. 127 80. 70.
2 8. 1.0 1.0 20.0 0.0 0.0 0.0 .3 75.0 0
le 0 0 .5 0 0
0 4
B He
tot Sys Dia
chol toli stol
ester c ic
ol BP BP
s t
Fe 6 28
0. 225. 150 95. 10
3 ma 1. 3.0 1.0 30.0 0.0 1.0 0.0 .5 65.0 1
0 0 .0 0 3.0
le 0 8
Fe 4 23
0. 285. 130 84. 85.
4 ma 6. 3.0 1.0 23.0 0.0 0.0 0.0 .1 85.0 0
0 0 .0 0 0
le 0 0
Fe 4 30
0. 228. 180 110 99.
5 ma 3. 2.0 0.0 0.0 0.0 1.0 0.0 .3 77.0 0
0 0 .0 .0 0
le 0 0
Fe 6 33
0. 205. 138 71. 85.
6 ma 3. 1.0 0.0 0.0 0.0 0.0 0.0 .1 60.0 1
0 0 .0 0 0
le 0 1
Fe 4 21
0. 313. 100 71. 78.
7 ma 5. 2.0 1.0 20.0 0.0 0.0 0.0 .6 79.0 0
0 0 .0 0 0
le 0 8
5 26
Ma 0. 260. 141 89. 79.
8 2. 1.0 0.0 0.0 0.0 1.0 0.0 .3 76.0 0
le 0 0 .5 0 0
0 6
4 23
Ma 0. 225. 162 107 88.
9 3. 1.0 1.0 30.0 0.0 1.0 0.0 .6 93.0 0
le 0 0 .0 .0 0
0 1
In [5]:
chd_df.shape
Out[5]:
(4240, 16)
In [6]:
chd_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 4232 non-null object
1 age 4238 non-null float64
2 education 4130 non-null float64
3 currentSmoker 4237 non-null float64
4 cigsPerDay 4209 non-null float64
5 BP Meds 4180 non-null float64
6 prevalentStroke 4231 non-null float64
7 prevalentHyp 4238 non-null float64
8 diabetes 4238 non-null float64
9 tot cholesterol 4180 non-null float64
10 Systolic BP 4236 non-null float64
11 Diastolic BP 4235 non-null float64
12 BMI 4216 non-null float64
13 heartRate 4236 non-null float64
14 glucose 3849 non-null float64
15 Heart-Att 4240 non-null int64
dtypes: float64(14), int64(1), object(1)
memory usage: 530.1+ KB
Gender column is of type object i.e. strings. This need to be converted to ordinal type for building RF model
EDA
Summary of the data
In [7]:
chd_df.describe(include="all")
Out[7]:
G curr prev prev tot Sys Dia
edu cigs BP dia hea glu Hea
en entS alent alen cho toli stol BM
age cati Per Me bet rtR cos rt-
de mok Stro tHy lest c ic I
on Day ds es ate e Att
r er ke p erol BP BP
423 413 420 418 423 423 418 423 423 421 423 384 424
co 4237 4231.
42 8.00 0.0 9.0 0.0 8.00 8.0 0.0 6.0 5.0 6.0 6.0 9.0 0.0
u .000 0000
32 000 000 000 000 000 000 000 000 000 000 000 000 000
nt 000 00
0 00 00 00 0 00 00 00 00 00 00 00 00
u
ni
Na Na Na Na Na Na Na Na Na Na Na Na
q 2 NaN NaN NaN
N N N N N N N N N N N N
u
e
Fe
to m Na Na Na Na Na Na Na Na Na Na Na Na
NaN NaN NaN
p al N N N N N N N N N N N N
e
fr
24 Na Na Na Na Na Na Na Na Na Na Na Na
e NaN NaN NaN
14 N N N N N N N N N N N N
q
G curr prev prev tot Sys Dia
edu cigs BP dia hea glu Hea
en entS alent alen cho toli stol BM
age cati Per Me bet rtR cos rt-
de mok Stro tHy lest c ic I
on Day ds es ate e Att
r er ke p erol BP BP
236 132
m N 49.5 1.9 9.0 0.0 0.31 0.0 82. 25. 75. 81. 0.1
0.49 0.005 .67 .36
ea a 792 799 019 296 052 257 901 798 867 951 518
4218 909 727 237
n N 83 03 01 65 4 20 889 916 800 936 87
3 0
N 8.57 1.0 11. 0.1 0.46 0.1 44. 22. 11. 4.0 11. 23. 0.3
st 0.50 0.076
a 287 199 920 696 276 583 616 039 914 752 999 958 589
d 0026 650
N 5 43 742 82 3 16 098 244 467 56 488 428 53
107
N 32.0 1.0 0.0 0.0 0.00 0.0 83. 48. 15. 44. 40. 0.0
m 0.00 0.000 .00
a 000 000 000 000 000 000 500 000 540 000 000 000
in 0000 000 000
N 00 00 00 00 0 00 000 000 000 000 000 00
0
206 117
2 N 42.0 1.0 0.0 0.0 0.00 0.0 75. 23. 68. 71. 0.0
0.00 0.000 .00 .00
5 a 000 000 000 000 000 000 000 070 000 000 000
0000 000 000 000
% N 00 00 00 00 0 00 000 000 000 000 00
0 0
234 128
5 N 49.0 2.0 0.0 0.0 0.00 0.0 82. 25. 75. 78. 0.0
0.00 0.000 .00 .00
0 a 000 000 000 000 000 000 000 395 000 000 000
0000 000 000 000
% N 00 00 00 00 0 00 000 000 000 000 00
0 0
263 144
7 N 56.0 3.0 20. 0.0 1.00 0.0 90. 28. 83. 87. 0.0
1.00 0.000 .00 .00
5 a 000 000 000 000 000 000 000 040 000 000 000
0000 000 000 000
% N 00 00 000 00 0 00 000 000 000 000 00
0 0
696 295 142 143 394

m N 70.0 4.0 70. 1.0 1.00 1.0 56. 1.0
1.00 1.000 .00 .00 .50 .00 .00
a a 000 000 000 000 000 000 800 000
0000 000 000 000 000 000 000
x N 00 00 000 00 0 00 000 00
0 0 0 0 0
Frequency of the levels in the Gender column
In [8]:
# Get the count of Male and Female
chd_df.Gender.value_counts()
Out[8]:
Female 2414
Male 1818
Name: Gender, dtype: int64
Proportion of observations in each of the target classes
In [9]:
chd_df['Heart-Att'].value_counts()
Out[9]:
0 3596
1 644
Name: Heart-Att, dtype: int64
In [10]:
# Get the proportion in each class
chd_df['Heart-Att'].value_counts(normalize=True)
Out[10]:
0 0.848113
1 0.151887
Name: Heart-Att, dtype: float64
Check for duplicate data
In [11]:
# Are there any duplicates ?
dups = chd_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
chd_df[dups]
Number of duplicate rows = 0
Out[11]:
B He
tot Sys Dia
Ge a edu curren cigsP P prevale preval dia B hear glu art
chol toli stol
nd g cati tSmok erDa M ntStro entHy bet M tRat cos -
ester c ic
er e on er y ed ke p es I e e At
ol BP BP
s t
There are no duplicates in the data
Check for missing value in any column
In [12]:
# Are there any missing values ?
chd_df.isnull().sum()
Out[12]:
Gender 8
age 2
education 110
currentSmoker 3
cigsPerDay 31
BP Meds 60
prevalentStroke 9
prevalentHyp 2
diabetes 2
tot cholesterol 60
Systolic BP 4
Diastolic BP 5
BMI 24
heartRate 4
glucose 391
Heart-Att 0
dtype: int64
Missing values exists for all the independent varibles with some variables having very high missing values
In [13]:
chd_df.dropna().shape
Out[13]:
(3616, 16)
If we were to drop all the observations having missing values, then number of observations reduces from 4240
to 3616 rows for the same 16 columns.
As we see in the below boxplot continuous variables have outliers present.
So, we will impute the continuous variables with median value, and drop the missing values corresponding to
categorical variables.
In [14]:
chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP', 'Diastolic
BP','BMI','heartRate','glucose']].boxplot()
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x2cabfa5b788>
In [15]:
for column in chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP',
'Diastolic BP','BMI','heartRate','glucose']]:
median = chd_df[column].median()
chd_df[column] = chd_df[column].fillna(median)
In [16]:
chd_df.isnull().sum()
Out[16]:
Gender 8
age 0
education 110
currentSmoker 3
cigsPerDay 0
BP Meds 60
prevalentStroke 9
prevalentHyp 2
diabetes 2
tot cholesterol 0
Systolic BP 0
Diastolic BP 0
BMI 0
heartRate 0
glucose 0
Heart-Att 0
dtype: int64
Only categorical variables have misisng values now
In [17]:
chd_df.dropna(inplace=True)
chd_df.shape
Out[17]:
(4055, 16)
After removing these missing values, the number of observations is now 4055
Checking for Outliers
In [18]:
# construct box plot for continuous variables
plt.figure(figsize=(15,15))
BP','BMI','heartRate','glucose']].boxplot(vert=0)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x2cac1bf2808>
Outliers exists for most of the continuous variables, and also has many outliers
Treating the Outliers
In [19]:
# Complete the function to calculate lower_range and upper_range
def treat_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
In [20]:
for feature in chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP',
'Diastolic BP','BMI','heartRate','glucose']]:
lr,ur=treat_outlier(chd_df[feature])
chd_df[feature]=np.where(chd_df[feature]>ur,ur,chd_df[feature])
chd_df[feature]=np.where(chd_df[feature]<lr,lr,chd_df[feature])
In [21]:
BP','BMI','heartRate','glucose']].boxplot()
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x2cac21a9c08>
There are no outliers after treating them
Checking pairwise distribution of the continuous variables
In [22]:
sns.pairplot(chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP',
'Diastolic BP','BMI','heartRate','glucose']])
Out[22]:
<seaborn.axisgrid.PairGrid at 0x2cac1dc1b48>
Checking for Correlations
In [23]:
# construct heatmap with only continuous variables
sns.set(font_scale=1.2)
sns.heatmap(chd_df[['age', 'cigsPerDay', 'tot cholesterol', 'Systolic BP',
'Diastolic BP','BMI','heartRate','glucose']].corr(), annot=True,vmin=-
0.5,vmax=1)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x2cac52bae08>
In [24]:
chd_df.head(10)
Out[24]:
B He
tot Sys Dia
chol toli stol
ester c ic
ol BP BP
s t
3 26
Ma 0. 195. 106 70. 77.
0 9. 4.0 0.0 0.0 0.0 0.0 0.0 .9 80.0 0
le 0 0 .0 0 0
0 7
B He
tot Sys Dia
chol toli stol
ester c ic
ol BP BP
s t
Fe 4 28
0. 250. 121 81. 76.
1 ma 6. 2.0 0.0 0.0 0.0 0.0 0.0 .7 95.0 0
0 0 .0 0 0
le 0 3
4 25
Ma 0. 245. 127 80. 70.
2 8. 1.0 1.0 20.0 0.0 0.0 0.0 .3 75.0 0
le 0 0 .5 0 0
0 4
Fe 6 28
0. 225. 150 95. 10
3 ma 1. 3.0 1.0 30.0 0.0 1.0 0.0 .5 65.0 1
0 0 .0 0 3.0
le 0 8
Fe 4 23
0. 285. 130 84. 85.
4 ma 6. 3.0 1.0 23.0 0.0 0.0 0.0 .1 85.0 0
0 0 .0 0 0
le 0 0
Fe 4 30
0. 228. 180 110 99.
5 ma 3. 2.0 0.0 0.0 0.0 1.0 0.0 .3 77.0 0
0 0 .0 .0 0
le 0 0
Fe 6 33
0. 205. 138 71. 85.
6 ma 3. 1.0 0.0 0.0 0.0 0.0 0.0 .1 60.0 1
0 0 .0 0 0
le 0 1
Fe 4 21
0. 313. 100 71. 78.
7 ma 5. 2.0 1.0 20.0 0.0 0.0 0.0 .6 79.0 0
0 0 .0 0 0
le 0 8
5 26
Ma 0. 260. 141 89. 79.
8 2. 1.0 0.0 0.0 0.0 1.0 0.0 .3 76.0 0
le 0 0 .5 0 0
0 6
4 23
Ma 0. 225. 162 107 88.
9 3. 1.0 1.0 30.0 0.0 1.0 0.0 .6 93.0 0
le 0 0 .0 .0 0
0 1
Decision tree in Python can take only numerical / categorical colums. It cannot take string / obeject types. The
following code loops through each column and checks if the column type is object then converts those columns
into categorical with each distinct value becoming a category or code.
In [25]:
for feature in chd_df.columns:
if chd_df[feature].dtype == 'object':
chd_df[feature] = pd.Categorical(chd_df[feature]).codes
In [26]:
chd_df.head(10)
Out[26]:
B He
tot Sys Dia
chol toli stol
ester c ic
ol BP BP
s t
3 26
0. 195. 106 70. 77.
0 1 9. 4.0 0.0 0.0 0.0 0.0 0.0 .9 80.0 0
0 0 .0 0 0
0 7
4 28
0. 250. 121 81. 76.
1 0 6. 2.0 0.0 0.0 0.0 0.0 0.0 .7 95.0 0
0 0 .0 0 0
0 3
4 25
0. 245. 127 80. 70.
2 1 8. 1.0 1.0 20.0 0.0 0.0 0.0 .3 75.0 0
0 0 .5 0 0
0 4
6 28
0. 225. 150 95. 10
3 0 1. 3.0 1.0 30.0 0.0 1.0 0.0 .5 65.0 1
0 0 .0 0 3.0
0 8
4 23
0. 285. 130 84. 85.
4 0 6. 3.0 1.0 23.0 0.0 0.0 0.0 .1 85.0 0
0 0 .0 0 0
0 0
4 30
0. 228. 180 110 99.
5 0 3. 2.0 0.0 0.0 0.0 1.0 0.0 .3 77.0 0
0 0 .0 .0 0
0 0
6 33
0. 205. 138 71. 85.
6 0 3. 1.0 0.0 0.0 0.0 0.0 0.0 .1 60.0 1
0 0 .0 0 0
0 1
4 21
0. 313. 100 71. 78.
7 0 5. 2.0 1.0 20.0 0.0 0.0 0.0 .6 79.0 0
0 0 .0 0 0
0 8
5 26
0. 260. 141 89. 79.
8 1 2. 1.0 0.0 0.0 0.0 1.0 0.0 .3 76.0 0
0 0 .5 0 0
0 6
B He
tot Sys Dia
chol toli stol
ester c ic
ol BP BP
s t
4 23
0. 225. 162 107 88.
9 1 3. 1.0 1.0 30.0 0.0 1.0 0.0 .6 93.0 0
0 0 .0 .0 0
0 1
Female converted to 0, and Male to 1

In [27]:
chd_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4055 entries, 0 to 4239
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 4055 non-null int8
1 age 4055 non-null float64
2 education 4055 non-null float64
3 currentSmoker 4055 non-null float64
4 cigsPerDay 4055 non-null float64
5 BP Meds 4055 non-null float64
6 prevalentStroke 4055 non-null float64
7 prevalentHyp 4055 non-null float64
8 diabetes 4055 non-null float64
9 tot cholesterol 4055 non-null float64
10 Systolic BP 4055 non-null float64
11 Diastolic BP 4055 non-null float64
12 BMI 4055 non-null float64
13 heartRate 4055 non-null float64
14 glucose 4055 non-null float64
15 Heart-Att 4055 non-null int64
dtypes: float64(14), int64(1), int8(1)
memory usage: 670.8 KB
Capture the target column into separate vectors for training set and test set
In [28]:
X = chd_df.drop(["Heart-Att"], axis=1)
y = chd_df.pop("Heart-Att")
Splitting data into training and test set
In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, train_labels, test_labels = train_test_split(X, y,

test_size=.30, random_state=1)
In [30]:
train_labels.value_counts()
Out[30]:
0 2424
1 414
In [31]:
train_labels.value_counts(normalize=True)
Out[31]:
0 0.854123
1 0.145877
In [32]:
test_labels.value_counts()
Out[32]:
0 1022
1 195
In [33]:
test_labels.value_counts(normalize=True)
Out[33]:
0 0.83977
1 0.16023
Observations are almost equally distributed between the train and test sets w.r.t target classes
Ensemble RandomForest Classifier

Building the Random Forest model
Importance of Random State
The important thing is that everytime you use any natural number, you will always get the same output
the first time you make the model which is similar to random state while train test split
In [34]:
# To understand the differences of different random states affecting Out-of-
Bag score
random_state=[0,23,42]
for i in random_state:
rf=RandomForestClassifier(random_state=i,oob_score=True)
rf.fit(X_train,train_labels)
print(rf.oob_score_)
0.8513037350246653
0.8527131782945736
0.8548273431994362
In [35]:
#Build a RandomForestCassifier wit n_estimators 100, max_features 6, andfit
it on the training data
rfcl = RandomForestClassifier(n_estimators =
100,max_features=6,random_state=0)
rfcl = rfcl.fit(X_train, train_labels)
In [36]:
rfcl
Out[36]:
RandomForestClassifier(max_features=6, random_state=0)
Predicting Train and Test data with the RF Model
In [37]:
ytrain_predict = rfcl.predict(X_train)
ytest_predict = rfcl.predict(X_test)
Train Accuracy
In [38]:
rfcl.score(X_train,train_labels)
Out[38]:
1.0
Evaluating model performance with confusion matrix
In [39]:
from sklearn.metrics import confusion_matrix,classification_report
Evaluating model performance on the training data
In [40]:
# Get the confusion matrix on the train data
confusion_matrix(train_labels,ytrain_predict)
sns.heatmap(confusion_matrix(train_labels,ytrain_predict),annot=True,
fmt='d',cbar=False, cmap='rainbow')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Confusion Matrix')
plt.show()
In [41]:
print(classification_report(train_labels,ytrain_predict))
precision recall f1-score support
0 1.00 1.00 1.00 2424

1 1.00 1.00 1.00 414
accuracy 1.00 2838

macro avg 1.00 1.00 1.00 2838
weighted avg 1.00 1.00 1.00 2838
In [42]:
print('Accuracy', ((2424+414)/(2424+414)))
print('Sensitivity',((414/414))) #TP/Actual yes
print('Specificity',(2424/2424)) #TN/Actual no
print('Precision',(414/414)) #TP/Predicted yes
Accuracy 1.0
Sensitivity 1.0
Specificity 1.0
Precision 1.0
In [43]:
from sklearn.metrics import roc_curve,roc_auc_score
rf_fpr, rf_tpr,_=roc_curve(train_labels,rfcl.predict_proba(X_train)[:,1])
plt.plot(rf_fpr,rf_tpr, marker='x', label='Random Forest')
plt.plot(np.arange(0,1.1,0.1),np.arange(0,1.1,0.1))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()
print('Area under Curve is',
roc_auc_score(train_labels,rfcl.predict_proba(X_train)[:,1]))
Area under Curve is 1.0
Evaluating model performance on the test data
In [44]:
confusion_matrix(test_labels,ytest_predict)
sns.heatmap(confusion_matrix(test_labels,ytest_predict),annot=True,
fmt='d',cbar=False, cmap='rainbow')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Confusion Matrix')
plt.show()
In [45]:
print(classification_report(test_labels,ytest_predict))
0 0.85 0.99 0.91 1022

1 0.61 0.06 0.10 195
accuracy 0.84 1217

macro avg 0.73 0.52 0.51 1217
weighted avg 0.81 0.84 0.78 1217
In [46]:
print('Accuracy', ((1019+13)/(1019+3+182+13)))
print('Sensitivity',((13/(13+182))))
print('Specificity',(1019/(1019+3)))
print('Precision',(13/(13+3)))
Accuracy 0.847986852917009
Sensitivity 0.06666666666666667
Specificity 0.99706457925636
Precision 0.8125
Test Accuracy
In [47]:
rfcl.score(X_test,test_labels)
Out[47]:
0.8430566967953985
In [48]:
#from sklearn.metrics import roc_curve,roc_auc_score
rf_fpr, rf_tpr,_=roc_curve(test_labels,rfcl.predict_proba(X_test)[:,1])
plt.title('ROC')
plt.show()
roc_auc_score(test_labels,rfcl.predict_proba(X_test)[:,1]))
Result:
Area under the curve on the training data is 100%, which indicates very high performance that all classes have
been correctly classified. Whereas on the test data model performance is average with AUC 69%, which is
very less compare to the performance of the training data.
Since we are building a model to predict if a person will have a heart disease or not, for practical purposes, we
will be more interested in correctly classifying 1 (having heart disease) than 0(not having heart disease).
If a person not having a heart disease, is incorrectly predicted to have a heart disease, in this situation, the cost
and other impact to life is less severe, than when we incorrectly predict a person, who actually have a heart
disease, as not having a heart disease.
From the Random Forest model,looking at the Accuracy,Sensitivity,Speficity,Recall and AUC, we have 100%
results on the training data, whereas on the Test data, performance is lesser,especially in predicting Class 1.
This is because overfitting has happened on the training data, and therefore the model is weak in generalizing
and predicting any new data.
In this model, we have hard-coded the hyper parameter values. We can optimize/fine-tune the random forest
model, by trying different values for the hyper parameters to see if the model performance is improving.
Grid Search for finding out the optimal values for the hyper parameters
Note: runs for longer time on more data with more params
In [49]:
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [5,7,10],
'max_features': [4,6],
'min_samples_leaf': [5,10],
'min_samples_split': [50,100],
'n_estimators': [100,200,300]
}
rfcl = RandomForestClassifier(random_state=0)
grid_search = GridSearchCV(estimator = rfcl, param_grid = param_grid, cv = 10)

In [50]:
grid_search.fit(X_train, train_labels)
Out[50]:
GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=0),
param_grid={'max_depth': [5, 7, 10], 'max_features': [4, 6],
'min_samples_leaf': [5, 10],
'min_samples_split': [50, 100],
'n_estimators': [100, 200, 300]})
In [51]:
grid_search.best_params_
Out[51]:
{'max_depth': 10,
'max_features': 6,
'min_samples_leaf': 5,
'min_samples_split': 50,
'n_estimators': 200}
In [52]:
best_grid = grid_search.best_estimator_
In [53]:
best_grid
Out[53]:
RandomForestClassifier(max_depth=10, max_features=6, min_samples_leaf=5,
min_samples_split=50, n_estimators=200, random_state=0)
In [54]:
ytrain_predict = best_grid.predict(X_train)
ytest_predict = best_grid.predict(X_test)
In [55]:
confusion_matrix(train_labels,ytrain_predict)
Out[55]:
array([[2423, 1],
[ 397, 17]], dtype=int64)
In [56]:
print(classification_report(train_labels,ytrain_predict))
0 0.86 1.00 0.92 2424

1 0.94 0.04 0.08 414
accuracy 0.86 2838

macro avg 0.90 0.52 0.50 2838
weighted avg 0.87 0.86 0.80 2838
In [57]:
rf_fpr, rf_tpr,_=roc_curve(train_labels,best_grid.predict_proba(X_train)[:,1])
plt.title('ROC')
plt.show()
roc_auc_score(train_labels,best_grid.predict_proba(X_train)[:,1]))
In [58]:
confusion_matrix(test_labels,ytest_predict)
Out[58]:
array([[1022, 0],
[ 193, 2]], dtype=int64)
In [59]:
print(classification_report(test_labels,ytest_predict))
0 0.84 1.00 0.91 1022

1 1.00 0.01 0.02 195
accuracy 0.84 1217

macro avg 0.92 0.51 0.47 1217
weighted avg 0.87 0.84 0.77 1217
In [60]:
rf_fpr, rf_tpr,_=roc_curve(test_labels,best_grid.predict_proba(X_test)[:,1])
plt.title('ROC')
plt.show()
roc_auc_score(test_labels,best_grid.predict_proba(X_test)[:,1]))
Final Conclusion:
Here, we can see that even when we try other values for the hyper parameters, the model performance is not
improving much.
AUC is for train is 87 % and for test it is 70%. The problem of over fitting still exist, as there is a 17% gap
between the train and test set. But still, the model is useful only in predicting class 0, and not class 1.
This is because, the dataset i`ms unbalanced, and so we have a class imbalance problem.
In the real world datasets, most of the time you will come across unbalanced datasets. To build a more robust
classification model, this class imbalance needs to be addressed before building the model. This will be
applicable to any kind of classification model. Once this issue is addressed and the model is built, further
model tuning/optimization using grid search will result in improved performance.
You will learn to deal with such imbalance in data using different performance improvement methods in the
Machine Learning Course.

Random Forest - US - Heart - Patients - Class

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Random Forest - US - Heart - Patients - Class

Uploaded by

Copyright:

Available Formats

Problem Statement

Importing required packages and dataset

Checking the data

696 295 142 143 394

Frequency of the levels in the Gender column

Check for duplicate data

There are no duplicates in the data

Check for missing value in any column

As we see in the below boxplot continuous variables have outliers present.

Checking for Outliers

Treating the Outliers

There are no outliers after treating them

Checking pairwise distribution of the continuous variables

Checking for Correlations

Female converted to 0, and Male to 1

Splitting data into training and test set

X_train, X_test, train_labels, test_labels = train_test_split(X, y,

Ensemble RandomForest Classifier

Importance of Random State

Predicting Train and Test data with the RF Model

Evaluating model performance with confusion matrix

Evaluating model performance on the training data

0 1.00 1.00 1.00 2424

accuracy 1.00 2838

Area under Curve is 1.0

Evaluating model performance on the test data

0 0.85 0.99 0.91 1022

accuracy 0.84 1217

Area under Curve is 0.6934969140448592

grid_search = GridSearchCV(estimator = rfcl, param_grid = param_grid, cv = 10)

0 0.86 1.00 0.92 2424

accuracy 0.86 2838

0 0.84 1.00 0.91 1022

accuracy 0.84 1217

You might also like