You are on page 1of 10

Title: Heart Disease Condition Classification Using Machine Learning

Algorithm
Objective: To classify the heart disease of patient.
THEORY
Random Forest Classifier: A Random Forest Classifier is an ensemble learning method used
for both classification and regression tasks. It belongs to the family of tree-based models and
is particularly powerful and versatile. Random Forests are an ensemble of decision trees, where
each tree is constructed using a random subset of the data and a random subset of the
features. The predictions from multiple trees are then aggregated to make a final prediction.
Random Forests are widely used in various applications, including image classification,
bioinformatics, finance, and many other fields, due to their flexibility and effectiveness in
handling complex datasets.

KNN: It is the one of the simplest Machine Learning algorithm based on Supervised Learning
technique. It assumes the similarity between the new data and available data and put the
new data into the category that is most similar to the available categories.

Performance Evaluation metrics


Confusion Matrix: A confusion matrix is a table that summarizes the performance of a
classification algorithm. It includes the counts of true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN).
 True Positive (TP): Instances that are actually positive and predicted as positive.
 True Negative (TN): Instances that are actually negative and predicted as negative.
 False Positive (FP): Instances that are actually negative but predicted as positive.
 False Negative (FN): Instances that are actually positive but predicted as negative.

Recall (Sensitivity or True Positive Rate): Recall is the ratio of correctly predicted positive
observations to the total actual positives.
 Formula: Recall = TP / (TP + FN)
 It indicates the ability of the model to capture and correctly classify positive instances.

Precision: Precision is the ratio of correctly predicted positive observations to the total
predicted positives.
 Formula: Precision = TP / (TP + FP)
 It measures the accuracy of the positive predictions and is useful when the cost of
false positives is high.

F1 score: The F1 Score is the harmonic mean of precision and recall, providing a balance
between the two metrics.
 Formula: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
 It is particularly useful when there is an uneven class distribution.
CODE( for Random Forest)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

pip install tensorflow --user

from tensorflow.keras.optimizers import Adam


import pandas as pd
df = pd.read_csv("Desktop\Heart_Disease.csv")
print(df.to_string())
df = df.query("caa <4")
df = df.query(" thall> 0")

df.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol',


'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved','exercise_induced_angina',
'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']

df['sex'] = df['sex'].replace({0: 'female', 1: 'male'})

df['chest_pain_type'] = df['chest_pain_type'].replace({1: 'typical angina',2: 'atypical angina',


3: 'non-anginal pain',4: 'asymptomatic'})

df['fasting_blood_sugar'] = df['fasting_blood_sugar'].replace({0: 'lower than 120mg/ml',1:


'greater than 120mg/ml'})

df['rest_ecg']= df['rest_ecg'].replace({0:'normal',1:'ST-T wave abnormality',2:'left ventricular


hypertrophy'})

df['exercise_induced_angina'] = df['exercise_induced_angina'].replace({0: 'no', 1: 'yes'})

df['st_slope'] = df['st_slope'].replace({1: 'upsloping', 2: 'flat', 3: 'downsloping'})

df['thalassemia'] = df['thalassemia'].replace({1: 'normal', 2: 'fixed defect', 3: 'reversable


defect'})

df.info()

df[['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg', 'exercise_induced_angina',


'st_slope', 'thalassemia']] = df[['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg',
'exercise_induced_angina', 'st_slope', 'thalassemia']].astype('object')

df.info()
df = pd.get_dummies(df, drop_first=True)

df.head()

#Training part
X_train, X_test, y_train, y_test = train_test_split(df.drop(labels='target', axis=1), df['target'],
test_size=0.2, random_state=10)

model = RandomForestClassifier(max_depth=4)

model.fit(X_train, y_train)

estimator = model.estimators_[1]
feature_names = [i for i in X_train.columns]

y_train_str = y_train.astype('str')
y_train_str[y_train_str == '0'] = 'no disease'
y_train_str[y_train_str == '1'] = 'disease'
y_train_str = y_train_str.values

y_predict = model.predict(X_test)
y_pred_quant = model.predict_proba(X_test)[:, 1]
y_pred_bin = model.predict(X_test)

confusion_matrix = confusion_matrix(y_test, y_pred_bin)


confusion_matrix

total=sum(sum(confusion_matrix))

sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])
print('Sensitivity : ', sensitivity )
specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
print('Specificity : ', specificity)
Accuracy1=(confusion_matrix[0,0]+confusion_matrix[1,1])/(confusion_matrix[0,0]+confusio
n_matrix[0,1]+confusion_matrix[1,0]+confusion_matrix[1,1])
print('Accuracy : ', Accuracy1)

# Calculate recall (true positive rate or sensitivity)


recall1 = confusion_matrix[1, 1] / (confusion_matrix[1, 1] + confusion_matrix[1, 0])
print('Recall (True Positive Rate or Sensitivity):', recall1)

# Calculate precision (positive predictive value)


precision1 = confusion_matrix[1, 1] / (confusion_matrix[1, 1] + confusion_matrix[0, 1])
print('Precision (Positive Predictive Value):', precision1)

# Calculate F1-score (harmonic mean of precision and recall)


f1_score1= 2 * (precision1* recall1) / (precision1 + recall1)
print('F1-score:', f1_score1)

from sklearn.metrics import roc_curve, auc


import matplotlib.pyplot as plt

# Assuming you have your classifier 'clf' trained already

# Get predicted probabilities for the positive class


y_scores = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and ROC area for each class


fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)

# Plot ROC curve


plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Generate confusion matrix


confusion_matrix=confusion_matrix(y_test, y_pred_bin)

# Plot confusion matrix


plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Class 0',
'Class 1'], yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

CODE( for KNN)


# Import necessary libraries
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import seaborn as sns
import matplotlib.pyplot as plt

# Define the classifiers and their hyperparameters


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Define the base KNN model and set up the pipeline with scaling
knn_pipeline = Pipeline([('scaler', StandardScaler()),('knn', KNeighborsClassifier())])

# Hyperparameter grid for KNN


knn_param_grid = {'knn__n_neighbors': list(range(1, 12)),'knn__weights': ['uniform',
'distance'], 'knn__p': [1, 2]}

# Function to tune classifier hyperparameters


def tune_clf_hyperparameters(clf, param_grid, X_train, y_train, scoring='recall', n_splits=3):
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=10)
clf_grid = GridSearchCV(clf, param_grid, cv=cv, scoring=scoring, n_jobs=-1)
clf_grid.fit(X_train, y_train)
best_hyperparameters = clf_grid.best_params_
return clf_grid.best_estimator_, best_hyperparameters

# Hyperparameter tuning for KNN


best_knn,best_knn_hyperparams=tune_clf_hyperparameters(knn_pipeline,knn_param_grid
, X_train, y_train)
print('KNN Optimal Hyperparameters:\n', best_knn_hyperparams)

# Get the predicted labels


y_pred_bin = best_knn.predict(X_test)

# Generate confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred_bin)

# Plot confusion matrix


plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Class 0', 'Class
1'], yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# Calculate metrics
print(classification_report(y_test, y_pred_bin))

# Calculate sensitivity, specificity, accuracy, recall, precision, and F1-score


sensitivity = conf_matrix[0, 0] / (conf_matrix[0, 0] + conf_matrix[1, 0])
specificity = conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[0, 1])
accuracy2= (conf_matrix[0, 0] + conf_matrix[1, 1]) / sum(sum(conf_matrix))
recall2= conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 0])
precision2= conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[0, 1])
f1_score2= 2 * (precision2 * recall2) / (precision2 + recall2)

print('Sensitivity:', sensitivity)
print('Specificity:', specificity)
print('Accuracy:', accuracy2)
print('Recall (True Positive Rate or Sensitivity):', recall2)
print('Precision (Positive Predictive Value):', precision2)
print('F1-score:', f1_score2)

# Generate ROC curve


y_scores = best_knn.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)

# Plot ROC curve


plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=3, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

# Comparison of Precision, Recall, and F1-score between classifiers


categories = ["Random Forest", "KNN"]
precision = [precision1, precision2] # Assuming precision1 and precision2 are calculated
earlier
recall = [recall1, recall2] # Assuming recall1 and recall2 are calculated earlier
f1_score = [f1_score1, f1_score2] # Assuming f1_score1 and f1_score2 are calculated
earlier

x = range(len(categories))
bar_width = 0.35
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(x, precision, bar_width, label='Precision', color='b')
ax.bar([p + bar_width for p in x], recall, bar_width, label='Recall', color='g')
ax.bar([p + 2 * bar_width for p in x], f1_score, bar_width, label='F1-score', color='r')

ax.set_xlabel('Category')
ax.set_ylabel('Score')
ax.set_title('Precision, Recall, and F1-score Comparison')
ax.set_xticks([p + 1 * bar_width for p in x])
ax.set_xticklabels(categories)
ax.legend()
plt.tight_layout()
DISCUSSION:
In this study, we employed the Random Forest Classifier and K-Nearest Neighbors (KNN) for
heart disease classification. The Random Forest, known for handling complex datasets, uses
an ensemble of decision trees to capture intricate patterns. Meanwhile, KNN, with its
simplicity and reliance on data similarity, proves effective for local patterns and non-linear
boundaries. Performance metrics, including confusion matrix, recall, precision, and F1 score,
were utilized for rigorous evaluation. Regular refinement guided by expert input ensures the
reliability of these models, offering a comprehensive approach to accurate heart disease
classification.

CONCLUSION:
In summary, our study employed the Random Forest Classifier and K-Nearest Neighbors for
heart disease classification, with the Random Forest excelling in capturing complex patterns
and KNN effective for local similarities. Rigorous evaluation using key metrics provided a
comprehensive assessment. Ongoing refinement guided by experts will enhance the models
reliability for accurate heart disease classification, contributing to improved patient
outcomes.

You might also like