DWDM Lab 3

Title: Heart Disease Condition Classification Using Machine Learning
Algorithm
Objective: To classify the heart disease of patient.
THEORY
Random Forest Classifier: A Random Forest Classifier is an ensemble learning method used
for both classification and regression tasks. It belongs to the family of tree-based models and
is particularly powerful and versatile. Random Forests are an ensemble of decision trees, where
each tree is constructed using a random subset of the data and a random subset of the
features. The predictions from multiple trees are then aggregated to make a final prediction.
Random Forests are widely used in various applications, including image classification,
bioinformatics, finance, and many other fields, due to their flexibility and effectiveness in
handling complex datasets.
KNN: It is the one of the simplest Machine Learning algorithm based on Supervised Learning
technique. It assumes the similarity between the new data and available data and put the
new data into the category that is most similar to the available categories.
Performance Evaluation metrics

Confusion Matrix: A confusion matrix is a table that summarizes the performance of a
classification algorithm. It includes the counts of true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN).
 True Positive (TP): Instances that are actually positive and predicted as positive.
 True Negative (TN): Instances that are actually negative and predicted as negative.
 False Positive (FP): Instances that are actually negative but predicted as positive.
 False Negative (FN): Instances that are actually positive but predicted as negative.
Recall (Sensitivity or True Positive Rate): Recall is the ratio of correctly predicted positive
observations to the total actual positives.
 Formula: Recall = TP / (TP + FN)
 It indicates the ability of the model to capture and correctly classify positive instances.
Precision: Precision is the ratio of correctly predicted positive observations to the total
predicted positives.
 Formula: Precision = TP / (TP + FP)
 It measures the accuracy of the positive predictions and is useful when the cost of
false positives is high.
F1 score: The F1 Score is the harmonic mean of precision and recall, providing a balance
between the two metrics.
 Formula: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
 It is particularly useful when there is an uneven class distribution.
CODE( for Random Forest)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
pip install tensorflow --user
from tensorflow.keras.optimizers import Adam

import pandas as pd
df = pd.read_csv("Desktop\Heart_Disease.csv")
print(df.to_string())
df = df.query("caa <4")
df = df.query(" thall> 0")
df.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol',

'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved','exercise_induced_angina',
'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']
df['sex'] = df['sex'].replace({0: 'female', 1: 'male'})
df['chest_pain_type'] = df['chest_pain_type'].replace({1: 'typical angina',2: 'atypical angina',

3: 'non-anginal pain',4: 'asymptomatic'})
df['fasting_blood_sugar'] = df['fasting_blood_sugar'].replace({0: 'lower than 120mg/ml',1:

'greater than 120mg/ml'})
df['rest_ecg']= df['rest_ecg'].replace({0:'normal',1:'ST-T wave abnormality',2:'left ventricular

hypertrophy'})
df['exercise_induced_angina'] = df['exercise_induced_angina'].replace({0: 'no', 1: 'yes'})
df['st_slope'] = df['st_slope'].replace({1: 'upsloping', 2: 'flat', 3: 'downsloping'})
df['thalassemia'] = df['thalassemia'].replace({1: 'normal', 2: 'fixed defect', 3: 'reversable

defect'})
df.info()
df[['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg', 'exercise_induced_angina',

'st_slope', 'thalassemia']] = df[['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg',
'exercise_induced_angina', 'st_slope', 'thalassemia']].astype('object')
df.info()
df = pd.get_dummies(df, drop_first=True)
df.head()
#Training part
X_train, X_test, y_train, y_test = train_test_split(df.drop(labels='target', axis=1), df['target'],
test_size=0.2, random_state=10)
model = RandomForestClassifier(max_depth=4)
model.fit(X_train, y_train)
estimator = model.estimators_[1]
feature_names = [i for i in X_train.columns]
y_train_str = y_train.astype('str')
y_train_str[y_train_str == '0'] = 'no disease'
y_train_str[y_train_str == '1'] = 'disease'
y_train_str = y_train_str.values
y_predict = model.predict(X_test)
y_pred_quant = model.predict_proba(X_test)[:, 1]
y_pred_bin = model.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred_bin)

confusion_matrix
total=sum(sum(confusion_matrix))
sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])
print('Sensitivity : ', sensitivity )
specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
print('Specificity : ', specificity)
Accuracy1=(confusion_matrix[0,0]+confusion_matrix[1,1])/(confusion_matrix[0,0]+confusio
n_matrix[0,1]+confusion_matrix[1,0]+confusion_matrix[1,1])
print('Accuracy : ', Accuracy1)
# Calculate recall (true positive rate or sensitivity)

recall1 = confusion_matrix[1, 1] / (confusion_matrix[1, 1] + confusion_matrix[1, 0])
print('Recall (True Positive Rate or Sensitivity):', recall1)
# Calculate precision (positive predictive value)

precision1 = confusion_matrix[1, 1] / (confusion_matrix[1, 1] + confusion_matrix[0, 1])
print('Precision (Positive Predictive Value):', precision1)
# Calculate F1-score (harmonic mean of precision and recall)

f1_score1= 2 * (precision1* recall1) / (precision1 + recall1)
print('F1-score:', f1_score1)
from sklearn.metrics import roc_curve, auc

import matplotlib.pyplot as plt
# Assuming you have your classifier 'clf' trained already
# Get predicted probabilities for the positive class

y_scores = model.predict_proba(X_test)[:, 1]
# Compute ROC curve and ROC area for each class

fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)
# Plot ROC curve

plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
from sklearn.metrics import confusion_matrix
import seaborn as sns
# Generate confusion matrix

confusion_matrix=confusion_matrix(y_test, y_pred_bin)
# Plot confusion matrix

plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Class 0',
'Class 1'], yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
CODE( for KNN)

# Import necessary libraries
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import seaborn as sns
# Define the classifiers and their hyperparameters

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Define the base KNN model and set up the pipeline with scaling
knn_pipeline = Pipeline([('scaler', StandardScaler()),('knn', KNeighborsClassifier())])
# Hyperparameter grid for KNN

knn_param_grid = {'knn__n_neighbors': list(range(1, 12)),'knn__weights': ['uniform',
'distance'], 'knn__p': [1, 2]}
# Function to tune classifier hyperparameters

def tune_clf_hyperparameters(clf, param_grid, X_train, y_train, scoring='recall', n_splits=3):
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=10)
clf_grid = GridSearchCV(clf, param_grid, cv=cv, scoring=scoring, n_jobs=-1)
clf_grid.fit(X_train, y_train)
best_hyperparameters = clf_grid.best_params_
return clf_grid.best_estimator_, best_hyperparameters
# Hyperparameter tuning for KNN

best_knn,best_knn_hyperparams=tune_clf_hyperparameters(knn_pipeline,knn_param_grid
, X_train, y_train)
print('KNN Optimal Hyperparameters:\n', best_knn_hyperparams)
# Get the predicted labels

y_pred_bin = best_knn.predict(X_test)
# Generate confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred_bin)
# Plot confusion matrix

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Class 0', 'Class
1'], yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
# Calculate metrics
print(classification_report(y_test, y_pred_bin))
# Calculate sensitivity, specificity, accuracy, recall, precision, and F1-score

sensitivity = conf_matrix[0, 0] / (conf_matrix[0, 0] + conf_matrix[1, 0])
specificity = conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[0, 1])
accuracy2= (conf_matrix[0, 0] + conf_matrix[1, 1]) / sum(sum(conf_matrix))
recall2= conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 0])
precision2= conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[0, 1])
f1_score2= 2 * (precision2 * recall2) / (precision2 + recall2)
print('Sensitivity:', sensitivity)
print('Specificity:', specificity)
print('Accuracy:', accuracy2)
print('Recall (True Positive Rate or Sensitivity):', recall2)
print('Precision (Positive Predictive Value):', precision2)
print('F1-score:', f1_score2)
# Generate ROC curve

y_scores = best_knn.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)
# Plot ROC curve

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=3, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
# Comparison of Precision, Recall, and F1-score between classifiers

categories = ["Random Forest", "KNN"]
precision = [precision1, precision2] # Assuming precision1 and precision2 are calculated
earlier
recall = [recall1, recall2] # Assuming recall1 and recall2 are calculated earlier
f1_score = [f1_score1, f1_score2] # Assuming f1_score1 and f1_score2 are calculated
earlier
x = range(len(categories))
bar_width = 0.35
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(x, precision, bar_width, label='Precision', color='b')
ax.bar([p + bar_width for p in x], recall, bar_width, label='Recall', color='g')
ax.bar([p + 2 * bar_width for p in x], f1_score, bar_width, label='F1-score', color='r')
ax.set_xlabel('Category')
ax.set_ylabel('Score')
ax.set_title('Precision, Recall, and F1-score Comparison')
ax.set_xticks([p + 1 * bar_width for p in x])
ax.set_xticklabels(categories)
ax.legend()
plt.tight_layout()
DISCUSSION:
In this study, we employed the Random Forest Classifier and K-Nearest Neighbors (KNN) for
heart disease classification. The Random Forest, known for handling complex datasets, uses
an ensemble of decision trees to capture intricate patterns. Meanwhile, KNN, with its
simplicity and reliance on data similarity, proves effective for local patterns and non-linear
boundaries. Performance metrics, including confusion matrix, recall, precision, and F1 score,
were utilized for rigorous evaluation. Regular refinement guided by expert input ensures the
reliability of these models, offering a comprehensive approach to accurate heart disease
classification.
CONCLUSION:
In summary, our study employed the Random Forest Classifier and K-Nearest Neighbors for
heart disease classification, with the Random Forest excelling in capturing complex patterns
and KNN effective for local similarities. Rigorous evaluation using key metrics provided a
comprehensive assessment. Ongoing refinement guided by experts will enhance the models
reliability for accurate heart disease classification, contributing to improved patient
outcomes.

DWDM Lab 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWDM Lab 3

Uploaded by

Copyright:

Available Formats

Title: Heart Disease Condition Classification Using Machine Learning

Performance Evaluation metrics

pip install tensorflow --user

from tensorflow.keras.optimizers import Adam

df.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol',

df['sex'] = df['sex'].replace({0: 'female', 1: 'male'})

df['chest_pain_type'] = df['chest_pain_type'].replace({1: 'typical angina',2: 'atypical angina',

df['fasting_blood_sugar'] = df['fasting_blood_sugar'].replace({0: 'lower than 120mg/ml',1:

df['rest_ecg']= df['rest_ecg'].replace({0:'normal',1:'ST-T wave abnormality',2:'left ventricular

df['exercise_induced_angina'] = df['exercise_induced_angina'].replace({0: 'no', 1: 'yes'})

df['st_slope'] = df['st_slope'].replace({1: 'upsloping', 2: 'flat', 3: 'downsloping'})

df['thalassemia'] = df['thalassemia'].replace({1: 'normal', 2: 'fixed defect', 3: 'reversable

df[['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg', 'exercise_induced_angina',

confusion_matrix = confusion_matrix(y_test, y_pred_bin)

# Calculate recall (true positive rate or sensitivity)

# Calculate precision (positive predictive value)

# Calculate F1-score (harmonic mean of precision and recall)

from sklearn.metrics import roc_curve, auc

# Assuming you have your classifier 'clf' trained already

# Get predicted probabilities for the positive class

# Compute ROC curve and ROC area for each class

# Plot ROC curve

# Generate confusion matrix

# Plot confusion matrix

CODE( for KNN)

# Define the classifiers and their hyperparameters

# Hyperparameter grid for KNN

# Function to tune classifier hyperparameters

# Hyperparameter tuning for KNN

# Get the predicted labels

# Generate confusion matrix

# Plot confusion matrix

# Calculate sensitivity, specificity, accuracy, recall, precision, and F1-score

# Generate ROC curve

# Plot ROC curve

# Comparison of Precision, Recall, and F1-score between classifiers

You might also like