Professional Documents
Culture Documents
Algorithm
Objective: To classify the heart disease of patient.
THEORY
Random Forest Classifier: A Random Forest Classifier is an ensemble learning method used
for both classification and regression tasks. It belongs to the family of tree-based models and
is particularly powerful and versatile. Random Forests are an ensemble of decision trees, where
each tree is constructed using a random subset of the data and a random subset of the
features. The predictions from multiple trees are then aggregated to make a final prediction.
Random Forests are widely used in various applications, including image classification,
bioinformatics, finance, and many other fields, due to their flexibility and effectiveness in
handling complex datasets.
KNN: It is the one of the simplest Machine Learning algorithm based on Supervised Learning
technique. It assumes the similarity between the new data and available data and put the
new data into the category that is most similar to the available categories.
Recall (Sensitivity or True Positive Rate): Recall is the ratio of correctly predicted positive
observations to the total actual positives.
Formula: Recall = TP / (TP + FN)
It indicates the ability of the model to capture and correctly classify positive instances.
Precision: Precision is the ratio of correctly predicted positive observations to the total
predicted positives.
Formula: Precision = TP / (TP + FP)
It measures the accuracy of the positive predictions and is useful when the cost of
false positives is high.
F1 score: The F1 Score is the harmonic mean of precision and recall, providing a balance
between the two metrics.
Formula: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
It is particularly useful when there is an uneven class distribution.
CODE( for Random Forest)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
df.info()
df.info()
df = pd.get_dummies(df, drop_first=True)
df.head()
#Training part
X_train, X_test, y_train, y_test = train_test_split(df.drop(labels='target', axis=1), df['target'],
test_size=0.2, random_state=10)
model = RandomForestClassifier(max_depth=4)
model.fit(X_train, y_train)
estimator = model.estimators_[1]
feature_names = [i for i in X_train.columns]
y_train_str = y_train.astype('str')
y_train_str[y_train_str == '0'] = 'no disease'
y_train_str[y_train_str == '1'] = 'disease'
y_train_str = y_train_str.values
y_predict = model.predict(X_test)
y_pred_quant = model.predict_proba(X_test)[:, 1]
y_pred_bin = model.predict(X_test)
total=sum(sum(confusion_matrix))
sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])
print('Sensitivity : ', sensitivity )
specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
print('Specificity : ', specificity)
Accuracy1=(confusion_matrix[0,0]+confusion_matrix[1,1])/(confusion_matrix[0,0]+confusio
n_matrix[0,1]+confusion_matrix[1,0]+confusion_matrix[1,1])
print('Accuracy : ', Accuracy1)
# Calculate metrics
print(classification_report(y_test, y_pred_bin))
print('Sensitivity:', sensitivity)
print('Specificity:', specificity)
print('Accuracy:', accuracy2)
print('Recall (True Positive Rate or Sensitivity):', recall2)
print('Precision (Positive Predictive Value):', precision2)
print('F1-score:', f1_score2)
x = range(len(categories))
bar_width = 0.35
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(x, precision, bar_width, label='Precision', color='b')
ax.bar([p + bar_width for p in x], recall, bar_width, label='Recall', color='g')
ax.bar([p + 2 * bar_width for p in x], f1_score, bar_width, label='F1-score', color='r')
ax.set_xlabel('Category')
ax.set_ylabel('Score')
ax.set_title('Precision, Recall, and F1-score Comparison')
ax.set_xticks([p + 1 * bar_width for p in x])
ax.set_xticklabels(categories)
ax.legend()
plt.tight_layout()
DISCUSSION:
In this study, we employed the Random Forest Classifier and K-Nearest Neighbors (KNN) for
heart disease classification. The Random Forest, known for handling complex datasets, uses
an ensemble of decision trees to capture intricate patterns. Meanwhile, KNN, with its
simplicity and reliance on data similarity, proves effective for local patterns and non-linear
boundaries. Performance metrics, including confusion matrix, recall, precision, and F1 score,
were utilized for rigorous evaluation. Regular refinement guided by expert input ensures the
reliability of these models, offering a comprehensive approach to accurate heart disease
classification.
CONCLUSION:
In summary, our study employed the Random Forest Classifier and K-Nearest Neighbors for
heart disease classification, with the Random Forest excelling in capturing complex patterns
and KNN effective for local similarities. Rigorous evaluation using key metrics provided a
comprehensive assessment. Ongoing refinement guided by experts will enhance the models
reliability for accurate heart disease classification, contributing to improved patient
outcomes.