BTP Sixth Sem Report

Heart Disease Prediction using Hybrid Random Forest
A project report submitted in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology
in
Computer Science and Engineering
by
Akhilesh Kumar Singh (112015009)

Kale Vaibhav Vitthal (112015071)
Karde Shivam Dnyaneshwar (112015072)
Kartikey Singh (112015073)
Under the Supervision of: Dr. Mahendra Pratap Yadav
Semester: 6th
Name of Department: Department of Computer Science and Engineering
Indian Institute of Information Technology, Pune
(An Institute of National Importance by an Act of Parliament)
April 2023
BONAFIDE CERTIFICATE
This is to certify that the project report entitled “Heart Disease Prediction using Hybrid
Random Forest” submitted by Kale Vaibhav Vitthal bearing the MIS No: 112015071, Karde
Shivam Dnyaneshwar bearing the MIS No: 112015072 , Kartikey Singh bearing the MIS No:
112015073, Akhilesh Kumar Singh bearing the MIS No: 112015009, in completion of his/her
project work under the guidance of Dr. Mahendra Pratap Yadav is accepted for the project report
submission in partial fulfillment of the requirements for the award of the degree of Bachelor of
Technology in the Department of Computer Science and Engineering, Indian Institute of
Information Technology, Pune (IIIT Pune), during the academic year 2022-23.
Dr. Mahendra Pratap Yadav Dr. Sanjeev Sharma

Project Guide Head of the Department
Assistant Professor Assistant Professor
Department of CSE Department of CSE
IIIT Pune IIIT Pune
Project Viva-voce held on 27/04/2023

Undertaking for Plagiarism
We Kale Vaibhav Vitthal, Karde Shivam Dnyaneshwar, Kartikey Singh, Akhilesh Kumar Singh
solemnly declare that research work presented in the report titled “Heart Disease Prediction using
Hybrid Random Forest” is solely our research work with no significant contribution from any other
person. Small contribution/help wherever taken has been duly acknowledged and that complete report
has been written by us. We understand the zero tolerance policy of Indian Institute of Information
Technology Pune towards plagiarism. Therefore we declare that no portion of my report has been
plagiarized and any material used as reference is properly referred/cited. We undertake that if we are
found guilty of any formal plagiarism in the above titled thesis even after award of the degree, the
Institute reserves the right to withdraw/revoke my B.Tech degree.
Students’ Name and Signature with Date

Conflict of Interest
Report title: Heart Disease Prediction using Hybrid Random Forest
The authors whose names are listed immediately below certify that they have no affiliations with or
involvement in any organization or entity with any financial interest (such as honoraria; educational
grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership,
or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial
interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject
matter or materials discussed in this report.
Students’/ Student’s Name and Signature with Date

ACKNOWLEDGEMENT
This project would not have been possible without the help and cooperation of many. We would like to
thank the people who helped us directly and indirectly in the completion of this project work.
First and foremost, We would like to express our gratitude to our honorable Director, Prof. O.G. Kakde,
for providing his kind support in various aspects. We would like to express our gratitude to our project
guide Dr. Mahendra Pratap Yadav, Department of CSE, for providing excellent guidance,
encouragement, inspiration, constant and timely support throughout this B.Tech Project. We would like to
express our gratitude to the Head of Department (Dr. Sanjeev Sharma), Department of CSE, for
providing his kind support in various aspects. We would also like to thank all the faculty members in the
Department of CSE and our classmates for their steadfast and strong support and engagement with this
project.
Heart disease Prediction using Hybrid Random Forest
Abstract
Cardiovascular disease refers to any critical condition that impacts the heart. Heart disease is one of the
most significant causes of mortality in the world today. As these heart diseases can be life-threatening,
researchers are focusing on designing smart systems to accurately diagnose them based on electronic health
data, with the aid of machine learning algorithms. Prediction of cardiovascular disease is a critical challenge
in the area of clinical data analysis. Machine learning (ML) has been shown to be effective in assisting in
making decisions and predictions from the large quantity of data produced by the healthcare industry. This
work discusses several machine learning approaches for predicting heart diseases, using data of major
health factors from patients. In this work, we propose a novel method that aims at finding significant
features by applying machine learning techniques resulting in improving the accuracy in the prediction of
cardiovascular disease. Five classification methods are demonstrated: Multilayer Perceptron (MLP),
Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), K-Nearest Neighbors (KNN) to
build the prediction models. The prediction model is introduced with different combinations of features and
several known classification techniques mentioned before. The final model is Hybrid Random Forest based
on ensemble learning using various combinations of these five models. Data preprocessing and feature
selection steps were done before building the models. The models were evaluated based mainly on accuracy
and certain other performance metrics such as recall, precision, etc.
Keywords: Heart Disease Prediction, Machine Learning, Hybrid Random Forest, Ensemble Learning,
MLP, SVM, NB, DT, KNN.
TABLE OF CONTENTS
Abstract i
(i) List of Figures/Symbols/Nomenclature iv
(ii) List of Tables v
1 Introduction 1
1.1 Overview of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Research Gap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Problem Statement 6
2.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Methodology of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Analysis And Design 14
4 Results and Discussion 17
5 Conclusion and Future Scope 20
6 References
7 Publication (if any)

List of Figures / Symbols/ Nomenclature
List of Figures:
1. Correlation coefficient matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2. Neural Network Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3. Concept of hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4. Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5. Decision tree . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 12
6. K nearest neighbor model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7. Basics of Stacking .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
8. Stacking Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

List of Tables:
1. TABLE I : HEART DISEASE DATASET DESCRIPTION
2. TABLE II : OUTLIERS IN DATASET
3. TABLE III : USING COMBINATION OF 2 ALGORITHMS
4. TABLE IV : USING COMBINATION OF 3 ALGORITHMS
5. TABLE V : USING COMBINATION OF 4 ALGORITHMS

Chapter 1
Introduction
1.1 Overview of Work
The development of modern technology in the healthcare industry has always been at a rapid pace. The
number of people being treated is also increasing. Heart diseases are one of the most common diseases in
the healthcare industry. Heart diseases are dependent on multiple factors and various data. This calls for a
technology that could be helpful in analyzing the factors and data of the patient and draw a fruitful
conclusion out of it. Machine Learning techniques have proven to be effective in this area in recent times.
Several models have been proposed and research work is still going on this topic. The standard machine
learning techniques are already in use in this domain. The severity of the disease is classified based on
various methods like K-Nearest Neighbor Algorithm (KNN), Decision Trees (DT), Genetic algorithm (GA),
and Naive Bayes (NB). The nature of heart disease is complex and hence, the disease must be handled
carefully. Not doing so may affect the heart.
The perspective of medical science and data mining are used for discovering various sorts of metabolic
syndromes. We have also seen decision trees be used in predicting the accuracy of events related to heart
disease [1]. Neural networks using heart rate time series are introduced. This method uses various clinical
records for prediction such as Left bundle branch block (LBBB), Right bundle branch block (RBBB), Atrial
fibrillation (AFIB), Normal Sinus Rhythm (NSR), Sinus bradycardia (SBR), Atrial flutter (AFL), Premature
Ventricular Contraction (PVC)), and Second degree block (BII) to find out the exact condition of the patient
in relation to heart disease. The dataset with a radial basis function network (RBFN) is used for
classification. We have also seen recent developments in machine learning ML techniques used for Internet
of Things (IoT) as well. ML algorithms on network traffic data have been shown to provide accurate
identification of IoT devices connected to a network.
Heart disease is a serious health condition that affects many people worldwide and is a leading cause of
death. Various advanced technologies are used to treat this condition, but there can be differences in
expertise and knowledge among medical professionals at different centers, which can lead to poor outcomes
and even fatalities in some cases. To overcome these challenges, researchers have developed machine
learning algorithms and data mining techniques to predict heart disease by analyzing a patient's different
health parameters, such as blood pressure, cholesterol levels, ECG readings, age, gender, family history, and
lifestyle factors.
By utilizing these techniques, doctors can make more accurate predictions about the likelihood of a patient
developing heart disease, and more informed decisions about diagnosis and treatment can be made. In
hospitals, predictive models can be integrated into clinical decision support systems, which can help doctors
1
to diagnose and treat heart disease more effectively. The use of machine learning and data mining in
predicting heart disease has the potential to significantly improve patient outcomes, reduce misdiagnosis,
and ensure that the right treatment is provided at the right time. Medical diagnosis is considered as a crucial
but difficult task to be done efficiently and effectively. The automation of this task is very helpful.
Unfortunately all physicians are not experts in any subject specialists and beyond the scarcity of resources
there some places. Data mining can be used to find hidden patterns and knowledge that may contribute to
successful decision making. This plays a key role for healthcare professionals in making accurate decisions
and providing quality services to the public. The approach provided by the healthcare organization to
professionals who do not have more knowledge and skills is also very important. One of the main
limitations of existing methods is the ability to draw accurate conclusions as needed.
Neural networks are generally regarded as the best tool for prediction of diseases like heart disease and
brain disease. The proposed method which we use has 13 attributes for heart disease prediction. The results
show an enhanced level of performance compared to the existing methods in works. The Carotid Artery
Stenting (CAS) has also become a prevalent treatment mode in the medical field during these recent years.
The CAS prompts the occurrence of major adverse cardiovascular events (MACE) of heart disease patients
that are elderly. Their evaluation becomes very important. Neural network methods are introduced, which
combine not only posterior probabilities but also predicted values from multiple predecessor techniques.
This model shows strong results compared to previous works. For all experiments, the Cleveland heart
dataset is used with a Neural Network to improve the performance of heart disease as we have seen
previously.
This work mainly focuses on building such models using the machine learning techniques that are available
in the current time. Various types of machine learning techniques are implemented in this work and
compared with each other for accuracy. Finally, a Hybrid Random Forest model based on the ensemble
learning technique in machine learning is trained and built in this work. The model will input the patient’s
health data and based on that, it will predict if the patient is suffering from a heart disease or not. The
dataset used in this work is the Cleveland dataset from UCI repository. The dataset contains 14 attributes
including 1 target attribute and other as input attributes. The dataset is preprocessed before using it for
training and testing the models. Preprocessing techniques are applied followed by Principal Component
Analysis as a feature extraction method. The accuracy of the model will play the most important role in
making it stand apart from the existing machine learning models.
1.2 Motivation of the Work

It has been discussed above that the possibility of a patient suffering from a heart disease depends upon
multiple factors and large data. Manually checking this data with multiple factors and doing all calculations
becomes a tedious task for the doctor and it also increases the chances of errors in calculations. Since
healthcare determines a person’s life and diagnoses diseases, its diagnosis accuracy is very important. An
intelligent system using machine learning can be very useful to provide assistance to the doctor for
predicting heart disease. This reduces the calculations needed to be done manually and reduces the chances
of human error possible during the calculations.
Treating a disease at a less critical level or at the initial phase is an easier task and it increases the chance of
2
disease getting completely cured. The fact that early diagnosis and treatment of heart disease can be very
crucial in saving lives motivates this work. Lives of many people can be saved from the risk of heart disease
by predicting it earlier and possibly at a less critical level. Moreover, not only doctors but even a person can
use this model to predict the disease provided the required input data is known.
1.3 Literature Review

[1] The authors have used an R studio rattle to perform heart disease classification of the Cleveland UCI
repository. The clustering of datasets is done on the basis of the variables and criteria of Decision Tree
features. Then, the classifiers are applied to each clustered dataset in order to estimate its performance.
Several standard performance metrics such as accuracy, precision and error in classification have been
considered for the computation of performance efficacy of this model. Models like Random Forest with
accuracy of 86.1%, Support Vector Machine with accuracy 86.1% and VOTE classifier with 87.41 and at
last HRFLM (proposed) with highest accuracy of 88.4%. Although author mentioned HRFLM but they
didn’t mention any implementation of it. They also haven’t provided any details to parameters used with the
models
[2] KarenGarate-Escamila et al. proposed a hybrid dimensionality reduction technique combining

Chi-square and principal component analysis (CHI-PCA) to predict heart disease. Chi-square and principal
component analysis (CHI-PCA) using random forests (RF) showed the most remarkable accuracy, at 98.7%
for the Cleveland dataset, 99.0% for the Hungarian dataset, and 99.4% for the Cleveland–Hungarian (CH)
dataset, respectively. The problem here is that this high accuracy is due to old and small datasets with very
few outliers to handle.
[3] Ritu et al. presented a sequential feature selection method for identifying mortality events in patients
with heart disease during treatment to find the most critical features. Numerous machine learning methods
are utilized, including LDA, KNN, RF, SVM, DT, and GBC. Experimental findings indicated that the
sequential feature selection technique achieves an accuracy of 86.67% for the random forest classifier.
[4] Author used 14 attributes out of 75 including age, sex, chest pain, blood pressure, cholesterol, fasting
blood sugar, electrocardiographic results, maximum heart rate achieved, exercise-induced angina, ST
depression induced by exercise, slope of the peak exercise ST segment, number of major vessels, reversible
defect, and target (0 or 1). Author further used and compared all Feature selection techniques. ANOVA FS
technique achieved the highest accuracy of 83.60% with the Random Forest classifier using 8 features. The
feature subset selected by the backward feature selection technique has achieved the highest classification
accuracy of 86%, precision of 87%, sensitivity of 80.76%, and f-measure of 85.71% with the DT classifier.
All the models created with combining feature selection techniques have achieved more than 80% accuracy.
Furthermore, among all the feature selection categories, wrapper-based techniques, namely, FFS, BFS, EFS,
and RFS, have obtained higher classification performances than the other two categories, above 83%.
[5] The study aimed to use data mining techniques to detect heart disease in healthcare. The algorithms used
were KNN, Neural Networks, Naive Bayes and SVM. KNN had the highest accuracy of 85%, followed by
SVM and ANN with 66% and 51% accuracy.
[6] Takci used twelve classification algorithms from various categories and four feature selection methods
for heart attack prediction. The result shows that, without feature selection, the maximum accuracy value
3
was 82.59%; it increased to 84.81% with feature selection. Model accuracy of 84.81% was obtained using
naive Bayes and linear SVM. The ReliefF algorithm provides the best model accuracy among the four
alternative feature selection techniques according to the mean accuracy value.
[7] Spencer et al. conducted experiments on four frequently used heart disease datasets using four different
feature selection techniques: principal component analysis, Chi-squared testing, ReliefF, and symmetrical
uncertainty. As noted by the authors, the benefits of feature selection differ depending on the machine
learning approach employed for the cardiac datasets. For example, one of the most accurate models
discovered had an accuracy of 85.0%, a precision of 84.73%, and a recall of 85.56% when Chi-Squared
feature selection was combined with the BayesNet classifier.
S.No. Name of Research Author and Year Methodology Results

Paper
[1] Effective Heart Disease S. Mohan, C. Hybrid Random The features of the
Prediction Using Thirumalai and G. Forest and Linear dataset were used
Hybrid Machine Srivastava, 2019 Method (HRFLM), efficiently and eighty
Learning Techniques NN, SVM percent accuracy was
obtained on UCI
datasets
[2] Classification models A.KarenGarate-Escami Chi-square and Good Accuracy was

for heart disease la, A. E. Hassani, and principal obtained in
prediction using feature E. Andr´es, 2020 component Cleveland and
selection and PCA analysis Hungarian Dataset
(CHI-PCA), but the dataset was
Random Forest small and old.
(RF)
[3] Sequential feature R. Aggrawal and S. Sequential feature The feature selection
selection and machine Pal, 2020 selection method method applied
learning along with LDA, worked and fit well
algorithm-based KNN, RF, SVM, with the Random
patient’s death events DT, and GBC. Forest method
prediction and implemented in the
diagnosis in heart same research work.
disease
[4] Comparative Study on Kaushalya All Feature ANOVA and

Heart Disease Dissanayake, Md selection technique Backward Feature
Prediction Using Gapar Md Johar, 2021 along with the selection were
Feature Selection combination of working best with
Techniques on different classifiers combination of DT
Classification like SVM, DT, and RF with highest
Algorithms KNN, and RF accuracy around
eighty percent
[5] Back Propagation N. Al-milli, 2013 KNN, SVM and KNN showed more
neural network for Naive Bayes were accurate results
4
prediction of heart used with some followed by SVM
disease feature selection and ANN.
techniques and
ANN.
[6] Improvement of heart H. Takci, 2018 Used twelve Result shows that,
attack prediction by the classification without feature
feature selection algorithms from selection, the
methods various categories maximum accuracy
and four feature value was 82.59%; it
selection methods increased to 84.81%
for heart attack with feature
prediction. selection.
[7] Exploring feature R. Spencer, F. Thabtah, Feature selection As noted by the

selection and N. Abdelhamid, and techniques such as authors, the benefits
classification methods M.Thompson, 2020 Chi-Squared of feature selection
for predicting heart testing, PCA, differ depending on
disease ReliefF, the machine learning
symmetrical approach employed
uncertainty. for the cardiac
datasets
1.4 Research Gaps

1) Incorporating missing risk factors
While many studies have focused on traditional risk factors such as age, gender, and blood pressure. But
other risk factors such as family history, lifestyle choices, and socio-economic status may also play a role in
the development of heart disease. Therefore, incorporating these factors in ML models may improve their
accuracy.
2) Limited real-world validation

Most studies on heart disease prediction using machine learning have been conducted on limited and
homogeneous datasets. There is a need for large-scale, real-world validation datasets to increase the
performance and prediction of these models.
3) Data reduction techniques
More work can be done by using more data related to heart disease with the help of different data reduction
techniques and extensive data analysis and trying additional algorithms or combinations of algorithms to
reach the maximum possible accuracy.
5
Chapter 2
Problem Statement
2.1. Research Objectives

This research work mainly focuses on developing a machine learning model for heart disease prediction
based on machine learning with enhanced performance as compared to the existing ML models for the
same. The objective is to develop a model that shows a better accuracy in predicting the results.
2.2. Methodology of the Work
A. Data Collection
The dataset used is the Cleveland Dataset from UCI repository [5]. The dataset contains a total of 300
instances with 13 attributes as described in Table I.
TABLE I. HEART DISEASE DATASET DESCRIPTION
Data element Description Type Range Remarks
Age - Numa 29-77 Average is 54.37
Sex - Bib 0: Female 32% Female 68% Male

1: Male
Cp Chest pain level Nomc 0/1/2/3 0: Asymptotic 2: Majority have 0 pain

non-anginal pain 3: Typical angina
Trestbps Rest blood pressure Num 94-200 Average is 131.6
Chol Chol Num 126-564 126-564
Fbs Fasting blood sugar level Bi 0: Level below 120 1: Level above -
120
Restecg Resting electrocardiographic Nom 0/1/2 0: Showing probable or -

results definite left ventricular
hypertrophy. 2: Abnormal
Thalach Maximum heart rate achieved Num 71-202 -
Exang Exercise induced angina Bi 0: None 1: Produced -
Oldpeak ST depression induced by Num 0-6.2 Right skewed data, the

exercise relative to rest majority of the
population is between 0
and 0.
Slope The slope of the peak exercise ST Nom 0: Unsloping 1: Flat 2: -

segment Down-sloping
Ca Number of major vessels Nom 0/1/2/3/ -
Thal Defect type Nom 1: Fixed defect 2: Normal 3: There is one outlier of
Reversible defect category 0
6
There is one Diagnosis of heart disease Bi 0: No disease -
outlier of category 1: Disease
0
B. Data Preprocessing
The quality of the data used to build a machine learning model has a significant impact on its
performance. Therefore, data preprocessing plays a crucial role in ensuring the quality of the data. This
process involves a variety of tasks, such as removing corrupted or missing data points and outliers, as well
as transforming the data, resampling it, and performing feature selection to enhance its quality.
1) Data Visualization and Cleaning
Initially, we conducted an assessment for any missing values, but fortunately, none were
detected. Following that, we proceeded to examine any potential outliers, and as indicated in
Table II, several were identified.
TABLE II: OUTLIERS IN DATASET
Attribute Outlier Values
Age None
Chol 417,564,394,407,409
Trestbps 172,178,180,200,174,192,178,180
Thalach 71
Oldpeak 4.2,6.2,5.6,4.2,4.4
The outliers were removed using IsolationForest and the contamination argument was auto
and random state was 1.
a. Isolation Forest
Isolation Forest is an anomaly detection algorithm that works by isolating anomalies

(outliers) instead of identifying them. It is based on the idea that anomalies are few and far
between, so they should be more susceptible to isolation than normal points. This algorithm
can be implemented using the IsolationForest library in Python.
Here's an overview of how the Isolation Forest algorithm works:
1. Randomly select a feature from the dataset.

2. Randomly select a split value between the minimum and maximum value of the
selected feature.
3. Divide the data into two partitions using the selected feature and split value.
7
4. Repeat steps 1-3 recursively on the resulting partitions until the tree is fully grown.
5. Repeat steps 1-4 to create multiple trees in a forest.
6. Calculate the anomaly score for each data point based on the average depth at which
it is isolated across all trees.
7. Data points with higher anomaly scores are considered as outliers.
To examine the relationship between various attributes and the output, a correlation
coefficient matrix was generated. The matrix, depicted in Figure 1, displays the correlation
values that indicate both the strength and direction of the relationship between variables. A
positive correlation coefficient indicates a positive linear relationship between variables,
while a negative correlation coefficient indicates a negative linear relationship. The matrix
allows us to visualize the extent to which the variables are associated with the output
variable, which can provide valuable insights for further analysis and modeling.
Fig. 1. Correlation coefficient matrix
3) Data Transformation
When dealing with datasets that contain data in varying formats or when merging different
datasets, transformation techniques are often employed. In the specific case mentioned,
nominal features were transformed into factors to make them compatible with Rstudio for
analysis.
4) Dimensionality Reduction
Dimensionality reduction is a crucial process in machine learning, which involves reducing

the number of features to enhance model performance by decreasing complexity and
preventing overfitting. This can be achieved through either feature selection or extraction
8
methods. Feature selection entails choosing a subset of features from the original set, and
techniques like CFS (Correlation-based Feature Selection), Chi-squared test, and ridge
regression are commonly used for this purpose. In this study, CfsSubsetEval was used to
evaluate the worth of feature subsets by considering their individual predictive ability and
degree of redundancy. On the other hand, feature extraction involves generating a new set of
features from the original set. Principal Component Analysis (PCA) is a popular feature
extraction method that projects the original data onto a lower-dimensional space.
5) Data Splitting
In machine learning, it is common practice to split the data into training and testing sets. The
training set is used to train the model, while the testing set is used to evaluate its
performance by predicting the output. In this study, the hold-out method was used to split
the data into a training set comprising 80% of the data and a testing set comprising 20% of
the data.
C. Applied Algorithms
1) Naive Bayes (NB)
The Naive Bayes classifier is a popular type of supervised machine learning algorithm used for
classification tasks, including text classification. It falls into the category of generative learning
algorithms that aim to model the input distribution for a specific class or category. Unlike discriminative
classifiers like logistic regression, Naïve Bayes does not learn which features are most important for
differentiating between classes. Instead, it is based on the assumption that all features are independent
and contribute equally to the target class. The algorithm employs Bayes' theorem to calculate the
posterior probability of an event A, given some prior probability of event B, as shown in equation (1).
𝑃(𝐴|𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴)/𝑃(𝐵) (1)
2) Neural Network
Figure 2 illustrates that Neural Networks serve as universal approximators, capable of establishing any
correlation between a system's inputs and outputs, irrespective of its intricacy. These networks mimic the
operating principles of the human brain, where they assign weights (w) to each input, denoting its
significance to the output, during the training process.
9
Fig:2 Neural Network Diagram [8]
Each neuron in the network is associated with an activation function, while the network's
performance is affected by the number of neurons, layers, and activation functions used, all of which are
dependent on the application. In this study, an MLP network with 5 hidden nodes and a sigmoid activation
function was utilized.
3) SVM
Fig 3 : Concept of hyperplane [9]
4) Random Forest
Random Forest (RF) is another popular supervised machine learning algorithm that can be used for both
classification and regression tasks. It utilizes ensemble learning, a technique that combines multiple
classifiers to make accurate predictions in complex situations. RF algorithms employ multiple decision
trees to establish predictions, using techniques such as bagging or bootstrap aggregation, as illustrated in
Figure 3. By aggregating the predictions of many decision trees, RF can improve the accuracy and
robustness of the model.
10
Fig 4: Random Forest [10]
5) Decision trees
Decision tree is a type of supervised machine learning algorithm used for both classification and
regression tasks. It is a tree-like structure where each node represents a feature or attribute, each branch
represents a decision or rule, and each leaf node represents a class label or a numeric value. The
algorithm creates a tree by recursively splitting the data into subsets based on the most significant
feature, to minimize the impurity or increase the homogeneity of the subsets. The resulting decision
tree can be used to predict the class or value of new data points based on their feature values. Decision
trees are easy to interpret and visualize, making them useful for exploratory analysis and
decision-making. However, they can be prone to overfitting and may not perform well on complex and
noisy datasets
11
Fig 5: Decision tree [11]
5) KNN
The k-nearest neighbors (KNN) algorithm is a supervised learning classifier that is non-parametric. It is
used for regression or classification problems, but is more commonly used as a classification algorithm.
The basic premise of KNN is that data points that are similar to one another are often in close
proximity. KNN assigns a class label on the basis of a majority vote. The label that is most frequently
represented around a given data point is used. This is technically known as “plurality voting”. The
term, “majority vote”, is more commonly used in literature, even though it technically requires a
majority of greater than 50%. This primarily works when there are only two categories. When multiple
classes are present, say four categories, a class label can be assigned with a vote of greater than 25%,
without necessarily needing 50% of the vote to make a conclusion about a class.
12
Fig 6 : K nearest neighbour model [12]
6) Hybrid Random Forest
A hybrid random forest is a machine learning algorithm that combines two or more types of random
forests in a single model. In a hybrid random forest, different types of random forests are combined to
form a more powerful model that can handle a wider range of data and produce more accurate
predictions. For example, one type of random forest may be optimized for handling numerical data,
while another type may be better suited for categorical data. By combining these two types, a hybrid
random forest can effectively handle both types of data and produce better predictions. Hybrid random
forests have been shown to outperform single-type random forests on a variety of datasets and are a
popular choice for machine learning applications where the data is complex and diverse. We made a
number of combinations of ML models like KNN, Decision Tree, SVM (linear), Neural Network (NN),
Naive Bayes (NB). In all combination of five models the SVM, Decision Tree, Neural Network, KNN
combination was performing best with accuracy of 88.52%
13
Chapter 3
Analysis and Design
[A]. The Basics of Stacking:
Fig 7: Basics of Stacking [13]
The training dataset (X) consists of m observations and n features, resulting in a dimension of m x n as
shown in Fig 7. To enhance the accuracy of predictions, M different models are trained on X using
techniques such as cross-validation. These models generate predictions for the outcome variable (y),
which are then aggregated into a second level training dataset (Xl2) with a dimension of m x M as
shown in Fig 5. The M predictions serve as new features in this second level data. Finally, one or more
second level models are trained on Xl2 to produce the ultimate predictions for deployment. There exist
multiple approaches to constructing the second level data (Xl2), among which is stacking, a technique
that is particularly suitable for small to medium-sized datasets. Stacking operates by leveraging the
concept of k-fold cross-validation to generate out-of-sample predictions.
Stacking is an ensemble learning technique that involves the use of a Meta classifier to combine
multiple classification models. These models, also known as base classifiers, generate predictions that
are used to train the Meta classifier as new features. The training data is divided into N folds, with one
fold held out for validation. In this approach, M models are used, and for each fold, the prediction is
made using the remaining folds. These predictions are then collected in an out-of-sample predictions
matrix, Xoos, which serves as the level 2 training data, Xl2. This process is repeated for each of the M
models. The resulting Xoos matrix is then used in a second level training, using a chosen method, to
obtain the final predictions for all data points.
14
[B]. Proposed Stacking based Model:
Fig 8: Stacking Data Flow Diagram [14]
The dataset is divided into two portions, with 80% allocated for training and 20% reserved for testing.
The initial base suggested model employs KNN, Naïve Bayes, Neural Network and Decision Tree
algorithms to train the training component dataset. To acquire additional features, a meta-classifier is
used to train the classifier of the initial base suggested model. In this study, Random Forest is used as
the meta-classifier to categorize the final predictions. The flow chart of the suggested Stacking model is
shown in Figure 8. The proposed paradigm follows three phases. Firstly, the original training dataset is
created and trained using KNN, Naïve Bayes, Random Forest, and Decision Tree. The 80 % training
dataset is trained using KNN, NaïveBayes Model, Neural Network (NN), and Decision Tree (DT).
Secondly, the predictions of each model are obtained after training the four models (KNN, NB, NN,
and DT) in the first stage. In the third stage, a new dataset is created based on the predictions of the first
basic classifiers (KNN, NB, NN, and DT), resulting in a four-dimensional dataset. The second-level
classifier, the Meta-Classifier, is applied to this dataset created in the first stage. Random Forest is
employed as the Meta-Classifier in this investigation. The proposed stacking model's performance and
accuracy are evaluated by individually training and analyzing these models, and compared to those of
individual classifiers such as KNN, Naïve Bayes, Linear Discriminant Analysis, and Decision Tree in
terms of Recall, Precision, and F-Measure.
15
[C]. Algorithm
16
Chapter 4
Results and Discussion
In this study, we utilized the UCI Heart Disease dataset, consisting of 300 patient records indicating the
presence or absence of heart disease. The dataset was randomly divided into a training set (used to train the
machine learning model) and a testing set (used to evaluate the performance of the model).
Algorithm combinations Using Hybrid Random Forest Classifier Technique
1) Using a combination of 2 algorithms at a time:
TABLE III : USING COMBINATION OF 2 ALGORITHMS
S. Combination Accuracy(in%) Precision Recall

No
1 Naive Bayes and Decision Tree 75.40 73 63
2 Naive Bayes and Neural Network 78.68 86 82
3 Decision Tree and Neural Network 77.04 88 74
4 Naive Bayes and KNN 73.77 81 79
5 Decision Tree and KNN 81.96 87 71
6 Neural Network and KNN 83.60 75 71
7 SVM and KNN 75.40 78 66
8 SVM and Neural Network 83.60 88 74
9 SVM and Naive Bayes 81.96 91 76
10 SVM and Decision Tree 75.40 83 63
In the Table III above , the combination of two algorithms at a time is used using a Hybrid Random Forest
Classifier. The combinations are made up of the Naive Bayes, Decision tree, Neural Network, KNN and
SVM taking two algorithms at a time. The highest accuracy is given by SVM and Neural Network which is
83.60% with precision of 88% and recall of 74%.
17
TABLE IV : USING COMBINATION OF 3 ALGORITHMS
S.N Combination Accuracy(in%) Precision Recall

o.
1 SVM, Naive 79.77 83 80

Bayes,
Perceptron
2 SVM, Naive 80.89 81 86

Bayes, KNN
3 SVM, Naive 80.89 81 86

Bayes, DT
4 SVM, 78.65 82 80
Perceptron, DT
5 SVM, KNN, DT 76.40 80 78
6 SVM, 80.89 82 84
Perceptron,
KNN
7 Naive Bayes, 82.02 81 88

Perceptron,
KNN
8 Naive Bayes, 77.52 81 78

Perceptron, DT
9 Naive Bayes, 76.40 81 76

KNN, DT
10 Perceptron, 82.02 84 84
KNN, DT
In the Table IV above , the combination of three algorithms at a time is used using a Hybrid Random Forest
Classifier. The combinations are made up of the Naive Bayes, Decision tree, Perceptron, KNN and SVM
taking three algorithms at a time. The highest accuracy is given by Decision tree, Perceptron and KNN
which is 82.02% with precision of 84% and recall of 84 %.
TABLE V : USING COMBINATION OF 4 ALGORITHMS
S. Combination Accuracy(in Precision Recall

No %)
18
1 SVM, Decision Tree, Neural Network, 90.16 86 69
KNN
2 Decision Tree, Neural Network, KNN, 86.84 85 79

Naive Bayes
3 SVM, Decision Tree, KNN, Naive Bayes 85.24 78 69
4 SVM, Decision Tree, Neural Network, 82.89 72 75

Naive Bayes
5 KNN,SVM, Neural Network, Naive 83.36 86 52

Bayes
In the Table V above , the combination of four algorithms at a time is used using a Hybrid Random Forest
Classifier. The combinations are made up of the Naive Bayes, Decision tree, Perceptron, KNN and SVM
taking four algorithms at a time. The highest accuracy is given by SVM, Decision Tree, Neural Network
and KNN which is 90.16% with precision of 86% and recall of 69 %.
The results of our analysis show that the hybrid random forest classifier is a promising machine learning
technique for heart disease prediction. The model achieved an accuracy of 90.16% , which is a fairly high
level of accuracy for a predictive model.
It is important to note that this analysis is based on a hypothetical dataset and real-world performance may
vary depending on the quality and size of the dataset used to train and test the model. In addition, there may
be other factors that influence heart disease risk that are not captured in the data set used for this analysis.
Overall, our results suggest that the hybrid random forest classifier may be a valuable tool for heart disease
prediction. Further research could explore the use of additional features and more advanced machine
learning techniques to improve the accuracy and robustness of heart disease prediction models.
19
Chapter 5
Conclusion and Future Scope
Our study demonstrates that the hybrid random forest classifier is a promising tool for predicting heart
disease using the UCI Heart Disease dataset. A hybrid random forest classifier, combining various machine
learning algorithms including (SVM,Decision Tree, Neural Network,KNN) was applied to the training set
to predict the heart disease status of the patients. After training, the performance of the model was evaluated
on the testing set, where it achieved an accuracy of 90.16 %. The hybrid random forest classifier uses a
combination of various algorithms and a set of features such as age, sex, and medical history to predict the
presence or absence of heart disease in a patient.
In future work, the model's accuracy could be further improved by including additional features related to
lifestyle choices, and by using more advanced machine learning techniques such as deep learning. To
evaluate the generalizability of the hybrid random forest classifier, future research could test the model on
other datasets and patient populations to assess its effectiveness in predicting heart disease in diverse patient
groups. Additionally, the model could be validated in clinical settings to determine its potential for clinical
application in the context of cardiovascular disease prevention and management. Overall, the hybrid
random forest classifier has potential as a tool for predicting heart disease and could improve cardiovascular
disease prevention and management in the future.
20
References
[1] S. Mohan, C. Thirumalai and G. Srivastava, "Effective Heart Disease Prediction Using Hybrid Machine
Learning Techniques," in IEEE Access, vol. 7, pp. 81542-81554, 2019, doi:
10.1109/ACCESS.2019.2923707.
[2] A. KarenGarate-Escamila, A. E. Hassani, and E. Andr ´ es, ` “Classification models for heart disease
prediction using feature selection and PCA,” Informatics in Medicine Unlocked, vol. 19, Article ID 100330,
2020.
[3] R. Aggrawal and S. Pal, “Sequential feature selection and machine learning algorithm-based patient’s
death events prediction and diagnosis in heart disease,” SN Computer Science, vol. 1, no. 6, 2020.
[4] Kaushalya Dissanayake, Md Gapar Md Johar, "Comparative Study on Heart Disease Prediction Using
Feature Selection Techniques on Classification Algorithms", Applied Computational Intelligence and Soft
Computing, vol. 2021, Article ID 5581806, 17 pages, 2021. https://doi.org/10.1155/2021/5581806
[5] N. Al-milli, ‘‘Back Propagation neural network for prediction of heart disease,’’ J. Theor. Appl.Inf.
Technol., vol. 56, no. 1, pp. 131–135, 2013.
[6] H. Takci, “Improvement of heart attack prediction by the feature selection methods,” Turkish Journal of
Electrical Engineering and Computer Sciences, vol. 26, pp. 1–10, 2018.
[7]R. Spencer, F. Thabtah, N. Abdelhamid, and M.Thompson, “Exploring feature selection and
classification methods for predicting heart disease,” Digital Health, vol. 6, Article ID2055207620914777,
2020.
Pictures and Dataset
[8] S. Aurangabadkar and M. A. Potey, "Support Vector Machine based classification system for
classification of sport articles," 2014 International Conference on Issues and Challenges in Intelligent
Computing Techniques (ICICT), Ghaziabad, India, 2014, pp. 146-150, doi:
10.1109/ICICICT.2014.6781268.
[9] Lainez, Sheryl May & Gonzales, Dennis. (2019). Automated Fingerlings Counting Using Convolutional
Neural Network. 67-72. 10.1109/CCOMS.2019.8821746.
[10] Pourebrahim, Nastaran & Sultana, Selima & Niakanlahiji, Amirreza & Thill, Jean-Claude. (2019). Trip
distribution modeling with Twitter data. Computers Environment and Urban Systems. 77.
10.1016/j.compenvurbsys.2019.101354.
21
[11] Kumar, Munish & Sharma, R. & Jindal, M. & Jindal, Simpel. (2020). Performance Evaluation of
Classifiers for the Recognition of Offline Handwritten Gurumukhi Characters and Numerals: A Study.
Artificial Intelligence Review. 53. 10.1007/s10462-019-09727-2.
[12] Zhang, Wenhao. (2017). Machine Learning Approaches to Predicting Company Bankruptcy. Journal of
Financial Risk Management. 06. 364-374. 10.4236/jfrm.2017.64026.
[13] https://www.kdnuggets.com/2017/02/stacking-models-imropved-predictions.html
[14] Maria Ali, Muhammad Nasim Haider, Saima Anwar Lashari, Wareesa Sharif, Abdullah Khan, Dzati
Athiar Ramli, Stacking Classifier with Random Forest functioning as a Meta Classifier for Diabetes
Diseases Classification, Procedia Computer Science, Volume 207, 2022,Pages 3459-3468,ISSN 1877-0509,
https://doi.org/10.1016/j.procs.2022.09.404.
[15] https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data
22

BTP Sixth Sem Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BTP Sixth Sem Report

Uploaded by

Copyright:

Available Formats

Heart Disease Prediction using Hybrid Random Forest

Computer Science and Engineering

Akhilesh Kumar Singh (112015009)

Under the Supervision of: Dr. Mahendra Pratap Yadav

Name of Department: Department of Computer Science and Engineering

Indian Institute of Information Technology, Pune

(An Institute of National Importance by an Act of Parliament)

Dr. Mahendra Pratap Yadav Dr. Sanjeev Sharma

Project Viva-voce held on 27/04/2023

Students’ Name and Signature with Date

Kale Vaibhav Vitthal (112015071)

Karde Shivam Dnyaneshwar (112015072)

Kartikey Singh (112015073)

Akhilesh Kumar Singh (112015009)

Report title: Heart Disease Prediction using Hybrid Random Forest

Students’/ Student’s Name and Signature with Date

Kale Vaibhav Vitthal (112015071)

Karde Shivam Dnyaneshwar (112015072)

Kartikey Singh (112015073)

Akhilesh Kumar Singh (112015009)

(i) List of Figures/Symbols/Nomenclature iv

(ii) List of Tables v

1.1 Overview of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Research Gap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Methodology of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Analysis And Design 14

4 Results and Discussion 17

5 Conclusion and Future Scope 20

7 Publication (if any)

2. Neural Network Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6. K nearest neighbor model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

8. Stacking Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1. TABLE I : HEART DISEASE DATASET DESCRIPTION

2. TABLE II : OUTLIERS IN DATASET

3. TABLE III : USING COMBINATION OF 2 ALGORITHMS

4. TABLE IV : USING COMBINATION OF 3 ALGORITHMS

5. TABLE V : USING COMBINATION OF 4 ALGORITHMS

1.1 Overview of Work

1.2 Motivation of the Work

1.3 Literature Review

[2] KarenGarate-Escamila et al. proposed a hybrid dimensionality reduction technique combining

S.No. Name of Research Author and Year Methodology Results

[2] Classification models A.KarenGarate-Escami Chi-square and Good Accuracy was

[4] Comparative Study on Kaushalya All Feature ANOVA and

[7] Exploring feature R. Spencer, F. Thabtah, Feature selection As noted by the

1.4 Research Gaps

2) Limited real-world validation

3) Data reduction techniques

2.1. Research Objectives

2.2. Methodology of the Work

TABLE I. HEART DISEASE DATASET DESCRIPTION

Data element Description Type Range Remarks

Age - Numa 29-77 Average is 54.37

Sex - Bib 0: Female 32% Female 68% Male

Cp Chest pain level Nomc 0/1/2/3 0: Asymptotic 2: Majority have 0 pain

Trestbps Rest blood pressure Num 94-200 Average is 131.6

Chol Chol Num 126-564 126-564

Restecg Resting electrocardiographic Nom 0/1/2 0: Showing probable or -

Thalach Maximum heart rate achieved Num 71-202 -

Exang Exercise induced angina Bi 0: None 1: Produced -

Oldpeak ST depression induced by Num 0-6.2 Right skewed data, the