You are on page 1of 60

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/364231780

Heart Disease Classification–Based on the Best Machine Learning Model

Article in Iraqi Journal of Science · September 2022


DOI: 10.24996/ijs.2022.63.9.28

CITATIONS READS

4 658

2 authors:

Melad Mizher Rahmah Aymen Dawood Salman


University of Baghdad University of Technology, Iraq
2 PUBLICATIONS 12 CITATIONS 29 PUBLICATIONS 233 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Melad Mizher Rahmah on 30 October 2022.

The user has requested enhancement of the downloaded file.


Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976
DOI: 10.24996/ijs.2022.63.9.28

ISSN: 0067-2904

Heart Disease Classification–Based on the Best Machine Learning Model

Melad Mizher Rahma 1*, Aymen Dawood Salman 2


1
Department of Computer Science, Information Institute for Higher Studies, Iraqi Computer Informatics
Authority, Baghdad, Iraq
2
Department of Computer Engineering, University of Technology, Baghdad, Iraq

Received: 19/9/2021 Accepted: 21/11/2021 Published: 30/9/2022

Abstract
In recent years, predicting heart disease has become one of the most demanding
tasks in medicine. In modern times, one person dies from heart disease every
minute. Within the field of healthcare, data science is critical for analyzing large
amounts of data. Because predicting heart disease is such a difficult task, it is
necessary to automate the process in order to prevent the dangers connected with it
and to assist health professionals in accurately and rapidly diagnosing heart disease.
In this article, an efficient machine learning-based diagnosis system has been
developed for the diagnosis of heart disease. The system is designed using machine
learning classifiers such as Support Vector Machine (SVM), Nave Bayes (NB), and
K-Nearest Neighbor (KNN). The proposed work depends on the UCI database from
the University of California, Irvine for the diagnosis of heart diseases. This dataset is
preprocessed before running the machine learning model to get better accuracy in
the classification of heart diseases. Furthermore, a 5-fold cross-validation operator
was employed to avoid identical values being selected throughout the model
learning and testing phase. The experimental results show that the Naive Bayes
algorithm has achieved the highest accuracy of 97% compared to other ML
algorithms implemented.

Keywords- Machine Learning, Heart Disease (HD), Naïve Bayes (NB) , KNN ,
SVM;

‫استظادا إلى أفضل نطهذج للتعلم اآللي‬


ً - ‫تصظيف أمراض القلب‬

2
‫ أيطن داود سلطان‬,*1 ‫ميالد مزهر رحطة‬
1
‫ العخاق‬، ‫ الهيئة العخاقية لسعمؽماتية الحاسؽب‬، ‫ معهج السعمؽمات لمجراسات العميا‬، ‫قدػ عمؽم الحاسؽب‬
2
‫ العخاق‬،‫الجامعة التكشؽلؽجية‬، ‫ قدػ هشجسة الحاسؽب‬،‫أيسؼ داود سمسان‬
‫الخالصة‬
‫ في العرخ‬.‫ أصبح التشبؤ بأمخاض القمب أحج أكثخ السؽاقف صعؽبة في الطب‬، ‫في الدشؽات األخيخة‬
‫أمخ‬
‫ يعج عمػ البيانات ًا‬، ‫ في مجال الخعاية الرحية‬.‫الحجيث يسؽت شخص واحج مؼ أمخاض القمب كل دقيقة‬

__________________________________
* Email: Ms201930544@ipps.icci.edu.iq

3966
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

‫ فسؼ الزخوري‬، ‫نعخ ألن التشبؤ بأمخاض القمب مهسة صعبة‬


‫ ًا‬.‫بالغ األهسية لتحميل كسيات كبيخة مؼ البيانات‬
‫أتستة العسمية مؼ أجل مشع السخاطخ السختبطة بها ومداعجة السهشييؼ الرحييؼ في التذخيص الجقيق والدخيع‬
.‫ تػ تطؽيخ نعام تذخيص فعال قائػ عمى التعمػ اآللي لتذخيص أمخاض القمب‬، ‫ في هحا البحث‬.‫ألمخاض القمب‬
‫ في‬KNN ‫ و‬Naïve Bayes ‫ و‬Support Vector Machine ‫تتزسؼ مرشفات التعمػ اآللي استخجام‬
‫ مؼ جامعة كاليفؽرنيا في إيخفيؼ لتذخيص‬UCI ‫ يعتسج الشعام السقتخح عمى قاعجة بيانات‬.‫ترسيػ الشعام‬
‫ تتػ معالجة مجسؽعة البيانات هحه مدبًقا قبل تذغيل نسؽذج التعمػ اآللي لمحرؽل عمى دقة‬.‫أمخاض القمب‬
‫ أضعاف لتجشب‬5 ‫ تػ استعسال عامل التحقق مؼ صحة‬، ‫ عالوة عمى ذلغ‬.‫أفزل في ترشيف أمخاض القمب‬
Naive ‫ أظهخت الشتائج التجخيبية أن خؽارزمية‬.‫اختيار القيػ الستطابقة خالل مخحمة التعمػ واالختبار الشسؽذجي‬
.‫ األخخى السطبقة‬ML ‫ مقارنة بخؽارزميات‬٪79 ‫ حققت أعمى دقة بمغت‬Bayes

I. Introduction
Heart disease (HD) is typically considered to be one of the most complicated and life-
threatening illnesses in humans. As a result of this disease, the heart is generally not able to
pump the specified quantity of blood to different elements of the body to carry out the body’s
ordinary activities. As a result, cardiac failure arises [1]. Within the United States, the rate of
heart sickness is quite high [2]. The signs and symptoms of HD encompass chest pain,
swollen feet, weakness of the physical body, and fatigue with associated signs such as
elevated jugular venous pressure and peripheral edema, which may be produced by functional
cardiac or no cardiac abnormalities [3]. The early detection methods for HD were challenging,
and the resulting uncertainty became one of the principal problems impacting people's quality
of life [4]. HD diagnosis and remedy are distinctly challenging, in particular in poor countries,
because of the loss of clinical equipment, physicians, and different services, all of which have
an effect on the right diagnosis and remedy of heart patients [5]. Correct and accurate
identification of a patient's heart attack risk is critical for reducing the risk of significant heart
problems and enhancing heart protection [6]. According to the European Society of
Cardiology, 26 million people worldwide have been diagnosed with HD. Last year, 3.6
million new cases were diagnosed. Half of patients with HD die within two years, and heart
disease management costs account for around 3% of healthcare expenditure [7]. An invasive
HD diagnosis is based on a review of the patient's health history, a clinical assessment report,
and a medical expert's examination of the patient's symptoms. Due to human error, many of
these techniques result in incorrect diagnoses and, in many cases, delays in diagnosis
outcomes. It is also more costly and computationally complicated, and it takes longer to
determine [8]. To overcome the challenges of invasive-based heart disease diagnosis, ML
prediction models-based noninvasive medical decision support systems such as K-NN, SVM,
NB, DT, ANN, Logistic Regression (LR), Ada-Boost (AB), FL, and rough set theory have
been developed by various researchers and are commonly utilized for HD diagnosis [9 and
10]. As a consequence of these ML-based expert medical decision systems, the ratio of HD
fatalities has been reduced [9]. Several research projects have focused on utilizing a machine-
learning-based approach to diagnose HD, for example. The classification success of several
ML methods on the Cleveland HD dataset was recorded in a literature review [10], and [11].
Many researchers have utilized this dataset to study various classification challenges linked to
heart diseases using various machine learning classification techniques. For example, predict
coronary artery disease in its early stage so that patients can undergo treatment and save their
lives [12]. The main contribution of this work is to find the best classifier for the classification
of heart diseases and to help physicians diagnose the heart condition of their patients with the
highest degree of accuracy and efficiency.

3967
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

II. RELATED WORKS


Alotaibi [13] has developed a machine learning model that compares five distinct approaches.
The Rapid Miner tool outperformed MATLAB and Weka in terms of accuracy. This study
looked into the accuracy of the classification approaches: Decision Tree, Logistic Regression,
Random Forest, Naive Bayes, and SVM. The decision tree algorithm was found to be the
most accurate. Latha et al. [14] performed a comparative analysis to improve the predictive
accuracy of heart disease risk using ensemble techniques on the Cleveland dataset of 303
observations. They applied the brute force method to obtain all possible attribute set
combinations and trained the classifiers. They achieved a maximum increase in the accuracy
of a weak classifier of 7.26% based on the ensemble algorithm and produced an accuracy of
85.48% using a majority vote with NB, BN, RF, and MLP classifiers using an attribute set of
nine attributes. Mohan et al. [15] developed an effective hybrid random forest with a linear
model (HRFLM) to enhance the accuracy of heart disease prediction using the Cleveland
dataset with 297 records and 13 features. They concluded that the RF and LM methods
provided the best error rates. Louridi et al. [16] proposed a solution to identify the
presence/absence of heart disease by replacing missing values with the mean values during
preprocessing. They trained three machine learning algorithms, namely, NB, SVM (linear and
radial basis function), and KNN, by splitting the Cleveland dataset of 303 instances and 13
attributes into 50:50, 70:30, 75:25, and 80:20 training and testing ratios. Gupta et al. [17]
replaced the missing values based on the majority label and derived 28 features using the
Pearson correlation coefficient from the Cleveland dataset and trained LR, KNN, SVM, DT,
and RF classifiers using the factor analysis of mixed data (FAMD) method; the results based
on a weight matrix RF achieved the best accuracy of 93.44%. Perumal et al. [18] developed a
heart disease prediction model using the Cleveland dataset of 303 data instances through
feature standardization and feature reduction using PCA, where they identified and utilized
seven principal components to train the ML classifiers. They concluded that LR and SVM
provided almost similar accuracy values (87% and 85%, respectively) compared to that of k-
NN (69%). Kumar et al. [19] trained five machine learning classifiers, namely, LR, SVM,
DT, RF, and KNN, using a UCI dataset with 303 records and 10 attributes to predict
cardiovascular disease. The RF classifier achieved the highest accuracy of 85.71% with a
ROC AUC of 0.8675 compared to the other classifiers. Gazeloglu et al. [20] projected 18
machine learning models and 3 feature selection techniques (correlation-based FS, chi-square,
and fuzzy rough set) to find the best prediction combination for heart disease diagnosis using
the Cleveland dataset of 303 instances and 13 variables. Sharma et al. [21] used the heart
disease dataset, which is available in the (UCI) machine learning repository and was
employed in their research. Using data mining strategies such as NB, DT, LR, and RF, the
suggested system predicts the likelihood of HD and classifies patient risk levels. As a result,
in their work, they have been capable of evaluating the output of numerous ML algorithms.
The outcomes show that the RF method has the very best accuracy of 90.16% when compared
to different ML techniques. Pavithra et al. [22] proposed a new hybrid feature selection
technique with the combination of random forest, AdaBoost, and linear correlation (HRFLC)
using the UCI dataset of 280 instances to predict heart disease. Eleven (11) features were
selected using filter, wrapper, and embedded methods; an improvement of 2% was found for
the accuracy of the hybrid model. Kavitha et al. [23] implemented a novel hybrid model on
the Cleveland heart dataset of 303 instances and 14 features with a 70:30 ratio for training and
testing by applying DT, RF, and hybrid (DT + RF) algorithms.
III. THEORETICAL BACKGROUND
A- HEART DISEASE

3968
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

The human heart is a crucial organ that serves as a pump to circulate blood throughout the
body. If the body's blood circulation is poor, organs such as the brain suffer, and if the heart
entirely stops pumping, within minutes, death happens. The heart's correct functioning is
critical for survival. Problems with the heart and its blood vessels are referred to as "heart
disease".
1. The common form of cardiac disease worldwide is a condition in which plaque builds
up in the arteries of the heart, causing the heart to receive less blood and oxygen.
2. Angina pectoris is a condition in which a person suffers from chest pain. It is a
medical term for chest discomfort that occurs as a result of a loss of blood flow to the heart. It
is also called angina, and it is an indication that you are having a heart attack. The chest
discomfort comes in waves that last a few seconds or minutes.
3. Cardiomyopathy is the weakening or change of the heart muscle's structure as a result
of inadequate cardiac pumping. Hypertension, alcohol use, viral infections, and genetic
abnormalities are all frequent causes of cardiomyopathy.
4. Arrhythmias It is linked to a problem with the heartbeat's rhythmic action. The
heartbeat might be irregular, slow, or fast. These irregular heartbeats are caused by a short
circuit in the electrical circuitry of the heart.
5. Myocarditis is an inflammation of the heart muscle caused by viral, fungal, or
bacterial diseases. It's an uncommon disease with minimal symptoms such as joint pain, leg
swelling, or fever that isn't caused by the heart [24], [25].
B. Naïve Bayes
The Naive Bayes predictor was utilized in this research. It is also a supervised learning
approach to categorize data by calculating the likelihood of independent factors. The high
likelihood class is allocated to the whole transaction after the probability of each class is
calculated [26], [27]. In different datasets, such as instructional data mining [29] and health
data mining, NB is a popular approach for predicting classes [30]. This model may be used to
categorize a variety of datasets, such as sentiment analysis [30] and virus identification [25].
It operates by predicting a predefined class for each document based on the values of
independent variables. It calculates the likelihood of A given that B, as illustrated in the
equation below [29]. Then focus on identifying a distinct class for each feature; in this
situation, none of the other variables are interdependent [29]. The probability is calculated
using the below equation:
( ) ( )
( ) (1)
( )
P (C|X): Posterior probability of class (c, target) given predictor (x, attributes).
P (C): Prior probability of class.
P (X|C): Likelihood or posterior probability of X conditioned on Ci.
P (X): Prior probability of X.
C. K Nearest Neighbor Algorithm
The KNN method is an example-based learning algorithm that is frequently utilized in real-
life scenarios. Both classification and regression problems may be solved using the KNN
method. The K-Nearest Neighbor technique is another name for lazy learning. In comparison
to other classification approaches, it is a simple classification method with a low computing
cost. In a j-dimensional dataset, for each sample analyzed, the K closest sample is found.
Common distance computation methods are used to determine the distance between the
samples and the sample being assessed, like Euclid, Hamming, and Manhattan. The sample's
class can be decided by a majority vote of the K closest sample classes [25]. The following
equation can be given for the Euclid distance calculation method:

d= √∑ ( ) (2)

3969
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

Where p stands for the sample evaluated, q stands for any sample within the training dataset,
and n stands for the feature size.
D. SVM Algorithm
The SVM is the final ML algorithm used in this research. This is also known as a supervised
(ML) model, because the classes in the database are predefined [31]. It operates by classifying
the items in the collection into specified categories. In order to improve accuracy, it classifies
transactions by assigning one or more classes [32]. Previously, SVM has already been used in
a medical data application to predict the correct class for HD patients [33].
E. Data Overview
The (HD) dataset, which was obtained from the Kaggle platform, was used in this study [34].
The data came from four different databases in total, but just the Cleveland data was used in
this study. It is an open dataset with many properties, but for this experiment, just 14 were
chosen, as stated and recommended by several researchers who believe that the selected 14
attributes are the most effective in predicting heart disease in a patient [27]. A total of 303
patients' records are also included in the database file. Table 1 shows the full explanation of
each property as well as the number of possible values.

Table 1-Data -Overview and Properties Description


S.No Attribute Description Differentiated Values
Age - The person's age is the first feature to define [Min : 29, range of values between 29 and
1
Max : 77] 77

Sex - The gender of a patient is the second attribute.[“0”denotes


2 0, 1
female, whereas “1” denotes male]

CP -The third characteristic describes the level of a patient's


chest pain (CP) when they arrive at the hospital. This
3 0, 1, 2, 3
characteristic has four different types of values, each of which
describes a different amount of chest discomfort

RestBP - This attribute described patient's blood pressure (BP) Multiple values
4 while in the hospital [Minimum blood pressure: 94, maximum between 94 and
blood pressure: 200] 200

(Chol)- The cholesterol level is displayed in this column Multiple values


5
[Cholesterol Minimum: 126, Cholesterol Maximum: 564]. between 126 and 564

(FBS)- The patient' fasting blood sugar level is described in the


6 next attribute. It has values that are classified as binary. If the 0,1
patient has more than 120mg/dl sugar, the result is 1, else it is 0.

(RestECG)-This parameter shows the ECG result on a scale of


7 0, 1, 2
0 to 2. Each number indicates the degree of the pain.

(HeartBeat)- The maximum value of heartbeat recorded at the between 71 and


8
time of admission [Minimum: 71, Maximum: 202] 202

3970
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

Exang - This measure was used to determine whether or not


9 exercise causes angina. If yes, the value will be “1,” and if not, it 0, 1
will be “0.”

oldPeak- The patient's depressive status is the next property to There are a variety of real
10
define. number values between 0 and 6.2

11 Slope- The patient’ state through peak exercise. 1, 2, 3

CA- This property displays the fluoroscopy status. It depicts the


12 0, 1, 2, 3
number of colored vessels.

Thal- This is a different type of test that is necessary when a


patient has chest discomfort or trouble breathing. The outcome
13 0, 1, 2, 3
of the Thallium test is represented by four different types of
values.
Target –This column is often referred to as the Class or Label
14 0,1
column.

IV. PROPOSED METHODOLOGY


The proposed work predicts HD through exploring the three abovementioned algorithm types
as well as the overall performance analysis The goal of this work is to efficiently classify five
cases of the heart (arrhythmia, myocardial infarction, ischemic, high blood cholesterol, and
normal cases of the heart). The entered values from the patient's clinical report are entered by
the health professional. The information is fed into a model that estimates the likelihood of
developing HD. Figure 1 illustrates the whole operation involved.

Figure 1-Flowchart illustrated the design of proposal classifier

3971
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

A-Dataset Relabeling and Preprocessing


This study's dataset was obtained from the UCI Machine Learning Repository [34], which is a
dataset of the open-source type having a number of characteristics, but for this experiment,
just fourteen were chosen, as explained and proposed by many researchers. The researchers
believed that the selected 14 attributes were the most effective in predicting heart disease in a
patient. This dataset was relabeled by the written-algorithm using the Pandas library in Python
that works on relabeling data to five categories (0-4) according to medical consultations and
international research that were studied on the most important symptoms and important tests
in diagnosing heart diseases. Table 2 illustrates the algorithm that is used to do relabeling on a
dataset.
Table 2-Illustrate the algorithm that is used to do labeling on a dataset.
Algorithm shows relabeling dataset for five classes

Input: dataset of heart diseases

Output: relabel dataset from 2 classes to 5 classes

If(ECG=1 & chest pain type=1,3 & Thalah(max heart rate): heart rate<100 & Exang: exercise induced
angina=1 & st.Slope=1 &Chol(cholesterol):cholesterol>200)
=< Myocardil Infarction=1

else If(cp: chest pain type=2 &Thalah(max HR):heart rate<100 & Exang: exercise induced angina=1 &
Chol(cholesterol): cholesterol <200
=< Ischemic=2

else If(Chol(cholesterol):cholesterol <200)


=<Cholestrol=3

else If(Thalah(max heart rate): heart rate<100)


=<Arrhythmia=4

else =<healthy=0

After the data relabeling process, the dataset is preprocessed, which includes:
1-In order to prepare data for analysis, data cleaning is the act of eliminating or modifying
data that is incorrect, incomplete, irrelevant, redundant, or poorly organized [35].
2-Data transformation, by using the Discretization process to convert continuous data into a
set of data intervals [35].
B-Data Splitting
After the data has been prepared through preprocessing, the splitting process is performed to
divide it. The data set is divided into two parts: a (training) data set and a (testing) or a
(validation) data set, with the "training" data set being used to train the model and the "test" or
"validation" data set being used to qualify performance. In this work, 50% was determined as
test data and 50% as training data.
C- Classification
This research used three supervised learning classification models: Naïve Bayes , KNN, and
SVM to classify five cases of heart disease. A 5-fold cross-validation operator was employed
to avoid identical values being selected throughout the model learning and testing phase. It
assists in dividing the data into k equal groups and allows each subset to participate in the
training and testing phases. The cross-validation operator is thought to be efficient since it
repeats the learning phase k times, with each testing data choice differing from the previous.
Finally, the experiment is repeated k times and the average findings are utilized. For learning

3972
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

and testing purposes, cross-validation is a widely used operator. It affords the data choice in 4
exclusive ways: shuffled sampling, liner sampling, stratified sampling, and automatic [36]. In
this study, however, shuffled sampling is used.
V. COMPARISONS AND DISCUSSION OF MODEL PERFORMANCES
In terms of accuracy, PPV, NPV, recall, specificity, and F-measure, the heart disease
classification model based on the Nave Bayes classifier is compared to the current KNN and
SVM, as shown in table 4. The primary parameters which might be evaluated are True
Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) values.
The assessment criteria are shown below:
A-Accuracy (Acc): is a statistical measure of a classifier's ability to properly identify or rule
out a condition. It may be computed using the equation below [21].
Acc= ×100% (3)
B- (Recall): The sensitivity indicates the percentage of true positives that are accurately
detected. The sensitivity may be determined using the equation below [21].
Recall = (4)
C- (Specificity): As shown in the equation below, this may be calculated by dividing the true
negative by the total number of negatives [21].
Specificity = (5)
d- Precision (PPV): This is the likelihood that a patient who gets a positive screening test has
the disease. As stated in the equation, the PPV may be calculated [21].
PPV = (6)
e- Negative predictive value (NPV): This reflects the likelihood of discovering a patient who
is not at risk for heart disease and is calculated using the equation below [21].
NPV = (7)
Table 3 shows the confusion matrix generated by the suggested model for three methods.

Table 3-Confusion Matrix Values Obtained Using Various Algorithms


Algorithm TP FP TN FN
KNN 86 9 260 9
Naïve Bayes 94 7 355 7

SVM 71 30 356 30

Table 4-Classification Algorithms Results


Algorithm Recall PPV Specificity NPV Accuracy F-measure
KNN 0.905 90.52 0.966 96.65 95.1% 1.792
Naïve Bayes 0.93 93.06 0.980 98.06 96.9% 1.841
SVM 0.70 70.92 0.922 92.22 87.4% 1.386

VI. Analysis of Results


The proposed classifiers SVM, KNN, and Naïve Bayes are compared based on the following
parameters: testing time, specificity, precision, recall, NPV, and accuracy of the result
obtained. This comparison aims to find the best performance classifier among the proposed
classifiers to diagnose heart disease The performance comparison results of the classifiers
when 65 % of the UCI data is used for training and 35% of the data for testing are shown in
Figure 2. The specificity of Naïve Bayes is comparatively high compared with other
classifiers, which is 98.0%. It can be noted that the lowest accuracy has been recorded by the

3973
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

SVM classifier (87.4%), followed by KNN ( 95.1%) while Naïve Bayes has the best
performance with 96.9%. Also, the Naïve Bayes is faster than the other classifiers at getting
the diagnosis results. Moreover, in Naïve Bayes, the NPV is higher (98.06) than in SVM
(92.22) and ) 69.65) KNN classifiers. As for Recall and Precision, KNN records (90.5) in
recall and (90.5) in precision. Naïve Bayes produced 93.0 and 93.06 for recall and precision,
respectively. Finally, the SVM classifier performance was the worst with (70.0) and (70.92)
for recall and precision, respectively. To summarize, the Naïve Bayes was the most robust
classifier among the proposed classifiers. Figure 3 shows the confusion matrix for the Naive
Bayes classifier.

Figure 2- the performance comparison of three algorithms

Figure 3-Confusion Matrix of Naïve Bayes

3974
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

VII. CONCLUSION
With the increasing number of deaths due to heart diseases, it has become necessary to
develop a system to predict heart diseases effectively and accurately. The motivation for the
study was to find the most efficient machine learning algorithm for the detection of heart
diseases. This study compares the accuracy scores of KNN, Naive Bayes, and Support Vector
Machine algorithms for the classification of heart diseases using the UCI machine learning
repository dataset after preprocessing it. The result of this study indicates that the Naive
Bayes algorithm is the most efficient algorithm with an accuracy score of 96.9% for the
prediction of heart disease. The main limitation encountered in this work is the inability to
diagnose other types of heart diseases such as (Heart Valve Disease, Pericarditis, Wolff-
Parkinson-White syndrome, and Congenital heart disease), as the diagnosis of these diseases
depends on other features not available in the UCI database. The work can be enhanced in the
future by developing a web application based on the Naïve Bayes algorithm as well as by
using a real database that contains more features that help classify other heart diseases.

REFERENCES
[1] A. L. Bui, T. B. Horwich, and G. C. Fonarow, “Epidemiology and risk profile of heart failure,”
Nat. Rev. Cardiol., vol. 8, no. 1, pp. 30–41, 2011.
[2] P. A. Heidenreich et al., “Forecasting the future of cardiovascular disease in the United States: a
policy statement from the American Heart Association,” Circulation, vol. 123, no. 8, pp. 933–
944, 2011.
[3] M. Durairaj and N. Ramasamy, “A comparison of the perceptive approaches for preprocessing
the data set for predicting fertility success rate,” Int. J. Control Theory Appl, vol. 9, no. 27, pp.
255–260, 2016.
[4] J. Mourao-Miranda, A. L. W. Bokde, C. Born, H. Hampel, and M. Stetter, “Classifying brain
states and determining the discriminating activation patterns: support vector machine on
functional MRI data,” Neuroimage, vol. 28, no. 4, pp. 980–995, 2005.
[5] S. Ghwanmeh, A. Mohammad, and A. Al-Ibrahim, “Innovative artificial neural networks-based
decision support system for heart diseases diagnosis,” 2013.
[6] Q. K. Al-Shayea, “Artificial neural networks in medical diagnosis,” Int. J. Comput. Sci. Issues,
vol. 8, no. 2, pp. 150–154, 2011.
[7] J. López-Sendón, “The heart failure epidemic,” Medicographia, vol. 33, no. 4, pp. 363–369,
2011.
[8] K. Vanisree and J. Singaraju, “Decision support system for congenital heart disease diagnosis
based on signs and symptoms using neural networks,” Int. J. Comput. Appl., vol. 19, no. 6, pp. 6–
12, 2011.
[9] A. Methaila, P. Kansal, H. Arya, and P. Kumar, “Early heart disease prediction using data mining
techniques,” Comput. Sci. Inf. Technol. J., vol. 24, pp. 53–59, 2014.
[10] O. W. Samuel, G. M. Asogbon, A. K. Sangaiah, P. Fang, and G. Li, “An integrated decision
support system based on ANN and Fuzzy_AHP for heart failure risk prediction,” Expert Syst.
Appl., vol. 68, pp. 163–172, 2017.
[11] R. Detrano et al., “International application of a new probability algorithm for the diagnosis of
coronary artery disease,” Am. J. Cardiol., vol. 64, no. 5, pp. 304–310, 1989.
[12] U. Raghavendra et al., “Automated technique for coronary artery disease characterization and
classification using DD-DTDWT in ultrasound images,” Biomed. Signal Process. Control, vol.
40, no. October, pp. 324–334, 2018, doi: 10.1016/j.bspc.2017.09.030.
[13] F. S. Alotaibi, “Implementation of machine learning model to predict heart failure disease,” Int. J.
Adv. Comput. Sci. Appl., vol. 10, no. 6, pp. 261–268, 2019.
[14] Latha, C.B.C.; Jeeva, S.C. Improving the accuracy of prediction of heart disease risk based on
ensemble classification techniques.Inform. Med. Unlocked 2019, 16, 100203. [CrossRef]
[15] Mohan, S.; Thirumalai, C.; Srivastava, G. Effective Heart Disease Prediction Using Hybrid
Machine Learning Techniques. IEEE Access 2019, 7, 81542–81554. [CrossRef]
[16] Louridi, N.; Amar, M.; El Ouahidi, B., "Identification of Cardiovascular Diseases Using Machine

3975
Rahma and Salman Iraqi Journal of Science, 2022, Vol. 63, No. 9, pp: 3966-3976

Learning," In Proceedings of the 7th Mediterranean Congress of Telecommunications 2019, CMT


2019, Fez, Morocco, 24–25 October 2019; pp. 1–6. [CrossRef]
[17] Gupta, A.; Kumar, R.; Arora, H.S.; Raman, B., "MIFH: A Machine Intelligence Framework for
Heart Disease Diagnosis," IEEE Access 2019, 8, 14659–14674. [CrossRef]
[18] Perumal, R., "Early Prediction of Coronary Heart Disease from Cleveland Dataset using
Machine Learning Techniques," Int. J. Adv.Sci. Technol. 2020, 29, 4225–4234.
[19] Kumar, N.K.; Sindhu, G.; Prashanthi, D.; Sulthana, A., "Analysis and Prediction of Cardio
Vascular Disease using Machine Learning Classifiers," In Proceedings of the 2020 6th
International Conference on Advanced Computing and Communication Systems(ICACCS),
Coimbatore, India, 6–7 March 2020; pp. 15–21. [CrossRef]
[20] Gazelo ˘glu, C., "Prediction of heart disease by classifying with feature selection and machine
learning methods," Prog. Nutr. 2020,22, 660–670. [CrossRef]
[21] V. Sharma, S. Yadav, and M. Gupta, “Heart Disease Prediction using Machine Learning
Techniques,” in 2020 2nd International Conference on Advances in Computing, Communication
Control and Networking (ICACCCN), 2020, pp. 177–181.
[22] Pavithra, V.; Jayalakshmi, V., "Hybrid feature selection technique for prediction of
cardiovascular diseases," Mater. Today Proc. 2021,22, 660–670. [CrossRef]
[23] Kavitha, M.; Gnaneswar, G.; Dinesh, R.; Sai, Y.R.; Suraj, R.S., "Heart Disease Prediction using
Hybrid machine Learning Model," In Proceedings of the 6th International Conference on
Inventive Computation Technologies, ICICT 2021, Coimbatore, India, 20–22 January 2021; pp.
1329–1333. [CrossRef]
[24]B. S. Kumar, “A Survey on Data Mining Techniques for Prediction of Heart Diseases,” IOSR J.
Eng. www. iosrjen. org ISSN, vol. 8, pp. 22–27, 2018.
[25] S. Bashir, Z. S. Khan, F. H. Khan, A. Anjum, and K. Bashir, “Improving heart disease prediction
using feature selection approaches,” in 2019 16th international bhurban conference on applied
sciences and technology (IBCAST), 2019, pp. 619–623.
[26] F. Razaque et al., “Using naïve bayes algorithm to students’ bachelor academic performances
analysis,” in 2017 4th IEEE International Conference on Engineering Technologies and Applied
Sciences (ICETAS), 2017, pp. 1–5.
[27] C. B. Rjeily, G. Badr, A. H. El Hassani, and E. Andres, “Medical data mining for heart diseases
and the future of sequential mining in medical field,” in Machine Learning Paradigms, Springer,
2019, pp. 71–99.
[28] L. Dey, S. Chakraborty, A. Biswas, B. Bose, and S. Tiwari, “Sentiment analysis of review
datasets using naive bayes and k-nn classifier,” arXiv Prepr. arXiv1610.09982, 2016.
[29] O. Qasim and K. Al-Saedi, “Malware Detection using Data Mining Naïve Bayesian Classification
Technique with Worm Dataset,” Int. J. Adv. Res. Comput. Commun. Eng, vol. 6, no. 11, pp.
211–213, 2017.
[30] I. Babaoğlu, M. S. Kıran, E. Ülker, and M. Gündüz, “Diagnosis of coronary artery disease using
artificial bee colony and k-nearest neighbor algorithms,” Int. J. Comput. Commun. Eng., vol. 2,
no. 1, pp. 56–59, 2013.
[31] S. K. Kotha, J. Pinjala, K. Kasoju, and M. Pothineni, “Gesture Recognition System,” Int. J. Res.
Eng. Technol., 2015.
[32] P. Tabesh, G. Lim, S. Khator, and C. Dacso, “A support vector machine approach for predicting
heart conditions,” in IIE Annual Conference. Proceedings, 2010, p. 1.
[33] P. Tabesh, G. Lim, S. Khator, and C. Dacso, "A support vector machine approach for predicting
heart conditions, in Proceedings of the 2010 Industrial Engineering Research Conference, 2010,
p. 5.
[34] UCI, "Heart Disease Data Set." [Online]. Available: https://www.kaggle.com/ronitf/heart-disease-
uci. [Accessed: 20-Apr2019].
[35] E. Acuna, “preprocessing in Data Mining,” Int. Encycl. Stat. Sci., no. September, 2011, DOI:
10.1007/978-3-642-04898-2.
[36] Y. Jung and J. Hu, “A K-fold averaging cross-validation procedure,” J. Nonparametr. Stat., vol.
27, no. 2, pp. 167–179, 2015, DOI: 10.1080/10485252.2015.1010532.

3976

View publication stats


See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/331589020

Heart Disease Prediction System

Research Proposal · March 2019

CITATION READS
1 93,426

1 author:

Kennedy Ngure Ngare


St. Paul's University
1 PUBLICATION 1 CITATION

SEE PROFILE

All content following this page was uploaded by Kennedy Ngure Ngare on 03 May 2019.

The user has requested enhancement of the downloaded file.


HEART DESEASE PREDICTION SYSTEM

By

KENNEDY NGURE NGARE

BSC/LMR/5551/17

A project proposal submitted for the study leading to a project report in


partial fulfilment of the requirements for the award of a Bachelor of
Science in Computer Science at St. Paul’s University.

Supervisor: SAMUEL MUTHEE

DATE

January – April 2019


Declaration
This is to certify that the work being presented in the project entitled “Heart Disease Prediction
System” submitted by undersigned student of Bachelors in COMPUTER SCIENCE in the
fulfillment for award of Bachelors in Computer Science is a record of my own work carried out
by me under guidance and supervision of Samuel Muthee of the Department of Computer
Science and that this work has not submitted elsewhere for award of any other degree.

Name of student: ________Kennedy Ngure Ngare_______________________

Registration Number: _____BSC/LMR/5551/17_________________________

Sign: __________________________________________

Approval

This project was done and presented by me before the panel concerned on the 2nd May 2012 at
St Paul’s University with my approval and that of my supervisor

Supervisor Name: ____________ Samuel Muthee ______________

Signature: ___________________________________

Date: ____________________

iv
Acknowledgement
The satisfaction that accompanies that the successful completion of any task would be
incomplete without the mention of people whose ceaseless cooperation made it possible, whose
constant guidance and encouragement crown all efforts with success. I am very grateful to my
project supervisor Samuel Muthee, for the guidance, inspiration and constructive suggestions
that helpful me in the preparation of this project. I won’t forget to also mention my course mates;
Jastine Luka ,Melissa Wairimu, Justus Gaita and Valentine Mwaura for their wonderful and
skillful guidance in assisting me with the necessary support to ensure that my project is a
success. I also thank my parents and family at large for their moral and financial support in
funding the project to ensure successful completion of the project.

v
Dedication.

I dedicate this project to God Almighty my creator, my strong pillar, my source of inspiration,
Wisdom, knowledge and understanding. He has been the source of my strength throughout this
Program and on His wings only have I soared. I also dedicate this work to my friends; Mary
Waitherero who has encouraged me all the way and whose encouragement has made sure that I
live it all it takes to finish that which I have started. To my father Antony Ngare and all my
beloved friends(Joseph Kamau ,Nicolous Wakaba and Sherlene Wanjiku). Who have been
affected in every way possible by this quest?

All the work done in coming up with this system is dedicated to my family for being with/part of
me in the whole process especially my dear dad and mum who stood by me in all situations even
at the times of financial need.

Thank You. My Love for You All Can Never Be Quantified. God Bless You.

vi
Abstract.

The health care industries collect huge amounts of data that contain some hidden information, which is
useful for making effective decisions. For providing appropriate results and making effective decisions on
data, some advanced data mining techniques are used. In this study, a Heart Disease Prediction System
(HDPS) is developed using Naives Bayes and Decision Tree algorithms for predicting the risk level of
heart disease. The system uses 15 medical parameters such as age, sex, blood pressure, cholesterol, and
obesity for prediction. The HDPS predicts the likelihood of patients getting heart disease. It enables
significant knowledge. E.g. Relationships between medical factors related to heart disease and patterns, to
be established. We have employed the multilayer perceptron neural network with backpropagation as the
training algorithm. The obtained results have illustrated that the designed diagnostic system can
effectively predict the risk level of heart diseases.

Keywords: Data Mining, Naives Bayes, Decision Tree, Backpropagation, Disease Diagnosis

vii
Table of Contents
Declaration ............................................................................................................................................... iv
Approval ................................................................................................................................................... iv
Acknowledgement .................................................................................................................................... v
Dedication. ............................................................................................................................................... vi
Abstract. .................................................................................................................................................. vii
Table of Contents ....................................................................................................................................... viii
Chapter One. ................................................................................................................................................. 1
1.1 Background. .................................................................................................................................. 1
1.2 Background of The Study .............................................................................................................. 1
1.3 Problem Statement. ...................................................................................................................... 2
1.4 Objectives...................................................................................................................................... 3
1.4.1 Main Objectives. ................................................................................................................... 3
1.4.2 Specific Objectives. ............................................................................................................... 3
1.5 Justification. .................................................................................................................................. 3
1.6 Scope and Limitation. ................................................................................................................... 4
1.6.1 Scope. .................................................................................................................................... 4
1.6.2 Limitations............................................................................................................................. 4
CHAPTER TWO: LITERATURE REVIEW ........................................................................................................... 5
1.7 Introduction .................................................................................................................................. 5
1.8 Literature Review .......................................................................................................................... 5
1.9 Proposed Architecture. ................................................................................................................. 7
1.9.1 Naïve Bayes Classifier............................................................................................................ 8
1.9.2 Project Flow Chart............................................................................................................... 10
1.10 Conclusion And Future Work. ..................................................................................................... 20
Chapter 3: Research Methodology ............................................................................................................. 13
1.11 Research Design. ......................................................................................................................... 13
1.12 System Development Methodology. .......................................................................................... 13
1.12.1 Data Collection and Preprocessing. .................................................................................... 14
1.13 Classifiers Used for Experiments................................................................................................. 15
1.13.1 Naïve Bayesian. ................................................................................................................... 16
1.13.2 Decision Trees. .................................................................................................................... 17

viii
1.13.3 Ensemble DM approach. ..................................................................................................... 17
1.14 Tools ............................................................................................................................................ 17
1.14.1 Software requirements: ...................................................................................................... 17
1.14.2 Hardware Requirements ..................................................................................................... 18
1.15 Budget. ........................................................................................................................................ 18
1.16 Work Plan .................................................................................................................................... 19
References. ................................................................................................................................................. 21

ix
Chapter One.
1.1 Background.
Among all fatal disease, heart attacks diseases are considered as the most prevalent. Medical
practitioners conduct different surveys on heart diseases and gather information of heart patients, their
symptoms and disease progression. Increasingly are reported about patients with common diseases
who have typical symptoms. In this fast moving world people want to live a very luxurious life so they
work like a machine in order to earn lot of money and live a comfortable life therefore in this race they
forget to take care of themselves, because of this there food habits change their entire lifestyle change,
in this type of lifestyle they are more tensed they have blood pressure, sugar at a very young age and
they don’t give enough rest for themselves and eat what they get and they even don’t bother about the
quality of the food if sick the go for their own medication as a result of all these small negligence it leads
to a major threat that is the heart disease

The term ‘heart disease’ includes the diverse diseases that affect heart. The number of people
suffering from heart disease is on the rise (health topics, 2010). The report from world health
organization shows us a large number of people that die every year due to the heart disease all
over the world. Heart disease is also stated as one of the greatest killers in Africa.

Data mining has been used in a variety of applications such as marketing, customer relationship
management, engineering, and medicine analysis, expert prediction, web mining and mobile
computing. Of late, data mining has been applied successfully in healthcare fraud and detecting
abuse cases.

1.2 Background of The Study


Data analysis proves to be crucial in the medical field. It provides a meaningful base to critical
decisions. It helps to create a complete study proposal. One of the most important uses of data
analysis is that it helps in keeping human bias away from medical conclusion with the help of
proper statistical treatment. By use of data mining for exploratory analysis because of nontrivial
information in large volumes of data.

The health care industries collect huge amounts of data that contain some hidden information,
which is useful for making effective decisions for providing appropriate results and making

1
effective decisions on data, some data mining techniques are used to better the experience and
conclusion that have been given.

Heart predictor system will use the data mining knowledge to give a user-oriented approach to
new and hidden patterns in the data. The knowledge which is implemented can be used by the
healthcare experts to get better quality of service and to reduce the extent of adverse medicine
effect.

1.3 Problem Statement.


Heart disease can be managed effectively with a combination of lifestyle changes, medicine and,
in some cases, surgery. With the right treatment, the symptoms of heart disease can be reduced
and the functioning of the heart improved. The predicted results can be used to prevent and thus
reduce cost for surgical treatment and other expensive.

The overall objective of my work will be to predict accurately with few tests and attributes the
presence of heart disease. Attributes considered form the primary basis for tests and give
accurate results more or less. Many more input attributes can be taken but our goal is to predict
with few attributes and faster efficiency the risk of having heart disease. Decisions are often
made based on doctors’ intuition and experience rather than on the knowledge rich data hidden
in the data set and databases. This practice leads to unwanted biases, errors and excessive
medical costs which affects the quality of service provided to patients.

Data mining holds great potential for the healthcare industry to enable health systems to
systematically use data and analytics to identify inefficiencies and best practices that improve
care and reduce costs. According to (Wurz & Takala, 2006) ⁠the opportunities to improve care
and reduce costs concurrently could apply to as much as 30% of overall healthcare spending. The
successful application of data mining in highly visible fields like e-business, marketing and retail
has led to its application in other industries and sectors. Among these sectors just discovering is
healthcare. The healthcare environment is still „information rich‟ but „knowledge poor‟. There
is a wealth of data available within the healthcare systems. However, there is a lack of effective
analysis tools to discover hidden relationships and trends in the data for African genres.

2
1.4 Objectives.

1.4.1 Main Objectives.


The main objective of this research is to develop a heart prediction system. The system can
discover and extract hidden knowledge associated with diseases from a historical heart data set

Heart disease prediction system aims to exploit data mining techniques on medical data set to
assist in the prediction of the heart diseases.

1.4.2 Specific Objectives.


• Provides new approach to concealed patterns in the data.
• Helps avoid human biasness.
• To implement Naïve Bayes Classifier that classifies the disease as per the input of the user.
• Reduce the cost of medical tests.
1.5 Justification.
Clinical decisions are often made based on doctor’s insight and experience rather than on the
knowledge rich data hidden in the dataset. This practice leads to unwanted biases, errors and
excessive medical costs which affects the quality of service provided to patients. The proposed
system will integrate clinical decision support with computer-based patient records (Data Sets).
This will reduce medical errors, enhance patient safety, decrease unwanted practice variation,
and improve patient outcome. This suggestion is promising as data modeling and analysis tools,
e.g., data mining, have the potential to generate a knowledge rich environment which can help to
significantly improve the quality of clinical decisions.

There are voluminous records in medical data domain and because of this, it has become
necessary to use data mining techniques to help in decision support and prediction in the field of
healthcare. Therefore, medical data mining contributes to business intelligence which is useful
for diagnosing of disease

3
1.6 Scope and Limitation.

1.6.1 Scope.
Here the scope of the project is that integration of clinical decision support with computer-based

patient records could reduce medical errors, enhance patient safety, decrease unwanted practice

variation, and improve patient outcome. This suggestion is promising as data modeling and

analysis tools, e.g., data mining, have the potential to generate a knowledge-rich environment which can
help to significantly improve the quality of clinical decisions

1.6.2 Limitations.
Medical diagnosis is considered as a significant yet intricate task that needs to be carried out
precisely and efficiently. The automation of the same would be highly beneficial. Clinical
decisions are often made based on doctor’s intuition and experience rather than on the
knowledge rich data hidden in the database. This practice leads to unwanted biases, errors and
excessive medical costs which affects the quality of service provided to patients. Data mining
have the potential to generate a knowledge-rich environment which can help to significantly
improve the quality of clinical decisions.

4
CHAPTER TWO: LITERATURE REVIEW
1.7 Introduction
Data mining is the process of finding previously unknown patterns and trends in databases and
using that information to build predictive models. Data mining combines statistical analysis,
machine learning and database technology to extract hidden patterns and relationships from large
databases. The World Health Statistics 2012 report enlightens the fact that one in three adults
worldwide has raised blood pressure - a condition that causes around half of all deaths from stroke
and heart disease. Heart disease, also known as cardiovascular disease (CVD), encloses a number
of conditions that influence the heart – not just heart attacks. Heart disease was the major cause of
casualties in the different countries including India. Heart disease kills one person every 34
seconds in the United States. Coronary heart disease, Cardiomyopathy and Cardiovascular disease
are some categories of heart diseases. The term “cardiovascular disease” includes a wide range of
conditions that affect the heart and the blood vessels and the manner in which blood is pumped
and circulated through the body. Diagnosis is complicated and important task that needs to be
executed accurately and efficiently. The diagnosis is often made, based on doctor’s experience &
knowledge. This leads to unwanted results & excessive medical costs of treatments provided to
patients. Therefore, an automatic medical diagnosis system would be exceedingly beneficial.
1.8 Literature Review
Numerous studies have been done that have focus on diagnosis of heart disease. They have applied
different data mining techniques for diagnosis & achieved different probabilities for different
methods.
(Polaraju, Durga Prasad, & Tech Scholar, 2017) ⁠ proposed Prediction of Heart Disease using
Multiple Regression Model and it proves that Multiple Linear Regression is appropriate for
predicting heart disease chance. The work is performed using training data set consists of 3000
instances with 13 different attributes which has mentioned earlier. The data set is divided into two
parts that is 70% of the data are used for training and 30% used for testing.
(Deepika & Seema, 2017) ⁠focuses on techniques that can predict chronic disease by mining the
data containing in historical health records using Naïve Bayes, Decision tree, Support Vector
Machine (SVM) and Artificial Neural Network (ANN). A comparative study is performed on
classifiers to measure the better performance on an accurate rate. From this experiment, SVM gives
highest accuracy rate, whereas for diabetes Naïve Bayes gives the highest accuracy.

5
(Beyene & Kamat, 2018) ⁠recommended different algorithms like Naive Bayes, Classification Tree,
KNN, Logistic Regression, SVM and ANN. The Logistic Regression gives better accuracy
compared to other algorithms. (Beyene & Kamat, 2018) ⁠ suggested Heart Disease Prediction
System using Data Mining Techniques. WEKA software used for automatic diagnosis of disease
and to give qualities of services in healthcare centers. The paper used various algorithms like SVM,
Naïve Bayes, Association rule, KNN, ANN, and Decision Tree. The paper recommended SVM is
effective and provides more accuracy as compared with other data mining algorithms.
Chala Beyene recommended Prediction and Analysis the occurrence of Heart Disease Using Data
Mining Techniques. The main objective is to predict the occurrence of heart disease for early
automatic diagnosis of the disease within result in short time. The proposed methodology is also
critical in healthcare organization with experts that have no more knowledge and skill. It uses
different medical attributes such as blood sugar and heart rate, age, sex are some of the attributes
are included to identify if the person has heart disease or not. Analyses of data set are computed
using WEKA software.
(Soni, Ansari, & Sharma, 2011) ⁠ proposed to use non- linear classification algorithm for heart
disease prediction. It is proposed to use bigdata tools such as Hadoop Distributed File System
(HDFS), Map reduce along with SVM for prediction of heart disease with optimized attribute set.
This work made an investigation on the use of different data mining techniques for predicting heart
diseases. It suggests to use HDFS for storing large data in different nodes and executing the
prediction algorithm using SVM in more than one node simultaneously using SVM. SVM is used
in parallel fashion which yielded better computation time than sequential SVM.
(Science & Faculty, 2009) ⁠ suggested heart disease prediction using data mining and machine
learning algorithm. The goal of this study is to extract hidden patterns by applying data mining
techniques. The best algorithm J48 based on UCI data has the highest accuracy rate compared to
LMT. (Purushottam, Saxena, & Sharma, 2016) ⁠ proposed an efficient heart disease prediction
system using data mining. This system helps medical practitioner to make effective decision
making based on the certain parameter. By testing and training phase a certain parameter, it
provides 86.3% accuracy in testing phase and 87.3% in training phase.
(Kirmani, 2017) ⁠ suggested multi disease prediction using data mining techniques. Nowadays, data
mining plays vital role in predicting multiple disease. By using data mining techniques, the number
of tests can be reduced. This paper mainly concentrates on predicting the heart disease, diabetes

6
and breast cancer etc.,
(Sai & Reddy, 2017) ⁠ proposed Heart disease prediction using ANN algorithm in data mining. Due
to increasing expenses of heart disease diagnosis disease, there was a need to develop new system
which can predict heart disease. Prediction model is used to predict the condition of the patient
after evaluation on the basis of various parameters like heart beat rate, blood pressure, cholesterol
etc. The
accuracy of the system is proved in java.
(A & Naik, 2016) ⁠ recommended to develop the prediction system which will diagnosis the heart
disease from patient’s medical data set. 13 risk factors of input attributes have considered to build
the system. After analysis of the data from the dataset, data cleaning and data integration was
performed. He used k-means and naïve Bayes to predict heart disease. This paper is to build the
system using historical heart database that gives diagnosis. 13 attributes have considered for
building the system. To extract knowledge from database, data mining techniques such as
clustering, classification methods can be used. 13 attributes with total of 300 records were used
from the Cleveland Heart Database. This model is to predict whether the patient have heart disease
or not based on the values of 13 attributes.
(Sultana, Haider, & Uddin, 2017) proposed an analysis of cardiovascular disease. This paper
proposed data mining techniques to predict the disease. It is intended to provide the survey of
current techniques to extract information from dataset and it will useful for healthcare
practitioners. The performance can be obtained based on the time taken to build the decision tree
for the system. The primary objective is to predict the disease with a smaller number of attributes.
1.9 Proposed Architecture.
In this system we are implementing effective heart attack prediction system using Naïve Bayes
algorithm. We can give the input as in CSV file or manual entry to the system. After taking input
the algorithms apply on that input that is Naïve Bayes. After accessing data set the operation is
performed and effective heart attack level is produced.
The proposed system will add some more parameters significant to heart attack with their weight,
age and the priority levels are by consulting expertise doctors and the medical experts. The heart
attack prediction system designed to help the identify different risk levels of heart attack like
normal, low or high and also giving the prescription details with related to the predicted result.

7
1.9.1 Naïve Bayes Classifier.
Naïve Bayes classifier is based on Bayes theorem. This classifier uses conditional independence
in which attribute value is independent of the values of other attributes. The Bayes theorem is as
follows:
Let X= {x1, x2, ......, Xn} be a set of n attributes. In Bayesian, X is considered as evidence and H
be some hypothesis means, the data of X belongs to specific class C. We have to determine P (H|X),
the probability that the hypothesis H holds given evidence i.e. data sample X. According to Bayes
theorem the P (H|X) is expressed as: P(H|X) = P(X|H) P(H) / P(X).4
Algorithm Used Accuracy Time Taken

Naïve Bayes 52.33% 609 m

Decision Tree 52% 719 m

K-NN 45.67% 1000 m

Using Bayesian classifiers, the system will discover the concealed knowledge associated with
diseases from historical records of the patients having heart disease. Bayesian classifiers predict
the class membership probabilities, in a way that the probability of a given sample belongs to a
particular class statistically. Bayesian classifier is based on Bayes’ theorem. We can use Bayes
theorem to determine the probability that a proposed diagnosis is correct, given the observation. A
simple probabilistic, the naive Bayes classifier is used for classification based on which is based
on Bayes’ theorem. According to naïve Bayesian classifier the occurrence or an occurrence of a
particular feature of a class is considered as independent in the presence or absence of any other
feature. When the dimension of the inputs is high and more efficient result is expected, the chief
Naïve Bayes Classifier technique is applicable. The Naïve Bayes model identifies the physical
characteristics and features of patients suffering from heart disease. For each input, it gives the
possibility of attribute of the expectable state. Naïve Bayes is a statistical classifier which assumes
no dependency between attributes. This classifier algorithm uses conditional independence, means
it assumes that an attribute value of a given class is independent of the values of other attributes.
The advantage of using Naïve Bayes is that one can work with the Naïve Bayes model without.
using any Bayesian methods. (Brownlee, 2016). P (Disease|symptom1, symptom2, … … . , symptomn)
P(Disease)P(symptom1, … . . , symptomn|Disease) =P(symptom1, symptom2, … . . Symptomn).

8
1.9.1.1 Flowchart Of Naïve Bayes Decision Tree Algorithm.

The classification tree literally creates a tree with branches, nodes, and leaves that lets us take an
unknown data point and move down the tree, applying the attributes of the data point to the tree
until a leaf is reached and the unknown output of the data point can be determined. In order to
create a good classification tree model, we need to have an existing data set with known output
from which we can build our model. We also divide our data set into two parts: a training set,
which is used to create the model, and a test set, which is used to verify that the model is accurate
and not over fitted.

9
1.9.2 Project Flow Chart.

This will be the proposed flow chart that the system will look like

10
Data Flow Diagram

Dataset Preprocessing

Pattern Matching

Prediction

Rule Generation

Accuracy Calculation

Result

11
1.9.3 Proposed Model

UCI Respos ito ry


Heart Attack
Datas et

Prepro cess ing


Decen tralization

Class ificatio ns
Model

Feature Selection

Eval uation and


Comparison o f
R es ults

Conclus ion
and
Sug gestion

12
Chapter 3: Research Methodology
1.10 Research Design.
I will be using the experimental type of research design. It is a quantitative research method.
Basically, it is a research conducted with a scientific approach, where a set of variables are kept
constant while other set of variables are being measured as the subject of the experiment. This is
more practically while conducting face recognition and detection as it monitors the behaviours
and patterns of a subject to be used to acknowledge whether the subject matches all details
presented and cross checked with previous data. It is an effect research method as it is time
bound and focuses on the relationship between the variables that give actual results.

1.11 System Development Methodology.


The methodology of software development is the method in managing project development. There are
many models of the methodology are available such as Waterfall model model, Incremental model, RAD
model, Agile model, Iterative model and Spiral model. However, it still need to be considered by
developer to decide which is will be used in the project. The methodology model is useful to manage the
project efficiently and able to help developer from getting any problem during time of development.
Also, it help to achieve the objective and scope of the projects. In order to build the project, it need to
understand the stakeholder requirements.

Methodology provides a framework for undertaking the proposed DM modeling. The methodology is a
system comprising steps that transform raw data into recognized data patterns to extract knowledge for
users.

13
There are four phases that involve in the spiral model:

1) Planning phase

Phase where the requirement are collected and risk is assessed. This phase where the title of the project
has been discussed with project supervisor. From that discussion, Heart Prediction System has been
proposed. The requirement and risk was assessed after doing study on existing system and do literature
review about another existing research.

2) Risk analysis Phase

Phase where the risk and alternative solution are identified. A prototype are created at the end this
phase. If there is any risk during this phase, there will be suggestion about alternate solution.

3) Engineering phase

At this phase, a software are created and testing are done at the end this phase.

4) Evaluation phase

At this phase, the user do evaluation toward the software. It will be done after the system are presented
and the user do test whether the system meet with their expectation and requirement or not. If there is
any error, user can tell the problem about system.

1.11.1 Data Collection and Preprocessing.


The data set for this research was taken from UCI data repository.14 Data accessed from the UCI
Machine Learning Repository is freely available. In particular, the Cleveland and Hungarian databases
have been used by many researchers and found to be suitable for developing a mining model, because
of lesser missing values and outliers. The data is cleaned and preprocessed before it is submitted to the
proposed algorithm for training and testing.

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators
that are used by the machine learning community for the empirical analysis of machine learning
algorithms.

The overall objective of our work is to predict more accurately the presence of heart disease. In this
paper, UCI repository dataset are used to get more accurate results. Two data mining classification
techniques were applied namely Decision trees and Naive Bayes

his database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.
In particular, the Cleveland database is the only one that has been used by ML researchers to this date.
The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no

14
presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to
distinguish presence (values 1,2,3,4) from absence (value 0).

Attributes with categorical values were converted to numerical values since most machine learning
algorithms require integer values. Additionally, dummy variables were created for variables with more
than two categories. Dummy variables help Neural Networks learn the data more accurately.

1.12 Classifiers Used for Experiments.

15
1.12.1 Naïve Bayesian.
It is a probabilistic classifier based on Bayes’ theorem specified by the prior probabilities of its root
nodes. The Bayes theorem is given in Equation 1 and normalization constant is given in Equation 2. It
proves to be an optimal algorithm in terms of minimization of generalized error. It can handle statistical-
based machine learning for feature vectors and assign the label for feature vector based on maximal
probable among available classes {XX1, X2..., XM}. It means that feature “y” belongs to Xiclass, when
posterior probability is maximum ie Max. The Bayesian classification problem may be formulated by a-
posterior probabilities that assign the class label ωi to sample X such that is maximal. The Bayesian
classification problem may be formulated by a-posterior probabilities that assign the class label ωi to
sample X such that is maximal.

Application of Bayes’ rule with the mutual exclusivity in diseases and the conditional independence in
findings is known as the Naïve Bayesian Approach. It is a probabilistic classifier based on Bayes’ theorem
with strong independence assumptions between the features. Naïve Bayesian classifier despite its
simplicity, it surprisingly performs well and often outperforms in complex classification. Simple Naïve
Bayesian can be implemented by plugging in the following main Bayes’ formula:

P (X1, X2…, Xn | Y) = P (X1 | Y) P (X2 | Y) … P (Xn | Y) (3)

The above-mentioned Naïve Bayesian network produces a mathematical model, which is used for
modeling the complicated relations of random variables of disease attributes and decision outcome. The
algorithm uses the formula to calculate conditional probability with respect to disease condition
attributes value and decision attribute value. Based on prior knowledge, the algorithm classifies the
decision attribute into labels assigned, and hence the conditional support is computed for each variable
attribute.

16
1.12.2 Decision Trees.
The decision tree approach is more powerful for classification problems. There are two steps in this
technique building a tree & applying the tree to the dataset. There are many popular decision tree
algorithms CART, ID3, C4.5, CHAID, and J48. From these J48 algorithm is used for this system. J48
algorithm uses pruning method to build a tree. Pruning is a technique that reduces size of tree by
removing over fitting data, which leads to poor accuracy in predications. The J48 algorithm recursively
classifies data until it has been categorized as perfectly as possible. This technique gives maximum
accuracy on training data. The overall concept is to build a tree that provides balance of flexibility &
accuracy.

1.12.3 Ensemble DM approach.


In order to have more reliable and accurate prediction results, ensemble method is a well-proven
approach practiced in research for attaining highly accurate classification of data by hybridizing different
classifiers. The improved prediction performance is a well-known in-built feature of ensemble
methodology. This study proposes a weighted vote-based classifier ensemble technique, overcoming the
limitations of conventional DM techniques by employing the ensemble of two heterogeneous classifiers:
Naive Bayesian and classification via decision tree

1.13 Tools
For application development, the following Software Requirements are:

Operating System: Windows 7 or any Linux Debian Distro.

Language: R and Shiny

Tools: RStudio IDE, Microsoft Excel (Optional).

Technologies used: R, Unix, Shiny.

1.13.1 Software requirements:

Operating System Any OS with clients to access the internet

Network Wi-Fi Internet or cellular Network

Visio Studio Create and design Data Flow and

Context Diagram

Github Versioning Control

Google Chrome Medium to find reference to do system

testing, display and run shinyApp.

17
1.13.2 Hardware Requirements
For application development, the following Software Requirements are:

Processor: Intel or high

RAM: 1024 MB

Space on disk: minimum 100mb

For running the application:

Device: Any device that can access the internet

Minimum space to execute: 20 MB

The effectiveness of the proposal is evaluated by conducting experiments with a cluster formed by 3
nodes with identical setting, configured with an Intel CORE™ i7-4770 processor (3.40GHZ, 4 Cores,
8GB RAM, running Ubuntu 18.04 LTS with 64-bit Linux 4.31.0 kernel)

1.14 Budget.
The budget of completion for developing the heart disease prediction system will require various
software and hardware devices. The application is averagely expensive to build but if happens to
be as successful as the developer sees it to be it will bring forth enough profit to cover the costs
undergone.
The table below explains the planned budget in Kenyan Shillings to develop the system:
HARDWARE SOFTWARE PRICE
Computer Operating System Windows - 25000
Hard disk Cloud storage 8,500
Internet Connection Safaricom 3,000
Total 36,500

18
1.15 Work Plan

19
1.16 Conclusion And Future Work.

The proposed system is GUI-based, user-friendly, scalable, reliable and an expandable system.
The proposed working model can also help in reducing treatment costs by providing Initial
diagnostics in time. The model can also serve the purpose of training tool for medical students
and will be a soft diagnostic tool available for physician and cardiologist. General physicians can
utilize this tool for initial diagnosis of cardio-patients. There are many possible improvements
that could be explored to improve the scalability and accuracy of this prediction system. As we
have developed a generalized system, in future we can use this system for the analysis of
different data sets. The performance of the health’s diagnosis can be improved significantly by
handling numerous class labels in the prediction process, and it can be another positive direction
of research. In DM warehouse, generally, the dimensionality of the heart database is high, so
identification and selection of significant attributes for better diagnosis of heart disease are very
challenging tasks for future research.

20
References.
A, A. S., & Naik, C. (2016). Different Data Mining Approaches for Predicting Heart Disease, 277–
281. https://doi.org/10.15680/IJIRSET.2016.0505545

Beyene, C., & Kamat, P. (2018). Survey on prediction and analysis the occurrence of heart disease
using data mining techniques. International Journal of Pure and Applied Mathematics,
118(Special Issue 8), 165–173. Retrieved from
https://www.scopus.com/inward/record.uri?eid=2-s2.0-
85041895038&partnerID=40&md5=2f0b0c5191a82bc0c3f0daf67d73bc81

Brownlee, J. (2016). Naive Bayes for Machine Learning. Retrieved March 4, 2019, from
https://machinelearningmastery.com/naive-bayes-for-machine-learning/

Kirmani, M. (2017). Cardiovascular Disease Prediction using Data Mining Techniques. Oriental
Journal of Computer Science and Technology, 10(2), 520–528.
https://doi.org/10.13005/ojcst/10.02.38

Polaraju, K., Durga Prasad, D., & Tech Scholar, M. (2017). Prediction of Heart Disease using
Multiple Linear Regression Model. International Journal of Engineering Development and
Research, 5(4), 2321–9939. Retrieved from www.ijedr.org

Purushottam, Saxena, K., & Sharma, R. (2016). Efficient Heart Disease Prediction System. In
Procedia Computer Science (Vol. 85, pp. 962–969).
https://doi.org/10.1016/j.procs.2016.05.288

Sai, P. P., & Reddy, C. (2017). International Journal of Computer Science and Mobile Computing
HEART DISEASE PREDICTION USING ANN ALGORITHM IN DATA MINING.
International Journal of Computer Science & Mobile Computing, 6(4), 168–172. Retrieved
from www.ijcsmc.com

Science, C., & Faculty, G. M. (2009). Heart Disease Prediction Using Machine learning and Data
Mining Technique. Ijcsc 0973-7391, 7, 1–9.

Soni, J., Ansari, U., & Sharma, D. (2011). Intelligent and Effective Heart Disease Prediction
System using Weighted Associative Classifiers. Heart Disease, 3(6), 2385–2392.

21
Sultana, M., Haider, A., & Uddin, M. S. (2017). Analysis of data mining techniques for heart isease
prediction. In 2016 3rd International Conference n Electrical Engineering and Information and
Communication Technology, iCEEiCT 2016 (pp. 1–5).
https://doi.org/10.1109/CEEICT.2016.7873142
Artificial, I. T. O. (1995). Chapter 2. Neuron, 36–62. https://doi.org/10.1109/ETD.1995.403491
Bozzo, R., Conca, A., & Marangon, F. (2014). Decision support system for city logistics: Literature
review, and guidelines for an ex-ante model. Transportation Research Procedia, 3(July), 518–527.
https://doi.org/10.1016/j.trpro.2014.10.033
Çela, E. K., & Frasheri, N. (2012a). A literature review of data mining techniques used in
healthcare databases. ICT Innovations 2012 Web Proceedings, 577–582.
Çela, E. K., & Frasheri, N. (2012b). A literature review of data mining techniques used in
healthcare databases. ICT Innovations 2012 Web Proceedings, 577–582.
David, H. B. F., & Belcy, S. A. (2018). Heart Disease Prediction Using Data Mining Techniques,
6956(October), 1817–1823. https://doi.org/10.21917/ijsc.2018.0253
Desai, S. D., Giraddi, S., Narayankar, P., Pudakalakatti, N. R., & Sulegaon, S. (2019). Back-
propagation neural network versus logistic regression in heart disease classification. Advances in
Intelligent Systems and Computing, 702, 133–144. https://doi.org/10.1007/978-981-13-0680-8_13
Kiruthika Devi, S., Krishnapriya, S., & Kalita, D. (2016). Prediction of heart disease using data
mining techniques. Indian Journal of Science and Technology, 9(39), 1291–1293.
https://doi.org/10.17485/ijst/2016/v9i39/102078
Lavanya, M., & Gomathi, M. P. M. (2016). Prediction of Heart Disease using Classification
Algorithms. International Journal of Advanced Research in Computer Engineering & Technology,
5(7), 2278–1323.
Meenakshi, K., Maragatham, G., Agarwal, N., & Ghosh, I. (2018). A Data mining Technique for
Analyzing and Predicting the success of Movie. Journal of Physics: Conference Series, 1000(1).

22

View publication stats


Jawalkar et al. Journal of Engineering
Journal of Engineering and Applied Science (2023) 70:122
https://doi.org/10.1186/s44147-023-00280-y and Applied Science

RESEARCH Open Access

Early prediction of heart disease


with data analysis using supervised learning
with stochastic gradient boosting
Anil Pandurang Jawalkar1*, Pandla Swetcha1, Nuka Manasvi1, Pakki Sreekala1, Samudrala Aishwarya1,
Potru Kanaka Durga Bhavani1 and Pendem Anjani1

*Correspondence:
anil.jawalkar022@gmail.com Abstract
1
Department of Information Heart diseases are consistently ranked among the top causes of mortality on a global
Technology, Malla Reddy scale. Early detection and accurate heart disease prediction can help effectively man-
Engineering College age and prevent the disease. However, the traditional methods have failed to improve
for Women (UGC-Autonomous),
Maisammaguda, Hyderabad, heart disease classification performance. So, this article proposes a machine learn-
India ing approach for heart disease prediction (HDP) using a decision tree-based random
forest (DTRF) classifier with loss optimization. Initially, preprocessing of the dataset
with patient records with known labels is performed for the presence or absence
of heart disease records. Then, train a DTRF classifier on the dataset using stochas-
tic gradient boosting (SGB) loss optimization technique and evaluate the classifier’s
performance using a separate test dataset. The results demonstrate that the proposed
HDP-DTRF approach resulted in 86% of precision, 86% of recall, 85% of F1-score,
and 96% of accuracy on publicly available real-world datasets, which are higher
than traditional methods.
Keywords: Heart disease, Machine learning, Decision tree, Random forest, Stochastic
gradient boosting, Loss optimization

Introduction
One person dies due to cardiovascular disease every 36 s in every country. Coronary
heart disease is the leading cause of mortality in the USA, accounting for one out of
every four fatalities that occur each year. This disease claims the lives of about 0.66 mil-
lion people annually [1]. The expenditures associated with cardiovascular disease are
significant for the healthcare system in the USA. In the years 2021 and 2022, it resulted
in annual costs of around $219 billion owing to the increased demand for medical treat-
ment and medication and the loss of productivity caused by deaths. Table 1 provides the
statistics of the heart disease dataset with total heart disease cases, deaths, case fatal-
ity rate, and total vaccinations. A prompt diagnosis also aids in preventing heart failure,

© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the mate-
rial. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdo-
main/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 2 of 18

Table 1 Statistics of heart disease dataset


Country Total cases Total deaths Case fatality rate Total vaccinations

USA 44,752,659 720,581 1.61% 401,670,644


India 34,157,813 453,996 1.33% 1,031,906,566
Brazil 21,534,894 600,185 2.78% 239,756,958
Russia 8,073,318 222,853 2.76% 99,150,000
Turkey 7,052,488 64,049 0.91% 110,838,084
UK 7,005,365 137,322 1.96% 113,391,940
France 6,939,471 116,512 1.68% 100,355,009
Iran 6,261,269 126,711 2.02% 21,543,821
Argentina 5,301,830 115,662 2.18% 52,038,168
Colombia 4,985,923 126,245 2.53% 43,999,110
Spain 4,988,029 87,928 1.76% 77,561,325

which is another potential cause of mortality in certain cases [2]. Since many traits put
a person at risk for acquiring the ailment, it is difficult to diagnose heart disease in its
earlier stages while it is still in its infancy. Diabetes, hypertension, elevated cholesterol
levels, an irregular pulse rhythm, and a wide variety of other diseases are some risk fac-
tors that might contribute to this [3]. These ailments are grouped and discussed under
“heart disease,” an umbrella word. The symptoms of cardiac disease can differ consider-
ably from one individual to the next and from one condition to another within the same
patient [4]. The process of identifying and classifying cardiovascular diseases is a contin-
uous one that has a chance of being fruitful when carried out by a qualified professional
with appropriate knowledge and skill in the relevant sector. There are a lot of different
aspects, such as age, diabetes, smoking, being overweight, and eating a diet high in junk
food. There have been several variables and criteria discovered that have been shown to
either cause heart disease or raise the risk of developing heart disease [5].
Most hospitals use management software to monitor the clinical and patient data they
collect. It is well-known these days, and these kinds of devices generate a vast quantity
of information on patients. These data are used for decision-making help in clinical set-
tings rather seldom. These data are precious, yet a significant portion of their knowledge
is left unused [6]. Because of the sheer volume of data involved in the process, the trans-
lation of clinical data that has been acquired into information that intelligent systems
can use to assist healthcare practitioners in making decisions is a process fraught with
difficulties [7]. Intelligent systems put this knowledge to use to enhance the quality of
treatment provided to patients. As a result of this issue, research on the processing of
medical photographs was carried out. Because there were not enough specialists and too
many instances were misdiagnosed, an automated detection method that was both quick
and effective was necessary [8].
The primary objective of the research is centered around the effective utilization of a
classifier model, which aims to categorize and identify vital components within com-
plex medical data. This categorization process is a critical step towards enabling early
diagnosis of cardiovascular diseases, potentially contributing to improved patient out-
comes and healthcare management [9]. However, the pursuit of disease prediction at an
early stage is not without its challenges. One significant factor pertains to the inherent
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 3 of 18

complexity of the predictive methods employed in the classification process [10]. The
intricate nature of these methods can lead to difficulties in interpreting the underlying
decision-making processes, which might impede the integration of these models into
clinical practice. Furthermore, the efficiency of disease prediction models is impacted
by the time they take to execute. Swift diagnosis and intervention are crucial in medi-
cal conditions, and time-intensive models might not align with the urgency required for
timely medical decisions. Researchers [11] have investigated various alternative strate-
gies to forecast cardiovascular diseases. Perfect treatment and diagnosis have the poten-
tial to save the lives of an infinite number of individuals. The novel contribution of this
work is as follows:

• Preprocessing of HDP dataset with normalization, exploratory data analysis (EDA),


data visualization, and extraction of top correlated features.
• Implementation of DTRF classifier for training preprocessed dataset, which can
accurately predict the presence or absence of heart disease.
• The SGB loss optimization is used to reduce the losses generated during the training
process, which tunes the hyperparameters of DTRF.

The rest of the article is organized as follows: Sect. 2 gives a detailed literature survey
analysis. Section 3 gives a detailed analysis of the proposed HDP-DTRF with multiple
modules. Section 4 gives a detailed simulation analysis of the proposed HDP-DTRF. Sec-
tion 5 concludes the article.

Literature survey
Rani et al. [12] designed a novel hybrid decision support system to diagnose cardiac ail-
ments early. They effectively addressed the missing data challenge by employing mul-
tivariate imputations through chained equations. Additionally, their unique approach
to feature selection involved a fusion of genetic algorithms (GA) and recursive feature
reduction. Notably, the integration of random forest classifiers played a pivotal role in
significantly enhancing the accuracy of their system. However, despite these advance-
ments, their hybrid approach’s complexity might have posed challenges in terms of inter-
pretability and practical implementation. Kavitha et al. [13] embraced machine learning
techniques to forecast cardiac diseases. They introduced a hybrid model by incorporat-
ing random forest as the base classifier. This hybridization aimed to enhance predic-
tion accuracy; however, their decision to capture and store user input parameters for
future use was intriguing but yielded suboptimal classification performance. This unique
approach could be viewed as an innovative attempt to integrate patient-specific informa-
tion, yet the exact impact on overall performance warrants further investigation.
Mohan et al. [14] further advanced the field by employing a hybrid model that com-
bined random forest with a linear model to predict cardiovascular diseases. Through
this amalgamation of different classification approaches and feature combinations,
they achieved commendable performance with an accuracy of 88.7%. However, it is
worth noting that while hybrid models show promise, the trade-offs between com-
plexity and interpretability could influence their practical utility in real-world clini-
cal settings. To predict heart diseases, Shah et al. [15] adopted supervised learning
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 4 of 18

techniques, including Naive Bayes, decision trees, K-nearest neighbor (KNN), and
random forest algorithms. Their choice of utilizing the Cleveland database from the
UCI repository as their data source added a sense of universality to their findings.
However, the lack of customization in data sources might limit the applicability of
their model to diverse patient populations with varying characteristics. Guoet et al.
[16] contributed to the field by harnessing an improved learning machine (ILM)
model in conjunction with machine learning techniques. Integrating novel feature
combinations and categorization methods showcased their dedication to enhanc-
ing performance and accuracy. Nonetheless, while their approach exhibits promising
results, the precise impact of specific feature combinations on prediction accuracy
could have been further explored. Hager Ahmed et al. [17] presented an innovative
real-time prediction system for cardiac diseases using Apache Spark and Apache
Kafka. This system, characterized by its three-tier architecture—offline model build-
ing, online prediction, and stream processing pipeline—highlighted its commitment
to harnessing cutting-edge technologies for practical medical applications. However,
the scalability and resource requirements of such real-time systems, especially in
healthcare settings with limited computational resources, could be an area of concern.
Kataria et al. [18] comprehensively analyzed and compared various machine learn-
ing algorithms for predicting heart disease. Their focus on analyzing the algorithms’
ability to predict heart disease effectively sheds light on their dedication to identify-
ing the most suitable model. However, their study’s outcome might have been further
enriched by addressing the unique challenges posed by individual attributes, such as
high blood pressure and diabetes, in a more customized manner. Kannan et al. [19]
meticulously evaluated machine learning algorithms to predict and diagnose cardiac
sickness. By selecting 14 criteria from the UCI Cardiac Datasets, they showcased
their dedication to designing a comprehensive study. Nevertheless, a deeper analy-
sis of how these algorithms perform with specific criteria and their contributions to
accurate predictions could provide more actionable insights.
Ali et al. [20] conducted a detailed analysis of supervised machine-learning algo-
rithms for predicting cardiac disease. Their thorough evaluation of decision trees,
k-nearest neighbors, and logistic regression classifiers (LRC) provided a well-rounded
perspective on the strengths and limitations of each method. However, a more fine-
grained analysis of how these algorithms perform under various parameter configura-
tions and feature combinations might offer additional insights into their potential use
cases. Mienye et al. [21] introduced an enhanced technique for ensemble learning, uti-
lizing decision trees, random forests, and support vector machine classifiers. The vot-
ing system they employed to aggregate results showcased their innovative approach
to combining various methods. However, the potential trade-offs between ensemble
complexity and the robustness of predictions could be considered for future refine-
ment. Dutta et al. [22] revolutionized the field by introducing convolutional neural
networks (CNNs) for predicting coronary heart disease. Their approach, leveraging
the power of CNNs on a large dataset of ECG signals, showcased the potential for
deep learning techniques in healthcare. However, the requirement for extensive com-
putational resources and potential challenges in model interpretability could be areas
warranting further attention. Latha et al. [23] demonstrated ensemble classification
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 5 of 18

approaches. Combined with a bagging technique, their utilization of decision trees,


naive Bayes, and random forest exemplified their determination to achieve robust
results. Nevertheless, the potential interplay between different ensemble techniques
and their effectiveness under various scenarios could be explored further.
Ishaq et al. [24] introduced the concept of using the synthetic minority oversampling
technique (SMOTE) in conjunction with efficient data mining methods to improve sur-
vival prediction for heart failure patients. Their emphasis on addressing class imbalance
through SMOTE showcased their awareness of real-world challenges in healthcare data-
sets. However, the potential impact of the SMOTE method on individual patient sub-
groups and its implications for model fairness could be areas of future exploration. Asadi
et al. [25] proposed a unique cardiac disease detection technique based on random for-
est swarm optimization. Their use of a large dataset for evaluation underscored their
dedication to robust testing. However, the potential influence of dataset characteristics
and the algorithm’s sensitivity to various parameters on prediction performance could
be investigated further.

Proposed methodology
Heart disease is a significant health problem worldwide and is responsible for many
deaths every year. Traditional methods for diagnosing heart disease are often time-con-
suming, expensive, and inaccurate. Therefore, there is a need for more accurate and effi-
cient methods for predicting and diagnosing heart disease. The article aims to provide
a detailed analysis of the proposed HDP-DTRF approach and its performance in accu-
rately predicting the presence or absence of heart disease. The results demonstrate the
effectiveness of the proposed approach, which can lead to improved diagnosis and treat-
ment of heart disease, ultimately leading to better health outcomes for patients.
Figure 1 shows the proposed HDP-DTRF block diagram. The initial step in the pro-
posed approach is the preprocessing of a dataset consisting of patient records with
known labels indicating the presence or absence of heart disease. The dataset is then
used to train a DTRF classifier with the SGB loss optimization technique. The perfor-
mance of the trained classifier is evaluated using a separate publicly available real-world
test dataset, and the results show that the proposed HDP-DTRF approach can accurately
predict the presence or absence of heart disease. Using decision trees in the random for-
est classifier enables the algorithm to handle nonlinear data and make accurate predic-
tions even with missing or noisy data. Applying the SGB loss optimization technique
further enhances the algorithm’s performance by improving the convergence rate and
avoiding overfitting. The proposed approach can be useful in clinical decision-making
processes, enabling medical professionals to predict the likelihood of heart disease in
patients and take appropriate preventive measures.
The detailed operation of the proposed HDP-DTRF system is illustrated as follows:

Step 1: Data preprocessing: Gather a dataset containing patient records, where each
record includes features such as age, blood pressure, and cholesterol levels, along
with labels indicating whether the patient has heart disease. Remove duplicate
records, handle missing values (e.g., imputing missing data or removing instances
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 6 of 18

Fig. 1 Block diagram for the proposed HDP-DTRF system

with missing values), and eliminate irrelevant or redundant features. Encode cate-
gorical variables (like gender) into numerical values using techniques like one-hot
encoding. Scale numerical features to bring them to a common scale, which can pre-
vent features with larger ranges from dominating the model.
Step 2: Training the DTRF classifier: Initialize an empty random forest ensemble. For
each tree in the ensemble, randomly sample the training data with replacement. It
creates a bootstrapped dataset for training each tree, ensuring diversity in the data
subsets. Construct a decision tree using the bootstrapped dataset. At each node of
the tree, split the data based on the feature that provides the best separation, deter-
mined using metrics like Gini impurity or information gain. Add the constructed
decision tree to the random forest ensemble. Repeat the process to create the ensem-
ble’s desired number of decision trees.
Step 3: SGB optimization: Initialize the model by setting the initial prediction to the
mean of the target labels. Calculate the negative gradient of the loss function (such as
mean squared error or log loss) concerning the current model’s predictions. This gra-
dient represents the direction in which the model’s predictions need to be adjusted
to minimize the loss. Train a new decision tree using the negative gradient as the tar-
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 7 of 18

get. This new tree will help correct the errors made by the previous model iterations.
Update the model’s predictions by adding the predictions of the new tree, scaled by a
learning rate. This step moves the model closer to the correct predictions. Repeat the
process for a predefined number of iterations. Each iteration focuses on improving
the model’s predictions based on the errors made in the previous iterations.
Step 4: Performance evaluation: Use a separate real-world test dataset that was not
used during training to evaluate the performance of the trained HDP-DTRF classifier.

DTRF classifier
The DTRF classifier, an ensemble learning model, centers around the decision tree as its
core component. As illustrated in Fig. 2, the DTRF block diagram depicts a framework
comprising multiple trained decision trees employing the bagging technique. During the
classification process, when a sample requiring classification is input, the ultimate classi-
fication outcome is determined through a majority vote from the output of an individual
decision tree [26]. In classifying high-dimensional data, the DTRF model outperforms
standalone decision trees by effectively addressing overfitting, displaying robust resistance
to noise and outliers, and demonstrating exceptional scalability and parallel processing
capabilities. Notably, the strength of DTRF stems from its inherent parameter-free nature,
embodying a data-driven approach. The model requires no prior knowledge of classifica-
tion from the user and is adept at training classification rules based on observed instances.

Fig. 2 Block diagram of DTRF


Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 8 of 18

This data-centric attribute enhances the model’s adaptability to various data scenarios. The
DTRF model’s essence lies in utilizing K decision trees. Each of these trees contributes a
single “vote” towards the category it deems most fitting, thereby participating in determin-
ing the class to which the independent variable X, under consideration, should be allocated.
This approach effectively harnesses the collective wisdom of multiple trees, facilitating
accurate and robust classification outcomes that capitalize on the diverse insights provided
by each decision tree. The mathematical analysis of DTRF is as follows:

{h(X, θK ), k = 1, 2, · · · , K } (1)

Here, K represents the number of decision trees present in the DTRF. In this context, θk
is a collection of independent random vectors uniformly distributed amongst themselves.
Here, K individual decision trees are generated. Each tree provides its prediction for the
category that best fits the independent variable X . The predictions made by the K decision
trees are combined through a voting mechanism to determine the final category assign-
ment for the independent variable X . It is important to note that the given Eq. (1) indicates
the ensemble nature of the DTRF model, where multiple decision trees work collectively
to enhance predictive accuracy and robustness. The collection of θk represents the varied
parameter sets for each decision tree within the ensemble.
The following procedures must be followed to produce a DTRF:

Step 1: The K classification regression trees are generated by randomly selecting K


samples from the original training set as a self-service sample set, using the random
repeated sampling method. Extracting all K samples requires repeating this procedure.
Step 2: Each node in the trees will include m randomly selected characteristics from the
first training set (m n). Only one of the m traits is employed in the node splitting pro-
cedure, and it is the one with the greatest classification potential. DTRF calculates how
much data is included in each feature to do this.
Step 3: A tree never has to be trimmed since it grows perfectly without help.
Step 4: The generated trees are built using DTRFs, and the freshly received data is cate-
gorized using DTRFs. The number of votes from the tree classifiers determines the clas-
sification outcomes.

There are a lot of important markers of generalization performance that are inherent to
DTRFs. Similarity and correlation between different decision trees, mistakes in generaliza-
tion, and the system’s ability to generalize are all features t. A system’s decision-making effi-
cacy is determined by how well it can generalize its results to fresh information that follows
the same distribution as the training set [27]. The system’s performance and generalizability
benefit from reducing the severity of generalization mistakes. Here is a case of the overgen-
eralization fallacy in action:

PE ∗ = PX,Y (mr(X, Y ) < 0) (2)

Here, PE ∗ denotes the generalization error, the subscripts X and Y point to the space
where the probability is defined, and Mr(X, Y ) is the margin function. The following is a
definition of the margin function:
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 9 of 18

mr(X, Y ) = Yavg k (I(h(X, θk ) = Y ) − max(J ))) + Yavgk (I(h(X, θk ) = J )) (3)

If it stands for the input sample, Y indicates the correct classification, and J indi-
cates the incorrect one. Specifically, h(g) is a representation of a sequence model for
classification, I(g) indicates an indicator function, and avg k (g) means averaging. The
margin function determines how many more votes the correct classification for sam-
ple X receives than all possible incorrect classifications. As the value of the margin
function grows, so does the classifier’s confidence in its accuracy. The term “conver-
gence formulation of generalization error” as follows [28]:

limk → ∞PE ∗ = PX,Y (Pθ (I(h(X, θk ) = Y )) − max(J ) � = YPθ (I(h(X, θk ) = J ))) (4)

As the number of decision trees grows, the generalization error will tend toward
a maximum, as predicted by the preceding calculation, and the model will not over-
fit. The classification power of each tree and the correlation between trees is used
to estimate the maximum allowed generalization error. The DTRF model aims to
produce a DTRF with a small correlation coefficient and strong classification power.
Classification intensity ( S ) is the sample-space-wide mathematical expectation of
the variable mr(X, Y ).

S = EX,Y ∗ mr(X, Y ) (5)

Here, θ and θ′ are independent and identically distributed vectors of estimated


data EX,Y , correlation coefficients of mr(θ, X, Y ) and mr(θ , ′X, Y ):

covX, Y (mr(θ, X, Y ), mr(θ′, X, Y ))


ρ= (6)
sd(θ)sd(θ′)

Among them, sd(θ) can be expressed as follows:

2
1 N 1 N
sd(θ) = mr(xi , θ) − mr(xi , θ ) (7)
N i=1 N i=1

Equation (7) is a metric that is used to quantify the degree to which the trees
h(X, θ) and h(X, θ′) on the dataset consisting of X, Y are correlated with one another.
The correlation coefficient increases in magnitude in direct proportion ρ to the size
of the chi-square. The upper limit of generalization error is obtained using the fol-
lowing formula, which is based on the Chebyshev inequality:

ρ(1 − S 2 )
PX,Y (mr(X, Y ) < 0) ≤ (8)
S2

The generalization error limit of a DTRF is inversely proportional to the strength


of the correlation P between individual decision trees and positively correlated with
the classification intensity S of a single tree. That is to say, the stricter the category
S , the lower the degree of linkage P . If the DTRF is to improve its classification accu-
racy, the threshold for generalization error must be lowered.
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 10 of 18

SGB loss optimization


The SGB optimization approach has recently received increased use in various deep-
learning applications. These applications call for a higher degree of expertise in
learning than what can be provided by more conventional means. During the whole
training process, the learning rate that SGB uses does not, at any time, experience
any fluctuations. The SGB uses one learning rate, which is alpha. The SGB algorithm
maintains a per-parameter learning rate to increase performance in scenarios with
sparse gradients (for example, computer vision challenges). It maintains per-param-
eter learning rates that are updated based on the average of recent magnitudes of
the gradients for the weight, and it does so based on averaging recent gradient mag-
nitudes (for example, how rapidly it is changing). In addition, it does this based on
averaging recent gradient magnitudes for the weight. It illustrates that the strategy is
effective for online and non-stationary applications (for example, noisy). The chain
rule applied calculus to compute the partial derivatives. To calculate the loss gradi-
ent about the weights and biases, it will allow us to determine how the loss varies as
a function of the weights and biases. Let us assume that we have a training dataset
with N samples, denoted as { xi , yi } for i = 1, 2, …, N, where xi is the input, and yi is
the true label or target value. It uses a decision tree with parameters θ to predict the
output  yifor input xi . The output can be any function of the parameters and the input,
represented as  yi = f (xi , θ). The goal is to minimize the difference between the pre-
dicted output  yi and the true label yi . It is typically done by defining a loss function
L(yi , yi ) that quantifies the difference between the predicted and true values. The total
loss over the entire dataset is then defined as the sum of the individual losses over all
samples:
 
Ltotal = �i L f (xi , θ ), yi (9)

The optimization algorithm focused on estimating the values of the parameters θ


that minimize this total loss. It is typically done using gradient descent, which updates
the parameters θ in the opposite direction of the gradient of the total loss concerning
the parameters:

θnew = θold − α∇θ Ltotal (10)

Here,α is the learning rate, which controls the size of the parameter update, and
∇θ Ltotal is the gradient of the total loss concerning the parameters θ. The SGB can
sometimes oscillate and take a long time to converge due to the noisy gradients.
Momentum is a technique that helps SGB converge faster by adding a fraction of the
previous update to the current update:

vt = βvt − 1 + (1 − β)∇θ Lminibatch (11)

θt = θt − 1 − αvt (12)

Here, vt is the momentum term at iteration t, β is the momentum coefficient, typi-


cally set to 0.9 or 0.99, and the other terms are as previously defined.
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 11 of 18

Results and discussion


This section gives a detailed performance analysis of the proposed HDP-DTRF. The
performance of the proposed method is measured using multiple performance met-
rics. All these metrics are measured for proposed methods as well as existing meth-
ods. Then, all the methods use the same publicly available real-world dataset for
performance estimations.

Dataset
The Cleveland Heart Disease dataset contains data on 303 patients who were evalu-
ated for heart disease. The dataset is downloaded from open-access websites like
the UCI-ML repository. Each patient is represented by 14 attributes, which include
demographic and clinical information such as age, sex, chest pain type, resting blood
pressure, serum cholesterol level, and exercise test results. The dataset has 303
records, each corresponding to a unique patient. The data in each record includes
values for all 14 attributes, and the diagnosis of heart disease (present or absent) is
also included in the dataset. Table 2 provides a detailed description of the dataset.
Researchers and data scientists can use this dataset to develop predictive models for
heart disease diagnosis or explore relationships between the different variables in the
dataset. With 303 records, this dataset is relatively small compared to other medi-
cal datasets. However, it is still widely used in heart disease research due to its rich
attributes and long history of use in research studies.

Table 2 Description of dataset


Column Description Min value Max value

Age Age of the patient 29 77


Sex 1 = male, 0 = female) 0 1
Chest pain type 1 = typical angina, 2 = atypical angina, 1 4
4 = asymptomatic, 3 = non-anginal pain
Resting blood pressure (mm Hg) Resting blood pressure 94 200
Serum cholesterol (mg/dl) Serum cholesterol 126 564
Fasting blood sugar Fasting blood sugar (> 120 mg/dl or not) of the 0 1
patient (1 = true, 0 = false)
Resting electrocardiographic results Results of resting electrocardiogram (0 = normal, 0 2
1 = ST-T wave abnormality, 2 = left ventricular
hypertrophy)
Maximum heart rate achieved Maximum heart rate achieved (in beats per 71 202
minute) during exercise
Exercise-induced angina Whether exercise-induced angina or not (1 = yes, 0 1
0 = no)
Oldpeak ST depression induced by exercise relative to rest 0 6.2
Slope The slope of the peak exercise ST segment 1 3
(1 = upsloping, 2 = flat, 3 = downsloping)
Number of major vessels Number of major vessels (0–3) colored by 0 3
fluoroscopy
Thal Thallium stress test result (3 = normal, 6 = fixed 3 7
defect, 7 = reversible defect)
Target Whether the patient has heart disease or not 0 1
(0 = no, 1 = yes)
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 12 of 18

EDA
EDA is essential in understanding and analyzing any dataset, including the Cleveland
Heart Disease dataset. EDA involves examining the dataset’s basic properties, identifying
missing values, checking data distributions, and exploring relationships between vari-
ables. Figure 3 shows the EDA of the dataset. Figure 3 (a) shows the count for each target
class. Here, the no heart disease class contains 138 records, and the heart disease pre-
sented class contains 165 records. Figure 3 (b) shows the male and female-based record
percentages in the dataset. Here, the dataset contains 68.32% male and 31.68% female
records. Figure 3 (c) shows the percentage of records for chest pain experienced by the
patient in the dataset. Here, the dataset contains 47.19% of records in typical angina,
16.50% in atypical angina, 28.71% in non-anginal pain, and 7.59% in the asymptomatic
class. Figure 3 (d) shows the percentage of records for fasting blood sugar in the data-
set. Here, the dataset contains 85.15% of records in the fasting blood sugar (> 120 mg/
dl) class and 14.85% of records in the fasting blood sugar (< 120 mg/dl) class. Figure 4
shows the heart disease frequency by age for both no disease and disease classes. The
output contains histogram levels that show the frequency of heart disease by age. Here,
the counts of patients with and without heart disease are shown in red and green colors.
The overlap between the bars shows how the frequency of heart disease varies with age,
with a peak in the frequency of heart disease occurring around the age of 29–77 years.
Figure 5 shows the frequencies for different columns of the dataset, which contains
the frequencies of chest pain type, fasting blood sugar, rest ECG, exercise-induced
angina, st_slope, and number of major vessel columns. Exploring the frequencies

Fig. 3 EDA of the dataset. a Count for each target class. b Male–female distribution. c Chest pain
experienced by patient distribution. d Fasting blood sugar distribution
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 13 of 18

Fig. 4 Heart disease frequency by age

of different variables in a dataset is crucial in understanding the data and gaining


insights about the underlying phenomena. By analyzing the frequency of values in
each variable, we can better understand the data distribution and identify poten-
tial patterns, relationships, and outliers that are important for further analysis. For
example, understanding the frequency of different chest pain types in a heart disease
dataset reveals whether certain types of chest pain are more strongly associated with
the disease than others. Similarly, analyzing the frequency of different fasting blood
sugar levels helps to identify potential risk factors for heart disease. Overall, exploring
the frequencies of variables is an important step in the EDA process, as it provides a
starting point for identifying potential relationships and patterns in the data.

Performance evaluation
Table 3 shows the class-specific performance evaluation of HDP-DTRF. Here, the per-
formance was measured for class-0 (no heart disease) and class-1 (heart disease pre-
sented) classes. Further, macro average and weighted average performances were also
measured. Macro average treats all classes equally, regardless of their size. It calculates
the average performance metrics across all classes, giving each class an equal weight. It
means that the performance of smaller classes will have the same impact on the metric
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 14 of 18

Fig. 5 Frequencies for different columns of the dataset

as larger classes.Then, the weighted average considers the size of each class. It calcu-
lates the average performance metric across all classes but gives each class a weight pro-
portional to its size. It means that the performance of larger classes will have a greater
impact on the metric than smaller classes.
Table 4 shows the class-0 performance comparison of various methods. Here, the
proposed HDP-DTRF improved precision by 5.75%, recall by 1.37%, F1-score by
6%, and accuracy by 2.45% compared to KNN [15]. Then, the proposed HDP-DTRF
improved precision by 3.45%, recall by 0.63%, F1-score by 3.61%, and accuracy by
1.45% compared to ILM [16]. Then, the proposed HDP-DTRF improved precision
by 2.30%, recall by 1.27%, F1-score by 3.61%, and accuracy by 1.03% compared to
LRC [20]. Table 5 shows the class-1 performance comparison of various methods.
Here, KNN [15] shows a 2.35% lower precision, a 4.40% lower recall, a 3.53% lower
F1-score, and a 1.03% lower accuracy than the proposed HDP-DTRF method. Then,
ILM shows a 2.35% lower precision, a 5.49% lower recall, a 1.14% lower F1-score, and
a 1.03% lower accuracy than the proposed HDP-DTRF method. Then, LRC [20] shows
a 4.71% lower precision, an 11.11% lower recall, a 2.27% lower F1-score, and a 1.03%
lower accuracy than the proposed HDP-DTRF method.
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 15 of 18

Table 3 Class-specific performance evaluation of proposed HDP-DTRF


Method Precision Recall F1-score Accuracy

Class-0 0.87 0.79 0.83 0.98


Class-1 0.85 0.91 0.88 0.97
Macro average 0.86 0.85 0.85 0.95
Weighted average 0.86 0.86 0.85 0.96

Table 4 Class-0 performance comparison of various methods


Method Precision Recall F1-score Accuracy

KNN [15] 0.75 0.80 0.77 0.92


ILM [16] 0.80 0.70 0.75 0.91
LRC [20] 0.85 0.75 0.80 0.95
Proposed HDP-DTRF 0.87 0.79 0.83 0.98

Table 5 Class-1 performance comparison of various methods


Method Precision Recall F1-score Accuracy

KNN [15] 0.83 0.87 0.85 0.96


ILM [16] 0.81 0.85 0.87 0.96
LRC [20] 0.79 0.81 0.85 0.96
Proposed HDP-DTRF 0.85 0.91 0.88 0.97

Table 6 shows the macro average performance comparison of various methods. For
KNN [15], the percentage improvements are 7.5% for precision, 13.3% for recall, 10.4%
for F1-score, and 6.7% for accuracy. For ILM [16], the percentage improvements are
achieved as 2.4% for precision, 6.1% for recall, 6.0% for F1-score, and 3.2% for accuracy.
For LRC [20], the percentage improvements are achieved as 3.4% for precision, 10.0%
for recall, 6.0% for F1-score, and 4.3% for accuracy archived by the proposed method.
Table 7 shows the weighted average performance comparison of various methods. For

Table 6 Macro average performance comparison of various methods


Method Precision Recall F1-score Accuracy

KNN [15] 0.80 0.75 0.77 0.90


ILM [16] 0.85 0.82 0.83 0.93
LRC [20] 0.81 0.80 0.83 0.92
Proposed HDP-DTRF 0.86 0.85 0.85 0.95

Table 7 Weighted average performance comparison of various methods


Method Precision Recall F1-score Accuracy

KNN [15] 0.81 0.79 0.79 0.90


ILM [16] 0.83 0.82 0.81 0.93
LRC [20] 0.85 0.81 0.83 0.93
Proposed HDP-DTRF 0.86 0.86 0.85 0.96
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 16 of 18

Fig. 6 ROC curve of proposed HDP-DTRF

KNN [15], the percentage improvements are 6.5% for precision, 3.3% for recall, 1.4%
for F1-score, and 6.7% for accuracy. For ILM [16], the percentage improvements are
achieved as 2.4% for precision, 5.1% for recall, 6.0% for F1-score, and 3.2% for accuracy.
For LRC [20], the percentage improvements are achieved as 1.4% for precision, 1.0% for
recall, 6.0% for F1-score, and 4.3% for accuracy archived by the proposed method.
The ROC curve of the proposed HDP-DTRF is seen in Fig. 6. The true positive
rate (TPR) is shown against the false-positive rate (FPR) on the ROC curve, which
considers various threshold values. In the context of the HDP-DTRF technique, the
ROC curve illustrates the degree to which the model can differentiate between posi-
tive and negative heart disease instances. The model’s performance is greater when it
has a higher TPR and a lower FPR. The ROC curve that represents the HDP-DTRF
approach that has been suggested is used to find the best classification threshold,
which strikes a balance between sensitivity and specificity in the diagnostic process. If
there is a point on the ROC curve that is closer to the top left corner, this implies that
the model is doing better.

Conclusions
This article proposes a machine-learning approach for heart disease prediction. The
approach uses a DTRF classifier with loss optimization and involves preprocessing a
dataset of patient records to determine the presence or absence of heart disease. The
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 17 of 18

DTRF classifier is then trained on the SGB loss optimization dataset and evaluated using
a separate test dataset. The proposed HDP-DTRF improved class-specific performances
and a macro with weighted average performance measures. Overall, the proposed HDP-
DTRF improved precision by 2.30%, recall by 1.27%, F1-score by 3.61%, and accuracy by
1.03% compared to traditional methodologies. Further, this work can be extended with
deep learning-based classification with machine learning feature analysis .

Abbreviations
HDP Heart disease prediction
DTRF Decision tree-based random forest
SGB Stochastic gradient boosting
FP False positive
FN False negative
TN True negative
TP True positive

Acknowledgements
Not applicable.

Authors’ contributions
A.P.J, P.S., and N.M. contributed to the technical content of the paper, and P.S. and S.A. contributed to the conceptual con-
tent and architectural design. P.K., D.B., and P.A. contributed to the guidance and counseling on the writing of the paper.

Funding
No funding was received by any government or private concern.

Availability of data and materials


Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Declarations
Competing interests
The authors declare that they have no competing interests.

Received: 31 May 2023 Accepted: 5 September 2023

References
1. Bhatt CM et al (2023) Effective heart disease prediction using machine learning techniques. Algorithms 16(2):88
2. Dileep P et al (2023) An automatic heart disease prediction using cluster-based bi-directional LSTM (C-BiLSTM)
algorithm. Neural Comput Appl 35(10):7253–7266
3. Jain A et al (2023) Optimized levy flight model for heart disease prediction using CNN framework in big data appli-
cation. Exp Syst Appl 223:119859
4. Nandy S et al (2023) An intelligent heart disease prediction system based on swarm-artificial neural network. Neural
Comput Appl 35(20):14723–14737
5. Hassan D et al (2023) Heart disease prediction based on pre-trained deep neural networks combined with principal
component analysis. Biomed Signal Proc Contr 79:104019
6. Ozcan M et al (2023) A classification and regression tree algorithm for heart disease modeling and prediction.
Healthc Anal 3:100130
7. Saranya G et al (2023) A novel feature selection approach with integrated feature sensitivity and feature correlation
for improved heart disease prediction. J Ambient Intell Humaniz Comput 14(9):12005–12019
8. Sudha VK et al (2023) Hybrid CNN and LSTM network for heart disease prediction. SN Comp Sc 4(2):172
9. Chaurasia V, et al (2023) Novel method of characterization of heart disease prediction using sequential feature
selection-based ensemble technique. Biomed Mat Dev 2023;1–10. https://​doi.​org/​10.​1007/​s44174-​022-​00060-x
10. Ogundepo EA et al (2023) Performance analysis of supervised classification models on heart disease prediction.
Innov Syst Software Eng 19(1):129–144
11. de Vries S et al (2023) Development and validation of risk prediction models for coronary heart disease and heart
failure after treatment for Hodgkin lymphoma. J Clin Oncol 41(1):86–95
12. Vijaya Kishore V, Kalpana V (2020) Effect of Noise on Segmentation Evaluation Parameters. In: Pant, M., Kumar
Sharma, T., Arya, R., Sahana, B., Zolfagharinia, H. (eds) Soft Computing: Theories and Applications. Advances in Intel-
ligent Systems and Computing, vol 1154. Springer, Singapore. https://​doi.​org/​10.​1007/​978-​981-​15-​4032-5_​41.
13. Kalpana V, Vijaya Kishore V, Praveena K (2020) A Common Framework for the Extraction of ILD Patterns from
CT Image. In: Hitendra Sarma, T., Sankar, V., Shaik, R. (eds) Emerging Trends in Electrical, Communications, and
Jawalkar et al. Journal of Engineering and Applied Science (2023) 70:122 Page 18 of 18

Information Technologies. Lecture Notes in Electrical Engineering, vol 569. Springer, Singapore. https://​doi.​org/​10.​
1007/​978-​981-​13-​8942-9_​42
14. Annamalai M, Muthiah P (2022) An Early Prediction of Tumor in Heart by Cardiac Masses Classification in Echocardi-
ogram Images Using Robust Back Propagation Neural Network Classifier. Brazilian Archives of Biology and Technol-
ogy. 65. https://​doi.​org/​10.​1590/​1678-​4324-​20222​10316
15. Shah D et al (2020) Heart disease prediction using machine learning techniques. SN Comput Sci 1:345
16. Guo C et al (2020) Recursion enhanced random forest with an improved linear model (RERF-ILM) for heart disease
detection on the internet of medical things platform. IEEE Access 8:59247–59256
17. Ahmed H et al (2020) Heart disease identification from patients’ social posts, machine learning solution on Spark.
Future Gen Comp Syst 111:714–722
18. Katarya R et al (2021) Machine learning techniques for heart disease prediction: a comparative study and analysis.
Health Technol 11:87–97
19. Kannan R et al (2019) Machine learning algorithms with ROC curve for predicting and diagnosing the heart disease.
Springer, Soft Computing and Medical Bioinformatics
20. Ali MM et al (2021) Heart disease prediction using supervised machine learning algorithms: Performance analysis
and comparison. Comput Biol Med 136:104672
21. Mienye ID et al (2020) An improved ensemble learning approach for the prediction of heart disease risk. Inform Med
Unlocked 20:100402
22. Dutta A et al (2020) An efficient convolutional neural network for coronary heart disease prediction. Expert Syst
Appl 159:113408
23. Latha CBC et al (2019) Improving the accuracy of heart disease risk prediction based on ensemble classification
techniques. Inform Med Unlocked 16:100203
24. Ishaq A et al (2021) Improving the prediction of heart failure patients’ survival using SMOTE and effective data min-
ing techniques. IEEE Access 9:39707–39716
25. Asadi S et al (2021) Random forest swarm optimization-based for heart diseases diagnosis. J Biomed Inform
115:103690
26. Asif D et al (2023) Enhancing heart disease prediction through ensemble learning techniques with hyperparameter
optimization. Algorithms 16(6):308
27. David VAR S, Govinda E, Ganapriya K, Dhanapal R, Manikandan A (2023) "An Automatic Brain Tumors Detection
and Classification Using Deep Convolutional Neural Network with VGG-19," 2023 2nd International Conference on
Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India,
2023, pp. 1-5. https://​doi.​org/​10.​1109/​ICAEC​A56562.​2023.​10200​949
28. Radwan M et al (2023) MLHeartDisPrediction: heart disease prediction using machine learning. J Comp Commun
2(1):50-65

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

You might also like