You are on page 1of 12

Heart Disease Detection with machine learning

Devshree Jadeja

20BCP112

Abstract diabetes, extreme blood lipid levels, being


overweight, and obesity are indicative of
Heart disease, commonly referred to as
cardiovascular disease (CVD) or coronary heart
disease (CHD), is a general term for several individuals who are susceptible to cardiovascular
conditions affecting the heart and blood vessels that disease. Identifying those individuals who are at risk of
carry a high risk of death through heart attacks, developing such conditions is crucial. In this work, we
strokes, and other serious illnesses. Referred to as suggested a machine learning-based approach for
CVD or CHD, heart disease is a group of conditions diagnosing cardiac disease that is both effective and
that pose a high risk of death via heart attacks, accurate. The system was developed using standard
strokes, and other severe diseases, affecting the feature selection algorithms like Relief, Minimal
heart and blood vessels. The World Health redundancy maximal relevance, least absolute shrinkage
Organization (WHO) revealed that heart disease, selection operator, and Local learning for removing
which accounts for 31% of fatalities and 17.9 million unnecessary and redundant features. The classification
annual fatalities, is still the leading cause of death algorithms used in the system include Support vector
worldwide. Diagnosis methods traditionally include machine, Logistic regression, Artificial neural network,
intrusive medical testing that is time-consuming and K-nearest neighbor, Nave bays, and Decision tree. To
reliant on clinical judgement. Tools that are precise address the issue of feature selection, we also put out a
and effective have been provided by machine brand-new, quick conditional mutual information
learning, which has drastically changed medical feature selection approach. In order to improve
diagnosis. The aim of analyzing Logistic Regression, classification accuracy and shorten the time it takes to
SVM, KNN, Naive Bayes, and Random Forest is to process classifications; features selection techniques are
enhance patient care, and to enable early detection utilized. Additionally, cross-validation with one subject
of heart disease along with risk assessment through left out has been used.
machine learning.

1.1. Dataset
1. Introduction to the problem The UCI Heart Disease dataset is a crucial resource
Cardiovascular disease (CVD) is a major cause of with useful information for forecasting the potentially
death globally. Known CVDs include coronary heart fatal medical condition of heart disease in the field of
disease, peripheral arterial disease, cerebrovascular healthcare and machine learning. This dataset consists
ailment, rheumatic heart disease, and congenital heart of 1026 patient records and 14 attributes, with a
disease. Records from the World Health Organization combination of numerical and categorical parameters
(WHO) indicate that over 17.9 million people pass obtained from real clinical data. Notably, the dataset
away annually due to heart disease and other heart- captures important information like age, gender, the
related issues. Heart attacks and strokes are responsible type of chest pain, resting blood pressure, serum
for over 80% of deaths relating to CVD. Factors that cholesterol, fasting blood sugar status, resting
increase the probability of heart-related problems are electrocardiographic findings, the maximum heart rate
inadequate physical activity, unhealthy diet, excessive during exercise, exercise-induced angina, depression
alcohol consumption, and smoking. The presence of relative to rest, the slope of the peak exercise ST
intermediate-risk factors like high blood pressure, segment, the color of the major vessels on fluoroscopy,
and the type of thalassemia.
The goal variable of the dataset represents the presence improve the diagnostic procedure. Notably, the
or absence of cardiac disease and is the basis for binary machine learning model presented in the paper attained
classification tasks. The development and evaluation of a remarkable accuracy rate, highlighting its
predictive models intended to ascertain a patient's risk effectiveness in the identification of cardiac disease.
of heart disease based on their qualities is the main
goal of this dataset. The information has many uses in
3. Application of Machine Learning for the
the healthcare industry, including risk assessment,
Detection of Heart Disease
early detection, and tailored treatment planning. This
dataset can help physicians and researchers make
better judgments about the diagnosis and treatment of
cardiac disease, which could improve patient outcomes The diagnosis of valvular heart disorders is a crucial
and the standard of healthcare. area of medical research, and in this study, the
researchers investigate the possibility of ensemble
learning approaches, including as bagging, boosting
1.2. Related Work (AdaBoost), and random subspace, to improve the
performance of Support Vector Machines (SVMs).
They emphasize the significance of early identification
1. Heart Disease Prediction System using Naive and detection of cardiac valve diseases because they
Baye might have a serious negative influence on patient
health. The study uses a dataset of 215 samples,
The Naive Bayes algorithm is the method employed in
including both normal and defective cardiac valves,
research to identify and forecast heart disease. The
and several diagnostic procedures and imaging
Cleveland dataset was used, and the outcomes are
methods, including echocardiography and Doppler
displayed as pie charts and bar charts, illustrating the
ultrasonography. In order to extract features from
system's accuracy based on various instances in the
Doppler heart sound signals, wavelet packet
training dataset. The research highlights the
decomposition and entropy-based standards are used.
importance of utilizing data mining tools to extract
The basis classifiers are SVMs, more specifically
valuable insights from healthcare data, particularly for
Linear Support Vector Machines (LSSVM).
early detection of heart disease. The authors aim to
categorize medical data into five classes and utilize the
The performance of the ensemble methods is
Naive Bayes classifier to predict the class labels of
comprehensively evaluated using metrics such as
unknown samples. The effectiveness of their system
classification accuracy, sensitivity, specificity, and a
relies on their algorithm selection and the quality of the
confusion matrix, crucial for accurate medical
employed database.
diagnoses. The experimental findings reveal that both
bagging and boosting improve the classification
performance of SVMs, with boosting, in particular,
2. Application of Machine Learning for the demonstrating promising results, achieving high
Detection of Heart Disease sensitivity (97.3%) and specificity (96%). Notably, the
random subspace ensemble does not exhibit significant
In-depth discussion of the use of machine learning performance improvements for this specific medical
algorithms for the diagnosis of heart disease is diagnosis task. Furthermore, a comparison with
provided in the study, with an emphasis on the previous methods used in valvular heart disease
importance of correct datasets for successful diagnosis. detection shows that ensemble methods, especially
The research focuses on categorizing cases as positive SVM-AdaBoost, outperform certain prior approaches
(indicating heart illness) or negative (no heart disease), in terms of sensitivity and specificity, suggesting the
using the "Cleveland heart disease data set 2016" from potential for machine learning techniques to enhance
the University of California, Irvine. early detection and diagnosis in cardiology, ultimately
contributing to improved patient care and outcomes.
The Python platform is used to develop a number of
machine learning methods used in the study, including
neural networks, logistic regression, K-nearest
neighbor, and naive bayes. The research emphasizes
how Python and its libraries, including sci-kit learn,
NumPy, pandas, and matplotlib, have the potential to
4. Heart Diseases Prediction using Hybrid
Machine learning model
6. Heart Disease Detection Using Machine
The paper addresses the important issue of prognosis Learning Majority Voting Ensemble Method
of cardiovascular disease, a major global health
problem with high mortality. Early detection of cardiac This study proposes an alternative method for
conditions such as heart failure and heart failure is predicting individual cardiac events using a rating
essential to saving lives. Machine learning (ML) is group method, based on simple and inexpensive
emerging as a promising solution for accurate clinical tests performed at nearby hospitals a real
prediction and rational decision making in the patient data is used to perform model training that
healthcare industry. The study uses the Cleveland includes healthy and sick people through many of
Heart Disease database and employs data analysis them Combining predictions from machine-
techniques, particularly regression and classification. learning models provides more accurate and
Three machine learning algorithms are used: 1. reliable answers a will provide detection of disease
Random Forest, 2. Decision Tree, and 3. Hybrid
Surprisingly, survey data show an impressive 90%
Model Combining Random Forest and Decision Tree
Techniques Experimental results show a reasonable accuracy rate when a rigorous voting system is
accuracy of 88.7% for cardiovascular disease used, it shows a significant improvement in the
prediction using the Hybrid model. The paper also prediction of cardiovascular disease and inspires
highlights the development of a user-friendly interface hope encourage for to improve health decision and
that can input relevant parameters to predict patient outcomes.
cardiovascular disease using the Hybrid model. Key
steps include the Cleveland Cardiovascular Database, Furthermore, the ensemble method holds great
decision trees, random forests, hybrid algorithms, and promise for expanding access to accurate heart
machine learning. disease risk assessments because it can provide
precise predictions based on inexpensive and easily
5. Web-based heart disease decision support accessible medical testing performed at
system using data mining classification neighborhood clinics. The model illustrates its
modeling techniques flexibility and applicability to various patient
groups by utilizing real-world patient data,
Dr. Robert Detrano made pioneering contributions including both healthy and afflicted individuals.
to this area by initially employing logistic
regression as a tool to achieve a commendable
7. Score and Correlation Coefficient-Based
precision rate of 77% [5]. This marked a
Feature Selection for Predicting Heart
significant step forward in utilizing statistical and
Failure Diagnosis by Using Machine
analytical methods to enhance our ability to
Learning Algorithms
predict heart disease accurately. Detrano's work
underscored the potential of data-driven
This research proposes a machine-learning-
approaches in this domain. Subsequently, Newton
based method for precise diagnosis and risk
Cheung leveraged advanced algorithms, including
prediction to address the urgent issue of
the Naive Bayes algorithm, to further improve the
cardiovascular disease (CVD), a major cause of
classification of heart disease data. Through his
mortality worldwide. Coronary artery disease
research, Cheung achieved an impressive
(CAD), one of the several kinds of CVD, is a
accuracy rate of 81.48% [6]. This achievement
major contributor because of the problems
highlighted the power of machine learning and
brought on by atherosclerosis. In order to
probabilistic methods in refining the accuracy and
reduce the risk of CVD, atherosclerosis must be
reliability of heart disease predictions.
identified and treated early. The study uses a
variety of clinical data to create a diagnostic
system, including biomarkers like chest
discomfort, serum cholesterol, and
electrocardiographic results. On two different
datasets, namely Cleveland and a dataset for goal of this dataset. The information has many uses in
predicting heart failure, feature selection and the healthcare industry, including risk assessment, early
engineering techniques are used, followed by detection, and tailored treatment planning. This dataset
the application of optimised hyperparameter can help physicians and researchers make better
classification algorithms like SVM, KNN, judgments about the diagnosis and treatment of cardiac
Decision Tree, Random Forest, and Logistic disease, which could improve patient outcomes and the
standard of healthcare.
Regression.
2.2. Approach
Notably, the Random Forest algorithm
performs exceptionally well, earning F1 scores
The architecture diagram shows how the process
for heart failure prediction of 95%, 97.62%,
moves from beginning to end. It is a visual
95.35%, and 96.47% for accuracy, precision,
depiction of the machine learning pipeline that
recall, and F1 scores. Additionally, it achieves
shows how the data was preprocessed, how the
97.68% accuracy for heart failure prediction
results were obtained, and what the key
and 100% accuracy for the Cleveland dataset.
characteristics of the data were.
This thorough method demonstrates how
machine learning can improve risk assessment
and CVD diagnosis. A literature review (LR)
that supports this study should go over
previous studies, techniques, and their
shortcomings while highlighting the novelty
and importance of the suggested strategy for
reaching high CVD prediction accuracy rates.
It should also discuss the effects of such
accurate prediction on patient outcomes and
early intervention.

2. Methodology

2.1. Data Collection

The UCI Heart Disease dataset is a crucial resource


with useful information for forecasting the potentially
fatal medical condition of heart disease in the field of
healthcare and machine learning. This dataset consists
of 1026 patient records and 14 attributes, with a
combination of numerical and categorical parameters Fig (1) Approach of the work
obtained from real clinical data. Notably, the dataset Link: https://www.canva.com/templates/EAFDvitQBg8-
captures important information like age, gender, the flow-chart-whiteboard-in-red-blue-basic-style/
type of chest pain, resting blood pressure, serum
cholesterol, fasting blood sugar status, resting With our data at hand, the imperative next step is
electrocardiographic findings, the maximum heart rate pre-processing, often likened to setting the stage
during exercise, exercise-induced angina, depression before the grand performance. Here, the dataset is
relative to rest, the slope of the peak exercise ST meticulously scoured for anomalies like missing
segment, the color of the major vessels on fluoroscopy, values and duplicates, ensuring the foundation of
and the type of thalassemia.
our analysis is solid. It's akin to refining raw
The goal variable of the dataset represents the presence material before it's melded into something
or absence of cardiac disease and is the basis for binary meaningful.
classification tasks. The development and evaluation of
predictive models intended to ascertain a patient's risk The architecture diagram shows how the process
of heart disease based on their qualities is the main moves from beginning to end. It shows how your
machine learning pipeline looks visually. This k_id=744661634414
would normally be made with software like
PowerPoint, Draw.io, or Visio, and it should
show: 2.2 Data Visualization

Pre-processing is a crucial next step once we get our


 Data Entry (Source) data, and it's frequently compared to setting the stage
before a big performance. Here, the dataset is
 Preprocessing of Data
painstakingly checked for irregularities like missing
 Feature Choice values and duplicates to ensure the accuracy of our
 Cross-validation is a component of model research. It's comparable to processing raw materials
training. before moulding them into something useful. With a
 Predictions for model evaluation cleaned dataset, we venture into the realm of EDA — a
process where we employ statistics and visual plots to
unearth patterns, relationships, anomalies, or any
valuable insight from the data. This phase is crucial.
Think of it as deciphering the story the data narrates.
Utilizing matplotlib and seaborn, various plots like
histograms, pair plots, and heatmaps are crafted, each
offering a unique lens to view the data.

Data pre-processing involved the following steps:

Verifying Missing Values: There are no missing


values in the dataset.
Duplicate Rows Have Been Checked: Duplicate rows
have been verified.
Normalization: You normalized the data using
StandardScaler to make sure that each feature has the
same weight.

Fig(b) Flow of work

Edit link:
https://miro.com/app/board/uXjVNf1jnqw=/? Fig©. Pi Chart for understanding the balance of classes
utm_source=showme&utm_campaign=cpa&share_lin
Fig(d). Proportion of the classes

The above figure displays the distribution of diagnosis


class i.e target feature, showing the proportionality of
the number of people having heart disease and the one
without heart disease.

The below grid of subplots based on the values of one Fig(f). The histogram visualization
or more categorical variables. It provides a way to
visualize the distribution of a parameter within bins
defined by any other parameter.
The dataset comprises 14 different water quality
parameters. During the process of distribution of
data, individual statistics are presented to provide
contextual information regarding the heart disease
diagnosis. The below histogram distribution analysis
reveals that trestbps, chol and oldpeak follows a left-
skewed distribution, thalach displays right-skewed
distribution, age follows nearly uniform distribution
white sex, cp, fbs, restecg, exang, slope, ca, thal and
target are discrete or categorical features.

Fig(e). Scatter plot grid

We have also visualized the frequency of each of


the feature I the categorical data and the frequency of
patience in it having the heart disease.

Fig(g). Correlation Heatmap


The heart disease project uses visualizations
including histograms, pair plots, and correlation
heatmaps as a key lens. They help to better
understand potential heart disease indicators by
illuminating the distribution, linkages, and
complex interactions of several health measures.
These visual tools help with model development
and improve the interpretability of results by
revealing the structure and patterns within the
data. This ensures that the predictive insights are
in line with medical realities and can be
effectively communicated to both technical and
non-technical stakeholders.

Fig: Distribution of heart disease by resting


electrocardiographic results

The above figure shows the distribution of heart


disease with respect to resting electrocardiographic
results. It may be observed from the graph that
those who display a "normal" result in the restecg
category typically have a lesser risk of heart
Fig: Distribution of heart disease by Gender
disease, while those having “ST-T abnormality”
result have higher risk of heart disease.

The above figure shows the distribution of heart


disease with respect to gender. From the figure it
can be inferred that the majority of the women
have heart disease. Thus, heart disease is a major

The above figure shows the distribution of heart


disease with respect to chest pain type. From the
figure we can say that most of the people with
typical angina (chest pain type 0) do not have
heart disease.
3. Data Preprocessing

Data Preprocessing Data preprocessing is an


essential and fundamental step in machine
learning. It is a process of preparing the raw data
and converting it into a format that is suitable for
machine learning models. It involves a number of
techniques and transformations in order to make
the raw data ready for analysis and modeling. This
procedure is crucial for ensuring that the data is
clean, pertinent and appropriate for machine
learning algorithms. Data preprocessing includes
various operations: Handling missing values
Fig: Distribution of cholesterol wrt age Missing data refers to the null values that are
present in the dataset. They are generally
represented as NaN. A model’s performance is
The scatter plot depicted above illustrates the
significantly impacted by missing values in the
relationship between age and cholesterol levels.
dataset, hence to address this issue imputation
The visualization suggests that the majority of
technique is employed.
individuals exhibit cholesterol levels ranging from
150 mg/dl to 350 mg/dl.

Fig: Distribution of thalach wrt age

Fig: checking for missing values


From the above scatter plot between thalach and
Outlier detection Outliers are the data points that
age, it's evident that there exists a negative
differ noticeably from the rest of the dataset and
correlation between age and thalach, as the age
badly affect the performance of the model. They
increases, the maximum heart rate tends to
are frequently anomalous data that disturb the data
decrease. Additionally, it's noteworthy that the
distribution and generally arise from inconsistent
majority of individuals display maximum heart
data input. Thus detecting outliers and handling
rates falling within the range of 100 to 180 beats
them is very important. The methods used to
per minute.
handle outliers standard deviation or z-score.
generalizability, and other important parameters,
performs the best on the provided dataset.

The process of model selection plays a crucial role in


the analysis. Selecting the best machine learning
algorithm for the particular dataset and problem is
a part of this process. In this project, the following
machine learning algorithms are used.

Logistic Regression

Logistic regression stands as an extensively


employed supervised learning classification
technique primarily focused on estimating the
likelihood of a target variable based on input
characteristics. It is a specialized instance of linear
regression which is used when the target feature is
categorical or discrete. Although it can be
expanded to address multiclass classification issues
as well, it is frequently used for binary class
classification tasks. It is a simple yet effective way
to understand the relation between independent and
dependent variables, particularly when there is
binary data.

Fig: box plot of numeric features K-Nearest Neighbors

The 'age' column from the data dataset is box K-Nearest Neighbors (KNN) is a widely adopted
plotted using the sns.boxplot() function. The box and simple machine learning algorithm employed
plot shows the median, quartiles, and any in both classification and regression tasks. K-NN is
probable outliers, giving information about the a non-parametric technique, which means it makes
distribution of the 'age' data. The plot's title and y- no assumptions about the underlying data. Because
axis label are added using the plt.title() and it does not immediately learn from the training set
plt.ylabel() functions, respectively. but instead stores the information and uses it
during the testing phase, this algorithm is also
4. Model Selection known as a "lazy learner". The fundamental
principle at the core of KNN involves predicting
the classification of a new data point by identifying
We embrace the feature selection process before the K nearest labeled data points in the training
getting into modelling. The key here is to choose dataset and using their class labels to predict the
include the most useful attributes, allowing our class of the new instance.
model to operate effectively without being
overburdened by unnecessary data. The attributes Support Vector Machine
with the greatest statistical significance are chosen
by utilizing methods like SelectKBest. Support Vector Machine (SVM) is a renowned
supervised machine learning method frequently
A crucial stage in the machine learning pipeline is utilized in tasks related to classification and
model selection. The objective is to determine regression analysis. SVM is found to be a useful
which algorithm, in terms of accuracy, algorithm in processing complex and nonlinear
data The method separates the dataset into several
classes using a hyperplane, with the goal of
maximizing the distance between the hyperplane
and the points closest to each class.

Naive Bayes After all trees have been built, predictions are
generated by having each tree cast a "vote" for the
The Naive Bayes is a straightforward yet effective class. This is known as aggregate decision-making.
algorithm that is specifically used for The outcome of a binary classification, such as the
classification tasks. It is a probabilistic classifier prognosis of heart disease, is determined by the
based on the Bayes theorem. majority vote. For instance, if there are 100 trees
and 70 of them predict "Disease" but the remaining
30 predict "No Disease," "Disease" will be the
Random Forest final forecast.
manage Missing Values: Random Forest can
Random Forest is a prominent machine learning manage missing values by either considering the
algorithm that can successfully handle mode for categorical features or utilising median
classification as well as regression issues. It is a imputation for numerical features.
type of ensemble learning technique that increases Important Feature:
prediction accuracy by making use of the The capacity of Random Forest to shed light on the
predictive strength of numerous decision trees to significance of each feature in prediction is yet
increase the accuracy of predictions. Thus another benefit. In the forest, characteristics that
Random Forest outperforms other algorithms and frequently manifest themselves close to tree roots
helps to obtain better performance. It can process include
high-dimensional data efficiently without the need
for feature selection or dimensionality reduction. Extreme references:
It is also resilient to outliers and missing data.

Heart disease prediction using random


forest architecture: Random Forest has many hyperparameters that can
be tuned for optimal performance. These include
the number of trees in a forest (n_estimators), the
Decision tree ensemble: At its core, Random
maximum tree depth (max_depth), the minimum
Forest is a decision tree ensemble. This indicates
number of samples needed to split the nodes
that the Random Forest model constructs
(min_samples_split), and many more, Appropriate
numerous trees during the training phase as
settings can be provided best results, especially for
opposed to depending on a single decision tree.
specific problems such as cardiovascular disease
Bootstrap Aggregating (Bagging): Random Forest
prognosis.
bootstraps the dataset by replacing each sample in
each decision tree it builds. This guarantees that
The Random Forest model's architecture is
each tree receives a slightly different subset of the
important in the endeavour to forecast heart
data, allowing for variety in how each tree makes
disease. In comparison to individual trees, its
decisions.
ensemble nature, which uses several decision trees,
Random Feature Selection: Random Forest does
delivers a reliable and less biased prediction. The
not take into account all features when splitting
model is a powerful tool for forecasting the
each tree. Instead, it chooses a portion of them at
intricate nature of cardiac disorders based on a
random. The diversity of individual trees is further
variety of patient features due to the inherent
ensured by the unpredictability of feature
randomness in data sampling and feature selection,
selection, minimizing overfitting.
which assures that the model is well-generalized to
new, unknown data.
then skilfully used to expand the dataset for in-
depth analysis, adding dimension without creating
bias. The dataset's thoughtful division into separate
Result
training and testing sets made it possible to
conduct an in-depth evaluation of model
Table (1). Accuracy of all model performance. Additionally, it was crucial to use
conventional scaling after carefully isolating
Model Accuracy % numerical properties. The model's training and
evaluation phases were protected from
Logistic Regression 87.01 unwarranted impact thanks to this normalisation,
which made sure that each feature contributed
KNN 74.35 fairly. Overall, this thorough pre-processing
method created a solid foundation for trustworthy
SVM 85.39 machine learning models, improving their capacity
to recognise complex data patterns,
Naive Bayes 84.42

Random Forest 99.02

Final Thoughts on Feature Selection: The


experiment

demonstrated an astute method of feature selection for


the prognosis of heart disease. Without careful feature
selection, the initial findings made clear how important
it is to clean up the data. We identified the most
important attributes required for predictive modelling
using the SelectKBest approach. By purposefully Fig(j). All model’s ROC
reducing the number of dimensions, the dataset's noise
was eliminated, leaving just the most important aspects
to be highlighted. The models are more likely to
produce accurate results by focusing on these crucial
qualities. The dataset's quality was improved by the
meticulous feature selection, which also prepared it for
the future modelling phase to produce the most
accurate heart disease forecasts.

Final Thoughts on Pre-processing

We adopted a methodical pre-processing


procedure in our investigation, which was
concerned with the prediction of heart disease.
The meticulous categorization of the data made
sure that the complex traits of particular variables
were correctly captured. One-hot encoding was
https://link.springer.com/article/10.1007/
s10916-011-9740-z

Model Precision Recall F1-score Roc Area

Logistic Regression 0.85 0.92 0.88 0.87

KNN 0.79 0.71 0.74 0.82

SVM 0.82 0.93 0.87 0.99

Naive Bayes 0.84 0.87 0.86 0.89

Random Forest 1.00 0.98 0.99 0.99

Table(2). Comparing all the approaches( Models)

5. Literature Review Comparative Table

[4].
https://ieeexplore.ieee.org/abstract/document
Muhammad /9074954
Hybrid Learning
Fathurachman/2014 84.00%
[14]

Divyansh Khanna, SVM 97%


[5].
https://ieeexplore.ieee.org/abstract/document
Sahu, Veeky Baths,
/4493524
Bharat
Deshpande/2015 [15]
Kemal Polat, Seral KNN 89%

Sahan, Salih Ensembling 97.00%


82.19%
TABLE III
ACCURACY COMPARISON WITH LATEST RESEARCH PAPER

[1].
https://ieeexplore.ieee.org/abstract/documen
t/9358597

[2]

You might also like