Heart Failure CETM24

Student Name:
Student Number:
Topic: Application of Machine learning in the prediction of
Heart failure Anomaly.
Word Count:
1
1.1 Introduction
According to the World Health Organization (WHO), the leading cause of death around the
world is cardiovascular disease (CVD). CVD is expected to kill 17,9 million people every year,
which is 31% of all deaths around the world. Four out of every five deaths from CVD are caused
by heart attacks and strokes, and one-third of these deaths happen suddenly in people under the
age of 70. (WHO, 2020).
Cardiovascular diseases include a wide range of problems with the heart and how well it works.
Heart disease should be found in a precise and correct way. Most of the time, a clinical expert
looks at it. Cardiovascular diseases can be predicted and preventive steps can be taken to get rid
of or suffocate these dangerous diseases if they are caught early and the patient's experiences and
way of life are taken into account in full (Shaji, 2019; Aldallal and Al-Moosa 2018).
Information mining has recently become popular all over the world in a wide range of fields, and
medicine is no exception. Information mining techniques are good choices for building
coordinated data frameworks to study cardiovascular diseases because they can help find hidden
information in raw data. Because information mining is such a big part of improving medical
services, different ways to look at and predict cardiovascular disease have been suggested (Bou
Rjeily et al., 2019; Singh et al., 2018).
Erratic heartbeat position is based on how the heartbeats are happening and tries to group them
together like a single heartbeat. As various research already talked about, when the heartbeat
starts in an ectopic area, the shape of the heartbeat may change. For example, a missing P wave
could cause a heartbeat to happen at the wrong time. If a heartbeat isn't normal, it could be a sign
of heart disease. By grouping and commenting on the types of each heartbeat on the ECG, one
2
could easily see the recurrence of abnormalities in the heart to make the right diagnosis and
treatment. Heartbeat grouping has two main parts: extracting highlights and preparing the model.
Furthermore, figure 1 below shows a typical system for figuring out if something is wrong with a
heartbeat. The noise reduction process tries to reduce how much the movement of the recording
device or the patient changes how the signal is interpreted. The heartbeat detection tries to find
where the heartbeats are so that the heart rate can be worked out. Based on a known heartbeat
location, the heartbeat segmentation takes out the whole heartbeat. The heartbeat classification
looks for any irregularities in the shape of the heartbeat on the ECG signal. This is done with a
system that divides heart disease abnormalities into three groups: irregular heart rate, irregular
rhythm, and ectopic rhythm. Similar to the heartbeat classification, the irregular heart rhythm
classification looks at a period signal on the ECG record instead of just one heartbeat shape. In
the sections that follow, we'll talk about relevant research from the literature that has to do with
the five subsystems.
Fig 1: Typical heartbeat anomaly using ECG
3
2.1 Machine learning Algorithm
Logistic regression and decision tree are the two supervised machine learning algorithm used in
this report. Logistic regression is a common Machine Learning method that belongs to the
Supervised Learning approach (Tollest et al., 2016). It is used to forecast the categorical
dependent variable from a group of independent factors (Harrell & Frank, 2010). A categorical
dependent variable's output is predicted using logistic regression. As a result, the conclusion
must be categorical or discrete. It may be Yes or No, 0 or 1, true or false, and so on, but instead
of presenting the precise values like 0 and 1, it presents the probability values that fall between 0
and 1 (Park et al., 2017).
The Decision Tree algorithm is a member of the supervised learning algorithm family. The
decision tree approach, unlike other supervised learning algorithms, may also be utilized to solve
regression and classification issues (Kamiski et al., 2017). However, the objective of employing
a Decision Tree is to develop a training model that predicts the class or value of the
outcome variable using simple decision rules derived from past data (training data). In Decision
Trees, the prediction of a record's class label begins at the tree's root, the values of the root
attribute and the record's attribute are compared (Karimi & Hamilton 2011).
2.2 Application Area
Logistic regression can be applied to few areas which includes;
i. Predicting how likely it is that a person will have a heart attack as the case of this
study.
ii. It can also be used to predict how likely a customer is to buy a product or stop a
subscription.
4
iii. Lastly, it can be used to predict how likely a customer is to buy a product or cancel a
subscription.
Decision tree can also be used in the following ways;
iv. They are a good way to deal with nonlinear data sets.
v. They are machine learning tool used in engineering, civil planning, law, and business,
among other places.
3.1 Data Source
To predict anomalies in heart disease, this study implored the use of open sourced dataset from
kaggle and could be found from https://www.kaggle.com/datasets/fedesoriano/heart-failure-
prediction. People who have cardiovascular disease or are at high risk for it (because they have
one or more risk factors like high blood pressure, diabetes, high cholesterol, or an already
existing disease) need to be found and treated early. A machine learning model can help a lot
with this. However, this dataset was made by putting together different datasets that were already
out there but had never been put together before. In this dataset, 5 heart datasets are put together
based on 11 common features. This makes it the largest dataset for heart disease research that has
been made available so far. The five datasets that were used to put it together are:
i. Cleveland: 303 observations
ii. Hungarian: 294 observations
Switzerland: 123 things to notice
iv. Long Beach, Virginia: 200 notes
5
v. Stalog (Heart) Data Set: 270 observations
3.2 Data Pre-Processing
Data preprocessing is a type of data mining that is used to turn raw data into a format that is
useful and easy to use. Hence, the first stage of this process is data cleaning which was done in
the study with the use of outlier detection. The unreasonable data-points were removed before
any analysis was done in this study. Also, data transformation is another pre-processing
technique used in this report. Dependent variable heart disease which is in binary form that
indicates if a patient is suffering from heart disease or not is definitely in factor form and
transformed to numeric form before analysis is conducted. Furthermore, the first six set of data
was brought up in the R code, likewise the last 6 set of data were also brought up in the output.
Lastly, out of 918 patients involved in the study, 508 has heart disease and 410 patient did not
have.
4.1 Results
4.1.1 Data Exploration
6
Fig 1: Summary statistics of all variables
The screenshot above is the summary of all variables used in this study. The table above
estimates the minimum values, maximum values, means and other statistic of all variables in the
study. The minimum age for all patient is 28, while the maximum age is 77.
Fig 2: Bar plot of heart disease
7
The graph above visualizes the proportion of patient with heart disease and those without the
disease. Patient with heart disease are more represented in this study than those without
according to the graph above.
Fig 3: Heart distribution in relation to sex.
The bar chart above shows there are more females without heart disease than those with heart
disease. While, there are more males with heart disease than those without heart disease.
8
Fig 4: Plot between restingBP and Cholestrol level.
Ggplot2 package in R was used to visualize the relationship between patient restingBP and their
cholesterol level with their respective heart disease rates. The graph shows the patient’s heart
disease with 1.0 has cholesterol level between 200 and 400 has a resting BP between 100 and
150.
4.1.2 Logistic regression
Logistic supervised machine learning algorithm was used to train and detect anomaly in the heart
disease dataset in this study. The predicted probability was visualized with the aim of bar chart
as shown in figure 5 below. It is also visualized against sex and chest pain type with a line plot as
attached in figure 6 below. Lastly, the model accuracy of how correctly this machine learning
algorithm has predicted the anomaly in the dataset was shown in figure 7 below.
9
Fig 5: Histogram of predicted probability.
From this graph above, we can conclude the predicted probability has bimodal distribution. The
bimodal distribution suggests that the population data has two distinct and independent peaks.
Fig 6: Predicted probability against sex and chest pain type.
10
From the graph above it can be inferred that the predicted probability of male respondents were
relatively higher than that of female respondents. Also, patients with ASY chest pain type do
have higher probabilities than those with NAP chest type.
Fig 7: accuracy for logistic regression
Accuracy is a metric that shows how well the model works in all classes as a whole. It works
well when all classes have the same weight. It is worked out by dividing the number of right
guesses by the total number of guesses. Hence, the logistic regression model correctly has
42.38% accuracy predicting anomaly in this dataset.
4.1.3 Decision Tree
The second supervised learning algorithm used in this study is the decision tree. Figure 8 shows
the decision tree which indicates the segregation of the outcome variable. Figure 9 shows the
anomaly plot and figure 10 shows the model accuracy which will be compared to that of the
logistic regression to know the one that best predict anomaly in this dataset.
11
Fig 8: Decision tree
From the decision tree above, we can observe the number of correct anomalies that occurs in the heart
failure disease dataset used in this study.
Fig 9: Anomalies segment in heart disease.
12
Furthermore, anomalies segmentation graph above is plotted for cholesterol level of patients in the study
and the marker in the graph above differentiate the correctly predicted anomalies data from the incorrect
one.
Fig 10: accuracy for decision tree
From the screenshot above, the accuracy of using decision tree to predict the anomalies in heart
disease is 87.01% which is the better model in predicting anomaly of heart disease according to
this study.
5.0 Conclusion
Heart disease prediction mechanisms are a great way to keep track of and keep an eye on the
health of the patient's area. This helps to lower mortality even in young and middle-aged people.
The heart disease expectation framework can be a useful tool, and since most people don't take
their regular health checks seriously, it helps them know how their health is doing.
Also, they don't have the important information about their health status. With the results of the
heart disease expectation framework close at hand, people with heart disease can know how the
disease is getting worse and stop it from getting worse in the early stages. Then, based on the
different boundaries that have been looked at, patients can be given clinical ideas that fit their
needs.
13
Furthermore, how well heart disease prediction systems can diagnose people depends on how
well the proposed model can figure out diagnostic patterns. So, the more accurate the model, the
more likely it is that heart disease can be predicted in people.
Lastly, in this study, a density-based, unsupervised way to find problems in heart patients is
suggested. In this study, after the basic features have been taken out, both logistic regression and
decision tree are used as machine learning algorithms to predict heart disease problems. Based
on this study, the decision tree model is a better way to predict heart disease problems because it
is more accurate (87.01%).
14
References
15
R-Code and Output
##Import Dataset
heart=read.csv("heart disease.csv")
##Data Exploration
#View firt 6 set of the data
head(heart)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR

## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 N 0.0 Up 0
## 2 N 1.0 Flat 1
## 3 N 0.0 Up 0
## 4 Y 1.5 Flat 1
## 5 N 0.0 Up 0
## 6 N 0.0 Up 0
#View last 6 set of the data

tail(heart)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR

## 913 57 F ASY 140 241 0 Normal 123
## 914 45 M TA 110 264 0 Normal 132
## 915 68 M ASY 144 193 1 Normal 141
## 916 57 M ASY 130 131 0 Normal 115
## 917 57 F ATA 130 236 0 LVH 174
## 918 38 M NAP 138 175 0 Normal 173
## 913 Y 0.2 Flat 1
## 914 N 1.2 Flat 1
## 915 N 3.4 Flat 1
## 916 Y 1.2 Flat 1
## 917 N 0.0 Flat 1
## 918 N 0.0 Up 0
#Check distribution
table(heart$HeartDisease)
16
##
## 0 1
## 410 508
#Dataset summary
summary (heart)
## Age Sex ChestPainType RestingBP

## Min. :28.00 Length:918 Length:918 Min. : 0.0
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0
## Median :54.00 Mode :character Mode :character Median :130.0
## Mean :53.51 Mean :132.4
## 3rd Qu.:60.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. : 0.0 Min. :0.0000 Length:918 Min. : 60.0
## 1st Qu.:173.2 1st Qu.:0.0000 Class :character 1st Qu.:120.0
## Median :223.0 Median :0.0000 Mode :character Median :138.0
## Mean :198.8 Mean :0.2331 Mean :136.8
## 3rd Qu.:267.0 3rd Qu.:0.0000 3rd Qu.:156.0
## Max. :603.0 Max. :1.0000 Max. :202.0
## Length:918 Min. :-2.6000 Length:918 Min. :0.0000
## Class :character 1st Qu.: 0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median : 0.6000 Mode :character Median :1.0000
## Mean : 0.8874 Mean :0.5534
## 3rd Qu.: 1.5000 3rd Qu.:1.0000
## Max. : 6.2000 Max. :1.0000
#Check minimum age

min(heart['Age'])
## [1] 28
#Check Maximum age

max(heart['Age'])
## [1] 77
#Barplot reperesentation of the the dataset
barplot(prop.table(table(heart$HeartDisease)),
col = rainbow(2),
ylim = c(0,1),
main = "Heart Disease Distribution")
17
table(heart$HeartDisease,heart$Sex)
##
## F M
## 0 143 267
## 1 50 458
barplot(prop.table(table(heart$HeartDisease,heart$Sex)),
col = rainbow(2),
ylim = c(0,1),
main = "Heart Disease Distribution in relation to Sex")
18
##partitioning the data into training and test set
library(caret)
trainIndex <- createDataPartition(heart$HeartDisease, p = .67,
list = FALSE)
Train <- heart[ trainIndex,]
Test <- heart[-trainIndex,]
#Using Logistics Regression Model

heart=read.csv("heart disease.csv")
xtabs(~HeartDisease +Sex+ ChestPainType, data = Train) # table of categorical
outcome
## , , ChestPainType = ASY
##
## Sex
## HeartDisease F M
## 0 15 43
## 1 24 232
##
## , , ChestPainType = ATA
##
## Sex
## HeartDisease F M
## 0 40 64
## 1 3 14
19
##
## , , ChestPainType = NAP
##
## Sex
## HeartDisease F M
## 0 34 70
## 1 3 42
##
## , , ChestPainType = TA
##
## Sex
## HeartDisease F M
## 0 6 11
## 1 1 14
library(ggplot2)
plot=ggplot(heart, aes(x=Cholesterol , y= RestingBP, color=HeartDisease)) +
geom_line()
plot
ggplot(heart, aes(x=HeartDisease, fill=Sex))+geom_bar()
20
#Logistic model
logit <- glm(HeartDisease ~. , data=Train, family = "binomial")
summary(logit)
##
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5152 -0.3788 0.1675 0.4335 2.4863
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.989939 1.682574 -0.588 0.556299
## Age 0.021660 0.016289 1.330 0.183603
## SexM 1.296778 0.353791 3.665 0.000247 ***
## ChestPainTypeATA -2.004780 0.393109 -5.100 3.40e-07 ***
## ChestPainTypeNAP -1.960296 0.322925 -6.070 1.28e-09 ***
## ChestPainTypeTA -1.377171 0.524808 -2.624 0.008687 **
## RestingBP 0.003231 0.006914 0.467 0.640313
## Cholesterol -0.002988 0.001316 -2.270 0.023201 *
## FastingBS 1.238916 0.338516 3.660 0.000252 ***
## RestingECGNormal -0.077002 0.338497 -0.227 0.820048
## RestingECGST -0.207983 0.436988 -0.476 0.634112
21
## MaxHR -0.007913 0.006197 -1.277 0.201635
## ExerciseAnginaY 1.063538 0.312326 3.405 0.000661 ***
## Oldpeak 0.257931 0.154006 1.675 0.093972 .
## ST_SlopeFlat 1.423039 0.549891 2.588 0.009658 **
## ST_SlopeUp -0.747844 0.589562 -1.268 0.204628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 849.89 on 615 degrees of freedom
## Residual deviance: 392.30 on 600 degrees of freedom
## AIC: 424.3
##
## Number of Fisher Scoring iterations: 6
confint(logit) # CIs using profiled log-likelihood
## 2.5 % 97.5 %
## (Intercept) -4.301714604 2.3163638549
## Age -0.010144017 0.0539063326
## SexM 0.614337550 2.0042508787
## ChestPainTypeATA -2.800097689 -1.2532156231
## ChestPainTypeNAP -2.607555459 -1.3382156244
## ChestPainTypeTA -2.420584328 -0.3550049216
## RestingBP -0.010380010 0.0169941989
## Cholesterol -0.005609866 -0.0004363128
## FastingBS 0.588016304 1.9183234888
## RestingECGNormal -0.742802443 0.5872607420
## RestingECGST -1.066933078 0.6509881278
## MaxHR -0.020123276 0.0042255621
## ExerciseAnginaY 0.452323099 1.6797466920
## Oldpeak -0.041534283 0.5632453062
## ST_SlopeFlat 0.289978734 2.4676977468
## ST_SlopeUp -1.953546666 0.3751598925
#Wald Test
library(aod)
wald.test(b = coef(logit), Sigma = vcov(logit), Terms = 4:6)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 49.4, df = 3, P(> X2) = 1.1e-10
exp(cbind(OR = coef(logit), confint(logit))) # OR and 95% CI
## OR 2.5 % 97.5 %
## (Intercept) 0.3715992 0.01354531 10.1387413
22
## Age 1.0218958 0.98990726 1.0553857
## SexM 3.6574949 1.84843170 7.4205329
## ChestPainTypeATA 0.1346899 0.06080412 0.2855850
## ChestPainTypeNAP 0.1408167 0.07371452 0.2623133
## ChestPainTypeTA 0.2522913 0.08886967 0.7011700
## RestingBP 1.0032361 0.98967368 1.0171394
## Cholesterol 0.9970168 0.99440584 0.9995638
## FastingBS 3.4518706 1.80041340 6.8095326
## RestingECGNormal 0.9258877 0.47577870 1.7990536
## RestingECGST 0.8122209 0.34406211 1.9174346
## MaxHR 0.9921181 0.98007785 1.0042345
## ExerciseAnginaY 2.8966006 1.57195977 5.3641970
## Oldpeak 1.2942490 0.95931645 1.7563632
## ST_SlopeFlat 4.1497106 1.33639907 11.7952599
## ST_SlopeUp 0.4733860 0.14177037 1.4552241
##Likelihood Ratio Test

logLik(logit) # see the model's log-likelihood
## 'log Lik.' -196.1482 (df=16)
with(logit, null.deviance - deviance) # find the difference in deviance
## [1] 457.598
with(logit, df.null - df.residual) # The df for the difference between the

two models = the number of predictor variables
## [1] 15
with(logit, pchisq(null.deviance - deviance, df.null - df.residual,

lower.tail = FALSE)) # obtain p-value
## [1] 5.136055e-88
##Predicting probability
n.n1 <- predict(logit, Test, type = "response")
n.n2 <- with(heart, data.frame(HeartDisease = rep(seq(from = 200, to = 800,

length.out = 100),4),Cholesterol = mean(Cholesterol)
,RestingBP=rep(seq(from = 200, to = 800,
length.out = 100),4)))
n.n3 <- cbind(heart, predict(logit, newdata =heart, type = "link", se =

TRUE)) # including standard error for CI later
n.n3 <- within(n.n3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
23
View(n.n3)
hist(n.n3$PredictedProb, main="Histogram of predicted probabilities",
col="red")
##Obtaining the
test data
library(dplyr)
Test <- Test %>% mutate(model_pred = 1*(n.n1 > .53) + 0,
HeartDisease = 1*(HeartDisease == "Yes") +
0)
#Plotting the predicted probabilities

library(ggplot2)
ggplot(n.n3, aes(x = HeartDisease, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = ChestPainType
), alpha = 0.2) + geom_line(aes(colour= Sex), size = 1)
24
##Determining the
model accuracy for logistic regression
Test <- Test %>% mutate(accurate = 1*(model_pred == HeartDisease))
sum(Test$accurate)/nrow(Test)
## [1] 0.4238411
##Using Decision Tree

library(rpart)
Train$HeartDisease <- factor(Train$HeartDisease)
Tree <- rpart(HeartDisease ~. -HeartDisease, data =Train)
plot(Tree, main="Decision Tree of Heart Disease")
text(Tree, pretty = 2)
25
#Showing the Segments of the Anomalies
plot(y=heart$Cholesterol, x=heart$MaxHR,
pch = 16, col = "blue", ylab = "Cholesterol", xlab = "Max HR")
abline(v =136.5, col = "red", lwd = 2) #a vertical line
26
##Making prediction
library(rpart)
predict_unseen <-predict(Tree, Train, type = 'class')
##Testing the patients that did have heart disease and those that didnt
table_heart <- table(Train$HeartDisease, predict_unseen)
table_heart
## predict_unseen
## 0 1
## 0 244 39
## 1 41 292
##estimating the confusion matrix accuracy metrics

accuracy_Test <- sum(diag(table_heart)) / sum(table_heart)
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 0.87012987012987"
27
28

Heart Failure CETM24

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Heart Failure CETM24

Uploaded by

Copyright:

Available Formats

Student Name:

age of 70. (WHO, 2020).

Rjeily et al., 2019; Singh et al., 2018).

the five subsystems.

Fig 1: Typical heartbeat anomaly using ECG

and 1 (Park et al., 2017).

2.2 Application Area

Logistic regression can be applied to few areas which includes;

Decision tree can also be used in the following ways;

among other places.

3.1 Data Source

kaggle and could be found from https://www.kaggle.com/datasets/fedesoriano/heart-failure-

i. Cleveland: 303 observations

ii. Hungarian: 294 observations

Switzerland: 123 things to notice

iv. Long Beach, Virginia: 200 notes

3.2 Data Pre-Processing

4.1.1 Data Exploration

Fig 2: Bar plot of heart disease

according to the graph above.

Fig 3: Heart distribution in relation to sex.

4.1.2 Logistic regression

Fig 6: Predicted probability against sex and chest pain type.

have higher probabilities than those with NAP chest type.

Fig 7: accuracy for logistic regression

42.38% accuracy predicting anomaly in this dataset.

4.1.3 Decision Tree

failure disease dataset used in this study.

Fig 9: Anomalies segment in heart disease.

Fig 10: accuracy for decision tree

more likely it is that heart disease can be predicted in people.

is more accurate (87.01%).

## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR

#View last 6 set of the data

## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR

## Age Sex ChestPainType RestingBP

#Check minimum age

#Check Maximum age

#Barplot reperesentation of the the dataset

#Using Logistics Regression Model

ggplot(heart, aes(x=HeartDisease, fill=Sex))+geom_bar()

confint(logit) # CIs using profiled log-likelihood

exp(cbind(OR = coef(logit), confint(logit))) # OR and 95% CI

##Likelihood Ratio Test

## 'log Lik.' -196.1482 (df=16)

with(logit, null.deviance - deviance) # find the difference in deviance

with(logit, df.null - df.residual) # The df for the difference between the

with(logit, pchisq(null.deviance - deviance, df.null - df.residual,

n.n2 <- with(heart, data.frame(HeartDisease = rep(seq(from = 200, to = 800,

n.n3 <- cbind(heart, predict(logit, newdata =heart, type = "link", se =

#Plotting the predicted probabilities

##Using Decision Tree

Tree <- rpart(HeartDisease ~. -HeartDisease, data =Train)

plot(Tree, main="Decision Tree of Heart Disease")

##estimating the confusion matrix accuracy metrics

## [1] "Accuracy for test 0.87012987012987"

You might also like