You are on page 1of 28

Student Name:

Student Number:
Topic: Application of Machine learning in the prediction of
Heart failure Anomaly.
Word Count:

1
1.1 Introduction

According to the World Health Organization (WHO), the leading cause of death around the

world is cardiovascular disease (CVD). CVD is expected to kill 17,9 million people every year,

which is 31% of all deaths around the world. Four out of every five deaths from CVD are caused

by heart attacks and strokes, and one-third of these deaths happen suddenly in people under the

age of 70. (WHO, 2020).

Cardiovascular diseases include a wide range of problems with the heart and how well it works.

Heart disease should be found in a precise and correct way. Most of the time, a clinical expert

looks at it. Cardiovascular diseases can be predicted and preventive steps can be taken to get rid

of or suffocate these dangerous diseases if they are caught early and the patient's experiences and

way of life are taken into account in full (Shaji, 2019; Aldallal and Al-Moosa 2018).

Information mining has recently become popular all over the world in a wide range of fields, and

medicine is no exception. Information mining techniques are good choices for building

coordinated data frameworks to study cardiovascular diseases because they can help find hidden

information in raw data. Because information mining is such a big part of improving medical

services, different ways to look at and predict cardiovascular disease have been suggested (Bou

Rjeily et al., 2019; Singh et al., 2018).

Erratic heartbeat position is based on how the heartbeats are happening and tries to group them

together like a single heartbeat. As various research already talked about, when the heartbeat

starts in an ectopic area, the shape of the heartbeat may change. For example, a missing P wave

could cause a heartbeat to happen at the wrong time. If a heartbeat isn't normal, it could be a sign

of heart disease. By grouping and commenting on the types of each heartbeat on the ECG, one

2
could easily see the recurrence of abnormalities in the heart to make the right diagnosis and

treatment. Heartbeat grouping has two main parts: extracting highlights and preparing the model.

Furthermore, figure 1 below shows a typical system for figuring out if something is wrong with a

heartbeat. The noise reduction process tries to reduce how much the movement of the recording

device or the patient changes how the signal is interpreted. The heartbeat detection tries to find

where the heartbeats are so that the heart rate can be worked out. Based on a known heartbeat

location, the heartbeat segmentation takes out the whole heartbeat. The heartbeat classification

looks for any irregularities in the shape of the heartbeat on the ECG signal. This is done with a

system that divides heart disease abnormalities into three groups: irregular heart rate, irregular

rhythm, and ectopic rhythm. Similar to the heartbeat classification, the irregular heart rhythm

classification looks at a period signal on the ECG record instead of just one heartbeat shape. In

the sections that follow, we'll talk about relevant research from the literature that has to do with

the five subsystems.

Fig 1: Typical heartbeat anomaly using ECG

3
2.1 Machine learning Algorithm

Logistic regression and decision tree are the two supervised machine learning algorithm used in

this report. Logistic regression is a common Machine Learning method that belongs to the

Supervised Learning approach (Tollest et al., 2016). It is used to forecast the categorical

dependent variable from a group of independent factors (Harrell & Frank, 2010). A categorical

dependent variable's output is predicted using logistic regression. As a result, the conclusion

must be categorical or discrete. It may be Yes or No, 0 or 1, true or false, and so on, but instead

of presenting the precise values like 0 and 1, it presents the probability values that fall between 0

and 1 (Park et al., 2017).

The Decision Tree algorithm is a member of the supervised learning algorithm family. The

decision tree approach, unlike other supervised learning algorithms, may also be utilized to solve

regression and classification issues (Kamiski et al., 2017). However, the objective of employing

a Decision Tree is to develop a training model that predicts the class or value of the

outcome variable using simple decision rules derived from past data (training data). In Decision

Trees, the prediction of a record's class label begins at the tree's root, the values of the root

attribute and the record's attribute are compared (Karimi & Hamilton 2011).

2.2 Application Area

Logistic regression can be applied to few areas which includes;

i. Predicting how likely it is that a person will have a heart attack as the case of this

study.

ii. It can also be used to predict how likely a customer is to buy a product or stop a

subscription.

4
iii. Lastly, it can be used to predict how likely a customer is to buy a product or cancel a

subscription.

Decision tree can also be used in the following ways;

iv. They are a good way to deal with nonlinear data sets.

v. They are machine learning tool used in engineering, civil planning, law, and business,

among other places.

3.1 Data Source

To predict anomalies in heart disease, this study implored the use of open sourced dataset from

kaggle and could be found from https://www.kaggle.com/datasets/fedesoriano/heart-failure-

prediction. People who have cardiovascular disease or are at high risk for it (because they have

one or more risk factors like high blood pressure, diabetes, high cholesterol, or an already

existing disease) need to be found and treated early. A machine learning model can help a lot

with this. However, this dataset was made by putting together different datasets that were already

out there but had never been put together before. In this dataset, 5 heart datasets are put together

based on 11 common features. This makes it the largest dataset for heart disease research that has

been made available so far. The five datasets that were used to put it together are:

i. Cleveland: 303 observations

ii. Hungarian: 294 observations

Switzerland: 123 things to notice

iv. Long Beach, Virginia: 200 notes

5
v. Stalog (Heart) Data Set: 270 observations

3.2 Data Pre-Processing

Data preprocessing is a type of data mining that is used to turn raw data into a format that is

useful and easy to use. Hence, the first stage of this process is data cleaning which was done in

the study with the use of outlier detection. The unreasonable data-points were removed before

any analysis was done in this study. Also, data transformation is another pre-processing

technique used in this report. Dependent variable heart disease which is in binary form that

indicates if a patient is suffering from heart disease or not is definitely in factor form and

transformed to numeric form before analysis is conducted. Furthermore, the first six set of data

was brought up in the R code, likewise the last 6 set of data were also brought up in the output.

Lastly, out of 918 patients involved in the study, 508 has heart disease and 410 patient did not

have.

4.1 Results

4.1.1 Data Exploration

6
Fig 1: Summary statistics of all variables

The screenshot above is the summary of all variables used in this study. The table above

estimates the minimum values, maximum values, means and other statistic of all variables in the

study. The minimum age for all patient is 28, while the maximum age is 77.

Fig 2: Bar plot of heart disease

7
The graph above visualizes the proportion of patient with heart disease and those without the

disease. Patient with heart disease are more represented in this study than those without

according to the graph above.

Fig 3: Heart distribution in relation to sex.

The bar chart above shows there are more females without heart disease than those with heart

disease. While, there are more males with heart disease than those without heart disease.

8
Fig 4: Plot between restingBP and Cholestrol level.

Ggplot2 package in R was used to visualize the relationship between patient restingBP and their

cholesterol level with their respective heart disease rates. The graph shows the patient’s heart

disease with 1.0 has cholesterol level between 200 and 400 has a resting BP between 100 and

150.

4.1.2 Logistic regression

Logistic supervised machine learning algorithm was used to train and detect anomaly in the heart

disease dataset in this study. The predicted probability was visualized with the aim of bar chart

as shown in figure 5 below. It is also visualized against sex and chest pain type with a line plot as

attached in figure 6 below. Lastly, the model accuracy of how correctly this machine learning

algorithm has predicted the anomaly in the dataset was shown in figure 7 below.

9
Fig 5: Histogram of predicted probability.

From this graph above, we can conclude the predicted probability has bimodal distribution. The

bimodal distribution suggests that the population data has two distinct and independent peaks.

Fig 6: Predicted probability against sex and chest pain type.

10
From the graph above it can be inferred that the predicted probability of male respondents were

relatively higher than that of female respondents. Also, patients with ASY chest pain type do

have higher probabilities than those with NAP chest type.

Fig 7: accuracy for logistic regression

Accuracy is a metric that shows how well the model works in all classes as a whole. It works

well when all classes have the same weight. It is worked out by dividing the number of right

guesses by the total number of guesses. Hence, the logistic regression model correctly has

42.38% accuracy predicting anomaly in this dataset.

4.1.3 Decision Tree

The second supervised learning algorithm used in this study is the decision tree. Figure 8 shows

the decision tree which indicates the segregation of the outcome variable. Figure 9 shows the

anomaly plot and figure 10 shows the model accuracy which will be compared to that of the

logistic regression to know the one that best predict anomaly in this dataset.

11
Fig 8: Decision tree

From the decision tree above, we can observe the number of correct anomalies that occurs in the heart

failure disease dataset used in this study.

Fig 9: Anomalies segment in heart disease.

12
Furthermore, anomalies segmentation graph above is plotted for cholesterol level of patients in the study

and the marker in the graph above differentiate the correctly predicted anomalies data from the incorrect

one.

Fig 10: accuracy for decision tree

From the screenshot above, the accuracy of using decision tree to predict the anomalies in heart

disease is 87.01% which is the better model in predicting anomaly of heart disease according to

this study.

5.0 Conclusion

Heart disease prediction mechanisms are a great way to keep track of and keep an eye on the

health of the patient's area. This helps to lower mortality even in young and middle-aged people.

The heart disease expectation framework can be a useful tool, and since most people don't take

their regular health checks seriously, it helps them know how their health is doing.

Also, they don't have the important information about their health status. With the results of the

heart disease expectation framework close at hand, people with heart disease can know how the

disease is getting worse and stop it from getting worse in the early stages. Then, based on the

different boundaries that have been looked at, patients can be given clinical ideas that fit their

needs.

13
Furthermore, how well heart disease prediction systems can diagnose people depends on how

well the proposed model can figure out diagnostic patterns. So, the more accurate the model, the

more likely it is that heart disease can be predicted in people.

Lastly, in this study, a density-based, unsupervised way to find problems in heart patients is

suggested. In this study, after the basic features have been taken out, both logistic regression and

decision tree are used as machine learning algorithms to predict heart disease problems. Based

on this study, the decision tree model is a better way to predict heart disease problems because it

is more accurate (87.01%).

14
References

15
R-Code and Output

##Import Dataset
heart=read.csv("heart disease.csv")

##Data Exploration
#View firt 6 set of the data
head(heart)

## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR


## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 N 0.0 Up 0
## 2 N 1.0 Flat 1
## 3 N 0.0 Up 0
## 4 Y 1.5 Flat 1
## 5 N 0.0 Up 0
## 6 N 0.0 Up 0

#View last 6 set of the data


tail(heart)

## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR


## 913 57 F ASY 140 241 0 Normal 123
## 914 45 M TA 110 264 0 Normal 132
## 915 68 M ASY 144 193 1 Normal 141
## 916 57 M ASY 130 131 0 Normal 115
## 917 57 F ATA 130 236 0 LVH 174
## 918 38 M NAP 138 175 0 Normal 173
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 913 Y 0.2 Flat 1
## 914 N 1.2 Flat 1
## 915 N 3.4 Flat 1
## 916 Y 1.2 Flat 1
## 917 N 0.0 Flat 1
## 918 N 0.0 Up 0

#Check distribution
table(heart$HeartDisease)

16
##
## 0 1
## 410 508

#Dataset summary
summary (heart)

## Age Sex ChestPainType RestingBP


## Min. :28.00 Length:918 Length:918 Min. : 0.0
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0
## Median :54.00 Mode :character Mode :character Median :130.0
## Mean :53.51 Mean :132.4
## 3rd Qu.:60.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. : 0.0 Min. :0.0000 Length:918 Min. : 60.0
## 1st Qu.:173.2 1st Qu.:0.0000 Class :character 1st Qu.:120.0
## Median :223.0 Median :0.0000 Mode :character Median :138.0
## Mean :198.8 Mean :0.2331 Mean :136.8
## 3rd Qu.:267.0 3rd Qu.:0.0000 3rd Qu.:156.0
## Max. :603.0 Max. :1.0000 Max. :202.0
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## Length:918 Min. :-2.6000 Length:918 Min. :0.0000
## Class :character 1st Qu.: 0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median : 0.6000 Mode :character Median :1.0000
## Mean : 0.8874 Mean :0.5534
## 3rd Qu.: 1.5000 3rd Qu.:1.0000
## Max. : 6.2000 Max. :1.0000

#Check minimum age


min(heart['Age'])

## [1] 28

#Check Maximum age


max(heart['Age'])

## [1] 77

#Barplot reperesentation of the the dataset

barplot(prop.table(table(heart$HeartDisease)),
col = rainbow(2),
ylim = c(0,1),
main = "Heart Disease Distribution")

17
table(heart$HeartDisease,heart$Sex)

##
## F M
## 0 143 267
## 1 50 458

barplot(prop.table(table(heart$HeartDisease,heart$Sex)),
col = rainbow(2),
ylim = c(0,1),
main = "Heart Disease Distribution in relation to Sex")

18
##partitioning the data into training and test set
library(caret)
trainIndex <- createDataPartition(heart$HeartDisease, p = .67,
list = FALSE)
Train <- heart[ trainIndex,]
Test <- heart[-trainIndex,]

#Using Logistics Regression Model


heart=read.csv("heart disease.csv")
xtabs(~HeartDisease +Sex+ ChestPainType, data = Train) # table of categorical
outcome

## , , ChestPainType = ASY
##
## Sex
## HeartDisease F M
## 0 15 43
## 1 24 232
##
## , , ChestPainType = ATA
##
## Sex
## HeartDisease F M
## 0 40 64
## 1 3 14

19
##
## , , ChestPainType = NAP
##
## Sex
## HeartDisease F M
## 0 34 70
## 1 3 42
##
## , , ChestPainType = TA
##
## Sex
## HeartDisease F M
## 0 6 11
## 1 1 14

library(ggplot2)
plot=ggplot(heart, aes(x=Cholesterol , y= RestingBP, color=HeartDisease)) +
geom_line()

plot

ggplot(heart, aes(x=HeartDisease, fill=Sex))+geom_bar()

20
#Logistic model
logit <- glm(HeartDisease ~. , data=Train, family = "binomial")

summary(logit)

##
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5152 -0.3788 0.1675 0.4335 2.4863
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.989939 1.682574 -0.588 0.556299
## Age 0.021660 0.016289 1.330 0.183603
## SexM 1.296778 0.353791 3.665 0.000247 ***
## ChestPainTypeATA -2.004780 0.393109 -5.100 3.40e-07 ***
## ChestPainTypeNAP -1.960296 0.322925 -6.070 1.28e-09 ***
## ChestPainTypeTA -1.377171 0.524808 -2.624 0.008687 **
## RestingBP 0.003231 0.006914 0.467 0.640313
## Cholesterol -0.002988 0.001316 -2.270 0.023201 *
## FastingBS 1.238916 0.338516 3.660 0.000252 ***
## RestingECGNormal -0.077002 0.338497 -0.227 0.820048
## RestingECGST -0.207983 0.436988 -0.476 0.634112

21
## MaxHR -0.007913 0.006197 -1.277 0.201635
## ExerciseAnginaY 1.063538 0.312326 3.405 0.000661 ***
## Oldpeak 0.257931 0.154006 1.675 0.093972 .
## ST_SlopeFlat 1.423039 0.549891 2.588 0.009658 **
## ST_SlopeUp -0.747844 0.589562 -1.268 0.204628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 849.89 on 615 degrees of freedom
## Residual deviance: 392.30 on 600 degrees of freedom
## AIC: 424.3
##
## Number of Fisher Scoring iterations: 6

confint(logit) # CIs using profiled log-likelihood

## 2.5 % 97.5 %
## (Intercept) -4.301714604 2.3163638549
## Age -0.010144017 0.0539063326
## SexM 0.614337550 2.0042508787
## ChestPainTypeATA -2.800097689 -1.2532156231
## ChestPainTypeNAP -2.607555459 -1.3382156244
## ChestPainTypeTA -2.420584328 -0.3550049216
## RestingBP -0.010380010 0.0169941989
## Cholesterol -0.005609866 -0.0004363128
## FastingBS 0.588016304 1.9183234888
## RestingECGNormal -0.742802443 0.5872607420
## RestingECGST -1.066933078 0.6509881278
## MaxHR -0.020123276 0.0042255621
## ExerciseAnginaY 0.452323099 1.6797466920
## Oldpeak -0.041534283 0.5632453062
## ST_SlopeFlat 0.289978734 2.4676977468
## ST_SlopeUp -1.953546666 0.3751598925

#Wald Test
library(aod)
wald.test(b = coef(logit), Sigma = vcov(logit), Terms = 4:6)

## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 49.4, df = 3, P(> X2) = 1.1e-10

exp(cbind(OR = coef(logit), confint(logit))) # OR and 95% CI

## OR 2.5 % 97.5 %
## (Intercept) 0.3715992 0.01354531 10.1387413

22
## Age 1.0218958 0.98990726 1.0553857
## SexM 3.6574949 1.84843170 7.4205329
## ChestPainTypeATA 0.1346899 0.06080412 0.2855850
## ChestPainTypeNAP 0.1408167 0.07371452 0.2623133
## ChestPainTypeTA 0.2522913 0.08886967 0.7011700
## RestingBP 1.0032361 0.98967368 1.0171394
## Cholesterol 0.9970168 0.99440584 0.9995638
## FastingBS 3.4518706 1.80041340 6.8095326
## RestingECGNormal 0.9258877 0.47577870 1.7990536
## RestingECGST 0.8122209 0.34406211 1.9174346
## MaxHR 0.9921181 0.98007785 1.0042345
## ExerciseAnginaY 2.8966006 1.57195977 5.3641970
## Oldpeak 1.2942490 0.95931645 1.7563632
## ST_SlopeFlat 4.1497106 1.33639907 11.7952599
## ST_SlopeUp 0.4733860 0.14177037 1.4552241

##Likelihood Ratio Test


logLik(logit) # see the model's log-likelihood

## 'log Lik.' -196.1482 (df=16)

with(logit, null.deviance - deviance) # find the difference in deviance

## [1] 457.598

with(logit, df.null - df.residual) # The df for the difference between the


two models = the number of predictor variables

## [1] 15

with(logit, pchisq(null.deviance - deviance, df.null - df.residual,


lower.tail = FALSE)) # obtain p-value

## [1] 5.136055e-88

##Predicting probability
n.n1 <- predict(logit, Test, type = "response")

n.n2 <- with(heart, data.frame(HeartDisease = rep(seq(from = 200, to = 800,


length.out = 100),4),Cholesterol = mean(Cholesterol)
,RestingBP=rep(seq(from = 200, to = 800,
length.out = 100),4)))

n.n3 <- cbind(heart, predict(logit, newdata =heart, type = "link", se =


TRUE)) # including standard error for CI later
n.n3 <- within(n.n3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})

23
View(n.n3)
hist(n.n3$PredictedProb, main="Histogram of predicted probabilities",
col="red")

##Obtaining the
test data
library(dplyr)
Test <- Test %>% mutate(model_pred = 1*(n.n1 > .53) + 0,
HeartDisease = 1*(HeartDisease == "Yes") +
0)

#Plotting the predicted probabilities


library(ggplot2)
ggplot(n.n3, aes(x = HeartDisease, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = ChestPainType
), alpha = 0.2) + geom_line(aes(colour= Sex), size = 1)

24
##Determining the
model accuracy for logistic regression
Test <- Test %>% mutate(accurate = 1*(model_pred == HeartDisease))
sum(Test$accurate)/nrow(Test)

## [1] 0.4238411

##Using Decision Tree


library(rpart)
Train$HeartDisease <- factor(Train$HeartDisease)

Tree <- rpart(HeartDisease ~. -HeartDisease, data =Train)

plot(Tree, main="Decision Tree of Heart Disease")

text(Tree, pretty = 2)

25
#Showing the Segments of the Anomalies
plot(y=heart$Cholesterol, x=heart$MaxHR,
pch = 16, col = "blue", ylab = "Cholesterol", xlab = "Max HR")
abline(v =136.5, col = "red", lwd = 2) #a vertical line

26
##Making prediction
library(rpart)
predict_unseen <-predict(Tree, Train, type = 'class')

##Testing the patients that did have heart disease and those that didnt
table_heart <- table(Train$HeartDisease, predict_unseen)
table_heart

## predict_unseen
## 0 1
## 0 244 39
## 1 41 292

##estimating the confusion matrix accuracy metrics


accuracy_Test <- sum(diag(table_heart)) / sum(table_heart)
print(paste('Accuracy for test', accuracy_Test))

## [1] "Accuracy for test 0.87012987012987"

27
28

You might also like