Professional Documents
Culture Documents
Student Number:
Topic: Application of Machine learning in the prediction of
Heart failure Anomaly.
Word Count:
1
1.1 Introduction
According to the World Health Organization (WHO), the leading cause of death around the
world is cardiovascular disease (CVD). CVD is expected to kill 17,9 million people every year,
which is 31% of all deaths around the world. Four out of every five deaths from CVD are caused
by heart attacks and strokes, and one-third of these deaths happen suddenly in people under the
Cardiovascular diseases include a wide range of problems with the heart and how well it works.
Heart disease should be found in a precise and correct way. Most of the time, a clinical expert
looks at it. Cardiovascular diseases can be predicted and preventive steps can be taken to get rid
of or suffocate these dangerous diseases if they are caught early and the patient's experiences and
way of life are taken into account in full (Shaji, 2019; Aldallal and Al-Moosa 2018).
Information mining has recently become popular all over the world in a wide range of fields, and
medicine is no exception. Information mining techniques are good choices for building
coordinated data frameworks to study cardiovascular diseases because they can help find hidden
information in raw data. Because information mining is such a big part of improving medical
services, different ways to look at and predict cardiovascular disease have been suggested (Bou
Erratic heartbeat position is based on how the heartbeats are happening and tries to group them
together like a single heartbeat. As various research already talked about, when the heartbeat
starts in an ectopic area, the shape of the heartbeat may change. For example, a missing P wave
could cause a heartbeat to happen at the wrong time. If a heartbeat isn't normal, it could be a sign
of heart disease. By grouping and commenting on the types of each heartbeat on the ECG, one
2
could easily see the recurrence of abnormalities in the heart to make the right diagnosis and
treatment. Heartbeat grouping has two main parts: extracting highlights and preparing the model.
Furthermore, figure 1 below shows a typical system for figuring out if something is wrong with a
heartbeat. The noise reduction process tries to reduce how much the movement of the recording
device or the patient changes how the signal is interpreted. The heartbeat detection tries to find
where the heartbeats are so that the heart rate can be worked out. Based on a known heartbeat
location, the heartbeat segmentation takes out the whole heartbeat. The heartbeat classification
looks for any irregularities in the shape of the heartbeat on the ECG signal. This is done with a
system that divides heart disease abnormalities into three groups: irregular heart rate, irregular
rhythm, and ectopic rhythm. Similar to the heartbeat classification, the irregular heart rhythm
classification looks at a period signal on the ECG record instead of just one heartbeat shape. In
the sections that follow, we'll talk about relevant research from the literature that has to do with
3
2.1 Machine learning Algorithm
Logistic regression and decision tree are the two supervised machine learning algorithm used in
this report. Logistic regression is a common Machine Learning method that belongs to the
Supervised Learning approach (Tollest et al., 2016). It is used to forecast the categorical
dependent variable from a group of independent factors (Harrell & Frank, 2010). A categorical
dependent variable's output is predicted using logistic regression. As a result, the conclusion
must be categorical or discrete. It may be Yes or No, 0 or 1, true or false, and so on, but instead
of presenting the precise values like 0 and 1, it presents the probability values that fall between 0
The Decision Tree algorithm is a member of the supervised learning algorithm family. The
decision tree approach, unlike other supervised learning algorithms, may also be utilized to solve
regression and classification issues (Kamiski et al., 2017). However, the objective of employing
a Decision Tree is to develop a training model that predicts the class or value of the
outcome variable using simple decision rules derived from past data (training data). In Decision
Trees, the prediction of a record's class label begins at the tree's root, the values of the root
attribute and the record's attribute are compared (Karimi & Hamilton 2011).
i. Predicting how likely it is that a person will have a heart attack as the case of this
study.
ii. It can also be used to predict how likely a customer is to buy a product or stop a
subscription.
4
iii. Lastly, it can be used to predict how likely a customer is to buy a product or cancel a
subscription.
iv. They are a good way to deal with nonlinear data sets.
v. They are machine learning tool used in engineering, civil planning, law, and business,
To predict anomalies in heart disease, this study implored the use of open sourced dataset from
prediction. People who have cardiovascular disease or are at high risk for it (because they have
one or more risk factors like high blood pressure, diabetes, high cholesterol, or an already
existing disease) need to be found and treated early. A machine learning model can help a lot
with this. However, this dataset was made by putting together different datasets that were already
out there but had never been put together before. In this dataset, 5 heart datasets are put together
based on 11 common features. This makes it the largest dataset for heart disease research that has
been made available so far. The five datasets that were used to put it together are:
5
v. Stalog (Heart) Data Set: 270 observations
Data preprocessing is a type of data mining that is used to turn raw data into a format that is
useful and easy to use. Hence, the first stage of this process is data cleaning which was done in
the study with the use of outlier detection. The unreasonable data-points were removed before
any analysis was done in this study. Also, data transformation is another pre-processing
technique used in this report. Dependent variable heart disease which is in binary form that
indicates if a patient is suffering from heart disease or not is definitely in factor form and
transformed to numeric form before analysis is conducted. Furthermore, the first six set of data
was brought up in the R code, likewise the last 6 set of data were also brought up in the output.
Lastly, out of 918 patients involved in the study, 508 has heart disease and 410 patient did not
have.
4.1 Results
6
Fig 1: Summary statistics of all variables
The screenshot above is the summary of all variables used in this study. The table above
estimates the minimum values, maximum values, means and other statistic of all variables in the
study. The minimum age for all patient is 28, while the maximum age is 77.
7
The graph above visualizes the proportion of patient with heart disease and those without the
disease. Patient with heart disease are more represented in this study than those without
The bar chart above shows there are more females without heart disease than those with heart
disease. While, there are more males with heart disease than those without heart disease.
8
Fig 4: Plot between restingBP and Cholestrol level.
Ggplot2 package in R was used to visualize the relationship between patient restingBP and their
cholesterol level with their respective heart disease rates. The graph shows the patient’s heart
disease with 1.0 has cholesterol level between 200 and 400 has a resting BP between 100 and
150.
Logistic supervised machine learning algorithm was used to train and detect anomaly in the heart
disease dataset in this study. The predicted probability was visualized with the aim of bar chart
as shown in figure 5 below. It is also visualized against sex and chest pain type with a line plot as
attached in figure 6 below. Lastly, the model accuracy of how correctly this machine learning
algorithm has predicted the anomaly in the dataset was shown in figure 7 below.
9
Fig 5: Histogram of predicted probability.
From this graph above, we can conclude the predicted probability has bimodal distribution. The
bimodal distribution suggests that the population data has two distinct and independent peaks.
10
From the graph above it can be inferred that the predicted probability of male respondents were
relatively higher than that of female respondents. Also, patients with ASY chest pain type do
Accuracy is a metric that shows how well the model works in all classes as a whole. It works
well when all classes have the same weight. It is worked out by dividing the number of right
guesses by the total number of guesses. Hence, the logistic regression model correctly has
The second supervised learning algorithm used in this study is the decision tree. Figure 8 shows
the decision tree which indicates the segregation of the outcome variable. Figure 9 shows the
anomaly plot and figure 10 shows the model accuracy which will be compared to that of the
logistic regression to know the one that best predict anomaly in this dataset.
11
Fig 8: Decision tree
From the decision tree above, we can observe the number of correct anomalies that occurs in the heart
12
Furthermore, anomalies segmentation graph above is plotted for cholesterol level of patients in the study
and the marker in the graph above differentiate the correctly predicted anomalies data from the incorrect
one.
From the screenshot above, the accuracy of using decision tree to predict the anomalies in heart
disease is 87.01% which is the better model in predicting anomaly of heart disease according to
this study.
5.0 Conclusion
Heart disease prediction mechanisms are a great way to keep track of and keep an eye on the
health of the patient's area. This helps to lower mortality even in young and middle-aged people.
The heart disease expectation framework can be a useful tool, and since most people don't take
their regular health checks seriously, it helps them know how their health is doing.
Also, they don't have the important information about their health status. With the results of the
heart disease expectation framework close at hand, people with heart disease can know how the
disease is getting worse and stop it from getting worse in the early stages. Then, based on the
different boundaries that have been looked at, patients can be given clinical ideas that fit their
needs.
13
Furthermore, how well heart disease prediction systems can diagnose people depends on how
well the proposed model can figure out diagnostic patterns. So, the more accurate the model, the
Lastly, in this study, a density-based, unsupervised way to find problems in heart patients is
suggested. In this study, after the basic features have been taken out, both logistic regression and
decision tree are used as machine learning algorithms to predict heart disease problems. Based
on this study, the decision tree model is a better way to predict heart disease problems because it
14
References
15
R-Code and Output
##Import Dataset
heart=read.csv("heart disease.csv")
##Data Exploration
#View firt 6 set of the data
head(heart)
#Check distribution
table(heart$HeartDisease)
16
##
## 0 1
## 410 508
#Dataset summary
summary (heart)
## [1] 28
## [1] 77
barplot(prop.table(table(heart$HeartDisease)),
col = rainbow(2),
ylim = c(0,1),
main = "Heart Disease Distribution")
17
table(heart$HeartDisease,heart$Sex)
##
## F M
## 0 143 267
## 1 50 458
barplot(prop.table(table(heart$HeartDisease,heart$Sex)),
col = rainbow(2),
ylim = c(0,1),
main = "Heart Disease Distribution in relation to Sex")
18
##partitioning the data into training and test set
library(caret)
trainIndex <- createDataPartition(heart$HeartDisease, p = .67,
list = FALSE)
Train <- heart[ trainIndex,]
Test <- heart[-trainIndex,]
## , , ChestPainType = ASY
##
## Sex
## HeartDisease F M
## 0 15 43
## 1 24 232
##
## , , ChestPainType = ATA
##
## Sex
## HeartDisease F M
## 0 40 64
## 1 3 14
19
##
## , , ChestPainType = NAP
##
## Sex
## HeartDisease F M
## 0 34 70
## 1 3 42
##
## , , ChestPainType = TA
##
## Sex
## HeartDisease F M
## 0 6 11
## 1 1 14
library(ggplot2)
plot=ggplot(heart, aes(x=Cholesterol , y= RestingBP, color=HeartDisease)) +
geom_line()
plot
20
#Logistic model
logit <- glm(HeartDisease ~. , data=Train, family = "binomial")
summary(logit)
##
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5152 -0.3788 0.1675 0.4335 2.4863
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.989939 1.682574 -0.588 0.556299
## Age 0.021660 0.016289 1.330 0.183603
## SexM 1.296778 0.353791 3.665 0.000247 ***
## ChestPainTypeATA -2.004780 0.393109 -5.100 3.40e-07 ***
## ChestPainTypeNAP -1.960296 0.322925 -6.070 1.28e-09 ***
## ChestPainTypeTA -1.377171 0.524808 -2.624 0.008687 **
## RestingBP 0.003231 0.006914 0.467 0.640313
## Cholesterol -0.002988 0.001316 -2.270 0.023201 *
## FastingBS 1.238916 0.338516 3.660 0.000252 ***
## RestingECGNormal -0.077002 0.338497 -0.227 0.820048
## RestingECGST -0.207983 0.436988 -0.476 0.634112
21
## MaxHR -0.007913 0.006197 -1.277 0.201635
## ExerciseAnginaY 1.063538 0.312326 3.405 0.000661 ***
## Oldpeak 0.257931 0.154006 1.675 0.093972 .
## ST_SlopeFlat 1.423039 0.549891 2.588 0.009658 **
## ST_SlopeUp -0.747844 0.589562 -1.268 0.204628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 849.89 on 615 degrees of freedom
## Residual deviance: 392.30 on 600 degrees of freedom
## AIC: 424.3
##
## Number of Fisher Scoring iterations: 6
## 2.5 % 97.5 %
## (Intercept) -4.301714604 2.3163638549
## Age -0.010144017 0.0539063326
## SexM 0.614337550 2.0042508787
## ChestPainTypeATA -2.800097689 -1.2532156231
## ChestPainTypeNAP -2.607555459 -1.3382156244
## ChestPainTypeTA -2.420584328 -0.3550049216
## RestingBP -0.010380010 0.0169941989
## Cholesterol -0.005609866 -0.0004363128
## FastingBS 0.588016304 1.9183234888
## RestingECGNormal -0.742802443 0.5872607420
## RestingECGST -1.066933078 0.6509881278
## MaxHR -0.020123276 0.0042255621
## ExerciseAnginaY 0.452323099 1.6797466920
## Oldpeak -0.041534283 0.5632453062
## ST_SlopeFlat 0.289978734 2.4676977468
## ST_SlopeUp -1.953546666 0.3751598925
#Wald Test
library(aod)
wald.test(b = coef(logit), Sigma = vcov(logit), Terms = 4:6)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 49.4, df = 3, P(> X2) = 1.1e-10
## OR 2.5 % 97.5 %
## (Intercept) 0.3715992 0.01354531 10.1387413
22
## Age 1.0218958 0.98990726 1.0553857
## SexM 3.6574949 1.84843170 7.4205329
## ChestPainTypeATA 0.1346899 0.06080412 0.2855850
## ChestPainTypeNAP 0.1408167 0.07371452 0.2623133
## ChestPainTypeTA 0.2522913 0.08886967 0.7011700
## RestingBP 1.0032361 0.98967368 1.0171394
## Cholesterol 0.9970168 0.99440584 0.9995638
## FastingBS 3.4518706 1.80041340 6.8095326
## RestingECGNormal 0.9258877 0.47577870 1.7990536
## RestingECGST 0.8122209 0.34406211 1.9174346
## MaxHR 0.9921181 0.98007785 1.0042345
## ExerciseAnginaY 2.8966006 1.57195977 5.3641970
## Oldpeak 1.2942490 0.95931645 1.7563632
## ST_SlopeFlat 4.1497106 1.33639907 11.7952599
## ST_SlopeUp 0.4733860 0.14177037 1.4552241
## [1] 457.598
## [1] 15
## [1] 5.136055e-88
##Predicting probability
n.n1 <- predict(logit, Test, type = "response")
23
View(n.n3)
hist(n.n3$PredictedProb, main="Histogram of predicted probabilities",
col="red")
##Obtaining the
test data
library(dplyr)
Test <- Test %>% mutate(model_pred = 1*(n.n1 > .53) + 0,
HeartDisease = 1*(HeartDisease == "Yes") +
0)
24
##Determining the
model accuracy for logistic regression
Test <- Test %>% mutate(accurate = 1*(model_pred == HeartDisease))
sum(Test$accurate)/nrow(Test)
## [1] 0.4238411
text(Tree, pretty = 2)
25
#Showing the Segments of the Anomalies
plot(y=heart$Cholesterol, x=heart$MaxHR,
pch = 16, col = "blue", ylab = "Cholesterol", xlab = "Max HR")
abline(v =136.5, col = "red", lwd = 2) #a vertical line
26
##Making prediction
library(rpart)
predict_unseen <-predict(Tree, Train, type = 'class')
##Testing the patients that did have heart disease and those that didnt
table_heart <- table(Train$HeartDisease, predict_unseen)
table_heart
## predict_unseen
## 0 1
## 0 244 39
## 1 41 292
27
28