You are on page 1of 72

ABSTRACT

Machine Learning in R language is used across the world. The healthcare


industry is no exclusion. Machine Learning can play an essential role in predicting
presence/absence of locomotors disorderd, Heart diseases and more. Such
information, if predicted well in advance, can provide important intuitions to doctors
who can then adapt their diagnosis and dealing per patient basis. It’s works on
predicting possible Heart Diseases in people using Machine Learning algorithms. In
this project we perform the comparative analysis of classifiers like decision tree,
Naïve Bayes, Logistic Regression, SVM and Random Forest and propose an
ensemble classifier which perform hybrid classification by taking strong and weak
classifiers since it can have multiple number of samples for training and validating
the data so we perform the analysis of existing classifier and proposed classifier like
Ada-boost and XG-boost which can Give accurate result and aids in
predictive analysis.

Keywords: Machine Learning in R language, SVM, Random Forest, Linear


Discriminant Analysis, Quadrant Discriminant Analysis, k-nearest neighbors,
glioblastoma multiforme

1
OBJECTIVES

The main objective of developing this project are:

• To develop Machine Learning in R language model to predict future possibility


of heart disease by implementing Logistic Regression.
• To determine significant risk factors based on medical dataset which may lead to
heart disease.
• To analyze feature selection methods and understand their working principle.

2
LIST OF ABBREVIATIONS:

• LDA: Linear Discriminant Analysis


• QDA: Quadrant Discriminant Analysis
• K-NN: k-nearest neighbors
• SVM: Support-vector machines
• RF: Random Forest
• GBM: glioblastoma multiforme
• EDA: Exploratory data analysis is the key step for getting meaningful results.
• ECG: Electro Cardio Gram
• AMI: Acute Myocardial Infarction

3
SYSTEM SPECIFICATIONS
HARDWARE REQUIREMENTS

The section of hardware configuration is an important task related to the software


development insufficient random access memory may affect adversely on the speed
and efficiency of the entire system. The process should be powerful to handle the
entire operations. The hard disk should have sufficient capacity to store the file and
application.

System : Intel Pentium processor.

RAM : 2 GB and above.

Hard disk : 250 GB Hard Disk and above

SOFTWARE REQUIREMENTS

A major element in building a system is the section of compatible software since


the software in the market is experiencing in geometric progression. Selected
software should be acceptable by the firm and one user as well as it should be
feasible for the system. This document gives a detailed description of the software
requirement specification. The study of requirement specification is focused
specially on the functioning of the system. It allow the developer or analyst to
understand the system, function to be carried out the performance level to be
obtained and corresponding interfaces to be established.

Operating system : Windows 11

Environment (IDE) : RSTUDIO

Back end : R version 4.2.1

4
1 INTRODUCTION

According to the World Health Organization, every year 12 million deaths


occur worldwide due to Heart Disease. The load of cardiovascular disease is rapidly
increasing all over the world from the past few years. Many researches have been
conducted in attempt to pinpoint the most influential factors of heart disease as well
as accurately predict the overall risk. Heart Disease is even highlighted as a silent
killer which leads to the death of the person without obvious symptoms.

The early diagnosis of heart disease plays a vital role in making decisions on
lifestyle changes in high-risk patients and in turn reduce the complications. This
project aims to predict future Heart Disease by analyzing data of patients which
classifies whether they have heart disease or not using machine-learning algorithms.

1.1 PROBLEM DEFINITION

The major challenge in heart disease is its detection. There are instruments
available which can predict heart disease but either they are expensive or are not
efficient to calculate chance of heart disease in human. Early detection of cardiac
diseases can decrease the mortality rate and overall complications. However, it is
not possible to monitor patients every day in all cases accurately and consultation of
a patient for 24 hours by a doctor is not available since it requires more sapience,
time and expertise. Since we have a good amount of data in today’s world, we can
use various Machine Learning in R language algorithms to analyze the data for
hidden patterns. The hidden patterns can be used for health diagnosis in medicinal
data.

5
1.2 MOTIVATION FOR THE WORK

Machine Learning in R language techniques have been around us and has been
compared and used for analysis for many kinds of data science applications. The
major motivation behind this research-based project was to explore the feature
selection methods, data preparation and processing behind the training models in the
Machine Learning in R language. With first hand models and libraries, the challenge
we face today is data where beside their abundance, and our cooked models, the
accuracy we see during training, testing and actual validation has a higher variance.
Hence this project is carried out with the motivation to explore behind the models,
and further implement Logistic Regression model to train the obtained data.
Furthermore, as the whole Machine Learning in R language is motivated to develop
an appropriate computer-based system and decision support that can aid to early
detection of heart disease, in this project we have developed a model which classifies
if patient will have heart disease in ten years or not based on various features (i.e.
potential risk factors that can cause heart disease) using logistic regression. Hence,
the early prognosis of cardiovascular diseases can aid in making decisions on
lifestyle changes in high risk patients and in turn reduce the complications, which
can be a great milestone in the field of medicine.

With growing development in the field of medical science alongside Machine


Learning in R language various experiments and researches has been carried out in
these recent years releasing the relevant significant papers. The paper propose heart
disease prediction using KStar, J48, SMO, and Bayes Net and Multilayer perceptron
using WEKA software. Based on performance from different factor SMO (89% of
accuracy) and Bayes Net (87% of accuracy) achieve optimum performance than
KStar, Multilayer perceptron and J48 techniques using k-fold cross validation. The

6
accuracy performance achieved by those algorithms are still not satisfactory. So that
if the performance of accuracy is improved more to give batter decision to diagnosis
disease.

In research conducted using Cleveland dataset for heart diseases which


contains 303 instances and used 10-fold Cross Validation, considering 13 attributes,
implementing 4 different algorithms, they concluded Gaussian Naïve Bayes and
Random Forest gave the maximum accuracy of 91.2 percent.

Using the similar dataset of Framingham, Massachusetts, the experiments


were carried out using 4 models and were trained and tested with maximum accuracy
K Neighbors Classifier: 87%, Support Vector Classifier: 83%, Decision Tree
Classifier: 79% and Random Forest Classifier: 84%.

7
2 PROJECT DESCRIPTION
Heart disease is perceived as the deadliest disease in the human life across the
world. In particular, in this type of disease the heart is not capable in pushing the
required quantity of blood to the remaining organs of the human body in order to
accomplish the regular functionalities. Some of the symptoms of heart disease
include physical body weakness, improper breathing, swollen feet, etc. The
techniques are essential to identify the complicated heart diseases which results in
high risk in turn affect the human life. Presently, diagnosis and treatment process are
highly challenging due to inadequacy of physicians and diagnostic apparatus that
affect the treatment of heart patients

Early diagnosis of heart disease is significant to minimize the heart related


issues and to protect it from serious risks. The invasive techniques are implemented
to diagnose heart diseases based on medical history, symptom analysis report by
experts, and physical laboratory report. Moreover, it causes delay and imprecise
diagnosis due to human intervention. It is time consuming, computationally
intensive and expensive at the time of assessment.

Heart disease can be predicted based on various symptoms such as age,


gender, pulse rate etc. Data analysis in healthcare assists in predicting diseases,
improving diagnosis, analyzing symptoms, providing appropriate medicines,
improving the quality of care, minimizing cost, extending the life span and reduces
the death rate of heart patients. ECG helps in screening irregular heart beat and stroke
with the embedded sensors by resting it on a chest in order to track the patient’s heart
beat.

Heart disease prediction is being done with the detailed clinical data that could
assist experts to make decision. Human life is highly dependent on proper

8
functioning of blood vessels in the heart. The improper blood circulation causes
heart inactiveness, kidney failure, imbalanced condition of brain, and even
immediate death also. Some of the risk factors that can cause heart diseases are
obesity, smoking, diabetes, blood pressure, cholesterol, lack of physical activities
and unhealthy diet.

AMI is the cardiovascular disease that happens due to interruption in the blood
flow or circulation in the heart muscle, causes heart muscle to become necrotic
(damage or die). The primary reason for this disease is the blockage means that the
blood flow to the heart muscle become obstructed or reduced. If the blood flow is
reduced or obstructed, the functioning of red blood cells that carries enough oxygen
helps in sustaining consciousness and human life have a severe impact. Without
oxygen supply for 6 to 8 minutes, heart muscle may get arrest that in turn resulted
in patient’s death.

The significant cause of the cardiovascular disease is ‘plaque’ means a hard


substance formed in the coronary arteries which is made up of cholesterol (fat),
causes the blood flow to be reduced or obstructed. Sometimes, it can be formed in
the arteries known as atherosclerosis and investigating the cause of it are determined
as a chronic inflammation.

The increase in the amount of white blood cells causes inflammation and other
subsequent disorders such as stroke or reinfarction Generally, there are two stages
of wound healing in terms of monocytes and macrophages, namely, inflammatory,
and reparative stages. However, the two stages are compulsory for proper wound
healing and if the inflammation is continued too long, then it leads to heart failure.

An unusual type of heart disease is the acute spasm or contraction in the


coronary arteries. The spasms become visible in arteries suddenly with no symptom

9
of atherosclerosis. It blocks the blood flow that causes oxygen deprivation in the
heart. Male genders are more likely to experience heart attack than females.
Moreover, women can experience pain more than an hour and the duration to
experience the pain of men is normally less than an hour. The cardiovascular disease
has an impact in the complete physiological system, not only in the heart; changes
occur everywhere that too in the remote organs such as bone marrow and spleen.

10
3 SOFTWARE DESCRIPTION
R is a programing language and free software developed by Ross Ihaka and
Robert Gentleman in 1993. R possesses an in-depth catalog of applied mathematics
and graphical strategies. It includes Machine Learning in R language algorithms,
simple and linear regression, statistics, applied mathematics. Most of the R libraries
are written in R, except for serious machine tasks, C, C++, and algebraic language
codes are most well-liked.

R is not solely entrusted by academics, however many massive firms and


MNC’s additionally use R programing language, including Uber, Google, Airbnb,
Facebook, and then on. Data analysis with R is finished in an exceedingly series of
steps: programming, transforming, discovering, modeling, and communicating the
results. Moreover, if you need any help with R programming homework or
assignment, we have tutors available 24/7.

3.1 EVOLUTION OF R

R was initially written by Ross Ihaka and Robert Gentleman at the Department
of Statistics of the University of Auckland in Auckland, New Zealand. R made its
first appearance in 1993.

A large group of individuals has contributed to R by sending code and bug reports.

Since mid-1997 there has been a core group (the "R Core Team") who can modify
the R source code archive.

11
3.2 R VERSION

R version 4.2.1 (2022-06-23 ucrt) -- "Funny-Looking Kid" Copyright (C)


2022 The R Foundation for Statistical Computing Platform: x86_64-w64-
mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You


are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()'
for distribution details. Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for


more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an
HTML browser interface to help. Type 'q()' to quit R.

3.3 FEATURES OF R

As stated earlier, R is a programming language and software environment for


statistical analysis, graphics representation and reporting. The following are the
important features of R −

• R is a well-developed, simple and effective programming language which


includes conditionals, loops, user defined recursive functions and input and
output facilities.
• R has an effective data handling and storage facility,
• R provides a suite of operators for calculations on arrays, lists, vectors and
matrices.
• R provides a large, coherent and integrated collection of tools for data analysis.

12
• R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.

3.4 USAGE OF R PROGRAMMING

R language is in much demand in real-world applications because of the following


reasons:

IMPORTANT FOR DATA SCIENCE: As R is an interpreted language, we can


run code without any compiler which is most important in data science. R is a vector
language and hence powerful and faster than another language. R is used in biology,
genetics as well as in statistics. Hence, it can perform any type of task.

OPEN-SOURCE: R language is an open-source language. It is also maintained by


a large number of the programmer as a community across the world. Since R is
issued under the General Public License (GNU), and hence there is no restriction on
its usage.

POPULARITY: R programming language has become the most popular


programming language in the technological world. R language is not given
importance in the academic world but with the emergence of data science, the
requirement for R in industries has increased.

ROBUST VISUALIZATION LIBRARY: R language consist of libraries like


ggplot2, plotly that provides graphical plots to the user. R is mostly recognized for
its amazing visualizations which is very important in data science programming
language.

13
USED TO DEVELOP WEB APPS: R provides the ability to build web
applications. Using the R package, we can create develop interactive applications
using the console of your R IDE.

PLATFORM INDEPENDENT: R language is a platform-independent language.


It can work on any system irrespective of whether it is Windows, Linux, and Mac.

USED IN MACHINE LEARNING IN R LANGUAGE: Most important


advantage of R programming is that it helps to carry out Machine Learning in R
language operations like classification, regression and also provides features for
artificial intelligence and neural networks.

3.5 RSTUDIO APPLICATION FOR WINDOWS

RStudio is free, open-source integrated development environment for R


programming language. It is designed for the use of data scientists, statisticians, data
miners, business intelligence developers. App is designed for the use by data science
project teams. Software is powerful and popular IDE for R language. Application is
cross-platform product, which is available for Windows, macOS, Ubuntu. Product
RStudio app has a simple and intuitive interface that is easy to learn and makes it a
good place to start for beginners. Interface has a lot of functionality and it is great
tool for analyzing and visualizing data. Application provides a lot of support in form
of a built in help system.

INTERFACE

Software RStudio download is well-designed, intuitive, user-friendly


application. Design is minimalistic and clean. Interface is clean and user can easily

14
navigate through the menus. Application provides a wide range of features that are
mainly categorized into Data Science, Visualization, Administration. Interface is
very user-friendly and has a clean design. It is based on "scratchpad" principle,
where the user can create their own projects, or start with one of many templates that
are available. Application provides code completion, which makes it easier to write
code.

USABILITY

Application download RStudio with an intuitive, user-friendly interface.


Design is clean and minimalistic. Product is simple to use. Application is easy to use
and has a simple and intuitive interface. There are many features that make
application usable, such as code completion and built-in help system. Is very user-
friendly and intuitive application. It is designed for data analysts and statisticians
and provides a lot of features, functionality for them. Software install RStudio is
easy to install has a quick start guide that is easy to follow. It has a built-in help
guide that has short tutorials to get you started.

FUNCTIONALITY

A wide range of features are available in RStudio install. Features are divided
into Data Science, Visualization, Administration. These are main areas of software.
Functionality of application is excellent. It has a wide range of features for analyzing,
visualizing data. It can be used for different purposes, such as data science, web
development, other fields. Product RStudio Mac offers a lot of functionality. You
can use console to run scripts and use interactive code to explore data sets. Editor
and help system are very helpful and informative. Plots and charts you create with
15
app are easy to customize and look beautiful. You can use download RStudio for
windows to manage your packages and collaborate with other users.

Is very powerful IDE for R. It has a lot of features that you can use for more
comfortable using of application. You can use it open up a project, which is folder
with a collection of related documents that make up a complete work session. You
can use it to open up a file, which is single document or file with a collection of
related data or text.

SUPPORT

There is wide range of documentation and tutorials available on internet.


Application RStudio Windows has a forum where user can ask questions and get an
answer. There is online training available. Has a lot of support and you can find a lot
of tutorials and how-to guides on their website on YouTube. You can contact the
customer service with any problem that you might have. They have a very active
community of developers and users.

16
4 PACKAGES IN R PROGRAMMING
The package is an appropriate way to organize the work and share it with
others. Typically, a package will include code (not only R code!), documentation for
the package and the functions inside, some tests to check everything works as it
should, and data sets.

4.1 PACKAGES IN R

Packages in R Programming language are a set of R functions, compiled code,


and sample data. These are stored under a directory called “library” within the R
environment. By default, R installs a group of packages during installation. Once we
start the R console, only the default packages are available by default. Other
packages that are already installed need to be loaded explicitly to be utilized by the
R program that’s getting to use them.

4.2 PACKAGES USED

library(readr):

The goal of readr is to provide a fast and friendly way to read rectangular data
from delimited files, such as comma-separated values (CSV) and tab-separated
values (TSV). It is designed to parse many types of data found in the wild, while
providing an informative problem report when parsing leads to unexpected results.
If you are new to readr, the best place to start is the data import chapter in R for Data
Science.

17
library(tidyverse):

These Tidyverse packages were specially designed for Data Science with a
common design philosophy. They include all the packages required in the data
science workflow, ranging from data exploration to data visualization.

Tidyverse Packages in R following:

1. Data Visualization and Exploration: ggplot2

2. Data Wrangling and Transformation: dplyr, tidyr, stringr, forcats

3. Data Import and Management: tibble, readr

4. Functional Programming: purr

library(broom):

broom summarizes key information about models in tidy tibble()s. broom


provides three verbs to make it convenient to interact with model objects:

1. tidy () summarizes information about model components

2. glance () reports information about the entire model

3. augment () adds information about observations to a dataset

tidy() produces a Tibble () where each row contains information about an important
component of the model. For regression models, this often corresponds to regression
coefficients. This is can be useful if you want to inspect a model or create custom
visualizations

18
library(Metrics):

The Matrix package contains functions that extend R to support highly dense
or sparse matrices. It provides efficient access to BLAS (Basic Linear Algebra
Subroutines), Lapack (dense matrix), TAUCS (sparse matrix) and UMFPACK
(sparse matrix) routines.

library(dslabs)

dslabs: Data Science Labs

Datasets and functions that can be used for data analysis practice, homework
and projects in data science courses and workshops. 26 datasets are available for
case studies in data visualization, statistical inference, modeling, linear regression,
data wrangling and Machine Learning in R language.

library(dplyr)

dplyr is an R package that provides a grammar of data manipulation and


provides a most used set of verbs that helps data science analysts to solve the most
common data manipulation. In order to use this, you have to install it first using
install.packages('dplyr') and load it using library(dplyr).

library(caret)

Caret Package is a comprehensive framework for building Machine Learning


in R language models in R. In this tutorial, I explain nearly all the core features of
the caret package and walk you through the step-by-step process of building

19
predictive models. Be it a decision tree or xgboost, caret helps to find the optimal
model in the shortest possible time.

library(lubridate)

Lubridate makes it easier to do the things R does with date-times and possible
to do the things R does not. If you are new to lubridate, the best place to start is the
date and times chapter in R for data science.

library(tidytext)

Using tidy data principles can make many text mining tasks easier, more
effective, and consistent with tools already in wide use. Much of the infrastructure
needed for text mining with tidy data frames already exists in packages like dplyr,
broom, tidyr, and ggplot2. In this package, we provide functions and supporting data
sets to allow conversion of text to and from tidy formats, and to switch seamlessly
between tidy tools and existing text mining packages.

library("RColorBrewer")

RColorBrewer is an R Programming Language package library that offers a


variety of color palettes to use while making different types of plots. Colors impact
the way we visualize data. If we have to make a data standout or we want a color-
blind person to visualize the data as well as a normal person we have to use the right
color palette.

20
library(randomForest)

The R package "randomForest" is used to create random forests. Use the


below command in R console to install the package. You also have to install the
dependent packages if any. The package "randomForest" has the function
randomForest () which is used to create and analyze random forests.

library(tictoc)

tictoc is a R library typically used in User Interface, Frontend Framework,


React applications. tictoc has no bugs, it has no vulnerabilities, it has a Permissive
License and it has low support. You can download it from GitHub. R package with
extended timing functions tic/toc, as well as stack and list structures.

library(e1071)

e1071 Package in R e1071 is a package for R programming that provides


functions for statistic and probabilistic algorithms like a fuzzy classifier, naive Bayes
classifier, bagged clustering, short-time Fourier transform, support vector machine,
etc.. When it comes to SVM, there are many packages available in R to implement
it.

library(ggpubr)

ggpubr: 'ggplot2' Based Publication Ready Plots The 'ggplot2' package is


excellent and flexible for elegant data visualization in R. However the default
generated plots requires some formatting before we can send them for publication.

21
5 DATA VISUALIZATION
The outcome variable class has more than two levels. According to the
codebook, any non-zero values can be coded as an “event.” We create a new variable
called “Cleveland_hd” to represent a binary 1/0 outcome.There are a few other
categorical/discrete variables in the dataset. We also convert sex into a ‘factor’ for

next step analysis. Otherwise, R will treat this as continuous by default.

NAME TYPE DESCRIPTION


Age Continuous Age in years
Sex Discrete 0=Female 1=Male
Cp Discrete Chest Pain type:
1=typical angina,
2=atypical angina,
3=non-anginal pain,
4=asymptom
Trestbps Continuous Resting blood pressure (in
mm Hg)

22
Chol Continuous Serum cholesterol in mg/dl

Fbs Discrete Fasting blood sugar > 120


mg/dl: 1=True, 0=False
Exang Continuous Max Discrete Exercise induced angina:
heart rate achieved 1=Yes 0=No
Thalach Continuous Max heart rate achieved.
Old peak ST Continuous Depression induced by
exercise relative to rest.
Slope Discrete The slope of the peak
exercise segment:
1=up sloping,
2=Flat,
3=Down sloping.
Ca Continuous Num of major vessels
colored by fluoroscopy that
ranged between 0 and 3.
Thal Discrete 3=Normal 6=fixed defect
7=reversible defect
Class Discrete Diagnosis classes:
0=No Presence
1=Least to have heart
disease
2=>1; 3=>2
4=More likely have heart
disease

23
5.1 CLINICAL VARIABLES

Use statistical tests to see which predictors are related to heart disease. We
can explore the associations for each variable in the dataset. Depending on the type
of the data (i.e., continuous or categorical), we use t-test or chi-squared test to
calculate the p-values.

T-test is used to determine whether there is a significant difference between


the means of two groups (e.g., is the mean age from group A different from the mean
age from group B?). A chi-squared test for independence compares the equivalence
of two proportions.

5.2 PUTTING ALL THREE VARIABLES IN ONE MODEL:

The plots and the statistical tests both confirmed that all the three variables
are highly significantly associated with our outcome (p<0.001 for all tests).

In general, we want to use multiple logistic regression when we have one


binary outcome variable and two or more predicting variables. The binary variable
is the dependent (Y) variable; we are studying the effect that the independent (X)
variables have on the probability of obtaining a particular value of the dependent
variable. For example, we might want to know the effect that maximum heart rate,
age, and sex have on the probability that a person will have a heart disease in the
next year. The model will also tell us what the remaining effect of maximum heart
rate is after we control or adjust for the effects of the other two effectors.The glm()
command is designed to perform generalized linear models (regressions) on binary
outcome data, count data, probability data, proportion data, and many other data
types. In our case, the outcome is binary following a binomial distribution.

24
5.3 EXTRACTING USEFUL INFORMATION FROM THE MODEL
OUTPUT:

It is common practice in medical research to report Odds Ratio (OR) to


quantify how strongly the presence or absence of property A is associated with the
presence or absence of the outcome. When the OR is greater than 1, we say A is
positively associated with outcome B (increases the Odds of having B). Otherwise,
we say A is negatively associated with B (decreases the Odds of having B).

The raw glm coefficient table (the ‘estimate’ column in the printed output) in
R represents the log(Odds Ratios) of the outcome. Therefore, we need to convert the
values to the original OR scale and calculate the corresponding 95% Confidence
Interval (CI) of the estimated Odds Ratios when reporting results from a logistic
regression.

5.4 PREDICTED PROBABILITIES FROM OUR MODEL:

So far, we have built a logistic regression model and examined the model
coefficients/ORs. We may wonder how can we use this model we developed to
predict a person’s likelihood of having heart disease given his/her age, sex, and
maximum heart rate. Furthermore, we’d like to translate the predicted probability
into a decision rule for clinical use by defining a cutoff value on the probability scale.
In practice, when an individual comes in for a health check-up, the doctor would like
to know the predicted probability of heart disease, for specific values of the
predictors: a 45-year-old female with a max heart rate of 150. To do that, we create
a data frame called newdata, in which we include the desired values for our
prediction.

25
5.5 MODEL PERFORMANCE METRICS:

We are going to use some common metrics to evaluate the model


performance. The most straightforward one is Accuracy, which is the proportion of
the total number of predictions that were correct. On the other hand, we can calculate
the classification error rate using 1- accuracy. However, accuracy can be misleading
when the response is rare (i.e., imbalanced response). Another popular metric, Area
Under the ROC curve (AUC), has the advantage that it’s independent of the change
in the proportion of responders. AUC ranges from 0 to 1. The closer it gets to 1 the
better the model performance. Lastly, a confusion matrix is an N X N matrix, where
N is the level of outcome. For the problem at hand, we have N=2, and hence we get
a 2 X 2 matrix. It cross-tabulates the predicted outcome levels against the true
outcome levels.

After these metrics are calculated, we’ll see (from the logistic regression OR
table) that older age, being male and having a lower max heart rate are all risk factors
for heart disease. We can also apply our model to predict the probability of having
heart disease. For a 45 years old female who has a max heart rate of 150, our model
generated a heart disease probability of 0.177 indicating low risk of heart disease.

5.6 DISEASE PREDICTION

The analyis below shows the disease prediction using various ML algorithms. The
outcome has been defined to be a binary classification variable, and several
classification algorithms have been used to predict the accuracy. This is just a
comparison study and the reasoning behind the usage of these algorithms has not
been the focus of this study.

26
5.7 EXPLORE THE ASSOCIATIONS GRAPHICALLY

In addition to p-values from statistical tests, we can plot the age, sex, and
maximum heart rate distributions with respect to our outcome variable. This will
give us a sense of both the direction and magnitude of the relationship.

• First, we plot age using a boxplot since it is a continuous variable.


• Next, we plot sex using a barplot since it is a binary variable in this dataset.
• Finally, we plot thalach using a boxplot since it is a continuous variable.
• Age is on the x-axis, sex on the y-axis (0 - female, 1 - male), size of the circle is
the cholestorl level, and color is condition. Yellow means disease and blue
means no disease and each circle is a datapoint. You can see that male count it
much more than the female count, and male has the more cases with disease
than female population. Also, the disease seems more popular with high
cholestorl values.
• The plot below is same as the above, except, the y-axis is the chest pain type,
and the color is sex rather than condition.

27
6 SOURCE CODE

# Read datasets Cleveland_hd.csv into Cleveland_hd

library(readr)

Cleveland_hd <- read.csv("D:/Mini R Project/Cleveland_hd.csv ")

# take a look at the first 5 rows of Cleveland_hd

head(Cleveland_hd,5)

28
29
IDENTIFYING IMPORTANT CLINICAL VARIABLES

# load the tidyverse package

library(tidyverse)

# Use the 'mutate' function from dplyr to recode our data

Cleveland_hd %>% mutate(hd = ifelse(class > 0, 1, 0))-> Cleveland_hd

# recode sex using mutate function and save as Cleveland_hd

Cleveland_hd %>% mutate(sex = factor(sex, levels = 0:1, labels = c("Female", "M


ale")))-> Cleveland_hd

# Does sex have an effect? Sex is a binary variable in this dataset,

# so the appropriate test is chi-squared test

hd_sex <- chisq.test(Cleveland_hd$hd, Cleveland_hd$sex)


30
# Does age have an effect? Age is continuous, so we use a t-test

hd_age <- t.test (age~hd, Cleveland_hd)

# What about thalach? Thalach is continuous, so we use a t-test

hd_heartrate <- t.test (thalach~hd, Cleveland_hd)

31
32
# Print the results to see if p<0.05.

print(hd_sex)

print(hd_age)

33
print(hd_heartrate)

PUTTING ALL THREE VARIABLES IN ONE MODEL

# use glm function from base R and specify the family argument as binomial

model <-glm(data = Cleveland_hd, hd~age+sex+thalach, family="binomial")

# extract the model summary

summary(model)

34
35
36
EXTRACTING USEFUL INFORMATION FROM THE MODEL OUTPUT

# load the broom package

library(broom)

# tidy up the coefficient table

tidy_m <- tidy(model)

tidy_m

37
# calculate OR

tidy_m$OR <- exp(tidy_m$estimate)

# calculate 95% CI and save as lower CI and upper CI

tidy_m$lower_CI <- exp(tidy_m$estimate - 1.96 * tidy_m$std.error)

tidy_m$upper_CI <- exp(tidy_m$estimate + 1.96 * tidy_m$std.error)

# display the updated coefficient table

tidy_m

38
PREDICTED PROBABILITIES FROM OUR MODEL

# get the predicted probability in our dataset using the predict() function

pred_prob <- predict(model,Cleveland_hd, type = "response")

# create a decision rule using probability 0.5 as cutoff and save the predicted decis
ion into the main data frame

Cleveland_hd$pred_hd <- ifelse(pred_prob >= 0.5,1,0)

# create a newdata data frame to save a new case information

newdata <- data.frame(age = 45, sex = "Female", thalach = 150)

# predict probability for this new case and print out the predicted value

p_new <- predict(model,newdata, type = "response")

p_new

39
MODEL PERFORMANCE METRICS

# load Metrics package

library(Metrics)

# calculate auc, accuracy, clasification error

auc <- auc(Cleveland_hd$hd,Cleveland_hd$pred_hd)

accuracy <- accuracy(Cleveland_hd$hd,Cleveland_hd$pred_hd)

classification_error <- ce(Cleveland_hd$hd,Cleveland_hd$pred_hd)

# print out the metrics on to screen

print(paste("AUC=", auc))

print(paste("Accuracy=", accuracy))

print(paste("Classification Error=", classification_error))

# confusion matrix

table(Cleveland_hd$hd,Cleveland_hd$pred_hd, dnn=c('True Status','Predicted Stat


us')) # confusion matrix

40
41
7. GRAPHICAL OUTPUT

7.1 RECODE HD TO BE LABELLED

# Recode hd to be labelled

Cleveland_hd%>%mutate(hd_labelled = ifelse(hd == 0, "No Disease", "Disease"))


-> Cleveland_hd

# age vs hd

ggplot(data = Cleveland_hd, aes(x = hd_labelled,y = age)) + geom_boxplot()

42
7.2 MAX HEART RATE VS HD

# Max heart rate vs hd

ggplot(data = Cleveland_hd,aes(x=hd_labelled,y=thalach)) + geom_boxplot()

43
7.3 DISEASE DISTRIBUTION FOR AGE.

####################################################

# Disease distribution for age.

# 0 - no disease

# 1 - disease

####################################################

Cleveland_hd%>% group_by(age, condition) %>% summarise(count = n()) %>%

ggplot() + geom_bar(aes(age, count, fill = as.factor(condition)), stat = "Identity"


)+

theme_bw() +

theme(axis.text.x = element_text(angle = 90, size = 10)) +

ylab("Count") + xlab("Age") + labs(fill = "Condition")

44
45
7.4 CHEST PAIN TYPE FOR DISEASED PEOPLE

####################################################

# Chest pain type for diseased people

# You can see - Majority as condition 3 type

# 0: typical angina 1: atypical angina Value 2: non-anginal pain Value 3: asympto


matic

####################################################

Cleveland_hd%>% filter(condition == 1) %>% group_by(age, cp) %>% summaris


e(count = n()) %>%

ggplot() + geom_bar(aes(age, count, fill = as.factor(cp)),stat = "Identity") +

theme_bw() +

theme(axis.text.x = element_text(angle = 90, size = 10)) +

ylab("Count") + xlab("Age") + labs(fill = "Condition") +

ggtitle("Age vs. Count (disease only) for various chest pain conditions") +

scale_fill_manual(values=c("red", "blue", "green", "black"))

46
47
7.5 CONDITION SEX WISE

age is on the x-axis,

sex on the y-axis (0 - female, 1 - male),

size of the circle is the cholesterol level, and color is condition.

Yellow means disease and blue means no disease and each circle is a
datapoint.

Can see that male count it much more than the female count, and male
has the more cases with disease than female population. Also, the
disease seems more popular with high cholesterol values.

####################################################

# condition sex wise

####################################################

options(repr.plot.width = 20, repr.plot.height = 8)

heart_disease_data %>% ggballoonplot(x = "age", y = "sex",

size = "chol", size.range = c(5, 30), fill = "condition",show.l


abel = FALSE,

ggtheme = theme_bw()) +

scale_fill_viridis_c(option = "C") +

theme(axis.text.x = element_text(angle = 90, size = 10)) +

ggtitle("Age vs. Sex Map") + labs(fill = "Condition")

48
49
The plot below is same as the above, except, the y-axis is
the chest pain type, and the color is sex rather than
condition

options(repr.plot.width = 20, repr.plot.height = 8)

####################################################

# condition sex wise

####################################################

heart_disease_data %>% ggballoonplot(x = "age", y = "cp",

size = "chol", size.range = c(5, 30), fill = "sex",show.label = FA


LSE,

ggtheme = theme_bw()) +

scale_fill_viridis_c(option = "C") +

theme(axis.text.x = element_text(angle = 90, size = 10)) +

ggtitle("Age vs. Chest Pain Map") + labs(fill = "sex")

50
51
7.6 DISEASE PREDICTION SETUP
set.seed(2020, sample.kind = "Rounding")
# Divide into train and validation dataset
test_index <- createDataPartition(y = heart_disease_data$condition, time
s = 1, p = 0.2, list= FALSE)
train_set <- heart_disease_data[-test_index, ]
validation <- heart_disease_data[test_index, ]

# Converting the dependent variables to factors


train_set$condition <- as.factor(train_set$condition)
validation$condition <- as.factor(validation$condition)

52
7.7 LDA: LINEAR DISCRIMINANT ANALYSIS

################################

# LDA Analysis

###############################

lda_fit <- train(condition ~ ., method = "lda", data = train_set)

lda_predict <- predict(lda_fit, validation)

confusionMatrix(lda_predict, validation$condition)

53
54
7.8 QDA: QUADRANT DISCRIMINANT ANALYSIS

################################

# QDA Analysis

###############################

qda_fit <- train(condition ~ ., method = "qda", data = train_set)

qda_predict <- predict(qda_fit, validation)

confusionMatrix(qda_predict, validation$condition)

55
56
7.9 K-NN: K-NEAREST NEIGHBORSCLASSIFIER

5-fold cross validation was used, and tuning was done on all the next algorithms
discussed here to avoid over-training the algorithms.

ctrl <- trainControl(method = "cv", verboseIter = FALSE, number = 5)

knnFit <- train(condition ~ .,

data = train_set, method = "knn", preProcess = c("center","scale"),

trControl = ctrl , tuneGrid = expand.grid(k = seq(1, 20, 2)))

plot(knnFit)

toc()

knnPredict <- predict(knnFit,newdata = validation )

knn_results <- confusionMatrix(knnPredict, validation$condition )

knn_results

57
58
59
7.10 SVM: SUPPORT-VECTOR MACHINES

############################

# SVM

############################

ctrl <- trainControl(method = "cv", verboseIter = FALSE, number = 5)

grid_svm <- expand.grid(C = c(0.01, 0.1, 1, 10, 20))

tic(msg= " Total time for SVM :: ")

svm_fit <- train(condition ~ .,data = train_set,

method = "svmLinear", preProcess = c("center","scale"),

tuneGrid = grid_svm, trControl = ctrl)

plot(svm_fit)

toc()

svm_predict <- predict(svm_fit, newdata = validation)

svm_results <- confusionMatrix(svm_predict, validation$condition)

svm_results

60
61
62
7.11 RF: RANDOM FOREST

############################

# RF

############################

control<- trainControl(method = "cv", number = 5, verboseIter = FALSE)

grid <-data.frame(mtry = seq(1, 10, 2))

tic(msg= " Total time for rf :: ")

rf_fit <- train(condition ~ ., method = "rf", data = train_set, ntree = 20, trControl =
control,

tuneGrid = grid)

plot(rf_fit)

toc()

rf_predict <- predict(rf_fit, newdata = validation)

rf_results <- confusionMatrix(rf_predict, validation$condition)

rf_results

63
64
65
7.12 GBM: GLIOBLASTOMA MULTIFORME
############################
# GBM
############################

gbmGrid <- expand.grid(interaction.depth = c(1, 5, 10, 25, 30),


n.trees = c(5, 10, 25, 50),
shrinkage = c(0.1, 0.2, 0.3, 0.4, 0.5),
n.minobsinnode = 20)

tic(msg= " Total time for GBM :: ")


gbm_fit <- train(condition ~ ., method = "gbm", data = train_set, trControl = contr
ol, verbose = FALSE,
tuneGrid = gbmGrid)

plot(gbm_fit)
toc()
gbm_predict <- predict(gbm_fit, newdata = validation)

gbm_results <- confusionMatrix(gbm_predict, validation$condition)

gbm_results

66
67
68
CONCLUSION
Heart diseases are a major killer in India and throughout the world, application
of promising technology like machine learning to the initial prediction of heart
diseases will have a profound impact on society. The early prognosis of heart disease
can aid in making decisions on lifestyle changes in high-risk patients and in turn
reduce the complications, which can be a great milestone in the field of medicine.
The number of people facing heart diseases is on a raise each year. This prompts for
its early diagnosis and treatment. The utilization of suitable technology support in
this regard can prove to be highly beneficial to the medical fraternity and patients.
In this paper, the seven different machine learning algorithms used to measure the
performance are SVM, Decision Tree, Random Forest, Naïve Bayes, Logistic
Regression, Adaptive Boosting, and Extreme Gradient Boosting applied on the
dataset.

69
FUTURE ENHANCEMENT

The expected attributes leading to heart disease in patients are available in the
dataset which contains 76 features and 14 important features that are useful to
evaluate the system are selected among them. If all the features taken into the
consideration, then the efficiency of the system the author gets is less. To increase
efficiency, attribute selection is done. In this n features have to be selected for
evaluating the model which gives more accuracy. The correlation of some features
in the dataset is almost equal and so they are removed. If all the attributes present in
the dataset are taken into account then the efficiency decreases considerably.

All the seven machine learning methods accuracies are compared based on

which one prediction model is generated. Hence, the aim is to use various evaluation
metrics like confusion matrix, accuracy, precision, recall, and f1-score which
predicts the disease efficiently. Comparing all seven the extreme gradient boosting
classifier gives the highest accuracy of 81%

70
REFERENCES
[1] Soni, Jyoti, et al. "Predictive data mining for medical diagnosis: An overview of
heart disease prediction." International Journal of Computer Applications 17.8
(2011): 43-48.

[2] Dangare, Chaitrali S., and Sulabha S. Apte. "Improved study of heart disease
prediction system using data mining classification techniques." International Journal
of Computer Applications 47.10 (2012): 44-48.

[3] Uyar, Kaan, and Ahmet İlhan. "Diagnosis of heart disease using genetic
algorithm based trained recurrent fuzzy neural networks." Procedia computer
science 120 (2017): 588-593.

[4] Kim, Jae Kwon, and Sanggil Kang. "Neural network-based coronary heart
disease risk prediction using feature correlation analysis." Journal of healthcare
engineering 2017 (2017).

[5] Baccouche, Asma, et al. "Ensemble Deep Learning Models for Heart Disease
Classification: A Case Study from Mexico." Information 11.4 (2020): 207.

[6] https://archive.ics.uci.edu/ml/datasets/Heart+Disease

[7] https://www.kaggle.com/ronitf/heart-disease-uci

[8] https://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf

[9]https://nthu-datalab.github.io/ml/labs/03_Decision
Trees_RandomForest/03_Decision-Tree_Random-Forest.html

[10] https://www.kaggle.com/jprakashds/confusion-matrix-in-python-binaryclass

[11] scikit-learn, keras, pandas and matplotlib

71
[12] A. H. M. S. U. Marjia Sultana, "Analysis of Data Mining Techniques for Heart
Disease Prediction," 2018.

[13] M. I. K. ,. A. I. ,. S. Musfiq Ali, "Heart Disease Prediction Using Machine


Learning Algorithms".

[14] K. Bhanot, "towarddatascience.com," 13 Feb 2019. [Online]. Available:

https://towardsdatascience.com/predicting-presence-of-heart-diseases-using
machine learning-36f00f3edb2c. [Accessed 2 March 2020].

[15] [Online]. Available: https://www.kaggle.com/ronitf/heart-disease-


uci#heart.csv.. [Accessed 05 December 2019].

[16] M. A. K. S. H. K. M. a. V. P. M Marimuthu, "A Review on Heart Disease


Prediction using Machine Learning and Data Analytics Approach".

72

You might also like