Professional Documents
Culture Documents
1
OBJECTIVES
2
LIST OF ABBREVIATIONS:
3
SYSTEM SPECIFICATIONS
HARDWARE REQUIREMENTS
SOFTWARE REQUIREMENTS
4
1 INTRODUCTION
The early diagnosis of heart disease plays a vital role in making decisions on
lifestyle changes in high-risk patients and in turn reduce the complications. This
project aims to predict future Heart Disease by analyzing data of patients which
classifies whether they have heart disease or not using machine-learning algorithms.
The major challenge in heart disease is its detection. There are instruments
available which can predict heart disease but either they are expensive or are not
efficient to calculate chance of heart disease in human. Early detection of cardiac
diseases can decrease the mortality rate and overall complications. However, it is
not possible to monitor patients every day in all cases accurately and consultation of
a patient for 24 hours by a doctor is not available since it requires more sapience,
time and expertise. Since we have a good amount of data in today’s world, we can
use various Machine Learning in R language algorithms to analyze the data for
hidden patterns. The hidden patterns can be used for health diagnosis in medicinal
data.
5
1.2 MOTIVATION FOR THE WORK
Machine Learning in R language techniques have been around us and has been
compared and used for analysis for many kinds of data science applications. The
major motivation behind this research-based project was to explore the feature
selection methods, data preparation and processing behind the training models in the
Machine Learning in R language. With first hand models and libraries, the challenge
we face today is data where beside their abundance, and our cooked models, the
accuracy we see during training, testing and actual validation has a higher variance.
Hence this project is carried out with the motivation to explore behind the models,
and further implement Logistic Regression model to train the obtained data.
Furthermore, as the whole Machine Learning in R language is motivated to develop
an appropriate computer-based system and decision support that can aid to early
detection of heart disease, in this project we have developed a model which classifies
if patient will have heart disease in ten years or not based on various features (i.e.
potential risk factors that can cause heart disease) using logistic regression. Hence,
the early prognosis of cardiovascular diseases can aid in making decisions on
lifestyle changes in high risk patients and in turn reduce the complications, which
can be a great milestone in the field of medicine.
6
accuracy performance achieved by those algorithms are still not satisfactory. So that
if the performance of accuracy is improved more to give batter decision to diagnosis
disease.
7
2 PROJECT DESCRIPTION
Heart disease is perceived as the deadliest disease in the human life across the
world. In particular, in this type of disease the heart is not capable in pushing the
required quantity of blood to the remaining organs of the human body in order to
accomplish the regular functionalities. Some of the symptoms of heart disease
include physical body weakness, improper breathing, swollen feet, etc. The
techniques are essential to identify the complicated heart diseases which results in
high risk in turn affect the human life. Presently, diagnosis and treatment process are
highly challenging due to inadequacy of physicians and diagnostic apparatus that
affect the treatment of heart patients
Heart disease prediction is being done with the detailed clinical data that could
assist experts to make decision. Human life is highly dependent on proper
8
functioning of blood vessels in the heart. The improper blood circulation causes
heart inactiveness, kidney failure, imbalanced condition of brain, and even
immediate death also. Some of the risk factors that can cause heart diseases are
obesity, smoking, diabetes, blood pressure, cholesterol, lack of physical activities
and unhealthy diet.
AMI is the cardiovascular disease that happens due to interruption in the blood
flow or circulation in the heart muscle, causes heart muscle to become necrotic
(damage or die). The primary reason for this disease is the blockage means that the
blood flow to the heart muscle become obstructed or reduced. If the blood flow is
reduced or obstructed, the functioning of red blood cells that carries enough oxygen
helps in sustaining consciousness and human life have a severe impact. Without
oxygen supply for 6 to 8 minutes, heart muscle may get arrest that in turn resulted
in patient’s death.
The increase in the amount of white blood cells causes inflammation and other
subsequent disorders such as stroke or reinfarction Generally, there are two stages
of wound healing in terms of monocytes and macrophages, namely, inflammatory,
and reparative stages. However, the two stages are compulsory for proper wound
healing and if the inflammation is continued too long, then it leads to heart failure.
9
of atherosclerosis. It blocks the blood flow that causes oxygen deprivation in the
heart. Male genders are more likely to experience heart attack than females.
Moreover, women can experience pain more than an hour and the duration to
experience the pain of men is normally less than an hour. The cardiovascular disease
has an impact in the complete physiological system, not only in the heart; changes
occur everywhere that too in the remote organs such as bone marrow and spleen.
10
3 SOFTWARE DESCRIPTION
R is a programing language and free software developed by Ross Ihaka and
Robert Gentleman in 1993. R possesses an in-depth catalog of applied mathematics
and graphical strategies. It includes Machine Learning in R language algorithms,
simple and linear regression, statistics, applied mathematics. Most of the R libraries
are written in R, except for serious machine tasks, C, C++, and algebraic language
codes are most well-liked.
3.1 EVOLUTION OF R
R was initially written by Ross Ihaka and Robert Gentleman at the Department
of Statistics of the University of Auckland in Auckland, New Zealand. R made its
first appearance in 1993.
A large group of individuals has contributed to R by sending code and bug reports.
Since mid-1997 there has been a core group (the "R Core Team") who can modify
the R source code archive.
11
3.2 R VERSION
Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an
HTML browser interface to help. Type 'q()' to quit R.
3.3 FEATURES OF R
12
• R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
13
USED TO DEVELOP WEB APPS: R provides the ability to build web
applications. Using the R package, we can create develop interactive applications
using the console of your R IDE.
INTERFACE
14
navigate through the menus. Application provides a wide range of features that are
mainly categorized into Data Science, Visualization, Administration. Interface is
very user-friendly and has a clean design. It is based on "scratchpad" principle,
where the user can create their own projects, or start with one of many templates that
are available. Application provides code completion, which makes it easier to write
code.
USABILITY
FUNCTIONALITY
A wide range of features are available in RStudio install. Features are divided
into Data Science, Visualization, Administration. These are main areas of software.
Functionality of application is excellent. It has a wide range of features for analyzing,
visualizing data. It can be used for different purposes, such as data science, web
development, other fields. Product RStudio Mac offers a lot of functionality. You
can use console to run scripts and use interactive code to explore data sets. Editor
and help system are very helpful and informative. Plots and charts you create with
15
app are easy to customize and look beautiful. You can use download RStudio for
windows to manage your packages and collaborate with other users.
Is very powerful IDE for R. It has a lot of features that you can use for more
comfortable using of application. You can use it open up a project, which is folder
with a collection of related documents that make up a complete work session. You
can use it to open up a file, which is single document or file with a collection of
related data or text.
SUPPORT
16
4 PACKAGES IN R PROGRAMMING
The package is an appropriate way to organize the work and share it with
others. Typically, a package will include code (not only R code!), documentation for
the package and the functions inside, some tests to check everything works as it
should, and data sets.
4.1 PACKAGES IN R
library(readr):
The goal of readr is to provide a fast and friendly way to read rectangular data
from delimited files, such as comma-separated values (CSV) and tab-separated
values (TSV). It is designed to parse many types of data found in the wild, while
providing an informative problem report when parsing leads to unexpected results.
If you are new to readr, the best place to start is the data import chapter in R for Data
Science.
17
library(tidyverse):
These Tidyverse packages were specially designed for Data Science with a
common design philosophy. They include all the packages required in the data
science workflow, ranging from data exploration to data visualization.
library(broom):
tidy() produces a Tibble () where each row contains information about an important
component of the model. For regression models, this often corresponds to regression
coefficients. This is can be useful if you want to inspect a model or create custom
visualizations
18
library(Metrics):
The Matrix package contains functions that extend R to support highly dense
or sparse matrices. It provides efficient access to BLAS (Basic Linear Algebra
Subroutines), Lapack (dense matrix), TAUCS (sparse matrix) and UMFPACK
(sparse matrix) routines.
library(dslabs)
Datasets and functions that can be used for data analysis practice, homework
and projects in data science courses and workshops. 26 datasets are available for
case studies in data visualization, statistical inference, modeling, linear regression,
data wrangling and Machine Learning in R language.
library(dplyr)
library(caret)
19
predictive models. Be it a decision tree or xgboost, caret helps to find the optimal
model in the shortest possible time.
library(lubridate)
Lubridate makes it easier to do the things R does with date-times and possible
to do the things R does not. If you are new to lubridate, the best place to start is the
date and times chapter in R for data science.
library(tidytext)
Using tidy data principles can make many text mining tasks easier, more
effective, and consistent with tools already in wide use. Much of the infrastructure
needed for text mining with tidy data frames already exists in packages like dplyr,
broom, tidyr, and ggplot2. In this package, we provide functions and supporting data
sets to allow conversion of text to and from tidy formats, and to switch seamlessly
between tidy tools and existing text mining packages.
library("RColorBrewer")
20
library(randomForest)
library(tictoc)
library(e1071)
library(ggpubr)
21
5 DATA VISUALIZATION
The outcome variable class has more than two levels. According to the
codebook, any non-zero values can be coded as an “event.” We create a new variable
called “Cleveland_hd” to represent a binary 1/0 outcome.There are a few other
categorical/discrete variables in the dataset. We also convert sex into a ‘factor’ for
22
Chol Continuous Serum cholesterol in mg/dl
23
5.1 CLINICAL VARIABLES
Use statistical tests to see which predictors are related to heart disease. We
can explore the associations for each variable in the dataset. Depending on the type
of the data (i.e., continuous or categorical), we use t-test or chi-squared test to
calculate the p-values.
The plots and the statistical tests both confirmed that all the three variables
are highly significantly associated with our outcome (p<0.001 for all tests).
24
5.3 EXTRACTING USEFUL INFORMATION FROM THE MODEL
OUTPUT:
The raw glm coefficient table (the ‘estimate’ column in the printed output) in
R represents the log(Odds Ratios) of the outcome. Therefore, we need to convert the
values to the original OR scale and calculate the corresponding 95% Confidence
Interval (CI) of the estimated Odds Ratios when reporting results from a logistic
regression.
So far, we have built a logistic regression model and examined the model
coefficients/ORs. We may wonder how can we use this model we developed to
predict a person’s likelihood of having heart disease given his/her age, sex, and
maximum heart rate. Furthermore, we’d like to translate the predicted probability
into a decision rule for clinical use by defining a cutoff value on the probability scale.
In practice, when an individual comes in for a health check-up, the doctor would like
to know the predicted probability of heart disease, for specific values of the
predictors: a 45-year-old female with a max heart rate of 150. To do that, we create
a data frame called newdata, in which we include the desired values for our
prediction.
25
5.5 MODEL PERFORMANCE METRICS:
After these metrics are calculated, we’ll see (from the logistic regression OR
table) that older age, being male and having a lower max heart rate are all risk factors
for heart disease. We can also apply our model to predict the probability of having
heart disease. For a 45 years old female who has a max heart rate of 150, our model
generated a heart disease probability of 0.177 indicating low risk of heart disease.
The analyis below shows the disease prediction using various ML algorithms. The
outcome has been defined to be a binary classification variable, and several
classification algorithms have been used to predict the accuracy. This is just a
comparison study and the reasoning behind the usage of these algorithms has not
been the focus of this study.
26
5.7 EXPLORE THE ASSOCIATIONS GRAPHICALLY
In addition to p-values from statistical tests, we can plot the age, sex, and
maximum heart rate distributions with respect to our outcome variable. This will
give us a sense of both the direction and magnitude of the relationship.
27
6 SOURCE CODE
library(readr)
head(Cleveland_hd,5)
28
29
IDENTIFYING IMPORTANT CLINICAL VARIABLES
library(tidyverse)
31
32
# Print the results to see if p<0.05.
print(hd_sex)
print(hd_age)
33
print(hd_heartrate)
# use glm function from base R and specify the family argument as binomial
summary(model)
34
35
36
EXTRACTING USEFUL INFORMATION FROM THE MODEL OUTPUT
library(broom)
tidy_m
37
# calculate OR
tidy_m
38
PREDICTED PROBABILITIES FROM OUR MODEL
# get the predicted probability in our dataset using the predict() function
# create a decision rule using probability 0.5 as cutoff and save the predicted decis
ion into the main data frame
# predict probability for this new case and print out the predicted value
p_new
39
MODEL PERFORMANCE METRICS
library(Metrics)
print(paste("AUC=", auc))
print(paste("Accuracy=", accuracy))
# confusion matrix
40
41
7. GRAPHICAL OUTPUT
# Recode hd to be labelled
# age vs hd
42
7.2 MAX HEART RATE VS HD
43
7.3 DISEASE DISTRIBUTION FOR AGE.
####################################################
# 0 - no disease
# 1 - disease
####################################################
theme_bw() +
44
45
7.4 CHEST PAIN TYPE FOR DISEASED PEOPLE
####################################################
####################################################
theme_bw() +
ggtitle("Age vs. Count (disease only) for various chest pain conditions") +
46
47
7.5 CONDITION SEX WISE
Yellow means disease and blue means no disease and each circle is a
datapoint.
Can see that male count it much more than the female count, and male
has the more cases with disease than female population. Also, the
disease seems more popular with high cholesterol values.
####################################################
####################################################
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C") +
48
49
The plot below is same as the above, except, the y-axis is
the chest pain type, and the color is sex rather than
condition
####################################################
####################################################
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C") +
50
51
7.6 DISEASE PREDICTION SETUP
set.seed(2020, sample.kind = "Rounding")
# Divide into train and validation dataset
test_index <- createDataPartition(y = heart_disease_data$condition, time
s = 1, p = 0.2, list= FALSE)
train_set <- heart_disease_data[-test_index, ]
validation <- heart_disease_data[test_index, ]
52
7.7 LDA: LINEAR DISCRIMINANT ANALYSIS
################################
# LDA Analysis
###############################
confusionMatrix(lda_predict, validation$condition)
53
54
7.8 QDA: QUADRANT DISCRIMINANT ANALYSIS
################################
# QDA Analysis
###############################
confusionMatrix(qda_predict, validation$condition)
55
56
7.9 K-NN: K-NEAREST NEIGHBORSCLASSIFIER
5-fold cross validation was used, and tuning was done on all the next algorithms
discussed here to avoid over-training the algorithms.
plot(knnFit)
toc()
knn_results
57
58
59
7.10 SVM: SUPPORT-VECTOR MACHINES
############################
# SVM
############################
plot(svm_fit)
toc()
svm_results
60
61
62
7.11 RF: RANDOM FOREST
############################
# RF
############################
rf_fit <- train(condition ~ ., method = "rf", data = train_set, ntree = 20, trControl =
control,
tuneGrid = grid)
plot(rf_fit)
toc()
rf_results
63
64
65
7.12 GBM: GLIOBLASTOMA MULTIFORME
############################
# GBM
############################
plot(gbm_fit)
toc()
gbm_predict <- predict(gbm_fit, newdata = validation)
gbm_results
66
67
68
CONCLUSION
Heart diseases are a major killer in India and throughout the world, application
of promising technology like machine learning to the initial prediction of heart
diseases will have a profound impact on society. The early prognosis of heart disease
can aid in making decisions on lifestyle changes in high-risk patients and in turn
reduce the complications, which can be a great milestone in the field of medicine.
The number of people facing heart diseases is on a raise each year. This prompts for
its early diagnosis and treatment. The utilization of suitable technology support in
this regard can prove to be highly beneficial to the medical fraternity and patients.
In this paper, the seven different machine learning algorithms used to measure the
performance are SVM, Decision Tree, Random Forest, Naïve Bayes, Logistic
Regression, Adaptive Boosting, and Extreme Gradient Boosting applied on the
dataset.
69
FUTURE ENHANCEMENT
The expected attributes leading to heart disease in patients are available in the
dataset which contains 76 features and 14 important features that are useful to
evaluate the system are selected among them. If all the features taken into the
consideration, then the efficiency of the system the author gets is less. To increase
efficiency, attribute selection is done. In this n features have to be selected for
evaluating the model which gives more accuracy. The correlation of some features
in the dataset is almost equal and so they are removed. If all the attributes present in
the dataset are taken into account then the efficiency decreases considerably.
All the seven machine learning methods accuracies are compared based on
which one prediction model is generated. Hence, the aim is to use various evaluation
metrics like confusion matrix, accuracy, precision, recall, and f1-score which
predicts the disease efficiently. Comparing all seven the extreme gradient boosting
classifier gives the highest accuracy of 81%
70
REFERENCES
[1] Soni, Jyoti, et al. "Predictive data mining for medical diagnosis: An overview of
heart disease prediction." International Journal of Computer Applications 17.8
(2011): 43-48.
[2] Dangare, Chaitrali S., and Sulabha S. Apte. "Improved study of heart disease
prediction system using data mining classification techniques." International Journal
of Computer Applications 47.10 (2012): 44-48.
[3] Uyar, Kaan, and Ahmet İlhan. "Diagnosis of heart disease using genetic
algorithm based trained recurrent fuzzy neural networks." Procedia computer
science 120 (2017): 588-593.
[4] Kim, Jae Kwon, and Sanggil Kang. "Neural network-based coronary heart
disease risk prediction using feature correlation analysis." Journal of healthcare
engineering 2017 (2017).
[5] Baccouche, Asma, et al. "Ensemble Deep Learning Models for Heart Disease
Classification: A Case Study from Mexico." Information 11.4 (2020): 207.
[6] https://archive.ics.uci.edu/ml/datasets/Heart+Disease
[7] https://www.kaggle.com/ronitf/heart-disease-uci
[8] https://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf
[9]https://nthu-datalab.github.io/ml/labs/03_Decision
Trees_RandomForest/03_Decision-Tree_Random-Forest.html
[10] https://www.kaggle.com/jprakashds/confusion-matrix-in-python-binaryclass
71
[12] A. H. M. S. U. Marjia Sultana, "Analysis of Data Mining Techniques for Heart
Disease Prediction," 2018.
https://towardsdatascience.com/predicting-presence-of-heart-diseases-using
machine learning-36f00f3edb2c. [Accessed 2 March 2020].
72