You are on page 1of 63

A project report on

HEART FAILURE PREDICTION


USING HYBRID MACHINE
LEARNING TECHNIQUES

Submitted in partial fulfilment for the award of the


degree of

Master of Computer Application

by

RUPLI KUMARI
19MCA1097

School of Computer Science and Engineering (SCOPE)


DECLARATION

I hereby declare that the thesis entitled “HEART FAILURE


PREDICTION USING HYBRID MACHINE LEARNING
TECHNIQUES” submitted by me, for the award of the degree of Specify the
name of the degree VIT is a record of bonafide work carried out by me under
the supervision of Prof. Shyamala L.

I further declare that the work reported in this thesis has not been
submitted and will not be submitted, either in part or in full, for the award
of any other degree or diploma in this institute or any other institute or
university.

Place: Chennai

Date: Signature of the Candidate


ABSTRACT
One of the major causes of death in the world is Heart Failure. This disease
affects directly the heart’s pumping job. Because of this perturbation,
nutrients and oxygen are not well circulated and distributed. Heart disease is
the main reason for death in the world over the last decade. Almost one
person dies of Heart disease about every minute in the United States alone.
Researchers have been using several data mining techniques to help health
care professionals in the diagnosis of heart disease. However, using data
mining technique can reduce the number of tests that are required. In order
to reduce number of deaths from heart diseases there have to be a quick and
efficient detection technique. Decision Tree is one of the effective data
mining methods used. This research compares different algorithms of
Decision Tree classification seeking better performance in heart disease
diagnosis using WEKA. The algorithms which are tested is J48 algorithm,
Logistic model tree algorithm and Random Forest algorithm. The existing
datasets of heart disease patients from Cleveland database of UCI repository
is used to test and justify the performance of decision tree algorithms. This
dataset consists of 303 instances and 76 attributes. Subsequently, the
classification algorithm that has optimal potential will be suggested for use
in sizeable data. The goal of this study is to extract hidden patterns by
applying data mining techniques, which are noteworthy to heart diseases and
to predict the presence of heart disease in patients where this presence is
valued from no presence to likely presence. This disease affects directly the
heart’s pumping job. In this paper, a multi-level risk assessment of
developing heart failure has been proposed, in which five risk levels of heart
failure can be predicted using Machine Learning models namely, decision
tree classifier, Linear Regression, Gradient Booster Classification, Random
Forest Classifier, Support Vector Machine, K Nearest Neighbor etc. On the
other hand, we are boosting the early prediction of heart failure by involving
three main risk factors with the heart failure data set.
Keywords: Logistic Regression on, Random Forests Classifier Algorithm, KNN

i
ACKNOWLEDGEMENT

It is my pleasure to express with deep sense of gratitude to Prof. Shyamala L


Professor, SCOPE, Vellore Institute of Technology, for her constant
guidance, continual encouragement, understanding; more than all, he
taught me patience in my endeavor. My association with him / her is not
confined to academics only, but it is a great opportunity on my part of
work with an intellectual and expert in the field of <area>.

I would like to express my gratitude to <Chancellor>, <VPs>, <VC>,


<PRO-VC>, and <Dean Name>, <School Name>, for providing with an
environment to work in and for his inspiration during the tenure of the
course.

In jubilant mood I express ingeniously my whole-hearted thanks to


Sivagami M. Program Chair, all teaching staff and members working as
limbs of our university for their not-self-centered enthusiasm coupled with
timely encouragements showered on me with zeal, which prompted the
acquirement of the requisite knowledge to finalize my course study
successfully. I would like to thank my parents for their support.

It is indeed a pleasure to thank my friends who persuaded and encouraged


me to take up and complete this task. At last, but not least, I express my
gratitude and appreciation to all those who have helped me directly or
indirectly toward the successful completion of this project.

Place: Vellore

Date: Rupali Kumari

ii
CONTENTS

INTRODUCTION…………………………………………………………………………….1
RELATED WORK……………………………………………………………………………2
BACKGROUND……………………………………………………………………………...3
PROBLEM STATEMENT……………………………………………………………………7

METHODOLOGY……………………………………………………………………………7
• Dataset collect on
• Splitting
• Classification:

DATA PRE-PROCESSING…………………………………………………………………..9

CLASSIFICATION MODELLING…………………………………………………………..9

• DECISION TREES…………………………………………………………………. 10
• KNN (K-Nearest Neighbour Algorithm).……………………………………………10
• RANDOM FOREST…………………………………………………………………11
• LOGISTIC REGRESSION…………………………………………………………..12

PERFORMANCE MEASURES……………………………………………………………..13

OVERVIEW OF PROPOSED WORK………………………………………………………13

PROPOSED WORK…………………………………………………………………………13

• Feature Scaling and attribute selection………………………………………………15


• Building and training the model……………………………………………………..15
PROPOSED SOURCE CODE HEART DISEASE PREDICTION…………………………20
RESULTS EVALUATION………………………………………………………………….49

• Setup for experimentation……………………………………………………………49


• Evaluation of results………………………………………………………………….49
• Result with Random oversampling………………………………………………….50
• Benchmarking of the proposed model………………………………………………..52

CONCLUSION………………………………………………………………………………53

REFERENCES……………………………………………………………………………….55

iii
LIST OF FIGURES

Fig 1. Proposed working architecture


Fig 2. Data Pre-Processing

Fig 3. Feature importance graph

Fig 4 Proposed Model Accuracy

Fig 5. Correlation Matrix

Fig 6. Performance comparison with various models

iv
LIST OF TABLES

Table 1. Literature Review

Table 2. Dataset

Table 3: Heart disease dataset detailed information.

Table 4: Range and datatype of dataset’s attributes.

Table 5: Results with random oversampling.

Table 6: Overall performance

v
LIST OF ACRONYMS

Convolutional Neural Networks (CNN)

World Health Organization (WHO)

Waikato Environment for Knowledge Analysis (WEKA)

Multilayer Perception (MLP)


Machine Learning (ML)

Decision Tree (DT)

Deep Learning (DL)

Electrocardiogram (ECG)

Support Vector machines (SVM)

Random Forest (RF)

KNN (K-Nearest Neighbour Algorithm)

LR (Logistic Regression)

Gaussian Naive Bayes (GNB)

TP (True Positive)

TN (True Negative)

FP (False Positive)

FN (False Negative)

vi
INTRODUCTION
The heart is a vital organ of the human body and is responsible for pumping blood to other
organs of the body. Heart failure (HF) is a serious disorder with high prevalence. HF is
prevalent in developed countries at a rate of approximately 2% in the adult population and
about 8% in older subjects. Moreover, literature shows that about 3-5% of hospitals
admissions have a connection with HF incidents. Moreover, HF diagnosis is very costly
because in developed countries HF accounts for 2% of the total health costs. Hence, the
development of non-invasive methods for HF detection based on machine learning and
data mining will help improve quality of life and reduce the associated medical costs.
Recently, different machine learning researchers have developed numerous models based
on feature transformation and machine learning methods for disease detection and
mortality prediction. Earlier studies developed logistic regression, C4.5, Naive Bayes,
BNNF and BNND algorithms and obtained HF classification accuracies of 71%, 81.11%,
81.48%, 80.95% and 81.11% respectively. The HF classification accuracy was improved
to 84.5% by Polat et al. by developing an artificial immune system. Polat et al. proposed
another novel system in which further improved the HF classification accuracy to 87%.
Recently, Ali et al. developed a hybrid method in which L1 regularized SVM with
hybridized with linear discriminant classifier. Their hybrid method resulted in an HF
classification accuracy of 90%. Recent research has been concentrated on features
transformation and selection for improved HF prediction. In this study, we search optimal
feature extraction algorithm by evaluating the performance of different feature extraction
algorithms namely Principal Component Analysis (PCA), Sparse PCA, Kernel PCA and
Incremental PCA. These algorithms are integrated with five different states of the art
machine learning models namely linear regression, Gaussian Naïve Bayes and linear
discriminant analysis. To evaluate the performance of the developed integrated models,
four different evaluation metrics are used i.e., Mathews Correlation Coefficient (MCC),
specificity, sensitivity and accuracy. The remaining paper has three sections. The second
section briefly explains the HF database and the developed integrated models. Section III
is about simulation results and discussion of the obtained results while the last section
presents the conclusion.

Working on heart disease patients’ databases can be compared to real-life application.


Doctors’ knowledge to assign the weight to each attribute. More weight is assigned to the
attribute having a high impact on disease prediction. Therefore, it appears reasonable to
try utilizing the knowledge and experience of several specialists collected in databases
towards assisting the Diagnosis process. It also provides healthcare professionals with an
extra source of knowledge for making decisions

The healthcare industry collects large amounts of healthcare data and that need to be mined
to discover hidden information for effective decision making. Motivated by the worldwide
increasing mortality of heart disease patients each year and the availability of a huge
number of patients’ data from which to extract useful knowledge, researchers have been
using data mining techniques to help health care professionals in the diagnosis of heart
disease (Helma, Gottmann et al. 2000). Data mining is the exploration of large datasets to
extract hidden and previously unknown patterns, relationships and knowledge that are
difficult to detect with traditional statistical methods (Lee, Liao et al. 2000). Thus, data
mining refers to mining or extracting knowledge from large amounts of data. Data mining
applications will be used for better health policy-making and prevention of hospital errors,

1
early detection, prevention of diseases and preventable hospital deaths (Ruben 2009). A
heart disease prediction system can assist medical professionals in predicting heart disease
based on the clinical data of patients [1]. Hence by implementing a heart disease prediction
system using Data Mining techniques and doing some sort of data mining on various heart
disease attributes, it can able to predict more probabilistically that the patients will be
diagnosed with heart disease. This paper presents a new model that enhances the Decision
Tree accuracy in identifying heart disease patients. It uses the different algorithm of
Decision Tree

RELATED WORK
Prediction of heart disease using data mining techniques has been an ongoing effort for
the past two decades. Most of the papers have implemented several data mining
techniques for the diagnosis of heart disease such as Decision Tree, Naive Bayes, neural
network, kernel density, automatically defined groups, bagging algorithm and support
vector machine showing different levels of accuracies on multiple databases of patients
from around the world. One of the bases on which the papers differ in the selection of
parameters on which the methods have been used. Many authors have specified different
parameters and databases for testing the accuracies. In particular, researchers have been
investigating the application of the Decision Tree technique in the diagnosis of heart
disease with considerable success. Stair-Taut et al. used the Weka tool to investigate
applying Naive Bayes and J48 Decision Trees for the detection of coronary heart disease.
Tu et al. used the bagging algorithm in the Weka tool and compared it with the J4.8
Decision Tree in the diagnosis of heart disease. the decision-making process of heart
disease is effectively diagnosed by a Random forest algorithm based on the probability of
decision support; the heart disease is predicted. As a result, the author concluded that the
decision tree performs well and sometimes the accuracy is similar in Bayesian
classification. An Efficient Classification Tree Technique for Heart Disease Prediction.
This paper analyzes the classification tree techniques in data mining. The classification
tree algorithms used and tested in this paper are Decision Stump, Random Forest and LMT
Tree algorithm. The objective of this research was to compare the outcomes of the
performance of different classification techniques for a heart disease dataset. ANN has
been introduced to produce the highest accuracy prediction in the medical field. The
backpropagation multilayer perception (MLP) of ANN is used to predict heart disease.
The obtained results are compared with the results of existing models within the same
domain and found to be improved. The data of heart disease patients collected from the
UCI laboratory is used to discover patterns with NN, DT, Support Vector machines SVM,
and Naive Bayes. The results are compared for performance and accuracy with these
algorithms. The proposed hybrid method returns results of 86.8% for F-measure,
competing with the other existing methods. The classification without segmentation of
Convolutional Neural Networks (CNN) is introduced. This method considers the heart
cycles with various start positions from the Electrocardiogram (ECG) signals in the
training phase. CNN can generate features with various positions in the testing phase of
the patient. A large amount of data generated by the medical industry has not been used
effectively previously. The new approaches presented here decrease the cost and improve
the prediction of heart disease easily and effectively. The different research techniques
considered in this work for the prediction and classification of heart disease using ML and
deep learning (DL) techniques are highly accurate in establishing the efficacy of these
methods.

2
BACKGROUND
Millions of people are getting some sort of heart disease every year and heart disease is
the biggest killer of both men and women in the United States and around the world. The
World Health Organization (WHO) analysed that twelve million deaths occur worldwide
due to Heart diseases. In almost every 34 seconds the heart disease kills one person in the
world. Medical diagnosis plays a vital role and yet complicated task that needs to be
executed efficiently and accurately. To reduce the cost for achieving clinical tests a piece
of appropriate computer-based information and decision support should be aided. Data
mining is the use of software techniques for finding patterns and consistency in sets of
data. Also, with the advent of data mining in the last two decades, there is a big opportunity
to allow computers to directly construct and classify the different attributes or classes.
Learning of the risk components connected with heart disease helps medicinal services
experts to recognize patients at high risk of having Heart disease. Statistical analysis has
identified risk factors associated with heart disease to be age, blood pressure, total
cholesterol, diabetes, hypertension, family history of heart disease, obesity and lack of
physical exercise, fasting blood sugar etc [3]. Researchers have been applying different
data mining Techniques to help medicinal services experts with progressed exactness in
the judgement of heart disease. Neural network, Naive Bayes, Decision Tree etc. are some
techniques used in the diagnosis of heart disease. Applying Decision Tree techniques has
shown useful accuracy in the diagnosis of heart disease. But assisting health care
professionals in the diagnosis of the world’s biggest killer demands higher accuracy. Our
research seeks to improve diagnosis accuracy to improve health outcomes. Decision Tree
is one of the data mining techniques that cannot handle continuous variables directly so
the continuous attributes must be converted to discrete attributes. A couple of Decision
Tree uses binary discretization for continuous-valued features. Another important
accuracy improvement is applying reduced error pruning to the Decision Tree in the
diagnosis of heart disease patients. Intuitively, more complex models might be expected
to produce more accurate results, but which techniques are best? Seeking to thoroughly
investigate options for accuracy improvements in heart disease diagnosis this paper
systematically investigates comparing multiple classifiers decision tree technique. This
research uses Waikato Environment for Knowledge Analysis (WEKA). The information
of the UCI repository regularly introduced in a database or spreadsheet. To use this data
for the WEKA tool, the data sets need to be in the ARFF format (attribute-relation file
format). WEKA tool is used to pre-process the dataset. After reviewing all these 76
different attributes, the unimportant attributes are dropped and only the important
attributes (i.e. 14 attributes in this case) is considered for analysis to yield more accurate
and better results. The 14th one is a predicted attribute, which is referred to as Class. A
thorough comparison between different decision tree algorithms within the WEKA tool
and deriving the decisions out of it would help the system to predict the likely presence
of heart disease in the patient and will help to diagnose heart disease well in advance and
able to cure it in the right time.

This chosen approach is implemented using the WEKA tool. WEKA is an open-source
software tool that consists of an accumulation of machine learning algorithms for Data
Mining undertakings. It contains apparatuses for information pre-processing,
classification, regression, clustering, association rules, and visualization [4]. For testing,
the classification tools and explorer mode of WEKA are used. Decision Tree classifiers
with Cross-Validation 10-fold in Test mode is considered for this study.

3
The following steps are performed in WEKA.

• Start the WEKA Explorer.

• Open CSV dataset file and save in ARFF format

• Click on Classify tab and select J48 etc (from Trees)

• from choose button.

• Select the appropriate Test mode option.

• Click on the Start button and the result will be displayed

4
YEAR AUTH-OR PURPOSE ALGORITHMS ACCURACY
USED

2015 Sharma Efficient Heart Decision tree 86.3% for testing


Purushottam et. Disease Prediction classifier phase. 87.3% for
al. System using training
Decision Tree phase

2015 Boshra Brahmi Prediction and J48, Naïve J48 gives better
et. al. Diagnosis of Heart Bayes, accuracy than
Disease by Data KNN, SMV other three
Mining techniques
Technique

2016 Ashok Kumar Evaluate the Naïve Bayes 83%


Dwivedi et. performance of KNN 80%
al. different machine Logistic 85%
learning Regression 77%
techniques for Classification
heart disease Tree
prediction
2016 Jayamin Patel et. Heart Disease J48, Logistic J48 gives 56.76%
al. Prediction model which is better than
using Machine tree, Random LMT algorithm of
Learning forest accuracy 55.75%
and Data Mining
Technique

2017 P. Sai Heart disease ANN Accuracy proved in


Chandrasekhar prediction using JAVA
Reddy et. al. ANN algorithm in
data mining

5
2018 Chala Bayen et. Prediction and J48, Naïve It gives a short time
al. Analysis Bayes, result which helps
the occurrence of SVM to give quality of
Heart services and
Disease using reduce cost to
datamining individual
techniques

Table 1. Literature Review

PROBLEM STATEMENT

The obtrusive based strategies are typically performed at the point when patient accompany
certain side effects which regularly are the essential indications where ordinary individual
additionally having little information can comprehend that patient is experiencing coronary
illness or stroke directly around them. Also, the methods generally are over the top expensive
and computationally complex and require some investment in appraisals. Then in the research,
we found that we don't have a framework that must investigate certain highlights and
indications identified with the patients, living style and parental history that could turn into
preparatory to the patients. Ahead of time, we might want to make attention to the patients to
be care full and take fundamental preventive strides to keep away from such complex illness
to enter the body and thrive. The problem here we incurred that the prediction alone cannot
overall rule out the disease from the body. It needs to be cured by three important basic things.
1. Medicine 2. Precautions 3. Changing living style by suggesting physical activity by
considering patients different attributes. Therefore, our model is to predict the level of heart
disease and suggest medicinal and non-medicinal ways to get rid of the heart disease.

METHODOLOGY

Dataset collect on

Kaggle is one of the most popular online community websites for data science and machine
learning algorithms. Kaggle allows the user to find and publish the datasets. It has datasets on
everything where the people can easily get the related datasets. Here the datasets are taken from
the Kaggle website. The datasets include 2 columns and 12 rows where the columns include
the serial number, attributes and description. The rows include the attributes namely age, sex,
chest pain, cholesterol rate, resting electrographic results, fasting blood sugar, thalach, exang,
oldpeak, slope, number of major vessels coloured, thal-which means number of defect type.
The attributes in the dataset are listed in table 2.

The dataset contains 303 records and 13 attributes. There are no missing values in the dataset.
In the dataset, we have an attribute called target, which denotes whether the person has heart
disease or not. If the patient has heart disease, the value in the target field is 1, else the value is

6
set to 0, which indicates the patient is not having heart disease. In this dataset, 165 patients had
heart disease, and 138 patients had no heart disease. The other attributes are Type of chest pain,
level of blood pressure, serum cholesterol, blood sugar level, results of the electrocardiogram,
maximum heart rate, exercise-induced angina, ST depression induced by exercise relative to
rest, the slope of the peak exercise (ST segment), number of major vessels coloured by
fluoroscopy and reversible defect.
The database contains NaN values. The NaN values cannot process by the programming hence
these values need to convert into numerical values. In this approach mean of the column is
calculated and NaN values are replaced by the mean.

Splitting:

The whole database is split into a training and testing database. 80% of data is taken for training
while the remaining 20% data is used for testing.

Classification:

The training data is trained by using four different machine learning algorithms i.e., Decision
Tree, KNN, Kmean clustering and Adaboost. Each algorithm is explained in detail.

SNO Attributes Description

1 age Age in years


2 sex Male or Female

3 cp chest pain type

4 trestbps resting blood pressure


5 chol serum cholesterol in mg/dl
6 FBS fasting blood sugar
7 restecg resting electrographic results

8 thalach max mum heart rate achieved

exang exercise-induced angina


9

10 oldpea ST depression induced by exercise relative to rest

11 slope The slope of the peak exercise ST segment

12 ca No. of major vessels coloured

13 thal Defect type

Table 2. Dataset

7
DATA PRE-PROCESSING

Heart disease data is pre-processed after the collection of various records. The dataset contains
a total of 303 patient records, where 6 records are with some missing values. Those 6 records
have been removed from the dataset and the remaining 297 patient records are used in pre-
processing. The multiclass variable and binary classification are introduced for the attributes
of the given dataset. The multi-class variable is used to check the presence or absence of heart
disease. In the instance of the patient having heart disease, the value is set to 1, else the value
is set to 0 indicating the absence of heart disease in the patient. The pre-processing of data is
carried out by converting medical records into diagnosis values. The results of data pre-
processing for 297 patient records indicate that 137 records show the value of 1 establishing
the presence of heart disease while the remaining 160 reflected the value of 0 indicating the
absence of heart disease.

CLASSIFICATION MODELLING

The clustering of datasets is done based on the variables and criteria of Decision Tree (DT)
features. Then, the classifiers are applied to each clustered dataset to estimate its performance.
The best performing models are identified from the above results based on their low rate of
error. The performance is further optimized by choosing the DT cluster with a high rate of error
and extraction of its corresponding classifier features. The performance of the classifier is
evaluated for error optimization on this data set.

The execution of the Classification algorithm is generally analysed by assessing the


affectability, specificity, and accuracy of the classification. The sensitivity is the proportion of
positive instances that are correctly classified as positive (i.e., the proportion of patients known
to have the disease, who test positive for it). The specificity is the proportion of negative
instances that are correctly classified as negative (i.e., the proportion of patients known not to
have the disease, who test negative for it). The accuracy is the proportion of instances that are
correctly classified. To quantify the dependability of the execution of the proposed model, the
information is isolated into preparing and testing data with 10-fold stratified cross-validation
these values are defined as,

Sensitivity = True Positive/ (True Positive + False Negative)


Specificity = True Negative/ (True Negative +False Positive)
Accuracy = (True Positive + True Negative) / (True Positive + True Negative+ False Negative+
False Positive)

All measures can be ascertained focused around four qualities specifically True Positive, False
Positive, False Negative, and False Positive,

• True Positive (TP) is various effectively classified that an instance positive.


• False Positive (FP) is some incorrectly classified that an instance is positive.
• False Negative (FN) is some incorrectly classified that an instance is negative.
• True Negative (TN) is many correctly classified that an instance is negative.
• F-Measure is a way of combining recall and precision scores into a single measure of
performance. The recall is the ratio of relevant instances found in the search result to
the total of all relevant instances.
• Precision is the proportion of relevant instances in the results returned.

8
• Receiver Operating Characteristics (ROC) Area is traditional to plot this same
information in a normalized form with a 1-false negative rate plotted against the false
positive rate.
• For every algorithm, the test choice cross-validation was utilized. As opposed to
holding a part for testing, the cross-validation repeats the training and testing process a
few times with random forest samples. The standard for this is 10-fold cross-validation.
The data is partitioned arbitrarily into 10 sections in which the classes are represented
in the same proportions as in the full dataset(stratification). Each section is held out
thus and the algorithm is trained on the nine remaining parts; then its error rate is
computed on the holdout set. At long last, the 10 error estimates are found the middle
value to yield an overall error estimate. For J48 and Random Forest, all the tests were
run with ten different random seeds. Choosing the different random seeds is carried out
to normal out statistical variations

DECISION TREES

The decision tree Algorithm belongs to the family of supervised machine learning algorithms.
It can be used for both a classification problem as well as for regression problem. The goal of
this algorithm is to create a model that predicts the value of a target variable, for which the
decision tree uses the tree representation to solve the problem in which the leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.

A decision tree is a tree whose internal nodes can be taken as tests (on input data patterns) and
whose leaf nodes can be taken as categories (of these patterns). These tests are filtered down
through the tree to get the right output to the input pattern. Decision Tree algorithms can be
applied and used in different fields. It can be used as a replacement for statistical procedures to
find data, to extract text, to find missing data in a class, to improve search engines and it also
finds various applications in medical fields. Many Decision tree algorithms have been
formulated. They have different accuracy and cost-effectiveness. It is also very important for
us to know which algorithm is best to use. The ID3 is one of the oldest Decision tree algorithms.
It is very useful while making simple decision trees but as the complications increase its
accuracy to make good Decision trees decreases. Hence IDA (intelligent decision tree
algorithm) and C4.5 algorithms have been formulated. For training samples of data, the trees
are constructed based on high entropy inputs. These trees are simple and fast constructed in a
top-down recursive divide and conquer (DAC) approach. Tree pruning is performed to remove
the irrelevant samples.

KNN (K-Nearest Neighbour Algorithm):

K-nearest neighbour is a supervised and pattern classification learning algorithm that helps us
find which class the new input (test value) belongs to when k nearest neighbours are chosen
and distance is calculated between them. It attempts to estimate the conditional distribution
of Y given X, and classify a given observation (test value) to the class with the highest
estimated probability. K-Nearest Neighbour is used for both classification and regression
technique. This algorithm does not use the parameters instead they use the data points to derive
the output. It is the concept of the last learning model which is full of prediction. The basic idea
of this algorithm is they use various datapoints as inputs and with these data points, they derive
the output that is full of assumption.

9
RANDOM FOREST

This algorithm contains a set of trees in which each node is like a tree structure and from that
the output is predicted. It handles a large amount of data. It gives an accurate output and gives
better efficiency. The computation process is tough, shows the accuracy rate for neural network
and random forests algorithm. The accuracy rate of the neural network algorithm is high when
compared to random forests.

Random forest is an ensemble classifier that consists of many decision trees. The output of the
classes is represented by individual trees. It is derived from random decision a forest that was
proposed by Tin Kam Ho of Bell Labs in 1995. This method combines with a random selection
of features to construct a decision tree with controlled variations. The tree is constructed using
the algorithm as discussed.

• Let N be the number of training classes and M be the number of variables in the
classifier.
• The input variable m is used to determine the node of the tree. Note that m<M
• Choosing in times of training sets with the replacement of all available training cases
N by predicting the classes, estimate the error of the tree.
• Choose m variable randomly for each node of the tree and calculate the best split.
• At last, the tree is fully grown and it is not pruned. The tree is pushed down for
predicting a new sample. When the terminal node ends up, the label is assigned the
training sample. This procedure is iterated over all trees and it is reported as random
forest prediction.

Multi-classifiers are the aftereffect of joining a few individual classifiers. Troupes of classifiers
towards expanding the execution have been presented. Random Forest (RF) is one of the cases
of such procedures. RF as a multi classifier formed by choice trees where each tree ht had been
created from the set of information preparing and a vector Ө t of arbitrary numbers
indistinguishably disseminated and free from the vectors. Vectors Ө 1, Ө 2 ,.., Ө t-1 used to
create the classifiers h1; h2; ::; ht-1 . Every decision tree is manufactured from a random subset
of the preparation dataset. It utilized a random vector that is produced from some altered
likelihood dissemination, where the likelihood circulation is shifted to centre samples that are
difficult to arrange. A Random vector can be joined into the tree-becoming process from
various perspectives. The leaf hubs of each one tree are named by evaluations of the back
dissemination over the information class names. Every interior hub contains a test that best
parts the space of data to be arranged. Another, concealed occasion is ordered by sending it
down every tree and conglomerating the arrived at leaf appropriations.

There are three methodologies for Random Forest, for example, Forest-RI (Random Input
choice) and Forest-RC (Random blend) and blended of Forest-RI and Forest-RC.

The Random Forest procedure has some desirable qualities, for example

• It is not difficult to utilize, basic and effortlessly parallelized.


• It doesn’t oblige models or parameters to choose aside from the number of indicators
to pick at arbitrary at every node.
• It runs effectively on extensive databases; it is moderately strong against anomalies and
commotion.

10
• It can deal with a huge number of information variables without variable deletion; it
gives evaluations of what variables are important in classification.
• It has a successful system for assessing missing information and keeps up accuracy
when a vast extent of the data is missing, it has methods for adjusting error in class
populace unequal data sets.

LOGISTIC REGRESSION

Logistic Regression is not like a regression model instead it is like a classification model. This
algorithm gives the output in the form of binary values i.e. like 0’s and 1’s. It is one of the
statistical models and contains some statistical symptoms (assumption). The generated sample
information is represented in the form of mathematical representation. The logistic regression
estimates the attributes i.e. full of assumptions. These assumptions are measured in either 0’s
or 1’s. It has only two possible values i.e., true or false. The sigmoid function is frequently used
by this algorithm. It is a statistical method similar to linear regression since LR finds an
equation that predicts an outcome for a binary variable, Y, from one or more response
variables, X. However, unlike linear regression, the response variables can be
categorical or continuous, as the model does not strictly require continuous data. To predict
group membership, LR uses the log odds ratio rather than probabilities and an
iterative maximum likelihood method rather than a least-square to fit the final model. This
means the researcher has more freedom when using LR and the method may be more
appropriate for nonnormally distributed data or when the samples have unequal covariance
matrices. Logistic regression assumes independence among variables, which is not always met
in orthoscopic datasets. However, as is often the case, the applicability of the method (and how
well it works, e.g., the classification error) often trumps statistical assumptions. One drawback
of LR is that the method cannot produce typicality probabilities (useful for forensic casework),
but these values may be substituted with nonparametric methods such as ranked probabilities
and ranked interindividual similarity measures (Ousley and Hefner, 2005).

PERFORMANCE MEASURES

Several standard performance metrics such as accuracy, precision and error in classification
have been considered for the computation of the performance efficacy of this model. Accuracy
in the current context would mean the percentage of instances correctly predicting from among
all the available instances. Precision is defined as the percentage of corrective prediction in the
positive class of the instances. Classification error is defined as the percentage of accuracy
missing or error available in the instances. To identify the significant features of heart disease,
three performance metrics are used which will help in better understanding the behaviour of
the various combinations of the feature selection. ML technique focuses on the best performing
model compared to the existing models. We introduce the Hybrid method, which produces high
accuracy and less classification error in the prediction of heart disease. The performance of
every classifier is evaluated individually and all results are adequately recorded for further
investigation.

11
OVERVIEW OF PROPOSED WORK

The weighted voting of Logistic Regression (LR), K-Nearest Neighbour (KNN), Random
Forest (RF), Gaussian Naive Bayes (GNB) algorithms are used to perform classification on the
cardiovascular disease dataset. In fatal diseases like cardiac diseases, accurate detection of the
disease is essential. But the conventional approaches are not good in terms of accuracy in
diagnosing the disease. In the proposed model, the data from the heart disease dataset is split
into test and training data (33% test data and 67% training data). Feature scaling is performed
to normalize the range of independent variables in the data. The test data is given as the input
to Decision Tree, Gaussian Naive Bayes, Bernoulli Naive Bayes, Logistic Regression,
Multinomial Naive Bayes, K-Nearest Neighbour, Random Forest and Neural Network
(Logistic, TanH, ReLU, and Identity). The proposed model uses 13 clinical attributes of the
heart disease dataset as input and predicts the disease accurately. The proposed method
VLRAKN has an accuracy of 89% and a precision of 91.6%. The results of VLRAKN are
compared and analyzed against conventional models. The results of VLRAKN are compared
with a few other machine learning algorithms. The Accuracy, Precision, F- measure,
sensitivity, and specificity in the below table. The introduced model is efficient in predicting
the disease compared with other models in terms of accuracy and precision.

PROPOSED WORK

Cardiovascular diseases are significant causes of death around the world. Each year
approximately 17 million people die due to cardiac problems. Proper diagnosis and early
prediction of the disease can reduce the adverse effects of the disease. Machine learning plays
a vital role in the prediction of the disease. Many research studies were performed by the
researchers in the application of machine learning in cardiac disease prediction. Machine
learning helps us to find the hidden patterns present in the data, which are essential during the
prediction of the disease. Many prediction models were proposed by the researchers which use
machine learning approaches. proposed an approach that uses structural equation mapping and
fuzzy cognitive maps for prediction. The main aim of this work is to develop a model with
good accuracy and precision in predicting cardiac disease in patients using machine learning.
Many research studies were referred to during the construction of the model. For designing a
model with reasonable accuracy, the performance of several machine learning algorithms was
analyzed. The proposed model is based on the weighted voting of classifiers. During the
construction of the model, the parameters mentioned in the Table are taken into consideration
for predicting the disease. The process involved in designing the model. The proposed model
achieved an accuracy of 89% and 91.6% precision in predicting the disease.
It is important to have first aid at the time of a heart attack. The number of deaths due to heart
attack occurs because there is a lack of awareness and first aid given to patients. As the living
style of the person all-round the globe has been changed which is the fundamental
establishment for the reason for various heart intricacies, there is sufficient research done to
about the prediction. Here we are aiming to provide a one step further that is studying the
complexity of heart disease and giving the medical and non-medical suggestions to get rid of
the heart disease. In the proposed work we are focusing on analysis, prediction, accuracy after
using many algorithms and comparison and providing suggestions. It is also important to know
whether the person needs to be diagnosed with heart disease or not. In our work, we are also
examining if the person needs to be examined or not after training the dataset. Experimenting
with the various classification models and checking with yield the greatest accuracy.
In the presented work there are six different sections in which the work take place. The six

12
different sections are namely database selection, sample selection, attribute creation,
Modelling/ Training, Extract knowledge and finally Medical suggestions and
recommendations. The idea behind diving the whole working in the different section is to
working out on each section independently so that the result and great accuracy can be
achieved.

Fig. 1. Proposed working architecture

In the proposed architecture as we can see that there are six different sections which are listed
above. Here first we are selecting the database in which it is important to work on the target
database so we are selecting the target database from the heart disease database after that
sample needs to be selected from the pool of dataset and also it is important to remove the noisy
data present the dataset and since there are a lot of attributes present in the dataset so it is
necessary to create the specific attributes required for the training of dataset now, we have to
extract the relevant attributes which are useful for the process. In the next section, there is
Modelling and training of the selected dataset happens. Here, in this section, we are using seven
different machine learning models one after another so that the best model that is the model
with the highest accuracy can find out. After applying the different machine learning
algorithms and models now it extracts the knowledge and finally gives the medical and non-
medical suggestions. We have used the supervised learning models in the proposed system.

Quantitative research needs numerical data that can come out either from the numerical data
itself or otherwise graphs. Statistical methods are applied on it to get usefulness from the data.

13
Qualitative research is in words and thoughts. There must be an expert opinion that can bring
useful information through the thoughts and feeling of the examinee. Qualitative research is to
understand the concepts, thoughts, experiences and feelings of the patients. This research paper
uses both quantitative and qualitative data. We have used the University of California Irvine
(UCI) dataset for this paper. There are 3 types of data used in this paper are:

A standard heart disease dataset from Kaggle is used whose data is collected from four
prestigious hospitals of Switzerland and the United States of America. The dataset contains
303 records and 13 attributes. There are no missing values in the dataset. In the dataset, we
have an attribute called target, which denotes whether the person has heart disease or not. If
the patient has heart disease, the value in the target field is 1, else the value is set to 0, which
indicates the patient is not having heart disease. In this dataset, 165 patients had heart disease,
and 138 patients had no heart disease. The other attributes are Type of chest pain, level of blood
pressure, serum cholesterol, blood sugar level, results of the electrocardiogram, maximum heart
rate, exercise-induced angina, ST depression induced by exercise relative to rest, the slope of
the peak exercise (ST segment), number of major vessels coloured by fluoroscopy and
reversible defect.

Feature Scaling and attribute selection: Cardiac disease data is feature scaled to normalize
the range of independent variables or features of data. The cardiac disease dataset contains 303
records. The 13 attributes in the dataset are considered as necessary, including age and sex.
Age and sex are considered because heart diseases are found in men at an early age than
females, and other factors. These attributes are essential for predicting the disease.

Building and training the model: After attribute selection and feature scaling, the heart
disease data is split into two different parts for training and testing the model. Various machine
learning (ML) algorithms are used to perform prediction on the dataset.

Continuous (#): which is quantitative data that can be measured

Ordinal Data: Categorical data that has an order to it (0,1,2,3, etc)

Binary Data: data whose unit can take on only two possible states (0 &1)

There are 13 feature attributes identified in the dataset for the heart disease prediction and
working which are mentioned below:

1. age (#)
2. sex: 1= Male, 0= Female (Binary)
3. (cp) chest pain type (4 values -Ordinal): Value 1: typical angina, Value 2: atypical angina,
Value 3: non-anginal pain, Value 4: asymptomatic
4. (trestbps) resting blood pressure (#)
5. (chol) serum cholesterol in mg/dl (#)
6. (fbs)fasting blood sugar > 120 mg/dl (Binary) (1 = true; 0 = false)
7. (restecg) resting electrocardiography results (values 0,1,2)
8. (thalach) maximum heart rate achieved (#)
9. (exang) exercise induced angina (binary) (1 = yes; 0 = no)
10. (oldpeak) = ST depression induced by exercise relative to rest (#)

14
11. (slope) of the peak exercise ST segment (Ordinal) (Value 1: up sloping, Value 2: flat, Value 3:
down sloping)
12. (ca) number of major vessels (0–3, Ordinal) colored by fluoroscopy
(thal) maximum heart rate achieved — (Ordinal): 3 = normal; 6 = fixed defect; 7 = reversible defect

INPUT DATA

PREPROCESSING

SPLITTING

TRAINING DATA TESTING DATA

HYBRID METHOD ACCURACY


CLASSIFIER

Fig 2. Structure of Data Pre-Processing

15
Attribute Description Data Type

Age Age of patient in years Numeric

Sex Patients gender is represented as 1 for male and 0 for Nominal


female
Chest Pain(cp) The type of chest pain Numeric

1. Typical angina
2. Atypical angina
3. Non- anginal pain and
4. Asymptomatic

Trestbps Level of resting blood pressure in mm Hg Numeric

Chol Serum cholesterol in mg/dl. Numeric

Fasting blood Blood sugar levels on fasting>120 mg/dl; 1 in case of Numeric


sugar(FBS) true and 0 in case of false.
Resting Resting electrocardiographic results Numeric
electrocardiogram
results(restecg)
Thalach Maximum heart rate. Numeric

Exang Angina caused due to exercise. (0 represents true and 1 Numeric


represents yes)
Oldpeak ST depression-induced exercise relative to rest Numeric

Slope The slope of the peak exercise ST segment is Numeric


represented by 3 values
Ca The number of major vessels coloured by fluoroscopy Numeric
is represented by 4 values from 0-3.
Thal Heart status is represented by 3 different values. Numeric
Normal is represented by 3, a fixed defect is
represented by 6 and the reversible defect is
represented by 7.
Target Heart disease diagnosis represented in 2 values. The Nominal
presence of disease is indicated by 1 and absence is
indicated by 0.
Table 3. Heart disease dataset detailed information.

16
Age Numeric [29-77]

Sex Nominal [0,1]

Chest Pain(CP) Numeric [1 - 4]

TRESTBPS Numeric [94 - 200]

Chol Numeric [126 - 564]

Fasting blood sugar(FBS) Nominal [0 - 1]

Resting electrocardiogram Numeric [0 - 2]


results(RESTECG)
Maximun THALACH Numeric [71 - 202]

EXANG Numeric [0 - 1]

Oldpeak Numeric [0 – 6.20]

SLOPE Numeric [1 - 3]

CA Categorical [5 level]

THAL Categorical [4 level]

TARGET Numeric [0 - 1]

Table 4. Range and datatype of dataset’s attributes.

17
By using the scikit package from python, the training of models is done. By using ensemble
methods voting classier is built with RF, LR, GNB, NNR, and KNN algorithms. The ensemble
voting classifier allows us to give different weights to each classifier. In the proposed model,
Logistic regression and k nearest neighbour are given weights as 3, and the weights of the
remaining classifier are 0. The voting classifier does the majority rule voting by using predicted
class labels.
Classifiers Accuracy Precision Sensitivity Specificity F-Measure Error
LR 86 79 90 82 84.4 14
RF 83 75 87 79 80 17
DT 80 79 79 80 79 20
SVM 87 81 90 84 85 17
KNN 88 91 84 91 88 12
XG boost 84 79 86 82 82 16
Hybrid 89 91.6 86 91 88 11
Model
Table 5. Overall performance

Various metrics are considered for the evaluation of this classification model; they are
precision, error, F-measure, accuracy, sensitivity, and specificity. Accuracy is the ratio of
correctly predicted samples to the total number of samples, and classification error is the
amount of error present in the samples. Sensitivity is the proportion of actual positives which
are correctly predicted, F-measure is defined as the harmonic mean of both precision and recall,
specificity is defined as the ability of the classifier to identify negative results, precision is the
ratio of correctly predicted positive samples to the total number of correctly predicted samples.
The hybrid Model produces results with high accuracy, precision, and minimum classification
error.

18
PROPOSED SOURCE CODE HEART DISEASE PREDICTION

import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

filePath = 'heartDisease.csv'
data = pd.read_csv(filePath)
data.head(5)

print("(Rows, columns): " + str(data.shape))


data.columns

data.nunique(axis=0)# returns the number of unique values for each variable.


data.describe()

data['target'].value_counts()

19
corr = data.corr()
plt.subplots(figsize=(15,10))
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True,
cmap=sns.diverging_palette(220, 20, as_cmap=True))
sns.heatmap(corr, xticklabels=corr.columns,
yticklabels=corr.columns,
annot=True,
cmap=sns.diverging_palette(220, 20, as_cmap=True))

subData = data[['age','trestbps','chol','thalach','oldpeak']]
sns.pairplot(subData)

20
sns.catplot(x="target", y="oldpeak", hue="slope", kind="bar", data=data);
plt.title('ST depression (induced by exercise relative to rest) vs. Heart Disease',size=25)
plt.xlabel('Heart Disease',size=20)
plt.ylabel('ST depression',size=20)

plt.figure(figsize=(12,8))
sns.violinplot(x= 'target', y= 'oldpeak',hue="sex", inner='quartile',data= data )
plt.title("Thalach Level vs. Heart Disease",fontsize=20)
plt.xlabel("Heart Disease Target", fontsize=16)
plt.ylabel("Thalach Level", fontsize=16)

21
plt.figure(figsize=(12,8))
plt.figure(figsize=(12,8))
sns.boxplot(x= 'target', y= 'thalach',hue="sex", data=data )
plt.title("ST depression Level vs. Heart Disease", fontsize=20)
plt.xlabel("Heart Disease Target",fontsize=16)
plt.ylabel("ST depression induced by exercise relative to rest", fontsize=16)

22
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 1)

from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

from sklearn.metrics import classification_report


from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression(random_state=1) # get instance of model


model1.fit(x_train, y_train) # Train/Fit model

y_pred1 = model1.predict(x_test) # get y predictions


print(classification_report(y_test, y_pred1))

from sklearn.metrics import classification_report


from sklearn.neighbors import KNeighborsClassifier

model2 = KNeighborsClassifier() # get instance of model


model2.fit(x_train, y_train) # Train/Fit model

y_pred2 = model2.predict(x_test) # get y predictions


print(classification_report(y_test, y_pred2)) # output accuracy

from sklearn.metrics import classification_report


from sklearn.svm import SVC

model3 = SVC(random_state=1) # get instance of model


model3.fit(x_train, y_train) # Train/Fit model

y_pred3 = model3.predict(x_test) # get y predictions


print(classification_report(y_test, y_pred3)) # output accuracy

23
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB

model4 = GaussianNB() # get instance of model


model4.fit(x_train, y_train) # Train/Fit model

y_pred4 = model4.predict(x_test) # get y predictions


print(classification_report(y_test, y_pred4)) # output accuracy

from sklearn.metrics import classification_report


from sklearn.tree import DecisionTreeClassifier

model5 = DecisionTreeClassifier(random_state=1) # get instance of model


model5.fit(x_train, y_train) # Train/Fit model

y_pred5 = model5.predict(x_test) # get y predictions


print(classification_report(y_test, y_pred5)) # output accuracy

from sklearn.metrics import classification_report


from sklearn.ensemble import RandomForestClassifier

model6 = RandomForestClassifier(random_state=1)# get instance of model


model6.fit(x_train, y_train) # Train/Fit model

y_pred6 = model6.predict(x_test) # get y predictions


print(classification_report(y_test, y_pred6)) # output accuracy

from sklearn.metrics import confusion_matrix, accuracy_score


cm = confusion_matrix(y_test, y_pred6)
print(cm)
accuracy_score(y_test, y_pred6)

# get importance
importance = model6.feature_importances_

# summarize feature importance


for i,v in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,v))

24
index= data.columns[:-1]
importance = pd.Series(model6.feature_importances_, index=index)
importance.nlargest(13).plot(kind='barh', colormap='winter')

y_pred = model6.predict(x_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[1 1]
[1 1]
[0 0]
[1 0]
[0 0]
[0 0]
[1 0]
[1 1]
[0 0]
[1 1]
[1 0]
[1 1]
[0 0]
[1 1]
[1 1]
[1 1]
[1 1]
[0 0]
[1 1] 25
[1 1]
[1 1]
[1 1]
[1 1]
First need to find the average the accuracy of the above models

average = (y_pred1+y_pred2+y_pred3+y_pred4+y_pred5+y_pred6)//6
avg_accuracy = accuracy_score(y_test,average)
avg_accuracy

Apply ensembler with above models:

from sklearn.model_selection import train_test_split


from sklearn.model_selection import train_test_split as tts
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import KFold
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier

labels
=['LogisticRegression','KNeighborsClassifier','SVC','GaussianNB','DecisionTreeClassifier','Random
ForestClassifier']
for i, label in zip([model1,model2,model3,model4,model5,model6], labels):
scores = model_selection.cross_val_score(i,X,y, cv =5, scoring ='accuracy')
print('Accuracy: %0.2f (%0.2f) [%s]'%(scores.mean(),scores.std(),label))

A voting ensemble is an ensemble machine learning model that combines the


predictions from multiple other models. It is a technique that may be used to improve
model performance, ideally achieving better performance than any single model used in
the ensemble.

voting_hard = VotingClassifier(estimators = [(labels[0],model1),


(labels[1],model2),
(labels[2],model3),
(labels[3],model4),
(labels[4],model5),
(labels[5],model6)],

voting ='hard' )
labels_new
=['LogisticRegression','KNeighborsClassifier','SVC','GaussianNB','DecisionTreeClassifier','Random
ForestClassifier','Voting']
for i, label in zip([model1,model2,model3,model4,model5,model6,voting_hard], labels_new):
scores = model_selection.cross_val_score(i,X,y, cv =5, scoring ='accuracy')
print('Accuracy: %0.2f (%0.2f) [%s]'%(scores.mean(),scores.std(),label))

26
labels =
['LogisticRegression','KNeighborsClassifier','SVC','GaussianNB','DecisionTreeClassifier','RandomF
orestClassifier']
for clf, label in zip([model1,model2,model3,model4,model5,model6], labels):
bagging_clf = BaggingClassifier(clf,max_samples=0.4, max_features=10, random_state=0)
bagging_scores = cross_val_score(bagging_clf, X, y, cv=10,n_jobs=-1)
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
print("Mean: {0:.3f}, std: (+/-) {1:.3f} [{2}]".format(scores.mean(), scores.std(), label))
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

d1 = {}
labels
=['LogisticRegression','KNeighborsClassifier','SVC','GaussianNB','DecisionTreeClassifier','Random
ForestClassifier']
for i, model in zip([model1,model2,model3,model4,model5,model6], labels):
scores = model_selection.cross_val_score(i,X,y, cv =5, scoring ='accuracy')
print('Accuracy: %0.2f (%0.2f) [%s]'%(scores.mean(),scores.std(),model))
d1[model] = '%0.2f'%scores.mean()

An accurate result extracted after comparing all models

def get_accurate_result():
accuracy=list(d1.values())
model=list(d1.keys())
return model[accuracy.index(max(accuracy))],max(accuracy)

result = accurate_result()
model = result[0]
acc = result[1]
percent = float(acc) *100
print("=="*30)
print("{} is more accurate then other models,\n{} is {} % accurate.".format(model,model,percent))
print("=="*30)

27
Heart Disease Prediction
Separate Models
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('heart_disease_dataset.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, 13].values

#handling missing data


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values= np.NAN, strategy= 'mean', fill_value=None, verbose=0,
copy=True)
imputer=imputer.fit(X[:,11:13])
X[:,11:13]=imputer.transform(X[:,11:13])

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#EXPLORING THE DATASET

dataset.num.value_counts()

# Fitting Decision Tree Classification to the Training set

from sklearn.tree import DecisionTreeClassifier


classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 8)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

#ACCURACY SCORE
from sklearn.metrics import accuracy_score
print("ACC",accuracy_score(y_test,y_pred))

##CONFUSION MATRIX
from sklearn.metrics import classification_report, confusion_matrix
cm=confusion_matrix(y_test, y_pred)

28
#Interpretation:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
#ROC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Decision Tree (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
ACC 0.7631578947368421
precision recall f1-score support

0 0.73 0.84 0.78 38


1 0.81 0.68 0.74 38

accuracy 0.76 76
macro avg 0.77 0.76 0.76 76
weighted avg 0.77 0.76 0.76 76

29
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('heart_disease_dataset.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, 13].values

#handling missing data

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(missing_values= np.NAN, strategy= 'mean', fill_value=None, verbose=0,
copy=True)
imputer=imputer.fit(X[:,11:13])
X[:,11:13]=imputer.transform(X[:,11:13])

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#EXPLORING THE DATASET

dataset.num.value_counts()

# Fitting Naive Bayes to the Training set


from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

#ACCURACY SCORE
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

30
#Interpretation:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

#ROC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='KNN (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
precision recall f1-score support

0 0.78 0.95 0.86 38


1 0.93 0.74 0.82 38

accuracy 0.84 76
macro avg 0.86 0.84 0.84 76
weighted avg 0.86 0.84 0.84 76

31
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset

dataset = pd.read_csv('heart_disease_dataset.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 13].values

#handling missing data

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(missing_values= np.NAN, strategy= 'mean', fill_value=None, verbose=0,
copy=True)
imputer=imputer.fit(X[:,11:13])
X[:,11:13]=imputer.transform(X[:,11:13])

#splitting dataset into training set and test set

from sklearn.model_selection import train_test_split


X_train,X_test,Y_train,Y_test=train_test_split(X, y, test_size = 0.25, random_state = 101)

#feature scaling

from sklearn.preprocessing import StandardScaler


sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.transform(X_test)

dataset.num.value_counts()

#### logistic regression

#fitting LR to training set

from sklearn.linear_model import LogisticRegression


classifier =LogisticRegression()
classifier.fit(X_train,Y_train)

#Predict the test set results

y_Class_pred=classifier.predict(X_test)
#checking the accuracy for predicted results
from sklearn.metrics import accuracy_score
accuracy_score(Y_test,y_Class_pred)

# Making the Confusion Matrix 32

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(Y_test, y_Class_pred)
#Interpretation:

from sklearn.metrics import classification_report


print(classification_report(Y_test, y_Class_pred))

from sklearn.metrics import roc_auc_score


from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(Y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(Y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
precision recall f1-score support

0 0.80 0.95 0.87 38


1 0.94 0.76 0.84 38

accuracy 0.86 76
macro avg 0.87 0.86 0.85 76
weighted avg 0.87 0.86 0.85 76

33
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('heart_disease_dataset.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, 13].values

#handling missing data

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(missing_values= np.NAN, strategy= 'mean', fill_value=None, verbose=0,
copy=True)
imputer=imputer.fit(X[:,11:13])
X[:,11:13]=imputer.transform(X[:,11:13])

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state =None)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#EXPLORING THE DATASET

dataset.num.value_counts()

# Fitting Naive Bayes to the Training set


from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

#ACCURACY SCORE
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

34
#Interpretation:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

from sklearn.metrics import roc_auc_score


from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Navie Bayes (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
precision recall f1-score support

0 0.78 0.97 0.87 37


1 0.97 0.74 0.84 39

accuracy 0.86 76
macro avg 0.87 0.86 0.85 76
weighted avg 0.88 0.86 0.85 76

35
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('heart_disease_dataset.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, 13].values

#handling missing data

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(missing_values= np.NAN, strategy= 'mean', fill_value=None, verbose=0,
copy=True)
imputer=imputer.fit(X[:,11:13])
X[:,11:13]=imputer.transform(X[:,11:13])

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#EXPLORING THE DATASET

dataset.num.value_counts()

# Fitting Naive Bayes to the Training set


from sklearn.ensemble import RandomForestClassifier
classifier =RandomForestClassifier(n_estimators=20)
classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

from sklearn.metrics import accuracy_score


accuracy_score(y_test,y_pred)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

36
#Interpretation:

from sklearn.metrics import classification_report


print(classification_report(y_test, y_pred))
#ROC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Random Forest (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
precision recall f1-score support

0 0.82 0.95 0.88 38


1 0.94 0.79 0.86 38

accuracy 0.87 76
macro avg 0.88 0.87 0.87 76
weighted avg 0.88 0.87 0.87 76

37
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('heart_disease_dataset.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 13].values

#handling missing data

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(missing_values= np.NAN, strategy= 'mean', fill_value=None, verbose=0,
copy=True)
imputer=imputer.fit(X[:,11:13])
X[:,11:13]=imputer.transform(X[:,11:13])

#splitting dataset into training set and test set

from sklearn.model_selection import train_test_split


X_train,X_test,Y_train,Y_test=train_test_split(X, y, test_size = 0.15, random_state = 101)

#feature scaling

from sklearn.preprocessing import StandardScaler


sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.transform(X_test)

#EXPLORING THE DATASET

dataset.num.value_counts()

##SUPPORT VECTOR CLASSIFICATIONS

##checking for different kernels

from sklearn.svm import SVC

classifier = SVC(kernel = 'linear', random_state = 0 ,probability=True)


classifier.fit(X_train, Y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)

from sklearn.metrics import accuracy_score


accuracy_score(Y_test,y_pred)

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, y_pred) 38
#Interpretation:

from sklearn.metrics import classification_report


print(classification_report(Y_test, y_pred))

#ROC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(Y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(Y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='SVM (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)

precision recall f1-score support

0 0.84 0.96 0.90 28


1 0.93 0.72 0.81 18

accuracy 0.87 46
macro avg 0.89 0.84 0.86 46
weighted avg 0.88 0.87 0.87 46

39
Implemented code for find the
above model accuracy.
Hybrid Solution FOBU
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error

import numpy as np
from copy import deepcopy

class ModelTree(object):

def __init__(self, model, max_depth=5, min_samples_leaf=10):

self.model = model
self.max_depth = max_depth
self.min_samples_leaf = min_samples_leaf
self.tree = None

def fit(self, X, y, verbose=False):


model = self.model
min_samples_leaf = self.min_samples_leaf
max_depth = self.max_depth
if verbose:
print(" max_depth={}, min_samples_leaf={}...".format(max_depth, min_samples_leaf))

def _build_tree(X, y):

global index_node_global

def _create_node(X, y, depth, container):


loss_node, model_node = _fit_model(X, y, model)
node = {"name": "node",
"index": container["index_node_global"],
"loss": loss_node,
"model": model_node,
"data": (X, y),
"n_samples": len(X),
"j_feature": None,
"threshold": None,
"children": {"left": None, "right": None},
"depth": depth}
container["index_node_global"] += 1
return node

40
def _split_traverse_node(node, container):

result = _splitter(node, model,


max_depth=max_depth,min_samples_leaf=min_samples_leaf)

if not result["did_split"]:
if verbose:
depth_spacing_str = " ".join([" "] * node["depth"])
print(" {}*leaf {} @ depth {}: loss={:.6f}, N={}".format(depth_spacing_str,
node["index"], node["depth"], node["loss"], result["N"]))
return

node["j_feature"] = result["j_feature"]
node["threshold"] = result["threshold"]
del node["data"] # delete node stored data

(X_left, y_left), (X_right, y_right) = result["data"]


model_left, model_right = result["models"]

if verbose:
depth_spacing_str = " ".join([" "] * node["depth"])
print(" {}node {} @ depth {}: loss={:.6f}, j_feature={}, threshold={:.6f},
N=({},{})".format(depth_spacing_str, node["index"], node["depth"], node["loss"],
node["j_feature"], node["threshold"], len(X_left), len(X_right)))

node["children"]["left"] = _create_node(X_left, y_left, node["depth"]+1, container)


node["children"]["right"] = _create_node(X_right, y_right, node["depth"]+1, container)
node["children"]["left"]["model"] = model_left
node["children"]["right"]["model"] = model_right

_split_traverse_node(node["children"]["left"], container)
_split_traverse_node(node["children"]["right"], container)

container = {"index_node_global": 0} # mutatable container


root = _create_node(X, y, 0, container) # depth 0 root node
_split_traverse_node(root, container) # split and traverse root node

return root

self.tree = _build_tree(X, y)
return self.tree

41
def predict(self, X):
assert self.tree is not None
def _predict(node, x):
no_children = node["children"]["left"] is None and \
node["children"]["right"] is None
if no_children:
y_pred_x = node["model"].predict([x])[0]
return y_pred_x
else:
if x[node["j_feature"]] <= node["threshold"]: # x[j] < threshold
return _predict(node["children"]["left"], x)
else: # x[j] > threshold
return _predict(node["children"]["right"], x)
y_pred = np.array([_predict(self.tree, x) for x in X])
return y_pred

def loss(self, X, y, y_pred):


loss = self.model.loss(X, y, y_pred)
return loss

42
def _splitter(node, model,max_depth=5, min_samples_leaf=10):
X, y = node["data"]
depth = node["depth"]
N, d = X.shape
did_split = False
loss_best = node["loss"]
data_best = None
models_best = None
j_feature_best = None
threshold_best = None

if (depth >= 0) and (depth < max_depth):


for j_feature in range(d):
threshold_search = []
for i in range(N):
threshold_search.append(X[i, j_feature])

for threshold in threshold_search:

(X_left, y_left), (X_right, y_right) = _split_data(j_feature, threshold, X, y)


N_left, N_right = len(X_left), len(X_right)

split_conditions = [N_left >= min_samples_leaf,


N_right >= min_samples_leaf]

if not all(split_conditions):
continue

loss_left, model_left = _fit_model(X_left, y_left, model)


loss_right, model_right = _fit_model(X_right, y_right, model)
loss_split = (N_left*loss_left + N_right*loss_right) / N

if loss_split < loss_best:


did_split = True
loss_best = loss_split
models_best = [model_left, model_right]
data_best = [(X_left, y_left), (X_right, y_right)]
j_feature_best = j_feature
threshold_best = threshold

result = {"did_split": did_split,


"loss": loss_best,
"models": models_best,
"data": data_best,
"j_feature": j_feature_best,
"threshold": threshold_best,
"N": N}

return result
43
def _fit_model(X, y, model):
model_copy = deepcopy(model) # must deepcopy the model!
model_copy.fit(X,y)
y_pred = model_copy.predict(X)
loss = model_copy.loss(X, y, y_pred)
assert loss >= 0.0
return loss, model_copy

def _split_data(j_feature, threshold, X, y):


idx_left = np.where(X[:, j_feature] <= threshold)[0]
idx_right = np.delete(np.arange(0, len(X)), idx_left)

assert len(idx_left) + len(idx_right) == len(X)


return (X[idx_left], y[idx_left]), (X[idx_right], y[idx_right])

class logistic_regr:

def __init__(self):
from sklearn.linear_model import LogisticRegression
self.model = LogisticRegression(penalty="l2",solver='liblinear')
self.flag = False
self.flag_y_pred = None

def fit(self, X, y):


y_unique = list(set(y))
if len(y_unique) == 1:
self.flag = True
self.flag_y_pred = y_unique[0]
else:
self.model.fit(X, y)

def predict(self, X):


if self.flag:
return self.flag_y_pred * np.ones((len(X),), dtype=int)
else:
return self.model.predict(X)

def loss(self, X, y, y_pred):


return mean_squared_error(y, y_pred)

def predict_proba(self,X):
return self.model.predict_proba(X)

44
dataset = pd.read_csv('heart_disease_dataset.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, 13].values

import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values= np.NAN, strategy= 'mean', fill_value=None, verbose=0,
copy=True)
imputer=imputer.fit(X[:,11:13])
X[:,11:13]=imputer.transform(X[:,11:13])

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state =9)

from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

dataset.num.value_counts()

esitmators=5
y_pred=[]
n_train_split=int(len(X_train)/esitmators)
inital_train=0
final_train=0

45
yy_pred=[]
classifier=None

for i in range(1,esitmators+1):

classifier =logistic_regr()

final_train=i*n_train_split
temp_X_train=X_train[inital_train:final_train]
temp_y_train=y_train[inital_train:final_train]

L=ModelTree(classifier,max_depth=20, min_samples_leaf=10)

node=L.fit(temp_X_train,temp_y_train,verbose=False)
classifier=node["model"]

y_pred_temp=L.predict(X_test)
yy_pred.append(y_pred_temp)

for j in range(len(yy_pred[0])):
curr=[]
for i in range(len(yy_pred)):
curr.append(yy_pred[i][j])
a=curr.count(0)
b=curr.count(1)
if a>b:
y_pred.append(0)
else:
y_pred.append(1)

46
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test, y_pred)

from sklearn.metrics import classification_report


print(classification_report(y_test, y_pred))

from sklearn.metrics import roc_auc_score


from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='HYBRID_FOBU(area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive')
plt.ylabel('True Positive')
plt.title('Proposed Model')
plt.legend(loc="lower right")
plt.show()

precision recall f1-score support

0 0.91 0.87 0.89 47


1 0.81 0.86 0.83 29

accuracy 0.87 76
macro avg 0.86 0.87 0.86 76
weighted avg 0.87 0.87 0.87 76

47
RESULTS

Setup for experimentation: Jupyter notebook is used to build the model and used to perform
the cardiac disease classification on the dataset. A real-life data collected from four prestigious
hospitals in Switzerland and the United States of America is used for this work. This data is
categorized under 76 attributes, out of which 14 attributes are provided in the dataset. In the
dataset, 13 attributes are used for classification. The attribute named target shows the presence
of the disease. 0 in the target field indicates that there is no heart disease, and 1 indicates the
presence of heart diseases.

Evaluation of results: The results are evaluated by using the confusion matrix, accuracy score,
and area under the ROC curve. From the confusion matrix, four results are produced they are
TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative), and
accuracy is generated using an accuracy score.

The prediction models are developed using 13 features and the accuracy is calculated for
modelling techniques. The best classification methods are given. This table compares the
accuracy, classification error, precision, F-measure, sensitivity and specificity. The highest
accuracy is achieved by this proposed hybrid classification method in comparison with existing
methods.

Out of the 13 features we examined, the top 4 significant features that helped us classify
between a positive & negative Diagnosis were chest pain type (cp), maximum heart rate
achieved (thalach), number of major vessels (ca), and ST depression induced by exercise
relative to rest (oldpeak) as shown in figure 2.

Fig. 3. Feature importance graph

48
Fig 4. Proposed Model Accuracy

Result with Random oversampling:


This oversampling technique is utilized to supplement the preparation information with different
duplicates of a portion of the minority classes. This oversampling should be possible more than once.
This is one of the basic techniques, that is likewise demonstrated to be vigorous. Instead of copying
each example in the minority class, some of them might be arbitrarily picked with substitution. In simple
terms, existing samples are sampled with the replacement for producing new samples.

ALGORITHM PRECISION RECALL ACCURACY

Logistic Regression 68 66 67.5

K-NN 74 75 79.4

Random Forest 95 74 78

Decision Tree 86 89 65.5

Naïve Bayes 75 31 60

SVM 74 67 79

HYBRID FOBU 91.6 81 87

Table 6. Results with random over sampling

49
Fig. 5. Correlation Matrix

There is a positive correlation between chest pain (cp) & target (our predictor). This makes
sense since the greater amount of chest pain results in a greater chance of having heart disease.
Cp (chest pain), is an ordinal feature with 4 values: Value 1: typical angina, Value 2: atypical
angina, Value 3: non-anginal pain, Value 4: asymptomatic.

In addition, we see a negative correlation between exercises induced angina & our predictor.
This makes sense because when you exercise, your heart requires more blood, but narrowed
arteries slow down blood flow.
From comparing positive and negative heart disease patients. There are vast differences in
means for many of our Features. From examine the details, we can observe that positive patients
experience heightened maximum heart rate achieved (thalach) average. In addition, positive
patients exhibit about 1/3rd the amount of ST depression induced by exercise relative to rest
(oldpeak).
Our Hybrid machine learning algorithm can now classify patients with Heart Disease. Now we
can properly diagnose patients, & get them the help they need to recover. By diagnosing
detecting these features early, we may prevent worse symptoms from arising later. Our Random
Forest algorithm yields the highest accuracy, 80%. Any accuracy above 70% is considered good,
but be careful because if your accuracy is extremely high, it may be too good to be true (an
example of Overfitting). Thus, 80% is the ideal accuracy.

50
BENCHMARKING OF THE PROPOSED MODEL

Benchmarking is needed to compare the performance of the existing models compared with the
proposed hybrid model. This method is used to identify whether the proposed method is the best
and improves accuracy or not. The accuracy is calculated with the number of feature selection
and the model generated results. This Hybrid method has no restriction in selecting features to
use. All the features selected in this model accomplish the best results. The performance
comparison of the various model concerning the proposed method respectively. The proposed
method is used on all 13 attributes and classified, based on the error rate. This result proves that
all the features selected and ML techniques used, prove effective in accurately predicting heart
disease of patients compared with known existing models.

A stack of 6 machine learning algorithms and their results were compared with the proposed
model. The models were compared based on accuracy, precision, F-measure, sensitivity, and
specificity. The compared results were shown in the below fig. The Hybrid Model achieved
the highest accuracy when compared with the Decision Tree, Random Forest, Logistic
Regression, Naive Bayes (Gaussian, Bernoulli and multinomial), and Support vector machine
etc.

Fig. 6 Performance comparison with various models

51
CONCLUSION
Identifying the processing of raw healthcare data of heart information will help
in the long-term saving of human lives and early detection of abnormalities in
heart conditions. Machine learning techniques were used in this work to process
raw data and provide a new and novel discernment towards heart disease. Heart
disease prediction is challenging and very important in the medical field.
However, the mortality rate can be drastically controlled if the disease is detected
at the early stages and preventative measures are adopted as soon as possible.
Further extension of this study is highly desirable to direct the investigations to
real-world datasets instead of just theoretical approaches and simulations. The
proposed hybrid approach is used to combine the characteristics of Random
Forest (RF) and Linear Method (LM). This method proved to be quite accurate
in the prediction of heart disease. The future course of this research can be
performed with diverse mixtures of machine learning techniques to better
prediction techniques. Furthermore, new feature selection methods can be
developed to get a broader perception of the significant features to increase the
performance of heart disease prediction.

52
REFERENCES

[1] M. S. Amin, Y. K. Chiam, K. D. Varathan, ‘‘Identification of significant features and data


mining techniques in predicting heart disease,’’ Telematics Inform., vol. 36, pp. 82–93, Mar.
2019. [Online]. Available:
https://linkinghub.elsevier.com/retrieve/pii/S0736585318308876
[2] S. M. S. Shah, S. Batool, I. Khan, M. U. Ashraf, S. H. Abbas, and S. A. Hussain, ‘‘Feature
extraction through parallel probabilistic principal component analysis for heart disease
diagnosis,’’ Phys. A, Stat. Mech. Appl., vol. 482, pp. 796–807, 2017. doi:
10.1016/j.physa.2017.04.113.
[3] Y. E. Shao, C.-D. Hou, and C.-C. Chiu, ‘‘Hybrid intelligent modelling schemes for heart
disease classification,’’ Appl. Soft Comput. J., vol. 14, pp. 47–52, Jan. 2014. doi:
10.1016/j.asoc.2013.09.020.
[4] J. S. Sonawane and D. R. Patil, ‘‘Prediction of heart disease using multilayer perceptron
neural network,’’ in Proc. Int. Conf. Inf. Commun. Embed-
ded Syst., Feb. 2014, pp. 1–6.
[5] C. Sowmiya and P. Sumitra, ‘‘Analytical study of heart disease diagnosis using
classification techniques,’’ in Proc. IEEE Int. Conf. Intell. Techn. Control, Optim. Signal
Process. (INCOS), Mar. 2017, pp. 1–5.
[6] B. Tarle and S. Jena, ‘‘An artificial neural network based pattern classification algorithm
for diagnosis of heart disease,’’ in Proc. Int. Conf. Comput.,Commun., Control Automat.
(ICCUBEA), Aug. 2017, pp. 1–4.
[7] V. P. Tran and A. A. Al-Jumaily, ‘‘Non-contact Doppler radar based prediction of nocturnal
body orientations using deep neural network for chronic heart failure patients,’’ in Proc. Int.
Conf. Elect. Comput. Technol. Appl. (ICECTA), Nov. 2017, pp. 1–5.
[8] K. Uyar and A. Ilhan, ‘‘Diagnosis of heart disease using genetic algorithm based trained
recurrent fuzzy neural networks,’’ Procedia Comput. Sci., vol. 120, pp. 588–593, 2017.
[9] T. Vivekanandan and N. C. S. N. Iyengar, ‘‘Optimal feature selection using a modified
differential evolution algorithm and its effectiveness for prediction of heart disease,’’ Comput.
Biol. Med., vol. 90, pp. 125–136, Nov. 2017.
[10] S. Radhimeenakshi, ‘‘Classification and prediction of heart disease risk using data mining
techniques of support vector machine and artificial neural network,’’ in Proc. 3rd Int. Conf.
Comput. Sustain. Global Develop. (INDIACom), New Delhi, India, Mar. 2016, pp. 3107–
3111.
[11] R. Wagh and S. S. Paygude, ‘‘CDSS for heart disease prediction using risk factors,’’ Int.
J. Innov. Res. Comput., vol. 4, no. 6, pp. 12082–12089, Jun. 2016.
[12] O. W. Samuel, G. M. Asogbon, A. K. Sangaiah, P. Fang, and G. Li, ‘‘An integrated
decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction,’’
Expert Syst. Appl., vol. 68, pp. 163–172,
Feb. 2017.
[13] S. Zaman and R. Toufiq, ‘‘Codon based back propagation neural network approach to
classify hypertension gene sequences,’’ in Proc. Int. Conf. Elect., Comput. Commun. Eng.
(ECCE), Feb. 2017, pp. 443–446.
[14] W. Zhang and J. Han, ‘‘Towards heart sound classification without segmentation using
convolutional neural network,’’ in Proc. Comput. Cardiol. (CinC), vol. 44, Sep. 2017, pp. 1–4.
[15] Y. Meidan, M. Bohadana, A. Shabtai, J. D. Guarnizo, M. Ochoa, N. O. Tippenhauer, and
Y. Elovici, ‘‘ProfilIoT: A machine learning approach for IoT device identification based on
network traffic analysis,’’ in Proc. Symp. Appl. Comput., Apr. 2017, pp. 506–509.
[16] J. Wu, S. Luo, S. Wang, and H. Wang, ‘‘NLES: A novel lifetime extension scheme for
safety-critical cyber-physical systems using SDN and NFV,’’ IEEE Internet Things J., no. 6,

53
no. 2, pp. 2463–2475, Apr. 2019.
[17] J. Wu, M. Dong, K. Ota, J. Li, and Z. Guan, ‘‘Big data analysis-based secure cluster
management for optimized control plane in software-defined networks, IEEE Trans. Netw.
Service Manag., vol. 15, no. 1, pp. 27–38, Mar. 2018.
[18] J. Wu, M. Dong, K. Ota, J. Li, and Z. Guan, ‘‘FCSS: Fog computing-based content-aware
filtering for security services in information centric social networks,’’ IEEE Trans. Emerg.
Topics Comput., to be published. doi: 10.1109/TETC.2017.2747158.
[20] G. Li, J. Wu, J. Li, K. Wang, and T. Ye, ‘‘Service popularity-based smart resources
partitioning for fog computing-enabled industrial Internet of things,’’ IEEE Trans. Ind.
Information., vol. 14, no. 10, pp. 4702–4711,
Oct. 2018.
[21] J. Wu, K. Ota, M. Dong, and C. Li, ‘‘A hierarchical security framework for defending
against sophisticated attacks on wireless sensor networks in smart cities,’’ IEEE Access, vol.
4, pp. 416–424, 2016.
[22] H. Li, K. Ota, and M. Dong, ‘‘Learning IoT in edge: Deep learning for the Internet of
Things with edge computing,’’ IEEE Netw., vol. 32, no. 1, pp. 96–101, Jan./Feb. 2018.

[23] J Thomas MR, Lip GY. Novel risk markers and risk assessments for cardiovascular disease.
Circulation research. 2017; 120(1):133–149. https://doi.org/10.1161/CIRCRESAHA.116.309955
PMID: 28057790 [2] Ahmed M. AlaaID1, Thomas Bolton, Emanuele Di Angelantonio, James H.F.
RuddID, Mihaela van der Schaar,―Cardiovascular disease risk prediction using automated machine
learning: A prospective study of 423,604 UK Biobank participants‖, PLOS ONE 14(5):

[24] Stephen F. Weng, Jenna Reps, Joe Kai1, Jonathan M. Garibaldi, Nadeem Qureshi, ―Can machine-
learning improve cardiovascular risk prediction using routine clinical data?‖, PLOS ONE |
https://doi.org/10.1371/journal.pone. 0174944 April 4, 2017

[25] Rine Nakanishi, Damini Dey, Frederic Commandeur, Piotr Slomka, ―Machine Learning in
Predicting Coronary Heart Disease and Cardiovascular Disease Events: Results from The Multi-Ethnic
Study of Atherosclerosis (Mesa)‖, JACC Mar- 20, 2018, Volume 71, Issue 11

[26] https://www.cdc.gov/heartdisease/facts.htm. Available [Online].

[27] Senthilkumar Mohan, Chandrasegar Thirumalai, Gautam Srivastava ―Effective Heart Disease
Prediction Using Hybrid Machine Learning Techniques‖, Digital Object Identifier
10.1109/ACCESS.2019.2923707, IEEE Access, VOLUME 7, 2019 S.P. Bingulac, ―On the
Compatibility of Adaptive Controllers,‖ Proc. Fourth Ann. Allerton Conf. Circuits and Systems Theory,
pp. 8-16, 1994. (Conference proceedings)

[28] A. Gavhane, G. Kokkula, I. Pandya, and K. Devadkar, ‗‗Prediction of heart disease using machine
learning,‘‘ in Proc. 2nd Int. Conf. Electron., Commun. Aerosp. Technol. (ICECA), Mar. 2018, pp.
1275–1278.

[29] M. Sultana, A. Haider, and M. S. Uddin, ―Analysis of data mining techniques for heart disease
prediction,‖ 2016 3rd Int. Conf. Electr. Eng. Inf. Commun. Technol. ICEEICT 2016, 2017

[30] M. Akhil, B. L. Deekshatulu, and P. Chandra, ―Classification of Heart Disease Using K- Nearest
Neighbor and Genetic Algorithm,‖ Procedia Technol., vol. 10, pp. 85–94, 2013.

[31] N. Al-milli, ‗‗Backpropogation neural network for prediction of heart disease,‘‘ J. Theor. Appl.Inf.
Technol., vol. 56, no. 1, pp. 131–135, 2013

54
[32] J. Wu, S. Luo, S. Wang, and H. Wang, ‘‘NLES: A novel lifetime extension scheme for safety-
critical cyber-physical systems using SDN and NFV,’’ IEEE Internet Things J., no. 6, no. 2, pp. 2463–
2475, Apr. 2019. [45] J. Wu, M. Dong, K. Ota, J. Li, and Z. Guan, ‘‘Big data analysis-based secure
cluster management for optimized control plane in software-defined networks, IEEE Trans. Netw.
Service Manag., vol. 15, no. 1, pp. 27–38, Mar. 2018. [46] J. Wu, M. Dong, K. Ota, J. Li, and Z. Guan,
‘‘FCSS: Fog computing based content-aware filtering for security services in information centric social
networks,’’ IEEE Trans. Emerg. Topics Comput., to be published. doi: 10.1109/TETC.2017.2747158.
[33] G. Li, J. Wu, J. Li, K. Wang, and T. Ye, ‘‘Service popularity-based smart resources partitioning
for fog computing-enabled industrial Internet of things,’’ IEEE Trans. Ind. Informat., vol. 14, no. 10,
pp. 4702–4711, Oct. 2018. [48] J. Wu, K. Ota, M. Dong, and C. Li, ‘‘A hierarchical security framework
for defending against sophisticated attacks on wireless sensor networks in smart cities,’’ IEEE Access,
vol. 4, pp. 416–424, 2016. [49] H. Li, K. Ota, and M. Dong, ‘‘Learning IoT in edge: Deep learning for
the Internet of Things with edge computing,’’ IEEE Netw., vol. 32, no. 1, pp. 96–101, Jan./Feb. 2018.

55

You might also like