Professional Documents
Culture Documents
by
RUPLI KUMARI
19MCA1097
I further declare that the work reported in this thesis has not been
submitted and will not be submitted, either in part or in full, for the award
of any other degree or diploma in this institute or any other institute or
university.
Place: Chennai
i
ACKNOWLEDGEMENT
Place: Vellore
ii
CONTENTS
INTRODUCTION…………………………………………………………………………….1
RELATED WORK……………………………………………………………………………2
BACKGROUND……………………………………………………………………………...3
PROBLEM STATEMENT……………………………………………………………………7
METHODOLOGY……………………………………………………………………………7
• Dataset collect on
• Splitting
• Classification:
DATA PRE-PROCESSING…………………………………………………………………..9
CLASSIFICATION MODELLING…………………………………………………………..9
• DECISION TREES…………………………………………………………………. 10
• KNN (K-Nearest Neighbour Algorithm).……………………………………………10
• RANDOM FOREST…………………………………………………………………11
• LOGISTIC REGRESSION…………………………………………………………..12
PERFORMANCE MEASURES……………………………………………………………..13
PROPOSED WORK…………………………………………………………………………13
CONCLUSION………………………………………………………………………………53
REFERENCES……………………………………………………………………………….55
iii
LIST OF FIGURES
iv
LIST OF TABLES
Table 2. Dataset
v
LIST OF ACRONYMS
Electrocardiogram (ECG)
LR (Logistic Regression)
TP (True Positive)
TN (True Negative)
FP (False Positive)
FN (False Negative)
vi
INTRODUCTION
The heart is a vital organ of the human body and is responsible for pumping blood to other
organs of the body. Heart failure (HF) is a serious disorder with high prevalence. HF is
prevalent in developed countries at a rate of approximately 2% in the adult population and
about 8% in older subjects. Moreover, literature shows that about 3-5% of hospitals
admissions have a connection with HF incidents. Moreover, HF diagnosis is very costly
because in developed countries HF accounts for 2% of the total health costs. Hence, the
development of non-invasive methods for HF detection based on machine learning and
data mining will help improve quality of life and reduce the associated medical costs.
Recently, different machine learning researchers have developed numerous models based
on feature transformation and machine learning methods for disease detection and
mortality prediction. Earlier studies developed logistic regression, C4.5, Naive Bayes,
BNNF and BNND algorithms and obtained HF classification accuracies of 71%, 81.11%,
81.48%, 80.95% and 81.11% respectively. The HF classification accuracy was improved
to 84.5% by Polat et al. by developing an artificial immune system. Polat et al. proposed
another novel system in which further improved the HF classification accuracy to 87%.
Recently, Ali et al. developed a hybrid method in which L1 regularized SVM with
hybridized with linear discriminant classifier. Their hybrid method resulted in an HF
classification accuracy of 90%. Recent research has been concentrated on features
transformation and selection for improved HF prediction. In this study, we search optimal
feature extraction algorithm by evaluating the performance of different feature extraction
algorithms namely Principal Component Analysis (PCA), Sparse PCA, Kernel PCA and
Incremental PCA. These algorithms are integrated with five different states of the art
machine learning models namely linear regression, Gaussian Naïve Bayes and linear
discriminant analysis. To evaluate the performance of the developed integrated models,
four different evaluation metrics are used i.e., Mathews Correlation Coefficient (MCC),
specificity, sensitivity and accuracy. The remaining paper has three sections. The second
section briefly explains the HF database and the developed integrated models. Section III
is about simulation results and discussion of the obtained results while the last section
presents the conclusion.
The healthcare industry collects large amounts of healthcare data and that need to be mined
to discover hidden information for effective decision making. Motivated by the worldwide
increasing mortality of heart disease patients each year and the availability of a huge
number of patients’ data from which to extract useful knowledge, researchers have been
using data mining techniques to help health care professionals in the diagnosis of heart
disease (Helma, Gottmann et al. 2000). Data mining is the exploration of large datasets to
extract hidden and previously unknown patterns, relationships and knowledge that are
difficult to detect with traditional statistical methods (Lee, Liao et al. 2000). Thus, data
mining refers to mining or extracting knowledge from large amounts of data. Data mining
applications will be used for better health policy-making and prevention of hospital errors,
1
early detection, prevention of diseases and preventable hospital deaths (Ruben 2009). A
heart disease prediction system can assist medical professionals in predicting heart disease
based on the clinical data of patients [1]. Hence by implementing a heart disease prediction
system using Data Mining techniques and doing some sort of data mining on various heart
disease attributes, it can able to predict more probabilistically that the patients will be
diagnosed with heart disease. This paper presents a new model that enhances the Decision
Tree accuracy in identifying heart disease patients. It uses the different algorithm of
Decision Tree
RELATED WORK
Prediction of heart disease using data mining techniques has been an ongoing effort for
the past two decades. Most of the papers have implemented several data mining
techniques for the diagnosis of heart disease such as Decision Tree, Naive Bayes, neural
network, kernel density, automatically defined groups, bagging algorithm and support
vector machine showing different levels of accuracies on multiple databases of patients
from around the world. One of the bases on which the papers differ in the selection of
parameters on which the methods have been used. Many authors have specified different
parameters and databases for testing the accuracies. In particular, researchers have been
investigating the application of the Decision Tree technique in the diagnosis of heart
disease with considerable success. Stair-Taut et al. used the Weka tool to investigate
applying Naive Bayes and J48 Decision Trees for the detection of coronary heart disease.
Tu et al. used the bagging algorithm in the Weka tool and compared it with the J4.8
Decision Tree in the diagnosis of heart disease. the decision-making process of heart
disease is effectively diagnosed by a Random forest algorithm based on the probability of
decision support; the heart disease is predicted. As a result, the author concluded that the
decision tree performs well and sometimes the accuracy is similar in Bayesian
classification. An Efficient Classification Tree Technique for Heart Disease Prediction.
This paper analyzes the classification tree techniques in data mining. The classification
tree algorithms used and tested in this paper are Decision Stump, Random Forest and LMT
Tree algorithm. The objective of this research was to compare the outcomes of the
performance of different classification techniques for a heart disease dataset. ANN has
been introduced to produce the highest accuracy prediction in the medical field. The
backpropagation multilayer perception (MLP) of ANN is used to predict heart disease.
The obtained results are compared with the results of existing models within the same
domain and found to be improved. The data of heart disease patients collected from the
UCI laboratory is used to discover patterns with NN, DT, Support Vector machines SVM,
and Naive Bayes. The results are compared for performance and accuracy with these
algorithms. The proposed hybrid method returns results of 86.8% for F-measure,
competing with the other existing methods. The classification without segmentation of
Convolutional Neural Networks (CNN) is introduced. This method considers the heart
cycles with various start positions from the Electrocardiogram (ECG) signals in the
training phase. CNN can generate features with various positions in the testing phase of
the patient. A large amount of data generated by the medical industry has not been used
effectively previously. The new approaches presented here decrease the cost and improve
the prediction of heart disease easily and effectively. The different research techniques
considered in this work for the prediction and classification of heart disease using ML and
deep learning (DL) techniques are highly accurate in establishing the efficacy of these
methods.
2
BACKGROUND
Millions of people are getting some sort of heart disease every year and heart disease is
the biggest killer of both men and women in the United States and around the world. The
World Health Organization (WHO) analysed that twelve million deaths occur worldwide
due to Heart diseases. In almost every 34 seconds the heart disease kills one person in the
world. Medical diagnosis plays a vital role and yet complicated task that needs to be
executed efficiently and accurately. To reduce the cost for achieving clinical tests a piece
of appropriate computer-based information and decision support should be aided. Data
mining is the use of software techniques for finding patterns and consistency in sets of
data. Also, with the advent of data mining in the last two decades, there is a big opportunity
to allow computers to directly construct and classify the different attributes or classes.
Learning of the risk components connected with heart disease helps medicinal services
experts to recognize patients at high risk of having Heart disease. Statistical analysis has
identified risk factors associated with heart disease to be age, blood pressure, total
cholesterol, diabetes, hypertension, family history of heart disease, obesity and lack of
physical exercise, fasting blood sugar etc [3]. Researchers have been applying different
data mining Techniques to help medicinal services experts with progressed exactness in
the judgement of heart disease. Neural network, Naive Bayes, Decision Tree etc. are some
techniques used in the diagnosis of heart disease. Applying Decision Tree techniques has
shown useful accuracy in the diagnosis of heart disease. But assisting health care
professionals in the diagnosis of the world’s biggest killer demands higher accuracy. Our
research seeks to improve diagnosis accuracy to improve health outcomes. Decision Tree
is one of the data mining techniques that cannot handle continuous variables directly so
the continuous attributes must be converted to discrete attributes. A couple of Decision
Tree uses binary discretization for continuous-valued features. Another important
accuracy improvement is applying reduced error pruning to the Decision Tree in the
diagnosis of heart disease patients. Intuitively, more complex models might be expected
to produce more accurate results, but which techniques are best? Seeking to thoroughly
investigate options for accuracy improvements in heart disease diagnosis this paper
systematically investigates comparing multiple classifiers decision tree technique. This
research uses Waikato Environment for Knowledge Analysis (WEKA). The information
of the UCI repository regularly introduced in a database or spreadsheet. To use this data
for the WEKA tool, the data sets need to be in the ARFF format (attribute-relation file
format). WEKA tool is used to pre-process the dataset. After reviewing all these 76
different attributes, the unimportant attributes are dropped and only the important
attributes (i.e. 14 attributes in this case) is considered for analysis to yield more accurate
and better results. The 14th one is a predicted attribute, which is referred to as Class. A
thorough comparison between different decision tree algorithms within the WEKA tool
and deriving the decisions out of it would help the system to predict the likely presence
of heart disease in the patient and will help to diagnose heart disease well in advance and
able to cure it in the right time.
This chosen approach is implemented using the WEKA tool. WEKA is an open-source
software tool that consists of an accumulation of machine learning algorithms for Data
Mining undertakings. It contains apparatuses for information pre-processing,
classification, regression, clustering, association rules, and visualization [4]. For testing,
the classification tools and explorer mode of WEKA are used. Decision Tree classifiers
with Cross-Validation 10-fold in Test mode is considered for this study.
3
The following steps are performed in WEKA.
4
YEAR AUTH-OR PURPOSE ALGORITHMS ACCURACY
USED
2015 Boshra Brahmi Prediction and J48, Naïve J48 gives better
et. al. Diagnosis of Heart Bayes, accuracy than
Disease by Data KNN, SMV other three
Mining techniques
Technique
5
2018 Chala Bayen et. Prediction and J48, Naïve It gives a short time
al. Analysis Bayes, result which helps
the occurrence of SVM to give quality of
Heart services and
Disease using reduce cost to
datamining individual
techniques
PROBLEM STATEMENT
The obtrusive based strategies are typically performed at the point when patient accompany
certain side effects which regularly are the essential indications where ordinary individual
additionally having little information can comprehend that patient is experiencing coronary
illness or stroke directly around them. Also, the methods generally are over the top expensive
and computationally complex and require some investment in appraisals. Then in the research,
we found that we don't have a framework that must investigate certain highlights and
indications identified with the patients, living style and parental history that could turn into
preparatory to the patients. Ahead of time, we might want to make attention to the patients to
be care full and take fundamental preventive strides to keep away from such complex illness
to enter the body and thrive. The problem here we incurred that the prediction alone cannot
overall rule out the disease from the body. It needs to be cured by three important basic things.
1. Medicine 2. Precautions 3. Changing living style by suggesting physical activity by
considering patients different attributes. Therefore, our model is to predict the level of heart
disease and suggest medicinal and non-medicinal ways to get rid of the heart disease.
METHODOLOGY
Dataset collect on
Kaggle is one of the most popular online community websites for data science and machine
learning algorithms. Kaggle allows the user to find and publish the datasets. It has datasets on
everything where the people can easily get the related datasets. Here the datasets are taken from
the Kaggle website. The datasets include 2 columns and 12 rows where the columns include
the serial number, attributes and description. The rows include the attributes namely age, sex,
chest pain, cholesterol rate, resting electrographic results, fasting blood sugar, thalach, exang,
oldpeak, slope, number of major vessels coloured, thal-which means number of defect type.
The attributes in the dataset are listed in table 2.
The dataset contains 303 records and 13 attributes. There are no missing values in the dataset.
In the dataset, we have an attribute called target, which denotes whether the person has heart
disease or not. If the patient has heart disease, the value in the target field is 1, else the value is
6
set to 0, which indicates the patient is not having heart disease. In this dataset, 165 patients had
heart disease, and 138 patients had no heart disease. The other attributes are Type of chest pain,
level of blood pressure, serum cholesterol, blood sugar level, results of the electrocardiogram,
maximum heart rate, exercise-induced angina, ST depression induced by exercise relative to
rest, the slope of the peak exercise (ST segment), number of major vessels coloured by
fluoroscopy and reversible defect.
The database contains NaN values. The NaN values cannot process by the programming hence
these values need to convert into numerical values. In this approach mean of the column is
calculated and NaN values are replaced by the mean.
Splitting:
The whole database is split into a training and testing database. 80% of data is taken for training
while the remaining 20% data is used for testing.
Classification:
The training data is trained by using four different machine learning algorithms i.e., Decision
Tree, KNN, Kmean clustering and Adaboost. Each algorithm is explained in detail.
Table 2. Dataset
7
DATA PRE-PROCESSING
Heart disease data is pre-processed after the collection of various records. The dataset contains
a total of 303 patient records, where 6 records are with some missing values. Those 6 records
have been removed from the dataset and the remaining 297 patient records are used in pre-
processing. The multiclass variable and binary classification are introduced for the attributes
of the given dataset. The multi-class variable is used to check the presence or absence of heart
disease. In the instance of the patient having heart disease, the value is set to 1, else the value
is set to 0 indicating the absence of heart disease in the patient. The pre-processing of data is
carried out by converting medical records into diagnosis values. The results of data pre-
processing for 297 patient records indicate that 137 records show the value of 1 establishing
the presence of heart disease while the remaining 160 reflected the value of 0 indicating the
absence of heart disease.
CLASSIFICATION MODELLING
The clustering of datasets is done based on the variables and criteria of Decision Tree (DT)
features. Then, the classifiers are applied to each clustered dataset to estimate its performance.
The best performing models are identified from the above results based on their low rate of
error. The performance is further optimized by choosing the DT cluster with a high rate of error
and extraction of its corresponding classifier features. The performance of the classifier is
evaluated for error optimization on this data set.
All measures can be ascertained focused around four qualities specifically True Positive, False
Positive, False Negative, and False Positive,
8
• Receiver Operating Characteristics (ROC) Area is traditional to plot this same
information in a normalized form with a 1-false negative rate plotted against the false
positive rate.
• For every algorithm, the test choice cross-validation was utilized. As opposed to
holding a part for testing, the cross-validation repeats the training and testing process a
few times with random forest samples. The standard for this is 10-fold cross-validation.
The data is partitioned arbitrarily into 10 sections in which the classes are represented
in the same proportions as in the full dataset(stratification). Each section is held out
thus and the algorithm is trained on the nine remaining parts; then its error rate is
computed on the holdout set. At long last, the 10 error estimates are found the middle
value to yield an overall error estimate. For J48 and Random Forest, all the tests were
run with ten different random seeds. Choosing the different random seeds is carried out
to normal out statistical variations
DECISION TREES
The decision tree Algorithm belongs to the family of supervised machine learning algorithms.
It can be used for both a classification problem as well as for regression problem. The goal of
this algorithm is to create a model that predicts the value of a target variable, for which the
decision tree uses the tree representation to solve the problem in which the leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.
A decision tree is a tree whose internal nodes can be taken as tests (on input data patterns) and
whose leaf nodes can be taken as categories (of these patterns). These tests are filtered down
through the tree to get the right output to the input pattern. Decision Tree algorithms can be
applied and used in different fields. It can be used as a replacement for statistical procedures to
find data, to extract text, to find missing data in a class, to improve search engines and it also
finds various applications in medical fields. Many Decision tree algorithms have been
formulated. They have different accuracy and cost-effectiveness. It is also very important for
us to know which algorithm is best to use. The ID3 is one of the oldest Decision tree algorithms.
It is very useful while making simple decision trees but as the complications increase its
accuracy to make good Decision trees decreases. Hence IDA (intelligent decision tree
algorithm) and C4.5 algorithms have been formulated. For training samples of data, the trees
are constructed based on high entropy inputs. These trees are simple and fast constructed in a
top-down recursive divide and conquer (DAC) approach. Tree pruning is performed to remove
the irrelevant samples.
K-nearest neighbour is a supervised and pattern classification learning algorithm that helps us
find which class the new input (test value) belongs to when k nearest neighbours are chosen
and distance is calculated between them. It attempts to estimate the conditional distribution
of Y given X, and classify a given observation (test value) to the class with the highest
estimated probability. K-Nearest Neighbour is used for both classification and regression
technique. This algorithm does not use the parameters instead they use the data points to derive
the output. It is the concept of the last learning model which is full of prediction. The basic idea
of this algorithm is they use various datapoints as inputs and with these data points, they derive
the output that is full of assumption.
9
RANDOM FOREST
This algorithm contains a set of trees in which each node is like a tree structure and from that
the output is predicted. It handles a large amount of data. It gives an accurate output and gives
better efficiency. The computation process is tough, shows the accuracy rate for neural network
and random forests algorithm. The accuracy rate of the neural network algorithm is high when
compared to random forests.
Random forest is an ensemble classifier that consists of many decision trees. The output of the
classes is represented by individual trees. It is derived from random decision a forest that was
proposed by Tin Kam Ho of Bell Labs in 1995. This method combines with a random selection
of features to construct a decision tree with controlled variations. The tree is constructed using
the algorithm as discussed.
• Let N be the number of training classes and M be the number of variables in the
classifier.
• The input variable m is used to determine the node of the tree. Note that m<M
• Choosing in times of training sets with the replacement of all available training cases
N by predicting the classes, estimate the error of the tree.
• Choose m variable randomly for each node of the tree and calculate the best split.
• At last, the tree is fully grown and it is not pruned. The tree is pushed down for
predicting a new sample. When the terminal node ends up, the label is assigned the
training sample. This procedure is iterated over all trees and it is reported as random
forest prediction.
Multi-classifiers are the aftereffect of joining a few individual classifiers. Troupes of classifiers
towards expanding the execution have been presented. Random Forest (RF) is one of the cases
of such procedures. RF as a multi classifier formed by choice trees where each tree ht had been
created from the set of information preparing and a vector Ө t of arbitrary numbers
indistinguishably disseminated and free from the vectors. Vectors Ө 1, Ө 2 ,.., Ө t-1 used to
create the classifiers h1; h2; ::; ht-1 . Every decision tree is manufactured from a random subset
of the preparation dataset. It utilized a random vector that is produced from some altered
likelihood dissemination, where the likelihood circulation is shifted to centre samples that are
difficult to arrange. A Random vector can be joined into the tree-becoming process from
various perspectives. The leaf hubs of each one tree are named by evaluations of the back
dissemination over the information class names. Every interior hub contains a test that best
parts the space of data to be arranged. Another, concealed occasion is ordered by sending it
down every tree and conglomerating the arrived at leaf appropriations.
There are three methodologies for Random Forest, for example, Forest-RI (Random Input
choice) and Forest-RC (Random blend) and blended of Forest-RI and Forest-RC.
The Random Forest procedure has some desirable qualities, for example
10
• It can deal with a huge number of information variables without variable deletion; it
gives evaluations of what variables are important in classification.
• It has a successful system for assessing missing information and keeps up accuracy
when a vast extent of the data is missing, it has methods for adjusting error in class
populace unequal data sets.
LOGISTIC REGRESSION
Logistic Regression is not like a regression model instead it is like a classification model. This
algorithm gives the output in the form of binary values i.e. like 0’s and 1’s. It is one of the
statistical models and contains some statistical symptoms (assumption). The generated sample
information is represented in the form of mathematical representation. The logistic regression
estimates the attributes i.e. full of assumptions. These assumptions are measured in either 0’s
or 1’s. It has only two possible values i.e., true or false. The sigmoid function is frequently used
by this algorithm. It is a statistical method similar to linear regression since LR finds an
equation that predicts an outcome for a binary variable, Y, from one or more response
variables, X. However, unlike linear regression, the response variables can be
categorical or continuous, as the model does not strictly require continuous data. To predict
group membership, LR uses the log odds ratio rather than probabilities and an
iterative maximum likelihood method rather than a least-square to fit the final model. This
means the researcher has more freedom when using LR and the method may be more
appropriate for nonnormally distributed data or when the samples have unequal covariance
matrices. Logistic regression assumes independence among variables, which is not always met
in orthoscopic datasets. However, as is often the case, the applicability of the method (and how
well it works, e.g., the classification error) often trumps statistical assumptions. One drawback
of LR is that the method cannot produce typicality probabilities (useful for forensic casework),
but these values may be substituted with nonparametric methods such as ranked probabilities
and ranked interindividual similarity measures (Ousley and Hefner, 2005).
PERFORMANCE MEASURES
Several standard performance metrics such as accuracy, precision and error in classification
have been considered for the computation of the performance efficacy of this model. Accuracy
in the current context would mean the percentage of instances correctly predicting from among
all the available instances. Precision is defined as the percentage of corrective prediction in the
positive class of the instances. Classification error is defined as the percentage of accuracy
missing or error available in the instances. To identify the significant features of heart disease,
three performance metrics are used which will help in better understanding the behaviour of
the various combinations of the feature selection. ML technique focuses on the best performing
model compared to the existing models. We introduce the Hybrid method, which produces high
accuracy and less classification error in the prediction of heart disease. The performance of
every classifier is evaluated individually and all results are adequately recorded for further
investigation.
11
OVERVIEW OF PROPOSED WORK
The weighted voting of Logistic Regression (LR), K-Nearest Neighbour (KNN), Random
Forest (RF), Gaussian Naive Bayes (GNB) algorithms are used to perform classification on the
cardiovascular disease dataset. In fatal diseases like cardiac diseases, accurate detection of the
disease is essential. But the conventional approaches are not good in terms of accuracy in
diagnosing the disease. In the proposed model, the data from the heart disease dataset is split
into test and training data (33% test data and 67% training data). Feature scaling is performed
to normalize the range of independent variables in the data. The test data is given as the input
to Decision Tree, Gaussian Naive Bayes, Bernoulli Naive Bayes, Logistic Regression,
Multinomial Naive Bayes, K-Nearest Neighbour, Random Forest and Neural Network
(Logistic, TanH, ReLU, and Identity). The proposed model uses 13 clinical attributes of the
heart disease dataset as input and predicts the disease accurately. The proposed method
VLRAKN has an accuracy of 89% and a precision of 91.6%. The results of VLRAKN are
compared and analyzed against conventional models. The results of VLRAKN are compared
with a few other machine learning algorithms. The Accuracy, Precision, F- measure,
sensitivity, and specificity in the below table. The introduced model is efficient in predicting
the disease compared with other models in terms of accuracy and precision.
PROPOSED WORK
Cardiovascular diseases are significant causes of death around the world. Each year
approximately 17 million people die due to cardiac problems. Proper diagnosis and early
prediction of the disease can reduce the adverse effects of the disease. Machine learning plays
a vital role in the prediction of the disease. Many research studies were performed by the
researchers in the application of machine learning in cardiac disease prediction. Machine
learning helps us to find the hidden patterns present in the data, which are essential during the
prediction of the disease. Many prediction models were proposed by the researchers which use
machine learning approaches. proposed an approach that uses structural equation mapping and
fuzzy cognitive maps for prediction. The main aim of this work is to develop a model with
good accuracy and precision in predicting cardiac disease in patients using machine learning.
Many research studies were referred to during the construction of the model. For designing a
model with reasonable accuracy, the performance of several machine learning algorithms was
analyzed. The proposed model is based on the weighted voting of classifiers. During the
construction of the model, the parameters mentioned in the Table are taken into consideration
for predicting the disease. The process involved in designing the model. The proposed model
achieved an accuracy of 89% and 91.6% precision in predicting the disease.
It is important to have first aid at the time of a heart attack. The number of deaths due to heart
attack occurs because there is a lack of awareness and first aid given to patients. As the living
style of the person all-round the globe has been changed which is the fundamental
establishment for the reason for various heart intricacies, there is sufficient research done to
about the prediction. Here we are aiming to provide a one step further that is studying the
complexity of heart disease and giving the medical and non-medical suggestions to get rid of
the heart disease. In the proposed work we are focusing on analysis, prediction, accuracy after
using many algorithms and comparison and providing suggestions. It is also important to know
whether the person needs to be diagnosed with heart disease or not. In our work, we are also
examining if the person needs to be examined or not after training the dataset. Experimenting
with the various classification models and checking with yield the greatest accuracy.
In the presented work there are six different sections in which the work take place. The six
12
different sections are namely database selection, sample selection, attribute creation,
Modelling/ Training, Extract knowledge and finally Medical suggestions and
recommendations. The idea behind diving the whole working in the different section is to
working out on each section independently so that the result and great accuracy can be
achieved.
In the proposed architecture as we can see that there are six different sections which are listed
above. Here first we are selecting the database in which it is important to work on the target
database so we are selecting the target database from the heart disease database after that
sample needs to be selected from the pool of dataset and also it is important to remove the noisy
data present the dataset and since there are a lot of attributes present in the dataset so it is
necessary to create the specific attributes required for the training of dataset now, we have to
extract the relevant attributes which are useful for the process. In the next section, there is
Modelling and training of the selected dataset happens. Here, in this section, we are using seven
different machine learning models one after another so that the best model that is the model
with the highest accuracy can find out. After applying the different machine learning
algorithms and models now it extracts the knowledge and finally gives the medical and non-
medical suggestions. We have used the supervised learning models in the proposed system.
Quantitative research needs numerical data that can come out either from the numerical data
itself or otherwise graphs. Statistical methods are applied on it to get usefulness from the data.
13
Qualitative research is in words and thoughts. There must be an expert opinion that can bring
useful information through the thoughts and feeling of the examinee. Qualitative research is to
understand the concepts, thoughts, experiences and feelings of the patients. This research paper
uses both quantitative and qualitative data. We have used the University of California Irvine
(UCI) dataset for this paper. There are 3 types of data used in this paper are:
A standard heart disease dataset from Kaggle is used whose data is collected from four
prestigious hospitals of Switzerland and the United States of America. The dataset contains
303 records and 13 attributes. There are no missing values in the dataset. In the dataset, we
have an attribute called target, which denotes whether the person has heart disease or not. If
the patient has heart disease, the value in the target field is 1, else the value is set to 0, which
indicates the patient is not having heart disease. In this dataset, 165 patients had heart disease,
and 138 patients had no heart disease. The other attributes are Type of chest pain, level of blood
pressure, serum cholesterol, blood sugar level, results of the electrocardiogram, maximum heart
rate, exercise-induced angina, ST depression induced by exercise relative to rest, the slope of
the peak exercise (ST segment), number of major vessels coloured by fluoroscopy and
reversible defect.
Feature Scaling and attribute selection: Cardiac disease data is feature scaled to normalize
the range of independent variables or features of data. The cardiac disease dataset contains 303
records. The 13 attributes in the dataset are considered as necessary, including age and sex.
Age and sex are considered because heart diseases are found in men at an early age than
females, and other factors. These attributes are essential for predicting the disease.
Building and training the model: After attribute selection and feature scaling, the heart
disease data is split into two different parts for training and testing the model. Various machine
learning (ML) algorithms are used to perform prediction on the dataset.
Binary Data: data whose unit can take on only two possible states (0 &1)
There are 13 feature attributes identified in the dataset for the heart disease prediction and
working which are mentioned below:
1. age (#)
2. sex: 1= Male, 0= Female (Binary)
3. (cp) chest pain type (4 values -Ordinal): Value 1: typical angina, Value 2: atypical angina,
Value 3: non-anginal pain, Value 4: asymptomatic
4. (trestbps) resting blood pressure (#)
5. (chol) serum cholesterol in mg/dl (#)
6. (fbs)fasting blood sugar > 120 mg/dl (Binary) (1 = true; 0 = false)
7. (restecg) resting electrocardiography results (values 0,1,2)
8. (thalach) maximum heart rate achieved (#)
9. (exang) exercise induced angina (binary) (1 = yes; 0 = no)
10. (oldpeak) = ST depression induced by exercise relative to rest (#)
14
11. (slope) of the peak exercise ST segment (Ordinal) (Value 1: up sloping, Value 2: flat, Value 3:
down sloping)
12. (ca) number of major vessels (0–3, Ordinal) colored by fluoroscopy
(thal) maximum heart rate achieved — (Ordinal): 3 = normal; 6 = fixed defect; 7 = reversible defect
INPUT DATA
PREPROCESSING
SPLITTING
15
Attribute Description Data Type
1. Typical angina
2. Atypical angina
3. Non- anginal pain and
4. Asymptomatic
16
Age Numeric [29-77]
EXANG Numeric [0 - 1]
SLOPE Numeric [1 - 3]
CA Categorical [5 level]
TARGET Numeric [0 - 1]
17
By using the scikit package from python, the training of models is done. By using ensemble
methods voting classier is built with RF, LR, GNB, NNR, and KNN algorithms. The ensemble
voting classifier allows us to give different weights to each classifier. In the proposed model,
Logistic regression and k nearest neighbour are given weights as 3, and the weights of the
remaining classifier are 0. The voting classifier does the majority rule voting by using predicted
class labels.
Classifiers Accuracy Precision Sensitivity Specificity F-Measure Error
LR 86 79 90 82 84.4 14
RF 83 75 87 79 80 17
DT 80 79 79 80 79 20
SVM 87 81 90 84 85 17
KNN 88 91 84 91 88 12
XG boost 84 79 86 82 82 16
Hybrid 89 91.6 86 91 88 11
Model
Table 5. Overall performance
Various metrics are considered for the evaluation of this classification model; they are
precision, error, F-measure, accuracy, sensitivity, and specificity. Accuracy is the ratio of
correctly predicted samples to the total number of samples, and classification error is the
amount of error present in the samples. Sensitivity is the proportion of actual positives which
are correctly predicted, F-measure is defined as the harmonic mean of both precision and recall,
specificity is defined as the ability of the classifier to identify negative results, precision is the
ratio of correctly predicted positive samples to the total number of correctly predicted samples.
The hybrid Model produces results with high accuracy, precision, and minimum classification
error.
18
PROPOSED SOURCE CODE HEART DISEASE PREDICTION
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
filePath = 'heartDisease.csv'
data = pd.read_csv(filePath)
data.head(5)
data['target'].value_counts()
19
corr = data.corr()
plt.subplots(figsize=(15,10))
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True,
cmap=sns.diverging_palette(220, 20, as_cmap=True))
sns.heatmap(corr, xticklabels=corr.columns,
yticklabels=corr.columns,
annot=True,
cmap=sns.diverging_palette(220, 20, as_cmap=True))
subData = data[['age','trestbps','chol','thalach','oldpeak']]
sns.pairplot(subData)
20
sns.catplot(x="target", y="oldpeak", hue="slope", kind="bar", data=data);
plt.title('ST depression (induced by exercise relative to rest) vs. Heart Disease',size=25)
plt.xlabel('Heart Disease',size=20)
plt.ylabel('ST depression',size=20)
plt.figure(figsize=(12,8))
sns.violinplot(x= 'target', y= 'oldpeak',hue="sex", inner='quartile',data= data )
plt.title("Thalach Level vs. Heart Disease",fontsize=20)
plt.xlabel("Heart Disease Target", fontsize=16)
plt.ylabel("Thalach Level", fontsize=16)
21
plt.figure(figsize=(12,8))
plt.figure(figsize=(12,8))
sns.boxplot(x= 'target', y= 'thalach',hue="sex", data=data )
plt.title("ST depression Level vs. Heart Disease", fontsize=20)
plt.xlabel("Heart Disease Target",fontsize=16)
plt.ylabel("ST depression induced by exercise relative to rest", fontsize=16)
22
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
23
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB
# get importance
importance = model6.feature_importances_
24
index= data.columns[:-1]
importance = pd.Series(model6.feature_importances_, index=index)
importance.nlargest(13).plot(kind='barh', colormap='winter')
y_pred = model6.predict(x_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
[[0 0]
[1 1]
[0 0]
[0 0]
[0 0]
[0 0]
[0 0]
[1 1]
[0 0]
[1 1]
[1 1]
[0 0]
[1 0]
[0 0]
[0 0]
[1 0]
[1 1]
[0 0]
[1 1]
[1 0]
[1 1]
[0 0]
[1 1]
[1 1]
[1 1]
[1 1]
[0 0]
[1 1] 25
[1 1]
[1 1]
[1 1]
[1 1]
First need to find the average the accuracy of the above models
average = (y_pred1+y_pred2+y_pred3+y_pred4+y_pred5+y_pred6)//6
avg_accuracy = accuracy_score(y_test,average)
avg_accuracy
labels
=['LogisticRegression','KNeighborsClassifier','SVC','GaussianNB','DecisionTreeClassifier','Random
ForestClassifier']
for i, label in zip([model1,model2,model3,model4,model5,model6], labels):
scores = model_selection.cross_val_score(i,X,y, cv =5, scoring ='accuracy')
print('Accuracy: %0.2f (%0.2f) [%s]'%(scores.mean(),scores.std(),label))
voting ='hard' )
labels_new
=['LogisticRegression','KNeighborsClassifier','SVC','GaussianNB','DecisionTreeClassifier','Random
ForestClassifier','Voting']
for i, label in zip([model1,model2,model3,model4,model5,model6,voting_hard], labels_new):
scores = model_selection.cross_val_score(i,X,y, cv =5, scoring ='accuracy')
print('Accuracy: %0.2f (%0.2f) [%s]'%(scores.mean(),scores.std(),label))
26
labels =
['LogisticRegression','KNeighborsClassifier','SVC','GaussianNB','DecisionTreeClassifier','RandomF
orestClassifier']
for clf, label in zip([model1,model2,model3,model4,model5,model6], labels):
bagging_clf = BaggingClassifier(clf,max_samples=0.4, max_features=10, random_state=0)
bagging_scores = cross_val_score(bagging_clf, X, y, cv=10,n_jobs=-1)
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
print("Mean: {0:.3f}, std: (+/-) {1:.3f} [{2}]".format(scores.mean(), scores.std(), label))
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
d1 = {}
labels
=['LogisticRegression','KNeighborsClassifier','SVC','GaussianNB','DecisionTreeClassifier','Random
ForestClassifier']
for i, model in zip([model1,model2,model3,model4,model5,model6], labels):
scores = model_selection.cross_val_score(i,X,y, cv =5, scoring ='accuracy')
print('Accuracy: %0.2f (%0.2f) [%s]'%(scores.mean(),scores.std(),model))
d1[model] = '%0.2f'%scores.mean()
def get_accurate_result():
accuracy=list(d1.values())
model=list(d1.keys())
return model[accuracy.index(max(accuracy))],max(accuracy)
result = accurate_result()
model = result[0]
acc = result[1]
percent = float(acc) *100
print("=="*30)
print("{} is more accurate then other models,\n{} is {} % accurate.".format(model,model,percent))
print("=="*30)
27
Heart Disease Prediction
Separate Models
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
dataset.num.value_counts()
#ACCURACY SCORE
from sklearn.metrics import accuracy_score
print("ACC",accuracy_score(y_test,y_pred))
##CONFUSION MATRIX
from sklearn.metrics import classification_report, confusion_matrix
cm=confusion_matrix(y_test, y_pred)
28
#Interpretation:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
#ROC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Decision Tree (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
ACC 0.7631578947368421
precision recall f1-score support
accuracy 0.76 76
macro avg 0.77 0.76 0.76 76
weighted avg 0.77 0.76 0.76 76
29
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
dataset.num.value_counts()
#ACCURACY SCORE
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)
30
#Interpretation:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
#ROC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='KNN (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
precision recall f1-score support
accuracy 0.84 76
macro avg 0.86 0.84 0.84 76
weighted avg 0.86 0.84 0.84 76
31
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('heart_disease_dataset.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 13].values
#feature scaling
dataset.num.value_counts()
y_Class_pred=classifier.predict(X_test)
#checking the accuracy for predicted results
from sklearn.metrics import accuracy_score
accuracy_score(Y_test,y_Class_pred)
Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
precision recall f1-score support
accuracy 0.86 76
macro avg 0.87 0.86 0.85 76
weighted avg 0.87 0.86 0.85 76
33
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state =None)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
dataset.num.value_counts()
#ACCURACY SCORE
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)
34
#Interpretation:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
precision recall f1-score support
accuracy 0.86 76
macro avg 0.87 0.86 0.85 76
weighted avg 0.88 0.86 0.85 76
35
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
dataset.num.value_counts()
36
#Interpretation:
Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
precision recall f1-score support
accuracy 0.87 76
macro avg 0.88 0.87 0.87 76
weighted avg 0.88 0.87 0.87 76
37
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#feature scaling
dataset.num.value_counts()
#ROC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(Y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(Y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='SVM (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
Newdataset = pd.read_csv('newdata.csv')
ynew=classifier.predict(Newdataset)
print("Predicted Class for newdata.csv:",ynew)
accuracy 0.87 46
macro avg 0.89 0.84 0.86 46
weighted avg 0.88 0.87 0.87 46
39
Implemented code for find the
above model accuracy.
Hybrid Solution FOBU
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import numpy as np
from copy import deepcopy
class ModelTree(object):
self.model = model
self.max_depth = max_depth
self.min_samples_leaf = min_samples_leaf
self.tree = None
global index_node_global
40
def _split_traverse_node(node, container):
if not result["did_split"]:
if verbose:
depth_spacing_str = " ".join([" "] * node["depth"])
print(" {}*leaf {} @ depth {}: loss={:.6f}, N={}".format(depth_spacing_str,
node["index"], node["depth"], node["loss"], result["N"]))
return
node["j_feature"] = result["j_feature"]
node["threshold"] = result["threshold"]
del node["data"] # delete node stored data
if verbose:
depth_spacing_str = " ".join([" "] * node["depth"])
print(" {}node {} @ depth {}: loss={:.6f}, j_feature={}, threshold={:.6f},
N=({},{})".format(depth_spacing_str, node["index"], node["depth"], node["loss"],
node["j_feature"], node["threshold"], len(X_left), len(X_right)))
_split_traverse_node(node["children"]["left"], container)
_split_traverse_node(node["children"]["right"], container)
return root
self.tree = _build_tree(X, y)
return self.tree
41
def predict(self, X):
assert self.tree is not None
def _predict(node, x):
no_children = node["children"]["left"] is None and \
node["children"]["right"] is None
if no_children:
y_pred_x = node["model"].predict([x])[0]
return y_pred_x
else:
if x[node["j_feature"]] <= node["threshold"]: # x[j] < threshold
return _predict(node["children"]["left"], x)
else: # x[j] > threshold
return _predict(node["children"]["right"], x)
y_pred = np.array([_predict(self.tree, x) for x in X])
return y_pred
42
def _splitter(node, model,max_depth=5, min_samples_leaf=10):
X, y = node["data"]
depth = node["depth"]
N, d = X.shape
did_split = False
loss_best = node["loss"]
data_best = None
models_best = None
j_feature_best = None
threshold_best = None
if not all(split_conditions):
continue
return result
43
def _fit_model(X, y, model):
model_copy = deepcopy(model) # must deepcopy the model!
model_copy.fit(X,y)
y_pred = model_copy.predict(X)
loss = model_copy.loss(X, y, y_pred)
assert loss >= 0.0
return loss, model_copy
class logistic_regr:
def __init__(self):
from sklearn.linear_model import LogisticRegression
self.model = LogisticRegression(penalty="l2",solver='liblinear')
self.flag = False
self.flag_y_pred = None
def predict_proba(self,X):
return self.model.predict_proba(X)
44
dataset = pd.read_csv('heart_disease_dataset.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, 13].values
import numpy as np
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values= np.NAN, strategy= 'mean', fill_value=None, verbose=0,
copy=True)
imputer=imputer.fit(X[:,11:13])
X[:,11:13]=imputer.transform(X[:,11:13])
dataset.num.value_counts()
esitmators=5
y_pred=[]
n_train_split=int(len(X_train)/esitmators)
inital_train=0
final_train=0
45
yy_pred=[]
classifier=None
for i in range(1,esitmators+1):
classifier =logistic_regr()
final_train=i*n_train_split
temp_X_train=X_train[inital_train:final_train]
temp_y_train=y_train[inital_train:final_train]
L=ModelTree(classifier,max_depth=20, min_samples_leaf=10)
node=L.fit(temp_X_train,temp_y_train,verbose=False)
classifier=node["model"]
y_pred_temp=L.predict(X_test)
yy_pred.append(y_pred_temp)
for j in range(len(yy_pred[0])):
curr=[]
for i in range(len(yy_pred)):
curr.append(yy_pred[i][j])
a=curr.count(0)
b=curr.count(1)
if a>b:
y_pred.append(0)
else:
y_pred.append(1)
46
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)
accuracy 0.87 76
macro avg 0.86 0.87 0.86 76
weighted avg 0.87 0.87 0.87 76
47
RESULTS
Setup for experimentation: Jupyter notebook is used to build the model and used to perform
the cardiac disease classification on the dataset. A real-life data collected from four prestigious
hospitals in Switzerland and the United States of America is used for this work. This data is
categorized under 76 attributes, out of which 14 attributes are provided in the dataset. In the
dataset, 13 attributes are used for classification. The attribute named target shows the presence
of the disease. 0 in the target field indicates that there is no heart disease, and 1 indicates the
presence of heart diseases.
Evaluation of results: The results are evaluated by using the confusion matrix, accuracy score,
and area under the ROC curve. From the confusion matrix, four results are produced they are
TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative), and
accuracy is generated using an accuracy score.
The prediction models are developed using 13 features and the accuracy is calculated for
modelling techniques. The best classification methods are given. This table compares the
accuracy, classification error, precision, F-measure, sensitivity and specificity. The highest
accuracy is achieved by this proposed hybrid classification method in comparison with existing
methods.
Out of the 13 features we examined, the top 4 significant features that helped us classify
between a positive & negative Diagnosis were chest pain type (cp), maximum heart rate
achieved (thalach), number of major vessels (ca), and ST depression induced by exercise
relative to rest (oldpeak) as shown in figure 2.
48
Fig 4. Proposed Model Accuracy
K-NN 74 75 79.4
Random Forest 95 74 78
Naïve Bayes 75 31 60
SVM 74 67 79
49
Fig. 5. Correlation Matrix
There is a positive correlation between chest pain (cp) & target (our predictor). This makes
sense since the greater amount of chest pain results in a greater chance of having heart disease.
Cp (chest pain), is an ordinal feature with 4 values: Value 1: typical angina, Value 2: atypical
angina, Value 3: non-anginal pain, Value 4: asymptomatic.
In addition, we see a negative correlation between exercises induced angina & our predictor.
This makes sense because when you exercise, your heart requires more blood, but narrowed
arteries slow down blood flow.
From comparing positive and negative heart disease patients. There are vast differences in
means for many of our Features. From examine the details, we can observe that positive patients
experience heightened maximum heart rate achieved (thalach) average. In addition, positive
patients exhibit about 1/3rd the amount of ST depression induced by exercise relative to rest
(oldpeak).
Our Hybrid machine learning algorithm can now classify patients with Heart Disease. Now we
can properly diagnose patients, & get them the help they need to recover. By diagnosing
detecting these features early, we may prevent worse symptoms from arising later. Our Random
Forest algorithm yields the highest accuracy, 80%. Any accuracy above 70% is considered good,
but be careful because if your accuracy is extremely high, it may be too good to be true (an
example of Overfitting). Thus, 80% is the ideal accuracy.
50
BENCHMARKING OF THE PROPOSED MODEL
Benchmarking is needed to compare the performance of the existing models compared with the
proposed hybrid model. This method is used to identify whether the proposed method is the best
and improves accuracy or not. The accuracy is calculated with the number of feature selection
and the model generated results. This Hybrid method has no restriction in selecting features to
use. All the features selected in this model accomplish the best results. The performance
comparison of the various model concerning the proposed method respectively. The proposed
method is used on all 13 attributes and classified, based on the error rate. This result proves that
all the features selected and ML techniques used, prove effective in accurately predicting heart
disease of patients compared with known existing models.
A stack of 6 machine learning algorithms and their results were compared with the proposed
model. The models were compared based on accuracy, precision, F-measure, sensitivity, and
specificity. The compared results were shown in the below fig. The Hybrid Model achieved
the highest accuracy when compared with the Decision Tree, Random Forest, Logistic
Regression, Naive Bayes (Gaussian, Bernoulli and multinomial), and Support vector machine
etc.
51
CONCLUSION
Identifying the processing of raw healthcare data of heart information will help
in the long-term saving of human lives and early detection of abnormalities in
heart conditions. Machine learning techniques were used in this work to process
raw data and provide a new and novel discernment towards heart disease. Heart
disease prediction is challenging and very important in the medical field.
However, the mortality rate can be drastically controlled if the disease is detected
at the early stages and preventative measures are adopted as soon as possible.
Further extension of this study is highly desirable to direct the investigations to
real-world datasets instead of just theoretical approaches and simulations. The
proposed hybrid approach is used to combine the characteristics of Random
Forest (RF) and Linear Method (LM). This method proved to be quite accurate
in the prediction of heart disease. The future course of this research can be
performed with diverse mixtures of machine learning techniques to better
prediction techniques. Furthermore, new feature selection methods can be
developed to get a broader perception of the significant features to increase the
performance of heart disease prediction.
52
REFERENCES
53
no. 2, pp. 2463–2475, Apr. 2019.
[17] J. Wu, M. Dong, K. Ota, J. Li, and Z. Guan, ‘‘Big data analysis-based secure cluster
management for optimized control plane in software-defined networks, IEEE Trans. Netw.
Service Manag., vol. 15, no. 1, pp. 27–38, Mar. 2018.
[18] J. Wu, M. Dong, K. Ota, J. Li, and Z. Guan, ‘‘FCSS: Fog computing-based content-aware
filtering for security services in information centric social networks,’’ IEEE Trans. Emerg.
Topics Comput., to be published. doi: 10.1109/TETC.2017.2747158.
[20] G. Li, J. Wu, J. Li, K. Wang, and T. Ye, ‘‘Service popularity-based smart resources
partitioning for fog computing-enabled industrial Internet of things,’’ IEEE Trans. Ind.
Information., vol. 14, no. 10, pp. 4702–4711,
Oct. 2018.
[21] J. Wu, K. Ota, M. Dong, and C. Li, ‘‘A hierarchical security framework for defending
against sophisticated attacks on wireless sensor networks in smart cities,’’ IEEE Access, vol.
4, pp. 416–424, 2016.
[22] H. Li, K. Ota, and M. Dong, ‘‘Learning IoT in edge: Deep learning for the Internet of
Things with edge computing,’’ IEEE Netw., vol. 32, no. 1, pp. 96–101, Jan./Feb. 2018.
[23] J Thomas MR, Lip GY. Novel risk markers and risk assessments for cardiovascular disease.
Circulation research. 2017; 120(1):133–149. https://doi.org/10.1161/CIRCRESAHA.116.309955
PMID: 28057790 [2] Ahmed M. AlaaID1, Thomas Bolton, Emanuele Di Angelantonio, James H.F.
RuddID, Mihaela van der Schaar,―Cardiovascular disease risk prediction using automated machine
learning: A prospective study of 423,604 UK Biobank participants‖, PLOS ONE 14(5):
[24] Stephen F. Weng, Jenna Reps, Joe Kai1, Jonathan M. Garibaldi, Nadeem Qureshi, ―Can machine-
learning improve cardiovascular risk prediction using routine clinical data?‖, PLOS ONE |
https://doi.org/10.1371/journal.pone. 0174944 April 4, 2017
[25] Rine Nakanishi, Damini Dey, Frederic Commandeur, Piotr Slomka, ―Machine Learning in
Predicting Coronary Heart Disease and Cardiovascular Disease Events: Results from The Multi-Ethnic
Study of Atherosclerosis (Mesa)‖, JACC Mar- 20, 2018, Volume 71, Issue 11
[27] Senthilkumar Mohan, Chandrasegar Thirumalai, Gautam Srivastava ―Effective Heart Disease
Prediction Using Hybrid Machine Learning Techniques‖, Digital Object Identifier
10.1109/ACCESS.2019.2923707, IEEE Access, VOLUME 7, 2019 S.P. Bingulac, ―On the
Compatibility of Adaptive Controllers,‖ Proc. Fourth Ann. Allerton Conf. Circuits and Systems Theory,
pp. 8-16, 1994. (Conference proceedings)
[28] A. Gavhane, G. Kokkula, I. Pandya, and K. Devadkar, ‗‗Prediction of heart disease using machine
learning,‘‘ in Proc. 2nd Int. Conf. Electron., Commun. Aerosp. Technol. (ICECA), Mar. 2018, pp.
1275–1278.
[29] M. Sultana, A. Haider, and M. S. Uddin, ―Analysis of data mining techniques for heart disease
prediction,‖ 2016 3rd Int. Conf. Electr. Eng. Inf. Commun. Technol. ICEEICT 2016, 2017
[30] M. Akhil, B. L. Deekshatulu, and P. Chandra, ―Classification of Heart Disease Using K- Nearest
Neighbor and Genetic Algorithm,‖ Procedia Technol., vol. 10, pp. 85–94, 2013.
[31] N. Al-milli, ‗‗Backpropogation neural network for prediction of heart disease,‘‘ J. Theor. Appl.Inf.
Technol., vol. 56, no. 1, pp. 131–135, 2013
54
[32] J. Wu, S. Luo, S. Wang, and H. Wang, ‘‘NLES: A novel lifetime extension scheme for safety-
critical cyber-physical systems using SDN and NFV,’’ IEEE Internet Things J., no. 6, no. 2, pp. 2463–
2475, Apr. 2019. [45] J. Wu, M. Dong, K. Ota, J. Li, and Z. Guan, ‘‘Big data analysis-based secure
cluster management for optimized control plane in software-defined networks, IEEE Trans. Netw.
Service Manag., vol. 15, no. 1, pp. 27–38, Mar. 2018. [46] J. Wu, M. Dong, K. Ota, J. Li, and Z. Guan,
‘‘FCSS: Fog computing based content-aware filtering for security services in information centric social
networks,’’ IEEE Trans. Emerg. Topics Comput., to be published. doi: 10.1109/TETC.2017.2747158.
[33] G. Li, J. Wu, J. Li, K. Wang, and T. Ye, ‘‘Service popularity-based smart resources partitioning
for fog computing-enabled industrial Internet of things,’’ IEEE Trans. Ind. Informat., vol. 14, no. 10,
pp. 4702–4711, Oct. 2018. [48] J. Wu, K. Ota, M. Dong, and C. Li, ‘‘A hierarchical security framework
for defending against sophisticated attacks on wireless sensor networks in smart cities,’’ IEEE Access,
vol. 4, pp. 416–424, 2016. [49] H. Li, K. Ota, and M. Dong, ‘‘Learning IoT in edge: Deep learning for
the Internet of Things with edge computing,’’ IEEE Netw., vol. 32, no. 1, pp. 96–101, Jan./Feb. 2018.
55