You are on page 1of 6

2018 4th International Conference on Computing Communication and Automation (ICCCA)

A Hybrid Machine Learning Approach for


Prediction of Heart Diseases
Krishna Roy
SanchayitaDhar Tanusree Dey
Computer Science & Engineering
Computer Science & Engineering Computer Science & Engineering
Tripura Institute of Technology
Tripura Institute of Technology Tripura Institute of Technology
Narsingarh, Agartala, India
Narsingarh, Agartala, India Narsingarh, Agartala, India
itkrishna1996@gmail.com
sanchayitadhr@gmail.com deytanusree1995@gmail.com

Ankur Biswas
Pritha Datta
Computer Science & Engineering
Computer Science & Engineering
Tripura Institute of Technology
Tripura Institute of Technology
Narsingarh, Agartala, India
Narsingarh, Agartala, India
abiswas.tit@gmail.com
prithaitcse12@gmail.com

Abstract— Heart diseases are the chief cause of death all heart disease [7,8]. The term CVD consists of numerous
over the world over the last few decades. To avoid heart types of anarchy that may damage the heart. Most common
disease or coronary illness and discover indications early, types of heart diseases or related cardiovascular diseases are
individuals over 55 years must have a total cardiovascular tabulated in table I.
checkup. Researchers and specialists developed various
intelligent techniques to improve capacity of the health care TABLE I
professionals in recognition of cardiovascular disease. In
cardiovascular disease finding and treatment, single data Diseases Description
mining strategies are giving the reasonable precision and
accuracy. Nevertheless the usage data mining procedure be Acute coronary Obstruction of Blood-supply to the
capable of reducing the number of test that is required to be syndromes heart muscle.
carried out. In order to decrease the figure of deaths from Angina Lack of blood towards the heart muscle
heart diseases there has to be a quick and efficient detection causes chest pain
technique providing better accuracy and precision. The aim of Arrhythmia Irregular heartbeat or cardiac
this paper is to present an efficient technique of predicting dysrhythmia
heart diseases using machine learning approaches. Hence we Cardiomyopathy Heart muscle disease which makes it
proposed a hybrid approach for heart prediction using inflexible to pump blood to the rest of
Random forest classifier and simple k-means algorithm your body
machine learning techniques. The dataset is also evaluated Congenital heart Problem in the structure of the heart
using two other different machine learning algorithms, namely, disease that is present at birth
J48 tree classifier and Naive Bayes classifier and results are Coronary heart The arteries narrow, reducing blood
compared. Results attained through Random forest classifier disease flow to the heart
and the corresponding confusion matrix shows robustness of
the methodology.
Hidden pattern from data and existing relationship can be
Keywords— Heart disease, Data mining, Machine learning, extracted from large data sources using data mining
Random forest, and K-means techniques that merge statistical analysis, machine learning
and database technology [9]. In medical centres (hospitals or
I. INTRODUCTION in clinics) data mining technique helps in identifying that if
the individual has any kind of diseases or not. It is also used
As per statistics available till 2018, an expected 17.9 for early automatic diagnosis of patients from their diseases
million millions of deaths occur worldwide every year due i.e. in short time. It is used for automatic diagnosis of the
to cardiovascular disease (CVD) which counts to 31% of disease and gives satisfactory services in healthcare centres
entire deaths worldwide. If existing trends persist, the for saving the life of individuals. Prediction technique helps
annual figure of deaths from CVD will mount to 22.2 the stakeholders to take reasonable decision, particularly for
million by 2030 [http://www.who.int]. A complete specialists to give rationale decision to treat patients. In this
prediction by using data mining techniques may provide us paper, a hybrid technique using Random forest classifier and
an early accurate conclusion of this disease. A variety of simple k-means algorithm for predicting heart disease is
data mining approaches like Decision tree [1], Neural proposed. The proposed technique is also compared with
Network [2,3], Naive Bayes [4], KNN algorithm [5] and other types of Classifier of machine learning. The rest of the
also some hybrid technique called neural network ensemble paper is organised as follows: Section 2 describes the
i.e. combination of neural network and ensemble based background of data mining techniques and tools; Section 3
methods [6] are used to classify, predict and cluster data to illustrates the proposed methodology. Section 4 demonstrates
make correct or accurate decision-making for the risk of the results and Section 5 presents concluding remarks with
its future scope.

978-1-5386-6947-1/18/$31.00 ©2018 IEEE 1


II. LITERATURE REVIEW CART, ID3, C4.5, J48, and CHAID are very important in the
prediction of diseases.
A. Data Mining Methods & Techniques:
K-means Algorithm:
There are different ranges of chief techniques utilized for K-means is a vector quantization algorithm that
data mining that are developed in the latest years and used in generates k cluster from given objects of problem domain so
data mining practical applications that include association, as objects of each cluster are more analogous. In addition to
clustering, prediction and pattern evaluations etc., for
identification of the cluster numbers, k-means furthermore
knowledge discovery from database (KDD).
“learns” the cluster on its possess devoid of additional
Classification: information concerning an observation be supposed to
Classification is among the foremost techniques of data which cluster, which is the main reason that k-means
mining that belong to domain of machine learning. It is technique is considered as semi-supervised learning. K-
considered as a method to classify all the items present in a means is especially well-organized for big data sets.
set of data. Classification also involves exploitation of
different methodologies and techniques of mathematics and ID3 Algorithm
statistics such as linear programming, decision tree, neural The ID3 (Iterative Dichotomiser 3) algorithm is a
network. decision tree learning algorithm. ID3 forms a tree which is
in the trend of top down approach and the tree beginning
Clustering: from a set of objects and the specification of the property. In
Clustering is one of the data mining techniques which are some cases ID3 fails to produce optimal solution but it can
helpful for clustering substances having similar features provide the local optima. In this algorithm, the locally fit
using mechanical methods. Clustering is totally dissimilar attribute is chosen so as to divide the dataset. The process is
from classification. Here the classes are defined by clustering continued for every iteration as the Greedy strategy does.
techniques and objects are placed on them. In classification The optimality can be improved by using backtracking
techniques, objects are consigned to predefined classes. method throughout the search process for the optimal
Through clustering dense and spare regions in object space decision tree.
can be recognized and find out distribution patterns and
interesting correlations among the attributes of data. It means
data segmentation [10]. Support Vector Machine
Naive Bayes: Support vector machines sometimes also called support
vector network are one of the supervised learning models. In
Naive Bayes is one of the machine learning algorithms recent years, SVM is one of the widely used learning
that addresses the classification problem, which is based on algorithms that identify data for classification. In this
Bayes probability theorem. Earlier it was popular for text algorithm, we design each data item as a point in n-
classification that engrosses high dimensional training data dimensional space where n is number of features you have,
sets. The Naive Bayes classification is a probabilistic with the value of each feature being the value of a particular
classifier. It is based on probability models which are based coordinate. Then, we perform classification by finding the
on strong independence assumption. For example, a disease hyper-plane that differentiates the two classes very well.
may be considered to be a heart ailment if an individual Support Vectors are simply the co-ordinates of individual
enclose chest pain, blood pressure and cholesterol. A naïve observation. SVMs perform non-linear classification in
bayes classifier considered each of these features to addition to performing linear classification. Support vector
contribute in parallel to the probability that this disease is a machines are useful in text and hypertext categorization,
heart disease or not. The equation for naïve bayes is given classification of images and many more areas now a day.
below: Support vector machine is suited for extreme cases and
P (YC) = (P (Y|C) *P(C))/(P(Y)) (1) showed the best performance [11]

where Y is the instance to be predicted and C is the class B. Open source tools for Data Mining
value for instance. The above-given formula or equations
used to determine the class in which feature expected to The WEKA Tools:
categorize. The tool WEKA stands for Waikato Environment for
Decision Tree: Knowledge analysis [12,13] . The University of Waikato has
developed the WEKA tool in New Zealand. It was written in
A decision tree is a supervised learning algorithm Java. WEKA is used for progressing machine learning
classifier that is simple to understand and interpret. It deals algorithm and various applications are used to solve the
with both numerical and categorical data sets. Decision tree actual and real data mining troubles. Weka is open source
looks as similar as the tree structure looks where internal software. Data processing, clustering, classification,
nodes, branches and leaf nodes are present and each of those regression and association are implemented by WEKA .
branches denotes attribute values of given dataset. A test is
explained by internal nodes on a given set of attributes. On TANAGRA Tools:
the other hand, the class which is considered or implies the TANAGRA is open source software of data mining
end results are shown by the leaf nodes. On the basis of which allow initiating different techniques of data mining
predictive attribute and the given rules, system of from statistical learning, machine learning and database
classification begins from the root node to leaf nodes. The domain. TANAGRA tool is used by many researchers and
most frequently utilised decision tree approaches includes

2
students because it is very easy to use and also helps to
analyse the data either real or synthetic.
Rapid Miner
Earlier Rapid Miner was popular as YALE (Yet another
Learning Environment), is a setting that supports
methodology of data mining and machine learning such as
data extract, data transform and load (ETL), data pre-
processing, conversion and imagination, creating models,
assessment and implementation. In Rapid Miner, Java
programming language was utilised for its modelling and
hence capable to be used for mining textual data, interactive
mining, quality engineering, data flow mining etc.
MATLAB
MATLAB stands for Matrix laboratory. MATLAB
reinforces a multi-criterion numerical evaluation domain.
This was introduced as 4th generation programming
language. MATLAB maintains matrix direction, designing of
functions and data, execution of algorithm, formation of user
GUI and concatenation with instructions written in different
high level languages, including C, C++, Java, FORTRAN,
Python etc.

III. METHODOLOGY
In this paper, to develop a prediction system that be
capable to envisage heart diseases based on measurements,
are extracted from The ERIC laboratory (eric.univ-
lyon2.fr/~ricco) consisting of 209 test cases, and we have
used the Knowledge Discovery in Database (KDD)
methodology.
The entire dataset consisting of the following attributes:
(age, chest_pain, rest_bpress, blood_sugar, rest_electro,
max_heart_rate, exercice_angina, disease) is further divided Fig.1: Flow diagram of methodology
into 3 classes. 1. User Input class includes age, chest_pain,
rest_bpress, blood_sugar, rest_electro, max_heart_rate, A. Data source s
exercice_angina, 2. Additional attributes including Tobacco Data are collected from ERIC Laboratory, 5 avenue
usage & Past heart disease can be utilized for better Pierre Mendes France, 69676 Bron Cedex, France is
prediction, 3. Prediction attributes. The ‘Disease’ attribute provided as input to system in ESS format as shown in figure
establish the presence of the disease. It is classified into two 2.
levels namely ‘positive’ or ‘negative’. The overall flow
diagram is shown in figure 1.
The chosen dataset is to be confirmed for noise,
inconsistency or any missing values. Few Noises and
inconsistencies identified in the data can be corrected
manually. Few missing values in the dataset which can be
replaced with the most probable or global mean/mode value
determined with regression and outliers were substituted
with attribute mean values. An appropriate data cleaning
strategy is to be performed depending on the selected tools
for the preprocessing phase. An appropriate transformation
technique like attribute selection is essential to shrink the Fig. 2: Raw data from Eric dataset
number of features a classification algorithm requires
scrutinizing and diminishing fault from irrelevant features. B. Data mining in Weka using Decision tree
A best first search (BFS) method can be utilized to select the 1) Data preprocessing: In preprocessing data is
best attributes from 09 attributes that were available. manipulated in away that it can be suitable for future
Finally, an appropriate data mining technique was selected
examination, hence we prepared a basic set of data in .csv
for developing a predictive model. For this purpose
format or arff format suitable for classification shown in
Decision Tree approach of Weka machine learning software
figure 3.
was selected.

3
x By replacing values of attribute with the number of
standard deviations each have from the attribute mean
value.

IV. RESULTS & DISCISSION


In this section we have a demonstrated the prediction system
using three different classification system namely, J48 tree
classifier, Naive Bayes classifier and Random Forest
classifier. Classification be a technique of machine learning
and data mining so as to predict membership for a particular
group of instances. Classifications obtained through Random
Forest classifier outperformed other two classifiers and
obtained ‘Correctly classified instances’ of 100% compared
Fig. 3: Date sets in .arff format visualised to 86% and 81% of J48 tree and Naive Bayes repectively. A
2) Examines data by Statistical method: After loading comparison with other classifier is shown in table I. The
data set in Weka, it shows information about the selected dataset is clustered using simple K-means where missing
attribute like total attributes, sum of weights, instances, most values are replaced with global mean and mode. The result
obtained the following statistics in terms of the parameters as
of the attributes are numeric. So to perform classification
shown in figure 5.
requirements we must change the class from numeric to
nominal using the path:

Choose filters supervised attribute


NumericToNominal
The visualisation of allthe attributes are shown in fig.4

Fig. 5: Statistics of K-Means clustering

In Weka, Random forest tree classifier has been chosen and


Fig. 4: List of attributes in Weka visualization ‘correctly classified instances’ is 100%. With the help of
Decision tree it can be easily predicted the presence of heart
3) Classification & Clustering: The preprocessed data in diseases in human as shown in figure 6.
arff is applied for Random forest classification and K-means
clustering. In Classification through random forest, training
set were prepared to learn a model so as to make it capable
to categorize the instances of data into well-known classes.
The Classification process involved following steps:

x construction of training dataset.


x Identification of class attribute and classes.
x Identification of relevant attributes.
x Learning the model using training set.
x classification of the unknown data through model
Through k-Means Clustering, additional techniques utilized
to improve results:
x By normalizing the results to reduce the Euclidean Fig.6: Decision tree
distance measure which can be accomplished by
scaling the data of each attribute

4
A. Evaluations TABLE I V: DETAILED ACCURACY BY CLASS ( RANDOM
FOREST)
The All the techniques adopted were evaluated to observe
how they fulfil the goals of data mining. Algorithms were TR FP Preci Recall F- Classify
evaluated on the basis of classification accuracy, area under rate Rate sion Meas
the ROC curve and confusion matrix table.
1.000 0.000 1.000 1.000 1.000 positive
The following data obtained while implementing the
1.000 0.000 1.000 1.000 1.000 negative
classification technique through Weka. Total 209 records are
taken for evaluation to predict the existence of heart diseases. 1.000 0.000 1.000 1.000 1.000
The overall summary of classification using J48, Naive
Bayes and Random forest are tabulated in Table I.
A confusion matrix is the performance marker furnishes
TABLE I : SUMMARY OF CLASSIFICATION all the detailed information regarding the total number of
correctly and incorrectly classified instances. For each
Classifier J48 Naive Random classifier describing the performance of the classification
Bayes Forest model (or “classifier”) on a set of 209 data for which the true
Correctly values are shown in table V.
181 170
classified 209 TABLE V: CONFUSION MATRIX
(86.6%) (81.33%)
instances
a b Å Classified as
Incorrectly 28 39
0 73 19 a=positive
classif inst. (13.39%) (18.66%)
J48
Kappa 9 108 b=negative
0.725 0.619 1
statistic 70 22 a=positive
Naive
Mean Bayes 9 108 b=negative
0.2047 0.2279 0.1009
absol_error
Random 92 0 a=positive
Root mean
0.3199 0.3742 0.1434 Forest 17 100 b=negative
squared error
Relative 20.4674
41.53% 46.22% V. CONCLUSION & FUTURE SCOPE
absol. error %
In this paper, the intent was to devise a predictive model
Root relative for detection of cardiovascular heart disease using machine
64.45% 75.37% 28.8811
squared error learning techniques utilizing varied parameters related to
Total no. of heart. The dataset was pre-processed and three different
209 209 209 supervised machine learning classification algorithms i.e. J48
instances
Classifier, Naive Bayes and Random Forest using Weka tool
of machine learning software. The performance of the
The detailed accuracy of three classifiers are compared. technique was evaluated utilizing benchmark metrics of
Table II , III and IV demonstrates the accuracy of J48, Naive accuracy, precision, recall and F-measure etc. The most
Bayes and Random forest respectively. effective model to predict patients with heart disease appears
to be a Random Forest classifier implemented on selected
TABLE II : DETAILED ACCURACY BY CLASS ( J48) attributes with a classification accuracy of 100%. By nature
TR FP Precisi Recal F- Classify heart disease is a deadly disease and delay in diagnosis or
rate Rate on l Meas misdiagnosis of this ailment can ground to severe or life
frightening problems like cardiac arrest and even death. The
0.793 0.077 0.890 0.793 0.839 positive best model selected for predicting heart disease could not
0.923 0.207 0.850 0.923 0.885 negative exceed accuracy of classification 100%. Hence it proves that
machine learning or data mining procedures can be
0.866 0.149 0.868 0.866 0.865 efficiently utilised to predict cardiovascular heart diseases.
The conclusion of this paper can be utilised as a subordinate
tool by cardiologists to formulate reliable diagnosis of heart
TABLE III : DETAILED ACCURACY BY CLASS ( NAÏVE BAYES) diseases.
TR FP Precis Recall F- Classify All the validations accomplished in this work were
rate Rate ion Meas implemented on a particular dataset of 9 attributes.
Additional analysis should be executed with more attributes
0.761 0.145 0.805 0.761 0.782 positive having different parameter settings to improve as well as
0.855 0.239 0.820 0.855 0.837 negative develop new capabilities of the prediction models. In
addition, Random Forest and Simple K-means algorithms
0.813 0.198 0.813 0.866 0.813 should be hardened thoroughly. The main challenge in data
mining process or machine learning process is the
inconsistencies of data, presence of missing values, noisy

5
data and outliers. Therefore, statistical and machine learning Accuracy”, International Journal of Engineering and Technology, Vol
methodologies must be applied to manage the data quality. 9, No 4 Aug-Sep 2017.
[7] Chitra R., & Seenivasagam, V. “ Review of heart disease prediction
system using data mining and hybrid intelligent techniques”, ICTACT
REFERENCES JOURNAL ON SOFT COMPUTING, July 2013, volume: 03, issue:
[1] Anooj, P .K., “Clinical decision support system: Risk level prediction 04 pp.605-09.
of heart disease using weighted fuzzy rules,” Journal of King Saud [8] Sudhakar K. and Manimekalai M., “Study of Heart Disease Prediction
University – Computer and Information Sciences (2012) 24, 27–40. using Data Mining”, International Journal of Advanced Research in
[2] Amin, S. U.. Agarwal, K and Beg, R. “Genetic Neural Network Computer Science and Software Engineering, Volume 4, Issue 1,
Based Data Mining in Prediction of Heart Disease Using Risk pp.1157-60, January 2014.
Factors,” ,IEEE Conference on Information and Communication [9] Buttle, F. “Introduction to Customer Relationship Management. In: F.
Technologies (ICT 2013), 2013. Buttle, Customer Relationship Management,” Oxford: Butterworth-
[3] Dangare, C. S. & Apte, S.S. “A Data mining approach for prediction Heinemann is an imprint of Elsevier, 2009, pp. 1-25.
of heart disease using neural network’s”, International Journal of [10] Sharma, A. et al, “Application of Data Mining – A Survey Paper,”
Computer Engineering & Technology(IJCET)), Volume 3, Issue 3, International Journal of Computer Science and Information
October – December (2012), pp. 30-40. Technologies, vol. 5 (2), 2014, pp. 2023-2025.
[4] Indhumathi S, & Vijaybaskar G., “Web based health care detection [11] Lee, H. G., Ki yong Noh and Keun Ho Ryu, “ Mining Biosignal Data:
using naive Bayes algorithm”, International Journal of Advanced Coronary Artery Disease diagnosis Using Linear and Nonlinear
Research in Computer Engineering & Technology (IJARCET), Features of HRV”, Springer-Verlag Berlin Heidelberg 2007.
Volume 4 Issue 9, pp.3532-36, September 2015. [12] Hall, M. , Frank E., Geoffrey Holmes, Bernhard Pfahringer, Peter
[5] Purusothama G. and Krishnakumari, P. “A Survey of Data mining Reutemann, and Ian H. Witten, “The WEKA data mining software: an
techniques on risk prediction: Heart disease”, Indian Journal of update,” SIGKDD Explor. Newsl. Vol. 11, 1, 2009, pp.10-18.
Science and Technology, 2015. [13] Masethe H.D. and Masethe M.A. (2014) Prediction of heart disease
[6] A. Malav, K. Kadam, P.Kamat, “Prediction Of Heart Disease Using K using classification algorithms, world congress on engineering and
Means and Artificial Neural Network as Hybrid Approach to Improve computer science 2014 Vol II WC ECS 2014, 22–24 October, 2014,
San Francisco, USA

You might also like