Professional Documents
Culture Documents
Ankur Biswas
Pritha Datta
Computer Science & Engineering
Computer Science & Engineering
Tripura Institute of Technology
Tripura Institute of Technology
Narsingarh, Agartala, India
Narsingarh, Agartala, India
abiswas.tit@gmail.com
prithaitcse12@gmail.com
Abstract— Heart diseases are the chief cause of death all heart disease [7,8]. The term CVD consists of numerous
over the world over the last few decades. To avoid heart types of anarchy that may damage the heart. Most common
disease or coronary illness and discover indications early, types of heart diseases or related cardiovascular diseases are
individuals over 55 years must have a total cardiovascular tabulated in table I.
checkup. Researchers and specialists developed various
intelligent techniques to improve capacity of the health care TABLE I
professionals in recognition of cardiovascular disease. In
cardiovascular disease finding and treatment, single data Diseases Description
mining strategies are giving the reasonable precision and
accuracy. Nevertheless the usage data mining procedure be Acute coronary Obstruction of Blood-supply to the
capable of reducing the number of test that is required to be syndromes heart muscle.
carried out. In order to decrease the figure of deaths from Angina Lack of blood towards the heart muscle
heart diseases there has to be a quick and efficient detection causes chest pain
technique providing better accuracy and precision. The aim of Arrhythmia Irregular heartbeat or cardiac
this paper is to present an efficient technique of predicting dysrhythmia
heart diseases using machine learning approaches. Hence we Cardiomyopathy Heart muscle disease which makes it
proposed a hybrid approach for heart prediction using inflexible to pump blood to the rest of
Random forest classifier and simple k-means algorithm your body
machine learning techniques. The dataset is also evaluated Congenital heart Problem in the structure of the heart
using two other different machine learning algorithms, namely, disease that is present at birth
J48 tree classifier and Naive Bayes classifier and results are Coronary heart The arteries narrow, reducing blood
compared. Results attained through Random forest classifier disease flow to the heart
and the corresponding confusion matrix shows robustness of
the methodology.
Hidden pattern from data and existing relationship can be
Keywords— Heart disease, Data mining, Machine learning, extracted from large data sources using data mining
Random forest, and K-means techniques that merge statistical analysis, machine learning
and database technology [9]. In medical centres (hospitals or
I. INTRODUCTION in clinics) data mining technique helps in identifying that if
the individual has any kind of diseases or not. It is also used
As per statistics available till 2018, an expected 17.9 for early automatic diagnosis of patients from their diseases
million millions of deaths occur worldwide every year due i.e. in short time. It is used for automatic diagnosis of the
to cardiovascular disease (CVD) which counts to 31% of disease and gives satisfactory services in healthcare centres
entire deaths worldwide. If existing trends persist, the for saving the life of individuals. Prediction technique helps
annual figure of deaths from CVD will mount to 22.2 the stakeholders to take reasonable decision, particularly for
million by 2030 [http://www.who.int]. A complete specialists to give rationale decision to treat patients. In this
prediction by using data mining techniques may provide us paper, a hybrid technique using Random forest classifier and
an early accurate conclusion of this disease. A variety of simple k-means algorithm for predicting heart disease is
data mining approaches like Decision tree [1], Neural proposed. The proposed technique is also compared with
Network [2,3], Naive Bayes [4], KNN algorithm [5] and other types of Classifier of machine learning. The rest of the
also some hybrid technique called neural network ensemble paper is organised as follows: Section 2 describes the
i.e. combination of neural network and ensemble based background of data mining techniques and tools; Section 3
methods [6] are used to classify, predict and cluster data to illustrates the proposed methodology. Section 4 demonstrates
make correct or accurate decision-making for the risk of the results and Section 5 presents concluding remarks with
its future scope.
where Y is the instance to be predicted and C is the class B. Open source tools for Data Mining
value for instance. The above-given formula or equations
used to determine the class in which feature expected to The WEKA Tools:
categorize. The tool WEKA stands for Waikato Environment for
Decision Tree: Knowledge analysis [12,13] . The University of Waikato has
developed the WEKA tool in New Zealand. It was written in
A decision tree is a supervised learning algorithm Java. WEKA is used for progressing machine learning
classifier that is simple to understand and interpret. It deals algorithm and various applications are used to solve the
with both numerical and categorical data sets. Decision tree actual and real data mining troubles. Weka is open source
looks as similar as the tree structure looks where internal software. Data processing, clustering, classification,
nodes, branches and leaf nodes are present and each of those regression and association are implemented by WEKA .
branches denotes attribute values of given dataset. A test is
explained by internal nodes on a given set of attributes. On TANAGRA Tools:
the other hand, the class which is considered or implies the TANAGRA is open source software of data mining
end results are shown by the leaf nodes. On the basis of which allow initiating different techniques of data mining
predictive attribute and the given rules, system of from statistical learning, machine learning and database
classification begins from the root node to leaf nodes. The domain. TANAGRA tool is used by many researchers and
most frequently utilised decision tree approaches includes
2
students because it is very easy to use and also helps to
analyse the data either real or synthetic.
Rapid Miner
Earlier Rapid Miner was popular as YALE (Yet another
Learning Environment), is a setting that supports
methodology of data mining and machine learning such as
data extract, data transform and load (ETL), data pre-
processing, conversion and imagination, creating models,
assessment and implementation. In Rapid Miner, Java
programming language was utilised for its modelling and
hence capable to be used for mining textual data, interactive
mining, quality engineering, data flow mining etc.
MATLAB
MATLAB stands for Matrix laboratory. MATLAB
reinforces a multi-criterion numerical evaluation domain.
This was introduced as 4th generation programming
language. MATLAB maintains matrix direction, designing of
functions and data, execution of algorithm, formation of user
GUI and concatenation with instructions written in different
high level languages, including C, C++, Java, FORTRAN,
Python etc.
III. METHODOLOGY
In this paper, to develop a prediction system that be
capable to envisage heart diseases based on measurements,
are extracted from The ERIC laboratory (eric.univ-
lyon2.fr/~ricco) consisting of 209 test cases, and we have
used the Knowledge Discovery in Database (KDD)
methodology.
The entire dataset consisting of the following attributes:
(age, chest_pain, rest_bpress, blood_sugar, rest_electro,
max_heart_rate, exercice_angina, disease) is further divided Fig.1: Flow diagram of methodology
into 3 classes. 1. User Input class includes age, chest_pain,
rest_bpress, blood_sugar, rest_electro, max_heart_rate, A. Data source s
exercice_angina, 2. Additional attributes including Tobacco Data are collected from ERIC Laboratory, 5 avenue
usage & Past heart disease can be utilized for better Pierre Mendes France, 69676 Bron Cedex, France is
prediction, 3. Prediction attributes. The ‘Disease’ attribute provided as input to system in ESS format as shown in figure
establish the presence of the disease. It is classified into two 2.
levels namely ‘positive’ or ‘negative’. The overall flow
diagram is shown in figure 1.
The chosen dataset is to be confirmed for noise,
inconsistency or any missing values. Few Noises and
inconsistencies identified in the data can be corrected
manually. Few missing values in the dataset which can be
replaced with the most probable or global mean/mode value
determined with regression and outliers were substituted
with attribute mean values. An appropriate data cleaning
strategy is to be performed depending on the selected tools
for the preprocessing phase. An appropriate transformation
technique like attribute selection is essential to shrink the Fig. 2: Raw data from Eric dataset
number of features a classification algorithm requires
scrutinizing and diminishing fault from irrelevant features. B. Data mining in Weka using Decision tree
A best first search (BFS) method can be utilized to select the 1) Data preprocessing: In preprocessing data is
best attributes from 09 attributes that were available. manipulated in away that it can be suitable for future
Finally, an appropriate data mining technique was selected
examination, hence we prepared a basic set of data in .csv
for developing a predictive model. For this purpose
format or arff format suitable for classification shown in
Decision Tree approach of Weka machine learning software
figure 3.
was selected.
3
x By replacing values of attribute with the number of
standard deviations each have from the attribute mean
value.
4
A. Evaluations TABLE I V: DETAILED ACCURACY BY CLASS ( RANDOM
FOREST)
The All the techniques adopted were evaluated to observe
how they fulfil the goals of data mining. Algorithms were TR FP Preci Recall F- Classify
evaluated on the basis of classification accuracy, area under rate Rate sion Meas
the ROC curve and confusion matrix table.
1.000 0.000 1.000 1.000 1.000 positive
The following data obtained while implementing the
1.000 0.000 1.000 1.000 1.000 negative
classification technique through Weka. Total 209 records are
taken for evaluation to predict the existence of heart diseases. 1.000 0.000 1.000 1.000 1.000
The overall summary of classification using J48, Naive
Bayes and Random forest are tabulated in Table I.
A confusion matrix is the performance marker furnishes
TABLE I : SUMMARY OF CLASSIFICATION all the detailed information regarding the total number of
correctly and incorrectly classified instances. For each
Classifier J48 Naive Random classifier describing the performance of the classification
Bayes Forest model (or “classifier”) on a set of 209 data for which the true
Correctly values are shown in table V.
181 170
classified 209 TABLE V: CONFUSION MATRIX
(86.6%) (81.33%)
instances
a b Å Classified as
Incorrectly 28 39
0 73 19 a=positive
classif inst. (13.39%) (18.66%)
J48
Kappa 9 108 b=negative
0.725 0.619 1
statistic 70 22 a=positive
Naive
Mean Bayes 9 108 b=negative
0.2047 0.2279 0.1009
absol_error
Random 92 0 a=positive
Root mean
0.3199 0.3742 0.1434 Forest 17 100 b=negative
squared error
Relative 20.4674
41.53% 46.22% V. CONCLUSION & FUTURE SCOPE
absol. error %
In this paper, the intent was to devise a predictive model
Root relative for detection of cardiovascular heart disease using machine
64.45% 75.37% 28.8811
squared error learning techniques utilizing varied parameters related to
Total no. of heart. The dataset was pre-processed and three different
209 209 209 supervised machine learning classification algorithms i.e. J48
instances
Classifier, Naive Bayes and Random Forest using Weka tool
of machine learning software. The performance of the
The detailed accuracy of three classifiers are compared. technique was evaluated utilizing benchmark metrics of
Table II , III and IV demonstrates the accuracy of J48, Naive accuracy, precision, recall and F-measure etc. The most
Bayes and Random forest respectively. effective model to predict patients with heart disease appears
to be a Random Forest classifier implemented on selected
TABLE II : DETAILED ACCURACY BY CLASS ( J48) attributes with a classification accuracy of 100%. By nature
TR FP Precisi Recal F- Classify heart disease is a deadly disease and delay in diagnosis or
rate Rate on l Meas misdiagnosis of this ailment can ground to severe or life
frightening problems like cardiac arrest and even death. The
0.793 0.077 0.890 0.793 0.839 positive best model selected for predicting heart disease could not
0.923 0.207 0.850 0.923 0.885 negative exceed accuracy of classification 100%. Hence it proves that
machine learning or data mining procedures can be
0.866 0.149 0.868 0.866 0.865 efficiently utilised to predict cardiovascular heart diseases.
The conclusion of this paper can be utilised as a subordinate
tool by cardiologists to formulate reliable diagnosis of heart
TABLE III : DETAILED ACCURACY BY CLASS ( NAÏVE BAYES) diseases.
TR FP Precis Recall F- Classify All the validations accomplished in this work were
rate Rate ion Meas implemented on a particular dataset of 9 attributes.
Additional analysis should be executed with more attributes
0.761 0.145 0.805 0.761 0.782 positive having different parameter settings to improve as well as
0.855 0.239 0.820 0.855 0.837 negative develop new capabilities of the prediction models. In
addition, Random Forest and Simple K-means algorithms
0.813 0.198 0.813 0.866 0.813 should be hardened thoroughly. The main challenge in data
mining process or machine learning process is the
inconsistencies of data, presence of missing values, noisy
5
data and outliers. Therefore, statistical and machine learning Accuracy”, International Journal of Engineering and Technology, Vol
methodologies must be applied to manage the data quality. 9, No 4 Aug-Sep 2017.
[7] Chitra R., & Seenivasagam, V. “ Review of heart disease prediction
system using data mining and hybrid intelligent techniques”, ICTACT
REFERENCES JOURNAL ON SOFT COMPUTING, July 2013, volume: 03, issue:
[1] Anooj, P .K., “Clinical decision support system: Risk level prediction 04 pp.605-09.
of heart disease using weighted fuzzy rules,” Journal of King Saud [8] Sudhakar K. and Manimekalai M., “Study of Heart Disease Prediction
University – Computer and Information Sciences (2012) 24, 27–40. using Data Mining”, International Journal of Advanced Research in
[2] Amin, S. U.. Agarwal, K and Beg, R. “Genetic Neural Network Computer Science and Software Engineering, Volume 4, Issue 1,
Based Data Mining in Prediction of Heart Disease Using Risk pp.1157-60, January 2014.
Factors,” ,IEEE Conference on Information and Communication [9] Buttle, F. “Introduction to Customer Relationship Management. In: F.
Technologies (ICT 2013), 2013. Buttle, Customer Relationship Management,” Oxford: Butterworth-
[3] Dangare, C. S. & Apte, S.S. “A Data mining approach for prediction Heinemann is an imprint of Elsevier, 2009, pp. 1-25.
of heart disease using neural network’s”, International Journal of [10] Sharma, A. et al, “Application of Data Mining – A Survey Paper,”
Computer Engineering & Technology(IJCET)), Volume 3, Issue 3, International Journal of Computer Science and Information
October – December (2012), pp. 30-40. Technologies, vol. 5 (2), 2014, pp. 2023-2025.
[4] Indhumathi S, & Vijaybaskar G., “Web based health care detection [11] Lee, H. G., Ki yong Noh and Keun Ho Ryu, “ Mining Biosignal Data:
using naive Bayes algorithm”, International Journal of Advanced Coronary Artery Disease diagnosis Using Linear and Nonlinear
Research in Computer Engineering & Technology (IJARCET), Features of HRV”, Springer-Verlag Berlin Heidelberg 2007.
Volume 4 Issue 9, pp.3532-36, September 2015. [12] Hall, M. , Frank E., Geoffrey Holmes, Bernhard Pfahringer, Peter
[5] Purusothama G. and Krishnakumari, P. “A Survey of Data mining Reutemann, and Ian H. Witten, “The WEKA data mining software: an
techniques on risk prediction: Heart disease”, Indian Journal of update,” SIGKDD Explor. Newsl. Vol. 11, 1, 2009, pp.10-18.
Science and Technology, 2015. [13] Masethe H.D. and Masethe M.A. (2014) Prediction of heart disease
[6] A. Malav, K. Kadam, P.Kamat, “Prediction Of Heart Disease Using K using classification algorithms, world congress on engineering and
Means and Artificial Neural Network as Hybrid Approach to Improve computer science 2014 Vol II WC ECS 2014, 22–24 October, 2014,
San Francisco, USA