TNCAB-2019 Paper 16

Heart disease prediction using data mining
techniques
Bhavya Dubey1 and Dr.Kapil Kumar Nagwanshi2
1First
Author Affiliation with address
bhavyadubey.nmims@gmail.com
2Second Author Affiliation with address
dr.kapil@ieee.org
ABSTRACT
Data mining technique within the history of medical knowledge found with monumental investigations resulted that the
prediction of cardiovascular disease is incredibly vital in life science. The information from anamnesis has been found as
heterogeneous knowledge and it looks that the assorted styles of data ought to be understood to predict the cen-
tre sickness of a patient. Cardiovascular disease is that the leading reason for death everywhere the planet within
the past 10 years. Many researcher’s area unit victimisation applied math and data processing tools to assist health care
professionals within the diagnosing of cardiovascular disease. Employing a single data processing technique within
the diagnosing of cardiovascular disease has been comprehensively investigated showing acceptable levels of accuracy.
The patient risk level is assessed victimisation data processing classification techniques like Naïve mathematician,
KNN, Decision Tree rule, Neural Network. etc., Accuracy of the danger level is high once victimisation a lot of range of
attributes. During this planned work, a thirteen attribute structured clinical info from UCI Machine Learning Repository
has been used as a supply knowledge. Decision tree and Naive mathematician are applied and their performance
on diagnosing has been compared. Naive mathematician outperforms compared to decision tree.
Keywords:
1. INTRODUCTION
As the biggest single reason for death on the planet, cardiovascular disease(CVD) is a significant and crucial is-
sue. CVD is not a solitary infection, yet a group of disease and injuries that influence the cardiovascular frame-
work (the heart and veins). These are most regularly sicknesses of the heart and of the veins of the heart and
mind. When all is said in done they influence individuals in later existence (with rate rising pointedly after the
30-44 age go), albeit, as per a main cardiologist, by around 35 years of age, most who will get a type of CVD as
of now have the beginnings of the ailment. Coronary heart disease (CHD), also called coronary artery disease
and atherosclerotic coronary illness, is the final product of the accumulation of athermanous plaques within the
walls of the arteries that supply the myocardium (the muscle of the heart). While the manifestations and indica-
tions of coronary illness are noted in the advanced condition of sickness, most people with coronary illness show
no proof of infection for quite a long time as the ailment advances before the principal beginning of side effects,
regularly an "abrupt" cardiovascular failure, at last emerge. Following quite a while of movement, a portion of
these aroma therapist plaques may break and (alongside the actuation of the blood coagulating framework) begin
constraining.
Table 1: attributes
Name Type Description

Age Continuous Age in years
Sex Discrete Value 1: male
Value 0: female
Chest Pain Discrete Value 1: typical type1 angina
Value 2: typical type 2 angina
Value 3: non angina pain
Value 4: asymptomatic
Fasting blood sugar Continuous Value 1: >120mg/dl
Value 0: <120mg/dl
Rest ecg Discrete Resting electrographic results:
Value 0: normal
Value 1: having ST&T wave abnormality
Value 2: probable a definite left
Exang Discrete Exercise induced angina
Value 1: yes
Value 0: no
Slope Discrete Slope of the peak exercise segment
Value 1: unsloping
Value 0: flat
Value 3: down sloping
CA Discrete No. of major vessels coloured by fluoroscopy
ranging in between value 0-3
Thal Discrete Value 3: normal
Value 6: fixed defect
Value 7: reversible defect
Trestbps Continuous Resting blood pressure in mmHg
Chol Continuous Serum Cholesterol in mg/dl
Thalach Continuous Maximum heart rate achieved
Old peak ST Continuous Depression induced by exercise related to rest
Diagnosis Continuous Value 0: no disease
Value 1: mild
Value 2: severe
Table 2: literature survey
S.NO PAPER TITLE AUTHOR NAME INFERENCES
1. Diagnosis Of Heart Disease us- Asha Rajkumar, Mrs.G Sophia Comparison of different classifi-
ing data mining algorithm Reena ers on training data sets proved
that Naïve Bayes classifier is the
most accurate and swift algo-
rithm.
2. Data mining and visualization Nada Lavrac, Marko Bohanec, We acknowledged primarily re-
for decision support and model- Marko Debeljak, Bojan Cestnik garding the utilization of method
ling of public health-care re- and Andrej Kobler of information mining. To
sources. search out the desired datasets
from the cluster of unassimilated
datasets.
3. Data Preparation by CFS an Es- Ashwinkumar.U.M We found the utilization of is-
sential approach for decision sues like Glasgow Coma Scale
making using C 4.5 for Medical in prediction of severity of
data mining. trauma and alternative factors.
4. Data Mining in Healthcare In- Margaret R. Kraft, Kevin C. We came to understand regard-
formation Systems: Case Study Desouza and Ida Androwich ing use of technologies like Arti-
of a Veterans’ Administration ficial Neural Networks (ANN)
Spinal-Cord Injury Population for the prediction of Length of
keep (LOS) of patient popula-
tion so as to cut back the price of
treatment and hospital stays and
increase the potency and effec-
tiveness in resource allocation.
5. Predictive Data Mining for Med- JyotiSoni, Ujma Ansari, Dipesh Decision Tree prediction is best
ical Diagnosis: An Overview of Sharma and SunitaSoni for classification techniques.The
Heart Disease Prediction accuracy of call tree and Bayesi-
an Classification is healthier
among alternative techniques.
6. Devendra Ratnaparkhi,Tush ar Naïve Bayes technique is that
Mahajan,Vishal Jadhav the most effective and correct
technique for classifying and
predicting information of pa-
tients World Health Organiza-
tion diagnosed with heart dis-
eases as a result of its most cor-
rect as compared to different al-
gorithms i.e. neural network and
call algorithms and takes lesser
quantity of your time to execute
its method.
2.Data Mining
Data mining is characterized as a procedure used to separate usable data from a bigger arrangement of any crude
information. It infers examining data designs in huge clumps of data utilizing at least one programming. Data
mining has applications in different fields, similar to science and research. Data mining includes successful in-
formation assortment and warehousing just as PC handling. For sectioning the data and assessing the probability
of future occasions, data mining utilizes advanced numerical calculations. Data mining is otherwise called
Knowledge Discovery in Data (KDD).
Key highlights of data mining:
• Automatic example expectations dependent on pattern and conduct examination.
• Prediction dependent on likely results.
• Creation of choice arranged data.
• Focus on huge informational collections and databases for examination.
• Clustering dependent on finding and outwardly reported gatherings of realities not recently known.
Data mining can be performed on following sorts of data
• Relational databases
• Data distribution centres
• Advanced DB and data storehouses
• Object-arranged and object-social databases
• Transactional and Spatial databases
• Heterogeneous and inheritance databases
• Multimedia and spilling database
• Text databases
• Text mining and Web mining
Data mining techniques:

1.Classification:
We utilize these data mining methods, to recover significant and applicable data about data and metadata. We
use it to arrange various data in various classes. As this procedure is like bunching. It relates a way that portions
data records into various fragments called classes.
In any case, in contrast to grouping, here the data investigators would know about various bunch. Along these
lines, in classification investigation, we have to apply calculations. That is, we require choosing how new data
ought to be grouped.
For Example-Outlook email. They utilize certain calculations to portray an email as real or spam.
2. Clustering:
Clustering investigation is an data mining method to distinguish data that resemble one another. This procedure
assists with understanding the distinctions and similitudes between the information. The term itself characterizes
its meaning. Bunch implies a gathering of data objects. Likewise, these items are like a similar bunch. As a re-
sult, objects are like each other inside a similar gathering. In spite of the fact that, they are distinctive in same or
another groups.
3.Regression:
In factual terms, we use to recognize and break down the connection between factors. Likewise, it encourages
you comprehend the trademark estimation of the reliant variable changes. In any case, if any of the autonomous
factors is shifted. Subsequently, one variable is subject to another, however it isn't the other way around. For the
most part, utilized for expectation and determining.
4. Association Rules:
We use Data Mining Techniques, to distinguish fascinating relations between various factors with regards to the
database . Likewise, the Data Mining procedures used to unload shrouded designs in the information. Affiliation
rules are so valuable for looking at and anticipating conduct. This is prescribed in the retail business.
5. Outer Detection:
This kind of data mining method alludes to perception of data things in the dataset which don't coordinate a nor-
mal example or anticipated conduct. This strategy can be utilized in an assortment of areas, for example, inter-
ruption, identification, misrepresentation or issue discovery, and so on. External discovery is likewise called Out-
lier Analysis or Outlier mining. It is a thing that goes astray from the regular normal inside a dataset or a combi-
country of information. These sorts of things are factually standoffish when contrasted with the remainder of the
information. Consequently, it shows that something strange has occurred. Also, further requires extra considera-
tion.
6. Sequential Patterns:
This data mining procedure assists with finding or recognize comparable examples or patterns in exchange data
for certain period. This is significant piece of data mining methods. As this procedure tries to find comparable
examples.
7. Prediction:
Prediction has utilized a blend of different data mining procedures like patterns, consecutive examples, grouping,
characterization, and so forth. It investigates past occasions or examples in a correct arrangement for anticipating
a future occasion. As we use forecast, data digging procedure for some specific employments. As it is utilized to
dis-covers the connection among free and ward factors.
2.2 OPEN SOURCE TOOLS FOR DATA MINING
2.2.1 Weka Tool

WEKA is an info mining machine complicated using the Faculty of Waikato in New Zealand which tools in-
formation mining algorithms the use of the JAVA language. WEKA is a nation of the-artwork facility for
acquiring gadget studying methods as well as the utility of theirs to real world info mining difficulties. It's a
pair of methods increasing awareness of algorithms for data mining things. The algorithms are used immedi-
ately to a dataset. WEKA tools algorithms for statistics pre-processing, regression, classification, clustering
plus association rules; Additionally, it includes visualization equipment. The brand new gadget getting to
know schemes likewise should be created with this package. WEKA is an open source software application
issued. The data document generally employed by Weka is within ARFF record format, this includes unique
tags to recommend things that are different in the specifics report.
2.2.2 Tanagra
Tanagra is a totally free info mining software package for educational as well as research purposes. It pro-
poses several info mining strategies from exploratory info analysis, statistical gaining understanding of, de-
vice learning and also database area. Tanagra is an open source process as each researcher is able to get ad-
mission to the source code and also put the own algorithms of his, as a considerable way as he consents as
well as conforms to the software package distribution license. The primary goal of Tanagra mission is offer-
ing researchers and college pupils a clean to utilize data mining program, conforming to the present norms of
the application program improvement in this specific place and also letting them explore sometimes synthet-
ic or actual info.
2.2.3 MATLAB
MATLAB is an excessive words and active area for numerical computation, programming and visualization.
Using MATLAB we are going to examine data, expand algorithms and make applications and fashions. The
language, application and integrated math functions allow us to find out numerous tactics and achieve an an-
swer quicker than with spreadsheets of regular programming languages, together with C/C++ of JAVA.
2.2.4 .NET Framework
.net framework is a software program framework created by way of Microsoft which operates typically
on Microsoft windows and also provides languages interoperability throughout many programming lan-
guages. For builders the .NET Framework provides a regular and comprehensive energy which has vis-
ually beautiful pc user research and secure and seamless communication.
2.2.5 Rapid Miner
Rapid Miner is undoubtedly the sector primary open source gadget for files mining. It's readily available as
a stand-alone energy for facts analysis and as a stats mining motor for blending into own goods. A huge
number of uses of Rapid Miner in over forty nations supply the customers of theirs with an aggressive
edge.
3.RELATED WORK
Specialists have recently endeavoured to estimate the nearness of heart issues utilizing capacities. Varma, Sri-
vastava, and Negi [8] have built up a unit during that they distinguish blood vessel illness. They portray a cross
breed process considering this in that the columnists used the dataset from the Department of Cardiology at Gan-
dhi Medical school. The proportion of tuples joins 335 records and moreover has twenty-six attributes. From the
execution of theirs the information was pre-dealt with by relating the data by using particle swarm headway.
While using the Multilayer recognition (MLP) they obtained the reliability of 77%. We have applied the normal-
ly utilized informational collection of the UCI AI that contains the Cleveland, Hungarian nearby long beach VA
datasets. El-Baily et al. [9] drove investigate on these by picking five ordinary variables. Two getting ready
methods have been used specifically Decision Tree (Fast Decision and c4.5) tree (FDT). The exactness recorded
was 69.5 % for long coastline VA using FDT similarly as 78.54 % utilizing C4.5.
Table 3: comparison
Author Year Technique Dataset Accuracy
Cheung et 2001 Naïve Bayes cleveland 81.48%

al.[10]
Yan et 2003 Naïve Bayes cleveland 78.56

al[11]
Andreva 2006 Naïve Bayes cleveland 95%

P.[12]
Naïve Bayes 95%

Palaniapp an, et 2007
al. Decision Tree 94.93%
2008 80.4%
Tantimon
gcolwata,
et al.
Multilayer 74.5%
Perceptron
Automatically 67.8%
Hara, et al. 2008 Defined
Groups
Sitar-Taut, 2009 Naïve Bayes cleveland 62.03%
et al.[13]
Decision cleveland 60.40%

Trees
Naïve Bayes cleveland 52.33%

Rajkumar, 2010
KNN cleveland 45.67%
et al.[2]
Decision list cleveland 52%
Sunithasajja [1] 2010 Naïve Bayes cleveland 63.97%
hungarian 65.74%
VA long beach 38.42%
Srinivas et 2010 Naïve Bayes cleveland 84.14%

al[14]
Showman 2012 Naïve Bayes cleveland 81.48%

et al[15]
4.PROPOSED METHOD
Data Mining is primary a component of Knowledge Discovery Database (KDD). Many humans cope with
Data Mining as a synonym for KDD seeing it is a vital component of KDD procedure. Knowledge dis-
covery as a procedure is depicted in figure one and includes an iterative sequence of the following
measures.
● Ata Cleaning - To remove noise or even facts that are irrelevant.
● Data Integration - Where numerous records energy sources might be mixed.
● Data Selection - Where data related on the evaluation assignment are retrieved from the website.
● Data Transformation - Where info is converted or maybe consolidated into forms suitable for mining by
performing precise or even aggregation operations.
● Data Mining - An essential technique wherein clever techniques are used in an attempt to acquire info
patterns.
● Pattern Evaluation - To identify the frankly intriguing styles representing expertise based on several
interestingness measures.
● Knowledge Presentation - know-how representation techniques are used to present the mined under-
standing on the user.
Figure 1: diagrammatic representation of our proposed method

Figure2: range of different attributes
The records mining step might move together with the individual or maybe an understanding base. The attention
grabbing styles unit provided with the individual and might well be preserved as brand new understanding inside
the knowledge. Processing would be that the method of discovering attention grabbing expertise from big quanti-
ties of facts remained each in databases, captures warehouses, or maybe completely different records repository.
4.1 Choosing Data Mining Task
This is the step for the selection of the goal foe KDD process. The goal that are available can be either Classifica-
tion, Regression or clustering etc. We chose Naïve Bayes (Classification), and Logistic Regression (Regression)
and KNN (K-Nearest Neighbour).
4.1.1 Logistic Regression

Logistic regression is a classification algorithm used to assign observations to a discrete set of Classes. Unlike line-
ar regression which outputs continuous number values, logistic regression transforms its output using the logistic
sigmoid function to return a probability value which can then be mapped to two or more discrete classes.
Types of logical regression:
1. Binary (Pass/Fail)
2. Multi (Cats, Dogs, Sheep)
Sigmoid function:S (z) =1/1+e^−z
Decision Boundary: p≥0.5, class=1 p<0.5, class=0
Figure: Decision Boundary
Cost Function
Vectorised cost function
For Multiclass - Instead of y=0,1 we will expand our definition so that y=0,1...n. Basically we re-run binary classi-
fication multiple times, once for each class.
4.1.1 Naïve Bayes
Bayes’ Theorem is stated as:

P(h|d) = (P(d|h) * P(h)) / P(d)
P(h|d) is the probability of hypothesis h given the data d. This is called the posterior
probability.
P(d|h) is the probability of data d given that the hypothesis h was true.
P(h) is the probability of hypothesis h being true (regardless of the data). This is called
the prior probability of h.
P(d) is the probability of the data (regardless of the hypothesis).
we are interested in calculating the posterior probability of P(h|d) from the prior probability
p(h) with P(D) and P(d|h). After calculating the posterior probability for a number of different
hypotheses, we will select the hypothesis with the highest probability. This is the maximum
probable hypothesis and may formally be called the (MAP)hypothesis.
This can be written as:
MAP(h) = max(P(h|d)) or
MAP(h) = max((P(d|h) * P(h)) / P(d)) or
MAP(h) = max(P(d|h) * P(h))

The P(d) is a normalizing term which allows us to calculate the probability. We can drop it
when we are interested in the most probable hypothesis as it is constant and only used to
normalize. Back to classification, if we have an even number of instances in each class in our
training data, then the probability of each class (e.g. P(h)) will be equal. Again, this would be
a constant term in our equation, and we could drop it so that we end up with:
MAP(h) = max(P(d|h))
Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification
problems. The technique is easiest to understand when described using binary or categorical
input values. It is called naive Bayes or idiot Bayes because the calculation of the probabili-
ties for each hypothesis are simplified to make their calculation tractable. Rather than at-
tempting to calculate the values of each attribute value P (d1, d2, d3|h), they are assumed to
be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and
so on. This is a very strong assumption that is most unlikely in real data, i.e. that the attrib-
utes do not interact. Nevertheless, the approach performs surprisingly well on data where this
assumption does not hold.
MAP(h) = max(P(d|h) * P(h))
Gaussian Naïve Bayes:

mean(x) = 1/n * sum(x)
Where n is the number of instances and x are the values for an input variable in your training
data. We can calculate the standard deviation using the following equation:
standard deviation(x) = sqrt (1/n *sum(xi-mean(x)^2))
This is the square root of the average squared difference of each value of x from the mean val-
ue of x, where n is the number of instances, sqrt() is the square root function, sum() is the sum
function, xi is a specific value of the x variable for the i’th instance and mean(x) is de-
scribedabove,and^2isthesquare.GaussianPDFwithanewinputforthevariable,andin
return the Gaussian PDF will provide an estimate of the probability of that new input value
for that class.
pdf (x, mean, sd) = (1 / (sqrt (2 * PI) * sd)) * exp (-((x-mean^2)/(2*sd^2)))
Where pdf(x) is the Gaussian Probability Density Function (PDF), sqrt () is the square root,
mean and sd are the mean and standard deviation calculated above, Pi is the numerical con-
stant, exp () is the numerical constant e or Euler’s number raised to power and x is the input
value for the input variable.
4.1.2 K-Nearest Neighbour
We can implement a KNN model by following the below steps:
1. Load the data

2. Initialize the value of k
3. Forgettingthepredictedclass, iteratefrom1tototalnumberoftrainingdatapoints
 Calculate the distance between test data and each row of training data. Here
we will use Euclidean distance as our distance metric since it’s the most pop-
ular method. The other metrics that can be used are Che-by-shev, cosine, etc.
 Sort the calculated distances in ascending order based on distance values
 Get top k rows from the sorted array
 Get the most frequent class of these rows
 Return the predicted class
5. RESULT:
The max accuracy , precision and recall obtained in case of Cleveland dataset are 85% , 86% and 86% respective-
ly .Then the max accuracy , precision and recall obtained in case of Hungarian dataset are 80% , 82% and 81%
respectively .And finally in case of VA long beach dataset maximum accuracy , precision and recall obtained
are 77% , 76% and 78% respectively. Next we’ve used Naïve Bayes algorithm by using all the three datasets
combined. Using this the maximum accuracy, precision and recall obtained are 85%, 86% and 86% respectively.
After using Naïve Bayes classification algorithm on three datasets we now use Logistic Regression prediction al-
gorithm on them to predict the result. The max accuracy, precision and recall obtained in case of Cleveland da-
taset are 89%, 90% and 89% respectively. Then the max accuracy, precision and recall obtained in case of Hun-
garian dataset are 82%, 83% and 82% respectively. And finally in case of VA long beach dataset max accuracy,
precision and recall obtained are 77%, 76% and 78% respectively. Now we use Naïve Bayes algorithm by using
all the three datasets combined. By doing this, the max accuracy, precision and recall obtained are 84%, 87% and
87% respectively. The overall comparison of features in the Cleveland data set is shown below in fig 3.The accu-
racy , precision and recall with all data sets through Naive Bayes and of Logistics Regression are shown in table
5 and table 6 and the confusion matrix of Naive Bayes and Logistics regression are shown below table 6.
Figure3: overall comparison of the features in the Cleveland dataset

Table 4: numeric description of attributes
age sex cp Trest chol fbs Rest Thal exang Old slope ca thal target
bps ecg ach peak
Count 303 303 303 303 303 303 303 303 303 303 303 303 303 303
mean 54.762 0.6831 0.96 131.6 246.2 0.14 0.52 149.6 0.3267 1.039 1.3993 0.72 2.313 0.54554
38 68 699 238 64 851 8053 469 33 604 4 3937 531
7 5 3
std 8.4756 0.4660 1.03 17.53 51.83 0.35 0.52 22.90 0.4697 1.161 0.6162 1.02 0.612 0.49883
94 11 205 814 075 619 586 516 94 075 26 2606 277 5
2 8
min 34 0 0 94 126 0 0 71 0 0 0 0 0 0
Percentile 48 0 0 120 211 0 0 133.5 0 0 1 0 2 0

25%
Percentile 55 1 1 130 240 0 1 153 0 0.8 1 0 2 1

50%
Percentile 61 1 2 140 274.5 0 1 166 1 1.6 2 1 3 1

75%
max 77 1 3 200 564 1 2 202 1 6.2 2 4 3 1
Table5: NAÏVE BAYES DESCRIPTIVE ANALYSIS

DATASET ACCURACY PRECISION RECALL
Cleveland 85% 86% 86%

Hungarian 80% 82% 81%
VA Long Beach 77% 76% 78%
Combined 85% 86% 86%
Table6: LOGISTIC REGRESSION DESCRIPTIVE ANALYSIS

Dataset Accuracy Precision Recall
Cleveland 89% 90% 89%
Hungarian 82% 83% 82%

VA Long Beach 77% 76% 78%
Combined 84% 87% 87%
CONFUSION MATRIX:
Confusion matrix for Logistic Regression :
The accuracy score for logistic regression is : 85.25%

Precision for logistic regression is: 0.85
Recall for logistic regression is: 0. 88
F-score for logistic regression is: 0.86
Confusion Matrix for Naive Bayes:
The accuracy score for logistic regression is: 83.47%

Precision for logistic regression is: 0.83
Recall for logistic regression is: 0. 91
F-score for logistic regression is: 0.87
5.1 COMPARATIVE ANALYSIS OF ALGORITHMS USED USING GRAPHS
For Cleveland dataset the accuracy of the Naïve Bayes Model was found using the (90, 10) data division to train-
ing and testing sets. For Hungarian data set the accuracy of model was found at division of (75, 25) for training
and testing set, (60, 40) training and testing for VA Long Beach Data Set. Learning curves which are generated in
our system were used to determine the correct ratio of training and testing for finding the best accuracy.
Fig 4:Precision
Fig 5: Recall
Fig 6 :Naive Bayes Accuracy

Fig 7: COMPARATIVE ANALYSIS OF ALGORITHMS USED USING GRAPHS
6. CONCLUSION
The objective of our work is is providing an examination of blended procedure strategies that will be used in
computerized heart disease prediction technique disvascular condition when the user inputs the data . Changed
systems just as method classifiers result of estimating created open all through this work that is developed in the
previous barely any years for successful and efficient heart condition project. The assessment proposes that totally
differing arrangements result of estimating utilized altogether the papers by taking different types of traits.
This specific paper depicts technique which may be used for the arrangement of different system manners by
which and number-crunching ways where along these lines on make a code goals that has the best possible ex-
pected value. We’ve used Naïve Bayes Classifier just as Logistic Regression structures for telling whether the
user who has entered the value of the one who has entered the patients value has or is suffering from an heart dis-
ease or not , we’ve used 3 data sets on their basis we’ve classified and trained the classifier which gets compared
with the user’s input and confusion matrix is created for a particular technique , then the final confusion matrix is
created and the result is displayed on the screen whether patient is having heart disease or not and if having at
what level they are suffering i.e. sever , mild , or no heart disease.
7. REFERENCES
[1] Dua, D. and KarraTaniskidou, E. (2017). UCI Machine Learning Repository

[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and
Computer Science.
[2] Rajkumar, A. and G.S. Reena, Diagnosis Of Heart Disease Using Data MiningAlgorithm. Global Journal of Com-
puter Science and Technology, 2010. Vol. 10 (Issue10).
[3] Kelley, Deeanna. "Heart disease: Causes, prevention, and current research." JCCC Honors Journal 5.2 (2014):
[4] Yan, H., et al., Development of a decision support system for heart diseasediagnosis using multilayer per-
ceptron. Proceedings of the 2003 InternationalSymposium on, 2003. vol.5: p. pp. V-709- V- 712.
[5]Sitar-Taut, V.A., et al., Using machine learning algorithms in cardiovascular diseaserisk evaluation. Journal of Ap-
plied Computer Science & Mathematics, 2009.
[6]Heller, R.F., et al., How well can we predict coronary heart disease? Findings in theUnited Kingdom Heart Disease
Prevention Project.BRITISH MEDICAL JOURNAL,1984.
[7] Babič, František, et al. "Predictive and descriptive analysis for heart disease diagnosis." Computer Science and In-
formation Systems (FedCSIS), 2017 Federated Conference on. IEEE, 2017.
[8] L. Verma, S. Srivastaa, and P.C. Negi, "A Hybrid Data Mining Model to Predict Coronary Artery Disease Cases Us-
ing Non-Invasive Clinical Data", Journal of Medical Systems, vol. 40, no.178, 2016, doi: 10.1007/s10916-016-0536-z.
[9] Miss. Chaitrali S. Dangare, Dr.Mrs.Sulabha S. Apte, A data mining approach for prediction of heart disease using
neural networks, international journal of computer engineering and technology, 2012.
[10] N. AdityaSundar, P. PushpaLatha, M. Rama Chandra,performance analysis of classification data mining tech-
niques over heart diseases database, international journal of engineering science and advanced technology, 2012.
[11] Shadab Adam Pattekari and AsmaParveen, prediction system for heart disease using naïve bayes, International
Journal of Advanced Computer and Mathematical Sciences, 2012.
[12] LathaParthiban and R.Subramanian, Intelligent Heart Disease Prediction System using ANFIS and Genetic Algo-
rithm, International Journal of Biological and Medical Sciences, 2008.
[13] JesminNahar, Tasadduq Imam, Kevin S. Tickle, Yi-Ping Phoebe Chen, Association rule mining to detect factors
which contribute to heart disease in males and females, Elsevier, 2013.
[14] Nada Lavrac, Selected techniques for data mining in medicine, Elsevier, 1999.
[15] TanawutTantimongcolwat, ThanakornNaenna, Identification of ischemic heart dis ease via machine learning analy-is
on Magnetocardiography, Elsevier,2008.
[16] Resul Das, Ibrahim Turkoglu, AbdulkadirSengur, Effective diagnosis of heart disease through neural networks
ensembles, Elsevier, 2009.
[17] Resul Das, Ibrahim Turkoglu, AbdulkadirSengur Diagnosis of valvular heart disease through neural networks
ensembles, Elsevier, 2009.
[18] Oleg Yu. Atkov, Coronary heart disease diagnosis by artificial neural networks including genetic polymor-
phisms and clinical parameters, Elsevier, 2012.
[19] Marcel A.J. van Gerven, Predicting carcinoid heart disease with the noisy-threshold classifier, Elsevier, 2007.
[20] Matjaz' Kukar, Analysing and improving the diagnosis of ischaemic heart disease with machine learning, Elsevier,
1999.
[21] HumarKahramanli, NovruzAllahverdi, Design of a hybrid system for diabetes and heart diseases, Elsevier, 2008.
[22] JesminNahar, Tasadduq Imam, Computational intelligence for heart disease diagnosis: A medical knowledge
driven approach, Elsevier, 2013.
[23] Nan-Chen Hsieh &Lun-Ping Hung & Chun-Che Shih, Intelligent Postoperative Morbidity Prediction of Heart
Disease Using Artificial Intelligence Techniques, J Med Syst, 2012.
[24] Adebayo Peter Idowu, Data Mining Techniques for Predicting Immunize-able Diseases: Nigeria as a Case
Study, International Journal of Applied Information Systems, 2013.
[25] MonaliDey, SiddharthSwarupRautaray, Study and Analysis of Data mining Algorithms for Healthcare Decision
Support System, International Journal of Computer Science and Information Technologies(2014).

TNCAB-2019 Paper 16

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TNCAB-2019 Paper 16

Uploaded by

Copyright:

Available Formats

Heart disease prediction using data mining

Name Type Description

S.NO PAPER TITLE AUTHOR NAME INFERENCES

• Automatic example expectations dependent on pattern and conduct examination.

• Prediction dependent on likely results.

• Creation of choice arranged data.

• Focus on huge informational collections and databases for examination.

Data mining can be performed on following sorts of data

• Data distribution centres

• Advanced DB and data storehouses

• Object-arranged and object-social databases

• Transactional and Spatial databases

• Heterogeneous and inheritance databases

• Multimedia and spilling database

• Text mining and Web mining

Data mining techniques:

2.2.1 Weka Tool

Cheung et 2001 Naïve Bayes cleveland 81.48%

Yan et 2003 Naïve Bayes cleveland 78.56

Andreva 2006 Naïve Bayes cleveland 95%

Naïve Bayes 95%

Decision cleveland 60.40%

Naïve Bayes cleveland 52.33%

Sunithasajja [1] 2010 Naïve Bayes cleveland 63.97%

VA long beach 38.42%

Srinivas et 2010 Naïve Bayes cleveland 84.14%

Showman 2012 Naïve Bayes cleveland 81.48%

● Data Integration - Where numerous records energy sources might be mixed.

Figure 1: diagrammatic representation of our proposed method

4.1 Choosing Data Mining Task

4.1.1 Logistic Regression

Decision Boundary: p≥0.5, class=1 p<0.5, class=0

Figure: Decision Boundary

4.1.1 Naïve Bayes

Bayes’ Theorem is stated as:

MAP(h) = max((P(d|h) * P(h)) / P(d)) or

MAP(h) = max(P(d|h) * P(h))

MAP(h) = max(P(d|h) * P(h))

Gaussian Naïve Bayes:

standard deviation(x) = sqrt (1/n *sum(xi-mean(x)^2))

pdf (x, mean, sd) = (1 / (sqrt (2 * PI) * sd)) * exp (-((x-mean^2)/(2*sd^2)))

4.1.2 K-Nearest Neighbour

We can implement a KNN model by following the below steps:

1. Load the data

Figure3: overall comparison of the features in the Cleveland dataset

Percentile 48 0 0 120 211 0 0 133.5 0 0 1 0 2 0

Percentile 55 1 1 130 240 0 1 153 0 0.8 1 0 2 1

Percentile 61 1 2 140 274.5 0 1 166 1 1.6 2 1 3 1

max 77 1 3 200 564 1 2 202 1 6.2 2 4 3 1

Table5: NAÏVE BAYES DESCRIPTIVE ANALYSIS

Cleveland 85% 86% 86%

VA Long Beach 77% 76% 78%

Combined 85% 86% 86%

Table6: LOGISTIC REGRESSION DESCRIPTIVE ANALYSIS

Cleveland 89% 90% 89%

Hungarian 82% 83% 82%

The accuracy score for logistic regression is : 85.25%

Confusion Matrix for Naive Bayes:

The accuracy score for logistic regression is: 83.47%

Fig 6 :Naive Bayes Accuracy

[1] Dua, D. and KarraTaniskidou, E. (2017). UCI Machine Learning Repository

You might also like