You are on page 1of 11

Project on

HEART DISEASE PREDICTION USING


VARIOUS CLASSIFICATION ALGORITHMS
AND THEIR COMPARITIVE ANALYSIS
by

TEAM:
Reg. no Name
16BCE0247 Siddharth Shah
16BCE0460 Abhinav Kumar
16BCE2027 Sunny
16BCE2150 Suryansh Singh

Submitted to:

Professor: Dr. Ilanthenral Kandasamy

Report submitted for the


Project Review of

Course Code: CSE3013 – Artificial Intelligence


Slot: B2 + TB2
1. Abstract:
Supervised and unsupervised learning is becoming increasingly popular these days. More and
more classification and clustering algorithms are becoming popular these days. We use these
algorithms on a given dataset and train our algorithm to recognise new data and put them in
best suited class or cluster. Sometimes, we can use these algorithms to predict the outcomes
based on given data. We are planning to use some of these classification algorithms of
supervised learning to predict heart disease and find the best algorithm for this purpose.

In this project, we are trying to implement various classification algorithms, namely KNN,
Kmean, SVM, and logistic regression. We plan to use an open source data set (available on a
website called kaggle ) about heart diseases. This data set contains 14 attributes; each attribute
has numeric value.

First of all, we will implement each of these algorithms separately and then we will use this
dataset to train those algorithms. We will then use test dataset to test these algorithms (KNN,
K-mean, SVM and logistic regression). We will find accuracy and efficiency of each algorithm
and compare them. Then, we will find the best suited algorithm for this purpose.

We will calculate precision, recall, specificity and F-measure. We will use these measures to
compare our algorithms and find the best suited algorithm for this purpose.

Our intention in this project is to find out the best suited algorithm among the given algorithms
to predict the heart disease.

Keywords: classification, F-measure, K-NN, K-means, logistic regression precision, recall,


supervised learning, SVM

2. Introduction:
“What we lack in knowledge, we make up with data.”

This quotation aptly emphasises on the growing popularity of data science. Classification and
clustering of data is an important part of the data analysis process. There are broadly two
approaches for data classification and clustering- supervised learning and unsupervised
learning. One fundamental difference between these two approaches is the use of sample data
to be used as training sites as reference for further classification. In supervised learning, we
have these sample data which help us in creating various classes and then train our algorithm.
Whereas, in unsupervised learning, our algorithm just groups data in certain cluster based on
some attributes. We will use KNN, K-mean, SVM and logistic regression as classification
algorithms in our project.

KNN (K nearest neighbour) is a classification algorithm which classifies data based on


distance. Euclidean distance is used in this algorithm. Distances between each pairs of nodes
are calculated using Euclidean formula. We will use these distances to identify k-nearest
neighbour of any node and see which class they belong to. Then, the given node is placed in
the most frequent class.

Support Vector Machine (SVM) is а directed machine lеаrning cаlculаtiоn which cаn bе
utilizеd for bоth аrrаngеmеnt оr rеlаpsе chаllеngеs. In аny cаsе, it is fоr thе mоst pаrt utilizеd
in grоuping issuеs. In this cаlculаtiоn, wе plоt еvеry dаtum thing аs а pоint in n-dimеnsiоnаl
spаcе (whеrе n is numbеr оf highlights yоu hаvе) with thе еstimаtiоn оf еаch cоmpоnеnt bеing
thе еstimаtiоn оf а spеcific оrgаnizе. At thаt pоint, wе pеrfоrm grоuping by finding thе
hypеrplаnе thаt sеpаrаtе thе twо clаssеs еxcеptiоnаlly wеll. Suppоrt Vеctоrs аrе simply thе
cооrdinаtеs оf individuаl оbsеrvаtiоn. Suppоrt Vеctоr Mаchinе is а frоntiеr which bеst
sеgrеgаtеs thе twо clаssеs (hypеr-plаnе/ linе).

In K-means algorithm, we assume random means. Number of assumed means is equal to the
number of classes(K). We classify the data in various classes based on each nodes distance
from the means. Each data is placed in the class with its nearest means. After the classification
is complete, new mean of each class is calculated. And the entire process is repeated again.
This process is iteratively repeated until final classes are obtained.

Logistic regression fits the data with a sigmoidal/logistic curve rather than a line and outputs
an approximation of the probability of the output given the input. It Could use value of
regression line as a probability approximation.

Objectives of the project

Objective-1 To use classification algorithms of supervised learning to predict heart disease.

Objective-2 We are trying to implement various classification algorithms, namely KNN,


Kmean, SVM, and logistic regression.

Objective-3 We will find accuracy and efficiency of each algorithm and compare them.

Objective-4 We will calculate precision, recall, specificity and F-measure. We will use these
measures to compare our algorithms and find the best suited algorithm for this purpose.

Objective-5 We will be able to tell the best algorithm to predict the classification for any
dataset.

3. Literature Review Summary Table


Support Vector Machine(SVM) and integer-coded genetic algorithm (GA) were used for heart
disease classification by Sumit Bhatia, Praveen Prakash, and G.N. Pillai [1]. Simple Support
Vector Machine (SSVM) algorithm was used to determine the support vectors in a fast, iterative
manner. An integer-coded genetic algorithm was applied to Cleveland heart disease database
for selecting the important and relevant features and discarding the irrelevant and redundant
ones.

Prediction of heart disease using neural network was proposed by Dangare et al. in [2]. Feature
selection is used to predict the disease. Their method obtained an accuracy of 92.5% for 13
features and 100% accuracy with 15 features. There is a 7.5% improvement after discarding 2
features from 15 to 13.

Jabbar et al. proposed a method using associative classification and feature subset selection for
risk score of disease [3]. Authors used information gain, symmetrical uncertainty, and genetic
algorithm as feature selection measures. Their method obtained an accuracy of 95% with hybrid
feature selection. Heart disease data set collected with 11 features for experimental analysis.

Assessment of coronary heart events risk factors was proposed by karaolis et al. [5]. Authors
investigated 2 types of risk factors namely modifiable and non-modifiable. 528 samples were
collected and data mining analysis was done using C4.5. The Highest accuracy obtained by
their model was 75% for PCI and CABG models. Authors used C4.5 classifier without feature
selection measures. The accuracy obtained by this approach is less compared with other
approaches

Diagnosis of heart disease using regression trees was proposed by Amir [6]. Authors collected
116 heart sound signals data set and applied regression tree. Their model is proposed to classify
Phonocardiogram’s (PCG) data. Authors calculated like hood ratio to classify the disease. Their
method obtained an accuracy of 99%. In the year 2014 authors [8] proposed a framework to
expect the coronary illness using multilayer perceptron. Their method uses 13 clinical elements
as input and achieved an accuracy of 98%. Literature mentioned in this related work has not
used effective feature selection measures to improve the accuracy. Authors used weak
classifiers to predict the disease. In this paper, we integrated PSO with KNN classifier to obtain
effective results.

Limitation
Authors Concept / Methodol Dataset s/ Future
and Year Theoretical ogy used/ details/ Relevant Research/
Title (Study)
(Referen model/ Implemen Analysi Finding Gaps
ce) Framework tation s identified

SVM Based
Decision Prediction Less
Classificatio Heart
Support Svm of accuracy
[1] n using disease
System for techniques hea in some
simple svm dataset
Heart Disease rt disease cases
Classification
A data mining
approach for Prediction
Heart More
predicting Feature Neural of
[2] disease complex
heart disease selection networks hea
dataset algorithm
using neural rt disease
networks
Prediction of
risk score for informatio
heart disease associative n Associativ
using classificatio gai e Better
Heart
associative n and n, prediction approach
[3] disease of
classification feature symmetric can be
and hybrid al dataset hea
subset used
feature subset selection uncertaint rt disease
selection y

A Hybrid Data Methods


data mining PCI and 528 Prediction
Mining Model can be
[5] to Predict analysis CABG sample of
generalise
Coronary using c4.5 models s Coronary
d to be
Artery Disease Related Artery used for
Cases Using to Disease any heart
Non-Invasive t disease
Clinical Data he prediction.
disease
s
Early diagnosis Early Methods
116
of heart disease prediction can be
using Classificatio Classificat heart
of heart integrated
classification n and ion and sound
[6] diseases with other
and regression regression regression signals
using the approache
trees trees trees data s for better
obtained
sets results.
trees

4. Proposed work and implementation


Methodology adapted:
The main task in this project is to perform supervised learning on the selected data set (in form
of csv file). To achieve the goal, we split the data set into two sets namely, test set and the
training set and then use the training set to train our algorithms and test set to analyse the results.
The main steps in methodology are-

i. Selection of proper data set- A dataset consisting of various attributes that affect heart
and can lead to a heart disease and classification into two classes 0 or 1(representing no
or yes).
ii. Splitting the data set- the next task will be to split the data set into training and testing.
We select 75% of the data to train the algorithms and rest of the data for testing. iii.
Training the algorithms- We have selected a few algorithms (k-NN, SVM, logistic
regression, K-means clustering) and we will use the data set to train these algorithms.
iv. Testing of algorithm- In this step, we will calculate the result for the test sets using the
algorithms for all the data in testing data set.
v. Calculation of accuracy-After this, we will measure the accuracy of all the algorithms
using precision, recall, f-measure etc. We will compare the accuracy for all the
algorithms and find the best suited algorithm for classification of the given dataset.

Hardware and software requirements:

i. Python ide
ii. Heart disease dataset (from Kaggle)
iii. MS office (csv files)
iv. Python libraries (pandas, numpy and various machine learning libraries)

5. Dataset used / Tools used:


We have selected a heart disease dataset from Kaggle which classifies various data as 0 or 1
based on many attributes. Some of the attributes in the dataset are-

• Age
• Sex
• Cp
• trestbps
• chol
• fbs
• restecg
• thalach
• exang
All these attributes affect heart diseases in one way or another and we will try to predict or
classify our data for these attributes as 0 or 1 for heart disease.

6. Results
We trained all the algorithms and tested them against the testing dataset, we calculated the
accuracy of each algorithm. To calculate the accuracy, we used various measures like precision,
recall and F-measure. Once we had the accuracy for each algorithm, we compared the
accuracies and gave the best suited algorithm for the given dataset and further used the
algorithm to predict the results for new data. We were be able to tell the best algorithm to
predict the classification for the dataset.

Results of our findings are shown in the following tables and graphs-

TARGET precision recall f1-score support

0 0.73 0.89 0.80 36

1 0.91 0.78 0.84 55

avg / total 0.84 0.82 0.83 91

Table-1 KNN CLASSIFICATION:

Accuracy 0.8241758241758241

Target precision recall f1-score support

0 0.78 0.65 0.71 49

1 0.66 0.79 0.72 42

avg / total 0.72 0.71 0.71 91

Table2: Naïve bayes

Accuracy 0.7142857142857143
Target precision recall f1-score support
0 0.7 0.86 0.78 36
1 0.89 0.76 0.82 55
weighted avg 0.82 0.8 0.8 91

Table 3: SVM
Accuracy-:0.8021978021978022

Target precision recall f1-score support


0 0.73 0.84 0.78 38
1 0.87 0.77 0.82 53
weighted avg 0.81 0.8 0.8 91

Table 4: Logistic Regression


Accuracy-:0.8021978021978022

Target precision recall f1-score support

0 0.61 0.43 0.5 63

1 0.23 0.39 0.29 28

Weighted 0.5 0.42 0.44 91


avg

Table 5: K Means

Accuracy-:0.4175824175824176

1) Outcome in-terms of Graphs (result)


ACCURACY OF ALGORITHMS

NAÏVE BAYES

K-MEAN

K-NN

LOGISTIC REGRESSION

SVM

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Graph 1-: Accuracy comparison of learning

support(D T)

support(KNN)

support(K MEAN)

support(L R)

supportN B)

support(SVM)

0 10 20 30 40 50 60 70

1 0

Graph2-: comparison of support factor


precision

precision(D T)

precision(KNN)

precision(K MEAN)

precision(L R)

precision(N B)

precision(SVM)

0 0.2 0.4 0.6 0.8 1

1 0

Graph3 -: Comparison of precision

Chart Title

recallN B)

recall(L R)

recall(K MEAN)

recall(KNN)

recall(D T)

recall(SVM)

0 0.2 0.4 0.6 0.8 1

1 0

Graph 4:-Comparison of recall data


Chart Title

f1-score(D T)

f1-score(KNN)

f1-score(K MEAN)

f1-score(L R)

f1-score(SVM)

f1-scoreN B)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 0

Graph 5-: f1 score of algorithms

References
[1] S. Bhatia, P. Prakash and G.N. Pillai, SVM based Decision Support System for Heart Disease
Classification with Integer-coded Genetic Algorithm to select critical features, Proceedings of the
World Congress on Engineering and Computer Science, San Francisco, USA, pp.34-38, 2008.

[2]. Dangare A. Data mining approach for prediction of heart disease using neural network. IJCET
2012; 3: 30-40.

[3]. Jabbar MA, Deekshatulu BL, Priti C. Prediction of risk score for heart disease using associative
classification and hybrid feature Selection. IEEE ISDA 2012; 628-634.

[4].Masethe HD, Masathe MA. Prediction of heart disease using classification algorithm. Wcess 2014.

[5].Karaolis. Assessment of risk factor of coronary heart events based on data mining. IEEE
Transactions IT Med 2016; 559-566.

[6].Amir M. Early diagnosis of heart disease using classification and regression trees. University
Caglian Italy 2013; 1-4.

[7].Sonwane J. Prediction of heart disease using multilayer perceptron neural network. ICICES 2014;
1-6.

You might also like