Professional Documents
Culture Documents
Ai Finalreport b2
Ai Finalreport b2
TEAM:
Reg. no Name
16BCE0247 Siddharth Shah
16BCE0460 Abhinav Kumar
16BCE2027 Sunny
16BCE2150 Suryansh Singh
Submitted to:
In this project, we are trying to implement various classification algorithms, namely KNN,
Kmean, SVM, and logistic regression. We plan to use an open source data set (available on a
website called kaggle ) about heart diseases. This data set contains 14 attributes; each attribute
has numeric value.
First of all, we will implement each of these algorithms separately and then we will use this
dataset to train those algorithms. We will then use test dataset to test these algorithms (KNN,
K-mean, SVM and logistic regression). We will find accuracy and efficiency of each algorithm
and compare them. Then, we will find the best suited algorithm for this purpose.
We will calculate precision, recall, specificity and F-measure. We will use these measures to
compare our algorithms and find the best suited algorithm for this purpose.
Our intention in this project is to find out the best suited algorithm among the given algorithms
to predict the heart disease.
2. Introduction:
“What we lack in knowledge, we make up with data.”
This quotation aptly emphasises on the growing popularity of data science. Classification and
clustering of data is an important part of the data analysis process. There are broadly two
approaches for data classification and clustering- supervised learning and unsupervised
learning. One fundamental difference between these two approaches is the use of sample data
to be used as training sites as reference for further classification. In supervised learning, we
have these sample data which help us in creating various classes and then train our algorithm.
Whereas, in unsupervised learning, our algorithm just groups data in certain cluster based on
some attributes. We will use KNN, K-mean, SVM and logistic regression as classification
algorithms in our project.
Support Vector Machine (SVM) is а directed machine lеаrning cаlculаtiоn which cаn bе
utilizеd for bоth аrrаngеmеnt оr rеlаpsе chаllеngеs. In аny cаsе, it is fоr thе mоst pаrt utilizеd
in grоuping issuеs. In this cаlculаtiоn, wе plоt еvеry dаtum thing аs а pоint in n-dimеnsiоnаl
spаcе (whеrе n is numbеr оf highlights yоu hаvе) with thе еstimаtiоn оf еаch cоmpоnеnt bеing
thе еstimаtiоn оf а spеcific оrgаnizе. At thаt pоint, wе pеrfоrm grоuping by finding thе
hypеrplаnе thаt sеpаrаtе thе twо clаssеs еxcеptiоnаlly wеll. Suppоrt Vеctоrs аrе simply thе
cооrdinаtеs оf individuаl оbsеrvаtiоn. Suppоrt Vеctоr Mаchinе is а frоntiеr which bеst
sеgrеgаtеs thе twо clаssеs (hypеr-plаnе/ linе).
In K-means algorithm, we assume random means. Number of assumed means is equal to the
number of classes(K). We classify the data in various classes based on each nodes distance
from the means. Each data is placed in the class with its nearest means. After the classification
is complete, new mean of each class is calculated. And the entire process is repeated again.
This process is iteratively repeated until final classes are obtained.
Logistic regression fits the data with a sigmoidal/logistic curve rather than a line and outputs
an approximation of the probability of the output given the input. It Could use value of
regression line as a probability approximation.
Objective-3 We will find accuracy and efficiency of each algorithm and compare them.
Objective-4 We will calculate precision, recall, specificity and F-measure. We will use these
measures to compare our algorithms and find the best suited algorithm for this purpose.
Objective-5 We will be able to tell the best algorithm to predict the classification for any
dataset.
Prediction of heart disease using neural network was proposed by Dangare et al. in [2]. Feature
selection is used to predict the disease. Their method obtained an accuracy of 92.5% for 13
features and 100% accuracy with 15 features. There is a 7.5% improvement after discarding 2
features from 15 to 13.
Jabbar et al. proposed a method using associative classification and feature subset selection for
risk score of disease [3]. Authors used information gain, symmetrical uncertainty, and genetic
algorithm as feature selection measures. Their method obtained an accuracy of 95% with hybrid
feature selection. Heart disease data set collected with 11 features for experimental analysis.
Assessment of coronary heart events risk factors was proposed by karaolis et al. [5]. Authors
investigated 2 types of risk factors namely modifiable and non-modifiable. 528 samples were
collected and data mining analysis was done using C4.5. The Highest accuracy obtained by
their model was 75% for PCI and CABG models. Authors used C4.5 classifier without feature
selection measures. The accuracy obtained by this approach is less compared with other
approaches
Diagnosis of heart disease using regression trees was proposed by Amir [6]. Authors collected
116 heart sound signals data set and applied regression tree. Their model is proposed to classify
Phonocardiogram’s (PCG) data. Authors calculated like hood ratio to classify the disease. Their
method obtained an accuracy of 99%. In the year 2014 authors [8] proposed a framework to
expect the coronary illness using multilayer perceptron. Their method uses 13 clinical elements
as input and achieved an accuracy of 98%. Literature mentioned in this related work has not
used effective feature selection measures to improve the accuracy. Authors used weak
classifiers to predict the disease. In this paper, we integrated PSO with KNN classifier to obtain
effective results.
Limitation
Authors Concept / Methodol Dataset s/ Future
and Year Theoretical ogy used/ details/ Relevant Research/
Title (Study)
(Referen model/ Implemen Analysi Finding Gaps
ce) Framework tation s identified
SVM Based
Decision Prediction Less
Classificatio Heart
Support Svm of accuracy
[1] n using disease
System for techniques hea in some
simple svm dataset
Heart Disease rt disease cases
Classification
A data mining
approach for Prediction
Heart More
predicting Feature Neural of
[2] disease complex
heart disease selection networks hea
dataset algorithm
using neural rt disease
networks
Prediction of
risk score for informatio
heart disease associative n Associativ
using classificatio gai e Better
Heart
associative n and n, prediction approach
[3] disease of
classification feature symmetric can be
and hybrid al dataset hea
subset used
feature subset selection uncertaint rt disease
selection y
i. Selection of proper data set- A dataset consisting of various attributes that affect heart
and can lead to a heart disease and classification into two classes 0 or 1(representing no
or yes).
ii. Splitting the data set- the next task will be to split the data set into training and testing.
We select 75% of the data to train the algorithms and rest of the data for testing. iii.
Training the algorithms- We have selected a few algorithms (k-NN, SVM, logistic
regression, K-means clustering) and we will use the data set to train these algorithms.
iv. Testing of algorithm- In this step, we will calculate the result for the test sets using the
algorithms for all the data in testing data set.
v. Calculation of accuracy-After this, we will measure the accuracy of all the algorithms
using precision, recall, f-measure etc. We will compare the accuracy for all the
algorithms and find the best suited algorithm for classification of the given dataset.
i. Python ide
ii. Heart disease dataset (from Kaggle)
iii. MS office (csv files)
iv. Python libraries (pandas, numpy and various machine learning libraries)
• Age
• Sex
• Cp
• trestbps
• chol
• fbs
• restecg
• thalach
• exang
All these attributes affect heart diseases in one way or another and we will try to predict or
classify our data for these attributes as 0 or 1 for heart disease.
6. Results
We trained all the algorithms and tested them against the testing dataset, we calculated the
accuracy of each algorithm. To calculate the accuracy, we used various measures like precision,
recall and F-measure. Once we had the accuracy for each algorithm, we compared the
accuracies and gave the best suited algorithm for the given dataset and further used the
algorithm to predict the results for new data. We were be able to tell the best algorithm to
predict the classification for the dataset.
Results of our findings are shown in the following tables and graphs-
Accuracy 0.8241758241758241
Accuracy 0.7142857142857143
Target precision recall f1-score support
0 0.7 0.86 0.78 36
1 0.89 0.76 0.82 55
weighted avg 0.82 0.8 0.8 91
Table 3: SVM
Accuracy-:0.8021978021978022
Table 5: K Means
Accuracy-:0.4175824175824176
NAÏVE BAYES
K-MEAN
K-NN
LOGISTIC REGRESSION
SVM
support(D T)
support(KNN)
support(K MEAN)
support(L R)
supportN B)
support(SVM)
0 10 20 30 40 50 60 70
1 0
precision(D T)
precision(KNN)
precision(K MEAN)
precision(L R)
precision(N B)
precision(SVM)
1 0
Chart Title
recallN B)
recall(L R)
recall(K MEAN)
recall(KNN)
recall(D T)
recall(SVM)
1 0
f1-score(D T)
f1-score(KNN)
f1-score(K MEAN)
f1-score(L R)
f1-score(SVM)
f1-scoreN B)
1 0
References
[1] S. Bhatia, P. Prakash and G.N. Pillai, SVM based Decision Support System for Heart Disease
Classification with Integer-coded Genetic Algorithm to select critical features, Proceedings of the
World Congress on Engineering and Computer Science, San Francisco, USA, pp.34-38, 2008.
[2]. Dangare A. Data mining approach for prediction of heart disease using neural network. IJCET
2012; 3: 30-40.
[3]. Jabbar MA, Deekshatulu BL, Priti C. Prediction of risk score for heart disease using associative
classification and hybrid feature Selection. IEEE ISDA 2012; 628-634.
[4].Masethe HD, Masathe MA. Prediction of heart disease using classification algorithm. Wcess 2014.
[5].Karaolis. Assessment of risk factor of coronary heart events based on data mining. IEEE
Transactions IT Med 2016; 559-566.
[6].Amir M. Early diagnosis of heart disease using classification and regression trees. University
Caglian Italy 2013; 1-4.
[7].Sonwane J. Prediction of heart disease using multilayer perceptron neural network. ICICES 2014;
1-6.