You are on page 1of 4

International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

Heart Attack Prediction System


Sushmita Manikandan
Department of Computer Science and Engineering
PES Institute of Technology, Bangalore South Campus
Bengaluru, India

Abstract—Heart attack is one of the most pressing problems The manifestations of a cardiac disease can be a stroke or
of the health care industry. In general, the patient’s reports have heart attack and is a growing problem. It contributes
to be carefully scrutinized by doctors to make a diagnosis of a significantly to the number of deaths as stated by World
heart failure. This research study is an attempt to reduce the Health Organization that 1 out of every 5-individual aged over
efforts and time put in by the doctor by automating the risk 40 is susceptible to heart diseases. The introduction is
prediction with the help of a binary classifier. A prototype concerned with an in-depth assessment of heart failure risk
implementation of such a system with an easy-to-use user aimed at establishing the prevention opportunities.
interface is presented in this paper. The graphic user interface is
web based and Naïve Bayes algorithm was used to build the A recent study by the Indian Council of Medical Research
classifier. The resulting system gave an accuracy score of 81.25%. (ICMR) and the Registrar General of India (RGI) mentions,
about 25% of deaths between age band of 30-70 years occur
Keywords—prediction; data analysis; regression; machine because of different heart related problems.
learning.
B. Data Mining and Analysis
I. INTRODUCTION
Machine learning comes under the umbrella of artificial
The World Health Organization has stated ischemic and intelligence (AI). It enables the computers to learn without
hypertensive heart attack to be one of the prime causes of programming it explicitly. Machine learning aims at
death. This illness becomes particularly life threatening when developing computer programs that can change whenever
not monitored. The system attempts to solve this issue by exposed to new sets of data. The machine learning algorithms
building a prototype of an interactive prediction system that are classified as Supervised or Unsupervised. It is an evolution
gives the vulnerability of an individual to heart attack, from Pattern Recognition study and theory of Computational
measured as a risk factor. The implementation of this learning.
prototype uses the dataset obtained from UCI’s machine
Data Mining is the art and science of searching data for
learning repository, Rapid Miner for cleansing the dataset and
Anaconda v2.7 packages to construct the classifier. The results patterns and establishing relationship among them. It is
of prediction are made available to the consumer with the help concerned with extraction of data for human comprehension.
of a web interface well equipped with an easy-to-use graphic Data science and machine learning algorithms were used in
user interface. Rapid Miner was used to establish the best this research work to do a supervised learning in order to build
a binary classifier for prediction of susceptibility of an
fitting algorithm for the given dataset and 4 algorithms
individual to heart attack in the form of a risk factor.
including Naïve Bayes, Decision Trees, K-Nearest Neighbour
and Random Forest were compared by building their
processes in Rapid Miner. The outcome of this activity III. RELATED WORKS
showed that Naïve Bayes gives the highest accuracy on the In previous efforts, several state of the art classification
given dataset. models were compared on a dataset obtained from AFIC,
Pakistan [1]. Naïve Bayes showed the highest accuracy in this
II. LITERATURE SURVEY case. Furthermore, SVM has been applied to the dataset of
National Health and Nutrition Examination Survey and
A. Medical Facts obtained an accuracy of 83%. SVM, with the highest accuracy
Researched data points show that almost 50% of the deaths as against other aforementioned general techniques and
related to heart and blood vessel diseases are on account of Boosting, stood out as the best methods for prediction [2].
coronary heart disease, which includes heart attack. A Multiple mathematical models were created and analyzed in
whooping 325,000 people are estimated to die every year due the earlier works. Data mining tools have shown to be of
to coronary attack before hospitalization or getting emergency immense help in the process of analysis and prediction. A
medical support. prototype of an Intelligent Heart Disease Prediction System
(IHDPS) was developed which establishes relationship
An estimated 260,000 deaths occur every year due to between the buried patterns and heart diseases and makes a
major chronic disease of Congestive heart failure in case of prediction [3]. Naïve Bayes was used for this purpose and the
older adults. results are highly accurate.

978-1-5386-1887-5/17/$31.00 ©2017 IEEE

817
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

In another model, the authors used the clustering technique Table 1 illustrates a subset of the dataset that was
to obtain 3 clusters using 14 different attributes. SVM, considered for regression. The final numerical dataset was in
Random Forest and Logistic regression etcetera can be used .csv format that consisted of the following features:
for this purpose [4]. As far as clustering method is concerned,
Logistic Regression has proved to be the most accurate. 1. Age

Among other efforts, a User Interface based Cardiac 2. Sex


monitoring system was developed using mobile gateways and 3. Chest pain
monitoring servers [5]. Different patterns for different
subgroups of patients were established and logistic regression 4. Blood Pressure at Rest
was used for this purpose. 5. Cholesterol
6. Fasting blood sugar
IV. PROPOSED SYSTEM
In this study, a prototype of a system that comprises of a 7. Electrocardiographic results at Rest
binary classification model to predict the risk factor of an 8. Peak heart rate achieved
individual based on his/her medical data is proposed. The
system is well equipped with a comprehensive graphic user 9. Exercise induced angina
interface that is easy to use and understand. The classification 10. ST depression induced by exercise relative to rest
follows a supervised learning wherein the dataset used was
obtained from University of California, Irvine’s machine 11. Slope of the peak exercise ST segment
learning repository. The reference is limited to the Cleveland’s 12. Number of major vessels
dataset which was collected as unstructured data in the form of
medical reports and converted to a structured dataset. The 13. Thal
dataset represents a binary classification problem. The dataset 14. Num – response variable
comprises of 14 attributes in total, out of which 13 are
predictor variables and one feature is a binary response
variable. Therefore, the dataset represents a binary B. Classification
classification problem. Naïve Bayes was used for the
classification process. In the end, an easy to use web interface
was developed so that the system can be used by persons with
little or no technical knowledge, thereby completely
abstracting the core functionality and implementation details
of the system.

V. IMPLEMENTATION

A. Data preprocessing
Rapid Miner was used for the purpose of pre-processing
the data. Rapid Miner is a data science software platform that
delivers a unified environment for predictive analysis, text
mining, deep learning and machine learning. It was used to
clean the dataset, i.e., to handle missing values and exclude
outliers. It started with uploading the dataset into the design
section of Rapid Miner using the “Retrieve” operator followed
by the use of “Set Role” operator for specifying the class label
and finally the “Replace Missing Values” operator to handle
the missing values by substituting with the mean of the other
values present for that particular feature in the dataset.

TABLE I. SAMPLE OF THE PROCESSED DATASET

Fig. 1. Classification and accuracy computation

Fig 1 shows the methodology for classification and


subsequently computing the associated metrics. For
classification, Gaussian Naïve Bayes algorithm was used from
the scikit-learn package. Scikit-learn encapsulates multiple
data mining tools. The processed dataset was spilt into training
and test data. The training data was converted into a NumPy

818
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

array and Gaussian Naïve Bayes algorithm was applied. This VI. RESULTS AND ANALYSIS
regression resulted in a model to which the test data was fed in The working prototype of the system is as shown below:
and the accuracy and other parameters were calculated.

C. User Interface

Fig. 4. Predicting low for an individual

Fig. 2. Accessing classifier via web interface

Fig 2 shows the flow diagram for using the web interface
for the obtaining the risk factor of the patient. An interactive
web interface was developed to access the classifier and check
the risk factor. The backend consists of a CGI script written in
Python. It accepts the medical form data as input to the Fig. 5. Predicting high for an individual
classification model trained and predicts the risk factor of the
individual, i.e., high risk or low risk. Figures 4 and 5 show the risk factors- low and high
respectively for two different individuals.
The sequence of events taking place while using the
system is diagrammatically depicted below: Consider the following confusion matrix:

TABLE II. CONFUSION MATRIX

Table 2 shows the confusion matrix with the help of which


Fig. 3. Sequence diagram for the prediction system the following metrics are calculated. The confusion matrix
consists of true negatives denoted by TN, true positives
Fig 3 shows the sequence diagram of the system. The denoted by TP, false negatives denoted by FN and false
predictor receives the form data from the web interface and positives denoted by FP. The following metrics calculated are
inputs the same to the classification model. Once the model crucial to any classification system.
makes a prediction, it is displayed to the user and
simultaneously updated on the local database.

819
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

i. Accuracy = (TP+TN)/total (1) algorithms applied on the dataset, Naïve Bayes has proven to
give the best accuracy of 81.25%. The same classifier can be
= 0.8125 accessed with the help of a web interface for convenience of
= 81.25% the user.
ii. True Positive Rate=Recall=TP/(TP+FN) (2) As a part of the future scope of this system, newly
introduced classification algorithms can be used in order to
= 0.7333 improve the accuracy.
= 73.33%
iii. False Positive Rate=FP/(FP+TN) (3)
Acknowledgment
= 0.1633
I thank the University of California, Irvine for providing
= 16.33% the comprehensive dataset that was used for this work. I also
iv. Specificity = TN/Actual Negatives (4) extend my sincere thanks to Ms Saraswathi Punagin and Ms
Keerti Torvi, Assistant Professors at PESIT, Bangalore South
= 0.8367 Campus for their guidance.
= 83.67%
v. Precision = TP/Predicted Positives (5) References
[1] M. Saqlain, W. Hussain, N. A. Saqib and M. A. Khan, "Identification of
= 0.5789 heart failure by using unstructured data of cardiac patients," 2016 45th
= 57.89% International Conference on Parallel Processing Workshops (ICPPW),
Philadelphia, PA, 2016, pp. 426-431. doi: 10.1109/ICPPW.2016.66
vi. F-Score = 1/((1/recall)+(1/precision)) (6) [2] Y Wei, T Liu, R Valdez, M Gwinn and M.J Khoury, ”Application of
support vector machine modeling for prediction of common diseases: the
= 1/3.0911 case of diabetes and pre-diabetes,” BMC Medical Informatics and
Decision Making, vol. 10, 2010.
= 0.3235 [3] S. Palaniappan and R. Awang, "Intelligent heart disease prediction
system using data mining techniques," 2008 IEEE/ACS International
VII. CONCLUSION Conference on Computer Systems and Applications, Doha, 2008, pp.
108-115. doi: 10.1109/AICCSA.2008.4493524.
A prototype of the system which classifies an individual on [4] M. Hertzong and B Poezehl, “Cluster analysis of symptom occurence to
his risk factor is implemented. The dataset was retrieved from identify sub groups of heart failure patients: A pilot study”, Journal of
UCI’s machine learning repository [6] and pre-processed. The Cardiovascular Nursing, vol. 25, pp. 273-283, July/August 2010.
final dataset consists of 13 predictor variables or features and [5] K. Kwon, H. Hwang, H. Kang, K. G. Woo and K. Shim, "A remote
one response variable named num. If num is 0, it implies that cardiac monitoring system for preventive care," 2013 IEEE International
Conference on Consumer Electronics (ICCE), Las Vegas, NV, 2013, pp.
less than 50% of the blood vessels are narrowing i.e., the 197-200. doi: 10.1109/ICCE.2013.6486857.
prediction is ‘low risk individual’ and value of 1 implies that [6] Dataset obtained from
more than 50% of blood vessels narrowing i.e., the prediction
https://archive.ics.uci.edu/ml/machine-learning-databases/heart-
is ‘high risk individual’. Since the data follows a normal disease/processed.cleveland.data
distribution, Gaussian Naïve Bayes algorithm was used for the
classification. Amongst all the state of the art classification

820

You might also like