Professional Documents
Culture Documents
Fypd - 18-510
Fypd - 18-510
Submitted by
B. HARIKRISHNAN
REG. NO.: 18LCA510
I hereby declare that this project, entitled “DETECTION OF CHRONIC HEART DISEASE
FAILURE USING HYBRID MACHINE LEARNING MODEL” submitted to the Government Arts
College (Autonomous), Coimbatore in partial fulfillment of the requirements for the award of the Degree
of Master of Computer Applications is a bonafide record of original project work done by under the
supervision and guidance of Dr. K. ARTHI, MCA, M.Phil., Ph.D., Assistant Professor, Post
Graduate and Research Department of Computer Applications, Government Arts College
(Autonomous), Coimbatore – 641 018.
B. HARIKRISHNAN
(18LCA510)
[i]
CERTIFICATE
This is to certify that the project entitled “DETECTION OF CHRONIC HEART DISEASE
FAILURE USING HYBRID MACHINE LEARNING MODEL” is a record of an original project
work done by B. HARIKRISHNAN (18LCA510) submitted to Bharathiar University in a partial
fulfillment of the requirement for the award of the Degree of MASTER OF COMPUTER
APPLICATIONS.
[ii]
ACKNOWLEDGEMENT
I take this opportunity to acknowledge my deep sense of gratitude and heartfelt thanks to our beloved
Principal, Dr. K. CHITRA, M.Sc., M.Phil., Ph.D., Government Arts College (Autonomous),
Coimbatore, who has given permission to carry out the project by providing necessary facilities and
permitting me with all facilities to complete it successfully.
I also express my sincere and profound thanks to the Head of the Department,
Dr. R. A. ROSELINE, M.Sc., M.Phil., Ph.D., Associate Professor and Head, Post Graduate and
Research Department of Computer Applications, Government Arts College (Autonomous), Coimbatore,
for her valuable support and suggestions in my project work.
I would also express my sincere gratitude to my internal guide,
Dr. K. ARTHI, MCA, M.Phil., Ph.D., Assistant Professor, Department of Computer Applications, for
who wrought shaped my project into a better one with her valuable guidance for completion of this
project.
Finally, I also express my sincere gratitude to all of my parents and friends who helped for my
carrier without whose sustained support, I could not have made my debut in Computer Applications.
[iii]
ABSTRACT
Heart disease is one of the most significant causes of mortality in the world today. Prediction of
cardiovascular disease is a critical challenge in the area of clinical data analysis. Machine learning (ML)
has been shown to be effective in assisting in making decisions and predictions from the large quantity
of data produced by the healthcare industry. Machine learning techniques being used in recent
developments in different areas of medical industry. In this work, a proposed novel method that aims at
finding heart disease by applying machine learning techniques. The prediction model uses classification
techniques and Cleveland heart disease dataset is used. Machine learning technique Decision Tree and
Random Forest is applied. The novel technique of machine learning model is used. In implementation,
three machine learning algorithms are used, they are 1. Decision Tree, 2. Random Forest and 3. Hybrid
model (Hybrid of Decision tree and random forest). Experimental results show an accuracy level of
88:7% through the prediction model for heart disease with the hybrid model. The interface is designed
to get the input parameter from user to predict the heart disease, for which hybrid model of Decision
Tree and Random forest are used.
[iv]
LIST OF ABBREVIATIONS
ML – Machine learning
CVD – Cardiovascular diseases
CHDD – Cleveland Heart Disease Database
KNN – K-Nearest Neighbor Algorithm
DT – Decision Trees
GA – Genetic algorithm
NB – Naive Bayes
CADSS – Computer Aided Decision Support System
CNN – Convolutional Neural Networks
ECG – Electrocardiogram
DL – Deep learning
[v]
TABLE OF CONTENTS
TITLE PAGE NO.
DECLARATION i
CERTIFICATE ii
ACKNOWLEDGEMENT iii
ABSTRACT iv
LIST OF ABBREVIATION v
CHAPTER-1: INTRODUCTION 1–3
Overview of the Project 1
Objective of the Project 3
CHAPTER – 2: LITERATURE SURVEY 4–6
Overview of Literature Survey 4
CHAPTER – 3: SYSTEM ANALYSIS 7 – 11
Objective 7
Existing System 7
Proposed System 8
Problem Statement 8
Requirement Analysis 9
System Requirement 11
CHAPTER – 4: SYSTEM DESIGN 12 – 18
Architecture Design 12– 13
UML Diagrams 14 – 18
CHAPTER – 5: IMPLEMENTATION 19 – 21
Implementation Methodology 19
Data Dictionary 19
Modules 20
CHAPTER- 6: EXPERIMENTAL RESULTS AND ANALYSIS 22– 24
CHAPTER-7: CONCLUSION AND FUTURE WORK 25
APPENDIX - A: 26
REFERENCES 26
APPENDIX – B: 28– 42
1. Source Code 28
2. Sample Screens Shots 39
CHAPTER 1
INTRODUCTION
Data mining (DM) is the extraction of useful information from large data sets that results in predicting
or describing the data using techniques such as classification, clustering, association, etc. Data mining
has found extensive applicability in the healthcare industry such as in classifying optimum treatment
methods, predicting disease risk factors, and finding efficient cost structures of patient care. Research
using data mining models have been applied to diseases such as diabetes, asthma, cardiovascular
diseases, AIDS, etc. Various techniques of data mining such as naïve Bayesian classification, artificial
neural networks, support vector machines, decision trees, logistic regression, etc. have been used to
develop models in healthcare research.
1.1. OVERVIEW OF THE PROJECT
An estimated 17 million people die of cardiovascular diseases (CVD) every year. Although such diseases
are controllable, their early prognosis and a patient’s evaluated risk are necessary to curb the high
mortality rates it presents. Common cardiovascular diseases include coronary heart disease,
cardiomyopathy, hypertensive heard disease, heart failure, etc. Common causes of heart diseases include
smoking, diabetes, lack of physical activity, hypertension, high cholesterol diet, etc.
Research in the field of cardiovascular diseases using data mining has been an ongoing effort involving
prediction, treatment, and risk score analysis with high levels of accuracy. Multiple CVD surveys have
been conducted with the most prominent one being the data set from the Cleveland Heart Clinic. The
Cleveland Heart Disease Database (CHDD) as such has been considered the de facto database for heart
disease research. Recommending the parameters from this database, this paper proposes a framework to
apply logistic regression, support vector machines, and decision trees to attain individual predictions
which are in turn used in rule-based algorithms. The result of each rule from this system is then compared
on the basis of accuracy, sensitivity, and specificity.
The methodology aims to accomplish of two goals: the first is to primarily present a predictive
framework for heart disease, and the second is to compare the efficiency of merging the outcomes of
multiple models as opposed to using a single model.
It is difficult to identify heart disease because of several contributory risk factors such as diabetes, high
blood pressure, high cholesterol, abnormal pulse rate and many other factors. Various techniques in data
mining and neural networks have been employed to find out the severity of heart disease among humans.
The severity of the disease is classified based on various methods like K-Nearest Neighbour Algorithm
(KNN), Decision Trees (DT), Genetic algorithm (GA), and Naive Bayes (NB). The nature of heart
disease is complex and hence, the disease must be handled carefully. Not doing so may affect the heart
[1]
or cause premature death. The perspective of medical science and data mining are used for discovering
various sorts of metabolic syndromes. Data mining with classification plays a significant role in the
prediction of heart disease and data investigation.
In previous, the decision tree has been used in predicting the accuracy of events related to heart disease.
Various methods have been used for knowledge abstraction by using known methods of data mining for
prediction of heart disease. In this work, numerous readings have been carried out to produce a prediction
model using not only distinct techniques but also by relating two or more techniques. These amalgamated
new techniques are commonly known as hybrid methods. Neural networks using heart rate time series
was introduced in this. This method uses various clinical records for prediction such as Left bundle
branch block (LBBB), Right bundle branch block (RBBB), Atrial fibrillation (AFIB), Normal Sinus
Rhythm (NSR), Sinus bradycardia (SBR), Atrial fibrillation utter (AFL), Premature Ventricular
Contraction (PVC)), and Second-degree block (BII) to find out the exact condition of the patient in
relation to heart disease. The dataset with a radial basis function network (RBFN) is used for
classification, where 70% of the data is used for training and the remaining 30% is used for classification.
Computer Aided Decision Support System (CADSS) in the field of medicine and research was also
introduced. In previous work, the usage of data mining techniques in the healthcare industry has been
shown to take less time for the prediction of disease with more accurate results [16]. Model proposes the
diagnosis of heart disease using the GA. This method uses effective association rules inferred with the
GA for tournament selection, crossover and the mutation which results in the new proposed fitness
function. For experimental validation, the model uses the well-known Cleveland dataset which is
collected from a UCI machine learning repository. The model clarifies later on how the result prove to
be prominent when compared to some of the known supervised learning techniques. The most powerful
evolutionary algorithm Particle Swarm Optimization (PSO) is introduced and some rules are generated
for heart disease. The rules have been applied randomly with encoding techniques which result in
improvement of the accuracy overall. Heart disease is predicted based on symptoms namely, pulse rate,
sex, age, and many others. The ML algorithm with Neural Networks is introduced, whose results are
more accurate and reliable as seen already. Neural networks are generally regarded as the best tool for
prediction of diseases like heart disease and brain disease. The proposed method which uses 13 attributes
for heart disease prediction. The results show an enhanced level of performance compared to the existing
methods in works like. The Carotid Artery Stenting (CAS) has also become a prevalent treatment mode
in the medical field during these recent years. The CAS prompts the occurrence of major adverse
cardiovascular events (MACE) of heart disease patients that are elderly. Their evaluation becomes very
important. The model generates results using an Artificial Neural Network ANN, which produces good
performance in the prediction of heart disease. Neural network methods are introduced, which combine
[2]
not only posterior probabilities but also predicted values from multiple predecessor techniques. This
model achieves an accuracy level of up to 89:01% which is a strong result compared to previous works.
For all experiments, the Cleveland heart dataset is used with a Neural Network NN to improve the
performance of heart disease as discussed previously. The model has also seen recent developments in
machine learning ML techniques and used for Internet of Things (IoT) as well. ML algorithms on
network traffic data has been shown to provide accurate identification of IoT devices connected to a
network. Median external. collected and labelled network traffic data from nine distinct IoT devices,
PCs and smartphones. Using supervised learning, they trained a multi-stage meta classifier. In the first
stage, the classifier can distinguish between traffic generated by IoT and non-IoT devices. In the second
stage, each IoT device is associated with a specific IoT device class. Deep learning is a promising
approach for extracting accurate information from raw sensor data from IoT devices deployed in
complex environments. Because of its multilayer structure, deep learning is also appropriate for the edge
computing environment.
1.2. OBJECTIVE OF THE PROJECT
In this work, a technique called the Hybrid Random Forest with Linear Model (HRFLM) was introduced.
The main objective of this research is to improve the performance accuracy of heart disease prediction.
Many studies have been conducted that results in restrictions of feature selection for algorithmic use. In
contrast, the HRFLM method uses all features without any restrictions of feature selection. Here the
model conduct experiments used to identify the features of a machine learning algorithm with a hybrid
method. The experiment results show that our proposed hybrid method has stronger capability to predict
heart disease compared to existing methods.
[3]
CHAPTER 2
LITERATURE SURVEY
[4]
proposes a classification model which uses random forest as classifier, chi square genetic algorithm as
feature selection measures to predict heart disease. The experimental results have shown that our
approach improve classification accuracy compared to other classification approaches, and the presented
model can be successfully used by health care professional for predicting heart disease.
A Data mining Model for predicting the Coronary Heart Disease using Random
Forest Classifier
A. S. Abdullah and R. R. Rajalaxmi, in Proc. Int. Conf. Recent Trends Comput. Methods,
Commun. Controls, Apr. 2012, pp. 2225.
Coronary Heart Disease (CHD) is a common form of disease affecting the heart and an important cause
for premature death. From the point of view of medical sciences, data mining is involved in discovering
various sorts of metabolic syndromes. Classification techniques in data mining play a significant role in
prediction and data exploration. Classification technique such as Decision Trees has been used in
predicting the accuracy and events related to CHD. In this paper, a Data mining model has been
developed using Random Forest classifier to improve the prediction accuracy and to investigate various
events related to CHD. This model can help the medical practitioners for predicting CHD with its various
events and how it might be related with different segments of the population. The events investigated
are Angina, Acute Myocardial Infarction (AMI), Percutaneous Coronary Intervention (PCI), and
Coronary Artery Bypass Graft surgery (CABG). Experimental results have shown that classification
using Random Forest Classification algorithm can be successfully used in predicting the events and risk
factors related to CHD.
Using PSO algorithm for producing best rules in diagnosis of heart disease
A. H. Alkeshuosh, M. Z. Moghadam, I. A. Mansoori and M. Abdar, "Using PSO Algorithm for
Producing Best Rules in Diagnosis of Heart Disease," 2017 International Conference on Computer
and Applications (ICCA), Doha, 2017, pp. 306-311.
Heart disease is still a growing global health issue. In the health care system, limiting human experience
and expertise in manual diagnosis leads to inaccurate diagnosis, and the information about various
illnesses is either inadequate or lacking in accuracy as they are collected from various types of medical
equipment. Since the correct prediction of a person's condition is of great importance, equipping medical
science with intelligent tools for diagnosing and treating illness can reduce doctors' mistakes and
financial losses. In this paper, the Particle Swarm Optimization (PSO) algorithm, which is one of the
most powerful evolutionary algorithms, is used to generate rules for heart disease. First the random rules
are encoded and then they are optimized based on their accuracy using PSO algorithm. Finally, the model
compares our results with the C4.5 algorithm.
[5]
Backpropagation neural network for prediction of heart disease
Al-Milli, Nabeel. (2013). Backpropogation neural network for prediction of heart disease. 56. 131-
135.
Recently, several software's, tools and various algorithms have been proposed by the researchers for
developing effective medical decision support systems. Moreover, new algorithms and new tools are
continued to develop and represent day by day. Diagnosing of heart disease is one of the important issues
and many researchers investigated to develop intelligent medical decision support systems to improve
the ability of the physicians. Neural network is widely used tool for predicting heart disease diagnosis.
In this research paper, a heart disease prediction system is developed using neural network. The proposed
system used 13 medical attributes for heart disease predictions. The experiments conducted in this work
have shown the good performance of the proposed algorithm compared to similar approaches of the state
of the art.
[6]
CHAPTER – 3
SYSTEM ANALYSIS
3.1. OBJECTIVE
The objective of this project is to predict heart disease by an automated medical diagnosis system based
on machine learning. This uses hybrid model which means it is the best classification algorithm for heart
disease prediction.
3.2. EXISTING SYSTEM
One of the existing studies applying neural network to self-applied questionnaire (SAQ) data to develop
a heart disease prediction system. The validation of the work was provided by checking against the result
of the neural network with “Dundee Rank Factor Score” which is related to statistically 3 risk factors
(blood pressure, smoking and blood cholesterol) together with sex and age to determine risk of having
heart disease. In the study, they used multi-layered feedforward neural network which was trained with
Backpropagation Algorithm.
Drawbacks
• The study not only clarifies common risk factors of the disease but also the other data collected in
SAQ.
• Mai Shouman et al. worked on the application of k-Nearest-Neighbors (k-NN) in diagnosis of heart
disease k-NN has given higher accuracy.
• Applying integrating voting could not enhance the k-NN accuracy in the diagnosis of heart disease
patients, unlike Decision tree classifiers where voting increases accuracy. Voting is an aggregation
technique which is used to combine decisions of multiple classifiers. However, the accuracy for k-
NN with voting reduced to 92.7%.
• G Purusothaman have surveyed and compared different classification techniques for heart disease
prediction. Instead of applying a single model such as Decision tree, artificial neural network and
Naïve Bayes, the authors focus on the working of hybrid models i.e., models which combines more
than one classification technique.
• However, applying hybrid model has difficulties.
[7]
3.3. PROPOSED SYSTEM
Proposed system is the prediction of heart disease by an automated medical diagnosis system based on
machine learning is proposed to satisfy this need. Hybrid model used for the prediction system.
Cleveland database was used for heart disease prediction system. Because Cleveland database is the
most commonly used database by ML researchers.
The dataset contains 303 instances and 76 attributes, but only 14 of them are referred by all published
studies.
The "goal" field which has varying values from 0(absence) to 4 denotes if heart disease present or not in
the patient. Studies on the Cleveland database have focuses on distinguishing absence (value 0) from
presence (values range from 1 to 4).
Advantages
• The proposed heart disease prediction system uses hybrid model.
• Patients can go for treatment based on the report generated
• Helps patient to take preventive measures in advance.
3.4. PROBLEM STATEMENT
Heart disease is one of the prevalent disease that can lead to reduce the lifespan of human beings
nowadays. Heart disease is a disease that affects on the function of heart. An estimate of a person’s risk
for coronary heart disease is important for many aspects of health promotion and clinical medicine. A
risk prediction model may be obtained through multivariate regression analysis of a longitudinal study.
Due to digital technologies are rapidly growing, healthcare centres store huge amount of data in their
database that is very complex and challenging to analysis. Data mining techniques and machine learning
algorithms play vital roles in analysis of different data in medical centres. The techniques and algorithms
can be directly used on a dataset for creating some models or to draw vital conclusions, and inferences
from the dataset.
[8]
3.5. REQUIREMENT ANALYSIS
3.5.1. FUNCTIONAL REQUIREMENTS
Data collection
The data collection process involves the selection of quality data for analysis. The model uses Heart
disease dataset taken from uci.edu for machine learning implementation. The job of a data analyst is to
find ways and sources of collecting relevant and comprehensive data, interpreting it, and analyzing
results with the help of statistical techniques.
Data visualization
A large amount of information represented in graphic form is easier to understand and analyze. Some
companies specify that a data analyst must know how to create slides, diagrams, charts, and templates.
In our approach, the heart disease rates is shown as data visualization part.
[9]
Data preprocessing
The purpose of preprocessing is to convert raw data into a form that fits machine learning. Structured
and clean data allows a data scientist to get more precise results from an applied machine learning model.
The technique includes data formatting, cleaning, and sampling.
Dataset splitting
A dataset used for machine learning should be partitioned into three subsets — training, test, and
validation sets.
Training set. A data scientist uses a training set to train a model and define its optimal parameters it has
to learn from data.
Test set. A test set is needed for an evaluation of the trained model and its capability for generalization.
The latter means a model’s ability to identify patterns in new unseen data after having been trained over
a training data. It’s crucial to use different subsets for training and testing to avoid model overfitting,
which is the incapacity for generalization which is mentioned above.
Model training
After a data scientist has preprocessed the collected data and split it into train and test can proceed with
a model training. This process entails “feeding” the algorithm with training data. An algorithm will
process data and output a model that is able to find a target value (attribute) in new data an answer you
want to get with predictive analysis. The purpose of model training is to develop a model.
Model evaluation and testing
The goal of this step is to develop the simplest model able to formulate a target value fast and well
enough. A data scientist can achieve this goal through model tuning. That’s the optimization of model
parameters to achieve an algorithm’s best performance.
Non-functional requirements
The following is a list of non-functional requirements. The specific details will need to be defined by
internal stakeholders.
❖ Response Time
❖ Availability
❖ Stability
❖ Maintainability
❖ Usability
[10]
3.6. SYSTEM REQUIREMENTS
The system requirements include Hardware and Software requirement, which are provided below
3.6.1. HARDWARE REQUIREMENTS
Processor : Any Processor above 500 MHz.
Ram : 4 GB
Hard Disk : 4 GB
Input device : Standard Keyboard and Mouse.
Output device : VGA and High Resolution Monitor.
3.6.2. SOFTWARE SPECIFICATION
Operating System : Windows 7 or higher
Programming : Python 3.6 and related libraries
[11]
CHAPTER 4
SYSTEM DESIGN
This chapter gives overview of architecture design, dataset for implementation, algorithm used and UML
designs.
4.1. ARCHITECTURE DESIGN
[12]
Figure 4.1: Architecture Diagram
The above figure represents architecture of proposed system, in which all modules of the work are
represented. User gives input dataset collection training model and prediction is mentioned.
[13]
4.2. UML DIAGRAMS
The design is a plan or drawing produced to show the look and function or workings of an object
before it is made. Unified Modeling language (UML) is a standardized modeling language enabling
developers to specify, visualize, construct and document artifacts of a software system. Thus, UML
makes these artifacts scalable, secure and robust in execution. UML is an important aspect involved in
object-oriented software development. It uses graphic notation to create visual models of software
systems.
The different types of UML diagram are as follows.
• Use Case Diagram
• Class Diagram
• Activity Diagram
• Sequence Diagram
• Collaboration Diagram
• Component Diagram
• Deployment Diagram
[14]
4.2.1. USE CASE DIAGRAM
Load Dataset
Apply ML algorithm
Heart Disease
Prediction
Find Accuracy
[15]
4.2.2. SEQUENCE DIAGRAM
Load
Input
DT / RF /
Hybrid
Prediction
[16]
4.2.3. ACTIVITY DIAGRAM
Load dataset
Split dataset
Train model
Plot / analyze
[17]
4.2.4. DEPLOYMENT DIAGRAM
Client
Datatset input Middleware Pre-process
Train set ML
Testinput
Server
Predicted Trainedmodel
output
[18]
CHAPTER 5
IMPLEMENTATION
Variable Variable
Short description Short description
name name
[19]
5.3. MODULES
The modules included in our implementation are as follows
❖ Decision Tree
❖ Random forest
❖ Hybrid model
5.3.1. Decision Tree
Decision tree is a type of supervised learning algorithm that is mostly used in classification problems. It
works for both categorical and continuous input and output variables. In this technique, the model splits
sample into two or more homogeneous sets (or sub-populations) based on most significant splitter /
differentiator in input variables. In decision tree internal node represents a test on the attribute, branch
depicts the outcome and leaf represents decision made after computing attribute.
Decision Tree works in following manner
❖ Place the best attribute of the dataset at the root of the tree.
❖ Split the training set into subsets. Subsets should be made in such a way that each subset contains
data with the same value for an attribute.
❖ Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.
In decision trees, for predicting a class label for a record the model start from the root of the tree. Then
compare the values of the root attribute with record’s attribute. On the basis of comparison, follow the
branch corresponding to that value and jump to the next node.
[20]
5.3.2. Random Forest Model
1. Given there are n cases in the training dataset. From these n cases, sub-samples are chosen at random
with replacement. These random sub-samples chosen from the training dataset are used to build
individual trees.
2. Assuming there are k variables for input, a number m is chosen such that m < k. m variables are
selected randomly out of k variables at each node. The split which is the best of these m variables is
chosen to split the node. The value of m is kept unchanged while the forest is grown.
3. Each tree is grown as large as possible without pruning.
4. The class of the new object is predicted based upon the majority of votes received from the
combination of all the decision trees.
[21]
CHAPTER 6
EXPERIMENTAL RESULTS AND ANALYSIS
The proposed work is implemented in Python 3.6.4 with libraries scikit-learn, pandas, matplotlib and
other mandatory libraries. The heart disease dataset downloaded from uci.edu is considered for study.
Machine learning algorithm is applied such as decision tree, and Random forest. The model used these
machines learning algorithm and identified heart disease. To improve the work and novelty of the work,
the model implemented hybrid model of Decision Tree and Random Forest. The result shows that Heart
disease detection is efficient using Random Forest algorithm and hybrid model. Random forest achieves
76% accuracy, Decision Tree achieves around 79% accuracy, Hybrid model achieves 76% accuracy.
The following table shows the accuracy arrived in our experimental study.
[22]
The below figure shows the accuracy comparison of our proposed work.
[23]
Figure 6.3: Evaluation metrics for Hybrid algorithm
[24]
CHAPTER 7
CONCLUSION AND FUTURE WORK
In conclusion, as identified through the literature review, there is a need for combinational and more
complex models to increase the accuracy of predicting the early onset of cardiovascular diseases.
The proposed framework using combinations of Decision Tree and Random forest for heart disease
prediction. Using the Cleveland Heart Disease database, train and test the system and thus attain the
most efficient model. For future development it requires some of the deep learning models such as CNN
or DNN algorithm for heart disease prediction. Also planning to classify it as multi-class problem to
identify the level of the disease.
[25]
APPENDIX – A
REFERENCES
[1] Mackay,J., Mensah,G. 2004 “Atlas of Heart Disease and Stroke” Nonserial Publication, ISBN-13
9789241562768 ISBN-10 9241562765.
[2] Robert Detrano 1989 “Cleveland Heart Disease Database” V.A. Medical Center, Long Beach and
Cleveland Clinic Foundation.
[3] Yanwei Xing, Jie Wang and Zhihong Zhao Yonghong Gao 2007 “Combination data mining
methods with new medical data to predicting outcome of Coronary Heart Disease” Convergence
Information Technology, 2007. International Conference November 2007, pp 868-872.
[4] Jianxin Chen, Guangcheng Xi, Yanwei Xing, Jing Chen, and Jie Wang 2007 “Predicting Syndrome
by NEI Specifications: A Comparison of Five Data Mining Algorithms in Coronary Heart Disease”
Life System Modeling and Simulation Lecture Notes in Computer Science, pp 129-135.
[5] Jyoti Soni, Ujma Ansari, Dipesh Sharma 2011 “Predictive Data Mining for Medical Diagnosis: An
Overview of Heart Disease Prediction” International Journal of Computer Applications, doi
10.5120/2237-2860.
[6] Mai Shouman, Tim Turner, Rob Stocker 2012 “Using Data Mining Techniques In Heart Disease
Diagnoses And Treatment” Electronics, Communications and Computers (JECECC), 2012 Japan-
Egypt Conference March 2012, pp 173-177.
[7] Robert Detrano, Andras Janosi, Walter Steinbrunn, Matthias Pfisterer, Johann-Jakob Schmid,Sarbjit
Sandhu, Kern H. Guppy, Stella Lee, Victor Froelicher 1989 “International application of a new
probability algorithm for the diagnosis of coronary artery disease” The American Journal of
Cardiology, pp 304-310.15
[8] Polat, K., S. Sahan, and S. Gunes 2007 “Automatic detection of heart disease using an artificial
immune recognition system (AIRS) with fuzzy resource allocation mechanism and k-nn (nearest
neighbour) based weighting preprocessing” Expert Systems with Applications 2007, pp 625-631.
[9] Ozsen, S., Gunes, S. 2009 “Attribute weighting via genetic algorithms for attribute weighted
artificial immune system (AWAIS) and its application to heart disease and liver disorders problems”
Expert Systems with Applications, pp 386-392.
[10] Resul Das, Ibrahim Turkoglub, and Abdulkadir Sengurb 2009 “Effective diagnosis of heart disease
through neural networks ensembles” Expert Systems with Applications, pp 7675–7680.
[11] L. Breiman, Random Forest, Machine Learning, Vol. 45, Kluwer Academic Publishers (2001),
pp. 5-32.
[26]
[12] Rish, Irina (2001). An empirical study of the naive Bayes classifier. IJCAI Workshop on Empirical
Methods in AI.
[13] Breiman. L. [1998b] Randomizing Outputs To Increase Prediction Accuracy. Technical Report 518,
May 1998, Statistics Department, UCD (in press Machine Learning).
[14] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.
[27]
APPENDIX – B
1. SOURCE CODE
Main.py:
import tkinter as tk
from tkinter import Message, Text
from PIL import Image, ImageTk
import pandas as pd
import csv
import numpy as np
from PIL import Image, ImageTk
import pandas as pd
import NeuralNetwork as NN
#import predict as pred
from keras.models import load_model
from sklearn.preprocessing import StandardScaler
from tkinter import filedialog
import RFALG as RF
import DTALG as DT
import Hybrid as hy
import predict as pr
def clear():
print("Clear1")
txt.delete(0, 'end')
txt1.delete(0, 'end')
txt2.delete(0, 'end')
txt3.delete(0, 'end')
txt4.delete(0, 'end')
txt5.delete(0, 'end')
txt6.delete(0, 'end')
txt7.delete(0, 'end')
txt8.delete(0, 'end')
txt9.delete(0, 'end')
txt10.delete(0, 'end')
txt11.delete(0, 'end')
txt12.delete(0, 'end')
txt13.delete(0, 'end')
txt14.delete(0, 'end')
window = tk.Tk()
[28]
window.title("Heart Disease Prediction")
window.geometry('1280x720')
bgcolor="#ffe6e6"
bgcolor1="#e60000"
fgcolor="#660000"
window.configure(background="#ffe6e6")
#window.attributes('-fullscreen', True)
window.grid_rowconfigure(0, weight=1)
window.grid_columnconfigure(0, weight=1)
[29]
txt4 = tk.Entry(window,width=15 ,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold ') )
txt4.place(x=300, y=350)
[30]
txt11 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))
txt11.place(x=850, y=350)
[31]
elbl7 = tk.Label(window, text="Ex:0/2",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor
,font=('times', 15, ' bold ') )
elbl7.place(x=500, y=500)
def browse():
path=filedialog.askopenfilename()
print(path)
txt.delete(0, 'end')
txt.insert('end',path)
if path !="":
print(path)
else:
tm.showinfo("Input error", "Select Dataset")
def preprocess():
path=txt.get()
if path != "" :
print("preprocess")
# read synthetic cleveland dataset from full cleveland.data
df_main = pd.read_table(path, sep=',')
[32]
df_main /= df_main.max()
# plot histograms
#axs.set_xlabel('No Heart Disease Heart Disease')
axs.set_title('DataSet')
axs.grid()
axs.hist(y_all)
axs.get_children()[0].set_color('g')
axs.get_children()[2].set_color('c')
axs.get_children()[5].set_color('b')
axs.get_children()[7].set_color('y')
axs.get_children()[9].set_color('r')
fig.savefig('results/Preprocess.png')
plt.pause(5)
plt.show(block=False)
plt.close()
tm.showinfo("Input error", "Preprocess Successfully Finished")
else:
tm.showinfo("Input error", "Select Dataset")
def RFprocess():
sym=txt.get()
if sym != "":
RF.process(sym)
tm.showinfo("Input", "RandomForest Successfully Finished")
else:
tm.showinfo("Input error", "Select Dataset")
def DTprocess():
sym=txt.get()
if sym != "":
DT.process(sym)
print("DT")
tm.showinfo("Input", "DT Successfully Finished")
else:
tm.showinfo("Input error", "Select Dataset")
def hybridmodel():
sym=txt.get()
if sym != "":
hy.process(sym)
tm.showinfo("Input", "Hybrid Successfully Finished")
else:
tm.showinfo("Input error", "Select Dataset")
def predictprocess():
[33]
print("predict")
txt14.delete(0, 'end')
#txt1.insert('end', "60")
a1=txt1.get()
a2=txt2.get()
a3=txt3.get()
a4=txt4.get()
a5=txt5.get()
a6=txt6.get()
a7=txt7.get()
a8=txt8.get()
a9=txt9.get()
a10=txt10.get()
a11=txt11.get()
a12=txt12.get()
a13=txt13.get()
if a1 == "":
tm.showinfo("Insert error", "Enter Age")
elif a2 == "":
tm.showinfo("Insert error", "Enter Sex")
elif a3 == "":
tm.showinfo("Insert error", "Enter Cp")
elif a4 == "":
tm.showinfo("Insert error", "Enter tresp")
elif a5 == "":
tm.showinfo("Insert error", "Enter Chol")
elif a6 == "":
tm.showinfo("Insert error", "Enter fbs")
elif a7 == "":
tm.showinfo("Insert error", "Enter restecg")
elif a8 == "":
tm.showinfo("Insert error", "Enter thalach")
elif a9 == "":
tm.showinfo("Insert error", "Enter exang")
elif a10=="":
tm.showinfo("Insert error", "Enter oldpeak")
elif a11 == "":
tm.showinfo("Insert error", "Enter slope")
elif a12 == "":
tm.showinfo("Insert error", "Enter ca")
elif a13 == "":
tm.showinfo("Insert error", "Enter thal")
else:
#new_pred =
model.predict_classes(np.array([[58,1,3,112,230,0,2,165,0,2.5,2,1,7]])) #4
#new_pred =
model.predict_classes(np.array([[a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13]]))
new_pred=pr.process([a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13])
res=int(new_pred[0])
print(res)
[34]
if res == 0:
print("sdsd")
txt14.insert('end', "No Heart Disease")
else:
print("sds no")
txt14.insert('end', "Heart Disease Possible")
#nn=np.array([[a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13]])
#print(nn)
#pred.process(nn)
[35]
quitWindow = tk.Button(window, text="Quit", command=window.destroy ,fg=fgcolor
,bg=bgcolor1 ,width=17 ,height=2, activebackground = "Red" ,font=('times', 15, ' bold
'))
quitWindow.place(x=1060, y=600)
window.mainloop()
[36]
RFALG.py
import pandas as pd
import matplotlib as plt
import numpy as np
from sklearn import linear_model
#from sklearn.model_selection cross_validation
from scipy.stats import norm
def process(path):
dataset = pd.read_csv(path)
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values
model2=RandomForestClassifier()
model2.fit(X_train, y_train)
y_pred = model2.predict(X_test)
print("predicted")
print(y_pred)
print("test")
print(y_test)
result2=open("results/resultRF.csv","w")
result2.write("ID,Predicted Value" + "\n")
for j in range(len(y_pred)):
result2.write(str(j+1) + "," + str(y_pred[j]) + "\n")
result2.close()
mse=mean_squared_error(y_test, y_pred)
[37]
mae=mean_absolute_error(y_test, y_pred)
r2=r2_score(y_test, y_pred)
print(" ")
print("MSE VALUE FOR RandomForest IS %f " % mse)
print("MAE VALUE FOR RandomForest IS %f " % mae)
print("R-SQUARED VALUE FOR RandomForest IS %f " % r2)
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE VALUE FOR RandomForest IS %f " % rms)
ac=accuracy_score(y_test,y_pred)
print ("ACCURACY VALUE RandomForest IS %f" % ac)
print(" ")
result2=open('results/RFMetrics.csv', 'w')
result2.write("Parameter,Value" + "\n")
result2.write("MSE" + "," +str(mse) + "\n")
result2.write("MAE" + "," +str(mae) + "\n")
result2.write("R-SQUARED" + "," +str(r2) + "\n")
result2.write("RMSE" + "," +str(rms) + "\n")
result2.write("ACCURACY" + "," +str(ac) + "\n")
result2.close()
df = pd.read_csv('results/RFMetrics.csv')
acc = df["Value"]
alc = df["Parameter"]
colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#8c564b"]
explode = (0.1, 0, 0, 0, 0)
fig = plt.figure()
plt.bar(alc, acc,color=colors)
plt.xlabel('Parameter')
plt.ylabel('Value')
plt.title(' Random Forest Metrics Value')
fig.savefig('results/RFMetricsValue.png')
plt.pause(5)
plt.show(block=False)
plt.close()
[38]
2. SCREEN SHOTS
The following screen shows the application home page
[39]
The following screen shows the result of decision tree algorithm
[40]
The following screen shows the results of random forest algorithm
[41]
The following screen shows the results of Hybrid model
[42]