You are on page 1of 49

DETECTION OF HEART DISEASE USING

HYBRID MACHINE LEARNING ALGORITHM

A Project Report Submitted to the Government Arts College (Autonomous), Coimbatore.


In partial fulfillment of the requirements
for the Award of the Degree of

MASTER OF COMPUTER APPLICATIONS

Submitted by
B. HARIKRISHNAN
REG. NO.: 18LCA510

Under the Guidance of

Dr. K. ARTHI, MCA, M.Phil., Ph.D.,


Assistant Professor

POST GRADUATE AND RESEARCH DEPARTMENT OF COMPUTER APPLICATIONS


GOVERNMENT ARTS COLLEGE (AUTONOMOUS), COIMBATORE – 641 018
Re-accredited with “A” Grade by NAAC. Affiliated to Bharathiar University.
DECLARATION

I hereby declare that this project, entitled “DETECTION OF CHRONIC HEART DISEASE
FAILURE USING HYBRID MACHINE LEARNING MODEL” submitted to the Government Arts
College (Autonomous), Coimbatore in partial fulfillment of the requirements for the award of the Degree
of Master of Computer Applications is a bonafide record of original project work done by under the
supervision and guidance of Dr. K. ARTHI, MCA, M.Phil., Ph.D., Assistant Professor, Post
Graduate and Research Department of Computer Applications, Government Arts College
(Autonomous), Coimbatore – 641 018.

Place : Coimbatore Signature of the Candidate


Date :

B. HARIKRISHNAN
(18LCA510)

[i]
CERTIFICATE

This is to certify that the project entitled “DETECTION OF CHRONIC HEART DISEASE
FAILURE USING HYBRID MACHINE LEARNING MODEL” is a record of an original project
work done by B. HARIKRISHNAN (18LCA510) submitted to Bharathiar University in a partial
fulfillment of the requirement for the award of the Degree of MASTER OF COMPUTER
APPLICATIONS.

Head of the Department Signature of the Guide

Submitted for the project viva-voce examination held on _______________

Internal Examiner External Examiner

[ii]
ACKNOWLEDGEMENT

I take this opportunity to acknowledge my deep sense of gratitude and heartfelt thanks to our beloved
Principal, Dr. K. CHITRA, M.Sc., M.Phil., Ph.D., Government Arts College (Autonomous),
Coimbatore, who has given permission to carry out the project by providing necessary facilities and
permitting me with all facilities to complete it successfully.
I also express my sincere and profound thanks to the Head of the Department,
Dr. R. A. ROSELINE, M.Sc., M.Phil., Ph.D., Associate Professor and Head, Post Graduate and
Research Department of Computer Applications, Government Arts College (Autonomous), Coimbatore,
for her valuable support and suggestions in my project work.
I would also express my sincere gratitude to my internal guide,
Dr. K. ARTHI, MCA, M.Phil., Ph.D., Assistant Professor, Department of Computer Applications, for
who wrought shaped my project into a better one with her valuable guidance for completion of this
project.
Finally, I also express my sincere gratitude to all of my parents and friends who helped for my
carrier without whose sustained support, I could not have made my debut in Computer Applications.

[iii]
ABSTRACT

Heart disease is one of the most significant causes of mortality in the world today. Prediction of
cardiovascular disease is a critical challenge in the area of clinical data analysis. Machine learning (ML)
has been shown to be effective in assisting in making decisions and predictions from the large quantity
of data produced by the healthcare industry. Machine learning techniques being used in recent
developments in different areas of medical industry. In this work, a proposed novel method that aims at
finding heart disease by applying machine learning techniques. The prediction model uses classification
techniques and Cleveland heart disease dataset is used. Machine learning technique Decision Tree and
Random Forest is applied. The novel technique of machine learning model is used. In implementation,
three machine learning algorithms are used, they are 1. Decision Tree, 2. Random Forest and 3. Hybrid
model (Hybrid of Decision tree and random forest). Experimental results show an accuracy level of
88:7% through the prediction model for heart disease with the hybrid model. The interface is designed
to get the input parameter from user to predict the heart disease, for which hybrid model of Decision
Tree and Random forest are used.

[iv]
LIST OF ABBREVIATIONS

ML – Machine learning
CVD – Cardiovascular diseases
CHDD – Cleveland Heart Disease Database
KNN – K-Nearest Neighbor Algorithm
DT – Decision Trees
GA – Genetic algorithm
NB – Naive Bayes
CADSS – Computer Aided Decision Support System
CNN – Convolutional Neural Networks
ECG – Electrocardiogram
DL – Deep learning

[v]
TABLE OF CONTENTS
TITLE PAGE NO.
DECLARATION i
CERTIFICATE ii
ACKNOWLEDGEMENT iii
ABSTRACT iv
LIST OF ABBREVIATION v
CHAPTER-1: INTRODUCTION 1–3
Overview of the Project 1
Objective of the Project 3
CHAPTER – 2: LITERATURE SURVEY 4–6
Overview of Literature Survey 4
CHAPTER – 3: SYSTEM ANALYSIS 7 – 11
Objective 7
Existing System 7
Proposed System 8
Problem Statement 8
Requirement Analysis 9
System Requirement 11
CHAPTER – 4: SYSTEM DESIGN 12 – 18
Architecture Design 12– 13
UML Diagrams 14 – 18
CHAPTER – 5: IMPLEMENTATION 19 – 21
Implementation Methodology 19
Data Dictionary 19
Modules 20
CHAPTER- 6: EXPERIMENTAL RESULTS AND ANALYSIS 22– 24
CHAPTER-7: CONCLUSION AND FUTURE WORK 25
APPENDIX - A: 26
REFERENCES 26
APPENDIX – B: 28– 42
1. Source Code 28
2. Sample Screens Shots 39
CHAPTER 1
INTRODUCTION

Data mining (DM) is the extraction of useful information from large data sets that results in predicting
or describing the data using techniques such as classification, clustering, association, etc. Data mining
has found extensive applicability in the healthcare industry such as in classifying optimum treatment
methods, predicting disease risk factors, and finding efficient cost structures of patient care. Research
using data mining models have been applied to diseases such as diabetes, asthma, cardiovascular
diseases, AIDS, etc. Various techniques of data mining such as naïve Bayesian classification, artificial
neural networks, support vector machines, decision trees, logistic regression, etc. have been used to
develop models in healthcare research.
1.1. OVERVIEW OF THE PROJECT
An estimated 17 million people die of cardiovascular diseases (CVD) every year. Although such diseases
are controllable, their early prognosis and a patient’s evaluated risk are necessary to curb the high
mortality rates it presents. Common cardiovascular diseases include coronary heart disease,
cardiomyopathy, hypertensive heard disease, heart failure, etc. Common causes of heart diseases include
smoking, diabetes, lack of physical activity, hypertension, high cholesterol diet, etc.
Research in the field of cardiovascular diseases using data mining has been an ongoing effort involving
prediction, treatment, and risk score analysis with high levels of accuracy. Multiple CVD surveys have
been conducted with the most prominent one being the data set from the Cleveland Heart Clinic. The
Cleveland Heart Disease Database (CHDD) as such has been considered the de facto database for heart
disease research. Recommending the parameters from this database, this paper proposes a framework to
apply logistic regression, support vector machines, and decision trees to attain individual predictions
which are in turn used in rule-based algorithms. The result of each rule from this system is then compared
on the basis of accuracy, sensitivity, and specificity.
The methodology aims to accomplish of two goals: the first is to primarily present a predictive
framework for heart disease, and the second is to compare the efficiency of merging the outcomes of
multiple models as opposed to using a single model.
It is difficult to identify heart disease because of several contributory risk factors such as diabetes, high
blood pressure, high cholesterol, abnormal pulse rate and many other factors. Various techniques in data
mining and neural networks have been employed to find out the severity of heart disease among humans.
The severity of the disease is classified based on various methods like K-Nearest Neighbour Algorithm
(KNN), Decision Trees (DT), Genetic algorithm (GA), and Naive Bayes (NB). The nature of heart
disease is complex and hence, the disease must be handled carefully. Not doing so may affect the heart

[1]
or cause premature death. The perspective of medical science and data mining are used for discovering
various sorts of metabolic syndromes. Data mining with classification plays a significant role in the
prediction of heart disease and data investigation.
In previous, the decision tree has been used in predicting the accuracy of events related to heart disease.
Various methods have been used for knowledge abstraction by using known methods of data mining for
prediction of heart disease. In this work, numerous readings have been carried out to produce a prediction
model using not only distinct techniques but also by relating two or more techniques. These amalgamated
new techniques are commonly known as hybrid methods. Neural networks using heart rate time series
was introduced in this. This method uses various clinical records for prediction such as Left bundle
branch block (LBBB), Right bundle branch block (RBBB), Atrial fibrillation (AFIB), Normal Sinus
Rhythm (NSR), Sinus bradycardia (SBR), Atrial fibrillation utter (AFL), Premature Ventricular
Contraction (PVC)), and Second-degree block (BII) to find out the exact condition of the patient in
relation to heart disease. The dataset with a radial basis function network (RBFN) is used for
classification, where 70% of the data is used for training and the remaining 30% is used for classification.
Computer Aided Decision Support System (CADSS) in the field of medicine and research was also
introduced. In previous work, the usage of data mining techniques in the healthcare industry has been
shown to take less time for the prediction of disease with more accurate results [16]. Model proposes the
diagnosis of heart disease using the GA. This method uses effective association rules inferred with the
GA for tournament selection, crossover and the mutation which results in the new proposed fitness
function. For experimental validation, the model uses the well-known Cleveland dataset which is
collected from a UCI machine learning repository. The model clarifies later on how the result prove to
be prominent when compared to some of the known supervised learning techniques. The most powerful
evolutionary algorithm Particle Swarm Optimization (PSO) is introduced and some rules are generated
for heart disease. The rules have been applied randomly with encoding techniques which result in
improvement of the accuracy overall. Heart disease is predicted based on symptoms namely, pulse rate,
sex, age, and many others. The ML algorithm with Neural Networks is introduced, whose results are
more accurate and reliable as seen already. Neural networks are generally regarded as the best tool for
prediction of diseases like heart disease and brain disease. The proposed method which uses 13 attributes
for heart disease prediction. The results show an enhanced level of performance compared to the existing
methods in works like. The Carotid Artery Stenting (CAS) has also become a prevalent treatment mode
in the medical field during these recent years. The CAS prompts the occurrence of major adverse
cardiovascular events (MACE) of heart disease patients that are elderly. Their evaluation becomes very
important. The model generates results using an Artificial Neural Network ANN, which produces good
performance in the prediction of heart disease. Neural network methods are introduced, which combine

[2]
not only posterior probabilities but also predicted values from multiple predecessor techniques. This
model achieves an accuracy level of up to 89:01% which is a strong result compared to previous works.
For all experiments, the Cleveland heart dataset is used with a Neural Network NN to improve the
performance of heart disease as discussed previously. The model has also seen recent developments in
machine learning ML techniques and used for Internet of Things (IoT) as well. ML algorithms on
network traffic data has been shown to provide accurate identification of IoT devices connected to a
network. Median external. collected and labelled network traffic data from nine distinct IoT devices,
PCs and smartphones. Using supervised learning, they trained a multi-stage meta classifier. In the first
stage, the classifier can distinguish between traffic generated by IoT and non-IoT devices. In the second
stage, each IoT device is associated with a specific IoT device class. Deep learning is a promising
approach for extracting accurate information from raw sensor data from IoT devices deployed in
complex environments. Because of its multilayer structure, deep learning is also appropriate for the edge
computing environment.
1.2. OBJECTIVE OF THE PROJECT
In this work, a technique called the Hybrid Random Forest with Linear Model (HRFLM) was introduced.
The main objective of this research is to improve the performance accuracy of heart disease prediction.
Many studies have been conducted that results in restrictions of feature selection for algorithmic use. In
contrast, the HRFLM method uses all features without any restrictions of feature selection. Here the
model conduct experiments used to identify the features of a machine learning algorithm with a hybrid
method. The experiment results show that our proposed hybrid method has stronger capability to predict
heart disease compared to existing methods.

[3]
CHAPTER 2
LITERATURE SURVEY

2.1. OVERVIEW OF LITERATURE SRUVEY


This chapter gives the overview of literature survey. This chapter represents some of the relevant work
done by the researchers.
Many existing techniques have been studied by the researchers on heart disease prediction problem, few
of them are discussed below.
There is ample related work in the fields directly related to this paper. ANN has been introduced to
produce the highest accuracy prediction in the medical field. The back propagation multilayer perception
(MLP) of ANN is used to predict heart disease. The obtained results are compared with the results of
existing models within the same domain and found to be improved. The data of heart disease patients
collected from the UCI laboratory is used to discover patterns with NN, DT, Support Vector machines
SVM, and Naive Bayes. The results are compared for performance and accuracy with these algorithms.
The proposed hybrid method returns results of 86:8% for F-measure, competing with the other existing
methods. The classification without segmentation of Convolutional Neural Networks (CNN) is
introduced. This method considers the heart cycles with various start positions from the
Electrocardiogram (ECG) signals in the training phase. CNN is able to generate features with various
positions in the testing phase of the patient.
A large amount of data generated by the medical industry has not been used effectively previously. The
new approaches presented here decrease the cost and improve the prediction of heart disease in an easy
and effective way. The various different research techniques considered in this work for prediction and
classification of heart disease using ML and deep learning (DL) techniques are highly accurate in
establishing the efficacy of these methods.
Intelligent heart disease prediction system using random forest and evolutionary approachAkhil,
Jabbar & Deekshatulu, Bulusu & Chandra, Priti. (2016). Intelligent heart disease prediction
system using random forest and evolutionary approach. journal of network and innovative
computing. 4. 175-184.
Heart disease is a leading cause of premature death in the world. Predicting the outcome of disease is
the challenging task. Data mining is involved to automatically infer diagnostic rules and help specialists
to make diagnosis process more reliable. Several data mining techniques are used by researchers to help
health care professionals to predict the heart disease. Random forest is an ensemble and most accurate
learning algorithm, suitable for medical applications. Chi square feature selection measure is used to
evaluate between variables and determines whether they are correlated or not. In this paper, the model

[4]
proposes a classification model which uses random forest as classifier, chi square genetic algorithm as
feature selection measures to predict heart disease. The experimental results have shown that our
approach improve classification accuracy compared to other classification approaches, and the presented
model can be successfully used by health care professional for predicting heart disease.
A Data mining Model for predicting the Coronary Heart Disease using Random
Forest Classifier
A. S. Abdullah and R. R. Rajalaxmi, in Proc. Int. Conf. Recent Trends Comput. Methods,
Commun. Controls, Apr. 2012, pp. 2225.
Coronary Heart Disease (CHD) is a common form of disease affecting the heart and an important cause
for premature death. From the point of view of medical sciences, data mining is involved in discovering
various sorts of metabolic syndromes. Classification techniques in data mining play a significant role in
prediction and data exploration. Classification technique such as Decision Trees has been used in
predicting the accuracy and events related to CHD. In this paper, a Data mining model has been
developed using Random Forest classifier to improve the prediction accuracy and to investigate various
events related to CHD. This model can help the medical practitioners for predicting CHD with its various
events and how it might be related with different segments of the population. The events investigated
are Angina, Acute Myocardial Infarction (AMI), Percutaneous Coronary Intervention (PCI), and
Coronary Artery Bypass Graft surgery (CABG). Experimental results have shown that classification
using Random Forest Classification algorithm can be successfully used in predicting the events and risk
factors related to CHD.
Using PSO algorithm for producing best rules in diagnosis of heart disease
A. H. Alkeshuosh, M. Z. Moghadam, I. A. Mansoori and M. Abdar, "Using PSO Algorithm for
Producing Best Rules in Diagnosis of Heart Disease," 2017 International Conference on Computer
and Applications (ICCA), Doha, 2017, pp. 306-311.
Heart disease is still a growing global health issue. In the health care system, limiting human experience
and expertise in manual diagnosis leads to inaccurate diagnosis, and the information about various
illnesses is either inadequate or lacking in accuracy as they are collected from various types of medical
equipment. Since the correct prediction of a person's condition is of great importance, equipping medical
science with intelligent tools for diagnosing and treating illness can reduce doctors' mistakes and
financial losses. In this paper, the Particle Swarm Optimization (PSO) algorithm, which is one of the
most powerful evolutionary algorithms, is used to generate rules for heart disease. First the random rules
are encoded and then they are optimized based on their accuracy using PSO algorithm. Finally, the model
compares our results with the C4.5 algorithm.

[5]
Backpropagation neural network for prediction of heart disease
Al-Milli, Nabeel. (2013). Backpropogation neural network for prediction of heart disease. 56. 131-
135.
Recently, several software's, tools and various algorithms have been proposed by the researchers for
developing effective medical decision support systems. Moreover, new algorithms and new tools are
continued to develop and represent day by day. Diagnosing of heart disease is one of the important issues
and many researchers investigated to develop intelligent medical decision support systems to improve
the ability of the physicians. Neural network is widely used tool for predicting heart disease diagnosis.
In this research paper, a heart disease prediction system is developed using neural network. The proposed
system used 13 medical attributes for heart disease predictions. The experiments conducted in this work
have shown the good performance of the proposed algorithm compared to similar approaches of the state
of the art.

[6]
CHAPTER – 3
SYSTEM ANALYSIS

3.1. OBJECTIVE
The objective of this project is to predict heart disease by an automated medical diagnosis system based
on machine learning. This uses hybrid model which means it is the best classification algorithm for heart
disease prediction.
3.2. EXISTING SYSTEM
One of the existing studies applying neural network to self-applied questionnaire (SAQ) data to develop
a heart disease prediction system. The validation of the work was provided by checking against the result
of the neural network with “Dundee Rank Factor Score” which is related to statistically 3 risk factors
(blood pressure, smoking and blood cholesterol) together with sex and age to determine risk of having
heart disease. In the study, they used multi-layered feedforward neural network which was trained with
Backpropagation Algorithm.
Drawbacks
• The study not only clarifies common risk factors of the disease but also the other data collected in
SAQ.
• Mai Shouman et al. worked on the application of k-Nearest-Neighbors (k-NN) in diagnosis of heart
disease k-NN has given higher accuracy.
• Applying integrating voting could not enhance the k-NN accuracy in the diagnosis of heart disease
patients, unlike Decision tree classifiers where voting increases accuracy. Voting is an aggregation
technique which is used to combine decisions of multiple classifiers. However, the accuracy for k-
NN with voting reduced to 92.7%.
• G Purusothaman have surveyed and compared different classification techniques for heart disease
prediction. Instead of applying a single model such as Decision tree, artificial neural network and
Naïve Bayes, the authors focus on the working of hybrid models i.e., models which combines more
than one classification technique.
• However, applying hybrid model has difficulties.

[7]
3.3. PROPOSED SYSTEM
Proposed system is the prediction of heart disease by an automated medical diagnosis system based on
machine learning is proposed to satisfy this need. Hybrid model used for the prediction system.
Cleveland database was used for heart disease prediction system. Because Cleveland database is the
most commonly used database by ML researchers.
The dataset contains 303 instances and 76 attributes, but only 14 of them are referred by all published
studies.
The "goal" field which has varying values from 0(absence) to 4 denotes if heart disease present or not in
the patient. Studies on the Cleveland database have focuses on distinguishing absence (value 0) from
presence (values range from 1 to 4).
Advantages
• The proposed heart disease prediction system uses hybrid model.
• Patients can go for treatment based on the report generated
• Helps patient to take preventive measures in advance.
3.4. PROBLEM STATEMENT
Heart disease is one of the prevalent disease that can lead to reduce the lifespan of human beings
nowadays. Heart disease is a disease that affects on the function of heart. An estimate of a person’s risk
for coronary heart disease is important for many aspects of health promotion and clinical medicine. A
risk prediction model may be obtained through multivariate regression analysis of a longitudinal study.
Due to digital technologies are rapidly growing, healthcare centres store huge amount of data in their
database that is very complex and challenging to analysis. Data mining techniques and machine learning
algorithms play vital roles in analysis of different data in medical centres. The techniques and algorithms
can be directly used on a dataset for creating some models or to draw vital conclusions, and inferences
from the dataset.

[8]
3.5. REQUIREMENT ANALYSIS
3.5.1. FUNCTIONAL REQUIREMENTS
Data collection
The data collection process involves the selection of quality data for analysis. The model uses Heart
disease dataset taken from uci.edu for machine learning implementation. The job of a data analyst is to
find ways and sources of collecting relevant and comprehensive data, interpreting it, and analyzing
results with the help of statistical techniques.
Data visualization
A large amount of information represented in graphic form is easier to understand and analyze. Some
companies specify that a data analyst must know how to create slides, diagrams, charts, and templates.
In our approach, the heart disease rates is shown as data visualization part.

Figure 3.0: Data Visualization of Heart disease rate

[9]
Data preprocessing
The purpose of preprocessing is to convert raw data into a form that fits machine learning. Structured
and clean data allows a data scientist to get more precise results from an applied machine learning model.
The technique includes data formatting, cleaning, and sampling.
Dataset splitting
A dataset used for machine learning should be partitioned into three subsets — training, test, and
validation sets.
Training set. A data scientist uses a training set to train a model and define its optimal parameters it has
to learn from data.
Test set. A test set is needed for an evaluation of the trained model and its capability for generalization.
The latter means a model’s ability to identify patterns in new unseen data after having been trained over
a training data. It’s crucial to use different subsets for training and testing to avoid model overfitting,
which is the incapacity for generalization which is mentioned above.
Model training
After a data scientist has preprocessed the collected data and split it into train and test can proceed with
a model training. This process entails “feeding” the algorithm with training data. An algorithm will
process data and output a model that is able to find a target value (attribute) in new data an answer you
want to get with predictive analysis. The purpose of model training is to develop a model.
Model evaluation and testing
The goal of this step is to develop the simplest model able to formulate a target value fast and well
enough. A data scientist can achieve this goal through model tuning. That’s the optimization of model
parameters to achieve an algorithm’s best performance.
Non-functional requirements
The following is a list of non-functional requirements. The specific details will need to be defined by
internal stakeholders.
❖ Response Time
❖ Availability
❖ Stability
❖ Maintainability
❖ Usability

[10]
3.6. SYSTEM REQUIREMENTS
The system requirements include Hardware and Software requirement, which are provided below
3.6.1. HARDWARE REQUIREMENTS
Processor : Any Processor above 500 MHz.
Ram : 4 GB
Hard Disk : 4 GB
Input device : Standard Keyboard and Mouse.
Output device : VGA and High Resolution Monitor.
3.6.2. SOFTWARE SPECIFICATION
Operating System : Windows 7 or higher
Programming : Python 3.6 and related libraries

[11]
CHAPTER 4
SYSTEM DESIGN

This chapter gives overview of architecture design, dataset for implementation, algorithm used and UML
designs.
4.1. ARCHITECTURE DESIGN

Figure 4.0: Architecture Diagram


The above figure represents architecture of proposed system, in which all modules of the work
are represented. User gives input from dataset collection training model and recognition is mentioned.

[12]
Figure 4.1: Architecture Diagram
The above figure represents architecture of proposed system, in which all modules of the work are
represented. User gives input dataset collection training model and prediction is mentioned.

[13]
4.2. UML DIAGRAMS
The design is a plan or drawing produced to show the look and function or workings of an object
before it is made. Unified Modeling language (UML) is a standardized modeling language enabling
developers to specify, visualize, construct and document artifacts of a software system. Thus, UML
makes these artifacts scalable, secure and robust in execution. UML is an important aspect involved in
object-oriented software development. It uses graphic notation to create visual models of software
systems.
The different types of UML diagram are as follows.
• Use Case Diagram
• Class Diagram
• Activity Diagram
• Sequence Diagram
• Collaboration Diagram
• Component Diagram
• Deployment Diagram

[14]
4.2.1. USE CASE DIAGRAM

Load Dataset

Apply ML algorithm

Decision Tree Random Hybrid Model


Forest

Heart Disease
Prediction

Find Accuracy

Figure 4.2: Use case Diagram


The above figure represents usecase diagram, in which user upload dataset is pre-processed and applied
algorithm. They are analyzed for breast cancer prediction of binary type.

[15]
4.2.2. SEQUENCE DIAGRAM

Dataset Train set Test set Model Result

Load

Input

DT / RF /
Hybrid

Prediction

Figure 4.3: Sequence Diagram


A sequence diagram shows a parallel vertical lines, different processes or objects that live
simultaneously, and as horizontal arrows, the messages exchanged between them, in order in which they
occur. The above figure represents sequence diagram, the proposed system’s sequence of data flow is
represented.

[16]
4.2.3. ACTIVITY DIAGRAM

Load dataset

Split dataset

Train set Test set

Train model

Heart disease prediction

Plot / analyze

Figure 4.4: Activity diagram


The above figure represents activity diagram of proposed system. The figure shows complete flow of
activity from dataset loading and all sequence of module.

[17]
4.2.4. DEPLOYMENT DIAGRAM

Client
Datatset input Middleware Pre-process

Train set ML

Testinput
Server

Predicted Trainedmodel
output

Figure 4.4: Deployment Diagram


In the deployment diagram the UML models the physical deployment of artifacts on nodes. The nodes
appear as boxes, and the artifacts allocated to each node appear as rectangles within the boxes. Nodes
may have subnodes, which appear as nested boxes. A single node in a deployment diagram may
conceptually represent multiple physical nodes, such as a cluster of database servers.

[18]
CHAPTER 5
IMPLEMENTATION

5.1. IMPLEMENTATION METHODOLOGY


The proposed work is implemented in Python 3.6.4 with libraries scikit-learn, pandas, matplotlib and
other mandatory libraries. The dataset is downloaded from uci.edu. The data downloaded contains binary
classes of heart disease. Machine learning algorithm is applied such as decision tree and random forest
along with hybrid model.
5.2. DATA DICTIONARY
The dataset collected with attributes age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak,
slop, ca, thal, pred_attribute. The sample of collevted data is shown in the below figure.

Figure 5.0: Dataset used for Study


The dataset variable names are described below

Variable Variable
Short description Short description
name name

Age Age of patient thalach maximum heart rate achieved

Sex Sex, 1 for male exang exercise induced angina (1 yes)

cp chest pain oldpeak ST depression induc. ex.

trestbps resting blood pressure slope slope of peak exercise ST

chol serum cholesterol ca number of major vessel

fbs fasting blood sugar thal no explanation provided, but probably


larger 120mg/dl (1 true) thalassemia (3 normal; 6 fixed defect; 7
reversable defect)

restecg resting electroc. result num diagnosis of heart disease (angiographic


(1 anomality) disease status)

[19]
5.3. MODULES
The modules included in our implementation are as follows
❖ Decision Tree
❖ Random forest
❖ Hybrid model
5.3.1. Decision Tree
Decision tree is a type of supervised learning algorithm that is mostly used in classification problems. It
works for both categorical and continuous input and output variables. In this technique, the model splits
sample into two or more homogeneous sets (or sub-populations) based on most significant splitter /
differentiator in input variables. In decision tree internal node represents a test on the attribute, branch
depicts the outcome and leaf represents decision made after computing attribute.
Decision Tree works in following manner
❖ Place the best attribute of the dataset at the root of the tree.
❖ Split the training set into subsets. Subsets should be made in such a way that each subset contains
data with the same value for an attribute.
❖ Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.
In decision trees, for predicting a class label for a record the model start from the root of the tree. Then
compare the values of the root attribute with record’s attribute. On the basis of comparison, follow the
branch corresponding to that value and jump to the next node.

Figure 5.1: Flow chart Decision Tree algorithm

[20]
5.3.2. Random Forest Model
1. Given there are n cases in the training dataset. From these n cases, sub-samples are chosen at random
with replacement. These random sub-samples chosen from the training dataset are used to build
individual trees.
2. Assuming there are k variables for input, a number m is chosen such that m < k. m variables are
selected randomly out of k variables at each node. The split which is the best of these m variables is
chosen to split the node. The value of m is kept unchanged while the forest is grown.
3. Each tree is grown as large as possible without pruning.
4. The class of the new object is predicted based upon the majority of votes received from the
combination of all the decision trees.

Figure 5.2: Flow chart of Random Forest

5.3.3. HYBRID MODEL


We developed this model using decision tree and random forest algorithm. The combined model works
based probabilities of random forest. The probabilities from random forest are added to train data and
fed to decision tree algorithm. Similarly, decision tree probabilities are identified and fed to test data.
Finally, the values are predicted.

[21]
CHAPTER 6
EXPERIMENTAL RESULTS AND ANALYSIS

The proposed work is implemented in Python 3.6.4 with libraries scikit-learn, pandas, matplotlib and
other mandatory libraries. The heart disease dataset downloaded from uci.edu is considered for study.
Machine learning algorithm is applied such as decision tree, and Random forest. The model used these
machines learning algorithm and identified heart disease. To improve the work and novelty of the work,
the model implemented hybrid model of Decision Tree and Random Forest. The result shows that Heart
disease detection is efficient using Random Forest algorithm and hybrid model. Random forest achieves
76% accuracy, Decision Tree achieves around 79% accuracy, Hybrid model achieves 76% accuracy.
The following table shows the accuracy arrived in our experimental study.

Algorithm Accuracy (%)


Decision Tree 79
Random Forest 76
Hybrid (Decision Tree+ Random Forest) 76
Table: Experimental Results of proposed system

[22]
The below figure shows the accuracy comparison of our proposed work.

Figure 6.1: Evaluation metrics for Decision Tree algorithm

Figure 6.2: Evaluation metrics for Random Forest algorithm

[23]
Figure 6.3: Evaluation metrics for Hybrid algorithm

[24]
CHAPTER 7
CONCLUSION AND FUTURE WORK

In conclusion, as identified through the literature review, there is a need for combinational and more
complex models to increase the accuracy of predicting the early onset of cardiovascular diseases.
The proposed framework using combinations of Decision Tree and Random forest for heart disease
prediction. Using the Cleveland Heart Disease database, train and test the system and thus attain the
most efficient model. For future development it requires some of the deep learning models such as CNN
or DNN algorithm for heart disease prediction. Also planning to classify it as multi-class problem to
identify the level of the disease.

[25]
APPENDIX – A
REFERENCES

[1] Mackay,J., Mensah,G. 2004 “Atlas of Heart Disease and Stroke” Nonserial Publication, ISBN-13
9789241562768 ISBN-10 9241562765.
[2] Robert Detrano 1989 “Cleveland Heart Disease Database” V.A. Medical Center, Long Beach and
Cleveland Clinic Foundation.
[3] Yanwei Xing, Jie Wang and Zhihong Zhao Yonghong Gao 2007 “Combination data mining
methods with new medical data to predicting outcome of Coronary Heart Disease” Convergence
Information Technology, 2007. International Conference November 2007, pp 868-872.
[4] Jianxin Chen, Guangcheng Xi, Yanwei Xing, Jing Chen, and Jie Wang 2007 “Predicting Syndrome
by NEI Specifications: A Comparison of Five Data Mining Algorithms in Coronary Heart Disease”
Life System Modeling and Simulation Lecture Notes in Computer Science, pp 129-135.
[5] Jyoti Soni, Ujma Ansari, Dipesh Sharma 2011 “Predictive Data Mining for Medical Diagnosis: An
Overview of Heart Disease Prediction” International Journal of Computer Applications, doi
10.5120/2237-2860.
[6] Mai Shouman, Tim Turner, Rob Stocker 2012 “Using Data Mining Techniques In Heart Disease
Diagnoses And Treatment” Electronics, Communications and Computers (JECECC), 2012 Japan-
Egypt Conference March 2012, pp 173-177.
[7] Robert Detrano, Andras Janosi, Walter Steinbrunn, Matthias Pfisterer, Johann-Jakob Schmid,Sarbjit
Sandhu, Kern H. Guppy, Stella Lee, Victor Froelicher 1989 “International application of a new
probability algorithm for the diagnosis of coronary artery disease” The American Journal of
Cardiology, pp 304-310.15
[8] Polat, K., S. Sahan, and S. Gunes 2007 “Automatic detection of heart disease using an artificial
immune recognition system (AIRS) with fuzzy resource allocation mechanism and k-nn (nearest
neighbour) based weighting preprocessing” Expert Systems with Applications 2007, pp 625-631.
[9] Ozsen, S., Gunes, S. 2009 “Attribute weighting via genetic algorithms for attribute weighted
artificial immune system (AWAIS) and its application to heart disease and liver disorders problems”
Expert Systems with Applications, pp 386-392.
[10] Resul Das, Ibrahim Turkoglub, and Abdulkadir Sengurb 2009 “Effective diagnosis of heart disease
through neural networks ensembles” Expert Systems with Applications, pp 7675–7680.
[11] L. Breiman, Random Forest, Machine Learning, Vol. 45, Kluwer Academic Publishers (2001),
pp. 5-32.

[26]
[12] Rish, Irina (2001). An empirical study of the naive Bayes classifier. IJCAI Workshop on Empirical
Methods in AI.
[13] Breiman. L. [1998b] Randomizing Outputs To Increase Prediction Accuracy. Technical Report 518,
May 1998, Statistics Department, UCD (in press Machine Learning).
[14] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.

[27]
APPENDIX – B
1. SOURCE CODE

Main.py:
import tkinter as tk
from tkinter import Message, Text
from PIL import Image, ImageTk
import pandas as pd

import tkinter.ttk as ttk


import tkinter.font as font
import tkinter.messagebox as tm
import matplotlib.pyplot as plt

import csv
import numpy as np
from PIL import Image, ImageTk
import pandas as pd
import NeuralNetwork as NN
#import predict as pred
from keras.models import load_model
from sklearn.preprocessing import StandardScaler
from tkinter import filedialog

import RFALG as RF
import DTALG as DT
import Hybrid as hy
import predict as pr

def clear():
print("Clear1")
txt.delete(0, 'end')
txt1.delete(0, 'end')
txt2.delete(0, 'end')
txt3.delete(0, 'end')
txt4.delete(0, 'end')
txt5.delete(0, 'end')
txt6.delete(0, 'end')
txt7.delete(0, 'end')
txt8.delete(0, 'end')
txt9.delete(0, 'end')
txt10.delete(0, 'end')
txt11.delete(0, 'end')
txt12.delete(0, 'end')
txt13.delete(0, 'end')
txt14.delete(0, 'end')

window = tk.Tk()

[28]
window.title("Heart Disease Prediction")

window.geometry('1280x720')
bgcolor="#ffe6e6"
bgcolor1="#e60000"
fgcolor="#660000"

window.configure(background="#ffe6e6")
#window.attributes('-fullscreen', True)

window.grid_rowconfigure(0, weight=1)
window.grid_columnconfigure(0, weight=1)

message1 = tk.Label(window, text="Heart Disease Prediction" ,bg=bgcolor ,fg=fgcolor


,width=50 ,height=2,font=('times', 30, 'italic bold underline'))
message1.place(x=100, y=20)

lbl = tk.Label(window, text="Data Set",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl.place(x=350, y=150)

txt = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt.place(x=600, y=150)

lbl1 = tk.Label(window, text="Age",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl1.place(x=50, y=200)

txt1 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt1.place(x=300, y=200)

lbl2 = tk.Label(window, text="Sex",width=15 ,fg=fgcolor ,bg=bgcolor ,height=1


,font=('times', 15, ' bold '))
lbl2.place(x=50, y=250)

txt2 = tk.Entry(window,width=15 ,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold ') )


txt2.place(x=300, y=250)

lbl3 = tk.Label(window, text="CP",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl3.place(x=50, y=300)

txt3 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt3.place(x=300, y=300)

lbl4 = tk.Label(window, text="tresp",width=15 ,fg=fgcolor ,bg=bgcolor ,height=1


,font=('times', 15, ' bold '))
lbl4.place(x=50, y=350)

[29]
txt4 = tk.Entry(window,width=15 ,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold ') )
txt4.place(x=300, y=350)

lbl5 = tk.Label(window, text="chol",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl5.place(x=50, y=400)

txt5 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt5.place(x=300, y=400)

lbl6 = tk.Label(window, text="fbs",width=15 ,fg=fgcolor ,bg=bgcolor ,height=1


,font=('times', 15, ' bold '))
lbl6.place(x=50, y=450)

txt6 = tk.Entry(window,width=15 ,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold ') )


txt6.place(x=300, y=450)

lbl7 = tk.Label(window, text="restecg",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl7.place(x=50, y=500)

txt7 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt7.place(x=300, y=500)

lbl8 = tk.Label(window, text="thalach",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl8.place(x=600, y=200)

txt8 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt8.place(x=850, y=200)

lbl9 = tk.Label(window, text="exang",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl9.place(x=600, y=250)

txt9 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt9.place(x=850, y=250)

lbl10 = tk.Label(window, text="oldpeak",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl10.place(x=600, y=300)

txt10 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt10.place(x=850, y=300)

lbl11 = tk.Label(window, text="slope",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl11.place(x=600, y=350)

[30]
txt11 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))
txt11.place(x=850, y=350)

lbl12 = tk.Label(window, text="ca",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl12.place(x=600, y=400)

txt12 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt12.place(x=850, y=400)

lbl13 = tk.Label(window, text="thal",width=15 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
lbl13.place(x=600, y=450)

txt13 = tk.Entry(window,width=15,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt13.place(x=850, y=450)

lbl14 = tk.Label(window, text="Predicted Value",width=15 ,height=1 ,fg=fgcolor


,bg=bgcolor ,font=('times', 15, ' bold ') )
lbl14.place(x=600, y=500)

txt14 = tk.Entry(window,width=25,bg=bgcolor ,fg=fgcolor,font=('times', 15, ' bold '))


txt14.place(x=850, y=500)

elbl1 = tk.Label(window, text="Ex:60",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl1.place(x=500, y=200)

elbl2 = tk.Label(window, text="0-M/1-F",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl2.place(x=500, y=250)

elbl3 = tk.Label(window, text="Ex:1-4",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl3.place(x=500, y=300)

elbl4 = tk.Label(window, text="Ex:90-180",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl4.place(x=500, y=350)

elbl5 = tk.Label(window, text="Ex:180-320",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl5.place(x=500, y=400)

elbl6 = tk.Label(window, text="Ex:0/1",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl6.place(x=500, y=450)

[31]
elbl7 = tk.Label(window, text="Ex:0/2",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor
,font=('times', 15, ' bold ') )
elbl7.place(x=500, y=500)

elbl8 = tk.Label(window, text="Ex:90-200",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl8.place(x=1020, y=200)

elbl9 = tk.Label(window, text="Ex:0/1",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl9.place(x=1020, y=250)

elbl10 = tk.Label(window, text="Ex:0.0-4.0",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl10.place(x=1020, y=300)

elbl11 = tk.Label(window, text="Ex:1-3",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl11.place(x=1020, y=350)

elbl12 = tk.Label(window, text="Ex:0-3",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl12.place(x=1020, y=400)

elbl13 = tk.Label(window, text="Ex:3/6/7",width=7 ,height=1 ,fg=fgcolor ,bg=bgcolor


,font=('times', 15, ' bold ') )
elbl13.place(x=1020, y=450)

def browse():
path=filedialog.askopenfilename()
print(path)
txt.delete(0, 'end')
txt.insert('end',path)
if path !="":
print(path)
else:
tm.showinfo("Input error", "Select Dataset")

def preprocess():
path=txt.get()
if path != "" :
print("preprocess")
# read synthetic cleveland dataset from full cleveland.data
df_main = pd.read_table(path, sep=',')

# Neural Net transfer function likes to work with floats


df_main.astype(float)

# Normalize values to range [0:1]

[32]
df_main /= df_main.max()

# split data into independent and dependent variables


y_all = df_main['num']
X_all = df_main.drop(columns = 'num')

fig, axs = plt.subplots(nrows=1, ncols=1, sharey=False, figsize=(10,5))

# plot histograms
#axs.set_xlabel('No Heart Disease Heart Disease')
axs.set_title('DataSet')
axs.grid()
axs.hist(y_all)
axs.get_children()[0].set_color('g')
axs.get_children()[2].set_color('c')
axs.get_children()[5].set_color('b')
axs.get_children()[7].set_color('y')
axs.get_children()[9].set_color('r')
fig.savefig('results/Preprocess.png')
plt.pause(5)
plt.show(block=False)
plt.close()
tm.showinfo("Input error", "Preprocess Successfully Finished")
else:
tm.showinfo("Input error", "Select Dataset")

def RFprocess():
sym=txt.get()
if sym != "":
RF.process(sym)
tm.showinfo("Input", "RandomForest Successfully Finished")
else:
tm.showinfo("Input error", "Select Dataset")

def DTprocess():
sym=txt.get()
if sym != "":
DT.process(sym)
print("DT")
tm.showinfo("Input", "DT Successfully Finished")
else:
tm.showinfo("Input error", "Select Dataset")

def hybridmodel():
sym=txt.get()
if sym != "":
hy.process(sym)
tm.showinfo("Input", "Hybrid Successfully Finished")
else:
tm.showinfo("Input error", "Select Dataset")

def predictprocess():

[33]
print("predict")
txt14.delete(0, 'end')
#txt1.insert('end', "60")
a1=txt1.get()
a2=txt2.get()
a3=txt3.get()
a4=txt4.get()
a5=txt5.get()
a6=txt6.get()
a7=txt7.get()
a8=txt8.get()
a9=txt9.get()
a10=txt10.get()
a11=txt11.get()
a12=txt12.get()
a13=txt13.get()

if a1 == "":
tm.showinfo("Insert error", "Enter Age")
elif a2 == "":
tm.showinfo("Insert error", "Enter Sex")
elif a3 == "":
tm.showinfo("Insert error", "Enter Cp")
elif a4 == "":
tm.showinfo("Insert error", "Enter tresp")
elif a5 == "":
tm.showinfo("Insert error", "Enter Chol")
elif a6 == "":
tm.showinfo("Insert error", "Enter fbs")
elif a7 == "":
tm.showinfo("Insert error", "Enter restecg")
elif a8 == "":
tm.showinfo("Insert error", "Enter thalach")
elif a9 == "":
tm.showinfo("Insert error", "Enter exang")
elif a10=="":
tm.showinfo("Insert error", "Enter oldpeak")
elif a11 == "":
tm.showinfo("Insert error", "Enter slope")
elif a12 == "":
tm.showinfo("Insert error", "Enter ca")
elif a13 == "":
tm.showinfo("Insert error", "Enter thal")
else:

#new_pred =
model.predict_classes(np.array([[58,1,3,112,230,0,2,165,0,2.5,2,1,7]])) #4
#new_pred =
model.predict_classes(np.array([[a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13]]))
new_pred=pr.process([a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13])
res=int(new_pred[0])
print(res)

[34]
if res == 0:
print("sdsd")
txt14.insert('end', "No Heart Disease")
else:
print("sds no")
txt14.insert('end', "Heart Disease Possible")
#nn=np.array([[a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13]])
#print(nn)
#pred.process(nn)

br = tk.Button(window, text="Browse", command=browse ,fg=fgcolor ,bg=bgcolor1


,width=10 ,height=1, activebackground = "Red" ,font=('times', 15, ' bold '))
br.place(x=800, y=140)

clearButton = tk.Button(window, text="Clear", command=clear ,fg=fgcolor ,bg=bgcolor1


,width=10 ,height=1 ,activebackground = "Red" ,font=('times', 15, ' bold '))
clearButton.place(x=950, y=140)

process = tk.Button(window, text="Preprocess", command=preprocess ,fg=fgcolor


,bg=bgcolor1 ,width=17 ,height=2, activebackground = "Red" ,font=('times', 15, ' bold
'))
process.place(x=50, y=600)

DTbutton = tk.Button(window, text="Decision Tree", command=DTprocess ,fg=fgcolor


,bg=bgcolor1 ,width=17 ,height=2, activebackground = "Red" ,font=('times', 15, ' bold
'))
DTbutton.place(x=250, y=600)

RFbutton = tk.Button(window, text="Random Forest", command=RFprocess ,fg=fgcolor


,bg=bgcolor1 ,width=17 ,height=2, activebackground = "Red" ,font=('times', 15, ' bold
'))
RFbutton.place(x=460, y=600)

HYbutton = tk.Button(window, text="Hybrid", command=hybridmodel ,fg=fgcolor ,bg=bgcolor1


,width=17 ,height=2, activebackground = "Red" ,font=('times', 15, ' bold '))
HYbutton.place(x=650, y=600)

predict = tk.Button(window, text="Predict", command=predictprocess ,fg=fgcolor


,bg=bgcolor1 ,width=17 ,height=2, activebackground = "Red" ,font=('times', 15, ' bold
'))
predict.place(x=850, y=600)

[35]
quitWindow = tk.Button(window, text="Quit", command=window.destroy ,fg=fgcolor
,bg=bgcolor1 ,width=17 ,height=2, activebackground = "Red" ,font=('times', 15, ' bold
'))
quitWindow.place(x=1060, y=600)

window.mainloop()

[36]
RFALG.py
import pandas as pd
import matplotlib as plt
import numpy as np
from sklearn import linear_model
#from sklearn.model_selection cross_validation
from scipy.stats import norm

from sklearn.svm import SVC


from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score


from random import seed
from random import randrange
from csv import reader
import csv
import numpy as np
import pandas as pd
from pandas import read_csv
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestClassifier

def process(path):
dataset = pd.read_csv(path)
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

X_train, X_test, y_train, y_test = train_test_split(X, y)

model2=RandomForestClassifier()
model2.fit(X_train, y_train)
y_pred = model2.predict(X_test)
print("predicted")
print(y_pred)
print("test")
print(y_test)

result2=open("results/resultRF.csv","w")
result2.write("ID,Predicted Value" + "\n")
for j in range(len(y_pred)):
result2.write(str(j+1) + "," + str(y_pred[j]) + "\n")
result2.close()

mse=mean_squared_error(y_test, y_pred)

[37]
mae=mean_absolute_error(y_test, y_pred)
r2=r2_score(y_test, y_pred)

print(" ")
print("MSE VALUE FOR RandomForest IS %f " % mse)
print("MAE VALUE FOR RandomForest IS %f " % mae)
print("R-SQUARED VALUE FOR RandomForest IS %f " % r2)
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE VALUE FOR RandomForest IS %f " % rms)
ac=accuracy_score(y_test,y_pred)
print ("ACCURACY VALUE RandomForest IS %f" % ac)
print(" ")

result2=open('results/RFMetrics.csv', 'w')
result2.write("Parameter,Value" + "\n")
result2.write("MSE" + "," +str(mse) + "\n")
result2.write("MAE" + "," +str(mae) + "\n")
result2.write("R-SQUARED" + "," +str(r2) + "\n")
result2.write("RMSE" + "," +str(rms) + "\n")
result2.write("ACCURACY" + "," +str(ac) + "\n")
result2.close()

df = pd.read_csv('results/RFMetrics.csv')
acc = df["Value"]
alc = df["Parameter"]
colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#8c564b"]
explode = (0.1, 0, 0, 0, 0)

fig = plt.figure()
plt.bar(alc, acc,color=colors)
plt.xlabel('Parameter')
plt.ylabel('Value')
plt.title(' Random Forest Metrics Value')
fig.savefig('results/RFMetricsValue.png')
plt.pause(5)
plt.show(block=False)
plt.close()

[38]
2. SCREEN SHOTS
The following screen shows the application home page

The following screen shows the result of pre-process

[39]
The following screen shows the result of decision tree algorithm

[40]
The following screen shows the results of random forest algorithm

[41]
The following screen shows the results of Hybrid model

[42]

You might also like