You are on page 1of 44

USAGE OF DATA MINING TECHNIQUES

IN PREDICTING THE HEART DISEASES –


DECISION TREE & RANDOM FOREST
ALGORITHM

Under the esteemed guidance of Presented By


Mrs. V. LAKSHMI K. ANITHA
Assistant Professor 16131F0022

1
ABSTRACT

Nowadays, heart disease is the main cause of several deaths among all other diseases. Due
to the lack of resources in the medical field, the prediction of heart diseases becomes a major
problem. For early diagnosis and treatment, some classification algorithms such as Decision Tree and
Random Forest Algorithm are used. The data mining techniques compare the accuracy of the
algorithm and predicts heart diseases.

The main aim is to predict heart disease based on the dataset values. Heart diseases are of
various forms such as Coronary heart disease, Heart attack, Arrhythmia, and Heart failure. The
architecture consists of three steps. First, a dataset of 13 attributes is collected. The Second step is
applying the classification techniques for the dataset using the Decision tree and Random Forest
Algorithms. Finally, the accuracy is collected for both the algorithms and results the best algorithm
for the prediction of heart diseases.
2
INTRODUCTION

Many new techniques and algorithms of data mining has been used for predicting heart
diseases. Data mining proved its efficiency in most of the areas to achieve improved accuracy and
performance mostly in medical field. Classification is termed as one of the data mining techniques
which are used to predict group membership for data instances.

Decision Tree Algorithm is a supervised algorithm. Which will solve the problems by using a
Tree like representation. Each internal node of the tree represents an attribute and each leaf node
represents to a class label.

Random Forest algorithm is a supervised algorithm. As the name suggest, the algorithm creates
the forest with a number of trees.

3
INTRODUCTION(CONT.)

Heart diseases are seen in all the classes of people in recent times. Cardiovascular disease is
the leading cause of death. Hence continued efforts are being done to predict the possibility of getting
heart diseases. Cardiovascular heart diseases are of two types which are most common in people with
diabetes. They are,
1. Coronary Artery diseases
2. Cerebral Vascular disease.

Some symptoms of heart diseases are pain in chest, shoulders, arms, jaws, breath shortness,
giddiness and nausea. The major problem of Coronory heart diseases are high blood pressure and also
diabetes which may weaken the heart.

4
EXISTING SYSTEM

Several techniques are used to detect heart diseases based on the large number of attributes.
In order to reduce the attributes from a large dataset, a KNN classification approach is used to reduce
the attributes list but takes more amount of time to perform classification and then reduces the
attributes which may not be accurate in prediction of heart disease for a particular dataset. Naïve
bayes is one of the most popular classification algorithms used in data mining. By using naïve bayes
algorithm diagnosing a heart disease is possible based on the list of attributes and provide the
calculation of yes/no probability. But the disadvantage is accuracy and also a strong feature
independence.

5
PROPOSED SYSTEM

In the Proposed system Decision tree and Random forest algorithms are used to predict
heart diseases. A decision tree will be in the form of the tree structure to depict the decisions by
calculating the entropy and information gain.

Random Forest Algorithm increases the predictive power of the algorithm and also helps to
prevent over-fitting. The random forest algorithm is an ensemble of a randomized decision tree. Each
decision tree gives a vote for the prediction of the target variable. Random forest algorithm chooses
the prediction that gets the most vote. In the Random forest the system uses multiple random decision
trees for better accuracy.

6
ALGORITHM AND IMPLEMENTATION

7
DECISION TREE ALGORITHM
Step 1: Compute the entropy for a data-set.

Step 2: For every attribute in the dataset:

Step 2.1: Calculate Entropy for all categorical values.

Step-2.2: Take average Information Entropy for the current attribute.

Step-2.3: Calculate Gain for the current attribute.

Step 3: Pick the Highest Gain attribute and make it as a root node for further classification.

Step 4: Repeat until we get the tree we desired.

Step 5: End

8
DECISION TREE ALGORITHM (CONT.)
Entropy:
H(S) = σ𝑐∈𝐶 −𝑝 𝑐 log 2 𝑝(𝑐)
Where, S – Current dataset for which entropy is calculated.
c – Set of classes in S.
p(c) - Proportion of number of elements in class c to the number of elements in set S.
Information Gain:
IG(A,S) = H(S) - σ𝑡∈𝑇 𝑝 𝑡 𝐻(𝑡)
Where, H(S) – Entropy of Set S.
T – Subset created from splitting set S by attribute A.
p(t) - Proportion of number of elements in t to the number of elements in set S.
H(t) – Entropy of subset t.

9
RANDOM FOREST ALGORITHM
Step 1: Randomly select “k” features from total “m” features. Where k<<m.
Step 2: Among the “k” features, calculate the node “d” using the best split point.
Step 3: Split the node into leaf nodes using the best split.
Step 4: Build forest by repeating 1 to 4 for “n” number times to create “n” number of trees.
Step 5: Take the test features and the rules of each randomly created tree to predict the outcome
and stores the predicted outcome(target).
Step 6: Calculate the votes for each predicted target.
Step 7: consider the high voted predicted target as the final prediction from the randoml forest
algorithm.

10
COMPLETE ATTRIBUTES INFORMATION
1 id: patient identification number 17 dm (1 = history of diabetes; 0 = no such history)
2 ccf: social security number (I replaced this with a dummy 18 famhist: family history of coronary artery disease (1 = yes; 0 =
value of 0) no)
3 age: age in years 19 restecg: resting electrocardiographic results
4 sex: sex (1 = male; 0 = female) -- Value 0: normal
5 painloc: chest pain location (1 = substernal; 0 = otherwise) -- Value 1: having ST-T wave abnormality (T wave inversions
6 painexer (1 = provoked by exertion; 0 = otherwise) and/or ST
7 relrest (1 = relieved after rest; 0 = otherwise) -- Value 2:showing probable or definite left ventricular hypertrophy
8 pncaden (sum of 5, 6, and 7) 20 ekgmo (month of exercise ECG reading)
9 cp: chest pain type 21 ekgday(day of exercise ECG reading)
-- Value 1: typical angina 22 ekgyr (year of exercise ECG reading)
-- Value 2: non-anginal pain 23 dig (digitalis used furing exercise ECG: 1 = yes; 0 = no)
-- Value 3: asymptomatic 24 prop (Beta blocker used during exercise ECG: 1 = yes; 0 = no)
10 trestbps: resting blood pressure (in mm Hg on admission to 25 nitr (nitrates used during exercise ECG: 1 = yes; 0 = no)
the 26 pro (calcium channel blocker used during exercise ECG: 1 = yes;
hospital) 0 = no)
11 htn 27 diuretic (diuretic used used during exercise ECG: 1 = yes; 0 = no)
12 chol: serum cholestoral in mg/dl 28 proto: exercise protocol
13 smoke: I believe this is 1 = yes; 0 = no (is or is not a 29 thaldur: duration of exercise test in minutes
smoker) 30 thaltime: time when ST measure depression was noted
14 cigs (cigarettes per day) 31 met: mets achieved
15 years (number of years as a smoker) 32 thalach: maximum heart rate achieved
16 fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 33 thalrest: resting heart rate
11
COMPLETE ATTRIBUTES INFORMATION
34 tpeakbps: peak exercise blood pressure (first of 2 parts) 55 cmo: month of cardiac cath (sp?) (perhaps "call")
35 tpeakbpd: peak exercise blood pressure (second of 2 56 cday: day of cardiac cath (sp?)
parts) 36 dummy 57 cyr: year of cardiac cath (sp?)
37 trestbpd: resting blood pressure 58 num: diagnosis of heart disease (angiographic disease status)
38 exang: exercise induced angina (1 = yes; 0 = no) -- Value 0: < 50% diameter narrowing
39 xhypo: (1 = yes; 0 = no) -- Value 1: > 50% diameter narrowing
40 oldpeak = ST depression induced by exercise relative (in any major vessel: attributes 59 through 68 are vessels)
to rest 41 slope: the slope of the peak exercise ST segment 59 lmt
-- Value 1: upsloping 60 ladprox
-- Value 2: flat 61 laddist
-- Value 3: downsloping 62 diag
42 rldv5: height at rest 63 cxmain
43 rldv5e: height at peak exercise 64 ramus
44 ca: number of major vessels (0-3) colored by 65 om1
flourosopy 66 om2
45 restckm: irrelevant 67 rcaprox
46 exerckm: irrelevant 68 rcadist
47 restef: rest raidonuclid (sp?) ejection fraction 48 69 lvx1: not used
restwm: rest wall (sp?) motion abnormality 70 lvx2: not used
49 exeref: exercise radinalid (sp?) ejection fraction 71 lvx3: not used
50 exerwm: exercise wall (sp?) motion 72 lvx4: not used
51 thal: 0 = normal; 1 = fixed defect; 2 = reversable defect 73 lvf: not used
52 thalsev: not used 74 cathef: not used
53 thalpul: not used 75 junk: not used
12
54 earlobe: not used 76 name: last name of patient
ATTRIBUTES USED TO PREDICT HEART DISEASES
1. #3 (age)
2. #4 (gender)
3. #9 (cp)
4. #12 (chol)
5. #16 (fbs)
6. #19 (restecg)
7. #32 (thalach) - maximum heart rate achieved
8. #38 (exang)
9. #40 (oldpeak)
10. #41 (slope)
11. #44 (ca)
12. #51 (thal)
13. #58 (num) (the predicted attribute)
13
EXAMPLE
Sample Data:

From the above sample data out of 15 instances, 7 instances says yes(1) and 8 instances say no(0).
we get Entropy,
H(S)=0.96
Highest Gain value= 0.73 (CP)
As CP has the highest information gain value it will be selected as an root node.

14
EXAMPLE (CONT.)

15
EXAMPLE(CONT.)
Consider the first instance of the dataset. Where cp=1, slope=2 and chol>240
Final Result : Patient with Heart Disease.

Consider the Second instance of the dataset. Where cp=1, slope=2, chol<240 and restecg=1
Final Result : Patient with Heart Disease.

Consider the 15th instance of the dataset. Where cp=0,sex=1,age<40


Final Result: Patient with No Heart Disease.

Similarly a new instances can be calculated by using the above dataset.

16
Random Forest EXAMPLE

Tree - 1 Tree - 2 Tree - 3

17
Random Forest EXAMPLE(CONT.)
For Second instance in Dataset:
Tree - 1 Result: Yes
Tree – 2 Result: Yes
Tree – 3 Result: Yes
Prediction Result : Patient with Heart Disease.

For 13th instance in Dataset:


Tree - 1 Result: Yes
Tree – 2 Result: No
Tree – 3 Result: No
Prediction Result : Patient with No Heart Disease.

18
SYSTEM REQUIREMENTS AND
SPECIFICATION

19
Functional Requirements:
The functional requirements of a system defines a function of a software system or its
components. A function is described as a set of inputs, behavior of a system and output.

Initial Input :

Dataset with High Dimensional Attributes

Intermediate Input :

Cleaned Dataset

No of Attributes To be Filtered

Output :

Predicted Results

20
Non-functional requirements:
Non-functional requirements deal with the characteristics of the system which cannot be expressed as

functions - such as the maintainability of the system, portability of the system, usability of the system, etc.

i) Usability: The project can be used in ordered to group the patients into Heart effected and Not Effected

Categories.

ii) Reliability: System can handle a large text file with comma value separator format.

iii) Performance: Classification and Prediction is Done Efficiently

iv) Supportability: Supported on any OS which contains Python.

v) Implementation: Using python language with pycharm and Anaconda

21
SOFTWARE REQUIREMENTS

Operating System : Win 7 and above

Programming language : Python 3

Environment : NetBeans 8.1 and pycharm.

22
HARDWARE REQUIREMENTS

Processor : Pentium IV or above


RAM : 512 MB
Hard disk : 20 GB
Monitor : User choice

23
SYSTEM DESIGN

24
Data Flow Diagram:

25
UML DIAGRAMS

26
Use Case Diagram:

27
Sequence Diagram:

Sequence diagram for Upload:

28
Sequence diagram for Decision Tree:

29
Sequence diagram for Decision Tree:

30
Sequence diagram for class segment:

31
Activity Diagram:

32
OUTPUT SCREENS

33
Home Page:

34
Uploading Dataset:

35
Dataset:

36
Dataset Result:

37
Home Page:

38
Prediction of Heart Disease by taking user data :-

39
Prediction of Heart Disease by taking user data :-

40
Prediction of Heart Disease by taking user data :

41
Accuracy Result:

42
Conclusion:

Prediction of Heart Diseases Using a Data Mining Approaches(Decision Tree and Random
Forest Algorithm) gives the higher efficiency and reduces complexity based on the attribute
reduction. The Random forest algorithm performs well and classifies the dataset of the Heart Disease
into two classes well when compared to traditional methods.
The Proposed work reduces the cost for different medical tests and helps the patients to take
precautionary measures well in advance. In future the same method can also be applied in predicting
and diagnosing other disease types.

43
THANK YOU…

44

You might also like