You are on page 1of 28

Presentation on Decision Tress

and Random Forest


Presented By
Sal Saad Al Deen Taher
ID: 20ME128
Part 1- Decision Trees
Decision Tree
• A decision tree is a decision support tool that uses a tree-like graph or model of
decisions and their possible consequences, including chance event outcomes,
resource costs, and utility.
• One way to display an algorithm that only contains conditional control statements.
• A decision tree is a flowchart-like structure in which each internal node represents
a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each
branch represents the outcome of the test, and each leaf node represents a class
label (decision taken after computing all attributes).
• Decision Tree algorithms are referred to as CART (Classification and Regression
Trees)
“The possible solutions to a given problem emerge as the leaves of a tree, each node
representing a point of deliberation and decision.”
- Niklaus Wirth (1934 — ), Programming language designer
Example of Decision Tree
Over 30 Years?

YES NO

PARENT NODE Sports


Married?
Car
YES NO CAR
AGE MARRIED CHILDREN CHOICE
Sports
Have CHILD NODE Car 31 NO NO SPORTS
Children?
YES NO 29 NO NOROOT NODE
SPORTS
33 YES NO SEDAN
Sedan DECISION/
Minivan Car 35 YES NOINTERNALSEDAN
NODE
40 YES YES MINIVAN
TERMINAL
25 NO NO SPORTS
NODE/ LEAF
36 YES NO SEDAN
BRANCH TREE
35 YES YES ?
Common Terms of A Decision Tree
• Root Node: It represents entire population or sample and this further gets
divided into two or more homogeneous sets.
• Splitting: It is a process of dividing a node into two or more sub-nodes.
• Decision Node: When a sub-node splits into further sub-nodes, then it is
called decision node.
• Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
• Pruning: When we remove sub-nodes of a decision node, this process is
called pruning. You can say opposite process of splitting.
• Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
• Parent and Child Node: A node, which is divided into sub-nodes is called
parent node of sub-nodes whereas sub-nodes are the child of parent node.
Decision Tree Algorithm:
• Data may be consisting of Categorical Data or Numeric Data or the combination
of both. Let’s look at some categorical data and try to build a decision tree
Outlook Temperature Humidity Wind Played Golf
Sunny Hot High Weak No
Sunny Hot High Weak No
Overcast Hot High Weak Yes
Rain Mild High Strong Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rain Mild Normal Weak Yes
Sunny Mild Normal Weak Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High Weak No
Decision Tree Algorithm
Played Golf   Played Golf  
Played Golf Yes No Total   Yes No Total
Yes No Sunny 2 3 5
Outlook Hot 2 2 4
Overcast 4 0 4 Temperature
9 5 Rain 3 2 5 Mild 4 2 6
14 Cool 3 1 4
      14

  Played Golf     Played Golf  


  Yes No Total   Yes No Total
Humidity High
Normal 6
3 4
1
7
7
Wind Weak 6
Strong 3
4
1
10
4
How to find the
      14       14
Root node?
1. Gini Index
2. Entropy
Decision Tree Algorithm:
•   Index: Gini Index is a measure of how often a randomly chosen element from the set
Gini
would be incorrectly labeled if it was randomly labeled according to the distribution of
labels in the subset.

Or simply Gini Index = 1- (the probability of “yes”)2 - (the probability of “no”)2


Gini Index is lower bounded by 0, with 0 occurring if the data set contains only one class.
Entropy: In machine learning, entropy is a measure of the randomness in the information
being processed. The higher the entropy, the harder it is to draw any conclusions from that
information.
Decision Tree Algorithm: Categorical Data
Played Golf   Played Golf  
Played Golf Yes No Total   Yes No Total
Yes No Sunny 2 3 5
Outlook Hot 2 2 4
Overcast 4 0 4 Temperature
9 5 Rain 3 2 5 Mild 4 2 6
14 Cool 3 1 4
      14

  Played Golf     Played Golf  


  Yes No Total   Yes No Total
Gini Index for Outlook= 0.342
Humidity High 3 4 7 Wind Weak 6 4 10
Normal 6 1 7 Strong 3 1 4 Gini Index for Temperature= 0.440
      14       14
Gini Index for Humidity= 0.367
Gini Index for Wind= 0.450
Gini Index = 1- (the probability of “yes”)2 - (the probability of “no”)2

Gini Index for the Class Variable = 1- [(9/14)2 + (5/14)2 ]= 0.459


Outlook should be the
Gini Index for the Outlook= (5/14)* [1-(2/5)2 + (3/5)2 ] + (4/14)* [1-
root node.
(4/4)2 + (0/4)2 ] + (5/14)* [1-(3/5)2 + (2/5)2 ] = 0.342
Decision Tree Algorithm Outlook Temperature Humidity Wind Played
Sunny Hot High Weak No
Sunny Hot High Weak No
Gini Index for Outlook= 0.342
Rain Mild High Strong Yes
Gini Index for Temperature= 0.267 Rain Cool Normal Weak Yes
Is the Outlook is
Rain Cool Normal Strong No
Overcast? Gini Index for Humidity= 0.229 Sunny Mild High Weak No
Gini Index for Wind= 0.357 Sunny Cool Normal Weak Yes
YES NO
Rain Mild Normal Weak Yes
Played Golf Sunny Mild Normal Weak Yes
Yes No Is the Humidity Rain Mild High Weak No
4 0 Normal?
Normal Humidity
Outlook Temperature Humidity Wind Played
YES NO Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Is the Wind Is the Wind Sunny Cool Normal Weak Yes
Weak? Strong? Rain Mild Normal Weak Yes
Sunny Mild Normal Weak Yes

YES NO YES NO
High Humidity
Played Golf Played Golf Played Golf Played Golf Outlook Temperature Humidity Wind Played
Yes No Yes No Yes No Yes No Sunny Hot High Weak No
4 0 0 1 1 0 0 4 Sunny Hot High Weak No
Rain Mild High Strong Yes
Sunny Mild High Weak No
Rain Mild High Weak No
Practice Program:1
• Preferred Software: Jupyter Notebook from Anaconda Package
• Required File: play.csv download from slack or web server. Copy the
file location because it will be required to run the program.
• Required Modules: Numpy, Pandas, Scikit Learn etc.
Decision Tree Algorithm: Numeric Data
• The "Iris" dataset. Originally published at UCI Machine Learning Repository: Iris
Data Set, this small dataset from 1936 is often used for testing out machine
learning algorithms and visualizations (for example, Scatter Plot). Each row of the
table represents an iris flower, including its species and dimensions of its
botanical parts, sepal and petal, in centimeters.
• This dataset is built in Scikit Learn Module
• For splitting the dataset and to find the root node-
• First, we have to sort out the Classes and their occurrences. In Irish dataset
there are 3 Classes of Flowers Name Setosa, Versicolor and Virginica
respectively.
• Second, we have to sort a suitable feature in ascending order and find
average value for every two dataset. And also find the Gini Index for that
Average Value.
• For the lowest Gini Index Value we will split the dataset.
Decision Tree Algorithm: Numeric Data
For Example, The Irish Dataset is sorted according to Petal Length in Ascending order
and Average value of Petal Length has been taken every two length. From this we can
find-
Petal Length<=2.45 Petal Length<=3.15
cm cm

YES NO YES NO
Setosa Setosa Setosa Setosa
Yes No Yes No Yes No Yes No
50 0 0 100 50 1 0 99
Gini Index = 0 Gini Index = 0.013
So, the entire dataset will be splitted according to Petal Length <=2.45 cm & this should
be the root node. Similarly for the remaining dataset, the procedure is repeated
choosing the Petal Width Column.
Decision Tree Algorithm: Numeric Data

Petal Width<=1.75 cm Petal Width<=1.85 cm

YES NO YES NO
Versicolor Versicolor Versicolor Versicolor
Yes No Yes No Yes No Yes No
49 5 1 45 50 16 0 34
Gini Index = 0.110 Gini Index = 0.24

So, Petal Width <=1.75 cm should be selected as the second internal node. For further
splitting we can choose Petal Length & repeat the process.
Decision Tree Algorithm: Numeric Data
The final decision tree could be like this-

Petal Length<=2.45
cm

YES NO
Setosa
Yes No Petal Width<=1.75 cm
50 0
YES NO

Petal Length<=4.95 Petal Length<=4.85


cm cm

YES NO YES NO
Versicolor Versicolor Virginica Virginica
Yes No Yes No Yes No Yes No
47 1 2 4 2 1 0 43
Practice Program 2
• Preferred Software: Jupyter Notebook from Anaconda Package
• Required Database: Iris dataset which is built in with Scikit Learn
module. For calculation purpose I attach the file in slack and web
server also.
• Required Modules: Numpy, Pandas, Scikit Learn etc.
Decision Tree Algorithm: Mixed Data
• For splitting the dataset and to find the root node-
• First, we have to sort out the Classes and their occurrences. And select the
best class to split the dataset.
• Second, we have to sort a suitable feature in ascending order and find
average value for every two dataset. And also find the Gini Index for that
Average Value.
• For the lowest Gini Index Value we will split the dataset.
• For subsequent branch nodes we will follow the either numeric data algorithm or
categorical data algorithm but have to decide by the value of Gini Index or
Information Gain from Entropy Value.
• We will decide the leaf node if there is just one class remain. Splitting of parent
nodes and Pruning on branch nodes depend on the value of Gini Index.
Practice Program 3
• Preferred Software: Jupyter Notebook from Anaconda Package
• Required Database: download it from the link
https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/Cogni
tiveClass/ML0101ENv3/labs/drug200.csv
Required Modules: Numpy, Pandas, Scikit Learn etc.
Advantages & Disadvantages of Decision Tree
• Easy to Understand • Overfitting
• Useful in Data exploration • Not fit for continuous variables
• Implicitly perform feature selection • Variance can give different results

Disadvantages
• Little effort for data preparation of decision tree.
Advantages

• Less data cleaning required • Complex algorithm cannot


guarantee optimal decision tree.
• Data type is not a constraint
• Decision tree learners create biased
• Non-Parametric Method trees if some classes dominate.
• Non-linear relationships between • Low prediction accuracy for a
parameters do not affect tree dataset as compared to other
performance. machine learning algorithms
Part 2- Random Forest
Random Forests
• Random forest, like its name implies, consists of a large number of individual decision
trees that operate as an ensemble.
• Each individual tree in the random forest spits out
a class prediction and the class with the most votes
becomes the model’s prediction
• A large number of relatively uncorrelated models
(trees) operating as a committee will outperform
any of the individual constituent models.
• Prerequisites for random forest to perform
well are:
• There needs to be some actual signal in the
features so that models built using those
features do better than random guessing.
• The predictions made by individual trees
need to have low correlation to each other.
Random Forests Algorithm:
Bagging (Bootstrap Aggregation):
• Decisions trees are very sensitive to the data they are trained on — small changes
to the training set can result in significantly different tree structures. Random
forest takes advantage of this by allowing each individual tree to randomly
sample from the dataset with replacement, resulting in different trees. This
process is known as bagging.
• Bagging does not mean training the data into smaller parts and training each tree
on a different smaller part.
• Bagging is instead of using the original training data , taking a random sample of
same size as original data.
• For example if the original data was [1, 2, 3, 4, 5, 6] then, one of the trees can
have following list [1, 2, 2, 3, 6, 6]. Notice that both lists are of length six and that
“2” and “6” are both repeated in the randomly selected training data one can
give to that tree.
Random Forests Algorithm:
Feature Randomness:
• In a normal decision tree, when it is time
to split a node, we consider every
possible feature and pick the one that
produces the most separation between
the observations in the left node vs.
those in the right node.
• In contrast, each tree in a random forest
can pick only from a random subset of
features.
• This forces even more variation amongst
the trees in the model and ultimately
results in lower correlation across trees
and more diversification.
Example: Step 1: Create A Bootstrapped Dataset
Original Dataset Bootstrapped Dataset
Chest Good Blood Blocked Heart Chest Good Blood Blocked Weight Heart
Weight Pain Circulation Arteries Disease
Pain Circulation Arteries Disease

No No No 125 No No No No 125 No
Yes Yes Yes 180 Yes Yes Yes Yes 180 Yes
Yes Yes No 210 No Out of Bag Yes No Yes 167 Yes
Yes No Yes 167 Yes Dataset Yes No Yes 167 Yes

Step 2: Create A Decision Tree Using The Bootstrapped Dataset using a random subset of variables at each step

Good Circulation Step 3: Repeat Step 1 &


2 to create different
bootstrapped dataset &
decision tree
Example:
Though we can make hundreds of decision trees from the bootstrapped dataset suppose
in this example we made 6 different datasets and trees. No we will check for a data.

Chest Good Blood Blocked Heart So we take the data and run the six trees in the random we
Weight
Pain Circulation Arteries Disease made. Running the six trees separately the following results
No No No 168 Yes
come.
Heart Disease
Yes No
5 1
Now suppose we take two out of bag datasets to compare.

Chest Good Blood Blocked Weight Heart Heart Disease The proportion of Out-of-Bag
Pain Circulation Arteries Disease Yes No
4 2 samples that are incorrectly
No No No 168 Yes
classified is the “Out-of-Bag
error”. We can measure the
Chest Good Blood Blocked Heart Heart Disease
Pain Circulation Arteries Weight Disease Yes No accuracy of the random forest by
5 1
No No No 125 No “Out-of-Bag error”
Advantages & Disadvantages of Decision Tree
• Produces Highly Accurate Classifier • Random forests have been
• Runs efficiently on large databases observed to overfit for some
• Estimates important variables in datasets with noisy

Disadvantages
classification classification/regression tasks.
Advantages

• Generates an internal unbiased • For data including categorical


estimate of the generalization error as variables with different number
the forest building progresses.
of levels, random forests are
• Has an effective method for estimating biased in favor of those attributes
missing data and maintains accuracy
when a large proportion of the data are with more levels. Therefore, the
missing. variable importance scores from
• Generated forests can be saved for random forest are not reliable for
future use on other data this type of data.
Practice Program for Random Forests
• Preferred Software: Jupyter Notebook from Anaconda Package
• Required Modules: Numpy, Pandas, Scikit Learn etc.
• For Random Forests classifier and Evaluation of the Algorithm we will
use Practice Program 3
• For feature importance we will use Practice Program 2
References
• Hands-On Machine Learning with Scikit-Learn & TensorFlow by Aurelien Geron
• Master Machine Learning Algorithms; Discover How They Work and Implement Them From Scratch by
Jason Brownlee
• Python Machine Learning by Sebastian Raschka & Vahid Mirajalili
• https://en.wikipedia.org/wiki/Decision_tree
• https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052
• https://www.youtube.com/user/joshstarmer
• https://en.wikipedia.org/wiki/Random_forest
• https://towardsdatascience.com/understanding-random-forest-58381e0602d2
• https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

You might also like