You are on page 1of 70

Decision Trees

(Classification Trees & Regression Trees)

Copyright©2019 by Simplifying Skills, Contact:- 9579708361/7798283335/8390096208


Tree Methods

Let’s start off with a thought experiment to


give some motivation behind using a decision
tree method.
Tree Methods

Imagine that I play Tennis every Saturday and I always invite a


friend to come with me.
Sometimes my friend shows up, sometimesnot.
For him it depends on a variety of factors, such as: weather,
temperature, humidity, wind etc..
I start keeping track of these features and whether or not he
showed up to play with me.
Tree Methods
Tree Methods

I want to use this data


to predict whether or
not he will show up to
play.
An intuitive way to do
this is through a
Decision Tree
Tree Methods

In this tree we have:


● Nodes
○ Split for the
value of a
certain attribute
● Edges
○ Outcome of a
split to next
node
Tree Methods

In this tree we have:


● Root
○ The node that
performs the first
split
● Leaves
○ Terminal nodes
that predict the
outcome
Intuition Behind Splits

Imaginary Data with 3 features (X,Y, and Z) with two possible


classes.
Intuition Behind Splits

Splitting on Y gives us a clear separation between classes


Intuition Behind Splits

We could have also tried splitting on other features first:


Intuition Behind Splits
Entropy and Information Gain are the Mathematical Methods
of choosing the best split. Refer to reading assignment.
Decision Tree – Classification

Decision tree builds classification or regression models in the form of a tree structure. It
breaks down a dataset into smaller and smaller subsets while at the same time an
associated decision tree is incrementally developed. The final result is a tree with decision
nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g.,
Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a classification or decision.
The topmost decision node in a tree which corresponds to the best predictor called root
node. Decision trees can handle both categorical and numerical data.
Algorithm

The core algorithm for building decision trees called ID3 (Iterative Dichotomiser 3) by J. R.
Quinlan which employs a top-down, greedy search through the space of possible branches
with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. In
ZeroR model there is no predictor, in OneR model we try to find the single best predictor, naive
Bayesian includes all predictors using Bayes' rule and the independence assumptions between
predictors but decision tree includes all predictors with the dependence assumptions
between predictors.

Entropy

A decision tree is built top-down from a root node and involves partitioning the data into
subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy
to calculate the homogeneity of a sample. If the sample is completely homogeneous the
entropy is zero and if the sample is an equally divided it has entropy of one.
To build a decision tree, we need to calculate two types of entropy using frequency tables as
follows:

a) Entropy using the frequency table of one attribute:


b) Entropy using the frequency table of two attributes:
Information Gain

The information gain is based on the decrease in entropy after a dataset is split on an
attribute. Constructing a decision tree is all about finding attribute that returns the highest
information gain (i.e., the most homogeneous branches).

Step 1: Calculate entropy of the target.

Step 2:

The dataset is then split on the different attributes. The entropy for each branch is calculated.
Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted
from the entropy before the split. The result is the Information Gain, or decrease in entropy.
Step 3: Choose attribute with the largest information gain as the decision node, divide the
dataset by its branches and repeat the same process on every branch.
Step 4a: A branch with entropy of 0 is a leaf node.

Step 4b: A branch with entropy more than 0 needs further splitting.
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is
classified.
Decision Tree to Decision Rules

A decision tree can easily be transformed to a set of rules by mapping from the root node to
the leaf nodes one by one.
Random Forests
To improve performance, we can use many trees with
a random sample of features chosen as the split.
● A new random sample of features is chosen for
every single tree at every single split.
● For classification, m is typically chosen to be
the square root of p.
Where m is the number of randomly selected features that can be searched at a split
point and p is the number of input variables. For example, if a dataset had 25 input
variables for a classification problem then m=5.
Random Forests

What's the point?


● Suppose there is one very strong feature in the
data set. When using “bagged” trees, most of the
trees will use that feature as the top split,
resulting in an ensemble of similar trees that are
highly correlated.
Random Forests

What's the point?

● Averaging highly correlated quantities does not


significantly reduce variance.
● By randomly leaving out candidate features from each
split, Random Forests "decorrelates" the trees, such
that the averaging process can reduce the variance of
the resulting model.
Partitioning Data & Majority Vote
No Split :: Majority Vote :: Accuracy
One Split
One More Split A variance value of zero indicates that all values within a set
of numbers are identical; all variances that are non-zero will
be positive numbers. A large variance indicates that numbers
in the set are far from the mean and each other, while a small
variance indicates the opposite.
Making & Choosing Rules
Quantifying Better Splits
No Split
One Split
One more split
Another Measure : Gini Impurity

Gini impurity is a measure of misclassification, which applies in a multiclass classifier


context.
Gini coefficient applies to binary classification and requires a classifier that can in
some way rank examples according to the likelihood of being in a positive class.
What is Gini Index?

Gini index or Gini impurity measures the degree or probability of a particular variable being
wrongly classified when it is randomly chosen. But what is actually meant by ‘impurity’? If
all the elements belong to a single class, then it can be called pure. The degree of Gini index
varies between 0 and 1, where 0 denotes that all elements belong to a certain class or if
there exists only one class, and 1 denotes that the elements are randomly distributed across
various classes. A Gini Index of 0.5 denotes equally distributed elements into some classes.
Making Rules: To Tennis or Not to Tennis
Selected Rules and Decision Tree
Rules with Continuous Predictors
Decision trees work with continuous variables as well. The way they work is by principle of
reduction of variance.

Let us take an example, where you have age as the target variable. So, let us say you
compute the variance of age and it comes out to be x. Next, decision tree looks at various
splits and calculates the total weighted variance of each of these splits. It chooses the split
which provides the minimum variance.
Regression Tree
Regression Tree :Example
When to stop
Drawbacks & Remedies
Drawbacks
Cross-Validation : Pruning Your Tree
Random Forests
Noise in the data
Random Forest
Some Terminologies
Important Terminology related to Decision Trees
Let’s look at the basic terminology used with Decision trees:

Root Node: It represents entire population or sample and this further gets
divided into two or more homogeneous sets.

Splitting: It is a process of dividing a node into two or more sub-nodes.

Decision Node: When a sub-node splits into further sub-nodes, then it is


called decision node.

Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is
called pruning. You can say opposite process of splitting.

Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.

Parent and Child Node: A node, which is divided into sub-nodes is called
parent node of sub-nodes where as sub-nodes are the child of parent node.

These are the terms commonly used for decision trees. As we know that
every algorithm has advantages and disadvantages, below are the important
factors which one should know.
Advantages:-

Easy to Understand: Decision tree output is very easy to understand even for people from non-
analytical background. It does not require any statistical knowledge to read and interpret them. Its
graphical representation is very intuitive and users can easily relate their hypothesis.

Useful in Data exploration: Decision tree is one of the fastest way to identify most significant
variables and relation between two or more variables. With the help of decision trees, we can
create new variables / features that has better power to predict target variable. It can also be used
in data exploration stage. For example, we are working on a problem where we have information
available in hundreds of variables, there decision tree will help to identify most significant variable.

Less data cleaning required: It requires less data cleaning compared to some other modeling
techniques. It is not influenced by outliers and missing values to a fair degree.

Data type is not a constraint: It can handle both numerical and categorical variables.

Non Parametric Method: Decision tree is considered to be a non-parametric method. This


means that decision trees have no assumptions about the space distribution and the classifier
structure.
Disadvantages

Over fitting: Over fitting is one of the most practical


difficulty for decision tree models. This problem gets
solved by setting constraints on model parameters
and pruning (discussed in detailed below).

Not fit for continuous variables: While working with


continuous numerical variables, decision tree looses
information when it categorizes variables in different
categories.
Regression Trees vs. Classification Trees
• Regression trees are used when dependent variable is continuous. Classification trees are
used when dependent variable is categorical.

• In case of regression tree, the value obtained by terminal nodes in the training data is the
mean response of observation falling in that region. Thus, if an unseen data observation
falls in that region, we’ll make its prediction with mean value.

• In case of classification tree, the value (class) obtained by terminal node in the training
data is the mode of observations falling in that region. Thus, if an unseen data observation
falls in that region, we’ll make its prediction with mode value.

• Both the trees divide the predictor space (independent variables) into distinct and non-
overlapping regions. For the sake of simplicity, you can think of these regions as high
dimensional boxes or boxes.

• Both the trees follow a top-down greedy approach known as recursive binary splitting. We
call it as ‘top-down’ because it begins from the top of tree when all the observations are
available in a single region and successively splits the predictor space into two new
branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for
best variable available) about only the current split, and not about future splits which will
lead to a better tree.
• This splitting process is continued until a user defined
stopping criteria is reached. For example: we can tell the
the algorithm to stop once the number of observations per
node becomes less than 50.

• In both the cases, the splitting process results in fully


grown trees until the stopping criteria is reached. But, the
fully grown tree is likely to overfit data, leading to poor
accuracy on unseen data. This bring ‘pruning’. Pruning is
one of the technique used tackle overfitting. We’ll learn
more about it in following section.
How does a tree decide where to split?

The decision of making strategic splits heavily affects a tree’s accuracy. The
decision criteria is different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node in two or


more sub-nodes. The creation of sub-nodes increases the homogeneity of
resultant sub-nodes. In other words, we can say that purity of the node
increases with respect to the target variable. Decision tree splits the nodes
on all available variables and then selects the split which results in most
homogeneous sub-nodes.

The algorithm selection is also based on type of target variables. Let’s look
at the four most commonly used algorithms in decision tree:
Gini:-

Gini says, if we select two items from a population at random then


they must be of same class and probability for this is 1 if population is
pure.

• It works with categorical target variable “Success” or “Failure”.


• It performs only Binary splits
• Higher the value of Gini higher the homogeneity.
• CART (Classification and Regression Tree) uses Gini method to
create binary splits.

Steps to Calculate Gini for a split:-

1. Calculate Gini for sub-nodes, using formula sum of square of


probability for success and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of
that split
Example: – Referring to example used above, where we want to segregate the
students based on target variable ( playing cricket or not ). In the snapshot
below, we split the population using two input variables Gender and Class. Now, I
want to identify which split is producing more homogeneous sub-nodes using
Gini .
Split on Gender:-

1. Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68


2. Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
3. Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55
= 0.59

Similar for Split on Class:-

1. Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51


2. Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
3. Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51

Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence,
the node split will take place on Gender.

You might often come across the term ‘Gini Impurity’ which is determined by subtracting the
Gini value from 1. So mathematically we can say, Gini Impurity = 1 – Gini Value.

Gini Impurity = 1-Gini


Some Extra….
Chi-Square

It is an algorithm to find out the statistical significance between the


differences between sub-nodes and parent node. We measure it by sum of
squares of standardized differences between observed and expected
frequencies of target variable.

• It works with categorical target variable “Success” or “Failure”.


• It can perform two or more splits.
• Higher the value of Chi-Square higher the statistical significance of
differences between sub-node and Parent node.
• Chi-Square of each node is calculated using formula,
• Chi-square = ((Actual – Expected)^2 / Expected)^1/2
• It generates tree called CHAID (Chi-square Automatic Interaction Detector)
Steps to Calculate Chi-square for a split:

Calculate Chi-square for individual node by calculating the


deviation for Success and Failure both

Calculated Chi-square of Split using Sum of all Chi-square of


success and Failure of each node of the split

Example: Let’s work with above example that we


have used to calculate Gini.
Split on Gender:

First we are populating for node Female, Populate the actual value for “Play
Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would be 5 for
both because parent node has probability of 50% and we have applied same probability on
Female count(10).
Calculate deviations by using formula, Actual – Expected. It is for “Play Cricket” (2 – 5 = -3)
and for “Not play cricket” ( 8 – 5 = 3).
Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using formula with
formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer below table for
calculation.
Follow similar steps for calculating Chi-square value for Male node.
Now add all Chi-square values to calculate Chi-square for split Gender.
Split on Class:

Perform similar steps of calculation for split on Class and


you will come up with below table.

Above, you can see that Chi-square also identify the


Gender split is more significant compare to Class.
Information Gain:

Look at the image below and think which node can be described easily. I
am sure, your answer is C because it requires less information as all
values are similar. On the other hand, B requires more information to
describe it and A requires the maximum information. In other words, we
can say that C is a Pure node, B is less Impure and A is more impure.
Now, we can build a conclusion that less impure node requires less information
to describe it. And, more impure node requires more information. Information
theory is a measure to define this degree of disorganization in a system known
as Entropy. If the sample is completely homogeneous, then the entropy is zero
and if the sample is an equally divided (50% – 50%), it has entropy of one.
Entropy can be calculated using formula:-

Here p and q is probability of success and failure respectively in that node.


Entropy is also used with categorical target variable. It chooses the split which
has lowest entropy compared to parent node and other splits. The lesser the
entropy, the better it is.
Reduction in Variance

Till now, we have discussed the algorithms for categorical target variable.
Reduction in variance is an algorithm used for continuous target variables
(regression problems). This algorithm uses the standard formula of variance to
choose the best split. The split with lower variance is selected as the criteria to
split the population:

Above X-bar is mean of the values, X is actual and n is number of values.

Steps to calculate Variance:

• Calculate variance for each node.


• Calculate variance for each split as weighted average of
each node variance.
Let’s Implement

You might also like