Professional Documents
Culture Documents
Decision tree builds classification or regression models in the form of a tree structure. It
breaks down a dataset into smaller and smaller subsets while at the same time an
associated decision tree is incrementally developed. The final result is a tree with decision
nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g.,
Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a classification or decision.
The topmost decision node in a tree which corresponds to the best predictor called root
node. Decision trees can handle both categorical and numerical data.
Algorithm
The core algorithm for building decision trees called ID3 (Iterative Dichotomiser 3) by J. R.
Quinlan which employs a top-down, greedy search through the space of possible branches
with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. In
ZeroR model there is no predictor, in OneR model we try to find the single best predictor, naive
Bayesian includes all predictors using Bayes' rule and the independence assumptions between
predictors but decision tree includes all predictors with the dependence assumptions
between predictors.
Entropy
A decision tree is built top-down from a root node and involves partitioning the data into
subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy
to calculate the homogeneity of a sample. If the sample is completely homogeneous the
entropy is zero and if the sample is an equally divided it has entropy of one.
To build a decision tree, we need to calculate two types of entropy using frequency tables as
follows:
The information gain is based on the decrease in entropy after a dataset is split on an
attribute. Constructing a decision tree is all about finding attribute that returns the highest
information gain (i.e., the most homogeneous branches).
Step 2:
The dataset is then split on the different attributes. The entropy for each branch is calculated.
Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted
from the entropy before the split. The result is the Information Gain, or decrease in entropy.
Step 3: Choose attribute with the largest information gain as the decision node, divide the
dataset by its branches and repeat the same process on every branch.
Step 4a: A branch with entropy of 0 is a leaf node.
Step 4b: A branch with entropy more than 0 needs further splitting.
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is
classified.
Decision Tree to Decision Rules
A decision tree can easily be transformed to a set of rules by mapping from the root node to
the leaf nodes one by one.
Random Forests
To improve performance, we can use many trees with
a random sample of features chosen as the split.
● A new random sample of features is chosen for
every single tree at every single split.
● For classification, m is typically chosen to be
the square root of p.
Where m is the number of randomly selected features that can be searched at a split
point and p is the number of input variables. For example, if a dataset had 25 input
variables for a classification problem then m=5.
Random Forests
Gini index or Gini impurity measures the degree or probability of a particular variable being
wrongly classified when it is randomly chosen. But what is actually meant by ‘impurity’? If
all the elements belong to a single class, then it can be called pure. The degree of Gini index
varies between 0 and 1, where 0 denotes that all elements belong to a certain class or if
there exists only one class, and 1 denotes that the elements are randomly distributed across
various classes. A Gini Index of 0.5 denotes equally distributed elements into some classes.
Making Rules: To Tennis or Not to Tennis
Selected Rules and Decision Tree
Rules with Continuous Predictors
Decision trees work with continuous variables as well. The way they work is by principle of
reduction of variance.
Let us take an example, where you have age as the target variable. So, let us say you
compute the variance of age and it comes out to be x. Next, decision tree looks at various
splits and calculates the total weighted variance of each of these splits. It chooses the split
which provides the minimum variance.
Regression Tree
Regression Tree :Example
When to stop
Drawbacks & Remedies
Drawbacks
Cross-Validation : Pruning Your Tree
Random Forests
Noise in the data
Random Forest
Some Terminologies
Important Terminology related to Decision Trees
Let’s look at the basic terminology used with Decision trees:
Root Node: It represents entire population or sample and this further gets
divided into two or more homogeneous sets.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is
called pruning. You can say opposite process of splitting.
Parent and Child Node: A node, which is divided into sub-nodes is called
parent node of sub-nodes where as sub-nodes are the child of parent node.
These are the terms commonly used for decision trees. As we know that
every algorithm has advantages and disadvantages, below are the important
factors which one should know.
Advantages:-
Easy to Understand: Decision tree output is very easy to understand even for people from non-
analytical background. It does not require any statistical knowledge to read and interpret them. Its
graphical representation is very intuitive and users can easily relate their hypothesis.
Useful in Data exploration: Decision tree is one of the fastest way to identify most significant
variables and relation between two or more variables. With the help of decision trees, we can
create new variables / features that has better power to predict target variable. It can also be used
in data exploration stage. For example, we are working on a problem where we have information
available in hundreds of variables, there decision tree will help to identify most significant variable.
Less data cleaning required: It requires less data cleaning compared to some other modeling
techniques. It is not influenced by outliers and missing values to a fair degree.
Data type is not a constraint: It can handle both numerical and categorical variables.
• In case of regression tree, the value obtained by terminal nodes in the training data is the
mean response of observation falling in that region. Thus, if an unseen data observation
falls in that region, we’ll make its prediction with mean value.
• In case of classification tree, the value (class) obtained by terminal node in the training
data is the mode of observations falling in that region. Thus, if an unseen data observation
falls in that region, we’ll make its prediction with mode value.
• Both the trees divide the predictor space (independent variables) into distinct and non-
overlapping regions. For the sake of simplicity, you can think of these regions as high
dimensional boxes or boxes.
• Both the trees follow a top-down greedy approach known as recursive binary splitting. We
call it as ‘top-down’ because it begins from the top of tree when all the observations are
available in a single region and successively splits the predictor space into two new
branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for
best variable available) about only the current split, and not about future splits which will
lead to a better tree.
• This splitting process is continued until a user defined
stopping criteria is reached. For example: we can tell the
the algorithm to stop once the number of observations per
node becomes less than 50.
The decision of making strategic splits heavily affects a tree’s accuracy. The
decision criteria is different for classification and regression trees.
The algorithm selection is also based on type of target variables. Let’s look
at the four most commonly used algorithms in decision tree:
Gini:-
Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence,
the node split will take place on Gender.
You might often come across the term ‘Gini Impurity’ which is determined by subtracting the
Gini value from 1. So mathematically we can say, Gini Impurity = 1 – Gini Value.
First we are populating for node Female, Populate the actual value for “Play
Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would be 5 for
both because parent node has probability of 50% and we have applied same probability on
Female count(10).
Calculate deviations by using formula, Actual – Expected. It is for “Play Cricket” (2 – 5 = -3)
and for “Not play cricket” ( 8 – 5 = 3).
Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using formula with
formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer below table for
calculation.
Follow similar steps for calculating Chi-square value for Male node.
Now add all Chi-square values to calculate Chi-square for split Gender.
Split on Class:
Look at the image below and think which node can be described easily. I
am sure, your answer is C because it requires less information as all
values are similar. On the other hand, B requires more information to
describe it and A requires the maximum information. In other words, we
can say that C is a Pure node, B is less Impure and A is more impure.
Now, we can build a conclusion that less impure node requires less information
to describe it. And, more impure node requires more information. Information
theory is a measure to define this degree of disorganization in a system known
as Entropy. If the sample is completely homogeneous, then the entropy is zero
and if the sample is an equally divided (50% – 50%), it has entropy of one.
Entropy can be calculated using formula:-
Till now, we have discussed the algorithms for categorical target variable.
Reduction in variance is an algorithm used for continuous target variables
(regression problems). This algorithm uses the standard formula of variance to
choose the best split. The split with lower variance is selected as the criteria to
split the population: