Lecture Decision Trees

DECISION TREES
Introduction
It is a method that induces concepts from examples

(inductive learning), proposed by J. R. Quinlan published
Induction of Decision Trees in 1986
One of the widely used & practical learning method
The learning is supervised: i.e. the classes or categories of the

data instances are known
It represents concepts as decision trees (which can be

rewritten as if-then rules)
1
DECISION TREES
Introduction
The target function can be Boolean / discrete valued or numeric
•Classification trees : Classification trees are used to classify the

given data set into categories. To use classification trees, the response
of the target variable needs to be a categorical value such as yes/no,
true/false
• Regression trees : Regression trees are used to address prediction

requirements and are used when the target or response
variable is a numeric or discrete value such as stock value,
commodity price, and so on.
2
DECISION TREES
Decision Tree Representation
1. Each node corresponds to an attribute
2. Each branch corresponds to an attribute value
3. Each leaf node assigns a classification
3
DECISION TREES
4
DECISION TREES
Example
5
DECISION TREES
Example
Outlook
Sunny Rain
Overcast
Humidity Wind
High Normal Strong Weak
A Decision Tree for the concept PlayTennis

An unknown observation is classified by testing its attributes
and reaching a leaf node
6
DECISION TREES
Decision trees represent a disjunction of conjunctions of

constraints on the attribute values of instances
Each path from the tree root to a leaf corresponds to a

conjunction of attribute tests (one rule for classification)
The tree itself corresponds to a disjunction of these

conjunctions (set of rules for classification)
7
DECISION TREES
8
DECISION TREES
Basic Decision Tree Learning Algorithm
Most algorithms for growing decision trees are variants of a

basic algorithm
An example of this core algorithm is the ID3 algorithm

developed by Quinlan (1986)
It employs a top-down, greedy search through the space of

possible decision trees
9
DECISION TREES

First of all we select the best attribute to be tested at the root
of the tree
For making this selection each attribute is evaluated using

either some heuristic or a statistical test (gini, or information
gain/entropy) to determine how well it alone classifies the
training examples.
Conditions for Stopping Partitioning

• All samples for a given node belong to the same class.
• There are no remaining attributes for further partitioning
–
majority voting is employed for classifying the leaf.
10
DECISION TREES
Training Decision Trees
Let's create a decision tree using an algorithm called

Iterative Dichotomiser 3 (ID3) Invented by Ross Quinlan,
ID3 was one of the first algorithms used to train decision
trees.
Assume that you have to classify animals as cats or dogs on

the basis of a few attributes, namely, plays fetch, is grumpy,
and favorite food.
The data set contains 14 samples.
11
DECISION TREES
The Data Set
12
DECISION TREES
Some observations about the Data Set
• Cats are generally grumpier than the dogs.
• Most dogs play fetch and most cats refuse
• Dogs prefer dog food and bacon
• Cats only like cat food and bacon
• The is grumpy and plays fetch explanatory variables

can be easily converted to binary-valued features.
13
DECISION TREES
Some observations about the Data Set

The favorite food is a categorical variable that has three
possible values; we will one-hot encode It
From this table, we can manually construct classification

rules. For example, an animal that is grumpy and likes cat
food must be a cat, while an animal that plays fetch and
likes bacon must be a dog.
Constructing these classification rules by hand for even a

small data set is cumbersome. Instead, we will learn these
rules by creating a decision tree.
14
DECISION TREES
To classify new animals, the decision tree will examine an

explanatory variable at each node. The edge it follows to the
next node will depend on the outcome of the test.
For example, the first node might ask whether or not the animal
likes to play fetch. If the animal does, we will follow the edge to
the left child node; if not, we will follow the edge to the right
child node.
Eventually an edge will connect to a leaf node that

indicates whether the animal is a cat or a dog.
15
DECISION TREES: Take a look again !!!
Example
Outlook
Sunny Rain
Overcast
Humidity Wind
High Normal Strong Weak
Note that the uncertainty is higher at upper nodes and lower

at lower nodes.
16
DECISION TREES
One of the ways to measure the uncertainty in a variable is to

use Entropy given by the following equation,
where n is the number of outcomes and P (xi) is the probability

of the outcome i. Common values for b are 2, e, and 10.
17
DECISION TREES
Entropy Example:
A single toss of a fair coin has only two outcomes: heads and
tails. The probability that the coin will land on heads is 0.5,
and the probability that it will land on tails is 0.5.
The entropy of the coin toss is equal to the following:

H ( X ) = −(0.5log2 0.5 + 0.5log2 0.5) =1.0
That is, only one bit is required to represent the two equally
probable outcomes, heads and tails.
18
DECISION TREES
Entropy
If the coin has the same face on both sides, the variable
representing its outcome has 0 bits of entropy; that is, we are
always certain of the outcome and the variable will never
represent new information.
19
DECISION TREES
Entropy for Animal Classification Problem

Let's calculate the entropy of classifying an unknown animal.
If an equal number of dogs and cats comprise our animal

classification training data and we do not know anything else
about the animal, the entropy of the decision is equal to one.
Our training data, however, contains six dogs and eight

cats. If we do not know anything else about the unknown animal,
the entropy of the decision is given by the following:
20
DECISION TREES
Tree Construction
Now let's find the explanatory variable that will be most helpful
in classifying the animal; that is, let's find the explanatory
variable that reduces the entropy the most. We can test the
plays fetch explanatory variable and divide the training instances
into animals that play fetch and animals that don't. This produces
the two following subsets:
21
DECISION TREES
Tree Construction (continued..)
The left child node contains a subset of the training data with
seven cats and two dogs that do not like to play fetch. The
entropy at this node is given by the following:
The right child contains a subset with one cat and four dogs that
do like to play fetch. The entropy at this node is given by the
following:
22
DECISION TREES

Instead of testing the plays fetch explanatory variable, we could
test the is grumpy explanatory variable. This test produces the
following tree.
The instances that fail the

test follow the left edge,
and instances that pass the
test follow the right edge.
23
DECISION TREES
We could also divide the instances into animals that prefer cat
food and animals that don't, to produce the following tree:
24
DECISION TREES
25
DECISION TREES
Which Attribute is the Best Classifier?: Definition of Entropy

It is worth mentioning here that as we go down, from the
root node, we would like to see the decreasing trend in the
entropy.
Note that we can choose
favorite food = dog food,

Favorite food = bacon
for the root node
26
DECISION TREES
Which Attribute is the Best Classifier?: Definition of Entropy
Question: How to decide the best possible attribute for root

node?
Answer 1: Consider taking average of the entropies of the

two sibling nodes.
However this may result in sub-optimal tree with one subset

having almost 0 entropy and the other having entropy almost
close to 1.
27
DECISION TREES
Information Gain
We will measure the reduction in entropy using a metric called

information gain which is the difference between the entropy of
the parent node, H (T ), and the weighted average of the children
nodes' entropies.
28
DECISION TREES
Information Gain
The following table contains the information gains for all of the
tests. In this case, the cat food test is still the best, as it increases
the information gain the most.
29
DECISION TREES
Tree Construction (Continued..)
After deciding in favor of cat food test, we have to add

another node to the tree.
It is to be noted that
One of the child node
Has got entropy value 0.
It contains only cats. The

Other node (left) contains
2 cats and 6 dogs.
We will add new nodes here

30
DECISION TREES
Next Node
It must be remembered that this node has 8 samples. It
contains two cats and six dogs whose favorite food is not cat
food (may be dog food or bacon)
The entropy of this node is already calculated i.e 0.8113
To add a new node, we shall evaluate the entropy and the

information gain for all the remaining explanatory variables.
31
DECISION TREES
play fetch?
From the data set, find out the samples, i.e
From all animals who don’t like cat food, we shall make two
subsets. First for those which don’t like play fetch and
second, those who like to play fetch.
No cat food, no play fetch

2 Dogs, 2 cats , Entropy = -[(2/4)*log(2/4) +(2/4)*log(2/4)]=1
No cat food, like to play fetch

Information Gain = 0.8113- (1.0 * 4/8 + 0 * 4/8) = 0.3113

32
DECISION TREES
Is grumpy?
From the data set, find out the samples, i.e
From all animals who don’t like cat food, we shall make two
subsets. First for those which don’t like play fetch and
second, those who like to play fetch.
No cat food, not grumpy

No cat food, grumpy

Information Gain = 0.8113- (0 * 4/8 + 1 * 4/8) = 0.3113

33
DECISION TREES
The following table contains the information gains for all of

the possible tests:
34
DECISION TREES
ID3 breaks ties by selecting one of the best tests arbitrarily. We

will select the is grumpy
test, which splits its
parent's eight instances
into a leaf node containing
four dogs and a node
containing two cats and
two dogs. The following
is a diagram of
the current tree:
35
DECISION TREES
We will now select another explanatory variable to test the child

node's four instances. The remaining tests, favorite food=bacon,
favorite food=dog food, and plays fetch, all produce a leaf node
containing one dog or cat and a node containing the remaining
animals. The remaining tests produce equal information gains, as
shown in the following table:
36
DECISION TREES
We will arbitrarily select the plays fetch test to produce a leaf

node containing one dog and a node containing two cats and a
dog.
Two explanatory variables remain; we can test for animals that
like bacon, or we can test for animals that like dog food.
Both of the tests will produce the same subsets and create a leaf
node containing one dog and a leaf node containing two cats.
We will arbitrarily choose to test for animals that like dog food.
The following is a diagram of the completed decision tree:

37
DECISION TREES
38
DECISION TREES
Let’s classify some animals using the decision tree.
39
DECISION TREES
Other algorithms can be used to train decision trees.

C4.5 is a modified version of ID3 that can be used with
continuous explanatory variables and can accommodate
missing values for features.
C4.5 also can prune trees. Pruning reduces the size of a tree
by replacing branches that classify few instances with leaf
nodes.
Used by scikit-learn's implementation of decision trees,

CART is another learning algorithm
40
Performance Metrics
A variety of metrics exist to evaluate the performance of binary

classifiers.
The most common metrics are Confusion matrix, accuracy,

precision, recall, F1 measure, and ROC AUC score.
All of these measures depend on the concepts of true positives,

true negatives, false positives, and false negatives
True Negatives (TN): Actual FALSE, predicted as FALSE

False Positives (FP): Actual FALSE, predicted as TRUE (Type I error)
False Negatives (FN): Actual TRUE, predicted as FALSE (Type II error)
True Positives (TP): Actual TRUE, predicted as TRUE
41
Performance Metrics
Accuracy: what percentage of predictions were correct?
Precision: how many of the samples predicted as positive are

actually positive
Recall: how many of the positive samples are captured

by the positive predictions:
42
Performance Metrics
F Score: Harmonic mean of precision and recall:
ROC : Receiver Operating Characteristics
AUC: It is the area under the ROC curve.
43
Performance Metrics
Model Evaluation Metrics

Confusion Matrix:
Confusion matrix is the table that is
used for describing the performance
of a classification model.
The matrix shown is for binary case.
For multiclass problem, (say 5

classes) it becomes a matrix with
more number of columns and rows
(5x5).
44
Advantages of Decision Trees
1. No condition of zero mean and unit variance.

Some of the algorithms require feature scaling or standardization or
normalization before application. i.e Ridge and Laso, KNN, SVM, Logistic
regression, Gradient descent, others may require for quick convergence but
Decision trees do not require feature scaling.
2. Can tolerate missing data

For example, let's consider a Boolean attribute A. Let there be 10 values for A out
of which three have a value of True and the rest 7 have a value of False. So,
the probability of A(x) = True is 0.3, and the probability that A(x) = False is
0.7.
A fractional 0.3 of this is distributed down the branch for A = True, and a
fractional 0.7 is distributed down the other. These probability values are
used for computing the information gain, and can be used if a second
missing attribute value needs to be tested. The same methodology can be
applied in the case of learning when we need to fill any unknowns for the
new branches. The C4.5 algorithm uses this mechanism for filling the
missing values.
45
Disadvantages of Decision Trees
1. Overfitting
.
46

Lecture Decision Trees

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Decision Trees

Uploaded by

Copyright:

Available Formats

DECISION TREES

It is a method that induces concepts from examples

One of the widely used & practical learning method

The learning is supervised: i.e. the classes or categories of the

It represents concepts as decision trees (which can be

The target function can be Boolean / discrete valued or numeric

•Classification trees : Classification trees are used to classify the

• Regression trees : Regression trees are used to address prediction

Decision Tree Representation

1. Each node corresponds to an attribute

2. Each branch corresponds to an attribute value

3. Each leaf node assigns a classification

A Decision Tree for the concept PlayTennis

Decision Tree Representation

Decision trees represent a disjunction of conjunctions of

Each path from the tree root to a leaf corresponds to a

The tree itself corresponds to a disjunction of these

Decision Tree Representation

Basic Decision Tree Learning Algorithm

Most algorithms for growing decision trees are variants of a

An example of this core algorithm is the ID3 algorithm

It employs a top-down, greedy search through the space of

Basic Decision Tree Learning Algorithm

For making this selection each attribute is evaluated using

Conditions for Stopping Partitioning

Training Decision Trees

Let's create a decision tree using an algorithm called

Assume that you have to classify animals as cats or dogs on

The data set contains 14 samples.

The Data Set

Some observations about the Data Set

• Cats are generally grumpier than the dogs.

• Most dogs play fetch and most cats refuse

• Dogs prefer dog food and bacon

• Cats only like cat food and bacon

• The is grumpy and plays fetch explanatory variables

Some observations about the Data Set

From this table, we can manually construct classification

Constructing these classification rules by hand for even a

Basic Decision Tree Learning Algorithm

To classify new animals, the decision tree will examine an

Eventually an edge will connect to a leaf node that

Note that the uncertainty is higher at upper nodes and lower

One of the ways to measure the uncertainty in a variable is to

where n is the number of outcomes and P (xi) is the probability

The entropy of the coin toss is equal to the following:

Entropy for Animal Classification Problem

If an equal number of dogs and cats comprise our animal

Our training data, however, contains six dogs and eight

Tree Construction (continued..)

Tree Construction (continued..)

The instances that fail the

Tree Construction (continued..)

Tree Construction (continued..)

Which Attribute is the Best Classifier?: Definition of Entropy

Note that we can choose

favorite food = dog food,

for the root node

Which Attribute is the Best Classifier?: Definition of Entropy

Question: How to decide the best possible attribute for root

Answer 1: Consider taking average of the entropies of the

However this may result in sub-optimal tree with one subset

We will measure the reduction in entropy using a metric called