You are on page 1of 46

DECISION TREES

Introduction

It is a method that induces concepts from examples


(inductive learning), proposed by J. R. Quinlan published
Induction of Decision Trees in 1986

One of the widely used & practical learning method

The learning is supervised: i.e. the classes or categories of the


data instances are known

It represents concepts as decision trees (which can be


rewritten as if-then rules)
1
DECISION TREES

Introduction

The target function can be Boolean / discrete valued or numeric

•Classification trees : Classification trees are used to classify the


given data set into categories. To use classification trees, the response
of the target variable needs to be a categorical value such as yes/no,
true/false

• Regression trees : Regression trees are used to address prediction


requirements and are used when the target or response
variable is a numeric or discrete value such as stock value,
commodity price, and so on.
2
DECISION TREES

Decision Tree Representation

1. Each node corresponds to an attribute

2. Each branch corresponds to an attribute value

3. Each leaf node assigns a classification

3
DECISION TREES

4
DECISION TREES

Example

5
DECISION TREES

Example
Outlook
Sunny Rain
Overcast
Humidity Wind
High Normal Strong Weak

A Decision Tree for the concept PlayTennis


An unknown observation is classified by testing its attributes
and reaching a leaf node
6
DECISION TREES

Decision Tree Representation

Decision trees represent a disjunction of conjunctions of


constraints on the attribute values of instances

Each path from the tree root to a leaf corresponds to a


conjunction of attribute tests (one rule for classification)

The tree itself corresponds to a disjunction of these


conjunctions (set of rules for classification)

7
DECISION TREES

Decision Tree Representation

8
DECISION TREES

Basic Decision Tree Learning Algorithm

Most algorithms for growing decision trees are variants of a


basic algorithm

An example of this core algorithm is the ID3 algorithm


developed by Quinlan (1986)

It employs a top-down, greedy search through the space of


possible decision trees

9
DECISION TREES

Basic Decision Tree Learning Algorithm


First of all we select the best attribute to be tested at the root
of the tree

For making this selection each attribute is evaluated using


either some heuristic or a statistical test (gini, or information
gain/entropy) to determine how well it alone classifies the
training examples.

Conditions for Stopping Partitioning


• All samples for a given node belong to the same class.
• There are no remaining attributes for further partitioning

majority voting is employed for classifying the leaf.
10
DECISION TREES

Training Decision Trees

Let's create a decision tree using an algorithm called


Iterative Dichotomiser 3 (ID3) Invented by Ross Quinlan,
ID3 was one of the first algorithms used to train decision
trees.

Assume that you have to classify animals as cats or dogs on


the basis of a few attributes, namely, plays fetch, is grumpy,
and favorite food.

The data set contains 14 samples.

11
DECISION TREES

The Data Set

12
DECISION TREES

Some observations about the Data Set

• Cats are generally grumpier than the dogs.

• Most dogs play fetch and most cats refuse

• Dogs prefer dog food and bacon

• Cats only like cat food and bacon

• The is grumpy and plays fetch explanatory variables


can be easily converted to binary-valued features.

13
DECISION TREES

Some observations about the Data Set


The favorite food is a categorical variable that has three
possible values; we will one-hot encode It

From this table, we can manually construct classification


rules. For example, an animal that is grumpy and likes cat
food must be a cat, while an animal that plays fetch and
likes bacon must be a dog.

Constructing these classification rules by hand for even a


small data set is cumbersome. Instead, we will learn these
rules by creating a decision tree.

14
DECISION TREES

Basic Decision Tree Learning Algorithm

To classify new animals, the decision tree will examine an


explanatory variable at each node. The edge it follows to the
next node will depend on the outcome of the test.

For example, the first node might ask whether or not the animal
likes to play fetch. If the animal does, we will follow the edge to
the left child node; if not, we will follow the edge to the right
child node.

Eventually an edge will connect to a leaf node that


indicates whether the animal is a cat or a dog.
15
DECISION TREES: Take a look again !!!

Example
Outlook
Sunny Rain
Overcast
Humidity Wind
High Normal Strong Weak

Note that the uncertainty is higher at upper nodes and lower


at lower nodes.

16
DECISION TREES

One of the ways to measure the uncertainty in a variable is to


use Entropy given by the following equation,

where n is the number of outcomes and P (xi) is the probability


of the outcome i. Common values for b are 2, e, and 10.

17
DECISION TREES

Entropy Example:

A single toss of a fair coin has only two outcomes: heads and
tails. The probability that the coin will land on heads is 0.5,
and the probability that it will land on tails is 0.5.

The entropy of the coin toss is equal to the following:


H ( X ) = −(0.5log2 0.5 + 0.5log2 0.5) =1.0

That is, only one bit is required to represent the two equally
probable outcomes, heads and tails.

18
DECISION TREES

Entropy

If the coin has the same face on both sides, the variable
representing its outcome has 0 bits of entropy; that is, we are
always certain of the outcome and the variable will never
represent new information.

19
DECISION TREES

Entropy for Animal Classification Problem


Let's calculate the entropy of classifying an unknown animal.

If an equal number of dogs and cats comprise our animal


classification training data and we do not know anything else
about the animal, the entropy of the decision is equal to one.

Our training data, however, contains six dogs and eight


cats. If we do not know anything else about the unknown animal,
the entropy of the decision is given by the following:

20
DECISION TREES

Tree Construction

Now let's find the explanatory variable that will be most helpful
in classifying the animal; that is, let's find the explanatory
variable that reduces the entropy the most. We can test the
plays fetch explanatory variable and divide the training instances
into animals that play fetch and animals that don't. This produces
the two following subsets:

21
DECISION TREES

Tree Construction (continued..)

The left child node contains a subset of the training data with
seven cats and two dogs that do not like to play fetch. The
entropy at this node is given by the following:

The right child contains a subset with one cat and four dogs that
do like to play fetch. The entropy at this node is given by the
following:

22
DECISION TREES

Tree Construction (continued..)


Instead of testing the plays fetch explanatory variable, we could
test the is grumpy explanatory variable. This test produces the
following tree.

The instances that fail the


test follow the left edge,
and instances that pass the
test follow the right edge.

23
DECISION TREES

Tree Construction (continued..)

We could also divide the instances into animals that prefer cat
food and animals that don't, to produce the following tree:

24
DECISION TREES

Tree Construction (continued..)

25
DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy


It is worth mentioning here that as we go down, from the
root node, we would like to see the decreasing trend in the
entropy.

Note that we can choose

favorite food = dog food,


Favorite food = bacon

for the root node

26
DECISION TREES

Which Attribute is the Best Classifier?: Definition of Entropy

Question: How to decide the best possible attribute for root


node?

Answer 1: Consider taking average of the entropies of the


two sibling nodes.

However this may result in sub-optimal tree with one subset


having almost 0 entropy and the other having entropy almost
close to 1.

27
DECISION TREES

Information Gain

We will measure the reduction in entropy using a metric called


information gain which is the difference between the entropy of
the parent node, H (T ), and the weighted average of the children
nodes' entropies.

28
DECISION TREES

Information Gain

The following table contains the information gains for all of the
tests. In this case, the cat food test is still the best, as it increases
the information gain the most.

29
DECISION TREES

Tree Construction (Continued..)

After deciding in favor of cat food test, we have to add


another node to the tree.

It is to be noted that
One of the child node
Has got entropy value 0.

It contains only cats. The


Other node (left) contains
2 cats and 6 dogs.

We will add new nodes here


30
DECISION TREES

Next Node
It must be remembered that this node has 8 samples. It
contains two cats and six dogs whose favorite food is not cat
food (may be dog food or bacon)

The entropy of this node is already calculated i.e 0.8113

To add a new node, we shall evaluate the entropy and the


information gain for all the remaining explanatory variables.

31
DECISION TREES

play fetch?
From the data set, find out the samples, i.e
From all animals who don’t like cat food, we shall make two
subsets. First for those which don’t like play fetch and
second, those who like to play fetch.

No cat food, no play fetch


2 Dogs, 2 cats , Entropy = -[(2/4)*log(2/4) +(2/4)*log(2/4)]=1

No cat food, like to play fetch


4 Dogs, 0 cats , Entropy = -[(0/4)*log(0/4) +(4/4)*log(4/4)]=0

Information Gain = 0.8113- (1.0 * 4/8 + 0 * 4/8) = 0.3113


32
DECISION TREES

Is grumpy?
From the data set, find out the samples, i.e
From all animals who don’t like cat food, we shall make two
subsets. First for those which don’t like play fetch and
second, those who like to play fetch.

No cat food, not grumpy


4 Dogs, 0 cats , Entropy = -[(4/4)*log(4/4) +(0/4)*log(0/4)]=0

No cat food, grumpy


2 Dogs, 2 cats , Entropy = -[(2/4)*log(2/4) +(2/4)*log(2/4)]=1

Information Gain = 0.8113- (0 * 4/8 + 1 * 4/8) = 0.3113


33
DECISION TREES

The following table contains the information gains for all of


the possible tests:

34
DECISION TREES

ID3 breaks ties by selecting one of the best tests arbitrarily. We


will select the is grumpy
test, which splits its
parent's eight instances
into a leaf node containing
four dogs and a node
containing two cats and
two dogs. The following
is a diagram of
the current tree:

35
DECISION TREES

We will now select another explanatory variable to test the child


node's four instances. The remaining tests, favorite food=bacon,
favorite food=dog food, and plays fetch, all produce a leaf node
containing one dog or cat and a node containing the remaining
animals. The remaining tests produce equal information gains, as
shown in the following table:

36
DECISION TREES

We will arbitrarily select the plays fetch test to produce a leaf


node containing one dog and a node containing two cats and a
dog.
Two explanatory variables remain; we can test for animals that
like bacon, or we can test for animals that like dog food.

Both of the tests will produce the same subsets and create a leaf
node containing one dog and a leaf node containing two cats.

We will arbitrarily choose to test for animals that like dog food.

The following is a diagram of the completed decision tree:


37
DECISION TREES

38
DECISION TREES

Let’s classify some animals using the decision tree.

39
DECISION TREES

Other algorithms can be used to train decision trees.


C4.5 is a modified version of ID3 that can be used with
continuous explanatory variables and can accommodate
missing values for features.

C4.5 also can prune trees. Pruning reduces the size of a tree
by replacing branches that classify few instances with leaf
nodes.

Used by scikit-learn's implementation of decision trees,


CART is another learning algorithm
40
Performance Metrics

A variety of metrics exist to evaluate the performance of binary


classifiers.

The most common metrics are Confusion matrix, accuracy,


precision, recall, F1 measure, and ROC AUC score.

All of these measures depend on the concepts of true positives,


true negatives, false positives, and false negatives

True Negatives (TN): Actual FALSE, predicted as FALSE


False Positives (FP): Actual FALSE, predicted as TRUE (Type I error)
False Negatives (FN): Actual TRUE, predicted as FALSE (Type II error)
True Positives (TP): Actual TRUE, predicted as TRUE

41
Performance Metrics

Accuracy: what percentage of predictions were correct?

Precision: how many of the samples predicted as positive are


actually positive

Recall: how many of the positive samples are captured


by the positive predictions:

42
Performance Metrics

F Score: Harmonic mean of precision and recall:

ROC : Receiver Operating Characteristics

AUC: It is the area under the ROC curve.

43
Performance Metrics

Model Evaluation Metrics


Confusion Matrix:
Confusion matrix is the table that is
used for describing the performance
of a classification model.

The matrix shown is for binary case.

For multiclass problem, (say 5


classes) it becomes a matrix with
more number of columns and rows
(5x5).
44
Advantages of Decision Trees

1. No condition of zero mean and unit variance.


Some of the algorithms require feature scaling or standardization or
normalization before application. i.e Ridge and Laso, KNN, SVM, Logistic
regression, Gradient descent, others may require for quick convergence but
Decision trees do not require feature scaling.

2. Can tolerate missing data


For example, let's consider a Boolean attribute A. Let there be 10 values for A out
of which three have a value of True and the rest 7 have a value of False. So,
the probability of A(x) = True is 0.3, and the probability that A(x) = False is
0.7.
A fractional 0.3 of this is distributed down the branch for A = True, and a
fractional 0.7 is distributed down the other. These probability values are
used for computing the information gain, and can be used if a second
missing attribute value needs to be tested. The same methodology can be
applied in the case of learning when we need to fill any unknowns for the
new branches. The C4.5 algorithm uses this mechanism for filling the
missing values.
45
Disadvantages of Decision Trees

1. Overfitting
.

46

You might also like