You are on page 1of 43

Decision Trees

LECTURER:
Humera Farooq, Ph.D.
Computer Sciences Department,
Bahria University (Karachi Campus)
Outline
 Decision trees for classification
 (ID3 algorithm)

 Overfitting

 Some experimental issues


Decision Trees
 A hierarchical data structure that represents data by implementing a divide and
conquer strategy
 Can be used as a non-parametric classification and regression method, and as a
Boolean Function
 Given a collection of examples, learn a decision tree that represents it and use this
representation to classify new examples
 Can be viewed as a way to compactly represent a lot of data.
 Output is a discrete category. Real valued outputs are possible (regression trees)
 There are efficient algorithms for processing large amounts of data (but not too
many features)
 There are methods for handling noisy data (classification noise and attribute noise)
and for handling missing attribute values
Decision Trees Terminologies
Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of
splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the
child of a parent node.

A is a parent not of B and C


Decision Trees
 Decision tree is a function that
 takes a vector of attribute values as its input,
 returns a “decision” as its output.
 Both input and output values can be measured on a nominal, ordinal, interval,
and ratio scales, can be discrete or continuous.

 The decision is formed via a sequence of tests:


 each internal node of the tree represents a test,
 the branches are labeled with possible outcomes of the test
 each leaf node represents a decision to be returned by the tree
How to start ? DT
 In the beginning, the whole training set is considered as the root.
 Feature values are preferred to be categorical. If the values are continuous then
they are discretized prior to building the model.
 Records are distributed recursively on the basis of attribute values.
 Order to placing attributes as root or internal node of the tree is done by using
some statistical approach.

 Decision Trees follow Sum of Product (SOP) representation.


 The Sum of product (SOP) is also known as Disjunctive Normal Form.
 For a class, every branch from the root of the tree to a leaf node having the same
class is conjunction (product) of values, different branches ending in that class
form a disjunction (sum).
Decision Trees Representation
Color

B A
Blue Red Green
C Shape Shape
triangle circle
B
square circle
square

B B A
A C
As Boolean functions they can represent any Boolean function.

Can be rewritten as rules in Disjunctive Normal Form (DNF)


Green^square  positive Color
Blue^circle  positive
Blue^square  positive
Blue Red Green
The disjunction of these rules is equivalent to the Decision Tree Shape Shape
triangle circle
+
square circle
square

green ∧ square ∨ blue ∧ circle ∨ blue ∧ square - + -


+ +
Will I play tennis today?
Assume we are interested in determining whether to play tennis (+/-) given
certain nominal features, below:
 Features
 Outlook (O): {Sun, Overcast, Rain}
 Temperature (T): {Hot, Mild, Cool}
 Humidity (H): {High, Normal, Low}
 Wind (W): {Strong, Weak}

 Labels
 Binary classification task: Y = {+, -}
Working of DT
 The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria are
different for classification and regression trees.
 Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes.
 The creation of sub-nodes increases the homogeneity of resultant sub-nodes.
 In other words, we can say that the purity of the node increases with respect to the target variable.
 The decision tree splits the nodes on all available variables and then selects the split which results
in most homogeneous sub-nodes.
 The algorithm selection is also based on the type of target variables. Let us look at some
algorithms used in Decision Trees:

ID3 → (extension of D3)


C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing
classification trees)
MARS → (multivariate adaptive regression splines)
Will I play tennis today?
O T H W Play? Outlook: S(unny),
1 S H H W - O(vercast),
2 S H H S - R(ainy)
3 O H H W +
4 R M H W + Temperature: H(ot),
5 R C N W + M(edium),
6 R C N S - C(ool)
7 O C N S + Humidity: H(igh),
8 S M H W - N(ormal),
9 S C N W + L(ow)
10 R M N W +
11 S M N S +
Wind: S(trong),
12 O M H S +
W(eak)
13 O H N W +
14 R M H S -
Basic Decision Trees Learning
Algorithm
Data is processed in Batch (i.e. all the data available)
 Recursively build a decision tree top down.

O T H W Play?
1 S H H W -
2 S H H S -
Outlook
3 O H H W +
4 R M H W +
5 R C N W +
6 R C N S - Sunny Overcast Rain
7 O C N S +
Humidity Yes Wind
8 S M H W -
9 S C N W +
10 R M N W + High Normal Strong Weak
11 S M N S + No Yes No Yes
12 O M H S +
13 O H N W +
14 R M H S -
Basic Decision Tree Algorithm
 Let S be the set of Examples
 Label is the target attribute (the prediction)
 Attributes is the set of measured attributes
 ID3(S, Attributes, Label)
If all examples are labeled the same return a single node tree with Label
Otherwise Begin
A = attribute in Attributes that best classifies S (Create a Root node for tree)

for each possible value v of A


Add a new tree branch corresponding to A=v
Let Sv be the subset of examples in S with A=v
if Sv is empty: add leaf node with the common value of Label in S
Else: below this branch add the subtree
ID3(Sv, Attributes - {a}, Label)
End
Return Root
Picking the Root Attribute
 The goal is to have the resulting decision tree as small as possible (Occam’s
Razor)
 But, finding the minimal decision tree consistent with the data is NP-hard
 The recursive algorithm is a greedy heuristic search for a simple tree, but cannot
guarantee optimality.
 The main decision in the algorithm is the selection of the next attribute to
condition on.
Picking the Root Attribute
Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples
< (A=1,B=1), + >: 100 examples

What should be the first attribute we select?

Splitting on A: we get purely labeled nodes. 1 A 0


+ -
1 B 0 Splitting on B: we don’t get purely labeled nodes.

1 A 0 -
What if we have: <(A=1,B=0), - >: 3 examples

+ -
Picking the Root Attribute
Consider data with two Boolean attributes (A,B).
< (A=0,B=0), - >: 50 examples
< (A=0,B=1), - >: 50 examples
< (A=1,B=0), - >: 0 examples 3 examples
< (A=1,B=1), + >: 100 examples

Trees looks structurally similar; which attribute should we choose?

1 B 0 1 A 0
1 A 0 - -
53 1 B 0 100
+ - + -
100 50 Advantage A. But…
100 3
Need a way to quantify things
Picking the Root Attribute
 We want attributes that split the examples to sets that are relatively pure in one
label; this way we are closer to a leaf node.

 The most popular heuristics is based on information gain, originated with the ID3
system of Quinlan.
Entropy
Entropy (impurity, disorder) of a set of examples, S, relative to a binary
classification is:
Entropy(S)  p  log(p  )  p  log(p  )
where p  is the proportion of positive examples in S and
p  is the proportion of negatives.
If all the examples belong to the same category: Entropy = 0
If all the examples are equally mixed (0.5, 0.5): Entropy = 1

High Entropy – High level of Uncertainty


Low Entropy – No Uncertainty.
Outlook
Information Gain
SunnyOvercast Rain
 The information gain of an attribute a is the expected reduction in
entropy caused by partitioning on this attribute
| Sv |
Gain(S, a)  Entropy(S)   Entropy(S v )
vvalues(a) | S |

 where Sv is the subset of S for which attribute a has value v, and the
entropy of partitioning the data is calculated by weighing the
entropy of each partition by its size relative to the original set

 Partitions of low entropy (imbalanced splits) lead to high gain

High Entropy – High level of


Uncertainty
Low Entropy – No
Uncertainty.
Will I play tennis today?
O T H W Play?
1 S H H W -
Outlook: S(unny),
2 S H H S -
O(vercast),
3 O H H W +
R(ainy)
4 R M H W +
5 R C N W + Temperature: H(ot),
6 R C N S - M(edium),
7 O C N S + C(ool)
8 S M H W -
9 S C N W + Humidity: H(igh),
10 R M N W + N(ormal),
11 S M N S + L(ow)
12 O M H S + Wind: S(trong),
13 O H N W +
W(eak)
14 R M H S -
Will I play tennis today?
O T H W Play? Current entropy:
1 S H H W - p = 9/14
2 S H H S - n = 5/14
3 O H H W +
4 R M H W + H(Y) =
5 R C N W + −(9/14) log2(9/14) −(5/14)
6 R C N S - log2(5/14)
7 O C N S +  0.94
8 S M H W -
9 S C N W +
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -
Information Gain: Outlook
O T H W Play?
Outlook = sunny:
1 S H H W - p = 2/5 n = 3/5 HS = 0.971
2 S H H S - Outlook = overcast:
3 O H H W +
p = 4/4 n = 0 Ho= 0
4 R M H W +
5 R C N W + Outlook = rainy:
6 R C N S - p = 3/5 n = 2/5 HR = 0.971
7 O C N S +
8 S M H W -
Expected entropy:
9 S C N W +
10 R M N W + (5/14)×0.971 + (4/14)×0
11 S M N S + + (5/14)×0.971 = 0.694
12 O M H S +
13 O H N W + Information gain:
14 R M H S -
0.940 – 0.694 = 0.246
Information Gain: Humidity
O T H W Play? Humidity = high:
1 S H H W -
2 S H H S -
p = 3/7 n = 4/7 Hh = 0.985
3 O H H W + Humidity = Normal:
4 R M H W + p = 6/7 n = 1/7 Ho= 0.592
5 R C N W +
6 R C N S -
7 O C N S +
8 S M H W - Expected entropy:
9 S C N W + (7/14)×0.985 + (7/14)×0.592= 0.7785
10 R M N W +
11 S M N S + Information gain:
12 O M H S +
13 O H N W +
0.940 – 0.151 = 0.1515
14 R M H S -
Which feature to split on?

O T H W Play?
Information gain:
1 S H H W - Outlook: 0.246
2 S H H S - Humidity: 0.151
3 O H H W + Wind: 0.048
4 R M H W + Temperature: 0.029
5 R C N W +
6 R C N S - → Split on Outlook
7 O C N S +
8 S M H W -
9 S C N W +
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -
An Illustrative Example (III)
Outlook Gain(S,Humidity)=0.151
Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
Gain(S,Outlook) = 0.246
An Illustrative Example (III)
O T H W Play?
Outlook 1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
8 S M H W -
? Yes ?
9 S C N W +
10 R M N W +
11 S M N S +
12 O M H S +
13 O H N W +
14 R M H S -
An Illustrative Example (III)
O T H W Play?
Outlook
1 S H H W -
2 S H H S -
3 O H H W +
4 R M H W +
Sunny Overcast Rain 5 R C N W +
1,2,8,9,11 3,7,12,13 4,5,6,10,14 6 R C N S -
2+,3- 4+,0- 3+,2- 7 O C N S +
? Yes ? 8 S M H W -
9 S C N W +
10 R M N W +
Continue until:
• Every attribute is included in path, or, 11 S M N S +
• All examples in the leaf have same label 12 O M H S +
13 O H N W +
14 R M H S -
An Illustrative Example (IV)
Gain(S sunny , Humidity) .97-(3/5) 0-(2/5) 0 = .97
Outlook
Gain(S sunny , Temp)  .97- 0-(2/5) 1 = .57
Gain(S sunny , Wind) .97-(2/5) 1 - (3/5) .92= .02

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
? Yes ?

Day Outlook Temperature Humidity Wind PlayTennis


1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
An Illustrative Example (V)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
? Yes ?
An Illustrative Example (V)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes ?

High Normal
No Yes
induceDecisionTree(S)
 1. Does S uniquely define a class?
if all s ∈ S have the same label y: return S;

 2. Find the feature with the most information gain:


i = argmax i Gain(S, Xi)

 3. Add children to S:
for k in Values(Xi):
Sk = {s ∈ S | xi = k}
addChild(S, Sk)
induceDecisionTree(Sk)
return S;
An Illustrative Example (VI)
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes
Hypothesis Space in Decision Tree
Induction
 Conduct a search of the space of decision trees which can represent
all possible discrete functions. (pros and cons)
 Goal: to find the best decision tree
 Best could be “smallest depth”

 Best could be “minimizing the expected number of tests”

 Finding a minimal decision tree consistent with a set of data is NP-


hard.
 Performs a greedy heuristic search: hill climbing without
backtracking
 Makes statistically based decisions using all data
Example
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes
Another Example
Usage of mobile phones by individuals based on their income, age, education and marital status. Each of the attributes have been
categorized into 2 or 3 types
RID Income Age Education Marital Status Usage

1 Low Old University Married Low


2 Medium Young College Single Medium
3 Low Old University Married Low
4 High Young University Single High
5 Low Old University Married Low
6 High Young College Single Medium
7 Medium Young College Married Medium
8 Medium Old High School Single Low
9 High Old University Single High
10 Low Old High School Married Low
11 Medium Young College Married Medium
12 Medium Old High School Single Low
13 High Old University Single High
14 Low Old High School Married Low
15 Medium Young College Married Medium
Overfitting the Data
 Learning a tree that classifies the training data perfectly may not lead to On training
accuracy
the tree with the best generalization performance.
 There may be noise in the training data the tree is fitting
 The algorithm might be making decisions based on very little data On testing

 A decision tree overfits the training data when its accuracy on the
training data goes up but its accuracy on unseen data goes down Complexity of tree

 Model complexity (informally):


How many parameters do we have to learn?
 Decision trees: complexity = #nodes

Empirical
Error

Model complexity
Reasons for overfitting

Too much variance in the training data


 Training data is not a representative sample
of the instance space
 We split on features that are actually irrelevant

Too much noise in the training data


 Noise = some feature values or class labels are incorrect
 We learn to predict the noise

In both cases, it is a result of our will to minimize the empirical


error when we learn, and the ability to do it (with DTs)
Overfitting - Example
Outlook = Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, NO
Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind

High Normal Strong Weak


No Wind No Yes

This can always be done – may fit noise or


Strong Weak other coincidental regularities
No Yes
Pruning a decision tree

Prune = remove leaves and assign majority label


of the parent to all items
Prune the children of S if:
 all children are leaves, and
 the accuracy on the validation set does not decrease
if we assign the most frequent class label to all items
at S.
Trees and Rules
 Decision Trees can be represented as Rules
 (outlook = sunny) and (humidity = normal) then YES

 (outlook = rain) and (wind = strong) then NO


Outlook
 Sometimes Pruning can be done at the rules level
 Rules are generalized by

erasing a condition (different!)


Sunny Overcast Rain
1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes
Avoiding Overfitting
 Two basic approaches
 Pre-pruning: Stop growing the tree at some point during construction when it is
determined that there is not enough data to make reliable choices.
 Post-pruning: Grow the full tree and then remove nodes that seem not to have
sufficient evidence.
 Methods for evaluating subtrees to prune
 Cross-validation: Reserve hold-out set to evaluate utility
 Statistical testing: Test if the observed regularity can be dismissed as likely to
occur by chance
 Minimum Description Length: Is the additional complexity of the hypothesis
smaller than remembering the exceptions?
N-fold cross validation

Instead of a single test-training split:


train test
Split data into N equal-sized parts

Train and test N different classifiers


Report average accuracy and standard deviation
of the accuracy
Evaluation: significance tests

You have two different classifiers, A and B


You train and test them on the same data set
using N-fold cross-validation
For the n-th fold:
accuracy(A, n), accuracy(B, n)
pn = accuracy(A, n) -
accuracy(B, n)
Is the difference between A and B’s
accuracies significant?
Decision Trees - Summary
 Hypothesis Space:
 Variable size (contains all functions)

 Deterministic; Discrete and Continuous attributes

 Search Algorithm
 ID3 - batch

 Issues:
 What is the goal?

 When to stop? How to guarantee good generalization?

You might also like