5.2) Decision Tree Induction

Classification: Basic Concepts,
Decision Trees
Classification Learning: Definition
l Given a collection of records (training set)

– Each record contains a set of attributes, one of the
attributes is the class
l Find a model for the class attribute as a
function of the values of the other attributes
l Goal: previously unseen records should be
assigned a class as accurately as possible
– Use test set to estimate the accuracy of the model
– Often, the given data set is divided into training and
test sets, with training set used to build the model
and test set used to validate it
Illustrating Classification Learning
Tid Attrib1 Attrib2 Attrib3 Class Learning

1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
l Predicting tumor cells as benign or malignant
l Classifying credit card transactions

as legitimate or fraudulent
l Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil
l Categorizing news stories as finance,

weather, entertainment, sports, etc.
Classification Learning Techniques
l Decision tree-based methods

l Rule-based methods
l Instance-based methods
l Probability-based methods
l Neural networks
l Support vector machines
l Logic-based methods
Example of a Decision Tree
Tid Refund Marital Taxable

Status Income Cheat
Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
TaxInc NO
6 No Married 60K No
No
< 80K > 80K
7 Yes Divorced 220K
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Apply Model to Test Data
Test Data
Start at the root of tree Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Learning: ID3
Function ID3(Training-set, Attributes)

– If all elements in Training-set are in same class, then return
leaf node labeled with that class
– Else if Attributes is empty, then return leaf node labeled
with majority class in Training-set
– Else if Training-Set is empty, then return leaf node labeled
with default majority class
– Else
◆Selectand remove A from Attributes
◆Make A the root of the current tree
◆For each value V of A
– Create a branch of the current tree labeled by V
– Partition_V  Elements of Training-set with value V for A
– Induce-Tree(Partition_V, Attributes)
– Attach result to branch V
Illustrative Training Set
Risk Assessment for Loan Applications

Client # Credit History Debt Level Collateral Income Level RISK LEVEL
1 Bad High None Low HIGH
2 Unknown High None Medium HIGH
3 Unknown Low None Medium MODERATE
4 Unknown Low None Low HIGH
5 Unknown Low None High LOW
6 Unknown Low Adequate High LOW
7 Bad Low None Low HIGH
8 Bad Low Adequate High MODERATE
9 Good Low None High LOW
10 Good High Adequate High LOW
11 Good High None Low HIGH
12 Good High None Medium MODERATE
13 Good High None High LOW
14 Bad High None Medium HIGH
ID3 Example (I)
2) All examples are in the same class, HIGH.

1) Choose Income as root of tree.
Return Leaf Node.
Income
High Income
Low
Medium Low High
2) 3) 4) Medium
1,4,7,11 2,3,12,14 5,6,8,9,10,13
HIGH 2,3,12,14 5,6,8,9,10,13
3) Choose Debt Level as root of subtree. 3a) All examples are in the same class, MODERATE.
Return Leaf node.
Debt Level
Debt Level
Low High High
Low
3a) 3b) 3b)
3 2,12,14 MODERATE 2,12,14
ID3 Example (II)
3b’-3b’’’) All examples are in the same class.

3b) Choose Credit History as root of subtree.
Return Leaf nodes.
Credit History Credit History
Unknown Good
Unknown Good
Bad
3b’) 3b’’) 3b’’’ Bad
2 14 ) 12
HIGH HIGH MODERATE
4) Choose Credit History as root of subtree. 4a-4c) All examples are in the same class.
Return Leaf nodes.
Credit History
Credit History
Unknown Good
Bad Unknown Good
4a) 4b) 4c) Bad
5,6 8 9,10,13
LOW MODERATE LOW
ID3 Example (III)
Attach subtrees at appropriate places.
Income
Low
Medium High
HIGH Debt Level Credit History
Low High Unknown Good
Bad
MODERATE Credit History LOW MODERATE LOW
Unknown Good
Bad
HIGH HIGH MODERATE

Another Example
Assume A1 is binary feature (Gender: M/F)

Assume A2 is nominal feature (Color: R/G/B)
A1
A2
R
G A2
R
A1
M F G A2
B
Decision surfaces are axis-
aligned Hyper-rectangles A1
M F
Non-Uniqueness
Decision trees are not unique:

– Given a set of training instances T, there
generally exists a number of decision trees
that are consistent with (or fit) T
Tid Refund Marital Taxable
Status Income Cheat
No
MarSt Single,
1 Yes Single 125K
Married Divorced
2 No Married 100K No
3 No Single 70K No NO Refund
4 Yes Married 120K No Yes No
5 No Divorced 95K Yes
NO TaxInc
6 No Married 60K No
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10
ID3’s Question
Given a training set, which of all of the decision trees

consistent with that training set should we pick?
More precisely:
Given a training set, which of all of the decision trees
consistent with that training set has the greatest
likelihood of correctly classifying unseen instances of
the population?
ID3’s (Approximate) Bias
ID3 (and family) prefers simpler decision trees

Occam’s Razor Principle:
– “It is vain to do with more what can be
done with less...Entities should not be
multiplied beyond necessity.”
Intuitively:
– Always accept the simplest answer that fits
the data, avoid unnecessary constraints
– Simpler trees are more general
ID3’s Question Revisited
l Since ID3 builds a decision tree by recursively

selecting attributes and splitting the training data
based on the values of these attributes
Practically:
Given a training set, how do we select attributes so that
the resulting tree is as small as possible, i.e. contains as
few attributes as possible?
Not All Attributes Are Created Equal
l Each attribute of an instance may be thought

of as contributing a certain amount of
information to its classification
– Think 20-Questions:
◆What are good questions?
◆Ones whose answers maximize information gained
– For example, determine shape of an object:
◆Num. sides contributes a certain amount of information
◆Color contributes a different amount of information
l ID3 measures information gained by making

each attribute the root of the current subtree,
and subsequently chooses the attribute that
produces the greatest information gain
Entropy (as information)
l Entropy at a given node t:

Entropy(t ) = − p( j | t ) log p( j | t )
j
(where p( j | t) is the relative frequency of class j at node t)
Based on Shannon’s information theory

For simplicity, assume only 2 classes Yes and No
Assume t is a set of messages sent to a receiver that must guess their class
If p( Yes | t )=1 (resp., p( No | t )=1), then the receiver guesses a new example as
Yes (resp., No). No message need be sent.
If p( Yes | t )=p( No | t )=0.5, then the receiver cannot guess and must be told the
class of a new example. A 1-bit message must be sent.
If 0<p( Yes | t )<1, then the receiver needs less than 1 bit on average to know the
class of a new example.
Entropy (as homogeneity)
l Think chemistry/physics
– Entropy is measure of disorder or homogeneity
– Minimum (0.0) when homogeneous / perfect order
– Maximum (1.0, in general log C) when most
heterogeneous / complete chaos
l In ID3
– Minimum (0.0) when all records belong to one class,
implying most information
– Maximum (log C) when records are equally distributed
among all classes, implying least information
– Intuitively, the smaller the entropy the purer the partition
Examples of Computing Entropy
Entropy (t ) = − p( j | t ) log p( j | t )
j 2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Information Gain
l Information Gain:
GAIN  n 
= Entropy ( p ) −   Entropy (i ) 
k
i
 n 
split i =1
(where parent Node, p is split into k partitions, and

ni is number of records in partition i)
– Measures reduction in entropy achieved because of
the split ➔ maximize
– ID3 chooses to split on the attribute that results in the
largest reduction, i.e, (maximizes GAIN)
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
Computing Gain
Before Splitting: C0 N00
E0
C1 N01
A? B?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4
C0 N10 C0 N20 C0 N30 C0 N40

C1 N11 C1 N21 C1 N31 C1 N41
E1 E2 E3 E4
E12 E34
Gain = E0 – E12 vs. E0 – E34
Gain Ratio
l Gain Ratio:
GAIN n n
GainRATIO = SplitINFO = −  log
Split k
i i
split
SplitINFO n n i =1
(where parent Node, p is split into k partitions, and

ni is number of records in partition i)
– Designed to overcome the disadvantage of GAIN

– Adjusts GAIN by the entropy of the partitioning
(SplitINFO)
– Higher entropy partitioning (large number of small
partitions) is penalized
– Used by C4.5 (an extension of ID3)
Other Splitting Criterion: GINI Index
l GINI Index for a given node t :
GINI (t ) = 1 −  [ p( j | t )]
k
ni
=  GINI (i )
2
GINI split
i =1 n
j
– Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
– Minimize GINI split value
– Used by CART, SLIQ, SPRINT
How to Specify Test Condition?
l Depends on attribute types

– Nominal
– Ordinal
– Continuous
l Depends on number of ways to split

– Binary split
– Multi-way split
Splitting Based on Nominal Attributes
l Multi-way split: Use as many partitions as values
CarType
Family Luxury
Sports
l Binary split: Divide values into two subsets
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Need to find optimal partitioning!

Splitting Based on Continuous Attributes
Different ways of handling

– Multi-way split: form ordinal categorical attribute
◆ Static – discretize once at the beginning
◆ Dynamic – repeat on each new partition
– Binary split: (A < v) or (A  v)

◆ How to choose v?
Need to find optimal partitioning!
Can use GAIN or GINI !

Overfitting and Underfitting
Overfitting:
– Given a model space H, a specific model hH is
said to overfit the training data if there exists
some alternative model h’H, such that h has
smaller error than h’ over the training examples,
but h’ has smaller error than h over the entire
distribution of instances
Underfitting:
– The model is too simple, so that both training
and test errors are large
Detecting Overfitting
Overfitting
Overfitting in Decision Tree Learning
l Overfitting results in decision trees that are more

complex than necessary
– Tree growth went too far
– Number of instances gets smaller as we build
the tree (e.g., several leaves match a single
example)
l Training error no longer provides a good estimate

of how well the tree will perform on previously
unseen records
Avoiding Tree Overfitting – Solution 1
l Pre-Pruning (Early Stopping Rule)

– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
◆ Stop if all instances belong to the same class
◆ Stop if all the attribute values are the same
– More restrictive conditions:
◆Stop if number of instances is less than some user-specified
threshold
◆Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
◆Stopif expanding the current node does not improve impurity
measures (e.g., GINI or GAIN)
Avoiding Tree Overfitting – Solution 2
l Post-pruning
– Split dataset into training and validation sets
– Grow full decision tree on training set
– While the accuracy on the validation set
increases:
◆Evaluate the impact of pruning each subtree, replacing
its root by a leaf labeled with the majority class for that
subtree
◆Replace subtree that most increases validation set
accuracy (greedy approach)
Decision Tree Based Classification
l Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Good accuracy (careful: NFL)
l Disadvantages:
– Axis-parallel decision boundaries
– Redundancy
– Need data to fit in memory
– Need to retrain with new data
Oblique Decision Trees
x+y<1
Class = + Class =
• Test condition may involve multiple attributes

• More expressive representation
• Finding optimal test condition is computationally expensive
Subtree Replication
Q R
S 0 Q 1
0 1 S 0
0 1
• Same subtree appears in multiple branches

Decision Trees and Big Data
l For each node in the decision tree we need to

scan the entire data set
l Could just do multiple MapReduce steps across

the entire data set
– Ouch!
l Have each processor build a decision tree based

on local data
– Combine decision trees
– Better?
Combining Decision Trees
l Meta Decision Tree

– Somehow decide which decision tree to use
– Hierarchical
– More training needed
l Voting
– Run classification data through all decision
trees and then vote
l Merge Decision Trees

5.2) Decision Tree Induction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5.2) Decision Tree Induction

Uploaded by

Copyright:

Available Formats

Classification: Basic Concepts,

l Given a collection of records (training set)

Tid Attrib1 Attrib2 Attrib3 Class Learning

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

l Predicting tumor cells as benign or malignant

l Classifying credit card transactions

l Classifying secondary structures of protein

l Categorizing news stories as finance,

l Decision tree-based methods

Tid Refund Marital Taxable

Training Data Model: Decision Tree

Function ID3(Training-set, Attributes)

Risk Assessment for Loan Applications

2) All examples are in the same class, HIGH.

3b’-3b’’’) All examples are in the same class.

Attach subtrees at appropriate places.

MODERATE Credit History LOW MODERATE LOW

HIGH HIGH MODERATE

Assume A1 is binary feature (Gender: M/F)

Decision trees are not unique:

Given a training set, which of all of the decision trees

ID3 (and family) prefers simpler decision trees

l Since ID3 builds a decision tree by recursively

l Each attribute of an instance may be thought

l ID3 measures information gained by making

l Entropy at a given node t:

(where p( j | t) is the relative frequency of class j at node t)

Based on Shannon’s information theory

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C1 1 P(C1) = 1/6 P(C2) = 5/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

(where parent Node, p is split into k partitions, and

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40

(where parent Node, p is split into k partitions, and

– Designed to overcome the disadvantage of GAIN

l GINI Index for a given node t :

– Maximum (1 - 1/nc) when records are equally

l Depends on attribute types

l Depends on number of ways to split

l Multi-way split: Use as many partitions as values

l Binary split: Divide values into two subsets

Need to find optimal partitioning!

Different ways of handling

– Binary split: (A < v) or (A  v)

Need to find optimal partitioning!

Can use GAIN or GINI !

l Overfitting results in decision trees that are more

l Training error no longer provides a good estimate

l Pre-Pruning (Early Stopping Rule)

• Test condition may involve multiple attributes

• Same subtree appears in multiple branches

l For each node in the decision tree we need to

l Could just do multiple MapReduce steps across

l Have each processor build a decision tree based

l Meta Decision Tree

You might also like