You are on page 1of 26

Decision Tree Learning

Today’s Agenda
• What is Decision Tree?
• ID3 Algorithm
• Entropy and Information Gain
• Tree Construction
• Examples
• Summary
Decision Trees
● A decision tree consists of
Nodes:
● test for the value of a certain attribute

Edges:
● correspond to the outcome of a test

● connect to the next node or leaf

Leaves:
● terminal nodes that predict the outcome

to classifiy an example:
1. start at the root
2. perform the test
3. follow the edge corresponding to outcome
4. goto 2. unless leaf
5. predict that outcome associated with the leaf
3
Decision Tree Learning

In Decision Tree The training examples


Learning, a new are used for choosing
example is classified by appropriate tests in the
submitting it to a series decision tree. Typically,
of tests that determine a tree is built from top
the class label of the to bottom, where tests
example.These tests are that maximize the
Training
organized in a information gain about
hierarchical structure the classification
called a decision tree. are selected first.
?

New
Example Classification

4
A Sample Task (Training & Testing Data)
Day Temperature Outlook Humidity Windy Play Golf?
07-05 hot sunny high false no
07-06 hot sunny high true no
07-07 hot overcast high false yes
07-09 cool rain normal false yes
07-10 cool overcast normal true yes
07-12 mild sunny high false no
07-14 cool sunny normal false yes
07-15 mild rain normal false yes
07-20 mild sunny normal true yes
07-21 mild overcast high true yes
07-22 hot overcast normal false yes
07-23 mild rain high true no
07-26 cool rain normal true no
07-30 mild rain high false yes

today cool sunny normal false ?


tomorrow mild sunny normal false ?

5
Decision Tree Example
Outlook

sunny
rain
overcast

Humidity Yes
Windy

high normal true false

No Yes No Yes
Decision Tree Learning

tomorrow mild sunny normal false ?

7
Divide-And-Conquer Algorithms
● Family of decision tree learning algorithms
TDIDT: Top-Down Induction of Decision Trees
● Learn trees in a Top-Down fashion:
divide the problem in subproblems
solve each problem

Basic Divide-And-Conquer Algorithm:


1. select a test for root node
Create branch for each possible outcome of the test
2. split instances into subsets
One for each branch extending from the node
3. repeat recursively for each branch, using only instances that reach
the branch
4. stop recursion for a branch if all its instances have the same class

8
Decision Tree (ID3 Pseudocode)
A Different Decision Tree

● also explains all of the training data



will it generalize well to newdata?
10
Which attribute to select as the
root?

11
A criterion for attribute selection
• Which is the best attribute?
– The one which will result in the smallest tree
– Heuristic: choose the attribute that produces the
“purest” nodes
• Popular impurity criterion: information gain
– Information gain increases with the average purity
of the subsets that an attribute produces
• Strategy: choose attribute that results in
greatest information gain

12
Computing information
• Information is measured in bits
– Given a probability distribution, the info required
to predict an event is the distribution’s entropy

– Entropy gives the information required in bits (this


can involve fractions of bits!)

13
Consider entropy H(p)

pure, 100% yes

not pure at all, 40% yes

pure, 100% yes


not pure at all, 40% yes done
allmost 1 bit of information required
to distinguish yes and no
Entropy
• Formula for computing the entropy:
entropy( p1 , p2 ,, pn )   p1logp1  p2logp2   pn logpn

15
Entropy of a split
• Information in a split with x items of one class,
y items of the second class
x y
info([x, y])  entropy( , )
xy xy
x x y y
 log( ) log( )
x y x y x y x y

16
Example: attribute “Outlook”
• “Outlook” = “Sunny”: 2 and 3 split
2 2 3 3
info([2,3] )  entropy(2/ 5,3/5)   log( )  log( )  0.971 bits
5 5 5 5

17
Outlook = Overcast
• “Outlook” = “Overcast”: 4/0 split

Note: log(0) is not


info([4,0])  entropy(1,0)  1log(1)  0 log(0)  0 bitsdefined, but we
evaluate 0*log(0) as
zero

18
Outlook = Rainy
• “Outlook” = “Rainy”:
3 3 2 2
info([3,2] )  entropy(3/ 5,2/5)   log( )  log( )  0.971 bits
5 5 5 5

19
Expected Information
Expected information for attribute:

info([3,2],[4,0],[3,2])  (5 / 14)  0.971  (4 / 14)  0  (5 / 14)  0.971

 0.693 bits

20
Computing the information gain
• Information gain:
(information before split) – (information after
split)
gain(" Outlook" )  info([9,5] ) - info([2,3] , [4,0], [3,2])  0.940 - 0.693
 0.247 bits

• Information gain for attributes from weather


data: gain(" Outlook" )  0.247 bits
gain(" Temperatur e" )  0.029 bits
gain(" Humidity" )  0.152 bits
21 gain(" Windy" )  0.048 bits
Example (Ctd.)

Outlook is selected
as the root note

? ?

further splitting
necessary Outlook = overcast
contains only
examples of class yes
Example (Ctd.)

Gain(Temperature ) = 0.571 bits


Gain(Humidity ) = 0.971 bits Humidity is selected
Gain(Windy ) = 0.020 bits

23
Example (Ctd.)

Humidity is selected

further splitting
necessary

Pure leaves
→ No further expansion necessary
24
Final decision tree

25

You might also like