You are on page 1of 161

Supervised Learning:

Classification-I
M M Awais

SPJCM
Decision tree induction

Classification - Decision Tree 2


Introduction
 It is a method that induces concepts from
examples (inductive learning)
 Most widely used & practical learning
method
 The learning is supervised: i.e. the classes
or categories of the data instances are
known
 It represents concepts as decision trees
(which can be rewritten as if-then rules)

Classification - Decision Tree 3


Introduction
 Decision tree learning is one of the most
widely used techniques for classification.
 Its classification accuracy is competitive with
other methods, and
 it is very efficient.
 The classification model is a tree, called
decision tree.
 C4.5 by Ross Quinlan is perhaps the best
known system. It can be downloaded from
the Web.

Classification - Decision Tree 4


Introduction

 The target function can be Boolean or


discrete valued

Classification - Decision Tree 5


Decision Trees
 Example: “is it a good day to play golf?” A particular instance in the
 a set of attributes and their possible values: training set might be:
outlook sunny, overcast, rain
<overcast, hot, normal, false>: play
temperature cool, mild, hot
humidity high, normal
windy true, false

In this case, the target class


is a binary attribute, so each
instance represents a positive
or a negative example.

Classification - Decision Tree 6


Decision Tree Representation

1. Each node corresponds to an attribute

2. Each branch corresponds to an attribute


value

3. Each leaf node assigns a classification

Classification - Decision Tree 7


Example

Classification - Decision Tree 8


Example

 Outlook
 Sunny  Rain
 Overcast
 Humidity Wind
 High  Normal  Strong  Weak

 A Decision Tree for the concept PlayTennis


 An unknown observation is classified by testing its

attributes and reaching a leaf node


Classification - Decision Tree 9
Using Decision Trees for Classification
 Examples can be classified as follows
 1. look at the example's value for the feature specified

 2. move along the edge labeled with this value

 3. if you reach a leaf, return the label of the leaf

 4. otherwise, repeat from step 1

 Example (a decision tree to decide whether to go on a picnic):


outlook
So a new instance:
sunny overcast rain <rainy, hot, normal, true>: ?
will be classified as “noplay”
humidity P windy

high normal true false

N P N P

Classification - Decision Tree 10


Decision Trees and Decision Rules
outlook

If attributes are continuous, sunny overcast rain


internal nodes may test
against a threshold. humidity yes windy

> 75% <= 75% > 20 <= 20

no yes no yes

Each path in the tree represents a decision rule:

Rule1: Rule3:
If (outlook=“sunny”) AND (humidity<=0.75) If (outlook=“overcast”)
Then (play=“yes”) Then (play=“yes”)
Rule2: ...
If (outlook=“rainy”) AND (wind>20)
Then (play=“no”)

Classification - Decision Tree 11


 DECISION TREES

 Basic Decision Tree Learning Algorithm


 Most algorithms for growing decision trees
are variants of a basic algorithm

 An example of this core algorithm is the ID3


algorithm developed by Quinlan (1986)

 Itemploys a top-down, greedy search through


the space of possible decision trees
12
 DECISION TREES

 Basic Decision Tree Learning Algorithm


 First
of all we select the best attribute to be
tested at the root of the tree

 Formaking this selection each attribute is


evaluated using a statistical test to determine
how well it alone classifies the training
examples

13
 DECISION TREES

 Basic Decision Tree Learning Algorithm


 We have

 D12  D11 - 12 observations


 D1
- 4 attributes
D10
 D2  D5
 D4

• Outlook
 D6
 D14
 D3 • Temperature

 D8  D9• Humidity
 D7
 D13
• Wind

- 2 classes (Yes, No)

14
 DECISION TREES

 Basic Decision Tree Learning Algorithm


 Outlook
 Sunny  Rain
 Overcast

 D10 
 D1  D8 D6
 D3
 D14
 D11  D4
 D9  D12
 D2  D7
 D5
 D13

15
 DECISION TREES

 Basic Decision Tree Learning Algorithm


 The selection process is then repeated using
the training examples associated with each
descendant node to select the best attribute
to test at that point in the tree

16
 DECISION TREES
 Outlook
 Sunny  Rain
 Overcast

 D10 
 D1  D8 D6
 D3
 D14
 D11  D4
 D9  D12
 D2  D7
 D5
 D13

 What is the “best” attribute to test at this point? The possible


choices are Temperature, Wind & Humidity

17
 DECISION TREES

 Basic Decision Tree Learning Algorithm


 Thisforms a greedy search for an acceptable
decision tree, in which the algorithm never
backtracks to reconsider earlier choices

18
 DECISION TREES

Which Attribute is the Best Classifier?


 The central choice in the ID3 algorithm is
selecting which attribute to test at each node
in the tree
 We would like to select the attribute which is

most useful for classifying examples


 For this we need a good quantitative measure

 For this purpose a statistical property, called

information gain, Gini Index is used

19
Top-Down Decision Tree Generation
 The basic approach usually consists of two phases:
 Tree construction

 At the start, all the training examples are at the


root
 Partition examples are recursively based on
selected attributes
 Tree pruning

 remove tree branches that may reflect noise in


the training data and lead to errors when
classifying test data
 improve classification accuracy

Classification - Decision Tree 20


Top-Down Decision Tree Generation
 Basic Steps in Decision Tree Construction
 Tree starts a single node representing all data

 If sample are all same class then node becomes a

leaf labeled with class label


 Otherwise, select feature that best separates sample

into individual classes.


 Recursion stops when:

 Samples in node belong to the same


tu re ? class
ct fe a
(majority) se l e
w to
 There are H nooremaining attributes on which to
split

Classification - Decision Tree 21


How to find Feature to split?

 Many methods are available but our focus


will be on the following two:
 Information Theory(Information Gain)
 Gain Ratio
 Gini Index

Classification - Decision Tree 22


Information
High Uncertainty

No Uncertainty
Classification - Decision Tree 23
Valuable Information

 Which information is more valuable:


 Of high uncertain region, or
 Of no uncertain region

gi o n
tai n re
g h U ncer
H i

Classification - Decision Tree 24


Information theory
 Information theory provides a mathematical basis
for measuring the information content.
 To understand the notion of information, think
about it as providing the answer to a question, for
example, whether a coin will come up heads.
 If one already has a good guess about the answer,
then the actual answer is less informative.
 If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).

Classification - Decision Tree 25


Information theory (cont …)
 For a fair (honest) coin, you have no
information, and you are willing to pay more
(say in terms of $) for advanced information -
less you know, the more valuable the
information.
 Information theory uses this same intuition,
but instead of measuring the value for
information in dollars, it measures information
contents in bits.
 One bit of information is enough to answer a
yes/no question about which one has no
idea, such as the flip of a fair coin

Classification - Decision Tree 26


Information Basic

Classification - Decision Tree 27


Entropy

Classification - Decision Tree 28


Classification - Decision Tree 29
Classification - Decision Tree 30
Classification - Decision Tree 31
Information: Basics
 Information (Entropy) is:
 E= - pi log pi,
 where pi is the probability of an event i
 (-pi log pi is always +ve)
 For multiple events
 E(I) = i -pi log pi
 Suppose you toss a fair coin, find the information
(entropy) when the probability of head or tail is 0.5 each.
 possible events: 2, pi=0.5
 E(I)= - 0.5log 0.5 - 0.5log 0.5 = 1.0
 If the coin is biased i.e, chances of heads is 0.75 and of tail
is 0.25, then E(I)= - 0.75log 0.75 - 0.25log 0.25 < 1.0
Classification - Decision Tree 32
Information: Basics
 Suppose you have dice and you roll it, find the entropy if
getting a ‘6’ if the probabilities of each event i.e, of getting
1 to 6 is equal.
 possible events: 6, pi=1/6
 E(I)= 6(- 1/6)log (1/6)=2.585
 If the dice is biased i.e, chances of ‘6’ is 0.75 then what is the
entropy:
 p(for 6) =0.75,
 p(for all other) = 0.25,
 p (any other number) = 0.25/5 = 0.05 (equally divided among 1 to 5)
 then E(I)= - 0.75log 0.75 – 5 (0.05)log (0.05) = 1.39

Classification - Decision Tree 33


Information: Basics
nt y
 Suppose you have dice and you roll it, find the air tentropy if
e
ci.e, of getting
getting a ‘6’ if the probabilities of each event u n
s e s e r
1 to 6 is equal. e a w
c r l o
 possible events: 6, p =1/6
t i n l s o
i
e n i s a
 E(I)= 6(- 1/6)log (1/6)=2.585 e v p y
n
achances t r o
 If the dice is biased i.e, o f e n of ‘6’ is 0.75 then what is the
l it y he
entropy: b i o t
ba s
 p r
=0.75,
p o e s
(for 6)
e uc
 ps t
h r d
e0.25,
A (for all other) =
 p
(any other number) = 0.25/5 = 0.05 (equally divided among 1 to 5)
 then E(I)= - 0.75log 0.75 – 5 (0.05)log (0.05) = 1.39

Classification - Decision Tree 34


Information: Basics
 Suppose you have dice and you roll it, find the entropy if
getting a ‘6’ if the probabilities of each eevent e i.e,
bl eof getting
w n
1 to 6 is equal. t r r i a no
i o n v a s k
i s r e e i
 possible events: 6, pi=1/6 d ec a tu a l u
g a f e s v
 E(I)= 6(- 1/6)log (1/6)=2.585 i n t h e i t
k
a chances s of ‘6’ e
cis 0.75 then what is the
If the dice is biased m i.e, a o n

i n b le t y
entropy: S o r i a a i n
v a e r t
 p(for 6) =0.75, se a n c
o e u
h o t h
 c = 0.25,
p(for all other) e s
u c
 d = 0.25/5 = 0.05 (equally divided among 1 to 5)
p (any otherrenumber)
a t
 thenh
t E(I)= - 0.75log 0.75 – 5 (0.05)log (0.05) = 1.39

Classification - Decision Tree 35


Decision Trees
The most notable types of decision tree algorithms are:-
 Iterative Dichotomiser 3 (ID3): This algorithm uses Information

Gain to decide which attribute is to be used classify the current


subset of the data. For each level of the tree, information gain is
calculated for the remaining data recursively.
 C4.5: This algorithm is the successor of the ID3 algorithm. This

algorithm uses either Information gain or Gain ratio to decide upon


the classifying attribute. It is a direct improvement from the ID3
algorithm as it can handle both continuous and missing attribute
values.
 Classification and Regression Tree(CART): It is a dynamic

learning algorithm which can produce a regression tree as well as a


classification tree depending upon the dependent variable.

Classification - Decision Tree 36


DT: Entropy – A measuring Value
 Entropy is a concept originated in thermodynamics
but later found its way to information theory.
 In decision tree construction process, definition of
entropy as a measure of disorder suits well.
 If the class values of the data in a node is equally
divided among possible values of the class value,
we say entropy (disorder) is maximum.
 If the class values of the data in a node is same for
all data, entropy (disorder) is minimum.

Classification - Decision Tree 37


DT: Entropy – A measuring Value

 A decision tree is built top-down from a root


node and involves partitioning the data into
subsets that contain instances with similar
values (homogenous).
 ID3 algorithm uses entropy to calculate the
homogeneity of a sample.
 If the sample is completely homogeneous the
entropy is zero and if the sample is an
equally divided it has entropy of one.

Classification - Decision Tree 38


Entropy

 Maximum probability at the center where curve is touching highest


point is =1 as given above

Classification - Decision Tree 39


Information theory: Entropy measure
 The entropy formula,
|C |
entropy ( D)    Pr(c ) log
j 1
j 2 Pr(c j )

|C |

 Pr(c )  1,
j 1
j

 Pr(cj) is the probability of class cj in data set D


 We use entropy as a measure of impurity or
disorder of data set D. (or, a measure of
information in a tree)

Classification - Decision Tree 40


Entropy measure: E= - (p /s)log(p /s) - (n /s)log(n /s)
p= all +ve examples, n= -ve, s=total examples

 As the data become purer and purer, the entropy value


becomes smaller and smaller. This is useful for classification
Classification - Decision Tree 41
Information gain
 Given a set of examples D, we first compute its
entropy for the ‘c’ classes:
|C |
entropy ( D)   Pr(c j ) log 2 Pr(c j )
j 1

 If we choose attribute Ai, with v values, the root of the


current tree, this will partition D into v subsets D1, D2
…, Dv . The expected entropy if Ai is used as the
current root: v |D |
entropy Ai ( D)  
j
 entropy ( D j )
j 1 | D |

Classification - Decision Tree 42


Information Gai

Classification - Decision Tree 43


Information gain (cont …)
 Information gained by selecting attribute Ai to
branch or to partition the data is
gain( D, Ai )  entropy ( D)  entropy Ai ( D)

 We choose the attribute with the highest gain to


branch/split the current tree.
 As the information gain increases for a variable,
the uncertainty in decision making reduces.

Classification - Decision Tree 44


Day outlook temp humidity wind play
D1 sunny hot high weak No
D2 sunny hot high strong No
D3 overcast hot high weak Yes
D4 rain mild high weak Yes
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes
D8 sunny mild high weak No
D9 sunny cool normal weak Yes
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes
D13 overcast hot normal weak Yes
D14 rain mild high strong No
Classification - Decision Tree 45
 To build a decision tree, we need to calculate two types
of entropy using frequency tables as follows:
 a) Entropy using the frequency table of one attribute:

Note: Probability of P(9)=9/14=0.64 & P(5)=5/14=0.36

Classification - Decision Tree 46


How calculate Log base 2?
 To Calculate Log2 based value:
Log(value)/log(2) like to calculate
log2(0.36)=log10(0.36)/log10(2)
 http://logbase2.blogspot.com/2008/08/log-calculator.html

Classification - Decision Tree 47


b) Entropy using the frequency table of two
attributes :

Classification - Decision Tree 48


 Calculate the following Entropies
 E(PlayGolf, temp)
 P(Hot). E(Yes, No)+P(Mild).E(Yes, No)+P(Cool).E(Yes, No)
 Help: First Built the table

 E(PlayGolf, humidity)

 E(PlayGolf, windy)

Classification - Decision Tree 49


Information Gain

 The information gain is based on the


decrease in entropy after a dataset is split
on an attribute.
 Constructing a decision tree is all about
finding attribute that returns the highest
information gain (i.e., the most homogeneous
branches).

Classification - Decision Tree 50


Information Gain
Information Gain
Information Gain
Information Gain
Information Gain

 Step 3: Choose attribute with the largest


information gain as the decision node.
Information Gain

 Step 4a: A branch with entropy of 0 is a leaf


node.
Information Gain

 Step 4b: A branch with entropy more than 0


needs further splitting.
Information Gain
Information Gain

 Step 5: The ID3 algorithm is run recursively


on the non-leaf branches, until all data is
classified.
Decision Tree to Decision Rules
Decision Tree to Decision Rules
Example
Owns House Married Gender Employed Credit Risk Class
History
Yes Yes M Yes A B
No No F Yes A A
Yes Yes F Yes B C
Yes No M No B B
No Yes F Yes B C
No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Yes Yes F Yes a C

Classification - Decision Tree 62


Choosing the “Best” Feature
Own House? Credit rating

Yes No A B

Married ? Gender

Yes No M F

Classification - Decision Tree 63


Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B

Own House? No No F Yes A A


Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall entropy first: No No M No B B

 Total samples: 10 Yes No F Yes A A


No Yes F Yes A C
Yes Yes F Yes a C

Classification - Decision Tree 64


Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B

Own House? No No F Yes A A


Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall entropy first: No No M No B B

 Total samples: 10 Yes No F Yes A A


No Yes F Yes A C
 Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C

 Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57


 Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
 Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05

Classification - Decision Tree 65


Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B

Own House? No No F Yes A A


Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall entropy first: No No M No B B

 Total samples: 10 Yes No F Yes A A


No Yes F Yes A C
 Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C

 Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57


 Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
 Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05

Classification - Decision Tree 66


Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B

Own House? No No F Yes A A


Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall entropy first: No No M No B B

 Total samples: 10 Yes No F Yes A A


No Yes F Yes A C
 Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C

 Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57


 Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
 Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05

Classification - Decision Tree 67


Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B

Own House? No No F Yes A A


Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall entropy first: No No M No B B

 Total samples: 10 Yes No F Yes A A


No Yes F Yes A C
 Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C

 Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57


 Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
 Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5)
Only 1 out- of
(1/5)log(1/5) -(2/5)log(2/5)
5 have class A for own =house:
1.52 yes
 E(Dj) = 0.5*E(yes)+0.5*E(no)
Only 2 out of 5 have= 1.52
class B for own house: yes
 Gain(D, Own House)
Only 2 out=of1.57-1.52 = 0.05C for own house: yes
5 have class

Classification - Decision Tree 68


Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B

Own House? No No F Yes A A


Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall entropy first: No No M No B B

 Total samples: 10 Yes No F Yes A A


No Yes F Yes A C
 Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C

 Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57


 Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
 Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05

Classification - Decision Tree 69


Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B

Own House? No No F Yes A A


Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall entropy first: No No M No B B

 Total samples: 10 Yes No F Yes A A


No Yes F Yes A C
 Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C

 Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57


 Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
 Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05

Classification - Decision Tree 70


Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B

Own House? No No F Yes A A


Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall entropy first: No No M No B B

 Total samples: 10 Yes No F Yes A A


No Yes F Yes A C
 Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C

 Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57


 Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
 Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05

Classification - Decision Tree 71


Owns Married Gender Employe Credit Risk
House d History Class

Similarly Find the values Yes


No
Yes
No
M
F
Yes
Yes
A
A
B
A

for all the other variables Yes


Yes
Yes
No
F
M
Yes
No
B
B
C
B
No Yes F Yes B C
No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Yes Yes F Yes a C

 Own House: 0.05


 Married: 0.72
 Gender: 0.88
 Employed: 0.45 Selected as Root Node
 Credit rating: 0.05

Classification - Decision Tree 72


Owns Married Gender Employe Credit Risk

Choosing the “Best”


House d History Class

Yes Yes M Yes A B

Feature
No No F Yes A A
Yes Yes F Yes B C

Gender Yes No M No B B
No Yes F Yes B C

M F No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Class A: 0 Class A: 3 Yes Yes F Yes a C
Class B: 3 Class B: 0
Class C: 0 Class C: 4

No further Further split is


split is required required here, cannot
here, identifies B identify A, and C fully
fully

Apply the same procedure again on other variables leaving


out column for Gender, and rows for class B as it has been fully
determined

Classification - Decision Tree 73


Owns Married Gender Employe Credit Risk

Choosing the “Best”


House d History Class

Yes Yes M Yes A B

Feature
No No F Yes A A
Yes Yes F Yes B C

Gender Yes No M No B B
No Yes F Yes B C

M F No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Class A: 0 Class A: 3 Yes Yes F Yes a C
Class B: 3 Class B: 0
Class C: 0 Class C: 4

No further Further split is


split is required required here, cannot
here, identifies B identify A, and C fully
fully

Apply the same procedure again on other variables leaving


out column for Gender, and rows for class B as it has been fully
determined

Classification - Decision Tree 74


Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B


No No F Yes A A

Own House? Yes Yes F Yes B C


Yes No M No B B
Yes No No Yes F Yes B C
No No F Yes B A
No No M No B B
Yes No F Yes A A

 E(D)=1.33 No Yes F Yes A C


Yes Yes F Yes a C

 Own House: 0.96


 Married: 0.00
 Etc…

Married is the best node as E(Dj) = 0,


Hence information gain will be maximum
Classification - Decision Tree 75
Owns Married Gender Employe Credit Risk
House d History Class

Completing DT Yes
No
Yes
Yes
No
Yes
M
F
F
Yes
Yes
Yes
A
A
B
B
A
C
Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Gender No No M No B B
Yes No F Yes A A

M F No Yes F Yes A C
Yes Yes F Yes a C

Class B: 3 Class A: 3,Class C: 4


Married

Yes No

Class C: 4 Class A: 3

Classification - Decision Tree 76


Owns Married Gender Employe Credit Risk

Completing DT
House d History Class
Yes Yes M Yes A B
No No F Yes A A
Gender Yes Yes F Yes B C
Yes No M No B B
No Yes F Yes B C
M F
No No F Yes B A
No No M No B B
Yes No F Yes A A

Class B: 3 Class A: 3,Class C: 4 No Yes F Yes A C


Yes Yes F Yes a C
Married

Yes No

Class C: 4 Class A: 3

R1: If Gender=M then Class B


R2: If Gender=F and Married=Yes
Then Class C
Rules R3: If Gender=F and Married=No
Then Class A

Classification - Decision Tree 77


Table 6.1 Class‐labeled training tuples from AllElectronics customer database.

78
Classification - Decision Tree 79
Classification - Decision Tree 80
Classification - Decision Tree 81
Trees Construction Algorithm (ID3)
 Decision Tree Learning Method (ID3)
 Input: a set of examples S, a set of features F, and a target set T (target

class T represents the type of instance we want to classify, e.g., whether


“to play golf”)
 1. If every element of S is already in T, return “yes”; if no element of S is in

T return “no”
 2. Otherwise, choose the best feature f from F (if there are no features

remaining, then return failure);


 3. Extend tree from f by adding a new branch for each attribute value

 4. Distribute training examples to leaf nodes (so each leaf node S is now

the set of examples at that node, and F is the remaining set of features not
yet selected)
 5. Repeat steps 1-5 for each leaf node

 Main Question:
 how do we choose the best feature at each step?

Note:
Note:ID3ID3algorithm
algorithmonly
onlydeals
dealswith
withcategorical
categoricalattributes,
attributes,but
butcan
canbe
beextended
extended
(as
(asininC4.5)
C4.5)totohandle
handlecontinuous
continuousattributes
attributes
Classification - Decision Tree 82
Choosing the “Best” Feature
 Using Information Gain to find the “best” (most discriminating) feature
 Entropy, E(I) of a set of instance I, containing p positive and n negative examples
p p n n
E(I )   log 2  log 2
pn pn pn pn
 Gain(A, I) is the expected reduction in entropy due to feature (attribute) A

pj  nj
Gain( A, I )  E ( I )   pn
E(I j )
descendant j
 the jth descendant of I is the set of instances with value vj for A

S: [9+,5-]
Outlook? E = -(9/14).log(9/14) - (5/14).log(5/14)
= 0.940
overcast rainy
sunny

“yes”, since all positive examples


[4+,0-] [2+,3-] [3+,2-]

Classification - Decision Tree 83


Decision Tree Learning - Example
Day outlook temp humidity wind play
S: [9+,5-] (E = 0.940)
D1 sunny hot high weak No
D2 sunny hot high strong No humidity?
D3 overcast hot high weak Yes
high normal
D4 rain mild high weak Yes
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes [3+,4-] (E = 0.985) [6+,1-] (E = 0.592)
D8 sunny mild high weak No
Gain(S, humidity) = .940 - (7/14)*.985 - (7/14)*.592
D9 sunny cool normal weak Yes
= .151
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes S: [9+,5-] (E = 0.940)
D13 overcast hot normal weak Yes
wind?
D14 rain mild high strong No
weak strong
So, classifying examples by humidity
provides more information gain than by
wind. In this case, however, you can [6+,2-] (E = 0.811) [3+,3-] (E = 1.00)
verify that outlook has largest information Gain(S, wind) = .940 - (8/14)*.811 - (6/14)*1.0
gain, so it’ll be selected as root = .048

Classification - Decision Tree 84


Decision Tree Learning - Example
Day outlook temp humidity wind play
D1 sunny hot high weak No
D2 sunny hot high strong No
S: [9+,5-]
D3 overcast hot high weak Yes
D4 rain mild high weak Yes Outlook{D1, D2, …, D14}
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes
D8 sunny mild high weak No sunny overcast rainy
D9 sunny cool normal weak Yes
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes ? yes ?
D13 overcast hot normal weak Yes
D14 rain mild high strong No
[2+,3-] [4+,0-] [3+,2-]
E=.970 E=0 E=.970

So, classifying examples by humidity


provides more information gain than by Gain(S, outlook) = .940 - (5/14)*..97 - (4/14)*0 - (5/14)*.97
wind. In this case, however, you can = .241
verify that outlook has largest information
gain, so it’ll be selected as root

Classification - Decision Tree 85


Decision Tree Learning - Example
 Partially learned decision tree
S: [9+,5-]
Outlook {D1, D2, …, D14}

sunny overcast rainy

?E=.970 yes E=0 ?E=.970


[2+,3-] [4+,0-] [3+,2-]
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}

 which attribute should be tested here?


Ssunny = {D1, D2, D8, D9, D11}
Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970
Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570
Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019

Classification - Decision Tree 86


Highly-branching attributes

 Problematic: attributes with a large number of


values (extreme case: ID code)
 Subsets are more likely to be pure if there is
a large number of values
 Information gain is biased towards choosing
attributes with a large number of values
 This may result in overfitting(selection of an
attribute that is non-optimal for prediction)

Classification - Decision Tree 87


Gain Ratio for Attribute Selection (C4.5)

Classification - Decision Tree 88


Another Alternative to avoid selecting
attributes with large -domains

Classification - Decision Tree 89


Gini Index (CART, SLIQ, ibm
IntellegentMiner)

Classification - Decision Tree 90


Contd..!!!

Classification - Decision Tree 91


Classification - Decision Tree 92
Comparing Attribute Selection Measures

Classification - Decision Tree 93


Split Algorithm with Gini Index
 Basic concept taken from Economics given
by Corrado Gini (1884 to 1965)
 The index varies from 0 to 1
 ZERO means no uncertainty
 ONE means maximum uncertainty
 Brazil 0.59
Income distribution of  India 0.32
various countries  China 0.45

 USA 0.41
 Japan 0.25
Most evenly distributed income

Classification - Decision Tree 94


Gini Index
 The Gini index is measure of impurity developed by
Italian statistician Corrado Gini in 1912.
 It is usually used to measure income inequality but
can be used to measure any form of uneven
distribution.
 Gini index is a number between 0 and 1where 0
corresponds with perfect equality (where every one
has same income) and 1 corresponds the perfect
inequality (where one person has all the income and
everyone else has zero income).

GINI (t ) 1   p 2 ( j | t )
j

Classification - Decision Tree 95


Diversity and Gini Index
high diversity, low purity

G = 1-(3/8)2 -(3/8)2 -(1/8)2 -(1/8)2 = .69 (E=1.811)

low diversity, high purity

G = 1-(6/7)2-(1/7)2 = .24 (E=0.592)


Classification - Decision Tree 96
Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B

Own House? No No F Yes A A


Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall G first: No No M No B B

 Total samples: 10 Yes No F Yes A A


No Yes F Yes A C
 Gt=1- (3/10)2 - (3/10)2 - (4/10)2 = 0.66 Yes Yes F Yes a C

 Attribute: Own House


 G(y)= 1- (1/5)2 - (2/5)2 - (2/5)2 = 0.66
 Gain=G -G
 G(n)=0.64 t i
 Total G=0.5G(y)+0.5G(n)= 0.64  Gains:
 Attribute: Married  Own House:
 Total G=0.5G(y)+0.5G(n)= 0.40 0.02
 Attribute: Gender G=0.511  Married:
 Attribute: Employed: G= 0.475 0.26
 Attribute: Credit Rating: G=0.64  Gender:
0.302
Classification - Decision Tree  Employed: 97
Owns Married Gender Employe Credit Risk
House d History Class

Choosing the “Best” Feature Yes Yes M Yes A B


No No F Yes A A
Yes Yes F Yes B C
Gender Yes No M No B B
No Yes F Yes B C
M F
No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C

 Choose Gender Yes Yes F Yes a C

Apply the same procedure again on other variables leaving


out column for Gender, and rows for class B as it has been fully
determined

Check if you can get the same DT or not

Classification - Decision Tree 98


Categorical Attributes: Computing
Gini Index
• For each distinct value, gather counts for each class in the set.
• Use the count matrix to make decisions

Multi-way Split Two-way split


(Find best partition of values)
Outlook
Outlook Outlook
Overcast Rain/
Overca Rain Sunny Sunny Overcast/ Sunny
st rain

C1 1 2 1 C1 4 5
C1 7 2

C2 4 1 1 C2 0 5 C2 2 3
Gini 0 .48 .48 Gini 0 .5 Gini .345 .48
0.34 0.36 0.391

Classification - Decision Tree 99


Continuous Attributes: Computing
Gini Index…

Cheat No No No Yes Yes Yes No No No No


Taxable Income

60 70 75 85 90 95 100 120 125 220


Sorted
Values 55 65 72 80 87 92 97 110 122 172 230

Split Pos
≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ >

Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Classification - Decision Tree 100


Gini index (CART)
 E.g., two classes, Pos and Neg, and dataset S
with p Pos-elements and n Neg-elements.
 fp = p / (p+n) fn = n / (p+n)

gini(S) = 1 – fp2 - fn2

 If dataset S is split into S1, S2 then


ginisplit(S1, S2 )=gini(S1)·(p1+n1)/(p+n) +
gini(S2)·(p2+n2)/(p+n)

Classification - Decision Tree 101


Gini index - play tennis example
Outlook Temperature Humidity W indy Class outlook
sunny hot high false N
sunny hot high true N overcast rain, sunny
overcast hot high false P
rain mild high false P
rain cool normal false P P 100% ……………
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P humidity
sunny mild normal true P
overcast mild high true P normal high
overcast hot normal false P
rain mild high true N
P 86% ……………
 Two top best splits at root node:
 Split on outlook:
 S1: {overcast} (4Pos, 0Neg) S2: {sunny, rain}
 Split on humidity:
 S1: {normal} (6Pos, 1Neg) S2: {high}

Classification - Decision Tree 102


Calculations
 Outlook
 Sunny or Rainy Yes = 5 No = 5 Gini = .5
 Overcast Yes = 4 No = 0 Gini = 0
 Gain = 0.36
 Temperature
 Hot or Cold Yes = 5 No = 3 Gini = 0.47
 Mild Yes = 4 No = 2 Gini = 0.44
 Gain = 0.46
 Humidity
 High Yes = 3 No = 4 Gini = 0.49
 Normal Yes = 6 No = 1 Gini = 0.25
 Gain = 0.37
 Windy
 FALSE Yes = 6 No = 2 Gini = 0.38
 TRUE Yes = 3 No = 3 Gini = 0.5
 Gain = 0.43

Classification - Decision Tree 103


Calculations at Node 0
 Outlook

 5  2  5  2  1
GINI (outlook  sunny  rainy )  1        
 10   10   2
 4  2  0  2 
GINI (outlook  overcast )  1         0
 4   4  
 10   1   4  
GINI ( split based on outlook )    *     (0)   0.3571
 14   2   14  

Classification - Decision Tree 104


Temperature

 5  2  3  2 
GINI (temperatur e  hot  cold ) 1         0.46875
 8   8  
 4  2  2  2 
GINI (temperatur e  mild ) 1         0.44
 6   6  
 8   6 
GINI ( split based on temperatur e)    * 0.46875    * 0.44   0.456
 14   14  

Classification - Decision Tree 105


Humidity

 3  2  4  2  24
GINI (humidity  high)  1        
 7   7   49
 4  2  2  2  12
GINI (humidity  normal )  1        
 6   6   49
 7   24   7   12  
GINI ( split based on humidity )    *      *     0.37
 14   49   14   49  

Classification - Decision Tree 106


Windy

 6  2  2  2 
GINI ( windy  FALSE )  1         0.375
 8   8  
 3  2  3  2 
GINI ( windy  TRUE )  1         0. 5
 6   6  
 8  6 
GINI ( split based on windy)    * 0.375    * 0.5  0.43
 14   14  

Classification - Decision Tree 107


V=outlook
N=14
Overcast Rain/Sunny

Humidity
4 yes, 0 no
N = 10

Normal High

windy
4 yes 1 no
N=5

False true

V=outlook
3 yes, 0 no
N=2
Rain Sunny

1 no, 0 yes 1 yes 0 no

Classification - Decision Tree 108


Classification - Decision Tree 109
Dealing With Continuous Variables
 Partition continuous attribute into a discrete set of
intervals
 sort the examples according to the continuous attribute A
 identify adjacent examples that differ in their target classification

 generate a set of candidate thresholds midway

 problem: may generate too many intervals

 Another Solution:
 take a minimum threshold M of the examples of the majority class in each

adjacent partition; then merge adjacent partitions with the same majority class
70.5 77.5
Example: M = 3
Temperature 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Play? yes no yes yes yes no no yes yes yes no yes yes no

Same majority, so they are merged

Final mapping: temperature £ 77.5 ==> “yes”; temperature > 77.5 ==> “---”

Classification - Decision Tree 110


Improving on Information Gain
 Info. Gain tends to favor attributes with a large number of values
 larger distribution ==> lower entropy ==> larger Gain

 Quinlan suggests using Gain Ratio


 penalize for large number of values

Si S Gain ( A, S )
SplitInfo ( A, S )   log i GainRatio ( A, S ) 
S S SplitInfo ( A, S )
 Example: “outlook”
S: [9+,5-]
SplitInfo (outlook, S) Outlook
= -(4/14).log(4/14) - (5/14).log(5/14) - (5/14).log(5/14)
= 1.577 overcast rainy
sunny
GainRatio (outlook, S)
= 0.246 / 1.577 = 0.156
S1: [4+,0-] S2 : [2+,3-] S3 : [3+,2-]

Classification - Decision Tree 111


Gain Ratios of Decision Variables
 Temperature
 Outlook
 Info = 0.911
 Info = 0.693
 Gain = 0.940 - .911 = 0.029
 Gain = 0.940 - .693 = 0.247
 Split info = info ([4, 6, 4]) = 1.362
 Split info = info ([5, 4, 5]) = 1.577
 Gain ratio = 0.029/1.362=0.021

Gain ratio = 0.247/1.577=0.156

 Humidity  Windy
 Info = 0.788  Info = 0.892
 Gain = 0.940 - .788 = 0.152  Gain = 0.940 - .892 = 0.048
 Split info = info ([7, 7]) = 1  Split info = info ([8, 6]) = .985

  Gain ratio = 0..048/.985=0.049


Gain ratio = 0.152/1=0.152

Classification - Decision Tree 112


Over-fitting in Classification
 A tree generated may over-fit the training examples due to noise or too small
a set of training data
 Two approaches to avoid over-fitting:
 (Stop earlier): Stop growing the tree earlier

 (Post-prune): Allow over-fit and then post-prune the tree

 Approaches to determine the correct final tree size:


 Separate training and testing sets or use cross-validation

 Use all the data for training, but apply a statistical test (e.g., chi-square) to

estimate whether expanding or pruning a node may improve over entire


distribution
 Use Minimum Description Length (MDL) principle: halting growth of the

tree when the encoding is minimized.


 Rule post-pruning (C4.5): converting to rules before pruning

Classification - Decision Tree 113


The loan data (reproduced)
Approved or not

Classification - Decision Tree 114


A decision tree from the loan data
 Decision nodes and leaf nodes (classes)

Classification - Decision Tree 115


Use the decision tree

No

Classification - Decision Tree 116


Is the decision tree unique?
 No. Here is a simpler tree.
 We want smaller tree and accurate tree.
 Easy to understand and perform better.

 Finding the best tree is


NP-hard.
 All current tree building
algorithms are heuristic
algorithms

Classification - Decision Tree 117


From a decision tree to a set of rules
 A decision tree can
be converted to a
set of rules
 Each path from the
root to a leaf is a
rule.

Classification - Decision Tree 118


Algorithm for decision tree learning
 Basic algorithm (a greedy divide-and-conquer algorithm)
 Assume attributes are categorical now (continuous attributes
can be handled too)
 Tree is constructed in a top-down recursive manner
 At start, all the training examples are at the root
 Examples are partitioned recursively based on selected
attributes
 Attributes are selected on the basis of an impurity function
(e.g., information gain)
 Conditions for stopping partitioning
 All examples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority class is the leaf
 There are no examples left

Classification - Decision Tree 119


Decision tree learning algorithm

Classification - Decision Tree 120


Choose an attribute to partition data
 The key to building a decision tree - which
attribute to choose in order to branch.
 The objective is to reduce impurity or
uncertainty in data as much as possible.
 A subset of data is pure if all instances belong to
the same class.
 The heuristic in C4.5 is to choose the attribute
with the maximum Information Gain or Gain
Ratio based on information theory.

Classification - Decision Tree 121


The loan data (reproduced)
Approved or not

Classification - Decision Tree 122


Two possible roots, which is better?

 Fig. (B) seems to be better.

Classification - Decision Tree 123


An example
6 6 9 9
entropy ( D)    log 2   log 2  0.971
15 15 15 15

6 9
entropy Own _ house ( D)    entropy ( D1 )   entropy ( D2 )
15 15
6 9
  0   0.918
15 15
 0.551

5 5 5
entropy Age ( D)    entropy ( D1 )   entropy ( D2 )   entropy ( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
  0.971   0.971   0.722 middle 3 2 0.971
15 15 15
 0.888
old 4 1 0.722

 Own_house is the best


choice for the root.

Classification - Decision Tree 124


We build the final tree

 We can use information gain ratio to evaluate the


impurity as well

Classification - Decision Tree 125


Handling continuous attributes
 Handle continuous attribute by splitting into
two intervals (can be more) at each node.
 How to find the best threshold to divide?
 Use information gain or gain ratio again
 Sort all the values of an continuous attribute in
increasing order {v1, v2, …, vr},
 One possible threshold between two adjacent
values vi and vi+1. Try all possible thresholds and
find the one that maximizes the gain (or gain
ratio).

Classification - Decision Tree 126


An example in a continuous space

Classification - Decision Tree 127


Avoid overfitting in classification
 Overfitting: A tree may overfit the training data
 Good accuracy on training data but poor on test data
 Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
 Two approaches to avoid overfitting
 Pre-pruning: Halt tree construction early
 Difficult to decide because we do not know what may
happen subsequently if we keep growing the tree.
 Post-pruning: Remove branches or sub-trees from a
“fully grown” tree.
 This method is commonly used. C4.5 uses a statistical
method to estimates the errors at each node for pruning.
 A validation set may be used for pruning as well.

Classification - Decision Tree 128


An example Likely to overfit the data

Classification - Decision Tree 129


Other issues in decision tree learning

 From tree to rules, and rule pruning


 Handling of missing values
 Handling skewed distributions
 Handling attributes and classes with different
costs.
 Attribute construction (adding a new one)
 etc.

Classification - Decision Tree 130


Name Gender Height Output1 Output2

DT Example (1) Kristina


Jim
F
M
1.6 m
2m
Short
Tall
Medium
Medium
Maggie F 1.9 m Medium Tall
Martha F 1.88 m Medium Tall
Stephanie F 1.7 m Short Medium
 Considering the data in the table and the Bob M 1.85 m Medium Medium
correct classification in Output1, we Kathy F 1.6 m Short Medium
have: Dave M 1.7 m Short Medium
Worth M 2.2 m Tall Tall
 Short (4/15) Steven M 2.1 m Tall Tall
 Medium (8/15) Debbie F 1.8 m Medium Medium
 Tall (3/15) Todd M 1.95 m Medium Medium
Kim F 1.9 m Medium Tall
Amy F 1.8 m Medium Medium
 Entropy = 4/15 log(15/4) + 8/15 log(15/8) Wynette F 1.75 m Medium Medium
+ 3/15 log(15/3) = 0.4384
Entropy(F) = 3/9 log(9/3) +
6/9 log(9/6) = 0.2764
 Choosing the gender as the splitting
attribute we have: Entropy(M) = 1/6 log(6/1) +
2/6 log(6/2) + 3/6 log(6/3) =
0.4392

Classification - Decision Tree 131


DT Example (2)
 The algorithm must determine what the gain
in information is by using this split.
 To do this, we calculate the weighted sum of
these last two entropies to get:
((9/15) 0.2764) + ((6/15) 0.4392) = 0.34152

 The gain in entropy by using the gender


attribute is thus:
0.4384 – 0.34152 = 0.09688

Classification - Decision Tree 132


DT Example (3)
Name Gender Height Output1 Output2
Kristina F 1.6 m Short Medium
Jim M 2m Tall Medium
 Looking at the height Maggie F 1.9 m Medium Tall
attribute, we divide it into Martha F 1.88 m Medium Tall
ranges: Stephanie F 1.7 m Short Medium
(0, 1.6], (1.6, 1.7], (1.7, 1.8], (1.8, 1.9], Bob M 1.85 m Medium Medium
(1.9, 2.0], (2.0, ) Kathy F 1.6 m Short Medium
Dave M 1.7 m Short Medium
Worth M 2.2 m Tall Tall
 Now we can compute the Steven M 2.1 m Tall Tall
entropy Debbie F 1.8 m Medium Medium
 2 in (0, 1.6]  (2/2(0)+0+0)=0 Todd M 1.95 m Medium Medium
 2 in (1.6, 1.7]  (2/2(0)+0+0)=0 Kim F 1.9 m Medium Tall
 3 in (1.7, 1.8]  (0+3/3(0)+0)=0 Amy F 1.8 m Medium Medium
 4 in (1.8, 1.9]  (0+4/4(0)+0)=0 Wynette F 1.75 m Medium Medium
 2 in (1.9, 2.0] 
(0+1/2(0.301)+1/2(0.301))=0.301
 2 in (2.0, ]  (0+0+2/2(0))=0

Classification - Decision Tree 133


DT Example (4)
 All the states are completely ordered (entropy 0)
except for the (1.9, 2.0] state.
 The gain in entropy by using the height attribute is:
0.4384-2/15(0.301)=0.3983

 Thus, this has the greater gain, and we choose this


over gender as the first splitting attribute

Classification - Decision Tree 134


DT Example (5) Height

<=1.6m >2.0m

>1.6m >1.7m >1.8m >1.9m


<=1.7m <=1.8m <=1.9m <=2.0m

Short Short Medium Medium Tall


Height

 (1.9, 2.0] – is too large !!! <=1.95m >1.95m

 A further subdivision on height is needed


Medium Tall

Height

We can optimize the tree <=1.7m >1.7m


<=1.95m
>1.95m

Short Medium Tall

Classification - Decision Tree 135


Quinlan’s ID3 and C4.5 decision tree
algorithms
Given dataset T
Attribute1 Attribute2 Attribute3 Class
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
B 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1
Classification - Decision Tree 136
Quinlan’s ID3 and C4.5 decision tree
algorithms
Consider test on attribute 1
freq(class,value) CLASS1 CLASS2
A 2 3 5
B 4 0 4
C 3 2 5
9 5 14

Info(T) 0.4098 0.5305 0.9403

Info(S) CLASS1 CLASS2 Info(S) Weight


A 0.5288 0.4422 0.9710 0.3571
B 0.0000 0.0000 0.0000 0.2857
C 0.4422 0.5288 0.9710 0.3571
0.6935

Gain .9403 - .6935 = 0.2467


Classification - Decision Tree 137
Quinlan’s ID3 and C4.5 decision tree
algorithms
Consider test on attribute 3
freq(class,value) CLASS1 CLASS2
True 3 3 6
False 6 2 8
9 5 14

Info(T) 0.4098 0.5305 0.9403

Info(S) CLASS1 CLASS2 Info(S) Weight


True 0.5000 0.5000 1.0000 0.4286
False 0.3113 0.5000 0.8113 0.5714
0.8922

Gain .9403 - .8922 = 0.0481


Classification - Decision Tree 138
Quinlan’s ID3 and C4.5 decision tree algorithms
 Summary
 Gain(Attribute1) = 0.2467
 Gain(Attribute3) = 0.0481
 In ID3 Attribute 2 not considered
 Since it is numeric
 Thus split on Attribute 1 – highest gain

Classification - Decision Tree 139


C4.5 decision tree algorithms
 What about numeric attribute 2 (as done by C4.5)?
 Consider as categorical
 Then gain = 0.4039 – should split on it
 But – 9 branches, of which 6 with only one

instance
 Tree too wide – not compact
 Since it is really numerical – what to do with a
different value?
 Use threshold Z, and split into two subsets
 Y <= Z and Y > Z
 More complex tests, assuming discrete values and
variable number of subsets
Classification - Decision Tree 140
C4.5 decision tree algorithms
 C4.5 and continuous attribute
 Sort values into v1,…,vm
 Try Zi = vi or Zi = (vi + vi+1) / 2 for i=1,…,m-1
 C4.5 uses Z = vi – more explainable decision rule
 Select splitting value Z*
 So that gain(Z*) = max {gain(Zi), i=1,…,m-1)}
 For last example – Attribute 2 (see next slide)
 Z* = 80
 Gain = 0.1022
 So even with this approach – would have split on Attribute
1

Classification - Decision Tree 141


C4.5 decision tree algorithms
Attribute 2 freq(class,value) Info(S)
Zi Att3 <= Zi Att3 > Zi CLASS1 CLASS2 CLASS1 CLASS2 Total Weight Info(Tx) Gain
65 1 1 0 0.0000 0.0000 0.0000 0.0714
13 8 5 0.4310 0.5302 0.9612 0.9286 0.8926 0.0477
70 4 3 1 0.3113 0.5000 0.8113 0.2857
10 6 4 0.4422 0.5288 0.9710 0.7143 0.9253 0.0150
75 5 4 1 0.2575 0.4644 0.7219 0.3571
9 5 4 0.4711 0.5200 0.9911 0.6429 0.8950 0.0453
78 6 5 1 0.2192 0.4308 0.6500 0.4286
8 4 4 0.5000 0.5000 1.0000 0.5714 0.8500 0.0903
80 9 7 2 0.2820 0.4822 0.7642 0.6429
5 2 3 0.5288 0.4422 0.9710 0.3571 0.8380 0.1022
85 10 7 3 0.3602 0.5211 0.8813 0.7143
4 2 2 0.5000 0.5000 1.0000 0.2857 0.9152 0.0251
90 12 8 4 0.3900 0.5283 0.9183 0.8571
2 1 1 0.5000 0.5000 1.0000 0.1429 0.9300 0.0103
95 13 8 5 0.4310 0.5302 0.9612 0.9286
1 1 0 0.0000 0.0000 0.0000 0.0714 0.8926 0.0477

Classification - Decision Tree 142


C4.5 decision tree algorithms

All same class – so T2 is a leaf


Classification - Decision Tree 143
C4.5 decision tree algorithms
Consider test on attribute 3 FOR SUBSET T1
freq(class,value) CLASS1 CLASS2
True 1 1 2
False 1 2 3
2 3 5

Info(T) 0.5288 0.4422 0.9710 Book has 0.940

Info(S) CLASS1 CLASS2 Info(S) Weight


True 0.5000 0.5000 1.0000 0.4000
False 0.5283 0.3900 0.9183 0.6000
0.9510

Gain .971 - .951 = 0.0200

Attribute 2 freq(class,value) Info(S)


Zi Att3 <= Zi Att3 > Zi CLASS1 CLASS2 CLASS1 CLASS2 Total Weight Info(Tx) Gain
70 2 2 0 0.0000 0.0000 0.0000 0.4000
3 0 3 0.0000 0.0000 0.0000 0.6000 0.0000 0.9710
85 3 2 1 0.3900 0.5283 0.9183 0.6000
2 0 2 0.0000 0.0000 0.0000 0.4000 0.5510 0.4200
90 4 2 2 0.5000 0.5000 1.0000 0.8000
1 0 1 0.0000 0.0000 0.0000 0.2000 0.8000 0.1710

Max gain on Attribute 2 - split on Z* = 70


Classification - Decision Tree 144
C4.5 decision tree algorithms
Consider test on attribute 3 FOR SUBSET T3
freq(class,value) CLASS1 CLASS2
True 0 2 2
False 3 0 3
3 2 5

Info(T) 0.4422 0.5288 0.9710 Book has 0.940

Info(S) CLASS1 CLASS2 Info(S) Weight


True 0.0000 0.0000 0.0000 0.4000
False 0.0000 0.0000 0.0000 0.6000
0.0000

Gain .971 - 0.000 = 0.9710

Attribute 2 freq(class,value) Info(S)


Zi Att3 <= Zi Att3 > Zi CLASS1 CLASS2 CLASS1 CLASS2 Total Weight Info(Tx) Gain
70 1 0 1 0.0000 0.0000 0.0000 0.2000
4 3 1 0.3113 0.5000 0.8113 0.8000 0.6490 0.3219
80 4 2 2 0.5000 0.5000 1.0000 0.8000
1 1 0 0.0000 0.0000 0.0000 0.2000 0.8000 0.1710

Max gain on Attribute 3


Classification - Decision Tree 145
C4.5 decision tree algorithms

Classification - Decision Tree 146


C4.5 decision tree algorithms

Classification - Decision Tree 147


C4.5 decision tree algorithms
 We used entropy of T after splitting into T1,…,Tn
 Info(T ) = -Σ k [(freq(C ,T ) / |T |) ∙ log (freq(C ,T ) / |T |)]
j i=1 i j j 2 i j j
 Info (T) = Σ n (|T | / |T|) ∙ Info(T )
x j=1 j j
 Gain(X) = Info(T) – Info (T)
x
 This is biased in favor of tests X with many outcomes
 Split on ID will generate one subset for each unique value – i.e.,

for all – with each subset of 1 instance


 It has maximal gain as Info (T)=0
x
 But result is a one level tree with one branch for each instance

 Thus divide by number of branches – to measure average gain

Classification - Decision Tree 148


C4.5 decision tree algorithms
 So define entropy
 Split-info(X) = Σ n (|T | / |T|) ∙ log (|T | / |T|)
j=1 j 2 j
 Potential information generated by splitting T into T ,…,T
1 n
 Similar to definition of Info(T )
j
 Use entropy of T after splitting into T1,…,Tn as before
 Info(T ) = -Σ k [(freq(C ,T ) / |T |) ∙ log (freq(C ,T ) / |T |)]
j i=1 i j j 2 i j j
 Info (T) = Σ n (|T | / |T|) ∙ Info(T )
x j=1 j j
 Gain(X) = Info(T) – Info (T)
x
 Selection criteria
 Gain-ratio(X) = Gain(X) / Split-info(X)

 Proportion of information generated by a “useful” compact split

 Select X* so that Gain-ratio(X*) = max {Gain-ratio(X)}


attributes X

Classification - Decision Tree 149


C4.5 decision tree algorithms
Splitting the root
Attribute1 Attribute2 Attribute3
Gain(X) 0.2467 0.1022 0.0481
|T1| 5 9 6
|T2| 4 5 8
|T3| 5
|T| 14 14 14
|T1|/|T|*log(|T1|/|T|) 0.5305 0.4098 0.5239
|T2|/|T|*log(|T2|/|T|) 0.5164 0.5305 0.4613
|T3|/|T|*log(|T3|/|T|) 0.5305
Split-info(X) 1.5774 0.9403 0.9852
Gain-ratio(X) 0.1564 0.1087 0.0488
Still split on Attribute 1

Classification - Decision Tree 150


C4.5 decision tree algorithms
 Missing data
 Unknown
 Not recorded
 Data entry error
 What to do with missing data?
 Eliminate instances with missing data
 Only useful when there are few
 Replace missing data with some values
 Fixed values, mean, mode, from distribution
 Modify algorithm to work with missing data

Classification - Decision Tree 151


C4.5 decision tree algorithms

 Issues with modified algorithm


 How compare subsets with different number of
unknown values
 With what class to associate instances with
unknown values
 C4.5 replaces unknown values
 Based on the distribution (=relative frequency) of
known values

Classification - Decision Tree 152


C4.5 decision tree algorithms
 For Split-info(X)
 Add one subset for the missing values
 That is – if there are n known classes, use T n+1 for missing
values
 For Info(T) and Infox(T) for a certain attribute
 Use only known values
 Compute F = (number instances with a known value) /
(total number of instances in data set)
 Use Gain(X) = F ∙ [Info(T) – Info x(T)]

Classification - Decision Tree 153


C4.5 decision tree algorithms
Given dataset T
Attribute1 Attribute2 Attribute3 Class
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
????? 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1

Classification - Decision Tree 154


C4.5 decision tree algorithms
Given dataset T
Attribute1 Attribute2 Attribute3 Class
Consider test on attribute 1 A 70 True CLASS1
freq(class,value) CLASS1 CLASS2 A 90 True CLASS2
A 85 False CLASS2
A 2 3 5 A 95 False CLASS2
B 3 0 3 A 70 False CLASS1
C 3 2 5 ????? 90 True CLASS1
B 78 False CLASS1
8 5 13 B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
Factor F = 13 / 14 0.9286 C 70 True CLASS2
C 80 False CLASS1
Info(T) 0.4310 0.5302 0.9612 C 80 False CLASS1
C 96 False CLASS1

Info(S) CLASS1 CLASS2 Info(S) Weight


A 0.5288 0.4422 0.9710 0.3846
B 0.0000 0.0000 0.0000 0.2308
C 0.4422 0.5288 0.9710 0.3846
0.7469

Original Gain equation .9612 - .7469 = 0.2144

New Gain Equation F*Original-Gain 0.1990

 Weight is calculated on the basis of : n/N-m


 n= attribute values, N=total number of tuples, m=number of missing values of attribute

Classification - Decision Tree 155


C4.5 decision tree algorithms

Splitting the root


Attribute1 Attribute2 Attribute3
Gain(X) 0.1990 0.0587 -0.0205
|T1| 5 9 6
|T2| 3 5 8
|T3| 5
????? 1
|T| 13 14 14
|T1|/|T|*log(|T1|/|T|) 0.5302 0.4098 0.5239
|T2|/|T|*log(|T2|/|T|) 0.4882 0.5305 0.4613
|T3|/|T|*log(|T3|/|T|) 0.5302
Unknown 0.2846
Split-info(X) 1.8332 0.9403 0.9852
Gain-ratio(X) 0.1086 0.0625 -0.0208
Still split on Attribute 1
Classification - Decision Tree 156
C4.5 decision tree algorithms

 At this point, with unknown values


 Test attributes selected for each node
 Subsets defined for instances with known values
 But what to do with the unknown?
 C4.5 assigns it to ALL the subsets T1,…,Tn
 With probability (or weight)
 P(Ti) = w = |Ti known values| / |T known values|

Classification - Decision Tree 157


C4.5 decision tree algorithms

Classification - Decision Tree 158


C4.5 decision tree algorithms

Given dataset T
Attribute1 Attribute2 Attribute3 Class
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
????? 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1

Classification - Decision Tree 159


C4.5 decision tree algorithms
 Classification = CLASS2 (3.4 /
Given dataset T
0.4) means Attribute1 Attribute2 Attribute3 Class
 3.4 = |updated T | = 3 + 5/13 = A 70 True CLASS1
i
A 90 True CLASS2
3.3846 A 85 False CLASS2
 0.4 = number of instances A 95 False CLASS2
A 70 False CLASS1
without (known) value in Ti ????? 90 True CLASS1
 Thus 3 / 3.3846 = 88.64% belong B 78 False CLASS1
B 65 True CLASS1
to CLASS2 B 75 False CLASS1
 The balance (~12%) is error C 80 True CLASS2
C 70 True CLASS2
rate C 80 False CLASS1
 Belongs to other classes C 80 False CLASS1
C 96 False CLASS1
– in this case CLASS1

Classification - Decision Tree 160


C4.5 decision tree algorithms

 Prediction
 Same approach – with probabilities – is used
 If values of attributes known – class is well
defined
 Else all paths from the root explored
 Probability of each class is determined for all classes
 Which is a sum of probabilities along paths
 Class with highest probability is selected

Classification - Decision Tree 161

You might also like