DM-Lecture Decision Trees (A)

Supervised Learning:
Classification-I
M M Awais
SPJCM
Decision tree induction
Classification - Decision Tree 2

Introduction
 It is a method that induces concepts from
examples (inductive learning)
 Most widely used & practical learning
method
 The learning is supervised: i.e. the classes
or categories of the data instances are
known
 It represents concepts as decision trees
(which can be rewritten as if-then rules)

Introduction
 Decision tree learning is one of the most
widely used techniques for classification.
 Its classification accuracy is competitive with
other methods, and
 it is very efficient.
 The classification model is a tree, called
decision tree.
 C4.5 by Ross Quinlan is perhaps the best
known system. It can be downloaded from
the Web.

Introduction
 The target function can be Boolean or

discrete valued

Decision Trees
 Example: “is it a good day to play golf?” A particular instance in the
 a set of attributes and their possible values: training set might be:
outlook sunny, overcast, rain
<overcast, hot, normal, false>: play
temperature cool, mild, hot
humidity high, normal
windy true, false
In this case, the target class

is a binary attribute, so each
instance represents a positive
or a negative example.

Decision Tree Representation
1. Each node corresponds to an attribute
2. Each branch corresponds to an attribute

value
3. Each leaf node assigns a classification

Example

Example
 Outlook
 Sunny  Rain
 Overcast
 Humidity Wind
 High  Normal  Strong  Weak
 A Decision Tree for the concept PlayTennis

 An unknown observation is classified by testing its
attributes and reaching a leaf node

Using Decision Trees for Classification
 Examples can be classified as follows
 1. look at the example's value for the feature specified
 2. move along the edge labeled with this value
 3. if you reach a leaf, return the label of the leaf
 4. otherwise, repeat from step 1
 Example (a decision tree to decide whether to go on a picnic):

outlook
So a new instance:
sunny overcast rain <rainy, hot, normal, true>: ?
will be classified as “noplay”
humidity P windy
high normal true false
N P N P

Decision Trees and Decision Rules
outlook
If attributes are continuous, sunny overcast rain

internal nodes may test
against a threshold. humidity yes windy
> 75% <= 75% > 20 <= 20
no yes no yes
Each path in the tree represents a decision rule:
Rule1: Rule3:
If (outlook=“sunny”) AND (humidity<=0.75) If (outlook=“overcast”)
Then (play=“yes”) Then (play=“yes”)
Rule2: ...
If (outlook=“rainy”) AND (wind>20)
Then (play=“no”)

 DECISION TREES
 Basic Decision Tree Learning Algorithm

 Most algorithms for growing decision trees
are variants of a basic algorithm
 An example of this core algorithm is the ID3

algorithm developed by Quinlan (1986)
 Itemploys a top-down, greedy search through

the space of possible decision trees
12
 DECISION TREES

 First
of all we select the best attribute to be
tested at the root of the tree
 Formaking this selection each attribute is

evaluated using a statistical test to determine
how well it alone classifies the training
examples
13
 DECISION TREES

 We have
 D12  D11 - 12 observations

 D1
- 4 attributes
D10
 D2  D5
 D4

• Outlook
 D6
 D14
 D3 • Temperature
 D8  D9• Humidity
 D7
 D13
• Wind
- 2 classes (Yes, No)
14
 DECISION TREES

 Outlook
 Sunny  Rain
 Overcast
 D10 
 D1  D8 D6
 D3
 D14
 D11  D4
 D9  D12
 D2  D7
 D5
 D13
15
 DECISION TREES

 The selection process is then repeated using
the training examples associated with each
descendant node to select the best attribute
to test at that point in the tree
16
 DECISION TREES
 Outlook
 Sunny  Rain
 Overcast
 D10 
 D1  D8 D6
 D3
 D14
 D11  D4
 D9  D12
 D2  D7
 D5
 D13
 What is the “best” attribute to test at this point? The possible

choices are Temperature, Wind & Humidity
17
 DECISION TREES

 Thisforms a greedy search for an acceptable
decision tree, in which the algorithm never
backtracks to reconsider earlier choices
18
 DECISION TREES
Which Attribute is the Best Classifier?

 The central choice in the ID3 algorithm is
selecting which attribute to test at each node
in the tree
 We would like to select the attribute which is
most useful for classifying examples

 For this we need a good quantitative measure
 For this purpose a statistical property, called
information gain, Gini Index is used
19
Top-Down Decision Tree Generation
 The basic approach usually consists of two phases:
 Tree construction
 At the start, all the training examples are at the

root
 Partition examples are recursively based on
selected attributes
 Tree pruning
 remove tree branches that may reflect noise in

the training data and lead to errors when
classifying test data
 improve classification accuracy

Top-Down Decision Tree Generation
 Basic Steps in Decision Tree Construction
 Tree starts a single node representing all data
 If sample are all same class then node becomes a
leaf labeled with class label

 Otherwise, select feature that best separates sample
into individual classes.

 Recursion stops when:
 Samples in node belong to the same

tu re ? class
ct fe a
(majority) se l e
w to
 There are H nooremaining attributes on which to
split

How to find Feature to split?
 Many methods are available but our focus

will be on the following two:
 Information Theory(Information Gain)
 Gain Ratio
 Gini Index

Information
High Uncertainty
No Uncertainty
Valuable Information
 Which information is more valuable:

 Of high uncertain region, or
 Of no uncertain region
gi o n
tai n re
g h U ncer
H i

Information theory
 Information theory provides a mathematical basis
for measuring the information content.
 To understand the notion of information, think
about it as providing the answer to a question, for
example, whether a coin will come up heads.
 If one already has a good guess about the answer,
then the actual answer is less informative.
 If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).

Information theory (cont …)
 For a fair (honest) coin, you have no
information, and you are willing to pay more
(say in terms of $) for advanced information -
less you know, the more valuable the
information.
 Information theory uses this same intuition,
but instead of measuring the value for
information in dollars, it measures information
contents in bits.
 One bit of information is enough to answer a
yes/no question about which one has no
idea, such as the flip of a fair coin

Information Basic

Entropy

Information: Basics
 Information (Entropy) is:
 E= - pi log pi,
 where pi is the probability of an event i
 (-pi log pi is always +ve)
 For multiple events
 E(I) = i -pi log pi
 Suppose you toss a fair coin, find the information
(entropy) when the probability of head or tail is 0.5 each.
 possible events: 2, pi=0.5
 E(I)= - 0.5log 0.5 - 0.5log 0.5 = 1.0
 If the coin is biased i.e, chances of heads is 0.75 and of tail
is 0.25, then E(I)= - 0.75log 0.75 - 0.25log 0.25 < 1.0
Information: Basics
 Suppose you have dice and you roll it, find the entropy if
getting a ‘6’ if the probabilities of each event i.e, of getting
1 to 6 is equal.
 possible events: 6, pi=1/6
 E(I)= 6(- 1/6)log (1/6)=2.585
 If the dice is biased i.e, chances of ‘6’ is 0.75 then what is the
entropy:
 p(for 6) =0.75,
 p(for all other) = 0.25,
 p (any other number) = 0.25/5 = 0.05 (equally divided among 1 to 5)
 then E(I)= - 0.75log 0.75 – 5 (0.05)log (0.05) = 1.39

Information: Basics
nt y
 Suppose you have dice and you roll it, find the air tentropy if
e
ci.e, of getting
getting a ‘6’ if the probabilities of each event u n
s e s e r
1 to 6 is equal. e a w
c r l o
 possible events: 6, p =1/6
t i n l s o
i
e n i s a
 E(I)= 6(- 1/6)log (1/6)=2.585 e v p y
n
achances t r o
 If the dice is biased i.e, o f e n of ‘6’ is 0.75 then what is the
l it y he
entropy: b i o t
ba s
 p r
=0.75,
p o e s
(for 6)
e uc
 ps t
h r d
e0.25,
A (for all other) =
 p
(any other number) = 0.25/5 = 0.05 (equally divided among 1 to 5)
 then E(I)= - 0.75log 0.75 – 5 (0.05)log (0.05) = 1.39

Information: Basics
 Suppose you have dice and you roll it, find the entropy if
getting a ‘6’ if the probabilities of each eevent e i.e,
bl eof getting
w n
1 to 6 is equal. t r r i a no
i o n v a s k
i s r e e i
 possible events: 6, pi=1/6 d ec a tu a l u
g a f e s v
 E(I)= 6(- 1/6)log (1/6)=2.585 i n t h e i t
k
a chances s of ‘6’ e
cis 0.75 then what is the
If the dice is biased m i.e, a o n

i n b le t y
entropy: S o r i a a i n
v a e r t
 p(for 6) =0.75, se a n c
o e u
h o t h
 c = 0.25,
p(for all other) e s
u c
 d = 0.25/5 = 0.05 (equally divided among 1 to 5)
p (any otherrenumber)
a t
 thenh
t E(I)= - 0.75log 0.75 – 5 (0.05)log (0.05) = 1.39

Decision Trees
The most notable types of decision tree algorithms are:-
 Iterative Dichotomiser 3 (ID3): This algorithm uses Information
Gain to decide which attribute is to be used classify the current

subset of the data. For each level of the tree, information gain is
calculated for the remaining data recursively.
 C4.5: This algorithm is the successor of the ID3 algorithm. This
algorithm uses either Information gain or Gain ratio to decide upon

the classifying attribute. It is a direct improvement from the ID3
algorithm as it can handle both continuous and missing attribute
values.
 Classification and Regression Tree(CART): It is a dynamic
learning algorithm which can produce a regression tree as well as a

classification tree depending upon the dependent variable.

DT: Entropy – A measuring Value
 Entropy is a concept originated in thermodynamics
but later found its way to information theory.
 In decision tree construction process, definition of
entropy as a measure of disorder suits well.
 If the class values of the data in a node is equally
divided among possible values of the class value,
we say entropy (disorder) is maximum.
 If the class values of the data in a node is same for
all data, entropy (disorder) is minimum.

DT: Entropy – A measuring Value
 A decision tree is built top-down from a root

node and involves partitioning the data into
subsets that contain instances with similar
values (homogenous).
 ID3 algorithm uses entropy to calculate the
homogeneity of a sample.
 If the sample is completely homogeneous the
entropy is zero and if the sample is an
equally divided it has entropy of one.

Entropy
 Maximum probability at the center where curve is touching highest

point is =1 as given above

Information theory: Entropy measure
 The entropy formula,
|C |
entropy ( D)    Pr(c ) log
j 1
j 2 Pr(c j )
|C |
 Pr(c )  1,
j 1
j
 Pr(cj) is the probability of class cj in data set D

 We use entropy as a measure of impurity or
disorder of data set D. (or, a measure of
information in a tree)

Entropy measure: E= - (p /s)log(p /s) - (n /s)log(n /s)
p= all +ve examples, n= -ve, s=total examples
 As the data become purer and purer, the entropy value

becomes smaller and smaller. This is useful for classification
Information gain
 Given a set of examples D, we first compute its
entropy for the ‘c’ classes:
|C |
entropy ( D)   Pr(c j ) log 2 Pr(c j )
j 1
 If we choose attribute Ai, with v values, the root of the

current tree, this will partition D into v subsets D1, D2
…, Dv . The expected entropy if Ai is used as the
current root: v |D |
entropy Ai ( D)  
j
 entropy ( D j )
j 1 | D |

Information Gai

Information gain (cont …)
 Information gained by selecting attribute Ai to
branch or to partition the data is
gain( D, Ai )  entropy ( D)  entropy Ai ( D)
 We choose the attribute with the highest gain to

branch/split the current tree.
 As the information gain increases for a variable,
the uncertainty in decision making reduces.

Day outlook temp humidity wind play
D1 sunny hot high weak No
D2 sunny hot high strong No
D3 overcast hot high weak Yes
D4 rain mild high weak Yes
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes
D8 sunny mild high weak No
D9 sunny cool normal weak Yes
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes
D13 overcast hot normal weak Yes
D14 rain mild high strong No
 To build a decision tree, we need to calculate two types
of entropy using frequency tables as follows:
 a) Entropy using the frequency table of one attribute:
Note: Probability of P(9)=9/14=0.64 & P(5)=5/14=0.36

How calculate Log base 2?
 To Calculate Log2 based value:
Log(value)/log(2) like to calculate
log2(0.36)=log10(0.36)/log10(2)
 http://logbase2.blogspot.com/2008/08/log-calculator.html

b) Entropy using the frequency table of two
attributes :

 Calculate the following Entropies
 E(PlayGolf, temp)
 P(Hot). E(Yes, No)+P(Mild).E(Yes, No)+P(Cool).E(Yes, No)
 Help: First Built the table
 E(PlayGolf, humidity)
 E(PlayGolf, windy)

Information Gain
 The information gain is based on the

decrease in entropy after a dataset is split
on an attribute.
 Constructing a decision tree is all about
finding attribute that returns the highest
information gain (i.e., the most homogeneous
branches).

Information Gain
Information Gain
Information Gain
Information Gain
Information Gain
 Step 3: Choose attribute with the largest

information gain as the decision node.
Information Gain
 Step 4a: A branch with entropy of 0 is a leaf

node.
Information Gain
 Step 4b: A branch with entropy more than 0

needs further splitting.
Information Gain
Information Gain
 Step 5: The ID3 algorithm is run recursively

on the non-leaf branches, until all data is
classified.
Decision Tree to Decision Rules
Decision Tree to Decision Rules
Example
Owns House Married Gender Employed Credit Risk Class
History
Yes Yes M Yes A B
No No F Yes A A
Yes Yes F Yes B C
Yes No M No B B
No Yes F Yes B C
No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Yes Yes F Yes a C

Choosing the “Best” Feature
Own House? Credit rating
Yes No A B
Married ? Gender
Yes No M F

Owns Married Gender Employe Credit Risk
House d History Class
Choosing the “Best” Feature Yes Yes M Yes A B
Own House? No No F Yes A A

Yes Yes F Yes B C
Yes No Yes No M No B B
No Yes F Yes B C
No No F Yes B A
 Find the overall entropy first: No No M No B B
 Total samples: 10 Yes No F Yes A A

No Yes F Yes A C
Yes Yes F Yes a C


Yes Yes F Yes B C
No Yes F Yes B C
No No F Yes B A

No Yes F Yes A C
 Class A: 3, Class B: 3, Class C: 4 Yes Yes F Yes a C
 Entropy(D)= -(3/10)log(3/10)-(3/10)log(3/10)-(4/10)log(4/10) = 1.57

 Own homes: has two v values, Yes (5 instances) and No (5
instances), total 10, probability of each is 0.5
 Find entropy(Dj) for each yes and no and the add the two weighted
by their class probabilities
 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05


Yes Yes F Yes B C
No Yes F Yes B C
No No F Yes B A

No Yes F Yes A C

 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05


Yes Yes F Yes B C
No Yes F Yes B C
No No F Yes B A

No Yes F Yes A C

 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05


Yes Yes F Yes B C
No Yes F Yes B C
No No F Yes B A

No Yes F Yes A C

 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5)
Only 1 out- of
(1/5)log(1/5) -(2/5)log(2/5)
5 have class A for own =house:
1.52 yes
 E(Dj) = 0.5*E(yes)+0.5*E(no)
Only 2 out of 5 have= 1.52
class B for own house: yes
 Gain(D, Own House)
Only 2 out=of1.57-1.52 = 0.05C for own house: yes
5 have class


Yes Yes F Yes B C
No Yes F Yes B C
No No F Yes B A

No Yes F Yes A C

 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05


Yes Yes F Yes B C
No Yes F Yes B C
No No F Yes B A

No Yes F Yes A C

 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05


Yes Yes F Yes B C
No Yes F Yes B C
No No F Yes B A

No Yes F Yes A C

 E(yes)= -(1/5)log(1/5) - (2/5)log(2/5) -(2/5)log(2/5) = 1.52
 E(no)= -(2/5)log(2/5) - (1/5)log(1/5) -(2/5)log(2/5) = 1.52
 E(Dj) = 0.5*E(yes)+0.5*E(no) = 1.52
 Gain(D, Own House) = 1.57-1.52 = 0.05

Similarly Find the values Yes

No
Yes
No
M
F
Yes
Yes
A
A
B
A
for all the other variables Yes

Yes
Yes
No
F
M
Yes
No
B
B
C
B
No Yes F Yes B C
No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Yes Yes F Yes a C
 Own House: 0.05

 Married: 0.72
 Gender: 0.88
 Employed: 0.45 Selected as Root Node
 Credit rating: 0.05

Choosing the “Best”

Yes Yes M Yes A B
Feature
No No F Yes A A
Yes Yes F Yes B C
Gender Yes No M No B B
No Yes F Yes B C
M F No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Class A: 0 Class A: 3 Yes Yes F Yes a C
Class B: 3 Class B: 0
Class C: 0 Class C: 4
No further Further split is

split is required required here, cannot
here, identifies B identify A, and C fully
fully
Apply the same procedure again on other variables leaving

out column for Gender, and rows for class B as it has been fully
determined

Choosing the “Best”

Yes Yes M Yes A B
Feature
No No F Yes A A
Yes Yes F Yes B C
No Yes F Yes B C
M F No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
Class A: 0 Class A: 3 Yes Yes F Yes a C
Class B: 3 Class B: 0
Class C: 0 Class C: 4
No further Further split is

split is required required here, cannot
here, identifies B identify A, and C fully
fully

determined


No No F Yes A A
Own House? Yes Yes F Yes B C

Yes No M No B B
Yes No No Yes F Yes B C
No No F Yes B A
No No M No B B
Yes No F Yes A A
 E(D)=1.33 No Yes F Yes A C

Yes Yes F Yes a C
 Own House: 0.96

 Married: 0.00
 Etc…
Married is the best node as E(Dj) = 0,

Hence information gain will be maximum
Completing DT Yes
No
Yes
Yes
No
Yes
M
F
F
Yes
Yes
Yes
A
A
B
B
A
C
Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Gender No No M No B B
Yes No F Yes A A
M F No Yes F Yes A C
Yes Yes F Yes a C
Class B: 3 Class A: 3,Class C: 4

Married
Yes No
Class C: 4 Class A: 3

Completing DT
Yes Yes M Yes A B
No No F Yes A A
Gender Yes Yes F Yes B C
Yes No M No B B
No Yes F Yes B C
M F
No No F Yes B A
No No M No B B
Yes No F Yes A A
Class B: 3 Class A: 3,Class C: 4 No Yes F Yes A C

Yes Yes F Yes a C
Married
Yes No
Class C: 4 Class A: 3
R1: If Gender=M then Class B

R2: If Gender=F and Married=Yes
Then Class C
Rules R3: If Gender=F and Married=No
Then Class A

Table 6.1 Class‐labeled training tuples from AllElectronics customer database.
78
Trees Construction Algorithm (ID3)
 Decision Tree Learning Method (ID3)
 Input: a set of examples S, a set of features F, and a target set T (target
class T represents the type of instance we want to classify, e.g., whether

“to play golf”)
 1. If every element of S is already in T, return “yes”; if no element of S is in
T return “no”
 2. Otherwise, choose the best feature f from F (if there are no features
remaining, then return failure);

 3. Extend tree from f by adding a new branch for each attribute value
 4. Distribute training examples to leaf nodes (so each leaf node S is now
the set of examples at that node, and F is the remaining set of features not
yet selected)
 5. Repeat steps 1-5 for each leaf node
 Main Question:
 how do we choose the best feature at each step?
Note:
Note:ID3ID3algorithm
algorithmonly
onlydeals
dealswith
withcategorical
categoricalattributes,
attributes,but
butcan
canbe
beextended
extended
(as
(asininC4.5)
C4.5)totohandle
handlecontinuous
continuousattributes
attributes
Choosing the “Best” Feature
 Using Information Gain to find the “best” (most discriminating) feature
 Entropy, E(I) of a set of instance I, containing p positive and n negative examples
p p n n
E(I )   log 2  log 2
pn pn pn pn
 Gain(A, I) is the expected reduction in entropy due to feature (attribute) A
pj  nj
Gain( A, I )  E ( I )   pn
E(I j )
descendant j
 the jth descendant of I is the set of instances with value vj for A
S: [9+,5-]
Outlook? E = -(9/14).log(9/14) - (5/14).log(5/14)
= 0.940
overcast rainy
sunny
“yes”, since all positive examples

[4+,0-] [2+,3-] [3+,2-]

Decision Tree Learning - Example
S: [9+,5-] (E = 0.940)
D2 sunny hot high strong No humidity?
high normal
D4 rain mild high weak Yes
D7 overcast cool normal strong Yes [3+,4-] (E = 0.985) [6+,1-] (E = 0.592)
D8 sunny mild high weak No
Gain(S, humidity) = .940 - (7/14)*.985 - (7/14)*.592
= .151
D12 overcast mild high strong Yes S: [9+,5-] (E = 0.940)
wind?
weak strong
So, classifying examples by humidity
provides more information gain than by
wind. In this case, however, you can [6+,2-] (E = 0.811) [3+,3-] (E = 1.00)
verify that outlook has largest information Gain(S, wind) = .940 - (8/14)*.811 - (6/14)*1.0
gain, so it’ll be selected as root = .048

D2 sunny hot high strong No
S: [9+,5-]
D4 rain mild high weak Yes Outlook{D1, D2, …, D14}
D7 overcast cool normal strong Yes
D8 sunny mild high weak No sunny overcast rainy
D12 overcast mild high strong Yes ? yes ?
[2+,3-] [4+,0-] [3+,2-]
E=.970 E=0 E=.970
So, classifying examples by humidity

provides more information gain than by Gain(S, outlook) = .940 - (5/14)*..97 - (4/14)*0 - (5/14)*.97
wind. In this case, however, you can = .241
verify that outlook has largest information
gain, so it’ll be selected as root

 Partially learned decision tree
S: [9+,5-]
Outlook {D1, D2, …, D14}
sunny overcast rainy
?E=.970 yes E=0 ?E=.970

[2+,3-] [4+,0-] [3+,2-]
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
 which attribute should be tested here?

Ssunny = {D1, D2, D8, D9, D11}
Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970
Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570
Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019

Highly-branching attributes
 Problematic: attributes with a large number of

values (extreme case: ID code)
 Subsets are more likely to be pure if there is
a large number of values
 Information gain is biased towards choosing
attributes with a large number of values
 This may result in overfitting(selection of an
attribute that is non-optimal for prediction)

Gain Ratio for Attribute Selection (C4.5)

Another Alternative to avoid selecting
attributes with large -domains

Gini Index (CART, SLIQ, ibm
IntellegentMiner)

Contd..!!!

Comparing Attribute Selection Measures

Split Algorithm with Gini Index
 Basic concept taken from Economics given
by Corrado Gini (1884 to 1965)
 The index varies from 0 to 1
 ZERO means no uncertainty
 ONE means maximum uncertainty
 Brazil 0.59
Income distribution of  India 0.32
various countries  China 0.45
 USA 0.41
 Japan 0.25
Most evenly distributed income

Gini Index
 The Gini index is measure of impurity developed by
Italian statistician Corrado Gini in 1912.
 It is usually used to measure income inequality but
can be used to measure any form of uneven
distribution.
 Gini index is a number between 0 and 1where 0
corresponds with perfect equality (where every one
has same income) and 1 corresponds the perfect
inequality (where one person has all the income and
everyone else has zero income).
GINI (t ) 1   p 2 ( j | t )
j

Diversity and Gini Index
high diversity, low purity
G = 1-(3/8)2 -(3/8)2 -(1/8)2 -(1/8)2 = .69 (E=1.811)
low diversity, high purity
G = 1-(6/7)2-(1/7)2 = .24 (E=0.592)


Yes Yes F Yes B C
No Yes F Yes B C
No No F Yes B A
 Find the overall G first: No No M No B B

No Yes F Yes A C
 Gt=1- (3/10)2 - (3/10)2 - (4/10)2 = 0.66 Yes Yes F Yes a C
 Attribute: Own House

 G(y)= 1- (1/5)2 - (2/5)2 - (2/5)2 = 0.66
 Gain=G -G
 G(n)=0.64 t i
 Total G=0.5G(y)+0.5G(n)= 0.64  Gains:
 Attribute: Married  Own House:
 Total G=0.5G(y)+0.5G(n)= 0.40 0.02
 Attribute: Gender G=0.511  Married:
 Attribute: Employed: G= 0.475 0.26
 Attribute: Credit Rating: G=0.64  Gender:
0.302
Classification - Decision Tree  Employed: 97

No No F Yes A A
Yes Yes F Yes B C
No Yes F Yes B C
M F
No No F Yes B A
No No M No B B
Yes No F Yes A A
No Yes F Yes A C
 Choose Gender Yes Yes F Yes a C

determined
Check if you can get the same DT or not

Categorical Attributes: Computing
Gini Index
• For each distinct value, gather counts for each class in the set.
• Use the count matrix to make decisions
Multi-way Split Two-way split

(Find best partition of values)
Outlook
Outlook Outlook
Overcast Rain/
Overca Rain Sunny Sunny Overcast/ Sunny
st rain
C1 1 2 1 C1 4 5
C1 7 2
C2 4 1 1 C2 0 5 C2 2 3
Gini 0 .48 .48 Gini 0 .5 Gini .345 .48
0.34 0.36 0.391

Continuous Attributes: Computing
Gini Index…
Cheat No No No Yes Yes Yes No No No No

Taxable Income
60 70 75 85 90 95 100 120 125 220

Sorted
Values 55 65 72 80 87 92 97 110 122 172 230
Split Pos
≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Gini index (CART)
 E.g., two classes, Pos and Neg, and dataset S
with p Pos-elements and n Neg-elements.
 fp = p / (p+n) fn = n / (p+n)
gini(S) = 1 – fp2 - fn2
 If dataset S is split into S1, S2 then

ginisplit(S1, S2 )=gini(S1)·(p1+n1)/(p+n) +
gini(S2)·(p2+n2)/(p+n)

Gini index - play tennis example
Outlook Temperature Humidity W indy Class outlook
sunny hot high false N
sunny hot high true N overcast rain, sunny
overcast hot high false P
rain mild high false P
rain cool normal false P P 100% ……………
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P humidity
sunny mild normal true P
overcast mild high true P normal high
overcast hot normal false P
rain mild high true N
P 86% ……………
 Two top best splits at root node:
 Split on outlook:
 S1: {overcast} (4Pos, 0Neg) S2: {sunny, rain}
 Split on humidity:
 S1: {normal} (6Pos, 1Neg) S2: {high}

Calculations
 Outlook
 Sunny or Rainy Yes = 5 No = 5 Gini = .5
 Overcast Yes = 4 No = 0 Gini = 0
 Gain = 0.36
 Temperature
 Hot or Cold Yes = 5 No = 3 Gini = 0.47
 Mild Yes = 4 No = 2 Gini = 0.44
 Gain = 0.46
 Humidity
 High Yes = 3 No = 4 Gini = 0.49
 Normal Yes = 6 No = 1 Gini = 0.25
 Gain = 0.37
 Windy
 FALSE Yes = 6 No = 2 Gini = 0.38
 TRUE Yes = 3 No = 3 Gini = 0.5
 Gain = 0.43

Calculations at Node 0
 Outlook
 5  2  5  2  1
GINI (outlook  sunny  rainy )  1        
 10   10   2
 4  2  0  2 
GINI (outlook  overcast )  1         0
 4   4  
 10   1   4  
GINI ( split based on outlook )    *     (0)   0.3571
 14   2   14  

Temperature
 5  2  3  2 
GINI (temperatur e  hot  cold ) 1         0.46875
 8   8  
 4  2  2  2 
GINI (temperatur e  mild ) 1         0.44
 6   6  
 8   6 
GINI ( split based on temperatur e)    * 0.46875    * 0.44   0.456
 14   14  

Humidity
 3  2  4  2  24
GINI (humidity  high)  1        
 7   7   49
 4  2  2  2  12
GINI (humidity  normal )  1        
 6   6   49
 7   24   7   12  
GINI ( split based on humidity )    *      *     0.37
 14   49   14   49  

Windy
 6  2  2  2 
GINI ( windy  FALSE )  1         0.375
 8   8  
 3  2  3  2 
GINI ( windy  TRUE )  1         0. 5
 6   6  
 8  6 
GINI ( split based on windy)    * 0.375    * 0.5  0.43
 14   14  

V=outlook
N=14
Overcast Rain/Sunny
Humidity
4 yes, 0 no
N = 10
Normal High
windy
4 yes 1 no
N=5
False true
V=outlook
3 yes, 0 no
N=2
Rain Sunny
1 no, 0 yes 1 yes 0 no

Dealing With Continuous Variables
 Partition continuous attribute into a discrete set of
intervals
 sort the examples according to the continuous attribute A
 identify adjacent examples that differ in their target classification
 generate a set of candidate thresholds midway
 problem: may generate too many intervals
 Another Solution:
 take a minimum threshold M of the examples of the majority class in each
adjacent partition; then merge adjacent partitions with the same majority class
70.5 77.5
Example: M = 3
Temperature 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Play? yes no yes yes yes no no yes yes yes no yes yes no
Same majority, so they are merged
Final mapping: temperature £ 77.5 ==> “yes”; temperature > 77.5 ==> “---”

Improving on Information Gain
 Info. Gain tends to favor attributes with a large number of values
 larger distribution ==> lower entropy ==> larger Gain
 Quinlan suggests using Gain Ratio

 penalize for large number of values
Si S Gain ( A, S )
SplitInfo ( A, S )   log i GainRatio ( A, S ) 
S S SplitInfo ( A, S )
 Example: “outlook”
S: [9+,5-]
SplitInfo (outlook, S) Outlook
= -(4/14).log(4/14) - (5/14).log(5/14) - (5/14).log(5/14)
= 1.577 overcast rainy
sunny
GainRatio (outlook, S)
= 0.246 / 1.577 = 0.156
S1: [4+,0-] S2 : [2+,3-] S3 : [3+,2-]

Gain Ratios of Decision Variables
 Temperature
 Outlook
 Info = 0.911
 Info = 0.693
 Gain = 0.940 - .911 = 0.029
 Gain = 0.940 - .693 = 0.247
 Split info = info ([4, 6, 4]) = 1.362
 Split info = info ([5, 4, 5]) = 1.577
 Gain ratio = 0.029/1.362=0.021

Gain ratio = 0.247/1.577=0.156
 Humidity  Windy
 Info = 0.788  Info = 0.892
 Gain = 0.940 - .788 = 0.152  Gain = 0.940 - .892 = 0.048
 Split info = info ([7, 7]) = 1  Split info = info ([8, 6]) = .985
  Gain ratio = 0..048/.985=0.049

Gain ratio = 0.152/1=0.152

Over-fitting in Classification
 A tree generated may over-fit the training examples due to noise or too small
a set of training data
 Two approaches to avoid over-fitting:
 (Stop earlier): Stop growing the tree earlier
 (Post-prune): Allow over-fit and then post-prune the tree
 Approaches to determine the correct final tree size:

 Separate training and testing sets or use cross-validation
 Use all the data for training, but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may improve over entire

distribution
 Use Minimum Description Length (MDL) principle: halting growth of the
tree when the encoding is minimized.

 Rule post-pruning (C4.5): converting to rules before pruning

The loan data (reproduced)
Approved or not

A decision tree from the loan data
 Decision nodes and leaf nodes (classes)

Use the decision tree
No

Is the decision tree unique?
 No. Here is a simpler tree.
 We want smaller tree and accurate tree.
 Easy to understand and perform better.
 Finding the best tree is

NP-hard.
 All current tree building
algorithms are heuristic
algorithms

From a decision tree to a set of rules
 A decision tree can
be converted to a
set of rules
 Each path from the
root to a leaf is a
rule.

Algorithm for decision tree learning
 Basic algorithm (a greedy divide-and-conquer algorithm)
 Assume attributes are categorical now (continuous attributes
can be handled too)
 Tree is constructed in a top-down recursive manner
 At start, all the training examples are at the root
 Examples are partitioned recursively based on selected
attributes
 Attributes are selected on the basis of an impurity function
(e.g., information gain)
 Conditions for stopping partitioning
 All examples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority class is the leaf
 There are no examples left

Decision tree learning algorithm

Choose an attribute to partition data
 The key to building a decision tree - which
attribute to choose in order to branch.
 The objective is to reduce impurity or
uncertainty in data as much as possible.
 A subset of data is pure if all instances belong to
the same class.
 The heuristic in C4.5 is to choose the attribute
with the maximum Information Gain or Gain
Ratio based on information theory.

The loan data (reproduced)
Approved or not

Two possible roots, which is better?
 Fig. (B) seems to be better.

An example
6 6 9 9
entropy ( D)    log 2   log 2  0.971
15 15 15 15
6 9
entropy Own _ house ( D)    entropy ( D1 )   entropy ( D2 )
15 15
6 9
  0   0.918
15 15
 0.551
5 5 5
entropy Age ( D)    entropy ( D1 )   entropy ( D2 )   entropy ( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
  0.971   0.971   0.722 middle 3 2 0.971
15 15 15
 0.888
old 4 1 0.722
 Own_house is the best

choice for the root.

We build the final tree
 We can use information gain ratio to evaluate the

impurity as well

Handling continuous attributes
 Handle continuous attribute by splitting into
two intervals (can be more) at each node.
 How to find the best threshold to divide?
 Use information gain or gain ratio again
 Sort all the values of an continuous attribute in
increasing order {v1, v2, …, vr},
 One possible threshold between two adjacent
values vi and vi+1. Try all possible thresholds and
find the one that maximizes the gain (or gain
ratio).

An example in a continuous space

Avoid overfitting in classification
 Overfitting: A tree may overfit the training data
 Good accuracy on training data but poor on test data
 Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
 Two approaches to avoid overfitting
 Pre-pruning: Halt tree construction early
 Difficult to decide because we do not know what may
happen subsequently if we keep growing the tree.
 Post-pruning: Remove branches or sub-trees from a
“fully grown” tree.
 This method is commonly used. C4.5 uses a statistical
method to estimates the errors at each node for pruning.
 A validation set may be used for pruning as well.

An example Likely to overfit the data

Other issues in decision tree learning
 From tree to rules, and rule pruning

 Handling of missing values
 Handling skewed distributions
 Handling attributes and classes with different
costs.
 Attribute construction (adding a new one)
 etc.

Name Gender Height Output1 Output2
DT Example (1) Kristina

Jim
F
M
1.6 m
2m
Short
Tall
Medium
Medium
Maggie F 1.9 m Medium Tall
Martha F 1.88 m Medium Tall
Stephanie F 1.7 m Short Medium
 Considering the data in the table and the Bob M 1.85 m Medium Medium
correct classification in Output1, we Kathy F 1.6 m Short Medium
have: Dave M 1.7 m Short Medium
Worth M 2.2 m Tall Tall
 Short (4/15) Steven M 2.1 m Tall Tall
 Medium (8/15) Debbie F 1.8 m Medium Medium
 Tall (3/15) Todd M 1.95 m Medium Medium
Kim F 1.9 m Medium Tall
Amy F 1.8 m Medium Medium
 Entropy = 4/15 log(15/4) + 8/15 log(15/8) Wynette F 1.75 m Medium Medium
+ 3/15 log(15/3) = 0.4384
Entropy(F) = 3/9 log(9/3) +
6/9 log(9/6) = 0.2764
 Choosing the gender as the splitting
attribute we have: Entropy(M) = 1/6 log(6/1) +
2/6 log(6/2) + 3/6 log(6/3) =
0.4392

DT Example (2)
 The algorithm must determine what the gain
in information is by using this split.
 To do this, we calculate the weighted sum of
these last two entropies to get:
((9/15) 0.2764) + ((6/15) 0.4392) = 0.34152
 The gain in entropy by using the gender

attribute is thus:
0.4384 – 0.34152 = 0.09688

DT Example (3)
Name Gender Height Output1 Output2
Kristina F 1.6 m Short Medium
Jim M 2m Tall Medium
 Looking at the height Maggie F 1.9 m Medium Tall
attribute, we divide it into Martha F 1.88 m Medium Tall
ranges: Stephanie F 1.7 m Short Medium
(0, 1.6], (1.6, 1.7], (1.7, 1.8], (1.8, 1.9], Bob M 1.85 m Medium Medium
(1.9, 2.0], (2.0, ) Kathy F 1.6 m Short Medium
Dave M 1.7 m Short Medium
Worth M 2.2 m Tall Tall
 Now we can compute the Steven M 2.1 m Tall Tall
entropy Debbie F 1.8 m Medium Medium
 2 in (0, 1.6]  (2/2(0)+0+0)=0 Todd M 1.95 m Medium Medium
 2 in (1.6, 1.7]  (2/2(0)+0+0)=0 Kim F 1.9 m Medium Tall
 3 in (1.7, 1.8]  (0+3/3(0)+0)=0 Amy F 1.8 m Medium Medium
 4 in (1.8, 1.9]  (0+4/4(0)+0)=0 Wynette F 1.75 m Medium Medium
 2 in (1.9, 2.0] 
(0+1/2(0.301)+1/2(0.301))=0.301
 2 in (2.0, ]  (0+0+2/2(0))=0

DT Example (4)
 All the states are completely ordered (entropy 0)
except for the (1.9, 2.0] state.
 The gain in entropy by using the height attribute is:
0.4384-2/15(0.301)=0.3983
 Thus, this has the greater gain, and we choose this

over gender as the first splitting attribute

DT Example (5) Height
<=1.6m >2.0m
>1.6m >1.7m >1.8m >1.9m

<=1.7m <=1.8m <=1.9m <=2.0m
Short Short Medium Medium Tall

Height
 (1.9, 2.0] – is too large !!! <=1.95m >1.95m
 A further subdivision on height is needed

Medium Tall
Height
We can optimize the tree <=1.7m >1.7m

<=1.95m
>1.95m
Short Medium Tall

Quinlan’s ID3 and C4.5 decision tree
algorithms
Given dataset T
Attribute1 Attribute2 Attribute3 Class
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
B 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1
algorithms
Consider test on attribute 1
freq(class,value) CLASS1 CLASS2
A 2 3 5
B 4 0 4
C 3 2 5
9 5 14
Info(T) 0.4098 0.5305 0.9403
Info(S) CLASS1 CLASS2 Info(S) Weight

A 0.5288 0.4422 0.9710 0.3571
B 0.0000 0.0000 0.0000 0.2857
C 0.4422 0.5288 0.9710 0.3571
0.6935
Gain .9403 - .6935 = 0.2467

algorithms
Consider test on attribute 3
True 3 3 6
False 6 2 8
9 5 14
Info(T) 0.4098 0.5305 0.9403

True 0.5000 0.5000 1.0000 0.4286
False 0.3113 0.5000 0.8113 0.5714
0.8922
Gain .9403 - .8922 = 0.0481

Quinlan’s ID3 and C4.5 decision tree algorithms
 Summary
 Gain(Attribute1) = 0.2467
 Gain(Attribute3) = 0.0481
 In ID3 Attribute 2 not considered
 Since it is numeric
 Thus split on Attribute 1 – highest gain

C4.5 decision tree algorithms
 What about numeric attribute 2 (as done by C4.5)?
 Consider as categorical
 Then gain = 0.4039 – should split on it
 But – 9 branches, of which 6 with only one
instance
 Tree too wide – not compact
 Since it is really numerical – what to do with a
different value?
 Use threshold Z, and split into two subsets
 Y <= Z and Y > Z
 More complex tests, assuming discrete values and
variable number of subsets
 C4.5 and continuous attribute
 Sort values into v1,…,vm
 Try Zi = vi or Zi = (vi + vi+1) / 2 for i=1,…,m-1
 C4.5 uses Z = vi – more explainable decision rule
 Select splitting value Z*
 So that gain(Z*) = max {gain(Zi), i=1,…,m-1)}
 For last example – Attribute 2 (see next slide)
 Z* = 80
 Gain = 0.1022
 So even with this approach – would have split on Attribute
1

Attribute 2 freq(class,value) Info(S)
Zi Att3 <= Zi Att3 > Zi CLASS1 CLASS2 CLASS1 CLASS2 Total Weight Info(Tx) Gain
65 1 1 0 0.0000 0.0000 0.0000 0.0714
13 8 5 0.4310 0.5302 0.9612 0.9286 0.8926 0.0477
70 4 3 1 0.3113 0.5000 0.8113 0.2857
10 6 4 0.4422 0.5288 0.9710 0.7143 0.9253 0.0150
75 5 4 1 0.2575 0.4644 0.7219 0.3571
9 5 4 0.4711 0.5200 0.9911 0.6429 0.8950 0.0453
78 6 5 1 0.2192 0.4308 0.6500 0.4286
8 4 4 0.5000 0.5000 1.0000 0.5714 0.8500 0.0903
80 9 7 2 0.2820 0.4822 0.7642 0.6429
5 2 3 0.5288 0.4422 0.9710 0.3571 0.8380 0.1022
85 10 7 3 0.3602 0.5211 0.8813 0.7143
4 2 2 0.5000 0.5000 1.0000 0.2857 0.9152 0.0251
90 12 8 4 0.3900 0.5283 0.9183 0.8571
2 1 1 0.5000 0.5000 1.0000 0.1429 0.9300 0.0103
95 13 8 5 0.4310 0.5302 0.9612 0.9286
1 1 0 0.0000 0.0000 0.0000 0.0714 0.8926 0.0477

All same class – so T2 is a leaf

Consider test on attribute 3 FOR SUBSET T1
True 1 1 2
False 1 2 3
2 3 5
Info(T) 0.5288 0.4422 0.9710 Book has 0.940

True 0.5000 0.5000 1.0000 0.4000
False 0.5283 0.3900 0.9183 0.6000
0.9510
Gain .971 - .951 = 0.0200

70 2 2 0 0.0000 0.0000 0.0000 0.4000
3 0 3 0.0000 0.0000 0.0000 0.6000 0.0000 0.9710
85 3 2 1 0.3900 0.5283 0.9183 0.6000
2 0 2 0.0000 0.0000 0.0000 0.4000 0.5510 0.4200
90 4 2 2 0.5000 0.5000 1.0000 0.8000
1 0 1 0.0000 0.0000 0.0000 0.2000 0.8000 0.1710
Max gain on Attribute 2 - split on Z* = 70

Consider test on attribute 3 FOR SUBSET T3
True 0 2 2
False 3 0 3
3 2 5
Info(T) 0.4422 0.5288 0.9710 Book has 0.940

True 0.0000 0.0000 0.0000 0.4000
False 0.0000 0.0000 0.0000 0.6000
0.0000
Gain .971 - 0.000 = 0.9710

70 1 0 1 0.0000 0.0000 0.0000 0.2000
4 3 1 0.3113 0.5000 0.8113 0.8000 0.6490 0.3219
80 4 2 2 0.5000 0.5000 1.0000 0.8000
1 1 0 0.0000 0.0000 0.0000 0.2000 0.8000 0.1710
Max gain on Attribute 3



 We used entropy of T after splitting into T1,…,Tn
 Info(T ) = -Σ k [(freq(C ,T ) / |T |) ∙ log (freq(C ,T ) / |T |)]
j i=1 i j j 2 i j j
 Info (T) = Σ n (|T | / |T|) ∙ Info(T )
x j=1 j j
 Gain(X) = Info(T) – Info (T)
x
 This is biased in favor of tests X with many outcomes
 Split on ID will generate one subset for each unique value – i.e.,
for all – with each subset of 1 instance

 It has maximal gain as Info (T)=0
x
 But result is a one level tree with one branch for each instance
 Thus divide by number of branches – to measure average gain

 So define entropy
 Split-info(X) = Σ n (|T | / |T|) ∙ log (|T | / |T|)
j=1 j 2 j
 Potential information generated by splitting T into T ,…,T
1 n
 Similar to definition of Info(T )
j
 Use entropy of T after splitting into T1,…,Tn as before
 Info(T ) = -Σ k [(freq(C ,T ) / |T |) ∙ log (freq(C ,T ) / |T |)]
j i=1 i j j 2 i j j
 Info (T) = Σ n (|T | / |T|) ∙ Info(T )
x j=1 j j
 Gain(X) = Info(T) – Info (T)
x
 Selection criteria
 Gain-ratio(X) = Gain(X) / Split-info(X)
 Proportion of information generated by a “useful” compact split
 Select X* so that Gain-ratio(X*) = max {Gain-ratio(X)}

attributes X

Splitting the root
Attribute1 Attribute2 Attribute3
Gain(X) 0.2467 0.1022 0.0481
|T1| 5 9 6
|T2| 4 5 8
|T3| 5
|T| 14 14 14
|T1|/|T|*log(|T1|/|T|) 0.5305 0.4098 0.5239
|T2|/|T|*log(|T2|/|T|) 0.5164 0.5305 0.4613
|T3|/|T|*log(|T3|/|T|) 0.5305
Split-info(X) 1.5774 0.9403 0.9852
Gain-ratio(X) 0.1564 0.1087 0.0488
Still split on Attribute 1

 Missing data
 Unknown
 Not recorded
 Data entry error
 What to do with missing data?
 Eliminate instances with missing data
 Only useful when there are few
 Replace missing data with some values
 Fixed values, mean, mode, from distribution
 Modify algorithm to work with missing data

 Issues with modified algorithm

 How compare subsets with different number of
unknown values
 With what class to associate instances with
unknown values
 C4.5 replaces unknown values
 Based on the distribution (=relative frequency) of
known values

 For Split-info(X)
 Add one subset for the missing values
 That is – if there are n known classes, use T n+1 for missing
values
 For Info(T) and Infox(T) for a certain attribute
 Use only known values
 Compute F = (number instances with a known value) /
(total number of instances in data set)
 Use Gain(X) = F ∙ [Info(T) – Info x(T)]

Given dataset T
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
????? 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1

Given dataset T
Consider test on attribute 1 A 70 True CLASS1
freq(class,value) CLASS1 CLASS2 A 90 True CLASS2
A 85 False CLASS2
A 2 3 5 A 95 False CLASS2
B 3 0 3 A 70 False CLASS1
C 3 2 5 ????? 90 True CLASS1
B 78 False CLASS1
8 5 13 B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
Factor F = 13 / 14 0.9286 C 70 True CLASS2
C 80 False CLASS1
Info(T) 0.4310 0.5302 0.9612 C 80 False CLASS1
C 96 False CLASS1

A 0.5288 0.4422 0.9710 0.3846
B 0.0000 0.0000 0.0000 0.2308
C 0.4422 0.5288 0.9710 0.3846
0.7469
Original Gain equation .9612 - .7469 = 0.2144
New Gain Equation F*Original-Gain 0.1990
 Weight is calculated on the basis of : n/N-m

 n= attribute values, N=total number of tuples, m=number of missing values of attribute

Splitting the root

Attribute1 Attribute2 Attribute3
Gain(X) 0.1990 0.0587 -0.0205
|T1| 5 9 6
|T2| 3 5 8
|T3| 5
????? 1
|T| 13 14 14
|T1|/|T|*log(|T1|/|T|) 0.5302 0.4098 0.5239
|T2|/|T|*log(|T2|/|T|) 0.4882 0.5305 0.4613
|T3|/|T|*log(|T3|/|T|) 0.5302
Unknown 0.2846
Split-info(X) 1.8332 0.9403 0.9852
Gain-ratio(X) 0.1086 0.0625 -0.0208
Still split on Attribute 1
 At this point, with unknown values

 Test attributes selected for each node
 Subsets defined for instances with known values
 But what to do with the unknown?
 C4.5 assigns it to ALL the subsets T1,…,Tn
 With probability (or weight)
 P(Ti) = w = |Ti known values| / |T known values|


Given dataset T
A 70 True CLASS1
A 90 True CLASS2
A 85 False CLASS2
A 95 False CLASS2
A 70 False CLASS1
????? 90 True CLASS1
B 78 False CLASS1
B 65 True CLASS1
B 75 False CLASS1
C 80 True CLASS2
C 70 True CLASS2
C 80 False CLASS1
C 80 False CLASS1
C 96 False CLASS1

 Classification = CLASS2 (3.4 /
Given dataset T
0.4) means Attribute1 Attribute2 Attribute3 Class
 3.4 = |updated T | = 3 + 5/13 = A 70 True CLASS1
i
A 90 True CLASS2
3.3846 A 85 False CLASS2
 0.4 = number of instances A 95 False CLASS2
A 70 False CLASS1
without (known) value in Ti ????? 90 True CLASS1
 Thus 3 / 3.3846 = 88.64% belong B 78 False CLASS1
B 65 True CLASS1
to CLASS2 B 75 False CLASS1
 The balance (~12%) is error C 80 True CLASS2
C 70 True CLASS2
rate C 80 False CLASS1
 Belongs to other classes C 80 False CLASS1
C 96 False CLASS1
– in this case CLASS1

 Prediction
 Same approach – with probabilities – is used
 If values of attributes known – class is well
defined
 Else all paths from the root explored
 Probability of each class is determined for all classes
 Which is a sum of probabilities along paths
 Class with highest probability is selected

DM-Lecture Decision Trees (A)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM-Lecture Decision Trees (A)

Uploaded by

Copyright:

Available Formats

Supervised Learning:

Classification - Decision Tree 2

Classification - Decision Tree 3

Classification - Decision Tree 4

 The target function can be Boolean or

Classification - Decision Tree 5

In this case, the target class

Classification - Decision Tree 6

1. Each node corresponds to an attribute

2. Each branch corresponds to an attribute

3. Each leaf node assigns a classification

Classification - Decision Tree 7

Classification - Decision Tree 8

 A Decision Tree for the concept PlayTennis

attributes and reaching a leaf node

 2. move along the edge labeled with this value

 3. if you reach a leaf, return the label of the leaf

 4. otherwise, repeat from step 1

 Example (a decision tree to decide whether to go on a picnic):

high normal true false

Classification - Decision Tree 10

If attributes are continuous, sunny overcast rain

> 75% <= 75% > 20 <= 20

Each path in the tree represents a decision rule:

Classification - Decision Tree 11

 Basic Decision Tree Learning Algorithm

 An example of this core algorithm is the ID3

 Itemploys a top-down, greedy search through

 Basic Decision Tree Learning Algorithm

 Formaking this selection each attribute is

 Basic Decision Tree Learning Algorithm

 D12  D11 - 12 observations

- 2 classes (Yes, No)

 Basic Decision Tree Learning Algorithm

 Basic Decision Tree Learning Algorithm

 What is the “best” attribute to test at this point? The possible

 Basic Decision Tree Learning Algorithm

Which Attribute is the Best Classifier?

most useful for classifying examples

 For this purpose a statistical property, called

information gain, Gini Index is used

 At the start, all the training examples are at the

 remove tree branches that may reflect noise in

Classification - Decision Tree 20

 If sample are all same class then node becomes a

leaf labeled with class label

into individual classes.

 Samples in node belong to the same

Classification - Decision Tree 21

 Many methods are available but our focus

Classification - Decision Tree 22

 Which information is more valuable:

Classification - Decision Tree 24

Classification - Decision Tree 25

Classification - Decision Tree 26

Classification - Decision Tree 27

Classification - Decision Tree 28

Classification - Decision Tree 33

Classification - Decision Tree 34

Classification - Decision Tree 35

Gain to decide which attribute is to be used classify the current

algorithm uses either Information gain or Gain ratio to decide upon

learning algorithm which can produce a regression tree as well as a