Professional Documents
Culture Documents
• DESCRIPTIVE:
• Decision Trees (ID3, C4.5)
• Rough Sets
• Genetic Algorithms
• Classification by Association
• STATISTICAL:
• Neural Networks
• Bayesian Networks
Classification Data
• Decision tree is
A flow-chart-like tree structure;
Internal node denotes an attribute;
Branch represents the values of the node
attribute;
Leaf nodes represent class labels or class
distribution
DECISION TREE
An Example
age
<=30 overcast
31..40 >40
Crucial point
Good choice of the root attribute and internal nodes
attributes is a crucial point
Bad choice may result, in the worst case in a just
another knowledge representation:
a relational table re-written as a tree with class
attributes (decision attributes) as the leaves.
age >40
<=30
31…40
age >40
<=30
31…40
class=yes
Building The Tree: we chose “student” on <=30 branch
age >40
<=30
student
income student credit class
no yes
medium no fair yes
low yes fair yes
in cr cl low yes excellent no
in cr cl
h f n medium yes fair yes
l f y medium no excellent no
h e n
m e y 31…40
m f n
class=yes
Building The Tree: we chose “student” on <=30 branch
age >40
<=30
student
income student credit class
no yes
medium no fair yes
low yes fair yes
low yes excellent no
medium yes fair yes
class= no class=yes
medium no excellent no
31…40
class=yes
Building The Tree: we chose “credit” on >40 branch
<=30
age >40
credit
student
excellent fair
no yes
in st cl in st cl
l y n m n y
class= no class=yes m n n
l y y
m y y
31…40
class=yes
Finished Tree for class=“buys”
<=30
age >40
credit
student
excellent fair
no yes
buys=no buys=yes
buys= no buys=yes
31…40
buys=yes
Extracting Classification Rules from Trees
• IF the samples (records in the data table) are all in the same
class, THEN the node becomes a leaf
and is labeled with that class
Basic Idea of ID3/C4.5 Algorithm
• OTHERWISE
• the algorithm uses an entropy-based measure known as
information gain as a heuristic for selecting the attribute
that will best separate the samples:
split the data table into individual classes
5 4
E (age) = I (2,3) + I (4,0)
14 14
5
g Class P: buys_computer = + I (3,2) = 0.694
14
“yes”
Hence
g Class N: buys_computer =
“no”
g I(p, n) = I(9, 5) =0.940 Gain(age) = I ( p, n) − E(age)
g Compute the entropy for Gain(age)=0.246
age: Similarly
age pi ni I(pi, ni)
<=30 2 3 0.971 Gain(income) = 0.029
31…40 4 0 0 Gain( student ) = 0.151
>40 3 2 0.971 Gain(credit _ rating ) = 0.048
The attribute “age” becomes the root.
Decision Tree Induction, Predictive Accuracy
and Information Gain
EXAMPLES
Decision Tree Construction
Example 2
High
>40 No Exc No
<=30 No Fair No
<=30 No Exc No
31-40 No Fair Yes
Income
Low Med
High
Age
Credit Exc >40
<=30
Fair
YES
YES Age Credit Class
<=30 Fair No
<=30 Exc No
31-40 Fair Yes
CORRECT? – INCORRECT?
Income Med
Low
High Age
Fair Credit 31-40 <=30
Exc 0.91 Ind:5
YES Age YES
0.5 Credit 0.5 Credit
Fair Exc
>40 Exc
31-40
Stud Class Stud Class Stud Class Fair
Stud Class Stud Class
Yes Yes No No Yes Yes No Yes
Yes No
Yes Yes
Stud Class
Student No No
Yes No
Ind:6
YES Age 0.91
<=30 31-40
Income Med
Low
High
Age
Credit Student 31-40 >40
Yes
Fair Exc No <=30
Credit_Rating
Fair Exc
CORRECT? – INCORRECT?
CORRECT? – INCORRECT?
Credit_Rating
Fair Exc
Income
Low Med
Student
Yes No
Age Stud Class High Age Stud Class
Income
Low Med
Student
<=30 No No
Income
Low Med Student
Yes No
High
YES Student
Income Income
Yes No
Med
Low
Student High Med
Yes Age Class
No
>40 Yes
Age Class Age Class
Age Class Age Class
31-40 Yes <=30 No
<=30 No >40 Yes
31-40 Yes <=30 No
Age Class
31-40 Yes
>40 No
Age Class Age Class
>40 Yes
30-40 Yes
<=30 No
CORRECT? – INCORRECT?
Credit_Rating
Fair Exc
Income
Low Med
Student
Yes No
High
YES Student
Yes Income Income
No
Low Med High Med
Student
Yes YES
No
YES
NO
Age Class Age Class
<=30 No >40 Yes
31-40 Yes <=30 No
Age Class
31-40 Yes
>40 No
Age Class YES
>40 Yes
<=30 No
CORRECT? – INCORRECT?
CORRECT? – INCORRECT?
Credit_Rating
Fair Exc
Income
Low Med Student
Yes
High No
YES Student
Income
Yes No Low Med
Student
Yes YES Age
No YES
>40 Income
Age <=30
YES
Age
YES NO 31-40
Med High
>40
<=30 31-40 YES
NO NO
NO YES Age
31-40 >40
YES NO
Predictive accuracy:
1. Against Lecture Notes: 4/6 = 66.66%
2. Against Tree 1 rules with root att. Income: 3/6 = 50%
3. Against Tree 2 rules with root att. Credit: 5/6 = 83.33%
EXERCISE 2
High
>40 No Exc No
<=30 No Fair No
<=30 No Exc No
31-40 No Fair Yes
Income Med
Low
High Age
Fair Credit 31-40 <=30
Exc >40 0.91 Ind:5
YES Age YES
0.5 Credit 0.5 Credit
Fair Exc
>40 Exc
31-40 <=30
Stud Class Stud Class Stud Class Fair
Stud Class Stud Class
Yes Yes No No Yes Yes No Yes
Yes No
Yes Yes
Income Med
Low
High
Age
Credit Student 31-40 >40
Yes
Fair Exc No <=30
Credit_Rating
Fair Exc
CORRECT
CORRECT
Credit_Rating
Fair Exc
Income
Low Med
Student
Yes No
Age Stud Class High Age Stud Class
Income
Low Med
Student
<=30 No No
Income
Low Med Student
Yes No
High
YES Student
Income Income
Yes No
Med
Low
Student High Med
Yes Age Class
No
>40 Yes
Age Class Age Class
Age Class Age Class
31-40 Yes <=30 No
<=30 No >40 Yes
31-40 Yes <=30 No High
Age Class
31-40 Yes
>40 No
Age Class Age Class
Low
>40 Yes
31-40 Yes No Records
<=30 No Majority Voting
YES
No Records
Majority Voting
CORRECTED YES
Credit_Rating
Fair Exc
Income
Low Med
Student
Yes No
High
YES Student
Yes Income Income
No
Low High Med
Student
Yes YES High
No
YES
NO Low
Age Class Age Class Med
<=30 No YES
>40 Yes YES
31-40 Yes <=30 No
Age Class
31-40 Yes
>40 No
Age Class YES
>40 Yes
<=30 No
CORRECT
Credit_Rating
Fair Exc
Income
Low Med Student
Yes
High No
YES Student
Income
Yes No Low Med
Student
Yes YES Age
No 31-40 YES
>40 High Income
Age <=30 YES
YES YES
Age
YES YES 31-40
Low High
<=30 Med
>40
<=30 31-40 >40 YES YES
YES
YES YES NO
YES Age
YES
Majority- no 31-40 <=30
Attribuets left >40
on all AGE YES YES
leaves YES
Predictive accuracy:
1. Against Lecture Notes: 4/6 = 66.66%
2. Against Tree 1 rules with root att. Income: 3/6 = 50%
3. Against Tree 2 rules with root att. Credit: 4/6 = 66.66%
4. Against OLD Tree 2 rules with root att. Credit: 5/6 = 83.33%
Calculation of Information gain at each level of tree with
root attribute Income
= 0.940
Income Pi Ni I(Pi,Ni)
2. Index:1 Low 3 1 0.8111
Med 4 2 0.9234
High 2 2 1
2. Index 2
Credit Pi Ni I(Pi,Ni)
Fair 2 1 0.913
Exc 2 1 0.913
EXERCISE:
Construct a correct tree of your choice of attributes
and evaluate:
1. correctness of your rules, i.e.
the predictive accuracy with respect to the TRAINING
data
2. predictive accuracy with respect to test data from
Exercise 2
• Remember
• The TERMINATION CONDITIONS!