Professional Documents
Culture Documents
Clase12 13
Clase12 13
Classification 2
• Many Algorithms:
– Hunt’s Algorithm (one of the
earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
1
26-10-2020
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
CarType
Family Luxury
Sports
2
26-10-2020
Size
Small Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
3
26-10-2020
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
10
11
• Greedy approach:
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
12
4
26-10-2020
13
A? B?
Yes No Yes No
M1 M1 M2 M2
1 2 1 2
M1 M2
Gain = P – M1 vs P – M2
14
• Entropy
Entropy (t )
j
p ( j | t ) log p ( j | t )
• Misclassification error
Error ( t ) 1 max i
P (i | t )
15
5
26-10-2020
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
16
17
18
6
26-10-2020
19
20
21
7
26-10-2020
• Information Gain:
n
Entropy( p ) Entropy(i )
k
GAIN i
n
split i 1
22
• Gain Ratio:
GAIN n n
SplitINFO log
k
GainRATIO Split i i
SplitINFO
split
n ni 1
23
Error (t ) 1 max P (i | t ) i
24
8
26-10-2020
25
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
26
• Stop expanding a node when all the records belong to the same class
• Stop expanding a node when all the records have similar attribute values
27
9
26-10-2020
28
29
Steps
30
10
26-10-2020
31
32
Credit Rating
1 TRUE/ 2 FALSE
33
11
26-10-2020
34
Typically, the training set contains 70% - 90% of the original data.
35
36
12
26-10-2020
37
38
39
13
26-10-2020
40
41
accuracy: 67.00%
AUC =.721
42
14
26-10-2020
43
15