Professional Documents
Culture Documents
Classification
Classification
Chiara Renso
KDD-LAB
ISTI- CNR, Pisa, Italy
1
CLASSIFICATION: DEFINITION
2
Data Mining Course - UFPE - June 2012
3
ILLUSTRATING CLASSIFICATION TASK
EXAMPLES OF CLASSIFICATION TASK
4
CLASSIFICATION TECHNIQUES
5
EXAMPLE OF A DECISION TREE
Splitting Attributes
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
MarSt Single,
Married Divorced
NO TaxInc
< 80K > 80K
NO YES
7
DECISION TREE CLASSIFICATION TASK
8
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree.
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
9
APPLY MODEL TO TEST DATA
Test Data
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
10
APPLY MODEL TO TEST DATA
Test Data
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
11
APPLY MODEL TO TEST DATA
Test Data
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
12
APPLY MODEL TO TEST DATA
Test Data
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
13
APPLY MODEL TO TEST DATA
Test Data
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
14
DECISION TREE CLASSIFICATION TASK
15
DECISION TREE INDUCTION
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
16
GENERAL STRUCTURE OF HUNT’S ALGORITHM
Let Dt be the set of training records that
are associated with node t and y = {y1,
y2, . . . , yc} be the class labels.
General Procedure:
Refund
Don’t
Yes No
Cheat
Don’t Don’t
Cheat Cheat
Don’t 18
Cheat
Cheat
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes
certain criterion.
19
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes
certain criterion.
20
HOW TO SPECIFY TEST CONDITION?
21
SPLITTING BASED ON NOMINAL ATTRIBUTES
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
22
SPLITTING BASED ON ORDINAL ATTRIBUTES
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
24
SPLITTING BASED ON CONTINUOUS ATTRIBUTES
Greedy strategy.
Split the records based on an attribute test that optimizes
certain criterion.
26
HOW TO DETERMINE THE BEST SPLIT
27
HOW TO DETERMINE THE BEST SPLIT
Greedy approach:
Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
28
MEASURES OF NODE IMPURITY
Gini Index
Entropy
29
HOW TO FIND THE BEST SPLIT
Before Splitting: M0
A? B?
M1 M2 M3 M4
M12 M34 30
Gain = M0 – M12 vs M0 – M34
MEASURE OF IMPURITY: GINI
31
EXAMPLES FOR COMPUTING GINI
B?
Yes No
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (2/6)2
Gini(Children)
= 0.194
= 7/12 * 0.194 +
Gini(N2) 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 = 0.333
= 0.528
CATEGORICAL ATTRIBUTES: COMPUTING GINI INDEX
35
CONTINUOUS ATTRIBUTES: COMPUTING GINI INDEX
Sorted Values
Split Positions
37
ALTERNATIVE SPLITTING CRITERIA BASED ON INFO
information
Entropy based computations are similar to the GINI index
computations
38
EXAMPLES FOR COMPUTING ENTROPY
Greedy strategy.
Split the records based on an attribute test that optimizes
certain criterion.
43
STOPPING CRITERIA FOR TREE INDUCTION
Early termination
44
DECISION TREE BASED CLASSIFICATION
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
45
Data Mining Course - UFPE - June 2012
46
ESTIMATING GENERALIZATION ERRORS
UNDERFITTING AND OVERFITTING
Overfitting
Overfitting
results in decision trees that are more
complex than necessary
49
OCCAM’S RAZOR
50
HOW TO ADDRESS OVERFITTING
51
HOW TO ADDRESS OVERFITTING…
Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a bottom-up fashion
52
MODEL EVALUATION
53
METRICS FOR PERFORMANCE EVALUATION
CLASS Class=No c d
d: TN (true negative)
54
METRICS FOR PERFORMANCE EVALUATION…
PREDICTED CLASS
Class=Yes Class=No
a+d TP +TN
Accuracy = =
a +b +c + d TP +TN + FP + FN 55
LIMITATION OF ACCURACY
56
COST MATRIX
PREDICTED CLASS
57
COMPUTING COST OF CLASSIFICATION
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200