Professional Documents
Culture Documents
Data Mining Classification: Basic Concepts and Techniques: Lecture Notes For Chapter 3
Data Mining Classification: Basic Concepts and Techniques: Lecture Notes For Chapter 3
l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Deep Learning
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
Ensemble Classifiers
– Boosting, Bagging, Random Forests
6/15/2021 Introduction to Data Mining, 2nd Edition 5
Example of a Decision Tree
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
NO YES
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
(3,0)
(3,0)
(3,0)
(1,3) (3,0)
(1,0) (0,3)
(3,0)
(3,0)
(3,0)
(1,3) (3,0)
(1,0) (0,3)
(3,0)
(3,0)
(3,0)
(1,3) (3,0)
(1,0) (0,3)
(3,0)
(3,0)
(3,0)
(1,3) (3,0)
(1,0) (0,3)
Multi-way split:
– Use as many partitions as
distinct values.
Binary split:
– Divides values into two subsets
l Multi-way split:
– Use as many partitions
as distinct values
l Binary split:
– Divides values into two
subsets
– Preserve order
property among
attribute values This grouping
violates order
property
l Greedy approach:
– Nodes with purer class distribution are
preferred
l Gini Index
GINI (t ) 1 [ p( j | t )] 2
l Entropy
Entropy (t ) p ( j | t ) log p ( j | t )
j
l Misclassification error
Error (t ) 1 max P (i | t ) i
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
6/15/2021 Introduction to Data Mining, 2nd Edition 31
Measure of Impurity: GINI
GINI (t ) 1 [ p ( j | t )]2
j
GINI (t ) 1 [ p ( j | t )]2
j
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
GINI (t ) 1 [ p ( j | t )]2
j
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Entropy (t ) p ( j | t ) log p ( j | t )
j 2
l Information Gain:
n
GAIN Entropy ( p )
k
Entropy (i ) i
n
split i 1
l Gain Ratio:
GAIN n n
SplitINFO log
k
GainRATIO split
Split i i
SplitINFO n n i 1
l Gain Ratio:
GAIN n n
SplitINFO log
k
GainRATIO split
Split i i
SplitINFO n n i 1
Error (t ) 1 max P (i | t ) i
Error (t ) 1 max P (i | t ) i
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416
Z Z
X Y
6/15/2021 Introduction to Data Mining, 2nd Edition 57
Limitations of single attribute-based decision boundaries