Professional Documents
Culture Documents
Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Deep Learning
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
Ensemble Classifiers
– Boosting, Bagging, Random Forests
02/03/2020 Introduction to Data Mining, 2 nd Edition 5
Example of a Decision Tree
c al c al us
i i o
or or nu
t e g
t e g
nti
a ss
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
8 No Single 85K Yes
< 80K > 80K
9 No Married 75K No NO YES
10 No Single 90K Yes
10
c al c al us
i i o
or or nu
teg
teg
nti
a ss
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No
NO Home
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home
10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home
10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home
10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
Dt
belong to more than one
class, use an attribute test
to split the data into smaller ?
subsets. Recursively apply
the procedure to each
subset.
02/03/2020 Introduction to Data Mining, 2 nd Edition 16
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Annual Defaulted = No
(3,0) Single,
Married
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Annual Defaulted = No
(3,0) Single,
Married
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Annual Defaulted = No
(3,0) Single,
Married
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Annual Defaulted = No
(3,0) Single,
Married
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Multi-way split:
Marital
– Use as many partitions as Status
distinct values.
Binary split:
– Divides values into two subsets
Shirt Shirt
Binary split: Size Size
{Small, {Medium,
Large} Extra Large}
02/03/2020 Introduction to Data Mining, 2 nd Edition 24
Test Condition for Continuous Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy approach:
– Nodes with purer class distribution are
preferred
C0: 5 C0: 9
C1: 5 C1: 1
Entropy
𝑐− 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦=− ∑ 𝑝𝑖 (𝑡 ) 𝑙𝑜 𝑔2 𝑝𝑖 (𝑡)
𝑖= 0
Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛
𝑒𝑟𝑟𝑜𝑟=1 − max [ 𝑝¿¿ 𝑖 (𝑡 )]¿
Gain = P - M
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
02/03/2020 Introduction to Data Mining, 2 nd Edition 31
Measure of Impurity: GINI
Gini
Index for a given node
𝑐 −1
2
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥=1− ∑ 𝑝𝑖 ( 𝑡 )
𝑖=0
Where is the frequency of class at node , and is the total
number of classes
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Entropy at a given node
𝑐− 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦=− ∑ 𝑝𝑖 (𝑡 ) 𝑙𝑜 𝑔2 𝑝𝑖 (𝑡)
𝑖= 0
Where is the frequency of class at node , and is the total number of classes
Information
Gain:
𝑘
𝑛𝑖
𝐺𝑎𝑖 𝑛 𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑝 ) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑖)
𝑖=1 𝑛
Gain Ratio:
𝑘
𝐺𝑎𝑖 𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=− ∑ 𝑙𝑜 𝑔2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛
Gain Ratio:
𝑘
𝐺𝑎𝑖 𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=∑ 𝑙𝑜 𝑔 2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416
Z Z
X Y
02/03/2020 Introduction to Data Mining, 2 nd Edition 57
Limitations of single attribute-based decision boundaries