You are on page 1of 15

26-10-2020

Classification 2

Deeping on Decision Tree Induction

• Many Algorithms:
– Hunt’s Algorithm (one of the
earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

Tree Induction

• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

1
26-10-2020

Tree Induction

• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

How to Specify Test Condition?

• Depends on attribute types


– Nominal
– Ordinal
– Continuous

• Depends on number of ways to split


– 2-way split
– Multi-way split

Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.

{Sports CarType CarType


OR {Family,
, {Family} {Sports}
Luxury}
Luxury}

2
26-10-2020

Splitting Based on Ordinal Attributes


• Multi-way split: Use as many partitions as distinct values.

Size
Small Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.

{Small, Size Size


Medium {Large}
OR {Medium,
{Small}
Large}
}
• What about this split?
Size
{Small,
Large} {Medium}

Splitting Based on Continuous Attributes

• Different ways of handling


– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more compute intensive

Splitting Based on Continuous Attributes

3
26-10-2020

Tree Induction

• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

10

How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Which test condition is the best?

11

How to determine the Best Split

• Greedy approach:

– Nodes with homogeneous/purer class distribution are


preferred
• Need a measure of node impurity:

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

12

4
26-10-2020

Finding the Best Split

1. Compute impurity measure (P) before splitting


2. Compute impurity measure (M) after splitting
 Compute impurity measure of each child node
 M is the weighted impurity of children
3. Choose the attribute test condition that produces the highest gain
Gain = P – M
or equivalently, lowest impurity measure after splitting (M)

13

Finding the Best Split


Before Splitting: C0 N00
P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M1 M2 M2
1 2 1 2

M1 M2
Gain = P – M1 vs P – M2

14

Measures of Node Impurity


• Gini Index
GINI (t )  1  
j
[ p ( j | t )] 2

• Entropy
Entropy (t )   
j
p ( j | t ) log p ( j | t )

• Misclassification error

Error ( t )  1  max i
P (i | t )

15

5
26-10-2020

Measure of Impurity: GINI


• Gini Index for a given node t :
GINI (t )  1   [ p( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed


among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class, implying
most interesting information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

16

Examples for computing GINI


GINI (t )  1   [ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

17

Splitting Based on GINI

• Used in CART, SLIQ, SPRINT.


• When a node p is split into k partitions (children), the quality of split is
computed as,
k
ni
GINI split   GINI (i )
i 1 n
where, ni = number of records at child i,
n = number of records at node p.
• Choose the attribute that minimizes weighted average Gini index of the
children

18

6
26-10-2020

Binary Attributes: Computing GINI Index


• Splits into two partitions
• Effect of Weighing partitions: Larger and Purer
Partitions are sought for. Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125

19

Alternative Splitting Criteria based on INFO

• Entropy at a given node t:


Entropy (t )    p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).


– Measures homogeneity of a node.
• Maximum (log nc) when records are equally distributed among all classes
implying least information
• Minimum (0.0) when all records belong to one class, implying most
information
– Entropy based computations are similar to the GINI index computations

20

Examples for computing Entropy


Entropy (t )    p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

21

7
26-10-2020

Splitting Based on INFO...

• Information Gain:
n
 Entropy( p )    Entropy(i ) 
k
GAIN i

 n 
split i 1

Parent Node, p is split into k partitions;


ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split. Choose
the split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.

22

Splitting Based on INFO...

• Gain Ratio:

GAIN n n
 SplitINFO    log
k
GainRATIO Split i i

SplitINFO
split

n ni 1

Parent Node, p is split into k partitions


ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning (SplitINFO).


Higher entropy partitioning (large number of small partitions) is penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information Gain

23

Splitting Criteria based on Classification Error

• Classification error at a node t :

Error (t )  1  max P (i | t ) i

• Measures misclassification error made by a node.


• Maximum (1 - 1/nc) when records are equally distributed among all classes, implying
least interesting information
• Minimum (0.0) when all records belong to one class, implying most interesting
information

24

8
26-10-2020

Examples for Computing Error


Error (t )  1  max P (i | t )
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

25

Tree Induction

• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

26

Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the same class

• Stop expanding a node when all the records have similar attribute values

• Early termination (to be discussed later)

27

9
26-10-2020

Handling Missing Attribute Values

• Missing values affect decision tree construction in three different ways:


– Affects how impurity measures are computed
– Affects how to distribute instance with missing value to child nodes
– Affects how a test instance with missing value is classified

28

Decision Trees: Example 2 in Rapid Miner


Problem: Prospect filtering- Identify which prospects to extend credit to.
Dataset: German Credit dataset from the University of California-Irvine Machine
Learning (UCI-ML) data repository

29

Steps

30

10
26-10-2020

Step 1: Data Preparation


• The raw data consists of 1000 samples and a total of 20 attributes and 1 label or target
attribute.
• There are seven numeric attributes, and the rest are categorical or qualitative, including
the label, which is a binominal variable.
• The label attribute is called Credit Rating and can take the value of 1 (good) or 2 (bad).
• In the data 70% of the samples fall into the good credit rating class

31

Step 1: Data Preparation

32

Step 1: Data Preparation

Credit Rating
1 TRUE/ 2 FALSE

33

11
26-10-2020

34

Step 2: Divide dataset


As with all supervised model building, data must be separated into
two sets:
1. one for training or developing an acceptable model
2. and the other for testing or ensuring that the model would work
equally well on a different dataset.

The standard practice is to split the available data into a training


set and a testing set.

Typically, the training set contains 70% - 90% of the original data.

35

36

12
26-10-2020

Step 3: Modeling Operator and Parameters


The Validation operator allows one to build a model and apply it on
validation data in the same step. This means that two operations -
model building and model evaluation - must be configured using
the same operator.

37

38

Step 4: Configuring the Decision Tree Model


The parameters of Decision Tree and Performance operators must
be set.

39

13
26-10-2020

Step 5: Process Execution and Interpretation

40

Step 5: Process Execution and Interpretation

41

Step 5: Process Execution and Interpretation

accuracy: 67.00%

AUC =.721

42

14
26-10-2020

Decision Tree Based Classification


Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless the attributes are
interacting)
Disadvantages:
– Space of possible decision trees is exponentially large. Greedy approaches are
often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute

43

15

You might also like