Clase12 13

26-10-2020
Classification 2
Deeping on Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the
earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes certain
criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
1
26-10-2020
Tree Induction
criterion.
• Issues
How to Specify Test Condition?
• Depends on attribute types

– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split

– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
• Binary split: Divides values into two subsets.

Need to find optimal partitioning.
{Sports CarType CarType

OR {Family,
, {Family} {Sports}
Luxury}
Luxury}
2
26-10-2020
Splitting Based on Ordinal Attributes

• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Small, Size Size

Medium {Large}
OR {Medium,
{Small}
Large}
}
• What about this split?
Size
{Small,
Large} {Medium}
Splitting Based on Continuous Attributes
• Different ways of handling

– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A  v)

• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous Attributes
3
26-10-2020
Tree Induction
criterion.
• Issues
10
How to determine the Best Split
Before Splitting: 10 records of class 0,

10 records of class 1
Which test condition is the best?
11
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous/purer class distribution are

preferred
• Need a measure of node impurity:
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
12
4
26-10-2020
Finding the Best Split
1. Compute impurity measure (P) before splitting

2. Compute impurity measure (M) after splitting
 Compute impurity measure of each child node
 M is the weighted impurity of children
3. Choose the attribute test condition that produces the highest gain
Gain = P – M
or equivalently, lowest impurity measure after splitting (M)
13
Finding the Best Split

Before Splitting: C0 N00
P
C1 N01
A? B?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4
C0 N10 C0 N20 C0 N30 C0 N40

C1 N11 C1 N21 C1 N31 C1 N41
M1 M1 M2 M2
1 2 1 2
M1 M2
Gain = P – M1 vs P – M2
14
Measures of Node Impurity

• Gini Index
GINI (t )  1  
j
[ p ( j | t )] 2
• Entropy
Entropy (t )   
j
p ( j | t ) log p ( j | t )
• Misclassification error
Error ( t )  1  max i
P (i | t )
15
5
26-10-2020
Measure of Impurity: GINI

• Gini Index for a given node t :
GINI (t )  1   [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed

among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class, implying
most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
16
Examples for computing GINI

GINI (t )  1   [ p( j | t )]2
j
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
17
Splitting Based on GINI
• Used in CART, SLIQ, SPRINT.

• When a node p is split into k partitions (children), the quality of split is
computed as,
k
ni
GINI split   GINI (i )
i 1 n
where, ni = number of records at child i,
n = number of records at node p.
• Choose the attribute that minimizes weighted average Gini index of the
children
18
6
26-10-2020
Binary Attributes: Computing GINI Index

• Splits into two partitions
• Effect of Weighing partitions: Larger and Purer
Partitions are sought for. Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
19
Alternative Splitting Criteria based on INFO
• Entropy at a given node t:

Entropy (t )    p( j | t ) log p( j | t )
j
(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node.
• Maximum (log nc) when records are equally distributed among all classes
implying least information
• Minimum (0.0) when all records belong to one class, implying most
information
– Entropy based computations are similar to the GINI index computations
20
Examples for computing Entropy

Entropy (t )    p ( j | t ) log p ( j | t )
j 2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
21
7
26-10-2020
Splitting Based on INFO...
• Information Gain:
n
 Entropy( p )    Entropy(i ) 
k
GAIN i
 n 
split i 1
Parent Node, p is split into k partitions;

ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split. Choose
the split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
22
Splitting Based on INFO...
• Gain Ratio:
GAIN n n
 SplitINFO    log
k
GainRATIO Split i i
SplitINFO
split
n ni 1
Parent Node, p is split into k partitions

ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the partitioning (SplitINFO).

Higher entropy partitioning (large number of small partitions) is penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information Gain
23
Splitting Criteria based on Classification Error
• Classification error at a node t :
Error (t )  1  max P (i | t ) i
• Measures misclassification error made by a node.

• Maximum (1 - 1/nc) when records are equally distributed among all classes, implying
least interesting information
• Minimum (0.0) when all records belong to one class, implying most interesting
information
24
8
26-10-2020
Examples for Computing Error

Error (t )  1  max P (i | t )
i
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
25
Tree Induction
criterion.
• Issues
26
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records belong to the same class
• Stop expanding a node when all the records have similar attribute values
• Early termination (to be discussed later)
27
9
26-10-2020
Handling Missing Attribute Values
• Missing values affect decision tree construction in three different ways:

– Affects how impurity measures are computed
– Affects how to distribute instance with missing value to child nodes
– Affects how a test instance with missing value is classified
28
Decision Trees: Example 2 in Rapid Miner

Problem: Prospect filtering- Identify which prospects to extend credit to.
Dataset: German Credit dataset from the University of California-Irvine Machine
Learning (UCI-ML) data repository
29
Steps
30
10
26-10-2020
Step 1: Data Preparation

• The raw data consists of 1000 samples and a total of 20 attributes and 1 label or target
attribute.
• There are seven numeric attributes, and the rest are categorical or qualitative, including
the label, which is a binominal variable.
• The label attribute is called Credit Rating and can take the value of 1 (good) or 2 (bad).
• In the data 70% of the samples fall into the good credit rating class
31
32
Credit Rating
1 TRUE/ 2 FALSE
33
11
26-10-2020
34
Step 2: Divide dataset

As with all supervised model building, data must be separated into
two sets:
1. one for training or developing an acceptable model
2. and the other for testing or ensuring that the model would work
equally well on a different dataset.
The standard practice is to split the available data into a training

set and a testing set.
Typically, the training set contains 70% - 90% of the original data.
35
36
12
26-10-2020
Step 3: Modeling Operator and Parameters

The Validation operator allows one to build a model and apply it on
validation data in the same step. This means that two operations -
model building and model evaluation - must be configured using
the same operator.
37
38
Step 4: Configuring the Decision Tree Model

The parameters of Decision Tree and Performance operators must
be set.
39
13
26-10-2020
Step 5: Process Execution and Interpretation
40
41
accuracy: 67.00%
AUC =.721
42
14
26-10-2020
Decision Tree Based Classification

Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless the attributes are
interacting)
Disadvantages:
– Space of possible decision trees is exponentially large. Greedy approaches are
often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
43
15

Clase12 13

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clase12 13

Uploaded by

Copyright:

Available Formats

26-10-2020

Deeping on Decision Tree Induction

How to Specify Test Condition?

• Depends on attribute types

• Depends on number of ways to split

Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets.

{Sports CarType CarType

Splitting Based on Ordinal Attributes

{Small, Size Size

Splitting Based on Continuous Attributes

• Different ways of handling

– Binary Decision: (A < v) or (A  v)

Splitting Based on Continuous Attributes

How to determine the Best Split

Before Splitting: 10 records of class 0,

Which test condition is the best?

How to determine the Best Split

– Nodes with homogeneous/purer class distribution are

Finding the Best Split

1. Compute impurity measure (P) before splitting

Finding the Best Split

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40

Measures of Node Impurity

Measure of Impurity: GINI

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed

Examples for computing GINI

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C1 1 P(C1) = 1/6 P(C2) = 5/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

Splitting Based on GINI

• Used in CART, SLIQ, SPRINT.

Binary Attributes: Computing GINI Index

Alternative Splitting Criteria based on INFO

• Entropy at a given node t:

(NOTE: p( j | t) is the relative frequency of class j at node t).

Examples for computing Entropy

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C1 1 P(C1) = 1/6 P(C2) = 5/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

Splitting Based on INFO...

Parent Node, p is split into k partitions;

Splitting Based on INFO...

Parent Node, p is split into k partitions

– Adjusts Information Gain by the entropy of the partitioning (SplitINFO).

Splitting Criteria based on Classification Error

• Classification error at a node t :

• Measures misclassification error made by a node.

Examples for Computing Error

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C1 1 P(C1) = 1/6 P(C2) = 5/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

Stopping Criteria for Tree Induction

• Early termination (to be discussed later)

Handling Missing Attribute Values

• Missing values affect decision tree construction in three different ways:

Decision Trees: Example 2 in Rapid Miner

Step 1: Data Preparation

Step 1: Data Preparation