You are on page 1of 76

Classification: Basic Concepts

and Decision Trees

1
Classification: Definition
 Given a collection of records (training set )
 Each record contains a set of attributes, one of
the attributes is the class.
 Find a model for class attribute as a function of
the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
 A test set is used to determine the accuracy of
the model. Usually, the given data set is
divided into training and test sets, with training
set used to build the model and test set used
to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Examples of Classification
Task
 Predicting tumor cells as benign or malignant

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

 Categorizing news stories as finance,


weather, entertainment, sports, etc
Example of a Decision Tree
Tid Refund Marital Taxable Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes
Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision
Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than
10
10 No Single 90K Yes
one tree that fits the same
data!
Decision Tree Classification
Task Tree
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test
TestData
Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification
Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Confusion Matric
 A confusion matrix (Kohavi and Provost,
1998) contains information about actual
and predicted classifications done by a
classification system. Performance of
such systems is commonly evaluated
using the data in the matrix.
Confusion matrix
 a is the number of correct predictions that
an instance is negative,
 b is the number of incorrect predictions that
an instance is positive,
 c is the number of incorrect of predictions
that an instance negative, and
 d is the number of correct predictions that
an instance is positive.
Confusion Matrix
Predicted
Negative Positive
Negative a b
Actual
Positive c d
Confusion matrix
 Several standard terms have been defined for
the 2 class matrix:
 The accuracy (AC) is the proportion of the total
number of predictions that were correct. It is
determined using the equation:
 The recall or true positive rate (TP) is the
proportion of positive cases that were correctly
identified, as calculated using the equation:
 The false positive rate (FP) is the
proportion of negatives cases that were
incorrectly classified as positive, as
calculated using the equation:
 The true negative rate (TN) is defined
as the proportion of negatives cases
that were classified correctly, as
calculated using the equation:
 The false negative rate (FN) is the
proportion of positives cases that were
incorrectly classified as negative, as
calculated using the equation:
 Finally, precision (P) is the proportion of
the predicted positive cases that were
correct, as calculated using the
equation:
Decision Tree Induction
 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 SLIQ
General Structure of Hunt’s
Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

 Let Dt be the set of training 1 Yes Single 125K No


records that reach a node t 2 No Married 100K No

 General Procedure: 3 No Single 70K No

If Dt contains records that belong


4 Yes Married 120K No

the same class yt, then t is a leaf
5 No Divorced 95K Yes

node labeled as yt 6 No Married 60K No


7 Yes Divorced 220K No
 If Dt is an empty set, then t is a
8 No Single 85K Yes
leaf node labeled by the default
class, yd
9 No Married 75K No
10 No Single 90K Yes
 If Dt contains records that belong 10

to more than one class, use an Dt


attribute test to split the data into
smaller subsets. Recursively apply
the procedure to each subset. ?
Tid Refund Marital Taxable

Hunt’s Algorithm 1 Yes


Status

Single
Income Cheat

125K No
2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes

Refund Refund 9 No Married 75K No


Yes No Yes No 10 No Single 90K Yes
10

Don’t Don’t Marital


Marital Cheat
Cheat Status Status
Single, Single,
Married Married
Divorced Divorced

Don’t Taxable Don’t


Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Tree Induction
 Greedy strategy.
 Split the records based on an attribute test
that optimizes certain criterion.

 Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
Tree Induction
 Greedy strategy.
 Split the records based on an attribute test
that optimizes certain criterion.

 Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
How to Specify Test
Condition?
 Depends on attribute types
 Nominal
 Ordinal
 Continuous

 Depends on number of ways to split


 2-way split
 Multi-way split
Splitting Based on Nominal
Attributes
 Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.

{Sports,
CarType OR {Family,
CarType
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal
Attributes
 Multi-way split: Use as many partitions as distinct
values. Size
Small Large
Medium

 Binary split: Divides values into two subsets.


Size Need to find optimal partitioning.
{Small, OR Size
Medium} {Large} {Medium,
Large} {Small}

{Small,
Size
{Medium}
 What about this split? Large}
Splitting Based on Continuous
Attributes

 Different ways of handling


 Discretization to form an ordinal categorical
attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal
interval bucketing, equal frequency
bucketing
(percentiles), or clustering.

 Binary Decision: (A < v) or (A  v)


 consider all possible splits and finds the best
Splitting Based on Continuous
Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction
 Greedy strategy.
 Split the records based on an attribute test
that optimizes certain criterion.

 Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
How to determine the Best
Split
Before Splitting: 10 records of
class 0, 10 records of class 1
Own Car Student
Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the


best?
How to determine the Best
Split
 Greedy approach:
 Nodes with homogeneous class distribution
are preferred
 Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non- Homogeneous,
homogeneous,
Low degree of
High degree of impurity
impurity
Measures of Node Impurity
 Gini Index

 Entropy

 Misclassification error
How to Find the Best Split
C0 N00
Before C1 N01
M0
Splitting:
A? B?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4
C0 N10 C0 N20 C0 N30 C0 N40
C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M1 M3
2 Gain = M0 – M12 vs M0 – M34 4

Measure of
Gini Index for a given node t :
Impurity: GINI
GINI (t )  1  [ p( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

 Maximum (1 - 1/nc) when records are equally distributed among all


classes, implying least interesting information
 Minimum (0.0) when all records belong to one class, implying most
interesting information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (t )  1  [ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1
=0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Splitting Based on GINI
 Used in CART, SLIQ, SPRINT.
 When a node p is split into k partitions (children), the
quality of split is computed as,
k
ni
GINI split   GINI (i)
i 1 n

where, ni = number of records at child i,


n = number of records at node p.
Binary Attributes: Computing
GINI Index
 Splits into two partitions
 Effect of Weighing partitions:
 Larger and Purer Partitions are sought for.

Parent
B? C1 6

Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – N1 N2 Gini(Children)
(2/6)2
C1 5 1 = 7/12 * 0.194 +
= 0.194
C2 2 4 5/12 * 0.528
Gini(N2) Gini=0.333 = 0.333
= 1 – (1/6)2 –
(4/6)2
Categorical Attributes: Computing
Gini Index
 For each distinct value, gather counts for each class in the
dataset
 Use the count matrix to make decisions
Two-way split
Multi-way split (find best partition of values)
CarType CarType CarType
Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
Continuous Attributes: Computing
Gini Index
 Use Binary Decisions based on one Tid Refund Marital
Status
Taxable
Income Cheat
value 1 Yes Single 125K No
 Several Choices for the splitting value 2 No Married 100K No
 Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No
 Each splitting value has a count matrix 5 No Divorced 95K Yes
associated with it 6 No Married 60K No
 Class counts in each of the partitions, A 7 Yes Divorced 220K No
< v and A  v 8 No Single 85K Yes
 Simple method to choose best v 9 No Married 75K No
 For each v, scan the database to 10 No Single 90K Yes
gather count matrix and compute its
10

Gini index Taxable


Income
 Computationally Inefficient! Repetition > 80K?
of work.
Yes No
Continuous Attributes: Computing
Gini Index...
 For efficient computation: for each attribute,
 Sort the attribute on values
 Linearly scan these values, each time updating the count matrix and
computing gini index
 Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220

Split Positions 55 65 72 80 87 92 97 110 122 172 230


<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based
on INFO
 Entropy at a given node t:
Entropy(t )   p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

 Measures homogeneity of a node.


 Maximum (log nc) when records are equally distributed
among all classes implying least information
 Minimum (0.0) when all records belong to one class,
implying most information
 Entropy based computations are similar to the
GINI index computations
Examples for computing
Entropy
Entropy(t )   p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6)
= 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6)
= 0.92
Splitting Based on INFO...
 Information Gain:
 n 
 Entropy( p)  
k
GAIN Entropy(i ) 
i

 n 
split i 1

Parent Node, p is split into k partitions;


ni is number of records in partition i
 Measures Reduction in Entropy achieved because of the split.
Choose the split that achieves most reduction (maximizes GAIN)
 Used in ID3 and C4.5
 Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
Splitting Based on INFO...
 Gain Ratio:
GAIN n n
 SplitINFO   log
k
GainRATIO split
Split i i

SplitINFO n n i 1

Parent Node, p is split into k partitions


ni is the number of records in partition i
 Adjusts Information Gain by the entropy of the partitioning
(SplitINFO). Higher entropy partitioning (large number of small
partitions) is penalized!
 Used in C4.5
 Designed to overcome the disadvantage of Information Gain
Splitting Criteria based on
Classification Error

 Classification error at a node t :

Error(t )  1  max P(i | t ) i

 Measures misclassification error made by a node.


 Maximum (1 - 1/nc) when records are equally distributed among
all classes, implying least interesting information
 Minimum (0.0) when all records belong to one class, implying
most interesting information
Examples for Computing Error
Error(t )  1  max P(i | t )
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6
= 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Comparison among Splitting
Criteria
For a 2-class problem:
Misclassification Error vs Gini
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1) N1 N2
= 1 – (3/3)2 – Gini(Children)
C1 3 4 = 3/10 * 0
(0/3)2
C2 0 3 + 7/10 * 0.489
=0
= 0.342
Gini(N2)
= 1 – (4/7)2 –
(3/7)2
= 0.489
Tree Induction
 Greedy strategy.
 Split the records based on an attribute test
that optimizes certain criterion.

 Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
Stopping Criteria for Tree
Induction
 Stop expanding a node when all the
records belong to the same class

 Stop expanding a node when all the


records have similar attribute values

 Early termination (to be discussed later)


Why Decision Tree Model?
 Relatively fast compared to other
classification models
 Obtain similar and sometimes better
accuracy compared to other models
 Simple and easy to understand
 Can be converted into simple and easy
to understand classification rules

53
A Decision Tree
Age < 25

Car Type in {sports}

High

High Low

54
Decision Tree Classification
 A decision tree is created in two phases:
 Tree Building Phase
 Repeatedly partition the training data until all
the examples in each partition belong to one
class or the partition is sufficiently small
 Tree Pruning Phase
 Remove dependency on statistical noise or
variation that may be particular only to the
training set

55
Tree Building Phase
 General tree-growth algorithm (binary tree)
Partition(Data S)
If (all points in S are of the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
Use best split to partition S into S1 and S2;
Partition(S1);
Partition(S2);
56
Tree Building Phase (cont.)
 The form of the split depends on the
type of the attribute
 Splits for numeric attributes are of the
form A  v, where v is a real number
 Splits for categorical attributes are of
the form A  S’, where S’ is a subset of
all possible values of A

57
Splitting Index
 Alternative splits for an attribute are
compared using a splitting index
 Examples of splitting index:
 Entropy ( entropy(T) = - pj x log2(pj) )
 Gini Index ( gini(T) = 1 - pj )
2

(pj is the relative frequency of class j in T)

58
The Best Split
 Suppose the splitting index is I(), and a
split partitions S into S1 and S2
 The best split is the split that maximizes
the following value:
I(S) - |S1|/|S| x I(S1) + |S2|/|S| x I(S2)

59
Tree Pruning Phase
 Examine the initial tree built
 Choose the subtree with the least
estimated error rate
 Two approaches for error estimation:
 Use the original training dataset (e.g. cross
–validation)
 Use an independent dataset

60
SLIQ - Overview
 Capable of classifying disk-resident datasets
 Scalable for large datasets
 Use pre-sorting technique to reduce the cost
of evaluating numeric attributes
 Use a breath-first tree growing strategy
 Use an inexpensive tree-pruning algorithm
based on the Minimum Description Length
(MDL) principle

61
Data Structure
 A list (class list) for the class label
 Each entry has two fields: the class label
and a reference to a leaf node of the
decision tree
 Memory-resident
 A list for each attribute
 Each entry has two fields: the attribute
value, an index into the class list
 Written to disk if necessary
62
An illustration of the Data
Structure
Age Class List Car Class List Class Leaf
Index Type Index
23 1 Family 1 1 High N1
17 2 Sports 2 2 High N1
43 3 Sports 3 3 High N1
68 4 Family 4 4 Low N1
32 5 Truck 5 5 Low N1
20 6 Family 6 6 High N1
63
Pre-sorting
 Sorting of data is required to find the
split for numeric attributes
 Previous algorithms sort data at every
node in the tree
 Using the separate list data structure,
SLIQ only sort data once at the
beginning of the tree building phase

64
After Pre-sorting
Age Class List Car Class List Class Leaf
Index Type Index
17 2 Family 1 1 High N1
20 6 Sports 2 2 High N1
23 1 Sports 3 3 High N1
32 5 Family 4 4 Low N1
43 3 Truck 5 5 Low N1
68 4 Family 6 6 High N1
65
Node Split
 SLIQ uses a breath-first tree growing
strategy
 In one pass over the data, splits for all the
leaves of the current tree can be evaluated
 SLIQ uses gini-splitting index to
evaluate split
 Frequency distribution of class values in
data partitions is required

66
Class Histogram
 A class histogram is used to keep the
frequency distribution of class values
for each attribute in each leaf node
 For numeric attributes, the class
histogram is a list of <class, frequency>
 For categorical attributes, the class
histogram is a list of <attribute value,
class, frequency>

67
Evaluate Split
for each attribute A
traverse attribute list of A
for each value v in the attribute list
find the corresponding class and leaf node
update the class histogram in the leaf l
if A is a numeric attribute then
compute splitting index for test (Av) for leaf l
if A is a categorical attribute then
for each leaf of the tree do
find subset of A with the best split

68
Subsetting for Categorical
Attributes
If cardinality of S is less than a threshold
all of the subsets of S are evaluated
else
start an empty subset S’
repeat
adds the element of S to S’ which gives
the best split
until there is no improvement

69
Partition the data
 Partition can be done by updating the leaf
reference of each entry in the class list
 Algorithm:
for each attribute A used in a split
traverse attribute list of A
for each value v in the list
find corresponding class label and leaf l
find the new node, n, to which v belongs
by applying the splitting test at l
update the leaf reference to n
70
Example of Evaluating Splits
Initial Histogram
H L
L 0 0
Age Index Class Leaf
R 4 2
17 2 1 High N1
20 6 2 High N1 Evaluate split (age  17)
23 1 3 High N1
H L
32 5 4 Low N1
L 1 0
43 3 5 Low N1
R 3 2
68 4 6 High N1
Evaluate split (age  32)
H L
L 3 1
R 1 1
71
Example of Updating Class List
Age  23
N1
Age Index Class Leaf
17 2 1 High N2
20 6 2 High N2 N2 N3
23 1 3 High N1
32 5 4 Low N1
43 3 5 Low N1 N3 (New value)
68 4 6 High N2

72
MDL Principle
 Given a model, M, and the data, D
 MDL principle states that the best
model for encoding data is the one that
minimizes Cost(M,D) = Cost(D|M) +
Cost(M)
 Cost (D|M) is the cost, in number of bits,
of encoding the data given a model M
 Cost (M) is the cost of encoding the model
M

73
MDL Pruning Algorithm
 The models are the set of trees obtained by
pruning the initial decision T
 The data is the training set S
 The goal is to find the subtree of T that best
describes the training set S (i.e. with the
minimum cost)
 The algorithm evaluates the cost at each
decision tree node to determine whether to
convert the node into a leaf, prune the left or
the right child, or leave the node intact.
74
Encoding Scheme
 Cost(S|T) is defined as the sum of all
classification errors
 Cost(M) includes
 The cost of describing the tree
 number of bits used to encode each node
 The costs of describing the splits
 For numeric attributes, the cost is 1 bit
 For categorical Attributes, the cost is ln(nA), where
nA is the total number of tests of the form A  S’
used

75
Performance (Scalability)

76

You might also like