Professional Documents
Culture Documents
1
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is
divided into training and test sets, with training
set used to build the model and test set used
to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification
Task
Predicting tumor cells as benign or malignant
6 No Medium 60K No
Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test
TestData
Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test
TestData
Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification
Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Confusion Matric
A confusion matrix (Kohavi and Provost,
1998) contains information about actual
and predicted classifications done by a
classification system. Performance of
such systems is commonly evaluated
using the data in the matrix.
Confusion matrix
a is the number of correct predictions that
an instance is negative,
b is the number of incorrect predictions that
an instance is positive,
c is the number of incorrect of predictions
that an instance negative, and
d is the number of correct predictions that
an instance is positive.
Confusion Matrix
Predicted
Negative Positive
Negative a b
Actual
Positive c d
Confusion matrix
Several standard terms have been defined for
the 2 class matrix:
The accuracy (AC) is the proportion of the total
number of predictions that were correct. It is
determined using the equation:
The recall or true positive rate (TP) is the
proportion of positive cases that were correctly
identified, as calculated using the equation:
The false positive rate (FP) is the
proportion of negatives cases that were
incorrectly classified as positive, as
calculated using the equation:
The true negative rate (TN) is defined
as the proportion of negatives cases
that were classified correctly, as
calculated using the equation:
The false negative rate (FN) is the
proportion of positives cases that were
incorrectly classified as negative, as
calculated using the equation:
Finally, precision (P) is the proportion of
the predicted positive cases that were
correct, as calculated using the
equation:
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
SLIQ
General Structure of Hunt’s
Algorithm Tid Refund Marital
Status
Taxable
Income Cheat
Single
Income Cheat
125K No
2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
How to Specify Test
Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
{Sports,
CarType OR {Family,
CarType
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal
Attributes
Multi-way split: Use as many partitions as distinct
values. Size
Small Large
Medium
{Small,
Size
{Medium}
What about this split? Large}
Splitting Based on Continuous
Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
How to determine the Best
Split
Before Splitting: 10 records of
class 0, 10 records of class 1
Own Car Student
Car? Type? ID?
Entropy
Misclassification error
How to Find the Best Split
C0 N00
Before C1 N01
M0
Splitting:
A? B?
Yes No Yes No
Node N1 Node N2 Node N3 Node N4
C0 N10 C0 N20 C0 N30 C0 N40
C1 N11 C1 N21 C1 N31 C1 N41
M1 M2 M3 M4
M1 M3
2 Gain = M0 – M12 vs M0 – M34 4
Measure of
Gini Index for a given node t :
Impurity: GINI
GINI (t ) 1 [ p( j | t )]2
j
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (t ) 1 [ p( j | t )]2
j
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – N1 N2 Gini(Children)
(2/6)2
C1 5 1 = 7/12 * 0.194 +
= 0.194
C2 2 4 5/12 * 0.528
Gini(N2) Gini=0.333 = 0.333
= 1 – (1/6)2 –
(4/6)2
Categorical Attributes: Computing
Gini Index
For each distinct value, gather counts for each class in the
dataset
Use the count matrix to make decisions
Two-way split
Multi-way split (find best partition of values)
CarType CarType CarType
Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
Continuous Attributes: Computing
Gini Index
Use Binary Decisions based on one Tid Refund Marital
Status
Taxable
Income Cheat
value 1 Yes Single 125K No
Several Choices for the splitting value 2 No Married 100K No
Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No
Each splitting value has a count matrix 5 No Divorced 95K Yes
associated with it 6 No Married 60K No
Class counts in each of the partitions, A 7 Yes Divorced 220K No
< v and A v 8 No Single 85K Yes
Simple method to choose best v 9 No Married 75K No
For each v, scan the database to 10 No Single 90K Yes
gather count matrix and compute its
10
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based
on INFO
Entropy at a given node t:
Entropy(t ) p( j | t ) log p( j | t )
j
n
split i 1
SplitINFO n n i 1
Gini(N1) N1 N2
= 1 – (3/3)2 – Gini(Children)
C1 3 4 = 3/10 * 0
(0/3)2
C2 0 3 + 7/10 * 0.489
=0
= 0.342
Gini(N2)
= 1 – (4/7)2 –
(3/7)2
= 0.489
Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Stopping Criteria for Tree
Induction
Stop expanding a node when all the
records belong to the same class
53
A Decision Tree
Age < 25
High
High Low
54
Decision Tree Classification
A decision tree is created in two phases:
Tree Building Phase
Repeatedly partition the training data until all
the examples in each partition belong to one
class or the partition is sufficiently small
Tree Pruning Phase
Remove dependency on statistical noise or
variation that may be particular only to the
training set
55
Tree Building Phase
General tree-growth algorithm (binary tree)
Partition(Data S)
If (all points in S are of the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
Use best split to partition S into S1 and S2;
Partition(S1);
Partition(S2);
56
Tree Building Phase (cont.)
The form of the split depends on the
type of the attribute
Splits for numeric attributes are of the
form A v, where v is a real number
Splits for categorical attributes are of
the form A S’, where S’ is a subset of
all possible values of A
57
Splitting Index
Alternative splits for an attribute are
compared using a splitting index
Examples of splitting index:
Entropy ( entropy(T) = - pj x log2(pj) )
Gini Index ( gini(T) = 1 - pj )
2
58
The Best Split
Suppose the splitting index is I(), and a
split partitions S into S1 and S2
The best split is the split that maximizes
the following value:
I(S) - |S1|/|S| x I(S1) + |S2|/|S| x I(S2)
59
Tree Pruning Phase
Examine the initial tree built
Choose the subtree with the least
estimated error rate
Two approaches for error estimation:
Use the original training dataset (e.g. cross
–validation)
Use an independent dataset
60
SLIQ - Overview
Capable of classifying disk-resident datasets
Scalable for large datasets
Use pre-sorting technique to reduce the cost
of evaluating numeric attributes
Use a breath-first tree growing strategy
Use an inexpensive tree-pruning algorithm
based on the Minimum Description Length
(MDL) principle
61
Data Structure
A list (class list) for the class label
Each entry has two fields: the class label
and a reference to a leaf node of the
decision tree
Memory-resident
A list for each attribute
Each entry has two fields: the attribute
value, an index into the class list
Written to disk if necessary
62
An illustration of the Data
Structure
Age Class List Car Class List Class Leaf
Index Type Index
23 1 Family 1 1 High N1
17 2 Sports 2 2 High N1
43 3 Sports 3 3 High N1
68 4 Family 4 4 Low N1
32 5 Truck 5 5 Low N1
20 6 Family 6 6 High N1
63
Pre-sorting
Sorting of data is required to find the
split for numeric attributes
Previous algorithms sort data at every
node in the tree
Using the separate list data structure,
SLIQ only sort data once at the
beginning of the tree building phase
64
After Pre-sorting
Age Class List Car Class List Class Leaf
Index Type Index
17 2 Family 1 1 High N1
20 6 Sports 2 2 High N1
23 1 Sports 3 3 High N1
32 5 Family 4 4 Low N1
43 3 Truck 5 5 Low N1
68 4 Family 6 6 High N1
65
Node Split
SLIQ uses a breath-first tree growing
strategy
In one pass over the data, splits for all the
leaves of the current tree can be evaluated
SLIQ uses gini-splitting index to
evaluate split
Frequency distribution of class values in
data partitions is required
66
Class Histogram
A class histogram is used to keep the
frequency distribution of class values
for each attribute in each leaf node
For numeric attributes, the class
histogram is a list of <class, frequency>
For categorical attributes, the class
histogram is a list of <attribute value,
class, frequency>
67
Evaluate Split
for each attribute A
traverse attribute list of A
for each value v in the attribute list
find the corresponding class and leaf node
update the class histogram in the leaf l
if A is a numeric attribute then
compute splitting index for test (Av) for leaf l
if A is a categorical attribute then
for each leaf of the tree do
find subset of A with the best split
68
Subsetting for Categorical
Attributes
If cardinality of S is less than a threshold
all of the subsets of S are evaluated
else
start an empty subset S’
repeat
adds the element of S to S’ which gives
the best split
until there is no improvement
69
Partition the data
Partition can be done by updating the leaf
reference of each entry in the class list
Algorithm:
for each attribute A used in a split
traverse attribute list of A
for each value v in the list
find corresponding class label and leaf l
find the new node, n, to which v belongs
by applying the splitting test at l
update the leaf reference to n
70
Example of Evaluating Splits
Initial Histogram
H L
L 0 0
Age Index Class Leaf
R 4 2
17 2 1 High N1
20 6 2 High N1 Evaluate split (age 17)
23 1 3 High N1
H L
32 5 4 Low N1
L 1 0
43 3 5 Low N1
R 3 2
68 4 6 High N1
Evaluate split (age 32)
H L
L 3 1
R 1 1
71
Example of Updating Class List
Age 23
N1
Age Index Class Leaf
17 2 1 High N2
20 6 2 High N2 N2 N3
23 1 3 High N1
32 5 4 Low N1
43 3 5 Low N1 N3 (New value)
68 4 6 High N2
72
MDL Principle
Given a model, M, and the data, D
MDL principle states that the best
model for encoding data is the one that
minimizes Cost(M,D) = Cost(D|M) +
Cost(M)
Cost (D|M) is the cost, in number of bits,
of encoding the data given a model M
Cost (M) is the cost of encoding the model
M
73
MDL Pruning Algorithm
The models are the set of trees obtained by
pruning the initial decision T
The data is the training set S
The goal is to find the subtree of T that best
describes the training set S (i.e. with the
minimum cost)
The algorithm evaluates the cost at each
decision tree node to determine whether to
convert the node into a leaf, prune the left or
the right child, or leave the node intact.
74
Encoding Scheme
Cost(S|T) is defined as the sum of all
classification errors
Cost(M) includes
The cost of describing the tree
number of bits used to encode each node
The costs of describing the splits
For numeric attributes, the cost is 1 bit
For categorical Attributes, the cost is ln(nA), where
nA is the total number of tests of the form A S’
used
75
Performance (Scalability)
76