Professional Documents
Culture Documents
Data Mining
Tools & Techniques
Lecture 12
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Classification
Linearly Separable Not Linearly Separable
Decision Trees
Decision Trees
• Divides the feature space by axes
aligned decision boundaries 1
• Each rectangular region is labeled 2
with one label/class
• Idea is to divide the entire X‐space
into rectangles such that each
rectangle is as homogeneous or
“pure” as possible 3
• Pure = containing records that
belong to just one class
• Recursive partitioning of p‐
dimensional space of the predictor
variables into non‐overlapping
multidimensional rectangles
Leaf node
(class label)
Pure Node
Idea:
1. use counts at leaves to define probability distributions, so we
can measure uncertainty
2. a good attribute splits the examples into subsets that are
(ideally) pure
Choosing a Good Attribute
• Restaurant example
• Test Patrons or Type first?
• Popular measures
– information gain, gain ratio, Gini index
Information Gain
• ID3 uses information gain as its attribute selection measure
• Node N represents tuples of partition D
• Attribute with the highest information gain is chosen as the
splitting attribute for node N
• Objective: to partition on an attribute that would do the “best
classification,” so that the amount of information still required
to finish classifying the tuples is minimal
• Minimize expected number of tests needed to classify a given
tuple and guarantee a simple tree is found
Notations
• D: data partition, a training set of class‐labeled tuples
• m: distinct values of class label attribute defining m
distinct classes, Ci (i = 1,…,m)
• Ci,D: set of tuples of class Ci in D
• |Dj|: number of tuples in Dj
• |Ci,D |: number of tuples in Ci,D
Information Gain
• Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|
• Expected information (entropy) needed to classify a tuple in D: