You are on page 1of 25

eMBA933

Data Mining
Tools & Techniques
Lecture 13

Dr. Faiz Hamid


Associate Professor
Department of IME
IIT Kanpur
fhamid@iitk.ac.in
Example
Example
Information Gain
• Continuous‐Valued Attribute, A
• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– Given v values of A, then v‐1 possible splits are evaluated
– For each possible split‐point for A, evaluate InfoA(D), where the number
of partitions is two, that is, v=2 (or j=1, 2)
– The point with the minimum expected information requirement for A is
selected as the split point for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ split‐point, and D2 is the
set of tuples in D satisfying A > split‐point
Comparing Attribute Selection Measures
• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal‐sized partitions and purity
in both partitions
Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C‐SEP: performs better than info. gain and gini index in certain cases
• G‐statistic: close approximation to χ2 distribution
• MDL (Minimal Description Length) principle
– least bias toward multivalued attributes
– The best tree is the one that requires the fewest # of bits to both (1) encode the
tree, and (2) encode the exceptions to the tree
• Multivariate splits
– partition tuples based on a combination of attributes, rather than on a single attribute
– a form of attribute (or feature) construction

– CART: finds multivariate splits based on a linear combination of attributes


Attribute Selection Measures
• Which attribute selection measure is the best?
– Most give good results, none is significantly superior than others
– “If the first learner performs better than the second learner on some learning
situations, the first learner must perform worse than the second learner on
other learning situations”
– time complexity of decision tree induction generally increases exponentially
with tree height
• measures that tend to produce shallower trees (e.g., with multiway rather than
binary splits, and that favor more balanced splits) may be preferred
– shallow trees tend to have a large number of leaves and higher error rates
Decision Tree Learning Algorithm
• Aim: Find the smallest tree consistent with the training samples.
• Idea: Recursively choose “most significant” attribute as root of (sub)tree
• Quinlan, 1993

if best_attribute is discrete‐valued and multiway splits allowed


Decision Tree Learning
• If samples at a node are of different classes and attributes are
left to split, choose the best attribute to split them
Decision Tree Learning
• After splitting the samples on an attribute:
– remaining samples become decision tree problems
themselves (or subtrees)
– less samples in each child node
– one less attribute
• Recursive approach
Decision Tree Learning
• If all the samples in a node are of same class, make the node a
leaf. Assign label to this node as the class of the samples
Decision Tree Learning
• If there are no samples left at a node
– Implies no such sample has been observed
– Assign label as the majority class at the node’s parent
Decision Tree Learning
• If there are no attributes left and a node is not pure
– implies these samples have exactly the same feature
values but different classifications
– this may happen because:
1. some of the data could be incorrect, or
2. the attributes do not give enough information to describe the
situation fully (i.e. lack other useful attributes), or
3. the problem is truly non‐deterministic, i.e., given two samples
describing exactly the same conditions, we may make different
decisions
– Solution: Call it a leaf node and assign the majority vote as
the label
Splitting Attribute: Types
• discrete‐valued
Splitting Attribute: Types
• discrete‐valued and a binary tree must be produced
Splitting Attribute: Types
• continuous‐valued
Splitting Attribute: Types
• Ordinal attribute Binary split: as long as it does not
violate the order property of the
Multiway split attribute values
Decision Tree Pruning
Decision Tree Pruning
• A decision tree may grow long / branch too much until every node is pure
– many of the branches will reflect anomalies in the training data due to noise or
outliers
– each leaf will represent a very specific set of attribute combinations that are seen in
the training data
– poor accuracy for unseen samples
– overfitting
• Pruning
– remove the least‐reliable branches
– lower ends (the leaves) of the tree are “snipped” until the tree is much smaller
– statistical measures typically used to prune branches
• Pruned trees
– smaller
– less complex
– easier to comprehend
– usually faster and better at correctly classifying independent test data
Decision Tree Pruning

Unpruned Pruned
Decision Tree Pruning
• Two approaches
• Prepruning
– Tree is “pruned” by halting its construction early
1. Do not split a node if improvement in the information measure is not
statistically significant
2. Do not split a node if class distribution of instances are independent
of the available attributes (using chi‐squared test)
3. Do not split a node if number of instances is less than some user‐
specified threshold
– Upon halting, the node becomes a leaf
– Leaf may hold the most frequent class among the subset tuples
– Difficult to choose an appropriate threshold
Decision Tree Pruning
• Postpruning
– Remove subtrees from a “fully grown” tree
– Replace the subtree with a leaf, use majority vote to label the
class
– cost complexity pruning algorithm used in CART
• A function of number of leaves in the tree and error rate
• defines the tree obtained by pruning the subtree t from the tree T

• Cross‐validation approach
• For each internal node, N, compute cost complexities with and without
subtree at N
• Subtree is pruned if it results in a smaller cost complexity
• Among the eligible subtrees, choose the one to prune which lowers the error
the most
Decision Tree Pruning
• Postpruning

1
2
3
Other Issues
• Repetition: occurs when an attribute is repeatedly
tested along a given branch of the tree
Other Issues
• Replication: duplicate subtrees exist within the tree

• Multivariate splits can prevent these problems

You might also like