Professional Documents
Culture Documents
Data Mining
Tools & Techniques
Lecture 13
Unpruned Pruned
Decision Tree Pruning
• Two approaches
• Prepruning
– Tree is “pruned” by halting its construction early
1. Do not split a node if improvement in the information measure is not
statistically significant
2. Do not split a node if class distribution of instances are independent
of the available attributes (using chi‐squared test)
3. Do not split a node if number of instances is less than some user‐
specified threshold
– Upon halting, the node becomes a leaf
– Leaf may hold the most frequent class among the subset tuples
– Difficult to choose an appropriate threshold
Decision Tree Pruning
• Postpruning
– Remove subtrees from a “fully grown” tree
– Replace the subtree with a leaf, use majority vote to label the
class
– cost complexity pruning algorithm used in CART
• A function of number of leaves in the tree and error rate
• defines the tree obtained by pruning the subtree t from the tree T
• Cross‐validation approach
• For each internal node, N, compute cost complexities with and without
subtree at N
• Subtree is pruned if it results in a smaller cost complexity
• Among the eligible subtrees, choose the one to prune which lowers the error
the most
Decision Tree Pruning
• Postpruning
1
2
3
Other Issues
• Repetition: occurs when an attribute is repeatedly
tested along a given branch of the tree
Other Issues
• Replication: duplicate subtrees exist within the tree