You are on page 1of 27

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

NBTree: A Naive Bayes/Decision-Tree Hybrid


Darin Morrison

April 17, 2007

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Outline
1

Motivation Problem and Solutions Consideration of Existing Solutions Naive-Bayes Classiers Decision-Trees Learning Curves New Solution: NBTree Denition of NBTree Performance Summary

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Problem and Solutions

Outline
1

Motivation Problem and Solutions Consideration of Existing Solutions Naive-Bayes Classiers Decision-Trees Learning Curves New Solution: NBTree Denition of NBTree Performance Summary

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Problem and Solutions

Automatic Induction of Classiers

Problem How can we generate a classier from an arbitrarily sized database of labeled instances, where attributes are not necessarily independent? Solutions
1 2 3

Naive-Bayes Classiers Decision-Trees (C4.5) ?

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Problem and Solutions

Automatic Induction of Classiers

Problem How can we generate a classier from an arbitrarily sized database of labeled instances, where attributes are not necessarily independent? Solutions
1 2 3

Naive-Bayes Classiers Decision-Trees (C4.5) ?

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Problem and Solutions

Automatic Induction of Classiers

Problem How can we generate a classier from an arbitrarily sized database of labeled instances, where attributes are not necessarily independent? Solutions
1 2 3

Naive-Bayes Classiers Decision-Trees (C4.5) ?

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Problem and Solutions

Automatic Induction of Classiers

Problem How can we generate a classier from an arbitrarily sized database of labeled instances, where attributes are not necessarily independent? Solutions
1 2 3

Naive-Bayes Classiers Decision-Trees (C4.5) ?

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Naive-Bayes Classiers Decision-Trees Learning Curves

Outline
1

Motivation Problem and Solutions Consideration of Existing Solutions Naive-Bayes Classiers Decision-Trees Learning Curves New Solution: NBTree Denition of NBTree Performance Summary

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Naive-Bayes Classiers Decision-Trees Learning Curves

Naive-Bayes Classiers

Pros
1 2 3 4

Fast Induced classiers are easy to interpret Robust to irrelevant attributes Uses evidence from many attributes Assumes independence of attributes Low performance ceiling on large databases

Cons
1 2

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Naive-Bayes Classiers Decision-Trees Learning Curves

Outline
1

Motivation Problem and Solutions Consideration of Existing Solutions Naive-Bayes Classiers Decision-Trees Learning Curves New Solution: NBTree Denition of NBTree Performance Summary

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Naive-Bayes Classiers Decision-Trees Learning Curves

Decision-Trees

Pros
1 2

Fast Segmentation of data Fragmentation as number of splits becomes large Interpretability goes down as number of splits increase

Cons
1 2

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Naive-Bayes Classiers Decision-Trees Learning Curves

Outline
1

Motivation Problem and Solutions Consideration of Existing Solutions Naive-Bayes Classiers Decision-Trees Learning Curves New Solution: NBTree Denition of NBTree Performance Summary

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Naive-Bayes Classiers Decision-Trees Learning Curves

Comparison between NB and C4.5 Learning Curves

Error bars represent 95% condence intervals on accuracy


Darin Morrison NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Outline
1

Motivation Problem and Solutions Consideration of Existing Solutions Naive-Bayes Classiers Decision-Trees Learning Curves New Solution: NBTree Denition of NBTree Performance Summary

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Algorithm
dom(makeNBTree) cod(makeNBTree)
1

Set LabeledInstance Tree Split NBC

2 3

For each attribute Xi , evaluate the utility, u(Xi ), of a split on attribute Xi . For continuous attributes, a threshold is also found at this stage. Let j = argmaxi (ui ), i.e., the attribute with the highest utility. If uj is not signicantly better than the utility of the current node, create a Naive-Bayes classier for the current node and return. Partition the set of instances T according to the test on Xj . If Xj is continuous, a threshold split is used; if Xj is discrete, a multi-way split is made for all possible values. For each child, call the algorithm recursively on the portion of T that matches the test leading to the child.
Darin Morrison NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Algorithm
dom(makeNBTree) cod(makeNBTree)
1

Set LabeledInstance Tree Split NBC

2 3

For each attribute Xi , evaluate the utility, u(Xi ), of a split on attribute Xi . For continuous attributes, a threshold is also found at this stage. Let j = argmaxi (ui ), i.e., the attribute with the highest utility. If uj is not signicantly better than the utility of the current node, create a Naive-Bayes classier for the current node and return. Partition the set of instances T according to the test on Xj . If Xj is continuous, a threshold split is used; if Xj is discrete, a multi-way split is made for all possible values. For each child, call the algorithm recursively on the portion of T that matches the test leading to the child.
Darin Morrison NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Algorithm
dom(makeNBTree) cod(makeNBTree)
1

Set LabeledInstance Tree Split NBC

2 3

For each attribute Xi , evaluate the utility, u(Xi ), of a split on attribute Xi . For continuous attributes, a threshold is also found at this stage. Let j = argmaxi (ui ), i.e., the attribute with the highest utility. If uj is not signicantly better than the utility of the current node, create a Naive-Bayes classier for the current node and return. Partition the set of instances T according to the test on Xj . If Xj is continuous, a threshold split is used; if Xj is discrete, a multi-way split is made for all possible values. For each child, call the algorithm recursively on the portion of T that matches the test leading to the child.
Darin Morrison NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Algorithm
dom(makeNBTree) cod(makeNBTree)
1

Set LabeledInstance Tree Split NBC

2 3

For each attribute Xi , evaluate the utility, u(Xi ), of a split on attribute Xi . For continuous attributes, a threshold is also found at this stage. Let j = argmaxi (ui ), i.e., the attribute with the highest utility. If uj is not signicantly better than the utility of the current node, create a Naive-Bayes classier for the current node and return. Partition the set of instances T according to the test on Xj . If Xj is continuous, a threshold split is used; if Xj is discrete, a multi-way split is made for all possible values. For each child, call the algorithm recursively on the portion of T that matches the test leading to the child.
Darin Morrison NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Algorithm
dom(makeNBTree) cod(makeNBTree)
1

Set LabeledInstance Tree Split NBC

2 3

For each attribute Xi , evaluate the utility, u(Xi ), of a split on attribute Xi . For continuous attributes, a threshold is also found at this stage. Let j = argmaxi (ui ), i.e., the attribute with the highest utility. If uj is not signicantly better than the utility of the current node, create a Naive-Bayes classier for the current node and return. Partition the set of instances T according to the test on Xj . If Xj is continuous, a threshold split is used; if Xj is discrete, a multi-way split is made for all possible values. For each child, call the algorithm recursively on the portion of T that matches the test leading to the child.
Darin Morrison NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Algorithm
dom(makeNBTree) cod(makeNBTree)
1

Set LabeledInstance Tree Split NBC

2 3

For each attribute Xi , evaluate the utility, u(Xi ), of a split on attribute Xi . For continuous attributes, a threshold is also found at this stage. Let j = argmaxi (ui ), i.e., the attribute with the highest utility. If uj is not signicantly better than the utility of the current node, create a Naive-Bayes classier for the current node and return. Partition the set of instances T according to the test on Xj . If Xj is continuous, a threshold split is used; if Xj is discrete, a multi-way split is made for all possible values. For each child, call the algorithm recursively on the portion of T that matches the test leading to the child.
Darin Morrison NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Utility
Utility of Node Computed by discretizing the data and computing 5-fold cross-validation accuracy estimate of using NBC at node Utility of Split Computed by weighted sum of utility of nodes, where weight given to node is proportional to num of instances that reach node Signicance Split is signicant i the relative reduction in error is greater than 5% and there are at least 30 instances in node Intuition Attempt to approximate whether generalization accuracy for NBC at each leaf is higher than single NBC at current node
Darin Morrison NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Outline
1

Motivation Problem and Solutions Consideration of Existing Solutions Naive-Bayes Classiers Decision-Trees Learning Curves New Solution: NBTree Denition of NBTree Performance Summary

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Graphs

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Graphs

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Denition of NBTree Performance

Statistics
Average Accuracy C4.5 Naive-Bayes NBTree Number of Nodes per Tree letter adult DNA led24 C4.5 2109 2213 131 49 NBTree 251 137 3 1 81.91% 81.69% 84.47%

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Motivation Consideration of Existing Solutions New Solution: NBTree Summary

Summary NBTree appears to be a viable approach to inducing classiers, where:


Many attributes are relevant for classication Attributes are not necessarily independent Database is large Interpretability of classier is important

In practice, NBTrees are shown to scale to large databases and, in general, outperform Decision Trees and NBCs alone

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

Appendix

For Further Reading

References

R. Kohavi. Scaling Up the Accuracy of Naive-Bayes Classiers: a Decision-Tree Hybrid Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996.

Darin Morrison

NBTree: A Naive Bayes/Decision-Tree Hybrid

You might also like