This action might not be possible to undo. Are you sure you want to continue?

C5.1.3

Decision Tree Discovery

Ron Kohavi

ronnyk@CS.Stanford.EDU

Ross Quinlan

quinlan@cse.unsw.edu.au

Data Mining School of Computer Science and Engineering Blue Martini Software Samuels Building, G08 2600 Campus Dr. Suite 175 University of New South Wales San Mateo, CA 94403 Sydney 2052 Australia

Keywords: classi cation, induction, recursive partitioning, over tting avoidance, cost-complexity pruning, pessimistic pruning, splitting criteria, decision rules

Abstract

We describe the two most commonly used systems for induction of decision trees for classi cation: C4.5 and CART. We highlight the methods and di erent decisions made in each system with respect to splitting criteria, pruning, noise handling, and other di erentiating features. We describe how rules can be derived from decision trees and point to some di erence in the induction of regression trees. We conclude with some pointers to advanced techniques, including ensemble methods, oblique splits, grafting, and coping with large data.

C5.1.3.1 C4.5

C4.5 belongs to a succession of decision tree learners that trace their origins back to the work of Hunt and others in the late 1950s and early 1960s (Hunt 1962). Its immediate predecessors were ID3 (Quinlan 1979), a simple system consisting initially of about 600 lines of Pascal, and C4 (Quinlan 1987). C4.5 has grown to about 9,000 lines of C that is available on diskette with Quinlan (1993). Although C4.5 has been superseded by C5.0, a commercial system from RuleQuest Research, this discussion will focus on C4.5 since its source code is readily available. 1

Figure C5. each linked to a subtree.3. We focus here on decision trees. imagine a marker that is initially at the top (root) of the tree. An attribute Aa is described as continuous or discrete according to whether its values are numeric or nominal.1.1.1 shows a simple example in which the tests appear in ovals. The distinguishing characteristic of learning systems is the form in which this function is expressed. The class attribute C is discrete and has values C1 C2 ::: Cx. 2 . The goal is to learn from the training cases a function DOM (A1) DOM (A2) ::: DOM (Ak ) ! DOM (C ) that maps from the attribute values to a predicted class.5 consists of a collection of training cases. link to section B2.fall true sell 100 points false previous day = ? fall hold steady sell rise buy Figure C5. a recursive structure that is a leaf node labelled with a class value. To classify a case using a decision tree.3.1: A simple decision tree Input and Output Input to C4. each having a tuple of values for a xed set of attributes (or independent variables) A = fA1 A2 ::: Ak g and a class attribute (or dependent variable). the leaves in boxes.4]. or a test node that has two or more outcomes. and the test outcomes are labels on the links.

one for each value of Aa . (This is the default.. each involving only a single attribute Aa. b2 . the decision tree is a leaf labelled with Cj .. ...) 3 . If the marker is at a test node. each orthogonal to one of the attribute axes. and denote by Si the set of cases in S that has outcome bi of B .5 uses tests of three types. Candidate Tests C4. Tt where Ti is the result of growing a decision tree for the cases in Si.If the marker is at a leaf. possible tests are: { \Aa = ?" with z outcomes. say).. Decision regions in the instance space are thus bounded by hyperplanes. The decision tree is B b1 b2 b3 T1 T2 T3 bt . the label associated with that leaf becomes the predicted class. Divide and Conquer Decision tree learners use a method known as divide and conquer to construct a suitable tree from a training set S of cases: If all the cases in S belong to the same class (Cj . the outcome of that test is determined and the marker moved to the top of the subtree for that outcome. If Aa is a discrete attribute with z values. let B be some test with outcomes b1. bt that produces a nontrivial partition of S . Otherwise.

The information content of a message that identi es the class of a case in S is then I (S ) = . Possible values of are found by sorting the distinct values of Aa that appear in S .3.. A problem with this criterion is that it favors tests with numerous outcomes { for example.{ \Aa 2 G?" with 2 z outcomes. . the form of the test is \Aa " with outcomes true and false. G(S B ) is maximized by a test in which each Si contains a single case. but di erent B s give di erent trees. Two such criteria are used in C4. Tests of this kind are found by a greedy search for a partition G that maximizes the value of the splitting g criterion (discussed below). jSij I (Si): (C5. the information gained is then t X G(S B ) = I (S ) . C4.1. jS j log jjSijj : S i=1 4 t X jSi j ! (C5..3. information gain and gain ratio. (So. S2.1. any test B that partitions S non-trivially will lead to a decision tree. selecting the candidate test that maximizes a heuristic splitting criterion.) Selecting Tests In the divide and conquer algorithm. x X j =1 RF (Cj S ) log(RF (Cj S )): After S is partitioned into subsets S1.1) The gain criterion chooses the test B that maximizes G(S B ).5 relies on greedy search. where G = fG1 G2 ::: Gg g is a partition of the values of attribute Aa . The gain ratio criterion sidesteps this problem by also taking into account the potential information from the partition itself: i=1 jS j P (S B ) = .5. Since it is infeasible to guarantee the minimality of the tree (Hya l & Rivest 1976). then identifying one threshold between each pair of adjacent values.. Most learning systems attempt to keep the tree as small as possible because smaller trees are more easily understood and. St by a test B . if the cases in S have d distinct values for Aa . are likely to have higher predictive accuracy (see. where is a constant threshold. for instance. If Aa has numeric values. Quinlan & Rivest (1989)). d-1 thresholds are considered.2) . Let RF (Cj S ) denote the relative frequency of cases in S that belong to class Cj . by Occam's Razor arguments.

Denote by S0 the subset of cases in S whose values of Aa are unknown. Let B be a potential test based on attribute Aa with outcomes b1 .1. The information gained by B is reduced because we learn nothing about the cases in S0 Equation C5. they are notionally fragmented into fractional cases and added to the subsets corresponding to known outcomes. the test B that maximizes G(S B )=P (S B ). Instead.. 5 t X P (S B ) = . As before. let Si denote those cases with (known) outcome bi of B ..1. S0 j.3. from among the tests with at least average gain. jjS0jj log jjS0jj .2 is modi ed to Both changes have the e ect of reducing the desirability of tests involving attributes with a high proportion of missing values. .5 does not build a separate decision tree for the cases in S0 . When a test B has been chosen. b2 . Such lacunae a ect both the way that a decision tree is constructed and its use to classify a new case. and hence whose outcome of B cannot be determined.3.Gain ratio then chooses. The cases in S0 are added to each Si with weight jSij=jS . jjSijj log jjSijj : S S S i=1 S G(S B ) = jS j. jS0j G(S .bt . S0 B ): S ! ! . Equation C5.. Missing attribute values also complicate the use of the decision tree to classify a case. CP (T Y ) is the relative frequency of training cases that reach T .1 now becomes The split information is increased to re ect the additional \outcome" of the test (namely. If T is a leaf. the initial result of the classi cation is a class probability distribution determined as follows: Let CP (T Y ) denote the result of classifying case Y with decision tree T . Missing Values Missing attribute values are a common occurrence in data. Instead of a single class. the fact that it cannot be determined for the cases in S0). When a decision tree is constructed from a set S of training cases. either through errors made when the values were recorded or because they were judged irrelevant to the particular case. C4. the divide and conquer algorithm selects a test on which to partition S .

the class with the highest probability is chosen as the predicted class. and the outcome of B on case Y is known (bi. or until further partitioning is impossible because two cases have the same values for each attribute but belong to di erent classes. Before discussing it. as before. and suppose that Z misclassi es M of the cases in S . interpreted as the prior probability of that outcome. giving CP (T Y ) = t X Note that the weight of the subtree Ti depends on the proportion of training cases that have outcome bi. When the class probability distribution resulting from classifying case Y with the decision tree has been determined.If T is a tree whose root is test B . the decision tree will correctly classify all training cases. we introduce the heuristic on which it is based.5 employs a mechanism of the latter kind. C4. Ti is the decision tree for outcome bi. Consequently. jSij CP (T Y ): i i=1 jS . all outcomes of B are explored and combined probabilistically. then CP (T Y ) = CP (Ti Y ) where. Estimating True Error Rates Consider some classi er Z formed from a subset S of training cases. This so-called over tting is generally thought to lead to a loss of predictive accuracy in most applications (Quinlan 1986). say). Over tting can be avoided by a stopping criterion that prevents some sets of training cases from being subdivided (usually on the basis of a statistical test of the signi cance of the best test). S0 j Avoiding Over tting The divide and conquer algorithm partitions the data until every leaf contains cases of a single class. The true error rate of Z is its accuracy over 6 . or by removing some of the structure of the decision tree after it has been produced. Otherwise. if there are no con icting cases. Most authors agree that the latter is preferable since it allows potential interactions among attributes to be explored before deciding whether the result is worth keeping.

the classi er Z can be viewed as causing M error events in jS j \trials".i Now. The true error rate is usually markedly higher than the classi er's resubstitution error rate on the training cases (here M=jS j). which might be near zero for an unpruned decision tree.. The true error rate is often estimated by measuring Z 's error rate on a collection of unseen cases that were not used in its construction this is the best strategy when a substantial set of unseen cases is available. C4.5 prunes it in a single bottom-up pass. though.) CF 8 < (1 . In many applications. Following (Diem 1962. and so tends to minimize the apparent error rate. pr )N . an upper limit pr can be found such that p pr with probability 1-CF .25. data is scarce and all of it is needed to construct the classi er.. C4. We can go further and derive con dence limits for p for a given con dence CF . produced from a training set S . If an event occurs M times in N trials. we will use UCF (M N ) to denote the error bound pr above.the entire universe from which the training set was sampled. but this can be altered to cause higher or lower levels of pruning. C4. pr )N = : PM N i p (1 . Let T be a non-leaf decision tree. page 185). pr satis es the following equations: for M = 0 for M > 0 i=0 i r (The same source gives a quickly-computable approximation for pr in the latter case. Pruning Decision Trees After a decision tree is produced by the divide and conquer algorithm.5 estimates the true error rate of Z using only the values M and jS j from the training set as follows. the ratio M=N is an estimate of the probability p of the event. Tt . of the form B b1 b2 b3 T1 T2 T3 7 bt . the upper bound pr is used as a more conservative estimate of the error rate of Z on unseen cases.5 uses a default CF value of 0. In the following. Since Z was constructed to t the cases in S .

let Tf be the subtree corresponding to the most frequent outcome of B . Friedman. and UCF (ETf jS j). Tf .3. 8 .where each Ti has already been pruned. and EL respectively. the pruning method. CART r is also the name of the system currently implementing the methodology described in the above book. The use of trees in the statistical community dates back to AID (Automatic Interaction Detection) by Morgan & Sonquist (1963). and the way missing values are handled. This form of pruning is computationally e cient and gives quite reasonable results in most applications. the splitting criteria. C4.5 is also used in CART. Further. It is sold by Salford Systems. and L be ET . The di erences are in the tree structure. and to later work on THAID by Morgan and Messenger in the early 1970s. an acronym for Classi cation And Regression Trees. Depending on whichever is lower. C4.1. ETf .5 leaves T unchanged replaces T by the leaf L or replaces T by its subtree Tf . and let L be a leaf labelled with the most frequent class in S . is described in the book by Breiman. C5. Let the number of cases in S misclassi ed by T . UCF (EL jS j). The basic methodology of divide and conquer described in C4.2 CART CART.5's tree pruning algorithm considers the three corresponding estimated error rates UCF (ET jS j). Olshen & Stone (1984).

3. and Pj (e) represents the probability assigned to class j for example e by the probabilistic classi er. A class probability tree predicts a class distribution for an example instead of a single class. 1984.5). This restriction simpli es the splitting criterion because there need not be a penalty for multi-way splits (Kononenko 1995b). For each class j . A splitting criterion for a two-class problem is used to nd the attribute and the two superclasses that optimize the two-class criterion. The approach gives \strategic" splits in the sense that several classes that 9 j =1 . let Cj (e) be the indicator variable that is one if the class for the example e is j and zero otherwise. The mean squared error. The tree may be less interpretable with multiple splits occurring on the same attribute at adjacent levels. which can be used for multi-class problems. RF (Cj S )2 j =1 and the information gain due to a split is computed as in Equation C5. There may be no good binary split on an attribute that has a good multi-way split (Kononenko 1995a). The common measure of performance for a class probability tree is the mean squared error. the classes are separated into two superclasses containing disjoint and mutually exhaustive classes.1. however. if the label is binary. CART also supports the twoing splitting criterion. Futhermore. is de ned as: 2 3 x X MSE = Ee 4 (Cj (e) . Theorem 4. Splitting Criteria CART uses the Gini diversity index as a splitting criterion. At each node. Pj (e))25 where the expectation is over all examples. which may lead to inferior trees. The interesting observation about the Gini diversity index is that it minimizes the resubstitution estimate for the mean squared error. or MSE. Let RF (Cj S ) denote the relative frequency of cases in S that belong to class Cj . The Gini index is de ned as: x X Igini(S ) = 1 . the binary split restriction allows CART to optimally partition categorical attributes (minimizing any concave splitting criteria) to two subsets of values in time that is linear time in the number of attribute values (Breiman et al. The restriction has its disadvantages.1.Tree Structure CART constructs trees that have only binary splits.

We refer the reader to Breiman et al.. giving an estimate of the true error of each tree. 10 . R = R(T ) + number-of-leaves : It can be shown that. if the dataset is small. there are at most a nite number of possible subtrees. Kohavi 1995) to improve the error estimates and utilize more data.7). (1984) for details. Building a second tree e ectively doubles the induction time. a third of the data) and constructing T1 T2 fTt g from the data. The procedure for pruning using 10-fold cross validation is more complex since multiple trees must be built and pruned. for every value of . The time complexity of the pruning step when 10-fold crossvalidation is used is a factor of 10 more expensive than C4. but it does tend to produce smaller trees (Oates & Jensen 1997). CART therefore uses 10-fold cross validation (Stone 1974. T1 T2 fTt g. Note that. a second tree is usually built using the whole dataset in order to make e cient use of all the data. Pruning CART uses a pruning technique called minimal cost complexity pruning. When a large dataset is available. There is thus a sequence of trees minimizing R .g. the error estimates have high variance and precious data (held out) are not used in building the initial tree. 1984. there exists a unique smallest tree minimizing R (Breiman et al. which assumes that the bias in the resubstitution error of a tree increases linearly with the number of leaf nodes. Proposition 3. but because a holdout set was used. although runs through a continuum of values. created by varying from zero to in nity. The examples in the holdout set can then be classi ed using each tree. The pruning step described above is relatively fast. An e cient \weakest-link" pruning algorithm can be constructed to compute Tk+1 from Tk . Moreover. Formally. selecting the best to minimize the true error can be done by setting aside a holdout set (e.5's pruning. The matching the tree that minimizes the error can then be used as the pruning parameter to prune the tree built from the whole dataset. The trees are nested: each tree is contained in the previous one. The cost assigned to a subtree is the sum of two terms: the resubstitution error and the number of leaves times a complexity parameter . excluding the holdout set.are similar are grouped together.

to improve comprehensibility. h(t ))2 n i i i where yi is the real-valued label for example ti and h(ti) is the (real-valued) prediction. Pruning is done in a manner similar to the cost complexity pruning described above. The splitting criteria is chosen to minimize the resubstitution estimate. except that each leaf predicts a real number. The criterion uses only those instances for which the value is known. CART also supports building regression trees. Holmes & Witten 1998).5. CART does not penalize the splitting criterion during the tree construction if examples have unknown values for the attribute used in the split. Also. Wang. CART nds several surrogate splits that can be used instead of the original split. The regression tree structure is similar to a classication tree.Even when cross-validation is used. The surrogates cannot be chosen based on the original splitting criterion because the subtree at each node is constructed based on the original split selected. Inglis. Regression trees are somewhat simpler than classi cation trees because the growing and pruning criteria used in CART are the same. CART employs the \1 SE rule. it is sometimes preferable to choose a smaller tree that has comparable accuracy to the best tree. Missing Values Unlike C4. each leaf predicts a constant value model trees generalize to building a model at each leaf (Frank. Regression Trees As its name implies. During classi cation.5. 11 . The resubstitution estimate is the mean squared error: X R(S ) = 1 (y . The surrogate splits are therefore chosen to maximize a measure of predictive association with the original split." which chooses the smallest tree whose estimated mean error rate is within one standard error (estimated standard deviation) of the estimated mean error rate of the best tree. the pruning parameter is sometimes unstable. Unlike C4. This procedure works well if there are attributes that are highly correlated with the chosen attribute. In CART. the rst surrogate split based on a known attribute value is used.

5 generalizes this prototype rule by dropping any conditions that are irrelevant to the class. Mehta. 5. C4. 2. (1984) also describes some mechanisms for supporting such splits in CART. The disadvantage is loss of comprehension. which can 12 . which was later enhanced in Freund & Schapire (1995). Agrawal & Mehta (1996) describe the SPRINT algorithm.1. guided again by the heuristic for estimating true error rates. C4. Breiman et al. yet keep a tree-like structure that can be shown to users.1). Empirical comparisons were done in Quinlan (1996) and Bauer & Kohavi (1999). Kohavi & Kunz (1997) describe option trees. Murthy. Shafer.(minimum description length) and MML. The main advantage of oblique splits is their ability to create splits that are not axis-orthogonal. Quinlan & Rivest (1989).(minimum message length) based pruning methods. Breiman (1996) describes a Bagging (Bootstrap Aggregating) procedure. which provide the advantages of voting methods.3 Advanced Methods Decision trees have been extended in many ways. 3. and yet the accuracy of the tree and the derived rules is similar. Rissanen & Agrawal (1995). Kasif & Salzberg (1994) describe an induction system that allows such tests. Kearns & Mansour (1998) provide a theoretically-justi ed pruning algorithm.5 also contains a mechanism to re-express decision trees as ordered lists of if-then rules.2. Rules have the added advantage of being more easily understood by people. 4.C5. Most implementation of decision trees require loading the data into memory. The induction of Oblique Decision Trees allow tests at the nodes to be linear combinations of attributes. There are usually substantially fewer nal rules than there are leaves on the tree. Each path from the root of the tree to a leaf gives the conditions that must be satis ed if a case is to be classi ed by that leaf. Esposito.3. 1. Ensemble methods that build multiple trees can dramatically reduce the error. We provide a short list of pointers to important topics. and Wallace & Patrick (1993) describe MDL. The set of rules is reduced further based on the MDL principle described above (see C8. Many di erent pruning methods have been proposed in the literature. but usually result in huge structures that are incomprehensible. Malerba & Semeraro (1995) provide a comparison of pruning and grafting methods. Schapire (1990) introduced boosting.

Geigy Pharmaceuticals. (1996). A. K. E. Murphy. `Machine Learning: ECML-95 (Proc. Esposito. Heidelberg. Ali. Lookahead methods are described in Murthy & Salzberg (1995). & Semeraro. R. More information can be found in Turney (1997). Decision trees choose a split and do not revisit choices. Breiman. Oblivious decision trees conduct the same split across a whole level (Kohavi & Li 1995) and can be converted into a graph or a decision table (Kohavi & Sommer eld 1998). and variants'. 1995)'. D. (1999). eds. New York. J.. 6. 105{139. Kohavi & Yun 1996) conceptually choose the best tree for a given test instance. `Bagging predictors'. 10. 7. L. G. (1995). 123{140. Machine Learning 24. (1984) and a comparison is given in Pazzani.scale out of core. Berlin. Most algorithms assume 0-1 costs for mistakes. ed. Wadsworth International Group. pp. Classi cation and Regression Trees. boosting. & Stone.. R. 9. In practice. Utgo (1997) describes how to update decision trees incrementally as more data is made available.. H. Simplifying decision trees by pruning and grafting: New results. 8. 287{290. Friedman. L. (1984). European Conf. References Bauer. Olshen. & Kohavi. Wrobel. Springer Verlag. Lecture Notes in Arti cial Intelligence 914. Lavrac & S. Merz. J. Diem. 13 . Several commercial systems implementing decision trees also provide parallel implementations. Lazy decision trees (Friedman. only a path needs to be constructed. Breiman. `An empirical comparison of voting classi cation algorithms: Bagging. in N. F.. Hume & Brunk (1994). Descriptions on how to generalize to loss matrices are given in Breiman et al. Documenta Geigy Scienti c Tables. Variants and other methods for scaling to larger datasets are described in Freitas & Lavington (1998). on Machine Learning. C. Machine Learning 36. Malerba. (1962).

A decision-theoretic generalization of on-line learning and an application to boosting. 1137{1143. Springer-Verlag. H.stanford.. pp. B. L. P. (1998).. 23{37. http://robotics. pp. R. & Kunz. `Machine Learning: Proceedings of the Fourteenth International Conference'. in `Proceedings of the Thirteenth National Conference on Arti cial Intelligence'. Piatetsky-Shapiro... R. S. Inglis. in C. S. R. Targeting business users with decision table classi ers. Kluwer Academic Publishers. E. Kohavi. (1995). Morgan Kaufmann Publishers. D. 161{169. Mellish. Hya l. 249{253. Wiley. in R. M. A fast. (1976). 717{724. L. ed. Stolorz & G. ed. Oblivious decision trees. `Constructing optimal binary decision trees is NPcomplete'. Y. Morgan Kaufmann. Kearns. (1995). Mining Very Large Databases With Parallel Processing. bottom-up decision tree pruning algorithm with near-optimal generalization.edu/~ronnyk. eds. in C. 269{277. (1998). & Mansour. Information Processing Letters 5(1). Agrawal. 14 .. (1995). C. A. R. Holmes. AAAI Press. 15{17. and top-down pruning. & Li. Freitas. 1071{1077. pp.-H. Available at http://robotics.edu/users/ronnyk. E. in `Proceedings of the Second European Conference on Computational Learning Theory'.Frank. (1996). I. (1997). in J. Friedman. Machine Learning 32(1). Y. & Sommer eld. Option decision trees with majority votes. S. pp. pp. Y. ed. Mellish. `Proceedings of the 14th International Joint Conference on Arti cial Intelligence'.. `Proceedings of the 14th International Joint Conference on Arti cial Intelligence'. & Yun. E. (1962). Freund. pp. R.. Concept Learning: An Information Processing Problem. Shavlik. R. Hunt. Kohavi. & Schapire. 63{76. Fisher. J. & Rivest. & Lavington. Morgan Kaufmann Publishers. Inc. Kohavi. pp. C. Kohavi. A.. Kohavi. To appear in Journal of Computer and System Sciences. (1998). Y. H. ed. Inc. in D. AAAI Press and the MIT Press.stanford. graphs. A study of cross-validation and bootstrap for accuracy estimation and model selection. `Using model trees for classi cation'.. (1998). S. `Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining'. R. Lazy decision trees. G. & Witten. Morgan Kaufmann. Wang. `Machine Learning: Proceedings of the Fifteenth International Conference'..

`Problems in the analysis of survey data. (1987). & Brunk. pp. M.fri. 415{434. S. On biases in estimating the multivalued attributes. in `Machine Learning: Proceedings of the Eleventh International Conference'.. Morgan Kaufmann. `Proceedings of the rst international conference on knowledge discovery and data mining'.Kononenko. `Induction of decision trees'. `Proceedings of the 14th International Joint Conference on Arti cial Intelligence'. S. Addison-Wesley. Journal of Arti cial Intelligence Research 2. 254{262.. Inductive knowledge acquisition: A case study. & Agrawal. Uthurusamy. 1025{1031. in C.. R. Lookahead and pathology in decision tree induction. R. 31{36... S. & Jensen. Hume. ed. (1995). Reducing misclassi cation costs. S. Morgan. Journal of the American Statistical Association 58.) Readings in Machine Learning. T. (1986). Quinlan. Murthy. J. `Applications of Expert Systems'. ed. S. Murphy. Fayyad & R. Pazzani. K. `A system for the induction of oblique decision trees'. Michie. (1963). D. Rissanen. eds. in D. MDL-based decision tree pruning. ed. in U. http://ai.uni-lj. and a proposal'. Murthy. R. (1994). S. Quinlan. M. & Sonquist. Merz. A. pp. 1{33. and knowledge discovery in databases'. J. Machine Learning 1. Fisher.. (1997).. (1994).. 216{221. T. Morgan Kaufmann. (1995b). M. Morgan Kaufmann. R.. ed. AAAI Press. Kasif. N. I. pp.si/papers/index.html. chapter 9. `Proceedings of the 14th International Joint Conference on Arti cial Intelligence'. `Expert Systems in the Micro Electronic Age'. R. in J. & Salzberg. in `ECML-95 workshop on Statistics. Discovering rules from large collections of examples: A case study. pp. Quinlan. P. (1995).. `Machine Learning: Proceedings of the Fourteenth International Conference'. J. & Salzberg. (1979). ed. J. 81{106. J. Mellish. The e ects of training set size on decision tree complexity. Reprinted in Shavlik and Dietterich (eds. Morgan Kaufmann. (1995a). Oates. 15 . C.. K. Ali. in C. machine learning. I. C. J. in D. Quinlan. Mellish. 157| 173. A counter example to the stronger version of the binary tree hypothesis. Edinburgh University Press. S. pp. Mehta. Kononenko.

5. R. 197{227. R.5: Programs for Machine Learning. J. 16 . E. (1993). E. (1996). Wallace. L. C4. (1997). Bagging. 227{248.5. Turney. Quinlan. J. P. Utgo . (1989). `The strength of weak learnability'. pp. M. R.. Journal of the Royal Statistical Society B 36.nrc. Machine Learning 29. and c4. Stone. & Mehta. in `Proceedings of the Thirteenth National Conference on Arti cial Intelligence'. (1990). R. Quinlan. & Patrick. California. J. 725{730. AAAI Press and the MIT Press. Cost-sensitive learning. (1996). & Rivest. Shafer. http://ai. J. Machine Learning 5(2). `Cross-validatory choice and assessment of statistical predictions'. R. 7{22.ca/bibliographies/cost-sensitive. 111{147. J. Machine Learning 11. M. C. (1997). `Decision tree induction based on e cient tree restructuring'. P.Quinlan.iit. `Inferring decision trees using the minimum description length principle'. boosting. in `Proceedings of the 22nd International Conference on Very Large Databases (VLDB)'. Sprint: a scalable prallel classi er for data mining. `Coding decision trees'.html . Schapire. (1993). (1974). Information and Computation 80. San Mateo. R. Agrawal. Morgan Kaufmann.

Sign up to vote on this title

UsefulNot useful- decision tree
- Tree Based Classifier
- Machine Learning 4CO250
- Pruning Neglected Fruit Trees
- Teaching - SY1011T1 - InTROAI - Assignment 3
- Credit Scoring SAS
- Biology of Pruning
- Winter 2007 Conservation Almanac Newsletter, Trinity County Resource Conservation District
- Cal Sura
- HD-MO Guava
- TreePlan 203 Guide
- 01_ForecastingforInventory
- Trees & Boundary Lines
- Con Currency in Java
- Mastering UniPaaS
- Buzz Group Method
- Grade 4 - English
- Drift
- Red Black Trees
- Vachun
- Trees People and the Buit Environment Hirons
- Best Forex Trade Copier and Signal System PDF
- The Natural History of Chocolate
- Machine Learning
- MCQs
- Plant Lore, Legends and Lyrics
- Denr Mc No 1 Dated March 8 2010_National Greening Program
- lec14
- Making Trees With Wire
- 00628798
- treesHB