Evaluating Probability Estimates from Decision Trees

Nitesh V. Chawla and David A. Cieslak
Department of Computer Science and Engineering
University of Notre Dame, IN 46556

Abstract Hernandez-Orallo 2003; Margineantu & Dietterich 2001).
A classifier is considered to be well-calibrated if the pre-
Decision trees, a popular choice for classification, have their
limitation in providing good quality probability estimates. dicted probability approaches the empirical probability as
Typically, smoothing methods such as Laplace or m-estimate the number of predictions goes to infinity (DeGroot & Fien-
are applied at the decision tree leaves to overcome the system- berg 1983). Previous work has pointed out that probability
atic bias introduced by the frequency-based estimates. An en- estimates derived from leaf frequencies are not appropriate
semble of decision trees has also been shown to help in reduc- for ranking test examples (Zadrozny & Elkan 2001). This
ing the bias and variance in the leaf estimates, resulting in bet- is mitigated by applying smoothing to the leaf estimates.
ter calibrated probabilistic predictions. In this work, we eval- On the other hand, Naive Bayes classifier can produce poor
uate the calibration or quality of these estimates using various probability estimates but still result in more useful rank-
loss measures. We also examine the relationship between the ing than probabilistic decision trees (Domingos & Pazzani
quality of such estimates and resulting rank-ordering of test
1996). This begs a careful evaluation of the “quality of prob-
instances. Our results quantify the impact of smoothing in
terms of the loss measures, and the coupled relationship with ability estimates”.
the AUC measure. The focus of our paper is to measure the probability es-
timates using different losses. We also want to quantify the
improvement offered by the smoothing methods when ap-
Introduction plied to the leaf frequencies — is there a significant decre-
Decision trees typically produce crisp classifications; that ment in the losses, thus implying an improvement in quality
is, the leaves carry decisions for individual classes. How- (or calibration) of the estimates? Finally, we empirically
ever, that is insufficient for various applications. One can motivate the relationship between the quality of estimates
require a score output from a supervised learning method to produced by decision trees and the rank-ordering of the test
rank order the instances. For instance, consider the classifi- instances. That is, does the AUC improve once the model
cation of pixels in mammogram images as possibly cancer- is well-calibrated? We believe it is important to study and
ous (Chawla et al. 2002). A typical mammography dataset quantify the relationship between the calibration of decision
might contain 98% normal pixels and 2% abnormal pix- trees and the resulting rank-ordering. We implemented the
els. A simple default strategy of guessing the majority class following loss measures1 .
would give a predictive accuracy of 98%. Ideally, a fairly
high rate of correct cancerous predictions is required, while • Negative Cross Entropy (NCE). This measure was also
allowing for a small to moderate error rate in the majority utilized for evaluating losses in the NIPS 2003 Challenge
class. It is more costly to predict a cancerous case as non- on Evaluating Predictive Uncertainty (Candella 2004).
cancerous than otherwise. Thus, a probabilistic estimate or • Quadratic Loss (QL).
ranking of cancerous cases can be decisive for the practi-
tioner. The cost of further tests can be decreased by thresh- • Error (0/1 Loss: 01L).
olding the patients at a particular rank. Secondly, probabilis- We then correlate these with the ROC curve analysis
tic estimates can allow one to threshold ranking for class (Bradley 1997; Provost & Fawcett 2001). We use ROC
membership at values < 0.5. curves and the resulting AUC to demonstrate the sensitivity
Thus, the classes assigned at the leaves of the deci- of ranking to the probabilistic estimates. Due to space con-
sion trees have to be appropriately converted to reliable straints, while the ROC curves are not directly presented,
probabilistic estimates. However, the leaf frequencies can the AUC as an indicator of the rank-ordering achieved on
require smoothing to improve the “quality” of the esti- the test examples is included. This leads us to the question:
mates (Provost & Domingos 2003; Pazzani et al. 1994; Is there an empirically justified correlation between the loss
Smyth, Gray, & Fayyad 1995; Bradley 1997; Ferri, Flach, & measures and AUC?
c 2006, American Association for Artificial Intelli-
gence (www.aaai.org). All rights reserved. The equations are provided in the subsequent sections

which are a natural each class. Bag- mates can mitigate the aforementioned problem (Provost & ging basically aggregates predictions (by voting or averag- Domingos 2003). 0). given b. Aiming to perfectly classify the given set of training ex. 0) and of the estimates. pˆ can either be the leaf One way of improving the probability estimates given by frequency based estimate or smoothed by Laplace or m- an unpruned decision tree is to smooth them to make them estimate. Let pˆk (y|x) indicate the probability assigned by a tree k to a test example x. of T P = 5 and T P = 50. a decision tree may overfit the training set. This can result in coarser probability estimates original data set. ing) from classifiers learned on multiple bootstraps of data. The nal class distribution is not taken into consideration. consider the confusion matrix given in Figure 1. F P ) distributions: (5. Both are powerful calibrating methods. it can give poorer estimates as all the examples dividual decision trees is effectively invariant for test exam- belonging to a decision tree leaves are given the same esti. P (y|x) at each leaf. it could be useful to incorporate based) estimate at a decision tree leaf for a class y is: the prior of positive class to smooth the probabilities so that the estimates are shifted towards the minority class base rate (b). which are more reliable given the ev- tially dominated by one class. The relative coverage of the leaves and the origi. The classification assigned by the in- eralization. Typically. let us idence. Again considering the two pathological cases calculation from the frequencies at the leaves. Then. A small leaf can potentially give opti. different decision regions obtained by thresholding at fea. Given overfitting will also be countered as the variance component the evidence. can be sys. Zadrozny We use ensemble methods to further “smooth” out the prob- & Elkan 2001). thus resulting in a different function for at the leaves. has been shown to improve classifier accuracy. We let the trees grow fully to get precise estimates. Each leaf will potentially mistic estimates for classification purposes. Overfit. Each tree has a potentially different representation of the ture values. The m-estimate (Cussens 1993) can be used as follows Predicted Predicted (Zadrozny & Elkan 2001): Negative Positive P (y|x) = (T P + bm)/(T P + F P + m) (3) Actual Negative TN FP where b is the base rate or the prior of positive class. gy (x) averages over probabilities assigned less extreme. such Positive FN TP that bm = 10. Niculescu-Mizil & Caruana (2005) explore two smooth- ing methods not surveyed in this paper: Platt Calibration and Isotonic Regression. Averaging these estimates will improve the quality leaves with the following (T P. as each tree is essentially constructed from a bootstrap. as it overcomes the bias introduced by the (50. as the averaging will then reduce the overall variance in the Smoothing Leaf Frequencies estimates. mate. 1996). Pruning is equivalent to coalescing probability estimate computed from each of the subspaces.98. In that scenario. systematic error caused by having axis-parallel splits. simply using the frequency derived from the Bagged Decision Trees correct counts of classes at a leaf might not give sound prob- abilistic estimates (Provost & Domingos 2003. ent. For instance.5 Laplace estimate introduces a prior probability of 1/C for The decision tree probability estimates. and 0. Probabilistic Decision Trees with C4. and F P is the number priate for highly unbalanced datasets (Zadrozny & Elkan of false positives. as the leaves are essen. Figure 1: Confusion Matrix which rely for minimizing loss by searching an argument space to find improved probability estimates. 0) leaf will be reduced by voting or averaging. Actual Zadrozny and Elkan (2001) suggest using m. The hyperplanes constructed for each tree will be differ- amples. Smoothing the frequency-based esti. For notational purposes. Laplace estimates might not be very appro- number of true positives at the leaf.86 tematically skewed towards 0 and 1. ability estimates at the leaves. ting is typically circumvented by deploying various pruning The classification can either be done by taking the most pop- methodologies. the Laplace estimates are 0. and m is the parameter for controlling the shift towards b. Bagging (Breiman is not very sound. One can smooth these estimated probabilities by the different trees to a test example. by using the Laplace estimate (Provost & Domingos 2003). A comparison between these two and the other smoothing methods is part P (y|x) = T P/(T P + F P ) (1) of our future work. T P is the However. ples. 2001). which can be written as follows: 1  K gy (x) = pˆ(yk |x) (4) P (y|x) = (T P + 1)/(T P + F P + C) (2) K k=1 . have a different P (y|x) due to different training set com- the frequency based estimate will give the same weights to position. However. a probabilistic estimate of 1 for the (5. the probabilistic (frequency. respectively. While pruning improves the decision tree gen. But pruning deploys methods that typically ular class attached to the test example or by aggregating the maximize accuracies.

of the datasets. There have been attempts to deal with un- even the global classifier. smoothing to overcome the skewness (bias) in the estimates cured on each instance in the test set. tion (Dumais et al. ilar trends as Figure 2 were observed for the other datasets as abilities. have in predicting the actual class. probabilistic estimates (Provost & Domingos 2003). telecommunications man- term. between the different 1  smoothing techniques and this baseline. the trees are constructed from A dataset is unbalanced if the classes are not approximately bags. Thus. for the mammography dataset. We used C4. That is. the minimum loss is 0. we set the un- This will lead to 1 − 2 × 0 + (1 + 0)2 = 2. RelativeDif f erence = T reeM ethod−U nprunedT ree U nprunedT ree . we do include some of the results using pruned trees without hood probability estimates. + log(1 − p(y = 1|xi )))} Figures 2 shows the distribution of p(yi = +1|xi ) for all i|y=0 the positive class (+1) examples. want to look for the following trends. However. Our findings are in agreement with of the instances. If the difference for . & Matwin 1998). of the decision trees have to be appropriately converted to havi (1999). 2002. but also the probabilities assigned for smoothing to the leaf estimates. We will assume a two-class case (y ∈ 0. the class 1 predictions resulting losses and AUC. 0/1 loss. We ing the classification. Since. on the losses and AUC. when true label y = 1. In the ing improvement in the estimates. Japkowicz & individual classifiers can be weaker than the aggregate or Stephen 2002). we primarily focus on our results Entropy of predicting the true labels of the testing set obtained from unpruned decision trees. and c is the actual class of xi . resulting in broadly distributed estimates. Hence. and U nprunedT ree is the measure • O1L: This is the classification error. text classifica- Zadrozny & Elkan (2001) find that bagging doesn’t al. ways improve the probability estimates for large unbalanced Cohen 1995) and detection of oil spills in satellite images datasets. the squared term sums over all the ing. respec- estimates. but the maximum can be infinity if Results p(y = 1|xi ) = 0 or 1 − p(y = 1|xi ) = 0. we used the following Table 1 summarizes the datasets. The best case pruned decision trees with leaf-frequency based estimates will be when p(y = 1|xi ) = 1. Sim- predictions that make the best estimates at the true prob. 1998.5 de- instances. in essence. ordering. obtained by smooth- subsequent equation. b) the relationship between the should have probabilities closer to 1. generated from different scenarios. quality of probability estimates produced by decision trees 1  and AUC. NCE es. the regions can be of different shapes and sizes. and c) the i|y=1  accuracy of the point predictions (error estimate or 0/1 loss). The QL indicates at the leaves. the more confidence we methods such as bagging. where the derived from the unsmoothed leaf-frequency estimate. defined by AUC. We then calculate the relative dif- 1 − 2 × 1 + (1 + 0)2 ) = 0. If T reeM ethod QL = {1 − 2p(y = c|xi ) + p(y = j|xi )2 } is the measure generated from either of {Laplace. Holte. there is an improvement in the quality of applications can require a ranking or a probabilistic estimate probabilistic estimates. NCE is that it will be undefined for log(0). One word of caution with bagging. the lower the loss. Mladeni´c & Grobelnik 1999. An aggregation of the same can balanced datasets in domains such as fraudulent telephone lead to a reduction in the variance component of the error calls (Fawcett & Provost 1996). Experiments with Unbalanced Datasets ability distribution. with respect to the different losses.5 for generat. To that end. and the rank- will be when p(y = 1|xi ) = 0. particularly using ensemble the other possible class. there is clear evidence of the impact of applying to the actual class. The main goal of our evaluation is to understand and demon- sentially measures the proximity of the predicted values strate a) the impact of smoothing on the probabilitiy and to the actual class values. want to demonstrate the relationships between the qual- which is two for our case. the classes assigned at the leaves the work of Provost & Domingos (2003) and Bauer & Ko. (Kubat. As in the related work with probability es- • NCE: The NCE measure is the average Negative Cross timation decision trees. However. Thus. For instance. For this we utilize the NCE and QL losses to in- N CE = − {( log(p(y = 1|xi )) n dicate the quality of the probability predictions. y = 1. defined by losses. 1). M- n i j∈0. The We are interested in objectively quantifying the result- quadratic loss is averaged over all the test instances. Thus. agement (Ezawa. Thus. & Norton 1996). The equally represented (Chawla et al. It not only accounts for the probability assigned well. then estimated probabilities are thresholded at 0. it can be considered as the measure cision trees for our experiments (Quinlan 1992). the worst case ity of the estimates. Distribution/cost sensitive anced datasets. Singh. One would expect • QL: The QL measure is the average Quadratic Loss oc.1 estimate and Bagging}. This will lead to as our baseline. thereby reducing the overall error (Breiman 1996). We generated 30 bags for each a test instance. ference. Each leaf is. xi is tively (hold-out method). We divided our datasets three loss measures to evaluate the quality of the probability into 70% and 30% splits for training and testing. we show that even for large and unbal. defining its own region of prob. We also possible probability values assigned to the test instance. This brings us to another question: What is the right probabilistic Loss Measures estimate for unbalanced datasets? As mentioned in the Introduction. that must be minimized to obtain the maximum likeli.

We Summary also notice a very compelling trend from these Figures. As one would expect. m-estimate does not result in improved per. the coarser unsmoothed leafs rate. and QL and be very important. Smooth- ing invariably and consistently reduces the NCE loss.29 Adult 14 48. which adds ev. Table 2 shows the correlation coffecient between parallel splits of decision trees.183 0. idence to our observation that as NCE decreases. then smoothing the estimates is actually improving the quality of our estimates and reducing Table 2: Correlation Among the Losses and AUC the loss. QL and NCE is negative.071 Mammography 6 11. The Bagged-Laplace trees are the most consistent in their qual- ity of probabilistic predictions based on both QL and NCE. then the point NCE QL 01L AUC predictions (accuracy) are deteriorating. Based on our results. On the other hand. error rate is very sensitive to the trees.8663 AUC is positive.2 0.4931 There are various observations from this Figure.430 0.8 1 0 0. Table 3 shows the results from applying default pruning. once the rate.2 0.840 0. O1L 0. Thus.6217 -0. the AUC’s for pruned trees are also bet- m-estimate as the probabilities are shifted towards the base ter than unpruned trees. On the other hand. Thus. The error rate shows an interesting trend.097 Forest Cover 54 38. m-estimate. and ensembles are quality of the estimates directly impacts the resulting rank.6 0. Notably. able to overcome the bias in estimates arising from the axis- ordering.8 1 0 0. for the tage in using unpruned trees. then the rank-ordering of the test examples QL 0.6 0.4668 is improving. AUC in. Without smoothing the results in no change in error for 4 out 5 datasets in no change NCE of pruned trees is consistently lower than unpruned in error. In fact.500 0.400 0. there is a high mates. Table 1: Datasets. Laplace estimate The results are quite compelling.4 0.8245 1 -0.6217 0. For most of the applications requiring unbalanced datasets. an improved rank-ordering.5044 0. The We show that decision trees are a viable strategy for prob- NCE measure is very tightly inversely correlated with the ability estimates that can be used for rank-ordering the test resulting AUC.023 40 40 40 35 35 35 30 30 30 25 25 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 0 0. Figure 3 show the results. having reliable probability estimates is important for correlated with QL. bagging always reduces the error are resulting in better quality estimates.6 0.4 0.8 1 (a) (b) (c) Figure 2: a) Probability Distribution using the leaf frequencies as estimates.24 Satimage 36 6. Dataset Number of Features Number of Examples Proportion of Minority (positive) class examples Phoneme 5 5. If the difference for error is positive. Moreover. creases for the probability estimation decision trees. the error estimate (01L) is tightly Thus. leaves are smoothed by Laplace. AUC. c) Probability Distribution using bagging. However. we would . for the probabilistic decision trees. However.8245 -0.2 0. We demonstrated that the rank-ordering of the test in- negative correlation between NCE and AUC. b) Probability distribution by smoothing leaf frequencies using Laplace estimates. Laplace estimate.4 0. there is definitely an advan- formance over Laplace for any of the datasets.5044 1 0. resulting in smoother esti- the different loss measures and AUC. two most skewed datasets — mammography and covtype — m-estimate leads to an increase in the quadratic loss. The probabilities are gy (x) that are averaged from the (frequency-based) leaf estimates. where Xp is the positive class example. stances is related to the quality of the probability estimates. If the difference for NCE 1 0. There the resulting rank-order of examples or P (Xp > Xn) can were no such trends between O1L and AUC. the examples.

Bagged-Laplace is for the averaged Laplace smoothed estimates over the 30 bags. The convention in the Figures is as follows: Laplace is the leaf frequency based estimate smoothed by laplace method. and m-estimate is the leaf frequencie based estimate smoothed by m-estimate. . (a) (b) (c) (d) (e) Figure 3: Relative differences of the different smoothing methods from the unsmoothed leaf frequency based estimates. Bagged is for the averaged leaf frequency estimates over the 30 bags.

C. Table 3: Comparison of probability estimates produced with unpruned and pruned trees. Katholieke Universiteit Leu. Feature Selection nique. Bagging predictors. Domingos. Images. Learning to Classify English Text with chine Learning.849300 . 2004. DeGroot. Machine Learning 36(1. and Machine Learning for Effective User Profile. Learn. ICML-96. The Comparison and Provost. Zadrozny.. M.. and Cussens.966270 Mammography . P. Bari. Gray. NCE AUC Frequency LaPlace Frequency LaPlace Dataset Unpruned Pruned Unpruned Pruned Unpruned Pruned Unpruned Pruned Phoneme . Flach. Hall. Improved class lenge. 1995. Heckerman. L. Bauer. J. 8–13.293877 . Machine Learning 52(3). An empirical comparison Improving the auc of probabilistic estimation trees. M.057237 . 1983.112777 . F. 6(5). 3–24. D...146401 ... Reducing misclassification costs. . References Ferri. C.. 148–155.783980 . D. C. CA: Morgan Kaufmann. J.. 121–132. and Domingos. Merz. P.514000 . M. S. Machine Breiman. Pattern Recognition 30(6):1145–1159. ceedings of the Eleventh International Conference on Ma- ings of European Conference on Machine Learning. 231. boosting and ropean Conference on Machine Learning.312174 . tional Conference on Machine Learning. Robust Classification dence: Conditions for the optimality of the simple bayesian for Imprecise Environments.. and Matwin. M. F.746100 .767510 . D. Morgan Kaufmann.156307 . ceedings of the 2nd International Conference on Knowl- tively reduces the variance and bias in estimation. Machine Learning. W. Tree induction for Evaluation of Forecasters. T. 1993.. Portland.. 2002. SMOTE: Synthetic Minority Oversampling TEch.069245 . Bowyer.. 1999. 1999. A. and Norton. Statistician 32:12 – 22. chine Learning. 1998. Kubat. J.. decisions when costs and probabilities are both unknown. 147. Mladeni´c. A... T. and Provost.. In Proceedings of the Seventh International Conference on tional Conference on Machine Learning. Margineantu. Inductive Learning Algorithms and Representations Smyth. In Pro- conditional probabilities and their reliabilities. Proceedings of the 16th International Conference on Ma- Cohen. Ezawa. 1998. Provost. P. Machine Learning 30:195–215. Hume. and Fienberg. 2001.141756 . International Conference on Information and Knowledge In Proceedings of the Twelth International Conference on Management..397530 . and Hernandez-Orallo. Depart. M. P.385605 .117407 . and Sahami. S. 1995. 169–184. The class imbalance Bradley. R.749640 ..022569 .. W. Italy: Morgan Kauffman. 2003.798680 .. edge Discovery and Data Mining. and Stephen. C4. 625–632. Evaluating Predictive Uncertainty Chal.777647 . good probabilities with supervised learning.2).781200 . 2001. and Grobelnik. Platt. In International Conference on Machine Learn. 1996.668420 .5: Programs for Machine Learning. In Non- Chawla. 105–112. In Interna- ment of Computer Science. Holte... B. K. and Kegelmeyer. 2001. C. S.. A. linear Estimation and Classification.023136 .159342 . J. Pazzani..822590 .795210 Forest Cover . 2002. and Pazzani. 1996. Bayes and pseudo-bayes estimates of Brunk.026180 . and Fayyad. In Eu- of voting classification algorithms: Bagging. In Proceed. and Elkan.974840 . 1996. San Mateo. Predicting Workshop on Inductive Logic Programming. 1997.. cations Risk Management.877140 .. R. R. 1992.. Murphy. The Use of the Area Under the ROC problem: A systematic study. N. In Proceedings of the 5th International Niculescu-Mizil. M. Japkowicz..026670 . Machine Learning 42/3:203– classifier. Quinlan. S. J. In Pro- ability estimates from the decision trees. 506–514. Beyond indepen. Machine Learning Learning for the Detection of Oil Spills in Satellite Radar 24(2):123–140. J.808600 recommend bagging or other ensemble generation methods Fawcett. Dumais. probability estimates from decision tree models. N.451316 . T. In Proceedings of the Interna. NIPS 2004. ILP Methods. Bagging effec. Ali. 2003. P. 1996. 1994.354893 . probability-based rankings. 139– Knowledge Discovery and Data Mining. A. T. In 357. and Fawcett. and Dietterich. Combining Data Mining with decision trees for improving the calibration of the prob. 217–215. K. ing..881790 . J. S.. W. 258–267. Retrofitting for Text Categorization. 2005. and Caruana.. E. OR: AAAI.041607 .039046 . P.157325 . and Kohavi. Intelligent Data Analysis Curve in the Evaluation of Machine Learning Algorithms.. U.777330 Satimage .782070 Adult . L. Learning and making ing Goal Oriented Bayesian Networks for Telecommuni. In Proceedings of the Seventh decision tree classifiers using kernel density estimation. F. Candella. Singh.. ven. M. K.. Journal of Artificial Intelligence Research 16:321– for Unbalanced Class Distribution and Naive Bayes. variants.