Professional Documents
Culture Documents
Caliprated Probability PDF
Caliprated Probability PDF
28
effective when boosting full decision trees. Significantly bet- hi is trained on the weighted train set. The error of hi deter-
ter performance is obtained by boosting full decision trees mines the model weight αi and the future weight of each
with exponential loss, and then calibrating their predictions training example. There are two equivalent formulations.
using either Platt Scaling or Isotonic Regression. Calibration The first formulation, also used by Friedman, Hastie, and
with Platt Scaling or Isotonic Regression is so effective that Tibshirani (2000) assumes yi ∈ {−1, 1} and hi ∈ {−1, 1}.
after calibration boosted decision trees predict better prob- The output of the boosted model is:
abilities than any other learning method we have compared T
them to, including neural nets, bagged trees, random forests, F (x) = αi hi (x) (1)
and calibrated SVMs. i=1
Friedman et al. show that AdaBoost builds an additive
Boosting and Calibration logistic regression model for minimizing E(exp(−yF (x))).
They show that E(exp(−yF (x))) is minimized by:
In this section we empirically examine the relationship be-
1 P (y = 1|x)
tween boosting’s predictions and posterior probabilities. One F (x) = log (2)
way to visualize this relationship when the true posterior 2 P (y = −1|x)
probabilities are not known is through reliability diagrams This suggests applying a logistic correction in order to get
(DeGroot & Fienberg 1982). To construct a reliability dia- back the conditional probability:
gram, the prediction space is discretized into ten bins. Cases 1
with predicted value between 0 and 0.1 are put in the first P (y = 1|x) = (3)
1 + exp(−2F (x))
bin, between 0.1 and 0.2 in the second bin, etc. For each bin,
the mean predicted value is plotted against the true fraction As we will see in the Empirical Results section, this logis-
of positive cases in the bin. If the model is well calibrated tic correction works well when boosting simple base learners
the points will fall near the diagonal line. such as decision stumps. However, if the base learners are
powerful enough that the training data becomes fully separa-
The bottom row of Figure 1 shows reliability plots on a
ble, after correction the predictions will become only 0’s and
large test set after 1,4,8,32,128, and 1024 stages of boosting
1’s (Rosset, Zhu, & Hastie 2004) and thus have poor calibra-
Bayesian smoothed decision trees (Buntine 1992). The top
tion.
of the figure shows histograms of the predicted values for the
same models. The histograms show that as the number of
Platt Calibration
steps of boosting increases, the predicted values are pushed
away from 0 and 1 and tend to collect on either side of the de- An equivalent formulation of AdaBoost assumes yi ∈ {0, 1}
cision surface. This shift away from 0 and 1 hurts calibration and hi ∈ {0, 1}. The output of the boosted model is
T
and yields sigmoid-shaped reliability plots. αi hi (x)
Figure 2 shows histograms and reliability diagrams for f (x) = i=1 T (4)
boosted decision trees after 1024 steps of boosting on eight i=1 αi
test problems. (See the Empirical Results section for more We use this model for Platt Scaling and Isotonic Regres-
detail about these problems.) The figures present results mea- sion, and treat f (x) as the raw (uncalibrated) prediction.
sured on large independent test sets not used for training. For Platt (1999) uses a sigmoid to map SVM outputs on
seven of the eight data sets the predicted values after boosting [−∞, +∞] to posterior probabilities on [0, 1]. The sigmoidal
do not approach 0 or 1. The one exception is LETTER.P1, shape of the reliability diagrams in Figure 2 suggest that the
a highly skewed data set that has only 3% positive class. same calibration method should be effective at calibrating
On this problem some of the predicted values do approach boosting predictions. In this section we closely follow the
0, though careful examination of the histogram shows that description of Platt’s method in (Platt 1999).
there is a sharp drop in the number of cases predicted to have Let the output of boosting be f (x) given by equation 4.
probability near 0. To get calibrated probabilities, pass the output of boosting
All the reliability plots in Figure 2 display sigmoid-shaped through a sigmoid:
reliability diagrams, motivating the use of a sigmoid to map 1
P (y = 1|f ) = (5)
the predictions to calibrated probabilities. The functions fit- 1 + exp(Af + B)
ted with Platt’s method and Isotonic Regression are shown in where the A and B are fitted using maximum likelihood es-
the middle and bottom rows of the figure. timation from a calibration set (fi , yi ). Gradient descent is
used to find A and B that are the solution to:
Calibration
argmin{− yi log(pi ) + (1 − yi )log(1 − pi )}, (6)
In this section we describe three methods for calibrating pre- A,B i
dictions from AdaBoost: Logistic Correction, Platt Scaling,
and Isotonic Regression. where 1
pi = (7)
1 + exp(Afi + B)
Logistic Correction Two questions arise: where does the sigmoid training set
Before describing Logistic Correction, it is useful to briefly (fi , yi ) come from and how to avoid overfitting to it.
review AdaBoost. Start with each example in the train set One answer to the first question is to use the same train set
(xi , yi ) having equal weight. At each step i a weak learner as boosting: for each example (xi , yi ) in the boosting train
29
1 STEP 4 STEPS 8 STEPS 32 STEPS 128 STEPS 1024 STEPS
0.16 0.16 0.16 0.16 0.16 0.16
0.14 0.14 0.14 0.14 0.14 0.14
0.12 0.12 0.12 0.12 0.12 0.12
0.1 0.1 0.1 0.1 0.1 0.1
0.08 0.08 0.08 0.08 0.08 0.08
0.06 0.06 0.06 0.06 0.06 0.06
0.04 0.04 0.04 0.04 0.04 0.04
0.02 0.02 0.02 0.02 0.02 0.02
0 0 0 0 0 0
1 1 1 1 1 1
Fraction of Positives
Fraction of Positives
Fraction of Positives
Fraction of Positives
Fraction of Positives
0.6 0.6 0.6 0.6 0.6 0.6
0 0 0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value
Figure 1: Effect of boosting on the predicted values. Histograms of the predicted values (top) and reliability diagrams (bottom)
on the test set for boosted trees at different steps of boosting on the COV TYPE problem.
COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS MG
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
1 1
0.8 0.8
Fraction of Positives
Fraction of Positives
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 1
0.8 0.8
Fraction of Positives
Fraction of Positives
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value
Figure 2: Histograms of predicted values and reliability diagrams for boosted decision trees.
set, use (f (xi ), yi ) as a training example for the sigmoid. Un- target values see (Platt 1999). The middle row of Figure 2
fortunately, this can introduce unwanted bias in the sigmoid shows the sigmoids fitted with Platt Scaling.
training set and can lead to poor results (Platt 1999).
An alternate solution is to split the data into a training and Isotonic Regression
a validation set. After boosting is trained on the training set, An alternative to Platt Scaling is to use Isotonic Regression
the predictions on the validation set are used to fit the sig- (Robertson, Wright, & Dykstra 1988). Zadrozny and Elkan
moid. Cross validation can be used to allow both boosting (2002; 2001) successfully used Isotonic Regression to cali-
and the sigmoid to be trained on the full data set. The data brate predictions from SVMs, Naive Bayes, boosted Naive
is split into C parts. For each fold one part is held aside for Bayes, and decision trees. Isotonic Regression assumes only
use as an independent calibration validation set while boost- that:
ing is performed using the other C-1 parts. The union of the
yi = m(fi ) + i (9)
C validation sets is then used to fit the sigmoid parameters.
Following Platt, experiments in this paper use 3-fold cross- where m is an isotonic (monotonically increasing) function.
validation to estimate the sigmoid parameters. Given a training set (fi , yi ), the Isotonic Regression problem
As for the second question, an out-of-sample model is used is finding the isotonic function m̂ such that:
to avoid overfitting to the calibration set. If there are N+ pos-
itive examples and N− negative examples in the calibration m̂ = argmin (yi − z(fi ))2 (10)
z
set, for each example Platt Calibration uses target values y+ A piecewise constant solution m̂ can be found in linear
and y− (instead of 1 and 0, respectively), where time by using the pair-adjacent violators (PAV) algorithm
N+ + 1 1 (Ayer et al. 1955) presented in Table 1.
y+ = ; y− = (8)
N+ + 2 N− + 2 As in the case of Platt calibration, if we use the boosting
For a more detailed treatment, and a justification of these train set (xi , yi ) to get the train set (f (xi ), yi ) for Isotonic
30
Table 2: Cross-entropy performance of boosted decision stumps and boosted decision trees before and after calibration. Quali-
tatively similar results are obtained for squared error.
B OOSTED S TUMPS B OOSTED T REES
PROBLEM RAW PLATT ISO LOGIST LOGLOSS RAW PLATT ISO LOGIST LOGLOSS
COV TYPE 0.7571 0.6705 0.6714 0.6785 0.6932 0.6241 0.5113 0.5179 0.7787 0.7062
ADULT 0.5891 0.4570 0.4632 0.4585 0.4905 0.5439 0.4957 0.5072 0.5330 0.5094
LETTER . P 1 0.2825 0.0797 0.0837 0.0820 0.0869 0.0975 0.0375 0.0378 0.1172 0.0799
LETTER . P 2 0.8018 0.5437 0.5485 0.5452 0.5789 0.3901 0.1451 0.1412 0.5255 0.3461
MEDIS 0.4573 0.3829 0.3754 0.3836 0.4411 0.4757 0.3877 0.3885 0.5596 0.5462
SLAC 0.8698 0.8304 0.8351 0.8114 0.8435 0.8270 0.7888 0.7786 0.8626 0.8450
HS 0.6110 0.3351 0.3365 0.3824 0.3407 0.4084 0.2429 0.2501 0.5988 0.4780
MG 0.4868 0.4107 0.4125 0.4096 0.4568 0.5021 0.4286 0.4297 0.5026 0.4830
MEAN 0.6069 0.4637 0.4658 0.4689 0.4914 0.4836 0.3797 0.3814 0.5597 0.4992
31
COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS MG
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
1 1
0.8 0.8
Fraction of Positives
Fraction of Positives
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value
Figure 3: Histograms of predicted values and reliability diagrams for boosted trees calibrated with Platt’s method.
COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS MG
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
1 1
0.8 0.8
Fraction of Positives
Fraction of Positives
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value
Figure 4: Histograms of predicted values and reliability diagrams for boosted trees calibrated with Isotonic Regression.
function on an independent test set. problems, and have the nice property that the theoretically
The right side of Table 2 presents the performance of the suggested Logistic Correction works well.2
best boosted trees before and after calibration. To prevent To illustrate how calibration transforms the predictions, we
the results from depending on one specific tree style, we show histograms and reliability diagrams for the eight prob-
boost ten different styles of trees. We use all the tree types lems for boosted trees after 1024 steps of boosting, after Platt
in Buntine’s IND package (1991) (ID3, CART, CART0, Scaling (Figure 3) and after Isotonic Regression (Figure 4).
C4.5, MML, SMML, BAYES) as well as three new tree The figures show that calibration undoes the shift in proba-
types that should predict better probabilities: unpruned C4.5 bility mass caused by boosting: after calibration many more
with Laplacian smoothing (Provost & Domingos 2003); un- cases have predicted probabilities near 0 and 1. The relia-
pruned C4.5 with Bayesian smoothing; and MML trees with bility plots are closer to the diagonal, and the S shape char-
Laplacian smoothing. We consider the boosted models after acteristic of boosting’s predictions is gone. On each prob-
2,4,8,16,32,64,128,256,512,1024 and 2048 steps of boosting. lem, transforming the predictions using either Platt Scaling
As expected, Logistic Correction is not competitive when or Isotonic Regression yields a significant improvement in
boosting full decision trees. The other two calibration meth- the quality of the predicted probabilities, leading to much
ods (Platt Scaling and Isotonic Regression) provide a sig- lower squared error and cross-entropy. The main difference
nificant improvement in the quality of the predicted proba- between Isotonic Regression and Platt Scaling for boosting
bilities. Both methods reduce cross-entropy about 21% and can be seen when comparing the histograms in the two fig-
squared error about 13%. ures. Because Isotonic Regression generates a piecewise
Surprisingly, when boosting directly optimizes log-loss, constant function, the histograms are quite coarse, while the
boosted trees have poor performance. Because the base-level histograms generated by Platt Scaling are smooth and easier
models are so expressive, boosting with log-loss quickly sep- to interpret.
arates the two classes on the train set and pushes predictions To determine how the performance of Platt Scaling and
toward 0 and 1, hurting calibration. This happens despite the Isotonic Regression changes with the amount of data avail-
fact that we regularize boosting by varying the number of able for calibration, we vary the size of the calibration set
iterations of boosting (Rosset, Zhu, & Hastie 2004). from 32 to 8192 by factors of two. Figure 5 shows the av-
Comparing the results in the left and right side of Tables 2
we see that boosted trees significantly outperform boosted 2
We also tried boosting 2-level decision trees. Boosted 2-level
stumps on five of the eight problems. On average over the trees outperformed boosted 1-level stumps, but did not perform as
eight problems, boosted trees yield about 13% lower squared well as boosting full trees. Moreover, 2-level trees are complex
error and 18% lower cross-entropy than boosted stumps. enough base-level models that Logistic Correction is no longer as
Boosted stumps, however, do win by a small margin on three effective as Platt Scaling or Isotonic Regression.
32
Unscaled
Platt
Isotonic Regression yield the best average squared error and
0.32 Isotonic
cross-entropy performance (results are not included due to
lack of space). This, combined with their excellent per-
0.31
formance on other measures such as accuracy, precision,
and area under the ROC curve (Caruana & Niculescu-Mizil
Squared Error
0.3
2006), may make calibrated boosted decision trees the model
0.29
of choice for many applications.
Acknowledgments
0.28
Thanks to Bianca Zadrozny and Charles Elkan for use of
0.27
their Isotonic Regression code. Thanks to Charles Young et
10 100
Calibration Set Size
1000 10000
al. at SLAC (Stanford Linear Accelerator) for the SLAC data
set. Thanks to Tony Gualtieri at Goddard Space Center for
Figure 5: Learning curves for Platt Scaling and Isotonic Re- help with the Indian Pines Data set. This work was supported
gression for boosted trees (average across 8 problems) by NSF Award 0412930.
References
erage squared error over the eight test problems for the best Ayer, M.; Brunk, H.; Ewing, G.; Reid, W.; and Silverman, E. 1955.
uncalibrated and the best calibrated boosted trees. We per- An empirical distribution function for sampling with incomplete
form ten trials for each problem. information. Annals of Mathematical Statistics 5(26):641–647.
The nearly horizontal line in the graph show the squared Blake, C., and Merz, C. 1998. UCI repository of machine learning
error prior to calibration. This line is not perfectly horizon- databases.
tal only because the test sets change as more data is moved Buntine, W., and Caruana, R. 1991. Introduction to ind and re-
into the calibration sets. The plot shows the squared error af- cursive partitioning. Technical Report FIA-91-28, NASA Ames
ter calibration with Platt’s method or Isotonic Regression as Research Center.
the size of the calibration set varies from small to large. The Buntine, W. 1992. Learning classification trees. Statistics and
learning curves show that the performance of boosted trees Computing 2:63–73.
is improved by calibration even for very small calibration set Caruana, R., and Niculescu-Mizil, A. 2006. An empirical compar-
sizes. When the calibration set is small (less than about 2000 ison of supervised learning algorithms. In ICML ’06: Proceedings
cases), Platt Scaling outperforms Isotonic Regression. This of the 23rd international conference on Machine learning.
happens because Isotonic Regression is less constrained than Collins, M.; Schapire, R. E.; and Singer, Y. 2002. Logistic regres-
Platt Scaling, so it is easier for it to overfit when the cali- sion, adaboost and bregman distances. Machine Learning 48(1-3).
bration set is small. Platt’s method also has some overfitting DeGroot, M., and Fienberg, S. 1982. The comparison and evalua-
control built in. tion of forecasters. Statistician 32(1):12–22.
Friedman, J.; Hastie, T.; and Tibshirani, R. 2000. Additive logistic
Conclusions regression: a statistical view of boosting. The Annals of Statistics
In this paper we empirically demonstrated why AdaBoost 38(2).
yields poorly calibrated predictions. To correct this problem, Gualtieri, A.; Chettri, S. R.; Cromp, R.; and Johnson, L. 1999.
we experimented with using a variant of boosting that di- Support vector machine classifiers as applied to aviris data. In
rectly optimizes log-loss, as well as with three post-training Proc. Eighth JPL Airborne Geoscience Workshop.
calibration methods: Logistic Correction justified by Fried- Lebanon, G., and Lafferty, J. 2001. Boosting and maximum like-
man et al.’s analysis, Platt Scaling justified by the sigmoidal lihood for exponential models. In Advances in Neural Information
shape of the reliability plots, and Isotonic Regression, a gen- Processing Systems.
eral non-parametric calibration method. Platt, J. 1999. Probabilistic outputs for support vector machines
One surprising result is that boosting with log-loss instead and comparison to regularized likelihood methods. In Adv. in
of the usual exponential loss works well when the base level Large Margin Classifiers.
models are weak models such as 1-level stumps, but is not Provost, F., and Domingos, P. 2003. Tree induction for probability-
competitive when boosting more powerful models such as based rankings. Machine Learning.
full decision trees. Similarly Logistic Correction works well Robertson, T.; Wright, F.; and Dykstra, R. 1988. Order Restricted
for boosted stumps, but gives poor results when boosting full Statistical Inference. New York: John Wiley and Sons.
decision trees (or even 2-level trees). Rosset, S.; Zhu, J.; and Hastie, T. 2004. Boosting as a regularized
The other two calibration methods, Platt Scaling and Iso- path to a maximum margin classifier. J. Mach. Learn. Res. 5.
tonic Regression, are very effective at mapping boosting’s Schapire, R.; Freund, Y.; Bartlett, P.; and Lee, W. 1998. Boost-
predictions to calibrated posterior probabilities, regardless ing the margin: A new explanation for the effectiveness of voting
of how complex the base level models are. After calibra- methods. Annals of Statistics 26(5):1651–1686.
tion with either of these two methods, boosted decision trees Zadrozny, B., and Elkan, C. 2001. Obtaining calibrated probability
have better reliability diagrams, and significantly improved estimates from decision trees and naive bayesian classifiers. In
ICML.
squared-error and cross-entropy.
When compared to other learning algorithms, boosted de- Zadrozny, B., and Elkan, C. 2002. Transforming classifier scores
into accurate multiclass probability estimates. In KDD.
cision trees that have been calibrated with Platt Scaling or
33