Predicting Good Probabi
‘Alexandr Niclescu-Miai
Rich Caruana
ities With Supervised Learning
ALERN@ US CORNELL EDL
Department OF Computer Science, Cell Universi thas NY 14883
Abstract
We examine the relationship etween the predic.
tons mae by iret leaning goin td
tne poster probes. We sho that max
tu nrg tod such ak booted ees ad
wooed ms pus probity mass way from
(and yielding cacti sli taped
storion in he presicted protiics. Mo
fb Sicha Nave Bays, which make uel
teinpendene sumptions. ps probabilities
toward U and I. Other mols sich ss nud
et and bagged tes do a0 have hese ses
tn predict well abated probabilities. We ex.
pvimeat with two ways eect the Biased
Probables pried by some kerning met
fn Pat Scaling and Istnic Regression. We
these clitvation methods are stable for ad
tobe elective. The empiial ets show tht
ser calvaion boosted ees, fandom fess,
tnd SVMs predit he hes probabiies
|. Introduction
In many applications i is important to pret well a
sed probs good scours o art unde he ROC
uve ae nt sce This paper examines the
abies peediced by ten supervised leaming aor
SVMs, cual net, daiitn tees, thetary ete lear
ing, bagged tees random Tress, booted tees. boasted
tmp, aie bes and logs represion. We ho how
matin margin methods sich 38 SVMs, Bont tres,
nd boosted stumps tend to push predicted pris
say from Ua! This his he qty of he probabil
cs they predict and yields characterise sgmsbaped
stron in the predicted probes. Other methos
Such as naive ayes hve te oppose ia and tend wo posh
weds cloner O ad 1 Aad sae learns methods
[rpasing Proce ofthe 22% hernia Conference
‘tahineLaamng, Bo, Gemny, 208 Copy 9
tear towne)
sch as baged tees and ner ets have ite o bias
‘predict welled pb,
Alte: examining te distortion oak of character to
cach eaing method. we experiment with wo elation
‘methods for comeing these disotions
Plat Sealing: mtd for transforming SVM caps
ftom [-ae, 2) to posterior pobliies (Pat 1999)
Tsotonc Regression: the method used by Zairomy an
kan (2002 20) to elite pretons fom Dost
‘ave hayes, SVM, and denon tes models
Plat Sealing i most fective when the stron inthe
predicted robles signs. otic Repro
"one more powerfl aiaton method tha an comet
‘ny thowtorie ditonton, Unfortunately, this eats power
‘Somes ta pris. A leing curve analysis shows hat so
‘onic Renession is moze prone to ovens and thus pe
{orm worse han Pat Sening hen da sear
Finaly, we examine how goed are the probabilities pre
te cach earning mated ae each maths peec-
‘ions hve been elated. Expenmens with ight ls
‘eation problems suggest hat random fre, neural ets
sd bagged decision tes are the bes! lamin meth for
Dreditng welled probes roe alan.
ttaerealition the best ethode are osted er
hom forests and SVM
2. Calibration Methods
In this section we desribe the two eto for mapping
‘model predictions to poserie probable: Plat Cala
tion ant Teton Regression, Unfortunatly these methods
redesign fr binary clawfation and an tlt
‘xen them o malin problems. One wy to del wit
moll problems nt transfor tem 0 in oh
Tem, elite the binary modes an ecombine the pre
betions Zadrony Elkan, 2002).
21, Pat Calibration
Plat (1999) propel tanfring SVM poisons 0
posterior proba by passing them trough a sig
We will se a Section 4 thts same wanstorationdicing Gao Probubties With Sper Leas
‘iso ote For booted ees ad boosted stam,
Let the ouput offering method be f(x). To get eal
‘wae probbiie, pss the ontpu through a sigmoid
Pin) aRare a)
a
ere the pareters and # ae fied using axium
Tiketibod estimation tam a ing ining et (J)
Grain desent edt Bd A ad ch they
argmin~ Sulonin) +(1= niet =P). @)
where
a 8
am
“Two questions arise: where dos the sigmoid inset come
from? an how oad overt oth trang et
If we re the sme dase tt was sed to tesa the made
ve watt wo elle, we iodaceunasied has. Far 2
mpl, if the mode east rine he tin set pe.
fect and onder al the negative examples before the pos
"he cep, tho the sisi rannrmtion wil tp
Jn foc, So wena Ne an inked
tiene og pod pst psbaition This
tomer snc ada bck ste thee wan be ed
Formodl and praete selection,
1 avoid vetting wo the sigmoid a set an out
sample mode! is wed, there are X, postive examples
td negaive examples i the ui st for each ai.
ing example Pt Calitration vss target ales ad
(instead a Lando, especie). bere
TNeo Oa io
Fora mre detailed treatment, an a jsieation of hese
pico rget vals se Pi. 199)
222. Isotonle Regression
“he sgn nstrmition works wel or some eng
tas be ts ne appropiate To es.” Zns
{nt han (2. 201) aces was wore pw
mato Ree on lotic Regreston Raber e
Tb vo calbeate predictions fom SVN Nave ye
boost Nave yey and decon esis meted
‘nore perl in tha he lento hte png
func be oem onto icesing). Tht
fiver the peictons om ula th te ge
tht tee esumpion noon Repession at
wmf +6 6
~aigortar PAV stgoctan etna pre
rable rom aeaibsed mol predictions.
Toa nig (J) od ton 7
ili i a ia = 1
3 WhileS) nf mcs > tas
Sees = (hares athe
Replace yn ans th he,
4 Outpt the spine const. foneton
if) = aslo EFS,
‘where sam stoic (monotonically increasing) faae=
tion Then given tain et (fis he Ionic Repres
‘om proble siding the soon faction sch a,
n= aramin. Dla)?
‘One algoctn that finds a stepwise cost solton for
the Isotoie Regression problem is palraacent volts
{avy agerit (Aye tl, 1955) prseted in Table |
Asinthe cae of lat calibration i ewe the mo rin
Ing Set (2) to ett trating et (7). sonic
lpendent vlan sto train the soni anton
3.Data Sets
‘We compare algorithms om 8 binary clisitation prob-
Jens. ADULT, COV-TYPE and LETTER ae feo UCT
Repsitoy (Blake & Mere, 198), COV-TYPE ha bon
comer ta inary problem by tating the lrest ass
‘postive snd the rest a nepatv, We comvered LETTER
{boolean two ways. LETTER pI teas the Fete “O" as
pose and the msn 28 eters as seat, yelling
1 vey unbalanced problem. LETTER p uses kes AM.
‘pines and N-Z a negatives, yen dieu, bat
‘ell alee problem. 1S i the InianPnc? dat et
(Goa et ay 19) whore he ditcl class Soybes-
‘iil the positive case SLAC ix problem fom the
‘Stanfosd Linear Accelero. MEDIS ad MG ae medical
ats, The data ete ae sunarize a Table 2
‘le2 Desi of poems
Cosme st 4000 36%
Mio oo
She ‘oodicing Gao Probubties With Sper Leas
4. Qualitative Analysis of Predictions
ln this section we qualitatively examine the calibration of
the eiferent easing algo. For cach lgvithm me
we many variations and parameter sexing to ai ier
ent madels Fr example, we tain mds wing tend
‘son tee sles, sual nets of many sizes, SVM with
many Keele ete After uaining, we apply Plat Seal.
ing and stone Regression to calla al modes. Each
‘model ried on th sae andor sample of 000 cae
td alate om indepenent samples of 1000 cases For
theirs in hs section esl, fr each problem, and
foreach ening seri de de ha as the bes
ison Beto or afer sang.
‘On rel problems where the true conn probes
Tait diagrams (DeGroe & Feber, 1982), Fit the
preiction space is discret io ten bas. Ces ith
Prod sae tween and fl nthe fren, be
ce a2 in he second bit. Far each bi, the
mean predicted vale pet pains the te fraction of
Prive cases, Ifthe tml wll elated the pos
til fl nee the Sapo ie
We fint examine the retions mae by boone tes
Feu I shows hogan of tbe preted sacs Cop
row) al rely agra mit anton ror
toc usd for ining ce cliration An interesting spect
‘ofthe eit plots in igre sth the lay aie
tial shape on seven ofthe eight probles,modhatng
these of wig to won predictions in calibrated
probabies. The elicit pls ia the mile Yow ofthe
aiiy plot inthe boom ofthe eae show he fnton
ite wi toc Repression.
Examining the bistorans of predicted salues (op 0%
in Figure 1, aoe that amon al the valesprodited by
one wees eth coral eon with ee pens
approaching 0 or 1. The coe exception is LETTER P, 2
Nhl skewed data set that has only 3 piv class. On
this problem sme predicted wales do pproch 0, hough
‘eel examiztin of the histogram shows thr evea 2
ts problem tere isa sharp drop in the noe of ees
prece 1o have probability near 0. This siting ofthe
Prediction award the center of the Aistograts cases the
Sos shaped reliability pos of boosted tees.
‘To show how cairaion tansforms pectin, we pot
hslograms and relibiliy diagrams forthe ight problems
Sowing loredt comm former econ Ww iy
for booted res fer Pit Calibration (Figure 2) aso
tonic Regesion (Fie 3). The gues sho that calibra
‘Son undoes the bl a probity mans cated by boot
ing: fe cllbration many mote cases have predicted prob
Sle nae O and 1, ‘The velablity dagen se lovee
{diagonal andthe S-sope characteris of boosted tee
edicts is pone. On each pblem,waasfoming pe
‘hsion wing Pat Sealing oe Ionic Regression yields
fs signifcant improvement in the predcied. proba,
eating w mach Tower sued ere and Jp loss, One
dierence between Tstnic Regression ap Plat Scaling
‘apparent inthe histograms: Aecaeoone Repression
_zeneats plseise constant Fonction the istogrom are
fsa, while he histograms generated by Pat Sealing ae
Smother See (Niele Mini & Caruana, 205) fr 8
‘more thorough analysis of booting fom te pon--view
of preicting well clita probate,
Figure 6 shows the preition histograms orth ten een
Ing ethos on dhe SLAC problem befor alien
iter ealbation with Plats method. Reliab digrams
Showing the fited functions for Plas method and so
tonic Represion also ae shown, Boosted saps at
[SVM also exit distinctive sgmeit-haped reli
lots (sco at thd rows, respecte, of Figure
Doosted stomps an SVM exhib sie behavior onthe
cater sven problems. As i the ease of oot es, the
“nil ope othe ely po cena with the
‘Soncentrtion of mis nthe center ofthe istgras of
reited als, with rested stomps ag the nt cx
treme. Tis imeresting to note tht the Tearing methods
that eit ths betavior ae maximum maria tts,
“The sg shaped relbility plo that ress fon pre
isis Being pushed avay fon O and | appears 10 be
‘Somacteie of max orga method
Figure 4 which shows histograms of preted valves ant
relay plots for neural els very fleet sty
The iby plots closely fs the agonal lin ns
ating that cual nets are wells to besa wth an
[problem appears to Benet ite tom calibration. On the
‘ter probes bth calatn ethos appear toe ti
ing aproximate the Single, ak ht se at
tal oer of them. Becaseof ths, saing might hut
‘neal ntealitation litle. The sigmoid tained with
Plats eth hae rouble iting teal property, es
vel pushing predictions away fm and ts can been
In the histograms in Figure 5. The histograms fer uncal-
rata curl ots ie Figure ok simi the histograms
for booed tres afer lt Scaling in Figure 2, evi
condence tha the histograms rfc the undoing Sue
SV pain are l (0) y(2 ono oma —
min)reiting God rabies With Soper ted Learning
‘owe 1 Mion frei es
ge 2. Hist of rd vl
owe 3. gram of rect wal an eli gam for a oct wit tn Rearion,
ture of he problems. For example, we could conte hat
‘the LETTER and HS problems, gen the aval fe
fares, have well dined classes With smal uber of
‘aes inthe “yin, while inthe SLAC problem the
two clues hae high velop with significant uncertsny
Formosteases Its imeresting to poe hat eral etek
With sage sigmoid ouput uni canbe viewed as near
aster in the spn o's hidden Unis) ith sigmoid
‘te output tat alas the predictions. In hs espect
curl ets ar siilario SVMs and hosted es aftr thy
ave Been calibrated using Pat's metho,
Examining te histograms and reibilty diagrams for lo-
gis regression and hugged res shows tha they be
fave sina petra net Bosh Teringsgoithneare
‘well ait inal and pt-caltation dees 2 belp
them on most problets. Bagged tes are Helped a ile
by post caratin onthe MEDIS snd LETTER P2 prob
Jems, While tis not surprising hat logistic regression pre-ih Soper Learning
reiting Gnd Probab
‘we 5. Msgr of ote le nd ey digas for eal cased with Pats toa
its such we-clibrated probabil. tsieresing that
baguing devnion wes also yeh well-cairated modes
‘Given that ged tres ae wel clad, me can dedice
a rgulr doco wes also ae well called on aver
‘age ithe sense hs if vision ees ae trained 00
Ferent samples of te data end thee proctionssverage,
the average will be wll clbred. Unfortunatly sin
le decision tee has high variance and his variance ects
ies caliration. Plat Sealing fs no able oda ith this
high vanance, tu Iotonic Regression can hep x some
‘ofthe pms crete by variance. Rows fv, sit ad
Seven in Figue 6 show the histograms (before and ater
Calitaton and ebity dag lose eresion,
aus ree, a decison tren onthe SLAC problem
Random forts are les let ext. RF alsa well
caltyated on Some pobles bt are poly alta
LUSTER P2, ano wel calited on 8, COV-TYPE,
MEDIS and LETTER Pl, It is ntresting that om thts
problens. RFs seem lo exhibit, bough fo a leer ex
tent, the same behavior athe max margin methods ee
‘sted als ae liga pushed toward the mide of the
histogram andthe eliaility plots show a imo shape
(yore scented onthe LETTER problems an ess 0 on
(COV-IYPE, MEDIS and HS). Mebals such 35 bagsing
and random forests that average preditons fom a base
Seto modes can have dificaly making pedo nea
Sn erase variance in he uring se macs wil
‘has predictions that shouldbe near eo one aay fom
these values Becase predictions ae restctd tthe in
terval [Lemos cast by variance tent be one sit
‘ear ar and oe. Forex, a model shuld prodt
‘P= Ofor case the only way bagging can ahieve this is
{all agped wees predict 21 If we ald aise 1 tbe wees
that baring averaging over, this aie wil soe ome
tues to pei ales lezr than 0 Yr this case. thus m=
Ing the sverige precio of the bagged ene say
fiom 0, We observe this eect most song with aso
fovea because the be-leel es tsined wih ae
forest have relively high variance de to feature subse
tung. Post clvatn som to blp mgt tis ble.
Because Naive Bayes mates he urealisic assumpson that
‘heats are conditionally independent gen the cas,
‘tends opus predicted als tard aa 1. Tis the
‘opposite havior om the max margin metho and eereiting God rabies With Soper ted Learning
Mud
iy ngra for SLAC
“Hate!
ates eahility plots that have an inverted sigmoid shape.
‘While Pla Selig i sl Relying to grove cain,
its lear hata sigmoid isnt te eght eansformation
calitvate Nave Bayes model. tonic Regressions abt
Ter hoioe calibrate these model
Returning to Figure 6, we seta the isingrame fhe pre
ited values before clit fest olama) fom the te
erent dels display wide variation. The mate margin
methods (SVM, boosted ues. td boosted stumps) have
the predicted values massed in th conter of he bistogras,
sing sigmoidal shape inthe elbiity pls. Bath
Plt Scaling and tonic Repression ae fective ting
‘hi sigmoidal shape Aer cation the prediction Bis
‘ograms extend farther it the all ear predicted values
oftnd F
oe methods tht ae wel elbrsed (neal sts, get
tues random foes, an lopstc regression, alban
‘eth Plat Selig ally ves peotality mas ay
From 0 and I. Tes ear om ooking athe eat de
‘grams Tor these reheat the gms haw ily
fing the predictions inte is of thew welt
mete,
‘Overall if ne examines the probity histograms before
nda clbratio, is cer ht he store are mich
‘ore salu tn each ser air Plt Sealing. Calbeaton
‘pnts edacestheiferenses between the poi
Ines predicted by the diferent modes. OF couse, abe.
‘om sable Fly coe th preitons ram the ine
for models sch dion oe nd mae Bayes.
5. Learning Curve Analysis
In thie secton we presen tearing curve analysis ofthe
‘wocallrton nethods, lat Scaling and Iotonic Rees”
‘on. The gol it determine how effesive thes eal
‘don methods ae ate annunt of dats aval for cl
tration varies. Por this analysis we we he same mods a
In Seaton, bt here we atthe size ofthe elation et
fom 3 eases $192 eas by Tat oft To mere
‘altration performance we examine he squat mor of
the made,
“Te ples n igure 7 show te average squared enor over
the cht est prolems, For cach role, we promt
tls Error bas are show onthe pot tae 0 n=
row that they maybe litical ose. Calton ening
‘ures are shown for ln ofthe wen learning thd (dee
rom res eft ou),
“Te neaty horizontal nes in he graphs show the oat
tor pit to ealfbaon. These lines ate nt prey
Forza ony beans the tev sets change moe data
‘moved it the elation sts. Each plot shows the
squared enor afer eaiation with Pat's eto 0 o>
tonic Regresinn a the sizeof the cliton st aes
{tom srl wo ge. When the clirstion sets sal est.
‘han aboot 20-1009 case), Plat Scaling enpforms Io
tonic Repression with al nie letning methods. THe hap
pets hecase [stoic Regression i ess constained thanreiting God rabies With Soper ted Learning
in|
Fp 7. Lasing Cares at Song nc Repssnfaerapes arn ble)
Pat Seating, so its easier frit overt when he clea
thm set all Plas method lo ae some ovetiting
‘ont tiltin (ge Seton 2), Aste sizeof he alba:
Th setters, the leaning cries for Plat Scaling a
Isotnic Regression join or even cross, When there ae
1000 or mote points in he elvan st soon Regte-
sion ays els performance ax yoo aor beter tha,
lat Sealing
For ering methods dha make wel eit peisions
such a¢ ner nes bagged tes, and ogc repression,
iter Pat Scaling no soon Repression yes much
improvement in pctontance even when the cbbration set
every large, With hee method clan sot Bene
lal and aetaly har porforance hon the the elds
tion stare smal
For the max margin methods, oostd tees, boosted
‘Sings a SVMS, cllvaion provides a8 inpeovement
‘¥en when the cation st is small Seton wes
that «igi good match for baowed tees, ost
stumps and SVMS. As expected for these methods Patt
‘Sealing pero ter than stn Regeston fo stall
‘to mediom sz calibration es than 1000 cases). a is
‘irualy intisingushabl fo larger elation set.
‘As expected, clitvaton improves the perfomnanse of
Naive Bayes ehdels for slmost ll calibration set szes,
with sonic Repesion cutperoring Pat Sesing when
thermore dts, Forth rest ofthe mc: KN RF an
DDI (oot shown postaiation Beige one she alien
sets are large enough
6. Empirical Comparison
As belo, lor each leaning alpha we wala die
ent models wing diferent parameter etings and clletedicing Gao Probubties With Sper Leas
etch move with tonic Regression aa Pt Scaling.
Model re ane samples and calibrated on inde
ence Isamples. For each data, ering algorithm,
nd calibration method, we sleet fe model wih the Pest
perfomance using the sane Tk pois usd for calito,
Figure 8 shows the squared eco (top) and opts (bt
tom) foreach lang method befote and ater elite,
tom, Each bar averages ove five talon eich fhe eight
robles, Ero br representing 1 sandr deviaon for
the means are shown, The pabaliespredited by four
leaning mets boned tees, SVM, boost tps
and nave byes — ae eumatalyigroved by eae
tio. Calration doesnot Rlp bogged es. ani actly
hts aural nets. Before abating the best mods ae
random forests, Rage es, and et nets AR al
esto, weve on Wee, ern frst, ahd SVM
rei the hex probtiten
7. Conclusions
ln this paper we examined the probbiies reitd by
ten diferent easing meted. Maxioum warn med
‘os such as booting and SVMs yield caret dis
terton in ther positions, Oter methods sh a me
tyes make precton wth the oposite dsonton, An
methods soc steal es and bagged wees predit well-
Citrate polis. We examined the elecunenes of
Pat Scaling and Iotonic Regression for clang the
relics ue by diferent earning methods Plat Sel
ing is mot fev when he dita sil, bt tone
‘Regsession mone powerfl when tees sueseat dato
event oeriting. Aer clitatio, the model tht pre
(he ietest protilies are Bosted es, random fre.
‘SVMs uncliated bagged res and wacalieated new
Thanks wo B. Zadoany apd C. Ean for he Iotnie Re
srsson cove, C Youn et ala taf Linear Acele-
‘So for the SLAC data, and A. Guat st Godard Space
Center frp with the Inn Pies Data Tis ork was
supped by NSF Award 0412930,
References
IM, Bronk, ing. G, Rei W, & Siena, B.
11955), Av eoplia dsibton fancon fr sampling
‘it incomplete information. Annals of Mathematical
Sass, 3. 641-817,
lake, & Mere (198), UCL repair of nshine
Teaming daabsss.
DeGroot M, & Ferber, (1982), he compari ant
‘alton of fceastes. Scan, 2, 1222.
Cir, Chet, SR Crom, Re & Jobson, L
(1989). Soppreyectr machine cass as aplcd
to wis data Poe Eighth JPL Aiorne Geostience
Workshop,
‘Nicuescu Mii, A & Carman, R (2008) Obtain eal
ated polis from bosing. Proc. 210 Confer
fence on Uncertain n Arial Ineligence (UAT 5)
[SUT Press.
Plax. J.(1999), Probabilistic utp for supe veto ma
‘Shines and orp to oltre ikelhond meth
fds Advances i Large Maun Clases (pp. 61-78
Roberson, T, Wright, Fy Dyk, R: (1988), Onker
rested surstical irene New Yok: John Wiey
sn Sons
Zao. B.. & Elkan, C. (2001), Obsning elt
Probably exist fom decision wees and nave
bayesian lasers. ICME (pp 609-616),
Zadeorn. Bs & Bihan, C2012, ‘Transforming class
er scores im acarate mula probability estimates
ADD (pp 94-199)