Predicting Good Probabilities With Supervised Learning

Predicting Good Probabi ‘Alexandr Niclescu-Miai Rich Caruana ities With Supervised Learning ALERN@ US CORNELL EDL Department OF Computer Science, Cell Universi thas NY 14883 Abstract We examine the relationship etween the predic. tons mae by iret leaning goin td tne poster probes. We sho that max tu nrg tod such ak booted ees ad wooed ms pus probity mass way from (and yielding cacti sli taped storion in he presicted protiics. Mo fb Sicha Nave Bays, which make uel teinpendene sumptions. ps probabilities toward U and I. Other mols sich ss nud et and bagged tes do a0 have hese ses tn predict well abated probabilities. We ex. pvimeat with two ways eect the Biased Probables pried by some kerning met fn Pat Scaling and Istnic Regression. We these clitvation methods are stable for ad tobe elective. The empiial ets show tht ser calvaion boosted ees, fandom fess, tnd SVMs predit he hes probabiies |. Introduction In many applications i is important to pret well a sed probs good scours o art unde he ROC uve ae nt sce This paper examines the abies peediced by ten supervised leaming aor SVMs, cual net, daiitn tees, thetary ete lear ing, bagged tees random Tress, booted tees. boasted tmp, aie bes and logs represion. We ho how matin margin methods sich 38 SVMs, Bont tres, nd boosted stumps tend to push predicted pris say from Ua! This his he qty of he probabil cs they predict and yields characterise sgmsbaped stron in the predicted probes. Other methos Such as naive ayes hve te oppose ia and tend wo posh weds cloner O ad 1 Aad sae learns methods [rpasing Proce ofthe 22% hernia Conference ‘tahineLaamng, Bo, Gemny, 208 Copy 9 tear towne) sch as baged tees and ner ets have ite o bias ‘predict welled pb, Alte: examining te distortion oak of character to cach eaing method. we experiment with wo elation ‘methods for comeing these disotions Plat Sealing: mtd for transforming SVM caps ftom [-ae, 2) to posterior pobliies (Pat 1999) Tsotonc Regression: the method used by Zairomy an kan (2002 20) to elite pretons fom Dost ‘ave hayes, SVM, and denon tes models Plat Sealing i most fective when the stron inthe predicted robles signs. otic Repro "one more powerfl aiaton method tha an comet ‘ny thowtorie ditonton, Unfortunately, this eats power ‘Somes ta pris. A leing curve analysis shows hat so ‘onic Renession is moze prone to ovens and thus pe {orm worse han Pat Sening hen da sear Finaly, we examine how goed are the probabilities pre te cach earning mated ae each maths peec- ‘ions hve been elated. Expenmens with ight ls ‘eation problems suggest hat random fre, neural ets sd bagged decision tes are the bes! lamin meth for Dreditng welled probes roe alan. ttaerealition the best ethode are osted er hom forests and SVM 2. Calibration Methods In this section we desribe the two eto for mapping ‘model predictions to poserie probable: Plat Cala tion ant Teton Regression, Unfortunatly these methods redesign fr binary clawfation and an tlt ‘xen them o malin problems. One wy to del wit moll problems nt transfor tem 0 in oh Tem, elite the binary modes an ecombine the pre betions Zadrony Elkan, 2002). 21, Pat Calibration Plat (1999) propel tanfring SVM poisons 0 posterior proba by passing them trough a sig We will se a Section 4 thts same wanstorationdicing Gao Probubties With Sper Leas ‘iso ote For booted ees ad boosted stam, Let the ouput offering method be f(x). To get eal ‘wae probbiie, pss the ontpu through a sigmoid Pin) aRare a) a ere the pareters and # ae fied using axium Tiketibod estimation tam a ing ining et (J) Grain desent edt Bd A ad ch they argmin~ Sulonin) +(1= niet =P). @) where a 8 am “Two questions arise: where dos the sigmoid inset come from? an how oad overt oth trang et If we re the sme dase tt was sed to tesa the made ve watt wo elle, we iodaceunasied has. Far 2 mpl, if the mode east rine he tin set pe. fect and onder al the negative examples before the pos "he cep, tho the sisi rannrmtion wil tp Jn foc, So wena Ne an inked tiene og pod pst psbaition This tomer snc ada bck ste thee wan be ed Formodl and praete selection, 1 avoid vetting wo the sigmoid a set an out sample mode! is wed, there are X, postive examples td negaive examples i the ui st for each ai. ing example Pt Calitration vss target ales ad (instead a Lando, especie). bere TNeo Oa io Fora mre detailed treatment, an a jsieation of hese pico rget vals se Pi. 199) 222. Isotonle Regression “he sgn nstrmition works wel or some eng tas be ts ne appropiate To es.” Zns {nt han (2. 201) aces was wore pw mato Ree on lotic Regreston Raber e Tb vo calbeate predictions fom SVN Nave ye boost Nave yey and decon esis meted ‘nore perl in tha he lento hte png func be oem onto icesing). Tht fiver the peictons om ula th te ge tht tee esumpion noon Repession at wmf +6 6 ~aigortar PAV stgoctan etna pre rable rom aeaibsed mol predictions. Toa nig (J) od ton 7 ili i a ia = 1 3 WhileS) nf mcs > tas Sees = (hares athe Replace yn ans th he, 4 Outpt the spine const. foneton if) = aslo EFS, ‘where sam stoic (monotonically increasing) faae= tion Then given tain et (fis he Ionic Repres ‘om proble siding the soon faction sch a, n= aramin. Dla)? ‘One algoctn that finds a stepwise cost solton for the Isotoie Regression problem is palraacent volts {avy agerit (Aye tl, 1955) prseted in Table | Asinthe cae of lat calibration i ewe the mo rin Ing Set (2) to ett trating et (7). sonic lpendent vlan sto train the soni anton 3.Data Sets ‘We compare algorithms om 8 binary clisitation prob- Jens. ADULT, COV-TYPE and LETTER ae feo UCT Repsitoy (Blake & Mere, 198), COV-TYPE ha bon comer ta inary problem by tating the lrest ass ‘postive snd the rest a nepatv, We comvered LETTER {boolean two ways. LETTER pI teas the Fete “O" as pose and the msn 28 eters as seat, yelling 1 vey unbalanced problem. LETTER p uses kes AM. ‘pines and N-Z a negatives, yen dieu, bat ‘ell alee problem. 1S i the InianPnc? dat et (Goa et ay 19) whore he ditcl class Soybes- ‘iil the positive case SLAC ix problem fom the ‘Stanfosd Linear Accelero. MEDIS ad MG ae medical ats, The data ete ae sunarize a Table 2 ‘le2 Desi of poems Cosme st 4000 36% Mio oo She ‘oodicing Gao Probubties With Sper Leas 4. Qualitative Analysis of Predictions ln this section we qualitatively examine the calibration of the eiferent easing algo. For cach lgvithm me we many variations and parameter sexing to ai ier ent madels Fr example, we tain mds wing tend ‘son tee sles, sual nets of many sizes, SVM with many Keele ete After uaining, we apply Plat Seal. ing and stone Regression to calla al modes. Each ‘model ried on th sae andor sample of 000 cae td alate om indepenent samples of 1000 cases For theirs in hs section esl, fr each problem, and foreach ening seri de de ha as the bes ison Beto or afer sang. ‘On rel problems where the true conn probes Tait diagrams (DeGroe & Feber, 1982), Fit the preiction space is discret io ten bas. Ces ith Prod sae tween and fl nthe fren, be ce a2 in he second bit. Far each bi, the mean predicted vale pet pains the te fraction of Prive cases, Ifthe tml wll elated the pos til fl nee the Sapo ie We fint examine the retions mae by boone tes Feu I shows hogan of tbe preted sacs Cop row) al rely agra mit anton ror toc usd for ining ce cliration An interesting spect ‘ofthe eit plots in igre sth the lay aie tial shape on seven ofthe eight probles,modhatng these of wig to won predictions in calibrated probabies. The elicit pls ia the mile Yow ofthe aiiy plot inthe boom ofthe eae show he fnton ite wi toc Repression. Examining the bistorans of predicted salues (op 0% in Figure 1, aoe that amon al the valesprodited by one wees eth coral eon with ee pens approaching 0 or 1. The coe exception is LETTER P, 2 Nhl skewed data set that has only 3 piv class. On this problem sme predicted wales do pproch 0, hough ‘eel examiztin of the histogram shows thr evea 2 ts problem tere isa sharp drop in the noe of ees prece 1o have probability near 0. This siting ofthe Prediction award the center of the Aistograts cases the Sos shaped reliability pos of boosted tees. ‘To show how cairaion tansforms pectin, we pot hslograms and relibiliy diagrams forthe ight problems Sowing loredt comm former econ Ww iy for booted res fer Pit Calibration (Figure 2) aso tonic Regesion (Fie 3). The gues sho that calibra ‘Son undoes the bl a probity mans cated by boot ing: fe cllbration many mote cases have predicted prob Sle nae O and 1, ‘The velablity dagen se lovee {diagonal andthe S-sope characteris of boosted tee edicts is pone. On each pblem,waasfoming pe ‘hsion wing Pat Sealing oe Ionic Regression yields fs signifcant improvement in the predcied. proba, eating w mach Tower sued ere and Jp loss, One dierence between Tstnic Regression ap Plat Scaling ‘apparent inthe histograms: Aecaeoone Repression _zeneats plseise constant Fonction the istogrom are fsa, while he histograms generated by Pat Sealing ae Smother See (Niele Mini & Caruana, 205) fr 8 ‘more thorough analysis of booting fom te pon--view of preicting well clita probate, Figure 6 shows the preition histograms orth ten een Ing ethos on dhe SLAC problem befor alien iter ealbation with Plats method. Reliab digrams Showing the fited functions for Plas method and so tonic Represion also ae shown, Boosted saps at [SVM also exit distinctive sgmeit-haped reli lots (sco at thd rows, respecte, of Figure Doosted stomps an SVM exhib sie behavior onthe cater sven problems. As i the ease of oot es, the “nil ope othe ely po cena with the ‘Soncentrtion of mis nthe center ofthe istgras of reited als, with rested stomps ag the nt cx treme. Tis imeresting to note tht the Tearing methods that eit ths betavior ae maximum maria tts, “The sg shaped relbility plo that ress fon pre isis Being pushed avay fon O and | appears 10 be ‘Somacteie of max orga method Figure 4 which shows histograms of preted valves ant relay plots for neural els very fleet sty The iby plots closely fs the agonal lin ns ating that cual nets are wells to besa wth an [problem appears to Benet ite tom calibration. On the ‘ter probes bth calatn ethos appear toe ti ing aproximate the Single, ak ht se at tal oer of them. Becaseof ths, saing might hut ‘neal ntealitation litle. The sigmoid tained with Plats eth hae rouble iting teal property, es vel pushing predictions away fm and ts can been In the histograms in Figure 5. The histograms fer uncal- rata curl ots ie Figure ok simi the histograms for booed tres afer lt Scaling in Figure 2, evi condence tha the histograms rfc the undoing Sue SV pain are l (0) y(2 ono oma — min)reiting God rabies With Soper ted Learning ‘owe 1 Mion frei es ge 2. Hist of rd vl owe 3. gram of rect wal an eli gam for a oct wit tn Rearion, ture of he problems. For example, we could conte hat ‘the LETTER and HS problems, gen the aval fe fares, have well dined classes With smal uber of ‘aes inthe “yin, while inthe SLAC problem the two clues hae high velop with significant uncertsny Formosteases Its imeresting to poe hat eral etek With sage sigmoid ouput uni canbe viewed as near aster in the spn o's hidden Unis) ith sigmoid ‘te output tat alas the predictions. In hs espect curl ets ar siilario SVMs and hosted es aftr thy ave Been calibrated using Pat's metho, Examining te histograms and reibilty diagrams for lo- gis regression and hugged res shows tha they be fave sina petra net Bosh Teringsgoithneare ‘well ait inal and pt-caltation dees 2 belp them on most problets. Bagged tes are Helped a ile by post caratin onthe MEDIS snd LETTER P2 prob Jems, While tis not surprising hat logistic regression pre-ih Soper Learning reiting Gnd Probab ‘we 5. Msgr of ote le nd ey digas for eal cased with Pats toa its such we-clibrated probabil. tsieresing that baguing devnion wes also yeh well-cairated modes ‘Given that ged tres ae wel clad, me can dedice a rgulr doco wes also ae well called on aver ‘age ithe sense hs if vision ees ae trained 00 Ferent samples of te data end thee proctionssverage, the average will be wll clbred. Unfortunatly sin le decision tee has high variance and his variance ects ies caliration. Plat Sealing fs no able oda ith this high vanance, tu Iotonic Regression can hep x some ‘ofthe pms crete by variance. Rows fv, sit ad Seven in Figue 6 show the histograms (before and ater Calitaton and ebity dag lose eresion, aus ree, a decison tren onthe SLAC problem Random forts are les let ext. RF alsa well caltyated on Some pobles bt are poly alta LUSTER P2, ano wel calited on 8, COV-TYPE, MEDIS and LETTER Pl, It is ntresting that om thts problens. RFs seem lo exhibit, bough fo a leer ex tent, the same behavior athe max margin methods ee ‘sted als ae liga pushed toward the mide of the histogram andthe eliaility plots show a imo shape (yore scented onthe LETTER problems an ess 0 on (COV-IYPE, MEDIS and HS). Mebals such 35 bagsing and random forests that average preditons fom a base Seto modes can have dificaly making pedo nea Sn erase variance in he uring se macs wil ‘has predictions that shouldbe near eo one aay fom these values Becase predictions ae restctd tthe in terval [Lemos cast by variance tent be one sit ‘ear ar and oe. Forex, a model shuld prodt ‘P= Ofor case the only way bagging can ahieve this is {all agped wees predict 21 If we ald aise 1 tbe wees that baring averaging over, this aie wil soe ome tues to pei ales lezr than 0 Yr this case. thus m= Ing the sverige precio of the bagged ene say fiom 0, We observe this eect most song with aso fovea because the be-leel es tsined wih ae forest have relively high variance de to feature subse tung. Post clvatn som to blp mgt tis ble. Because Naive Bayes mates he urealisic assumpson that ‘heats are conditionally independent gen the cas, ‘tends opus predicted als tard aa 1. Tis the ‘opposite havior om the max margin metho and eereiting God rabies With Soper ted Learning Mud iy ngra for SLAC “Hate! ates eahility plots that have an inverted sigmoid shape. ‘While Pla Selig i sl Relying to grove cain, its lear hata sigmoid isnt te eght eansformation calitvate Nave Bayes model. tonic Regressions abt Ter hoioe calibrate these model Returning to Figure 6, we seta the isingrame fhe pre ited values before clit fest olama) fom the te erent dels display wide variation. The mate margin methods (SVM, boosted ues. td boosted stumps) have the predicted values massed in th conter of he bistogras, sing sigmoidal shape inthe elbiity pls. Bath Plt Scaling and tonic Repression ae fective ting ‘hi sigmoidal shape Aer cation the prediction Bis ‘ograms extend farther it the all ear predicted values oftnd F oe methods tht ae wel elbrsed (neal sts, get tues random foes, an lopstc regression, alban ‘eth Plat Selig ally ves peotality mas ay From 0 and I. Tes ear om ooking athe eat de ‘grams Tor these reheat the gms haw ily fing the predictions inte is of thew welt mete, ‘Overall if ne examines the probity histograms before nda clbratio, is cer ht he store are mich ‘ore salu tn each ser air Plt Sealing. Calbeaton ‘pnts edacestheiferenses between the poi Ines predicted by the diferent modes. OF couse, abe. ‘om sable Fly coe th preitons ram the ine for models sch dion oe nd mae Bayes. 5. Learning Curve Analysis In thie secton we presen tearing curve analysis ofthe ‘wocallrton nethods, lat Scaling and Iotonic Rees” ‘on. The gol it determine how effesive thes eal ‘don methods ae ate annunt of dats aval for cl tration varies. Por this analysis we we he same mods a In Seaton, bt here we atthe size ofthe elation et fom 3 eases $192 eas by Tat oft To mere ‘altration performance we examine he squat mor of the made, “Te ples n igure 7 show te average squared enor over the cht est prolems, For cach role, we promt tls Error bas are show onthe pot tae 0 n= row that they maybe litical ose. Calton ening ‘ures are shown for ln ofthe wen learning thd (dee rom res eft ou), “Te neaty horizontal nes in he graphs show the oat tor pit to ealfbaon. These lines ate nt prey Forza ony beans the tev sets change moe data ‘moved it the elation sts. Each plot shows the squared enor afer eaiation with Pat's eto 0 o> tonic Regresinn a the sizeof the cliton st aes {tom srl wo ge. When the clirstion sets sal est. ‘han aboot 20-1009 case), Plat Scaling enpforms Io tonic Repression with al nie letning methods. THe hap pets hecase [stoic Regression i ess constained thanreiting God rabies With Soper ted Learning in| Fp 7. Lasing Cares at Song nc Repssnfaerapes arn ble) Pat Seating, so its easier frit overt when he clea thm set all Plas method lo ae some ovetiting ‘ont tiltin (ge Seton 2), Aste sizeof he alba: Th setters, the leaning cries for Plat Scaling a Isotnic Regression join or even cross, When there ae 1000 or mote points in he elvan st soon Regte- sion ays els performance ax yoo aor beter tha, lat Sealing For ering methods dha make wel eit peisions such a¢ ner nes bagged tes, and ogc repression, iter Pat Scaling no soon Repression yes much improvement in pctontance even when the cbbration set every large, With hee method clan sot Bene lal and aetaly har porforance hon the the elds tion stare smal For the max margin methods, oostd tees, boosted ‘Sings a SVMS, cllvaion provides a8 inpeovement ‘¥en when the cation st is small Seton wes that «igi good match for baowed tees, ost stumps and SVMS. As expected for these methods Patt ‘Sealing pero ter than stn Regeston fo stall ‘to mediom sz calibration es than 1000 cases). a is ‘irualy intisingushabl fo larger elation set. ‘As expected, clitvaton improves the perfomnanse of Naive Bayes ehdels for slmost ll calibration set szes, with sonic Repesion cutperoring Pat Sesing when thermore dts, Forth rest ofthe mc: KN RF an DDI (oot shown postaiation Beige one she alien sets are large enough 6. Empirical Comparison As belo, lor each leaning alpha we wala die ent models wing diferent parameter etings and clletedicing Gao Probubties With Sper Leas etch move with tonic Regression aa Pt Scaling. Model re ane samples and calibrated on inde ence Isamples. For each data, ering algorithm, nd calibration method, we sleet fe model wih the Pest perfomance using the sane Tk pois usd for calito, Figure 8 shows the squared eco (top) and opts (bt tom) foreach lang method befote and ater elite, tom, Each bar averages ove five talon eich fhe eight robles, Ero br representing 1 sandr deviaon for the means are shown, The pabaliespredited by four leaning mets boned tees, SVM, boost tps and nave byes — ae eumatalyigroved by eae tio. Calration doesnot Rlp bogged es. ani actly hts aural nets. Before abating the best mods ae random forests, Rage es, and et nets AR al esto, weve on Wee, ern frst, ahd SVM rei the hex probtiten 7. Conclusions ln this paper we examined the probbiies reitd by ten diferent easing meted. Maxioum warn med ‘os such as booting and SVMs yield caret dis terton in ther positions, Oter methods sh a me tyes make precton wth the oposite dsonton, An methods soc steal es and bagged wees predit well- Citrate polis. We examined the elecunenes of Pat Scaling and Iotonic Regression for clang the relics ue by diferent earning methods Plat Sel ing is mot fev when he dita sil, bt tone ‘Regsession mone powerfl when tees sueseat dato event oeriting. Aer clitatio, the model tht pre (he ietest protilies are Bosted es, random fre. ‘SVMs uncliated bagged res and wacalieated new Thanks wo B. Zadoany apd C. Ean for he Iotnie Re srsson cove, C Youn et ala taf Linear Acele- ‘So for the SLAC data, and A. Guat st Godard Space Center frp with the Inn Pies Data Tis ork was supped by NSF Award 0412930, References IM, Bronk, ing. G, Rei W, & Siena, B. 11955), Av eoplia dsibton fancon fr sampling ‘it incomplete information. Annals of Mathematical Sass, 3. 641-817, lake, & Mere (198), UCL repair of nshine Teaming daabsss. DeGroot M, & Ferber, (1982), he compari ant ‘alton of fceastes. Scan, 2, 1222. Cir, Chet, SR Crom, Re & Jobson, L (1989). Soppreyectr machine cass as aplcd to wis data Poe Eighth JPL Aiorne Geostience Workshop, ‘Nicuescu Mii, A & Carman, R (2008) Obtain eal ated polis from bosing. Proc. 210 Confer fence on Uncertain n Arial Ineligence (UAT 5) [SUT Press. Plax. J.(1999), Probabilistic utp for supe veto ma ‘Shines and orp to oltre ikelhond meth fds Advances i Large Maun Clases (pp. 61-78 Roberson, T, Wright, Fy Dyk, R: (1988), Onker rested surstical irene New Yok: John Wiey sn Sons Zao. B.. & Elkan, C. (2001), Obsning elt Probably exist fom decision wees and nave bayesian lasers. ICME (pp 609-616), Zadeorn. Bs & Bihan, C2012, ‘Transforming class er scores im acarate mula probability estimates ADD (pp 94-199)

Predicting Good Probabilities With Supervised Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predicting Good Probabilities With Supervised Learning

Uploaded by

Copyright:

Available Formats

You might also like