Professional Documents
Culture Documents
_-
EDIT CARD TRANSACTIONS CON-
tiiiue to grow in iiuinbcr, taking an ever-larger THIS
SCALABLE BLACK-BOX APPROACH FOR BUILDING
share or the US paymcnt system and leading EFFICIENT FRAUD DETECTORS CAN SIGNIFICANTLY REDUCE
lo a higher rate of stolen nccount numbers
and subscquenl losses by himks. Improved LOSS DUE TO ILLEGITIMATE BEHAVIOR. IN MANY CASES, THE . .il
I "
fraud detection thus has become essential to AUTHORS' METHODS OUTPERFORV A WELL-KNOWN, STATE- j . ,
maintain Iht: viability OS the US payment sys-
lem. Banks liave used early fraud warning OF-THE-ART COMMERCIAL FRA UD-DETECTION SYSTEM.
sysletns for some ycars.
I.arge-scale data-mining techoiqucs can
improve on the statc ol the art i n commercial
practicc. Scalable tcchniqries tu analyEe miis- Our approach cardholders' interest to reduce illegitimate use
sive amount^ oS transaction data that e m - of credit cards by early fraud detection. For
ciently cnmpole fraud detectors i n a timely In today's increasingly elcctronic society many years, the credit card industly has stud-
manner is an important problem, especially and with the rapid advances orelectronic com- ied computing models for automated detcc-
Sore-commerce. Besides scalahility and effi- merce on thc Inlernet, the use oicredit cards tion systems; recently, these models have been
ciency, thc Sraud-detection task cxhihits lecli- for purchases has become convenient and nec- the subject of academic research, especially
nical problems that includc skewed distribu- essary. Credil card transacti(ins have heconic with respect to e-commerce.
tions OS training tlntii ;ind oonunil~irincos1 the dc facto standard lor Internet and Web- The credit card fraud-detection domain
pel- crror, both of which have iiot been widely based e-commerce. Thc US government esti- presents a number of challenging issues for
studied in the knowledgc-discovery and dala- niates that credit cards accounted for approx- data mining:
niining community. imately US $ I1 hillion in Intcrnet sales during
In this article, wc survey and evaluate a 1998. This figure is expected IO grow rapidly There arc millions of credit card transac-
iiiirnheroStechniques that address lhesc three each year. However, the growing number of tions processed each day. Mining such
main issues concurrenlly. Our proposed credit card transac~ioiisprovides morc oppor- massive amounls of data requires highly
mcthods of combining multiple learncd fraud
detectors under a "cost modcl" are general
imd demonstrably useful; our cnipirical
tunity for thieves to steal credit card numhcrs
and subscquently comniil Sraud. When banks
lose money hccause of credit card Sraud, card-
- eft'icienl techniques that scale.
The datu are highly skewed-many more
transactions are legitimate than Sriludu-
results demonstrate that we can significantly hddcrs pay fur $111ofthat loss through highcr lent. Typical accuracy-based mining tecli-
reduce loss due to fraud through distributed interest ratcs, higher fees, ancl reduced bene- niques can generate highly accurate fraud
dah1 mining of fraud models. fits. Hence, it is in both the hanks' and thc dctcctors by simply predicting that all
~ ~ _ _ ~ _ -
67
Output the final hypothesis:
transactinns arc legitimate, although this a largc tlala sct OC lahcled transactions (eithcr can ofSset the loss of predictive performance
is equivalenttonotdeiecting fraudat all. Craudulent or legitim;ite) into smallcr subsets, that risuelly occurs when mining Sroin dala
Each transaction record lias a diSfercnt apply mining techniques lo generate classi- subsets or sampling. Furthermore, when we
dollar amount a i d thus lids a veriable Ciers i n parallel,and cornhine Ihc resullant USC the leemed classifiers (fur example, dur-
potential loss, rather than a Fixed misclas- basc models by metalearning h n i the clessi- ing transaction autliorizalion). the base clas-
sificiltian cost per error type , 'IS' ' IS
' corn- fiers' behavior to gencrate a metaclassiSier. I sificrs call executc in parallcl, with the mala-
monly assumed in cost-based mining Our approach treats the chssilicrs as black classilier then combining theirresnlts. So, our
techniques. boxes so thaL we can employ a variely of learn- approach is highly cfkient in generaling these
ing algorithms. Besides exlensibility, coni- models and also rcletively efficient in apply-
Our approach addresses the efficiency and bining mulliplc models computed over all ing them
scalabilily issues in several ways. We divide available data produces inctaclassiSiers that Another parallel approacli focuses on par-
- ~~
Oll,EOME COST
Table 2. Cost and rovings in the tredil turd fraud domain using tlosr.rombiner(tor1 t 95% ronfidente intervol).
20.07 *.I3 46
23.64 +.96 36
29.08 t 1.60 21
Expcrimcnts and results. To evaluale our sct bccause this approach increases the train- As wc discussed carlier, the desired dislri-
ificr class-combiner approach to ing-set size and is not appropriate in domailis bution is not necessarily 50:5&Ior instaocc,
skewed class distributiuns, we perlorined a with large amounts of data-one ol the three Ihc desired distribution is 30:70 when the
se1olexperiments using the credit cm1 fraud primary issues we address here.) Comparcd given ilistributioa is 1090. With 10:90 dis-
data from Chase? We used transactions from to the otlicr three methods, class-combining tributions, our method rcduced the cost siig-
lhefirsteightinonths(10/95-5/96)lortraio- on subsets with a 5 0 5 0 fraud distribution nificantly more than COTS. With 1 3 9 [lis-
ing, the ninth inunth (6196) for validating, clcarly achieves a significant increase in s i n - Iributions, our method did not outperform
and tlie twelfth month (9196) for testing. ings--atleast$IIO,000fortIieinonlh (6196). COTS. Both methods did not achievc any
(Because credit card transactions have a nal- When the overhead is $50, inure than halfof savings with 1399 distributions.
u r d two-month business cycle-the timc to the losses were prevented. To cliaractcrize the condition when our
bill a customer is oiic month, followed by a Surprisingly, we also observe that when tlie tcchoiques arc cffective, wc calculale R, the
one-month payment period-the true label ralio of the overhead amount to the avcragc
of atransaction cannot bedelermincd i n less cost: R = OverlrrodlAiwra,ye c m f .Our zip-
than twomonths’ timc. Hence, building mod- 50:50 distrihulian (generaled hy ignoring proach is significantly more cffcctive than
els f m n data in one month cannot be ratio- somc data) achieved significantly more sav- the deploycd COTS whco R < 6. Both mctli-
nally applied fur fraud detection in tlic next ings than combining classifiers trained lrom oils are ti01 effective when the R > 24. So,
month. We therefore test our models on data all eight months’ dala with the given distrih- under IIrciisonable cost model with a fixed
that is at least two months newer.) uliun. This reaffirms thc importance of overhead cost in challenging transaclions iis
Based on lhc empirical results fruni the employing the appropriate training class dis- potentially fraudulent, when the numbor of
effects of class distiibutions, thc dcsircd distri- tribution i n this domain. Class-combiner also fraudulent transactions is a very small per-
bution is 5050. Because the given distribution contribuled to the performance improvcrncnt. centagc ofthe totiil, it is financially undcsir-
i s 20:80, Four subsets are generalcd Crum each Conseqocnlly. utilizing the desircd lriiining ablc to dclect fraud. Thc loss doe to fhis frrilud
inonthfaratolaloC32suhsets.Wcapplied Cow distribution and class-combiner provides ti is ycl another cost of conducting business.
learning algorithms (C4.5,CART, Ripper, and synergistic approach to data mining with Hawcver, fillering out “easy” (or law-risk)
Baycs) to each subset and gcneraletl I28 base nonuniform class and cost distribuliuns. Per- transactions (the data we received were pos-
classifiers. Bascd on ourexperiencc with train- haps more importantly, how do iiurtcchniques sihly iiltered by a similar process) can reducc
ing metaclassifiers, Bayes is gcnerally more perform compared to the bank‘s existing a high overhead-to-loss ratio. The filtering
cflective and efficient, so it is the metaleamer Sraud-detection system? We label thc current process can use fraud detectors that are buill
for all Ihe experiments reported herc. syslein “COTS” (commercial off-the-shelf bascd on indivirlual customer pmfilcs, which
Furthermore, to invcstigale if our systcm) in Table 2. COTS acliicved signiti- arc now in use by inany credit card cumpa-
approach is indccd Ruillul, we rim cxpcri- wings than our lechniqucs in the nics. These individual profiles charactcrizc
ments 011 the class-combincr strategy directly thrcc overhead aniounts we repoit in Ibis table. the customers’ purchasing behavior. For
applied to thc original data sets from the lirsl This comparison might not be entirely example, if ii customer regularly buys gro-
eight months (that is, they have the given accurale becsiisc COTS lies much mure ceries at a particular supermarket or has set
20:80 distribution). We also cvaluated how training data than we have and il might bc up a monthly paymenl f i r phone bills, these
individual classifiers generated from each optimized tu a different cost modcl (which triinsactions are closc to no risk; hencc, pur-
month perform without class-combining. might even bc the simple error rate). Fiir- chases of similar characteristics cim he safcly
Table 2 shows the cost and savings from the lhermore, onlikc COTS, om techniques are authorized wilhout fiwtlier checking. Ilzduc-
class-combiner strategy using thc 5050 dis- genel-al for problems with skewcd dislribu- ing the overhead through s1re;imlining husi-
tribution (128 baseclassitiers), Ihe avemgcof tions and ilu not utilize any domain knowl- ness operations and incrcased automation
individual CART classifiers generaled using edge in detecting credit card fraud-the only will also lower the ralio.
the desired distribulion (10 cl;issiliers), class- cxccption is the cost madcl used for evaluii-
combiner using the given distribulion (12 base tion and search guidance. Nevcrthclcss,
classifiers-8 niontlis x 4 learning algo- COTS’perfr,rnianccon tlie test data provides Knowledge sharing through
rithms), and tlie average of individual classi- some indication of how the existing fraud- bridging
fiers using the given distribution (the average dctcctioii system behaves in the real warld.
of 32 classiliers). (We did not perform exper- We also evaluated our mctliod with more Much of the prior work on combining
iments on simply replicating the minority skewed distributions (by clownsampling multiplc modcls assumes that ;dl mudels
inslances to achieve S0:50 in one singlc data ininorily instiinces): 1090, 1:99, and I : 9 W originate frum different (not necessarily dis-
- ~~
- _NOVEMBER/DECEMBER
______ 1999
tinct) subsets of a single data set as a mcans 1. A,,,, # B,,,,: The two attributes are of include it in its schema, and presumably
to increase accuracy (for example, by impos- entircly differcnt types drawn from dis- other altributcs (including tlie common ones)
ing probability distributions over the in- tinct domains. The problem can then be have predictive value.
stances of the training set, or by stratified ileduced to two dual problems where onc
sampling, subsampling, and so forth) and not database has one more attribute than the Method 11: Learn a local model without the
as a means to integrate distributed informa- other: that is, missing attribute and exchange. In this
tion. Although the JAM systcm addresses the approach, database DO, can leain two local
latter problem by employing mctalearning Sc!iema(DBA)= [Al, A , ...,A,,,A,,+,,C] mudels: one with the attribute A,,,, that can be
techniques, integrating classification models Schema(Db',) = [ B l ,B2...., U,,, C) used locally hy the metalearning agents and
derived from distinct and distributed data- one without it that can be subsequently
bases might not always he feasible. where we assume thal attribute B<,+lis exchanged. Learning a secondclassifier agent
In a11 cases considered so far, all classifi- not prcsent in DB,. (The dual problem without the missing attribute, or with the
cation models are assumed to originate from has DB,* composed with B,,,,, but A,v,, attributes that belong to the intersection ofthe
databases of identical schemata. Because is not availahlc to A.) attributes of the two databases' data sets,
classifiers depend directly on the underlying implies that the second classifier uses only the
data's format, minor differences i n the sche- 2. A,,,, = B,J+l:The two attributes are of attributes that arc common among the patic-
mata between dalabases derive incompatible similar lype hut slightly diffcrent ipating sites and no issue exists for its inte-
classifiers-lhat is, a classifier cannot be semantics-that is, there might be a gration at other databascs. But, remote classi-
applied on data of diSCerent formats. Ye1 these map Croni tlie domain OS one type lo the fiers imported by database DB, (and assured
classifiers may target the same concept. We domain of lhe other. For example, A,+I not to involve predictions over the missing
seek to bridge these disparate classifiers in and are fields with time-dependent attributes) can still be locally integrated with
some principled lashion. information bul of different dirralion the original model that employs A,,, , In this
The banks seck to be able lo exchangc (that is, A,,, might dcnote the numher case, the rcmote classifiers simply ignore the
thcir classifiers and thus incorporate uscful oftimcsaneventaccrirred withinawin- local data set's missing attributes.
information in their system that would oth- dow OS half an hour and lJ,,+l ,nigh[ Bolh approaches address the incompati-
erwise be inaccessible to both. Indeed, for denote lhc number of times the same ble schenra problem, atid rnctnleerning over
each credit card lransaction, both institutions event occurred but within IO minutes). these models should procecd in a straight-
record similar inSormation; however, they forward manner. Compared lo Zbigniew
also include specific fields containing impor- In both cases (attributeA,,+i is either not Ras's related work," our approach is more
tant information that each has acquired sep- present in DB, UT semantically diffcrent Srom general because our techniques support both
arately and that provides predictive value in the corresponding &+J, the classifiers CAj categorical and continuous atlributes and are
determining Craudulent transaction patterns. derived from OB,, are not compatible with not limited to n specific syntactic case or
To Cacilitate the exchange of knowledge (rep- DB,'s data and LhereForc cannol be directly purely logical consistency of generated rulc
resented as remotely learned models) and used in DBi:s site and vice vcrs11. But the pur- models. Instead, these bridging techniques
take advantage of incompatible and other- poseof usingadistributedd*a-miningsystein can employ machine- or statistical-learning
wise useless classifiers, we need to devise and dcploying Icarning agents and metalearii- algorithms to compute arbitrary models for
methods that bridge the differences imposed ing their classilier agents is to he able to com- the missing values.
by the different schemata. bine infomiation from different sources.
Database compatibility. The incompatible Bridging methods. Therc are two methods Pruning
schenia problem impedes JAM [rum taking For handling the missing attribules.s
advantage of all available dalahases. Let's An ensemble olclassifiers can he unneces-
consider two data sites A and b' with data- Method I : Learn a local mode1,for the miss- narily complex, meaning that many classifiers
bases DBA and DBB2having similar but not ing nrtrihute and exchange. Database UB, might bc redundant, wasting resources and
identical schemata. Without loss of gencral- imports, along with tlie remote classificr reducing system throughput. (Throughput
ity, we assume that rgenl, a bridging agent Srom database DEA here denotes the rate at which a metaclassitier
.hat is trained to compute values for the miss- :a11 pipe through and label n stream of data
Schcnta(DB,J = {Al,A,, ...,A,, A,,,,, Cl ing attribute A,,+i in DB,'s dala. Next, DB, items.) We study the efficiency OS metaclas-
Schema(DB,) = [ E , , B2,...,B,,, C) rpplies the bridging agent on ils own data to sifiers by investigating the effects olpruning
A m a l e the missing values. In many cascs, (discarding ceaain base classiciers) on their
where, Ai and Bidenote the ilh attribute of iowever, this might not be possible ordesir- performance. Determining the optimal set of
DBA and DB,,respectively, and C the class ible by DBA (for example, in case the Aassifiers for metalearning is a combinator-
label (for example, the fraudllegitimate label ittribute is proprietary). The alternative for ial prohlem. Hence, the objective of pruning
in the credit card fraud illustration) of each latabase Dne is to simply add the missing istuutilizeheiiristictneth~s1osearchforp~-
instance. Without loss of generality, we fir- ittribute to its data set and fill it with null val- !ially grown metaclassifiers (metaclassifiers
ther assume lhat Ai = Bi, 1 < i < n. As for the ies. Even though the missing attribute might with pruned subtrees) that are more efficient
A,,,! and B,,+l attribute& there arc two lave high predictive value for DBA, it is of ind scalable and at the Same time achieve
possibilities: i o value to DE,,. After all, UB, did no1 :omparable or better predictive peiformance
~ ~_______~~
~~
72 ~~~__ ___~.. IEEE INTELLIGENTSYSTEMS
I
73
single base classifiers in every categoiy. More- computcr systems. Here we seek to perlorm Rough Sers iri Knowledge Iliscovery, Studies
over, by bridging the two databases, we man- the samc soit of task as in thc credit card fraud in Fuczi,ress and .?@I Compuiin,q, L. Pol-
kowski and A. Skowran, eda., Physicti Ver-
aged to further improvc the metalearliing sys- domain. We seek to build models to distin- lag,NewYaik, I99X,pp.98-108.
tem's peiformance. guish betwcen bud (intrusions or attacks) and
Howevcr, combining classifiers' agents good (normal) conncctions or processes. By 7. A. Pradromidis andS.Stolfo,"PmningMeta~
from the two banks directly (without bridg- firs1 applying fcatue-extraction algorithms, classifiers in il Distributed Data Mining Sys-
ing) is not very effective, no doubt because tem," Pmc. Firrt N d l Conf. New Infomu-
followed by the application of machinc-leani-
tion Technologies, Editinns of New I t c h . ,
the attribute missing from the First Union ing algorithms (such as Rippcr) to learn and Athens, 1998,pp. 151-160.
data set is significant in modeling the Chase combine multiple models for differclit lypes
data set. Hence, the First Union classifiers of intrusions, we have achievcd remarkably 8. D. Margineantu and T. Dietterich, "Pruning
arc not as effective as the Chase classifiers good success in detecting intrusions? This Adaptive Baosting," P,oc. 14lh / d l CO$
on thc Chase data, and the Chasc classificrs Machine Learnbrg, Morgan Kaufmiinn, San
work, as wcll as the results reported in this Francisco, 1997,pp. 211-218.
cannot perform at full strength at the First article, demonstrates convincingly that dis-
Union sites without the bridging agents. tributed data-mining tcchniques that combine 9. W.Lce,S.Stolfa,siidK.Mok,"Mining ins
This table also shows the invaluable contri- multiplc modcls producc effectivc fraud and Data-Flow Environment: Experiencc in Net-
bution of pruning. In all cLises, pruning suc- work Intrusion Ilctection," Proc. F$Ih inr'l
intrusion rletcctors. V
Conf. Knuwiedfe Discovery and Dnra Mitt-
ceeded incomputing mctaclassificrs with sim-
ing,AAAI Press, Menlo Park, Calif., 1999,
ilar or better fraucl-detcction capabilities, while p p 114-124.
reducing their size and thus improving their
efficiency. We have provided deletaileddescrip-
tion on the pmning methods and a compara- Philip K. Chan is an assistant professor af coni-
puter scicnce at thc Florida Institute ofTcchnol-
tive study betwcenpredictivepeiformance and ogy. His research interests include scalable adap~
metaclassifier throughput elsewhere? tive mcthods, machine learning, data mining,
distributed and parallel computing, and inlelligent
rystems.HereceivedhisPhD,MS,nndBS incom-
putcr science fiom C ol umbi University, Vander-
bilt University, end Southwest Texas State Uni-
versity, respectively. Contact him at Computer
Sciencc, Florida Tech, Melbourne, FL 32901:
pkcC3cs.fit. ~ t l uwww.cs.fit.edd-pkc.
;
References
W ~ i F ~ " i ~ ~ P h D ~ ~ "incomputerscience
didate at
1.P.Chi~nanilS.Stoilo,"Met~leamingforMul~ Columbia University. His research interests are in
tistratcgy and Parallel Learning," Pmc. Sec- machiiielewiiing,detamining,dish.ibuted systcms,
NE LIMITATION OF OUR AP- ond Inl'l Workshop Mulrislrure~yLearning, infrirmntian retrieval, iind collaborative filtering.
proach to skcwed distributions is the necd to Centcr for Aililicinl Intelligence, Gcorge Hclrceiveil hisMSciindB.BngfromTsingI,uaUni~
run preliminary expcriments to determine the Mason Univ., Fairfnx, RI., 1993, pp, versity andMPhi1h.omColumhiaUiiiversity. Cnn-
150-165. tact him tit thc Computer Science Dept., Calumbin
desired training distribution based on a de-
Univ., Ncw York, N Y 10027; wlan@cs.columbiii.
h e d cost model. This process can be auto- 2. S.Stollpetiil.,"JAM liivaAgentsfurMcta- -du:www.cs.columbii,.~d~l-wf~,,.
mated, but it is unavoidable because the leiirning over Distributed Datnhases," Proc.
dcsireddistributionhighlydepcndson thccost Third h l ' l Cmf.K,io,ubdge Uiscmery atid Andreas L. Prodromidis is il dircctoi in rcseeluch
model and the lcaniing algorithm. Currently, Darn Mining, AAA1 Picss, Menlo Park, md developmentat iPrivacy.His research intcmts
Calif., 1997, pp. 74-81, include data mining, inachine learning, elrclraoic
for simplicity reasons, all the base leamcrs use
mnmcrcc, and distributed systcms. Hc received a
the samc desired distribution: using an indi- 3. D. Wolpen,"Stacked Gencraliriition," Neaml PhD in computer scicnce from Columbia Univw
vidualized training distribution for each base N ~ m w r / k ~ , V 5o ,l .1992,pp. 241-259. ity and R Diploma i h elechical cngineaiingfrom thc
learner could improve the perSormance. Fur- Ntitiunal Technical University of Athens. He is a
thermore, because thieves also leam and Sraud 4. P. Chan and S . Stolfo, "Toward Scalable mcmbernf thclEEEandAAA1.Contact him at iPri-
Learning with Nonuniform Class and Cost vecy, 599 Lexington Ave., 82300, New York, NY
patterns evolve over time,some clessifiers are
Distributions:A Ciise Study in Credit Card 1W22: andreas~iprivacy.com:www.cs.columbin.
more relevant than others at a puticular time. Fraud Detection," I'roc. Fourlk Int'l Conf, :dul-n"dreils.
Therefore, an adaptive classifier-selection Knuwlcdp Dircovevy nnd Dora Minin,q,
method is essential. Unlike a monolithic AAA1 Prcss, Menlo Park, Calif., 1998, pp. ?alvatoreJ.Stolfu is a pmfcssor olcamputer sci-
approach OS learning oiic classifier using 164-168. mccat CulumbiaUniversity andcadirectorofthe
USCllSl and Columbia Centcr for Applied
incremental learning, our modular multiclas-
5. A. Prodramidis andS. St,,lfo,"MiningData- Research i n Digital Govemmcnt Information S y s ~
sifier approach facilitates adaptation over time bases with Different Schemns: Integrating .ems. His mmtreccnt research has focuscd on clis-
and removes out-of-date knowledge. Incompatible Classifiecs," Proc. E'ourlh bill ributed data-mining systems with applications to
Our cxperiencc in credit card fraud detec- C w j : Knowledge Discovery atid Dma Mim rnud and intmsinn detection in network informa-
tion has also affected other important appli- i n g , A A A l Pless,Mcnlo l'ark, Calif., 1998, ion systems. Hc receivcd his PhD lrom NYU's
pp.314-31x. 2oourant Institute. Contact lhim at the Compulcr
cations. Encouraged by our results, we shifted
Sciencc Dcpt., Columbia Univ., New York, NY
our attention to the growing probleinof intru- 6. 2. Ras,"Answering Nonstandard Qucrics in 10027: sal@'cs.columbia.edu:www.ca.coluntbia.
sion dctection in network- and host-based Distributed Knowledge-Based Systems.'' :dulsnl; www.cs.colunibh.cdui-saI.
IEEE INTELLIGENTSYSTEMS