You are on page 1of 8

Disfribufed Dam Mining in

Credit C u d Fmud Defwdon


Philip K. Chan, Florida Institute of Technology
Wei Fan, Andreas 1. Prodromidir, and Salvotore 1. Stalfo, Columbia University

_-
EDIT CARD TRANSACTIONS CON-
tiiiue to grow in iiuinbcr, taking an ever-larger THIS
SCALABLE BLACK-BOX APPROACH FOR BUILDING
share or the US paymcnt system and leading EFFICIENT FRAUD DETECTORS CAN SIGNIFICANTLY REDUCE
lo a higher rate of stolen nccount numbers
and subscquenl losses by himks. Improved LOSS DUE TO ILLEGITIMATE BEHAVIOR. IN MANY CASES, THE . .il
I "

fraud detection thus has become essential to AUTHORS' METHODS OUTPERFORV A WELL-KNOWN, STATE- j . ,
maintain Iht: viability OS the US payment sys-
lem. Banks liave used early fraud warning OF-THE-ART COMMERCIAL FRA UD-DETECTION SYSTEM.
sysletns for some ycars.
I.arge-scale data-mining techoiqucs can
improve on the statc ol the art i n commercial
practicc. Scalable tcchniqries tu analyEe miis- Our approach cardholders' interest to reduce illegitimate use
sive amount^ oS transaction data that e m - of credit cards by early fraud detection. For
ciently cnmpole fraud detectors i n a timely In today's increasingly elcctronic society many years, the credit card industly has stud-
manner is an important problem, especially and with the rapid advances orelectronic com- ied computing models for automated detcc-
Sore-commerce. Besides scalahility and effi- merce on thc Inlernet, the use oicredit cards tion systems; recently, these models have been
ciency, thc Sraud-detection task cxhihits lecli- for purchases has become convenient and nec- the subject of academic research, especially
nical problems that includc skewed distribu- essary. Credil card transacti(ins have heconic with respect to e-commerce.
tions OS training tlntii ;ind oonunil~irincos1 the dc facto standard lor Internet and Web- The credit card fraud-detection domain
pel- crror, both of which have iiot been widely based e-commerce. Thc US government esti- presents a number of challenging issues for
studied in the knowledgc-discovery and dala- niates that credit cards accounted for approx- data mining:
niining community. imately US $ I1 hillion in Intcrnet sales during
In this article, wc survey and evaluate a 1998. This figure is expected IO grow rapidly There arc millions of credit card transac-
iiiirnheroStechniques that address lhesc three each year. However, the growing number of tions processed each day. Mining such
main issues concurrenlly. Our proposed credit card transac~ioiisprovides morc oppor- massive amounls of data requires highly
mcthods of combining multiple learncd fraud
detectors under a "cost modcl" are general
imd demonstrably useful; our cnipirical
tunity for thieves to steal credit card numhcrs
and subscquently comniil Sraud. When banks
lose money hccause of credit card Sraud, card-
- eft'icienl techniques that scale.
The datu are highly skewed-many more
transactions are legitimate than Sriludu-
results demonstrate that we can significantly hddcrs pay fur $111ofthat loss through highcr lent. Typical accuracy-based mining tecli-
reduce loss due to fraud through distributed interest ratcs, higher fees, ancl reduced bene- niques can generate highly accurate fraud
dah1 mining of fraud models. fits. Hence, it is in both the hanks' and thc dctcctors by simply predicting that all
~ ~ _ _ ~ _ -
67
Output the final hypothesis:

transactinns arc legitimate, although this a largc tlala sct OC lahcled transactions (eithcr can ofSset the loss of predictive performance
is equivalenttonotdeiecting fraudat all. Craudulent or legitim;ite) into smallcr subsets, that risuelly occurs when mining Sroin dala
Each transaction record lias a diSfercnt apply mining techniques lo generate classi- subsets or sampling. Furthermore, when we
dollar amount a i d thus lids a veriable Ciers i n parallel,and cornhine Ihc resullant USC the leemed classifiers (fur example, dur-
potential loss, rather than a Fixed misclas- basc models by metalearning h n i the clessi- ing transaction autliorizalion). the base clas-
sificiltian cost per error type , 'IS' ' IS
' corn- fiers' behavior to gencrate a metaclassiSier. I sificrs call executc in parallcl, with the mala-
monly assumed in cost-based mining Our approach treats the chssilicrs as black classilier then combining theirresnlts. So, our
techniques. boxes so thaL we can employ a variely of learn- approach is highly cfkient in generaling these
ing algorithms. Besides exlensibility, coni- models and also rcletively efficient in apply-
Our approach addresses the efficiency and bining mulliplc models computed over all ing them
scalabilily issues in several ways. We divide available data produces inctaclassiSiers that Another parallel approacli focuses on par-

- ~~

1611 INTELLIGENI SYSIEMS-


The issue of skewed distributions has not
been sludicd widely because many of the data
sets used in research do not exhibit lhis char-
actciistic. We address skewness hy partition-
inglhedattisct intosubsets withadeaireddis-
tribution, applying mining techniques to ihe
subscts, and combining the mined classifiers
by mctalearning (as we have already dis-
cussed). Other researchers attempt to rcmove
unncccssary instanccs from the majority
class-instances that are i n the borderline
region (noise or redundant exeinplarsj are can-
didales [or rcnioval. In contrasl, our approach
kecps all the data for mining and does not
change the underlying inining algorithms.
We addrclmss thc issue of nonuniform cos1
by dcwloping the appropriate cost model f b r
the crctlit card fraud domain and biasing oiir
methods toward redocing cosl. This cost
modcl dctermiiies thc desired distribution
just mentioncd. AdaCost (a cost-sensitive
version oCAdaBoost)relies on thc cost model
for updating wciglits in the training distri-
bution (FurmoreonA&iCost, sce the“Ada-
Cost ;ilgorithm” sidebar.)Naturally, this cost
model also defines lhc primary evaluation
criterion for our techniques. Furthermwc, wc
investigate techniqucs to improve the cost
perlormance of a hank‘s fraud deteclor by
importing remote classifiers from other
hiinks and cornhiniiig this rcniotcly learned
knowledge with locally stored classificrs.
The law and coinpetitivc coiicerns restrict
hanks Srom sharing inlirrmation about their
custwners wilh othcr banks. However, they
may sliare black-hox Sraud-detection mod-
CIS. Our distributed data-mining approach
pruvidcs a direct and efficient solution io
sharing knowlcdge wilhout sharing data. We
also address possiblc incoinpiitihility of data
schemata among clifrerent banks.
We designed and dcveloped an agent-based
distributed environmcnt to demonstrate our
distributed and parallel data-mining tech-
niques. The1AM (Java Agents fur Mctalcarn-
ingj systcin not only providcs distributed data-
niining capabilities, it elso lcts uscrs monitor
allclizing aparticolar algorithm on a particu- Furthermore, hccause our appruach cmld and visimlizc the various learning agents and
lar parallel architecture. However, a new generate a potenlially large nuniber orclassi- derived models i u real time. Rescarchco havc
algorithm or architecture requircs a substaii- fiers from the conciirrcntly processed dala studied a variety of algoiithnis and techiiiques
tial amount or pal.allcl-programining work. subsek, and lhereforc potentially ieyiiirc more hrcomhining multiple coniputed models. The
, Although our architeclure- and algorithm cumputational resourccs during detection, we JAM system pruvidcs generic fealures to cas-
1 independent approach is not as efficient as investigate pruning methods that identify ily implemenl any of these combining tech-
some finc-grained paral lelinition approaches. rcdundant classificrs and remove them from large collection of base-
it lets users plug difSeren1off-the-shelflenrii- thc cnsemble wilhout sigiiificaiitly degrading learning algorithms), and it has been broadly
ing programs into a parallel and distributed the predictivc pcrformance. This pruning lech- availahlc~oruse.TlieIAMsystcmis available
envimnineiit with relativc case and eliminatcs nique increases the lcarned &lectors’ coni- h r download at http:llwww.cs.coluinbi;l.edd
the nccd Sor expensive parallel hardware. putational pciforniaiice and throughput. -sal/ JAM/PRO.IECT.
Table I. Cost model assuming a fixed overhend.

Oll,EOME COST

Miss (false negative-FN) Tranarnt


Credit card data and cost False alarm (false oositive-FPI ' Overheadif tranamf > overheador 0 if tranarnts overhead
models Hit (true posiiive-TP) Overheadif lranamt, overheador fmnainf if tranarnts overhead
Normal (true negative-TN) 0
Chase Bank and First Union Bank, m e n -
hers of the Financial Services Technology
Consortium (FSTC), provided tis with real
credit card data for this study. Thc two data and hcst models is 5050, we randomly divide the
sets contain credit card transactions labcled majority instances into Sonr partitions and
as fraudulent or legitimate. Each bank sup- Sorm four data subsets by merging the minor-
plied S00,000records spanning onc yew with ity instances with each oC the four partitions
20% fraud and 80% nonfiaud distribulion for whcre Cost(i) is the cost associated with crmtaioing majority instances. That is, the
Chase Bank and I S % versus 85% for First transactions i, and n is the total number of minority instances replicate across four data
Union Bank. In practice, fraudulent transac- Lransactiuns. subsets to gencrale the desired 5O:SO distri-
tions arc much lcss frequent than the 15% to After consulting with a bank representa- bntion for each distributcd training set.
20% observed in the data given to us. These tive, we jointly settled on a simplified cost For c~incretencss,let N he the size of the data
data might have been cases where the banks model that closely reflects reality. Because it set with a distribution of x y (r is the percent-
have difficulty in determining legitimacy takes time and personnel to investigate a agcoSthemior~rilyclass)andu:vbcthedesired
correctly. In some OS our experimenls, we potentially fraudulent transaction, each inves- distribulion. The nuinbcr olminority instanccs
deliberately creale more skewcd distributions tigation incurs an averliend. Othcr related is N x x, atid the desired number of majority
tu evaluate the effectivencss of our tech- costs-for example, the operational resources instances in a subset is Nx x v h . The number
niques under more cxtreme conditions. needed for the Iraud-detection system-me of subsets is the nuinher of majority instances
Bank personnel devcloped the schemata consolidated into overhead. So, if the amount IN x y ) dividcd by the number ol desired
ufthe databases ovcr years of experience and of a transaction is smxller than the ovcrhwid, majority illstances in cach subset, which is Ny
continuous analysis to capture important investigating the transactivn is not worth- dividcd by Nxvlu or ylx x ulv. So, wc have
information for fraud deteclion. We cannot while even if it is suspicious. For enamplc, if y/xxulvsubscts,eachofwhichhas Nxminor-
reveal the details ofthe sclicma beyond what it takes $10 to investigate a potential luss of ity illstances and Nxvlu majority instanccs.
we havc described elsewhere.* Thc records $1, it is more economical not to investigate The next step is to apply a learning algu-
of one schema have a fixed length of 117 it. Therefore, assuming a Sixed overhead, we rithm or algurilhms to each subset. Because
bytes each and about 30 attributes, including devised the cost inodel shown in Tahle I for the learning processes on the subsets arc inde-
the binary class label (fraudulenrllcgitimatc each transaclion, where franamfis thc amount pendent, the subsets can be distributcd to dif-
transaction). Some ficlds Lile numeric and the of a credit card transaction. The overhead ferent processors and each learning process
rest categorical. Because accounl identifica- threshold, cor obvious reasons, is a closely zan run i n perallcl. For massive amounts of
lion is not present in the data, we cannot guarded secret and varies over time. The :lata, our approach can substanlially improve
group trailsections into accounts. Thercfore, range of values uscd here arc probably rea- rpeed for superlincal- time-lcerning algo-
inslead of leaning behavior models ofindi- sonable levels as bounds for this data sct, hul rithms. Thc generatcd classifiers are coin-
vidual customer accounts, we build overall are probably signilicantly lower. We evalu- billed by inetalcarning from their classifica-
models that try to differentiate legitimate ated all our empirical studics iisiiig this cost tion behavior. Wc liave described sevcral
transactions from fraudulent ones. Our mod- model. metalearning strategies e1sewhere.l
els are customer-independent and can serve To simplify our discussion, we only
as a second line of defense, the first being icscribe thc clrrss-combiner (or stuckingj
customer-dependent models. Skewed distributions ulrategy? This strategy composes a inetalevel
Most machiidearning literature conccn- lraiiiing set by using the hase classiliers' prc-
trates on model accuracy (either lrainiiig error Given askewed dislributiiio, we wouldlike iiclions on a validation set as attribute values
or gencralination error on hold-out test data to generate atraining set of labeled transactions and the aclual classilication a s the class label.
computed as overall accuracy, tme-positivc or with a desired dislribution witlriul removing rhis training set then serves for training a
false-positive rates, or return-on-cost analy- any data, which maximizes classifier perior- InelaclassiSier. For inlegrating subsets, the
sis). This domain provides a considcrdbly dif- mancc. In this domain, we found that deter- :lass-combiner slrategy is inorc eSSective
ferent metric to evaluate the learned models' mining lhe desired distribution is an expcri- than the voting-based techniqoes. When the
performance-models arc evaluated and rated mental art mdrequiresextcnsiveempirical lesls learned models are used during online Srand
by a cost model. Due to lhe different dollar to find the most eSkctivc training distribution :letcction, transactions feed into the learnccl
amount of each credit card transactiun and In our approach, we lirst create data sub- mse classifiers and the meteclassifier then
otherfactors,thecostoff;iilingtodctectafraud sets with the desired distribution (determined :ombioes lheir predictions. Again, the base
varies with cach transaction. Hence, the cost by exlensive sampling experiments). Then :lassiliers are indepcndent and can exccute in
model for this domain relies on lhe smn and we generate classifiers Srom these subscts prallel on diSferent processors. In addition,
average of loss caused by Sraud. We define and combine them by mctalearning lrom )UT approach can prune rcdundant base clas-
their classification bchevior. For example, iS sifiers without affecting the cost perlor-
Cumu1ati"eC"rf = Cust(i) the given skewed distribution is 2W80 and mance, inaking it relatively erlicient in the
the desired distribution for gencraling the :redit card authorization proccss.
~~ - __ ~-

Table 2. Cost and rovings in the tredil turd fraud domain using tlosr.rombiner(tor1 t 95% ronfidente intervol).

20.07 *.I3 46
23.64 +.96 36

29.08 t 1.60 21

Expcrimcnts and results. To evaluale our sct bccause this approach increases the train- As wc discussed carlier, the desired dislri-
ificr class-combiner approach to ing-set size and is not appropriate in domailis bution is not necessarily 50:5&Ior instaocc,
skewed class distributiuns, we perlorined a with large amounts of data-one ol the three Ihc desired distribution is 30:70 when the
se1olexperiments using the credit cm1 fraud primary issues we address here.) Comparcd given ilistributioa is 1090. With 10:90 dis-
data from Chase? We used transactions from to the otlicr three methods, class-combining tributions, our method rcduced the cost siig-
lhefirsteightinonths(10/95-5/96)lortraio- on subsets with a 5 0 5 0 fraud distribution nificantly more than COTS. With 1 3 9 [lis-
ing, the ninth inunth (6196) for validating, clcarly achieves a significant increase in s i n - Iributions, our method did not outperform
and tlie twelfth month (9196) for testing. ings--atleast$IIO,000fortIieinonlh (6196). COTS. Both methods did not achievc any
(Because credit card transactions have a nal- When the overhead is $50, inure than halfof savings with 1399 distributions.
u r d two-month business cycle-the timc to the losses were prevented. To cliaractcrize the condition when our
bill a customer is oiic month, followed by a Surprisingly, we also observe that when tlie tcchoiques arc cffective, wc calculale R, the
one-month payment period-the true label ralio of the overhead amount to the avcragc
of atransaction cannot bedelermincd i n less cost: R = OverlrrodlAiwra,ye c m f .Our zip-
than twomonths’ timc. Hence, building mod- 50:50 distrihulian (generaled hy ignoring proach is significantly more cffcctive than
els f m n data in one month cannot be ratio- somc data) achieved significantly more sav- the deploycd COTS whco R < 6. Both mctli-
nally applied fur fraud detection in tlic next ings than combining classifiers trained lrom oils are ti01 effective when the R > 24. So,
month. We therefore test our models on data all eight months’ dala with the given distrih- under IIrciisonable cost model with a fixed
that is at least two months newer.) uliun. This reaffirms thc importance of overhead cost in challenging transaclions iis
Based on lhc empirical results fruni the employing the appropriate training class dis- potentially fraudulent, when the numbor of
effects of class distiibutions, thc dcsircd distri- tribution i n this domain. Class-combiner also fraudulent transactions is a very small per-
bution is 5050. Because the given distribution contribuled to the performance improvcrncnt. centagc ofthe totiil, it is financially undcsir-
i s 20:80, Four subsets are generalcd Crum each Conseqocnlly. utilizing the desircd lriiining ablc to dclect fraud. Thc loss doe to fhis frrilud
inonthfaratolaloC32suhsets.Wcapplied Cow distribution and class-combiner provides ti is ycl another cost of conducting business.
learning algorithms (C4.5,CART, Ripper, and synergistic approach to data mining with Hawcver, fillering out “easy” (or law-risk)
Baycs) to each subset and gcneraletl I28 base nonuniform class and cost distribuliuns. Per- transactions (the data we received were pos-
classifiers. Bascd on ourexperiencc with train- haps more importantly, how do iiurtcchniques sihly iiltered by a similar process) can reducc
ing metaclassifiers, Bayes is gcnerally more perform compared to the bank‘s existing a high overhead-to-loss ratio. The filtering
cflective and efficient, so it is the metaleamer Sraud-detection system? We label thc current process can use fraud detectors that are buill
for all Ihe experiments reported herc. syslein “COTS” (commercial off-the-shelf bascd on indivirlual customer pmfilcs, which
Furthermore, to invcstigale if our systcm) in Table 2. COTS acliicved signiti- arc now in use by inany credit card cumpa-
approach is indccd Ruillul, we rim cxpcri- wings than our lechniqucs in the nics. These individual profiles charactcrizc
ments 011 the class-combincr strategy directly thrcc overhead aniounts we repoit in Ibis table. the customers’ purchasing behavior. For
applied to thc original data sets from the lirsl This comparison might not be entirely example, if ii customer regularly buys gro-
eight months (that is, they have the given accurale becsiisc COTS lies much mure ceries at a particular supermarket or has set
20:80 distribution). We also cvaluated how training data than we have and il might bc up a monthly paymenl f i r phone bills, these
individual classifiers generated from each optimized tu a different cost modcl (which triinsactions are closc to no risk; hencc, pur-
month perform without class-combining. might even bc the simple error rate). Fiir- chases of similar characteristics cim he safcly
Table 2 shows the cost and savings from the lhermore, onlikc COTS, om techniques are authorized wilhout fiwtlier checking. Ilzduc-
class-combiner strategy using thc 5050 dis- genel-al for problems with skewcd dislribu- ing the overhead through s1re;imlining husi-
tribution (128 baseclassitiers), Ihe avemgcof tions and ilu not utilize any domain knowl- ness operations and incrcased automation
individual CART classifiers generaled using edge in detecting credit card fraud-the only will also lower the ralio.
the desired distribulion (10 cl;issiliers), class- cxccption is the cost madcl used for evaluii-
combiner using the given distribulion (12 base tion and search guidance. Nevcrthclcss,
classifiers-8 niontlis x 4 learning algo- COTS’perfr,rnianccon tlie test data provides Knowledge sharing through
rithms), and tlie average of individual classi- some indication of how the existing fraud- bridging
fiers using the given distribution (the average dctcctioii system behaves in the real warld.
of 32 classiliers). (We did not perform exper- We also evaluated our mctliod with more Much of the prior work on combining
iments on simply replicating the minority skewed distributions (by clownsampling multiplc modcls assumes that ;dl mudels
inslances to achieve S0:50 in one singlc data ininorily instiinces): 1090, 1:99, and I : 9 W originate frum different (not necessarily dis-
- ~~

- _NOVEMBER/DECEMBER
______ 1999
tinct) subsets of a single data set as a mcans 1. A,,,, # B,,,,: The two attributes are of include it in its schema, and presumably
to increase accuracy (for example, by impos- entircly differcnt types drawn from dis- other altributcs (including tlie common ones)
ing probability distributions over the in- tinct domains. The problem can then be have predictive value.
stances of the training set, or by stratified ileduced to two dual problems where onc
sampling, subsampling, and so forth) and not database has one more attribute than the Method 11: Learn a local model without the
as a means to integrate distributed informa- other: that is, missing attribute and exchange. In this
tion. Although the JAM systcm addresses the approach, database DO, can leain two local
latter problem by employing mctalearning Sc!iema(DBA)= [Al, A , ...,A,,,A,,+,,C] mudels: one with the attribute A,,,, that can be
techniques, integrating classification models Schema(Db',) = [ B l ,B2...., U,,, C) used locally hy the metalearning agents and
derived from distinct and distributed data- one without it that can be subsequently
bases might not always he feasible. where we assume thal attribute B<,+lis exchanged. Learning a secondclassifier agent
In a11 cases considered so far, all classifi- not prcsent in DB,. (The dual problem without the missing attribute, or with the
cation models are assumed to originate from has DB,* composed with B,,,,, but A,v,, attributes that belong to the intersection ofthe
databases of identical schemata. Because is not availahlc to A.) attributes of the two databases' data sets,
classifiers depend directly on the underlying implies that the second classifier uses only the
data's format, minor differences i n the sche- 2. A,,,, = B,J+l:The two attributes are of attributes that arc common among the patic-
mata between dalabases derive incompatible similar lype hut slightly diffcrent ipating sites and no issue exists for its inte-
classifiers-lhat is, a classifier cannot be semantics-that is, there might be a gration at other databascs. But, remote classi-
applied on data of diSCerent formats. Ye1 these map Croni tlie domain OS one type lo the fiers imported by database DB, (and assured
classifiers may target the same concept. We domain of lhe other. For example, A,+I not to involve predictions over the missing
seek to bridge these disparate classifiers in and are fields with time-dependent attributes) can still be locally integrated with
some principled lashion. information bul of different dirralion the original model that employs A,,, , In this
The banks seck to be able lo exchangc (that is, A,,, might dcnote the numher case, the rcmote classifiers simply ignore the
thcir classifiers and thus incorporate uscful oftimcsaneventaccrirred withinawin- local data set's missing attributes.
information in their system that would oth- dow OS half an hour and lJ,,+l ,nigh[ Bolh approaches address the incompati-
erwise be inaccessible to both. Indeed, for denote lhc number of times the same ble schenra problem, atid rnctnleerning over
each credit card lransaction, both institutions event occurred but within IO minutes). these models should procecd in a straight-
record similar inSormation; however, they forward manner. Compared lo Zbigniew
also include specific fields containing impor- In both cases (attributeA,,+i is either not Ras's related work," our approach is more
tant information that each has acquired sep- present in DB, UT semantically diffcrent Srom general because our techniques support both
arately and that provides predictive value in the corresponding &+J, the classifiers CAj categorical and continuous atlributes and are
determining Craudulent transaction patterns. derived from OB,, are not compatible with not limited to n specific syntactic case or
To Cacilitate the exchange of knowledge (rep- DB,'s data and LhereForc cannol be directly purely logical consistency of generated rulc
resented as remotely learned models) and used in DBi:s site and vice vcrs11. But the pur- models. Instead, these bridging techniques
take advantage of incompatible and other- poseof usingadistributedd*a-miningsystein can employ machine- or statistical-learning
wise useless classifiers, we need to devise and dcploying Icarning agents and metalearii- algorithms to compute arbitrary models for
methods that bridge the differences imposed ing their classilier agents is to he able to com- the missing values.
by the different schemata. bine infomiation from different sources.

Database compatibility. The incompatible Bridging methods. Therc are two methods Pruning
schenia problem impedes JAM [rum taking For handling the missing attribules.s
advantage of all available dalahases. Let's An ensemble olclassifiers can he unneces-
consider two data sites A and b' with data- Method I : Learn a local mode1,for the miss- narily complex, meaning that many classifiers
bases DBA and DBB2having similar but not ing nrtrihute and exchange. Database UB, might bc redundant, wasting resources and
identical schemata. Without loss of gencral- imports, along with tlie remote classificr reducing system throughput. (Throughput
ity, we assume that rgenl, a bridging agent Srom database DEA here denotes the rate at which a metaclassitier
.hat is trained to compute values for the miss- :a11 pipe through and label n stream of data
Schcnta(DB,J = {Al,A,, ...,A,, A,,,,, Cl ing attribute A,,+i in DB,'s dala. Next, DB, items.) We study the efficiency OS metaclas-
Schema(DB,) = [ E , , B2,...,B,,, C) rpplies the bridging agent on ils own data to sifiers by investigating the effects olpruning
A m a l e the missing values. In many cascs, (discarding ceaain base classiciers) on their
where, Ai and Bidenote the ilh attribute of iowever, this might not be possible ordesir- performance. Determining the optimal set of
DBA and DB,,respectively, and C the class ible by DBA (for example, in case the Aassifiers for metalearning is a combinator-
label (for example, the fraudllegitimate label ittribute is proprietary). The alternative for ial prohlem. Hence, the objective of pruning
in the credit card fraud illustration) of each latabase Dne is to simply add the missing istuutilizeheiiristictneth~s1osearchforp~-
instance. Without loss of generality, we fir- ittribute to its data set and fill it with null val- !ially grown metaclassifiers (metaclassifiers
ther assume lhat Ai = Bi, 1 < i < n. As for the ies. Even though the missing attribute might with pruned subtrees) that are more efficient
A,,,! and B,,+l attribute& there arc two lave high predictive value for DBA, it is of ind scalable and at the Same time achieve
possibilities: i o value to DE,,. After all, UB, did no1 :omparable or better predictive peiformance
~ ~_______~~
~~
72 ~~~__ ___~.. IEEE INTELLIGENTSYSTEMS
I

Table 3. Results on knowledge shoring and pruning.

CHASE Fins1 UNION


SAVINGS SAVINGS
results than fully grown (unpruncd) mcta-
cUSSlFlCAllON METHOD SIZE ($K) SIZE ($K)
classificrs. To this end, we introduccd two COTS scoring system from Chase NIA 682 NIA NIA
stages for pruning metaclassifiers: tlie prc- Best base classifier over entire set 1 762 1 803
training and posttraining pnining stages. Both Best base classifier over one subset 1 736 1 770
levels are essential and complementary tn Metaclassifier 50 818 50 944
Metaciassifier 32 870 21 943
cach other with respect to the improvement of Metaciassifier (t First Union) 110 800 NIA NIA
the system's accuracy and efficiency. Metaciassifier (tFirst Union) 63 877 NIA NIA
Pretrrrining prunirig rcfers to the filtering Metaclassifier (t Chase - bridging) NIA NIA 110 942
of thc classifiers before they arc combined. Metaclassifier (t Chase + bridging) NIA NIA 110 963
Metaclassifier (t Chase t bridging) NIA NIA 56 962
Instead of combining classifiers in a hrute- ~ ~~~

force manner, we introduce a pscliminary


stage lor analy7,ing the available
and qualifying thcrn for inclusion across six differcnt data sites (each sitc stores For thc Cirst incompatibility, we mappcd
bined metaclassificr. Only those classificrs two months of data) aiid prepared thc sct of thc First Union data values io thc Chmc
that appear (according to one or more prc- candidate bast chssificrs-that is, tlie orig- data's scmantics. For thc second incompati-
defined mctrics) to be most promising par- iiial sct of base classifiers the pruning algu- bilily, we deployed bridging agents to com-
ticipale in tlic final metaclassilier. Hcree,we rithm is callcd to evaluate. Wc computed pute the missing values? Whcn predicting,
adopta black-box ;ipproachthalevaluetcs thc these classifiers by applying fivc learning thc First Union classificrs simply disrcgardcd
selofclassificrs hasedonly on thcir input and algorithms (Bayes, C4.5, ID3, CART, and the real values provided at thc Chase data
output hehavior, not their intcrnal structure. Ripper) to each month of data, crealing 60 sites, while tlic Chase classificrs iclicd on
Conversely,poarmining pruning denotcs the hase classifiers (10 classiliers per data site). both thc commoii attributcs and the predic-
evaluation and pruning of constituent hasc Ncxt, we had each data site import the remotc tions of the bridging agents to dcliver a pee-
classifiers after a complete meta hase classifiers ( S O in total) that wc subse- diction at the First Union data sites.
been constructed. We have impleincntcd and quently used i n the pruning and mctalearn- Tahle 3 sutnmarizcs our results for thc
experimcnted with lhree prctraining and lwo ing phascs, thus ensuring that cach classifier Chase and First Union hanks, displaying the
posttraining pruning algorithms, each with would iiot he tested unfairly on known diatii. savings for cach fraud predictor examined.
different scarch heuristics. Spccificdly, we had cach site use half of its The column denoted as size indicates the
The Sisst prctraining pruning algorithm local date (one month) to tcst, prune, and numher olhssc cizissifiers used in thc cnscm-
rnnks and sclects its classificrs hy evaluating met;iICasn the hase classificrs and the othcr hlc classilictitian system. The first row of
each candidatc classifier independently (mct- halC to evaluatc the pruned and unpruiied T'ihle 3 shows the best possible performance
ric-based), and thc s ~ c ~ nalgorithm
tl ctecidcs rnctacliissifier's ovcrnll pcrlurmance.' In or Chase's own COTS tiutliorizatioii and
by e x m i n i n g the classifiers in correlation csseiicc, this experiment's sctting corrc- dctection system on this data sct. The ncxt
with each other (diversity-based). Thc third sponds to a parallcl sixfold cross-validation. two rows psesent the pcrformance of the hest
reiies on the independent pcrformance o l the Finally, we had tlic two siniulaled banks hase classificrs over the entim sct and over a
classifiers and the manner in which they pse- exchangc their classifier agcnts. In addition singlc month's data, whilc the last four rows
dictwithiespectt~~cachothesandwithrespcctto its 10 locel and SO internal classificrs detail thc pcrformance of tlie unpruned (siue
to the underlying data set (coverage and spc- (those importcrl from their peer duta sites), of SO and 110) and pruncd metaclassifiers
cialty-based). The first posttraining pruning cach site also itnpostcd 60 exlemal classifiers (size of 32 antl 63). The first two of lhese
algorithms are based on a cost-complexity (from the other hank). Thus, we populated metaclassifiers combine only internal (lrom
pruning technique (a lechniquc the CART each simulated Chase d;itii site with 60 (10 t Chase) basc classifiers, whilc the last two
decision-tree learning algorithm uses that S O ) Chase cliissifiers antl 60 First Union clas- combine bolh internal and external (from
scoks to minimize thc cost and size of its trcc sifiers, and we populated each First Union Chasc end First Union) hasc classifiers. We
while iwlucingthc misclassllication mtc). The site with 60 ( I O + 50) Firs1 Union classifiers did not use bridging agcnts in lhese experi-
second is based on the crirrclation between the and6O Chase classificrs. Agaiii,the sites uscd ments, bcuuse Chase defincd a11 attributes
classifiers and thc metaclassilier? half of thcir local data (one month) to test, uscd by the Fiirst Union classifier agents.
Comparcd to Dragos Margineanlu and prunc, and metalearn thc basc classifiers and Table 3 records similar data for thc First
Thomas Dietterich's approach? ours con- the other half to evaluate the pruned or Union data set, with the exception of First
re gencral sctting where ensem- ifier's overall perfor- Union's COTS authorization and detection
fiers can he ohtailled by apply- mance. To ensure fairncss, we did not usc thc performance (it was not made available to us),
different learning algorithms 10 local classifiers in metalearning. and the additional resiills ohtaincd when
over (possibly) distinct datahascs. Further- Thc two databases, huwcvcr, had lhe Col- cmploying special bridging agents born Chase
more, instead of voting (such as AdaBoost) lowing schema differences: to computc the values of First Union's miss-
over thc predictions of ing altsihutcs. Most obviously, these expcri-
classificalion, wc use melalearning to com- * Chasc and First Union dcfined a (nearly ments show the superior performance of meta-
bine thc individual classificrs' predictions. identimil) fcature with dirferent scinan- learning ovcr the single-modcl appsoachcs
tics. and over lhe trxlitional authorization and
Evaluation of knowledge sharing and * Chasc includes two (continuous) features dctcction systems (at leiist for the given data
pruning. First, we distributed the data scts no1 present in the Pirsl Union data. sets). Thc mctaclassifiers outpcrformed thc

73
single base classifiers in every categoiy. More- computcr systems. Here we seek to perlorm Rough Sers iri Knowledge Iliscovery, Studies
over, by bridging the two databases, we man- the samc soit of task as in thc credit card fraud in Fuczi,ress and .?@I Compuiin,q, L. Pol-
kowski and A. Skowran, eda., Physicti Ver-
aged to further improvc the metalearliing sys- domain. We seek to build models to distin- lag,NewYaik, I99X,pp.98-108.
tem's peiformance. guish betwcen bud (intrusions or attacks) and
Howevcr, combining classifiers' agents good (normal) conncctions or processes. By 7. A. Pradromidis andS.Stolfo,"PmningMeta~
from the two banks directly (without bridg- firs1 applying fcatue-extraction algorithms, classifiers in il Distributed Data Mining Sys-
ing) is not very effective, no doubt because tem," Pmc. Firrt N d l Conf. New Infomu-
followed by the application of machinc-leani-
tion Technologies, Editinns of New I t c h . ,
the attribute missing from the First Union ing algorithms (such as Rippcr) to learn and Athens, 1998,pp. 151-160.
data set is significant in modeling the Chase combine multiple models for differclit lypes
data set. Hence, the First Union classifiers of intrusions, we have achievcd remarkably 8. D. Margineantu and T. Dietterich, "Pruning
arc not as effective as the Chase classifiers good success in detecting intrusions? This Adaptive Baosting," P,oc. 14lh / d l CO$
on thc Chase data, and the Chasc classificrs Machine Learnbrg, Morgan Kaufmiinn, San
work, as wcll as the results reported in this Francisco, 1997,pp. 211-218.
cannot perform at full strength at the First article, demonstrates convincingly that dis-
Union sites without the bridging agents. tributed data-mining tcchniques that combine 9. W.Lce,S.Stolfa,siidK.Mok,"Mining ins
This table also shows the invaluable contri- multiplc modcls producc effectivc fraud and Data-Flow Environment: Experiencc in Net-
bution of pruning. In all cLises, pruning suc- work Intrusion Ilctection," Proc. F$Ih inr'l
intrusion rletcctors. V
Conf. Knuwiedfe Discovery and Dnra Mitt-
ceeded incomputing mctaclassificrs with sim-
ing,AAAI Press, Menlo Park, Calif., 1999,
ilar or better fraucl-detcction capabilities, while p p 114-124.
reducing their size and thus improving their
efficiency. We have provided deletaileddescrip-
tion on the pmning methods and a compara- Philip K. Chan is an assistant professor af coni-
puter scicnce at thc Florida Institute ofTcchnol-
tive study betwcenpredictivepeiformance and ogy. His research interests include scalable adap~
metaclassifier throughput elsewhere? tive mcthods, machine learning, data mining,
distributed and parallel computing, and inlelligent
rystems.HereceivedhisPhD,MS,nndBS incom-
putcr science fiom C ol umbi University, Vander-
bilt University, end Southwest Texas State Uni-
versity, respectively. Contact him at Computer
Sciencc, Florida Tech, Melbourne, FL 32901:
pkcC3cs.fit. ~ t l uwww.cs.fit.edd-pkc.
;
References
W ~ i F ~ " i ~ ~ P h D ~ ~ "incomputerscience
didate at
1.P.Chi~nanilS.Stoilo,"Met~leamingforMul~ Columbia University. His research interests are in
tistratcgy and Parallel Learning," Pmc. Sec- machiiielewiiing,detamining,dish.ibuted systcms,
NE LIMITATION OF OUR AP- ond Inl'l Workshop Mulrislrure~yLearning, infrirmntian retrieval, iind collaborative filtering.
proach to skcwed distributions is the necd to Centcr for Aililicinl Intelligence, Gcorge Hclrceiveil hisMSciindB.BngfromTsingI,uaUni~
run preliminary expcriments to determine the Mason Univ., Fairfnx, RI., 1993, pp, versity andMPhi1h.omColumhiaUiiiversity. Cnn-
150-165. tact him tit thc Computer Science Dept., Calumbin
desired training distribution based on a de-
Univ., Ncw York, N Y 10027; wlan@cs.columbiii.
h e d cost model. This process can be auto- 2. S.Stollpetiil.,"JAM liivaAgentsfurMcta- -du:www.cs.columbii,.~d~l-wf~,,.
mated, but it is unavoidable because the leiirning over Distributed Datnhases," Proc.
dcsireddistributionhighlydepcndson thccost Third h l ' l Cmf.K,io,ubdge Uiscmery atid Andreas L. Prodromidis is il dircctoi in rcseeluch
model and the lcaniing algorithm. Currently, Darn Mining, AAA1 Picss, Menlo Park, md developmentat iPrivacy.His research intcmts
Calif., 1997, pp. 74-81, include data mining, inachine learning, elrclraoic
for simplicity reasons, all the base leamcrs use
mnmcrcc, and distributed systcms. Hc received a
the samc desired distribution: using an indi- 3. D. Wolpen,"Stacked Gencraliriition," Neaml PhD in computer scicnce from Columbia Univw
vidualized training distribution for each base N ~ m w r / k ~ , V 5o ,l .1992,pp. 241-259. ity and R Diploma i h elechical cngineaiingfrom thc
learner could improve the perSormance. Fur- Ntitiunal Technical University of Athens. He is a
thermore, because thieves also leam and Sraud 4. P. Chan and S . Stolfo, "Toward Scalable mcmbernf thclEEEandAAA1.Contact him at iPri-
Learning with Nonuniform Class and Cost vecy, 599 Lexington Ave., 82300, New York, NY
patterns evolve over time,some clessifiers are
Distributions:A Ciise Study in Credit Card 1W22: andreas~iprivacy.com:www.cs.columbin.
more relevant than others at a puticular time. Fraud Detection," I'roc. Fourlk Int'l Conf, :dul-n"dreils.
Therefore, an adaptive classifier-selection Knuwlcdp Dircovevy nnd Dora Minin,q,
method is essential. Unlike a monolithic AAA1 Prcss, Menlo Park, Calif., 1998, pp. ?alvatoreJ.Stolfu is a pmfcssor olcamputer sci-
approach OS learning oiic classifier using 164-168. mccat CulumbiaUniversity andcadirectorofthe
USCllSl and Columbia Centcr for Applied
incremental learning, our modular multiclas-
5. A. Prodramidis andS. St,,lfo,"MiningData- Research i n Digital Govemmcnt Information S y s ~
sifier approach facilitates adaptation over time bases with Different Schemns: Integrating .ems. His mmtreccnt research has focuscd on clis-
and removes out-of-date knowledge. Incompatible Classifiecs," Proc. E'ourlh bill ributed data-mining systems with applications to
Our cxperiencc in credit card fraud detec- C w j : Knowledge Discovery atid Dma Mim rnud and intmsinn detection in network informa-
tion has also affected other important appli- i n g , A A A l Pless,Mcnlo l'ark, Calif., 1998, ion systems. Hc receivcd his PhD lrom NYU's
pp.314-31x. 2oourant Institute. Contact lhim at the Compulcr
cations. Encouraged by our results, we shifted
Sciencc Dcpt., Columbia Univ., New York, NY
our attention to the growing probleinof intru- 6. 2. Ras,"Answering Nonstandard Qucrics in 10027: sal@'cs.columbia.edu:www.ca.coluntbia.
sion dctection in network- and host-based Distributed Knowledge-Based Systems.'' :dulsnl; www.cs.colunibh.cdui-saI.

IEEE INTELLIGENTSYSTEMS

You might also like