Credit Card Fraud Detection System

V. Filippov
Institute of Control Sciences Moscow, Russia Email: filippovs@umail.ru

L. Mukhanov
Insitute of Electronic Controlling Machines Moscow, Russia Email: lmukhanov@fraudprevention.ru

B. Shchukin
Institute of Control Sciences Moscow, Russia Email: tsh@cyber.mephi.ru

Abstract—The use of credit cards is prevalent in modern day society. But it is obvious that the number of credit card fraud cases is constantly increasing in spite of the chip cards worldwide integration and existing protection systems. This is why the problem of fraud detection is very important now. In this paper the general description of the developed fraud detection system and comparisons between models based on using of artificial intelligence are given. In the last section of this paper the results of evaluative testing and corresponding conclusions are considered.

I. I NTRODUCTION The use of credit cards is prevalent in modern day society. But as in other related fields, financial fraud is also occurring in spite of the chip cards worldwide integration and existing protection systems. This is why most software developers are trying to improve existing methods of fraud detection in processing systems. The majority of such methods are rules based models. Such models allow bank employees to create the rules describing transactions that are suspicious. But the number of transactions per day is large and new types of the fraud appear quickly. Therefore, it is very difficult to track new types of fraud and to create corresponding rules in time. It would require a significant increase in the number of employees. Such problems can be avoided using of artificial intelligence. But this task is very special and complex models are not acceptable because of authorization time limits [1],[2]. The use of Bayesian Networks is suitable for this type of detection, but results from previous research showed that some input data ( attributes of transaction) representation method should be used for effective classification [3]. For transaction monitoring by bank employees the clustering model was developed. This model allows provision of fast analysis of transactions by attributes. In this paper a general description of the developed credit card fraud detection system, the clustering model, the Naive Bayesian Classifier and the model based on Bayesian Networks with the data representation method are considered. Finally, conclusions about results of models’ evaluative testing are made. II. F RAUD D ETECTION S YSTEM In this system two modules (FDS ONLINEP and FDS OFFLINEP) for fraud detection ( transaction classification) are used. The FDS ONLINEP module is used for on-line fraud detection, i.e. fraud detection process during authorization of transactions in a bank processing system.

Fig. 1.

Structure of the Fraud Detection System

In this module different models for the fraud detection can be used. If a transaction is recognized as fraudulent, in this module, then a corresponding message will be sent to the processing system and this transaction can be declined. Classification process takes some time and the time of transaction authorization is limited. This is why some models can not be applied in the FDS ONLINEP module because of exceeding time limits. These models are used in the FDS OFFLINE module.FDS OFFLINE module allows the system to detect fraud among transactions that have been already authorized and were classified in the FDS ONLINEP module. For the storage of incoming transactions, statistical data for corresponding models, results of classification and generic parameters a FDS Data Warehouse is used. Module FDS ALERT is used for alerting credit card holders in case of fraud recognition by the FDS ONLINEP module using SMS or email messages. For building statistical data the FDS BUILDSTATP module is used. In some models ( for example, the model based on Bayesian Networks ) it is required to build statistical data with a defined period, and this module allows classification of transactions during the building of statistical data( Fig. 1). Following models are used for the fraud detection: 1) A model is based on using data clusterization regions of parameters’ values ( the clustering model). In this model parameter analysis of legal and fraudulent transactions is conducted in order to find regions of data clusterization for each parameter. During the process of classification, parameters of an incoming transaction

Authorized licensed use limited to: Pune Institute of Computer Technology. Downloaded on January 6, 2010 at 06:14 from IEEE Xplore. Restrictions apply.

Such calculation of Ninterval is based on the assumption that a twofold increase of Npoints will be equal to Ninterval plus one. This model allows the user to assign suspicious parameters of a transaction that are typical for fraud. if in the training data we observed values from Table II. 2010 at 06:14 from IEEE Xplore. A model is based on minimal differences in times between geographic places where transactions have taken place. In general. these limits allow the user to restrict the amount of transactions per month for one card or the amount of transactions from one financial institution per month ( per day). Regions of data clusterization This model is based on the use of the parameters’ data clusterization regions.5 1237. It is also possible to assign a suspicious sequence of transactions. If parameters of the transaction are typical for legal then the transaction is recognized as legal. A model is based on limits. A transaction will be recognized as fraudulent if the difference in time between the place of the last transaction and the place where previous transaction has taken place is less than possible difference. separately for legal and fraudulent transactions. Authorized licensed use limited to: Pune Institute of Computer Technology.All of them are discrete ( 18 input attributes are described in Table I). Ninterval is the binary logarithm of the attribute values account Npoints . During classification. 2. . A model is based on Bayesian Networks. 2).5 2) 3) 4) 5) are compared with these regions of data clusterization. To find regions of data clusterization a corresponding analysis should be provided at first. T HE C LUSTERING M ODEL Fig. This difference (DIF Fmax ) is split into Ninterval segments. For each found segment the calculation of the average value and the corresponding deviation for hit attribute values is made. Restrictions apply. Comparisons are made for each parameter of a transaction with found regions of data clusterization for this parameter. This model allows us to get a probabilistic estimation of classified transaction conformity with legal and fraudulent transactions. In this system 24 real parameters of transactions are used for classification.5 Maximum deviation 42. we must compare parameters of a transaction with regions of data clusterization for legal transactions. This information is collected for each attribute of a transaction during the learning process ( the building of statistical data).5 37. Thus Ninterval centers and corresponding deviations that describe all values of the certain attribute from the training data appears( Fig. For example. III. A model is based on heuristic rules.TABLE I I NPUT ATTRIBUTES Input attribute Message type Type of transaction Network identification Day of registration in system Time of registration in system(in minutes) Day of registration in device Time of registration in device(in minutes) Amount of transaction Currency of transaction Terminal type Language code( for ATM only) Acquirer institution identification Acquirer institution country code Merchant identification Card data input method Card present flag Cardholder present flag Cardholder authentication method Example of attribute value 1031 700 25 2 720 1 730 1000 840 1 7 109428 643 456783 2 1 1 1 TABLE II VALUES OF ONE ATTRIBUTE Number of value 1 2 3 4 5 6 7 8 Value 1200 1220 1260 1270 720 730 780 800 TABLE III F OUND REGIONS OF DATA CLUSTERIZATION Number of region 1 2 Center of region 757. For example. In order to determine these regions of clusterization first we need to find the maximum difference (DIF Fmax ) between values of an attribute in the training data. Downloaded on January 6. Alternatively if the parameters of the transaction are typical for fraud then the transaction is recognized as fraudulent. Ninterval can be found using another way of looking. The transaction is recognized as fraudulent if it was found as a typical fraudulent transaction. If the transaction is not typical for legal transactions then it is compared with regions of data clusterization for fraudulent transactions. then the regions of clusterization from Table III will be found.

µi c . ∧ Xn = xn .. V. The final decision about a transaction class is made after the class probability estimation [5]: P (C = c|X = x) = P (C = c) × P (X = x|C = c) P (X = x) 1 D= √ n In this case the Gaussian is averaged over a set of all attribute values that has been observed in the training data..[10].[13]. In this approach numeric attributes should be transformed into discrete [6]. The disadvantage of this approach is concerned with the assumption about corresponding attribute values to the normal distribution. µi c . This is connected with a strong impact of different attributes to each other. Such an approach allows us to get a closer estimation of a real attribute distribution. where the conditional independence of attributes ( except of class attribute) is supposed. This principle is based on the quantitative characteristic M DL: M DL = log2 N × |B | − LL 2 Here C is the random variable denoting the class of an instance ( transaction) and X is a vector of random variables denoting the observed attribute value vector. T HE NAIVE BAYESIAN C LASSIFIER It is better to start the research of Bayesian Networks for the credit card transaction classification process from the Naive Bayesian Classifier because of its simplicity. Weight factors can be changed by bank employees according to importance of corresponding parameter for the whole transaction classification. a structure of dependence network between used attributes of transaction should be found at first [11]. 2010 at 06:14 from IEEE Xplore. Because of the assumption about the attributes’ conditional independence: P (X = x|C = c) = i P (Xi = xi |C = c) To compute P (Xi = xi |C = c) different approaches can be used. D) n i=1 n IV. To get this information the Rissanen’s Minimal Description Length principle is used [12]. Dc ) i g (x. the currency of transaction has an impact on the amount of transactions. under the assumption that all attributes are Authorized licensed use limited to: Pune Institute of Computer Technology. .If a value of a transaction parameter hits into any region ( deflection of transaction parameter value from the center of a region is less than the corresponding deviation) then this parameter is recognized as typical and as the result of classification for this parameter ( Classi ) value 1 is returned. It is better to calculate the class probability using the kernel density estimation [5]. n is an account of parameters.. If the value of the parameter does not hit in any of corresponding regions then it is recognized as not typical and as the result of classification for this parameter ( Classi ) value -1 is returned. N is the account of instances in the learning data set. At first it is supposed that wi = 1/n.[8]: i Dc ωi = ωi /( j =1 ωj ) P (C = c|X = x) = 1 × g (x. If it is demanded to avoid an influence of a parameter to the transaction classification process then a corresponding weight factor should be decreased. The Naive Bayesian Classifier is one of the forms of Bayesian Networks. wi is a factor ( weight factor ) of the importance of classification result of parameter i for the whole transaction classification process [4]. Dc ) = √ 2 −(x − µi 1 c) × exp i 2 i 2 × (Dc ) 1 × π × Dc Here µi c is the mean value of an attribute i for a class c and is the standard deviation of attribute i values for a class c. + ωn × Classn Here Classi is the result of comparing parameter i with corresponding regions of data clusterization.[7]. The absolute value of Result is the accuracy of the transaction classification. At first. If any weight factors were changed then all weight factors should be normalized: n discrete this probability can be calculated as ( the discrete distribution): Kc + 1 N +1 Here Kc is the account of the value x in the learning data set for an attribute i and a transaction class c. however the storing of all values is needed. xi . For example. If Result is less than 0 then the transaction is recognized as not typical. Downloaded on January 6.. If Result is greater than 0 then the transaction is recognized as typical. It is possible to use the normal distribution [5]. The final result of classification of the whole transaction is the linear combination of classification results for each parameter: Result = ω1 × Class1 + ω2 × Class2 + . Therefore. BAYESIAN B ELIEF N ETWORKS AND T HE MDL P RINCIPLE The assumption about the conditional independence of attributes has a great influence to the efficacy of Bayesian Network classification [9]. The disadvantage of this approach is that it is required to store all observed values in the training data which is difficult for a considerable amount of transactions. In this calculation X = x represents the event that X1 = x1 ∧ X2 = x2 ∧ . Restrictions apply. where P (Xi = x|C = c) = µi c = i Dc = 1 × xi n j =1 cj n n 1 2 × (xi − µi c) n j =1 cj i P (Xi = x|C = c) = g (x.

To calculate the class probability it is required to store all observed combinations for each region ( for example. one is intended for the searching of data clusterization regions and the other is for building of statistical data. . To solve this problem the developed clusterization algorithm can be used. The searching of the optimal conditional dependence network between attributes is the searching of the minimum M DL. The class probability is calculated as ( the discrete distribution): Kc + 1 N +1 Here Kc is account of cases when a value of transaction parameter hits into corresponding regions of data clusterization if the value x is 1( 0 if did not hit) for an attribute i and a transaction class c. |B | is the account of dependence edges in the Bayesian Network. 2010 at 06:14 from IEEE Xplore. Therefore it is very difficult to find the suitable distribution. X2 . It should be mentioned that the searching of the dependence network structure is an independent task. For example. This allows us to consider the dependence between clusters of values for different attributes. The main problem is that parameters of a transaction would not have any common distributions. the dependence between attributes X1 . y ) P ( x) × P ( y ) I (X.Fig. ΠXi is a set of attributes that depends on the attribute Xi . 3. A detailed algorithm of the network searching for Bayesian Networks using the Rissanen’s Minimal Description Length was described in reference [10]. but for these attributes only two values are possible ( 0 or 1). For all possible combinations of attributes a calculation of LL is made. y ) × log2 In this expression N is the account of attributes. or else it has value 0. A more robust approach can be used. The second task is to build the statistical data for the Naive Bayesian Classifier or Bayesian Networks. Results of the Naive Bayes method classification using different approaches for P legal calculation Fig. Such approach has been chosen for the evaluative testing. The first task is to find regions of data clusterization for each of the attributes. In this case an input attribute has a value 1 when a real value of this parameter hits into some region of data clusterization. Y ) = X. ΠXi ) P (x. P (Xi = x|C = c) = VI. Such a method is suitable for the Naive Bayesian Classifier only. But results of the conducted evaluative testing showed that the Bayesian Networks method is not effective enough. The first set of transactions was generated for the Authorized licensed use limited to: Pune Institute of Computer Technology. it is required to know all observed real values of this attribute in the training data. Restrictions apply. In this case an input attribute has value 1 when a real value of this parameter has been observed in the training data ( else 0). log2 (N × |B |)/2 .Y P (x. Thus all training data consists of two parts. I (X. For example.this characteristic describes the amount of bits required to keep corresponding network in memory. LL is the final characteristic of the dependence between all used attributes in the formed network. X3 }. This task is performed before the training processes by the system using representative data for all bank cards. The process of the training for the Naive Bayesian Classifier and Bayesian Networks using this representation data method is split into two tasks. Y ) is the characteristic of the dependence between X and Y . if X1 depends on X2 and X3 then ΠX 1 = {X2 . Thus to get the input value of an attribute. The approach based on the discrete distribution is used for the class probability calculation. the type of transaction can have values that group around number 800 or number 1000 only. It is possible to transform dependent attributes in one independent attribute. This is why the following original method for input data representation was developed. The key idea of the new method is the use of attributes that correspond to real attributes of a transaction. R ESULTS The evaluative testing of two sets of transactions have been generated. X3 ). For Bayesian Networks it is better to use a number of clusters where a value of an attribute hits during classification. But it requires to store all observed values and it is not desirable. Results of the Naive Bayes method classification using different approaches for P f raud calculation N LL = N × i=1 ΠXi I (Xi . Downloaded on January 6. 4.N is the account of instances in the learning data set.

P f raud . The fraudulent transactions have been generated with a specific financial institution. Three different tests were conducted using these sets.the probability that transaction is legal. In these pictures P legal . Consequently the probabilities for these values will be low which is not acceptable for this type of detection. From these transactions 16 have been done with the ”local country” parameter using local currency and other have been done with ”foreign country” parameter using foreign currency. 11. 6. The problem is that if a number of different values for some attribute in the training set is increased then the dispersion of the attribute values will grow upwards. 4). The results of the Naive Bayes method classification using the normal distribution. These transactions were generated on the base of real fraudulent transaction. The second test was intended to receive comparison between the Naive Bayesian Classifier and Bayesian Networks. where 52 of them correspond to legal transactions while 31 correspond to fraudulent transactions. But results of the classification using the Naive Bayesian Classifier based on the developed input representation method and the discrete distribution for probability estimation are acceptable. the kernel density estimation and the developed input data representation method( Fig. The training data consists of 83 transactions. . The testing data consists of 9 transactions with attributes that have been observed in the legal transactions from the training data and 2 transactions that correspond to fraud. because of according this assumption. these probabilities are low because the real distribution of values from the training set does not correspond to the normal distribution for each attribute. All transactions have been generated for one card and these transactions correspond to real using of credit card in one of the banks during three months. This set contains 21 legal purchases that have been generated using 8 different merchants through 8 different financial organizations. Before the training a network of attribute dependence was found using the Rissanen’s Minimal Description Length principle for the model based on Bayesian Networks. The first test was intended to receive a comparison between the Naive Bayesian Classifier based on the normal distribution. Downloaded on January 6. The most legal probabilities for the Naive Bayesian Classifier using these probability estimation methods are too low and are therefore incorrect. Legal transactions have numbers from 1 to 9 and fraudulent transactions correspond to 10. In this part 24 transactions have been generated using 5 different cash point machines of issuer bank and 7 cash withdrawals were done through cash point machines of foreign banks.Fig. For the normal distribution. This is the main reason that legal transactions have been classified as legal and fraudulent transactions have been classified as fraudulent with high probabilities. This search was based on real transactions from one of the banks. Restrictions apply. The reasons for this for each method are different. the discrete distribution. But the values that have been observed in the training data may do not hit into a center of the real distribution. Other part of legal transactions consists of cash withdrawals. Also the assumption about conditional independence of different attributes has a bad effect on the final probability calculation. Therefore the probabilities for values that hit into a center of the real distribution for corresponding attribute will be high only. Results of P f raud calculation for Bayesian networks and the Naive Bayesian Classifier training process and the second set is for the testing process. 3 and Fig. 5. probabilities for different attributes are multiplied. Comparative testing of Authorized licensed use limited to: Pune Institute of Computer Technology. For both models the input representation method was used. country code and time. 2010 at 06:14 from IEEE Xplore. the kernel density estimation and the discrete distribution are not acceptable for this type of fraud detection. The reason for this for the discrete distribution and the kernel density estimation is that a value of probability for each attribute depends on a number of different values for this attribute which have been observed in the training set. Thus the Naive Bayesian Classifier using the normal distribution. the discrete distribution and the kernel density estimation method is not suitable for this type of detection. It should be noticed that transactions 7 and 8 have one attribute( value of this attribute) that is not observed in the training set. In this case only two values for input attributes are possible (0 and 1) and this is why probabilities calculated using the discrete distribution are correct for this testing. Results of P legal calculation for Bayesian networks and the Naive Bayesian Classifier Fig.the probability that transaction is fraudulent.

197–243. pp. no. New York. “Probabilistic network construction using the minimum description length principle. [12] J. vol. of the Eleventh Conference on Uncertainty in Artificial Intelligence. Washington D. 194–202. Mukhanov. Stolfo. pp. 6). San Francisco. 10th Conference on Uncertainty in Artificial Intelligence. and R. Chan and S.” in Proc.” in Proc. Bayesian Networks and the clustering model are acceptable for this type of detection.” Computational Intelligence. the Naive Bayesian Classifier and Bayesian Networks showed that it is better to use Bayesian Networks for fraud detection ( Fig. [7] D. These results can be acknowledged as acceptable ( Fig. 4. 41–48. Austria. The probability P legal calculated for transactions 7 and 8 using the Naive Bayesian Classifier are less than 0. 1995. “Supervised and unsupervised discretization of continuous features. Evanston. But this testing was not intended to estimate classification accuracy of these methods in general.” Lecture Notes in Computer Science. because of the fact that the correlation between attributes is not taken into account in this model. the normal distribution and the kernel density estimation for this type of fraud detection. This is why the input representation method was developed that allows to increase effectiveness of the Naive Bayesian Authorized licensed use limited to: Pune Institute of Computer Technology. Kohavi. Sahami. 3. “Using bayesian belief networks for credit card fraud detection. 7). pp. pp. Dougherty. of the IASTED International Conference on Artificial Intelligence and Applications. Fig. Tibshirani. . 338–345. Brause. But the accuracy of classification for this model is not enough. 20. [11] D. vol. San Francisco. pp. 2002. Italy. T. T. 28. “Learning gaussian networks. Geiger.” in Proc. For all legal transactions ( numbers of legal transactions: 1-9) P legal was determined greater or equal to 0. pp. 1994. it should be mentioned that the evaluative testing was conducted to find out if the Naive Bayes. Lam and F. MA. [10] J. “Neural data mining for credit card fraud detection.” in Proc. and M. [3] L. Results of the conducted evaluative testing prove that it is possible to use Bayesian Networks based on the input representation method and the developed clustering model in the real fraud detection system. 1993. Bouckaert. Handbook of massive data sets. “Estimating continuous distributions in bayesian classifiers. This evaluative testing was intended to simulate a typical use of credit cards. 103–106. pp. 1999. [4] J. 266–273. 221– 225. “Learning bayesian networks: The combination of knowledge and statistical data. The main problem for this is that the real distribution of values for each attribute does not correspond to any common distribution. [5] G.” in Proc.” in Proc. The same problem is observed with the Naive Bayesian Classifier and because of this the model is less accurate than the model based on Bayesian Networks.Classifier and Bayesian Networks method for this type of fraud detection. Hastie. [8] P. Heckerman. Friedman. San Mateo. Geiger and D.This probability is rather low for transactions that have one attribute for which a value is not obtained in the training data.. D.C. It should be noticed that results for this model are comparable with results of Bayesian Networks testing because the special factor weights were selected for this testing. the special evaluative testing was conducted. VII. Suzuki. When comparing these models. Feb. pp. Restrictions apply. Langley. Results of testing for the clustering model R EFERENCES [1] R. pp. of the Fourth International Conference on Knowledge Discovery and Data Mining. 2. of the 11th IEEE International Conference on Tools with Artificial Intelligence. 1995. no. and M. 747. pp. “Learning bayesian belief networks: An approach based on the mdl principle. “Learning bayesian belief networks based on the mdl principle: An efficient algorithm using the branch and bound technique.” The Annals of Statistics. Pardalos. and D. 1996. 2000. 337–407. no. This model allows banking employees to provide fast monitoring of incoming transactions. Suzuki. 2010 at 06:14 from IEEE Xplore. “A construction of bayesian networks from databases on an mdl principle. of the Ninth Conrefence on Uncertainty on Artificial Intelligence. “Toward scalable learning with non-uniform class and cost distribution: A case study in credit card fraud detection. “Additive logistic regression: a statistical view of boosting. 1994. Downloaded on January 6. Insbruck.3. of the International Conference on Machine Learning. Heckerman. Finally. J. pp. The description of the developed clustering model was considered in this paper also. [6] J. 235–243.65 using Bayesian Networks. P. pp. C ONCLUSION A general description of the developed fraud detection system and comparison of base models have been presented.” in Proc. 2008. vol. Obtained results show that it is impossible to use the Naive Bayesian Classifier based on the discrete distribution. USA: Kluwer Academic Publishers. K. [9] W. 1993.” in International Conference on Machine Learning. 5 and Fig. vol. This estimation can be found using considerable amount of real transaction sets only. [2] J. Abello. R. [13] R. 10. 269–293. Eds. Bally.. 7. Chickering. Bacchus. 463–470. Norwell. Langsdorf.” Machine Learning. and M. Resende. John and P. 1998. Hepp. 164–168. However it does not mean that the clustering model is as effective as the model based on Bayesian Networks for fraud detection in general. 1995. The third test was intended to estimate result of classification for the clustering model.

Sign up to vote on this title
UsefulNot useful