You are on page 1of 4

SPIS2015, 16-17 Dec.

2015, Amirkabir University of Technology, Tehran, IRAN

HIGH PERFORMANCE IMPLEMENTATION OF


TAX FRAUD DETECTION ALGORITHM
Mehdi Samee Rad Asadollah Shahbahrami
Islamic Azad University, Department of Computer Engineering, Faculty of
Rasht Branch Faculty of Engineering Engineering, University of Guilan
Rasht, Iran Rasht, Iran
SameeRad@iauRasht.ac.ir shahbahrami@guilan.ac.ir

Abstract— Tax fraud includes a large spectrum of methods to parallel patterns in .Net framework. Bayesian networks have
deny the facts and realities, claiming wrong information, and been used for parallelism. It utilizes conditional probability
accomplishing financial businesses regardless of what the legal distribution to identify and to predict tax fraud among people
frameworks are. Nowadays, with the development tax systems who pay tax. The proposed system has two stages, training and
and the large volume of the data stored in them, need is felt for a testing. Two parallel techniques .Net parallel loops and parallel
tool by which we can process the stored data and provide users LINQ technologies were used [11, 12, 13, 14].
with the information obtained from it. According to tax politics,
especially value-added tax, the rate of tax fraud is now This paper is organized as follows. Section two discusses
increasing. Based on the investigations, recent researchers tend to some related works and data mining algorithms. Section three
use similar and standard methods to detect tax fraud, which presents the proposed method and section four discusses the
includes, association rules, clustering, neural networks, decision implementation results and finally section five includes
trees, Bayesian networks, regression and genetic algorithms. conclusions.
Because of large volume of tax database, most of the studied
methods about fraud detection are computationally intensive. In
order to increase the performance of fraud detection algorithms II. RELATED WORKS
such as Bayesian networks, parallelism techniques are used in Some data mining algorithms such as multilayer feed
this paper. We used parallel technology of Microsoft .Net, forward neural network, Support Vector Machine (SVM),
parallel loops and P-LINQ on the Intel Xeon server with 16,
Genetic Programming (GP), Group Method Data Handle
X7755 dual core processors and memory of 32GB. The
implementation results on real database show that a speedup of (GMDH), Logistic Regression (LR), Probabilistic Neural
up to 9.2x is achieved. Networks (PNN) are used in fraud detection systems. Tax data
has usually a large volume of data and information and
Keywords— Data Mining, Tax Fraud Detection, Bayesian processing those algorithms are computationally intensive.
Networks, Parallelism Technaiues Some related works in the field of tax fraud detection is
present in Table 1.
I. INTRODUCTION TABLE 1. OVERALL ANALYSIS AND EVALUATION OF
TAX FRAUD RESEARCHES
Nowadays, available hardware systems are capable of Ref Tools techniques Environment Approach Knowledge
parallel processing. Successful use of these capabilities by Number and
means of multi-thread programming, parallel algorithms and clementine and Canada, Chile Clustering
sum of the
SPSS and the and neural
[1]

techniques of parallel processing provide us with a strong level amount of the


(non-parallel) United States network
of operational abilities [1, 2]. Having an intelligent system, we bills
can extract some information from various tax banks and then Data mining The
by identifying tax frauds and informing related organizations DM miner Added-value and information
[2]

about it. It is possible to utilize such intelligent systems to (non-parallel) of Germany Association claimed in the
rules assertions
identify the taxes accurately in order to increase remittance of
Available
tax [3, 4, 5]. Tax is considered the most central and the main T-statistics Experimental Genetics and
fields in the
[3]

source of the income countries [6, 7]. Since the tax specialists (non- parallel) data Regression
assertions
utilize traditional strategies for auditing, a remarkable amount clementine and Information Regression Published
of tax incomes of governments are lost [8, 9, 10]. There are SPSS of 49 related and Neural articles in this
[4]

some techniques for tax fraud detection such as data mining (no parallel) articles Network field
techniques, while those algorithms for a large tax databases are
computationally intensive. III. PROPOSED METHOD
In this paper, some parallelism techniques are applied on The main purpose of training algorithm is to discover the
data mining algorithms to order to increase the performance. In rules governing various labels, based on features of records.
other words, a parallel tax fraud detection system is presented. Figure 1 depicts the different levels of training algorithm, used
Implementation results on some real data show that a in this study, in which building the model according to parallel
performance improvement of 9.2x is achieved using available method, is the main factor helping detection.

978-1-5090-0139-2/15/$31.00 ©2015 IEEE 6


Assumptions in [1] is people have activities based on the are used to compute the probability of tax fraud among the tax
bills and in [2] is everyone should hand added-value assertions payers. That is to say each fraud is considered one element of
also in [3] is studying records and dossiers, to identify fraud this vector.
and in [4] is tax fraud in considered place of use. Among the
(2)
techniques available in categorizing method, the Bayesian
technique has been used by means of parallel procedure.
The use of parallel processing

Experimental Training Generate Apply Training In computation of probability by means of simple Bayesian
data the model Model Model data formulae, it is seen that X vector stands for the input features
FIGURE 1. TRAINING, BUILDING AND APPLYING MODEL IN TAX DATA MINING such as occupation, income. There are M classes of outputs.
A. Bayesian networks (3)
Bayesian networks are one of the statistical categorizers and
they can define the sequences of data dependency in which the Also we compute the probability of all single classes which
features of data input are independent of each other and don't occurs, separately. Since in the above formulae, p(x) is
effect each other. This method is called supervised technique. constant so the value assigned to it must be fixed. Then for all
Bayesian network is a direct graph in which nodes show the possibilities the probability is computed. Maximum of the
and stand for the variables (X1, ..., Xn). They are used to all obtained probabilities indicates the class to which the
compute the probability of occurrence of Xi, under the vector belongs to. That is to say the probability of occurrence,
condition that their parents have already occurred. Xi shows also non-occurrence of fraud, both is computed. Vector X will
the probability of occurrence of each node and parent (Xi) is belong to the class (Ci) with the biggest value. If we assume
the probability of occurrence of the parents of that node and i=1, as a case of fraud, then classmates of the fraud will be
the total probability, equals to the product of all probabilities. identified. In order for this to happen, fraud detector program
Conditional probability distribution of fraud, with studies has been actualized as shown in Figure 3. Regardless of inputs
about the amount of income taxes, amount of purchase, and by analyzing the outputs of fraud detector program, we
goodwill and the amount of probability of sale on statistical will have:
population of sample for the careers registered in financial
(4)
department, can be seen in Figure 2.
The results of the studies showed that the tax payers who The most influential fields in parameters are the first series of
have complementary sheet, cover the 57.9% of fraud. fields "remittance of declarations", "handing assertions",
"handing registries" and/or "not handing either one of them".
(1) By means of designing a program structure of which is shown
in Figure 3, the rate of fraud was computed in parallel manner.
Data
Loop
Mining Level
Algorithm Parallelism Reports
needed by tax
department to
TAX Data issue
TAX & complement
DBs TAX Tax
TAX
Fraud assessment
forgers
Detector Query (discovering
System Level fact denial)
Other Input Parallelism

FIGURE 3. STRUCTURE OF FRAUD DETECTOR AND PARALLELISM SYSTEM,


MADE BY MEANS OF .NET AND LINQ

B. Tax fraud detection algorithm on the parallel platform


Different parallel execution steps of tax fraud detection
FIGURE 2. BAYESIAN TREE FOR IDENTIFYING TAX FRAUD algorithm are as follows:
There are three ways to analyze Figure 2 by means of 1) Table fetch in order to detect career frauds.
parallel method, first casual reasoning or prediction; we obtain 2) Data sample selection from the tables (5% of all
the effect from the cause. In this method, conditions are given. information in career assertions)
Second, evidential reasoning, the results are known. Finally, 3) Constructing behavioral model for people committing tax
intercasual reasoning, we let the various parameters influence fraud, using multilayer Bayesian.
each other, vertically. 4) Conforming the behavioral model of people committing
Bayesian shows the input features as a vector. Formulae 2 tax fraud to other information, defined by tax departments.

7
5) Evaluating results of conforming the behavioral model of for (i=0, i< Code_Hoze.Capacity; i++)
people committing tax fraud, to all the data. for (j=0, j<TaxPaery.length; j++)
6) Returning to step 3 in case of exceeding the error calculateMultiLevelBaysianFormules();
threshold. By means of parallelism technique in DOT NET, codes of
7) Detecting fraud, making reports and notifying the tax used loops in the program, became parallel in order to achieve
departments. higher speed as follows:
Based on Figure 4, a label is assigned to each of the features for (i=0, i< Code_Hoze.Capacity; i++)
Parallel.For (0, TaxPaery.length, j => {
of the payer, which will be used in constructing model. These
ret=calculateMultiLevelBaysianFormules
labels have been used in formulae 5. Then according to the (f1Arr[j], f2Arr[j], f3Arr[j], f4Arr[j]);
below conditions we can compute the rate of financial fraud. }
E. Second level: queries level parallelism
Serial code at the center of the fraud detector program is
continuously run to fetch the tax data from database and to
process and restore it, which is time consuming.
for (i=0, i< dgvTaxPayer.RowCount; i++)
for (j=0, j< ghatee_inf_RowsCount; j++)
KernelComDo.Excute (QueryString);
Using PLINQ technology, parallelism in query level has
been actualized as the code piece below. Which resulted in
reducing the running time of the program significantly.
for (int i = 0; i < dgvTaxPayer.RowCount; i++){
Int32 result1=ghatee_inf_Rows.Where (p =>
p.Field <string> ("K_Parvand") ==
dgvMoadi.Rows[i].Cells[2].Value) .Count();
}
FIGURE 4. LABELING PARALLEL BAYESIAN LAYERS IN ORDER
TO BE INSERTED IN THE FORMULAE F. Third level: infrastructure level parallelism
X= (Notification = 1, Declaration = 0, After analyzing, designing and actualizing tax fraud
Financial Offices = 1, No surrender = 1) detection system, it was run over Intel Xeon server with 16
P (Fraud=1 | X) = P (N=1 | A = 0, B = 1, C = 0, D = 1) dual core processer of X5570 and RAM of 64GB. The results
have been published as follows.
G. Compute performance of proposed method of parallelism
In this study, part of the loops of the program became
(5) parallel, in extracting Bayesian tables section, and also in part
of program queries of fraud detection section. Then the final
program was run with speed processing.
TABLE 2. THE RESULTS OF THE LOOP AND QUERY PARALLELISM (SECOND)
Number of Time programs in Time in Loop Time in Query
taxpayer serial mode (s) Level Parallel Level Parallel
1000 505 33 14
2000 539 45 18
C. Parallel algorithm of tax fraud detection 3000 561 72 27
Parallel thinking and parallel analysis and designs can be 4000 598 122 34
5000 659 130 40
the solution for the significant problems of nowadays
6000 691 155 59
software. It can be considered as a rescue for the unable world 7000 730 182 97
of software. In this stage, based on the investigations done, 8000 872 201 114
parallelism can be possible, in three levels: 9000 954 227 189
10000 1114 241 193
1) Parallelism in the level of loops used in the program.
SUM 7223 1408 785
2) Parallelism in the level of queries used in the program.
3) Parallelism in the level of infrastructure, network and (6)
operating system.
The acceleration we achieved to by means of loop
D. First level: loop level parallelism parallelism, can be seen in Table 2. Also Figure 5 displays the
Loops used in the body of program in serial mode, were accelerated graph obtained from loop parallelism. It is worth
time consuming, which was altered completely with changing mentioning the numerical values of serial and parallel models
the structure. were same and their difference is in the time consumed.

8
Results gained from parallelism in the program queries is as
shown in Table 2. The important point is that, sum of the time
duration of all experiment in serial method was divided by
sum of the time duration of all experiments in parallel method,
the third and fourth columns of Table 2 and in Formulae 6.
Finally the mean of total rate of performance for each method
was computed.
Final comparison of the two methods-parallelism in loops
and parallelism in queries. The reason that parallelism in
queries (PLINQ) outperforms parallelism in loops, is the fact
that tax fraud identification and prediction program is data
level and is originally data based. FIGURE 6. PREDICT AMOUNT OF FRAUD BY MEANS OF EXAMINING DATA

IV. CONCLUSIONS
Intelligent systems entitled tax fraud detection systems,
identify fraud among the available tax data with less error
coefficient, which depends on the tax department chosen by
the user, can report the list of frauds committed in that
department to the related specialists.
Because of the large volume of tax data, running the
program in serial mode was time consuming. So in this study
we tried to rely on new parallelism technology and apply it on
parts of procedure to reduce the running time of the program.
This system can be applied to the structure of tax system.
FIGURE 5. SPEEDUP LINQ OVER LOOP PARALLELISM
In this study, parallelism in loops was done using REFERENCES
Parallel.For, while queries parallelism was actualized by [1] P.C. González, J.D. Velásquez, "Characterization and detection of
means of Parallel LINQ in dotNet. Parallelism in executive taxpayers with false invoices using data mining techniques," Expert
Systems with Applications vol.40, no. 5, pp. 1427-1436, (2013).
floor was carried out, using parallel processing, available
[2] R.S. Wu, C.S. O.U. H. Lin, S.I. Chang, DC Yen, "Using data mining
multi core hardware systems and proper utilization of the technique to enhance tax evasion detection performance," Expert
abilities in multi core processor. Given the data in financial Systems with Applications vol.39, no. 10, pp. 8769-8777, (2012).
departments including 15 of them in financial careers, around [3] P. Ravisankar, V. Ravi, G.R. Rao, I. Bose, "Detection of financial
10028 experiments were conducted. Out of 10028 tax payers, statement fraud and feature selection using data mining
technique," Decision Support Systems vol.50, no. 2, pp. 491-500, (2011).
995 of them were committing frauds clearly. Output result of
[4] E.W.T. Ngai, Y. Hu, Y.H. Wong, Y. Chen, X. Sun, "The application of
fraud detection in this section is shown in Table 3. data mining techniques in financial fraud detection: A classification
TABLE 3. OUTPUT OF SYSTEM BY MEANS OF LOOP AND QUERIES PARALLELISM framework and an academic review of literature," Decision Support
Number of the processed tuples Systems vol.50, no. 3, pp. 559-569, (2011).
Department Tax
Forgers of tax payers from their 10 years [5] V. Ajay, D.V. Ashoka, V.N. Aradya. "Application of Data Mining
code payers
history file, for identification Techniques for Defect Detection and Classification," In Proceedings of
H-1 303 9 2,877,094 the 3rd International Conference on Frontiers of Intelligent Computing:
H-2 1264 170 203,091,683 Theory and Applications (FICTA) 2015.
H-3 806 46 13,038,888 [6] M.S. Abadeh, S. Mahmoodi, M. Taherparvar, “Application Data
H-4 1117 90 7,260,277 Mining,” Niyaz Danesh Press, 2012
H-5 802 71 6,973,117 [7] J. Shahrabi, V. Shakor; “Data mining Concepts,” Metalon Press, 2007
H-6 476 42 10,420,630 [8] A. Ahmadi, A. Mohebbi, “Business Intelligence: data mining and
H-7 771 110 12,249,956 optimization,” Amirkabir University Press, 2013
H-8 603 89 45,498,883 [9] Central Michigan University, “Computational Data Mining Techniques
H-9 670 34 19,956,647 in Automotive Insurance Fraud Detection”; Data Science Journal 2012.
H-10 333 44 4,048,641 [10] F. Nonyelum, “Data Mining Application in credit card fraud detection
H-11 501 92 57,167,908 system,” Journal of Engineering Science and Technology Vol. 6, (2011)
H-12 733 35 29,967,092 [11] Akash Verenkar, “Using .NET Parallel Programming Model to Achieve
H-13 473 26 9,692,546 Data Parallelism in Multi-tier,” Microsoft Corporation, 2015
H-14 578 101 27,129,424
[12] D. Leijen, W. Schulte, S. Burckhardt, “The Design of a Task Parallel
H-15 598 36 9,598,905 Library,” Microsoft Corporation, 2015
Sum: 10028 995 458,971,690
[13] S. Okur, D. Dig, “How do Developers Use Parallel Libraries?,” MSDN
By means of data in Table 3 and the value of variable from Microsoft Corporation, 2015
the previous step, we increased the amount of prediction. The [14] I. Ostrovsky, “Parallel Programming in .NET 4,” Parallel Computing
dotted area on the Figure 6 shows the area of fraud prediction. Platform Group, 2015.

You might also like