You are on page 1of 4

2009 World Congress on Computer Science and Information Engineering

Fraud Detection in Tax Declaration Using Ensemble ISGNN

Kehan Zhang1 Aiguo Li2 Baowei Song1


1
Department of Marine, Northwestern Polytenical University, Xi'an, 710072, Shaanxi, China
2
Department of Computer Science and Technology, Xi’an University of Science and Technology,
Xi’an, 710054, Shaanxi, China

Abstract prediction, and recognition for the sake of their


efficiency [2]. ISGNN (Iteration learning Self-
Generating Neural Network) is an improved method
Fraud detection in tax declaration plays an based on SGNN, which cuts down the size of SGNT
important role in tax assessment. Using ensemble and improves classification precision [3]. Using
ISGNN (Iteration learning Self-Generating Neural
ensemble ISGNN to solve the problem of FDTD is
Network) to solve the problem of fraud detection
in tax declaration is presented in this paper. An presented in this paper. The experimental results show
ensemble ISGNN is trained using financial data of that this method is more efficient than that of SGNN
sampled enterprises, and the trained ensemble on classification precision. Classification precision of
ISGNN is then employed to detect whether tax proposed method is 96.7742%, and is 3.22 points
declared by an enterprise is legitimate or not. higher than that of SGNN.
Experimental results show that proposed approach
is effective: classification precision of proposed
method is 96.7742% in 31 sample data, and it is
3.22 points higher than that of SGNN. The number
2. Related work
of samples to train ISGNN of ensemble ISGNN is
one third that of SGNN. SGNN was first proposed by Wen in 1992[4]. Inoue
and Narihisa then did many researches based on SGNN,
Keywords: ISGNN; Ensemble ISGNN; Fraud and showed that SGNN was efficient in classification,
Detection clustering, prediction, and recognition [5, 6]. SGNN
avoid these tricky conditions as follows: (1) the
quantity of the network layers, (2) the quantity of the
1. Introduction neurons in each layer, (3) connecting weights among
consequent layers.
The method of training SGNN is to generate a Self-
Fraud detection in tax declaration (FDTD) plays an generating neural Tree (SGNT) from training sample
important role in tax assessment. According to the data. After all training samples are inserted into the
results of FDTD, administration of taxation decides SGNT as leaf neurons; the SGNT has been finally
what the key enterprises for tax detection are. However, generated. The attributes of the training samples are the
factors that affect FDTD are complex, so FDTD is a weight of the leaf neurons while the weights of all non-
challenge. An approach of FDTD based on SGNN was leaf neurons are the mean value of all its children.
proposed by Wang et al [1]. They showed that their SGNN would dynamically decide the weight among
method has higher classification precision than that of consequent layers in SGNT and the structures of neural
C 5.0. networks. The learning process of SGNN has two main
SGNN (Self-Generating Neural Network) has been steps: generating and optimizing SGNT.
used in many fields such as classification, clustering,

978-0-7695-3507-4/08 $25.00 © 2008 IEEE 242


237
DOI 10.1109/CSIE.2009.73
However, the number of node neurons in SGNT is Through HWP (Horizontally Well Placed) and VWP
too large to improve the classification precision and it (Vertically Well Placed), the SGNT basically reaches
requests larger memory room. That is a shortage of an optimization [4]. In this way, SGNT of an ISGNN is
smaller than that of classical SGNN. Classification
SGNN. Li et al proposed an improved method named
precision of ISGNN is higher than classical SGNN.
ISGNN based on classical SGNN. ISGNN is different Pseudo-codes are given below:
from SGNN in the process that ISGNN inserts a {
training sample into a SGNT and optimizes and prunes copy (n1,e1); // n1 is the root node.
the SGNT simultaneously. In this way, ISGNN for (i=2, j=2; i<=N; i++) {
decrease the number of SGNT and it is superior to insert (tree,ei);
SGNN in classification precision. vwp (tree);
hwp (tree);
merge (tree);
3. Fraud Detection in Tax Declaration j++;}
}
Fraud detection in tax declaration (FDTD) is the tax
administration detects whether taxpaying of enterprises insert (tree, ei): insert sample ei into the tree; vwp
is true or not, according to financial data of enterprises. (tree): make the current SGNT VWP; hwp (tree): make
Wang et al gave formal descriptions of FDTD as the current SGNT HWP; merge (tree): pruning the
follows [1]: dead node[4].
Definition 1: x is a set of D enterprises which
belong to same area, same industry and in the same 4.2 Ensemble ISGNN
level, x = ( x 1 , x 2 , … , x D ) . Here the levels of
enterprises have two classes: common taxpayer and Neural network ensemble is a learning paradigm
uncommon taxpayer. where several networks are jointly used to solve a
Definition 2: E is an attribute set which concludes problem. In the beginning of the 1990’s, Hansen and
the financial information of enterprises, which declared Salamon showed that the generalization ability of a
taxpaying, E = (e 1 , e 2 , , e q ) . neural network system can be signification improved
through combining some neural networks i.e. training
Definition 3: V is a sample data set of the financial
information of enterprises, which declared taxpaying. several neural networks and combining their results in
some way [7]. Latter, Inoue and Zhou et al have done
⎡ v1 ⎤ ⎡ v11 v1q ⎤ many reporters based on neural network ensemble.
⎢ ⎥ ⎢ ⎥ They showed that neural network ensemble is superior
V =⎢ ⎥=⎢ ⎥ (1)
to the individual network [8, 9].
⎢⎣v D ⎥⎦ ⎢⎣v D1 v Dq ⎥⎦ ISGNN is employed as classifiers in neural network
ensemble to solve classification problem in this paper.
Where vi is the input vector, which is composed
Massive data is difficult to manage for practical
of xi ’s attributes; vij is an attribute value of xi ; application. The more training samples are employed
i = 1, 2 , D to learn ISGNN, the higher classification precision will
FDTD is to verdict whether vi is true or not be won. But the time-consumed is unaccepted.
Ensemble ISGNN, proposed in this paper, has the
according to V.
performance of meeting higher classification precision
and shorter times-consumed on classification problem
4. Ensemble ISGNN
simultaneously.
4.1 ISGNN The algorithm of ensemble ISGNN has 6 main steps
as follows:
The learning process of ISGNN includes two main ① Finding out all different kinds of samples in
steps: generating SGNT and optimizing the SGNT. training dataset and labeling their classes. Then
The performance of ISGNN is superior to SGNN, classify all training samples according to their class
because generating and optimizing SGNT are fused labels.
when each training sample input the SGNT. In other
word, ISGNN inserts a sample data into a SGNT and
optimizes and prunes the SGNT simultaneously.

238
243
② Computing the proportion of each class of M − Err
Pr ecision = × 100 % (2)
samples in training dataset, named ratet , which is the M
proportion of the t-th class samples in training dataset. Where Err is the number of samples that are wrong
③ Deciding Q number samples to train an ISGNN. classified; M is the number of testing samples. Here, M
is 31.
④ Selecting the samples from each class randomly,
To verdict whether classification results are correct
according to the ratet . Then combine these samples
or not as follows:
to generate a new training dataset whose size is Q . if yˆi − yi ≤ a , then the classification result is
⑤ According to the step ④, use the new training correct. (3)
sample dataset to train an ISGNN. A SGNT is
Where ŷ i is the real value of the i-th sample, which
generated.
⑥ Repeating the steps ④ and ⑤ . Number of is expressed 1or 0. Here, 1 denotes that the enterprise
iteration is decided by CN . CN is the number of is legitimate. 0 denotes that the enterprise is
ISGNN in ensemble ISGNN. illegitimate; ŷ i is the output of the i-th sample, which
As soon as each ISGNN of the ensemble ISGNN is obtained from ensemble ISGNN; a is a constant,
was trained independently, we can predict testing whose scale is from 0 to 0.5. Here a is 0.4.
samples’ class using the trained ensemble ISGNN. According to section 4.2, the result of FDTD used
Each testing sample was inputted to the ensemble ensemble ISGNN is obtained. The result of FDTD
ISGNN and CN classification results were obtained. If used SGNN is obtained using the algorithm proposed
CN ISGNNs deduce the same results for one testing by Wang et al [1]. Table 1 lists FDTD results of
sample, the final classification result is decided easily. ensemble ISGNN and SGNN.
Otherwise, we find out winner among CN SGNTs. In the experiment, one third samples of training data
The winner has the minimum distance between the were employed to train each ISGNN in ensemble
testing sample and leaf neurons among CN SGNTs. ISGNN, and there are 3 ISGNNs in the ensemble
The class of the testing sample is then decided as same ISGNN. Experimental results show that classification
as the class of winner. Finally, we can predict class of precision of ensemble ISGNN is 96.7742% in 31
each sample in testing set by ensemble ISGNN and testing sample data; it is 3.22 points higher than that of
calculate of classification precision of testing sample SGNN, and the number of training sample data of
set. ensemble ISGNN is one third that of SGNN. Ensemble
ISGNN is efficient for FDTD. However, ensemble
5. Experiments ISGNN consumes more time than that of SGNN.

We choose 61 sampled enterprises as experimental Table 1. Performence Evaluation of ensemble ISGNN and
objects to compare FDTD performances between SGNN with Tax Dataset
ensemble ISGNN and SGNN. These 61 enterprises are Number of
representative in the retail business of Qingdao city in Precision Time-consume
Err Training sample
(%) ( sec )
China, whose business area ranges from dozens of data
square meters to nearly 30,000 square meters, and Ensemble
96.7742 1 10 0.7110
registered capital ranges from hundreds of thousands ISGNN
SGNN 93.5484 2 30 0.1875
RMB to over 230 million RMB, staff's number ranges
from several to several hundred people. The financial
information of each enterprise, such as registered
capital, business area, staff’s number, total project of 6. Conclusion
the appreciation amount of tax to be paid monthly, etc.,
composes an attribute set of each sample. Every We propose a new method used a neural network
sample has 71 attributes as input vector and one ensemble of ISGNN to solve the problem of FDTD
attribute as output vector. These 61 enterprises are and give comparison results of ensemble ISGNN and
divided two sets: the first 30 enterprises are training SGNN in this paper. We evaluate the performances
samples; the others are testing samples. between ensemble ISGNN and SGNN according to
Precision of fraud detection is defined as follows: financial data of 61 sampled enterprises. Experimental

239
244
results show that ensemble ISGNN is effective: [3] Li, A.G, Yong, H, Li, Z.H. Iteration Learning SGNN.
classification precision is 3.22 points higher than that Proc of IEEE ICNN&B'05, Beijing, pp.1912~1916.
of SGNN in 31 testing sample data, and the number of [4] Wen, W, Jennings, A, Liu, H. Learning a neural tree.
training sample data is one third that of SGNN. Proc of Int’l Joint Conf. on Neural Networks, Beijing,
However, classification precision of ensemble ISGNN pp.751~756.
is unsteady because of selecting training samples [5] Inoue, H, Narihisa, H. Efficiency of Self-Generating
randomly.
Neural Network Applied to Pattern Recognition.
Mathematical and Computer Modeling, 2003, 38 (11-
Acknowledgement 13):1225~1232.
[6] Inoue, H, Fukunaga, Y, Narihisa, H. Efficient Hybrid
This work was partially supported by the Natural
Neural Network for Chaotic Time Series Prediction.
Science Fund of Department of Education of Shaanxi
IEICE Transactions on Information and Systems, 2002,
province,China under Grant No. 07JK314,and the
J85-D-II (4):689~694.
Science Research Fund of Xi’an University of Science
[7] Hansen, Lk, Salamon, P. Neural network ensembles.
and Technology.
IEEE Trans Pattern Analysis and Machine Intelligence,
1990,12 (10): 993~1001.
References [8] Inoue, H, Narihisa, H. Improving Generalization
Ability of Self-Generating Neural Networks through
[1] Wang, S.W, Li, A.G. Fraud detection in tax declaration Ensemble Averaging. Proc of 4th SICE Symposium on
using SGNN. Journal of Xi’an University of Science Decentralized Autonomous Systems, SICE, Okinawa,
and Technology, 2004, 24 (4):470~474. Japan ,2000, pp:177~180.
[2] Inoue, H, Narihisa, H. Efficient Pruning Method for [9] Zhou, Z.H, Chen, S.F. Neural Network Ensemble.
Ensemble Self-Generating Neural Networks. Journal of CHINESE J.COMPUTERS, 2002, 25 (1):1~8.
Systemic, Cybernetics and Informatics, 2003,
1(6):72~77.

240
245

You might also like