Professional Documents
Culture Documents
Random Forest
Du Shaohui1 GuanWen Qiu1
Saint-Petersburg State University University of California, Irvine
Saint-Petersburg State, Russia California, United State
1641449599@qq.com guanwenq@uci.edu
Hongjun Yu3
Huafeng Mai2
Beijing Foreign Studies University
University of arizona
Beijing, China
Arizona, United State
yuhongj un_apply @163.com.
huafengmai@email.arizona.edu
Abstract— In the evolution of the electronic money system, behavior patterns of real users and fraud behavior patterns
frequent transaction fraud has been a shadow behind the in the data, so as to help the subsequent classification
prosperity. It not only endangers the property security of users, model better achieve the purpose of detecting fraudulent
but also hinders the development of digital finance in the world. transactions. In [3], Fang et al. Proposed a framework for
With the development of data mining and machine learning, fraud detection based on CNN to capture the inherent
some mature technologies are gradually applied to the detection patterns of fraud learned from the marked data. Wang [5]
of transaction fraud. This paper proposes a transaction fraud proposed a data mining algorithm based on UCI public
detection model based on random forest. The experimental dataset. Bhusari and Patil proposed a hidden Markov
results of IEEE CIS fraud dataset show that the method of this
model [10] to help obtain high fraud coverage and low
model is better than the benchmark model, such as logistic
false alarm rate. The research of work [11] mainly focuses
regression, support vector machine. Finally, the accuracy of our
on the most important feature engineering part of data
model reached 97.4%, and the AUC ROC score was 92.7%.
mining. Researchers extend the transaction aggregation
Keywords- Fraud detection, Data mining, Online Transaction strategy, and propose to create a new set of features by
using von Mises distribution on the basis of analyzing the
I. In t r o d u c t io n periodic behavior of transaction time.
In recent years, the number of e-commerce users is However, these methods have some defects, such as
increasing day by day, and the scale of online transactions insufficient data or limited feature processing. In reference
is also expanding. Fraud criminals often use various [5], due to the limited amount of data, it is difficult to
channels to steal card information and transfer a large establish a robust and accurate fraud detection system for
amount of money in the shortest time, causing a lot of practical scenarios. In reference [12], it only considers
property losses to users and banks. Fraudulent transactions financial data, which may not be enough for a good data
are usually small probability events hidden in a large mining algorithm.
amount of data, and the transaction form is extremely
flexible. So, machine learning and data mining are useful B. Our Contribution
to build detection system of fraudulent transactions. The Referring to the fraud transaction model based on
core detection algorithms of the system are mainly based xgboost [13], this paper proposes a fraud detection
on classification [1-5]. In order to face a large amount of algorithm suitable for IEEE-CIS fraud data set based on
data, data mining related processing methods are random forest [13]. Xgboost [12-21] is an improved
introduced into transaction fraud task [6-9]. In this paper, gradient boost model. The data set contains more than 1
we establish a fraud detection system based on the million samples, each sample contains more than 400
classification model of random forest and the data characteristic variables, including financial characteristics
processing related to feature engineering. and non-financial characteristics. Sufficient amount of data
and data type will provide a reliable basis for the
A. Related Work subsequent classification model. O f course, the processing
Data mining is a process of exploring information process is more complex, so we need to clean the data
hidden in a large number of data through algorithms. carefully and select the most valuable features carefully. In
Through the cleaning, correction, extraction, selection and our work, we first carry out the truth of the data to
summary of a large number of data features, the hidden eliminate some outliers and excessive missing data. In the
knowledge behind the data is obtained. For our problem, it further feature engineering cycle, the data will be
is to extract the difference information between the transformed and the statistical data such as maximum,
Authorized licensed use limited to: National Inst of Training & Indust Eng - Mumbai. Downloaded on August 12,2021 at 08:03:33 UTC from IEEE Xplore. Restrictions apply.
mean and standard deviation will be extracted. Then, TABLE 1 Tr a n s a c t io n Ta b l e
Recursive feature elimination (RFECV) is used to
eliminate some unimportant features. Name Description Type
145
Authorized licensed use limited to: National Inst of Training & Indust Eng - Mumbai. Downloaded on August 12,2021 at 08:03:33 UTC from IEEE Xplore. Restrictions apply.
TABLE 3 Performance of different models
IV. Ex p e r im e n t s
V. Co n c l u s io n s
In order to show the superiority of the model, we In this paper, we propose a new approach utilizing
compared the accuracy and AUC ROC score with other random forest to detect fraud based on IEEE-CIS Fraud
models. AUC ROC score is actually the area under the dataset. Just like all machine learning pipeline, data
receiver operating characteristic curve, which is created by preprocessing especially feature extraction is critical to the
drawing the relationship between true positive rate (TPR) performance of our model. By using RFECV we can
and false positive rate (FPR) under different threshold reduce many features that are redundant or highly
settings. The formulas for TPR and FPR are defined as correlated which can easily bias our model. It is shown that
follows: the random forest performs better on this dataset when
comparing to such of SVM and Logistic Regression. We
TP
TPR = (1) infer that this is probably caused by the extreme class
TP + F N unbalance inherent to the IEEE-CIS Fraud data.
FP a c k n o w l ed g em en t
FPR = (2)
FP + TN The author Du Shaohui and co-author GuanWen Qiu
TP was the true positive prediction, FN was the false thanks Huafeng Mai for help in experiments, Hongjun Yu
negative prediction, FP was the false positive prediction, for idea contribution on feature engineering and Cai Heng
and TN was the true negative prediction. for meaningful discussions.
The configuration matrix corresponding to the model Re f e r en c es
can also be obtained (Fig. 2). As can be seen from table 3,
[1] Duan, L., Xu, L., Liu, Y., & Lee, J. (2009). Clusterbased outlier
the random forest model is superior to the other two detection. Annals of Operations Research, 168(1), 151-168.
models in terms of AUC ROC score and accuracy.
[2] Minastireanu, E. A., & Mesnita, G. (2019). Light gbm machine learning
algorithm to online click fraud detection. J. Inform. Assur. Cybersecur,
2019.
[3] Fang, Y., Zhang, Y., & Huang, C. Credit Card Fraud Detection Based
on Machine Learning.
[4] Maes, S., Tuyls, K., Vanschoenwinkel, B., & Manderick, B. (2002,
January). Credit card fraud detection using Bayesian and neural
146
Authorized licensed use limited to: National Inst of Training & Indust Eng - Mumbai. Downloaded on August 12,2021 at 08:03:33 UTC from IEEE Xplore. Restrictions apply.
networks. In Proceedings o f the 1st international naiso congress on
neuro fuzzy technologies (pp. 261- 270).
[5] Wang, M., Yu, J., & Ji, Z. (2018). Credit Fraud Risk Detection Based on
XGBoost-LR Hybrid Model.
[6] Dhingra, S. (2019). Comparative Analysis o f algorithms for Credit Card
Fraud Detection using Data Mining: A Review. Journal o f Advanced
Database Management & Systems, 6(2), 12-17.
[7] Minastireanu, E. A., & Mesnita, G. (2019). Light gbm machine learning
algorithm to online click fraud detection. J. Inform. Assur.
Cybersecur, 2019.
[8] Bhusari, V., & Patil, S. (2016). Study of hidden markov model in credit
card fraudulent detection. In 2016 World Conference on Futuristic
Trends in Research and Innovation for Social Welfare (Startup
Conclave) (pp. 1-4). IEEE.
[9] Bahnsen, A. C., Aouada, D., Stojanovic, A., & Ottersten, B. (2016).
Feature engineering strategies for credit card fraud detection. Expert
Systems with Applications, 51, 134-142.
[10] Carneiro, N., Figueira, G., & Costa, M. (2017). A data mining based
system for credit-card fraud detection in e-tail. Decision Support
Systems, 95, 91-101.
[11] Zhang, Y. , Tong, J. , Wang, Z. , & Gao, F. . (2020). Customer
Transaction Fraud Detection Using Xgboost Model. 2020 International
Conference on Computer Engineering and Application (ICCEA).
[12] Torlay L , Perrone-Bertolotti M , Thomas E , et al. Machine learning-
XGBoost analysis of language networks to classify patients with
epilepsy[J]. Brain Informatics, 2017.
[13] Ji X , Tong W , Liu Z , et al. Five-Feature Model for Developing the
Classifier for Synergistic vs. Antagonistic Drug Combinations Built by
XGBoost[J]. Frontiers in Genetics, 2013, 10.
[14] Yin Y , Sun Y , Zhao F , et al. Improved XGBoost model based on
genetic algorithm[J]. International Journal of Computer Applications in
Technology, 2020, 62(3):240.
[15] Ren X , Guo H , Li S , et al. A Novel Image Classification Method with
CNN-XGBoost Model[C]// International Workshop on Digital
Watermarking. 2017.
[16] Zhang D, Qian L, Mao B, et al. A data-driven design for fault detection
o f wind turbines using random forests and XGboost[J]. IEEE Access,
2018, 6: 21020-21031.
[17] Gumus M, Kiran M S. Crude oil price forecasting using
XGBoost[C]//2017 International Conference on Computer Science and
Engineering (UBMK). IEEE, 2017: 1100-1103.
[18] Chen T, Guestrin C. Xgboost: A scalable tree boosting
system[C]//Proceedings of the 22nd acm sigkdd international
conference on knowledge discovery and data mining. 2016: 785-794.
[19] Zhong J, Sun Y, Peng W, et al. XGBFEMF: An XGBoost-based
framework for essential protein prediction[J]. IEEE Transactions on
NanoBioscience, 2018, 17(3): 243-250.
[20] Zamani Joharestani M, Cao C, Ni X, et al. PM2. 5 Prediction based on
random forest, XGBoost, and deep learning using multisource remote
sensing data[J]. Atmosphere, 2019, 10(7): 373.
[21] Torlay L , Perrone-Bertolotti M , Thomas E , et al. Machine learning-
XGBoost analysis of language networks to classify patients with
epilepsy[J]. Brain Informatics, 2017.
147
Authorized licensed use limited to: National Inst of Training & Indust Eng - Mumbai. Downloaded on August 12,2021 at 08:03:33 UTC from IEEE Xplore. Restrictions apply.