You are on page 1of 20

Computational Economics

https://doi.org/10.1007/s10614-020-10090-6

Using Machine Learning Approach to Evaluate


the Excessive Financialization Risks of Trading Enterprises

Zhennan Wu1

Accepted: 30 December 2020


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021

Abstract
As Internet technology develops and spreads widely, using the Internet for finan-
cial management has become a new type of financial technology. Despite Internet
finance’s convenience and profits, severe financial risks, such as chaotic reputation
management, bad loans, and malicious deception, also appear. Hence, in order to
enhance the ability of trading financial enterprises to respond to over-financiali-
zation risks, the machine learning algorithms are utilized to build a decision tree
model, a random forest model, and a gradient boosting model; the average fusion
method is utilized to build a fusion control model. The performances and risk pre-
diction indicators of the proposed algorithm under the models mentioned above are
analyzed. Finally, by analyzing a trading enterprise’s loan data within six months,
the optimized risk control model’s actual impacts are evaluated. The results show
that the support vector machine (SVM) will be quicker trained than other models if
the data set is in the smaller range of 1G-5G, with an average of 20 min. The fusion
model (FM) will consume a shorter time if the data set is in the broader range of
5G-30G, with an average of 35 minutes. Different models have unique advantages
in different performances; the precision, recall rate, and accuracy of the fusion algo-
rithm are higher, 79.35%, 39.28%, and 78.28%. The precision of the random for-
est algorithm (RFA) is 72.48%, which is also higher. The performance of the risk
control model is improved through model fusion, in an effort to improve the ability
of trade finance enterprises to withstand financial loan risks, which provides a refer-
ence for the risk control of financial enterprises.

Keywords  Big data · Data mining · Machine learning · Over‐financialization ·


Financial risk

* Zhennan Wu
245717326@qq.com
1
School of International Relations, Renmin University of China, Beijing, China

13
Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Z. Wu

1 Introduction

Currently, Internet finance has flourished, showing various operating mechanisms


and business models (Fuyong 2016). Financial institutions can breakthrough time
and geographical constraints, speed up business processing through Internet tech-
nology, and provide faster financial services on the Internet for customers with
financing needs (Gomber et al. 2018). However, the rise of Internet finance has also
created problems such as credit risk and user fraud, making improving risk control
capabilities via a risk assessment system becomes a necessity (Norris et al. 2019). In
the “big data era,” how to use data to meet actual business needs to improve business
efficiency and accuracy is a problem that business personnel has explored (Yang
et al. 2018). Therefore, establishing a suitable enterprise risk assessment system to
predict lenders’ credit risks is the most crucial task for the development of trad-
ing enterprises (Tupa et al. 2017). Internet finance provides much convenience for
people’s lives, but fraud risks and fraud incidents have also increased with Internet
finance development. Fraud methods are endless, such as phone fraud, Trojan horse
viruses, and phishing websites. Internet financial fraud involves enormous money
and victims. Besides, Internet financial frauds have formed a complete industry
chain, with the characteristics of large scale, strong action, and high coordination.
Under the accelerated application of big data, criminal acts permeate all aspects of
the Internet financial business, earning hundreds of billions of CNY in finance every
year (Alsayed and Bilgrami 2017; Campus 2018). Therefore, studying the risk of
excessive financialization of enterprises is also an important research direction of
risk control in Internet finance.
Before the advent of the significant data era, the risk control model of enterprises
has played a role in credit management; however, due to the insufficient understand-
ing of the data, its accuracy is low, making it impossible to evaluate the borrowers
comprehensively (Florio and Leoni 2017). As the era of big data is approaching,
the problems of single dimensionality and limited evaluation capabilities of tradi-
tional risk control models have gradually been exposed with the increase of data
dimensions. Research methods using big data for mining have gradually be applied
to the financial field. As a critical link of the industry, financial risk models have
also become a key research direction (Florio and Leoni 2017). The advantage of big
data is that the data are sufficient for analysis, involving a wide range of dimensions.
With the establishment of the cloud computing platform, big data’s distributed com-
puting power has been significantly improved; meanwhile, the data foundation and
computing power are more efficient. Therefore, models built on big data technol-
ogy, which have significant quantity and accuracy changes, are more effective than
the traditional risk control models (Hasan et  al. 2019; Ding et  al. 2019). Big data
improve the credit information system and help trade enterprises reduce credit risks
(Ivanov et al. 2019; Urbinati et al. 2019). Big data utilizes network data to improve
the scoring model and optimize the approval process by integrating algorithms;
eventually, it forms a sustainable closed-loop management method (Saura et  al.
2019). Therefore, big data technology applies to the financial field’s risk assessment,
which has become one of the research hot spots in this field.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Using Machine Learning Approach to Evaluate the Excessive…

2 Related Works

The risk control model concept has been put forward very early, and its function
is to provide early warning of future risks through data analysis. When the Inter-
net industry was emerging, financial institutions, such as peer-to-peer (P2P) and
microfinance, have entered the market as supplements to the traditional financial
industry (Aldridge 2019). Most of the initial Internet financial products were the
onlineization of offline products. They were not much different from traditional
products, except for lower quotas, looser restrictions, and more flexible terms and
repayment methods (Yue et al. 2016). If the risk control link has not been valued,
the situation of the “market ahead of risk control” will appear, and the overall
overdue rate and bad debt rate of the industry will far exceed that of banks (Chen
et al. 2019a, b, c). Under such circumstances, risk control has gradually attracted
the attention of people from all walks of life, which has also become the most
significant bottleneck affecting the development of the networked financial indus-
try. Financial risk control under big data employs data analysis and models for
risk assessment; based on the evaluation scores, it predicts the repayment abil-
ity, willingness to repay, and fraud risk of the lenders, thereby using scientific
risk prevention and control (Zhang et al. 2020a, b). In the era of big data, loads
of data are combined with data processing methods. Big data can improve the
credit reporting system, help financial institutions provide financial products, and
reduce credit risks. Lv et  al. (2019) used big data processing in WebVRGIS’s
BIM storage to improve the utilization of various energy sources, reduce energy
waste, and reduce pollutant emissions (Abra et  al. 2018). Big data technology
uses massive data for calculation, analyzes the inherent laws of information,
improves the scoring model, and optimizes the approval. After effective rules are
obtained, big data further optimizes the scoring model and the approval, forming
a virtuous circle.
Machine learning is an application of artificial intelligence, enabling the sys-
tem to automatically learn and improve from experience without explicit program-
ming (Gu et  al. 2020). The utilization of big data mining and machine learning
technology can solve the risk problems of trading enterprises. Zhu et al. (2019)
proposed an enhanced hybrid integration algorithm that significantly improved
the accuracy of credit risk predictions in small- and medium-sized enterprises
by combining the classic random space algorithm and MultiBoosting algorithm
(Zhu et al. 2019). Yang (2020) built a variability model adapted to early warning
and control of financial risks, minimized the risks of financial enterprises, and
maximized the enterprises’ benefits based on algorithms of big data and machine
learning (Yang 2020). Kim et al. (2019a, b) proposed a machine learning finan-
cial risk detection and classification method based on feature selection, which
reduced enterprises’ financial risks by 10 % (Kim et  al. 2019a, b). Gulsoy and
Kulluk (2019) proposed an objective risk measurement method for the customer
loan process of small and medium-sized enterprises; besides, they performed the
classification task of data mining by collecting current customers’ data credit
evaluation process of banks (Gulsoy and Kulluk 2019). The above analysis shows

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Z. Wu

that most scholars are currently using different machine learning algorithms to
build enterprise risk control models. Due to the diversification of enterprise risk
control, enterprise risk control models built from the perspective of combining
data mining and machine learning are rarely reported.
Therefore, network financial data are collected and processed based on previous
works with blockchain technology. An enterprise risk control model under differ-
ent algorithm conditions is constructed from data mining and machine learning per-
spectives. Afterward, the original parameters model is optimized. Data mining and
machine learning algorithms’ differences are analyzed by substituting trading enter-
prises’ financial loan data. Finally, all risk control models are evaluated comprehen-
sively. The results will provide a theoretical basis for solving the current problem of
over-financialization in trade-oriented enterprises.

3 Methods

3.1 Traditional Risk Control Model of Trading Financial Enterprises

In China, the risk control of financial enterprises initiates from the traditional
banking industry through manual methods. During the risk control process, cor-
responding standards are formulated to rate different customers. The risk control
of financial loans is mainly an accurate credit evaluation of loan users. The spe-
cific risk control model is shown in Fig. 1. From an external perspective, the loan
users are restricted and regulated from four aspects: user information input, finan-
cial rule formulation, risk monitoring, and risk management. The specific loan
amount and repayment time are determined by the contract, report, and specific
practice process. The traditional financial risk control has the problems of less

Fig. 1  Traditional risk control


model of trading financial
enterprises INPUT AND RISK
IDENTIFICATION

Contract and
DEVELOPMENT

acquisition
MONITORING
RISK-BASED

POLICY

Financial Operational
reporting

RISK MANAGEMENT
ACTIVITIVES

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Using Machine Learning Approach to Evaluate the Excessive…

data, narrow scope, and small dimensions. This requires a combination of data
mining and machine learning technologies. The subsequent algorithms are based
on the traditional risk control models, and only the corresponding data processing
methods are different.

3.2 Construction of Risk Control Model Based on Machine Learning Algorithms

Standard machine learning algorithms include a support vector machine (SVM)


and logistic regression. Different machine algorithms are utilized to build the risk
control models based on the traditional risk control model. (1) SVM is a super-
vised learning model accompanied by learning algorithms. This algorithm is uti-
lized for data classification and regression analysis. Figure 2 shows the structure
of the SVM model. The SVM model maps the space points so that each category
is divided by different gaps as much as possible. This model then maps these
points to another space to predict the category to which they belong according to
the position where they fall into the gap. While training the data set, ­Di ­(Xi, ­Yi)
represents the data samples. The specific calculations are as follows:
1
min |𝜔|2 (1)
w, b 2

s.t.yi (𝜔 ⋅ xi + b) − 1 ≥ 0 (2)
where s.t. Means being constrained to a particular condition, 𝜔 is the normal vector,
­Xi is the feature vector of the I-th data, b is the intercept, and ­yi is used as the clas-
sification marker to supervise the training times. For each inequality constraint, the
Lagrange product 𝛼i ≥ 0, i = 1, 2, 3 … N is introduced. The specific function is as
follows:

Fig. 2  Structure of the SVM


model X2
2/w

X1
b/w

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Z. Wu

1 2 ∑N ∑ [ ]
L(𝜔, b, 𝛼) = |𝜔| − ai yi (𝜔 ⋅ xi + b) − 1 (3)
2 i=1

Through derivation, the following will be obtained:


N
𝜕L ∑
=𝜔− ai yi xi (4)
𝜕𝜔 i−1

N
𝜕L ∑
= ai yi (5)
𝜕b i−1

After substituting it into the Lagrangian function, the optimization function of


SVM is as follows. Through this operation, the algorithm of machine learning is
completed:
N N N
1 ∑∑ ∑
min ai aj yi yj (xi ⋅ yj ) − ai (6)
𝛼 2 i=1 j=1 i=1

N

s.t. 𝛼i yi = 0 𝛼i ≥ 0, i = 1, 2 … N (7)
i=1

(2) Logistic regression algorithm (LRA) is a machine learning model for classifica-
tion problems. It is a predictive analysis algorithm based on the concept of prob-
ability. Figure 3 shows the structure of the logistic regression algorithm. As a clas-
sification model, the conditional probability distribution represents the specific
information (Kim et al. 2019a, b) of users under different conditions. The model is
defined as:

e(w + b)
T ⋅x

P(Y = 1|x) = (8)


1 + e( wT ⋅x + b)

Fig. 3  Structure of the LSA Y


model

1
S-Cure
y=0.8

0.5
w
Threshould
Value
y=0.3

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Using Machine Learning Approach to Evaluate the Excessive…

1
P(Y = 0|x) = T
. (9)
1+ e(w ⋅x + b)
where P (Y/X) represents the conditional probability distribution, x = (­X1, ­X2 …
­Xn) is the independent variable related to the users in the risk control model, T is
the number of users, z = wT ⋅ x + b , and y is generally 0 or 1 to indicate the loan of
users is not overdue or overdue. By introducing the Sigmoid function, the range of h
(z) is [0,1]:
1
h(z) = (10)
1 + e−z
By substituting the weight of θ = (w, b) into the above equation, the probability
of h𝜃 (x) after the weighted sum of each feature value of the sample is:

P(y = 1||x; 𝜃) = h𝜃 (x) (11)

P(y = 1||x; 𝜃) = 1 − h𝜃 (x) (12)

They are integrated into:

P(y|x;𝜃) = h𝜃 (x).y (1 − h𝜃 (x))1−y (13)


For all samples, the probability of occurrence is:
m m
∏ ∏ (i) (i)
L(𝜃) = P(y(i) |x(i) ;𝜃) = h𝜃 (x(i) )y (1 − h𝜃 (x(i) ))1−y . (14)
i=1 i=1

The log-likelihood value is:


m

l(𝜃) = log L(𝜃) = P(y(i) log h𝜃 (x(i) ) + (1 − y(i) ) log(1 − h𝜃 (x(i) ))) (15)
i=1

Through this calculation method, the solution of the equation is finally obtained.

3.3 Construction of Risk Control Model Based on Data Mining Algorithms

Standard data mining algorithms include supporting decision tree algorithm


(DTA), random forest algorithm (RFA), and light gradient boosting machine
(LGBM). Different data mining algorithms are utilized to build the risk control
models based on the traditional risk control model. (1) DTA uses a decision tree
to create a training model. This model learns the previous data (training data)
to speculate simple decision rules to predict the target variable’s class or value
(Chanmee and Kesorn 2020). Figure  4 shows the structure of the DTA model.
The DTA model classifies targets mainly according to the gains. The more feature
information the general data has, the greater the gain is, and the stronger the clas-
sification capability of the algorithm is. The specific equation is as follows:

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Z. Wu

Decision Node Root Node

Decision Node Decision Node

Leaf Node Leaf Node Leaf Node Decision Node

Leaf Node Leaf Node

Fig. 4  Structure of the DTA model

m

H(x) = Pi log Pi (16)
i=1

H (x) is the DTA’s entropy value, and ­Pi is the random variable to obtain the prob-
ability distribution. Under given conditions, the mathematical expectation of the
conditional probability distribution of X relative to Y is calculated as follows:
m

H(Y|X ) = Pi H(Y|X = xi ) (17)
i=1

The gain P (D, A) of feature A to training set D is defined as:


P(D, A) = H(D) ⋅ H(D|A ) (18)

(2) RFA is a data mining method used for classification, regression, and other
tasks. It constructs many decision trees during training and predicts by analyzing
the output mean (Georganos et al. 2019; Schonlau and Zou 2020). Figure 5 shows
the structure of the RFA model. This model is an integrated learning model that
inherits multiple decision numbers through Bagging, thereby obtaining a more
accurate and stable model prediction. The ordinary decision number model usu-
ally selects the optimal features from the n samples on the node as the tree for the
division. However, the RFA model is usually constructed through cross-valida-
tion by randomly selecting sample features on the node.
(5) LGBM is a gradient boosting framework based on a decision tree. It has
the characteristics of fast, distributed, and high performance (Chen et al. 2019a,
b, c; Ustuner and Balik Sanli 2019). In the Boosting set model, it is an efficient
implementation of GBDT as XGBoost. In principle, it is similar to GBDT and
XGBoost. Both of them utilize the negative gradient of the loss function as the
residual approximation of the current decision tree to fit the new decision tree.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Using Machine Learning Approach to Evaluate the Excessive…

N1-Feature

CLASS A
TERR#1

MAJORITY
X dataset

CLASS B

VOTING

CLASS
TERR#2

FINAL
TERR#3

CLASS C
N3-Feature
Fig. 5  Structure of RFA model

Dig1
Dig2 DTA

Machine learning algorithm


Prediction Learn1
Data mining algorithm

Average SVM Learn2


Prediction
Dig1 Average
Prediction RFA
Dig2 Fusion
Average
Average Prediction
Prediction LGB LSA Learn1
Dig1 M Learn2
Dig2

Average fusion method


Fig. 6  Risk control model based on fusion algorithm of data mining and machine learning

4 Research Methodology and Model

4.1 Construction of Risk Control Model Based on Fusion Algorithm of Data


Mining and Machine Learning

Model fusion combines two basic models to form a new model. Here, model fusion
uses the average fusion method. As shown in Fig. 6, it is a fusion model (FM) based

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Z. Wu

on data mining and machine learning algorithms. The three algorithms of data min-
ing and the two machine learning algorithms are given weights with a proportion
of 20%; that is, each algorithm accounts for 20% . The average prediction of each
model is summarized as FM’s prediction, and the conclusion is drawn.

4.2 Measurement Indicators of the Enterprise Risk Control Model

(1) Error rate (ER): The ratio of the wrong samples to the total samples in the
model’s prediction results. Accuracy (A): It measures the ratio of the number
of samples classified in the prediction results of the model to the total samples,
and its equation is:
m
1 ∑∏
E(f |D ) = f (xi ) ≠ yi (19)
m i=1

m
1 ∑∏
A(f |D ) = f (xi ) = yi = 1 − E(f |D ) (20)
m i=1

where f represents the constructed model, D is the data set, m is the number of
samples, f ­(xi) represents the predicted value of the sample, and ­yi represents
the true result of the sample.

(2) Precision (P): It indicates that the predicted defaulted individual accounts for
this prediction’s correct ratio. Recall rate (R): It indicates that the contract’s
actual breach accounted for this prediction’s correct ratio. The higher these three
values are, the higher the accuracy of the representative model is. If the preci-
sion of default is increased, the financial risks of trading enterprises are resolved
effectively. Its calculations are as follows:
TP
P= (21)
TP + FP

TP
R= (22)
TP + FN

2 ∗ (P ∗ R)
F1 = (23)
P+R
where TP means that the prediction is correct, FP means that the prediction is
wrong, P is the positive example, R is the recall rate, FN is the negative exam-
ple, and F1 is the reconciled average of precision rate and recall rate, which is
the result of a combination of the two indicators.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Using Machine Learning Approach to Evaluate the Excessive…

(3) Receiver operating characteristic (ROC) curve: In the signal detection theories,
the ROC curve is an analysis tool of the coordinate pattern. This curve sorts the
sample data according to the prediction results of the model. Then, it uses the
positive category prediction method to calculate the true rate and the false posi-
tive rate. Finally, it obtains the characteristic curve (Michael et al. 2019; Smith
et al. 2019). The area under the curve (AUC) is the area under the ROC curve
surrounded by the curve and the coordinate. Its value will not be greater than 1.
The calculations are as follows:
TP FP
TPR = , FPR = (24)
TP + FN TP + FP

m-1
1 ∑
AUC= (x − xi ) ⋅ (yi +yi+1 ) (25)
m i=1 i+1

where m is the number of samples, x­ i represents the false positive rate, and y­ i
represents the true positive rate.

5 Results and Discussions

5.1 Data Collection and Processing

Blockchain technology is utilized for data collection and processing. Blockchain


technology is decentralized and distributed. The public tally books cannot tam-
per. Usually, the blockchain is jointly managed by a peer-to-peer network that
abides by the protocol used for communication and verification between nodes.
After recording, the data in any given block cannot be retroactively changed
unless most users have agreed (Cong and He 2019).
This technology is utilized to obtain data related to financial products on the
network. The specific process is shown in Fig.  7. It mainly includes four parts:
data source, data collection, data preprocessing, and data storage (Rathee et  al.
2019). In data collection and processing, it is necessary to match the data of the

Torstatus IP Bitcoincharts Data Storage


Data Soures
Ipinfor Blockchain

Oracle SQL Oracle


CSV file Java data scraper 11.2 Database
Data Collection
Python data
Generated CSV
scraper

Data
Preprocessing Ruby 1.9.3 New Generated

Fig. 7  Blockchain-based data collection and preprocessing

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Z. Wu

financial enterprises and the users. After filtering the blockchain data, all the data
are positively matched according to some rules. Also, it is confirmed that each
piece of data is valid and effective. Generally, data preprocessing and processing
include missing value processing and outlier processing. The model cannot learn
the relationship between variables correctly due to the sample’s missing values,
which affects the model’s generalization ability and obtains wrong predicted val-
ues or classification results. Methods for missing value processing are deleting
and filling. If the amount of missing data are too large and there is no particu-
larly important information, the missing data will be deleted directly. For a small
amount of missing data, the filling method is used. There are two commonly used
filling methods: one is to fill in with the mode, average or median, and the other is
to fill in with the model. Mode, mean, and median can fill in the missing values;
however, since the mean is affected by extreme values, the mode and median are
mostly used. In actual applications, according to the characteristics of the mode
and the median, the missing values of discrete variables are filled with the mode,
and the missing values of the continuous type are filled with the median. For data
with a more considerable amount, they can be compressed without affecting the
subsequent analysis.
The collected data are utilized for model construction. Before mining the sam-
ple data and building the model, the original data are divided into a training set
(mainly for model construction) and a test set (mainly for model verification) to
facilitate the comprehensive measurement of trading enterprises’ financial risks,
where the training set accounts for 75% and the test set accounts for 25%.

5.2 Experimental Environment and Data Source

(1) Experimental environment: the operating system of the experimental computer


is Ubuntu 14.04 OS. The Spark framework is utilized for simulation. The ver-
sion is TensorFlow 1.4.0. The Spark on yarn operating mode is adopted —the
programming language i Python 3.6.3. The experimental computer hardware
includes a Graphics Processing Unit (GPU), whose model is NVIDIA 1080Ti.
The video memory is 11G, and the internal memory is 16G.
(2) Data source: the data for model verification come from a trading enterprise A
founded in 2017. The data for the verification of the risk control model comes
from the cooperation projects of enterprise A with Chinese banks. The data
mainly includes collecting a particulproduct’s user loanuct, which contains nec-
essary personal information and consumption data. These data cover a wide
range of areas and contain many dimensions, which are generally applicable.
Therefore, they are the most relevant data for building a risk control model. Strict
confidentiality measures for the personal data of users are adopted.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Using Machine Learning Approach to Evaluate the Excessive…

Model verification data includes four aspects of information. (1) The user’s
personal information includes name, age, ID number, phone number, and educa-
tional background. Personal information contains a total of 20,000 pieces of data
and 4,256 features. (2) The user’s consumption information, including user name,
membership level, number of favorable comments, and monthly consumption
amount on major online shopping malls such as Taobao, JD, and PDD, totaling
20,000 pieces of data and 5269 features. (3) The enterprise information, includ-
ing users applying for loans, Internet loan enterprises, the number of credit cards,
and the number of property insurance, totaling 20,000 pieces of data and 3658
features. (4) The communication operator information, including available bal-
ance, available points, and monthly call charges, totaling 20,000 pieces of data
and 5128 features. The total number of involved features is 18,311, and the piece
of total sample data is 80,000. These data mainly come from consumption data
from May to December 2019. When data is segmented, the two data’s positive
and negative samples should be balanced to prevent the model from overfitting
and affect the model’s generalization ability. For the training set, the Logistic
Regression library in Sklearn is imported, and the default parameters of the logis-
tic regression algorithm are used to build the model in Jupiter Notebook.

5.3 Training Performance Evaluation of Different Risk Control Models Based


on Data Mining and Machine Learning

Figure 8 shows the training time of different risk control models based on data mining
and machine learning. As shown in the figure, in different data sets, different models
have different advantages. When the data set size is smaller (1G-2G), the training time
of different models is not different. When the data set’s size is 2G-5G, FM and LRA’s
training time is more prolonged, while that of the LGBM is shorter. When the data set
size is 5G-10G, the training time of FM and LGBM is prolonged. When the data set
size is 10G-30G, the training time of various models tends to be consistent. The SVM
tends to be stable in the entire data set; the fluctuation of its training time is not very
large. When the size of the data set is smaller, SVM can obtain the best results under
different data sets; when the size of the data set is larger, FM’s time is shorter.

Fig. 8  Results of training time 50


SVM
of different data mining and
LSA
machine learning risk control
40 DTA
models
Training time / min

RFA
LGBM
30 FM

20

10

1G 2G 5G 10G 30G
Data collection

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Z. Wu

Fig. 9  Accuracy and precision of different data mining and machine learning risk control models

Fig. 10  Recall rate results of dif- 1.2 SVM LSA


ferent data mining and machine DTA RFA
learning risk control models 1.0 LGBM FM
Recall Percentage/%

0.8

0.6

0.4

0.2

Training set Test set Initial model Tuning the model

5.4 Performance Evaluation of Different Data Mining and Machine Learning Risk


Control Models

Figure  9 shows the accuracy and precision of different risk control models
based on data mining and machine learning. As shown in the figure, in the train-
ing set, SVM, FM, and LGBM models’ accuracy is high. However, it decreases
significantly in the test set. The parameters of the original model are optimized
to improve the accuracy of each model. Compared with the original model, the
SVM model’s accuracy is improved significantly, while LS and LGBM models
have little change. This suggests that the SVM model based on parameter opti-
mization has excellent accuracy. In the training set, the precision scores of the
RFA and SVM models are higher. In the test set, all models, except the Bayes-
ian network, have decreased values. After the parameter optimization, the overall
evaluation of all models increases slightly, among which the precision of RFA,
FM, and LGBM models has increased significantly. The results show that param-
eter optimization has a better effect on the RFA model’s precision while the worst
effect on FM precision.
Figure 10 shows the recall rate results for different risk control models based
on data mining and machine learning. As shown in the figure, the recall rate of
RFA, DTA, and LGBM models is higher; however, it decreases significantly in

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Using Machine Learning Approach to Evaluate the Excessive…

the test set. The parameters of the original model are optimized to improve the
accuracy of each model. Compared with the original model, the LSA model’s
recall rate is improved slightly, while other models have little change. However,
the recall rate of the FM is significantly higher than other models. The results
show that the parameter optimization has the best effect on the accuracy of FM.
Figure 11 shows the AUC and F1-Score results of different risk control mod-
els based on data mining and machine learning. As shown in the figure, in the
training set, the precision of FM, DTA, and LGBM models is higher; however,
it decreases significantly in the test set. The parameters of the original model are
optimized to improve the accuracy of each model. Compared with the original
model, the AUC of the DTA model is improved significantly, while that of the
LSA and SVM models has little change. The results show that the parameter opti-
mization has a better effect on the precision of the DTA model. The total scores
of the precision rate and recall rate of the LGBM, DTA, and RFA models are high
in the training set. In the test set, all model values, except the FM, decrease. After
parameter optimization, the comprehensive evaluation of FM slightly increases.
The effects of parameter optimization on other models are not noticeable. How-
ever, the overall score of LGBM decreases linearly. The results show that param-
eter optimization has a better effect on the total evaluation value of precision rate
and recall rate of FM.

5.5 ROC Evaluation of Different Data Mining and Machine Learning Risk Control


Models

Figure 12 shows the ROC evaluation results of different data mining and machine
learning risk control models. First, all the models are divided into a training set
and a test set. Then, the ROC of each model is solved to obtain the average ROC
value. After parameter optimization, the ROC curves are obtained. As shown in
Fig. 12, the overall ROC values of the training set’s performance are larger; the
reason is that more data make the models more accurate. However, the difference
between the training set and the test set is massive, indicating that the models’

Fig. 11  AUC and F1-Score results of different data mining and machine learning risk control models

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Z. Wu

Fig. 12  ROC evaluation of different data mining and machine learning risk control models

overall prediction effects are not acceptable. After parameter optimization, as


shown in Fig. 12b, the difference between the single original model and the opti-
mized model is small and intersects mutually, indicating that the optimized model
is more stable, proving the effectiveness of parameter optimization in the model
construction process.

5.6 Actual Verification Results of Different Data Mining and Machine Learning


Risk Control Models

All the optimized risk control models are randomly substituted into the real financial
data of a trading enterprise. The results of the various indicators are comprehen-
sively scored. The obtained results are shown in Table 1. As shown in the table, in
a single model, the AUC results show that the precision of the FM is the highest,
which is 79.35 %, followed by the SVM model of 78.31 %. The precision of most
data mining and machine learning models is around 0.7. Therefore, these models are
applicable in predicting the risk control of trading enterprises. The accuracy rate and
recall rate of each model are evaluated comprehensively. The FM has the highest
value (0.4764), followed by the LGBM model (0.4336). Therefore, SVM, LGBM,
and FM are more accurate in the risk control evaluation of trading enterprises. In
terms of accuracy, the highest is SVM (78.84 %), followed by LGBM. The accuracy

Table 1  Comprehensive comparison results of different data mining and machine learning risk control
models
Performance SVM LSA DTA RFA LGBM FM

AUC​ 0.7831 0.779 0.6974 0.7746 0.7746 0.7935


F1-Score 0.4055 0.3198 0.2335 0.3376 0.4336 0.4764
Accuracy 0.7884 0.7765 0.7561 0.7828 0.7877 0.7828
Precision 0.6913 6818 0.5579 0.7248 0.6591 0.6059
Recall 0.2869 0.2089 0.1476 0.2201 0.3231 0.3928
Overview 0.59104 0.5532 0.4785 0.56798 0.59562 0.61028

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Using Machine Learning Approach to Evaluate the Excessive…

of the remaining models is above 75 %. Therefore, all models meet the requirements
in terms of accuracy. In terms of precision, the highest is RFA (72.48 %), followed
by the SVM (69.13 %). In terms of recall rate, the highest is FM (39.28 %), followed
by the LGBM (32.31 %). In summary, each model has unique advantages in terms
of different performances. The FM has higher precision, recall rate, and accuracy,
while RFA also has higher precision.

6 Conclusions

In big data, the Fusion Model is utilized to predict trading enterprises’ financial
risks. With the help of data collection and preprocessing blockchain algorithm
methods, risk control models of trading enterprises are constructed based on SVM,
decision trees, random forests, gradient boosting, and integrated algorithms. The
parameters of the original model are optimized. The parameters of the original
model are optimized. By analyzing financial loan data of trade-oriented enterprises,
the differences between different algorithm models in practical application and their
performance differences are analyzed comprehensively. When the data set size is
smaller, the SVM has the best results under different data sets. When the size of the
data set is larger, the Bayesian network model consumes a shorter time. Each model
has unique advantages in terms of different performances. The FM has higher preci-
sion, recall rate, and accuracy, and the RFA also has higher precision. Although all
models’ performances have been analyzed as comprehensively as possible, defects
due to the objective limitations such as funding exist, which will be improved in the
following aspects. (1) While evaluating the models’ performances, only the stand-
ard indicators, such as AUC and precision, are considered to determine the model’s
quality. However, the internal relations among these indicators are not explained
from a general perspective. (2) The results suggest that the constructed risk control
models are quite different for the data of different enterprises because the mining of
data may not be deep enough. Also, since the information on credit reporting is eval-
uated from several aspects, including the social data of individuals, its results are
critical for the construction of the model. If the data of credit reporting are analyzed
thoroughly, users’ consumption ability and consumption psychology will be evalu-
ated comprehensively. (3) The data used to construct the models mainly come from
the business data of trading enterprises. Meanwhile, the time span only includes a
few months. However, trying different data mining and machine learning algorithms
for model training takes a longer time. Therefore, how to reduce the running time
and improve the efficiency of algorithm learning during massive data analysis is a
problem that requires further investigation. The above are the research directions in
the future.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Z. Wu

Compliance with ethical standards 

Conflict of interest  All Authors declare that they have no conflict of interest.

Human and Animal Rights  This article does not contain any studies with human participants or animals
performed by any of the authors.

Informed Consent  Informed consent was obtained from all individual participants included in the study.

References
Abra, F. D., Huijser, M. P., Pereira, C. S., & Ferraz, K. M. (2018). How reliable are your data? Verifying spe-
cies identification of road-killed mammals recorded by road maintenance personnel in São Paulo state,
Brazil. Biological Conservation., 225, 42–52.
Aldridge, I. (2019). Big data in portfolio allocation: A new approach to successful portfolio optimization. The
Journal of Financial Data Science, 1(1), 45–63.
Alsayed, A., & Bilgrami, A. (2017). E-banking security: Internet hacking, phishing attacks, analysis and pre-
vention of fraudulent activities. International Journal of Emerging Technology and Advanced Engi-
neering, 7(1), 109–115.
Campus, K. (2018). Credit card fraud detection using machine learning models and collating machine learn-
ing models. International Journal of Pure and Applied Mathematics, 118(20), 825–838.
Chanmee, S., & Kesorn, K. (2020) Data quality enhancement for decision tree algorithm using knowledge-
based model. Current Applied Science and Technology, 20: 259–277.
Chen, R., Yu, J., Jin, C., & Bao, W. (2019a). Internet finance investor sentiment and return comovement.
Pacific-Basin Finance Journal, 56, 151–161.
Chen, R., Yu, J., Jin, C., & Bao, W. (2019b). Internet finance investor sentiment and return comovement.
Pacific-Basin Finance Journal, 56, 151–161.
Chen, T., Xu, J., Ying, H., Chen, X., Feng, R., Fang, X., Gao, H., & Wu, J. (2019c). Prediction of extubation
failure for intensive care unit patients using light gradient boosting machine. IEEE Access : Practical
Innovations, Open Solutions, 7, 150960–150968.
Cong, L. W., & He, Z. (2019). Blockchain disruption and smart contracts. The Review of Financial Studies,
32(5), 1754–1797.
Ding, H., Peng, C., Tian, Y., & Xiang, S. (2019). A risk adaptive access control model based on Markov for
big data in the cloud. International Journal of High Performance Computing and Networking, 13(4),
464–475.
Florio, C., & Leoni, G. (2017). Enterprise risk management and firm performance: The Italian case. The
British Accounting Review, 49(1), 56–74.
Fuyong, Y. (2016). Internet finance statistics dilemma and implementation path. Financial Development
Review, (10), pp. 4–9.
Georganos, S., Grippa, T., Niang Gadiaga, A., Linard, C., Lennert, M., Vanhuysse, S., Mboga, N., Wolff,
E., & Kalogirou, S. Geographical random forests: a spatial extension of the random forest algorithm
to address spatial heterogeneity in remote sensing and population modelling. Geocarto International.
2019, 36: pp. 1–16.
Gomber, P., Kauffman, R. J., Parker, C., & Weber, B. W. (2018). On the fintech revolution: Interpreting the
forces of innovation, disruption, and transformation in financial services. Journal of Management Infor-
mation Systems, 35(1), 220–265.
Gulsoy, N., & Kulluk, S. (2019). A data mining application in credit scoring processes of small and medium
enterprises commercial corporate customers. Wiley Interdisciplinary Reviews: Data Mining and Knowl-
edge Discovery, 9(3), e1299–e1311.
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. The Review of Financial
Studies, 33(5), 2223–2273.
Hasan, R., Chatwin, C., & Sayed, M. Examining alternatives to traditional accident causation models in the
offshore oil and gas industry. Journal of Risk Research. 2019, pp. 1–16.
Ivanov, D., Dolgui, A., & Sokolov, B. (2019). The impact of digital technology and Industry 4.0 on the ripple
effect and supply chain risk analytics. International Journal of Production Research, 57(3), 829–846.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Using Machine Learning Approach to Evaluate the Excessive…

Kim, H., Hong, T., & Kim, J. (2019b). Automatic ventilation control algorithm considering the indoor envi-
ronmental quality factors and occupant ventilation behavior using a logistic regression model. Building
and Environment, 153, 46–59.
Kim, H., Kim, J., Kim, Y., Kim, I., & Kim, K. J. (2019a). Design of network threat detection and classifica-
tion based on machine learning on cloud computing. Cluster Computing, 22(1), 2341–2350.
Lv, Z., Kong, W., Zhang, X., et al. (2019). Intelligent security planning for regional distributed energy inter-
net. IEEE Transactions on Industrial Informatics, 16(5), 3540–3547.
Michael, H., Tian, L., & Ghebremichael, M. (2019). The ROC curve for regularly measured longitudinal
biomarkers. Biostatistics, 20(3), 433–451.
Norris, G., Brookes, A., & Dowell, D. (2019). The psychology of internet fraud victimisation: A systematic
review. Journal of Police and Criminal Psychology, 34(3), 231–245.
Rathee, G., Sharma, A., Saini, H., Kumar, R., & Iqbal, R. (2019). A hybrid framework for multimedia data
processing in IoT-healthcare using blockchain technology. Multimedia Tools and Applications, 36:
1–23.
Saura, J. R., Herráez, B. R., & Reyes-Menendez, A. (2019). Comparing a traditional approach for financial
brand communication analysis with a big data analytics technique. IEEE access : practical innovations,
open solutions, 7, 37100–37108.
Schonlau, M., & Zou, R. Y. (2020). The random forest algorithm for statistical learning. The Stata Journal,
20(1), 3–29.
Smith, A. M., Lampinen, J. M., Wells, G. L., Smalarz, L., & Mackovichova, S. (2019). Deviation from per-
fect performance measures the diagnostic utility of eyewitness lineups but partial area under the ROC
curve does not. Journal of Applied Research in Memory and Cognition, 8(1), 50–59.
Tupa, J., Simota, J., & Steiner, F. (2017). Aspects of risk management implementation for Industry 4.0. Pro-
cedia Manufacturing, 11, 1223–1230.
Urbinati, A., Bogers, M., Chiesa, V., & Frattini, F. (2019). Creating and capturing value from big data: A
multiple-case study analysis of provider companies. Technovation, 84, 21–36.
Ustuner, M., & Balik Sanli, F. (2019). Polarimetric target decompositions and light gradient boosting
machine for crop classification: A comparative evaluation. ISPRS International Journal of Geo-Infor-
mation, 8(2), 97–117.
Yang, B. (2020). Construction of logistics financial security risk ontology model based on risk association
and machine learning. Safety Science, 123, 104–123.
Yang, D., Chen, P., Shi, F., & Wen, C. (2018). Internet finance: Its uncertain legal foundations and the role of
big data in its development. Emerging Markets Finance and Trade, 54(4), 721–732.
Yue, X., Wang, H., Jin, D., Li, M., & Jiang, W. (2016). Healthcare data gateways: Found healthcare intel-
ligence on blockchain with novel privacy risk control. Journal of Medical Systems., 40(10), 218–223.
Zhang, H., Liao, H., Wu, X., Zavadskas, E. K., & Al-Barakati, A. (2020a). Internet financial investment prod-
uct selection with pythagorean fuzzy DNMA method. Engineering Economics, 31(1), 61–71.
Zhang, H., Liao, H., Wu, X., Zavadskas, E. K., & Al-Barakati, A. (2020b). Internet financial investment
product selection with pythagorean fuzzy DNMA method. Engineering Economics, 31(1), 61–71.
Zhu, Y., Zhou, L., Xie, C., Wang, G. J., & Nguyen, T. V. (2019). Forecasting SMEs’ credit risk in supply
chain finance with an enhanced hybrid ensemble machine learning approach. International Journal of
Production Economics, 211, 22–33.

Publisher’s note  Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.

13
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:

1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at

onlineservice@springernature.com

You might also like