You are on page 1of 9

Session 6: eCommerce WSDM ’21, March 8–12, 2021, Virtual Event, Israel

Credit Risk and Limits Forecasting in E-Commerce Consumer


Lending Service via Multi-view-aware Mixture-of-experts Nets
Ting Liang∗ Guanxiong Zeng∗ Qiwei Zhong
Alibaba Group Alibaba Group Alibaba Group
Hangzhou, China Hangzhou, China Hangzhou, China
kuiyu.lt@alibaba-inc.com moshi.zgx@alibaba-inc.com yunwei.zqw@alibaba-inc.com

Jianfeng Chi Jinghua Feng† Xiang Ao†‡


Alibaba Group Alibaba Group Institute of Computing Technology,
Hangzhou, China Hangzhou, China Chinese Academy of Sciences
bianfu.cjf@alibaba-inc.com jinghua.fengjh@alibaba-inc.com Beijing, China
aoxiang@ict.ac.cn

Jiayu Tang
Alibaba Group
Hangzhou, China
jiayu.tangjy@alibaba-inc.com

ABSTRACT setting compared with conventional methods. Meanwhile, MvMoE


Consumer lending service is escalating in E-Commerce platforms has good interpretability, which better underpins the imperative
due to its capability in enhancing buyers’ purchasing power, improv- demands in financial industries.
ing average order value, and increasing revenue of the platforms.
Credit risk forecasting and credit limits setting are two fundamental CCS CONCEPTS
problems in E-Commerce/online consumer lending services. Cur- • Information systems → Electronic commerce; • Computing
rently, the majority of institutes rely on two-separate-step methods methodologies → Multi-task learning.
to resolve. First, build a rating model to evaluate credit risk, and
then design heuristic strategies to set credit limits, which requires KEYWORDS
a large amount of prior knowledge and lacks theoretical justifica- Credit Risk Forecasting, Credit Limits Setting, Multi-view Learning,
tions. In this paper, we propose an end-to-end multi-view and multi- Multi-task Learning
task learning based approach named MvMoE (Multi-view-aware
ACM Reference Format:
Mixture-of-Experts network) to solve these two problems simulta-
Ting Liang, Guanxiong Zeng, Qiwei Zhong, Jianfeng Chi, Jinghua Feng,
neously. First, a multi-view network with a hierarchical attention Xiang Ao, and Jiayu Tang. 2021. Credit Risk and Limits Forecasting in
mechanism is constructed to distill users’ heterogeneous financial E-Commerce Consumer Lending Service via Multi-view-aware Mixture-of-
information into shared hidden representations. Then, we jointly experts Nets. In Proceedings of the Fourteenth ACM International Conference
train these two tasks with a view-aware multi-gate mixture-of- on Web Search and Data Mining (WSDM ’21), March 8–12, 2021, Virtual Event,
experts network and a subsequent progressive network to improve Israel. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3437963.
their performances. With the real-world dataset contained 5.44 mil- 3441743
lion users, we investigate the effectiveness of MvMoE. Experimental
results exhibit that the proposed model is able to improve AP over 1 INTRODUCTION
5.60% on credit risk forecasting and MAE over 9.52% on credit limits Recent years have witnessed an accelerated growth of consumer fi-
∗ These
nance. A multi-tier consumer financial service system that is mainly
authors contributed equally to this work.
† Corresponding authors.
composed of commercial banks, licensed consumer finance compa-
‡ Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS). nies, and online FinTech lending platforms is escalating [8]. Among
Also at University of Chinese Academy of Sciences, China. them, the scale of online FinTech lending platforms, taking advan-
tage of alternative data sources, big data, and machine learning
Permission to make digital or hard copies of all or part of this work for personal or technologies, increased nearly 400 times1 . However, this year (2020),
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation the outbreak of COVID-19 brought great challenges to major fi-
on the first page. Copyrights for components of this work owned by others than ACM nancial institutions, because the cumulative effect of continuous
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
economic disasters caused by the epidemic may eventually lead
fee. Request permissions from permissions@acm.org. to a wave of default2 . It thus becomes increasingly imperative for
WSDM ’21, March 8–12, 2021, Virtual Event, Israel
1 http://www.nifd.cn/Financial
© 2021 Association for Computing Machinery.
2 https://www.bloomberg.com/news/articles/2020-08-05/more-bailout-cash-may-
ACM ISBN 978-1-4503-8297-7/21/03. . . $15.00
https://doi.org/10.1145/3437963.3441743 just-delay-wave-of-credit-card-defaults

229
Purchase machine learning based researches for CRF [14, 31, 35] in the AI
0.40
Limit research field, CLS has often been overlooked by the same research
community. A general way to finish the CLS task is to design limits
0.10 setting strategies based on heuristic human expertise, which suf-
fers from several common disadvantages, e.g. lack of theoretical
justification, robustness, and flexibility [4]. In most circumstances,
Approved
0.10 the only connection between CRF and CLS is the output of CRF
will be considered as a part of the input for the heuristic function
Buyers of predicting CLS. There are few relevant researches conducted
Platform 0.10 on solving these two tasks at the same time. However, since CRF
forecasts the probability of a potential default, and CLS is largely
Not approved dependent on the result of CRF and the buyers’ purchase demand.
Credit Risk Forecasting Credit Limits Setting These two tasks determine the risk exposure and revenue of the
platform together, which derives a natural motivation to explore
Figure 1: A toy example of E-Commerce Consumer Lending the correlation between them.
Service. The platform needs to decide whether a buyer can In this paper, we aim to simultaneously consider the CRF and CLS
be approved, and how much credit limits (gold coins shown to underpin E-Commerce consumer lending platforms maximizing
in the figure) should be allocated to each approved buyer ac- the expected profits while minimizing the risk exposure. There are
cording to his/her risk probability (the numbers on the right some special challenges to be resolved:
of the approved buyers). 1) Although these two tasks seem to be related, their inside
patterns and coherent connections remain under-explored. For ex-
financial institutions to manage risks and grant credit at the same ample, though credit limits could be regarded as a signal of the
time. future purchase potential of a consumer, a high purchase demand
For the E-Commerce industry, in more detail, consumer lend- doesn’t indicate a low credit risk and vice versa. Sometimes their
ing service is emerging due to its capability in enhancing buyers’ expectations on future purchases might be unable to correctly eval-
purchasing power, improving average order value, and increasing uate due to the impacts of occasional incidents [29].
revenue of the platform [19]. Ant Credit Pay, the most popular 2) Different from the bank financial services, it’s difficult for
payment tool in Taobao, has surpassed 700 million users accord- online lending platforms to collect personal privacy information
ing to Alibaba’s 2019 quarterly financial report3 . PayPal Credit’s such as customers’ assets or income, which are closely related to
transaction amount has reached a 50 billion milestone in February risk assessment.
20194 . These two lending services have played a great positive role 3) How to improve the performance of both tasks in the mean-
in their E-Commerce platforms, respectively [15]. time while keeping robustness, flexibility, and interpretability of
The approved buyers by an E-Commerce consumer lending ser- the model is another intractable challenge.
vice can pay by credit through signing an electronic contract ac- To this end, we consider CRF and CLS together from the per-
cording to their historical credit pledges. The debt will be settled spective of multi-task learning, and propose a Multi-view-aware
after the buyer repays the loan within the time stipulated in the Mixture-of-Experts network coined MvMoE to remedy these spe-
contract [6, 21, 35]. Figure 1 illustrates a toy example of the busi- cial challenges. First, we combine heterogeneous multi-view data
ness model of such lending service. It generally consists of two sources including user profile, sequential behaviors, and social
tasks. The first is Credit Risk Forecasting (CRF), which helps to relations in the platform to perform comprehensive user model-
decide whether a buyer should be approved and the discount co- ing. Multi-Layer Perceptron (MLP), Bi-directional Long Short-Term
efficient when setting credit limits. The second is Credit Limits Memory (BiLSTM), and Graph Neural Network (GNN) are adopted
Setting (CLS), which decides the amount that a buyer can loan to encode features on each view, respectively. A hierarchical atten-
when he/she purchases on this E-Commerce platform. tion mechanism is devised to judge the importance of intra-view
Considering the purpose of increasing revenue and controlling and inter-view features. Second, we propose a novel neural network
the risk exposure of the platform, buyers’ purchasing demand is a structure called view-aware mixture-of-experts to catch preferable
benchmark for credit limits set. Unnecessarily high credit limits, for information for different tasks. This structure also helps us under-
example, which are much higher than buyers’ purchase demand, stand the role of different views. Finally, we utilize the output of the
will increase the risk exposure. While too low credit limits, on the CRF task to guide the CLS task by a progressive network between
other hand, might lead to the churn of customers. In a nutshell, the tower of each task according to financial prior knowledge.
the platform should give no credit limits to defaulters, and allocate The contributions of this work are summarized as follows:
credit limits to other customers according to their purchase demand.
Under such a service mode, CRF and CLS have become two • We are the very first attempt to regard CRF and CLS as a
widely known and the most critical issues to every E-Commerce holistic framework, and define it as a multi-task learning
consumer lending service. In traditional practices, these two tasks problem which is composed of a default detection task and
are solved separately [16]. Though we have observed emerging a credit limit forecasting task.
3 http://www.xinhuanet.com/english/2019-01/09/c_137731690.htm • We propose MvMoE to remedy the special challenges of our
4 https://www.paypal.com/stories/us problem. MvMoE is composed of two network structures: the
Session 6: eCommerce WSDM ’21, March 8–12, 2021, Virtual Event, Israel

first is a multi-view network that fuses various data sources explore the correlation between them and simultaneously bolster
with a shared hidden representation through a hierarchical their performances.
attention layer. The second is a multi-task network built
upon the proposed view-aware mixture-of-experts network 3 BUSINESS MODEL AND PROBLEM
which catches important features to buoy the performance STATEMENT
of the two tasks simultaneously. The importance of different In this section, we first introduce the business setting in our work,
views for a specific task could also be highlighted. then present the heterogeneous data sources and the problem state-
• Experiments on a real-world dataset demonstrates the effec- ments for the CRF and CLS tasks, respectively.
tiveness of MvMoE. It not only outperforms the compared Business Setting. In this work, our business setting is based on a
SOTA approaches but also has as good interpretability as consumer lending service provided by an E-Commerce platform to
traditional tree-based methods. its buyers. The purpose of this service is to maximize profits under
The remainder of this paper is organized as follows. Section 2 the premise of controlling the risk of the platform.
surveys the related research in literature. Section 3 introduces the Recall that the consumer lending service platform needs to decide
business model and the problem statement, and Section 4 details whether a buyer can be approved, and how much credit limit should
the proposed MvMoE approach. Section 5 and 6 illustrate the exper- be allocated to each approved buyer. The first task is answered by
iment settings and demonstrate the experimental results. Section 7 CRF, and the second is resolved by CLS. To be more specific, in this
concludes the paper. work, CRF aims to produce the probability of one-month delinquent
on each buyer’s credit payments on the next month. The reason why
2 RELATED WORK we choose one-month delinquent as the indicator is that the cost
In this section, we review the related work from two-fold, namely of the collection will increase significantly after one-month-delay.
credit risk forecasting and credit limits setting. Meanwhile, 70% of one-month delinquent users may turn to three-
month delinquent, which is the general delinquent upper bound
2.1 Credit Risk Forecasting in the online consumer lending service industry. CLS, as the task
The research on credit risk forecasting started in 1968, Altman [2] of maximizing the potential profits, aims to predict each buyer’s
developed a linear model based on several financial features to purchase demand in the next month. Meanwhile, the platform
predict bankruptcy. Since then, banks began to build their own should grant no credit limit to a buyer, if he/she is a defaulter.
rating models. Hand and Henley [12] summarized the development For instance, as shown in Figure 2, if a buyer paid on the E-
of statistical methods in this period. In these traditional financial Commerce platform by credit, then it would be a due date according
scenarios, such as credit cards, most people take the initiative to to the lending service agreement. One-month delinquent means
apply, and the bank can obtain the user’s detailed information such the buyer fails to repay the loan by one month after the due date,
as assets and income. However, in online consumer finance, it is and the buyer is thus regarded as a defaulter. The purpose of CRF is
often impossible to obtain detailed user information due to various to predict the probability of a given user as a defaulter (c.f. 𝑦1 in the
restrictions. With the coming of the big data era, researchers started figure). For the task of CLS, its purpose is to predict the one-month
to use machine learning technology to predict default probabili- purchase demand (c.f. 𝑦2 in the figure) of the same buyer if he/she
ties. Malekipirbazari and Aksakalli [24] proposed a random forest is a benign user. Otherwise, we restrict the credit limit to 0 when
based method to assess lending risk and got better performance he/she is a defaulter.
than FICO (Fair Isaac Corporation) and LC (Lending Club). Recent
Credit Due Default? User-level
works introduced deep neural network to detect specific types of de- Payments Date 1 month y1: 1 or 0
fault [35], e.g., fraud [20, 26, 31], cash-out [14] and malicious [22, 32],
and they proved that relation data is a good supplement for default Time
identification [35]. y2 : Benign Users -- 1 Month Purchase Amount
Defaulters -- 0
2.2 Credit Limits Setting
Different from credit risk forecasting, credit limits setting has few Figure 2: Timeline of 𝑦1 and 𝑦2 .
open research works since it is greatly dependent upon the prefer-
ence of platforms, which is to seek profits or to control risks. The Heterogeneous Multi-view Information. We adopt multiple
most commonly used methods include expected profit method [1], views of heterogeneous information in our problem that is generally
the risk points method [17, 25, 30] and profitability indicators available in online E-Commerce platforms. Denoted by 𝑽𝑢 as the
method [3], etc. Besides, Bazzi and Hasna [4], Cheng and Cirillo view of user profiles, denoted by 𝑽𝑏 as the view of users’ sequential
[5] tried to model the limits base on the applicant’s credit history behaviors, and denoted by 𝑽𝑟 as the view of the relationship of
confirmed by an expert approach. However, none of these methods users. Then, the heterogeneous multi-view information utilized in
can be implemented without computing a credit score first. this paper is denoted as 𝑽 = {𝑽𝑢 , 𝑽𝑏 , 𝑽𝑟 }, consisting of the original
Although either CRF or CLS has corresponding research respec- input features.
tively, the industry generally follows the two-step methodology Problem Statement of CRF. According to the general definition
and ignores the rich interrelations between them. In this paper, in online consumer lending service platforms, we assign a label
our purpose is to integrate two tasks in a unified framework to 𝑦1 ∈ {0, 1} on each buyer 𝑢 ∈ U to indicate whether he/she is a

231
Session 6: eCommerce WSDM ’21, March 8–12, 2021, Virtual Event, Israel

defaulter or not, where U represents the buyers in the training views:


set. The CRF task is to predict the default probability 𝑝 𝑣 of buyer InterAtt (𝒗1 , 𝒗2 ) = A1 (𝒗2 , 𝒗1 , 𝒗1 ) ⊙ A2 (𝒗1 , 𝒗2 , 𝒗2 )
𝑣 in the testing set, which is considered as a binary classification 𝒗2 𝒗 ⊤
problem. A1 (𝒗2 , 𝒗1 , 𝒗1 ) = Softmax( √ 1 )𝒗1
𝑑 (4)
Problem Statement of CLS. The problem of the CLS is analogous 𝒗1 𝒗2⊤
to the CRF, however, the target of the CLS, denoted by 𝑦2 ∈ R, A2 (𝒗1 , 𝒗2 , 𝒗2 ) = Softmax( √ )𝒗2
indicates the customer’s purchase demand in the next month, which 𝑑
can be generally considered as a regression problem. Besides, 𝑦2 is where A1, A2 ∈ R𝑛×𝑑 are attentive representations of 𝒗 1 and 𝒗 2 , 𝑛
set to 0 if a user is annotated as a defaulter since the credit limits is the size of samples, 𝑑 is the dimension of intra-view attention, the
are solely provided to benign users. Softmax(·) operates by column, and ⊙ is the element-wise product.
To sum up, in this paper, we treat the CRF as a binary classifi- For example, when considering 𝑽𝑢Intra , 𝑽𝑏Intra , we can get:
cation and the CLS as a regression task according to our business Inter
 
setting. Our method is to jointly train these two tasks based on the 𝑽(𝑢,𝑏) = InterAtt 𝑽𝑢Intra , 𝑽𝑏Intra (5)
heterogeneous multi-view input. Inter , 𝑽 Inter and 𝑽 Inter for every two
At last, we can generate 𝑽(𝑢,𝑏) (𝑢,𝑟 ) (𝑏,𝑟 )
intra-view attentions via the inter-view attention layer.
4 THE MODEL
In this section, we present the proposed MvMoE ( Multi-view-aware 4.2 Multi-task Network
Mixture-of-Experts network), its overall architecture is shown as 4.2.1 View-aware MMoE Layer.
Figure 3 (a). (a) Multi-gate Mixture-of-Experts. The Multi-gate Mixture-of-
Experts (MMoE) model, which is shown in Figure 3 (c), is built
4.1 Multi-view Network upon the widely used Shared-Bottom and Mixture-of-Experts struc-
4.1.1 Embedding Layer. ture [23]. It is composed of the expert network and the gating
As mentioned above, we have built multi-view features including network. The former contains MLP with the same structure which
𝑽𝑢 , 𝑽𝑏 , 𝑽𝑟 under the premise of complying with security and privacy called experts to learn different knowledge separately. And the lat-
policies. We utilize different classic networks to deal with these ter computes a softmax vector for each task to achieve the selective
multi-view inputs. In particular, we use MLP, BiLSTM [10], and use of experts.
GNN [11] to encode the statistical user profiles, user sequential (b) View-aware MMoE. In this paper, we construct the View-
behaviors, and user relation information, respectively. The encoded aware MMoE structure, which uses the MMoE structure for every
representations are denoted as follows with the same dimensions: view separately. As a result, we can better judge the importance of
different views for the two tasks according to the network outputs,
𝑬 𝑽𝑢 = MLP(𝑽𝑢 ) which makes the network more interpretable.
𝑬 𝑽𝑏 = BiLSTM(𝑽𝑏 ) (1) For example, when the input is 𝑽𝑢Intra in the task 𝑖, the formula
𝑬 𝑽𝑟 = GNN(𝑽𝑟 ) is,
𝐸
Õ
4.1.2 Hierarchical Attention Layer. VM𝑖 = 𝒈𝑘𝑖 (𝑽𝑢Intra )𝒇𝑘 (𝑽𝑢Intra ) (6)
𝑽𝑢Intra
In order to better integrate the information of 𝑬𝑽𝑢 , 𝑬𝑽𝑏 , 𝑬𝑽𝑟 , 𝑘=1

we apply a hierarchical attention mechanism, where the bottom where 𝒈𝑖 is the gating network for task 𝑖. 𝒇𝑘 is the output of the
attention layer focuses on the internal part of each view, while the 𝑘-th expert, 𝐸 is the number of experts.
top attention layer learns the relationship between different views, After view-aware MMoE, we get VM𝑖 Intra , VM𝑖 Intra and VM𝑖 Intra for
𝑽𝑢 𝑽𝑏 𝑽𝑟
as shown in Figure 3 (a). all intra-views. In addition, for inter-views, the results are repre-
(a) Intra-view Attention. For each view, we calculate the attention sented as VM𝑖 Inter , VM𝑖 Inter and VM𝑖 Inter .
score of embedding features and use the element-wise product to 𝑽(𝑢,𝑏) 𝑽(𝑢,𝑟 ) 𝑽(𝑏,𝑟 )
get a new output with the same dimension. (c) Attention Layer. To extract important information, we con-
catenate the outputs of all view-aware MMoE layers followed by
an attention layer. The following formula calculates the attention
IntraAtt (𝒉) = 𝒉 ⊙ Softmax(𝑾𝒂 𝒉) (2)
score Att𝑠 and corresponding score vector.
where 𝑾𝑎 ∈ R𝑑×𝑑 is a trainable weight matrix, 𝒉 ∈ R𝑑 is the output Att𝑠 (𝒉 𝒊 ) = Softmax(𝑾𝑎𝑖 𝒉𝑖 )
of embedding layer, ⊙ is the element-wise product, 𝒉 could be 𝑬𝑽𝑢 , (7)
𝑬𝑽𝑏 or 𝑬𝑽𝑟 . For example, we can get the distilled representation of Att(𝒉𝑖 ) = 𝒉𝑖 ⊙ Att𝑠 (𝒉𝑖 )
the user profile view: where 𝑖 is the index of task, 𝑾𝑎𝑖 ∈ R𝑑×𝑑 is a trainable weight matrix,
⊙ is the element-wise product. For example,
𝑽𝑢Intra = IntraAtt (𝑬 𝑽𝑢 ) (3)
𝒉CRF = [VMCRF CRF CRF CRF CRF CRF
Intra , VM Intra , VM Intra , VM Inter , VM Inter , VM Inter ] (8)
𝑽𝑢 𝑽𝑏 𝑽𝑟 𝑽 (𝑢,𝑏) 𝑽 (𝑢,𝑟 ) 𝑽 (𝑏,𝑟 )
(b) Inter-view Attention. We design the inter-view attention
to model the asynchronous interactions between different views, is the output of all six concatenated view-aware MMoE layers.
which is shown as Figure 3 (b). It is calculated as follows, where Att𝑠 (𝒉CRF ) is the attention score. Att(𝒉CRF ) is the final output
𝒗 1 and 𝒗 2 are outputs of intra-view attention mechanism of two after attention layer of the CRF task.

232
Session 6: eCommerce WSDM ’21, March 8–12, 2021, Virtual Event, Israel

(a) (b)

(c)

Figure 3: (a) The model architecture of MvMoE. MvMoE is mainly divided into two network structures, the left is a multi-
view network including Embedding Layer and Hierarchical Attention Layer, and the right is a multi-task network including
View-aware MMoE Layer and Progressive Layer. (b) Inter-view attention mechanism. (c) The structure of MMoE module.

After the above structures, we feed the concatenation of the where 𝑁 is the number of samples, 𝑦1 and 𝑦2 are the ground truth
embedding features to an MMoE to get the residual information, of CRF and CLS respectively, 𝑦b1 is the predicted risk probability of
then concatenate it with the output of the attention layer for each one user, 𝑦b2 is the predicted limit of him/her, Θ is the parameter set
task respectively. The final output of the View-aware MMoE Layer of the proposed model and 𝛼 is the weighted factor and, 𝜆 is the
is coefficient of L2 regularizer.
VaMMoE𝑖Layer = [VM𝑖( [𝑬 𝑽𝑢 ,𝑬 𝑽 , Att(𝒉𝑖 ) ] (9)
𝑏 ,𝑬 𝑽𝑟 ])
5 EXPERIMENTS SETUP
4.2.2 Progressive Layer.
Intuitively, the credit score is a key index to set credit limits. To 5.1 Dataset
capitalize this prior knowledge, we exploit CRF to CLS in two phases. We collect a real-world dataset from an online E-Commerce con-
Firstly, we employ a progressive layer between the tower of each sumer lending service provided by Alibaba which services both
task to dynamically transfer compatible hidden representations [28]. personal and enterprise buyers. The dataset contains 4.37 million
Without loss of generality, the output of lateral connections is users (from 2018/08/01 to 2019/01/31) for training and 1.07 million
simply added to the output of the same layer of CLS tower via an users (from 2019/04/01 to 2019/05/31) for testing, chronologically.
MLP. Secondly, we introduce the probability of non-default (1 − 𝑦b1 , It is noteworthy that the interval between the training set and the
𝑦b1 is the predicted risk probability of one user) to CLS through a testing set should be not less than one month since the data for
Bayesian-like approach[33]. Formally, the output of the CRF task is the next month is required when defining the labels of CRF and
b1 = MLP(VaMMoECRF
𝑦 Layer ) (10) CLS tasks. To demonstrate methods’ robustness, we further split
the testing set into two subsets, namely Testing 1 (from 2019/04/01
while the output of the CLS task is
to 2019/04/30) and Testing 2 (from 2019/05/01 to 2019/05/31). The
b2 = MLP(VaMMoECLS
𝑦 CRF
Layer |VaMMoELayer ) (11) data statistical information is exhibited in Table 1. The description
of the feature sets used in MvMoE is detailed in Table 2. In order to
4.3 Model Training transform the skewed distribution into a Gaussian-like distribution
Our model is trained on two parts of losses with regularization. The during training, we adopt log10 (·) on the original credit limits in
loss function is defined as: both training and testing sets.
𝑁
1 Õ 𝑖 
L1 (Θ) = −
𝑁 𝑖=1
b𝑖1 ) + (1 − 𝑦𝑖1 ) log(1 − 𝑦
𝑦1 log( 𝑦 b𝑖1 ) 5.2 Compared Methods
The representative conventional two-step methods, recent state-
𝑁
1 Õ 𝑖 (12)
L2 (Θ) = b𝑖2 ) 2
(𝑦 − 𝑦 of-the-art multi-task methods, and variants of the proposed model
𝑁 𝑖=1 2 are compared to demonstrate the effectiveness of our model and
L (Θ) = 𝛼 L1 (Θ) + (1 − 𝛼) L2 (Θ) + 𝜆 ∥Θ ∥ 22 analyze the effect of model structures and different input views.

233
Session 6: eCommerce WSDM ’21, March 8–12, 2021, Virtual Event, Israel

Table 1: The statistical information of dataset. classification model, which is defined as:
+ + |+1)
𝑟𝑎𝑛𝑘𝑢 − |U |×(|U
Í
𝑢∈U + 2
Dataset #Defaulter #Total User #Default Rate 𝐴𝑈 𝐶 = (13)
| U+ | × | U− |
Training 21,857 4,378,322 0.499%
Testing 1 2,356 508,969 0.463% Here, U + and U − denote the positive and negative set in the testing
Testing 2 2,854 556,662 0.513% set, respectively. And 𝑟𝑎𝑛𝑘𝑢 indicates the rank of user 𝑢 via the
score of prediction. Since the positive rate in financial scenarios is
Table 2: The Statistics of feature sets used in MvMoE. low in general (about 0.5% in our dataset), two more widely adopted
metrics are involved in this paper: AP (Average Precision, the area
under the Precision Recall Characteristic curve) and R@P𝑁 (Recall
User Profile Behavior Info. Relation Info.
Member Type Browsing Sharing when Precision equals 𝑁 ). Specifically, we set 𝑁 =0.5 that lifts 100
Hist. Consumption Clicking Device times in our dataset (50% vs 0.5%). The higher AUC, AP, and R@P0.5
Feature VIP Level Paying Trading indicate the better performance of the approach. And R@P0.5 also
... ... ... reflects the ability to detect top-ranked positive samples and balance
Dimension 100 200 45 the impact on the real-world business system.
For the CLS task, we adopt MSE (Mean Squared Error) and MAE
(Mean Absolute Error). The MSE is Mean Square Error on cur-
(a) Two-step Methods
rency amount after log10 (·) operation which is for transforming
• GBDT: It is a competitive gradient boosting model that has the skewed distribution into a Gaussian-like distribution during
been widely used in industrial environment [10, 34]. Specif- training. The MAE is Mean Absolute Error on currency amount for
ically, we model the CRF task first and then involve the visualizing business effect intuitively. The closer values of MSE and
prediction score of the CRF task to the model of the CLS MAE are to zero are better.
task. For a fair comparison, besides user profile, we also add
relation and behavior features through feature engineering 6 EXPERIMENTS RESULTS
to the input of GBDT.
• MvMoE𝐶𝐿𝑆 , MvMoE𝐶𝑅𝐹 : We remove the corresponding 6.1 Main Results
progressive layer in our MvMoE model, respectively, and Table 3 demonstrates the performances of all compared methods.
derive two single-task methods. We mainly take experimental results on Testing 1 for a brief exhibi-
(b) Multi-task Methods tion. Major findings are summarized as follows:

• Shared-Bottom [23]: A classic multi-task network which 6.1.1 MvMoE v.s. Two-step Methods.
shares one bottom network among all tasks and maintains We can observe that MvMoE outperforms the two-step meth-
separate tower network for each task. ods by a significant margin. The AP and MAE improve 5.60% and
• MMoE [23]: An improvement of shared-bottom network 9.52% compared with GBDT, respectively. And the improvement
which uses shared experts and specific gates to tackle con- of R@P0.5 by 15.46% shows that our MvMoE can detect more de-
flicts between tasks. faulters under high precision. Moreover, MvMoE which trains two
• MvMoE\HA , MvMoE\MS , MvMoE\PS : Three submodels of tasks simultaneously gets better results than its variants, namely
MvMoE by removing hierarchical attention layer, view-aware MvMoE𝐶𝑅𝐹 and MvMoE𝐶𝐿𝑆 . It demonstrates that our full model
MMoE structure, or progressive structure, respectively. could capture the dependence between CRF and CLS tasks better.
• MvMoE\𝑈 , MvMoE\𝐵 , MvMoE\𝑅 : Three submodels of Mv- Besides, the performances of MvMoE𝐶𝑅𝐹 and MvMoE𝐶𝐿𝑆 are bet-
MoE ignoring view 𝑽𝑢 , 𝑽𝑏 , or 𝑽𝑟 , respectively. ter than GBDT by improving AP over 3.69% and MAE over 2.83%
• MvMoE: Our proposed full method. via adopting unstructured features (e.g., behavior sequences and
relationships).
5.3 Implementation Details In addition, due to the imbalance between default and benign
We implement the proposed model on the Keras platform [7]. We samples in the financial field, the AUC of GBDT is already very
randomly initialize the model parameters with a He initializer [13] high. While the AUC improvement is relatively less obvious, AP
and choose Adam [18] as the optimizer. During training, the positive and R@P0.5 might be more suitable for this scenario [9].
examples are upsampled to keep the positive rate at around 10% in 6.1.2 MvMoE v.s. Multi-task Methods.
our dataset. We set the batch size to 128, the learning rate to 0.001, First, MMoE gets better performances than Shared-Bottom meth-
and the L2 regularizer parameter 𝜆 = 0.01 to prevent overfitting. ods, especially on CLS task. The higher AP of MMoE indicates that
We adopt the early stopping when the loss doesn’t improve for 5 it can better distinguish defaulters than Shared-Bottom methods.
epochs. We use 6 experts in every view-aware MMoE structure. And MMoE’s MAE is 2.65% smaller and R@P0.5 is 3.04% higher
The GBDT models are with 500 trees and a maximum depth of 5. than Shared-Bottom methods, since the structure of multi-gate in
MMoE has a certain mitigation effect on conflicts caused by task dif-
5.4 Metrics ferences. Second, our model MvMoE is more advanced than MMoE,
For the CRF task, the first metric AUC (the area under ROC curve) the state-of-the-art multi-task methods, with about 3.24% increased
is a common evaluation index to measure the quality of the binary AP and 3.10% reduced MAE. There might be two main reasons. One

234
Session 6: eCommerce WSDM ’21, March 8–12, 2021, Virtual Event, Israel

Table 3: Comparisons of different methods on the two testing sets.

Methods Testing 1 Testing 2


CRF CLS CRF CLS
AUC AP R@P0.5 MSE MAE AUC AP R@P0.5 MSE MAE
GBDT𝐶𝑅𝐹 0.8916 0.4253 0.3761 - - 0.8867 0.4400 0.4169 - -
MvMoE𝐶𝑅𝐹 0.9125 0.4410 0.4155 - - 0.9053 0.4540 0.4317 - -
Two-step Baseline
GBDT𝐶𝐿𝑆 - - - 0.5805 3,500 - - - 0.6087 3,383
MvMoE𝐶𝐿𝑆 - - - 0.5395 3,401 - - - 0.5611 3,288
Shared-Bottom 0.9074 0.4324 0.4048 0.5506 3,357 0.9002 0.4459 0.4320 0.5701 3,343
Multi-task Baseline
MMoE 0.9074 0.4350 0.4172 0.5380 3,268 0.9044 0.4462 0.4355 0.5675 3,163
MvMoE\𝐻𝐴 0.9081 0.4187 0.3959 0.5342 3,256 0.9009 0.4313 0.4057 0.5587 3,151
Model Ablation MvMoE\𝑀𝑆 0.9091 0.4311 0.4099 0.5346 3,243 0.9002 0.4464 0.4250 0.5560 3,129
MvMoE\𝑃𝑆 0.9088 0.4264 0.4070 0.5516 3,412 0.9061 0.4506 0.4226 0.5712 3,298
MvMoE\𝑈 0.8945 0.3975 0.3509 0.5564 3,218 0.8865 0.4104 0.3498 0.5743 3,107
View Ablation MvMoE\𝐵 0.9074 0.4319 0.3925 0.5785 3,431 0.9012 0.4434 0.3973 0.6060 3,295
MvMoE\𝑅 0.8954 0.3835 0.3076 0.5556 3,246 0.8842 0.3895 0.2852 0.5837 3,105
Full Model MvMoE 0.9129 0.4491 0.4342 0.5336 3,167 0.9079 0.4656 0.4524 0.5530 2,969

(a)

(b)

Figure 4: (a) Precision-recall characteristic curve in different situations for CRF task. (b) The scatter plots of the predicted risk
and limits in two-steps GBDT model, Shared-Bottom, MMoE, and MvMoE from the left to the right respectively in Testing 1.

is that MvMoE distills more concise information from multi-views 6.2 Ablation Test
via utilizing a hierarchical attention mechanism. And the other is 6.2.1 Effects of structures.
that the CRF information is jointly trained with and introduced As shown in Table 3, the metrics all deteriorate to some extent
into the CLS task through the view-aware MMoE module and the when we remove no matter which component of MvMoE. Specif-
progressive structure, respectively. ically, for the CRF task, hierarchical attention is more conducive,
Moreover, as shown in Figure 4 (a), MvMoE’s PR curve on the since AP and R@P0.5 of MvMoE\HA are reduced by 6.78% and 8.81%
CRF task is always at the top, reflecting its best performance in dif- respectively. Furthermore, the MAE of MvMoE\PS which removing
ferent situation. In addition, the similar results on Testing 2 shown the progressive structure increases most significantly by 7.75%. It
in Table 3 indicate the stability and robustness of our MvMoE. exhibits that the forecasting of risks plays an important role in the
Overall, the MvMoE network proposed in this paper consistently CLS task.
outperforms all baseline methods for both CRF and CLS tasks.

235
Session 6: eCommerce WSDM ’21, March 8–12, 2021, Virtual Event, Israel

Figure 5: Expert utilization rate distribution in different gat- Figure 6: Attention score Att𝑠 of different views.
ing networks for the two tasks. The meaning of all symbols
is as follows: E1–E6: Expert 1–6, respectively.
denotes the predicted risk probabilities, and the 𝑦-axis denotes the
predicted limits in log-scale. It can be seen from the figure that,
To analyze more deeply, the three technical components of our in GBDT, the median value of credit limits is almost the same in
MvMoE model, namely HA, MS, and PS, can be categorized into different sections of risk, while all multi-task models show clear
two groups. First, HA and MS aim to utilize multi-view data to correlations between the predicted risk and the forecasted credit
capture complex user representations for better CRF prediction. limits. More precisely, the higher the risk, the lower the credit
Second, PS is designed to combine the two tasks by feeding the limits, and it is consistent with the expectations in the business
output of the CRF task to the CLS task to improve its performance. setting. The results of credit limits predicted by our model are more
Hence, from the result of our ablation test, masking either HA or significantly constrained at the higher risk section compared with
MS would lead to worse performance on CRF, and HA dominates the other baseline multi-task models. For example, the variance of
the overall performance. The measures on CLS are still better than the credit limits prediction by our approach is much smaller and the
MMoE due to the PS component. When masking PS, the degrading average value is much closer to 0 when the risk probability at 0.8.
on CLS indicates the effectiveness of PS in supporting the CLS task. We emphasize that it guarantees a smaller risk exposure under the
Meanwhile, HA+MS can also outperform MMoE on the CRF task condition that our MvMoE model has better overall performance
since they fuse multi-view data for prediction. on both CRF and CLS tasks. These observations illustrate that those
6.2.2 Effects of views. multi-task models can better capture the relationship between risk
We then remove each view information and its corresponding and limits [27], and among them, our model performs best.
structure of MvMoE respectively to demonstrate the effectiveness
6.3.2 Utilization rate of experts.
of different views. From the last four rows in Table 3, we can see that
We next visualize the expert utilization rate distribution in dif-
all metrics get worse by removing any view-specific information.
ferent gating networks of a typical defaulter with high purchase
Specifically, for the CRF task, the AP and R@P0.5 of MvMoE\𝑅
demand, as shown in Figure 5.
deteriorate by 14.61% and 29.17% respectively. Moreover, the MSE
The higher the height in the figure, the more important the cor-
of MvMoE\𝐵 increases most significantly by 8.42% when removing
responding expert. Each task has different preferences for experts
behavior view. It indicates that the user’s relational network plays
within the view-aware MMoE structure. For instance, E2 in VM𝑽 Intra
a more significant role in the CRF task while purchasing behaviors 𝑏

have a more important impact on the CLS task in our dataset. The plays a more important role for CRF task, while E4 is more impor-
results reflect the importance of modeling multiple views since tant for CLS task. More interesting, E3 is almost the only expert
every view has a positive contribution to our tasks. Besides, these who works in VM𝑽 Inter for CRF task. Similar phenomena can be
(𝑢,𝑏)
ablation tests also show that our model can still work in the case of found in the CLS task. We speculate that different experts learn
data source missing. the characteristics of users in different dimensions. Taking this
user as an example, E4 learns that the frequency of user purchases
6.3 Visualization is very high in VM𝑽 Intra , and E2 learns that user purchases have a
𝑏
6.3.1 Relation between risk and limits. downward trend. Moreover, the CRF task is more concerned with
First, we analyze the relation between the risk and the limits by how to distinguish between benign users and defaulters, while the
visualizing the predictions among different compared approaches. CLS task pays more attention to the accuracy of the purchase de-
Figure 4 (b) demonstrates the scatter plots of predicted risk (c.f. 𝑦b1 ) mand prediction of benign users. A higher score of E4 in VM𝑽 Intra of
𝑏
and limits (c.f. 𝑦b2 ) in the two-steps GBDT model, Shared-Bottom, CLS task reflects the compatibility of view-aware MMoE. This also
MMoE and our MvMoE on the two testing sets. Here the 𝑥-axis shows that model’s experts have indeed learned useful knowledge

236
Session 6: eCommerce WSDM ’21, March 8–12, 2021, Virtual Event, Israel

6.3.3 Attention value of views. [9] Jesse Davis and Mark Goadrich. 2006. The relationship between Precision-Recall
Finally, we visualize the scores of two attention layers of the and ROC curves. In ICML. ACM, 233–240.
[10] Long Guo, Lifeng Hua, Rongfei Jia, Binqiang Zhao, Xiaobo Wang, and Bin
view-aware MMoE Layer, as shown in Figure 6. The higher the Cui. 2019. Buying or Browsing?: Predicting Real-time Purchasing Intent using
score, the more important the corresponding view. The results Attention-based Deep Network with Multiple Behavior. In KDD. ACM, 1984–
1992.
show that both CRF and CLS tasks have their preferred data views. [11] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation
Specifically, the CRF task obtains information from more views Learning on Large Graphs. In NIPS. 1024–1034.
and focuses on VM𝑽 Intra , since gang crimes are quite common in the [12] David J Hand and William E Henley. 1997. Statistical classification methods in
𝑟 consumer credit scoring: a review. Journal of the Royal Statistical Society: Series
financial scenario. It may look like benign on individual users, but A (Statistics in Society) 160, 3 (1997), 523–541.
many abnormal structures will be detected once being connected [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep
into rectifiers: Surpassing human-level performance on imagenet classification.
through the network structures (e.g., ring networks and spindle In ICCV. 1026–1034.
networks). On the other hand, the CLS task pays more attention [14] Binbin Hu, Zhiqiang Zhang, Chuan Shi, Jun Zhou, Xiaolong Li, and Yuan Qi.
to VM𝑽 Intra , since the user’s purchasing behaviors reflect his/her 2019. Cash-out User Detection based on Attributed Heterogeneous Information
𝑏 Network with a Hierarchical Attention Mechanism. In AAAI. 946–953.
activity and procurement potential well. [15] Yi Huang, Ye Li, and Hongzhe Shan. 2018. Fintech and Firm Selection: Evidence
From the visualizations, we recognize that MvMoE has better from E-commerce Platform Lending. (2018).
[16] Yusuf Tansel İç. 2012. Development of a credit limit allocation model for banks
interpretability on the results, especially compared with the orig- using an integrated Fuzzy TOPSIS and linear programming. Expert Systems with
inal MMoE structure. The interpretability of MvMoE buoys the Applications 39, 5 (2012), 5309–5316.
deployment in online platform since it provides strong evidence [17] Tor Jacobson and Kasper Roszbach. 2003. Bank lending policy, credit scoring and
value-at-risk. Journal of Banking & Finance 27, 4 (2003), 615–633.
and may facilitate the confidence of financial decision makers. [18] Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-
mization. In ICLR.
[19] Xue Li, Hongdi Zhang, Qian Wang, Xiaogang Chen, Juan Shi, and Qian Jia. 2019.
7 CONCLUSION The Influence of Online Personal Consumer Credit Products on Consumers’
Impulse Purchasing Intention: A case study of Ant Credit Pay. In ICEBT. 59–66.
In this paper, we investigated the correlation between credit risk [20] Can Liu, Qiwei Zhong, Xiang Ao, Li Sun, Wangli Lin, Jinghua Feng, Qing He, and
forecasting and credit limits setting tasks in financial scenario and Jiayu Tang. 2020. Fraud Transactions Detection via Behavior Tree with Local
proposed a novel multi-task learning neural network MvMoE to Intention Calibration. In KDD.
[21] Yang Liu, Xiang Ao, Qiwei Zhong, Jinghua Feng, Jiayu Tang, and Qing He. 2020.
bolster their performances simultaneously. MvMoE consumes het- Alike and Unlike: Resolving Class Imbalance Problem in Financial Credit Risk
erogeneous multi-view information and is equipped with a view- Assessment. In CIKM. 2125–2128.
aware multi-gate mixture-of-experts structure to perform multi- [22] Ziqi Liu, Chaochao Chen, Xinxing Yang, Jun Zhou, Xiaolong Li, and Le Song.
2018. Heterogeneous Graph Neural Networks for Malicious Account Detection.
task learning and facilitate the interpretability. It thus considers the In CIKM. 2077–2085.
interrelationships between multi-view information and the relation- [23] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018.
Modeling task relationships in multi-task learning with multi-gate mixture-of-
ship between different tasks. We performed various experiments experts. In KDD. ACM, 1930–1939.
on the real-world industrial dataset to evaluate the performance [24] Milad Malekipirbazari and Vural Aksakalli. 2015. Risk assessment in social
of MvMoE. The experimental results showed the superiority of lending via random forests. Expert Systems with Applications 42, 10 (2015), 4621–
4631.
MvMoE in different settings. Meanwhile, the interpretability of [25] Sami Mestiri and Abdeljelil Farhat. 2019. Using non-parametric count model for
MvMoE is visualized by a set of empirical evaluations. credit scoring. Available at SSRN 3464812 (2019).
[26] Carsten AW Paasch. 2008. Credit Card Fraud Detection Using Artificial Neural
Networks Tuned by Genetic Algorithms. Ph.D. Thesis. Hong Kong University of
ACKNOWLEDGMENTS Science and Technology (2008).
[27] Sebastian Ruder. 2017. An overview of multi-task learning in deep neural net-
This work is supported by Alibaba Group through Alibaba In- works. arXiv preprint arXiv:1706.05098 (2017).
novative Research Program. Xiang Ao is partially supported by [28] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James
Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Pro-
the National Natural Science Foundation of China under Grant gressive neural networks. arXiv preprint arXiv:1606.04671 (2016).
No.61976204, 92046003 and U1811461, the Project of Youth In- [29] Dilip Soman and Amar Cheema. 2002. The effect of credit on spending decisions:
novation Promotion Association CAS and Beijing Nova Program The role of the credit limit and credibility. Marketing Science 21, 1 (2002), 32–53.
[30] Lyn C Thomas, David B Edelman, and Jonathan N Crook. 2002. Credit scoring
Z201100006820062. and its applications. SIAM.
[31] Daixin Wang, Jianbin Lin, Peng Cui, Quanhui Jia, Zhen Wang, Yanming Fang,
Quan Yu, Jun Zhou, Shuang Yang, and Yuan Qi. 2019. A Semi-supervised Graph
REFERENCES Attentive Network for Financial Fraud Detection. In ICDM. 598–607.
[1] Bertrand-H Abtey. 2002. Comment évaluer les risques liés aux investissements. [32] Jianyu Wang, Rui Wen, Chunming Wu, Yu Huang, and Jian Xion. 2019. FdGars:
Dunod. Fraudster Detection via Graph Convolutional Networks in Online App Review
[2] Edward I Altman. 1968. Financial ratios, discriminant analysis and the prediction System. In WWW. ACM, 310–316.
of corporate bankruptcy. The Journal of Finance 23, 4 (1968), 589–609. [33] Qi Wang, Zhihui Ji, Huasheng Liu, and Binqiang Zhao. 2019. Deep Bayesian
[3] Edward I Altman, Andrea Resti, and Andrea Sironi. 2005. Recovery Risk: The next Multi-Target Learning for Recommender Systems. arXiv preprint arXiv:1902.09154
Challenge in Credit Risk Management. Risk Books. (2019).
[4] Mehdi Bazzi and Chamlal Hasna. 2015. Rating models and its Applications: [34] Shen Xin, Martin Ester, Jiajun Bu, Chengwei Yao, Zhao Li, Xun Zhou, Yizhou Ye,
Setting Credit Limits. Journal of Applied Finance and Banking 5, 5 (2015), 201. and Can Wang. 2019. Multi-task based Sales Predictions for Online Promotions.
[5] Dan Cheng and Pasquale Cirillo. 2018. A reinforced urn process modeling of In CIKM. 2823–2831.
recovery rates and recovery times. Journal of Banking & Finance 96 (2018), 1–17. [35] Qiwei Zhong, Yang Liu, Xiang Ao, Binbin Hu, Jinghua Feng, Jiayu Tang, and
[6] Jianfeng Chi, Guanxiong Zeng, Qiwei Zhong, Ting Liang, Jinghua Feng, Xiang Qing He. 2020. Financial Defaulter Detection on Online Credit Payment via
Ao, and Jiayu Tang. 2020. Learning to Undersampling for Class Imbalanced Multi-view Attributed Heterogeneous Information Network. In WWW. 785–795.
Credit Risk Forecasting. In ICDM.
[7] François Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
[8] Stijn Claessens, Jon Frost, Grant Turner, and Feng Zhu. 2018. Fintech credit
markets around the world: size, drivers and policy issues. BIS Quarterly Review
September (2018).

237

You might also like