Professional Documents
Culture Documents
Ying Gao, Haolang Xiao, Choujun Zhan, Lingrui Liang, Wentian Cai et al.
PII: S0020-0255(23)01032-0
DOI: https://doi.org/10.1016/j.ins.2023.119447
Reference: INS 119447
Please cite this article as: Y. Gao, H. Xiao, C. Zhan et al., CATE: Contrastive augmentation and tree-enhanced embedding for credit scoring, Information
Sciences, 119447, doi: https://doi.org/10.1016/j.ins.2023.119447.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for
readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its
final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
1. Introduction
Credit loaning to customers is one of the primary functions of financial institutions. However, credit risk is
a significant hazard to the regular operations of these institutions [1], as evidenced by non-performing loans in
commercial banks in China, which reached 2.8 trillion yuan by the end of the fourth quarter of 2021, equivalent
to 1.73% of all loans made by the banking sector. The consequences of excessive credit risk can be dire, leading to the
bankruptcy of associated businesses [2] and negatively impacting the operating performance of financial institutions
[3]. In the context, credit risk assessment, also known as "credit scoring", is a crucial stage in credit risk management,
as it alleviates the threat posed by information asymmetry.
Petrides et al. [4] demonstrated that a reliable credit scoring model significantly impacts the risk management
and profitability of financial institutions. Conventional credit scoring is a statistical approach that generates a score
indicating the credibility of a loan application based on transaction and borrower-related data [5]. This score is obtained
by calculating the expected probability of default (PD) and can be simplified to a classification task. While statistical
analysis has been historically dominant in credit scoring due to its easy implementation and clear interpretability,
it is not without limitations. Specifically, statistical analysis relies on strong assumptions, such as linear separability
and standard distribution [6]. Consequently, it may encounter limitations in specific scenarios, such as cases where
variables fail to exhibit a linear relationship [7] or when large datasets are involved [8].
Given the large scale, high complexity, and nonlinear nature of credit data, it is challenging to develop an accurate
credit scoring model through statistical analysis. To meet the development needs of the credit industry, machine
learning (ML)-based methods are applied to credit scoring. Baesens et al. [9] demonstrated the advantages of ML-
based methods over traditional statistical analysis in a variety of applications.
Ensemble methods have gained widespread application in the field of credit scoring due to their flexibility and
superior performance. Ensemble models combine various individual models to create an improved overall model. In
∗ Corresponding
author.
gaoying@scut.edu.cn (Y. Gao); holland.shaw.chn@gmail.com (H. Xiao); zchoujun2@gmail.com (C. Zhan);
huxp@bit.edu.cn (X. Hu)
ORCID (s):
conjunction with the ensemble learning framework, the selection of suitable training samples serves as an additional
technique to enhance classification performance [10]. Comparative studies conducted by Niu et al. [11], Shen et al. [12]
indicate that ensemble models generally exhibit superior performance compared to individual models. Consequently,
ensemble models have become one of the most active research areas in the field of credit scoring in recent years [13].
Although tree-based ensemble models have been shown to outperform individual decision tree in terms of
evaluation accuracy [14], they fall short of explanatory information necessary for decision-making [15]. In the field of
credit scoring, interpretability is a crucial aspect of model performance [16]. Cross-features, which combine intervals
of multiple feature variables, have proven effective in modeling feature interactions and representing structured data,
providing richer semantic information than individual features. However, existing studies on the post-interpretation
of complex models concentrate primarily on the relationship between individual features and credit risk [17, 18], and
rarely notice the interaction between features. To bridge this gap in the literature, this study aims to develop a credit
scoring model that achieves high performance and interpretability. In terms of performance, we anticipate that the
proposed model can match the accuracy level of state-of-the-art credit scoring models. In terms of interpretability, our
objective is to devise a model that can locate the primary cross-features during evaluations.
This study suggests a credit scoring model based on contrastive augmentation and tree-enhanced embedding
(CATE) mechanisms, aiming to address the aforementioned problems. As inspired by the insights from Wang et al.
[19], we leverage tree-based ensemble models to transform the initial features of credit transactions into cross-features.
The main contributions of this study include the following three aspects:
1. This work introduces a novel approach for credit scoring that combines the strengths of both tree-based ensemble
models and embedding-based model. The embedding-based model is known for its strong generalization ability,
while the tree-based ensemble models can automatically generate interpretable cross-features. The additive
attention mechanism can provide intrinsic local interpretability for CATE by identifying the local cross-features
that receive greater attention scores during evaluation. Furthermore, we propose a decision path information
fusion mechanism to enable the embedding-based model to learn the construction pattern of the tree-based
ensemble models. This mechanism reduces the gap between the two components and facilitates information
propagation, addressing the limitations that arise from separately modeling the embedding-based model and
tree-based ensemble models;
2. This work devises a dual-task learning technique that combines the classification task with the contrastive
learning framework. Multiple tree-based ensemble models are utilized to generate augmented samples from
various perspectives. This augmentation enriches the training data and allows for more effective utilization
of contrastive learning. The contrastive task clusters elements belonging to the same class while separating
those from different classes in the embedding space. By incorporating dual-task learning, the classification task
becomes more effective, enabling CATE to learn more discriminative representation vectors corresponding to
different classes of transactions. Consequently, CATE demonstrates an enhanced capability in identifying risky
transactions;
3. The proposed model has exhibited promising performance across four credit transaction datasets, showcasing its
effectiveness in credit scoring. Furthermore, an in-depth analysis of the intrinsic local interpretability of CATE
is conducted through a specific case study. The findings reveal that the proposed model identifies and utilizes
distinct cross-features, highlighting its interpretability and providing valuable insights into its decision-making
process.
According to the experimental findings from four public credit datasets, the proposed solution combines the
advantages of the embedding-based method and the tree-based ensemble models. As a result, it creates a credit scoring
model with superior evaluation performance and interpretability.
2. Related work
This section provides an overview of the related researches developments in credit scoring, specifically focusing
on the classical individual models and ensemble methods. Subsequently, it delves into the exploration of feature
transformation based on tree-based ensemble models, henceforth referred to as "tree-enhanced" models. Additionally,
the section encompasses a presentation of the relevant researches on recurrent neural networks and contrastive learning,
which proffers further insight into the innovative approaches being employed in the field of credit scoring.
linearize the original features, while forgeNet is employed to handle the transformed high-dimensional data and uncover
the underlying relationships between features. Wu et al. [37] introduced the tree-enhanced deep adaptive network
(TEDAN) to address challenges such as overfitting and large training gradient variance. These studies demonstrate
that tree-enhanced methods can automatically learn decision rules from data and model valid and interpretable high-
order cross-features. In this study, a decision path information fusion method is designed to preserve the structural
information of DTs, allowing the local cross-features to retain as much structural information as possible.
3. Preliminaries
This section outlines the process of constructing cross-features, a vital component of the credit scoring methodology
discussed in Section 4.
However, embedding-based representation learning models capture the interaction effects between features during
training in an opaque manner, which fails to meet our interpretability requirements. To address this limitation, a
common industrial solution for making cross-features explicit and interpretable is to manually create cross-features
that are then fed into an interpretable method like LR. In doing so, LR can learn the importance of each cross-feature.
For example, cross-features can be generated by combining intervals of feature variables 𝑥age and 𝑥term to produce
a second-order cross-feature [𝑥age > 18] ∧ [𝑥t36 = 1], where the 𝑥t36 variable indicates whether the loan term is 36
months. However, the manual creation of high-order cross-features using this approach poses a challenge in terms of
scalability. A large number of feature variables must be intertwined to model high-order cross-features, resulting in
an exponential increase in complexity that is difficult to manage manually. Although complexity can be controlled
to a certain extent through fine feature engineering, such as crossing only important features, this approach requires
extensive relevant domain knowledge and lacks cross-domain adaptability.
x1 d a1 x2 d a0 x1 d a4 x0 d a2
yes no yes no yes no yes no
x2 d a0 x3 d a0 x0 d a2 x4 d a3 x3 d a2 x3 d a2 x3 d a2 x3 d a0
yes no yes no yes no yes no yes no yes no yes no yes no
In contrast to embedding-based representation learning, tree-based models do not acquire embedding vectors but
rather learn decision rules from empirical data. The appeal of this strategy lies in its effectiveness and interpretability.
Fig. 1 illustrates a DT model 𝑇 = {𝑉 , 𝐸}, where 𝑉 represents the set of nodes and 𝐸 represents the set of edges. The
nodes can be partitioned into three subsets 𝑉 = 𝑉𝑅 ∪ 𝑉𝐼 ∪ 𝑉𝐿 , where 𝑉𝑅 denotes the set consisting solely of the root
node 𝑣𝑅 , 𝑉𝐼 denotes the set of internal nodes, and 𝑉𝐿 denotes the set of leaf nodes. Each internal node 𝑣𝑖 ∈ 𝑉𝐼 in the
tree split a feature variable 𝑥𝑖 ∈ 𝒙 by utilizing two decision edges. When dealing with numerical feature variables like
income, the node chooses a threshold 𝑎𝑗 ∈ ℝ, and splits the feature into [𝑥𝑖 ≤ 𝑎𝑗 ] and [𝑥𝑖 > 𝑎𝑗 ]; When dealing with
categorical feature variables like gender, one-hot encoding is used to convert them to binary variables first. The node
then splits the feature into [𝑥𝑖 = 𝑎𝑗 ] and [𝑥𝑖 ≠ 𝑎𝑗 ] based on whether or not it is equal to a certain value.
The path connecting 𝑣𝑅 and any leaf node 𝑣𝑙 ∈ 𝑉𝐿 represents a decision rule, referred to as "decision path" in this
paper, which can also be viewed as a cross-feature. Cross-features combine the local intervals of multiple features. In
Fig. 1, for example, node 𝑣7 represents the cross-feature [𝑥0 > 𝑎0 ] ∧ [𝑥2 > 𝑎0 ] ∧ [𝑥4 > 𝑎3 ]. Given the initial feature
vector of a transaction, the DT determines which leaf node the transaction will reach. DT can be thought of as mapping
feature vectors to leaf nodes based on the unique structure of the tree. In this mechanism, the path of the activated leaf
node can be regarded as the most concerned cross-feature in the decision process of the DT. Consequently, tree-based
models are considered inherently self-interpretable. This method of generating cross-features avoids labor-intensive
feature engineering and effectively resolves the issues of difficult expansion and lack of cross-domain adaptability that
arise when cross-features are manually produced.
Individual DT may be inadequate to capture complex patterns in credit transaction data due to the limitation of
mapping only one single cross-feature for each transaction. To overcome this, a popular solution is to use tree-based
ensemble model to generate a more diverse set of cross-features. In this study, we extract cross-features from the raw
data of credit transactions using pre-trained tree-based ensemble models. Although tree-based ensemble models are not
explicitly designed for cross-feature extraction, it is reasonable to assume that the leaf nodes represent effective cross-
features for credit scoring, given that each DT is trained and optimized for the classification task. As an illustration, we
utilize GBDT, which enhances the overall performance by integrating multiple additive trees. Assuming that the forest
consists of 𝜏 DTs, the output of the 𝑡-th DT model is denoted as 𝑦̂(𝑡)DT
, and the GBDT can be expressed by Eq. (2):
∑
𝜏
𝑦̂GBDT (𝒙) = 𝑦̂(𝑡)
DT
(𝒙). (2)
𝑡=1
GBDT can be conceptualized as a set of decision trees 𝑄 = {𝑄1 , … , 𝑄𝜏 }, wherein each tree 𝑄𝑡 is responsible for
mapping the initial feature vector 𝒙 to a specific leaf node 𝑄𝑡 (𝒙). The number of leaves in the 𝑡-th tree is denoted by 𝐿𝑡 .
Table 1
The semantics of feature variables (V) and thresholds (T) of the GBDT in Fig. 1.
V Semantic T Semantic
𝑥0 Number Of Times 90 Days Late 𝑎0 0.5000
𝑥1 Revolving Utilization Of Unsecured Lines 𝑎1 0.0609
𝑥2 Number Of Time 60-89 Days Past Due 𝑎2 1.5000
𝑥3 Number Of Time 30-59 Days Past Due 𝑎3 0.0410
𝑥4 Debt Ratio 𝑎4 0.5010
We consider the leaf node activated by the initial feature vector as the corresponding local cross-feature, and represent
it using a one-hot vector denoted as 𝒇 𝑡 :
{
(𝑖) 1, if 𝑖 = 𝑄𝑡 (𝒙),
𝑓𝑡 = (3)
0, if 𝑖 ≠ 𝑄𝑡 (𝒙) and 𝑖 ∈ {1, … , 𝐿𝑡 }.
In contrast to the vanilla GBDT, wherein the evaluation weights of all activated leaf nodes are aggregated to obtain
the final output, our proposed method involves preserving and concatenating all activated leaf nodes to produce the
global cross-feature (hereafter referred to as the "cross-feature"). The total number of leaf nodes in the forest is denoted
∑
as 𝑁𝐿 = 𝜏 𝐿𝑡 . We represent the resulting cross-feature as a multi-hot vector 𝒒 ∈ {0, 1}𝑁𝐿 :
where 𝒒 denotes a sparse vector, with binary elements that assume the value of 1 corresponding to the leaf nodes
that are activated by the initial feature vector in each tree. In contrast, the elements equal to 0 represent all of the
non-activated leaf nodes in the forest.
The GBDT illustrated in Fig. 1 comprises two subtrees, denoted as 𝑄1 and 𝑄2 , both consisting of 8 leaf nodes.
Assuming that 𝒙 eventually reaches the eighth leaf node of 𝑄1 and the sixth leaf node of 𝑄2 , the corresponding
cross-feature 𝒒 is represented as [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]. The specific semantics of the feature variables
{𝑥0 , … , 𝑥4 } and thresholds {𝑎0 , … , 𝑎4 } are provided in Table 1. Furthermore, the semantic information of the two
local cross-features derived from 𝒙 can be described as follows: 1) The number of payments that are overdue for more
than 90 days exceeds 0.5, the number of payments overdue for 60 to 89 days is greater than 0.5, and the proportion of
debt is greater than 4.1%; 2) The number of payments that are overdue for more than 90 days exceeds 0.5 but not more
than 1.5, and the number of payments overdue for 30 to 59 days is greater than 1.5.
It is important to acknowledge that various tree-based ensemble models, including RF, extreme gradient boosting
(XGBoost), and light gradient boosting machine (LightGBM), can produce cross-features similar to those generated by
GBDT. Compared with XGBoost and LightGBM, RF demonstrates greater randomness, replacedleadingwhich leads
to the creation of a more diverse set of cross-features; On the other hand, GBDT leverages a more comprehensive
decision path, resulting in the generation of cross-features with richer semantic information. In this study, RF and
GBDT are the selected models to generate cross-features.
4. Methodology
In this section, we elaborate on the model based on contrastive augmentation and tree-enhanced embedding
mechanisms. We propose three mechanisms to enhance the collaboration between the tree-based ensemble models
and the embedding-based model: 1) Data augmentation technique: We consider constructing cross-features by tree-
based ensemble models to be the cropping of the initial features. Multiple tree-based ensemble models are pre-trained
to enrich the diversity of the training data, resulting in more robust representations with improved generalization
capability; 2) Decision path information fusion technique: We aggregate the embedding vectors corresponding to each
node on the decision path as its representation to preserve as much structural information as possible about the decision
trees; 3) Dual-task learning technique: In addition to the classification loss, we employ the supervised contrastive loss
to encourage the embedding-based model to generate distinct representations across different categories. As illustrated
in Fig. 2, CATE is primarily composed of four stages:
Data augmentation Decision path information fusion Additive attention Dual-task learning
Path Fusion
GBDT Embedding vectors
vectors
Classification network
LS concat
Attention network
TM concat
DT
Transactions fusion
Age: Income: yˆ cls
rGBDT
concat
LS
Amount: Debt ratio: TM
DT projGBDT
fusion
Projection
RF reuse reuse
network
Days past:
concat
LS
...
TM
Attention network
DT
fusion reuse
projRF
Projection
network
concat
LS
TM rRF
DT
fusion
1. Multiple tree-based ensemble models are pre-trained for data augmentation to generate diverse cross-features
for each credit transaction;
2. The embedding vectors corresponding to each node on the decision path are aggregated so that the representation
corresponding to the decision path retains as much tree structure information as possible;
3. The interactions between the initial feature and the local cross-features are learned through an additive attention
mechanism, where the local cross-features are assigned various attention weights according to the initial feature
vector for each transaction;
4. Dual-task learning is employed. The first task aims to improve the separability of the learned representations
across distinct categories and increases their similarity within the same category through supervised contrastive
learning. The second task involves optimizing the model parameters through a classification task to improve the
suitability of the learned representations for credit scoring scenarios.
Each of the four stages will be described in detail in the following sections.
𝒒 GBDT = GBDT(𝒙|𝑄GBDT ),
(5)
𝒒 RF = RF(𝒙|𝑄RF ),
where 𝑁𝐿𝐺 and 𝑁𝐿𝑅 denote the number of leaves in the GBDT and RF, respectively, while 𝑄GBDT and 𝑄RF denote
the set of DTs corresponding to the GBDT and RF, respectively.
where 𝜙(⋅) denotes the function to remove all zero row vectors from the matrix. As 𝒒 is a sparse vector with very
few nonzero elements, the resulting embedding matrix only contains the embedding vectors that correspond to the
activated leaf nodes.
Intuitively, we posit that the embedding vectors of local cross-features corresponding to neighboring leaf nodes
will be similar due to the partial overlap between the decision paths they are located in. To elaborate, assuming there
⋃
are 𝑁 nodes in the forest, and each node 𝑣𝑛 ∈ 𝜏𝑡=1 𝑉 (𝑡) is mapped into a learnable embedding vector 𝒆𝑛 ∈ ℝ𝑑 . To
enhance the connection between the embedding vectors and the raw data, the embedding vectors are initialized with
the mean of the samples reaching the corresponding nodes in the forest. Each leaf node 𝑣𝑙 is then mapped to the path
embedding matrix 𝑷 𝑙 ∈ ℝ|𝑃 (𝑙)|×𝑑 , which results from concatenating the embedding vectors corresponding to all nodes
along the decision path:
where 𝑃 (𝑙) denotes the index set of nodes in the decision path corresponding to the leaf node 𝑣𝑙 . To accommodate
varying path lengths |𝑃 (𝑙)|, the row vectors are aggregated into a path fusion vector. To retain the tree structure
information to the greatest extent possible, LSTM is adopted to learn the construction process of the DT for aggregation:
𝑜 = LSTM(𝒆 , 𝒃𝑜
𝒃(𝑛) (8)
(𝑛) (𝑛−1)
),
where 𝒆(𝑛) denotes the 𝑛-th row in the path embedding matrix, 𝒃(𝑛)
𝑜 denotes the corresponding output of the LSTM
block. The mechanism is illustrated in Fig. 3a. Let 𝐿 denote the final row index in the path embedding matrix. The
resulting path fusion vector 𝒑𝑙 ∈ ℝ𝑑 is represented as follows:
𝒑𝑙 = 𝒃(𝐿)
𝑜 . (9)
By applying the above method, the path fusion vectors corresponding to adjacent leaf nodes contain information
about the overlapping nodes between their decision paths, thereby incorporating the tree structure information into the
path fusion vectors. We replace dense embedding vectors with path fusion vectors to construct the path fusion matrix
𝑬 𝑃 ∈ ℝ𝜏×𝑑 :
( )
𝑬 𝑃 = 𝜙 [𝑞1 𝒑1 , … , 𝑞𝑁𝐿 𝒑𝑁𝐿 ] , ∀𝑞𝑙 ≠ 0 and 𝑞𝑙 ∈ 𝒒, (10)
The path fusion matrix describes the high-level semantic information of all activated local cross-features and can be
utilized in the downstream classification task. This approach strengthens the connection between the upstream tree-
based ensemble models and the downstream embedding-based model.
Hidden layers
x
LSTM block
( n1)
concat a'xt
c c(n)
pt
tanh (b) Attention network
input output
maximize minimize maximize
Zi Zj Za Zb
similarity similarity similarity
Path embedding
matrix f (n) i (n) bi( n ) o( n ) g(ā) g(ā) g(ā) g(ā)
Vs Vs tanh Vs
Ri representation Rj Ra representation Rb
e(n) bo( n )
Enc(ā) Enc(ā) Enc(ā) Enc(ā)
bo( n 1) Qi Qj Qa Qb
where 𝐿(𝒙) denotes the index set of leaf nodes activated by 𝒙 and |𝐿(𝒙)| = 𝜏, 𝑾 ∈ ℝ𝑑ℎ ×(𝑑𝑥 +𝑑) and 𝒃 ∈ ℝ𝑑ℎ represent
the weight matrix and bias vector of the hidden layer respectively, 𝑑ℎ denotes the dimension of the hidden vector, the
output of the hidden layer is mapped to the attention weight by vector 𝒉 ∈ ℝ𝑑ℎ and 𝜎(⋅) represents the ReLU activation
function. The attention weight is then normalized using the softmax function.
Fig. 3b illustrates our attention network. Notably, the attention network is shared to calculate the attention weights
for the local cross-features output by different tree-based ensemble models. These attention weights indicate which
local cross-features receive greater consideration during the evaluation process. We aggregate path fusion vectors
using attention weights to obtain the representation 𝒓 ∈ ℝ𝑑 corresponding to each transaction:
∑
𝒓 = Enc(𝒙, 𝑬 𝑃 ) = 𝑎𝑥𝑙 𝒑𝑙 , (12)
𝑙∈𝐿(𝒙)
By obtaining the representation through a weighted sum, the path fusion vectors that receive smaller attention weights
have a relatively minimal effect on the ultimately generated representation. The incorporation of embedding and
attention mechanisms provides CATE with robust representation capabilities and guarantees the effectiveness of the
model.
where 𝐼 ≡ {1, … , 2𝑁𝑠 } denotes the index set of augmented samples, 𝐴(𝑖) ≡ 𝐼 ⧵ {𝑖} denotes the index set of other
augmented samples excluding the sample 𝑖, 𝑆(𝑖) ≡ {𝑠 ∈ 𝐴(𝑖) and 𝑦𝑠 = 𝑦𝑖 } denotes the index set of other augmented
samples with the same label as sample 𝑖, and 𝜏𝑐 ∈ ℝ+ is a scalar temperature parameter.
The utilization of the contrastive learning framework is a common approach for pre-training the encoder. Such
framework relies on the contrastive loss to pre-train the encoder, followed by training a classification network
while keeping the encoder parameters fixed during the classification task. However, the initial embedding vectors
are not realistic data, and as a result, relying solely on the contrastive objective may lead to local optima that are
irrelevant to the classification task, ultimately impairing the classification ability of the model. Therefore, to ensure
optimal performance, it is necessary to perform both the contrastive learning and the classification task concurrently.
Specifically, the classification task can be defined as follow:
𝑦̂ = sigmoid(𝑏0 + 𝒃⊤ 𝒓
1 GBDT
+ 𝒃⊤ 𝒓 ),
2 RF
(15)
where 𝑏0 denotes the bias term, 𝒃1 ∈ ℝ𝑑 and 𝒃2 ∈ ℝ𝑑 denote the parameters of two LR models, respectively.
𝒓GBDT and 𝒓RF are the representations corresponding to the output cross-features of GBDT and RF, respectively. The
classification layer of CATE follows a shallow additive model structure that enables the assessment of the contribution
of individual components, thereby enhancing the interpretability of the model. We use the cross-entropy loss function
as the classification objective function:
∑
𝑝 = − 𝑦𝑖 log(𝑦̂𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑦̂𝑖 ), (16)
𝑖∈𝐼
Assuming that 𝜆 denotes the 𝓁2 regularization hyperparameter to avoid overfitting, 𝜃 denotes all learnable model
parameters, and the final objective function for CATE is given by:
= 𝑐 + 𝑝 + 𝜆||𝜃||2 , (17)
The CATE model is composed of two cascade models. In the first stage, GBDT and RF models are pre-trained
for the extraction of cross-features. In the second stage, the mini-batch gradient descent technique, combined with the
Adam algorithm, is employed to optimize the embedding-based classification model. This two-stage process provides
a holistic approach that effectively leverages the strengths of both tree-based ensemble models and embedding-based
method, resulting in improved performance for classification tasks.
Table 2
Description of datasets.
5. Experimental study
5.1. Data description
This study employs four public credit datasets to conduct credit scoring experiments, as illustrated in Table 2.
Prosper dataset 1 is obtained from the reputable Prosper online lending platform in the United States. Lending Club
(LC) dataset 2 , spanning a period from 2007 to 2017, is a valuable source of information for academic research, as
it offers insights into the workings of Lending Club, the largest peer-to-peer (P2P) online lending platform in the
United States. Give me some credit (Give) dataset 3 is a public credit scoring dataset available on Kaggle competition
platform. Car loan (CarLoan) dataset 4 is available on the Developer Competition of Xunfei Open Platform.
The data preprocessing methodology utilized in this study involves several distinct steps. Firstly, the default and
normal labels, represented by numerical values of 1 and 0 respectively, are screened. Subsequently, samples with
similar labels are merged into one of these two types of labels, while samples with irrelevant labels are eliminated.
Secondly, features exhibiting a high missing rate of over 50% are removed, followed by the removal of samples still
containing missing values. Thirdly, categorial features with less than 20 cardinalities undergo one-hot encoding, while
those with more than 20 cardinalities are transformed using frequency coding. Furthermore, the study incorporates
additional preprocessing techniques for specific datasets. However, due to space limitations, a detailed account of
these techniques is not provided in this study.
In this study, a rigorous methodology is adopted to validate the model by employing the five-fold cross-validation
test. The rationale behind this approach is to mitigate the impact of random data partitioning and to generate robust
evaluation outcomes. The original dataset is divided into five approximately equal subsets, and during each cycle of
the cross-validation process, one subset is selected as the test set while the other four subsets are combined to form the
training set. This process is repeated five times, ensuring that every subset is utilized as the test set once. Finally, the
mean of the evaluation results obtained from the five cycles is used as the conclusive outcome of the experiment.
correlation coefficient (MCC), and area under the receiver operating characteristic (ROC) curve (AUC) are employed
to measure the overall performance of the model. The ROC curve is obtained by setting different thresholds on the
decision function used to compute the false positive rate (FPR) and the true positive rate (TPR), and AUC is computed
using the trapezoidal rule. The definitions of the other indicators are as follows:
TP
Rec = ,
TP + FN
2 × Rec × precision 2 × TP
F-score = = ,
Rec + precision 2 × TP + FN + FP
√
√ TP × TN
G-mean = Rec × specif icity = ,
(TP + FN) × (TN + FP)
(18)
TP + TN
Acc = ,
TP + FN + TN + FP
Rec + specif icity TP × (TN + FP) + TN × (TP + FN)
BAcc = = ,
2 2 × (TP + FN) × (TN + FP)
TP × TN − FP × FN
MCC = √ ,
(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)
The higher the Rec, the stronger the recognition ability of the model for default samples. The F-score, which is
the harmonic average of precision and recall, serves as an indicator to balance the recognition ability and accuracy of
the model for default samples. A higher F-score signifies a stronger ability of the model to evaluate default samples.
The G-mean score measures the overall identification ability of the model. A higher value indicates that the model has
a relatively balanced identification ability for both default and normal samples. Acc represents the ratio of correctly
evaluated samples by the model. However, it is important to note that Acc may not be a suitable evaluation metric for
highly imbalanced datasets, as it can produce falsely high results. In such cases, BAcc and AUC can provide a better
overall evaluation metric. A higher AUC value signifies a stronger ability of the model to distinguish between different
categories of samples. BAcc is similar to G-mean in that it reflects relatively balanced accuracy rates for default and
normal cases. The MCC is essentially a correlation coefficient, with 1 representing a perfect model, 0 representing a
random prediction, and -1 representing the exact opposite prediction. Overall, higher values of these metrics indicate
that the classification model is more robust and performs better.
Table 3
Parameters for grid search of baseline credit scoring models.
However, since the LC and Give datasets have a larger sample size, the performance of the proposed model may not
differ significantly from other models on these two datasets, but it still exhibits superior performance. Conversely, the
proposed model fails to achieve optimal performance when applied to the CarLoan dataset, which will be expounded
upon in subsequent analysis. It should be noted that the LC and CarLoan datasets have a large number of samples,
and the kernel-based SVM model requires substantial computation. Therefore, the SVM model was not trained to its
full potential on these datasets, resulting in relatively poor performance. When SVM is applied to the Give dataset,
a part of the ROC curve is below the diagonal, indicating that for highly imbalanced datasets, SVM has difficulty
identifying default samples, resulting in misclassifying some default samples with a relatively low probability. The
outcomes of the Prosper, LC, and Give datasets highlight the benefits of contrast augmentation and tree-enhanced
embedding mechanisms in improving the performance of classical GBDT.
Table 4 offers a performance comparison of various credit scoring models on datasets with lower imbalanced
rates. Among the models, CATE stands out as the superior performer on all indicators for the Prosper dataset. This
finding highlights the effectiveness of CATE as a suitable option for credit scoring on the Prosper dataset. GBDT+LR,
TEM, XGB+forgeNet and TEDAN outperform GBDT on all indicators, implying that the cross-feature extraction
mechanism and leaf node re-weighting mechanism are effective. However, CATE performs even better than these
four, indicating that the contrast augmentation mechanism can further improve the ability of the model to extract
separable representations. It is worth noting that AugBoost performs worse than GBDT on all metrics, while RF has
worse performance. This suggests that the quality of cross-features generated by RF can impact the performance of
AugBoost.
The experimental results obtained from the LC dataset are analogous to those from the Prosper dataset, with
the exception that the CATE model only shows a marginal improvement in terms of performance indicators over
other models. The LC dataset is characterized by a substantial number of training samples, which poses significant
computational challenges to the kernel-based SVM model. As a result, the model was not trained to the optimal level,
leading to relatively inferior performance. In contrast to the results obtained from the Prosper dataset, all models, except
for RF, demonstrate satisfactory performance. This disparity in performance can be attributed to the random selection
of features in imbalanced datasets, which may result in the omission of crucial features. Consequently, this omission
may hinder the ability of the model to identify default samples accurately.
Table 5 presents a comparative analysis of the performance of different credit scoring models on datasets that exhibit
higher imbalanced rates. The results indicate that the CATE model outperforms other models on all indicators in the
Give dataset, except for Acc where the difference is negligible at 0.03%. These findings suggest that the contrastive
augmentation mechanism effectively improves the ability of the model to identify default samples in highly imbalanced
100 100
datasets. In terms of individual credit scoring models, LR demonstrates superior performance, suggesting a possible
linear relationship between user characteristics and transaction default in this particular dataset. The performance
comparison between TEM and GBDT+LR suggests that embedding cross-features into dense vectors in a low-
dimensional space may not always be effective and could lead to inferior results compared to using cross-features
directly. Nonetheless, the performance of CATE provides evidence that the contrastive augmentation mechanism is a
potent solution to address the challenges posed by the aforementioned problems.
Based on the analysis of the CarLoan dataset, the CATE model exhibits superior performance in terms of Rec,
F-score, G-mean, BAcc, and MCC. However, the performance of Acc and AUC remains at a moderate level. Further
examining of the Rec performance of each model reveals the difficulty in identifying default samples in this dataset.
The performance of RF and XGB+forgeNet indicates that certain features exhibit strong correlations with default
transactions. Consequently, filtering out these features may cause the model to fail to identify risky transactions.
Despite this, the imbalance rate of this dataset is not higher than that of the Give dataset, suggesting that default
and normal samples possess similar characteristics, making it challenging for the model to differentiate between them.
The exceptional Rec performance of the CATE model can be attributed to the contrastive augmentation mechanism,
which effectively brings elements of the same class closer together while distancing them from most elements
of different classes in the embedding space. However, when the two types of elements have comparable original
features, this mechanism inadvertently moves some normal samples to one end of the default samples, resulting in
the misclassification of these samples by the model. Therefore, the model cannot accurately determine whether these
samples are default with high probability, leading to a decrease in the Acc and AUC of the model. Due to the numerous
samples and features in the CarLoan dataset and the significant amount of computation required for kernel-based
Table 4
Performance comparison on different datasets with lower imbalanced rate.
SVM, the SVM is not trained to the optimum on this dataset, resulting in poor performance and approximate random
prediction.
The experimental findings presented by Prosper, LC, and Give demonstrate that the stronger model exhibits Acc
and AUC scores surpassing 85%. Nonetheless, the discrimination among models is constrained due to the imbalanced
distribution of samples across various categories within the datasets. In this regard, F-score emerges as a more suitable
evaluation metric that offers a more accurate assessment of the effectiveness of the model in recognizing risky loans.
Comparative to the baseline model and other tree-enhanced models, the CATE model displays superior credit scoring
performance. Moreover, in the CarLoan dataset, the CATE model shows a better ability to accurately identify default
samples. These findings suggest that the CATE model may be a promising approach for credit scoring.
Table 5
Performance comparison on different datasets with higher imbalanced rate.
history of the borrower, the platform found that there had been a large number of inquiries (TotalInquiries > 15.5),
which may suggest financial instability or a high level of credit-seeking behavior. Additionally, the borrower has
opened other credit transactions (TradesOpenedLast6Months > 0.5) and has arrearages in other credit transactions
(AmountDelinquent > 29.5) within 6 months before the reviewing. These factors could indicate a history of
delinquency and suggest a higher risk of default for the borrower.
In comparison, the normal case is characterized by regular transactional patterns that exhibit a low loan interest
rate (LenderYield ≤ 0.142, BorrowerRate ≤ 0.125) and a short loan period (Term ≤ 24). Additionally, the borrower
tends to have a higher income (IncomeRange > 2.5, i.e., more than $50,000), a high credit card available limit
(AvailableBankcardCredit > 11009), and a low debt-to-income ratio (DebtToIncomeRatio ≤ 0.195), with debts
constituting only a minor proportion of their overall income. Furthermore, the borrower typically holds a relatively
common occupation (Occupation > 1498.5) and has worked in the same field for an extended period (Employment
Duration > 3.5). The credit history of the borrower also reveals a limited number of inquiries (TotalInquiries ≤ 4.5).
The findings of this study demonstrate the effectiveness of the CATE model in identifying reasonable judgment
conditions for credit scoring. The observed differences in the characteristic interval between the two types of
transactions provide further validation of the ability of the model to identify such conditions accurately. Therefore,
it can be concluded that the CATE model can effectively recognize the judgment conditions that determine whether a
credit transaction is risky or not. Furthermore, the use of local cross-features with higher attention scores can provide
a detailed explanation of the evaluation results of the model.
Table 6
Comparison of default and normal transactions w.r.t their corresponding local cross-features.
Table 7
Ablation study of CATE on different datasets.
identify default transactions in the credit scoring task. Though misidentifying too many normal transactions as defaults
could cause financial institutions to miss opportunities, difficulty in identifying default transactions can result in
significant losses.
The experimental results conducted on four datasets demonstrate that each component of the CATE model exhibits
a significant role in accomplishing the credit scoring task. This finding serves as a testament to the efficacy of the
CATE design.
95
85 95
75
Metrics (%)
65 Rec
90
55 F-score
G-mean
Acc
45 BAcc 85
MCC
35 AUC
Metrics (%)
Training length (year) 80
75
95
85 70
Metrics (%)
75
65 65
models themselves. The findings suggest that the performance of CATE is erratic when the number of trees is small.
Moreover, a slight decline in CATE performance is noted when 𝜏 ∈ {48, 64, 96}. As the number of trees increases, the
performance of CATE improves and stabilizes, indicating that an adequate number of trees is required to furnish
sufficient information to the subsequent attention mechanism. Furthermore, due to the presence of the attention
mechanism, additional trees beyond a certain number do not provide new information. In this study, 𝜏 = 112 is the
preferred setting for the Prosper dataset that balances both model performance and computational efficiency.
Moreover, CATE can be easily incorporated into practical applications of credit scoring because of its excellent
interpretability of decision-making and effectively assists stakeholders in identifying transactions that may default.
However, it is important to acknowledge the limitations of the CATE model. Firstly, the shallow LSTM model
may face challenges in effectively learning the construction patterns of decision trees when the decision paths become
lengthy. Consequently, this limitation restricts the tree-based ensemble model from extending the depth of decision
trees, potentially impacting its overall performance. Secondly, although initializing the embedding vector based on the
mean of the subsample set strengthens the association between the embedding vector and the original dataset, it can also
constrain the embedding-based model to extract features effectively. Furthermore, in addition to the aforementioned
limitations, the proposed method may exhibit weakness in predicting normal transactions under circumstances of
extremely imbalanced rates in the dataset or when there is an overlap in the underlying distribution between different
classes. The future work of this study comprises three aspects. Firstly, we aim to explore improved decision path
information fusion methods to further enhance the ability to extract tree structure information. In particular, we
intend to investigate the co-learning of tree-based and embedding-based models to further facilitate information
dissemination between the two components, leading to better performance in real-world applications. Secondly, to
address the weakness of the CATE model, we plan to incorporate imbalanced learning techniques, such as resampling
and customized ensemble methods. Finally, we aim to leverage the abundance of information available on the Internet
to introduce additional user attributes, such as social, consumption, and behavior information. This will enable us to
expand the interpretability of the model and enable it to make more accurate predictions based on a broader range of
factors.
Acknowledgments
This work is supported by the Guangzhou Science and Technology Program key projects (202103010005), the
National Natural Science Foundation of China (61876066).
References
[1] K. Buehler, A. Freeman, R. Hulme, The new arsenal of risk management, Harv. Bus. Rev. 86 (2008) 92–100+137.
[2] S. Maldonado, G. Peters, R. Weber, Credit scoring using three-way decisions with probabilistic rough sets, Inf Sci 507 (2020) 700–714.
[3] J. Sun, J. Lang, H. Fujita, H. Li, Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and
bagging with differentiated sampling rates, Inf Sci 425 (2018) 76–91.
[4] G. Petrides, D. Moldovan, L. Coenen, T. Guns, W. Verbeke, Cost-sensitive learning for profit-driven credit scoring, J.Oper.Res.Soc. 73 (2022)
338–350.
[5] S. Carta, A. Ferreira, D. Reforgiato Recupero, M. Saia, R. Saia, A combined entropy-based approach for a proactive credit scoring, Eng Appl
Artif Intell 87 (2020) 103292.
[6] N. Chen, B. Ribeiro, A. Chen, Financial credit risk assessment: a recent review, Artif Intell Rev 45 (2016) 1–23.
[7] P. Pławiak, M. Abdar, U. Rajendra Acharya, Application of new deep genetic cascade ensemble of SVM classifiers to predict the Australian
credit scoring, Appl. Soft Comput. J. 84 (2019) 105740.
[8] N. Le, T.-T. Huynh, E. Yapp, H.-Y. Yeh, Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and
PSSM profiles, Comput. Methods Programs Biomed. 177 (2019) 81–88.
[9] B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, J. Vanthienen, Benchmarking state-of-the-art classification algorithms for
credit scoring, J Oper Res Soc 54 (2003) 627–635.
[10] D. Tripathi, A. Shukla, B. Reddy, G. Bopche, D. Chandramohan, Credit Scoring Models Using Ensemble Learning and Classification
Approaches: A Comprehensive Survey, Wireless Pers Commun 123 (2022) 785–812.
[11] K. Niu, Z. Zhang, Y. Liu, R. Li, Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending,
Inf Sci 536 (2020) 120–134.
[12] F. Shen, X. Zhao, G. Kou, F. Alsaadi, A new deep learning ensemble credit risk evaluation model with an improved synthetic minority
oversampling technique, Appl. Soft Comput. 98 (2021) 106852.
[13] X. Feng, Z. Xiao, B. Zhong, Y. Dong, J. Qiu, Dynamic weighted ensemble classification for credit scoring using Markov Chain, Appl Intell
49 (2019) 555–568.
[14] M. Papouskova, P. Hajek, Two-stage consumer credit risk modelling using heterogeneous ensemble learning, Decis Support Syst 118 (2019)
33–45.
[15] L.-A. Dong, X. Ye, G. Yang, Two-stage rule extraction method based on tree ensemble model for interpretable loan evaluation, Inf Sci 573
(2021) 46–64.
[16] J. Tomczak, M. Zie¸ba, Classification Restricted Boltzmann Machine for comprehensible credit scoring model, Expert Sys Appl 42 (2015)
1789–1796.
[17] W. Wang, C. Lesner, A. Ran, M. Rukonic, J. Xue, E. Shiu, Using small business banking data for explainable credit risk scoring, in: AAAI -
AAAI Conf. Artif. Intell., AAAI press, New York, 2020, pp. 13396–13401.
[18] V. Moscato, A. Picariello, G. Sperlí, A benchmark of machine learning approaches for credit score prediction, Expert Sys Appl 165 (2021)
113986.
[19] X. Wang, X. He, F. Feng, L. Nie, T.-S. Chua, TEM: Tree-enhanced embedding model for explainable recommendation, in: Web Conf. - Proc.
World Wide Web Conf., WWW, Association for Computing Machinery, Inc, 2018, pp. 1543–1552. doi:10.1145/3178876.3186066.
[20] S. Lessmann, B. Baesens, H.-V. Seow, L. Thomas, Benchmarking state-of-the-art classification algorithms for credit scoring: An update of
research, Eur J Oper Res 247 (2015) 124–136.
[21] S. Sohn, D. Kim, J. Yoon, Technology credit scoring model with fuzzy logistic regression, Appl. Soft Comput. J. 43 (2016) 150–158.
[22] J. López, S. Maldonado, Profit-based credit scoring based on robust optimization and feature selection, Inf Sci 500 (2019) 190–202.
[23] S. Sohn, J. Kim, Decision tree-based technology credit scoring for start-up firms: Korean case, Expert Sys Appl 39 (2012) 4007–4012.
[24] Y. Tian, B. Bian, X. Tang, J. Zhou, A new non-kernel quadratic surface approach for imbalanced data classification in online credit scoring,
Inf Sci 563 (2021) 150–165.
[25] T. Li, G. Kou, Y. Peng, A new representation learning approach for credit data analysis, Inf Sci 627 (2023) 115 – 131.
[26] A. Blanco, R. Pino-Mejías, J. Lara, S. Rayo, Credit scoring models for the microfinance industry using neural networks: Evidence from Peru,
Expert Sys Appl 40 (2013) 356–364.
[27] H. He, W. Zhang, S. Zhang, A novel ensemble method for credit scoring: Adaption of different imbalance ratios, Expert Sys Appl 98 (2018)
105–117.
[28] Y. Xia, C. Liu, Y. Li, N. Liu, A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring, Expert Sys
Appl 78 (2017) 225–241.
[29] W. Liu, H. Fan, M. Xia, Step-wise multi-grained augmented gradient boosting decision trees for credit scoring, Eng Appl Artif Intell 97
(2021) 104036.
[30] C.-F. Tsai, Y.-F. Hsu, D. Yen, A comparative study of classifier ensembles for bankruptcy prediction, Appl. Soft Comput. J. 24 (2014)
977–984.
[31] P. Pławiak, M. Abdar, J. Pławiak, V. Makarenkov, U. Acharya, DGHNL: A new deep genetic hierarchical network of learners for prediction
of credit scoring, Inf Sci 516 (2020) 401–418.
[32] V. Djeundje, J. Crook, R. Calabrese, M. Hamid, Enhancing credit scoring with alternative data, Expert Sys Appl 163 (2021) 113766.
[33] Y. Song, Y. Wang, X. Ye, R. Zaretzki, C. Liu, Loan default prediction using a credit rating-specific and multi-objective ensemble learning
scheme, Inf Sci 629 (2023) 599 – 617.
[34] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, J. Candela, Practical lessons from predicting clicks on
ads at Facebook, in: Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., Association for Computing Machinery, New York, 2014, pp.
1–9. doi:10.1145/2648584.2648589.
[35] W. Liu, H. Fan, M. Xia, Credit scoring based on tree-enhanced gradient boosting decision trees, Expert Sys Appl 189 (2022) 116034.
[36] J. Liu, S. Zhang, H. Fan, A two-stage hybrid credit risk prediction model based on XGBoost and graph-based deep neural network, Expert
Sys Appl 195 (2022) 116624.
[37] Y. Wu, D. Zhu, X. Wang, Tree enhanced deep adaptive network for cancer prediction with high dimension low sample size microarray data,
Appl. Soft Comput. 136 (2023) 110078.
[38] J. Elman, Finding structure in time, Cogn. Sci. 14 (1990) 179–211.
[39] M. Jordan, Attractor dynamics and parallelism in a connectionist sequential machine, in: Proc. 8th Annu. Conf., Cognitive Sci. Soc., MIT
Press, 1986, pp. 531–546.
[40] J. F. Kolen, S. C. Kremer, Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies, 2001, pp. 237–243.
doi:10.1109/9780470544037.ch14.
[41] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Comp. 9 (1997) 1735–1780.
[42] P. Bachman, R. Devon Hjelm, W. Buchwalter, Learning representations by maximizing mutual information across views, in: Adv. neural inf.
proces. syst., Neural information processing systems foundation, Vancouver, 2019, pp. 15535–15545.
[43] O. Henaff, A. Srinivas, J. Fauw, A. Razavi, C. Doersch, S. Eslami, A. Eslami, Data-Efficient image recognition with contrastive predictive
coding, in: Int. Conf. Machin. Learn., ICML, International Machine Learning Society (IMLS), Virtual, Online, 2020, pp. 4130–4140.
[44] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum Contrast for Unsupervised Visual Representation Learning, in: Proc IEEE Comput
Soc Conf Comput Vision Pattern Recognit, IEEE Computer Society, Virtual, Online, 2020, pp. 9726–9735. doi:10.1109/CVPR42600.2020.
00975.
[45] I. Misra, L. van der Maaten, Self-supervised learning of pretext-invariant representations, in: Proc IEEE Comput Soc Conf Comput Vision
Pattern Recognit, IEEE Computer Society, Virtual, Online, 2020, pp. 6706–6716. doi:10.1109/CVPR42600.2020.00674.
[46] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in: Int. Conf. Machin.
Learn., ICML, International Machine Learning Society (IMLS), Virtual, Online, 2020, pp. 1575–1585.
[47] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, D. Krishnan, Supervised contrastive learning, in: Adv.
neural inf. proces. syst., Neural information processing systems foundation, Virtual, Online, 2020, pp. 18661–18673.
☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.
☐ The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
Ying Gao: Resources, Funding acquisition, Project administration, Supervision, Writing – review.
Haolang Xiao: Investigation, Conceptualization, Methodology, Data curation, Software, Formal
analysis, Validation, Visualization, Writing – original draft.
Choujun Zhan: Data curation, Writing – review & editing.
Lingrui Liang: Investigation, Data curation.
Wentian Cai: Writing - review & editing.
Xiping Hu: Supervision, Writing – review.