You are on page 1of 25

Journal Pre-proof

CATE: Contrastive augmentation and tree-enhanced embedding for credit scoring

Ying Gao, Haolang Xiao, Choujun Zhan, Lingrui Liang, Wentian Cai et al.

PII: S0020-0255(23)01032-0
DOI: https://doi.org/10.1016/j.ins.2023.119447
Reference: INS 119447

To appear in: Information Sciences

Received date: 13 March 2023


Revised date: 1 August 2023
Accepted date: 3 August 2023

Please cite this article as: Y. Gao, H. Xiao, C. Zhan et al., CATE: Contrastive augmentation and tree-enhanced embedding for credit scoring, Information
Sciences, 119447, doi: https://doi.org/10.1016/j.ins.2023.119447.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for
readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its
final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2023 Published by Elsevier.


CATE: Contrastive Augmentation and Tree-enhanced Embedding
for Credit Scoring
Ying Gaoa , Haolang Xiaoa , Choujun Zhanb , Lingrui Lianga , Wentian Caia and Xiping Huc,d,∗
a School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
b School of Computing, South China Normal University, Guangzhou, 510631, China
c Beijing Institute of Technology, Beijing, 100081, China
d Shenzhen MSU-BIT University, Shenzhen, 518172, China

ARTICLE INFO ABSTRACT


Keywords: Credit transactions are vital financial activities that yield substantial economic benefits. To
tree-based model further improve lending decisions, stakeholders require accurate and interpretable credit scoring
feature extraction methods. While the majority of previous studies have focused on the relationship between
attention mechanism individual features and credit risk, only a few have investigated cross-features. Notably, cross-
supervised contrastive learning features can not only represent structured data effectively but also provide richer semantic
credit scoring information than individual features. Nevertheless, most previous methods for learning cross-
feature effects from credit data have been implicit and unexplainable. This paper proposes a
new credit scoring model based on contrastive augmentation and tree-enhanced embedding
mechanisms, termed CATE. The proposed model automatically constructs explainable cross-
features by using tree-based models to learn decision rules from the data. Moreover, the
importance of each local cross-feature is then derived through an attention mechanism. Finally,
the credit score of a user is evaluated using embedding vectors. Experimental results on 4 public
datasets demonstrated the interpretability of our proposed method and outperformed 13 state-
of-the-art benchmark methods in terms of performance.

1. Introduction
Credit loaning to customers is one of the primary functions of financial institutions. However, credit risk is
a significant hazard to the regular operations of these institutions [1], as evidenced by non-performing loans in
commercial banks in China, which reached 2.8 trillion yuan by the end of the fourth quarter of 2021, equivalent
to 1.73% of all loans made by the banking sector. The consequences of excessive credit risk can be dire, leading to the
bankruptcy of associated businesses [2] and negatively impacting the operating performance of financial institutions
[3]. In the context, credit risk assessment, also known as "credit scoring", is a crucial stage in credit risk management,
as it alleviates the threat posed by information asymmetry.
Petrides et al. [4] demonstrated that a reliable credit scoring model significantly impacts the risk management
and profitability of financial institutions. Conventional credit scoring is a statistical approach that generates a score
indicating the credibility of a loan application based on transaction and borrower-related data [5]. This score is obtained
by calculating the expected probability of default (PD) and can be simplified to a classification task. While statistical
analysis has been historically dominant in credit scoring due to its easy implementation and clear interpretability,
it is not without limitations. Specifically, statistical analysis relies on strong assumptions, such as linear separability
and standard distribution [6]. Consequently, it may encounter limitations in specific scenarios, such as cases where
variables fail to exhibit a linear relationship [7] or when large datasets are involved [8].
Given the large scale, high complexity, and nonlinear nature of credit data, it is challenging to develop an accurate
credit scoring model through statistical analysis. To meet the development needs of the credit industry, machine
learning (ML)-based methods are applied to credit scoring. Baesens et al. [9] demonstrated the advantages of ML-
based methods over traditional statistical analysis in a variety of applications.
Ensemble methods have gained widespread application in the field of credit scoring due to their flexibility and
superior performance. Ensemble models combine various individual models to create an improved overall model. In
∗ Corresponding
author.
gaoying@scut.edu.cn (Y. Gao); holland.shaw.chn@gmail.com (H. Xiao); zchoujun2@gmail.com (C. Zhan);
huxp@bit.edu.cn (X. Hu)
ORCID (s):

Haolang Xiao et al: Preprint submitted to Elsevier Page 1 of 22


CATE

conjunction with the ensemble learning framework, the selection of suitable training samples serves as an additional
technique to enhance classification performance [10]. Comparative studies conducted by Niu et al. [11], Shen et al. [12]
indicate that ensemble models generally exhibit superior performance compared to individual models. Consequently,
ensemble models have become one of the most active research areas in the field of credit scoring in recent years [13].
Although tree-based ensemble models have been shown to outperform individual decision tree in terms of
evaluation accuracy [14], they fall short of explanatory information necessary for decision-making [15]. In the field of
credit scoring, interpretability is a crucial aspect of model performance [16]. Cross-features, which combine intervals
of multiple feature variables, have proven effective in modeling feature interactions and representing structured data,
providing richer semantic information than individual features. However, existing studies on the post-interpretation
of complex models concentrate primarily on the relationship between individual features and credit risk [17, 18], and
rarely notice the interaction between features. To bridge this gap in the literature, this study aims to develop a credit
scoring model that achieves high performance and interpretability. In terms of performance, we anticipate that the
proposed model can match the accuracy level of state-of-the-art credit scoring models. In terms of interpretability, our
objective is to devise a model that can locate the primary cross-features during evaluations.
This study suggests a credit scoring model based on contrastive augmentation and tree-enhanced embedding
(CATE) mechanisms, aiming to address the aforementioned problems. As inspired by the insights from Wang et al.
[19], we leverage tree-based ensemble models to transform the initial features of credit transactions into cross-features.
The main contributions of this study include the following three aspects:
1. This work introduces a novel approach for credit scoring that combines the strengths of both tree-based ensemble
models and embedding-based model. The embedding-based model is known for its strong generalization ability,
while the tree-based ensemble models can automatically generate interpretable cross-features. The additive
attention mechanism can provide intrinsic local interpretability for CATE by identifying the local cross-features
that receive greater attention scores during evaluation. Furthermore, we propose a decision path information
fusion mechanism to enable the embedding-based model to learn the construction pattern of the tree-based
ensemble models. This mechanism reduces the gap between the two components and facilitates information
propagation, addressing the limitations that arise from separately modeling the embedding-based model and
tree-based ensemble models;
2. This work devises a dual-task learning technique that combines the classification task with the contrastive
learning framework. Multiple tree-based ensemble models are utilized to generate augmented samples from
various perspectives. This augmentation enriches the training data and allows for more effective utilization
of contrastive learning. The contrastive task clusters elements belonging to the same class while separating
those from different classes in the embedding space. By incorporating dual-task learning, the classification task
becomes more effective, enabling CATE to learn more discriminative representation vectors corresponding to
different classes of transactions. Consequently, CATE demonstrates an enhanced capability in identifying risky
transactions;
3. The proposed model has exhibited promising performance across four credit transaction datasets, showcasing its
effectiveness in credit scoring. Furthermore, an in-depth analysis of the intrinsic local interpretability of CATE
is conducted through a specific case study. The findings reveal that the proposed model identifies and utilizes
distinct cross-features, highlighting its interpretability and providing valuable insights into its decision-making
process.
According to the experimental findings from four public credit datasets, the proposed solution combines the
advantages of the embedding-based method and the tree-based ensemble models. As a result, it creates a credit scoring
model with superior evaluation performance and interpretability.

2. Related work
This section provides an overview of the related researches developments in credit scoring, specifically focusing
on the classical individual models and ensemble methods. Subsequently, it delves into the exploration of feature
transformation based on tree-based ensemble models, henceforth referred to as "tree-enhanced" models. Additionally,
the section encompasses a presentation of the relevant researches on recurrent neural networks and contrastive learning,
which proffers further insight into the innovative approaches being employed in the field of credit scoring.

Haolang Xiao et al: Preprint submitted to Elsevier Page 2 of 22


CATE

2.1. Credit scoring


Researchers use predictive models called scorecards to estimate the likelihood that a candidate will default in
the future. Among these models, logistic regression (LR) stands as an industry standard for effectively testing the
performance of new methods in credit scoring [20]. Sohn et al. [21] developed an interpretable credit scoring model
based on the fuzzy LR by transforming verbal evaluation items in credit transaction data into triangular fuzzy numbers.
López and Maldonado [22] introduced a cost-sensitive learning framework for profit-oriented credit scoring, which
considers various cost trade-offs to enhance alignment with business requirements, improving economical and practical
aspects of credit scoring models. To address the challenges posed by large amounts of data and complex calculations in
credit scoring, researchers have applied ML-based methods such as decision tree (DT), support vector machine (SVM),
neural network (NN), etc. Sohn and Kim [23] proposed a DT for evaluating the creditworthiness of startups and outlined
the most important evaluation factors. Tian et al. [24] proposed an SVM based on fuzzy homocentric quadratic surfaces,
demonstrating its effectiveness in addressing the class imbalance issue in credit data. Additionally, Li et al. [25]
introduced a novel representation learning approach for credit data analysis which aims to capture complex patterns and
relationships within credit data. In comparing various models, Blanco et al. [26] fitted and compared fourteen distinct
multi-layer perceptron (MLP) models on the credit scoring dataset of Peruvian microfinance institutions. The results
indicated that MLPs outperformed classical credit scoring techniques such as LR and linear discriminant analysis, as
evidenced by lower misclassification rates.
The superiority of ensemble methods over individual models can be attributed to the fusion of diverse base
models. These methods are typically classified into two categories: namely serial ensemble methods like bagging, and
parallel ensemble methods like boosting. To enrich the diversity of individual models and decrease the overall model
evaluation error, the random forest (RF) constructs diverse training objects through the bagging strategy [27], which
is considered a representative parallel integrated credit scoring model. The gradient boosting decision tree (GBDT),
on the other hand, is a representative serial ensemble credit scoring model that improves model performance through
stepwise loss optimization while maintaining the training object constant [28, 29]. Tsai et al. [30] explored various
ensemble techniques for bankruptcy prediction and found that boosting ensemble methods outperform those based on
bagging. Sun et al. [3] proposed a credit scoring algorithm that integrated bagging with synthetic minority oversampling
technology (SMOTE) to generate diverse training data by varying sampling rates to enrich the diversity of individual
models. Pławiak et al. [31] presented a novel deep genetic hierarchical learning network for credit scoring, combining
deep learning and genetic algorithm within a hierarchical structure. Liu et al. [29] proposed a stepwise multi-grained
augmented GBDT credit scoring model, which combined the advantages of bagging and boosting by enriching the
input features. Dong et al. [15] proposed a two-step rule extraction method based on tree-based ensemble models to
automatically construct credit scoring decision rules. Djeundje et al. [32] investigated the utilization of alternative
data to enhance credit scoring practices, proposing a credit scoring framework that integrates conventional credit
data with non-traditional sources like mobile payments and social media activity. In a recent study, Song et al. [33]
proposed a novel approach to predict loan defaults with a highly imbalanced class distribution. They employed a credit
rating-specific and multi-objective ensemble learning methodology, effectively addressing the challenges posed by
class imbalance in credit scoring.
However, most of these studies have primarily focused on the evaluation performance of the model, while
neglecting the interpretability of the model. In contrast to these previous works, the proposed CATE model takes
a different approach. By leveraging tree-based ensemble models, CATE effectively captures the interaction between
features and automatically generates cross-features that prove advantageous for credit scoring. The use of the attention
mechanism allows for the determination of the significance of each cross-feature, thus enabling the provision of
interpretability for the model.

2.2. Tree-enhanced model


The tree-enhanced method, which is capable of feature segmentation and combination, converts the initial features
into cross-features that contain interaction information between features. He et al. [34] proposed GBDT+LR, a model
that combines GBDT and LR for predicting Facebook AD clicks, with the latter model re-weighting the leaves.
Empirical results show that this combination method outperforms either method used alone. Building on the work
of He et al. [34], Wang et al. [19] introduced embedding to enable the transformed cross-features to encode high-level
semantic information, which is referred to as the tree-enhanced embedding method (TEM). Liu et al. [35] devised
AugBoost, training a new RF in each stage of GBDT training to enrich the diversity of training data through feature
transformation. Liu et al. [36] proposed a two-stage hybrid model, termed XGB+forgeNet, which utilizes XGBoost to

Haolang Xiao et al: Preprint submitted to Elsevier Page 3 of 22


CATE

linearize the original features, while forgeNet is employed to handle the transformed high-dimensional data and uncover
the underlying relationships between features. Wu et al. [37] introduced the tree-enhanced deep adaptive network
(TEDAN) to address challenges such as overfitting and large training gradient variance. These studies demonstrate
that tree-enhanced methods can automatically learn decision rules from data and model valid and interpretable high-
order cross-features. In this study, a decision path information fusion method is designed to preserve the structural
information of DTs, allowing the local cross-features to retain as much structural information as possible.

2.3. Recurrent neural networks


The feedforward neural network represents a class of neural network models that map current input to output
without cycle connection. In contrast, the recurrent neural network (RNN) incorporates "memory" cells that store past
internal outputs, allowing for recurrent connections and the utilization of contextual information during training. This
characteristic makes RNN a natural choice for learning patterns from sequential data. The earliest RNN models can
be traced back to those proposed by Elman [38] and Jordan [39]. However, training RNNs can be challenging due
to the issues of exploding or vanishing gradients [40]. To address this deficiency in learning dependencies between
long-term sequences, Hochreiter and Schmidhuber [41] introduced the Long Short-Term Memory Neural Network
(LSTM). This model regulates the flow of information through a "gate" mechanism, which selectively retains or erases
specific information in the sequence to transfer more essential information along the chain. In this study, LSTM is
employed to learn the construction process from the decision paths of DTs to generate embedding vectors that retain
the structural information of DTs as much as possible.

2.4. Contrastive learning


The field of unsupervised visual representation learning encompasses two major categories: generative and
discriminative methods. In recent years, discriminative methods based on latent space have shown remarkable
effectiveness. However, most of these approaches necessitate the use of specialized architectures [42, 43] or repositories
[44, 45] to be effective. A notable exception is a framework proposed by Chen et al. [46] known as SimCLR, which
employs a contrastive self-supervised learning algorithm to improve the diversity of training data through random data
augmentation operations. Remarkably, this approach can match the performance of supervised models without the
need for specialized architectures or repositories. Expanding on the SimCLR framework, Khosla et al. [47] introduced
a supervised contrastive loss (SupCon) that extends it to fully supervised learning. SupCon effectively utilizes label
information to bring elements of the same class closer and separate different class elements in the embedding space..
In this study, SupCon is introduced to obtain representation vectors with a superior classification effect.

3. Preliminaries
This section outlines the process of constructing cross-features, a vital component of the credit scoring methodology
discussed in Section 4.

3.1. Embedding-based methods


Embedding-based representation learning models can learn representation vectors for credit scoring from raw data.
Among these models, factorization stands out as a simple yet effective embedding model. Let 𝒙 ∈ ℝ𝑑𝑥 denote the
feature vector, 𝒓 ∈ ℝ𝑑 denote the representation vector output by the encoder, Enc(⋅) denote the encoder (typically
built with NNs), 𝒂 ∈ ℝ𝑑 denote the weight vector that maps the representation vector to the evaluation result, and
𝑏 denote the bias term. Here, 𝑑 and 𝑑𝑥 denote the embedding size and the dimension of the initial feature vector,
respectively. The factorization process can be expressed as follows:

𝑦̂MF (𝒙) = 𝑏 + 𝒂⊤ 𝒓 = 𝑏 + 𝒂⊤ Enc(𝒙), (1)

However, embedding-based representation learning models capture the interaction effects between features during
training in an opaque manner, which fails to meet our interpretability requirements. To address this limitation, a
common industrial solution for making cross-features explicit and interpretable is to manually create cross-features
that are then fed into an interpretable method like LR. In doing so, LR can learn the importance of each cross-feature.
For example, cross-features can be generated by combining intervals of feature variables 𝑥age and 𝑥term to produce
a second-order cross-feature [𝑥age > 18] ∧ [𝑥t36 = 1], where the 𝑥t36 variable indicates whether the loan term is 36
months. However, the manual creation of high-order cross-features using this approach poses a challenge in terms of

Haolang Xiao et al: Preprint submitted to Elsevier Page 4 of 22


CATE

scalability. A large number of feature variables must be intertwined to model high-order cross-features, resulting in
an exponential increase in complexity that is difficult to manage manually. Although complexity can be controlled
to a certain extent through fine feature engineering, such as crossing only important features, this approach requires
extensive relevant domain knowledge and lacks cross-domain adaptability.

3.2. Tree-based methods


Q1 Q2
x0 d a0 x0 d a0
yes no yes no

x1 d a1 x2 d a0 x1 d a4 x0 d a2
yes no yes no yes no yes no

x2 d a0 x3 d a0 x0 d a2 x4 d a3 x3 d a2 x3 d a2 x3 d a2 x3 d a0
yes no yes no yes no yes no yes no yes no yes no yes no

v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15

Fig. 1. An example of GBDT with two subtrees

In contrast to embedding-based representation learning, tree-based models do not acquire embedding vectors but
rather learn decision rules from empirical data. The appeal of this strategy lies in its effectiveness and interpretability.
Fig. 1 illustrates a DT model 𝑇 = {𝑉 , 𝐸}, where 𝑉 represents the set of nodes and 𝐸 represents the set of edges. The
nodes can be partitioned into three subsets 𝑉 = 𝑉𝑅 ∪ 𝑉𝐼 ∪ 𝑉𝐿 , where 𝑉𝑅 denotes the set consisting solely of the root
node 𝑣𝑅 , 𝑉𝐼 denotes the set of internal nodes, and 𝑉𝐿 denotes the set of leaf nodes. Each internal node 𝑣𝑖 ∈ 𝑉𝐼 in the
tree split a feature variable 𝑥𝑖 ∈ 𝒙 by utilizing two decision edges. When dealing with numerical feature variables like
income, the node chooses a threshold 𝑎𝑗 ∈ ℝ, and splits the feature into [𝑥𝑖 ≤ 𝑎𝑗 ] and [𝑥𝑖 > 𝑎𝑗 ]; When dealing with
categorical feature variables like gender, one-hot encoding is used to convert them to binary variables first. The node
then splits the feature into [𝑥𝑖 = 𝑎𝑗 ] and [𝑥𝑖 ≠ 𝑎𝑗 ] based on whether or not it is equal to a certain value.
The path connecting 𝑣𝑅 and any leaf node 𝑣𝑙 ∈ 𝑉𝐿 represents a decision rule, referred to as "decision path" in this
paper, which can also be viewed as a cross-feature. Cross-features combine the local intervals of multiple features. In
Fig. 1, for example, node 𝑣7 represents the cross-feature [𝑥0 > 𝑎0 ] ∧ [𝑥2 > 𝑎0 ] ∧ [𝑥4 > 𝑎3 ]. Given the initial feature
vector of a transaction, the DT determines which leaf node the transaction will reach. DT can be thought of as mapping
feature vectors to leaf nodes based on the unique structure of the tree. In this mechanism, the path of the activated leaf
node can be regarded as the most concerned cross-feature in the decision process of the DT. Consequently, tree-based
models are considered inherently self-interpretable. This method of generating cross-features avoids labor-intensive
feature engineering and effectively resolves the issues of difficult expansion and lack of cross-domain adaptability that
arise when cross-features are manually produced.
Individual DT may be inadequate to capture complex patterns in credit transaction data due to the limitation of
mapping only one single cross-feature for each transaction. To overcome this, a popular solution is to use tree-based
ensemble model to generate a more diverse set of cross-features. In this study, we extract cross-features from the raw
data of credit transactions using pre-trained tree-based ensemble models. Although tree-based ensemble models are not
explicitly designed for cross-feature extraction, it is reasonable to assume that the leaf nodes represent effective cross-
features for credit scoring, given that each DT is trained and optimized for the classification task. As an illustration, we
utilize GBDT, which enhances the overall performance by integrating multiple additive trees. Assuming that the forest
consists of 𝜏 DTs, the output of the 𝑡-th DT model is denoted as 𝑦̂(𝑡)DT
, and the GBDT can be expressed by Eq. (2):


𝜏
𝑦̂GBDT (𝒙) = 𝑦̂(𝑡)
DT
(𝒙). (2)
𝑡=1

GBDT can be conceptualized as a set of decision trees 𝑄 = {𝑄1 , … , 𝑄𝜏 }, wherein each tree 𝑄𝑡 is responsible for
mapping the initial feature vector 𝒙 to a specific leaf node 𝑄𝑡 (𝒙). The number of leaves in the 𝑡-th tree is denoted by 𝐿𝑡 .

Haolang Xiao et al: Preprint submitted to Elsevier Page 5 of 22


CATE

Table 1
The semantics of feature variables (V) and thresholds (T) of the GBDT in Fig. 1.

V Semantic T Semantic
𝑥0 Number Of Times 90 Days Late 𝑎0 0.5000
𝑥1 Revolving Utilization Of Unsecured Lines 𝑎1 0.0609
𝑥2 Number Of Time 60-89 Days Past Due 𝑎2 1.5000
𝑥3 Number Of Time 30-59 Days Past Due 𝑎3 0.0410
𝑥4 Debt Ratio 𝑎4 0.5010

We consider the leaf node activated by the initial feature vector as the corresponding local cross-feature, and represent
it using a one-hot vector denoted as 𝒇 𝑡 :
{
(𝑖) 1, if 𝑖 = 𝑄𝑡 (𝒙),
𝑓𝑡 = (3)
0, if 𝑖 ≠ 𝑄𝑡 (𝒙) and 𝑖 ∈ {1, … , 𝐿𝑡 }.

In contrast to the vanilla GBDT, wherein the evaluation weights of all activated leaf nodes are aggregated to obtain
the final output, our proposed method involves preserving and concatenating all activated leaf nodes to produce the
global cross-feature (hereafter referred to as the "cross-feature"). The total number of leaf nodes in the forest is denoted

as 𝑁𝐿 = 𝜏 𝐿𝑡 . We represent the resulting cross-feature as a multi-hot vector 𝒒 ∈ {0, 1}𝑁𝐿 :

𝒒 = GBDT(𝒙|𝑄) = concat(𝒇 1 , … , 𝒇 𝜏 ), (4)

where 𝒒 denotes a sparse vector, with binary elements that assume the value of 1 corresponding to the leaf nodes
that are activated by the initial feature vector in each tree. In contrast, the elements equal to 0 represent all of the
non-activated leaf nodes in the forest.
The GBDT illustrated in Fig. 1 comprises two subtrees, denoted as 𝑄1 and 𝑄2 , both consisting of 8 leaf nodes.
Assuming that 𝒙 eventually reaches the eighth leaf node of 𝑄1 and the sixth leaf node of 𝑄2 , the corresponding
cross-feature 𝒒 is represented as [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0]. The specific semantics of the feature variables
{𝑥0 , … , 𝑥4 } and thresholds {𝑎0 , … , 𝑎4 } are provided in Table 1. Furthermore, the semantic information of the two
local cross-features derived from 𝒙 can be described as follows: 1) The number of payments that are overdue for more
than 90 days exceeds 0.5, the number of payments overdue for 60 to 89 days is greater than 0.5, and the proportion of
debt is greater than 4.1%; 2) The number of payments that are overdue for more than 90 days exceeds 0.5 but not more
than 1.5, and the number of payments overdue for 30 to 59 days is greater than 1.5.
It is important to acknowledge that various tree-based ensemble models, including RF, extreme gradient boosting
(XGBoost), and light gradient boosting machine (LightGBM), can produce cross-features similar to those generated by
GBDT. Compared with XGBoost and LightGBM, RF demonstrates greater randomness, replacedleadingwhich leads
to the creation of a more diverse set of cross-features; On the other hand, GBDT leverages a more comprehensive
decision path, resulting in the generation of cross-features with richer semantic information. In this study, RF and
GBDT are the selected models to generate cross-features.

4. Methodology
In this section, we elaborate on the model based on contrastive augmentation and tree-enhanced embedding
mechanisms. We propose three mechanisms to enhance the collaboration between the tree-based ensemble models
and the embedding-based model: 1) Data augmentation technique: We consider constructing cross-features by tree-
based ensemble models to be the cropping of the initial features. Multiple tree-based ensemble models are pre-trained
to enrich the diversity of the training data, resulting in more robust representations with improved generalization
capability; 2) Decision path information fusion technique: We aggregate the embedding vectors corresponding to each
node on the decision path as its representation to preserve as much structural information as possible about the decision
trees; 3) Dual-task learning technique: In addition to the classification loss, we employ the supervised contrastive loss
to encourage the embedding-based model to generate distinct representations across different categories. As illustrated
in Fig. 2, CATE is primarily composed of four stages:

Haolang Xiao et al: Preprint submitted to Elsevier Page 6 of 22


CATE

Data augmentation Decision path information fusion Additive attention Dual-task learning
Path Fusion
GBDT Embedding vectors
vectors

Classification network
LS concat

Attention network
TM concat
DT
Transactions fusion
Age: Income: yˆ cls
rGBDT
concat
LS
Amount: Debt ratio: TM
DT projGBDT
fusion

Projection
RF reuse reuse

network
Days past:
concat
LS
...
TM

Attention network
DT
fusion reuse
projRF

Projection
network
concat
LS
TM rRF
DT
fusion

Fig. 2. The architecture of CATE model.

1. Multiple tree-based ensemble models are pre-trained for data augmentation to generate diverse cross-features
for each credit transaction;
2. The embedding vectors corresponding to each node on the decision path are aggregated so that the representation
corresponding to the decision path retains as much tree structure information as possible;
3. The interactions between the initial feature and the local cross-features are learned through an additive attention
mechanism, where the local cross-features are assigned various attention weights according to the initial feature
vector for each transaction;
4. Dual-task learning is employed. The first task aims to improve the separability of the learned representations
across distinct categories and increases their similarity within the same category through supervised contrastive
learning. The second task involves optimizing the model parameters through a classification task to improve the
suitability of the learned representations for credit scoring scenarios.
Each of the four stages will be described in detail in the following sections.

4.1. Tree-based data augmentation


The practice of data augmentation has gained widespread popularity in representation learning within the
realm of image analysis [42]. In this study, we propose a data augmentation operation, which employs tree-based
ensemble models as a cross-feature construction method that bears similarity to the image cropping approach. This
operation combines specific feature variables to represent the initial feature vector 𝒙 and generate new, transformed
representation of the same data. Crucially, cross-features generated by different tree-based ensemble models for the
same transaction may vary significantly, revealing important distinctions between the initial feature vector and its
augmented counterparts. To simulate random pruning operations, we employ a range of tree-based ensemble models
and select two, GBDT and RF, to generate cross-features for each feature vector. These cross-features serve as distinct
perspectives on the same transaction and are denoted by 𝒒 GBDT ∈ {0, 1}𝑁𝐿𝐺 and 𝒒 RF ∈ {0, 1}𝑁𝐿𝑅 , respectively:

𝒒 GBDT = GBDT(𝒙|𝑄GBDT ),
(5)
𝒒 RF = RF(𝒙|𝑄RF ),

Haolang Xiao et al: Preprint submitted to Elsevier Page 7 of 22


CATE

where 𝑁𝐿𝐺 and 𝑁𝐿𝑅 denote the number of leaves in the GBDT and RF, respectively, while 𝑄GBDT and 𝑄RF denote
the set of DTs corresponding to the GBDT and RF, respectively.

4.2. Decision path information fusion


In this subsection, we treat the cross-features generated by different tree-based ensemble models uniformly,
denoting them all as 𝒒. To address the problem of model interpretability, sparse linear methods can be utilized to
determine the importance of local cross-features. Specifically, the local cross-feature with the highest score is selected
as the explanation for evaluation [34]. To further incorporate high-level semantic information, each leaf node in the
forest is associated with a learnable dense embedding vector denoted as 𝒆𝑙 ∈ ℝ𝑑 [19], where 𝑑 represents the dimension
of the embedding vector. The embedding mechanism provides two significant benefits: 1) Frequently co-occurring local
cross-features may be correlated, and embedding learning can model this potential correlation. This can enable relevant
local cross-features to be mapped to close elements in the embedding space, thereby alleviating the sparsity problem
of cross-features; 2) Since the embedding vectors are learnable during training, additional embedding information can
be incorporated, providing flexibility to the model.
After obtaining the multi-hot vector 𝒒 from the tree-based ensemble model, the cross-feature is transformed into
an embedding matrix 𝑬 ∈ ℝ𝜏×𝑑 , by collecting the corresponding embedding vectors of non-zero elements:
( )
𝑬 = 𝜙 [𝑞1 𝒆1 , … , 𝑞𝑁𝐿 𝒆𝑁𝐿 ] , ∀𝑞𝑙 ≠ 0 and 𝑞𝑙 ∈ 𝒒, (6)

where 𝜙(⋅) denotes the function to remove all zero row vectors from the matrix. As 𝒒 is a sparse vector with very
few nonzero elements, the resulting embedding matrix only contains the embedding vectors that correspond to the
activated leaf nodes.
Intuitively, we posit that the embedding vectors of local cross-features corresponding to neighboring leaf nodes
will be similar due to the partial overlap between the decision paths they are located in. To elaborate, assuming there

are 𝑁 nodes in the forest, and each node 𝑣𝑛 ∈ 𝜏𝑡=1 𝑉 (𝑡) is mapped into a learnable embedding vector 𝒆𝑛 ∈ ℝ𝑑 . To
enhance the connection between the embedding vectors and the raw data, the embedding vectors are initialized with
the mean of the samples reaching the corresponding nodes in the forest. Each leaf node 𝑣𝑙 is then mapped to the path
embedding matrix 𝑷 𝑙 ∈ ℝ|𝑃 (𝑙)|×𝑑 , which results from concatenating the embedding vectors corresponding to all nodes
along the decision path:

𝑷 𝑙 = concat(… , 𝒆𝑛 , … ), ∀𝑛 ∈ 𝑃 (𝑙), (7)

where 𝑃 (𝑙) denotes the index set of nodes in the decision path corresponding to the leaf node 𝑣𝑙 . To accommodate
varying path lengths |𝑃 (𝑙)|, the row vectors are aggregated into a path fusion vector. To retain the tree structure
information to the greatest extent possible, LSTM is adopted to learn the construction process of the DT for aggregation:

𝑜 = LSTM(𝒆 , 𝒃𝑜
𝒃(𝑛) (8)
(𝑛) (𝑛−1)
),

where 𝒆(𝑛) denotes the 𝑛-th row in the path embedding matrix, 𝒃(𝑛)
𝑜 denotes the corresponding output of the LSTM
block. The mechanism is illustrated in Fig. 3a. Let 𝐿 denote the final row index in the path embedding matrix. The
resulting path fusion vector 𝒑𝑙 ∈ ℝ𝑑 is represented as follows:

𝒑𝑙 = 𝒃(𝐿)
𝑜 . (9)

By applying the above method, the path fusion vectors corresponding to adjacent leaf nodes contain information
about the overlapping nodes between their decision paths, thereby incorporating the tree structure information into the
path fusion vectors. We replace dense embedding vectors with path fusion vectors to construct the path fusion matrix
𝑬 𝑃 ∈ ℝ𝜏×𝑑 :
( )
𝑬 𝑃 = 𝜙 [𝑞1 𝒑1 , … , 𝑞𝑁𝐿 𝒑𝑁𝐿 ] , ∀𝑞𝑙 ≠ 0 and 𝑞𝑙 ∈ 𝒒, (10)

The path fusion matrix describes the high-level semantic information of all activated local cross-features and can be
utilized in the downstream classification task. This approach strengthens the connection between the upstream tree-
based ensemble models and the downstream embedding-based model.

Haolang Xiao et al: Preprint submitted to Elsevier Page 8 of 22


CATE

Hidden layers
x
LSTM block 
( n1)
concat a'xt
c c(n)
pt
tanh (b) Attention network

input output
maximize minimize maximize
Zi Zj Za Zb
similarity similarity similarity
Path embedding
matrix f (n) i (n) bi( n ) o( n ) g(ā) g(ā) g(ā) g(ā)
Vs Vs tanh Vs

Ri representation Rj Ra representation Rb
e(n) bo( n )
Enc(ā) Enc(ā) Enc(ā) Enc(ā)
bo( n 1) Qi Qj Qa Qb

(a) Decision path information fusion mechanism Xtrue Xfalse

(c) Contrastive learning framework

Fig. 3. Illustration of components in CATE.

4.3. Additive attention mechanism


The approach of mapping cross-features into path fusion matrices is effective in capturing high-level semantic
information and identifying potential correlations between local cross-features. However, a significant drawback of this
approach is its limitation in accurately modeling data when the same cross-feature is attributed to different transactions.
This limitation poses a particular challenge in real-world applications, as transactions with identical cross-features may
have distinct default risks. For instance, a second-order local cross-feature such as [𝑥age > 18] ∧ [𝑥𝑡36 = 1] can be
assigned to both a 30-year-old customer and a 65-year-old customer. Nevertheless, their actual default risks differ
significantly, as the older customer is more likely to have a singular source of income and is more prone to various
diseases, leading to unexpectedly high consumption and thereby increasing default risk. In this case, the local cross-
feature [𝑥age > 18] ∧ [𝑥𝑡36 = 1] holds greater significance for the 30-year-old customer, while other local cross-features
such as [𝑥age > 60] ∧ [𝑥𝑡36 = 1] might be more important for the 65-year-old customer.
In this study, we assign varying attention weights to activated local cross-features to capture the variations in
their significance when assessing different transactions. To ensure scalability and generalization, we propose modeling
attention weight 𝑎𝑥𝑡 as a function based on both the initial feature vector and the path fusion vector, instead of relying
solely on learning from the raw data. To this end, we utilize a MLP to learn the attention weights, called attention
network, which can be formulated as follows:
( )
𝑎′𝑥𝑙 = 𝒉⊤ ⋅ 𝜎 𝑾 ⋅ concat(𝒙, 𝒑𝑙 ) + 𝒃 , 𝑙 ∈ 𝐿(𝒙),
exp(𝑎′𝑥𝑙 ) (11)
𝑎𝑥𝑙 = ∑ ′
,
𝑙′ ∈𝐿(𝒙) exp(𝑎𝑥𝑙′ )

where 𝐿(𝒙) denotes the index set of leaf nodes activated by 𝒙 and |𝐿(𝒙)| = 𝜏, 𝑾 ∈ ℝ𝑑ℎ ×(𝑑𝑥 +𝑑) and 𝒃 ∈ ℝ𝑑ℎ represent
the weight matrix and bias vector of the hidden layer respectively, 𝑑ℎ denotes the dimension of the hidden vector, the
output of the hidden layer is mapped to the attention weight by vector 𝒉 ∈ ℝ𝑑ℎ and 𝜎(⋅) represents the ReLU activation
function. The attention weight is then normalized using the softmax function.
Fig. 3b illustrates our attention network. Notably, the attention network is shared to calculate the attention weights
for the local cross-features output by different tree-based ensemble models. These attention weights indicate which
local cross-features receive greater consideration during the evaluation process. We aggregate path fusion vectors
using attention weights to obtain the representation 𝒓 ∈ ℝ𝑑 corresponding to each transaction:

𝒓 = Enc(𝒙, 𝑬 𝑃 ) = 𝑎𝑥𝑙 𝒑𝑙 , (12)
𝑙∈𝐿(𝒙)

Haolang Xiao et al: Preprint submitted to Elsevier Page 9 of 22


CATE

By obtaining the representation through a weighted sum, the path fusion vectors that receive smaller attention weights
have a relatively minimal effect on the ultimately generated representation. The incorporation of embedding and
attention mechanisms provides CATE with robust representation capabilities and guarantees the effectiveness of the
model.

4.4. Dual-task learning mechanism


In this work, we introduce supervised contrastive learning to make more efficient use of labels. The method aims
to improve the effectiveness of representation by clustering elements belonging to the same class while separating
those from different classes in the embedding space. To achieve this goal, we employ a two-layer shallow NN as the
projection head 𝑔(⋅) to map the representation 𝒓 to the hidden vector 𝒛 ∈ ℝ𝑑 :
𝒛 = 𝑔(𝒓) = 𝑾 (2) 𝜎(𝑾 (1) 𝒓), (13)
where 𝑾 (1) ∈ ℝ𝑑×𝑑 and 𝑾 (2) ∈ ℝ𝑑×𝑑 denote the weight matrices of the hidden layers. Fig. 3c illustrates our
contrastive learning framework. It is important to note that the projection head is shared across the outputs of various
tree-based ensemble models.
Assuming 𝑁𝑠 samples are selected in each iteration for training as a mini-batch. We consider the transformed
cross-features generated by the tree-based ensemble model as augmented data, resulting in a total of 2𝑁𝑠 samples. To
quantify the similarity between two vectors 𝒗 and 𝒖, we use the dot product of their 𝓁2 normalized forms, denoted by
𝒖⊤ 𝒗
sim(𝒖, 𝒗) = ||𝒖||||𝒗|| . For contrastive learning, we utilize the augmented samples and define the supervised contrastive
loss function as follows:
∑ −1 ∑ exp(sim(𝒛𝑖 , 𝒛𝑠 )∕𝜏𝑐 )
𝑐 = log ∑ , (14)
𝑖∈𝐼
|𝑆(𝑖)| 𝑠∈𝑆(𝑖) exp(sim(𝒛𝑖 , 𝒛𝑎 )∕𝜏𝑐 )
𝑎∈𝐴(𝑖)

where 𝐼 ≡ {1, … , 2𝑁𝑠 } denotes the index set of augmented samples, 𝐴(𝑖) ≡ 𝐼 ⧵ {𝑖} denotes the index set of other
augmented samples excluding the sample 𝑖, 𝑆(𝑖) ≡ {𝑠 ∈ 𝐴(𝑖) and 𝑦𝑠 = 𝑦𝑖 } denotes the index set of other augmented
samples with the same label as sample 𝑖, and 𝜏𝑐 ∈ ℝ+ is a scalar temperature parameter.
The utilization of the contrastive learning framework is a common approach for pre-training the encoder. Such
framework relies on the contrastive loss to pre-train the encoder, followed by training a classification network
while keeping the encoder parameters fixed during the classification task. However, the initial embedding vectors
are not realistic data, and as a result, relying solely on the contrastive objective may lead to local optima that are
irrelevant to the classification task, ultimately impairing the classification ability of the model. Therefore, to ensure
optimal performance, it is necessary to perform both the contrastive learning and the classification task concurrently.
Specifically, the classification task can be defined as follow:
𝑦̂ = sigmoid(𝑏0 + 𝒃⊤ 𝒓
1 GBDT
+ 𝒃⊤ 𝒓 ),
2 RF
(15)

where 𝑏0 denotes the bias term, 𝒃1 ∈ ℝ𝑑 and 𝒃2 ∈ ℝ𝑑 denote the parameters of two LR models, respectively.
𝒓GBDT and 𝒓RF are the representations corresponding to the output cross-features of GBDT and RF, respectively. The
classification layer of CATE follows a shallow additive model structure that enables the assessment of the contribution
of individual components, thereby enhancing the interpretability of the model. We use the cross-entropy loss function
as the classification objective function:

𝑝 = − 𝑦𝑖 log(𝑦̂𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑦̂𝑖 ), (16)
𝑖∈𝐼

Assuming that 𝜆 denotes the 𝓁2 regularization hyperparameter to avoid overfitting, 𝜃 denotes all learnable model
parameters, and the final objective function for CATE is given by:
 = 𝑐 + 𝑝 + 𝜆||𝜃||2 , (17)
The CATE model is composed of two cascade models. In the first stage, GBDT and RF models are pre-trained
for the extraction of cross-features. In the second stage, the mini-batch gradient descent technique, combined with the
Adam algorithm, is employed to optimize the embedding-based classification model. This two-stage process provides
a holistic approach that effectively leverages the strengths of both tree-based ensemble models and embedding-based
method, resulting in improved performance for classification tasks.

Haolang Xiao et al: Preprint submitted to Elsevier Page 10 of 22


CATE

Table 2
Description of datasets.

Dataset Features Numerical Categorical Samples Default Default rate


features features
Prosper 47 39 8 41,541 11,878 28.59%
LC 59 52 7 1,089,032 202,677 18.61%
Give 10 10 0 116,557 7,983 6.85%
CarLoan 49 45 4 149,988 26,544 17.70%

5. Experimental study
5.1. Data description
This study employs four public credit datasets to conduct credit scoring experiments, as illustrated in Table 2.
Prosper dataset 1 is obtained from the reputable Prosper online lending platform in the United States. Lending Club
(LC) dataset 2 , spanning a period from 2007 to 2017, is a valuable source of information for academic research, as
it offers insights into the workings of Lending Club, the largest peer-to-peer (P2P) online lending platform in the
United States. Give me some credit (Give) dataset 3 is a public credit scoring dataset available on Kaggle competition
platform. Car loan (CarLoan) dataset 4 is available on the Developer Competition of Xunfei Open Platform.
The data preprocessing methodology utilized in this study involves several distinct steps. Firstly, the default and
normal labels, represented by numerical values of 1 and 0 respectively, are screened. Subsequently, samples with
similar labels are merged into one of these two types of labels, while samples with irrelevant labels are eliminated.
Secondly, features exhibiting a high missing rate of over 50% are removed, followed by the removal of samples still
containing missing values. Thirdly, categorial features with less than 20 cardinalities undergo one-hot encoding, while
those with more than 20 cardinalities are transformed using frequency coding. Furthermore, the study incorporates
additional preprocessing techniques for specific datasets. However, due to space limitations, a detailed account of
these techniques is not provided in this study.
In this study, a rigorous methodology is adopted to validate the model by employing the five-fold cross-validation
test. The rationale behind this approach is to mitigate the impact of random data partitioning and to generate robust
evaluation outcomes. The original dataset is divided into five approximately equal subsets, and during each cycle of
the cross-validation process, one subset is selected as the test set while the other four subsets are combined to form the
training set. This process is repeated five times, ensuring that every subset is utilized as the test set once. Finally, the
mean of the evaluation results obtained from the five cycles is used as the conclusive outcome of the experiment.

5.2. Performance metrics


Credit scoring can be effectively framed as a binary classification problem to predict default probability. Conse-
quently, evaluation outcomes can be classified into four distinct categories: true positive (TP), false positive (FP), true
negative (TN), and false negative (FN). TP refers to instances where actual defaults are accurately classified as such,
while FP describes instances where normal transactions are erroneously labeled as defaults. Similarly, TN denotes
instances where normal transactions are correctly identified as such, and FN reflects instances where actual defaults
are incorrectly labeled as normal ones.
This study employs widely recognized performance evaluation metrics in the field of credit scoring to assess the
effectiveness of the model in predicting credit risk. In practical applications, there are two important considerations
when evaluating the performance of credit scoring models. Firstly, misclassifying risky transactions can result in
significant economic losses. Therefore, the metrics of recall (Rec) and balanced F-score (F-score) are chosen to assess
the ability of the model to predict risky transactions. Secondly, credit datasets often suffer from class imbalance. To
account for this nature and provide a more comprehensive assessment, G-mean and balanced accuracy (BAcc) are
selected to evaluate model performance. Furthermore, predicting too many normal transactions as risky transactions
can lead to missed investment opportunities. To capture this aspect, metrics such as accuracy (Acc), matthews
1 https://www.kaggle.com/datasets/yousuf28/prosper-loan
2 https://www.lendingclub.com/
3 https://www.kaggle.com/c/GiveMeSomeCredit/overview
4 https://challenge.xfyun.cn/topic/info?type=car-loan

Haolang Xiao et al: Preprint submitted to Elsevier Page 11 of 22


CATE

correlation coefficient (MCC), and area under the receiver operating characteristic (ROC) curve (AUC) are employed
to measure the overall performance of the model. The ROC curve is obtained by setting different thresholds on the
decision function used to compute the false positive rate (FPR) and the true positive rate (TPR), and AUC is computed
using the trapezoidal rule. The definitions of the other indicators are as follows:
TP
Rec = ,
TP + FN
2 × Rec × precision 2 × TP
F-score = = ,
Rec + precision 2 × TP + FN + FP

√ TP × TN
G-mean = Rec × specif icity = ,
(TP + FN) × (TN + FP)
(18)
TP + TN
Acc = ,
TP + FN + TN + FP
Rec + specif icity TP × (TN + FP) + TN × (TP + FN)
BAcc = = ,
2 2 × (TP + FN) × (TN + FP)
TP × TN − FP × FN
MCC = √ ,
(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)
The higher the Rec, the stronger the recognition ability of the model for default samples. The F-score, which is
the harmonic average of precision and recall, serves as an indicator to balance the recognition ability and accuracy of
the model for default samples. A higher F-score signifies a stronger ability of the model to evaluate default samples.
The G-mean score measures the overall identification ability of the model. A higher value indicates that the model has
a relatively balanced identification ability for both default and normal samples. Acc represents the ratio of correctly
evaluated samples by the model. However, it is important to note that Acc may not be a suitable evaluation metric for
highly imbalanced datasets, as it can produce falsely high results. In such cases, BAcc and AUC can provide a better
overall evaluation metric. A higher AUC value signifies a stronger ability of the model to distinguish between different
categories of samples. BAcc is similar to G-mean in that it reflects relatively balanced accuracy rates for default and
normal cases. The MCC is essentially a correlation coefficient, with 1 representing a perfect model, 0 representing a
random prediction, and -1 representing the exact opposite prediction. Overall, higher values of these metrics indicate
that the classification model is more robust and performs better.

5.3. Results and discussion


5.3.1. Performance comparison
To evaluate the effectiveness of the proposed approach, this study employs four individual models (LR, DT,
SVM, NN) and four ensemble models (AdaBoost, Bagging, GBDT, and RF) to conduct comparative experiments.
LR, DT, SVM, and NN are commonly used as baseline models in credit scoring research. All eight models possess
hyperparameters that can significantly impact their performance. To fine-tune these hyperparameters, a grid search
optimization procedure is utilized to fine-tune these hyperparameters. Table 3 outlines the relevant hyperparameters of
the baseline models, including the corresponding searching space, while non-listed hyperparameters follow the default
settings of the algorithm packages. It is worth mentioning that the hyperparameters are adjusted for each experiment
according to the specific dataset being used.
To further evaluate the effectiveness of the proposed tree-enhanced model, GBDT+LR, TEM, AugBoost,
XGB+forgeNet and TEDAN are included in the comparative experiment. To ensure a fair comparison, all tree-
enhanced methods leverage identical pre-trained tree-based ensemble models for constructing cross-features and
employed the same classification objective function for training the downstream model. Specifically, the number of
decision trees in the tree-based ensemble models is set to 64, and the maximum depth of decision trees in the tree-based
ensemble models is set to 4. The regularization hyperparameter 𝜆 is set to 0.001.
To validate the effectiveness of CATE in improving the performance of tree-based ensemble models, we have
chosen the ROC curve as the standard for comparative analysis. The ROC curve is a commonly used evaluation
metric for binary classification tasks, providing a visual representation of model performance by plotting the FPR
on the X-axis and the true positive rate TPR on the Y-axis. This curve allows for the evaluation of classification model
across different thresholds, with a larger area under the curve indicating superior model performance. As illustrated in
Fig. 4, the proposed CATE model performs significantly better than other credit scoring models on the Prosper dataset.

Haolang Xiao et al: Preprint submitted to Elsevier Page 12 of 22


CATE

Table 3
Parameters for grid search of baseline credit scoring models.

Classifier Parameter Function Searching space


LR C Control the strength of regularization { 10−3 , 10−2 , 10−1 , 1, 10, 102 , 103 }
solver Algorithm to use in the optimization problem { newton-cg, lbfgs, liblinear, sag }
DT criterion The function to measure the quality of a split { gini, entropy }
splitter The strategy used to choose the split at each node { best, random }
max_depth The maximum depth of the tree { 2, 4, 6, 8 }
NN num_hidden The number of hidden layers { 1, 2, 3 }
d_ratio Dropout ratio { 0, 0.1, 0.2, 0.3, 0.4 }
Bagging n_estimators The number of base estimators in the ensemble { 32, 64, 96, 128 }
max_samples The ratio of samples to draw each stage { 0.25, 0.5, 0.75, 1.0 }
max_features The ratio of features to draw each stage { 0.25, 0.5, 0.75, 1.0 }
AdaBoost n_estimators The maximum of base estimators { 32, 64, 128 }
learning_rate Contribution of each base estimator { 0.1, 1, 2 }
algorithm The algorithm to implement boosting { SAMME, SAMMER.R }
RF n_estimators The number of decision trees in the forest { 32, 64, 96, 128 }
max_depth The maximum depth of the tree { 2, 4, 6, 8 }
GBDT n_estimators The number of boosting stages to perform { 32, 64, 96, 128 }
max_depth Maximum depth of the individual CART { 2, 4, 6, 8 }

However, since the LC and Give datasets have a larger sample size, the performance of the proposed model may not
differ significantly from other models on these two datasets, but it still exhibits superior performance. Conversely, the
proposed model fails to achieve optimal performance when applied to the CarLoan dataset, which will be expounded
upon in subsequent analysis. It should be noted that the LC and CarLoan datasets have a large number of samples,
and the kernel-based SVM model requires substantial computation. Therefore, the SVM model was not trained to its
full potential on these datasets, resulting in relatively poor performance. When SVM is applied to the Give dataset,
a part of the ROC curve is below the diagonal, indicating that for highly imbalanced datasets, SVM has difficulty
identifying default samples, resulting in misclassifying some default samples with a relatively low probability. The
outcomes of the Prosper, LC, and Give datasets highlight the benefits of contrast augmentation and tree-enhanced
embedding mechanisms in improving the performance of classical GBDT.
Table 4 offers a performance comparison of various credit scoring models on datasets with lower imbalanced
rates. Among the models, CATE stands out as the superior performer on all indicators for the Prosper dataset. This
finding highlights the effectiveness of CATE as a suitable option for credit scoring on the Prosper dataset. GBDT+LR,
TEM, XGB+forgeNet and TEDAN outperform GBDT on all indicators, implying that the cross-feature extraction
mechanism and leaf node re-weighting mechanism are effective. However, CATE performs even better than these
four, indicating that the contrast augmentation mechanism can further improve the ability of the model to extract
separable representations. It is worth noting that AugBoost performs worse than GBDT on all metrics, while RF has
worse performance. This suggests that the quality of cross-features generated by RF can impact the performance of
AugBoost.
The experimental results obtained from the LC dataset are analogous to those from the Prosper dataset, with
the exception that the CATE model only shows a marginal improvement in terms of performance indicators over
other models. The LC dataset is characterized by a substantial number of training samples, which poses significant
computational challenges to the kernel-based SVM model. As a result, the model was not trained to the optimal level,
leading to relatively inferior performance. In contrast to the results obtained from the Prosper dataset, all models, except
for RF, demonstrate satisfactory performance. This disparity in performance can be attributed to the random selection
of features in imbalanced datasets, which may result in the omission of crucial features. Consequently, this omission
may hinder the ability of the model to identify default samples accurately.
Table 5 presents a comparative analysis of the performance of different credit scoring models on datasets that exhibit
higher imbalanced rates. The results indicate that the CATE model outperforms other models on all indicators in the
Give dataset, except for Acc where the difference is negligible at 0.03%. These findings suggest that the contrastive
augmentation mechanism effectively improves the ability of the model to identify default samples in highly imbalanced

Haolang Xiao et al: Preprint submitted to Elsevier Page 13 of 22


CATE

100 100

80 random guess 80 random guess


True positive rate (%)

True positive rate (%)


LR LR
DT DT
SVM SVM
60 NN 60 NN
AdaBoost AdaBoost
Bagging Bagging
GBDT GBDT
40 RF 40 RF
GBDT+LR GBDT+LR
TEM TEM
20 AugBoost 20 AugBoost
XGB+forgeNet XGB+forgeNet
TEDAN TEDAN
CATE CATE
0 0
0 20 40 60 80 100 0 20 40 60 80 100
False positive rate (%) False positive rate (%)
(a) Prosper (b) LC
100 100

80 random guess 80 random guess


True positive rate (%)

True positive rate (%)


LR LR
DT DT
SVM SVM
60 NN 60 NN
AdaBoost AdaBoost
Bagging Bagging
GBDT GBDT
40 RF 40 RF
GBDT+LR GBDT+LR
TEM TEM
20 AugBoost 20 AugBoost
XGB+forgeNet XGB+forgeNet
TEDAN TEDAN
CATE CATE
0 0
0 20 40 60 80 100 0 20 40 60 80 100
False positive rate (%) False positive rate (%)
(c) Give (d) CarLoan

Fig. 4. ROC curve of models on different credit scoring datasets.

datasets. In terms of individual credit scoring models, LR demonstrates superior performance, suggesting a possible
linear relationship between user characteristics and transaction default in this particular dataset. The performance
comparison between TEM and GBDT+LR suggests that embedding cross-features into dense vectors in a low-
dimensional space may not always be effective and could lead to inferior results compared to using cross-features
directly. Nonetheless, the performance of CATE provides evidence that the contrastive augmentation mechanism is a
potent solution to address the challenges posed by the aforementioned problems.
Based on the analysis of the CarLoan dataset, the CATE model exhibits superior performance in terms of Rec,
F-score, G-mean, BAcc, and MCC. However, the performance of Acc and AUC remains at a moderate level. Further
examining of the Rec performance of each model reveals the difficulty in identifying default samples in this dataset.
The performance of RF and XGB+forgeNet indicates that certain features exhibit strong correlations with default
transactions. Consequently, filtering out these features may cause the model to fail to identify risky transactions.
Despite this, the imbalance rate of this dataset is not higher than that of the Give dataset, suggesting that default
and normal samples possess similar characteristics, making it challenging for the model to differentiate between them.
The exceptional Rec performance of the CATE model can be attributed to the contrastive augmentation mechanism,
which effectively brings elements of the same class closer together while distancing them from most elements
of different classes in the embedding space. However, when the two types of elements have comparable original
features, this mechanism inadvertently moves some normal samples to one end of the default samples, resulting in
the misclassification of these samples by the model. Therefore, the model cannot accurately determine whether these
samples are default with high probability, leading to a decrease in the Acc and AUC of the model. Due to the numerous
samples and features in the CarLoan dataset and the significant amount of computation required for kernel-based

Haolang Xiao et al: Preprint submitted to Elsevier Page 14 of 22


CATE

Table 4
Performance comparison on different datasets with lower imbalanced rate.

Dataset model Rec F-score G-mean Acc BAcc MCC AUC


Prosper LR 0.3355 0.4387 0.5562 0.7545 0.6289 0.3250 0.7655
DT 0.3505 0.4612 0.5716 0.7658 0.6413 0.3590 0.7512
SVM 0.3055 0.4170 0.5347 0.7558 0.6208 0.3216 0.6208
NN 0.3229 0.4295 0.5473 0.7548 0.6253 0.3226 0.7628
AdaBoost 0.3449 0.4580 0.5679 0.7667 0.6403 0.3609 0.7554
Bagging 0.4407 0.5507 0.6422 0.7944 0.6884 0.4514 0.8246
GBDT 0.4911 0.5985 0.6794 0.8116 0.7155 0.5034 0.8667
RF 0.2681 0.3944 0.5081 0.7648 0.6159 0.3452 0.7658
GBDT+LR 0.7325 0.7439 0.8143 0.8559 0.8189 0.6439 0.9140
TEM 0.7159 0.7392 0.8078 0.8556 0.8137 0.6403 0.9148
AugBoost 0.4593 0.5604 0.6528 0.7939 0.6936 0.4527 0.8320
XGB+forgeNet 0.7141 0.7370 0.8062 0.8543 0.8123 0.6372 0.9151
TEDAN 0.7078 0.7321 0.8024 0.8519 0.8087 0.6307 0.9119
CATE 0.7620 0.7747 0.8363 0.8734 0.8400 0.6870 0.9321
LC LR 0.7532 0.7857 0.8515 0.9236 0.8579 0.7403 0.9589
DT 0.7508 0.7548 0.8416 0.9097 0.8484 0.7024 0.9417
SVM 0.7539 0.7851 0.8516 0.9233 0.8579 0.7395 0.8579
NN 0.7443 0.7829 0.8471 0.9233 0.8542 0.7380 0.9584
AdaBoost 0.7653 0.7788 0.8546 0.9192 0.8598 0.7296 0.9591
Bagging 0.7803 0.8243 0.8719 0.9382 0.8772 0.7888 0.9662
GBDT 0.7889 0.8185 0.8740 0.9350 0.8786 0.7798 0.9680
RF 0.2385 0.3816 0.4873 0.8567 0.6181 0.4377 0.9466
GBDT+LR 0.7893 0.8278 0.8764 0.9390 0.8812 0.7922 0.9708
TEM 0.7897 0.8277 0.8766 0.9389 0.8813 0.7921 0.9708
AugBoost 0.7739 0.8098 0.8658 0.9324 0.8713 0.7700 0.9655
XGB+forgeNet 0.7771 0.8215 0.8698 0.9373 0.8755 0.7858 0.9694
TEDAN 0.7884 0.8266 0.8758 0.9385 0.8806 0.7907 0.9706
CATE 0.7921 0.8314 0.8784 0.9403 0.8831 0.7968 0.9720

SVM, the SVM is not trained to the optimum on this dataset, resulting in poor performance and approximate random
prediction.
The experimental findings presented by Prosper, LC, and Give demonstrate that the stronger model exhibits Acc
and AUC scores surpassing 85%. Nonetheless, the discrimination among models is constrained due to the imbalanced
distribution of samples across various categories within the datasets. In this regard, F-score emerges as a more suitable
evaluation metric that offers a more accurate assessment of the effectiveness of the model in recognizing risky loans.
Comparative to the baseline model and other tree-enhanced models, the CATE model displays superior credit scoring
performance. Moreover, in the CarLoan dataset, the CATE model shows a better ability to accurately identify default
samples. These findings suggest that the CATE model may be a promising approach for credit scoring.

5.3.2. Interpretability of CATE


The experiments in Section 5.3.1 have demonstrated the noteworthy evaluation performance of CATE in credit
scoring. However, the crucial decision-making factors that CATE identifies in assessing credit risk remain unknown.
To illustrate the interpretability of CATE, we present Table 6, which outlines a basic information comparison between
the selected default and normal cases. Furthermore, Table 6 highlights the top two items corresponding to the highest
attention scores in the local cross-features of GBDT and RF outputs, respectively.
The presented findings in Table 6 provide insights into the default case in terms of various loan-related attributes
and borrower characteristics. Specifically, the local cross-features indicate that the transaction has a higher loan interest
rate (LenderYield > 0.25, BorrowerRate > 0.2), longer loan periods (Term > 24), and larger monthly payments
(MonthlyLoanPayment > 36.205). Furthermore, the borrower belongs to a relatively rare occupation (Occupation
> 113.5), which could potentially affect the ability of the borrower to repay the loan. Upon reviewing the credit

Haolang Xiao et al: Preprint submitted to Elsevier Page 15 of 22


CATE

Table 5
Performance comparison on different datasets with higher imbalanced rate.

Dataset model Rec F-score G-mean Acc BAcc MCC AUC


Give LR 0.1604 0.2522 0.3987 0.9349 0.5761 0.2848 0.8414
DT 0.0787 0.1402 0.2799 0.9339 0.5377 0.2093 0.7869
SVM 0.1400 0.2265 0.3726 0.9346 0.5665 0.2669 0.6369
NN 0.1615 0.2524 0.3999 0.9346 0.5764 0.2821 0.8466
AdaBoost 0.1665 0.2556 0.4058 0.9336 0.5783 0.2778 0.8450
Bagging 0.1174 0.1959 0.3413 0.9341 0.5558 0.2439 0.8276
GBDT 0.1712 0.2637 0.4117 0.9346 0.5809 0.2893 0.8536
RF 0.0688 0.1242 0.2618 0.9336 0.5330 0.1951 0.8450
GBDT+LR 0.1755 0.2583 0.4161 0.9310 0.5810 0.2642 0.8356
TEM 0.1507 0.2409 0.3867 0.9350 0.5717 0.2786 0.8526
AugBoost 0.1628 0.2542 0.4016 0.9346 0.5771 0.2837 0.8524
XGB+forgeNet 0.1210 0.1993 0.3453 0.9338 0.5573 0.2436 0.8515
TEDAN 0.1572 0.2470 0.3946 0.9344 0.5744 0.2778 0.8533
CATE 0.1794 0.2734 0.4215 0.9347 0.5848 0.2963 0.8549
CarLoan LR 0.0030 0.0060 0.0548 0.8228 0.5011 0.0228 0.6476
DT 0.0012 0.0024 0.0223 0.8229 0.5004 0.0152 0.5921
SVM 0.0011 0.0022 0.0326 0.8228 0.5003 0.0087 0.5003
NN 0.0027 0.0053 0.0513 0.8228 0.5009 0.0201 0.6483
AdaBoost 0.0000 0.0000 0.0000 0.8230 0.5000 -0.0027 0.6345
Bagging 0.0132 0.0256 0.1145 0.8217 0.5044 0.0432 0.6197
GBDT 0.0024 0.0047 0.0486 0.8231 0.5010 0.0266 0.6517
RF 0.0000 0.0000 0.0000 0.8230 0.5000 0.0000 0.6280
GBDT+LR 0.0282 0.0522 0.1668 0.8190 0.5086 0.0558 0.6439
TEM 0.0060 0.0118 0.0764 0.8231 0.5024 0.0410 0.6570
AugBoost 0.0025 0.0050 0.0502 0.8231 0.5011 0.0289 0.6497
XGB+forgeNet 0.0000 0.0000 0.0000 0.8230 0.5000 0.0000 0.6425
TEDAN 0.0228 0.0433 0.1503 0.8216 0.5081 0.0640 0.6482
CATE 0.0398 0.0716 0.1977 0.8177 0.5124 0.0689 0.6373

history of the borrower, the platform found that there had been a large number of inquiries (TotalInquiries > 15.5),
which may suggest financial instability or a high level of credit-seeking behavior. Additionally, the borrower has
opened other credit transactions (TradesOpenedLast6Months > 0.5) and has arrearages in other credit transactions
(AmountDelinquent > 29.5) within 6 months before the reviewing. These factors could indicate a history of
delinquency and suggest a higher risk of default for the borrower.
In comparison, the normal case is characterized by regular transactional patterns that exhibit a low loan interest
rate (LenderYield ≤ 0.142, BorrowerRate ≤ 0.125) and a short loan period (Term ≤ 24). Additionally, the borrower
tends to have a higher income (IncomeRange > 2.5, i.e., more than $50,000), a high credit card available limit
(AvailableBankcardCredit > 11009), and a low debt-to-income ratio (DebtToIncomeRatio ≤ 0.195), with debts
constituting only a minor proportion of their overall income. Furthermore, the borrower typically holds a relatively
common occupation (Occupation > 1498.5) and has worked in the same field for an extended period (Employment
Duration > 3.5). The credit history of the borrower also reveals a limited number of inquiries (TotalInquiries ≤ 4.5).
The findings of this study demonstrate the effectiveness of the CATE model in identifying reasonable judgment
conditions for credit scoring. The observed differences in the characteristic interval between the two types of
transactions provide further validation of the ability of the model to identify such conditions accurately. Therefore,
it can be concluded that the CATE model can effectively recognize the judgment conditions that determine whether a
credit transaction is risky or not. Furthermore, the use of local cross-features with higher attention scores can provide
a detailed explanation of the evaluation results of the model.

Haolang Xiao et al: Preprint submitted to Elsevier Page 16 of 22


CATE

Table 6
Comparison of default and normal transactions w.r.t their corresponding local cross-features.

Normal case Default case


Transaction loan amount $3,500 loan amount $1,500
& interest rate 8.2% & interest rate 29%
CF of GBDT 1): LoanNumber ≤ 94282.5 & BorrowerRate ≤ 1): BorrowerRate > 0.2 & LoanNumber ≤ 21158.5
0.125 & LenderYield > 0.25 & Occupation > 113.5
& AvailableBankcardCredit > 11009 2): Term > 24 & LoanNumber ≤ 94282.5
& DebtToIncomeRatio ≤ 0.195 & BorrowerRate > 0.125
2): TotalInquiries ≤ 4.5 & LoanNumber ≤ 48408.5 & TradesOpenedLast6Months > 0.5
& LenderYield ≤ 0.333 & Occupation > 1498.5
CF of RF 1): AvailableBankcardCredit > 175.5 1): TotalInquiries > 9.5 & LenderYield > 0.13
& IncomeRange > 2.5 & LenderYield ≤ 0.142 & DateCreditPulled ≤ 2008
& ListingCategory(numeric) > 0.5 & AmountDelinquent > 29.5
2): Term ≤ 24 & BorrowerRate ≤ 0.155 2): LoanNumber ≤ 34685.5
& EmploymentStatus_Other ≤ 0.5 & MonthlyLoanPayment > 36.205
& EmploymentStatusDuration > 3.5 & LenderYield > 0.14 & TotalInquiries > 15.5
𝑦̂ 0.0032 0.9456

5.3.3. Ablation experiments


To investigate the importance of each component in the CATE model, this work conducts an ablation study to
evaluate the performance of the model after removing each component. The ablation study consists of four experimental
conditions:
• Full Model: The complete CATE model with all its components included;
• w/o Tree-based Augmentation (TA): The tree-based ensemble models used for data augmentation are removed,
leaving only one GBDT model;
• w/o Path Information Fusion (PF): Only embedding vectors corresponding to leaf nodes are used without
decision path information fusing;
• w/o Supervised Contrastive Loss (CL): Only cross-entropy loss is used to train the model without introducing
supervised contrastive loss;
Table 7 displays the results of the ablation experiments conducted on various datasets to evaluate the effectiveness
of the CATE model. The results indicate that the full model outperforms the other ablation models on all performance
indicators for the Prosper, LC, and Give datasets. The findings demonstrate the effectiveness of each module
incorporated into the CATE model. It should be noted that the LC and Give datasets, which comprise a greater number
of training samples, exhibit a higher level of fitting for each model. Additionally, the larger number of test samples
reduces the fluctuation of the evaluation index.
Based on the CarLoan dataset, the full model exhibits a slight decrease in Acc and AUC metrics. The analysis in
Section 5.3.1 suggests that the default and normal samples in this dataset share similar characteristics, presenting
a challenge for the model to differentiate default samples. Consequently, improving the ability of the model to
recognize default samples may result in misclassifying more normal samples, thereby decreasing Acc and AUC
indicators. The data augmentation mechanism facilitates the representation learning ability of the embedding-based
model by providing diverse cross-features, resulting in more effective representations. Removing this mechanism has
a certain impact on model performance. The decision path information fusion operation integrates the information of
overlapping nodes between decision paths for adjacent leaves. This process preserves the tree structure information
to a greater extent and is the primary means to enhance the representation learning ability of CATE. Removing
this mechanism has a considerable impact on model performance. The contrastive loss mechanism improves the
distinguishability between default and normal samples, which is often in combination with data augmentation
operations. Removing this mechanism alone has minimal impact on model performance. It is critical to accurately

Haolang Xiao et al: Preprint submitted to Elsevier Page 17 of 22


CATE

Table 7
Ablation study of CATE on different datasets.

Dataset model Rec F-score G-mean Acc BAcc MCC AUC


Prosper Full Model 0.7620 0.7747 0.8363 0.8734 0.8400 0.6870 0.9321
w/o TA 0.7322 0.7500 0.8170 0.8605 0.8220 0.6537 0.9207
w/o PF 0.7279 0.7508 0.8163 0.8618 0.8217 0.6560 0.9201
w/o CL 0.7413 0.7544 0.8215 0.8621 0.8259 0.6588 0.9233
LC Full Model 0.7921 0.8314 0.8784 0.9403 0.8831 0.7968 0.9720
w/o TA 0.7921 0.8307 0.8783 0.9400 0.8830 0.7958 0.9716
w/o PF 0.7920 0.8304 0.8781 0.9399 0.8828 0.7955 0.9715
w/o CL 0.7891 0.8294 0.8767 0.9397 0.8816 0.7944 0.9716
Give Full Model 0.1794 0.2734 0.4215 0.9347 0.5848 0.2963 0.8549
w/o TA 0.1648 0.2556 0.4038 0.9344 0.5778 0.2826 0.8540
w/o PF 0.1743 0.2662 0.4153 0.9342 0.5822 0.2885 0.8523
w/o CL 0.1740 0.2673 0.4152 0.9347 0.5823 0.2925 0.8546
CarLoan Full Model 0.0398 0.0716 0.1977 0.8177 0.5124 0.0689 0.6373
w/o TA 0.0354 0.0645 0.1867 0.8184 0.5111 0.0658 0.6399
w/o PF 0.0309 0.0570 0.1746 0.8193 0.5099 0.0627 0.6432
w/o CL 0.0379 0.0687 0.1932 0.8181 0.5119 0.0677 0.6382

identify default transactions in the credit scoring task. Though misidentifying too many normal transactions as defaults
could cause financial institutions to miss opportunities, difficulty in identifying default transactions can result in
significant losses.
The experimental results conducted on four datasets demonstrate that each component of the CATE model exhibits
a significant role in accomplishing the credit scoring task. This finding serves as a testament to the efficacy of the
CATE design.

5.3.4. Effects on training length


The objective of this section is to investigate the effect of varying amounts of training data on the CATE model
performance. The training data is collected from the LC dataset over five years from June 2012 to June 2017, with
the length of the data being adjusted every three months for experimental purposes. A pre-sampled test set is utilized
and remains constant throughout the experiment. The results depicted in Fig. 5a demonstrate that model performance
experiences a significant improvement until approximately three years of historical data is incorporated, after which
performance gains diminish progressively. This observation can be attributed to the reduced acquisition of new
information with the use of more extensive training data, which leads to a gradual decrease in performance gains.

5.3.5. Hyperparameter research


This section aims to investigate the influence of certain parameters in CATE on the performance of the model.
To achieve this objective, we conduct hyperparameter tuning experiments. The outcomes of these experiments
demonstrate consistent findings across all four datasets, with the Prosper dataset being selected as a representative
example for the sake of conciseness.
Fig. 5b illustrates the effect of modifying the depth 𝑇𝑑 of DTs in tree-based ensemble models on the performance of
CATE. The results reveal that the performance of CATE deteriorates slightly when the depth is shallow. The best model
performance is achieved when 𝑇𝑑 = 5, beyond which the performance fluctuates. In terms of the Rec metric, 𝑇𝑑 = 6
yields better performance. The reason why CATE can benefit from the depth of DTs compared to using only leaf node
information is the utilization of node information along the decision path. As the tree depth increases, decision rules
become intricate, and the path fusion vector is also fused by more embedding vectors. However, a simple single-layer
LSTM model may not be adequate for effective fusion as the decision rules become more complex, resulting in a
decrease in model performance. In this study, 𝑇𝑑 = 5 is the preferred setting for the Prosper dataset that balances both
model performance and computational efficiency.
Fig. 5c portrays the impact of altering the number of trees in tree-based ensemble models on the performance
of CATE. It is noteworthy that the figure presents the performance of CATE rather than the tree-based ensemble

Haolang Xiao et al: Preprint submitted to Elsevier Page 18 of 22


CATE

95
85 95

75
Metrics (%)

65 Rec
90
55 F-score
G-mean
Acc
45 BAcc 85
MCC
35 AUC

2013 2014 2015 2016 2017

Metrics (%)
Training length (year) 80

(a) Performance difference by training length

75
95

85 70
Metrics (%)

75

65 65

Rec G-mean BAcc MCC AUC


55 F-score Acc
Rec G-mean BAcc MCC AUC
F-score Acc 16 32 48 64 80 96 112 128 144 160 176
1 2 3 4 5 6 7 8 Number of trees
Tree depth (c) Performance difference by number of trees
(b) Performance difference by tree depth

Fig. 5. Experiments on training length and hyperparameters tuning.

models themselves. The findings suggest that the performance of CATE is erratic when the number of trees is small.
Moreover, a slight decline in CATE performance is noted when 𝜏 ∈ {48, 64, 96}. As the number of trees increases, the
performance of CATE improves and stabilizes, indicating that an adequate number of trees is required to furnish
sufficient information to the subsequent attention mechanism. Furthermore, due to the presence of the attention
mechanism, additional trees beyond a certain number do not provide new information. In this study, 𝜏 = 112 is the
preferred setting for the Prosper dataset that balances both model performance and computational efficiency.

6. Conclusion and future work


The emergence of social lending platforms has addressed the needs of the growing credit industry, while also
disrupting traditional credit risk assessment services. In light of the increasing complexity of credit scenarios, financial
institutions and individual investors require more efficient and interpretable methods to identify risky transactions.
This study proposes a novel credit risk assessment model based on contrastive augmentation and tree-enhanced
embedding mechanism, namely CATE. The CATE model leverages the decision rule learning ability of tree models to
automatically construct interpretable cross-features, which are effective in modeling the interaction between features
while providing richer semantic information than individual features. It then measures the importance of each local
cross-feature using the attention mechanism, which provides intrinsic local interpretability of the model. Finally,
the decision path information fusion mechanism and the contrastive learning framework further improve the model
performance in evaluating the credit risk of users based on fused embedding vectors. This study applied the CATE
model to four public credit transaction datasets and confirmed its ability to achieve high accuracy in credit scoring.

Haolang Xiao et al: Preprint submitted to Elsevier Page 19 of 22


CATE

Moreover, CATE can be easily incorporated into practical applications of credit scoring because of its excellent
interpretability of decision-making and effectively assists stakeholders in identifying transactions that may default.
However, it is important to acknowledge the limitations of the CATE model. Firstly, the shallow LSTM model
may face challenges in effectively learning the construction patterns of decision trees when the decision paths become
lengthy. Consequently, this limitation restricts the tree-based ensemble model from extending the depth of decision
trees, potentially impacting its overall performance. Secondly, although initializing the embedding vector based on the
mean of the subsample set strengthens the association between the embedding vector and the original dataset, it can also
constrain the embedding-based model to extract features effectively. Furthermore, in addition to the aforementioned
limitations, the proposed method may exhibit weakness in predicting normal transactions under circumstances of
extremely imbalanced rates in the dataset or when there is an overlap in the underlying distribution between different
classes. The future work of this study comprises three aspects. Firstly, we aim to explore improved decision path
information fusion methods to further enhance the ability to extract tree structure information. In particular, we
intend to investigate the co-learning of tree-based and embedding-based models to further facilitate information
dissemination between the two components, leading to better performance in real-world applications. Secondly, to
address the weakness of the CATE model, we plan to incorporate imbalanced learning techniques, such as resampling
and customized ensemble methods. Finally, we aim to leverage the abundance of information available on the Internet
to introduce additional user attributes, such as social, consumption, and behavior information. This will enable us to
expand the interpretability of the model and enable it to make more accurate predictions based on a broader range of
factors.

CRediT authorship contribution statement


Ying Gao: Resources, Funding acquisition, Project administration, Supervision, Writing - review. Haolang Xiao:
Investigation, Conceptualization, Methodology, Data curation, Software, Formal analysis, Validation, Visualization,
Writing - original draft. Choujun Zhan: Data curation, Writing - review & editing. Lingrui Liang: Investigation,
Data curation. Wentian Cai: Writing - review & editing. Xiping Hu: Supervision, Writing - review.

Declaration of Competing Interest


The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgments
This work is supported by the Guangzhou Science and Technology Program key projects (202103010005), the
National Natural Science Foundation of China (61876066).

References
[1] K. Buehler, A. Freeman, R. Hulme, The new arsenal of risk management, Harv. Bus. Rev. 86 (2008) 92–100+137.
[2] S. Maldonado, G. Peters, R. Weber, Credit scoring using three-way decisions with probabilistic rough sets, Inf Sci 507 (2020) 700–714.
[3] J. Sun, J. Lang, H. Fujita, H. Li, Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and
bagging with differentiated sampling rates, Inf Sci 425 (2018) 76–91.
[4] G. Petrides, D. Moldovan, L. Coenen, T. Guns, W. Verbeke, Cost-sensitive learning for profit-driven credit scoring, J.Oper.Res.Soc. 73 (2022)
338–350.
[5] S. Carta, A. Ferreira, D. Reforgiato Recupero, M. Saia, R. Saia, A combined entropy-based approach for a proactive credit scoring, Eng Appl
Artif Intell 87 (2020) 103292.
[6] N. Chen, B. Ribeiro, A. Chen, Financial credit risk assessment: a recent review, Artif Intell Rev 45 (2016) 1–23.
[7] P. Pławiak, M. Abdar, U. Rajendra Acharya, Application of new deep genetic cascade ensemble of SVM classifiers to predict the Australian
credit scoring, Appl. Soft Comput. J. 84 (2019) 105740.
[8] N. Le, T.-T. Huynh, E. Yapp, H.-Y. Yeh, Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and
PSSM profiles, Comput. Methods Programs Biomed. 177 (2019) 81–88.
[9] B. Baesens, T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, J. Vanthienen, Benchmarking state-of-the-art classification algorithms for
credit scoring, J Oper Res Soc 54 (2003) 627–635.
[10] D. Tripathi, A. Shukla, B. Reddy, G. Bopche, D. Chandramohan, Credit Scoring Models Using Ensemble Learning and Classification
Approaches: A Comprehensive Survey, Wireless Pers Commun 123 (2022) 785–812.

Haolang Xiao et al: Preprint submitted to Elsevier Page 20 of 22


CATE

[11] K. Niu, Z. Zhang, Y. Liu, R. Li, Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending,
Inf Sci 536 (2020) 120–134.
[12] F. Shen, X. Zhao, G. Kou, F. Alsaadi, A new deep learning ensemble credit risk evaluation model with an improved synthetic minority
oversampling technique, Appl. Soft Comput. 98 (2021) 106852.
[13] X. Feng, Z. Xiao, B. Zhong, Y. Dong, J. Qiu, Dynamic weighted ensemble classification for credit scoring using Markov Chain, Appl Intell
49 (2019) 555–568.
[14] M. Papouskova, P. Hajek, Two-stage consumer credit risk modelling using heterogeneous ensemble learning, Decis Support Syst 118 (2019)
33–45.
[15] L.-A. Dong, X. Ye, G. Yang, Two-stage rule extraction method based on tree ensemble model for interpretable loan evaluation, Inf Sci 573
(2021) 46–64.
[16] J. Tomczak, M. Zie¸ba, Classification Restricted Boltzmann Machine for comprehensible credit scoring model, Expert Sys Appl 42 (2015)
1789–1796.
[17] W. Wang, C. Lesner, A. Ran, M. Rukonic, J. Xue, E. Shiu, Using small business banking data for explainable credit risk scoring, in: AAAI -
AAAI Conf. Artif. Intell., AAAI press, New York, 2020, pp. 13396–13401.
[18] V. Moscato, A. Picariello, G. Sperlí, A benchmark of machine learning approaches for credit score prediction, Expert Sys Appl 165 (2021)
113986.
[19] X. Wang, X. He, F. Feng, L. Nie, T.-S. Chua, TEM: Tree-enhanced embedding model for explainable recommendation, in: Web Conf. - Proc.
World Wide Web Conf., WWW, Association for Computing Machinery, Inc, 2018, pp. 1543–1552. doi:10.1145/3178876.3186066.
[20] S. Lessmann, B. Baesens, H.-V. Seow, L. Thomas, Benchmarking state-of-the-art classification algorithms for credit scoring: An update of
research, Eur J Oper Res 247 (2015) 124–136.
[21] S. Sohn, D. Kim, J. Yoon, Technology credit scoring model with fuzzy logistic regression, Appl. Soft Comput. J. 43 (2016) 150–158.
[22] J. López, S. Maldonado, Profit-based credit scoring based on robust optimization and feature selection, Inf Sci 500 (2019) 190–202.
[23] S. Sohn, J. Kim, Decision tree-based technology credit scoring for start-up firms: Korean case, Expert Sys Appl 39 (2012) 4007–4012.
[24] Y. Tian, B. Bian, X. Tang, J. Zhou, A new non-kernel quadratic surface approach for imbalanced data classification in online credit scoring,
Inf Sci 563 (2021) 150–165.
[25] T. Li, G. Kou, Y. Peng, A new representation learning approach for credit data analysis, Inf Sci 627 (2023) 115 – 131.
[26] A. Blanco, R. Pino-Mejías, J. Lara, S. Rayo, Credit scoring models for the microfinance industry using neural networks: Evidence from Peru,
Expert Sys Appl 40 (2013) 356–364.
[27] H. He, W. Zhang, S. Zhang, A novel ensemble method for credit scoring: Adaption of different imbalance ratios, Expert Sys Appl 98 (2018)
105–117.
[28] Y. Xia, C. Liu, Y. Li, N. Liu, A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring, Expert Sys
Appl 78 (2017) 225–241.
[29] W. Liu, H. Fan, M. Xia, Step-wise multi-grained augmented gradient boosting decision trees for credit scoring, Eng Appl Artif Intell 97
(2021) 104036.
[30] C.-F. Tsai, Y.-F. Hsu, D. Yen, A comparative study of classifier ensembles for bankruptcy prediction, Appl. Soft Comput. J. 24 (2014)
977–984.
[31] P. Pławiak, M. Abdar, J. Pławiak, V. Makarenkov, U. Acharya, DGHNL: A new deep genetic hierarchical network of learners for prediction
of credit scoring, Inf Sci 516 (2020) 401–418.
[32] V. Djeundje, J. Crook, R. Calabrese, M. Hamid, Enhancing credit scoring with alternative data, Expert Sys Appl 163 (2021) 113766.
[33] Y. Song, Y. Wang, X. Ye, R. Zaretzki, C. Liu, Loan default prediction using a credit rating-specific and multi-objective ensemble learning
scheme, Inf Sci 629 (2023) 599 – 617.
[34] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, J. Candela, Practical lessons from predicting clicks on
ads at Facebook, in: Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., Association for Computing Machinery, New York, 2014, pp.
1–9. doi:10.1145/2648584.2648589.
[35] W. Liu, H. Fan, M. Xia, Credit scoring based on tree-enhanced gradient boosting decision trees, Expert Sys Appl 189 (2022) 116034.
[36] J. Liu, S. Zhang, H. Fan, A two-stage hybrid credit risk prediction model based on XGBoost and graph-based deep neural network, Expert
Sys Appl 195 (2022) 116624.
[37] Y. Wu, D. Zhu, X. Wang, Tree enhanced deep adaptive network for cancer prediction with high dimension low sample size microarray data,
Appl. Soft Comput. 136 (2023) 110078.
[38] J. Elman, Finding structure in time, Cogn. Sci. 14 (1990) 179–211.
[39] M. Jordan, Attractor dynamics and parallelism in a connectionist sequential machine, in: Proc. 8th Annu. Conf., Cognitive Sci. Soc., MIT
Press, 1986, pp. 531–546.
[40] J. F. Kolen, S. C. Kremer, Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies, 2001, pp. 237–243.
doi:10.1109/9780470544037.ch14.
[41] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Comp. 9 (1997) 1735–1780.
[42] P. Bachman, R. Devon Hjelm, W. Buchwalter, Learning representations by maximizing mutual information across views, in: Adv. neural inf.
proces. syst., Neural information processing systems foundation, Vancouver, 2019, pp. 15535–15545.
[43] O. Henaff, A. Srinivas, J. Fauw, A. Razavi, C. Doersch, S. Eslami, A. Eslami, Data-Efficient image recognition with contrastive predictive
coding, in: Int. Conf. Machin. Learn., ICML, International Machine Learning Society (IMLS), Virtual, Online, 2020, pp. 4130–4140.
[44] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum Contrast for Unsupervised Visual Representation Learning, in: Proc IEEE Comput
Soc Conf Comput Vision Pattern Recognit, IEEE Computer Society, Virtual, Online, 2020, pp. 9726–9735. doi:10.1109/CVPR42600.2020.
00975.

Haolang Xiao et al: Preprint submitted to Elsevier Page 21 of 22


CATE

[45] I. Misra, L. van der Maaten, Self-supervised learning of pretext-invariant representations, in: Proc IEEE Comput Soc Conf Comput Vision
Pattern Recognit, IEEE Computer Society, Virtual, Online, 2020, pp. 6706–6716. doi:10.1109/CVPR42600.2020.00674.
[46] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in: Int. Conf. Machin.
Learn., ICML, International Machine Learning Society (IMLS), Virtual, Online, 2020, pp. 1575–1585.
[47] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, D. Krishnan, Supervised contrastive learning, in: Adv.
neural inf. proces. syst., Neural information processing systems foundation, Virtual, Online, 2020, pp. 18661–18673.

Haolang Xiao et al: Preprint submitted to Elsevier Page 22 of 22


Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

☐ The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
Ying Gao: Resources, Funding acquisition, Project administration, Supervision, Writing – review.
Haolang Xiao: Investigation, Conceptualization, Methodology, Data curation, Software, Formal
analysis, Validation, Visualization, Writing – original draft.
Choujun Zhan: Data curation, Writing – review & editing.
Lingrui Liang: Investigation, Data curation.
Wentian Cai: Writing - review & editing.
Xiping Hu: Supervision, Writing – review.

You might also like