Professional Documents
Culture Documents
During the midterm consultation and class meetings with Dr. Jennifer Llovido, she
The class variable of the credit risk loan eligibility dataset contains a class
imbalance wherein 80% of the class is 0 (Not eligible) while 20% is 1 (Eligible). To
2. Explain the process of how you partitioned the dataset in making the model
The researchers added the details of performing the holdout and cross-validation
method in the paper. The researchers conducted training-testing splits of 90-10, 80-20,
70-30, and 60-40 splits in the holdout method and conducted 10-fold, 20-fold, 30-fold,
The researchers applied the CRISP-DM methodology in making the case study.
Preparation, Modelling, Evaluation, and Deployment. The researchers did not include the
classification, Linear Support Vector Machine (LSVM), Random Forest, Naïve Bayes
Classifier, Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree.
5. Apply association rule mining and clustering analysis to the dataset, if applicable
The researchers applied association rule mining to the dataset and found 177,280
rules. The researchers found that 27.145% of the rules has a lift score greater than 1,
58.223% of the rules has a lift score equal to 1, and 14.623% of the rules has a lift score
The researchers also tried clustering analysis to the dataset but discovered that the
algorithm is not quite suitable for the dataset since it contains numeric and categorical
variables. The researchers tried using K Prototypes clustering from the clustMixType
package since the algorithm is applicable to mixed-type data. After implemeting the
algorithm, the researchers found no significant insights with the results since the
algorithm outputs clustering per variable, not for the whole dataset.
ABSTRACT
Credit risk assessment is a critical task for financial institutions that lends money to
their loans, lenders can minimize their losses and make informed decisions about which loan
applications to approve. This paper proposes a novel approach to credit risk assessment that
combines traditional credit scoring models with state-of-the-art machine learning algorithms.
The researchers used a Credit Risk Loan Eligibility dataset with features that include the
borrower’s demographic data, financial statements, and credit history. The researchers
discovered the five most important variables by getting the information gain ratio value of each
eligibility by using ten supervised machine learning algorithms. Namely, Classification and
Regression Tree (CART), K Nearest Neighbors (kNN), Rule-based classification, Linear Support
Vector Machine (LSVM), Random Forest, Naïve Bayes Classifier, Artificial Neural Network
(ANN), Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree. The Random
Forest algorithm performed the best, with an accuracy of 89.85% using a cross-validated set of
50-folds. The worst performing algorithm was Naïve Bayes Classifier which had an accuracy of
50%.
keywords – credit risk, binary classification, CART, kNN, Rule-based classification, Linear
SVM, Random Forest, Naïve Bayes Classifier, ANN, Conditional Inference Tree, C4.5, Gradient
History demonstrates that one of the main factors contributing to bank distress is
the concentration of credit risk in asset portfolios. This holds for both specific
marketing is that information to both bank and account holders is asymmetrical. That is,
account holders know more about their investments rather than their banks. As future and
present users of these cash reservoirs and banks, it is our duty to gain knowledge on how
managerial and tactical bank credit works to have an in-depth understanding of our
commitments. It typically refers to the possibility that a lender will not be able to receive
the total and interest owed, which would disrupt cash flows and raise collection costs.
The borrower's general capacity to repay a loan in accordance with its original terms is
used to determine credit risks. When determining the credit risk of a consumer loan,
lenders consider the five Cs: credit history, capacity to repay, capital, the loan's
With this, the study mainly aims to classify a borrower’s credit risk loan data
using supervised machine learning algorithms such as Classification and Regression Tree
Machine (LSVM), Random Forest, Naïve Bayes Classifier, Artificial Neural Network
(ANN), Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree in
This case study aims to explore the credit risk loan eligibility dataset. Specifically, it aims
to:
1. Identify which variables have the most significant importance in identifying credit
kappa score.
Related Studies
analysis of the credit risk and rating of borrowers is relevant to all banks involved in
lending to individuals and legal entities. To ensure the effectiveness of credit risk
management in commercial banks, it was deemed necessary to develop the kinds of terms
and conditions for those bank clients who take loans that would both attract potential
borrowers and guarantee loan repayment. However, to develop a separate set of terms and
conditions for every individual borrower, existing and potential bank clients should be
grouped according to their similarities and differences. After that, a separate set of terms
and conditions needs to be worked out for each group in accordance with the
characteristic features of the group members. The classification of bank clients into
distinct groups should proceed according to the method of classification that unites
disparate system elements into homogeneous groups based on the similarities of the
elements in question. This method of classification needs to reflect the structure of the
source data and to ensure the most adequate division of the data into groups.
They took into account the statistics reflecting the bank customers’ violations of
the contract conditions and the damage caused to the bank by each such violation. The
magnitude of the risk as the amount of damage (risk defined as the customer’s failure to
make principal payments on time) can be seen as a regressive dependence on such factors
as the average loan size, the period for which the loan is issued, and a number of other
characteristics of each customer class. Such a model would enable the forecasting of the
Li, et.al utilized Logistic Regression (LR), due to the interpretability and
efficiency. They showcased a hybrid scoring model through the use of Logistic
Regression by data augmentation which obtained a more accurate credit score. Despite
the widely used LR, there remains a paucity of ability on nonlinear relationships.
were investigated, such as Decision Tree (DT) and Support Vector Machine (SVM). They
used Decision Tree and Artificial Neural Network to build the credit model, and results
indicated that it is a successful technology. They proposed a novel credit model using the
clustered support vector machine(CSVM) and demonstrated that CSVM could achieve
However, much of the research was based on traditional machine learning methods and
up until now could not address sequential input data. The use of deep learning techniques
to build a credit risk model has seen significant increases in the reported accuracy on
benchmarking data sets. They compared the classification abilities of machine and deep
learning models, and results show that the tree-based models are more stable than the
models based on multilayer artificial neural networks. They depicted the only use of
LSTM to predict default probability on sequential input data. They proposed a dynamic
forecasting system to predict the default probability of a company, which utilizes a long
short-term memory model to incorporate daily news from social media. Babaev et al.,
combined embedding techniques in NLP and RNN to mining sequential transaction data.
Hassani, et.al, in financial risk management, credit cards' evaluation is a vital role
for banks. Banks often need to evaluate customer credit to decide against risk and
competition. Identifying influential factors of the credit card can help them. Moreover,
feature selection using the machine learning algorithms can evaluate them by higher
accuracy while traditional techniques require essential assumptions. Real-world data such
as financial data are unbalanced, and machine learning algorithms are not well classified.
In this paper, the data is balanced by the SMOTE method, and then the firefly binary
algorithm is used for identifying the effective features of the credit cards. Other than that
this study investigates various machine algorithms such as KNN, Fuzzy KNN, Random
Forest, Decision Tree, and SVM for the effective classifier of the credit card.
related to the existing algorithms for unbalanced data classification, particularly the
methods when dealing with unbalanced datasets and proposes the use of preprocessing
classification. The study also mentions the use of the SMOTE algorithm for unbalanced
data classification and its limitations in not taking into account the properties of the data
itself.
To address this limitation, the study proposes a new oversampling method that
considers the sample distribution of the minority class. By doing so, the new samples
generated will be more representative and will better describe the minority class, leading
to improved classification accuracy. The proposed oversampling method can help avoid
overfitting of the model that can result from the random selection of sample points
The study can be used as a basis for further research on the optimization of
Specifically, the study can be extended to explore other oversampling methods that take
into account the properties of the data itself, as well as undersampling methods that can
be used to reduce the number of samples in the majority class. The findings of this study
can also be applied to other fields that involve the classification of unbalanced datasets,
Mohammad Ali states that the use of statistical techniques for categorizing loan
applications as good or bad has become increasingly important due to the high demand
for credit. It is crucial to use a classification method that can accurately predict loan
profitability. In this study, Mohammad Ali compares the predictive capabilities of three
classification techniques, namely logistic regression, CART, and random forests, using an
80:20 learning:test split on German credit data. He evaluates the performance of each
model by calculating the probability of default for each observation in the test set and
classifying them as good or bad loans based on a threshold. He uses several thresholds to
compare the performance of each classifier on five model suitability statistics: accuracy,
The results show that none of the classifiers performed the best in all five
probability of default thresholds, while CART was the most accurate, precise, and
specific at higher thresholds. Random forests had the best negative predictive value and
Mimi Mukherjee and Matloob Khushi noted that real-world datasets often have
imbalanced classes, with some classes having significantly fewer instances than others.
have been proposed to balance the dataset, particularly for continuous features. However,
when datasets have both nominal and continuous features, SMOTE-NC is the only
over-sampling technique available to balance the data. The authors present a new
Continuous) that encodes nominal features as numeric values to reflect the change of
association with the minority class. The authors show that the classification model using
SMOTE-ENC provides better predictions than the model using SMOTE-NC when the
dataset has a substantial number of nominal features and when there is an association
between categorical features and the target class. Furthermore, SMOTE-ENC can be
applied to both mixed datasets and nominal-only datasets, addressing a major limitation
of SMOTE-NC.
According to Will Badr he recommends splitting dataset into training and testing
sets before balancing the data to ensure that the test dataset remains unbiased and
provides a true evaluation for the model. Balancing the data before splitting, may risk
introducing bias into the test set due to the synthetic generation of a few data points that
are well-known from the training set. He stated that it is important for the test set to be as
remove valuable information and alter the overall dataset distribution, which can impact
the quality of the model. Therefore, under-sampling should not be the first approach to
consider for imbalanced datasets. He also added that it is important to recognize the
The figure below shows the Cross-Industry Standard Process for Data Mining
(CRISP-DM) Model. It consists of phases, each designed to guide the data mining process. The
use of the CRISP-DM model in data mining helps to ensure that the data mining project is
A. Business understanding
The credit risk loan eligibility dataset contains information on loan applicants and
their eligibility for credit loans. It contains various information such as their age, income,
employment status, credit score, loan amount, and other relevant asset information.
Banks or lending institutions can use this dataset to make informed decisions to check if
an applicant is eligible for a credit loan. The class variable of the dataset is the
loan_status variable, this is the determining factor if the applicant is eligible or not. A
B. Data understanding
In order to fully understand the credit risk loan eligibility dataset, the researchers
Columns Description
loan_amnt the listed amount of the loan applied for by the borrower
funded_amnt the total amount committed to that loan at that point in time
funded_amnt_inv the total amount committed by investors for that loan at that point in time
emp_title the job title supplied by the borrower when applying for the loan
the first 3 numbers of the zip code provided by the borrower in the loan
zip_code
application.
a ratio calculated using the borrower’s total monthly debt payments on the
dti total debt obligations, excluding mortgage and the requested LC loan,
divided by the borrower’s self-reported monthly income
open_acc the number of open credit lines in the borrower's credit file
total_acc the total number of credit lines in the borrower's credit file
The dataset contains missing values that must be handled before training the model.
Missing values can be imputed using various techniques, such as mean-median imputation, or
removed entirely if they are insignificant. It also contains categorical variables such as
employment status, education, and marital status. The data set includes numeric factors such as
age, income, and loan amount, which may have diverse scopes and measurements, impacting the
efficiency of machine learning algorithms. The original data has 63,999 cases and 45 variables.
Loan Status is the target variable. There are 2 levels, 0 being eligible and 1 being not
eligible. The latter only form about 20% of the data. Therefore the class variable of the original
The researchers dropped the columns that are irrelevant to the classifier, such as
member_id, zip_code, and addr_state. Columns where the majority of data have missing
values (at about 50%) were also dropped, such as batch_enrolled, emp_title,
with a new data frame with only 31 variables. After this, the researchers transformed the
data types of the columns to be more appropriate for model learning. Ordinal values such
as loan_status (the class variable), grade, verification status, and others are turned into
factors instead of characters. The remaining missing values were handled by dropping the
Fig. 4 shows that the original distribution of the class variable is unbalanced,
wherein 80% are 0 and 20% are 1. In order to remove the bias towards the negative value
in training the model, the researchers tried using the Synthetic Minority Oversampling
Technique (SMOTE) and the undersampling technique in balancing the class variable.
The researchers decided to choose the balanced dataset produced by using SMOTE (See
Appendix A, Figures 7-8) because it produces a more distributed class variable and does
not reduce the overall data compared to using the undersampling technique.
Feature Preparation
Feature selection is the method of reducing input variables by using only relevant
data and getting rid of noise in data. The researchers obtained the Univariance Feature
Importance Score of each variable in order to select the only relevant features. In order to
get these scores, the researchers tried different methods, namely, using chi-square
statistics to derive a score, information gain ratio, correlation/entropy with best first
In the chi-square test statistic, the researchers first discretized the numeric and int
variables to convert them as factors with 5 levels. After this, the algorithm outputs the top
Computing the univariate feature importance score using the information gain
ratio, the algorithm outputs the top five variables with the highest gain which are
The cfs method stated that the variables int_rate, initial_list_status, recoveries,
and last_week_pay, respectively, are the most important features. In the black-box feature
selection, the five most important features are loan_amnt, term, int_rate, emp_length, and
strategies to find the most important features. Both forward-search and best-first search
initial_list_status.
The researchers decided to pick the top 5 features using the information gain ratio
D. Modeling
(LSVM), Random Forest, Naïve Bayes Classifier, Artificial Neural Network (ANN),
The researchers only used the balanced dataset in making models using the ten
algorithms mentioned above because when producing a model using the unbalanced
The researchers trained the various models using different testing/training splits
for hold-out validation, namely 10-90, 20-80, 30-70, and 40-60 distributions.
Additionally, the researchers also tried varying fold counts for cross-validation, with 10,
15, 20, and 50 folds, to see if the training and test distribution would have any effect on
the performance of the data mining models. Each algorithm was trained with our data set
at least 16 times, 8 in each balanced and unbalanced dataset, with the exception of
The CART algorithm is used to build a decision tree that predicts the loan
eligibility status of an applicant based on the given attributes. The algorithm would
recursively split the dataset based on the values of the attributes and minimize the Gini
impurity to determine the best split. The resulting decision tree could be used to classify
new loan applicants based on their attributes. (See Appendix A, Figures 14-17)
Creating a tree with default settings and pre-pruning resulted in a decision tree
that only uses two variables — last_week_pay and term. (See Appendix A, Fig. 5).
Meanwhile, when attempting to build a full tree with all 31 variables and a cp value of 0,
it takes roughly five minutes for the tree to be rendered as a png file. Hence, the
researchers decided the next course of action is to limit the number of columns that will
Next, the researchers tried to evaluate the models created, both the partial tree
with pre-pruning and the full tree. First, we created a sample set with only 15 rows and
removed the values for loan status. Then we compared the predictions from the actual
values. The predictions of the default tree with pre-pruning performed terribly, even
though the testing and training data are the same. The model incorrectly classifies all
instances as ‘0’. Meanwhile, the full tree performed significantly better, getting all
The KNN algorithm is used to predict loan eligibility by identifying the k nearest
neighbors of a new loan applicant based on their attributes. The algorithm would
calculate the distance between the new applicant and each existing applicant in the
dataset and select the k nearest neighbors. The majority class of the k nearest neighbors
would then be used to predict the loan eligibility status of the new applicant. (See
Rule-based classification
The training and testing datasets are created by subsetting the original dataset.
“PART” algorithm, in order to evaluate the performance of the trained model and provide
predictions based on a set of predefined rules. Instead of learning patterns from data,
rule-based classifiers use a set of if-then rules to determine the class or category of a
given instance.
The linear SVM algorithm aims to find the optimal hyperplane by solving an
optimization problem that involves minimizing the classification errors and maximizing
the margin. It finds the best hyperplane that maximally separates the data points of
different classes. The hyperplane is a decision boundary that divides the feature space
into two regions, one for each class. constructs a hyperplane in a higher-dimensional
feature space that corresponds to a linear decision boundary in the original feature space.
Random Forest
decision trees to make more accurate and robust predictions. A collection of decision
selecting subsets of the training data, with replacement, to create multiple subsets of the
data. Each subset is then used to train an individual decision tree. This process helps to
introduce randomness and diversity into the models, which can reduce overfitting and
improve generalization.
During the construction of each decision tree, at each split, a random subset of
features is considered rather than using all available features. This further adds
randomness and reduces correlation among the trees. By combining the predictions of
multiple trees, the Random Forest algorithm can make more accurate and reliable
A tuning grid called RFgrid is created. The grid specifies the values to be tested
for the “mtry” parameter, which controls the number of features considered at each split
of the decision trees in the Random Forest. In this case, the “mtry” values range from 1 to
between features. It is often used for classification tasks and works well with categorical
process data in a manner resembling the human brain. The package commonly used for
artificial neural networks in R is called "nnet." It provides functions and tools for
values, determined by the expand.grid function, is used to fine-tune the neural network
model.
continuous predictor variables, as well as handling missing values. The "ctree" method is
used for training the conditional inference tree model. The researcher implements
conditional inference trees for classification tasks. It starts by setting a seed value for
reproducibility and creates fold indices using the createFolds. The conditional inference
tree model is trained using the train function, where the loan_status variable is predicted
C4.5
The J48 algorithm in R from the package caret and RWeka is used with the
method set to 'J48'. This constructs decision trees based on the C4.5 algorithm, which
uses attribute selection, recursive splitting, handling of missing values, and pruning to
tree is built to correct the mistakes made by the previous trees. This algorithm is used by
including the package “xgboost” and “dplyr”, with the method “xgbTree”.
The xgboost model is trained using the train function, with hyperparameters tuned
function, which computes performance metrics based on predicted and observed values.
testing datasets.
RESULTS AND DISCUSSION
Various classification algorithms were evaluated using both the holdout method and
cross-validation. Across the 10 classification models trained, we consistently found that the
models whose dataset was split with cross-validation performed slightly better compared to their
hold-out counterparts. Additionally, a high fold count for cross validation yielded marginally
higher accuracy and kappa compared to low fold counts, for the models using cross-validation
for splitting training and testing sets, at the cost of higher computing time (See Appendix B, Table
2.1). Also, the algorithms trained in cross-validation method - conditional inference trees,
gradient boosted decision trees, neural network, SVM, and especially Naive Bayes’ kappa
performance was lower than 0.5, indicating moderate agreement between the predicted labels
As for the models trained with datasets partitioned using hold-out method, the
test-training splits 10-90 and 30-70 tend to have higher accuracy and kappa. The 40-60 split was
consistently the worst division of the test and train sets. (See Appendix B, Table 2.2). The
The researchers observed that Naive Bayes algorithm performed terribly when used as a
classifier for our data set. It had a kappa of 0 and accuracy of 50 because the Naive Bayes model
experimented with was the random forest classifier reaching 89.85% accuracy; closely followed
by KNN with 89.7% accuracy, both trained using the cross-validated set with 50 folds. (See
to see how different values affect the model’s performance. For rpart decision trees, the
complexity parameter is best set at zero (See Appendix A, Fig. 18). For KNN, we observed that
the value of k or number of neighbors is best set at 1, because the accuracy and kappa drops with
higher k values. In random forest, the mtry parameter which controls how much randomness is
on the decision tree creation process, is best set at 5, to fully utilize all the available attributes.
For the rule-based classifier, the confidence threshold parameter determines the minimum
confidence required for a rule to be considered valid and used for classification. 0.5 and 0.255
threshold values give the highest accuracy rating. Pruning was also incorporated to improve the
performance by removing rules that do not significantly contribute to the overall accuracy.
The results found in this study will serve as a guide as to how banks or lending
organizations approve future loan applicants. The variables with the most importance will serve
as the top features to look for in an applicant. Also, the algorithm that was most effective in
processing this type of dataset, Random Forest Classifier, will serve as a basis in creating
systems that automates the process of classifying the eligibility of an applicant for a loan.
CONCLUSION
In this paper, we studied whether a customer is likely to default a loan, or be able to pay
off the loan through the five (5) features identified with the highest information gain ratio value,
trained 10 different classification models through both hold-out and cross-validated partition
1. Cross-validation is a better partitioning method for test and train sets to create a
classification model, when compared to the hold-out method. Using different folds for
testing and training noticeably increases the accuracy of classification models, compared
to using only a fixed set and number of testing and training partitions.
2. It is important to make sure the class variable is balanced, otherwise the accuracy metric
for the various classification models would be misleading because of the skewed
distribution. We found the kappa statistic an important evaluation metric for the
performance of the classifier especially when the data is unbalanced, compared to the
percentage of cases classified accurately, since kappa compares the observed accuracy
3. The random forest algorithm produces well performing classifiers that use both
categorical and numerical values to infer a predicted class. Even with an unbalanced data
set, it performs relatively better compared to other algorithms that suffer with low kappa
This study is open for future revisions and assessments by future researchers tackling the
same topic. The researchers recommend procuring a more detailed and concrete dataset produces
greater statistical power, less sampling variability, and will be able to discover more patterns and
events in detecting credit loan eligibility. Also, the researchers recommend collecting real-world
and up-to-date data to evaluate the model, to continuously refine its performance based on
real-world events.
risk loan eligibility can advance the field and contribute to more accurate and reliable credit risk
Hassani, et.al(June 19, 2021).”Credit Risk Assessment Using Learning Algorithms for
Feature Selection”.
https://www.tandfonline.com/doi/full/10.1080/16168658.2021.1925021
Konovalova, N., Kristovska, I., & Kudinska, M. (2016). Credit Risk Management in
commercial banks. Polish Journal of Management Studies, 13(2), 90–100.
https://doi.org/10.17512/pjms.2016.13.2.09
Li, et.al(July 2020).”A Two-Stage Dynamic Credit Risk Assessment
System”.https://dl.acm.org/doi/10.1145/3417188.3417193
Sulin, et.al(2001).”The credit-risk decision mechanism on fixed loan interest rate with
imperfect information”.https://ieeexplore.ieee.org/document/6077383
The Investopedia Team(March 15, 2022).”Credit Risk: Definition, Role of Ratings, and
Examples”. https://www.investopedia.com/terms/c/creditrisk.asp
List of Figures
Figure 10: Univariate feature importance score of variables using chi-square tests
Figure 11: Univariate feature importance score using gain ratio
Figure 20: Summary statistics of the various models, tested through Cross-validation
Figure 21: Model comparison of Hold-out results
Figure 22: Summary statistics of the various models, tested through Hold-out
Figure 23: Lift value per rule
Appendix B
List of Tables
10 folds NA 0 NA 50 0
15 folds NA 0 NA 50 0
Naive
Bayes 20 folds NA 0 NA 50 0
50 folds NA 0 NA 50 0
Artificial
Size = 8
Neural 10 folds
Decay = 0.1
61.51 58.36 59.89 60.92 .2184
Networks
10-90 NA 0 NA 0.5 0
20-80 NA 0 NA 0.5 0
Naive
Bayes 30-70 NA 0 NA 0.5 0
40-60 NA 0 NA 0.5 0