You are on page 1of 48

Revision List

During the midterm consultation and class meetings with Dr. Jennifer Llovido, she

suggested the following:

1. Solve the class imbalance of the dataset

The class variable of the credit risk loan eligibility dataset contains a class

imbalance wherein 80% of the class is 0 (Not eligible) while 20% is 1 (Eligible). To

resolve this, the researchers performed Synthetic Minority Oversampling Technique

(SMOTE) to balance the classes.

2. Explain the process of how you partitioned the dataset in making the model

The researchers added the details of performing the holdout and cross-validation

method in the paper. The researchers conducted training-testing splits of 90-10, 80-20,

70-30, and 60-40 splits in the holdout method and conducted 10-fold, 20-fold, 30-fold,

and 50-fold cross-validation.

3. Apply the CRISP-DM methodology

The researchers applied the CRISP-DM methodology in making the case study.

Parts of the methodology include Business understanding, Data Understanding, Data

Preparation, Modelling, Evaluation, and Deployment. The researchers did not include the

Deployment phase since this study is only for research purposes.


4. Apply alternative classification techniques

The researchers conducted different classification techniques such as

Classification and Regression Tree (CART), K Nearest Neighbors (kNN), Rule-based

classification, Linear Support Vector Machine (LSVM), Random Forest, Naïve Bayes

Classifier, Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree.

5. Apply association rule mining and clustering analysis to the dataset, if applicable

The researchers applied association rule mining to the dataset and found 177,280

rules. The researchers found that 27.145% of the rules has a lift score greater than 1,

58.223% of the rules has a lift score equal to 1, and 14.623% of the rules has a lift score

less than 1. (See Append A, Fig. 23)

The researchers also tried clustering analysis to the dataset but discovered that the

algorithm is not quite suitable for the dataset since it contains numeric and categorical

variables. The researchers tried using K Prototypes clustering from the clustMixType

package since the algorithm is applicable to mixed-type data. After implemeting the

algorithm, the researchers found no significant insights with the results since the

algorithm outputs clustering per variable, not for the whole dataset.
ABSTRACT

Credit risk assessment is a critical task for financial institutions that lends money to

individuals and businesses. By accurately identifying the likelihood of borrowers defaulting on

their loans, lenders can minimize their losses and make informed decisions about which loan

applications to approve. This paper proposes a novel approach to credit risk assessment that

combines traditional credit scoring models with state-of-the-art machine learning algorithms.

The researchers used a Credit Risk Loan Eligibility dataset with features that include the

borrower’s demographic data, financial statements, and credit history. The researchers

discovered the five most important variables by getting the information gain ratio value of each

variable which are recoveries, collection_recovery_fee, last_week_pay, int_rate, and

initial_list_status. The researchers performed binary classification in classifying the borrower’s

eligibility by using ten supervised machine learning algorithms. Namely, Classification and

Regression Tree (CART), K Nearest Neighbors (kNN), Rule-based classification, Linear Support

Vector Machine (LSVM), Random Forest, Naïve Bayes Classifier, Artificial Neural Network

(ANN), Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree. The Random

Forest algorithm performed the best, with an accuracy of 89.85% using a cross-validated set of

50-folds. The worst performing algorithm was Naïve Bayes Classifier which had an accuracy of

50%.

keywords – credit risk, binary classification, CART, kNN, Rule-based classification, Linear

SVM, Random Forest, Naïve Bayes Classifier, ANN, Conditional Inference Tree, C4.5, Gradient

Boosted Decision Tree, loan


INTRODUCTION

History demonstrates that one of the main factors contributing to bank distress is

the concentration of credit risk in asset portfolios. This holds for both specific

organizations and banking systems as a whole. An essential characteristic of credit

marketing is that information to both bank and account holders is asymmetrical. That is,

account holders know more about their investments rather than their banks. As future and

present users of these cash reservoirs and banks, it is our duty to gain knowledge on how

managerial and tactical bank credit works to have an in-depth understanding of our

savings, interests, taxes, and loans.

Credit risk refers to the possibility of suffering a loss in acquisitions. A

consequence of a borrower's failure to make loan payments or fulfill contractual

commitments. It typically refers to the possibility that a lender will not be able to receive

the total and interest owed, which would disrupt cash flows and raise collection costs.

The borrower's general capacity to repay a loan in accordance with its original terms is

used to determine credit risks. When determining the credit risk of a consumer loan,

lenders consider the five Cs: credit history, capacity to repay, capital, the loan's

conditions, and associated collateral.

With this, the study mainly aims to classify a borrower’s credit risk loan data

using supervised machine learning algorithms such as Classification and Regression Tree

(CART), K Nearest Neighbors (kNN), Rule-based classification, Linear Support Vector

Machine (LSVM), Random Forest, Naïve Bayes Classifier, Artificial Neural Network
(ANN), Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree in

analyzing whether a borrower is eligible to be granted a loan.

Objectives of the Study

This case study aims to explore the credit risk loan eligibility dataset. Specifically, it aims

to:

1. Identify which variables have the most significant importance in identifying credit

risk loan eligibility.

2. Identify which classification algorithm performs best in terms of accuracy and

kappa score.
Related Studies

A. Credit Risk Management in Commercial Banks

According to Konovalova et al., carrying out a quantitative assessment and

analysis of the credit risk and rating of borrowers is relevant to all banks involved in

lending to individuals and legal entities. To ensure the effectiveness of credit risk

management in commercial banks, it was deemed necessary to develop the kinds of terms

and conditions for those bank clients who take loans that would both attract potential

borrowers and guarantee loan repayment. However, to develop a separate set of terms and

conditions for every individual borrower, existing and potential bank clients should be

grouped according to their similarities and differences. After that, a separate set of terms

and conditions needs to be worked out for each group in accordance with the

characteristic features of the group members. The classification of bank clients into

distinct groups should proceed according to the method of classification that unites

disparate system elements into homogeneous groups based on the similarities of the

elements in question. This method of classification needs to reflect the structure of the

source data and to ensure the most adequate division of the data into groups.

They took into account the statistics reflecting the bank customers’ violations of

the contract conditions and the damage caused to the bank by each such violation. The

magnitude of the risk as the amount of damage (risk defined as the customer’s failure to

make principal payments on time) can be seen as a regressive dependence on such factors

as the average loan size, the period for which the loan is issued, and a number of other

factors. Specification and identification of such regressions should be performed based on


the information about the damage caused by each client and about the credit

characteristics of each customer class. Such a model would enable the forecasting of the

risk posed by each potential client.

B. A Two-Stage Dynamic Credit Risk Assessment System

Li, et.al utilized Logistic Regression (LR), due to the interpretability and

efficiency. They showcased a hybrid scoring model through the use of Logistic

Regression by data augmentation which obtained a more accurate credit score. Despite

the widely used LR, there remains a paucity of ability on nonlinear relationships.

To better understand the relationship behind data, more sophisticated methods

were investigated, such as Decision Tree (DT) and Support Vector Machine (SVM). They

used Decision Tree and Artificial Neural Network to build the credit model, and results

indicated that it is a successful technology. They proposed a novel credit model using the

clustered support vector machine(CSVM) and demonstrated that CSVM could achieve

high classification performance while remaining relatively cheap computationally.

However, much of the research was based on traditional machine learning methods and

up until now could not address sequential input data. The use of deep learning techniques

to build a credit risk model has seen significant increases in the reported accuracy on

benchmarking data sets. They compared the classification abilities of machine and deep

learning models, and results show that the tree-based models are more stable than the

models based on multilayer artificial neural networks. They depicted the only use of

LSTM to predict default probability on sequential input data. They proposed a dynamic

forecasting system to predict the default probability of a company, which utilizes a long
short-term memory model to incorporate daily news from social media. Babaev et al.,

combined embedding techniques in NLP and RNN to mining sequential transaction data.

C. Credit Risk Assessment Using Learning Algorithms for Feature Selection

Hassani, et.al, in financial risk management, credit cards' evaluation is a vital role

for banks. Banks often need to evaluate customer credit to decide against risk and

competition. Identifying influential factors of the credit card can help them. Moreover,

feature selection using the machine learning algorithms can evaluate them by higher

accuracy while traditional techniques require essential assumptions. Real-world data such

as financial data are unbalanced, and machine learning algorithms are not well classified.

In this paper, the data is balanced by the SMOTE method, and then the firefly binary

algorithm is used for identifying the effective features of the credit cards. Other than that

this study investigates various machine algorithms such as KNN, Fuzzy KNN, Random

Forest, Decision Tree, and SVM for the effective classifier of the credit card.

D. Application of Big Data Unbalanced Classification Algorithm in Credit Risk

Analysis of Insurance Companies

The study of Xian Wu and Huan Liu on classification of unbalanced datasets is

related to the existing algorithms for unbalanced data classification, particularly the

SMOTE algorithm. The article highlights the limitations of traditional classification

methods when dealing with unbalanced datasets and proposes the use of preprocessing

techniques and optimized classifiers to improve the accuracy of minority class

classification. The study also mentions the use of the SMOTE algorithm for unbalanced
data classification and its limitations in not taking into account the properties of the data

itself.

To address this limitation, the study proposes a new oversampling method that

considers the sample distribution of the minority class. By doing so, the new samples

generated will be more representative and will better describe the minority class, leading

to improved classification accuracy. The proposed oversampling method can help avoid

overfitting of the model that can result from the random selection of sample points

between two sample points used by the SMOTE algorithm.

The study can be used as a basis for further research on the optimization of

classifiers and the development of oversampling methods for unbalanced datasets.

Specifically, the study can be extended to explore other oversampling methods that take

into account the properties of the data itself, as well as undersampling methods that can

be used to reduce the number of samples in the majority class. The findings of this study

can also be applied to other fields that involve the classification of unbalanced datasets,

such as credit scoring and fraud detection.

E. Performance of Three Classification Techniques in Classifying Credit Applications

Into Good Loans and Bad Loans: A Comparison

Mohammad Ali states that the use of statistical techniques for categorizing loan

applications as good or bad has become increasingly important due to the high demand

for credit. It is crucial to use a classification method that can accurately predict loan

profitability. In this study, Mohammad Ali compares the predictive capabilities of three

classification techniques, namely logistic regression, CART, and random forests, using an
80:20 learning:test split on German credit data. He evaluates the performance of each

model by calculating the probability of default for each observation in the test set and

classifying them as good or bad loans based on a threshold. He uses several thresholds to

compare the performance of each classifier on five model suitability statistics: accuracy,

precision, negative predictive value, recall, and specificity.

The results show that none of the classifiers performed the best in all five

cross-validation statistics. However, logistic regression performed the best at low

probability of default thresholds, while CART was the most accurate, precise, and

specific at higher thresholds. Random forests had the best negative predictive value and

recall at higher thresholds.

F. SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for

nominal and continuous features

Mimi Mukherjee and Matloob Khushi noted that real-world datasets often have

imbalanced classes, with some classes having significantly fewer instances than others.

To address this problem, various synthetic minority over-sampling methods (SMOTE)

have been proposed to balance the dataset, particularly for continuous features. However,

when datasets have both nominal and continuous features, SMOTE-NC is the only

over-sampling technique available to balance the data. The authors present a new

over-sampling method called SMOTE-ENC (SMOTE – Encoded Nominal and

Continuous) that encodes nominal features as numeric values to reflect the change of

association with the minority class. The authors show that the classification model using

SMOTE-ENC provides better predictions than the model using SMOTE-NC when the
dataset has a substantial number of nominal features and when there is an association

between categorical features and the target class. Furthermore, SMOTE-ENC can be

applied to both mixed datasets and nominal-only datasets, addressing a major limitation

of SMOTE-NC.

G. Different Ways to Handle Imbalanced Datasets.

According to Will Badr he recommends splitting dataset into training and testing

sets before balancing the data to ensure that the test dataset remains unbiased and

provides a true evaluation for the model. Balancing the data before splitting, may risk

introducing bias into the test set due to the synthetic generation of a few data points that

are well-known from the training set. He stated that it is important for the test set to be as

objective as possible. However, according to his study under-sampling techniques may

remove valuable information and alter the overall dataset distribution, which can impact

the quality of the model. Therefore, under-sampling should not be the first approach to

consider for imbalanced datasets. He also added that it is important to recognize the

performance of ML models on imbalanced datasets will be constrained by their ability to

predict rare and minority points.


METHODS

The figure below shows the Cross-Industry Standard Process for Data Mining

(CRISP-DM) Model. It consists of phases, each designed to guide the data mining process. The

use of the CRISP-DM model in data mining helps to ensure that the data mining project is

structured, manageable, and produces actionable results.

Fig. 1: CRISP-DM Model

A. Business understanding

The credit risk loan eligibility dataset contains information on loan applicants and

their eligibility for credit loans. It contains various information such as their age, income,

employment status, credit score, loan amount, and other relevant asset information.

Banks or lending institutions can use this dataset to make informed decisions to check if

an applicant is eligible for a credit loan. The class variable of the dataset is the
loan_status variable, this is the determining factor if the applicant is eligible or not. A

value of 0 denotes “Not eligible,” and 1 denotes “Eligible.”

B. Data understanding

In order to fully understand the credit risk loan eligibility dataset, the researchers

obtained and used these descriptions in developing an

Columns Description

member_id a unique LC assigned I.D. for the borrower member

loan_amnt the listed amount of the loan applied for by the borrower

funded_amnt the total amount committed to that loan at that point in time

funded_amnt_inv the total amount committed by investors for that loan at that point in time

term the number of payments on the loan

a series of code given to a certain batch on when their loan was


batch_enrolled
recorded/enrolled

int_rate interest rate on the loan

a classification system that involves assigning a quality score to a loan


grade based on a borrower's credit history, quality of the collateral, and the
likelihood of repayment of the principal and interest

sub_grade a specified grading system for the loans

emp_title the job title supplied by the borrower when applying for the loan

emp_length employment length in years

the home ownership status provided by the borrower during registration or


home_ownership
obtained from the credit report

the self-reported annual income provided by the borrower during


annual_inc
registration

verification_status indicates the verification of annual income/income


pymnt_plan indicates if a payment plan has been put in place for the loan

desc description of purpose of loan

purpose a category provided by the borrower for the loan request

title the loan title provided by the borrower

the first 3 numbers of the zip code provided by the borrower in the loan
zip_code
application.

addr_state the state provided by the borrower in the loan application

a ratio calculated using the borrower’s total monthly debt payments on the
dti total debt obligations, excluding mortgage and the requested LC loan,
divided by the borrower’s self-reported monthly income

delinq_2yrs the number of delinquencies in the past 2 years

inq_last_6mths the number of inquiries in the last 6 months

mths_since_last_delinq the number of months since the last delinquency

mths_since_last_record the number of months since the last public record

open_acc the number of open credit lines in the borrower's credit file

pub_rec the number of derogatory public records

revol_bal total credit revolving balance

revolving line utilization rate, or the amount of credit the borrower is


revol_util
using relative to all available revolving credit

total_acc the total number of credit lines in the borrower's credit file

initial_list_status the initial listing status of the loan (f, w)

total_rec_int interest received to date

total_rec_late_fee late fees received to date

recoveries post charge off gross recovery

collection_recovery_fee post charge off collection fee

collections_12_mths_ex_ number of collections in 12 months excluding medical collections


med
mths_since_last_major_de months since most recent 90-day or worse rating
rog

indicates whether the loan is an individual application or a joint


application_type
application with two co-borrowers

indicates if the co-borrowers' joint income was verified by LC, not


verification_status_joint
verified, or if the income source was verified

last_week_pay last week payment was received

acc_now_delinq the number of accounts on which the borrower is now delinquent

tot_coll_amt total collection amounts ever owed

tot_cur_bal total current balance of all accounts

total_rev_hi_lim total revolving high credit/credit limit

loan_status current status of the loan

Table 1. Column Descriptions

The dataset contains missing values that must be handled before training the model.

Missing values can be imputed using various techniques, such as mean-median imputation, or

removed entirely if they are insignificant. It also contains categorical variables such as

employment status, education, and marital status. The data set includes numeric factors such as

age, income, and loan amount, which may have diverse scopes and measurements, impacting the

efficiency of machine learning algorithms. The original data has 63,999 cases and 45 variables.

(See Appendix A, Fig. 2)

Loan Status is the target variable. There are 2 levels, 0 being eligible and 1 being not

eligible. The latter only form about 20% of the data. Therefore the class variable of the original

dataset is unbalanced. (See Appendix A, Fig. 4)


C. Data preparation

Handling missing data

The researchers dropped the columns that are irrelevant to the classifier, such as

member_id, zip_code, and addr_state. Columns where the majority of data have missing

values (at about 50%) were also dropped, such as batch_enrolled, emp_title,

mnths_since_last record, mnths_since_last_major_derog, etc. The researchers are left

with a new data frame with only 31 variables. After this, the researchers transformed the

data types of the columns to be more appropriate for model learning. Ordinal values such

as loan_status (the class variable), grade, verification status, and others are turned into

factors instead of characters. The remaining missing values were handled by dropping the

rows containing such instances. (See Appendix A, Fig. 3)

Balancing the class variable

Fig. 4 shows that the original distribution of the class variable is unbalanced,

wherein 80% are 0 and 20% are 1. In order to remove the bias towards the negative value

in training the model, the researchers tried using the Synthetic Minority Oversampling

Technique (SMOTE) and the undersampling technique in balancing the class variable.

The researchers decided to choose the balanced dataset produced by using SMOTE (See

Appendix A, Figures 7-8) because it produces a more distributed class variable and does

not reduce the overall data compared to using the undersampling technique.
Feature Preparation

Feature selection is the method of reducing input variables by using only relevant

data and getting rid of noise in data. The researchers obtained the Univariance Feature

Importance Score of each variable in order to select the only relevant features. In order to

get these scores, the researchers tried different methods, namely, using chi-square

statistics to derive a score, information gain ratio, correlation/entropy with best first

search (cfs), and Black-box feature selection.

In the chi-square test statistic, the researchers first discretized the numeric and int

variables to convert them as factors with 5 levels. After this, the algorithm outputs the top

five variables, which are last_week_pay, initial_list_status, term, verification_status, and

dti. (See Appendix A, Figures 9-10).

Computing the univariate feature importance score using the information gain

ratio, the algorithm outputs the top five variables with the highest gain which are

recoveries, collection_recovery_fee, last_week_pay, int_rate, and initial_list_status. (See

Appendix A, Fig. 11).

The cfs method stated that the variables int_rate, initial_list_status, recoveries,

and last_week_pay, respectively, are the most important features. In the black-box feature

selection, the five most important features are loan_amnt, term, int_rate, emp_length, and

home_ownership as least important. Additionally, the researchers tried greedy search

strategies to find the most important features. Both forward-search and best-first search

identified last_week_pay as the most important feature. Hill-climbing search identified


recoveries as the most important attribute, followed by last_week_pay, then

initial_list_status.

The researchers decided to pick the top 5 features using the information gain ratio

method of the balanced dataset. The top 5 features are recoveries,

collection_recovery_fee, last_week_pay, int_rate, and initial_list_status.

D. Modeling

The researchers used ten classification algorithms in performing binary

classification of the dataset. Namely, Classification and Regression Tree (CART), K

Nearest Neighbors (kNN), Rule-based classification, Linear Support Vector Machine

(LSVM), Random Forest, Naïve Bayes Classifier, Artificial Neural Network (ANN),

Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree.

The researchers only used the balanced dataset in making models using the ten

algorithms mentioned above because when producing a model using the unbalanced

dataset, it produces a bias towards the majority class, which is loan_status = 0.

The researchers trained the various models using different testing/training splits

for hold-out validation, namely 10-90, 20-80, 30-70, and 40-60 distributions.

Additionally, the researchers also tried varying fold counts for cross-validation, with 10,

15, 20, and 50 folds, to see if the training and test distribution would have any effect on

the performance of the data mining models. Each algorithm was trained with our data set

at least 16 times, 8 in each balanced and unbalanced dataset, with the exception of

artificial neural networks.


Classification and Regression Tree (CART)

The CART algorithm is used to build a decision tree that predicts the loan

eligibility status of an applicant based on the given attributes. The algorithm would

recursively split the dataset based on the values of the attributes and minimize the Gini

impurity to determine the best split. The resulting decision tree could be used to classify

new loan applicants based on their attributes. (See Appendix A, Figures 14-17)

Creating a tree with default settings and pre-pruning resulted in a decision tree

that only uses two variables — last_week_pay and term. (See Appendix A, Fig. 5).

Meanwhile, when attempting to build a full tree with all 31 variables and a cp value of 0,

it takes roughly five minutes for the tree to be rendered as a png file. Hence, the

researchers decided the next course of action is to limit the number of columns that will

be utilized for model building. (See Appendix A, Fig. 6)

Next, the researchers tried to evaluate the models created, both the partial tree

with pre-pruning and the full tree. First, we created a sample set with only 15 rows and

removed the values for loan status. Then we compared the predictions from the actual

values. The predictions of the default tree with pre-pruning performed terribly, even

though the testing and training data are the same. The model incorrectly classifies all

instances as ‘0’. Meanwhile, the full tree performed significantly better, getting all

instances from the sampled partition with correct predictions.

K Nearest Neighbors (kNN)

The KNN algorithm is used to predict loan eligibility by identifying the k nearest

neighbors of a new loan applicant based on their attributes. The algorithm would
calculate the distance between the new applicant and each existing applicant in the

dataset and select the k nearest neighbors. The majority class of the k nearest neighbors

would then be used to predict the loan eligibility status of the new applicant. (See

Appendix A, Figures 12-13)

Rule-based classification

The training and testing datasets are created by subsetting the original dataset.

Through cross-validation, the researchers created a rule-based classifier using the

“PART” algorithm, in order to evaluate the performance of the trained model and provide

predicted values to produce a decision list of the trained model.

Rule-based classification is a type of classification algorithm that makes

predictions based on a set of predefined rules. Instead of learning patterns from data,

rule-based classifiers use a set of if-then rules to determine the class or category of a

given instance.

Linear Support Vector Machine

The linear SVM algorithm aims to find the optimal hyperplane by solving an

optimization problem that involves minimizing the classification errors and maximizing

the margin. It finds the best hyperplane that maximally separates the data points of

different classes. The hyperplane is a decision boundary that divides the feature space

into two regions, one for each class. constructs a hyperplane in a higher-dimensional

feature space that corresponds to a linear decision boundary in the original feature space.
Random Forest

Random Forest (method=”rf”) works by combining the predictions of multiple

decision trees to make more accurate and robust predictions. A collection of decision

trees is constructed using a technique called "bagging." Bagging involves randomly

selecting subsets of the training data, with replacement, to create multiple subsets of the

data. Each subset is then used to train an individual decision tree. This process helps to

introduce randomness and diversity into the models, which can reduce overfitting and

improve generalization.

During the construction of each decision tree, at each split, a random subset of

features is considered rather than using all available features. This further adds

randomness and reduces correlation among the trees. By combining the predictions of

multiple trees, the Random Forest algorithm can make more accurate and reliable

predictions compared to a single decision tree.

A tuning grid called RFgrid is created. The grid specifies the values to be tested

for the “mtry” parameter, which controls the number of features considered at each split

of the decision trees in the Random Forest. In this case, the “mtry” values range from 1 to

5, because the dataset only has 5 attributes at most.

Naïve Bayes Classifier

Naive Bayes, was implemented through caret’s method “nb”. It is a simple

probabilistic classifier based on Bayes' theorem with the assumption of independence

between features. It is often used for classification tasks and works well with categorical

and text data.


Artificial Neural Network

A neural network is an artificial intelligence technique that enables computers to

process data in a manner resembling the human brain. The package commonly used for

artificial neural networks in R is called "nnet." It provides functions and tools for

building, training, and evaluating neural network models. A grid of hyperparameter

values, determined by the expand.grid function, is used to fine-tune the neural network

model.

Conditional Inference Tree

Conditional inference trees offer flexibility in handling both categorical and

continuous predictor variables, as well as handling missing values. The "ctree" method is

used for training the conditional inference tree model. The researcher implements

conditional inference trees for classification tasks. It starts by setting a seed value for

reproducibility and creates fold indices using the createFolds. The conditional inference

tree model is trained using the train function, where the loan_status variable is predicted

based on other variables in the dataset.

C4.5

The J48 algorithm in R from the package caret and RWeka is used with the

method set to 'J48'. This constructs decision trees based on the C4.5 algorithm, which

uses attribute selection, recursive splitting, handling of missing values, and pruning to

create an interpretable model for classification tasks.


Gradient Boosted Decision Tree

When using the XGBoost model for classification, it follows a boosting

framework where an ensemble of decision trees is constructed iteratively. Each decision

tree is built to correct the mistakes made by the previous trees. This algorithm is used by

including the package “xgboost” and “dplyr”, with the method “xgbTree”.

The xgboost model is trained using the train function, with hyperparameters tuned

through cross-validation. The resulting model is evaluated using the confusionMatrix

function, which computes performance metrics based on predicted and observed values.

Train-test split is performed using createDataPartition, creating separate training and

testing datasets.
RESULTS AND DISCUSSION

Various classification algorithms were evaluated using both the holdout method and

cross-validation. Across the 10 classification models trained, we consistently found that the

models whose dataset was split with cross-validation performed slightly better compared to their

hold-out counterparts. Additionally, a high fold count for cross validation yielded marginally

higher accuracy and kappa compared to low fold counts, for the models using cross-validation

for splitting training and testing sets, at the cost of higher computing time (See Appendix B, Table

2.1). Also, the algorithms trained in cross-validation method - conditional inference trees,

gradient boosted decision trees, neural network, SVM, and especially Naive Bayes’ kappa

performance was lower than 0.5, indicating moderate agreement between the predicted labels

and the true labels.

As for the models trained with datasets partitioned using hold-out method, the

test-training splits 10-90 and 30-70 tend to have higher accuracy and kappa. The 40-60 split was

consistently the worst division of the test and train sets. (See Appendix B, Table 2.2). The

performance of the 10 models across hold-out and cross-validation is similar.

The researchers observed that Naive Bayes algorithm performed terribly when used as a

classifier for our data set. It had a kappa of 0 and accuracy of 50 because the Naive Bayes model

simply predicted everything to have loan_status of 0. The best performing algorithm we

experimented with was the random forest classifier reaching 89.85% accuracy; closely followed

by KNN with 89.7% accuracy, both trained using the cross-validated set with 50 folds. (See

Appendix A, Fig. 19)


For some classification algorithms, the researchers tried tweaking the tuning parameters

to see how different values affect the model’s performance. For rpart decision trees, the

complexity parameter is best set at zero (See Appendix A, Fig. 18). For KNN, we observed that

the value of k or number of neighbors is best set at 1, because the accuracy and kappa drops with

higher k values. In random forest, the mtry parameter which controls how much randomness is

on the decision tree creation process, is best set at 5, to fully utilize all the available attributes.

For the rule-based classifier, the confidence threshold parameter determines the minimum

confidence required for a rule to be considered valid and used for classification. 0.5 and 0.255

threshold values give the highest accuracy rating. Pruning was also incorporated to improve the

performance by removing rules that do not significantly contribute to the overall accuracy.

The results found in this study will serve as a guide as to how banks or lending

organizations approve future loan applicants. The variables with the most importance will serve

as the top features to look for in an applicant. Also, the algorithm that was most effective in

processing this type of dataset, Random Forest Classifier, will serve as a basis in creating

systems that automates the process of classifying the eligibility of an applicant for a loan.
CONCLUSION

In this paper, we studied whether a customer is likely to default a loan, or be able to pay

off the loan through the five (5) features identified with the highest information gain ratio value,

namely, recoveries, collection_recovery_fee, last_week_pay, int_rate, and initial_list_status. We

trained 10 different classification models through both hold-out and cross-validated partition

sets, and arrived at the following conclusions:

1. Cross-validation is a better partitioning method for test and train sets to create a

classification model, when compared to the hold-out method. Using different folds for

testing and training noticeably increases the accuracy of classification models, compared

to using only a fixed set and number of testing and training partitions.

2. It is important to make sure the class variable is balanced, otherwise the accuracy metric

for the various classification models would be misleading because of the skewed

distribution. We found the kappa statistic an important evaluation metric for the

performance of the classifier especially when the data is unbalanced, compared to the

percentage of cases classified accurately, since kappa compares the observed accuracy

(proportion of cases classified accurately) to random chance (expected accuracy).

3. The random forest algorithm produces well performing classifiers that use both

categorical and numerical values to infer a predicted class. Even with an unbalanced data

set, it performs relatively better compared to other algorithms that suffer with low kappa

values when training with unbalanced data.


FUTURE WORKS/RECOMMENDATIONS

This study is open for future revisions and assessments by future researchers tackling the

same topic. The researchers recommend procuring a more detailed and concrete dataset produces

greater statistical power, less sampling variability, and will be able to discover more patterns and

events in detecting credit loan eligibility. Also, the researchers recommend collecting real-world

and up-to-date data to evaluate the model, to continuously refine its performance based on

real-world events.

By considering these recommendations, future studies on binary classification of credit

risk loan eligibility can advance the field and contribute to more accurate and reliable credit risk

assessment, enabling better decision-making in the lending industry.


Bibliography

Alon, et.al,(March 2019).”Credit Risk Research: Review and


Agenda”.https://www.researchgate.net/publication/323430569_Credit_Risk_Rese
arch_Review_and_Agenda

Badr Will(February 22, 2019).”Different Ways to Handle Imbalanced


Datasets’https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-ho
w-you-can-solve-it-1640568947eb

Bhumika Dutta(July 27, 2021). “A Classification and Regression Tree (CART)


Algorithm”.https://www.analyticssteps.com/blogs/classification-and-regression-tr
ee-cart-algorithm

“Glossary of Key Credit and Lending


Terms”.https://www.google.com/url?sa=t&source=web&rct=j&url=https://cbatrai
ninginstitute.org/wp-content/uploads/2018/03/glossary_ofTerms.pdf&ved=2ahUK
Ewizhbja_Nb9AhVMfXAKHQBMCIgQFnoECC4QAQ&usg=AOvVaw2yT8dd
mFEJbMYS20tBoJo3

Hassani, et.al(June 19, 2021).”Credit Risk Assessment Using Learning Algorithms for
Feature Selection”.
https://www.tandfonline.com/doi/full/10.1080/16168658.2021.1925021

“K-Nearest Neighbor(KNN) Algorithm for Machine


Learning”.https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-
learning

Konovalova, N., Kristovska, I., & Kudinska, M. (2016). Credit Risk Management in
commercial banks. Polish Journal of Management Studies, 13(2), 90–100.
https://doi.org/10.17512/pjms.2016.13.2.09
Li, et.al(July 2020).”A Two-Stage Dynamic Credit Risk Assessment
System”.https://dl.acm.org/doi/10.1145/3417188.3417193

“Loan Terms: Specific Terms Defined & How to Negotiate


Them”.https://www.investopedia.com/loan-terms-5075341

Mimi Mukherjee and Matloob Khushi(2021).”SMOTE-ENC: A novel SMOTE-based


method to generate synthetic data for nominal and continuous
features”.https://arxiv.org/ftp/arxiv/papers/2103/2103.07612.pdf

Mohammad Ali(2015).”Performance of Three Classification Techniques in Classifying


Credit Applications Into Good Loans and Bad Loans: A
Comparison”.https://www.diva-portal.org/smash/get/diva2:824593/FULLTEXT01
.pdf

Ofer (2019). “LendingClub_data_Defaults”.


https://www.kaggle.com/code/danofer/lendingclub-data-defaults/notebook

Shadab Hussain(2019).”Credit Risk Loan


Eligibility”.https://www.kaggle.com/datasets/shadabhussain/credit-risk-loan-eligi
nility

Soumyadip Ghosh,Sandeep Juneja(December 2006).”Computing worst-case tail


probabilities in credit risk”.https://dl.acm.org/doi/10.5555/1218112.1218162

Sulin, et.al(2001).”The credit-risk decision mechanism on fixed loan interest rate with
imperfect information”.https://ieeexplore.ieee.org/document/6077383

The Investopedia Team(March 15, 2022).”Credit Risk: Definition, Role of Ratings, and
Examples”. https://www.investopedia.com/terms/c/creditrisk.asp

Varrey, et.al(November 18, 2022).”Predictive Model to Compute Eligibility Test for


Loans”.https://ieeexplore.ieee.org/document/9951727
Wu Xian and Liu Huan(March 25, 2022).”Application of Big Data Unbalanced
Classification Algorithm in Credit Risk Analysis of Insurance
Companies”.https://www.hindawi.com/journals/jmath/2022/3899801/
APPENDICES
Appendix A

List of Figures

Figure 1: CRISP-DM Model


Figure 2: Variables of the data set
Figure 3: Data after dropping rows with missing values
Figure 4: 80-20 Original Class Distribution

Figure 5: Building a model without feature selection/subsetting


Figure 6: Tree of the data set with complexity parameter = 0

Figure 7: Distribution after SMOTE - NC (80 training)

Figure 8: Class Distribution for TRAIN and TEST


Figure 9: Numeric variables as factors after discretization

Figure 10: Univariate feature importance score of variables using chi-square tests
Figure 11: Univariate feature importance score using gain ratio

Figure 12: kNN 10 folds using unbalanced dataset


Figure 13: kNN 10 folds using balanced dataset

Figure 14: CART 10 folds using unbalanced dataset


Figure 15: CART 10 folds using balanced dataset

Figure 16: CART 15 folds using unbalanced dataset

Figure 17: CART 15 folds using balanced dataset


Figure 18: Tuning rpart decision tree with complexity parameter (cp)
Figure 19: Model comparison of Cross-validated results

Figure 20: Summary statistics of the various models, tested through Cross-validation
Figure 21: Model comparison of Hold-out results

Figure 22: Summary statistics of the various models, tested through Hold-out
Figure 23: Lift value per rule
Appendix B

List of Tables

Cross Validation BALANCED (SMOTE)

Precision Recall F1 Accuracy Kappa

10 folds 84.5 88.8 86.6 86.2 .725

15 folds 84.6 88.5 86.5 86.2 .723

20 folds 84.8 88.1 86.4 86.2 .723


RPART
Decision Cp = 0 84.7 88.7 86.7 86.4 .727
Tree
Cp = 0.001 70 69.6 91.0 69.8 .396
50 folds
Cp = 0.002 64.1 58.8 61.3 62.9 .259

Cp = 0.04 59.7 56.2 57.9 59.1 .182

10 folds 86.6 93.6 90 89.6 .791

15 folds 86.6 93.5 89.9 89.5 .791


K=1
20 folds 86.8 93.6 90.1 89.7 .794

50 folds 86.9 93.4 90.1 89.7 .794

10 folds 82.0 92.8 87.1 86.2 .724

15 folds 81.8 92.6 86.9 86 .72


K=3
20 folds 82.2 92.7 87.1 86.3 .726

50 folds 82.2 92.4 87.0 86.2 .724


KNN
10 folds 79.4 92.0 85.2 84 .681

15 folds 79.2 91.9 85.1 83.9 .678


K=5
20 folds 79.5 92.1 85.3 84.2 .683

50 folds 79.7 91.9 85.3 84.2 .685

10 folds 75.8 90.3 82.4 80.7 .614

15 folds 75.6 90.2 82.3 80.5 .611


K = 10
20 folds 75.8 90.3 82.5 80.8 .615

50 folds 75.9 90.1 82.4 80.8 .615

Rule-Based 10 folds Th= .5 pruned 76.8 84.6 80.5 79.5 .591


Classifier
15 folds Th=.255 pruned 77.9 81.6 79.7 79.3 .585

20 folds Th=.255 pruned 77.5 83.4 80.3 79.6 .592

50 folds Th= .5 pruned 76.86 84.32 80.42 79.47 .5893

10 folds 60.1 57.5 58.8 59.7 .193

15 folds 60.1 57.56 58.8 59.68 .1935


Linear
SVM 20 folds 59.90 57.40 58.62 59.49 .1898

50 folds 60.65 57.94 59.27 60.17 .2035

10 folds Mtry = 5 86.6 93.7 90 89.6 .792

15 folds Mtry = 5 86.6 93.5 89.9 89.5 .79

20 folds Mtry = 5 86.59 93.58 89.95 89.54 .7909


Random
Forest 61.15 58.24 59.66 60.62 .2124
Mtry = 1
50 folds
Mtry = 3 68.68 63.51 65.99 67.27 .3455

Mtry = 5 86.83 93.94 90.25 89.85 .797

10 folds NA 0 NA 50 0

15 folds NA 0 NA 50 0
Naive
Bayes 20 folds NA 0 NA 50 0

50 folds NA 0 NA 50 0

Artificial
Size = 8
Neural 10 folds
Decay = 0.1
61.51 58.36 59.89 60.92 .2184
Networks

10 folds Mincriterion = 67.21 81.80 73.97 70.95 .4190


.01

15 folds Mincriterion = 67.35 81.89 79.91 71.0 .4219


Condition
.01
al
Inference 20 folds Mincriterion = 66.99 82.36 73.88 70.89 .4178
Tree .01

50 folds Mincriterion = 67.57 82.47 74.28 71.44 .4289


.01

10 folds C = 0.5 84 88.47 86.18 85.81 .7163


M=1

C4.5 Fit 15 folds C = 0.5 84.18 89.26 86.65 86.25 .7249


M=1
20 folds C = 0.5 84.35 88.60 86.42 86.08 .7217
M=1

50 folds C = 0.5 84.56 89.44 86.93 86.56 .7311


M=1

10 folds 62.72 61.05 61.88 62.38 .2477


Gradient Nrounds =20
15 folds 62.78 60.01 61.36 62.21 .2443
Boosted
Decision 20 folds 63.06 60.02 61.50 62.43 .2487
Tree
50 folds 63.22 60.15 61.65 62.58 .2516

Table 2.1: Model performance of Cross-Validation partitioning

Hold-out method Balanced - SMOTE

Test - Train Split Precision Recall F1 Accuracy Kappa

10-90 81.90 88.08 84.88 84.31 .6861

RPART 20-80 81.94 87.45 84.60 84.08 .6817


Decision
Tree 30-70 82.85 86.88 84.81 84.44 .6889

40-60 81.85 86.55 84.13 83.68 .6735

10-90 k=1 82.82 91.46 86.93 86.25 .7249

20-80 k=1 82.36 92.11 86.96 86.19 .7238


KNN
30-70 k=1 82.46 91.27 86.64 85.93 .7186

40-60 k=1 81.39 90.90 85.88 85.06 .7012

10-90 Th = 0.5 86.08 64.67 73.85 77.11 .5421

20-80 Th = 0.5 64.78 96.70 77.58 72.06 .4412


Rule-Based
Classifier 30-70 Th = 0.5 75.28 86.75 80.61 79.13 .5826

40-60 Th = 0.5 71.96 79.17 75.39 74.16 .4832

10-90 59.24 55.99 57.57 58.73 .1749

20-80 59.93 57.18 58.52 59.47 .1895


SVM
30-70 60.2 57.14 58.63 59.68 .1936
40-60 59.85 57.28 58.54 59.43 .1886

10-90 Mtry = 5 83.85 91.99 87.73 87.14 .7428

20-80 Mtry = 5 83.43 92.28 87.63 86.98 .7396


Random
Forest 30-70 Mtry = 5 83.70 91.86 87.59 86.99 .7397

40-60 Mtry = 5 83.25 91.36 87.12 86.49 .7297

10-90 NA 0 NA 0.5 0

20-80 NA 0 NA 0.5 0
Naive
Bayes 30-70 NA 0 NA 0.5 0

40-60 NA 0 NA 0.5 0

Artificial 20-80 Size = 10 60.9 58.0 59.4 60.4 .208


Neural Decay = 0.4
Network

10-90 Mincriterion 67.29 80.69 73.38 70.73 .4147


= .01

20-80 Mincriterion 68.27 79.71 73.55 71.33 .4266


Conditional = .01
Inference
Tree 30-70 Mincriterion 67.10 80.89 73.35 70.61 .4123
= .01

40-60 Mincriterion 64.45 76.18 69.83 67.08 .3417


= .01

10-90 C= .3775 82.63 87.88 85.17 84.70 .6941


M= 1

20-80 C= .0.5 83.04 85.92 84.45 84.18 .6836


M= 1
C4.5 Fit
30-70 C=.3775 80.35 86.70 83.40 82.75 .6550
M= 1

40-60 C= .3775 79.81 85.64 82.62 81.99 .6399


M= 1

10-90 Nrounds 62.97 62.22 62.59 62.81 .2563


Gradient =20
20-80 62.48 60.36 62.09 61.404 .2512
Boosted
Decision 30-70 63.14 58.84 60.92 62.24 .2449
Tree
40-60 62.28 60.98 61.62 62.02 .2405

Table 2.2: Model performance of Hold-Out method partitioning

You might also like