DataMining - CaseStudy

Revision List
During the midterm consultation and class meetings with Dr. Jennifer Llovido, she
suggested the following:
1. Solve the class imbalance of the dataset
The class variable of the credit risk loan eligibility dataset contains a class
imbalance wherein 80% of the class is 0 (Not eligible) while 20% is 1 (Eligible). To
resolve this, the researchers performed Synthetic Minority Oversampling Technique
(SMOTE) to balance the classes.
2. Explain the process of how you partitioned the dataset in making the model
The researchers added the details of performing the holdout and cross-validation
method in the paper. The researchers conducted training-testing splits of 90-10, 80-20,
70-30, and 60-40 splits in the holdout method and conducted 10-fold, 20-fold, 30-fold,
and 50-fold cross-validation.
3. Apply the CRISP-DM methodology
The researchers applied the CRISP-DM methodology in making the case study.
Parts of the methodology include Business understanding, Data Understanding, Data
Preparation, Modelling, Evaluation, and Deployment. The researchers did not include the
Deployment phase since this study is only for research purposes.

4. Apply alternative classification techniques
The researchers conducted different classification techniques such as
Classification and Regression Tree (CART), K Nearest Neighbors (kNN), Rule-based
classification, Linear Support Vector Machine (LSVM), Random Forest, Naïve Bayes
Classifier, Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree.
5. Apply association rule mining and clustering analysis to the dataset, if applicable
The researchers applied association rule mining to the dataset and found 177,280
rules. The researchers found that 27.145% of the rules has a lift score greater than 1,
58.223% of the rules has a lift score equal to 1, and 14.623% of the rules has a lift score
less than 1. (See Append A, Fig. 23)
The researchers also tried clustering analysis to the dataset but discovered that the
algorithm is not quite suitable for the dataset since it contains numeric and categorical
variables. The researchers tried using K Prototypes clustering from the clustMixType
package since the algorithm is applicable to mixed-type data. After implemeting the
algorithm, the researchers found no significant insights with the results since the
algorithm outputs clustering per variable, not for the whole dataset.
ABSTRACT
Credit risk assessment is a critical task for financial institutions that lends money to
individuals and businesses. By accurately identifying the likelihood of borrowers defaulting on
their loans, lenders can minimize their losses and make informed decisions about which loan
applications to approve. This paper proposes a novel approach to credit risk assessment that
combines traditional credit scoring models with state-of-the-art machine learning algorithms.
The researchers used a Credit Risk Loan Eligibility dataset with features that include the
borrower’s demographic data, financial statements, and credit history. The researchers
discovered the five most important variables by getting the information gain ratio value of each
variable which are recoveries, collection_recovery_fee, last_week_pay, int_rate, and
initial_list_status. The researchers performed binary classification in classifying the borrower’s
eligibility by using ten supervised machine learning algorithms. Namely, Classification and
Regression Tree (CART), K Nearest Neighbors (kNN), Rule-based classification, Linear Support
Vector Machine (LSVM), Random Forest, Naïve Bayes Classifier, Artificial Neural Network
(ANN), Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree. The Random
Forest algorithm performed the best, with an accuracy of 89.85% using a cross-validated set of
50-folds. The worst performing algorithm was Naïve Bayes Classifier which had an accuracy of
50%.
keywords – credit risk, binary classification, CART, kNN, Rule-based classification, Linear
SVM, Random Forest, Naïve Bayes Classifier, ANN, Conditional Inference Tree, C4.5, Gradient
Boosted Decision Tree, loan

INTRODUCTION
History demonstrates that one of the main factors contributing to bank distress is
the concentration of credit risk in asset portfolios. This holds for both specific
organizations and banking systems as a whole. An essential characteristic of credit
marketing is that information to both bank and account holders is asymmetrical. That is,
account holders know more about their investments rather than their banks. As future and
present users of these cash reservoirs and banks, it is our duty to gain knowledge on how
managerial and tactical bank credit works to have an in-depth understanding of our
savings, interests, taxes, and loans.
Credit risk refers to the possibility of suffering a loss in acquisitions. A
consequence of a borrower's failure to make loan payments or fulfill contractual
commitments. It typically refers to the possibility that a lender will not be able to receive
the total and interest owed, which would disrupt cash flows and raise collection costs.
The borrower's general capacity to repay a loan in accordance with its original terms is
used to determine credit risks. When determining the credit risk of a consumer loan,
lenders consider the five Cs: credit history, capacity to repay, capital, the loan's
conditions, and associated collateral.
With this, the study mainly aims to classify a borrower’s credit risk loan data
using supervised machine learning algorithms such as Classification and Regression Tree
(CART), K Nearest Neighbors (kNN), Rule-based classification, Linear Support Vector
Machine (LSVM), Random Forest, Naïve Bayes Classifier, Artificial Neural Network
(ANN), Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree in
analyzing whether a borrower is eligible to be granted a loan.
Objectives of the Study
This case study aims to explore the credit risk loan eligibility dataset. Specifically, it aims
to:
1. Identify which variables have the most significant importance in identifying credit
risk loan eligibility.
2. Identify which classification algorithm performs best in terms of accuracy and
kappa score.
Related Studies
A. Credit Risk Management in Commercial Banks
According to Konovalova et al., carrying out a quantitative assessment and
analysis of the credit risk and rating of borrowers is relevant to all banks involved in
lending to individuals and legal entities. To ensure the effectiveness of credit risk
management in commercial banks, it was deemed necessary to develop the kinds of terms
and conditions for those bank clients who take loans that would both attract potential
borrowers and guarantee loan repayment. However, to develop a separate set of terms and
conditions for every individual borrower, existing and potential bank clients should be
grouped according to their similarities and differences. After that, a separate set of terms
and conditions needs to be worked out for each group in accordance with the
characteristic features of the group members. The classification of bank clients into
distinct groups should proceed according to the method of classification that unites
disparate system elements into homogeneous groups based on the similarities of the
elements in question. This method of classification needs to reflect the structure of the
source data and to ensure the most adequate division of the data into groups.
They took into account the statistics reflecting the bank customers’ violations of
the contract conditions and the damage caused to the bank by each such violation. The
magnitude of the risk as the amount of damage (risk defined as the customer’s failure to
make principal payments on time) can be seen as a regressive dependence on such factors
as the average loan size, the period for which the loan is issued, and a number of other
factors. Specification and identification of such regressions should be performed based on

the information about the damage caused by each client and about the credit
characteristics of each customer class. Such a model would enable the forecasting of the
risk posed by each potential client.
B. A Two-Stage Dynamic Credit Risk Assessment System
Li, et.al utilized Logistic Regression (LR), due to the interpretability and
efficiency. They showcased a hybrid scoring model through the use of Logistic
Regression by data augmentation which obtained a more accurate credit score. Despite
the widely used LR, there remains a paucity of ability on nonlinear relationships.
To better understand the relationship behind data, more sophisticated methods
were investigated, such as Decision Tree (DT) and Support Vector Machine (SVM). They
used Decision Tree and Artificial Neural Network to build the credit model, and results
indicated that it is a successful technology. They proposed a novel credit model using the
clustered support vector machine(CSVM) and demonstrated that CSVM could achieve
high classification performance while remaining relatively cheap computationally.
However, much of the research was based on traditional machine learning methods and
up until now could not address sequential input data. The use of deep learning techniques
to build a credit risk model has seen significant increases in the reported accuracy on
benchmarking data sets. They compared the classification abilities of machine and deep
learning models, and results show that the tree-based models are more stable than the
models based on multilayer artificial neural networks. They depicted the only use of
LSTM to predict default probability on sequential input data. They proposed a dynamic
forecasting system to predict the default probability of a company, which utilizes a long
short-term memory model to incorporate daily news from social media. Babaev et al.,
combined embedding techniques in NLP and RNN to mining sequential transaction data.
C. Credit Risk Assessment Using Learning Algorithms for Feature Selection
Hassani, et.al, in financial risk management, credit cards' evaluation is a vital role
for banks. Banks often need to evaluate customer credit to decide against risk and
competition. Identifying influential factors of the credit card can help them. Moreover,
feature selection using the machine learning algorithms can evaluate them by higher
accuracy while traditional techniques require essential assumptions. Real-world data such
as financial data are unbalanced, and machine learning algorithms are not well classified.
In this paper, the data is balanced by the SMOTE method, and then the firefly binary
algorithm is used for identifying the effective features of the credit cards. Other than that
this study investigates various machine algorithms such as KNN, Fuzzy KNN, Random
Forest, Decision Tree, and SVM for the effective classifier of the credit card.
D. Application of Big Data Unbalanced Classification Algorithm in Credit Risk
Analysis of Insurance Companies
The study of Xian Wu and Huan Liu on classification of unbalanced datasets is
related to the existing algorithms for unbalanced data classification, particularly the
SMOTE algorithm. The article highlights the limitations of traditional classification
methods when dealing with unbalanced datasets and proposes the use of preprocessing
techniques and optimized classifiers to improve the accuracy of minority class
classification. The study also mentions the use of the SMOTE algorithm for unbalanced
data classification and its limitations in not taking into account the properties of the data
itself.
To address this limitation, the study proposes a new oversampling method that
considers the sample distribution of the minority class. By doing so, the new samples
generated will be more representative and will better describe the minority class, leading
to improved classification accuracy. The proposed oversampling method can help avoid
overfitting of the model that can result from the random selection of sample points
between two sample points used by the SMOTE algorithm.
The study can be used as a basis for further research on the optimization of
classifiers and the development of oversampling methods for unbalanced datasets.
Specifically, the study can be extended to explore other oversampling methods that take
into account the properties of the data itself, as well as undersampling methods that can
be used to reduce the number of samples in the majority class. The findings of this study
can also be applied to other fields that involve the classification of unbalanced datasets,
such as credit scoring and fraud detection.
E. Performance of Three Classification Techniques in Classifying Credit Applications
Into Good Loans and Bad Loans: A Comparison
Mohammad Ali states that the use of statistical techniques for categorizing loan
applications as good or bad has become increasingly important due to the high demand
for credit. It is crucial to use a classification method that can accurately predict loan
profitability. In this study, Mohammad Ali compares the predictive capabilities of three
classification techniques, namely logistic regression, CART, and random forests, using an
80:20 learning:test split on German credit data. He evaluates the performance of each
model by calculating the probability of default for each observation in the test set and
classifying them as good or bad loans based on a threshold. He uses several thresholds to
compare the performance of each classifier on five model suitability statistics: accuracy,
precision, negative predictive value, recall, and specificity.
The results show that none of the classifiers performed the best in all five
cross-validation statistics. However, logistic regression performed the best at low
probability of default thresholds, while CART was the most accurate, precise, and
specific at higher thresholds. Random forests had the best negative predictive value and
recall at higher thresholds.
F. SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for
nominal and continuous features
Mimi Mukherjee and Matloob Khushi noted that real-world datasets often have
imbalanced classes, with some classes having significantly fewer instances than others.
To address this problem, various synthetic minority over-sampling methods (SMOTE)
have been proposed to balance the dataset, particularly for continuous features. However,
when datasets have both nominal and continuous features, SMOTE-NC is the only
over-sampling technique available to balance the data. The authors present a new
over-sampling method called SMOTE-ENC (SMOTE – Encoded Nominal and
Continuous) that encodes nominal features as numeric values to reflect the change of
association with the minority class. The authors show that the classification model using
SMOTE-ENC provides better predictions than the model using SMOTE-NC when the
dataset has a substantial number of nominal features and when there is an association
between categorical features and the target class. Furthermore, SMOTE-ENC can be
applied to both mixed datasets and nominal-only datasets, addressing a major limitation
of SMOTE-NC.
G. Different Ways to Handle Imbalanced Datasets.
According to Will Badr he recommends splitting dataset into training and testing
sets before balancing the data to ensure that the test dataset remains unbiased and
provides a true evaluation for the model. Balancing the data before splitting, may risk
introducing bias into the test set due to the synthetic generation of a few data points that
are well-known from the training set. He stated that it is important for the test set to be as
objective as possible. However, according to his study under-sampling techniques may
remove valuable information and alter the overall dataset distribution, which can impact
the quality of the model. Therefore, under-sampling should not be the first approach to
consider for imbalanced datasets. He also added that it is important to recognize the
performance of ML models on imbalanced datasets will be constrained by their ability to
predict rare and minority points.

METHODS
The figure below shows the Cross-Industry Standard Process for Data Mining
(CRISP-DM) Model. It consists of phases, each designed to guide the data mining process. The
use of the CRISP-DM model in data mining helps to ensure that the data mining project is
structured, manageable, and produces actionable results.
Fig. 1: CRISP-DM Model
A. Business understanding
The credit risk loan eligibility dataset contains information on loan applicants and
their eligibility for credit loans. It contains various information such as their age, income,
employment status, credit score, loan amount, and other relevant asset information.
Banks or lending institutions can use this dataset to make informed decisions to check if
an applicant is eligible for a credit loan. The class variable of the dataset is the
loan_status variable, this is the determining factor if the applicant is eligible or not. A
value of 0 denotes “Not eligible,” and 1 denotes “Eligible.”
B. Data understanding
In order to fully understand the credit risk loan eligibility dataset, the researchers
obtained and used these descriptions in developing an
Columns Description
member_id a unique LC assigned I.D. for the borrower member
loan_amnt the listed amount of the loan applied for by the borrower
funded_amnt the total amount committed to that loan at that point in time
funded_amnt_inv the total amount committed by investors for that loan at that point in time
term the number of payments on the loan
a series of code given to a certain batch on when their loan was

batch_enrolled
recorded/enrolled
int_rate interest rate on the loan
a classification system that involves assigning a quality score to a loan

grade based on a borrower's credit history, quality of the collateral, and the
likelihood of repayment of the principal and interest
sub_grade a specified grading system for the loans
emp_title the job title supplied by the borrower when applying for the loan
emp_length employment length in years
the home ownership status provided by the borrower during registration or

home_ownership
obtained from the credit report
the self-reported annual income provided by the borrower during

annual_inc
registration
verification_status indicates the verification of annual income/income

pymnt_plan indicates if a payment plan has been put in place for the loan
desc description of purpose of loan
purpose a category provided by the borrower for the loan request
title the loan title provided by the borrower
the first 3 numbers of the zip code provided by the borrower in the loan
zip_code
application.
addr_state the state provided by the borrower in the loan application
a ratio calculated using the borrower’s total monthly debt payments on the
dti total debt obligations, excluding mortgage and the requested LC loan,
divided by the borrower’s self-reported monthly income
delinq_2yrs the number of delinquencies in the past 2 years
inq_last_6mths the number of inquiries in the last 6 months
mths_since_last_delinq the number of months since the last delinquency
mths_since_last_record the number of months since the last public record
open_acc the number of open credit lines in the borrower's credit file
pub_rec the number of derogatory public records
revol_bal total credit revolving balance
revolving line utilization rate, or the amount of credit the borrower is

revol_util
using relative to all available revolving credit
total_acc the total number of credit lines in the borrower's credit file
initial_list_status the initial listing status of the loan (f, w)
total_rec_int interest received to date
total_rec_late_fee late fees received to date
recoveries post charge off gross recovery
collection_recovery_fee post charge off collection fee
collections_12_mths_ex_ number of collections in 12 months excluding medical collections

med
mths_since_last_major_de months since most recent 90-day or worse rating
rog
indicates whether the loan is an individual application or a joint

application_type
application with two co-borrowers
indicates if the co-borrowers' joint income was verified by LC, not

verification_status_joint
verified, or if the income source was verified
last_week_pay last week payment was received
acc_now_delinq the number of accounts on which the borrower is now delinquent
tot_coll_amt total collection amounts ever owed
tot_cur_bal total current balance of all accounts
total_rev_hi_lim total revolving high credit/credit limit
loan_status current status of the loan
Table 1. Column Descriptions
The dataset contains missing values that must be handled before training the model.
Missing values can be imputed using various techniques, such as mean-median imputation, or
removed entirely if they are insignificant. It also contains categorical variables such as
employment status, education, and marital status. The data set includes numeric factors such as
age, income, and loan amount, which may have diverse scopes and measurements, impacting the
efficiency of machine learning algorithms. The original data has 63,999 cases and 45 variables.
(See Appendix A, Fig. 2)
Loan Status is the target variable. There are 2 levels, 0 being eligible and 1 being not
eligible. The latter only form about 20% of the data. Therefore the class variable of the original
dataset is unbalanced. (See Appendix A, Fig. 4)

C. Data preparation
Handling missing data
The researchers dropped the columns that are irrelevant to the classifier, such as
member_id, zip_code, and addr_state. Columns where the majority of data have missing
values (at about 50%) were also dropped, such as batch_enrolled, emp_title,
mnths_since_last record, mnths_since_last_major_derog, etc. The researchers are left
with a new data frame with only 31 variables. After this, the researchers transformed the
data types of the columns to be more appropriate for model learning. Ordinal values such
as loan_status (the class variable), grade, verification status, and others are turned into
factors instead of characters. The remaining missing values were handled by dropping the
rows containing such instances. (See Appendix A, Fig. 3)
Balancing the class variable
Fig. 4 shows that the original distribution of the class variable is unbalanced,
wherein 80% are 0 and 20% are 1. In order to remove the bias towards the negative value
in training the model, the researchers tried using the Synthetic Minority Oversampling
Technique (SMOTE) and the undersampling technique in balancing the class variable.
The researchers decided to choose the balanced dataset produced by using SMOTE (See
Appendix A, Figures 7-8) because it produces a more distributed class variable and does
not reduce the overall data compared to using the undersampling technique.
Feature Preparation
Feature selection is the method of reducing input variables by using only relevant
data and getting rid of noise in data. The researchers obtained the Univariance Feature
Importance Score of each variable in order to select the only relevant features. In order to
get these scores, the researchers tried different methods, namely, using chi-square
statistics to derive a score, information gain ratio, correlation/entropy with best first
search (cfs), and Black-box feature selection.
In the chi-square test statistic, the researchers first discretized the numeric and int
variables to convert them as factors with 5 levels. After this, the algorithm outputs the top
five variables, which are last_week_pay, initial_list_status, term, verification_status, and
dti. (See Appendix A, Figures 9-10).
Computing the univariate feature importance score using the information gain
ratio, the algorithm outputs the top five variables with the highest gain which are
recoveries, collection_recovery_fee, last_week_pay, int_rate, and initial_list_status. (See
Appendix A, Fig. 11).
The cfs method stated that the variables int_rate, initial_list_status, recoveries,
and last_week_pay, respectively, are the most important features. In the black-box feature
selection, the five most important features are loan_amnt, term, int_rate, emp_length, and
home_ownership as least important. Additionally, the researchers tried greedy search
strategies to find the most important features. Both forward-search and best-first search
identified last_week_pay as the most important feature. Hill-climbing search identified

recoveries as the most important attribute, followed by last_week_pay, then
initial_list_status.
The researchers decided to pick the top 5 features using the information gain ratio
method of the balanced dataset. The top 5 features are recoveries,
collection_recovery_fee, last_week_pay, int_rate, and initial_list_status.
D. Modeling
The researchers used ten classification algorithms in performing binary
classification of the dataset. Namely, Classification and Regression Tree (CART), K
Nearest Neighbors (kNN), Rule-based classification, Linear Support Vector Machine
(LSVM), Random Forest, Naïve Bayes Classifier, Artificial Neural Network (ANN),
Conditional Inference Tree, C4.5, and Gradient Boosted Decision Tree.
The researchers only used the balanced dataset in making models using the ten
algorithms mentioned above because when producing a model using the unbalanced
dataset, it produces a bias towards the majority class, which is loan_status = 0.
The researchers trained the various models using different testing/training splits
for hold-out validation, namely 10-90, 20-80, 30-70, and 40-60 distributions.
Additionally, the researchers also tried varying fold counts for cross-validation, with 10,
15, 20, and 50 folds, to see if the training and test distribution would have any effect on
the performance of the data mining models. Each algorithm was trained with our data set
at least 16 times, 8 in each balanced and unbalanced dataset, with the exception of
artificial neural networks.

Classification and Regression Tree (CART)
The CART algorithm is used to build a decision tree that predicts the loan
eligibility status of an applicant based on the given attributes. The algorithm would
recursively split the dataset based on the values of the attributes and minimize the Gini
impurity to determine the best split. The resulting decision tree could be used to classify
new loan applicants based on their attributes. (See Appendix A, Figures 14-17)
Creating a tree with default settings and pre-pruning resulted in a decision tree
that only uses two variables — last_week_pay and term. (See Appendix A, Fig. 5).
Meanwhile, when attempting to build a full tree with all 31 variables and a cp value of 0,
it takes roughly five minutes for the tree to be rendered as a png file. Hence, the
researchers decided the next course of action is to limit the number of columns that will
be utilized for model building. (See Appendix A, Fig. 6)
Next, the researchers tried to evaluate the models created, both the partial tree
with pre-pruning and the full tree. First, we created a sample set with only 15 rows and
removed the values for loan status. Then we compared the predictions from the actual
values. The predictions of the default tree with pre-pruning performed terribly, even
though the testing and training data are the same. The model incorrectly classifies all
instances as ‘0’. Meanwhile, the full tree performed significantly better, getting all
instances from the sampled partition with correct predictions.
K Nearest Neighbors (kNN)
The KNN algorithm is used to predict loan eligibility by identifying the k nearest
neighbors of a new loan applicant based on their attributes. The algorithm would
calculate the distance between the new applicant and each existing applicant in the
dataset and select the k nearest neighbors. The majority class of the k nearest neighbors
would then be used to predict the loan eligibility status of the new applicant. (See
Appendix A, Figures 12-13)
Rule-based classification
The training and testing datasets are created by subsetting the original dataset.
Through cross-validation, the researchers created a rule-based classifier using the
“PART” algorithm, in order to evaluate the performance of the trained model and provide
predicted values to produce a decision list of the trained model.
Rule-based classification is a type of classification algorithm that makes
predictions based on a set of predefined rules. Instead of learning patterns from data,
rule-based classifiers use a set of if-then rules to determine the class or category of a
given instance.
Linear Support Vector Machine
The linear SVM algorithm aims to find the optimal hyperplane by solving an
optimization problem that involves minimizing the classification errors and maximizing
the margin. It finds the best hyperplane that maximally separates the data points of
different classes. The hyperplane is a decision boundary that divides the feature space
into two regions, one for each class. constructs a hyperplane in a higher-dimensional
feature space that corresponds to a linear decision boundary in the original feature space.
Random Forest
Random Forest (method=”rf”) works by combining the predictions of multiple
decision trees to make more accurate and robust predictions. A collection of decision
trees is constructed using a technique called "bagging." Bagging involves randomly
selecting subsets of the training data, with replacement, to create multiple subsets of the
data. Each subset is then used to train an individual decision tree. This process helps to
introduce randomness and diversity into the models, which can reduce overfitting and
improve generalization.
During the construction of each decision tree, at each split, a random subset of
features is considered rather than using all available features. This further adds
randomness and reduces correlation among the trees. By combining the predictions of
multiple trees, the Random Forest algorithm can make more accurate and reliable
predictions compared to a single decision tree.
A tuning grid called RFgrid is created. The grid specifies the values to be tested
for the “mtry” parameter, which controls the number of features considered at each split
of the decision trees in the Random Forest. In this case, the “mtry” values range from 1 to
5, because the dataset only has 5 attributes at most.
Naïve Bayes Classifier
Naive Bayes, was implemented through caret’s method “nb”. It is a simple
probabilistic classifier based on Bayes' theorem with the assumption of independence
between features. It is often used for classification tasks and works well with categorical
and text data.

Artificial Neural Network
A neural network is an artificial intelligence technique that enables computers to
process data in a manner resembling the human brain. The package commonly used for
artificial neural networks in R is called "nnet." It provides functions and tools for
building, training, and evaluating neural network models. A grid of hyperparameter
values, determined by the expand.grid function, is used to fine-tune the neural network
model.
Conditional Inference Tree
Conditional inference trees offer flexibility in handling both categorical and
continuous predictor variables, as well as handling missing values. The "ctree" method is
used for training the conditional inference tree model. The researcher implements
conditional inference trees for classification tasks. It starts by setting a seed value for
reproducibility and creates fold indices using the createFolds. The conditional inference
tree model is trained using the train function, where the loan_status variable is predicted
based on other variables in the dataset.
C4.5
The J48 algorithm in R from the package caret and RWeka is used with the
method set to 'J48'. This constructs decision trees based on the C4.5 algorithm, which
uses attribute selection, recursive splitting, handling of missing values, and pruning to
create an interpretable model for classification tasks.

Gradient Boosted Decision Tree
When using the XGBoost model for classification, it follows a boosting
framework where an ensemble of decision trees is constructed iteratively. Each decision
tree is built to correct the mistakes made by the previous trees. This algorithm is used by
including the package “xgboost” and “dplyr”, with the method “xgbTree”.
The xgboost model is trained using the train function, with hyperparameters tuned
through cross-validation. The resulting model is evaluated using the confusionMatrix
function, which computes performance metrics based on predicted and observed values.
Train-test split is performed using createDataPartition, creating separate training and
testing datasets.
RESULTS AND DISCUSSION
Various classification algorithms were evaluated using both the holdout method and
cross-validation. Across the 10 classification models trained, we consistently found that the
models whose dataset was split with cross-validation performed slightly better compared to their
hold-out counterparts. Additionally, a high fold count for cross validation yielded marginally
higher accuracy and kappa compared to low fold counts, for the models using cross-validation
for splitting training and testing sets, at the cost of higher computing time (See Appendix B, Table
2.1). Also, the algorithms trained in cross-validation method - conditional inference trees,
gradient boosted decision trees, neural network, SVM, and especially Naive Bayes’ kappa
performance was lower than 0.5, indicating moderate agreement between the predicted labels
and the true labels.
As for the models trained with datasets partitioned using hold-out method, the
test-training splits 10-90 and 30-70 tend to have higher accuracy and kappa. The 40-60 split was
consistently the worst division of the test and train sets. (See Appendix B, Table 2.2). The
performance of the 10 models across hold-out and cross-validation is similar.
The researchers observed that Naive Bayes algorithm performed terribly when used as a
classifier for our data set. It had a kappa of 0 and accuracy of 50 because the Naive Bayes model
simply predicted everything to have loan_status of 0. The best performing algorithm we
experimented with was the random forest classifier reaching 89.85% accuracy; closely followed
by KNN with 89.7% accuracy, both trained using the cross-validated set with 50 folds. (See
Appendix A, Fig. 19)

For some classification algorithms, the researchers tried tweaking the tuning parameters
to see how different values affect the model’s performance. For rpart decision trees, the
complexity parameter is best set at zero (See Appendix A, Fig. 18). For KNN, we observed that
the value of k or number of neighbors is best set at 1, because the accuracy and kappa drops with
higher k values. In random forest, the mtry parameter which controls how much randomness is
on the decision tree creation process, is best set at 5, to fully utilize all the available attributes.
For the rule-based classifier, the confidence threshold parameter determines the minimum
confidence required for a rule to be considered valid and used for classification. 0.5 and 0.255
threshold values give the highest accuracy rating. Pruning was also incorporated to improve the
performance by removing rules that do not significantly contribute to the overall accuracy.
The results found in this study will serve as a guide as to how banks or lending
organizations approve future loan applicants. The variables with the most importance will serve
as the top features to look for in an applicant. Also, the algorithm that was most effective in
processing this type of dataset, Random Forest Classifier, will serve as a basis in creating
systems that automates the process of classifying the eligibility of an applicant for a loan.
CONCLUSION
In this paper, we studied whether a customer is likely to default a loan, or be able to pay
off the loan through the five (5) features identified with the highest information gain ratio value,
namely, recoveries, collection_recovery_fee, last_week_pay, int_rate, and initial_list_status. We
trained 10 different classification models through both hold-out and cross-validated partition
sets, and arrived at the following conclusions:
1. Cross-validation is a better partitioning method for test and train sets to create a
classification model, when compared to the hold-out method. Using different folds for
testing and training noticeably increases the accuracy of classification models, compared
to using only a fixed set and number of testing and training partitions.
2. It is important to make sure the class variable is balanced, otherwise the accuracy metric
for the various classification models would be misleading because of the skewed
distribution. We found the kappa statistic an important evaluation metric for the
performance of the classifier especially when the data is unbalanced, compared to the
percentage of cases classified accurately, since kappa compares the observed accuracy
(proportion of cases classified accurately) to random chance (expected accuracy).
3. The random forest algorithm produces well performing classifiers that use both
categorical and numerical values to infer a predicted class. Even with an unbalanced data
set, it performs relatively better compared to other algorithms that suffer with low kappa
values when training with unbalanced data.

FUTURE WORKS/RECOMMENDATIONS
This study is open for future revisions and assessments by future researchers tackling the
same topic. The researchers recommend procuring a more detailed and concrete dataset produces
greater statistical power, less sampling variability, and will be able to discover more patterns and
events in detecting credit loan eligibility. Also, the researchers recommend collecting real-world
and up-to-date data to evaluate the model, to continuously refine its performance based on
real-world events.
By considering these recommendations, future studies on binary classification of credit
risk loan eligibility can advance the field and contribute to more accurate and reliable credit risk
assessment, enabling better decision-making in the lending industry.

Bibliography
Alon, et.al,(March 2019).”Credit Risk Research: Review and

Agenda”.https://www.researchgate.net/publication/323430569_Credit_Risk_Rese
arch_Review_and_Agenda
Badr Will(February 22, 2019).”Different Ways to Handle Imbalanced

Datasets’https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-ho
w-you-can-solve-it-1640568947eb
Bhumika Dutta(July 27, 2021). “A Classification and Regression Tree (CART)

Algorithm”.https://www.analyticssteps.com/blogs/classification-and-regression-tr
ee-cart-algorithm
“Glossary of Key Credit and Lending

Terms”.https://www.google.com/url?sa=t&source=web&rct=j&url=https://cbatrai
ninginstitute.org/wp-content/uploads/2018/03/glossary_ofTerms.pdf&ved=2ahUK
Ewizhbja_Nb9AhVMfXAKHQBMCIgQFnoECC4QAQ&usg=AOvVaw2yT8dd
mFEJbMYS20tBoJo3
Hassani, et.al(June 19, 2021).”Credit Risk Assessment Using Learning Algorithms for
Feature Selection”.
https://www.tandfonline.com/doi/full/10.1080/16168658.2021.1925021
“K-Nearest Neighbor(KNN) Algorithm for Machine

Learning”.https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-
learning
Konovalova, N., Kristovska, I., & Kudinska, M. (2016). Credit Risk Management in
commercial banks. Polish Journal of Management Studies, 13(2), 90–100.
https://doi.org/10.17512/pjms.2016.13.2.09
Li, et.al(July 2020).”A Two-Stage Dynamic Credit Risk Assessment
System”.https://dl.acm.org/doi/10.1145/3417188.3417193
“Loan Terms: Specific Terms Defined & How to Negotiate

Them”.https://www.investopedia.com/loan-terms-5075341
Mimi Mukherjee and Matloob Khushi(2021).”SMOTE-ENC: A novel SMOTE-based

method to generate synthetic data for nominal and continuous
features”.https://arxiv.org/ftp/arxiv/papers/2103/2103.07612.pdf
Mohammad Ali(2015).”Performance of Three Classification Techniques in Classifying

Credit Applications Into Good Loans and Bad Loans: A
Comparison”.https://www.diva-portal.org/smash/get/diva2:824593/FULLTEXT01
.pdf
Ofer (2019). “LendingClub_data_Defaults”.

https://www.kaggle.com/code/danofer/lendingclub-data-defaults/notebook
Shadab Hussain(2019).”Credit Risk Loan

Eligibility”.https://www.kaggle.com/datasets/shadabhussain/credit-risk-loan-eligi
nility
Soumyadip Ghosh,Sandeep Juneja(December 2006).”Computing worst-case tail

probabilities in credit risk”.https://dl.acm.org/doi/10.5555/1218112.1218162
Sulin, et.al(2001).”The credit-risk decision mechanism on fixed loan interest rate with
imperfect information”.https://ieeexplore.ieee.org/document/6077383
The Investopedia Team(March 15, 2022).”Credit Risk: Definition, Role of Ratings, and
Examples”. https://www.investopedia.com/terms/c/creditrisk.asp
Varrey, et.al(November 18, 2022).”Predictive Model to Compute Eligibility Test for

Loans”.https://ieeexplore.ieee.org/document/9951727
Wu Xian and Liu Huan(March 25, 2022).”Application of Big Data Unbalanced
Classification Algorithm in Credit Risk Analysis of Insurance
Companies”.https://www.hindawi.com/journals/jmath/2022/3899801/
APPENDICES
Appendix A
List of Figures
Figure 1: CRISP-DM Model

Figure 2: Variables of the data set
Figure 3: Data after dropping rows with missing values
Figure 4: 80-20 Original Class Distribution
Figure 5: Building a model without feature selection/subsetting

Figure 6: Tree of the data set with complexity parameter = 0
Figure 7: Distribution after SMOTE - NC (80 training)
Figure 8: Class Distribution for TRAIN and TEST

Figure 9: Numeric variables as factors after discretization
Figure 10: Univariate feature importance score of variables using chi-square tests
Figure 11: Univariate feature importance score using gain ratio
Figure 12: kNN 10 folds using unbalanced dataset

Figure 13: kNN 10 folds using balanced dataset
Figure 14: CART 10 folds using unbalanced dataset

Figure 15: CART 10 folds using balanced dataset
Figure 16: CART 15 folds using unbalanced dataset
Figure 17: CART 15 folds using balanced dataset

Figure 18: Tuning rpart decision tree with complexity parameter (cp)
Figure 19: Model comparison of Cross-validated results
Figure 20: Summary statistics of the various models, tested through Cross-validation
Figure 21: Model comparison of Hold-out results
Figure 22: Summary statistics of the various models, tested through Hold-out
Figure 23: Lift value per rule
Appendix B
List of Tables
Cross Validation BALANCED (SMOTE)
Precision Recall F1 Accuracy Kappa
10 folds 84.5 88.8 86.6 86.2 .725
15 folds 84.6 88.5 86.5 86.2 .723
20 folds 84.8 88.1 86.4 86.2 .723

RPART
Decision Cp = 0 84.7 88.7 86.7 86.4 .727
Tree
Cp = 0.001 70 69.6 91.0 69.8 .396
50 folds
Cp = 0.002 64.1 58.8 61.3 62.9 .259
Cp = 0.04 59.7 56.2 57.9 59.1 .182
10 folds 86.6 93.6 90 89.6 .791
15 folds 86.6 93.5 89.9 89.5 .791

K=1
20 folds 86.8 93.6 90.1 89.7 .794
50 folds 86.9 93.4 90.1 89.7 .794
10 folds 82.0 92.8 87.1 86.2 .724
15 folds 81.8 92.6 86.9 86 .72

K=3
20 folds 82.2 92.7 87.1 86.3 .726
50 folds 82.2 92.4 87.0 86.2 .724

KNN
10 folds 79.4 92.0 85.2 84 .681
15 folds 79.2 91.9 85.1 83.9 .678

K=5
20 folds 79.5 92.1 85.3 84.2 .683
50 folds 79.7 91.9 85.3 84.2 .685
10 folds 75.8 90.3 82.4 80.7 .614
15 folds 75.6 90.2 82.3 80.5 .611

K = 10
20 folds 75.8 90.3 82.5 80.8 .615
50 folds 75.9 90.1 82.4 80.8 .615
Rule-Based 10 folds Th= .5 pruned 76.8 84.6 80.5 79.5 .591

Classifier
15 folds Th=.255 pruned 77.9 81.6 79.7 79.3 .585
20 folds Th=.255 pruned 77.5 83.4 80.3 79.6 .592
50 folds Th= .5 pruned 76.86 84.32 80.42 79.47 .5893
10 folds 60.1 57.5 58.8 59.7 .193
15 folds 60.1 57.56 58.8 59.68 .1935

Linear
SVM 20 folds 59.90 57.40 58.62 59.49 .1898
50 folds 60.65 57.94 59.27 60.17 .2035
10 folds Mtry = 5 86.6 93.7 90 89.6 .792
15 folds Mtry = 5 86.6 93.5 89.9 89.5 .79
20 folds Mtry = 5 86.59 93.58 89.95 89.54 .7909

Random
Forest 61.15 58.24 59.66 60.62 .2124
Mtry = 1
50 folds
Mtry = 3 68.68 63.51 65.99 67.27 .3455
Mtry = 5 86.83 93.94 90.25 89.85 .797
10 folds NA 0 NA 50 0
Naive
Bayes 20 folds NA 0 NA 50 0
Artificial
Size = 8
Neural 10 folds
Decay = 0.1
61.51 58.36 59.89 60.92 .2184
Networks
10 folds Mincriterion = 67.21 81.80 73.97 70.95 .4190

.01

Condition
.01
al
Inference 20 folds Mincriterion = 66.99 82.36 73.88 70.89 .4178
Tree .01

.01
10 folds C = 0.5 84 88.47 86.18 85.81 .7163

M=1
C4.5 Fit 15 folds C = 0.5 84.18 89.26 86.65 86.25 .7249

M=1
20 folds C = 0.5 84.35 88.60 86.42 86.08 .7217
M=1
50 folds C = 0.5 84.56 89.44 86.93 86.56 .7311

M=1
10 folds 62.72 61.05 61.88 62.38 .2477

Gradient Nrounds =20
15 folds 62.78 60.01 61.36 62.21 .2443
Boosted
Decision 20 folds 63.06 60.02 61.50 62.43 .2487
Tree
50 folds 63.22 60.15 61.65 62.58 .2516
Table 2.1: Model performance of Cross-Validation partitioning
Hold-out method Balanced - SMOTE
Test - Train Split Precision Recall F1 Accuracy Kappa
10-90 81.90 88.08 84.88 84.31 .6861
RPART 20-80 81.94 87.45 84.60 84.08 .6817

Decision
Tree 30-70 82.85 86.88 84.81 84.44 .6889
40-60 81.85 86.55 84.13 83.68 .6735
10-90 k=1 82.82 91.46 86.93 86.25 .7249
20-80 k=1 82.36 92.11 86.96 86.19 .7238

KNN
30-70 k=1 82.46 91.27 86.64 85.93 .7186
40-60 k=1 81.39 90.90 85.88 85.06 .7012
10-90 Th = 0.5 86.08 64.67 73.85 77.11 .5421
20-80 Th = 0.5 64.78 96.70 77.58 72.06 .4412

Rule-Based
Classifier 30-70 Th = 0.5 75.28 86.75 80.61 79.13 .5826
40-60 Th = 0.5 71.96 79.17 75.39 74.16 .4832
10-90 59.24 55.99 57.57 58.73 .1749
20-80 59.93 57.18 58.52 59.47 .1895

SVM
30-70 60.2 57.14 58.63 59.68 .1936
40-60 59.85 57.28 58.54 59.43 .1886
10-90 Mtry = 5 83.85 91.99 87.73 87.14 .7428
20-80 Mtry = 5 83.43 92.28 87.63 86.98 .7396

Random
Forest 30-70 Mtry = 5 83.70 91.86 87.59 86.99 .7397
40-60 Mtry = 5 83.25 91.36 87.12 86.49 .7297
10-90 NA 0 NA 0.5 0
20-80 NA 0 NA 0.5 0
Naive
Bayes 30-70 NA 0 NA 0.5 0
40-60 NA 0 NA 0.5 0
Artificial 20-80 Size = 10 60.9 58.0 59.4 60.4 .208

Neural Decay = 0.4
Network
10-90 Mincriterion 67.29 80.69 73.38 70.73 .4147

= .01
20-80 Mincriterion 68.27 79.71 73.55 71.33 .4266

Conditional = .01
Inference
Tree 30-70 Mincriterion 67.10 80.89 73.35 70.61 .4123
= .01
40-60 Mincriterion 64.45 76.18 69.83 67.08 .3417

= .01
10-90 C= .3775 82.63 87.88 85.17 84.70 .6941

M= 1
20-80 C= .0.5 83.04 85.92 84.45 84.18 .6836

M= 1
C4.5 Fit
30-70 C=.3775 80.35 86.70 83.40 82.75 .6550
M= 1
40-60 C= .3775 79.81 85.64 82.62 81.99 .6399

M= 1
10-90 Nrounds 62.97 62.22 62.59 62.81 .2563

Gradient =20
20-80 62.48 60.36 62.09 61.404 .2512
Boosted
Decision 30-70 63.14 58.84 60.92 62.24 .2449
Tree
40-60 62.28 60.98 61.62 62.02 .2405
Table 2.2: Model performance of Hold-Out method partitioning

DataMining - CaseStudy

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DataMining - CaseStudy

Uploaded by

Copyright:

Available Formats

Revision List

suggested the following:

1. Solve the class imbalance of the dataset

resolve this, the researchers performed Synthetic Minority Oversampling Technique

(SMOTE) to balance the classes.

and 50-fold cross-validation.

3. Apply the CRISP-DM methodology

Parts of the methodology include Business understanding, Data Understanding, Data

Deployment phase since this study is only for research purposes.

The researchers conducted different classification techniques such as

Classification and Regression Tree (CART), K Nearest Neighbors (kNN), Rule-based

less than 1. (See Append A, Fig. 23)

individuals and businesses. By accurately identifying the likelihood of borrowers defaulting on

variable which are recoveries, collection_recovery_fee, last_week_pay, int_rate, and

initial_list_status. The researchers performed binary classification in classifying the borrower’s

Boosted Decision Tree, loan

organizations and banking systems as a whole. An essential characteristic of credit

savings, interests, taxes, and loans.

Credit risk refers to the possibility of suffering a loss in acquisitions. A

consequence of a borrower's failure to make loan payments or fulfill contractual

conditions, and associated collateral.

(CART), K Nearest Neighbors (kNN), Rule-based classification, Linear Support Vector

analyzing whether a borrower is eligible to be granted a loan.

Objectives of the Study

risk loan eligibility.

2. Identify which classification algorithm performs best in terms of accuracy and

A. Credit Risk Management in Commercial Banks

According to Konovalova et al., carrying out a quantitative assessment and

factors. Specification and identification of such regressions should be performed based on

risk posed by each potential client.

B. A Two-Stage Dynamic Credit Risk Assessment System

To better understand the relationship behind data, more sophisticated methods

high classification performance while remaining relatively cheap computationally.

C. Credit Risk Assessment Using Learning Algorithms for Feature Selection

D. Application of Big Data Unbalanced Classification Algorithm in Credit Risk

Analysis of Insurance Companies

The study of Xian Wu and Huan Liu on classification of unbalanced datasets is

SMOTE algorithm. The article highlights the limitations of traditional classification

techniques and optimized classifiers to improve the accuracy of minority class

between two sample points used by the SMOTE algorithm.

classifiers and the development of oversampling methods for unbalanced datasets.

such as credit scoring and fraud detection.

E. Performance of Three Classification Techniques in Classifying Credit Applications

Into Good Loans and Bad Loans: A Comparison

precision, negative predictive value, recall, and specificity.

cross-validation statistics. However, logistic regression performed the best at low

recall at higher thresholds.

F. SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for

nominal and continuous features

To address this problem, various synthetic minority over-sampling methods (SMOTE)

over-sampling method called SMOTE-ENC (SMOTE – Encoded Nominal and

G. Different Ways to Handle Imbalanced Datasets.

objective as possible. However, according to his study under-sampling techniques may

performance of ML models on imbalanced datasets will be constrained by their ability to

predict rare and minority points.

structured, manageable, and produces actionable results.

Fig. 1: CRISP-DM Model

value of 0 denotes “Not eligible,” and 1 denotes “Eligible.”

obtained and used these descriptions in developing an

member_id a unique LC assigned I.D. for the borrower member

term the number of payments on the loan

a series of code given to a certain batch on when their loan was

int_rate interest rate on the loan