You are on page 1of 11

A comparison of decision trees and random forests

for loan approval prediction


Yousra Iman1,2* and Dr Sobia Khalid (Supervisor)2,3
1* Software Engineering, Fatima Jinnah Women University ,Rawalpindi,
Pakistan.

*Corresponding author(s). E-mail(s): yousraiman22@gmail.com;

Abstract
The use of machine learning models that predict loan approval are critical to the
banking and finance industries. Artificial intelligence (AI) techniques have revo-
lutionized the risk of credit evaluation, allowing for more accurate and efficient
decision-making processes. This study developed Random Forest and Decision
Tree methods for loan default prediction based on historical data.
The methodology included techniques for careful model selection, evaluation, and
the preprocessing process of the data. with a 0.80 accuracy rating. We assess how
well machine learning methods more especially, decision trees and random forests
identify potential loan defaults. Our research process includes extensive prepro-
cessing of the data, model building, and evaluation metrics such as accuracy and
precision. The random forest algorithm outperforms decision trees, with an accu-
racy of 0.80 compared to 0.75 for decision trees, according to the results. This
suggests that random forests might be more accurate than other techniques in
forecasting loan defaults, which could give institutions of finance valuable data
to enhance their evaluation of loan risk.

Keywords: Loan approval, Machine Learning, AI, Credit Risk, Random Forest,
Decision Trees.

1 Introduction
Individuals around the globe depend on banks in one way or another to provide loans
for a range of uses, including helping them get over financial roadblocks and accom-
plish personal goals. The ever-changing financial surroundings and increasing industry
competition have made taking out a loan necessary. Moreover, banking institutions of

1
all sizes depend on loan distribution to make a profit, which they need to manage their
operations and maintain during difficult times. Loans are the main source of income
for the banking sector and also carry the highest level of financial risk for banks. The
interest a bank receives on its loans accounts for the majority of its assets.
Although lending money is frequently beneficial for both of the parties, there are
important risks associated with it. These risks, commonly referred to as ”Credit Risk,”
arise from a borrower’s failure to return the loan by the mutually agreed-upon deadline,
which was set by the lender and the borrower [1]. It is vital to ascertain whether the
client’s credit is suitable for that before authorizing a loan. The ”5C principle,” which
refers to Character, Capital, Capacity, Collateral, and Conditions, is a tool used by
banking authorities to assess borrowers in traditional lending procedures [2]. Following
a regressive verification and validation process, banks and major financial institutions
still cannot guarantee that the applicant selected at this stage will be able to repay
the loan on time. The author’s personal experiences and comprehension of customer
service served as the main foundation for this evaluation. This strategy comes with a
lot of limitations. Banks and other major financial institutions approve loan requests
after a regressive verification and validation process, but there is still no guarantee
that the selected applicant will be able to make loan repayments on time.
In the past, banks hired highly skilled individuals whose sole duty it was to eval-
uate loan candidates and decide whether or not they were eligible for one after a
thorough investigation. A candidate’s ”Credit Score,” which is a numerical value, dic-
tated whether they would be approved or denied for a loan. The credit score usually
helps authorities determine whether the borrower will repay the loan by the deadline
based on the borrower’s history of payments, credit report, and background [3].
Experts and algorithms for statistical analysis were needed for the loan achieving
process to precisely determine an applicant’s reliability. In order to automatically
predict an applicant’s credit score based on their credit history and other historical
data, academics and banking authorities have recently chosen to train classifiers based
on various machine learning and deep learning algorithms. This has made the process
of selecting eligible applicants before approving a loan much simpler.
In light of these circumstances, the aim of this paper is to examine the use of
various machine learning models in the loan provision process and determine the
optimal course of action for a financial institution that can assist banks in consistently
identifying applicants and repaying loans for significantly lower credit risk. By applying
Random Forest and Decision Tree classifiers, we constructed the model. Each of them
will be used separately to examine the dataset, look for trends, and make inferences.
Based on that analysis, determine the likelihood that a new applicant will default on
a loan.

2 Literature Review
The prediction, according to the authors, starts with data processing and cleaning,
missing value replacement, modeling, and data set experimental analysis. It then moves
on to model evaluation and data testing [1]. The approach entails a thorough analysis
and synthesis of the body of research, with a particular emphasis on deep learning

2
applications in the banking industry. One of its strongest points is that it offers a
thorough and current summary of deep learning applications, making it an invaluable
tool for researchers and banks looking to use these technologies.Potential biases in
the reviewed literature, the challenge of generalizing findings across different banking
environments, and the difficulty of evaluating the practical efficacy of deep learning
applications are potential weaknesses. By examining and synthesizing the various uses
of deep learning in banking, this paper seeks to close the gap in the literature and
provide guidance for future research and implementation strategies in this quickly
changing technological environment [2].
For the purpose of trying to predict loan approvals based on historical candidate
data, the methodology uses a variety of classification algorithms, including logistic
regression, random forest classifier, and support vector machine classifier. The main
goal is to use this historical data to train machine learning models that will decide
whether or not to approve a loan for a new applicant. Strengths include helping banks
optimize loan approval procedures and improve loan recovery by using machine learn-
ing to analyze historical patterns for precise predictions. On the other hand, drawbacks
could include possible algorithmic biases, difficulties with interpretability in intricate
models such as random forests, and the requirement for a well-balanced dataset to yield
accurate predictions. The goal of the paper is to enhance the loan decision-making
process in the banking industry by utilizing machine learning models that have been
trained on past data [3].
The methodology focuses on using AI/ML technologies to improve the accuracy
of loan defaulter prediction. Specifically, different data science techniques like Logistic
Regression, Support Vector Machines (SVM), Neural Networks, and Random Forest
are leveraged. It focuses on data science-based predictive analytics, which makes it
possible to forecast credit scores for loan disbursements. The application of various
machine learning techniques to enhance precision is a strength that enables financial
institutions to optimize the credit process and lower the risk of loan defaults. Nev-
ertheless, drawbacks could include difficulties with interpretation for more intricate
models such as Neural Networks and possible restrictions on how certain algorithms
can handle datasets that aren’t balanced. Although implementation challenges and
adaptability across diverse markets may pose limitations, the paper highlights the
transformative potential of such technologies for global financial institutions while
exploring a high-level loan origination process and an alternative credit scoring model
[4].
Using Machine Learning (ML) algorithms on a historical loan dataset, the method-
ology used looks for patterns that can be used to predict future loan defaulters.
Analysis is done using historical customer data, which includes age, income, loan
amount, and length of employment. To identify influential features and forecast loan
defaults, a variety of machine learning algorithms, including Random Forest, Support
Vector Machine, K-Nearest Neighbor, and Logistic Regression, are used. When these
algorithms are compared and assessed using common metrics, the Random Forest algo-
rithm shows to be more accurate. Random Forest’s strength is its capacity to manage

3
high-dimensional data, minimize overfitting, and assign feature importance. Never-
theless, because of its ensemble nature, it may be less interpretable and potentially
computationally inefficient for larger datasets [5].
To solve the problem of loan approval in the banking industry, the approach entails
creating a Real-Time Binary classification model based on a deep neural network.
The proposed model exhibits several strengths, such as its emphasis on real-time
processing and its utilization of deep learning to improve accuracy, precision, and recall
when categorizing loan applicants as either good or bad risks. Potential drawbacks,
however, might include the difficulty of integrating real-time systems into banking
infrastructure, the requirement for reliable and current data for real-time forecasts,
and possible moral dilemmas related to the use of personal information. In comparison
to traditional binary classifiers, this paper’s novel real-time approach for loan approval
employs deep neural networks, demonstrating improved performance and emphasizing
the significance of prompt decision-making in banking processes [6].
The process evaluates borrower characteristics to help banking authorities choose
qualified borrowers for loans. The study compares Random Forest and Decision Trees
algorithms for loan prediction using the same dataset. The Random Forest model
achieves a significantly higher accuracy of 80% compared to 73%, outperforming
Decision Trees. Profound comparative analysis and useful application for evaluating
banking risk are among its strong points. Notwithstanding, constraints might result
from the dataset’s restricted number of loan default cases and possible misidentifica-
tion of non-defaulters as defaulters. Updating datasets and addressing misclassification
issues could help future research achieve better model precision and make more
accurate predictions [7].
By randomly selecting data and features, the Random Forest ensemble learning
technique creates a large number of decision trees. Its methodology reduces overfit-
ting and improves accuracy by combining predictions from these trees to produce
the final output. Its resilience to outliers, ability to handle large datasets with high
dimensionality, and robustness against overfitting are its main advantages. It provides
information about the significance of features, manages missing data skillfully, and
easily adapts to tasks involving both regression and classification. Nevertheless, some
of Random Forest’s drawbacks include the possibility of prediction sluggishness as a
result of the creation of multiple trees, a lower ability to capture complex relationships
in data when compared to more complex models, a higher number of trees can reduce
interpretability and lead to overfitting on noisy datasets if proper tuning is not done
[8].

3 Methodology and Dataset


3.1 Dataset
The 1000 records in 9 columns that make up the UCI dataset offer insightful informa-
tion about the field of loan approval prediction. The author gathers information from
several banks and modifies it in accordance with the specifications. The primary char-
acteristics included in the dataset are details regarding the applicants’ age, occupation,
residence, checking and savings accounts, credit limit, and purpose. The dynamics of

4
loan approval decisions, the effects of income, work, and education on loan eligibility,
and the possibility of creating predictive models to improve the loan approval process
can all be better understood with the help of this data [9].

3.2 Methodology
The following stages made up our building route:
• Information Gathering
• Preprocessing of Data
• Data cleaning
• Model Selecting
• Model Assessment and Scoring
Having the appropriate data in place was essential to developing the Model. We con-
tacted UCI, one of the most reputable sources of data, for this reason [9]. This data
set, which we selected, was further divided into two sets: a 70:30 ratio of training data
to test data. Preparing the data was a crucial next step in our endeavor to create
the AI model. This method converts unprocessed data into a format that is easier to
comprehend, more practical, and effective. Our dataset includes information gathered
from social media platforms and online sources in addition to historical customer data
from the financial institution.
The main purpose of data cleansing is to replace or remove missing values from
data sets. After that, we carried out exploratory data analysis, using statistical ideas
like the probability density function and normal distribution to examine the dependent
and independent variables as well as to find anomalies and outliers. To evaluate the
credit risk, we used the Random Forest and Decision Tree classification techniques.
The methods of the Score Model and Evaluate Model are used to examine the
precision and accuracy. The selection of the Model was made with consideration for
the benefits of the tools and methods as well as our primary goal of creating a model
that is impartial and equitable upon assessment. Among the important metrics we
monitored were: The Confusion metrics
• Accuracy
• Precision
On the AI Model, an accuracy score of 0.80 is considered good. The loan is approved
for candidates who have a high overall score; those who have medium scores and a
higher risk score will pay a higher interest rate on the authorized loan amount. Our

5
model aimed to offer insights that would enable the greatest proportion of the populace
to obtain loans from financial institutions.

3.3 System Architecture

3.4 Equations
3.4.1 Decision Tree
Techniques from the Score Model and Evaluate Model are used to examine. They are
a flexible algorithm that can be applied to regression and classification tasks. These
are among the most widely used classification algorithms, with multiple branches, leaf
nodes, and root nodes. Using a Recursive Portioning Algorithm (RPA) and classifying
the instances, the algorithm creates a structure resembling a tree [10]. A leaf node
represents a class label, and test results are represented by the branches. An attribute’s
internal nodes represent these tests.

6
3.4.2 Random Forest
An ensemble of decision trees is known by its trademark, the Random Forest model.
We have a group of decision trees in Random Forest, which we refer to as ”Forest.”
Each tree provides a classification, which is referred to as a ”vote” for that class when
a new object is classified according to its attributes. The classification with the most
votes is selected by the forest. Every tree is planted raised in accordance with A sample
of P cases is chosen at random but with 12 replacements if the number of cases in
the training dataset is P. This sample will serve as the tree’s training dataset. When
there are N input features, a number n¡¡N is given so that, at each node, n features
are randomly chosen from the P, and the node is split using the best split on these m
features. n is kept constant throughout the forest’s growth.

There are numerous benefits to using Random Forest over other machine learning
algorithms, including:
• defense against overfitting.
• Regression or accurate classification.
• more effective when working with big databases.

4 Results
In this paper, we used two different machine learning algorithms, Random Forest and
Decision Trees, to develop a model for evaluating credit risk and loan prediction. The
classification report, confusion matrix, and results of each model are shown below,
along with their respective models’ accuracy and other scores, to help you better
understand them.

4.1 Decision Tree


Our accuracy score from the Decision Tree classifier was 75%.

7
Fig. 1 Classification Report for Decision Tree.

Fig. 2 Confusion Matrix of Decision Tree.

4.2 Random Forest


The accuracy score provided by the Random Forest Classifier was 80%.

Fig. 3 Classification Report for Random Forest.

Fig. 4 Confusion Matrix of Random Forest.

It is evident from the above-mentioned confusion matrices and classification reports


for both models that, when it comes to loan prediction on the provided dataset, the
Random Forest algorithm exceeds Decision Trees.

8
4.3 Graphs
Plotting a correlation between the features helped us determine whether any additional
changes were needed to improve the model’s performance.

Fig. 5 Correlation heatmap between all the features .

These below scatterplots showed different distributions of data points and showed
how each algorithm handled the features of the dataset, providing information about
the predictive behaviors of the various algorithms.

Fig. 6 Scatter Plot between all the features .

4.4 Comparative Study


In this comparative analysis of multiple research papers, we evaluated the precision
and accuracy metrics of decision tree and random forest algorithms. This analysis
showed that, in comparison to the decision tree method, the random forest algorithm
consistently demonstrated higher accuracy rates and comparable precision across a
number of studies.

9
Author Citations Accuracy Precision
Lili,li. et al. [11] 0.50 Nil
Madaan. et al. [12] 0.73 0.82
Tumuluru. et al. [5] 0.81 0.75
Ndayisenga. et al. [13] 0.75 0.73
Yousra Iman Nil 0.80 0.70

5 Conclusion
The study concludes by highlighting the effectiveness of using machine learning mod-
els Random Forest and Decision Trees, in particular for loan approval prediction.
Based on historical data, the comparative analysis demonstrated the Random Forest
algorithm’s superior performance over Decision Trees in accurately predicting loan
defaults. This study demonstrates how AI-driven approaches can greatly improve the
credit risk assessment procedures used by banks. The results highlight how crucial
it is to use cutting-edge computational methods to predict loan approval accurately
and consistently. This will help financial institutions make better decisions and reduce
their exposure to credit risk.
Moreover, the prevalence of the random forest model in loan default prediction
indicates the model’s practicality, opening the door for improved risk management
practices in the banking sector. Financial institutions looking to improve the efficiency
of their loan approval procedures and successfully reduce credit risk can benefit greatly
from these insights. Advanced algorithms like random forests can greatly aid in the
improvement of credit evaluation systems as machine learning advances, leading to
the eventual development of more stable and dependable financial practices in loan
approval processes.

References
[1] Ghildiyal, G.S..R.V. B.: Analyze of different algorithms of machine learning for
loan approval. In Smart Trends in Computing and Communications 719-727
(2022 Springer Singapore)

[2] Hassani, H.X.S.E..G.M. H.: Deep learning and implementations in banking.


annals of data science 7, 433–446 (2020)

[3] Singh, Y.A.A.R..P.G.N. V.: Prediction of modernized loan approval system based
on machine learning approach. in 2021 international conference on intelligent
technologies (conit) (pp. 1-4). ieee. (2021)

[4] Darapaneni, K.A.D.A.S.M.S.S..P.A.R. N.: Loan Prediction Software for Financial


Institutions. In 2022 Interdisciplinary Research in Technology and Management
(IRTM) (pp. 1-8). IEEE.

[5] Tumuluru, B.L.R.L.M.B.S.C.H.M.H..S.N. P.: Comparative analysis of customer


loan approval prediction using machine learning algorithms. in 2022 second inter-
national conference on artificial intelli- gence and smart energy (icais) (pp.

10
349-353). ieee., (2022)

[6] Abakarim, L.M..A.A. Y. (ed.): Towards an Efficient Real-time Approach to Loan


Credit Approval Using Deep Learning. In 2018 9th In- Ternational Symposium on
Signal, Image, Video and Communications (ISIVC) (pp. 306-313). IEEE., (2018)

[7] Madaan, K.A.K.C.J.R..N.P. M.: Loan default prediction using decision trees and
random forest: A comparative study. In IOP Conference Series: Materials Science
and Engineering (Vol. 1022, No. 1, p. 012042). IOP Publishing. (2021)

[8] Liu, W.Y..Z.J.. Y.: New Machine Learning Algorithm: Random Forest. In
Information Computing and Applications:

[9] LEARNING, U.M.: Database Contents License (DbCL) V1.0

[10] Bhargav, .S.K.. P.: A Machine Learning Method for Predicting Loan Approval by
Comparing the Random Forest and Decision Tree Algorithms. Journal of Survey
in Fisheries Sciences, 10(1S), 1803-1813. (2023)

[11] Lai, A. L. (2020: Loan default prediction with machine learning techniques. .
In 2020 International Conference on Computer Communication and Network
Security (CCNS) (pp. 5-9). IEEE.

[12] Sheikh, G.A.K..K.T. M. A.: An approach for prediction of loan approval using
machine learning algorithm. in 2020 international conference on electronics and
sustainable communication systems (icesc) (pp. 490-494). ieee (2020)

[13] Ndayisenga, T..: Bank loan approval prediction using machine learning techniques
(Doctoral dissertation).

11

You might also like