You are on page 1of 28

Comparative Analysis of Loan Prediction Models

A PROJECT REPORT
SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE AWARD OF THE DEGREE
OF
BACHELOR OF TECHNOLOGY
IN
SOFTWARE ENGINEERING
Submitted by:
DIVIJ GERA (2K19/SE/039)
KARAN BAJAJ(2K19/SE/060)
Under the supervision of

Dr.Abhilasha

SOFTWARE ENGINEERING
DELHI TECHNOLOGICAL UNIVERSITY
(Formerly Delhi College of Engineering)
Bawana Road, Delhi-110042

Click here to enter text.


SOFTWARE ENGINEERING
DELHI TECHNOLOGICAL UNIVERSITY
(Formerly Delhi College of
Engineering) Bawana Road, Delhi-
110042

CANDIDATE’S DECLARATION

We, Divij Gera (2K19/SE/039) and Karan Bajaj (2K19/SE/060), students of B.Tech.
(Software Engineering), hereby declare that the project Dissertation titled “Comparative
analysis of Loan Prediction Models” which is submitted by us to the Department of Software
Engineering, Delhi Technological University, Delhi in partial fulfillment of the requirement
for the award of the degree of Bachelor of Technology, is original and not copied from any
source without proper citation. This work has not previously formed the basis for the award
of any Degree, Diploma Associateship, Fellowship or other similar title or recognition.

Place: Delhi

Date: 09 December, 2022

Divij Gera Karan

Bajaj

Click here to enter text.


SOFTWARE ENGINEERING
DELHI TECHNOLOGICAL UNIVERSITY
(Formerly Delhi College of
Engineering) Bawana Road, Delhi-
110042

CERTIFICATE

I hereby certify that the Project Dissertation titled “Comparative analysis of Loan Prediction
Models” which is submitted by S Divij Gera (2K19/SE/039) and Karan Bajaj (2K19/SE/060),
Software Engineering, Delhi Technological University, Delhi in partial fulfillment of the
requirement for the award of the degree of Bachelor of Technology, is a record of the project
work carried out by the students under my supervision. To the best of my knowledge this
work has not been submitted in part or full for any Degree or Diploma to this University or
elsewhere.

SUPERVISOR
Place: Delhi Dr. Abhilasha
Date: 15 December, 2022

Click here to enter text.


ACKNOWLEDGEMENT

Presentation, inspiration and motivation have always played a key role in the success of any
venture in life.

We express our sincere thanks to Prof J.P. Saini, Vice Chancellor, Delhi Technological
University, Delhi.
We pay our deepest sense of gratitude to Prof. O.P. Verma, HOD Electronics and
Communication, Delhi Technological University, Delhi to encourage us to the highest peak
and to provide us the opportunity to prepare the project. We are immensely obliged to our
friends for their elevating inspiration, encouraging guidance and kind supervision in the
completion of our project.
We feel to acknowledge our indebtedness and deep sense of gratitude to our teacher Dr.
Manjeet Kumar and Ms. Lavi Tanwar whose valuable guidance and kind supervision
given to us throughout the course which shaped the present work as its show.
Last, but not the least, our parents are also an important inspiration for us. So, with due
regards, we express our gratitude to them.

Click here to enter text.


ABSTRACT

The major source of profit for banks is the money they earn from giving loans. Although a lot of
people apply for loans, it’s increasingly harder to select a genuine applicant, who will repay the
loan in its entirely in time. While processing applicants manually, there may be a lot of
misconceptions and human error. Therefore, loan prediction systems have been developed using
various machine learning techniques, so the system automatically selects the eligible candidates.
This is helpful to both bank staff and applicant and the time period for the sanction of loan will be
drastically reduced. In this paper, we considered five such research papers and compared the
performance of the various machine learning algorithms to understand the practical utility in this
application as well as to determine the most efficient algorithm.

Click here to enter text.


CONTENTS

Candidate’s Declaration I

Certificate Ii

Acknowledgement iii

Abstract iv

Contents V

List of Figures vi

List of Tables vii

Chapter 1 1

Chapter 2 4

Chapter 3 5

Chapter 4 15

Chapter 5 17

Chapter 6 18

References 20

Click here to enter text.


LIST OF FIGURES

1. Figure 1. Decision Tree


2. Figure 2. Random Forest
3. Figure 3. Support Vector Machines
4. Figure 4. K-Nearest Neighbours
5. Figure 5. Logistic Regression
6. Figure 6. Feed-forward Neural Networks
7. Figure 7. Proposed future system pipeline

Click here to enter text.


LIST OF TABLES

1. Table 1. Description of Dataset 1


2. Table 2. Description of Dataset 2
3. Table 3. Comparison of the two datasets
4. Table 4. Comparison of models

Click here to enter text.


Chapter 1: Introduction

What is loan prediction?

Loan Prediction is a classification problem where we predict whether a loan will be approved or
not. In these kinds of problems, we have to predict discrete values based on a given set of
independent variables(s). These variables may include information such as Income of Candidate,
Credit History, Sex, Marital Status etc.

What is Loan prediction system?

A loan Prediction system is a system that provides you with an interface for loan approval to the
applicant's application for a loan. Applicants provide the system with their personal information
and, according to their information system gives the status of the availability of a loan.

A Prediction Model uses data mining, statistics, and probability to forecast an outcome. Every
model has some variables known as predictors that are likely to influence future results. The data
was collected from various resources before a statistical model was made. It can use a simple
linear equation, or a sophisticated neural network mapped using complex software. As more data
becomes available, the model becomes more refined, and the error decreases, meaning then it’ll
be able to predict with the least risk and consume as less time as it can. The Prediction Model
helps the banks by minimizing the risk associated with the loan approval system and helps the
applicant by decreasing the time taken in the process.

Why is loan prediction necessary?

The cost of assets is increasing day by day, and the capital required to purchase an entire asset is
very high. So, purchasing it out of your savings is not possible. The easiest way to get the
required funds is to apply for a loan. But taking a loan is a very time-consuming process. The
application must go through a lot of stages, and it’s still not necessary that it will be approved.
To decrease the approval time and to decrease the risk associated with the loan, many loan
prediction models were introduced.

Click here to enter text.


Chapter 2: Literature Review

Pandey et al.

The authors, Nitesh Pandey, Ramanand Gupta, Sagar Uniyal, Vishal Kumar [1] uses multiple
Machine Learning algorithms (Random Forest, Support Vector Machine, Decision Tree, Logistic
Regression) as a tool in paper and shows how these approaches can be used in real world loan
approval problems. Their paper uses a multi-variable dataset to predict whether the loan should be
given or not. The various pre-processing techniques that have been used in the paper include
Imputation, Binning, One Hot Encoding. The paper concludes that each of these algorithms
obtained a precision rate between 70% and 80% although the Support Vector Machine model is
very efficient and produces superior results than other models.

Shinde et al.

The authors, Anant Shinde, Yash Patil, Ishan Kotian, Abhinav Shinde and Reshma Gulwani [2]
use a single Machine Learning model, Logistic Regression using stratified k-folds cross validation
as a tool in their paper and shows how these approaches can be used in real world loan approval
problems. His paper uses a multi-variable dataset to predict whether the loan should be given or
not. The preprocessing techniques that have been used in the paper include basic data cleaning.
The above research employs a logistic regression algorithm-based prediction model. To create a
logistic classification model that predicts loan status, over 350 sample data were collected and
evaluated. The algorithm obtained a maximum accuracy of about 82 percent and regression
models are used to obtain such precision.

Pramod et al.

The authors, Ms. Kathe Rutika Pramod, Ms. Panhale Sakshi Dattatray, Ms. Avhad Pooja Prakash,
Ms. Dapse Punam Laxman, Mr. Ghorpade Dinesh B. [3] use a single Machine Learning model,
Decision Tree Algorithm as a tool in their paper and demonstrates how these approaches can be
used in real world loan approval problems. His paper uses a multi-variable dataset to predict
whether the loan should be given or not. The preprocessing techniques that have been used in in
the paper include fixing for missing data (The analytical process started from data cleaning and
processing followed by missing value imputation with mice package)
The above paper got a result that the best accuracy on public test set is 0.811 using the Decision
Tree algorithm.

Click here to enter text.


Hassan et al.

The authors Amira Kamil Ibrahim Hassan, Ajith Abraham [4] construct a loan default prediction
mode using three different algorithms, to train a supervised two-layer feed-forward network to
produce the prediction model. But first, two attribute filtering functions were used, resulting in
two data sets with reduced attributes and the original data-set. A German bank real world credit
application cases datasets consists of 20 attributes (7 numerical, 13 categorical) was used.
A two-stage experiment was designed. In the first stage, two-attribute filtering functions
(PLsFilter) and (ConsistencySubsetEval) were implied on the dataset, resulting in three different
datasets. The original dataset with 24 attributes the second with 20 attributes and the third with 9
attributes.
In the second stage of the experiment a supervised two-layer feed forward network, with sigmoid
hidden neurons and output neurons was used. finalized the architecture with 25 neurons. The input
layer has 24 neurons, 20 neurons and 9 neurons. The output layer has 1 neuron. The network was
trained using SCG, OSS and LM algorithms. A default split of 60% data for training, 20% for
testing, and the remaining 20% for validation was used over 10000 epochs.
The results of the model suggested unevenness between accuracy of defaulters and that of non-
defaulters, however it opens up many possibilities about the use of Neural Networks for this
application.

Bhanu et al.

The authors, L. Udaya Bhanu, Dr. S. Narayana [5] use a five Machine Learning model, Decision
Tree Algorithm, Random Forest, KNN, SVM, Logistics Regression as a tool in paper and shows
how these approaches can be used in real world loan approval problems. A dataset from Kaggle
was used which consists of various values/variables such as sex, marital status, education, self-
employed, loan status, applicant income, coapplicant income etc. . There was a need to convert
because dataset may have missing values, noisy data. So, data mining method for cleaning method
was used along with MinMaxScaler to replace Null values. This paper concludes that Regression
tree algorithms shows the best performance among the five algorithms chosen.

Click here to enter text.


Click here to enter text.
Chapter 3: Comparative Analysis

Chapter 3.1 Datasets

The first step in any machine learning problem is the selection of a suitable dataset. The research
papers involved in this comparative analysis used multiple datasets consisting of various
quantitative and qualitative parameters. These datasets may be described as follows:

Pandey et al., Shinde et al., Pramod et al. and Bhanu et al. all use a preexisting online dataset,
the Loan Prediction Dataset from Kaggle in order to perform the classification task. The
parameters used for this purpose are: Sex, Marital Status, Education, Number of Dependents,
Salary, Loan Amount, Credit History and Location. According to the in-paper analysis, most of
the parameters in the dataset were evenly distributed and followed a normal distribution. The
dataset consists of 367 rows with unique arguments, and is hence suitable only for models that
can work on smaller amounts of data i.e.. neural networks and deep learning models would not
be able to perform well on this dataset.

Table 1. Description of Dataset 1

Click here to enter text.


Hassan et al. uses the German bank real world credit application cases datasets consists of 20
attributes (7 numerical, 13 categorical). It codes the categorical attributes to form 24 attributes. The
number of instances is 1000. The last attribute (21st in the original dataset and 25th in the coded
data set) is the output “should the customer be granted the loan, yes/no”.

Table 2. Description of Dataset 2

Therefore, the models and algorithms discussed in this comparative analysis have the same basis
for comparison, since both datasets broadly utilize the same features and parameters with the only
difference being the fact that the neural networks have been trained on dataset that has a much
larger number of instances, which is typical of deep learning models. Moreover, all of the
remaining models have been trained on the same dataset, therefore there will be no difference in
their performance on account of a superior dataset. The following table summarizes the differences
between the two datasets considered.

Basis Dataset 1 Dataset 2


Number of parameters 13 20
Number of Instances 367 1000
Categorical Parameters 6 13
Quantitative Parameters 7 7

Table 3. Comparison of the two datasets used

Click here to enter text.


Chapter 3.2 Data Preprocessing

Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is a crucial step while creating a machine learning model. It may include any
number of processes depending on the features and quality of the dataset. The various techniques
that have been used in the studied papers are:

Imputation: Imputation refers to a class of methods that estimate missing values using
assumptions about the distribution of the data, which include mean and median imputation. Or
assumptions about the relationship between auxiliary variables (or x variables) and the target y
variable to predict missing values. Examples include mean imputation, substitution and
extrapolation. All 5 papers researched used imputation techniques to fill in the gaps in the data
points.

Binning: Binning is a way to group several continuous values into a smaller number of "bins".
For example, if you have data about a group of people, you might want to arrange their ages into a
smaller number of age intervals. In this problem statement, binning may be used to group together
number of dependents or income categories. Among the five papers, only Pandey et al. uses
binning techniques as part of their feature engineering.

Encoding: Encoding is a technique of converting categorical variables into numerical values so


that it could be easily fitted to a machine learning model. Each of the models considered performs
some form of encoding to be able to take the categorical data into account.

Normalization: Normalization is a technique often applied as part of data preparation for machine
learning. The goal of normalization is to change the values of numeric columns in the dataset to
use a common scale (usually 0-1), without distorting differences in the ranges of values or losing
information. Normalization is useful for all of the models but is absolutely essential for
optimization of the neural networks considered in Hassan et al.

Attribute Filtering: This component for feature engineering is called "filter-based feature
selection" because selected metrics are used to find irrelevant attributes. One can then filter out
redundant columns from your model. Choosing the right features can potentially improve the
accuracy and efficiency of classification. Attribute filtering has been applied in all 3 neural
networks considered in Hassan et al. Two-attribute filtering functions (PLsFilter) and
(ConsistencySubsetEval) were implied on the dataset, resulting in three different datasets. The
original dataset with 24 attributes the second with 20 attributes and the third with 9 attributes.

Click here to enter text.


Chapter 3.3 Machine Learning Models

Model selection and training forms the next step in the data science process. Choosing the right
model for a given problem is essential to get the best possible results. Moreover, given the fact
that the loan prediction problem statement presents a classification task, there is no dearth of
models that exist to perform it, making it all the more important that we compare and contrast the
models against each other so that we may be able to draw valuable insights. The 5 papers
combined have used 6 kinds of machine learning algorithm for credit risk classification. A brief
summary of the functioning of each of these models is as follows:

Chapter 3.3.1 Decision Tree (1,3,5):


Decision Tree (1,3,5): A Decision tree is a flowchart-like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node
(terminal node) holds a class label. A tree can be “learned” or trained by splitting the source set
into subsets based on an attribute value test. This process is repeated on each derived subset in a
recursive manner called recursive partitioning. The recursion is completed when the subset at a
node all has the same value of the target variable, or when splitting no longer adds value to the
predictions. The construction of a decision tree classifier does not require any domain knowledge
or parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision
trees can handle high-dimensional data and in general, decision tree classifiers have good
accuracy.
Example: Decision Tree formed in Shinde et al.

Figure 1: Decision Tree

Click here to enter text.


Chapter 3.3.2 Random Forest (1,5):

Random forest (1,5): Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model. As the name
suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output. The greater number of
trees in the forest leads to higher accuracy and prevents the problem of overfitting. It’s working can
be seen from the below diagram:

Figure 2: Random Forest

Click here to enter text.


Chapter 3.3.3 Support Vector Machines (1,5):

Support Vector Machines (1,5): Support Vector Machine (SVM) is a supervised machine
learning algorithm used for both classification and regression. Although it is best suited for
classification tasks. The objective of SVM algorithm is to find a hyperplane in an N-dimensional
space that distinctly classifies the data points. The dimension of the hyperplane depends upon the
number of features. The best hyperplane is the one that represents the largest separation or margin
between the two classes.


Figure 3: Support Vector Machines

Chapter 3.3.4 K-Nearest Neighbours (5):

K-Nearest Neighbours (5): K-Nearest Neighbours is one of the essential classification algorithms
in Machine Learning. It belongs to the supervised learning domain and finds intense application in
pattern recognition, data mining and intrusion detection. Given some prior data (also called training
data), which classifies coordinates into groups identified by an attribute and given an unclassified
point, the KNN algorithm can assign it to a group by observing what group its nearest Neighbours
belong to.

Click here to enter text.


Figure 4: K-Nearest Neighbours

Chapter 3.3.5 Logistic Regression (5, 2, 1):

Logistic Regression (5, 2, 1): Logistic regression is a supervised classification algorithm. In a


classification problem, the target variable (or output), y, can take only discrete values for a given
set of features (or inputs), X. Now, logistic regression is a regression model. The model builds a
regression model to predict the probability that a given data entry belongs to the category
numbered as “1”. Just like Linear regression assumes that the data follows a linear function,
Logistic regression models the data using the sigmoid function.

Figure 5: Logistic Regression

Click here to enter text.


Chapter 3.3.6 Feed-forward Neural Networks (4):

Feed-forward Neural Networks (4): Neural networks are a set of algorithms, modeled loosely
after the human brain, that are designed to recognize patterns. They interpret sensory data through a
kind of machine perception, labeling or clustering raw input. The patterns they recognize are
numerical, contained in vectors, into which all real-world data, be it images, sound, text or time
series, must be translated.
Neural networks help us cluster and classify. They can be considered as a clustering and
classification layer on top of the data you store and manage. They help to group unlabeled data
according to similarities among the example inputs, and they classify data when they have a
labeled dataset to train on.

A Feed Forward Neural Network is an artificial neural network in which the connections between
nodes does not form a cycle. The feed forward model is the simplest form of neural network as
information is only processed in one direction. While the data may pass through multiple hidden
nodes, it always moves in one direction and never backwards. In this model, a series of inputs enter
the layer and are multiplied by the weights. Each value is then added together to get a sum of the
weighted input values. If the sum of the values is above a specific threshold, usually set at zero, the
value produced is often 1, whereas if the sum falls below the threshold, the output value is -1, This
process is continued layer by layer. The end result is then compared to the expected result and the
weights are modified depending on the deviation of actual result from expected result.

Figure 6: Feed-forward Neural Networks

Click here to enter text.


Chapter 4 Results

Each of the considered models have been compared on the basis of their accuracy and f1 score
which are defined as:

Accuracy: It may be defined as the number of correct predictions as a proportion of the total
number of predictions. Accuracy is considered to be one of the most popular and generally used
metric for comparison. Though it acts as a great metric, the results may be misleading if the dataset
itself is imbalance in nature.

F1 score: In statistical analysis of binary classification, the F-score or F-measure is a measure of a


test's accuracy. It is calculated from the precision and recall of the test, where the precision is the
number of true positive results divided by the number of all positive results, including those not
identified correctly, and the recall is the number of true positive results divided by the number of
all samples that should have been identified as positive. Precision is also known as positive
predictive value, and recall is also known as sensitivity in diagnostic binary classification. The F1
score is the harmonic mean of the precision and recall. The highest possible value of an F-score is
1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either precision or
recall are zero.

S.No. Algorithm Accuracy (%) F1-Score Average Average F1-


Accuracy (%) Score

1. Random Forest 76.2 0.58


(1)
79.1 0.695
2. Random Forest 82 0.81
(5)

3. SVM (1) 79.7 0.61

4. SVM (5) 78 0.75 78.85 0.68

5. Logistic 75.6 0.64


Regression (1)

6. Logistic 72.1 0.82 73.56 0.73


Regression (2)

7. Logistic 73 0.73
Regression (5)

8. Decision Tree 70 0.55


(1)
71 0.635
9. Decision Tree 72 0.72
(5)

10. KNN (5) 59 0.53 59 0.53

Click here to enter text.


11. Neural 89 0.85
Network (24
attributes)

12. Neural 97 0.97 90.6 87.3


Network (20
attributes)

13. Neural 86 0.80


Network (9
attributes)

Table 4. Comparison of models

The analysis concludes that deep learning-based feed forward neural networks, with attribute
filtering functions gave the best result, both in terms of accuracy as well as F1 score. However, it is
important to note that this model used a different dataset than the other algorithms considered,
which may have provided it with an added edge. However, this does work as a proof of concept
that further development of deep learning models in this problem statement could bear great fruits.
From the ten models trained on the Loan Prediction Dataset from Kaggle, the Random Forest
Classifiers showed results that were far superior than the rest, reaching, bearing an average
accuracy of 79.1% and an F1 score 0.69. Support Vector Machines were a close second having
nearly the same value on both the metrics.
Furthermore, these results also act as clear indicator of the superiority of Dataset 2 over Dataset 1.
This shows how a larger dataset and the existence of more parameters of comparison could be an
added advantage and further data collection should be promoted in this domain.

Click here to enter text.


Chapter 5 Conclusion

Through the duration of this project, we studied multiple research papers aimed towards credit risk
classification, more popularly known as loan prediction. Loan Prediction is a classification problem
where we predict whether a loan will be approved or not. By predicting the loan defaulters, the
bank can reduce its Non-Performing Assets. We looked into the various machine learning models
and algorithms used by the mentioned papers and the techniques that were used to produce these
results.
The process of prediction starts from cleaning and processing of data, imputation of missing
values, experimental analysis of data set and then model building to evaluation of model and
testing on test data. The analysis concludes that deep learning-based feed forward neural networks,
with attribute filtering functions gave the best result, both in terms of accuracy as well as F1 score.
However, it is important to note that this model used a different dataset than the other algorithms
considered, which may have provided it with an added edge. However, this does work as a proof of
concept that further development of deep learning models in this problem statement could bear
great fruits.
From the ten models trained on the Loan Prediction Dataset from Kaggle, the Random Forest
Classifiers showed results that were far superior than the rest, reaching, bearing an average
accuracy of 79.1% and an F1 score 0.69. Support Vector Machines were a close second having
nearly the same value on both the metrics. While these results are a great start, the models still need
to be refined before they can completely replace the current manual credit risk analysis systems.
However, these are good enough to be incorporated as a step in the current systems, since the
current loan default rate of Indian banks averages at 24%. This may help save on human effort as
well as allow banks to reduce loan defaulters.

Click here to enter text.


Chapter 6 Future Scope

The comparative analysis performed shows highly promising results with neural networks. In the
future, we plan to expand these findings by
- Implementing feed-forward neural network models on other databases.
- Verify whether the results were only caused by a better dataset or due to a more efficient
algorithm.
- Build a complete system by incorporating a front-end, back-end and database manager
along with the trained model
The following figure depicts an initial proposed pipeline for the aforementioned system:

Figure 7. Proposed future system pipeline

Click here to enter text.


References
1. Pandey N, Gupta R, Uniyal S, Kumar V. “Loan Approval Prediction using Machine
Learning Algorithms Approach”. International Journal for Innovative Research in
Technology 2021; 8(1): pp. 898-902.

2. Shinde, A., Patil, Y., Kotian, I., Shinde, A., & Gulwani, R. (2022). Loan prediction system
using machine learning. ITM Web of Conferences, 44.
https://doi.org/10.1051/itmconf/20224403019
3. Pramod, K., Dattatray P., Prakash A., “An Approach for Prediction of Loan Approval
Using Machine Learning Algorithm”. International Journal of Creative Research Thoughts
(IJCRT) 9(6).
4. A. K. I. Hassan and A. Abraham, "Modeling consumer loan default prediction using
ensemble neural networks," 2013 INTERNATIONAL CONFERENCE ON COMPUTING,
ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE), 2013, pp. 719-724, doi:
10.1109/ICCEEE.2013.6634029.
5. Udaya Bhanu, L., & Narayana, D. S. (2021). Customer loan prediction using supervised
learning technique. International Journal of Scientific and Research Publications (IJSRP),
11(6), 403–407. https://doi.org/10.29322/ijsrp.11.06.2021.p11453

Click here to enter text.


Click here to enter text.
Click here to enter text.
Click here to enter text.

You might also like