Professional Documents
Culture Documents
by
November, 2023
LOAN ELIGIBILITY PREDICTION USING
MACHINE LEARNING
by
November, 2023
DECLARATION
I hereby declare that the thesis entitled “LOAN ELIGIBILITY PREDICTION
USING MACHINE LEARNING” submitted by me, for the award of the degree of
M.Tech (Software Engineering) is a record of bonafide work carried out by me under
the supervision of Dr. Ranichandra C.
I further declare that the work reported in this thesis has not been submitted and
will not be submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.
Place: Vellore
Date: Signature of the candidate
CERTIFICATE
This is to certify that the thesis entitled “LOAN ELIGIBILITY PREDICTION USING
MACHINE LEARNING” submitted by YERRAM KARTHIK (19MIS0424), School
of Computer Science Engineering and Information Systems, Vellore Institute of
Technology, Vellore for the award of the degree M.Tech (Software Engineering) is a
record of bonafide work carried out by him under my supervision.
The contents of this report have not been submitted and will not be submitted either in
part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The Project report fulfils the requirements and regulations of
VELLORE INSTITUTE OF TECHNOLOGY, VELLORE and in my opinion meets the
necessary standards for submission.
Banks are making major part of profits through loans. Though lot of people are applying
for loans. It’s hard to select the genuine applicant, who will repay the loan. While doing
the process manually, lot of misconception may happen to select the genuine applicant.
Therefore, we are developing loan prediction system using machine learning, so the
system automatically selects the eligible candidates. This is helpful to both bank staff
and applicant. The time-period for the sanction of loan will be drastically reduced. In
this project we are predicting the loan data by using some machine learning algorithms.
In the realm of financial technology, this project delves into the development of a Loan
Eligibility Prediction system using Machine Learning (ML) techniques. Leveraging a
diverse dataset encompassing applicant demographics, financial history, and credit-
related information, the system employs predictive models to assess an individual's
eligibility for a loan. Through rigorous data preprocessing, feature engineering, and the
utilization of various ML algorithms, the project aims to deliver a robust and accurate
prediction mechanism that empowers lenders and borrowers alike by streamlining the
loan approval process, enhancing risk assessment, and promoting financial inclusivity.
i
ACKNOWLEDGEMENT
Place: Vellore
Date: Yerram karthik
ii
TABLE OF CONTENTS
LIST OF FIGURES
LIST OF TABLES
LIST OF ACRONYMS
CHAPTER 1
INTRODUCTION
1.1 BACKGROUND
1.2 MOTIVATION
1.3 PROBLEM STATEMENT
1.4 OBJECTIVE
1.5 SCOPE OF THE PROJECT
CHAPTER 2
LITERATURE SURVEY
2.1 SUMMARY OF THE EXISTING WORK
2.2 CHALLENGES PRESENT IN EXISTING SYSTEM
CHAPTER 3
REQUIREMENTS
3.1 HARDWARE REQUIREMENTS
3.2 SOFTWARE REQUIREMENTS
3.3 BUDGET
3.4 GNATT CHART
CHAPTER 4
ANALYSIS AND DESIGN
4.1 PROPOSED METHODOLOGY
4.2 SYSTEM ARCHITECTURE
4.3 MODULE DESCRIPTIONS
CHAPTER 5
IMPLEMENTATION AND TESTING
5.1 DATA SET
5.2 SAMPLE CODE
5.3 SAMPLE OUTPUT
5.4 TEST PLAN & DATA VERIFICATION
CHAPTER 6
RESULTS
6.1 RESEARCH FINDINGS
6.2 RESULT ANALYSIS & EVALUATION METRICS
CONCLUSION AND FUTURE WORK
REFERENCES
iii
LIST OF FIGURES
S.No. Figure Page No.
1. Fig-1[System Architecture] 15
2. Fig-2[Count plot (Male Vs Female)] 30
3. Fig-3[Count plot (graduate Vs not 31
graduate)]
4. Fig-4[Count plot for credit history (Yes 31
or No)]
5. Fig-5[Pair plot] 32
6. Fig-6[Histogram (Applicant Income Vs 33
Frequency)
7. Fig-7[ROC Curve] 45
8. Fig-8[Recall Curve] 48
LIST OF TABLES
LIST OF ACRONYMS
iv
INTRODUCTION
1.1 BACKGROUND
The concept of "Loan Eligibility Prediction using Machine Learning" is a
fundamental application of artificial intelligence and data science within the financial
sector. This application leverages the power of machine learning algorithms to assess
and predict an individual's or business's eligibility for obtaining a loan. The primary
goal is to streamline and automate the loan approval process, making it more efficient
and accurate for both financial institutions and applicants.
1.2 MOTIVATION
Loan eligibility prediction using machine learning is a game-changer in the
financial industry, offering the potential to revolutionize lending processes. By
harnessing the power of advanced algorithms and data analytics, it enables more
accurate and fair assessment of individuals' creditworthiness. This not only benefits
lenders by reducing default risks and improving decision-making, but it also extends
the opportunity for financial inclusion to a broader and more diverse population,
ultimately fostering economic growth and empowerment. With the ability to efficiently
and effectively determine loan eligibility, ML-driven solutions have the potential to
reshape the lending landscape, making access to credit more equitable and accessible
for all.
The adverse impact of low loan repayment rates on banks is a major issue. Bank
employees check the details of applicant manually and give the loan to eligible
applicant. Checking the details of all applicants takes lot of time. In the existing Loan
Eligibility model are unsatisfied or make it difficult to establish a new analysis model.
There has been increased demand for applying the machine learning methodology to
loan Eligibility prediction due to its high performance.
1
1.4 OBJECTIVE
2
CHAPTER 2
LITERATURE SURVEY
3
SaiViswanadh Sarma, 2. Efficiency and 2. Algorithm
B. Sravani, Automation Bias
Nedunchezhian
4
8. Prediction Of Loan G.Murali Krishna, 1. Efficiency 1. Data Bias and
Eligibility of the V.Madhavi Improvement: Fairness Issues
Customer 2. Risk Mitigation: 2. Model
3. Data Privacy Accuracy
Concerns Dependency
3. Complexity
and Maintenance
5
2.2 CHALLENGES PRESENT IN EXISTING SYSTEM
The challenges present in the existing systems for loan eligibility
prediction using machine learning are:
6
8. Bias and Fairness Concerns: Ensuring that the loan eligibility
predictions are fair and unbiased is a recurring challenge. Discriminatory
or biased outcomes can lead to legal and ethical issues.
9. Complexity and Maintenance: Several systems highlight the
complexity of implementing and maintaining machine learning-based
solutions. Maintenance and ongoing updates are essential for keeping the
models accurate and relevant.
10. Risk Assessment and Transparency: Some systems emphasize the
importance of risk assessment and transparency. Balancing the need for
risk mitigation while maintaining transparency in decision-making is a
challenge.
It's important to note that these challenges may vary depending on the
specific system and the context in which it is applied. Addressing these
challenges is crucial to ensure that machine learning-based loan eligibility
prediction systems are accurate, fair, and compliant with regulations.
7
CHAPTER 3
REQUIREMENTS
3.3 BUDGET
Procured Items/Components for the Project Work Total Cost
8
3.4 GNATT CHART
9
Month 3: October 1, 2023 - November 1, 2023
10
11
12
CHAPTER 4
ANALYSIS AND DESIGN
13
4.1.5 Feature Encoding:
• In the context of loan eligibility prediction using machine learning, feature
encoding involves converting non-numeric data like customer categories (e.g.,
"student," "employed") into numerical representations (e.g., 0 or 1) so that
machine learning algorithms can analyse and make predictions based on these
features. Common methods include label encoding for ordinal data and one-hot
encoding for nominal data, ensuring that categorical information can be used
effectively in the prediction model.
14
4.2 SYSTEM ARCHITECTURE
15
4.3 MODULE DESCRIPTIONS
DATA COLLECTION:
It’s time to pick up the baton and lead the way to machine learning
implementation. The job is to find ways and sources of collecting relevant
and comprehensive data, interpreting it, and analysing results with the help
of statistical techniques.
Kaggle link: https://www.kaggle.com/datasets/vikasukani/loan-eligible-
dataset
EXPLORATORY DATA ANALYSIS:
EDA is an approach of analysing data sets to summarize their main
characteristics, which often includes their data types, memory size
occupied, etc.,
16
17
DATA PRE-PROCESSING:
We loaded the data set as pandas data frame to process the data set and load
it in the machine learning model. In this experiment we dropped the null
values.
DATA SPLITTING:
For each experiment, we split the entire dataset into 70% training set and
30% test set. We used the training set for resampling, hyper parameter
tuning, and training the model and we used test set to test the performance
of the trained model. While splitting the data, we specified a random seed
(any random number), which ensured the same data split every time the
program executed.
18
CHAPTER 5
IMPLEMENTATION AND TESTING
5.1 DATASET
Dataset Reference:
• https://www.kaggle.com/datasets/vikasukani/loan-eligible-dataset
Dataset Screenshot:
19
20
5.2 SAMPLE CODE
import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import pickle
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score
df = pd.read_csv(r"C:\Users\yerra\Music\Loan eligibility
prediction\CODING\back end\loan.csv")
EXPLANATION:
1. `import seaborn as sns`: This line imports the Seaborn library, which is a data
visualization library based on Matplotlib. It is often used to create attractive and
informative statistical graphics.
2. `import pandas as pd`: This line imports the Pandas library, which is a popular
data manipulation and analysis library in Python. It allows you to work with
structured data, such as data in tables or data frames.
3. `import warnings`: This line imports the warnings module, which is used to
control the behavior of warning messages in Python.
4. `warnings.filterwarnings("ignore")`: This line sets up a filter to ignore warning
messages in the code. It suppresses warning messages that may otherwise be
displayed during the execution of your code.
5. `import matplotlib.pyplot as plt`: This line imports the Matplotlib library,
specifically the pyplot module, which is used for creating various types of plots and
charts in Python.
6. `import pickle`: This line imports the pickle module, which is used for serializing
and deserializing Python objects. It allows you to save and load data structures and
models.
7. `from sklearn.ensemble import VotingClassifier`: This line imports the
`VotingClassifier` class from the scikit-learn library, which is used to create an
ensemble model that combines multiple machine learning classifiers to make
predictions.
8. `from sklearn.tree import DecisionTreeClassifier`: This line imports the
`DecisionTreeClassifier` class from scikit-learn, which is used to create decision tree
models for classification tasks.
9. `from sklearn.linear_model import LogisticRegression`: This line imports the
`LogisticRegression` class from scikit-learn, which is used to create logistic
regression models for binary classification.
21
10. `from sklearn.naive_bayes import GaussianNB`: This line imports the
`GaussianNB` class from scikit-learn, which is used to create Naive Bayes models for
classification tasks, assuming Gaussian-distributed features.
11. `from sklearn.metrics import f1_score`: This line imports the `f1_score`
function from scikit-learn, which is a metric used to evaluate the performance of
classification models.
12. `df = pd.read_csv(r"C:\Users\yerra\Music\Loan eligibility
prediction\CODING\back end\loan.csv")`: This line reads a CSV file named
"loan.csv" located at the specified file path and loads its data into a Pandas DataFrame
called `df`. The `r` before the path is used to treat the string as a raw string, which can
be helpful when dealing with backslashes in file paths.
Each line of code is responsible for importing libraries, setting up warning filters, and
loading data for your loan eligibility prediction project.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 614 non-null int64
7 CoapplicantIncome 614 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 564 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB
df
22
613 LP002990 Female No 0 Graduate Yes
df['Dependents']= pd.to_numeric(df['Dependents'],errors='coerce')
df
23
3 LP001006 Male Yes 0.0 Not Graduate
No
4 LP001008 Male No 0.0 Graduate
No
.. ... ... ... ... ...
...
609 LP002978 Female No 0.0 Graduate
No
610 LP002979 Male Yes NaN Graduate
No
611 LP002983 Male Yes 1.0 Graduate
No
612 LP002984 Male Yes 2.0 Graduate
No
613 LP002990 Female No 0.0 Graduate
Yes
24
613 0.0 Semiurban N
df.style.highlight_null(null_color='red')
<pandas.io.formats.style.Styler at 0x2cc81688f50>
df.isnull().sum()
Loan_ID 0
Gender 13
Married 3
Dependents 66
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
df = df.dropna()
EXPLANATION:
1. `df = df.dropna()`: This line of code is using the Pandas DataFrame `df` and the
`dropna()` method to remove rows with missing (NaN) values from the DataFrame.
By calling `dropna()` without any arguments, it drops any row in the DataFrame
where at least one column has a missing value (NaN). After executing this line, the
DataFrame `df` will contain only the rows with complete data, and any rows with
missing values will be removed.
2. `df`: This line simply prints the modified DataFrame `df` to the console or the
output, displaying the DataFrame with missing value rows removed.
So, this code is essentially cleaning the DataFrame by removing rows with missing
data and updating the DataFrame in place.
df
25
Yes
.. ... ... ... ... ...
...
608 LP002974 Male Yes 0.0 Graduate
No
609 LP002978 Female No 0.0 Graduate
No
611 LP002983 Male Yes 1.0 Graduate
No
612 LP002984 Male Yes 2.0 Graduate
No
613 LP002990 Female No 0.0 Graduate
Yes
26
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 439 entries, 1 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 439 non-null object
1 Gender 439 non-null object
2 Married 439 non-null object
3 Dependents 439 non-null float64
4 Education 439 non-null object
5 Self_Employed 439 non-null object
6 ApplicantIncome 439 non-null int64
7 CoapplicantIncome 439 non-null float64
8 LoanAmount 439 non-null float64
9 Loan_Amount_Term 439 non-null float64
10 Credit_History 439 non-null float64
11 Property_Area 439 non-null object
12 Loan_Status 439 non-null object
dtypes: float64(5), int64(1), object(7)
memory usage: 48.0+ KB
df.reset_index(inplace = True)
df
27
0 4583 1508.0 128.0
360.0
1 3000 0.0 66.0
360.0
2 2583 2358.0 120.0
360.0
3 6000 0.0 141.0
360.0
4 5417 4196.0 267.0
360.0
.. ... ... ...
...
434 3232 1950.0 108.0
360.0
435 2900 0.0 71.0
360.0
436 8072 240.0 253.0
360.0
437 7583 0.0 187.0
360.0
438 4583 0.0 133.0
360.0
df = df.drop('index', axis=1)
df
28
No
1 LP001005 Male Yes 0.0 Graduate
Yes
2 LP001006 Male Yes 0.0 Not Graduate
No
3 LP001008 Male No 0.0 Graduate
No
4 LP001011 Male Yes 2.0 Graduate
Yes
.. ... ... ... ... ...
...
434 LP002974 Male Yes 0.0 Graduate
No
435 LP002978 Female No 0.0 Graduate
No
436 LP002983 Male Yes 1.0 Graduate
No
437 LP002984 Male Yes 2.0 Graduate
No
438 LP002990 Female No 0.0 Graduate
Yes
29
.. ... ... ...
434 1.0 Rural Y
435 1.0 Rural Y
436 1.0 Urban Y
437 1.0 Urban Y
438 0.0 Semiurban N
30
Fig 3 – Count plot [graduate Vs Not graduate]
31
# Select numerical columns for comparison
numerical_columns = ['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term']
# Create a pairplot
sns.pairplot(df, vars=numerical_columns, diag_kind='kde')
plt.show()
32
Fig 6 – Histogram [ApplicantIncome Vs Frequency]
df = df.drop('Loan_ID', axis=1)
df
33
5417
.. ... ... ... ... ...
...
434 Male Yes 0.0 Graduate No
3232
435 Female No 0.0 Graduate No
2900
436 Male Yes 1.0 Graduate No
8072
437 Male Yes 2.0 Graduate No
7583
438 Female No 0.0 Graduate Yes
4583
Property_Area Loan_Status
0 Rural N
1 Urban Y
2 Urban Y
3 Urban Y
4 Urban Y
.. ... ...
434 Rural Y
435 Rural Y
436 Urban Y
437 Urban Y
438 Semiurban N
df['Gender'] = ley.fit_transform(df['Gender'])
df['Married'] = ley.fit_transform(df['Married'])
df['Education'] = ley.fit_transform(df['Education'])
df['Self_Employed'] = ley.fit_transform(df['Self_Employed'])
df['Property_Area'] = ley.fit_transform(df['Property_Area'])
df['Loan_Status'] = ley.fit_transform(df['Loan_Status'])
34
df
Property_Area Loan_Status
0 0 0
1 2 1
2 2 1
3 2 1
4 2 1
.. ... ...
434 0 1
435 0 1
436 2 1
437 2 1
438 1 0
35
[439 rows x 12 columns]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 439 non-null int32
1 Married 439 non-null int32
2 Dependents 439 non-null float64
3 Education 439 non-null int32
4 Self_Employed 439 non-null int32
5 ApplicantIncome 439 non-null int64
6 CoapplicantIncome 439 non-null float64
7 LoanAmount 439 non-null float64
8 Loan_Amount_Term 439 non-null float64
9 Credit_History 439 non-null float64
10 Property_Area 439 non-null int32
11 Loan_Status 439 non-null int32
dtypes: float64(5), int32(6), int64(1)
memory usage: 31.0 KB
df.shape
(439, 12)
df.corr()
36
Loan_Status 0.073219 0.124946 0.055115 -0.077345 -
0.060529
Loan_Status
Gender 0.073219
Married 0.124946
Dependents 0.055115
Education -0.077345
Self_Employed -0.060529
ApplicantIncome -0.023656
CoapplicantIncome -0.021223
LoanAmount -0.059680
Loan_Amount_Term -0.009306
Credit_History 0.531467
Property_Area 0.032841
Loan_Status 1.000000
37
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
dfmin= df[df['Loan_Status'] == 1]
dfmax= df[df['Loan_Status'] == 0]
df_dsampled = pd.concat([dfminu,dfmaxd])
df_dsampled['Loan_Status'].value_counts()
1 1000
0 1000
Name: Loan_Status, dtype: int64
y = df_dsampled['Loan_Status']
X = df_dsampled.drop('Loan_Status', axis = 1)
38
27 1668.0 110.0 360.0 1.0
123 5625.0 187.0 360.0 1.0
159 1717.0 116.0 360.0 1.0
.. ... ... ... ...
227 0.0 480.0 360.0 1.0
323 1255.0 125.0 360.0 0.0
181 3890.0 201.0 360.0 0.0
59 1881.0 167.0 360.0 1.0
49 2254.0 126.0 180.0 0.0
Property_Area
147 1
336 1
27 1
123 1
159 1
.. ...
227 0
323 0
181 1
59 2
49 2
147 1
336 1
27 1
123 1
159 1
..
227 0
323 0
181 0
59 0
49 0
Name: Loan_Status, Length: 2000, dtype: int32
X_train
39
74 1 1 2.0 1 0
4288
362 1 1 2.0 0 0
3510
.. ... ... ... ... ...
...
190 0 0 0.0 0 0
7200
254 1 1 1.0 0 0
3875
26 1 0 0.0 0 0
4166
270 1 1 2.0 0 0
5391
31 1 1 1.0 0 0
5649
Property_Area
95 1
225 0
216 1
74 2
362 0
.. ...
190 0
254 2
26 2
270 2
31 2
y_train
95 0
225 1
216 1
74 1
362 1
..
40
190 1
254 0
26 1
270 1
31 1
Name: Loan_Status, Length: 1400, dtype: int32
y_test
143 0
23 0
238 0
77 1
335 0
..
371 0
221 0
281 1
424 1
169 1
Name: Loan_Status, Length: 600, dtype: int32
X_test
41
77 1126.0 225.0 360.0 1.0
335 2451.0 110.0 360.0 1.0
.. ... ... ... ...
371 1915.0 185.0 360.0 1.0
221 2934.0 93.0 360.0 0.0
281 1459.0 95.0 360.0 1.0
424 3300.0 142.0 180.0 1.0
169 0.0 112.0 360.0 1.0
Property_Area
143 2
23 0
238 2
77 2
335 2
.. ...
371 1
221 2
281 2
424 0
169 0
X_test.to_csv(r'C:\Users\yerra\Music\Loan eligibility
prediction\CODING\test.csv',index=False)
train_auc = []
test_auc = []
for i in k:
clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=i,
algorithm='brute')
clf.fit(X_train,y_train)
prob_cv = clf.predict(X_train)
#pickle.dump(knn,open(r'C:\Users\ST-0008\Documents\santhosh\Loan
eligibility prediction\KN.pkl','wb'))
pred_test = knn.predict(X_test)
test_accuracy = accuracy_score(y_test, pred_test)
test_accuracy
42
0.9416666666666667
predicted = knn.predict(X_test[:20])
pred = []
for j in predicted:
pred.append(class_label[j])
original_Classlabel predicted_classlabel
0 not eligible eligible
1 not eligible not eligible
2 not eligible not eligible
3 eligible not eligible
4 not eligible not eligible
5 eligible eligible
6 not eligible not eligible
7 eligible eligible
8 not eligible not eligible
9 eligible eligible
10 eligible eligible
11 not eligible not eligible
12 eligible eligible
13 eligible eligible
14 not eligible not eligible
15 not eligible not eligible
16 eligible eligible
17 not eligible not eligible
18 eligible eligible
19 eligible eligible
43
1 0.97 0.91 0.94 300
<Axes: >
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title(f"ROC Curve (AUC = {auc:.2f})")
plt.show()
44
Fig 7 – ROC Curve
from sklearn.metrics import precision_recall_curve,
average_precision_score
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title(f"Precision-Recall Curve (AP = {average_precision:.2f})")
plt.show()
EXPLANATION:
This code is used to create and visualize a Precision-Recall curve, a common tool for
evaluating the performance of binary classification models. Here's an explanation of
each part of the code:
45
2. `precision, recall, thresholds = precision_recall_curve(y_test, pred_test)`: This
line calculates the precision, recall, and thresholds using the `precision_recall_curve`
function. It requires two arguments: `y_test`, which is the true labels for the test dataset,
and `pred_test`, which is the predicted probabilities or scores generated by your
classifier. The function returns arrays of precision, recall, and threshold values.
5. `plt.xlabel("Recall")`: This line sets the label for the x-axis to "Recall," indicating
that the x-axis represents the recall values.
6. `plt.ylabel("Precision")`: This line sets the label for the y-axis to "Precision,"
indicating that the y-axis represents the precision values.
8. `plt.show()`: This line displays the Precision-Recall curve with the specified title, x-
axis label, and y-axis label in a graphical window or output.
In summary, this code computes and visualizes a Precision-Recall curve to assess the
performance of a binary classification model, and it provides a summary metric
(average precision) for that model's performance. The curve shows how the model's
precision and recall change at different classification thresholds, which can be crucial
for model evaluation and decision-making.
46
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
param_grid={'n_estimators':n_estimators , 'max_depth':dept}
clf = RandomForestClassifier()
model = GridSearchCV(clf,param_grid,scoring='accuracy',n_jobs=-
1,cv=3)
model.fit(X_train,y_train)
print("optimal n_estimators",model.best_estimator_.n_estimators)
print("optimal max_depth",model.best_estimator_.max_depth)
optimal_max_depth = model.best_estimator_.max_depth
optimal_n_estimators = model.best_estimator_.n_estimators
optimal n_estimators 60
optimal max_depth 600
clf = RandomForestClassifier(max_depth =
optimal_max_depth,n_estimators = optimal_n_estimators)
clf.fit(X_train,y_train)
import pickle
#pickle.dump(clf,open(r'C:\Users\ST-0008\Documents\santhosh\Loan
eligibility prediction\RF.pkl','wb'))
y_pred=clf.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
accuracy
0.9966666666666667
print(classification_report(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='.2f')
47
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title(f"ROC Curve (AUC = {auc:.2f})")
plt.show()
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title(f"Precision-Recall Curve (AP = {average_precision:.2f})")
plt.show()
48
class_label = ['not eligible','eligible']
original = []
for i in y_test[:20]:
original.append(class_label[i])
predicted = clf.predict(X_test[:20])
pred = []
for j in predicted:
pred.append(class_label[j])
original_Classlabel predicted_classlabel
0 not eligible not eligible
1 not eligible not eligible
2 not eligible not eligible
3 eligible eligible
4 not eligible not eligible
5 eligible eligible
6 not eligible not eligible
7 eligible eligible
8 not eligible not eligible
9 eligible eligible
10 eligible eligible
11 not eligible not eligible
12 eligible eligible
13 eligible eligible
14 not eligible not eligible
15 not eligible not eligible
16 eligible eligible
17 not eligible not eligible
18 eligible eligible
19 eligible eligible
model = xgb.XGBClassifier(n_estimators=1000,
learning_rate=0.04,random_state=1)
model.fit(X_train, y_train)
import pickle
filename = r'C:\Users\yerra\Music\Loan eligibility
prediction\CODING\front end\X_gb_loan.pkl'
49
pickle.dump(model, open(filename, 'wb'))
pred_test4 =model.predict(X_test)
test_accuracy4 = accuracy_score(y_test, pred_test4)
pred_train = model.predict(X_train)
train_accuracy4 =accuracy_score(y_train,pred_train)
print("---------------------------")
# Code for drawing seaborn heatmaps
class_names =['0','1']
df_heatmap = pd.DataFrame(confusion_matrix(y_test,
pred_test4.round()), index=class_names, columns=class_names )
fig = plt.figure( )
heatmap = sns.heatmap(df_heatmap, annot=True, fmt="d")
EXPLANATION:
The code trains an XGBoost (Extreme Gradient Boosting) classifier, saves the trained
model to a file using pickle, evaluates the model's performance on both the test and
training datasets, and visualizes a confusion matrix heatmap. Let's break down the
code step by step:
1. `import xgboost as xgb`: This line imports the XGBoost library, which is a
popular gradient boosting framework used for machine learning tasks.
2. `from sklearn.metrics import accuracy_score`: This line imports the
`accuracy_score` function from scikit-learn, which is used to calculate the accuracy of
a classification model.
3. `model = xgb.XGBClassifier(n_estimators=1000, learning_rate=0.04,
random_state=1)`: This line initializes an XGBoost classifier model with the
specified hyperparameters. It sets the number of estimators (trees) to 1000, learning
rate to 0.04, and random state to 1 for reproducibility.
4. `model.fit(X_train, y_train)`: This line fits (trains) the XGBoost model using the
training data `X_train` and the corresponding labels `y_train`.
5. `import pickle`: This line imports the `pickle` module for serializing and
deserializing Python objects.
50
6. `filename = r'C:\Users\yerra\Music\Loan eligibility prediction\CODING\front
end\X_gb_loan.pkl'`: This line defines a file path where the trained XGBoost model
will be saved using pickle.
7. `pickle.dump(model, open(filename, 'wb'))`: This line saves the trained XGBoost
model to the specified file.
8. `pred_test4 = model.predict(X_test)`: This line uses the trained model to make
predictions on the test dataset and stores the predicted labels in `pred_test4`.
9. `test_accuracy4 = accuracy_score(y_test, pred_test4)`: This line calculates the
accuracy of the model on the test dataset by comparing the predicted labels
(`pred_test4`) to the true labels (`y_test`).
10. `pred_train = model.predict(X_train)`: This line uses the trained model to make
predictions on the training dataset and stores the predicted labels in `pred_train`.
11. `train_accuracy4 = accuracy_score(y_train, pred_train)`: This line calculates
the accuracy of the model on the training dataset.
12. `print("AUC on Test data is " + str(accuracy_score(y_test, pred_test4)))`:
This line prints the accuracy of the model on the test dataset.
13. `print("AUC on Train data is " + str(accuracy_score(y_train, pred_train)))`:
This line prints the accuracy of the model on the training dataset.
14. The code for drawing the confusion matrix heatmap is as follows:
- `class_names = ['0', '1']`: This line defines class names for the confusion matrix.
- `df_heatmap = pd.DataFrame(confusion_matrix(y_test, pred_test4.round()),
index=class_names, columns=class_names)`: This line calculates the confusion
matrix using the `confusion_matrix` function and stores it in a Pandas DataFrame. It
uses the true labels `y_test` and the predicted labels `pred_test4`. The `round()`
function is applied to the predicted labels to ensure they are integers.
- `fig = plt.figure()`: This line creates a new Matplotlib figure.
15. `heatmap = sns.heatmap(df_heatmap, annot=True, fmt="d")`: This line
generates a heatmap of the confusion matrix using Seaborn. It annotates the cells with
values and specifies the format as integers.
Overall, the code trains an XGBoost classifier, evaluates its accuracy on both the test
and training datasets, and provides a visualization of the confusion matrix heatmap to
assess its performance.
51
from sklearn.metrics import accuracy_score, f1_score,
precision_score, recall_score, classification_report,
confusion_matrix
#Voting Classifier
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, stratify=y, random_state=42)
from sklearn.ensemble import GradientBoostingClassifier
# Create individual classifiers
classifier1 = DecisionTreeClassifier()
classifier2 = RandomForestClassifier()
classifier3 = GradientBoostingClassifier()
52
f1 = f1_score(y_test, y_pred)*100
print(f'F1 Score: {f1}%')
Accuracy: 99.66666666666667%
Precision: 100.0%
F1 Score: 99.66555183946488%
Confusion Matrix:
[[300 0]
[ 2 298]]
Recall: 99.33333333333333%
EXPLANATION:
This code demonstrates the use of a Voting Classifier in scikit-learn, which combines
the predictions of multiple base classifiers to make a final decision. It then evaluates
the performance of the Voting Classifier on a test dataset. Here's an explanation of each
part of the code:
1. `from sklearn.metrics import accuracy_score, f1_score, precision_score,
recall_score, classification_report, confusion_matrix`: This line imports various
metrics for model evaluation from scikit-learn.
2. Splitting the Data:
- `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,
random_state=42)`: This code splits the dataset into training and test sets using the
`train_test_split` function. It uses 70% of the data for training (`X_train` and `y_train`)
and 30% for testing (`X_test` and `y_test`). The `stratify` parameter ensures that the
class distribution is preserved in the split, and `random_state` is set for reproducibility.
3. Creating Individual Classifiers:
- Three individual classifiers are created: `classifier1`, `classifier2`, and `classifier3`.
These classifiers are a Decision Tree, Random Forest, and Gradient Boosting Classifier,
respectively.
4. Creating a Voting Classifier:
- `voting_classifier = VotingClassifier(estimators=[('clf1', classifier1), ('clf2',
classifier2), ('clf3', classifier3)], voting='hard')`: This code creates a Voting Classifier
(`voting_classifier`) that combines the three individual classifiers using a "hard" voting
scheme, where the final prediction is based on a majority vote.
53
5. Fitting the Voting Classifier:
- `voting_classifier.fit(X_train, y_train)`: This line fits (trains) the Voting Classifier
on the training data (`X_train` and `y_train`).
6. Making Predictions:
- `y_pred = voting_classifier.predict(X_test)`: It uses the trained Voting Classifier to
make predictions on the test data (`X_test`) and stores the predicted labels in `y_pred`.
7. Calculating and Printing Metrics:
- The code calculates and prints several evaluation metrics for the Voting
Classifier:
- Accuracy: It measures the percentage of correctly predicted labels in the test
dataset.
- Precision: It quantifies the proportion of true positive predictions among all
positive predictions.
- F1 Score: It combines precision and recall into a single metric to assess the model's
overall performance.
- Confusion Matrix: It shows the counts of true positives, true negatives, false
positives, and false negatives.
- Recall (Sensitivity): It calculates the percentage of true positives among all actual
positive cases.
These metrics collectively provide insights into the performance of the Voting
Classifier, indicating that it has high accuracy, precision, and recall, with a balanced F1
score. The confusion matrix also highlights the number of correctly and incorrectly
classified instances in each class.
import pickle
filename = r'C:\Users\yerra\Music\Loan eligibility
prediction\CODING\front end\voting_loan.pkl'
pickle.dump(voting_classifier, open(filename, 'wb'))
54
5.3 SAMPLE OUTPUT
55
Test plan:
1.Objective: Define the objective of the test. In this case, it's splitting the dataset into
training and testing sets for machine learning model evaluation.
3.Steps:
Data Verification:
1.Data Integrity: Ensure that the dataset (X) and target variable (y) are correctly loaded
and free from errors or inconsistencies. You may want to perform data cleaning and
preprocessing steps before the split.
2.Check Split Size: Verify that the actual split sizes of the training and testing sets
match the specified test_size. You can print the lengths of X_train, X_test, y_train, and
y_test to confirm this.
3.Class Distribution (if using stratify): If you use stratify=y, check whether the class
distribution in the target variable is preserved in both the training and testing sets. You
can do this by counting the occurrences of each class in y_train and y_test.
56
4.Reproducibility: Ensure that the use of random_state provides consistent and
reproducible results when re-running the code. Check if multiple runs of the code
produce the same split.
5.Data Exploration: Optionally, perform exploratory data analysis on the training and
testing sets to understand the characteristics of the data, such as feature distributions
and class imbalances.
6.Data Quality: Verify that the data is of high quality and suitable for the machine
learning task. Address any issues with missing values, outliers, or anomalies if
necessary.
57
CHAPTER 6
RESULTS
In the realm of loan eligibility prediction using machine learning within the
banking sector, the research findings indicate several common trends and challenges.
The merits of these studies consistently highlight improvements in efficiency and
accuracy in loan approval and risk assessment. Researchers have also leveraged diverse
machine learning techniques, such as random forest models, to enhance loan eligibility
assessment. These advancements offer potential benefits in automating decision-
making processes and mitigating human bias. However, data privacy and security
concerns loom as a prominent demerit, emphasizing the need for robust data protection
measures. Additionally, the complexity and interpretability of machine learning models,
as well as their dependence on data quality, present challenges in implementing these
solutions effectively.
A recurrent theme across the research is the trade-off between model accuracy
and interpretability, reflecting the banking sector's need for both transparency and
predictive power. Furthermore, concerns of bias and fairness in loan eligibility
decisions have been acknowledged, hinting at the ethical dimensions of these
algorithms. As the field continues to evolve, it becomes clear that a delicate balance
must be struck between model sophistication, data quality, and ethical considerations
to harness the full potential of machine learning for improving loan eligibility
assessment in the banking sector.
58
2. Random Forest:
- Random Forest was tuned using GridSearchCV to find the optimal hyperparameters,
resulting in a max depth of 45 and 60 estimators.
- It achieved an impressive accuracy of approximately 99.67% on the test data,
indicating a strong predictive capability.
- Random Forest is a robust ensemble method known for handling complex
relationships in the data, but it may lack interpretability.
3. XGBoost:
- XGBoost was used with 1000 estimators and a learning rate of 0.04, and it achieved
high accuracy.
- The test accuracy was approximately 99.67%, which is consistent with the Random
Forest model.
- XGBoost is a powerful boosting algorithm, often used for structured data, and it
provides competitive predictive performance.
4. Voting Classifier:
- A Voting Classifier was created by combining the predictions of three base
classifiers: Decision Tree, Random Forest, and Gradient Boosting.
- The Voting Classifier achieved an accuracy of approximately 99.67% on the test
data, consistent with the other models.
- This ensemble approach leverages the strengths of individual classifiers, leading to
strong predictive performance.
Overall, the machine learning models in this analysis demonstrate high accuracy and
strong predictive capabilities. The Random Forest, XGBoost, and Voting Classifier
models perform exceptionally well with test accuracies around 99.67%. However, it's
important to consider the trade-off between accuracy and model interpretability, as
more complex models like Random Forest and XGBoost may be challenging to explain.
The choice of the best model depends on the specific goals of your application, such as
whether interpretability or pure predictive power is more critical. Additionally, it's
essential to evaluate the models for potential overfitting and consider other metrics such
as precision, recall, and F1 score, especially when dealing with imbalanced datasets or
applications with varying costs of false positives and false negatives.
59
CONCLUSION AND FUTURE WORK
Conclusion:
Future Work:
The future work for the project "Loan Eligibility Prediction using Machine Learning"
can include the following aspects to further enhance the system and its impact:
4. Fairness and Bias Mitigation: Implement techniques and tools to address bias and
fairness concerns in the lending process. Ensure that the machine learning models are
fair and unbiased, promoting equitable access to credit for all applicants.
60
5. Interpretability and Explainability: Work on enhancing the interpretability of the
machine learning models. Develop techniques to explain the model's decisions, making
it more transparent for both applicants and regulatory authorities.
6. Risk Assessment and Fraud Detection: Expand the system's capabilities to not only
predict loan eligibility but also to assess the risk associated with each loan application
and detect potential fraud. This can help in minimizing default rates and fraudulent
activities.
7. Scalability: Ensure that the system can handle a growing volume of loan applications
as the business expands. Scalability is crucial to maintain efficiency and responsiveness
9. User-Friendly Interfaces: Develop user-friendly interfaces for both bank staff and
loan applicants. A well-designed user interface can enhance the user experience and
facilitate the adoption of the system.
12. Feedback Mechanism: Establish a feedback loop with bank staff and applicants to
gather insights and feedback on the system's performance and user experience. Use this
feedback to make iterative improvements.
13. Partnerships: Collaborate with other financial institutions, data providers, and
fintech companies to exchange knowledge, data, and best practices in loan eligibility
prediction and risk assessment.
61
The future work for this project should focus on leveraging the latest advancements in
machine learning and data science to create a more accurate, fair, and efficient loan
eligibility prediction system while ensuring compliance with regulatory requirements
and promoting financial inclusion.
62
REFERENCES
[1] Dorfleitner, G., Oswald, E. M., & Zhang, R. (2021). From Credit Risk to Social
Impact: On the Funding Determinants in Interest-Free Peer-to-Peer Lending. Journal of
Business Ethics, 170, 375–400.
[2] Kumar, S., Sharma, & Mahdavi, M. (2021). Machine Learning (ML) Technologies
for Digital Credit Scoring in Rural Finance: A Literature Review. Risks, 9(11), 192.
[3] Xu, J., Lu, Z., & Xie, Y. (2021). Loan default prediction of Chinese P2P market: a
machine learning methodology. Scientific Reports, 11(1), 1-19.
[4] Meshref, H. (2020). Predicting Loan Approval of Bank Direct Marketing Data
Using Ensemble Machine Learning Algorithms.International Journal of Circuits,
Systems, and Signal Processing, 14, 914-922. DOI: 10.46300/9106.2020.14.117
[5] Aphale, A. S., & Shinde, S. R. (2020). Predict Loan Approval in Banking System:
Machine Learning Approach for Cooperative Banks Loan Approval. International
Journal of Engineering Research & Technology (IJERT), 9, 991-995.
[6] Hussein, A. S., Li, T., Yohannese, C. W., & Bashir, K. (2019). A-SMOTE: A new
pre-processing approach for highly imbalanced datasets by improving SMOTE.
International Journal of Computational Intelligence Systems.
[7] Powers, D. M. (2020). Evaluation: From Precision, Recall and F-measure to ROC,
Informedness, Markedness, and Correlation. arXiv preprint arXiv:2010.16061.
[9] Saini. (2021). Logistic Regression: What is Logistic Regression and Why Do We
Need It?
[10] Shouman, M., Turner, T., & Stocker, R. (2012). Applying k-nearest neighbour in
diagnosing heart disease patients.International Journal of Information and Education
Technology, 2(3), 220-223.
63
[11] Turkson, R. E., Baagyere, E. Y., & Wenya, G. E. (2016). A machine learning
approach for predicting bank credit worthiness.2016 Third International Conference on
Artificial Intelligence and Pattern Recognition (AIPR).
[12] Vaidya, A. (2017). Predictive and probabilistic approach using logistic regression:
Application to prediction of loan approval.2017 8th International Conference on
Computing, Communication, and Networking Technologies (ICCCNT).
[13] Sheikh, M. A., Goel, A. K., & Kumar, T. (2020). An Approach for Prediction of
Loan Approval using Machine Learning Algorithm. 2020 International Conference on
Electronics and Sustainable Communication Systems (ICESC).
64