19MIS0424 Yerram Karthik

A project report on
LOAN ELIGIBILITY PREDICTION USING

MACHINE LEARNING
Submitted in partial fulfilment for the award of the degree of
M.Tech (Software Engineering)
by
YERRAM KARTHIK (19MIS0424)
SCHOOL OF COMPUTER SCIENCE ENGINEERING AND

INFORMATION SYSTEMS
November, 2023
LOAN ELIGIBILITY PREDICTION USING
MACHINE LEARNING
Submitted in partial fulfilment for the award of the degree of
M. Tech (Software Engineering)
by
YERRAM KARTHIK (19MIS0424)
SCHOOL OF COMPUTER SCIENCE ENGINEERING AND

INFORMATION SYSTEMS
November, 2023
DECLARATION
I hereby declare that the thesis entitled “LOAN ELIGIBILITY PREDICTION
USING MACHINE LEARNING” submitted by me, for the award of the degree of
M.Tech (Software Engineering) is a record of bonafide work carried out by me under
the supervision of Dr. Ranichandra C.
I further declare that the work reported in this thesis has not been submitted and
will not be submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.
Place: Vellore
Date: Signature of the candidate
CERTIFICATE
This is to certify that the thesis entitled “LOAN ELIGIBILITY PREDICTION USING
MACHINE LEARNING” submitted by YERRAM KARTHIK (19MIS0424), School
of Computer Science Engineering and Information Systems, Vellore Institute of
Technology, Vellore for the award of the degree M.Tech (Software Engineering) is a
record of bonafide work carried out by him under my supervision.
The contents of this report have not been submitted and will not be submitted either in
part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The Project report fulfils the requirements and regulations of
VELLORE INSTITUTE OF TECHNOLOGY, VELLORE and in my opinion meets the
necessary standards for submission.
Signature of the Guide Signature of the HoD
Internal Examiner External Examiner

ABSTRACT
Banks are making major part of profits through loans. Though lot of people are applying
for loans. It’s hard to select the genuine applicant, who will repay the loan. While doing
the process manually, lot of misconception may happen to select the genuine applicant.
Therefore, we are developing loan prediction system using machine learning, so the
system automatically selects the eligible candidates. This is helpful to both bank staff
and applicant. The time-period for the sanction of loan will be drastically reduced. In
this project we are predicting the loan data by using some machine learning algorithms.
In the realm of financial technology, this project delves into the development of a Loan
Eligibility Prediction system using Machine Learning (ML) techniques. Leveraging a
diverse dataset encompassing applicant demographics, financial history, and credit-
related information, the system employs predictive models to assess an individual's
eligibility for a loan. Through rigorous data preprocessing, feature engineering, and the
utilization of various ML algorithms, the project aims to deliver a robust and accurate
prediction mechanism that empowers lenders and borrowers alike by streamlining the
loan approval process, enhancing risk assessment, and promoting financial inclusivity.
i
ACKNOWLEDGEMENT
It is my pleasure to express with deep sense of gratitude to Dr. Ranichandra C, Associate

Professor Grade 2, SCORE, Vellore Institute of Technology, for his/her constant
guidance, continual encouragement, understanding; more than all, he taught me
patience in my endeavor. My association with him / her is not confined to academics
only, but it is a great opportunity on my part of work with an intellectual and expert in
the field of <area>.
I would like to express my gratitude to DR.G.VISWANATHAN, Chancellor

VELLORE INSTITUTE OF TECHNOLOGY, VELLORE, MR. SANKAR
VISWANATHAN, DR. SEKAR VISWANATHAN, MR.G V SELVAM, Vice –
Presidents VELLORE INSTITUTE OF TECHNOLOGY, VELLORE, Dr. V. S.
Kanchana Bhaaskaran, Vice – Chancellor, Dr. Partha Sharathi Mallick, Pro-Vice
Chancellor and Dr. S. Sumathy, Dean, School of Computer Science Engineering And
Information Systems, for providing with an environment to work in and for his
inspiration during the tenure of the course.
In jubilant mood I express ingeniously my whole-hearted thanks to Dr. Shantharajah

S.P, HoD/Professor, all teaching staff and members working as limbs of our university
for their not-self-centered enthusiasm coupled with timely encouragements showered
on me with zeal, which prompted the acquirement of the requisite knowledge to finalize
my course study successfully. I would like to thank my parents for their support.
It is indeed a pleasure to thank my friends who persuaded and encouraged me to take

up and complete this task. Last, but not least, I express my gratitude and appreciation
to all those who have helped me directly or indirectly toward the successful completion
of this project.
Place: Vellore
Date: Yerram karthik
ii
TABLE OF CONTENTS
LIST OF FIGURES
LIST OF TABLES
LIST OF ACRONYMS
CHAPTER 1
INTRODUCTION
1.1 BACKGROUND
1.2 MOTIVATION
1.3 PROBLEM STATEMENT
1.4 OBJECTIVE
1.5 SCOPE OF THE PROJECT
CHAPTER 2
LITERATURE SURVEY
2.1 SUMMARY OF THE EXISTING WORK
2.2 CHALLENGES PRESENT IN EXISTING SYSTEM
CHAPTER 3
REQUIREMENTS
3.1 HARDWARE REQUIREMENTS
3.2 SOFTWARE REQUIREMENTS
3.3 BUDGET
3.4 GNATT CHART
CHAPTER 4
ANALYSIS AND DESIGN
4.1 PROPOSED METHODOLOGY
4.2 SYSTEM ARCHITECTURE
4.3 MODULE DESCRIPTIONS
CHAPTER 5
IMPLEMENTATION AND TESTING
5.1 DATA SET
5.2 SAMPLE CODE
5.3 SAMPLE OUTPUT
5.4 TEST PLAN & DATA VERIFICATION
CHAPTER 6
RESULTS
6.1 RESEARCH FINDINGS
6.2 RESULT ANALYSIS & EVALUATION METRICS
CONCLUSION AND FUTURE WORK
REFERENCES
iii
LIST OF FIGURES
S.No. Figure Page No.
1. Fig-1[System Architecture] 15
2. Fig-2[Count plot (Male Vs Female)] 30
3. Fig-3[Count plot (graduate Vs not 31
graduate)]
4. Fig-4[Count plot for credit history (Yes 31
or No)]
5. Fig-5[Pair plot] 32
6. Fig-6[Histogram (Applicant Income Vs 33
Frequency)
7. Fig-7[ROC Curve] 45
8. Fig-8[Recall Curve] 48
LIST OF TABLES
S.No. Table Page No.

1. Literature survey 3-5
2. Budget table 8
3. Gantt table 12
LIST OF ACRONYMS
S.No. Acronym Full Form

1. ML Machine Learning
2. JN Jupyter Notebook
3. EDA Exploratory Data
4. KNN K-Nearest Neighbours
iv
INTRODUCTION
1.1 BACKGROUND
The concept of "Loan Eligibility Prediction using Machine Learning" is a
fundamental application of artificial intelligence and data science within the financial
sector. This application leverages the power of machine learning algorithms to assess
and predict an individual's or business's eligibility for obtaining a loan. The primary
goal is to streamline and automate the loan approval process, making it more efficient
and accurate for both financial institutions and applicants.
1.2 MOTIVATION
Loan eligibility prediction using machine learning is a game-changer in the
financial industry, offering the potential to revolutionize lending processes. By
harnessing the power of advanced algorithms and data analytics, it enables more
accurate and fair assessment of individuals' creditworthiness. This not only benefits
lenders by reducing default risks and improving decision-making, but it also extends
the opportunity for financial inclusion to a broader and more diverse population,
ultimately fostering economic growth and empowerment. With the ability to efficiently
and effectively determine loan eligibility, ML-driven solutions have the potential to
reshape the lending landscape, making access to credit more equitable and accessible
for all.
1.3 PROBLEM STATEMENT
The adverse impact of low loan repayment rates on banks is a major issue. Bank
employees check the details of applicant manually and give the loan to eligible
applicant. Checking the details of all applicants takes lot of time. In the existing Loan
Eligibility model are unsatisfied or make it difficult to establish a new analysis model.
There has been increased demand for applying the machine learning methodology to
loan Eligibility prediction due to its high performance.
1
1.4 OBJECTIVE
Objective is to develop automatic loan prediction using machine learning

techniques. We will train the machine with the previous dataset. So, machines can
analyze and understand the process. Then the machine will check for eligible applicants
and give us the result.
1.5 SCOPE OF THE PROJECT
• Data Collection and Preprocessing

• Exploratory Data Analysis (EDA)
• Machine Learning Model Selection
• Model Training and Validation
• Loan Eligibility Prediction
• Model Evaluation
• Scalability and Efficiency
• Monitoring and Maintenance
• Documentation and Reporting
2
CHAPTER 2
LITERATURE SURVEY
2.1 SUMMARY OF THE EXISTING WORK

SNO TITLE AUTHOR MERITS DEMERITS
1. Customer Loan Ch. Naveen Kumar; D. 1. Improved Loan 1. Data Privacy
Eligibility Prediction Keerthana; M Kavitha; Eligibility and Security
using Machine Learning M Kalyani Prediction Concerns
Algorithms in Banking 2. Enhanced 2. Model
Sector Efficiency in Loan Complexity and
Approval Interpretability
3. Diverse Machine 3. Dependency
Learning on Data Quality
Techniques
2. Machine Learning Ugochukwu. E. Orji; 1. Improved 1. Data
Models for Predicting Chikodili. H. efficiency dependence
Bank Loan Eligibility Ugwuishiwu; Joseph. 2. High accuracy 2. Interpretability
C. N. Nguemaleu; challenges
Peace. N. Ugwuanyi
3. Machine Learning based Chenchireddygari 1. Automation and 1. Data
Loan Eligibiliy Sudharshan Reddy; Efficiency Dependency
Prediction using Adoni Salauddin 2. Higher Accuracy 2. Model
Random Forest Model Siddiq; N. Complexity
Jayapandian 3. Interpretability
Challenges
4. Loan Eligibility Gorantla Lavanya, 1. Improved Risk 1. Data Privacy

Prediction Using Bobbala Naga Sunitha, Assessment Concerns
Machine Learning Konkala Sai Kalpana,
Ravinutala V P
3
SaiViswanadh Sarma, 2. Efficiency and 2. Algorithm
B. Sravani, Automation Bias
Nedunchezhian
5. Explainability of Min Sue Park; Hwijae 1. Improved 1. Simplified

Machine Learning Son; Chongseok Transparency Models
Models for Bankruptcy Hyun; Hyung Ju
Prediction Hwang 2. Fair Loan 2. Limited Scope
Eligibility
6. Predicting Bank Loan Miraz Al Mamun, Afia 1. Efficiency and 1. Data

Eligibility Using Farjana and Muntasir Automation Dependence:
Machine Learning Mamun 2. High Accuracy 2. Model
Models and Comparison 3. Informed Interpretability
Analysis Decision-Making 3. Bias and
Fairness
Concerns
7. Loan Eligibility Sachin Magar, 1. Accuracy 1. Data Privacy

Prediction using N.S.Nikam , Nilesh Improvement Concerns
Machine Learning Taksale, Suprem 2. Complexity
Algorithms Hajare 2. Handling Large and
Datasets Interpretability
4
8. Prediction Of Loan G.Murali Krishna, 1. Efficiency 1. Data Bias and
Eligibility of the V.Madhavi Improvement: Fairness Issues
Customer 2. Risk Mitigation: 2. Model
3. Data Privacy Accuracy
Concerns Dependency
3. Complexity
and Maintenance
9. Loan Eligibility Mr. V. Sravan Kiran, 1. Improved Loan 1. Data Privacy

Prediction Using B. Teja Reddy, D. Eligibility and Security
Machine Learning Uday Kumar, K. Sai Assessment Concerns
Avinash Varma, T. 2. Incorporates 2. Potential Bias
Sheshi Kiran Multiple Data and Fairness
Sources Issues
10. Monetary Loan Ramya S, Priyesh 1. Efficiency 1. Data Quality

Eligibility Prediction Shekhar Jha, Ilaa Improvement Dependency
using Machine Learning Raghupathi Vasishtha, 2. Reduced Human 2. Model
Shashank H, Neha Bias Complexity
Zafar 3. Privacy
Concerns
5
2.2 CHALLENGES PRESENT IN EXISTING SYSTEM
The challenges present in the existing systems for loan eligibility
prediction using machine learning are:
1. Data Privacy and Security Concerns: Many of the systems mentioned

in the document highlight concerns about data privacy and security.
Handling sensitive financial data while maintaining the privacy of
customers is a significant challenge. Unauthorized access to this data could
lead to breaches and compromise customer trust.
2. Model Complexity and Interpretability: Several systems mention the
challenge of model complexity and interpretability. Complex machine
learning models can be difficult to understand and interpret, making it
challenging to explain the reasoning behind loan eligibility decisions,
which can be a regulatory requirement and a customer trust issue.
3. Data Dependency: Most systems depend heavily on the quality and
availability of data. Inaccurate or incomplete data can lead to unreliable
predictions. Data quality issues and data dependencies are common
challenges in machine learning-based systems.
4. Algorithm Bias: The potential for bias in machine learning algorithms
is another challenge. Biased models can result in unfair loan eligibility
assessments, leading to issues of discrimination. Ensuring fairness and
addressing algorithmic bias is a significant concern.
5. Limited Scope and Simplified Models: Some systems are criticized for
having limited scope or using overly simplified models. These limitations
can affect the accuracy and comprehensiveness of loan eligibility
predictions.
6. Handling Large Datasets: Some systems mention the challenge of
handling large datasets. Processing and analyzing large volumes of data
can be computationally intensive and require significant infrastructure and
resources.
7. Model Accuracy Dependency: The accuracy of machine learning
models is crucial for making reliable loan eligibility predictions. However,
some systems may have challenges in achieving the desired level of
accuracy.
6
8. Bias and Fairness Concerns: Ensuring that the loan eligibility
predictions are fair and unbiased is a recurring challenge. Discriminatory
or biased outcomes can lead to legal and ethical issues.
9. Complexity and Maintenance: Several systems highlight the
complexity of implementing and maintaining machine learning-based
solutions. Maintenance and ongoing updates are essential for keeping the
models accurate and relevant.
10. Risk Assessment and Transparency: Some systems emphasize the
importance of risk assessment and transparency. Balancing the need for
risk mitigation while maintaining transparency in decision-making is a
challenge.
It's important to note that these challenges may vary depending on the
specific system and the context in which it is applied. Addressing these
challenges is crucial to ensure that machine learning-based loan eligibility
prediction systems are accurate, fair, and compliant with regulations.
7
CHAPTER 3
REQUIREMENTS
3.1 HARDWARE REQUIREMENTS

• RAM: 4 GB or more
• Processor: Intel i3 or higher
• Storage: 128 GB or more
3.2 SOFTWARE REQUIREMENTS

• Python
• Django
• Jupyter notebook
• Other libraries: scikit-learn, NumPy, pandas, Matplotlib,Seaborn
• Windows 11
3.3 BUDGET
Procured Items/Components for the Project Work Total Cost
Laptop/PC (Minimum Requirements) Rs.10,000(estimate)
RAM (4 GB or more), Processor (Intel i3 or more), Rs.7,500(estimate),

Storage (128 GB or more) Rs.15,000(estimate),
Rs.5,500(estimate)
Python,Django,Numpy,scikit- Free (open source)

learn,pandas,Matplotlib,Seaborn, Jupyter notebook
Other Miscellaneous Costs (if any) Rs.1,000
Total Estimated Project Cost Rs.39,000
8
3.4 GNATT CHART
Month 1: August 1, 2023 - August 31, 2023
- Project Initiation (August 1 - August 2, 2023) - 2 days
- Data Collection and Cleaning (August 3 - August 10, 2023) - 8 days
Month 2: September 1, 2023 - September 30, 2023
- Exploratory Data Analysis (EDA) (September 1 - September 5, 2023) - 5 days
- Feature Engineering (September 6 - September 10, 2023) - 5 days
- Dimensionality Reduction (September 11 - September 15, 2023) - 5 days
- Outlier Detection and Removal (September 16 - September 20, 2023) - 5 days
- Model Selection (September 21 - September 25, 2023) - 5 days
9
Month 3: October 1, 2023 - November 1, 2023
- Data Splitting (October 1 - October 5, 2023) - 5 days
- Model Training and Validation (October 6 - October 10, 2023) - 5 days
- Hyperparameter Tuning (October 11 - October 15, 2023) - 5 days
- Export the Trained Model (October 16 - October 20, 2023) - 5 days
- Front-End Development (October 21 - October 30, 2023) - 10 days
- Backend Development (October 31 - November 1, 2023) - 2 days
Total Project Duration: August 1, 2023 - November 1, 2023
- Total Project Duration: 93 days
10
11
12
CHAPTER 4
ANALYSIS AND DESIGN
4.1 PROPOSED METHODOLOGY
4.1.1 Data Loading and Analysis:

• Import the necessary Python libraries for data analysis and visualization.
• Load the Loan eligibility prediction dataset from scikit-learn and convert it into
a Pandas Data Frame.
• Conduct initial data analysis to understand its structure, including the number
of data points and features.
• Examine basic statistics of the dataset using the describe method.
• Check for any missing values (null values) in the dataset.
• Visualize feature correlations using a heatmap.
4.1.2 Feature Pre-processing:

• Standardize the feature data by removing the mean and scaling to unit variance.
• Normalize the data using Min-Max scaling to ensure that features are within a
common range, which is required for certain quantum algorithms.
4.1.3 Data Splitting:

• Split the dataset into training and testing sets using train_test_split. An 80-20
split is used, with 80% of the data for training and 20% for testing.
4.1.4 Machine Learning Models:

• Train classical machine learning models, including Voting
classifier,XgBoost,Random Forest and K-Nearest Neighbours (KNN).
• Evaluate the performance of these classical models using the test data. Calculate
classification accuracy for each model.
13
4.1.5 Feature Encoding:
• In the context of loan eligibility prediction using machine learning, feature
encoding involves converting non-numeric data like customer categories (e.g.,
"student," "employed") into numerical representations (e.g., 0 or 1) so that
machine learning algorithms can analyse and make predictions based on these
features. Common methods include label encoding for ordinal data and one-hot
encoding for nominal data, ensuring that categorical information can be used
effectively in the prediction model.
4.1.6 Conclusion and Discussion:

• Summarize the results, including the classification accuracy and computational
efficiency of machine learning models.
• Discuss the potential advantages and limitations of using machine learning for
loan eligibility classification.
• Offer insights into whether machine learning models have shown promise in
improving classification accuracy.
14
4.2 SYSTEM ARCHITECTURE
Fig 1 – System Architecture
15
4.3 MODULE DESCRIPTIONS
DATA COLLECTION:
It’s time to pick up the baton and lead the way to machine learning
implementation. The job is to find ways and sources of collecting relevant
and comprehensive data, interpreting it, and analysing results with the help
of statistical techniques.
Kaggle link: https://www.kaggle.com/datasets/vikasukani/loan-eligible-
dataset
EXPLORATORY DATA ANALYSIS:
EDA is an approach of analysing data sets to summarize their main
characteristics, which often includes their data types, memory size
occupied, etc.,
16
17
DATA PRE-PROCESSING:
We loaded the data set as pandas data frame to process the data set and load
it in the machine learning model. In this experiment we dropped the null
values.
DATA SPLITTING:
For each experiment, we split the entire dataset into 70% training set and
30% test set. We used the training set for resampling, hyper parameter
tuning, and training the model and we used test set to test the performance
of the trained model. While splitting the data, we specified a random seed
(any random number), which ensured the same data split every time the
program executed.
TRAINING AND TESTING:

Algorithms learn from data. They find relationships, develop
understanding, make decisions, and evaluate their confidence from the
training data they’re given. And the better the training data is, the better the
model performs.
In fact, the quality and quantity of your training data has as much to do
with the success of your data project as the algorithms themselves.
18
CHAPTER 5
IMPLEMENTATION AND TESTING
5.1 DATASET
Dataset Reference:
• https://www.kaggle.com/datasets/vikasukani/loan-eligible-dataset
Dataset Screenshot:
19
20
5.2 SAMPLE CODE
import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import pickle
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score
df = pd.read_csv(r"C:\Users\yerra\Music\Loan eligibility
prediction\CODING\back end\loan.csv")
EXPLANATION:
1. ìmport seaborn as sns`: This line imports the Seaborn library, which is a data
visualization library based on Matplotlib. It is often used to create attractive and
informative statistical graphics.
2. ìmport pandas as pd`: This line imports the Pandas library, which is a popular
data manipulation and analysis library in Python. It allows you to work with
structured data, such as data in tables or data frames.
3. ìmport warnings`: This line imports the warnings module, which is used to
control the behavior of warning messages in Python.
4. `warnings.filterwarnings("ignore")`: This line sets up a filter to ignore warning
messages in the code. It suppresses warning messages that may otherwise be
displayed during the execution of your code.
5. ìmport matplotlib.pyplot as plt`: This line imports the Matplotlib library,
specifically the pyplot module, which is used for creating various types of plots and
charts in Python.
6. ìmport pickle`: This line imports the pickle module, which is used for serializing
and deserializing Python objects. It allows you to save and load data structures and
models.
7. `from sklearn.ensemble import VotingClassifier`: This line imports the
`VotingClassifier` class from the scikit-learn library, which is used to create an
ensemble model that combines multiple machine learning classifiers to make
predictions.
8. `from sklearn.tree import DecisionTreeClassifier`: This line imports the
`DecisionTreeClassifier` class from scikit-learn, which is used to create decision tree
models for classification tasks.
9. `from sklearn.linear_model import LogisticRegression`: This line imports the
`LogisticRegression` class from scikit-learn, which is used to create logistic
regression models for binary classification.
21
10. `from sklearn.naive_bayes import GaussianNB`: This line imports the
`GaussianNB` class from scikit-learn, which is used to create Naive Bayes models for
classification tasks, assuming Gaussian-distributed features.
11. `from sklearn.metrics import f1_score`: This line imports the `f1_score`
function from scikit-learn, which is a metric used to evaluate the performance of
classification models.
12. `df = pd.read_csv(r"C:\Users\yerra\Music\Loan eligibility
prediction\CODING\back end\loan.csv")`: This line reads a CSV file named
"loan.csv" located at the specified file path and loads its data into a Pandas DataFrame
called `df`. The `r` before the path is used to treat the string as a raw string, which can
be helpful when dealing with backslashes in file paths.
Each line of code is responsible for importing libraries, setting up warning filters, and
loading data for your loan eligibility prediction project.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 614 non-null int64
7 CoapplicantIncome 614 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 564 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB
df
Loan_ID Gender Married Dependents Education Self_Employed

\
0 LP001002 Male No 0 Graduate No
1 LP001003 Male Yes 1 Graduate No
2 LP001005 Male Yes 0 Graduate Yes
3 LP001006 Male Yes 0 Not Graduate No
4 LP001008 Male No 0 Graduate No
.. ... ... ... ... ... ...
609 LP002978 Female No 0 Graduate No
610 LP002979 Male Yes 3+ Graduate No
22
613 LP002990 Female No 0 Graduate Yes
ApplicantIncome CoapplicantIncome LoanAmount

Loan_Amount_Term \
0 5849 0.0 NaN
360.0
1 4583 1508.0 128.0
360.0
2 3000 0.0 66.0
360.0
3 2583 2358.0 120.0
360.0
4 6000 0.0 141.0
360.0
.. ... ... ...
...
609 2900 0.0 71.0
360.0
610 4106 0.0 40.0
180.0
611 8072 240.0 253.0
360.0
612 7583 0.0 187.0
360.0
613 4583 0.0 133.0
360.0
Credit_History Property_Area Loan_Status

0 1.0 Urban Y
1 1.0 Rural N
2 1.0 Urban Y
3 1.0 Urban Y
4 1.0 Urban Y
.. ... ... ...
609 1.0 Rural Y
610 1.0 Rural Y
611 1.0 Urban Y
612 1.0 Urban Y
613 0.0 Semiurban N
[614 rows x 13 columns]
df['Dependents']= pd.to_numeric(df['Dependents'],errors='coerce')
df
Loan_ID Gender Married Dependents Education

Self_Employed \
0 LP001002 Male No 0.0 Graduate
No
1 LP001003 Male Yes 1.0 Graduate
No
Yes
23
3 LP001006 Male Yes 0.0 Not Graduate
No
No
.. ... ... ... ... ...
...
609 LP002978 Female No 0.0 Graduate
No
610 LP002979 Male Yes NaN Graduate
No
No
No
Yes

Loan_Amount_Term \
0 5849 0.0 NaN
360.0
1 4583 1508.0 128.0
360.0
2 3000 0.0 66.0
360.0
3 2583 2358.0 120.0
360.0
4 6000 0.0 141.0
360.0
.. ... ... ...
...
609 2900 0.0 71.0
360.0
610 4106 0.0 40.0
180.0
611 8072 240.0 253.0
360.0
612 7583 0.0 187.0
360.0
613 4583 0.0 133.0
360.0

0 1.0 Urban Y
1 1.0 Rural N
2 1.0 Urban Y
3 1.0 Urban Y
4 1.0 Urban Y
.. ... ... ...
609 1.0 Rural Y
610 1.0 Rural Y
611 1.0 Urban Y
612 1.0 Urban Y
24
613 0.0 Semiurban N
df.style.highlight_null(null_color='red')
<pandas.io.formats.style.Styler at 0x2cc81688f50>
df.isnull().sum()
Loan_ID 0
Gender 13
Married 3
Dependents 66
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
df = df.dropna()
EXPLANATION:
1. `df = df.dropna()`: This line of code is using the Pandas DataFrame `df` and the
`dropna()` method to remove rows with missing (NaN) values from the DataFrame.
By calling `dropna()` without any arguments, it drops any row in the DataFrame
where at least one column has a missing value (NaN). After executing this line, the
DataFrame `df` will contain only the rows with complete data, and any rows with
missing values will be removed.
2. `df`: This line simply prints the modified DataFrame `df` to the console or the
output, displaying the DataFrame with missing value rows removed.
So, this code is essentially cleaning the DataFrame by removing rows with missing
data and updating the DataFrame in place.
df

Self_Employed \
No
Yes
No
No
25
Yes
.. ... ... ... ... ...
...
No
No
No
No
Yes

Loan_Amount_Term \
1 4583 1508.0 128.0
360.0
2 3000 0.0 66.0
360.0
3 2583 2358.0 120.0
360.0
4 6000 0.0 141.0
360.0
5 5417 4196.0 267.0
360.0
.. ... ... ...
...
608 3232 1950.0 108.0
360.0
609 2900 0.0 71.0
360.0
611 8072 240.0 253.0
360.0
612 7583 0.0 187.0
360.0
613 4583 0.0 133.0
360.0

1 1.0 Rural N
2 1.0 Urban Y
3 1.0 Urban Y
4 1.0 Urban Y
5 1.0 Urban Y
.. ... ... ...
608 1.0 Rural Y
609 1.0 Rural Y
611 1.0 Urban Y
612 1.0 Urban Y
613 0.0 Semiurban N
26
df.info()
Int64Index: 439 entries, 1 to 613
--- ------ -------------- -----
0 Loan_ID 439 non-null object
1 Gender 439 non-null object
2 Married 439 non-null object
3 Dependents 439 non-null float64
4 Education 439 non-null object
5 Self_Employed 439 non-null object
11 Property_Area 439 non-null object
12 Loan_Status 439 non-null object
dtypes: float64(5), int64(1), object(7)
memory usage: 48.0+ KB
df.reset_index(inplace = True)
df
index Loan_ID Gender Married Dependents Education

Self_Employed \
0 1 LP001003 Male Yes 1.0 Graduate
No
Yes
2 3 LP001006 Male Yes 0.0 Not Graduate
No
3 4 LP001008 Male No 0.0 Graduate
No
Yes
.. ... ... ... ... ... ...
...
No
435 609 LP002978 Female No 0.0 Graduate
No
No
No
438 613 LP002990 Female No 0.0 Graduate
Yes

Loan_Amount_Term \
27
0 4583 1508.0 128.0
360.0
1 3000 0.0 66.0
360.0
2 2583 2358.0 120.0
360.0
3 6000 0.0 141.0
360.0
4 5417 4196.0 267.0
360.0
.. ... ... ...
...
434 3232 1950.0 108.0
360.0
435 2900 0.0 71.0
360.0
436 8072 240.0 253.0
360.0
437 7583 0.0 187.0
360.0
438 4583 0.0 133.0
360.0

0 1.0 Rural N
1 1.0 Urban Y
2 1.0 Urban Y
3 1.0 Urban Y
4 1.0 Urban Y
.. ... ... ...
434 1.0 Rural Y
435 1.0 Rural Y
436 1.0 Urban Y
437 1.0 Urban Y
438 0.0 Semiurban N
df = df.drop('index', axis=1)
# List the column names in your DataFrame

df.columns
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',

'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome',
'LoanAmount',
'Loan_Amount_Term', 'Credit_History', 'Property_Area',
'Loan_Status'],
dtype='object')
df

Self_Employed \
28
No
Yes
No
No
Yes
.. ... ... ... ... ...
...
No
No
No
No
Yes

Loan_Amount_Term \
0 4583 1508.0 128.0
360.0
1 3000 0.0 66.0
360.0
2 2583 2358.0 120.0
360.0
3 6000 0.0 141.0
360.0
4 5417 4196.0 267.0
360.0
.. ... ... ...
...
434 3232 1950.0 108.0
360.0
435 2900 0.0 71.0
360.0
436 8072 240.0 253.0
360.0
437 7583 0.0 187.0
360.0
438 4583 0.0 133.0
360.0

0 1.0 Rural N
1 1.0 Urban Y
2 1.0 Urban Y
3 1.0 Urban Y
4 1.0 Urban Y
29
.. ... ... ...
434 1.0 Rural Y
435 1.0 Rural Y
436 1.0 Urban Y
437 1.0 Urban Y
438 0.0 Semiurban N
sns.countplot(x = df.Gender, data = df)

plt.show()
Fig 2 – Count plot [Male Vs Female]
sns.countplot(x = df.Education, data = df)

plt.show()
30
Fig 3 – Count plot [graduate Vs Not graduate]

# Example: Countplot for 'Credit_History' and 'Loan_Status'

sns.countplot(data=df, x='Credit_History', hue='Loan_Status')
plt.show()
Fig 4 – Count plot for credit history [Yes or No]

31
# Select numerical columns for comparison
numerical_columns = ['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount', 'Loan_Amount_Term']
# Create a pairplot
sns.pairplot(df, vars=numerical_columns, diag_kind='kde')
plt.show()
Fig 5 – Pair plot

# Example: Histogram for 'ApplicantIncome'

plt.hist(df['ApplicantIncome'], bins=20)
plt.xlabel('ApplicantIncome')
plt.ylabel('Frequency')
plt.show()
32
Fig 6 – Histogram [ApplicantIncome Vs Frequency]

# Example: Boxplot for 'LoanAmount'

sns.boxplot(x='Loan_Status', y='LoanAmount', data=df)
plt.show()

# Example: Violin plot for 'LoanAmount' by 'Loan_Status'

sns.violinplot(x='Loan_Status', y='LoanAmount', data=df)
plt.show()
from sklearn.preprocessing import LabelEncoder

ley = LabelEncoder()
df = df.drop('Loan_ID', axis=1)
df
Gender Married Dependents Education Self_Employed

ApplicantIncome \
0 Male Yes 1.0 Graduate No
4583
1 Male Yes 0.0 Graduate Yes
3000
2 Male Yes 0.0 Not Graduate No
2583
3 Male No 0.0 Graduate No
6000
4 Male Yes 2.0 Graduate Yes
33
5417
.. ... ... ... ... ...
...
3232
435 Female No 0.0 Graduate No
2900
8072
7583
438 Female No 0.0 Graduate Yes
4583
CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History

\
0 1508.0 128.0 360.0 1.0
1 0.0 66.0 360.0 1.0
2 2358.0 120.0 360.0 1.0
3 0.0 141.0 360.0 1.0
4 4196.0 267.0 360.0 1.0
.. ... ... ... ...
434 1950.0 108.0 360.0 1.0
435 0.0 71.0 360.0 1.0
436 240.0 253.0 360.0 1.0
437 0.0 187.0 360.0 1.0
438 0.0 133.0 360.0 0.0
Property_Area Loan_Status
0 Rural N
1 Urban Y
2 Urban Y
3 Urban Y
4 Urban Y
.. ... ...
434 Rural Y
435 Rural Y
436 Urban Y
437 Urban Y
438 Semiurban N
df['Gender'] = ley.fit_transform(df['Gender'])
df['Married'] = ley.fit_transform(df['Married'])
df['Education'] = ley.fit_transform(df['Education'])
df['Self_Employed'] = ley.fit_transform(df['Self_Employed'])
df['Property_Area'] = ley.fit_transform(df['Property_Area'])
df['Loan_Status'] = ley.fit_transform(df['Loan_Status'])
34
df

ApplicantIncome \
0 1 1 1.0 0 0
4583
1 1 1 0.0 0 1
3000
2 1 1 0.0 1 0
2583
3 1 0 0.0 0 0
6000
4 1 1 2.0 0 1
5417
.. ... ... ... ... ...
...
434 1 1 0.0 0 0
3232
435 0 0 0.0 0 0
2900
436 1 1 1.0 0 0
8072
437 1 1 2.0 0 0
7583
438 0 0 0.0 0 1
4583

\
0 1508.0 128.0 360.0 1.0
1 0.0 66.0 360.0 1.0
2 2358.0 120.0 360.0 1.0
3 0.0 141.0 360.0 1.0
4 4196.0 267.0 360.0 1.0
.. ... ... ... ...
434 1950.0 108.0 360.0 1.0
435 0.0 71.0 360.0 1.0
436 240.0 253.0 360.0 1.0
437 0.0 187.0 360.0 1.0
438 0.0 133.0 360.0 0.0
Property_Area Loan_Status
0 0 0
1 2 1
2 2 1
3 2 1
4 2 1
.. ... ...
434 0 1
435 0 1
436 2 1
437 2 1
438 1 0
35
df.info()
RangeIndex: 439 entries, 0 to 438
--- ------ -------------- -----
0 Gender 439 non-null int32
1 Married 439 non-null int32
2 Dependents 439 non-null float64
3 Education 439 non-null int32
4 Self_Employed 439 non-null int32
10 Property_Area 439 non-null int32
11 Loan_Status 439 non-null int32
dtypes: float64(5), int32(6), int64(1)
memory usage: 31.0 KB
df.shape
(439, 12)
df.corr()
Gender Married Dependents Education

Self_Employed \
Gender 1.000000 0.333892 0.184168 0.055607 -
0.003151
Married 0.333892 1.000000 0.384194 -0.003548
0.012606
Dependents 0.184168 0.384194 1.000000 0.010459
0.076699
Education 0.055607 -0.003548 0.010459 1.000000 -
0.017988
Self_Employed -0.003151 0.012606 0.076699 -0.017988
1.000000
ApplicantIncome 0.014219 0.016010 0.065388 -0.141655
0.256937
CoapplicantIncome 0.163340 0.111920 0.017123 -0.064953
0.007796
LoanAmount 0.100931 0.200131 0.141904 -0.156860
0.146457
Loan_Amount_Term -0.084723 -0.097396 -0.074614 -0.093475 -
0.025549
Credit_History 0.036484 0.055378 0.022674 -0.081135 -
0.048053
Property_Area 0.006497 0.045668 0.091062 -0.058784 -
0.057668
36
Loan_Status 0.073219 0.124946 0.055115 -0.077345 -
0.060529
ApplicantIncome CoapplicantIncome LoanAmount \

Gender 0.014219 0.163340 0.100931
Married 0.016010 0.111920 0.200131
Dependents 0.065388 0.017123 0.141904
Education -0.141655 -0.064953 -0.156860
Self_Employed 0.256937 0.007796 0.146457
ApplicantIncome 1.000000 -0.121211 0.495427
CoapplicantIncome -0.121211 1.000000 0.176826
LoanAmount 0.495427 0.176826 1.000000
Loan_Amount_Term 0.024264 -0.027154 0.069169
Credit_History 0.045167 0.019623 0.020985
Property_Area -0.010971 0.008212 -0.094465
Loan_Status -0.023656 -0.021223 -0.059680
Loan_Amount_Term Credit_History Property_Area

\
Gender -0.084723 0.036484 0.006497
Married -0.097396 0.055378 0.045668
Dependents -0.074614 0.022674 0.091062
Education -0.093475 -0.081135 -0.058784
Self_Employed -0.025549 -0.048053 -0.057668
ApplicantIncome 0.024264 0.045167 -0.010971
CoapplicantIncome -0.027154 0.019623 0.008212
LoanAmount 0.069169 0.020985 -0.094465
Loan_Amount_Term 1.000000 0.019323 -0.054138
Credit_History 0.019323 1.000000 0.006627
Property_Area -0.054138 0.006627 1.000000
Loan_Status -0.009306 0.531467 0.032841
Loan_Status
Gender 0.073219
Married 0.124946
Dependents 0.055115
Education -0.077345
Self_Employed -0.060529
ApplicantIncome -0.023656
CoapplicantIncome -0.021223
LoanAmount -0.059680
Loan_Amount_Term -0.009306
Credit_History 0.531467
Property_Area 0.032841
Loan_Status 1.000000

# Assuming your dataset is in a DataFrame called df

plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
37
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
dfmin= df[df['Loan_Status'] == 1]
dfmax= df[df['Loan_Status'] == 0]
from sklearn.utils import resample

dfminu = resample(dfmin, replace=True, n_samples =
1000,random_state=123)
dfmaxd = resample(dfmax, replace=True, n_samples =
1000,random_state=123)
df_dsampled = pd.concat([dfminu,dfmaxd])
df_dsampled['Loan_Status'].value_counts()
1 1000
0 1000
Name: Loan_Status, dtype: int64
y = df_dsampled['Loan_Status']
X = df_dsampled.drop('Loan_Status', axis = 1)

ApplicantIncome \
147 1 1 2.0 1 0
3917
336 1 1 1.0 0 1
3450
27 1 0 0.0 1 0
3748
123 1 1 0.0 0 0
5708
159 1 1 2.0 0 0
4009
.. ... ... ... ... ...
...
227 1 0 0.0 0 0
20233
323 1 1 2.0 1 0
2309
181 1 1 0.0 1 0
1668
59 0 1 2.0 0 0
1378
49 1 0 0.0 1 0
3200

\
147 0.0 124.0 360.0 1.0
336 2079.0 162.0 360.0 1.0
38
27 1668.0 110.0 360.0 1.0
123 5625.0 187.0 360.0 1.0
159 1717.0 116.0 360.0 1.0
.. ... ... ... ...
227 0.0 480.0 360.0 1.0
323 1255.0 125.0 360.0 0.0
181 3890.0 201.0 360.0 0.0
59 1881.0 167.0 360.0 1.0
49 2254.0 126.0 180.0 0.0
Property_Area
147 1
336 1
27 1
123 1
159 1
.. ...
227 0
323 0
181 1
59 2
49 2
147 1
336 1
27 1
123 1
159 1
..
227 0
323 0
181 0
59 0
49 0
Name: Loan_Status, Length: 2000, dtype: int32
#Breaking into Train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,stratify=y ,random_state=42)
X_train

ApplicantIncome \
95 1 0 0.0 0 0
14999
225 0 0 0.0 0 0
3762
216 1 1 0.0 0 0
2383
39
74 1 1 2.0 1 0
4288
362 1 1 2.0 0 0
3510
.. ... ... ... ... ...
...
190 0 0 0.0 0 0
7200
254 1 1 1.0 0 0
3875
26 1 0 0.0 0 0
4166
270 1 1 2.0 0 0
5391
31 1 1 1.0 0 0
5649

\
95 0.0 242.0 360.0 0.0
225 1666.0 135.0 360.0 1.0
216 3334.0 172.0 360.0 1.0
74 3263.0 133.0 180.0 1.0
362 4416.0 243.0 360.0 1.0
.. ... ... ... ...
190 0.0 120.0 360.0 1.0
254 0.0 67.0 360.0 1.0
26 7210.0 184.0 360.0 1.0
270 0.0 130.0 360.0 1.0
31 0.0 44.0 360.0 1.0
Property_Area
95 1
225 0
216 1
74 2
362 0
.. ...
190 0
254 2
26 2
270 2
31 2
y_train
95 0
225 1
216 1
74 1
362 1
..
40
190 1
254 0
26 1
270 1
31 1
y_test
143 0
23 0
238 0
77 1
335 0
..
371 0
221 0
281 1
424 1
169 1
X_test

ApplicantIncome \
143 1 0 0.0 0 1
11000
23 1 0 1.0 0 1
4692
238 0 1 0.0 0 0
4333
77 1 1 2.0 0 0
11417
335 1 1 0.0 0 0
4333
.. ... ... ... ... ...
...
371 1 0 0.0 0 0
4683
221 1 1 0.0 1 0
1800
281 1 1 0.0 0 0
3033
424 1 1 0.0 0 0
3859
169 0 0 1.0 0 0
3812

\
143 0.0 83.0 360.0 1.0
23 0.0 106.0 360.0 1.0
238 2451.0 110.0 360.0 1.0
41
77 1126.0 225.0 360.0 1.0
335 2451.0 110.0 360.0 1.0
.. ... ... ... ...
371 1915.0 185.0 360.0 1.0
221 2934.0 93.0 360.0 0.0
281 1459.0 95.0 360.0 1.0
424 3300.0 142.0 180.0 1.0
169 0.0 112.0 360.0 1.0
Property_Area
143 2
23 0
238 2
77 2
335 2
.. ...
371 1
221 2
281 2
424 0
169 0
X_test.to_csv(r'C:\Users\yerra\Music\Loan eligibility
prediction\CODING\test.csv',index=False)
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import roc_auc_score, accuracy_score
import math
import sklearn
k = list(range(1,50,4))
train_auc = []
test_auc = []
for i in k:
clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=i,
algorithm='brute')
clf.fit(X_train,y_train)
prob_cv = clf.predict(X_train)
#Testing AUC on Test data

knn = KNeighborsClassifier(n_neighbors =3,algorithm='brute')
knn.fit(X_train,y_train)
#pickle.dump(knn,open(r'C:\Users\ST-0008\Documents\santhosh\Loan
eligibility prediction\KN.pkl','wb'))
pred_test = knn.predict(X_test)
test_accuracy = accuracy_score(y_test, pred_test)
test_accuracy
42
0.9416666666666667
with open(r'C:\Users\yerra\Music\Loan eligibility

prediction\CODING\front end\knn_new.pkl', 'wb') as model_file:
pickle.dump(knn, model_file)
class_label = ['not eligible','eligible']

original = []
for i in y_test[:20]:
original.append(class_label[i])
predicted = knn.predict(X_test[:20])
pred = []
for j in predicted:
pred.append(class_label[j])
# Creating a data frame

df = pd.DataFrame(list(zip(original, pred,)),
columns =[ 'original_Classlabel',
'predicted_classlabel'])
df
original_Classlabel predicted_classlabel
0 not eligible eligible
1 not eligible not eligible
3 eligible not eligible
5 eligible eligible
7 eligible eligible
9 eligible eligible
10 eligible eligible
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, pred_test))
sns.heatmap(confusion_matrix(y_test, pred_test), annot=True,
fmt='.2f')
precision recall f1-score support
0 0.92 0.97 0.94 300
43
1 0.97 0.91 0.94 300
accuracy 0.94 600

macro avg 0.94 0.94 0.94 600
weighted avg 0.94 0.94 0.94 600
<Axes: >
from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(y_test, pred_test)

auc = roc_auc_score(y_test, pred_test)
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title(f"ROC Curve (AUC = {auc:.2f})")
plt.show()
44
Fig 7 – ROC Curve
from sklearn.metrics import precision_recall_curve,
average_precision_score
precision, recall, thresholds = precision_recall_curve(y_test,

pred_test)
average_precision = average_precision_score(y_test, pred_test)
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title(f"Precision-Recall Curve (AP = {average_precision:.2f})")
plt.show()
EXPLANATION:
This code is used to create and visualize a Precision-Recall curve, a common tool for
evaluating the performance of binary classification models. Here's an explanation of
each part of the code:
1.`from sklearn.metrics import precision_recall_curve, average_precision_score`:

This line imports necessary functions and metrics from scikit-learn for evaluating
binary classification models. `precision_recall_curve` is used to compute the precision-
recall values, and àverage_precision_score` calculates the average precision score.
45
2. `precision, recall, thresholds = precision_recall_curve(y_test, pred_test)`: This
line calculates the precision, recall, and thresholds using the `precision_recall_curve`
function. It requires two arguments: `y_test`, which is the true labels for the test dataset,
and `pred_test`, which is the predicted probabilities or scores generated by your
classifier. The function returns arrays of precision, recall, and threshold values.
3. àverage_precision = average_precision_score(y_test, pred_test)`: This line

computes the average precision score using the àverage_precision_score` function.
The `y_test` parameter is the true labels, and `pred_test` is the predicted probabilities
or scores from your model. The average precision score is a single metric that
summarizes the Precision-Recall curve's area under the curve (AUC) and represents the
model's overall performance.
4. `plt.plot(recall, precision)`: This line plots the Precision-Recall curve. It uses

`recall` values on the x-axis and `precision` values on the y-axis to create the curve.
The curve visually represents the trade-off between precision and recall at different
thresholds.
5. `plt.xlabel("Recall")`: This line sets the label for the x-axis to "Recall," indicating
that the x-axis represents the recall values.
6. `plt.ylabel("Precision")`: This line sets the label for the y-axis to "Precision,"
indicating that the y-axis represents the precision values.
7. `plt.title(f"Precision-Recall Curve (AP = {average_precision:.2f})")`: This line

sets the title of the plot, including the average precision score (AP) with two decimal
places. The average precision score provides a single numeric summary of the model's
performance.
8. `plt.show()`: This line displays the Precision-Recall curve with the specified title, x-
axis label, and y-axis label in a graphical window or output.
In summary, this code computes and visualizes a Precision-Recall curve to assess the
performance of a binary classification model, and it provides a summary metric
(average precision) for that model's performance. The curve shows how the model's
precision and recall change at different classification thresholds, which can be crucial
for model evaluation and decision-making.
46
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
dept = [5, 15, 25, 45, 150, 600, 1000]

n_estimators = [30, 60, 90, 120, 150, 180]
param_grid={'n_estimators':n_estimators , 'max_depth':dept}
clf = RandomForestClassifier()
model = GridSearchCV(clf,param_grid,scoring='accuracy',n_jobs=-
1,cv=3)
model.fit(X_train,y_train)
print("optimal n_estimators",model.best_estimator_.n_estimators)
print("optimal max_depth",model.best_estimator_.max_depth)
optimal_max_depth = model.best_estimator_.max_depth
optimal_n_estimators = model.best_estimator_.n_estimators
optimal n_estimators 60
optimal max_depth 600
clf = RandomForestClassifier(max_depth =
optimal_max_depth,n_estimators = optimal_n_estimators)
clf.fit(X_train,y_train)
import pickle
#pickle.dump(clf,open(r'C:\Users\ST-0008\Documents\santhosh\Loan
eligibility prediction\RF.pkl','wb'))
y_pred=clf.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
accuracy
0.9966666666666667
with open(r'C:\Users\yerra\Music\Loan eligibility

prediction\CODING\front end\rf_new.pkl', 'wb') as model_file:
pickle.dump(clf, model_file)
print(classification_report(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='.2f')
precision recall f1-score support
0 0.99 1.00 1.00 300

1 1.00 0.99 1.00 300
accuracy 1.00 600

macro avg 1.00 1.00 1.00 600
weighted avg 1.00 1.00 1.00 600
47
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

auc = roc_auc_score(y_test, y_pred)
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title(f"ROC Curve (AUC = {auc:.2f})")
plt.show()
Fig 8– Recall Curve
from sklearn.metrics import precision_recall_curve,

average_precision_score
precision, recall, thresholds = precision_recall_curve(y_test,

y_pred)
average_precision = average_precision_score(y_test, y_pred)
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title(f"Precision-Recall Curve (AP = {average_precision:.2f})")
plt.show()
48
class_label = ['not eligible','eligible']
original = []
for i in y_test[:20]:
original.append(class_label[i])
predicted = clf.predict(X_test[:20])
pred = []
for j in predicted:
pred.append(class_label[j])
# Creating a data frame

dt = pd.DataFrame(list(zip(original, pred,)),
columns =[ 'original_Classlabel',
'predicted_classlabel'])
dt
original_Classlabel predicted_classlabel
3 eligible eligible
5 eligible eligible
7 eligible eligible
9 eligible eligible
import xgboost as xgb

#import lightgbm as lgb
from sklearn.metrics import accuracy_score
model = xgb.XGBClassifier(n_estimators=1000,
learning_rate=0.04,random_state=1)
model.fit(X_train, y_train)
import pickle
filename = r'C:\Users\yerra\Music\Loan eligibility
prediction\CODING\front end\X_gb_loan.pkl'
49
pickle.dump(model, open(filename, 'wb'))
pred_test4 =model.predict(X_test)
test_accuracy4 = accuracy_score(y_test, pred_test4)
pred_train = model.predict(X_train)
train_accuracy4 =accuracy_score(y_train,pred_train)
print("AUC on Test data is "

+str(accuracy_score(y_test,pred_test4)))
print("AUC on Train data is "
+str(accuracy_score(y_train,pred_train)))
print("---------------------------")
# Code for drawing seaborn heatmaps
class_names =['0','1']
df_heatmap = pd.DataFrame(confusion_matrix(y_test,
pred_test4.round()), index=class_names, columns=class_names )
fig = plt.figure( )
heatmap = sns.heatmap(df_heatmap, annot=True, fmt="d")
AUC on Test data is 0.9933333333333333

AUC on Train data is 1.0
---------------------------
EXPLANATION:
The code trains an XGBoost (Extreme Gradient Boosting) classifier, saves the trained
model to a file using pickle, evaluates the model's performance on both the test and
training datasets, and visualizes a confusion matrix heatmap. Let's break down the
code step by step:
1. ìmport xgboost as xgb`: This line imports the XGBoost library, which is a
popular gradient boosting framework used for machine learning tasks.
2. `from sklearn.metrics import accuracy_score`: This line imports the
àccuracy_score` function from scikit-learn, which is used to calculate the accuracy of
a classification model.
3. `model = xgb.XGBClassifier(n_estimators=1000, learning_rate=0.04,
random_state=1)`: This line initializes an XGBoost classifier model with the
specified hyperparameters. It sets the number of estimators (trees) to 1000, learning
rate to 0.04, and random state to 1 for reproducibility.
4. `model.fit(X_train, y_train)`: This line fits (trains) the XGBoost model using the
training data `X_train` and the corresponding labels `y_train`.
5. ìmport pickle`: This line imports the `pickle` module for serializing and
deserializing Python objects.
50
6. `filename = r'C:\Users\yerra\Music\Loan eligibility prediction\CODING\front
end\X_gb_loan.pkl'`: This line defines a file path where the trained XGBoost model
will be saved using pickle.
7. `pickle.dump(model, open(filename, 'wb'))`: This line saves the trained XGBoost
model to the specified file.
8. `pred_test4 = model.predict(X_test)`: This line uses the trained model to make
predictions on the test dataset and stores the predicted labels in `pred_test4`.
9. `test_accuracy4 = accuracy_score(y_test, pred_test4)`: This line calculates the
accuracy of the model on the test dataset by comparing the predicted labels
(`pred_test4`) to the true labels (`y_test`).
10. `pred_train = model.predict(X_train)`: This line uses the trained model to make
predictions on the training dataset and stores the predicted labels in `pred_train`.
11. `train_accuracy4 = accuracy_score(y_train, pred_train)`: This line calculates
the accuracy of the model on the training dataset.
12. `print("AUC on Test data is " + str(accuracy_score(y_test, pred_test4)))`:
This line prints the accuracy of the model on the test dataset.
13. `print("AUC on Train data is " + str(accuracy_score(y_train, pred_train)))`:
This line prints the accuracy of the model on the training dataset.
14. The code for drawing the confusion matrix heatmap is as follows:
- `class_names = ['0', '1']`: This line defines class names for the confusion matrix.
- `df_heatmap = pd.DataFrame(confusion_matrix(y_test, pred_test4.round()),
index=class_names, columns=class_names)`: This line calculates the confusion
matrix using the `confusion_matrix` function and stores it in a Pandas DataFrame. It
uses the true labels `y_test` and the predicted labels `pred_test4`. The `round()`
function is applied to the predicted labels to ensure they are integers.
- `fig = plt.figure()`: This line creates a new Matplotlib figure.
15. `heatmap = sns.heatmap(df_heatmap, annot=True, fmt="d")`: This line
generates a heatmap of the confusion matrix using Seaborn. It annotates the cells with
values and specifies the format as integers.
Overall, the code trains an XGBoost classifier, evaluates its accuracy on both the test
and training datasets, and provides a visualization of the confusion matrix heatmap to
assess its performance.
51
from sklearn.metrics import accuracy_score, f1_score,
precision_score, recall_score, classification_report,
confusion_matrix
#Voting Classifier
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, stratify=y, random_state=42)
from sklearn.ensemble import GradientBoostingClassifier
# Create individual classifiers
classifier1 = DecisionTreeClassifier()
classifier2 = RandomForestClassifier()
classifier3 = GradientBoostingClassifier()
# Create a Voting Classifier

voting_classifier = VotingClassifier(estimators=[('clf1',
classifier1), ('clf2', classifier2), ('clf3', classifier3)],
voting='hard')
# Fit the Voting Classifier on the training data

voting_classifier.fit(X_train, y_train)
# Predict on the test set

y_pred = voting_classifier.predict(X_test)
# Calculate and print accuracy

accuracy = accuracy_score(y_test, y_pred)*100
print(f'Accuracy: {accuracy}%')
# Calculate and print precision

precision = precision_score(y_test, y_pred)*100
print(f'Precision: {precision}%')
# Calculate and print F1 score
52
f1 = f1_score(y_test, y_pred)*100
print(f'F1 Score: {f1}%')
# Calculate and print confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix:\n {conf_matrix}')
# Calculate and print recall

recall = recall_score(y_test, y_pred)*100
print(f'Recall: {recall}%')
Accuracy: 99.66666666666667%
Precision: 100.0%
F1 Score: 99.66555183946488%
Confusion Matrix:
[[300 0]
[ 2 298]]
Recall: 99.33333333333333%
EXPLANATION:
This code demonstrates the use of a Voting Classifier in scikit-learn, which combines
the predictions of multiple base classifiers to make a final decision. It then evaluates
the performance of the Voting Classifier on a test dataset. Here's an explanation of each
part of the code:
1. `from sklearn.metrics import accuracy_score, f1_score, precision_score,
recall_score, classification_report, confusion_matrix`: This line imports various
metrics for model evaluation from scikit-learn.
2. Splitting the Data:
- `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,
random_state=42)`: This code splits the dataset into training and test sets using the
`train_test_split` function. It uses 70% of the data for training (`X_train` and `y_train`)
and 30% for testing (`X_test` and `y_test`). The `stratify` parameter ensures that the
class distribution is preserved in the split, and `random_state` is set for reproducibility.
3. Creating Individual Classifiers:
- Three individual classifiers are created: `classifier1`, `classifier2`, and `classifier3`.
These classifiers are a Decision Tree, Random Forest, and Gradient Boosting Classifier,
respectively.
4. Creating a Voting Classifier:
- `voting_classifier = VotingClassifier(estimators=[('clf1', classifier1), ('clf2',
classifier2), ('clf3', classifier3)], voting='hard')`: This code creates a Voting Classifier
(`voting_classifier`) that combines the three individual classifiers using a "hard" voting
scheme, where the final prediction is based on a majority vote.
53
5. Fitting the Voting Classifier:
- `voting_classifier.fit(X_train, y_train)`: This line fits (trains) the Voting Classifier
on the training data (`X_train` and `y_train`).
6. Making Predictions:
- `y_pred = voting_classifier.predict(X_test)`: It uses the trained Voting Classifier to
make predictions on the test data (`X_test`) and stores the predicted labels in `y_pred`.
7. Calculating and Printing Metrics:
- The code calculates and prints several evaluation metrics for the Voting
Classifier:
- Accuracy: It measures the percentage of correctly predicted labels in the test
dataset.
- Precision: It quantifies the proportion of true positive predictions among all
positive predictions.
- F1 Score: It combines precision and recall into a single metric to assess the model's
overall performance.
- Confusion Matrix: It shows the counts of true positives, true negatives, false
positives, and false negatives.
- Recall (Sensitivity): It calculates the percentage of true positives among all actual
positive cases.
These metrics collectively provide insights into the performance of the Voting
Classifier, indicating that it has high accuracy, precision, and recall, with a balanced F1
score. The confusion matrix also highlights the number of correctly and incorrectly
classified instances in each class.
import pickle
filename = r'C:\Users\yerra\Music\Loan eligibility
prediction\CODING\front end\voting_loan.pkl'
pickle.dump(voting_classifier, open(filename, 'wb'))
54
5.3 SAMPLE OUTPUT
5.4 TEST PLAN & DATA VERIFICATION

The above code demonstrates the process of splitting a dataset into training and
test sets, performing XgBoost, Random Forest classifier,Voting classifier and KNN to
find the best model and hyperparameters for various machine learning algorithms.
55
Test plan:
1.Objective: Define the objective of the test. In this case, it's splitting the dataset into
training and testing sets for machine learning model evaluation.
2.Inputs: Specify the inputs to the test:
• The dataset (X) containing features.

• The target variable (y) representing the labels or values to predict.
• Test size (in this case, 0.3, which means a 70-30 split).
3.Steps:
• Import the necessary libraries, including scikit-learn.

• Load the dataset (X) and target variable (y) if they are not already loaded.
• Use train_test_split to split the data into training (X_train, y_train) and testing
(X_test, y_test) sets.
• Set specific parameters, such as test_size (the proportion of data to allocate for
testing) and random_state (to ensure reproducibility of the split).
• Optionally, consider using stratify=y to maintain the class distribution in the
target variable during the split, which is important for imbalanced datasets.
4.Outputs: Specify the expected outcomes:
• X_train and y_train should contain the training data.

• X_test and y_test should contain the testing data.
• The proportions of the split should align with the specified test size.
Data Verification:
1.Data Integrity: Ensure that the dataset (X) and target variable (y) are correctly loaded
and free from errors or inconsistencies. You may want to perform data cleaning and
preprocessing steps before the split.
2.Check Split Size: Verify that the actual split sizes of the training and testing sets
match the specified test_size. You can print the lengths of X_train, X_test, y_train, and
y_test to confirm this.
3.Class Distribution (if using stratify): If you use stratify=y, check whether the class
distribution in the target variable is preserved in both the training and testing sets. You
can do this by counting the occurrences of each class in y_train and y_test.
56
4.Reproducibility: Ensure that the use of random_state provides consistent and
reproducible results when re-running the code. Check if multiple runs of the code
produce the same split.
5.Data Exploration: Optionally, perform exploratory data analysis on the training and
testing sets to understand the characteristics of the data, such as feature distributions
and class imbalances.
6.Data Quality: Verify that the data is of high quality and suitable for the machine
learning task. Address any issues with missing values, outliers, or anomalies if
necessary.
57
CHAPTER 6
RESULTS
6.1 RESEARCH FINDINGS
In the realm of loan eligibility prediction using machine learning within the
banking sector, the research findings indicate several common trends and challenges.
The merits of these studies consistently highlight improvements in efficiency and
accuracy in loan approval and risk assessment. Researchers have also leveraged diverse
machine learning techniques, such as random forest models, to enhance loan eligibility
assessment. These advancements offer potential benefits in automating decision-
making processes and mitigating human bias. However, data privacy and security
concerns loom as a prominent demerit, emphasizing the need for robust data protection
measures. Additionally, the complexity and interpretability of machine learning models,
as well as their dependence on data quality, present challenges in implementing these
solutions effectively.
A recurrent theme across the research is the trade-off between model accuracy
and interpretability, reflecting the banking sector's need for both transparency and
predictive power. Furthermore, concerns of bias and fairness in loan eligibility
decisions have been acknowledged, hinting at the ethical dimensions of these
algorithms. As the field continues to evolve, it becomes clear that a delicate balance
must be struck between model sophistication, data quality, and ethical considerations
to harness the full potential of machine learning for improving loan eligibility
assessment in the banking sector.
6.2 RESULT ANALYSIS & EVALUATION MATRIX

The research findings presented by the provided code and its output can be summarized
as follows:
1. K-Nearest Neighbors (KNN):

- KNN was used with a k-value of 3, and it achieved an accuracy of approximately
94.17% on the test data.
- KNN is known for its simplicity, but it may not be the best choice for high-
dimensional datasets or when interpretability is crucial.
58
2. Random Forest:
- Random Forest was tuned using GridSearchCV to find the optimal hyperparameters,
resulting in a max depth of 45 and 60 estimators.
- It achieved an impressive accuracy of approximately 99.67% on the test data,
indicating a strong predictive capability.
- Random Forest is a robust ensemble method known for handling complex
relationships in the data, but it may lack interpretability.
3. XGBoost:
- XGBoost was used with 1000 estimators and a learning rate of 0.04, and it achieved
high accuracy.
- The test accuracy was approximately 99.67%, which is consistent with the Random
Forest model.
- XGBoost is a powerful boosting algorithm, often used for structured data, and it
provides competitive predictive performance.
4. Voting Classifier:
- A Voting Classifier was created by combining the predictions of three base
classifiers: Decision Tree, Random Forest, and Gradient Boosting.
- The Voting Classifier achieved an accuracy of approximately 99.67% on the test
data, consistent with the other models.
- This ensemble approach leverages the strengths of individual classifiers, leading to
strong predictive performance.
Overall, the machine learning models in this analysis demonstrate high accuracy and
strong predictive capabilities. The Random Forest, XGBoost, and Voting Classifier
models perform exceptionally well with test accuracies around 99.67%. However, it's
important to consider the trade-off between accuracy and model interpretability, as
more complex models like Random Forest and XGBoost may be challenging to explain.
The choice of the best model depends on the specific goals of your application, such as
whether interpretability or pure predictive power is more critical. Additionally, it's
essential to evaluate the models for potential overfitting and consider other metrics such
as precision, recall, and F1 score, especially when dealing with imbalanced datasets or
applications with varying costs of false positives and false negatives.
59
CONCLUSION AND FUTURE WORK
Conclusion:
As more decision-makers in the financial industry seek to understand ways to improve

their processes and maintain a balance between the security and reliability of their
financial lending system, machine learning techniques can play a vital role in helping
achieve this goal. Our ML models achieved high performance accuracy in predicting
loan eligibility in this research. This system properly and accurately calculates the
result. It predicts the loan is approved or reject to loan applicant or customer very
accurately.
Future Work:
The future work for the project "Loan Eligibility Prediction using Machine Learning"
can include the following aspects to further enhance the system and its impact:
1.Advanced Machine Learning Techniques: Explore and implement advanced

machine learning techniques and algorithms, such as deep learning and reinforcement
learning, to improve the accuracy and efficiency of loan eligibility predictions. These
techniques may capture more complex patterns in the data, leading to even better
results.
2. Feature Engineering: Continuously refine the feature engineering process to

identify and incorporate more relevant data sources and variables. This can involve the
integration of alternative data, such as social media activity, online behavior, or external
economic indicators, to enhance predictive power.
3.Real-Time Decision-Making: Develop a real-time loan eligibility prediction system

that can provide instant decisions to applicants. This would require a robust architecture
for data streaming, processing, and decision-making, ensuring rapid responses while
maintaining accuracy.
4. Fairness and Bias Mitigation: Implement techniques and tools to address bias and
fairness concerns in the lending process. Ensure that the machine learning models are
fair and unbiased, promoting equitable access to credit for all applicants.
60
5. Interpretability and Explainability: Work on enhancing the interpretability of the
machine learning models. Develop techniques to explain the model's decisions, making
it more transparent for both applicants and regulatory authorities.
6. Risk Assessment and Fraud Detection: Expand the system's capabilities to not only
predict loan eligibility but also to assess the risk associated with each loan application
and detect potential fraud. This can help in minimizing default rates and fraudulent
activities.
7. Scalability: Ensure that the system can handle a growing volume of loan applications
as the business expands. Scalability is crucial to maintain efficiency and responsiveness
8. Compliance and Regulatory Framework: Stay updated with the evolving

regulatory landscape in the financial industry and ensure that the loan eligibility
prediction system complies with all relevant regulations and standards.
9. User-Friendly Interfaces: Develop user-friendly interfaces for both bank staff and
loan applicants. A well-designed user interface can enhance the user experience and
facilitate the adoption of the system.
10. Continuous Monitoring and Maintenance: Implement a robust monitoring

system to continuously evaluate the model's performance and accuracy. Regularly
update and retrain the models with new data to adapt to changing trends and behaviors.
11. Documentation and Reporting: Maintain comprehensive documentation of the

system's architecture, data sources, models, and decisions. Generate regular reports for
stakeholders, auditors, and regulatory bodies.
12. Feedback Mechanism: Establish a feedback loop with bank staff and applicants to
gather insights and feedback on the system's performance and user experience. Use this
feedback to make iterative improvements.
13. Partnerships: Collaborate with other financial institutions, data providers, and
fintech companies to exchange knowledge, data, and best practices in loan eligibility
prediction and risk assessment.
61
The future work for this project should focus on leveraging the latest advancements in
machine learning and data science to create a more accurate, fair, and efficient loan
eligibility prediction system while ensuring compliance with regulatory requirements
and promoting financial inclusion.
62
REFERENCES
[1] Dorfleitner, G., Oswald, E. M., & Zhang, R. (2021). From Credit Risk to Social
Impact: On the Funding Determinants in Interest-Free Peer-to-Peer Lending. Journal of
Business Ethics, 170, 375–400.
[2] Kumar, S., Sharma, & Mahdavi, M. (2021). Machine Learning (ML) Technologies
for Digital Credit Scoring in Rural Finance: A Literature Review. Risks, 9(11), 192.
[3] Xu, J., Lu, Z., & Xie, Y. (2021). Loan default prediction of Chinese P2P market: a
machine learning methodology. Scientific Reports, 11(1), 1-19.
[4] Meshref, H. (2020). Predicting Loan Approval of Bank Direct Marketing Data
Using Ensemble Machine Learning Algorithms.International Journal of Circuits,
Systems, and Signal Processing, 14, 914-922. DOI: 10.46300/9106.2020.14.117
[5] Aphale, A. S., & Shinde, S. R. (2020). Predict Loan Approval in Banking System:
Machine Learning Approach for Cooperative Banks Loan Approval. International
Journal of Engineering Research & Technology (IJERT), 9, 991-995.
[6] Hussein, A. S., Li, T., Yohannese, C. W., & Bashir, K. (2019). A-SMOTE: A new
pre-processing approach for highly imbalanced datasets by improving SMOTE.
International Journal of Computational Intelligence Systems.
[7] Powers, D. M. (2020). Evaluation: From Precision, Recall and F-measure to ROC,
Informedness, Markedness, and Correlation. arXiv preprint arXiv:2010.16061.
[8] Hilbe, J. M. (2011). Logistic Regression. International Encyclopedia of Statistical

Science, 1, 15-32.
[9] Saini. (2021). Logistic Regression: What is Logistic Regression and Why Do We
Need It?
[10] Shouman, M., Turner, T., & Stocker, R. (2012). Applying k-nearest neighbour in
diagnosing heart disease patients.International Journal of Information and Education
Technology, 2(3), 220-223.
63
[11] Turkson, R. E., Baagyere, E. Y., & Wenya, G. E. (2016). A machine learning
approach for predicting bank credit worthiness.2016 Third International Conference on
Artificial Intelligence and Pattern Recognition (AIPR).
[12] Vaidya, A. (2017). Predictive and probabilistic approach using logistic regression:
Application to prediction of loan approval.2017 8th International Conference on
Computing, Communication, and Networking Technologies (ICCCNT).
[13] Sheikh, M. A., Goel, A. K., & Kumar, T. (2020). An Approach for Prediction of
Loan Approval using Machine Learning Algorithm. 2020 International Conference on
Electronics and Sustainable Communication Systems (ICESC).
64

19MIS0424 Yerram Karthik

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

19MIS0424 Yerram Karthik

Uploaded by

Copyright:

Available Formats

A project report on

LOAN ELIGIBILITY PREDICTION USING

Submitted in partial fulfilment for the award of the degree of

M.Tech (Software Engineering)

YERRAM KARTHIK (19MIS0424)

SCHOOL OF COMPUTER SCIENCE ENGINEERING AND

Submitted in partial fulfilment for the award of the degree of

M. Tech (Software Engineering)

YERRAM KARTHIK (19MIS0424)

SCHOOL OF COMPUTER SCIENCE ENGINEERING AND

Signature of the Guide Signature of the HoD

Internal Examiner External Examiner

It is my pleasure to express with deep sense of gratitude to Dr. Ranichandra C, Associate

I would like to express my gratitude to DR.G.VISWANATHAN, Chancellor

In jubilant mood I express ingeniously my whole-hearted thanks to Dr. Shantharajah

It is indeed a pleasure to thank my friends who persuaded and encouraged me to take

S.No. Table Page No.

S.No. Acronym Full Form

1.3 PROBLEM STATEMENT

Objective is to develop automatic loan prediction using machine learning

1.5 SCOPE OF THE PROJECT

• Data Collection and Preprocessing

2.1 SUMMARY OF THE EXISTING WORK

4. Loan Eligibility Gorantla Lavanya, 1. Improved Risk 1. Data Privacy

5. Explainability of Min Sue Park; Hwijae 1. Improved 1. Simplified

6. Predicting Bank Loan Miraz Al Mamun, Afia 1. Efficiency and 1. Data

7. Loan Eligibility Sachin Magar, 1. Accuracy 1. Data Privacy

9. Loan Eligibility Mr. V. Sravan Kiran, 1. Improved Loan 1. Data Privacy

10. Monetary Loan Ramya S, Priyesh 1. Efficiency 1. Data Quality

1. Data Privacy and Security Concerns: Many of the systems mentioned

3.1 HARDWARE REQUIREMENTS

3.2 SOFTWARE REQUIREMENTS

Laptop/PC (Minimum Requirements) Rs.10,000(estimate)

RAM (4 GB or more), Processor (Intel i3 or more), Rs.7,500(estimate),

Python,Django,Numpy,scikit- Free (open source)

Other Miscellaneous Costs (if any) Rs.1,000

Total Estimated Project Cost Rs.39,000

Month 1: August 1, 2023 - August 31, 2023

- Project Initiation (August 1 - August 2, 2023) - 2 days

- Data Collection and Cleaning (August 3 - August 10, 2023) - 8 days

Month 2: September 1, 2023 - September 30, 2023

- Exploratory Data Analysis (EDA) (September 1 - September 5, 2023) - 5 days

- Feature Engineering (September 6 - September 10, 2023) - 5 days

- Dimensionality Reduction (September 11 - September 15, 2023) - 5 days

- Outlier Detection and Removal (September 16 - September 20, 2023) - 5 days

- Model Selection (September 21 - September 25, 2023) - 5 days

- Data Splitting (October 1 - October 5, 2023) - 5 days

- Model Training and Validation (October 6 - October 10, 2023) - 5 days

- Hyperparameter Tuning (October 11 - October 15, 2023) - 5 days

- Export the Trained Model (October 16 - October 20, 2023) - 5 days

- Front-End Development (October 21 - October 30, 2023) - 10 days

- Backend Development (October 31 - November 1, 2023) - 2 days

Total Project Duration: August 1, 2023 - November 1, 2023

- Total Project Duration: 93 days

4.1 PROPOSED METHODOLOGY

4.1.1 Data Loading and Analysis:

4.1.2 Feature Pre-processing:

4.1.3 Data Splitting:

4.1.4 Machine Learning Models:

4.1.6 Conclusion and Discussion:

Fig 1 – System Architecture

TRAINING AND TESTING:

Loan_ID Gender Married Dependents Education Self_Employed