Professional Documents
Culture Documents
Experiment – 3
Aim:
Write a program in python to predict if a loan will get approved or not.
Theory:
The goal of this project is that from the data collected from the loan applicants, after
preprocessing the data, we have to predict based on the information who will be eligible to
receive a loan.
Dataset Used:
In the Dataset we find the following features:
1. Loan ID, the identifier code of each applicant.
2. Gender, Male or Female for each applicant.
3. Married, the maritage state.
4. Dependents, how many dependents does the applicant have?
5. Education, the level of education, graduate or non-graduate.
6. Self Employed, Yes or No in the case
7. Applicant Income
8. Co-applicant Income
9. Loan Amount
10. Loan Amount Term
11. Credit History, just Yes or No in the case
12. Property Area, urban, semi urban or rural area of the applicant's property.
13. Loan Status, Yes or No (The independent variable represents the class)
We know how to build, train & test our Machine Learning Model and find its accuracy using
the given data. But how do we measure its real-world performance?
We use a Confusion Matrix for this. A Confusion matrix is an N x N matrix used for
evaluating the performance of a classification model, where N is the number of target classes.
The matrix compares the actual target values with those predicted by the machine learning
model. This gives us a holistic view of how well our classification model is performing and
what kinds of errors it is making.
It is a table with 4 different combinations of predicted and actual values. It is extremely useful
for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC
curves.
Let’s understand TP, FP, FN, TN:
True Positive:
Interpretation: You predicted positive and it’s
true.
True Negative:
Interpretation: You predicted negative and
it’s true.
False Positive: (Type 1 Error)
Interpretation: You predicted positive
Figure 3. 1
and it’s false.
False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
We describe predicted values as Positive and Negative and actual values as True and False.
Figure 3. 2
Label Encoding: Most Supervised Machine Learning Algorithms work on numerical data
only. If we have categorical data in our dataset, we need to convert it into numeric form. We
use the Pandas replace() function to manually encode our data, and, LabelEncoder() function
to automate this process.
3. Program to predict if a loan will get approved or
not.
In [19]: df=pd.read_csv('loan_data_set.csv')
df
Out[19]: Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome Loan
Not
3 LP001006 Male Yes 0 No 2583 2358.0
Graduate
Out[20]: Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanA
Not
3 LP001006 Male Yes 0 No 2583 2358.0
Graduate
In [21]: df.shape
(614, 13)
Out[21]:
In [22]: df.describe()
Out[22]: ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History
In [23]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 614 non-null int64
7 CoapplicantIncome 614 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 564 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB
In [24]: df.isnull().sum()
Loan_ID 0
Out[24]:
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
#Gender Encoding
df= df.replace({"Gender":{"Male":1, "Female":0 }})
#Married Encoding
df =df.replace({"Married" :{"Yes":1, "No":0}})
#Education Encoding
df['Education'].value_counts()
df= df.replace({"Education":{"Graduate":1, "Not Graduate":0 }})
print(df)
Property_Area Loan_Status
0 1 1
1 0 0
2 1 1
3 1 1
4 1 1
.. ... ...
609 0 1
610 0 1
611 1 1
612 1 1
613 2 0
In [26]: df
Out[26]: Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
In [27]: df.describe()
In the analysis we found that the data has a lot of missing/nul values. We fill all null values with the median of
the data in which the null values are present. We use the fillna() function and pass the median() function as it's
parameter.
In [29]: df.isnull().sum()
Gender 0
Out[29]:
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
What is Correlation
The mutual relationship, covariation, or association between two or more variables is called Correlation. It is
not concerned with either the changes in x or y individually, but with the measurement of simultaneous
variations in both variables.
The Seaborn Heatmap gives a Visual Representation of correlation between the variables.
In [30]: sns.set(rc={'figure.figsize':(15,8)})
sns.heatmap(df.corr(),annot=True,cmap="rocket")
plt.show()
df=correlationdrop(df,0.05)
print(df)
Married Dependents Education Self_Employed ApplicantIncome \
0 0.0 0.0 1 0.0 5849
1 1.0 1.0 1 0.0 4583
2 1.0 0.0 1 1.0 3000
3 1.0 0.0 0 0.0 2583
4 0.0 0.0 1 0.0 6000
.. ... ... ... ... ...
609 0.0 0.0 1 0.0 2900
610 1.0 3.0 1 0.0 4106
611 1.0 1.0 1 0.0 8072
612 1.0 2.0 1 0.0 7583
613 0.0 0.0 1 1.0 4583
Property_Area Loan_Status
0 1 1
1 0 0
2 1 1
3 1 1
4 1 1
.. ... ...
609 0 1
610 0 1
611 1 1
612 1 1
613 2 0
Model Building
0.8292682926829268
Out[36]:
SVM Classifier
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is a number of features you have) with the value of each feature being the value
of a particular coordinate.
[[14 19]
[ 2 88]]
Accuracy:80.44%
Standard Deviation:3.52%
True
0 14 19 33
1 2 88 90