You are on page 1of 9

October 20, 2022

Experiment – 3

Aim:
Write a program in python to predict if a loan will get approved or not.

Theory:
The goal of this project is that from the data collected from the loan applicants, after
preprocessing the data, we have to predict based on the information who will be eligible to
receive a loan.

Dataset Used:
In the Dataset we find the following features:
1. Loan ID, the identifier code of each applicant.
2. Gender, Male or Female for each applicant.
3. Married, the maritage state.
4. Dependents, how many dependents does the applicant have?
5. Education, the level of education, graduate or non-graduate.
6. Self Employed, Yes or No in the case
7. Applicant Income
8. Co-applicant Income
9. Loan Amount
10. Loan Amount Term
11. Credit History, just Yes or No in the case
12. Property Area, urban, semi urban or rural area of the applicant's property.
13. Loan Status, Yes or No (The independent variable represents the class)

We know how to build, train & test our Machine Learning Model and find its accuracy using
the given data. But how do we measure its real-world performance?
We use a Confusion Matrix for this. A Confusion matrix is an N x N matrix used for
evaluating the performance of a classification model, where N is the number of target classes.
The matrix compares the actual target values with those predicted by the machine learning
model. This gives us a holistic view of how well our classification model is performing and
what kinds of errors it is making.
It is a table with 4 different combinations of predicted and actual values. It is extremely useful
for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC
curves.
Let’s understand TP, FP, FN, TN:
True Positive:
Interpretation: You predicted positive and it’s
true.
True Negative:
Interpretation: You predicted negative and
it’s true.
False Positive: (Type 1 Error)
Interpretation: You predicted positive
Figure 3. 1
and it’s false.
False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.

We describe predicted values as Positive and Negative and actual values as True and False.

Figure 3. 2

Label Encoding: Most Supervised Machine Learning Algorithms work on numerical data
only. If we have categorical data in our dataset, we need to convert it into numeric form. We
use the Pandas replace() function to manually encode our data, and, LabelEncoder() function
to automate this process.
3. Program to predict if a loan will get approved or
not.

Setting up the Environement and Loading the Dataset


In [18]: import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [19]: df=pd.read_csv('loan_data_set.csv')
df

Out[19]: Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome Loan

0 LP001002 Male No 0 Graduate No 5849 0.0

1 LP001003 Male Yes 1 Graduate No 4583 1508.0

2 LP001005 Male Yes 0 Graduate Yes 3000 0.0

Not
3 LP001006 Male Yes 0 No 2583 2358.0
Graduate

4 LP001008 Male No 0 Graduate No 6000 0.0

... ... ... ... ... ... ... ... ...

609 LP002978 Female No 0 Graduate No 2900 0.0

610 LP002979 Male Yes 3+ Graduate No 4106 0.0

611 LP002983 Male Yes 1 Graduate No 8072 240.0

612 LP002984 Male Yes 2 Graduate No 7583 0.0

613 LP002990 Female No 0 Graduate Yes 4583 0.0

614 rows × 13 columns

Analyzing the Data


In [20]: df.head()

Out[20]: Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanA

0 LP001002 Male No 0 Graduate No 5849 0.0

1 LP001003 Male Yes 1 Graduate No 4583 1508.0

2 LP001005 Male Yes 0 Graduate Yes 3000 0.0

Not
3 LP001006 Male Yes 0 No 2583 2358.0
Graduate

4 LP001008 Male No 0 Graduate No 6000 0.0

In [21]: df.shape

(614, 13)
Out[21]:

In [22]: df.describe()
Out[22]: ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History

count 614.000000 614.000000 592.000000 600.00000 564.000000

mean 5403.459283 1621.245798 146.412162 342.00000 0.842199

std 6109.041673 2926.248369 85.587325 65.12041 0.364878

min 150.000000 0.000000 9.000000 12.00000 0.000000

25% 2877.500000 0.000000 100.000000 360.00000 1.000000

50% 3812.500000 1188.500000 128.000000 360.00000 1.000000

75% 5795.000000 2297.250000 168.000000 360.00000 1.000000

max 81000.000000 41667.000000 700.000000 480.00000 1.000000

In [23]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 614 non-null int64
7 CoapplicantIncome 614 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 564 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

In [24]: df.isnull().sum()

Loan_ID 0
Out[24]:
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

Processing the Data


Label Encoding
In [25]: #Laon Status Encoding
df= df.replace({"Loan_Status":{'Y': 1, 'N': 0}})

#Gender Encoding
df= df.replace({"Gender":{"Male":1, "Female":0 }})

#Married Encoding
df =df.replace({"Married" :{"Yes":1, "No":0}})

#Replace the 3+ in dependents ande make the column numeric


df['Dependents'] = df['Dependents'].replace('3+', '3')
df['Dependents']=pd.to_numeric(df['Dependents'], errors='coerce')

#Count the quantity of values on the column


df['Self_Employed'].value_counts()
df= df.replace({"Self_Employed":{"Yes":1, "No":0 }})

#Education Encoding
df['Education'].value_counts()
df= df.replace({"Education":{"Graduate":1, "Not Graduate":0 }})

#Drop the Loan ID column


df = df.drop('Loan_ID',axis=1)

#Property Area Encoding


df['Property_Area'].value_counts()
df['Property_Area'] = df['Property_Area'].map({'Rural': 0, 'Urban': 1, 'Semiurban':2})

print(df)

Gender Married Dependents Education Self_Employed ApplicantIncome \


0 1.0 0.0 0.0 1 0.0 5849
1 1.0 1.0 1.0 1 0.0 4583
2 1.0 1.0 0.0 1 1.0 3000
3 1.0 1.0 0.0 0 0.0 2583
4 1.0 0.0 0.0 1 0.0 6000
.. ... ... ... ... ... ...
609 0.0 0.0 0.0 1 0.0 2900
610 1.0 1.0 3.0 1 0.0 4106
611 1.0 1.0 1.0 1 0.0 8072
612 1.0 1.0 2.0 1 0.0 7583
613 0.0 0.0 0.0 1 1.0 4583

CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History \


0 0.0 NaN 360.0 1.0
1 1508.0 128.0 360.0 1.0
2 0.0 66.0 360.0 1.0
3 2358.0 120.0 360.0 1.0
4 0.0 141.0 360.0 1.0
.. ... ... ... ...
609 0.0 71.0 360.0 1.0
610 0.0 40.0 180.0 1.0
611 240.0 253.0 360.0 1.0
612 0.0 187.0 360.0 1.0
613 0.0 133.0 360.0 0.0

Property_Area Loan_Status
0 1 1
1 0 0
2 1 1
3 1 1
4 1 1
.. ... ...
609 0 1
610 0 1
611 1 1
612 1 1
613 2 0

[614 rows x 12 columns]

In [26]: df
Out[26]: Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount

0 1.0 0.0 0.0 1 0.0 5849 0.0 NaN

1 1.0 1.0 1.0 1 0.0 4583 1508.0 128.0

2 1.0 1.0 0.0 1 1.0 3000 0.0 66.0

3 1.0 1.0 0.0 0 0.0 2583 2358.0 120.0

4 1.0 0.0 0.0 1 0.0 6000 0.0 141.0

... ... ... ... ... ... ... ... ...

609 0.0 0.0 0.0 1 0.0 2900 0.0 71.0

610 1.0 1.0 3.0 1 0.0 4106 0.0 40.0

611 1.0 1.0 1.0 1 0.0 8072 240.0 253.0

612 1.0 1.0 2.0 1 0.0 7583 0.0 187.0

613 0.0 0.0 0.0 1 1.0 4583 0.0 133.0

614 rows × 12 columns

In [27]: df.describe()

Out[27]: Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanA

count 601.000000 611.000000 599.000000 614.000000 582.000000 614.000000 614.000000 592

mean 0.813644 0.651391 0.762938 0.781759 0.140893 5403.459283 1621.245798 146

std 0.389718 0.476920 1.015216 0.413389 0.348211 6109.041673 2926.248369 85

min 0.000000 0.000000 0.000000 0.000000 0.000000 150.000000 0.000000 9

25% 1.000000 0.000000 0.000000 1.000000 0.000000 2877.500000 0.000000 100

50% 1.000000 1.000000 0.000000 1.000000 0.000000 3812.500000 1188.500000 128

75% 1.000000 1.000000 2.000000 1.000000 0.000000 5795.000000 2297.250000 168

max 1.000000 1.000000 3.000000 1.000000 1.000000 81000.000000 41667.000000 700

In the analysis we found that the data has a lot of missing/nul values. We fill all null values with the median of
the data in which the null values are present. We use the fillna() function and pass the median() function as it's
parameter.

In [28]: df.fillna(df.median(), inplace=True)


columns = df.columns
for column in columns:
df[column] = pd.to_numeric(df[column], errors='coerce')

In [29]: df.isnull().sum()

Gender 0
Out[29]:
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
What is Correlation
The mutual relationship, covariation, or association between two or more variables is called Correlation. It is
not concerned with either the changes in x or y individually, but with the measurement of simultaneous
variations in both variables.

Application in Machine Learning


Correlation is a highly applied technique in machine learning during data analysis and data mining. It
can extract key problems from a given set of features, which can later cause significant damage during
the fitting model. Data having non-correlated features have many benefits.
Such as:
1. Learning of Algorithm will be faster
2. Interpretability will be high
3. Bias will be less

The Seaborn Heatmap gives a Visual Representation of correlation between the variables.

In [30]: sns.set(rc={'figure.figsize':(15,8)})
sns.heatmap(df.corr(),annot=True,cmap="rocket")
plt.show()

In [31]: #dropping Correlations


def correlationdrop(df, sl):
columns = df.columns
for column in columns:
C=abs(df[column].corr(df['Loan_Status']))
if C < sl:
df=df.drop(columns=[column])
return df

df=correlationdrop(df,0.05)

print(df)
Married Dependents Education Self_Employed ApplicantIncome \
0 0.0 0.0 1 0.0 5849
1 1.0 1.0 1 0.0 4583
2 1.0 0.0 1 1.0 3000
3 1.0 0.0 0 0.0 2583
4 0.0 0.0 1 0.0 6000
.. ... ... ... ... ...
609 0.0 0.0 1 0.0 2900
610 1.0 3.0 1 0.0 4106
611 1.0 1.0 1 0.0 8072
612 1.0 2.0 1 0.0 7583
613 0.0 0.0 1 1.0 4583

CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History \


0 0.0 128.0 360.0 1.0
1 1508.0 128.0 360.0 1.0
2 0.0 66.0 360.0 1.0
3 2358.0 120.0 360.0 1.0
4 0.0 141.0 360.0 1.0
.. ... ... ... ...
609 0.0 71.0 360.0 1.0
610 0.0 40.0 180.0 1.0
611 240.0 253.0 360.0 1.0
612 0.0 187.0 360.0 1.0
613 0.0 133.0 360.0 0.0

Property_Area Loan_Status
0 1 1
1 0 0
2 1 1
3 1 1
4 1 1
.. ... ...
609 0 1
610 0 1
611 1 1
612 1 1
613 2 0

[614 rows x 11 columns]

Model Building

Separating the Variables


In [32]: x = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

Scaling the Data


Data Scaling is a data preprocessing step for numerical features.functions that will be used to achieve the
functionality are:
1. The fit(data) method is used to compute the mean and std dev for a given feature so that it can be used
further for scaling.
2. The transform(data) method is used to perform scaling using mean and std dev calculated using the .fit()
method.
3. The fit_transform() method does both fit and transform.
In the MinMaxScaler, the minimum of feature is made equal to zero and the maximum of feature equal to
one. MinMax Scaler shrinks the data within the given range, usually of 0 to 1. It transforms data by scaling
features to a given range. It scales the values to a specific value range without changing the shape of the
original distribution.

In [33]: from sklearn.preprocessing import MinMaxScaler


sc = MinMaxScaler()
X = sc.fit_transform(x)
Splitting the Data
In [34]: from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

Logistic Regression Machine Learning Model


In [35]: from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train,y_train)
z=model.predict(X_test)

Testing the Accuracy


In [36]: from sklearn.metrics import accuracy_score
accuracy_score(y_test,z)

0.8292682926829268
Out[36]:

SVM Classifier
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is a number of features you have) with the value of each feature being the value
of a particular coordinate.

In [37]: from sklearn.svm import SVC


classifier=SVC(kernel='rbf',gamma=0.2)
classifier.fit(X_train,y_train)

#Predicting the test set results


y_pred=classifier.predict(X_test)

Making Confussion Matrix


In [43]: from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
print(cm)

#Applying k-fold Cross Validation


from sklearn.model_selection import cross_val_score
accuracies=cross_val_score(estimator=classifier,X=X_train,y=y_train)
print("Accuracy:{:.2f}%".format(accuracies.mean()*100))
print("Standard Deviation:{:.2f}%".format(accuracies.std()*100))

[[14 19]
[ 2 88]]
Accuracy:80.44%
Standard Deviation:3.52%

Classification Report using Confusion Matrix


In [42]: pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Out[42]: Predicted 0 1 All

True

0 14 19 33

1 2 88 90

All 16 107 123

You might also like