You are on page 1of 4

Loan Predictor

Preliminary analysis of datasets:

In training dataset has 614 entries. The test data set has 367 rows for which prediction is to be done.

The Loan_status is the target variable, which we need to predict. We find that approx. 69% cases
have loan approved and in 31% cases loan is not approved.

Loan ID:

The loan id is different for each individual and there is no relation with eligibility criteria for
approving loan.

Gender:

Factor with 2 levels.

In the dataset there are 112 Females, 489 Males, and 13 are blanks. The same figures for test data
are 70 females, 286 males with 11 NA. The gender ratio is almost similar. We will replace those by
taking ratios of the datasets or in the merged datasets.

Married:

Factor with 2 levels.

In the dataset there are 213 individuals which are no married and 398 individuals are married. There
are 3 blanks. In test data set the same figures are: not married: 134, Married 233, NA: 0. Here also
we will take ratios of Not Married to Married in train data set. Therefore, there will be 1 individual
who is not married and 2 individuals will be taken married. We will examine it further if the other
variables like dependent can give us some lead.

Dependents:

Factor with 4 levels: 0, 1, 2, 3+

This shows which individuals has any parents or children which are dependent on that individual,
which can be a great factor with approving of loan because the more the dependents will be there
the more income will be used on them and maybe the individual can’t pay his dues on time.

Here, there are 345 individuals with no dependents while other are having one, two or more than
three dependents. There are 15 blanks and we will replace them with 0 assuming they have no
dependents.

Education:

Factor with 2 levels: Graduate, Not graduate

This shows that if the individual is graduate or not a graduate. There are 480 graduate individuals
and 134 not graduates. In the test set, the respective count is 84, 283 There are no missing values.

The table shows an interesting analysis between married and graduate students. The chances of
being a graduate and not married raises the chances of approval while not graduate and married has
less chance of approval because of less income and less security.
Married status Graduate Not Graduate

Yes 309 89

No 168 45

Total 477 134

Self- Employment:

Factor with 2 levels

The self-employment is a great factor because amount of income and job security can be taken as
important parameter in approving loan of individual. There are 500 individuals which are not self-
employed and rest 82 are self-employed, and NA: 32. In the test data the counts are: Self-employed:
37, Not self-employed: 307, NA -2. The missing values which will be replaced by ratio method.

Applicant Income:

Numeric Variable. NA -0

Income Loan Status Loan Status Grand


N Y Total
<0 or (blank) 11 11 22
0-99 43 96 139
100-199 107 261 368
200-299 19 36 55
300-399 8 7 15
400-499 2 7 9
500-599 2 2
600-699 3 3
700-800 1 1
Grand Total 192 422 614

Here, we can see that higher income has greater chances of approving of loan. There are no missing
values in this. But it has outliers.

Co-Applicant Income:

Numeric Variable. NA – 0

This is very similar to Applicant income variable. Income of Co-applicant will provide better chances
of repayment and hence, it becomes an important factor in granting the loan. It also has outliers in
both the sets.
Since the Income variable both for Applicant and Co-Applicant is large numbers, we will try using
their logarithmic value in model as parameter.

Loan Amount:

Numeric Variable.

Train data, NA – 22

Test data, NA - 5

The amount of loan for which an individual applies is captured in this variable. There missing values
will replaced according to mean of concerned locality of the house (Property_Area). The blank space
of urban class, semi urban and lower class has mean of 121, 127.30, and 139.74 respectively. There
are many outliers in both the data sets.

Loan Amount Term:

Numeric Variable.

Train Data, NA - 14

Test data, NA -6

It shows the duration period of loan amount. As most of the loan amount is applied for 360 days we
will replace blank values with 360 days.

Loan Count of Loan


Amount Amount
Term
12 1
36 2
60 2
84 4
120 3
180 42
240 3
300 13
360 493
480 15
(blank) 14
Grand Total 592

Credit History:

A numeric variable, which will be converted to factor with 2 levels.

It shows if individual has taken loan before or not. If he has taken than it shows 1 nor then 0. There
are 475 individuals with 1 and 89 with 0 and there are 51 missing values. In test data, NA: 29. This is
really tricky one. We observe that

Property:

Factor with 3 levels. NA: 0


The property is classified in three categories such as urban class, semi urban class, and rural class.

Property % Train dataset % Test dataset


Rural 29.15 30.25
Semiurban 37.95 31.60
Urban 32.90 38.15
We find that the distribution of data set is quite different in both the dataset.

This should prove to be important variable to decide whether the loan amount is in proportion to
the market value of the property. Based on this the approval status will be decided.

Model Plan:

Since it is logistic problem, we intend to use GLM. The metric will be accuracy since the test data set
will be evaluated in the competition for correct prediction. But in real life, we would like to avoid
false positive since the cost may be much higher later on since the loan may become NPA. In such a
case we shall use precision as metric. We will try to optimise model for both these cases separately.

Generation of New variables:

We intend to experiment with following:

1. Total income = Applicant income + Co-applicant income


2. Per capita income = total income/no. of dependents
3. Loan Amount/ Income ratio
4. Log of the incomes.
5. Log of loan amount
6. EMI = Loan Amount/Loan Term

You might also like