Professional Documents
Culture Documents
In training dataset has 614 entries. The test data set has 367 rows for which prediction is to be done.
The Loan_status is the target variable, which we need to predict. We find that approx. 69% cases
have loan approved and in 31% cases loan is not approved.
Loan ID:
The loan id is different for each individual and there is no relation with eligibility criteria for
approving loan.
Gender:
In the dataset there are 112 Females, 489 Males, and 13 are blanks. The same figures for test data
are 70 females, 286 males with 11 NA. The gender ratio is almost similar. We will replace those by
taking ratios of the datasets or in the merged datasets.
Married:
In the dataset there are 213 individuals which are no married and 398 individuals are married. There
are 3 blanks. In test data set the same figures are: not married: 134, Married 233, NA: 0. Here also
we will take ratios of Not Married to Married in train data set. Therefore, there will be 1 individual
who is not married and 2 individuals will be taken married. We will examine it further if the other
variables like dependent can give us some lead.
Dependents:
This shows which individuals has any parents or children which are dependent on that individual,
which can be a great factor with approving of loan because the more the dependents will be there
the more income will be used on them and maybe the individual can’t pay his dues on time.
Here, there are 345 individuals with no dependents while other are having one, two or more than
three dependents. There are 15 blanks and we will replace them with 0 assuming they have no
dependents.
Education:
This shows that if the individual is graduate or not a graduate. There are 480 graduate individuals
and 134 not graduates. In the test set, the respective count is 84, 283 There are no missing values.
The table shows an interesting analysis between married and graduate students. The chances of
being a graduate and not married raises the chances of approval while not graduate and married has
less chance of approval because of less income and less security.
Married status Graduate Not Graduate
Yes 309 89
No 168 45
Self- Employment:
The self-employment is a great factor because amount of income and job security can be taken as
important parameter in approving loan of individual. There are 500 individuals which are not self-
employed and rest 82 are self-employed, and NA: 32. In the test data the counts are: Self-employed:
37, Not self-employed: 307, NA -2. The missing values which will be replaced by ratio method.
Applicant Income:
Numeric Variable. NA -0
Here, we can see that higher income has greater chances of approving of loan. There are no missing
values in this. But it has outliers.
Co-Applicant Income:
Numeric Variable. NA – 0
This is very similar to Applicant income variable. Income of Co-applicant will provide better chances
of repayment and hence, it becomes an important factor in granting the loan. It also has outliers in
both the sets.
Since the Income variable both for Applicant and Co-Applicant is large numbers, we will try using
their logarithmic value in model as parameter.
Loan Amount:
Numeric Variable.
Train data, NA – 22
Test data, NA - 5
The amount of loan for which an individual applies is captured in this variable. There missing values
will replaced according to mean of concerned locality of the house (Property_Area). The blank space
of urban class, semi urban and lower class has mean of 121, 127.30, and 139.74 respectively. There
are many outliers in both the data sets.
Numeric Variable.
Train Data, NA - 14
Test data, NA -6
It shows the duration period of loan amount. As most of the loan amount is applied for 360 days we
will replace blank values with 360 days.
Credit History:
It shows if individual has taken loan before or not. If he has taken than it shows 1 nor then 0. There
are 475 individuals with 1 and 89 with 0 and there are 51 missing values. In test data, NA: 29. This is
really tricky one. We observe that
Property:
This should prove to be important variable to decide whether the loan amount is in proportion to
the market value of the property. Based on this the approval status will be decided.
Model Plan:
Since it is logistic problem, we intend to use GLM. The metric will be accuracy since the test data set
will be evaluated in the competition for correct prediction. But in real life, we would like to avoid
false positive since the cost may be much higher later on since the loan may become NPA. In such a
case we shall use precision as metric. We will try to optimise model for both these cases separately.