You are on page 1of 10

Banking Personal Loan

Prediction Model Building USING PYTHON


REPORT BY – TUSHAR DHANDE
Prediction Model : Personal Loan Report by Tushar Dhande

Note: Please refer the ‘.ipynb’ or html file for the codes
Q.1. Read the column description and ensure you understand each attribute well.

Page | 2
Prediction Model : Personal Loan Report by Tushar Dhande

Q.2. Perform univariate analysis of each and every attribute - use an appropriate plot for a given
attribute and mention your insights (5 points)

For the distribution plots we see around 3 peaks in the data. However, the distribution of people
across age groups is largely uniform. The mean and median are close to each other and there are no
missing values in the dataset.

Page | 3
Prediction Model : Personal Loan Report by Tushar Dhande

We see 3 peaks with Experience and similar distribution to age, given the fact that experience is
directly related with age. Again, the mean and median for experience are very close to each other.

We can see that the data is right-skewed, i.e. there is large number of people with lower Incomes.

Page | 4
Prediction Model : Personal Loan Report by Tushar Dhande

There is considerable difference between number of families with 3 people and 1 people. Family
with only 1 person has the highest count, these might be bachelors.

The avg. spending on credit cards is right-skewed indicating that there are more number of people
spending less on the credit cards monthly.

There are considerable number of Undergraduates in our dataset as compared to graduates and
working professionals.

Page | 5
Prediction Model : Personal Loan Report by Tushar Dhande

Mortgage distribution is highly right-skewed with a large peak near 0. This indicated large number
of customers have a very low mortgage and there are very few people with large mortgage.

There are very few people who took loan and also few people with securities account.

The are very few people having CD Account. Also, the number of people using internet banking
facilities is higher than those with who do not, could be attributed to the fact that data contains
educated people.

Page | 6
Prediction Model : Personal Loan Report by Tushar Dhande

Q.3. Perform correlation analysis among all the variables - you can use Pairplot and Correlation
coefficients of every attribute with every other attribute (5 points)

As expected, we see a high correlation between Age and experience. Going forward, before building
the model we will drop one of the variable from the dataset.

Q.4. One hot encode the Education variable (3 points)

Page | 7
Prediction Model : Personal Loan Report by Tushar Dhande

Q.5. Separate the data into dependant and independent variables and create training and test
sets out of them (X_train, y_train, X_test, y_test) (2 points)

We drop Age(As correlated with experience), ID and ZIP Code from the independent variables.

Q.6. Use StandardScaler( ) from sklearn, to transform the training and test data into scaled values
( fit the StandardScaler object to the train data and transform train and test data using this object,
making sure that the test set does not influence the values of the train set) (5 points)

Page | 8
Prediction Model : Personal Loan Report by Tushar Dhande

Q.7. Write a function which takes a model, X_train, X_test, y_train and y_test as input and returns
the accuracy,recall, precision, specificity, f1_score of the model trained on the train set and
evaluated on the test set (5 points)

Q.7. Employ multiple Classification models (Logistic, K-NN, Naïve Bayes etc) and use the function
from step 7 to train and get the metrics of the model (15 points)

Q.8. Create a dataframe with the columns - “Model”, “accuracy”, “recall”, “precision”,
“specificity”, “f1_score”. Populate the dataframe accordingly (5 points)

Page | 9
Prediction Model : Personal Loan Report by Tushar Dhande

Q.10. Give your reasoning on which is the best model in this case (5 points)

We have tried 4 different models in this case to classify that whether a customer will buy a loan or
not using attributes Experience, Family, CC Avg, CD Account, Education, mortgage, securities
account, credit card.

We can see that on the training data KNN gives the best fit of 1.0, however this might be a case of
over-fitting the data. Logistic Regression and SVM perform better on training accuracy as compared
to Naïve Bayes.

Logistic and SVM has better accuracy on the test data as well. In our case, since data is uneven i.e.
only a small fraction(~10%) of total people take loan we need to account for precision and recall as
well to have a robust model and f1 score gives a good indication of both. Looking at f1 we can see
clearly that the SVM performs the best and it has the highest specificity as well.

Hence, we will choose Support Vector Machine model for our predication of customers who might
take loan and accordingly target campaigns.

Page | 10

You might also like