You are on page 1of 25

DATA Mining - Project 3

Thera Bank - Loan Purchase


Modeling

Presented by: Sanan Sahadevan Olachery.

Submission Date: Jan 24th 2020.

1|Page
Content

 R codes - Page 2.

 Assignment Summary -Page 3.

 Question 1 with Working - Page 4-10.

 Question 2 with Working - Page 11-24.

 Question 3 with Working - Page 25.

Sr.
No R Codes Description
1 Dim For Checking the Dimension of Data set
2 any(is.na) For Checking Missing Values in the Data set
3 sapply For Checking Colomn having missing Value
4 rpart
5 rpart.plot
6 randomforest
7 ranger
8 rocit
9 dplyr
10 factoextra
11 Metrics

Note: snap shot of most of the R code used in the assignment are pasted in the relevant areas.

2|Page
Assessment Case Study:

This case is about a bank (Thera Bank) which has a growing customer base. Majority of these
customers are liability customers (depositors) with varying size of deposits. The number of customers
who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this
base rapidly to bring in more loan business and in the process, earn more through the interest on
loans. In particular, the management wants to explore ways of converting its liability customers to
personal loan customers (while retaining them as depositors). A campaign that the bank ran last year
for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the
retail marketing department to devise campaigns with better target marketing to increase the success
ratio with a minimal budget. The department wants to build a model that will help them identify the
potential customers who have a higher probability of purchasing the loan. This will increase the success
ratio while at the same time reduce the cost of the campaign.

The dataset has data on 5000 customers. The data include customer demographic information (age,
income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the
customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers,
only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign .

You are brought in as a consultant and your job is to build the best model which can classify the
right customers who have a higher probability of purchasing the loan. You are expected to do the
following:
 EDA of the data available. Showcase the results using appropriate graphs - (10 Marks)

 Build appropriate models on both the test and train data (CART & Random Forest). Interpret all
the model outputs and do the necessary modifications wherever eligible (such as pruning) - (30
Marks)

 Check the performance of all the models that you have built (test and train). Use all the model
performance measures you have learned so far. Share your remarks on which model performs
the best. - (20 Marks)

3|Page
Q1}. EDA of the data available. Showcase the results using appropriate graphs.
Solution:

The given data set is analyzed using R Studio and observations were made that there were 5000 rows & 14
variables. (Fig1) Missing values were checked, identified and replaced in the data set. Summary and Structure of
the data set was also checked . (Fig 2&3)

After doing a preliminary analysis on the data set we can infer that ID and Zip code data are not of much use and
it can be removed from the dataset.

Experience had negative values which were rectified and replaced with a positive value.

With the help of Graphical representation following points were analyzed from the dataset:

A) Missing values were eliminated from the original dataset and new dataset does not contain any missing
values. (Fig3.4)
B) Distribution of all numerical variable can be seen in the histogram distribution dataset. (Fig3.4)
C) Outliers present in the dataset can be seen via a Box Plot diagram. ( Fig 3.5a &3.5b)
 Indicating that Credit Card & Mortgage have outliers with all levels of Education.
 Income with Graduation and Advance Professionals.
 Similarly outliers are present in Credit card, Mortgage and income category with Personal Loan
Class NO takers
D) Credit Card & Mortgage are good indicators and can be targeted.

We can further analyze the dataset by clustering and check optimal number of cluster. (Fig 3.6 & Fig 3.7)
For a large data size like this Kmeans clustering is apt and is used for analyzing the data.

Dataset is divided into 3 clusters in Kmeans with 3 centers with nstart 10 times. Using within clusters sum of
Squares and Silhouette method. It divides the dataset into cluster sizes of 2149,2012 & 839.

Fig (1)

4|Page
Fig (2)

Fig(3.1)

5|Page
Fig(3.2)

Fig(3.3)

6|Page
Fig(3.4)

Histogram (Fig 3.4)

7|Page
Fig(3.5a)

8|Page
Fig(3.5b)

Fig(3.6)

9|Page
Fig(3.7)

10 | P a g e
Q2} Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the
model outputs and do the necessary modifications wherever eligible (such as pruning)

Solution:
The dataset is split into 70/30 ratio for test and train data set using R studio.

Sr. no Particulars Class 0 (no loans) Class 1 (loans)


1 Therabank. Train 3151 359
2 Therabank. Test 1369 131

11 | P a g e
Cart Modelling

Applying CART model on the datasets and classification trees to learn.

Plotting of the classification tree is done

12 | P a g e
Variable importance for splitting of Tree is conducted

Education Income Family.members CC Avg CD.Account Mortgage Experience Age Online


232.137107 188.541598 142.501489 106.606257 56.904176 27.306276 3.445512 3.437672 1.75104

Observation of the CART model classification tree is:

 Split can be seen when Income is Greater than and less than$115k
 Data is split by tree algo on Education, Income, Family Members, CD Accounts and CC Average.
 Complexity Parameter is lower to 0.05 in comparison to 0.2

Pruning the CART Tree for controlling the over fitting.

CART MODEL PREDICTION

The bank would want to be vigilant in disbursement of Loan to defaulters instead of rejecting the loan to a
genuine customer, hence we will set threshold at 0.70. Probability with 0.7 or more will be considered as class 1
and others will be in class 0.

13 | P a g e
Confusion MATRIX to check the Performance of CART MODEL PREDICTION

With a confusion matrix we can check the performance of the models.


It is observed that Pruned CART tree has an accuracy of 1.67% post adjusting its complexity.

Applying Random Forest Model to the dataset


Random Forest model is an ensemble method. It is a combination of multiple trees chosen randomly on a
dataset . For this case study we have used random forest and ranger package.

Random Forest Modeling

14 | P a g e
Out of bag error =0.01314286

With the result of Out of Bag ( OOB)error for class 0 and class 1 and over all OOB we can interpret that in
between 250-350 tree should suffice and train less and achieve good result.

Plotting of Random Forest Model

15 | P a g e
Prediction of Random forest

16 | P a g e
Tuning the Data in Random Forest

17 | P a g e
18 | P a g e
Applying Ranger Package Model to the Dataset

Ranger Package is built atop random forest package for better performance as it has less parameters to tune.
mtry is the number of variables we will use to build the trees takes cares of other parameters like splits, nodes

Since the case study is a classification problem ranger package automatically chooses best method of split with
minimum nodes.

Range.model=train(personal.loan~,data=TheraBank.train, tunelength=3,method
="ranger",trcontrol=trainControl=trainControl(method='cv',number = 5,verboseIter = FALSE)

19 | P a g e
Tuning of Range Grid

20 | P a g e
21 | P a g e
Refined Ranger model

We are tuning the model with511 number of trees with mtry4

Prediction of RangeR

Confusion matrix of RangeR

22 | P a g e
Plotting of ROC Curve

By Plotting the ROC Curve we can check the Sensitivity and Specifivity Rate

23 | P a g e
24 | P a g e
Q3} Check the performance of all the models that you have built (test and train). Use all the model
performance measures you have learned so far. Share your remarks on which model perform the best.
Solution:
As a consultant our objective was to build a best predictive model which can classify the right customers
who have a higher probability of purchasing the loan. After checking the performance of all the models (test and
Train) we conclude the following:

Performance of CART modeling method is not as good as compared to the Output of Random Forest and ranger
random forest.

Output of Ranger Random Forest should be considered as it has 98% accuracy.

To Ensure that there are no Defaulters from number of customers who are also borrowers (asset customers)
and our earning from them in the form of Interest are Good.

OOB
errors
Particulars % Accuracy %
CART 0.21 1.67
Random Forest 1.2 1.8
Tuned Random Forest 1.17 90.3
Ranger Random Forest 1.31 98.8

25 | P a g e

You might also like