Professional Documents
Culture Documents
1|Page
Content
R codes - Page 2.
Sr.
No R Codes Description
1 Dim For Checking the Dimension of Data set
2 any(is.na) For Checking Missing Values in the Data set
3 sapply For Checking Colomn having missing Value
4 rpart
5 rpart.plot
6 randomforest
7 ranger
8 rocit
9 dplyr
10 factoextra
11 Metrics
Note: snap shot of most of the R code used in the assignment are pasted in the relevant areas.
2|Page
Assessment Case Study:
This case is about a bank (Thera Bank) which has a growing customer base. Majority of these
customers are liability customers (depositors) with varying size of deposits. The number of customers
who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this
base rapidly to bring in more loan business and in the process, earn more through the interest on
loans. In particular, the management wants to explore ways of converting its liability customers to
personal loan customers (while retaining them as depositors). A campaign that the bank ran last year
for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the
retail marketing department to devise campaigns with better target marketing to increase the success
ratio with a minimal budget. The department wants to build a model that will help them identify the
potential customers who have a higher probability of purchasing the loan. This will increase the success
ratio while at the same time reduce the cost of the campaign.
The dataset has data on 5000 customers. The data include customer demographic information (age,
income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the
customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers,
only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign .
You are brought in as a consultant and your job is to build the best model which can classify the
right customers who have a higher probability of purchasing the loan. You are expected to do the
following:
EDA of the data available. Showcase the results using appropriate graphs - (10 Marks)
Build appropriate models on both the test and train data (CART & Random Forest). Interpret all
the model outputs and do the necessary modifications wherever eligible (such as pruning) - (30
Marks)
Check the performance of all the models that you have built (test and train). Use all the model
performance measures you have learned so far. Share your remarks on which model performs
the best. - (20 Marks)
3|Page
Q1}. EDA of the data available. Showcase the results using appropriate graphs.
Solution:
The given data set is analyzed using R Studio and observations were made that there were 5000 rows & 14
variables. (Fig1) Missing values were checked, identified and replaced in the data set. Summary and Structure of
the data set was also checked . (Fig 2&3)
After doing a preliminary analysis on the data set we can infer that ID and Zip code data are not of much use and
it can be removed from the dataset.
Experience had negative values which were rectified and replaced with a positive value.
With the help of Graphical representation following points were analyzed from the dataset:
A) Missing values were eliminated from the original dataset and new dataset does not contain any missing
values. (Fig3.4)
B) Distribution of all numerical variable can be seen in the histogram distribution dataset. (Fig3.4)
C) Outliers present in the dataset can be seen via a Box Plot diagram. ( Fig 3.5a &3.5b)
Indicating that Credit Card & Mortgage have outliers with all levels of Education.
Income with Graduation and Advance Professionals.
Similarly outliers are present in Credit card, Mortgage and income category with Personal Loan
Class NO takers
D) Credit Card & Mortgage are good indicators and can be targeted.
We can further analyze the dataset by clustering and check optimal number of cluster. (Fig 3.6 & Fig 3.7)
For a large data size like this Kmeans clustering is apt and is used for analyzing the data.
Dataset is divided into 3 clusters in Kmeans with 3 centers with nstart 10 times. Using within clusters sum of
Squares and Silhouette method. It divides the dataset into cluster sizes of 2149,2012 & 839.
Fig (1)
4|Page
Fig (2)
Fig(3.1)
5|Page
Fig(3.2)
Fig(3.3)
6|Page
Fig(3.4)
7|Page
Fig(3.5a)
8|Page
Fig(3.5b)
Fig(3.6)
9|Page
Fig(3.7)
10 | P a g e
Q2} Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the
model outputs and do the necessary modifications wherever eligible (such as pruning)
Solution:
The dataset is split into 70/30 ratio for test and train data set using R studio.
11 | P a g e
Cart Modelling
12 | P a g e
Variable importance for splitting of Tree is conducted
Split can be seen when Income is Greater than and less than$115k
Data is split by tree algo on Education, Income, Family Members, CD Accounts and CC Average.
Complexity Parameter is lower to 0.05 in comparison to 0.2
The bank would want to be vigilant in disbursement of Loan to defaulters instead of rejecting the loan to a
genuine customer, hence we will set threshold at 0.70. Probability with 0.7 or more will be considered as class 1
and others will be in class 0.
13 | P a g e
Confusion MATRIX to check the Performance of CART MODEL PREDICTION
14 | P a g e
Out of bag error =0.01314286
With the result of Out of Bag ( OOB)error for class 0 and class 1 and over all OOB we can interpret that in
between 250-350 tree should suffice and train less and achieve good result.
15 | P a g e
Prediction of Random forest
16 | P a g e
Tuning the Data in Random Forest
17 | P a g e
18 | P a g e
Applying Ranger Package Model to the Dataset
Ranger Package is built atop random forest package for better performance as it has less parameters to tune.
mtry is the number of variables we will use to build the trees takes cares of other parameters like splits, nodes
Since the case study is a classification problem ranger package automatically chooses best method of split with
minimum nodes.
Range.model=train(personal.loan~,data=TheraBank.train, tunelength=3,method
="ranger",trcontrol=trainControl=trainControl(method='cv',number = 5,verboseIter = FALSE)
19 | P a g e
Tuning of Range Grid
20 | P a g e
21 | P a g e
Refined Ranger model
Prediction of RangeR
22 | P a g e
Plotting of ROC Curve
By Plotting the ROC Curve we can check the Sensitivity and Specifivity Rate
23 | P a g e
24 | P a g e
Q3} Check the performance of all the models that you have built (test and train). Use all the model
performance measures you have learned so far. Share your remarks on which model perform the best.
Solution:
As a consultant our objective was to build a best predictive model which can classify the right customers
who have a higher probability of purchasing the loan. After checking the performance of all the models (test and
Train) we conclude the following:
Performance of CART modeling method is not as good as compared to the Output of Random Forest and ranger
random forest.
To Ensure that there are no Defaulters from number of customers who are also borrowers (asset customers)
and our earning from them in the form of Interest are Good.
OOB
errors
Particulars % Accuracy %
CART 0.21 1.67
Random Forest 1.2 1.8
Tuned Random Forest 1.17 90.3
Ranger Random Forest 1.31 98.8
25 | P a g e