You are on page 1of 14

Bank Data

Analysis Report

Chandra Prakash S
CGT19010
Background:
Targeting through telemarketing phone calls to sell long-term deposits of a Portuguese
bank. Within a campaign, the human agents execute phone calls to a list of clients to sell a
deposit or if meanwhile the client calls the contact-centers for any other reason, he is asked
to subscribe the deposit. Thus, the result is a binary unsuccessful or successful contact (y).

Objective:
The purpose of the modeling to make a classification analysis for profiling of the customers
for whom the contact would be successful and for whom the contact would be unsuccessful
using three different analysis techniques Logistic Regression, Decision Tree and Random
Forest. Thus, trying to compare and find which technique is performing better in this
current scenario or the case.

Methodology:
The sequential activities done in the analysis is shown below
 Data Understanding
 Data Preparation
 Model Building
 Model validation
 Performance comparison

Data Understanding

The data set consisted of a total of 17 variables (including 1 Dependent variable). There are a
total of 45211 observations containing 17 variable. Each record includes 16 explanatory
observations about the client contacted, and 1 response observation of whether the client
subscribed to a Term Deposit.

The structure of the data showed that it is a combination of numerical (integer), Categorical
(factor) data in continuous and discreet form

The data was inspected for duplicate entries and it showed that there are no Duplicates in the
data, followed by which the data was checked for missing values which again showed that the
data was free from missing values. This has allowed to skip the step of data imputation which
would have been tedious whith many variables being factor or categorical variables.

Then the dependent variable y was converted to 1 and 0 (binary) from yes and no
correspondingly, this would make the further analysis simpler.

Further univariate analysis was done to study the nature of each of the variables and the
distribution of data through Bar charts, Box plots and Histograms. There were some outliers
in the data of the variables like Age, balance etc.

2
Bank Data Analysis
The following is the graph showing the numerical data with outliers, These outliers might
impact the predictive ability of the regression model but since the distribution is so high we
will always be having outliers in the data, also the decision tree and the random forest are
capable to handle the outliers.

Data Preparation:
The data which was analyzed the in had a few discrepancies which was corrected in this
stage. The job variable had a very few entries named unknow, since this will be useful in the
model, it has been removed. (for instance we cannot give a result stating the bank will be
successful if it attempts to sell the term deposit to person with unknown job)

The data of variable contact was having too many unknowns and since all the client or
customer were contacted the mode of contact may not be important as long you don’t have a
complete info of all the modes of contact such data cannot be imputed also, the variable is
hence dropped out of the analysis, if in a decision tree you get to see that a guy will not
subscribe if you don’t know the mode of contact, will it make any difference to you. That
wont of any help in the future too.

3
Bank Data Analysis
There are also a unknown data in other variable like poutcome which cannot be removed
since they are in significant numbers and removing them will let us handicap with many data
points being lost.

Also the dependent variable bar chat shows that the response yes or 1 is far to lesser than
that of the response no or 0. This shows that there will be skewed ness in predicting the no
or 0 which is not the purpose of the study. We would require to build a model that would be
in a position to predict yes and no with equal ability. For making this sure the training data
should be having equal spread of yes and no.

The covariate matrix analysis shows that the there is not much of covariance between the
numeric variables, yet there is a strong correlation shown between the y and the duration of
the contact. And the variable outcome and pdays are strongly corelated so using any one of
these in the model would be sufficient.

The current data set should be split into two parts, one for training the model and other for
validation of the predictive ability of the model, the former is called a the training data and
the later called as the test data. The main data is split into Training and Test data at the ratio
of 0.7.
4
Bank Data Analysis
The Training data now contains 27768 entries of no or 0 and 3678 entries of yes or 1. This
data cannot be used as such to build the model, so the data need to be balanced before
building of the model. This can be achieved by under sampling, oversampling or synthetic
methods.

For over sampling the data of yes or 1s needs to be over populated by sampling with
replacement to the amount of zeros and then combined with that of the data for 0s. This
would increase the no of the data point of the test data to be 55536.

Model Preparation

To predict, given the sixteen variables, if the bank term deposit would be or not subscribed,
three algorithms are used.
 Logistic regression
 Decision Tree
 Random forest

Logistic Regression
Logistic Regression is a Machine Learning algorithm which is used for the classification
problems, it is a predictive analysis algorithm and based on the concept of probability. In a
binomial logistic regression, a dependent variable will have only two possible types either 1
and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.
It is one of the simplest ML algorithms that can be used for various classification problems
such as spam detection, Diabetes prediction, cancer detection etc.

Decision Tree
A decision tree is a flowchart-like structure in which each internal node represents a ”test” on
an attribute, each branch represents the outcome of the test, and each leaf node represents a
class label. The paths from root to leaf represent classification rules. Algorithms for
constructing decision trees usually work top-down, by choosing a variable at each step that
best splits the set of items. Different algorithms use different metrics for measuring ”best”.
These generally measure the homogeneity of the target within the subsets and are applied to
each candidate subset, the resulting values are combined (e.g., averaged) to provide a measure
of the quality of the split.

Random Forest
A decision tree is a flowchart-like structure in which each internal node represents a ”test” on
an attribute, each branch represents the outcome of the test, and each leaf node represents a
class label. The paths from root to leaf represent classification rules. Algorithms for
constructing decision trees usually work top-down, by choosing a variable at each step that
best splits the set of items. Different algorithms use different metrics for measuring ”best”.
These generally measure the homogeneity of the target within the subsets and are applied to
each candidate subset, the resulting values are combined (e.g., averaged) to provide a measure
of the quality of the split.

5
Bank Data Analysis
Model Validation:

Logistic regression:
The oversampled training data had been used to create a model in the glm as binomial
logistic regression and then it is then tested through the test data and the confusion matrix
is created, which shows an accuracy of 0.8376.

The model is capable of predicting correctly for 84% of the time.

From the model summary, It can seen that age, pdays are insignificant to the model as
they are having a p value >0.05

The AIC value is 45288

The performance of the model was also evaluated using the ROC curve,

The Area under the


curve for the ROC was
found to be 0.9047

This is a significantly
good model

Decision Tree:

The Decision tree is modeled using the rpart module using the oversampled testing data
and the model is tested just as done for the Logistic regression. The confusion matrix
shows the the accuracy of the model to 0.75.

The model is capable of predicting correctly for 75% of the time.


6
Bank Data Analysis
The decision tree is shown below,

From the decision tree it is very obvious that if the contact duration is >472 then the
customer is likely (0.84 probability) to take the term deposit and this has occurred 27% of
the times in the test data.

It is also seen that those people with-out the housing loan are easily convinced to take the
term deposit with contact duration between 261 and 472 , this has occurred 14% of the
times in the test data with a corresponding probability of 0.72.
It is difficult to convince a customer with a housing loan to take the term deposit.

If you are calling in the month of aug, jan, may, jun, jul, nov, it is difficult to say that too
with less contact duration you can convince only those customers who have already been
convinced by a campaign with a corresponding probability of 0.86 (very certain)(2% in
test data), for others you might not be able to convince term to buy the term deposit.
(probability 0.38) this has happened 38% in the test data.

The performance of the model was also evaluated using the ROC curve,

7
Bank Data Analysis
The Area under the
curve for the ROC was
found to be 0.8346

This is a significantly
good model

Random Forest:
The oversampled training data had been used to create a model in the rf and then it is then
tested through the test data and the confusion matrix is created, which shows an accuracy
of 0.9004.

The model is capable of predicting correctly for 90.4% of the time.

The performance of the model was also evaluated using the ROC curve,

The Area under the


curve for the ROC was
found to be 0.9295

This is a significantly
good model

This variable importance in Random forest was calculated based on the Gini impurity index to
check what attributes affect the model most. varImpPlot() function has been used on the model to
visually check variable importance. The result is generated as graph as shown below. From the
graph it can be seen that the duration of the contact is the most significant variable in the random
forest which is in line with the result of the decision tree analysis.

8
Bank Data Analysis
Comparative study

Of the three methods it can be observed that the Random forest algorithm has given us the
best prediction results, followed by Logistic regress and Decision tree at the last.

The following is the table of various parameters,

Parameter Logistic
Decision tree Random forest
regression

Accuracy 0.848 0.757 0.900


Area Under the
0.904 0.834 0.929
Curve
Precision 0.804 0.866 0.671

Recall 0.402 0.309 0.563

There is this peculiar observation that the accuracy wise and the area under the curve wise
the order of better performance to poor is as follows:

Random forest>Logistic regression>Decision Tree

But from the precision and recall point of view the order is as follow,

Decision tree>Logistic regression> Random forest


9
Bank Data Analysis
10
Bank Data Analysis
Annexure:

rm(list = ls())
bank=read.csv("bankdata.csv",header = TRUE)
library(knitr)
library(ggplot2)
library(tidyverse)
library(DMwR)
library(plyr)
library(ggpubr)
library(dplyr)
library(gridExtra)
library(randomForest)
library(xgboost)
library(caTools)
library(kableExtra)
library(caret)
library(corrplot)
library(ROSE)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
str(bank)
head(bank)
nrow(bank)
sum(duplicated(bank))#checking for duplicated data
sum(!complete.cases(bank))#checking for missing data
apply(bank,2,function(x)sum(is.na(x)))#checking for missing data by variables
bank = bank%>% distinct# removing if any duplicated values
nrow(bank)
bank$y = ifelse(bank$y=='yes',1,0)#recoding the yes and no in the final variable
bank=bank %>% mutate_if(is.character, as.factor)#making other characters as factor
bank$y=as.factor(bank$y)#making y as factor
###Plots
qplot(bank$y, bank$age, data=bank, geom="boxplot", xlab="Subscription Status", ylab="Age")
qplot(bank$y, bank$balance, data=bank, geom="boxplot", xlab="Subscription Status",
ylab="Balance")
ggplot(bank, aes(x = job)) + geom_bar() +geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
ggplot(bank, aes(x = marital)) + geom_bar() + geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
ggplot(bank, aes(x = default)) + geom_bar() + geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
ggplot(bank, aes(x = education)) + geom_bar() + geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
11
Bank Data Analysis
ggplot(bank, aes(x = day)) + geom_bar() + geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
ggplot(bank, aes(x = month)) + geom_bar() + geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) +theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
ggplot(bank) + geom_bar(aes(x = month), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust
= 1))
ggplot(bank, aes(x = housing)) + geom_bar() + geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) +theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
ggplot(bank, aes(x = loan)) + geom_bar() +geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
ggplot(bank) + geom_bar(aes(x = loan), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() + theme(axis.text.x = element_text(angle = 90,
hjust = 1))
ggplot(bank, aes(x = contact)) + geom_bar() +geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
ggplot(bank, aes(x = duration)) + geom_bar() + theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
ggplot(bank, aes(x = campaign)) + geom_histogram(binwidth = 1) +geom_text(stat='count',
aes(label=..count..), position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(bank, aes(x = pdays)) + geom_histogram(binwidth = 50) +geom_text(stat='count',
aes(label=..count..), position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(bank, aes(x = previous)) + geom_histogram(binwidth = 1)+geom_text(stat='count',
aes(label=..count..), position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(bank) + geom_histogram(aes(x = previous), binwidth = 1) +
facet_grid(y~., scales = "free") + theme_bw() + theme(axis.text.x = element_text(angle = 90,
hjust = 1))
ggplot(bank, aes(x = poutcome)) + geom_bar() +geom_text(stat='count', aes(label=..count..),
position=position_dodge(width=0.9), vjust=-0.25) + theme_bw() + theme(axis.text.x =
element_text(angle = 90, hjust = 1))
#removing Unknown data
#education
nrow(bank[bank$education =="unknown", ])
ggplot(bank) + geom_bar(aes(x = education), col = "white") +
facet_grid(y~., scales = "free") + theme_bw() + theme(axis.text.x = element_text(angle = 90,
hjust = 1))
# significant data of unknown so not removed and the distribution is proportionate so not removing
#Job
nrow(bank[bank$job =="unknown", ])# since the no of unknown is less - it is removed
bank <- bank %>% filter(job != "unknown")
nrow(bank[bank$job =="unknown", ])#after removal it is 0
#Data Study
12
Bank Data Analysis
summary(bank$y)
ggplot(bank,aes(job))+geom_bar(aes(fill=y))

# Removing the variablr contact from model consideration

bank$contact=NULL

#corelation matrix amongst the numeric variables

cordata<-bank
cordata$job=NULL
cordata$marital=NULL
cordata$education=NULL
cordata$default=NULL
cordata$housing=NULL
cordata$loan=NULL
cordata$month=NULL
cordata$poutcome=NULL
cordata$y=as.numeric(cordata$y)
cor(cordata)

install.packages("Hmisc")
library("Hmisc")

mydata.rcorr = rcorr(as.matrix(cordata))
mydata.rcorr

#Highly skewed data- y has 39922 yes values and 5289 no values.

# Training and testing the data


set.seed(010)
Ptr<-sample.split(bank$y, SplitRatio = 0.7)
train<-subset(bank, Ptr==TRUE)
test<-subset(bank, Ptr==FALSE)
train1=subset(train,y==1)
nrow(train1)
train0=subset(train,y==0)
nrow(train0)
####Oversampling
osmp=sample(1:nrow(train1),nrow(train0),replace=TRUE)
head(osmp)
ngtr=train1[osmp,]
nrow(ngtr)
newtrainos=rbind(ngtr,train0)
summary(newtrainos)
table(newtrainos$y)
modo=glm(y~.,data=newtrainos,family = "binomial")
pro=predict(modo,test,type="response")
clo=ifelse(pro>=0.5,1,0)
clo1<-as.factor(clo)
13
Bank Data Analysis
tbo=table(test$y,clo)
tbo
confusionMatrix(test$y,clo1)
####
LRpredictionos=prediction(pro,test$y)
LRROCOS=performance(LRpredictionos,"tpr","fpr")
plot(LRROCOS)
LRaucos=performance(LRpredictionos,"auc")
LRaucos
###Decision Tree
DTmodos=rpart(y~.,data=newtrainos,method="class")
fancyRpartPlot(DTmodos,cex=0.5)
prp(DTmodos, type = 2, extra = 104, fallen.leaves = TRUE, main="Decision Tree", cex=0.5)
rpart.plot(DTmodos, cex=0.6)
DTprdtos=predict(DTmodos,test,type="prob")
DTprdtoscl=predict(DTmodos,test,type="class")
head(DTprdtos)
DTPrdtFos=DTprdtos[,2]
library(ROCR)
DTpredictionos=prediction(DTPrdtFos,test$y)
confusionMatrix(test$y,DTprdtoscl)
tbdt<-table(test$y,DTprdtoscl)
tbdt
DTROCos=performance(DTpredictionos,"tpr","fpr")
DTROCos
plot(DTROCos)
DTaucos=performance(DTpredictionos,"auc")
DTaucos
summary(DTmodos)
#####random forest
rfos <- randomForest( y ~ ., data=newtrainos,importance=TRUE,ntree=1000)
varImpPlot(rfos)
prdrfos<-predict(rfos,test,type="prob")
rfpredictionos=prediction(prdrfos[,2],test$y)
rfpred2<-predict(rfos,test,type="class")
rfROCos=performance(rfpredictionos,"tpr","fpr")
rfaucos=performance(rfpredictionos,"auc")
rfaucos
plot(rfROCos)
library(corrplot)
cor(newtrainos)
confusionMatrix(test$y,rfpred2)
tbrf=(table(test$y,rfpred2))
tbrf

14
Bank Data Analysis

You might also like