Professional Documents
Culture Documents
his case is about a bank (MyBank) which has a growing customer base. For a
bank it is very important to study the customer details like gender,
transaction history, account balance, active loan detail, number of
dependents, etc to build effective marketing strategies for their products.
>
Now total number of rows remains the same as 20000 and number of column
changes to 38
Further removing AGE_BKT and ACC_OP_DATE as they are same as AGE and
LEN_OF_RLTN_IN_MNTH
data1$AGE_BKT <- NULL
> data1$ACC_OP_DATE <- NULL
With the above code, now total number of rows remains the same. But total
number of variables as 36.
Further converting Target, FLG_HAS_CC,
FLG_HAS_ANY_CHGS,FLG_HAS_NOMINEE, FLG_HAS_OLD_LOAN are
denoted as integer data types but they are basically categorical. So we need to
convert them using command as.factor
>
colSums(is.na(data1))
TARGET AGE
GENDER
0 0
0
BALANCE OCCUPATION
SCR
0 0
0
HOLDING_PERIOD ACC_TYPE
LEN_OF_RLTN_IN_MNTH
0 0
0
NO_OF_L_CR_TXNS NO_OF_L_DR_TXNS
TOT_NO_OF_L_TXNS
0 0
0
NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_ATM_DR_TXNS
NO_OF_NET_DR_TXNS
0 0
0
NO_OF_MOB_DR_TXNS NO_OF_CHQ_DR_TXNS
FLG_HAS_CC
0 0
0
AMT_ATM_DR AMT_BR_CSH_WDL_DR
AMT_CHQ_DR
0 0
0
AMT_NET_DR AMT_MOB_DR
AMT_L_DR
0 0
0
FLG_HAS_ANY_CHGS AMT_OTH_BK_ATM_USG_CHGS
AMT_MIN_BAL_NMC_CHGS
0 0
0
NO_OF_IW_CHQ_BNC_TXNS NO_OF_OW_CHQ_BNC_TXNS
AVG_AMT_PER_ATM_TXN
0 0
0
AVG_AMT_PER_CSH_WDL_TXN AVG_AMT_PER_CHQ_TXN
AVG_AMT_PER_NET_TXN
0 0
0
AVG_AMT_PER_MOB_TXN FLG_HAS_NOMINEE
FLG_HAS_OLD_LOAN
0 0
0
>
#Further let’s explore the data set using some graphs. This will give us
insights on the distribution of numeric variables as well as important features
impacting the response variable
ggp1 <- ggplot(data = data1,aes(x=AGE))
+geom_histogram(fill="Lightblue",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp1
> ggp2 <- ggplot(data = data1,aes(x=HOLDING_PERIOD))
+geom_histogram(fill="green",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp2
> ggp3 <- ggplot(data = data1,aes(x=LEN_OF_RLTN_IN_MNTH))
+geom_histogram(fill="grey",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp3
> ggp4 <- ggplot(data = data1,aes(x=NO_OF_L_CR_TXNS))
+geom_histogram(fill="orange",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp4
> ggp5 <- ggplot(data = data1,aes(x=NO_OF_L_DR_TXNS))
+geom_histogram(fill="pink",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp5
> ggp6 <- ggplot(data = data1,aes(x=AGE))
+geom_boxplot(fill="Lightblue")
> ggp6
The frequency distribution for Age shows that targeted customers are
highest in age group 26-30
Holding period and length of relationship with the bank are more or less
evenly distributed
Most of credit transactions in the range between 0-15, and debit
transaction less than 10
boxplot(data1$AGE,col = "Blue",xlab="Age",horizontal = T)
> boxplot(data1$HOLDING_PERIOD,col =
"orange",xlab="HOLDING_PERIOD",horizontal = T)
> boxplot(data1$LEN_OF_RLTN_IN_MNTH,col =
"green",xlab="LEN_OF_RLTN_IN_MNTH",horizontal = T)
> boxplot(data1$NO_OF_L_CR_TXNS,col =
"pink",xlab="NO_OF_L_CR_TXNS",horizontal = T)
> boxplot(data1$NO_OF_L_DR_TXNS,col =
"grey",xlab="NO_OF_L_DR_TXNS",horizontal = T)
>
Box plot above shows for age median is around 38 years
holding period i.e. ability to hold money in the account of 15months,
length of relationship with bank at 125months
number of credit transactions at 10 and number of debit transactions at
5
There are many outliers for number of credit and debit transactions