You are on page 1of 7

#Problem Definition: -

his case is about a bank (MyBank) which has a growing customer base. For a
bank it is very important to study the customer details like gender,
transaction history, account balance, active loan detail, number of
dependents, etc to build effective marketing strategies for their products.

MyBank conducted a campaign to increase the conversion ratio of one of the


products which is “Personal Loans” and the details of all these target
customers is collated. In this campaign 20000 customers were targeted with
an offer of Personal Loan at 10% interest rate per annum. 2512 customers
out of 20000 expressed their need for personal loan.
 
These customers are labelled as TARGET=1 and remaining customers are
labelled as TARGET=0.
Use the Personal Loan Campaign Data and perform the below tasks
#first part of problem is to
EDA of the data available. Showcase the results using appropriate
graphs

In this we need to check the dataset, its type and structure


There are total 20000 rows and 40 variables
dim(data)
[1] 20000 40

Next step is to check the structure of the data


str(data)
'data.frame': 20000 obs. of 40 variables:
$ CUST_ID : Factor w/ 20000 levels "C1","C10","C100",..: 17699
16532 11027 17984 2363 11747 18115 15556 15216 12494 ...
$ TARGET : int 0 0 0 0 0 0 0 0 0 0 ...
$ AGE : int 27 47 40 53 36 42 30 53 42 30 ...
$ GENDER : Factor w/ 3 levels "F","M","O": 2 2 2 2 2 1 2 1 1
2 ...
$ BALANCE : num 3384 287489 18217 71720 1671623 ...
$ OCCUPATION : Factor w/ 4 levels "PROF","SAL","SELF-EMP",..: 3 2
3 2 1 1 1 2 3 1 ...
$ AGE_BKT : Factor w/ 7 levels "<25",">50","26-30",..: 3 7 5 2
5 6 3 2 6 3 ...
$ SCR : int 776 324 603 196 167 493 479 562 105 170 ...
$ HOLDING_PERIOD : int 30 28 2 13 24 26 14 25 15 13 ...
$ ACC_TYPE : Factor w/ 2 levels "CA","SA": 2 2 2 1 2 2 2 1 2
2 ...
$ ACC_OP_DATE : Factor w/ 4869 levels "01-01-2000","01-01-
2001",..: 3270 1807 3575 994 2861 863 4533 3160 258 335 ...
$ LEN_OF_RLTN_IN_MNTH : int 146 104 61 107 185 192 177 99 88 111 ...
$ NO_OF_L_CR_TXNS : int 7 8 10 36 20 5 6 14 18 14 ...
$ NO_OF_L_DR_TXNS : int 3 2 5 14 1 2 6 3 14 8 ...
$ TOT_NO_OF_L_TXNS : int 10 10 15 50 21 7 12 17 32 22 ...
$ NO_OF_BR_CSH_WDL_DR_TXNS: int 0 0 1 4 1 1 0 3 6 3 ...
$ NO_OF_ATM_DR_TXNS : int 1 1 1 2 0 1 1 0 2 1 ...
$ NO_OF_NET_DR_TXNS : int 2 1 1 3 0 0 1 0 4 0 ...
$ NO_OF_MOB_DR_TXNS : int 0 0 0 1 0 0 0 0 1 0 ...
$ NO_OF_CHQ_DR_TXNS : int 0 0 2 4 0 0 4 0 1 4 ...
$ FLG_HAS_CC : int 0 0 0 0 0 1 0 0 1 0 ...
$ AMT_ATM_DR : int 13100 6600 11200 26100 0 18500 6200 0 35400
18000 ...
$ AMT_BR_CSH_WDL_DR : int 0 0 561120 673590 808480 379310 0 945160
198430 869880 ...
$ AMT_CHQ_DR : int 0 0 49320 60780 0 0 10580 0 51490 32610 ...
$ AMT_NET_DR : num 973557 799813 997570 741506 0 ...
$ AMT_MOB_DR : int 0 0 0 71388 0 0 0 0 170332 0 ...
$ AMT_L_DR : num 986657 806413 1619210 1573364 808480 ...
$ FLG_HAS_ANY_CHGS : int 0 1 1 0 0 0 1 0 0 0 ...
$ AMT_OTH_BK_ATM_USG_CHGS : int 0 0 0 0 0 0 0 0 0 0 ...
$ AMT_MIN_BAL_NMC_CHGS : int 0 0 0 0 0 0 0 0 0 0 ...
$ NO_OF_IW_CHQ_BNC_TXNS : int 0 0 0 0 0 0 0 0 0 0 ...
$ NO_OF_OW_CHQ_BNC_TXNS : int 0 0 1 0 0 0 0 0 0 0 ...
$ AVG_AMT_PER_ATM_TXN : num 13100 6600 11200 13050 0 ...
$ AVG_AMT_PER_CSH_WDL_TXN : num 0 0 561120 168398 808480 ...
$ AVG_AMT_PER_CHQ_TXN : num 0 0 24660 15195 0 ...
$ AVG_AMT_PER_NET_TXN : num 486779 799813 997570 247169 0 ...
$ AVG_AMT_PER_MOB_TXN : num 0 0 0 71388 0 ...
$ FLG_HAS_NOMINEE : int 1 1 1 1 1 1 0 1 1 0 ...
$ FLG_HAS_OLD_LOAN : int 1 0 1 0 0 1 1 1 1 0 ...
$ random : num 1.14e-05 1.11e-04 1.20e-04 1.37e-04 1.74e-
04 ...

>

It is a mix of integer, numeric, and factor variables


Based on data exploration, we can clearly see that customer id and random
are not required in the dataset. They are not x or y variable
#removing unwanted columns
> data1 <- data[,-c(1,40)]

Now total number of rows remains the same as 20000 and number of column
changes to 38
Further removing AGE_BKT and ACC_OP_DATE as they are same as AGE and
LEN_OF_RLTN_IN_MNTH
data1$AGE_BKT <- NULL
> data1$ACC_OP_DATE <- NULL

With the above code, now total number of rows remains the same. But total
number of variables as 36.
Further converting Target, FLG_HAS_CC,
FLG_HAS_ANY_CHGS,FLG_HAS_NOMINEE, FLG_HAS_OLD_LOAN are
denoted as integer data types but they are basically categorical. So we need to
convert them using command as.factor

#converting into categorical for required variables


> data1$TARGET <- as.factor(data1$TARGET)
> data1$FLG_HAS_CC <- as.factor(data1$FLG_HAS_CC)
> data1$FLG_HAS_ANY_CHGS <-
as.factor(data1$FLG_HAS_ANY_CHGS)
> data1$FLG_HAS_NOMINEE <-
as.factor(data1$FLG_HAS_NOMINEE)
> data1$FLG_HAS_OLD_LOAN <-
as.factor(data1$FLG_HAS_OLD_LOAN)

>

#check the str again


str(data1)
the four unwanted columns have been removed and data types have been
corrected for the dataset as (data1)
to check for missing values in any of the columns

colSums(is.na(data1))
TARGET AGE
GENDER
0 0
0
BALANCE OCCUPATION
SCR
0 0
0
HOLDING_PERIOD ACC_TYPE
LEN_OF_RLTN_IN_MNTH
0 0
0
NO_OF_L_CR_TXNS NO_OF_L_DR_TXNS
TOT_NO_OF_L_TXNS
0 0
0
NO_OF_BR_CSH_WDL_DR_TXNS NO_OF_ATM_DR_TXNS
NO_OF_NET_DR_TXNS
0 0
0
NO_OF_MOB_DR_TXNS NO_OF_CHQ_DR_TXNS
FLG_HAS_CC
0 0
0
AMT_ATM_DR AMT_BR_CSH_WDL_DR
AMT_CHQ_DR
0 0
0
AMT_NET_DR AMT_MOB_DR
AMT_L_DR
0 0
0
FLG_HAS_ANY_CHGS AMT_OTH_BK_ATM_USG_CHGS
AMT_MIN_BAL_NMC_CHGS
0 0
0
NO_OF_IW_CHQ_BNC_TXNS NO_OF_OW_CHQ_BNC_TXNS
AVG_AMT_PER_ATM_TXN
0 0
0
AVG_AMT_PER_CSH_WDL_TXN AVG_AMT_PER_CHQ_TXN
AVG_AMT_PER_NET_TXN
0 0
0
AVG_AMT_PER_MOB_TXN FLG_HAS_NOMINEE
FLG_HAS_OLD_LOAN
0 0
0

>

There are no missing values in the dataset.

#Further let’s explore the data set using some graphs. This will give us
insights on the distribution of numeric variables as well as important features
impacting the response variable
ggp1 <- ggplot(data = data1,aes(x=AGE))
+geom_histogram(fill="Lightblue",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp1
> ggp2 <- ggplot(data = data1,aes(x=HOLDING_PERIOD))
+geom_histogram(fill="green",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp2
> ggp3 <- ggplot(data = data1,aes(x=LEN_OF_RLTN_IN_MNTH))
+geom_histogram(fill="grey",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp3
> ggp4 <- ggplot(data = data1,aes(x=NO_OF_L_CR_TXNS))
+geom_histogram(fill="orange",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp4
> ggp5 <- ggplot(data = data1,aes(x=NO_OF_L_DR_TXNS))
+geom_histogram(fill="pink",binwidth = 5,colour="black")
+geom_vline(aes(xintercept=median(AGE)),linetype="dashed")
> ggp5
> ggp6 <- ggplot(data = data1,aes(x=AGE))
+geom_boxplot(fill="Lightblue")
> ggp6
 The frequency distribution for Age shows that targeted customers are
highest in age group 26-30
 Holding period and length of relationship with the bank are more or less
evenly distributed
 Most of credit transactions in the range between 0-15, and debit
transaction less than 10

boxplot(data1$AGE,col = "Blue",xlab="Age",horizontal = T)
> boxplot(data1$HOLDING_PERIOD,col =
"orange",xlab="HOLDING_PERIOD",horizontal = T)
> boxplot(data1$LEN_OF_RLTN_IN_MNTH,col =
"green",xlab="LEN_OF_RLTN_IN_MNTH",horizontal = T)
> boxplot(data1$NO_OF_L_CR_TXNS,col =
"pink",xlab="NO_OF_L_CR_TXNS",horizontal = T)
> boxplot(data1$NO_OF_L_DR_TXNS,col =
"grey",xlab="NO_OF_L_DR_TXNS",horizontal = T)

>
 Box plot above shows for age median is around 38 years
 holding period i.e. ability to hold money in the account of 15months,
 length of relationship with bank at 125months
 number of credit transactions at 10 and number of debit transactions at
5
 There are many outliers for number of credit and debit transactions

You might also like