Bank Rpubs

Portuguese Bank Marketing
Data
WQD7004/RRookie/Yong Keh Soon-WQD180065, Vikas Mann-WQD180051, L-ven
Lew Teck Wei-WQD180056, Lim Shien Long-WQD180027
14 December 2018
 Basic R Setup
o R Libraries Used
o Import Dataset
 Understanding Data
o Overview of Data Attributes
o Data Validation
 Check for Duplicate Rows
 Check for Missing Data
o Data Cleaning
 Remove Rows with Columns that has Missing Value
 Save Deduplicated Rows
 Impute Missing Values
 Final Removal of Duplicated Rows
o Verify Cleanliness
 Verify Deduplication
 Verify Missing Value
 About The Cleaned Dataset
o Data Structure of Cleaned Dataset
o Recoding ‘yes’ to binary
o Sample Rows
 Size of Dataset
 Sample Observations
 Data Summary
o Outcome Imbalance
 Exploratory Data Analysis
o Univariate Analysis
 Age Distribution
 Age Distribution vs Marital Status That Subscribes Term Deposit
 Age vs Subscription
 Balance vs Subscription
 Education vs Subscription
 Subscription based on Number of Contact during Campaign
 Duration
 Scatterplot of Duration by Age
 Scatterplot of Duration by Campaign
 Scatterplot Matrix
o Split the Training / Testing data and Scale
 Custom Function For Binary Class Performance Evaluation
 Create a function for plotting distribution
o Model 1 - Build Model by fitting Decision Tree Classification
o Evaluate the Decision Tree Prediction model
o Model 2 - Build Model by fitting Logistic Regression
o Evaluate the Logistic Regression Prediction model
o Compare the models
Basic R Setup
R Libraries Used
Here are the R libraries used in this analysis.
library(knitr) # web widget

library(tidyverse) # data manipulation
library(data.table) # fast file reading
library(caret) # rocr analysis
library(ROCR) # rocr analysis
library(kableExtra) # nice table html formating
library(gridExtra) # arranging ggplot in grid
library(rpart) # decision tree
library(rpart.plot) # decision tree plotting
library(caTools) # split
Import Dataset
Import CSV into dataframe.
banks = read.table('data/bank-full.csv',sep=',',header = T)
Understanding Data
Overview of Data Attributes
Attribute Information:
1. Age (numeric)
2. Type of job
3. Marital status
4. Education
5. Default: has defaulted on credit?
6. Balance: has balance loan?
7. Housing: has housing loan?
8. Loan: has personal loan?
9. Contact: Cellular, phone?
10. Month: Last contact month
11. Day: Last contact day
12. Duration: Last contact duration
13. campaign: number of contacts performed during this campaign and for this client
(numeric, includes last contact)
14. pdays: number of days that passed by after the client was last contacted from a
previous campaign (numeric: 999 means client was not previously contacted)
15. previous: number of contacts performed before this campaign and for this client
(numeric)
16. poutcome: outcome of the previous marketing campaign (categorical:
‘failure’,‘nonexistent’,‘success’)
17. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
Statistical Summary
summary(banks)
## age job marital education
## Min. :18.00 blue-collar:9844 divorced: 5315 primary : 6966
## 1st Qu.:33.00 management :9632 married :27579 secondary:23539
## Median :39.00 technician :7720 single :13027 tertiary :13525
## Mean :40.95 admin. :5252 NA's : 646 unknown : 1879
## 3rd Qu.:48.00 services :4210 NA's : 658
## Max. :95.00 (Other) :9260
## NA's :641 NA's : 649
## default balance housing loan contact
## no :45125 Min. : -8019 no :20400 no :38591 cellular :29749
## yes : 825 1st Qu.: 73 yes :25489 yes : 7324 telephone: 2953
## NA's: 617 Median : 448 NA's: 678 NA's: 652 unknown :13215
## Mean : 1363 NA's : 650
## 3rd Qu.: 1426
## Max. :102127
## NA's :658
## day month duration campaign
## Min. : 1.0 may :14017 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.0 jul : 6989 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.0 aug : 6359 Median : 180.0 Median : 2.000
## Mean :15.8 jun : 5426 Mean : 258.1 Mean : 2.767
## 3rd Qu.:21.0 nov : 4038 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.0 (Other): 9086 Max. :4918.0 Max. :63.000
## NA's :632 NA's : 652 NA's :648 NA's :657
## pdays previous poutcome y
## Min. : -1.00 Min. : 0.0000 failure: 4994 no :40547
## 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1871 yes : 5385
## Median : -1.00 Median : 0.0000 success: 1538 NA's: 635
## Mean : 40.31 Mean : 0.5795 unknown:37551
## 3rd Qu.: -1.00 3rd Qu.: 0.0000 NA's : 613
## Max. :871.00 Max. :275.0000
## NA's :654 NA's :649
NA can be observed in all 17 attributes.
Data Validation
Check for Duplicate Rows
sum(duplicated(banks))
## [1] 1450
Check for Missing Data

How Many Rows Contain Missing Data
sum(!complete.cases(banks))
## [1] 1259
How Many Rows Are Completely Missing Values In All Columns
all.empty = rowSums(is.na(banks))==ncol(banks)
sum(all.empty)
## [1] 61
Missing Value By Variable
sapply(banks, function(x) sum(is.na(x)))

## age job marital education default balance housing
## 641 649 646 658 617 658 678
## loan contact day month duration campaign pdays
## 652 650 632 652 648 657 654
## previous poutcome y
## 649 613 635
Data Cleaning
Remove Rows with Columns that has Missing Value
banks.clean = banks[!all.empty,]
Save Deduplicated Rows

Saved deduplicated data to new variable ‘banks.clean’
banks.clean = banks.clean %>% distinct
Number of Rows After Dedup
nrow(banks.clean)
## [1] 45116
Impute Missing Values

Create New Column To Indicate Missing Detection
banks.clean$missing = !complete.cases(banks.clean)
Missing Numeric Value Treatment
Replace with Average
banks.clean$age[is.na(banks.clean$age)] = mean(banks$age, na.rm=T)

banks.clean$day[is.na(banks.clean$day)] = mean(banks$day, na.rm=T)
banks.clean$duration[is.na(banks.clean$duration)] = mean(banks$duration,
na.rm=T)
banks.clean$previous[is.na(banks.clean$previous)] = mean(banks$previous,
na.rm=T)
banks.clean$campaign[is.na(banks.clean$campaign)] = mean(banks$campaign,
na.rm=T)
Replace with Mode - Below variables distribution is highly skewed at at specific value,

hence we are going to impute missing value with the mode
hist(banks.clean$balance)
hist(banks.clean$pdays)
banks.clean$pdays[is.na(banks.clean$pdays)] = as.numeric(names(sort(-
table(banks$pdays)))[1])
banks.clean$balance[is.na(banks.clean$balance)] = as.numeric(names(sort(-
table(banks$balance)))[1])
Missing Categorical Treatment
Replace with Special Category
banks.clean$job = fct_explicit_na(banks.clean$job, "missing")

banks.clean$marital = fct_explicit_na(banks.clean$marital, "missing")
banks.clean$education = fct_explicit_na(banks.clean$education, "missing")
banks.clean$default = fct_explicit_na(banks.clean$default, "missing")
banks.clean$loan = fct_explicit_na(banks.clean$loan, "missing")
banks.clean$contact = fct_explicit_na(banks.clean$contact, "missing")
banks.clean$poutcome = fct_explicit_na(banks.clean$poutcome, "missing")
banks.clean$y = fct_explicit_na(banks.clean$y, "missing")
banks.clean$housing = fct_explicit_na(banks.clean$housing, "missing")
banks.clean$month = fct_explicit_na(banks.clean$month, "missing")
Final Removal of Duplicated Rows

After imputation, certain rows became identical hence need to be deduplicated.
banks.clean = banks.clean %>% distinct
Verify Cleanliness
Verify Deduplication
Number of Rows Before Dedup
nrow(banks)
## [1] 46567
Number of Rows After Dedup
Number of rows had been reduced after deduplication
nrow(banks.clean)
## [1] 45112
Number of Deduplicated Rows
There is no more duplicated rows
sum(duplicated(banks.clean))
## [1] 0
Verify Missing Value

All missing value has been treated
sapply(banks.clean, function(x) sum(is.na(x)))

## age job marital education default balance housing
## 0 0 0 0 0 0 0
## loan contact day month duration campaign pdays
## 0 0 0 0 0 0 0
## previous poutcome y missing
## 0 0 0 0
New ‘missing’ levels has been introduced
levels(banks.clean$job)
## [1] "admin." "blue-collar" "entrepreneur" "housemaid"
## [5] "management" "retired" "self-employed" "services"
## [9] "student" "technician" "unemployed" "unknown"
## [13] "missing"
levels(banks.clean$marital)
## [1] "divorced" "married" "single" "missing"
levels(banks.clean$education)
## [1] "primary" "secondary" "tertiary" "unknown" "missing"
levels(banks.clean$default)
## [1] "no" "yes" "missing"
levels(banks.clean$loan)
levels(banks.clean$contact)
## [1] "cellular" "telephone" "unknown" "missing"
levels(banks.clean$poutcome)
## [1] "failure" "other" "success" "unknown" "missing"
levels(banks.clean$y)
levels(banks.clean$housing)
levels(banks.clean$month)
## [1] "apr" "aug" "dec" "feb" "jan" "jul" "jun"
## [8] "mar" "may" "nov" "oct" "sep" "missing"
How Many Rows Had Missing Data Before Cleaning
sum(banks.clean$missing)
## [1] 1121
Display Those Rows That Had Missing Data Before
banks.clean %>% filter(missing==T) %>% head %>% kable

mar educa defa balahous loa cont mo durat camp pd previpoutc miss
agejob day y
ital tion ult nceing n act nth ion aign ays ousome ing
28.00blue- marrie missi miss missin 15.80 258.09 0.5794
missing 0yes may 1.00000 -1 missing no TRUE
000collar d ng ing g 327 54 895
46.00 missin miss missin 15.80 258.09 0.5794unknow
missing missing no 0yes may 2.76663 -1 no TRUE
000 g ing g 327 54 895n
32.00manage missin missin missin 5.000 179.00 0.0000
tertiary no 0 no may 1.00000 -1 missing no TRUE
000ment g g g 00 00 000
40.94 marrie unkno 5.000missi 849.00 0.0000
missing tertiary no 523yes no 2.00000 -1 missing no TRUE
495 d wn 00ng 00 000
40.94 marrie missin 15.80 252.00 0.5794unknow miss
missing missing no 19yes no may 1.00000 -1 TRUE
495 d g 327 00 895n ing
35.00 missin secondar unkno 5.000 1077.0 0.5794
services no 59yes no may 2.76663 -1 missing no TRUE
000 g y wn 00 000 895
Summary Statistic After Cleaning
summary(banks.clean)
## Mean :40.94 admin. :5094 missing : 529 unknown : 1829
## 3rd Qu.:48.00 services :4096 missing : 547
## Max. :95.00 retired :2240
## (Other) :7280
## default balance housing loan
## no :43797 Min. : -8019 no :19803 no :37439
## yes : 809 1st Qu.: 62 yes :24746 yes : 7133
## missing: 506 Median : 436 missing: 563 missing: 540
## Mean : 1347
## 3rd Qu.: 1407
## Max. :102127
##
## contact day month duration
## cellular :28876 Min. : 1.0 may :13585 Min. : 0.0
## telephone: 2866 1st Qu.: 8.0 jul : 6785 1st Qu.: 104.0
## unknown :12836 Median :16.0 aug : 6161 Median : 182.0
## missing : 534 Mean :15.8 jun : 5268 Mean : 258.2
## 3rd Qu.:21.0 nov : 3920 3rd Qu.: 317.0
## Max. :31.0 apr : 2884 Max. :4918.0
## (Other): 6509
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 failure: 4847
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1813
## Median : 2.000 Median : -1.00 Median : 0.0000 success: 1486
## Mean : 2.766 Mean : 39.69 Mean : 0.5794 unknown:36461
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000 missing: 505
## Max. :63.000 Max. :871.00 Max. :275.0000
##
## y missing
## no :39367 Mode :logical
## yes : 5224 FALSE:43991
## missing: 521 TRUE :1121
##
##
##
##
NA has been removed. Imputation done by replacing with ‘missing’ for categorical values.

Numerical attributes NAwas replaced by mean() of the attribute.
Save file to banks.csv
write.csv(banks.clean, file = "./data/banks.csv")
banks = banks.clean
Save cleaned dataset. Rename dataframe as banks
About The Cleaned Dataset

Data Structure of Cleaned Dataset
str(banks)
## 'data.frame': 45112 obs. of 18 variables:
## $ age : num 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 13 levels "admin.","blue-collar",..: 5 10 3 2 12
5 5 3 6 10 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 3 2 3
1 2 3 ...
## $ education: Factor w/ 5 levels "primary","secondary",..: 3 2 2 4 4 3 3
3 1 2 ...
## $ default : Factor w/ 3 levels "no","yes","missing": 1 1 1 1 1 1 1 2 1
1 ...
## $ balance : num 2143 29 2 1506 1 ...
## $ housing : Factor w/ 3 levels "no","yes","missing": 2 2 2 2 1 2 2 2 2
2 ...
## $ loan : Factor w/ 3 levels "no","yes","missing": 1 1 2 1 1 1 2 1 1
1 ...
## $ contact : Factor w/ 4 levels "cellular","telephone",..: 3 3 3 3 3 3
3 3 3 3 ...
## $ day : num 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 13 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9
9 9 ...
## $ duration : num 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 5 levels "failure","other",..: 4 4 4 4 4 4 4 4 4
4 ...
## $ y : Factor w/ 3 levels "no","yes","missing": 1 1 1 1 1 1 1 1 1
1 ...
## $ missing : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
Recoding ‘yes’ to binary

banks$y = ifelse(banks$y=='yes',1,0)
str(banks)
## 'data.frame': 45112 obs. of 18 variables:
## $ age : num 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 13 levels "admin.","blue-collar",..: 5 10 3 2 12
5 5 3 6 10 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 3 2 3
1 2 3 ...
## $ education: Factor w/ 5 levels "primary","secondary",..: 3 2 2 4 4 3 3
3 1 2 ...
## $ default : Factor w/ 3 levels "no","yes","missing": 1 1 1 1 1 1 1 2 1
1 ...
## $ balance : num 2143 29 2 1506 1 ...
## $ housing : Factor w/ 3 levels "no","yes","missing": 2 2 2 2 1 2 2 2 2
2 ...
## $ loan : Factor w/ 3 levels "no","yes","missing": 1 1 2 1 1 1 2 1 1
1 ...
## $ contact : Factor w/ 4 levels "cellular","telephone",..: 3 3 3 3 3 3
3 3 3 3 ...
## $ day : num 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 13 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9
9 9 ...
## $ duration : num 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 5 levels "failure","other",..: 4 4 4 4 4 4 4 4 4
4 ...
## $ y : num 0 0 0 0 0 0 0 0 0 0 ...
## $ missing : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
Check for ‘yes’ attribute to be changed to ‘y’.
Sample Rows
Size of Dataset
nrow(banks)
## [1] 45112
ncol(banks)
## [1] 18
Sample Observations
head(banks)
age job marital education default balance housing

<dbl> <fctr> <fctr> <fctr> <fctr> <dbl> <fctr>
1 58 management married tertiary no 2143 yes
2 44 technician single secondary no 29 yes
3 33 entrepreneur married secondary no 2 yes
4 47 blue-collar married unknown no 1506 yes
5 33 unknown single unknown no 1 no
6 35 management married tertiary no 231 yes

6 rows | 1-10 of 19 columns
Data Summary
summary(banks)
## Mean :40.94 admin. :5094 missing : 529 unknown : 1829
## 3rd Qu.:48.00 services :4096 missing : 547
## Max. :95.00 retired :2240
## (Other) :7280
## default balance housing loan
## no :43797 Min. : -8019 no :19803 no :37439
## yes : 809 1st Qu.: 62 yes :24746 yes : 7133
## missing: 506 Median : 436 missing: 563 missing: 540
## Mean : 1347
## 3rd Qu.: 1407
## Max. :102127
##
## contact day month duration
## cellular :28876 Min. : 1.0 may :13585 Min. : 0.0
## telephone: 2866 1st Qu.: 8.0 jul : 6785 1st Qu.: 104.0
## unknown :12836 Median :16.0 aug : 6161 Median : 182.0
## missing : 534 Mean :15.8 jun : 5268 Mean : 258.2
## 3rd Qu.:21.0 nov : 3920 3rd Qu.: 317.0
## Max. :31.0 apr : 2884 Max. :4918.0
## (Other): 6509
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 failure: 4847
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1813
## Median : 2.000 Median : -1.00 Median : 0.0000 success: 1486
## Mean : 2.766 Mean : 39.69 Mean : 0.5794 unknown:36461
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000 missing: 505
## Max. :63.000 Max. :871.00 Max. :275.0000
##
## y missing
## Min. :0.0000 Mode :logical
## 1st Qu.:0.0000 FALSE:43991
## Median :0.0000 TRUE :1121
## Mean :0.1158
## 3rd Qu.:0.0000
## Max. :1.0000
##
Outcome Imbalance
Observe that the dataset predicted outcome (y) is skewed towards ‘no’ with over 88%.
prop.table(table(banks$y))
##
## 0 1
## 0.8841993 0.1158007
Exploratory Data Analysis

Univariate Analysis
Age Distribution
The bulk of clients are between the ages of 33 (1st Quartile) and 48 (3rd Quartile) with
mean lying on 41 visualized on the histogram with red vertical line.
Boxplot of age describes essentially the same statistics but we can see outliers above the
age of 65.
summary(banks$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 39.00 40.94 48.00 95.00
gg = ggplot (banks)
p1 = gg + geom_histogram(aes(x=age),color="black", fill="white", binwidth =
5) +
ggtitle('Age Distribution (red mean line)') +
ylab('Count') +
xlab('Age') +
geom_vline(aes(xintercept = mean(age), color = "red")) +
scale_x_continuous(breaks = seq(0,100,5)) +
theme(legend.position = "none")
p2 = gg + geom_boxplot(aes(x='', y=age)) +
ggtitle('Age Boxplot') +
ylab('Age')
grid.arrange(p1, p2, ncol = 2)

Age Distribution vs Marital Status That Subscribes Term Deposit
The bulk of clients are married or divorced. Sharp drop of clients above age 60 with
marital status ‘divorced’ and ‘married’. *Single clients drop in numbers above age 40.
p3 <- ggplot(banks, aes(x=age, fill=marital)) +

geom_histogram(binwidth = 2, alpha=0.7) +
facet_grid(cols = vars(y)) +
expand_limits(x=c(0,100)) +
ggtitle("Age Distribution by Marital Status")
p3
Age vs Subscription
Most clients that subscribe are between age 25 to 45. Mean age for all clients is above 40
years of age.
mu <- banks %>% group_by(y) %>% summarize(grp.mean=mean(age))
ggplot (banks, aes(x=age)) +

geom_histogram(color = "blue", fill = "blue", binwidth = 5) +
facet_grid(cols=vars(y)) +
ggtitle('Age Distribution by Subscription') + ylab('Count') + xlab('Age')
+
geom_vline(data=mu, aes(xintercept=grp.mean), color="red",
linetype="dashed")
Balance vs Subscription
Clients that subscribe to term deposits have lower loan balances.
mu2 <- banks %>% group_by(y) %>% summarize(grp.mean2=mean(balance))
ggplot (banks, aes(x=balance)) +

geom_histogram(color = "blue", fill = "blue") +
facet_grid(cols=vars(y)) +
ggtitle('Balance Histogram') + ylab('Count') + xlab('Balance') +
geom_vline(data=mu2, aes(xintercept=grp.mean2), color="red",
linetype="dashed")
Education vs Subscription
Having higher education is seen to contribute to higher subscription of term deposit. Most
clients who subscribe are from ‘secondary’ and ‘tertiary’ education levels. Tertiary
educated clients have higher rate of subscription (15%) from total clients called.
ggplot(data = banks.clean, aes(x=education, fill=y)) +

geom_bar() +
ggtitle("Term Deposit Subscription based on Education Level") +
xlab(" Education Level") +
guides(fill=guide_legend(title="Subscription of Term Deposit"))
banks.clean %>%
group_by(education) %>%
summarize(pct.yes = mean(y=="yes")*100) %>%
arrange(desc(pct.yes))
education
<fctr>
tertiary
unknown
secondary
primary
missing
5 rows
Subscription based on Number of Contact during Campaign

It can be observed from barchart that there will be no subscription beyond 7 contact during
the campaign. Future campaign could improve resource utilization by setting limits to
contacts during a campaign. Future campaigns can focus on first 3 contacts as it will have
higher subscription rate.
ggplot(data=banks.clean, aes(x=campaign, fill=y))+

geom_histogram()+
ggtitle("Subscription based on Number of Contact during the Campaign")+
xlab("Number of Contact during the Campaign")+
xlim(c(min=1,max=30)) +
guides(fill=guide_legend(title="Subscription of Term Deposit"))
banks.clean %>%
group_by(campaign) %>%
summarize(contact.cnt = n(), pct.con.yes = mean(y=="yes")*100) %>%
arrange(desc(contact.cnt)) %>%
head()
campaign contact.cnt
<dbl> <int>
1 17302
2 12325
3 5439
4 3472
5 1742
6 1274
6 rows
Duration
range(banks$duration)
## [1] 0 4918
summary(banks$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 104.0 182.0 258.2 317.0 4918.0
banks %>% select(duration) %>% arrange(desc(duration)) %>% head

1
6
6 rows
mu2 <- banks %>% group_by(y) %>% summarize(grp2.mean=mean(duration))
p6 <- ggplot(banks, aes(x=duration, fill = y)) +

geom_histogram(binwidth = 2) +
coord_cartesian(xlim = c(0,5000), ylim = c(0,400))
p6 + geom_vline(data = mu2, aes(xintercept = grp2.mean), color = "red",

linetype = "dashed")
Scatterplot of Duration by Age

Less clients after age of 60. Duration during call looks similar.
banks %>%
ggplot(aes(age, duration)) +
geom_point() +
ggtitle("Scatterplot of Duration vs Age for Subscription of Term
Deposit")
Scatterplot of Duration by Campaign

Duration on call similar for first 10 contacts during campaign. Successful subscription
(y=1) occur within first 10 contacts. Much less after that.
banks %>% filter(campaign < 63) %>%

ggplot(aes(campaign, duration)) +
geom_point() +
ggtitle("Scatterplot of Duration vs Campaign for Subscription of Term
Deposit")
Scatterplot Matrix
Due to large number of attributes (17 total), 8 was chosen for correlation. No clear
correlation pattern can be observed as most attributes are categorical.
banks_select1 <- banks %>% select(duration, month, day, balance)

pairs(banks_select1)
banks_select2 <- banks %>% select(balance, housing, loan, campaign)

pairs(banks_select2)
Split the Training / Testing data and Scale

Split the dataset into traning and testing dataset and scale the numerical variables
# split into training and testing

set.seed(123)
split = sample.split(banks$y,SplitRatio = 0.70)
training_set = subset(banks, split == TRUE)
test_set = subset(banks, split == FALSE)
# scale
training_set[c(1,6,10,12,13)] = scale(training_set[c(1,6,10,12,13)])
test_set[c(1,6,10,12,13)] = scale(test_set[c(1,6,10,12,13)])
Custom Function For Binary Class Performance Evaluation
binclass_eval = function (actual, predict) {

cm = table(as.integer(actual), as.integer(predict),
dnn=c('Actual','Predicted'))
ac = (cm['1','1']+cm['0','0'])/(cm['0','1'] + cm['1','0'] + cm['1','1'] +
cm['0','0'])
pr = cm['1','1']/(cm['0','1'] + cm['1','1'])
rc = cm['1','1']/(cm['1','0'] + cm['1','1'])
fs = 2* pr*rc/(pr+rc)
list(cm=cm, recall=rc, precision=pr, fscore=fs, accuracy=ac)
}
Create a function for plotting distribution

Function to be used later for plotting the prediction distribution
plot_pred_type_distribution <- function(df, threshold) {

v <- rep(NA, nrow(df))
v <- ifelse(df$pred >= threshold & df$y == 1, "TP", v)
v <- ifelse(df$pred >= threshold & df$y == 0, "FP", v)
v <- ifelse(df$pred < threshold & df$y == 1, "FN", v)
v <- ifelse(df$pred < threshold & df$y == 0, "TN", v)
df$pred_type <- v
ggplot(data=df, aes(x=y, y=pred)) +

geom_violin(fill='black', color=NA) +
geom_jitter(aes(color=pred_type), alpha=0.6) +
geom_hline(yintercept=threshold, color="red", alpha=0.6) +
scale_color_discrete(name = "type") +
labs(title=sprintf("Threshold at %.2f", threshold))
}
Model 1 - Build Model by fitting Decision Tree Classification

Build the Decision Tree model and plot
# fit the decision tree classification

classifier = rpart(formula = y ~ .,
data = training_set, method = "class")
# plot
prp(classifier, type = 2, extra = 104, fallen.leaves = TRUE, main="Decision
Tree")
Evaluate the Decision Tree Prediction model

Predict the Decision Tree model on testing data and evaluate by finding the accuracy and
calculating/plotting ROC curve.
# predict test data by probability

pred.DT = predict(classifier, newdata = test_set[-17], type = 'prob')
# find the threshold for prediction optimization

predictions_DT <- data.frame(y = test_set$y, pred = NA)
predictions_DT$pred <- pred.DT[,2]
plot_pred_type_distribution(predictions_DT,0.36)
# choose the best threshold as 0.36
test.eval.DT = binclass_eval(test_set[, 17], pred.DT[,2] > 0.36)
# Making the Confusion Matrix

test.eval.DT$cm
## Predicted
## Actual 0 1
## 0 11614 352
## 1 964 603
# calculate accuracy, precision etc.
acc_DT=test.eval.DT$accuracy
prc_DT=test.eval.DT$precision
recall_DT=test.eval.DT$recall
fscore_DT=test.eval.DT$fscore
# print evaluation
cat("Accuracy: ", acc_DT,
"\nPrecision: ", prc_DT,
"\nRecall: ", recall_DT,
"\nFScore: ", fscore_DT)
## Accuracy: 0.9027562
## Precision: 0.6314136
## Recall: 0.3848117
## FScore: 0.4781919
# calculate ROC curve
rocr.pred = prediction(predictions = pred.DT[,2], labels = test_set$y)
rocr.perf = performance(rocr.pred, measure = "tpr", x.measure = "fpr")
rocr.auc = as.numeric(performance(rocr.pred, "auc")@y.values)
# print ROC AUC

rocr.auc
## [1] 0.7590981
# plot ROC curve
plot(rocr.perf,
lwd = 3, colorize = TRUE,
print.cutoffs.at = seq(0, 1, by = 0.1),
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('Decision Tree - auc : ', round(rocr.auc, 5)))
abline(0, 1, col = "red", lty = 2)
Model 2 - Build Model by fitting Logistic Regression

Alternatively build another model by fitting Logistic Regression algorithm in Tranining
dataset
# creating the classifier

classifier.lm = glm(formula = y ~ .,
family = binomial,
data = training_set)
Evaluate the Logistic Regression Prediction model

Predict the Logistic Regression model on testing data and evaluate by finding the accuracy
and calculating/plotting ROC curve.
pred_lm = predict(classifier.lm, type='response', newdata=test_set[-17])
# plot the prediction distribution

predictions_LR <- data.frame(y = test_set$y, pred = NA)
predictions_LR$pred <- pred_lm
plot_pred_type_distribution(predictions_LR,0.30)
# choose the best threshold as 0.30

test.eval.LR = binclass_eval(test_set[, 17], pred_lm > 0.30)
# Making the Confusion Matrix

test.eval.LR$cm
## Predicted
## Actual 0 1
## 0 11373 593
## 1 709 858
# calculate accuracy, precision etc.
acc_LR=test.eval.LR$accuracy
prc_LR=test.eval.LR$precision
recall_LR=test.eval.LR$recall
fscore_LR=test.eval.LR$fscore
# print evaluation
cat("Accuracy: ", acc_LR,
"\nPrecision: ", prc_LR,
"\nRecall: ", recall_LR,
"\nFScore: ", fscore_LR)
## Accuracy: 0.9037907
## Precision: 0.5913163
## Recall: 0.5475431
## FScore: 0.5685885
# calculate ROC
rocr.pred.lr = prediction(predictions = pred_lm, labels = test_set$y)
rocr.perf.lr = performance(rocr.pred.lr, measure = "tpr", x.measure =
"fpr")
rocr.auc.lr = as.numeric(performance(rocr.pred.lr, "auc")@y.values)
# print ROC AUC

rocr.auc.lr
## [1] 0.9078594
# plot ROC curve
plot(rocr.perf.lr,
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('Logistic Regression - auc : ', round(rocr.auc.lr, 5)))
Compare the models

The two prediction models are compared to choose the best and optimal one. As per the
below comparision, Logistic regression provides more accuracy and AUC.
# compare Accuracy and ROC

compare <- data.frame(Method = c('Decision Tree', 'Logistic Regression'),
Accuracy = NA, Precision = NA, Recall = NA, FScore = NA, 'ROC AUC' = NA)
compare$Accuracy <- c(acc_DT,acc_LR)
compare$ROC.AUC <- c(rocr.auc,rocr.auc.lr)
compare$Precision <- c(prc_DT, prc_LR)
compare$Recall <- c(recall_DT, recall_LR)
compare$FScore <- c(fscore_DT, fscore_LR)
kable_styling(kable(compare),c("striped","bordered"), full_width = F)
Method Accuracy Precision Recall FScore ROC.AUC
Decision Tree 0.9027562 0.6314136 0.3848117 0.4781919 0.7590981
Logistic 0.9037907 0.5913163 0.5475431 0.5685885 0.9078594

Regression
# plot both ROC curves side by side for an easy comparison

par(mfrow=c(1,2))
# plot ROC curve for Decision Tree

plot(rocr.perf,
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('Decision Tree - auc : ', round(rocr.auc, 5)))
# plot ROC curve for Logistic Regression

plot(rocr.perf.lr,
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('Logistic Regression - auc : ', round(rocr.auc.lr, 5)))

Bank Rpubs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bank Rpubs

Uploaded by

Copyright:

Available Formats

Portuguese Bank Marketing

library(knitr) # web widget

NA can be observed in all 17 attributes.

Check for Missing Data

How Many Rows Are Completely Missing Values In All Columns

Missing Value By Variable

sapply(banks, function(x) sum(is.na(x)))

Save Deduplicated Rows

Number of Rows After Dedup

Impute Missing Values

Missing Numeric Value Treatment

banks.clean$age[is.na(banks.clean$age)] = mean(banks$age, na.rm=T)

Replace with Mode - Below variables distribution is highly skewed at at specific value,

Missing Categorical Treatment

Replace with Special Category

banks.clean$job = fct_explicit_na(banks.clean$job, "missing")

Final Removal of Duplicated Rows

banks.clean = banks.clean %>% distinct

Number of Rows After Dedup

Number of rows had been reduced after deduplication

Number of Deduplicated Rows

There is no more duplicated rows

Verify Missing Value

sapply(banks.clean, function(x) sum(is.na(x)))

New ‘missing’ levels has been introduced

How Many Rows Had Missing Data Before Cleaning

Display Those Rows That Had Missing Data Before

banks.clean %>% filter(missing==T) %>% head %>% kable

NA has been removed. Imputation done by replacing with ‘missing’ for categorical values.

Save file to banks.csv

write.csv(banks.clean, file = "./data/banks.csv")

Save cleaned dataset. Rename dataframe as banks

About The Cleaned Dataset

Recoding ‘yes’ to binary

Check for ‘yes’ attribute to be changed to ‘y’.

age job marital education default balance housing

2 44 technician single secondary no 29 yes

3 33 entrepreneur married secondary no 2 yes

4 47 blue-collar married unknown no 1506 yes

5 33 unknown single unknown no 1 no

6 35 management married tertiary no 231 yes

Exploratory Data Analysis

grid.arrange(p1, p2, ncol = 2)

p3 <- ggplot(banks, aes(x=age, fill=marital)) +

mu <- banks %>% group_by(y) %>% summarize(grp.mean=mean(age))

ggplot (banks, aes(x=age)) +

mu2 <- banks %>% group_by(y) %>% summarize(grp.mean2=mean(balance))

ggplot (banks, aes(x=balance)) +

ggplot(data = banks.clean, aes(x=education, fill=y)) +

Subscription based on Number of Contact during Campaign

ggplot(data=banks.clean, aes(x=campaign, fill=y))+

p6 <- ggplot(banks, aes(x=duration, fill = y)) +

p6 + geom_vline(data = mu2, aes(xintercept = grp2.mean), color = "red",

Scatterplot of Duration by Age

Scatterplot of Duration by Campaign

banks %>% filter(campaign < 63) %>%

banks_select1 <- banks %>% select(duration, month, day, balance)

banks_select2 <- banks %>% select(balance, housing, loan, campaign)

Split the Training / Testing data and Scale

# split into training and testing

Custom Function For Binary Class Performance Evaluation

binclass_eval = function (actual, predict) {

Create a function for plotting distribution