You are on page 1of 24

Portuguese Bank Marketing

Data
WQD7004/RRookie/Yong Keh Soon-WQD180065, Vikas Mann-WQD180051, L-ven
Lew Teck Wei-WQD180056, Lim Shien Long-WQD180027
14 December 2018

 Basic R Setup
o R Libraries Used
o Import Dataset
 Understanding Data
o Overview of Data Attributes
o Data Validation
 Check for Duplicate Rows
 Check for Missing Data
o Data Cleaning
 Remove Rows with Columns that has Missing Value
 Save Deduplicated Rows
 Impute Missing Values
 Final Removal of Duplicated Rows
o Verify Cleanliness
 Verify Deduplication
 Verify Missing Value
 About The Cleaned Dataset
o Data Structure of Cleaned Dataset
o Recoding ‘yes’ to binary
o Sample Rows
 Size of Dataset
 Sample Observations
 Data Summary
o Outcome Imbalance
 Exploratory Data Analysis
o Univariate Analysis
 Age Distribution
 Age Distribution vs Marital Status That Subscribes Term Deposit
 Age vs Subscription
 Balance vs Subscription
 Education vs Subscription
 Subscription based on Number of Contact during Campaign
 Duration
 Scatterplot of Duration by Age
 Scatterplot of Duration by Campaign
 Scatterplot Matrix
o Split the Training / Testing data and Scale
 Custom Function For Binary Class Performance Evaluation
 Create a function for plotting distribution
o Model 1 - Build Model by fitting Decision Tree Classification
o Evaluate the Decision Tree Prediction model
o Model 2 - Build Model by fitting Logistic Regression
o Evaluate the Logistic Regression Prediction model
o Compare the models
Basic R Setup
R Libraries Used
Here are the R libraries used in this analysis.

library(knitr) # web widget


library(tidyverse) # data manipulation
library(data.table) # fast file reading
library(caret) # rocr analysis
library(ROCR) # rocr analysis
library(kableExtra) # nice table html formating
library(gridExtra) # arranging ggplot in grid
library(rpart) # decision tree
library(rpart.plot) # decision tree plotting
library(caTools) # split

Import Dataset
Import CSV into dataframe.

banks = read.table('data/bank-full.csv',sep=',',header = T)

Understanding Data
Overview of Data Attributes
Attribute Information:

1. Age (numeric)
2. Type of job
3. Marital status
4. Education
5. Default: has defaulted on credit?
6. Balance: has balance loan?
7. Housing: has housing loan?
8. Loan: has personal loan?
9. Contact: Cellular, phone?
10. Month: Last contact month
11. Day: Last contact day
12. Duration: Last contact duration
13. campaign: number of contacts performed during this campaign and for this client
(numeric, includes last contact)
14. pdays: number of days that passed by after the client was last contacted from a
previous campaign (numeric: 999 means client was not previously contacted)
15. previous: number of contacts performed before this campaign and for this client
(numeric)
16. poutcome: outcome of the previous marketing campaign (categorical:
‘failure’,‘nonexistent’,‘success’)
17. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Statistical Summary

summary(banks)
## age job marital education
## Min. :18.00 blue-collar:9844 divorced: 5315 primary : 6966
## 1st Qu.:33.00 management :9632 married :27579 secondary:23539
## Median :39.00 technician :7720 single :13027 tertiary :13525
## Mean :40.95 admin. :5252 NA's : 646 unknown : 1879
## 3rd Qu.:48.00 services :4210 NA's : 658
## Max. :95.00 (Other) :9260
## NA's :641 NA's : 649
## default balance housing loan contact
## no :45125 Min. : -8019 no :20400 no :38591 cellular :29749
## yes : 825 1st Qu.: 73 yes :25489 yes : 7324 telephone: 2953
## NA's: 617 Median : 448 NA's: 678 NA's: 652 unknown :13215
## Mean : 1363 NA's : 650
## 3rd Qu.: 1426
## Max. :102127
## NA's :658
## day month duration campaign
## Min. : 1.0 may :14017 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.0 jul : 6989 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.0 aug : 6359 Median : 180.0 Median : 2.000
## Mean :15.8 jun : 5426 Mean : 258.1 Mean : 2.767
## 3rd Qu.:21.0 nov : 4038 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.0 (Other): 9086 Max. :4918.0 Max. :63.000
## NA's :632 NA's : 652 NA's :648 NA's :657
## pdays previous poutcome y
## Min. : -1.00 Min. : 0.0000 failure: 4994 no :40547
## 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1871 yes : 5385
## Median : -1.00 Median : 0.0000 success: 1538 NA's: 635
## Mean : 40.31 Mean : 0.5795 unknown:37551
## 3rd Qu.: -1.00 3rd Qu.: 0.0000 NA's : 613
## Max. :871.00 Max. :275.0000
## NA's :654 NA's :649

NA can be observed in all 17 attributes.

Data Validation
Check for Duplicate Rows

sum(duplicated(banks))
## [1] 1450

Check for Missing Data


How Many Rows Contain Missing Data

sum(!complete.cases(banks))
## [1] 1259

How Many Rows Are Completely Missing Values In All Columns

all.empty = rowSums(is.na(banks))==ncol(banks)
sum(all.empty)
## [1] 61

Missing Value By Variable

sapply(banks, function(x) sum(is.na(x)))


## age job marital education default balance housing
## 641 649 646 658 617 658 678
## loan contact day month duration campaign pdays
## 652 650 632 652 648 657 654
## previous poutcome y
## 649 613 635

Data Cleaning
Remove Rows with Columns that has Missing Value

banks.clean = banks[!all.empty,]

Save Deduplicated Rows


Saved deduplicated data to new variable ‘banks.clean’
banks.clean = banks.clean %>% distinct

Number of Rows After Dedup

nrow(banks.clean)
## [1] 45116

Impute Missing Values


Create New Column To Indicate Missing Detection

banks.clean$missing = !complete.cases(banks.clean)

Missing Numeric Value Treatment

Replace with Average

banks.clean$age[is.na(banks.clean$age)] = mean(banks$age, na.rm=T)


banks.clean$day[is.na(banks.clean$day)] = mean(banks$day, na.rm=T)
banks.clean$duration[is.na(banks.clean$duration)] = mean(banks$duration,
na.rm=T)
banks.clean$previous[is.na(banks.clean$previous)] = mean(banks$previous,
na.rm=T)
banks.clean$campaign[is.na(banks.clean$campaign)] = mean(banks$campaign,
na.rm=T)

Replace with Mode - Below variables distribution is highly skewed at at specific value,


hence we are going to impute missing value with the mode

hist(banks.clean$balance)

hist(banks.clean$pdays)

banks.clean$pdays[is.na(banks.clean$pdays)] = as.numeric(names(sort(-
table(banks$pdays)))[1])
banks.clean$balance[is.na(banks.clean$balance)] = as.numeric(names(sort(-
table(banks$balance)))[1])

Missing Categorical Treatment

Replace with Special Category

banks.clean$job = fct_explicit_na(banks.clean$job, "missing")


banks.clean$marital = fct_explicit_na(banks.clean$marital, "missing")
banks.clean$education = fct_explicit_na(banks.clean$education, "missing")
banks.clean$default = fct_explicit_na(banks.clean$default, "missing")
banks.clean$loan = fct_explicit_na(banks.clean$loan, "missing")
banks.clean$contact = fct_explicit_na(banks.clean$contact, "missing")
banks.clean$poutcome = fct_explicit_na(banks.clean$poutcome, "missing")
banks.clean$y = fct_explicit_na(banks.clean$y, "missing")
banks.clean$housing = fct_explicit_na(banks.clean$housing, "missing")
banks.clean$month = fct_explicit_na(banks.clean$month, "missing")

Final Removal of Duplicated Rows


After imputation, certain rows became identical hence need to be deduplicated.

banks.clean = banks.clean %>% distinct

Verify Cleanliness
Verify Deduplication
Number of Rows Before Dedup

nrow(banks)
## [1] 46567

Number of Rows After Dedup

Number of rows had been reduced after deduplication

nrow(banks.clean)
## [1] 45112

Number of Deduplicated Rows

There is no more duplicated rows

sum(duplicated(banks.clean))
## [1] 0

Verify Missing Value


All missing value has been treated

sapply(banks.clean, function(x) sum(is.na(x)))


## age job marital education default balance housing
## 0 0 0 0 0 0 0
## loan contact day month duration campaign pdays
## 0 0 0 0 0 0 0
## previous poutcome y missing
## 0 0 0 0

New ‘missing’ levels has been introduced

levels(banks.clean$job)
## [1] "admin." "blue-collar" "entrepreneur" "housemaid"
## [5] "management" "retired" "self-employed" "services"
## [9] "student" "technician" "unemployed" "unknown"
## [13] "missing"
levels(banks.clean$marital)
## [1] "divorced" "married" "single" "missing"
levels(banks.clean$education)
## [1] "primary" "secondary" "tertiary" "unknown" "missing"
levels(banks.clean$default)
## [1] "no" "yes" "missing"
levels(banks.clean$loan)
## [1] "no" "yes" "missing"
levels(banks.clean$contact)
## [1] "cellular" "telephone" "unknown" "missing"
levels(banks.clean$poutcome)
## [1] "failure" "other" "success" "unknown" "missing"
levels(banks.clean$y)
## [1] "no" "yes" "missing"
levels(banks.clean$housing)
## [1] "no" "yes" "missing"
levels(banks.clean$month)
## [1] "apr" "aug" "dec" "feb" "jan" "jul" "jun"
## [8] "mar" "may" "nov" "oct" "sep" "missing"

How Many Rows Had Missing Data Before Cleaning

sum(banks.clean$missing)
## [1] 1121

Display Those Rows That Had Missing Data Before

banks.clean %>% filter(missing==T) %>% head %>% kable


mar educa defa balahous loa cont mo durat camp pd previpoutc miss
agejob day y
ital tion ult nceing n act nth ion aign ays ousome ing
28.00blue- marrie missi miss missin 15.80 258.09 0.5794
missing 0yes may 1.00000 -1 missing no TRUE
000collar d ng ing g 327 54 895
46.00 missin miss missin 15.80 258.09 0.5794unknow
missing missing no 0yes may 2.76663 -1 no TRUE
000 g ing g 327 54 895n
32.00manage missin missin missin 5.000 179.00 0.0000
tertiary no 0 no may 1.00000 -1 missing no TRUE
000ment g g g 00 00 000
40.94 marrie unkno 5.000missi 849.00 0.0000
missing tertiary no 523yes no 2.00000 -1 missing no TRUE
495 d wn 00ng 00 000
40.94 marrie missin 15.80 252.00 0.5794unknow miss
missing missing no 19yes no may 1.00000 -1 TRUE
495 d g 327 00 895n ing
35.00 missin secondar unkno 5.000 1077.0 0.5794
services no 59yes no may 2.76663 -1 missing no TRUE
000 g y wn 00 000 895
Summary Statistic After Cleaning

summary(banks.clean)
## age job marital education
## Min. :18.00 blue-collar:9595 divorced: 5139 primary : 6758
## 1st Qu.:33.00 management :9322 married :26815 secondary:22855
## Median :39.00 technician :7485 single :12629 tertiary :13123
## Mean :40.94 admin. :5094 missing : 529 unknown : 1829
## 3rd Qu.:48.00 services :4096 missing : 547
## Max. :95.00 retired :2240
## (Other) :7280
## default balance housing loan
## no :43797 Min. : -8019 no :19803 no :37439
## yes : 809 1st Qu.: 62 yes :24746 yes : 7133
## missing: 506 Median : 436 missing: 563 missing: 540
## Mean : 1347
## 3rd Qu.: 1407
## Max. :102127
##
## contact day month duration
## cellular :28876 Min. : 1.0 may :13585 Min. : 0.0
## telephone: 2866 1st Qu.: 8.0 jul : 6785 1st Qu.: 104.0
## unknown :12836 Median :16.0 aug : 6161 Median : 182.0
## missing : 534 Mean :15.8 jun : 5268 Mean : 258.2
## 3rd Qu.:21.0 nov : 3920 3rd Qu.: 317.0
## Max. :31.0 apr : 2884 Max. :4918.0
## (Other): 6509
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 failure: 4847
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1813
## Median : 2.000 Median : -1.00 Median : 0.0000 success: 1486
## Mean : 2.766 Mean : 39.69 Mean : 0.5794 unknown:36461
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000 missing: 505
## Max. :63.000 Max. :871.00 Max. :275.0000
##
## y missing
## no :39367 Mode :logical
## yes : 5224 FALSE:43991
## missing: 521 TRUE :1121
##
##
##
##

NA has been removed. Imputation done by replacing with ‘missing’ for categorical values.


Numerical attributes NAwas replaced by mean() of the attribute.

Save file to banks.csv

write.csv(banks.clean, file = "./data/banks.csv")

banks = banks.clean

Save cleaned dataset. Rename dataframe as banks

About The Cleaned Dataset


Data Structure of Cleaned Dataset
str(banks)
## 'data.frame': 45112 obs. of 18 variables:
## $ age : num 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 13 levels "admin.","blue-collar",..: 5 10 3 2 12
5 5 3 6 10 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 3 2 3
1 2 3 ...
## $ education: Factor w/ 5 levels "primary","secondary",..: 3 2 2 4 4 3 3
3 1 2 ...
## $ default : Factor w/ 3 levels "no","yes","missing": 1 1 1 1 1 1 1 2 1
1 ...
## $ balance : num 2143 29 2 1506 1 ...
## $ housing : Factor w/ 3 levels "no","yes","missing": 2 2 2 2 1 2 2 2 2
2 ...
## $ loan : Factor w/ 3 levels "no","yes","missing": 1 1 2 1 1 1 2 1 1
1 ...
## $ contact : Factor w/ 4 levels "cellular","telephone",..: 3 3 3 3 3 3
3 3 3 3 ...
## $ day : num 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 13 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9
9 9 ...
## $ duration : num 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 5 levels "failure","other",..: 4 4 4 4 4 4 4 4 4
4 ...
## $ y : Factor w/ 3 levels "no","yes","missing": 1 1 1 1 1 1 1 1 1
1 ...
## $ missing : logi FALSE FALSE FALSE FALSE FALSE FALSE ...

Recoding ‘yes’ to binary


banks$y = ifelse(banks$y=='yes',1,0)
str(banks)
## 'data.frame': 45112 obs. of 18 variables:
## $ age : num 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 13 levels "admin.","blue-collar",..: 5 10 3 2 12
5 5 3 6 10 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 3 2 3
1 2 3 ...
## $ education: Factor w/ 5 levels "primary","secondary",..: 3 2 2 4 4 3 3
3 1 2 ...
## $ default : Factor w/ 3 levels "no","yes","missing": 1 1 1 1 1 1 1 2 1
1 ...
## $ balance : num 2143 29 2 1506 1 ...
## $ housing : Factor w/ 3 levels "no","yes","missing": 2 2 2 2 1 2 2 2 2
2 ...
## $ loan : Factor w/ 3 levels "no","yes","missing": 1 1 2 1 1 1 2 1 1
1 ...
## $ contact : Factor w/ 4 levels "cellular","telephone",..: 3 3 3 3 3 3
3 3 3 3 ...
## $ day : num 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 13 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9
9 9 ...
## $ duration : num 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 5 levels "failure","other",..: 4 4 4 4 4 4 4 4 4
4 ...
## $ y : num 0 0 0 0 0 0 0 0 0 0 ...
## $ missing : logi FALSE FALSE FALSE FALSE FALSE FALSE ...

Check for ‘yes’ attribute to be changed to ‘y’.

Sample Rows
Size of Dataset

nrow(banks)
## [1] 45112
ncol(banks)
## [1] 18

Sample Observations

head(banks)

  age job marital education default balance housing


  <dbl> <fctr> <fctr> <fctr> <fctr> <dbl> <fctr>
1 58 management married tertiary no 2143 yes

2 44 technician single secondary no 29 yes

3 33 entrepreneur married secondary no 2 yes

4 47 blue-collar married unknown no 1506 yes

5 33 unknown single unknown no 1 no

6 35 management married tertiary no 231 yes


6 rows | 1-10 of 19 columns

Data Summary

summary(banks)
## age job marital education
## Min. :18.00 blue-collar:9595 divorced: 5139 primary : 6758
## 1st Qu.:33.00 management :9322 married :26815 secondary:22855
## Median :39.00 technician :7485 single :12629 tertiary :13123
## Mean :40.94 admin. :5094 missing : 529 unknown : 1829
## 3rd Qu.:48.00 services :4096 missing : 547
## Max. :95.00 retired :2240
## (Other) :7280
## default balance housing loan
## no :43797 Min. : -8019 no :19803 no :37439
## yes : 809 1st Qu.: 62 yes :24746 yes : 7133
## missing: 506 Median : 436 missing: 563 missing: 540
## Mean : 1347
## 3rd Qu.: 1407
## Max. :102127
##
## contact day month duration
## cellular :28876 Min. : 1.0 may :13585 Min. : 0.0
## telephone: 2866 1st Qu.: 8.0 jul : 6785 1st Qu.: 104.0
## unknown :12836 Median :16.0 aug : 6161 Median : 182.0
## missing : 534 Mean :15.8 jun : 5268 Mean : 258.2
## 3rd Qu.:21.0 nov : 3920 3rd Qu.: 317.0
## Max. :31.0 apr : 2884 Max. :4918.0
## (Other): 6509
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 failure: 4847
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1813
## Median : 2.000 Median : -1.00 Median : 0.0000 success: 1486
## Mean : 2.766 Mean : 39.69 Mean : 0.5794 unknown:36461
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000 missing: 505
## Max. :63.000 Max. :871.00 Max. :275.0000
##
## y missing
## Min. :0.0000 Mode :logical
## 1st Qu.:0.0000 FALSE:43991
## Median :0.0000 TRUE :1121
## Mean :0.1158
## 3rd Qu.:0.0000
## Max. :1.0000
##

Outcome Imbalance
Observe that the dataset predicted outcome (y) is skewed towards ‘no’ with over 88%.

prop.table(table(banks$y))
##
## 0 1
## 0.8841993 0.1158007

Exploratory Data Analysis


Univariate Analysis
Age Distribution
The bulk of clients are between the ages of 33 (1st Quartile) and 48 (3rd Quartile) with
mean lying on 41 visualized on the histogram with red vertical line.

Boxplot of age describes essentially the same statistics but we can see outliers above the
age of 65.

summary(banks$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 39.00 40.94 48.00 95.00
gg = ggplot (banks)
p1 = gg + geom_histogram(aes(x=age),color="black", fill="white", binwidth =
5) +
ggtitle('Age Distribution (red mean line)') +
ylab('Count') +
xlab('Age') +
geom_vline(aes(xintercept = mean(age), color = "red")) +
scale_x_continuous(breaks = seq(0,100,5)) +
theme(legend.position = "none")

p2 = gg + geom_boxplot(aes(x='', y=age)) +
ggtitle('Age Boxplot') +
ylab('Age')

grid.arrange(p1, p2, ncol = 2)


Age Distribution vs Marital Status That Subscribes Term Deposit
The bulk of clients are married or divorced.  Sharp drop of clients above age 60 with
marital status ‘divorced’ and ‘married’. *Single clients drop in numbers above age 40.

p3 <- ggplot(banks, aes(x=age, fill=marital)) +


geom_histogram(binwidth = 2, alpha=0.7) +
facet_grid(cols = vars(y)) +
expand_limits(x=c(0,100)) +
scale_x_continuous(breaks = seq(0,100,10)) +
ggtitle("Age Distribution by Marital Status")

p3

Age vs Subscription
Most clients that subscribe are between age 25 to 45. Mean age for all clients is above 40
years of age.

mu <- banks %>% group_by(y) %>% summarize(grp.mean=mean(age))

ggplot (banks, aes(x=age)) +


geom_histogram(color = "blue", fill = "blue", binwidth = 5) +
facet_grid(cols=vars(y)) +
ggtitle('Age Distribution by Subscription') + ylab('Count') + xlab('Age')
+
scale_x_continuous(breaks = seq(0,100,5)) +
geom_vline(data=mu, aes(xintercept=grp.mean), color="red",
linetype="dashed")

Balance vs Subscription
Clients that subscribe to term deposits have lower loan balances.

mu2 <- banks %>% group_by(y) %>% summarize(grp.mean2=mean(balance))

ggplot (banks, aes(x=balance)) +


geom_histogram(color = "blue", fill = "blue") +
facet_grid(cols=vars(y)) +
ggtitle('Balance Histogram') + ylab('Count') + xlab('Balance') +
geom_vline(data=mu2, aes(xintercept=grp.mean2), color="red",
linetype="dashed")

Education vs Subscription
Having higher education is seen to contribute to higher subscription of term deposit. Most
clients who subscribe are from ‘secondary’ and ‘tertiary’ education levels. Tertiary
educated clients have higher rate of subscription (15%) from total clients called.

ggplot(data = banks.clean, aes(x=education, fill=y)) +


geom_bar() +
ggtitle("Term Deposit Subscription based on Education Level") +
xlab(" Education Level") +
guides(fill=guide_legend(title="Subscription of Term Deposit"))

banks.clean %>%
group_by(education) %>%
summarize(pct.yes = mean(y=="yes")*100) %>%
arrange(desc(pct.yes))

education
<fctr>
tertiary

unknown

secondary

primary

missing
5 rows

Subscription based on Number of Contact during Campaign


It can be observed from barchart that there will be no subscription beyond 7 contact during
the campaign. Future campaign could improve resource utilization by setting limits to
contacts during a campaign. Future campaigns can focus on first 3 contacts as it will have
higher subscription rate.

ggplot(data=banks.clean, aes(x=campaign, fill=y))+


geom_histogram()+
ggtitle("Subscription based on Number of Contact during the Campaign")+
xlab("Number of Contact during the Campaign")+
xlim(c(min=1,max=30)) +
guides(fill=guide_legend(title="Subscription of Term Deposit"))

banks.clean %>%
group_by(campaign) %>%
summarize(contact.cnt = n(), pct.con.yes = mean(y=="yes")*100) %>%
arrange(desc(contact.cnt)) %>%
head()

campaign contact.cnt
<dbl> <int>
1 17302

2 12325

3 5439

4 3472

5 1742

6 1274
6 rows

Duration

range(banks$duration)
## [1] 0 4918
summary(banks$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 104.0 182.0 258.2 317.0 4918.0
banks %>% select(duration) %>% arrange(desc(duration)) %>% head

 
 
1

6
6 rows
mu2 <- banks %>% group_by(y) %>% summarize(grp2.mean=mean(duration))

p6 <- ggplot(banks, aes(x=duration, fill = y)) +


geom_histogram(binwidth = 2) +
facet_grid(cols = vars(y)) +
coord_cartesian(xlim = c(0,5000), ylim = c(0,400))

p6 + geom_vline(data = mu2, aes(xintercept = grp2.mean), color = "red",


linetype = "dashed")

Scatterplot of Duration by Age


Less clients after age of 60. Duration during call looks similar.

banks %>%
ggplot(aes(age, duration)) +
geom_point() +
facet_grid(cols = vars(y)) +
scale_x_continuous(breaks = seq(0,100,10)) +
ggtitle("Scatterplot of Duration vs Age for Subscription of Term
Deposit")

Scatterplot of Duration by Campaign


Duration on call similar for first 10 contacts during campaign. Successful subscription
(y=1) occur within first 10 contacts. Much less after that.

banks %>% filter(campaign < 63) %>%


ggplot(aes(campaign, duration)) +
geom_point() +
facet_grid(cols = vars(y)) +
ggtitle("Scatterplot of Duration vs Campaign for Subscription of Term
Deposit")

Scatterplot Matrix
Due to large number of attributes (17 total), 8 was chosen for correlation. No clear
correlation pattern can be observed as most attributes are categorical.

banks_select1 <- banks %>% select(duration, month, day, balance)


pairs(banks_select1)

banks_select2 <- banks %>% select(balance, housing, loan, campaign)


pairs(banks_select2)

Split the Training / Testing data and Scale


Split the dataset into traning and testing dataset and scale the numerical variables

# split into training and testing


set.seed(123)
split = sample.split(banks$y,SplitRatio = 0.70)
training_set = subset(banks, split == TRUE)
test_set = subset(banks, split == FALSE)

# scale
training_set[c(1,6,10,12,13)] = scale(training_set[c(1,6,10,12,13)])
test_set[c(1,6,10,12,13)] = scale(test_set[c(1,6,10,12,13)])

Custom Function For Binary Class Performance Evaluation

binclass_eval = function (actual, predict) {


cm = table(as.integer(actual), as.integer(predict),
dnn=c('Actual','Predicted'))
ac = (cm['1','1']+cm['0','0'])/(cm['0','1'] + cm['1','0'] + cm['1','1'] +
cm['0','0'])
pr = cm['1','1']/(cm['0','1'] + cm['1','1'])
rc = cm['1','1']/(cm['1','0'] + cm['1','1'])
fs = 2* pr*rc/(pr+rc)
list(cm=cm, recall=rc, precision=pr, fscore=fs, accuracy=ac)
}

Create a function for plotting distribution


Function to be used later for plotting the prediction distribution

plot_pred_type_distribution <- function(df, threshold) {


v <- rep(NA, nrow(df))
v <- ifelse(df$pred >= threshold & df$y == 1, "TP", v)
v <- ifelse(df$pred >= threshold & df$y == 0, "FP", v)
v <- ifelse(df$pred < threshold & df$y == 1, "FN", v)
v <- ifelse(df$pred < threshold & df$y == 0, "TN", v)

df$pred_type <- v

ggplot(data=df, aes(x=y, y=pred)) +


geom_violin(fill='black', color=NA) +
geom_jitter(aes(color=pred_type), alpha=0.6) +
geom_hline(yintercept=threshold, color="red", alpha=0.6) +
scale_color_discrete(name = "type") +
labs(title=sprintf("Threshold at %.2f", threshold))
}

Model 1 - Build Model by fitting Decision Tree Classification


Build the Decision Tree model and plot

# fit the decision tree classification


classifier = rpart(formula = y ~ .,
data = training_set, method = "class")

# plot
prp(classifier, type = 2, extra = 104, fallen.leaves = TRUE, main="Decision
Tree")

Evaluate the Decision Tree Prediction model


Predict the Decision Tree model on testing data and evaluate by finding the accuracy and
calculating/plotting ROC curve.

# predict test data by probability


pred.DT = predict(classifier, newdata = test_set[-17], type = 'prob')

# find the threshold for prediction optimization


predictions_DT <- data.frame(y = test_set$y, pred = NA)
predictions_DT$pred <- pred.DT[,2]
plot_pred_type_distribution(predictions_DT,0.36)
# choose the best threshold as 0.36
test.eval.DT = binclass_eval(test_set[, 17], pred.DT[,2] > 0.36)

# Making the Confusion Matrix


test.eval.DT$cm
## Predicted
## Actual 0 1
## 0 11614 352
## 1 964 603
# calculate accuracy, precision etc.
acc_DT=test.eval.DT$accuracy
prc_DT=test.eval.DT$precision
recall_DT=test.eval.DT$recall
fscore_DT=test.eval.DT$fscore

# print evaluation
cat("Accuracy: ", acc_DT,
"\nPrecision: ", prc_DT,
"\nRecall: ", recall_DT,
"\nFScore: ", fscore_DT)
## Accuracy: 0.9027562
## Precision: 0.6314136
## Recall: 0.3848117
## FScore: 0.4781919
# calculate ROC curve
rocr.pred = prediction(predictions = pred.DT[,2], labels = test_set$y)
rocr.perf = performance(rocr.pred, measure = "tpr", x.measure = "fpr")
rocr.auc = as.numeric(performance(rocr.pred, "auc")@y.values)

# print ROC AUC


rocr.auc
## [1] 0.7590981
# plot ROC curve
plot(rocr.perf,
lwd = 3, colorize = TRUE,
print.cutoffs.at = seq(0, 1, by = 0.1),
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('Decision Tree - auc : ', round(rocr.auc, 5)))
abline(0, 1, col = "red", lty = 2)

Model 2 - Build Model by fitting Logistic Regression


Alternatively build another model by fitting Logistic Regression algorithm in Tranining
dataset

# creating the classifier


classifier.lm = glm(formula = y ~ .,
family = binomial,
data = training_set)

Evaluate the Logistic Regression Prediction model


Predict the Logistic Regression model on testing data and evaluate by finding the accuracy
and calculating/plotting ROC curve.

pred_lm = predict(classifier.lm, type='response', newdata=test_set[-17])

# plot the prediction distribution


predictions_LR <- data.frame(y = test_set$y, pred = NA)
predictions_LR$pred <- pred_lm
plot_pred_type_distribution(predictions_LR,0.30)

# choose the best threshold as 0.30


test.eval.LR = binclass_eval(test_set[, 17], pred_lm > 0.30)

# Making the Confusion Matrix


test.eval.LR$cm
## Predicted
## Actual 0 1
## 0 11373 593
## 1 709 858
# calculate accuracy, precision etc.
acc_LR=test.eval.LR$accuracy
prc_LR=test.eval.LR$precision
recall_LR=test.eval.LR$recall
fscore_LR=test.eval.LR$fscore

# print evaluation
cat("Accuracy: ", acc_LR,
"\nPrecision: ", prc_LR,
"\nRecall: ", recall_LR,
"\nFScore: ", fscore_LR)
## Accuracy: 0.9037907
## Precision: 0.5913163
## Recall: 0.5475431
## FScore: 0.5685885
# calculate ROC
rocr.pred.lr = prediction(predictions = pred_lm, labels = test_set$y)
rocr.perf.lr = performance(rocr.pred.lr, measure = "tpr", x.measure =
"fpr")
rocr.auc.lr = as.numeric(performance(rocr.pred.lr, "auc")@y.values)

# print ROC AUC


rocr.auc.lr
## [1] 0.9078594
# plot ROC curve
plot(rocr.perf.lr,
lwd = 3, colorize = TRUE,
print.cutoffs.at = seq(0, 1, by = 0.1),
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('Logistic Regression - auc : ', round(rocr.auc.lr, 5)))
abline(0, 1, col = "red", lty = 2)

Compare the models


The two prediction models are compared to choose the best and optimal one. As per the
below comparision, Logistic regression provides more accuracy and AUC.

# compare Accuracy and ROC


compare <- data.frame(Method = c('Decision Tree', 'Logistic Regression'),
Accuracy = NA, Precision = NA, Recall = NA, FScore = NA, 'ROC AUC' = NA)
compare$Accuracy <- c(acc_DT,acc_LR)
compare$ROC.AUC <- c(rocr.auc,rocr.auc.lr)
compare$Precision <- c(prc_DT, prc_LR)
compare$Recall <- c(recall_DT, recall_LR)
compare$FScore <- c(fscore_DT, fscore_LR)
kable_styling(kable(compare),c("striped","bordered"), full_width = F)

Method Accuracy Precision Recall FScore ROC.AUC

Decision Tree 0.9027562 0.6314136 0.3848117 0.4781919 0.7590981

Logistic 0.9037907 0.5913163 0.5475431 0.5685885 0.9078594


Regression

# plot both ROC curves side by side for an easy comparison


par(mfrow=c(1,2))

# plot ROC curve for Decision Tree


plot(rocr.perf,
lwd = 3, colorize = TRUE,
print.cutoffs.at = seq(0, 1, by = 0.1),
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('Decision Tree - auc : ', round(rocr.auc, 5)))
abline(0, 1, col = "red", lty = 2)

# plot ROC curve for Logistic Regression


plot(rocr.perf.lr,
lwd = 3, colorize = TRUE,
print.cutoffs.at = seq(0, 1, by = 0.1),
text.adj = c(-0.2, 1.7),
main = 'ROC Curve')
mtext(paste('Logistic Regression - auc : ', round(rocr.auc.lr, 5)))
abline(0, 1, col = "red", lty = 2)

You might also like