Professional Documents
Culture Documents
Bank Rpubs
Bank Rpubs
Data
WQD7004/RRookie/Yong Keh Soon-WQD180065, Vikas Mann-WQD180051, L-ven
Lew Teck Wei-WQD180056, Lim Shien Long-WQD180027
14 December 2018
Basic R Setup
o R Libraries Used
o Import Dataset
Understanding Data
o Overview of Data Attributes
o Data Validation
Check for Duplicate Rows
Check for Missing Data
o Data Cleaning
Remove Rows with Columns that has Missing Value
Save Deduplicated Rows
Impute Missing Values
Final Removal of Duplicated Rows
o Verify Cleanliness
Verify Deduplication
Verify Missing Value
About The Cleaned Dataset
o Data Structure of Cleaned Dataset
o Recoding ‘yes’ to binary
o Sample Rows
Size of Dataset
Sample Observations
Data Summary
o Outcome Imbalance
Exploratory Data Analysis
o Univariate Analysis
Age Distribution
Age Distribution vs Marital Status That Subscribes Term Deposit
Age vs Subscription
Balance vs Subscription
Education vs Subscription
Subscription based on Number of Contact during Campaign
Duration
Scatterplot of Duration by Age
Scatterplot of Duration by Campaign
Scatterplot Matrix
o Split the Training / Testing data and Scale
Custom Function For Binary Class Performance Evaluation
Create a function for plotting distribution
o Model 1 - Build Model by fitting Decision Tree Classification
o Evaluate the Decision Tree Prediction model
o Model 2 - Build Model by fitting Logistic Regression
o Evaluate the Logistic Regression Prediction model
o Compare the models
Basic R Setup
R Libraries Used
Here are the R libraries used in this analysis.
Import Dataset
Import CSV into dataframe.
banks = read.table('data/bank-full.csv',sep=',',header = T)
Understanding Data
Overview of Data Attributes
Attribute Information:
1. Age (numeric)
2. Type of job
3. Marital status
4. Education
5. Default: has defaulted on credit?
6. Balance: has balance loan?
7. Housing: has housing loan?
8. Loan: has personal loan?
9. Contact: Cellular, phone?
10. Month: Last contact month
11. Day: Last contact day
12. Duration: Last contact duration
13. campaign: number of contacts performed during this campaign and for this client
(numeric, includes last contact)
14. pdays: number of days that passed by after the client was last contacted from a
previous campaign (numeric: 999 means client was not previously contacted)
15. previous: number of contacts performed before this campaign and for this client
(numeric)
16. poutcome: outcome of the previous marketing campaign (categorical:
‘failure’,‘nonexistent’,‘success’)
17. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
Statistical Summary
summary(banks)
## age job marital education
## Min. :18.00 blue-collar:9844 divorced: 5315 primary : 6966
## 1st Qu.:33.00 management :9632 married :27579 secondary:23539
## Median :39.00 technician :7720 single :13027 tertiary :13525
## Mean :40.95 admin. :5252 NA's : 646 unknown : 1879
## 3rd Qu.:48.00 services :4210 NA's : 658
## Max. :95.00 (Other) :9260
## NA's :641 NA's : 649
## default balance housing loan contact
## no :45125 Min. : -8019 no :20400 no :38591 cellular :29749
## yes : 825 1st Qu.: 73 yes :25489 yes : 7324 telephone: 2953
## NA's: 617 Median : 448 NA's: 678 NA's: 652 unknown :13215
## Mean : 1363 NA's : 650
## 3rd Qu.: 1426
## Max. :102127
## NA's :658
## day month duration campaign
## Min. : 1.0 may :14017 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.0 jul : 6989 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.0 aug : 6359 Median : 180.0 Median : 2.000
## Mean :15.8 jun : 5426 Mean : 258.1 Mean : 2.767
## 3rd Qu.:21.0 nov : 4038 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.0 (Other): 9086 Max. :4918.0 Max. :63.000
## NA's :632 NA's : 652 NA's :648 NA's :657
## pdays previous poutcome y
## Min. : -1.00 Min. : 0.0000 failure: 4994 no :40547
## 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1871 yes : 5385
## Median : -1.00 Median : 0.0000 success: 1538 NA's: 635
## Mean : 40.31 Mean : 0.5795 unknown:37551
## 3rd Qu.: -1.00 3rd Qu.: 0.0000 NA's : 613
## Max. :871.00 Max. :275.0000
## NA's :654 NA's :649
Data Validation
Check for Duplicate Rows
sum(duplicated(banks))
## [1] 1450
sum(!complete.cases(banks))
## [1] 1259
all.empty = rowSums(is.na(banks))==ncol(banks)
sum(all.empty)
## [1] 61
Data Cleaning
Remove Rows with Columns that has Missing Value
banks.clean = banks[!all.empty,]
nrow(banks.clean)
## [1] 45116
banks.clean$missing = !complete.cases(banks.clean)
Replace with Average
hist(banks.clean$balance)
hist(banks.clean$pdays)
banks.clean$pdays[is.na(banks.clean$pdays)] = as.numeric(names(sort(-
table(banks$pdays)))[1])
banks.clean$balance[is.na(banks.clean$balance)] = as.numeric(names(sort(-
table(banks$balance)))[1])
Verify Cleanliness
Verify Deduplication
Number of Rows Before Dedup
nrow(banks)
## [1] 46567
nrow(banks.clean)
## [1] 45112
sum(duplicated(banks.clean))
## [1] 0
levels(banks.clean$job)
## [1] "admin." "blue-collar" "entrepreneur" "housemaid"
## [5] "management" "retired" "self-employed" "services"
## [9] "student" "technician" "unemployed" "unknown"
## [13] "missing"
levels(banks.clean$marital)
## [1] "divorced" "married" "single" "missing"
levels(banks.clean$education)
## [1] "primary" "secondary" "tertiary" "unknown" "missing"
levels(banks.clean$default)
## [1] "no" "yes" "missing"
levels(banks.clean$loan)
## [1] "no" "yes" "missing"
levels(banks.clean$contact)
## [1] "cellular" "telephone" "unknown" "missing"
levels(banks.clean$poutcome)
## [1] "failure" "other" "success" "unknown" "missing"
levels(banks.clean$y)
## [1] "no" "yes" "missing"
levels(banks.clean$housing)
## [1] "no" "yes" "missing"
levels(banks.clean$month)
## [1] "apr" "aug" "dec" "feb" "jan" "jul" "jun"
## [8] "mar" "may" "nov" "oct" "sep" "missing"
sum(banks.clean$missing)
## [1] 1121
summary(banks.clean)
## age job marital education
## Min. :18.00 blue-collar:9595 divorced: 5139 primary : 6758
## 1st Qu.:33.00 management :9322 married :26815 secondary:22855
## Median :39.00 technician :7485 single :12629 tertiary :13123
## Mean :40.94 admin. :5094 missing : 529 unknown : 1829
## 3rd Qu.:48.00 services :4096 missing : 547
## Max. :95.00 retired :2240
## (Other) :7280
## default balance housing loan
## no :43797 Min. : -8019 no :19803 no :37439
## yes : 809 1st Qu.: 62 yes :24746 yes : 7133
## missing: 506 Median : 436 missing: 563 missing: 540
## Mean : 1347
## 3rd Qu.: 1407
## Max. :102127
##
## contact day month duration
## cellular :28876 Min. : 1.0 may :13585 Min. : 0.0
## telephone: 2866 1st Qu.: 8.0 jul : 6785 1st Qu.: 104.0
## unknown :12836 Median :16.0 aug : 6161 Median : 182.0
## missing : 534 Mean :15.8 jun : 5268 Mean : 258.2
## 3rd Qu.:21.0 nov : 3920 3rd Qu.: 317.0
## Max. :31.0 apr : 2884 Max. :4918.0
## (Other): 6509
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 failure: 4847
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1813
## Median : 2.000 Median : -1.00 Median : 0.0000 success: 1486
## Mean : 2.766 Mean : 39.69 Mean : 0.5794 unknown:36461
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000 missing: 505
## Max. :63.000 Max. :871.00 Max. :275.0000
##
## y missing
## no :39367 Mode :logical
## yes : 5224 FALSE:43991
## missing: 521 TRUE :1121
##
##
##
##
banks = banks.clean
Sample Rows
Size of Dataset
nrow(banks)
## [1] 45112
ncol(banks)
## [1] 18
Sample Observations
head(banks)
Data Summary
summary(banks)
## age job marital education
## Min. :18.00 blue-collar:9595 divorced: 5139 primary : 6758
## 1st Qu.:33.00 management :9322 married :26815 secondary:22855
## Median :39.00 technician :7485 single :12629 tertiary :13123
## Mean :40.94 admin. :5094 missing : 529 unknown : 1829
## 3rd Qu.:48.00 services :4096 missing : 547
## Max. :95.00 retired :2240
## (Other) :7280
## default balance housing loan
## no :43797 Min. : -8019 no :19803 no :37439
## yes : 809 1st Qu.: 62 yes :24746 yes : 7133
## missing: 506 Median : 436 missing: 563 missing: 540
## Mean : 1347
## 3rd Qu.: 1407
## Max. :102127
##
## contact day month duration
## cellular :28876 Min. : 1.0 may :13585 Min. : 0.0
## telephone: 2866 1st Qu.: 8.0 jul : 6785 1st Qu.: 104.0
## unknown :12836 Median :16.0 aug : 6161 Median : 182.0
## missing : 534 Mean :15.8 jun : 5268 Mean : 258.2
## 3rd Qu.:21.0 nov : 3920 3rd Qu.: 317.0
## Max. :31.0 apr : 2884 Max. :4918.0
## (Other): 6509
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 failure: 4847
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1813
## Median : 2.000 Median : -1.00 Median : 0.0000 success: 1486
## Mean : 2.766 Mean : 39.69 Mean : 0.5794 unknown:36461
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000 missing: 505
## Max. :63.000 Max. :871.00 Max. :275.0000
##
## y missing
## Min. :0.0000 Mode :logical
## 1st Qu.:0.0000 FALSE:43991
## Median :0.0000 TRUE :1121
## Mean :0.1158
## 3rd Qu.:0.0000
## Max. :1.0000
##
Outcome Imbalance
Observe that the dataset predicted outcome (y) is skewed towards ‘no’ with over 88%.
prop.table(table(banks$y))
##
## 0 1
## 0.8841993 0.1158007
Boxplot of age describes essentially the same statistics but we can see outliers above the
age of 65.
summary(banks$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 39.00 40.94 48.00 95.00
gg = ggplot (banks)
p1 = gg + geom_histogram(aes(x=age),color="black", fill="white", binwidth =
5) +
ggtitle('Age Distribution (red mean line)') +
ylab('Count') +
xlab('Age') +
geom_vline(aes(xintercept = mean(age), color = "red")) +
scale_x_continuous(breaks = seq(0,100,5)) +
theme(legend.position = "none")
p2 = gg + geom_boxplot(aes(x='', y=age)) +
ggtitle('Age Boxplot') +
ylab('Age')
p3
Age vs Subscription
Most clients that subscribe are between age 25 to 45. Mean age for all clients is above 40
years of age.
Balance vs Subscription
Clients that subscribe to term deposits have lower loan balances.
Education vs Subscription
Having higher education is seen to contribute to higher subscription of term deposit. Most
clients who subscribe are from ‘secondary’ and ‘tertiary’ education levels. Tertiary
educated clients have higher rate of subscription (15%) from total clients called.
banks.clean %>%
group_by(education) %>%
summarize(pct.yes = mean(y=="yes")*100) %>%
arrange(desc(pct.yes))
education
<fctr>
tertiary
unknown
secondary
primary
missing
5 rows
banks.clean %>%
group_by(campaign) %>%
summarize(contact.cnt = n(), pct.con.yes = mean(y=="yes")*100) %>%
arrange(desc(contact.cnt)) %>%
head()
campaign contact.cnt
<dbl> <int>
1 17302
2 12325
3 5439
4 3472
5 1742
6 1274
6 rows
Duration
range(banks$duration)
## [1] 0 4918
summary(banks$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 104.0 182.0 258.2 317.0 4918.0
banks %>% select(duration) %>% arrange(desc(duration)) %>% head
1
6
6 rows
mu2 <- banks %>% group_by(y) %>% summarize(grp2.mean=mean(duration))
banks %>%
ggplot(aes(age, duration)) +
geom_point() +
facet_grid(cols = vars(y)) +
scale_x_continuous(breaks = seq(0,100,10)) +
ggtitle("Scatterplot of Duration vs Age for Subscription of Term
Deposit")
Scatterplot Matrix
Due to large number of attributes (17 total), 8 was chosen for correlation. No clear
correlation pattern can be observed as most attributes are categorical.
# scale
training_set[c(1,6,10,12,13)] = scale(training_set[c(1,6,10,12,13)])
test_set[c(1,6,10,12,13)] = scale(test_set[c(1,6,10,12,13)])
df$pred_type <- v
# plot
prp(classifier, type = 2, extra = 104, fallen.leaves = TRUE, main="Decision
Tree")
# print evaluation
cat("Accuracy: ", acc_DT,
"\nPrecision: ", prc_DT,
"\nRecall: ", recall_DT,
"\nFScore: ", fscore_DT)
## Accuracy: 0.9027562
## Precision: 0.6314136
## Recall: 0.3848117
## FScore: 0.4781919
# calculate ROC curve
rocr.pred = prediction(predictions = pred.DT[,2], labels = test_set$y)
rocr.perf = performance(rocr.pred, measure = "tpr", x.measure = "fpr")
rocr.auc = as.numeric(performance(rocr.pred, "auc")@y.values)
# print evaluation
cat("Accuracy: ", acc_LR,
"\nPrecision: ", prc_LR,
"\nRecall: ", recall_LR,
"\nFScore: ", fscore_LR)
## Accuracy: 0.9037907
## Precision: 0.5913163
## Recall: 0.5475431
## FScore: 0.5685885
# calculate ROC
rocr.pred.lr = prediction(predictions = pred_lm, labels = test_set$y)
rocr.perf.lr = performance(rocr.pred.lr, measure = "tpr", x.measure =
"fpr")
rocr.auc.lr = as.numeric(performance(rocr.pred.lr, "auc")@y.values)