Professional Documents
Culture Documents
a. Fit models to the data for (1) k-nearest neighbors with k = 3, (2) Naive
Bayes and (3) classification trees. Use Personal Loan as the outcome
variable. Report the validation confusion matrix for each of the three
model. (10 points)
install.packages("FNN")
install.packages("caret")
library(FNN)
library(caret)
knn.pred.valid <- knn (train.df[, 1:11], valid.df[, 1:11],
cl = train.df[, 12], k = 3)
confusionMatrix(knn.pred.valid, as.factor(valid.df[, 12]))
library(e1071)
nb <- naiveBayes(Personal.Loan ~ ., data = train.df)
nb
# training
pred.class <- predict(nb, newdata = train.df)
confusionMatrix(pred.class, as.factor(train.df$Personal.Loan))
# validation
pred.class2 <- predict(nb, newdata = valid.df)
confusionMatrix(pred.class2, as.factor(valid.df$Personal.Loan))
#classification tree
library(rpart)
library(rpart.plot)
default.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class")
# plot tree
prp(default.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10)
In this case, classification tree show a better accuracy with a value of 0.9835.
b. Create a data frame with the actual outcome, predicted outcome, and
probability of being a "1" for each of the three models. Report the first 10
rows of this data frame. (5 points)
c. Add two columns to this data frame for (1) a majority vote of predicted
outcomes, and (2) the average of the predicted probabilities. Using the
classifications generated by these two methods derive a confusion matrix
for each method and report the overall accuracy. ( 10 points)
for(i in 1:200) {
df3[i,11] <- as.numeric(df3[i,4])+as.numeric(df3[i,7])+as.numeric(df3[i,9])/3
if ((as.numeric(df3[i,2])-1)+(as.numeric(df3[i,5])-1)+(as.numeric(df3[i,8])-1) > 1) {
df3[i,10]<- 1 } else {df3[i,10]<- 0 }
}
head(df3, 10)
confusionMatrix(as.factor(df3$majority_class), as.factor(valid.df$Personal.Loan))
confusionMatrix(as.factor(ifelse(df3$ave_prob>0.5, 1,0)),as.factor(valid.df$Personal.Loan))
d. Compare the error rates for the three individual methods and the two
ensemble methods. (5 points)
erknn <-(71+108)/2000
ernb <-(157+74)/2000
erct <-(23+10)/2000
ervote <-(3+177)/2000
erprob <-(20+173)/2000
erknn
ernb
erct
ervote
erprob
PART B (5 points)
Use Bagging and Boosted Trees and compare their performance with all the
methodologies we used in PART A.
# bagging
library(adabag)
bank.df$Personal.Loan = as.factor(bank.df$Personal.Loan)
train.df$Personal.Loan = as.factor(train.df$Personal.Loan)
# boosting
boost <- boosting(Personal.Loan ~ ., data = train.df)
pred1 <- predict(boost, valid.df, type = "class")
confusionMatrix(as.factor(pred1$class), as.factor(valid.df$Personal.Loan))
In this case, we can appreciate that boosting and bag trees have more accuracy than the previous
models. Boosting have the highest accuracy.
(5 points)
We are interested in predicting the loan acceptance behavior for three new
customers, with the following profiles. Use each of the developed methods to
predict these three new customers.
For Knn the model predict that the first and second new record won’t accept the personal Loan and
the last will accept.
pred.class <- predict(nb, newdata = new.df)
The same classification is for naïve bayes and the classification tree.
In all the models we can predict the same conclusions. The customers 1 and 2 will not accept the
Personal Loan. The customer 3 will accept the offer of the bank.