You are on page 1of 23

Telecom Customer Churn Prediction Assessment

Index
1.Objective
2.EDA analysis
3.Logistics Regression
4. K - Nearest Neighbour Classifier
5. Naive Bayes Classifier
6.Conclusion

1.Objective:
In this project, we simulate one such case of customer churn where we work on a
data of postpaid customers with a contract. The data has information about the
customer usage behavior, contract details and the payment details. The data also
indicates which were the customers who canceled their service. Based on this past
data, we need to build a model which can predict whether a customer will cancel
their service in the future or not.

Importing Libraries
Packages required and loaded:
toload_libraries <-
c("lattice","rpart","rpart.plot","caret","randomForest","scatterplot3d","e1071","reshape
2","rpsychi", "car", "psych", "corrplot", "forecast", "GPArotation", "psy", "MVN",
"DataExplorer", "ppcor", "Metrics", "foreign", "MASS", "lattice", "nortest",
"Hmisc","factoextra", "nFactors")
new.packages <- toload_libraries[!(toload_libraries %in%
installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(toload_libraries, require, character.only= TRUE)

Reading the data


setwd("C:/Users/itadmin/Desktop/Greatlakes/R/datasets")
library(readxl)
churn_data <- read_excel("cellphone.xlsx",
sheet = "Data")
2.EDA(Exploratory Data Analysis):

> dim(churn_data)- Total Number of Rows and Columns


[1] 3333 11

Checking for missing values –No missing values


> sum(is.na(churn_data))
[1] 0

> any(is.na(churn_data))
[1] FALSE
sapply(churn_data,function(x) sum(is.na(x)))
Churn AccountWeeks ContractRenewal DataPlan DataUsage
CustServCalls DayMins DayCalls
0 0 0 0 0 0 0 0
MonthlyCharge OverageFee RoamMins
0 0 0

str(churn_data)
tibble [3,333 x 11] (S3: tbl_df/tbl/data.frame)
$ Churn : num [1:3333] 0 0 0 0 0 0 0 0 0 0 ...
$ AccountWeeks : num [1:3333] 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal: num [1:3333] 1 1 1 0 0 0 1 0 1 0 ...
$ DataPlan : num [1:3333] 1 1 0 0 0 0 1 0 0 1 ...
$ DataUsage : num [1:3333] 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : num [1:3333] 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num [1:3333] 265 162 243 299 167 ...
$ DayCalls : num [1:3333] 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num [1:3333] 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num [1:3333] 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num [1:3333] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...>
Converting Churn , Contract Renewal and Data Plan as factored variables as they
have value as 0 or 1

col=c("ContractRenewal","DataPlan","Churn")
> churn_data[,col]=lapply(churn_data[,col],factor)
> str(churn_data)
tibble [3,333 x 11] (S3: tbl_df/tbl/data.frame)
$ Churn : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ AccountWeeks : num [1:3333] 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 1 ...
$ DataPlan : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 2 1 1 2 ...
$ DataUsage : num [1:3333] 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : num [1:3333] 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num [1:3333] 265 162 243 299 167 ...
$ DayCalls : num [1:3333] 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num [1:3333] 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num [1:3333] 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num [1:3333] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...

Univariate Analysis

table((churn_data$Churn))

0 1
2850 483

>
prop.table((table((churn_data$Churn))))

0 1
0.8550855 0.1449145
summary(churn_data)
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayMins D
0:2850 Min. : 1.0 0: 323 0:2411 Min. :0.0000 Min. :0.000 Min. : 0.0 Min.
1: 483 1st Qu.: 74.0 1:3010 1: 922 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:143.7 1st
Median :101.0 Median :0.0000 Median :1.000 Median :179.4 Medi
Mean :101.1 Mean :0.8165 Mean :1.563 Mean :179.8 Mean
3rd Qu.:127.0 3rd Qu.:1.7800 3rd Qu.:2.000 3rd Qu.:216.4 3rd
Max. :243.0 Max. :5.4000 Max. :9.000 Max. :350.8 Max.
MonthlyCharge OverageFee RoamMins
Min. : 14.00 Min. : 0.00 Min. : 0.00
1st Qu.: 45.00 1st Qu.: 8.33 1st Qu.: 8.50
Median : 53.50 Median :10.07 Median :10.30
Mean : 56.31 Mean :10.05 Mean :10.24
3rd Qu.: 66.20 3rd Qu.:11.77 3rd Qu.:12.10
Max. :111.30 Max. :18.19 Max. :20.00

>

### Introductory plot of datset

plot_intro(churn_data)
plot_histogram(churn_data,geom_histogram_args = list(fill="blue"),
+ theme_config = list(axis.line = element_line(size = 1, colour = "green"),
strip.background = element_rect(color = "red", fill = "yellow")))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

> plot_density(churn_data,,geom_density_args = list(fill="gold",alpha=0.4))


Outliers Check with respect to churn

> plot_boxplot(churn_data, by="Churn" , geom_boxplot_args = list("outlier.color" = "red", fill="blue"))


Observations:
1. Response Variable Churn is an imbalanced class with 2850 as No or ‘0’ and
483 as Yes or ‘1’ 14.5% is the churn rate, 483/3333 have churned.
2. Data usage has many outliers for the Churners (class 1)
3. Day Mins and Monthly Charge has many outliers in the Non-Churner category
(Class 0)
4. No missing values in the data

Bivariate Analysis

p1 = ggplot(churn_data, aes(AccountWeeks, fill=Churn)) + geom_density(alpha=0.4)


> p2 = ggplot(churn_data, aes(MonthlyCharge, fill=Churn)) + geom_density(alpha=0.4)
> p3 = ggplot(churn_data, aes(CustServCalls, fill=Churn))+geom_bar(position = "dodge")
> roamMins <- cut(churn_data$RoamMins, include.lowest = TRUE, breaks = seq(0, 20, by = 2))
> p4=ggplot(churn_data, aes(roamMins,fill = Churn)) + geom_bar(position="dodge")
>
> grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
Correlation:

churn_data.numeric= churn_data %>% select_if(is.numeric)


> corrplot(round(cor(churn_data.numeric),2),type = "upper",method = "number")

by(churn_data,churn_data$Churn,summary)
churn_data$Churn: 0
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayM
ins DayCalls
0:2850 Min. : 1.0 0: 186 0:2008 Min. :0.0000 Min. :0.00 Min.
: 0.0 Min. : 0.0
1: 0 1st Qu.: 73.0 1:2664 1: 842 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.
:142.8 1st Qu.: 87.0
Median :100.0 Median :0.0000 Median :1.00 Median
:177.2 Median :100.0
Mean :100.8 Mean :0.8622 Mean :1.45 Mean
:175.2 Mean :100.3
3rd Qu.:127.0 3rd Qu.:2.0000 3rd Qu.:2.00 3rd Qu.
:210.3 3rd Qu.:114.0
Max. :243.0 Max. :4.7500 Max. :8.00 Max.
:315.6 Max. :163.0
MonthlyCharge OverageFee RoamMins
Min. : 15.70 Min. : 0.000 Min. : 0.00
1st Qu.: 45.00 1st Qu.: 8.230 1st Qu.: 8.40
Median : 53.00 Median : 9.980 Median :10.20
Mean : 55.82 Mean : 9.955 Mean :10.16
3rd Qu.: 64.67 3rd Qu.:11.660 3rd Qu.:12.00
Max. :111.30 Max. :18.090 Max. :18.90
------------------------------------------------------------------------------------------
-------
churn_data$Churn: 1
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayMin
s DayCalls MonthlyCharge
0: 0 Min. : 1.0 0:137 0:403 Min. :0.000 Min. :0.00 Min. :
0.0 Min. : 0.0 Min. : 14.00
1:483 1st Qu.: 76.0 1:346 1: 80 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:1
53.2 1st Qu.: 87.5 1st Qu.: 45.00
Median :103.0 Median :0.000 Median :2.00 Median :2
17.6 Median :103.0 Median : 63.00
Mean :102.7 Mean :0.547 Mean :2.23 Mean :2
06.9 Mean :101.3 Mean : 59.19
3rd Qu.:127.0 3rd Qu.:0.295 3rd Qu.:4.00 3rd Qu.:2
65.9 3rd Qu.:116.5 3rd Qu.: 69.00
Max. :225.0 Max. :5.400 Max. :9.00 Max. :3
50.8 Max. :165.0 Max. :110.00
OverageFee RoamMins
Min. : 3.55 Min. : 2.0
1st Qu.: 8.86 1st Qu.: 8.8
Median :10.57 Median :10.6
Mean :10.62 Mean :10.7
3rd Qu.:12.47 3rd Qu.:12.8
Max. :18.19 Max. :20.0

Observations:

1. From Histograms almost all continuous predictors like Account Weeks, Day
Calls/Mins, OverageFee, Roam mins have normal distributions
2. Monthly charge has its distribution skewed to a bit left which can be ignored
3. Customers who churn vs who dont are mostly have similar distribution for the
Account weeks with mean of Churn(1) = 103 Weeks and Not Churn(0) ~ 101
Weeks
4. On an Average Customers who Churn are utilizing more Day Minutes(207
mins) than who don’t (175 mins)
5. On the other hand Churning customers data usage (0.54 GB)on an average is
less compared to Non-Churning ones ( 0.86 GB)
6. Churning Customers call Customer Service more in the bracket of ( 5 - 10
calls) v/s the bracket of (0-5 Calls)
7. Monthly Charges are also more for Churn customers compared to Non-Churn
in the 60 - 75 monetary amounts
8. Data suggests there is very strong correlation between Monthly charges and
data usage which is quite obvious . So we can replace one variable with
another after evaluation
3. Logistics Regression

Splitting the data


set.seed(144) #Input any random number
> spl = sample.split(churn_data$Churn, SplitRatio = 0.7)
> train = subset(churn_data, spl == T)
> class(spl)
[1] "logical"
> dim(train)
[1] 2333 11
> test = subset(churn_data, spl == F)
> dim(test)
[1] 1000 11
>
> prop.table(table(train$Churn))

0 1
0.8551222 0.1448778
> prop.table(table(test$Churn))

0 1
0.855 0.145
>
> table(train$Churn)

0 1
1995 338
> table(test$Churn)

0 1
855 145

model1=glm(Churn~.,data = churn_data,family = "binomial")


> summary(model1)

Call:
glm(formula = Churn ~ ., family = "binomial", data = churn_data)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0058 -0.5112 -0.3477 -0.2093 2.9981

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.9510252 0.5486763 -10.846 < 2e-16 ***
AccountWeeks 0.0006525 0.0013873 0.470 0.638112
ContractRenewal1 -1.9855172 0.1436107 -13.826 < 2e-16 ***
DataPlan1 -1.1841611 0.5363668 -2.208 0.027262 *
DataUsage 0.3636565 1.9231751 0.189 0.850021
CustServCalls 0.5081349 0.0389682 13.040 < 2e-16 ***
DayMins 0.0174407 0.0324841 0.537 0.591337
DayCalls 0.0036523 0.0027497 1.328 0.184097
MonthlyCharge -0.0275526 0.1909074 -0.144 0.885244
OverageFee 0.1868114 0.3256902 0.574 0.566248
RoamMins 0.0789226 0.0220522 3.579 0.000345 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2758.3 on 3332 degrees of freedom


Residual deviance: 2188.4 on 3322 degrees of freedom
AIC: 2210.4

Number of Fisher Scoring iterations: 5

vif(model1)
AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls Da
yMins DayCalls MonthlyCharge
1.003246 1.058705 14.087816 1601.163095 1.081250 952.5
39781 1.004592 2829.804947
OverageFee RoamMins
211.716226 1.193368

The multicolliniearity has caused the inflated VIF values for correlated variables,
making the model unreliable. We will use a stepwise variable reduction function
using VIF values

vif_func <- dget("vif_func.R")


> #source= "C:/Users/itadmin/Desktop/Greatlakes/R/projects/Telecom Custom
er Churn Prediction Assessment")
> vif_func(in_frame=churn_data1[,-1],thresh=5,trace=T)
var vif
AccountWeeks 1.00379056966792
ContractRenewal 1.0072163639093
DataPlan 12.4734695151247
DataUsage 1964.8002067194
CustServCalls 1.00194507320884
DayMins 1031.49060758217
DayCalls 1.00293512970177
MonthlyCharge 3243.30055507161
OverageFee 224.639750372869
RoamMins 1.34658276919068

removed: MonthlyCharge 3243.301

var vif
AccountWeeks 1.00349668958413
ContractRenewal 1.00651641850144
DataPlan 12.4695602982927
DataUsage 12.8138032553607
CustServCalls 1.00177759301348
DayMins 1.00333362434266
DayCalls 1.00292948433251
OverageFee 1.0016574227944
RoamMins 1.346470768479

removed: DataUsage 12.8138

[1] "AccountWeeks" "ContractRenewal" "DataPlan" "CustServCalls" "DayMins"


"DayCalls" "OverageFee"
[8] "RoamMins"
> vif_func(in_frame=churn_data[,-1],thresh=5,trace=T)
var vif
AccountWeeks 1.00379056966792
ContractRenewal <NA>
DataPlan <NA>
DataUsage 1964.8002067194
CustServCalls 1.00194507320884
DayMins 1031.49060758217
DayCalls 1.00293512970177
MonthlyCharge 3243.30055507161
OverageFee 224.639750372869
RoamMins 1.34658276919068

removed: MonthlyCharge 3243.301

var vif
AccountWeeks 1.00349668958413
ContractRenewal <NA>
DataPlan <NA>
DataUsage 12.8138032553607
CustServCalls 1.00177759301348
DayMins 1.00333362434266
DayCalls 1.00292948433251
OverageFee 1.0016574227944
RoamMins 1.346470768479

removed: DataUsage 12.8138

[1] "AccountWeeks" "ContractRenewal" "DataPlan" "CustServCalls" "DayMins"


"DayCalls" "OverageFee"
[8] "RoamMins"

Model 2: We will not use the MonthlyCharge and DataUsage variables and create a
new model.

model2 <- glm(Churn ~ . -MonthlyCharge - DataUsage , data= churn_data, family=binomial)


> summary(model2)

Call:
glm(formula = Churn ~ . - MonthlyCharge - DataUsage, family = binomial,
data = churn_data)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0113 -0.5099 -0.3496 -0.2100 2.9978

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.9945437 0.5377103 -11.148 < 2e-16 ***
AccountWeeks 0.0006621 0.0013873 0.477 0.633
ContractRenewal1 -1.9880114 0.1435423 -13.850 < 2e-16 ***
DataPlan1 -0.9353165 0.1441298 -6.489 8.62e-11 ***
CustServCalls 0.5072934 0.0389173 13.035 < 2e-16 ***
DayMins 0.0127543 0.0010725 11.892 < 2e-16 ***
DayCalls 0.0036213 0.0027486 1.318 0.188
OverageFee 0.1398147 0.0226568 6.171 6.79e-10 ***
RoamMins 0.0831284 0.0203211 4.091 4.30e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2758.3 on 3332 degrees of freedom


Residual deviance: 2188.6 on 3324 degrees of freedom
AIC: 2206.6

Number of Fisher Scoring iterations: 5

Model 3: AccountWeeks and DayCalls are insignificant for the model, we will remove
these variables a create a new model

model3 <- glm(Churn ~ . -MonthlyCharge - DataUsage -AccountWeeks -DayCalls, data= churn_data, family=bin
> summary(model3)

Call:
glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
DayCalls, family = binomial, data = churn_data)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.9932 -0.5154 -0.3480 -0.2095 2.9906

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.552897 0.432757 -12.831 < 2e-16 ***
ContractRenewal1 -1.989219 0.143452 -13.867 < 2e-16 ***
DataPlan1 -0.934814 0.144015 -6.491 8.52e-11 ***
CustServCalls 0.505651 0.038834 13.021 < 2e-16 ***
DayMins 0.012774 0.001073 11.907 < 2e-16 ***
OverageFee 0.138612 0.022648 6.120 9.34e-10 ***
RoamMins 0.083476 0.020304 4.111 3.93e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2758.3 on 3332 degrees of freedom


Residual deviance: 2190.6 on 3326 degrees of freedom
AIC: 2204.6

Number of Fisher Scoring iterations: 5

vif(model3)
ContractRenewal DataPlan CustServCalls DayMins OverageFee RoamMins
1.056179 1.018476 1.076219 1.039028 1.022948 1.010395

>

Building the model on train data

summary(model1.train)

Call:
glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
DayCalls, family = binomial, data = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.9988 -0.5148 -0.3418 -0.1996 2.8680

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.91228 0.52113 -11.345 < 2e-16 ***
ContractRenewal1 -1.93645 0.17175 -11.275 < 2e-16 ***
DataPlan1 -0.92904 0.17012 -5.461 4.73e-08 ***
CustServCalls 0.52940 0.04747 11.153 < 2e-16 ***
DayMins 0.01343 0.00129 10.409 < 2e-16 ***
OverageFee 0.14563 0.02717 5.360 8.32e-08 ***
RoamMins 0.08804 0.02406 3.660 0.000252 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1930.4 on 2332 degrees of freedom


Residual deviance: 1519.6 on 2326 degrees of freedom
AIC: 1533.6

Number of Fisher Scoring iterations: 5

> vif(model1.train)
ContractRenewal DataPlan CustServCalls DayMins OverageFee Roa
mMins
1.059151 1.021584 1.083952 1.050825 1.024412 1.0
07917

Testing the model


predict.test=predict(model1.train,type = "response",newdata = test)
blr_confusion_matrix(model1.train,data = test)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 834 123
1 21 22

Accuracy : 0.856
95% CI : (0.8327, 0.8772)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.4863

Kappa : 0.1796

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.1517
Specificity : 0.9754
Pos Pred Value : 0.5116
Neg Pred Value : 0.8715
Prevalence : 0.1450
Detection Rate : 0.0220
Detection Prevalence : 0.0430
Balanced Accuracy : 0.5636

'Positive' Class : 1

# Confusion matrix with threshold of 0.2

logitR.predicted1 = ifelse(predict.test > 0.2 , 1, 0)


> logitR.predF1 = factor(logitR.predicted1, levels = c(0,1))
>
> logitR.CM1 = confusionMatrix(logitR.predF1,test$Churn, positive = "1")
> logitR.CM1
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 720 55
1 135 90

Accuracy : 0.81
95% CI : (0.7843, 0.8339)
No Information Rate : 0.855
P-Value [Acc > NIR] : 1

Kappa : 0.3765

Mcnemar's Test P-Value : 9.969e-09


Sensitivity : 0.6207
Specificity : 0.8421
Pos Pred Value : 0.4000
Neg Pred Value : 0.9290
Prevalence : 0.1450
Detection Rate : 0.0900
Detection Prevalence : 0.2250
Balanced Accuracy : 0.7314

'Positive' Class : 1

ROCRpred = prediction(predict.test,test$Churn)
>
> Auc=as.numeric(performance(ROCRpred, "auc")@y.values)
> Auc
[1] 0.8145513
Observation:

1. Logistic Regression also performs poorly in case of general model with


positive pred rate of 51% and Sensitivity of just 15%

2. Ofcourse this model can be improved through better selection of predictors


and their interaction effects but the general case is worst performer

3. This LR model also suffers from accuracy paradox such that if threshold
probability is decreses from 0.5 to say 0.2

4. By lowering the threshold, we have improved the sensitivity of the model to


0.62 from earlier 0.15, of course compromising on the overall accuracy
which stands lower at 0.8 from 0.85. The specifity of the model is also
compromised. The threshold can be further reduced to further improve the
sensitivity
4. K - Nearest Neighbour Classifier

Have used various combinations for KNN method

library(class)
install.packages("class")

model.knn=knn(train,test,train$Churn ,k=5) # have odd numebr of K


table(test$Churn,model.knn)
model.knn
knn.pred=predict(model.knn,test)

?knn.fit

trctl = trainControl(method = "repeatedcv", number = 10, repeats = 3)

set.seed(1111)
model.knn1= train(Churn~., data = train, method="knn",
trControl= trctl, preProcess = c("center", "scale"),
tuneLength= 10)
model.knn1
knn.pred1=predict(model.knn1,test)

trctl1 = trainControl(method = "cv", number = 10)


model.knn2= train(Churn~., data = train, method="knn",
trControl= trctl1)
model.knn2
knn.pred2=predict(model.knn2,test)
knn.CM = confusionMatrix(knn.pred,test$Churn, positive = "1")
knn.CM1 = confusionMatrix(knn.pred1,test$Churn, positive = "1")
knn.CM2 = confusionMatrix(knn.pred2,test$Churn, positive = "1")

knn.CM
knn.CM1
knn.CM2

knn.CM
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 837 87
1 18 58

Accuracy : 0.895
95% CI : (0.8743, 0.9133)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.0001125

Kappa : 0.4723

Mcnemar's Test P-Value : 3.22e-11

Sensitivity : 0.4000
Specificity : 0.9789
Pos Pred Value : 0.7632
Neg Pred Value : 0.9058
Prevalence : 0.1450
Detection Rate : 0.0580
Detection Prevalence : 0.0760
Balanced Accuracy : 0.6895

'Positive' Class : 1

> knn.CM1
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 837 87
1 18 58

Accuracy : 0.895
95% CI : (0.8743, 0.9133)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.0001125

Kappa : 0.4723

Mcnemar's Test P-Value : 3.22e-11

Sensitivity : 0.4000
Specificity : 0.9789
Pos Pred Value : 0.7632
Neg Pred Value : 0.9058
Prevalence : 0.1450
Detection Rate : 0.0580
Detection Prevalence : 0.0760
Balanced Accuracy : 0.6895

'Positive' Class : 1

> knn.CM2
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 840 117
1 15 28

Accuracy : 0.868
95% CI : (0.8454, 0.8884)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.1301

Kappa : 0.248

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.1931
Specificity : 0.9825
Pos Pred Value : 0.6512
Neg Pred Value : 0.8777
Prevalence : 0.1450
Detection Rate : 0.0280
Detection Prevalence : 0.0430
Balanced Accuracy : 0.5878

'Positive' Class : 1

>

Observation
1. Confusion matrix suggests that model has very high accuracy of 90% but
its positive class prediction rate (Churning rate) is around 76 % which is
good enough in real time scenarios but can be improved with further
tuning
5. Naive Bayes Classifier

ibrary(e1071)
> NB.fit = naiveBayes(Churn~., data = train)
> NB.fit

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
0 1
0.8551222 0.1448778

Conditional probabilities:
AccountWeeks
Y [,1] [,2]
0 101.4090 40.11039
1 101.0799 39.43145

ContractRenewal
Y 0 1
0 0.06716792 0.93283208
1 0.28106509 0.71893491

DataPlan
Y 0 1
0 0.6957393 0.3042607
1 0.8284024 0.1715976

DataUsage
Y [,1] [,2]
0 0.8834687 1.294567
1 0.5587574 1.153906

CustServCalls
Y [,1] [,2]
0 1.456642 1.166870
1 2.242604 1.780296

DayMins
Y [,1] [,2]
0 176.2034 49.65592
1 209.0583 70.06561

DayCalls
Y [,1] [,2]
0 99.91679 19.86590
1 100.94675 20.87259

MonthlyCharge
Y [,1] [,2]
0 56.13193 16.51706
1 59.68225 16.04013

OverageFee
Y [,1] [,2]
0 9.913263 2.540241
1 10.622722 2.558384

RoamMins
Y [,1] [,2]
0 10.12566 2.847187
1 10.77604 2.732145

>

NB.pred = predict(NB.fit, test, type = "class")


> NB.CM = confusionMatrix(NB.pred, test$Churn, positive = "1")
> NB.CM
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 828 101
1 27 44

Accuracy : 0.872
95% CI : (0.8497, 0.8921)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.06738

Kappa : 0.345

Mcnemar's Test P-Value : 1.101e-10

Sensitivity : 0.3034
Specificity : 0.9684
Pos Pred Value : 0.6197
Neg Pred Value : 0.8913
Prevalence : 0.1450
Detection Rate : 0.0440
Detection Prevalence : 0.0710
Balanced Accuracy : 0.6359

'Positive' Class : 1

Observation
1. Accuracy is 87% but its positive prediction rate is 62% which is quite low
2. Sensitivity which is TP/ TP + FN is just 30 % another hallmark of
untrustworthiness
Conclusion
1. KNN
a. k-NN performs the best with Positive pred rate of 81% in the general
case model where the formula intends to take all the 10 predictors
irrespective of their type whether continous or categorical

b. The intended or any refined / tuned target model should be able to


catch the Churners based on the data provided . Ofcourse the dataset
is lopsided in favor of more-NonChurners rather than our intended
target of finding Churners based on their behavior hidden in the
dataset.

2. Naive Bayes has no parameters to tune , but k-NN and Logit Regr can be
improved by fine tuning the train control parameters and also deploying the
up/down sampling approach for Logistic regression to counteract the class
imbalance

So based on accuracy and positive predictive power we recommend using KNN


model

You might also like