Predicting The Churn in Telecom Industry

Telecom Customer Churn Prediction Assessment
Index
1.Objective
2.EDA analysis
3.Logistics Regression
4. K - Nearest Neighbour Classifier
5. Naive Bayes Classifier
6.Conclusion
1.Objective:
In this project, we simulate one such case of customer churn where we work on a
data of postpaid customers with a contract. The data has information about the
customer usage behavior, contract details and the payment details. The data also
indicates which were the customers who canceled their service. Based on this past
data, we need to build a model which can predict whether a customer will cancel
their service in the future or not.
Importing Libraries
Packages required and loaded:
toload_libraries <-
c("lattice","rpart","rpart.plot","caret","randomForest","scatterplot3d","e1071","reshape
2","rpsychi", "car", "psych", "corrplot", "forecast", "GPArotation", "psy", "MVN",
"DataExplorer", "ppcor", "Metrics", "foreign", "MASS", "lattice", "nortest",
"Hmisc","factoextra", "nFactors")
new.packages <- toload_libraries[!(toload_libraries %in%
installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(toload_libraries, require, character.only= TRUE)
Reading the data

setwd("C:/Users/itadmin/Desktop/Greatlakes/R/datasets")
library(readxl)
churn_data <- read_excel("cellphone.xlsx",
sheet = "Data")
2.EDA(Exploratory Data Analysis):
> dim(churn_data)- Total Number of Rows and Columns

[1] 3333 11
Checking for missing values –No missing values

> sum(is.na(churn_data))
[1] 0
> any(is.na(churn_data))
[1] FALSE
sapply(churn_data,function(x) sum(is.na(x)))
Churn AccountWeeks ContractRenewal DataPlan DataUsage
CustServCalls DayMins DayCalls
0 0 0 0 0 0 0 0
MonthlyCharge OverageFee RoamMins
0 0 0
str(churn_data)
tibble [3,333 x 11] (S3: tbl_df/tbl/data.frame)
$ Churn : num [1:3333] 0 0 0 0 0 0 0 0 0 0 ...
$ AccountWeeks : num [1:3333] 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal: num [1:3333] 1 1 1 0 0 0 1 0 1 0 ...
$ DataPlan : num [1:3333] 1 1 0 0 0 0 1 0 0 1 ...
$ DataUsage : num [1:3333] 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : num [1:3333] 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num [1:3333] 265 162 243 299 167 ...
$ DayCalls : num [1:3333] 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num [1:3333] 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num [1:3333] 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num [1:3333] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...>
Converting Churn , Contract Renewal and Data Plan as factored variables as they
have value as 0 or 1
col=c("ContractRenewal","DataPlan","Churn")
> churn_data[,col]=lapply(churn_data[,col],factor)
> str(churn_data)
tibble [3,333 x 11] (S3: tbl_df/tbl/data.frame)
$ Churn : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ AccountWeeks : num [1:3333] 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 1 ...
$ DataPlan : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 2 1 1 2 ...
$ DataUsage : num [1:3333] 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : num [1:3333] 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num [1:3333] 265 162 243 299 167 ...
$ DayCalls : num [1:3333] 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num [1:3333] 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num [1:3333] 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num [1:3333] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
Univariate Analysis
table((churn_data$Churn))
0 1
2850 483
>
prop.table((table((churn_data$Churn))))
0 1
0.8550855 0.1449145
summary(churn_data)
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayMins D
0:2850 Min. : 1.0 0: 323 0:2411 Min. :0.0000 Min. :0.000 Min. : 0.0 Min.
1: 483 1st Qu.: 74.0 1:3010 1: 922 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:143.7 1st
Median :101.0 Median :0.0000 Median :1.000 Median :179.4 Medi
Mean :101.1 Mean :0.8165 Mean :1.563 Mean :179.8 Mean
3rd Qu.:127.0 3rd Qu.:1.7800 3rd Qu.:2.000 3rd Qu.:216.4 3rd
Max. :243.0 Max. :5.4000 Max. :9.000 Max. :350.8 Max.
Min. : 14.00 Min. : 0.00 Min. : 0.00
1st Qu.: 45.00 1st Qu.: 8.33 1st Qu.: 8.50
Median : 53.50 Median :10.07 Median :10.30
Mean : 56.31 Mean :10.05 Mean :10.24
3rd Qu.: 66.20 3rd Qu.:11.77 3rd Qu.:12.10
Max. :111.30 Max. :18.19 Max. :20.00
>
### Introductory plot of datset
plot_intro(churn_data)
plot_histogram(churn_data,geom_histogram_args = list(fill="blue"),
+ theme_config = list(axis.line = element_line(size = 1, colour = "green"),
strip.background = element_rect(color = "red", fill = "yellow")))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> plot_density(churn_data,,geom_density_args = list(fill="gold",alpha=0.4))

Outliers Check with respect to churn
> plot_boxplot(churn_data, by="Churn" , geom_boxplot_args = list("outlier.color" = "red", fill="blue"))

Observations:
1. Response Variable Churn is an imbalanced class with 2850 as No or ‘0’ and
483 as Yes or ‘1’ 14.5% is the churn rate, 483/3333 have churned.
2. Data usage has many outliers for the Churners (class 1)
3. Day Mins and Monthly Charge has many outliers in the Non-Churner category
(Class 0)
4. No missing values in the data
Bivariate Analysis
p1 = ggplot(churn_data, aes(AccountWeeks, fill=Churn)) + geom_density(alpha=0.4)

> p2 = ggplot(churn_data, aes(MonthlyCharge, fill=Churn)) + geom_density(alpha=0.4)
> p3 = ggplot(churn_data, aes(CustServCalls, fill=Churn))+geom_bar(position = "dodge")
> roamMins <- cut(churn_data$RoamMins, include.lowest = TRUE, breaks = seq(0, 20, by = 2))
> p4=ggplot(churn_data, aes(roamMins,fill = Churn)) + geom_bar(position="dodge")
>
> grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
Correlation:
churn_data.numeric= churn_data %>% select_if(is.numeric)

> corrplot(round(cor(churn_data.numeric),2),type = "upper",method = "number")
by(churn_data,churn_data$Churn,summary)
churn_data$Churn: 0
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayM
ins DayCalls
0:2850 Min. : 1.0 0: 186 0:2008 Min. :0.0000 Min. :0.00 Min.
: 0.0 Min. : 0.0
1: 0 1st Qu.: 73.0 1:2664 1: 842 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.
:142.8 1st Qu.: 87.0
Median :100.0 Median :0.0000 Median :1.00 Median
:177.2 Median :100.0
Mean :100.8 Mean :0.8622 Mean :1.45 Mean
:175.2 Mean :100.3
3rd Qu.:127.0 3rd Qu.:2.0000 3rd Qu.:2.00 3rd Qu.
:210.3 3rd Qu.:114.0
Max. :243.0 Max. :4.7500 Max. :8.00 Max.
:315.6 Max. :163.0
Min. : 15.70 Min. : 0.000 Min. : 0.00
1st Qu.: 45.00 1st Qu.: 8.230 1st Qu.: 8.40
Median : 53.00 Median : 9.980 Median :10.20
Mean : 55.82 Mean : 9.955 Mean :10.16
3rd Qu.: 64.67 3rd Qu.:11.660 3rd Qu.:12.00
Max. :111.30 Max. :18.090 Max. :18.90
------------------------------------------------------------------------------------------
-------
churn_data$Churn: 1
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayMin
s DayCalls MonthlyCharge
0: 0 Min. : 1.0 0:137 0:403 Min. :0.000 Min. :0.00 Min. :
0.0 Min. : 0.0 Min. : 14.00
1:483 1st Qu.: 76.0 1:346 1: 80 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:1
53.2 1st Qu.: 87.5 1st Qu.: 45.00
Median :103.0 Median :0.000 Median :2.00 Median :2
17.6 Median :103.0 Median : 63.00
Mean :102.7 Mean :0.547 Mean :2.23 Mean :2
06.9 Mean :101.3 Mean : 59.19
3rd Qu.:127.0 3rd Qu.:0.295 3rd Qu.:4.00 3rd Qu.:2
65.9 3rd Qu.:116.5 3rd Qu.: 69.00
Max. :225.0 Max. :5.400 Max. :9.00 Max. :3
50.8 Max. :165.0 Max. :110.00
OverageFee RoamMins
Min. : 3.55 Min. : 2.0
1st Qu.: 8.86 1st Qu.: 8.8
Median :10.57 Median :10.6
Mean :10.62 Mean :10.7
3rd Qu.:12.47 3rd Qu.:12.8
Max. :18.19 Max. :20.0
Observations:
1. From Histograms almost all continuous predictors like Account Weeks, Day
Calls/Mins, OverageFee, Roam mins have normal distributions
2. Monthly charge has its distribution skewed to a bit left which can be ignored
3. Customers who churn vs who dont are mostly have similar distribution for the
Account weeks with mean of Churn(1) = 103 Weeks and Not Churn(0) ~ 101
Weeks
4. On an Average Customers who Churn are utilizing more Day Minutes(207
mins) than who don’t (175 mins)
5. On the other hand Churning customers data usage (0.54 GB)on an average is
less compared to Non-Churning ones ( 0.86 GB)
6. Churning Customers call Customer Service more in the bracket of ( 5 - 10
calls) v/s the bracket of (0-5 Calls)
7. Monthly Charges are also more for Churn customers compared to Non-Churn
in the 60 - 75 monetary amounts
8. Data suggests there is very strong correlation between Monthly charges and
data usage which is quite obvious . So we can replace one variable with
another after evaluation
3. Logistics Regression
Splitting the data

set.seed(144) #Input any random number
> spl = sample.split(churn_data$Churn, SplitRatio = 0.7)
> train = subset(churn_data, spl == T)
> class(spl)
[1] "logical"
> dim(train)
[1] 2333 11
> test = subset(churn_data, spl == F)
> dim(test)
[1] 1000 11
>
> prop.table(table(train$Churn))
0 1
0.8551222 0.1448778
> prop.table(table(test$Churn))
0 1
0.855 0.145
>
> table(train$Churn)
0 1
1995 338
> table(test$Churn)
0 1
855 145
model1=glm(Churn~.,data = churn_data,family = "binomial")

> summary(model1)
Call:
glm(formula = Churn ~ ., family = "binomial", data = churn_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0058 -0.5112 -0.3477 -0.2093 2.9981
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.9510252 0.5486763 -10.846 < 2e-16 ***
AccountWeeks 0.0006525 0.0013873 0.470 0.638112
ContractRenewal1 -1.9855172 0.1436107 -13.826 < 2e-16 ***
DataPlan1 -1.1841611 0.5363668 -2.208 0.027262 *
DataUsage 0.3636565 1.9231751 0.189 0.850021
CustServCalls 0.5081349 0.0389682 13.040 < 2e-16 ***
DayMins 0.0174407 0.0324841 0.537 0.591337
DayCalls 0.0036523 0.0027497 1.328 0.184097
MonthlyCharge -0.0275526 0.1909074 -0.144 0.885244
OverageFee 0.1868114 0.3256902 0.574 0.566248
RoamMins 0.0789226 0.0220522 3.579 0.000345 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2758.3 on 3332 degrees of freedom

Residual deviance: 2188.4 on 3322 degrees of freedom
AIC: 2210.4
Number of Fisher Scoring iterations: 5
vif(model1)
AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls Da
yMins DayCalls MonthlyCharge
1.003246 1.058705 14.087816 1601.163095 1.081250 952.5
39781 1.004592 2829.804947
OverageFee RoamMins
211.716226 1.193368
The multicolliniearity has caused the inflated VIF values for correlated variables,
making the model unreliable. We will use a stepwise variable reduction function
using VIF values
vif_func <- dget("vif_func.R")

> #source= "C:/Users/itadmin/Desktop/Greatlakes/R/projects/Telecom Custom
er Churn Prediction Assessment")
> vif_func(in_frame=churn_data1[,-1],thresh=5,trace=T)
var vif
AccountWeeks 1.00379056966792
ContractRenewal 1.0072163639093
DataPlan 12.4734695151247
DataUsage 1964.8002067194
CustServCalls 1.00194507320884
DayMins 1031.49060758217
DayCalls 1.00293512970177
MonthlyCharge 3243.30055507161
OverageFee 224.639750372869
RoamMins 1.34658276919068
removed: MonthlyCharge 3243.301
var vif
ContractRenewal 1.00651641850144
DataPlan 12.4695602982927
DataUsage 12.8138032553607
DayMins 1.00333362434266
DayCalls 1.00292948433251
OverageFee 1.0016574227944
RoamMins 1.346470768479
removed: DataUsage 12.8138
[1] "AccountWeeks" "ContractRenewal" "DataPlan" "CustServCalls" "DayMins"

"DayCalls" "OverageFee"
[8] "RoamMins"
> vif_func(in_frame=churn_data[,-1],thresh=5,trace=T)
var vif
ContractRenewal <NA>
DataPlan <NA>
DataUsage 1964.8002067194
DayMins 1031.49060758217
DayCalls 1.00293512970177
MonthlyCharge 3243.30055507161
OverageFee 224.639750372869
RoamMins 1.34658276919068
removed: MonthlyCharge 3243.301
var vif
ContractRenewal <NA>
DataPlan <NA>
DataUsage 12.8138032553607
DayMins 1.00333362434266
DayCalls 1.00292948433251
OverageFee 1.0016574227944
RoamMins 1.346470768479
removed: DataUsage 12.8138
[1] "AccountWeeks" "ContractRenewal" "DataPlan" "CustServCalls" "DayMins"

"DayCalls" "OverageFee"
[8] "RoamMins"
Model 2: We will not use the MonthlyCharge and DataUsage variables and create a
new model.
model2 <- glm(Churn ~ . -MonthlyCharge - DataUsage , data= churn_data, family=binomial)

> summary(model2)
Call:
glm(formula = Churn ~ . - MonthlyCharge - DataUsage, family = binomial,
data = churn_data)
Deviance Residuals:
-2.0113 -0.5099 -0.3496 -0.2100 2.9978
Coefficients:
(Intercept) -5.9945437 0.5377103 -11.148 < 2e-16 ***
AccountWeeks 0.0006621 0.0013873 0.477 0.633
DataPlan1 -0.9353165 0.1441298 -6.489 8.62e-11 ***
CustServCalls 0.5072934 0.0389173 13.035 < 2e-16 ***
DayMins 0.0127543 0.0010725 11.892 < 2e-16 ***
DayCalls 0.0036213 0.0027486 1.318 0.188
OverageFee 0.1398147 0.0226568 6.171 6.79e-10 ***
RoamMins 0.0831284 0.0203211 4.091 4.30e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 2206.6
Model 3: AccountWeeks and DayCalls are insignificant for the model, we will remove
these variables a create a new model
model3 <- glm(Churn ~ . -MonthlyCharge - DataUsage -AccountWeeks -DayCalls, data= churn_data, family=bin
> summary(model3)
Call:
glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
DayCalls, family = binomial, data = churn_data)
Deviance Residuals:
-1.9932 -0.5154 -0.3480 -0.2095 2.9906
Coefficients:
(Intercept) -5.552897 0.432757 -12.831 < 2e-16 ***
DataPlan1 -0.934814 0.144015 -6.491 8.52e-11 ***
CustServCalls 0.505651 0.038834 13.021 < 2e-16 ***
DayMins 0.012774 0.001073 11.907 < 2e-16 ***
OverageFee 0.138612 0.022648 6.120 9.34e-10 ***
RoamMins 0.083476 0.020304 4.111 3.93e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 2204.6
vif(model3)
ContractRenewal DataPlan CustServCalls DayMins OverageFee RoamMins
1.056179 1.018476 1.076219 1.039028 1.022948 1.010395
>
Building the model on train data
summary(model1.train)
Call:
glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
DayCalls, family = binomial, data = train)
Deviance Residuals:
-1.9988 -0.5148 -0.3418 -0.1996 2.8680
Coefficients:
(Intercept) -5.91228 0.52113 -11.345 < 2e-16 ***
DataPlan1 -0.92904 0.17012 -5.461 4.73e-08 ***
CustServCalls 0.52940 0.04747 11.153 < 2e-16 ***
DayMins 0.01343 0.00129 10.409 < 2e-16 ***
OverageFee 0.14563 0.02717 5.360 8.32e-08 ***
RoamMins 0.08804 0.02406 3.660 0.000252 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 1533.6
> vif(model1.train)
ContractRenewal DataPlan CustServCalls DayMins OverageFee Roa
mMins
1.059151 1.021584 1.083952 1.050825 1.024412 1.0
07917
Testing the model

predict.test=predict(model1.train,type = "response",newdata = test)
blr_confusion_matrix(model1.train,data = test)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 834 123
1 21 22
Accuracy : 0.856
95% CI : (0.8327, 0.8772)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.4863
Kappa : 0.1796
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.1517
Specificity : 0.9754
Pos Pred Value : 0.5116
Neg Pred Value : 0.8715
Prevalence : 0.1450
Detection Rate : 0.0220
Detection Prevalence : 0.0430
Balanced Accuracy : 0.5636
'Positive' Class : 1
# Confusion matrix with threshold of 0.2
logitR.predicted1 = ifelse(predict.test > 0.2 , 1, 0)

> logitR.predF1 = factor(logitR.predicted1, levels = c(0,1))
>
> logitR.CM1 = confusionMatrix(logitR.predF1,test$Churn, positive = "1")
> logitR.CM1
Reference
Prediction 0 1
0 720 55
1 135 90
Accuracy : 0.81
95% CI : (0.7843, 0.8339)
P-Value [Acc > NIR] : 1
Kappa : 0.3765
Mcnemar's Test P-Value : 9.969e-09

Prevalence : 0.1450
ROCRpred = prediction(predict.test,test$Churn)
>
> Auc=as.numeric(performance(ROCRpred, "auc")@y.values)
> Auc
[1] 0.8145513
Observation:
1. Logistic Regression also performs poorly in case of general model with

positive pred rate of 51% and Sensitivity of just 15%
2. Ofcourse this model can be improved through better selection of predictors

and their interaction effects but the general case is worst performer
3. This LR model also suffers from accuracy paradox such that if threshold
probability is decreses from 0.5 to say 0.2
4. By lowering the threshold, we have improved the sensitivity of the model to

0.62 from earlier 0.15, of course compromising on the overall accuracy
which stands lower at 0.8 from 0.85. The specifity of the model is also
compromised. The threshold can be further reduced to further improve the
sensitivity
4. K - Nearest Neighbour Classifier
Have used various combinations for KNN method
library(class)
install.packages("class")
model.knn=knn(train,test,train$Churn ,k=5) # have odd numebr of K

table(test$Churn,model.knn)
model.knn
knn.pred=predict(model.knn,test)
?knn.fit
trctl = trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(1111)
model.knn1= train(Churn~., data = train, method="knn",
trControl= trctl, preProcess = c("center", "scale"),
tuneLength= 10)
model.knn1
knn.pred1=predict(model.knn1,test)
trctl1 = trainControl(method = "cv", number = 10)

model.knn2= train(Churn~., data = train, method="knn",
trControl= trctl1)
model.knn2
knn.pred2=predict(model.knn2,test)
knn.CM = confusionMatrix(knn.pred,test$Churn, positive = "1")
knn.CM1 = confusionMatrix(knn.pred1,test$Churn, positive = "1")
knn.CM2 = confusionMatrix(knn.pred2,test$Churn, positive = "1")
knn.CM
knn.CM1
knn.CM2
knn.CM
Reference
Prediction 0 1
0 837 87
1 18 58
Accuracy : 0.895
95% CI : (0.8743, 0.9133)
P-Value [Acc > NIR] : 0.0001125
Kappa : 0.4723
Prevalence : 0.1450
> knn.CM1
Reference
Prediction 0 1
0 837 87
1 18 58
Accuracy : 0.895
95% CI : (0.8743, 0.9133)
P-Value [Acc > NIR] : 0.0001125
Kappa : 0.4723
Prevalence : 0.1450
> knn.CM2
Reference
Prediction 0 1
0 840 117
1 15 28
Accuracy : 0.868
95% CI : (0.8454, 0.8884)
Kappa : 0.248
Mcnemar's Test P-Value : <2e-16
Prevalence : 0.1450
>
Observation
1. Confusion matrix suggests that model has very high accuracy of 90% but
its positive class prediction rate (Churning rate) is around 76 % which is
good enough in real time scenarios but can be improved with further
tuning
5. Naive Bayes Classifier
ibrary(e1071)
> NB.fit = naiveBayes(Churn~., data = train)
> NB.fit
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
0 1
0.8551222 0.1448778
Conditional probabilities:
AccountWeeks
Y [,1] [,2]
0 101.4090 40.11039
1 101.0799 39.43145
ContractRenewal
Y 0 1
0 0.06716792 0.93283208
1 0.28106509 0.71893491
DataPlan
Y 0 1
0 0.6957393 0.3042607
1 0.8284024 0.1715976
DataUsage
Y [,1] [,2]
0 0.8834687 1.294567
1 0.5587574 1.153906
CustServCalls
Y [,1] [,2]
0 1.456642 1.166870
1 2.242604 1.780296
DayMins
Y [,1] [,2]
0 176.2034 49.65592
1 209.0583 70.06561
DayCalls
Y [,1] [,2]
0 99.91679 19.86590
1 100.94675 20.87259
MonthlyCharge
Y [,1] [,2]
0 56.13193 16.51706
1 59.68225 16.04013
OverageFee
Y [,1] [,2]
0 9.913263 2.540241
1 10.622722 2.558384
RoamMins
Y [,1] [,2]
0 10.12566 2.847187
1 10.77604 2.732145
>
NB.pred = predict(NB.fit, test, type = "class")

> NB.CM = confusionMatrix(NB.pred, test$Churn, positive = "1")
> NB.CM
Reference
Prediction 0 1
0 828 101
1 27 44
Accuracy : 0.872
95% CI : (0.8497, 0.8921)
Kappa : 0.345
Prevalence : 0.1450
Observation
1. Accuracy is 87% but its positive prediction rate is 62% which is quite low
2. Sensitivity which is TP/ TP + FN is just 30 % another hallmark of
untrustworthiness
Conclusion
1. KNN
a. k-NN performs the best with Positive pred rate of 81% in the general
case model where the formula intends to take all the 10 predictors
irrespective of their type whether continous or categorical
b. The intended or any refined / tuned target model should be able to

catch the Churners based on the data provided . Ofcourse the dataset
is lopsided in favor of more-NonChurners rather than our intended
target of finding Churners based on their behavior hidden in the
dataset.
2. Naive Bayes has no parameters to tune , but k-NN and Logit Regr can be
improved by fine tuning the train control parameters and also deploying the
up/down sampling approach for Logistic regression to counteract the class
imbalance
So based on accuracy and positive predictive power we recommend using KNN

model

Predicting The Churn in Telecom Industry

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predicting The Churn in Telecom Industry

Uploaded by

Copyright:

Available Formats

Telecom Customer Churn Prediction Assessment

Reading the data

> dim(churn_data)- Total Number of Rows and Columns

Checking for missing values –No missing values

### Introductory plot of datset

> plot_density(churn_data,,geom_density_args = list(fill="gold",alpha=0.4))

> plot_boxplot(churn_data, by="Churn" , geom_boxplot_args = list("outlier.color" = "red", fill="blue"))

p1 = ggplot(churn_data, aes(AccountWeeks, fill=Churn)) + geom_density(alpha=0.4)

churn_data.numeric= churn_data %>% select_if(is.numeric)

Splitting the data

model1=glm(Churn~.,data = churn_data,family = "binomial")

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2758.3 on 3332 degrees of freedom

Number of Fisher Scoring iterations: 5

vif_func <- dget("vif_func.R")

removed: MonthlyCharge 3243.301

removed: DataUsage 12.8138

[1] "AccountWeeks" "ContractRenewal" "DataPlan" "CustServCalls" "DayMins"

removed: MonthlyCharge 3243.301

removed: DataUsage 12.8138

[1] "AccountWeeks" "ContractRenewal" "DataPlan" "CustServCalls" "DayMins"

model2 <- glm(Churn ~ . -MonthlyCharge - DataUsage , data= churn_data, family=binomial)

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2758.3 on 3332 degrees of freedom

Number of Fisher Scoring iterations: 5

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2758.3 on 3332 degrees of freedom

Number of Fisher Scoring iterations: 5

Building the model on train data

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1930.4 on 2332 degrees of freedom

Number of Fisher Scoring iterations: 5

Testing the model

Mcnemar's Test P-Value : <2e-16

# Confusion matrix with threshold of 0.2

logitR.predicted1 = ifelse(predict.test > 0.2 , 1, 0)

Mcnemar's Test P-Value : 9.969e-09

1. Logistic Regression also performs poorly in case of general model with

2. Ofcourse this model can be improved through better selection of predictors

4. By lowering the threshold, we have improved the sensitivity of the model to

Have used various combinations for KNN method

model.knn=knn(train,test,train$Churn ,k=5) # have odd numebr of K

trctl = trainControl(method = "repeatedcv", number = 10, repeats = 3)

trctl1 = trainControl(method = "cv", number = 10)

Mcnemar's Test P-Value : 3.22e-11

Mcnemar's Test P-Value : 3.22e-11

Mcnemar's Test P-Value : <2e-16

Naive Bayes Classifier for Discrete Predictors

NB.pred = predict(NB.fit, test, type = "class")

Mcnemar's Test P-Value : 1.101e-10

b. The intended or any refined / tuned target model should be able to

So based on accuracy and positive predictive power we recommend using KNN

You might also like