Professional Documents
Culture Documents
Index
1.Objective
2.EDA analysis
3.Logistics Regression
4. K - Nearest Neighbour Classifier
5. Naive Bayes Classifier
6.Conclusion
1.Objective:
In this project, we simulate one such case of customer churn where we work on a
data of postpaid customers with a contract. The data has information about the
customer usage behavior, contract details and the payment details. The data also
indicates which were the customers who canceled their service. Based on this past
data, we need to build a model which can predict whether a customer will cancel
their service in the future or not.
Importing Libraries
Packages required and loaded:
toload_libraries <-
c("lattice","rpart","rpart.plot","caret","randomForest","scatterplot3d","e1071","reshape
2","rpsychi", "car", "psych", "corrplot", "forecast", "GPArotation", "psy", "MVN",
"DataExplorer", "ppcor", "Metrics", "foreign", "MASS", "lattice", "nortest",
"Hmisc","factoextra", "nFactors")
new.packages <- toload_libraries[!(toload_libraries %in%
installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(toload_libraries, require, character.only= TRUE)
> any(is.na(churn_data))
[1] FALSE
sapply(churn_data,function(x) sum(is.na(x)))
Churn AccountWeeks ContractRenewal DataPlan DataUsage
CustServCalls DayMins DayCalls
0 0 0 0 0 0 0 0
MonthlyCharge OverageFee RoamMins
0 0 0
str(churn_data)
tibble [3,333 x 11] (S3: tbl_df/tbl/data.frame)
$ Churn : num [1:3333] 0 0 0 0 0 0 0 0 0 0 ...
$ AccountWeeks : num [1:3333] 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal: num [1:3333] 1 1 1 0 0 0 1 0 1 0 ...
$ DataPlan : num [1:3333] 1 1 0 0 0 0 1 0 0 1 ...
$ DataUsage : num [1:3333] 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : num [1:3333] 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num [1:3333] 265 162 243 299 167 ...
$ DayCalls : num [1:3333] 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num [1:3333] 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num [1:3333] 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num [1:3333] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...>
Converting Churn , Contract Renewal and Data Plan as factored variables as they
have value as 0 or 1
col=c("ContractRenewal","DataPlan","Churn")
> churn_data[,col]=lapply(churn_data[,col],factor)
> str(churn_data)
tibble [3,333 x 11] (S3: tbl_df/tbl/data.frame)
$ Churn : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ AccountWeeks : num [1:3333] 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 1 ...
$ DataPlan : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 2 1 1 2 ...
$ DataUsage : num [1:3333] 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : num [1:3333] 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num [1:3333] 265 162 243 299 167 ...
$ DayCalls : num [1:3333] 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num [1:3333] 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num [1:3333] 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num [1:3333] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
Univariate Analysis
table((churn_data$Churn))
0 1
2850 483
>
prop.table((table((churn_data$Churn))))
0 1
0.8550855 0.1449145
summary(churn_data)
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayMins D
0:2850 Min. : 1.0 0: 323 0:2411 Min. :0.0000 Min. :0.000 Min. : 0.0 Min.
1: 483 1st Qu.: 74.0 1:3010 1: 922 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:143.7 1st
Median :101.0 Median :0.0000 Median :1.000 Median :179.4 Medi
Mean :101.1 Mean :0.8165 Mean :1.563 Mean :179.8 Mean
3rd Qu.:127.0 3rd Qu.:1.7800 3rd Qu.:2.000 3rd Qu.:216.4 3rd
Max. :243.0 Max. :5.4000 Max. :9.000 Max. :350.8 Max.
MonthlyCharge OverageFee RoamMins
Min. : 14.00 Min. : 0.00 Min. : 0.00
1st Qu.: 45.00 1st Qu.: 8.33 1st Qu.: 8.50
Median : 53.50 Median :10.07 Median :10.30
Mean : 56.31 Mean :10.05 Mean :10.24
3rd Qu.: 66.20 3rd Qu.:11.77 3rd Qu.:12.10
Max. :111.30 Max. :18.19 Max. :20.00
>
plot_intro(churn_data)
plot_histogram(churn_data,geom_histogram_args = list(fill="blue"),
+ theme_config = list(axis.line = element_line(size = 1, colour = "green"),
strip.background = element_rect(color = "red", fill = "yellow")))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Bivariate Analysis
by(churn_data,churn_data$Churn,summary)
churn_data$Churn: 0
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayM
ins DayCalls
0:2850 Min. : 1.0 0: 186 0:2008 Min. :0.0000 Min. :0.00 Min.
: 0.0 Min. : 0.0
1: 0 1st Qu.: 73.0 1:2664 1: 842 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.
:142.8 1st Qu.: 87.0
Median :100.0 Median :0.0000 Median :1.00 Median
:177.2 Median :100.0
Mean :100.8 Mean :0.8622 Mean :1.45 Mean
:175.2 Mean :100.3
3rd Qu.:127.0 3rd Qu.:2.0000 3rd Qu.:2.00 3rd Qu.
:210.3 3rd Qu.:114.0
Max. :243.0 Max. :4.7500 Max. :8.00 Max.
:315.6 Max. :163.0
MonthlyCharge OverageFee RoamMins
Min. : 15.70 Min. : 0.000 Min. : 0.00
1st Qu.: 45.00 1st Qu.: 8.230 1st Qu.: 8.40
Median : 53.00 Median : 9.980 Median :10.20
Mean : 55.82 Mean : 9.955 Mean :10.16
3rd Qu.: 64.67 3rd Qu.:11.660 3rd Qu.:12.00
Max. :111.30 Max. :18.090 Max. :18.90
------------------------------------------------------------------------------------------
-------
churn_data$Churn: 1
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayMin
s DayCalls MonthlyCharge
0: 0 Min. : 1.0 0:137 0:403 Min. :0.000 Min. :0.00 Min. :
0.0 Min. : 0.0 Min. : 14.00
1:483 1st Qu.: 76.0 1:346 1: 80 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:1
53.2 1st Qu.: 87.5 1st Qu.: 45.00
Median :103.0 Median :0.000 Median :2.00 Median :2
17.6 Median :103.0 Median : 63.00
Mean :102.7 Mean :0.547 Mean :2.23 Mean :2
06.9 Mean :101.3 Mean : 59.19
3rd Qu.:127.0 3rd Qu.:0.295 3rd Qu.:4.00 3rd Qu.:2
65.9 3rd Qu.:116.5 3rd Qu.: 69.00
Max. :225.0 Max. :5.400 Max. :9.00 Max. :3
50.8 Max. :165.0 Max. :110.00
OverageFee RoamMins
Min. : 3.55 Min. : 2.0
1st Qu.: 8.86 1st Qu.: 8.8
Median :10.57 Median :10.6
Mean :10.62 Mean :10.7
3rd Qu.:12.47 3rd Qu.:12.8
Max. :18.19 Max. :20.0
Observations:
1. From Histograms almost all continuous predictors like Account Weeks, Day
Calls/Mins, OverageFee, Roam mins have normal distributions
2. Monthly charge has its distribution skewed to a bit left which can be ignored
3. Customers who churn vs who dont are mostly have similar distribution for the
Account weeks with mean of Churn(1) = 103 Weeks and Not Churn(0) ~ 101
Weeks
4. On an Average Customers who Churn are utilizing more Day Minutes(207
mins) than who don’t (175 mins)
5. On the other hand Churning customers data usage (0.54 GB)on an average is
less compared to Non-Churning ones ( 0.86 GB)
6. Churning Customers call Customer Service more in the bracket of ( 5 - 10
calls) v/s the bracket of (0-5 Calls)
7. Monthly Charges are also more for Churn customers compared to Non-Churn
in the 60 - 75 monetary amounts
8. Data suggests there is very strong correlation between Monthly charges and
data usage which is quite obvious . So we can replace one variable with
another after evaluation
3. Logistics Regression
0 1
0.8551222 0.1448778
> prop.table(table(test$Churn))
0 1
0.855 0.145
>
> table(train$Churn)
0 1
1995 338
> table(test$Churn)
0 1
855 145
Call:
glm(formula = Churn ~ ., family = "binomial", data = churn_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0058 -0.5112 -0.3477 -0.2093 2.9981
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.9510252 0.5486763 -10.846 < 2e-16 ***
AccountWeeks 0.0006525 0.0013873 0.470 0.638112
ContractRenewal1 -1.9855172 0.1436107 -13.826 < 2e-16 ***
DataPlan1 -1.1841611 0.5363668 -2.208 0.027262 *
DataUsage 0.3636565 1.9231751 0.189 0.850021
CustServCalls 0.5081349 0.0389682 13.040 < 2e-16 ***
DayMins 0.0174407 0.0324841 0.537 0.591337
DayCalls 0.0036523 0.0027497 1.328 0.184097
MonthlyCharge -0.0275526 0.1909074 -0.144 0.885244
OverageFee 0.1868114 0.3256902 0.574 0.566248
RoamMins 0.0789226 0.0220522 3.579 0.000345 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
vif(model1)
AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls Da
yMins DayCalls MonthlyCharge
1.003246 1.058705 14.087816 1601.163095 1.081250 952.5
39781 1.004592 2829.804947
OverageFee RoamMins
211.716226 1.193368
The multicolliniearity has caused the inflated VIF values for correlated variables,
making the model unreliable. We will use a stepwise variable reduction function
using VIF values
var vif
AccountWeeks 1.00349668958413
ContractRenewal 1.00651641850144
DataPlan 12.4695602982927
DataUsage 12.8138032553607
CustServCalls 1.00177759301348
DayMins 1.00333362434266
DayCalls 1.00292948433251
OverageFee 1.0016574227944
RoamMins 1.346470768479
var vif
AccountWeeks 1.00349668958413
ContractRenewal <NA>
DataPlan <NA>
DataUsage 12.8138032553607
CustServCalls 1.00177759301348
DayMins 1.00333362434266
DayCalls 1.00292948433251
OverageFee 1.0016574227944
RoamMins 1.346470768479
Model 2: We will not use the MonthlyCharge and DataUsage variables and create a
new model.
Call:
glm(formula = Churn ~ . - MonthlyCharge - DataUsage, family = binomial,
data = churn_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0113 -0.5099 -0.3496 -0.2100 2.9978
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.9945437 0.5377103 -11.148 < 2e-16 ***
AccountWeeks 0.0006621 0.0013873 0.477 0.633
ContractRenewal1 -1.9880114 0.1435423 -13.850 < 2e-16 ***
DataPlan1 -0.9353165 0.1441298 -6.489 8.62e-11 ***
CustServCalls 0.5072934 0.0389173 13.035 < 2e-16 ***
DayMins 0.0127543 0.0010725 11.892 < 2e-16 ***
DayCalls 0.0036213 0.0027486 1.318 0.188
OverageFee 0.1398147 0.0226568 6.171 6.79e-10 ***
RoamMins 0.0831284 0.0203211 4.091 4.30e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model 3: AccountWeeks and DayCalls are insignificant for the model, we will remove
these variables a create a new model
model3 <- glm(Churn ~ . -MonthlyCharge - DataUsage -AccountWeeks -DayCalls, data= churn_data, family=bin
> summary(model3)
Call:
glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
DayCalls, family = binomial, data = churn_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9932 -0.5154 -0.3480 -0.2095 2.9906
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.552897 0.432757 -12.831 < 2e-16 ***
ContractRenewal1 -1.989219 0.143452 -13.867 < 2e-16 ***
DataPlan1 -0.934814 0.144015 -6.491 8.52e-11 ***
CustServCalls 0.505651 0.038834 13.021 < 2e-16 ***
DayMins 0.012774 0.001073 11.907 < 2e-16 ***
OverageFee 0.138612 0.022648 6.120 9.34e-10 ***
RoamMins 0.083476 0.020304 4.111 3.93e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
vif(model3)
ContractRenewal DataPlan CustServCalls DayMins OverageFee RoamMins
1.056179 1.018476 1.076219 1.039028 1.022948 1.010395
>
summary(model1.train)
Call:
glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
DayCalls, family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9988 -0.5148 -0.3418 -0.1996 2.8680
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.91228 0.52113 -11.345 < 2e-16 ***
ContractRenewal1 -1.93645 0.17175 -11.275 < 2e-16 ***
DataPlan1 -0.92904 0.17012 -5.461 4.73e-08 ***
CustServCalls 0.52940 0.04747 11.153 < 2e-16 ***
DayMins 0.01343 0.00129 10.409 < 2e-16 ***
OverageFee 0.14563 0.02717 5.360 8.32e-08 ***
RoamMins 0.08804 0.02406 3.660 0.000252 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> vif(model1.train)
ContractRenewal DataPlan CustServCalls DayMins OverageFee Roa
mMins
1.059151 1.021584 1.083952 1.050825 1.024412 1.0
07917
Reference
Prediction 0 1
0 834 123
1 21 22
Accuracy : 0.856
95% CI : (0.8327, 0.8772)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.4863
Kappa : 0.1796
Sensitivity : 0.1517
Specificity : 0.9754
Pos Pred Value : 0.5116
Neg Pred Value : 0.8715
Prevalence : 0.1450
Detection Rate : 0.0220
Detection Prevalence : 0.0430
Balanced Accuracy : 0.5636
'Positive' Class : 1
Reference
Prediction 0 1
0 720 55
1 135 90
Accuracy : 0.81
95% CI : (0.7843, 0.8339)
No Information Rate : 0.855
P-Value [Acc > NIR] : 1
Kappa : 0.3765
'Positive' Class : 1
ROCRpred = prediction(predict.test,test$Churn)
>
> Auc=as.numeric(performance(ROCRpred, "auc")@y.values)
> Auc
[1] 0.8145513
Observation:
3. This LR model also suffers from accuracy paradox such that if threshold
probability is decreses from 0.5 to say 0.2
library(class)
install.packages("class")
?knn.fit
set.seed(1111)
model.knn1= train(Churn~., data = train, method="knn",
trControl= trctl, preProcess = c("center", "scale"),
tuneLength= 10)
model.knn1
knn.pred1=predict(model.knn1,test)
knn.CM
knn.CM1
knn.CM2
knn.CM
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 837 87
1 18 58
Accuracy : 0.895
95% CI : (0.8743, 0.9133)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.0001125
Kappa : 0.4723
Sensitivity : 0.4000
Specificity : 0.9789
Pos Pred Value : 0.7632
Neg Pred Value : 0.9058
Prevalence : 0.1450
Detection Rate : 0.0580
Detection Prevalence : 0.0760
Balanced Accuracy : 0.6895
'Positive' Class : 1
> knn.CM1
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 837 87
1 18 58
Accuracy : 0.895
95% CI : (0.8743, 0.9133)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.0001125
Kappa : 0.4723
Sensitivity : 0.4000
Specificity : 0.9789
Pos Pred Value : 0.7632
Neg Pred Value : 0.9058
Prevalence : 0.1450
Detection Rate : 0.0580
Detection Prevalence : 0.0760
Balanced Accuracy : 0.6895
'Positive' Class : 1
> knn.CM2
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 840 117
1 15 28
Accuracy : 0.868
95% CI : (0.8454, 0.8884)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.1301
Kappa : 0.248
Sensitivity : 0.1931
Specificity : 0.9825
Pos Pred Value : 0.6512
Neg Pred Value : 0.8777
Prevalence : 0.1450
Detection Rate : 0.0280
Detection Prevalence : 0.0430
Balanced Accuracy : 0.5878
'Positive' Class : 1
>
Observation
1. Confusion matrix suggests that model has very high accuracy of 90% but
its positive class prediction rate (Churning rate) is around 76 % which is
good enough in real time scenarios but can be improved with further
tuning
5. Naive Bayes Classifier
ibrary(e1071)
> NB.fit = naiveBayes(Churn~., data = train)
> NB.fit
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
0 1
0.8551222 0.1448778
Conditional probabilities:
AccountWeeks
Y [,1] [,2]
0 101.4090 40.11039
1 101.0799 39.43145
ContractRenewal
Y 0 1
0 0.06716792 0.93283208
1 0.28106509 0.71893491
DataPlan
Y 0 1
0 0.6957393 0.3042607
1 0.8284024 0.1715976
DataUsage
Y [,1] [,2]
0 0.8834687 1.294567
1 0.5587574 1.153906
CustServCalls
Y [,1] [,2]
0 1.456642 1.166870
1 2.242604 1.780296
DayMins
Y [,1] [,2]
0 176.2034 49.65592
1 209.0583 70.06561
DayCalls
Y [,1] [,2]
0 99.91679 19.86590
1 100.94675 20.87259
MonthlyCharge
Y [,1] [,2]
0 56.13193 16.51706
1 59.68225 16.04013
OverageFee
Y [,1] [,2]
0 9.913263 2.540241
1 10.622722 2.558384
RoamMins
Y [,1] [,2]
0 10.12566 2.847187
1 10.77604 2.732145
>
Reference
Prediction 0 1
0 828 101
1 27 44
Accuracy : 0.872
95% CI : (0.8497, 0.8921)
No Information Rate : 0.855
P-Value [Acc > NIR] : 0.06738
Kappa : 0.345
Sensitivity : 0.3034
Specificity : 0.9684
Pos Pred Value : 0.6197
Neg Pred Value : 0.8913
Prevalence : 0.1450
Detection Rate : 0.0440
Detection Prevalence : 0.0710
Balanced Accuracy : 0.6359
'Positive' Class : 1
Observation
1. Accuracy is 87% but its positive prediction rate is 62% which is quite low
2. Sensitivity which is TP/ TP + FN is just 30 % another hallmark of
untrustworthiness
Conclusion
1. KNN
a. k-NN performs the best with Positive pred rate of 81% in the general
case model where the formula intends to take all the 10 predictors
irrespective of their type whether continous or categorical
2. Naive Bayes has no parameters to tune , but k-NN and Logit Regr can be
improved by fine tuning the train control parameters and also deploying the
up/down sampling approach for Logistic regression to counteract the class
imbalance