You are on page 1of 25

Mini Project – Telecom Customer Churn

Sravanthi.M

1
Table of Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................3
3.1.2.Set up working Directory....................................................................................................3
3.1.3.Import and Read the Dataset.............................................................................................4
3.2. Variable Identification................................................................................................................4
4. Conclusion.........................................................................................................................................4
5. Detailed Explanation of Findings…………………………………………………………………………………………………….5

1. EDA - Exploratory Data Analysis 

1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs

1.2 EDA - Check for Outliers and missing values and check the summary of the dataset

1.3 EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.

1.4 EDA - Summarize the insights you get from EDA.

 2. Build Models and compare them to get to the best one

2.1 Applying Logistic Regression

2.2 Interpret Logistic Regression

2.3 Applying KNN Model

2.4 Interpret KNN Model

2.5 - Applying Naive Bayes Model

2.6 Interpret Naive Bayes Model

2.7 Confusion matrix interpretation for all models

2.8 Interpretation of other Model Performance Measures for logistic <KS, AUC, GINI>

2.9 Remarks on Model validation exercise <Which model performed the best>

3. Actionable Insights and Recommendations

4. Source Code
1 Project Objective
The objective of the report is to explore the Telecom Customer Churn in R and generate insights
about the data set. This exploration report will consist of the following:

 Importing the dataset in R


 Understanding the structure of dataset
 Graphical exploration
 Descriptive statistics
 Insights from the dataset

2 Assumptions

 We need to predict based on the past data what percentage of customers will be
cancelling the services in future
 The data also indicates which customers who cancelled their services.

3 Exploratory Data Analysis – Step by step approach


A Typical Data exploration activity consists of the following steps:

1. Environment Set up and Data Import


2. Univariate Analysis
3. Bivariate Analysis
4. Check collinearity
5. Build Models and compare for the best one
6. The data set have 11 variables used for marketing segmentation in the context of product
service Management. Variables and the expansion of the variables are mentioned below

We shall follow these steps in exploring the provided dataset.

3.1 Environment Set up and Data Import


3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability. For installation we will use

3|Page
install. packages (“Package name”)

3.1.2 Set up working Directory


Setting a working directory on starting of the R session makes importing and exporting data
files and code files easier. Basically, working directory is the location/ folder on the PC where
you have the data, codes etc. related to the project. For setting up and importing we use
below syntax’s
Syntax → setwd() & getwd()

Please refer 6 for Source Code.


3.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the
file.

Please refer 6 for Source Code.

3.2 Variable Identification


We are using
 setwd() :For setting working directory

 getwd() : returns an absolute file path representing the current working directory

 dim: returns the dimension (e.g. the number of columns and rows)

 Str: To look specific data row by row we use str()

 names() : to find the names of the columns

 summary: is a generic function used to produce result summaries of the results of

various model fitting functions. The function invokes particular methods which

depend on the class of the first argument.

 attach() : to attach my data

 hist(): To plot histogram

 boxplot(): To plot boxplot

4 Conclusion
From the above given problem, we have found out in this case all models show significantly
greater predictive accuracy. For every customer with accuracy of 69.7%. The Logistic
identical in terms of results.
If maximum accuracy is the goal then I would recommend that the logistic model since it is
much more interpretable than the other model.
4|Page
5 Detailed Explanation of Findings
1. EDA - Exploratory Data Analysis 
1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs
1.2 EDA - Check for Outliers and missing values and check the summary of the dataset
1.3 EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
1.4 EDA - Summarize the insights you get from EDA.

Ans: str(): It is used to find out the structure of the data.It is an alternative way to display the
summary of the data.It gives information each basic structure.

Hist(): A histogram represents the frequencies of values of a variable into ranges

Boxplot():

Range. If you are interested in the spread of all the data, it is represented on a boxplot by the horizontal
distance between the smallest value and the largest value, including any outliers.

5|Page
Cor():

 We need to find missing values


Syntax: sum(is.na(data))
Output:

6|Page
Summary():
Below is the summary of our given data set

Basic EDA is performed on the given data set and below is the out Put:

Univariate Analysis:
We are checking summary of each variable and by plotting we are check outliers for each variable.

7|Page
 Account weeks have few outliers.

 Data usage also have some outliers

8|Page
 Customer calls also have some outliers

 Day Mins also have outliers

9|Page
 Day calls also have outliers

 Monthly charges also have outliers

10 | P a g
e
 Overagefee also have outliers

 Roammins have outliers

11 | P a g
e
12 | P a g
e
13 | P a g
e
##Need to check for multicolinarity
 Is there evidence of multicollinearity? Showcase your analysis
 First, we need to create correlation matrix and the plot the correlation for Telecom Customer
Churn set.
 Now we need to check multicollinearity of independent variables using VIF

14 | P a g
e
15 | P a g
e
 2. Build Models and compare them to get to the best one
2.1 Applying Logistic Regression
2.2 Interpret Logistic Regression
2.3 Applying KNN Model
2.4 Interpret KNN Model
2.5 - Applying Naive Bayes Model
2.6 Interpret Naive Bayes Model
2.7 Confusion matrix interpretation for all models
2.8 Interpretation of other Model Performance Measures for logistic <KS, AUC, GINI>
2.9 Remarks on Model validation exercise <Which model performed the best>

Ans: Predicting telecom churn prediction consists of detecting which customers are likely to cancel a
subscription to a service based on how they use the service. We want to predict the answer to the
following question, asked for each current customer: “Is this customer going to leave us within the
next few months?” There are only two possible answers, yes or no, and it is what we call a binary
classification task. For that we will be performing Logistic Regression, KNN Model and Navie Bayes
Model. Before starting we need to create train data and sample data.

Logistic Regression:
For the above given model, we will be performing 6 steps in logistic regression for overall
Validity of the Model
1. Log likelihood Test
2. McFadden Rsq
3. Individual Slopes Significance Test
4. Explanatory Power of odds
5. Classification / Confusion Matrix
6. ROC Curve

Log likelihood Test:


This means that if the value on the x-axis increases, the value on the y-axis also increases . This is
16 | P a g
e
important because it ensures that the maximum value of the log of the probability occurs at the same
point as the original probability function.

McFadden Rsq:
Thumb rule of McFadden's pseudo R-squared ranging from 0.2 to 0.4 indicates very good model fit.

Individual Slopes Significance Test:

17 | P a g
e
Explanatory Power of odds:

Classification / Confusion Matrix:

A confusion matrix is a table that is often used to describe the performance of a classification model
on a set of test data for which the true values are known.

18 | P a g
e
19 | P a g
e
KNN:

20 | P a g
e
3. Actionable Insights and Recommendations

Ans: Customers that have signed up recently on a month-to-month contract with a single telephone
line and who pay with an alternative method to electronic check are the most likely to churn.
Resources should be focussed on these customers to move them to products that are indicators of
brand loyalty. Marketing and retention teams should priorities the following products in descending
order of importance:

1. Two-year contract
2. One-year contract
3. Paperless billing
4. Payment by electronic check
5. A second telephone line

4. Source code

## Setting up working directory and getting working directory

setwd("D:/College Data/Predictive Modeling/Project 4")


data=read.csv("Cellphone.csv",header=TRUE)
str(data)
attach(data)
hist(data)
boxplot(data)
cor(data)
cor(Churn,ContractRenewal)
data$Churn_fact=as.factor(data$Churn)
data$ContractRenewal_fact=as.factor(data$ContractRenewal)
data$DataPlan_fact=as.factor(data$DataPlan)
summary(data)
sum(is.na(data))
str(data)

data_fact=data[,-c(1,3,4)]
data_num=data[,-c(12,13,14)]
21 | P a g
e
str(data_fact)
str(data_num)

###Basic EDA
library(funModeling)
library(tidyverse)
library(Hmisc)

basic_eda <- function(data)


{
summary(data)
df_status(data)
freq(data)
profiling_num(data)
hist(data)
describe(data)

}
basic_eda(data)

#Univariate analysis
summary(AccountWeeks)
boxplot(AccountWeeks,main = "Accountweeks",col = "orange",border = "blue")
summary(DataUsage)
boxplot(DataUsage,main = "Datausage",col = "Green",border = "Red")
summary(CustServCalls)
boxplot(CustServCalls,main = "custsercalls", col = "Pink",border = "blue")
summary(DayMins)
boxplot(DayMins,main = "DayMins",col = "orange",border = "blue")
summary(DayCalls)
boxplot(DayCalls,main = "Daycalls",col = "orange",border = "blue")
summary(MonthlyCharge)
boxplot(MonthlyCharge,main = "Monthlycharge",col = "orange",border = "blue")
summary(OverageFee)
boxplot(OverageFee,main = "Overagefee",col = "orange",border = "blue")
summary(RoamMins)
boxplot(RoamMins,main = "RoamMins",col = "orange",border = "blue")

summary(Churn)

hist(DataUsage)
hist(CustServCalls)
hist(MonthlyCharge)

plot(DataUsage,MonthlyCharge)
plot(MonthlyCharge,DayMins)
boxplot(data)
is.na(data)
data=na.omit(data)
names(data)

#collinearity
22 | P a g
e
str(data)
library(corrplot)
cor_data=cor(data[,-c(12,13,14)])
corrplot(cor_data , method='number')
library(car)

lm(Churn~.,data=data_num)
vif(lm(Churn~.,data=data_num))

vif(glm(Churn_fact~.,data=data_fact,family=binomial))
summary(glm(ContractRenewal_fact~.,data=data_fact,family=binomial))

summary(glm(Churn_fact~AccountWeeks+DataUsage+CustServCalls+DayMins+DayCalls
+
MonthlyCharge+OverageFee+RoamMins+ContractRenewal_fact,
data=data_fact,family=binomial))

vif(glm(Churn_fact~AccountWeeks+DataUsage+CustServCalls+DayMins+DayCalls+
MonthlyCharge+OverageFee+RoamMins+ContractRenewal_fact,
data=data_fact,family=binomial))

cor(data_num)

library(caTools)
sample = sample.split(data_fact$Churn_fact, SplitRatio = .70)
train = subset(data_fact, sample == TRUE)
test = subset(data_fact, sample == FALSE)
####Logistic Regression
set.seed(1234)
# 6 steps in logistic

# Step 1: Overall Validity of the Model #

# 1) Log likelihood Test

logit=glm(Churn_fact~AccountWeeks+CustServCalls+DayCalls+RoamMins+ContractRe
newal_fact,
data=train,family=binomial)
summary(logit)

library(car)
vif(logit)
library(lmtest)
lrtest(logit)

#step 2: McFadden Rsq


options(scipen=999)
library(pscl)
print(pR2(logit))
23 | P a g
e
#step 3: Indiviudal Slopes Significance Test
summary(logit)
#Estimates for the variables
print(logit)

#Step 4: Explanatory Power of odds

round(exp(coef(logit)),2)

probability_Scores=format(exp(coef(logit))/(exp(coef(logit))+1),scientific =
FALSE)

print(probability_Scores)

#step 5 : Classification / Confusion Matrix

Pred.logit=predict(logit,type = "response",data=train)
summary(Pred.logit)
summary(train$Churn_fact)
plot(train$Churn_fact,Pred.logit)
Pred.logit=ifelse(Pred.logit<0.14,0,1)
table(Actual=train$ContractRenewal_fact,Pred.logit)

#step 6: ROC Curve

library(Deducer)
rocplot(logit)

library(caret)
##Testing the model on the Test Data

Pred.logit.test=predict(logit,type = "response",newdata = test)


summary(Pred.logit.test)
summary(test$Churn_fact)
plot(test$Churn_fact,Pred.logit.test)
Pred.logit.test.factor=ifelse(Pred.logit.test<0.20,0,1)
confusionMatrix(table(Actual=test$Churn_fact,Pred.logit.test.factor))

library(blorr) # to build and validate binary logistic models

blr_step_aic_both(logit, details = FALSE)

#ModelPerformanceParameter
#Train
train$prediction = predict(logit, train, type="response")
library(ROCR)
library(ineq)
predObj = prediction(train$prediction, train$Churn_fact)
perf = performance(predObj, "tpr", "fpr")
plot(perf)
24 | P a g
e
KS = max(perf@y.values[[1]]-perf@x.values[[1]])
auc = performance(predObj,"auc");
auc = as.numeric(auc@y.values)
gini = ineq(train$prediction, type="Gini")

#Test
test$prediction = predict(logit, test, type="response")
library(ROCR)
library(ineq)
predObj = prediction(test$prediction, test$Churn_fact)
perf = performance(predObj, "tpr", "fpr")
plot(perf)
KS = max(perf@y.values[[1]]-perf@x.values[[1]])
auc = performance(predObj,"auc");
auc = as.numeric(auc@y.values)
gini = ineq(test$prediction, type="Gini")

#Use KNN Classifier


#normalize the test & train data

norm=function(x){(x-min(x))/(max(x)-min(x))}
norm.data=as.data.frame(lapply(data_fact[,-c(9,10,11)],norm))
norm.data_fact=cbind(data_fact[,c(9,10,11)],norm.data)

#split the normalized dataset


library(caTools)
sample = sample.split(norm.data_fact$Churn_fact, SplitRatio = .70)
norm.train = subset(norm.data_fact, sample == TRUE)
norm.test = subset(norm.data_fact, sample == FALSE)

library(class)
knn.pred = knn(norm.train[-c(1)], norm.test[,-c(1)], norm.train[,1], k = 19)

table.knn = table(norm.test$Churn_fact, knn.pred)


table.knn
sum(diag(table.knn)/sum(table.knn))
confusionMatrix(table.knn)

#Naive Bayes

library(e1071)
NB =
naiveBayes(Churn_fact~AccountWeeks+CustServCalls+DayCalls+RoamMins+ContractR
enewal_fact, data = train)
predNB = predict(NB, test, type = "class")
tab.NB = table(test[,9], predNB)
sum(diag(tab.NB)/sum(tab.NB))
confusionMatrix(tab.NB)
tab.NB
25 | P a g
e

You might also like