Telecom Customer Churn Project Report

Mini Project – Telecom Customer Churn
Sravanthi.M
1
Table of Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................3
3.1.2.Set up working Directory....................................................................................................3
3.1.3.Import and Read the Dataset.............................................................................................4
3.2. Variable Identification................................................................................................................4
4. Conclusion.........................................................................................................................................4
5. Detailed Explanation of Findings…………………………………………………………………………………………………….5
1. EDA - Exploratory Data Analysis
1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs
1.2 EDA - Check for Outliers and missing values and check the summary of the dataset
1.3 EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
1.4 EDA - Summarize the insights you get from EDA.
2. Build Models and compare them to get to the best one
2.1 Applying Logistic Regression
2.2 Interpret Logistic Regression
2.3 Applying KNN Model
2.4 Interpret KNN Model
2.5 - Applying Naive Bayes Model
2.6 Interpret Naive Bayes Model
2.7 Confusion matrix interpretation for all models
2.8 Interpretation of other Model Performance Measures for logistic <KS, AUC, GINI>
2.9 Remarks on Model validation exercise <Which model performed the best>
3. Actionable Insights and Recommendations
4. Source Code
1 Project Objective
The objective of the report is to explore the Telecom Customer Churn in R and generate insights
about the data set. This exploration report will consist of the following:
 Importing the dataset in R

 Understanding the structure of dataset
 Graphical exploration
 Descriptive statistics
 Insights from the dataset
2 Assumptions
 We need to predict based on the past data what percentage of customers will be
cancelling the services in future
 The data also indicates which customers who cancelled their services.
3 Exploratory Data Analysis – Step by step approach

A Typical Data exploration activity consists of the following steps:
1. Environment Set up and Data Import

2. Univariate Analysis
3. Bivariate Analysis
4. Check collinearity
5. Build Models and compare for the best one
6. The data set have 11 variables used for marketing segmentation in the context of product
service Management. Variables and the expansion of the variables are mentioned below
We shall follow these steps in exploring the provided dataset.
3.1 Environment Set up and Data Import

3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability. For installation we will use
3|Page
install. packages (“Package name”)
3.1.2 Set up working Directory

Setting a working directory on starting of the R session makes importing and exporting data
files and code files easier. Basically, working directory is the location/ folder on the PC where
you have the data, codes etc. related to the project. For setting up and importing we use
below syntax’s
Syntax → setwd() & getwd()
Please refer 6 for Source Code.

3.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the
file.
Please refer 6 for Source Code.
3.2 Variable Identification

We are using
 setwd() :For setting working directory
 getwd() : returns an absolute file path representing the current working directory
 dim: returns the dimension (e.g. the number of columns and rows)
 Str: To look specific data row by row we use str()
 names() : to find the names of the columns
 summary: is a generic function used to produce result summaries of the results of
various model fitting functions. The function invokes particular methods which
depend on the class of the first argument.
 attach() : to attach my data
 hist(): To plot histogram
 boxplot(): To plot boxplot
4 Conclusion
From the above given problem, we have found out in this case all models show significantly
greater predictive accuracy. For every customer with accuracy of 69.7%. The Logistic
identical in terms of results.
If maximum accuracy is the goal then I would recommend that the logistic model since it is
much more interpretable than the other model.
4|Page
5 Detailed Explanation of Findings
1. EDA - Exploratory Data Analysis
1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs
1.2 EDA - Check for Outliers and missing values and check the summary of the dataset
1.3 EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
1.4 EDA - Summarize the insights you get from EDA.
Ans: str(): It is used to find out the structure of the data.It is an alternative way to display the
summary of the data.It gives information each basic structure.
Hist(): A histogram represents the frequencies of values of a variable into ranges
Boxplot():
Range. If you are interested in the spread of all the data, it is represented on a boxplot by the horizontal
distance between the smallest value and the largest value, including any outliers.
5|Page
Cor():
 We need to find missing values

Syntax: sum(is.na(data))
Output:
6|Page
Summary():
Below is the summary of our given data set
Basic EDA is performed on the given data set and below is the out Put:
Univariate Analysis:
We are checking summary of each variable and by plotting we are check outliers for each variable.
7|Page
 Account weeks have few outliers.
 Data usage also have some outliers
8|Page
 Customer calls also have some outliers
 Day Mins also have outliers
9|Page
 Day calls also have outliers
 Monthly charges also have outliers
10 | P a g
e
 Overagefee also have outliers
 Roammins have outliers
11 | P a g
e
12 | P a g
e
13 | P a g
e
##Need to check for multicolinarity
 Is there evidence of multicollinearity? Showcase your analysis
 First, we need to create correlation matrix and the plot the correlation for Telecom Customer
Churn set.
 Now we need to check multicollinearity of independent variables using VIF
14 | P a g
e
15 | P a g
e
2. Build Models and compare them to get to the best one
2.1 Applying Logistic Regression
2.2 Interpret Logistic Regression
2.3 Applying KNN Model
2.4 Interpret KNN Model
2.5 - Applying Naive Bayes Model
2.6 Interpret Naive Bayes Model
2.7 Confusion matrix interpretation for all models
2.8 Interpretation of other Model Performance Measures for logistic <KS, AUC, GINI>
2.9 Remarks on Model validation exercise <Which model performed the best>
Ans: Predicting telecom churn prediction consists of detecting which customers are likely to cancel a
subscription to a service based on how they use the service. We want to predict the answer to the
following question, asked for each current customer: “Is this customer going to leave us within the
next few months?” There are only two possible answers, yes or no, and it is what we call a binary
classification task. For that we will be performing Logistic Regression, KNN Model and Navie Bayes
Model. Before starting we need to create train data and sample data.
Logistic Regression:
For the above given model, we will be performing 6 steps in logistic regression for overall
Validity of the Model
1. Log likelihood Test
2. McFadden Rsq
3. Individual Slopes Significance Test
4. Explanatory Power of odds
5. Classification / Confusion Matrix
6. ROC Curve
Log likelihood Test:

This means that if the value on the x-axis increases, the value on the y-axis also increases . This is
16 | P a g
e
important because it ensures that the maximum value of the log of the probability occurs at the same
point as the original probability function.
McFadden Rsq:
Thumb rule of McFadden's pseudo R-squared ranging from 0.2 to 0.4 indicates very good model fit.
Individual Slopes Significance Test:
17 | P a g
e
Explanatory Power of odds:
Classification / Confusion Matrix:
A confusion matrix is a table that is often used to describe the performance of a classification model
on a set of test data for which the true values are known.
18 | P a g
e
19 | P a g
e
KNN:
20 | P a g
e
3. Actionable Insights and Recommendations
Ans: Customers that have signed up recently on a month-to-month contract with a single telephone
line and who pay with an alternative method to electronic check are the most likely to churn.
Resources should be focussed on these customers to move them to products that are indicators of
brand loyalty. Marketing and retention teams should priorities the following products in descending
order of importance:
1. Two-year contract
2. One-year contract
3. Paperless billing
4. Payment by electronic check
5. A second telephone line
4. Source code
## Setting up working directory and getting working directory
setwd("D:/College Data/Predictive Modeling/Project 4")

data=read.csv("Cellphone.csv",header=TRUE)
str(data)
attach(data)
hist(data)
boxplot(data)
cor(data)
cor(Churn,ContractRenewal)
data$Churn_fact=as.factor(data$Churn)
data$ContractRenewal_fact=as.factor(data$ContractRenewal)
data$DataPlan_fact=as.factor(data$DataPlan)
summary(data)
sum(is.na(data))
str(data)
data_fact=data[,-c(1,3,4)]
data_num=data[,-c(12,13,14)]
21 | P a g
e
str(data_fact)
str(data_num)
###Basic EDA
library(funModeling)
library(tidyverse)
library(Hmisc)
basic_eda <- function(data)

{
summary(data)
df_status(data)
freq(data)
profiling_num(data)
hist(data)
describe(data)
}
basic_eda(data)
#Univariate analysis
summary(AccountWeeks)
boxplot(AccountWeeks,main = "Accountweeks",col = "orange",border = "blue")
summary(DataUsage)
boxplot(DataUsage,main = "Datausage",col = "Green",border = "Red")
summary(CustServCalls)
boxplot(CustServCalls,main = "custsercalls", col = "Pink",border = "blue")
summary(DayMins)
boxplot(DayMins,main = "DayMins",col = "orange",border = "blue")
summary(DayCalls)
boxplot(DayCalls,main = "Daycalls",col = "orange",border = "blue")
summary(MonthlyCharge)
boxplot(MonthlyCharge,main = "Monthlycharge",col = "orange",border = "blue")
summary(OverageFee)
boxplot(OverageFee,main = "Overagefee",col = "orange",border = "blue")
summary(RoamMins)
boxplot(RoamMins,main = "RoamMins",col = "orange",border = "blue")
summary(Churn)
hist(DataUsage)
hist(CustServCalls)
hist(MonthlyCharge)
plot(DataUsage,MonthlyCharge)
plot(MonthlyCharge,DayMins)
boxplot(data)
is.na(data)
data=na.omit(data)
names(data)
#collinearity
22 | P a g
e
str(data)
library(corrplot)
cor_data=cor(data[,-c(12,13,14)])
corrplot(cor_data , method='number')
library(car)
lm(Churn~.,data=data_num)
vif(lm(Churn~.,data=data_num))
vif(glm(Churn_fact~.,data=data_fact,family=binomial))
summary(glm(ContractRenewal_fact~.,data=data_fact,family=binomial))
summary(glm(Churn_fact~AccountWeeks+DataUsage+CustServCalls+DayMins+DayCalls
+
MonthlyCharge+OverageFee+RoamMins+ContractRenewal_fact,
data=data_fact,family=binomial))
vif(glm(Churn_fact~AccountWeeks+DataUsage+CustServCalls+DayMins+DayCalls+
MonthlyCharge+OverageFee+RoamMins+ContractRenewal_fact,
data=data_fact,family=binomial))
cor(data_num)
library(caTools)
sample = sample.split(data_fact$Churn_fact, SplitRatio = .70)
train = subset(data_fact, sample == TRUE)
test = subset(data_fact, sample == FALSE)
####Logistic Regression
set.seed(1234)
# 6 steps in logistic
# Step 1: Overall Validity of the Model #
# 1) Log likelihood Test
logit=glm(Churn_fact~AccountWeeks+CustServCalls+DayCalls+RoamMins+ContractRe
newal_fact,
data=train,family=binomial)
summary(logit)
library(car)
vif(logit)
library(lmtest)
lrtest(logit)
#step 2: McFadden Rsq

options(scipen=999)
library(pscl)
print(pR2(logit))
23 | P a g
e
#step 3: Indiviudal Slopes Significance Test
summary(logit)
#Estimates for the variables
print(logit)
#Step 4: Explanatory Power of odds
round(exp(coef(logit)),2)
probability_Scores=format(exp(coef(logit))/(exp(coef(logit))+1),scientific =
FALSE)
print(probability_Scores)
#step 5 : Classification / Confusion Matrix
Pred.logit=predict(logit,type = "response",data=train)
summary(Pred.logit)
summary(train$Churn_fact)
plot(train$Churn_fact,Pred.logit)
Pred.logit=ifelse(Pred.logit<0.14,0,1)
table(Actual=train$ContractRenewal_fact,Pred.logit)
#step 6: ROC Curve
library(Deducer)
rocplot(logit)
library(caret)
##Testing the model on the Test Data
Pred.logit.test=predict(logit,type = "response",newdata = test)

summary(Pred.logit.test)
summary(test$Churn_fact)
plot(test$Churn_fact,Pred.logit.test)
Pred.logit.test.factor=ifelse(Pred.logit.test<0.20,0,1)
confusionMatrix(table(Actual=test$Churn_fact,Pred.logit.test.factor))
library(blorr) # to build and validate binary logistic models
blr_step_aic_both(logit, details = FALSE)
#ModelPerformanceParameter
#Train
train$prediction = predict(logit, train, type="response")
library(ROCR)
library(ineq)
predObj = prediction(train$prediction, train$Churn_fact)
perf = performance(predObj, "tpr", "fpr")
plot(perf)
24 | P a g
e
KS = max(perf@y.values[[1]]-perf@x.values[[1]])
auc = performance(predObj,"auc");
auc = as.numeric(auc@y.values)
gini = ineq(train$prediction, type="Gini")
#Test
test$prediction = predict(logit, test, type="response")
library(ROCR)
library(ineq)
predObj = prediction(test$prediction, test$Churn_fact)
perf = performance(predObj, "tpr", "fpr")
plot(perf)
KS = max(perf@y.values[[1]]-perf@x.values[[1]])
auc = performance(predObj,"auc");
auc = as.numeric(auc@y.values)
gini = ineq(test$prediction, type="Gini")
#Use KNN Classifier

#normalize the test & train data
norm=function(x){(x-min(x))/(max(x)-min(x))}
norm.data=as.data.frame(lapply(data_fact[,-c(9,10,11)],norm))
norm.data_fact=cbind(data_fact[,c(9,10,11)],norm.data)
#split the normalized dataset

library(caTools)
sample = sample.split(norm.data_fact$Churn_fact, SplitRatio = .70)
norm.train = subset(norm.data_fact, sample == TRUE)
norm.test = subset(norm.data_fact, sample == FALSE)
library(class)
knn.pred = knn(norm.train[-c(1)], norm.test[,-c(1)], norm.train[,1], k = 19)
table.knn = table(norm.test$Churn_fact, knn.pred)

table.knn
sum(diag(table.knn)/sum(table.knn))
confusionMatrix(table.knn)
#Naive Bayes
library(e1071)
NB =
naiveBayes(Churn_fact~AccountWeeks+CustServCalls+DayCalls+RoamMins+ContractR
enewal_fact, data = train)
predNB = predict(NB, test, type = "class")
tab.NB = table(test[,9], predNB)
sum(diag(tab.NB)/sum(tab.NB))
confusionMatrix(tab.NB)
tab.NB
25 | P a g
e

Telecom Customer Churn Project Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Telecom Customer Churn Project Report

Uploaded by

Copyright:

Available Formats

Mini Project – Telecom Customer Churn

1. EDA - Exploratory Data Analysis

1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs

1.4 EDA - Summarize the insights you get from EDA.

2.1 Applying Logistic Regression

2.2 Interpret Logistic Regression

2.3 Applying KNN Model

2.4 Interpret KNN Model

2.5 - Applying Naive Bayes Model

2.6 Interpret Naive Bayes Model

2.7 Confusion matrix interpretation for all models

3. Actionable Insights and Recommendations

 Importing the dataset in R

3 Exploratory Data Analysis – Step by step approach

1. Environment Set up and Data Import

We shall follow these steps in exploring the provided dataset.

3.1 Environment Set up and Data Import

3.1.2 Set up working Directory

Please refer 6 for Source Code.

Please refer 6 for Source Code.

3.2 Variable Identification

 Str: To look specific data row by row we use str()

 names() : to find the names of the columns

 summary: is a generic function used to produce result summaries of the results of

depend on the class of the first argument.

 attach() : to attach my data

 hist(): To plot histogram

 boxplot(): To plot boxplot

Hist(): A histogram represents the frequencies of values of a variable into ranges

 We need to find missing values

 Data usage also have some outliers

 Day Mins also have outliers

 Monthly charges also have outliers

 Roammins have outliers

Log likelihood Test:

Individual Slopes Significance Test:

Classification / Confusion Matrix:

## Setting up working directory and getting working directory

setwd("D:/College Data/Predictive Modeling/Project 4")

basic_eda <- function(data)

# Step 1: Overall Validity of the Model #

# 1) Log likelihood Test

#step 2: McFadden Rsq

#Step 4: Explanatory Power of odds

#step 5 : Classification / Confusion Matrix

#step 6: ROC Curve

Pred.logit.test=predict(logit,type = "response",newdata = test)

library(blorr) # to build and validate binary logistic models

blr_step_aic_both(logit, details = FALSE)

#Use KNN Classifier

#split the normalized dataset

table.knn = table(norm.test$Churn_fact, knn.pred)

You might also like