Professional Documents
Culture Documents
Sravanthi.M
1
Table of Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................3
3.1.2.Set up working Directory....................................................................................................3
3.1.3.Import and Read the Dataset.............................................................................................4
3.2. Variable Identification................................................................................................................4
4. Conclusion.........................................................................................................................................4
5. Detailed Explanation of Findings…………………………………………………………………………………………………….5
1.2 EDA - Check for Outliers and missing values and check the summary of the dataset
1.3 EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
2. Build Models and compare them to get to the best one
2.8 Interpretation of other Model Performance Measures for logistic <KS, AUC, GINI>
2.9 Remarks on Model validation exercise <Which model performed the best>
4. Source Code
1 Project Objective
The objective of the report is to explore the Telecom Customer Churn in R and generate insights
about the data set. This exploration report will consist of the following:
2 Assumptions
We need to predict based on the past data what percentage of customers will be
cancelling the services in future
The data also indicates which customers who cancelled their services.
3|Page
install. packages (“Package name”)
getwd() : returns an absolute file path representing the current working directory
dim: returns the dimension (e.g. the number of columns and rows)
various model fitting functions. The function invokes particular methods which
4 Conclusion
From the above given problem, we have found out in this case all models show significantly
greater predictive accuracy. For every customer with accuracy of 69.7%. The Logistic
identical in terms of results.
If maximum accuracy is the goal then I would recommend that the logistic model since it is
much more interpretable than the other model.
4|Page
5 Detailed Explanation of Findings
1. EDA - Exploratory Data Analysis
1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs
1.2 EDA - Check for Outliers and missing values and check the summary of the dataset
1.3 EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
1.4 EDA - Summarize the insights you get from EDA.
Ans: str(): It is used to find out the structure of the data.It is an alternative way to display the
summary of the data.It gives information each basic structure.
Boxplot():
Range. If you are interested in the spread of all the data, it is represented on a boxplot by the horizontal
distance between the smallest value and the largest value, including any outliers.
5|Page
Cor():
6|Page
Summary():
Below is the summary of our given data set
Basic EDA is performed on the given data set and below is the out Put:
Univariate Analysis:
We are checking summary of each variable and by plotting we are check outliers for each variable.
7|Page
Account weeks have few outliers.
8|Page
Customer calls also have some outliers
9|Page
Day calls also have outliers
10 | P a g
e
Overagefee also have outliers
11 | P a g
e
12 | P a g
e
13 | P a g
e
##Need to check for multicolinarity
Is there evidence of multicollinearity? Showcase your analysis
First, we need to create correlation matrix and the plot the correlation for Telecom Customer
Churn set.
Now we need to check multicollinearity of independent variables using VIF
14 | P a g
e
15 | P a g
e
2. Build Models and compare them to get to the best one
2.1 Applying Logistic Regression
2.2 Interpret Logistic Regression
2.3 Applying KNN Model
2.4 Interpret KNN Model
2.5 - Applying Naive Bayes Model
2.6 Interpret Naive Bayes Model
2.7 Confusion matrix interpretation for all models
2.8 Interpretation of other Model Performance Measures for logistic <KS, AUC, GINI>
2.9 Remarks on Model validation exercise <Which model performed the best>
Ans: Predicting telecom churn prediction consists of detecting which customers are likely to cancel a
subscription to a service based on how they use the service. We want to predict the answer to the
following question, asked for each current customer: “Is this customer going to leave us within the
next few months?” There are only two possible answers, yes or no, and it is what we call a binary
classification task. For that we will be performing Logistic Regression, KNN Model and Navie Bayes
Model. Before starting we need to create train data and sample data.
Logistic Regression:
For the above given model, we will be performing 6 steps in logistic regression for overall
Validity of the Model
1. Log likelihood Test
2. McFadden Rsq
3. Individual Slopes Significance Test
4. Explanatory Power of odds
5. Classification / Confusion Matrix
6. ROC Curve
McFadden Rsq:
Thumb rule of McFadden's pseudo R-squared ranging from 0.2 to 0.4 indicates very good model fit.
17 | P a g
e
Explanatory Power of odds:
A confusion matrix is a table that is often used to describe the performance of a classification model
on a set of test data for which the true values are known.
18 | P a g
e
19 | P a g
e
KNN:
20 | P a g
e
3. Actionable Insights and Recommendations
Ans: Customers that have signed up recently on a month-to-month contract with a single telephone
line and who pay with an alternative method to electronic check are the most likely to churn.
Resources should be focussed on these customers to move them to products that are indicators of
brand loyalty. Marketing and retention teams should priorities the following products in descending
order of importance:
1. Two-year contract
2. One-year contract
3. Paperless billing
4. Payment by electronic check
5. A second telephone line
4. Source code
data_fact=data[,-c(1,3,4)]
data_num=data[,-c(12,13,14)]
21 | P a g
e
str(data_fact)
str(data_num)
###Basic EDA
library(funModeling)
library(tidyverse)
library(Hmisc)
}
basic_eda(data)
#Univariate analysis
summary(AccountWeeks)
boxplot(AccountWeeks,main = "Accountweeks",col = "orange",border = "blue")
summary(DataUsage)
boxplot(DataUsage,main = "Datausage",col = "Green",border = "Red")
summary(CustServCalls)
boxplot(CustServCalls,main = "custsercalls", col = "Pink",border = "blue")
summary(DayMins)
boxplot(DayMins,main = "DayMins",col = "orange",border = "blue")
summary(DayCalls)
boxplot(DayCalls,main = "Daycalls",col = "orange",border = "blue")
summary(MonthlyCharge)
boxplot(MonthlyCharge,main = "Monthlycharge",col = "orange",border = "blue")
summary(OverageFee)
boxplot(OverageFee,main = "Overagefee",col = "orange",border = "blue")
summary(RoamMins)
boxplot(RoamMins,main = "RoamMins",col = "orange",border = "blue")
summary(Churn)
hist(DataUsage)
hist(CustServCalls)
hist(MonthlyCharge)
plot(DataUsage,MonthlyCharge)
plot(MonthlyCharge,DayMins)
boxplot(data)
is.na(data)
data=na.omit(data)
names(data)
#collinearity
22 | P a g
e
str(data)
library(corrplot)
cor_data=cor(data[,-c(12,13,14)])
corrplot(cor_data , method='number')
library(car)
lm(Churn~.,data=data_num)
vif(lm(Churn~.,data=data_num))
vif(glm(Churn_fact~.,data=data_fact,family=binomial))
summary(glm(ContractRenewal_fact~.,data=data_fact,family=binomial))
summary(glm(Churn_fact~AccountWeeks+DataUsage+CustServCalls+DayMins+DayCalls
+
MonthlyCharge+OverageFee+RoamMins+ContractRenewal_fact,
data=data_fact,family=binomial))
vif(glm(Churn_fact~AccountWeeks+DataUsage+CustServCalls+DayMins+DayCalls+
MonthlyCharge+OverageFee+RoamMins+ContractRenewal_fact,
data=data_fact,family=binomial))
cor(data_num)
library(caTools)
sample = sample.split(data_fact$Churn_fact, SplitRatio = .70)
train = subset(data_fact, sample == TRUE)
test = subset(data_fact, sample == FALSE)
####Logistic Regression
set.seed(1234)
# 6 steps in logistic
logit=glm(Churn_fact~AccountWeeks+CustServCalls+DayCalls+RoamMins+ContractRe
newal_fact,
data=train,family=binomial)
summary(logit)
library(car)
vif(logit)
library(lmtest)
lrtest(logit)
round(exp(coef(logit)),2)
probability_Scores=format(exp(coef(logit))/(exp(coef(logit))+1),scientific =
FALSE)
print(probability_Scores)
Pred.logit=predict(logit,type = "response",data=train)
summary(Pred.logit)
summary(train$Churn_fact)
plot(train$Churn_fact,Pred.logit)
Pred.logit=ifelse(Pred.logit<0.14,0,1)
table(Actual=train$ContractRenewal_fact,Pred.logit)
library(Deducer)
rocplot(logit)
library(caret)
##Testing the model on the Test Data
#ModelPerformanceParameter
#Train
train$prediction = predict(logit, train, type="response")
library(ROCR)
library(ineq)
predObj = prediction(train$prediction, train$Churn_fact)
perf = performance(predObj, "tpr", "fpr")
plot(perf)
24 | P a g
e
KS = max(perf@y.values[[1]]-perf@x.values[[1]])
auc = performance(predObj,"auc");
auc = as.numeric(auc@y.values)
gini = ineq(train$prediction, type="Gini")
#Test
test$prediction = predict(logit, test, type="response")
library(ROCR)
library(ineq)
predObj = prediction(test$prediction, test$Churn_fact)
perf = performance(predObj, "tpr", "fpr")
plot(perf)
KS = max(perf@y.values[[1]]-perf@x.values[[1]])
auc = performance(predObj,"auc");
auc = as.numeric(auc@y.values)
gini = ineq(test$prediction, type="Gini")
norm=function(x){(x-min(x))/(max(x)-min(x))}
norm.data=as.data.frame(lapply(data_fact[,-c(9,10,11)],norm))
norm.data_fact=cbind(data_fact[,c(9,10,11)],norm.data)
library(class)
knn.pred = knn(norm.train[-c(1)], norm.test[,-c(1)], norm.train[,1], k = 19)
#Naive Bayes
library(e1071)
NB =
naiveBayes(Churn_fact~AccountWeeks+CustServCalls+DayCalls+RoamMins+ContractR
enewal_fact, data = train)
predNB = predict(NB, test, type = "class")
tab.NB = table(test[,9], predNB)
sum(diag(tab.NB)/sum(tab.NB))
confusionMatrix(tab.NB)
tab.NB
25 | P a g
e