You are on page 1of 35

Customer Churn Prediction

REPORT

TABLE OF CONTENTS
1. Introduction---------------------------------------------------------------------------------3

2. Data overview------------------------------------------------------------------------------4-6

2.1 Data Description----------------------------------------------------------------------4

2.2 Overview of the training dataset----------------------------------------------------5

2.3 Overview of the test dataset---------------------------------------------------------6

3. Imported Libraries------------------------------------------------------------------------7

4. Data Visualization------------------------------------------------------------------------8-9

4.1 Customer churn in data -------------------------------------------------------------8

4.2 Variable distribution-----------------------------------------------------------------9

5. Data Pre-processing---------------------------------------------------------------------10-14

5.1 Training variable summary--------------------------------------------------------11

5.2 Correlation Matrix------------------------------------------------------------------12

5.3 Visualizing Data with principal Components-----------------------------------13

5.4 Binary variable distribution in customer churn---------------------------------14

6. Model Building-------------------------------------------------------------------------15-28

6.1 Logistic Regression----------------------------------------------------------------16

6.1.1 Synthetic Minority Oversampling Technique (SMOTE) -----------------17

6.1.2 Recursive Feature Elimination------------------------------------------------18

P a g e 1 | 35
Customer Churn Prediction

6.2 Decision tree classifier---------------------------------------------------------------19

6.3 K-NN Classifier-----------------------------------------------------------------------20

6.4 Random Forest Classifier------------------------------------------------------------21

6.5 Gaussian Naïve Bayes----------------------------------------------------------------22

6.6 Support Vector Machine--------------------------------------------------------------23

6.6.1 Linear -----------------------------------------------------------------------------21

6.6.2 Rbf----------------------------------------------------------------------------------25

6.7 Gaussian Process Classifier----------------------------------------------------------26

6.8 Gradient Boosting Classifier---------------------------------------------------------27


6.9 Multi-layer Perceptron Classifier---------------------------------------------------28
7. Compare model metrics---------------------------------------------------------------------29-35
7.1 Training data-----------------------------------------------------------------------------30
7.1.1 Confusion Matrices--------------------------------------------------------------31
7.1.2 ROC Curves----------------------------------------------------------------------32
7.2 Test data----------------------------------------------------------------------------------33
7.2.1 Confusion Matrices--------------------------------------------------------------34
7.2.2 ROC Curves----------------------------------------------------------------------35

P a g e 2 | 35
Customer Churn Prediction

1. INTRODUCTION
Customer churn, also known as customer retention, customer turnover, or customer
defection, is the loss of clients or customers.
Telephone service companies, Internet service providers, pay TV companies, insurance
firms, and alarm monitoring services, often use customer attrition analysis and customer
attrition rates as one of their key business metrics because the cost of retaining an
existing customer is far less than acquiring a new one.
Companies from these sectors often have customer service branches which attempt to win
back defecting clients, because recovered long-term customers can be worth much more
to a company than newly recruited clients.
Companies usually make a distinction between voluntary churn and involuntary churn.
Voluntary churn occurs due to a decision by the customer to switch to another company
or service provider, involuntary churn occurs due to circumstances such as a customer's
relocation to a long-term care facility, death, or the relocation to a distant location. In
most applications, involuntary reasons for churn are excluded from the analytical models.
Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors
of the company-customer relationship which companies’ control, such as how billing
interactions are handled or how after-sales help is provided.
Predictive analytics use churn prediction models that predict customer churn by assessing
their propensity of risk to churn. Since these models generate a small, prioritized list of
potential defectors, they are effective at focusing customer retention marketing programs
on the subset of the customer base who are most vulnerable to churn.

P a g e 3 | 35
Customer Churn Prediction

2. DATA OVERVIEW
2.1 Data description
State: The state the customer resides in.
Account length: The length of the
Area code: The area code of the place the customer resides in.
International plan: If the customer has subscribed to an international Plan.
Voice mail plan: If the customer has subscribed to a Voice mail Plan.
Number vmail messages: The number of messages the customer used with the voice
mail Plan.
Total day minutes: Total minutes used by the customer in the day.
Total day calls: Total number of calls made by the customer in the day.
Total day charge: Total charge for the calls and minutes used by the customer in the day.
Total eve minutes: Total minutes used by the customer in the evening.
Total eve calls: Total number calls made by the customer in the evening.
Total eve charge: Total charge for the calls and minutes used by the customer in the
evening.
Total night minutes: Total minutes used by the customer in the night.
Total night calls: Total number of calls made by the customer in the day.
Total night charge: Total charge for the calls and minutes used by the customer in the
night.
Total intl minutes: Total international minutes used by the customer.
Total intl calls: Total international calls made by the customer.
Total intl charge: Total international charge for the minutes and calls by the customer.
Customer service calls: Service calls made by the customer to the service center.

P a g e 4 | 35
Customer Churn Prediction

2.2 Overview of the training dataset:

 Rows: 2666

 Number of features: 20

 Missing values: 0

 Features:
['State', 'Account length', 'Area code', 'International plan', 'Voice mail plan', 'Number
vmail messages', 'Total day minutes', 'Total day calls', 'Total day charge', 'Total eve
minutes', 'Total eve calls', 'Total eve charge', 'Total night minutes', 'Total night calls',
'Total night charge', 'Total intl minutes', 'Total intl calls', 'Total intl charge', 'Customer
service calls', 'Churn'].
 Unique values:

State 51
Account length 205
Area code 3
International plan 2
Voice mail plan 2
Number vmail messages 42
Total day minutes 1489
Total day calls 115
Total day charge 1489
Total eve minutes 1442
Total eve calls 120
Total eve charge 1301
Total night minutes 1444
Total night calls 118
Total night charge 885
Total intl minutes 158
Total intl calls 21
Total intl charge 158
Customer service calls 10
Churn 2

P a g e 5 | 35
Customer Churn Prediction

2.3 Overview of the test dataset:


 Rows: 667

 Number of features: 20

 Missing Values: 0

 Features:
['State', 'Account length', 'Area code', 'International plan', 'Voice mail plan', 'Number
vmail messages', 'Total day minutes', 'Total day calls', 'Total day charge', 'Total eve
minutes', 'Total eve calls', 'Total eve charge', 'Total night minutes', 'Total night calls',
'Total night charge', 'Total intl minutes', 'Total intl calls', 'Total intl charge', 'Customer
service calls', 'Churn'].
 Unique values:

State 51
Account length 179
Area code 3
International plan 2
Voice mail plan 2
Number vmail messages 37
Total day minutes 562
Total day calls 100
Total day charge 562
Total eve minutes 557
Total eve calls 94
Total eve charge 528
Total night minutes 568
Total night calls 96
Total night charge 453
Total intl minutes 132
Total intl calls 17
Total intl charge 132
Customer service calls 9
Churn 2

P a g e 6 | 35
Customer Churn Prediction

3. IMPORTED LIBRARIES:
import numpy as np

import pandas as pd

from math import *

3.1 Visualization

import matplotlib.pyplot as plt

from PIL import Image

import seaborn as sns

import itertools

import io

import plotly.offline as py

py.init_notebook_mode(connected=True)

import plotly.graph_objs as go

from plotly.subplots import make_subplots

import plotly.figure_factory as ff

P a g e 7 | 35
Customer Churn Prediction

4. DATA VISUALIZATION

4.1 Customer churn in data

 Here we can observe that 85.4% of the churn in training data is false and the rest
14.6% is true.
 Now let’s observe how the variables are distributed.

P a g e 8 | 35
Customer Churn Prediction

4.2 Variable distribution

 Several of the numerical data are very correlated. (Total day minutes and Total day
charge), (Total eve minutes and Total eve charge), (Total night minutes and Total
night charge) and lastly (Total intl minutes and Total intl charge) are also
correlated. We only must select one of them.

P a g e 9 | 35
Customer Churn Prediction

5. DATA PREPROCESSING

 Data pre-processing is a process of preparing the raw data and making it


suitable for the machine learning models we will use.

 Here, we remove redundant data by dropping columns that are unnecessary


and the are 'State', 'Area code', 'Total day charge', 'Total eve charge', 'Total
night charge', 'Total intl charge'.

 After dropping redundant columns, we create a target column which is churn


in this scenario.

 Now, we separate categorical and numerical columns. We do this so to


identify the values associated with them and to organize the data before we
tackle the problem.

 We use LabelEncoder to normalize the data where we encode target labels


with value between 0 and n_classes-1. Label Encoding also refers to
converting the labels into a numeric form to convert them into the machine-
readable form. Several methods like

1)fit(y)-Fit label encoder,


2)fit_transform(y)-Fit label encoder and return encoded labels,
3)get_params([deep])-Get parameters for this estimator,
4)inverse_transform(y)-Transform labels back to original encoding,
5)set_output(*[, transform])-Set output container
6)set_params(**params)-Set the parameters of this estimator.
7)transform(y)- Transform labels to normalized encoding.

 Now, the data is scaled which makes it easy for a model to learn and understand
the problem.

P a g e 10 | 35
Customer Churn Prediction

5.1 Training variable summary

P a g e 11 | 35
Customer Churn Prediction

5.2 Correlation Matrix

 Correlation explains how one or more variables are related to each other.

 Correlation, statistical technique which determines how one variables moves/changes


in relation with the other variable. It gives us the idea about the degree of the
relationship of the two variables. It is a bi-variate analysis measure which describes the
association between different variables. In most of the business it is useful to express
one subject in terms of its relationship with others.

P a g e 12 | 35
Customer Churn Prediction

5.3 Visualizing data with principal components

 Principal component analysis (PCA) is an unsupervised machine learning technique.


Perhaps the most popular use of principal component analysis is dimensionality
reduction. Besides using PCA as a data preparation technique, we can also use it to
help visualize data. A picture is worth a thousand words. With the data visualized, it is
easier for us to get some insights and decide on the next step in our machine learning
models.

P a g e 13 | 35
Customer Churn Prediction

5.4 Binary variable distribution in customer churn

P a g e 14 | 35
Customer Churn Prediction

6. MODEL BUILDING

The models we used here are:

1) Logistic Regression
2) Decision Tree Classifier
3) KNN Classifier
4) Random Forest Classifier
5) Gaussian Naïve Bayes
6) SVM
7) Gaussian process Classifier
8) Gradient Boosting Classifier
9) Multi-layer perceptron Classifier

 After training each model, we compare their precision and accuracy


scores such that prediction of 1 to be as correct as possible and we use
recall when we want our model to spot as many real 1 as possible.

P a g e 15 | 35
Customer Churn Prediction

6.1 Logistic Regression


 In statistics, the logistic model (or logit model) is a statistical model that models
the probability of an event taking place by having the log-odds for the event be
a linear combination of one or more independent variables. In regression
analysis, logistic regression (or logit regression) is estimating the parameters of a
logistic model (the coefficients in the linear combination).

 Here we plot roc curve and confusion matrix which aids us to understand the
features well. Next, we split the principal training dataset and plot subsets.

 Accuracy Score: 0.8215892053973014

 Area under curve: 0.5529555149299208

P a g e 16 | 35
Customer Churn Prediction

Threshold Plot

P a g e 17 | 35
Customer Churn Prediction

6.1.1 Synthetic Minority Oversampling Technique (SMOTE)

 Randomly pick a point from the minority class.


 Compute the k-nearest neighbours (for some pre-specified k) for this point.
 Add k new points somewhere between the chosen point and each of its
neighbours.
 Accuracy Score: 0.767616191904048
 Area under curve: 0.7249619134673979

P a g e 18 | 35
Customer Churn Prediction

6.1.2 Recursive Feature Elimination

 Recursive Feature Elimination (RFE) is based on the idea to repeatedly


construct a model and choose either the best or worst performing feature,
setting the feature aside and then repeating the process with the rest of the
features. This process is applied until all features in the dataset are
exhausted. The goal of RFE is to select features by recursively considering
smaller and smaller sets of features.

 Accuracy Score: 0.8200899550224887

 Area under curve: 0.552041438147471

P a g e 19 | 35
Customer Churn Prediction

6.2 Decision tree classifier

 Decision tree classifiers are used successfully in many diverse areas. Their most
important feature is the capability of capturing descriptive decision-making
knowledge from the supplied data. Decision tree can be generated from training
sets.
 Accuracy Score: 0.904047976011994
 Area under curve: 0.7983851310176721

P a g e 20 | 35
Customer Churn Prediction

6.3 K-NN Classifier

 K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
 Accuracy Score: 0.848575712143928
 Area under curve: 0.5856718464351005

P a g e 21 | 35
Customer Churn Prediction

6.4 Random Forest Classifier

 Random forest is a meta estimator that fits several decision tree classifiers on
various sub-samples of the dataset and use averaging to improve the predictive
accuracy and control over-fitting. The sub-sample size is always the same as
the original input sample size, but the samples are drawn with replacement.
 Accuracy Score: 0.9115442278860569
 Area under curve: 0.7736822059719684

P a g e 22 | 35
Customer Churn Prediction

6.5 Gaussian Naïve Bayes

 Naïve Bayes is a probabilistic machine learning algorithm used for many


classification functions and is based on the Bayes theorem. Gaussian Naïve Bayes
is the extension of naïve Bayes. While other functions are used to estimate data
distribution, Gaussian or normal distribution is the simplest to implement as you
will need to calculate the mean and standard deviation for the training data.
 Accuracy Score: 0.8200899550224887
 Area under curve: 0.6366087751371116

P a g e 23 | 35
Customer Churn Prediction

6.6 Support Vector Machine


“Support Vector Machine” (SVM) is a supervised machine learning algorithm which
can be used for both classification and regression challenges. it is mostly used in
classification problems. In this algorithm, we plot each data item as a point in n-
dimensional space. where n is number of features you have) with the value of each
feature being the value of a particular coordinate. Then, we perform classification by
finding the hyper-plane that differentiate the two classes.

6.6.1 Support Vector Machine (linear)

 Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
 Accuracy Score: 0.8200899550224887
 Area under curve: 0.5

P a g e 24 | 35
Customer Churn Prediction

6.6.2 Support Vector Machine (rbf)

 Non-Linear SVM is used for non-linearly separated data, which means if a


dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
 Accuracy Score: 0.9355322338830585
 Area under curve: 0.846854052407069

P a g e 25 | 35
Customer Churn Prediction

6.7 Gaussian Process Classifier


 Gaussian Processes are a generalization of the Gaussian probability distribution
and can be used as the basis for sophisticated non-parametric machine learning
algorithms for classification and regression.
 Accuracy Score: 0.848575712143928
 Area under curve: 0.5824192565508837

P a g e 26 | 35
Customer Churn Prediction

6.8 Gradient Boosting Classifier


 Gradient boosting is a method standing out for its prediction speed and
accuracy, particularly with large and complex datasets. From Kaggle
competitions to machine learning solutions for business, this algorithm has
produced the best results. We already know that errors play a major role in
any machine learning algorithm. There are mainly two types of error, bias
error, and variance error. Gradient boost algorithm helps us minimize bias
error of the model.
 Accuracy Score: 0.9310344827586207
 Area under curve: 0.824596282754418

P a g e 27 | 35
Customer Churn Prediction

6.9 Multi-layer Perceptron Classifier


 Multi-layer perception is also known as MLP. It is fully connected dense
layers, which transform any input dimension to the desired dimension. A
multi-layer perception is a neural network that has multiple layers. To create
a neural network, we combine neurons together so that the outputs of some
neurons are inputs of other neurons.
 Accuracy Score: 0.904047976011994
 Area under curve: 0.7561014625228518

P a g e 28 | 35
Customer Churn Prediction

7. COMPARE MODEL METRICS (Training data)

P a g e 29 | 35
Customer Churn Prediction

7.1.1 Confusion matrices for models

P a g e 30 | 35
Customer Churn Prediction

7.1.2 (ROC) curves for models

P a g e 31 | 35
Customer Churn Prediction

7.1.3 Model performances over the test dataset

P a g e 32 | 35
Customer Churn Prediction

7.2 COMPARING MODEL MATRICS (Test data)

P a g e 33 | 35
Customer Churn Prediction

7.2.1 Confusion matrices for models

P a g e 34 | 35
Customer Churn Prediction

7.2.2 ROC- curve for models

P a g e 35 | 35

You might also like