You are on page 1of 19

EXPLORATORY DATA ANALYSIS ON CUSTOMER CHURN

MODEL
[Python Certification Course,2021]

Submitted to:
PROF.NIRENDU KONAR

Submitted by:
Swati choudhary
(20BSP2597)

DATE OF SUBMISSION: 15/06/2021


Customer churning, customer migration or customer loss, as it is called, has been treated as a
main concern because of the different costs associated with it. When customers change their
current service provider to another, costs are imposed to the losing company and not on the
customers. The highly competitive nature of the telecommunication sector and the absence of
a differentiation strategy in terms of products and services offered are what makes subscribers
churn from one company to another. Customers are always looking for innovative and original
products; if their actual provider cannot meet their needs, their loyalty and retention are in
question. As disloyalty increases, churn rate tends to rise, leading to a minimization of the
firm‘s value. Churn is considered a ―profit killer. As the customer base decreases, the revenues
associated go down.
In this “VF_customer_churn” dataset, out of 2134 responses, 580 were Churn customers
(Customers who frequently switch in less than or equal to 6months in the network) and 1554
were loyal customers.
The target respondents were selected and grouped into two major types:
1) Loyal
2) Churners (frequent switchers in the network)
Here, all variables are-
['Sl. No. of respondendts', 'Identifier Loyal or Churn', 'Gender', 'A1' 'A2', 'A3', 'A4', 'Z1', 'Z2', 'Z3',
'Z4', 'Z5', 'Rural /Urban', 'V1', 'V2', 'V3', 'E1', 'E2', 'O1', 'O2', 'O3', 'O4', 'Muliti SIM', 'Muliti SIM
Phone', 'Network', 'Call charges / Price', 'Internet Speed', 'Innovative VAS Services ', 'Uncalled
for activation of Vas services', 'Access to Credit balance or racharge facility ', 'Ease of use / Self
Service', 'Security/ Privacy', 'Brand Loyalty', 'Promotional Programs', 'Peer Pressure', 'Customer
Service']

There are two objectives in this project - understanding the data by EDA and developing ML
model to predict customer churn.

EXPLORATORY DATA ANALYSIS:


The Vodafone customer churn dataset contains only categorical variable and the EDA of
categorical variable contains:

 Cross tabulation
 Pie - chart
 Chi – square test of independence
 Cat plot
 Histogram

1) CROSS TABULATION:
A.

INTERPRETATION:

In this crosstab A4 represent age less than 18 years of user A3 represent age between 18 to
25 years of user A2 represent age between 25 to 40 years of user and A1 represent age of
user between 40 to 60 and the last age group is more than 60 so if all the above age group
is showing 0 that means user with 60 years becomes 1 means falling under churn category.
According to this crosstab we came to know how many users is loyal or churn out of the
company within each category of age groups So 612 loyal users out of 1554 are from age
group between 18 to 25 and 198 out of 580 churned users are from age group of 25 to 40
Here zero represents loyal user and 1 represents user who churn out of the company.
B.

INTERPRETATION:

In this crosstab Z5 represents user is from Begusarai, Z4 represents Bhagalpur , Z3


represents Dhanbad, Z2 represents Muzzafarpur, Z1 represents Patna and if any user
doesn’t fall under any of the above zone then that person falls under the zone Ranchi .

This crosstab tells us about how many users are loyal to the company or they have been
churn out of the company from the above mentioned zone. According to this 420 out of
total 1554 loyal user are from zone 4 i.e. from Bhagalpur and 125 out of total 580 churn out
user are from zone 2 i.e. from Muzzafarpur.
C.

INTERPRETATION:

Here V3, V2, V1 and V4 represents LVC , MVC, HVC and UHVC respectively. According to this
crosstab 514 user out of 1554 loyal user falls under LVC and 198 out of 580 churned users
are from LVC value brand.

D.
+

Interpretation:

This crosstab tells us about education qualification of user where E1 and E2 represents
Graduates, Matriculation respectively and rest is Post Graduates So according to this table 604 out
of the total loyal 1554 user are Matriculation and 264 out of 580 churned users are also
Matriculation.

E.
Interpretation:

This crosstab tells us which occupation of user are more loyal or which occupation of user
are mostly churned out of the company. So here O4,O3,O2,O1 represents occupation of
Agriculture, Business, Home maker, Services and rest is Student. 428 out of 1554 loyal user
are Business and 178 out of 580 churned users have Services as their occupation.

F.

This is master table (below), from this table we can find any count with any
combinations of variables.
2) CHI – SQUARE TESTING:

HYPOTHESIS:
 There is no relationship between area( rural/urban) and churn rate ( loyal/churn)
 There is no relationship between peer pressure and identifier (loyal/churn)
 There is no relationship between brand loyalty and identifier (loyal/churn)
 There is no relationship between customer service and identifier(loyal/churn)
 There is no relationship between call charges/prices and identifier(loyal/churn)

HYPOTHESIS 1:

 H0 : There is no relationship between area( rural/urban) and churn rate


( loyal/churn)
 Ha : There is a relationship between area( rural/urban) and churn rate ( loyal/churn)

INTERPRETATION:

According to this chi – square testing we came to know that there is a dependency between area and
identifiers so if user changes their area of living there might be a chances of changes in their identifier
status. In this p – value is less than alpha (0.05,95% significance level) so we reject the null hypothesis
and accept the alternate hypothesis.

HYPOTHESIS 2:

 H0 : There is no relationship between peer pressure and identifier (loyal/churn)


 Ha : There is a relationship between peer pressure and identifier (loyal/churn)

INTERPRETATION:

According to this chi – square testing we came to know that there is a dependency between peer
pressure and identifiers so there will be a chances of changes in identifier status due to the peer
pressure on user. In this p – value is less than alpha (0.05,95% significance level) so we reject the null
hypothesis and accept the alternate hypothesis.
HYPOTHESIS 3:

 H0 : There is no relationship between brand loyalty and identifier (loyal/churn)


 Ha : There is a relationship between brand loyalty and identifier (loyal/churn)

INTERPRETATION:

According to this chi – square testing we came to know that there is a dependency between brand
loyalty and identifiers so if a user is loyal his status is also considered to be loyal but if his brand loyalty
changes and he shifted to other company then he is no longer an loyal user he might be churned out of
the company. In this also p – value is less than alpha (0.05,95% significance level) so we reject the null
hypothesis and accept the alternate hypothesis.

HYPOTHESIS 4:

 H0 : There is no relationship between customer service and identifier(loyal/churn)


 Ha : There is a relationship between customer service and identifier(loyal/churn)

INTERPRETATION:

According to this chi – square testing we came to know that there is a dependency between customer
service and identifiers which shows that user’s status of becoming loyal or churn depends upon the
customer services provided by the company. In this also p – value is less than alpha (0.05,95%
significance level) so we reject the null hypothesis and accept the alternate hypothesis.

HYPOTHESIS 5:

 There is no relationship between call charges/prices and identifier(loyal/churn)


 There is a relationship between call charges/prices and identifier(loyal/churn)
INTERPRETATION:

According to this chi – square testing we came to know that there is a dependency between call
charges/prices and identifiers which shows that if there is a increase in call charges/ prices then the
status of the user automatically gets changed. Here also p – value is less than alpha (0.05,95%
significance level) so we reject the null hypothesis and accept the alternate hypothesis.

3. Cat plot – Basically a scatter plot is made to show the relationship between two numerical
variables. So, a Cat plot is a type of scatter plot which is used to show the relationship between
two categorical variables.

INTERPRETATION-

This cat-plot shows the relationship between


internet speed and identifier(loyal/churn). In
this cat plot 0 represents loyal and 1 represents
churn so This indicate that higher the
magnitude of scatter plot means more people
belongs to that category and here it falls on
loyal category. So, from this we can say that
more user is stick to the company due to the
internet speed and are loyal to it.

INTERPRETATION:

This cat-plot shows the relationship between


internet speed and network. In this cat-plot
number 2, 3, 4, 5 represents different types of
network available in that area. This indicate
that higher the magnitude of scatter plot
means more people belongs to that category
so we can say that more user are inclined
towards network 4 because of the good
internet speed.
INTERPRETATION:

This cat-plot shows the relationship between


security/privacy and identifiers(loyal/churn).
Here also 0 represents loyal user and 1
represents churned user so according to this
cat-plot majority of the user are inclined
towards category 1 which shows that the
major reason behinds user getting churn out of
the company is due to security/ privacy

INTERPRETATION:

This cat-plot shows the relationship between


network and identifiers(loyal/churn). It
indicates more user are inclined towards loyal
user because the magnitude of scatter plot is
more towards it. This also shows that network
of the company is good that’s why user is loyal
to it.
4. Histograms :-

Interpretation –

From this histogram we can interpret that how many customers are in particular category. For example ,
in gender there are 1809 customers male and females are 325
Table of some variable counts & its percentage value:-
5.pie-chart:-

GENDER AGE

ZONE LOCATION

CUSTOMER BASED ON VALUE EDUCATIONAL QUALIFICATION


OCCUPATION MULTI SIM

MULTI SIM PHONE

Interpretation :-

As there were 2134 respondents in this dataset so Among the respondents it was found that, 84.8 percent of the
respondents were male and 15.2 percent of them were female. With regard to the age of the respondents, 35.4
percent of the respondents were in the age group between 25-40 years of age, followed by 31.4 percent of the
respondents between 18-25 years, 17.4 percent were less than 18 years, 10.4 percent between 40-60 years and
5.2 percent of the respondents were more than 60 years of age. The Zone wise distribution of the respondents
showed that, 24.8 percent of them were from Bhagalpur, 19.9 percent from Ranchi, 17.9 percent from Begusarai,
14.2 percent from Patna, 13.9 percent from Muzzafarpur and 9.4 percent from Dhanbad. It was found from the
table that; 57.4 percent of the respondents were from Rural location and 42.6 percent from Urban location. With
respect to the classification of respondents based on Value, 33.4 percent of the respondents were Low Value
Customers, 28.9 percent were Medium Value Customers, 20.2 percent were High Value Customers and 17.5
percent were Ultra High Value Customers. Sample proportion based on educational qualification inferred that, 40.7
percent of the respondents had Higher Secondary and below as their qualification followed by 35.7 percent as
Graduates and 23.6 percent as Post-Graduates. While looking into the occupation-wise classification, 26.6 percent
of the respondents were doing business, 23.9 percent were doing agriculture, 23.4 percent were in service sector,
20.8 percent were students and 5.2 percent of the respondents were Home Makers. With regard to the use of
multi sim, 56. 1 percent of the respondents were using and 43.9 percent were not using. With regard to use of
Multi-sim phone 54.3 percent were using and 45.7 percent of the respondents were not using multi sim phon

LOGISTIC REGRESSION MODEL


CONFUSION MATRIX:
INTERPRETATION:

 In this matrix 456 represents number of true


positive as both predicted and actual values are
true means loyal in actual and predicted.
 5 represents number of false negative as in
actual they are loyal but while predicting they
are considered to be churned out of the
company.
 15 represents number of false positive because
in actual they have churn out of the company
but while predicting they comes out to be loyal.
 165 represents number of true negative as both
in actual and while predicting are true i.e. these
users are churn out of the company for sure
because they fall under the category of 1 i.e.
churn.

CLASSIFICATION REPORT:

INTERPRETATION:

The classification report is about key metrics in a


classification problem. It indicates two classes 0 & 1 so
here each class has same precision but class 0 has more
recall than class 1.

 The recall means “how many of this class you


find over the whole number of element of this
class”. It tells us how many of the positive class
your model could catch.
 The precision will be “ how many are correctly
classified among that class i.e. how precise is
your prediction.
 The F1-score is the harmonic mean between
precision & recall. The F1-score reaches its best
value at 1 and worst score at 0.
 The support is the number of occurrences of the
OVERALL % given class in your dataset. So here we have 461
user of class 0 and 180 user of class 1, which is a
good balanced dataset.
INTERPRETATION:

 Got a classification rate of 96%, considered as good


accuracy.
 Precision is about being precise, i.e., how accurate your
model is. In other words, you can say, when a model
makes a prediction, how often it is correct. In this
prediction case, when your Logistic Regression model
predicted user are going to churn out of the company,
that user have 97% of the time.
 Recall - If there are user who is going to churn in the test
set and your Logistic Regression model can identify it 91
% of the time.

NEURAL NETWORK MODEL


CONFUSION MATRIX:
INTERPRETATION:

 In this matrix 458 represents number of true


positive as both predicted and actual values are
true means loyal in actual and predicted.
 3 represents number of false negative as in
actual they are loyal but while predicting they
are considered out to be churned out of the
company.
 19 represents number of false positive because
in actual they have churn out of the company
but while predicting they comes out to be loyal.
 161 represents number of true negative as both
in actual and while predicting are true i.e. these
users are churn out of the company for sure
because they fall under the category of 1 i.e.
churn.

CLASSIFICATION REPORT:
INTERPRETATION:

The classification report is about key metrics in a


classification problem. It indicates two classes 0 & 1 so here
each class has same precision but class 0 has more recall
than class 1.

 The recall means “how many of this class you find


over the whole number of element of this class”. It
tells us how many of the positive class your model
could catch.
 The precision will be “ how many are correctly
classified among that class i.e. how precise is your
prediction.
 The F1-score is the harmonic mean between
precision & recall. The F1-score reaches its best
value at 1 and worst score at 0.
 The support is the number of occurrences of the
given class in your dataset. So here we have 461
user of class 0 and 180 user of class 1, which is a
good balanced dataset.

COMPARISON AND RESULT FROM TWO MODEL

After comparing the two model we can say that logistic regression model is giving much higher
recall value than the neural network model. The recall value tells us about how many of the
positive class your model could catch so we always want that our recall value should be higher.

In the neural network instead of using different parameters still the recall value is less than the
logistic regression model and the precision value is also less in case of neural network model.
Which parameter – Accuracy, Precision, Recall is most important for the model
selection in this business context and why?

Ans - Recall is the most important for the model selection in the business context because it
gives a measure of how accurately our model is able to identify the relevant data. We refer to it
as Sensitivity or True Positive Rate. Thus for all the user who actually churn out of the company
recall tells us how many we correctly identified as going to churn out of the company.

It is most important for model selection in business context because it helps us to accurately
calculate the True positive rate. In a subscription-based business, even a small rate of
monthly/quarterly churn will compound quickly over time. Just 1 percent monthly churn
translates to almost 12 percent yearly churn. Given that it’s far more expensive to acquire a
new customer than to retain an existing one, businesses with high churn rates will quickly find
themselves in a financial hole as they have to devote more and more resources to new
customer acquisition.”

From here we can take an example for better understanding of recall - According to statistics,
the global telecommunication industry is recording huge losses that amount to billions of
dollars due to churning. The rule of thumb known by marketers is that it costs 5 times more to
acquire a new customer than to retain the existing one. So, it is preferable for companies to not
lose the path of their existing customer base and focus on actions and measures to reduce
churn. Sometimes, a new customer may churn before the company recovers the whole
acquisition cost.

Since it is difficult to detect the potential churners, it is necessary for the Telco firms to takes
the necessary actions to identify those with the intention to churn before they solidify their act
and lead to profit decrease. Customers switch easily when the competitors offer what they
consider to be in their best interest

You might also like