Professional Documents
Culture Documents
MODEL
[Python Certification Course,2021]
Submitted to:
PROF.NIRENDU KONAR
Submitted by:
Swati choudhary
(20BSP2597)
There are two objectives in this project - understanding the data by EDA and developing ML
model to predict customer churn.
Cross tabulation
Pie - chart
Chi – square test of independence
Cat plot
Histogram
1) CROSS TABULATION:
A.
INTERPRETATION:
In this crosstab A4 represent age less than 18 years of user A3 represent age between 18 to
25 years of user A2 represent age between 25 to 40 years of user and A1 represent age of
user between 40 to 60 and the last age group is more than 60 so if all the above age group
is showing 0 that means user with 60 years becomes 1 means falling under churn category.
According to this crosstab we came to know how many users is loyal or churn out of the
company within each category of age groups So 612 loyal users out of 1554 are from age
group between 18 to 25 and 198 out of 580 churned users are from age group of 25 to 40
Here zero represents loyal user and 1 represents user who churn out of the company.
B.
INTERPRETATION:
This crosstab tells us about how many users are loyal to the company or they have been
churn out of the company from the above mentioned zone. According to this 420 out of
total 1554 loyal user are from zone 4 i.e. from Bhagalpur and 125 out of total 580 churn out
user are from zone 2 i.e. from Muzzafarpur.
C.
INTERPRETATION:
Here V3, V2, V1 and V4 represents LVC , MVC, HVC and UHVC respectively. According to this
crosstab 514 user out of 1554 loyal user falls under LVC and 198 out of 580 churned users
are from LVC value brand.
D.
+
Interpretation:
This crosstab tells us about education qualification of user where E1 and E2 represents
Graduates, Matriculation respectively and rest is Post Graduates So according to this table 604 out
of the total loyal 1554 user are Matriculation and 264 out of 580 churned users are also
Matriculation.
E.
Interpretation:
This crosstab tells us which occupation of user are more loyal or which occupation of user
are mostly churned out of the company. So here O4,O3,O2,O1 represents occupation of
Agriculture, Business, Home maker, Services and rest is Student. 428 out of 1554 loyal user
are Business and 178 out of 580 churned users have Services as their occupation.
F.
This is master table (below), from this table we can find any count with any
combinations of variables.
2) CHI – SQUARE TESTING:
HYPOTHESIS:
There is no relationship between area( rural/urban) and churn rate ( loyal/churn)
There is no relationship between peer pressure and identifier (loyal/churn)
There is no relationship between brand loyalty and identifier (loyal/churn)
There is no relationship between customer service and identifier(loyal/churn)
There is no relationship between call charges/prices and identifier(loyal/churn)
HYPOTHESIS 1:
INTERPRETATION:
According to this chi – square testing we came to know that there is a dependency between area and
identifiers so if user changes their area of living there might be a chances of changes in their identifier
status. In this p – value is less than alpha (0.05,95% significance level) so we reject the null hypothesis
and accept the alternate hypothesis.
HYPOTHESIS 2:
INTERPRETATION:
According to this chi – square testing we came to know that there is a dependency between peer
pressure and identifiers so there will be a chances of changes in identifier status due to the peer
pressure on user. In this p – value is less than alpha (0.05,95% significance level) so we reject the null
hypothesis and accept the alternate hypothesis.
HYPOTHESIS 3:
INTERPRETATION:
According to this chi – square testing we came to know that there is a dependency between brand
loyalty and identifiers so if a user is loyal his status is also considered to be loyal but if his brand loyalty
changes and he shifted to other company then he is no longer an loyal user he might be churned out of
the company. In this also p – value is less than alpha (0.05,95% significance level) so we reject the null
hypothesis and accept the alternate hypothesis.
HYPOTHESIS 4:
INTERPRETATION:
According to this chi – square testing we came to know that there is a dependency between customer
service and identifiers which shows that user’s status of becoming loyal or churn depends upon the
customer services provided by the company. In this also p – value is less than alpha (0.05,95%
significance level) so we reject the null hypothesis and accept the alternate hypothesis.
HYPOTHESIS 5:
According to this chi – square testing we came to know that there is a dependency between call
charges/prices and identifiers which shows that if there is a increase in call charges/ prices then the
status of the user automatically gets changed. Here also p – value is less than alpha (0.05,95%
significance level) so we reject the null hypothesis and accept the alternate hypothesis.
3. Cat plot – Basically a scatter plot is made to show the relationship between two numerical
variables. So, a Cat plot is a type of scatter plot which is used to show the relationship between
two categorical variables.
INTERPRETATION-
INTERPRETATION:
INTERPRETATION:
Interpretation –
From this histogram we can interpret that how many customers are in particular category. For example ,
in gender there are 1809 customers male and females are 325
Table of some variable counts & its percentage value:-
5.pie-chart:-
GENDER AGE
ZONE LOCATION
Interpretation :-
As there were 2134 respondents in this dataset so Among the respondents it was found that, 84.8 percent of the
respondents were male and 15.2 percent of them were female. With regard to the age of the respondents, 35.4
percent of the respondents were in the age group between 25-40 years of age, followed by 31.4 percent of the
respondents between 18-25 years, 17.4 percent were less than 18 years, 10.4 percent between 40-60 years and
5.2 percent of the respondents were more than 60 years of age. The Zone wise distribution of the respondents
showed that, 24.8 percent of them were from Bhagalpur, 19.9 percent from Ranchi, 17.9 percent from Begusarai,
14.2 percent from Patna, 13.9 percent from Muzzafarpur and 9.4 percent from Dhanbad. It was found from the
table that; 57.4 percent of the respondents were from Rural location and 42.6 percent from Urban location. With
respect to the classification of respondents based on Value, 33.4 percent of the respondents were Low Value
Customers, 28.9 percent were Medium Value Customers, 20.2 percent were High Value Customers and 17.5
percent were Ultra High Value Customers. Sample proportion based on educational qualification inferred that, 40.7
percent of the respondents had Higher Secondary and below as their qualification followed by 35.7 percent as
Graduates and 23.6 percent as Post-Graduates. While looking into the occupation-wise classification, 26.6 percent
of the respondents were doing business, 23.9 percent were doing agriculture, 23.4 percent were in service sector,
20.8 percent were students and 5.2 percent of the respondents were Home Makers. With regard to the use of
multi sim, 56. 1 percent of the respondents were using and 43.9 percent were not using. With regard to use of
Multi-sim phone 54.3 percent were using and 45.7 percent of the respondents were not using multi sim phon
CLASSIFICATION REPORT:
INTERPRETATION:
CLASSIFICATION REPORT:
INTERPRETATION:
After comparing the two model we can say that logistic regression model is giving much higher
recall value than the neural network model. The recall value tells us about how many of the
positive class your model could catch so we always want that our recall value should be higher.
In the neural network instead of using different parameters still the recall value is less than the
logistic regression model and the precision value is also less in case of neural network model.
Which parameter – Accuracy, Precision, Recall is most important for the model
selection in this business context and why?
Ans - Recall is the most important for the model selection in the business context because it
gives a measure of how accurately our model is able to identify the relevant data. We refer to it
as Sensitivity or True Positive Rate. Thus for all the user who actually churn out of the company
recall tells us how many we correctly identified as going to churn out of the company.
It is most important for model selection in business context because it helps us to accurately
calculate the True positive rate. In a subscription-based business, even a small rate of
monthly/quarterly churn will compound quickly over time. Just 1 percent monthly churn
translates to almost 12 percent yearly churn. Given that it’s far more expensive to acquire a
new customer than to retain an existing one, businesses with high churn rates will quickly find
themselves in a financial hole as they have to devote more and more resources to new
customer acquisition.”
From here we can take an example for better understanding of recall - According to statistics,
the global telecommunication industry is recording huge losses that amount to billions of
dollars due to churning. The rule of thumb known by marketers is that it costs 5 times more to
acquire a new customer than to retain the existing one. So, it is preferable for companies to not
lose the path of their existing customer base and focus on actions and measures to reduce
churn. Sometimes, a new customer may churn before the company recovers the whole
acquisition cost.
Since it is difficult to detect the potential churners, it is necessary for the Telco firms to takes
the necessary actions to identify those with the intention to churn before they solidify their act
and lead to profit decrease. Customers switch easily when the competitors offer what they
consider to be in their best interest