You are on page 1of 6

Part A.

Churn Analysis for the mobile operator “KCELL”

Business understanding
KCELL (“the Company”) is a mobile operator based in Kazakhstan which utilizes plenty of marketing and sales
resources to attract new and retain existing customers. For the last three years, the Company has experienced a
decrease in sales. Therefore, management would like to understand customer behavior and accurately predict
which customers are more likely to churn in the future.
Data understanding
The Company has structured client data records based on
historical information on which customers ultimately left
and which kept using their services. Using a technique
known as training, this historical data can be used to build a
machine learning model of one telecom operator's churn.
After developing and validating the model, we may use the
profile details of a random customer to forecast whether
they would go or stay.
Hypothesis:
- A higher volume of calls to customer care in a certain time frame suggests that a customer is having several
issues, and that there is a high likelihood that they will leave.
- If a customer has a high overall billed cost and is dissatisfied with the existing service, they are more inclined to
look for another operator.
Data preparation and exploration
In order to ensure the quality of the dataset “Garbage In, Garbage Out” rule should be followed. The data
should be complete, clean and accurate. We should ensure that there are no missing values
and nulls in the data and multiple values for the same dimension.
To gain key insight into the data we should understand the relationship between features and target
variables. In addition, feature engineering should be performed to represent and categorize
customers based on the features that likely make them churn.
Modeling technique
In order to predict whether the customers will churn or not, the classification approach is used. Specifically,
logistic regression will identify relationships between our target feature, Churn, and our remaining features to
apply probabilistic calculations for determining which class the customer should belong to. Based on the value
of variables the output can be either at level 1 or 0, which corresponds to the probability of the customer leaving
the company or continuing with it.
Evaluation Techniques Required
To evaluate the effectiveness of our model we need to look at how frequently it was
accurate since we're attempting to anticipate whether a customer will churn or not.
The residual distance between real training data and predicted training data, as well
as between actual test data and predicted test data, will be used to achieve this. We
will compare the accuracy score, confusion matrix, and AUC for the model as well.
Conclusion
Knowing the status of existing customers is pivotal to the Company’s success. With an accurate churn prediction
model, the management can take proactive actions to prevent churn and sustain growth efforts.

Part B.
The given dataset illustrates the information related to the casino with 17 columns and 7.3 rows.

Using Python, it is observed as following:


The number of observations used is 7,305 and only 1 is unused. The dataset has 6 categorical and 9 numerical
values as follows:

The completeness of the dataset is checked in Python with no missing values:


In order to summarize data in visualizations and show the data’s distribution, the code seaborn was used in
Python:

The main objective of this dataset is to find the Gross Revenue per machine, using the Multiple Linear
Regression (MLR) model where the response variable is the Gross Revenue per machine and predictors are
GrossRevenue, NoMachines, Section, MachineName, Model, Diff_w_Upper, PlaysPerMachine, Casino.

The MLR model equation is represented below:


The model is evaluated with the fit measures such as R square, MSE and Root MSE, which has the following
figures:

The R square measure seems high since the model explains approximately 76% of the variability in the data.
The degree of inaccuracy in statistical models is gauged by the mean squared error, or MSE. The average
squared difference between the values anticipated and observed is evaluated.
The Root MSE depicts that on average and over the range of the data, the difference between the prediction and
the actual value is approximately $4,528, but it could be imprecise and investigated further.
The p-value .05 default criteria is used as a cutoff point for the significance of a variable. Purple horizontal bars
that reach the right side of the plot indicate effects with p-values lower than .05. Blue bars indicate effects with
p-values higher than.05. The distribution of effects across different p-value ranges on the negative log10 scale is
displayed in the bar graph below.
From the Fit Summary it can be indicated that there are only six significant variables. In addition, from the
Studentized Residual Plot we can see that there are some minor positive and negative outliers in the residuals.
In regards to the Assessment, the model predictions seem consistent with responder outcomes.

Yet, using the different variable selection methods with the selection criterions did not substantially improve the
model and an adjusted R square didn’t increase, however it did lead to a specification that was more logical and
practical from an exploratory standpoint.

There no any limitations of this model since there are only linear effects.

You might also like