Professional Documents
Culture Documents
Peer Reviewed
ISSN: 2350-8914 (Online), 2350-8906 (Print)
Year: 2022 Month: October Volume: 12
Abstract
Telecommunications Industry are one of the fastest evolving business sectors. They demand huge financial
investment even at the onset of business, unlike others. And the repay of their investment is driven by the
number of customers they have garnished over time. With the increasing global and national competition this
has become a major challenge for all the Telecommunication companies to retain their existing customers.
Therefore, the telecommunication operators have major concern over identifying customers who are at risk
of churning all the time. An analysis of call detail records inside the telecom will provide an insight on how
customer behaviors are affected by the available services and provide an analysis part on whether they will be
potential churners. In this research paper, we propose an ensemble-based Machine Learning model with the
help of Stacking Technique on the Logistic Regression and Random Forest algorithms as base classifiers
and XGBoost as final classifier for churn prediction using call detail records of a fictional telecommunication
company that contains 7043 records of customers. The performance metrics like accuracy and F1-Score thus,
obtained by this proposed model on this publicly available dataset are 80.88% and 62.69% respectively.
Keywords
Machine Learning, XGBoost, Ensemble, Customer Churn, Stacking Classifier
Customer churn problem has attracted lots of In this research, the XGBoost classifier has been
researchers from different backgrounds. But their chosen for final classifier in the stacking model as it
target always remains same i.e., to predict the can address the class imbalance problem inherently by
customers who are going to churn and prevent them using the function scale pos weight where the user
from doing so. An effective churn prediction model can define appropriate weight to the positive or
not only strengthens the financial position of company negative levels and can make use of parallel
but also facilitates the good relationship between processing for relatively faster computation.
customers and company by helping the decision
makers to address the issues related to customers’
demands thus, increasing loyalty of customers toward 2. Literature Review
the company. There are various researches done using There has been done a significant amount of research
both qualitative analysis like the research done by V. for churn prediction in telecommunications industry.
Umayaparvathi et al.[1] using survey of different Most of the researches are focused on use of
literature on customer churn prediction and standalone machine learning algorithms while some
quantitative methods like the researches conducted by researches have presented comparative analysis of
J. Pamina et al.[2] and Shuli Wu et al.[3] using both standalone and hybrid models.
machine learning algorithms or research done by Piotr
Sulikowski et al. [4] using collinearity testing but due A comparative study of ten different machine learning
to the dependencies of algorithms on the datasets, algorithms was done in the research work of Sahar F.
there is no best algorithm that works well on all Sabbeh (2018)[5] on the dataset of 3333 records and
datasets. Churn is a binary classification problem and obtained an accuracy of 96.39% by using Random
Forest. In the research of Fahd Idrissi Khamlichi et al.
HyperParameter Tuning +
Ensemble of XGBoost with LR & RF
Churn Prediction Model
performed well with an accuracy of 95% among the
Data Preprocessing
suggested standalone models. They have also
CrossValidation
Data
1311
An XGBoost Based Ensemble Model for Customer Churn Prediction in Telecommunications Industry
made use of Jupyter notebook, Pandas, Numpy, no outliers. The MonthlyCharges Column is
Matplotlib, Seaborn and Sklearn libraries. The steps negatively skewed and Total Charges is
are as follows: positively skewed. Thus, MinMaxScaler from
Sklearn is chosen for scaling/normalization
a) Load the csv/excel data into dataframe. purpose. Minmaxscaler scales the data in the
range between [-1,1]. The equation is given as:
b) Check for shape and general info. x − min(x)
xscaled = (1)
c) Check for inconsistency of the data types and max(x) − min(x)
feature names.
It is found that the raw data has inconsistency in
TotalCharges columns, the entries are in
number but obtained as object type. This has
been changed to numeric datatype. In
SeniorCitizen column, number has been entered
which is reverted back to categorical type.
1312
Proceedings of 12th IOE Graduate Conference
Recall, also known as true positive rate or sensitivity, Table 2: Hyperparameters of Logistic Regression
is the probability that the model correctly identifies Hyperparameters Base Estimator
the true positive case out of total number of actual 1(Logistic
positives. It is a useful metric for cases including False Regression)
Negative. C 2
penalty l2
TP
Recall = (4) solver lbfgs
T P + FN
Class Weight 1: 1.2
F1-Score is the harmonic mean of precision and recall.
It gives the combined idea about precision and recall
Table 3: Hyperparameters of Random Forest
metrics.
Hyperparameters Base Estimator
2 ∗ Precision ∗ Recall 2(Random Forest)
F1 − Score = (5)
Precision + Recall n estimators 200
max depth 6
Where: TP is True Positive, TN is True Negative, FP max leaf nodes 50
is False Positive and FN is False Negative. Class Weight 1: 1.2
1313
An XGBoost Based Ensemble Model for Customer Churn Prediction in Telecommunications Industry
1.0
0.8
True Positive Rate
0.6
0.4
The Roc Auc Curves of base classifiers and Stacking 5. Conclusion and Future Work
Classifier are shown in Figure 4. The Area Under
Curve obtained by the final model is 86.109% which This paper proposes an ensemble based method of
is higher than the AUC (84.51%)as obtained by Shuli churn prediction in telecommunications industry
Wu et al. [3] on the same dataset using AdaBoost using stacking technique. The final estimator in
algorithm without use of SMOTE. stacking classifier is XGBoost and base estimators are
Logistic Regression and Random Forest. Compared
Table 6: Results of 10-fold StratifiedKFold CV with the performance metrics obtained by using single
Fold fit score test test test test classifiers, the performance metrics obtained by using
No time time accu rec prec f1 ensemble method is better in all aspects as shown in
0 5.972 0.100 0.794 0.572 0.622 0.596 Table 5. Also, the proposed model in this research
1 5.941 0.084 0.813 0.572 0.673 0.618 paper has obtained accuracy and f1-score of 80.88%
2 6.003 0.085 0.799 0.559 0.638 0.596 and 62.69% respectively outperforming the researches
3 5.676 0.070 0.814 0.610 0.663 0.635 done by J. Pamina et al. [2] and Shuli Wu et al. [3] as
4 5.704 0.090 0.784 0.508 0.613 0.556 shown in Figure 5.
5 5.582 0.092 0.787 0.572 0.605 0.588
6 5.666 0.072 0.809 0.583 0.661 0.619 The features importance has not been considered in
7 5.725 0.102 0.809 0.567 0.667 0.613 this research paper. A further study of churn prediction
8 5.604 0.092 0.805 0.567 0.654 0.607 model can be studied on this basis. Moreover, as the
9 5.468 0.078 0.804 0.561 0.652 0.603 machine learning algorithms depends on the nature of
dataset, the performance of this proposed model may
The results of 10-Fold StratifiedKFold not guarantee the same results on every dataset but
Cross-Validation for the Final Model is as shown in by implementing the stacking technique, the overall
Table 6. performance of standalone model can be improved.
1314
Proceedings of 12th IOE Graduate Conference
classifier for predicting churn in telecommunication. of machine learning techniques for “customer churn
Jour of Adv Research in Dynamical & Control Systems, prediction”. In 2019 Third International Conference on
11, 2019. Intelligent Computing in Data Sciences (ICDS), pages
[3] Shuli Wu, Wei-Chuen Yau, Thian-Song Ong, and Siew- 1–4. IEEE, 2019.
Chin Chong. Integrated churn prediction and customer [7] Qi Tang, Guoen Xia, Xianquan Zhang, and Feng
segmentation framework for telco business. IEEE Long. A customer churn prediction model based on
Access, 9:62118–62136, 2021. xgboost and mlp. In 2020 International Conference
[4] Piotr Sulikowski and Tomasz Zdziebko. Churn on Computer Engineering and Application (ICCEA),
factors identification from real-world data in the pages 608–612. IEEE, 2020.
telecommunications industry: case study. Procedia [8] Hemlata Jain, Ajay Khunteta, and Sumit Srivastava.
Computer Science, 192:4800–4809, 2021. Churn prediction in telecommunication using logistic
[5] Sahar F Sabbeh. Machine-learning techniques regression and logit boost. Procedia Computer Science,
for customer retention: A comparative study. 167:101–112, 2020.
International Journal of Advanced Computer Science [9] Telco customer churn. https://www.
and Applications, 9(2), 2018. kaggle.com/datasets/blastchar/
[6] Fahd Idrissi Khamlichi, Doha Zaim, and Khalid telco-customer-churn, 2017. [Online;
Khalifa. A new model based on global hybridization accessed 2022].
1315