You are on page 1of 4

2022 IEEE International Conference on Big Data and Smart Computing (BigComp)

Comparative Methods for Personalized Customer


Churn Prediction with Sequential Data
Ahmet Tuğrul Bayrak Güven Yücetürk Musa Berat Bahadır
Data Science and Innovation Data Science and Innovation Data Science and Innovation
2022 IEEE International Conference on Big Data and Smart Computing (BigComp) | 978-1-6654-2197-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/BIGCOMP54360.2022.00050

Ata Technology Platforms Ata Technology Platforms Ata Technology Platforms


İstanbul, Turkey İstanbul, Turkey İstanbul, Turkey
tugrulb@atp.com.tr guveny@atp.com.tr musab@atp.com.tr

Sare Melek Yalçınkaya Melike Demirdağ İsmail Utku Sayan


Data Science and Innovation Data Science and Innovation Data Science and Innovation
Ata Technology Platforms Ata Technology Platforms Ata Technology Platforms
İstanbul, Turkey İstanbul, Turkey İstanbul, Turkey
sarey@atp.com.tr meliked@atp.com.tr utkus@atp.com.tr

Abstract—Competition in the business environment has been independently of the service they receive. Besides the tra-
increasing with the developed technology. There are various ditional, rule-based methods, numerous advanced prediction
alternative service providers for customers. Those providers models have been conducted on customer churn analysis. Xi
would like to retain their current customers. To reach that goal,
churn prediction is a convenient method. However, predicting et al. build a recurrent neural network (RNN) using customer
churning customers is not a trivial task. It is more demanding behavior as part of customer churn prediction. They also
for some sectors like fast-food, since there can be numerous treat the problem as a regression problem [4]. Alternatively,
reasons when a customer stops having services. In this study, the Adwan et al. estimate churning customers via multilayer
challenging situation of customer churn in the fast-food industry perceptron (MLP) with backpropagation, using data from a
is analyzed, and a fast-food chain’s data is used. The data is
formed sequentially accordingly with customers’ personal churn telecommunications company. [5]. Similarly, Ullah et al. work
periods. Several recurrent neural network models such as gated on telecom data and use the random forests (RF) method then
recurrent units and long short-term memory are built using compare it with other classification algorithms [6], whereas
the sequential data to predict the churn stages of customers, Coussement and Van den Poel implement an SVM model
and they are compared with the other standard classification to predict customer churns [7]. Wei et al. also use telecom
methods. Apart from recurrent neural network models, a hybrid
model consisting of a convolutional network and long short-term data [8]. Using sequential data with long short-term memory
memory is applied. (LSTM) is another popular approach to churn analysis [9].
Index Terms—churn analysis, sequential data, deep learning, Çelik and Osmanoğlu, on the other hand, develop customer
recurrent neural network, convolutional neural network churn models using data from different sectors and compare
them [10]. As a different approach, Bose and Chen build a
I. I NTRODUCTION model where unsupervised clustering and supervised models
With the developing technology, business competition has are applied as hybrid [11]. Tama et al. address the problem as a
increased dramatically. Nowadays, there are many companies classification problem and use data from the fast-food industry.
and their alternatives that offer similar services for customers. They calculate the customer satisfaction and probability of
Therefore, when the customers are not satisfied with the customer churn by using decision tree and artificial neural
service or find more suitable products, they may turn to other network models [12]. Similarly, Bayrak and Bahadir work
options. For this reason, companies would like to retain their on fast-food chain data. They apply LSTM on sequential
customers, and churn analysis has become more crucial than data [13]–[15]. Whereas, Migueis et al. create a sequence
ever [1]. Even a modest increase in churn rate might reduce a model using the Markov method and compare this model with
company’s ability to grow. The analysis has an essential role other models [16]. As a different domain, Günther works on
in business-driven data analytics [2]. Thanks to churn analysis, insurance data to build a churn model [17]. Burez et al. study
companies can predict which customers they will lose before class imbalances in customer churn prediction. Results of the
losing the customers. As a result, it is possible to examine the study indicate that under-sampling might lead to improved
cause of customer dissatisfaction and prevent the loss of these prediction accuracy [18].
customers [3]. In this study, we use a dataset that contains transactional
Churn analysis is critical to the fast-food industry, where data from a fast-food chain. The transaction data is formed
competition is extremely high. The analysis is considerably as sequential data to be able to catch the patterns. Instead
difficult with this industry, as people’s habits can change of a rule-based model, we use RNN methods such as gated

2375-9356/22/$31.00 ©2022 IEEE 222


DOI 10.1109/BigComp54360.2022.00050

Authorized licensed use limited to: Unitec Library. Downloaded on September 17,2022 at 05:57:48 UTC from IEEE Xplore. Restrictions apply.
TABLE I: PATTERN SAMPLES
Regular order
Samples Customer pattern
frequency
[9, 3, 7, 5, 6, 19, 12, 10, 15, 2, 5, 7, 3] 15 110111001110
[111, 13, 9, 4, 0, 21, 1, 10, 6, 3, 19, 19] 21 110111101111
[2, 2, 7, 4, 11, 49, 27, 16, 1, 141, 55, 30] 55 111111111110

Fig. 1: Average-order-day histogram of all customers

recurrent unit (GRU), LSTM, bi-directional long short-term


memory (BI-LSTM), and a combined model of LSMT and Fig. 2: An example pattern of a multiple-time churn
convolutional neural networks (CNN) and compare them with
the traditional ones.

II. M ETHOD After that process, a customer’s regularity pattern is created


A. Data by checking if they have an order in their average-order-day
frequency. If the day difference between the two successive or-
The data used for the study consists of sales records of a
ders is less than a customer’s average-order-day frequency, the
fast-food chain. The data does not include any demographic
customer pattern has the value ”1”, otherwise ”0”. Therefore,
information of customers and contains 69582643 transaction
for n+1 orders the pattern consists of n values. An example
records. The data is grouped customer-wise, and 12755590
regularity pattern for different order frequencies can be viewed
rows are created. The features used for the study are sales id,
in Table I. Another example for a customer having churned can
customer id, sales date and quantity. To increment the ap-
be viewed in Figure 2. The customer has 24 orders and a 30-
plicability of the method, the properties are kept as basic as
day average-order-day frequency with two regularity periods
possible. To remove the outlier values on some orders, Z-score
between orders 1-9 and 13-24.
is conducted. After the Z-score is applied, 102439 rows are
A customer can churn several times, which occurs after a
eliminated. Therefore, the data becomes more consistent.
regularity period. In the figure mentioned, the row Customer
B. Definition of Customer Churn Pattern shows that the customer has two regularity periods
greater than the threshold value of five and has churned two
As stated earlier, when customers leave a service after a times accordingly. Since the customer order pattern is binary,
while, they become churn. For the customers to be churned, the formulization and data structure can be applied to all the
they have to be regular customers in the first place. Otherwise, customers.
it might not be possible to detect a pattern. The rule of being
a regular customer may vary in different sectors. However, the C. Preparing the Training Data
general rule is to have a specific number of orders and not have In the study, the different stages of a churning process
orders in a period of time. In the study, the minimum limit are labeled as shown in Figure 3. Phase 03 corresponds to
to be considered as a regular customer is to have at least five a customer’s last order in a churn, that is, in a regularity
sequential orders in their average-order-day frequency, which period. Similarly, Phase 01 indicates that the customer is away
is the mean of the day differences between orders. Other- from the churning process, with Phase 02, the customer is in
wise, they are not considered as regular customers and are the middle of the churning process. Each stage includes the
eliminated. The average-order-day frequency for all customers previous one cumulatively. Since a customer can churn more
(for both regular or non-regular) can be observed in Figure than once, previous churn profiles of customers are labeled
1a. After the non-regular customers are eliminated, 205728 separately with the time stamp. By this, the data is augmented,
regular customers remain. The average-order-day frequency and the models are trained with the augmented data using not
is calculated individually for each regular customer, and the just the present profile of a customer but for each profile when
average of all the customers is 60.209 (Figure 1b). To measure a customer churns. After calculating the features of a customer
that value, the day differences between orders are found. Then, for different churn profiles, the data is labeled. The labels are
the values as far as the standard deviation from the mean are {1, 2, 3} where the values represent different churn stages.
eliminated. The remaining maximum value is assigned as the The label values also correspond to phase numbers where the
average-order-day frequency for each customer. value ”3” means that a customer has churned. While labeling

223

Authorized licensed use limited to: Unitec Library. Downloaded on September 17,2022 at 05:57:48 UTC from IEEE Xplore. Restrictions apply.
Fig. 3: Sample phase structure

TABLE II: MODEL FEATURES


Feature Explanation
max amount maximum amount spent for an order
min amount minimum amount spent for an order Fig. 5: Sequential input structure
avg amount average amount spent for an order
max day diff maximum day difference between orders
min day diff minimum day difference between orders
avg day diff average day difference between orders
days since previous days since the previous order
number of orders total number of orders learning tasks. LSTM and CNN are complementary in their
number of churns total number of being churned modeling capabilities, as LSTM is good at temporal modeling,
whereas CNN is good at reducing frequency variations. They
can be combined into one unified model. The CNN-LSTM
the data, previous phases of churn profiles are calculated by architecture includes using CNN layers for feature extraction
shifting to three previous orders to avoid the labels from being on input data combined with LSTMs for sequence prediction
too similar as in Figure 3. [20]. A CNN-LSTM model can be defined by adding CNN
layers first, followed by LSTM layers with a dense layer. The
D. Non-sequential Models pipeline of the model is displayed in Table III. The architecture
For the study, commonly used non-sequential classification can be considered two different models, CNN is for feature
methods are built as baseline [19]. The models built are extraction, and LSTM is for rendering the features in time
decision tree (DT), RF, k-nearest neighbors (KNN), logistic steps.
regression (LR), support vector classifier (SVC), and MLP. The labelling of the training data is similar with the data
While labeling the data, three different labels are applied used in traditional models. In Figure 4, Phase 01, Phase 02
whenever a customer has churned, even in the past. ”3” is near and Phase 03 are shown as labels indicating the churn status.
to churn, ”1” is far away from being churned, and ”2” is in the Since each phase here consists of sequential data, these phases
middle of the churning process. The labeling is executed as can be expressed sequentially. To create a sequential model,
displayed in Figure 3; therefore, the phase numbers correspond the data is transformed into the structure in Figure 5. In the
to the labels. Besides, the features in the training set are shown figure, c represents the customer, r represents the regularity
in Table II. period of the customer, p represents the phase value, and s
represents the rank value. For each phase, the current profile
E. Sequential Models and 2 previous profiles (Phase 03 - Seuence 03, Phase 03 -
As it is mentioned earlier, the transactional data is formed Seuence 02, Phase 03 - Sequence 01 etc.) are calculated and
sequentially in Figure 5. Several RNN models are built for the stored as an array form shown in Figure 5. c05 r02 p03 s03
study, such as LSTM, BI-LSTM, and GRU. Apart from these is the third sequence value of the third phase of the second
models, a CNN-LSTM model is built as well. Both LSTM regularity period of the fifth customer. c05 r02 p03 s03 also
and CNN have shown improvements over a wide variety of corresponds to the first line in Figure 4. Similar structure
is applied for all required reference for all customers and
calculated with attributes in Table II.

III. E XPERIMENTS
The experiments for the models are executed in Python
programming language using Sklearn and Keras libraries. The
number of data for each label is 212453, and the labeled data
set contains 637359 rows. The 10% of the data is separated
as the test set. After separating the test set, the train set
includes 573623 rows. The parameters used in the models
can be viewed in Table IV. For the non-sequential models,
10-fold cross-validation is executed during the training. For
Fig. 4: Phases and sequences
the parameter tuning, grid-search is applied. The layers for
CNN-LSTM models is available in Table III.

224

Authorized licensed use limited to: Unitec Library. Downloaded on September 17,2022 at 05:57:48 UTC from IEEE Xplore. Restrictions apply.
TABLE III: CNN-LSTM LAYER DETAILS before or after a churn period. Apart from that, the permitted
customer demographic data can also be added to the feature
Layers Parameters set.
Conv1D filters=32, kernel size=3, activation=’relu’
MaxPooling1D pool size=2
hidden nodes=100, dropout=0.1 R EFERENCES
LSTM
activation=’sigmoid’, optimizer=’adam’
Dense units=3, activation=’sigmoid’
[1] A. Sharma and D. P. K. Panigrahi, “Article: A neural network based
approach for predicting customer churn in cellular network services,”
International Journal of Computer Applications, vol. 27, no. 11, pp.
TABLE IV: MODEL PARAMETERS 26–31, August 2011.
[2] S. Nalchigar and E. Yu, “Business-driven data analytics: A conceptual
Model Parameters modeling framework,” Data Knowledge Engineering, vol. 117, pp. 359–
DT max depth=5 372, 2018.
algorithm=’auto’, metric=’minkowski’, [3] M. Farquad, V. Ravi, and S. B. Raju, “Churn prediction using compre-
KNN
n neighbors=17, p=2, weights=’distance’ hensible support vector machine: An analytical crm application,” Applied
bootstrap=False, max depth=70, max features=10,
min samples leaf=5, min samples split=8,
Soft Computing, vol. 19, pp. 31–40, 2014.
RF
n estimators=100, random state=10 [4] M. Xi, Z. Luo, N. Wang, and J. Yin, “A latent feelings-aware rnn
LR C=0.001, penalty=’l2’ model for user churn prediction with behavioral data,” ArXiv, vol.
alpha=0.05, hidden layer sizes=150, dropout=0.1, abs/1911.02224, 2019.
MLP
activation=’relu’, solver=’adam’, max iter=100 [5] O. Adwan, H. Faris, K. Jaradat, O. Harfoushi, N. Ghatasheh, and
SVM C=10, gamma=0.01, kernel=’rbf’, random state=10 K. Abdullah, “Predicting customer churn in telecom industry using
hidden nodes=250, dropout=0.1, recurrent dropout=0.1, multilayer preceptron neural networks : Modeling and analysis,” 2014.
GRU
activation=’sigmoid’, optimizer=’adam’
hidden nodes=250, dropout=0.2, recurrent dropout=0.2,
[6] I. Ullah, B. Raza, A. K. Malik, M. Imran, S. U. Islam, and S. W. Kim,
LSTM “A churn prediction model using random forest: Analysis of machine
activation=’relu’, optimizer=’adam’
learning techniques for churn prediction and factor identification in
telecom sector,” IEEE Access, vol. 7, pp. 60 134–60 149, 2019.
[7] K. Coussement and D. Van den Poel, “Churn prediction in subscription
The results of the models can be seen in Table V. The results services: An application of support vector machines while comparing
on the table display that RNN models are more successful than two parameter-selection techniques,” Expert Systems with Applications,
vol. 34, no. 1, pp. 313–327, 2008.
the traditional, non-sequential models. When we compare the [8] C.-P. Wei and I.-T. Chiu, “Turning telecommunications call details
sequential models, the hybrid CNN-LSTM model is the most to churn prediction: a data mining approach,” Expert Systems with
successful among the others. Since a relatively large data set is Applications, vol. 23, no. 2, pp. 103–112, 2002.
[9] C. Mena, A. D. Caigny, K. Coussement, K. W. D. Bock, and S. Less-
used for training the models, more advanced models are more mann, “Churn prediction with sequential data and deep neural networks.
successful than the non-sequential models. a comparative analysis,” ArXiv, vol. abs/1909.11114, 2019.
[10] Özer Çelik and U. OSMANOĞLU, “Comparing to techniques used in
TABLE V: MODEL RESULTS customer churn analysis,” Journal of Multidisciplinary Developments,
vol. 4, no. 1, pp. 30–38, 2019.
Method Precision (%) Recall (%) F-1 Score (%) [11] I. Bose and X. Chen, “Hybrid models using unsupervised clustering for
DT 63.21 61.15 62.16 prediction of customer churn,” Journal of Organizational Computing
RF 70.03 68.19 69.09 and Electronic Commerce, vol. 19, no. 2, pp. 133–151, 2009.
KNN 65.72 62.64 64.14 [12] B. A. Tama, “Data mining for predicting customer satisfaction in fast-
SVC 70.89 71.64 71.26 food,” 2015.
LR 67.38 68.24 67.80
[13] A. T. Bayrak, A. A. Aktaş, O. Susuz, and O. Tunalı, “Churn prediction
MLP 72.31 73.14 72.72
GRU 75.09 75.24 75.16 with sequential data using long short term memory,” in 2020 4th
LSTM 77.05 77.12 77.08 International Symposium on Multidisciplinary Studies and Innovative
BI-LSTM 77.68 77.24 77.45 Technologies (ISMSIT), 2020, pp. 1–4.
CNN-LSTM 78.02 78.06 78.03 [14] A. T. Bayrak, A. A. Aktaş, O. Tunalı, O. Susuz, and N. Abbak,
“Personalized customer churn analysis with long short-term memory,” in
2021 IEEE International Conference on Big Data and Smart Computing
(BigComp), 2021, pp. 79–82.
IV. C ONCULUTION [15] M. B. Bahadır, A. T. Bayrak, G. Yücetürk, and P. Ergun, “A comparative
In this study, we use a fast-food chain’s transaction data study for employee churn prediction,” in 2021 29th Signal Processing
and Communications Applications Conference (SIU), 2021, pp. 1–4.
to predict the churn customers. We label the data whenever [16] V. L. Miguéis, D. V. D. Poel, A. Camanho, and J. F. E. Cunha,
a churn occurs so that we do not miss any potential label. “Predicting Partial Customer Churn Using Markov for Discrimination
We consider the problem a classification problem by using for Modeling First Purchase Sequences,” Working Papers of Faculty
of Economics and Business Administration, Ghent University, Belgium
three labels representing the stages of being churned. We build 12/806, Aug. 2012.
RNN models using sequential data and compare them with [17] C.-C. Günther, I. F. Tvete, K. Aas, G. I. Sandnes, and Ørnulf Borgan,
the traditional classification methods. As mentioned in the “Modelling and predicting customer churn from an insurance company,”
Scandinavian Actuarial Journal, vol. 2014, no. 1, pp. 58–71, 2014.
section Experiments, RNN models are more successful than [18] J. Burez and D. Van den Poel, “Handling class imbalance in customer
the traditional, non-sequential methods since the RNN models churn prediction,” Expert Systems with Applications, vol. 36, no. 3, Part
can link and detect the relations in sequential data better. 1, pp. 4626–4636, 2009.
[19] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine learning:
Also, among the sequential models, CNN-LSTM is the most a review of classification and combining techniques,” Artificial Intelli-
successful one. The pipeline in the study is straightforward gence Review, vol. 26, no. 3, pp. 159–190, Nov 2006.
and can be applied to some other sequential data in different [20] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional,
long short-term memory, fully connected deep neural networks,” in
sectors by defining the churn rules. 2015 IEEE International Conference on Acoustics, Speech and Signal
As a future study, this work can also be used as an input Processing (ICASSP), 2015, pp. 4580–4584.
for a recommender system by detecting customers’ products

225

Authorized licensed use limited to: Unitec Library. Downloaded on September 17,2022 at 05:57:48 UTC from IEEE Xplore. Restrictions apply.

You might also like