You are on page 1of 5

Customer Lifetime Value Analysis Based on Machine Learning

Xinqian Dai
University of Massachusetts Amherst, USA
daixinqian2022@163.com

ABSTRACT of high brand loyalty, and recurring economical value of existing


Customer lifetime value (CLV) is a powerful tool to determine the customers.
value of customers and filter customers most likely to attrite or Machine learning is the study designed to analyze patterns with
most likely to make their first purchase, especially for e-commerce data and make predictions based on these patterns and it has re-
companies. This article reviewed machine learning models in an- cently made significant headway into business applications [3].
alyzing CLV and prospected some potential directions for future Unlike many empirical methods used in marketing, machine learn-
research. Data of 8099 samples were collected and analyzed through ing techniques can accommodate an extremely large number of
four kinds of machine learning methods: Linear Regression, Sup- variables and promise a high scalability which is increasingly im-
port Vector Machine, Random Forest, Neural Network. The correla- portant for marketers because many of these algorithms need to
tions between features showed that CLV are generally affected by run in real time. Machine learning tools can deal with raw sequen-
monthly premium auto, total claim amount, and coverage. Analysis tial data so that they are more efficient and can have results with
through machine learning models has high precision and Random higher accuracy at the same time. The advantages are more obvious
Forest performs best. CLV prediction and customer segmentation when datasets are very large. Sampling is not needed when the
are vital in business field today. Marketers could take advantage of entire customer base is available. Sophistication in modeling has
the huge amount of data and machine learning models to portrait also make it possible for marketers to make targeted strategies for
customer behaviors. Collecting browsing and purchase histories is individual customers [4]. Machine learning methods can also au-
also beneficial for providing best offers to individual customers. tomatically learn which factors affect user behavior and how they
interact with each other in the underlying purchasing process, in
CCS CONCEPTS order to explains user behavior virtually in real time. For example,
Machine learning methods can be used to explain how discount
• Theory of computation → Theory and algorithms for applica-
rate effects purchase decisions while shopping online.
tion domains; Machine learning theory; Unsupervised learning and
clustering.
2 RELATED WORK
KEYWORDS Early statistical models of CLV were restricted to lack data and
Customer Lifetime Value, E-commerce, Machine Learning, Regres- the Negative Binomial Distribution [5]. With the emergence of
sion large-scale e-commerce platforms, researchers started to try to
solve this problem with machine learning. Vanderveld et al. split
ACM Reference Format:
customers with predicted CLV greater than zero into five groups
Xinqian Dai. 2022. Customer Lifetime Value Analysis Based on Machine
and developed specific strategies for each group. They used these
Learning. In 2022 the 6th International Conference on Information System
and Data Mining (ICISDM 2022), May 27–29, 2022, Silicon Valley, CA, USA. simple machine learning methods to predict customer behavior [6].
ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3546157.3546160 SkipGram with Negative Sampling (SGNS) is popular in neural em-
bedding models [7]. Barkan and Koenigstein used SGNS to analyze
1 INTRODUCTION product information featured by a single customer several times
[8]. Chen et al. showed that Neural Networks can be used to predict
Modern business is heavily based on customer service, and compa-
CLV of individual players in video games, and convolutional Neural
nies increasingly earn income from establishing and maintaining
Network structures performs best [9]. These studies have shown
long-term relationships with customers. Especially for e-commerce,
that basic machine learning methods, such as Neural Networks and
customer lifetime value (CLV) is one of the most important indi-
regression models, have good applications in the analysis of CLV,
cators because it is used to estimate the financial viability of a
because it has higher precision and accuracy.
business [1-2]. Customer lifetime value is the total amount of value
a customer could provide to the business through the whole pe-
riod of their buying and selling relationship. High CLV is a symbol 3 DATA RESEARCH
The dataset used in this article is downloaded from Kag-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed gle (https://www.kaggle.com/arashnic/marketing-seris-customer-
for profit or commercial advantage and that copies bear this notice and the full citation lifetime-value) and it has a sample size of 8099. The 23 features and
on the first page. Copyrights for components of this work owned by others than ACM their statistics are shown in Table 1.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a Figure 1 shows the correlations between all the features. Cover-
fee. Request permissions from permissions@acm.org. age, total claim amount and employment status all have a relatively
ICISDM 2022, May 27–29, 2022, Silicon Valley, CA, USA strong relation with monthly premium auto because greater cover-
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9625-7/22/05. . . $15.00 age, larger claim amount and better employment status are more
https://doi.org/10.1145/3546157.3546160 likely to have a larger amount of monthly automatic renewal of

13
ICISDM 2022, May 27–29, 2022, Silicon Valley, CA, USA Xinqian Dai

Table 1: Styles available in the Word template

Features Description
State The state where the customer comes from
Response Yes/No
Coverage Basic/Extended/Premium
Education Education Level: Bachelor/College/High School or Below
Employment Status Employed/Unemployed
Gender Female/Male
Income Annual Salary
Location Code Suburban/Rural/Urban
Marital Status Married/Single
Monthly Premium Auto Number of monthly automatic renewal of premium subscription
Months Since Last Claim Number of months since last claim
Months Since Policy Inception Number of months since policy inception
Number of Open Complaints Discrete Variable
Number of Policies Discrete Variable
Policy Type Personal Auto/ Corporate Auto/ Special Auto

Policy Name of policies


Renew Offer Type Offer1-4
Sales Channel Agent/ Branch/ Web/ Call Center
Total Claim Amount Discrete Variable
Vehicle Class Two-Door Car/ Four-Door Car/SUV/ Luxury SUV
Vehicle Size Midsize/Small/Large
Customer Lifetime Value Calculated CLV

premium subscription. Policy type has a strong positive relation The estimator of the coefficient vector is obtained by the least
with type since every policy has a specific policy type. Employ- square method. Define the loss function as the sum of squares of
ment status has a strong negative relation with income because the fitting residuals of all N samples:
unemployed people have much lower income than those who are 2
employed. ÕN ÕP
J (w) = ­yi − w 0 − w j xi j ®
© ª
State, response, months since last claim, months since policy
inception, number of policies and number of open complaints have i=1 « j=1 ¬
extremely small correlation coefficients with any other features, so xij represents the j-th feature (factor) of the i-th sample, and yi
they are not related with each other. represents the label of the i-th sample. The estimation of model
Monthly premium auto, total claim amount, and coverage are coefficients is the value of w when the loss function is the smallest:
strongly related with CLV, so they are the main factors that reflect
customer behaviors and affect CLV. State, response, gender, location ŵ = min J (w)
code, months since policy inception, sales channel, and vehicle size
have extremely small correlation coefficients with CLV so they do Regularization is an approach that prefers some bias over high
not have a high analytical value. variance. It has a penalty term after the least squares loss function.
When the penalty term is the sum of the squares of the coefficient
w, this regression method is called ridge regression (also known as
4 MODELS L2 regularization), and the loss function is:
4.1 Linear Regression N P
2
P
Õ Õ Õ
Linear regression is the most basic supervised learning method. In J (w) = ­yi − w 0 − w j xi j ® + λ wj 2
© ª
machine learning, models are used to reflect the linear relationship i=1 « j=1 ¬ j=1
between features and labels as shown in Figure 2. Sometimes it is
more effective using the combination of three or more independent where the penalty term is the sum of the absolute values of the
variables to predict and estimate since a phenomenon is often af- coefficient w, this regression method is called Lasso regression (also
fected by multiple factors. The formula of multiple linear regression known as L1 regularization), and the loss function is:
is: 2
ÕN ÕP ÕP
J (w) = ­yi − w 0 − w j xi j ® + λ w j
© ª
y = w 0 + w 1x 1 + w 2x 2 + . . . + wp xp i=1 « j=1 ¬ j=1

14
Customer Lifetime Value Analysis Based on Machine Learning ICISDM 2022, May 27–29, 2022, Silicon Valley, CA, USA

Figure 1: Correlations between features

Figure 3: Support Vector Machine


Figure 2: Linear Regression
If we use to describe the line in the middle of the graph, we only
need to solve an optimization equation that maximizes b. In the real
world, it is not always possible to find a plane that can perfectly
4.2 Support Vector Machine
divide the data into two parts, so we need to add a loss term ( )
Support Vector Machine (SVM) is used to analyze classification and to the equation:
regression based on the data. SVM maximizes the width of the gap
yi w 1x i1 + w 2x i2 + ... + wp x ip ≥ b (1 − εi)

between categories by representing training examples as points
in the coordinate system (Figure 3). Then it is possible to make n
predictions about the category of new examples base on where
Õ
εi ≤ C
they fall. i=1

15
ICISDM 2022, May 27–29, 2022, Silicon Valley, CA, USA Xinqian Dai

The commonly used kernels of support vector machines are


divided into three types, linear kernels, polynomial kernels, and
Gaussian kernels. Their formulas are as follows:
p
Õ
K (x i , x i ′ ) = xi j xi ′ j
j=1

p d
Õ
K (x i , x i ′ ) = ­1 + xi j xi ′ j ®
© ª

« j=1 ¬

p 2
© ©Õ
K (x i , x ) = exp ­−γ ­ x i j x i ′ j ®
ª ª
i′
®
®
« « j=1 ¬ ¬
SVM is effective when there are more dimensions than samples
and it performs well when the margin of separation between classes
is clear. Figure 4: Random Forest

4.3 Random Forest


Decision Tree is one of the methods closest to daily life. It sum-
marizes the decision-making process into the form of a tree. Each
upper-level node is split into the next-level leaf nodes through a
certain rule, and the terminal leaf nodes are the final classification
results [10]. The principle of node splitting is to maximize the in-
formation gain after node splitting. The "information" is defined by
entropy or Gini impurity. The concept of entropy is derived from
the information entropy is
n
Õ
i (p) = −
 
p ω j log2p ω j
j

The main advantages of decision trees are their high efficiency Figure 5: Neural Network
and the feature that they can handle non-numerical features.
A Random Forest is obtained by integrating multiple decision
trees through Bagging (Figure 4). The first step of the random forest 5 RESULTS AND DISCUSSION
algorithm to build each decision tree is called "row sampling", sam- Parameters in machine learning are essential to the prediction
pling with replacement from all training samples to get a Bootstrap of results. An important parameter in linear regression is about
data set. The second step is called "column sampling". Random m whether to calculate the intercept for this model. In support vector
features are selected from all M features (m is less than M), and the machine, the penalty parameter indicates how much the model
m features of the Bootstrap data set are used as the new training set wants to avoid misclassifying training examples. Another parameter
to train a decision tree. All N Decision Trees are combined by voting specifies the kernel type to be used in the algorithm. Criterion
at the end [11]. Random Forest reduces overfitting and improves and splitter in decision tree are used to define the quality of each
the accuracy in decision trees, and it is more flexible to complex split. For the tree itself, the maximum depth of the tree and the
regression problems. minimum number of samples required to be at a leaf node are the
two most important parameters [13]. Besides the same parameters
4.4 Neural Network as decision tree, random forest has another essential parameter
Neural Network, as shown in Figure 5 was originally only used shows the number of random trees in the forest. In Neural Network,
in the field of artificial intelligence but it has attracted attention the ith element shows the number of neurons in the ith hidden
and praise in many fields these years. Neurons in Neural Network layer; “alphafloat” is the L2 penalty (regularization term) parameter
algorithms simulate the structure of neurons in the real world. By 1. Best Choices of parameters for each model are shown in Table 2.
connecting multiple neurons layer by layer, a Neural Network with The deviations in these models are calculated through MSE,
hidden layers is obtained [12]. Neural Network is very powerful, Friedman MSE and MAE and the equations of MSE and MAE are:
and the multi-layer Neural Network can approximate any function. n 2
The goal of model training is to find the optimal weight parameter 1 Õ
MSE = Yi − Ŷi
θ then minimize the mean square error of the prediction. n i=1

16
Customer Lifetime Value Analysis Based on Machine Learning ICISDM 2022, May 27–29, 2022, Silicon Valley, CA, USA

Table 2: Best Choices of parameters for each model

Model Choices for Parameters


Linear Regression ’fit_intercept’: False
Decision Tree ’criterion’: ’friedman_mse’, ’max_depth’: 8, ’max_features’: 15,
’min_samples_leaf’: 5, ’splitter’: ’best’
Random Forest ’max_depth’: 15, ’max_features’: 9, ’min_samples_leaf’: 3,
’n_estimators’: 160
Neural Network ’learning_rate’: 0.2, ’max_depth’: 10, ’min_child_weight’: 3,
’n_estimators’: 60

Table 3: Return results of each model

Model MSE R2 MAE


Linear Regression 3.9*107 0.14 3937.42
Decision Tree 1.6*107 0.65 1671.02
Random Forest 1.4*107 0.70 1521.55
Neural Network 4.0*107 0.13 3951.53

Ín Ín
|yi − x i |
i=1 |ei | REFERENCES
MAE = = i=1
n n [1] Vanderveld, A., Pandey, A., Han, A., & Parekh, R. (2016). An engagement-based
According to the results for each model showed in Table 3, Linear customer lifetime value system for E-commerce. Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining.
Regression and Neural Network have close values of MSE and https://doi.org/10.1145/2939672.2939693
MAE, and Decision Tree and Random Forest have much smaller [2] Canhoto, A. I., & Clear, F. (2020). Artificial Intelligence and machine learning as
ones. Random Forest has the largest value of R2, 0.70, a little bit business tools: A framework for diagnosing value destruction potential. Business
Horizons, 63(2), 183–193. https://doi.org/10.1016/j.bushor.2019.11.003
than that of Decision Tree. Linear Regression and Neural Network [3] Mizik, N., & Hanssens, D. (2018). Handbook of Marketing Analytics. https://doi.
have values of R2 around 0.14. Considering the above three factors, org/10.4337/9781784716752
Random Forest is the best model because it has the smallest value [4] Gupta, S., Hanssens, D., Hardie, B., Kahn, W., Kumar, V., Lin, N., Ravishanker, N.,
& Sriram, S. (2006). Modeling customer lifetime value. Journal of Service Research,
of MSE and MAE, and the largest value of R2. 9(2), 139–155. https://doi.org/10.1177/1094670506293810
[5] Morrison, D. G., & Schmittlein, D. C. (1988). Generalizing the NBD model for
6 CONCLUSIONS customer purchases: What are the implications and is it worth the effort? Journal
of Business & Economic Statistics, 6(2), 145–159. https://doi.org/10.1080/07350015.
CLV has become an essential concept in both academic and practice 1988.10509648
[6] Vanderveld, A., Pandey, A., Han, A., & Parekh, R. (2016). An engagement-based
these years because of the value of transaction data and the signifi- customer lifetime value system for E-commerce. Proceedings of the 22nd ACM
cance of customer behaviors. This article collects data of CLV of SIGKDD International Conference on Knowledge Discovery and Data Mining. https:
customers from different states and analyzes features using classical //doi.org/10.1145/2939672.2939693
[7] Xu, P., Hu, W., Wu, J., Liu, W., Du, B., & Yang, J. (2019). Social Trust Network
models of machine learning. It can be seen from the correlation embedding. 2019 IEEE International Conference on Data Mining (ICDM). https:
coefficients that monthly premium auto, total claim amount, and //doi.org/10.1109/icdm.2019.00078
coverage are the main factors that affect CLV because they have a [8] Barkan, O., & Koenigstein, N. (2016). ITEM2VEC: Neural item embedding for
collaborative filtering. 2016 IEEE 26th International Workshop on Machine Learning
strong relation with CLV. In contrast, state, gender, sales channel, for Signal Processing (MLSP). https://doi.org/10.1109/mlsp.2016.7738886
and vehicle size do not have a high analytical value. Among the [9] Chen, P. P., Guitart, A., del Rio, A. F., & Perianez, A. (2018). Customer lifetime
value in video games using Deep Learning and parametric models. 2018 IEEE
four machine learning models discussed in this article, Random International Conference on Big Data (Big Data). https://doi.org/10.1109/bigdata.
Forest is the best for analyzing CLV because it has the smallest 2018.8622151
value of MSE and MAE, and the largest value of R2. [10] Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A., & Brown, S. D. (2004). An
introduction to decision tree modeling. Journal of Chemometrics, 18(6), 275–285.
There are some limitations in this article that may inspire future https://doi.org/10.1002/cem.873
study in this field. This article has a sample size of 8099 and it [11] Hasan, M. A., Nasser, M., Ahmad, S., & Molla, K. I. (2016). Feature selection for
is quite enough for getting a conclusion, but it could be better if intrusion detection using Random Forest. Journal of Information Security, 07(03),
129–140. https://doi.org/10.4236/jis.2016.73009
there are more datasets of CLV that can be analyze together. This [12] Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017). Understanding of a con-
article applies four models of machine learning and the largest volutional neural network. 2017 International Conference on Engineering and
Technology (ICET). https://doi.org/10.1109/icengtechnol.2017.8308186
value of R2, 0.70, is obtained when using random forest. More [13] Hussain, S. (2013). Relationships among various parameters for decision tree
models with higher precision should be used, such as XGBoost. The optimization. Studies in Computational Intelligence, 393–410. https://doi.org/10.
four models have their own characteristics in predicting results, 1007/978-3-319-01866-9_13TUG 2017. Institutional members of the LaTeX Users
Group. Retrieved May 27, 2017 from http://wwtug.org/instmem.html
so model fusion could be considered in future research to further
improve the accuracy and analyze important factors.

17

You might also like