Professional Documents
Culture Documents
Xinqian Dai
University of Massachusetts Amherst, USA
daixinqian2022@163.com
13
ICISDM 2022, May 27–29, 2022, Silicon Valley, CA, USA Xinqian Dai
Features Description
State The state where the customer comes from
Response Yes/No
Coverage Basic/Extended/Premium
Education Education Level: Bachelor/College/High School or Below
Employment Status Employed/Unemployed
Gender Female/Male
Income Annual Salary
Location Code Suburban/Rural/Urban
Marital Status Married/Single
Monthly Premium Auto Number of monthly automatic renewal of premium subscription
Months Since Last Claim Number of months since last claim
Months Since Policy Inception Number of months since policy inception
Number of Open Complaints Discrete Variable
Number of Policies Discrete Variable
Policy Type Personal Auto/ Corporate Auto/ Special Auto
premium subscription. Policy type has a strong positive relation The estimator of the coefficient vector is obtained by the least
with type since every policy has a specific policy type. Employ- square method. Define the loss function as the sum of squares of
ment status has a strong negative relation with income because the fitting residuals of all N samples:
unemployed people have much lower income than those who are 2
employed. ÕN ÕP
J (w) = yi − w 0 − w j xi j ®
© ª
State, response, months since last claim, months since policy
inception, number of policies and number of open complaints have i=1 « j=1 ¬
extremely small correlation coefficients with any other features, so xij represents the j-th feature (factor) of the i-th sample, and yi
they are not related with each other. represents the label of the i-th sample. The estimation of model
Monthly premium auto, total claim amount, and coverage are coefficients is the value of w when the loss function is the smallest:
strongly related with CLV, so they are the main factors that reflect
customer behaviors and affect CLV. State, response, gender, location ŵ = min J (w)
code, months since policy inception, sales channel, and vehicle size
have extremely small correlation coefficients with CLV so they do Regularization is an approach that prefers some bias over high
not have a high analytical value. variance. It has a penalty term after the least squares loss function.
When the penalty term is the sum of the squares of the coefficient
w, this regression method is called ridge regression (also known as
4 MODELS L2 regularization), and the loss function is:
4.1 Linear Regression N P
2
P
Õ Õ Õ
Linear regression is the most basic supervised learning method. In J (w) = yi − w 0 − w j xi j ® + λ wj 2
© ª
machine learning, models are used to reflect the linear relationship i=1 « j=1 ¬ j=1
between features and labels as shown in Figure 2. Sometimes it is
more effective using the combination of three or more independent where the penalty term is the sum of the absolute values of the
variables to predict and estimate since a phenomenon is often af- coefficient w, this regression method is called Lasso regression (also
fected by multiple factors. The formula of multiple linear regression known as L1 regularization), and the loss function is:
is: 2
ÕN ÕP ÕP
J (w) = yi − w 0 − w j xi j ® + λ w j
© ª
y = w 0 + w 1x 1 + w 2x 2 + . . . + wp xp i=1 « j=1 ¬ j=1
14
Customer Lifetime Value Analysis Based on Machine Learning ICISDM 2022, May 27–29, 2022, Silicon Valley, CA, USA
15
ICISDM 2022, May 27–29, 2022, Silicon Valley, CA, USA Xinqian Dai
p d
Õ
K (x i , x i ′ ) = 1 + xi j xi ′ j ®
© ª
« j=1 ¬
p 2
© ©Õ
K (x i , x ) = exp −γ x i j x i ′ j ®
ª ª
i′
®
®
« « j=1 ¬ ¬
SVM is effective when there are more dimensions than samples
and it performs well when the margin of separation between classes
is clear. Figure 4: Random Forest
The main advantages of decision trees are their high efficiency Figure 5: Neural Network
and the feature that they can handle non-numerical features.
A Random Forest is obtained by integrating multiple decision
trees through Bagging (Figure 4). The first step of the random forest 5 RESULTS AND DISCUSSION
algorithm to build each decision tree is called "row sampling", sam- Parameters in machine learning are essential to the prediction
pling with replacement from all training samples to get a Bootstrap of results. An important parameter in linear regression is about
data set. The second step is called "column sampling". Random m whether to calculate the intercept for this model. In support vector
features are selected from all M features (m is less than M), and the machine, the penalty parameter indicates how much the model
m features of the Bootstrap data set are used as the new training set wants to avoid misclassifying training examples. Another parameter
to train a decision tree. All N Decision Trees are combined by voting specifies the kernel type to be used in the algorithm. Criterion
at the end [11]. Random Forest reduces overfitting and improves and splitter in decision tree are used to define the quality of each
the accuracy in decision trees, and it is more flexible to complex split. For the tree itself, the maximum depth of the tree and the
regression problems. minimum number of samples required to be at a leaf node are the
two most important parameters [13]. Besides the same parameters
4.4 Neural Network as decision tree, random forest has another essential parameter
Neural Network, as shown in Figure 5 was originally only used shows the number of random trees in the forest. In Neural Network,
in the field of artificial intelligence but it has attracted attention the ith element shows the number of neurons in the ith hidden
and praise in many fields these years. Neurons in Neural Network layer; “alphafloat” is the L2 penalty (regularization term) parameter
algorithms simulate the structure of neurons in the real world. By 1. Best Choices of parameters for each model are shown in Table 2.
connecting multiple neurons layer by layer, a Neural Network with The deviations in these models are calculated through MSE,
hidden layers is obtained [12]. Neural Network is very powerful, Friedman MSE and MAE and the equations of MSE and MAE are:
and the multi-layer Neural Network can approximate any function. n 2
The goal of model training is to find the optimal weight parameter 1 Õ
MSE = Yi − Ŷi
θ then minimize the mean square error of the prediction. n i=1
16
Customer Lifetime Value Analysis Based on Machine Learning ICISDM 2022, May 27–29, 2022, Silicon Valley, CA, USA
Ín Ín
|yi − x i |
i=1 |ei | REFERENCES
MAE = = i=1
n n [1] Vanderveld, A., Pandey, A., Han, A., & Parekh, R. (2016). An engagement-based
According to the results for each model showed in Table 3, Linear customer lifetime value system for E-commerce. Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining.
Regression and Neural Network have close values of MSE and https://doi.org/10.1145/2939672.2939693
MAE, and Decision Tree and Random Forest have much smaller [2] Canhoto, A. I., & Clear, F. (2020). Artificial Intelligence and machine learning as
ones. Random Forest has the largest value of R2, 0.70, a little bit business tools: A framework for diagnosing value destruction potential. Business
Horizons, 63(2), 183–193. https://doi.org/10.1016/j.bushor.2019.11.003
than that of Decision Tree. Linear Regression and Neural Network [3] Mizik, N., & Hanssens, D. (2018). Handbook of Marketing Analytics. https://doi.
have values of R2 around 0.14. Considering the above three factors, org/10.4337/9781784716752
Random Forest is the best model because it has the smallest value [4] Gupta, S., Hanssens, D., Hardie, B., Kahn, W., Kumar, V., Lin, N., Ravishanker, N.,
& Sriram, S. (2006). Modeling customer lifetime value. Journal of Service Research,
of MSE and MAE, and the largest value of R2. 9(2), 139–155. https://doi.org/10.1177/1094670506293810
[5] Morrison, D. G., & Schmittlein, D. C. (1988). Generalizing the NBD model for
6 CONCLUSIONS customer purchases: What are the implications and is it worth the effort? Journal
of Business & Economic Statistics, 6(2), 145–159. https://doi.org/10.1080/07350015.
CLV has become an essential concept in both academic and practice 1988.10509648
[6] Vanderveld, A., Pandey, A., Han, A., & Parekh, R. (2016). An engagement-based
these years because of the value of transaction data and the signifi- customer lifetime value system for E-commerce. Proceedings of the 22nd ACM
cance of customer behaviors. This article collects data of CLV of SIGKDD International Conference on Knowledge Discovery and Data Mining. https:
customers from different states and analyzes features using classical //doi.org/10.1145/2939672.2939693
[7] Xu, P., Hu, W., Wu, J., Liu, W., Du, B., & Yang, J. (2019). Social Trust Network
models of machine learning. It can be seen from the correlation embedding. 2019 IEEE International Conference on Data Mining (ICDM). https:
coefficients that monthly premium auto, total claim amount, and //doi.org/10.1109/icdm.2019.00078
coverage are the main factors that affect CLV because they have a [8] Barkan, O., & Koenigstein, N. (2016). ITEM2VEC: Neural item embedding for
collaborative filtering. 2016 IEEE 26th International Workshop on Machine Learning
strong relation with CLV. In contrast, state, gender, sales channel, for Signal Processing (MLSP). https://doi.org/10.1109/mlsp.2016.7738886
and vehicle size do not have a high analytical value. Among the [9] Chen, P. P., Guitart, A., del Rio, A. F., & Perianez, A. (2018). Customer lifetime
value in video games using Deep Learning and parametric models. 2018 IEEE
four machine learning models discussed in this article, Random International Conference on Big Data (Big Data). https://doi.org/10.1109/bigdata.
Forest is the best for analyzing CLV because it has the smallest 2018.8622151
value of MSE and MAE, and the largest value of R2. [10] Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A., & Brown, S. D. (2004). An
introduction to decision tree modeling. Journal of Chemometrics, 18(6), 275–285.
There are some limitations in this article that may inspire future https://doi.org/10.1002/cem.873
study in this field. This article has a sample size of 8099 and it [11] Hasan, M. A., Nasser, M., Ahmad, S., & Molla, K. I. (2016). Feature selection for
is quite enough for getting a conclusion, but it could be better if intrusion detection using Random Forest. Journal of Information Security, 07(03),
129–140. https://doi.org/10.4236/jis.2016.73009
there are more datasets of CLV that can be analyze together. This [12] Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017). Understanding of a con-
article applies four models of machine learning and the largest volutional neural network. 2017 International Conference on Engineering and
Technology (ICET). https://doi.org/10.1109/icengtechnol.2017.8308186
value of R2, 0.70, is obtained when using random forest. More [13] Hussain, S. (2013). Relationships among various parameters for decision tree
models with higher precision should be used, such as XGBoost. The optimization. Studies in Computational Intelligence, 393–410. https://doi.org/10.
four models have their own characteristics in predicting results, 1007/978-3-319-01866-9_13TUG 2017. Institutional members of the LaTeX Users
Group. Retrieved May 27, 2017 from http://wwtug.org/instmem.html
so model fusion could be considered in future research to further
improve the accuracy and analyze important factors.
17