You are on page 1of 6

Applicability of Machine Learning Techniques in

Predicting Customers Defection


Niken Prasasti1,2 , Hayato Ohwada2
1

School of Business and Management, Bandung Institute of Technology, Indonesia


Department of Industrial Administration Department, Tokyo University of Science, Japan
niken.prasasti@sbm-itb.ac.id, ohwada@rs.tus.ac.jp

AbstractMachine learning is an established method of


predicting customer defection on a contractual business. Despite
this, there is no systematic comparison and evaluation of the
different machine learning techniques has been used. We
provided a comprehensive comparison of different machine
learning techniques on three different data sets of a software
company to predict the customer defection. The evaluation
criteria of the techniques consists of understandability of the
model, convenient of using the model, time efficiency on running
the learning model, and the performance of predicting customer
defection.
Keywords-customer
defection;
machine
learning;
classification; J48 decision tree; radom forest; neural network;
SVM

I.

INTRODUCTION

Machine learning techniques have reached a stage where


companies and industries are adopting them in a wide range
of application. The major focus of machine learning research
is to extract information automatically from data, by
computational and statistical methods. In a wide perspective,
machine learning is about giving software the ability to build
knowledge from experience, derived from the patterns and
rules extracted from a large volume of data [1].
Nowadays, research in machine learning give the
opportunity for company to develop their business strategy.
For instance, in the insurance, mass media, and
telecommunications industry, machine learning is applied to
identify customers with high probability to defect on a given
service that they provide. It does so by looking at the
information derived from the usage-patterns of past
customers. Previous techniques in predicting customer
defection include logistic regression [2], decision trees [3],
support vector machines (SVM) [4], neural artificial network
[5], and random forests [6]. In our previous paper [7], we
investigated the customer defection prediction using SVM and
J48 decision tree. Both of the classifier perform well for the
prediction model.
While recent research has focused on evaluating the
performance of each machine learning techniques, there has
been no comparison of other machine learning features, such
as understandability, convenient, time efficiency, and
visualization of the techniques. This paper presents a
comprehensive comparison of machine learning techniques

particularly in predicting customer defection. It evaluates not


only the performance, but also the features of machine
learning previously mentioned that has been the lack of recent
literatures. Based on the results of the experiments, a
recommendation as to which machine learning techniques
should be considered when predicting customer defection is
provided.
The remainder of this paper is organized as follows.
Section 2 reviews the problem description. Sections 3 defines
the data sets and variable description used for machine
learning procedures. Section 4 presents the machine learning
techniques used in this paper. Section 5 provides results and
the comparison of machine learning techniques used in
predicting customer defection. Section 6 consists of the result
tabulation and discussion. Finally, the conclusion is provided
in the last section.
II.

PROBLEM DESCRIPTION

The term defection is widely used in business with a


contractual customer base. A characteristic of contractual
business is that usage and retention are a relating process,
customers need to renew their contracts to continue access to
the service [8]. We focused on applying machine learning
techniques to analyze customer defection in a software
company as one example of contractual business. There is a
one-year contract between a customer and the company. The

Figure 1. Customer defection in the confirmation period

company offers three main products that vary by product

price; these will be defined as Low-Price, Mid-Price, and


High-Price.
The company has an e-commerce site that sends a
confirmation of auto-renewal e-mail to each customer at least
twice between zero days and fifty days before their renewal
time. The customer has to choose whether to opt-in or to
opt-out. If the customer chooses to opt-in, this indicates
positively that they would like to be contacted with a
particular form, in this case with a renewal form. In contrast,
choosing opt-out indicates that they would prefer not to be in,
or in other words it is a form of defection. Fig. 1 describes the
statistical number of customer who defects in the period
between fifty days before and the day of renewal time.
Typically, customer defection problem can be predicted
by machine learning using customers basic demographic and
records of usage information. In this case, we predicted the
customer defection using historical data of customers optingin and out activity. Data sets and variables used will be
described in the following section.
III.

DATA SETS

The data sets used in the experiments are provided by the


software company. We executed learning procedures on
different data sets of three different products mentioned
earlier, Low-Price, Mid-Price, and High-Price. Each data set
has over 20.000 records for 2007 through 2013 with 6
predictor variables. One issue in the data is that some
customers tend to opt-in for another product from the same
company after they opt-out from the previous one (which
should not be defined as defection), while the e-commerce site
is only able to record the opt-out data. Therefore, data
preparation is quite important in this research.
The original records contains the pattern of cancellation of
customers after they choose the opt-out option. Before
applying the data to the prediction models, we did a
preparation in order to use only the data represent the real
defection (when the customer who chose opt-out does not optin for another product). The final variables used in the learning
procedures are listed in Table 1.
TABLE I.

VARIABLES USED IN MACHINE LEARNING PROCEDURES

Variables

Definition

UPDATE_COUNT

Total count of renewals and purchases


(first purchase is excluded).

CC_PRODUCT_PRICE

Recently purchased product price.

OPTIONAL_FLAG

Whether customer used optional service


flag.

ORG_FLAG

Type of customer, whether personal or


organization.

MAIL_STATUS

Delivery status of e-mail.

CLASS

Type of customer (defecting or retained).

UPDATE_COUNT is calculated as the result of data


preparation and describes the total count of renewal and
purchase records of customers, not including the first
purchase. CLASS is the main variable that defines whether or
not a customer is classified as defecting. The class distribution
for machine learning for each datasets are presented in Table
2.

TABLE II.

VARIABLES USED IN MACHINE LEARNING PROCEDURES

Product
Low-Price
Mid-Price
High-Price

IV.

Positive

13,709
8,013
10,961

Negative

5,302
1,764
2,265

MACHINE LEARNING PROCEDURES

Several machine learning techniques are applicable to


predict customer defection. Intuitively, defection prediction
is a simple classification problem. It can be solved by
learning a classifier that discriminates between customers
based on the variables of the customer records. A set of
labelled training examples is given to the learner and the
classifier is then evaluated on a set of instances. We applied
the universal learning techniques in predicting the customer
defection: decision tree, neural network, and support vector
machine (SVM). We used the WEKA J48, RandomForest,
MultiLayerPerceptron, and SMO classifiers. We did
parameter tuning on all machine learning techniques to
achieve the best performance on the given data sets. In many
approaches in previous research, some machine learning
algorithms are not tuned at all if the performance of the
defection prediction is already sufficient with the default
parameters set by the learning tools.
A. J48 Decision Tree
A decision tree is categorized as a predictive machinelearning techniques that decides the target value (dependent
variable) of a new sample based on various attribute values of
the available data [9]. As other decision tree techniques,
WEKA J48 Decision Tree follows a simple algorithm. Using
the attributes of available training data, it first creates a
decision tree to classify a new item. It analyzes the attribute
that discriminates the various instance most obviously and
looks for another attribute that gives the highest information
gain. The process is continued until it get a clear decision of
what combination of attributes gives a particular target value,
and it will stop when it run out of attributes.
B. Random Forests
Random forests has three main ideas: trees, bootstrap, and
aggregation. It is a learning techniques consists of bagging of
unpruned decision tree learners with a randomized selection
of features at each split [10]. It follows the same algorithm
for both classification and regression. First is to draw ntree
bootstrap samples from the original data. For each of the
bootstrap samples, it grows an unpruned classification or
regression tree. Each tree gives a classification and votes for
the most popular class. Next, the forest chooses to classify
the case according to the label with the most votes over all
tress in the forest [11].
C. Neural Networks
Neural networks can be classified into single-layer
perception and multilayer perceptron (MLP). They have a
remarkable ability to derive meaning from complicated data
and generally can be used to extract patterns and detect
complex problem that is not easily noticed by other

techniques. We used MultiLayerPerceptron function in


WEKA. MLP neural network is a non-linear predictive model
where the inputs are transformed to outputs by using weights,
bias terms, and activation functions [12]. MLP neural
network is considered in this paper because non-linear
relationships were found in some previous research in
customer defection.
D. Support Vector Machines (SVM)
We used the WEKA sequential minimal optimization
(SMO) algorithm for training the support vector classifier. It
is one of the most universal algorithms for large-margin
classification by SVMs. SVM is a classification technique
based on neural network technology using statistical learning
theory [13]. It looks for a linear optimal hyperplane so that
the margin of separation between the positive and the
negative class is maximized. In practice, most data are not
linearly separable, so to make the separation feasible, a
transformation is done by using Kernel function. It
transforms the input into a higher dimensional features space
by a non-linear mapping [14].
A decision on the Kernel function is needed in
implementing SVM. The kernel defines the function class
we're working with. Instead of using linear, sigmoid, or the
polynomial kernel, we used the squared exponential kernel
(RBF) since it is generally more flexible than the other
kernels so that it can model more functions with its function
space.
V.

RESULTS

As mentioned in the first section, we would like to provide


a comprehensive comparison of machine learning techniques
in predicting customer defection. In order to do so, we
evaluate the techniques by four criteria: understandability of
the model, convenient of using the model, time efficiency on
running the learning model, and the performance of
predicting customer defection.
A. Understandability of The Model
Understandability of machine learning model is hard to
formalize, as it is a very subjective concept. Somehow, in
doing the measurement of understandability, we defined our
judgment based by the following questions.
Is it easy to know whether the model works or not?
Does the learning algorithm help to understand the
model better?
Are the results of the technique easily interpreted?
Decision trees are well known for their simplicity and
understandability. It is produced by algorithms that identify
various ways of splitting data set into branch (segment). It
follows a simple and understandable algorithm, described in
the previous section. The visualization of J48 decision tree
output is clear and readable.
J48 Decision Tree is one of learner that can have a tree
structure visualized. Fig.2 presents the decision tree
constructed by the J48 classifier. This indicates how the
classifier uses the attributes to make a decision. The leaf

Figure 2. Visualization of J48 decision tree classification results

nodes indicate which class an instance will be assigned to


should that node be reached. The numbers in brackets after
the leaf nodes indicate the number of instances assigned to
that node, followed by how many of those instances are
incorrectly classified as a result. With other classifiers some
other output will be given that indicates how the decisions are
made, e.g. a rule set. RandomForest produces an ensemble of
trees (not just one like J48), so the output does only provide
the calculation of learning performance.
In generating neural networks, WEKA has its own
graphical user interface (GUI) function that can be set to true
before the learning process start, to help us understand the
model that we will run better, can be seen in Fig. 3. The model
of neural network prediction using MultiLayerPerceptron
algorithm is provided as can be seen in Fig.4.

Figure 3. The GUI of MultiLayerPerceptron at the beginning of


running model

Figure 5. Part of the output of SMO on the customer defection data

Figure 4. The learning model of MultiLayerPerceptron

The SMO algorithms implement the sequential minimaloptimization algorithm for training a support vector
classifier, using kernel functions, here we used the RBF
kernel. Fig. 5 shows the output of SMO on the customer
defection data. Since the customer defection data contains
two class values, two binary SMO models have been output,
one hyperplane to separate each of the possible pair of class
values. Moreover, the hyperplanes are expressed as functions
of the attribute values in the original space [20].

B. Convenient of Using The Model


The method of learning in the purpose of customer
defection model consists of a set of algorithms. It requires
setting of parameters for achieving expected results. In this
paper, the convenient of using each model is represented by
the ease of tuning the parameters before proceeding the
algorithm. From machine learning perspective, classification
can be defined as a method of searching a function that maps
the space of attributes of the domain to the target classes [15].
Decision trees probably are the most common learning
method used for the customer defection problem. Generally,
in the WEKA J48 Decision Tree, the default parameter values
already gave the best performance across all data sets.
Though previous research [16] experimented that by
reducing error pruning (using the R N 3 flag) on J48 we
can improve the model performance, in this customer
defection prediction case, the default values give better
performance.

Like other decision tree, Random Forests (RF) have very


few parameters to tune and can be used quite efficiently with
the default parameters. Using the WEKA RandomForest, we
changed one main parameter in RF, the number of trees. We
experienced that by increasing the number of trees while
tuning to default value of 500 (for about 20,000 predictors
[17]), the performance increased quite well.
SMO is a more complicated classifier to be tuned. In
using it in WEKA, there are two parameters can be tuned; the
complexity value of SMO and the gamma value of the kernel
used by SMO. To find the best parameter for the model, we
used GridSearch function in WEKA which allows us to
optimize two parameters of an algorithm by setting it on a
maximum, minimum, base value, and step value for how
much a parameter can be increased for each test [18]. The
main advantage of GridSearch is it is not limited to first-level
parameters of the base classifier and we can specify paths to
the properties that we want to optimize.
The
default
parameters
in
the
WEKA
MultiLayerPerceptron are quite sensible for the model.
Somehow, for MLP deciding upon the learning rate is very
important [19]. Hence, we made changes on the learning rate
parameter -L to 0.1 and 0.5 and it showed up that using
default L 0.3 give optimum performance.
C. Time Efficiency on Building the Model
Time is one important thing to be considered in using
machine learning techniques on predicting customer
defection. We compared the time needed for running the
learning model of each classifier using WEKA. In three
different data sets, decision trees need the least time to build
the model and to calculate the performance. Between the two
decision trees, J48 performs speedier than RandomForest,
especially after when we tuned the number of trees in the
RandomForests into a bigger value than default.
MultilayerPerceptron needs more time than the decision
trees, but it is still acceptable since it is less than 10 seconds
in one running on every data sets. The longest time is needed
by the SMO support vector machine, it took up to more than
5 minutes on building the model after we tuned the kernel
function into RBF kernel.
TABLE III.
Product

TIME NEEDED BY CLASSIFIER ON EACH DATA SETS


Time needed to build model (second)
RF
MLP
SVM
4.35
5.6
280.7

Low-Price

J48
0.11

Mid-Price

0.13

5.66

4.3

299..8

High-Price

0.13

5.44

4.3

342.4

D. Performance of Predicting Customer Defection


A classification task involves assigning which out of a set
of categories or labels should be assigned to some data
according to some attributes of the data. In predicting the
customer defection, there are two possible classes, defect or
retain. Commonly, performance of a classifier task is

measured by accuracy. If, from a data set, a classifier could


correctly guess the label of half of the examples, then its
accuracy is said to be 50%. Somehow, in this paper, to avoid
thinking that one classifier model is better than other one only
by the accuracy, we also calculate the precision and recall of
each classifier.
TABLE IV.
Product

Low-Price

Mid-Price

High-Price

COMPARISON OF CLASSIFIER PERFORMANCE


Classifier
J48
RF

Accuracy
72.12%
72.28%

Recall
83.91%
84.21%

Precision
74.10%
74.14%

MLP
SMO
J48

68.81%
68.81%
81.95%

80.51%
84.93%
85.80%

72.04%
70.42%
88.14%

RF

82.32%

86.12%

88.22%

MLP

78.73%

91.32%

80.83%

SMO

82.28%

90.41%

80.92%

J48

82.87%

76.39%

92.87%

RF
MLP
SMO

83.13%
68.57%
82.71%

77.68%
67.57%
75.21%

92.61%
76.54%
91.51%

Table 4 compares the accuracy, recall, and precision scores


of four classifiers for three data sets. The table presents
experiment results for all 10-fold cross validations. It can be
safely concluded that no single model had the highest
accuracy in all three data sets. As we see, the accuracies of
four classifiers on the low-price product data sets remain
similar. Instead, the performance of every algorithm differed,
depending on the characteristics and type of the data.
Somehow, decision trees and SVM give more stable result.
VI.

DISCUSSION

We summarize the results of evaluation criteria of all


classifier techniques in Table 5 (high represents the good
value and low represents poor value). To the best of our
knowledge and by the results of the experiment, J48 decision
tree gives higher understandability (from the algorithm and
the result visualization), convenient of use, and the time
efficiency. Its high performance is also one thing to be
considered for applying the model to predicting customer
defection.
Though random forests give a high accuracy to each
prediction on all data sets, in practice it has a lower
understandability than J48 Decision Tree in this predicting
defection case. Hence, the convenient of use and time
efficiency of it are the advantages of this decision tree model.
Some recent research applied random forests model to a case
where the number of predictor variables are high.
Neural networks model seems to be not suitable in
predicting customer defection using the data sets with the
characteristic described in the third section. It shows lower
performance on all data sets, though it has high value of
understandability and time efficiency.
The last classifier, SMO as the support vector machine
tools, gives higher predicting performance. Support vector

machine methods are well-known of their good learning


performance. Somehow, it is a more complicated classifier
than the others. One of the weakness of it is the time needed
to run and build the model, especially when we have a huge
number of input data.
TABLE V.
Criteria
Understandability
Convenient
Time efficiency
Performance

J48
Higher
Higher
Higher
High

RF
Low
Higher
Higher
High

MLP
High
Low
High
Lower

SMO
Low
Low
Low
Higher

REFERENCES

[4]
[5]
[6]

[7]

[10]

[11]

Classifiers

Machine learning is an established method of predicting


customer defection on a contractual business. We applied
some machine learning classifier techniques to predict
customer defection in a software company and further
provided a comprehensive comparison of four classifier, J48
decision tree, random forests, neural networks, and support
vector machine. There are four evaluation criteria that we
used in the comparison: the understandability of the learning
model, the convenient on using the model, prediction
performance, and time efficiency.
Finally, we come to the result that on predicting customer
defection, each classifier has it best criteria. In this paper, due
to the compatibility with the data sets, we concluded that J48
decision tree and support vector machines model work
excellent. Somehow, this findings are limited only to some
customer defection case with typical data sets. The result may
have shown up differently on other data sets with other
prediction variables.

[3]

[9]

COMPARISON OF CLASSIFIER PERFORMANCE

VII. CONCLUSION

[1]
[2]

[8]

Mitchell, T.: Machine Learning: McGraw Hill, 1997.


Nie, G., Rowe, W., Zhang, L., Tian, Y., & Shi, Y. (2011). Credit card
churn forecasting by logistic regression and decision tree. Expert
Systems
with
Applications,
38(12),
1527315285.
doi:10.1016/j.eswa.2011.06.028
Bin, L., Peiji, S., & Juan, L. (2007). Customer Churn Prediction Based
on the Decision Tree in Personal Handyphone System Service. 2007
International Conference on Service Systems and Service
Management, 15. doi:10.1109/ICSSSM.2007.4280145
Coussement, K., & Poel, D. Van Den. (2006). Churn Prediction in
Subscription Services: an Application of Support Vector Machines
While Comparing Two Parameter-Selection Techniques Kristof.
Sharma, A. (2011). A Neural Network based Approach for Predicting
Customer Churn in Cellular Network Services. International Journal
of Computer Application, 27(11), 2631.
Ying, W., Li, X., Xie, Y., Johnson, E., & Engineering, S. (2008).
Preventing Customer Churn by Using Random Forests Modeling.
IEEE International Conference on Information Reuse and Integration,
3, 429435
Prasasti, N., Okada, M., Kanamori, K., & Ohwada, H. (2013).
Forthcoming. Customer Lifetime Value and Defection Possibility
Prediction Model using Machine Learning: An Application to a Cloudbased Software Company. Lecture Notes in Computer Science, 8398.

[12]
[13]
[14]
[15]
[16]
[17]

[18]
[19]

[20]

Ascarza, E., & Hardie, B. G. S. (2013). A Joint Model of Usage and


Churn in Contractual Settings. Journal of Marketing Science,
(February), 1-5.
Padhye, A. (n.d.). Chapter 5: Classification Method. Retrieved from
http://www.d.umn.edu/~padhy005/Chapter5.html
Montillo, A. A. (University of P. (n.d.). Random Forests. Retrieved
from
http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomFore
sts_4-2-2009.pdf
Coussement, K., & Poel, D. Van Den. (2008). Improving Customer
Attrition Prediction by Integrating Emotions from. Gent.
Glady, N., Baesens, B., & Croux, C. (2009). Modeling Churn Using
Customer Lifetime Value. European Journal of Operational Research,
197(1), 402411. doi:10.1016/j.ejor.2008.06.027
Vapnik, V.N. 1995. The Nature of Statistical Learning Theory. New
York: Springer Verlag.
Coussement, K., & Poel, D. Van Den. (2006). Churn Prediction in
Subacription Services: an Application of Support Vector MAchines
While Comparing Two Parameter-Selection Techniques.
Koblar, V. (2012). Optimizing Parameters of Machine Learning
Algorithms.
Kotthoff, L., Gent, I. P., Miguel, I., Building, J. C., & Haugh, N. (n.d.).
An Evaluation of Machine Learning in Algorithm Selection for Search
Problems, 115.
Janitza, A. B. S. (2012). Overview of Random Forest Methodology and
Practical Guidance with Emphasis on Computational Biology and
Overview of Random Forest Methodology and Practical Guidance with
Emphasis on Computational Biology and Bioinformatics, (129).
Weka - Optimizing Parameters. (n.d.). Retrieved from
http://weka.wikispaces.com/Optimizing+parameters
Why MultiLayer Perceptron / Neural Network? (n.d.). Retrieved from
http://courses.media.mit.edu/2006fall/mas622j/Projects/manu-ritaMAS_Proj/MLP.pdf
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: practical
machine learning tools and techniques. Morgan Kauffman Publisher.

You might also like