You are on page 1of 43

Classifying in a Predicted Future -

Predicting Customer Churn with User


Level Predictions

Marco André Wedemeyer


S TUDENT NUMBER: 2001451

T HESIS SUBMITTED IN PARTIAL FULFILLMENT


OF THE REQUIREMENTS FOR THE DEGREE OF
M ASTER OF S CIENCE IN D ATA S CIENCE & S OCIETY
D EPARTMENT OF C OGNITIVE S CIENCE & A RTIFICIAL I NTELLIGENCE
S CHOOL OF H UMANITIES AND D IGITAL S CIENCES
T ILBURG U NIVERSITY

Thesis committee:
Pieter Spronck
Yash Satsangi

Tilburg University
School of Humanities and Digital Sciences
Department of Cognitive Science & Artificial Intelligence
Tilburg, The Netherlands
June 2020
Preface

I want to thank my supervisor Pieter Spronck for guiding me through this process. His
experience and feedback have been instrumental to the success of my thesis. I also want
to thank Jef Vanlaer for his continued support as my external supervisor throughout the
project. Without our frequent and smooth communication this project would not have
been possible. Lastly, I want to thank Joan van de Wetering for welcoming me back after
my internship to write my thesis with TrendMiner and enabling me to work with real
life data.
Classifying in a Predicted Future - Predicting
Customer Churn with User Level Predictions

Marco André Wedemeyer

The field of Business to Business churn prediction is different from Business to Consumer
churn prediction in important ways, yet the field remains under-researched. Inspired by new
methodologies recently proposed in the literature, this thesis aims at exploring the possibilities
of encoding knowledge from user level churn prediction models into features used for customer
churn prediction. Two methodologies are proposed to predict customer features one month ahead
using user level model predictions as inputs, which allows for customer churn prediction in the
future. Data in the B2B SaaS context provided by the company TrendMiner is used to test these
ideas. Several classical machine learning algorithms are used for greater generalisability. The two
methodologies perform on par with current methods proposed for user-customer mappings for
customer feature creation and provide face value validity for the idea of leveraging the granularity
of user data to predict customer features for customer churn classification.

1. Introduction

Retaining customers has become ever more important for firms as competition increases
due to the increased costs associated with acquiring new customers as opposed to
retaining old ones. Understanding what drives customer satisfaction and dissatisfaction
is thus an important competitive necessity. Being able to predict which customers are
at risk of churning creates value by mitigating lost revenue and reducing customer
acquisition costs. Much attention has been placed on this problem in highly competitive
industries like telecommunications and banking. In these B2C settings switching is
easy for customers. The factors impacting a customers decision to churn have been
extensively researched and modeled in this context. In the B2B domain research is still
scarce as data collection has been difficult and limited. The lower number of customers
and limited access to usage data in this setting makes predictive analytics challenging.
With the introduction of SaaS more data has become available in more B2B sectors,
leading to an increased interest for churn prediction in this domain.
The SaaS model sees software as a service rather than a commodity. Rather than
selling a one time license to the end product, the customer signs up to a subscription ser-
vice for a limited duration of time. The subscription model is cheaper for the customer
in the short run with smaller upfront investment needed. The continued expenses are
justified by the outsourcing of administrative tasks and continued development of the
software service. The model is also attractive to vendors as the regular revenue stream
provides medium term stability. In the long run, however, the revenue streams are
closely linked to customer satisfaction, which requires close connections to maintain.
Since SaaS services are usually hosted by the vendor, analytical data about the usage
behavior of the customer can be mined. Based on this information customer satisfaction
can be estimated and churn models constructed (Frank and Pittges 2009).

1
Data Science & Society 2020

The relationships between B2C firms and their customers are different to those of
B2B firms. One such difference is that in B2B purchases the end user and the customer
tend to be different individuals. The end users also tend to outnumber the final pur-
chasing decision maker. The relationship between end user and customer have been
explored by Figalist et al. (2019). The authors describe a method of merging the data
known about the customer with the usage data generated by the end users. By inte-
grating the variables via a mapping scheme based on customer phases, the authors are
better able to predict customer churn. The authors claim the method to be generalisable
and thus will be adapted for this thesis.
Inspired by the idea of merging the data streams of user and customer level data
into one prediction task, this thesis proposes two methodologies for generating predic-
tions of a customer level feature one month into the future from user level data. Inspired
by encoder-decoder model architectures, these methodologies will attempt to encode
user level knowledge into customer features to improve customer churn prediction.
To test the viability and generalisability of the proposed methods, a set of machine
learning (ML) algorithms will be trained. The capacity of each of the algorithms to
predict churn will be tested prior to evaluating the proposed methodologies in order to
serve as a comparison. These prior evaluations will be performed using a set of features
inspired by literature and constrained by data availability. Access to data is provided by
the SaaS company TrendMiner. The performance of these features will also be evaluated.
Lastly the performance of proposed methodologies will be tested. These research aims
result in the following research questions:

1. How well do the different ML algorithms predict user and customer churn
in a SaaS context?
2. How well do the adapted features perform for predicting user and
customer churn in a SaaS context?
3. How well do the proposed methodologies perform for predicting
customer churn in a SaaS context?

The following chapters of this thesis are broken down into a review of the extant
literature (Section 2), an overview of the algorithms and methodologies used (Section
3), the experimental setup (Section 4), the results (Section 5), a discussion of the findings
(Section 6), and some concluding remarks (Section 7).

2. Related Work

Maintaining positive relationships with customers is of great importance to B2B firms


as customers tend to be large and few. Managing this relationship well has been the
focus of considerable research and has been broadly been named Customer Relation-
ship Management. Although consensus is yet to be reached about a clear definition,
the general consensus is that the two pillars of the research area are acquisition and
retention (Richards and Jones 2008). With increased global competition, the advantage
of retaining existing customers over acquiring new ones has become more evident
(Rauyruen and Miller, 2007). In the B2C field it has been found that small changes in
retention rates can have large impacts on the profitability of a firm (Van den Poel and
Larivière 2004). Although not directly translatable, this insight still holds true in the
B2B context. Each customer represents a significant proportion of a B2B firm’s revenue.

2
M. A. Wedemeyer Classifying in a Predicted Future

Customer retention is thus a significant factor in firm profitability (Rauyruen and Miller
2007).
Aiding in customer retention efforts is the study of churn prediction. The data
informed process of identifying customers who are most likely to churn supports reten-
tion campaigns as performed by Jahromi, Stakhovych, and Ewing (2014), which increase
retention rates and thus profitability. Coussement and De Bock (2013) define churn
prediction as “the process of calculating the probability of future churning behavior [..]
based on past information/prior behavior”. In an effort to improve the profitability of
retention campaigns the performance of prediction has been proposed to also consider
type 1 errors (precision) besides accuracy (Tsai and Lu 2009). Money spent on retention
efforts for loyal customers can yield better returns elsewhere.
Customer churn models have been used by telecommunications companies for
many years (Figalist et al. 2019) and an extensive literature analysing these practices
has accumulated in the Business to Consumer (B2C) sector (Chen, Hu, and Hsieh 2015).
Chen, Hu, and Hsieh (2015), however, point out that the B2B customer churn prediction
field is underdeveloped. Due to the limited number of customers, the collection of big
data is difficult for B2B firms, as opposed to B2C (Wiersema 2013), resulting in a lack
of capabilities to effectively store and analyse their big data (Jahromi, Stakhovych, and
Ewing 2014). The exception to this are B2B firms that need to gain a deep understanding
of “their customers’ customers and end-users” (Wiersema 2013).
In the pursuit of optimizing churn prediction two avenues have been identified
by Zhu et al. (2018). The first is the choice of algorithm used to attempt the task. An
non exhaustive overview of the algorithms used is provided in section 3.1. The second
option is the input to these models: the data. Different models and frameworks have
been proposed to categorise customer behavior into measurable features. The ones that
inspired the variable operationalizations in this thesis are outlined in section 2.2. One
framework stood out in particular as it highlighted an important aspect of B2B SaaS
churn, which is the disconnect between those who purchase and those who use the
product. This framework inspired the contribution of this thesis it outlined in section
2.3.

2.1 Algorithms

Many different algorithmic approaches from the fields of machine learning and statistics
have been attempted for user and customer churn prediction. LR is used rather often for
churn prediction in the literature due to its computational cost-effectiveness. However,
due to its linear nature it fails to accurately model complex relationships. For this reason
it often serves as a baseline for other models (Jahromi, Stakhovych, and Ewing 2014).
Chen, Hu, and Hsieh (2015) showed in their literature review that DTs are also a
very common algorithm used for churn prediction. The high interpretability proves
valuable when trying to create actionable insights from the learned model parameters.
Although being computationally simplistic, DTs as well as LRs can, in some cases, hold
their own in terms of performance against more computationally complex algorithms
such as NNs and SVCs (Haver 2017). DT and LR will thus be implemented as baselines
in this thesis.
Random Forests will be another ensemble method that will be evaluated as it is one
of the most common ensemble methods used in B2C churn prediction but has received
little to no attention in the B2B context according to Haver (2017).
Lastly, Support Vector Machines will be evaluated to provide an improved linear
classifier in comparison to the logistic regression.

3
Data Science & Society 2020

These four are among a larger selection of algorithms that have been studied in the
context of churn prediction. A selection is made rather than a choice of algorithm, as
no dominant model has emerged Zhu et al. (2018). The methodology explored in this
thesis will thus be tested on this selection of algorithms to increase generalisability.

2.2 Frameworks and Theoretical Models

The second aspect to improving churn prediction is to improve the features used in the
prediction task. Inspiration for such features has been taken from different streams of
research.
A very common framework is the Recency, Frequency, Monetary Value (RFM) model
from customer value analysis research. This model is commonly used to segment cus-
tomers into high value and low value customers (Chen, Hu, and Hsieh 2015). Customers
are given ordinal integer rankings between 1 and 5 for each variable and then ordered
to identify the most valuable customers. The model has received many modifications
(Chang and Tsai 2011; Wei et al. 2012; Yeh, Yang, and Ting 2009), however, these have
produced mixed results (Chen, Hu, and Hsieh 2015). Chen, Hu, and Hsieh (2015) found
support for the additional variable Length in the LRFMP model however no support for
the variable Profit.
The features of prior usage behavior outlined by the CUSAMS framework (Bolton,
Lemon, and Verhoef 2004) have been used for predicting customer retention (v. Wan-
genheim, Wünderlich, and Schumann 2017). These features are the Length (i.e. dura-
tion), Depth (i.e. usage), and Breadth (e.g. cross-buying) of the relationship between the
organisation and the customer in terms of the customers behavior. Additionally the
marketing instrument price will be used as a feature in this thesis on recommendation
of the external party.
Churn can be modeled on an individual level, however, the process does not happen
in an isolated environment. When a social network is built around a product or service,
the health of the community is an important aspect in retaining customers. Including the
states of more than one user can improve churn prediction performance. The Spreading
Activation (SPA) model suggests that churn intentions are shared within a network.
Depending on the health of the network these intentions can be mitigated by other
users or spread to them. Generally, graph theory is used to model these relationships,
however, due to the scope of this thesis, this is not a feasible implementation. As the
concept is useful a simplified operationalization will be used as described in section
4.1.1.
Chen, Fan, and Sun (2012) propose a novel data mining technique to model both
statistic and longitudinal features for churn prediction. This was achieved using a
three phase trained SVM, which goes beyond the scope of this thesis. The longitudinal
development of features is simplified to the aggregate of the feature on a monthly basis.

2.3 The Customer and End User Connection

Figalist et al. (2019) introduce an overlooked aspect to the B2B context of customer
churn. The customer and the end user of the product are not always the same people
and thus the perceived value of the product is different for the different stakeholders
involved. The authors believed that due to the communication channels between end
users and customers, end-user dissatisfaction will influence the customers decision to
churn. To enhance their prediction algorithm the authors created a methodology to
map the end user usage data to the customer data based on customer phases. The data

4
M. A. Wedemeyer Classifying in a Predicted Future

streams are categorised in two ways: 1) Static vs. Dynamic and 2) Direct vs. Derived.
The first categorisation identifies whether the data changes over time or not and the
second identifies whether the data was derived from low level data.

3. Methodology

The set of ML algorithms used to evaluate the proposed features are outlined in section
3.1. The proposed methodologies for creating the customer level features from user level
predictions are explained in section 3.2.

3.1 Algorithms

In the attempt to model customer and user churn no dominant model has been found
in the literature (Zhu et al. 2018). As such, multiple commonly used algorithms will
be tested to predict user churn, customer churn, and the new methodology of adding
user predictions to customer features. Each algorithm will be introduced and a quick
explanation provided.

3.1.1 Decision Tree. The DT implementation used in this thesis follows the CART deci-
sion tree, which tackles the classification problem with a binary recursive partitioning
approach (Steinberg 2009). Beginning with one initial binary decision rule (root) the
following data partitions are again split further in a binary way into branches. This
recursive process continues until no more splitting is possible or a condition is met to
stop splitting early, which results in a leaf node that is used for prediction.
Splitting rules - Decision trees are built using greedy decision rules, meaning that
the immediate best solution to a split is chosen without regard for a globally optimal
solution. A commonly used splitting rule uses the gini measure of impurity. This
measure originates in information theory (Tsallis 1988) and measures the probability
of an instance being wrongly classified if done so randomly from the distribution of the
data subset. The final gini impurity measure is the sum for all classes of the probability
of correct classification multiplied by the probability of misclassifying the instance
(Steinberg 2009).
Overfitting - In an effort to reduce overfitting decision trees can be pruned. CART
uses cost complexity pruning (CCP) to this effect. CCP iteratively evaluates the mis-
classification rate at each terminal node and prunes the weakest links until all terminal
nodes meet a threshold criteria. A second method for reducing overfitting is to limit
the minimum number of instances a leaf should have. This avoids the tree creating
decision rules based on few instances which don’t generalise well beyond the training
set (Steinberg 2009).
These two methods are not the only methods of reducing overfitting but they are
the ones that will be used to tune the DT in this thesis.

3.1.2 Random Forest. A weakness of DTs is their tendency to overfit on their train-
ing data. In an effort to overcome this limitation, RFs use the perturb-and-combine
technique to generate more generalisable predictions. This thesis uses two sources of
randomness to perturb each DT in the ensemble: 1) The data used by each DT is subject
to bootstrapping with replacement and 2) a limit is set to the maximum depth of the
trees.

5
Data Science & Society 2020

These changes increase the bias of the ensemble as each tree has a limited view of
the data, however, this limits each tree’s ability to overfit, thus decreasing the variance
(Pedregosa et al. 2011).
A weakness of the DT and RF are their orthogonal decision boundaries. As ex-
ploratory data analysis showed, many pairwise feature combinations on both the user
and customer level lend themselves to diagonal or elliptical decision boundaries, which
will pose a difficulty for these two algorithms.

3.1.3 Logistic Regression Classifier. The logistic classifier makes use of the logistic
regression model, which models the logarithm of the odds (log odds) of a binary
dependent variable by a linear combination of independent variables. The log odds are
transformed to probabilities using the logistic curve. The regression model is converted
into a binary classifier by setting a threshold probability value. Probabilities above this
threshold are classified as positive instances and those below as negative instances
(Duman, Ekinci, and Tanriverdi 2012).
To avoid overfitting regularization can be applied. Common methods include L1
and L2 regularization which add the sum of absolute (l1) or squared (l2) coefficients
to the loss function. Depending on the regularization parameter the additional loss
induced by the regularization will decrease the size of the weights and reduce the
models variance. On occasion L1 can result in non-unique solutions, which can be
overcome by combining the two to form elastic net regularization (Mitov and Claassen
2013). The regularization parameter and the ratio between L1 and L2 in the elastic net
regularization are the two hyperparameters that will be tuned in this thesis.
The logistic function requires a solver as it does not have a closed form solution as
for example linear least squares regression. This thesis will use the SAGA solver as it
supports elastic net regularization (Defazio Ambiata, Bach, and Lacoste-Julien 2014). To
improve the performance of the solver the data is scaled.

3.1.4 Support Vector Machine. The classification goal of SVMs is to find a hyperplane
that will separate the training examples in a way that will generalise best. The simplest
implementation of this goal is the maximum margin classifier which uses marginal
instances (support vectors) of label clusters to support the search for a margin maximis-
ing hyperplane. As this approach is prone to outliers, soft margins can be created by
selecting instances within the label clusters that provide a better bias, variance tradeoff.
This approach remains limited to linearly separable problems. SVMs overcome this
hurdle by implementing the kernel trick, which finds a suitable decision hyperplane
in higher dimensional transformations of the training data (Boser, Guyon, and Vapnik
1992). Commonly used kernels include the linear, polynomial, sigmoid and RBF kernels.
Due to the scope of this paper the default kernel (RBF) is used and tuned.
The two hyperparameters available when using RBF are C and Gamma. C is the
regularization parameter that controls the complexity of the decision hyperplane by
adjusting the strength of a penalty term for misclassification. Gamma determines the
range of influence a training instance has on the prediction of another.

3.1.5 Naive Bayes. Naive Bayes is a linear model that predicts the probability of classes
based on the bayes theorem. Splitting by labels, it models each feature using a gaussian
distribution. For each instance to be predicted, the posterior of each class is calculated
by multiplying the likelihoods of each feature occurrence together with the prior prob-
ability of the class. To convert the naive bayes model into a classifier the maximum a

6
M. A. Wedemeyer Classifying in a Predicted Future

posteriori rule is used (the label with the largest probability is chosen) (Buckinx et al.
2002).
The model assumes the independence of the features and that they can be modeled
using gaussian distributions. These assumptions are not met by the data set, however,
they rarely are in real data sets.

3.1.6 K-Nearest Neighbor. The KNN algorithm generates predictions based on the
majority label found in the k nearest instances to the given instance (Govindarajan
and Chandrasekaran 2010). Distances can be computed in different ways, for example,
Manhattan distance or Euclidean distance. The choice of metric depends on the domain
to which KNN is applied. In this thesis, Euclidean distance will be used. The main
hyperparameter of KNN considered in this thesis is the number of neighbors to consider
for label voting.

3.2 Proposed Methodologies

The tool purchased by clients of TrendMiner tends to be used by a team of engineers and
specialists, each being given their own account under the location license. This setup
means that the individual who makes the purchase decision is unlikely to also be using
the software themselves. The choice to extend the license is thus made by an individual
who does not directly experience the product. In this context it is interesting to explore
the possibilities of connecting the two perspectives with each other. Figalist et al. (2019)
propose a mapping methodology to bring these two streams of data together in order
to improve customer churn prediction. This methodology is adapted to the use case of
TrendMiner and its clients. The mappings are done via a month variable as opposed to
the customer phases as the data was only available for one phase (pilot phase).
Inspired by the idea of mapping user data to customer data, this thesis proposes the
idea of encoding the knowledge of user churn and feeding it forward to a customer
model rather than using the aggregate usage behavior. The difference lies in when
the data is aggregated and is visualized in Figure 1. In essence this thesis attempts to
answer the question of whether classification in a predicted future is feasible. Figalist
et al. (2019) aggregate and predict on data in the present, while this thesis proposes
aggregating in the future. Two methodologies to aggregate the user level predictions
are explored in this thesis.

3.2.1 First Methodology. The first method aggregates the binary user churn predictions
on a customer level. The sum of these predictions equates to the number of users
expected to churn in the next month. By subtracting this amount from the current
amount of active users, the number of active users for the next month is predicted. To
generalise this approach to teams of all sizes, the amount is divided by the maximum
number of users to date for the customer, resulting in the proportion of active users
predicted next month. This process is shown in equation 1.

PAt,c
At,c − i=1 P redt,c,i
Pt+1,c = (1)
max A<t,c

where Pt+1,c = proportion of active users in month t+1 for customer c,


At,c = number of active users in month t of customer c,
P redt,c,i = user level churn prediction in month t of customer c for user i,

7
Data Science & Society 2020

Figure 1
Overview of the Proposed Methodologies compared to the mapping scheme proposed by
Figalist et al. (2019).

max A<t,c = maximum users of customer c up to month t.

By aggregating the predictions from all of the active users as opposed to extrapo-
lating from customer level data, more information can be condensed into one feature.
A drawback of this approach is the propagation of errors from one model to the next.
When the user level model misclassifies an instance, the customer level model will be
trained on erroneous data.
Some underlying assumptions are that the user level models will be able to accu-
rately predict user churn, that the maximum number of active users seen to date will not
change significantly in the last month of the pilot phase and most importantly, that the
usage of the service at the user level will impact the purchase decision of the customer.

3.2.2 Second Methodology. The second method requires the user level model to output
the probability that a user will churn rather than the binary prediction. These probabil-
ities are then summed to approximate the number of users churned. Just like Method 1,
Method 2 then subtracts this prediction from the current active user count and divides
by the maximum users to date to predict the Social Impact feature one month into the
future, as described in 2.

PAt,c
At,c − i=1 P robt,c,i
Pt+1,c = (2)
max A<t,c

where Pt+1,c = proportion of active users in month t+1 for customer c,


At,c = number of active users in month t of customer c,
P redt,c,i = user level probability churn prediction in month t of customer c for user i,
max A<t,c = maximum users of customer c up to month t.

8
M. A. Wedemeyer Classifying in a Predicted Future

By using probabilities rather than binary values the drawback of the previously
shown approach is reduced. Cases in which the user level model is unsure will result in
probabilities closer to the threshold. This uncertainty can be taken into consideration by
the customer level model. The problem is not fully addressed, however, as there is still
the possibility that the user level model misclassifies an instance with high confidence.
This kind of bias is difficult to remove and requires tuning of the user level model to
avoid.

4. Experimental Setup

TrendMiner offers self service analytics software to firms with complex resource pro-
cesses such as oil, gas, mining and chemical companies. The historian databases that
collect data from these processes are connected to TrendMiner servers. Clients have
access to the TrendMiner software to analyse their data via an internet browser. When
agreed to, usage behavior of end users is tracked by customer success software. When
certain website elements are clicked a certain feature is performed and the action is
recorded together with other end user information like the timestamp, location, browser
type, operating system, etc.. This raw data was queried via JQL from the customer
success platform for the time period from the 31.10.2018 to the 31.03.2020. Together with
further data provided by TrendMiner this data was preprocessed into two datasets: 1)
User data and 2) Customer data. It is important to note that the number of users found
in the user data exceed the number of users linked to customer accounts present in the
customer data.

4.1 Variable Operationalization

The features used in this thesis are inspired by those proposed in the churn prediction
literature as outlined in section 2.2. As existing data was used, the operationalization
of the variables needed to be adapted to the resources available. Table 1 provides an
overview of the mapping of the literature models and their variables to the features
available or derivable from the provided data. More specific operationalizations of these
features follows in section 4.1.1 and 4.1.2.
Some of the proposed mappings show a large departure from the original oper-
ationalization. For instance, Social impact proposed by the SPA model is considered
in social network analysis. As no network data is available this variable can not be
replicated, however, the idea was translated to the given context. Client suggestions
were also included.

4.1.1 User Data. After preprocessing the user data set contained 13496 instances and 11
variables (3 keys, 7 features and 1 label). The 3 key features are Username, Customer,
and Month, which are the unique user ID, the associated customer account, and the
month in which the data was collected. The 7 features are Basic Events, Analytics
Events, Unique Events, Active days, Recency, Length, and Social Impact. The target
feature is Churn. The user data was aggregated by months by recommendation of the
external partner. The user feature operationalizations are summarised in Table 2.
The feature Username required preprocessing due to changing methods of data
storage in Mixpanel. During the analysed timeframe the User IDs were replaced with
Distinct IDs. A mapping between the two IDs was thus necessary. As certain features

9
Data Science & Society 2020

Table 1
Mapping of literature models and available features..

Feature (Customer
Framework Variable Feature (User level)
level)
Unique events Avg. unique events
Breadth
Analytics events Avg. analytics events
CUSAMS
Depth Basic events Avg. basic events
Current date - first us-
End of month - pilot
Length
age date start
Days since last activ- Avg. days since last
RFM Recency
ity activity
Model
Avg. monthly active
Frequency Monthly active days
days
SPA Social Proportion of monthly active users
Model Impact to maximum active users seen
Cumulative # of
Interventions
training sessions
Client Suggestions
User cost
Cost
Contracted user cost

continued to be tracked under the old ids, an overlap in data recording under both IDs
needed to be resolved.
The feature Month provided by Mixpanel is formatted in milliseconds since
01/01/1970 and was transformed to days since 01/01/1970 for compatibility with the
other files provided by the external partner.
The features Basic Events and Analytics Events were transformed into natural
logarithms due to their large positive skewness. This transformation is also justified
practically, as the difference between 10 to 50 clicks is more relevant to churn prediction
than from 1000 to 1040 clicks. The distribution after the transformation for basic events
resembles a normal distribution. To avoid negative infinity values when taking the
natural log of zero, all counts were increased by one.
The target variable churn can be operationalized in several ways. A simple labeling
scheme would be to label users as churners if they have no activity in the following
month. This method, however, does not consider users returning after some time of
inactivity. This may present itself as a user going on holiday and return to work. Another
method would be to label all months of a user who eventually churns as examples
of churn behavior. This deterministic approach believes that users are not influenced
by interventions. Predicting churn would thus become an exercise of cost estimation
rather than a tool of user retention strategies. The operationalization used in this thesis
observes whether the month of the instance in question is the last month with activity
for this user. If so, they are considered churned. An exception, of course, is the last
month of observed data. The last month worth of data is only used for label generation
and not for training purposes, as no label can be generated for it.
Histograms of the features (Fig. 2) show that some features have promising differ-
ences in distributions when grouping by the target feature. Although most features have

10
M. A. Wedemeyer Classifying in a Predicted Future

Table 2
Mapping of literature models and available features..

Feature Operationalization
Username Unique user id.
Customer Name of the business customer.
Month Number of days since 01-01-1970 from the first day of the
month.
Basic Events Natural log of the number of clicks categorised as basic func-
tionalities by the external partner.
Analytics Events Natural log of the number of clicks categorised as advanced
functionalities by the external partner.
Unique Events Number of unique features used by a user.
Active Days Number of days with recorded activity within the month.
Recency Difference in days between the last day of the month and the
last day with recorded activity.
Length Difference in days between last day with recorded activity and
first day with recorded activity.
Social Impact Proportion of users active per customer to the maximum num-
ber of active users observed in the past per customer.
Churn True if user remains inactive in the remain months

large overlapping distributions, they can still aid in classification in combination with
other features.

4.1.2 Customer Data. Customers make their decision to churn on predetermined con-
tract expiration dates. The data set contains information about 23 customers making
their decision to renew or churn after completing the pilot phase of the customer life
cycle at TrendMiner. The data is aggregated on a monthly basis from the raw user
data and additionally provided information such as training dates and contract values.
The user aggregated features are averages of the respective user features. The average
is calculated by dividing the sum of all values for the active users by the maximum
number of users to date. These averages thus consider the inactive users or churned
users. Customers which did not agree to data tracking by TrendMiner were excluded
due to missing data, reducing the customers to 20. Only the data in the month of the
decision is included, which reduces the number of instances to 20.
The dataset contains 13 features, two identifying features Customer and Month, 10
predictive features, and the target feature Decision. Histograms of the feature distribu-
tions can be seen in figure 3. The customer feature operationalizations are summarised
in Table 3.
The customer features Basic Events and Analytics Events are transformed with the
natural logarithm for the same reason the user level features were.
The features generated by the proposed methodologies of this thesis are added to
the feature set and evaluated separately. Each algorithm uses the proposed features
generated by their user level counterpart. Poor performance on the user level will thus
lead to worse performance on the customer level due to the propagation of errors.

11
Data Science & Society 2020

Figure 2
Histograms of user level features grouped by the target label

As customer renewal decisions were mostly made on the last day of the month, the
data of that month was associated with the decision label.

4.1.3 Variable Transformations. Principal Component Analysis (PCA) is a common


transformation of high dimensional data to extract features with the highest amount
of variation. Although helpful, PCA has its drawbacks. The principle components (PC)
that result from the transformation are difficult to interpret and they are blind to the
distribution of the target among the data.
PCA was attempted on the User level data. Figure 4 shows the distributions of
the PCs grouped by the target feature. Although the histograms looked promising, the
performance did not improve. For this reason and due to their lack of interpretability,
the original features were chosen for the prediction task.

4.2 Models and Hyperparameter Tuning

Several algorithms were used to predict both user and customer churn. These are the
scikit learn (v0.22.2) implementations of the decision tree (DT), logistic regression (LR),
random forest (RF), support vector machine (SVM), K-nearest neighbor (KNN), and
naive bayes (NB). The parameters of each were tuned using grid search and 10 fold
cross validation. The algorithms were tuned on SMOTE upsampled training data with

12
M. A. Wedemeyer Classifying in a Predicted Future

Table 3
Customer feature operationalization

Feature Operationalization
Customer Name of the business customer.
Month Number of days since 01-01-1970 from the first day of the
month.
Basic Events Natural log of the monthly number of basic feature clicks of
all users per customer divided by the maximum active users to
date.
Analytics Events Natural log of the monthly number of advanced feature clicks
of all users per customer divided by the maximum active users
to date.
Unique Events Summation of the user feature Unique Events divided by the
maximum number of users to date.
Active Days Summation of the user feature Active Days divided by the max-
imum number of users to date.
Recency Summation of the user feature Recency divided by the maxi-
mum number of users to date.
Length Current month minus pilot start start date.
Social Impact Proportion of users active to the maximum number of active
users to date.
Cumulative Cumulative number of days at which training sessions occurred
Trainings at a customer.
User Cost Contract value divided by active users.
Contracted User Contract value divided by maximum users seen to date.
Cost
Proposed Feature generated by the proposed methods.
Feature
Decision “Renew” if customer renews the contract, “Churn”otherwise.

a a blanket ratio of 0.8 for all algorithms. The parameters that achieved the highest F1
score were chosen. The resulting hyperparameters are reported in Table 4.
The minimum samples in a leaf node parameter for the DT only made a difference
when CCP was very low. The optimal solution thus made the choice of this parameter
redundant and the default (1) was chosen.
Due to the small sample size and the 4 fold cross validation, the hyperparameters
for the customer level algorithms were not stable over different random states. Addi-
tionally multiple best solutions were possible to the extent that selecting hyperparam-
eters took on an arbitrary nature. To combat this the average results over ten different
random states were used for hyperparameter selection were necessary. Due to the small
sample size this remained computationally tractable. All other hyperparameter tuning
was performed using random state zero.

13
Data Science & Society 2020

Figure 3
Histograms of customer level features grouped by the target label

4.3 Sampling

The target features Churn (user level) and Decision (customer level) were unbalanced
with 14% and 30% positive (churned) instances respectively. Initial testing showed
that upsampling the minority class helped the models learn the characteristics of the
positive class better. Synthetic Minority Over-sampling Technique (SMOTE) was used
to upsample the minority class. Special handling of the features Basic Events and
Analytics Events was required to ensure realistic upsampling. SMOTE upsampling was
performed before the log transformation and the newly interpolated data points were
rounded to the closest integer, as the features represent counts. Notably, the algorithms
differed in their optimal SMOTE ratio. Therefore the predictions of each algorithm are
made using its own optimal SMOTE ratio. These values are found by iterating over a
range of ratios in steps of 0.05 as shown in Figures 5 and 6. The trade-offs in terms of
accuracy and F1 score depending on this ratio parameter is discussed in 6.1.2 (Figure

14
M. A. Wedemeyer Classifying in a Predicted Future

Figure 4
Histograms of User Level Principal Components

15). The number of neighbors from which the new data points were synthesised are 5
and 2 for user and customer level respectively.

Figure 5
User level SMOTE upsampling ratio tuning

15
Data Science & Society 2020

Table 4
Hyperparameter tuning for each algorithm was performed using grid search

Algorithm Grid search approach Final Hyperparameters


Hyperparameter range: (Start, Stop, Type) Users Customers
Minimum samples in leaf (1,300, linear) lredundant 1
DT
(1,11, linear)
CCP alpha (0, 0.02, linear) (0, 0.45, linear) l0.012 0
Max depth (1,7, linear) (1,9, linear) l7 1
RF
N trees (1,51, linear) (1,49, linear) l21 43
C: (0.01, 2,log) (0.01, 2,linear) l1 1
LR
Elastinet L1 ratio (0,1, linear) (0,1, linear) l0 0
Gamma: (10−5 , 10−1 , log) (10−15 , 10−3 , log) l10−3 10−8
SVC
C: (10−2 ,103 , log) (10−1 , 101 1, log) l10 108
KNN Neighbors: (1, 301, linear) (1,17, linear) l163 13
NB Var. smoothing (10−3 , 10−1 , log) (10−10 , l10−1 10−7
10−1 , log)

Figure 6
Customer level SMOTE upsampling ratio tuning

4.4 Evaluation Methods

The performance of each algorithm is evaluated using common metrics such as accu-
racy, precision and recall, as well as receiver operating characteristic (ROC) curves and
their area under the curve (AUC). Values of the confusion matrix are also reported.
The capacity of each feature to predict churn is evaluated by training each model on
a single feature at a time. The F1 score is reported. This method is used, as opposed to for
example chi square, as it provides model specific results. The usefulness of the feature

16
M. A. Wedemeyer Classifying in a Predicted Future

is tested rather than inferred. A downside to this approach is that feature groups might
collaboratively outperform each feature’s individual contribution.
The contribution of each feature is also evaluated using accumulated local effects
(ALE) plots for the user level features and partial dependence plots (PDP) for the
customer level features. PDPs show the average prediction value over all instances
when adapting the range of a single feature. The changes in average prediction val-
ues show how a model uses a particular feature to make predictions. A downside of
PDPs is their assumption of feature independence. Violations of this assumption force
the algorithms to make predictions outside of the range of values on which it was
trained and penalized for misclassification. ALE plots solve this issue by considering
the conditional as opposed to the marginal probabilities. Depending on the value of the
inspected feature, models only make predictions using those instances that are realistic
given the correlations. As there are very few customer level instances and ALE plots
induce sparsity, the customer models resulted in unstable results. Thus, PDPs are used
at the customer level instead of ALE plots.
As the features generated by the proposed methodologies are the prediction of a
feature in the original feature set (Social Impact), the target value is known. The ability
of the user level models to predict the feature can thus be evaluated.
The question of whether classifying in a predicted future is sensible is tested by
comparing the performance of a feature aggregated in the present (f eaturet , where t is
the current moment in time) and one aggregated in the future (f eaturet+1 , where t+1 is
the next time step). Although the prediction of f eaturet+1 was derived with all of the
user level features, the classification task is still performed on only this singular feature.
The fairest comparison is thus with the performance of f eaturet . The performances of
the customer level algorithms using f eaturet , the predictions of f eaturet+1 and the
actual f eaturet+1 are reported.

5. Results

In the context of contractual SaaS B2B customer relations, the main research aims that
this thesis addresses are

1. How well do the different ML algorithms predict user and customer churn
in a SaaS context?
2. How well do the adapted features perform for predicting user and
customer churn in a SaaS context?
3. How well do the proposed methodologies perform for predicting
customer churn in a SaaS context?

Each of these research aims will be addressed in their own section.

5.1 Algorithmic Performances

The algorithmic performance is presented for both the User and the Customer level
features and targets without the addition of the proposed features. The baseline per-
formance of the algorithms is thus tested and their merits at predicting churn shown.
Reported are the confusion matrix values in Table 5. The derived performance mea-
sures accuracy, precision, recall and f1 score are shown in Table 6. These results were

17
Data Science & Society 2020

Table 5
Confusion matrix values for user and customer level models. Largest values in a column are
bolded.

User Level Customer Level


Algorithm TN FP FN TP TN FP FN TP
DT 6162 5427 364 1543 13 1 2 4
RF 6331 5258 388 1519 13 1 0 6
LR 8511 3078 735 1172 13 1 0 6
SVM 7372 4217 677 1230 14 0 0 6
NB 10962 627 1521 386 14 0 1 5
KNN 7056 4533 530 1377 13 1 0 6

Table 6
User and customer prediction results per algorithm. Largest values per column are bolded.

Level Algorithm Accuracy Precision Recall F1 Score


DT 0.571 0.221 0.809 0.348
RF 0.582 0.224 0.797 0.350
LR 0.717 0.276 0.615 0.381
User
SVM 0.637 0.226 0.645 0.335
NB 0.841 0.381 0.202 0.264
KNN 0.625 0.233 0.722 0.352
DT 0.850 0.800 0.667 0.727
RF 0.950 0.857 1.000 0.923
LR 0.950 0.857 1.000 0.923
Customer
SVM 1.000 1.000 1.000 1.000
NB 0.950 1.000 0.833 0.909
KNN 0.950 0.857 1.000 0.923

generated from 10 fold and 4 fold cross validation for the user level and customer level
respectively.
At the user level NB achieved the highest true negative rate of 94.6% while DT
achieved the lowest with 53.2%. Conversely, DT achieved the highest true positive rate
of 80.9% and NB scored the lowest with 20.2%. SVM and NB achieved perfect true
negative rates and RF, LR, SVM and KNN achieved perfect true positive rates.
From the user level models, LR achieved the highest F1 score, while for the customer
level model SVM achieved the highest F1 score. While for the user level there tended to
be a trade-off between precision and recall scores, this did not appear the case for the
customer level. At the user level the LR outperformed the SVM, which can be partially
attributed to the limited set of hyperparameters in the search grid of the SVM. A set
of hyperparameters could potentially outperform the LR but finding these was not
computationally tractable given the resources.

18
M. A. Wedemeyer Classifying in a Predicted Future

The ROC curves and AUC values were found for the models at the user and
customer level and are shown in Figure 7. The values are derived from the probability
estimates of each classifier.

Figure 7
Receiver Operating Characteristic curves for all algorithms for user level prediction

For the user level models, the highest AUC was achieved by LR (0.75) and the
lowest by DT (0.67). At the customer level the highest AUC was achieved by LR and
SVM (1) and the lowest by the DT (0.80). On both the user and customer level the DT
performed the worse and LR was part of the best performing algorithms.
All algorithms achieve to a satisfactory level and will continue to be used for the
feature and the methodologies evaluation.

5.2 Feature Evaluation Results

The features used in the churn prediction tasks were informed by theoretical models
from literature. Although this provides them with legitimacy to be explored as poten-
tial features, this does not guarantee performance. The contribution of each feature is
evaluated using permutation feature importance (PFI) (Figure 8) and each feature is
evaluated separately with each algorithm for its ability to predict churn (Figure 9).

Figure 8
Permutation Feature Importance for user and customer models.

19
Data Science & Society 2020

The PFIs for the user level models showed that all but one depended heavily on
the feature Active Days. The NB model relied instead on the feature basic events. This
is notable as NB achieved the highest precision and lowest recall (lowest F1 score). A
possible explanation for this reliance on this feature could be that besides having a fairly
good split in the label distributions when inspecting the histograms, it is also the feature
closest to the normal distribution. As the gaussian naive bayes implementation was
used, it seems intuitive that its algorithm would best take to this feature. See Figure 25
in the appendix for further analysis.
For PFIs on the customer level the most commonly used feature was Social Impact.
Social Impact was very important to the DT and RF for decision making. Looking at
the histograms of the customer features in Figure 3 it is clear that this feature appears
to have the most separated distributions when grouped by the target. Interestingly, the
most important features according to PFI are not the same for the user and customer
level. At different levels of aggregation, different features become important. At the cus-
tomer level the proportion of active users (Social Impact) becomes the most important
feature. The individual feature performance is also evaluated. Each algorithm is trained
using one feature at a time. This way each features’ individual ability to predict churn
is given a chance to be evaluated. Figure 9 shows the results.
At the user level the results align with the PFIs fairly well. Active Days performs
the best and the naive bayes relies on all features to a broader extent than the other
algorithms. Basic Events is the second best feature. The strongest feature in terms of PFI
is also the strongest in terms of individual performance for both the user and customer
level. At the user level the models struggle to make sense of features other than Active
Days, which is reflected in the PFIs. It can be the case that the hyperparameter tuning
performed for all features does not generalise well to each individual feature. This is
mostly the case for models with regularization parameters such as the DT, RF, LR and
SVM. The customer models show some more flexibility and most features are able to
perform well.

Figure 9
Individual feature performance for user and customer level per algorithm.

5.3 Results of the Proposed Methodologies

The two proposed methodologies took two different approaches at predicting the ratio
of active users in the following month. The mean absolute error (MAE) for each model
is reported in Table 7.

20
M. A. Wedemeyer Classifying in a Predicted Future

Table 7
Mean Absolute Error of the proposed methods. Smallest values per column are bolded.

Model Method 1 Method 2


DT 0.216 0.204
RF 0.218 0.200
LR 0.197 0.202
SVM 0.236 0.229
NB 0.187 0.180
KNN 0.186 0.204

The two methodologies yielded a range of values shown in the histograms in


Figures 10 and 11. The distributions of the features derived from the two methodologies
appear to be fairly separable. This provides an indication that the models will be
able to discern churners from non churners with the features generated from these
methodologies.

Figure 10
Method 1 (Binary) Histograms

The customer level models achieved relatively high performance metrics scores
with the 9 features available. The separate addition of the features proposed did not
improve these existing models as they relied on better features to make their predic-
tions. The results shown in Table 8 are thus the performance of each algorithm trained
and evaluated using only the proposed features individually. Each model is using the
proposed feature generated by its user level counterpart. Differences in performance of
the customer models are thus due to differences in prediction ability of the user level
models and algorithmic performance on the customer level.
Hyperparameter tuning was performed per method to achieve comparable perfor-
mance to the original models as initial results showed that the original tuning did not
produce optimal results with the proposed features.

21
Data Science & Society 2020

Figure 11
Method 2 (Probabilistic) Histograms

Table 8

Level Algorithm Accuracy Precision Recall F1 Score


DT 0.800 0.750 0.500 0.600
RF 0.650 0.462 1.000 0.632
LR 0.750 0.556 0.833 0.667
User
SVM 0.600 0.333 0.333 0.333
NB 0.850 0.667 1.000 0.800
KNN 0.750 0.545 1.000 0.706
DT 0.750 0.556 0.833 0.667
RF 0.750 0.556 0.833 0.667
LR 0.750 0.556 0.833 0.667
Customer
SVM 0.850 0.667 1.000 0.800
NB 0.850 0.667 1.000 0.800
KNN 0.800 0.600 1.000 0.750

The results show that both methodologies are able to support most customer mod-
els in discerning customers that will churn from those that won’t. Method 2 outperforms
Method 1 by a small margin for most algorithms. Notably NB is the best performing
algorithm for both methods with F1 scores of 0.8 and perfect recall. This is interesting
as NB was the worst performing algorithm at the user level in terms of F1 score (0.264)
and the best in terms of precision (0.381).
Performances of the customer feature Social Impact, the features generated by the
two proposed methodologies and the next month’s Social Impact are shown in Figure
12. The results show that the proposed methodologies perform comparably, yet slightly
worse, to the feature obtained by the mapping scheme proposed by Figalist et al. (2019).

22
M. A. Wedemeyer Classifying in a Predicted Future

As the models do not provide a uniform result about which method performs better,
an analysis of the underlying data is performed. The separability of the labels is tested
by evaluating the ROC curve and AUC of a simple linear classifier1 in Figure 13. The
feature next month’s social Impact is more separable than the current month’s Social
Impact feature as it has a higher AUC value.

Figure 12
Performances of the Customer feature Social Impact, the two proposed methodologies and the
customer feature Social Impact of the following month for the selected algorithms.

6. Discussion

The results obtained regarding the three research questions will be discussed in section
6.1, 6.2 and 6.3 respectively.

6.1 Evaluating Algorithmic Performance

The performance of the user level models is reflective of the real world nature of the
data set. With large overlaps in distributions, this classification task included instances
which were impossible to label correctly without overfitting. At the customer level the
few instances meant that overlaps in distributions were rarer and could be avoided by
the higher dimensionality of the feature space. With the majority of customer models
achieving near perfect results, this thesis thus suffers from the common limitation in the
B2B literature that the sample size is very small.

6.1.1 Hyperparameter Tuning. Identifying which hyperparameters result in the best


performance proved to be rather difficult in the case of the customer level model. The
very limited amount of data made it possible for multiple different parameters combi-
nations to achieve the same performance. The choice of hyperparameter was improved
by rerunning the cross validations with different random states, however, this brought

1 The classifier reports the false positive rate and true positive rate at every point of the feature range.

23
Data Science & Society 2020

Figure 13
ROC curve and AUC values of the Simple linear classifier

marginal returns. Thus, the choice of hyperparameters is based off an interpretation


of the general trend in the surrounding performance landscape rather than an optimal
solution. It is thus advised that if more instances are added to the customer dataset, the
hyperparameters be re-solved.

6.1.2 Precision and Recall Trade Off for Varying Upsampling Ratios. Varying the ratio
of the minority class to the majority class with SMOTE results in different learning
behaviors in the algorithms. As the class balance shifts, the attention given to the
minority class changes as it becomes increasingly important to correctly classify in order
to minimise loss. As the distributions of the user level features grouped by the targets
have a large overlap (Figure 2), a tradeoff between precision and recall is inevitable.
Interestingly, the scale of this trade off is different for every algorithm. Some are
more sensitive than others. Figure 14 shows the precision-recall tradeoff for a defined
range of SMOTE ratios on user data (0.2 to 1.0 in steps of 0.05) and the customer data
(0.6 to 1.0 in steps of 0.05).
Figure 14 shows that at the user level some models are already adapted to the
imbalanced data. NB doesn’t benefit much by upsampling the data as it already exhibits
relatively high precision values. LR and KNN appear the most susceptible to class bal-
ance changes, which leads to a greater flexibility in prediction. This flexibility appears
to have greatly benefited the performance of the LR.
The ability of the models to capitalise on the precision-recall trade-off is visualized
in Figure 15. The user level LR model was best able to make use of upsampled data and
achieved a maximum F1 score of 0.38, while attaining an accuracy of around 71%. The
picture is less clear for the customer level. Increases in precision were not accompanied

24
M. A. Wedemeyer Classifying in a Predicted Future

Figure 14
Precision Recall curve per algorithm

in reduced recall. Due to the small sample size it was possible for the algorithms to
achieve high scores in both.

Figure 15

6.2 Evaluating Feature Performance

The feature performances at the user level did not align with expectations when examin-
ing the histograms. The feature Basic Events appeared to be a good candidate, however,
the PFI and single feature performance showed that the majority of models did not
suffer when it was randomly shuffled and only the DT and RF were able to make use of
the feature in individual feature evaluation. The most used feature was instead Active
Days. Intuitively this makes sense as the number of days that a user is active should
correlate with their choice to churn. Users who don’t use the software often probably
don’t see the value they can gain and are more likely to churn. The RFM model would
predict exactly such a relationship. Conversely, the feature Analytics Events, which
measured the amount of advanced analytical features used by a user, was not used
as many negative as well as positive instances had zero values. As the use of analytical

25
Data Science & Society 2020

features is not necessary to derive value from the software it is a weak predictor of
churn. This trend among models was found at both the user and the customer level.
As noted in section 5.2, the two levels of analysis do not share the same top per-
forming features. At the user level Active Days performed best, while at the customer
level the feature Social Impact outperformed the rest. The important consideration to
make here, is that the contexts of user churn and customer churn are different. Customer
churn holds far greater consequences that, if anticipated, will preemptively impact user
usage behavior. The feature Social Impact might be signaling the users beliefs about
their management’s decision to churn. It is more likely that the users collective attitude
towards the product influences management’s decision to churn or not (24). The feature
Active Days, on the other hand, signals the retained interest of the users for the software.
Users who return more frequently are seen to derive higher value, which in turn should
lower the risk of customer churn if this value is communicated to the decision maker.

6.2.1 Evaluating Accumulated Local Effects and Partial Dependence Plots. Although
PFI allows for the evaluation of performance decreases when a single feature is ran-
domly reshuffled, it says little about how the model uses the feature to make predic-
tions. In order to understand how a feature contributes to the final verdict one can
use ALE and PDP plots. Due to the small sample size of the customer data, ALE plots
became unstable in large regions of the feature space and thus PDP plots are used. The
algorithms used to generate these plots are the respective best algorithms per level (LR
for user level and NB for customer level).
The ALE plots of the user features are shown in Figure 16. Monte carlo simulations
(50) were used to test the robustness of the results. Unlike one might assume from the
PFI, the LR made use of all of the features to some extent to make its predictions.
The feature Basic Events shows a clear trend of decreasing relative churn prediction
over the range of the feature. An explanation for this is that as users use the software
more and log more events over the month, the less likely they are to churn the fol-
lowing month. The feature Analytics Events appears to have the opposite relationship,
which has no theoretical backing. The algorithm has learned to predict users who
use analytical features as more likely to churn the following month. The Monte Carlo
simulations show that this relationship is far less robust but it retains the same sign in all
simulations. A potential explanation can be that users who accumulate a large amount
of analytical feature usage do so out of frustration. Users who don’t manage to “make
the software work” after many attempts might be more likely to stop using it in the fol-
lowing month. Lower usage levels of analytical features would thus indicate competent
usage of the software. This interpretation is supported by the trend of the feature Unique
Events, which captures the unique count of features used. Unique Events correlates to
some extent with Analytics Events as users need to also use analytical features to attain
a high Unique Feature count. The y intercept of the ALE plot indicates that when users
use more than 5 unique features they are less likely to churn.
The temporal features Active Days, Recency and Length all follow practical intu-
ition. Users who are active fewer than 4 times a month (once a week) are more likely to
churn the following month. The more recent the latest day of activity was, the less likely
a user is to churn the following month. As ALE plots visualize local effects and thus
account for the conditional probabilities, this effect is seen even beyond the obvious
correlation with Active Days. The plot thus shows that regular usage as opposed to
short bursts indicates healthy usage behavior. The trend in the feature Length supports
this idea, as users who have a longer history with the software are less likely to churn.

26
M. A. Wedemeyer Classifying in a Predicted Future

Figure 16
ALE plots for LR user level predictions on training data

The final feature, Social Impact, has a positive trend, which is counter to the trend
proposed by the theory from which it was operationalized. Theory states that intentions
to churn spread through a community and incite other members to churn as well.
An explanation for the opposite trend is that when the proportion of active users is
high the only change possible is down. Practical backing comes from an interpretation
by the client. All users join the initial training sessions during onboarding, however,
the largest dropoff of users is after these sessions. The users that continue using the
software after the initial training tend to see the value in the software and are less likely
to churn. Lower Social Impact scores at the user level thus indicate membership of a
core of power users. This idea is supported when looking at cohort attrition. Figure 17
visualises the attrition of each monthly user cohorts over time. Initial drop off is highest
but the cohort sizes level off after a few months. A stable base of users is formed, which
are all characterised by low Social Impact scores.
The customer PDP plots are generated using NB. This algorithm was chosen for
its interpretable plots. The aggregated user features are not guaranteed to follow the
same trends as their user level counterparts. For instance, Analytics Events seems to
have little to no impact on the prediction. Following the interpretation of the user level
feature, the frustrations of an individual are possibly not enough to sway the opinion of
the team or the decision maker if others derive value from the software.
The feature Recency also follows the opposite trend at the customer level. It is
possible that a low average recency indicates coordinated usage at the end of month,

27
Data Science & Society 2020

Figure 17
Attrition of monthly user cohorts

Figure 18
Partial dependence plots for customer level Naive Bayes model

which could be a sign of teams cleaning up the digital workspace before terminating
its use. Looking at Figure 19, this idea is supported. In the previous month the original
trend holds and indicates that users of customer who will churn spend larger amounts
of time away from the software than the users of renewing customers. In the month
of decision making, however, the trend reverses and the users of churning customers
become active within the week before the decision date.
Most notably, the interpretation of the Social Impact feature changes drastically
from the user to the customer level. At the customer level, the lower the Social Impact
score the higher the chance of churn with a threshold value of around 0.6. Although

28
M. A. Wedemeyer Classifying in a Predicted Future

Figure 19
Distribution of Customer Recency Feature distribution by label and Month from Prediction

individual users appear unaffected by the decision of their colleagues, at the customer
level it is an indication of how much value is being derived by the team from the
purchased license. When few users are using the license the customer is deriving less
value than they are paying for. This concept is intended to be captured in the feature
User Cost, however, its impact on the prediction is negligible.
The strongest change in prediction outcomes is observed in the feature Contracted
User Cost. As Contracted User Cost increases churn probability goes to zero. There are
two interpretations for these results: 1) Customers who understand the value of the
software are willing to pay more and 2) Customers who spent more are less willing to
give up on the investment (sunk cost fallacy). Interesting to note is also the relation-
ship between Contracted User Cost and the number of estimated user licences, which
follows the general pattern of 1/x (Figure 20). Notable are the three customers who
seem to have underpaid for the licences also churned. This supports the idea that the
customers willingness to pay for the licence is an indication of their assessment of the
softwares value. This is no guarantee though as three other customers following the
price relationship also churned. Other factors also influence customer churn of course.

6.3 Evaluating Proposed Methodologies

The results of the methodologies showed that some algorithms were better suited to
predicting the next month’s Social Impact than others. NB stood out by having among
the lowest MAE among the other algorithms. In Figure 21 NB stands out as well and
it becomes clearer why it performed well. Its probabilistic predictions are very close
to its binary predictions, showing that the algorithm appears fairly certain about its
predictions. Due to its high precision and low recall, most of the predictions predict the
negative class. This conservative approach to churn prediction appears to have paid off.
From Figure 21 we can also see that the DT and RF don’t lend themselves well
to probabilistic predictions. The cost complexity pruning most likely left few branches
remaining, which explains the large spikes in the respective histograms. This phe-

29
Data Science & Society 2020

Figure 20
Relationship between Contracted User Cost and Estimated User Licenses (Churned customer is
yellow)

nomenon left the two models as the worst performing when using Method 2 features
alongside, surprisingly, LR. Although LR had achieved the highest F1 score at the user
level it performed relatively poorly when using its own predictions at the customer
level. Being the model with the second highest precision, this comes at a surprise.

Figure 21
Histograms of user level binary and probabilistic churn predictions per algorithm.

6.3.1 Comparison of Methodologies. As the target value of the predictions is known,


the error of the predictions with respect to the target is visualized in Figure 22. No
differentiation was made between the algorithms in this representation. A simple linear
regression was fitted to the results. The predictions of both methods lie in similar
regions. At the low end of the target the models over predict the following month’s

30
M. A. Wedemeyer Classifying in a Predicted Future

proportion of active users. The range of this error is smaller than at the high end of
the feature. This can be explained by the fact that the proposed methodologies are
not capable of considering returning users. As the model architecture only considers
one month at a time, there is simply no data from which to extrapolate a returning
user. Implementing a sensible return estimation would benefit the performance of the
proposed methodologies as it would alleviate the large errors in the high ranges of the
predicted feature and reduce the overlap in distributions between the label groups.

Figure 22
Distribution of predictions over the range of the target feature (Churners in yellow and renewing
customers in black).

The results showed that Method 2 outperformed Method 1 by a small margin in


terms of F1 score in churn prediction and in terms of MAE with respect to the target
feature. Although the margins are small it can be said that the slight improvements
in estimating the feature one month into the future allowed the customer models to
make better predictions. It is plausible that Method 2 outperformed Method 1 due to
the ability of the algorithms to encode uncertainty in their predictions, however, Figure
23 shows that the distribution of errors is fairly similar between the two methods. From
the results and the analysis from Figures 22 and 23, neither method is significantly better
than the other.

6.3.2 Criticism of the Proposed Methodologies. The proposed methodologies estimate


the proportion of active users in the month after the customer has made their decision
to renew or churn. The feature is inspired by the idea of communication between users
of the software and the idea to include user data in customer churn prediction at all
relies on the idea that users and customers communicate as well. A point of criticism is
that the flow of information is bidirectional and thus the intentions of the customer to
churn can be communicated to the users before it is officially decided upon. This way
information about the target leaks into the data and the algorithms are modeling the
target with itself rather than a useful predictor.
In order to address this point of criticism one must ensure that the customers
decision to churn does not influence the user data. As there is no way of knowing this for
sure, one can approximate this one way flow of communication by using the data from
the previous month. The assumption is that the customer will only have made up their
mind in the last month of the pilot phase while users can choose to churn throughout
the entire process. As Figure 24 shows, the distributions of the Social Impact feature

31
Data Science & Society 2020

Figure 23
Distribution of errors of the two proposed methods over the range of the target. Logit trend
approximated with grey dashed line. Feasible region for errors bounded by two grey lines.

remain well separable in both months before the decision moment. This shows that user
sentiment is established before the customer’s decision to churn and thus no leaking of
the target into the data is found.

Figure 24
Distribution of customer feature Social Impact over time. Dashed line represents the moment of
decision making.

32
M. A. Wedemeyer Classifying in a Predicted Future

6.4 Impact for Academia

This thesis builds upon the recent work conducted by Figalist et al. (2019) of mapping
user and customer data together via a mapping scheme, by using user data to make
predictions about a customer level features one month into the future. The idea proved
to have face value validity as it achieved comparable results to the mapping scheme
presented by Figalist et al. (2019). By making use of the full granularity of the user data
to encode knowledge of usage behavior into a customer level feature, the algorithms
were able to classify customers to an adequate degree.
Future research into this idea should focus on generalising the approach to other
features besides the proportion of active users. The proposed methodologies should be
formalised into general hybrid models consisting of two stages: 1) User level models
predicting user behavior into the future followed by aggregation to the customer level,
and 2) Customer level models predicting customer churn with the addition of the
predicted features.
Besides exploring the proposed methodologies, this thesis contributes to the litera-
ture by testing the performance of theoretical models in practice. The features proposed
by the CUSAMS model all contributed to the prediction of user churn. The operational-
ization of breadth in terms of the unique number of features used outperformed the
Analytics Events operationalization. This showed that the competency of a user is better
understood as the range of features a user uses rather than the amount of advanced
features used.
The RFM model encomposes the idea of habitual use. Frequent, recent and main-
tained usage are signals that a user will remain active in the future. The decision
boundaries found by the RFM features support this idea and provide further evidence
for the importance of regular use for a healthy user base.
The adaptation of the social impact feature proved to be more useful for the cus-
tomer level prediction than the user level prediction. Nonetheless, it can be said that
social impacts of work colleagues are different than those of friends or family in other
consumer settings. The usage of a tool for work is far less subject to popular opinion
than a social media website. In instances where social impact scores were low, users
continued using the software due to the value it provided. Aggregating this idea to
the customer level changes the perspective. It is the objective of a manager to make
project decisions that maximise profits. The purchase of a tool such as TrendMiner is a
project decision. If too few users are deriving value from the tool then the profit of the
purchasing decision will become negative and the project is churned. The connection
between user and customer churn becomes very apparent here as it is not enough to
simply convince a manager of the value of the software but rather the needs of the end
users must be met in order for a pilot to be successful.
The largest limitation to the analysis of customer churn in this work was the small
sample size. Although intuitive relationships were found that can be backed by industry
experience, the confidence of the results remains low. New data points will have large
impacts on how the customer models will find their decision boundaries. This problem
is fundamental to the B2B context as customers are large and few. Introducing more
data into the customer churn process through user level data is thus a very desirable
goal.
Another limitation of the thesis is that although the hierarchical nature of the
customer-user relationship is acknowledged and explored it is not considered in a
statistical sense in the modelling approach. Hierarchical modeling shows that within
group and between group effects influence the results of individuals. By performing

33
Data Science & Society 2020

disaggregation on the customer level feature Social Impact and including it at the user
level, the models ignore the between group variations and violate the assumption of
the independence of errors (Woltman et al. 2012). In essence, the association of a user to
their respective company was lost in the user level modeling stage. As company culture
plays an important role in employee performance and behavior (Parker et al. 2003),
between group variations are fairly likely. Studies in the B2C context have considered
the hierarchical effects, for example, between consumers and demographic variables
(Seo, Ranganathan, and Babad 2008), however, in the B2B context these effects are
yet to be explored. Considering these hierarchical elements would benefit prediction
performance.

6.5 Recommendations for Industry

The user models perform at a level that is not recommended for practical use and thus
should be used primarily for understanding user churn. The same recommendation is
necessary for the customer level models due to the small sample size. Considering these
two limitations there are two main takeaways for industry besides the insight into user
churn.

1. The proportion of active users is a very useful feature in estimating how


much value a customer is deriving from the pilot program. The boxplots of
the Social Impact distributions at different months before and after the
decision date in Figure 24 showed that a threshold value of around 40% at
1 to 2 months before the decision date is indicative of the customer not
gaining enough value from the pilot program.
2. The amount a customer is willing to pay is indicative of how successful
the pilot will be. Figure fig:CostLicenses showed that customers who
underpay for the software licenses are far less likely to renew the contract.
It is thus a consideration to be made whether it is worth starting
underpaid pilots that likely won’t end in enterprise contracts.

7. Conclusion

This thesis explored the idea of predicting customer features one time step into the
future using user level predictions in an effort to improve customer churn prediction.
The idea was tested on one customer feature, namely the proportion of active users. The
prediction results showed that the proposed methodologies do not outperform current
methods of mapping user data to customer data. The main reason is the propagation
of errors from the user level models to the customer model. Allowing for uncertainty
in the user predictions improved performance marginally, meaning other ways need to
be found to improve user churn prediction. Results achieved by a theoretically perfect
user prediction, found by using the target feature, showed higher separability of labels
in the future than in the present, meaning that improvements of user level predictions
is a worthwhile effort.
The generalisability of these findings were improved by the use of six classical ma-
chine learning algorithms. The best user level performance was attained by the Logistic
Regression classifier, however, the feature generated by the Naive Bayes Predictions
allowed for the best customer level performance.

34
M. A. Wedemeyer Classifying in a Predicted Future

This thesis also operationalized and evaluated seven variables from three models
proposed in the literature to the available data provided. Frequency from the RFM
model (Active Days) and Depth from the CUSAMS framework (Basic Events) were
identified via Permutation Feature Importance and the Accumulated Local Effect plots
proved to be the most used features for the user level feature set. The Social Impact
variable from the SPA model in combination with cost features proposed by the external
party lead to the highest pairwise performance. An important insight derived was that
the features useful for the user churn prediction and for the customer churn prediction
were very different. The importance of the respective features reflected the context in
which the decision was made.

35
Data Science & Society 2020

References
Bolton, Ruth N., Katherine N. Lemon, and Peter C. Verhoef. 2004. The Theoretical
Underpinnings of Customer Asset Management: A Framework and Propositions for Future
Research. Journal of the Academy of Marketing Science, 32(3):271–292.
Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. Training algorithm for
optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational
Learning Theory, pages 144–152, Publ by ACM, New York, New York, USA.
Buckinx, W, B Baesens, D Van den Poel, P Van Kenhove, and J Vanthienen. 2002. Using Machine
Learning Techniques To Predict Defection Of Top Clients. WIT Transactions on Information and
Communication Technologies, 28.
Chang, Hui Chu and Hsiao Ping Tsai. 2011. Group RFM analysis as a novel framework to
discover better customer consumption behavior. Expert Systems with Applications,
38(12):14499–14513.
Chen, Kuanchin, Ya Han Hu, and Yi Cheng Hsieh. 2015. Predicting customer churn from
valuable B2B customers in the logistics industry: a case study. Information Systems and
e-Business Management, 13(3):475–494.
Chen, Zhen Yu, Zhi Ping Fan, and Minghe Sun. 2012. A hierarchical multiple kernel support
vector machine for customer churn prediction using longitudinal behavioral data. European
Journal of Operational Research, 223(2):461–472.
Coussement, Kristof and Koen W. De Bock. 2013. Customer churn prediction in the online
gambling industry: The beneficial effect of ensemble learning. Journal of Business Research,
66(9):1629–1636.
Defazio Ambiata, Aaron, Francis Bach, and Simon Lacoste-Julien. 2014. SAGA: A Fast
Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives.
Technical report.
Duman, Ekrem, Yeliz Ekinci, and Aydin Tanriverdi. 2012. Comparing alternative classifiers for
database marketing: The case of imbalanced datasets. Expert Systems with Applications,
39(1):48–53.
Figalist, Iris, Christoph Elsner, Jan Bosch, and Helena Holmström Olsson. 2019. Customer churn
prediction in B2B contexts. In Lecture Notes in Business Information Processing, volume 370
LNBIP, pages 378–386, Springer.
Frank, Ben and Jeff Pittges. 2009. Analyzing Customer Churn in the Software as a Service (SaaS)
Industry. In Southeastern InfORMS Conference Proceedings, pages 481–488.
Govindarajan, M. and RM Chandrasekaran. 2010. Evaluation of k-Nearest Neighbor classifier
performance for direct marketing. Expert Systems with Applications, 37(1):253–258.
Haver, Jana Van. 2017. Benchmarking analytical techniques for churn modelling in a B2B context.
Ph.D. thesis, Universiteit Gent, Gent.
Jahromi, Ali Tamaddoni, Stanislav Stakhovych, and Michael Ewing. 2014. Managing B2B
customer churn, retention and profitability. Industrial Marketing Management, 43(7):1258–1268.
Mitov, Venelin and Manfred Claassen. 2013. A Fused Elastic Net Logistic Regression Model for
Multi-Task Binary Classification.
Parker, Christopher P., Boris B. Baltes, Scott A. Young, Joseph W. Huff, Robert A. Altmann,
Heather A. Lacost, and Joanne E. Roberts. 2003. Relationships between psychological climate
perceptions and work outcomes: A meta-analytic review. Journal of Organizational Behavior,
24(4):389–416.
Pedregosa, Fabian, Vincent Michel, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Jake Vanderplas, David Cournapeau, Fabian Pedregosa, Gaël Varoquaux, Alexandre
Gramfort, Bertrand Thirion, Olivier Grisel, Vincent Dubourg, Alexandre Passos,
Matthieu Perrot Brucher, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in
Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg Alexandre Passos PEDREGOSA,
VAROQUAUX, GRAMFORT ET AL. Matthieu Perrot. Technical report.
Rauyruen, Papassapa and Kenneth E. Miller. 2007. Relationship quality as a predictor of B2B
customer loyalty. Journal of Business Research, 60(1):21–31.
Richards, Keith A. and Eli Jones. 2008. Customer relationship management: Finding value
drivers. Industrial Marketing Management, 37(2):120–130.
Seo, Dong Back, C. Ranganathan, and Yair Babad. 2008. Two-level model of customer retention
in the US mobile telecommunications service market. Telecommunications Policy,
32(3-4):182–196.
Steinberg, Dan. 2009. The Top Ten Algorithms in Data Mining. Taylor & Francis Group, LLC.

36
M. A. Wedemeyer Classifying in a Predicted Future

Tsai, Chih Fong and Yu Hsin Lu. 2009. Customer churn prediction by hybrid neural networks.
Expert Systems with Applications, 36(10):12547–12553.
Tsallis, Constantino. 1988. Possible generalization of Boltzmann-Gibbs statistics. Journal of
Statistical Physics, 52(1-2):479–487.
Van den Poel, Dirk and Bart Larivière. 2004. Customer attrition analysis for financial services
using proportional hazard models. In European Journal of Operational Research, volume 157,
pages 196–217, North-Holland.
v. Wangenheim, Florian, Nancy V. Wünderlich, and Jan H. Schumann. 2017. Renew or cancel?
Drivers of customer renewal decisions for IT-based service contracts. Journal of Business
Research, 79:181–188.
Wei, Jo Ting, Shih Yen Lin, Chih Chien Weng, and Hsin Hung Wu. 2012. A case study of
applying LRFM model in market segmentation of a children’s dental clinic. Expert Systems
with Applications, 39(5):5529–5533.
Wiersema, Fred. 2013. The B2B Agenda: The current state of B2B marketing and a look ahead.
Woltman, Heather, Andrea Feldstain, J Christine Mackay, and Meredith Rocchi. 2012. An
introduction to hierarchical linear modeling. Technical Report 1.
Yeh, I. Cheng, King Jang Yang, and Tao Ming Ting. 2009. Knowledge discovery on RFM model
using Bernoulli sequence. Expert Systems with Applications, 36(3 PART 2):5866–5871.
Zhu, Bing, Bart Baesens, Aimée Backiel, and Seppe K L M Vanden Broucke. 2018. Benchmarking
sampling techniques for imbalance learning in churn prediction. Journal of the Operational
Research Society, 69(1):49–65.

37
Data Science & Society 2020

Figure 25
Bhattacharyya Distances for User Level Data

Figure 25 shows the distributions of the user level features and the distribution of
data generated from a normal distribution with the features’ mean and standard devia-
tion. The Bhattacharyya distance is calculated for each feature. The feature Basic events
has a very small distance and thus meets the normality assumption of the gaussian
naive bayes algorithm very well. This is the case as decisions are made by inferring the
likelihood of labels given a gaussian distribution. It appears that the features Active
Days, Recency and Length follow an exponential distribution and the feature unique
events follows a binomial distribution.

38

You might also like