You are on page 1of 73

Debt Analytics: Proactive prediction of debtors in the

telecommunications industry

Ana Henriques Narciso

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Prof. Francisco António Chaves Saraiva de Melo


Prof. José Alberto Rodrigues Pereira Sardinha

Examination Committee

Chairperson: Prof. Miguel Nuno Dias Alves Pupo Correia

Supervisor: Prof. Francisco António Chaves Saraiva de Melo

Member of the Committee: Prof. Paulo Jorge Fernandes Carreira

November 2015

i
ii
Abstract
Telecommunications businesses sometimes face new customers who subscribe services with no real
intention of paying for them. This special class of fraudulent customers – never-payers – are responsible
for significant revenue losses, despite being a tiny subset of all subscribers. Besides not paying monthly
bills, additional resources are spent unnecessarily during service activation, CRM processes and
collection management.

This thesis was developed in collaboration with a telecommunications company whose goal is to predict
the never-payer population, consisting of post-paid customers who are never going to pay for the newly
subscribed services. The main challenge is to predict the outcome, even before the customer’s account
is activated. At that point, too little customer data is available for analysis. For those cases, the first
month of behaviour can be integrated to improve predictions.

The final platform is built on Microsoft BI stack tools based on the CRISP-DM methodology. The
integration module is capable of loading, cleaning and summarising large amounts of input data that
provide information about new customers. Then, the analytical module selects a specific set of relevant
attributes to train several predictive models. Those models were tested, facing new, unknown customers
to decide the likelihood of being customers who will never pay their debts. Ad-hoc exploration of the
input data and results is also possible using tools such as Excel, Power Pivot and Power View. The
solution was evaluated using data mining performance measures.

Keywords: Telecommunications, Fraud, Never-payer, Data Mining, Predictive Model, SQL Server

iii
iv
Resumo
As empresas de telecomunicações enfrentam, por vezes, novos clientes que subscrevem serviços sem
real intenção de os pagar. Esta classe especial de clientes fraudulentos – nunca-pagadores – é
responsável por perdas significativas nas receitas, apesar de constituir um subconjunto minúsculo de
todos os clientes. Além de não pagarem as faturas mensais, são gastos recursos adicionais
desnecessariamente, durante a subscrição, processos de CRM e gestão de cobrança.

Esta tese foi desenvolvida em colaboração com uma empresa de telecomunicações, cujo objetivo é
prever a população nunca-pagadores, consistindo em clientes pós-pagos que nunca irão pagar os
serviços recém-subscritos. O principal desafio é prever o resultado, mesmo antes da conta de cliente
ser ativa. Nessa altura, muito poucos dados ao cliente estão disponível para análise. Nesses casos, o
primeiro mês de comportamento pode ser integrado, a fim de melhorar as previsões.

A plataforma final é construída utilizando ferramentas de BI da Microsoft com base na metodologia


CRISP-DM. O módulo de integração é responsável por carregar, limpar e sumarizar grandes
quantidades de dados que fornecem informações sobre novos clientes. Em seguida, o módulo analítico
seleciona um conjunto específico de atributos relevantes para treinar vários modelos preditivos. Esses
modelos foram testados com clientes novos e calculou-se a probabilidade de nunca virem a pagar as
suas dívidas. A exploração ad-hoc dos dados de entrada e resultados também é possível usando
ferramentas como o Microsoft Excel, Power Pivot e Power View. A solução foi avaliada com recursos a
métricas de desempenho utilizadas em data mining.

Palavras-chave: Telecomunicações, Fraude, Nunca-Pagadores, Data Mining, Modelo Preditivo, SQL


Server

v
vi
Acknowledgments
First, I have to thank my thesis supervisors, Professor Francisco Melo and Professor Alberto Sardinha.
I would like to thank you for your support and understanding over this past year. Most of my energy was
spent on my day job, but it was Rui Santos that always remembered to plan things and to make the
extra effort, and for that I am forever grateful. I would also to thank Pedro Estanislau and Luís Batista
for putting up with my data cravings and the telecom company that made everything possible.

I thank my family for the patience and encouragement. I am also grateful to my partner who supported
me through since my first year at this Institution and had taught me so much about perseverance and
hard work. Finally, my friends, for understanding my sudden absence in our social gatherings.

vii
viii
Table of Contents
Abstract ................................................................................................................................................. iii

Resumo................................................................................................................................................... v

Acknowledgments ............................................................................................................................... vii

Table of Contents ................................................................................................................................. ix

List of figures ........................................................................................................................................ xi

List of Tables ....................................................................................................................................... xiii

List of Acronyms ................................................................................................................................. xv

1. Introduction .................................................................................................................................. 17

1.1. Problem Definition ............................................................................................................... 17

1.2. Objectives ............................................................................................................................ 17

1.3. Document Outline ................................................................................................................ 18

2. Fraud Detection: Context and Related Work ............................................................................ 19

2.1. Telecommunications CRM ................................................................................................... 19

2.2. Fraud Detection ................................................................................................................... 21


2.2.1. Telecom Fraud ........................................................................................................ 21
2.2.2. Related Work .......................................................................................................... 24
3. A Telecom Company: A Case Study .......................................................................................... 29

3.1. Business Model ................................................................................................................... 29

3.2. Data Model .......................................................................................................................... 32

4. Solution ......................................................................................................................................... 41

4.1. Solution Overview ................................................................................................................ 41

4.2. Methodology ........................................................................................................................ 42


4.2.1. Business Understanding ......................................................................................... 42
4.2.2. Data Understanding and Preparation ..................................................................... 44
4.2.3. Modelling ................................................................................................................ 55
4.2.4. Deployment ............................................................................................................. 59
5. Validation and Results ................................................................................................................ 63

5.1. Validation Plan ..................................................................................................................... 63

ix
5.2. Results ................................................................................................................................. 65

6. Conclusion ................................................................................................................................... 69

6.1. Contributions ....................................................................................................................... 69

6.2. Future Work ......................................................................................................................... 69

7. References.................................................................................................................................... 71

x
List of figures
Figure 1 – Diagram of the typical customer lifecycle. Adapted from [3]. ............................................... 19

Figure 2 – Diagram of the bad payer lifecycle. ...................................................................................... 20

Figure 3 – Global Telecom Fraud Loss in billions of US dollars. Data retrieved from CFCA surveys [9],
[11]–[15]. ................................................................................................................................................ 21

Figure 4 – Fraud Loss and Detection Risk. Adapted from [16]. ............................................................ 23

Figure 5 - Fraud management organisation – people, processes and technology [8]. ......................... 23

Figure 6 – Architecture of the patented system that collects customer data to compute the likelihood of
being a never-payer [17]........................................................................................................................ 25

Figure 7 – Flowchart of the patented system that calculates a never-pay score to determine the approval
of credit applications [17]. ...................................................................................................................... 25

Figure 8 – Diagram of the patented first-party fraud detection system [18]. ......................................... 26

Figure 9 – Business process (AS-IS) describing the customer lifecycle of a never-payer. ................... 30

Figure 10 - Diagram of the three main entities of the telecom company’s data model. ........................ 32

Figure 11 - Diagram of the entities available at subscription time. ........................................................ 34

Figure 12 – The Segment hierarchy. ..................................................................................................... 34

Figure 13 - Diagram of all the entities available before and after the customer account is activated. .. 36

Figure 14 - Entity-Relationship Diagram (Chen’s Database Notation) of the complete data model. .... 39

Figure 15 – Entity-Relationship Diagram (Crow’s Foot Notation) of the input data provided by the IT
department. ........................................................................................................................................... 40

Figure 16 – Application Architecture Overview ...................................................................................... 41

Figure 17 – CRISP-DM, the standard approach for developing data mining applications [20]. ............ 42

Figure 18 – Microsoft’s proposed data mining process methodology [22], heavily based on CRISP-DM
[20]. ........................................................................................................................................................ 42

Figure 19 – Overview of the SSIS package responsible for extracting data from flat files. .................. 45

Figure 20 – Data loading examples implemented by SSIS packages. ................................................. 46

Figure 21 - Entity-Relationship Diagram (Crow’s Foot Notation) of the input data after it was prepared.
............................................................................................................................................................... 47

Figure 22 – Entity-Relationship Diagram (adapted from Crow’s Foot Notation) of the Case Table
including the predictive attribute. ........................................................................................................... 49

Figure 23 – Never-payer distribution across the cities around Lisbon (Power Map) ............................ 51

xi
Figure 24 – Plots comparing the content of pricing plans for consumers and businesses. .................. 53

Figure 25 – Plots comparing the type of service usage, for the NP0 and NP1 population. .................. 53

Figure 26 – Plots comparing the number of risk evaluations before activation and the latest score across
datasets. ................................................................................................................................................ 54

Figure 27 – Mining model for Consumers using oversampling and different algorithms. ..................... 57

Figure 28 – Example of the mining model viewer for a Naïve Bayes algorithm.................................... 57

Figure 29 – Example of the mining model viewer for a Microsoft Decision Tree algorithm. ................. 58

Figure 30 – Example of a lift chart for a hybrid sampling data mining model. ...................................... 58

Figure 31 – Data mining process orchestrated by Integration Services (SSIS). ................................... 59

Figure 32 – Business process (TO-BE) describing the customer lifecycle of a never-payer. ............... 62

xii
List of Tables

Table 1 - Comparison of the related work described in this document. ................................................ 28

Table 2 – Client entity attributes. ........................................................................................................... 33

Table 3 – Account entity attributes. ........................................................................................................ 33

Table 4 - Service entity attributes. ......................................................................................................... 33

Table 5 – Risk Evaluation attributes. ..................................................................................................... 35

Table 6 – Postal entity attributes. .......................................................................................................... 35

Table 7 – Pricing Plan entity attributes. ................................................................................................. 36

Table 8 – Usage entity attributes. .......................................................................................................... 37

Table 9 – Billing table entity. .................................................................................................................. 37

Table 10 – Payment entity attributes. .................................................................................................... 38

Table 11 – Campaign entity attributes ................................................................................................... 38

Table 12 – Never-payer distribution across the cities around Lisbon (table view) ................................ 51

Table 13 – Never-payer distribution across all Portuguese districts. .................................................... 52

Table 14 – Correlation analysis between usage attributes. ................................................................... 54

Table 15 –Confusion matrix for the never-payer classifier. ................................................................... 63

Table 16 – Training and testing set for Consumer and Business datasets. .......................................... 64

Table 17 – Validation results for all combinations of segments, algorithms, sampling strategies and data
types. ..................................................................................................................................................... 68

xiii
xiv
List of Acronyms

BI Business Intelligence

CDR Call Detail Record

CRM Customer Relationship Management

DM Data Mining

DW Data Warehouse

ERP Enterprise Resource Planning

ETL Extract, Transform, Load

GSM Global System for Mobile Communications

IT Information Technology

KPI Key Performance Indicator

M2M Machine to Machine Communications

MBB Mobile Broad Band

MSSQL Microsoft SQL Server

MVNO Mobile Virtual Network Operator

NP Never-Payer

NP0 A Customer who is not a never-payer

NP1 A customer who is a never-payer

SME Small and Medium Enterprises

SOHO Small Office Home Office

SQL Structured Query Language

SSAS (Microsoft) SQL Server Analysis Services

SSIS (Microsoft) SQL Server Integration Services

xv
xvi
1. Introduction

1.1. Problem Definition


Every business tries to minimise the set of customers who do not make their payments on time. This
debt is not always easy to recover, depending on the willingness, the financial capacity of the delinquent
customers, as well as many other factors. Organisations have to spend much money and resources to
recover bad payments. One of the strategies to avoid this kind of risk is to intervene proactively before
a customer runs into debt.

Telecommunications businesses sometimes face new customers who subscribe to services but with no
intention to pay for them - a form of subscription fraud. These so-called Never-payers are responsible
for the loss of large amounts each month not only because of the bills that will never be paid, but also
the costs and resources associated with the subscription of a given service by the fraudulent customer.

The main challenge of this work is to predict this particular class of post-paid customers who are never
going to pay for the newly subscribed services - the so-called never-payers.

This system was developed in collaboration with a telecommunications company that needed to identify
what is the typical profile of a risky customer and determine for every new potential customer, the
probability of being a never-payer subscriber. Although the relative number of these dodgy customers is
very tiny when looking at dozens of hundreds of additional subscriptions every month, they represent
substantial losses that could be avoided or at least, mitigated.

1.2. Objectives
This work will perform a preliminary analysis of the data provided by the data warehouse of a
telecommunications company, such as customer data, pricing plans, risk evaluations, usage, billing
history, well as other common reference attributes and historical data typical of the telecom industry.
This exploration analysis will help identify the set of attributes that best profiles a never-payer customer
and defines a predictive model. Then, supervised mining models will be capable of profiling past debtors
and learning about similar characteristics, enhancing the proactive detection of potential debtors.

State of the art systems rely both on customer characteristic and behaviour. Only two patented systems
solely rely on customer characteristics instead of behaviour, which is much more difficult and prone to
predictive errors. For instance, it becomes very challenging to judge whether a new customer is risky
based purely on demographic attributes and with no past behaviour. This system tries to rely only on
customer attributes upon acquisition and minimises the behavioural data to predict the outcome.

The final solution comprises a platform built entirely on Microsoft BI stack tools. Integration components
are capable of loading, cleaning and summarising large amounts of input data that provide information

17
about new customers. Then, analytical components select a specific set of relevant attributes to train
several predictive models that will be responsible for facing unknown customers and decide whether
they will be customers who will never pay their debts. Ad-hoc exploration of the data and results is also
possible using tools such as Excel, Power Pivot and Power BI.

1.3. Document Outline


The rest of this thesis is organised as follows:

Chapter 2 presents the most relevant concepts that are important for understanding this work. It
describes the generic relationship between the customer and a telecommunications company – the
customer lifecycle. Then, it explains how fraudulent customers affect telecom companies, pointing out
the most relevant fraud detection systems for this thesis.

Chapter 3 introduces the main subject of this thesis – a telecommunications company. This chapter
focuses on the business model that supports the customer lifecycle described in Chapter 2, and
highlights the challenges of this thesis. Also, the data model supporting the business is also detailed.

Chapter 4 presents the implemented solution, beginning with an overview of the data mining system
and how components interact. Then, it explains all the development phases, from business
understanding, data understanding and preparation, to modelling.

Chapter 5 describes the datasets used in the validation phase, as well as the validation methodology
employed in the different experiments. Additionally, it discusses the results obtained in each strategy.

Finally, Chapter 0 presents the main conclusions of this work and provides an overview of all
contributions of this thesis. It also presents a discussion of future directions that this thesis can point to.

18
2. Fraud Detection: Context and
Related Work
This chapter introduces the context of this thesis starting with an overview of the telecommunications
industry. Section 2.1 details the general relationship between the company and customer, identifying
several challenges and opportunities. Section 2.2 introduces fraud in the industry as well as fraud
detection systems.

2.1. Telecommunications CRM


CRM (Customer Relationship Management) is defined as the strategy for building, managing, and
strengthening loyal and long-lasting customer relationships [1]. Just like in any other industry, the CRM
process of a telecom company should be a customer-centric approach based on customer insight.

The term customer lifecycle refers to the various stages of the relationship between a customer and a
business. It is a framework for understanding customer behaviour and it becomes vital to comprehend
it because it directly relates to long-term customer value [2].

Acquisition Relationship Management

New Established Former


Prospect
Customer Customer Customer

Agreement Activation Churn

Winback

Figure 1 – Diagram of the typical customer lifecycle. Adapted from [3].

Figure 1 shows the relationship begins with a prospect, who is someone or some company in the target
market but not a customer yet. Upon agreement and providing personal information, a contract is signed,
and it becomes a new customer. This customer becomes an established customer after interacting with
the company, for instance, when a service is activated and used.

The relationship management should be the longest and most profitable phase in this lifecycle, in which
the customer pays for the subscribed services and could be the target of a campaign and even make a
complaint. However, sometimes the customer is churned, becoming a former customer who could be
won back through some campaign. Churn is a term used in the telecommunication service industry to
denote the customer movement from one provider to another [4]. Churning may be voluntary or forced,
for instance, when a customer incurs in bad debt.

19
This thesis focuses on understanding what enables a telecommunications company to predict the
likelihood of a customer who simply does not pay his bills on time. Therefore, it is necessary to analyse
the lifecycle of a bad payer. Figure 2 depicts an instantiation of the customer lifecycle (Figure 1), but for
customers who incur in debt.

Acquisition Intermediate Recovery

Pay debt

New Established Former


Prospect Bad Payer
Customer Customer Customer

Agreement Activation Bill past due Churn

Winback

Figure 2 – Diagram of the bad payer lifecycle.

It may be called the bad payer lifecycle, featuring a hypothetical telecom customer who signs a contract,
uses subscribed services, but his bills are past due, most of the time. This cycle features three main
phases:

1. Acquisition, similar to the previous diagram. It represents the entry point for a prospect who
wants to subscribe to some service. At this moment, the prospect supplies the telecom company
with his information to draw up a contract and activate an account. The telecom company can
perform some kind of risk evaluation such as a background to avoid potential bad payers.
2. Intermediate, on which he uses the services subscribed and bills are emitted. It should be
lengthy and profitable unless he stops paying and enters the next phase.
3. Recovery, when a bad payer endures a set of strategies carried out by the company in order
to recover the bad debt. He could, ultimately, churn and leave voluntarily or even be forced to
churn because of his irrecoverable debt.

Many companies are increasingly using data mining techniques for CRM, which helps not only
addressing individual customer’s needs [5], but also predicting customer behaviour.

The key challenge of this thesis is to detect, at the earliest opportunity, the set of customers who will not
pay any bill after signing a contract (never-payers). Ideally, this would be feasible during Acquisition,
as soon as the company obtains some information about the prospect who could match with a known
never-payer pattern. However, if the results are inconclusive, one may look into a short period of
behaviour during Intermediate phase.

20
2.2. Fraud Detection

2.2.1. Telecom Fraud


Fraud, a dishonest attempt to convince an innocent party that a legitimate transaction is occurring when,
in fact, it is not [6], is as old as humanity itself [7]. In the last century, fraud matured in the area of
transactional businesses, such as the telecommunications industry [6]. Fraud in the telecom industry
means using of telecommunications products or services without the intention of paying [8], [9].

Fraud incidents increased dramatically with the expansion of modern technology and the Internet,
leading to the loss of billions of dollars worldwide each year [7]. Fraud negatively impacts everyone,
especially customers, knowing that fraud losses increase communications carriers’ operating costs [9].
Moreover, fraud can cause distress, loss of service and loss of customer confidence [10].

Current statistics, released by the Communications Fraud Control Association (CFCA) this year, point
to a global loss of about 38 billion (USD) which represent almost 2% of telecom revenues. According to
CFCA, subscription fraud has made it to the top five methods for committing fraud, adding up to more
than 6 billion (USD) of loss worldwide.

Figure 3 plots the estimated telecom fraud loss since 1999 [9], [11]–[15]. For ten years, fraud loss
increased continuously, peaking at $74 billion. The noticeable decrease in 2011 is attributed to growth
in global revenues outpacing the growth in fraud losses as well as improved anti-fraud programmes
implemented by operators and an increase in collaboration between professionals within the industry.

Global Telecom Fraud Loss (USD Billions)


74
80
70 57,2
60
USD Billions

46,3
50 37,5 40,1 38,1
40
30
20 12
10
0
1999 2003 2006 2009 2011 2013 2015
Year

Figure 3 – Global Telecom Fraud Loss in billions of US dollars. Data retrieved from CFCA surveys [9], [11]–[15].

Telecom companies, customers and third-parties may commit fraud against each other. This thesis
focuses on fraud perpetrated by customers against telecom firms. Fraud, particularly telecom fraud,
appear to be becoming more socially acceptable [10].

Telecommunications fraud is attractive to offenders for many reasons that have been challenging
telecom carriers [6], [10], [16]:

21
 Detection risk is low. The sheer volume of transactions increases the probability of fraud going
unnoticed because it is such a small proportion of the overall business. Moreover, the level of
punishment is relatively small.
 There are more telecom carriers every day. As more carriers are created, the amount of
intentional fraud increases. Bad payers can simply move between carriers to avoid credit
checks.
 No special equipment is required (usually). Many frauds don not require IT skills but exploit
business procedures, such as selling international calls or subscribing a product with no
intention of paying.

Industry experts estimate that there are more than 200 types of fraud [8]. The nature of fraud committed
against telecom carriers can be classified into three broad categories [6]:

 Technical fraud. It involves attacks against loopholes in communications technology. It typically


needs initial technical knowledge and ability, though once a weakness has been exploited, it
can be quickly distributed in a form that non-technical people can use, for example, card cloning.
 Contractual fraud. A fraud that generates revenue through the proper use of a service while
having no intention of paying for it, for instance, subscription fraud and premium rate fraud.
 Procedural fraud. Attacks against the procedures implemented to minimise exposure to fraud to
grant access to the system. For instance, roaming fraud and social engineering.

Some of the most common varieties of fraud in the telecom industry include subscription fraud, when
someone signs up for service (e.g., a new post-paid contract) with no intent to pay, and identity fraud,
when the fraudster masquerades as another customer making or selling calls on this account [6].
Besides exploiting technological loopholes, fraudsters can use social engineering, exploiting the human
interactions, for instance, pretending to be a phone repair person and gaining access to a customer’s
account. Furthermore, when telecom companies launch new services, fraudsters realise they could
purchase them at a low price and then resell them illegally at a higher price to consumers who were
unaware of the service. Even when companies implement regulations to promote fairness, this ends up
spawning new types of fraud.

Subscription-based relationships take the form of an ongoing billing relationship where customers have
agreed to pay for a service over time [3]. Subscription fraud violates the relationship agreement and can
happen in two different ways. On the one hand, the customer may be consciously fraudulent, in fact,
fraud often will masquerade as a usage management problem. On the other hand, a legitimate customer
account may be hacked by someone fraudulent. There is no sharp line between intent to pay and the
ability to pay [6].

Figure 4 displays the risk of detection as perceived by a potential fraudster plotted against loss resulting
from the activities of those fraudsters [16].

22
HIGH

Risk of Detection

Investigations
Increasing
prevention costs
Most organizations
start here

LOW
LOW Loss due to Fraud HIGH

Figure 4 – Fraud Loss and Detection Risk. Adapted from [16].

The exponential decay is a realistic portrayal of reality, given that considerable beneficial effect on loss
due to fraud can be achieved for a relatively small outlay, usually by improving processes of fraud
detection.

It is important to differentiate between fraud prevention and fraud detection. Fraud prevention aims to
stop fraud from occurring in the first place whereas fraud detection involves identifying fraud as quickly
as possible once it has been perpetrated [7]. Fraud detection comes into play once fraud prevention has
failed. Throughout this document, fraud detection is referred in a broad sense, meaning identifying fraud
at the first opportunity, even though it has not yet happened, that is, prevention.

Fraud detection (or prevention) presents itself as a significant challenge for telecom companies
concerning the volumes of data involved [8]. Daily, a company with 5 million customers can generate
hundreds of millions of transaction records. Telecommunications fraud is not static; new techniques
evolve as businesses put up defences against existing ones [6]. Besides, fighting fraud is complicated
by the existence of multiple telecom carriers. In the game of fraud detection, when one telecom company
is better than its competitors at detecting and stopping fraud, the fraudsters are inclined to move to the
competition.

RIGHT PROCESSES

DETERRENCE PREVENTION DETECTION MITIGATION POLICY ANALYSIS

RIGHT TECHNOLOGY

Enterprise Fraud Management Solution


RIGHT PEOPLE

Fraud Management Organisation

Figure 5 - Fraud management organisation – people, processes and technology [8].

23
One step towards beating fraudsters is to build a secure “golden database” [8]. That is, using the right
technology to construct a corporate data warehouse (DW), with adequate levels of security, which can
be used across multiple organisational departments, including CRM, fraud management, revenue
assurance and business intelligence. Another step involves prioritising the building of a fraud
management organisation that establishes key control points in the customer lifecycle and combines
the right processes, the right technology and the right people in the right places (Figure 5).

This thesis aims to build an enterprise fraud management solution that uses data mining techniques to
help the organisation prevent subscription fraud.

2.2.2. Related Work


This section presents a brief summary of the two most relevant implementations of fraud detection
systems in the context of this work. These systems share similarities with the system this thesis aims to
build, such as:

 Detecting forms of subscription fraud perpetrated by the customers, wherein fraudsters simply
do not pay their debts;
 The moment of detection should be as early as possible, even before the customer acquisition;
 Behavioural data is not accessible, only customer data and credit databases are used.

In the end, a table summarises the comparison between each fraud detection strategy.

System and method for automated detection of never-pay datasets

Celka and Rojas patented a method for automatic detection of never-pay datasets for credit services
industry, known as credit rollers [17]. They define the never-pay population as those customers who
make a request for credit and obtain the credit instrument but over the life of the account, never make a
payment. It is designed to be as a tool to help financial service providers knowing whether the applicant
is likely to never pay after obtaining the credit instrument.

The architecture of this system is shown in Figure 6 and comprises data sources and a never-pay
module that runs the detection algorithm. Data sources contain customer data such as credit bureau
(i.e. collection agencies) data, tradeline data, historical balance data, demographic data and additional
data provided by the customer when applying for credit. The module is configured to obtain customer
records from the sources and compares them with already-proven never-pay profiles. Then, the
likelihood score of being a never-pay profile is calculated.

24
Figure 6 – Architecture of the patented system that collects customer data to compute the likelihood of being a
never-payer [17].

Figure 7 presents the overall flow of the patented system and shows that several predictive models can
be combined with each other to compute a final score. If this score is below a given threshold, the
application is approved. On the contrary, if the score is higher than the threshold, the application is sent
for manual analysis.

Figure 7 – Flowchart of the patented system that calculates a never-pay score to determine the approval of credit
applications [17].

25
The patent does not specify how the predictive model is built; that is, how the never-pay profiles are
created from actual credit rollers. It is suggested that profiles are composed of business rules that are
compared with prospects and output a likelihood score.

The authors claim this system is not limited to financial services industry but are also fit to other
industries, including the telecommunications services industry.

First-party fraud detection system

Mahdi et al. published a patent on a method for detecting first-party fraud using a supervised model that
calculates a risk score for current applications for credit or goods, using identity information provided by
the consumer [18].

In first-party fraud, the fraudster uses his true identity to fill an application for obtaining credit, goods, or
services, without the intention to fulfil payment obligations. In other words, first-party fraud does not
involve a stolen identity, the fraudster is willing to ruin his credit to defraud the victim. The
telecommunications industry is also affected, for instance, as it offers heavily subsidised smartphones
to those who pass a credit check. Fraudsters find this as an opportunity to sign up to as many plans as
possible with as many carriers as possible in a short period of time in order to get as many smartphones
as possible at a low price. Fraudsters present themselves indeed as themselves (i.e., the first-party),
but they have no intention of paying for the goods as contractually obligated.

Since organisations attempt to check the identity of applicants and first-party fraudsters obviously satisfy
these criteria, first-party fraud is tough to detect and prevent. This type of fraud depends solely on the
will of the applicants, and whether they actually intend to pay after they get the credit.

Figure 8 shows a diagram of the system modules as well as the information flow describing the method
for predicting first-party fraud.

Figure 8 – Diagram of the patented first-party fraud detection system [18].

26
First, it receives the customer application containing consumer identity data such as social security
number (SSN), name, address, phone number, date of birth, and others. The search module is
responsible for matching the current application and prior individual applications provided by a historical
module using linking keys. When identical, or similar, identity information is frequently used in a proximity
of time for the same or another commodity, this is evidence of first-party fraud.

Then, the generation module is responsible for producing markers (i.e. metrics) that are indicative of
first-party fraud based the identity linking keys. Examples of such markers include the number of
applications in the last week/month/year linked by address, SSN, or phone, or the number of unique
emails used in the last week/month/year linked by the various identity elements described.

Finally, the predictive module computes a risk score based on the markers; wherein the risk score
represents a chance that the current application represents first-party fraud. This module uses standard
supervised machine learning algorithms built learning examples of previous fraud attempts. These
algorithms can include neural networks, support vector machines, boosted trees or regressions.

Summary

After detailing the most relevant state of the art regarding fraud detection systems, it pays to compare
them all considering several dimensions that distinguish each system, regarding their target market,
data sources, predictive model and other dimensions.

Table 4.1 presents this comparison. Some conclusions can be drawn:

27
Celka and Rojas 2008 Mahdi et al. 2014

Target market Credit Services from Finance, Generic (Credit Services)


Telecom, Retail and other
industries

Targeted fraudulent Never-payers First-party fraudsters


population

Data Sources Customer data, Tradeline data Customer data, ID Network,


(balance history), Credit Bureau Demographic data, Public Records
scores, Demographic data,
Public Records

Prediction timing Before account approval Before account approval

Predictive Model Classification (Supervised) Classification (Supervised)


Type

Predictive Rule-based (?) Neural networks, support vector


Algorithms machines, boosted trees or
regressions

Result Risk score of being fraudulent Risk score of being fraudulent

Mitigation strategy Manual review N/A

Table 1 - Comparison of the related work described in this document.

The system for detecting never-payers in the telecommunications industry will likely include data
sources such as demographic data, data provided by the customer and also some historical data, similar
to the systems described above. The source data will teach a classification algorithm to detect potential
never-payer customers, preferably before account approval. The final output will be the likelihood of not
paying any debts in the near future.

28
3. A Telecom Company: A Case Study
The main subject of this study is a Portuguese telecommunications company, whose challenge is to
detect fraudulent customers who never pay their debts. Firstly, it is important to understand its business
model and all the steps the customer follows through from the time he intends to sign a contract until he
runs into debt, that is, the customer lifecycle (Section 3.1). Secondly, to support the business model,
this company has implemented a DW-like data model that represents the input of this study and will also
be described (Section 3.2).

3.1. Business Model


The central entity of this business model is the Customer, who undergoes several phases during the
already mentioned customer lifecycle (Section 2.1). Considering the goal of this study is to identify, as
soon as possible, customers who will never pay their bills, it becomes pertinent to describe each lifecycle
phase. In addition to defining what makes a customer a never-payer, we need to pinpoint the precise
moment in time when we would have enough information to decide on his future.

The business process detailed in Figure 9 summarises all three customer lifecycle phases and its steps.
Each lane represents a distinct stakeholder, and the three phases Activation, Intermediate and Recovery
map exactly the bad payer lifecycle.

29
Customer Activations Risk Evaluation Billing Collection Mgmt

Sign up Process Fiscal Automatic Risk


form information Number Evaluation
Prospect
Customer Approved?
Yes
No

Manual Credit
Draft Contract Evaluation

Approved?
Approve Yes
Accept
Activation

Account
Contract Activation No
Prospect
Rejected

Usage of 1 bill cycle after


Services Issue Invoice
Intermediate

Pay before
due date? No

Yes
Paid
Pay bill before
due date

Open Debt
Collection Case
Receive
Nofications
Recovery

Wait according
Yes credit rating

No
Pay
debt?
Deactivate
Account
Account Deactivated

Figure 9 – Business process (AS-IS) describing the customer lifecycle of a never-payer.

The prospect customer begins the Activation phase when he shows an interest in subscribing a post-
paid service provided by the telecom company. After signing up a form that provides customer
information, the Activations Department is responsible for processing new account information and
requesting a Risk Evaluation on the potential client.

The prospective customer undergoes a Risk Evaluation, including criteria that have to be fulfilled so
that he is considered as an eligible client. The prospect’s fiscal number is supplied to evaluate each
eligibility criteria, which include questions like the examples described below:

 Is the prospect a returning customer, and if so, did he leave any debts in the past?
 Is the prospect featured in the database shared between the major Portuguese telecom
containing previous debtors? Debtors are automatically removed from this shared database if
certain conditions apply, for instance, if the debt amount is less than 20% of the national
minimum wage; if they are in an insolvency state; conversely, if the debt has prescribed or has
been relieved.
 Is the prospect signalled for fraudulent or suspicious behaviour?

30
 Is the prospect contained in a white list?

The Risk Evaluation is an automated process, and if the results indicate that the client is eligible, its
results are then supplied to the activations’ assistant responsible for drafting the new contract.
Otherwise, the assistant can ask for a Manual Credit Evaluation that is performed by an activations’
specialist who will investigate historical data on the client (if available).

Activations’ specialist gathers all the historical information available on the potential client and tells the
activations’ assistant about the client eligibility. If the manual evaluation is positive, the assistant can
proceed to draft a new contract. On the other hand, a negative decision by the specialist can suggest
alternative methods for minimising the risk, such as the regularisation of the debt before entering into a
new contract. This could include methods such as the payment of a bond that is recurrently deducted in
subsequent invoices. Moreover, even if the evaluation is negative, the specialist can override the
decision, providing a meaningful comment. Sometimes the specialist does not have sufficient privileges
to override negative eligibility. If so, he can up-delegate the decision to his manager.

At this point, the prospect can either be rejected or approved. If he passes the risk evaluation or
liquidates the existing debt, the Activations Department drafts a new contract that is provided to the
prospect for acceptance. Once the contract is accepted, Activations Department approves the activation
of a customer account that is associated with the post-paid contract.

When the contract is accepted by the customer and entered into, the account associated with the
contract is assigned to one of the following billing cycles:

 Cycle 1 – from the 1st day of the month to the 31st of the next month.
 Cycle 9 – from the 9th day of the month to the 8th of the next month.
 Cycle 16 – from the 16st day of the month to the 15th of the next month.
 Cycle 23 – from the 23rd day of the month to the 22nd of the next month.

For instance, a customer who enters into a contract on the 3rd day of the month will be assigned to Cycle
9, that is, the cycle that immediately follows the activation date. If that is the situation, the billing cycle
will be closed on the 9th and, depending on the billing process speed, an invoice will be issued
afterwards, to illustrate, on the 13th day (bill statement date). The bill due date is typically calculated by
adding at least 11 working days (legally acceptable).

After contract activation, the customer enters the Intermediate phase and begins to use the subscribed
services. The Billing system is responsible for issuing an invoice one bill cycle after the usage of
subscribed services. If the first bill becomes past due and the customer does not make any payment,
he enters the Recovery phase that is controlled by the Collection Management system. This system
automatically opens a debt collection case and, depending on the credit rating of the debtor, warnings
such as SMS alerts and letters will be sent at different timings, and even subscribed services may be
suspended (hotline). If the debt is not liquidated after a certain amount of time, the client account is
deactivated, becoming a never-payer.

31
At the end of recovery, if a client account is deactivated and none of the bills were liquidated, then it is
regarded as a never-payer. A never-payer is a client account that was deactivated by any given reason
and has not paid its bills, i.e. the billed amount equals the open amount (current and past due debt) -
this is the future we want to predict.

After this point and several failed attempts to liquidate client’s debt, the Collection Management System
automatically runs a well-defined algorithm of collection actions that could include sending letters, legal
notices and even delegating the collection process to collection agencies and its lawyers. It is also
important to note that the customer can delay the payment of his debts simply by reporting a claim, for
instance, declaring that the contracted service is not working as intended. For that reason, never-payer
customers are capable of using contracted services for several months while not paying a single bill.

This thesis does not focus on the collection actions’ phase, but only up until the customer is deactivated
and still has all bills unpaid. It would be ideal if the system were capable of helping the activations’
assistant avoiding a potential never-payer, simply by looking at the provided information. Alternatively,
at least, after some usage of services, detect risky behaviour indicating the likelihood of not paying any
bills.

3.2. Data Model


The business model stated above is supported by a data model which is implemented across several
operational systems. All these systems converge to one central database, the data warehouse (DW).
The DW-like model is the main input for detecting the never-payer population.

Customer (Client/Account/Service)

As the above section established, the Customer is the central entity of the business model. The data
model supporting the business depicts the customer as the association between three main data
entities, as seen in the diagram below. The entry point is the Client, which aggregates at least an
Account, which in turn subscribe at least a Service.

1 1..* 1 1..*
Client Account Service

Figure 10 - Diagram of the three main entities of the telecom company’s data model.

As an example, a business (e.g. a large corporation) may have one Account for each one of its
employees, which in turn subscribe to one or more Services (e.g. voice, fibre, TV). On the other hand,
a consumer can be represented by a Client entity which in turn may have one or more Accounts, for
instance, one for each one of his family members.

The following tables detail some of the attributes that are available for these three entities and may be
relevant for detecting patterns of never-paying customers.

32
Client A client is associated with at least an Account.

Segment A client is segmented a Business or Consumer client.

Location The only accessible demographic information is City and Postal code.

Fiscal Number Vital to the risk evaluation analysis and billing system, since it could provide
insights about past debts.
Table 2 – Client entity attributes.

Account An Account belongs to a Client and may subscribe at least a Service.

Creation date When the account was created.

Deactivation date When the account was deactivated, i.e. the status attribute is “deactivated”.

Status Account may be activated, deactivated, hotline (on the verge of deactivation).

Location This location is linked to bills. It represents the real location of the customer,
including City and Postal Code.

Fiscal Number Vital to the risk evaluation analysis and billing system, since it could provide
insights about past debts.
Table 3 – Account entity attributes.

Service A Service belongs to an Account and has Pricing Plans associated.

Pricing Plan The pricing plan(s) attached to the service.


Table 4 - Service entity attributes.

Client Segment

Figure 11 adds another four important entities to the model. Firstly, clients belong to a Segment, which
has its own classification hierarchy. Secondly, a customer undergoes a Risk Evaluation each time he
wants to activate an account. Thirdly, geographical data can be extracted using Postal information from
the customer account. Lastly, the subscribed services include one or more Pricing Plans, which will be
charged differently according to their rate.

33
Risk
Evaluation

0..*
1..*
1 1..* 1 1..* 1 1..* Pricing
Client Account Service
Plan

0..*

0..*
1

1
Segment Postal

Figure 11 - Diagram of the entities available at subscription time.

A customer belongs to a specific Segment, which is stored at Client level. The client is segmented as
Business or Consumer, for instance, the public sector and small and medium enterprises (SME) belong
to the business segment while the general public can belong to the consumer segment.

SME

SOHO
SME & SOHO
MVNO
Segment

Consumer

Self-employed
Business

Corporate

Large
Public Sector
Corporations
Group
Enterprises

Figure 12 – The Segment hierarchy.

The types of services (and pricing plans) offered are specific to each segment. Additionally, each
segment can be classified as a three-level hierarchy as shown in Figure 12.

Risk Evaluation

Prior to account activation, the client undergoes a Risk Evaluation (detailed in the previous section)
based on its fiscal number to determine, in theory, if the account is to be activated. Each time a customer
needs to activate a new account, at least a new risk evaluation will be performed.

34
Risk Evaluation Before every account activation the risk is as assessed.

Fiscal Number The fiscal number of the customer being evaluated.

Criterion Each evaluation comprises nine different criteria as described in Section 3.1
above and each one is represented by its number (between 1 and 9), name
and result of the evaluation for that criterion. A criterion scoring zero is indicated
as passed, but if it scores 1 it fails and the reason is also registered.

Evaluation Score Non-risky customers score zero, which is computed by summing up all nine
criteria scores. Sometimes the activation specialist can pass the evaluation
even if the real score is above zero (fail), if so, the reason is registered.

Creation info The timestamp of the evaluation as well as the login name of the activations’
specialist when this evaluation was created using the Risk Evaluation system.

Update info Each time an evaluation is updated, this logs the timestamp and login name of
the activations’ specialist who performed changes.
Table 5 – Risk Evaluation attributes.

Postal

Some of the contact information provided by the Account entity includes the postal code and city, which
roughly map with the national database of postal codes provided by the Portuguese Post Office (CTT)
[19]. The Postal entity comprises postcode data associated with the municipality and district names,
providing geographical insights that might be potential predictors of debt patterns.

Postal An account is associated with postal and city contact information.

Postal Code 4-digit postal code.

Postal name The name that is placed after the postal code (town or municipality name).

Town The name of the town.

Municipality The name of the municipality.

District The district name such as Lisboa, Évora, Porto and Aveiro.
Table 6 – Postal entity attributes.

Pricing Plan

This system only analyses customers who subscribe post-paid Pricing Plans; that is, customers who
use the services provided by the telecom company and pay the bills generated at the end of each month.
Examples of Pricing Plans include, for instance, mobile voice post-paid plans, or even a triple-play
pricing plan that includes television, Internet and mobile services.

35
Pricing Plan A Pricing Plan belongs to a Service.

Name The name of the pricing plan.

Flag GSM Whether it is a GSM (mobile communications) price plan or not.

Segment Consumer or Business, similar to Client Segment.

Hierarchy Classifies the pricing plan using a simple hierarchy of three levels:

 Level 1 – Similar to client segment. Usually, a business client


subscribes business pricing plans, and so on.
 Level 2 – This system focuses on post-paid pricing plans.
 Level 3 – Describes the content such as GSM (mobile), fixed, MBB
(mobile broadband), M2M (machine-to-machine) communications.
Table 7 – Pricing Plan entity attributes.

All seven entities in Figure 11 represent the data that is present the moment before the customer is
accepted by the telecom company, and a new account is created. In short, these entities are the ones
affected during the customer Acquisition phase (see Section 2.1), and each of them has unique
attributes that characterise a priori a telecom customer, in a way that these attributes are the only ones
on hand at subscription time.

Preferably, risky customers should be detected at the first opportunity, but we could look further into their
behaviour right after customer accounts are activated. Figure 13 adds four entities that will eventually
become available after an account is created, and services are subscribed.

Figure 13 - Diagram of all the entities available before and after the customer account is activated.

Usage

The most evident sign of customer behaviour is the usage of the subscribed services. Usage logs each
customer behaviour regarding, for example, mobile calls and data usage, storing metrics that help
understand customer behaviour and, ultimately, generate charges. In this system, and due to space and
processing power limitations, usage data was aggregated on a daily basis, so a service has a
corresponding usage record each day, as shown in Table 8.

36
Usage A service generates usage records.

Service The service this usage daily event refers to.

Event Date The day of the event.

Event Description Metrics such as megabytes, kilobytes and seconds.

Event Units The amount of units spent for the given event, as well as rounded units.
Suppose a service is charged every 30 seconds, and 35 seconds were spent,
so the rounded units would be 60.

Total Calls The number of events for the given event type. For two calls of 45 seconds
each, this metric would be 2.
Table 8 – Usage entity attributes.

Billing

The service usage generates charges associated with its account. Customer accounts are billed every
month in by the Billing system, as it was once described by the billing cycles in the section above. Every
billing cycle, a new bill is generated, and Billing keeps track of the charged amount, the due date and
the amount that has already been paid. The Billing data entity is essential to classify existing customers
into the never-payer population.

Billing Each billing cycle an account is given a new bill, including paid services.

Account The account this bill refers to.

Bill Cycle The day of the month on which the bill was issued. Each bill is issued monthly
always on the same day. For instance, if an account belongs to Bill Cycle 3, it
means that its bills are issued at the third day of each month.

Due Date The bill must be paid before this date, otherwise the account runs into debt.

Last Payment Date The last time the bill was paid.

Amounts Each bill keeps track of several types of an amount:

 Charged amount, which is the initial amount that was charged for the
given billing cycle.
 Original invoice due total, the total (cumulative) amount that is in debt.
 Billed payment, the amount that was already been paid.
 Open amount, the amount that is left to pay.
Table 9 – Billing table entity.

37
Payment

Each time a customer pays up, his account generates a new record which is stored in the Payment
entity.

Payment Each payment amortises the current account debt.

Account The account this payment refers to.

Payment Month The month of payment. Payment date is aggregated by month.

Amount The amount that was paid.


Table 10 – Payment entity attributes.

Campaign

Lastly, salespeople often contact customers (or prospects) to accept additional services and these
contacts are logged in the Campaign entity.

Campaign Accounts and services may be contacted by salespeople.

Account / Service The account and service of the campaign contact.

Contact Date The date of the campaign contact.

Campaign Name The name of the campaign.


Table 11 – Campaign entity attributes

After the brief explanation of how the data model supports the business model of this telecom, it
becomes more palpable what will be the data attributes and metrics that will be analysed and tested to
verify if never-paying customers are, in fact, predictable.

Figure 14 shows the complete diagram of the data model supporting the business entities. For more
detail concerning the data structure of the data sources that serve as input to the system, see Figure
15.

38
Risk
Evaluation

0..*
1..*
Payment
Segment Client Campaign

0..*
1 0..*

0..*
1
Account

1
0..* 1 1 1..*

Billing Service

1
Pricing
Plan

1..*
Usage

Figure 14 - Entity-Relationship Diagram (Chen’s Database Notation) of the complete data model.

39
0..*
Campaign
1 1..*
PK FK DW_CUST_ACCT_ID

0..* Cli_Acc_Serv 1..* 0..*


PK FK DW_SERV_ID Payments
Risk Evaluation PK DW_CLIENT_ID
PK DW_CAMP_CONTACT_CREATION_DT_ID PK FK DW_CUST_ACCT_ID
ACCOUNT_ACTIVATION POSTAL_CODE
PK CAMP_NAME PMT_AMT
PK FK NAME CITY
Billing PK DW_PMT_MONTH_ID
Cli_Segment 1..*
X_VDF_MW_RESP CLIENT_SEGMENT_DESC 0..*
1 PK FK DW_CUST_ACCT_ID
PK FK CLIENT_SEGMENT_DESC
X_VDF_SFA_RESTR_MORE_INFO CLIENT_FISCAL_NUM
PK DW_BILL_DT_ID
BU
X_VDF_SFA_RESTRICTION_TYPE PK DW_CUST_ACCT_ID PP_Ref
1..* ORIG_INVOICE_DUE_T
1
SEGMENT1 OTAL
SCORE DW_CREATION_DT_ID FK PRICING_PLAN_DESC_SS

SEGMENT2 BILLED_PMT
PK SEQUENCE_NUM DW_DEACT_DT_ID SUBSCRIB_TYPE

OPEN_AMT
SCORE_1 Usage DISCONNECT_REASON SUBSCRIB_CLASS

BILL_CHG_WITH_IVA
X_VDF_RISK_FLAG PK DW_CALL_START_MONTH_DT_ID SIEBL_CUST_ACCT_ST_DESC_SS PRICING_PLAN_TYPE

1..* DW_BILL_DUE_DT_ID
ATTRIB_VALUE PK FK DW_SERV_ID ADDR_POSTAL_CODE PRICING_PLAN_CLASS

DW_LAST_PMT_DT_ID
ATTRIB_NAME ACTUAL_UNITS ADDR_CITY HRCHY_B_ATTRIB_1
1
PK CREATED ACTUAL_UNITS_DESC PK DW_SERV_ID
1 HRCHY_B_ATTRIB_2

LAST_UPD ROUNDED_UNITS SEC_POSTAL_CODE PP / Service BU

LOGIN_LAST_UPD_BY ROUNDED_UNITS_DESC FISCAL_NUM PK FK DW_SERV_ID PAID_CLASS


1

LOGIN_CREATED_BY TOTAL_NUM_OF_CALLS FISCAL_NUM_VALID_IND PK PRICING_PLAN_DESC CONTENT


1

Figure 15 – Entity-Relationship Diagram (Crow’s Foot Notation) of the input data provided by the IT department.

40
4. Solution
In this chapter, the implemented solution is generically described in terms of its architecture (Section
4.1). Then, each step of the methodology for building data mining applications [20] is detailed (Section
4.2): understanding the business problem; understanding and preparing the data; and creating models.
Chapter 5 is solely dedicated to explaining how the models were assessed and the results.

4.1. Solution Overview


In order to proactively predict the never-payer population, a data mining application is proposed. This
application has to be capable of processing input data that can help predict if a customer is likely to
never pay his bills.

The general overview of the system is illustrated in Figure 16.

Figure 16 – Application Architecture Overview

The main application is composed by three main layers:

 Sources, where customer and behavioural data from across different dimensions of the
customer is stored, as described in Section 3.2. CSV flat files are the common interface between
the source systems and the application. Source data include the ERP, CRM and Billing systems
as well as CDRs and other sources.
 ETL, built on SQL Server and SSIS, is responsible for extracting the data, transforming it and
load the model set, comprising a set of data mining algorithms with different sampling strategies,
complexities and data types. This component is supported by SQL Server and automated by
SSIS.
 Data Mining, where predictive models are trained and tested, predictions are stored and
evaluated. The analytical component is built on SSAS and orchestrated by SSIS.

41
Additionally, reporting tools such as Microsoft Excel, Power View, Power BI and Power Pivot integrate
seamlessly with the SQL Server database. A number of views and stored procedures are available to
the user, containing results and statistics. This enables the creation of ad-hoc exploratory analyses.

4.2. Methodology
Considering that this thesis aims to build a data mining application, it makes sense to guide its
development around a widely used methodology such as CRISP-DM (Cross Industry Standard Process
for Data Mining). CRISP-DM is a data mining methodology and process model that describes a common
approach for conducting data mining projects [20], [21]. Figure 16 depicts the main phases of this
methodology. Furthermore, the recommended approach by Microsoft is also loosely based on this
methodology [22]; each Microsoft BI stack tools plays a role, as can be seen in Figure 17.

Figure 17 – CRISP-DM, the standard approach for Figure 18 – Microsoft’s proposed data mining process
developing data mining applications [20]. methodology [22], heavily based on CRISP-DM [20].

The following sections will detail each step of the methodology for building the data mining application:

 Business understanding: Section 4.2.1 focuses on the business goals and how they were
translated into a data mining problem definition.
 Data understanding and preparation: Section 4.2.2 introduces the source data, first insights,
challenges, and how data was prepared for modelling.
 Modelling: Section 4.2.3 presents the predictive models that were applied.
 Evaluation: Section 5 defines a validation plan as well as describes the results obtained.
 Deployment: Section 4.2.4 explains how the data mining application was operationalised and
deployed.

4.2.1. Business Understanding


The main subject of this thesis is a telecom company whose goal is to predict, as soon as possible, if a
new customer becomes a never-payer in a near-future. A never-payer is someone who subscribes
services or products but does not intend to pay for that subscription.

42
The first step consisted in attending several meetings to identify the aforementioned requirements, detail
business processes and stakeholders involved, assumptions and constraints.

This company delineated simple requirements, including:

1. Describing the typical profile of a never-pay customer.


2. Determining for every new potential customer, the probability (i.e. risk score) of being a never-
payer subscriber in the future.
3. Predicting never-payer accounts preferably during their acquisition.
4. The prediction can also be tested using a small amount of behavioural data (e.g. usage history).
5. Operationalising the learning and testing process.

The complete detail of the business model and was previously presented in Section 3.1, including all
steps the customer follows through from the time he intends to sign a contract until he becomes a never-
payer.

Several assumptions were established:

1. At least one-year of data must be provided for analysis.


2. Data sources are supplied by the company IT Department in the form of flat files1. This simplifies
the data loading process as flat files are a common interface between all the company systems
and this data mining application.
3. Data is provided by the IT Department with a degree of aggregation that is possible for
extraction, even if it is not fit for the application. Because of the sheer volume of data to be
extracted and the processing time needed to transform data, some was not filtered/aggregated
as it should be. This can introduce aggregation level and data quality problems that have to be
solved during data preparation.
4. The universe of customers is limited to those who have post-paid pricing plans. The main goal
is to detect customers who never pay their monthly bills, that is, post-paid contracts.
5. The aggregation level required for data mining is established at the account-level.

It was decided that all the solution should be developed using Microsoft BI tools. The main reason was
because that this software was already available for the company to use. Besides, Microsoft provides a
full-stack data mining development, wherein all the tools integrate together really well. Thus, the
software setup included:

 A server running Microsoft SQL Server 2012 with the following components installed:
o Database server running Microsoft SQL Server 2012 (MSSQL).
o Integration server running Microsoft SQL Integration Services 2012 (SSIS).
o Analytic server running Microsoft SL Server Analysis Services 2012 (SSAS).
 Microsoft Excel 2013 with the following components installed: Power View, Power Map, Power
Pivot and Power Query.

1 Data files containing text records with a fixed number of fields.

43
The hardware for developing the solution included a PC with an Intel i7 @ 2.9 GHz processor with 8GB
of RAM. Microsoft SQL Server 2012 was configured with 3GB of RAM, so the remaining memory was
used by other software components.

Concluding, the never-payer data mining problem was defined as a supervised, classification problem.
A supervised data mining algorithm learns data patterns contained in examples provided by the user,
thus the examples of never-payer accounts. It is a classification problem because the model will predict
events described by categorical labels such as “yes” or “no” [5] to answer one simple question: will she
be a never-payer customer? Also, a probability score is calculated, describing the likelihood of that event
to happen.

4.2.2. Data Understanding and Preparation


The second phase of CRISP-DM methodology begins with understanding and preparing the data. These
two steps should be followed continuously, and fall back to the previous phase when some business
input is needed.

Firstly, it is important to distinguish between customer data and behavioural data [23]. The former
refers to first-level attributes that are present during the acquisition process, for instance, customer’s
city, services and pricing plans subscribed, as well as risk evaluation scores. The second refers to
second-level attributes are available after the customer’s application has been approved, and he begins
to use the subscribed services. Examples of behavioural data include usage metrics by service, and the
bills generated every month. The overview of the business data model is shown in Figure 14; customer
data is green-marked, whereas behavioural data is blue-marked. For a detailed description of each data
entity, see Section 3.2.

Data Extraction

The IT Department provided the first batch of data containing customer data. This included a set of
customers whose accounts were activated during two-year period, from April 2013 until the beginning
of March 2015. This added up to approximately 613.000 clients with 937.000 accounts and 2.877.000
services with corresponding pricing plans. Segment and pricing plan were included, as well as publicly
available geographical data from the Portuguese Post Office (CTT) [19]. Later, one year of risk
evaluations were added to the picture, from September 2013 until September 2014. More than 17 million
records featured the almost 2 million unique risk evaluations. Additionally, behavioural data was
provided. This included more than 16 million campaign contacts, and 4.6 million payments for 6.3 million
bills. Daily usage of services added up to 20 million records. Because it was only possible to extract one
year of risk evaluations, the training and test dataset was limited by this period (September 2013 until
September 2014).

Notice that this sheer volume of data was filtered only by the customers of interest. Some data entities,
such as usage and campaigns, needed to be re-filtered for the first month after the account activation
date. This way, the 20 million of usage records were possible to store and process. It was established

44
that behavioural entities should be limited to thirty days after account activation. In addition, data such
as campaigns and usage was aggregated daily, thus making it very difficult to load and process. For
that reason, it was summarised afterwards by month.

The process of collecting initial data into a staging area was implemented using SSIS. Figure 19
shows the collapsed view of the package wherein each entity is loaded from flat files, and certain data
dependencies have to be met. The parallelization degree is maximised to speed up the extraction.

Figure 19 – Overview of the SSIS package responsible for extracting data from flat files.

Several strategies for loading large amounts of data were used. The first one was using the highly
efficient bulk insert2 T-SQL command. A number of stored procedures were developed to bulk load the
data and apply data transformations at once.

During every bulk loading process, some actions were taken to maximise storage efficiency and
processing speed:

 Trimming and uniformising string attributes;


 Checking and removing duplicate records considering each entity’s business key;
 Summarising data with the adequate aggregation level (daily vs. monthly);
 Creating database indexes to speed up lookups.

Figure 20 shows two examples of how large volumes of data were loaded into the application database.
For instance, risk evaluation data was spread across several monthly files, each one containing all risk

2 Bulk Insert (T-SQL): https://msdn.microsoft.com/en-us/library/ms188365(v=sql.110).aspx

45
evaluations for accounts that were activated in that given month. Data was extracted using SSIS bulk
load, which is less efficient, but for each 200MB file, it is considerably faster. The main problem was the
dozens of millions of records, most of them being duplicated. Therefore, the loading process had to
eliminate those duplicates and, in the end, summarise the data for getting only one evaluation per day.
Usage loading is similar to risk evaluation, but since each monthly file was sized up to 5GB, and only
3GB were available for SQL Server, the special bulk insert T-SQL command was required.

Figure 20 – Data loading examples implemented by SSIS packages.

The complete data model that was loaded into the database is depicted in Figure 15. Data loading also
ensured business constraints such as primary and foreign keys. For the sake of simplicity, several views
were set up to build the data model shown in Figure 21.

46
0..*
1..*
1
1..*
Campaign Client-Account-Service

PK FK AccountID PK ClientID
0..*
Payments
PK FK ServiceID [Client] Postal
RiskEvaluation PK FK AccountID
0..*
PK ContactDateID [Client] City PricingPlan (Denorm)
PK FiscalNumber PK MonthID
1..* 1
PK CampaignName FK [Client] Segment PK PricingPlan
PK CreateDate Amount
[Client] FiscalNumber Subtype_Paying
Client Segment
PK CritNumber 1..*
1
Billing
PK AccountID Subtype_PayingNonGSM
PK Segment
CritName
PK FK AccountID
BU [Account] CreateDateID 0..*
SubClass_Business
CritValue
PK CycleID
Segment1 [Account] AfterDarteID SubClass_Consumer
CritScore
OrigInvoiceDueTotal
Segment2 [Account] DeactDateID PaidClass_Prepaid
CritFlag
BilledPayment
[Account] PaidClass_Postpaid
CreateDateID (Account) Usage DisconnectReason
OpenAmount
1..* Content_GSM
EvalScore PK FK ServiceID [Account] Status
ChargedAmount
Content_FIXED
EvalResponse PK CallStartMonthID [Account] Postal
DueDateID
Content_MBB
SFARestrictionType ActualUnits [Account] Postal2
LastPaymentDateID
Content_M2M
SFARestrictionInfo ActualUnitsDesc [Account] City
Content_ISP
CreateLogin RoundedUnits FK [Account] FiscalNumber Service-PricingPlan
Content_AI
ModifyDate RoundedUnitsDesc [Account] FiscalNumValid PK FK ServiceID
1
1 1..* Content_AD
ModifyLogin TotalCalls PK ServiceID PK FK PricingPlan

Figure 21 - Entity-Relationship Diagram (Crow’s Foot Notation) of the input data after it was prepared.
1

47
Classification

Before any further data exploration, the dataset had to be classified for each account, concerning the
attribute we aim to predict. The classification step created an extra derived binary attribute using the
never-payer condition defined in Section 3.1; accounts that are deactivated and had all of their bills
unpaid (1).

𝐴𝑐𝑜𝑢𝑛𝑡𝑆𝑡𝑎𝑡𝑢𝑠 = ′𝐷𝑒𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑒𝑑 ′ ∩ 𝑠𝑢𝑚(𝑂𝑝𝑒𝑛𝐴𝑚𝑜𝑢𝑛𝑡) > 0 ∩ 𝑠𝑢𝑚(𝐵𝑖𝑙𝑙𝑒𝑑𝑃𝑎𝑦𝑚𝑒𝑛𝑡) (1)

Then, the first step towards predictive modelling was taken by creating a case table, also known as the
model set. The model set is the data that is used to build the data mining models [2]. Each case table
record is a case; that is, a customer’s account with relevant attributes for predictive modelling. Figure
22 shows the main case table of this system, containing all possible attributes that were tested for
relevance. Notice that the Account ID is marked as the case table primary key; this means the case
table is a summary of all attributes and metrics at the account-level (assumption 5 of Section 4.2.1).

Besides picking categorical and numerical attributes from the original dataset, several derived attributes
were also calculated, such as the number of services subscribed, the number of risk evaluations, and
the average number of calls in the first month of subscription, and others. This case table was composed
of almost 400.000 accounts that were activated during a one-year period and were classified. The
FlagNP binary attribute was 1 for never-payer accounts, and 0 otherwise.

Note: From this point on, the term “NP1” will refer to the population classified as being never-payer, in
other respects, it will be referred as “NP0” (without never-payers).

48
Client Account Service / PricingPlan RiskEvaluation Usage

ClientID PK AccountID N_ServiceID N_RiskEval N_TotalCalls_Seconds

[Client] Postal [Account] CreateDateID N_PricingPlan [Risk] CreateDate N_TotalCalls_KBytes

[Client] City [Account] DeactDateID Subtype_Paying [Risk] EvalScore N_TotalCalls_MBytes

[Client] Segment [Account] Subtype_PayingNonGSM [Risk] CreateLogin N_TotalCalls_Event


DisconnectReason
[Client] BU SubClass_Business [Risk] ModifyDate AVG_TotalCalls_Seconds
[Account] Status
[Client] Segment1 SubClass_Consumer [Risk] ModifyLogin AVG_TotalCalls_KBytes
[Account] Postal
[Client] Segment2 PaidClass_Prepaid [Risk] RealScore AVG_TotalCalls_KBytes
[Account] Postal2
[Client] FiscalNumber PaidClass_Postpaid AVG_TotalCalls_Event
[Account] City
AccountID Content_GSM N_ActualUnits_Seconds
[Account] FiscalNumber
[Account] CreateDateID Content_FIXED N_ActualUnits_KBytes
[Account] FiscalNumValid

predictive attribute.
[Account] AfterDarteID Content_MBB N_ActualUnits_MBytes
N_ServiceID
[Account] DeactDateID Content_M2M N_ActualUnits_Event

[Account] Content_ISP
DisconnectReason
Content_AI
[Account] Status
Content_AD
[Account] Postal

[Account] Postal2
Never-Payer
[Account] City
FlagNP
[Account] FiscalNumber

[Account] FiscalNumValid

Figure 22 – Entity-Relationship Diagram (adapted from Crow’s Foot Notation) of the Case Table including the

49
Data Quality

It was detected that almost 28% of accounts had invalid or missing attribute values. One example of
poor data quality was the account’s city and postal code attributes. They were marked as “deleted”, “not
available” or were missing in about 267.000 accounts. The account’s address should always be present,
since it is mandatory for associating with bills, very much like the fiscal number. For that reason, further
analysis was made to understand why more than one-quarter of the customers were invalid.

Because the data had come from the company’s DW, this kind of data quality was discussed with the
BI Department, and it was concluded that those customers were, in fact, invalid; accounts with invalid
statuses that should not be considered. For that reason, a new attribute flag was provided, indicating
whether the account was valid. Thus, the dataset was narrowed down to about 670.000 accounts and
avoided future problems evaluating mining models.

Input data was provided in batches, and its contents are based on the extraction date. When some
extraction mistakes occur, such as missing or miscalculated fields, only the affected entities were re-
extracted and without historical data, due to their large volume of data and the time needed to extract
them again. Because there is no historical data, further extractions always reflect the present time of the
extraction, and they need to be synchronised with older data.

For instance, suppose the batch of Clients, Accounts and Services need to be re-extracted. Even if
account creation dates are limited between date A and B to simulate the same time-span as the first
extraction, many Accounts will have a single set of Services, or they could even belong to different
Clients. Then, there will be (new) Services without Price Plans and Price Plans that used to belong to a
Service and are now orphan records.

There are two possible methods of avoiding this kind of time incoherence. The first one, and the easiest
concerning implementation is to re-extract everything ensuring every record refers to the same present
time, but this was completely impractical considering the time and processing power required for this
task. The alternative is to set up a sophisticated mechanism that would be capable of re-extracting only
the entities/records strictly needed for maintaining the model coherence. Considering the resources
required by the first approach and the complexity required by the alternative, the automatic resolution
of this problem was considered out of scope. However, when this situation occurred, special care was
taken, but in a more manual, case-by-case way.

Finally, although campaign data was provided, it was not possible to incorporate campaign contacts in
the data mining model, due to aggregation level incompatibilities.

Exploration Analysis

Exploration analysis was now possible. Excel and its add-ins Power View, Power Pivot and Power
Map, helped to visualise how the data behaved.

Globally, there are ~2% of never-payers, which is a reasonable percentage when it comes to fraud.
Firstly, 86% of the global dataset is composed of consumers, and from those consumers, ~2% are never-

50
payers. From the remaining 14% of businesses, ~1% never pay their bills. Moreover, business and
consumer segments have very distinct topologies, translating in different attribute metrics. For that
reason, they should be analysed separately.

Power Map helped to get to know better the location of each customer (account). Attributes such as the
account’s country, district, city, town and postal code, proved to be very helpful when mapping the
distribution of the never-payer population.

Figure 23 shows an example the proportional distribution of never-payers across account’s cities. The
cities with the biggest yellow bar are the cities with the greatest relative amount of never-payers (NP).
For instance, a customer from Sintra is ~80% more likely to be a never-payer than a customer from
Cascais. Table 12 shows the detail of this distribution, wherein Sintra holds 3.14% of NPs whereas
Cascais has only 1.74%.

Figure 23 – Never-payer distribution across the Table 12 – Never-payer distribution across the cities
cities around Lisbon (Power Map) around Lisbon (table view)

Looking at the big picture, the likelihood of being a never-payer varies across different levels of
geography. Looking at the distribution across districts presented in Table 13, the top districts are those
which have more probability of holding never-payers. It is important to consider the size of each the
population; even though some locations have a high propensity towards debt, they can lose importance
if their population size is not relevant. Besides, the customer’s business segment (Consumer or
Business) appeared to be very meaningful across locations.

51
Table 13 – Never-payer distribution across all Portuguese districts.

A more careful analysis of the location data shows some interesting conclusions. Globally, there are
~2% of NP consumers, whereas businesses are ~1% of the total population. When analysing by district
this gap is even clearer, including:

 Ilha de São Miguel, where ~3% of consumers do not pay any bills, and nearly every business
pays their bills (only 0.83% are NP);
 Ilha da Madeira is the top never-payer population (~6%), essentially because of its consumers
(~7%).
 In Guarda, almost everyone pays their debts (~99%), especially in the business segment.
 In Viana do Castelo, 2.59% of the businesses do not pay their bills, while consumers are below
the average of NP likelihood.

When looking at districts with at least 4.000 accounts, the worst consumers are located in Ilha da
Madeira, Ilha de São Miguel, Setúbal and Lisboa. On the contrary, the best-behaved are from Santarém
and Aveiro. Most of the never-payer businesses are from Ilha da Madeira, Faro, Setúbal and Lisboa,
while the best are located in Viseu and Leiria.

The final case table also had attributes regarding the subscribed services (number of services) and
pricing plans, containing the number of pricing plans, and other hierarchy dimensions. These attributes
are related to the Services entity, and since the case table shows the account perspective, their values
had to be pivoted and summarised. For that reason, all pricing plan attributes are numeric.

Many plots were created to understand if certain changes of mean values affected the likelihood of
belonging to the NP population. Figure 24 is an example of how numeric attributes were plotted against
the average values of the consumer and business population.

52
Pricing Plan Content - Consumer Pricing Plan Content - Business

3,00 1,20

2,50 2,21 1,00

0,80 0,67
2,00
AD
FIXED 0,60
1,39
1,50 AI
GSM 0,40
ISP
1,00 MBB 0,20
0,02 M2M
0,50 0,28 0,00
0,14 NP_0 Bus NP_1 Bus
-0,20
0,00
NP_0 Cons NP_1 Cons -0,40

Figure 24 – Plots comparing the content of pricing plans for consumers and businesses.

Steepest slopes show that the never-payer population and the general population subscribe different
pricing plans. For instance, the NP1 population subscribes more mobile broadband (MBB) pricing plans
than the NP1 population, while that is the opposite for fixed, GSM (mobile voice) and M2M (machine-
to-machine) pricing plans. Flat slopes indicate that, probably, that attribute will not be useful for
predicting the never-payer population. Another interesting insight is that businesses can subscribe
consumer pricing plans, but the contrary is not true for consumers. The class of the pricing plan will be
relevant only for business predictive models.

Figure 25 compares the service usage type for both NP0 and NP1 populations. The total amount of
seconds used in calls seems to be a good indicator. The never-payer population spends almost three
times fewer seconds in calls than the NP0 population. The second plot details the lines that were
overlapped in the first one. Looks like the average number of call events per day do not vary within the
population. Nonetheless, the amount of data transferred is also correlated; Never-payers spend ~80%
less mobile data than NP0 population.

Average Usage by day (first month) Average Usage by day (detail)

45,00 0,90
40,00 0,80 0,70
35,00 AVG_TotalCa 0,70
28,86
30,00
lls_Event AVG_TotalCa
0,60 lls_Event
25,00
0,50
20,00 0,38
AVG_TotalCa 0,40
15,00 9,03 lls_MBytes
10,00 0,30
5,00 0,20 0,11 0,09 AVG_TotalCa
- AVG_TotalCa 0,10 lls_MBytes
NP_0 NP_1 lls_Seconds
(5,00) -
NP_0 NP_1

Figure 25 – Plots comparing the type of service usage, for the NP0 and NP1 population.

53
Figure 26 shows that the risk evaluations executed before account activation are relevant. Businesses
that have gone through the process of risk evaluation, on average, ~10.5 times are most likely to be
never-payers, when comparing to the ~9.5 assessments of the NP0 population. In addition, the last
evaluation score immediately before activation is also relevant, wherein worst scores (higher values)
belong to never-payer businesses. However, for the business counterpart, these conclusions are not so
similar. In fact, there is a weak correlation between these two attributes between the NP0 and NP1
population. Nevertheless, consumer never-payers seem to score worst but with fewer evaluations.

Risk Evaluation (first month)


12 0,25
10,53
10 9,46
0,22 0,20

8
0,15
6 0,15
0,10
4
1,98 1,86 0,05
2
0,01 0,02
0 -
NP_0 Cons NP_1 Cons NP_0 Bus NP_1 Bus

Count of Risk Evaluations Last Real Score

Figure 26 – Plots comparing the number of risk evaluations before activation and the latest score across datasets.

Feature Selection

During data preparation, a total of 58 attributes were loaded. After initial data exploration and cleansing,
the next step is feature selection [24] – using correlation to identify the top attributes having a strong
relationship with the target variable (FlagNP). It has been proved effective in reducing dimensionality,
improving mining efficiency and accuracy, as well as enhancing result comprehensibility [25].

Some of the previously displayed tables and charts show the correlation between potential predictive
attributes and the target attribute. Nonetheless, highly correlated predictive attributes should also be
minimised, since they do not add value to the model. Table 14 shows the correlation analysis between
numeric usage attributes. For instance, the average of actual events is the same as the average of total
call events; therefore, the model set should only have one of them.

Table 14 – Correlation analysis between usage attributes.

54
This technique successfully eliminates numeric attributes of the pricing plans, usage and risk evaluation.
The final model set for the Consumer segment included the attributes

 ACC_AccountID: The account being evaluated


 FlagNP: Flag indicating whether the account is a never-payer
 ACC_Postal2: Postal Code (3 digits)
 ACC_PostalCode: Postal code (4 digits)
 ACC_PostalName: Postal Name
 ACC_Town: Town
 ACC_Municipality: Municipality
 ACC_District: District
 SVC_N_ServiceID: Number of services subscribed
 PPL_Content_GSM: Number of GSM pricing plans
 PPL_Content_FIXED: Number of FIXED pricing plans
 PPL_Content_MBB: Number of MBB pricing plans
 RSK_N_RiskEval: Total number of risk evaluations
 RSK_ModifyLogin: Last Modify Login (Risk Evaluation)
 RSK_RealScore: Last Score (Risk Evaluation)
 USG_AVG_TotalCalls_Event: Average Total of Calls (Event)
 USG_AVG_TotalCalls_Seconds: Average Total of Calls (Seconds)
 USG_AVG_TotalCalls_MBytes: Average Total of Calls (Mbytes)
 USG_AVG_ActualUnits_MBytes: Average Actual Units (Event)

For the Business segment, additional attributes were considered:

 CLI_Segment: Segment (Level 3)


 CLI_Segment1: Segment (Level 1)
 CLI_Segment2: Segment (Level 2)
 PPL_Content_M2M: Number of M2M pricing plans
 PPL_Content_ISP: Number of ISP pricing plans
 PPL_Content_AI: Number of AI pricing plans
 PPL_Content_AD: Number of AD pricing plans

4.2.3. Modelling
Although the relative amount of risky customers is very tiny (~2%) when looking at the dozens of
hundreds of new subscriptions every month, they represent substantial losses that could be avoided or
at least, mitigated. Because the dataset is highly imbalanced, particular strategies need to be followed.
Sampling is the most widespread means of overcoming the class imbalance problem [26].

A direct method to solve the imbalance problem is to balance artificially the distribution of the minority
class (NP1, never-payers) so that it is not under-represented when training the classifier [27]–[29].

55
There are three basic approaches to overcome the class imbalance problem and several works in the
literature that confirm the efficiency of these methods in practice [28]. These include:

 Random Oversampling (ROS), which consists of sampling of the minority class (NP1) with
replacement, until there are as many minority class examples as the majority class (NP0). This
could lead to overfitting, since it produces exact copies of the never-payer class examples.
 Random Undersampling (RUS), balancing class distribution through random elimination of
majority class (NP0) examples. The major drawback is that it can discard potentially useful data
that could be critical for the induction process.
 Hybrid Sampling (ROS/RUS), combining ROS and RUS, wherein the majority class (NP0) is
undersampled and the minority class (NP1) is over-sampled.

In these experiments, four strategies were implemented, namely:

 S1: The original training dataset was not altered.


 S2: Undersampling the NP1 class.
 S3: Oversampling the NP0 class.
 S4: Undersampling of the NP1 class and oversampling the NP0 class.

Two moments of prediction were also configured. The first one including only “customer data” (CD) with
attributes available during acquisition and before account approval. The second, “behavioural data”
(BD), adds the usage attributes that become accessible after the subscription.

Three classification algorithms were tested: Decision Trees, Naïve-Bayes and Logistic Regressions.
These are classification algorithms [5], [30] which were implemented by Microsoft and included in SSAS.

The Decision Tree algorithm is very popular among data miners since they can predict both discrete and
continuous variables and the generated rules are easy to understand.

The Naïve Bayes algorithm calculates probabilities for each possible state of the input attribute. It is very
simple to process and provides baseline results. Nonetheless, it does not support continuous variables,
but for the purpose of this work, they were discretised.

The Logistic Regression algorithm is a powerful and well-established statistical technique that estimates
the probabilities of the target categories [1]. It is analogous to simple linear regression but for discrete
outcomes.

Several techniques and datasets were combined to find the best approach, including:

 Different segments: consumer or business.


 Different balancing strategies: none (S1), undersampling (S2), oversampling (S1) or both (S4).
 Different data available: customer data (CD) or behavioural data (BD).
 Different algorithms: decision trees, Naïve Bayes or Logistic Regressions.

Figure 27 shows an example of how this combination was achieved using the user interface of SSAS.

56
Figure 27 – Mining model for Consumers using oversampling and different algorithms.

The process of modelling is interactive, allowing the developer to understand how the algorithms pick
the best attributes for prediction. Figure 28 shows an example of a Bayesian network produced by Naïve
Bayes algorithm that was applied on a hybrid sampling dataset, featuring consumer accounts with
behavioural data. The Bayesian network shows the strongest links with the target attribute (FlagNP) are
the number of subscribed services, the customer district, the average call duration and the type of pricing
plans subscribed.

Figure 28 – Example of the mining model viewer for a Naïve Bayes algorithm.

When applying the Microsoft Decision Tree algorithm on the same dataset, the best attributes are similar.
Figure 29 shows that the algorithm picks first the attributes that count the number of voice and mobile
broadband pricing plans of an account, as well as the account’s district and the login name that assessed
the risk before account activation. Accounts that subscribe more than five mobile (GSM) pricing plans,
more than four mobile broadband pricing plans and are from Beja District are likely to belong to the
never-payer population.

57
Figure 29 – Example of the mining model viewer for a Microsoft Decision Tree algorithm.

It is also possible to visually compare different algorithms using lift charts. The lift denotes how much
better a classification data mining model performs in comparison to a random selection [1]. The X-axis
shows the percentage of customers analysed, whereas the Y-axis shows the percentage of never-
payers correctly identified; this is the percentage of the total possible never-payers. The lift chart tells
that if we analyse X% of customers then we will correctly identify Y% of the never-payer population.

Figure 30 shows the lift chart of a hybrid consumer dataset, wherein a randomly selected sample
contains about 22% percent of never-payers. All the mining models perform better than the random
guess model. The best is the yellow one, the decision tree algorithm using behavioural data since it is
closer to the line of the ideal model. The ideal model would be finding all of the never-payers just by
looking at 22% of the total population.

Figure 30 – Example of a lift chart for a hybrid sampling data mining model.

58
After defining the data mining models for training and testing, the next step comprises the design of a
validation plan and results evaluation. This phase is fully described in Section 5 and aims to pick the
best models for predicting the never-payer population. After validating the results, the next section
details how the system was operationalised and deployed for the end-user.

4.2.4. Deployment
This is the stage of the methodology that ensures the data mining process is repeatable across the
enterprise [20]. The information and knowledge that were extracted from data need to be organised and
presented so the end-user can use it. This included the operationalization of all data mining steps using
the available Microsoft BI stack technologies.

Integration Services (SSIS) orchestrates the complete data mining process; wherein each step is
implemented by a package (.dtsx file), deployed to the SSIS server. Figure 31 lays out the sequence of
each SSIS package extracting input files to testing new accounts. The detail of each step is described
below.

Extract Classify Load Sample Train Test

• Collect and prepare • Identify never-payer • Join all data in a • Load sampling • Train each mining • Test mining models
initial data from flat population single case table strategies model and log results
files (account-level)

Figure 31 – Data mining process orchestrated by Integration Services (SSIS).

1. Extract is the data collection step that collects initial data from CSV flat files containing business
entities with attributes that may be useful for prediction. Data is stored in staging tables, and it
is prepared, cleaned and summarised with the intended aggregation level. Besides, the loading
process is as efficient as possible, both in terms of space and time used, making use of bulk
load SQL commands, table indexes, and data compression.
2. Classify is responsible for adding an extra target attribute named FlagNP. This Boolean
attribute indicates whether an account is a never-payer and it is calculated by looking at the
billing status of each account. The classification process outputs the examples the supervised
data mining algorithms learn to predict future never-payers.
3. Load integrates all business entities in a single case table. Each data record is the commonly
called a case, whose columns are the attributes with predictive potential as well as the target
attribute, FlagNP. Furthermore, it describes the attributes of a specific account that will be used
to train mining models. The final case table was the main input for exploration analysis and
feature selection.
4. Sample automates the process of randomly sampling the case table for training, putting aside
examples for subsequent testing, in order validate the models. Additionally, because of the high
imbalance ratio between the class distributions available for training, sampling strategies had to
be implemented, producing different datasets that artificially increase the proportion of never-

59
payers among the account population. Those techniques included the aforesaid undersampling,
oversampling and hybrid sampling strategies, as well as, no sampling, to serve as a baseline.
5. Train systematises the process of training each model with the corresponding dataset. It
combines four sampling strategies, three algorithms, two types of attributes and two degrees of
complexity, resulting in 48 models that were trained using training examples that were put aside
and samples during the previous step.
6. Test mechanises the classification of testing account examples that were set aside for testing.
The already trained models are put into practice so that prediction results can be evaluated and
validated. Additionally, it is possible to test new accounts fed by the user, which is, in fact, the
main purpose of this application.

There are two main folders mapped into the user’s file system. The “Input” folder is where train and test
files are dropped and processed, whereas the “Output” folder will contain the results of account
predictions. From the user’s point of view the interaction with the application begins with the input of the
data needed to train the models, as well as the new accounts that he needs to evaluate.

The “Input” folder contains a “Data” folder with folders for each business entity, where CSV flat files must
be placed, and format files3 (.fmt extension) describe the input data structures. The whole data mining
process described in Figure 31 can be started on demand by clicking a batch file that starts an SQL
Server Job that is responsible for running each one of the process packages. If the input files are faulty
or some error occurs, errors are logged, and the data is not loaded for training. Otherwise, if everything
goes as planned, the models will be updated with the most recent input data and will be ready for testing.

For making predictions using the already trained models, the user simply drops a CSV containing
account data inside the “Predict” folder the root of the “Input” folder. This CSV must comply with the
structure of the model set used for training, as described in Section 4.2.3. After clicking a batch file to
start the SQL Job that tests new accounts, the file is moved to “Processed” folder. Accounts featuring
only customer data will be classified using data mining models that are specific to customer data only.
On the other hand, accounts supplied with behavioural data will both be tested using customer data and
behaviour data mining models.

The “Output” folder contains now the account prediction results, presented as CSV files that can be
viewed using Excel. The structure of prediction output files is similar to the input files, but also contain
additional columns that describe the prediction result:

 Model Name – Name of the data mining model used for prediction.
 Prediction – Similar to the FlagNP attribute, but it represents the predicted result assigned by
the algorithm; ‘1’ if the algorithm classifies the account as never-payer, ‘0’ otherwise.
 Probability – Also known as “risk score”, this is the probability assigned to the prediction made
by the algorithm.
 Support – Represents the count of cases that match with the itemset or rule used for prediction.

3 Format files https://technet.microsoft.com/en-us/library/aa173859(v=sql.80).aspx

60
 Description – Additional details that help the user understand how the algorithm works. For
instance, the decision tree rule or association rule applied.

In addition to prediction output files, CSV reports are also produced, featuring performance measures
for each algorithm. These metrics are described in detail in the next section. Every prediction result and
statistics is accessible using a set of database views available to the end-user. Reporting tools such as
Excel, Power View and others can connect to these views and perform statistical reports.

Every step of the data mining application is logged in logging tables, including information, warnings and
possible errors. Once again, database views are available for querying and reporting the state of the
process.

The business process describing the bad payer lifecycle presented in Section 3.1 (Figure 9) can be
modified to accommodate the new data mining system that will enrich the risk evaluation step. Figure
32 introduces two different never-payer detection stages; the first one during the Acquisition phase using
customer data and the second during the Intermediate phase, after 30 days of service usage.

61
Customer Activations Risk Evaluation Never-Payer Detection

Subscription and customer data

Automatic
Sign up Process Fiscal Risk data
form Number
Credit
information
Prospect Evaluation
Customer

Yes Approved?

Contract
No

Accept Manual Credit NP Detection


Draft Contract
Contract Evaluation customer data

Yes + Mitigation
Prospect Rejected
Accepted?

Strategy
No Risk Risk results
Evaluation
Yes
Approved?
Activation

No
Approve
Account
Activation Activation

Customer + Behaviour Data


30 Days of NP Detection
Usage customer data +
Intermediate Risk
results
behaviour data
Intermediate

Risk
Evaluation

Business as Risk Mitigation Approved?


Usual Strategy No

Yes

Figure 32 – Business process (TO-BE) describing the customer lifecycle of a never-payer.

The first “NP Detection” occurs right after the automatic credit evaluation performed by the Risk
Evaluation system. This detection will only use customer data, that is, subscription data that was
supplied by the prospect when he signed up a form and risk data from the risk assessment. After the
never-payer detection algorithm runs, risk results are returned to the activations’ specialist who will
decide if the prospect is approved. If so, a mitigation strategy can be suggested to the prospect.

After accepting the contract, the customer’s account is activated, and service usage begins. After 30
days of usage, the second “NP Detection” takes place. This time, both customer and usage data will be
used as input for the detection algorithms. Risk results are, once again, returned to the activations’
specialist who will perform an intermediate risk evaluation. If something indicates that this customer will
be soon a never-payer, a mitigation strategy is put in motion and business proceeds as usual.

62
5. Validation and Results
This chapter describes the validations performed on this system as well as their results.

5.1. Validation Plan


For evaluating the system, several items must be defined, particularly the testing set (i.e., dataset), and
the evaluation metrics to assess its effectiveness.

A classifier is typically evaluated by a confusion matrix [31]. Each confusion matrix entry provides the
number of customers with the given outcome, in terms of being a never-payer or not. For instance, a
true positive (TP) is a never-payer who was correctly identified. The effectiveness measures most widely
used in data mining are set out in terms of the contingency table [5], [32].

PREDICTED AS NP1 PREDICTED AS NP0


ACTUAL NP1 TP FN
ACTUAL NP0 FP TN
Table 15 –Confusion matrix for the never-payer classifier.

Accuracy considers the population correctly identified by the system [5]. It reflects how well the classifier
recognises customers of the two possible classes. Accuracy is the number of correct predictions and
correct non-predictions divided by the number of all slots.

|{𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}| 𝑇𝑃 + 𝑇𝑁 (2)


Accuracy = =
|{𝑎𝑙𝑙 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛}| 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

Traditionally, accuracy is the most typically used measure for these purposes [28]. However, because
we have a class imbalance problem (only 2% are never-payers), accuracy is no longer a proper
measure. Accuracy places more weight on the common classes than on rare classes, making it difficult
for a classifier to perform well on the uncommon classes [29]. If the system classifies every customer
as never-payer, it can achieve an accuracy of 98%, which is meaningless. Because of this, additional
metrics are required.

If only the performance of the positive class is considered, precision and recall become relevant [28].

Precision (also known as positive predictive value) is the proportion of positive predictions that are
correct [26], [28]. In other words, precision defines the fraction of customers reported as never-payers
by the system that is correct.

|{𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑁𝑃1 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}| 𝑇𝑃 (3)


Positive Predictive Value = Precision = =
|{𝑁𝑃1 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}| 𝑇𝑃 + 𝐹𝑃

63
Recall (also known as TP rate or sensitivity) is the proportion of positive items retrieved by the system
[26], [28]. In short, recall is the fraction of correctly identified never-payers that is properly reported by
the system.

|{𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑁𝑃1 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}| 𝑇𝑃 (4)


TP Rate = Sensitivity = Recall = =
|{𝑁𝑃1 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛}| 𝑇𝑃 + 𝐹𝑁

In an ideal system, precision and recall are close to one, but enhancing one metric can hurt the other.

F-measure is a combined score for the entire system that corresponds to the harmonic mean between
precision and recall [32]. It combines recall and precision, which are effective metrics when the
imbalance problem exists [28].

2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 (5)


F-measure (F1) =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

Each consumer and business dataset was split according to their number of attributes [33]. Table 16
summarises the splits for each segment. The whole set is divided into a testing and a training set. 20%
of the examples were saved for later testing. From the remaining 80%, 75-80% were used to train the
algorithms and the remaining 20-25% to validate them.

Consumer
# Cases # NP1 (estimated) % of Total
Total 100% 340,390 7,158
Testing 20% 68,078 1,432 20%
Training 80% 272,312 5,726
Validation 25% 68,078 1,432 20%
Training 75% 204,234 4,295 60%
Business
Cases # NP1 (estimated) % of Total
Total 100% 55,797 631
Testing 20% 11,159 126 20%
Training 80% 44,638 505
Validation 20% 8,927 101 16%
Training 80% 35,710 404 64%
Table 16 – Training and testing set for Consumer and Business datasets.

64
5.2. Results
The predictive validation includes performance measures that evaluate the accuracy and precision of
predicted outcomes against expected results. It compares the performance of several predictive
algorithms in typical case scenarios (detailed in Section 4.2.3), using different sets of predictive
attributes.

The complete overview of the results when assessing all combinations of models is presented in Table
17. For the sake of simplicity, combinations that performed too poorly or were similar to one another
were removed. Because we are dealing with very imbalanced datasets, the overall results are not very
good. It becomes noticeable that the number of false positives is the biggest problem.

Consumer algorithms performed consistently better than the business algorithms. The main reason is
that the never-payer population in the consumer segment (~2%) is bigger than the business segment
(2%). It may seem a little difference, but consumer algorithms also train with ten times more examples
than the businesses’ do. This difference leads to an even smaller precision for the business segment,
since the number of never-payers correctly identified is much lower.

Overfitting happened mostly using oversampling strategies and when an algorithm’s complexity was
increased. Even though the number of true positives increased, false alarms became too high.

When examples were oversampled to achieve 50/50 chance of being never-payer, more false alarms
(false positives) were introduced. This false positive increase happens because the algorithm is being
told the word is a 50/50 chance, but when it is tested in the real word, this assumption makes it biased.
For example, the algorithm classifier may end up “remembering” a never-payer simply because it sees
this same account several times.

Increasing the complexity of algorithms can also lead to overfitting. It is possible to increase or decrease
the degree of complexity of every algorithm, for instance, by adjusting the maximum number of states
or the maximum size of the tree or Bayesian network. As an example, a decision tree may be forced to
grow a little longer, thus creating a more specific mining classifier. The more complex the algorithm is,
the less generic it is and the more specific it is. The key is to find that “sweet spot” wherein the degree
of complexity does not hurt the precision and recall of the algorithm.

Undersampling also introduced many false alarms (false positives) and misses (false negatives).
Random undersampling has removed certain significant examples. One of the problems with random
undersampling is that one we do not control what critical examples from the NP1 class are thrown away.
Valuable information about the decision boundary between the minority and majority class may be
eliminated.

The hybrid approach of using both undersampling and oversampling turned out to perform better,
especially when used with the Microsoft Decision Trees algorithm. Nevertheless, it was necessary to
tweak the size of each class to obtain these results.

65
In fact, Decision Trees performed the best for the three sampling strategies, both with customer and
behavioural data. One of the goals of this thesis was to see if it is possible to predict the outcome only
with customer data and decision trees could do so, even though introducing many false alarms.

Logistic regressions worked better with undersampling and hybrid strategies, but they performed worse
generically, except for undersampling in the consumer dataset.

Naïve Bayes performed better for consumer datasets, but with many false alarms. They seem to work
better for oversampling strategies.

In general, behavioural data algorithms achieved best results, which is, in fact, understandable. The
behaviour of customers is an important indicator of how he will behave in the future. Stating that a
customer will be a never-payer, just by judging where he came from and what services he will subscribe,
seems to be very hard.

66
F-
Model ID Segment Algorithm Balancing Obs. Data Type # Cases # NP1 TN FN FP TP Acc. Prec. Recall
Measure
NB_Cons_S1_CD Consumer Naive Bayes None Customer 68,078 1,429 66,118 1,392 531 37 97.18% 6.51% 2.59% 3.71%
NB_Cons_S1_BD_Complex Consumer Naive Bayes None Behaviour 68,078 1,429 65,326 1,330 1,323 99 96.10% 6.96% 6.93% 6.94%
Logistic
LR_Cons_S1_CD_Complex Consumer Regression
None Customer 68,078 1,429 64,212 1,356 2,437 73 94.43% 2.91% 5.11% 3.71%

NB_Bus_S1_CD Business Naive Bayes None Customer 11,160 111 10,933 107 116 4 98.00% 3.33% 3.60% 3.46%
NB_Bus_S1_BD Business Naive Bayes None Behaviour 11,160 111 10,892 105 157 6 97.65% 3.68% 5.41% 4.38%
Decision Undersamplin
DT_Cons_S2_CD_Complex Consumer Trees
10% NP0 Customer 68,078 1,429 64,181 1,174 2,468 255 94.65% 9.36% 17.84% 12.28%
g
Decision Undersamplin
DT_Cons_S2_BD_Complex Consumer Trees
10% NP0 Behaviour 68,078 1,429 64,414 1,154 2,235 275 95.02% 10.96% 19.24% 13.96%
g
Undersamplin
NB_Cons_S2_BD_Complex Consumer Naive Bayes 10% NP0 Behaviour 68,078 1,429 57,099 855 9,550 574 84.72% 5.67% 40.17% 9.94%
g
Logistic Undersamplin
LR_Cons_S2_CD Consumer Regression
10% NP0 Customer 68,078 1,429 63,471 1,210 3,178 219 93.55% 6.45% 15.33% 9.08%
g
Logistic Undersamplin
LR_Cons_S2_BD Consumer Regression
10% NP0 Behaviour 68,078 1,429 61,663 1,081 4,986 348 91.09% 6.52% 24.35% 10.29%
g
Undersamplin
NB_Bus_S2_CD Business Naive Bayes 10% NP0 Customer 11,160 111 10,303 96 746 15 92.46% 1.97% 13.51% 3.44%
g
Undersamplin
NB_Bus_S2_BD_Complex Business Naive Bayes 10% NP0 Behaviour 11,160 111 10,095 91 954 20 90.64% 2.05% 18.02% 3.69%
g
Logistic Undersamplin
LR_Bus_S2_CD_Complex Business Regression
10% NP0 Customer 11,160 111 10,669 103 380 8 95.67% 2.06% 7.21% 3.21%
g
Logistic Undersamplin
LR_Bus_S2_BD_Complex Business Regression
10% NP0 Behaviour 11,160 111 10,686 101 363 10 95.84% 2.68% 9.01% 4.13%
g
Decision
DT_Cons_S3_CD Consumer Trees
Oversampling 50.000 NP1 Customer 68,078 1,429 65,820 1,315 829 114 96.85% 12.09% 7.98% 9.61%
Decision
DT_Cons_S3_BD Consumer Trees
Oversampling 50.000 NP1 Behaviour 68,078 1,429 65,752 1,293 897 136 96.78% 13.17% 9.52% 11.05%
Decision
DT_Cons_S3_BD_Complex Consumer Trees
Oversampling 50.000 NP1 Behaviour 68,078 1,429 64,488 1,176 2,161 253 95.10% 10.48% 17.70% 13.17%

NB_Cons_S3_CD Consumer Naive Bayes Oversampling 50.000 NP1 Customer 68,078 1,429 59,998 1,013 6,651 416 88.74% 5.89% 29.11% 9.79%
NB_Cons_S3_BD Consumer Naive Bayes Oversampling 50.000 NP1 Behaviour 68,078 1,429 58,922 936 7,727 493 87.27% 6.00% 34.50% 10.22%

67
F-
Model ID Segment Algorithm Balancing Obs. Data Type # Cases # NP1 TN FN FP TP Acc. Prec. Recall
Measure
Decision
DT_Bus_S3_CD Business Trees
Oversampling 5.000 NP1 Customer 11,160 111 10,924 106 125 5 97.93% 3.85% 4.50% 4.15%
Decision
DT_Bus_S3_BD Business Trees
Oversampling 5.000 NP1 Behaviour 11,160 111 10,948 105 101 6 98.15% 5.61% 5.41% 5.50%
Decision
DT_Bus_S3_BD_Complex Business Trees
Oversampling 5.000 NP1 Behaviour 11,160 111 10,839 101 210 10 97.21% 4.55% 9.01% 6.04%

NB_Bus_S3_CD Business Naive Bayes Oversampling 5.000 NP1 Customer 11,160 111 10,269 94 780 17 92.17% 2.13% 15.32% 3.74%
NB_Bus_S3_BD Business Naive Bayes Oversampling 5.000 NP1 Behaviour 11,160 111 9,885 85 1,164 26 88.81% 2.18% 23.42% 4.00%
NB_Bus_S3_BD_Complex Business Naive Bayes Oversampling 5.000 NP1 Behaviour 11,160 111 10,177 90 872 21 91.38% 2.35% 18.92% 4.18%
Decision 40.000 NP0
DT_Cons_S4_CD Consumer Trees
Both Customer 68,078 1,429 64,024 1,115 2,625 314 94.51% 10.68% 21.97% 14.38%
50% NP1
Decision 40.000 NP0
DT_Cons_S4_BD Consumer Trees
Both Behaviour 68,078 1,429 63,973 1,107 2,676 322 94.44% 10.74% 22.53% 14.55%
50% NP1
40.000 NP0
NB_Cons_S4_CD Consumer Naive Bayes Both Customer 68,078 1,429 55,193 811 11,456 618 81.98% 5.12% 43.25% 9.15%
50% NP1
40.000 NP0
NB_Cons_S4_BD Consumer Naive Bayes Both Behaviour 68,078 1,429 54,124 736 12,525 693 80.52% 5.24% 48.50% 9.46%
50% NP1
Decision 10.000 NP1
DT_Bus_S4_CD Business Trees
Both Customer 11,160 111 10,334 92 715 19 92.77% 2.59% 17.12% 4.50%
30% NP0
Decision 10.000 NP1
DT_Bus_S4_BD Business Trees
Both Behaviour 11,160 111 10,128 81 921 30 91.02% 3.15% 27.03% 5.65%
30% NP0
10.000 NP1
NB_Bus_S4_CD Business Naive Bayes Both Customer 11,160 111 8,615 68 2,434 43 77.58% 1.74% 38.74% 3.32%
30% NP0
10.000 NP1
NB_Bus_S4_BD Business Naive Bayes Both Behaviour 11,160 111 8,243 57 2,806 54 74.35% 1.89% 48.65% 3.64%
30% NP0
Logistic 10.000 NP1
LR_Bus_S4_CD_Complex Business Regression
Both Customer 11,160 111 10,762 98 287 13 96.55% 4.33% 11.71% 6.33%
30% NP0
Logistic 10.000 NP1
LR_Bus_S4_BD_Complex Business Regression
Both Behaviour 11,160 111 10,728 99 321 12 96.24% 3.60% 10.81% 5.41%
30% NP0

Table 17 – Validation results for all combinations of segments, algorithms, sampling strategies and data types.

68
6. Conclusion
After implementing and assessing the performance of the system, this chapter presents the goals that
were met and new ideas for future developments.

6.1. Contributions
The main goal of this work was to predict if a customer will not pay any of his bills, even before the
customer’s account is activated. At that point, too little customer data is available for analysis.

So, the first challenge was to understand how debtors behave in the telecommunications industry. The
customer lifecycle was described, and the fraudster was introduced. Few systems can predict the
probability of becoming fraudulent upon the acquisition phase. Two patented systems were described
and compared. Even if little implementation details were available, the topology of data to be used as
predictive attributes was the most valuable insight.

Then, the main subject of this case study was presented. Both businesses and data model were detailed
and fit within the customer lifecycle. This business analysis was necessary for requesting the appropriate
data for analysis.

The solution implemented combined database, integration and analytical components. The
development was directed by the CRISP-DM methodology [20], [21], wherein the first step was to define
the data mining problem, and it was decided that classification algorithms would be tested. Several
techniques and exploration tools were used to get to know better the data provided. The data was
loaded, cleaned into the final set of attributes that showed the most potential for prediction.

Several combinations of balancing strategies, types of data, segments and algorithms were
experimented, to find the best approaches for predicting the never-payer population. These test runs
produce metrics that were evaluated and discussed.

The full system can operationalise the learning and testing process, outputting the probability of a
customer being a never-payer.

6.2. Future Work


This section points out several enhancements that could be made to boost the performance measures
presented above, as well as operational improvements to the process.

The experiments performed in this thesis required between 13 and 24 features (attributes) to estimate
the probability of being a never-payer customer. Since we are dealing with large data volumes, the
computation of these features turned out to be burdensome to run. For that reason, continuous attributes

69
were discretised. Also, every combination of attributes was not investigated. That would require 2N
experiments, where N represents the number of attributes, for every algorithm and balancing strategy.
It would be interesting to use smarter feature selection methods to help choosing the perfect set of
predictive attributes.

There is much room for improvement regarding the inclusion of extra features that were not available at
the time. For instance, more demographic data, the contact channel and additional behaviour such as
complaints.

During data preparation, several cleansing techniques were applied, but there is room for a more careful
outlier analysis and sparseness removal. Cleansing techniques are especially important for continuous
attributes of usage and risk evaluation.

In addition to the three sampling strategies experimented on this thesis – RUS, ROS and RUS/ROS
(Section 4.2.3) – the SMOTE sampling strategy could also help synthesising items belonging to the
never-payer class [26]. Another way to improve the performance of these models is to include more
never-payer examples from past years, to balance the dataset.

The set of the best predictive models could also be combined in order to output a weighted score. This
strategy is quite common in the credit industry (Section 2.2.2).

Finally, in addition to the views and tables that were provided for the user to get the prediction results, a
predefined report could also be added, helping business users to obtain KPIs regarding the results.

70
7. References

[1] K. Tsiptsis and A. Chorianopoulos, Data Mining Techniques in CRM. 2010.

[2] M. J. A. Berry and G. S. Linoff, Mastering Data Mining - The Art and Science of Customer
Relationship Management, 1st ed. Wiley, 1999.

[3] G. S. Linoff and M. J. A. Berry, Data Mining Techniques For Marketing, Sales, and Customer
Relationship Management, 3rd ed. Wiley, 2011.

[4] Y. Zhang, R. Liang, Y. Li, Y. Zheng, and M. Berry, “Behavior-Based Telecommunication Churn
Prediction with Neural Network Approach,” 2011 Int. Symp. Comput. Sci. Soc., pp. 307–310,
2011.

[5] J. Han and M. Kamber, Data Mining. Concepts and Techniques, 2nd ed. Morgan Kaufmann,
2006.

[6] R. A. Becker, C. Volinsky, and A. R. Wilks, “Fraud Detection in Telecommunications: History and
Lessons Learned,” Technometrics, vol. 52, no. 1, pp. 20–33, 2010.

[7] R. J. Bolton, D. J. Hand, F. Provost, L. Breiman, R. J. Bolton, and D. J. Hand, “Statistical Fraud
Detection: A Review,” Stat. Sci., vol. 17, no. 3, pp. 235–255, 2002.

[8] M. Ghosh, “Telecoms fraud,” Comput. Fraud Secur., vol. 2010, no. 7, pp. 14–17, 2010.

[9] Communications Fraud Control Association, “CFCA 2015 Global Fraud Loss Survey,” 2015.

[10] P. Hoath, “Telecoms Fraud, The Gory Details,” Comput. Fraud Secur., vol. 1998, no. 1, pp. 10–
14, 1998.

[11] Communications Fraud Control Association, “CFCA 2003 Global Fraud Loss Survey,” 2003.

[12] Communications Fraud Control Association, “CFCA 2006 Global Fraud Loss Survey,” 2006.

[13] Communications Fraud Control Association, “CFCA 2009 Global Fraud Loss Survey,” 2009.

[14] Communications Fraud Control Association, “CFCA 2011 Global Fraud Loss Survey,” 2011.

71
[15] Communications Fraud Control Association, “CFCA 2013 Global Fraud Loss Survey,” 2013.

[16] P. Hoath, “What’s new in telecoms fraud?,” Comput. Fraud Secur., vol. 1999, no. 2, pp. 13–19,
1999.

[17] C. J. Celka and C. R. Rojas, “System and method for automated detection of never-pay data
sets,” US20080294540 A1, 27-Nov-2008.

[18] R. Mahdi, D. Villagomez, and C. Jones, “First party fraud detection system,” US20140279379
A1, 18-Sep-2014.

[19] CTT Correios de Portugal, “Postcode,” 2015. [Online]. Available:


https://www.ctt.pt/feapl_2/app/restricted/postalCodeSearch/postalCodeDownloadFiles.jspx?lan
g=01. [Accessed: 30-Jun-2015].

[20] C. Shearer, H. J. Watson, D. G. Grecich, L. Moss, S. Adelman, K. Hammer, and S. a Herdlein,


“The CRISP-DM model: The New Blueprint for Data Mining,” J. Data Warehous., vol. 5, no. 4,
pp. 13–22, 2000.

[21] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and W. Rudiger, “CRISP-
DM 1.0,” Cris. Consort., p. 76, 2000.

[22] Microsoft, “Data Mining Concepts,” Microsoft Developer Network. [Online]. Available:
https://msdn.microsoft.com/en-us/library/ms174949(v=sql.110).aspx. [Accessed: 01-Mar-2015].

[23] S. Rosset, U. Murad, E. Neumann, Y. Idan, and G. Pinkas, “Discovery of Fraud Rules for
Telecommunications - Challenges and Solutions,” in Proceedings of the Fifth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 1999, pp. 409–413.

[24] A. Rehman and A. R. Ali, “Customer Churn Prediction , Segmentation and Fraud Detection in
Telecommunication Industry,” pp. 1–9, 2014.

[25] V. Ganganwar, “An overview of classification algorithms for imbalanced datasets,” Int. J. Emerg.
Technol. Adv. Eng, vol. 2, no. 4, pp. 42–47, 2012.

[26] P. Brennan, “A comprehensive survey of methods for overcoming the class imbalance problem
in fraud detection,” no. June, pp. 1–107, 2012.

[27] M. R. C. De Leon and E. R. L. Jalao, “Prediction Model Framework for Imbalanced Datasets,”
no. c, pp. 33–41, 2014.

[28] Q. Gu, Z. Cai, L. Zhu, and B. Huang, “Data Mining on Imbalanced Data Sets,” 2008 Int. Conf.
Adv. Comput. Theory Eng., pp. 1020–1024, 2008.

72
[29] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets: A review,”
Science (80-. )., vol. 30, no. 1, pp. 25–36, 2006.

[30] W. Habboub, “The Nine Data Mining Algorithms in SSAS,” TechNet Articles, 2012. [Online].
Available: http://social.technet.microsoft.com/wiki/contents/articles/6775.the-nine-data-mining-
algorithms-in-ssas.aspx. [Accessed: 16-Aug-2015].

[31] N. V Chawla, “Data Mining for Imbalanced Datasets: An Overview,” in Data Mining and
Knowledge Discovery Handbook, Springer-Verlag, 2005, pp. 853–867.

[32] N. Chinchor, “MUC-4 Evaluation Metrics,” Proc. 3rd Conf. Messag. Underst. - MUC3 ’91, pp. 22–
29, 1991.

[33] I. Guyon, “A scaling law for the validation-set training-set size ratio,” AT&T Bell Lab., pp. 1–11,
1997.

73

You might also like