Professional Documents
Culture Documents
a r t i c l e
i n f o
Keywords:
Late payment prediction system
Association rules
Clustering
Decision trees
Domain-driven data mining
a b s t r a c t
Most existing data mining algorithms apply data-driven data mining technologies. The major disadvantage of this method is that expert analysis is required before the derived information can be used. In this
paper, we thus adopt a domain-driven data mining strategy and utilize association rules, clustering, and
decision trees to analyze the data from xed-line users for establishing a late payment prediction system,
namely the Combined Mining-based Customer Payment Behavior Predication System (CM-CoP). The CMCoP could indicate potential users who may not pay the fee on time. In the implementation of the proposed system, rst association rules were used to analyze customer payment behavior and the results of
analysis were used to generate derivative attributes. Next, the clustering algorithm was used for customer segmentation. The cluster of customers who paid their bills was found and was then deleted to
reduce data imbalances. Finally, a decision tree was utilized to predict and analyze the rest of the data
using the derivative attributes and the attributes provided by the telecom providers. In the evaluation
results, the average accuracy of the CM-CoP model was 78.53% under an average recall of 88.13% and
an average gain of 11.2% after a six-month validation. Since the prediction accuracy of the existing
method used by telecom providers was 65.60%, the prediction accuracy of the proposed model was
13% greater. In other words, the results indicate that the CM-CoP model is effective, and is better than
that of the existing approach used in the telecom providers.
2013 Elsevier Ltd. All rights reserved.
1. Introduction
The telecom market has developed rapidly and telecom providers have spared no effort to increase their revenue by winning
more customers and improving performance. However, they still
have to deal with late payments from customers. Most customers
will pay their bill on time, but some also not pay their bill, either
intentionally or because they forget to make the payment. These
two behaviors are collectively called late payments.
There are many types of fraud, and telephone fraud is a common one (Taniguchi, Haft, Hollmen, & Tresp, 1998). The literature
indicates that the telephone fraud causes losses of two to three billion US dollars each year, and losses from telephone fraud comprises 1.5% to 5% of the total turnover. The traditional monitoring
methods used by xed-line providers identify abnormal situations
after the expiration of a payment period, but by then the loss has
already occurred. Obviously, traditional monitoring methods do
not satisfy the telecom providers needs for risk control. A number
Corresponding author.
E-mail addresses: chchen@mail.tku.edu.tw (C.-H. Chen), 081863@mail.tku.
edu.tw, chiang@cs.tku.edu.tw (R.-D. Chiang), tfwu945@gmail.com (T.-F. Wu),
HuanChen.Chu@gmail.com (H.-C. Chu).
0957-4174/$ - see front matter 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2013.06.001
6562
records. If the accuracy rate of a rule is lower than the set threshold
value, the system highlights it for review and the providers decide
whether to delete it. Next, providers may use new data and the
constructed model to produce a new rule. If the new rule is veried, the system will add the rule to the database thus maintaining
the systems predictive power. After the system is built, the provider spends six months conducting verications. A comparison
of the data from providers indicated that, even though the testing
environment was different from the conditions at the providers,
the efcacy rules produced by CM-CoP is greater than that of existing rules used by telecom providers. Thus, the two main contributions of this work are described as follows:
1. Firstly, we adopt a domain-driven data mining strategy and utilize data mining techniques to analyze the data from xed-line
users for establishing a late payment prediction system, namely
the Combined Mining-based Customer Payment Behavior Predication System (CM-CoP).
2. Secondly, the average accuracy of the CM-CoP model was
78.53% under an average recall of 88.13% and an average gain
of 11.2% after a six-month validation.
Since this paper focuses on real-world application in the telecom market, background concepts related to association rules,
clustering, and the decision tree method are introduced briey.
This paper is organized as follows: Section 2 introduces D3M and
related work; the framework of the proposed CM-CoP is described
in Section 3, which includes system ow, data preprocessing, and
data mining ow; Section 4 presents cases and verication results;
and conclusions are offered in Section 5.
2. Related work
This section introduces related data mining methods and
domain-driven data mining concepts and structures. Data mining
approaches include association rule, clustering, and decision tree
are describe in Section 2.1. The concepts of domain-driven data
mining and relevant existing mining structures are introduced in
Section 2.2.
2.1. Related data-mining approaches
Data mining aims to extract useful knowledge and patterns
from existing data to solve a specic issue. To date, it has been used
in many different elds, such as shopping cart analysis (Agrawal,
Imielinksi, & Swami, 1993), network intrusions (Tajbakhsh,
Rahmati, & Mirzaei, 2009), and stock market analysis (Au & Chan,
2003; Hadavandi, Shavandi, & Ghanbari, 2010). One common use
is the mining of association rules from transaction data, i.e. an
analysis of correlations between products purchased by customers.
The association rule is represented as A ? B, where A and B are
common products, and the rule states that if product A is purchased, product B will be purchased together with it. Two gauges
are used to measure the validity of association rules, support and
condence. The earliest association rule mining was suggested by
Agrawal et al. (1993), and the three main steps include: (1) produce candidate itemsets; (2) produce frequent itemsets based on
minimum support; and (3) produce frequent itemsets based on
minimum condence.
Clustering data based on data similarity is known as the clustering method, and it is an unsupervised mining technology. The kmeans method has been widely used (McQueen, 1967). The input
items for k-means include the given itemset and the designated
cluster number k, where k is an integer greater than 1. In k-means,
6563
where ti,j and bi,j are technical and business interestingness of model
mj, and [ii,j()] indicates the alternative checking of unied interestingness, [JPj is the merger function, Xm is the meta-knowledge consisting of meta-data about patterns, features and their relationships.
In other words, the CM-AKD consists of multi-steps for pattern
extraction and renement on the whole dataset. It rst split into J
steps of mining based on business understanding, data understanding, exploratory analysis and goal denition. Then, each step j is
used for extracting a pattern sub-set Pj based on technical signicance (ti()). The pattern sub-set Pj is then fed into step j + 1 for guiding corresponding feature construction and pattern set Pj+1. The
derived pattern sub-sets are then merged into a nal pattern set
(P) based on the environment (e), domain knowledge (Xm) and
business expectations (bi). Finally, the merged pattern P is then converted into business rules as nal deliverables that reect business
preferences and needs. Based on the CM-AKD framework, in the
next section, the details of the proposed predicting customer payment behavior system are described. Note that the CM-AKD framework is one of the domain-driven data mining framework proposed
by Cao et al. for mining actionable knowledge rules (patterns).
Based on CM-AKD framework, we propose the CM-CoP framework
for predicting telecommunications customer payment behavior.
And, the domain-driven data mining strategy focus on how to take
the objective and subjective interestingness in terms of technique
and business goals into consideration for driving actionable rules
(patterns). The descriptions about the proposed domain-driven data
mining strategy in terms of the ve items are stated as follows.
Firstly, based on the CM-AKD framework, we propose the system,
namely the Combined Mining-based Customer payment behavior
Predication framework (CM-CoP), which uses the heuristic methods
(Step-1 to Step-3 mining, see Fig. 1) is used for continuous testing to
solve problems. Secondly, the main objective of the CM-CoP is using
customer payment and communication behaviors to predict which
users might not pay their bills. Thirdly, in the proposed approach,
we take the attributes provided by the telecom providers (as business interestingness) and the rules provided by the results of the
association rules (as technique interestingness) into consideration
for achieving more accuracy results. For last two items, telecommunications customer payment behavior prediction is the complex
enterprise application as mentioned in previous section. After verication by the provider, the accuracy of the proposed model was
higher than that of the existing model.
3. The proposed predicting customer payment behavior system
This section introduces the methods structure. First, we will describe the proposed CM-CoP system framework according to the
CM-AKD framework. Next, the proposed predicting customer payment behavior algorithm is described, including using association
rules to produce payment behavior pattern for analysis and derived attributes, utilizing the clustering technique to reduce imbalance, and combing the derived attributes and the attributes
provided by the telecom providers to construct the decision trees
for predict user payment behavior patterns.
3.1. CM-CoP system framework
Based on the CM-AKD framework, we propose system, namely
the Combined Mining-based Customer payment behavior Predication
framework (CM-CoP), combines data mining techniques including
association rules, clustering, and decision trees, as well as industry
6564
AA
Actionable
cctiioonnaabbllee
cti
ct
Actionable
Rule
R
uulleeSet
SSet
eet(Model)
(Mo
(M
Rule
(Model)
R
Set
(Mooddeell))
Decision
D
eecciissiion
oonT
Tree
Tr
rreee2
ree
re
Decision
D
T
Tree
2
Decision
D
eecciissiion
oonT
Tree
rreeen
ree
re
Decision
D
T
Tree
n
St
S
Step-3
tteepp--3
Mi
M
Step-3
S
-33Mining
Mining
Miin
inniin
inngg
Decision
D
eecciissiion
oon
Trees
rreeeess
ree
re
Decision
Trees
D
o T
T
Meta
Knowledge
Domain
Knowledge
Extracted
DB
ti,1()
Association
A
sssocciia
ssoc
sso
ttion
tio
ti
o
on
Association
A
iaati
Patterns
Pattern
P
aatt
ttter
tte
eer
ern
rrnnss
Patterns
P
D
Desired
eessiire
rredd
red
Desired
D
Clustering
C
lluusstter
eeriinngg
eri
Clustering
C
St
S
Step-1
tteepp--1
Mi
M
Step-1
Mining
S
-11Mining
Miin
inniin
inngg
Association
AA
sssoocciiaation
sso
ss
ttionMining
M
Association
Mining
Miinniinngg
St
S
Step-2
Mi
M
iin
Step-2
Mining
Stteepp--2
-22Mining
M
inniin
inngg
Clustering
C
eeriinnggAnalysis
eri
AAnalysis
Clustering
Clluusstter
Annaallyyssiiss
Payment Records
CDR/BASE
Database
ti,2()
bi,2()
ETL Transformation
Telecom Provider
LL
Late
aatteeP
Pa
Payment
aayym
ment
Late
P
Payment
meenntt
Customers
Predicting
Cu
C
uussto
mers
rrssPr
Predi
Pre
eeddi
icctin
tti
tin
iinngg
Customers
Predicting
C
sttoom
meers
Pr
cti
Validated Rules
Experts Validation
R1: X1 Y1
R2: X2 Y2
Late Payment
Customers List
AA
Actionable
cctiioonnaabbllee
cti
ct
Actionable
Rule
R
uulleeSet
SSet
eet(Model)
(Mod
(Mo
R
Set
(Moddeell))
Rule
(Model)
updating
d
Rn: Xn Yn
knowledge provided by xed-line providers for analyzing xedline user data. The purpose of the CM-CoP is using customer payment and communication behaviors to predict which users might
not pay their bills, and the proposed CM-CoP framework is shown
in Fig. 1.
The overall system execution ow is shown in Fig. 1, and it includes two parts: (1) the domain-driven mining phase and (2) the
model tuning phase. In the rst part, ETL (extraction transformation loading) is utilized to derive CDR historical data. Next, this
study uses association rules to analyze data about user bills based
on telecom providers practices to create a behavioral model of potential late-paying users. According to the derived rules, in combination with the professional knowledge of the providers, the
derived attributes from payment behavior is established. Meanwhile, the clustering technique is then used to derive the desired
groups with business interestingness. Finally, decision tree algorithms are utilized to analyze the data by using various attributes,
and the derived rules are stored in a database for validation.
In the second part, after the model is constructed, its efciency
has to be evaluated to maintain its predictive power. Generally,
when the time frame is too long or policies change, user behavior
may change, and the accuracy and recall of the systems predictive
power will decrease. The systems design must take this into
account. The system automatically veries and compares the accuracy rate of each rule when the system retrieves the monthly user
payment records. If the accuracy rate of the rule is lower than the
set threshold value, the system will highlight it for the providers to
review the rule. Apart from this, the providers can use new data to
create rules from the constructed model, and if a new rule is veried, the system will add the rule to the database, thus maintaining
predictive power.
3.2. The proposed domain-driven mining approach
In this subsection, based on the proposed CM-CoP framework,
the CM-CoP algorithm is proposed for predicting customer payment behavior algorithm in this paper. The details of the proposed
CM-CoP algorithm are stated in Table 1.
From Table 1, the proposed CM-CoP algorithm can be divided
into four parts, including association pattern mining (lines 23),
clustering analysis (lines 45), mining decision tree (lines 611)
and rule evaluation (line 1216). In rst part, the preprocessed
payment records are rst used for deriving association pattern to
be as the customer behavior (see Section 3.3 for more details).
Then, since only a few customers will late to pay their bills, the
proportion between customers who late payment and pay their
bills on time is imbalance. Thus, the data imbalance should be taken into consideration before building the model. Here, after consulting the experts of the telecom providers for deriving
appropriate attributes, the clustering technique is then used to divide customers into groups. Those groups that can be identied as
contain customers who pay their bills on time will then be removed. The remaining groups are the customers that we need to
focus on. However, it is not an easy task to nd general actionable
Table 2
The operable business rule turning procedure.
Procedure: The operable business rule turning procedure
Input: The operable business rule set R0 , a set of new coming payment records
newPR, a set of CDR/BASE dataset CDR, mata knowledge Xm, domain
knowledge Xd, an accuracy threshold k.
Output: The operable business rule set R.
Procedure RuleTurning (){
R0 R0 [ CM-CoP(CDR, newPR, Xm, Xd);
(1) For each rule Rj in R0
(2) If expertValidation(Ri, k, Xm, Xd) == false
(3)
Remove Ri from R0 ;
(4)
End If
(5) End For
(6) Output the tuned operable business rule set R0 ;
}
rules for predicting the customers behavior, and directly use the
rules provided by the results of the association rules could not
solve the problem efciently. In order to conquer this issue, we select different set of attributes from the derived association patterns
(as technique interestingness) and those consulted attributes (as
business interestingness) for constructing decision trees. At last,
the each rule in the decision trees is then evaluated by the telecom
providers for enhancing its predicting ability. Finally, those veried
rules are then collected as the operable business rule set (see
Section 3.4 for more details).
Furthermore, modeling requires an efciency evaluation to
maintain its predictive power. Generally, when the time frame is
too long or policies change, user behavior may change, the predictive power of the system will be reduced. To avoid this, this system
automatically veries and compares the accuracy rate of each rule
when the system analyzes monthly user payment records by using
the operable business rule turning procedure shows in Table 2:
As shown in Table 2, the new payment records will rst use to
generate new operable business rule (line 1). Then, for each rule in
the operable business rule set R0 , experts verify it predicting power.
If its predicting power is lower than a threshold, then the rule will
be removed from the operable business rule set (line 26). Since
we focus on analyzing the real data for predicting the customers
who may make delayed payments, the goal of this paper is
attempted to design the CM-CoP framework (Fig. 1) and its
6565
algorithms (Tables 1 and 2). So, the detail approaches of the related
data mining algorithms are using the existing tools in IBM Intelligent Miner.
3.3. Payment behavior pattern analysis and derived attributes
The user communication characteristics consist of CDR information. Users telephone usage habits are expressed by the start time,
the number of users to make the call, the sum, duration, call type,
call variation information, and other statistical data. As specic
fraudulent behavior may have a xed behavior pattern, most
fraudulent behavior can be found by examining the CDR data. In
combination with the professional knowledge of the providers
and user data, fraudulent behavior can then be determined. Generally, normal users sometimes delay payment because of special
reasons or habitual delays. Although these delays are different
from fraudulent behavior, they also have xed behavior patterns.
To identify late paying users, this behavior must be compared with
customer payment records. With the derived attributes (X ? no
format) of user payment and call behavior patterns obtained from
CDR by discussing with telecom providers, the proposed approach
can be utilized to predict whether a user will default on a debt. As
shown in Fig. 1, the system requires data on CDR, the customer
base, and customer payment status. The attributes of the important data for user payment status are shown in Table 3.
As shown in Table 3, when a users Payment Status is 00, it
means the user paid the bill on time; when the Payment Status
is 01, it means the user failed to pay the bill on time, and when
the Payment Status is 02, it means the user never paid the bill.
Meanwhile, there is a delay of 35 days from the time spent the
telephone is used and the time needed to predict whether a user
may default on their debt for more than ten days. Data acquisition
and prediction can be completed in 20 days. Service personnel can
remind customers who may make delayed payments to pay the
fee. Based on the denition used by telecom providers, there are
six billing cycles. The data provided by telecom providers was taken from customer data in one area and payment cycle. One billing
cycle is listed in Table 4. Differences in regions and billing cycle
time points are not considered.
Assuming that it is early in the seventh month, we need to predict the correlation between payment behavior patterns in the
sixth month and payment habits in the rst to fth months. Since
the customer payment status for the fth month is still unknown
early in the seventh month, the payment records for the fth
month are not used in the analysis of the customers payment pattern. Thus, only the relation between the user payment behavior
from the rst to fourth month and late payment behavior in the
sixth month can be described.
Next, in an analysis of user payment behavior patterns, the payment records for the rst to sixth month are summarized in the
payment status table, as shown in Table 4, and the data for all of
the months is aggregated to the original payment status. Also,
attention is paid to whether payment defaults exceed ten days,
but attention is not paid to whether the users paid their bill. When
user payment status in Table 4 is 00, user payment status is yes
in the sixth month, otherwise it is no. Yes indicates a timely
payment, and no indicates delinquency. For example, in Table
4, the user payment status for the sixth month can be changed
from 02 to the new payment status no. The user payment status then changes from 01 to the new payment status 01C,
where C represents the payment status of the fourth month.
The limitation of repeating occurrences of the same new payment
status item in the analysis can be overcome using the association
rules.
During the analysis using association rules, the payment status
of each customer during the period from the rst to fourth month
6566
Table 3
Data format of original CDR payment situations.
Attribute name
Description
Amount
Installed date
Installation date
Billing cycle
Payment status
Cycle number
Billing period
Payment deadline
Table 4
New payment status.
Start
Customer ID
Bill month
001
001
001
001
001
06
04
03
02
01
02
No
01
01C
02
02D
00
00E
02
02F
C: the 4th month; D: the 3rd month; E: the 2nd
month; F: the 1st month
CDR/BASE
Clustering
Remove customers
paid punctually
DB
Decision tree 1
Decision tree n
Select high
accuracy rules
Rules in
rule base
End
Fig. 2. Data mining implementation ow.
6567
where it is found that some users may indulge in numerous payper-calls over a short time. Since this type of customers often cant
pay their bills, if these customers had similar behavior in the past,
their payment behavior will be used to identify whether the users
usually paid the fee normally. If customers dont have similar
behavior, they will be included in the forecast list.
In the second phase, the behavior rules for normal customers in
the rst phase are used to eliminate the customers who satised
the rules, and the percentage of default customers in the data
therefore increases. In this phase, a decision tree is formed for
drawing predicts from and analyzing the rest of the data. However,
there are many different kinds of default behavior. The dates
signed by customers are different, and some new customers have
no historical data. In order to nd the customers who met the
objective, different attributes are selected for analysis to produce
different decision trees and rules. After verication, the rules that
are more than 80% accurate are selected and stored in an SQL database in order to nd the target customers.
4. Experimental results
This study used data from xed-line providers in Taiwan for
case discussions. The xed-line providers provided customers
CDR and payment data of customers from one base station, area,
and payment cycle for a period of twelve months. The data was
then used for model construction. According to statistics, 7% of
the users still hadnt paid their bills more than ten days after they
were overdue. Due to the condentiality of the service agreements,
this study only discusses the data mining process in the second
phase as well as some verication results.
4.1. Training data and testing data
diluted by other behavior, the training data from each month was
subdivided into different datasets for one week, two weeks, three
weeks and one month. Ten datasets in total were then generated,
including four, three, two, and one datasets, which were generated
for one week, two weeks, three weeks, and one month, respectively. Thus, a stable model that is easily converted with time
was established, and the effectiveness of the data mining effect
was increased.
4.2. Case discussion
In the proposed system, this study used the function elds and
supplemental eld shown in Table 5 to cluster and describe user
behavior in the rst phase, in order to nd the cluster for normal
customers. Next, a decision tree was used to analyze the derived
clusters.
After clustering, about thirty clusters were derived. However,
most of them contain a small parts of users (less than 2% of all
users). The two representative clusters, namely cluster [6]6 and
cluster [4]3, that account for 55.92% and 31.49% of all users,
respectively, were used for further analysis. The clustering results
are shown in Fig. 3.
From Fig. 3, this study checked the supplemental elds of the
two clusters, i.e. the PAYSTATUS distribution situation. The
cluster [4]3 has 5644 instance. According to the PAYTATUS, its
Table 5
Important related elds.
Field name
Denition
Function
MAXAMOUNT
Function eld
TOTALAMOUNT
Since late payment behavior may change over time, the predictive power of the model may decrease. Thus, in order to overcome
this problem and prevent the model from depending on historical
data, a future system operation is conducted to enable the datasets
of the model to cover different time intervals. The time window
concept is used to set the datasets for the model construction
(including the training data and testing data). The xed-line provider offered twelve months of CDR data and payment data for
model construction. The data was divided into six sections according to the sequence months, and the model was in turn constructed
to determine the useful rules, so that different datasets for the
model would be able to cover different time intervals. In other
words, the sliding window size was set at six, and since the accuracy of the last month could not be evaluated, six datasets were
then generated for model construction. During each time interval,
the customer data in the sixth month was used as the training data,
and the data from the rst month to the fth month were used as
historical data to construct the prediction model. The data from the
following months were used as testing data to verify the derived
rules. Since the derived rules would become ineffective over time,
less accurate rules are deleted and new rules are generated during
each system test in order to prevent the model from depending on
historical data.
There are many reasons that customers will pay their bills late.
Some fraudulent behaviors may cause heavy short-term losses to
providers. We thus expect that the system can analyze and predict
user behavior through a small amount of CDR. Based on the studied
experiences of the past three months and users who made habitual
late payments, we also found that fraudulent behavior could be
identied based on the behavioral difference between the current
week and past weeks, and between the current week and the past
several months. To prevent special fraudulent behavior from being
STDAMOUNT
AVGAMOUNT
NUMBERCOUNT
PAYTYPE
PAYLASTMONTH
NUMBERCLRS3
TOTALDURITIONS
FIRSTCALL
PAYSTATUS
Function eld
Function eld
Function eld
Function eld
Function eld
Function eld
Function eld
Function eld
Function eld
Supplemental
eld
6568
Predictionaccuracy
the number of correctly predicted late payment user total
predicted number of late payment users A=B
Table 6
Verication of statistical results .
Month
Prediction results
Subjects investigated
Late payment
Normal payment
Total number of users
1576 (A)
436
2012 (B)
1837 (C)
1011
2848 (D)
Predictive accuracy
Recall
Accuracy of provider
78.33%
85.79%
64.50%
Late payment
Normal payment
Total number of users
1592
455
2047
1778
987
2765
Predictive accuracy
Recall
Accuracy of provider
77.77%
89.54%
64.30%
Late payment
Normal payment
Total number of users
1588
403
1991
1798
979
2777
Predictive accuracy
Recall
Accuracy of provider
79.76%
88.32%
64.75%
Late payment
Normal payment
Total number of users
1602
424
2026
1795
855
2650
Predictive accuracy
Recall
Accuracy of provider
79.07%
89.25%
67.74%
Late payment
Normal payment
Total number of users
1588
435
2023
1788
964
2752
Predictive accuracy
Recall
Accuracy of provider
78.50%
88.81%
64.97%
Late payment
Normal payment
Total number of users
1605
460
2065
1843
894
2737
Predictive accuracy
Recall
Accuracy of provider
77.72%
87.09%
67.34%
6569