Professional Documents
Culture Documents
Telecommunications Industry
Telecommunications Industry
TODEREAN
NEURAL NETWORKS
FOR CHURN PREDICTION
IN THE MOBILE
TELECOMMUNICATIONS INDUSTRY
Reviewers
Prof. Dr. Eng. Sergiu Nedevschi
Prof. Dr. Eng. Gabriel Oltean
ISBN 978-973-720-778-4
TABLE OF CONTENTS
INTRODUCTION ...........................................................................................1
1.1. Introduction............................................................................................6
i
1.4. Churn Prediction – Literature Review .................................................25
1.5. Conclusions..........................................................................................30
2.3. Conclusions..........................................................................................56
3.1. Introduction..........................................................................................57
ii
3.8.1. Learning Parameters Adaptation ...................................................82
3.12. Conclusions........................................................................................99
4.3. Conclusions........................................................................................108
5. CONCLUSIONS ......................................................................................110
BIBLIOGRAPHY ........................................................................................113
iii
LIST OF FIGURES
iv
LIST OF TABLES
Table 2.2. Distribution of Variable State relative to Variable Area Code ......36
v
ACRONYMS, NOTATIONS, AND SYMBOLS
ACC Accuracy
BP Backpropagation
FN False Negatives
FP False Positives
vi
PPV Positive Predictive Value
TN True Negatives
TP True Positives
VC Vapnik-Chervonenkis
ei Eigenvector i
E Error function
η Learning rate
kB Boltzmann constant
ℒ Lagrange function
vii
λc Regularization parameter
λi Eigenvalue i
o(l)
n Output vector for instance n in layer l
ρ Correlation matrix
σii2 Covariance
u(l)
n Input vector for instance n in layer l
w(l−1)
i
Weight vector from layer (l − 1) to layer l of neuron i
viii
xi Independent variable i
yi Dependent variable i
Yi Principal component i
ix
INTRODUCTION
1
As a consequence, a company active in the mobile telecommunications
industry must identify the customers that are likely to churn before they are
actually going to act and avoid contacting subscribers that will continue to use
the service in any event [34]. Thus, the need to develop a predictive model that
precisely identifies these types of customers is vital. This model must be capable
to identify customers that tend to switch the mobile provider in the near future.
Because of the nature of the prepaid mobile market, which is not based on a
contract, this is not an easy and well-defined task, making the implementation of
such a predictive model a complex assignment.
The approach chosen in this book for customer prediction is based on the
Cross Industry Standard Process for Data Mining (CRISP-DM) methodology.
The data mining process presented in Chapter 1 is an interdisciplinary field
which includes machine learning, pattern recognition, statistics, and visuali-
zation techniques to extract knowledge from large datasets [22]. By using data
mining techniques one can implement predictive models to discover trends and
past behaviors, allowing organizations to take smart decisions based on
knowledge from data.
2
proposed for classification. The methodology presented in this book is scalable
and can be applied to any real-world classification problem.
3
with various values of the independent variables. The relationships discovered
are presented in the form of a structure called classification model. The
classification models are one of the most studied models, possibly with the
highest practical relevance.
4
neurons during the learning process. To accelerate the training process of the
network, the method of adapting the learning rate is used by initializing it with a
higher value and gradually decreasing it as the learning progresses, and the
weights initialization method that uses the simulated annealing algorithm [81].
5
1. CHAPTER 1
1. DATA MINING PROCESS
1.1. Introduction
The data mining process can be defined in several ways which differ due
to the emphasis put by each on a different aspect of this process. One of the
earliest definitions states that the data mining process involves the nontrivial
extraction of implicit, previously unknown, and potentially useful information
from data [48].
6
data mining activities.
While the data mining process continued to develop, the attention towards
the definitions of this process was focused on certain aspects of the information
and its source. In 1996, Fayyad proposed the definition in which KDD is the
nontrivial process of identifying valid, new, and useful models from data [46].
This definition focuses on the patterns from data, not just on the information.
These patterns are not easily distinguishable and can only be identified by
applying analysis algorithms that can evaluate complex nonlinear relationships
between the independent variables, and between these independent variables
and the dependent variable. This definition of KDD emerged together with the
increasing popularity of machine learning techniques and their introduction into
this process. Algorithms such as decision trees, neural networks, support vector
machines, and Bayesian networks, allow to analyze nonlinear patterns in data
much easier compared to parametric statistical algorithms. This happens
primarily due to the fact that machine learning algorithms work similarly to the
way people act, not by calculating metrics based on average values and data
distributions. Initially, the term KDD referred only to the process of building the
model, but as this practice grew, this process expanded and began to include
several other operations.
7
[22]. A last definition refers to the data mining process as the analysis of large-
scale observational datasets to discover unknown relationships and summarize
the data in an easy to understand format [62].
8
common and important dependencies.
Business Data
Understanding Understanding
Data
Preparation
Deployment
Modeling
Data
Evaluation
The outer circle symbolizes the cyclical nature of the data mining process.
This process continues even after a solution is implemented because once the
patterns are discovered and the solution presented, it can lead to new and
perhaps even more complex demands from the decision-makers of the mobile
telecommunication company or of any company in general. The data mining
processes that are implemented will always benefit from the knowledge gained
from previous experiences. Further is presented each phase briefly:
▪ Data understanding – Begins with the data collection and continues with
9
the activities that enable familiarizing with the data, identifying data
quality issues, discovering the first data perspectives, and detecting
subsets to form assumptions about hidden information and patterns.
▪ Data preparation – Includes all the activities required to build the final
dataset used for modeling in the next phase. To perform these activities
there is not a predetermined order and can be repeated several times until
the desired outcome is reached.
▪ Deployment – During this phase the results of the analysis are organized
and presented in the proper way so that the company can benefit
completely. In most cases, the models are applied within the organization's
decision processes in real time. In certain instances, the results are
10
deployed throughout the organization as a simple report or as a complex
repetitive data mining process.
11
Segmentation and association are two of the most well-known unsupervised
learning methods [58].
Typically, the variables, also named fields or attributes, are of two types:
12
nominal (values belong to an unordered set) and continuous (values are real
numbers). If the variable ai is of nominal type, its definition domain is denoted
by dom(ai ) = {vi,1, vi,2, ..., vi, dom(ai ) } , where dom(ai ) refers to its finite
cardinality. Similarly, dom(y) = {c1, c2, ..., c dom(y) } represents the definition
domain of the dependent variable. Variables of continuous type have an infinite
cardinality.
It is generally assumed that the tuples from the training dataset are
generated in a random and independent order in accordance with an unknown
fixed joint probability distribution, D. When a tuple is classified using the
y = f (x) function, is considered a generalization of the deterministic case.
In this book, the notation π for the projection of tuples and σ for the
selection of tuples are used [60].
1.3.2. Classification
13
implies to deduct its definition from a set of instances, which can be formulated
explicitly or implicitly, but in any situation, it assigns or not, each instance to
the concept. In conclusion, a concept can be viewed as a function defined on the
input space with values in a Boolean set, namely, c: X → {− 1, 1}. Alternatively,
a concept c can be defined as a subset of X, namely, {x ∈ X: c(x) = 1}. A set of
concepts is known as a concept class C.
∑
ε(I(S ), D) = D(x, y)L(y, I(S )(x))
(1.1)
⟨x,y⟩∈U
{1 if y ≠ I(S )(x)
0 if y = I(S )(x)
L(y, I(S )(x)) = (1.2)
14
based on a training dataset and is able to generalize the relationships between
the independent variables and the dependent variable. For instance, an induction
algorithm constructs a classification model with tuples and their class labels as
input training data.
15
more about its quality, refine its parameters during the iterative data mining
process, and select the best performing model from a set of models.
^
∑
ε(I(S ), S ) = L(y, I(S )(x))
(1.3)
⟨x,y⟩∈S
where L(y, I(S )(x)) is the cost function defined in equation (1.2).
16
an estimate. The only downside to this is that this error represents an optimistic
biased estimation, especially if the induction algorithm overfits the training data.
The theoretical and empirical methods are two different methods present
in the literature to estimate the generalization error.
17
generalization error. Induction algorithms that do not have enough free
parameters, may generate a low training error, but on the other side can yield a
good generalization error. However, taking into account the characteristics and
volume of the training data available, the optimal capacity can be achieved, and
thus obtain the best generalization error.
VC Dimension
The VC theory highlights the extreme case when the training error and the
generalization error are estimated and these estimation values are bounds viable
for any induction algorithm and probability distribution in the input space.
These bounds are functions of the training dataset size and the VC dimension of
the induction algorithm.
18
Theorem 1. Given a hypothesis space H with a finite VC dimension d, the upper
bound on its generalization error is defined by:
d(ln(2n /d ) + 1) − ln(δ/4)
^ S) ≤
ε(h, D) − ε(h, , ∀h ∈ H, ∀δ > 0 (1.4)
n
19
to the number of free parameters of this classification model. The VC dimension
of a general classification model may be different from the number of free
parameters, and in many cases, it might be extremely difficult to calculate it
precisely. In this case, it is advisable to calculate a lower and upper bound of the
VC dimension. The two VC dimension bounds for neural networks are
presented in [121].
PAC Dimension
Definition 2. Let C be a class concept defined over the input space X with m
variables. Let I be an induction algorithm that considers the hypothesis space
H. C is PAC learnable by I using H if for ∀c ∈ C , ∀D defined over X,
∀ε ∈ (0,1/2) and ∀δ ∈ (0,1/2) the induction algorithm I with a probability
greater than or equal to 1 − δ will find the hypothesis h ∈ H such that,
ε(h, D) ≤ ε and learnable in polynomial time if the induction algorithm is of
polynomial time complexity in 1/ε, 1/δ, m, and size(c).
20
the number of training instances, which is sufficient for any consistent induction
algorithm I. In particular, the size of the training dataset must be equal to:
1 H
n≥ ln (1.5)
ε δ
21
Each time the algorithm is tested on one of the unseen folds and trained using
the remaining k − 1 folds.
22
1.3.5. Scalability
23
1.3.6. Dimensionality
High dimensional input data, i.e. datasets with large number of variables,
involve an exponential increase of the size of the search space, and
consequently increases the chance that an induction algorithm will build
classification models that are not valid in general. In [76], the authors explain
that in the case of a supervised classification model, the required number of
instances increases with the dimensionality of that dataset. Furthermore, the
author shows in [51] that in the case of a linear classification model, the
required number of instances is linear with respect to the dimensionality and to
the square of the dimensionality in the case of a quadratic classification model.
Regarding the nonparametric classification models, such as decision trees, the
situation is more serious. In order to obtain an efficient estimation of the
multivariate densities, in [71] it was estimated that as the number of dimensions
increases the number of instances must increase exponentially.
This situation is called the curse of dimensionality, term which was first
used by Bellman [7]. Algorithms, such as decision trees, which are effective in
situations of low dimensionality, do not yield significant results when the
dimensionality increases beyond a certain level. Moreover, the classification
models that are built on datasets with a small number of variables are easier to
interpret and more suitable for visualization by using different specific data
mining methods.
24
algorithms require the input variables to be of continuous type and the
dimensions to be representable as linear combinations of these input variables.
Each newly formed dimension is supposed to represent an unobserved factor.
Recently, the risk of customers to churn has become the main concern of
companies active in all industries with a relative low switching cost [104]. A
company that is experiencing this type of problem can see reduced profits and
receive less recommendations from customers of continued service [113].
Considering the churn rate in different industries, one can acknowledge that the
telecommunications industry is the main target of this exposure because the
annual churn rate in this industry varies between 20% and 40% [8], [94]. In the
mobile telecommunication sector, the term churn refers to the transfer of
subscribers from one service provider to another [137].
25
implement data mining models that use machine learning algorithms and are
highly accurate.
In [4], Arthur, Harris, and Annan have outlined based on the principal
component analysis the main factors that influence customers in the mobile
telecommunication industry to churn. These main factors have been grouped
into principal components using the correlation measure and the descriptive
statistical analysis of the independent variables.
26
Bayesian networks, logistic regression, and k-nearest neighbors [15]. The
Bayesian networks have yielded an acceptable accuracy and fairly close to the
one generated by the other two algorithms. In [34], Coussement and Van den
Poel have implemented a predictive model for churn prediction using logistic
regression and have used the ROC (Receiver Operator Characteristic) curve as
an evaluation criterion.
In [83] and [88], the authors have used Bayesian networks to identify the
reasons for customer churn. To discretize the continuous variables, Kisioglu and
Topcu, have used the CHAID (Chi-squared Automatic Interaction Detector)
algorithm [83]. In [82], Kirui et al. have implemented two predictive models
using the Naive Bayes algorithm and Bayesian networks, both yielding an
acceptable performance.
Wei and Chiu have used decision trees as a modeling technique and the
DET (Detection Error Tradeoff) curve as an evaluation criterion [137]. In their
study, the authors have used contract data and the changes in calling behavior.
Brandusoiu and Toderean have made a comparison between three decision trees
algorithms, namely CHAID, CART (Classification and Regression Tree), and
QUEST (Quick Unbiased Efficient Statistical Tree) [13]. These three algorithms
have been applied to the same dataset [10] as the one used in this book.
27
As another effort put forward to customer prediction, Hung et al. have
compared various data mining techniques [70]. In this study, the authors have
modeled decision trees and neural networks and have compared their
performance. Once again, the neural networks have achieved a better
performance than the decision trees.
In [23], Castanedo et al. have introduced for the first time the concept of
deep learning for customer prediction in the prepaid mobile telecommunications
industry. Castanedo et al. have investigated the application of a 4-layer
feedforward multilayer neural network on a large dataset. This model has had a
better performance than their previously implemented model which employed
the random forests algorithm.
28
Coussement and Van den Poel have compared in [35] the performance of
three classification techniques: logistic regression, support vector machines, and
random forests. The results of this study have shown that the random forests
model yields a significantly better performance than the other two models.
The support vector machines algorithm has been applied in [12] for
customer churn prediction in the prepaid mobile telecommunication industry.
Brandusoiu and Toderean have compared four different kernel functions,
namely: linear, polynomial, RBF, and sigmoid kernel. On the dataset [10], the
best predictive performance has been obtained using the polynomial kernel.
The majority of the previously described data mining methods that have
been applied to datasets that contain call details records from the prepaid sector
have a predictive performance below 85% and use machine learning algorithms
with a standard architecture. It should not be forgotten that within this industry
there is a tremendous competition and a mobile telecommunications company
must implement highly accurate predictive models in order to properly identify
customers who are at risk of churning.
29
Market Blanket (IPC-MB) algorithm, being more efficient than the other
algorithms existent in the literature [50]. This algorithm learns the Markov
blanket and minimizes the size of the set of the Pearson Chi-square conditional
independence tests [2] during the search, thus providing a better efficiency than
all the other algorithms present in the literature. After the Markov blanket has
been determined, the PC algorithm finishes learning the network structure [128].
To estimate the parameters, the Bayesian estimation algorithm with a Dirichlet
prior distribution has been used [74]. Using this architecture on the dataset [10],
the authors have obtained a performance of approximately 100%. This method
has been proposed by the authors in [19] and [11].
In [19] and [11], the authors have proposed an adapted architecture for the
support vector machines algorithm with a polynomial kernel with 4 degrees. For
training, a divide-and-conquer approach has been used which divides the
original problem into a set of subproblems which have been resolved using the
Sequential Minimal Optimization (SMO) algorithm adapted by Chang and Lin
[28]. Once the kernel matrix of a subproblem has been stored in cache, each of
its elements has been evaluated only once and has been calculated using the fast
SVM algorithm proposed by Dong, Suen, and Krzyzak [39]. Testing this
architecture on the dataset [10], has again yielded a predictive performance of
approximately 100%.
1.5. Conclusions
30
learning, and statistics. Through the use of these data mining techniques,
predictive models can be implemented to discover trends and past behaviors,
allowing organizations to make intelligent decisions based on knowledge
extracted from data.
The literature overview suggests that as of now, the data mining methods
that have been applied on datasets consisting of call detail records have
employed different machine learning algorithms with standard architectures,
and therefore their predictive performance is merely acceptable.
31
cations company, this method uses an improved machine learning algorithm
with a proprietary architecture and yields a superior performance and is scalable
to large datasets.
32
2. CHAPTER 2
2. DATA UNDERSTANDING AND PREPARATION
In this chapter, the dataset is analyzed and prepared for the modeling
phase. The PCA algorithm and its main extensions are introduced. In the data
preparation phase, the PCA algorithm is applied to the dataset in order to reduce
its dimensionality and avoid the collinearity between the independent variables.
Also, within this chapter the dataset is partitioned into a training and a test
datasets, and the distribution of the classes of the dependent variable is
balanced. This chapter represents the data understanding and the data
preparation phases of the CRISP-DM methodology.
The dataset used in this book, on which the principal component analysis
is applied, comes from the University of California, Irvine, the Department of
Information and Computer Science [10]. This dataset contains the call detail
records of 3,333 customers, each having 21 variables. Each row of this dataset
33
corresponds to a customer and for each one can find information about the
number of incoming and outgoing calls, the number of incoming and outgoing
SMSs, and about the voicemail. When implementing the predictive model, the
Churn variable will be used as a dependent variable, and the other 20 variables
as independent variables.
Table 2.1 indicates the 21 variables, their type, and the range of their
values. A first analysis of this dataset draws attention to the variables State, Area
Code, and Phone. The variable Area Code has only three different values – 408,
415, and 510, all belonging to the state of California. This would not be
abnormal if the data show that all customers are from California. However, as
illustrated in Table 2.2 (shown only up to state of Florida), these three zonal
prefixes are approximately evenly distributed across all the states of the USA. In
this case, it possible that the dataset contains incorrect data.
Therefore, one must keep in mind this aspect related to the variable Area
Code and to not include it as an independent variable when implementing the
predictive model. On the other hand, the variable State may contain errors too.
However, additional information about this dataset is required before including
both these variables in the data mining model. The variable Phone is also
excluded because it does not provide any relevant information for prediction
and is useful only to identify customers. Consequently, the number of
independent variables was reduced from 20 to 17 independent variables. The
variables International Plan and Voice Mail Plan are both nominal with Yes or
No values, while the other 15 variables are continuous.
34
Variable Name Type Values Missing
State Nominal AK, AL, … Values
0
Area Code Nominal 408, 415, 510 0
Phone Nominal N/A 0
International Plan Nominal Yes/No 0
Voice Mail Plan Nominal Yes/No 0
Account Length Continuous 1 – 243 0
Voice Mail Continuous 0 – 51 0
Day Minutes Continuous 0.00 – 350.80 0
Day Calls Continuous 0 – 165 0
Day Charge Continuous 0.00 – 59.64 0
Evening Minutes Continuous 0.00 – 363.70 0
Evening Calls Continuous 0 – 170 0
Evening Charge Continuous 0.00 – 30.91 0
Night Minutes Continuous 23.20 – 395.00 0
Night Calls Continuous 33 – 175 0
Night Charge Continuous 1.04 – 17.77 0
Intl. Minutes Continuous 0 – 20 0
Intl. Calls Continuous 0 – 20 0
Intl. Charge Continuous 0.00 – 5.40 0
Customer Service Calls Continuous 0–9 0
Churn Nominal Yes/No 0
Table 2.1. Dataset Variables [10].
modeling. The first step involves verifying the data to see if there are any
missing values and to have a first visual contact of the advanced statistics of
each available variable. Analyzing Table 2.1, we note that this dataset is a
complete set, i.e. for each customer and for each variable there are no missing
35
values. Otherwise, had been there any missing values, certain values should
have been imputed during this data preparation phase using an appropriate
method [114].
Area Code
State
408 415 510
AK 14 24 14
AL 25 40 15
AR 13 27 15
AZ 15 36 13
CA 7 17 10
CO 25 29 12
CT 22 39 13
DC 14 27 13
DE 13 31 17
FL 12 31 20
Table 2.2. Distribution of Variable State relative to Variable Area Code.
The next step in the data understanding phase consists of checking the
dataset for extreme values, which may indicate the presence of measurement or
recording errors in the dataset. A first look at Table 2.3, at the distributions of
each variable, shows the presence of extreme values for some variables, but
since these values do not indicate any error of measurement or any unusual
behavior which would lead us to the idea that there might be an irregularity in
the data, it is decided to keep all these extreme values for each variable. With
this extreme value check, the second phase of the CRISP-DM process is ended,
namely the data understanding phase.
36
2.2. Data Preparation
37
known as principal component analysis were given by Pearson in 1901 [109]
and by Hotelling in 1933 [68]. The two papers adopt different approaches to this
technique. Pearson has been concerned about the discovery of lines and planes
that best describe a set of points in a m-dimensional space, and the geometric
optimization problems, which he has considered, has led to the principal
component technique [109].
The approach taken by Hotelling starts from factor analysis, but the PCA
technique defined by him is different from the factor analysis [68]. The main
idea of this paper is that there exists a fundamental set of smaller dimension of
independent variables that can determine the values of the original m variables.
Hotelling has mentioned that such variables are have been called factors in the
psychology literature, thus he has introduced the alternate term of components
to avoid confusion with other uses of this term in mathematics. These
components have been chosen to maximize their successive contributions to the
total variance of the original variables, and have been called principal
components. The analysis that leads to the discovery of these principal
components has been called the principal component method.
38
becoming a paper frequently quoted in later developments [3]. In [112], Rao has
proposed new ideas regarding the use, the interpretation, and the extensions of
the PCA technique. Gower has discussed the connections between the PCA
method and other statistical techniques and has offered various important
geometric perspectives [59]. Jeffers has presented the practical aspect of the
PCA technique by discussing two case studies that employ the PCA method
[73].
39
l
ri2
∑ (2.1)
i=1
X2
ri
Y1
X1
40
Prior to using this dimensionality reduction algorithm, the dataset must be
standardized so that the arithmetic mean of each variable is zero and the
standard deviation is equal to one. Each variable Xi is represented by a vector of
size n × 1 , where n is the number of instances. The standardized variable is
represented by a vector Zi of size n × 1 , where Zi = (Xi − μi )/σii , μi is the
arithmetic mean of Xi, and σii is the standard deviation of Xi.
X2
Y2
Y1
θ
Y1
X1
σ11 0 ⋯ 0
0 σ22 ⋯ 0
V1/2 = (2.3)
⋮ ⋮ ⋱ ⋮
0 0 ⋯ σmm
41
In the case of a m-dimensional problem, new coordinates are introduced
for i = 1,2, . . . , m:
∑
Yi = eij Zj (2.4)
j=1
Let
m
eij Zj = e⊤i Z
∑
Yi =
j=1 (2.6)
Z = [Z1, Z 2, ..., Zm]
⊤
such that by projecting the vector Z over vector ei , the distance Yi is obtained
along the direction of vector ei, and we have:
42
C = E[(Z − Z)(Z − Z)⊤] (2.8)
σij2
rij = (2.9)
σii σjj
∑
(Xki − μi )(Xkj − μj )
k=1 (2.10)
σij2 =
n
The notation σij2 is used to illustrate the variance of variable Xi. If Xi and Xj
are independent, then σij2 = 0 , but σij2 = 0 does not imply that Xi and Xj are
independent.
43
2
σ11 2
σ12 2
σ1m
σ11σ11 σ11σ22
⋯ σ11σmm
2
σ12 2
σ22 2
σ2m
⋯
ρ= σ11σ22 σ22 σ22 σ22 σmm (2.11)
⋮ ⋮ ⋱ ⋮
2
σ1m 2
σ2m 2
σmm
σ11σmm σ22 σmm
⋯ σmm σmm
Considering that each variable has been standardized and taking into
account the above-mentioned standardized matrix, we have E(Z) = 0 , where 0
is a vector of zeros of size n × 1 , and Z has the covariance matrix equal to the
correlation matrix.
Undoubtedly, the higher the norm of vector ei , the higher Var(Yi ) will
be. Thus, the normalization constraint ei = 1 must be imposed while Var(Yi ) is
maximized, that is e⊤i ei = 1.
44
eigenvector. By multiplying equation (2.13) to the left with e⊤i we obtain
equation (2.14):
The total variability in the standardized dataset is equal to the sum of the
variances of each principal component, to the sum of the variances of each
vector Z, to the sum of the eigenvalues, and to the number of independent
variables. Thus, we have equation (2.15):
m m m
∑ ∑ ∑
Var(Yi ) = Var(Zi ) = λi = m (2.15)
i=1 i=1 i=1
where (λ1, e1), (λ2, e2), . . . , (λm, em) represent the eigenvalue and eigenvector pairs
for the correlation matrix ρ , and λ1 ≥ λ2 ≥ . . . ≥ λm . A partial correlation
45
coefficient is a correlation coefficient that takes into account the effect of all
other independent variables.
The ratio of the total variability in Z that is explained by the ith principal
component is equal to the ratio between the ith eigenvalue and the number of
independent variables, that is λi /m.
After calculating the principal components, the next step is to select the
principal components. From the total number of principal components which is
equal to the number of independent variables, only a smaller number is selected
based on four criteria: the eigenvalue criterion, the proportion of the explained
variance criterion, the minimum communality criterion, and the scree plot
criterion.
λ1 + λ2 + ... + λk
(2.17)
λ1 + λ2 + ... + λk + ... + λm
46
If the independent variables are highly correlated, a small number of
eigenvectors with high eigenvalues will be obtained, and k will be much smaller
than m, thus obtaining a significant dimensionality reduction.
47
Variable Name Min Max Mean Median Mode Std. Dev.
Account Length 1 243 101.06 101 105 39.82
Voice Mail 0 51 8.10 0 0 13.68
Day Minutes 0.00 350.8 179.77 179.40 154.00 54.47
Day Calls 0 1650 100.44 101 102 20.07
Day Charge 0.00 59.64 30.56 30.50 26.18 9.26
Evening Minutes 0.00 363.7 200.98 201.40 169.90 50.74
Evening Calls 0 1700 100.11 100 105 19.92
Evening Charge 0.00 30.91 17.08 17.12 14.25 4.31
Night Minutes 23.2 395.0 200.87 201.2 188.20 50.57
Night Calls 0
33 1750 100.11 100 105 19.57
Night Charge 1.04 17.77 9.04 9.05 9.45 2.28
Intl. Minutes 0.00 20.00 10.24 10.3 10 2.79
Intl. Calls 0 20 4.48 4 3 2.46
Intl. Charge 0.00 5.40 2.76 2.78 2.70 0.75
Customer Service 0 9 1.56 1 1 1.31
Calls
Table 2.3. Statistical Indicators of Quantitative Variables.
deviation of less than 1, whereas the variable Day Charge has a standard
deviation of greater than 54. If the principal component analysis is applied
without first standardizing these quantitative independent variables, the variable
Day Charge will dominate the influence of the variable International Charge
and similarly the entire range of variabilities. Therefore, all variables are
standardized and we obtain the vectors Zi = (Xi − μi )/σii using the mean and the
standard deviation from Table 2.3.
48
having only the first letter of each word of each variable and the z suffix,
denoting that each variable is standardized. It can be seen in Table 2.4 that some
independent variables are strongly correlated with each other, correlations that
could negatively influence the classification model. The principal component
analysis manipulates this correlation and identifies the components that
underline the correlated variables.
49
Var. al_z vm_z dm_z dc_z dch_z em_z ec_z ech_z nm_z nc_z nch_z im_z ic_z ich_z csc_z
al_z 1.000 -0.005 0.006 0.038 0.006 -0.007 0.019 -0.007 -0.009 -0.013 -0.009 0.010 0.021 0.010 -0.004
vm_z -0.005 1.000 0.001 -0.010 0.001 0.018 -0.006 0.018 0.008 0.007 0.008 0.003 0.014 0.003 -0.013
dm_z 0.006 0.001 1.000 0.007 1.000 0.007 0.016 0.007 0.004 0.023 0.004 -0.010 0.008 -0.010 -0.013
dc_z 0.038 -0.010 0.007 1.000 0.007 -0.021 0.006 -0.021 0.023 -0.020 0.023 0.022 0.005 0.022 -0.019
dch_z 0.006 0.001 1.000 0.007 1.000 0.007 0.016 0.007 0.004 0.023 0.004 -0.010 0.008 -0.010 -0.013
em_z -0.007 0.018 0.007 -0.021 0.007 1.000 -0.011 1.000 -0.013 0.008 -0.013 -0.011 0.003 -0.011 -0.013
ec_z 0.019 -0.006 0.016 0.006 0.016 -0.011 1.000 -0.011 -0.002 0.008 -0.002 0.009 0.017 0.009 0.002
50
ech_z -0.007 0.018 0.007 -0.021 0.007 1.000 -0.011 1.000 -0.013 0.008 -0.013 -0.011 0.003 -0.011 -0.013
nm_z -0.009 0.008 0.004 0.023 0.004 -0.013 -0.002 -0.013 1.000 0.011 1.000 -0.015 -0.012 -0.015 -0.009
nc_z -0.013 0.007 0.023 -0.020 0.023 0.008 0.008 0.008 0.011 1.000 0.011 -0.014 0.000 -0.014 -0.013
nch_z -0.009 0.008 0.004 0.023 0.004 -0.013 -0.002 -0.013 1.000 0.011 1.000 -0.015 -0.012 -0.015 -0.009
im_z 0.010 0.003 -0.010 0.022 -0.010 -0.011 0.009 -0.011 -0.015 -0.014 -0.015 1.000 0.032 1.000 -0.010
ic_z 0.021 0.014 0.008 0.005 0.008 0.003 0.017 0.003 -0.012 0.000 -0.012 0.032 1.000 0.032 -0.018
ich_z 0.010 0.003 -0.010 0.022 -0.010 -0.011 0.009 -0.011 -0.015 -0.014 -0.015 1.000 0.032 1.000 -0.010
csc_z -0.004 -0.013 -0.013 -0.019 -0.013 -0.013 0.002 -0.013 -0.009 -0.013 -0.009 -0.010 -0.018 -0.010 1.000
Table. 2.4. Pearson’s Correlation Coefficient of Quantitative Variables.
Principal Component
Var.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
al_z -0.020 0.002 0.028 -0.007 0.631 -0.099 0.081 0.073 0.289 0.472 -0.522 0.000 0.000 0.000 0.000
vm_z 0.013 0.017 -0.007 0.034 -0.054 0.537 -0.278 0.541 0.561 -0.135 0.063 0.000 0.000 0.000 0.000
dm_z 0.500 0.160 0.849 -0.052 -0.015 -0.022 -0.013 0.017 -0.004 -0.001 -0.003 0.000 0.000 0.000 0.000
dc_z -0.017 -0.058 0.040 0.029 0.570 -0.201 -0.302 -0.249 0.233 -0.065 0.646 0.000 0.000 0.000 0.000
dch_z 0.500 0.160 0.849 -0.052 -0.015 -0.022 -0.013 0.017 -0.004 -0.001 -0.003 0.000 0.000 0.000 0.000
em_z 0.286 0.738 -0.266 0.549 0.019 -0.025 0.011 -0.005 0.001 -0.002 0.011 0.000 0.000 0.000 0.000
ec_z -0.006 -0.009 0.043 -0.007 0.305 0.121 0.729 -0.033 0.159 -0.577 -0.013 0.000 0.000 0.000 0.000
51
ech_z 0.285 0.738 -0.266 0.549 0.019 -0.025 0.011 -0.005 0.001 -0.002 0.011 0.000 0.000 0.000 0.000
nm_z 0.434 -0.664 -0.089 0.601 -0.002 -0.005 0.014 0.016 -0.014 0.007 -0.015 0.000 -0.001 0.000 0.000
nc_z 0.055 0.005 0.021 0.005 -0.227 0.483 0.306 -0.535 0.282 0.459 0.202 0.000 0.000 0.000 0.000
nch_z 0.434 -0.664 -0.089 0.601 -0.002 -0.005 0.014 0.016 -0.015 0.007 -0.015 0.000 0.001 0.000 0.000
im_z -0.707 0.004 0.437 0.554 -0.040 -0.012 0.004 -0.010 0.014 0.003 -0.012 0.002 0.000 0.000 0.000
ic_z -0.045 0.022 0.045 0.025 0.358 0.454 0.154 0.351 -0.607 0.258 0.281 0.000 0.000 0.000 0.000
ich_z -0.707 0.004 0.437 0.554 -0.040 -0.012 0.004 -0.010 0.014 0.003 -0.012 -0.002 0.000 0.000 0.000
csc_z -0.014 -0.011 -0.025 -0.038 -0.238 -0.486 0.431 0.472 0.218 0.336 0.369 0.000 0.000 0.000 0.000
52
eigenvalues decrease in magnitude, i.e. λ1 ≥ λ2 ≥ . . . ≥ λ15.
After defining all the principal components, the criteria for selecting an
optimal number of principal components for the modeling phase must be
analyzed [16]. Thus, based on the eigenvalue criterion, which suggests to retain
the principal components with an eigenvalue greater than or equal to 1, the first
7 principal components are retained, all having a value greater than 1. The other
4 principal components have an eigenvalue approximately equal to 1 (Table
2.6). If the other criteria support such a decision, these 4 principal components
will be retained too.
53
components so that the communality of each independent variable is greater
than 50%. Thus, by selecting the first 11 principal components a communality of
100% for each variable is obtained. As a reminder, the commonality of an
independent variable is equal to the sum of the component weights squared.
The scree plot criterion involves interpreting the chart obtained from the
eigenvalues and the order number of each principal component. Figure 2.3
illustrates this chart, having on the horizontal axis the 15 principal components
and on the vertical axis the eigenvalues. This criterion suggests to retain the
maximum number of principal components before the graph approaches zero, in
this case the maximum being 11. Thus, based on this criterion the first 11
principal components are retained [16].
3.0
2.3
Eigenvalues
1.5
0.8
0.0
0 4 8 12 16
Principal Component
Figure 2.3. Scree Plot Criterion.
54
correlated and explain 100% of the variance in the original dataset.
At this stage, the dataset must be prepared for the machine learning
algorithms used in the next phase. In the following chapter, the classification
model is implemented and in order to empirically estimate the generalization
error of this algorithm, the dataset is randomly partitioned into a training
dataset, which represents 80% of the original dataset, and a test dataset, which
represents 20% of the original dataset [129].
55
2.3. Conclusions
This chapter describes the PCA algorithm and how it is applied to the
dataset [10]. By applying this algorithm, 11 principal components are obtained
which explain the entire variability in the dataset. In other words, by reducing
the dimensionality of the dataset no information is lost. The 11 principal
components selected contain the same information as the original dataset while
any collinearity between the independent variables is being avoided [16].
This chapter serves as the data understanding and the data preparation
phases of the CRISP-DM methodology.
56
3. CHAPTER 3
3. MODELING USING NEURAL NETWORKS
3.1. Introduction
57
algorithm mimics the human brain, which consists of different types of neurons,
each neuron connecting to several synapses. The ability of the human brain to
perceive and memorize new information through a learning process has
motivated researchers to develop artificial systems that are capable to perform
certain functions based on a learning process [36].
In 1982, the Hopfield model marked the beginning of the current period
of neural networks research [67]. This model does not operate at a neuron level,
but at a system level based on the Hebbian rule and functions as a recurrent
neural network. This type of neural network is useful to solve different
58
optimization problems. In 1985, as an extension to the Hopfield neural
networks, the Boltzmann machine was proposed having its learning algorithm
based on the simulated annealing method [81] by including stochastic neurons
[1]. In 1988, the Hopfield model was developed further and the cellular neural
networks were proposed [31].
The neural network model based on the principal component analysis was
proposed in 1982 [107], and based on the independent component analysis
(ICA) in 1994 [33]. The ICA algorithm is a generalization of the PCA algorithm
and is typically used for feature extraction. In the subsequent years, multiple
research papers were submitted proposing several neural networks models, such
as factor analysis, canonical correlation analysis (CCA), and linear discriminant
analysis (LDA).
59
A neural network is defined by its architecture, by the properties of its
neurons, and by the learning rules used [36].
Neural networks operate in two phases: first the learning process takes
place and then the generalization process. The training process consists of
parsing the training dataset and adapting the parameters of the network by using
an online or an offline learning algorithm. Once the learning process is over, the
network will be able to mimic the nonlinear relationships existent between the
60
independent variables and the dependent variable.
3.2. Perceptron
61
linearly separable patterns [115].
m
wi xi − θ = w⊤x − θ
∑
u= (3.1)
i=1
^y = ϕ(u) (3.2)
where xi represents the ith independent variable and x = (x1, x2, ..., xm)⊤ , wi
represents the weight of the ith independent variable and w = (w1, w2, ..., wm)⊤, θ
is the threshold or bias, and m is the number of independent variables. ϕ() is a
continuous or discontinuous activation function and projects the real numbers in
the ( − 1, 1) range.
w1
x1
w2
c ia de x2 Activation
ivare function
y u y
Summing
θ wm block θ
xm
Synaptic
weights
Figure 3.1. Mathematical Model of a Perceptron [111].
w⊤x − θ = 0 (3.3)
where the parameter θ helps to move this hyperplane from the origin.
The most frequently used activation functions [36] are the threshold
function (3.4), the sigmoid function (3.5), and the hyperbolic tangent function
(3.6):
{−1 or 0 x < 0
1 x≥0
ϕ(x) = (3.4)
1
ϕ(x) = (3.5)
1 + e −βx
All these activation functions are monotonically increasing with the range
( − 1, 1). Considering that the sigmoid functions satisfy lim ϕ(x) = 0 and
x→−∞
lim ϕ(x) = 1, many monotonically increasing functions satisfy these
x→+∞
conditions and therefore can be considered sigmoidal. In [41], the author has
63
threshold logistic tanh
1 1 1
ϕ(x)
ϕ(x)
0 0
w1 0.4
x1
-0.5 w2 0.2 -0.5
nc ia de x2 Activation
-1 0 -1
ctivare -2 0 2 -2 0 function
2 -2 0 2
x x x
y u y
Figure 3.2. Sigmoid Activation Functions [36].
w1
x1
w2
nc ia de x2 Activation
activare function
y1 u1 y1
1 1
Summing
θ1 block θ1
nc ia de Activation
activare function
yo uo yo
o o
Summing
θo block θo
wm
xm
Synaptic
weights
w( l 1)
w( L 1)
64 y1
(l ) ( L)
o
1 o 1
input vector x into multiple classes.
u = w⊤x − θ (3.7)
^
y = ϕ(u) (3.8)
where u = (u1, u2, ..., uo)⊤ and o is the number of neurons in the hidden layer,
θ = (θ1, θ2, ..., θo)⊤ represents the bias vector in the hidden layer, ϕ(u) = (ϕ1(u1),
ϕ2(u2), . . . , ϕo(uo))⊤ represents the vector of the activation functions of the
⊤
neurons in the hidden layer, and ^
y = ( ^y1, ^y2, ..., ^yo) represents the output vector.
65
−w⊤x
∑
E(w) =
(3.9)
x∈X
m
wij xn,i − θj = w⊤j xn − θj
∑
un, j = (3.10)
i=1
{−1 otherwise
1 un, j > 0
^yn, j = (3.11)
66
sufficiently small positive number, a typical value of this parameter η being 0.5.
The stability of the training process is not influenced by the selection of this
parameter η, only the convergence speed is affected. It is important to note that
the weights wij are randomly initialized. Finally, when the errors are small
enough the training process stops.
67
x1 x1
w2 w2
x2 F nc ia de x2 Activation
activare function
u y u y
^
y n = o(L) (0)
n , on = xn (3.14)
w1 w1
x1 x1
w2 w2
x2 F nc ia de x2 Activation
activare(l−1) (l−1) function
u(l)
n = wn on + θ(l) (3.15)
u1 y1 u 1 y1
1 1
Bloc Summing
sumator o(l)
θ1 (l) (l)
n = ϕ (un )
block(3.16) θ1
F nc ia de Activation
activare function
(l) ⊤ ⊤
where u(l) n =
(l) (l)
(un,1 , un,2, ..., un,k ) , w(l−1) (l−1)
= (wn,1
yo n
(l−1)
, wn,2 (l−1)
, ..., wn,k ) represents the yo
o u l l−1 uo
(l−1) (l−1) (l−1)
o (l−1) o
weight vector, on = (on,1 , on,2 , ..., on,k ) represents the output vector,
l−1
(l) (l) (l) (l)Bloc Summing
θ = (θ1 , θ2 , ..., θksumator ⊤
) represents θthe
o
bias vector, and ϕ (l)() applies theblock function θo
wm l wm
ϕi(l)x()m to the ith component of the vector. xm
Ponderi Synaptic
sinaptice weights
( L)
(1)
θ(1) (l )
θ( l ) θ( L )
(0) ( l 1) ( L 1)
w 1 w 1 w 1
x1 y1
o1(1) o1( l ) o1( L )
w (0)
2 w (2l 1)
w (2L 1)
x2 y2
o(1)
2
o(2l ) o(2L )
w (0)
k1 w (kll 1)
w (kLL 1)
xm y kL
o(1)
k1 o(kll ) o(kLL )
The first publication related to this algorithm appeared in 1963 [21], being
described as a dynamic multi-step optimization method. In the following years,
according to the work of Werbros [138] and of Rumelhart, Hinton, and Williams
[119], it became recognized in the field of artificial neural networks. The BP
algorithm is a generalization of the delta rule, therefore being also known as the
generalized delta rule, naming introduced by Rumelhart and McClelland in
[117]. By employing a gradient searching technique, the BP algorithm seeks to
minimize a cost function equivalent to the mean squared error (MSE) between
the expected and the current outputs of the network. As such, the MLP neural
networks can be extended to multiple layers.
The BP algorithm propagates the error between the expected and the
actual outputs of the network back through the network. After an input feature is
presented to the network, the generated output is compared to a known output
69
feature and the error is computed for each output neuron. These errors are then
propagated backward through the network, creating a control system this way.
The weights can be adjusted by using a gradient descent algorithm [36].
For the optimization, the MSE objective function is defined between the
expected y^n and the current yn network outputs for all feature pairs (xn, yn) ∈ S
from the training dataset:
1 1 2
y^n − yn
∑ ∑
E= En = (3.17)
n n∈S 2n n∈S
1 ^ 2 1 ⊤
En = y − yn = e e (3.18)
2 n 2 n n
en = y^n − yn (3.19)
70
error function, E or En, can be minimized using [36]:
∂En
Δnw = − η (3.20)
∂w
∂En
δn,(l)j = − (3.21)
∂un,(l)j
kl+2
δn,(l+1) (l+1) (l+1)
δ (l+2)w (l+1)
∑ n,pk jpk
j
= ϕ̇j (un, j ) (3.23)
p=1
∂En
= − δn,(l+1)
j
(l)
on,i (3.24)
∂ wij(l)
71
The first order derivatives for the sigmoid and the hyperbolic tangent
activation functions are given by equation (3.25) and equation (3.26),
respectively:
The bias vector θ(l+1) from layer (l + 1) can be updated using a gradient
descent algorithm or by expanding the weight vector w(l), namely:
⊤
θ(l+1) = (w0,1
(l) (l)
, w0,2 (l)
, ..., w0,k ) (3.27)
l+1
⊤
o(l) = (1, o1(l), ..., ok(l)) (3.28)
l
If 0 < η < 2/λmax , where λmax is the largest eigenvalue of the autocor-
relation of the vector x, denoted by R, the BP algorithm is convergent to the
mean value [142]. If η is too small, the likelihood of the error function to be
trapped in a local minimum increases. On the contrary, if η is too large, the
likelihood of the error function to fall into oscillatory traps increases. By
preprocessing the independent variables to remove the collinearity between
these one can avoid having large eigenvalues of R, and by increasing the
72
parameter η the convergence can be accelerated. The speed of the BP algorithm
is often accelerated when the PCA algorithm is applied as a preprocessing task,
unless the independent variables are uncorrelated or consist of sparse vectors. In
general, the value of the learning rate η is between 0 < η < 1 , ensuring that the
minimum of the error surface is not passed by consecutive changes in weight.
∂En(t)
Δn W(t) = − η + α ΔW(t − 1) (3.29)
∂W(t)
73
speed and the ability to avoid local minima, and is more robust at weights
initialization, especially if the training parameters have high values.
During training, all the instances are randomly and recurrently presented
to the network through epochs until the convergence criteria are met. As was
previously mentioned, the objective function of the optimization problem is En
and the weights are updated after each feature was presented to the network.
This learning process is known as online learning, incremental learning, or
feature learning. When the objective is to optimize the average error E, the
offline learning, non-incremental learning, or batch learning algorithm is used,
and the weights are updated only after all the training features are presented to
the network.
74
∂En
Δn wij(l) = − ηon (3.30)
∂ wij(l)
∂E
Δwij(l) = − ηoff Δn wij(l)
∑
= (3.31)
∂ wij(l) n
If the learning rates are fairly low, the online method becomes identical to
the offline method, and both yield similar outcomes [47].
Online learning is more effective when the training dataset is not fully
available or is highly dimensional, because in offline learning additional storage
is required. Also, online learning has the tendency to be faster than offline
learning and to yield at least the same accuracy, specifically for highly
dimensional training datasets [143].
If the learning rates are fairly low, the random character in online learning
permits to explore the search space more widely helping to avoid local minima
[32]. During online learning, the error surface is more closely followed,
enabling the use of higher learning rates and implicitly an accelerated
convergence due to the reduced number of iterations. For highly dimensional
training data, offline learning is frequently unfeasible due to very low values of
75
the parameter ηoff , while in online learning, the parameter ηon can have higher
values and therefore can speed the training process.
Normally, in an MLP neural network all the neurons use the same sigmoid
activation function, which constrains the network output to the ( − 1, 1) range.
In [18], [19], [103], [118], and [11], for classification the authors have used in
the output layer of an MLP neural network the generalized sigmoid activation
function, also called softmax. By using the generalized sigmoid function, the
MLP model has a greater flexibility, because the result of each output neuron is
restricted by the results of all the output neurons. The output of the ith neuron is
given by equation (3.32):
(L)
e ui
oi(L) = ϕ(ui(L)) = kL
(L) (3.32)
e uj
∑
j=1
The generalized sigmoid and the sigmoid functions have the same derivative.
76
empirically. The network pruning and growing techniques are used to determine
the number of neurons in the hidden layers.
The network pruning technique starts with a network with many hidden
neurons and then gradually eliminates redundant neurons during training. These
pruning techniques are categorized based on the calculation of a metric, i.e. the
sensitivity and the regularization. The first technique, calculates the sensitivity
of the error function E when eliminating a weight or a neuron, and eliminates
the least significant one. The second technique, which is based on
regularization, adds a term to the error function E to constrain the network to
make effective decisions. The BP algorithm obtained from this new objective
function sets the insignificant weights to zero and eliminates them during
training. If the objective function contains a sensitivity term, then both
techniques yield the same result.
ΔE Δw ∂lnE w ∂E
SwE = lim / = = (3.33)
Δw→0 E w ∂lnw E ∂w
Karnin has applied this pruning technique and has calculated during
training the sensitivity of each connection and has eliminated the ones with a
77
low sensitivity, without retraining [78]. This method has been improved by
adjusting some pruning rules to avoid eliminating an input neuron or an entire
hidden layer [56]. The authors have also proposed a fast algorithm to retrain the
neural network after eliminating a weight. Karnin's work has been further
developed and the relative local sensitivity index has been introduced for each
group of neurons or layer within the network [111].
78
The singular value decomposition (SVD) has been proposed as a neural
network pruning technique in [77]. In [130], the authors have applied the SVD
orthogonal transform to evaluate the importance of adding more hidden neurons
in a feedforward network. In [149], practical calculations have been presented
regarding the sensitivity of the input neurons and how to eliminate redundant
entries. For instance, if the input neurons have few dimensions with a low
sensitivity compared to the rest, those dimensions are eliminated and the neural
network retrained.
Another method for pruning the input and hidden neurons of an MLP
neural network has been proposed using the mutual information [146]. The
significant input neurons are determined first by computing their relevance and
contribution to the output and the redundant input neurons are eliminated. The
next step determines based on similar measures the hidden neurons that are
significant to the neural network and eliminates the redundant hidden neurons.
ET = E + λc Ec (3.34)
where E represents the error function, Ec represents a penalty for the complexity
of the network, and λc > 0 represents a regularization parameter. The penalty
term adds new local minima in the optimization process.
79
In the case of the weight decay technique, Ec is defined as a weights
function, namely, as the sum of the weights squared [65] and as the sum of the
absolute value of the weights [72]. The BP algorithm derived from the weights
function ET using a weight decay term is defined in equation (3.35):
∂ET ∂Ec
Δwij(l) = − η (l)
= Δwij,BP −ε (3.35)
∂ wij(l) ∂ wij(l)
(l) ∂E
Δwij,BP =−η (3.36)
∂ wij(l)
80
3.7.3. Network Growing
81
dataset avoids the curse of dimensionality [7] and improves the neural network’s
ability to generalize. This step is effective especially for large datasets.
82
which avoids a local minimum at the beginning of the training process and
converges to a global minimum.
Other learning algorithms have been addressed that propose local learning
rates, such as the heuristics in [125], the Quickdrop algorithm [44], and the
convergence strategy [95]. In the last paper, the authors have discussed the
theoretical aspects of batch learning algorithms with local learning rates based
on Lipschitz and Wolfe's conditions.
Generally, the weights are randomly initialized with small positive values
or with small values with zero mean [119]. This helps the gradient descent
83
algorithm at symmetry breaking, thus avoiding any redundancy in the neural
network. Initializing the weights with large values can saturate the neurons early
and decelerate the training process. In theory, the likelihood that some neurons
are saturated early in an MLP neural network increases with the maximum value
of the weights [89]. The maximum value of the initial weights has been
computed in [37] through statistical analysis. In [131], the authors have showed
empirically that in comparison to other techniques existent in the literature, by
initializing the weights of an MLP neural network with values in the range
[ − 0.77, 0.77], they attain the optimal performance.
84
in the output layer. In [140], based on the clustering and the k-nearest neighbors
algorithms the training instances have been grouped in a set of clusters based on
their accuracy.
In [96], [18], [19], and [11], the authors have proposed to initialize the
weights using the simulated annealing algorithm and the gradient descent
algorithm to refine the quality of the solution.
85
Numerous strategies have been introduced to avoid the local minima trap.
A straightforward and efficient technique assumes presenting the training
instances to the network randomly during each epoch. Another approach
involves using learning algorithms initialized in several regions of the weights
space, and then finding the optimal solution. This is a useful method for fast
converging algorithms, such as the conjugate gradient algorithm.
Another efficient way to avoid local minima is to inject noise into the
training process which leads to an increased generalization performance. In fact,
this approach has been employed by several annealing methods. The noise can
be added to the input neurons, to the output neurons, or to the weights, but once
added it should be reduced as the training process advances. Regardless of the
method selected, each will add a stochastic term to the weight vector. The
method in [105] uses an annealing average step-size. Selecting large steps
allows the algorithm to avoid the local minima, and small ones assures the
convergence in the local area. An effective method to avoid local minima is to
insert an annealing noise term in the gradient descent algorithm [18], [19], [30],
[11]. The SARprop algorithm employs this approach as well.
86
structure. The final properties of the solid are strongly dependent on the cooling
process, i.e. if the cooling process is carried out rapidly, the solid will degrade
with ease due to its imperfect structure, whilst if it is carried out slowly will
have a robust structure. The perfect crystalline structure corresponds to the
configuration of the global minimum energy.
Once perturbed, the ball can exhibit ascent or descent moves. Since the
temperature decreases gradually, the magnitude of the moves decreases too and
the ball can no longer pass all the highs of the surface. By the time the
87
20
Te
10
10
20 30 40 50
magnitude of the moves decreased so much so that the ball can only pass small
Timp
highs, the ball should be near the global minimum and the entire algorithm near
completion. However, an excessive perturbation should be avoided because the
ball can also reach a higher position than originally.
40
Temperature °C
30
20
10
10 20 30 40 50
Time
In the literature, there are two cooling methods: the exponential and the
linear cooling. Figure 3.5 illustrates the exponential cooling and it can be noted
that the algorithm passes less time at higher temperatures, and the time spent
increases as the temperature decreases, improving this way the solution found at
the previous iteration. In the case of linear cooling, the algorithm passes the
same time at each temperature. Consequently, being beneficial only if various
minima are nearby. In [86], the authors have showed that if the temperature is
decreased and increased recurrently, the solution is improved at each cycle and
it helps during training. Other cooling methods have been presented in [93].
88
For a physical system in state α and with energy Eα at temperature T, the
probability Pα of being in that state α satisfies the Boltzmann distribution:
1 − kEαT
Pα = e B (3.37)
Z
Eβ
−k
∑
Z= e BT
(3.38)
β
where the sum is taken over all states β with energy Eβ at temperature T. If the
temperature T is high, irrespective of the energy, the Boltzmann distribution
shows a uniform characteristic for each state. If the temperature T decreases to
zero, only the states with minimum energy will have a probability greater than
zero.
89
equilibrium, since any change in states will not cause an increase in energy. It is
important to decrease the temperature T slowly to allow the algorithm to
conduct a comprehensive search at each temperature.
ΔE
P = e− T (3.39)
In this book, the neural network is used for classification, namely for
predicting the state of a two-class dependent variable based on the independent
variables. The architecture presented further has been proposed also in [18],
[19], and [11].
The MLP neural network training process used employs the BP algorithm
based on the generalized delta rule [117]. For each instance presented to the
network during training, the information in the form of independent variables
moves forward through the network to generate a prediction in the output layer.
This prediction is compared to the actual output value of the training instance,
and the difference between the predicted and actual values is propagated back
90
through the network to adjust the connections weights, thus improving the
prediction of similar features.
The activation functions of the input neurons have their values set to the
input instances. The output of each neuron in the hidden and the output layers is
calculated using equation (3.40):
is the bias in layer l, and ϕ (l)() is the activation function from layer l. The
activation function used for the hidden layers is the hyperbolic tangent function
(equation (3.41)), and the softmax function for the output layer (equation
(3.42)):
(l) (l)
e un − e −un
ϕ(un(l)) = tanh(un(l)) = (l) (l)
(3.41)
e un + e −un
(L)
e un
ϕ(un(L)) = kL
(L) (3.42)
e uj
∑
j=1
91
3.11.1. Weights Initialization and Learning Process
The weights of the neural network used are initialized applying the
simulated annealing algorithm [81] and the alternating training process [96].
This procedure is applied to a random subset to derive the initial weights K1 = 4
times. The simulated annealing algorithm is used to escape the local minimum
during the training process by perturbing this local minimum K2 = 4 times. If
the local minimum is passed successfully, for the next training cycle the
simulated annealing algorithm initializes the weights with more relevant values.
In order to find the global minimum, this procedure is repeated K3 = 3 times.
2. A loop k = 0 is initialized.
3. The network is trained using the initial weights and the trained weights
w are obtained.
4. If the training error is less than or equal to 0.05, the loop is stopped
and the weight vector w is used as the result of the loop. Otherwise the
loop is incremented by one unit.
92
w′ = w + wn , by adding random noise wn within the range
[ − (0.5)k+1, (0.5)k+1]. If E(wmin ) < E(w) , where wmin is the perturbed
weight vector that produces the minimum training error, the initial
weights are set to wmin and returns to step 3.
Otherwise, the loop is stopped and the w vector is considered the final
result.
If the resulted weights yield a training error greater than 0.1, the algorithm
is repeated until the training error is less than or equal to 0.1 or repeated K3
times and the weights that produce the minimum test error within the k loops are
selected.
Within this architecture, the BP algorithm uses the cross entropy error
function, since it is a more appropriate alternative for classification problems
[97], [127]. The cross entropy is a function of the relative errors and is assumed
to estimate more precisely low probabilities [55], [65], [127]. To calculate the
first order partial derivatives of the error function in relation to the weights, the
BP algorithm with momentum [18], [19], [11] is used:
∂E
=0 (3.43)
∂ wij(l)
93
(l) ()
δn,i = − en,i (3.44)
∂En
= − δn,(l+1)
j
(l)
on,i (3.45)
∂ wij(l)
∂En
where δn,(l)j = − .
▪ ∂un,(l)j
▪ Set:
∂E ∂E ∂En
= + (3.46)
∂ wij(l) ∂ wij(l) ∂ wij(l)
kl+2
(l+1) (l+1) (l+1)
δ (l+2)w (l+1)
∑ n,q pq
δn,i = ϕ̇i (un,i ) (3.47)
q=1
L−1
∑
This gives a vector of size (kl + 1)kl+1 which is the gradient ∇E(wk ).
l=1
The gradient descent method with the learning rate η0 = 0.4 and the
momentum α = 0.9 consists of the following steps [18], [19], [11]:
1. Let k = 0 and the weight vector is initialized with w0, the learning rate
94
with η0, and Δw0 = 0.
2. The entire dataset is read and the error function E(wk ) and its gradient
∇E(wk ) are calculated. If ∇E(wk ) < 10−6 , the algorithm is stopped
and the current network is reported.
The gradient descent method for online learning with learning rate
η0 = 0.4, minimum learning rate ηmin = 0.001 , momentum α = 0.9 , the learning
rate decay factor β = (1/np)ln(η0 /ηmin ) , number of training instances n, and
number of epochs p needed to reduce the initial learning rate to ηmin, consists of
the following steps [18], [19], [11]:
1. Let k = 0 and the weight vector is initialized with w0, the learning rate
with η0, and Δw0 = 0.
95
function E(wk ) and its gradient ∇E(wk ) are calculated.
The training process takes place over at least one epoch and then stops
according to the following criteria, which is checked in the following order [18],
[19], [11]:
1. During the model update, the total training error is calculated at the
end of each iteration. If during the K1 iteration, the training error does
not decrease below the current minimum error E1 over the next step,
the algorithm is stopped and the weights obtained at step K1 are
reported.
96
2. If the change in the training error is relatively small, the training
process is stopped and the weights obtained at step K1 are reported:
2 E(wk ) − E(wk−1)
< 10−4 (3.48)
(E(wk ) + E(wk−1) + 10 )
−10
3. If the ratio between the current training error and the initial error is
low:
E(wk )
< 10−3 (3.49)
E+ 10−10
▪ where E is the model error calculated using equation (3.50) in the error
function. It reports the weights obtained at step K1.
1 ^
yl = yl
N∑l∈S
(3.50)
Given that the ability to generalize is higher for smaller networks with
less parameters, the pruning technique based on sensitivity is applied to
eliminate the redundant neurons during training. This pruning technique has
been proposed in [146] and starting from a large neural network, it first removes
the redundant neurons in the hidden layer and then the redundant neurons in the
input layer. This process is repeated until the global convergence conditions are
met [18], [19], [11].
During the elimination phase of the redundant hidden neurons, the neural
97
network is trained using the entire training dataset and if any convergence
condition is met, it advances to the second phase to eliminate the redundant
input neurons. The elimination process of hidden neurons is stopped if: the
global convergence criteria are met, the current error is three times the error of
the most suitable network, and the persistence limit in the hidden layer is
exceeded, where persistence is defined as the number of training cycles without
any improvement. If no convergence condition is met, a sensitivity analysis is
performed to identify the redundant neurons in the hidden layer.
To perform the sensitivity analysis for the hidden neurons, the test dataset
is applied to the network and the results are recorded as reference. Next, the
weights of the first hidden layer are temporary set to zero and the test dataset is
applied to the modified network, and the results compared. For each instance,
the absolute difference between the results obtained for the entire network and
the results obtained for the modified network is calculated along with the
standard deviation across the entire test dataset. This process is repeated for
each hidden neuron which is ranked according to this value. A high value
indicates significant neurons, while a low value indicates redundant neurons.
During the elimination phase of the redundant input neurons, the neural
network is trained using the entire training dataset, and if any convergence
condition is met, the global convergence conditions are verified, and if
necessary, these two elimination phases are repeated. The elimination process of
input neurons is stopped if: the global convergence criteria are met, the current
network error is three times the error of the most suitable network, the
persistence limit in the input layer is exceeded. If no convergence condition is
met, a sensitivity analysis is performed to identify the redundant neurons in the
98
input layer.
To perform the sensitivity analysis for the input neurons, the value of the
independent variable is varied for each instance in the test dataset, the maximum
and the minimum output values are recorded, and the maximum difference for
each instance is calculated along with the arithmetic mean.
3.12. Conclusions
In the first part of this chapter are presented the theoretical foundations
regarding neural networks and reviewed some noteworthy research papers
existent in the literature.
In this chapter the MLP neural network architecture [18], [19], [11] is
proposed to be used to implement the predictive model. This MLP neural
network is trained using the backpropagation learning algorithm [138] which is
improved by applying the momentum method to adjust the weights [119]. For
each training instance, the weights are updated using the gradient descent
stochastic optimization method [36] after each feature was presented to the
network sequentially. It is decided to use the hyperbolic tangent function as
activation function for the hidden layers and the generalized sigmoid function
[118] for the output layer, introducing this way flexibility in this model. The
structure of the MLP network is optimized using the pruning technique based on
sensitivity of the input and hidden neurons based on the mutual information
[146]. The network pruning strategy starts from a large network and gradually
removes redundant neurons during the learning process. To accelerate the neural
network training process, the method of adapting the learning rate is used by
99
initializing it with a higher value and gradually decreasing it as the learning
progresses, and the weights initialization method that uses the simulated
annealing algorithm [81].
The proposed MLP neural network architecture [18], [19], [11] is scalable
to large datasets and can be applied to any dataset from any field of activity, as
long as the problem to be solved is a classification problem.
100
4. CHAPTER 4
4. MODEL EVALUATION AND DEPLOYMENT
Reaching this phase implies that the model has been implemented and
yields a good predictive performance. Before proceeding to the next phase, it is
important to evaluate and review the steps taken to create the predictive model
to ensure that it meets the objectives accordingly. It is important to determine
whether there is an objective that has not been considered. At the end of this
phase, a decision is going to be made regarding the use of the results obtained
during the entire data mining process. At this stage, the predictive model is
assessed to decide whether or not the predictions are considered a success.
101
Predicted
Dataset Observed
No Yes % correct
No TN FP TNR
Training/
Yes FN TP TPR
Test
Total % NPV PPV ACC
The confusion matrix cells are called: true negatives (TN), false positives
(FP), false negatives (FN), and true positives (TP) (Table 4.1). The rest of the
measures from Table 4.1, such as TNR (true negatives rate or specificity), TPR
(true positives rate or sensitivity), NPV (negative predictive value), PPV
(positive predictive value or precision), and ACC (accuracy) are calculated
using the following equations:
TP
TPR = (4.1)
TP + FN
TN
TNR = (4.2)
TN + FP
TP
PPV = (4.3)
TP + FP
TN
NPV = (4.4)
TN + FN
TP + TN
ACC = (4.5)
P+N
102
where P represents the total number of positives and N is the total number of
negatives.
Based on these defined measures, Table 4.2 shows the confusion matrix
for the machine learning algorithm used, namely the MLP neural network, on
both, the training and the test datasets.
Predicted
Dataset Observed
No Da % correct
No 2279 0 100.00%
Training Yes 0 2294 100.00%
Total % 49.84% 50.16% 100.00%
No 569 2 99.65%
Test Yes 1 94 98.95%
Total % 85.59% 14.41% 99.55%
In the training phase, the prediction model has correctly classified all the
2,294 customers who have previously churned and stopped using the services
offered by the mobile telecommunication company, with a true positives rate of
100%. Of the 2,279 customers who continued to use the company's services, all
the customers are classified correctly, providing a specificity of 100%. In other
words, all the customers in the training dataset are classified correctly.
Within the test dataset, out of the 95 customers who stopped using the
services offered by the mobile telecommunication company, 94 customers are
classified correctly (a true positives rate of 98.95%); and of the 571 customers
103
who kept using the services, 569 customers are classified correctly (a specificity
of 99.65%). Overall, approximately 99.55% of the customers in the test dataset
are classified correctly and approximately 0.45% are misclassified.
Another way of interpreting the results is the lift chart. This type of graph
sorts the predicted pseudo-probabilities [101] in descending order and displays
the corresponding curve. There are two types of lift charts: incremental and
cumulative. The incremental lift chart represented in Figure 4.1 shows the lift
factor in each percentile [43] without any accumulation for the Yes class of the
dependent variable Churn. The curve corresponding to this predictive model
falls below the gray line, which corresponds to the random expectation (RND
E), around the 16th percentile. This means that compared to the random
expectation, the model achieves its maximum performance in the first 16% of
8
NN-MLP
7
6 RND E
5
Lift
0
0 20 40 60 80 100
Percentile
Churn = Yes
104
the instances.
The cumulative lift graph indicates the prediction rate of the model
compared to the random expectation. Figure 4.2 illustrates the curve of the
cumulative lift chart for the Yes class of the dependent variable Churn. By
reading the chart on the horizontal axis, it can be seen that for the 16th
percentile, the model has a lift index of approximately 7 on the vertical axis,
meaning that unlike a random model, this model has a predictive performance
of approximately 7 times better.
8
NN-MLP
7
6 RND E
5
Lift
0
0 20 40 60 80 100
Percentile
Churn = Yes
The performance of this predictive model can also be evaluated using the
gain measure. The gain chart shows the percentage of positive responses on the
vertical axis, and the percentage of customers contacted on the horizontal axis.
The gain measure is defined as the proportion of respondents present in each
percentile relative to the total number of respondents. The cumulative gain chart
shows the prediction rate of the model compared to the random expectation.
105
Figure 4.3 illustrates the curve corresponding to the MLP neural network
predictive model for the Yes class of the dependent variable Churn. It can be
seen that in the 16th percentile, the predictive model has a performance of
approximately 99%.
100
80
NN-MLP
60 RND E
% Gain
40
20
0
0 20 40 60 80 100
Percentile
Churn = Yes
Figure 4.4 shows the ROC curve of the predictive model. The ROC curve
is derived from the confusion matrix and uses only the TPR and the FPR (false
positives rate) measures, the latter being obtained by subtracting the specificity
from the unit. Following the chart in Figure 4.4, it can be observed that it
approaches the coordinate point (0, 1) in the upper left corner, which implies a
perfect prediction. Our predictive model based on neural networks obtains a
sensitivity of 99% and a specificity of 100%.
106
1
0.8
NN-MLP
TPR (Sensitivity)
0.6 RND E
0.4
0.2
0
0.0 0.2 0.4 0.6 0.8 1.0
FPR (1-Specificity)
Churn = Yes
organizing the results generated by the predictive model so that the mobile
telecommunication company can take specific decisions. These results can be
organized in the form of a document in which the instances are ordered based on
their pseudo-probabilities of belonging to a class of the dependent variable. If a
more complex approach to this problem is considered, then these results can be
integrated in an interactive reporting system in which the datasets used by the
data mining models are extracted from a database and automatically scored by
these models, and the results are then viewed through certain reporting tools.
If the company decides to use the lift graph, it can select the first 16% -
107
20% of the customers sorted by their corresponding pseudo-probabilities, and
expect to contact about seven times (a lift factor of 7) the number of customers
who intend to churn than selecting them randomly.
4.3. Conclusions
Using the incremental lift chart method, it can be observed that the
predictive model achieves its maximum performance in the first 16% of the
instances because the corresponding curve of the predictive model falls below
the gray line corresponding to the random expectation around the 16th
percentile. The cumulative lift chart indicates that for the 16th percentile, the
model has a lift index of approximately 7 on the vertical axis, meaning that
unlike a random model, this model has a predictive performance of about 7
times better.
Based on the gain chart, in the 16th percentile, the predictive model
108
implemented using the MLP neural network provides 99% of the respondents
present in the 16th percentile in relation to the total number of respondents.
By interpreting the chart of the ROC curve, it can be seen that the model
is very close to the perfect prediction with a sensitivity of 99% and a specificity
of 100%.
This chapter represents the evaluation and the deployment phases of the
CRISP-DM methodology.
109
5. CHAPTER 5
5. CONCLUSIONS
To precisely identify only the customers who are at risk of churning, the
companies active in this industry must implement data mining models that
employ machine learning algorithms and yield highly accurate results.
110
The newly formed dataset consists of the 11 principal components, the 2
nominal independent variables, and the dependent variable Churn, and is used
to train the machine learning algorithm. Prior to implementing the predictive
model, the distribution of the dependent variable is balanced using the
oversampling method [63] to ensure an optimal learning by the machine
learning algorithm.
111
also in [18], [19], and [11].
112
BIBLIOGRAPHY
1. Ackley, D.H., G.E. Hinton, and T.J. Sejnowski, A Learning Algorithm for
Boltzmann Machines. Cognitive Science, 1985. 9(2).
2. Agresti, A., Categorical Data Analysis. 2002: Wiley.
3. Anderson, T.W., Asymptotic Theory for Principal Component Analysis. The
Annals of Mathematical Statistics, 1963. 34.
4. Arthur, Y.D., E. Harris, and J. Annan, Principal Component Analysis of
Customer Churns in Ghanaian Telecommunication Industry. American
International Journal of Contemporary Research, 2012. 2(12).
5. Baesens, B., S. Viaene, D. Van den Poel, J. Vanthienen, and G. Dedene,
Bayesian Neural Network Learning for Repeat Purchase Modeling in Direct
Marketing. European Journal of Operational Research, 2002. 138(1).
6. Battiti, R., Accelerated Backpropagation Learning: Two Optimization
Methods. Complex Systems, 1989. 3.
7. Bellman, R.E., Dynamic Programming. 1957, Princeton: Princeton University
Press. xxv, 342 p.
8. Berson, A., S. Smith, and K. Therling, Building Data Mining Applications for
Crm. 1999: McGraw-Hill.
113
9. Bhattacharyya, S. and P. Pendharkar, Inductive, Evolutionary and Neural
Techniques for Discrimination: A Comparative Study. Decision Sciences,
1998. 29.
10. Blake, C.L. and C.J. Merz, Churn Data Set. 1998: California, USA.
11. Brandusoiu, I.B., Methods for Predicting the Evolution of the Number of
Subscribers in the Mobile Telecommunications Industry. 2016, Technical
University of Cluj-Napoca.
12. Brandusoiu, I.B. and G. Toderean, Churn Prediction in the Telecommuni-
cations Sector Using Support Vector Machines. Annals of the Oradea
University Fascicle of Management and Technological Engineering, 2013.
22(1).
13. Brandusoiu, I.B. and G. Toderean, Churn Prediction Modeling in Mobile
Telecommunications Industry Using Decision Trees. University of Oradea
Journal of Computer Science and Control Systems, 2013. 6(1).
14. Brandusoiu, I.B. and G. Toderean, A Neural Networks Approach for Churn
Prediction Modeling in Mobile Telecommunications Industry. University of
Pitesti Scientific Bulletin Series: Electronics and Computers Science, 2013.
13(1).
15. Brandusoiu, I.B. and G. Toderean, Predicting Churn in Mobile Telecommuni-
cations Industry. ACTA Technica Napocensis Electronics and Telecommuni-
cations, 2013. 54(3).
16. Brandusoiu, I.B. and G. Toderean, Applying Principal Component Analysis on
Call Detail Records. ACTA Technica Napocensis Electronics and Telecom-
munications, 2014. 55(4).
17. Brandusoiu, I.B. and G. Toderean, Churn Prediction in the Telecommuni-
cations Sector Using Bayesian Networks. University of Oradea Journal of
Computer Science and Control Systems, 2015. 8(2).
18. Brandusoiu, I.B. and G. Toderean, Churn Prediction in the Telecommuni-
cations Sector Using Neural Networks. ACTA Technica Napocensis
Electronics and Telecommunications, 2016. 57(1).
19. Brandusoiu, I.B. and G. Toderean, Methods for Churn Prediction in the Pre-
Paid Mobile Telecommunications Industry, 11th International Conference on
Communications (COMM). 2016.
20. Broomhead, D.S. and D. Lowe, Multivariable Functional Interpolation and
Adaptive Networks. Complex Systems, 1988. 2.
21. Bryson, A.E., W.F. Denham, and E.S. Dreyfus, Optimal Programming
Problems with Inequality Constraints I: Necessary Conditions for Extremal
Solutions. AIAA Journal, 1963. 1(11).
22. Cabena, P., P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi, Discovering
Data Mining: From Concept to Implementation. 1998: Prentice Hall.
23. Castanedo, F., G. Valverde, J. Zaratiegu, and A. Vazquez, Using Deep Learning
to Predict Customer Churn in a Mobile Telecommunication Network. 2014.
24. Castellano, G., A.M. Fanelli, and M. Pelillo, An Iterative Pruning Algorithm
for Feedforward Neural Networks. IEEE Transactions on Neural Networks,
1997. 8(3).
25. Cerny, V., Thermodynamical Approach to the Traveling Salesman Problem: An
Efficient Simulation Algorithm. Journal of Optimization Theory and
Applications, 1985. 45.
26. Chakrabarti, S., M. Ester, U. Fayyad, J. Gehrke, J. Han, S. Morishita, G.
Piatetsky-Shapiro, and W. Wang, Data Mining Curriculum: A Proposal Version
1.0. ACM Digital Library, 2006.
27. Chandrasekaran, H., H.H. Chen, and M.T. Manry, Pruning of Basis Functions
in Nonlinear Approximations. Neurocomputing, 2000. 34.
28. Chang, C.C. and C.J. Lin, LIBSVM: A Library for Support Vector Machines.
2001.
29. Chapman, P., J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, and C.E.A.
Shearer, CRISP-DM 1.0: Step-by-Step Data Mining Guide. 2000.
30. Choi, J.J., P. Arabshahi, R.J. Marks, and T.P. Caudell. Fuzzy Parameter
Adaptation in Neural Systems. Proceedings of International Joint Conference
on Neural Networks. 1992.
31. Chua, L.O. and L. Yang, Cellular Neural Network: I. Theory II. Applications.
IEEE Transactions on Circuits and Systems, 1988.
32. Cichocki, A. and R. Unbehauen, Neural Networks for Optimization and Signal
Processing. 1992: Wiley.
33. Comon, P., Independent Component Analysis – A New Concept. Signal
Processing, 1994. 36(3).
34. Coussement, K. and D. Van den Poel, Integrating the Voice of Customers
through Call Center Emails into a Decision Support System for Churn
Prediction. Information and Management, 2008. 45.
35. Coussement, K. and D. Van den Poel, Improving Customer Attrition Prediction
by Integrating Emotions from Client/Company Interaction Emails and
Evaluating Multiple Classifiers. Expert Systems with Applications, 2009. 36.
36. Da Silva, I.N., D.N. Spatti, R.A. Flauzino, L.H. Bartocci, and S.F. Dos Reis,
Artificial Neural Networks: A Practical Course. 2017: Springer.
37. Denoeux, T. and R.H. Lengelle, Initializing Back Propagation Networks with
Prototypes. Neural Networks, 1993. 6.
38. Dietterich, T.G., Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation, 1998. 10(7).
39. Dong, J.X., A. Krzyzak, and C.Y. Suen, Fast SVM Training Algorithm with
Decomposition on Very Large Data Sets. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2005. 27(4).
40. Drago, G.P. and S. Ridella, Statistically Controlled Activation Weight
Initialization. IEEE Transactions on Neural Networks, 1992. 3.
41. Duch, W., Uncertainty of Data, Fuzzy Membership Functions, and Multilayer
Perceptrons. IEEE Transactions on Neural Networks, 2005. 16.
42. Dunteman, G.H., Principal Components Analysis. 1989: Sage Publications.
43. Edwards, D.I., Introduction to Graphical Modeling 2nd Edition. 2000: Springer.
44. Fahlman, S.E. Fast Learning Variations on Backpropagation: An Empirical
Study. Proceedings of 1988 Connectionist Models Summer School. 1988.
45. Fahlman, S.E. and C. Lebiere, The Cascade-Correlation Learning Architecture.
Advances in Neural Information Processing Systems, 1990. 2.
46. Fayyad, U., G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in
Knowledge Discovery and Data Mining. 1996: AAAI Press.
47. Finnoff, W., Diffusion Approximations for the Constant Learning Rate
Backpropagation Algorithm and Resistance to Local Minima. Neural
Computation, 1994. 6(2).
48. Frawley, W., G. Piatetsky-Shapiro, and C. Matheus, Knowledge Discovery in
Databases – An Overview. Knowledge Discovery in Databases, 1991.
49. Freeman, M., The 2 Customer Lifecycles. Intelligent Enterprise, 1999. 2(16).
50. Fu, S.K., Efficient Learning of Markov Blanket and Markov Blanket Classifier.
2010, University of Montreal.
51. Fukunaga, K., Introduction to Statistical Pattern Recognition. 1990: Academic
Press.
52. Gartner, G. 2015; Available from: www.gartner.com.
53. Girshick, M.A., Principal Components. Journal of the American Statistical
Association, 1936. 31.
54. Girshick, M.A., On the Sampling Theory of Roots of Determinantal Equations.
The Annals of Mathematical Statistics, 1939. 10.
55. Gish, H. A Probabilistic Approach to the Understanding and Training of Neural
Network Classifiers. Proceedings of IEEE International Conference on
Acoustics, Speech, and Signal Processing. 1990.
56. Goh, Y.S. and E.C. Tan, Pruning Neural Networks During Training by
Backpropagation, IEEE Region 10's 9th Annual International Conference.
1994.
57. Gori, M. and M. Maggini, Optimal Convergence of Online Backpropagation.
IEEE Transactions on Neural Networks, 1996. 7(1).
58. Gorunescu, F., Data Mining Concepts, Models and Techniques. 2011: Springer
Verlag.
59. Gower, J.C., Some Distance Properties of Latent Root and Vector Methods
Used in Multivariate Analysis. Biometrika, 1966. 53.
60. Grumbach, S. and T. Milo, Towards Tractable Algebras for Bags. Journal of
Computer and System Sciences, 1996. 52(3).
61. Gupta, A. and S.M. Lam, Weight Decay Backpropagation for Noisy Data.
Neural Networks, 1998. 11.
62. Hand, D., H. Mannila, and P. Smyth, Principles of Data Mining. 2001: MIT
Press.
63. He, H. and Y. Ma, Imbalanced Learning Foundations, Algorithms, and
Applications. 2013: John Wiley & Sons.
64. Hebb, D.O., The Organization of Behavior. 1949: Wiley.
65. Hinton, G.E., Connectionist Learning Procedure. Artificial Intelligence, 1989.
40.
66. Hodgkin, A.L. and A.F. Huxley, A Quantitative Description of Ion Currents and
its Applications to Conductance and Excitation in Nerve Membranes. Journal
of Physics, 1952. 117.
67. Hopfield, J.J. Neural Networks and Physical Systems with Emergent
Collective Computational Abilities. Proceedings of National Academy of
Sciences of the USA. 1982.
68. Hotelling, H., Analysis of a Complex of Statistical Variables into Principal
Components. Journal of Educational Psychology, 1933. 24(6).
69. Hotelling, H., Simplified Calculation of Principal Components. Psychometrika,
1936. 1.
70. Hung, S., D. Yen, and H. Wang, Applying Data Mining to Telecom Churn
Management. Expert Systems with Applications, 2006. 31.
71. Hwang, J., S. Lay, and A. Lippman, Nonparametric Multivariate Density
Estimation: A Comparative Study. IEEE Transaction on Signal Processing,
1994. 42(10).
72. Ishikawa, M., Learning of Modular Structured Networks. Artificial
Intelligence, 1995. 75.
73. Jeffers, J.N.R., Two Case Studies in the Application of Principal Component
Analysis. Journal of the Royal Statistical Society, 1967. 16(3).
74. Jensen, F.V. and T.D. Nielsen, Bayesian Networks and Decision Graphs. 2007:
Springer.
75. Jiang, X., M. Chen, M.T. Manry, M.S. Dawson, and A.K. Fung, Analysis and
Optimization of Neural Networks for Remote Sensing. Remote Sensing
Reviews, 1994. 9.
76. Jimenez, L.O. and L.D. A., Supervised Classification in High-Dimensional
Space: Geometrical, Statistical, and Asymptotical Properties of Multivariate
Data. IEEE Transaction on Systems, Man, and Cybernetics, 1998. 28.
77. Kanjilal, P.P. and D.N. Banerjee, On the Application of Orthogonal
Transformation for the Design and Analysis of Feedforward Networks. IEEE
Transactions on Neural Networks, 1995. 6(5).
78. Karnin, E.D., A Simple Procedure for Pruning Backpropagation Trained Neural
Networks. IEEE Transactions on Neural Networks, 1990. 1(2).
79. Kim, H. and C. Yoon, Determinants of Subscriber Churn and Customer
Loyalty in the Korean Mobile Telephony Market. Telecommunications Policy,
2004. 28.
80. Kim, J.O. and C.W. Mueller, Factor Analysis: Statistical Methods and Practical
Issues. 1978: Sage Publications.
81. Kirkpatrick, S., C.D. Gelatt, and M.P. Vecchi, Optimization by Simulated
Annealing. Science, 1983. 220(4598).
82. Kirui, C., L. Hong, W. Cheruiyot, and H. Kirui, Predicting Customer Churn in
Mobile Telephony Industry Using Probabilistic Classifiers in Data Mining.
International Journal of Computer Science Issues, 2013. 10(2).
83. Kisioglu, P. and Y.I. Topcu, Applying Bayesian Belief Network Approach to
Customer Churn Analysis: A Case Study on the Telecom Industry of Turkey.
Expert Systems with Applications, 2011. 38(6).
84. Kotler, P. and L. Keller, Marketing Management 12th Edition. 2006: Prentice
Hall.
85. Le Cun, Y., P.Y. Simard, and B. Pearlmutter, Automatic Learning Rate
Maximization by Online Estimation of the Hessian’s Eigenvectors. Advances
in Neural Information Processing Systems, 1993. 5.
86. Ledesma, S., M. Torres, D. Hernandez, G. Avina, and G. Garcia. Temperature
Cycling on Simulated Annealing for Neural Network Learning. Proceedings of
MICAI. 2007.
87. Lee, B.W. and B.J. Shen, Design and Analysis of Analog VLSI Neural
Networks. Neural Networks for Signal Processing, 1992.
88. Lee, K.C. and N.Y. Jo, Bayesian Network Approach to Predict Mobile Churn
Motivations: Emphasis on General Bayesian Network, Markov Blanket, and
What-If Simulation. Future Generation Information Technology, Lecture Notes
in Computer Science, 2010. 6485.
89. Lee, Y., S.H. Oh, and M.W. Kim. The Effect of Initial Weights on Premature
Saturation in Back-Propagation Training. Proceedings IEEE International Joint
Conference on Neural Networks. 1991.
90. Lehtokangas, M., P. Korpisaari, and K. Kaski, Maximum Covariance Method
for Weight Initialization of Multilayer Perceptron Networks. Proceedings of
European Symposium on Artificial Neural Networks, 1996.
91. Lehtokangas, M., J. Saarinen, P. Huuhtanen, and K. Kaski, Initializing Weights
of a Multilayer Perceptron Network by Using the Orthogonal Least Squares
Algorithm. Neural Computation, 1995. 7.
92. Loeve, M., Probability Theory 3rd Edition. 1963: Van Nostrand.
93. Luke, B.T., Simulated Annealing Cooling Schedules. 2007.
94. Madden, G., S. Savage, and G. Coble-Neal, Subscriber Churn in the Australian
ISP Market. Information Economics and Policy, 1999. 11.
95. Magoulas, G.D., V.P. Plagianakos, and M.N. Vrahatis, Globally Convergent
Algorithms with Local Learning Rates. IEEE Transactions on Neural
Networks, 2002. 13(3).
96. Masters, T., Practical Neural Network Recipes in C++. 1993: Academic Press.
97. Matsuoka, K. and J. Yi. Backpropagation Based on the Logarithmic Error
Function and Elimination of Local Minima. Proceedings of the International
Joint Conference on Neural Networks. 1991.
98. McCulloch, W.S. and W. Pitts, A Logical Calculus of the Ideas Immanent in
Nervous Activity. The Bulletin of Mathematical Biophysics, 1943. 5.
99. McLoone, S., M.D. Brown, G. Irwin, and G. Lightbody, A Hybrid Linear/
Nonlinear Training Algorithm for Feedforward Neural Networks. IEEE
Transactions on Neural Networks, 1998. 9(4).
100.Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller,
Equations of State Calculations by Fast Computing Machines. Journal of
Chemical Physics, 1953. 21(6).
101.Minsky, M.L. and S. Papert, Perceptrons. 1969: MIT Press.
102.Mitchell, T., Machine Learning. 1997: McGraw-Hill.
103.Narayan, S., He Generalized Sigmoid Activation Function: Competitive
Supervised Learning. Information Sciences, 1997. 99.
104.Neslin, S., S. Gupta, W. Kamakura, J. Lu, and C. Mason, Defection Detection:
Measuring and Understanding the Predictive Accuracy of Customer Churn
Models. Journal of Marketing Research, 2006. 43.
105.Ng, S.C., S.H. Leung, and A. Luk, Fast Convergent Generalized Back-
Propagation Algorithm with Constant Learning Rate. Neural Processing
Letters, 1999. 9.
106.Nilsson, N.J., Introduction to Machine Learning. 1998: Stanford University.
107.Oja, E., A Simplified Neuron Model as a Principal Component Analyzer.
Journal of Mathematical Biology, 1982. 15.
108.Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. 1988: Morgan Kaufmann.
109.Pearson, K., On Lines and Planes of Closest Fit to Systems of Points in Space.
Philosophical Magazine, 1901. 6(2).
110.Pendharkar, P., Genetic Algorithm Based Neural Network Approaches for
Predicting Churn in Cellular Wireless Networks Service. Expert Systems with
Applications, 2009. 36.
111.Ponnapalli, P.V.S., K.C. Ho, and M. Thomson, A Formal Selection and Pruning
Algorithm for Feedforward Artificial Neural Network Optimization. IEEE
Transactions on Neural Networks, 1999. 10(4).
112.Rao, C.R., The Use and Interpretation of Principal Component Analysis in
Applied Research. The Indian Journal of Statistics, 1964. 26(4).
113.Reichheld, F. and W. Sasser, Zero Defection: Quality Comes to Services.
Harvard Business Review, 1990. 68(5).
114.Roderick, J., A. Little, and D.B. Rubin, Statistical Analysis with Missing Data
2nd Edition. 2002: John Wiley & Sons.
115.Rosenblatt, R., The Perceptron: A Probabilistic Model for Information Storage
and Organization in the Brain. Psychological Review, 1958. 65.
116.Rosenblatt, R., Principles of Neurodynamics. 1962: Spartan Books.
117.Rumelhart, D., McClelland, J. L., Parallel Distributed Processing: Explorations
in the Microstructure of Cognition. 1986: MIT Press.
118.Rumelhart, D.E., R. Durbin, R. Golden, and Y. Chauvin, Backpropagation: The
Basic Theory. Backpropagation: Theory, Architecture, and Applications, 1995.
119.Rumelhart, D.E., G.E. Hinton, and R.J. Williams, Learning Internal
Representations by Error Propagation. Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, 1986. 1.
120.Sato, T., B.Q. Huang, Y. Huang, M.T. Kechadi, and B. Buckley. Using PCA to
Predict Customer Churn in Telecommunication Dataset. Proceedings of the 6th
International Conference on Advanced Data Mining and Applications. 2010.
121.Schmitt, M., On the Complexity of Computing and Learning with
Multiplicative Neural Networks. Neural Computation, 2002. 14(2).
122.Seo, D., C. Ranganathan, and Y. Babad, Two-Level Model of Customer
Retention in the US Mobile Telecommunications Service Market. Telecom-
munications Policy, 2008. 32.
123.Shalev-Shwartz, S. and Y. Singer. A New Perspective on an Old Perceptron
Algorithm. Proceedings of the 16th Annual Conference on Computational
Learning Theory. 2005.
124.Sietsma, J. and R.J.F. Dow, Creating Artificial Neural Networks That
Generalize. Neural Networks, 1991. 4.
125.Silva, F.M. and L.B. Almeida, Speeding-up Backpropagation. Advanced
Neural Computers, 1990.
126.Smyth, S.G., Designing Multilayer Perceptrons from Nearest Neighbor
Systems. IEEE Transactions on Neural Networks, 1992. 3(2).
127.Solla, S.A., E. Levin, and M. Fleisher, Accelerated Learning in Layered Neural
Network. Complex Systems, 1988. 2.
128.Spirtes, P., C. Glymour, and R. Scheines, Causation, Prediction and Search 2nd
Edition. 2001: MIT Press.
129.Sumathi, S. and S.N. Sivanandam, Introduction to Data Mining and its
Applications. Studies in Computational Intelligence, 2006. 29.
130.Teoh, E.J., K.C. Tan, and C. Xiang, Estimating the Number of Hidden Neurons
in a Feedforward Network Using the Singular Value Decomposition. IEEE
Transactions on Neural Networks, 2006. 17(6).
131.Thimm, G. and E. Fiesler, High-Order and Multilayer Perceptron Initialization.
IEEE Transactions on Neural Networks, 1997. 8(2).
132.Valiant, L.G., A Theory of the Learnable. 1984: Communications of the ACM.
133.Vapnik, V.N., The Nature of Statistical Learning Theory. 1995, New York:
Springer. xv, 188 p.
134.Vapnik, V.N., Statistical Learning Theory. Adaptive and Learning Systems for
Signal Processing, Communications, and Control. 1998, New York: Wiley.
xxiv, 736 p.
135.Vogl, T.P., J.K. Mangis, A.K. Rigler, W.T. Zink, and D.L. Alkon, Accelerating
the Convergence of the Backpropagation Method. Biological Cybernetics,
1988. 59.
136.Wang, J., J. Yang, and W. Wu, Convergence of Cyclic and Almost-Cyclic
Learning with Momentum for Feedforward Neural Networks. IEEE
Transactions on Neural Networks, 2011. 22(8).
137.Wei, C. and I. Chiu, Turning Telecommunications Call Details to Churn
Prediction: A Data Mining Approach. Expert Systems with Applications, 2002.
23.
138.Werbos, P.J., Beyond Regressions: New Tools for Prediction and Analysis in
the Behavioral Sciences. 1974, Harvard University.
139.Wessels, L.F.A. and E. Barnard, Avoiding False Local Minima by Proper
Initialization of Connections. IEEE Transactions on Neural Networks, 1992. 3.
140.Weymaere, N. and J.P. Martens, On the Initialization and Optimization of
Multilayer Perceptron. IEEE Transactions on Neural Networks, 1994. 5.
141.Widrow, B. and M.E. Hoff, Adaptive Switching Circuits. Record of IRE
Eastern Electronic Show and Convention, 1960. 4.
142.Widrow, B., Stearns, S. D., Adaptive Signal Processing. 1985: Prentice Hall.
143.Wilson, D.R. and T.R. Martinez, The General Inefficiency of Batch Training
for Gradient Descent Learning. Neural Networks, 2003. 16.
144.Wolpert, D.H., The Relationship between PAC, the Statistical Physics
Framework, the Bayesian Framework, and the VC Framework. The
Mathematics of Generalization the SFI Studies in the Sciences of Complexity,
1995.
145.Wong, B.K., T.A. Bodnovich, and Y. Selvi, Neural Network Applications in
Business: A Review and Analysis of the Literature (1988–1995). Decision
Support Systems, 1997. 19.
146.Xing, H.J. and B.G. Hu, Two-Phase Construction of Multilayer Perceptrons
Using Information Theory. IEEE Transactions on Neural Networks, 2009.
20(4).
147.Yam, Y.F., C.T. Leung, P.K.S. Tam, and W.C. Siu, An Independent Component
Analysis Based Weight Initialization Method for Multilayer Perceptrons.
Neurocomputing, 2002. 48.
148.Yan, L., M. Fassino, and P. Baldasare. Predicting Customer Behavior via
Calling Links. Proceedings of International Joint Conference on Neural
Networks. 2005.
149.Zurada, J.M., A. Malinowski, and S. Usui, Perturbation Method for Deleting
Redundant Inputs of Perceptron Networks. Neurocomputing, 1997. 14.