You are on page 1of 139

Ionut B. BRANDUSOIU Gavril I.

TODEREAN

NEURAL NETWORKS
FOR CHURN PREDICTION
IN THE MOBILE
TELECOMMUNICATIONS INDUSTRY

GAER Publishing House


Bucharest, 2020
THE GENERAL ASSOCIATION OF ENGINEERS IN ROMANIA
Copyright © Authors, 2020

All rights on this edition are reserved to the Authors.

GAER Publishing House


118 Calea Victoriei
010093, Sector 1, Bucharest, Romania
Phone: 4021-316 89 92, 4021-316 89 93,
4021- 319 49 45 (bookshop); Fax: 4021-312 55 31
E-mail: editura@agir.ro; www.agir.ro

Reviewers
Prof. Dr. Eng. Sergiu Nedevschi
Prof. Dr. Eng. Gabriel Oltean

Description of CIP of the National Library of Romania


BRANDUSOIU, IONUT B.
    Neural networks for churn prediction in the mobile
telecommunications industry / Ionut B. Brandusoiu, Gavril I. Toderean -
Bucharest : G.A.E.R. Publishing House, 2020
    Contains bibliography
ISBN 978-973-720-778-4
621.39

Editor: Eng. Dan Bogdan


Cover: Dr. Eng. Ionut B. Brandusoiu

ISBN 978-973-720-778-4
TABLE OF CONTENTS

LIST OF FIGURES ........................................................................................iv

LIST OF TABLES ...........................................................................................v

ACRONYMS, NOTATIONS, AND SYMBOLS ...........................................vi

INTRODUCTION ...........................................................................................1

1. DATA MINING PROCESS .........................................................................6

1.1. Introduction............................................................................................6

1.2. Phases of the Data Mining Process ........................................................8

1.3. Supervised Learning ............................................................................11

1.3.1. Training Dataset ............................................................................12

1.3.2. Classification .................................................................................13

1.3.3. Induction Algorithms .....................................................................14

1.3.4. Performance Evaluation ................................................................15

1.3.5. Scalability ......................................................................................23

1.3.6. Dimensionality ..............................................................................24

i
1.4. Churn Prediction – Literature Review .................................................25

1.5. Conclusions..........................................................................................30

2. DATA UNDERSTANDING AND PREPARATION .................................33

2.1. Data Understanding .............................................................................33

2.2. Data Preparation ..................................................................................37

2.2.1. PCA Algorithm ..............................................................................37

2.2.2. Applying the PCA Algorithm ........................................................47

2.2.3. Dataset Partitioning and Distribution Balancing ...........................55

2.3. Conclusions..........................................................................................56

3. MODELING USING NEURAL NETWORKS .........................................57

3.1. Introduction..........................................................................................57

3.2. Perceptron ............................................................................................61

3.3. Multilayer Perceptron ..........................................................................67

3.4. Backpropagation Algorithm.................................................................69

3.5. Online and Offline Learning ................................................................74

3.6. Output Layer Activation Function .......................................................76

3.7. Network Structure Optimization..........................................................76

3.7.1. Network Pruning based on Sensitivity ..........................................77

3.7.2. Network Pruning based on Regularization ....................................79

3.7.3. Network Growing ..........................................................................81

3.8. Accelerating the Learning Process.......................................................81

ii
3.8.1. Learning Parameters Adaptation ...................................................82

3.8.2. Weights Initialization .....................................................................83

3.9. Avoiding Local Minima .......................................................................85

3.10. Simulated Annealing Algorithm ........................................................86

3.11. Proposed Neural Network Architecture .............................................90

3.11.1. Weights Initialization and Learning Process ...............................92

3.11.2. Backpropagation Algorithm ........................................................93

3.11.3. Gradient Descent Algorithm ........................................................94

3.11.4. Convergence Criteria ...................................................................96

3.12. Conclusions........................................................................................99

4. MODEL EVALUATION AND DEPLOYMENT ....................................101

4.1. Model Evaluation...............................................................................101

4.2. Model Deployment ............................................................................106

4.3. Conclusions........................................................................................108

5. CONCLUSIONS ......................................................................................110

BIBLIOGRAPHY ........................................................................................113

iii
LIST OF FIGURES

Figure 1.1. CRISP-DM Methodology ..............................................................9

Figure 2.1. Principle of PCA Algorithm ........................................................40

Figure 2.2. Axis Rotation in a Two-Dimensional Space ................................41

Figure 2.3. Scree Plot .....................................................................................54

Figure 3.1. Mathematical Model of a Perceptron ..........................................62

Figure 3.2. Sigmoid Activation Functions .....................................................64

Figure 3.3. Architecture of a Single-Layer Perceptron ..................................64

Figure 3.4. Architecture of an MLP ...............................................................68

Figure 3.4. Simulated Annealing Method ......................................................88

Figure 4.1. Incremental Lift Chart ...............................................................104

Figure 4.2. Cumulative Lift Chart ................................................................105

Figure 4.3. Gain Chart..................................................................................106

Figure 4.4. ROC Curve ................................................................................107

iv
LIST OF TABLES

Table 2.1. Dataset Variables ...........................................................................35

Table 2.2. Distribution of Variable State relative to Variable Area Code ......36

Table 2.3. Statistical Indicators of Quantitative Variables .............................48

Table 2.4. Pearson Correlation Coefficient of Quantitative Variables ...........50

Table 2.5. Principal Components Matrix .......................................................51

Table 2.6. Eigenvalues and Proportion of Explained Variance ......................52

Table 4.1. Confusion Matrix ........................................................................102

Table 4.2. Confusion Matrix – MLP ............................................................103

v
ACRONYMS, NOTATIONS, AND SYMBOLS

ACC Accuracy

BP Backpropagation

CART Classification and Regression Tree

CDR Call Details Records

CHAID Chi-squared Automatic Interaction Detector

CRISP-DM Cross Industry Standard Process for Data Mining

DET Detection Error Tradeoff

FN False Negatives

FP False Positives

KDD Knowledge Discovery in Databases

MLP Multilayer Perceptron

NPV Negative Predictive Value

PAC Probably Approximately Correct

PCA Principal Component Analysis

vi
PPV Positive Predictive Value

QUEST Quick Unbiased Efficient Statistical Tree

RBF Radial Basis Functions

ROC Receiver Operator Characteristic

TN True Negatives

TNR True Negatives Rate

TP True Positives

TPR True Positives Rate

VC Vapnik-Chervonenkis

α Momentum term for weight adjustment

Corr(Yi, Zj ) Partial correlation between principal component i and


standardized independent variable j

ϕ() Activation function

ϕi(l)() Activation function of neuron i in layer l

ei Eigenvector i

E Error function

En Error function for instance n

η Learning rate

kB Boltzmann constant

kl Number of neurons in layer l

L Number of layers of the MLP neural network

ℒ Lagrange function

vii
λc Regularization parameter

λi Eigenvalue i

λmax Maximum eigenvalue of autocorrelation of vector x

m Number of independent variables.

μi Mean of independent variable i

n Number of training instances

oi(l) Output of neuron i in layer l

o(l)
n Output vector for instance n in layer l

Pα Probability that a physical system is in state α

ri Minimum distance from instance i to a principal component

rij Correlation coefficient

ρ Correlation matrix

σii Standard deviation of independent variable i

σii2 Covariance

θi(l) Bias of neuron i in layer l

θ(l) Bias vector from layer l

un, j Input of neuron j for instance n

u(l)
n Input vector for instance n in layer l

Var(Yi ) Variance of principal component i

wij Weight between neuron i and neuron j

w(l−1)
i
Weight vector from layer (l − 1) to layer l of neuron i

viii
xi Independent variable i

yi Dependent variable i

yn, j Expected output of neuron j for instance n


^y Actual output of neuron j for instance n
n, j

Yi Principal component i

Z Vector of standardized independent variables

Zi Standardized independent variable i

ix
INTRODUCTION

This book has a practical objective, namely, to identify the customers of


the prepaid segment of a mobile telecommunications company that present a
high risk of churn. This aspect is extremely important in the mobile
telecommunications industry because contracting new customers and the
retention of existing customers represent the focal concerns of those companies
active within this sector. A significant method to increase customer value is to
keep the customer for a longer period of time [49].

In the telecommunications industry, the mobile market is the segment that


sees the fastest growth. At a global level, approximately 75% of the phone calls
are made using mobile phones, and as in any competitive market, the attention
has been switched from contracting to retention [79]. The loss of a customer is
the main concern of many organizations active in industries with low switching
costs. Among all the industries that face this problem, the telecommunications
industry is situated at the top of the list with an annual churn rate of
approximately 30%. This implies waste of resources, in other words, time,
effort, and money spent unwisely [84].

1
As a consequence, a company active in the mobile telecommunications
industry must identify the customers that are likely to churn before they are
actually going to act and avoid contacting subscribers that will continue to use
the service in any event [34]. Thus, the need to develop a predictive model that
precisely identifies these types of customers is vital. This model must be capable
to identify customers that tend to switch the mobile provider in the near future.
Because of the nature of the prepaid mobile market, which is not based on a
contract, this is not an easy and well-defined task, making the implementation of
such a predictive model a complex assignment.

The approach chosen in this book for customer prediction is based on the
Cross Industry Standard Process for Data Mining (CRISP-DM) methodology.
The data mining process presented in Chapter 1 is an interdisciplinary field
which includes machine learning, pattern recognition, statistics, and visuali-
zation techniques to extract knowledge from large datasets [22]. By using data
mining techniques one can implement predictive models to discover trends and
past behaviors, allowing organizations to take smart decisions based on
knowledge from data.

The CRISP-DM methodology presented in Chapter 1 is applied using the


Python programming language on a synthetic dataset that contains call details
records (CDR) for 3,333 customers. These CDRs are available with 21
independent variables and one dependent variable which indicates the past
behavior of these customers with respect to churn. This dataset belongs to the
University of California, Irvine, the Department of Information and Computer
Science [10]. This is a generic dataset, frequently used in research as a
benchmark for testing different architectures of machine learning algorithms

2
proposed for classification. The methodology presented in this book is scalable
and can be applied to any real-world classification problem.

In the mobile telecommunications industry, the databases can have


hundreds of thousands of instances and hundreds or thousands of variables.
Therefore, it is highly unlikely that all these variables are independent, without
presenting a degree of correlation. Multicollinearity must be avoided because it
leads to instability in the solution space, thus obtaining incoherent results. Even
if this instability is avoided, the highly correlated independent variables that are
included in the model tend to grossly underline a specific component of the
model because that component is taken into account multiple times.

Using a high number of independent variables to model the relationships


with a dependent variable may unnecessarily complicate the interpretation of the
analysis and violate the principle of parsimony which states that the number of
independent variables should be reduced so that the result can be easily
interpreted. Likewise, keeping a high number of independent variables may lead
to overfitting.

Therefore, to avoid multicollinearity and reduce the number of inde-


pendent variables the principal component analysis (PCA) method is applied in
Chapter 2 within the data preparation phase of the data mining process. The
PCA algorithm is a data dimensionality reduction technique which uses the
correlation structure of the independent variables.

The learning method used in this book is a supervised method, called


classification, whereby a mathematical machine learning algorithm identifies in
the preprocessed dataset which values of the dependent variable are associated

3
with various values of the independent variables. The relationships discovered
are presented in the form of a structure called classification model. The
classification models are one of the most studied models, possibly with the
highest practical relevance.

In this book, it is decided to implement the classification model using the


machine learning algorithm called neural networks. The results presented in this
book are both, theoretical and practical in nature. With respect to the chosen
machine learning algorithm, a proprietary architecture is proposed based on the
results presented and discussed in various remarkable research papers present in
the literature.

Chapter 3 starts with the theoretical foundations regarding neural


networks, followed by a review of the noteworthy research existent in the
literature and our explanation pertaining the neural network architecture chosen
to implement the predictive model [18], [19], [11]. During this section, a
multilayer perceptron (MLP) neural network is trained using the back-
propagation (BP) algorithm [138]. The BP algorithm is improved by applying
the momentum method to adjust the weights [119]. For each training instance,
the weights are updated using the gradient descent stochastic optimization
method [36] after each feature was presented to the network sequentially. It is
decided to use the hyperbolic tangent function as activation function for the
hidden layers and the generalized sigmoid function [118] for the output layer,
introducing this way flexibility in the MLP model. The structure of the MLP
network is optimized using the pruning technique based on sensitivity of the
input and the hidden neurons based on mutual information [146]. This network
pruning strategy starts from a large network and gradually removes redundant

4
neurons during the learning process. To accelerate the training process of the
network, the method of adapting the learning rate is used by initializing it with a
higher value and gradually decreasing it as the learning progresses, and the
weights initialization method that uses the simulated annealing algorithm [81].

Chapter 4 evaluates empirically the generalization error of the predictive


model and presents the practical results. It also proposes a solution for using the
results.

Chapter 5 concludes this book by presenting the conclusions.


5
1. CHAPTER 1
1. DATA MINING PROCESS

1.1. Introduction

The data mining process can be defined in several ways which differ due
to the emphasis put by each on a different aspect of this process. One of the
earliest definitions states that the data mining process involves the nontrivial
extraction of implicit, previously unknown, and potentially useful information
from data [48].

Considering that the data mining process has developed as a professional


activity, is required to make a distinction between this activity, statistics, and the
larger Knowledge Discovery in the Databases (KDD). Statistical modeling
involves the use of parametric statistical algorithms to group or predict an
outcome or event based on independent variables. The data mining process
refers to the use of machine learning algorithms to discover relationships
models between elements from large and noisy datasets. Finally, KDD
represents the entire process of accessing, exploring and preparing the data,
implementing and monitoring the model. This extensive process includes the

6
data mining activities.

While the data mining process continued to develop, the attention towards
the definitions of this process was focused on certain aspects of the information
and its source. In 1996, Fayyad proposed the definition in which KDD is the
nontrivial process of identifying valid, new, and useful models from data [46].
This definition focuses on the patterns from data, not just on the information.
These patterns are not easily distinguishable and can only be identified by
applying analysis algorithms that can evaluate complex nonlinear relationships
between the independent variables, and between these independent variables
and the dependent variable. This definition of KDD emerged together with the
increasing popularity of machine learning techniques and their introduction into
this process. Algorithms such as decision trees, neural networks, support vector
machines, and Bayesian networks, allow to analyze nonlinear patterns in data
much easier compared to parametric statistical algorithms. This happens
primarily due to the fact that machine learning algorithms work similarly to the
way people act, not by calculating metrics based on average values and data
distributions. Initially, the term KDD referred only to the process of building the
model, but as this practice grew, this process expanded and began to include
several other operations.

A further definition of the data mining process describes it as highlighting


new and significant correlations, patterns, and trends through the analysis of
large amounts of data using pattern recognition, mathematical, and statistical
techniques [52]. Another definition argues that this process is an
interdisciplinary field which includes machine learning, pattern recognition,
statistics, and visualization techniques to extract information from large datasets

7
[22]. A last definition refers to the data mining process as the analysis of large-
scale observational datasets to discover unknown relationships and summarize
the data in an easy to understand format [62].

Considering the above mentioned definitions, it is understood that the data


mining methods lie at the intersection of artificial intelligence, machine
learning, statistics, and database systems [26]. By applying these data mining
techniques, predictive models can be deployed in order to discover trends and
past behaviors, allowing organizations to make smart decisions based on
knowledge extracted from data.

1.2. Phases of the Data Mining Process

The data mining process has been characterized in various formats of


which the most widespread is the CRISP-DM format [29]. The CRISP-DM
methodology was developed by SPSS, Teradata, Daimler AG, NCR, and
OHRA, and represents a way to simplify the data mining process (Figure 1.1).
As a methodology, this process includes descriptions of typical phases of a
project, the tasks involved in each phase, and explanations of the relationships
between these tasks. As a process model, CRISP-DM provides an overview of
the life cycle of the data mining process.

The life cycle of a data mining process consists of six phases, as


illustrated in Figure 1.1. The succession of these phases is not mandatory
because often times to complete a project, is required to alternate frequently
between these phases. The purpose of each phase is to determine which task or
phase is to be executed. The arrows between these phases indicate the most

8
common and important dependencies.

Business Data
Understanding Understanding

Data
Preparation

Deployment

Modeling
Data

Evaluation

Figure 1.1. CRISP-DM Methodology [29].

The outer circle symbolizes the cyclical nature of the data mining process.
This process continues even after a solution is implemented because once the
patterns are discovered and the solution presented, it can lead to new and
perhaps even more complex demands from the decision-makers of the mobile
telecommunication company or of any company in general. The data mining
processes that are implemented will always benefit from the knowledge gained
from previous experiences. Further is presented each phase briefly:

▪ Business understanding – Is the initial phase and involves understanding


the objectives and the requirements from the company's point of view, and
then converting this knowledge into a data mining problem and a
preliminary plan designed to meet the objectives.

▪ Data understanding – Begins with the data collection and continues with

9
the activities that enable familiarizing with the data, identifying data
quality issues, discovering the first data perspectives, and detecting
subsets to form assumptions about hidden information and patterns.

▪ Data preparation – Includes all the activities required to build the final
dataset used for modeling in the next phase. To perform these activities
there is not a predetermined order and can be repeated several times until
the desired outcome is reached.

▪ Modeling – Consists of selecting different techniques to implement the


model and adjusting their parameters to achieve optimal results. In
general, several modeling techniques can be employed to solve the same
problem and some of them have specific requirements regarding the data
structure. Therefore, on certain occasions is required to return to the data
preparation phase.

▪ Evaluation – Assumes that the model is implemented and the results


present a good quality. To ensure that the model satisfies all the business
objectives accordingly, during this phase the steps taken to build the
model are evaluated and reviewed. It is also determined whether any
important objective was omitted and not taken into account. Ultimately, it
is defined how the results obtained through the data mining process will be
used by the decision makers.

▪ Deployment – During this phase the results of the analysis are organized
and presented in the proper way so that the company can benefit
completely. In most cases, the models are applied within the organization's
decision processes in real time. In certain instances, the results are

10
deployed throughout the organization as a simple report or as a complex
repetitive data mining process.

1.3. Supervised Learning

In the context of machine learning, the modeling techniques can be


categorized into supervised and unsupervised learning techniques. Supervised
learning aims to predict an event through a classification model or estimate the
value of a continuous variable through a regression model. Within these models,
there are several independent variables and a dependent variable. Classification
models translate the input space into predefined classes, while regression
models translate the input space into a real value range. There are many
alternatives for classification models, for example, decision trees, neural
networks, support vector machines, statistical methods, or algebraic functions.
Within each of these classification models the input data is analyzed with
respect to the dependent variable. This way, it can be said that the pattern
recognition is supervised by the dependent variable. This type of predictive
models establishes relationships between the independent variables and the
dependent variable, and generates a function which associates the independent
variables with the dependent variable and allows the prediction of output values
based on the input values.

In the case of unsupervised learning, the data do not contain a dependent


variable, only independent variables. The pattern recognition is undirected, in
other words it is not guided by a specific variable, and the purpose of the data
mining mathematical algorithms is to discover patterns in the input data.

11
Segmentation and association are two of the most well-known unsupervised
learning methods [58].

The dataset used in this book contains a dependent variable which is


known beforehand. As such, the objective is to implement a supervised model
that predicts an event, namely the customers who are churning from the services
offered by a mobile telecommunication company. In this predictive model, the
chosen machine learning algorithm learns which values of the dependent
variable are associated with different values of the independent variables.

Further are presented the theoretical notions related to supervised


learning, the classification model, and how to evaluate such a model.

1.3.1. Training Dataset

In the case of a supervised learning model, the training dataset is known


and the goal is to implement a system that can be used to predict previously
unseen instances.

The training dataset can be characterized in a multitude of ways. Most


often it is thought as being a set of instances belonging to a particular schema.
These instances refer to a collection of tuples (instances, rows, or records) that
may contain duplicates. Each tuple is described by a vector which contains the
values of each variable. The schema describes the variables and their definition
domains, denoted by B(A ∪ y) , where A represents the set of m independent
variables A = {a1, a2, . . . , am} and y represents the dependent variable.

Typically, the variables, also named fields or attributes, are of two types:

12
nominal (values belong to an unordered set) and continuous (values are real
numbers). If the variable ai is of nominal type, its definition domain is denoted
by dom(ai ) = {vi,1, vi,2, ..., vi, dom(ai ) } , where dom(ai ) refers to its finite
cardinality. Similarly, dom(y) = {c1, c2, ..., c dom(y) } represents the definition
domain of the dependent variable. Variables of continuous type have an infinite
cardinality.

The input space is defined as the Cartesian product of the definition


domains of all the independent variables, X = dom(a1) × dom(a2) × . . .
× dom(am). The universal input space U is defined as the Cartesian product of
the definition domains of all the independent and dependent variables,
U = X × dom(y). The training dataset is a set of instances which consists of a
set of n tuples, and is denoted by S(B) = (⟨a1, y1⟩, ⟨a2, y2⟩, ..., ⟨an, yn⟩) where
xq ∈ X and yq ∈ dom(y).

It is generally assumed that the tuples from the training dataset are
generated in a random and independent order in accordance with an unknown
fixed joint probability distribution, D. When a tuple is classified using the
y = f (x) function, is considered a generalization of the deterministic case.

In this book, the notation π for the projection of tuples and σ for the
selection of tuples are used [60].

1.3.2. Classification

The machine learning community was the first to introduce concept


learning. These concepts are categories built by the human brain for objects,
events, or ideas that have a common set of characteristics. Learning a concept

13
implies to deduct its definition from a set of instances, which can be formulated
explicitly or implicitly, but in any situation, it assigns or not, each instance to
the concept. In conclusion, a concept can be viewed as a function defined on the
input space with values in a Boolean set, namely, c: X → {− 1, 1}. Alternatively,
a concept c can be defined as a subset of X, namely, {x ∈ X: c(x) = 1}. A set of
concepts is known as a concept class C.

Definition 1. Given a training dataset S with the set of independent variables


A = {a1, a2, . . . , am} and a nominal dependent variable y that follows an
unknown fixed distribution D over the input space, the objective is to build a
classification model that presents a minimal generalization error.

The generalization error of a classification model is defined as the


misclassification rate over the distribution D. For nominal variables, the
generalization error can be defined as:


ε(I(S ), D) = D(x, y)L(y, I(S )(x))
(1.1)
⟨x,y⟩∈U

where L(y, I(S )(x)) is the cost function, defined as:

{1 if y ≠ I(S )(x)
0 if y = I(S )(x)
L(y, I(S )(x)) = (1.2)

1.3.3. Induction Algorithms

An induction algorithm, also known as inducer or learner, builds a model

14
based on a training dataset and is able to generalize the relationships between
the independent variables and the dependent variable. For instance, an induction
algorithm constructs a classification model with tuples and their class labels as
input training data.

To denote an induction algorithm, is used I and to denote a model induced


by the application of this algorithm I on the training dataset S, is used I(S ). The
dependent variable of the tuple xq can then be predicted, and this prediction
denoted by I(S )(xq ).

Depending on the induction algorithm, the classification models can be


categorized in several ways, for instance, some models are expressed as
decision trees, while others as probabilistic classification models. Additionally,
the classification models can be categorized as being deterministic, in the case
of decision trees, or stochastic, in the case of neural networks with
backpropagation of the error.

A classification model obtained from an induction algorithm can classify a


new unseen tuple by either assigning it to a particular class or by providing a
conditional probability vector for some instances to pertain to a class
(probabilistic model).

1.3.4. Performance Evaluation

It is fundamentally important to evaluate the performance of the induction


algorithm. As previously mentioned, an induction algorithm builds a
classification model based on a training dataset, which is then able to label new
unseen instances. By evaluating this classification model one can understand

15
more about its quality, refine its parameters during the iterative data mining
process, and select the best performing model from a set of models.

When evaluating a classification model, several aspects should be


considered, such as accuracy, comprehensibility, and computational complexity.
In this book, preference is given to classification models that yield a high
accuracy.

1.3.4.1 Generalization Error

As previously mentioned, I(S ) denotes a classification model induced by


the induction algorithm I on the training dataset S. The generalization error of
this classification model I(S ) is given by the probability of misclassifying a
selected instance according to the distribution D of the labeled input space. The
accuracy of this classification model is obtained by subtracting the
generalization error from the unit. The training error is defined as the percentage
of correctly classified instances from the training dataset:

^

ε(I(S ), S ) = L(y, I(S )(x))
(1.3)
⟨x,y⟩∈S

where L(y, I(S )(x)) is the cost function defined in equation (1.2).

Although this type of error seems like a natural criterion, it is difficult to


compute the actual value of the generalization error because the distribution D
of the labeled input space is known only in rare situations, such as in synthetic
cases. One way to calculate the generalization error is to use the training error as

16
an estimate. The only downside to this is that this error represents an optimistic
biased estimation, especially if the induction algorithm overfits the training data.

The theoretical and empirical methods are two different methods present
in the literature to estimate the generalization error.

A. Theoretical Estimation of Generalization Error

However, if one decides to estimate the generalization error using the


training error, it is important to note that a low training error does not
necessarily imply a low generalization error. A compromise arises frequently
between the error obtained during training and the confidence level that is
attributed to this error when estimating the generalization error, and it is
calculated by subtracting the training error from the generalization error. The
capacity of an induction algorithm is determinative for the level of confidence in
the training error and indicative relative to the classification models that this
algorithm can induce. The capacity of an induction algorithm can be calculated
using the Vapnik-Chervonenkis (VC) dimension which is discussed further.

Induction algorithms that present a large capacity, in other words that


have many free parameters in comparison to the training dataset size, are
susceptible to generate a training error that is low and lead to overfitting the
relationships present in the dataset and yield a poor generalization error. In such
a case, the training error is highly unlikely to be a good estimate for the
generalization error. Contrary, induction algorithms that present a small capacity
in comparison to the training dataset size, tend to generate a high training error
and lead to underfitting the relationships present in the dataset and yield a poor

17
generalization error. Induction algorithms that do not have enough free
parameters, may generate a low training error, but on the other side can yield a
good generalization error. However, taking into account the characteristics and
volume of the training data available, the optimal capacity can be achieved, and
thus obtain the best generalization error.

In [144], the author discusses the relationships between four theoretical


frameworks, compares them, and highlights their strengths and weaknesses.
These frameworks are useful to estimate the generalization error. Among these,
the VC and the PAC frameworks are mentioned, which add a penalty function to
the training error to indicate the capacity of an induction algorithm.

VC Dimension

The Vapnik-Chervonenkis [133] theory is the most complete theoretical


learning framework and relevant to classification models. The VC theory offers
all the conditions needed for the consistency of the induction procedure. The
concept of consistency comes from statistics and states that both, the training
and the generalization errors of the classification model must converge to a
minimal error as the training dataset tends to infinity. The VC theory defines the
VC dimension as a capacity measure of an induction algorithm.

The VC theory highlights the extreme case when the training error and the
generalization error are estimated and these estimation values are bounds viable
for any induction algorithm and probability distribution in the input space.
These bounds are functions of the training dataset size and the VC dimension of
the induction algorithm.

18
Theorem 1. Given a hypothesis space H with a finite VC dimension d, the upper
bound on its generalization error is defined by:

d(ln(2n /d ) + 1) − ln(δ/4)
^ S) ≤
ε(h, D) − ε(h, ,     ∀h ∈ H,   ∀δ > 0 (1.4)
n

with the probability 1 − δ , where ε(h, D) is the generalization error of the


classification model h over the distribution D, and ^ε(h, S ) is the training error
of the same classification model h measured over the training dataset S of
cardinality n.

The VC dimension represents the property of a set H composed of all the


classification models examined by the induction algorithm. In the simplest case
of a two-class model, the VC dimension is defined as being the maximum
number of instances that can be shattered by the set H composed of all relevant
classification models. By definition, a dataset S with n instances is shattered by a
set H, if and only if this set H contains a classification model consistent with any
dichotomy of S. To express it differently, a dataset S is shattered by H if the
instances in S can be separated in two classes in 2n different ways by some
classification models contained in H. One should take into account that if the
VC dimension of H is denoted by d, then exists at least one set of d instances
that can be shattered by H. Generally speaking, it will not be true that every set
of d instances can be shattered by H.

As a condition for the consistency of the induction procedure, the VC


dimension of an induction algorithm must be finite. In the case of a linear
classification model, the VC dimension is equal to the size of the input space or

19
to the number of free parameters of this classification model. The VC dimension
of a general classification model may be different from the number of free
parameters, and in many cases, it might be extremely difficult to calculate it
precisely. In this case, it is advisable to calculate a lower and upper bound of the
VC dimension. The two VC dimension bounds for neural networks are
presented in [121].

PAC Dimension

The Probably Approximately Correct (PAC) learning model was


introduced by Valiant in 1984 [132]. This framework is useful to characterize
the concept class that can be reliably learned from a reasonable number of
randomly drawn training instances and a reasonable amount of computation
[102]. The following definition of the PAC learning model is adapted from [102]
and [106]:

Definition 2. Let C be a class concept defined over the input space X with m
variables. Let I be an induction algorithm that considers the hypothesis space
H. C is PAC learnable by I using H if for ∀c ∈ C , ∀D defined over X,
∀ε ∈ (0,1/2) and ∀δ ∈ (0,1/2) the induction algorithm I with a probability
greater than or equal to 1 − δ will find the hypothesis h ∈ H such that,
ε(h, D) ≤ ε and learnable in polynomial time if the induction algorithm is of
polynomial time complexity in 1/ε, 1/δ, m, and size(c).

By examining an hypothesis space H with a probability greater than or


equal to 1 − δ to find a hypothesis h ∈ H with an error less than or equal to ε of
the target concept c ∈ C ⊆ H, the PAC learning model offers a general bound on

20
the number of training instances, which is sufficient for any consistent induction
algorithm I. In particular, the size of the training dataset must be equal to:

1 H
n≥ ln (1.5)
ε δ

B. Empirical Estimation of Generalization Error

The generalization error can be estimated by dividing the available dataset


into a training and a test datasets. The training dataset is used by the induction
algorithm to build the classification model, and then the misclassification rate of
this model is calculated on the test dataset. The error obtained on the test dataset
yields a better estimate of the generalization error, because the training error
tends to overfit the data and thus underestimates the generalization error.

When the available data is limited, it is a well-known practice to resample


the data, meaning to partition the dataset into a training and a test dataset in
several ways. An induction algorithm is trained and tested on each partition, and
then the arithmetic mean of all the misclassification rates is computed. This way,
a more reliable estimate of the generalization error is obtained.

Random sub-sampling and k-fold cross-validation are two well-known


resampling techniques. The first technique randomly partitions the dataset
multiple times into a disjoint training and test datasets, and computes the
arithmetic mean of the errors obtained from each partition. The k-fold cross-
validation technique randomly partitions the dataset into k mutually exclusive
datasets, on which the induction algorithm is trained and tested multiple times.

21
Each time the algorithm is tested on one of the unseen folds and trained using
the remaining k − 1 folds.

The estimation of the generalization error obtained through cross-


validation is equal to the ratio of the number of misclassifications over the total
number of instances in the dataset. The random sub-sampling technique has the
upside that it can be repeated unlimitedly, and the downside that the test datasets
are not independently selected in relation to the distribution of the instances.
Thus, using a t-test for paired differences using the random sub-sampling
technique may increase the risk of Type I (false positives) error, that is,
identifying a significant difference when there is none. On the other hand, using
a t-test on the generalization error obtained on each fold decreases the risk of a
Type I error, but instead may not provide an adequate estimate of the
generalization error. To obtain a more reliable estimate, the k-folds cross-
validation technique is usually repeated k times. However, the test dataset is not
independent and there is a risk of Type I error. Unfortunately, currently no
satisfactory solution has been found to this problem. Dietterich proposed in [38]
alternative tests that have a low chance of a Type I error, but have a high risk of
a Type II (false negative) error, i.e. not identifying a significant difference when
one exists.

While applying the random sub-sampling and k-fold cross-validation


techniques, a method called stratification is frequently used to ensure that the
distribution of the dependent variable from the initial dataset is kept within each
partition, i.e. in the training and the test datasets. This method reduces the
variance of the estimated error in particular for multi-class datasets.

22
1.3.5. Scalability

The induction process represents the main concern throughout many


disciplines, such as pattern recognition, machine learning, and statistics. The
data mining process is different from these traditional methods because of its
capability to scale to large datasets with various input data types. The concept of
scalability implies that the datasets have either a large number of instances, a
large dimensionality, or both.

Induction algorithms have been successfully implemented in multiple


situations to solve fairly basic problems, but with the increasing desire to
discover knowledge in large datasets, several difficulties and constraints related
to time and memory appear.

Since databases have become a standard within many domains, such as


telecommunications, finance, astronomy, biotech, marketing, healthcare, and
many others, the data mining process designated to discover knowledge within
these domains has become a very productive discipline. Organizations that
produce large amounts of data, such as telecommunications and financial
companies, accumulate few petabytes of data every year.

Thus, difficulties arise in implementing classification algorithms for large


datasets due to their high dimensionality, i.e. large number of instances and
variables. Different sampling methods can be used to select only a part of the
instances, reduce the number of instances by grouping them or eliminating
subsets of unimportant instances, or parallel processing to simultaneously solve
different aspects of this problem.

23
1.3.6. Dimensionality

High dimensional input data, i.e. datasets with large number of variables,
involve an exponential increase of the size of the search space, and
consequently increases the chance that an induction algorithm will build
classification models that are not valid in general. In [76], the authors explain
that in the case of a supervised classification model, the required number of
instances increases with the dimensionality of that dataset. Furthermore, the
author shows in [51] that in the case of a linear classification model, the
required number of instances is linear with respect to the dimensionality and to
the square of the dimensionality in the case of a quadratic classification model.
Regarding the nonparametric classification models, such as decision trees, the
situation is more serious. In order to obtain an efficient estimation of the
multivariate densities, in [71] it was estimated that as the number of dimensions
increases the number of instances must increase exponentially.

This situation is called the curse of dimensionality, term which was first
used by Bellman [7]. Algorithms, such as decision trees, which are effective in
situations of low dimensionality, do not yield significant results when the
dimensionality increases beyond a certain level. Moreover, the classification
models that are built on datasets with a small number of variables are easier to
interpret and more suitable for visualization by using different specific data
mining methods.

In recent years, multiple linear dimensionality reduction algorithms have


been developed, of which factor analysis [80] and principal component analysis
(PCA) [42] are mentioned. The main objective of these algorithms is to
transform the input variables into a dataset of smaller dimension. These

24
algorithms require the input variables to be of continuous type and the
dimensions to be representable as linear combinations of these input variables.
Each newly formed dimension is supposed to represent an unobserved factor.

1.4. Churn Prediction – Literature Review

Recently, the risk of customers to churn has become the main concern of
companies active in all industries with a relative low switching cost [104]. A
company that is experiencing this type of problem can see reduced profits and
receive less recommendations from customers of continued service [113].
Considering the churn rate in different industries, one can acknowledge that the
telecommunications industry is the main target of this exposure because the
annual churn rate in this industry varies between 20% and 40% [8], [94]. In the
mobile telecommunication sector, the term churn refers to the transfer of
subscribers from one service provider to another [137].

To address the issue of churn, telecommunications companies must take


the proactive approach. This approach involves the identification of customers
who may churn in the near future. As such, in order to persuade these customers
to continue to use its services, such a company offers them special programs or
incentives. This approach has some advantages due to the lower costs of offers
and the incapacity of customers to negotiate and benefit from better deals.
However, if these systems are not precise, large amounts of money might be
spent on offers sent to customers who would have continued to use the services
offered by that specific company in any event [34], [104]. To precisely identify
customers who present a high risk of churning, such a company needs to

25
implement data mining models that use machine learning algorithms and are
highly accurate.

Seo, Ranganathan, and Babad have used statistical techniques to


implement a predictive model to identify the factors that lead to customer
retention in the telecommunications industry [122]. The authors have used a
binary logistic regression and a linear model with two hierarchical levels. Their
study has been focused on understanding the factors associated with non-
churners’ behavior and on demographics.

Despite the efforts to use statistical techniques to implement customer


behavior prediction models, building such a model is strongly dependent on
machine learning techniques due to the better performance of these techniques
compared to the statistical methods for nonparametric datasets [5], [9].

In [4], Arthur, Harris, and Annan have outlined based on the principal
component analysis the main factors that influence customers in the mobile
telecommunication industry to churn. These main factors have been grouped
into principal components using the correlation measure and the descriptive
statistical analysis of the independent variables.

For customer prediction, Sato et al. have proposed a local PCA


classification model that compares the eigenvalues of the principal components
that explain most of the variability of the dataset used [120]. Sato et al. have
argued that this classification model applied to a real dataset provides a superior
performance over the Naive Bayes classification model, logistic regression, and
C4.5 decision tree.

Brandusoiu and Toderean have compared three data mining techniques,

26
Bayesian networks, logistic regression, and k-nearest neighbors [15]. The
Bayesian networks have yielded an acceptable accuracy and fairly close to the
one generated by the other two algorithms. In [34], Coussement and Van den
Poel have implemented a predictive model for churn prediction using logistic
regression and have used the ROC (Receiver Operator Characteristic) curve as
an evaluation criterion.

In [83] and [88], the authors have used Bayesian networks to identify the
reasons for customer churn. To discretize the continuous variables, Kisioglu and
Topcu, have used the CHAID (Chi-squared Automatic Interaction Detector)
algorithm [83]. In [82], Kirui et al. have implemented two predictive models
using the Naive Bayes algorithm and Bayesian networks, both yielding an
acceptable performance.

Wei and Chiu have used decision trees as a modeling technique and the
DET (Detection Error Tradeoff) curve as an evaluation criterion [137]. In their
study, the authors have used contract data and the changes in calling behavior.
Brandusoiu and Toderean have made a comparison between three decision trees
algorithms, namely CHAID, CART (Classification and Regression Tree), and
QUEST (Quick Unbiased Efficient Statistical Tree) [13]. These three algorithms
have been applied to the same dataset [10] as the one used in this book.

As another approach, Yan et al. have built a predictive model to determine


customer behavior in the prepaid mobile telecommunications industry [148].
Due to the limited availability of data in the prepaid segment, the three authors
have used the CDR data. As machine learning techniques, they have chosen to
use neural networks and decision trees, obtaining a much better accuracy when
modeling the neural networks.

27
As another effort put forward to customer prediction, Hung et al. have
compared various data mining techniques [70]. In this study, the authors have
modeled decision trees and neural networks and have compared their
performance. Once again, the neural networks have achieved a better
performance than the decision trees.

In [23], Castanedo et al. have introduced for the first time the concept of
deep learning for customer prediction in the prepaid mobile telecommunications
industry. Castanedo et al. have investigated the application of a 4-layer
feedforward multilayer neural network on a large dataset. This model has had a
better performance than their previously implemented model which employed
the random forests algorithm.

Pendharkar has conducted a research paper to predict customer churn in


the telecommunications industry and has used the genetic algorithm based on
neural networks to implement two predictive models, one using the cross
entropy criterion and the other the direct approach [110]. The two models have
been compared to a statistical model using the z-score, Pendharkar concluding
that the neural networks model has dominated the statistical model from every
perspective.

Brandusoiu and Toderean have compared RBF (Radial Basis Functions)


and MLP neural networks [14]. The two models have been applied to the same
dataset [10]. The MLP neural network has achieved a significantly better
performance. Even if the RBF neural network has achieved a poor performance
on this dataset, this algorithm should not be ignored because compared to a
regular neural network it has the ability to deal better with extreme values and
an increased computational performance.

28
Coussement and Van den Poel have compared in [35] the performance of
three classification techniques: logistic regression, support vector machines, and
random forests. The results of this study have shown that the random forests
model yields a significantly better performance than the other two models.

The support vector machines algorithm has been applied in [12] for
customer churn prediction in the prepaid mobile telecommunication industry.
Brandusoiu and Toderean have compared four different kernel functions,
namely: linear, polynomial, RBF, and sigmoid kernel. On the dataset [10], the
best predictive performance has been obtained using the polynomial kernel.

The majority of the previously described data mining methods that have
been applied to datasets that contain call details records from the prepaid sector
have a predictive performance below 85% and use machine learning algorithms
with a standard architecture. It should not be forgotten that within this industry
there is a tremendous competition and a mobile telecommunications company
must implement highly accurate predictive models in order to properly identify
customers who are at risk of churning.

It is important to note that a predictive model that uses an architecture


tailored for the problem to be solved will generate an increased performance.
For example, Brandusoiu and Toderean have demonstrated that implementing a
predictive model using an adapted architecture increases the predictive
performance even reaching the ideal level of accuracy of approximately 100%
[17], [19], [11].

In [17], Brandusoiu and Toderean have proposed a Bayesian network that


learns the graph structure using the Iterative Parent-Child based learning of

29
Market Blanket (IPC-MB) algorithm, being more efficient than the other
algorithms existent in the literature [50]. This algorithm learns the Markov
blanket and minimizes the size of the set of the Pearson Chi-square conditional
independence tests [2] during the search, thus providing a better efficiency than
all the other algorithms present in the literature. After the Markov blanket has
been determined, the PC algorithm finishes learning the network structure [128].
To estimate the parameters, the Bayesian estimation algorithm with a Dirichlet
prior distribution has been used [74]. Using this architecture on the dataset [10],
the authors have obtained a performance of approximately 100%. This method
has been proposed by the authors in [19] and [11].

In [19] and [11], the authors have proposed an adapted architecture for the
support vector machines algorithm with a polynomial kernel with 4 degrees. For
training, a divide-and-conquer approach has been used which divides the
original problem into a set of subproblems which have been resolved using the
Sequential Minimal Optimization (SMO) algorithm adapted by Chang and Lin
[28]. Once the kernel matrix of a subproblem has been stored in cache, each of
its elements has been evaluated only once and has been calculated using the fast
SVM algorithm proposed by Dong, Suen, and Krzyzak [39]. Testing this
architecture on the dataset [10], has again yielded a predictive performance of
approximately 100%.

1.5. Conclusions

Based on the definitions presented in this chapter, it is understood that the


data mining methods are at the intersection of artificial intelligence, machine

30
learning, and statistics. Through the use of these data mining techniques,
predictive models can be implemented to discover trends and past behaviors,
allowing organizations to make intelligent decisions based on knowledge
extracted from data.

These methods are extremely useful to the mobile telecommunications


companies that intend to identify customers who are going to churn in the near
future. If these systems are not accurate, an organization can spend consistent
amounts of money on offers sent to customers who would have continued to use
its services anyway, while competitors could take advantage of the lack of a
powerful predictive model.

The literature overview suggests that as of now, the data mining methods
that have been applied on datasets consisting of call detail records have
employed different machine learning algorithms with standard architectures,
and therefore their predictive performance is merely acceptable.

The data mining method proposed in this book achieves a superior


performance compared to the above-mentioned methods that employ a standard
architecture of the machine learning algorithms used and similar to one
achieved by the two methods presented that use an adapted architecture. This
method is applicable to any dataset that consists of call detail records. Initially,
this method applies the PCA algorithm to the dataset to reduce its
dimensionality and possible correlations between the independent variables.
Before applying the machine learning algorithm, to ensure an optimal learning,
the distribution of the classes of the dependent variable is balanced using the
oversampling method [63]. Unlike the existing data mining methods used for
customer churn prediction in the prepaid segment of a mobile telecommuni-

31
cations company, this method uses an improved machine learning algorithm
with a proprietary architecture and yields a superior performance and is scalable
to large datasets.

The following chapter presents the principal component analysis


algorithm and the manner in which it is applied to the dataset [10].


32
2. CHAPTER 2
2. DATA UNDERSTANDING AND PREPARATION

In this chapter, the dataset is analyzed and prepared for the modeling
phase. The PCA algorithm and its main extensions are introduced. In the data
preparation phase, the PCA algorithm is applied to the dataset in order to reduce
its dimensionality and avoid the collinearity between the independent variables.
Also, within this chapter the dataset is partitioned into a training and a test
datasets, and the distribution of the classes of the dependent variable is
balanced. This chapter represents the data understanding and the data
preparation phases of the CRISP-DM methodology.

2.1. Data Understanding

The dataset used in this book, on which the principal component analysis
is applied, comes from the University of California, Irvine, the Department of
Information and Computer Science [10]. This dataset contains the call detail
records of 3,333 customers, each having 21 variables. Each row of this dataset

33
corresponds to a customer and for each one can find information about the
number of incoming and outgoing calls, the number of incoming and outgoing
SMSs, and about the voicemail. When implementing the predictive model, the
Churn variable will be used as a dependent variable, and the other 20 variables
as independent variables.

Table 2.1 indicates the 21 variables, their type, and the range of their
values. A first analysis of this dataset draws attention to the variables State, Area
Code, and Phone. The variable Area Code has only three different values – 408,
415, and 510, all belonging to the state of California. This would not be
abnormal if the data show that all customers are from California. However, as
illustrated in Table 2.2 (shown only up to state of Florida), these three zonal
prefixes are approximately evenly distributed across all the states of the USA. In
this case, it possible that the dataset contains incorrect data.

Therefore, one must keep in mind this aspect related to the variable Area
Code and to not include it as an independent variable when implementing the
predictive model. On the other hand, the variable State may contain errors too.
However, additional information about this dataset is required before including
both these variables in the data mining model. The variable Phone is also
excluded because it does not provide any relevant information for prediction
and is useful only to identify customers. Consequently, the number of
independent variables was reduced from 20 to 17 independent variables. The
variables International Plan and Voice Mail Plan are both nominal with Yes or
No values, while the other 15 variables are continuous.

Before applying the PCA algorithm and implementing the predictive


model, the dataset must be prepared in an appropriate format for the analytical

34
Variable Name Type Values Missing
State Nominal AK, AL, … Values
0
Area Code Nominal 408, 415, 510 0
Phone Nominal N/A 0
International Plan Nominal Yes/No 0
Voice Mail Plan Nominal Yes/No 0
Account Length Continuous 1 – 243 0
Voice Mail Continuous 0 – 51 0
Day Minutes Continuous 0.00 – 350.80 0
Day Calls Continuous 0 – 165 0
Day Charge Continuous 0.00 – 59.64 0
Evening Minutes Continuous 0.00 – 363.70 0
Evening Calls Continuous 0 – 170 0
Evening Charge Continuous 0.00 – 30.91 0
Night Minutes Continuous 23.20 – 395.00 0
Night Calls Continuous 33 – 175 0
Night Charge Continuous 1.04 – 17.77 0
Intl. Minutes Continuous 0 – 20 0
Intl. Calls Continuous 0 – 20 0
Intl. Charge Continuous 0.00 – 5.40 0
Customer Service Calls Continuous 0–9 0
Churn Nominal Yes/No 0
Table 2.1. Dataset Variables [10].

modeling. The first step involves verifying the data to see if there are any
missing values and to have a first visual contact of the advanced statistics of
each available variable. Analyzing Table 2.1, we note that this dataset is a
complete set, i.e. for each customer and for each variable there are no missing

35
values. Otherwise, had been there any missing values, certain values should
have been imputed during this data preparation phase using an appropriate
method [114].

Area Code
State
408 415 510
AK 14 24 14
AL 25 40 15
AR 13 27 15
AZ 15 36 13
CA 7 17 10
CO 25 29 12
CT 22 39 13
DC 14 27 13
DE 13 31 17
FL 12 31 20
Table 2.2. Distribution of Variable State relative to Variable Area Code.

The next step in the data understanding phase consists of checking the
dataset for extreme values, which may indicate the presence of measurement or
recording errors in the dataset. A first look at Table 2.3, at the distributions of
each variable, shows the presence of extreme values for some variables, but
since these values do not indicate any error of measurement or any unusual
behavior which would lead us to the idea that there might be an irregularity in
the data, it is decided to keep all these extreme values for each variable. With
this extreme value check, the second phase of the CRISP-DM process is ended,
namely the data understanding phase.

36
2.2. Data Preparation

In the prepaid mobile telecommunication industry, the datasets contain


variables which are correlated. This collinearity should be avoided because it
leads to instability in the solution space and yields incoherent results. Even if
this instability is avoided, the highly correlated independent variables tend to
overemphasize a particular component of the model because that component is
taken into account multiple times.

Using too many independent variables to model the relationships with a


dependent variable may unnecessarily complicate the interpretation of the
analysis and may violate the principle of parsimony which asserts that the
number of independent variables should be reduced so that the result of the
analysis can be interpreted with ease. Also, keeping too many variables can lead
to overfitting, meaning that the test dataset does not behave similarly for all the
variables as the training dataset, so the generalization of the discovered
relationships becomes difficult.

Thus, in order to reduce the number of independent variables and to


ensure that these are not correlated, a dataset reduction method is applied which
uses the correlation structure of the independent variables. The dimensionality
reduction method used in this book is the principal component analysis
described further [16].

2.2.1. PCA Algorithm

Although the origins of statistical techniques are usually difficult to


follow, it is generally accepted that the earliest descriptions of the technique

37
known as principal component analysis were given by Pearson in 1901 [109]
and by Hotelling in 1933 [68]. The two papers adopt different approaches to this
technique. Pearson has been concerned about the discovery of lines and planes
that best describe a set of points in a m-dimensional space, and the geometric
optimization problems, which he has considered, has led to the principal
component technique [109].

The approach taken by Hotelling starts from factor analysis, but the PCA
technique defined by him is different from the factor analysis [68]. The main
idea of this paper is that there exists a fundamental set of smaller dimension of
independent variables that can determine the values of the original m variables.
Hotelling has mentioned that such variables are have been called factors in the
psychology literature, thus he has introduced the alternate term of components
to avoid confusion with other uses of this term in mathematics. These
components have been chosen to maximize their successive contributions to the
total variance of the original variables, and have been called principal
components. The analysis that leads to the discovery of these principal
components has been called the principal component method.

In 1936, Hotelling proposed an accelerated method for calculating the


principal components [69], and Girshick offered some alternatives for deriving
the principal components and introduced the idea that the principal components
obtained from a subset of the dataset are an estimation of the maximum veracity
of the principal components obtained from the entire dataset [53]. In 1939,
Girshick investigated the asymptotic sampling distributions of the coefficients
and variances of the principal components [54].

In 1963, Anderson discussed the theoretical aspects of Girshick's work,

38
becoming a paper frequently quoted in later developments [3]. In [112], Rao has
proposed new ideas regarding the use, the interpretation, and the extensions of
the PCA technique. Gower has discussed the connections between the PCA
method and other statistical techniques and has offered various important
geometric perspectives [59]. Jeffers has presented the practical aspect of the
PCA technique by discussing two case studies that employ the PCA method
[73].

The principal component analysis, called the Karhunen-Loeve transfor-


mation [92], is an unsupervised method and aims to represent the correlation
structure of a set of independent variables using a smaller set of linear
combinations of these independent variables. These linear combinations are
called components. The entire variability of a dataset produced by the original
variables X1, X2, . . . , Xm , can often be represented by a smaller set of k linear
combinations of these variables, which means that the k components exhibit
almost the same level of information as the original m variables. In this case, the
m variables can be replaced by k < m components so that the dataset consists of
n instances and k components instead of n instances and m variables.

The geometric approach of this method has been proposed by Pearson


[109]. To illustrate this geometric approach, it is assumed that there are only two
variables X1 and X2, and Y1 and Y2 are the two main components, as illustrated in
Figure 2.1. It is evident that most of the variance lies along the Y1 axis.

If ri is the minimum distance from a point i to the first principal


component Y1 , the optimal principal component Y1 is obtained by minimizing
equation (2.1):

39
l
ri2
∑ (2.1)
i=1

In a m-dimensional plan, the PCA algorithm searches for the k-


dimensional hyperplane that provides the best representation. The principal
components represent a new coordinate system obtained by rotating the original
system along the directions of maximum variability.

X2

ri
Y1

X1

Figure 2.1. Principle of PCA Algorithm[109].

Hotelling has proposed a more systematic approach of this method based


on eigenvectors [68]. In this approach, the coordinate system formed by two
variables X1 and X2 is transformed into a new coordinate system Y1 and Y2
obtained by rotating the original system along the directions of maximum
variability (Figure 2.2):

  Y1 = X1cosθ + X2sinθ


(2.2)
Y2 = − X1sinθ + X2cosθ

40
Prior to using this dimensionality reduction algorithm, the dataset must be
standardized so that the arithmetic mean of each variable is zero and the
standard deviation is equal to one. Each variable Xi is represented by a vector of
size n × 1 , where n is the number of instances. The standardized variable is
represented by a vector Zi of size n × 1 , where Zi = (Xi − μi )/σii , μi is the
arithmetic mean of Xi, and σii is the standard deviation of Xi.

X2

Y2
Y1
θ
Y1
X1

Figure 2.2. Axis Rotation in a Two-Dimensional Space [68].


−1
In matrix notation, this standardization is expressed as Z = (V1/2) (X−μ),
where the negative exponent refers to the inverse of the matrix, and V1/2 is the
diagonal matrix of size m × m and is called the standard deviation matrix:

σ11 0 ⋯ 0
0 σ22 ⋯ 0
V1/2 = (2.3)
⋮ ⋮ ⋱ ⋮
0 0 ⋯ σmm

41
In the case of a m-dimensional problem, new coordinates are introduced
for i = 1,2, . . . , m:


Yi = eij Zj (2.4)
j=1

The objective of this PCA algorithm is to find the eigenvector:

ei = [ei1, ei2, ..., eim]



(2.5)

that maximizes Var(Yj ) , i.e. finding the coordinates transformation that


maximizes the variation along the direction of Yj axis.

Let

m
eij Zj = e⊤i Z

Yi =
j=1 (2.6)
Z = [Z1, Z 2, ..., Zm]

such that by projecting the vector Z over vector ei , the distance Yi is obtained
along the direction of vector ei, and we have:

Var(Yi ) = E[(Yi − Yi )(Yi − Yi )⊤] = E[e⊤i (Z − Z)(Z − Z)⊤ei] =


⊤ (2.7)
                       = e⊤
i E[(Z − Z)(Z − Z) ]ei = ei Cei

where the covariance matrix C is given by equation (2.8):

42
C = E[(Z − Z)(Z − Z)⊤] (2.8)

Covariance measures the degree of simultaneous variation of two


variables when there is a dependence. Positive covariance indicates that if a
variable increases, the other variable tends to increase. Negative covariance
indicates that if a variable increases, the other variable tends to decrease. This
measure of covariance is not scalar, so by changing the units of measure will
change the covariance value.

The correlation coefficient rij avoids this problem by scaling the


covariance with each standard deviation:

σij2
rij = (2.9)
σii σjj

where σii2, i ≠ j is the covariance between Xi and Xj:


(Xki − μi )(Xkj − μj )
k=1 (2.10)
σij2 =
n

The notation σij2 is used to illustrate the variance of variable Xi. If Xi and Xj
are independent, then σij2 = 0 , but σij2 = 0 does not imply that Xi and Xj are
independent.

The correlation matrix is denoted by ρ:

43
2
σ11 2
σ12 2
σ1m
σ11σ11 σ11σ22
⋯ σ11σmm
2
σ12 2
σ22 2
σ2m

ρ= σ11σ22 σ22 σ22 σ22 σmm (2.11)
⋮ ⋮ ⋱ ⋮
2
σ1m 2
σ2m 2
σmm
σ11σmm σ22 σmm
⋯ σmm σmm

Considering that each variable has been standardized and taking into
account the above-mentioned standardized matrix, we have E(Z) = 0 , where 0
is a vector of zeros of size n × 1 , and Z has the covariance matrix equal to the
correlation matrix.

Undoubtedly, the higher the norm of vector ei , the higher Var(Yi ) will
be. Thus, the normalization constraint ei = 1 must be imposed while Var(Yi ) is
maximized, that is e⊤i ei = 1.

Therefore, the optimization problem involves finding the vector ei that


minimizes e⊤i Cei subject to constraint e⊤i ei = 1 . By applying the Lagrange
multiplier method, the stationary points of the Lagrange function ℒ are sought:

ℒ = e⊤i Cei − λ(e⊤i ei − 1) (2.12)

where λ is a Lagrange multiplier. By differentiating the function ℒ by the


elements of vector ei and by equating the derivatives with zero, we obtain:

Cei − λei = 0 (2.13)

which means that λ is an eigenvalue of the covariance matrix C, and ei is an

44
eigenvector. By multiplying equation (2.13) to the left with e⊤i we obtain
equation (2.14):

λ = e⊤i Cei = Var(Yi ) (2.14)

Thus, the new coordinate Yi is called a principal component and is


obtained from equation (2.6). This principal component Yi is independent of all
the previous principal components Yj (e⊤i ej = 0), j < i.

The total variability in the standardized dataset is equal to the sum of the
variances of each principal component, to the sum of the variances of each
vector Z, to the sum of the eigenvalues, and to the number of independent
variables. Thus, we have equation (2.15):

m m m

∑ ∑ ∑
Var(Yi ) = Var(Zi ) = λi = m (2.15)
i=1 i=1 i=1

The partial correlation between a particular component and a given


variable is a function of an eigenvalue and an eigenvector. More specifically, for
i, j = 1,2, . . . , m:

Corr(Yi, Zj ) = eij λi (2.16)

where (λ1, e1), (λ2, e2), . . . , (λm, em) represent the eigenvalue and eigenvector pairs
for the correlation matrix ρ , and λ1 ≥ λ2 ≥ . . . ≥ λm . A partial correlation

45
coefficient is a correlation coefficient that takes into account the effect of all
other independent variables.

The ratio of the total variability in Z that is explained by the ith principal
component is equal to the ratio between the ith eigenvalue and the number of
independent variables, that is λi /m.

After calculating the principal components, the next step is to select the
principal components. From the total number of principal components which is
equal to the number of independent variables, only a smaller number is selected
based on four criteria: the eigenvalue criterion, the proportion of the explained
variance criterion, the minimum communality criterion, and the scree plot
criterion.

The eigenvalue criterion suggests to retain only the principal components


that have an eigenvalue greater than or equal to 1. It should be recalled that the
total sum of the eigenvalues is equal to the total number of the independent
variables used, and thus an eigenvalue equal to 1 means that the principal
component explains the variability of approximately one independent variable.

The proportion of the explained variance criterion suggests retaining the


principal components that explain the desired percentage of variance. Since the
principal component analysis is a precursor to the modeling phase, it is intended
to obtain a maximum variance. If the eigenvalues λi are ordered descendingly,
the proportion of the explained variance by the k principal components is:

λ1 + λ2 + ... + λk
(2.17)
λ1 + λ2 + ... + λk + ... + λm

46
If the independent variables are highly correlated, a small number of
eigenvectors with high eigenvalues will be obtained, and k will be much smaller
than m, thus obtaining a significant dimensionality reduction.

Based on the communality criterion, those principal components must be


retained so that the communality of each independent variable is at least 50%.
The communality represents the proportion of variance common to several
independent variables and is equal to the sum of the component weights
squared. The communality values represent the global importance of each
independent variable within the principal component analysis.

The scree plot criterion refers to interpreting the chart obtained by


representing the eigenvalues and the order number of the principal components.
This criterion suggests to retain the maximum number of principal components
before the graph approaches zero, and that by selecting another principal
component, the variance would not increase significantly.

2.2.2. Applying the PCA Algorithm

Using the previously described principal component analysis, the 15


quantitative independent variables shown in Table 2.1 will be analyzed and
transformed into principal components [16]. Reporting this dataset to the
notations from the previous subchapter, we have X1 = Account Length ,
X2 = Voice Mail, ..., X15 = Customer Service Calls, so that m = 15, and n = 3,333.

Table 2.3 shows these 15 quantitative independent variables along with


their statistical indicators. There is a disparity regarding the variability of some
variables. For example, the variable International Charge has a standard

47
Variable Name Min Max Mean Median Mode Std. Dev.
Account Length 1 243 101.06 101 105 39.82
Voice Mail 0 51 8.10 0 0 13.68
Day Minutes 0.00 350.8 179.77 179.40 154.00 54.47
Day Calls 0 1650 100.44 101 102 20.07
Day Charge 0.00 59.64 30.56 30.50 26.18 9.26
Evening Minutes 0.00 363.7 200.98 201.40 169.90 50.74
Evening Calls 0 1700 100.11 100 105 19.92
Evening Charge 0.00 30.91 17.08 17.12 14.25 4.31
Night Minutes 23.2 395.0 200.87 201.2 188.20 50.57
Night Calls 0
33 1750 100.11 100 105 19.57
Night Charge 1.04 17.77 9.04 9.05 9.45 2.28
Intl. Minutes 0.00 20.00 10.24 10.3 10 2.79
Intl. Calls 0 20 4.48 4 3 2.46
Intl. Charge 0.00 5.40 2.76 2.78 2.70 0.75
Customer Service 0 9 1.56 1 1 1.31
Calls
Table 2.3. Statistical Indicators of Quantitative Variables.

deviation of less than 1, whereas the variable Day Charge has a standard
deviation of greater than 54. If the principal component analysis is applied
without first standardizing these quantitative independent variables, the variable
Day Charge will dominate the influence of the variable International Charge
and similarly the entire range of variabilities. Therefore, all variables are
standardized and we obtain the vectors Zi = (Xi − μi )/σii using the mean and the
standard deviation from Table 2.3.

Table 2.4 illustrates the correlation matrix ρ of the quantitative


independent variables. For simplicity, the name of each variable is changed,

48
having only the first letter of each word of each variable and the z suffix,
denoting that each variable is standardized. It can be seen in Table 2.4 that some
independent variables are strongly correlated with each other, correlations that
could negatively influence the classification model. The principal component
analysis manipulates this correlation and identifies the components that
underline the correlated variables.

The principal component analysis is performed on 15 quantitative


independent variables [16]. The principal components matrix is illustrated in
Table 2.5. Each column in this table represents a principal component Yi = e⊤i Z.
The values are called component weights and represent the partial correlation
between the independent variables and the principal component. As previously
discussed, these weights are equal to the product of the ith eigenvector and the ith
eigenvalue, Corr(Yi, Zj ) = eij λi . The principal components weights take values
between –1 and 1.

In general, the first principal component quantifies most of the correlation


with respect to the other components. The variance Var(Y1) = e⊤1 Ce1 is
maximized. In Table 2.5 one can observe that the independent variables Intl.
Minutes, Intl. Charge, Day Minutes, Day Charge, Night Minutes, and Night
Charge vary together and the component weights have high and similar values
(column corresponding to first component), which indicates that all these
independent variables are correlated with the first principal component.

Table 2.6 shows the eigenvalues corresponding to each component and


the percentage of the total variance explained by each component. The ratio of
the total variability in Z that is explained by the ith principal component is equal
to λi /m , i.e. the ratio between the ith eigenvalue and the total number of

49
Var. al_z vm_z dm_z dc_z dch_z em_z ec_z ech_z nm_z nc_z nch_z im_z ic_z ich_z csc_z
al_z 1.000 -0.005 0.006 0.038 0.006 -0.007 0.019 -0.007 -0.009 -0.013 -0.009 0.010 0.021 0.010 -0.004
vm_z -0.005 1.000 0.001 -0.010 0.001 0.018 -0.006 0.018 0.008 0.007 0.008 0.003 0.014 0.003 -0.013
dm_z 0.006 0.001 1.000 0.007 1.000 0.007 0.016 0.007 0.004 0.023 0.004 -0.010 0.008 -0.010 -0.013
dc_z 0.038 -0.010 0.007 1.000 0.007 -0.021 0.006 -0.021 0.023 -0.020 0.023 0.022 0.005 0.022 -0.019
dch_z 0.006 0.001 1.000 0.007 1.000 0.007 0.016 0.007 0.004 0.023 0.004 -0.010 0.008 -0.010 -0.013
em_z -0.007 0.018 0.007 -0.021 0.007 1.000 -0.011 1.000 -0.013 0.008 -0.013 -0.011 0.003 -0.011 -0.013
ec_z 0.019 -0.006 0.016 0.006 0.016 -0.011 1.000 -0.011 -0.002 0.008 -0.002 0.009 0.017 0.009 0.002

50
ech_z -0.007 0.018 0.007 -0.021 0.007 1.000 -0.011 1.000 -0.013 0.008 -0.013 -0.011 0.003 -0.011 -0.013
nm_z -0.009 0.008 0.004 0.023 0.004 -0.013 -0.002 -0.013 1.000 0.011 1.000 -0.015 -0.012 -0.015 -0.009
nc_z -0.013 0.007 0.023 -0.020 0.023 0.008 0.008 0.008 0.011 1.000 0.011 -0.014 0.000 -0.014 -0.013
nch_z -0.009 0.008 0.004 0.023 0.004 -0.013 -0.002 -0.013 1.000 0.011 1.000 -0.015 -0.012 -0.015 -0.009
im_z 0.010 0.003 -0.010 0.022 -0.010 -0.011 0.009 -0.011 -0.015 -0.014 -0.015 1.000 0.032 1.000 -0.010
ic_z 0.021 0.014 0.008 0.005 0.008 0.003 0.017 0.003 -0.012 0.000 -0.012 0.032 1.000 0.032 -0.018
ich_z 0.010 0.003 -0.010 0.022 -0.010 -0.011 0.009 -0.011 -0.015 -0.014 -0.015 1.000 0.032 1.000 -0.010
csc_z -0.004 -0.013 -0.013 -0.019 -0.013 -0.013 0.002 -0.013 -0.009 -0.013 -0.009 -0.010 -0.018 -0.010 1.000
Table. 2.4. Pearson’s Correlation Coefficient of Quantitative Variables.
Principal Component
Var.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
al_z -0.020 0.002 0.028 -0.007 0.631 -0.099 0.081 0.073 0.289 0.472 -0.522 0.000 0.000 0.000 0.000
vm_z 0.013 0.017 -0.007 0.034 -0.054 0.537 -0.278 0.541 0.561 -0.135 0.063 0.000 0.000 0.000 0.000
dm_z 0.500 0.160 0.849 -0.052 -0.015 -0.022 -0.013 0.017 -0.004 -0.001 -0.003 0.000 0.000 0.000 0.000
dc_z -0.017 -0.058 0.040 0.029 0.570 -0.201 -0.302 -0.249 0.233 -0.065 0.646 0.000 0.000 0.000 0.000
dch_z 0.500 0.160 0.849 -0.052 -0.015 -0.022 -0.013 0.017 -0.004 -0.001 -0.003 0.000 0.000 0.000 0.000
em_z 0.286 0.738 -0.266 0.549 0.019 -0.025 0.011 -0.005 0.001 -0.002 0.011 0.000 0.000 0.000 0.000
ec_z -0.006 -0.009 0.043 -0.007 0.305 0.121 0.729 -0.033 0.159 -0.577 -0.013 0.000 0.000 0.000 0.000

51
ech_z 0.285 0.738 -0.266 0.549 0.019 -0.025 0.011 -0.005 0.001 -0.002 0.011 0.000 0.000 0.000 0.000
nm_z 0.434 -0.664 -0.089 0.601 -0.002 -0.005 0.014 0.016 -0.014 0.007 -0.015 0.000 -0.001 0.000 0.000
nc_z 0.055 0.005 0.021 0.005 -0.227 0.483 0.306 -0.535 0.282 0.459 0.202 0.000 0.000 0.000 0.000
nch_z 0.434 -0.664 -0.089 0.601 -0.002 -0.005 0.014 0.016 -0.015 0.007 -0.015 0.000 0.001 0.000 0.000
im_z -0.707 0.004 0.437 0.554 -0.040 -0.012 0.004 -0.010 0.014 0.003 -0.012 0.002 0.000 0.000 0.000
ic_z -0.045 0.022 0.045 0.025 0.358 0.454 0.154 0.351 -0.607 0.258 0.281 0.000 0.000 0.000 0.000
ich_z -0.707 0.004 0.437 0.554 -0.040 -0.012 0.004 -0.010 0.014 0.003 -0.012 -0.002 0.000 0.000 0.000
csc_z -0.014 -0.011 -0.025 -0.038 -0.238 -0.486 0.431 0.472 0.218 0.336 0.369 0.000 0.000 0.000 0.000

Table. 2.5. Principal Component Matrix.


Eigenvalues
PC Total Variance % Cumulative %
1 2.046 13.638 13.638
2 2.028 13.521 27.158
3 1.987 13.246 40.405
4 1.950 13.000 53.405
5 1.060 7.070 60.475
6 1.031 6.876 67.351
7 1.009 6.728 74.078
8 0.995 6.631 80.709
9 0.975 6.501 87.210
10 0.968 6.455 93.664
11 0.950 6.336 100.000
12 0.000 0.000 100.000
13 0.000 0.000 100.000
14 0.000 0.000 100.000
15 0.000 0.000 100.000
Table 2.6. Eigenvalues and Proportion of Explained Variance.

independent variables m. Thus, considering that the first eigenvalue is 2.046,


and given that there are 15 quantitative independent variables, we obtain a
percentage of 13.64% of the total variance explained only by this first
component.

Thus, it is understood that a single principal component quantifies


approximately the seventh part of the variability in the entire dataset of these 15
quantitative variables, that is, it contains about the seventh part of the
information provided by this dataset. In Table 2.5 it can also be noticed that the

52
eigenvalues decrease in magnitude, i.e. λ1 ≥ λ2 ≥ . . . ≥ λ15.

The second principal component represents the second-best linear


combination of the variables, with the condition that it is orthogonal to the first
principal component. Two vectors are orthogonal if they are mathematically
independent, meaning that they are perpendicular and not correlated. The
second principal component is obtained from the remaining variability after the
first principal component was extracted. Similarly to the first principal
component, the independent variables that vary together and the component
weights have high and similar values (column corresponding to second
component) can be identified in Table 2.5. Similarly are defined the remaining
principal components.

After defining all the principal components, the criteria for selecting an
optimal number of principal components for the modeling phase must be
analyzed [16]. Thus, based on the eigenvalue criterion, which suggests to retain
the principal components with an eigenvalue greater than or equal to 1, the first
7 principal components are retained, all having a value greater than 1. The other
4 principal components have an eigenvalue approximately equal to 1 (Table
2.6). If the other criteria support such a decision, these 4 principal components
will be retained too.

The proportion of the explained variance suggests to retain the principal


components that explain the desired percentage of variance. Thus, based on this
criterion, the first 11 principal components are retained which account for the
entire variability in the dataset.

Based on the communality criterion, one should retain those principal

53
components so that the communality of each independent variable is greater
than 50%. Thus, by selecting the first 11 principal components a communality of
100% for each variable is obtained. As a reminder, the commonality of an
independent variable is equal to the sum of the component weights squared.

The scree plot criterion involves interpreting the chart obtained from the
eigenvalues and the order number of each principal component. Figure 2.3
illustrates this chart, having on the horizontal axis the 15 principal components
and on the vertical axis the eigenvalues. This criterion suggests to retain the
maximum number of principal components before the graph approaches zero, in
this case the maximum being 11. Thus, based on this criterion the first 11
principal components are retained [16].

3.0

2.3
Eigenvalues

1.5

0.8

0.0
0 4 8 12 16
Principal Component
Figure 2.3. Scree Plot Criterion.

Accordingly, after analyzing all 4 selection criteria, in the modeling phase


only the first 11 principal components will be used. These components are not

54
correlated and explain 100% of the variance in the original dataset.

2.2.3. Dataset Partitioning and Distribution Balancing

At this stage, the dataset must be prepared for the machine learning
algorithms used in the next phase. In the following chapter, the classification
model is implemented and in order to empirically estimate the generalization
error of this algorithm, the dataset is randomly partitioned into a training
dataset, which represents 80% of the original dataset, and a test dataset, which
represents 20% of the original dataset [129].

For optimal training, the machine learning algorithm requires a training


dataset that has a dependent variable with an approximately balanced
distribution of its classes, in our case the Churn variable. The training dataset
has 388 (14%) instances belonging to Yes class and 2,279 (86%) instances
belonging to No class. Such an imbalanced distribution is common in the mobile
telecommunication industry. In order to balance the distribution of the variable
Churn, the oversampling technique [63] is applied. This technique randomly
clones the instances corresponding to Yes class of the variable Churn until are
approximately equal in number to the instances corresponding to No class. By
adding new instances, a distribution of the variable Churn in the training dataset
will be follows: 2,294 (50%) instances pertaining to Yes class and 2,279 (50%)
instances pertaining to No class, with a total of 4,573 instances. In the test
dataset, the variable Churn has 95 (15%) instances belonging to Yes class and
571 (85%) instances belonging to No class.

55
2.3. Conclusions

This chapter describes the PCA algorithm and how it is applied to the
dataset [10]. By applying this algorithm, 11 principal components are obtained
which explain the entire variability in the dataset. In other words, by reducing
the dimensionality of the dataset no information is lost. The 11 principal
components selected contain the same information as the original dataset while
any collinearity between the independent variables is being avoided [16].

Consequently, the final dataset to which the machine learning algorithm is


applied in Chapter 3 consists of these 11 principal components, two nominal
independent variables International Plan and Voice Mail Plan excluded during
the principal component analysis, and the dependent variable Churn.

This dimensionality reduction method that also avoids the collinearity


between the independent variables [10] can be applied to any dataset within any
field of activity as long as there is a classification problem to be solved.

In the last part of this chapter, in order to ensure an optimal training


performed by the machine learning algorithm used for modeling in the next
chapter, the distribution of the dependent variable Churn is balanced.

This chapter serves as the data understanding and the data preparation
phases of the CRISP-DM methodology.


56
3. CHAPTER 3
3. MODELING USING NEURAL NETWORKS

In this chapter, an introduction to neural networks is first presented,


followed by the theoretical foundations about the multilayer perceptron and the
backpropagation learning algorithm. Various research papers from the literature
are then discussed regarding techniques for optimizing the structure of neural
networks and for accelerating the learning process. The simulated annealing
algorithm is introduced for weights initialization. In the second part of this
chapter, is presented and explained the architecture used to implement the
predictive model for identifying customers who are going to churn [18], [19],
[11]. The conclusions regarding the architecture used are presented at the end of
the chapter.

3.1. Introduction

Neural networks are a very popular machine learning algorithm based on


modern mathematical concepts and inspired by biological neural networks. This

57
algorithm mimics the human brain, which consists of different types of neurons,
each neuron connecting to several synapses. The ability of the human brain to
perceive and memorize new information through a learning process has
motivated researchers to develop artificial systems that are capable to perform
certain functions based on a learning process [36].

In 1940, McCulloch and Pitts ascertained that a neuron can be modeled


through a single-threshold device to execute a logical function [98]. In 1949, the
Hebbian rule was proposed which illustrated the influence the learning process
has on the synapses between neurons [64]. In 1952, Hodgkin and Huxley
obtained precise results by integrating the neural aspect of the brain into a set of
equations based on the structure of the neurons and the electrical signals
between these [66]. At the end of the 1950s, beginning of the 1960s, Rosenblatt
proposed the perceptron [116] and Widrow and Hoff advanced the adaline
model and the training algorithm based on the least mean squares technique
[141].

In 1969, Minsky and Papert demonstrated in a rigorously precise manner


that the perceptron is not practical when complex logical functions are to be
modeled, thereby considerably reducing researchers’ interest in neural networks
[101]. Concurrently, the adaline and madaline models solved multiple problems
from various domains, but due to the integrated linear activation function, the
nonlinear separability was still an unsolvable problem.

In 1982, the Hopfield model marked the beginning of the current period
of neural networks research [67]. This model does not operate at a neuron level,
but at a system level based on the Hebbian rule and functions as a recurrent
neural network. This type of neural network is useful to solve different

58
optimization problems. In 1985, as an extension to the Hopfield neural
networks, the Boltzmann machine was proposed having its learning algorithm
based on the simulated annealing method [81] by including stochastic neurons
[1]. In 1988, the Hopfield model was developed further and the cellular neural
networks were proposed [31].

In 1986, Rumelhart, Hinton, and Williams achieved the most significant


milestone in neural networks research by proposing the backpropagation
learning algorithm for the MLP neural network [119]. It was shortly revealed
that this backpropagation learning algorithm was actually invented by Werbos in
1974 [138]. In 1988, the RBF neural network model was proposed [20]. The
MLP and RBF neural networks are universal approximators.

The neural network model based on the principal component analysis was
proposed in 1982 [107], and based on the independent component analysis
(ICA) in 1994 [33]. The ICA algorithm is a generalization of the PCA algorithm
and is typically used for feature extraction. In the subsequent years, multiple
research papers were submitted proposing several neural networks models, such
as factor analysis, canonical correlation analysis (CCA), and linear discriminant
analysis (LDA).

In 1985, the Bayesian networks model was proposed [108], which is a


popular graphical model in the machine learning community. This model is a
representation formalism and belongs to a general class of models called
probabilistic graphical models. Another milestone in the machine learning
research is the support vector machine algorithm proposed by Vapnik et al.,
which implements principles from statistical learning theory [134].

59
A neural network is defined by its architecture, by the properties of its
neurons, and by the learning rules used [36].

The architecture of a neural network is characterized by the weight vector


w = (wij ), where wij denotes the weight of the connection between neurons i and
j. If wij = 0 , neurons i and j are not connected. By setting some of the weights
wij to be equal to zero, different neural networks topologies can be obtained.
From the architectural point of view, neural networks can be categorized into
feedforward, recurrent, and hybrid neural networks [36].

With respect to feedforward neural networks, the connections between


neurons are oriented in only one direction. Typically, a feedforward neural
network has multiple layers in which the neurons in the same layer are not
connected and the layers are not connected by feedback connections. In the case
of an MLP and RBF neural networks, every neuron in each layer is connected to
every neuron in the next layer, being two fully connected neural networks [36].

On the contrary, the layers in a Hopfield recurrent neural network or a


Boltzmann machine have at least one feedback connection. In a cellular neural
network, also called a cellular nonlinear network, the neurons are mutually
connected only to the neurons in their immediate vicinity. The neurons in a
cellular neural network receive their own signals from nearby neurons [31].

Neural networks operate in two phases: first the learning process takes
place and then the generalization process. The training process consists of
parsing the training dataset and adapting the parameters of the network by using
an online or an offline learning algorithm. Once the learning process is over, the
network will be able to mimic the nonlinear relationships existent between the

60
independent variables and the dependent variable.

The training process represents the fundamental capacity of neural


networks. The training rules are algorithms used to determine the appropriate
weights w and other parameters of the network. Looking from a different
perspective, the training process can be regarded as a nonlinear optimization
problem that seeks to determine the parameters of the network that minimize a
cost function for the training dataset [36].

The neural networks training is performed in epochs, in other words the


entire input dataset is passed through the network only once and processed by
the learning algorithm at the end of an epoch. At the end of this learning
process, the neural network contains complex relationships between the input
variables and is capable to generalize to new unseen data. The training process
can be controlled by defining a convergence criterion.

3.2. Perceptron

In a neural network, the neuron represents a node and is considered the


base unit. A neuron processes the information from other neurons based on
which according to an activation function it generates the result. This linear or
nonlinear function projects the input data through the network to the output, and
is denoted by ϕ() . The synapses of the variables are modeled by weights. The
perceptron, also referred to as the McCulloch-Pitts neuron, was invented in
1957 by Frank Rosenblatt and represents the simplest neural network model,
consisting of a single neuron carrying out a weighted sum of its inputs followed
by an activation function. Rosenblatt used a single-layer perceptron to classify

61
linearly separable patterns [115].

In the case of a perceptron with a single neuron, the network topology is


shown in Figure 3.1, and the input of the neuron is given by equation (3.1):

m
wi xi − θ = w⊤x − θ

u= (3.1)
i=1

^y = ϕ(u) (3.2)

where xi represents the ith independent variable and x = (x1, x2, ..., xm)⊤ , wi
represents the weight of the ith independent variable and w = (w1, w2, ..., wm)⊤, θ
is the threshold or bias, and m is the number of independent variables. ϕ() is a
continuous or discontinuous activation function and projects the real numbers in
the ( − 1, 1) range.

w1
x1
w2
c ia de x2 Activation
ivare function
y u y

Summing
θ wm block θ
xm
Synaptic
weights
Figure 3.1. Mathematical Model of a Perceptron [111].

The perceptron w1 a single neuron using the threshold activation


with
x1
w2
nc ia de x2 Activation
ctivare 62 function
y1 u1 y1
1 1
function is capable to classify the input vector x into two classes. The decision
boundary is given by the hyperplane:

w⊤x − θ = 0 (3.3)

where the parameter θ helps to move this hyperplane from the origin.

The most frequently used activation functions [36] are the threshold
function (3.4), the sigmoid function (3.5), and the hyperbolic tangent function
(3.6):

{−1  or  0 x < 0
1 x≥0
ϕ(x) = (3.4)

1
ϕ(x) = (3.5)
1 + e −βx

ϕ(x) = tanh(βx) (3.6)

The parameter β in these activation functions represents a gain and


controls the slope of the activation function. Figure 3.2 shows the graph of each
of these activation functions.

All these activation functions are monotonically increasing with the range
( − 1, 1). Considering that the sigmoid functions satisfy lim ϕ(x) = 0 and
x→−∞
lim ϕ(x) = 1, many monotonically increasing functions satisfy these
x→+∞
conditions and therefore can be considered sigmoidal. In [41], the author has

63
threshold logistic tanh
1 1 1

0.5 0.8 0.5


0.6
ϕ(x)

ϕ(x)

ϕ(x)
0 0
w1 0.4
x1
-0.5 w2 0.2 -0.5
nc ia de x2 Activation
-1 0 -1
ctivare -2 0 2 -2 0 function
2 -2 0 2
x x x
y u y
Figure 3.2. Sigmoid Activation Functions [36].

presented other similar activation functions.


Summing
θ wm block θ
If multiplexmneurons are in the hidden layer and the threshold activation
Synaptic
function is being used, weights
they form what is called a single-layer perceptron, as
illustrated in Figure 3.3. Thus, this type of perceptron is capable to classify the

w1
x1
w2
nc ia de x2 Activation
activare function
y1 u1 y1
1 1

Summing
θ1 block θ1

nc ia de Activation
activare function
yo uo yo
o o

Summing
θo block θo
wm
xm
Synaptic
weights

Figure 3.3. Architecture of a Single-Layer Perceptron.


(l ) (l ) ( L) ( L)

w( l 1)
w( L 1)
64 y1
(l ) ( L)
o
1 o 1
input vector x into multiple classes.

The output of this model is given by:

u = w⊤x − θ (3.7)

^
y = ϕ(u) (3.8)

where u = (u1, u2, ..., uo)⊤ and o is the number of neurons in the hidden layer,
θ = (θ1, θ2, ..., θo)⊤ represents the bias vector in the hidden layer, ϕ(u) = (ϕ1(u1),
ϕ2(u2), . . . , ϕo(uo))⊤ represents the vector of the activation functions of the

neurons in the hidden layer, and ^
y = ( ^y1, ^y2, ..., ^yo) represents the output vector.

Computing the weights of a perceptron with a sigmoid activation function


that seeks to minimize the quadratic training error is considered an NP-hard
problem [123]. Based on the training algorithm proposed by Rosenblatt for the
perceptron, the weight vector w is adapted by minimizing the error [115], [116].

In [116], Rosenblatt has demonstrated the perceptron convergence


theorem for classification problems. This theorem states that for linearly
separable data, the weights will eventually converge to a fixed point in a finite
number of updates. In 1996, this theorem was made applicable to the MLP, and
indicated that the BP algorithm with a learning rate without an upper limit will
converge to an optimal solution [57].

This convergence theorem can be demonstrated by minimizing the


activation function used in the perceptron by applying the gradient descent
optimization method [36]:

65
−w⊤x

E(w) =
(3.9)
x∈X

where X̄ is the set of misclassified instances by the weight vector w. Therefore,


the weights are updated so that the number of misclassified instances decreases.
The convergence theorem can be extended to a single-layer perceptron by
updating the training algorithm used in the perceptron to function for multiple
neurons.

The perceptron training algorithm is given by equation (3.10):

m
wij xn,i − θj = w⊤j xn − θj

un, j = (3.10)
i=1

{−1 otherwise
   1 un, j > 0
^yn, j = (3.11)

en, j = yn, j − ^yn, j (3.12)

wij(t) = wij(t − 1) + ηxn,ien, j (3.13)

for i = 1, . . . , m , j = 1, . . . , o , where un, j is the input of neuron j for instance n,


wj = (w1j, w2j, ..., wmj )⊤ represents the vector of the updated weights at neuron j,
xn,i is the input i of instance n, θj is the threshold of neuron j, yn, j and ^yn, j are the
actual and the expected outputs of neuron j for instance n with a value equal to –
1 or 1 representing the class membership, and η is the learning rate, a

66
sufficiently small positive number, a typical value of this parameter η being 0.5.
The stability of the training process is not influenced by the selection of this
parameter η, only the convergence speed is affected. It is important to note that
the weights wij are randomly initialized. Finally, when the errors are small
enough the training process stops.

For classification problems, the perceptron training algorithm functions


only in the case of linearly separable data, and cannot converge in the case of
linearly inseparable data. This training method proposed by Rosenblatt is not
able to converge when trying to model linearly inseparable data because it
cannot find the minimum of the error function [57].

3.3. Multilayer Perceptron

The multilayer perceptron neural networks are feedforward networks with


one or more hidden layers of neurons between the input and the output layers.
The neurons in the output layer represent a hyperplane in the input feature
space.

Figure 3.4 shows the architecture of an MLP. To simplify this illustration,


it can be assumed that the symbol of the activation function presented in Figure
3.1 is included in the symbol of the summing block in Figure 3.4 and that there
are L layers, each having kl neurons, l = 1, . . . , L . The vector w(l−1)
i
represents
the weights from layer (l − 1) to layer l of neuron i. The bias, the output, and the
activation function of neuron i from layer l, are denoted by θi(l) , o(l)
i
, and ϕi(l)()
respectively. An MLP trained with the BP algorithm, also known as a BP
network, can classify linearly inseparable features [36].

67
x1 x1
w2 w2
x2 F nc ia de x2 Activation
activare function
u y u y

3.4, the following relations for instance n and l Summing


From Figure Bloc = 1, . . . , L
wm sumator θ wm block θ
arexmobtained: xm
Ponderi Synaptic
sinaptice weights

^
y n = o(L) (0)
n , on = xn (3.14)
w1 w1
x1 x1
w2 w2
x2 F nc ia de x2 Activation
activare(l−1) (l−1) function
u(l)
n = wn on + θ(l) (3.15)
u1 y1 u 1 y1
1 1

Bloc Summing
sumator o(l)
θ1 (l) (l)
n = ϕ (un )
block(3.16) θ1

F nc ia de Activation
activare function
(l) ⊤ ⊤
where u(l) n =
(l) (l)
(un,1 , un,2, ..., un,k ) , w(l−1) (l−1)
= (wn,1
yo n
(l−1)
, wn,2 (l−1)
, ..., wn,k ) represents the yo
o u l l−1 uo
(l−1) (l−1) (l−1)
o (l−1) o
weight vector, on = (on,1 , on,2 , ..., on,k ) represents the output vector,
l−1
(l) (l) (l) (l)Bloc Summing
θ = (θ1 , θ2 , ..., θksumator ⊤
) represents θthe
o
bias vector, and ϕ (l)() applies theblock function θo
wm l wm
ϕi(l)x()m to the ith component of the vector. xm
Ponderi Synaptic
sinaptice weights

( L)
(1)
θ(1) (l )
θ( l ) θ( L )
(0) ( l 1) ( L 1)
w 1 w 1 w 1
x1 y1
o1(1) o1( l ) o1( L )

w (0)
2 w (2l 1)
w (2L 1)
x2 y2
o(1)
2
o(2l ) o(2L )

w (0)
k1 w (kll 1)
w (kLL 1)
xm y kL
o(1)
k1 o(kll ) o(kLL )

Figure 3.4. Architecture of an MLP.

prag 68 logistic tanh


1 1 1
Generally, all the activation functions ϕi(l)() in a neural network are
selected to be the same, but in some instances the first (L − 1) layers can use, for
example, the same sigmoid function, and the last layer L can use a different
continuous and differentiable function.

3.4. Backpropagation Algorithm

The BP learning algorithm is the most popular learning rule for


performing supervised learning tasks and is used to train feedforward neural
networks, such as an MLP [119], [138]. Approximately 95%, of the existing
applications based on neural networks use the MLP neural network with the
backpropagation learning algorithm [145].

The first publication related to this algorithm appeared in 1963 [21], being
described as a dynamic multi-step optimization method. In the following years,
according to the work of Werbros [138] and of Rumelhart, Hinton, and Williams
[119], it became recognized in the field of artificial neural networks. The BP
algorithm is a generalization of the delta rule, therefore being also known as the
generalized delta rule, naming introduced by Rumelhart and McClelland in
[117]. By employing a gradient searching technique, the BP algorithm seeks to
minimize a cost function equivalent to the mean squared error (MSE) between
the expected and the current outputs of the network. As such, the MLP neural
networks can be extended to multiple layers.

The BP algorithm propagates the error between the expected and the
actual outputs of the network back through the network. After an input feature is
presented to the network, the generated output is compared to a known output

69
feature and the error is computed for each output neuron. These errors are then
propagated backward through the network, creating a control system this way.
The weights can be adjusted by using a gradient descent algorithm [36].

A continuous nonlinear monotonically increasing and differentiable


activation function is required to successfully implement the BP algorithm. Two
such functions that are normally used are the sigmoid and the hyperbolic tangent
functions [36]. The BP algorithm for an MLP network is described below.
Analogously it can be derived for other neural networks models.

For the optimization, the MSE objective function is defined between the
expected y^n and the current yn network outputs for all feature pairs (xn, yn) ∈ S
from the training dataset:

1 1 2
y^n − yn
∑ ∑
E= En = (3.17)
n n∈S 2n n∈S

where n represents the number of instances in the training dataset, and:

1 ^ 2 1 ⊤
En = y − yn = e e (3.18)
2 n 2 n n

en = y^n − yn (3.19)

where en,i = ^yn,i − yn,i is the ith element of en.

The network parameters w(l−1)


n and θ(l−1) , l = 1, . . . , L can be represented
by using the notation w = (wij ). By applying the gradient descent algorithm, the

70
error function, E or En, can be minimized using [36]:

∂En
Δnw = − η (3.20)
∂w

where η is a small positive number, called the learning rate.

The delta function is defined in equation (3.21), for l = 1, . . . , L and


j = 1, . . . , kL:

∂En
δn,(l)j = − (3.21)
∂un,(l)j

By applying the chain rule to calculate the derivative of equation (3.20)


for the output neurons (l = L) equation (3.22) and for the hidden neurons
(l = 1, . . . , L − 1) equation (3.23) are obtained:

δn,(L)j = − en,()j ϕ̇(L)


j
(un,(L)j ) (3.22)

kl+2
δn,(l+1) (l+1) (l+1)
δ (l+2)w (l+1)
∑ n,pk jpk
j
= ϕ̇j (un, j ) (3.23)
p=1

Equations (3.22) and (3.23) offer a recursive solution for δn,(l+1)


j
for the
entire neural network. Consequently, the vector w can be adjusted using:

∂En
= − δn,(l+1)
j
(l)
on,i (3.24)
∂ wij(l)

71
The first order derivatives for the sigmoid and the hyperbolic tangent
activation functions are given by equation (3.25) and equation (3.26),
respectively:

ϕ̇(u) = βϕ(u)(1 − ϕ(u)) (3.25)

ϕ̇(u) = β(1 − ϕ 2(u)) (3.26)

The bias vector θ(l+1) from layer (l + 1) can be updated using a gradient
descent algorithm or by expanding the weight vector w(l), namely:


θ(l+1) = (w0,1
(l) (l)
, w0,2 (l)
, ..., w0,k ) (3.27)
l+1

Accordingly, the output o(l):


o(l) = (1, o1(l), ..., ok(l)) (3.28)
l

If 0 < η < 2/λmax , where λmax is the largest eigenvalue of the autocor-
relation of the vector x, denoted by R, the BP algorithm is convergent to the
mean value [142]. If η is too small, the likelihood of the error function to be
trapped in a local minimum increases. On the contrary, if η is too large, the
likelihood of the error function to fall into oscillatory traps increases. By
preprocessing the independent variables to remove the collinearity between
these one can avoid having large eigenvalues of R, and by increasing the

72
parameter η the convergence can be accelerated. The speed of the BP algorithm
is often accelerated when the PCA algorithm is applied as a preprocessing task,
unless the independent variables are uncorrelated or consist of sparse vectors. In
general, the value of the learning rate η is between 0 < η < 1 , ensuring that the
minimum of the error surface is not passed by consecutive changes in weight.

The BP algorithm can be improved by applying the momentum method,


which involves adding a momentum term α with a value between 0 < α ≤ 1 , to
adjust the weights. This term represents the inertia and is proportional to the last
change in the weight, i.e. the values of α influence the current weight adjustment
ΔW(t), to move in the same direction as the previous adjustments ΔW(t − 1) .
Thus, the current weight adjustment for updating the weights can be calculated
using equation (3.29) [119]:

∂En(t)
Δn W(t) = − η + α ΔW(t − 1) (3.29)
∂W(t)

A typical value of the momentum α is 0.9 [36]. This parameter can


efficiently amplify the descent in areas of the error surface which are almost flat
by a factor of 1/(1 − α) . In the opposite case, if there are many areas with
increased fluctuations, the momentum term α has a stabilization effect,
flattening the oscillations and accelerating the convergence. In [136], the author
has analyzed the BP algorithm with momentum and has presented the
convergence criteria.

The BP algorithm and the BP algorithm with gradient and momentum


have a similar complexity, but the latter is superior regarding the convergence

73
speed and the ability to avoid local minima, and is more robust at weights
initialization, especially if the training parameters have high values.

Considering that the BP algorithm is based on the gradient descent


algorithm, it is susceptible to being trapped in local minima in the cost function.
However, its performance can be increased and the number of local minima
decreased, if more hidden neurons are inserted, if the gain term is reduced, and
by using different weights initialization techniques.

During training, all the instances are randomly and recurrently presented
to the network through epochs until the convergence criteria are met. As was
previously mentioned, the objective function of the optimization problem is En
and the weights are updated after each feature was presented to the network.
This learning process is known as online learning, incremental learning, or
feature learning. When the objective is to optimize the average error E, the
offline learning, non-incremental learning, or batch learning algorithm is used,
and the weights are updated only after all the training features are presented to
the network.

3.5. Online and Offline Learning

Online learning is a stochastic optimization technique, during which the


training features are presented to the network in sequences. For each instance in
the training dataset, the weights are updated using the gradient descent
algorithm. When ηon is small enough the BP learning algorithm minimizes the
global error E [119].

74
∂En
Δn wij(l) = − ηon (3.30)
∂ wij(l)

Offline learning is a deterministic optimization technique, in which the


objective function of the optimization problem is the error E and the weights are
updated at the end of an epoch [119]. Before adjusting the weights, the
increments are summed for each instance in the training dataset, as shown
below:

∂E
Δwij(l) = − ηoff Δn wij(l)

= (3.31)
∂ wij(l) n

If the learning rates are fairly low, the online method becomes identical to
the offline method, and both yield similar outcomes [47].

Online learning is more effective when the training dataset is not fully
available or is highly dimensional, because in offline learning additional storage
is required. Also, online learning has the tendency to be faster than offline
learning and to yield at least the same accuracy, specifically for highly
dimensional training datasets [143].

If the learning rates are fairly low, the random character in online learning
permits to explore the search space more widely helping to avoid local minima
[32]. During online learning, the error surface is more closely followed,
enabling the use of higher learning rates and implicitly an accelerated
convergence due to the reduced number of iterations. For highly dimensional
training data, offline learning is frequently unfeasible due to very low values of

75
the parameter ηoff , while in online learning, the parameter ηon can have higher
values and therefore can speed the training process.

3.6. Output Layer Activation Function

Normally, in an MLP neural network all the neurons use the same sigmoid
activation function, which constrains the network output to the ( − 1, 1) range.
In [18], [19], [103], [118], and [11], for classification the authors have used in
the output layer of an MLP neural network the generalized sigmoid activation
function, also called softmax. By using the generalized sigmoid function, the
MLP model has a greater flexibility, because the result of each output neuron is
restricted by the results of all the output neurons. The output of the ith neuron is
given by equation (3.32):

(L)
e ui
oi(L) = ϕ(ui(L)) = kL
(L) (3.32)
e uj

j=1

The generalized sigmoid and the sigmoid functions have the same derivative.

3.7. Network Structure Optimization

Generally, small neural networks with few parameters have a higher


generalization capacity. During the training process of an MLP neural network,
the number of neurons in the hidden layers is not known and is estimated

76
empirically. The network pruning and growing techniques are used to determine
the number of neurons in the hidden layers.

The network pruning technique starts with a network with many hidden
neurons and then gradually eliminates redundant neurons during training. These
pruning techniques are categorized based on the calculation of a metric, i.e. the
sensitivity and the regularization. The first technique, calculates the sensitivity
of the error function E when eliminating a weight or a neuron, and eliminates
the least significant one. The second technique, which is based on
regularization, adds a term to the error function E to constrain the network to
make effective decisions. The BP algorithm obtained from this new objective
function sets the insignificant weights to zero and eliminates them during
training. If the objective function contains a sensitivity term, then both
techniques yield the same result.

3.7.1. Network Pruning based on Sensitivity

When pruning a neural network, in order to measure the individual


contribution of a certain weight w or of a neuron in assisting the classification
task and to remove the redundant elements, the sensitivity of the error function
E is calculated using equation (3.33):

ΔE Δw ∂lnE w ∂E
SwE = lim / = = (3.33)
Δw→0 E w ∂lnw E ∂w

Karnin has applied this pruning technique and has calculated during
training the sensitivity of each connection and has eliminated the ones with a

77
low sensitivity, without retraining [78]. This method has been improved by
adjusting some pruning rules to avoid eliminating an input neuron or an entire
hidden layer [56]. The authors have also proposed a fast algorithm to retrain the
neural network after eliminating a weight. Karnin's work has been further
developed and the relative local sensitivity index has been introduced for each
group of neurons or layer within the network [111].

A different pruning technique with retraining has been proposed in [124].


After the neural network has converged, the output of each hidden neuron is
analyzed and if a value does not vary too much for the entire training dataset,
the respective neuron is eliminated because it functions merely as a bias to all
subsequent neurons. Analogously, if the value of two hidden neurons is
proportional or identical for the entire training dataset, one of the neurons is
eliminated. The low value weights are considered insignificant and are
eliminated as well. Once the redundant neurons are eliminated, the neural
network is retrained.

In [75], the authors have proposed a pruning technique based on


sensitivity that employs linear models to identify the hidden neurons that can be
estimated as a linear combination of their input neurons. Once a neuron is
identified, it is replaced in the following layers by biases and the weights past
that neuron are changed. After the redundant neurons are eliminated, retraining
is not required. In [27], an improved version of this previously described linear
model has been proposed [75], that requires to retrain the neural network after
eliminating the hidden neurons. Another pruning technique has been advanced
to eliminate hidden neurons and adjust the weights such that the overall
functionality of the neural network remains unchanged [24].

78
The singular value decomposition (SVD) has been proposed as a neural
network pruning technique in [77]. In [130], the authors have applied the SVD
orthogonal transform to evaluate the importance of adding more hidden neurons
in a feedforward network. In [149], practical calculations have been presented
regarding the sensitivity of the input neurons and how to eliminate redundant
entries. For instance, if the input neurons have few dimensions with a low
sensitivity compared to the rest, those dimensions are eliminated and the neural
network retrained.

Another method for pruning the input and hidden neurons of an MLP
neural network has been proposed using the mutual information [146]. The
significant input neurons are determined first by computing their relevance and
contribution to the output and the redundant input neurons are eliminated. The
next step determines based on similar measures the hidden neurons that are
significant to the neural network and eliminates the redundant hidden neurons.

3.7.2. Network Pruning based on Regularization

The optimization objective for the pruning technique based on


regularization is defined as:

ET = E + λc Ec (3.34)

where E represents the error function, Ec represents a penalty for the complexity
of the network, and λc > 0 represents a regularization parameter. The penalty
term adds new local minima in the optimization process.

79
In the case of the weight decay technique, Ec is defined as a weights
function, namely, as the sum of the weights squared [65] and as the sum of the
absolute value of the weights [72]. The BP algorithm derived from the weights
function ET using a weight decay term is defined in equation (3.35):

∂ET ∂Ec
Δwij(l) = − η (l)
= Δwij,BP −ε (3.35)
∂ wij(l) ∂ wij(l)

where the change in weight that corresponds to the BP algorithm is given by


equation (3.36):

(l) ∂E
Δwij,BP =−η (3.36)
∂ wij(l)

and ε = ηλc is the decay coefficient at each change in weight. If the BP


algorithm does not enforce it otherwise, the values of the weights decrease to
zero and at the end of the training process only the significant weights will have
values different than zero. By eliminating the redundant weights, the
generalization improves while the likelihood of overfitting decreases. A neural
network trained using weight decay-based algorithms is not affected by the
initial configuration of the network. In [61], the authors have proposed a
different weight decay based algorithm and obtained a neural network that is not
sensitive to noise.

80
3.7.3. Network Growing

An MLP neural network can also be trained by starting from a small


network and gradually adding hidden neurons until the preferred performance is
obtained. Using this method, the minimal neural network for our task can be
found. The network growing algorithms are less computationally intensive than
the previously described pruning algorithms. This approach can avoid a local
minimum through the addition of a new hidden neuron. For example, if the error
function E stops decreasing or decreases slowly, by adding a new hidden neuron
this function changes its shape and the local minimum is avoided.

The cascade-correlation algorithm is a popular network growing


algorithm and efficient from the computational and performance perspective
[45]. In this architecture, each added hidden neuron k is connected to the input
neurons and to every neuron that it precedes, and all the weights connected to it
and the output neurons are updated while the weights connected to the trained
neurons remain unchanged. When the training process begins, the neural
network has no hidden neurons and if after a predetermined number of cycles,
the task cannot be solved, a hidden neuron with random weights is added from a
set of candidate hidden neurons. In this neural network, the input neurons are
directly connected to the output neurons.

3.8. Accelerating the Learning Process

The BP learning algorithm is based on the gradient descent algorithm and


convergences slowly. Until recently, numerous measures have been addressed to
improve the convergence speed of the BP algorithm. Preprocessing the training

81
dataset avoids the curse of dimensionality [7] and improves the neural network’s
ability to generalize. This step is effective especially for large datasets.

The slow convergence is mainly attributable to the output of the sigmoid


functions being untimely saturated. The saturated neurons do not provide any
improvement to the connected weights, which leads to more training cycles
required to properly adjust these weights. To avoid this situation, the slope of
the sigmoid function can be changed when it approaches zero or the error
function E so that the backpropagated error is finite.

3.8.1. Learning Parameters Adaptation

The efficiency of the BP algorithm and the BP with momentum algorithm


rely on the appropriate selection of the learning rate η and momentum α . To
accelerate the convergence of these algorithms, some heuristics are required to
properly select the optimal values of these parameters η and α. Generally, these
learning parameters are adjusted once for each epoch.

The global parameters η and α help adjusting the weights between


neurons. The recommended value of parameter η is 1/λmax , where λmax is the
largest eigenvalue of the Hessian matrix H of the error function [85]. The author
has proposed in [85] an online algorithm to determine the value of λmax without
calculating the Hessian matrix.

A conventional method to accelerate the training process assumes to


initialize the parameter η with a large value and gradually decrease it as the
training progresses [18], [19], [11]. In [119], Rumelhart, Hinton, and Williams
have argued that this method is similar to the simulated annealing algorithm,

82
which avoids a local minimum at the beginning of the training process and
converges to a global minimum.

The bold-driver heuristic technique has been proposed to improve the


performance [6], [135]. The learning rate η should be increased if the error is
getting smaller because the training process is close to the minimum, and
decreased if the error is getting larger because the minimum was missed and the
weights are not being updated. This technique is repeated until the error
decreases. In this scenario, the momentum term α can have a fixed value.

Other learning algorithms have been addressed that propose local learning
rates, such as the heuristics in [125], the Quickdrop algorithm [44], and the
convergence strategy [95]. In the last paper, the authors have discussed the
theoretical aspects of batch learning algorithms with local learning rates based
on Lipschitz and Wolfe's conditions.

3.8.2. Weights Initialization

One of the most effective methods to accelerate the training process of a


neural network consists of weights initialization. It is essential to initialize the
weights properly because inadequate choices cause the training process to
converge slowly or to fall in a local minimum trap. The objective of this
technique is, before the training process starts, to determine the value of the
weights that lead to a global minimum and accelerate the convergence. This
way, the outputs of the hidden neurons are in the area that is not saturated.

Generally, the weights are randomly initialized with small positive values
or with small values with zero mean [119]. This helps the gradient descent

83
algorithm at symmetry breaking, thus avoiding any redundancy in the neural
network. Initializing the weights with large values can saturate the neurons early
and decelerate the training process. In theory, the likelihood that some neurons
are saturated early in an MLP neural network increases with the maximum value
of the weights [89]. The maximum value of the initial weights has been
computed in [37] through statistical analysis. In [131], the authors have showed
empirically that in comparison to other techniques existent in the literature, by
initializing the weights of an MLP neural network with values in the range
[ − 0.77, 0.77], they attain the optimal performance.

Many other initialization techniques based on heuristics have been


proposed in the literature. In [139], the initial weights of neuron i from layer l
have been selected based on 1/ qi(l) , where qi(l) is the number of weights of
neuron i from layer l. If the weights of neuron i from layer l are uniformly
distributed, then ui(l) is a standard normally distributed random variable. This
being considered an optimal technique for initializing the weights [139]. In [99],
the weights have been randomly initialized with values in the interval [ − a0, a0],
where a0 > 0 , and then have been transformed according to the dynamic range
of the activation function so that each neuron is active over this range.

Several initialization techniques based on parametric estimation have


been proposed. For example, in [37], [40], and [91], different methods to
estimate the weights based on a nonlinear projection between the input and the
output neurons have been presented. In [37], the weights of an MLP neural
network have been initialized using a supervised clustering algorithm. In [126],
a clustering and the k-nearest neighbors algorithms have been used to initialize
the weights in the hidden layer, and the SVD transform to initialize the weights

84
in the output layer. In [140], based on the clustering and the k-nearest neighbors
algorithms the training instances have been grouped in a set of clusters based on
their accuracy.

In [91], the authors have proposed to initialize the weights in an MLP


neural network using the orthogonal least square algorithm. In [90], the weights
initialization has been based on the maximum covariance, which is comparable
to the cascade correlation algorithm [45]. The algorithm presented in [147] is
for MLP neural networks and initializes first the weights in the hidden layer that
extract important features from the input neurons based on the ICA algorithm,
and secondly it analyzes the weights in the output layer so that the output
neurons are maintained in the active region.

In [96], [18], [19], and [11], the authors have proposed to initialize the
weights using the simulated annealing algorithm and the gradient descent
algorithm to refine the quality of the solution.

3.9. Avoiding Local Minima

The standard training methods based on the gradient descent algorithm


cannot easily avoid local minima. The surface of the error function of an MLP
neural network is a step function with many flat and abrupt regions [61]. If the
training dataset is small, there is frequently a one-to-one correspondence
between each training instance and the steps in the surface of the error function.
As the size of the training dataset increases, the surface of the error function
becomes flatter, with many flat regions that extend to infinity in each direction,
and as a consequence linear algorithms are not efficient any more.

85
Numerous strategies have been introduced to avoid the local minima trap.
A straightforward and efficient technique assumes presenting the training
instances to the network randomly during each epoch. Another approach
involves using learning algorithms initialized in several regions of the weights
space, and then finding the optimal solution. This is a useful method for fast
converging algorithms, such as the conjugate gradient algorithm.

Another efficient way to avoid local minima is to inject noise into the
training process which leads to an increased generalization performance. In fact,
this approach has been employed by several annealing methods. The noise can
be added to the input neurons, to the output neurons, or to the weights, but once
added it should be reduced as the training process advances. Regardless of the
method selected, each will add a stochastic term to the weight vector. The
method in [105] uses an annealing average step-size. Selecting large steps
allows the algorithm to avoid the local minima, and small ones assures the
convergence in the local area. An effective method to avoid local minima is to
insert an annealing noise term in the gradient descent algorithm [18], [19], [30],
[11]. The SARprop algorithm employs this approach as well.

3.10. Simulated Annealing Algorithm

The simulated annealing algorithm has an analogue part in metallurgy


regarding the annealing process and an analogue part in solving optimization
problems [81], [25].

In metallurgy, once a solid material is melted, the temperature is gradually


reduced in a controlled manner until it solidifies in a perfect crystalline

86
structure. The final properties of the solid are strongly dependent on the cooling
process, i.e. if the cooling process is carried out rapidly, the solid will degrade
with ease due to its imperfect structure, whilst if it is carried out slowly will
have a robust structure. The perfect crystalline structure corresponds to the
configuration of the global minimum energy.

The Metropolis algorithm simulates the states of a solid material until it


reaches thermal equilibrium at a certain temperature [100]. The simulated
annealing algorithm [87] derives from the conventional Metropolis algorithms.
It is a descent algorithm that avoids the local minima by employing random
ascent moves. The simulated annealing algorithm emulates a nonstationary
finite Markov chain having the definition domain of the objective function as its
state space. To express it differently, it is a Monte Carlo algorithm for
combinatorial optimization.

Within the simulated annealing algorithm, the structure of the solid


mimics the solution of the objective function and the temperature dictates the
method used for new solutions. At first, the initial solution is perturbed, then the
new solution is evaluated and accepted if it yields a better result than the
previous solution. The simulated annealing algorithm is illustrated in Figure 3.5,
which depicts an irregular surface similar to the step surface of the error
function. The intention is to place the ball on or near to the global minimum of
this surface. Initially, the temperature has a large value and the ball is perturbed
firmly being able to easily pass the highs and approach the base of the surface.

Once perturbed, the ball can exhibit ascent or descent moves. Since the
temperature decreases gradually, the magnitude of the moves decreases too and
the ball can no longer pass all the highs of the surface. By the time the

87
20

Te
10

10
20 30 40 50
magnitude of the moves decreased so much so that the ball can only pass small
Timp
highs, the ball should be near the global minimum and the entire algorithm near
completion. However, an excessive perturbation should be avoided because the
ball can also reach a higher position than originally.

40
Temperature °C

30

20

10

10 20 30 40 50
Time

Figure 3.5. Simulated Annealing Method.

In the literature, there are two cooling methods: the exponential and the
linear cooling. Figure 3.5 illustrates the exponential cooling and it can be noted
that the algorithm passes less time at higher temperatures, and the time spent
increases as the temperature decreases, improving this way the solution found at
the previous iteration. In the case of linear cooling, the algorithm passes the
same time at each temperature. Consequently, being beneficial only if various
minima are nearby. In [86], the authors have showed that if the temperature is
decreased and increased recurrently, the solution is improved at each cycle and
it helps during training. Other cooling methods have been presented in [93].

88
For a physical system in state α and with energy Eα at temperature T, the
probability Pα of being in that state α satisfies the Boltzmann distribution:

1 − kEαT
Pα = e B (3.37)
Z

where kB represents the Boltzmann’s constant and Z is a normalization constant


called the partition function and is given by equation (3.38):


−k

Z= e BT
(3.38)
β

where the sum is taken over all states β with energy Eβ at temperature T. If the
temperature T is high, irrespective of the energy, the Boltzmann distribution
shows a uniform characteristic for each state. If the temperature T decreases to
zero, only the states with minimum energy will have a probability greater than
zero.

In the simulated annealing algorithm, the Boltzmann’s constant kB is


omitted. The temperature T is a control parameter, called computational
temperature, and limits the perturbations given by the energy function E(x) .
When the temperature T is high, the algorithm disregards small differences in
energy and reaches thermal equilibrium swiftly, in other words the algorithm
conducts a succinct search to find the global minimum. As the temperature T
decreases, the algorithm reacts to small differences in energy and conducts a
comprehensive search in the vicinity of the previously detected minimum and
finds one more relevant. If T = 0 , the algorithm should be in thermal

89
equilibrium, since any change in states will not cause an increase in energy. It is
important to decrease the temperature T slowly to allow the algorithm to
conduct a comprehensive search at each temperature.

When using the simulated annealing algorithm, it is highly probable to


find a global minimum. The probability of a difference between two states is
influenced by the Boltzmann distributions of the difference in energy of those
two states:

ΔE
P = e− T (3.39)

The probability of performing ascent moves in the energy function


(ΔE > 0) is high when the temperature T is high, and low otherwise.

3.11. Proposed Neural Network Architecture

In this book, the neural network is used for classification, namely for
predicting the state of a two-class dependent variable based on the independent
variables. The architecture presented further has been proposed also in [18],
[19], and [11].

The MLP neural network training process used employs the BP algorithm
based on the generalized delta rule [117]. For each instance presented to the
network during training, the information in the form of independent variables
moves forward through the network to generate a prediction in the output layer.
This prediction is compared to the actual output value of the training instance,
and the difference between the predicted and actual values is propagated back

90
through the network to adjust the connections weights, thus improving the
prediction of similar features.

The activation functions of the input neurons have their values set to the
input instances. The output of each neuron in the hidden and the output layers is
calculated using equation (3.40):

o(l) (l) (l) (l)


n = ϕ (un ) = ϕ (wn on + θ(l))
(l−1) (l−1)
(3.40)

where u(l) (l−1)


n is the input of neuron n from layer l, wn is the weight vector from
(l)
layer (l − 1) to layer l of neuron n, o(l)
n is the output of neuron n from layer l, θ

is the bias in layer l, and ϕ (l)() is the activation function from layer l. The
activation function used for the hidden layers is the hyperbolic tangent function
(equation (3.41)), and the softmax function for the output layer (equation
(3.42)):

(l) (l)
e un − e −un
ϕ(un(l)) = tanh(un(l)) = (l) (l)
(3.41)
e un + e −un

(L)
e un
ϕ(un(L)) = kL
(L) (3.42)
e uj

j=1

where kL is the number of neurons in the output layer L.

91
3.11.1. Weights Initialization and Learning Process

The weights of the neural network used are initialized applying the
simulated annealing algorithm [81] and the alternating training process [96].
This procedure is applied to a random subset to derive the initial weights K1 = 4
times. The simulated annealing algorithm is used to escape the local minimum
during the training process by perturbing this local minimum K2 = 4 times. If
the local minimum is passed successfully, for the next training cycle the
simulated annealing algorithm initializes the weights with more relevant values.
In order to find the global minimum, this procedure is repeated K3 = 3 times.

Because this algorithm is computationally intensive for large datasets,


only a random subset of the training dataset is used to initialize the weights,
namely:

1. K2 = 4 weight vectors are randomly generated within the range


[ − 0.5, 0.5] and the training error is calculated for each weight vector.
The weights that yield the minimum training error are selected as
initial weights.

2. A loop k = 0 is initialized.

3. The network is trained using the initial weights and the trained weights
w are obtained.

4. If the training error is less than or equal to 0.05, the loop is stopped
and the weight vector w is used as the result of the loop. Otherwise the
loop is incremented by one unit.

5. If k < K1 , the old weights are perturbed to obtain K2 new weights

92
w′ = w + wn , by adding random noise wn within the range
[ − (0.5)k+1, (0.5)k+1]. If E(wmin ) < E(w) , where wmin is the perturbed
weight vector that produces the minimum training error, the initial
weights are set to wmin and returns to step 3.

Otherwise, the loop is stopped and the w vector is considered the final
result.

If the resulted weights yield a training error greater than 0.1, the algorithm
is repeated until the training error is less than or equal to 0.1 or repeated K3
times and the weights that produce the minimum test error within the k loops are
selected.

3.11.2. Backpropagation Algorithm

Within this architecture, the BP algorithm uses the cross entropy error
function, since it is a more appropriate alternative for classification problems
[97], [127]. The cross entropy is a function of the relative errors and is assumed
to estimate more precisely low probabilities [55], [65], [127]. To calculate the
first order partial derivatives of the error function in relation to the weights, the
BP algorithm with momentum [18], [19], [11] is used:

▪ For each l, i, and j, set:

∂E
=0 (3.43)
∂ wij(l)

▪ For each l = 1, . . . , L and j = 1, . . . , kL:

93
(l) ()
δn,i = − en,i (3.44)

▪ For each l = L, . . . , 1, i = 1, . . . , kL−1, and j = 1, . . . , kL:

∂En
= − δn,(l+1)
j
(l)
on,i (3.45)
∂ wij(l)

∂En
where δn,(l)j = − .
▪ ∂un,(l)j
▪ Set:

∂E ∂E ∂En
= + (3.46)
∂ wij(l) ∂ wij(l) ∂ wij(l)

▪ If l ≥ 1 and i > 0, then:

kl+2
(l+1) (l+1) (l+1)
δ (l+2)w (l+1)
∑ n,q pq
δn,i = ϕ̇i (un,i ) (3.47)
q=1

L−1


This gives a vector of size (kl + 1)kl+1 which is the gradient ∇E(wk ).
l=1

3.11.3. Gradient Descent Algorithm

The gradient descent method with the learning rate η0 = 0.4 and the
momentum α = 0.9 consists of the following steps [18], [19], [11]:

1. Let k = 0 and the weight vector is initialized with w0, the learning rate

94
with η0, and Δw0 = 0.

2. The entire dataset is read and the error function E(wk ) and its gradient
∇E(wk ) are calculated. If ∇E(wk ) < 10−6 , the algorithm is stopped
and the current network is reported.

3. If ηk ∇E(wk ) ≤ α Δwk , then α = 0.9ηk ∇E(wk ) / Δwk , so that


the direction of the steepest gradient descent controls the change in
weight at the next step, otherwise it could be in the opposite direction
and irrespective of how small the value of ηk is, the error will not
decrease.

4. Let v = wk − ηk∇E(wk ) + α Δwk.

5. If E(v) < E(wk ) , then wk+1 = v , Δwk+1 = wk+1 − wk , and ηk+1 = ηk .


Otherwise, ηk = 0.5ηk and returns to step 3.

6. If any convergence condition is met, the network is reported according


to the convergence criterion, otherwise k is incremented by one and
returns to step 2.

The gradient descent method for online learning with learning rate
η0 = 0.4, minimum learning rate ηmin = 0.001 , momentum α = 0.9 , the learning
rate decay factor β = (1/np)ln(η0 /ηmin ) , number of training instances n, and
number of epochs p needed to reduce the initial learning rate to ηmin, consists of
the following steps [18], [19], [11]:

1. Let k = 0 and the weight vector is initialized with w0, the learning rate
with η0, and Δw0 = 0.

2. An entire randomly selected subset of data is read and the error

95
function E(wk ) and its gradient ∇E(wk ) are calculated.

3. If ηk ∇E(wk ) ≤ α Δwk , then α = 0.9ηk ∇E(wk ) / Δwk , so that


the direction of the steepest gradient descent controls the change in
weight at the next step, otherwise it could be in the opposite direction
and irrespective of how small the value of ηk is, the error will not
decrease.

4. Let v = wk − ηk∇Ek (wk ) + α Δwk.

5. If Ek (v) < E(wk ) , then wk+1 = v and Δwk+1 = wk+1 − wk . Otherwise,


wk+1 = wk and Δwk+1 = Δwk.

6. ηk+1 = e −β ηk. If ηk+1 < ηmin, then ηk+1 = ηmin.

7. If any convergence condition is met, the network is reported according


to the convergence criterion, otherwise k is incremented by one and
returns to step 2.

3.11.4. Convergence Criteria

The training process takes place over at least one epoch and then stops
according to the following criteria, which is checked in the following order [18],
[19], [11]:

1. During the model update, the total training error is calculated at the
end of each iteration. If during the K1 iteration, the training error does
not decrease below the current minimum error E1 over the next step,
the algorithm is stopped and the weights obtained at step K1 are
reported.

96
2. If the change in the training error is relatively small, the training
process is stopped and the weights obtained at step K1 are reported:

2 E(wk ) − E(wk−1)
< 10−4 (3.48)
(E(wk ) + E(wk−1) + 10 )
−10

3. If the ratio between the current training error and the initial error is
low:

E(wk )
< 10−3 (3.49)
E+ 10−10

▪ where E is the model error calculated using equation (3.50) in the error
function. It reports the weights obtained at step K1.

1 ^
yl = yl
N∑l∈S
(3.50)

Given that the ability to generalize is higher for smaller networks with
less parameters, the pruning technique based on sensitivity is applied to
eliminate the redundant neurons during training. This pruning technique has
been proposed in [146] and starting from a large neural network, it first removes
the redundant neurons in the hidden layer and then the redundant neurons in the
input layer. This process is repeated until the global convergence conditions are
met [18], [19], [11].

During the elimination phase of the redundant hidden neurons, the neural

97
network is trained using the entire training dataset and if any convergence
condition is met, it advances to the second phase to eliminate the redundant
input neurons. The elimination process of hidden neurons is stopped if: the
global convergence criteria are met, the current error is three times the error of
the most suitable network, and the persistence limit in the hidden layer is
exceeded, where persistence is defined as the number of training cycles without
any improvement. If no convergence condition is met, a sensitivity analysis is
performed to identify the redundant neurons in the hidden layer.

To perform the sensitivity analysis for the hidden neurons, the test dataset
is applied to the network and the results are recorded as reference. Next, the
weights of the first hidden layer are temporary set to zero and the test dataset is
applied to the modified network, and the results compared. For each instance,
the absolute difference between the results obtained for the entire network and
the results obtained for the modified network is calculated along with the
standard deviation across the entire test dataset. This process is repeated for
each hidden neuron which is ranked according to this value. A high value
indicates significant neurons, while a low value indicates redundant neurons.

During the elimination phase of the redundant input neurons, the neural
network is trained using the entire training dataset, and if any convergence
condition is met, the global convergence conditions are verified, and if
necessary, these two elimination phases are repeated. The elimination process of
input neurons is stopped if: the global convergence criteria are met, the current
network error is three times the error of the most suitable network, the
persistence limit in the input layer is exceeded. If no convergence condition is
met, a sensitivity analysis is performed to identify the redundant neurons in the

98
input layer.

To perform the sensitivity analysis for the input neurons, the value of the
independent variable is varied for each instance in the test dataset, the maximum
and the minimum output values are recorded, and the maximum difference for
each instance is calculated along with the arithmetic mean.

3.12. Conclusions
In the first part of this chapter are presented the theoretical foundations
regarding neural networks and reviewed some noteworthy research papers
existent in the literature.

In this chapter the MLP neural network architecture [18], [19], [11] is
proposed to be used to implement the predictive model. This MLP neural
network is trained using the backpropagation learning algorithm [138] which is
improved by applying the momentum method to adjust the weights [119]. For
each training instance, the weights are updated using the gradient descent
stochastic optimization method [36] after each feature was presented to the
network sequentially. It is decided to use the hyperbolic tangent function as
activation function for the hidden layers and the generalized sigmoid function
[118] for the output layer, introducing this way flexibility in this model. The
structure of the MLP network is optimized using the pruning technique based on
sensitivity of the input and hidden neurons based on the mutual information
[146]. The network pruning strategy starts from a large network and gradually
removes redundant neurons during the learning process. To accelerate the neural
network training process, the method of adapting the learning rate is used by

99
initializing it with a higher value and gradually decreasing it as the learning
progresses, and the weights initialization method that uses the simulated
annealing algorithm [81].

The architecture of this machine learning algorithm is applied to the


dataset resulted in Chapter 2, which consists of 11 principal components
obtained by applying the PCA algorithm, 2 nominal independent variables, and
the dependent variable. The next chapter, Chapter 4, evaluates this predictive
model and presents the practical results.

The proposed MLP neural network architecture [18], [19], [11] is scalable
to large datasets and can be applied to any dataset from any field of activity, as
long as the problem to be solved is a classification problem.

This chapter represents the modeling phase of the CRISP-DM


methodology used.


100
4. CHAPTER 4
4. MODEL EVALUATION AND DEPLOYMENT

4.1. Model Evaluation

Reaching this phase implies that the model has been implemented and
yields a good predictive performance. Before proceeding to the next phase, it is
important to evaluate and review the steps taken to create the predictive model
to ensure that it meets the objectives accordingly. It is important to determine
whether there is an objective that has not been considered. At the end of this
phase, a decision is going to be made regarding the use of the results obtained
during the entire data mining process. At this stage, the predictive model is
assessed to decide whether or not the predictions are considered a success.

The predictive performance of the model can be viewed using the


confusion matrix. The confusion matrix is a table with two rows and two
columns, where the cells on the diagonal of the classified cases represent correct
predictions, and those on the opposite diagonal represent incorrect predictions
from the training and the test datasets respectively, as illustrated in Table 4.1.

101
Predicted
Dataset Observed
No Yes % correct
No TN FP TNR
Training/
Yes FN TP TPR
Test
Total % NPV PPV ACC

Table 4.1. Confusion Matrix.

The confusion matrix cells are called: true negatives (TN), false positives
(FP), false negatives (FN), and true positives (TP) (Table 4.1). The rest of the
measures from Table 4.1, such as TNR (true negatives rate or specificity), TPR
(true positives rate or sensitivity), NPV (negative predictive value), PPV
(positive predictive value or precision), and ACC (accuracy) are calculated
using the following equations:

TP
TPR =   (4.1)
TP + FN

TN
TNR =   (4.2)
TN + FP

TP
PPV =   (4.3)
TP + FP

TN
NPV =   (4.4)
TN + FN

TP + TN
ACC =   (4.5)
P+N

102
where P represents the total number of positives and N is the total number of
negatives.

Based on these defined measures, Table 4.2 shows the confusion matrix
for the machine learning algorithm used, namely the MLP neural network, on
both, the training and the test datasets.

Predicted
Dataset Observed
No Da % correct
No 2279 0 100.00%
Training Yes 0 2294 100.00%
Total % 49.84% 50.16% 100.00%
No 569 2 99.65%
Test Yes 1 94 98.95%
Total % 85.59% 14.41% 99.55%

Table 4.2. Confusion Matrix – NN-MLP.

In the training phase, the prediction model has correctly classified all the
2,294 customers who have previously churned and stopped using the services
offered by the mobile telecommunication company, with a true positives rate of
100%. Of the 2,279 customers who continued to use the company's services, all
the customers are classified correctly, providing a specificity of 100%. In other
words, all the customers in the training dataset are classified correctly.

Within the test dataset, out of the 95 customers who stopped using the
services offered by the mobile telecommunication company, 94 customers are
classified correctly (a true positives rate of 98.95%); and of the 571 customers

103
who kept using the services, 569 customers are classified correctly (a specificity
of 99.65%). Overall, approximately 99.55% of the customers in the test dataset
are classified correctly and approximately 0.45% are misclassified.

In other words, the information presented in Table 4.2 suggests that,


overall, the predictive model will classify correctly approximately 995 out of
1,000 customers.

Another way of interpreting the results is the lift chart. This type of graph
sorts the predicted pseudo-probabilities [101] in descending order and displays
the corresponding curve. There are two types of lift charts: incremental and
cumulative. The incremental lift chart represented in Figure 4.1 shows the lift
factor in each percentile [43] without any accumulation for the Yes class of the
dependent variable Churn. The curve corresponding to this predictive model
falls below the gray line, which corresponds to the random expectation (RND
E), around the 16th percentile. This means that compared to the random
expectation, the model achieves its maximum performance in the first 16% of

8
NN-MLP
7

6 RND E

5
Lift

0
0 20 40 60 80 100
Percentile

Churn = Yes

Figure 4.1. Incremental Lift Chart.

104
the instances.

The cumulative lift graph indicates the prediction rate of the model
compared to the random expectation. Figure 4.2 illustrates the curve of the
cumulative lift chart for the Yes class of the dependent variable Churn. By
reading the chart on the horizontal axis, it can be seen that for the 16th
percentile, the model has a lift index of approximately 7 on the vertical axis,
meaning that unlike a random model, this model has a predictive performance
of approximately 7 times better.

8
NN-MLP
7

6 RND E

5
Lift

0
0 20 40 60 80 100
Percentile

Churn = Yes

Figure 4.2. Cumulative Lift Chart.

The performance of this predictive model can also be evaluated using the
gain measure. The gain chart shows the percentage of positive responses on the
vertical axis, and the percentage of customers contacted on the horizontal axis.
The gain measure is defined as the proportion of respondents present in each
percentile relative to the total number of respondents. The cumulative gain chart
shows the prediction rate of the model compared to the random expectation.

105
Figure 4.3 illustrates the curve corresponding to the MLP neural network
predictive model for the Yes class of the dependent variable Churn. It can be
seen that in the 16th percentile, the predictive model has a performance of
approximately 99%.

100

80
NN-MLP

60 RND E
% Gain

40

20

0
0 20 40 60 80 100
Percentile

Churn = Yes

Figure 4.3. Gain Chart.

Figure 4.4 shows the ROC curve of the predictive model. The ROC curve
is derived from the confusion matrix and uses only the TPR and the FPR (false
positives rate) measures, the latter being obtained by subtracting the specificity
from the unit. Following the chart in Figure 4.4, it can be observed that it
approaches the coordinate point (0, 1) in the upper left corner, which implies a
perfect prediction. Our predictive model based on neural networks obtains a
sensitivity of 99% and a specificity of 100%.

4.2. Model Deployment

The model deployment phase of the CRISP-DM methodology involves

106
1

0.8
NN-MLP
TPR (Sensitivity)
0.6 RND E

0.4

0.2

0
0.0 0.2 0.4 0.6 0.8 1.0
FPR (1-Specificity)

Churn = Yes

Figure 4.4. ROC Curve.

organizing the results generated by the predictive model so that the mobile
telecommunication company can take specific decisions. These results can be
organized in the form of a document in which the instances are ordered based on
their pseudo-probabilities of belonging to a class of the dependent variable. If a
more complex approach to this problem is considered, then these results can be
integrated in an interactive reporting system in which the datasets used by the
data mining models are extracted from a database and automatically scored by
these models, and the results are then viewed through certain reporting tools.

Once distributed, a mobile telecommunication company can use the


results of this predictive model if it decides to send, for example, different offers
to its customers. Based on this predictive model, the company can easily select
the first 16% to 20% (the 16th or 20th percentile) of the customers sorted by their
corresponding pseudo-probabilities, and expect that most of the contacted
customers are among those who intend to churn.

If the company decides to use the lift graph, it can select the first 16% -

107
20% of the customers sorted by their corresponding pseudo-probabilities, and
expect to contact about seven times (a lift factor of 7) the number of customers
who intend to churn than selecting them randomly.

4.3. Conclusions

In this chapter, the predictive model implemented based on the MLP


neural network architecture [18], [19], [11] proposed in Chapter 3 is evaluated.
To estimate this predictive model, five different methods are used, namely: the
confusion matrix, the incremental and the cumulative lift charts, the gain chart,
and the ROC curve.

By evaluating the results obtained using the confusion matrix, it can be


observed that for the Yes class of the dependent variable Churn, the MLP neural
network model has a predictive performance of 98.95%. When considering the
results generated for each class of the dependent variable Churn, the MLP
neural network achieves a predictive performance of 99.55%.

Using the incremental lift chart method, it can be observed that the
predictive model achieves its maximum performance in the first 16% of the
instances because the corresponding curve of the predictive model falls below
the gray line corresponding to the random expectation around the 16th
percentile. The cumulative lift chart indicates that for the 16th percentile, the
model has a lift index of approximately 7 on the vertical axis, meaning that
unlike a random model, this model has a predictive performance of about 7
times better.

Based on the gain chart, in the 16th percentile, the predictive model

108
implemented using the MLP neural network provides 99% of the respondents
present in the 16th percentile in relation to the total number of respondents.

By interpreting the chart of the ROC curve, it can be seen that the model
is very close to the perfect prediction with a sensitivity of 99% and a specificity
of 100%.

This chapter represents the evaluation and the deployment phases of the
CRISP-DM methodology.


109
5. CHAPTER 5
5. CONCLUSIONS

This book aims to identify the customers in the prepaid segment of a


mobile telecommunications company that present a high risk of churning, i.e.
switching to services offered by a competitor.

To precisely identify only the customers who are at risk of churning, the
companies active in this industry must implement data mining models that
employ machine learning algorithms and yield highly accurate results.

The data mining method proposed in this book is an original method. It


starts by applying the PCA algorithm in Chapter 2 to reduce the dimensionality
of the training dataset and to eliminate any possible correlations between the
independent variables. At the end of this process, a total of 11 principal
components are obtained which explain the entire variation in the dataset. It is
important to note that this method of reducing the dimensionality and avoiding
the collinearity of independent variables [16] can be used for any dataset from
any field of activity.

110
The newly formed dataset consists of the 11 principal components, the 2
nominal independent variables, and the dependent variable Churn, and is used
to train the machine learning algorithm. Prior to implementing the predictive
model, the distribution of the dependent variable is balanced using the
oversampling method [63] to ensure an optimal learning by the machine
learning algorithm.

In the modeling phase in Chapter 3, the predictive model is implemented


using the MLP neural networks algorithm. With respect to this machine learning
algorithm, an original architecture is proposed based on several techniques
existent in the literature and which has not been applied in other research papers
to solve the same problem, nor any other classification problem.

The MLP neural network is trained using the backpropagation learning


algorithm [138] which is improved by applying the momentum method to adjust
the weights [119]. For each training instance, the weights are updated using the
gradient descent stochastic optimization method [36] after each feature is
presented to the network sequentially. The hyperbolic tangent function is used
as an activation function for the hidden layers and the generalized sigmoid
function [118] as an activation function for the output layer, thus introducing
flexibility in this model. The structure of the MLP neural network is optimized
using the pruning technique based on sensitivity of the input and the hidden
neurons based on mutual information [146]. To accelerate the neural network
training process, the method of adapting the learning rate is used by initializing
it with a higher value and gradually decreasing it as the learning progresses, and
the weights initialization method that uses the simulated annealing algorithm
[81]. The MLP neural network architecture described above has been proposed

111
also in [18], [19], and [11].

By evaluating this predictive model in Chapter 4 using five different


methods, this architecture achieves a predictive performance very close to the
ideal model, namely of 99.55%. Based on the results generated by this model,
the mobile telecommunications company can take different approaches in order
to retain the customers that intend to churn and be certain that most of the
customers selected plan to change the service provider in the near future.


112
BIBLIOGRAPHY

1. Ackley, D.H., G.E. Hinton, and T.J. Sejnowski, A Learning Algorithm for
Boltzmann Machines. Cognitive Science, 1985. 9(2).
2. Agresti, A., Categorical Data Analysis. 2002: Wiley.
3. Anderson, T.W., Asymptotic Theory for Principal Component Analysis. The
Annals of Mathematical Statistics, 1963. 34.
4. Arthur, Y.D., E. Harris, and J. Annan, Principal Component Analysis of
Customer Churns in Ghanaian Telecommunication Industry. American
International Journal of Contemporary Research, 2012. 2(12).
5. Baesens, B., S. Viaene, D. Van den Poel, J. Vanthienen, and G. Dedene,
Bayesian Neural Network Learning for Repeat Purchase Modeling in Direct
Marketing. European Journal of Operational Research, 2002. 138(1).
6. Battiti, R., Accelerated Backpropagation Learning: Two Optimization
Methods. Complex Systems, 1989. 3.
7. Bellman, R.E., Dynamic Programming. 1957, Princeton: Princeton University
Press. xxv, 342 p.
8. Berson, A., S. Smith, and K. Therling, Building Data Mining Applications for
Crm. 1999: McGraw-Hill.

113
9. Bhattacharyya, S. and P. Pendharkar, Inductive, Evolutionary and Neural
Techniques for Discrimination: A Comparative Study. Decision Sciences,
1998. 29.
10. Blake, C.L. and C.J. Merz, Churn Data Set. 1998: California, USA.
11. Brandusoiu, I.B., Methods for Predicting the Evolution of the Number of
Subscribers in the Mobile Telecommunications Industry. 2016, Technical
University of Cluj-Napoca.
12. Brandusoiu, I.B. and G. Toderean, Churn Prediction in the Telecommuni-
cations Sector Using Support Vector Machines. Annals of the Oradea
University Fascicle of Management and Technological Engineering, 2013.
22(1).
13. Brandusoiu, I.B. and G. Toderean, Churn Prediction Modeling in Mobile
Telecommunications Industry Using Decision Trees. University of Oradea
Journal of Computer Science and Control Systems, 2013. 6(1).
14. Brandusoiu, I.B. and G. Toderean, A Neural Networks Approach for Churn
Prediction Modeling in Mobile Telecommunications Industry. University of
Pitesti Scientific Bulletin Series: Electronics and Computers Science, 2013.
13(1).
15. Brandusoiu, I.B. and G. Toderean, Predicting Churn in Mobile Telecommuni-
cations Industry. ACTA Technica Napocensis Electronics and Telecommuni-
cations, 2013. 54(3).
16. Brandusoiu, I.B. and G. Toderean, Applying Principal Component Analysis on
Call Detail Records. ACTA Technica Napocensis Electronics and Telecom-
munications, 2014. 55(4).
17. Brandusoiu, I.B. and G. Toderean, Churn Prediction in the Telecommuni-
cations Sector Using Bayesian Networks. University of Oradea Journal of
Computer Science and Control Systems, 2015. 8(2).
18. Brandusoiu, I.B. and G. Toderean, Churn Prediction in the Telecommuni-
cations Sector Using Neural Networks. ACTA Technica Napocensis
Electronics and Telecommunications, 2016. 57(1).
19. Brandusoiu, I.B. and G. Toderean, Methods for Churn Prediction in the Pre-
Paid Mobile Telecommunications Industry, 11th International Conference on
Communications (COMM). 2016.
20. Broomhead, D.S. and D. Lowe, Multivariable Functional Interpolation and
Adaptive Networks. Complex Systems, 1988. 2.
21. Bryson, A.E., W.F. Denham, and E.S. Dreyfus, Optimal Programming
Problems with Inequality Constraints I: Necessary Conditions for Extremal
Solutions. AIAA Journal, 1963. 1(11).
22. Cabena, P., P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi, Discovering
Data Mining: From Concept to Implementation. 1998: Prentice Hall.
23. Castanedo, F., G. Valverde, J. Zaratiegu, and A. Vazquez, Using Deep Learning
to Predict Customer Churn in a Mobile Telecommunication Network. 2014.
24. Castellano, G., A.M. Fanelli, and M. Pelillo, An Iterative Pruning Algorithm
for Feedforward Neural Networks. IEEE Transactions on Neural Networks,
1997. 8(3).
25. Cerny, V., Thermodynamical Approach to the Traveling Salesman Problem: An
Efficient Simulation Algorithm. Journal of Optimization Theory and
Applications, 1985. 45.
26. Chakrabarti, S., M. Ester, U. Fayyad, J. Gehrke, J. Han, S. Morishita, G.
Piatetsky-Shapiro, and W. Wang, Data Mining Curriculum: A Proposal Version
1.0. ACM Digital Library, 2006.
27. Chandrasekaran, H., H.H. Chen, and M.T. Manry, Pruning of Basis Functions
in Nonlinear Approximations. Neurocomputing, 2000. 34.
28. Chang, C.C. and C.J. Lin, LIBSVM: A Library for Support Vector Machines.
2001.
29. Chapman, P., J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, and C.E.A.
Shearer, CRISP-DM 1.0: Step-by-Step Data Mining Guide. 2000.
30. Choi, J.J., P. Arabshahi, R.J. Marks, and T.P. Caudell. Fuzzy Parameter
Adaptation in Neural Systems. Proceedings of International Joint Conference
on Neural Networks. 1992.
31. Chua, L.O. and L. Yang, Cellular Neural Network: I. Theory II. Applications.
IEEE Transactions on Circuits and Systems, 1988.
32. Cichocki, A. and R. Unbehauen, Neural Networks for Optimization and Signal
Processing. 1992: Wiley.
33. Comon, P., Independent Component Analysis – A New Concept. Signal
Processing, 1994. 36(3).
34. Coussement, K. and D. Van den Poel, Integrating the Voice of Customers
through Call Center Emails into a Decision Support System for Churn
Prediction. Information and Management, 2008. 45.
35. Coussement, K. and D. Van den Poel, Improving Customer Attrition Prediction
by Integrating Emotions from Client/Company Interaction Emails and
Evaluating Multiple Classifiers. Expert Systems with Applications, 2009. 36.
36. Da Silva, I.N., D.N. Spatti, R.A. Flauzino, L.H. Bartocci, and S.F. Dos Reis,
Artificial Neural Networks: A Practical Course. 2017: Springer.
37. Denoeux, T. and R.H. Lengelle, Initializing Back Propagation Networks with
Prototypes. Neural Networks, 1993. 6.
38. Dietterich, T.G., Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation, 1998. 10(7).
39. Dong, J.X., A. Krzyzak, and C.Y. Suen, Fast SVM Training Algorithm with
Decomposition on Very Large Data Sets. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2005. 27(4).
40. Drago, G.P. and S. Ridella, Statistically Controlled Activation Weight
Initialization. IEEE Transactions on Neural Networks, 1992. 3.
41. Duch, W., Uncertainty of Data, Fuzzy Membership Functions, and Multilayer
Perceptrons. IEEE Transactions on Neural Networks, 2005. 16.
42. Dunteman, G.H., Principal Components Analysis. 1989: Sage Publications.
43. Edwards, D.I., Introduction to Graphical Modeling 2nd Edition. 2000: Springer.
44. Fahlman, S.E. Fast Learning Variations on Backpropagation: An Empirical
Study. Proceedings of 1988 Connectionist Models Summer School. 1988.
45. Fahlman, S.E. and C. Lebiere, The Cascade-Correlation Learning Architecture.
Advances in Neural Information Processing Systems, 1990. 2.
46. Fayyad, U., G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in
Knowledge Discovery and Data Mining. 1996: AAAI Press.
47. Finnoff, W., Diffusion Approximations for the Constant Learning Rate
Backpropagation Algorithm and Resistance to Local Minima. Neural
Computation, 1994. 6(2).
48. Frawley, W., G. Piatetsky-Shapiro, and C. Matheus, Knowledge Discovery in
Databases – An Overview. Knowledge Discovery in Databases, 1991.
49. Freeman, M., The 2 Customer Lifecycles. Intelligent Enterprise, 1999. 2(16).
50. Fu, S.K., Efficient Learning of Markov Blanket and Markov Blanket Classifier.
2010, University of Montreal.
51. Fukunaga, K., Introduction to Statistical Pattern Recognition. 1990: Academic
Press.
52. Gartner, G. 2015; Available from: www.gartner.com.
53. Girshick, M.A., Principal Components. Journal of the American Statistical
Association, 1936. 31.
54. Girshick, M.A., On the Sampling Theory of Roots of Determinantal Equations.
The Annals of Mathematical Statistics, 1939. 10.
55. Gish, H. A Probabilistic Approach to the Understanding and Training of Neural
Network Classifiers. Proceedings of IEEE International Conference on
Acoustics, Speech, and Signal Processing. 1990.
56. Goh, Y.S. and E.C. Tan, Pruning Neural Networks During Training by
Backpropagation, IEEE Region 10's 9th Annual International Conference.
1994.
57. Gori, M. and M. Maggini, Optimal Convergence of Online Backpropagation.
IEEE Transactions on Neural Networks, 1996. 7(1).
58. Gorunescu, F., Data Mining Concepts, Models and Techniques. 2011: Springer
Verlag.
59. Gower, J.C., Some Distance Properties of Latent Root and Vector Methods
Used in Multivariate Analysis. Biometrika, 1966. 53.
60. Grumbach, S. and T. Milo, Towards Tractable Algebras for Bags. Journal of
Computer and System Sciences, 1996. 52(3).
61. Gupta, A. and S.M. Lam, Weight Decay Backpropagation for Noisy Data.
Neural Networks, 1998. 11.
62. Hand, D., H. Mannila, and P. Smyth, Principles of Data Mining. 2001: MIT
Press.
63. He, H. and Y. Ma, Imbalanced Learning Foundations, Algorithms, and
Applications. 2013: John Wiley & Sons.
64. Hebb, D.O., The Organization of Behavior. 1949: Wiley.
65. Hinton, G.E., Connectionist Learning Procedure. Artificial Intelligence, 1989.
40.
66. Hodgkin, A.L. and A.F. Huxley, A Quantitative Description of Ion Currents and
its Applications to Conductance and Excitation in Nerve Membranes. Journal
of Physics, 1952. 117.
67. Hopfield, J.J. Neural Networks and Physical Systems with Emergent
Collective Computational Abilities. Proceedings of National Academy of
Sciences of the USA. 1982.
68. Hotelling, H., Analysis of a Complex of Statistical Variables into Principal
Components. Journal of Educational Psychology, 1933. 24(6).
69. Hotelling, H., Simplified Calculation of Principal Components. Psychometrika,
1936. 1.
70. Hung, S., D. Yen, and H. Wang, Applying Data Mining to Telecom Churn
Management. Expert Systems with Applications, 2006. 31.
71. Hwang, J., S. Lay, and A. Lippman, Nonparametric Multivariate Density
Estimation: A Comparative Study. IEEE Transaction on Signal Processing,
1994. 42(10).
72. Ishikawa, M., Learning of Modular Structured Networks. Artificial
Intelligence, 1995. 75.
73. Jeffers, J.N.R., Two Case Studies in the Application of Principal Component
Analysis. Journal of the Royal Statistical Society, 1967. 16(3).
74. Jensen, F.V. and T.D. Nielsen, Bayesian Networks and Decision Graphs. 2007:
Springer.
75. Jiang, X., M. Chen, M.T. Manry, M.S. Dawson, and A.K. Fung, Analysis and
Optimization of Neural Networks for Remote Sensing. Remote Sensing
Reviews, 1994. 9.
76. Jimenez, L.O. and L.D. A., Supervised Classification in High-Dimensional
Space: Geometrical, Statistical, and Asymptotical Properties of Multivariate
Data. IEEE Transaction on Systems, Man, and Cybernetics, 1998. 28.
77. Kanjilal, P.P. and D.N. Banerjee, On the Application of Orthogonal
Transformation for the Design and Analysis of Feedforward Networks. IEEE
Transactions on Neural Networks, 1995. 6(5).
78. Karnin, E.D., A Simple Procedure for Pruning Backpropagation Trained Neural
Networks. IEEE Transactions on Neural Networks, 1990. 1(2).
79. Kim, H. and C. Yoon, Determinants of Subscriber Churn and Customer
Loyalty in the Korean Mobile Telephony Market. Telecommunications Policy,
2004. 28.
80. Kim, J.O. and C.W. Mueller, Factor Analysis: Statistical Methods and Practical
Issues. 1978: Sage Publications.
81. Kirkpatrick, S., C.D. Gelatt, and M.P. Vecchi, Optimization by Simulated
Annealing. Science, 1983. 220(4598).
82. Kirui, C., L. Hong, W. Cheruiyot, and H. Kirui, Predicting Customer Churn in
Mobile Telephony Industry Using Probabilistic Classifiers in Data Mining.
International Journal of Computer Science Issues, 2013. 10(2).
83. Kisioglu, P. and Y.I. Topcu, Applying Bayesian Belief Network Approach to
Customer Churn Analysis: A Case Study on the Telecom Industry of Turkey.
Expert Systems with Applications, 2011. 38(6).
84. Kotler, P. and L. Keller, Marketing Management 12th Edition. 2006: Prentice
Hall.
85. Le Cun, Y., P.Y. Simard, and B. Pearlmutter, Automatic Learning Rate
Maximization by Online Estimation of the Hessian’s Eigenvectors. Advances
in Neural Information Processing Systems, 1993. 5.
86. Ledesma, S., M. Torres, D. Hernandez, G. Avina, and G. Garcia. Temperature
Cycling on Simulated Annealing for Neural Network Learning. Proceedings of
MICAI. 2007.
87. Lee, B.W. and B.J. Shen, Design and Analysis of Analog VLSI Neural
Networks. Neural Networks for Signal Processing, 1992.
88. Lee, K.C. and N.Y. Jo, Bayesian Network Approach to Predict Mobile Churn
Motivations: Emphasis on General Bayesian Network, Markov Blanket, and
What-If Simulation. Future Generation Information Technology, Lecture Notes
in Computer Science, 2010. 6485.
89. Lee, Y., S.H. Oh, and M.W. Kim. The Effect of Initial Weights on Premature
Saturation in Back-Propagation Training. Proceedings IEEE International Joint
Conference on Neural Networks. 1991.
90. Lehtokangas, M., P. Korpisaari, and K. Kaski, Maximum Covariance Method
for Weight Initialization of Multilayer Perceptron Networks. Proceedings of
European Symposium on Artificial Neural Networks, 1996.
91. Lehtokangas, M., J. Saarinen, P. Huuhtanen, and K. Kaski, Initializing Weights
of a Multilayer Perceptron Network by Using the Orthogonal Least Squares
Algorithm. Neural Computation, 1995. 7.
92. Loeve, M., Probability Theory 3rd Edition. 1963: Van Nostrand.
93. Luke, B.T., Simulated Annealing Cooling Schedules. 2007.
94. Madden, G., S. Savage, and G. Coble-Neal, Subscriber Churn in the Australian
ISP Market. Information Economics and Policy, 1999. 11.
95. Magoulas, G.D., V.P. Plagianakos, and M.N. Vrahatis, Globally Convergent
Algorithms with Local Learning Rates. IEEE Transactions on Neural
Networks, 2002. 13(3).
96. Masters, T., Practical Neural Network Recipes in C++. 1993: Academic Press.
97. Matsuoka, K. and J. Yi. Backpropagation Based on the Logarithmic Error
Function and Elimination of Local Minima. Proceedings of the International
Joint Conference on Neural Networks. 1991.
98. McCulloch, W.S. and W. Pitts, A Logical Calculus of the Ideas Immanent in
Nervous Activity. The Bulletin of Mathematical Biophysics, 1943. 5.
99. McLoone, S., M.D. Brown, G. Irwin, and G. Lightbody, A Hybrid Linear/
Nonlinear Training Algorithm for Feedforward Neural Networks. IEEE
Transactions on Neural Networks, 1998. 9(4).
100.Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller,
Equations of State Calculations by Fast Computing Machines. Journal of
Chemical Physics, 1953. 21(6).
101.Minsky, M.L. and S. Papert, Perceptrons. 1969: MIT Press.
102.Mitchell, T., Machine Learning. 1997: McGraw-Hill.
103.Narayan, S., He Generalized Sigmoid Activation Function: Competitive
Supervised Learning. Information Sciences, 1997. 99.
104.Neslin, S., S. Gupta, W. Kamakura, J. Lu, and C. Mason, Defection Detection:
Measuring and Understanding the Predictive Accuracy of Customer Churn
Models. Journal of Marketing Research, 2006. 43.
105.Ng, S.C., S.H. Leung, and A. Luk, Fast Convergent Generalized Back-
Propagation Algorithm with Constant Learning Rate. Neural Processing
Letters, 1999. 9.
106.Nilsson, N.J., Introduction to Machine Learning. 1998: Stanford University.
107.Oja, E., A Simplified Neuron Model as a Principal Component Analyzer.
Journal of Mathematical Biology, 1982. 15.
108.Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. 1988: Morgan Kaufmann.
109.Pearson, K., On Lines and Planes of Closest Fit to Systems of Points in Space.
Philosophical Magazine, 1901. 6(2).
110.Pendharkar, P., Genetic Algorithm Based Neural Network Approaches for
Predicting Churn in Cellular Wireless Networks Service. Expert Systems with
Applications, 2009. 36.
111.Ponnapalli, P.V.S., K.C. Ho, and M. Thomson, A Formal Selection and Pruning
Algorithm for Feedforward Artificial Neural Network Optimization. IEEE
Transactions on Neural Networks, 1999. 10(4).
112.Rao, C.R., The Use and Interpretation of Principal Component Analysis in
Applied Research. The Indian Journal of Statistics, 1964. 26(4).
113.Reichheld, F. and W. Sasser, Zero Defection: Quality Comes to Services.
Harvard Business Review, 1990. 68(5).
114.Roderick, J., A. Little, and D.B. Rubin, Statistical Analysis with Missing Data
2nd Edition. 2002: John Wiley & Sons.
115.Rosenblatt, R., The Perceptron: A Probabilistic Model for Information Storage
and Organization in the Brain. Psychological Review, 1958. 65.
116.Rosenblatt, R., Principles of Neurodynamics. 1962: Spartan Books.
117.Rumelhart, D., McClelland, J. L., Parallel Distributed Processing: Explorations
in the Microstructure of Cognition. 1986: MIT Press.
118.Rumelhart, D.E., R. Durbin, R. Golden, and Y. Chauvin, Backpropagation: The
Basic Theory. Backpropagation: Theory, Architecture, and Applications, 1995.
119.Rumelhart, D.E., G.E. Hinton, and R.J. Williams, Learning Internal
Representations by Error Propagation. Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, 1986. 1.
120.Sato, T., B.Q. Huang, Y. Huang, M.T. Kechadi, and B. Buckley. Using PCA to
Predict Customer Churn in Telecommunication Dataset. Proceedings of the 6th
International Conference on Advanced Data Mining and Applications. 2010.
121.Schmitt, M., On the Complexity of Computing and Learning with
Multiplicative Neural Networks. Neural Computation, 2002. 14(2).
122.Seo, D., C. Ranganathan, and Y. Babad, Two-Level Model of Customer
Retention in the US Mobile Telecommunications Service Market. Telecom-
munications Policy, 2008. 32.
123.Shalev-Shwartz, S. and Y. Singer. A New Perspective on an Old Perceptron
Algorithm. Proceedings of the 16th Annual Conference on Computational
Learning Theory. 2005.
124.Sietsma, J. and R.J.F. Dow, Creating Artificial Neural Networks That
Generalize. Neural Networks, 1991. 4.
125.Silva, F.M. and L.B. Almeida, Speeding-up Backpropagation. Advanced
Neural Computers, 1990.
126.Smyth, S.G., Designing Multilayer Perceptrons from Nearest Neighbor
Systems. IEEE Transactions on Neural Networks, 1992. 3(2).
127.Solla, S.A., E. Levin, and M. Fleisher, Accelerated Learning in Layered Neural
Network. Complex Systems, 1988. 2.
128.Spirtes, P., C. Glymour, and R. Scheines, Causation, Prediction and Search 2nd
Edition. 2001: MIT Press.
129.Sumathi, S. and S.N. Sivanandam, Introduction to Data Mining and its
Applications. Studies in Computational Intelligence, 2006. 29.
130.Teoh, E.J., K.C. Tan, and C. Xiang, Estimating the Number of Hidden Neurons
in a Feedforward Network Using the Singular Value Decomposition. IEEE
Transactions on Neural Networks, 2006. 17(6).
131.Thimm, G. and E. Fiesler, High-Order and Multilayer Perceptron Initialization.
IEEE Transactions on Neural Networks, 1997. 8(2).
132.Valiant, L.G., A Theory of the Learnable. 1984: Communications of the ACM.
133.Vapnik, V.N., The Nature of Statistical Learning Theory. 1995, New York:
Springer. xv, 188 p.
134.Vapnik, V.N., Statistical Learning Theory. Adaptive and Learning Systems for
Signal Processing, Communications, and Control. 1998, New York: Wiley.
xxiv, 736 p.
135.Vogl, T.P., J.K. Mangis, A.K. Rigler, W.T. Zink, and D.L. Alkon, Accelerating
the Convergence of the Backpropagation Method. Biological Cybernetics,
1988. 59.
136.Wang, J., J. Yang, and W. Wu, Convergence of Cyclic and Almost-Cyclic
Learning with Momentum for Feedforward Neural Networks. IEEE
Transactions on Neural Networks, 2011. 22(8).
137.Wei, C. and I. Chiu, Turning Telecommunications Call Details to Churn
Prediction: A Data Mining Approach. Expert Systems with Applications, 2002.
23.
138.Werbos, P.J., Beyond Regressions: New Tools for Prediction and Analysis in
the Behavioral Sciences. 1974, Harvard University.
139.Wessels, L.F.A. and E. Barnard, Avoiding False Local Minima by Proper
Initialization of Connections. IEEE Transactions on Neural Networks, 1992. 3.
140.Weymaere, N. and J.P. Martens, On the Initialization and Optimization of
Multilayer Perceptron. IEEE Transactions on Neural Networks, 1994. 5.
141.Widrow, B. and M.E. Hoff, Adaptive Switching Circuits. Record of IRE
Eastern Electronic Show and Convention, 1960. 4.
142.Widrow, B., Stearns, S. D., Adaptive Signal Processing. 1985: Prentice Hall.
143.Wilson, D.R. and T.R. Martinez, The General Inefficiency of Batch Training
for Gradient Descent Learning. Neural Networks, 2003. 16.
144.Wolpert, D.H., The Relationship between PAC, the Statistical Physics
Framework, the Bayesian Framework, and the VC Framework. The
Mathematics of Generalization the SFI Studies in the Sciences of Complexity,
1995.
145.Wong, B.K., T.A. Bodnovich, and Y. Selvi, Neural Network Applications in
Business: A Review and Analysis of the Literature (1988–1995). Decision
Support Systems, 1997. 19.
146.Xing, H.J. and B.G. Hu, Two-Phase Construction of Multilayer Perceptrons
Using Information Theory. IEEE Transactions on Neural Networks, 2009.
20(4).
147.Yam, Y.F., C.T. Leung, P.K.S. Tam, and W.C. Siu, An Independent Component
Analysis Based Weight Initialization Method for Multilayer Perceptrons.
Neurocomputing, 2002. 48.
148.Yan, L., M. Fassino, and P. Baldasare. Predicting Customer Behavior via
Calling Links. Proceedings of International Joint Conference on Neural
Networks. 2005.
149.Zurada, J.M., A. Malinowski, and S. Usui, Perturbation Method for Deleting
Redundant Inputs of Perceptron Networks. Neurocomputing, 1997. 14.

You might also like