Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
:
A Guide for Statisticians
Basilio de Braganc¸a Pereira
UFRJ  Universidade Federal do Rio de Janeiro
Calyampudi Radhakrishna Rao
PSU  Penn State University
June 2009
Contents
1 Introduction 1
2 Fundamental Concepts on Neural Networks 3
2.1 Artiﬁcial Intelligence: Symbolist & Connectionist . . . . . . . . . 3
2.2 The brain and neural networks . . . . . . . . . . . . . . . . . . . 4
2.3 Artiﬁcial Neural Networks and Diagrams . . . . . . . . . . . . . 6
2.4 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Kolmogorov Theorem . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Model Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . 16
2.8.2 Biasvariance tradeoff: Early stopping method of training 16
2.8.3 Choice of structure . . . . . . . . . . . . . . . . . . . . . 18
2.8.4 Network growing . . . . . . . . . . . . . . . . . . . . . . 18
2.8.5 Network pruning . . . . . . . . . . . . . . . . . . . . . . 19
2.9 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.10 Mccullochpitt Neuron . . . . . . . . . . . . . . . . . . . . . . . 25
2.11 Rosenblatt Perceptron . . . . . . . . . . . . . . . . . . . . . . . . 26
2.12 Widrow’s Adaline and Madaline . . . . . . . . . . . . . . . . . . 29
3 Some Common Neural Networks Models 31
3.1 Multilayer Feedforward Networks . . . . . . . . . . . . . . . . . 31
3.2 Associative and Hopﬁeld Networks . . . . . . . . . . . . . . . . . 35
3.3 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . 44
3.4 Wavelet Neural Networks . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Wavelet Networks and Radial Basis Wavelet Networks. . . 51
3.5 Mixture of Experts Networks . . . . . . . . . . . . . . . . . . . . 53
3.6 Neural Network and Statistical Models Interface . . . . . . . . . . 55
1
CONTENTS 2
4 Multivariate Statistics Neural Networks 56
4.1 Cluster and Scaling Networks . . . . . . . . . . . . . . . . . . . 56
4.1.1 Competitive networks . . . . . . . . . . . . . . . . . . . 56
4.1.2 Learning Vector Quantization  LVQ . . . . . . . . . . . . 59
4.1.3 Adaptive Resonance Theory Networks  ART . . . . . . . 61
4.1.4 Self Organizing Maps  (SOM) Networks . . . . . . . . . 68
4.2 Dimensional Reduction Networks . . . . . . . . . . . . . . . . . 78
4.2.1 Basic Structure of the Data Matrix . . . . . . . . . . . . . 80
4.2.2 Mechanics of Some Dimensional Reduction Techniques . 82
4.2.2.1 Principal Components Analysis  PCA . . . . . 82
4.2.2.2 Nonlinear Principal Components . . . . . . . . 82
4.2.2.3 Factor Analysis  FA . . . . . . . . . . . . . . . 83
4.2.2.4 Correspondence Analysis  CA . . . . . . . . . 83
4.2.2.5 Multidimensional Scaling . . . . . . . . . . . . 84
4.2.3 Independent Component Analysis  ICA . . . . . . . . . . 85
4.2.4 PCA networks . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.5 Nonlinear PCA networks . . . . . . . . . . . . . . . . . 94
4.2.6 FA networks . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2.7 Correspondence Analysis Networks  CA . . . . . . . . . 98
4.2.8 Independent Component Analysis Networks  ICA . . . . 99
4.3 Classiﬁcation Networks . . . . . . . . . . . . . . . . . . . . . . . 101
5 Regression Neural Network Models 118
5.1 Generalized Linear Model Networks  GLIMN . . . . . . . . . . 118
5.1.1 Logistic Regression Networks . . . . . . . . . . . . . . . 120
5.1.2 Regression Network . . . . . . . . . . . . . . . . . . . . 125
5.2 Nonparametric Regression and Classiﬁcation Networks . . . . . . 126
5.2.1 Probabilistic Neural Networks . . . . . . . . . . . . . . . 126
5.2.2 General Regression Neural Networks . . . . . . . . . . . 129
5.2.3 Generalized Additive Models Networks . . . . . . . . . . 131
5.2.4 Regression and Classiﬁcation Tree Network . . . . . . . . 132
5.2.5 Projection Pursuit and FeedForward Networks . . . . . . 134
5.2.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6 Survival Analysis and others networks 136
6.1 Survival Analysis Networks . . . . . . . . . . . . . . . . . . . . 136
6.2 Time Series Forecasting Networks . . . . . . . . . . . . . . . . . 148
6.2.1 Forecasting with Neural Networks . . . . . . . . . . . . . 150
6.3 Control Chart Networks . . . . . . . . . . . . . . . . . . . . . . . 155
6.4 Some Statistical Inference Results . . . . . . . . . . . . . . . . . 160
6.4.1 Estimation methods . . . . . . . . . . . . . . . . . . . . . 160
CONTENTS 3
6.4.2 Interval estimation . . . . . . . . . . . . . . . . . . . . . 165
6.4.3 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . 166
List of Figures
2.1 Neural network elements . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Model of an artiﬁcial neuron . . . . . . . . . . . . . . . . . . . . 7
2.3 Activation functions: graphs and ANN representation . . . . . . . 8
2.4 Singlelayer network . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Multilayer network . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Network ﬁts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 Model choice: training x validation . . . . . . . . . . . . . . . . . 18
2.9 McCullogPitt neuron . . . . . . . . . . . . . . . . . . . . . . . . 26
2.10 Two layer perceptron . . . . . . . . . . . . . . . . . . . . . . . . 27
2.11 Twodimensional example . . . . . . . . . . . . . . . . . . . . . 28
2.12 ADALINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Classes of problems . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Feedforward network . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Type of associative networks . . . . . . . . . . . . . . . . . . . . 36
3.4 Associative network . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Schematic faces for Hebbian learning . . . . . . . . . . . . . . . 38
3.6 The features used to build the faces in Figure 3.5 . . . . . . . . . 39
3.7 Hopﬁeld network architecture . . . . . . . . . . . . . . . . . . . 40
3.8 Example Hopﬁeld network . . . . . . . . . . . . . . . . . . . . . 41
3.9 Classiﬁcation by alternative networks . . . . . . . . . . . . . . . 44
3.10 Radial basis functions h(.) . . . . . . . . . . . . . . . . . . . . . 45
3.11 Radial basis network . . . . . . . . . . . . . . . . . . . . . . . . 46
3.12 Some wavelets: (a) Haar (b) Morlet (c) Shannon . . . . . . . . . . 49
3.13 Wavelet operation . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.14 An Example of a Wavelet Transform. . . . . . . . . . . . . . . . 50
3.15 Radial Basis Wavelet Network . . . . . . . . . . . . . . . . . . . 52
3.16 Wavelon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.17 Wavelet network . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.18 Mixture of experts networks . . . . . . . . . . . . . . . . . . . . 54
4
LIST OF FIGURES 5
4.1 General Winner Take All network . . . . . . . . . . . . . . . . . 57
4.2 Success of classiﬁcation for all ANNs and all tested movement trials 61
4.3 Architecture of ART1 network . . . . . . . . . . . . . . . . . . . 62
4.4 Algorithm for updating weights in ART1 . . . . . . . . . . . . . . 63
4.5 Topology of neighboring regions . . . . . . . . . . . . . . . . . . 69
4.6 Kohonen selforganizing map. . . . . . . . . . . . . . . . . . . . 70
4.7 Linear array of cluster units . . . . . . . . . . . . . . . . . . . . . 70
4.8 The selforganizing map architecture . . . . . . . . . . . . . . . . 71
4.9 Neighborhoods for rectangular grid . . . . . . . . . . . . . . . . . 71
4.10 Neighborhoods for hexagonal grid . . . . . . . . . . . . . . . . . 72
4.11 Algorithm for updating weights in SOM . . . . . . . . . . . . . . 72
4.12 Mexican Hat interconnections . . . . . . . . . . . . . . . . . . . 73
4.13 Normal density function of two clusters . . . . . . . . . . . . . . 73
4.14 Sample from normal cluster for SOM training . . . . . . . . . . . 73
4.15 Initial weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.16 SOM weights, 50 iterations . . . . . . . . . . . . . . . . . . . . . 74
4.17 SOM weights, 300 iterations . . . . . . . . . . . . . . . . . . . . 74
4.18 SOM weights, 4500 iterations . . . . . . . . . . . . . . . . . . . 74
4.19 Combined forecast system . . . . . . . . . . . . . . . . . . . . . 75
4.20 Week days associated with the SOM neurons . . . . . . . . . . . 75
4.21 Seasons associated with the SOM neurons . . . . . . . . . . . . . 76
4.22 The relation between ICA and some other methods . . . . . . . . 86
4.23 PCA and ICA transformation of uniform sources . . . . . . . . . 89
4.24 PCA and ICA transformation of speach signals . . . . . . . . . . 90
4.25 PCA networks  Autoassociative . . . . . . . . . . . . . . . . . . 91
4.26 PCA networks architecture  Oja rule. . . . . . . . . . . . . . . . 92
4.27 Oja´s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.28 Non linear PCA network . . . . . . . . . . . . . . . . . . . . . . 95
4.29 Factor analysis APEX network . . . . . . . . . . . . . . . . . . . 96
4.30 Original variables in factor space . . . . . . . . . . . . . . . . . . 97
4.31 Students in factor space . . . . . . . . . . . . . . . . . . . . . . . 97
4.32 CA network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.33 ICA neural network . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.34 Example of classiﬁcation of a mixture of independent components 100
4.35 Taxonomy and decision regions . . . . . . . . . . . . . . . . . . 102
4.36 Classiﬁcation by alternative networks . . . . . . . . . . . . . . . 103
4.37 A nonlinear classiﬁcation problem . . . . . . . . . . . . . . . . . 103
4.38 Boundaries for neural networks with tree abd six hidden units . . . 104
4.39 Four basic classiﬁer groups . . . . . . . . . . . . . . . . . . . . . 105
4.40 A taxonomy of classiﬁers . . . . . . . . . . . . . . . . . . . . . . 106
LIST OF FIGURES 6
5.1 GLIM network . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Policholomous logistic network . . . . . . . . . . . . . . . . . . 122
5.3 Probabilistic Neural Network  PNN . . . . . . . . . . . . . . . . 128
5.4 General Regression Neural Network Architecture . . . . . . . . . 130
5.5 GAM neural network . . . . . . . . . . . . . . . . . . . . . . . . 131
5.6 Regression tree ˆ y . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.7 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.8 Classiﬁcation Tree Network . . . . . . . . . . . . . . . . . . . . 134
5.9 GAM network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.10 Network of networks . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1 Classes of regression model for survival data . . . . . . . . . . . . 138
6.2 Neural network model for survival data . . . . . . . . . . . . . . 139
6.3 Log odds survival network . . . . . . . . . . . . . . . . . . . . . 142
6.4 Nonlinear survival network . . . . . . . . . . . . . . . . . . . . 143
6.5 Partial logistic regression network (PLANN) . . . . . . . . . . . . 145
6.6 Head and neck cancer trial . . . . . . . . . . . . . . . . . . . . . 146
6.7 Head and neck cancer trial . . . . . . . . . . . . . . . . . . . . . 147
6.8 Separate Architecture . . . . . . . . . . . . . . . . . . . . . . . . 153
6.9 Combined Architecture for forecasting x
t+1
. . . . . . . . . . . . 153
6.10 Graphical model of Muller and Rios Issua . . . . . . . . . . . . . 162
6.11 Graphical model of Neal . . . . . . . . . . . . . . . . . . . . . . 163
6.12 Onehidden layer feedforward neural network. . . . . . . . . . . . 167
Chapter 1
Introduction
This work is a tentative to bring the attention of statisticians the parallel work
that has been done by neural network researchers in the area of data analysis.
In this area of application (data analysis) it is important to show that most
neural network models are similar or identical to popular statistical techniques
such as generalized linear models, classiﬁcation, cluster, projection ..., generalized
additive models, principal components, factor analysis etc. On the other hand,
neural networks algorithms implementations are inefﬁcient because:
1. They are based on biological or engineering criteria such as how easy is to
ﬁt a net on a chip rather than on wellestablished statistical or optimization
criteria;
2. Neural networks are designed to be implemented on massively parallel com
puters such as PC. On a serial computer, standard statistical optimization
criteria usually will be more efﬁcient to neural networks implementation.
In an computational process we have the following hierarchical framework.
The top of the hierarchy is the computational level. This attempts to answer the
questions  what is being computed and why? The next level is the algorithm,
which describes how the computation is being carried out, and ﬁnally there is
the implementation level, which gives the details and steps of the facilities of the
algorithm.
Research in neural networks involve different group of scientists in neuro
sciences, psychology, engineering, computer science and mathematics. All these
groups are asking different questions: neuroscientists and psychologists want to
know how animal brain works, engineers and computer scientists want to build in
telligent machines and mathematicians want to understand the fundamentals prop
erties of networks as complex systems.
1
CHAPTER 1. INTRODUCTION 2
Mathematical research in neural networks is not specially concerned with
statistics but mostly with nonlinear dynamics, geometry, probability and other
areas of mathematics. On the other hand, neural network research can stimulate
statistics not only by pointing to new applications but also providing it with new
type of nonlinear modeling. We believe that neural network will become one
of the standard technique in applied statistics because of its inspiration, but also
statisticians have a range of problems in which they can contribute in the neural
network research.
Therefore, in this work we will concentrate on the top of the hierarchy of the
computational process, that is what is being done and why, and also howthe neural
network models and statistical models are related to tackle the data analysis real
problem.
The book is organized as follows: some basics on artiﬁcial neural networks
(ANN) is presented in Chapters 2 and 3. Multivariate statistical topics are pre
sented in Chapter 4. Chapter 5 deals with parametric and nonparametric regres
sion models and Chapter 6 deals with inference results, survival analysis, control
charts and time series. The references and the bibliography is referred to sections
of the chapters. To be fair authors which inspired some of our developments, we
have included them in the bibliography, even if they were not explicitly mentioned
in the text. The ﬁrst author is grateful to CAPES, an Agency of the Brazilian Min
istry of Education for a onde year support grant to visit Penn State University and
to UFRJ for a leave of absence.
Chapter 2
Fundamental Concepts on Neural
Networks
2.1 Artiﬁcial Intelligence: Symbolist & Connection
ist
Here we comment on the connection of neural network and artiﬁcial intelli
gence. We can think of artiﬁcial intelligence as intelligence behavior embodied in
humanmade machines. The concept of what is intelligence is outside the scope of
this guide. It comprises the following components: learning, reasoning, problem
solving, perception and language understanding.
From the early days of computing, there have existed two different approaches
to the problem of developing machines that might embody such behavior. One
of these tries to capture knowledge as a set of irreducible semantic objects or
symbols and to manipulate these according to a set of formal rules. The rules taken
together form a recipe or algorithm for processing the symbols. This approach is
the symbolic paradigm, which can be described as consisting of three phrases:
 choice of an intelligent activity to study;
 development of a logicsymbolic structure able to imitate (such as “if condition
1 and condition 2... then result”;)
 compare the efﬁciency of this structure with the real intelligent activity.
Note that the symbolic artiﬁcial intelligence is more concern in imitate intel
ligence and not in explaining it. Concurrent with this, has been another line of
research, which has used machines whose architecture is loosely based on the hu
man brain. These artiﬁcial neural networks are supposed to learn from examples
3
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 4
and their “knowledge” is stored in representations that are distributed across of a
set of weights.
The connectionist, or neural network approach, starts from the premise that
intuitive knowledge cannot be captured in a set of formalized rules. It postulated
that the physical structure of the brain is fundamental to emulate the brain func
tion and the understanding of the mind. For the connectionists, the mental process
comes as the aggregate result of the behavior of a great number of simple com
putational connected elements (neurons) that exchange signals of cooperation and
competition.
The form how these elements are interconnected are fundamental for the re
sulting mental process, and this fact is that named this approach connectionist.
Some features of this approach are as follows:
 information processing is not centralized and sequential but parallel and spa
cially distributed;
 information is stored in the connections and adapted as newinformation arrives;
 it is robust to failure of some of its computational elements.
2.2 The brain and neural networks
The research work on artiﬁcial neural networks has been inspired by our knowl
edge of biological nervous system, in particular the brain. The brain has features,
we will not be concerned with large scales features such as brain lobes, neither
with lower levels of descriptions than nerve cell, call neurons. Neurons have ir
regular forms, as illustrated in Figure 2.1.
The cell body (the central region) is called neuron. We will focus on morpho
logical characteristic which allow the neuron to function as an information pro
cessing device. This characteristic lies in the set of ﬁbers that emanate from the
cell body. One of these ﬁbers  the axon  is responsible for transmitting informa
tion to other neurons. All other are dendrites, when carry information transmitted
from other neurons. Dendrites are surrounded by the synaptic bouton of other
neurons, as in Figure 1c. Neurons are highly interconnected. Physical details are
given in Table 1 where we can see that brain is a very complicated system, and a
true model of the brain would be very complicated. Building such a model is the
task that scientists faces. Statisticians as engineers on the contrary, use simpliﬁed
models that they can actually build and manipulate usefully. As statisticians we
will take the view that brain models are used as inspiration to building artiﬁcial
neural networks as the wings of a bird were the inspiration for the wings of an
airplane.
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 5
Dendrites
Axon
Axonbranches
(a) Neurons
Synaptic
bouton
AxonBranch
(b) Sypatic bouton or synapse
Dendrite
Axon
branch
Synaptic
bouton
(c) Synaptic bouton making contact with a dendrite
Synapse
(d) Transmitters and receptor
in the synapse and dendrites
Figure 2.1: Neural network elements
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 6
Table 2.1: Some physical characteristics of the brain
Number of neurons 100 billion
Number of synapses/neurons 1000
Total number of synapses 100000 billion
(Number of processors of a computer) (100 millions)
Operation 100
(Processing speed in a computer) (1 billion)
Number of operations 10000 trillion/s
Human brain volume 300 cm
3
Dendrite length of a neuron 1 cm
Firing length of an axon 10 cm
Brain weight 1.5kg
Neuron weight 1.2 10
−9
g
Synapse Excitatory and Inhibitory
2.3 Artiﬁcial Neural Networks and Diagrams
An artiﬁcial neuron can be identiﬁed by three basic elements.
• A set of synapses, each characterized by a weight, that when positive means
that the synapse is excitated and when negative is inhibitory. A signal x
j
in
the input of synapse j connected to the neuron k is multiplied by a synaptic
weight w
jk
.
• Summation  to sumthe input signs, weighted by respective synapses (weights).
• Activation function  restrict the amplitude of the neuron output.
The Figure 2.2 presents the model of an artiﬁcial neuron:
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 7
Figure 2.2: Model of an artiﬁcial neuron
Models will be displayed as networks diagrams. Neurons are represented by
circles and boxes, while the connections between neurons are shown as arrows:
• Circles represent observed variables, with the name shown inside the circle.
• Boxes represent values computed as a function of one or more arguments. The
symbol inside the box indicates the type of function.
• Arrows indicate that the source of the arrow is an argument of the function
computed at the destination of the arrow. Each arrow usually has a weight
or parameter to be estimated.
• A thick line indicates that the values at each end are to be ﬁtted by same crite
rion such as least squares, maximum likelihood etc.
• An artiﬁcial neural network is formed by several artiﬁcial neurons, which are
inspired in biological neurons.
• Each artiﬁcial neuron is constituted by one or more inputs and one output.
These inputs can be outputs (or not) of other neurons and the output can
be the input to other neurons. The inputs are multiplied by weights and
summed with one constant; this total then goes through the activation func
tion. This function has the objective of activate or inhibit the next neuron.
• Mathematically, we describe the k
th
neuron with the following form:
u
k
=
p
j=1
w
kj
x
j
and y
k
= ϕ(u
k
−w
k0
), (2.3.1)
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 8
where x
0
, x
1
, . . . , x
p
are the inputs; w
k1
, . . . , w
kp
are the synaptic weights;
u
k
is the output of the linear combination; w
k0
is the bias; ϕ() is the activa
tion function and y
k
is the neuron output.
The ﬁrst layer of an artiﬁcial neural network, called input layer, is constituted
by the inputs x
p
and the last layer is the output layer. The intermediary layers are
called hidden layers.
The number of layers and the quantity of neurons in each one of them is de
termined by the nature of the problem.
A vector of values presented once to all the output units of an artiﬁcial neural
network is called case, example, pattern, sample, etc.
2.4 Activation Functions
An activation function for a neuron can be of several types. The most com
mons are presented in Figure 2.3:
y = ϕ(x) = x
0
f( ) x
x
a) Identity function (ramp)
Figure 2.3: Activation functions: graphs and ANN representation
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 9
y = ϕ(x) =
_
1 x ≥ b
−1 x < b
0 X
F( ) x
b
b) Step or threshold function
y = ϕ(x) =
_
¸
¸
¸
_
¸
¸
¸
_
1 x ≥ 1/2
z −1/2 ≤ x <
1
2
0 x ≤ −
1
2
0
X
F( ) x
½ ½
c) Piecewiselinear function
y = ϕ
1
(x) =
1
1 + exp(−ax)
0
X
F( ) x
aincreasing
d) Logistic (sigmoid)
Figure 2.3: Activation functions: graphs and ANN representation (cont.)
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 10
y = ϕ
2
(x) = 2ϕ
1
(x) −1
=
1 −exp(−ax)
1 + exp(−ax)
x
F( ) x
1
1
e) Symmetrical sigmoid
y = ϕ
3
(x) = tang
_
x
2
_
= ϕ
2
(x)
f) Hyperbolic tangent function
i. Normal ϕ(x) = Φ(x)
ii. Wavelet ϕ(x) = Ψ(x)
g) Radial Basis (some examples)
Figure 2.3: Activation functions: graphs and ANN representation (cont.)
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 11
h) Wavelet Transform
ϕ(x) =
2
j/2
Ψ
_
2
j
x −k
_
Figure 2.3: Activation functions: graphs and ANN representation (cont.)
The derivatives of the identity function is 1. The derivative of the step function
is not deﬁned, which is why it is not used.
The nice feature of the sigmoids is that they are easy to compute. The deriva
tive of the logistic is ϕ
1
(x)(1 −ϕ
1
(x)).
For the hyperbolic tangent the derivative is
1 −ϕ
2
3
(x). (2.4.1)
2.5 Network Architectures
Architecture refers to the manner in which neurons in a neural network are
organized. There are three different classes of network architectures.
1. SingleLayer Feedforward Networks
In a layered neural network the neurons are organized in layers. One input
layer of source nodes projects onto an output layer of neurons, but not vice
versa. This network is strictly of a feedforward type.
2. Multilayer Feedforward Networks
These are networks with one or more hidden layers. Neurons in each of the
layer have as their inputs the output of the neurons of preceding layer only.
This type of architecture covers most of the statistical applications
The neural network is said to be fully connected if every neuron in each
layer is connected to each node in the adjacent forward layer, otherwise it
is partially connected.
3. Recurrent Network
Is a network where at least one neuron connects with one neuron of the
preceding layer and creating a feedback loop.
This type of architecture is mostly used in optimization problems.
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 12
Figure 2.4: Singlelayer network
Figure 2.5: Multilayer network
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 13
Figure 2.6: Recurrent network
2.6 Network Training
The training of the network consists of adjusting the synapses or weight with
the objective of optimizing the performance of the network. From a statistical
point of view, the training corresponds to estimating the parameters of the model
given a set of data and an estimation criteria.
The training can be as follows:
• learning with a teacher or supervised learning: is the case in which to each
input vector the output is known, so we can modify the network synaptic
weights in an orderly way to achieve the desired objective;
• learning without a teacher: this case is subdivided into selforganized (or un
supervised) and reinforced learning. In both cases, we do not have the out
put, which corresponds to each input vector. Unsupervised training corre
sponds, for example, to the statistical methods of cluster analysis and princi
pal components. Reinforced learning training corresponds to, for example,
the creditassign problem. It is the problem of assigning a credit or a blame
for overall outcomes to each of the internal decisions made by a learning
machine. It occurs, for example, when the error is propagated back to the
neurons in the hidden layer of a feedforward network of section ??.
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 14
2.7 Kolmogorov Theorem
Most conventional multivariate data analysis procedures can be modeled as a
mapping
f : A →B
where A and B are ﬁnite sets. Consider the following examples (Murtagh, 1994):
(i) B is a set of discrete categories, and A a set of mdimensional vectors charac
terizing n observations. The mapping f is a clustering function and choos
ing the appropriate f is the domain of cluster analysis.
(ii) In dimensionalityreduction technique (PCA, FA, etc.), B is a space contain
ing n points, which has lower dimension than A. In these cases, the problem
of the data analyst is to ﬁnd the mapping f, without precise knowledge of
B.
(iii) When precise knowledge of B is available, B has lower dimension than A,
and f is to be determined, includes the methods under the names discrimi
nation analysis and regression.
Most of the neural network algorithm can also be described as method to de
termine a nonlinear f, where f : A → B where A is a set of n, mdimensional
descriptive vectors and where some members of B are known a priori. Also,
some of the most common neural networks are said to perform similar tasks of
statistical methods. For example, multilayer perception and regression analysis,
Kohonen network and multidimensional scaling, ART network (Adaptive Reso
nance Theory) and cluster analysis.
Therefore the determination of f is of utmost importance and made possible
due to a theorem of Andrei Kolmogorov about representation and approximation
of continuous functions, whose history and development we present shortly.
In 1900, Professor David Hilbert at the International Congress of Mathemati
cians in Paris, formulated 23 problems from the discussion of which “advance
ments of science may be expected”. Some of these problems have been related to
applications, where insights into the existence, but not the construction, of solu
tions to problems have been deduced.
The thirteenth problem of Hilbert is having some impact in the area of neural
networks, since its solution in 1957 by Kolmogorov. A modern variant of Kol
mogorov’s Theorem is given in Rojas (1996, p.265)( or Bose and Liang (1996,
p.158).
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 15
Theorem1. (Kolmogorov Theorem): “Any continuous function f(x
1
, x
2
, . . . , x
m
)
on n variables x
1
, x
2
, . . . , x
m
on I
n
= [0, 1]
m
can be represent in the form
f(x
1
, . . . , x
m
) =
2m+1
q=1
g
_
m
p=1
λ
p
h
q
(x
p
)
_
where g and h
q
, for q = 1, . . . , 2m + 1 are functions of one variable and λ
p
for
p = 1, . . . , m are constants, and the g
q
functions do not depend on f(x
1
, . . . , x
m
).
Kolmogorov Theorem states an exact representation for a continuous function,
but does not indicate how to obtain those functions. However if we want to ap
proximate a function, we do not demand exact reproducibility but only a bounded
approximate error. This is the approach taken in neural network.
From Kolmogorov’s work, several authors have improved the representation
to allow Sigmoids, Fourier, Radial Basis, Wavelets functions to be used in the
approximation.
Some of the adaptations of Kolmogorov to neural networks are (Mitchell,
1997, p.105):
• Boolean functions. Every boolean function can be represented exactly by
some network with two layers of units, although the number of hidden units
required grows exponentially in the worst case with the number of network
inputs.
• Continuous functions. Every bounded continuous function can be approxi
mated with arbitrary small error by a network with two layers. The result
in this case applies to networks that use sigmoid activation at the hidden
layer and (unthresholded) linear activation at the output layer. The number
of hidden units required depends on the function to be approximated.
• Arbitrary functions. Any function can be approximated to arbitrary accuracy
by a network with three layers of units. Again, the output layer uses linear
activation, the two hidden layers use sigmoid activation, and the number
of units required in each hidden layer also depends on the function to be
approximated.
2.8 Model Choice
We have seen that the data analyst’s interest is to ﬁnd or learn a function f
from A to B. In some cases (dimension reduction, cluster analysis), we don’t
have precise knowledge of B and the term “unsupervised” learning is used. On
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 16
the other hand, the term “supervised” is often applied to the situation where f is
to be determined, but where some knowledge of B is available (e.g. discriminant
and regression analysis).
In the case of supervised learning the algorithm to be used in the determination
of f embraces the phases: learning, where the mapping f is determined using the
training set, B
⊂ B; testing of how well f performs using a test set B
⊂ B,
B
not identical to B
; and an application. The testing and application phases are
termed generalization.
In a multilayer network, Kolmogorov’s results provide guidelines for the num
ber of layers to be used. However, some problems may be easier to solve using
two hidden layers. The number of units in each hidden layer and (the number)
of training interactions to estimate the weights of the network have also to be
speciﬁed. In this section, we will address each of these problems.
2.8.1 Generalization
By generalization we mean that the inputoutput mapping computed by the
network is correct for a test data never used. It is assumed that the training is a
representative sample from the environment where the network will be applied.
Given a large network, it is possible that loading data into the network it may
end up memorizing the training data and ﬁnding a characteristic (noise, for exam
ple) that is presented in the training data but not true in the underline function that
is to be modelled.
The training of a neural network may be view as a curve ﬁtting problem and if
we keep improving our ﬁtting, we end up with a very good ﬁt only for the training
data and the neural network will not be able to generalize (interpolate or predict
other sample values).
2.8.2 Biasvariance tradeoff: Early stopping method of train
ing
Consider the two functions in Figure 2.7 obtained from two training of a net
work.
The data was generated from function h(x), and the data from
y(x) = h(x) + . (2.8.1)
The training data are shown as circles (◦) and the new data by (+). The dashed
represents a simple model that does not do a good job in ﬁtting the new data. We
say that the model has a bias. The full line represents a more complex model
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 17
Figure 2.7: Network ﬁts:  high variance , . . . high bias
which does an excellent job in ﬁtting the data, as the error is close to zero. How
ever it would not do a good job for predicting the values of the test data (or new
values). We say that the model has high variance. We say in this case that the
network has memorize the data or is overﬁtting the data.
A simple procedure to avoid overﬁtting is to divide the training data in two
sets: estimation set and validation set. Every now and then, stop training and
test the network performance in the validation set, with no weight update during
this test. Network performance in the training sample will improve as long as the
network is learning the underlying structure of the data (that is h(s)). Once the
network stops learning things that are expected to be true of any data sample and
learns things that are true only of the training data (that is ), performance on the
validation set will stop improving and will get worse.
Figure 2.8 shows the errors on the training and validation sets. To stop over
ﬁtting we stop training at epoch E
0
.
Haykin (1999, p.215) suggest to take 80% of the training set for estimation
and 20% for validation.
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 18
Figure 2.8: Model choice: training x validation
2.8.3 Choice of structure
A large number of layers and neurons reduce bias but at the same time increase
variance. In practice it is usual to try several types of neural networks and to use
crossvalidation on the behavior in a test data, to choose the simplest network with
a good ﬁt.
We may achieve this objective in one of two ways.
• Network growing
• Network pruning
2.8.4 Network growing
One possible policy for the choice of the number of layers in a multilayer
network is to consider a nested sequence of networks of increasing sequence as
suggest by Haykin (1999, p.215).
• p networks with a single layer of increasing size h
1
< h
2
< . . . < h
p
.
• g networks with two hidden layers: the ﬁrst layer of size h
p
and the second
layer is of increasing size h
1
< h
2
< . . . < h
g
.
The crossvalidation approach similar to the early stopping method previously
presented can be used for the choice of the numbers of layers and also of neurons.
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 19
2.8.5 Network pruning
In designing a multilayer perception, we are building a nonlinear model of the
phenomenon responsible for the generation of the inputoutput examples used to
train the network. As in statistics, we need a measure of the ﬁt between the model
(network) and the observed data.
Therefore the performance measure should include a criterion for selecting the
model structure (number of parameter of the network). Various identiﬁcation or
selection criteria are described in the statistical literature. The names of Akaike,
Hannan, Rissanen, Schwartz are associated with these methods, which share a
common form of composition:
_
ModelComplexity
criteria
_
=
_
Performance
measure
_
+
_
ModelComplexity
penalty
_
The basic difference between the various criteria lies in the modelcomplexity
penalty term.
In the context of backpropagation learning or other supervised learning pro
cedure the learning objective is to ﬁnd a weight vector that minimizes the total
risk
R(W) = E
s
(W) + λE
c
(W) (2.8.2)
where E
s
(W) is a standard performance measure which depends on both the net
work and the input data. In backpropagation learning it is usually deﬁned as
meansquare error. E
c
(W) is a complexity penalty which depends on the model
alone.
The regularization parameter λ, which represents the relative importance of
the complexity term with respect to the performance measure. With λ = 0, the
network is completely determined by the training sample. When λ is large, it says
that the training sample is unreliable.
Note that this risk function, with the mean square error is closely related to
ridge regression in statistics and the work of Tikhonov on ill posed problems or
regularization theory (Orr, 1996, Haykin, 1999, p.219).
A complexity penalty used in neural networks is (Haykin, 1999, p.220)
E¦w¦ =
i
(w
i
/w
o
)
2
1 + (w
i
/w
o
)
2
(2.8.3)
where w
0
is a preassigned parameter. Using this criteria, weight’s w
i
’s with small
values should be eliminated.
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 20
Another approach to eliminate connections in the network, works directly in
the elimination of units and was proposed and applied by Park et al (1994). It
consists in applying PCA, principal components analysis as follows:
(i) initially train a network with an arbitrary large number P of hidden modes;
(ii) apply PCA to the covariance matrix of the outputs of the hidden layer;
(iii) choose the number of important components P
∗
(by one of the usual crite
rion: eigenvalues greater than one, proportion of variance explained etc.);
(iv) if P
∗
< P, pick P
∗
nodes out of P by examining the correlation between
the hidden modes and the selected principal components.
Note that the resulting network will not always have the optimal number of units.
The network may improve its performance with few more units.
2.9 Terminology
Although many neural networks are similar to statistical models, a different
terminology has been developed in neurocomputing. Table 2.2 helps to translate
from the neural network language to statistics (cf Sarle, 1996). Some further glos
sary list are given in the references for this section. Some are quite specialized
character: for the statistician, e.g.: Gurney (1997) for neuroscientists term, Addri
ans and Zantinge (1996) for data mining terminology.
Table 2.2: Terminology
Neural Network Jargon Statistical Jargon
Generalizing from noisy data and as
sessment of accuracy thereof
Statistical Inference
The set of all cases one wants to be able
to generalize to
Population
A function of the value in a population,
such as the mean or a globally optimal
synaptic weight
Parameter
Continued on next page
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 21
Neural Network Jargon Statistical Jargon
A function of the values in a sample,
such as the mean or a learned synaptic
weight
Statistic
Neuron, neurocode, unit, node process
ing element
A simple linear or nonlinear computing
element that accepts one or more inputs,
computes a function thereof, and may
direct the result to one or more other
neurons
Neural Networks A class of ﬂexible nonlinear regression
and discriminant models, data reduction
models, and nonlinear dynamical sys
tems consisting of an often large number
of neurons interconnected in often com
plex ways and often organized in layers
Statistical methods Linear regression and discriminant
analysis, simulated annealing, random
search
Architecture Model
Training, Learning, Adaptation Estimation, Model ﬁtting, Optimization
Classiﬁcation Discriminant analysis
Mapping, Function approximation Regression
Supervised learning Regression, Discriminant analysis
Unsupervised learning,
Selforganization
Principal components, Cluster analysis,
Data reduction
Competitive learning Cluster analysis
Hebbian learning, Cottrell/ Munzo/ Zis
per technique
Principal components
Training set Sample, Construction sample
Test set, Validation set Holdout sample
Continued on next page
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 22
Neural Network Jargon Statistical Jargon
Pattern, Vector, Example, Case Observation, Case
Binnary (0/1), Bivalent or Bipolar (1/1) Binary, Dichotomous
Input Independent variables, Predictors, Re
gressors, Explanatory variables, Carries
Output Predicted values
Forward propagation Prediction
Training values Dependent variables, Responses
Target values Observed values
Training par Observation containing both inputs and
target values
Shift register, (Tapped) (time) delay
(line), Input window
Lagged value
Errors Residuals
Noise Error term
Generalization Interpolation, Extrapolation, Prediction
Error bars Conﬁdence intervals
Prediction Forecasting
Adaline (ADAptive LInear NEuron) Linear twogroup discriminant analysis
(not a Fisher´s but generic)
(Nohiddenlayer) perceptron Generalized linear model (GLIM)
Activation function, Signal function,
Transfer function
Inverse link function in GLIM
Softmax Multiple logistic function
Continued on next page
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 23
Neural Network Jargon Statistical Jargon
Squashing function Bounded function with inﬁnitive do
main
Semilinear function Differentiable nondecreasing function
Phimachine Linear model
Linear 1hiddenlayer perceptron Maximum redundant analysis, Principal
components of instrumental variables
1hiddenlayer perceptron Projection pursuit regression
Weights Synaptic weights (Regression coefﬁcients, Parameter esti
mates)
Bias Intercept
The difference between the expected
value of a statistic and the correspond
ing true value (parameter)
Bias
OLS (Orthogonal least square) Forward stepwise regression
Probabilistic neural network Kernel discriminant analysis
General regression neural network Kernel regression
Topolically distributed enconding (Generalized) Additive model
Adaptive vector quantization Iterative algorithm of doubtful conver
gence for Kmeans cluster analysis
Adaptive Resonance Theory 2
n
d Hartigan´s leader algorithm
Learning vector quantization A form of piecewise linear discrimi
nant analysis using a preliminary cluster
analysis
Counterpropagation Regressogram based on kmeans clus
ters
Continued on next page
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 24
Neural Network Jargon Statistical Jargon
Encoding, Autoassociation Dimensionality reduction (Independent
and dependent variables are the same)
Heteroassociation Regression, Discriminant analysis (In
dependent and dependent variables are
different)
Epoch Iteration
Continuous training, Incremental train
ing, Online training, Instantaneous
training
Iteratively updating estimates one obser
vation at time via difference equation, as
in stochastic approximation
Batch training, Offline training Iteratively updating estimates after each
complete pass over the data as in most
nonlinear regression algorithms
Shortcuts, Jumpers, Bypass connec
tions, direct linear feedthrough (direct
connections from input to output)
Main effects
Funcional links Iteration terms or transformations
Secondorder network Quadratic regression, Responsesurface
model
Higherorder networks Polynomial regression, linear model
with iteration terms
Instar, Outstar Iterative algorithms of doubtful conver
gence for approximating an arithmetic
mean or centroid
Deltarule, adaline rule, WidrowHoff
rule, LMS (Least Mean Square) rule
Iterative algorithm of doubtful conver
gence for training a linear perceptron by
least squares, similar to stochastic ap
proximation
Training by minimizing the median of
the squared errors
LMS (Least Median of Square)
Continued on next page
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 25
Neural Network Jargon Statistical Jargon
Generalized delta rule Iterative algorithm of doubtful conver
gence for training a nonlinear percep
tron by least squares, similar to stochas
tic approximation
Backpropagation Computation of derivatives for a multi
layer perceptron and various algorithms
(such as generalized delta rule) based
thereon
Weight decay, Regularization Shrinkage estimation, Ridge regression
Jitter Random noise added to the inputs to
smooth the estimates
Growing, Prunning, Brain damage, self
structuring, ontogeny
Subset selection, Model selection, Pre
test estimation
Optimal brain surgeon Wald test
LMS (Least Mean Squares) OLS (Ordinary Least Squares) (see also
”LMS rule” above)
Relative entropy, Cross entropy KullbackLeibler divergence
Evidence framework Empirical Bayes estimation
2.10 Mccullochpitt Neuron
The McCullochPitt neuron is a computational unit of binary threshold, also
called Linear Threshold Unit (LTU). The neuron receive the weighted sum of
inputs and has as output the value 1 if the sum is bigger than this threshold and it
is shown in Figure 2.9.
Mathematically, we represent this model as
y
k
= ϕ
_
p
j=1
ω
kj
−ω
o
_
(2.10.1)
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 26
Figure 2.9: McCullogPitt neuron
where y
k
is an output of neuron k, ω
jk
is the weight of neuron j to neuron k. x
j
is
the output of neuron j, ω
o
is the threshold to the neuron k and ϕ is the activation
function deﬁned as
ϕ(input) =
_
1 if input ≥ ω
0
−1 otherwise
(2.10.2)
2.11 Rosenblatt Perceptron
The Rosenblatt perceptron is a model with several McCullochPitt neurons.
Figure 2.10 shows one such a model with two layers of neurons
The one layer perceptron is equivalent to a linear discriminant:
ω
j
x
j
−ω
0
. (2.11.1)
This perceptron creates an output indicating if it belongs to class 1 or 1 (or 0).
That is
y = Out =
_
1
ω
i
x
i
−ω
0
≥ 0
0 otherwise
(2.11.2)
The constant ω
0
is referred as threshold or bias. In the perceptron, the weights are
constraints and the variables are the inputs. Since the objective is to estimate (or
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 27
Figure 2.10: Two layer perceptron
learn) the optimum weights, as in other linear discriminants, the next step is to
train the perceptron.
The perceptron is trained with samples using a sequential learning procedure
to determine the weight as follows: The samples are presented sequentially, for
each wrong output the weights are adjusted to correct the error. If the output is
corrected the weights are not adjusted. The equations for the procedure are:
ω
i
(t + 1) = ω
i
(t) + ∆ω
i
(t) (2.11.3)
ω
0
(t + 1) = ω
0
(t) + ∆ω
0
(t) (2.11.4)
and
∆ω
i
(t) = (y − ˆ y)x
i
(2.11.5)
∆ω
0
(t) = (y − ˆ y) (2.11.6)
Formally, the actual weight in true t is ω
i
(t), the new weight is ω
i
(t+1), ∆ω
i
(t) is
the adjustment or correction factor, y is the correct value or true response, ˆ y is the
perceptron output, a is the activation value for the output ˆ y = f(a) (a =
ω
i
x
i
).
The initial weight values are generally random numbers in the interval 0 and
1. The sequential presentation of the samples continues indeﬁnitely until some
stop criterion is satisﬁed. For example, if 100 cases are presented and if there is
no error, the training will be completed. Nevertheless, if there is an error, all 100
cases are presented again to the perceptron. Each cycle of presentation is called
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 28
an epoch. To make learning and convergence faster, some modiﬁcations can be
done in the algorithm. For example,
(i) To normalize the data: all input variables must be within the interval 0 and 1.
(ii) To introduce a learning rate, α, in the weight actualization procedure to αω
i
.
The value of α is a number in the interval 0 and 1.
The perceptron converge theorem states that when two classes are linearly sep
arable, the training guarantees the converge but not how fast. Thus if a linear
discriminant exists that can separate the classes without committing an error, the
training procedure will ﬁnd the line or separator plan, but it is not known how long
it will take to ﬁnd it (Figure 2.11).
Table 2.3 illustrates the computation of results of this section for the example
of Figure 2.11. Here convergence is achieved in one epoch. Here α = 0.25.
Figure 2.11: Twodimensional example
Table 2.3: Training with the perceptron rule on a twoinput example
ω
1
ω
2
ω
0
x
1
x
2
a y ˆ y a(ˆ y −y) ∆ω
1
∆ω
2
∆ω
0
0.0 0.4 0.3 0 0 0 0 0 0 0 0 0
0.0 0.4 0.3 0 1 0.4 1 0 0.25 0 0.25 0.25
0.0 0.15 0.55 1 0 0 0 0 0 0 0 0
0.0 0.15 0.55 1 1 0.15 0 1 0.25 0.25 0.25 0.25
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 29
2.12 Widrow’s Adaline and Madaline
Another neuron model, using as a training method the method of Least Mean
Square (LMS) error, was developed by Widrow and Hoff and called ADALINE
(ADAptive LINear Element). A generalization which consist of a structure of
many ADALINE is the MADALINE.
The major difference between the Perceptron (or LTU) and the ADALINE is in
the activation function. While the perceptron uses the step or threshold function,
the ADALINE uses the identity function as in Figure 2.12.
Figure 2.12: ADALINE
To a set of samples, the performance measure fo the training is:
D
MLS
=
(˜ y −y)
2
(2.12.1)
The LMS training objective is to feed the weigths that minimize D
MLS
.
A comparison of training for the ADALINE and Perceptron for a two dimen
sional neuron with w
0
= −1, w
1
= 0.5, w
2
= 0.3 and x
1
= 2, x
2
= 1 would
result in an activation value of:
a = 2x0.5 + 1x3 −1 = 0.3 (2.12.2)
Table 2.4 gives the updating results for the sample case.
The technique used to update the weight is called gradient descent. The per
ceptron and the LMS search for a linear separator. The appropriate line or plane
can only be obtained if the class are linear separable. To overcome this limitation
the more general neural network model of the next chapter can be used.
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 30
Table 2.4: Comparison of updating weights
Perceptron Adaline
˜ y 1 0.3
w
1
(t + 1) 0.5 + (0 −1).2 = −1.5 0.5 + (0 −0.3) = −0.1
w
2
(t + 1) 0.3 + (0 −1).2 = −0.7 0.3 + (0 −0.3).1 = 0
w
0
(t + 1) −1 + (0 −1) = −2 −1 + (0 −0.3) = −1.3
Chapter 3
Some Common Neural Networks
Models
3.1 Multilayer Feedforward Networks
This section deals with networks that use more than one perceptron arranged
in layers. This model is able to solve problems that are not linear separable such
as in Figure 3.1 (it is of no help to use layers of ADALINE because the resulting
output would be linear).
In order to train a newlearning rule is needed. The original perceptron learning
cannot be extended to the multilayered network because the output function is not
differentiable. If a weight is to be adjusted anywhere in the network, its effect
in the output of the network has to be known and hence the error. For this, the
derivative of the error or criteria function with respect to that weight must be
found, as used in the delta rule for the ADALINE.
So a newactivation function is needed that is nonlinear, otherwise nonlinearly
separable functions could not be implemented, but which is differentiable. The
one that is most often used successfully in multilayered network is the sigmoid
function.
Before we give the general formulation of the generalized delta rule, called
backpropagation, we present a numerical illustration for the Exclusive OR (XOR)
problem of Figure 3.1(a) with the following inputs and initial weights, using the
network of Figure 3.2.
The neural network of Figure 3.2 has one hidden layer with 2 neurons, one
input layer with 4 samples and one neuron in the output layer.
Table 3.1 presents the forward computation for one sample (1,1) in the XOR
problem.
31
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 32
decision
line
0.5
0.5
W
1 0
0
x
1
x
2
1
(a) Exclusive OR (XOR)
A A
A
A
A
A
B
B
B
B
B
B
(b) Linear separable class
A
A
A
A
B
B
B
B
C
C
C
C
C
C
(c) Nonlinear separable
SeparationbyHyperplanes
Depressed
NotDepressed
ClusteringbyEllipsoids
(d) Separable by twolayer of perceptron
Depressed
NotDepressed
Mesure1
M
e
s
u
r
e
2
(e) Separable by feedforward
(sigmoid) network, psycho
logical measures
Figure 3.1: Classes of problems
Figure 3.2: Feedforward network
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 33
Table 3.1: Forward computation
Neuron Input Output
3 .1+.3+.2+.6 1/(1 + exp(−.6)) = .65
4 .2.2+.4.3.1 1/(1 + exp(.1)) = .48
5 .5
∗
.65 −.4
∗
.48 + .4 = .53 1/(1 + exp(−.53)) = .63
Table 3.2 gives the calculus for the backpropagation of the error for the XOR
example.
Table 3.2: Error backpropagation
Neuron Error Adjust of tendency
5 .63
∗
(1 −.63)
∗
(0 −.63) = −.147 .147
4 .48
∗
(1 −.48)
∗
(−.147)
∗
(−.4) = .015 .015
3 .65
∗
(1 −.65)
∗
(−.147)
∗
(−.4) = −.017 .017
Table 3.3 gives the weight adjustment for this step of the algorithm.
Table 3.3: Weight adjustment
Weight Value
w
45
−.4 + (−.147)
∗
.48 = −.47
w
35
.5 + (−.147)
∗
.65 = .40
w
24
.4 + (.015)
∗
1 = .42
w
23
.3 + (−017)
∗
1 = .28
w
14
−.2 + (.015)
∗
1 = .19
w
13
.1 + (−.017)
∗
1 = .08
w
05
.4 + (−.147) = .25
w
04
−.3 + (.015) = −.29
w
03
.2 + (−.017) = .18
The performance of this update procedure suffer the same drawbacks of any
gradient descent algorithm. Several heuristic rules have been suggested to im
prove its performance, in the neural network literature (such as normalize the
inputs to be in the interval 0 and 1, initial weights chosen at random in the interval
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 34
0.5 and 0,5; to include a momentum parameter to speed convergence etc.) Here
we only present the general formulation of the algorithm.
The backpropagation algorithm
Let L be the number of layers. The different layers are denoted by L
1
(the
input later), L
2
, . . . , L
M
(the output layer). We assume the output has M neurons.
When neurons i and j are connected a weight ω
ij
is associated with the connec
tion. In a multilayer feedforward network, only neurons in subsequent layers can
be connected:
ω
ij
= 0 ⇒iL
l+1
, j ∈ L
l
1 ≤ l ≤ L −1.
The state or input or neuron j is characterized by z
j
. The network operates as
follows. The input layer assigns z
j
to neurons j in L
1
. The output of the neurons
in L
2
are
z
k
= f
_
j∈L
1
ω
kj
z
j
_
k ∈ L
2
(3.1.1)
where f is an activation function that has derivatives for all values of the argument.
For a neuron m in an arbitrary layer we abbreviate
a
m
=
j∈L
l−1
ω
mj
z
j
2 ≤ l ≤ L (3.1.2)
and where the outputs of layer l −1 is known,
z
m
= f(a
m
) m ∈ L
l
. (3.1.3)
We concentrate on the choice of weights when one sample (or pattern) is presented
to the network. The desired output, true value, or target will be denoted by t
k
, k ∈
L
L
(k = 1, . . . , M).
We want to adapt the initial guess for the weights so that to decrease:
D
LMS
=
k∈L
L
(t
k
−f(a
k
))
2
. (3.1.4)
If the function f is nonlinear, then D is a nonlinear function of the weights, the
gradient descent algorithm for the multilayer network is as follows.
The weights are updated proportional to the gradient of D with respect to the
weights, the weights ω
mj
is changed by an amount
∆ω
mj
= −α
∂D
∂ω
mj
m ∈ L
l
, j ∈ L
l−1
(3.1.5)
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 35
where α is called the learning rate. For the last and next to last layer the updates
are given by substituting (3.1.3) and (3.1.4) in (3.1.5).
∆ω
mj
= −α
∂
∂ω
mj
p∈L
L
_
_
t
p
−f
_
_
q∈L
L−1
ω
pq
z
q
_
_
_
_
2
= −α2
p∈L
L
(t
p
−f(a
p
)) (−f
(a
p
))
g∈%mcl
L−1
δ
pm
δ
gj
z
g
= −α2 (t
m
−f(a
m
)) f
(a
m
)z
j
m ∈ L
L
, j ∈ L
L−1
.
(3.1.6)
We assume that the neuron is able to calculate the function f and its derivative
f
. The error 2α(t
m
−f(a
m
)) f
(a
m
) can be sent back (or propagated back) to
neuron j ∈ L
L−1
. The value z
j
is present in neuron j so that the neuron can
calculate ∆ω
mj
. In an analogous way it is possible to derive the updates of the
weights ending in L
L−1
:
∆ω
mj
= −α
∂
∂ω
mj
p∈L
L
_
_
_
t
p
−f
_
_
g∈L
L−1
ω
pg
f(ω
qr
z
r
)
_
_
_
_
_
2
= −α2
p∈L
L
[t
p
−f(a
p
)] (−f
(a
p
))
g∈%mcl
L−1
δ
gm
δ
xj
z
r
= −α2
p∈L
L
(t
p
−f(a
p
)) f
(a
p
)ω
pm
f
(a
m
)z
j
, m ∈ L
L−1
, j ∈ L
L−2
.
(3.1.7)
Now, 2α[t
p
−f(a
p
)] f
(a
p
)ω
pm
is the weighted sum of the errors sent from layer
L
L
to neuron m in layer L
L−1
. Neuron m calculate this quantity using ω
pm
, p ∈
L
p
. Multiply this quantity f
(a
m
) and send it to neuron j that can calculate ∆ω
mj
and update ω
mj
.
This weight update is done for each layer in decreasing order of the layers,
until ∆
mj
, m ∈ L
2
, j ∈ L
1
. This is the familiar backpropagation algorithm. Note
that for the sigmoid (symmetric or logistic) and hyperbolic activation function
the properties of its derivatives makes it easier to implement the backpropagation
algorithm.
3.2 Associative and Hopﬁeld Networks
In unsupervised learning we have the following groups of learning rules:
• competitive learning
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 36
• Hebbian learning
When the competitive learning is used, the neurons compete among them to
update the weights. These algorithms have been used in classiﬁcation, signal
extraction, cluster. Examples of networks that use these rules are ART (Adaptive
Resonance Theory) and SOM (Self Organized Maps) networks.
Hebbian learning is based on the weight actualization due to neurophisiolo
gist Hebb. This algorithm have been used in characteristics extraction, associative
memories. Example of network that use these rules are: Hopﬁeld and PCA (Prin
cipal Component Analysis) networks. The neural networks we have discussed so
far have all been trained in a supervised manner. In this section, we start with
associative networks, introducing simple rules that allow unsupervised learning.
Despite the simplicity of their rules, they form the foundation for the more pow
erful networks of latter chapters.
The function of an associative memory is to recognize previous learned input
vectors, even in the case where some noise is added. We can distinguished three
kinds of associative networks:
 Heteroassociative networks: map m input vectors x
1
, x
2
, . . . , x
m
in
pdimensional spaces to m output vector y
1
, y
2
, . . . , y
m
in kdimensional
space, so that x
i
→y
i
.
 Autoassociative networks are a special subsets of heteroassociative networks,
in which each vector is associated with itself, i.e. y
i
= x
i
for i = 1, . . . , n.
The function of the network is to correct noisy input vector.
 Pattern recognition (cluster) networks are a special type of heteroassociative
networks. Each vector is associated with the scalar (class) i. The goal of
the network is to identify the class of the input pattern.
Figure 3.3 summarizes these three networks.
Figure 3.3: Type of associative networks
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 37
The three kinds of associative networks can be implemented with a single
layer of neurons. Figure 3.4 shows the structure of the heteroassociative network
without feedback.
Figure 3.4: Associative network
Let W = (ω
ij
) the p k weight matrix. The row vector x = (x
1
, . . . , x
p
) and
the identity activation produces the activation
y
1×k
= x
1p
W
p×k
. (3.2.1)
The basic rule to obtain the elements of W that memorize a set of n associa
tions (x
i
y
i
) is
W
p×k
=
n
i=1
x
i
p×1
◦ y
i
1k
(3.2.2)
that is for n observations, let X be the n p matrix whose rows are the input
vector and Y be the n k matrix whose rows are the output vector. Then
W
p×k
= X
p×n
◦ Y
nk
. (3.2.3)
Therefore in matrix notation we also have
Y = XW (3.2.4)
Although without a biological appeal, one solution to obtain the weight matrix
is the OLAM  Optimal Linear Associative Memory. If n = p, X is square matrix,
if it is invertible, the solution of (3.2.1) is
W = X
−1
Y (3.2.5)
If n < p and the input vectors are linear independent the optimal solution is
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 38
W
p×k
=
_
X
pn
X
np
_
−1
X
pn
Y (3.2.6)
which minimizes
[[WX − Y [[
2
and (X
X)
−1
X
is the pseudoinverse of
X. If the input vectors are linear dependent we have to use regularization theory
(similar to ridge regression).
The biological appeal to update the weight is to use the Hebb rule
W = ηX
Y (3.2.7)
as each sample presentation where η is a learning parameter in 0 and 1.
The association networks are useful for face, ﬁgure etc. recognition and com
pression, and also function optimization (such as Hopﬁeld network).
We will present some of this association networks using some examples of the
literature.
Consider the example (Abi et al, 1999) where we want to store a set of schematic
faces given in Figure 3.5 in an associative memory using Hebbian learning. Each
face is represented by a 4dimensional binary vector, denoted by x
i
in which a
given element corresponds to one of the features from Figure 3.6.
Figure 3.5: Schematic faces for Hebbian learning. Each face is made of four
features (hair,eyes,nose and mouth), taking the value +1 or −1
Suppose that we start with and η = 1.
W
0
=
_
¸
¸
_
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
_
¸
¸
_
(3.2.8)
The ﬁrst face is store in the memory by modifying the weights to
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 39
Figure 3.6: The features used to build the faces in Figure 3.5
W
1
= W
0
+
_
¸
¸
_
1
1
1
−1
_
¸
¸
_
_
1 1 1 −1
¸
=
_
¸
¸
_
1 1 1 1
1 1 1 1
1 1 1 1
−1 −1 −1 −1
_
¸
¸
_
, (3.2.9)
the second face is store by
W
2
= W
1
+
_
¸
¸
_
−1
−1
1
1
_
¸
¸
_
_
−1 −1 1 1
¸
=
_
¸
¸
_
2 2 0 0
2 2 0 0
0 0 2 −2
0 0 −2 +2
_
¸
¸
_
, (3.2.10)
and so on until
W
10
= W
9
+
_
¸
¸
_
1
1
−1
1
_
¸
¸
_
_
1 1 −1 1
¸
=
_
¸
¸
_
10 2 2 6
2 −10 2 −2
2 2 10 −2
6 −2 −2 10
_
¸
¸
_
. (3.2.11)
Another important association network is the Hopﬁeld network. In its presen
tation we follow closely Chester (1993). Hopﬁeld network is recurrent and fully
interconnected with each neuron feeding its output in all others. The concept was
that all the neurons would transmit signals back and forth to each other in a closed
feedback loop until their state become stable. The Hopﬁeld topology is illustrated
in Figure 3.7.
Figure 3.8 provides examples of a Hopﬁled network with nine neurons. The
operation is as follows:
Assume that there are three 9dimensional vectors α, β and γ.
α = (101010101) β = (110011001) γ = (010010010). (3.2.12)
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 40
Figure 3.7: Hopﬁeld network architecture
The substitution of −1 for each 0 in these vectors transforms then into bipolar
form α
∗
, β
∗
, γ
∗
.
α
∗
= (1 −11 −11 −11 −11)
β
∗
= (11 −1 −111 −1 −11)
γ
∗
= (−11 −1 −11 −1 −11 −1)
.
(3.2.13)
The weight matrix is
W = α
∗
α
∗
+ β
∗
β
∗
+ γ
∗
γ
∗
(3.2.14)
when the main diagonal is zeroed (no neuron feeding into itself). This becomes
the matrix in Figure 3.8 of the example.
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 41
A B C D E F G H I
From A 0 1 1 1 1 1 1 3 3
B 1 0 3 1 1 1 3 1 1
C 1 3 0 1 1 1 3 1 1
D 1 1 1 0 3 1 1 1 1
E 1 1 1 3 0 1 1 1 1
F 1 1 1 1 1 0 1 1 1
G 1 3 3 1 1 1 0 1 1
H 3 1 1 1 1 1 1 0 3
I 3 1 1 1 1 1 1 3 0
Synaptic weight matrix, nodes A−I
Start a b
A = 1 A = 1 A = 1
B = 1 B → 0 B = 0
C = 1 C = 1 C = 1
D = 1 D → 0 D → 1
E = 1 E → 0 E = 0
F = 1 F → 0 F → 1
G = 1 G = 1 G = 1
H = 1 H → 0 H = 0
I = 1 I = 1 I = 1
Example 1
Start a
A = 1 A = 1
B = 1 B = 1
C = 0 C = 0
D = 0 D = 0
E = 1 E = 1
F = 1 F = 1
G = 0 G = 00
H = 0 H = 0
I = 0 I → 1
Example 2
Start a b
A = 1 A = 1 A = 1
B = 0 B = 0 B = 0
C = 0 C → 1 C = 1
D = 1 D = 1 D = 1
E = 0 E = 0 E = 0
F = 0 F → 1 F = 1
G = 1 G = 1 G = 1
H = 0 H = 0 H = 0
I = 0 I → 1 I = 1
Example 3
Figure 3.8: Example of Hopﬁeld network (The matrix shows the synaptic weights be
tween neurons. Each example shows an initial state for the neurons (representing an input
vector). State transitions are shown by arrows, as individual nodes are updated one at a
time in response to the changing states of the other nodes. In example 2, the network out
put converges to a stored vector. In examples 1 and 3, the stable output is a spontaneous
attractor, identical to none of the three stored vectors. The density of 1bits in the stored
vectors is relatively high and the vectors share enough is in common bit locations to put
them far from orthogonality  all of which may make the matrix somewhat quirky, perhaps
contributing to its convergence toward the spontaneous attractor in examples 1 and 3.)
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 42
#
# #
# # #
# #
# #
# #
#
#
#
# #
(−1, 1) (1, 1)
The study of the dynamics of the Hopﬁeld network and its continuous exten
sion is an important subject which includes the study of network capacity (How
many neurons are need to store a certain number of pattern? Also it shows that the
bipolar form allows greater storage) energy (or Lyapunov) function of the system.
These topics are more related to applications to Optimization than to Statistics.
We end this section with an example from Fausett (1994) on comparison of
data using a BAM (Bidirectional Associative Memory) network. The BAM net
work is used to associate letters with simple bipolar codes.
Consider the possibility of using a (discrete) BAM network (with bipolar vec
tors) to map two simple letters (given by 5 3 patterns) to the following bipolar
codes:
The weight matrices are:
(to store A →−11) (C →11) (W, to store both)
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
1 −1
−1 1
1 −1
−1 1
1 −1
−1 1
−1 1
−1 1
−1 1
−1 1
1 −1
−1 1
−1 1
1 −1
−1 1
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
−1 −1
1 1
1 1
1 1
−1 −1
−1 −1
1 1
−1 −1
−1 −1
1 1
−1 −1
−1 −1
−1 −1
1 1
1 1
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
0 −2
0 2
2 0
0 2
0 −2
−2 0
0 2
−2 0
−2 0
0 2
0 −2
−2 0
−2 0
2 0
0 2
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
(3.2.15)
To illustrate the use of a BAM, we ﬁrst demonstrate that the net gives the cor
rect Y vector when presented with the x vector for either pattern A or the pattern
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 43
C:
INPUT PATTERN A
_
−11 −11 −111111 −111 −11
¸
W = (−14, 16) →(−1, 1) (3.2.16)
INPUT PATTERN C
_
−1111 −1 −11 −1 −11 −1 −1 −11
¸
W = (14, 16) →(1, 1) (3.2.17)
To see the bidirectional nature of the net, observe that the Y vectors can also
be used as input. For signals sent from the Y layer to the Xlayer, the weight
matrix is the transpose of the matrix W, i.e.,
W
T
=
_
0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0
0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0
_
(3.2.18)
For the input vector associated with pattern A, namely, (1,1), we have
(−1, 1)W
T
=
(−1, 1)
_
0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0
0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0
_
=
_
−2 2 −2 2 −2 2 2 2 2 2 −2 2 2 −2 2
¸
→
_
−1 1 −1 1 −1 1 1 1 1 1 −1 1 1 −1 1
¸
(3.2.19)
This is pattern A.
Similarly, if we input the vector associated with pattern C, namely, (1,1), we
obtain
(−1, 1)W
T
=
(−1, 1)
_
0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0
−2 2 0 2 −2 0 0 0 2 −2 0 0 0 2
_
=
_
−2 2 2 2 −2 −2 2 −2 −2 2 −2 −2 −2 −2 2
¸
→
_
−1 1 1 1 −1 −1 1 −1 −1 1 −1 −1 −1 1 1
¸
(3.2.20)
which is pattern C.
The net can also be used with noisy input for the x vector, the y vector, or
both.
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 44
3.3 Radial Basis Function Networks
The radial basis function is a single hidden layer feed forward network with
linear output transfer function and nonlinear transfer function h() in the hidden
layer. Many types of nonlinearity may be used.
The most general formula for the radial function is
h(x) = Φ
_
(x −c)R
−1
(x −c)
_
(3.3.1)
where Φ is the function used (Gaussian, multiquadratic, etc.), c is the center and
R the metric. The common functions are: Gaussian Φ(z) = e
−z
, Multiquadratic
Φ(z) = (1 + z)
1/2
, Inverse multiquadratic Φ(z) = (1 + z)
−1/2
and the Cauchy
Φ(z) = (1 + z)
−1
.
If the metric is Euclidean R = r
2
I for some scalar radius and the equation
simpliﬁes to
h(x) = Φ
_
(x −c)
T
(x −c)
r
2
_
. (3.3.2)
The essence of the difference between the operation of radial basis function and
multilayer perceptron is shown in Figure 3.9 for a hypothetical classiﬁcation of
psychiatric patients.
Figure 3.9: Classiﬁcation by Alternative Networks: multilayer perceptron (left)
and radial basis functions networks (right)
Multilayer perceptron separates the data by hyperplanes while radial basis
function cluster the data into a ﬁnite number of ellipsoids regions.
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 45
A simpliﬁcation is the 1dimensional input space in which case we have
h(x) = Φ
_
(x −c)
2
σ
2
_
. (3.3.3)
Figure 3.10 illustrate the alternative forms of h(x) for c = 0 and r = 1.
−4 −3 −2 −1 0 1 2 3 4
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Multiquadratic
Gaussian
Cauchi
Inverse−quadratic
Figure 3.10: Radial basis functions h(.)
To see how the RBFN works we consider some examples (Wu and McLarty,
2000). Consider two measures (from questionnaires) of mood to diagnose de
pression. The algorithm that implemented the radial basis function application
determine that seven cluster sites were needed for this problem. A constant vari
ability term, σ
2
= 1, was used for each hidden unit. The Gaussian function was
used for activation of the hidden layers. Figure 3.11 shows the network.
Suppose we have the scores x
= (4.8; 1.4). The Euclidean distance from this
vector and the cluster mean of the ﬁrst hidden unit µ
1
= (0.36; 0.52) is
D =
_
(4.8 −.36)
2
+ (.14 −.52)
2
= .398. (3.3.4)
The output of the ﬁrst hidden layer is
h
1
(D
2
) = e
−.398
2
/2×(.1)
= .453. (3.3.5)
This calculation is repeated for each of the remaining hidden unit. The ﬁnal
output is
(.453)(2.64) + (.670)(−.64) + (543)(5.25) + . . . + (−1)(0.0092) = 0.0544. (3.3.6)
Since the output is close to 0, the patient is classiﬁed as not depressed.
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 46
Figure 3.11: Radial basis network
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 47
The general form of the output is y = f(x) =
m
i=1
w
i
h
i
(x), that is the function
f is expressed as a linear combination of m ﬁxed functions (called basis function
in analogy to the concept of a vector being composed as combination of basis
vectors).
Training radial basis function networks proceeds in two steps. First the hid
den layer parameters are determined as a function of the input data and then the
weights in between the hidden and output layer are determined from the output of
the hidden layer and the target data.
Therefore we have a combination of unsupervised training to ﬁnd the cluster
and the cluster parameters and in the second step a supervised training which
is linear in the parameters. Usually in the ﬁrst step the kmeans algorithm of
clustering is used, but other algorithms can also be used including SOM and ART
(chapter 4).
3.4 Wavelet Neural Networks
The approximation of a function in terms of a set of orthogonal basis func
tion is familiar in many areas, in particular in statistics. Some examples are: the
Fourier expansion, where the basis consists of sines and cosines of differing fre
quencies. Another is the Walsh expansion of categorical time sequence or square
wave, where the concept of frequency is replaced by that of sequence, which gives
the number of “zero crossings” of the unit interval. In recent years there has been
a considerable interest in the development and use of an alternative type of basis
function: wavelets.
The main advantage of wavelets over Fourier expansion is that wavelets is a
localized approximator in comparison to the global characteristic of the Fourier
expansion. Therefore wavelets have signiﬁcant advantages when the data is non
stationary.
In this section, we will outline why and how neural network has been used to
implement wavelet applications. Some wavelets results are ﬁrst presented.
3.4.1 Wavelets
To present the wavelet basis of functions we start with a father wavelet or
scaling function φ such that
φ(x) =
√
2
k
l
k
φ(2x −k) (3.4.1)
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 48
usually, normalized as
∞
_
−∞
φ(x)dx = 1. A mother wavelet is obtained through
ψ(x) =
√
2
h
k
φ(2x −k) (3.4.2)
where l
k
and h
k
are related through
h
k
= (−1)
k
l
k
. (3.4.3)
The equations 3.4.1 and 3.4.2 are called dilatation equations, the coefﬁcients
l
k
, h
k
are lowpass and highpass ﬁlters, respectively.
We assume that these functions generate an orthonormal system of L
2
(1),
denoted ¦φ
j
0
,k
(x)¦ ∪¦ψ
jk
(x)¦
j≥j
0
,k
with φ
j
0
,k
(x) = 2
j
0
/2
φ(2
j
0
x −k), ψ
j,k
(x) =
2
j/2
ψ(2
j
x −k) for j ≥ j
0
the coarse scale.
For any f ∈ L
2
(1) we may consider the expansion
f(x) =
∞
k=−∞
α
k
φ
j
0
,k
(x) +
j≥j
0
∞
k=−∞
β
j,k
ψ
j,k
(x) (3.4.4)
for some coarse scale j
0
and where the true wavelet coefﬁcients are given by
α
k
=
_
∞
−∞
f(x)φ
j
0
,k
(x)dx β
j,k
=
_
∞
−∞
f(x)ψ
j,k
(x)dx. (3.4.5)
An estimate will take the form
ˆ
f(x) =
∞
k=−∞
ˆ α
k
φ
j
0
,k
(x) +
j≥j
0
∞
k=−∞
ˆ
β
j,k
ψ(x) (3.4.6)
where ˆ α
k
and
ˆ
β
k
are estimates of α
k
and β
jk
, respectively.
Several issues are of interests in the process of obtaining the estimate
ˆ
f(x):
(i) the choice of the wavelet basis;
(ii) the choice of thresholding policy (which ˆ α
j
and
ˆ
β
jk
should enter in the ex
pression of
ˆ
f(x); the value of j
0
; for ﬁnite sample the number of parameters
in the approximation);
(iii) the choice of further parameters appearing in the thresholding scheme;
(iv) the estimation of the scale parameter (noise level).
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 49
These issues are discussed in Morettin (1997) and Abramovich et al (2000).
Concerning the choice of wavelet basis, some possibilities includes the Haar
wavelet which is useful for categoricaltype data. It is based on
φ(x) = 1 0 ≤ x < 1
ψ(x) =
_
1 0 ≤ x < 1/2
−1 1/2 ≤ x < 1
(3.4.7)
and the expansion is then
f(x) = α
0
+
J
j=0
,
2
J
−1
k=0
β
jk
ψ
jk
(x). (3.4.8)
Other common wavelets are: the Shannon with mother function
ψ(x) =
sin(πx/2)
πx/2
cos (3πx/2) (3.4.9)
the Mortet wavelet with mother function
ψ(x) = e
iwx
e
−x
2
/2
(3.4.10)
where i =
√
−1 and w is a ﬁxed frequency. Figure 3.12 shows the plots of this
wavelets. Figure 3.13 illustrates the localized characteristic of wavelets.
Figure 3.12: Some wavelets: (a) Haar (b) Morlet (c) Shannon
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 50
Figure 3.13 gives an example of a wavelet transform.
Figure 3.13: Wavelet operation
In ﬁgure 3.13 a wavelet (b) is compared successively to different sections of
a function (a). The product of the section and the wavelet is a new function; the
area delimited by that function is the wavelet coefﬁcient. In (c) the wavelet is
compared to a section of the function that looks like the wavelet. The product of
the two is always positive, giving the big coefﬁcient shown in (d). (The product
of two negative functions is positive.) In (e) the wavelet is compared to a slowly
changing section of the function, giving the small coefﬁcients shown in (f). The
signal is analyzed at different scales, using wavelets of different widths. “You
play with the width of the wavelet in order to catch the rhythm of the signal,” says
Yves Meyer.
Figure 3.14: An Example of a Wavelet Transform.
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 51
Figure 3.4.1 shows: (a) The original signal. (b) The wavelet transform of the
signal, over 5 scales, differing by a factor of 2. The ﬁnest resolution, giving the
smallest details is at the top. At the bottom is the graph of the remaining low
frequencies.
3.4.2 Wavelet Networks and Radial Basis Wavelet Networks.
We have seen (Section 2.7  Kolmogorov Theorem) that neural networks have
been established as a general approximation tool for ﬁtting nonlinear models
from input/output data. On the other hand, wavelets decomposition emerges as
powerful tool for approximation. Because of the similarity between the discrete
wavelet transform and a onehiddenlayer neural network, the idea of combining
both wavelets and neural networks has been proposed. This has resulted in the
wavelet network which is discussed in Iyengar et al (2002) and we will summarize
here.
There are two main approaches to form wavelet network. In the ﬁrst, the
wavelet component is decoupled from the estimation components of the percep
tron. In essence the data is decomposed on some wavelets and this is fed into
the neural network. We call this Radial Basis Wavelet Neural Networks which is
shown in Figures 3.15(a),3.15(b).
In the second approach, the wavelet theory and neural networks are combined
into a single method. In wavelet networks, both the position and dilatation of the
wavelets as well as the weights are optimized. The neuron of a wavelet network
is a multidimensional wavelet in which the dilatation and translations coefﬁcients
are considered as neuron parameters. The output of a wavelet network is a linear
combination of several multidimensional wavelets.
Figures 3.16 and 3.17 give a general representation of a wavelet neuron (wavelon)
and a wavelet network
Observation. It is useful to rewright the wavelet basis as:
Ψ
a,b
= [a[
−1
_
x −b
a
_
a > 0, −∞< b < ∞with a = 2
j
, b = k2
−j
Adetailed discussion of this networks is given in Zhang and Benvensite (1992).
See also Iyengar et al (1992) and Martin and Morris (1999) for further references.
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 52
(a) Function approximation
(b) Classiﬁcation
Figure 3.15: Radial Basis Wavelet Network
Figure 3.16: Wavelon
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 53
Figure 3.17: Wavelet network  f(x) =
N
i=1
w
i
Ψ
_
x
i
−a
i
b
i
_
3.5 Mixture of Experts Networks
Consider the problem of learning a function in which the form of the function
varies in different regions of the input space (see also Radial Basis and Wavelets
Networks Sections). These types of problems can be made easier using Mixture
of Experts or Modular Networks.
In these type of network we assigned different expert networks to table each of
the different regions, and then used an extra ”gating” network, which also sees the
input data, to deduce which of the experts should be used to determine the output.
A more general approach would allow the discover of a suitable decomposition as
part of the learning process.
Figure 3.18 shows several possible decompositions
These types of decompositions are related to mixture of models, generalized
additive and tree models in statistics (see Section 5.2)
Further details and references can be seen in Mehnotra et al (2000) and Bishop
(1995). Applications can be seen in Jordan and Jacobs (1994), OhnoMachodo
(1995), Peng et al (1996), Desai et al (1997), Cheng (1997) and Ciampi and
Lechevalier (1997).
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 54
(a) Mixture with gating (b) Hierarchical
(c) Sucessive reﬁnement (d) Input modularity
Figure 3.18: Mixture of experts networks
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 55
3.6 Neural Network and Statistical Models Interface
The remaining chapters of this book apply these networks to different prob
lems of data analysis in place of statistical models. It is therefore convenient to
review the statistical bibliography relating neural network and statistical models
and statistical problems in general.
The relationship of Multilayer Feedforward Network with several linear mod
els is presented by Arminger and Enache (1996). They shown that these net
works can perform are equivalent to perform: linear regression,linear discrim
ination, multivariate regression, probit regression, logit regression, generalized
linear model, generalized additive models with known smoothing functions, pro
jection pursuit, LISREL (with mixed dependent variables). Other related paper
are: Cheng and Titterington (1994) Sarle (1994), Stern (1996), Warner and Misha
(1996), Schumacher et al (1996), Vach et al. (1996), De Veux et al (1998).
Attempts to compare performance of neural networks were made extensively
but no conclusive results seems to be possible; some examples are Balakrishnan
et al (1994), Mangiameli et al (1996) Mingoti and Lima (2006) for cluster anal
ysis and Tu (1996), Ennis et al (1998) and Schwarzer et al (2000) for logistic
regression.
Statistical intervals for neural networks were studied in Tibshirani (1996),
Huang and Ding (1997), Chryssolouriz et al (1996) and De Veux et al (1998),
For Bayesian analysis for neural networks see MacKay (1992), Neal (1996), Lee
(1998, 2004), and RiosInsua and Muller (1996), Paige and Butle (2001) and the
review of Titterington (2004).
Theoretical statistical results are presented in Murtagh (1994), Poli and Jones
(1994), Jordan (1995), Faraggi and Simon (1995), Ripley (1996), Lowe (1999),
Martin and Morris (1999), the review papers Cheng and Titterington (1994), Tit
terington (2004). Asymptotics and further results can be seen in a series of papers
by White (1989a,b,c, 1996), Swanson and White (1995,1997), Kuan and White
(1994). Books relating neural networks and statistics are: Smith (1993), Bishop
(1996), Ripley (1996), Neal (1996) and Lee (2004).
Pitfalls on the use of neural networks versus statistical modeling is discussed
in Tu (1996), Schwarzer et al (2000) and Zhang (2007). Further references on the
interfaces: statistical tests, methods (eg survival, multivariate etc) and others are
given in the correspondingly chapters of the book.
Chapter 4
Multivariate Statistics Neural
Network Models
4.1 Cluster and Scaling Networks
4.1.1 Competitive networks
In this section, we present some network based on competitive learning used
for clustering and in a network with a similar role to that of multidimensional
scaling (Kohonen map network).
The most extreme form of a competition is the Winner Take All and is suited
to cases of competitive unsupervised learning.
Several of the network discussed in this section use the same learning algo
rithm (weight updating).
Assume that there is a single layer with M neurons and that each neuron has
its own set of weights w
j
. An input x
is applied to all neurons. The activation
function is the identity. Figure 4.1 illustrates the architecture of the Winner Take
All network.
The node with the best response to the applied input vector is declared the
winner, according to the winner criterion
O
l
= min
j=1...c
_
d
2
(y
j
, w
j
)
_
(4.1.1)
where d is a distance measure of y
j
from the cluster prototype or center. The
change of weights is calculated according to
w
i
(k + 1) = w
l
(k) + α(x
i
−w
i
(k)) (4.1.2)
56
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 57
Figure 4.1: General Winner Take All network
where α is the learning rate, and may be decreasing at each iteration.
We now present an example due to Mehrotra et al (2000) to illustrate this
simple competitive network.
Let the set n consists of 6 threedimensional vectors, x
1
, x
2
, . . . , x
6
,
n = ¦x
1
1
= (1.1, 1.7, 1.8), x
2
2
= (0, 0, 0), x
3
3
= (0, 0.5, 1.5),
x
4
4
= (1, 0, 0), x
5
5
= (0.5, 0.5, 0.5), x
6
6
= (1, 1, 1)¦.
(4.1.3)
We begin a network containing three input nodes and three processing units (out
put nodes), A, B, C. The connection strengths of A, B, C are initially chosen
randomly, and are given by the weight matrix
W(0) =
_
_
w
1
: 0.2 0.7 0.3
w
2
: 0.1 0.1 0.9
w
3
: 1 1 1
_
_
(4.1.4)
To simplify computations, we use a learning rate η = 0.5 and update weights
equation 4.1.2. We compare squared Euclidean distances to select the winner:
d
2
j.l
≡ d
2
(w
j
, i
l
) refers to the squared Euclidean distance between the current
position of the processing node j from the lth pattern.
t = 1 : Sample presented x
1
1
= (1.1, 1.7, 1.8). Squared Euclidean distance
between A and x
1
1
: d
2
1.1
= (1.1 − 0.2)
2
+ (1.7 − 0.7)
2
+ (1.8 − 0.3)
2
= 4.1.
Similarly, d
2
2.1
= 4.4 and d
2
3.1
= 1.1.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 58
C is the “winner” since d
2
3.1
< d
2
1.1
and d
2
3.1
< d
2
2.1
. A and B are therefore
not perturbed by this sample whereas C moves halfway towards the sample (since
η = 0.5). The resulting weight matrix is
W(1) =
_
_
w
1
: 0.2 0.7 0.3
w
2
: 0.1 0.1 0.9
w
3
: 1.05 1.35 1.4
_
_
. (4.1.5)
t = 2 : Sample presented x
2
2
= (0, 0, 0), d
2
1.2
= 0.6, d
2
2.2
= 0.8, d
2
3.2
= 4.9,
hence A is the winner. The weights of A is updated. The resulting modiﬁed
weight vector is w
1
: (0.1, 0.35, 0.15).
Similarly, we have:
t = 3 : Sample presented x
3
3
= (0, 0.5, 1.5), d
2
2.3
= 0.5 is least, hence
B is the winner and is updated. The resulting modiﬁed weight vector is w
2
:
(0.05, 0.3, 1.2).
t = 4 : Sample presented x
4
4
= (1, 0, 0), d
2
1.4
= 1, d
2
2.4
= 2.4, d
2
3.4
= 3.8, hence
A is the winner and is updated: w
1
: (0.55, 0.2, 0.1).
t = 5 : x
5
5
= (0.5, 0.5, 0.5) is presented, winner A is updated: w
1
(5) =
(0.5, 0.35, 0.3).
t = 6 : x
6
6
= (1, 1, 1) is presented, winner C is updated: w
3
(6) = (1, 1.2, 1.2).
t = 7 : x
1
1
is presented, winner C is updated: w
3
(7) = (1.05, 1.45, 1.5).
t = 8 : Sample presented x
2
2
. Winner Ais updated to w
1
(8) = (0.25, 0.2, 0.15).
t = 9 : Sample presented x
3
3
. Winner B is updated to w
1
(9) = (0, 0.4, 1.35).
t = 10 : Sample presented x
4
4
. Winner Ais updated to w
1
(10) = (0.6, 0.1, 0.1).
t = 11 : Sample presented x
5
5
. Winner Ais updated to w
1
(11) = (0.55, 0.3, 0.3).
t = 12 : Sample presented x
6
6
. Winner C is updated and w
3
(12) = (1, 1.2, 1.25).
At this stage the weight matrix is
W(12) =
_
_
w
1
: 0.55 0.3 0.3
w
2
: 0 0.4 1.35
w
3
: 1 1.2 1.25
_
_
. (4.1.6)
Node A becomes repeatedly activated by the samples x
2
2
,
4
4
, and x
5
5
, node B by
x
3
3
alone, and node C by x
1
1
and x
6
6
. The centroid of x
2
2
, x
4
4
, and i
5
is (0.5, 0.2, 0.2),
and convergence of the weight vector for node Atowards this location is indicated
by the progression
(0.2, 0.7, 0.3) →(0.1, 0.35, 0.15) →(0.55, 0.2, 0.1) →(0.5, 0.35, 0.3) →
(0.25, 0.2, 0.15) →(0.6, 0.1, 0.1) →(0.55, 0.3, 0.3) . . .
(4.1.7)
To interpret the competitive learning algorithmas a clustering process is attrac
tive, but the merits of doing so are debatable: as illustrated by a simple example
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 59
were the procedure cluster the number 0, 2, 3, 5, 6 in groups A = (0, 2), (3, 5), (6)
instead of the more natural grouping A = (0), B = (2, 3) and C = (5, 6).
For further discussion on the relation of this procedure to ”kmeans clustering”
see Mehrotra et al (2000 pg 168).
4.1.2 Learning Vector Quantization  LVQ
Here again we present the results Mehnotra et al (2000 pg 173).
Unsupervised learning and clustering can be useful preprocessing steps for
solving classiﬁcation problems. A learning vector quantizer (LVQ) is an applica
tion of winnertakeall networks for such tasks, and illustrates how an unsuper
vised learning mechanism can be adapted to solve supervised learning tasks in
which class membership is known for every training pattern.
Each node in an LVQ is associated with an arbitrarily chosen class label. The
number of nodes chosen for each class is roughly proportional to the number
of training patterns that belong to that class, making the assumption that each
cluster has roughly the same number of patterns. The new updating rule may be
paraphrased as follows.
When pattern i from class C(i) is presented to the network, let the winner
node j
∗
belong to class C(j
∗
). The winner j
∗
is moved towards the pattern i if
C(i) = C(j
∗
) and away from i otherwise.
This algorithm is referred to as LVQ1, to distinguish it from more recent vari
ants of the algorithm. In the LVQ1 algorithm, the weight update rule uses a learn
ing rate η(t) that is a function of time t, such as η(t) = 1/t or η(t) = a[1 −(t/A)]
where a and A are positive constants and A > 1.
The following example illustrates the result of running the LVQ1 algorithm on
the input samples of the previous section.
The data are cast into a classiﬁcation problem by arbitrary associating the ﬁrst
and last samples with Class 1, and the remaining samples with Class 2. Thus the
training set is: T = ¦(x
1
; 1), (x
2
, 0), . . . , (x
6
; 1)¦ where x
1
= (1.1, 1.7, 1, 8), x
2
=
(0, 0, 0), x
3
= (0, 0.5, 1.5), x
4
= (1, 0, 0), x
5
= (0.5, 0.5, 0.5), and x
6
= (1, 1, 1).
The initial weight matrix is
W(0) =
_
_
w
1
: 0.2 0.7 0.3
w
2
: 0.1 0.1 1.9
w
3
: 1 1 1
_
_
(4.1.8)
Since there are twice as many samples in Class 2 as in Class 1, we label the ﬁrst
node (w
1
) as associated with Class 1, and the other two nodes with Class 2. Let
η(t) = 0.5 until t = 6, then η(t) = 0.25 until t = 12, and η(t) = 0.1 thereafter.
Only the change in a single weight vector in each weight update iteration of the
network, instead of writing out the entire weight matrix.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 60
1. Sample x
1
1
, winner w
3
(distance 1.07), w
3
changed to (0.95, 0.65, 0.60).
2. Sample x
2
2
, winner w
1
(distance 0.79), w
1
changed to (0.30, 1.05, 0.45).
3. Sample x
3
3
, winner w
2
(distance 0.73), w
2
changed to (0.05, 0.30, 1.20).
4. Sample x
4
, winner w
3
(distance 0.89), w
3
changed to (0.97, 0.33, 0.30).
... ... ...
156. Sample x
6
, winner w
1
(distance 0.56), w
1
changed to (1.04, 1.33, 1.38).
157. Sample x
1
, winner w
1
(distance 0.57), w
1
changed to (1.05, 1.37, 1.42).
158. Sample x
2
, winner w
3
(distance 0.58), w
3
changed to (0.46, 0.17, 0.17).
159. Sample x
3
, winner w
2
(distance 0.02), w
2
changed to (0.00, 0.49, 1.48).
160. Sample x
4
, winner w
3
(distance 0.58), w
3
changed to (0.52, 0.15, 0.15).
161. Sample x
5
, winner w
3
(distance 0.50), w
3
changed to (0.52, 0.18, 0.18).
162. Sample x
6
, winner w
1
(distance 0.56), w
1
changed to (1.05, 1.33, 1.38).
Note that associations between input samples and weight vectors stabilize by
the second cycle of pattern presentations, although the weight vectors continue to
change.
Mehrotra et all also mentioned a variation called the LVQ2 learning algorithm
with a different learning rule used instead and an update rule similar to that of
LVQ1.
Some application of the procedure are: Pentapoulos et al (1998) apply a LVQ
network to discriminate benign from malignant cells on the basis of the extracted
morphometric and textural features. Another LVQ network was also applied in
an attempt to discriminate at the patient level. The data consisted of 470 samples
of voided urine from an equal number of patients with urothelial lesions. For the
purpose of the study 45.452 cells were measured. The training sample used 30%
of the patients and cells respectively. The study included 50 cases of lithiasis,
61 case of information, 99 cases of benign prostatic hyperplasia, 5 cases of city
carcinoma, 71 case of grade I transitional cell carcinoma of the bladder (TCCB)
and 184 cases of grade II and grade III TCCB.
The application enable the correct classiﬁcation of 95.42% of the benign cells
and 86.75% of the malignant cells. At the patient level the LVQ network enable
the correct classiﬁcation of 100% of benign cases and 95.6% of the malignant
cases. The overall accuracy rate was 90.63% and 97.57%, respectively. Maksi
movic and Popovic (1999) compared alternative networks to classify arm move
ments in tetraplegics. The following hand movements were studied: I  up/down
proximal to the body on the lateral side; II  left/right above the level of the shoul
der; III  internal/external rotation of the upper arm (humerus).
Figure 4.2 presents the correct classiﬁcation percentual for the neural networks
used.
Vuckovic et al (2002) use neural networks to classify alert vs drowsy states
from 1’s long sequences of full spectrum EEG in an arbitrary subject. The ex
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 61
Figure 4.2: Success of classiﬁcation for all ANNs and all tested movement trials
perimental data was collected on 17 subjects. Two expert in EEG interpretation
provided the expertise to train the ANN. The three ANN used in the comparison
are: one layer perceptron with identity activation (WidrowHoff optimization),
perceptron with sigmoid activation (LevenbergMarquart optimization) and LVQ
network.
The LVQ network gives the best classiﬁcation. For validation 12 volunteers
were used and the matching between the human assessment and the network was
94.37 ±1.95% using the t statistics.
4.1.3 Adaptive Resonance Theory Networks  ART
Adaptive Resonance Theory (ART) models are neural networks that perform
clustering and can allow the number of cluster to vary. The major difference
between ART and other clustering methods is that ART allows the user to control
the degree of similarity between members of the same cluster by means of a user
deﬁned constant called the vigilance parameter.
We outline in this section the simpler version of the ART network called
ART1. The architecture of an ART1 network shown in Figure 4.3 consists of
two layer of neurons. The F
1
neurons and the F
2
(cluster) neurons together with a
reset unit to control the degree of similarity of patterns or sample elements placed
on the same cluster unit.
The ART1 networks accepts only binary inputs. For continuous input the
ART2 network was developed with a more complex F
1
layer to accommodate
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 62
continuous input.
Figure 4.3: Architecture of ART1 network
The input layer, F
1
receives and holds the input patters, the second layer, F
2
responds with a pattern associated with the given input. If this returned pattern is
sufﬁcient similar to the input pattern then there is a match. But if the difference is
substantial, then the layers communicate until a match is discovered, otherwise a
new cluster of patterns is formed around the new input vector.
The training algorithm is shown in Figure 4.4 (Mehrotra et al (2000) or Fauzett
(1994)).
The following example of Mehnotra et al (2000) illustrates the computations.
Consider a set of vectors ¦(1, 1, 0, 0, 0, 0, 1), (0, 0, 1, 1, 1, 1, 0), (1, 0, 1, 1, 1, 1, 0),
(0, 0, 0, 1, 1, 1, 0), and (1, 1, 0, 1, 1, 1, 0)¦ to be clustered using the ART1 algo
rithm. Let the vigilance parameter be ρ = 0.7.
We begin with a single node whose topdown weights are all initialized to 1,
i.e., t
l,1
(0) = 1, and bottomup weights are set to b
1,l
(0) =
1
8
. Here n = 7 and
initially m = 1. Given the ﬁrst input vector, (1, 1, 0, 0, 0, 0, 1), we compute
y
1
=
1
8
1 +
1
8
1 +
1
8
0 + . . . +
1
8
0 +
1
8
1 =
3
8
, (4.1.9)
and y
1
is declared the uncontested winner. Since
7
l=1
t
l,1
x
l
7
l=1
x
l
=
3
3
= 1 > 0.7, (4.1.10)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 63
− Initialize each topdown weight t
l,j
(0) = 1;
− Initialize each bottomup weight b
j,l
(0) =
1
n + 1
;
− while the network has not stabilized, do
1. Present a randomly chosen pattern x = (x
1
, . . . , x
n
) for learning.
2. Let the active set A contain all nodes; calculate y
j
= b
j,l
x
1
+ . . . + b
j,n
x
n
for each
node j ∈ A;
(a) Let j
∗
be a node in A with largest y
j
, with ties being broken arbitrarily;
(b) Compute s
∗
= (s
∗
1
, . . . , s
∗
n
) where s
∗
l
= t
l,j
∗x
l
;
(c) Compare similarity between s
∗
and x with the given vigilance parameter ρ:
if
n
l=1
s
∗
l
n
l=1
x
l
≤ ρ then remove j
∗
from set A
else associate x with node j
∗
and update weights:
b
j
∗
,l
(new) =
t
l,j
∗ (old) x
l
0.5 +
n
l=1
t
l,j
∗ (old) x
l
t
l,j
∗ (new) = t
l,j
∗ (old) x
l
until A is empty or x has been associated with some node j;
3. If A is empty, then create a new node whose weight vector coincides with the current
input pattern x;
endwhile
Figure 4.4: Algorithm for updating weights in ART1
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 64
the vigilance condition is satisﬁed and the updated weights are
b
1,l
(1) =
_
_
_
1
0.5 + 3
=
1
3.5
for l = 1, 2, 7;
0 otherwise.
(4.1.11)
Likewise,
t
l,1
(1) = t
l,1
(0)x
l
. (4.1.12)
These equations yield the following weight matrices.
B(1) =
_
1
3.5
1
3.5
0 0 0 0
1
3.5
_
T
, T(1) =
_
1 1 0 0 0 0 1
¸
T
(4.1.13)
Now we present the second sample (0,0,1,1,1,1,0). This generates y
1
= 0, but
the uncontested winner fails to satisfy the vigilance threshold since
l
t
l,1
x
l
/
l
x
l
=
0 < 0.7. Asecond node must hence be generated, with topdown weights identical
to the sample, and bottomup weights equal to 0 in the positions corresponding
to the zeroes in the sample, and remaining new bottomup weights are equal to
1/(0.5 + 0 + 0 + 1 + 1 + 1 + 1 + 0).
The new weight matrices are
B(2) =
_
¸
¸
_
1
3.5
1
3.5
0 0 0 0
1
3.5
0 0
1
4.5
1
4.5
1
4.5
1
4.5
0
_
¸
¸
_
T
and T(2) =
_
_
1 1 0 0 0 0 1
0 0 1 1 1 1 0
_
_
T
(4.1.14)
When the third vector (1, 0, 1, 1, 1, 1, 0) is presented to this network,
y
1
=
1
3.5
and y
2
=
4
4.5
(4.1.15)
are the node outputs, and the second node is the obvious winner. The vigilance test
succeeds, because
7
l=1
t
l,2
x
l
/
7
l=1
x
l
=
4
5
≥ 0.7. The second node’s weights
are hence adapted, with each topdown weight being the product of the old top
down weight and the corresponding element of the sample (1,0,1,1,1,1,0), while
each bottomup weight is obtained on dividing this quantity by 0.5 +
l
t
l,2
x
l
=
4.5. This results in no change to the weight matrices, so that B(3) = B(2) and
T(3) = T(2).
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 65
When the fourth vector (0,0,0,1,1,1,0) is presented to this network,
y
1
=
1
3.5
and y
2
=
3
4.5
(4.1.16)
are the node outputs, and the second node is the obvious winner. The vigilance test
succeeds, because
7
l=1
t
l,2
x
l
/
7
l=1
x
l
=
3
3
≥ 0.7. The second node’s weights
are hence adapted, with each topdown weight being the product of the old top
down weight and the corresponding element of the sample (0,0,0,1,1,1,0), while
each bottomup weight is obtained on dividing this quantity by 0.5 +
l
t
l,2
x
l
=
3.5. The resulting weight matrices are
B(4) =
_
¸
¸
_
1
3.5
1
3.5
0 0 0 0
1
3.5
0 0 0
1
3.5
1
3.5
1
3.5
0
_
¸
¸
_
T
, T(4) =
_
_
1 1 0 0 0 0 1
0 0 0 1 1 1 0
_
_
T
.
(4.1.17)
When the ﬁfth vector (1,1,0,1,1,1,0) is presented to this network,
y
1
=
2
3.5
and y
2
=
3
4.5
(4.1.18)
are the node outputs, and the second node is the obvious winner. The vigilance
test fails, because
7
l=1
t
l,2
x
l
/
7
l=1
x
l
=
3
5
< 0.7. The active set A is hence
reduced to contain only the ﬁrst node, which is the new winner (uncontested).
The vigilance test fails with this node as well, with
l
t
l,1
x
l
/
7
l=1
x
l
=
2
5
≤ 0.7.
A third node is hence created, and the resulting weight matrices are
B(5) =
_
¸
¸
¸
¸
¸
¸
¸
¸
_
1
3.5
1
3.5
0 0 0 0
1
3.5
0 0 0
1
3.5
1
3.5
1
3.5
0
1
5.5
1
5.5
0
1
5.5
1
5.5
0
_
¸
¸
¸
¸
¸
¸
¸
¸
_
T
, T(5) =
_
¸
¸
¸
¸
_
1 1 0 0 0 0 1
0 0 0 1 1 1 0
1 1 0 1 1 1 0
_
¸
¸
¸
¸
_
T
(4.1.19)
We now cycle through all the samples again. This can performed in random
order, but we opt for the same sequence as in the earlier cycle. After the third
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 66
vector is present, again, the weight matrices are modiﬁed to the following:
B(8) =
_
¸
¸
¸
¸
¸
¸
¸
¸
_
1
3.5
1
3.5
0 0 0 0
1
3.5
0 0 0
1
3.5
1
3.5
1
3.5
0
1
4.5
0 0
1
4.5
1
4.5
1
4.5
0
_
¸
¸
¸
¸
¸
¸
¸
¸
_
T
, T(8) =
_
¸
¸
¸
¸
_
1 1 0 0 0 0 1
0 0 0 1 1 1 0
1 0 0 1 1 1 0
_
¸
¸
¸
¸
_
T
.
(4.1.20)
Subsequent presentation of the samples do not result in further changes to the
weights, and T(8) represents the prototypes for the given samples. The network
has thus stabilized. For further details see Mehrotra et al (2000).
In that follows we describe some applications of ART neural networks.
Santos (2003) e Santos et al (2006) used neural networks and classiﬁcation
trees in the diagnosis of smear negative pulmonary tuberculosis (SPNT) which
account for 30% of the reported cases of Pulmonary Tuberculosis. The data con
sisted of 136 patients from the health care unit of the Universidade Federal do Rio
de Janeiro teaching hospital referred from 3/2001 to 9/2002.
Only symptoms and physical were used for constructing the neural networks
and classiﬁcation. The covariate vector contained 3 continuous variables and 23
binary variables.
In this application an ART neural network identiﬁed three groups of patients.
In each group the diagnostic was obtained from a one hidden layer feedforward
network. The neural networks used had sensibility of 71 to 84% and speciﬁcity
of 61 to 83% which has slight better than classiﬁcation tree. Statistical models in
literature shows sensibility of 49 to 100% and speciﬁcity of 16 to 86% but uses
laboratory results. The neural networks of Santos (2003) e Santos et al (2006)
used only clinical information.
Rozenthal (1997,see also Rozenthal et al 1998) applied an ART neural net
work to analyse a set of data from 53 schizophrenic patients (not addicted, phys
ically capable and bellow the age of 50) that answered the DSMIV (Diagnostic
and Statistical Manual for Mental Disorders) and were submitted to neuropsycho
logical tests. There seems to exit at least three functional patterns in schizophre
nia. These patterns deﬁne such group of symptom or dimension of schizophrenic
patients that are: those who have hallucination and disorderly thoughts and self
steem (negative dimension) and those with poor speach and disorderly thoughts
(disorganized dimension). This application of neural network (and a classical
statistical method of cluster, for comparison) indicate 2 important clusters. One
of which, low IQ and negative dimension, keeps itself stable when the number
of clusters is increased. It seems to act as an attractor, and also corresponds to
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 67
the more severe cases and more difﬁcult to respond to treatment. Tables 4.14.4
presents the results obtained.
Table 4.1: Presentation of the sample according to level of instruction, IQ and age
of onset
Level of Instruction no. Patients IQ age of onset
(n=53) (m ± dp) (m ± dp)
College 22 84.91±07.59 21.64±5.58
High school 19 80.84±0.07 19.5 ±6.13
Elementary 12 75.75±08.88 16.67±3.89
Table 4.2: Cluster patterns according to neuropsychological ﬁndings
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 RAVLT
Group I 77.6 7.7 7.0 0.05 4.2 11.2 2.7 0.05 4.2 1.8 4.2 8.3
(n=20)
Group II 84 2.0 24.1 3.1 4.6 11.6 2.0 100 5.5 1.8 9.1 11.5
(n=3)
Table 4.3: Three cluster patterns according to neuropsychological ﬁndings
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 RAVLT
Group I 77.6 7.7 7.0 0.05 4.2 11.2 2.7 0.05 4.2 1.8 4.2 8.3
(n=20)
Group IIa 84.7 0.5 57.1 14.1 4.5 10.9 1.9 1 5.0 1.7 9.1 11.4
(n=20)
Group IIb 84.9 3.7 29 4.0 4.7 12.1 2.1 1 5.8 1.8 20.8 11.6
(n=13)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 68
Table 4.4: Four cluster patterns according to neuropsychological ﬁndings
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 RAVLT
Group I 77.6 7.7 7.0 0.05 4.2 11.2 2.7 0.05 4.2 1.8 4.2 8.3
(n=20)
Group IIc 80.86 0.81 12.2 6.8 4.7 10.9 2.2 1 6.0 2.0 0.9 11.4
(n=14)
Group IId 85.1 1.4 51.90 12.0 4.9 11.6 2.0 1 4.7 1.7 21.2 11.2
(n=11)
Group IIe 90.0 7.7 43.0 6.7 4.2 12.5 1.7 1 5.6 1.6 55.4 12
(n=8)
Chiu and al (2000) developed a selforganizing cluster system for the arte
rial pulse based on ART neural network. The technique provides at least three
novel diagnostic tools in the clinical neurophysiology laboratory. First, the pieces
affected by unexpected artiﬁcial motions (ie physical disturbances) can be deter
mined easily by the ART2 neural network according to a status distribution plot.
Second, a few categories will be created after applying the ART2 network to the
input patterns (i.e minimal cardiac cycles). These pulse signal categories can be
useful to physicians for diagnosis in conventional clinical uses. Third, the status
distribution plot provides an alternative method to assist physicians in evaluat
ing the signs of abnormal and normal automatic control. The method has shown
clinical applicability for the examination of the autonomic nervous system.
Ishihara et al (1995) build a Kansey Engineering System based on ART neu
ral network to assist designers in creating products ﬁt for consumers´s underlying
needs, with minimal effort. Kansei Engineering is a technology for translating
human feeling into product design. Statistical models (linear regression, factor
analysis, multidimensional scaling, centroid clustering etc) is used to analyse the
feelingdesign relationship and building rules for Kansei Engineering Systems.
Although reliable, they are time and resource consuming and require expertise in
relation to its mathematical constraint. They found that their ART neural network
enable quick, automatic rule building in Kansei Engineering systems. The catego
rization and feature selection of their rule was also slight better than conventional
multivariate analysis.
4.1.4 Self Organizing Maps  (SOM) Networks
Self organizing maps, sometimes called topologically ordered maps or Koho
nen selforganizing feature maps is a method closely related to multidimensional
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 69
scaling. The objective is to represent all points in the original space by points in
a smaller space, such that distance and proximity relations are preserved as much
as possible.
There are m cluster units arranged in a one or two dimensional array.
The weight vector for a cluster neuron serves as an exemplar of the input pat
terns associated with that cluster. During the selforganizing process, the cluster
neuron whose weight vector matches the input pattern most closely (usually, the
square of the minimum Euclidean distance) is chosen as the winner. The winning
neuron and its neighboring units (in terms of the topology of the cluster neurons
update their weights the usual topology in two dimensional are shown in Figure
4.5).
Figure 4.5: Topology of neighboring regions
The architecture of the Kohonen SOMneural network for one and twodimensional
array and its topology is shown in Figures 4.64.7 and Figures 4.84.94.10 for the
one dimensional and two dimensional array. Figure 4.11 presents the weight up
dating algorithm for the Kohonen SOM neural networks (from Fausett 1994) and
Figure 4.12, the Mexican Hat interconection for neurons in the algorithm.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 70
Figure 4.6: Kohonen selforganizing map.
* * * * * **
#
* * * **
{ ( [ ] ) }
{
{
} ( ) [ ] R=2 R=1 R=0
Figure 4.7: Linear array of cluster units
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 71
Figure 4.8: The selforganizing map architecture
Figure 4.9: Neighborhoods for rectangular grid
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 72
Figure 4.10: Neighborhoods for hexagonal grid
Step 0 Initialize weights w
ij
Set topological neighborhood parameters.
Set learning rate parameters.
Step 1 While stopping condition is false, do Steps 28.
Step 2 For each input vector x, do Steps 35.
Step 3 For each j, compute:
D(j) =
i
(w
ij
−x
i
)
2
Step 4 Find index J such that D(J) is minimum.
Step 5 For all units within a speciﬁed neighborhood of J and for all i:
w
ij
(new) = w
ij
(old) + α[x
i
−w
ij
(old)]
Step 6 Update learning rate.
Step 7 Reduces radius of topological neighborhood at speciﬁed times.
Step 8 Test stopping condition.
Figure 4.11: Algorithm for updating weights in SOM
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 73
Figure 4.12: Mexican Hat interconnections
Braga et al (2000) sampled 1000 samples with equal probability from two
bivariate normal distributions with variances 1 and mean vectors µ
1
= (4, 4),µ
2
=
(12, 12). The distribution and the data are shown in Figure 4.13 and 4.14.
0
5
10
15
20
0
10
20
0
0.1
0.2
0.3
0.4
0.5
Class 01
Class 02
Figure 4.13: Normal density function
of two clusters
0 2 4 6 8 10 12 14 16
0
2
4
6
8
10
12
14
16
X
1
X
2
Class 01
Class 02
Figure 4.14: Sample fromnormal clus
ter for SOM training
The topology was such that each neuron had 4 other neighbouring neurons.
The resulting map was expected to preserve the statistical characteristics, ie the
visual inspection of the map should indicate how the data is distributed.
Figure 4.15 gives the initial conditions (weights). Figures 4.16 4.18 presents
the map after some interactions.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 74
Figure 4.15: Initial weights
Figure 4.16: SOM weights, 50 itera
tions
Figure 4.17: SOM weights, 300 iter
ations
Figure 4.18: SOM weights, 4500 it
erations
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 75
Lourenc¸o (1998) applied a combination of SOM networks, fuzzy set and clas
sical forecasting methods for the shorttermelectricity load prediction. The method
establish the various load proﬁles and process climatic variables in a linguistic
way, and those from the statistical models. The ﬁnal model includes a classiﬁer
scheme, a predictive scheme and a procedure to improve the estimations. The
classiﬁer is implemented via an SOM neural network. The forecaster uses statis
tical forecasting techniques (moving average, exponential smoothing and ARMA
type models). A fuzzy logic procedure uses climatic variables to improve the
forecast.
The complete system is shown in Figure 4.19.
Figure 4.19: Combined forecast system
The classiﬁer used a SOM network with 12 neurons displaced in four rows
and three columns. The classiﬁcation of patterns are shown in Figures 4.20 and
Figure 4.21 for the year of 1993.
1
Sunday
2
Sunday
3
Saturday
4
Saturday
5
Saturday
6
Monday to Friday
7
Monday to Friday
8
Monday to Friday
9
Monday to Friday
10
Monday to Friday
11
Monday to Friday
12
Monday to Friday
Figure 4.20: Week days associated with the SOM neurons
The proposed method improved the correct method used in the Brazilian Elet
ric System, which is important specially in the pick hours.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 76
1 2 3
4 5 6
Winter
7
Summer
8
Spring
9
Winter
10
Summer
11
Autumn
12
Winter
Figure 4.21: Seasons associated with the SOM neurons
Draghici and Potter (2003) predict HIV drug resistance using. Since drug
resistance is a important factor inﬂuencing the failure of current HIV therapies.
The ability to predict the drug resistance of HIV protease mutants may be useful
in developing more effective and long lasting treatment regimens.
In this work the authors predicted the HIV resistance to two current protease
inhibitors. The problem was approached from two perspectives. First, a predictor
was constructed based on the structural features of the HIVproteasegrug inhibitor
complex. A particular structure was represented by its list of contacts between
the inhibitor and the protease. Next, a classiﬁer was constructed based on the
sequence data of various drug resistent mutants. In both cases SOM networks
were ﬁrst used to extract the important features and cluster the patterns in an
unsupervised manner. This was followed by subsequent labelling based on the
known patterns in the training set.
The prediction performance of the classiﬁers was measured by crossvalidation.
The classiﬁer using the structure information correctly classiﬁed previously un
seen mutants with and accuracy of between 60 and 70 %. Several architectures
were tested on the more abundant sequence of 68 % and a coverage of 69 %. Mul
tiple networks were then combined into various majority voting schemes. The best
combination yielded an average of 85% coverage and 78% accuracy on previously
unseed data. This is more than two times better than the 33 % accuracy expected
from a random classiﬁer.
Hsu et al (2003) describe an unsupervised dynamic hierarchical selforganizing
approach, which suggests an appropriate number of clusters, to perform class dis
covery and marker gene identiﬁcation in microarray data. In the process of class
discovery, the proposed algorithm identiﬁes corresponding sets of predictor genes
that best distinguish one class from other classes. The approach integrates merits
of hierarchical clustering with robustness against noise know from selforganizing
approaches.
The proposed algorithm applied to DNA microarray data sets of two types of
cancers has demonstrated its ability to produce the most suitable number of clus
ters. Further, the corresponding marker genes identiﬁed through the unsupervised
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 77
algorithm also have a strong biological relationship to the speciﬁc cancer class.
The algorithm tested on leukemia microarray data, which contains three leukemia
types, was able to determine three major and one minor cluster. Prediction mod
els built for the four clusters indicate that the prediction strength for the smaller
cluster is generally low, therefore labelled as uncertain cluster. Further analysis
shows that the uncertain cluster can be subdivided further, and the subdivisions
are related to two of their original clusters. Another test performed using colon
cancer microarray data has automatically derived two clusters, which is consistent
with the number of classes in data (cancerous and normal).
Mangiameh et al (1996) provide a comparison of the performance of SOM
network and seven hierarchical cluster methods: single linkage, complete linkage,
average linkage, centroid method, Ward´s method, two stage density and knearest
neighbor. A total of 252 empirical data set was constructed to simulate several
levels of imperfections that include dispersion, aberrant observations, irrelevant
variables, nonuniform cluster, more detailed description in shown in Tables 4.5
4.7.
Data set Design factors # of
data sets
2,3,4 or 5 clusters;
Base data 4,6 or 8 variables; 36
low, med or high dispersion
Irrelevant variables Basic data plus 1 or 2 irrelevant variables 72
Cluster density Basic data plus 10% or 60% density 72
Outliers Basic data plus 10% or 20% outliers 72
Table 4.5: Data set design
Number of Average
clusters distance (units)
2 5.03
3 5.31
4 5.33
5 5.78
Table 4.6: Average distance between cluster
The experimental results are unambiguous; the SOM network is superior to all
seven hierarchical clustering algorithms commonly used today. Furthermore, the
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 78
Level of Avg. cluster Range in
dispersion std. dev. std.dev.
(in units)
High 7.72 3
Medium 3.72 6
Low 1.91 12
Table 4.7: Basis for data set construction
performance of the SOM network is shown to be robust across all of these data
imperfections. The SOM superiority is maintained across a wide range of ”messy
data” conditions that are typical of empirical data sets. Additionally, as the level of
dispersion in the data increases, the performance advantage of the SOM network
relative to the hierarchical clustering methods increases to a dominant level.
The SOM network is the most accurate method for 191 of the 252 data sets
tested, which represents 75.8% of the data. The SOM network ranks ﬁrst or sec
ond in accuracy in 87.3% of the data sets. For the high dispersion data sets, the
SOM network is most accurate 90.2% of the time, and is ranked ﬁrst or second
91.5% of the time. The SOM network frequently has average accuracy levels of
85% or greater, while other techniques average between 35% and 70%. Of the
252 data sets investigated, only six data sets resulted in poor SOM results. These
six data sets occurred at low levels of data dispersion, with a dominant cluster
containing 60% or more of the observations and four or more total clusters. De
spite the relatively poor performance at these data conditions, the SOM network
did average 72% of observations correctly classiﬁed.
These ﬁnds seems to contradict the bad performance of SOM network com
pared to kmeans clustering method applyed in 108 data sets reported by Balakr
ishman et al (1994).
Carvalho et al (2006) applied the SOM network to gamma ray burst and sug
gested the existence of 5 clusters of burst. This result was conﬁrmed by a feed
foward network and principal component analysis.
4.2 Dimensional Reduction Networks
In this section we will present some neural networks designed to deal with the
problem of dimensional reduction of data.
In several occasions it is useful and even necessary to ﬁrst reduce the dimen
sion of the data to a manageable size, keeping as much of the original information
as possible, and then to proceed with the analysis.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 79
Sometimes, a phenomenon which is in appearance highdimension is actually
governed by few variables (sometimes called ”latent variables” or ”factors”). The
redundance, presented in the data collected related to the phenomenon, can be
due, for example, to:
− Many of the variables collected may be irrelevant .
− Many of the variables will be correlated (therefore some redundant infor
mation is contained in the data), a new set of uncorrelated or independent
variables should be found.
An important reason to reduce the dimension of the data is that some authors
call: ”the curse of dimensionality and the empty space phenomenon”. The curse
of dimensionality phenomenon refers to the fact that in the absence of simplifying
assumptions, the sample size needed to make inferences with a given degree of
accuracy grows exponentially with the number of variables. The empty space
phenomenon responsible for the curse of dimensionality is that highdimensional
spaces are inherently sparse. For example, for a one dimensional standard normal
N(0, 1), about 70% of the mass is at points contained in the interval (sphere of
radius of one standard deviation around the mean zero. For a 10dimensional
N(0, I), the same (hyper) sphere contain only 0, 02% of the mass, and a radius of
more than 3 standard deviations is need to contain 70%.
A related concept in dimensional reduction methods is the intrinsic dimension
of a sample, which is deﬁned as the number of independent variables that explain
satisfactorily the phenomenon. This is a loosely deﬁne concept and a trial and
error process us usually used to obtain a satisfactory value for it.
There are several techniques to dimensional reduction and can be classiﬁed in
two types: with and without exclusion of variables. The main techniques are:
− Selection of variables:
– Expert opinion.
– Automatic methods: all possible regressions, best subset regression,
backward elimination, stepwise regression, etc.
− Using the original variables:
– Principal Components.
– Factor Analysis.
– Correspondence Analysis.
– Multidimensional Scaling.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 80
– Nonlinear Principal Components
– Independent Component Analysis
– Others
We are concerned in this book with the latter group of techniques (some of
which have already being described: Projection Pursuit, GAM, Multidimensional
Scaling, etc).
Now we turn to some general features and common operation on the data
matrix to be used in these dimensional reduction methods.
4.2.1 Basic Structure of the Data Matrix
A common operation in most dimensional reduction technique is the decom
position of a matrix of data into its basic structure.
We refer to the data set, the data matrix as X, and the speciﬁc observation of
variable j in subset i, as x
ij
. The dimension of X is (n x p), corresponding to n
observations of p variables (n > p).
Any data matrix can be decomposed into its characteristic components:
X
n x p
= U
n x p
d
p x p
V
p x p
(4.2.1)
which is call ”single value decomposition”, i.e. SVD.
The U matrix ”summarizes” the information in the rows of X, the V matrix
”summarizes” information in the columns of X, the d matrix is a diagonal ma
trix, whose diagonal entries are the singular values and are weigths indicating the
relative importance of each dimension in U and V and are ordered from largest
to smallest.
If X does not contain redundant information the dimension of U, V e d will
be equal the minimum dimension p of X. The dimension is called rank.
If S is a symmetric matrix the basic structure is:
S = UdV
= UdU
= V dV
(4.2.2)
Decomposition of X, X.X
or X
.X reveals the same structure:
X
nxp
= UdV
(4.2.3)
XX
(nxn)
= Ud
2
U
(4.2.4)
XX
(pxp)
= V d
2
V
(4.2.5)
A retangular matrix X can be made into symmetric matrices (XX
and
X
X) and their eigenstructure can be used to obtain the structure of X. If
XX
= UD and X
X = V DV
then X = UD
1
2
V .
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 81
The columns of U and V corresponding to the eigenvalues D
i
are also related
by: If U
i
is an eigenvector of XX
corresponding to D
i
then the corresponding
eigenvector of X
X corresponding to D
i
is equal or proportional to U
i
X.
Some dimensional reduction techniques share the same decomposition algo
rithm SVD. They differ only in the predecomposition transformation of the data
and their posdecomposition transformation of the latent variables. The following
transformation will be used
− Deviation from the mean:
x
ij
= x
ij
− ¯ x
j
(4.2.6)
where x
j
is the column (variable j) mean.
− Standardization
z
ij
=
x
ij
− ¯ x
j
S
j
(4.2.7)
where ¯ x
j
and S
j
are the column (variable j) mean and standard deviation.
− Unit length
x
∗
ij
=
x
ij
(
x
2
ij
)
1
2
(4.2.8)
− Doublecentering
x
∗
ij
= x
ij
− ¯ x
i
− ¯ x
j
+ ¯ x
ij
(4.2.9)
obtained by subtracting the row and column means and adding back the
overall mean.
− For categorical and frequency data
x
∗
ij
=
x
ij
(
i
x
ij
j
x
ij
)
1
2
=
x
ij
(x
•j
x
i•
)
1
2
(4.2.10)
From these transformations the following matrices are obtained:
− Covariance Matrix
C =
1
n
(X −
¯
X)
(X −
¯
X
) (4.2.11)
from the mean corrected data.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 82
− RCorrelation Matrix
R =
1
n
Z
c
Z
c
(4.2.12)
where Z
c
indicates the matrix X standardized within columns in the num
ber of observations.
− QCorrelation Matrix
Q =
1
p
Z
r
Z
r
(4.2.13)
where Z
r
indicates the matrix X standardized within rows, and p is the
number of variables.
4.2.2 Mechanics of Some Dimensional Reduction Techniques
Nowwe outline the relation of the SVDalgorithmto the dimensional reduction
algorithm, some of which can be computed using neural networks.
4.2.2.1 Principal Components Analysis  PCA
When applying PCA in a set of data the object is:
− to obtain from the original variables a new set of variables (factors, latent
variables) which are uncorrelated.
− to hope that a few of this new variables will account for the most of the
variation of the data, that is the data can be reasonably represented in lower
dimension, and keeping most of the original information
− these new variables can be reasonable interpreted.
The procedure is performed by applying a SVD on the CCorrelation Ma
trix or in the RCorrelation Matrix. Since PCA is not invariant to transformation
usually the R matrix is used.
4.2.2.2 Nonlinear Principal Components
Given the data matrix X, this procedure consists in performing the SVD al
gorithm in some function φ(X).
In the statistical literature an early reference is Gnanadesikan (1999) and is
closely related to Kernel principal components, Principal Curves and Informax
method in Independent Component Analysis. The later will be outlined later.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 83
4.2.2.3 Factor Analysis  FA
Factor analysis has also the aim of reducing the dimensionality of a variable
set and it also has the objective of representing a set of variables in terms of a
smaller number of hypothetical variables.
The main differences between PCA and FA are:
− PCA decomposes the total variance. In the case of standardized variables, it
produces a decomposition of R. FAon the other hand ﬁnds a decomposition
of the reduced matrix R−U, where U is a diagonal matrix of the ”unique”
variances associated with the variables. Unique variances are that part of
each variable’s variance that has nothing in common with remaining p − 1
variables.
− PCA is a procedure to decompose the correlation matrix without regard to
an underlying model. FA, on the other hand, has an underlying model that
rest on a number of assumptions, including normality as the distribution for
the variables.
− The emphasis in FA is in explaining X
i
as a linear function of hypothetical
unobserved common factors plus a factor unique to that variable, while the
emphasis in PCAis expressing the principal components as a linear function
of the X
i
’s. Contrary to PCA, the FA model does not provide a unique
transformation form variables to factors.
4.2.2.4 Correspondence Analysis  CA
Correspondence analysis represents the rows and columns of a data matrix
as points in a space of low dimension, and it is particularly suited to (twoway)
contingency tables.
Loglinear model (LLM) is also a method widely used for analysing contin
gency tables. The choice of method, CA or LLM depends of the type of data to
be analysed and on what relations or effects are of most interest. A practical rule
for deciding when each method is to be preferred is: CA is very suitable for dis
covering the inherent structure of contingency tables with variables containing a
large number of categories. LLMis particularly suitable for analysing multivariate
contingency tables, where the variables containing few categories. LLM analyse
mainly the interrelationships between a set of variables, where as correspondence
analysis examines the relations between the categories of the variables.
In CA the data consists of an array of frequencies with entries f
ij
, or as a
matrix F.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 84
− The ﬁrst step is to consider the transformation:
h
ij
=
f
ij
_
f
i•
f
•j
(4.2.14)
where h
ij
is the entry for a given cell, f
ij
the original cell frequency, f
i•
the
total for row i and f
•j
the total for column j.
− The second step is to ﬁnd the basic structure of the normalized matrix H
with elements (h
ij
) using SVD. This procedure summary row and column
vectors, U and V of H and a diagonal matrix of singular values, d, of H.
The ﬁrst singular value is always one and the successive values constitute
singular values or canonical correlations.
− The third and ﬁnal step is to rescale the row (U) and column (V ) vectors to
obtain the canonical scores. The row (X) and column (Y ) canonical scores
are obtained as:
X
i
= U
i
¸
f
••
f
i•
(4.2.15)
Y
i
= V
i
¸
f
••
f
•j
(4.2.16)
4.2.2.5 Multidimensional Scaling
Here we will describe the classical or metric method of multidimensional scal
ing which is closely related to the previous dimensional reduction method, since
it also os based on SVD algorithm. Alternative mapping and ordinal methods will
not be outlined.
In multidimensional scaling we are given the distances d
rs
between every para
of observations. Given the data matrix X, there are several ways to measure
the distance between pair of observations. Since many of them produce measure
of distances which do not satisfy the triangle inequality, the term dissimilarity is
used.
For continuous data X, the most common choice is the Euclidean distance:
d
2
= XX
This depend on the scale in which the variables are measured, One
way out is to use Mahalanobis distance with respect to the covariance matrix
ˆ
Σ of
the observations.
For categorical data a commonly used dissimilarity is based on the simple
matching coefﬁcient, that is the proportion C
rs
, of features which are common to
the two observations r and s. As this is between zero and one, the dissimilarity is
found to be d
rs
= 1 −c
rs
.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 85
For ordinal data we use the rank as if they were continuous data after rescaling
to the range (0,1) so that every original feature is given equal weight.
For mixture of continuous, categorical and ordinal features a widely used pro
cedure is: for each feature deﬁne a dissimilarity d
f
rs
and an indicator I
f
rs
which is
one only if feature f is recorded for both observation. Further I
f
rs
= 0 if we have
a categorical feature and an absence absence. Then
d
rs
=
f
I
f
rs
d
f
rs
I
f
rs
(4.2.17)
In the classical or metric method of multidimensional scaling, also known as
principal coordinate analysis, we assume that the dissimilarities were derived as
Euclidean distances between n points in p dimensions. The steps are as follows:
− From the data matrix obtain the matrix of distances T (XX
or XΣX
).
− Double centering the dissimilarity matrix T, say T
∗
.
− Decompose the matrix T
∗
using the SVD algorithm.
− From the p (at most) nonzero vectors represent the observations in the
space of two (or three) dimensions.
4.2.3 Independent Component Analysis  ICA
The ICA model will be easier to explain if the mechanics of a cocktail party
are ﬁrst described. At a cocktail party there are partygoers or speaker holding con
versations while at the same time there are microphones recording or observing
the speakers also called underlying source or component.
At a cocktail party, there are p microphones that record or observe m party
gower or speaker at n instants. This notation is consistent with traditional Mul
tivariate Statistics. The observed conversation consists of mixtures of true unob
served conversations. The microphones do not observe the observed conversation
in isolation. The recorded conversation are mixed. The problem is to unmix or
recover the original conversation from the recorded mixed conversation.
The relation of ICA with other methods of dimension reduction is shown in
Figure 4.22 as discussed in Hyvarinen (1999). The lines show close connections,
and the text next to the lines show the assumptions needed for the connection.
Before we formalize the ICA method we should mention that the roots of basic
ICA can be traced back to the work of Darmois in the 1950 (Darmois 1953) and
Rao in the 1960 (Kagan et al, 1973) concerning the characterization of random
variables in linear structures. Only recently this has been recognized by ICA
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 86
Figure 4.22: (The lines show close connections, and the text next to the lines show
the assumptions needed for the connection.)
researchers (Common, 1994, Jutten and Taleb 2000, Fiori 2003) which regrets
that this may have delayed progress.
For the general problem of source separation we use the statistical latent vari
ables model. Suppose we have p linear mixtures of m independent components:
_
_
x
i1
. . .
x
ip
_
_
=
_
_
a
11
.s
i1
+ . . . + a
1m
.s
im
. . .
a
p1
.s
i1
+ . . . + a
pm
.s
im
_
_
=
_
_
a
11
. . . a
1m
a
p1
. . . a
pm
_
_
_
_
s
i1
. . .
s
im
_
_
x
i
= As
i
(4.2.18)
Actually the more general case of unmixing consider a nonlinear transforma
tion and an additional error term ie:
x
i
= f(s
i
) +
i
(4.2.19)
the model 4.2.18 is the base for blind source separation method. Blind means that
we know very little or anything on the mixing matrix and make few assumptions
on the sources s
i
.
The principles behind the techniques of PCA and FA is to limit the number of
components s
i
to be small that is m < p and is based on obtaining noncorrelated
sources.
ICA consider the case of p = m and obtain estimates of Aand to ﬁnd compo
nents s
i
that are independent as possible.
After estimating the matrix A, we can compute its inverse, and obtain the
independent component simply by:
s = Wx (4.2.20)
Some of the characteristics of the procedure is:
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 87
− We ﬁx the variances of the components to be one.
− We cannot determine the order of the components.
− To estimate A at most one of the components can be Gaussian. The inde
pendent components must be Gaussian.
The key to estimating the ICA model is nonGaussianity and independence.
There are several methods: maximum likelihood and network entropy, mutual
information and KulbackLeibler divergence, nonlinear cross correlations, non
linear PCA, higherorder cumulants, weighted covariance matrix (Hyvarinen, 1999).
Here we outline the Infomax principle of network entropy maximization which
is equivalent to maximum likelihood (see Hyvarinen and Oja, 2000). Here we fol
low Jones and Porril (1998).
Consider eq.4.2.20, s = Wx. If a signal or source s has a cdf cumulative
density function g, then the distribution of g(s) by the probability integral trans
formation, has an uniform distribution ie has maximum entropy.
The unmixing W can be found by maximizing the entropy of H(U) of the joint
distribution U = (U
1
, . . . , Up) = (g
1
(s
1
) . . . g
p
(s
p
) where s
i
= Wx
i
. The correct
g
i
have the same form as the cdf of the x
i
, and which is sufﬁcient to approximate
these cdfs by sigmoids:
U
i
= tanh(s
i
) (4.2.21)
Given that U = g(Wx), the entropy H(U) is related to H(x) by:
H(U) = H(x) + E(log[
∂U
∂x
[) (4.2.22)
Given that we wish to ﬁnd W that maximizes H(U), we can ignore H(x) in
(4.2.22). Now
∂U
∂x
= [
∂U
∂s
[[
∂s
∂x
[ =
p
i=1
g
,
(s
i
)[W[ (4.2.23)
Therefore 4.2.22 becomes:
H(U) = H(x) + E(
p
i=1
g(s
,
i
)) + log[W[ (4.2.24)
The term E = (
p
i=1
g(s
i
)), given a sample of size n, can be estimated by:
E(
p
i=1
g(s
i
)) ≈
1
n
n
j=1
p
i=1
logg
,
i
(s
(j)
i
) (4.2.25)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 88
Ignoring H(x), equation 4.2.24 yield a new function which differ from H(U)
by H(x).
h(W) =
1
n
n
j
p
i
logg
,
i
(s
∗
i
) + log[W[ (4.2.26)
Deﬁning the cdf g
i
= tanh and recalling that g
,
= (1 −g
2
) we obtain:
h(W) =
1
n
j=1
n
p
i=1
log(1 −s
,2
i
) + log[W[ (4.2.27)
whose maximization with respect to W yield:
_W =
∂H(W)
∂W
=
∂h((W))
∂(W)
= (W
,
)
−1
+ 2.s.x
,
(4.2.28)
and an unmixing matrix can be found by taking small steps of size η to W
∆W = η((W
,−1
−2.s.x
,
) (4.2.29)
After reescaling, it can be shown (see Lee, 1998 p 41,45) that a more general
expression is:
∆W ≈
_
(I −tanh(s)s
,
−ss
,
)W superGaussian
(I + tanh(s)s
,
−ss
,
)W subGaussian
(4.2.30)
Supergaussian random variables have typically a ”spiky” pdf with heavy tails,
ie the pdf is relatively large at zero and at large values of the variable, while
being small for intermediate values. A typical example is the Laplace (or double
exponential) distribution. SubGaussian randomvariables, on the other hand, have
typically a ”ﬂat” pdf which is rather constant near zero, and very small for large
values of the variables. A typical example is uniform distribution.
We end this section with two illustrations of the applications of ICA and PCA
to simulated examples from Lee (1998, p 32). The ﬁrst simulated example in
volves two uniformly distributed sources s
1
e s
2
.
The sources are linearly mixed by:
x = As (4.2.31)
_
x
1
x
2
_
=
_
1 2
1 1
_ _
s
1
s
2
_
Figure 4.23 shows the results of applying of the mixing transformation and the
application of PCA and ICA. We see that PCA only sphere (decorrelate) the data,
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 89
−2 −1 0 1 2
−2
−1
0
1
2
Original Sources
s
1
s
2
−4 −2 0 2 4
−2
−1
0
1
2
Mixed Sources
x
1
x
2
−4 −2 0 2 4
−0.5
0
0.5
PCA Solution
u
1
u
2
−4 −2 0 2 4
−4
−2
0
2
4
ICA Solution
u
1
u
2
Figure 4.23: PCA and ICA transformation of uniform sources (Topleft:scatter
plot of the original sources, Topright: the mixtures, Bottomleft: the recovered
sources using PCA, Bottomright: the recovered sources using ICA.)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 90
while ICA not only sphere the data but also rotate it such that ˆ s
1
= u
1
and ˆ s
2
= u
2
have the same directions of s
1
and s
2
.
The second example in Figure 4.24 shows the time series of two speach signals
s
1
and s
2
. The signal are linearly mixed as the previous example in eq. 4.2.31.
The solution indicates that the recovered sources are permutated and scaled.
0 0 00
0
0 0 0 8 00 0 000 0 10 0 0 6 4 2
1
1
0.5
0 2000 4000 6000 8000 10000
0
0.5
1
0
1
0 2000 4000 6000 8000 10000
5
0
5
0 2000 4000 6000 8000 10000
0.5
0
0.5
0
0.5
2000 4000 6000 8000 10000
10000 8000 6000 4000 2000 0
0.5
0
0.2
0
0.2
0 2000 4000 6000 8000 10000
5
0
5
2000 4000 6000 8000 10000


OriginalSources
MixedSources
PCA Solution
ICA Solution
channel#1 channel#2
timesamples timesamples
0
Figure 4.24: PCA and ICA transformation of speach signals (Top: the original
sources, second row: the mixtures, third row: the recovere sources using PCA,
bottom: the recovered sources using ICA.)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 91
4.2.4 PCA networks
There are several neural networks architecture for performing PCA (See Dia
mantaras and Kung 1996 and Hertz, Kragh and Palmer,1991). They can be:
− Autoassociator type. The network is trained to minimize:
n
i=1
p
k=1
(y
ik
(x −x
ik
))
2
(4.2.32)
or
[[x
,
−F
1
(x)[[ = [[A
,
(x −µ)[[ (4.2.33)
The architecture with two layer and indentity activation function is as Figure
4.25.
Figure 4.25: PCA networks  Autoassociative
− Networks based on Oja (Hebbian) rule
The network is trained to ﬁnd W that minimizes
n
i=1
[[x
i
−W
,
Wx
i
[[
2
(4.2.34)
and its architecture is shown in Figure 4.26.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 92
Figure 4.26: PCA networks architecture  Oja rule.
Several update rules has been suggested, all of each follows the same matrix
equation:
∆W
l
= η
l
y
l
x
,
l
−K
l
W
l
(4.2.35)
where x
l
and y
l
are the input and output vectors, η
l
the learning rate and W
l
is
the weight matrix at the lth iteration. K
l
is a matrix whose choice depends on the
speciﬁc algorithm, such as the following:
1. William´s rule: K
l
= y
l
y
,
l
2. OjaKarhunen´s rule: K
l
= 3D(y
l
y
,
l
) + 2L(y
n
y
,
n
)
3. Sanger´s rule: K
l
= L(y
l
y
,
l
)
where D((A)) is a diagonal matrix with diagonal entries equal to those of A,
L(A) is a matrix whose entries above and including the main diagonal are zero.
The other entries are the same as that of A. For example:
D
_
2 3
3 5
_
=
_
2 0
0 5
_
L
_
2 3
3 5
_
=
_
0 0
3 0
_
(4.2.36)
To illustrate the computation involved consider the n = 4 sample of a p = 3
variable data in Table 4.8.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 93
Table 4.8: Data p = 3, n = 4.
n x
1
x
2
x
3
1 1.3 3.2 3.7
2 1.4 2.8 4.1
3 1.5 3.1 4.6
4 1.1 3.0 4.8
The covariance matrix is:
S
2
=
1
5
_
_
0.10 0.01 −0.11
0.01 0.10 −0.10
−0.11 −0.10 0.94
_
_
(4.2.37)
the eigenvalues and eigenvectors are shown in Table 4.9.
Table 4.9: SVD of the covariance matrix S
2
eigenvalue eigenvectors
V
1
V
2
V
3
d
1
0.965 0.823 0.553 0.126
d
2
0.090 0.542 0.832 0.115
d
3
0.084 0.169 0.026 0.985
Now we will compute the ﬁrst principal component using a neural network
architecture.
Figure 4.27: Oja´s rule
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 94
We choose η
l
= 1.0 for all values of l, and select the initial values w
1
=
0.3,w
2
= 0.4 and w
3
= 0.5. For the ﬁrst input x
1
= (0, 0.2, −0.7) we have:
y = (0.3, 0, 4, 0, 5).(0, 0.2, −0.7) = −0.27 (4.2.38)
´W = (−0.27(0, 0.2, −0.7) −(−0.27)
2
.(0.3, 0.4, 0.5)) (4.2.39)
W = (0.3, 0.4, 0.5) +´W = (0.278, 0.316, 0.652) (4.2.40)
For the next input x
2
= (0.1, −0.2, −0.3)
y = −0.231 (4.2.41)
´W = (−0.231(0.10, −0.20, −0.30) −(−0.231)
2
.(0.278, 0.316, 0.651))
(4.2.42)
W = (0.278, 0.316, 0.652) +´W = (0.240, 0.346, 0.687) (4.2.43)
Subsequent presentation of x
3
,x
4
and x
5
change the weight matrix to (0.272, 0.351, 0.697),
(0.238, 0.313, 0.751) and (0.172, 0.293, 0.804), respectively.
This process is repeated, cycling through the input vectors x
1
,x
2
,. . ., x
5
by
the end of the second iteration, the weights becomes (−0.008, −0.105, 0.989) at
the end of the third interaction it chances to (−0.111, −0.028, 1.004). The weight
adjustment process continues in this manner, resulting in the ﬁrst principal com
ponent.
Further applications are mentioned in Diamantaras and Kung (1996). Most of
the applications are in face, vowel, speach recognition. A comparison of the per
formance of the alternative rules are presented in Diamantaras and Kung (1996)
and Nicole (2000). The later author also compare the alternative algorithms for
neural computation of PCA with the classical batch SVD using the Fisher´s iris
data.
4.2.5 Nonlinear PCA networks
The nonlinear PCA networks is trained to minimize:
1
n
n
i=1
[[x
i
−W
,
ϕ(Wx)[[
2
(4.2.44)
where ϕ is a nonlinear function with scalar arguments.
Since from Kolmogorov theorem, a function ϕ can be approximated by a sum
of sinoidals, the architecture of the nonlinear PCA network is as in Figure 4.28.
Application on face recognition is given in Diamantaras and Kung (1996).
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 95
Figure 4.28: Non linear PCA network
4.2.6 FA networks
The application of artiﬁcial neural networks to factor analysis has been imple
mented using two types of architectures:
A) Using autoassociative networks
This network implemented by Sun and Lai (2002) and called SunFA.
The concept consists in representing the correlation matrix R = (r
ij
) by cross
products (or outer product) of two factor symmetric matrices F
p x m
and F
,
p x m
.
Here the network architecture is that of Figure 4.25 and when there are m
common factors the network is trained to minimize the function
E =(r
12
−
m
k=1
f
1k
f
2k
)
2
+ (r
13
−
m
k=1
f
1k
f
3k
)
2
+ . . .
+ (r
1p
−
m
k=1
f
1k
f
pk
)
2
+ (r
23
−
m
k=1
f
2k
f
3k
)
2
+ . . .
(r
2p
−
m
k=1
f
2k
f
pk
)
2
+ . . . + (r
p−1,p
−
m
k=1
f
p−1,k
f
pk
)
2
(4.2.45)
Sun and Lai (2002) report the result of an application to measure ability of 12
students (verbal,numerical,etc) and a simulation result comparing their method
to other four methods (factor analysis, principal axes, maximum likelihood and
image analysis).
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 96
Five correlation matrices were used for samples of size n = 82,164 and 328
with factors matrices A,B an C. Ten sets of correlation submatrices where sam
pled from each correlation matrix. They found that SunFA was the best method.
B) Using PCA (Hebbian) networks
Here an PCA network is used to extract the principal components of the cor
relation matrix and to choose the number of factors (components) m that will be
used in the factor analysis. Then a second PCA network extract mfactors choosen
in the previous analysis.
This procedure has been used by Delichere and Demmi (2002a,b) and Calvo
(1997). Delichere and Demmi use an updating algorithm call GHA (General
ized Hebbian Algorithm) and Calvo the APEX (Adaptive Principal Component
Extraction) see Diamantaras and Kung (1996).
We outline Calvo´s application. The data consisted on school grades of 8
students in four subjects (Mathematics, Biology, French and Latin). The ﬁrst
PCA network shown that the ﬁrst component explained 73% of the total variance
and the second explained 27%.
The second network used the architecture shown in Figure 4.29.
Figure 4.29: Factor analysis APEX network
Calvo obtained the results in Tables 4.10 and 4.11 and Figures 4.2.6 and 4.31,
he also reported that the eigenvalues were equal within (±0.01) to the results
obtained on commercial software (SPPS).
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 97
−1 −0.5 0 0.5 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
French
Latin
Biology
Mathematics
F1
F2
Figure 4.30: Original variables in fac
tor space
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
1
2
4
5
6
8
good in sciences
good in languages
good student bad student
7
3
Figure 4.31: Students in factor space
Student Math Biology French Latin
1 13 12.5 8.5 9.5
2 14.5 14.5 15.5 15
3 5.5 7 14 11.5
4 14 14 12 12.5
5 11 10 5.5 7
6 8 8 8 8
7 6 7 11 9.5
8 6 6 5 5.5
Table 4.10: Scholl grades in four subjects
Factor 1 Factor 2
Mathematics 0.4759 0.5554
Biology 0.5284 0.4059
French 0.4477 0.6298
Latin 0.5431 0.3647
Table 4.11: Loadings for the two factors retained
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 98
4.2.7 Correspondence Analysis Networks  CA
The only reference outlining howneural networks can be used for CAis Lehart
(1997). The obvious architecture is the same for SVD or PCA shown in Figure
4.32. Here we take:
Z
ij
=
f
ij
−f
i•
f
•j
_
f
i•
f
•j
(4.2.46)
Figure 4.32: CA network
Unfortunately there does not seem to be any application on real data yet.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 99
4.2.8 Independent Component Analysis Networks  ICA
The usual procedure in ICA follows the following steps:
− standardize the data
− whitening (decorrelate i.e apply PCA)
− obtain the independent components (eq. Informax, ie NPCA).
One possible architecture is shown in Figure 4.33 where
x = Qs +ε
and the steps are:
− whitening: v
k
= V x
k
− independent components: y
k
= Wv
k
.
Figure 4.33: ICA neural network
Now, we present some recent applications of ICA.
i) Fisher iris´s data  Lee (1998) compare classiﬁcation algorithms with a clas
siﬁcation based on ICA. The data set contains 3 classes, 4 variables and 50 ob
servations in each class, each class refers to a type of iris plant. The training set
contained 75% of the observation and the testing set 25%. The classiﬁcation er
ror in the test data was 3%. A simple classiﬁcation with boosting and a kmeans
clustering gave errors of 4,8% and 4,12% respectively. Figure 4.34 presents the
independent directions (basis vectors).
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 100
Figure 4.34: An example of classiﬁcation of a mixture of independent compo
nents. There are 4 different classes, each generated by two independent variables
and bias terms. The algorithm is able to ﬁnd the independent directions (basis
vectors) and bias terms for each class.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 101
Draghici et al (2003) implement a data mining technique based on the method
of ICAto generate reliable independent data sets for different HIVtherapies. They
consider a coupled model that incorporate three algorithms:
i) ICA model is used for data reﬁnement and normalization. The model accepts a
mixture of CD4+, CD8+ and viral load data to a particular patient and normalizes
it with broather data sets obtained from other HIV libraries. The ICA is used to
isolate groups of independent patterns embedded in the considered libraries.
ii) Cluster (Kohonen Map) networks are used to select those patterns chosen by
ICA algorithm that are close to the considered input data.
These two mechanisms helped to select similar data and also to improve the
system by throwing away unrelated data sets.
iii) Finally a nonlinear regression model is used to predict future mutations in the
CD4+,CD8+ and viral loads.
The author points out a series of advantage of their method over the usual
mathematical modelling using a set of simultaneous differential equation which
requires an expert to incorporate the dynamic behaviour in the equations.
Financial applications: the following are some illustration mentioned by Hy
varinen et al (2001):
i) Application of ICA as a complementary tool to PCA in a study of stock
portfolio, allowing the underline structure of the data to be more readily observed.
This can be of help in minimizing the risk in the investment strategy.
ii) ICA was applied to ﬁnd the fundamental factor common to all stores that
affect the cashﬂow, in the same retail chain. The effect of managerial actions
could be analysed.
iii) Time series prediction. ICA has been used to predict time series data
by ﬁrst predicting the common components and then transforming back to the
original series.
Further applications in ﬁnance are Cha and Chan and Chan and Cha (2001).
The book of Girolami (1999) also gives an application of clustering using ICA to
the Swiss banknote data on forgeries.
4.3 Classiﬁcation Networks
As mentioned before two important tasks in pattern recognition are patter clas
siﬁcation and cluster or vector quantization. Here we deal with the classiﬁcation
problem, whose task is to allocate an object characterized by a number of mea
surements into one of several distinct classes. Let an object (process, image, in
dividual, etc) be characterized by the kdimensional vector x = (x
1
, . . . , x
k
) and
let c
1
,c
2
,. . .,c
L
be labels representing each of the classes.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 102
The objective is to construct and algorithm that will use the information con
tained in the components x
1
,. . .,x
k
of the vector x to allocate this vector to one of
the L distinct categories. Typically, the pattern classiﬁer or discrimination func
tion calculates L functions (similarity measures), P
1
, P
2
,. . .,P
L
and allocates the
vectors x to the class with the highest similarity measures.
Many useful ideas and techniques have been proposed for more than 60 years
of research in classiﬁcation problems. Pioneer works are Fisher (1936) Rao (1949)
and reviews and comparisons can be seen in Randys (2001), Asparoukhov (2001)
and Bose (2003).
Neural network models that are used for classiﬁcation have already been pre
sented in previous sections. Figures 4.35 to 4.40 presents a taxonomy and exam
ples of neural classiﬁers.
Figure 4.35: Taxonomy and decision regions (Type of decisions regions that can
be formed by single and multilayer perceptrons with hard limiting nonlineari
ties. Shading denotes regions for class A. Smooth closed contours bound input
distributions.)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 103
Figure 4.36: Classiﬁcation by Alternative Networks: multilayer perceptron (left)
and radial basis functions networks (right)
Figure 4.37: A nonlinear classiﬁcation problem
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 104
Figure 4.38: Boundaries for neural networks with tree abd six hidden units (note
how using weight decay smooths the ﬁtted surface)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 105
PROBABILISTIC
RECEPTIVE
FIELDS
(KERNELS)
GROUP DECISIONREGIONS
DISTRIBUTION
DEPENDENT
KERNEL
GAUSSIAN
MIXTURE
METHODOF
POTENCIALS
FUNCTIONS,
CMAC
REPRESENTATIVE
CLASSIFIERS
EXEMPLAR
EUCLIDEANNORM
KNEAREST
NEIGHBOR,
LVQ
HYPERPLANE
SIGMOID
MULTILAYER
PERCEPTRON,
BOLTZMANN
MACHINE
A
B
A
x
x
x
x
x
x
COMPUTING
ELEMENT
Figure 4.39: Four basic classiﬁer groups
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 106
PROBABILISTIC
LOCAL
GROUP DECISIONREGIONS
DISTRIBUTION
DEPENDENT
KERNEL
RADIAL BASIS
FUNCTION,
KERNEL
DISCRIMINANT
REPRESENTATIVE
CLASSIFIERS
NEAREST
NEIGHBOR
EUCLIDEANNORM KNEAREST
NEIGHBOR,
LEARNINGVECTOR
QUANTIZER
GLOBAL
SIGMOID
GAUSSIAN,
GAUSSIAN
MIXTURE
A
B
A
x
x
x
x
x
x
COMPUTING
ELEMENT
RULE
FORMING
THRESHOLDLOGIC
BINARY
DECISION TREE,
HYPERSPHERE
MULTILAYER
PERCEPTRON,
HIGHORDER
POLYNOMIAL NET
Figure 4.40: A taxonomy of classiﬁers
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 107
Here we give some examples of classiﬁcation problems solved using neural
networks.
We start with an application of PNN probabilistic neural networks in bankruptcy
prediction. Yang at al (1999) using data from 122 companies for the period 1984
to 1989 built on warning bankrupt model for the US oil and gas industry. The
data consist of ﬁve ratios to classify oil and gas companies into bankrupt and non
bankrupt groups. The ﬁve ratios are net cash ﬂow to total assets, total debt to total
assets, exploration expenses to total reserves, current abilities to total reserves,
and the trend in total reserves calculated on the ratio of change from year to year.
The ﬁrst fou rates are deﬂated. There are two separate data sets one with deﬂated
ratios and one without deﬂation.
The 122 companies are randomly divided into three sets: the training data set
(33 nonbankrupt companies and 11 bankrupt companies), the validation data set
(26 nonbankrupt companies and 14 bankrupt companies) and the testing data (30
nonbankrupt companies and 8 bankrupt companies).
Four methods of classiﬁcation were tested: FISHER denotes Fisher discrimi
nant analysis, FF denotes feedforward network, PNN means probabilistic neural
networks, PNN* is probabilistic neural networks without data normalized. The
normalization of data is in the form: x
∗
= x/[[x[[.
Table 4.12 presents the results and we can see that the ranking of the classiﬁ
cation model is: FISHER, PNN*, PNN and EF.
Table 4.12: Number and percentage of Companies Correctly Classiﬁed
Nondeﬂated data Deﬂated data
Overall Nonbankrupt Bankrupt Overall Nonbankrupt Bankrupt
n=38 n=30 n=8 n=38 n=30 n=8
FISHER 27 (71 %) 20 (67 %) 7 (88 %) 33 (87 %) 26 (87 %) 7 (88 %)
BP 28 (74 %) 24 (80 %) 4 (50 %) 30 (79 %) 30 (100 %) 0 (0 %)
PNN 25 (66 %) 24 (80 %) 1 (13 %) 26 (68 %) 24 (80 %) 2 (25 %)
PNN* 28 (74 %) 24 (80 %) 4 (50 %) 32 (84 %) 27 (90 %) 5 (63 %)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 108
Bounds and Lloyd (1990) compared feedforward network, radial basis func
tions, three group of clinicians and some statistical classiﬁcation methods in the
diagnostic of low back disorders. The data set referred to 200 patients with low
back disorders.
Low back disorder were classiﬁed in four diagnostic category: SLBP (simple
low backpain), ROOTP (nerve root compression), SPARTH (spinal pathology,
due to tumour, inﬂammation or infection), AIB (abnormal illness behaviour, with
signiﬁcant psychological overlay).
For each patient the data was collected in a tick sheet that listed all the relevant
clinical features. The pacient data was analysed using two different subset of
the total data. The ﬁrst data set (full assessment) contained all 145 tick sheet
entries. This corresponds to all information including special investigations. The
second data used a subset of 86 tick sheet entries for each patient. These entries
correspond to a very limited set of symptoms that can be collected by paramedic
personnel.
There were 50 patients for each of the four diagnostic class. Of the 50 patients
in each class 25 were selected for the training data and the remaining for the test
set.
The neural network used were: feedforward network for individual diagnostic
(y = 1) and (y = 0) for remaining three, feedforward with two output (for the four
class diagnostic), a network of individual networks with two outputs, and mean
of 10 feedforward networks runs after training. All this networks had one hidden
layer.
The radial basis networks were designed to recognize all four diagnostic classes.
The effect of the number of centers, positions of the centers and different choice
of basis functions were all explored.
Two other classiﬁcation methods and CCAD (computedaided diagnosis) sys
tem were also tested in these data. A summary of the results is shown in Table
4.13 and 4.14. The performance of the radial basis function, CCAD and MLP is
quite good.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 109
Table 4.13: Symptomatic assessment: Percentage test diagnoses correct
Abnormal
Classiﬁcation Simple low Root Spinal illness
Method back pain pain pathology behaviour Mean
Clinicians Bristol
neurosurgeons 60 72 70 84 71
Glasgow orthopaedic
surgeons 80 80 68 76 76
Bristol general
pratictioners 60 60 72 80 68
CCAD 60 92 80 96 82
MLP
Best 50202 52 92 88 100 83
Mean 50202 43 91 87 98 80
Best 5002 44 92 88 96 80
Mean 5002 41 89 84 96 77
Radial basis functions 60 88 80 96 81
Closest class mean (CCM)
(Euclidean)
50 Inputs 56 92 88 96 63
86 Inputs 68 88 84 96 84
Closest class mean (CCM)
(scalar product)
50 Inputs (K=22) 16 96 44 96 63
86 Inputs (K=4) 92 84 60 84 80
K Nearest Neighboor (KNN)
(Euclidean)
50 Inputs (K=22) 88 84 76 84 83
86 Inputs (K=4) 92 80 84 80 84
K Nearest Neighboor (KNN)
(scalar product)
50 Inputs (K=3) 24 100 56 92 68
86 Inputs (K=6) 100 84 56 84 81
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 110
Table 4.14: Full assessment: Percentage test diagnoses correct
Abnormal
Classiﬁcation Simple low Root Spinal illness
Method back pain pain pathology behaviour Mean
Clinicians Bristol
neurosurgeons 96 92 60 80 82
Glasgow orthopaedic
surgeons 88 88 80 80 84
Bristol general
pratictioners 76 92 64 92 81
CCAD 100 92 80 88 90
MLP
Best 50202 76 96 92 96 90
Mean 50202 63 90 87 95 83
Best 5002 76 92 88 96 88
Mean 5002 59 88 85 96 82
Radial basis functions 76 92 92 96 89
Closest class mean (CCM)
(Euclidean)
85 Inputs 100 92 72 88 88
145 Inputs 100 88 76 88 88
Closet class mean (CCM)
(scalar product)
85 Inputs (K=22) 12 92 48 100 63
145 Inputs (K=4) 100 72 68 80 80
K Nearest Neighbour (KNN)
(Euclidean)
85 Inputs (K=19) 100 84 80 80 86
145 Inputs (K=4) 96 88 80 80 86
K Nearest Neighbour (KNN)
(scalar product)
85 Inputs (K=19) 48 96 52 92 68
145 Inputs (K=4) 92 92 72 76 83
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 111
Santos et al (2005,2006) uses feedforward network in the diagnostic of Hepati
tis Aand Pulmonary Tuberculosis and gives a comparison with logistic regression.
The hepatitis A data (Santos et al 2005), 2815 individuals from a survey sam
ple from regions in Rio de Janeiro state. The soroprevalence of hepatitis was 36,6.
From the 66 collected variables in the study, seven variables reﬂecting information
on the individuals, housing environment, and socioeconomic factors.
For a test sample of 762 individual randomly chosen. Table 4.15 presents the
results for the multilayer perceptron and a logistic regression model:
Table 4.15: Sensitivity (true positive) and speciﬁcity (true negative)
Errors Model
MLP LR
Sensitivity (%) 70 52
Speciﬁcity (%) 99 99
The Pulmonary Tuberculosis data (Santos et al 2006) consisted of data on 136
patients of the Hospital Universitario (HUCFF  Universidade Federal do Rio de
Janeiro). There were 23 variables in total. The test sample had 45 patients. A
comparison was made using multilayer perceptron and CART (Classiﬁcation and
Regression Tree) model. Table 4.16 summarizes the results in the test data.
Table 4.16: Sensitivity x Speciﬁcity
Errors Model
MLP CART
Sensitivity (%) 73 40
Speciﬁcity (%) 67 70
A general comparison of classiﬁcation methods including neural networks us
ing medical binary data is given in Asparoukhov and Krzanowski (2001). Table
4.17 reproduces one of their three tables to indicate the methods used (They also
have a similar table for large (1517) and moderate (10) number of variables).
They suggest that:
− all the effective classiﬁers for large sample are traditional ones (IBM  Inde
pendent Binary Model Behaviour, LDF  Linear Discriminating Function,
LLM  2 order loglinear model).
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 112
Table 4.17: Leaveoneout error rate (multiplied by 10
3
for footnotesize (6) num
ber of variables
1
Discriminant
procedure Data set
Psychology Pulmonary Thrombosis Epilepsy Aneurysm
= = = = = = = = = =
Linear
classiﬁers
IBM 239 283 147 139 228 284 168 186 23 21
LDF 244 223 147 139 272 275 147 147 53 46
LD 229 266 147 167 302 343 193 186 28 25
MIP  139 325 101 23
Quadratic
classiﬁers
Bahadur(2) 219 201 152 139 243 343 197 155 23 21
LLM (2) 206 207 151 144 250 304 153 116 23 21
QDF 215 207   265 314 153 178  
Nearest
neighboor
classiﬁers
kNNHills(L) 273(2) 266(1) 187(1) 185(1) 265(1) 294(1) 153(2) 217(1) 363(1) 306(1)
kNNHall(L) 230(2) 217(2) 166(1) 144(1) 257(1) 324(1) 102(2) 124(1) 23(1) 21(1)
Other
nonparametric
classiﬁers
Kernel 247 234 166 144 243 363 143 124 23 21
Fourier 271 223 166 154 316 373 177 147 23 21
MLP(3) 207 139 284 132 25
LVQ(c) 190(6) 146(6) 226(6) 109(6) 37(4)
1
=  equal prior probabilities;
= =  prior probabilities are equal to the class individuals’ number divided by the design set size;
kNNHills(L) and lNNHall(L)  L is order of the procedure;
MLP(h) h is hidden layer neuron number;
LVQ(c) c is number of codebooks per class.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 113
− most of the effective classiﬁers for footnotesize samples are non traditional
ones (MLP  Multilayer Perceptron, LVQ  Learning Vector Quantization
Neural Network, MIP  Mixed Integer Programming).
Arminger et al (1987) compared logistic discrimination regression tree and
neural networks in the analysis of credit risk. The dependent variable is whether
the credit is paid off or not. The predictor variable are sex, true of employment,
marital status, and car and telephone holder. The result on the test data is shown
in Table 4.18.
Table 4.18: Correct Classiﬁcation  Credit data
MLP LR CART
Good Credit 53 66 65
Bad Credit 70 68 66
Overall 66 67 66
For the lender point of view MLP is more effective.
Desai et al (1997) explore the ability of feedforward network, mixtureof
expert networks, and linear discriminant analysis in building credit scoring models
in an credit union environmental. Using data from three credit union data, split
randomly in 10 datasets. The data for each credit union contain 962, 918 and
853 observation. One third of each were used as test. The results for the generic
models (all data) and the customized models for each credit union is presented in
Table 4.19 and 4.20 respectively.
Table 4.19: Generic models
Data set
Percentage correctly classiﬁed
mlp g mmn g lda g lr g
% total % bad % total % bad % total % bad % total % bad
1 79.97 28.29 80.12 28.95 79.80 27.63 80.30 33.55
2 82.96 35.06 81.80 38.31 80.60 31.82 81.40 35.06
3 81.04 36.44 79.20 47.46 82.40 37.29 82.70 40.68
4 82.88 33.80 83.33 38.03 83.49 40.85 83.49 44.37
5 79.21 32.88 78.59 43.15 80.60 37.67 80.01 39.04
6 81.04 53.69 80.73 35.57 80.58 38.93 81.35 42.95
7 80.73 44.30 79.82 44.97 80.73 36.24 82.57 41.61
8 78.89 50.00 79.82 41.96 80.58 32.88 81.35 39.73
9 81.92 43.87 80.43 32.90 81.04 30.97 81.35 33.55
10 79.51 62.50 80.73 32.64 81.35 33.33 82.42 40.28
average 80.75 42.08 80.46 38.39 81.12 34.76 81.70 39.08
pvalue
1
0.191 0.197 0.98 0.035 0.83 0.19
1
The pvalues are for onetailed paired ttest comparing mlp g results with the other three methods.
The results indicates that the mixtureofexperts performns better in classify
ing bad credits.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 114
Table 4.20: Customized models
Data set Sample Percentage correctly classiﬁed
mlp c lda c lr c
% total % bad % total % bad % total % bad
Credit union L:
1 14.88 83.33 32.00 83.30 16.00 82.70 24.00
2 17.26 87.50 31.03 82.70 20.69 81.00 34.48
3 17.26 86.90 41.38 83.90 31.03 82.10 37.93
4 22.02 81.55 37.84 84.52 32.43 82.74 37.84
5 18.45 82.14 29.03 77.40 16.13 78.00 38.71
6 17.26 83.33 41.38 81.55 27.59 82.74 41.38
7 21.43 80.36 16.67 78.50 08.33 80.95 30.56
8 17.86 79.76 56.67 82.14 10.00 81.55 23.33
9 16.67 85.71 32.14 86.31 32.14 86.90 39.29
10 18.45 83.33 29.03 79.76 19.35 80.36 29.03
Credit union M:
1 26.77 85.43 79.71 86.60 70.97 87.40 73.53
2 26.77 88.58 76.47 85.40 18.64 88.20 79.41
3 20.47 85.43 80.77 87.40 31.58 85.80 75.00
4 23.63 90.94 75.00 90.20 35.14 89.00 75.00
5 24.02 89.37 88.52 89.90 22.22 86.60 81.97
6 26.38 88.58 74.63 88.19 12.96 86.22 71.64
7 26.77 85.43 74.63 85.04 28.30 86.22 77.61
8 25.59 88.19 75.38 86.61 28.89 87.40 67.69
9 27.59 87.40 72.86 85.43 31.37 86.61 67.14
10 24.41 90.55 82.26 89.37 21.05 89.37 80.65
Credit union N:
1 25.43 75.86 49.15 75.40 18.64 78.50 27.11
2 24.57 77.15 40.35 ?? 31.58 75.50 24.56
3 15.95 81.90 35.14 83.20 35.14 84.10 35.14
4 19.40 77.15 44.44 78.90 22.22 80.60 28.89
5 23.28 76.72 24.07 75.40 12.96 75.90 18.52
6 22.84 79.31 35.85 75.00 28.30 76.22 26.42
7 19.40 81.90 31.11 80.60 28.89 81.03 28.89
8 21.98 74.13 37.25 74.57 31.37 75.43 31.37
9 24.57 80.17 47.37 77.59 21.05 78.88 21.05
10 21.98 77.59 19.61 77.16 21.57 76.72 19.61
average 83.19 49.72 82.35 38.49 82.67 44.93
pvalue
1
0.018 5.7E7 0.109 0.0007
1
The pvalues are for onetailed paired ttest comparing mlp c results with the other two methods.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 115
West (2000) investigate the credit scoring accuracy of ﬁve neural network
models: multilayer perceptron (MLP), mixtureofexperts, radial basis function,
learning vector quantization and fuzzy adaptive resonance. The neural networks
credit scoring models are tested using 10fold crossvalidation with two real data
sets. Results are compared with linear discriminant analysis, logistic regression,
knearest neighbour, kernel density estimation and decision trees
Results show that feedforward network may not be the most accurate neural
network model and that both mixture of experts and radial basis function should be
considered for credit scoring applications. Logistic regression is the most accurate
of the traditional methods.
Table 4.21 summarize the results.
Table 4.21: Credit scoring error, average case neural network models
German credit data
b
Australian credit data
b
Good credit Bad credit Overall Good credit Bad credit Overall
Neural models
a
MOE 0.1428 0.4775 0.2434 0.1457 0.1246 0.1332
RBF 0.1347 0.5299 0.2540 0.1315 0.1274 0.1286
MLP 0.1352 0.5753 0.2672 0.1540 0.1326 0.1416
LVQ 0.2493 0.4814 0.3163 0.1710 0.1713 0.1703
FAR 0.4039 0.4883 0.4277 0.2566 0.2388 0.2461
Parametric
models
Linear discriminant 0.2771 0.2667 0.2740 0.0782 0.1906 0.1404
Logistic regression 0.1186 0.5133 0.2370 0.1107 0.1409 0.1275
Nonparametric
models
K nearest neighbor 0.2257 0.5533 0.3240 0.1531 0.1332 0.1420
Kernel density 0.1557 0.6300 0.3080 0.1857 0.1514 0.1666
CART 0.2063 0.5457 0.3044 0.1922 0.1201 0.1562
a
Neural network results are averages of 10 repetitions.
b
Reported results are group error rates averaged across 10 independent holdout samples.
Markham and Ragsdale (1995) presents an approach to classiﬁcation based
on combination of linear discrimination function (which the label Mahalanobis
distance measure  MDM) and feedforward network. The characteristic of the
data set is shown in Table 4.22. The process of splitting the data in validation and
test set was repeated 30 times. The results are shown in Table 4.23.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 116
Table 4.22: Data Characteristic
Dataset 1 2
Problem Domain Oil Quality Rating Bank Failure Prediction
Number of groups 3 2
Number of observations 56 162
Number of variables 5 19
Finally, a recent reference on comparisons of classiﬁcation models is Bose
(2003).
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 117
Table 4.23: Comparison of classiﬁcation methods on threegroup oil quality and
twogroup bank failure data set
Percentage of Misclassiﬁed Observations
in the Validation Sample
Threegroup oil quality Twogroup bank failure
Sample MDM NN1 NN2 MDM NN1 NN2
1 22.22 18.52 3.70 10.00 3.75 2.50
2 14.81 18.52 11.11 17.50 10.00 7.50
3 11.11 11.11 0 20.00 11.25 7.50
4 11.11 0 0 15.00 10.00 5.00
5 18.52 22.22 14.81 11.25 6.25 3.75
6 3.70 0 0 11.25 5.00 1.25
7 14.81 0 0 12.50 8.75 5.00
8 29.63 22.22 11.11 13.75 7.50 3.75
9 7.41 0 0 12.50 6.25 5.00
10 18.52 7.41 0 12.50 7.50 2.50
11 3.70 0 0 17.50 8.75 6.25
12 18.52 11.11 0 20.00 11.25 7.50
13 7.41 0 0 15.00 6.25 3.75
14 7.41 0 0 13.75 5.00 5.00
15 11.11 0 0 16.25 6.25 2.50
16 7.41 0 0 18.75 10.00 6.25
17 18.52 7.41 3.70 25.00 13.75 8.75
18 11.11 3.70 0 13.75 10.00 6.25
19 3.70 0 0 16.25 8.75 5.00
20 14.81 7.41 0 13.75 6.25 3.75
21 11.11 0 0 15.00 7.50 3.75
22 14.81 3.70 0 13.75 6.25 2.50
23 25.93 18.52 11.11 15.00 8.75 5.00
24 22.22 11.11 11.11 20.00 10.00 6.25
25 18.52 7.41 0 15.00 11.25 5.00
26 14.81 7.41 0 17.50 8.75 5.00
27 7.41 0 0 18.75 8.75 7.50
28 14.81 3.70 3.70 16.25 5.00 2.50
29 0 0 0 15.00 6.25 3.75
30 14.81 7.41 3.70 16.25 7.50 5.00
Average 13.33 6.42 2.47 15.63 8.08 4.83
Chapter 5
Regression Neural Network Models
5.1 Generalized Linear Model Networks  GLIMN
Generalized (iterative) linear models (GLIM) encompass many of the statis
tical methods most commonly employed in data analysis. Beyond linear models
with normal distributed errors, generalized linear models include logit and probit
models for binary response variable and loglinear (Poissonregression) models
for counts.
A generalized linear model consists of three components
• A random component, in the form of a response variable Y with distribution
from the exponential family which includes: the normal, Poisson, binomial,
gamma, inversenormal, negativebinomial distributions, etc.
• A linear predictor
η
i
= α
0
+ α
1
x
i1
+ . . . + α
k
x
ik
(5.1.1)
on which y
i
depends. The x
are independent (covariates, predictors) vari
ables.
• A link function L(µ
i
) which relates the expectation µ
i
= E(Y
i
) to the linear
predictor η
i
L(µ
i
) = η
i
(5.1.2)
The exponential family of distribution includes many common distributions
and its parameters have some nice statistical properties related to them (sufﬁ
ciency, attain CramerRao bound). Members of this family can be expressed in
the general form
118
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 119
f(y, θ, φ) = exp
_
yθ −b(θ)
α(φ) + C(y, φ)
_
. (5.1.3)
If φ is known, then θ is called the natural or canonical parameter. When
α(φ) = φ, φ is called the dispersion or scale parameter. It can be shown that
E(Y ) = b
(θ) = µ Var(Y ) = α(φ)b
(θ) = α(φ) ∨ (θ). (5.1.4)
Fitting a model may be regarded as a way of replacing a set of data values y
∼
=
(y
i
. . . y
n
) by a set of ﬁtted values ˆ µ
∼
i
= (ˆ µ
1
, . . . , ˆ µ
n
) derived from a model involv
ing (usually) a relative small number of parameters. One measure of discrepancy
most frequently used is that formed by the logarithm of the ratio of likelihoods,
called deviance (D).
Table 5.1 gives some usual choices of link functions for some distributions of
y
∼
for the canonical link θ = η.
Table 5.1: Generalized Linear Models: mean, variance and deviance functions
η ∨(µ) D(ˆ µ)
Normal µ 1
(y − ˆ µ)
2
Binomial ln[µ/(1 −µ)] µ(1 −µ) 2Σ¦[y ln(y/ˆ µ)]+
+(n −y) ln[(n −y)/(n − ˆ µ)]¦
Poisson log µ µ 2Σ[y ln(y/ˆ µ) −(y − ˆ µ)]
Gamma µ
−1
µ
2
2Σ[−ln(y/ˆ µ) −(y − ˆ µ)/ˆ µ]
The neural network architecture equivalent to the GLIM model is a perceptron
with the predictors as inputs and one output. There are no hidden layers and the
activation function is chosen to coincide with the inverse of the link function (L).
This network is shown in Figure 5.1.
In what follows we will deal with some particular cases: the logit regression
and the regression models neural networks.
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 120
Figure 5.1: GLIM network
5.1.1 Logistic Regression Networks
The logistic model deals with a situation in which we observe a binary re
sponse variable y and k covariates x’s. The logistic regression model is a general
ized linear model with binomial distribution for y
1
. . . y
n
and logit link function.
The equation takes the form
p = P(Y = 1/x
∼
) =
1
1 + exp(−β
0
−
k
i=1
β
i
x
i
)
= ∧
_
β
0
+
k
i=1
β
i
x
i
_
(5.1.5)
with ∧ denoting the logistic function. The model can be expressed in terms of log
odds of observing 1 given covariates x as
log tp = log
P
1 −β
= β
0
+
k
i=1
β
i
x
i
. (5.1.6)
A natural extension of the linear logistic regression model is to include quadratic
terms and multiplicative interaction terms:
P = Λ
_
β
0
+
k
i=1
β
i
x
i
+
k
i=1
γ
i
x
2
i
+
i<j
δ
ij
x
i
x
j
_
. (5.1.7)
Other approaches to model the probability p include generalized additive mod
els (Section 5.2.3) and feedforward neural networks.
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 121
For the linear logistic model (5.1.6) the neural network architecture is as in
Figure 5.1 with the sigmoid link k function as activation function of the output
neuron.
A feedforward network with hidden layer with r neurons and sigmoid activa
tion function would have the form:
P = Λ
_
w
0
+
r
j=1
w
j
Λ(w
oj
+
k
j=1
w
ij
x
i
)
_
. (5.1.8)
This representation of neural network is the basis for analytical comparison
of feedforward network and logistic regression. For a neural network without a
hidden layer and a sigmoid activation function (5.1.8) reduces to (5.1.6) and this
network is called logistic perceptron in the neural network literature. Therefore
it does not make sense to compare a neural network with hidden layer and linear
logistic regression. For a discussion of this and related measures and other com
parisons among these two models see Schwarzer et al (2000) and Tu (1996). An
attempt to improve the comparison and to use logistic regression as a staring point
to design a neural network which would be at least as good as the logistic model
is presented in Ciampi and Zhang (2002) using ten medical data sets.
The extension to the polychotomous case is obtained considering K > 1
outputs. The neural network of Figure 5.2 is characterized by weights β
ij
(j =
1, . . . , K) to output y
j
, obtained by
p
j
= Λ
_
β
oj
+
k
i=1
β
ij
x
i
_
. (5.1.9)
In statistical terms this network is equivalent to the multinomial or polychotomous
logistic model deﬁned as
p
j
= P(y
j
= 1/x
∼
) =
Λ(x, β
j
)
k
j=1
Λ(x, β
j
)
(5.1.10)
the weights can be interpreted as regression coefﬁcients and maximum likelihood
obtained by back propagation maximum likelihood.
There is a huge literature, specially in medical sciences, with comparisons
among these two models (LR and ANN). Here we will only give some useful
references mainly from the statistical literature. This section ends with two ap
plications showing some numerical aspects of the models equivalence as given in
Schumacher et al (1996).
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 122
Figure 5.2: Policholomous logistic network
Applications and comparisons of interest are: personal credit risk scores (Arminger
et al, 1997), country international credit scores (Cooper, 1999), polychotomous lo
gistic for discrete choice (Carvalho et al, 1998), scores risk of insolvency (Fletcher
and Goss, 1993, Leonard et al, 1995, Brockett et al, 1997) and some medical ap
plications (Hammad et al, 1997, Ottenbacher et al, 2001, Nguyen et al, 2002, Boll
et al, 2003).
The ﬁrst illustration is the that of Finney (1947) and the data consists N = 39
binary responses denoting the presence (y = 1) or absence (y = 0) of vasocon
striction in the ﬁnger’s skin after the inspiration of an air volume at a mean rate of
inspiration R.
Table 5.2 presents the results of the usual analysis from two logistic regression
models. The ﬁrst, considers only one covariate x
1
= log R, that is
P(y = 1/x) = Λ(β
0
+ β
1
x
1
). (5.1.11)
The second model considers the use of a second covariate, that is
P(y = /x) = Λ(β
0
+ β
1
+ β
2
x
2
). (5.1.12)
For x
1
= log and x
2
= log V , respectively,
In the ﬁrst case, the results show an effect marginally signiﬁcant of the loga
rithm of rate of inspiration in the probability to the presence of vasoconstriction
in the ﬁnger’s skin.
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 123
Table 5.2: Results of a logistic regression analysis for the vasoconstriction data
Coefﬁcient Standard Error Pvalue Neural Network
Variable of Regression
Intercept 0.470 0.439  0.469
Log R 1.336 0.665 0.045 1.335
(L = 24543) (E = 24.554)
Intercept 2.924 1.288  2.938
Log R 4.631 1.789 0.010 4.651
Log V 5.221 1.858 0.005 5.240
(L = 14.632) (E = 14, 73)
An ANN equivalent to this model would be a feedforward network with two
input neurons (x
0
= 1, x
1
= log R). Using a rate of learning η = 0, 001 to the
last interactions MLBP, we have w
0
= −0, 469 and w
1
= 1, 335 with a Kullback
Leiber measure of E
∗
= 24, 554.
For the case of the model with two covariates, the result shows a signiﬁcant
inﬂuence of the two covariates in the probability of vasoconstriction in the ﬁn
ger’s skin. One Perceptron with three input neurons (x0 = 1, x1 = log R, x2 =
log V ) should give similar results if MLBP was used. This was almost obtained
(w
0
= 2, 938, w
1
= 4, 650, w
2
= 4, 240) although the algorithm backpropaga
tion demands several changes in learning rate from an initial value η = 0, 2 to
η = 0, 000001 involving approximately 20000 interaction to obtain a value of
the distance KullbackLeibler of E
∗
= 14, 73 comparable to that of minus the
loglikelihood (L = −14, 632).
The second example is a study to examine the role of noninvasive sonographic
measurements for differentiation of benign and malignant breast tumors. Sonog
raphy measurements in N458 women and characteristics (x
1
, . . . , x
p
) were col
lected.
Cases were veriﬁed to be 325(y = 0) benign and 133 as (y = 1) malignant
tumor. A preliminary analysis with a logistic model indicates three covariates as
signiﬁcant, this indication plus problems of collinearity among the six variables
suggests the use of age, number of arteries in the tumor (AT), number of arteries
in the contralateral breast (AC). The results are shown in Table 5.3.
A comparison with others ANN is shown in the Table 5.4.
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 124
Table 5.3: Results of a logistic regression analysis of breast tumor data
Logistic
Variable Coefﬁcients Standard Error P Value Neural Network Weights
Intercept 8,178 0.924 w
0
= −8.108
Age 0.070 0.017 0.0001 w
1
= 0.069
log AT+1 5.187 0.575 0.0001 w
2
= 5.162
log AC+1 1.074 0.437 0.0014 w
3
= −1.081
L=79.99 E=80.003
Table 5.4: Results of classiﬁcation rules based on logistic regression, CART and
feedforward neural networks with j hidden units (NN(J)) for the breast tumor
data
Method Sensitivity Speciﬁcity Percentage of correct
(%) (%) classiﬁcation
Logistic regression 95.5 90.2 91.7
CART 97.9 89.5 91.9
NN(1) 95.5 92.0 93.0
NN(2) 97.0 92.0 93.4
NN(3) 96.2 92.3 93.4
NN(4) 94.7 93.5 93.9
NN(6) 97.7 94.8 95.6
NN(8) 97.7 95.4 96.1
NN(10) 98.5 98.5 98.5
NN(15) 99.2 98.2 98.5
NN(20) 99.2 99.1 99.1
NN(40) 99.2 99.7 99.6
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 125
5.1.2 Regression Network
The regression network is a GLIM model with identity link function. The
introduction of hidden layer has no effect in the architecture of the model because
of the linear properties and the architecture remains as in Figure 5.1.
Some authors have compared the prediction performance of econometric and
regression models with feedforward networks usually with one hidden layer. In
fact, they compared linear models with nonlinear models, in this case a particular
projection pursuit model with the sigmoid as the smooth function. (See Section
5.2.5). An extension for system of equations can be obtained for systems of k
equations by considering k outputs in the neural network as in Figure 5.2. Again
the outputs activation function are the identity function.
In what follows we describe some comparisons of regression models versus
neural networks.
Gorr et al (1994) compare linear regression, stepwise polynomial regression
and a feedforward neural network with three units in the hidden layer with an
index used for admissions committee for predicting students GPA (Grade Point
Average) in professional school. The dependent variable to be predicted for ad
missions decision is the total GPA of all courses taken in the last two years. The
admission/decision is based on the linear decision rule.
LDR = math + chem + eng + zool + 10Tot + 2Res + 4PTran (5.1.13)
which are the grades in Mathematics, Chemistry, English, Zoology; and total
GPA. The other variables are binary variables indicating: residence in the area,
a transfer student had partial credits in prerequisites transferred, and a transfer
student had all credits in prerequisites transfer.
Although none of the empirical methods was statistically signiﬁcant better
than the admission committee index, the analysis of the weights and output of the
hidden units, identiﬁes three different patterns in the data. The results showed
obtained very interesting comparisons by using quartiles of the predictions.
Church and Curram (1996) compared forecast of personal expenditure ob
tained by neural network and econometric models (London Business School, Na
tional Institute of Economic and Social Research, Bank of England and a Goldman
Sache equation). None of the models could explain the fall in expense growth at
the end of the eighties and beginning of the nineties.
A summary of this application of neural networks is the following:
• Neural networks use exactly the same covariates and observations used in each
econometric model. These networks produced similar results to the econo
metric models.
• The neural networks used 10 neurons and a hidden layer, in all models.
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 126
• The dependent variable and the covariates were reescalated to fall in the inter
val 0,2 to 0,8.
• A ﬁnal exercise used a neural network with inputs of all variables from all
models. This network was able to explain the fall in growth, but it is a
network with too many parameters.
• Besides the forecast comparison, a sensitivity analysis was done on each vari
able, so that the network was tested with the mean value of the data and
each variable was varied to verify the grade in which each variation affects
the forecast of the dependent variable.
Another econometric model comparison using neural network and linear re
gression was given by Qi and Maddala (1999). They relate economic factors with
the stock market and in their context conclude that neural network model can im
prove the linear regression model in terms of predictability, but not in terms of
proﬁtability.
In medicine, Lapeer et al (1994) compare neural network with linear regres
sion in predicting birthweight from nine perinatal variables which are thought to
be related. Results show, that seven of the nine variables, i.e. gestational age,
mother’s bodymass index (BMI), sex of the baby, mother’s height, smoking,
parity and gravidity, are related to birthweight. They found that neural network
performed slightly better than linear regression and found no signiﬁcant relation
between birthweight and each of the two variables: maternal age and social class.
An application of neural network in ﬁtting response surface is given by Balkin
and Lim (2000). Using simulated results for an inverse polynomial they shown
that neural network can be a useful model for response surface methodology.
5.2 Nonparametric Regression and Classiﬁcation Net
works
Here we relate some nonparametric statistical methods with the models found
in the neural network literature.
5.2.1 Probabilistic Neural Networks
Suppose we have C classes and want to classify a vector x
∼
to one of this
classes. The Bayes classiﬁer is based on p(k/x
∼
)απ
k
p
k
(x
∼
), where π
k
is the prior
probability of a observation be from class k, p
k
(x
∼
) is the probability of x
∼
being
observed among those of class C = k.
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 127
In general, the probability π
k
and the density p() have to be estimated from
the data. A powerful nonparametric technique to estimate these functions is based
on kernel methods and propose by Parzen (Ripley, 1996, p.184).
A kernel K is a bounded function with integral one. Suitable example is the
multivariate normal density function. We use K(x
∼
−y
∼
) as a measure of the prox
imity of x
∼
on y
∼
. The empirical distribution of x
∼
within a group k gives mass
1
m
k
to
each of the samples. A local estimate of the density p
k
(x
∼
) can be found summing
all these contributions with weight K(x
∼
−x
∼
2
) that is (Ripley, 1996, p.182)
ˆ p
j
(x
∼
) =
1
n
j
n
j
i=1
K(x
∼
−x
∼
i
). (5.2.1)
This is an average of the kernel functions centered on each example from the class.
Then, we have from Bayes theorem:
ˆ p(k/x
∼
) =
π
k
ˆ p
k
(x
∼
)
C
j=1
π
j
ˆ p
j
(x
∼
)
=
π
k
n
k
[i]=k
K(x −x
k
)
i
π
[i]
n
[i]
K(x −x
i
)
. (5.2.2)
When the prior probability are estimated by n
k/n
, (5.2.2) simpliﬁes to
p(k/x
∼
) =
[i]=k
K(x
∼
−x
∼
i
)
i
K(x
∼
−x
∼
i
)
. (5.2.3)
This estimate is known as Parzen estimate and probabilistic neural network (PNN)
(Patterson, 1996, p.352).
In radial basis function networks with a Gaussian basis it is assumed that the
normal distribution is a good approximation to the cluster distribution. The kernel
nonparametric method of Parzen approximates the density function of a particular
class in a pattern space by a sum of kernel functions, and one possible kernel is the
Gaussian function. As we have seen the probability density function for a class is
approximated by the following equation
p
k
(x
∼
) =
_
1
2π
n/2
σ
2
_
1
n
K
n
k
j=1
e
−(x−x
kj
)
2
/2σ
2
(5.2.4)
where σ
2
has to be decided. Choosing a large value results in overgeneraliza
tion, choosing a value too small results in overﬁtting. The value of σ should be
dependent on the number of patters in a class. One function used is
σ = an
−b
k
a > 0, 0 < b < 1. (5.2.5)
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 128
A three layer network like a RBF network is used with each unit in the hidden
layer centered on an individual item data. Each unit in the output layer has weights
of 1, and a linear output function, this layer adds all the outputs from the hidden
layer that corresponds to data from the same class together. This output represents
the probability that the input data belongs to the class represented by that unit. The
ﬁnal decision as to what class the data belongs to is simply the unit with the output
layer with the largest value. If values are normalized they lie between 0 and 1. In
some networks an additional layer is included that makes this decision as a winner
takes all, each unit with binary output.
Figure 5.3 shows a network that ﬁnds three class of data. There are two input
variables, and for the three classes of data there are four samples for class A, three
samples for class B and two samples for class C; hence the number of neurons in
the hidden layer.
The advantage of the PNN is that there is no training. The values for the
weights in the hidden units (i.e. the centers of the Gaussian functions) are just the
values themselves. If the amount of data is large, some clustering can be done to
reduce the number of units needed in the hidden k layer.
An application of PNNto classiﬁcation of electrocardiograms with a 46dimensional
vector input pattern, using data of 249 patients for training and 63 patients for test
ing is mentioned in Specht (1988).
Figure 5.3: Probabilistic Neural Network  PNN
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 129
5.2.2 General Regression Neural Networks
The general regression neural network was named by Specht (1991) and its
was already known in the nonparametric statistical literature as NadarayaWatson
estimator.
The idea of GRNN is to approximate a function given a set of n outputs
y
1
, . . . , y
n
and n vector of inputs x
∼
1
, . . . x
∼
n
using the following equation from
probability theory for the regression of y given x
∼
¯ y(¯ x
∼
) = E(y/x
∼
) =
_
yf(x
∼
, y)dy
_
f(x
∼
, y)dy
(5.2.6)
where f(x
∼
, y) is the joint probability density function of (x
∼
, y).
As in PNN we approximate the conditional density function by the sum of
Gaussian function. The regression of y on x
∼
is then given by
ˆ y(ˆ x
∼
) =
n
j=1
y
j
e
−d
2
j
/2σ
2
n
j=1
e
−d
2
j
/2σ
2
(5.2.7)
a particular form of the NadarayaWatson estimator
ˆ y(x) =
y
i
K(x
∼
−x
∼
i
)
K(x
∼
−x
∼
i
)
. (5.2.8)
The value of d
j
is the distance between the current input and the jth input in
the sample of training set. The radic σ is again given by
σ = an
−b/n
. (5.2.9)
The architecture of the GRNN is similar to PNN, except that the weights in
the output layer is not set to 1. Instead they are set to the corresponding values
of the output y in the training set. In addition, the sum of the outputs from the
Gaussian layer has to be calculated so that the ﬁnal output can be divided by the
sum. The architecture is shown in Figure 5.4 for a single output of two variables
input with a set of 10 data points.
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 130
Figure 5.4: General Regression Neural Network Architecture
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 131
5.2.3 Generalized Additive Models Networks
The generalized additive model, GAM relating an output y to a vector input
x
∼
= (x
1
, . . . , x
I
) is
E(Y ) = µ (5.2.10)
h(µ) = η =
I
i=0
f
i
(x
i
). (5.2.11)
This is a generalization of the GLIM model when the parameters β
i
are here re
placed by functions f
i
. Again h() is the link function. Nonparametric estimation
is used to obtain the unknown functions f
j
. They are obtained iteratively by ﬁrst
ﬁtting f
1
, then ﬁtting f
2
to the residual y
i
− f
1
(x
1i
), and so on. Ciampi and
Lechevalier (1997) used the model
logit(p) = f
1
(x
1
) + . . . + f
I
(x
I
) (5.2.12)
as a 2 class classiﬁer. They estimated the functions f using Bsplines.
Aneural network corresponding to (5.2.12) with two continuous input is shown
in Figure 5.5. The ﬁrst hidden layer consist of two blocks, each corresponding to
the transformation of the input with Bsplines. A second hidden layer computes
the f’s functions, all units have as activation the identity function. The output
layer has a logistic activation function and output p.
Figure 5.5: GAM neural network
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 132
5.2.4 Regression and Classiﬁcation Tree Network
Regression tree is a binning and averaging procedure. The procedure is pre
sented in Fox (2000a,b) and for a sample regression of y on x, to average the
values of y for corresponding values of x for certain conveniently chosen regions.
Some examples are shown in Figure 5.6.
(a) The binning estimator applied to the relationship be
tween infant mortality per 1000 and GDP per capita, in US
dollars. Ten bins are employed.
(b) The solid line gives the estimated regression function
relating occupational prestige to income implied by the
tree, the broken line is for a local linear regression.
Figure 5.6: Regression tree ˆ y
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 133
Closely related to regression trees are classiﬁcation trees, where the response
variable is categorical rather than quantitative.
The solid line gives the estimated regression function relating occupational
prestige to income implied by the tree, the broken line is for a local linear regres
sion.
The regression and classiﬁcation tree models can be written respectively as:
Y = α
1
I
1
(x
∼
) + . . . + I
L
(x
∼
) (5.2.13)
logit(p) = γ
1
I
1
(x
∼
) + . . . + I
L
(x
∼
) (5.2.14)
where the I are characteristics functions of Lsubsets of the predictor space which
form a partition.
The following classiﬁcation tree is given in Ciampi and Lechevalier (1997).
Figure 5.7: Decision Tree
The leaves of the tree represents the sets of the partition. Each leaf is reached
through a series of binary questions involving the classiﬁers (input); these are
determined from the data at each mode.
The trees can be represented as neural networks. Figure 5.8 shows such rep
resentation for the tree of Figure 5.7.
The hidden layers have activation function taking value 1 for negative input
and 1 for positive input. The weights linking the second hidden layer to the output
are determined by the data and are given by the logit of the class probability
corresponding to the leaves of the tree.
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 134
Figure 5.8: Classiﬁcation Tree Network
5.2.5 Projection Pursuit and FeedForward Networks
Generalized additive models ﬁts the model
y
i
= α +f
1
(x
1
) + . . . + f
I
(x
I
). (5.2.15)
Projection pursuit regression (PPR), ﬁts the model
y
i
= α + f
1
(z
i1
) + . . . + f
I
(z
iI
) (5.2.16)
where the z
1
’s are linear combinations of the x’s
z
ij
= α
j1
x
i1
+ α
j2
x
i2
+ . . . + α
αp
x
ip
= α
∼
j
x
∼
i
. (5.2.17)
The projectionpursuit regression model can therefore capture some interactions
among the x’s. The ﬁtting of this models is obtained by least squares ﬁtting of ˆ α
∼
1
,
then as in GAM ﬁtting
ˆ
f
1
(ˆ α
∼
1
x
∼
) from the residual R
1
of this model, estimate ˆ α
2
and then
ˆ
f
2
(ˆ α
∼
2
x
∼
) and so on.
The model can be written as
y
k
= f(x
∼
) = α
k
+
I
j=1
f
ik
(α
j
+ α
∼
j
x
∼
) (5.2.18)
for multiple output y
∼
.
Equation (5.2.18) is the equation for a feedforward network, if, for example,
the functions f
j
are sigmoid function and we consider the identity output.
CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 135
Clearly feedforward neural network is a special case of PPRtaking the smooth
functions f
j
to be logistic functions. Conversely, in PPR we can approximate each
smooth function f
i
by a sum of logistic functions.
Another heuristic reason why feedforward network might work well with a
few ridden units, for approximation of functions, is that the ﬁrst stage allows a
projection of the x
∼
’s onto a space of much lower dimension for the z’s.
5.2.6 Example
Ciampi and Chevalier (1997) analyzed a real data set of 189 women who have
had a baby. The output variable was 1 if the mothers delivered babies of normal
weight and 0 for mothers delivered babies of abnormally low weight (< 2500 g).
Several variables were observed in the course of pregnancy. Two continuous (age
in years and weight of mother in the last menstrual period before pregnancy) and
six binary variables.
They apply a classiﬁcation tree network, a GAM network, and a combination
of the two. Figure 6 gives their classiﬁcation network. Their GAM and combined
network are shown in Figures 7 and 8. The misclassiﬁcation number resulting in
each network was: GAM : 50  Classiﬁcation tree : 48 Network of network : 43.
Figure 5.9: GAM network
Figure 5.10: Network of networks
Chapter 6
Survival Analysis, Time Series and
Control Chart Neural Network
Models and Inference
6.1 Survival Analysis Networks
If T is a nonnegative random variable representing the time to failure or death
of an individual, we may specify the distribution of T by any one of the probabil
ity density function f(t), the cumulative distribution function F(t), the survivor
function S(t), the hazard function f(t) or the cumulative hazard function H(t).
These are related by:
F(t) =
_
t
0
f(u)du
f(t) = F
(t) =
d
dt
F(t)
S(t) = 1 −F(t)
h(t) =
f(t)
S(t)
= −
d
dt
[ln S(t)]
H(t) =
_
t
0
h(u)du
h(t) = H
(t)
S(t) = exp[−H(t)].
(6.1.1)
A distinct feature of survival data is the occurrence of incomplete observa
tions. This feature is known as censoring which can arise because of time limits
and other restrictions depending of the study.
136
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 137
There are different types of censoring.
• Right censoring occur if the events is not observed before the prespeciﬁed
studyterm or some competitive event (e.g. death by other cause) that causes
interruption of the followup on the individual experimental unit.
• Left censoring happens if the starting point is located before the time of the
beginning of the observation for the experimental unit (e.g. time of infection
by HIV virus in a study of survival of AIDS patients).
• Interval censoring the exact time to the event is unknown but it is known that
it falls in an interval I
i
(e.g. when observations are grouped).
The aim is to estimate the previous functions from the observed survival and
censoring times. This can be done either by assuming some parametric distribu
tion for T or by using nonparametric methods. Parametric models of survival
distributions can be ﬁtted by maximum likelihood techniques. The usual non
parametric estimator for the survival function is the KaplanMeier estimate. When
two or more group of patients are to be compared the logrank or the Mantel
Hanszel tests are used.
General class of densities and the nonparametric procedures with estimation
procedures are described in Kalbﬂeish and Prentice (2002).
Usually we have covariates related to the survival time T. The relation can be
linear (β
∼
x
∼
) or nonlinear (g(j
∼
; x
∼
)). A general class of models relating survival
time and covariates is studied in LouzadaNeto (1997, 1999). Here we describe
the three most common particular cases of the LouzadaNeto model.
The ﬁrst class of models is the accelerated failure time (AFT) models
log T = −β
∼
e
x
∼
+ W (6.1.2)
where W is a random variable. Then exponentiation gives
T = exp
_
−β
x
∼
_
e
W
or T
= e
W
= T exp
_
β
x
∼
_
(6.1.3)
where T
has hazard function h
0
that does not depend on β. If h
j
(t) is the hazard
function for the j’th patient it follows that
h
j
(t) = h
0
(t exp β
x) exp β
x. (6.1.4)
The second class is the proportional odds (PO) where the regression is on the
logodds of survival, correspondence to a linear logistic model with “death” or not
by a ﬁxed time t
0
as a binary response.
log
S
j
(t)
1 −S
j
(t)
= β
∼
x
∼
+ log
S
0
(t)
1 −S
0
(t)
(6.1.5)
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 138
or
S
j
(t)
1 −S
j
(t)
=
S
0
(t)
1 −S
0
(t)
exp β
∼
x
∼
. (6.1.6)
The third class is the “proportional hazard” or Cox regression model (PH).
log h
t
(t) = β
∼
x
∼
+ log h
0
(t) (6.1.7)
h
j
(t) = h
0
(t) exp β
∼
x
∼
. (6.1.8)
Ciampi and EtezadiAmoli (1985) extended models (6.1.2) and (6.1.7) under
a mixed model and not only extend these models but also puts the three models
under a more general comprehensive model (Louzado Neto and Pereira 2000).
See Figure 6.1.
EPH/EAF
EH
EPH EAF
PH AF
PH/AF
Figure 6.1: Classes of regression model for survival data ( .... LouzadaNeto
hazard, existing models, AFaccelerated failure, PH/AM mixed model, EPH
extended PH, EAFextended AF, EPH/EAFextended mixed, EHextended haz
ard. Each model can be obtained as particular case of models on top him in the
ﬁgure)
Ripley (1998) investigated seven neural networks in modeling breast cancer
prognosis; her models were based on alternative implementation of models (6.1.2)
to (6.1.5) allowing for censoring. Here we outline the important results of the
literature.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 139
The accelerated failure time  AFT model is implemented using the architec
ture of regression network with the censored times estimated using some missing
value method as in Xiang et al (2000).
For the Cox proportional hazard model, Faraggi and Simon (1995) substitute
the linear function βx
j
by the output f(x
j
, θ) of the neural network, that is
L
c
(θ) =
i∈...
exp
_
H
h=1
α
h
/[1 + exp(−w
h
x
i
)]
_
j∈R
i
exp
_
H
h=1
α
h
/[1 + exp(−w
h
x
i
)]
_ (6.1.9)
and estimations are obtained by maximum likelihood through NewtonRaphson.
The corresponding network is shown below in Figure 6.2.
Figure 6.2: Neural network model for survival data (Single hidden layer neural
network)
As an example (Faraggi and Simon, 1995) consider the data related to 506
patient with prostatic cancer in stage 3 and 4. The covariates are: stage, age,
weight, treatment (0.2; 1 or 5 mg of DES and placebo).
The results are given in the Tables 6.1, 6.2, 6.3 below for the models:
(a) Firstorder PH model (4 parameters);
(b) Secondorder (interactions) PH model (10 parameters);
(c) Neural network model with two hidden nodes (12 parameters);
(d) Neural network model with three hidden nodes (18 parameters).
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 140
Table 6.1: Summary statistics for the factors included in the models
Complete data Training set Validation set
Sample size 475 238 237
Stage 3 47.5% 47.6% 47.4%
Stage 4 52.5% 52.4% 52.6%
Median age 73 years 73 years 73 years
Median wight 980 970 990
Treatment: Low 49.9% 48.3% 51.5%
Treatment: High 50.1% 51.7% 48.5%
Median survival 33 months 33 months 34 months
% censoring 28.8% 29.8% 27.7%
Table 6.2: Loglikelihood and c statistics for ﬁrstorder, secondorder and neural network
proportional hazards models
Model Training data Test data
Number of Log lik c Log lik c
parameter s
First order PH 4 814.3 0.608 831.0 0.607
Secondorder PH 10 805.6 0.648 834.8 0.580
Neural network H = 2 12 801.2 0.646 834.5 0.6000
Neural network H = 3 18 794.9 0.661 860.0 0.582
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 141
Table 6.3: Estimation of the main effects and higher order interactions using 2
4
factorial
design contrasts and the predictions obtained from the different models
Effects PH PH Neural network Neural network
1st order 2nd order H = 2 H = 3
Stage 0.300 0.325 0.451 0.450
Rx
∗
0.130 0.248 0.198 0.260
Age 0.323 0.315 0.219 0.278
Weight 0.249 0.238 0.302 0.581
Stage Rx 0 0.256 0.404 0.655
State Age 0 0.213 0.330 0.415
State Wt
∗
0 0.069 0.032 0.109
Rx Age 0 0.293 0.513 0.484
Rx Wt 0 0.195 0.025 0.051
Stage Rx Age 0 0 0.360 0.475
Stage Rx Wt 0 0 0.026 0.345
Stage Age Wt 0 0 0.024 0.271
Rx Age Wt 0 0 0.006 0.363
State Rx Age Wt 0 0 0.028 0.128
∗
Rx = Treatment, Wt = Weight
Implementation of the proportional odds and proportional hazard were imple
mented also by Liestol et al (1994) and Biganzoli at al (1998).
Liestol, Anderson (1994) used a neural network for Cox’s model with covari
ates in the form.
Let T be a random survival time variable, and I
k
the interval t
k−1
< t <
t
k
, k = 1, . . . , K where 0 < t
0
< t
1
< . . . < t
k
<
∞
.
The model can be speciﬁed by the conditional probabilities.
P(T ∈ I
k
/T > t
k−1
, x) =
1
1 + exp
_
−β
0k
−
1
i=1
β
ik
x
i
_ (6.1.10)
for K = 1, . . . , K.
The corresponding neural network is the multinomial logistic network with k
outputs.
The output 0
k
in the k
th
output neuron corresponds to the conditional proba
bility of dying in the interval I
k
.
Data for the individual n consist of the regressor x
n
and the vector (y
n
1
, . . . , y
n
)
where y
n
k
is the indicator of individual n dying in I
k
and k
n
≤ K is the number of
intervals where n is observed. Thus y
n
1
, . . . , y
n
k
n−1
are all 0 and y
n
k
n
= 1 of n
dies in I
kn
and
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 142
Figure 6.3: Log odds survival network
0
k
= f(x, w) = Λ
_
β
0k
+
1
i=1
β
ik
x
i
_
(6.1.11)
and the function to optimize
E
∗
(w) =
N
h=1
K
n
k=1
−log (1 −[y
n
k
−f(x
n
, w)[) (6.1.12)
and w = (β
01
, . . . , β
0k
, β
11
, . . . , β
Ik
and under the hypothesis of proportional rates
make the restriction β
1j
= β
2
= β
3j
= β
4j
= . . . = β
j
. Other implementations
can be seen in Biganzoli et al (1998).
An immediate generalization would be to substitute the linearity for non
linearity of the regressors adding a hidden layer as in the Figure 6.4 below:
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 143
Figure 6.4: Nonlinear survival network (Twolayer feedforward neural nets,
showing the notation for nodes and weights: input nodes (•), output (and hid
den) nodes (O), covariates Z
j
, connection weights w
ij
, output values 0
(2)
1
)
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 144
An example from Liestol et al (1994) uses the data from 205 patients with
melanoma of which 53 died, 8 covariates were included.
Several networks were studied and the negative of the likelihood (interpreted
as prediction error) is given in the Table 6.4 below:
Table 6.4: Cross validation of models for survival with malignant melanoma (Col
umn 1. Linear model; 2. Linear model with weightdecay; 3. Linear model with a
penalty term for nonproportional hazards; 4. Nonlinear model with proportional
hazards; 5. Nonlinear model with a penalty term for nonproportional hazards;
6. Nonlinear model with proportional hazards in ﬁrst and second interval and in
third and fourth interval; 7. Nonlinear model with nonproportional hazards.)
1 2 3 4 5 6 7
Prediction error 172.6 170.7 168.6 181.3 167.0 168.0 170.2
Change 1.9 4.0 8.7 5.6 4.6 2.4
The main results for nonlinear models with two hidden nodes were:
• Proportional hazard models produced inferior predictions, decreasing the test
loglikelihood of a two hidden node model by 8.7 (column 4) when using
the standard weight decay, even more if no weight decay was used.
• Again in the test loglikelihood was obtained by using moderately nonproportional
models. Adding a penalty term to the likelihood of a nonproportional
model or assuming proportionality over the two ﬁrst and last time inter
vals improved the test loglikelihood by similar amounts (5.6 in the former
case (column 5) and 4.6 in the latter (column 6)). Using no restrictions on
the weights except weight decay gave slightly inferior results (column 7,
improvement 2.4).
In summary, for this small data set the improvements that could be obtained
compared to the simple linear models were moderate. Most of the gain could be
obtained by adding suitable penalty terms to the likelihood of a linear but non
proportional model.
An example from Biganzoli et al (1998) is the application neural networks in
the data sets of Efron’s brain and neck cancer and Kalbﬂeish and Prentice lung
cancer using the network architecture of Figure 6.5. The results of the survival
curve ﬁts follows in Figures 6.6 and 6.7:
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 145
Figure 6.5: Feed forward neural network model for partial logistic regression
(PLANN) (The units (nodes) are represented by circles and the connections be
tween units are represented by dashed lines. The input layer has J units for time
a and covariates plus one ... unit (0). The hidden layer has H units plus the ...
unit (0). A single output unit (K = 1) compute conditional failure probability x
1
and x
2
are the weights for the connections of the ... unit with the hidden and out
put unit w
a
and w
...
are the weights for the connections between input and hidden
units and hidden and output unit, respectively.)
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 146
(a)
(b)
Figure 6.6: Head and neck cancer trial ( (a) estimates of conditional failure proba
bility obtained with the best PLANNconﬁguration (solid line) and the cubiclinear
spline proposed by Efron
13
(dashed lines); (b) corresponding survival function
and KaplanMeyer estimates.)
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 147
(a)
(b)
Figure 6.7: Head and neck cancer trial ( (a) estimates of conditional failure proba
bility obtained with a suboptional PLANN model (solid line) and the cubiclinear
spline proposed by Efron
13
(dashed lines); (b) corresponding survival function
and KaplanMeyer estimates.)
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 148
A further reference is Bakker et al (2003) who have used a neuralBayesian
approach to ﬁt Cox survival model using MCMC and an exponential activation
function. Other applications can be seen in Lapuerta et al (1995), OhnoMachodo
et al (1995) and Mariani and al (1997).
6.2 Time Series Forecasting Networks
A time series is a set of data observed sequentially in time; although time can
be substituted by space, depth etc. Neighbouring observations are dependent and
the study of time series data consists in modeling this dependency.
A time series is represented by a set of observations ¦y
∼
(t
∼
), t
∼
∈ T¦ where T
is a set of index (time, space, etc.).
The nature of Y and T can be each a discrete or a continuous set. Further, y
∼
can be univariate or multivariate and T can have unidimensional or a multidimen
sional elements. For example, (Y, Z)
(t,s)
can represent the number of cases Y of
inﬂuenza and Z of meningitis in week t and state s in 2002 in the USA.
Thee are two main objectives in the study of time series. First, to understand
the underline mechanism that generates the time series. Second, to forecast future
values of it. Problems of interest are:
 description of the behavior in the data,
 to ﬁnd periodicities in the data,
 to forecast the series,
 to estimate the transfer function v(t) that connects an input series X to the
output series Y of the generating system. For example, in the linear case
Y (t) =
∞
u=0
u(t)X(t −u), (6.2.1)
 to forecast Y given v and X,
 to study the behavior of the system, by simulating X,
 to control Y by making changes in X.
There are basically two approaches for the analysis of time series. The ﬁrst,
analyzes the series in the time domain, that is we are interested in the magnitude of
the events that occur in time t. The main tool used is the autocorrelation function
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 149
(and functions of it) and we use parametric models. In the second approach, the
analysis is in the frequency domain, that is we are interested in the frequency in
which the events occur in a period of time. The tool used is the spectrum (which
is a transform of the correlation function ex: Walsh, Fourier, Wavelet transforms)
and we use nonparametric methods. The two approaches are not alternatives but
complementary and are justiﬁed by representation theorems due to Wold (1938) in
time domain and Cramer (1939) in the frequency domain. We can say from these
theorems that the time domain analysis is used when the main interest is in the
analysis of nondeterministic characteristic of this data and the frequency domain
is used when the interest is in the deterministic characteristic of the data. Essen
tially Wold (1938) results in the time domain states that any stationary process Y
t
can be represented by
Y
t
= D
t
+ Z
t
(6.2.2)
where D
t
and Z
t
are uncorrelated, D
t
is deterministic that is depends only on its
past values D
t−1
, D
t−2
, . . . and Z
t
is of the form
Z
t
=
t
+ ψ
1
t−1
+ ψ
2
t−2
(6.2.3)
or
Z
t
= π
1
Z
t−1
+ π
2
Z
t−2
+ . . . +
t
(6.2.4)
where
t
is a white noise sequence (mean zero and autocorrelations zero) and
ψ
1
, π
1
are parameters. Expression (6.2.3) is an inﬁnite movingaverage process,
MA(∞) and expression (6.2.4) is an inﬁnite autoregressive process AR(∞).
The frequency domain analysis is based on the pair of results:
ρ(t) =
_
∞
−∞
e
iwt
dF(w) (6.2.5)
where ρ(t) is the autocorrelation of Y
t
and acts as the characteristic function of
the (spectral) density F(w) of the process Z(w) and
Y (t) =
_
π
−π
e
iwt
dZ(w). (6.2.6)
Most of the results in time series analysis is based on these results on its exten
sions.
We can trace in time some developments in time series in the last century:
Decomposition Methods (Economics, thirties to ﬁfties), Exponential Smooth
ing (Operations Research, sixties), BoxJenkins (Engineering, Statistics, seven
ties), StateSpace Methods  Bayesian, Structural (Statistics, Econometrics, Con
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 150
trol Engineering,eighties), NonLinear Models (Statistics, Control Engineering,
nineties).
It should be pointed out that the importance of BoxJenkins methodology is
that they popularized and made accessible the difference equation representation
of Wold. Further we can make an analogy between scalar and matrices and the
difference equation models of BoxJenkins (scalar) and statespace models (ma
trices). The recent interest in nonlinear model for time series is that makes neural
networks attractive and we turn to this in the next Section.
6.2.1 Forecasting with Neural Networks
The most common used architecture is a feedforward network with p lagged
values of the time series as input and one output. As in the regression model of
Section 5.1.2, if there is no hidden layer and the output activation is the identity
function the network is equivalent to a linear AR(p) model. If there are one or
more hidden layers with sigmoid activation function in the hidden neurons the
neural network acts as nonlinear autoregressions.
The architecture and implementation of most time series applications of neu
ral networks follows closely to the regression implementation. In the regression
neural network we have the covariates as inputs to the feedforward network, here
we have the lagged values of the time series as input.
The use of artiﬁcial neural network applications in forecasting is surveyed in
Hill et al (1994) and Zang et al (1998). Here we will describe some recent and
nonstandard applications.
One of the series analyzed by Raposo (1992) was the famous Airline Passenger
Data of BoxJenkins (Monthly number of international passengers from 1949 to
1960). The model for this series becomes the standard benchmark model for
seasonal data with trend: the multiplicative seasonal integrated moving average
model, SARIMA (0, 1, 1)
12
(0, 1, 1).
Table 6.5 gives the comparison for the periods 1958, 1959, 1960 in terms of
mean average percentage error of the BoxJenkins model and some neural net
work.
The airline data was also analyzed by Faraway and Chatﬁeld (1998) using
neural network.
Another classical time series the Yearly Sunspot Data was analyzed by Park
et al (1996). Their results are presented in Tables 6.6 and 6.7 for the training data
(88 observations) and forecasting period (12 observations). Here they used the
PCA method described in Section 2.8.
A multivariate time series forecasting using neural network is presented in
Chakraborthy et al (1992). The data used is the Indices of Monthly Flour Prices
in Buffalo (x
t
), Minneapolis (y
t
) and Kansas City (z
t
) over the period from August
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 151
Table 6.5: Forecasting for Airline Data
Year MAPE %
(12:2:1) (13:2:1) 14:2:1) SARIMA
1958 7.63 6.44 5.00 7.75
1959 6.32 4.90 7.75 10.86
1960 7.39 7.93 11.91 15.55
(a:b:c)  number of neurons in: a  input, b  hidden, c  output, layer
Table 6.6: Result from network structures and the AR(2) model
Network Structure Mean Square Error Mean Absolute
(MSE) Error
7
(MAE)
2:11:1 156.2792 9.8539
2:3:1 154.4879 9.4056
2:2:1
∗
151.9368 9.2865
2:1:1 230.2508 12.2454
2:0:1
··
233.8252 12.1660
AR(2) 222.9554 11.8726
∗
indicates the proposed optimal structure.
··
represents a network without hidden units.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 152
Table 6.7: Summary of forecasting errors produced by neural network models and the
AR(2) model based on untrained data
Model Mean Square Error (MSE) Mean Absolute Error (MAE)
2:11:1 160.8409 9.2631
2:3:1 158.9064 9.1710
2:2:1
∗
157.7880 9.1522
2:1:1 180.7608 11.0378
2:0:1 180.4559 12.1861
AR(2) 177.2645 11.5624
∗
represents the selected model by the PCA method.
1972 to November 1980, previously analyzed by Tiag and Tsay using an ARMA
(1,1) multivariate model. They used two classes of feedforward networks: sepa
rate modeling of each series and a combined modeling as in Figure 6.8 and Figure
6.9.
The results are shown in Tables 6.8 and 6.9.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 153
Figure 6.8: Separate Architecture
Figure 6.9: Combined Architecture for forecasting x
t+1
(and similar to y
t+1
and
z
t+1
)
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 154
Table 6.8: MeanSquared Errors for Separate Modeling & Prediction 10
3
Buffalo Minneapolis Kansas City
Network MSE MSE MSE
221 Training 3.398 3.174 3.526
Onelag 4.441 4.169 4.318
Multilag 4.483 5.003 5.909
441 Training 3.352 3.076 3.383
Onelag 9.787 8.047 3.370
Multilag 10.317 9.069 6.483
661 Training 2.774 2.835 1.633
Onelag 12.102 9.655 8.872
Multilag 17.646 14.909 13.776
Table 6.9: MeanSquared Errors and Coeffs.of Variation for Combined vs. Tiao/Tsay’s
Modeling/ Prediction 10
3
Buffalo Minneapolis Kansas City
Model MSE CV MSE CV MSE CV
661 Training 1.446 7.573 1.554 7.889 1.799 8.437
Network Onelag 3.101 11.091 3.169 11.265 2.067 9.044
Multilag 3.770 12.229 3.244 11.389 2.975 10.850
881 Training 0.103 2.021 0.090 1.898 0.383 3.892
Network Onelag 0.087 1.857 0.072 1.697 1.353 7.316
Multilag 0.107 2.059 0.070 1.674 1.521 7.757
Tiao/Tsay Training 2.549 10.054 5.097 14.285 8.645 18.493
Onelag 2.373 9.701 4.168 12.917 7.497 17.222
Multilag 72.346 53.564 137.534 74.204 233.413 96.096
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 155
In previous example, initial analysis using true series methods have helped to
design the architecture of the neural network. More explicit combination of time
series model and neural network are: Tseng et al (2002) that improved the ARIMA
forecasting using a feedforward network with logged values of the series, its
forecast and the residual obtained from the ARIMA ﬁtting to the series; Coloba et
al (2002) which used a neural network after decomposing the series from its law
and high frequency signals; Cigizoglu (2003) used generated data from a time
series model to train a neural network. Periodicity terms and lagged values were
used to forecast river ﬂow in Turkey.
A comparison of structural time series model and neural network is presented
in Portugal (1995). Forecasting noninversible time series using neural network
was attempted by Pino et al (2002) and combining forecasts by Donaldson and
Kamstra (1996). Further comparisons with time series models and applications
to classical time series data can be seen in Terasvirta et al (2005), Kajitoni et al
(2005) and Ghiassi et al (2005).
Modeling of nonlinear time series using neural network and an extension
called stochastic neural network can be seen in Poli and Jones (1994), Stern
(1996), Lai and Wong (2001) and Shamseldin and O’Connor (2001).
Forecasting using a combination of neural networks and fuzzy set are pre
sented by Machado et al (1992), Lourenco (1998), Li et al (1999).
Finally, an attempt to choose a classical forecast method using a neural net
work is shown in Venkatachalan and Sohl (1999).
6.3 Control Chart Networks
Control charts are used to identify variations in production processes. There
are usually two types of causes of variation: chance causes and assignable causes.
Chance causes, or natural causes, are inherent in all processes. Assignable causes
represent improperly adjustable machines, operation error, and defective raw ma
terials and it is desirable to remove these causes from the process.
The process means (µ) and variance (σ
2
) are estimated when observed data
are collected in sample of size n and the means
¯
X
1
,
¯
X
2
, . . . are calculated. It is
assumed that the sample means are mutually independent and that
¯
X
i
has distri
bution N(µ
j
,
σ
2
n
), i = 1, 2, . . . and the condition µ = µ
0
is to be maintained.
A shift in the process mean occurs when the distribution of the sample changes
(µ ,= µ
0
). The decision in the control chart is based on the sample mean with a
signal of outofcontrol being given at the ﬁrst stage N such that
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 156
δ
N
=
[
¯
X
N
−µ
0
[
σ/
√
n
0
> c (6.3.1)
where usually c = 3 and the sample size of 4 or 5. This procedure is useful to
detect shifts in the process mean. However, disturbances can also cause shifts or
changes in the amount of process variability. A similar decision procedure can be
used for the behavior of ranges of the samples from the process and will signal an
outofcontrol variability. Similarly chart for σ is usually based on
[
¯
R −σ
R
[ > 3 (6.3.2)
or
¯
R ±3A
m
¯
R (6.3.3)
where
¯
R is average range of a number of sample ranges R and A
n
is a constant
which depends on n, relating σ to E(R) and V (R).
Several kinds of signals based in these statistics and charts have been pro
posed, for example: more than threes consecutive values more than 2σ from the
mean; eight consecutive values above the mean etc. This rules will detect trends,
sinoidal etc. patters in the data.
These rules can also be implemented using neural network and we will now
describe some of these work. All the them uses simulation results and make com
parison with classical statistical charts in terms of percentage of wrong decisions
and ARL (average run length) which is deﬁned to be the expected number of sam
ples that are observed from a process before outofcontrol signal is signaled by
the chart.
Prybutok et al (1994) used a feedforward network with 5 inputs (the sample
size), a single hidden layer with 5 neurons and 1 output neuron. The activation
functions were sigmoid the neural network classiﬁer the control chart as being
“in control” or “outofcontrol” respectively according if the output fall in the
intervals (0, 0.5) or (0.5, 1).
Various rules were compared (see Prybutok et al (1994)) such as C
4
=signal if
eight if the last standardized sample means are between 0 and 3 or if the eight of
the last eight are between 0 and 3. The rules were denoted by C
i
and its combina
tion by C
ijkl
.
For eight shifts given by [µ − µ
0
[ taking the values (0, 0.2, 0.4, 0.6, 0.8, 1, 2,
3) the ratio of the ARL for the neural network to the standard Shewhart
¯
X control
chart without supplemental runs rules are listed in Table 6.10. A further table with
other rules are given in Prybutok et al (1994).
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 157
Table 6.10: Rations of Average Run Lengths: Neural Network Model Cn
ij
to
Shewhart Control Chart without Runs Rules (C
1
)
Control Shift (µ)
Charts 0.0 0.2 0.4 0.6 0.8 1.0 2.0 3.0
ARL
∗
370.40 308.43 200.08 119.67 71.55 43.89 6.30 2.00
C
1
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Cn
54
0.13 0.14 0.16 0.18 0.21 0.24 0.27 0.19
Cn
55
0.25 0.27 0.30 0.33 0.38 0.38 0.46 0.32
Cn
56
0.47 0.55 0.60 0.67 0.68 0.72 0.71 0.44
Cn
57
0.95 0.99 1.31 1.41 1.48 1.44 1.19 0.68
Cn
62
0.18 0.19 0.22 0.25 0.30 0.32 0.43 0.34
Cn
63
0.24 0.26 0.29 0.32 0.37 0.43 0.50 0.36
Cn
64
0.33 0.34 0.36 0.39 0.44 0.48 0.55 0.43
Cn
65
0.42 0.40 0.44 0.48 0.53 0.56 0.67 0.53
Cn
66
0.56 0.53 0.57 0.60 0.63 0.69 0.76 0.56
Cn
67
0.78 0.71 0.72 0.78 0.82 0.79 0.90 0.63
Cn
68
1.15 1.02 0.94 1.01 1.02 1.08 1.04 0.75
Cn
71
0.25 0.26 0.27 0.33 0.33 0.38 0.44 0.07
Cn
72
0.49 0.52 0.52 0.60 0.61 0.63 0.70 0.42
Cn
73
0.88 0.92 0.96 1.01 1.02 1.09 0.98 0.63
Cn
74
1.56 1.61 1.71 1.72 1.69 1.56 1.47 0.90
ARL = ARL
c
1
(µ) for Shewhart Control Chart.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 158
Smith (1994) extends the previous work by considering a single model of a
simultaneous Xbar and R chart, and investigating both location and variance
shifts. She applied two groups of feedforward networks.
The ﬁrst group with only one output, sigmoid activation function, two hidden
layers, 3 or 10 neurons in each hidden layer, and interpretation as: 0 to 0.3 (mean
shift), 0.3 to 0.7 (under control), 0.7 to 1 (variance shift).
The inputs were either in ten observations (n = 10) and the calculated statis
tics (sample mean, range, standard deviation), a total of 13 inputs or on the cal
culated islatistics only, i.e. 3 inputs. Table 6.11 shows the percentage of wrong
decisions for the control chart and the neural networks.
Table 6.11: Wrong Decisions of Control Charts and Neural Networks
Training/ Inputs Hidden Neural Net Shewhart Control
Test Set Layer Size Percent Wrong Percent Wrong
A All 3 16.8 27.5
A All 10 11.7 27.5
A Statistics 3 28.5 27.5
B All 3 0.8 0.5
B All 10 0.5 0.5
B Statistics 3 0.0(None) 0.5
Table 6.12: Desired Output Vectors
Pattern Desired output Note
vector
Upward trend [1,0,0,0,0] –
Downward trend [1,0,0,0,0] –
Systematic variation (I) [0,1,0,0,0] the ﬁrst observation is above
the incontrol mean
Systematic variation (II) [0,1,0,0,0] the ﬁrst observation is below
the incontrol mean
Cycle (I) [0,0,1,0,0] sine wave
Cycle (II) [0,0,1,0,0] cosine wave
Mixture (I) [0,0,0,1,0] the mean of the ﬁrst distribution
is greater than the incontrol mean
Mixture (II) [0,0,0,1,0] the mean of the ﬁrst distribution
is less than the incontrol mean
Upward shift [0,0,0,0,1] –
Downward shift [0,0,0,0,1] –
The second group of network deferred from the ﬁrst group by the input which
now is the raw observations only (n = 5, n = 10, n = 20). These neural networks
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 159
were design to recognize the following shapes: ﬂat (in control), sine wave (cyclic),
slop (trend of drift) and bimodal (up and down shifts). The networks here had two
outputs used to classify patterns with the codes: ﬂat (0,0), sine wave (1,1), trend
(0,1), and bimodal (1,0). A output in (0,0.4] was considered 0 and in (0.6,1) was
considered 1. Table 6.3 gives some of the results Smith (1994) obtained.
Hwarng (2002) presents a neural network methodology for monitoring pro
cess shift in the presence of autocorrelation. The study of AR (1) (autoregressive
of order one) processes shows that the performance of neural network based mon
itoring scheme is superior to other ﬁve control charts in the cases investigated.
He used a feedforward network with 50 30 1 (inputhiddenoutput) structure of
neurons.
A more sophisticated pattern recognition process control chart using neural
networks is given in Cheng (1997). A comparison between the performance of a
feedforward network and a mixture of expert was made using simulations. The
multilayer network had 16 inputs (raw observations), 12 neurons in the hidden
layers and 5 outputs identifying the patterns as given in Table 6.13. Hyperbolic
activation function was used.
A mixture of three expert (choice based on experimentation) with the same
input as in the feedforward network. Each expert had the same structure as the
feedforward network. The gating network has 3 neurons elements in the output
layer and 4 neurons in a hidden layer.
The results seems to indicate that the mixture of experts performs better in
terms of correctly identiﬁed patterns.
Table 6.13: Pattern Recognition Networks
# Input σ # Correct # Missed of 300
Points of Noise of 300 Flat Slope Sine wave Bimodal
5 0.1 288 0 0 10 2
5 0.2 226 39 15 10 10
5 0.3 208 32 39 18 3
10 0.1 292 0 0 5 3
10 0.2 268 3 7 16 6
10 0.3 246 9 17 23 5
20 0.1 297 0 0 0 3
20 0.2 293 0 0 0 7
20 0.3 280 6 1 1 12
Over All Networks 88.8% 3.3% 2.9% 3.1% 1.9%
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 160
6.4 Some Statistical Inference Results
In this section, we describe some inferential procedures that have been devel
oped for and using neural network. In particular we outline some results on point
estimation (Bayes, likelihood, robust), interval estimation (parametric, bootstrap)
and signiﬁcance tests (Ftest and nonlinearity test).
A more detailed account of prediction interval outlined here see also Hwang
and Ding (1997).
6.4.1 Estimation methods
The standard method of ﬁtting a neural network is backpropagation which is
a gradient descent algorithm to minimize the soma of squares. Here we outline
some details of other alternative methods of estimation.
Since the most used activation function is the sigmoid we present some fur
ther results. First consider, maximum likelihood estimation, here we follow Schu
macher et al (1996) and Faraggi and Simon (1995).
The logistic (sigmoid) activation relates output y to input x assuming
P
x
∼
(Y = 1/x
∼
) = ∧
_
ω
0
+
I
i=1
ω
i
x
ji
_
= p(x
∼
, ω) (6.4.1)
where ∧(u) = 1/(1 + e
u
) is the sigmoid or logistic function.
As seen in Section 5.1, this activation function as a nonlinear regression
model is a special case of the generalized linear model i.e.
E(Y/x) = p(x, ω), V (Y/x
∼
) = p(x, ω)(1 −p(x, ω)). (6.4.2)
The maximum likelihood estimator of the weights ω’s is obtained from the log
likelihood function
L(ω) =
n
j=1
_
y
j
log p(x
i
, ω
∼
) + (1 −y
j
) log(1 −p(x
j
ω)
_
(6.4.3)
where (x
j
, y
j
) is the observation for pattern j.
To maximize L(ω) is equivalent to minimize the KulbackLeiber divergence
n
j=1
_
y
j
log
y
j
p(x
j
ω)
+ (1 −y
j
) log
1 −y
j
1 −p(x
j
, ω)
_
(6.4.4)
It can be written as
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 161
n
j=1
−log (1 −[y
j
−p(x, ω)[) (6.4.5)
which makes the interpretation of distance between y and p(y, ω) more clear. At
this point, for comparison, we recall the criteria function used in the backpropa
gation least square estimation method (or learning rule):
n
j=1
(y
j
−p(x
j
, ω))
2
(6.4.6)
Due to the relation to of maximum likelihood criteria and the KulbackLeiber,
the estimation based on (6.4.4) is called backpropagation of maximum likelihood.
In the neural network jargon it is also called relative entropy or cross entropy
method.
Bayesian methods for neural networks rely mostly in MCMC (Markov Chain
Monte Carlo). Here we follow Lee (1998) and review the priors used for neural
networks.
All approaches start with the same basic model for the output y
y
i
= β
0
+
k
j=1
β
j
1
1 + exp(−ω
j0
−
p
h=1
ω
jh
x
ih
)
+
i
, (6.4.7)
i
∼ N(0, σ
2
) (6.4.8)
The Bayesian models are:
− M¨ uller and Rios Insua (1998)
A direct acyclic graph (DAG) diagram of the model is:
The distributions for parameters and hyperparameters are:
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 162
Figure 6.10: Graphical model of Muller and Rios Issua
y
i
∼ N
_
m
j=0
β
j
∧ (ω
j
x
i
), σ
2
_
, i = 1, . . . , N (6.4.9)
β
j
∼ N(µ
β
, σ
2
β
), j = 0, . . . , m (6.4.10)
ω
j
−ωγ
j
∼ N
p
(µ
γ
, S
γ
), j = 1, . . . , m (6.4.11)
µ
β
∼ N(a
β
, A
β
) (6.4.12)
µ
γ
∼ N
p
(a
γ
, A
γ
) (6.4.13)
σ
2
β
∼ Γ
−1
_
c
β
2
,
C
β
2
_
(6.4.14)
S
γ
∼ Wish
−1
_
c
γ
, (c
γ
C
γ
)
−1
_
(6.4.15)
σ
2
∼ Γ
−1
_
s
2
,
S
2
_
. (6.4.16)
There are ﬁxed hyperparameters that need to be speciﬁed. M¨ uller and Rios
Insua specify many of them to be of the same scale as the data. As some
ﬁtting algorithms work better when (or sometimes only when) the data have
been rescaled so that [x
ih
[, [y
i
[ ≤ 1, one choice of starting hyperparameters
is: a
β
= 0, A
β
= 1, a
γ
= 0, A
γ
= I
p
, c
β
= 1, C
β
= 1, c
γ
= p + 1,
C
γ
=
1
p + 1
I
p
, s = 1, S =
1
10
.
− Neal Model (1996)
Neal’s fourstage model contains more parameters than the previous model.
A DAG diagram of the model is
Each of the original parameters (β and γ) is treated as a univariate normal
with mean zero and its own standard deviation. These standard deviations
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 163
Figure 6.11: Graphical model of Neal
are the product of two hyperparameters, one for the originating node of the
link in the graph, and one for the destination node.
There are two notes on this model that should be mentioned. First, Neal uses
hyperbolic tangent activation functions rather than logistic functions. These
are essentially equivalent in terms of the neural network, the main difference
being that their range is from 1 to 1 rather than 0 to 1. The second note is
that Neal also discusses using t distributions instead of normal distributions
for the parameters.
− MacKay Model (1992)
MacKay starts with the idea of using an improper prior, but knowing that
the posterior would be improper, he uses the data to ﬁx the hyperparameters
at a single value. Under his model, the distributions for parameters and
hyperparameters are:
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 164
y
i
∼ N
_
m
j=0
β
j
∧ (ω
j
x
i
)(γ
j
x
i
),
1
ν
_
, i = 1, . . . , N (6.4.17)
β
j
∼ N
_
0,
1
α
1
_
, j = 0, . . . , m (6.4.18)
ω
j
h ∼ N
p
_
0,
1
α
2
_
, j = 1, . . . , m h = 1, . . . , p (6.4.19)
ω
j
0 ∼ N
p
_
0,
1
α
3
_
, j = 1, . . . , m (6.4.20)
α
k
∝ 1, k = 1, 2, 3 (6.4.21)
v ∝ 1. (6.4.22)
MacKay uses the data to ﬁnd the posterior mode for the hyperparametrics
α and ν and then ﬁxes them at their modes.
Each of the three models reviewed is fairly complicated and are ﬁt using
MCMC.
− Lee Model (1998)
The model for the response y is
y
i
= β
0
+
k
j=1
β
j
1
1 + exp(−γ
j0
−
p
h=1
ω
jh
x
ih
)
+
i
,
i
iid
∼ N(0, σ
2
)
(6.4.23)
Instead of having to specify a proper prior, he uses the noninformative im
proper prior
π(β, ω, σ
2
) = (2π)
−d/2
1
σ
2
, (6.4.24)
where d is the dimension of the model (number of parameters). This prior is
proportional to the standard noninformative prior for linear regression. This
model is also ﬁtted using MCMC.
This subsection ends with some references on estimation problems of in
terest. Belue and Bower (1999) give some experimental design procedures
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 165
to achieve higher accuracy on the estimation of the neural network param
eters. Copobianco (2000) outlines robust estimation framework for neural
networks. Aitkin and Foxall (2003) reformulate the feedforward neural net
work as a latent variable model and construct the likelihood which is maxi
mized using ﬁnite mixture and the EM algorithm.
6.4.2 Interval estimation
Interval estimation for feedforward neural network essentially uses the stan
dard results of least squares for nonlinear regression models and ridge regression
(regularization). Here we summarize the results of Chryssoloures et al (1996) and
De Veux et al (1998).
Essentially they used the standard results for nonlinear regression least square
S(ω) =
(y
i
−f(X, ω))
2
(6.4.25)
or the regularization, when weight decay is used
S(ω) =
(y
i
−f(X, ω))
2
+ kα
ω
2
i
. (6.4.26)
The corresponding prediction intervals are obtained from the t distribution and
t
n−p
s
_
1 + g
0
(JJ)
−1
g
0
(6.4.27)
t
n−p
∗s
∗
_
1 + g
0
(J
J + kI)
−1
(J
J +kI)
−1
g
0
(6.4.28)
where s = (residual sum of squares)/(n − p) = RSS/(n − p), p is the number
of parameters, J is a matrix with entries ∂f(ω, x
i
)/∂ω
j
g
0
is a vector with entries
∂f(x
0
ω)/∂ω
j
and
k = y −f(Xω
∗
) + Jω
∗
, (6.4.29)
s
∗
= RSS/(n −tr(2H −H
2
)) (6.4.30)
H = J(J
J + αI)
−1
J (6.4.31)
p
∗
= tr(2H −H
2
). (6.4.32)
where H = J(J
J + αI)
−1
J
, ω
∗
is the true parameter value.
The author gives an application and De Veux et al (1998) also shown some
simulated results.
Tibshirani (1995) compared sevens methods to estimate the prediction inter
vals for neural networks. The method compared were: 1. Deltamethod; 2. Ap
proximate delta: Using approximate to information matrix that ignores second
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 166
derivatives; 3. The regularization estimator; 4. Two approximate sandwich esti
mators and 5. Two bootstrap estimators.
In this 5 examples and simulations he found that the bootstrap methods pro
vided the most accurate estimates of the standard errors of the predicted values.
The nonsimulation methods (delta, regularization, sandwich estimators) missed
the variability due to random choice of starting values.
6.4.3 Statistical Tests
Two statistical tests when using neural networks are presented here.
Adams (1999) gives the following test procedure to test hypothesis about the
parameters (weights) of a neural network. The test can be used to test nonlinear
relations between the inputs and outputs variables. To implement the test it is ﬁrst
necessary to calculate the degree of freedom for the network.
If n is the number of observations and p the number of estimated parameters
then the degree of freedom (df) are calculated as
df = n −p. (6.4.33)
With a multilayer feedforward network with m hidden layers, the input layer de
signed by 0 and the out designed by m+ 1. For L
m+1
outputs and L
m
neurons in
the hidden layer we have
df = n −p + (L
m+1
−1)(L
n
+ 1). (6.4.34)
The number of parameters is
P =
m+1
i=1
(L
i−1
L
1
+ L
i
) . (6.4.35)
For a network with 4:6:2, i.e. 4 inputs, 6 hidden neurons and 2 outputs, the degrees
of freedom for each output is
df = n −44 + (2 −1)(6 + 1) = n −37. (6.4.36)
For a feedforward network trained by least square we can form the model sum
of squares. Therefore any submodel can be tested using the F test
F =
(SSE
0
−SSE)/(df
0
−df)
SSE/df
(6.4.37)
where SSE
0
is the sum of squares of submodel and SSE is the sum of squares of
full model.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 167
The above test is based on the likelihood ratio test. An alternative proposed
by Lee et al (1993) uses the Rao score test or Lagrange multiplier test. Blake and
Kapetanios (2003) also suggested a Wald equivalent test. See also Lee (2001).
The implementation of Rao test (see Kiani, 2003) to test nonlinearity of a time
series using a onehidden layer neural network is
(i) regress y
t
on intercept and y
t−1
, . . . , y
t−k
, save residuals (ˆ u
t
), residual sum of
squares SSE
1
and predictions ˆ y
t
,
(ii) regress ˆ u
t
on y
t−1
, . . . , y
t−k
using neural network model that nests a linear
regression y
t
= πy
∼
t
+ u
t
(y
∼
t
= (y
t−1
, . . . , y
t−k
), save matrices Ψ of outputs
of the neurons for principle components analysis.
(iii) regress ˆ e
t
(residuals of the linear regression in (ii)) in Ψ
∗
(principal compo
nents of Ψ) and X matrices, and obtain R
2
, thus
TS = nR
2
∼ χ
2
(g) (6.4.38)
where g is the number of principal components used un step (iii).
The neural network in 6.12 clariﬁes the procedure:
Figure 6.12: Onehidden layer feedforward neural network.
Simulations results on the behaviors of this test can be seen in the bibliography
on references. An application is given in Kiani (2003).
Bibliography and references
Preface
Fundamental Concepts on Neural Networks
Artiﬁcial Intelligence: Symbolist & Connectionist
[1] Carvalho, L.V. Data mining. Editora Erika, SP. Brasil, 1997.
[2] Gueney, K. An Introduction to Neural Networks. University College Lon
don Press, 1997.
[3] Wilde, D. Neural Network Models. Springer, 2nd ed., 1997.
Artiﬁcial Neural Networks and Diagrams
[1] Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd ed., 1999.
[2] Sarle, W.S. Neural networks and statistical models. In Proceedings of
the Nineteenth Annual SAS Users Group International Conference. 1994.
(http//citeseer.nj.nec.com/sarle94neural.html).
Activation Functions
[1] Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd ed., 1999.
[2] Igengar, S.S., Chao, E.C., and Phola, V.V. Foundations of Wavelet Networks
and Applications. Chapman Hall/CRC, 2nd ed., 2002.
[3] Orr, M.J.L. Introduction to radial basis function networks. Tech. rep., In
stitute of Adaptive and Neural Computationy, Edinburgh University, 1996.
(www.anc.ed.ac.uk/ mjo/papers/intro.ps).
168
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 169
Network Architectures
[1] Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd ed., 1999.
Network Training
[1] Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd ed., 1999.
Kolmogorov Theorem
[1] Bose, N.K. and Liang, P. Neural Networks Fundamentals with Graphs,
Algorithms and Applications. McGrawHill Inc., 2nd ed., 1996.
[2] Mitchell, T.M. Machine Learning. McGrawHill Inc., 2nd ed., 1997.
[3] Murtagh, F. Neural networks and related “massively parallel” methods for
statistics: A short overview. International Statistical Review, 62(3):275–
288, 1994.
[4] Rojas, R. Neural Networks: A Systematic Introduction. Springer, 2nd ed.,
1996.
Model Choice
[1] Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd ed., 1999.
[2] Murtagh, F. Neural networks and related “massively parallel” methods for
statistics: A short overview. International Statistical Review, 62(3):275–
288, october 1994.
[3] Orr, M.J.L. Introduction to radial basis function networks. Tech. rep., In
stitute of Adaptive and Neural Computationy, Edinburgh University, 1996.
(www.anc.ed.ac.uk/ mjo/papers/intro.ps).
[4] Park, Y.R., Murray, T.J., and Chen, C. Prediction sun spots using a lay
ered perception neural networks. IEEE Transaction on Neural Networks,
7(2):501–505, 1994.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 170
Terminology
[1] Addrians, P. and Zanringe, D. Data Mining. AddisonWesley, 2nd ed.,
1996.
[2] Arminger, G. and Polozik, W. Data analysis and information systems, sta
tistical and conceptual approaches. studies in classiﬁcation data analysis
and knowledge organization. In Statistical models and neural networks
(Eds. O.E. BarndorfNielsen, J.L. Jensen, and W.S. Kendall), vol. 7, pages
243–260. Springer, 1996.
[3] Fausett, L. Fundamental of Neural, Architecture, Algorithms and Applica
tions. Prentice Hall, 1994.
[4] Gueney, K. An Introduction to Neural Networks. University College Lon
don Press, 1997.
[5] Kasabov, N.K. Foundations of Neural Networks. Fuzzy Systems and Knowl
edge Engineering. MIT Press, 1996.
[6] Sarle, W.S. Neural network and statistical jargon. (ftp://ftp.sas.com).
[7] Sarle, W.S. Neural networks and statistical models. In Proceedings of
the Nineteenth Annual SAS Users Group International Conference. 1994.
(http//citeseer.nj.nec.com/sarle94neural.html).
[8] Stegeman, J.A. and Buenfeld, N.R. A glossary of basic neural network
terminology for regression problems. Neural Computing and Applications,
(8):290–296, 1999.
Some Common Neural Networks Models
Multilayer Feedforward Networks
[1] DeWilde, P. Neural Network Models. Springer, 2nd ed., 1997.
[2] Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall,
2nd ed., 1999.
[3] Picton, P. Neural Networks. Palgrave, 2nd ed., 1999.
[4] Wu, C.H. and Mcharty, J.W. Neural Networks and Genome Informatics.
Elsevier, 2000.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 171
Associative and Hopﬁeld Networks
[1] Abdi, H., Valentin, D., and Edelman, B. SAGE, Quantitative Applications
in the Social Sciences. 124. Prentice Hall, 1999.
[2] Braga, A.P., Ludemir, T.B., and Carvalho, A.C.P.L.F. Redes Neurais Artiﬁ
ciais: Teoria e Aplicac¸ ˜ oes. LTC editora, 2000.
[3] Chester, M. Neural Networks: A Tutorial. Prentice Hall, 1993.
[4] Fausett, L. Fundamentals of Neural Networks Architectures, Algorithms
and Applications. Prentice Hall, 1994.
[5] Rojas, R. Neural Networks: A Systematic Introduction. Springer, 1996.
Radial Basis Function Networks
[1] Lowe, D. Statistics and neural networks advances in the interface. In Radial
basis function networks and statistics (Eds. J.W. Kay and D.M.T. Hering
ton), pages 65–95. Oxford University Press, 1999.
[2] Orr, M.J.L. Introduction to radial basis function networks. Tech. rep., In
stitute of Adaptive and Neural Computationy, Edinburgh University, 1996.
(www.anc.ed.ac.uk/ mjo/papers/intro.ps).
[3] Wu, C.H. and McLarty, J.W. Neural Networks and Genome Informatics.
Elsevier, 2000.
Wavelet Neural Networks
[1] Abramovich, F., Barley, T.C., and Sapatinas, T. Wavelet analysis and its
statistical application. Journal of the Royal Statistical Society D, 49(1):1–
29, 2000.
[2] Hubbard, B.B. The World According to Wavelets. A.K. Peters, LTD, 1998.
[3] Iyengar, S.S., Cho, E.C., and Phola, V.V. Foundations of Wavelets Networks
and Applications. Chapman and Gall/CRC, 2002.
[4] Martin, E.B. and Morris, A.J. Statistics and neural networks. In Artiﬁ
cial neural networks and multivariate statistics (Eds. J.W. Kay and D.M.
Titterington), pages 195–258. Oxford University Press, 1999.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 172
[5] Morettin, P.A. From fourier to wavelet analysis of true series. In Proceed
ings in Computational Statistics (Ed. A. Prat), pages 111–122. Physica
Verlag, 1996.
[6] Morettin, P.A. Wavelets in statistics (text for a tutorial). In The
Third International Conference on Statistical Data Analysis Based
on L
1
Norm and Related Methods. Neuchˆ atel, Switzerland, 1997.
(www.ime.usp/br/ pam/papers.html).
[7] Zhang, Q. and Benveniste, A. Wavelets networks. IEEE Transactions on
Neural Networks, 3(6):889–898, 1992.
Mixture of Experts Networks
[1] Bishop, C.M. Neural Network for Pattern Recognition. 124. Oxford, 1997.
[2] Cheng, C. A neural network approach for the analysis of control charts
patterns. International J. of Production Research, 35:667–697, 1997.
[3] Ciampi, A. and Lechevalier, Y. Statistical models as building blocks of
neural networks. Communications in Statistics  Theory and Methods,
26(4):991–1009, 1997.
[4] Desai, V.S., Crook, J.N., and Overstreet, G.A. A comparison of neural
networks and linear scoring models in the credit union environmental. Eu
ropean Journal of Operational Research, 95:24–37, 1996.
[5] Jordan, M.I. and Jacobs, R.A. Hierarchical mixtures of experts and the em
algorithm. Neural Computation, 6:181–214, 1994.
[6] Mehnotra, K., Mohan, C.K., and Ranka, S. Elements of Artiﬁcial Neural
Networks. MIT Press, 2000.
[7] OhnoMachodo, L., Walker, M.G., and Musen, M.A. Hierarchical neural
networks for survival analysis. In The Eighth Worlds Congress on Medical
Information. Vancouver, B.C., Canada, 1995.
Neural Network Models & Statistical Models Interface: Com
ments on the Literature
[1] Arminger, .G. and Enache, D. Statistical models and artiﬁcial neural net
works. In Data Analysis and Information Systems (Eds. H.H. Bock and
W. Polasek), pages 243–260. Springer Verlag, 1996.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 173
[2] Balakrishnan, P.V.S., Cooper, M.C., Jacob, M., and Lewis, P.A. A study of
the classiﬁcation capabilities of neural networks using unsupervised learn
ing: a comparision with kmeans clustering. Psychometrika, (59):509–525,
1994.
[3] Bishop, C.M. Neural Network for Pattern Recognition. 124. Oxford, 1997.
[4] Cheng, B. and Titterington, D.M. Neural networks: A review from a sta
tistical perspective (with discussion). Statistical Sciences, 9:2–54, 1994.
[5] Chryssolouriz, G., Lee, M., and Ramsey, A. Conﬁdence interval predic
tions for neural networks models. IEEE Transactions on Neural Networks,
7:229–232, 1996.
[6] De Veaux, R.D., Schweinsberg, J., Schumi, J., and Ungar, L.H. Predic
tion intervals for neural networks via nonlinear regression. Technometrics,
40(4):273–282, 1998.
[7] Ennis, M., Hinton, G., Naylor, D., Revow, M., and Tibshirani, R. A com
parison of statistical learning methods on the GUSTO database. Statistics
In Medicine, 17:2501–2508, 1998.
[8] Hwang, J.T.G. and Ding, A.A. Prediction interval for artiﬁcial neural net
works. Journal of the American Statistical Association, 92:748–757, 1997.
[9] Jordan, M.I. Why the logistic function? a tutorial discussion on proba
bilities and neural networks. Report 9503, MIT Computational Cognitive
Science, 1995.
[10] Kuan, C.M. and White, H. Artiﬁcial neural networks: An econometric
perspective. Econometric Reviews, 13:1–91, 1994.
[11] Lee, H.K.H. Model Selection and Model Averaging for Neural Network.
Ph.d thesis., Cornegie Mellon University, Depatament of Statistics, 1998.
[12] Lee, H.K.H. Bayesian Nonparametrics via Neural Networks. SIAM: Soci
ety for Industrial and Applied Mathematics, 2004.
[13] Lee, T.H., White, H., and Granger, C.W.J. Testing for negleted nonlinearity
in time series models. Journal of Econometrics, 56:269–290, 1993.
[14] Lowe, D. Statistics and neural networks advances in the interface. In Radial
basis function networks and statistics (Eds. J.W. Kay and D.M.T. Hering
ton), pages 65–95. Oxford University Press, 1999.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 174
[15] Mackay, D.J.C. Bayesian Methods for Adaptive Methods. Ph.d thesis.,
California Institute of Technology, 1992.
[16] Mangiameli, P., Chen, S.K., and West, D. A comparison of SOM neural
networks and hierarchical clustering methods. European Journal of Oper
ations Research, (93):402–417, 1996.
[17] Martin, E.B. and Morris, A.J. Statistics and neural networks. In Artiﬁ
cial neural networks and multivariate statistics (Eds. J.W. Kay and D.M.
Titterington), pages 195–258. Oxford University Press, 1999.
[18] Mingoti, S.A. and Lima, J.O. Comparing som neural network with fuzzy
cmeans, kmeans and traditional hierarchical clustering algorithms. Euro
pean Journal of Operational Research, 174(3):1742–1759, 2006.
[19] Murtagh, F. Neural networks and related “massively parallel” methods for
statistics: A short overview. International Statistical Review, 62(3):275–
288, october 1994.
[20] Neal, R.M. Bayesian Learning for Neural Networks. NY Springer, 1996.
[21] Paige, R.L. and Butler, R.W. Bayesian inference in neural networks.
Biometrika, 88(3):623–641, 2001.
[22] Poli, I. and Jones, R.D. A neural net model for prediction. Journal of the
American Statistical Association, 89(425):117–121, 1994.
[23] RiosInsua, D. and Muller, P. Feedfoward neural networks for nonparamet
ric regression. In Pratical Nonparametric and Semiparametric Bayesian
Statistics (Eds. D. Dey, P. Mueller, and D. Sinha), pages 181–194. NY
Springer, 1996.
[24] Ripley, B.D. Pattern Recognition and Neural Networks. Cambrige Univer
sity Press, 1996.
[25] Sarle, W.S. Neural network and statistical jargon. (ftp://ftp.sas.com).
[26] Sarle, W.S. Neural networks and statistical models. In Proceedings of
the Nineteenth Annual SAS Users Group International Conference. 1994.
(http//citeseer.nj.nec.com/sarle94neural.html).
[27] Schwarzer, G., Vach, W., and Schumacher, M. On the misuses of artiﬁcial
neural networks for prognostic and diagnostic classiﬁcation in oncology.
Statistics in Medicine, 19:541–561, 2000.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 175
[28] Smith, M. Neural Network for Statistical Modelling. Van Nostrond Rein
hold, 1993.
[29] Stern, H.S. Neural network in applied statistics (with discussion). Techno
metrics, 38:205–220, 1996.
[30] Swanson, N.R. and White, H. A modelselection approach to assessing the
information in the term structure using linear models and artiﬁcial neural
networks. Journal of Business & Economic Statistics, 13(3):265–75, 1995.
[31] Swanson, N.R. and White, H. Forecasting economic time series using ﬂexi
ble versus ﬁxed speciﬁcation and linear versus nonlinear econometric mod
els. International Journal of Forecasting, 13(4):439–461, 1997.
[32] Tibshirani, R. A comparison of some error estimates for neural networks
models. Neural Computation, 8:152–163, 1996.
[33] Titterington, D.M. Bayesian methods for neural networks and related mod
els. Statistical Science, 19(1):128–139, 2004.
[34] Tu, J.V. Advantages and disadvantages of using artiﬁcal neural networks
versus logit regression for predicting medical outcomes. Journal of Clinical
Epidemiology, 49:1225–1231, 1996.
[35] Vach, W., Robner, R., and Schumacher, M. Neural networks and logistic
regression. part ii. Computational Statistics and Data Analysis, 21:683–
701, 1996.
[36] Warner, B. and Misra, M. Understanding neural networks as statistical
tools. American Statistician, 50:284–293, 1996.
[37] White, H. Neuralnetwork learning and statistics. AI Expert, 4(12):48–52,
1989.
[38] White, H. Parametric statistical estimation using artiﬁcial neural networks.
In Mathematical Perspectives on Neural Networks (Eds. P. Smolensky,
M.C. Mozer, and D.E. Rumelhart), pages 719–775. L Erlbaum Associates,
1996.
[39] White, H. Learning in artiﬁcial neural networks: A statistical perspective.
Neural Computation, 1(4):425–464, 1989.
[40] White, H. Some asymptotic results for learning in single hiddenlayer feed
forward network models. Journal of the American Statistical Association,
84(408):1003–1013, 1989.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 176
[41] Zhang, G.P. Avoiding pitfalls in neural network research. IEEE Trans
actions on Systems, Man, and Cybernetics  Part C: Applications and Re
views, 37(1):3–16, 2007.
Multivariate Statistics Neural Network Models
Cluster and Scaling Networks
[1] Balakrishnan, P.V.S., Cooper, M.C., Jacob, M., and Lewis, P.A. A study of
the classiﬁcation capabilities of neural networks using unsupervised learn
ing: a comparision with kmeans clustering. Psychometrika, (59):509–525,
1994.
[2] Braga, A.P., Ludemir, T., and Carvalho, . Redes Neurais Artiﬁcias. LTC
Editora, Rio de Janeiro, 2000.
[3] Carvalho, J., Pereira, B.B., and Rao, C.R. Combining unsupervised and su
pervised neural networks analysis in gamma ray burst patterns. Techinical
report center of multivariate analysis, Departament of Statistics, Penn State
University, 2006.
[4] Chiu, C., Yeh, S.J., and Chen, C.H. Selforganizing arterial pressure pulse
classiﬁcation using neural networks: theoretical considerations and clinical
applicability. Computers in Biology and Medicine, (30):71–78, 2000.
[5] Draghici, S. and Potter, R.B. Predicting hiv drug resistance with neural
networks. Bioinformatics, (19):98–107, 2003.
[6] Fausett, L. Fundamentals of Neural Networks Architectures, Algorithms
and Applications. Prentice Hall, 1994.
[7] Hsu, A.L., Tang, S.L., and Halgamuge, S.K. An unsupervised hierarchi
cal dynamic selforganizing approach to cancer class discovery and marker
gene identiﬁcation in microarray data. Bioinformatics, 19:2131–2140,
2003.
[8] Ishihara, S., Ishihara, K., Nagamashi, M., and Matsubara, Y. An automatic
builder for a kansei engineering expert systems using a selforganizing neu
ral networks. Intenational Journal of Industrial Ergonomics, (15):13–24,
1995.
[9] Kartalopulos, S.V. Understanding Neural Networks and Fuzzy Logic. Basic
Concepts and Applications. IEEE Press, 1996.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 177
[10] Lourenc¸o, P.M. Um Modelo de Previs˜ ao de Curto Prazo de Carga El´ etrica
Combinando M´ etodos Estat´ısticos e Inteligˆ encia Computacional. D. sc. in
eletrical engineering, Pontif´ıcia Universidade Cat ´ olica do Rio de Janeiro,
1998.
[11] Maksimovic, R. and Popovic, M. Classiﬁcation of tetraplegies through
automatic movement evaluation. Medical Engineering and Physics,
(21):313–327, 1999.
[12] Mangiameli, P., Chen, S.K., and West, D. A comparison of SOM neural
networks and hierarchical clustering methods. European Journal of Oper
ations Research, (93):402–417, 1996.
[13] Mehrotra, K., Mohan, C.K., and Ranka, S. Elements of Artiﬁcial Neural
Networks. MIT Press, 2000.
[14] Patazopoulos, D., P, P.K., Pauliakis, A., Iokimliossi, A., and Dimopoulos,
M.A. Static cytometry and neural networks in the discrimination of lower
urinary system lesions. Urology, (6):946–950, 1988.
[15] Rozenthal, M. Esquizofrenia: Correlac¸ ˜ ao Entre Achados Neuropsi
col´ ogicos e Sintomatologia Atrav´ es de Redes Neuronais Artiﬁciais. D.Sc.
in Psychiatry, UFRJ, 2001.
[16] Rozenthal, M., Engelhart, E., and Carvalho, L.A.V. Adaptive resonance
theory in the search for neuropsychological patterns of schizophrenia. Te
chinical Report in Systems and Computer Sciences, COPPE  UFRJ, May
1998.
[17] Santos, A.M. Redes neurais e
´
Arvores de Classiﬁcac¸ ˜ ao Aplicados ao Di
agn´ ostico da Tuberculose Pulmonar Paucibacilar. D. Sc. in Production
Engineering, COPPE  UFRJ, 2003.
[18] Vuckovic, A., Radivojevic, V., Chen, A.C.N., and Popovic, D. Automatic
recognition of aletness and drawsiness from eec by an artiﬁcial neural net
works. Medical Engineering and Physics, (24):349–360, 2002.
Dimensional Reduction Networks
[1] Cha, S.M. and Chan, L.E. Applying independent component analysis to
factor model in ﬁnance. In Intelligent Data Engineering and Automated
Learning, IDEAL 2000, Data Mining, Finacncial Enginerring and In
teligent Agents (Eds. K. Leung, L.W. Chan, and H. Meng), pages 538–544.
Springer, Bruges, april 2002.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 178
[2] Chan, L.W. and Cha, S.M. Selection of independent factor model in ﬁ
nance. In Proceeding of the 3th International Conference on Independent
Component Analyses (Eds. K. Leung, L.W. Chan, and H. Meng), pages
161–166. december 2001.
[3] Clausen, S.E. Applied correspondence analysis: An introduction. SAGE,
Quantitative Applications in the Social Sciences, (121), 1998.
[4] Comon, P. Independent component analysis, a new concept? Signal Pro
cessing, 3:287–314, 1994.
[5] CorreiaPerpinan, M.A. A review of dimensional reduction techniques.
Techinical report cs9609, University of Shefﬁeld, 1997.
[6] Darmois, G. Analise g´ en´ erale des liasons stochastiques. International Sta
tistical Review, 21:2–8, 1953.
[7] Delichere, M. and Memmi, D. Neural dimensionality reduction for docu
ment processing. In European Sympsium on Artiﬁcial Neural Networks 
ESANN. april 2002a.
[8] Delichere, M. and Memmi, D. Analyse factorielle neuronalle pour docu
ments textuels. In Traiment Automatique des Languages Naturelles  TALN.
june 2002b.
[9] Diamantaras, K.I. and Kung, S.Y. Principal Components Neural Networks.
Theory and Applications. Wiley, 1996.
[10] Draghich, S., Graziano, F., Kettoola, S., Sethi, I., and Towﬁc, G. Min
ing HIV dynamics using independent component analysis. Bioinformatics,
19:931–986, 2003.
[11] Duda, R.O., Hart, P.E., and Stock, D.G. Pattern Classiﬁcation. Wiley, 2nd
ed., 2001.
[12] Dunteman, G.H. Principal component analysis. SAGE,Quantitative Appli
cations in the Social Sciences, (69), 1989.
[13] Fiori, S. Overview of independent component analysis technique with ap
plication to synthetic aperture radar (SAR) imagery processing. Neural
Networks, 16:453–467, 2003.
[14] Giannakopoulos, X., Karhunen, J., and Oja, E. An experimental compar
ision of neural algorithms for independent component analysis and blind
separation. International Journal of Neural Systems, 9:99–114, 1999.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 179
[15] Girolami, M. SelfOrganizing Neural Networks: Independent Component
Analysis and Blind Source Separation. Springer, 1999.
[16] Gnandesikan, R. Methods of Statistical Data Analysis of Multivariate Ob
servations. Wiley, 2nd ed., 1991.
[17] Hertz, J., Krogh, A., and Palmer, R.G. Introduction to the Theory of Neural
Computation. AddisonWestey, 1991.
[18] Hyvarinen, A. Survey on independent component analysis. Neural Com
puting Surveys, 2:94–128, 1999.
[19] Hyvarinen, A. and Oja, E. Independent component analysis: algorithm and
applications. Neural Networks, 13:411–430, 2000.
[20] Hyv¨ arinen, A., Karhunen, J., and Oja, E. Independent Component Analysis.
Wiley, 2001.
[21] Jackson, J.E. A User´s Guide to Principal Components. Wiley, 1991.
[22] Jutten, C. and Taleb, A. Source separation: from dust till down. In Proced
ings ICA 2000. 2000.
[23] Kagan, A.M., Linnik, Y., and Rao, C.R. Characterization Problems in
Mathematical Statistics. Wiley, 1973.
[24] Karhunen, J., Oja, E., Wang, L., and Joutsensalo, J. A class of neural net
works for independent component analysis. IEEE Transactions on Neural
Networks, 8:486–504, 1997.
[25] Kin, J.O. and Mueller, C.W. Introduction to factor analysis. SAGE, Quan
titative Applications in the Social Sciences, (13), 1978.
[26] Lebart, L. Correspondence analysis and neural networks. In IV Internation
Meeting in Multidimensional Data Analysis  NGUS97, pages 47–58. K.
Fernandes Aquirre (Ed), 1997.
[27] Lee, T.W. Independent Component Analysis Theory and Applications.
Kluwer Academic Publishers, 1998.
[28] Lee, T.W., Girolami, M., and Sejnowski, T.J. Independent component anal
ysis using an extended informax algorithm for mixed subgaussian and su
pergaussian sources. Neural Computation, 11:417–441, 1999.
[29] Mehrotra, K., Mohan, C.K., and Ranka, S. Elements of Artiﬁcial Neural
Networks. MIT Press, 2001.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 180
[30] Nicole, S. Feedfoward neural networks for principal components extrac
tion. Computational Statistics and Data Analysis, 33:425–437, 2000.
[31] Pielou, E.C. The Interpretation of Ecological Data: A primer on Classiﬁ
cation and Ordination. Willey, 1984.
[32] Ripley, B.D. Pattern Recognition and Neural Networks. Cambrige Univer
sity Press, 1996.
[33] Rowe, D.B. Multivariate Bayesian Statistics Models for Source Separation
and Signal Unmixing. Chapman & Hall/CRC, 2003.
[34] Stone, J.V. and Porril, J. Independent component analysis and projection
pursuit: a tutorial introduction. Techinical Report, Psychology Departa
ment, Shefﬁeld University, 1998.
[35] Sun, K.T. and Lai, Y.S. Applying neural netorks technologies to factor
analysis. In Procedings of the National Science Council, ROD(D), vol. 12,
pages 19–30. K. Fernandes Aquirre (Ed), 2002.
[36] Tura, B.R. Aplicac¸ ˜ ao do Data Mining em Medicina. M. sc. thesis, Facul
dade de Medicina, Universidade Federal do Rio de Janeiro, 2001.
[37] Vatentin, J.L. Ecologia Num´ erica: Uma Introduc¸ ˜ ao ` a An´ alise Multivariada
de Dados Ecol´ ogicos. Editora Interciˆ encia, 2000.
[38] Weller, S. and Rommey, A.K. Metric scalling: Correspondence analysis.
SAGE Quantitative Applications in the Social Sciences, (75), 1990.
Classiﬁcation Networks
[1] Arminger, G., Enache, D., and Bonne, T. Analyzing credit risk data: a com
parision of logistic discrimination, classiﬁcation, tree analysis and feed
foward networks. Computational Statistics, 12:293–310, 1997.
[2] Asparoukhov, O.K. and Krzanowski, W.J. A comparison of discriminant
procedures for binary variables. Computational Statistics and Data Analy
sis, 38:139–160, 2001.
[3] Bose, S. Multilayer statistical classiﬁers. Computational Statistics and
Data Analysis, 42:685–701, 2003.
[4] Bounds, D.G., Lloyd, P.J., and Mathew, B.G. A comparision of neural
network and other pattern recognition aproaches to the diagnosis of low
back disorders. Neural Networks, 3:583–591, 1990.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 181
[5] Desai, V.S., Crook, J.N., and Overstreet, G.A. A comparision of neural net
works and linear scoring models in the credit union environment. European
Journal of Operational Research, 95:24–37, 1996.
[6] Fisher, R.A. The use of multiples measurements in taxonomic problems.
Annals of Eugenics, 7:179–188, 1936.
[7] Lippmann, R.P. Neural nets for computing acoustic speech and signal pro
cessing. In ICASSP, vol. 1, pages 1–6. 1988.
[8] Lippmann, R.P. Pattern classiﬁcation using neural networks. IEEE Comu
nications Magazine, 11:47–64, 1989.
[9] Lippmann, R.P. A critical overview of neural networks pattern classiﬁers.
Neural Networks for Signal Processing, 30:266–275, 1991.
[10] Markham, I.S. and Ragsdale, C.T. Combining neural networks and statis
tical predictors to solve the classiﬁcation problem in discriminant analysis.
Decision Sciences, 26:229–242, 1995.
[11] Randys, S. Statistical and Neural Classiﬁers. Springer, 1949.
[12] Rao, C.R. On some problems arising out of discrimination with multiple
characters. Sankya, 9:343–365, 1949.
[13] Ripley, B.D. Neural networks and related methods for classiﬁcation (with
discussion). Journal of the Royal Statistical Society B, 56:409–456, 1994.
[14] Santos, A.M. Redes Neurais e
´
Arvores de Classiﬁcac¸ ˜ ao Aplicadas ao Di
agn´ ostico de Tuberculose Pulmonar. Tese D.Sc. Engenharia de Produc¸ ˜ ao,
UFRJ, 2003.
[15] Santos, A.M., Seixas, J.M., Pereira, B.B., Medronho, R.A., Campos, M.R.,
and Cal ˆ oba, L.P. Usando redes neurais artiﬁciais e regress˜ ao log´ıstica na
predic¸ ˜ ao de hepatite A. Revista Brasileira de Epidemologia, (8):117–126,
2005.
[16] Santos, A.M., Seixas, J.M., Pereira, B.B., Mello, F.C.Q., and Kritski, A.
Neural networks: an application for predicting smear negative pulmonary
tuberculosis. In Advances in Statistical Methods for the Health Sciences:
Applications to Cancer and AIDS Studies, Genome Sequence and Survival
Analysis (Eds. N. Balakrishman, J.H. Auget, M. Mesbab, and G. Molen
berghs), pages 279–292. Springer, 2006.
[17] West, D. Neural networks credit scoring models. Computers and Opera
tions Research, 27:1131–1152, 2000.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 182
Regression Neural Network Models
Generalized Linear Model Networks
[1] Nelder, J.A. and McCullagh, P. Generalized Linear Model. CRC/Chapman
and Hall, 2 nd ed., 1989.
[2] Sarle, W.S. Neural networks and statistical models. In Proceedings of
the Nineteenth Annual SAS Users Group International Conference. 1994.
(http//citeseer.nj.nec.com/sarle94neural.html).
[3] Warner, B. and Misra, M. Understanding neural networks as statistical
tools. American Statistician, 50:284–293, 1996.
Logistic Regression Networks
[1] Arminger, G., Enache, D., and Bonne, T. Analyzing credit risk data: a com
parision of logistic discrimination, classiﬁcation, tree analysis and feed
foward networks. Computational Statistics, 12:293–310, 1997.
[2] Boll, D., Geomini, P.A.M.J., Br¨ olmann, H.A.M., Sijmons, E.A., Heintz,
P.M., and Mol, B.W.J. The preoperative assessment of the adnexall mass:
The accuracy of clinical estimates versus clinical prediction rules. Interna
tional Journal of Obstetrics and Gynecology, 110:519–523, 2003.
[3] Brockett, P.H., Cooper, W.W., Golden, L.L., and Xia, X. A case study in
applying neural networks to predicting insolvency for property and causal
ity insurers. Journal of the Operational Research Society, 48:1153–1162,
1997.
[4] Carvalho, M.C.M., Dougherty, M.S., Fowkes, A.S., and Wardman, M.R.
Forecasting travel demand: A comparison of logit and artiﬁcial neural net
work methods. Journal of the Operational Research Society, 49:717–722,
1998.
[5] Ciampi, A. and Zhang, F. A new approach to training backpropagation ar
tiﬁcial neural networks: Empirical evaluation on ten data sets from clinical
studies. Statistics in Medicine, 21:1309–1330, 2002.
[6] Cooper, J.C.B. Artiﬁcial neural networks versus multivariate statistics: An
application from economics. Journal of Applied Statistics, 26:909–921,
1999.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 183
[7] Faraggi, D. and Simon, R. The maximum likelihood neural network as a
statistical classiﬁcation model. Journal of Statistical Planning and Infer
ence, 46:93–104, 1995.
[8] Fletcher, D. and Goss, E. Forecasting with neural networks: An application
using bankruptcy data. Information and Management, 24:159–167, 1993.
[9] Hammad, T.A., AbdelWahab, M.F., DeClaris, N., ElSahly, A., ElKady,
N., and Strickland, G.T. Comparative evaluation of the use of neural net
works for modeling the epidemiology of schistosomosis mansoni. Trans
action of the Royal Society of Tropical Medicine and Hygiene, 90:372–376,
1996.
[10] Lenard, M.J., Alam, P., and Madey, G.R. The application of neural net
works and a qualitative response model to the auditor’s going concern un
certainty decision. Decision Sciences, 26:209–226, 1995.
[11] Nguyen, T., Malley, R., Inkelis, S.K., and Kupperman, N. Comparison of
prediction models for adverse outcome in pediatric meningococcal disease
using artiﬁcial neural network and logistic regression analysis. Journal of
Clinical Epidemiology, 55:687–695, 2002.
[12] Ottenbacher, K.J., Smith, P.M., Illig, S.B., Limm, E.T., Fiedler, R.C., and
Granger, C.V. Comparison of logistic regression and neural networks to
predict rehospitalization in patients with stroke. Journal of Clinical Epi
demiology, 54:1159–1165, 2001.
[13] Schumacher, M., Roßner, R., and Vach, W. Neural networks and logistic
regression. part i. Computational Statistics and Data Analysis, 21:661–682,
1996.
[14] Schwarzer, G., Vach, W., and Schumacher, M. On the misuses of artiﬁcial
neural networks for prognostic and diagnostic classiﬁcation in oncology.
Statistics in Medicine, 19:541–561, 2000.
[15] Tu, J.V. Advantages and disadvantages of using artiﬁcal neural networks
versus logit regression for predicting medical outcomes. Journal of Clinical
Epidemiology, 49:1225–1231, 1996.
[16] Vach, W., Robner, R., and Schumacher, M. Neural networks and logistic
regression. part ii. Computational Statistics and Data Analysis, 21:683–
701, 1996.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 184
Regression Networks
[1] Balkin, S.D.J. and Lim, D.K.J. A neural network approach to response
surface methodology. Communications in Statistics  Theory and Methods,
29(9/10):2215–2227, 2000.
[2] Church, K.B. and Curram, S.P. Forecasting consumer’s expenditure: A
comparison between econometric and neural networks models. Interna
tional Journal of Forecasting, 12:255–267, 1996.
[3] Gorr, W.L., Nagin, D., and Szczypula, J. Comparative study of artiﬁcial
neural networks and statistical models for predicting student grade point
average. International Journal of Forecasting, 10:17–34, 1994.
[4] Laper, R.J.A., Dalton, K.J., Prager, R.W., Forsstrom, J.J., Selbmann, H.K.,
and Derom, R. Application of neural networks to the ranking of perinatal
variables inﬂuencing birthweight. Scandinavian Journal of Clinical and
Laboratory Investigation, 55 (suppl 22):83–93, 1995.
[5] Qi, M. and Maddala, G.S. Economic factors and the stock market: A new
perspective. Journal of Forecasting, 18:151–166, 1999.
[6] Silva, A.B.M., Portugal, M.S., and Cechim, A.L. Redes neurais artiﬁci
ais e analise de sensibilidade: Uma aplicac¸ ˜ ao ` a demanda de importac¸ ˜ oes
brasileiras. Economia Aplicada, 5:645–693, 2001.
[7] Tkacz, G. Neural network forecasting of canadian gdp growth. Interna
tional Journal of Forecasting, 17:57–69, 2002.
Nonparametric Regression and Classiﬁcation Networks
[1] Ciampi, A. and Lechevalier, Y. Statistical models as building blocks
of neural networks. Communications in Statistics Theory and Methods,
26(4):991–1009, 1997.
[2] Fox, J. Nonparametric simple regression. Quantitative Applications in the
Social Sciences, 130, 2000a.
[3] Fox, J. Multiple and generalized nonparametric regression. Quantitative
Applications in the Social Sciences, 131, 2000b.
[4] Kim, S.H. and Chun, S.H. Grading forecasting using an array of bipolar
predictions: Application of probabilistic networks to a stock market index.
International Journal of Forecasting, 14:323–337, 1998.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 185
[5] Peng, F., Jacobs, R.A., and Tanner, M.A. Bayesian inference in mixture
ofexperts and hierarchical mixturesofexperts models with an application
to speech recognition. Journal of the American Statistical Association,
91:953–960, 1996.
[6] Picton, P. Neural Networks. PARLGRAVE Ed., 2 nd ed., 2000.
[7] Ripley, B.D. Pattern Recognition and Neural Networks. Cambridge Uni
versity Press, 1996.
[8] Specht, D.F. A general regression neural networks. IEEE Transactions on
Neural Networks, 2(6):568–576, 1991.
[9] Specht, D.F. Probabilistic neural networks for classiﬁcation, mapping, or
associative memories. In Proceedings International Conference on Neural
Networks, vol. 1, pages 525–532. 1998.
Survival Analysis, Time Series and Control Chart Neu
ral Network Models and Inference
Survival Analysis Networks
[1] Bakker, B., Heskes, T., and Kappen, B. Improving lox survival analysis
with a neural bayesian approach in medicine, 2003. (in press).
[2] Biganzoli, E., Baracchi, P., and Marubini, E. Feedforward neural networks
for the analysis of censored survival data: A partial logistic regression ap
proach. Statistics in Medicine, 17:1169–1186, 1998.
[3] Biganzoli, E., Baracchi, P., and Marubini, E. A general framework for neu
ral network models on censored survival data. Neural Networks, 15:209–
218, 2002.
[4] Ciampi, A. and EtezadiAmoli, J. A general model for testing the propor
tional harzards and the accelerated failure time hypothesis in the analysis
of censored survival data with covariates. Communications in Statistics A,
14:651–667, 1985.
[5] Faraggi, D., LeBlanc, M., and Crowley, J. Understanding neural networks
using regression trees: An application to multiply myeloma survival data.
Statistics in Medicine, 20:2965–2976, 2001.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 186
[6] Faraggi, D. and Simon, R. A neural network model for survival data. Statis
tics in Medicine, 14:73–82, 1995a.
[7] Faraggi, D. and Simon, R. The maximum likelihood neural network as a
statistical classiﬁcation model. Journal of Statistical Planning and Infer
ence, 46:93–104, 1995b.
[8] Faraggi, D., Simon, R., Yaskily, E., and Kramar, A. Bayesian neural net
work models for censored data. Biometrical Journal, 39:519–532, 1997.
[9] Groves, D.J., Smye, S.W., Kinsey, S.E., Richards, S.M., Chessells, J.M.,
Eden, O.B., and Basley, C.C. A comparison of cox regression and neural
networks for risk stratiﬁcation in cases of acute lymphoblastic leukemia in
children. Neural Computing and Applications, 8:257–264, 1999.
[10] Kalbﬂeish, J.D. and Prentice, R.L. The Statistical Analysis of Failure Data.
Wisley, 2 nd ed., 2002.
[11] Lapuerta, P., Azen, S.P., and LaBree, L. Use of neural networks in pre
dicting the risk of coronary arthery disease. Computer and Biomedical
Research, 28:38–52, 1995.
[12] Liestol, K., Andersen, P.K., and Andersen, U. Survival analysis and neural
nets. Statistics in Medicine, 13:1189–1200, 1994.
[13] LouzadaNeto, F. Extended hazard regression model for rehability and
survival analysis. Lifetime Data Analysis, 3:367–381, 1997.
[14] LouzadaNeto, F. Modeling life data: A graphical approach. Applied
Stochastic Models in Business and Industry, 15:123–129, 1999.
[15] LouzadaNeto, F. and Pereira, B.B. Modelos em an´ alise de sobrevivˆ encia.
Cadernos Sa´ ude Coletiva, 8(1):9–26, 2000.
[16] Mariani, L., Coradin, D., Bigangoli, E., Boracchi, P., Marubini, E., Pillot,
S., Salvadori, R., Verones, V., Zucali, R., and Rilke, F. Prognostic factors
for metachronous contralateral breast cancer: a comparision of the linear
Cox regression model and its artiﬁcial neural networks extension. Breast
Cancer Research and Threament, 44:167–178, 1997.
[17] OhnoMachado, L. A comparison of cox proportional hazard and artiﬁ
cial network models for medical prognoses. Computers in Biology and
Medicine, 27:55–65, 1997.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 187
[18] OhnoMachado, L., Walker, M.G., and Musen, M.A. Hierarchical neural
networks for survival analysis. In The Eighth World Congress on Medical
Information. Vancouver, BC, Canada, 1995.
[19] Ripley, B.D. and Ripley, R.M. Artiﬁcial neural networks: Prospects for
medicine. In Neural networks as statistical methods in survival analysis
(Eds. R. Dybowski and C. Gant). Landes Biosciences Publishing, 1998.
[20] Ripley, R.M. Neural network models for breast cancer prognoses. D. Phill.
Thesis. Department of Engineering Science, University of Oxford, 1998.
(www.stats.ox.ac.uk/ ruth/index.html).
[21] Xiang, A., Lapuerta, P., Ryutov, Buckley, J., and Azen, S. Comparison of
the performance of neural networks methods and cox regression for cen
sored survival data. Computational Statistics and Data Analysis, 34:243–
257, 2000.
Time Series Forecasting Networks
[1] Balkin, S.D. and Ord, J.K. Automatic neural network modeling for univari
ate time series. International Journal of Forecasting, 16:509–515, 2000.
[2] Caloba, G.M., Caloba, L.P., and Saliby, E. Cooperac¸ ˜ ao entre redes neurais
artiﬁciais e t´ ecnicas classicas para previs¯ ao de demanda de uma s´ erie de
cerveja na australia. Pesquisa Operacional, 22:347–358, 2002.
[3] Chakraborthy, K., Mehrotra, K., Mohan, C.K., and Ranka, S. Forecast
ing the behavior of multivariate time series using neural networks. Neural
Networks, 5:961–970, 1992.
[4] Cramer, H. On the representation of functions by certain fourier integrals.
Transactions of the American Math. Society, 46:191–201, 1939.
[5] Donaldson, R.G. and Kamstra, M. Forecasting combining with neural net
works. Journal of Forecasting, 15:49–61, 1996.
[6] Faraway, J. and Chatﬁeld, C. Time series forecasting with neural networks:
A comparative study using the airline data. Applied Statistics, 47:231–250,
1998.
[7] Ghiassi, M., Saidane, H., and Zimbra, D.K. A dynamic artiﬁcial neural
network for forecasting series events. International Journal of Forecasting,
21:341–362, 2005.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 188
[8] Ghiassi, M., Saidane, H., and Zimbra, D.K. A dynamic artiﬁcial neural
network model for forecasting series events. International Journal of Fore
casting, 21:341–362, 2005.
[9] Hill, T., Marquez, L., O’Connor, M., and Remus, W. Artiﬁcial neural net
work models for forecasting and decision making. International Journal of
Forecasting, 10:5–15, 1994.
[10] Kajitomi, Y., Hipel, K.W., and Mclead, A.I. Forecasting nonlinear time
series with feedfoward neural networks: a case of study of canadian lynx
data. Journal of Forecasting, 24:105–117, 2005.
[11] Kajitoni, Y., Hipel, K.W., and Mclead, A.I. Forecasting nonlinear time
series with feedfoward neural networks: a case study of canadian lynx
data. Journal of Forecasting, 24:105–117, 2005.
[12] Lai, T.L. and Wong, S.P.S. Stochastic neural networks with applications
to nonlinear time series. Journal of the American Statistical Association,
96:968–981, 2001.
[13] Li, X., Ang, C.L., and Gray, R. Stochastic neural networks with applica
tions to nonlinear time series. Journal of Forecasting, 18:181–204, 1999.
[14] Lourenco, P.M. Um Modelo de Previs¯ ao de Curto Prazo de Carga El´ etrica
Combinando M´ etodos de Estat´ıstica e Inteligˆ encia Computacional. D.Sc
Thesis, PUCRJ, 1998.
[15] Machado, M.A.S., Souza, R.C., Caldeva, A.M., and Braga, M.J.F. Previs¯ ao
de energia el´ etrica usando redes neurais nebulosas. Pesquisa Naval, 15:99–
108, 2002.
[16] Park, Y.R., Murray, T.J., and Chen, C. Prediction sun spots using a layered
perception IEEE Transactions on Neural Networks, 7(2):501–505, 1994.
[17] Pereira, B.B. Series temporais multivariadas. Lectures Notes of SINAPE,
15:99–108, 1984.
[18] Pino, R., de la Fuente, D., Parreno, J., and Pviore, P. Aplicacion de redes
neuronales artiﬁciales a la prevision de series temporales no estacionarias
a no invertibles. Q¨ uestiio, 26:461–482, 2002.
[19] Poli, I. and Jones, R.D. A neural net model for prediction. Journal of the
American Statistical Association, 89(425):117–121, 1994.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 189
[20] Portugal, M.S. Neural networks versus true series methods: A forecasting
exercise. Revista Brasileira de Economia, 49:611–629, 1995.
[21] Raposo, C.M. Redes Neurais na Previs˜ ao em S´ eries Temporais. D.Sc
Thesis, COPPE/UFRJ, 1992.
[22] Shamseldin, D.Y. and O’Connor, K.M. A nonlinear neural network tech
nique for updating of river ﬂow forecasts. Hydrology and Earth System
Sciences, 5:577–597, 2001.
[23] Stern, H.S. Neural network in applied statistics (with discussion). Techno
metrics, 38:205–220, 1996.
[24] Stern, H.S. Neural network in applied statistics (with discussion). Techno
metrics, 38:205–220, 1996.
[25] Terasvirta, T., Dijk, D.V., and Medeiros, M.C. Linear models, smooth
transistions autoregressions and neural networks for forecasting macroeco
nomic time series: a reexamination. International Journal of Forecasting,
21:755–774, 2005.
[26] Terasvirta, T., Dijk, D.V., and Medeiros, M.C. Linear models, smooth tran
sitions autoregressions and neural networks for forecasting macroeconomic
time series: a reexamination. Internation Journal of Forecasting, 21:755–
774, 2005.
[27] Tseng, F.M., Yu, H.C., and Tzeng, G.H. Combining neural network model
with seasonal time series. ARIMA model Technological Forecasting and
S... Change, 69:71–87, 2002.
[28] Venkatachalam, A.R. and Sohl, J.E. An intelligent model selection and
forecasting system. Journal of Forecasting, 18:167–180, 1999.
[29] Zhang, G., Patuwo, B.E., and Hu, M.Y. Forecasting with artiﬁcial neural
networks: the state of the art. International Journal of Forecasting, 14:35–
62, 1998.
Control Chart Networks
[1] Cheng, C.S. A neural network approach for the analysis of control charts
patterns. International Journal of Production Research, 35:667–697, 1997.
[2] Hwarng, H.B. Detecting mean shift in AR(1) process. Decision Sciences
Institute 2002 Annual Meeting Proceedings, pages 2395–2400, 2002.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 190
[3] Prybutok, V.R., Sanford, C.C., and Nam, K.T. Comparison of neural net
work to Shewhart
¯
X control chart applications. Economic Quality Control,
9:143–164, 1994.
[4] Smith, A.E. Xbar and R control chart using neural computing. Interna
tional Journal of Production Research, 32:277–286, 1994.
Some Statistical Inference Results
Estimation Methods
[1] Aitkin, M. and Foxall, R. Statistical modelling of artiﬁcial neural networks
using the multilayer perceptron. Statistics and Computing, 13:227–239,
2003.
[2] Belue, L.M. and Bauer, K.W. Designing experiments for single output
multilayer perceptrons. Journal of Statistical Computation and Simulation,
64:301–332, 1999.
[3] Copobianco, E. Neural networks and statistical inference: seeking robust
and efﬁcient learning. Journal of Statistics and Data Analysis, 32:443–445,
2000.
[4] Faraggi, D. and Simon, R. The maximum likelihood neural network as a
statistical classiﬁcation model. Journal of Statistical Planning and Infer
ence, 46:93–104, 1995.
[5] Lee, H.K.H. Model Selection and Model Averaging for Neural Network.
Ph.d thesis., Cornegie Mellon University, Depatament of Statistics, 1998.
[6] Mackay, D.J.C. Bayesian Methods for Adaptive Methods. Ph.d thesis.,
California Institute of Technology, 1992.
[7] Neal, R.M. Bayesian Learning for Neural Networks. NY Springer, 1996.
[8] RiosInsua, D. and Muller, P. Feedfoward neural networks for nonparamet
ric regression. In Pratical Nonparametric and Semiparametric Bayesian
Statistics (Eds. D. Dey, P. Mueller, and D. Sinha), pages 181–194. NY
Springer, 1996.
[9] Schumacher, M., Robner, and Varch, W. Neural networks and logistic re
gression: Part i. Computational Statistics and Data Analysis, 21:661–682,
1996.
CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 191
Interval Estimation
[1] Chryssolouriz, G., Lee, M., and Ramsey, A. Conﬁdence interval predic
tions for neural networks models. IEEE Transactions on Neural Networks,
7:229–232, 1996.
[2] De Veaux, R.D., Schweinsberg, J., Schumi, J., and Ungar, L.H. Predic
tion intervals for neural networks via nonlinear regression. Technometrics,
40(4):273–282, 1998.
[3] Hwang, J.T.G. and Ding, A.A. Prediction interval for artiﬁcial neural net
works. Journal of the American Statistical Association, 92:748–757, 1997.
[4] Tibshirani, R. A comparison of some error estimates for neural networks
models. Neural Computation, 8:152–163, 1996.
Statistical Test
[1] Adams, J.B. Predicting pickle harvests using parametric feedfoward net
work. Journal of Applied Statistics, 26:165–176, 1999.
[2] Blake, A.P. and Kapetanious, G. A radial basis function artiﬁcial neural
network test for negleted. The Econometrics Journal, 6:357–373, 2003.
[3] Kiani, K. On performance of neural networks nonlinearity tests: Evidence
from G5 countries. Departament of Economics  Kansas State University,
2003.
[4] Lee, T.H. Neural network test and nonparametric kernel test for negleted
nonlinearity in regression models. Studies in Nonlinear Dynamics and
Econometrics, v4 issue 4, Article 1, 2001.
[5] Lee, T.H., White, H., and Granger, C.W.J. Testing for negleted nonlinearity
in time series models. Journal of Econometrics, 56:269–290, 1993.
[6] Terasvirta, T., Lin, C.F., and Granger, C.W.J. Power properties of the neural
network linearity test. Journal of Time Series Analysis, 14:209–220, 1993.
Contents
1 Introduction 2 Fundamental Concepts on Neural Networks 2.1 Artiﬁcial Intelligence: Symbolist & Connectionist . . . . . . . . . 2.2 The brain and neural networks . . . . . . . . . . . . . . . . . . . 2.3 Artiﬁcial Neural Networks and Diagrams . . . . . . . . . . . . . 2.4 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . 2.6 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Kolmogorov Theorem . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Model Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Biasvariance tradeoff: Early stopping method of training 2.8.3 Choice of structure . . . . . . . . . . . . . . . . . . . . . 2.8.4 Network growing . . . . . . . . . . . . . . . . . . . . . . 2.8.5 Network pruning . . . . . . . . . . . . . . . . . . . . . . 2.9 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Mccullochpitt Neuron . . . . . . . . . . . . . . . . . . . . . . . 2.11 Rosenblatt Perceptron . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Widrow’s Adaline and Madaline . . . . . . . . . . . . . . . . . . 3 Some Common Neural Networks Models 3.1 Multilayer Feedforward Networks . . . . . . . . . . . . . . . 3.2 Associative and Hopﬁeld Networks . . . . . . . . . . . . . . . 3.3 Radial Basis Function Networks . . . . . . . . . . . . . . . . 3.4 Wavelet Neural Networks . . . . . . . . . . . . . . . . . . . . 3.4.1 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Wavelet Networks and Radial Basis Wavelet Networks. 3.5 Mixture of Experts Networks . . . . . . . . . . . . . . . . . . 3.6 Neural Network and Statistical Models Interface . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 4 6 8 11 13 14 15 16 16 18 18 19 20 25 26 29 31 31 35 44 47 47 51 53 55
1
CONTENTS
4 Multivariate Statistics Neural Networks 4.1 Cluster and Scaling Networks . . . . . . . . . . . . . . . . . . 4.1.1 Competitive networks . . . . . . . . . . . . . . . . . . 4.1.2 Learning Vector Quantization  LVQ . . . . . . . . . . . 4.1.3 Adaptive Resonance Theory Networks  ART . . . . . . 4.1.4 Self Organizing Maps  (SOM) Networks . . . . . . . . 4.2 Dimensional Reduction Networks . . . . . . . . . . . . . . . . 4.2.1 Basic Structure of the Data Matrix . . . . . . . . . . . . 4.2.2 Mechanics of Some Dimensional Reduction Techniques 4.2.2.1 Principal Components Analysis  PCA . . . . 4.2.2.2 Nonlinear Principal Components . . . . . . . 4.2.2.3 Factor Analysis  FA . . . . . . . . . . . . . . 4.2.2.4 Correspondence Analysis  CA . . . . . . . . 4.2.2.5 Multidimensional Scaling . . . . . . . . . . . 4.2.3 Independent Component Analysis  ICA . . . . . . . . . 4.2.4 PCA networks . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Nonlinear PCA networks . . . . . . . . . . . . . . . . 4.2.6 FA networks . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Correspondence Analysis Networks  CA . . . . . . . . 4.2.8 Independent Component Analysis Networks  ICA . . . 4.3 Classiﬁcation Networks . . . . . . . . . . . . . . . . . . . . . . 5 Regression Neural Network Models 5.1 Generalized Linear Model Networks  GLIMN . . . . 5.1.1 Logistic Regression Networks . . . . . . . . . 5.1.2 Regression Network . . . . . . . . . . . . . . 5.2 Nonparametric Regression and Classiﬁcation Networks 5.2.1 Probabilistic Neural Networks . . . . . . . . . 5.2.2 General Regression Neural Networks . . . . . 5.2.3 Generalized Additive Models Networks . . . . 5.2.4 Regression and Classiﬁcation Tree Network . . 5.2.5 Projection Pursuit and FeedForward Networks 5.2.6 Example . . . . . . . . . . . . . . . . . . . . 6 Survival Analysis and others networks 6.1 Survival Analysis Networks . . . . . . . 6.2 Time Series Forecasting Networks . . . . 6.2.1 Forecasting with Neural Networks 6.3 Control Chart Networks . . . . . . . . . . 6.4 Some Statistical Inference Results . . . . 6.4.1 Estimation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 56 56 56 59 61 68 78 80 82 82 82 83 83 84 85 91 94 95 98 99 101 118 118 120 125 126 126 129 131 132 134 135 136 136 148 150 155 160 160
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .3 3 Interval estimation . . . .CONTENTS 6. . . 166 . . . . . . . . .4. .4. . . . . . . . . . . . . .2 6. . . . . . . 165 Statistical Tests . . .
. . . . . . . . . . . . . . . . . . . . . Some wavelets: (a) Haar (b) Morlet (c) Shannon . . . . Network ﬁts . . . . Mixture of experts networks . . . . .5 3. . . . . . . . . . . . . . . . . . . . Example Hopﬁeld network . . . . . . . . . . . Model choice: training x validation . . . . . . . . . . . . . .8 3. . .3 2. .17 3. . Multilayer network . . 4 . . . . . . . . . . .15 3. ADALINE . . . . . . . . . . . . . . . . . . . . .4 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feedforward network . . . . . .7 3. . . . . . . Radial basis network . .1 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic faces for Hebbian learning . . . . . . . . . . . Radial Basis Wavelet Network . . Wavelet network . . . . . . . . . . . . . . . . . .3 3. Type of associative networks . . . . . . . . . . . . .9 3. . . . . . . . . . . . . . . Wavelon . .10 3. .6 2. . . . . 5 7 8 12 12 13 17 18 26 27 28 29 32 32 36 37 38 39 40 41 44 45 46 49 50 50 52 52 53 54 . . . . . . . . . . .11 2. .List of Figures 2. . . . . . .) . . .2 2. . . . . . .16 3. . . .6 3. . . . . . . . . . . . . . . . . . . . . . . Radial basis functions h(. . . . . . Two layer perceptron . . . . . . . . . . . . . . . .18 Neural network elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 3. . . . . . . . .9 2. . . . . . . . . . . . . . . .1 3. . . Model of an artiﬁcial neuron .7 2. . .8 2. . Recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . . . Wavelet operation . . . . . . . . . .5 2. . .10 2. . . .12 3. . . . . . . . . . . . . . . .14 3. . . . Activation functions: graphs and ANN representation Singlelayer network . . . . . .5 Hopﬁeld network architecture . . . . . . . . . . . . The features used to build the faces in Figure 3. . . . . .4 3. . . . . . . . . . . . . . . Twodimensional example . McCullogPitt neuron . . . . Associative network . . .12 3. . . . . . . . . . . . . . . . . .13 3. . . . . . . . . Classiﬁcation by alternative networks . . Classes of problems . . . . . . . . . . . An Example of a Wavelet Transform. . . . . . . . . . . . . . . . .
32 4. . . . . . . . . . . . . . . 92 Oja´s rule .2 4. . . . .26 4. . . . . . . . . . . . . .35 4. . . . . . 71 Neighborhoods for rectangular grid . . . . 97 Students in factor space .37 4. . . . .40 5 General Winner Take All network . . . . . . . . . . . . . . . . . .LIST OF FIGURES 4. . . . . . 57 Success of classiﬁcation for all ANNs and all tested movement trials 61 Architecture of ART1 network . . . . . .Autoassociative . . . . . .36 4. . .39 4. . . . . . . . . . . . . . 98 ICA neural network . . . . . . . . . . . 72 Mexican Hat interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Non linear PCA network . . . .16 4. . . . . . . . 91 PCA networks architecture . .14 4. . . . . . . . . . . .29 4. . . . . . . . . . .6 4. . . . . . . . . .20 4. . . . . . . . . .1 4. . . . 95 Factor analysis APEX network . . . . . . . . . . . . .8 4. . . . . . . . . . . . . .5 4. . . . . . . .31 4. .21 4. . 76 The relation between ICA and some other methods . . . . . . .30 4. . . . . . . . . . .38 4. . . . . . . . . .10 4. . . . 99 Example of classiﬁcation of a mixture of independent components 100 Taxonomy and decision regions . . . . . . . 74 Combined forecast system . . . . 97 CA network . . . . . . . . . .7 4. .13 4. . . . . . . . . . . . . .19 4. 4500 iterations . 75 Week days associated with the SOM neurons . . . . . 73 Normal density function of two clusters . . . . . . . . . . . . . 74 SOM weights. 50 iterations . . . 105 A taxonomy of classiﬁers . . . . . . . . . 90 PCA networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Neighborhoods for hexagonal grid . . . . . . . . . . . . . . . . 104 Four basic classiﬁer groups . . . . .4 4. . . . . . . . . . . . . . . . . . . . . . .23 4. . .17 4. . .33 4.11 4. . . . . . .27 4. . . . . 73 Sample from normal cluster for SOM training . . 63 Topology of neighboring regions .34 4. . . . . . . . . . . . . . . . . . 70 The selforganizing map architecture . . . . . . . . . . . . . . . 103 A nonlinear classiﬁcation problem . . . .9 4. . . . . . 74 SOM weights. . . . . . . . . . . .25 4. . 69 Kohonen selforganizing map. . . . 62 Algorithm for updating weights in ART1 . . . . . . . . . . . . . . . . . .Oja rule. . . . . 300 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Classiﬁcation by alternative networks . . . . . . . . . . . . . . . . 89 PCA and ICA transformation of speach signals .24 4. . . . . . . . . . .12 4. .15 4. 73 Initial weights . . . . . . 70 Linear array of cluster units . . . . . . . . . . . . . .18 4. . . . . . . . . . . . . . . . 74 SOM weights.3 4. . . . . . . . . . . . . . . . .28 4. . . . 75 Seasons associated with the SOM neurons . . . . . . . . 86 PCA and ICA transformation of uniform sources . . 106 . . . . . . . . . . . . . . . . 103 Boundaries for neural networks with tree abd six hidden units . 72 Algorithm for updating weights in SOM . . . . . . . . . . . . .22 4. . . . . . . . . . . . . . . . . . . . 96 Original variables in factor space . . . . .
. . . . . . . . . . . . . . . . . . . . Partial logistic regression network (PLANN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probabilistic Neural Network . Policholomous logistic network . . . . . . . . . .7 6. . . . . . . . . . . . Classiﬁcation Tree Network . . . . . . . . . . . . . . . . . . . . . . . .1 6. 6 120 122 128 130 131 132 133 134 135 135 138 139 142 143 145 146 147 153 153 162 163 167 . . . . . . . . . . . . . . . . . . . . . .2 5. . . Separate Architecture . . . . . .3 5. . . . . . . . .9 6. .11 6. . . . .PNN . . . . . . . . . . . . . . .6 5. . . .9 5. Nonlinear survival network . . . . . Graphical model of Muller and Rios Issua . . . . . . . . . . . . . . . .2 6.12 GLIM network . . . . . . . . . . . . . . . .10 6. . Head and neck cancer trial . . . . . General Regression Neural Network Architecture GAM neural network . . . . . . . . . . . . . . . . .5 6. . . . . . . .4 5. Onehidden layer feedforward neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network of networks .8 6. . . . . . Log odds survival network . . . . . . . . Graphical model of Neal . .LIST OF FIGURES 5. .1 5. . . . .3 6. . . . . . . . . . . . . . . . GAM network .6 6. . . .8 5. . . . . . .10 6. . Classes of regression model for survival data . . . . . . Combined Architecture for forecasting xt+1 . . Neural network model for survival data . . . .7 5.4 6. . . . . . . . . . . . . . . . . . . . . . ˆ Decision Tree . . . . . . . . . . . . . .5 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Head and neck cancer trial . . . . . . . Regression tree y . . . . . . . . .
factor analysis etc. Neural networks are designed to be implemented on massively parallel computers such as PC. This attempts to answer the questions . engineering. principal components. engineers and computer scientists want to build intelligent machines and mathematicians want to understand the fundamentals properties of networks as complex systems. They are based on biological or engineering criteria such as how easy is to ﬁt a net on a chip rather than on wellestablished statistical or optimization criteria.Chapter 1 Introduction This work is a tentative to bring the attention of statisticians the parallel work that has been done by neural network researchers in the area of data analysis. standard statistical optimization criteria usually will be more efﬁcient to neural networks implementation. which gives the details and steps of the facilities of the algorithm.. neural networks algorithms implementations are inefﬁcient because: 1. which describes how the computation is being carried out. cluster. generalized additive models. classiﬁcation. On a serial computer. In an computational process we have the following hierarchical framework. computer science and mathematics.. In this area of application (data analysis) it is important to show that most neural network models are similar or identical to popular statistical techniques such as generalized linear models. 2. The top of the hierarchy is the computational level. All these groups are asking different questions: neuroscientists and psychologists want to know how animal brain works..what is being computed and why? The next level is the algorithm. projection . Research in neural networks involve different group of scientists in neurosciences. 1 . and ﬁnally there is the implementation level. On the other hand. psychology.
neural network research can stimulate statistics not only by pointing to new applications but also providing it with new type of nonlinear modeling. To be fair authors which inspired some of our developments. and also how the neural network models and statistical models are related to tackle the data analysis real problem. in this work we will concentrate on the top of the hierarchy of the computational process. Chapter 5 deals with parametric and nonparametric regression models and Chapter 6 deals with inference results. The references and the bibliography is referred to sections of the chapters. On the other hand.CHAPTER 1. The book is organized as follows: some basics on artiﬁcial neural networks (ANN) is presented in Chapters 2 and 3. but also statisticians have a range of problems in which they can contribute in the neural network research. that is what is being done and why. survival analysis. INTRODUCTION 2 Mathematical research in neural networks is not specially concerned with statistics but mostly with nonlinear dynamics. Multivariate statistical topics are presented in Chapter 4. we have included them in the bibliography. geometry. Therefore. The ﬁrst author is grateful to CAPES. We believe that neural network will become one of the standard technique in applied statistics because of its inspiration. even if they were not explicitly mentioned in the text. control charts and time series. probability and other areas of mathematics. . an Agency of the Brazilian Ministry of Education for a onde year support grant to visit Penn State University and to UFRJ for a leave of absence.
The concept of what is intelligence is outside the scope of this guide.choice of an intelligent activity to study. .Chapter 2 Fundamental Concepts on Neural Networks 2. These artiﬁcial neural networks are supposed to learn from examples 3 . which has used machines whose architecture is loosely based on the human brain. Concurrent with this. Note that the symbolic artiﬁcial intelligence is more concern in imitate intelligence and not in explaining it. perception and language understanding. problemsolving. The rules taken together form a recipe or algorithm for processing the symbols.development of a logicsymbolic structure able to imitate (such as “if condition 1 and condition 2.1 Artiﬁcial Intelligence: Symbolist & Connectionist Here we comment on the connection of neural network and artiﬁcial intelligence. which can be described as consisting of three phrases: . One of these tries to capture knowledge as a set of irreducible semantic objects or symbols and to manipulate these according to a set of formal rules. has been another line of research. This approach is the symbolic paradigm. there have existed two different approaches to the problem of developing machines that might embody such behavior.) . reasoning.. then result”..compare the efﬁciency of this structure with the real intelligent activity. It comprises the following components: learning. From the early days of computing. We can think of artiﬁcial intelligence as intelligence behavior embodied in humanmade machines.
For the connectionists. 2. Building such a model is the task that scientists faces. The connectionist. Statisticians as engineers on the contrary.information processing is not centralized and sequential but parallel and spacially distributed. As statisticians we will take the view that brain models are used as inspiration to building artiﬁcial neural networks as the wings of a bird were the inspiration for the wings of an airplane. The cell body (the central region) is called neuron. as in Figure 1c. . This characteristic lies in the set of ﬁbers that emanate from the cell body. It postulated that the physical structure of the brain is fundamental to emulate the brain function and the understanding of the mind. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 4 and their “knowledge” is stored in representations that are distributed across of a set of weights. The form how these elements are interconnected are fundamental for the resulting mental process. We will focus on morphological characteristic which allow the neuron to function as an information processing device.CHAPTER 2.information is stored in the connections and adapted as new information arrives.it is robust to failure of some of its computational elements. and this fact is that named this approach connectionist. or neural network approach.1. call neurons. . Dendrites are surrounded by the synaptic bouton of other neurons.the axon . Some features of this approach are as follows: . the mental process comes as the aggregate result of the behavior of a great number of simple computational connected elements (neurons) that exchange signals of cooperation and competition. Neurons have irregular forms. when carry information transmitted from other neurons. .is responsible for transmitting information to other neurons. All other are dendrites. The brain has features. neither with lower levels of descriptions than nerve cell. Physical details are given in Table 1 where we can see that brain is a very complicated system. One of these ﬁbers . as illustrated in Figure 2. and a true model of the brain would be very complicated. Neurons are highly interconnected.2 The brain and neural networks The research work on artiﬁcial neural networks has been inspired by our knowledge of biological nervous system. starts from the premise that intuitive knowledge cannot be captured in a set of formalized rules. in particular the brain. we will not be concerned with large scales features such as brain lobes. use simpliﬁed models that they can actually build and manipulate usefully.
FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 5 Dendrites Synaptic bouton Axon Branch Axon Axon branches (a) Neurons (b) Sypatic bouton or synapse Dendrite Synaptic bouton Axon branch Synapse (c) Synaptic bouton making contact with a dendrite (d) Transmitters and receptor in the synapse and dendrites Figure 2.CHAPTER 2.1: Neural network elements .
The Figure 2.5kg 1. • A set of synapses.CHAPTER 2. weighted by respective synapses (weights).1: Some physical characteristics of the brain Number of neurons Number of synapses/neurons Total number of synapses (Number of processors of a computer) Operation (Processing speed in a computer) Number of operations Human brain volume Dendrite length of a neuron Firing length of an axon Brain weight Neuron weight Synapse 100 billion 1000 100000 billion (100 millions) 100 (1 billion) 10000 trillion/s 300 cm3 1 cm 10 cm 1.2 × 10−9 g Excitatory and Inhibitory 6 2. each characterized by a weight.to sum the input signs. A signal xj in the input of synapse j connected to the neuron k is multiplied by a synaptic weight wjk . FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS Table 2. that when positive means that the synapse is excitated and when negative is inhibitory. • Summation .2 presents the model of an artiﬁcial neuron: .restrict the amplitude of the neuron output. • Activation function .3 Artiﬁcial Neural Networks and Diagrams An artiﬁcial neuron can be identiﬁed by three basic elements.
The symbol inside the box indicates the type of function. maximum likelihood etc. This function has the objective of activate or inhibit the next neuron. with the name shown inside the circle.2: Model of an artiﬁcial neuron Models will be displayed as networks diagrams. • Mathematically. Each arrow usually has a weight or parameter to be estimated. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 7 Figure 2.CHAPTER 2. Neurons are represented by circles and boxes. which are inspired in biological neurons. These inputs can be outputs (or not) of other neurons and the output can be the input to other neurons. • Boxes represent values computed as a function of one or more arguments. • A thick line indicates that the values at each end are to be ﬁtted by same criterion such as least squares. we describe the k th neuron with the following form: p uk = j=1 wkj xj and yk = ϕ(uk − wk0 ). this total then goes through the activation function. The inputs are multiplied by weights and summed with one constant. • An artiﬁcial neural network is formed by several artiﬁcial neurons. (2. while the connections between neurons are shown as arrows: • Circles represent observed variables. • Arrows indicate that the source of the arrow is an argument of the function computed at the destination of the arrow.3.1) . • Each artiﬁcial neuron is constituted by one or more inputs and one output.
A vector of values presented once to all the output units of an artiﬁcial neural network is called case. The number of layers and the quantity of neurons in each one of them is determined by the nature of the problem. The intermediary layers are called hidden layers. The most commons are presented in Figure 2. ϕ(·) is the activation function and yk is the neuron output. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 8 where x0 . wkp are the synaptic weights. is constituted by the inputs xp and the last layer is the output layer. etc. sample. x1 .3: f(x) y = ϕ(x) = x 0 x a) Identity function (ramp) Figure 2. wk0 is the bias. pattern. uk is the output of the linear combination.4 Activation Functions An activation function for a neuron can be of several types. xp are the inputs. . . example. called input layer. . . . . 2. wk1 . . The ﬁrst layer of an artiﬁcial neural network.CHAPTER 2.3: Activation functions: graphs and ANN representation . .
CHAPTER 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS F(x) 9 y = ϕ(x) = 1 x≥b −1 x < b 0 b X b) Step or threshold function F(x) 1 y = ϕ(x) = z 0 x ≥ 1/2 1 −1/2 ≤ x < 2 1 x≤− 2 ½ 0 ½ X c) Piecewiselinear function F(x) y = ϕ1 (x) = 1 1 + exp(−ax) a increasing 0 X d) Logistic (sigmoid) Figure 2.) .3: Activation functions: graphs and ANN representation (cont.
FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 10 F(x) y = ϕ2 (x) = 2ϕ1 (x) − 1 = 1 − exp(−ax) 1 + exp(−ax) 1 x 1 e) Symmetrical sigmoid y = ϕ3 (x) = tang x = ϕ2 (x) 2 f) Hyperbolic tangent function i. Wavelet ϕ(x) = Ψ(x) g) Radial Basis (some examples) Figure 2.) .3: Activation functions: graphs and ANN representation (cont. Normal ϕ(x) = Φ(x) ii.CHAPTER 2.
1. SingleLayer Feedforward Networks In a layered neural network the neurons are organized in layers. but not viceversa. The derivative of the step function is not deﬁned. One input layer of source nodes projects onto an output layer of neurons. The nice feature of the sigmoids is that they are easy to compute. 2. . There are three different classes of network architectures. The derivative of the logistic is ϕ1 (x)(1 − ϕ1 (x)). Neurons in each of the layer have as their inputs the output of the neurons of preceding layer only.CHAPTER 2. This type of architecture covers most of the statistical applications The neural network is said to be fully connected if every neuron in each layer is connected to each node in the adjacent forward layer. 3 (2. Multilayer Feedforward Networks These are networks with one or more hidden layers.3: Activation functions: graphs and ANN representation (cont.) The derivatives of the identity function is 1. Recurrent Network Is a network where at least one neuron connects with one neuron of the preceding layer and creating a feedback loop.4. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 11 h) Wavelet Transform ϕ(x) = 2j/2 Ψ 2j x − k Figure 2. 3. This type of architecture is mostly used in optimization problems. For the hyperbolic tangent the derivative is 1 − ϕ2 (x). which is why it is not used.1) 2. otherwise it is partially connected.5 Network Architectures Architecture refers to the manner in which neurons in a neural network are organized. This network is strictly of a feedforward type.
4: Singlelayer network Figure 2.CHAPTER 2.5: Multilayer network . FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 12 Figure 2.
when the error is propagated back to the neurons in the hidden layer of a feedforward network of section ??.6 Network Training The training of the network consists of adjusting the synapses or weight with the objective of optimizing the performance of the network. we do not have the output. to the statistical methods of cluster analysis and principal components. . Unsupervised training corresponds. for example. From a statistical point of view. the creditassign problem. so we can modify the network synaptic weights in an orderly way to achieve the desired objective. • learning without a teacher: this case is subdivided into selforganized (or unsupervised) and reinforced learning. It is the problem of assigning a credit or a blame for overall outcomes to each of the internal decisions made by a learning machine. It occurs. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 13 Figure 2. In both cases. the training corresponds to estimating the parameters of the model given a set of data and an estimation criteria. which corresponds to each input vector. for example.CHAPTER 2. The training can be as follows: • learning with a teacher or supervised learning: is the case in which to each input vector the output is known.6: Recurrent network 2. Reinforced learning training corresponds to. for example.
. mdimensional descriptive vectors and where some members of B are known a priori. where insights into the existence. In these cases. but not the construction. B has lower dimension than A. The thirteenth problem of Hilbert is having some impact in the area of neural networks. 1994): (i) B is a set of discrete categories. Most of the neural network algorithm can also be described as method to determine a nonlinear f .CHAPTER 2. etc. Therefore the determination of f is of utmost importance and made possible due to a theorem of Andrei Kolmogorov about representation and approximation of continuous functions.265)( or Bose and Liang (1996. where f : A → B where A is a set of n. and f is to be determined. A modern variant of Kolmogorov’s Theorem is given in Rojas (1996. the problem of the data analyst is to ﬁnd the mapping f .7 Kolmogorov Theorem Most conventional multivariate data analysis procedures can be modeled as a mapping f :A→B where A and B are ﬁnite sets. In 1900. includes the methods under the names discrimination analysis and regression. Some of these problems have been related to applications. p. some of the most common neural networks are said to perform similar tasks of statistical methods. whose history and development we present shortly. ART network (Adaptive Resonance Theory) and cluster analysis. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 14 2. The mapping f is a clustering function and choosing the appropriate f is the domain of cluster analysis. without precise knowledge of B. of solutions to problems have been deduced. Also.). multilayer perception and regression analysis.158). Consider the following examples (Murtagh. since its solution in 1957 by Kolmogorov. and A a set of mdimensional vectors characterizing n observations. Professor David Hilbert at the International Congress of Mathematicians in Paris. (ii) In dimensionalityreduction technique (PCA. For example. formulated 23 problems from the discussion of which “advancements of science may be expected”. Kohonen network and multidimensional scaling. B is a space containing n points. which has lower dimension than A. p. (iii) When precise knowledge of B is available. FA.
p. xm ). 1997. and the gq functions do not depend on f (x1 . . . On . . . . . Wavelets functions to be used in the approximation. . . for q = 1. although the number of hidden units required grows exponentially in the worst case with the number of network inputs.8 Model Choice We have seen that the data analyst’s interest is to ﬁnd or learn a function f from A to B. we do not demand exact reproducibility but only a bounded approximate error. From Kolmogorov’s work. Fourier. The number of hidden units required depends on the function to be approximated. . .105): • Boolean functions. Every boolean function can be represented exactly by some network with two layers of units. 1]m can be represent in the form 2m+1 m f (x1 . xm ) on n variables x1 . but does not indicate how to obtain those functions. m are constants. . . cluster analysis). xm on In = [0. Every bounded continuous function can be approximated with arbitrary small error by a network with two layers. the output layer uses linear activation. we don’t have precise knowledge of B and the term “unsupervised” learning is used. x2 . Radial Basis. . However if we want to approximate a function. . FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 15 Theorem 1. . • Continuous functions.CHAPTER 2. several authors have improved the representation to allow Sigmoids. This is the approach taken in neural network. Again. and the number of units required in each hidden layer also depends on the function to be approximated. . . Some of the adaptations of Kolmogorov to neural networks are (Mitchell. The result in this case applies to networks that use sigmoid activation at the hidden layer and (unthresholded) linear activation at the output layer. Kolmogorov Theorem states an exact representation for a continuous function. the two hidden layers use sigmoid activation. Any function can be approximated to arbitrary accuracy by a network with three layers of units. . . . . . (Kolmogorov Theorem): “Any continuous function f (x1 . xm ) = q=1 g p=1 λp hq (xp ) where g and hq . • Arbitrary functions. 2m + 1 are functions of one variable and λp for p = 1. 2. . x2 . In some cases (dimension reduction. .
(2. In this section. The data was generated from function h(x). we will address each of these problems. The full line represents a more complex model . where the mapping f is determined using the training set. 2.8.7 obtained from two training of a network. testing of how well f performs using a test set B ⊂ B. Given a large network.g.CHAPTER 2. B ⊂ B. B not identical to B . We say that the model has a bias. In a multilayer network.1 Generalization By generalization we mean that the inputoutput mapping computed by the network is correct for a test data never used. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 16 the other hand.8. discriminant and regression analysis). The dashed represents a simple model that does not do a good job in ﬁtting the new data.1) The training data are shown as circles (◦) and the new data by (+). and the data from y(x) = h(x) + . Kolmogorov’s results provide guidelines for the number of layers to be used. the term “supervised” is often applied to the situation where f is to be determined. for example) that is presented in the training data but not true in the underline function that is to be modelled. The testing and application phases are termed generalization. it is possible that loading data into the network it may end up memorizing the training data and ﬁnding a characteristic (noise. In the case of supervised learning the algorithm to be used in the determination of f embraces the phases: learning. but where some knowledge of B is available (e. 2. However. It is assumed that the training is a representative sample from the environment where the network will be applied.2 Biasvariance tradeoff: Early stopping method of training Consider the two functions in Figure 2. The training of a neural network may be view as a curve ﬁtting problem and if we keep improving our ﬁtting. some problems may be easier to solve using two hidden layers. The number of units in each hidden layer and (the number) of training interactions to estimate the weights of the network have also to be speciﬁed. and an application. we end up with a very good ﬁt only for the training data and the neural network will not be able to generalize (interpolate or predict other sample values).8.
Network performance in the training sample will improve as long as the network is learning the underlying structure of the data (that is h(s)). FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 17 Figure 2. Every now and then. We say that the model has high variance.high variance . . high bias which does an excellent job in ﬁtting the data. To stop overﬁtting we stop training at epoch E0 .7: Network ﬁts: .8 shows the errors on the training and validation sets. as the error is close to zero. Once the network stops learning things that are expected to be true of any data sample and learns things that are true only of the training data (that is ). . performance on the validation set will stop improving and will get worse. stop training and test the network performance in the validation set. . A simple procedure to avoid overﬁtting is to divide the training data in two sets: estimation set and validation set. with no weight update during this test. p. Haykin (1999. Figure 2.CHAPTER 2. We say in this case that the network has memorize the data or is overﬁtting the data.215) suggest to take 80% of the training set for estimation and 20% for validation. However it would not do a good job for predicting the values of the test data (or new values). .
.4 Network growing One possible policy for the choice of the number of layers in a multilayer network is to consider a nested sequence of networks of increasing sequence as suggest by Haykin (1999. < hg . In practice it is usual to try several types of neural networks and to use crossvalidation on the behavior in a test data. . • p networks with a single layer of increasing size h1 < h2 < .8: Model choice: training x validation 2. . p. • g networks with two hidden layers: the ﬁrst layer of size hp and the second layer is of increasing size h1 < h2 < . The crossvalidation approach similar to the early stopping method previously presented can be used for the choice of the numbers of layers and also of neurons. . to choose the simplest network with a good ﬁt. . FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 18 Figure 2.8. We may achieve this objective in one of two ways. < hp .8.3 Choice of structure A large number of layers and neurons reduce bias but at the same time increase variance. • Network growing • Network pruning 2.CHAPTER 2.215).
. p. As in statistics. we need a measure of the ﬁt between the model (network) and the observed data.CHAPTER 2. with the mean square error is closely related to ridge regression in statistics and the work of Tikhonov on ill posed problems or regularization theory (Orr. When λ is large. it says that the training sample is unreliable. 1999. 1996.2) where Es (W ) is a standard performance measure which depends on both the network and the input data. Rissanen. A complexity penalty used in neural networks is (Haykin.5 Network pruning In designing a multilayer perception. Using this criteria.220) (wi /wo )2 1 + (wi /wo )2 E{w} = i (2.8. Schwartz are associated with these methods. we are building a nonlinear model of the phenomenon responsible for the generation of the inputoutput examples used to train the network. Ec (W ) is a complexity penalty which depends on the model alone. p.8. which share a common form of composition: ModelComplexity criteria = Performance + measure ModelComplexity penalty The basic difference between the various criteria lies in the modelcomplexity penalty term. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 19 2. The regularization parameter λ.219). which represents the relative importance of the complexity term with respect to the performance measure. 1999. Note that this risk function.8. The names of Akaike. In the context of backpropagation learning or other supervised learning procedure the learning objective is to ﬁnd a weight vector that minimizes the total risk R(W ) = Es (W ) + λEc (W ) (2. Therefore the performance measure should include a criterion for selecting the model structure (number of parameter of the network). Haykin. In backpropagation learning it is usually deﬁned as meansquare error. the network is completely determined by the training sample.3) where w0 is a preassigned parameter. Various identiﬁcation or selection criteria are described in the statistical literature. weight’s wi ’s with small values should be eliminated. With λ = 0. Hannan.
(ii) apply PCA to the covariance matrix of the outputs of the hidden layer. Table 2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 20 Another approach to eliminate connections in the network. such as the mean or a globally optimal synaptic weight Statistical Jargon Statistical Inference Population Parameter Continued on next page .2: Terminology Neural Network Jargon Generalizing from noisy data and assessment of accuracy thereof The set of all cases one wants to be able to generalize to A function of the value in a population.). proportion of variance explained etc. (iii) choose the number of important components P ∗ (by one of the usual criterion: eigenvalues greater than one. principal components analysis as follows: (i) initially train a network with an arbitrary large number P of hidden modes. It consists in applying PCA. Addrians and Zantinge (1996) for data mining terminology. pick P ∗ nodes out of P by examining the correlation between the hidden modes and the selected principal components. Note that the resulting network will not always have the optimal number of units.2 helps to translate from the neural network language to statistics (cf Sarle. e.g. (iv) if P ∗ < P .: Gurney (1997) for neuroscientists term. The network may improve its performance with few more units. Table 2.CHAPTER 2. Some are quite specialized character: for the statistician. works directly in the elimination of units and was proposed and applied by Park et al (1994). 1996).9 Terminology Although many neural networks are similar to statistical models. 2. a different terminology has been developed in neurocomputing. Some further glossary list are given in the references for this section.
Model ﬁtting. random search Model Estimation. and nonlinear dynamical systems consisting of an often large number of neurons interconnected in often complex ways and often organized in layers Linear regression and discriminant analysis. and may direct the result to one or more other neurons A class of ﬂexible nonlinear regression and discriminant models. Cluster analysis. simulated annealing. data reduction models. such as the mean or a learned synaptic weight Neuron. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 21 Neural Network Jargon A function of the values in a sample. unit. Discriminant analysis Principal components. Data reduction Cluster analysis Principal components Neural Networks Statistical methods Architecture Training. Validation set Sample. Optimization Discriminant analysis Regression Regression. computes a function thereof.CHAPTER 2. Function approximation Supervised learning Unsupervised learning. Cottrell/ Munzo/ Zisper technique Training set Test set. Adaptation Classiﬁcation Mapping. node processing element Statistical Jargon Statistic A simple linear or nonlinear computing element that accepts one or more inputs. neurocode. Selforganization Competitive learning Hebbian learning. Construction sample Holdout sample Continued on next page . Learning.
Example. Predictors. Case Binnary (0/1). Bivalent or Bipolar (1/1) Input Statistical Jargon Observation. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 22 Neural Network Jargon Pattern. Signal function. Explanatory variables. Carries Predicted values Prediction Dependent variables. Extrapolation. Vector. Transfer function Softmax Multiple logistic function Continued on next page . Prediction Conﬁdence intervals Forecasting Linear twogroup discriminant analysis (not a Fisher´s but generic) Generalized linear model (GLIM) Inverse link function in GLIM (Nohiddenlayer) perceptron Activation function. Regressors. Responses Observed values Observation containing both inputs and target values Lagged value Output Forward propagation Training values Target values Training par Shift register.CHAPTER 2. (Tapped) (time) delay (line). Dichotomous Independent variables. Input window Errors Noise Generalization Error bars Prediction Adaline (ADAptive LInear NEuron) Residuals Error term Interpolation. Case Binary.
FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 23 Neural Network Jargon Squashing function Statistical Jargon Bounded function with inﬁnitive domain Differentiable nondecreasing function Linear model Maximum redundant analysis.CHAPTER 2. Principal components of instrumental variables Projection pursuit regression (Regression coefﬁcients. Parameter estimates) Intercept Bias Semilinear function Phimachine Linear 1hiddenlayer perceptron 1hiddenlayer perceptron Weights Synaptic weights Bias The difference between the expected value of a statistic and the corresponding true value (parameter) OLS (Orthogonal least square) Probabilistic neural network General regression neural network Topolically distributed enconding Adaptive vector quantization Forward stepwise regression Kernel discriminant analysis Kernel regression (Generalized) Additive model Iterative algorithm of doubtful convergence for Kmeans cluster analysis Hartigan´s leader algorithm A form of piecewise linear discriminant analysis using a preliminary cluster analysis Regressogram based on kmeans clusters Continued on next page Adaptive Resonance Theory 2n d Learning vector quantization Counterpropagation .
Jumpers. Bypass connections. as in stochastic approximation Iteratively updating estimates after each complete pass over the data as in most nonlinear regression algorithms Main effects Heteroassociation Epoch Continuous training. linear model with iteration terms Iterative algorithms of doubtful convergence for approximating an arithmetic mean or centroid Iterative algorithm of doubtful convergence for training a linear perceptron by least squares.CHAPTER 2. Online training. LMS (Least Mean Square) rule Training by minimizing the median of the squared errors Continued on next page . Incremental training. WidrowHoff rule. Discriminant analysis (Independent and dependent variables are different) Iteration Iteratively updating estimates one observation at time via difference equation. Instantaneous training Batch training. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 24 Neural Network Jargon Encoding. Offline training Shortcuts. Outstar Deltarule. direct linear feedthrough (direct connections from input to output) Funcional links Secondorder network Iteration terms or transformations Quadratic regression. Responsesurface model Polynomial regression. Autoassociation Statistical Jargon Dimensionality reduction (Independent and dependent variables are the same) Regression. adaline rule. similar to stochastic approximation LMS (Least Median of Square) Higherorder networks Instar.
selfstructuring. Prunning. Model selection. Brain damage. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 25 Neural Network Jargon Generalized delta rule Statistical Jargon Iterative algorithm of doubtful convergence for training a nonlinear perceptron by least squares. also called Linear Threshold Unit (LTU). Cross entropy Evidence framework 2. Ridge regression Random noise added to the inputs to smooth the estimates Subset selection.1) . ontogeny Optimal brain surgeon LMS (Least Mean Squares) Relative entropy. Regularization Jitter Growing. The neuron receive the weighted sum of inputs and has as output the value 1 if the sum is bigger than this threshold and it is shown in Figure 2. Mathematically.10 Mccullochpitt Neuron The McCullochPitt neuron is a computational unit of binary threshold.CHAPTER 2. similar to stochastic approximation Computation of derivatives for a multilayer perceptron and various algorithms (such as generalized delta rule) based thereon Shrinkage estimation. Pretest estimation Wald test OLS (Ordinary Least Squares) (see also ”LMS rule” above) KullbackLeibler divergence Empirical Bayes estimation Backpropagation Weight decay. we represent this model as p yk = ϕ j=1 ωkj − ωo (2.9.10.
Figure 2.1) This perceptron creates an output indicating if it belongs to class 1 or 1 (or 0). the weights are constraints and the variables are the inputs. That is 1 0 ω i xi − ω 0 ≥ 0 otherwise y = Out = (2. ωjk is the weight of neuron j to neuron k. (2.11.11.11 Rosenblatt Perceptron The Rosenblatt perceptron is a model with several McCullochPitt neurons.9: McCullogPitt neuron where yk is an output of neuron k.2) The constant ω0 is referred as threshold or bias.CHAPTER 2.10.2) 2.10 shows one such a model with two layers of neurons The one layer perceptron is equivalent to a linear discriminant: ω j xj − ω 0 . In the perceptron. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 26 Figure 2. Since the objective is to estimate (or . ωo is the threshold to the neuron k and ϕ is the activation function deﬁned as 1 if input ≥ ω0 −1 otherwise ϕ(input) = (2. xj is the output of neuron j.
FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 27 Figure 2.11. ∆ωi (t) is the adjustment or correction factor. y is the ˆ perceptron output. for each wrong output the weights are adjusted to correct the error. as in other linear discriminants. The sequential presentation of the samples continues indeﬁnitely until some stop criterion is satisﬁed. y is the correct value or true response.4) Formally. if there is an error. the new weight is ωi (t + 1). a is the activation value for the output y = f (a) (a = ˆ ωi xi ).5) (2. the actual weight in true t is ωi (t).11. the next step is to train the perceptron.10: Two layer perceptron learn) the optimum weights. The perceptron is trained with samples using a sequential learning procedure to determine the weight as follows: The samples are presented sequentially. Nevertheless.11. If the output is corrected the weights are not adjusted.11. all 100 cases are presented again to the perceptron.CHAPTER 2. Each cycle of presentation is called . the training will be completed. For example. The equations for the procedure are: ωi (t + 1) = ωi (t) + ∆ωi (t) ω0 (t + 1) = ω0 (t) + ∆ω0 (t) and ∆ωi (t) = (y − y )xi ˆ ∆ω0 (t) = (y − y ) ˆ (2.6) (2. if 100 cases are presented and if there is no error.3) (2. The initial weight values are generally random numbers in the interval 0 and 1.
0 0.0 0.3: Training with the perceptron rule on a twoinput example ω1 0.11: Twodimensional example Table 2.3 0. some modiﬁcations can be done in the algorithm.25 . the training guarantees the converge but not how fast.15 ω0 0. Thus if a linear discriminant exists that can separate the classes without committing an error. To make learning and convergence faster. The value of α is a number in the interval 0 and 1. Table 2.4 0.15 0.55 0. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 28 an epoch. For example.55 x1 0 0 1 1 x2 0 1 0 1 a 0 0.25 ∆ω0 0 0.25 ∆ω1 0 0 0 0.4 0 0. in the weight actualization procedure to αωi .11. The perceptron converge theorem states that when two classes are linearly separable.25 0 0.11). Here convergence is achieved in one epoch. α. but it is not known how long it will take to ﬁnd it (Figure 2. Figure 2.CHAPTER 2. the training procedure will ﬁnd the line or separator plan.4 0.0 ω2 0.0 0.3 illustrates the computation of results of this section for the example of Figure 2.25 0 0.25 0 0. Here α = 0.3 0.25 ∆ω2 0 0. (ii) To introduce a learning rate.15 y 0 1 0 0 y ˆ 0 0 0 1 a(ˆ − y) y 0 0. (i) To normalize the data: all input variables must be within the interval 0 and 1.25.
w1 = 0.12: ADALINE To a set of samples. the performance measure fo the training is: DM LS = (˜ − y)2 y (2. was developed by Widrow and Hoff and called ADALINE (ADAptive LINear Element).12. w2 = 0.3 and x1 = 2.4 gives the updating results for the sample case. A generalization which consist of a structure of many ADALINE is the MADALINE. . x2 = 1 would result in an activation value of: a = 2x0. While the perceptron uses the step or threshold function. The perceptron and the LMS search for a linear separator. Figure 2.12 Widrow’s Adaline and Madaline Another neuron model.12. The major difference between the Perceptron (or LTU) and the ADALINE is in the activation function. using as a training method the method of Least Mean Square (LMS) error. the ADALINE uses the identity function as in Figure 2. A comparison of training for the ADALINE and Perceptron for a two dimensional neuron with w0 = −1. The technique used to update the weight is called gradient descent.CHAPTER 2.1) The LMS training objective is to feed the weigths that minimize DM LS .5.2) Table 2.12. To overcome this limitation the more general neural network model of the next chapter can be used. The appropriate line or plane can only be obtained if the class are linear separable.5 + 1x3 − 1 = 0.3 (2. FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 29 2.
2 = −0.3 .1 = 0 −1 + (0 − 0.5 0.1 0.4: Comparison of updating weights y ˜ w1 (t + 1) w2 (t + 1) w0 (t + 1) Perceptron 1 0.5 + (0 − 1). FUNDAMENTAL CONCEPTS ON NEURAL NETWORKS 30 Table 2.3 + (0 − 0.3).5 + (0 − 0.3 + (0 − 1).2 = −1.7 −1 + (0 − 1) = −2 Adaline 0.3 0.CHAPTER 2.3) = −1.3) = −0.
If a weight is to be adjusted anywhere in the network. Before we give the general formulation of the generalized delta rule. we present a numerical illustration for the Exclusive OR (XOR) problem of Figure 3. the derivative of the error or criteria function with respect to that weight must be found.1) in the XOR problem. called backpropagation. The neural network of Figure 3. The original perceptron learning cannot be extended to the multilayered network because the output function is not differentiable.1 (it is of no help to use layers of ADALINE because the resulting output would be linear).2. Table 3. but which is differentiable.1 presents the forward computation for one sample (1. The one that is most often used successfully in multilayered network is the sigmoid function. In order to train a new learning rule is needed. For this. as used in the delta rule for the ADALINE. 31 .1 Multilayer Feedforward Networks This section deals with networks that use more than one perceptron arranged in layers. using the network of Figure 3.Chapter 3 Some Common Neural Networks Models 3. So a new activation function is needed that is nonlinear.1(a) with the following inputs and initial weights. one input layer with 4 samples and one neuron in the output layer.2 has one hidden layer with 2 neurons. otherwise nonlinearly separable functions could not be implemented. its effect in the output of the network has to be known and hence the error. This model is able to solve problems that are not linear separable such as in Figure 3.
5 x1 0 1 (a) Exclusive OR (XOR) (b) Linear separable class (c) Nonlinear separable Depressed Not Depressed Depressed Not Depressed Separation by Hyperplanes Clustering by Ellipsoids Mesure 2 Mesure 1 (d) Separable by twolayer of perceptron (e) Separable by feedforward (sigmoid) network.2: Feedforward network . psychological measures Figure 3.5 B B B B C B 0.CHAPTER 3.1: Classes of problems Figure 3. SOME COMMON NEURAL NETWORKS MODELS x2 0 1 A A A A B W decision line 32 A B C C C A A C C A B B A B A 0.
08 .53 Output 1/(1 + exp(−.48)∗ (−.40 .2+.19 .65)∗ (−.017) = .4.5∗ .147)∗ .1 + (−.48 1/(1 + exp(−.63) (0 − .65∗ (1 − .147)∗ .65 − .4 + (.28 −.63 (1 − . in the neural network literature (such as normalize the inputs to be in the interval 0 and 1.29 .47 .147 .1)) = .5 + (−.48 + .1: Forward computation Neuron 3 4 5 Input . Several heuristic rules have been suggested to improve its performance.CHAPTER 3.42 .2 + (. Table 3.3.3 gives the weight adjustment for this step of the algorithm.63) = −.6)) = .6 .2: Error backpropagation Neuron 5 4 3 Error .147) = .015) = −.25 −.147 .4 = .015)∗ 1 = .017 Table 3.1+.2 gives the calculus for the backpropagation of the error for the XOR example.2 + (−.48 = −.63 33 Table 3.48∗ (1 − .2+.65 = .015)∗ 1 = .65 1/(1 + exp(.147)∗ (−. SOME COMMON NEURAL NETWORKS MODELS Table 3.147)∗ (−.4) = .4 + (−.2.017)∗ 1 = .3: Weight adjustment Weight w45 w35 w24 w23 w14 w13 w05 w04 w03 Value −.4∗ .015 .3+.18 The performance of this update procedure suffer the same drawbacks of any gradient descent algorithm.4 + (−. initial weights chosen at random in the interval .3 + (. Table 3.015 .017 ∗ ∗ Adjust of tendency .1 .53)) = .4) = −.3 + (−017)∗ 1 = .
to include a momentum parameter to speed convergence etc. The weights are updated proportional to the gradient of D with respect to the weights.3) We concentrate on the choice of weights when one sample (or pattern) is presented to the network. zm = f (am ) m ∈ Ll . (3.CHAPTER 3.1. .1. The backpropagation algorithm Let L be the number of layers. SOME COMMON NEURAL NETWORKS MODELS 34 0. The different layers are denoted by L1 (the input later).4) If the function f is nonlinear.5.1. The network operates as follows.1. . The state or input or neuron j is characterized by zj . .1) where f is an activation function that has derivatives for all values of the argument. . When neurons i and j are connected a weight ωij is associated with the connection. M ). true value. In a multilayer feedforward network. The desired output. k ∈ LL (k = 1. the gradient descent algorithm for the multilayer network is as follows. L2 . (3.5) .2) and where the outputs of layer l − 1 is known. We want to adapt the initial guess for the weights so that to decrease: DLM S = k∈LL (tk − f (ak ))2 . The input layer assigns zj to neurons j in L1 . j ∈ Ll−1 ∂ωmj (3. j ∈ Ll 1 ≤ l ≤ L − 1. or target will be denoted by tk .1. . the weights ωmj is changed by an amount ∆ωmj = −α ∂D m ∈ Ll . then D is a nonlinear function of the weights. .5 and 0. For a neuron m in an arbitrary layer we abbreviate am = j∈Ll−1 ωmj zj 2 ≤ l ≤ L (3. . The output of the neurons in L2 are zk = f j∈L1 ωkj zj k ∈ L2 (3.) Here we only present the general formulation of the algorithm. . only neurons in subsequent layers can be connected: ωij = 0 ⇒ i Ll+1 . LM (the output layer). We assume the output has M neurons.
j ∈ LL−1 .CHAPTER 3. Neuron m calculate this quantity using ωpm . 2α [tp − f (ap )] f (ap )ωpm is the weighted sum of the errors sent from layer LL to neuron m in layer LL−1 . j ∈ LL−2 . In an analogous way it is possible to derive the updates of the weights ending in LL−1 : ∂ ∂ωmj tp − f 2 ωpg f (ωqr zr ) δgm δxj zr g∈%mclL−1 ∆ωmj = −α p∈LL g∈LL−1 = −α2 p∈LL [tp − f (ap )] (−f (ap )) (3. This weight update is done for each layer in decreasing order of the layers.6) = −α2 (tm − f (am )) f (am )zj m ∈ LL .7) = −α2 p∈LL (tp − f (ap )) f (ap )ωpm f (am )zj . Note that for the sigmoid (symmetric or logistic) and hyperbolic activation function the properties of its derivatives makes it easier to implement the backpropagation algorithm. j ∈ L1 .1.1.3) and (3. Multiply this quantity f (am ) and send it to neuron j that can calculate ∆ωmj and update ωmj . ∆ωmj = −α ∂ ∂ωmj p∈LL q∈LL−1 2 ωpq zq δpm δgj zg g∈%mclL−1 tp − f = −α2 p∈LL (tp − f (ap )) (−f (ap )) (3. This is the familiar backpropagation algorithm. m ∈ LL−1 . SOME COMMON NEURAL NETWORKS MODELS 35 where α is called the learning rate.2 Associative and Hopﬁeld Networks In unsupervised learning we have the following groups of learning rules: • competitive learning . m ∈ L2 . We assume that the neuron is able to calculate the function f and its derivative f .5).1.4) in (3. For the last and next to last layer the updates are given by substituting (3. The error 2α (tm − f (am )) f (am ) can be sent back (or propagated back) to neuron j ∈ LL−1 . The value zj is present in neuron j so that the neuron can calculate ∆ωmj . p ∈ Lp .1. 3. Now.1. until ∆mj .
associative memories. Figure 3. . Each vector is associated with the scalar (class) i. . The function of an associative memory is to recognize previous learned input vectors. in which each vector is associated with itself. . . Hebbian learning is based on the weight actualization due to neurophisiologist Hebb. This algorithm have been used in characteristics extraction.e. so that xi → y i . i. .CHAPTER 3. . x2 .3 summarizes these three networks.Autoassociative networks are a special subsets of heteroassociative networks. y i = xi for i = 1. n. Example of network that use these rules are: Hopﬁeld and PCA (Principal Component Analysis) networks. . cluster. In this section. introducing simple rules that allow unsupervised learning. The goal of the network is to identify the class of the input pattern. y 2 . Despite the simplicity of their rules. The function of the network is to correct noisy input vector. . .Pattern recognition (cluster) networks are a special type of heteroassociative networks. they form the foundation for the more powerful networks of latter chapters.Heteroassociative networks: map m input vectors x1 . Figure 3. the neurons compete among them to update the weights. y m in kdimensional space. Examples of networks that use these rules are ART (Adaptive Resonance Theory) and SOM (Self Organized Maps) networks. The neural networks we have discussed so far have all been trained in a supervised manner. We can distinguished three kinds of associative networks: . xm in pdimensional spaces to m output vector y 1 . .3: Type of associative networks . . signal extraction. even in the case where some noise is added. we start with associative networks. . . SOME COMMON NEURAL NETWORKS MODELS • Hebbian learning 36 When the competitive learning is used. . These algorithms have been used in classiﬁcation.
1) is W = X −1 Y (3. xp ) and the identity activation produces the activation y1×k = x1p Wp×k .2) that is for n observations.CHAPTER 3. .4: Associative network Let W = (ωij ) the p × k weight matrix.5) If n < p and the input vectors are linear independent the optimal solution is . one solution to obtain the weight matrix is the OLAM . Then Wp×k = Xp×n ◦ Ynk . . . The row vector x = (x1 . Figure 3.1) The basic rule to obtain the elements of W that memorize a set of n associations (xi y i ) is n Wp×k = i=1 i xi ◦ y1k p×1 (3.Optimal Linear Associative Memory.3) Although without a biological appeal.2. Figure 3.2. let X be the n × p matrix whose rows are the input vector and Y be the n × k matrix whose rows are the output vector.4) (3. if it is invertible. (3.2. X is square matrix.2. Therefore in matrix notation we also have Y = XW (3. SOME COMMON NEURAL NETWORKS MODELS 37 The three kinds of associative networks can be implemented with a single layer of neurons.2. the solution of (3. If n = p. .2.4 shows the structure of the heteroassociative network without feedback.
recognition and compression.2.eyes.5: Schematic faces for Hebbian learning. SOME COMMON NEURAL NETWORKS MODELS −1 38 Wp×k = Xpn Xnp Xpn Y (3. denoted by xi in which a given element corresponds to one of the features from Figure 3. The biological appeal to update the weight is to use the Hebb rule W = ηX Y (3.6.5 in an associative memory using Hebbian learning. and also function optimization (such as Hopﬁeld network).6) which minimizes W X − Y 2 and (X X)−1 X is the pseudoinverse of X. Each face is made of four features (hair. Consider the example (Abi et al. We will present some of this association networks using some examples of the literature. Each face is represented by a 4dimensional binary vector.nose and mouth).7) as each sample presentation where η is a learning parameter in 0 and 1.2. Figure 3. The association networks are useful for face. If the input vectors are linear dependent we have to use regularization theory (similar to ridge regression). 1999) where we want to store a set of schematic faces given in Figure 3.8) The ﬁrst face is store in the memory by modifying the weights to .2. taking the value +1 or −1 Suppose that we start with and η = 1. ﬁgure etc.CHAPTER 3. 0 0 0 0 W0 = 0 0 0 0 0 0 0 0 0 0 0 0 (3.
The concept was that all the neurons would transmit signals back and forth to each other in a closed feedback loop until their state become stable. Hopﬁeld network is recurrent and fully interconnected with each neuron feeding its output in all others.8 provides examples of a Hopﬁled network with nine neurons.12) .6: The features used to build the faces in Figure 3. = W9 + 1 1 −1 1 = −1 2 2 10 −2 1 6 −2 −2 10 2 0 0 2 0 0 . 0 2 −2 0 −2 +2 (3. (3.9) (3.5 1 1 1 1 1 1 1 1 1 1 .2. α = (101010101) β = (110011001) γ = (010010010). Figure 3. In its presentation we follow closely Chester (1993). SOME COMMON NEURAL NETWORKS MODELS 39 Figure 3. W1 = W0 + 1 1 1 −1 = 1 1 1 1 1 −1 −1 −1 −1 −1 the second face is store by −1 2 −1 2 W2 = W1 + −1 −1 1 1 = 1 0 1 0 and so on until 1 10 2 2 6 1 2 −10 2 −2 . The Hopﬁeld topology is illustrated in Figure 3.CHAPTER 3. The operation is as follows: Assume that there are three 9dimensional vectors α.2.10) W10 (3.11) Another important association network is the Hopﬁeld network.2.2. β and γ.7.
CHAPTER 3.2. β ∗ . The weight matrix is W = α∗ α∗ + β ∗ β ∗ + γ ∗ γ ∗ (3. γ ∗ .7: Hopﬁeld network architecture The substitution of −1 for each 0 in these vectors transforms then into bipolar form α∗ . α∗ = (1 − 11 − 11 − 11 − 11) β ∗ = (11 − 1 − 111 − 1 − 11) γ ∗ = (−11 − 1 − 11 − 1 − 11 − 1) .13) when the main diagonal is zeroed (no neuron feeding into itself).2.8 of the example. . This becomes the matrix in Figure 3.14) (3. SOME COMMON NEURAL NETWORKS MODELS 40 Figure 3.
CHAPTER 3. the network output converges to a stored vector. perhaps contributing to its convergence toward the spontaneous attractor in examples 1 and 3.) . identical to none of the three stored vectors. SOME COMMON NEURAL NETWORKS MODELS 41 From A B C D E F G H I A 0 1 1 1 1 1 1 3 3 B 1 0 3 1 1 1 3 1 1 C 1 3 0 1 1 1 3 1 1 D 1 1 1 0 3 1 1 1 1 E 1 1 1 3 0 1 1 1 1 F 1 1 1 1 1 0 1 1 1 G 1 3 3 1 1 1 0 1 1 H 3 1 1 1 1 1 1 0 3 I 3 1 1 1 1 1 1 3 0 Synaptic weight matrix. as individual nodes are updated one at a time in response to the changing states of the other nodes. the stable output is a spontaneous attractor.all of which may make the matrix somewhat quirky.8: Example of Hopﬁeld network (The matrix shows the synaptic weights between neurons. In example 2. State transitions are shown by arrows. In examples 1 and 3. Each example shows an initial state for the neurons (representing an input vector). nodes A − I Start A=1 B=1 C=1 D=1 E=1 F =1 G=1 H=1 I=1 a A=1 B→0 C=1 D→0 E→0 F →0 G=1 H→0 I=1 b A=1 B=0 C=1 D→1 E=0 F →1 G=1 H=0 I=1 Start A=1 B=1 C=0 D=0 E=1 F =1 G=0 H=0 I=0 a A=1 B=1 C=0 D=0 E=1 F =1 G = 00 H=0 I→1 Start A=1 B=0 C=0 D=1 E=0 F =0 G=1 H=0 I=0 a A=1 B=0 C→1 D=1 E=0 F →1 G=1 H=0 I→1 b A=1 B=0 C=1 D=1 E=0 F =1 G=1 H=0 I=1 Example 1 Example 2 Example 3 Figure 3. The density of 1bits in the stored vectors is relatively high and the vectors share enough is in common bit locations to put them far from orthogonality .
1) 42 The study of the dynamics of the Hopﬁeld network and its continuous extension is an important subject which includes the study of network capacity (How many neurons are need to store a certain number of pattern? Also it shows that the bipolar form allows greater storage) energy (or Lyapunov) function of the system. we ﬁrst demonstrate that the net gives the correct Y vector when presented with the x vector for either pattern A or the pattern .CHAPTER 3.15) To illustrate the use of a BAM. 1) · # # # · · # · · # · · · # # (1. We end this section with an example from Fausett (1994) on comparison of data using a BAM (Bidirectional Associative Memory) network. Consider the possibility of using a (discrete) BAM network (with bipolar vectors) to map two simple letters (given by 5 × 3 patterns) to the following bipolar codes: The weight matrices are: (to store A → −11) 1 −1 −1 1 1 −1 −1 1 1 −1 −1 1 −1 1 −1 1 −1 1 −1 1 1 −1 −1 1 −1 1 1 −1 −1 1 (C → 11) (W . The BAM network is used to associate letters with simple bipolar codes. to store both) −1 −1 0 −2 1 0 1 2 1 2 1 0 1 0 1 2 −1 −1 0 −2 −1 −1 −2 0 1 0 1 2 −1 −1 −2 0 −1 −1 −2 0 1 0 1 2 −1 −1 0 −2 −1 −1 −2 0 −1 −1 −2 0 1 2 1 0 1 1 0 2 (3. SOME COMMON NEURAL NETWORKS MODELS · # · # · # # # # # · # # · # (−1. These topics are more related to applications to Optimization than to Statistics.2.
the weight matrix is the transpose of the matrix W . namely. 16) → (−1.e. For signals sent from the Y layer to the Xlayer.19) = −2 2 −2 2 −2 2 2 2 2 2 −2 2 2 −2 2 → −1 1 −1 1 −1 1 1 1 1 1 −1 1 1 −1 1 This is pattern A.2. 1)W T = (−1.CHAPTER 3.16) INPUT PATTERN C −1111 −1 −11 −1 −11 −1 −1 −11 W = (14.2.. i. 0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0 0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0 WT = (3. 1) 0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0 −2 2 0 2 −2 0 0 0 2 −2 0 0 0 2 (3. 1) 43 (3. we obtain (−1.20) = −2 2 2 2 −2 −2 2 −2 −2 2 −2 −2 −2 −2 2 → −1 1 1 1 −1 −1 1 −1 −1 1 −1 −1 −1 1 1 which is pattern C. (1.18) For the input vector associated with pattern A. The net can also be used with noisy input for the x vector. observe that the Y vectors can also be used as input. or both. we have (−1.2. 1) (3.2.1). namely. (1. SOME COMMON NEURAL NETWORKS MODELS C: INPUT PATTERN A −11 −11 −111111 −111 −11 W = (−14.17) To see the bidirectional nature of the net. if we input the vector associated with pattern C. the y vector. Similarly. 16) → (1. 1) 0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0 0 0 2 0 0 −2 0 −2 −2 0 0 −2 −2 2 0 (3. 1)W T = (−1. .2.1).
c is the center and R the metric. . If the metric is Euclidean R = r2 I for some scalar radius and the equation simpliﬁes to h(x) = Φ (x − c)T (x − c) r2 . multiquadratic. The common functions are: Gaussian Φ(z) = e−z .9: Classiﬁcation by Alternative Networks: multilayer perceptron (left) and radial basis functions networks (right) Multilayer perceptron separates the data by hyperplanes while radial basis function cluster the data into a ﬁnite number of ellipsoids regions.). Figure 3. Multiquadratic Φ(z) = (1 + z)1/2 .9 for a hypothetical classiﬁcation of psychiatric patients.3 Radial Basis Function Networks The radial basis function is a single hidden layer feed forward network with linear output transfer function and nonlinear transfer function h(·) in the hidden layer.CHAPTER 3. Inverse multiquadratic Φ(z) = (1 + z)−1/2 and the Cauchy Φ(z) = (1 + z)−1 .3.1) where Φ is the function used (Gaussian. Many types of nonlinearity may be used. The most general formula for the radial function is h(x) = Φ (x − c)R−1 (x − c) (3.2) The essence of the difference between the operation of radial basis function and multilayer perceptron is shown in Figure 3. (3. SOME COMMON NEURAL NETWORKS MODELS 44 3.3. etc.
The ﬁnal output is (.8 0. Consider two measures (from questionnaires) of mood to diagnose depression.8 1.52) is D= (4.398 2 /2×(.2 0 −4 −3 −2 −1 0 1 2 3 4 Multiquadratic Gaussian Cauchi Inverse−quadratic Figure 3.64) + (.3.2 1 0. 2000).1) = .4 1.3. The Euclidean distance from this vector and the cluster mean of the ﬁrst hidden unit µ1 = (0.52)2 = . σ 2 = 1.0092) = 0. (3.10 illustrate the alternative forms of h(x) for c = 0 and r = 1.36)2 + (. .36. The Gaussian function was used for activation of the hidden layers.11 shows the network. (3.0544. A constant variability term. Suppose we have the scores x = (4. .) To see how the RBFN works we consider some examples (Wu and McLarty.14 − .6 0. the patient is classiﬁed as not depressed. 1. + (−1)(0.8.6) Since the output is close to 0. SOME COMMON NEURAL NETWORKS MODELS A simpliﬁcation is the 1dimensional input space in which case we have h(x) = Φ (x − c)2 σ2 .398.CHAPTER 3. (3.6 1.4) The output of the ﬁrst hidden layer is h1 (D2 ) = e−.8 − .25) + . . Figure 3. was used for each hidden unit.5) This calculation is repeated for each of the remaining hidden unit.4 0.3.64) + (543)(5.3.453. 0. 2 1. The algorithm that implemented the radial basis function application determine that seven cluster sites were needed for this problem.3) Figure 3.4). 45 (3.670)(−.453)(2.10: Radial basis functions h(.
CHAPTER 3.11: Radial basis network . SOME COMMON NEURAL NETWORKS MODELS 46 Figure 3.
In recent years there has been a considerable interest in the development and use of an alternative type of basis function: wavelets. where the basis consists of sines and cosines of differing frequencies. that is the function f is expressed as a linear combination of m ﬁxed functions (called basis function in analogy to the concept of a vector being composed as combination of basis vectors).1 Wavelets To present the wavelet basis of functions we start with a father wavelet or scaling function φ such that φ(x) = √ 2 k lk φ(2x − k) (3. Another is the Walsh expansion of categorical time sequence or square wave.4. SOME COMMON NEURAL NETWORKS MODELS The general form of the output is y = f (x) = m 47 wi hi (x). where the concept of frequency is replaced by that of sequence. Some wavelets results are ﬁrst presented. in particular in statistics. Some examples are: the Fourier expansion. Training radial basis function networks proceeds in two steps.4.4 Wavelet Neural Networks The approximation of a function in terms of a set of orthogonal basis function is familiar in many areas.CHAPTER 3. Therefore wavelets have signiﬁcant advantages when the data is nonstationary. First the hidden layer parameters are determined as a function of the input data and then the weights in between the hidden and output layer are determined from the output of the hidden layer and the target data. we will outline why and how neural network has been used to implement wavelet applications. The main advantage of wavelets over Fourier expansion is that wavelets is a localized approximator in comparison to the global characteristic of the Fourier expansion. Therefore we have a combination of unsupervised training to ﬁnd the cluster and the cluster parameters and in the second step a supervised training which is linear in the parameters. but other algorithms can also be used including SOM and ART (chapter 4). i=1 3. In this section. 3. which gives the number of “zero crossings” of the unit interval. Usually in the ﬁrst step the kmeans algorithm of clustering is used.1) .
(iii) the choice of further parameters appearing in the thresholding scheme.4. −∞ (3.6) ˆ where αk and βk are estimates of αk and βjk .4) for some coarse scale j0 and where the true wavelet coefﬁcients are given by ∞ ∞ αk = −∞ f (x)φj0 .k (x)dx βj.5) An estimate will take the form ∞ ∞ ˆ f (x) = k=−∞ αk φj0 .4. A mother wavelet is obtained through −∞ ψ(x) = √ 2 hk φ(2x − k) (3.1 and 3.3) The equations 3. for ﬁnite sample the number of parameters in the approximation). respectively.k (x) + j≥j0 k=−∞ βj.k (x) (3.k (x)dx.CHAPTER 3.k = f (x)ψj. (3. .k ψj. ˆ (ii) the choice of thresholding policy (which αj and βjk should enter in the exˆ ˆ pression of f (x).4.4. (iv) the estimation of the scale parameter (noise level). ψj. the coefﬁcients lk . the value of j0 . hk are lowpass and highpass ﬁlters. denoted {φj0 .2 are called dilatation equations.k ψ(x) (3.k (x) + ˆ j≥j0 k=−∞ ˆ βj.k (x)} ∪ {ψjk (x)}j≥j0 .k (x) = 2j0 /2 φ(2j0 x − k). We assume that these functions generate an orthonormal system of L2 (R). ˆ ˆ Several issues are of interests in the process of obtaining the estimate f (x): (i) the choice of the wavelet basis.4.4.4.2) where lk and hk are related through hk = (−1)k lk . SOME COMMON NEURAL NETWORKS MODELS ∞ 48 usually. For any f ∈ L2 (R) we may consider the expansion ∞ ∞ f (x) = k=−∞ αk φj0 . respectively.k (x) = 2j/2 ψ(2j x − k) for j ≥ j0 the coarse scale. normalized as φ(x)dx = 1.k with φj0 .
13 illustrates the localized characteristic of wavelets.9) the Mortet wavelet with mother function ψ(x) = eiwx e−x 2 /2 (3.4. (3.CHAPTER 3.8) Other common wavelets are: the Shannon with mother function sin(πx/2) cos (3πx/2) πx/2 ψ(x) = (3. some possibilities includes the Haar wavelet which is useful for categoricaltype data.4.12 shows the plots of this wavelets.4. k=0 βjk ψjk (x).4. Concerning the choice of wavelet basis.10) √ where i = −1 and w is a ﬁxed frequency. Figure 3. Figure 3. SOME COMMON NEURAL NETWORKS MODELS 49 These issues are discussed in Morettin (1997) and Abramovich et al (2000). It is based on φ(x) = 1 ψ(x) = and the expansion is then J 2J −1 0≤x<1 1 0 ≤ x < 1/2 −1 1/2 ≤ x < 1 (3.12: Some wavelets: (a) Haar (b) Morlet (c) Shannon .7) f (x) = α0 + j=0 . Figure 3.
SOME COMMON NEURAL NETWORKS MODELS Figure 3.) In (e) the wavelet is compared to a slowly changing section of the function.” says Yves Meyer. “You play with the width of the wavelet in order to catch the rhythm of the signal.CHAPTER 3. Figure 3. The product of the two is always positive.13: Wavelet operation In ﬁgure 3. giving the big coefﬁcient shown in (d). . In (c) the wavelet is compared to a section of the function that looks like the wavelet. giving the small coefﬁcients shown in (f). the area delimited by that function is the wavelet coefﬁcient. (The product of two negative functions is positive.13 a wavelet (b) is compared successively to different sections of a function (a). 50 Figure 3. The signal is analyzed at different scales. The product of the section and the wavelet is a new function.14: An Example of a Wavelet Transform. using wavelets of different widths.13 gives an example of a wavelet transform.
wavelets decomposition emerges as powerful tool for approximation. The ﬁnest resolution.CHAPTER 3. the wavelet component is decoupled from the estimation components of the perceptron. SOME COMMON NEURAL NETWORKS MODELS 51 Figure 3. the idea of combining both wavelets and neural networks has been proposed. b = k2−j A detailed discussion of this networks is given in Zhang and Benvensite (1992).7 . The neuron of a wavelet network is a multidimensional wavelet in which the dilatation and translations coefﬁcients are considered as neuron parameters. The output of a wavelet network is a linear combination of several multidimensional wavelets. See also Iyengar et al (1992) and Martin and Morris (1999) for further references. Because of the similarity between the discrete wavelet transform and a onehiddenlayer neural network. the wavelet theory and neural networks are combined into a single method.4.17 give a general representation of a wavelet neuron (wavelon) and a wavelet network Observation. In wavelet networks. differing by a factor of 2. We call this Radial Basis Wavelet Neural Networks which is shown in Figures 3. We have seen (Section 2. It is useful to rewright the wavelet basis as: Ψa.3. giving the smallest details is at the top.15(b).4. over 5 scales. In the second approach. In the ﬁrst. There are two main approaches to form wavelet network. . This has resulted in the wavelet network which is discussed in Iyengar et al (2002) and we will summarize here. (b) The wavelet transform of the signal. both the position and dilatation of the wavelets as well as the weights are optimized. In essence the data is decomposed on some wavelets and this is fed into the neural network.16 and 3. 3.15(a). On the other hand. At the bottom is the graph of the remaining low frequencies.1 shows: (a) The original signal. −∞ < b < ∞ with a = 2j .2 Wavelet Networks and Radial Basis Wavelet Networks.Kolmogorov Theorem) that neural networks have been established as a general approximation tool for ﬁtting nonlinear models from input/output data.b = a−1 x−b a a > 0. Figures 3.
CHAPTER 3. SOME COMMON NEURAL NETWORKS MODELS 52 (a) Function approximation (b) Classiﬁcation Figure 3.15: Radial Basis Wavelet Network Figure 3.16: Wavelon .
f (x) = N i=1 wi Ψ xi −ai bi 3. Cheng (1997) and Ciampi and Lechevalier (1997). Peng et al (1996). These types of problems can be made easier using Mixture of Experts or Modular Networks. Applications can be seen in Jordan and Jacobs (1994). SOME COMMON NEURAL NETWORKS MODELS 53 Figure 3. Figure 3. . In these type of network we assigned different expert networks to table each of the different regions.CHAPTER 3.5 Mixture of Experts Networks Consider the problem of learning a function in which the form of the function varies in different regions of the input space (see also Radial Basis and Wavelets Networks Sections). generalized additive and tree models in statistics (see Section 5. which also sees the input data.18 shows several possible decompositions These types of decompositions are related to mixture of models.2) Further details and references can be seen in Mehnotra et al (2000) and Bishop (1995). Desai et al (1997). to deduce which of the experts should be used to determine the output.17: Wavelet network . A more general approach would allow the discover of a suitable decomposition as part of the learning process. and then used an extra ”gating” network. OhnoMachodo (1995).
SOME COMMON NEURAL NETWORKS MODELS 54 (a) Mixture with gating (b) Hierarchical (c) Sucessive reﬁnement (d) Input modularity Figure 3.CHAPTER 3.18: Mixture of experts networks .
some examples are Balakrishnan et al (1994). They shown that these networks can perform are equivalent to perform: linear regression. Titterington (2004). Ennis et al (1998) and Schwarzer et al (2000) for logistic regression. projection pursuit.b. multivariate regression. Stern (1996). Neal (1996) and Lee (2004).1997). Books relating neural networks and statistics are: Smith (1993). De Veux et al (1998). Swanson and White (1995. Chryssolouriz et al (1996) and De Veux et al (1998). logit regression. Poli and Jones (1994). Asymptotics and further results can be seen in a series of papers by White (1989a.6 Neural Network and Statistical Models Interface The remaining chapters of this book apply these networks to different problems of data analysis in place of statistical models. Mangiameli et al (1996) Mingoti and Lima (2006) for cluster analysis and Tu (1996). probit regression. Warner and Misha (1996). multivariate etc) and others are given in the correspondingly chapters of the book. Further references on the interfaces: statistical tests. generalized linear model. (1996). LISREL (with mixed dependent variables). For Bayesian analysis for neural networks see MacKay (1992). the review papers Cheng and Titterington (1994). Vach et al. . and RiosInsua and Muller (1996).c. Schwarzer et al (2000) and Zhang (2007). Ripley (1996). Martin and Morris (1999). methods (eg survival. Other related paper are: Cheng and Titterington (1994) Sarle (1994). Lowe (1999).linear discrimination. Faraggi and Simon (1995). Theoretical statistical results are presented in Murtagh (1994). Kuan and White (1994). Paige and Butle (2001) and the review of Titterington (2004). Neal (1996). Bishop (1996). Attempts to compare performance of neural networks were made extensively but no conclusive results seems to be possible. Huang and Ding (1997). It is therefore convenient to review the statistical bibliography relating neural network and statistical models and statistical problems in general. SOME COMMON NEURAL NETWORKS MODELS 55 3. Statistical intervals for neural networks were studied in Tibshirani (1996). Schumacher et al (1996). 2004).CHAPTER 3. Jordan (1995). Ripley (1996). Pitfalls on the use of neural networks versus statistical modeling is discussed in Tu (1996). The relationship of Multilayer Feedforward Network with several linear models is presented by Arminger and Enache (1996). Lee (1998. generalized additive models with known smoothing functions. 1996).
c (4. The most extreme form of a competition is the Winner Take All and is suited to cases of competitive unsupervised learning. The change of weights is calculated according to wi (k + 1) = wl (k) + α(xi − wi (k)) 56 (4. Several of the network discussed in this section use the same learning algorithm (weight updating). we present some network based on competitive learning used for clustering and in a network with a similar role to that of multidimensional scaling (Kohonen map network). An input x is applied to all neurons.Chapter 4 Multivariate Statistics Neural Network Models 4.1 Competitive networks In this section.1 illustrates the architecture of the Winner Take All network.1) where d is a distance measure of yj from the cluster prototype or center. Assume that there is a single layer with M neurons and that each neuron has its own set of weights wj .1 Cluster and Scaling Networks 4. The node with the best response to the applied input vector is declared the winner... wj ) j=1.1.1.2) . The activation function is the identity. Figure 4. according to the winner criterion Ol = min d2 (yj .1.
1. d2 = 4.4 and d2 = 1. 1.8). 1.5.l position of the processing node j from the lth pattern. .5. 1 2 3 x4 = (1. C are initially chosen randomly.2)2 + (1. .1. 4 5 6 (4.8 − 0.2.1.1.1 .1: General Winner Take All network where α is the learning rate. x2 . We compare squared Euclidean distances to select the winner: d2 ≡ d2 (wj . 1)}. The connection strengths of A. x1 . 1.7. A. Squared Euclidean distance 1 between A and x1 : d2 = (1. t = 1 : Sample presented x1 = (1.7. B. We now present an example due to Mehrotra et al (2000) to illustrate this simple competitive network.CHAPTER 4. x5 = (0. 0.7 − 0.3 W (0) = w2 : 0.9 (4. x2 = (0.5). 0.1 − 0.3)2 = 4. . and may be decreasing at each iteration. 0).1 1 Similarly. 0. x6 . Let the set n consists of 6 threedimensional vectors. . 1.7 0.5).1 0. and are given by the weight matrix w1 : 0.1. 0.3) We begin a network containing three input nodes and three processing units (output nodes).5 and update weights equation 4. x6 = (1.4) w3 : 1 1 1 To simplify computations. 0.1 2.1 0. 1. il ) refers to the squared Euclidean distance between the current j. 0).8).1.5.2 0. B. MULTIVARIATE STATISTICS NEURAL NETWORKS 57 Figure 4. C.7)2 + (1.1. 3. x3 = (0. we use a learning rate η = 0. n = {x1 = (1. 1. 1.
1). d2 = 0.2.35. 0.2). 0.3). Similarly. d2 = 4. 0.7) (0.4 2. Winner A is updated to w1 (10) = (0.6.6. 0. d2 = 1. 0.8. (4.2. Winner A is updated to w1 (8) = (0.15).2.1.15) → (0. (4.9.3.1) → (0. hence 3 2.5) is presented. Winner B is updated to w1 (9) = (0.2.3).35. 6 At this stage the weight matrix is w1 : 0.3 W (12) = w2 : 0 0. d2 = 3.5. we have: t = 3 : Sample presented x3 = (0. 1.7.5. 0. 1.1 2.35. 0. 0. t = 5 : x5 = (0.3 0.35). 0.1) → (0.1 0.3) → (0.1 not perturbed by this sample whereas C moves halfway towards the sample (since η = 0. A and B are therefore 3.35.15).15) → (0.4 t = 2 : Sample presented x2 = (0. 0. 1. 0.55.05.35 .25 Node A becomes repeatedly activated by the samples x2 . 1. 0. Winner A is updated to w1 (11) = (0.2. t = 6 : x6 = (1. 0).3.5) w3 : 1. and x5 . 0.1 3.25). 0.5). winner A is updated: w1 (5) = 5 (0.5). 0.3.1). 0.3 W (1) = w2 : 0.5.8.2 0.55.3 B is the winner and is updated. 0.2.5). 1.3) . 5 t = 12 : Sample presented x6 . t = 4 : Sample presented x4 = (1.55.05 1.3) → (4.25.4 .05. . 0. 0. The resulting weight matrix is w1 : 0. 3. 0.1. 1 t = 8 : Sample presented x2 . 1. Winner C is updated and w3 (12) = (1. 1) is presented.2 1. The resulting modiﬁed weight vector is w2 : (0.1. x4 .5.2. . 6 t = 7 : x1 is presented. To interpret the competitive learning algorithm as a clustering process is attractive. 0.6) w3 : 1 1. 0. 1.45. node B by 2 4 5 x3 alone. 0.1 1.4.1. MULTIVARIATE STATISTICS NEURAL NETWORKS 58 C is the “winner” since d2 < d2 and d2 < d2 .5. 2 t = 9 : Sample presented x3 . 0. 0. d2 = 2. winner C is updated: w3 (6) = (1. hence 4 1. 0. 0.2).4.1.4 1.2 2 hence A is the winner.5. 0.CHAPTER 4. d2 = 0. 1. 0. 1.55. 0. 0).1. 3 t = 10 : Sample presented x4 . 0.2. winner C is updated: w3 (7) = (1. and i5 is (0. The weights of A is updated. 4 2 6 1 3 and convergence of the weight vector for node A towards this location is indicated by the progression (0. 1.4 3. 0. 0. 0. 4 t = 11 : Sample presented x5 . The centroid of x2 . but the merits of doing so are debatable: as illustrated by a simple example .25.2).7 0.1.2 2.2 1.5 is least.1 0.9 . and node C by x1 and x6 .35 1. 0.55 0. The resulting modiﬁed weight vector is w1 : (0.4 A is the winner and is updated: w1 : (0. d2 = 0.6.
1 thereafter. (3. Unsupervised learning and clustering can be useful preprocessing steps for solving classiﬁcation problems. x3 = (0. In the LVQ1 algorithm. 2). 3) and C = (5. 0). 1). 2.9 (4. B = (2. to distinguish it from more recent variants of the algorithm. 5. instead of writing out the entire weight matrix. making the assumption that each cluster has roughly the same number of patterns. This algorithm is referred to as LVQ1. 1. let the winner node j ∗ belong to class C(j ∗ ). the weight update rule uses a learning rate η(t) that is a function of time t. and illustrates how an unsupervised learning mechanism can be adapted to solve supervised learning tasks in which class membership is known for every training pattern. 0.8) w3 : 1 1 1 Since there are twice as many samples in Class 2 as in Class 1. and x6 = (1.5. 4. . and the remaining samples with Class 2. 1. Let η(t) = 0. . The number of nodes chosen for each class is roughly proportional to the number of training patterns that belong to that class. Only the change in a single weight vector in each weight update iteration of the network. 1. 1).5). 3. 5).5. (6) instead of the more natural grouping A = (0). The initial weight matrix is w1 : 0. and the other two nodes with Class 2.1. A learning vector quantizer (LVQ) is an application of winnertakeall networks for such tasks. 8).LVQ Here again we present the results Mehnotra et al (2000 pg 173).1 0.CHAPTER 4.25 until t = 12.5 until t = 6.2 0. Each node in an LVQ is associated with an arbitrarily chosen class label. The following example illustrates the result of running the LVQ1 algorithm on the input samples of the previous section.1. The winner j ∗ is moved towards the pattern i if C(i) = C(j ∗ ) and away from i otherwise.7 0.7.3 W (0) = w2 : 0. x2 = (0. 0. The data are cast into a classiﬁcation problem by arbitrary associating the ﬁrst and last samples with Class 1.1 1. and η(t) = 0. 1)} where x1 = (1. (x6 . 6). then η(t) = 0. we label the ﬁrst node (w1 ) as associated with Class 1. (x2 . 0. MULTIVARIATE STATISTICS NEURAL NETWORKS 59 were the procedure cluster the number 0. 0). 1. 0. . x5 = (0.2 Learning Vector Quantization . For further discussion on the relation of this procedure to ”kmeans clustering” see Mehrotra et al (2000 pg 168). 0). x4 = (1. . such as η(t) = 1/t or η(t) = a[1 − (t/A)] where a and A are positive constants and A > 1. When pattern i from class C(i) is presented to the network. The new updating rule may be paraphrased as follows.1. Thus the training set is: T = {(x1 .5). 0. . 6 in groups A = (0.5.
Sample x2 . Sample x6 .02).00. The following hand movements were studied: I . 0. 71 case of grade I transitional cell carcinoma of the bladder (TCCB) and 184 cases of grade II and grade III TCCB.37.42). Note that associations between input samples and weight vectors stabilize by the second cycle of pattern presentations.56).. winner w3 (distance 0. 161.05.57%. Sample x2 .52..30. w1 changed to (0.58).17.15. II .42% of the benign cells and 86. Figure 4. w1 changed to (1. 99 cases of benign prostatic hyperplasia. 162. 3 4.left/right above the level of the shoulder.58). III . winner w1 (distance 0.89).. 1 2. 0.20). 1. 0.CHAPTER 4.18. . w2 changed to (0. 1. Sample x1 . 1. Sample x4 .15). 0..internal/external rotation of the upper arm (humerus). although the weight vectors continue to change. . The data consisted of 470 samples of voided urine from an equal number of patients with urothelial lesions. winner w3 (distance 0.49. For the purpose of the study 45. 160. 1. 1. w3 changed to (0. At the patient level the LVQ network enable the correct classiﬁcation of 100% of benign cases and 95.52. 0. 0. Vuckovic et al (2002) use neural networks to classify alert vs drowsy states from 1’s long sequences of full spectrum EEG in an arbitrary subject. winner w3 (distance 0. w2 changed to (0.48). winner w2 (distance 0. 0.04. The ex . 5 cases of city carcinoma. 1. 0.2 presents the correct classiﬁcation percentual for the neural networks used.. 1. Maksimovic and Popovic (1999) compared alternative networks to classify arm movements in tetraplegics. 1.45). The training sample used 30% of the patients and cells respectively.56). w1 changed to (1.97. Sample x3 . w3 changed to (0. winner w1 (distance 0.33. The overall accuracy rate was 90.30). w3 changed to (0.33.17). 158. respectively.50). .18). 0. Sample x1 . 157.05.65.46.07).75% of the malignant cells. 0.73). The application enable the correct classiﬁcation of 95. Mehrotra et all also mentioned a variation called the LVQ2 learning algorithm with a different learning rule used instead and an update rule similar to that of LVQ1. winner w2 (distance 0. 156.60).95. MULTIVARIATE STATISTICS NEURAL NETWORKS 60 1. winner w3 (distance 0. 61 case of information. 2 3.05. winner w1 (distance 0.38).452 cells were measured. 159.38). 0.79).63% and 97. Some application of the procedure are: Pentapoulos et al (1998) apply a LVQ network to discriminate benign from malignant cells on the basis of the extracted morphometric and textural features. The study included 50 cases of lithiasis. w3 changed to (0. 0.6% of the malignant cases. 0. Another LVQ network was also applied in an attempt to discriminate at the patient level. Sample x4 .up/down proximal to the body on the lateral side. Sample x5 . Sample x6 .. 1.30. w1 changed to (1.57).33. winner w3 (distance 1. winner w1 (distance 0. Sample x3 .05. w3 changed to (0.
3 Adaptive Resonance Theory Networks . Two expert in EEG interpretation provided the expertise to train the ANN. The major difference between ART and other clustering methods is that ART allows the user to control the degree of similarity between members of the same cluster by means of a userdeﬁned constant called the vigilance parameter. For continuous input the ART2 network was developed with a more complex F1 layer to accommodate . 4. The three ANN used in the comparison are: one layer perceptron with identity activation (WidrowHoff optimization).1.CHAPTER 4. The LVQ network gives the best classiﬁcation.2: Success of classiﬁcation for all ANNs and all tested movement trials perimental data was collected on 17 subjects. The architecture of an ART1 network shown in Figure 4.37 ± 1.ART Adaptive Resonance Theory (ART) models are neural networks that perform clustering and can allow the number of cluster to vary. The F1 neurons and the F2 (cluster) neurons together with a reset unit to control the degree of similarity of patterns or sample elements placed on the same cluster unit.95% using the t statistics. MULTIVARIATE STATISTICS NEURAL NETWORKS 61 Figure 4.3 consists of two layer of neurons. We outline in this section the simpler version of the ART network called ART1. perceptron with sigmoid activation (LevenbergMarquart optimization) and LVQ network. The ART1 networks accepts only binary inputs. For validation 12 volunteers were used and the matching between the human assessment and the network was 94.
MULTIVARIATE STATISTICS NEURAL NETWORKS continuous input. F1 receives and holds the input patters. 3 (4. Let the vigilance parameter be ρ = 0. Consider a set of vectors {(1. 1. 0). 0.. otherwise a new cluster of patterns is formed around the new input vector. 1. 0. and bottomup weights are set to b1. 0. (1. 1. 0.. (1. 1. 0). (0.7. 0. 0.1 (0) = 1. (0. 0. 8 8 8 8 8 8 (4. 1. 0. 0. 1). 0). Given the ﬁrst input vector. The following example of Mehnotra et al (2000) illustrates the computations. 0. 0)} to be clustered using the ART1 algorithm. 1. We begin with a single node whose topdown weights are all initialized to 1. 1. 0. Since 7 l=1 tl.1. + × 0 + × 1 = . 1.1. 62 Figure 4. 1. 1).. The training algorithm is shown in Figure 4. 0. F2 responds with a pattern associated with the given input. 1. 1.9) and y1 is declared the uncontested winner. 1 i. the second layer. But if the difference is substantial.10) . 1. 1.3: Architecture of ART1 network The input layer. tl. 1. 1. Here n = 7 and 8 initially m = 1.e. and (1. 0. If this returned pattern is sufﬁcient similar to the input pattern then there is a match.7.1 xl 7 l=1 xl = 3 = 1 > 0. 1. then the layers communicate until a match is discovered.4 (Mehrotra et al (2000) or Fauzett (1994)).l (0) = . we compute y1 = 1 1 1 1 1 3 × 1 + × 1 + × 0 + .CHAPTER 4. 1.
.j (old) xl 1 . xn ) for learning. . Let the active set A contain all nodes.j ∗ xl . .5 + tl. + bj. (b) Compute s∗ = (s∗ .j ∗ (old) xl until A is empty or x has been associated with some node j. If A is empty. n+1 bj ∗ .l x1 + . . calculate yj = bj.CHAPTER 4. 2. then create a new node whose weight vector coincides with the current input pattern x.4: Algorithm for updating weights in ART1 .j ∗ (old) xl n ∗ l=1 tl. . − Initialize each bottomup weight bj. do 1. endwhile Figure 4. (a) Let j ∗ be a node in A with largest yj .j ∗ (new) = tl.l (0) = − while the network has not stabilized. . MULTIVARIATE STATISTICS NEURAL NETWORKS 63 − Initialize each topdown weight tl. .n xn for each node j ∈ A.j (0) = 1. with ties being broken arbitrarily. Present a randomly chosen pattern x = (x1 . s∗ ) where s∗ = tl. 3. . . .l (new) = 0. n 1 l (c) Compare similarity between s∗ and x with the given vigilance parameter ρ: n s∗ if l=1 l ≤ ρ then remove j ∗ from set A n l=1 xl else associate x with node j ∗ and update weights: tl.
5.5 + l tl.5 + 3 0 otherwise.5 0 1 4. Likewise.1 (0)xl . T (1) = 1 1 0 0 0 0 1 (4.5 0 1 4.1. 3.0.CHAPTER 4. because 7 tl. so that B(3) = B(2) and T (3) = T (2).l (1) = 0.5 0 0 1 4. This results in no change to the weight matrices.5 1 T 1 3. 1 1 1 B(1) = 0000 3.7. and bottomup weights equal to 0 in the positions corresponding to the zeroes in the sample. The new weight matrices are 1 3.14) When the third vector (1. A second node must hence be generated.5 3.1.1. 1. 7.1. and remaining new bottomup weights are equal to 1/(0.1 xl / l xl = 0 < 0.5 4. 2.2 xl / 7 xl = ≥ 0. and the second node is the obvious winner.2 xl = 4. MULTIVARIATE STATISTICS NEURAL NETWORKS the vigilance condition is satisﬁed and the updated weights are 1 1 = for l = 1.1. while each bottomup weight is obtained on dividing this quantity by 0.5 0 1 4. .15) are the node outputs.1. The second node’s weights l=1 l=1 5 are hence adapted.1.1. This generates y1 = 0.0). with each topdown weight being the product of the old topdown weight and the corresponding element of the sample (1. These equations yield the following weight matrices.1 (1) = tl.5 + 0 + 0 + 1 + 1 + 1 + 1 + 0). 1. but the uncontested winner fails to satisfy the vigilance threshold since l tl.5 and T (2) = 0 0 1 0 0 1 0 1 0 1 0 1 1 0 T (4.5 (4.5 b1.13) Now we present the second sample (0.12) .1. tl.1.0.1.1.1.5 B(2) = 0 1 3.11) (4. 0) is presented to this network.5 T T 64 (4. with topdown weights identical to the sample. 1. The vigilance test 4 succeeds. 1. y1 = 1 4 and y2 = 3. 0.7.0).5 3.
19) We now cycle through all the samples again. y1 = 2 3 and y2 = 3. with l tl.1.1.0.1.5 4.5 0 1 3. The resulting weight matrices are 1 3. The second node’s weights l=1 l=1 3 are hence adapted.5 .CHAPTER 4.2 xl / 7 xl = ≥ 0.0.0).1.5 B(5) = 0 1 5.18) are the node outputs.1. y1 = 1 3 and y2 = 3.1.5.1.17) When the ﬁfth vector (1. and the second node is the obvious winner.0. MULTIVARIATE STATISTICS NEURAL NETWORKS When the fourth vector (0. while each bottomup weight is obtained on dividing this quantity by 0.5 0 0 0 0 1 3. T (5) = 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 (4. The vigilance 3 test fails.5 1 T 1 3.2 xl / 7 xl = < 0.5 0 0 0 0 1 3.1. and the resulting weight matrices are 1 3.1.1.1. with each topdown weight being the product of the old topdown weight and the corresponding element of the sample (0.5 1 1 0 0 0 0 1 .5 0 1 3. The vigilance test 3 succeeds.5 0 1 T T 3.5 (4.5 4. T (4) = 0 0 T .5 0 1 3.5 + l tl.5 B(4) = 0 1 3. which is the new winner (uncontested).5 1 5.2 xl = 3.0. because 7 tl.1. l=1 5 A third node is hence created.1 xl / 7 xl = ≤ 0.7.0.0) is presented to this network. The active set A is hence l=1 l=1 5 reduced to contain only the ﬁrst node. because 7 tl.0) is presented to this network.7.5 1 3.16) are the node outputs.7.5 1 5.1. 2 The vigilance test fails with this node as well. but we opt for the same sequence as in the earlier cycle.5 65 (4. 1 0 0 0 0 1 0 1 0 1 1 0 (4.5 0 1 5. This can performed in random order.5 0 1 3.1. and the second node is the obvious winner. After the third .
5 0 0 0 0 0 0 1 3. In each group the diagnostic was obtained from a one hidden layer feedforward network. physically capable and bellow the age of 50) that answered the DSMIV (Diagnostic and Statistical Manual for Mental Disorders) and were submitted to neuropsychological tests. Rozenthal (1997. 1 0 0 1 1 1 0 0 (4. MULTIVARIATE STATISTICS NEURAL NETWORKS vector is present.5 1 1 0 0 0 0 1 0 .5 66 1 T T 3. Only symptoms and physical were used for constructing the neural networks and classiﬁcation. for comparison) indicate 2 important clusters. These patterns deﬁne such group of symptom or dimension of schizophrenic patients that are: those who have hallucination and disorderly thoughts and selfsteem (negative dimension) and those with poor speach and disorderly thoughts (disorganized dimension). T (8) = 0 0 0 1 1 1 0 . the weight matrices are modiﬁed to the following: 1 3. One of which.see also Rozenthal et al 1998) applied an ART neural network to analyse a set of data from 53 schizophrenic patients (not addicted.5 0 1 3.5 1 3.CHAPTER 4.1. Statistical models in literature shows sensibility of 49 to 100% and speciﬁcity of 16 to 86% but uses laboratory results. The neural networks of Santos (2003) e Santos et al (2006) used only clinical information.5 B(8) = 0 1 4. In this application an ART neural network identiﬁed three groups of patients. It seems to act as an attractor. and T (8) represents the prototypes for the given samples. This application of neural network (and a classical statistical method of cluster. There seems to exit at least three functional patterns in schizophrenia. low IQ and negative dimension.5 1 4.5 0 1 3.20) Subsequent presentation of the samples do not result in further changes to the weights. The network has thus stabilized. The neural networks used had sensibility of 71 to 84% and speciﬁcity of 61 to 83% which has slight better than classiﬁcation tree. For further details see Mehrotra et al (2000). The data consisted of 136 patients from the health care unit of the Universidade Federal do Rio de Janeiro teaching hospital referred from 3/2001 to 9/2002. Santos (2003) e Santos et al (2006) used neural networks and classiﬁcation trees in the diagnosis of smear negative pulmonary tuberculosis (SPNT) which account for 30% of the reported cases of Pulmonary Tuberculosis. The covariate vector contained 3 continuous variables and 23 binary variables. keeps itself stable when the number of clusters is increased. again.5 1 4. In that follows we describe some applications of ART neural networks. and also corresponds to .5 1 4.
5 P10 1.84±0.9 2.05 14.8 1.7 0.2 9.4 presents the results obtained.6 84 P2 7.1: Presentation of the sample according to level of instruction.9 12.59 80.6 P7 2.5 3. Patients (n=53) 22 19 12 IQ (m ± dp) 84.2 11.1 4.0 P5 4.91±07.3 11.05 1 1 P9 4.0 P8 0.1 RAVLT 8.1 P8 0.8 P11 4. IQ and age of onset Level of Instruction College High school Elementary no.8 RAVLT 8.13 16.6 84.67±3.2 5.64±5.8 P11 4.2 4.1 20.2 10.7 P3 7.8 P10 1.2 5.7 1.0 57.1 P4 0.75±08. MULTIVARIATE STATISTICS NEURAL NETWORKS 67 the more severe cases and more difﬁcult to respond to treatment.07 75.3 11.9 P2 7.CHAPTER 4.1 P5 4.5 Group I (n=20) Group II (n=3) Table 4.58 19.0 5.5 ±6.7 P6 11.7 84.6 Group I (n=20) Group IIa (n=20) Group IIb (n=13) .3: Three cluster patterns according to neuropsychological ﬁndings P1 77.7 2.05 3.2: Cluster patterns according to neuropsychological ﬁndings P1 77.05 100 P9 4.14.0 P3 7.7 2.6 P6 11. Table 4.4 11.88 age of onset (m ± dp) 21.0 24.2 4.1 P7 2.2 9.8 1.7 1.89 Table 4.1 29 P4 0.5 4. Tables 4.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
Table 4.4: Four cluster patterns according to neuropsychological ﬁndings
P1 77.6 80.86 85.1 90.0 P2 7.7 0.81 1.4 7.7 P3 7.0 12.2 51.90 43.0 P4 0.05 6.8 12.0 6.7 P5 4.2 4.7 4.9 4.2 P6 11.2 10.9 11.6 12.5 P7 2.7 2.2 2.0 1.7 P8 0.05 1 1 1 P9 4.2 6.0 4.7 5.6 P10 1.8 2.0 1.7 1.6
68
Group I (n=20) Group IIc (n=14) Group IId (n=11) Group IIe (n=8)
P11 4.2 0.9 21.2 55.4
RAVLT 8.3 11.4 11.2 12
Chiu and al (2000) developed a selforganizing cluster system for the arterial pulse based on ART neural network. The technique provides at least three novel diagnostic tools in the clinical neurophysiology laboratory. First, the pieces affected by unexpected artiﬁcial motions (ie physical disturbances) can be determined easily by the ART2 neural network according to a status distribution plot. Second, a few categories will be created after applying the ART2 network to the input patterns (i.e minimal cardiac cycles). These pulse signal categories can be useful to physicians for diagnosis in conventional clinical uses. Third, the status distribution plot provides an alternative method to assist physicians in evaluating the signs of abnormal and normal automatic control. The method has shown clinical applicability for the examination of the autonomic nervous system. Ishihara et al (1995) build a Kansey Engineering System based on ART neural network to assist designers in creating products ﬁt for consumers´s underlying needs, with minimal effort. Kansei Engineering is a technology for translating human feeling into product design. Statistical models (linear regression, factor analysis, multidimensional scaling, centroid clustering etc) is used to analyse the feelingdesign relationship and building rules for Kansei Engineering Systems. Although reliable, they are time and resource consuming and require expertise in relation to its mathematical constraint. They found that their ART neural network enable quick, automatic rule building in Kansei Engineering systems. The categorization and feature selection of their rule was also slight better than conventional multivariate analysis.
4.1.4 Self Organizing Maps  (SOM) Networks
Self organizing maps, sometimes called topologically ordered maps or Kohonen selforganizing feature maps is a method closely related to multidimensional
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
69
scaling. The objective is to represent all points in the original space by points in a smaller space, such that distance and proximity relations are preserved as much as possible. There are m cluster units arranged in a one or two dimensional array. The weight vector for a cluster neuron serves as an exemplar of the input patterns associated with that cluster. During the selforganizing process, the cluster neuron whose weight vector matches the input pattern most closely (usually, the square of the minimum Euclidean distance) is chosen as the winner. The winning neuron and its neighboring units (in terms of the topology of the cluster neurons update their weights the usual topology in two dimensional are shown in Figure 4.5).
Figure 4.5: Topology of neighboring regions
The architecture of the Kohonen SOM neural network for one and twodimensional array and its topology is shown in Figures 4.64.7 and Figures 4.84.94.10 for the one dimensional and two dimensional array. Figure 4.11 presents the weight updating algorithm for the Kohonen SOM neural networks (from Fausett 1994) and Figure 4.12, the Mexican Hat interconection for neurons in the algorithm.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
70
Figure 4.6: Kohonen selforganizing map.
* * * {* (*
{ } R=2
[#]
*) *} * *
[ ] R=0
( ) R=1
Figure 4.7: Linear array of cluster units
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
71
Figure 4.8: The selforganizing map architecture
Figure 4.9: Neighborhoods for rectangular grid
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
72
Figure 4.10: Neighborhoods for hexagonal grid
Step 0
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
Initialize weights wij Set topological neighborhood parameters. Set learning rate parameters. While stopping condition is false, do Steps 28. For each input vector x, do Steps 35. For each j, compute: D(j) = i (wij − xi )2 Find index J such that D(J) is minimum. For all units within a speciﬁed neighborhood of J and for all i: wij (new) = wij (old) + α[xi − wij (old)] Update learning rate. Reduces radius of topological neighborhood at speciﬁed times. Test stopping condition.
Figure 4.11: Algorithm for updating weights in SOM
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
73
Figure 4.12: Mexican Hat interconnections
Braga et al (2000) sampled 1000 samples with equal probability from two bivariate normal distributions with variances 1 and mean vectors µ1 = (4, 4),µ2 = (12, 12). The distribution and the data are shown in Figure 4.13 and 4.14.
16 14 Class 01 12 10 X2 Class 02 8 6 4 2 10 5 10 15 20 0 0 2 4 6 8 X1 10 12 14 16 Class 01
0.5 0.4 0.3 0.2 0.1 0
20
Class 02
0
Figure 4.13: Normal density function of two clusters
Figure 4.14: Sample from normal cluster for SOM training
The topology was such that each neuron had 4 other neighbouring neurons. The resulting map was expected to preserve the statistical characteristics, ie the visual inspection of the map should indicate how the data is distributed. Figure 4.15 gives the initial conditions (weights). Figures 4.16 4.18 presents the map after some interactions.
18: SOM weights. MULTIVARIATE STATISTICS NEURAL NETWORKS 74 Figure 4.17: SOM weights.16: SOM weights. 4500 iterations .15: Initial weights Figure 4. 300 iterations Figure 4. 50 iterations Figure 4.CHAPTER 4.
The classiﬁcation of patterns are shown in Figures 4.CHAPTER 4. 1 Sunday 4 Saturday 7 Monday to Friday 10 Monday to Friday 2 Sunday 5 Saturday 8 Monday to Friday 11 Monday to Friday 3 6 Saturday Monday to Friday 9 Monday to Friday 12 Monday to Friday Figure 4. The complete system is shown in Figure 4.19: Combined forecast system The classiﬁer used a SOM network with 12 neurons displaced in four rows and three columns. and those from the statistical models.20 and Figure 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 75 Lourenco (1998) applied a combination of SOM networks.19. a predictive scheme and a procedure to improve the estimations. which is important specially in the pick hours. A fuzzy logic procedure uses climatic variables to improve the forecast.21 for the year of 1993. . The method establish the various load proﬁles and process climatic variables in a linguistic way.20: Week days associated with the SOM neurons The proposed method improved the correct method used in the Brazilian Eletric System. Figure 4. The ﬁnal model includes a classiﬁer scheme. The classiﬁer is implemented via an SOM neural network. The forecaster uses statistical forecasting techniques (moving average. exponential smoothing and ARMA type models). fuzzy set and clas¸ sical forecasting methods for the shortterm electricity load prediction.
In both cases SOM networks were ﬁrst used to extract the important features and cluster the patterns in an unsupervised manner. MULTIVARIATE STATISTICS NEURAL NETWORKS 76 1 4 7 2 5 3 6 10 Summer Summer 8 11 Spring Autumn Winter Winter 12 Winter 9 Figure 4. to perform class discovery and marker gene identiﬁcation in microarray data. In the process of class discovery. The classiﬁer using the structure information correctly classiﬁed previously unseen mutants with and accuracy of between 60 and 70 %. In this work the authors predicted the HIV resistance to two current protease inhibitors. which suggests an appropriate number of clusters. The approach integrates merits of hierarchical clustering with robustness against noise know from selforganizing approaches. The ability to predict the drug resistance of HIV protease mutants may be useful in developing more effective and long lasting treatment regimens. a predictor was constructed based on the structural features of the HIV proteasegrug inhibitor complex. The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. a classiﬁer was constructed based on the sequence data of various drug resistent mutants. The prediction performance of the classiﬁers was measured by crossvalidation. Hsu et al (2003) describe an unsupervised dynamic hierarchical selforganizing approach. Several architectures were tested on the more abundant sequence of 68 % and a coverage of 69 %. the corresponding marker genes identiﬁed through the unsupervised . The problem was approached from two perspectives. A particular structure was represented by its list of contacts between the inhibitor and the protease.21: Seasons associated with the SOM neurons Draghici and Potter (2003) predict HIV drug resistance using. Further. This was followed by subsequent labelling based on the known patterns in the training set. the proposed algorithm identiﬁes corresponding sets of predictor genes that best distinguish one class from other classes. Next. This is more than two times better than the 33 % accuracy expected from a random classiﬁer. The best combination yielded an average of 85% coverage and 78% accuracy on previously unseed data. First. Multiple networks were then combined into various majority voting schemes.CHAPTER 4. Since drug resistance is a important factor inﬂuencing the failure of current HIV therapies.
3.31 5. two stage density and knearest neighbor. MULTIVARIATE STATISTICS NEURAL NETWORKS 77 algorithm also have a strong biological relationship to the speciﬁc cancer class. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low. Data set Design factors # of data sets 36 72 72 72 2. low. and the subdivisions are related to two of their original clusters. was able to determine three major and one minor cluster. average linkage.03 5. Ward´s method.6: Average distance between cluster The experimental results are unambiguous.54. the SOM network is superior to all seven hierarchical clustering algorithms commonly used today.6 or 8 variables. aberrant observations. A total of 252 empirical data set was constructed to simulate several levels of imperfections that include dispersion.7. 4. which contains three leukemia types.CHAPTER 4. Another test performed using colon cancer microarray data has automatically derived two clusters.33 5. med or high dispersion Irrelevant variables Basic data plus 1 or 2 irrelevant variables Cluster density Basic data plus 10% or 60% density Outliers Basic data plus 10% or 20% outliers Base data Table 4.5: Data set design Number of clusters 2 3 4 5 Average distance (units) 5. the .4 or 5 clusters. The algorithm tested on leukemia microarray data. nonuniform cluster. more detailed description in shown in Tables 4.78 Table 4. centroid method. therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further. complete linkage. which is consistent with the number of classes in data (cancerous and normal). Mangiameh et al (1996) provide a comparison of the performance of SOM network and seven hierarchical cluster methods: single linkage. irrelevant variables. Furthermore.
2 Dimensional Reduction Networks In this section we will present some neural networks designed to deal with the problem of dimensional reduction of data. as the level of dispersion in the data increases. The SOM network frequently has average accuracy levels of 85% or greater.5% of the time. The SOM network ranks ﬁrst or second in accuracy in 87. For the high dispersion data sets. Of the 252 data sets investigated.2% of the time. Carvalho et al (2006) applied the SOM network to gamma ray burst and suggested the existence of 5 clusters of burst.dev. Additionally. cluster Range in dispersion std. Despite the relatively poor performance at these data conditions. and is ranked ﬁrst or second 91. the performance advantage of the SOM network relative to the hierarchical clustering methods increases to a dominant level.91 12 Table 4. This result was conﬁrmed by a feedfoward network and principal component analysis.8% of the data. 4. and then to proceed with the analysis. the SOM network did average 72% of observations correctly classiﬁed. while other techniques average between 35% and 70%. std. These ﬁnds seems to contradict the bad performance of SOM network compared to kmeans clustering method applyed in 108 data sets reported by Balakrishman et al (1994). These six data sets occurred at low levels of data dispersion.72 3 Medium 3. The SOM network is the most accurate method for 191 of the 252 data sets tested.3% of the data sets. which represents 75. only six data sets resulted in poor SOM results. In several occasions it is useful and even necessary to ﬁrst reduce the dimension of the data to a manageable size. MULTIVARIATE STATISTICS NEURAL NETWORKS Level of Avg. dev. the SOM network is most accurate 90. (in units) High 7.7: Basis for data set construction 78 performance of the SOM network is shown to be robust across all of these data imperfections.72 6 Low 1. . keeping as much of the original information as possible. The SOM superiority is maintained across a wide range of ”messy data” conditions that are typical of empirical data sets. with a dominant cluster containing 60% or more of the observations and four or more total clusters.CHAPTER 4.
1). For example. The redundance. The empty space phenomenon responsible for the curse of dimensionality is that highdimensional spaces are inherently sparse. the same (hyper) sphere contain only 0. etc. A related concept in dimensional reduction methods is the intrinsic dimension of a sample. – Correspondence Analysis. a new set of uncorrelated or independent variables should be found. MULTIVARIATE STATISTICS NEURAL NETWORKS 79 Sometimes. This is a loosely deﬁne concept and a trial and error process us usually used to obtain a satisfactory value for it. An important reason to reduce the dimension of the data is that some authors call: ”the curse of dimensionality and the empty space phenomenon”. to: − Many of the variables collected may be irrelevant . backward elimination. – Multidimensional Scaling. for a one dimensional standard normal N (0. 02% of the mass. The main techniques are: − Selection of variables: – Expert opinion. The curse of dimensionality phenomenon refers to the fact that in the absence of simplifying assumptions. I).CHAPTER 4. about 70% of the mass is at points contained in the interval (sphere of radius of one standard deviation around the mean zero. . the sample size needed to make inferences with a given degree of accuracy grows exponentially with the number of variables. which is deﬁned as the number of independent variables that explain satisfactorily the phenomenon. For a 10dimensional N (0. for example. a phenomenon which is in appearance highdimension is actually governed by few variables (sometimes called ”latent variables” or ”factors”). and a radius of more than 3 standard deviations is need to contain 70%. presented in the data collected related to the phenomenon. – Factor Analysis. − Many of the variables will be correlated (therefore some redundant information is contained in the data). best subset regression. can be due. − Using the original variables: – Principal Components. There are several techniques to dimensional reduction and can be classiﬁed in two types: with and without exclusion of variables. stepwise regression. – Automatic methods: all possible regressions.
GAM. MULTIVARIATE STATISTICS NEURAL NETWORKS – Nonlinear Principal Components – Independent Component Analysis – Others 80 We are concerned in this book with the latter group of techniques (some of which have already being described: Projection Pursuit. If X does not contain redundant information the dimension of U .1) which is call ”single value decomposition”. i.2. The dimension is called rank.4) (4. as xij . Multidimensional Scaling. .2. The dimension of X is (n x p).e.2.2. 4.5) (4. corresponding to n observations of p variables (n > p). whose diagonal entries are the singular values and are weigths indicating the relative importance of each dimension in U and V and are ordered from largest to smallest. the d matrix is a diagonal matrix. We refer to the data set. V e d will be equal the minimum dimension p of X.2) A retangular matrix X can be made into symmetric matrices (XX and X X) and their eigenstructure can be used to obtain the structure of X.3) (4.2. Any data matrix can be decomposed into its characteristic components: Xn x p = Un x p dp x p Vp x p (4.2. X. the V matrix ”summarizes” information in the columns of X. the data matrix as X. If 1 XX = U D and X X = V DV then X = U D 2 V . and the speciﬁc observation of variable j in subset i.1 Basic Structure of the Data Matrix A common operation in most dimensional reduction technique is the decomposition of a matrix of data into its basic structure. etc). If S is a symmetric matrix the basic structure is: S = U dV = U dU = V dV Decomposition of X.X or X . SVD.CHAPTER 4.X reveals the same structure: Xnxp = U dV XX(nxn) = U d2 U XX(pxp) = V d2 V (4. Now we turn to some general features and common operation on the data matrix to be used in these dimensional reduction methods. The U matrix ”summarizes” the information in the rows of X.
.2. Some dimensional reduction techniques share the same decomposition algorithm SVD.2. − Standardization zij = xij − xj ¯ Sj (4.2. − For categorical and frequency data x∗ = ij xij ( i xij j xij ) 1 2 = xij (x•j xi• ) 2 1 (4.CHAPTER 4.9) xij ( x2 ) 2 ij 1 (4.8) obtained by subtracting the row and column means and adding back the overall mean. MULTIVARIATE STATISTICS NEURAL NETWORKS 81 The columns of U and V corresponding to the eigenvalues Di are also related by: If Ui is an eigenvector of XX corresponding to Di then the corresponding eigenvector of X X corresponding to Di is equal or proportional to Ui X.2.11) from the mean corrected data.2. The following transformation will be used − Deviation from the mean: xij = xij − xj ¯ where xj is the column (variable j) mean.10) From these transformations the following matrices are obtained: − Covariance Matrix C= 1 ¯ ¯ (X − X) (X − X ) n (4.7) (4. ¯ − Unit length x∗ = ij − Doublecentering x∗ = xij − xi − xj + xij ¯ ¯ ¯ ij (4. They differ only in the predecomposition transformation of the data and their posdecomposition transformation of the latent variables.6) where xj and Sj are the column (variable j) mean and standard deviation.2.
In the statistical literature an early reference is Gnanadesikan (1999) and is closely related to Kernel principal components.2. .CHAPTER 4. 4.2. some of which can be computed using neural networks. R= − QCorrelation Matrix 1 Q = Zr Zr (4. The procedure is performed by applying a SVD on the CCorrelation Matrix or in the RCorrelation Matrix. latent variables) which are uncorrelated.2. and keeping most of the original information − these new variables can be reasonable interpreted. 4.2 Mechanics of Some Dimensional Reduction Techniques Now we outline the relation of the SVD algorithm to the dimensional reduction algorithm.2. 4.2 Nonlinear Principal Components Given the data matrix X.13) p where Zr indicates the matrix X standardized within rows. MULTIVARIATE STATISTICS NEURAL NETWORKS − RCorrelation Matrix 82 1 Z Zc (4. that is the data can be reasonably represented in lower dimension.12) n c where Zc indicates the matrix X standardized within columns in the number of observations.2. − to hope that a few of this new variables will account for the most of the variation of the data. Principal Curves and Informax method in Independent Component Analysis. The later will be outlined later.PCA When applying PCA in a set of data the object is: − to obtain from the original variables a new set of variables (factors. and p is the number of variables.2.1 Principal Components Analysis . Since PCA is not invariant to transformation usually the R matrix is used.2. this procedure consists in performing the SVD algorithm in some function φ(X).
− PCA is a procedure to decompose the correlation matrix without regard to an underlying model. where U is a diagonal matrix of the ”unique” variances associated with the variables. − The emphasis in FA is in explaining Xi as a linear function of hypothetical unobserved common factors plus a factor unique to that variable. The choice of method. A practical rule for deciding when each method is to be preferred is: CA is very suitable for discovering the inherent structure of contingency tables with variables containing a large number of categories. In the case of standardized variables. the FA model does not provide a unique transformation form variables to factors. has an underlying model that rest on a number of assumptions. LLM is particularly suitable for analysing multivariate contingency tables. Unique variances are that part of each variable’s variance that has nothing in common with remaining p − 1 variables.3 Factor Analysis . or as a matrix F . where the variables containing few categories. LLM analyse mainly the interrelationships between a set of variables. .2. In CA the data consists of an array of frequencies with entries fij . FA. including normality as the distribution for the variables.CA Correspondence analysis represents the rows and columns of a data matrix as points in a space of low dimension. CA or LLM depends of the type of data to be analysed and on what relations or effects are of most interest. Contrary to PCA.2. FA on the other hand ﬁnds a decomposition of the reduced matrix R − U .4 Correspondence Analysis . while the emphasis in PCA is expressing the principal components as a linear function of the Xi ’s. The main differences between PCA and FA are: − PCA decomposes the total variance. and it is particularly suited to (twoway) contingency tables. MULTIVARIATE STATISTICS NEURAL NETWORKS 4. it produces a decomposition of R. Loglinear model (LLM) is also a method widely used for analysing contingency tables.2.FA 83 Factor analysis has also the aim of reducing the dimensionality of a variable set and it also has the objective of representing a set of variables in terms of a smaller number of hypothetical variables.2.CHAPTER 4. on the other hand. where as correspondence analysis examines the relations between the categories of the variables. 4.
The row (X) and column (Y ) canonical scores are obtained as: Xi = Ui Y i = Vi 4.CHAPTER 4. This procedure summary row and column vectors. As this is between zero and one. fij the original cell frequency.2. For categorical data a commonly used dissimilarity is based on the simple matching coefﬁcient. d.15) (4. Alternative mapping and ordinal methods will not be outlined.2. of H. Given the data matrix X. The ﬁrst singular value is always one and the successive values constitute singular values or canonical correlations. − The third and ﬁnal step is to rescale the row (U ) and column (V ) vectors to obtain the canonical scores. In multidimensional scaling we are given the distances drs between every para of observations. One ˆ way out is to use Mahalanobis distance with respect to the covariance matrix Σ of the observations. the most common choice is the Euclidean distance: 2 d = XX This depend on the scale in which the variables are measured. fi• the total for row i and f•j the total for column j.5 Multidimensional Scaling Here we will describe the classical or metric method of multidimensional scaling which is closely related to the previous dimensional reduction method. the dissimilarity is found to be drs = 1 − crs . Since many of them produce measure of distances which do not satisfy the triangle inequality. U and V of H and a diagonal matrix of singular values. that is the proportion Crs . f•• fi• f•• f•j (4. − The second step is to ﬁnd the basic structure of the normalized matrix H with elements (hij ) using SVD.2. of features which are common to the two observations r and s.2.14) where hij is the entry for a given cell. there are several ways to measure the distance between pair of observations. For continuous data X. since it also os based on SVD algorithm.16) . the term dissimilarity is used. MULTIVARIATE STATISTICS NEURAL NETWORKS − The ﬁrst step is to consider the transformation: hij = fij fi• f•j 84 (4.2.
The steps are as follows: − From the data matrix obtain the matrix of distances T (XX or XΣX ). At a cocktail party.2. For mixture of continuous.3 Independent Component Analysis . The problem is to unmix or recover the original conversation from the recorded mixed conversation.2. This notation is consistent with traditional Multivariate Statistics.22 as discussed in Hyvarinen (1999). and the text next to the lines show the assumptions needed for the connection. The observed conversation consists of mixtures of true unobserved conversations.ICA The ICA model will be easier to explain if the mechanics of a cocktail party are ﬁrst described. say T ∗ .17) In the classical or metric method of multidimensional scaling. − Double centering the dissimilarity matrix T . there are p microphones that record or observe m partygower or speaker at n instants. At a cocktail party there are partygoers or speaker holding conversations while at the same time there are microphones recording or observing the speakers also called underlying source or component. Then drs = f f Irs df rs f Irs (4. The microphones do not observe the observed conversation in isolation. − Decompose the matrix T ∗ using the SVD algorithm. also known as principal coordinate analysis. The lines show close connections.CHAPTER 4. Only recently this has been recognized by ICA .1) so that every original feature is given equal weight. categorical and ordinal features a widely used prof cedure is: for each feature deﬁne a dissimilarity df and an indicator Irs which is rs f one only if feature f is recorded for both observation. 4. Further Irs = 0 if we have a categorical feature and an absence.absence. Before we formalize the ICA method we should mention that the roots of basic ICA can be traced back to the work of Darmois in the 1950 (Darmois 1953) and Rao in the 1960 (Kagan et al. we assume that the dissimilarities were derived as Euclidean distances between n points in p dimensions. The recorded conversation are mixed. 1973) concerning the characterization of random variables in linear structures. − From the p (at most) nonzero vectors represent the observations in the space of two (or three) dimensions. MULTIVARIATE STATISTICS NEURAL NETWORKS 85 For ordinal data we use the rank as if they were continuous data after rescaling to the range (0. The relation of ICA with other methods of dimension reduction is shown in Figure 4.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
86
Figure 4.22: (The lines show close connections, and the text next to the lines show the assumptions needed for the connection.)
researchers (Common, 1994, Jutten and Taleb 2000, Fiori 2003) which regrets that this may have delayed progress. For the general problem of source separation we use the statistical latent variables model. Suppose we have p linear mixtures of m independent components: xi1 a11 .si1 + . . . + a1m .sim a11 . . . a1m si1 . . . = = . . . ... xip ap1 .si1 + . . . + apm .sim ap1 . . . apm sim xi = Asi (4.2.18) Actually the more general case of unmixing consider a nonlinear transformation and an additional error term ie: xi = f (si ) +
i
(4.2.19)
the model 4.2.18 is the base for blind source separation method. Blind means that we know very little or anything on the mixing matrix and make few assumptions on the sources si . The principles behind the techniques of PCA and FA is to limit the number of components si to be small that is m < p and is based on obtaining noncorrelated sources. ICA consider the case of p = m and obtain estimates of A and to ﬁnd components si that are independent as possible. After estimating the matrix A, we can compute its inverse, and obtain the independent component simply by: s = Wx Some of the characteristics of the procedure is: (4.2.20)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
− We ﬁx the variances of the components to be one. − We cannot determine the order of the components.
87
− To estimate A at most one of the components can be Gaussian. The independent components must be Gaussian. The key to estimating the ICA model is nonGaussianity and independence. There are several methods: maximum likelihood and network entropy, mutual information and KulbackLeibler divergence, nonlinear cross correlations, nonlinear PCA, higherorder cumulants, weighted covariance matrix (Hyvarinen, 1999). Here we outline the Infomax principle of network entropy maximization which is equivalent to maximum likelihood (see Hyvarinen and Oja, 2000). Here we follow Jones and Porril (1998). Consider eq.4.2.20, s = W x. If a signal or source s has a cdf cumulative density function g, then the distribution of g(s) by the probability integral transformation, has an uniform distribution ie has maximum entropy. The unmixing W can be found by maximizing the entropy of H(U ) of the joint distribution U = (U1 , . . . , U p) = (g1 (s1 ) . . . gp (sp ) where si = W xi . The correct gi have the same form as the cdf of the xi , and which is sufﬁcient to approximate these cdfs by sigmoids: Ui = tanh(si ) Given that U = g(W x), the entropy H(U ) is related to H(x) by: H(U ) = H(x) + E(log ∂U ) ∂x (4.2.22) (4.2.21)
Given that we wish to ﬁnd W that maximizes H(U ), we can ignore H(x) in (4.2.22). Now ∂U ∂U ∂s =   = ∂x ∂s ∂x Therefore 4.2.22 becomes:
p p
g , (si )W 
i=1
(4.2.23)
H(U ) = H(x) + E(
i=1
g(s,i )) + logW 
(4.2.24)
The term E = (
p i=1 p
g(si )), given a sample of size n, can be estimated by: 1 g(si )) ≈ n
n p , loggi (si ) j=1 i=1 (j)
E(
i=1
(4.2.25)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
88
Ignoring H(x), equation 4.2.24 yield a new function which differ from H(U ) by H(x). 1 h(W ) = n
n p , loggi (s∗ ) + logW  i j i
(4.2.26)
Deﬁning the cdf gi = tanh and recalling that g , = (1 − g 2 ) we obtain: 1 h(W ) = n
p
n
j=1 i=1
log(1 − s,2 ) + logW  i
(4.2.27)
whose maximization with respect to W yield: W = ∂H(W ) ∂h((W )) = = (W , )−1 + 2.s.x, ∂W ∂(W ) (4.2.28)
and an unmixing matrix can be found by taking small steps of size η to W ∆W = η((W ,−1 − 2.s.x, ) (4.2.29)
After reescaling, it can be shown (see Lee, 1998 p 41,45) that a more general expression is: ∆W ≈ (I − tanh(s)s, − ss, )W superGaussian (I + tanh(s)s, − ss, )W subGaussian (4.2.30)
Supergaussian random variables have typically a ”spiky” pdf with heavy tails, ie the pdf is relatively large at zero and at large values of the variable, while being small for intermediate values. A typical example is the Laplace (or doubleexponential) distribution. SubGaussian random variables, on the other hand, have typically a ”ﬂat” pdf which is rather constant near zero, and very small for large values of the variables. A typical example is uniform distribution. We end this section with two illustrations of the applications of ICA and PCA to simulated examples from Lee (1998, p 32). The ﬁrst simulated example involves two uniformly distributed sources s1 e s2 . The sources are linearly mixed by: x = As 1 2 x1 = x2 1 1 s1 s2 (4.2.31)
Figure 4.23 shows the results of applying of the mixing transformation and the application of PCA and ICA. We see that PCA only sphere (decorrelate) the data,
CHAPTER 4.) .23: PCA and ICA transformation of uniform sources (Topleft:scatter plot of the original sources. Topright: the mixtures. MULTIVARIATE STATISTICS NEURAL NETWORKS 89 Original Sources 2 1 2 1 Mixed Sources s2 0 −1 −2 −2 x2 −1 0 s1 PCA Solution 1 2 0 −1 −2 −4 −2 0 x1 ICA Solution 2 4 4 0. Bottomright: the recovered sources using ICA. Bottomleft: the recovered sources using PCA.5 −4 −2 0 u1 2 4 u2 0 −2 −4 −4 −2 0 u1 2 4 Figure 4.5 2 u2 0 −0.
2. third row: the recovere sources using PCA.0.2 2000 4000 6000 8000 10000 PCA Solution 0 0 1 0 5 2000 4000 6000 8000 10000 0.2 0 5 2000 4000 6000 8000 10000 ICA Solution 0 0 5 5 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 time samples time samples Figure 4.5 0.24: PCA and ICA transformation of speach signals (Top: the original sources.5 0 0. MULTIVARIATE STATISTICS NEURAL NETWORKS 90 while ICA not only sphere the data but also rotate it such that s1 = u1 and s2 = u2 ˆ ˆ have the same directions of s1 and s2 . bottom: the recovered sources using ICA.5 0 1 2000 4000 6000 8000 10000 0.5 Original Sources 0 0 .5 2000 4000 6000 8000 10000 Mixed Sources 0 0 1 0 1 2000 4000 6000 8000 10000 0. The signal are linearly mixed as the previous example in eq. channel # 1 channel # 2 0. The second example in Figure 4.) . second row: the mixtures.24 shows the time series of two speach signals s1 and s2 .5 0 0. 4.31.CHAPTER 4. The solution indicates that the recovered sources are permutated and scaled.
1991).4 PCA networks There are several neural networks architecture for performing PCA (See Diamantaras and Kung 1996 and Hertz. − F1 (x) = A.32) or x.33) The architecture with two layer and indentity activation function is as Figure 4. Figure 4.2.26. They can be: − Autoassociator type. The network is trained to minimize: n p (yik (x − xik ))2 i=1 k=1 (4.2.34) and its architecture is shown in Figure 4.2. MULTIVARIATE STATISTICS NEURAL NETWORKS 91 4.25.CHAPTER 4. .Autoassociative − Networks based on Oja (Hebbian) rule The network is trained to ﬁnd W that minimizes n xi − W .2.25: PCA networks . Kragh and Palmer. W xi 2 i=1 (4. (x − µ) (4.
) + 2L(yn yn ) 3. OjaKarhunen´s rule: Kl = 3D(yl yl.26: PCA networks architecture . Several update rules has been suggested.35) where xl and yl are the input and output vectors. . Sanger´s rule: Kl = L(yl yl. such as the following: 1.2. 2.36) To illustrate the computation involved consider the n = 4 sample of a p = 3 variable data in Table 4. all of each follows the same matrix equation: ∆Wl = ηl yl x. ηl the learning rate and Wl is the weight matrix at the lth iteration. MULTIVARIATE STATISTICS NEURAL NETWORKS 92 Figure 4.CHAPTER 4. L(A) is a matrix whose entries above and including the main diagonal are zero. . The other entries are the same as that of A.2. William´s rule: Kl = yl yl.l − Kl Wl (4. Kl is a matrix whose choice depends on the speciﬁc algorithm. ) where D((A)) is a diagonal matrix with diagonal entries equal to those of A.8.Oja rule. For example: D 2 3 2 0 = 3 5 0 5 L 2 3 0 0 = 3 5 3 0 (4.
9.8 93 The covariance matrix is: 0.11 −0.4 1.27: Oja´s rule . Table 4.084 eigenvectors V1 V2 V3 0.10 0.2 2.542 0.090 0.8: Data p = 3. n 1 2 3 4 x1 1.10 S 2 = 0. n = 4.965 0.7 4.169 0.0 x3 3.1 x2 3.01 5 −0.823 0.2.1 3.985 (4.10 0.832 0.01 −0.94 the eigenvalues and eigenvectors are shown in Table 4.9: SVD of the covariance matrix S 2 eigenvalue d1 d2 d3 0.8 3.CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS Table 4.126 0.115 0.11 1 0.6 4. Figure 4.3 1.026 0.1 4.10 −0.5 1.37) Now we will compute the ﬁrst principal component using a neural network architecture.553 0.
−0. 0.2. Since from Kolmogorov theorem.CHAPTER 4.651)) (4. −0.111.240.x2 . ϕ(W x)2 i=1 (4. 1.2. the weights becomes (−0. 5). The weight adjustment process continues in this manner.20.278. −0. 0. 0.1.5.27 W = (−0.4.231) .238.751) and (0. x5 by the end of the second iteration. 0. cycling through the input vectors x1 .2.28. 0.652) For the next input x2 = (0.0 for all values of l..7) we have: y = (0. 0.687) (4.2.3.231(0.30) − (−0.27)2 .43) (4. 0.3.028.44) where ϕ is a nonlinear function with scalar arguments. resulting in the ﬁrst principal component.278. . 0.2. −0.3) y = −0.7) = −0.2.(0.10. (0. 0.27(0.5 Nonlinear PCA networks The nonlinear PCA networks is trained to minimize: 1 n n xi − W .42) W = (0. 0. MULTIVARIATE STATISTICS NEURAL NETWORKS 94 We choose ηl = 1.41) 2 W = (−0.2. vowel. 0.40) Subsequent presentation of x3 .38) (4.5)) W = (0.293. −0.(0.3.351.346.008. 4.2.2. 0.316. a function ϕ can be approximated by a sum of sinoidals.313. −0. 0. Most of the applications are in face. 0.2.316.989) at the end of the third interaction it chances to (−0.004). . .w2 = 0. 0. 0. 0.804). 0. 4.697). the architecture of the nonlinear PCA network is as in Figure 4.105. −0. respectively. For the ﬁrst input x1 = (0. 0. 0.(0.7) − (−0. Application on face recognition is given in Diamantaras and Kung (1996). The later author also compare the alternative algorithms for neural computation of PCA with the classical batch SVD using the Fisher´s iris data.x4 and x5 change the weight matrix to (0.. 0.2. and select the initial values w1 = 0. Further applications are mentioned in Diamantaras and Kung (1996).39) (4.652) + W = (0. A comparison of the performance of the alternative rules are presented in Diamantaras and Kung (1996) and Nicole (2000).4.316.4 and w3 = 0.172. −0. 0.272. This process is repeated.2. 0. 0. −0.5) + W = (0.3.231 (4. speach recognition.278.
maximum likelihood and image analysis).CHAPTER 4. m (4. Here the network architecture is that of Figure 4.28: Non linear PCA network 4. x m .2. .etc) and a simulation result comparing their method to other four methods (factor analysis.p − k=1 fp−1. . .2. + (rp−1. MULTIVARIATE STATISTICS NEURAL NETWORKS 95 Figure 4. .k fpk )2 Sun and Lai (2002) report the result of an application to measure ability of 12 students (verbal. m + (r1p − k=1 m f1k fpk ) + (r23 − k=1 2 f2k f3k )2 + . The concept consists in representing the correlation matrix R = (rij ) by crossproducts (or outer product) of two factor symmetric matrices Fp x m and Fp. principal axes. . .numerical. .45) (r2p − k=1 f2k fpk )2 + .6 FA networks The application of artiﬁcial neural networks to factor analysis has been implemented using two types of architectures: A) Using autoassociative networks This network implemented by Sun and Lai (2002) and called SunFA.25 and when there are m common factors the network is trained to minimize the function m m E =(r12 − k=1 f1k f2k ) + (r13 − k=1 m 2 f1k f3k )2 + .
The data consisted on school grades of 8 students in four subjects (Mathematics. French and Latin). he also reported that the eigenvalues were equal within (±0. The second network used the architecture shown in Figure 4.29: Factor analysis APEX network Calvo obtained the results in Tables 4. .164 and 328 with factors matrices A.2.31. MULTIVARIATE STATISTICS NEURAL NETWORKS 96 Five correlation matrices were used for samples of size n = 82. Delichere and Demmi use an updating algorithm call GHA (Generalized Hebbian Algorithm) and Calvo the APEX (Adaptive Principal Component Extraction) see Diamantaras and Kung (1996). Ten sets of correlation submatrices where sampled from each correlation matrix.11 and Figures 4.b) and Calvo (1997).29. This procedure has been used by Delichere and Demmi (2002a.CHAPTER 4. They found that SunFA was the best method. Figure 4.01) to the results obtained on commercial software (SPPS). We outline Calvo´s application. B) Using PCA (Hebbian) networks Here an PCA network is used to extract the principal components of the correlation matrix and to choose the number of factors (components) m that will be used in the factor analysis. Then a second PCA network extract m factors choosen in the previous analysis.B an C.10 and 4. Biology. The ﬁrst PCA network shown that the ﬁrst component explained 73% of the total variance and the second explained 27%.6 and 4.
5 7 14 14 11 10 8 8 6 7 6 6 French 8.30: Original variables in factor space Figure 4.5 15 11.4059 0.2 0 −0.8 French 0.5 Table 4.5 14 12 5.2 −0.10: Scholl grades in four subjects Mathematics Biology French Latin Factor 1 0.5 5.5 1 good in languages −4 −4 −3 −2 −1 0 1 2 3 4 Figure 4.4759 0.5 15.5 7 8 9.5 12.6298 0.5 0 0.5 5.5554 0.3647 Table 4.5 8 11 5 Latin 9. MULTIVARIATE STATISTICS NEURAL NETWORKS 97 1 0.11: Loadings for the two factors retained .5284 0.6 −0.4 −0.5 14.CHAPTER 4.5431 Factor 2 0.4 0.8 −1 −1 Biology Latin 4 3 2 1 good in sciences F2 5 8 6 7 bad student 3 1 4 2 good student F1 0 −1 Mathematics −2 −3 −0.6 0.4477 0.31: Students in factor space Student 1 2 3 4 5 6 7 8 Math Biology 13 12.5 14.
CHAPTER 4.32.7 Correspondence Analysis Networks .46) Figure 4. . The obvious architecture is the same for SVD or PCA shown in Figure 4.2. MULTIVARIATE STATISTICS NEURAL NETWORKS 98 4.32: CA network Unfortunately there does not seem to be any application on real data yet.CA The only reference outlining how neural networks can be used for CA is Lehart (1997).2. Here we take: Zij = fij − fi• f•j fi• f•j (4.
e apply PCA) − obtain the independent components (eq. The training set contained 75% of the observation and the testing set 25%. ie NPCA).Lee (1998) compare classiﬁcation algorithms with a classiﬁcation based on ICA. Figure 4.34 presents the independent directions (basis vectors). 4 variables and 50 observations in each class.ICA The usual procedure in ICA follows the following steps: − standardize the data − whitening (decorrelate i. Figure 4. . each class refers to a type of iris plant.33: ICA neural network Now. i) Fisher iris´s data . we present some recent applications of ICA.8% and 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 99 4. The classiﬁcation error in the test data was 3%. A simple classiﬁcation with boosting and a kmeans clustering gave errors of 4. The data set contains 3 classes.CHAPTER 4. Informax.12% respectively.33 where x = Qs + ε and the steps are: − whitening: vk = V xk − independent components: yk = W vk .8 Independent Component Analysis Networks . One possible architecture is shown in Figure 4.2.
34: An example of classiﬁcation of a mixture of independent components. There are 4 different classes. MULTIVARIATE STATISTICS NEURAL NETWORKS 100 Figure 4. The algorithm is able to ﬁnd the independent directions (basis vectors) and bias terms for each class. each generated by two independent variables and bias terms.CHAPTER 4. .
They consider a coupled model that incorporate three algorithms: i) ICA model is used for data reﬁnement and normalization.c2 . whose task is to allocate an object characterized by a number of measurements into one of several distinct classes. . . ii) Cluster (Kohonen Map) networks are used to select those patterns chosen by ICA algorithm that are close to the considered input data. These two mechanisms helped to select similar data and also to improve the system by throwing away unrelated data sets.CHAPTER 4.cL be labels representing each of the classes. iii) Finally a nonlinear regression model is used to predict future mutations in the CD4+. Here we deal with the classiﬁcation problem. CD8+ and viral load data to a particular patient and normalizes it with broather data sets obtained from other HIV libraries. . The effect of managerial actions could be analysed. xk ) and let c1 . . 4. . ii) ICA was applied to ﬁnd the fundamental factor common to all stores that affect the cashﬂow.. The author points out a series of advantage of their method over the usual mathematical modelling using a set of simultaneous differential equation which requires an expert to incorporate the dynamic behaviour in the equations. iii) Time series prediction. MULTIVARIATE STATISTICS NEURAL NETWORKS 101 Draghici et al (2003) implement a data mining technique based on the method of ICA to generate reliable independent data sets for different HIV therapies. The model accepts a mixture of CD4+. This can be of help in minimizing the risk in the investment strategy.CD8+ and viral loads. Further applications in ﬁnance are Cha and Chan and Chan and Cha (2001). ICA has been used to predict time series data by ﬁrst predicting the common components and then transforming back to the original series. Financial applications: the following are some illustration mentioned by Hyvarinen et al (2001): i) Application of ICA as a complementary tool to PCA in a study of stock portfolio. The ICA is used to isolate groups of independent patterns embedded in the considered libraries. image. individual. .. allowing the underline structure of the data to be more readily observed. Let an object (process. in the same retail chain. .3 Classiﬁcation Networks As mentioned before two important tasks in pattern recognition are patter classiﬁcation and cluster or vector quantization. etc) be characterized by the kdimensional vector x = (x1 . The book of Girolami (1999) also gives an application of clustering using ICA to the Swiss banknote data on forgeries.
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
102
The objective is to construct and algorithm that will use the information contained in the components x1 ,. . .,xk of the vector x to allocate this vector to one of the L distinct categories. Typically, the pattern classiﬁer or discrimination function calculates L functions (similarity measures), P1 , P2 ,. . .,PL and allocates the vectors x to the class with the highest similarity measures. Many useful ideas and techniques have been proposed for more than 60 years of research in classiﬁcation problems. Pioneer works are Fisher (1936) Rao (1949) and reviews and comparisons can be seen in Randys (2001), Asparoukhov (2001) and Bose (2003). Neural network models that are used for classiﬁcation have already been presented in previous sections. Figures 4.35 to 4.40 presents a taxonomy and examples of neural classiﬁers.
Figure 4.35: Taxonomy and decision regions (Type of decisions regions that can be formed by single and multilayer perceptrons with hard limiting nonlinearities. Shading denotes regions for class A. Smooth closed contours bound input distributions.)
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
103
Figure 4.36: Classiﬁcation by Alternative Networks: multilayer perceptron (left) and radial basis functions networks (right)
Figure 4.37: A nonlinear classiﬁcation problem
CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS
104
Figure 4.38: Boundaries for neural networks with tree abd six hidden units (note how using weight decay smooths the ﬁtted surface)
MULTIVARIATE STATISTICS NEURAL NETWORKS 105 GROUP DECISION REGIONS COMPUTING ELEMENT REPRESENTATIVE CLASSIFIERS B PROBABILISTIC A A DISTRIBUTION DEPENDENT GAUSSIAN MIXTURE SIGMOID HYPERPLANE MULTILAYER PERCEPTRON. CMAC x x EXEMPLAR EUCLIDEAN NORM KNEAREST NEIGHBOR.39: Four basic classiﬁer groups . BOLTZMANN MACHINE RECEPTIVE FIELDS (KERNELS) KERNEL METHOD OF POTENCIALS FUNCTIONS.CHAPTER 4. LVQ x x x x Figure 4.
KERNEL DISCRIMINANT LOCAL x x NEAREST NEIGHBOR EUCLIDEAN NORM KNEAREST NEIGHBOR.40: A taxonomy of classiﬁers . MULTIVARIATE STATISTICS NEURAL NETWORKS 106 GROUP DECISION REGIONS COMPUTING ELEMENT REPRESENTATIVE CLASSIFIERS B PROBABILISTIC A A DISTRIBUTION DEPENDENT GAUSSIAN. LEARNING VECTOR QUANTIZER x x x x THRESHOLD LOGIC RULE FORMING BINARY DECISION TREE. HIGHORDER POLYNOMIAL NET GLOBAL KERNEL RADIAL BASIS FUNCTION. GAUSSIAN MIXTURE SIGMOID MULTILAYER PERCEPTRON. HYPERSPHERE Figure 4.CHAPTER 4.
The ﬁve ratios are net cash ﬂow to total assets. The ﬁrst fou rates are deﬂated. Table 4. The normalization of data is in the form: x∗ = x/x. the validation data set (26 nonbankrupt companies and 14 bankrupt companies) and the testing data (30 nonbankrupt companies and 8 bankrupt companies). PNN and EF. FF denotes feedforward network.probabilistic neural networks in bankruptcy prediction. PNN*. PNN* is probabilistic neural networks without data normalized. and the trend in total reserves calculated on the ratio of change from year to year. Four methods of classiﬁcation were tested: FISHER denotes Fisher discriminant analysis.12: Number and percentage of Companies Correctly Classiﬁed Overall n=38 27 (71 %) 28 (74 %) 25 (66 %) 28 (74 %) Nondeﬂated data Nonbankrupt n=30 20 (67 %) 24 (80 %) 24 (80 %) 24 (80 %) Bankrupt n=8 7 (88 %) 4 (50 %) 1 (13 %) 4 (50 %) Overall n=38 33 (87 %) 30 (79 %) 26 (68 %) 32 (84 %) Deﬂated data Nonbankrupt n=30 26 (87 %) 30 (100 %) 24 (80 %) 27 (90 %) Bankrupt n=8 7 (88 %) 0 (0 %) 2 (25 %) 5 (63 %) FISHER BP PNN PNN* . The data consist of ﬁve ratios to classify oil and gas companies into bankrupt and nonbankrupt groups. current abilities to total reserves.CHAPTER 4. PNN means probabilistic neural networks. MULTIVARIATE STATISTICS NEURAL NETWORKS 107 Here we give some examples of classiﬁcation problems solved using neural networks. We start with an application of PNN . There are two separate data sets one with deﬂated ratios and one without deﬂation. total debt to total assets. Table 4. Yang at al (1999) using data from 122 companies for the period 1984 to 1989 built on warning bankrupt model for the US oil and gas industry. The 122 companies are randomly divided into three sets: the training data set (33 nonbankrupt companies and 11 bankrupt companies). exploration expenses to total reserves.12 presents the results and we can see that the ranking of the classiﬁcation model is: FISHER.
a network of individual networks with two outputs. with signiﬁcant psychological overlay). Of the 50 patients in each class 25 were selected for the training data and the remaining for the test set.13 and 4. The pacient data was analysed using two different subset of the total data. The ﬁrst data set (full assessment) contained all 145 tick sheet entries. due to tumour. feedforward with two output (for the four class diagnostic). The data set referred to 200 patients with low back disorders. The radial basis networks were designed to recognize all four diagnostic classes.14. The second data used a subset of 86 tick sheet entries for each patient. Two other classiﬁcation methods and CCAD (computedaided diagnosis) system were also tested in these data. SPARTH (spinal pathology. inﬂammation or infection). AIB (abnormal illness behaviour. All this networks had one hidden layer. three group of clinicians and some statistical classiﬁcation methods in the diagnostic of low back disorders. and mean of 10 feedforward networks runs after training. CCAD and MLP is quite good. There were 50 patients for each of the four diagnostic class. Low back disorder were classiﬁed in four diagnostic category: SLBP (simple low backpain). radial basis functions. ROOTP (nerve root compression). . positions of the centers and different choice of basis functions were all explored. MULTIVARIATE STATISTICS NEURAL NETWORKS 108 Bounds and Lloyd (1990) compared feedforward network.CHAPTER 4. This corresponds to all information including special investigations. The effect of the number of centers. These entries correspond to a very limited set of symptoms that can be collected by paramedic personnel. A summary of the results is shown in Table 4. For each patient the data was collected in a tick sheet that listed all the relevant clinical features. The neural network used were: feedforward network for individual diagnostic (y = 1) and (y = 0) for remaining three. The performance of the radial basis function.
MULTIVARIATE STATISTICS NEURAL NETWORKS 109 Table 4.13: Symptomatic assessment: Percentage test diagnoses correct Classiﬁcation Method Clinicians Bristol neurosurgeons Glasgow orthopaedic surgeons Bristol general pratictioners CCAD MLP Best 50202 Mean 50202 Best 5002 Mean 5002 Radial basis functions Closest class mean (CCM) (Euclidean) 50 Inputs 86 Inputs Closest class mean (CCM) (scalar product) 50 Inputs (K=22) 86 Inputs (K=4) K Nearest Neighboor (KNN) (Euclidean) 50 Inputs (K=22) 86 Inputs (K=4) K Nearest Neighboor (KNN) (scalar product) 50 Inputs (K=3) 86 Inputs (K=6) Simple low back pain 60 80 60 60 52 43 44 41 60 Root pain 72 80 60 92 92 91 92 89 88 Spinal pathology 70 68 72 80 88 87 88 84 80 Abnormal illness behaviour 84 76 80 96 100 98 96 96 96 Mean 71 76 68 82 83 80 80 77 81 56 68 92 88 88 84 96 96 63 84 16 92 96 84 44 60 96 84 63 80 88 92 84 80 76 84 84 80 83 84 24 100 100 84 56 56 92 84 68 81 .CHAPTER 4.
14: Full assessment: Percentage test diagnoses correct Classiﬁcation Method Clinicians Bristol neurosurgeons Glasgow orthopaedic surgeons Bristol general pratictioners CCAD MLP Best 50202 Mean 50202 Best 5002 Mean 5002 Radial basis functions Closest class mean (CCM) (Euclidean) 85 Inputs 145 Inputs Closet class mean (CCM) (scalar product) 85 Inputs (K=22) 145 Inputs (K=4) K Nearest Neighbour (KNN) (Euclidean) 85 Inputs (K=19) 145 Inputs (K=4) K Nearest Neighbour (KNN) (scalar product) 85 Inputs (K=19) 145 Inputs (K=4) Simple low back pain 96 88 76 100 76 63 76 59 76 Root pain 92 88 92 92 96 90 92 88 92 Spinal pathology 60 80 64 80 92 87 88 85 92 Abnormal illness behaviour 80 80 92 88 96 95 96 96 96 Mean 82 84 81 90 90 83 88 82 89 100 100 92 88 72 76 88 88 88 88 12 100 92 72 48 68 100 80 63 80 100 96 84 88 80 80 80 80 86 86 48 92 96 92 52 72 92 76 68 83 . MULTIVARIATE STATISTICS NEURAL NETWORKS 110 Table 4.CHAPTER 4.
2006) uses feedforward network in the diagnostic of Hepatitis A and Pulmonary Tuberculosis and gives a comparison with logistic regression. MULTIVARIATE STATISTICS NEURAL NETWORKS 111 Santos et al (2005.2 order loglinear model). 2815 individuals from a survey sample from regions in Rio de Janeiro state. LDF .CHAPTER 4. A comparison was made using multilayer perceptron and CART (Classiﬁcation and Regression Tree) model. housing environment. From the 66 collected variables in the study. Table 4.6. . For a test sample of 762 individual randomly chosen. Table 4. There were 23 variables in total. seven variables reﬂecting information on the individuals. The test sample had 45 patients.Independent Binary Model Behaviour. The hepatitis A data (Santos et al 2005).Universidade Federal do Rio de Janeiro).17 reproduces one of their three tables to indicate the methods used (They also have a similar table for large (1517) and moderate (10) number of variables).15 presents the results for the multilayer perceptron and a logistic regression model: Table 4. The soroprevalence of hepatitis was 36.16 summarizes the results in the test data.15: Sensitivity (true positive) and speciﬁcity (true negative) Errors Sensitivity (%) Speciﬁcity (%) Model MLP LR 70 52 99 99 The Pulmonary Tuberculosis data (Santos et al 2006) consisted of data on 136 patients of the Hospital Universitario (HUCFF . They suggest that: − all the effective classiﬁers for large sample are traditional ones (IBM . Table 4. Table 4.16: Sensitivity x Speciﬁcity Errors Sensitivity (%) Speciﬁcity (%) Model MLP CART 73 40 67 70 A general comparison of classiﬁcation methods including neural networks using medical binary data is given in Asparoukhov and Krzanowski (2001). LLM .Linear Discriminating Function. and socioeconomic factors.
prior probabilities are equal to the class individuals’ number divided by the design set size.CHAPTER 4. MULTIVARIATE STATISTICS NEURAL NETWORKS 112 Table 4.L is order of the procedure. kNNHills(L) and lNNHall(L) .equal prior probabilities. LVQ(c). = = . MLP(h).c is number of codebooks per class.h is hidden layer neuron number. .17: Leaveoneout error rate (multiplied by 103 for footnotesize (6) number of variables1 Discriminant procedure Data set Psychology = = Pulmonary = = Thrombosis = = Epilepsy = = Aneurysm = = Linear classiﬁers IBM LDF LD MIP Quadratic classiﬁers Bahadur(2) LLM (2) QDF Nearest neighboor classiﬁers kNNHills(L) kNNHall(L) Other nonparametric classiﬁers Kernel Fourier MLP(3) LVQ(c) 1= 239 244 229 283 223 266  147 147 147 139 139 167 139 228 272 302 284 275 343 325 168 147 193 186 147 186 101 23 53 28 21 46 25 23 219 206 215 201 207 207 152 151  139 144  243 250 265 343 304 314 197 153 153 155 116 178 23 23  21 21  273(2) 230(2) 266(1) 217(2) 187(1) 166(1) 185(1) 144(1) 265(1) 257(1) 294(1) 324(1) 153(2) 102(2) 217(1) 124(1) 363(1) 23(1) 306(1) 21(1) 247 271 234 223 207 190(6) 166 166 144 154 139 146(6) 243 316 363 373 284 226(6) 143 177 124 147 132 109(6) 23 23 21 21 25 37(4) .
03 43.57 44.18: Correct Classiﬁcation .88 78.88 33. Table 4. The result on the test data is shown in Table 4. The dependent variable is whether the credit is paid off or not.88 30.96 35.197 lda g % total 79.95 38.83 % bad 33. Using data from three credit union data.58 81. The predictor variable are sex.68 44.73 44.57 81.49 80.89 50. MULTIVARIATE STATISTICS NEURAL NETWORKS 113 − most of the effective classiﬁers for footnotesize samples are non traditional ones (MLP . Desai et al (1997) explore the ability of feedforward network.08 80.01 81.70 83. The results indicates that the mixtureofexperts performns better in classifying bad credits.29 40.191 % bad 28.97 28.42 81.92 43.40 83.19 1 2 3 4 5 6 7 8 9 10 average pvalue1 1 The pvalues are for onetailed paired ttest comparing mlp g results with the other three methods.20 82. 918 and 853 observation.46 0.60 80.80 83. and linear discriminant analysis in building credit scoring models in an credit union environmental.80 80.20 respectively.15 35.35 82.82 78.04 53.04 36.58 80. MIP .31 47.96 32.97 33. Arminger et al (1987) compared logistic discrimination regression tree and neural networks in the analysis of credit risk.39 0.98 % bad 27.CHAPTER 4. The data for each credit union contain 962. Table 4.44 79.55 40.93 36.30 79. marital status.21 32. The results for the generic models (all data) and the customized models for each credit union is presented in Table 4.18.49 80.19: Generic models Data set Percentage correctly classiﬁed mlp g mmn g % total % bad % total 79. and car and telephone holder.Mixed Integer Programming).00 79.12 82.29 80.60 82.35 81.95 41.80 81. LVQ .73 33.24 32.33 79.73 80.76 0.82 81.46 38.Learning Vector Quantization Neural Network.51 62.04 42.37 39.Credit data Good Credit Bad Credit Overall MLP 53 70 66 LR 66 68 67 CART 65 66 66 For the lender point of view MLP is more effective.90 32.35 81.73 80.67 38.55 35. . One third of each were used as test.30 81.85 37.08 0.70 0. split randomly in 10 datasets.61 39.12 0.63 31. true of employment.69 80.04 81.33 34.64 38.75 42.35 82.43 79.50 80.19 and 4.87 80.97 41.06 40.035 lr g % total 80.40 82.59 81.28 39.73 80.82 37.Multilayer Perceptron.06 81. mixtureofexpert networks.
57 21.76 86.05 18.00 34.30 82.33 85.97 71.56 23.57 77.43 90.93 37.13 27.60 86.20: Customized models Data set Sample Percentage correctly classiﬁed mlp c lda c % total % bad % total % bad 83.86 16.80 89.36 87.7E7 lr c % total 82.00 80.14 22.26 21.10 80.61 67.77 26.29 29.53 79.33 80.37 21.22 12.50 84.40 81.00 75.40 ?? 83.84 38.74 80.03 79.55 75.40 21.40 88.40 90.43 88.89 31.60 75.37 78.CHAPTER 4.88 76.37 19.20 89.69 67.0007 Credit union L: 1 2 3 4 5 6 7 8 9 10 Credit union M: 1 2 3 4 5 6 7 8 9 10 Credit union N: 1 2 3 4 5 6 7 8 9 10 average pvalue1 1 14.37 21.45 26.15 81.95 19.47 23.59 24.17 77.90 88.22 81.64 31.60 85.90 74.37 21.20 85.84 19.85 31.25 47.40 86.00 88.41 75.40 87.84 29.05 19.61 49.98 The pvalues are for onetailed paired ttest comparing mlp c results with the other two methods.59 77.38 16.77 75.35 0.45 17. .76 85.31 81.26 17.50 86.86 82.14 86.13 80.00 82.14 22.43 78.59 27.33 39.58 85.03 75.63 75.71 76.40 75.90 81.14 80.49 5.37 88.52 74.43 16.89 31.71 83.41 25.59 08.61 89.72 79.74 78.38 72.018 16.58 85.22 12.72 83.64 31.38 37.90 84.71 41.70 81.90 80.88 17.14 19.14 29.04 86.50 82.03 41.14 83.63 74.33 87.03 32.00 32.03 73.43 88.61 85.90 77.55 82.40 90.65 27.11 37. MULTIVARIATE STATISTICS NEURAL NETWORKS 114 Table 4.58 35.90 76.02 18.30 28.19 87.35 70.77 25.14 44.10 82.14 28.55 78.50 75.00 31.90 75.43 89.36 79.59 83.28 22.26 49.33 10.67 18.98 24.00 81.11 24.05 21.19 32.43 24.16 82.89 31.22 87.19 85.55 86.31 79.47 80.72 82.86 77.30 28.96 28.64 77.61 44.63 24.20 78.07 35.43 17.57 15.00 20.58 35.93 0.44 24.67 56.60 74.77 20.69 31.67 32.89 18.56 35.48 37.26 22.97 18.95 81.57 38.40 23.67 0.96 28.35 35.02 26.03 41.15 40.52 77.00 82.00 86.94 89.22 86.38 26.37 75.109 % bad 24.42 28.52 26.15 76.38 30.70 83.
1457 0.21 summarize the results. . average case neural network models German credit datab Good credit Bad credit Overall Neural modelsa MOE RBF MLP LVQ FAR Parametric models Linear discriminant Logistic regression Nonparametric models K nearest neighbor Kernel density CART a b Australian credit datab Good credit Bad credit Overall 0.1420 0.3163 0.1315 0.MDM) and feedforward network.1186 0. The process of splitting the data in validation and test set was repeated 30 times.1713 0.1201 0.4814 0.1540 0.1404 0.1274 0.1409 0. mixtureofexperts.1428 0. learning vector quantization and fuzzy adaptive resonance.2388 0.3044 0. Logistic regression is the most accurate of the traditional methods.1562 Neural network results are averages of 10 repetitions.1416 0.1922 0. Table 4.5533 0.2566 0.2461 0.4039 0. Table 4.1857 0.1557 0. radial basis function.1246 0.1275 0.2667 0. The characteristic of the data set is shown in Table 4.1514 0.1703 0.4277 0.5753 0.3240 0.1352 0.1710 0.1347 0.2257 0. kernel density estimation and decision trees Results show that feedforward network may not be the most accurate neural network model and that both mixture of experts and radial basis function should be considered for credit scoring applications. The results are shown in Table 4.4775 0.5299 0.2540 0. MULTIVARIATE STATISTICS NEURAL NETWORKS 115 West (2000) investigate the credit scoring accuracy of ﬁve neural network models: multilayer perceptron (MLP).2771 0. Markham and Ragsdale (1995) presents an approach to classiﬁcation based on combination of linear discrimination function (which the label Mahalanobis distance measure .2493 0.1531 0.21: Credit scoring error.1906 0.2370 0.0782 0.2434 0.1332 0.5133 0.2063 0.1332 0.4883 0.1666 0. Reported results are group error rates averaged across 10 independent holdout samples. logistic regression.3080 0. knearest neighbour.22. The neural networks credit scoring models are tested using 10fold crossvalidation with two real data sets.5457 0.6300 0.CHAPTER 4.1107 0.1326 0.1286 0. Results are compared with linear discriminant analysis.2740 0.23.2672 0.
a recent reference on comparisons of classiﬁcation models is Bose (2003).CHAPTER 4.22: Data Characteristic Dataset Problem Domain Number of groups Number of observations Number of variables 1 Oil Quality Rating 3 56 5 2 Bank Failure Prediction 2 162 19 116 Finally. . MULTIVARIATE STATISTICS NEURAL NETWORKS Table 4.
11 20.25 5.81 18.23: Comparison of classiﬁcation methods on threegroup oil quality and twogroup bank failure data set Percentage of Misclassiﬁed Observations in the Validation Sample Threegroup oil quality Twogroup bank failure MDM NN1 NN2 MDM NN1 NN2 22.00 7.50 7.41 3.11 0 20.50 3.81 7.00 6.25 5.00 10.83 Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Average .70 0 0 17.00 13.50 7.25 2.25 3.50 14.00 3.25 3.75 10.00 10.00 2.11 15.75 10.41 0 17.11 11.41 3.70 16.00 6.75 8.11 11.00 13.00 7.11 0 0 15.25 7.70 3.52 11.50 11.50 5.00 18.00 14.25 2.22 18. MULTIVARIATE STATISTICS NEURAL NETWORKS 117 Table 4.75 5.50 6.41 0 0 13.75 7.75 5.00 11.75 14.50 14.00 8.33 6.75 5.81 7.11 0 20.00 14.00 5.52 7.81 0 0 12.00 6.75 11.70 0 0 11.50 8.22 11.75 5.00 18.25 7.75 14.75 7.70 25.00 29.52 3.81 3.41 0 0 18.25 3.81 7.50 3.63 22.11 0 0 15.93 18.75 8.00 6.75 5.22 11.41 0 13.52 7.00 11.47 15.50 3.42 2.CHAPTER 4.41 0 0 15.00 22.75 6.52 11.75 2.00 11.25 14.50 11.50 0 0 0 15.25 5.41 0 15.52 7.00 5.11 13.11 17.50 7.50 25.52 22.25 8.22 14.75 7.00 11.70 0 0 16.25 18.25 18.08 4.70 0 13.75 6.70 0 13.52 11.25 18.75 7.50 8.50 2.00 1.75 6.25 7.63 8.50 8.41 0 12.25 3.75 11.25 3.11 0 0 16.41 0 0 12.41 0 0 18.70 16.25 6.25 5.25 6.50 10.00 7.70 10.81 11.11 3.75 3.00 6.81 3.
2) The exponential family of distribution includes many common distributions and its parameters have some nice statistical properties related to them (sufﬁciency.Chapter 5 Regression Neural Network Models 5. Members of this family can be expressed in the general form 118 . inversenormal. predictors) variables.1. negativebinomial distributions. • A link function L(µi ) which relates the expectation µi = E(Yi ) to the linear predictor ηi L(µi ) = ηi (5. . Beyond linear models with normal distributed errors. generalized linear models include logit and probit models for binary response variable and loglinear (Poissonregression) models for counts. The x are independent (covariates.1 Generalized Linear Model Networks . gamma.1) on which yi depends. . Poisson. • A linear predictor ηi = α0 + α1 xi1 + . etc. attain CramerRao bound). + αk xik (5. binomial. A generalized linear model consists of three components • A random component.1. in the form of a response variable Y with distribution from the exponential family which includes: the normal.GLIMN Generalized (iterative) linear models (GLIM) encompass many of the statistical methods most commonly employed in data analysis.
called deviance (D).CHAPTER 5.1: Generalized Linear Models: mean. θ.3) α(φ) + C(y. φ is called the dispersion or scale parameter. variance and deviance functions η ∨(µ) D(ˆ) µ Normal Binomial µ ln[µ/(1 − µ)] 1 µ(1 − µ) (y − µ)2 ˆ 2Σ {[y ln(y/ˆ)]+ µ +(n − y) ln[(n − y)/(n − µ)]} ˆ 2Σ[y ln(y/ˆ) − (y − µ)] µ ˆ 2Σ [− ln(y/ˆ) − (y − µ)/ˆ] µ ˆ µ Poisson Gamma log µ µ−1 µ µ2 The neural network architecture equivalent to the GLIM model is a perceptron with the predictors as inputs and one output. .1. (5. One measure of discrepancy most frequently used is that formed by the logarithm of the ratio of likelihoods. then θ is called the natural or canonical parameter.1. It can be shown that f (y. µn ) derived from a model involvˆ µ ˆ ing (usually) a relative small number of parameters. φ) = exp E(Y ) = b (θ) = µ Var(Y ) = α(φ)b (θ) = α(φ) ∨ (θ). φ) If φ is known. ∼ ∼i Table 5.1. (5. yn ) by a set of ﬁtted values µ = (ˆ1 . In what follows we will deal with some particular cases: the logit regression and the regression models neural networks. Table 5. . This network is shown in Figure 5. There are no hidden layers and the activation function is chosen to coincide with the inverse of the link function (L).4) ∼ Fitting a model may be regarded as a way of replacing a set of data values y = (yi . . REGRESSION NEURAL NETWORK MODELS 119 yθ − b(θ) .1 gives some usual choices of link functions for some distributions of y for the canonical link θ = η. When α(φ) = φ. . . . .
The logistic regression model is a generalized linear model with binomial distribution for y1 . REGRESSION NEURAL NETWORK MODELS 120 Figure 5.CHAPTER 5. i=1 (5. . .1.1.5) with ∧ denoting the logistic function. The equation takes the form 1 1 + exp(−β0 − k i=1 k p = P (Y = 1/x) = ∼ β i xi ) = ∧ β0 + i=1 βi xi (5.7) Other approaches to model the probability p include generalized additive models (Section 5. .1: GLIM network 5.1. yn and logit link function. The model can be expressed in terms of log odds of observing 1 given covariates x as log tp = log P = β0 + 1−β k β i xi .6) A natural extension of the linear logistic regression model is to include quadratic terms and multiplicative interaction terms: k k P = Λ β0 + i=1 β i xi + i=1 γi x 2 + i i<j δij xi xj . (5.2.1.1 Logistic Regression Networks The logistic model deals with a situation in which we observe a binary response variable y and k covariates x’s.3) and feedforward neural networks.
. An attempt to improve the comparison and to use logistic regression as a staring point to design a neural network which would be at least as good as the logistic model is presented in Ciampi and Zhang (2002) using ten medical data sets.10) the weights can be interpreted as regression coefﬁcients and maximum likelihood obtained by back propagation maximum likelihood.CHAPTER 5. obtained by k pj = Λ βoj + i=1 βij xi .8) reduces to (5. with comparisons among these two models (LR and ANN).2 is characterized by weights βij (j = 1. The neural network of Figure 5. K) to output yj . βj ) (5. .1.1. This section ends with two applications showing some numerical aspects of the models equivalence as given in Schumacher et al (1996). (5. (5. For a discussion of this and related measures and other comparisons among these two models see Schwarzer et al (2000) and Tu (1996). . .1 with the sigmoid link k function as activation function of the output neuron. βj ) k j=1 pj = P (yj = 1/x) = ∼ Λ(x. Therefore it does not make sense to compare a neural network with hidden layer and linear logistic regression.1. Here we will only give some useful references mainly from the statistical literature.8) This representation of neural network is the basis for analytical comparison of feedforward network and logistic regression.9) In statistical terms this network is equivalent to the multinomial or polychotomous logistic model deﬁned as Λ(x.6) the neural network architecture is as in Figure 5. specially in medical sciences. . For a neural network without a hidden layer and a sigmoid activation function (5. The extension to the polychotomous case is obtained considering K > 1 outputs. A feedforward network with hidden layer with r neurons and sigmoid activation function would have the form: r k P = Λ w0 + j=1 wj Λ(woj + j=1 wij xi ) .1. REGRESSION NEURAL NETWORK MODELS 121 For the linear logistic model (5.1.6) and this network is called logistic perceptron in the neural network literature. There is a huge literature.1.
Ottenbacher et al. Nguyen et al. that is P (y = /x) = Λ(β0 + β1 + β2 x2 ). polychotomous logistic for discrete choice (Carvalho et al. . The ﬁrst. Leonard et al. 2003). country international credit scores (Cooper.12) (5. the results show an effect marginally signiﬁcant of the logarithm of rate of inspiration in the probability to the presence of vasoconstriction in the ﬁnger’s skin.CHAPTER 5. 2001. The second model considers the use of a second covariate. 1995.1. In the ﬁrst case. that is P (y = 1/x) = Λ(β0 + β1 x1 ). Boll et al. respectively. 1997) and some medical applications (Hammad et al. 1999). Brockett et al. scores risk of insolvency (Fletcher and Goss. 1998). 1993.2: Policholomous logistic network Applications and comparisons of interest are: personal credit risk scores (Arminger et al.1. considers only one covariate x1 = log R.2 presents the results of the usual analysis from two logistic regression models. 1997). (5. 2002. The ﬁrst illustration is the that of Finney (1947) and the data consists N = 39 binary responses denoting the presence (y = 1) or absence (y = 0) of vasoconstriction in the ﬁnger’s skin after the inspiration of an air volume at a mean rate of inspiration R. REGRESSION NEURAL NETWORK MODELS 122 Figure 5.11) For x1 = log and x2 = log V . 1997. Table 5.
335 with a KullbackLeiber measure of E ∗ = 24. . w1 = 4. Sonography measurements in N458 women and characteristics (x1 .4.336 0. The results are shown in Table 5. number of arteries in the contralateral breast (AC). 000001 involving approximately 20000 interaction to obtain a value of the distance KullbackLeibler of E ∗ = 14. 469 and w1 = 1. we have w0 = −0. 73) An ANN equivalent to this model would be a feedforward network with two input neurons (x0 = 1. xp ) were collected.439 0. . 001 to the last interactions MLBP.005 5.938 Log R 4. 632).665 0.2: Results of a logistic regression analysis for the vasoconstriction data Coefﬁcient Standard Error P value Neural Network Variable of Regression Intercept 0. 240) although the algorithm backpropagation demands several changes in learning rate from an initial value η = 0. A preliminary analysis with a logistic model indicates three covariates as signiﬁcant. REGRESSION NEURAL NETWORK MODELS 123 Table 5.651 Log V 5.CHAPTER 5.924 1.240 (L = 14.858 0.470 0.335 (L = 24543) (E = 24. For the case of the model with two covariates.010 4. .221 1. x1 = log R. this indication plus problems of collinearity among the six variables suggests the use of age. the result shows a signiﬁcant inﬂuence of the two covariates in the probability of vasoconstriction in the ﬁnger’s skin. One Perceptron with three input neurons (x0 = 1. A comparison with others ANN is shown in the Table 5. 2 to η = 0.288 2. This was almost obtained (w0 = 2. . 73 comparable to that of minus the loglikelihood (L = −14. w2 = 4. 650.3.631 1.469 Log R 1.554) Intercept 2.045 1. Using a rate of learning η = 0.789 0.632) (E = 14. x1 = log R). 554. The second example is a study to examine the role of noninvasive sonographic measurements for differentiation of benign and malignant breast tumors. x2 = log V ) should give similar results if MLBP was used. number of arteries in the tumor (AT). 938. Cases were veriﬁed to be 325(y = 0) benign and 133 as (y = 1) malignant tumor. .
7 98.9 93.187 1.178 0.7 97.069 w2 = 5.5 99.2 89.0 96.4 93.5 97.108 w1 = 0.0001 0.5 99.4 93.162 w3 = −1.0 93.7 Percentage of correct classiﬁcation 91.7 97.1 98.99 P Value 0.6 Logistic regression CART NN(1) NN(2) NN(3) NN(4) NN(6) NN(8) NN(10) NN(15) NN(20) NN(40) .9 95.437 L=79.3: Results of a logistic regression analysis of breast tumor data Variable Intercept Age log AT+1 log AC+1 Coefﬁcients 8.575 0.0001 0.5 92.3 93.081 E=80.2 Speciﬁcity (%) 90.2 94.0 92.5 98.4 98.1 99.CHAPTER 5.003 124 Table 5.7 91.5 97.924 0. CART and feedforward neural networks with j hidden units (NN(J)) for the breast tumor data Method Sensitivity (%) 95.0 92.2 99.2 99. REGRESSION NEURAL NETWORK MODELS Table 5.9 95.6 96.2 99.070 5.074 Logistic Standard Error 0.5 94.4: Results of classiﬁcation rules based on logistic regression.1 99.0014 Neural Network Weights w0 = −8.5 98.8 95.017 0.
• The neural networks used 10 neurons and a hidden layer. In what follows we describe some comparisons of regression models versus neural networks.1. Zoology. they compared linear models with nonlinear models. Church and Curram (1996) compared forecast of personal expenditure obtained by neural network and econometric models (London Business School. Gorr et al (1994) compare linear regression. in this case a particular projection pursuit model with the sigmoid as the smooth function.2 Regression Network The regression network is a GLIM model with identity link function. Some authors have compared the prediction performance of econometric and regression models with feedforward networks usually with one hidden layer.13) which are the grades in Mathematics. An extension for system of equations can be obtained for systems of k equations by considering k outputs in the neural network as in Figure 5.1. LDR = math + chem + eng + zool + 10T ot + 2Res + 4P T ran (5. The introduction of hidden layer has no effect in the architecture of the model because of the linear properties and the architecture remains as in Figure 5. National Institute of Economic and Social Research.1. . (See Section 5. and total GPA.2.CHAPTER 5. The dependent variable to be predicted for admissions decision is the total GPA of all courses taken in the last two years. Bank of England and a Goldman Sache equation). In fact. None of the models could explain the fall in expense growth at the end of the eighties and beginning of the nineties. and a transfer student had all credits in prerequisites transfer. These networks produced similar results to the econometric models. The other variables are binary variables indicating: residence in the area. a transfer student had partial credits in prerequisites transferred. English. stepwise polynomial regression and a feedforward neural network with three units in the hidden layer with an index used for admissions committee for predicting students GPA (Grade Point Average) in professional school. the analysis of the weights and output of the hidden units. identiﬁes three different patterns in the data.2. in all models. A summary of this application of neural networks is the following: • Neural networks use exactly the same covariates and observations used in each econometric model. REGRESSION NEURAL NETWORK MODELS 125 5. Although none of the empirical methods was statistically signiﬁcant better than the admission committee index.5). Chemistry. The admission/decision is based on the linear decision rule. Again the outputs activation function are the identity function. The results showed obtained very interesting comparisons by using quartiles of the predictions.
a sensitivity analysis was done on each variable.CHAPTER 5. This network was able to explain the fall in growth. sex of the baby. mother’s height. They relate economic factors with the stock market and in their context conclude that neural network model can improve the linear regression model in terms of predictability. An application of neural network in ﬁtting response surface is given by Balkin and Lim (2000). but not in terms of proﬁtability. • A ﬁnal exercise used a neural network with inputs of all variables from all models. parity and gravidity. smoking. 5.e. are related to birthweight. • Besides the forecast comparison.2. They found that neural network performed slightly better than linear regression and found no signiﬁcant relation between birthweight and each of the two variables: maternal age and social class. so that the network was tested with the mean value of the data and each variable was varied to verify the grade in which each variation affects the forecast of the dependent variable. i. pk (x) is the probability of x being ∼ ∼ observed among those of class C = k.2 Nonparametric Regression and Classiﬁcation Networks Here we relate some nonparametric statistical methods with the models found in the neural network literature.8. Another econometric model comparison using neural network and linear regression was given by Qi and Maddala (1999). Using simulated results for an inverse polynomial they shown that neural network can be a useful model for response surface methodology. but it is a network with too many parameters. ∼ ∼ ∼ .1 Probabilistic Neural Networks Suppose we have C classes and want to classify a vector x to one of this classes.2 to 0. Results show. Lapeer et al (1994) compare neural network with linear regression in predicting birthweight from nine perinatal variables which are thought to be related. where πk is the prior probability of a observation be from class k. mother’s bodymass index (BMI). gestational age. REGRESSION NEURAL NETWORK MODELS 126 • The dependent variable and the covariates were reescalated to fall in the interval 0. The Bayes classiﬁer is based on p(k/x)απk pk (x). In medicine. 5. that seven of the nine variables.
We use K(x − y ) as a measure of the prox∼ imity of x on y . 1996. p. (5.CHAPTER 5. we have from Bayes theorem: p(k/x) = ˆ ∼ πk pk (x) ˆ ∼ C j=1 πj pj (x) ˆ ∼ = πk nk [i]=k K(x − xk ) . The value of σ should be dependent on the number of patters in a class.184). Choosing a large value results in overgeneralization. (5.2) When the prior probability are estimated by nk/n . p. p.2) simpliﬁes to p(k/x) = ∼ [i]=k i K(x − x ) ∼ ∼i K(x − x ) ∼ ∼i .1) This is an average of the kernel functions centered on each example from the class.3) This estimate is known as Parzen estimate and probabilistic neural network (PNN) (Patterson. the probability πk and the density p(·) have to be estimated from the data. 0 < b < 1.2. As we have seen the probability density function for a class is approximated by the following equation pk (x) = ∼ 1 2π n/2 σ 2 1 nK nk j=1 e−(x−xkj ) 2 /2σ 2 (5. A kernel K is a bounded function with integral one. π[i] i n[i] K(x − xi ) (5.182) ∼ ∼2 nj ∼ pj (x) = ˆ ∼ 1 nj K(x − x ). Then.2.2. choosing a value too small results in overﬁtting. k (5. A powerful nonparametric technique to estimate these functions is based on kernel methods and propose by Parzen (Ripley.352). One function used is σ = an−b a > 0.2. and one possible kernel is the Gaussian function. A local estimate of the density pk (x) can be found summing all these contributions with weight K(x − x ) that is (Ripley. 1996. The kernel nonparametric method of Parzen approximates the density function of a particular class in a pattern space by a sum of kernel functions.5) . i=1 ∼ ∼i (5. REGRESSION NEURAL NETWORK MODELS 127 In general.4) where σ 2 has to be decided. 1996.2. Suitable example is the multivariate normal density function. The empirical distribution of x within a group k gives mass ∼ 1 to ∼ ∼ mk ∼ each of the samples.2. In radial basis function networks with a Gaussian basis it is assumed that the normal distribution is a good approximation to the cluster distribution.
If the amount of data is large. REGRESSION NEURAL NETWORK MODELS 128 A three layer network like a RBF network is used with each unit in the hidden layer centered on an individual item data.3 shows a network that ﬁnds three class of data. There are two input variables. The values for the weights in the hidden units (i. and for the three classes of data there are four samples for class A. This output represents the probability that the input data belongs to the class represented by that unit. this layer adds all the outputs from the hidden layer that corresponds to data from the same class together. Figure 5. each unit with binary output. and a linear output function. The ﬁnal decision as to what class the data belongs to is simply the unit with the output layer with the largest value.e. The advantage of the PNN is that there is no training. If values are normalized they lie between 0 and 1. Figure 5.3: Probabilistic Neural Network . the centers of the Gaussian functions) are just the values themselves. hence the number of neurons in the hidden layer. using data of 249 patients for training and 63 patients for testing is mentioned in Specht (1988). Each unit in the output layer has weights of 1. three samples for class B and two samples for class C. some clustering can be done to reduce the number of units needed in the hidden k layer.CHAPTER 5. An application of PNN to classiﬁcation of electrocardiograms with a 46dimensional vector input pattern. In some networks an additional layer is included that makes this decision as a winner takes all.PNN .
(5.2 General Regression Neural Networks The general regression neural network was named by Specht (1991) and its was already known in the nonparametric statistical literature as NadarayaWatson estimator.7) a particular form of the NadarayaWatson estimator y (x) = ˆ yi K(x − x ) ∼ ∼i K(x − x ) ∼ ∼i .8) The value of dj is the distance between the current input and the jth input in the sample of training set.2. x using the following equation from probability theory for the regression of y given x ∼ ∼1 ∼n yf (x.6) where f (x. . The architecture is shown in Figure 5. yn and n vector of inputs x .4 for a single output of two variables input with a set of 10 data points. . In addition. The radic σ is again given by σ = an−b/n . y)dy ∼ (5.9) The architecture of the GRNN is similar to PNN. y)dy y (¯ ) = E(y/x) = ¯x ∼ ∼ ∼ f (x. (5.2. . The idea of GRNN is to approximate a function given a set of n outputs y1 .2. except that the weights in the output layer is not set to 1.2. ∼ ∼ As in PNN we approximate the conditional density function by the sum of Gaussian function. . .CHAPTER 5. The regression of y on x is then given by ∼ n −d2 /2σ 2 j j=1 yj e n −d2 /2σ 2 j j=1 e y (ˆ ) = ˆx ∼ (5. y). y) is the joint probability density function of (x.2. . the sum of the outputs from the Gaussian layer has to be calculated so that the ﬁnal output can be divided by the sum. . . Instead they are set to the corresponding values of the output y in the training set. REGRESSION NEURAL NETWORK MODELS 129 5.
4: General Regression Neural Network Architecture . REGRESSION NEURAL NETWORK MODELS 130 Figure 5.CHAPTER 5.
A second hidden layer computes the f ’s functions. i=0 h(µ) = η = (5. .12) with two continuous input is shown in Figure 5.10) fi (xi ). They estimated the functions f using Bsplines. The ﬁrst hidden layer consist of two blocks. . + fI (xI ) (5. then ﬁtting f2 to the residual yi − f1 (x1i ). The output layer has a logistic activation function and output p. Nonparametric estimation is used to obtain the unknown functions fj . xI ) is ∼ E(Y ) = µ I (5. all units have as activation the identity function. . . Figure 5.12) as a 2 class classiﬁer.2.11) This is a generalization of the GLIM model when the parameters βi are here replaced by functions fi . A neural network corresponding to (5.5. Ciampi and Lechevalier (1997) used the model logit(p) = f1 (x1 ) + . They are obtained iteratively by ﬁrst ﬁtting f1 .2. each corresponding to the transformation of the input with Bsplines.2.3 Generalized Additive Models Networks The generalized additive model. GAM relating an output y to a vector input x = (x1 . .5: GAM neural network .CHAPTER 5. Again h(·) is the link function. . REGRESSION NEURAL NETWORK MODELS 131 5.2.2. and so on.
Some examples are shown in Figure 5. The procedure is presented in Fox (2000a. Figure 5. the broken line is for a local linear regression. in US dollars.4 Regression and Classiﬁcation Tree Network Regression tree is a binning and averaging procedure.2.6: Regression tree y ˆ .CHAPTER 5. REGRESSION NEURAL NETWORK MODELS 132 5.b) and for a sample regression of y on x. (a) The binning estimator applied to the relationship between infant mortality per 1000 and GDP per capita. to average the values of y for corresponding values of x for certain conveniently chosen regions.6. Ten bins are employed. (b) The solid line gives the estimated regression function relating occupational prestige to income implied by the tree.
The trees can be represented as neural networks.2. REGRESSION NEURAL NETWORK MODELS 133 Closely related to regression trees are classiﬁcation trees.8 shows such representation for the tree of Figure 5. The solid line gives the estimated regression function relating occupational prestige to income implied by the tree. Each leaf is reached through a series of binary questions involving the classiﬁers (input). . The weights linking the second hidden layer to the output are determined by the data and are given by the logit of the class probability corresponding to the leaves of the tree. .CHAPTER 5. . The hidden layers have activation function taking value 1 for negative input and 1 for positive input. where the response variable is categorical rather than quantitative.2. The regression and classiﬁcation tree models can be written respectively as: Y = α1 I1 (x) + . .13) (5. Figure 5. + IL (x) ∼ ∼ (5. the broken line is for a local linear regression.14) logit(p) = γ1 I1 (x) + . Figure 5.7. The following classiﬁcation tree is given in Ciampi and Lechevalier (1997). + IL (x) ∼ ∼ where the I are characteristics functions of Lsubsets of the predictor space which form a partition. . these are determined from the data at each mode.7: Decision Tree The leaves of the tree represents the sets of the partition.
17) The projectionpursuit regression model can therefore capture some interactions among the x’s. . if. + fI (ziI ) where the z1 ’s are linear combinations of the x’s zij = αj1 xi1 + αj2 xi2 + .2.5 Projection Pursuit and FeedForward Networks Generalized additive models ﬁts the model yi = α + f1 (x1 ) + .2. . .2. ﬁts the model yi = α + f1 (zi1 ) + .2. .CHAPTER 5. ∼ j ∼i (5. ∼ . Equation (5.2. the functions fj are sigmoid function and we consider the identity output.15) (5. for example.18) for multiple output y .8: Classiﬁcation Tree Network 5. estimate α2 ˆ ˆ α and then f2 (ˆ x) and so on. . + fI (xI ). Projection pursuit regression (PPR).18) is the equation for a feedforward network. REGRESSION NEURAL NETWORK MODELS 134 Figure 5.2. ∼2∼ ∼1∼ ∼1 The model can be written as I yk = f (x) = αk + ∼ j=1 fik (αj + α x) ∼j ∼ (5. The ﬁtting of this models is obtained by least squares ﬁtting of α . + ααp xip = α x . . ˆ ˆ α then as in GAM ﬁtting f1 (ˆ x) from the residual R1 of this model.16) (5.
6 Example Ciampi and Chevalier (1997) analyzed a real data set of 189 women who have had a baby. They apply a classiﬁcation tree network. in PPR we can approximate each smooth function fi by a sum of logistic functions. a GAM network. The misclassiﬁcation number resulting in each network was: GAM : 50 . Figure 5.CHAPTER 5.10: Network of networks . Several variables were observed in the course of pregnancy. REGRESSION NEURAL NETWORK MODELS 135 Clearly feedforward neural network is a special case of PPR taking the smooth functions fj to be logistic functions. Two continuous (age in years and weight of mother in the last menstrual period before pregnancy) and six binary variables.Classiﬁcation tree : 48 Network of network : 43. for approximation of functions. Conversely.2. ∼ 5.9: GAM network Figure 5. and a combination of the two. Their GAM and combined network are shown in Figures 7 and 8. is that the ﬁrst stage allows a projection of the x’s onto a space of much lower dimension for the z’s. The output variable was 1 if the mothers delivered babies of normal weight and 0 for mothers delivered babies of abnormally low weight (< 2500 g). Another heuristic reason why feedforward network might work well with a few ridden units. Figure 6 gives their classiﬁcation network.
the survivor function S(t).1. A distinct feature of survival data is the occurrence of incomplete observations. 136 . This feature is known as censoring which can arise because of time limits and other restrictions depending of the study. the hazard function f (t) or the cumulative hazard function H(t).1) H(t) = 0 h(u)du h(t) = H (t) S(t) = exp[−H(t)]. Time Series and Control Chart Neural Network Models and Inference 6. the cumulative distribution function F (t). These are related by: t F (t) = 0 f (u)du d F (t) dt S(t) = 1 − F (t) f (t) d h(t) = = − [ln S(t)] S(t) dt f (t) = F (t) = t (6. we may specify the distribution of T by any one of the probability density function f (t).1 Survival Analysis Networks If T is a nonnegative random variable representing the time to failure or death of an individual.Chapter 6 Survival Analysis.
x)). The relation can be linear (β x) or nonlinear (g(j .g. Then exponentiation gives T = exp −β x eW or T = eW = T exp β x ∼ ∼ (6. Parametric models of survival distributions can be ﬁtted by maximum likelihood techniques. time of infection by HIV virus in a study of survival of AIDS patients). death by other cause) that causes interruption of the followup on the individual experimental unit. A general class of models relating survival time and covariates is studied in LouzadaNeto (1997. Here we describe the three most common particular cases of the LouzadaNeto model. • Interval censoring the exact time to the event is unknown but it is known that it falls in an interval Ii (e. (6.CHAPTER 6. The aim is to estimate the previous functions from the observed survival and censoring times.3) where T has hazard function h0 that does not depend on β.g.1.2) where W is a random variable. correspondence to a linear logistic model with “death” or not by a ﬁxed time t0 as a binary response. If hj (t) is the hazard function for the j’th patient it follows that hj (t) = h0 (t exp β x) exp β x. This can be done either by assuming some parametric distribution for T or by using nonparametric methods. General class of densities and the nonparametric procedures with estimation procedures are described in Kalbﬂeish and Prentice (2002). 137 • Right censoring occur if the events is not observed before the prespeciﬁed studyterm or some competitive event (e.1. • Left censoring happens if the starting point is located before the time of the beginning of the observation for the experimental unit (e. Usually we have covariates related to the survival time T .1.5) . when observations are grouped). 1999). When two or more group of patients are to be compared the logrank or the MantelHanszel tests are used. SURVIVAL ANALYSIS AND OTHERS NETWORKS There are different types of censoring.1. log Sj (t) S0 (t) = β x + log 1 − Sj (t) ∼ ∼ 1 − S0 (t) (6. The ﬁrst class of models is the accelerated failure time (AFT) models log T = −β e x + W ∼ ∼ ∼∼ ∼ ∼ (6.4) The second class is the proportional odds (PO) where the regression is on the logodds of survival. The usual nonparametric estimator for the survival function is the KaplanMeier estimate.g.
her models were based on alternative implementation of models (6.7) under a mixed model and not only extend these models but also puts the three models under a more general comprehensive model (Louzado Neto and Pereira 2000).5) allowing for censoring.1.7) (6.1.2) to (6.1. AFaccelerated failure. PH/AM mixed model.. 1 − Sj (t) 1 − S0 (t) ∼∼ The third class is the “proportional hazard” or Cox regression model (PH). Each model can be obtained as particular case of models on top him in the ﬁgure) Ripley (1998) investigated seven neural networks in modeling breast cancer prognosis. existing models.. EAFextended AF.1. LouzadaNeto hazard. EH EPH/EAF EPH EAF PH/AF PH AF Figure 6. log ht (t) = β x + log h0 (t) ∼∼ 138 (6. Here we outline the important results of the literature.8) hj (t) = h0 (t) exp β x.1. SURVIVAL ANALYSIS AND OTHERS NETWORKS or Sj (t) S0 (t) = exp β x.1: Classes of regression model for survival data ( . ∼∼ Ciampi and EtezadiAmoli (1985) extended models (6. EHextended hazard.CHAPTER 6. .1. EPH/EAFextended mixed.2) and (6. EPHextended PH.6) (6..1. See Figure 6.1.
The corresponding network is shown below in Figure 6. j∈Ri exp H h=1 Lc (θ) = αh /[1 + exp(−wh xi )] H h=1 αh /[1 + exp(−wh xi )] (6.2. For the Cox proportional hazard model. SURVIVAL ANALYSIS AND OTHERS NETWORKS 139 The accelerated failure time . 6. treatment (0.9) and estimations are obtained by maximum likelihood through NewtonRaphson.2: Neural network model for survival data (Single hidden layer neural network) As an example (Faraggi and Simon. that is exp i∈. (c) Neural network model with two hidden nodes (12 parameters). The covariates are: stage. 1995) consider the data related to 506 patient with prostatic cancer in stage 3 and 4. (b) Secondorder (interactions) PH model (10 parameters). Figure 6.1.2.2. . 6... 1 or 5 mg of DES and placebo). weight. Faraggi and Simon (1995) substitute the linear function βxj by the output f (xj . (d) Neural network model with three hidden nodes (18 parameters). age.AFT model is implemented using the architecture of regression network with the censored times estimated using some missing value method as in Xiang et al (2000). θ) of the neural network.1. The results are given in the Tables 6.3 below for the models: (a) Firstorder PH model (4 parameters).CHAPTER 6.
0 0.607 0.0 834. secondorder and neural network proportional hazards models Model Number of parameter s First order PH Secondorder PH Neural network H = 2 Neural network H = 3 4 10 12 18 Training data Log lik c Test data Log lik c 814.5% 52.2: Loglikelihood and c statistics for ﬁrstorder.3 805.3% 51.6 801.2 794.4% 52.4% 73 years 970 48.7% Table 6.608 0.5% 73 years 980 49.5 860.8% Training set 238 47.8 834.582 .6% 52.6000 0.5% 34 months 27.9% 50.1: Summary statistics for the factors included in the models Complete data Sample size Stage 3 Stage 4 Median age Median wight Treatment: Low Treatment: High Median survival % censoring 475 47.CHAPTER 6.7% 33 months 29.9 0.646 0.6% 73 years 990 51.8% Validation set 237 47.580 0.1% 33 months 28. SURVIVAL ANALYSIS AND OTHERS NETWORKS 140 Table 6.648 0.5% 48.661 831.
238 0. . Let T be a random survival time variable. . y n ) n where yk is the indicator of individual n dying in Ik and kn ≤ K is the number of n intervals where n is observed.655 0. K.130 0.198 0. . . .109 0.051 0. .256 0. y n kn−1 are all 0 and y n kn = 1 of n dies in Ikn and . < tk < ∞ .345 0.249 0 0 0 0 0 0 0 0 0 0 ∗ PH 2nd order 0.CHAPTER 6.363 0.024 0.475 0.1. x) = 1 1 + exp −β0k − (6. n Data for the individual n consist of the regressor xn and the vector (y1 . k = 1.404 0. . . SURVIVAL ANALYSIS AND OTHERS NETWORKS 141 Table 6. .323 0.069 0. .293 0.006 0. K where 0 < t0 < t1 < . P (T ∈ Ik /T > tk−1 . .451 0.213 0. Liestol. The model can be speciﬁed by the conditional probabilities.260 0.248 0. The output 0k in the k th output neuron corresponds to the conditional probability of dying in the interval Ik . .128 Stage Rx∗ Age Weight Stage ×Rx State × Age State ×W t∗ Rx× Age Rx × W t Stage ×Rx× Age Stage ×Rx × W t Stage × Age ×W t Rx× Age ×W t State ×Rx× Age ×W t Rx = Treatment.032 0. Thus y1 .028 Neural network H=3 0.581 0.3: Estimation of the main effects and higher order interactions using 24 factorial design contrasts and the predictions obtained from the different models Effects PH 1st order 0.513 0. W t = Weight Implementation of the proportional odds and proportional hazard were implemented also by Liestol et al (1994) and Biganzoli at al (1998). Anderson (1994) used a neural network for Cox’s model with covariates in the form.025 0. .195 0 0 0 0 0 Neural network H=2 0.450 0.219 0. and Ik the interval tk−1 < t < tk .325 0. .026 0.484 0. . The corresponding neural network is the multinomial logistic network with k outputs. .278 0.330 0.271 0. . .415 0.360 0.10) 1 i=1 βik xi for K = 1.315 0.300 0.302 0.
12) and w = (β01 . . β11 . .3: Log odds survival network 1 0k = f (x. SURVIVAL ANALYSIS AND OTHERS NETWORKS 142 Figure 6. w) = Λ β0k + i=1 βik xi (6. An immediate generalization would be to substitute the linearity for nonlinearity of the regressors adding a hidden layer as in the Figure 6. βIk and under the hypothesis of proportional rates make the restriction β1j = β2 = β3j = β4j = . . Other implementations can be seen in Biganzoli et al (1998).1.11) and the function to optimize N Kn n − log (1 − yk − f (xn . β0k . = βj .1. .4 below: . . . . . . .CHAPTER 6. w)) h=1 k=1 E (w) = ∗ (6.
connection weights wij . output values 01 ) . showing the notation for nodes and weights: input nodes (•).CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 143 Figure 6. output (and hid(2) den) nodes (O). covariates Zj .4: Nonlinear survival network (Twolayer feedforward neural nets.
Several networks were studied and the negative of the likelihood (interpreted as prediction error) is given in the Table 6. 4.4 below: Table 6.0 8.6 2. The results of the survival curve ﬁts follows in Figures 6. 8 covariates were included.CHAPTER 6.3 167. Linear model with a penalty term for nonproportional hazards. Nonlinear model with nonproportional hazards. 3.6 in the former case (column 5) and 4. Linear model. 5. Adding a penalty term to the likelihood of a nonproportional model or assuming proportionality over the two ﬁrst and last time intervals improved the test loglikelihood by similar amounts (5.5.6 and 6.0 170. Nonlinear model with a penalty term for nonproportional hazards.) 1 2 3 4 5 6 7 Prediction error 172.9 4. 2. • A gain in the test loglikelihood was obtained by using moderately nonproportional models.2 Change 1. 7. Nonlinear model with proportional hazards in ﬁrst and second interval and in third and fourth interval.6 4. Linear model with weightdecay.6 181.6 in the latter (column 6)).0 168.4).6 170. improvement 2. Most of the gain could be obtained by adding suitable penalty terms to the likelihood of a linear but nonproportional model. Nonlinear model with proportional hazards.4 The main results for nonlinear models with two hidden nodes were: • Proportional hazard models produced inferior predictions.7 168. 6. In summary. even more if no weight decay was used. Using no restrictions on the weights except weight decay gave slightly inferior results (column 7.4: Cross validation of models for survival with malignant melanoma (Column 1.7 (column 4) when using the standard weight decay. decreasing the test loglikelihood of a two hidden node model by 8. SURVIVAL ANALYSIS AND OTHERS NETWORKS 144 An example from Liestol et al (1994) uses the data from 205 patients with melanoma of which 53 died.7 5. for this small data set the improvements that could be obtained compared to the simple linear models were moderate.7: . An example from Biganzoli et al (1998) is the application neural networks in the data sets of Efron’s brain and neck cancer and Kalbﬂeish and Prentice lung cancer using the network architecture of Figure 6.
respectively.. unit with the hidden and output unit wa and w. A single output unit (K = 1) compute conditional failure probability x1 and x2 are the weights for the connections of the .CHAPTER 6. unit (0)..5: Feed forward neural network model for partial logistic regression (PLANN) (The units (nodes) are represented by circles and the connections between units are represented by dashed lines..) . The hidden layer has H units plus the ... unit (0). are the weights for the connections between input and hidden units and hidden and output unit.. SURVIVAL ANALYSIS AND OTHERS NETWORKS 145 Figure 6... The input layer has J units for time a and covariates plus one .
6: Head and neck cancer trial ( (a) estimates of conditional failure probability obtained with the best PLANN conﬁguration (solid line) and the cubiclinear spline proposed by Efron 13 (dashed lines). SURVIVAL ANALYSIS AND OTHERS NETWORKS 146 (a) (b) Figure 6. (b) corresponding survival function and KaplanMeyer estimates.) .CHAPTER 6.
SURVIVAL ANALYSIS AND OTHERS NETWORKS 147 (a) (b) Figure 6. (b) corresponding survival function and KaplanMeyer estimates.7: Head and neck cancer trial ( (a) estimates of conditional failure probability obtained with a suboptional PLANN model (solid line) and the cubiclinear spline proposed by Efron 13 (dashed lines).CHAPTER 6.) .
2. 6. Other applications can be seen in Lapuerta et al (1995).to study the behavior of the system. Neighbouring observations are dependent and the study of time series data consists in modeling this dependency. OhnoMachodo et al (1995) and Mariani and al (1997). depth etc. There are basically two approaches for the analysis of time series.to control Y by making changes in X. t ∈ T } where T is a set of index (time. in the linear case ∞ ∼ ∼ ∼ ∼ Y (t) = u=0 u(t)X(t − u). analyzes the series in the time domain. . to forecast future values of it.). Problems of interest are: . by simulating X.1) . .to estimate the transfer function v(t) that connects an input series X to the output series Y of the generating system. space. . (Y. .description of the behavior in the data. For example. (6. For example.CHAPTER 6. etc. A time series is represented by a set of observations {y (t ). SURVIVAL ANALYSIS AND OTHERS NETWORKS 148 A further reference is Bakker et al (2003) who have used a neuralBayesian approach to ﬁt Cox survival model using MCMC and an exponential activation function. y can be univariate or multivariate and T can have unidimensional or a multidimensional elements.to ﬁnd periodicities in the data. The ﬁrst. Thee are two main objectives in the study of time series. although time can be substituted by space. . Second. The nature of Y and T can be each a discrete or a continuous set. First. that is we are interested in the magnitude of the events that occur in time t.2 Time Series Forecasting Networks A time series is a set of data observed sequentially in time. Z)(t.s) can represent the number of cases Y of inﬂuenza and Z of meningitis in week t and state s in 2002 in the USA. The main tool used is the autocorrelation function .to forecast Y given v and X. to understand the underline mechanism that generates the time series. Further.to forecast the series.
. Dt is deterministic that is depends only on its past values Dt−1 . Essentially Wold (1938) results in the time domain states that any stationary process Yt can be represented by Yt = D t + Zt (6. Statistics. .2. that is we are interested in the frequency in which the events occur in a period of time.2.2. seventies). (6.3) is an inﬁnite movingaverage process. + (6. Con .4) where t is a white noise sequence (mean zero and autocorrelations zero) and ψ1 . .2.4) is an inﬁnite autoregressive process AR(∞).2. . thirties to ﬁfties).CHAPTER 6.2. Econometrics. BoxJenkins (Engineering.2.6) Most of the results in time series analysis is based on these results on its extensions. In the second approach. Expression (6. Exponential Smoothing (Operations Research. We can say from these theorems that the time domain analysis is used when the main interest is in the analysis of nondeterministic characteristic of this data and the frequency domain is used when the interest is in the deterministic characteristic of the data. Fourier. SURVIVAL ANALYSIS AND OTHERS NETWORKS 149 (and functions of it) and we use parametric models. Dt−2 .Bayesian. the analysis is in the frequency domain. M A(∞) and expression (6. StateSpace Methods . and Zt is of the form Zt = t + ψ1 t−1 + ψ2 t−2 or Zt = π1 Zt−1 + π2 Zt−2 + . Structural (Statistics.3) t (6. π1 are parameters.5) where ρ(t) is the autocorrelation of Yt and acts as the characteristic function of the (spectral) density F (w) of the process Z(w) and π Y (t) = −π eiwt dZ(w). The two approaches are not alternatives but complementary and are justiﬁed by representation theorems due to Wold (1938) in time domain and Cramer (1939) in the frequency domain. The frequency domain analysis is based on the pair of results: ∞ ρ(t) = −∞ eiwt dF (w) (6. . Wavelet transforms) and we use nonparametric methods. The tool used is the spectrum (which is a transform of the correlation function ex: Walsh. We can trace in time some developments in time series in the last century: Decomposition Methods (Economics.2) where Dt and Zt are uncorrelated. sixties).
8. Further we can make an analogy between scalar and matrices and the difference equation models of BoxJenkins (scalar) and statespace models (matrices). if there is no hidden layer and the output activation is the identity function the network is equivalent to a linear AR(p) model. The architecture and implementation of most time series applications of neural networks follows closely to the regression implementation. Another classical time series the Yearly Sunspot Data was analyzed by Park et al (1996).7 for the training data (88 observations) and forecasting period (12 observations). nineties). 1)12 (0.eighties). 1959. A multivariate time series forecasting using neural network is presented in Chakraborthy et al (1992). Here they used the PCA method described in Section 2. 1. The use of artiﬁcial neural network applications in forecasting is surveyed in Hill et al (1994) and Zang et al (1998). Minneapolis (yt ) and Kansas City (zt ) over the period from August . Their results are presented in Tables 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 150 trol Engineering.1 Forecasting with Neural Networks The most common used architecture is a feedforward network with p lagged values of the time series as input and one output. The model for this series becomes the standard benchmark model for seasonal data with trend: the multiplicative seasonal integrated moving average model. Control Engineering. The airline data was also analyzed by Faraway and Chatﬁeld (1998) using neural network.1. As in the regression model of Section 5. In the regression neural network we have the covariates as inputs to the feedforward network.2. 6. If there are one or more hidden layers with sigmoid activation function in the hidden neurons the neural network acts as nonlinear autoregressions. Table 6.6 and 6. here we have the lagged values of the time series as input. Here we will describe some recent and nonstandard applications. 1). SARIMA (0.2. 1960 in terms of mean average percentage error of the BoxJenkins model and some neural network. NonLinear Models (Statistics.CHAPTER 6. The recent interest in nonlinear model for time series is that makes neural networks attractive and we turn to this in the next Section. 1. One of the series analyzed by Raposo (1992) was the famous Airline Passenger Data of BoxJenkins (Monthly number of international passengers from 1949 to 1960). It should be pointed out that the importance of BoxJenkins methodology is that they popularized and made accessible the difference equation representation of Wold.5 gives the comparison for the periods 1958. The data used is the Indices of Monthly Flour Prices in Buffalo (xt ).
1660 11.9368 230.93 11.hidden.00 4.8539 9.75 10.4879 151.4056 9.6: Result from network structures and the AR(2) model Network Structure Mean Square Error (MSE) 156.5: Forecasting for Airline Data Year 1958 1959 1960 (12:2:1) 7.number of neurons in: a . SURVIVAL ANALYSIS AND OTHERS NETWORKS 151 Table 6.55 (a:b:c) .32 7.input.44 5.75 7.91 SARIMA 7. represents a network without hidden units. .2454 12. b .9554 Mean Absolute Error7 (MAE) 9.8726 2:11:1 2:3:1 2:2:1∗ 2:1:1 2:0:1·· AR(2) ∗ ·· indicates the proposed optimal structure.CHAPTER 6.2792 154. c . layer Table 6.2865 12.90 7.63 6.8252 222.output.86 15.2508 233.39 MAPE % (13:2:1) 14:2:1) 6.
1861 11. 1972 to November 1980.8 and 6.7: Summary of forecasting errors produced by neural network models and the AR(2) model based on untrained data Model 2:11:1 2:3:1 2:2:1∗ 2:1:1 2:0:1 AR(2) ∗ Mean Square Error (MSE) Mean Absolute Error (MAE) 160.CHAPTER 6.2645 9.8 and Figure 6.9064 157.1) multivariate model.1522 11.4559 177.8409 158. They used two classes of feedforward networks: separate modeling of each series and a combined modeling as in Figure 6.7880 180.9.9. The results are shown in Tables 6.0378 12.7608 180. .1710 9. SURVIVAL ANALYSIS AND OTHERS NETWORKS 152 Table 6.5624 represents the selected model by the PCA method.2631 9. previously analyzed by Tiag and Tsay using an ARMA (1.
CHAPTER 6.9: Combined Architecture for forecasting xt+1 (and similar to yt+1 and zt+1 ) . SURVIVAL ANALYSIS AND OTHERS NETWORKS 153 Figure 6.8: Separate Architecture Figure 6.
373 9.549 10.of Variation for Combined vs.096 .370 6.497 17.072 0.102 17.169 5.398 4.069 2.564 137.265 3.059 Minneapolis MSE CV 1.353 1.229 0.285 8.222 Multilag 72.103 0.070 1.090 0.898 1.383 1.168 12.892 7.770 12.9: MeanSquared Errors and Coeffs.383 3.352 9.107 2.526 4.799 8.437 2.087 0.493 Onelag 2. SURVIVAL ANALYSIS AND OTHERS NETWORKS Table 6.483 3.413 96.101 11.909 Kansas City MSE 3.835 9.021 1.044 2.633 8.318 5.909 3.091 3.674 Kansas City MSE CV 1.097 14.534 74.697 1.169 11.317 2. Tiao/Tsay’s Modeling/ Prediction ×103 Model 661 Network Training Onelag Multilag Training Onelag Multilag Buffalo MSE CV 1.076 8.483 1.975 10.047 9.655 14.003 3.857 2.917 7.774 12.441 4.316 7.645 18.244 11.872 13.646 3.067 9.CHAPTER 6.776 154 Network 221 441 661 Table 6.757 881 Network Tiao/Tsay Training 2.446 7.8: MeanSquared Errors for Separate Modeling & Prediction ×103 Buffalo Minneapolis MSE MSE Training Onelag Multilag Training Onelag Multilag Training Onelag Multilag 3.054 5.787 10.573 3.701 4.521 3.850 0.554 7.889 3.174 4.204 233.346 53.389 0.
3 Control Chart Networks Control charts are used to identify variations in production processes. 6. Cigizoglu (2003) used generated data from a time series model to train a neural network. and defective raw materials and it is desirable to remove these causes from the process. . The decision in the control chart is based on the sample mean with a signal of outofcontrol being given at the ﬁrst stage N such that . and the condition µ = µ0 is to be maintained. . . operation error. . It is ¯ assumed that the sample means are mutually independent and that Xi has distri2 σ bution N (µj . Modeling of nonlinear time series using neural network and an extension called stochastic neural network can be seen in Poli and Jones (1994). Stern (1996). ). are calculated. i = 1. Chance causes. Forecasting noninversible time series using neural network was attempted by Pino et al (2002) and combining forecasts by Donaldson and Kamstra (1996). 2. There are usually two types of causes of variation: chance causes and assignable causes. are inherent in all processes. Forecasting using a combination of neural networks and fuzzy set are presented by Machado et al (1992).CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 155 In previous example. More explicit combination of time series model and neural network are: Tseng et al (2002) that improved the ARIMA forecasting using a feedforward network with logged values of the series. A comparison of structural time series model and neural network is presented in Portugal (1995). Li et al (1999). initial analysis using true series methods have helped to design the architecture of the neural network. Further comparisons with time series models and applications to classical time series data can be seen in Terasvirta et al (2005). . Finally. The process means (µ) and variance (σ 2 ) are estimated when observed data ¯ ¯ are collected in sample of size n and the means X1 . . X2 . Lourenco (1998). an attempt to choose a classical forecast method using a neural network is shown in Venkatachalan and Sohl (1999). Assignable causes represent improperly adjustable machines. Kajitoni et al (2005) and Ghiassi et al (2005). its forecast and the residual obtained from the ARIMA ﬁtting to the series. or natural causes. Periodicity terms and lagged values were used to forecast river ﬂow in Turkey. Lai and Wong (2001) and Shamseldin and O’Connor (2001). Coloba et al (2002) which used a neural network after decomposing the series from its law and high frequency signals. n A shift in the process mean occurs when the distribution of the sample changes (µ = µ0 ).
2.CHAPTER 6.4. a single hidden layer with 5 neurons and 1 output neuron.2) ¯ where R is average range of a number of sample ranges R and An is a constant which depends on n. All the them uses simulation results and make comparison with classical statistical charts in terms of percentage of wrong decisions and ARL (average run length) which is deﬁned to be the expected number of samples that are observed from a process before outofcontrol signal is signaled by the chart. A further table with other rules are given in Prybutok et al (1994). 0. for example: more than threes consecutive values more than 2σ from the mean. The activation functions were sigmoid the neural network classiﬁer the control chart as being “in control” or “outofcontrol” respectively according if the output fall in the intervals (0. relating σ to E(R) and V (R).8. However. 1). sinoidal etc. 1. For eight shifts given by µ − µ0  taking the values (0. A similar decision procedure can be used for the behavior of ranges of the samples from the process and will signal an outofcontrol variability. Similarly chart for σ is usually based on ¯ R − σR  > 3 or ¯ ¯ R ± 3Am R (6.10. eight consecutive values above the mean etc. Prybutok et al (1994) used a feedforward network with 5 inputs (the sample size). patters in the data. . 0.6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 156 δN = ¯ XN − µ0  >c √ σ/ n0 (6. Several kinds of signals based in these statistics and charts have been proposed. The rules were denoted by Ci and its combination by Cijkl . 2. 0.1) where usually c = 3 and the sample size of 4 or 5. disturbances can also cause shifts or changes in the amount of process variability. This rules will detect trends.3. 0. This procedure is useful to detect shifts in the process mean. 0. Various rules were compared (see Prybutok et al (1994)) such as C4 =signal if eight if the last standardized sample means are between 0 and 3 or if the eight of the last eight are between 0 and 3.3) (6. ¯ 3) the ratio of the ARL for the neural network to the standard Shewhart X control chart without supplemental runs rules are listed in Table 6.5.5) or (0.3.3. These rules can also be implemented using neural network and we will now describe some of these work.
43 0.61 1.75 0.63 0.CHAPTER 6.33 0.60 0.53 0.49 0.2 0.0 6.48 0.39 0.43 200.69 1.71 1.41 1.44 0.79 1.56 1.56 ARL = ARLc1 (µ) for Shewhart Control Chart.00 1.0 2.69 0.71 1.00 1.4 0.48 0.56 0.72 0.08 0.32 0.27 0.33 0.90 .08 119.55 0.56 0.60 0.0 0.24 0.32 0.61 0.36 0. 2.63 0.07 0.00 1.18 0.44 0.19 0.34 0.72 1.57 0.27 0.18 0.8 1.53 0.00 0.68 0.48 1.21 0.09 1. SURVIVAL ANALYSIS AND OTHERS NETWORKS 157 Table 6.63 0.78 0.72 0.25 0.02 1.16 0.27 0.56 0.40 308.43 0.34 0.46 0.0 370.44 0.00 1.99 1.19 0.53 0.88 0.95 0.98 1.42 0.33 0.36 0.31 1.33 0.32 0.94 1.37 0.40 0.25 0.60 0.55 0.70 0.10: Rations of Average Run Lengths: Neural Network Model Cnij to Shewhart Control Chart without Runs Rules (C1 ) Control Charts ARL∗ C1 Cn54 Cn55 Cn56 Cn57 Cn62 Cn63 Cn64 Cn65 Cn66 Cn67 Cn68 Cn71 Cn72 Cn73 Cn74 Shift (µ) 0.96 1.00 1.44 0.26 0.01 1.29 0.76 0.04 0.50 0.71 0.82 0.44 0.30 0.78 0.13 0.67 71.63 0.42 0.90 1.55 43.30 0.52 0.68 0.67 0.38 0.00 1.38 0.89 1.67 0.15 1.19 0.02 1.6 0.22 0.24 0.00 0.00 0.47 0.47 3.92 0.30 1.01 1.43 0.52 0.26 0.14 0.02 0.25 0.38 0.
She applied two groups of feedforward networks. two hidden layers.0(None) Shewhart Control Percent Wrong 27.0.0.0] [0.3 (mean shift). Table 6.0.1] [0.0.11 shows the percentage of wrong decisions for the control chart and the neural networks.8 11.e.1] Note Upward trend Downward trend Systematic variation (I) Systematic variation (II) Cycle (I) Cycle (II) Mixture (I) Mixture (II) Upward shift Downward shift – – the ﬁrst observation is above the incontrol mean the ﬁrst observation is below the incontrol mean sine wave cosine wave the mean of the ﬁrst distribution is greater than the incontrol mean the mean of the ﬁrst distribution is less than the incontrol mean – – The second group of network deferred from the ﬁrst group by the input which now is the raw observations only (n = 5.0. a total of 13 inputs or on the calculated islatistics only.0.0.3 to 0.0.0] [1. 3 inputs.0.5 0. The ﬁrst group with only one output.7 28.0] [0.0.0.0.0. i.5 0.7 (under control).1.0. n = 10. standard deviation). and interpretation as: 0 to 0.0.0] [0.0.0.5 0. sigmoid activation function.1. Table 6. 0. SURVIVAL ANALYSIS AND OTHERS NETWORKS 158 Smith (1994) extends the previous work by considering a single model of a simultaneous Xbar and R chart.5 27.0.0.0] [0.0. The inputs were either in ten observations (n = 10) and the calculated statistics (sample mean.0] [0. 0.CHAPTER 6.5 27. and investigating both location and variance shifts.5 0.11: Wrong Decisions of Control Charts and Neural Networks Training/ Test Set A A A B B B Inputs All All Statistics All All Statistics Hidden Layer Size 3 10 3 3 10 3 Neural Net Percent Wrong 16. These neural networks .1.0] [0.0.0.0.12: Desired Output Vectors Pattern Desired output vector [1. n = 20).8 0. range.1.1.5 0.5 Table 6.1.0. 3 or 10 neurons in each hidden layer.7 to 1 (variance shift).0] [0.
The gating network has 3 neurons elements in the output layer and 4 neurons in a hidden layer. and bimodal (1.1). Hyperbolic activation function was used. Table 6.6.1) was considered 1.3% # Missed of 300 Slope Sine wave 0 10 15 10 39 18 0 5 7 16 17 23 0 0 0 0 1 1 2.0). sine wave (cyclic). The networks here had two outputs used to classify patterns with the codes: ﬂat (0.3 Over All Networks # Correct of 300 288 226 208 292 268 246 297 293 280 88.0. A mixture of three expert (choice based on experimentation) with the same input as in the feedforward network.9% 3. sine wave (1. trend (0. A comparison between the performance of a feedforward network and a mixture of expert was made using simulations.13: Pattern Recognition Networks # Input σ Points of Noise 5 0.1% Bimodal 2 10 3 3 6 5 3 7 12 1. The multilayer network had 16 inputs (raw observations). He used a feedforward network with 50 · 30 · 1 (input·hidden·output) structure of neurons. A output in (0.1 10 0.1 20 0.4] was considered 0 and in (0. 12 neurons in the hidden layers and 5 outputs identifying the patterns as given in Table 6.3 20 0. SURVIVAL ANALYSIS AND OTHERS NETWORKS 159 were design to recognize the following shapes: ﬂat (in control).13.2 20 0. A more sophisticated pattern recognition process control chart using neural networks is given in Cheng (1997). Hwarng (2002) presents a neural network methodology for monitoring process shift in the presence of autocorrelation.9% .3 10 0.2 10 0.3 gives some of the results Smith (1994) obtained.1 5 0. Table 6. The study of AR (1) (autoregressive of order one) processes shows that the performance of neural network based monitoring scheme is superior to other ﬁve control charts in the cases investigated. slop (trend of drift) and bimodal (up and down shifts).1).8% Flat 0 39 32 0 3 9 0 0 6 3. The results seems to indicate that the mixture of experts performs better in terms of correctly identiﬁed patterns.CHAPTER 6.0).2 5 0. Each expert had the same structure as the feedforward network.
likelihood. First consider.2) The maximum likelihood estimator of the weights ω’s is obtained from the log likelihood function n L(ω) = j=1 yj log p(xi .4) It can be written as .4.4 Some Statistical Inference Results In this section. A more detailed account of prediction interval outlined here see also Hwang and Ding (1997). robust). Since the most used activation function is the sigmoid we present some further results. ∼ u (6. ω)(1 − p(x. maximum likelihood estimation.4.CHAPTER 6. ω). The logistic (sigmoid) activation relates output y to input x assuming I P x (Y = 1/x) = ∧ ω0 + ∼ ∼ ωi xji i=1 = p(x. bootstrap) and signiﬁcance tests (F test and nonlinearity test). yj ) is the observation for pattern j. ω)).4. ω ) + (1 − yj ) log(1 − p(xj ω) ∼ (6.4.1 Estimation methods The standard method of ﬁtting a neural network is backpropagation which is a gradient descent algorithm to minimize the soma of squares.4. this activation function as a nonlinear regression model is a special case of the generalized linear model i.1) where ∧(u) = 1/(1 + e ) is the sigmoid or logistic function. 6. ω) (6. we describe some inferential procedures that have been developed for and using neural network.e. Here we outline some details of other alternative methods of estimation. V (Y /x) = p(x. E(Y /x) = p(x.1.3) where (xj . interval estimation (parametric. As seen in Section 5. SURVIVAL ANALYSIS AND OTHERS NETWORKS 160 6. In particular we outline some results on point estimation (Bayes. here we follow Schumacher et al (1996) and Faraggi and Simon (1995). ω) ∼ (6. To maximize L(ω) is equivalent to minimize the KulbackLeiber divergence n yj log j=1 1 − yj yj + (1 − yj ) log p(xj ω) 1 − p(xj .
4. (6. At this point.5) which makes the interpretation of distance between y and p(y. we recall the criteria function used in the backpropagation least square estimation method (or learning rule): n (yj − p(xj .8) ∼ N (0.6) Due to the relation to of maximum likelihood criteria and the KulbackLeiber.7) (6. Here we follow Lee (1998) and review the priors used for neural networks.4) is called backpropagation of maximum likelihood. ω))2 j=1 (6. In the neural network jargon it is also called relative entropy or cross entropy method. the estimation based on (6. ω) more clear.4.4.CHAPTER 6. All approaches start with the same basic model for the output y k yi = β0 + i βj j=1 2 1 1 + exp(−ωj0 − p h=1 ωjh xih ) + i. σ ) The Bayesian models are: ¨ − Muller and Rios Insua (1998) A direct acyclic graph (DAG) diagram of the model is: The distributions for parameters and hyperparameters are: . for comparison. SURVIVAL ANALYSIS AND OTHERS NETWORKS n 161 − log (1 − yj − p(x. Bayesian methods for neural networks rely mostly in MCMC (Markov Chain Monte Carlo).4.4. ω)) j=1 (6.
m ωj − ωγj ∼ Np (µγ . . Aγ = Ip .4. σ 2 .4. s = 1.4. . .4. Aγ ) cβ Cβ 2 σβ ∼ Γ−1 . m µβ ∼ N (aβ . .13) (6. . These standard deviations . cγ = p + 1. yi  ≤ 1. . . σβ ).4. aγ = 0. . Aβ ) µγ ∼ Np (aγ .4. Cγ = p+1 10 − Neal Model (1996) Neal’s fourstage model contains more parameters than the previous model. SURVIVAL ANALYSIS AND OTHERS NETWORKS 162 Figure 6. one choice of starting hyperparameters is: aβ = 0. .15) (6. j = 1. A DAG diagram of the model is Each of the original parameters (β and γ) is treated as a univariate normal with mean zero and its own standard deviation.14) (6.16) 2 βj ∼ N (µβ . S = .4.CHAPTER 6.10) (6. 2 2 Sγ ∼ W ish−1 cγ . (cγ Cγ )−1 s S σ 2 ∼ Γ−1 . 1 1 Ip . 2 2 There are ﬁxed hyperparameters that need to be speciﬁed. . Sγ ). i = 1. .4. N (6.12) (6. .10: Graphical model of Muller and Rios Issua m yi ∼ N j=0 βj ∧ (ωj xi ). cβ = 1.11) (6. Aβ = 1. j = 0. As some ﬁtting algorithms work better when (or sometimes only when) the data have been rescaled so that xih . M¨ ller and Rios u Insua specify many of them to be of the same scale as the data. Cβ = 1.9) (6. .
but knowing that the posterior would be improper. and one for the destination node. Neal uses hyperbolic tangent activation functions rather than logistic functions. Under his model. − MacKay Model (1992) MacKay starts with the idea of using an improper prior. he uses the data to ﬁx the hyperparameters at a single value. There are two notes on this model that should be mentioned.CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 163 Figure 6. the main difference being that their range is from 1 to 1 rather than 0 to 1. First. one for the originating node of the link in the graph. the distributions for parameters and hyperparameters are: . The second note is that Neal also discusses using t distributions instead of normal distributions for the parameters.11: Graphical model of Neal are the product of two hyperparameters. These are essentially equivalent in terms of the neural network.
he uses the noninformative improper prior 1 . σ2 π(β. m α1 1 ωj h ∼ Np 0. j = 0.19) (6. . i = 1. − Lee Model (1998) The model for the response y is k yi = β0 + j=1 βj 1 1 + exp(−γj0 − p h=1 ωjh xih ) + i. . . Each of the three models reviewed is fairly complicated and are ﬁt using MCMC.4. . This model is also ﬁtted using MCMC. 0. . . N (6.4. . j = 1. .17) (6. This prior is proportional to the standard noninformative prior for linear regression. .24) where d is the dimension of the model (number of parameters). . m α3 αk ∝ 1. σ 2 ) Instead of having to specify a proper prior. .4. .18) (6. m h = 1.20) (6. .4. βj ∼ N MacKay uses the data to ﬁnd the posterior mode for the hyperparametrics α and ν and then ﬁxes them at their modes. σ 2 ) = (2π)−d/2 (6. . 3 v ∝ 1.23) ∼ N (0. Belue and Bower (1999) give some experimental design procedures .21) (6.4. .CHAPTER 6. . 2. j = 1. . 1 ν . k = 1. ω.4. . . .22) 1 . SURVIVAL ANALYSIS AND OTHERS NETWORKS 164 m yi ∼ N j=0 βj ∧ (ωj xi )(γj xi ). . This subsection ends with some references on estimation problems of interest.4.4. iid i (6. p α2 1 ωj 0 ∼ Np 0. .
CHAPTER 6. 2.4. (6.4. Essentially they used the standard results for nonlinear regression least square S(ω) = (yi − f (X.25) or the regularization. The author gives an application and De Veux et al (1998) also shown some simulated results. Here we summarize the results of Chryssoloures et al (1996) and De Veux et al (1998). xi )/∂ωj g0 is a vector with entries ∂f (x0 ω)/∂ωj and k s∗ H p∗ = y − f (Xω ∗ ) + Jω ∗ .2 Interval estimation Interval estimation for feedforward neural network essentially uses the standard results of least squares for nonlinear regression models and ridge regression (regularization). Tibshirani (1995) compared sevens methods to estimate the prediction intervals for neural networks.4. Approximate delta: Using approximate to information matrix that ignores second .4.4.32) where H = J(J J + αI)−1 J . ω ∗ is the true parameter value. J is a matrix with entries ∂f (ω.29) (6. ω))2 (6.30) (6. (6. ω))2 + kα 2 ωi . Copobianco (2000) outlines robust estimation framework for neural networks. Aitkin and Foxall (2003) reformulate the feedforward neural network as a latent variable model and construct the likelihood which is maximized using ﬁnite mixture and the EM algorithm. = RSS/(n − tr(2H − H 2 )) = J(J J + αI)−1 J = tr(2H − H 2 ).4.28) where s = (residual sum of squares)/(n − p) = RSS/(n − p).4. SURVIVAL ANALYSIS AND OTHERS NETWORKS 165 to achieve higher accuracy on the estimation of the neural network parameters.27) (6.4.31) (6. p is the number of parameters. The method compared were: 1. Deltamethod. 6. when weight decay is used S(ω) = (yi − f (X.4.26) The corresponding prediction intervals are obtained from the t distribution and tn−p s 1 + g0 (JJ)−1 g0 tn−p∗ s∗ 1 + g0 (J J + kI)−1 (J J + kI)−1 g0 (6.
Adams (1999) gives the following test procedure to test hypothesis about the parameters (weights) of a neural network. SURVIVAL ANALYSIS AND OTHERS NETWORKS 166 derivatives. .4. Two bootstrap estimators. To implement the test it is ﬁrst necessary to calculate the degree of freedom for the network.33) With a multilayer feedforward network with m hidden layers. The test can be used to test nonlinear relations between the inputs and outputs variables. (6. the degrees of freedom for each output is df = n − 44 + (2 − 1)(6 + 1) = n − 37. Two approximate sandwich estimators and 5.36) For a feedforward network trained by least square we can form the model sum of squares. The regularization estimator.4.4.4. 4. 6.34) P = i=1 (Li−1 L1 + Li ) . sandwich estimators) missed the variability due to random choice of starting values. (6. In this 5 examples and simulations he found that the bootstrap methods provided the most accurate estimates of the standard errors of the predicted values.37) where SSE0 is the sum of squares of submodel and SSE is the sum of squares of full model.35) For a network with 4:6:2.4. For Lm+1 outputs and Lm neurons in the hidden layer we have df = n − p + (Lm+1 − 1)(Ln + 1). (6.e. 3. The nonsimulation methods (delta. regularization.3 Statistical Tests Two statistical tests when using neural networks are presented here. the input layer designed by 0 and the out designed by m + 1. Therefore any submodel can be tested using the F test F = (SSE0 − SSE)/(df0 − df ) SSE/df (6. The number of parameters is m+1 (6. If n is the number of observations and p the number of estimated parameters then the degree of freedom (df) are calculated as df = n − p.4.CHAPTER 6. i. 4 inputs. 6 hidden neurons and 2 outputs.
yt−k ). An application is given in Kiani (2003). yt−k . save matrices Ψ of outputs of the neurons for principle components analysis. Simulations results on the behaviors of this test can be seen in the bibliography on references. save residuals (ˆt ). . . thus T S = nR2 ∼ χ2 (g) where g is the number of principal components used un step (iii). SURVIVAL ANALYSIS AND OTHERS NETWORKS 167 The above test is based on the likelihood ratio test.CHAPTER 6. . ˆ (ii) regress ut on yt−1 .12 clariﬁes the procedure: (6. See also Lee (2001). Blake and Kapetanios (2003) also suggested a Wald equivalent test. . . The implementation of Rao test (see Kiani. The neural network in 6. An alternative proposed by Lee et al (1993) uses the Rao score test or Lagrange multiplier test. residual sum of u squares SSE1 and predictions yt . yt−k using neural network model that nests a linear ˆ regression yt = πy + ut (y = (yt−1 . .4.38) ∼t ∼t Figure 6. . . 2003) to test nonlinearity of a time series using a onehidden layer neural network is (i) regress yt on intercept and yt−1 . . . . . and obtain R2 . . (iii) regress et (residuals of the linear regression in (ii)) in Ψ∗ (principal compoˆ nents of Ψ) and X matrices.12: Onehidden layer feedforward neural network.
Neural Network Models. Introduction to radial basis function networks. Neural networks and statistical models. 1997.C. [3] Orr. [3] Wilde.. V. 1997.Bibliography and references Preface Fundamental Concepts on Neural Networks Artiﬁcial Intelligence: Symbolist & Connectionist [1] Carvalho. Editora Erika. 1999.. Chao. L. [2] Gueney.nj. Foundations of Wavelet Networks and Applications. Neural Networks: A Comprehensive Foundation.J. Prentice Hall. Tech.S.V.com/sarle94neural. SP.uk/ mjo/papers/intro. rep.. 2nd ed..html). S. Data mining. 2nd ed. Activation Functions [1] Haykin. W. and Phola. An Introduction to Neural Networks.. D. Springer. E. University College London Press. 2002.nec.L. Neural Networks: A Comprehensive Foundation. 2nd ed.. Chapman Hall/CRC. K. S. (http//citeseer. Institute of Adaptive and Neural Computationy.ps). 1999.ac. 1994. M. Artiﬁcial Neural Networks and Diagrams [1] Haykin.ed. In Proceedings of the Nineteenth Annual SAS Users Group International Conference. S. (www. [2] Igengar. 1996. Prentice Hall.. 1997. [2] Sarle. Brasil.S. 2nd ed.anc.V. Edinburgh University. 168 .
P. 62(3):275– 288. S. Network Training [1] Haykin. International Statistical Review.anc.. 62(3):275– 288. McGrawHill Inc. [2] Mitchell. Neural Networks: A Comprehensive Foundation. Y. Springer. T. Neural networks and related “massively parallel” methods for statistics: A short overview. [3] Orr. Introduction to radial basis function networks. Murray. 1996. 7(2):501–505.. Prentice Hall. Prentice Hall.ps). [4] Rojas. . Algorithms and Applications.. N.M. T.ac. Machine Learning. 1999. 1996..R.K. Edinburgh University. 1994. Neural Networks Fundamentals with Graphs... (www. 1994. Neural Networks: A Comprehensive Foundation. F.ed. 2nd ed. 2nd ed. M.CHAPTER 6. F. S. [3] Murtagh. 1997. Prediction sun spots using a layered perception neural networks. 1999. 2nd ed. [4] Park. october 1994.uk/ mjo/papers/intro. 2nd ed. Tech.. 2nd ed. S. IEEE Transaction on Neural Networks. rep. C.. 1999.J. Neural networks and related “massively parallel” methods for statistics: A short overview..J. 1996. International Statistical Review. Kolmogorov Theorem [1] Bose. Model Choice [1] Haykin. Prentice Hall.L. 2nd ed. SURVIVAL ANALYSIS AND OTHERS NETWORKS 169 Network Architectures [1] Haykin.. R. Neural Networks: A Comprehensive Foundation. and Liang. Neural Networks: A Systematic Introduction. McGrawHill Inc. [2] Murtagh. and Chen.. Institute of Adaptive and Neural Computationy.
. (http//citeseer. J. 2nd ed. [8] Stegeman. In Statistical models and neural networks (Eds.nj. K.S.E. N. Fuzzy Systems and Knowledge Engineering. . W.html). Neural Computing and Applications. University College London Press. 1999.S. D. and Mcharty. G. 1999.nec. 2000. and Zanringe. 1996.R. MIT Press. Algorithms and Applications. Data analysis and information systems. statistical and conceptual approaches. Neural networks and statistical models.com/sarle94neural. [6] Sarle. Prentice Hall. (8):290–296. pages 243–260.. 1994. Palgrave. 1997.S. An Introduction to Neural Networks. J. [4] Gueney. P. [3] Picton. Fundamental of Neural. A glossary of basic neural network terminology for regression problems. C. Neural Network Models. Springer. SURVIVAL ANALYSIS AND OTHERS NETWORKS 170 Terminology [1] Addrians. Prentice Hall.. [5] Kasabov. [7] Sarle. Kendall). L. Foundations of Neural Networks. J. W. 2nd ed.A. N. BarndorfNielsen. 1994. vol. 2nd ed. [4] Wu. Springer. (ftp://ftp. Elsevier. Neural Networks and Genome Informatics. AddisonWesley. O. Neural network and statistical jargon. P. studies in classiﬁcation data analysis and knowledge organization. and Buenfeld. Some Common Neural Networks Models Multilayer Feedforward Networks [1] DeWilde. P.CHAPTER 6.H. 2nd ed. 1997.W. and Polozik. Neural Networks. Architecture. 1999. 1996. In Proceedings of the Nineteenth Annual SAS Users Group International Conference. Data Mining. 7.. [3] Fausett. and W.com). S. 1996. [2] Arminger.sas.K. [2] Haykin. Jensen. W. Neural Networks: A Comprehensive Foundation.L.
B. Oxford University Press. 2000. Cho. Statistics and neural networks advances in the interface.CHAPTER 6.. 2000. Redes Neurais Artiﬁciais: Teoria e Aplicacoes. [4] Fausett.S.M.C. H.P. Radial Basis Function Networks [1] Lowe. Peters. In Artiﬁcial neural networks and multivariate statistics (Eds. Neural Networks: A Systematic Introduction. and Sapatinas. rep. Prentice Hall. 1993.T.. and McLarty. L. 49(1):1– 29. Foundations of Wavelets Networks and Applications.K. LTC editora. and Carvalho. M.M. Edinburgh University. 1998.C. Fundamentals of Neural Networks Architectures.. In Radial basis function networks and statistics (Eds.F.H. 1999. . J. Prentice Hall. 124.ac. E. Oxford University Press. [3] Iyengar. A. Herington).B. Ludemir. [2] Hubbard. Institute of Adaptive and Neural Computationy. LTD. Barley. ¸˜ [3] Chester.. [3] Wu.L. B.C.V.L. 1996.J. 1994. Chapman and Gall/CRC. [2] Braga. pages 65–95.J. Journal of the Royal Statistical Society D. D.. 2000. A. E. [5] Rojas.W. 2002. 1999.anc.. 1999.uk/ mjo/papers/intro.. and Edelman. Wavelet analysis and its statistical application. J. Valentin. T. Statistics and neural networks. F. Neural Networks: A Tutorial.. M. (www. and Morris. R. The World According to Wavelets.. [2] Orr.B.W. A. and Phola. C. Introduction to radial basis function networks. T. SURVIVAL ANALYSIS AND OTHERS NETWORKS 171 Associative and Hopﬁeld Networks [1] Abdi.B. Quantitative Applications in the Social Sciences. Algorithms and Applications.P.W. Neural Networks and Genome Informatics. V. J. Elsevier. A. pages 195–258. SAGE. T.ed. Springer.ps). Tech. Wavelet Neural Networks [1] Abramovich. S. Kay and D. 1996. D. [4] Martin. Titterington). Prentice Hall. Kay and D.
1997. 1996. 26(4):991–1009.G.. P. Q. 1996.A. V. Oxford. Switzerland. .A. In Data Analysis and Information Systems (Eds. [6] Mehnotra.K. Springer Verlag.A. A. [5] Jordan.A. Bock and W. [2] Cheng. Neuchˆ tel. .ime.. Prat). and Lechevalier. A. M. Hierarchical neural networks for survival analysis. A. L. 1997. Crook. 1997. P. C. Neural Computation. and Jacobs. Mohan. pages 243–260.H.A. IEEE Transactions on Neural Networks.S. [6] Morettin. M. MIT Press. M. Wavelets in statistics (text for a tutorial). From fourier to wavelet analysis of true series.Theory and Methods. D. A neural network approach for the analysis of control charts patterns. B. Statistical models and artiﬁcial neural networks. Neural Network for Pattern Recognition. Walker. In The Eighth Worlds Congress on Medical Information.. [3] Ciampi. 1992. a (www.html). C.CHAPTER 6. pages 111–122. and Overstreet. H. S. In Proceedings in Computational Statistics (Ed.. Statistical models as building blocks of neural networks. R. International J. [7] Zhang. and Ranka.N. C.G. European Journal of Operational Research. [4] Desai. Communications in Statistics . 3(6):889–898. Y. 1995.. 2000. 124. Mixture of Experts Networks [1] Bishop. Hierarchical mixtures of experts and the em algorithm.C. SURVIVAL ANALYSIS AND OTHERS NETWORKS 172 [5] Morettin. 1996. and Enache. and Musen. J.. 1997.I. G. 95:24–37. of Production Research. A comparison of neural networks and linear scoring models in the credit union environmental. Polasek). K. Wavelets networks. 35:667–697.M. and Benveniste. Neural Network Models & Statistical Models Interface: Comments on the Literature [1] Arminger. Elements of Artiﬁcial Neural Networks. 6:181–214. 1994. [7] OhnoMachodo.. Vancouver.usp/br/ pam/papers. PhysicaVerlag. Canada. In The Third International Conference on Statistical Data Analysis Based on L1 Norm and Related Methods.
Journal of the American Statistical Association. 1994. Depatament of Statistics.. IEEE Transactions on Neural Networks. SURVIVAL ANALYSIS AND OTHERS NETWORKS 173 [2] Balakrishnan. 2004. 1998. T.D.W. and Ding. . D. Statistical Sciences.. C. 1995. A. [6] De Veaux. [11] Lee. [12] Lee.. In Radial basis function networks and statistics (Eds. 1998. A study of the classiﬁcation capabilities of neural networks using unsupervised learning: a comparision with kmeans clustering.K. C. MIT Computational Cognitive Science.. M. Neural Network for Pattern Recognition. L. R.A. Statistics and neural networks advances in the interface. D. 1997. 124. [4] Cheng... Psychometrika.H. Cornegie Mellon University.W.d thesis. J. [14] Lowe.S. Prediction intervals for neural networks via nonlinear regression. H. Neural networks: A review from a statistical perspective (with discussion). 1994.H. 1993. [10] Kuan. 1996. Cooper.M. Kay and D. J. H.H.CHAPTER 6. M. pages 65–95. 7:229–232. Report 9503. and Lewis.V. Prediction interval for artiﬁcial neural networks. Artiﬁcial neural networks: An econometric perspective. and Ramsey. Why the logistic function? a tutorial discussion on probabilities and neural networks. 1998.I. 17:2501–2508.C.M. Ph. [13] Lee.. C. H. Model Selection and Model Averaging for Neural Network. [8] Hwang. Herington).J. R. Hinton. Statistics In Medicine.. and Granger. and Tibshirani..M. B. and Titterington. J.T. (59):509–525..G. M.. J. Journal of Econometrics. 1994.. G. M. Naylor. Conﬁdence interval predictions for neural networks models. P. D.H. A. Jacob. 1997.. and White. Schumi.A. 13:1–91. G. [9] Jordan. P. [7] Ennis. SIAM: Society for Industrial and Applied Mathematics. M. 1999. and Ungar. Lee.T. White. Testing for negleted nonlinearity in time series models. A comparison of statistical learning methods on the GUSTO database.M. Technometrics. Oxford University Press. 40(4):273–282. Revow. H.. [5] Chryssolouriz. [3] Bishop. 92:748–757. M. 9:2–54. Econometric Reviews. Bayesian Nonparametrics via Neural Networks. Oxford. Schweinsberg. 56:269–290..K.
2000.. Feedfoward neural networks for nonparametric regression.J. 1994. R.html). D. [22] Poli. 1994. [24] Ripley. E. Oxford University Press.. Bayesian inference in neural networks.M.com). [21] Paige. F. A comparison of SOM neural networks and hierarchical clustering methods. (http//citeseer. Statistics in Medicine. Titterington).d thesis. W. SURVIVAL ANALYSIS AND OTHERS NETWORKS 174 [15] Mackay. Journal of the American Statistical Association. A neural net model for prediction. European Journal of Operations Research. 1996. California Institute of Technology. G. Comparing som neural network with fuzzy cmeans. [23] RiosInsua. [18] Mingoti.K.S. and Muller. 1992. Neural network and statistical jargon.O.S. pages 195–258.W. 88(3):623–641. International Statistical Review. and Morris.nec. S. In Proceedings of the Nineteenth Annual SAS Users Group International Conference. D. In Pratical Nonparametric and Semiparametric Bayesian Statistics (Eds. Neural networks and related “massively parallel” methods for statistics: A short overview.com/sarle94neural. European Journal of Operational Research. Mueller. [20] Neal. 1996. 1999. 19:541–561. and D. Cambrige University Press.J. 2001. D. Vach. M. and Lima. 1996. 62(3):275– 288. J. [16] Mangiameli. D.CHAPTER 6.. Bayesian Methods for Adaptive Methods. Dey. P.. . P. R. W. Chen.A.D. On the misuses of artiﬁcial neural networks for prognostic and diagnostic classiﬁcation in oncology. W. Ph. Kay and D. R. P. In Artiﬁcial neural networks and multivariate statistics (Eds.nj. Pattern Recognition and Neural Networks. Biometrika.L. NY Springer. kmeans and traditional hierarchical clustering algorithms. J.W. [26] Sarle.D. [19] Murtagh. and Butler.sas. A. and Schumacher.B. 174(3):1742–1759. Sinha). (93):402–417. pages 181–194. and West. [27] Schwarzer. 2006. Neural networks and statistical models. B. [17] Martin. Statistics and neural networks. (ftp://ftp.. S.M. [25] Sarle.C. 1996. october 1994. 89(425):117–121. R. Bayesian Learning for Neural Networks. I. and Jones. NY Springer.
1996. M.. N. . and White. Neuralnetwork learning and statistics. pages 719–775. P. American Statistician. A modelselection approach to assessing the information in the term structure using linear models and artiﬁcial neural networks. 19(1):128–139.S.E. 1989.R. H.R. and Schumacher. Van Nostrond Reinhold. 1997. Computational Statistics and Data Analysis. Parametric statistical estimation using artiﬁcial neural networks. 84(408):1003–1013. Journal of the American Statistical Association. 1989. [32] Tibshirani. 13(3):265–75. H. Journal of Business & Economic Statistics. [39] White. 13(4):439–461. H. [40] White. 21:683– 701. [35] Vach. Robner. B. Mozer. J. A comparison of some error estimates for neural networks models. 2004. In Mathematical Perspectives on Neural Networks (Eds. H. [29] Stern. 4(12):48–52. Neural Computation. H. Forecasting economic time series using ﬂexible versus ﬁxed speciﬁcation and linear versus nonlinear econometric models. N. AI Expert. Understanding neural networks as statistical tools. SURVIVAL ANALYSIS AND OTHERS NETWORKS 175 [28] Smith. 1996. Technometrics. Statistical Science.CHAPTER 6. Neural networks and logistic regression. 1(4):425–464. 1995. 1996. Neural network in applied statistics (with discussion). part ii. [36] Warner. Journal of Clinical Epidemiology. R. 1996. [30] Swanson. D.C. Rumelhart). [31] Swanson. Neural Computation. M. M. and White.. 38:205–220. Some asymptotic results for learning in single hiddenlayer feedforward network models. Learning in artiﬁcial neural networks: A statistical perspective. L Erlbaum Associates. H. and D. 1989. [37] White. Smolensky. H. and Misra. 1996. Advantages and disadvantages of using artiﬁcal neural networks versus logit regression for predicting medical outcomes. International Journal of Forecasting. 1996. M. 50:284–293. [38] White. [33] Titterington. R. 8:152–163.V. W. 1993. Neural Network for Statistical Modelling.M. 49:1225–1231. [34] Tu. Bayesian methods for neural networks and related models.
Redes Neurais Artiﬁcias. 2003. 2000. Nagamashi. [5] Draghici. [8] Ishihara. Combining unsupervised and supervised neural networks analysis in gamma ray burst patterns.C.R. Selforganizing arterial pressure pulse classiﬁcation using neural networks: theoretical considerations and clinical applicability. Predicting hiv drug resistance with neural networks. J. (19):98–107. and Chen. Fundamentals of Neural Networks Architectures. C. Cooper. S. S. Pereira. and Cybernetics . Multivariate Statistics Neural Network Models Cluster and Scaling Networks [1] Balakrishnan. C.. Understanding Neural Networks and Fuzzy Logic...B. and Halgamuge. Y..CHAPTER 6. 19:2131–2140. Basic Concepts and Applications.B. Penn State University. and Matsubara. K. S. R. [9] Kartalopulos.Part C: Applications and Reviews. Bioinformatics.. IEEE Transactions on Systems. (59):509–525. LTC Editora. Psychometrika. . P. Departament of Statistics. Tang. B. 1995. Avoiding pitfalls in neural network research. 2007. (30):71–78..L. 2000. An unsupervised hierarchical dynamic selforganizing approach to cancer class discovery and marker gene identiﬁcation in microarray data. M. A study of the classiﬁcation capabilities of neural networks using unsupervised learning: a comparision with kmeans clustering. Rio de Janeiro. 2006. .K. 1996.V. Man.S. Intenational Journal of Industrial Ergonomics. 2003. IEEE Press. [4] Chiu.. [3] Carvalho. Ludemir.L. A. M. and Rao.J. 37(1):3–16. Prentice Hall.. SURVIVAL ANALYSIS AND OTHERS NETWORKS 176 [41] Zhang.. A. S.. and Potter. [2] Braga.V. 1994. C. 1994. (15):13–24... P. Computers in Biology and Medicine.P. and Carvalho. M. [6] Fausett. Techinical report center of multivariate analysis.. S. Yeh. S. An automatic builder for a kansei engineering expert systems using a selforganizing neural networks.A.H. Ishihara. G. Jacob.P. L. and Lewis.. Algorithms and Applications. [7] Hsu. T. Bioinformatics.
and West. [15] Rozenthal. [14] Patazopoulos.K. May 1998. A.A. L. M. Chen. S. ı o 1998. Dimensional Reduction Networks [1] Cha.UFRJ.. . SURVIVAL ANALYSIS AND OTHERS NETWORKS 177 [10] Lourenco. L.M. P. A. Medical Engineering and Physics.N. Redes neurais e Arvores de Classiﬁcacao Aplicados ao Di¸˜ agn´ stico da Tuberculose Pulmonar Paucibacilar. and Popovic.. A. and H. S. V. D. Chan. P.M. 2003. Pontif´cia Universidade Cat´ lica do Rio de Janeiro. Elements of Artiﬁcial Neural Networks. Adaptive resonance theory in the search for neuropsychological patterns of schizophrenia. (21):313–327. Iokimliossi.. M. Techinical Report in Systems and Computer Sciences. D.V. Classiﬁcation of tetraplegies through automatic movement evaluation.. E. Radivojevic.Sc. Finacncial Enginerring and Inteligent Agents (Eds. C. [18] Vuckovic.. Leung. 1999. Pauliakis. (6):946–950. 2002. sc. Data Mining. [16] Rozenthal. A. M. K. M. (93):402–417.A. Automatic recognition of aletness and drawsiness from eec by an artiﬁcial neural networks. UFRJ.. in e ı e eletrical engineering. Esquizofrenia: Correlacao Entre Achados Neuropsi¸˜ col´ gicos e Sintomatologia Atrav´ s de Redes Neuronais Artiﬁciais. MIT Press. 2000. april 2002. Applying independent component analysis to factor model in ﬁnance. Springer. 1996. IDEAL 2000. Static cytometry and neural networks in the discrimination of lower urinary system lesions. pages 538–544.M. and Dimopoulos.UFRJ. R. and Popovic. Chen. COPPE . in Production o Engineering. P. [12] Mangiameli. [11] Maksimovic... D.W..E. [13] Mehrotra. D. Urology. Medical Engineering and Physics. S. A comparison of SOM neural networks and hierarchical clustering methods. and Carvalho.C. Engelhart. European Journal of Operations Research.. ´ [17] Santos. A. P. 2001. Um Modelo de Previs˜ o de Curto Prazo de Carga El´ trica ¸ a e Combinando M´ todos Estat´sticos e Inteligˆ ncia Computacional. In Intelligent Data Engineering and Automated Learning. Sc.. and Chan.K.K. D. Meng). Bruges. o e in Psychiatry. Mohan. L. COPPE . (24):349–360.CHAPTER 6... 1988. D. K. and Ranka.
S. Overview of independent component analysis technique with application to synthetic aperture radar (SAR) imagery processing. G. [14] Giannakopoulos.. G. 1989. SURVIVAL ANALYSIS AND OTHERS NETWORKS 178 [2] Chan. J. 1998. Principal Components Neural Networks. R. [12] Dunteman. april 2002a. Wiley.M. [6] Darmois. Principal component analysis. [13] Fiori. University of Shefﬁeld. Pattern Classiﬁcation. [10] Draghich.Y. 1999. E. Independent component analysis. K.E. M. 1994.TALN. [11] Duda. 16:453–467. Graziano. and Cha. 1996. and Memmi. M. F. S. (69). (121). Meng).G. 19:931–986. and Stock.Quantitative Applications in the Social Sciences. december 2001. [3] Clausen.. 21:2–8.. [9] Diamantaras. . Wiley. In Proceeding of the 3th International Conference on Independent Component Analyses (Eds. P. X. S. 2001. [4] Comon... S. 1953. K. Kettoola. [5] CorreiaPerpinan. SAGE.E. Analyse factorielle neuronalle pour documents textuels. SAGE. An experimental comparision of neural algorithms for independent component analysis and blind separation. A review of dimensional reduction techniques. and H. In European Sympsium on Artiﬁcial Neural Networks ESANN.. L. 2nd ed. Hart. D.. International Journal of Neural Systems. Karhunen. Neural Networks. M. Applied correspondence analysis: An introduction.. 1997.W. Leung. pages 161–166. P. International Stae e tistical Review. D. [7] Delichere. Techinical report cs9609. Neural dimensionality reduction for document processing. L. and Memmi. Theory and Applications. Sethi. Chan. I. [8] Delichere. Mining HIV dynamics using independent component analysis. Analise g´ n´ rale des liasons stochastiques. Selection of independent factor model in ﬁnance. june 2002b. S. S.. Bioinformatics. and Oja. Quantitative Applications in the Social Sciences. a new concept? Signal Processing. In Traiment Automatique des Languages Naturelles .H. 9:99–114. 2003. D. and Towﬁc. and Kung.W. G.O.CHAPTER 6.I. 3:287–314.A. 2003.
W. Independent component analysis using an extended informax algorithm for mixed subgaussian and supergaussian sources. 1999. Krogh. A User´s Guide to Principal Components. 2001. 1997. SURVIVAL ANALYSIS AND OTHERS NETWORKS 179 [15] Girolami.G.R.. Wiley. [21] Jackson.. Independent Component Analysis. a Wiley.NGUS97. Methods of Statistical Data Analysis of Multivariate Observations. E.K. [22] Jutten. Y. J. Wiley. 1998.. M. 1999. 1991. L. 11:417–441..CHAPTER 6. K. [29] Mehrotra. (13).M. SelfOrganizing Neural Networks: Independent Component Analysis and Blind Source Separation. A. and Sejnowski.. 2000. and Rao. T. C. Kluwer Academic Publishers. Elements of Artiﬁcial Neural Networks. C. [26] Lebart. MIT Press. Mohan. Quantitative Applications in the Social Sciences. Fernandes Aquirre (Ed). Characterization Problems in Mathematical Statistics. J. and Joutsensalo. 1997. and Oja. 2001. 8:486–504. T. Correspondence analysis and neural networks. A.. A. 1991.. J. Linnik. [19] Hyvarinen. [16] Gnandesikan. [23] Kagan. [24] Karhunen. 2000.J.. K. S. [17] Hertz. E. and Oja.. and Ranka. pages 47–58.O. J. C. Survey on independent component analysis. [27] Lee. R. . 1973. A. L... 2nd ed. and Mueller. AddisonWestey. T. In IV Internation Meeting in Multidimensional Data Analysis . SAGE. Introduction to factor analysis. [25] Kin. and Palmer. [18] Hyvarinen. [28] Lee. 1978. A. E. C. IEEE Transactions on Neural Networks.. M.. R. In Procedings ICA 2000. 13:411–430. 1991. and Taleb. Springer. [20] Hyv¨ rinen. Neural Computing Surveys. Independent component analysis: algorithm and applications. Source separation: from dust till down. Karhunen. Girolami. 1999. 2:94–128. Introduction to the Theory of Neural Computation. Wiley.W. Neural Networks.E. A. J. Neural Computation. Oja. J. A class of neural networks for independent component analysis. Wang. Independent Component Analysis Theory and Applications..W.
. Enache.K. In Procedings of the National Science Council. [36] Tura.L. and Mathew.K. Multilayer statistical classiﬁers. D. The Interpretation of Ecological Data: A primer on Classiﬁcation and Ordination. 1997. O. M. Classiﬁcation Networks [1] Arminger.CHAPTER 6. 12:293–310. Shefﬁeld University. Applying neural netorks technologies to factor analysis.V. P. 2002.. and Porril. A comparision of neural network and other pattern recognition aproaches to the diagnosis of low back disorders. Aplicacao do Data Mining em Medicina. [32] Ripley. .B. Neural Networks. W. Computational Statistics and Data Analysis. A. Facul¸˜ dade de Medicina. E. Computational Statistics and Data Analysis. J. Y. Feedfoward neural networks for principal components extraction. and Krzanowski. SURVIVAL ANALYSIS AND OTHERS NETWORKS 180 [30] Nicole.R. [33] Rowe. pages 19–30. 12. B.S. S. Computational Statistics and Data Analysis. 2001. 2000. 3:583–591. Pattern Recognition and Neural Networks.J. 2003. Lloyd. o e [38] Weller. [31] Pielou. Cambrige University Press. B. Independent component analysis and projection pursuit: a tutorial introduction. and Bonne. K. [2] Asparoukhov. Multivariate Bayesian Statistics Models for Source Separation and Signal Unmixing. Chapman & Hall/CRC. K.G. Willey. (75).D. D.C. Universidade Federal do Rio de Janeiro. [35] Sun. J. and Rommey. Editora Interciˆ ncia. and Lai. tree analysis and feedfoward networks. 1984. S. G. vol. 2000. [3] Bose. [37] Vatentin.G. Analyzing credit risk data: a comparision of logistic discrimination. A comparison of discriminant procedures for binary variables. J. 1996. [4] Bounds. [34] Stone. Ecologia Num´ rica: Uma Introducao a An´ lise Multivariada e ¸˜ ` a de Dados Ecol´ gicos. B. Computational Statistics.J. 38:139–160.. D. classiﬁcation. SAGE Quantitative Applications in the Social Sciences. Fernandes Aquirre (Ed). S. Psychology Departament. thesis. 1990. 33:425–437. 1990.T. 42:685–701. sc. ROD(D). 1998. Techinical Report. Metric scalling: Correspondence analysis.. 2003. T. 2001.
M. B. 27:1131–1152.M.H. 2003. Genome Sequence and Survival Analysis (Eds. ´ [14] Santos. Pattern classiﬁcation using neural networks. IEEE Comunications Magazine. Engenharia de Producao.R. [12] Rao. B..A. 1949. R. Medronho. . (8):117–126. 1988. I.P.P. 56:409–456.M. Pereira. Molenberghs).Sc.. R. [16] Santos.. Statistical and Neural Classiﬁers. and Kritski.M.Q. [8] Lippmann. SURVIVAL ANALYSIS AND OTHERS NETWORKS 181 [5] Desai. Revista Brasileira de Epidemologia. and Overstreet. A. S. [11] Randys. B. C. Computers and Operations Research. Pereira. Seixas. [13] Ripley. R. A comparision of neural networks and linear scoring models in the credit union environment.A.CHAPTER 6. and Ragsdale. [17] West.B. In Advances in Statistical Methods for the Health Sciences: Applications to Cancer and AIDS Studies. Decision Sciences. Campos. and G.. V. J. ¸˜ 2005. 1991.M. N. Balakrishman. The use of multiples measurements in taxonomic problems. Annals of Eugenics. 1.S. J. 2000.. 2006. Mello. 26:229–242. o ¸˜ UFRJ.N.A. Crook. Sankya.R. 1936. [10] Markham. 30:266–275. 9:343–365. 7:179–188. R. D.. [15] Santos. On some problems arising out of discrimination with multiple characters. G.. and Calˆ ba.. [6] Fisher. Springer. J. Auget.T. Seixas. M. L. 11:47–64. Neural networks: an application for predicting smear negative pulmonary tuberculosis. R. pages 279–292.P. 1949. A. A.B. Neural Networks for Signal Processing.. Neural networks credit scoring models. Tese D. Mesbab. In ICASSP. Usando redes neurais artiﬁciais e regress˜ o log´stica na o a ı predicao de hepatite A.. J. A critical overview of neural networks pattern classiﬁers.D. Neural networks and related methods for classiﬁcation (with discussion). 1995. [7] Lippmann.C. Redes Neurais e Arvores de Classiﬁcacao Aplicadas ao Di¸˜ agn´ stico de Tuberculose Pulmonar. 1994.S. 1989. Springer. 95:24–37.. Neural nets for computing acoustic speech and signal processing. F. vol. Journal of the Royal Statistical Society B. Combining neural networks and statistical predictors to solve the classiﬁcation problem in discriminant analysis. C. pages 1–6. 1996. [9] Lippmann. A.M.P. European Journal of Operational Research.
G. Understanding neural networks as statistical tools. W. 2 nd ed..nj.A. P. Analyzing credit risk data: a comparision of logistic discrimination. Br¨ lmann. 110:519–523. W.C...A.H. T. Heintz. F.S.. 1996.. [3] Warner.S. Dougherty. 1997. 49:717–722.html).M. tree analysis and feedfoward networks. and Xia. Journal of the Operational Research Society. D.. J. Forecasting travel demand: A comparison of logit and artiﬁcial neural network methods. Computational Statistics. and Mol. Logistic Regression Networks [1] Arminger. and McCullagh. M. (http//citeseer. [4] Carvalho. Generalized Linear Model. A.. Journal of the Operational Research Society. H. SURVIVAL ANALYSIS AND OTHERS NETWORKS 182 Regression Neural Network Models Generalized Linear Model Networks [1] Nelder.. and Zhang.. J.B. 1997. A case study in applying neural networks to predicting insolvency for property and causality insurers. Cooper. and Wardman.M. L. International Journal of Obstetrics and Gynecology.J. and Misra..M..S. Fowkes.J. E. [5] Ciampi. [2] Sarle. [2] Boll. X. 1994.A. 50:284–293. Statistics in Medicine. 1998. 1989. M. P. 2003. classiﬁcation. 12:293–310..C. Geomini. [3] Brockett. D.com/sarle94neural. American Statistician.W. . M. B. In Proceedings of the Nineteenth Annual SAS Users Group International Conference. Sijmons. A new approach to training backpropagation artiﬁcial neural networks: Empirical evaluation on ten data sets from clinical studies.CHAPTER 6.R. 1999. The preoperative assessment of the adnexall mass: The accuracy of clinical estimates versus clinical prediction rules. Golden.. o P. 2002. B. and Bonne. Enache. CRC/Chapman and Hall. P. [6] Cooper. 26:909–921.M. M. 48:1153–1162.L. 21:1309–1330. A.W. Artiﬁcial neural networks versus multivariate statistics: An application from economics. Journal of Applied Statistics.nec.. Neural networks and statistical models.A.
ElKady. Comparative evaluation of the use of neural networks for modeling the epidemiology of schistosomosis mansoni. Vach. T. [15] Tu. 46:93–104. part i.. DeClaris. Inkelis. On the misuses of artiﬁcial neural networks for prognostic and diagnostic classiﬁcation in oncology.T. N.. [10] Lenard.CHAPTER 6. 54:1159–1165. 1996. Limm.. Computational Statistics and Data Analysis. T. ElSahly.. Transaction of the Royal Society of Tropical Medicine and Hygiene.. [9] Hammad.. M. Computational Statistics and Data Analysis. Journal of Clinical Epidemiology.. E. Robner. Journal of Statistical Planning and Inference.. 90:372–376. W. E.. Roßner. P. [14] Schwarzer. G. N.V. and Goss. [11] Nguyen. S. and Kupperman. M. The application of neural networks and a qualitative response model to the auditor’s going concern uncertainty decision. Forecasting with neural networks: An application using bankruptcy data.K. 2000. R. 21:661–682. G. 1995.C. W.V. and Strickland.. C. AbdelWahab. 49:1225–1231. [8] Fletcher..A.M. 1996.. and Granger. Advantages and disadvantages of using artiﬁcal neural networks versus logit regression for predicting medical outcomes. 19:541–561. 24:159–167. Decision Sciences. R. 26:209–226. D..J... J. K. [13] Schumacher. Neural networks and logistic regression. and Madey.B. SURVIVAL ANALYSIS AND OTHERS NETWORKS 183 [7] Faraggi. 55:687–695. W.. M.R. Malley. P. Comparison of prediction models for adverse outcome in pediatric meningococcal disease using artiﬁcial neural network and logistic regression analysis. Comparison of logistic regression and neural networks to predict rehospitalization in patients with stroke. M. Journal of Clinical Epidemiology. and Schumacher. Statistics in Medicine. N.J. R. Fiedler. R. and Schumacher. part ii. and Simon. D.. 1995. [16] Vach. 2001. Neural networks and logistic regression. A. Illig. G. and Vach. Information and Management.. [12] Ottenbacher... The maximum likelihood neural network as a statistical classiﬁcation model. 21:683– 701. 2002. Smith.T. M. 1996. S. R.. .F. 1996. 1993. Journal of Clinical Epidemiology. Alam.
Theory and Methods.K. and Cechim.K. Forsstrom. Forecasting consumer’s expenditure: A comparison between econometric and neural networks models. R. Neural network forecasting of canadian gdp growth.CHAPTER 6.. [5] Qi. 1995. K. J. 2001. Scandinavian Journal of Clinical and Laboratory Investigation. and Szczypula. 18:151–166.B. 55 (suppl 22):83–93. International Journal of Forecasting. S. SURVIVAL ANALYSIS AND OTHERS NETWORKS Regression Networks 184 [1] Balkin. Journal of Forecasting. 2000. International Journal of Forecasting. International Journal of Forecasting. 2000b. 17:57–69. G. and Lim. Prager. A neural network approach to response surface methodology.J. Redes neurais artiﬁciais e analise de sensibilidade: Uma aplicacao a demanda de importacoes ¸˜ ` ¸˜ brasileiras. and Maddala. G. Nonparametric simple regression. D.S. Dalton.W. [4] Kim. W. Quantitative Applications in the Social Sciences. Economic factors and the stock market: A new perspective. Nonparametric Regression and Classiﬁcation Networks [1] Ciampi. [4] Laper. J. 5:645–693. 2000a. [7] Tkacz. and Curram.L..M. Quantitative Applications in the Social Sciences. 26(4):991–1009. Y. K. Selbmann. Grading forecasting using an array of bipolar predictions: Application of probabilistic networks to a stock market index. Statistical models as building blocks of neural networks.J.H.. M.. S.D.J.P. J. A. S. Communications in Statistics Theory and Methods. .J.B. A. Application of neural networks to the ranking of perinatal variables inﬂuencing birthweight.H. Multiple and generalized nonparametric regression. Comparative study of artiﬁcial neural networks and statistical models for predicting student grade point average. M. 29(9/10):2215–2227. 1999. Economia Aplicada. 1997. Portugal.. 1994. International Journal of Forecasting. and Derom. and Chun. 130. 1996. 14:323–337.A. [2] Fox. R. A. 12:255–267.. 131. S.. [3] Gorr. [2] Church.L. Nagin. J. [3] Fox.. and Lechevalier.. 2002. R. Communications in Statistics . [6] Silva.J. 10:17–34. 1998. H. D.S.
2(6):568–576. Cambridge University Press. 15:209– 218. A.. P.F. Statistics in Medicine. and Tanner. 2000.A. [5] Faraggi. LeBlanc. E. pages 525–532. [3] Biganzoli.. 2002. 1991. Pattern Recognition and Neural Networks. P. vol. SURVIVAL ANALYSIS AND OTHERS NETWORKS 185 [5] Peng. P. Baracchi. A general model for testing the proportional harzards and the accelerated failure time hypothesis in the analysis of censored survival data with covariates. [7] Ripley. PARLGRAVE Ed. B.. E. E. and Marubini. and Marubini.. or associative memories. B.. and Crowley.F. J. 1985. 1998. A general framework for neural network models on censored survival data. M. F. A general regression neural networks. Survival Analysis.D. Probabilistic neural networks for classiﬁcation. Journal of the American Statistical Association. and EtezadiAmoli. Feedforward neural networks for the analysis of censored survival data: A partial logistic regression approach. and Kappen. E..A. [6] Picton. T. 20:2965–2976. 14:651–667. M. Understanding neural networks using regression trees: An application to multiply myeloma survival data.. [8] Specht. Bayesian inference in mixtureofexperts and hierarchical mixturesofexperts models with an application to speech recognition. R.. Heskes. Time Series and Control Chart Neural Network Models and Inference Survival Analysis Networks [1] Bakker.. Jacobs. 1998. 17:1169–1186. (in press). 2001. [4] Ciampi. Neural Networks. Neural Networks. Improving lox survival analysis with a neural bayesian approach in medicine. 1996. Baracchi. 91:953–960. D. B. Communications in Statistics A.CHAPTER 6. IEEE Transactions on Neural Networks. . mapping. 1996. 2 nd ed. In Proceedings International Conference on Neural Networks.. 1.. 2003. [2] Biganzoli. Statistics in Medicine.. D. J. D. [9] Specht.
Modeling life data: A graphical approach. and Simon. and Kramar. 1997. Neural Computing and Applications.W. F. R.. S. L. V.. and Pereira. [10] Kalbﬂeish. u [16] Mariani. D. 1995a. The maximum likelihood neural network as a statistical classiﬁcation model. 1999. [9] Groves. R... D. and LaBree. Eden. S.CHAPTER 6. SURVIVAL ANALYSIS AND OTHERS NETWORKS 186 [6] Faraggi. Breast Cancer Research and Threament.. U. 2002. L. [8] Faraggi. and Andersen. E. A neural network model for survival data.J. Marubini. 1997. 2 nd ed. Bigangoli. R... ... and Basley. F. Salvadori.. Extended hazard regression model for rehability and survival analysis. 1995b. Applied Stochastic Models in Business and Industry. Survival analysis and neural nets. [7] Faraggi. 3:367–381. Statistics in Medicine. a e Cadernos Sa´ de Coletiva. 1999. R.E. Azen.. S. L. A. A comparison of cox proportional hazard and artiﬁcial network models for medical prognoses.L. D. 8(1):9–26... Chessells. 27:55–65. D. and Simon. 1994. 14:73–82....P.. 46:93–104.. Yaskily. 13:1189–1200. P. F.K. 1995. [15] LouzadaNeto. [13] LouzadaNeto. Zucali. 8:257–264. R.. The Statistical Analysis of Failure Data. Computer and Biomedical Research.D. Computers in Biology and Medicine. F. J. Prognostic factors for metachronous contralateral breast cancer: a comparision of the linear Cox regression model and its artiﬁcial neural networks extension. R. 2000. S. and Rilke.. Bayesian neural network models for censored data. P. Wisley. Lifetime Data Analysis. K.. 1997. Use of neural networks in predicting the risk of coronary arthery disease.. [14] LouzadaNeto. Pillot. 1997.B. B. O. D. Kinsey. 28:38–52. P. Boracchi. Richards.. S. 15:123–129. [12] Liestol. E. Coradin. A comparison of cox regression and neural networks for risk stratiﬁcation in cases of acute lymphoblastic leukemia in children. E. 44:167–178. [11] Lapuerta.M. Andersen. Simon. Biometrical Journal. Modelos em an´ lise de sobrevivˆ ncia.C. 39:519–532. Smye. [17] OhnoMachado. Statistics in Medicine. Journal of Statistical Planning and Inference. Verones.B. J.M. C. and Prentice.
16:509–515.M.. Neural Networks. and Ranka. SURVIVAL ANALYSIS AND OTHERS NETWORKS 187 [18] OhnoMachado. In Neural networks as statistical methods in survival analysis (Eds. 1998..G. Mehrotra. S. [2] Caloba. 15:49–61. H. Buckley.CHAPTER 6. Thesis. 1992.. A dynamic artiﬁcial neural network for forecasting series events. D. Applied Statistics. and Azen. Time series forecasting with neural networks: A comparative study using the airline data. On the representation of functions by certain fourier integrals.K.A. S. International Journal of Forecasting. K.D. Vancouver. Automatic neural network modeling for univariate time series. Gant). Hierarchical neural networks for survival analysis. 1998. Forecasting combining with neural networks. B.html). and Musen. K. P.P. M. Saidane.. Artiﬁcial neural networks: Prospects for medicine. M. Walker. Landes Biosciences Publishing. R. J. .. 47:231–250. R. 22:347–358.K.. C. International Journal of Forecasting. Transactions of the American Math.. 2005. [5] Donaldson. 1996. [6] Faraway. M. Neural network models for breast cancer prognoses. H.. 2002.. Lapuerta. E.M. BC.. Computational Statistics and Data Analysis. Comparison of the performance of neural networks methods and cox regression for censored survival data. [19] Ripley. and Saliby. J. [20] Ripley.stats. Journal of Forecasting. J. 1998. R. L. R. and Ord. 1995. (www. 1939. [4] Cramer. Pesquisa Operacional. D. and Chatﬁeld. and Ripley. [21] Xiang. 21:341–362. University of Oxford. Caloba. 46:191–201.. C. Forecasting the behavior of multivariate time series using neural networks.K.uk/ ruth/index. Dybowski and C. [7] Ghiassi.ox.D. 2000.. Mohan. Cooperacao entre redes neurais ¸˜ artiﬁciais e t´ cnicas classicas para previs¯ o de demanda de uma s´ rie de e a e cerveja na australia. 34:243– 257. In The Eighth World Congress on Medical Information. and Kamstra.G. S.M. L. 2000. Department of Engineering Science. G. M. Canada. A. Ryutov. and Zimbra. Society. [3] Chakraborthy. Time Series Forecasting Networks [1] Balkin.ac. Phill. 5:961–970.
Parreno.. Y.. [16] Park.. 1994..I. Artiﬁcial neural network models for forecasting and decision making. 1994. M.A. D. Saidane.. 96:968–981. Journal of Forecasting. Hipel. P..M. Stochastic neural networks with applications to nonlinear time series. 15:99– e 108.L. 15:99–108.I. de la Fuente. Um Modelo de Previs¯ o de Curto Prazo de Carga El´ trica a e Combinando M´ todos de Estat´stica e Inteligˆ ncia Computacional. S.C. Stochastic neural networks with applications to nonlinear time series.. D. 89(425):117–121. M. 26:461–482. Marquez. A. R.M. [10] Kajitomi. B.. H. 10:5–15. T. [9] Hill. Prediction sun spots using a layered perception IEEE Transactions on Neural Networks. R. K.J. and Chen.R. K. and Braga. and Wong. International Journal of Forecasting. D. International Journal of Forecasting. and Gray. Journal of Forecasting. O’Connor.. R. Hipel. 24:105–117. [12] Lai. SURVIVAL ANALYSIS AND OTHERS NETWORKS 188 [8] Ghiassi. L. Previs¯ o a de energia el´ trica usando redes neurais nebulosas. [11] Kajitoni.F. and Mclead. R. C.CHAPTER 6. Forecasting nonlinear time series with feedfoward neural networks: a case of study of canadian lynx data.. Y. 2002. [13] Li. M. Murray. and Remus. and Jones. 1998. T..Sc e ı e Thesis.L.K.J. [15] Machado. M.. 1999. I..P. and Pviore. C. . 24:105–117. A. X. Journal of the American Statistical Association. Y. 2005.S. Journal of the American Statistical Association.W. A neural net model for prediction.. Souza. T.B. 2005. Series temporais multivariadas.S. Q¨ estiio. A. Ang. 21:341–362. Caldeva. Aplicacion de redes neuronales artiﬁciales a la prevision de series temporales no estacionarias a no invertibles.... PUCRJ.D. 2001. Journal of Forecasting. [18] Pino. A dynamic artiﬁcial neural network model for forecasting series events. Pesquisa Naval. J. Forecasting nonlinear time series with feedfoward neural networks: a case study of canadian lynx data. Lectures Notes of SINAPE. [14] Lourenco. and Mclead. [17] Pereira. 2005.. P. 1994. 1984..W. and Zimbra. u [19] Poli. 2002. 7(2):501–505. 18:181–204. W.
49:611–629. [27] Tseng. International Journal of Forecasting. 18:167–180. and Sohl. [22] Shamseldin. International Journal of Production Research.C. and Medeiros. Revista Brasileira de Economia. [24] Stern. [26] Terasvirta.H. 1999. C. A. 35:667–697. M. Patuwo. K. Combining neural network model with seasonal time series.. Neural network in applied statistics (with discussion).. M. smooth transistions autoregressions and neural networks for forecasting macroeconomic time series: a reexamination. and Tzeng. SURVIVAL ANALYSIS AND OTHERS NETWORKS 189 [20] Portugal.. M.. 69:71–87.S..M. and Medeiros. [23] Stern. T. G.S. ARIMA model Technological Forecasting and S. G. H.Sc a e Thesis.Y. [2] Hwarng. 5:577–597. C. Linear models. D. Forecasting with artiﬁcial neural networks: the state of the art.E. Redes Neurais na Previs˜ o em S´ ries Temporais. pages 2395–2400. M.M. [21] Raposo. 21:755– 774.V. 14:35– 62. D. Journal of Forecasting. . smooth transitions autoregressions and neural networks for forecasting macroeconomic time series: a reexamination. Change. H. Internation Journal of Forecasting.V. [29] Zhang. COPPE/UFRJ. Dijk. 38:205–220. 38:205–220. 2005. 1998.Y. and Hu.C. An intelligent model selection and forecasting system. 2001. Neural network in applied statistics (with discussion). 2002. D. F. Technometrics. A nonlinear neural network technique for updating of river ﬂow forecasts. H. 1995. 1992. Yu. Neural networks versus true series methods: A forecasting exercise. A neural network approach for the analysis of control charts patterns. Technometrics. Dijk.R.M. Decision Sciences Institute 2002 Annual Meeting Proceedings. Detecting mean shift in AR(1) process. D. and O’Connor. [25] Terasvirta. International Journal of Forecasting.B.. B. [28] Venkatachalam.. 1996. Hydrology and Earth System Sciences.E.. T. 1996.. J. 2002. 1997. H.S.S. 21:755–774. Linear models.CHAPTER 6.C. 2005.. Control Chart Networks [1] Cheng.
A. SURVIVAL ANALYSIS AND OTHERS NETWORKS 190 [3] Prybutok.M. [4] Smith. and Varch. Model Selection and Model Averaging for Neural Network. L. Bayesian Learning for Neural Networks. 21:661–682. 32:277–286.W. Depatament of Statistics.. and D. V. D. 2003. NY Springer. 1998. and Foxall. Dey. K. Sinha).K. M.. and Muller. Statistical modelling of artiﬁcial neural networks using the multilayer perceptron. K. . Xbar and R control chart using neural computing. Neural networks and logistic regression: Part i. 1994. D. 32:443–445. California Institute of Technology. Ph. R. Economic Quality Control.E. The maximum likelihood neural network as a statistical classiﬁcation model. E.T. In Pratical Nonparametric and Semiparametric Bayesian Statistics (Eds. and Simon. 2000. Statistics and Computing. NY Springer. [7] Neal.. Some Statistical Inference Results Estimation Methods [1] Aitkin. 64:301–332. [6] Mackay.M.R. M. 1992. Mueller. 1996. Computational Statistics and Data Analysis. R.. [8] RiosInsua. [2] Belue. 1999. Designing experiments for single output multilayer perceptrons. 46:93–104. R. P. H. Robner. P. Bayesian Methods for Adaptive Methods. 1995. Cornegie Mellon University. C. Journal of Statistical Planning and Inference. W. [3] Copobianco. Comparison of neural net¯ work to Shewhart X control chart applications. 1994. [4] Faraggi. 9:143–164. D. [9] Schumacher. Journal of Statistics and Data Analysis. Journal of Statistical Computation and Simulation. [5] Lee.C.H. D. 1996.. 13:227–239. International Journal of Production Research. and Nam.J.CHAPTER 6.C.d thesis. 1996. Ph. pages 181–194. and Bauer. Neural networks and statistical inference: seeking robust and efﬁcient learning. Sanford.d thesis. Feedfoward neural networks for nonparametric regression.
Power properties of the neural network linearity test. [5] Lee. J. J. J. J. C.G. Neural Computation.W. C. Journal of Applied Statistics. 1996. Prediction intervals for neural networks via nonlinear regression. [6] Terasvirta. H. [4] Tibshirani.D. Article 1. Testing for negleted nonlinearity in time series models.H.. A comparison of some error estimates for neural networks models. T. [2] De Veaux. White.. and Granger.CHAPTER 6.. Prediction interval for artiﬁcial neural networks. [3] Hwang..J. 6:357–373. Journal of Time Series Analysis. 14:209–220. Schweinsberg. IEEE Transactions on Neural Networks.. Journal of the American Statistical Association.. and Ramsey. Neural network test and nonparametric kernel test for negleted nonlinearity in regression models. 1993. and Ding. 1998. 2001. R. and Granger.. 2003. A. T. 1993. Departament of Economics .Kansas State University. C. Schumi. Journal of Econometrics. 7:229–232. R. Technometrics. 2003. v4 issue 4.F. L.T. 40(4):273–282. 8:152–163. G. [2] Blake. 1999.. Lin. 1996.A. A radial basis function artiﬁcial neural network test for negleted.H. A. SURVIVAL ANALYSIS AND OTHERS NETWORKS Interval Estimation 191 [1] Chryssolouriz. .P. T. On performance of neural networks nonlinearity tests: Evidence from G5 countries.B. [3] Kiani. M.W. Conﬁdence interval predictions for neural networks models. and Ungar. Predicting pickle harvests using parametric feedfoward network.H. 56:269–290. The Econometrics Journal.J. A. and Kapetanious. 1997. Lee. Studies in Nonlinear Dynamics and Econometrics. 26:165–176. [4] Lee.. Statistical Test [1] Adams. G. 92:748–757. K.
This action might not be possible to undo. Are you sure you want to continue?