Artificial Neural Networks

António Eduardo de Barros Ruano

Centre for Intelligent Systems University of Algarve Faro, Portugal aruano@ualg.pt

Contents

1 - Introduction to Artificial Neural Networks.............................................. 1
1.1 - An Historical Perspective........................................................................................ 2 1.2 - Biological Inspiration ........................................................................................... 4 1.3 - Characteristics of Neural Networks...................................................................... 7 1.3.1 - Models of a Neuron......................................................................................... 7
1.3.1.1 - Input Synapses..................................................................................................... 8 1.3.1.2 - Activation Functions............................................................................... 11
1.3.1.2.1 - Sign, or Threshold Function ............................................................. 11 1.3.1.2.2 - Piecewise-linear Function................................................................. 11 1.3.1.2.3 - Linear Function................................................................................. 12 1.3.1.2.4 - Sigmoid Function ............................................................................. 13 1.3.1.2.5 - Hyperbolic tangent ........................................................................... 14 1.3.1.2.6 - Gaussian functions............................................................................ 16 1.3.1.2.7 - Spline functions ................................................................................ 16

1.3.2 - Interconnecting neurons ................................................................................ 19 1.3.2.1 - Singlelayer feedforward network ........................................................... 19 1.3.2.2 - Multilayer feedforward network............................................................. 20 1.3.2.3 - Recurrent networks................................................................................. 20 1.3.2.4 - Lattice-based associative memories ....................................................... 21 1.3.3 - Learning ........................................................................................................ 21 1.3.3.1 - Learning mechanism............................................................................... 22 1.3.3.2 - Off-line and on-line learning .................................................................. 23 1.3.3.3 - Deterministic or stochastic learning ....................................................... 23 1.4 - Applications of neural networks ......................................................................... 23 1.4.1 - Control systems............................................................................................. 23 1.4.1.1 - Nonlinear identification .......................................................................... 24 1.4.1.1.1 - Forward models ................................................................................ 25 1.4.1.1.2 - Inverse models.................................................................................. 27 1.4.1.2 - Control .................................................................................................... 30 1.4.1.2.1 - Model reference adaptive control ..................................................... 30 1.4.1.2.2 - Predictive control.............................................................................. 33 1.4.2 - Combinatorial problems................................................................................ 34 1.4.3 - Content addressable memories...................................................................... 34 1.4.4 - Data compression .......................................................................................... 35

Artificial Neural Networks

i

1.4.5 - Diagnostics.................................................................................................... 35 1.4.6 - Forecasting .................................................................................................... 35 1.4.7 - Pattern recognition ........................................................................................ 35 1.4.8 - Multisensor data fusion ................................................................................. 35 1.5 - Taxonomy of Neural Networks ............................................................................ 36 1.5.1 - Neuron models .............................................................................................. 36 1.5.2 - Topology........................................................................................................ 37 1.5.3 - Learning Mechanism..................................................................................... 37 1.5.4 - Application Type ........................................................................................... 38

2 - Supervised Feedforward Neural Networks .............................................39
2.1 - Multilayer Perceptrons.......................................................................................... 39 2.1.1 - The Perceptron .............................................................................................. 40 2.1.1.1 - Perceptron Convergence Theorem ......................................................... 41 2.1.1.2 - Examples .......................................................................................................... 44 2.1.2 - Adalines and Madalines ................................................................................ 50 2.1.2.1 - The LMS Algorithm ............................................................................... 54 2.1.2.2 - Justification of the LMS Algorithm........................................................ 54 2.1.3 - Multilayer perceptrons .................................................................................. 61 2.1.3.1 - The Error Back-Propagation Algorithm ................................................. 63 2.1.3.2 - The Jacobean matrix............................................................................... 64
2.1.3.2.1 - The Gradient Vector as computed by the BP Algorithm ........................................70

2.1.3.3 - Analysis of the error back-propagation algorithm ..................................... 70
2.1.3.3.1 - Normal equations and least-squares solution ..........................................................75 2.1.3.3.2 - The error backpropagation algorithm as a discrete dynamic system ......................76 2.1.3.3.3 - Performance surface and the normal equations matrix ...........................................80

2.1.3.4 - Alternatives to the error back-propagation algorithm ....................................... 81
2.1.3.4.1 - Quasi-Newton method.............................................................................................84 2.1.3.4.2 - Gauss-Newton method ............................................................................................84 2.1.3.4.3 - Levenberg-Marquardt method.................................................................................86

2.1.3.5 - Examples ................................................................................................ 87 2.1.3.6 - A new learning criterion ................................................................................ 93 2.1.3.7 - Practicalities .................................................................................................... 102
2.1.3.7.1 - Initial weight values ..............................................................................................102 2.1.3.7.2 - When do we stop training?....................................................................................105

2.1.3.8 - On-line learning methods ............................................................................ 109
2.1.3.8.1 - Adapting just the linear weights............................................................................110 2.1.3.8.2 - Adapting all the weights........................................................................................119

2.2 - Radial Basis Functions ....................................................................................... 123 2.2.1 - Training schemes......................................................................................... 125 2.2.1.1 - Fixed centres selected at random ................................................................ 125 2.2.1.2 - Self-organized selection of centres............................................................. 125 2.2.1.3 - Supervised selection of centres and spreads ............................................. 127 2.2.1.4 - Regularization ............................................................................................... 128 2.2.2 - On-line learning algorithms ........................................................................ 128 2.2.3 - Examples ..................................................................................................... 129 2.3 - Lattice-based Associative Memory Networks ..................................................... 134 2.3.1 - Structure of a lattice-based AMN ............................................................... 135

ii

Artificial Neural Networks

2.3.1.1 - Normalized input space layer ...................................................................... 135 2.3.1.2 - The basis functions ....................................................................................... 137
2.3.1.2.1 - Partitions

of the unity ..................................................................... 138

2.3.1.3 - Output layer ................................................................................................... 138

2.3.2 - CMAC networks.......................................................................................... 139 2.3.2.1 - Overlay displacement strategies ................................................................. 140
2.3.2.2 - Basis functions................................................................................................. 141 2.3.2.3 - Training and adaptation strategies .............................................................. 141 2.3.2.4 - Examples ........................................................................................................ 142

2.3.3 - B-spline networks........................................................................................ 145 2.3.3.1 - Basic structure ............................................................................................... 145 2.3.3.1.1 - Univariate basis functions .............................................................. 148 2.3.3.1.2 - Multivariate basis functions............................................................ 151 2.3.3.2 - Learning rules ................................................................................................ 152 2.3.3.3 - Examples ........................................................................................................ 152 2.3.3.4 - Structure Selection - The ASMOD algorithm .......................................... 153 2.3.3.4.1 - The algorithm ................................................................................. 154
2.3.3.5 - Completely supervised methods...................................................................... 162

2.4 - Model Comparison ............................................................................................. 162 2.4.1 - Inverse of the pH nonlinearity..................................................................... 163 2.4.1.1 - Multilayer perceptrons ................................................................................. 163 2.4.1.2 - Radial Basis functions .................................................................................. 164 2.4.1.3 - CMACs........................................................................................................... 165 2.4.1.4 - B-Splines ........................................................................................................ 165 2.4.2 - Inverse coordinates transformation ............................................................. 166 2.4.2.1 - Multilayer perceptrons ................................................................................. 166 2.4.2.2 - Radial Basis functions .................................................................................. 167 2.4.2.3 - CMACs........................................................................................................... 168 2.4.2.4 - B-Splines ........................................................................................................ 168 2.4.3 - Conclusions ................................................................................................. 169 2.5 - Classification with Direct Supervised Neural Networks..................................... 170 2.5.1 - Pattern recognition ability of the perceptron............................................... 170 2.5.2 - Large-margin perceptrons ........................................................................... 173
2.5.2.1 - Quadratic optimization .................................................................................... 174
2.5.2.1.1 - Linearly separable patterns....................................................................................175 2.5.2.1.2 - Non linearly separable patterns .............................................................................179

2.5.3 - Classification with multilayer feedforward networks ................................. 181

3 - Unsupervised Learning .........................................................................189
3.1 - Non-competitive learning ................................................................................... 189 3.1.1 - Hebbian rule ................................................................................................ 189
3.1.1.1 - Hebbian rule and autocorrelation .................................................................... 190 3.1.1.2 - Hebbian rule and gradient search .................................................................... 191 3.1.1.3 - Oja’s rule ......................................................................................................... 191

3.1.2 - Principal Component Analysis.................................................................... 192 3.1.3 - Anti-Hebbian rule........................................................................................ 201 3.1.4 - Linear Associative Memories...................................................................... 202
3.1.4.1 - Nonlinear Associative Memories .................................................................... 204

Artificial Neural Networks

iii

... 210 3......2....Cooperative process..........241 2........Learning Vector Quantization ...................................2 ....................Solutions of a linear system with A square ......................................3 .............3.2 ...........................Recurrent Networks...............................LMS for LAMs........Grossberg’s instar-outsar networks ..227 1...........1 ........... 227 1.............................Competitive Learning ................. 219 4................Adaptation .......................... 228 1................. 213 4....................1 ..2............................Statistical Formulation of Classifiers....3 ........2.................................1 .....6 ........2............Operation of the Hopfield network ......2...................... 204 3........................................Linear Equations.......................Counterpropagation networks ......... 224 Appendix 1 ........................... 207 3..........Inconsistent equations and least squares solutions ...................................................Self-organizing maps...........Test for consistency .................................1 ........................................2......2..........4 .........................................2.2............................................... 208 3....................................................................................7 ................... 205 3....3................................. 223 4.. 204 3.......Classifiers from competitive networks................................. 236 Appendix 2 ..................................1 .... 206 3................ 222 4...........................2 ...................Properties of the feature map............................................................... 229 1..................................3 ................................2..............1.......................2... 218 4... 219 4............................1........................2....................................................4 ...........Brain-State-in-a-Box models .............. 232 1.7.....................2.............Discrete Hopfield model .... 241 2.... 209 3...Problems with Hopfield networks.............................2.........................5 ............................ 208 3........... 224 4.......LMS as a combination of Hebbian rules............2 .................................2 ........................2...................................2.... 220 4......Storage phase........4 .............................. 210 3.............................2................1 ............. 229 1........................3 ...............................1 .........The Cohen-Grossberg theorem .........1 .......................... 207 3.........Lyapunov function of the BSB model............................1.................................................2......................................................Competition ................. 227 1.......................................2 ..................................Neurodynamic Models .................4 .........Minimum Error Rate and Mahalanobis Distance........................................................Hopfield models ...................Adaptive Ressonance Theory ................1.....................................................Training and applications of BSB models.............Introduction.......... 242 2........................ 205 3....Existence of solutions.....2....Retrieval phase ..1........1 ...........The Pattern-Recognition Problem..................... 245 iv Artificial Neural Networks ......................... Generalized Inverses and Least Squares Solutions.....Generalized Inverses...............1.....2............... 209 3............. 210 4 ...2 ..............................1 . 217 4......2 ...........................................2.........1.........................Least squares solutions using a QR decomposition: .......3 ......................................2.............................5 ............................................................Elements of Pattern Recognition ......5.....................5 ..2......................................................................213 4.Solving linear equations using generalized inverses .....2 ..............Ordering and Convergence..................................................................2...............1 ............ 234 1.........3.............2 ........ 214 4...................1 ...................................................

258 3..............................Equilibrium states ............................................Discriminant Functions ................................................................................................................... 252 Appendix 3 ....Stability ............................255 3..........................2 ................................................................................................... 269 Chapter 4..Methods of training parametric classifiers ............................................................................................... 250 2.............................................. 271 Appendix 1. 261 Chapter 1................................................ 271 Bibliography...............................................................................................................Decision Surfaces ......... 251 2.....................281 Artificial Neural Networks v ................................................................A Two-dimensional Example........................................1 .............................................................................................................................5 ................................................................ 246 2...3....................................................................................................................... 259 Exercises .............Elements of Nonlinear Dynamical Systems .................................................................................................. 246 2.......3 ....................................................................................Introduction............................................... 255 3..Attractors ..........................................................................................1 ..............2.........4 ..........6 .................................273 Index ................................4 .............. 263 Chapter 3.............................................................................................Linear and Nonlinear Classifiers....................................... 261 Chapter 2........ 256 3....................................3 ....

vi Artificial Neural Networks .

These will be briefly indicated in this Chapter and more clearly identified in the following Chapters. like neurophysiology. a large number of neural networks (including their variants) have been proposed. electronics. robotics [11] [12] and control systems [13] [14]. for the large majority of researchers. the outline of this Chapter is as follows: Artificial Neural Networks 1 . The most important ANN models are described in this introduction. Emphasis is given to supervised feedforward neural networks. speech recognition and synthesis [7] [8]. Neural networks research has spread through almost every field of science. which is the class of neural networks most employed in control systems. In an attempt to introduce some systematisation in this field. medicine [4] [5] [6]. which is the research field of the author. Although work on artificial neural networks began some fifty years ago. they will classified according to some important characteristics. covering areas as different as character recognition [1] [2] [3]. or precisely because of that. widespread interest in this area has only taken place in the last years of the past decade. Some were introduced with a specific application in mind. control systems and mathematics.CHAPTER 1 Introduction to Artificial Neural Networks In the past fifteen years a phenomenal growth in the development and application of artificial neural networks (ANN) technology has been observed. specifically to the type of the ANN discussed. artificial neural networks are yet a new and unfamiliar field of science. Writing a syllabus of artificial neural networks. is not an easy task. physics. Although active research in this area is still recent. this syllabus is confined to a broad view of the field of artificial neural networks. image processing [9] [10]. Another objective of this Chapter is to give ideas of where and how neural networks can be applied. and there are different learning rules around. Therefore. and applications are targeted for an even wider range of disciplines. This task becomes more complicated if we consider that contributions to this field come from a broad range of areas of science. This means that. For these reasons. as well as to summarise the main advantages that this new technology might offer over conventional techniques. psychology.

and the learning mechanism associated with the network.1 An Historical Perspective The field of artificial neural networks (they also go by the names of connectionist models. to a great extent. parallel distributed processing systems and neuromorphic systems) is very active nowadays. in his book The organization of behaviour [18] postulated a plausible qualitative mechanism for learning at the cellular level in brains.1. how they are interconnected. the perceptron. Primarily.3. As artificial neural networks are.1 and Section 1. which developed in the late eighties. where it will be shown that the basic concepts behind the brain mechanism are present in artificial neural networks. etc. To characterise a neural network.3 introduces important characteristics of ANNs.3. Building up on the modelling concepts introduced.2. work in this field can be traced back to more than fifty years ago [15] [16].3 Section 1. As already mentioned. inspired by the current understanding of how the brain works. and in the application fields. Principles of neurodynamics [19].4. back in 1943. 2 Artificial Neural Networks . In 1957 Rosenblatt developed the first neurocomputer.2. The most common types of interconnection are addressed in Section 1.5 will introduce general classifications or taxonomies.4.1. In 1949 Donald Hebb. Section 1.2. given linearly separable classes.3. His results were summarised in a very interesting book. the neural models.4 presents an overview of the applications of artificial neural networks to different fields of Science. neural control schemes are discussed in Section 1. such as optimization. Neuron models are discussed in Section 1. Section 1.1. data fusion. and these are summarised in Section 1.An introduction to artificial neural networks is given in Section 1. He proposed a learning rule for this first artificial neural network and proved that. are briefly summarized in the sections below. Different types of learning methods are available. It will be shown that neural networks have been extensively used to approximate forward and inverse models of nonlinear dynamical systems. An extension of his proposals is widely known nowadays as the Hebbian learning rule. develop a weight vector that would separate the classes (the famous Perceptron convergence theorem). Additional applications to other areas. the network topology. Finally. 1. Section 1.1 will discuss the role of ANN for nonlinear systems identification. a special emphasis will be given to the field of Control Systems. in a finite number of training trials.2.. pointed out the possibility of applying Boolean algebra to nerve net behaviour. McCulloch and Pitts [17].1. in Section 1. four different taxonomies of ANNs are presented. it is necessary to specify the neurons employed. pattern recognition.4. Although for many people artificial neural networks can be considered a new area of research. They essentially originated neurophysiological automata theory and developed a theory of nets. a brief introduction to the basic brain mechanisms is given in Section 1. a chronological perspective of ANN research is presented. Based on the learning paradigm. a perceptron would.

retrieving contextually appropriate information from memory. Another is the benefit that neuroscience can obtain from ANN research. probably the best known being the control of an inverted pendulum [22].One is the desire is to build a new breed of powerful computers. Although people like Bemard Widrow approached this field from an analytical point of view. where it was mathematically proved that these neural networks were not able to compute certain essential computer predicates like the EXCLUSIVE OR boolean function. New artificial neural network architectures are constantly being developed and new concepts and theories being proposed to explain the operation of these architectures. image processing. Grossberg [27] and Kohonen [28]. Adalines and the Widrow-Hoff learning rule were applied to a large number of problems.The dramatic improvement in processing power observed in the last few years makes it possible to perform computationally intensive training tasks which would. This discredited the research on artificial neural networks. Perceptrons [23]. and the desired output vector over the set of patterns [21]. making ANN one of the most active current areas of research. that can solve problems that are proving to be extremely difficult for current digital computers (algorithmically or rulebased inspired) and yet are easily done by humans in everyday life [32].Also the advances of VLSI technology in recent years. Fukushima [26]. with older technology. the ADALINE [20]. with commercial optical or electrooptical implementations being expected in the future. . Until the 80's. Then in the middle 80's interest in artificial neural networks started to rise substantially. Many of these developments can be used by neuroscientists as new paradigms for building functional concepts and models of elements of the brain. The death sentence to ANN was given in a book written by these researchers. are all examples of such tasks. required an unaffordable time.About the same time Bemard Widrow modelled learning from the point of view of minimizing the mean-square error between the output of a different type of ANN processing element. About the same time Minsky and Papert began promoting the field of artificial intelligence at the expense of neural networks research. Analog. In the sixties many sensational promises were made which were not fulfilled. other reasons can be identified for this recent renewal of interest: . research on neural networks was almost null. Although this persuasive and popular work made a very important contribution to the success of ANNS. most of the research on this field was done from an experimental point of view. This work has led to modern adaptive filters. Cognitive tasks like understanding spoken and written language. With the publication of the PDP books [31] the field exploded. Anderson [25]. Notable exceptions from this period are the works of Amari [24]. . - Artificial Neural Networks 3 . turned the possibility of implementing ANN in hardware into a reality. The work and charisma of John Hopfield [29][30] has made a large contribution to the credibility of ANN. digital and hybrid electronic implementations [33][34][35][36] are available today.

Information is coded on the frequency of the signal. The nerve terminal is close to the dendrites or body cells of other neurons. a neuron consists of a cell body with finger-like projections called dendrites and a long cablelike extension called the axon. • groups of a few hundreds neurons which form modules that may have highly specific functional processing capabilities. each branch terminating in a structure called the nerve terminal.A biological neuron Neurons are electrically excitable and the cell body can generate electrical signals. Circuits can therefore be formed by a number of neurons. The electrical signal propagates only in this direction and it is an all-or-none event. Before describing the structure of artificial neural networks we should give a brief description of the structure of the archetypal neural network. to exploit the available knowledge of the mechanisms of the human brain. 1. more than one nerve terminal can form synapses with a single neuron. then. very briefly consider the structure of the human brain. The axon may branch.1 . On the other end. regions of the brain containing millions of neurons or groups of neurons. • at a more macroscopic level. All of them try. called action potentials. forming special junctions called synapses. each region having assigned some discrete overall function. The basic component of brain circuitry is a specialized cell called the neuron.2 Biological Inspiration One point that is common to the various ANN models is their biological inspiration. It is usually recognized that the human brain can be viewed at three different levels [37]: • the neurons . 4 Artificial Neural Networks . As shown.the individual components in the brain circuitry. to a greater or lesser extent.Research into ANNs has lead to several different architectures being proposed over the years. Branching of the axon allows a neuron to form synapses on to several other neurons. Let us. which are propagated down the axis towards the nerve terminal. FIGURE 1.

The input strengths are not fixed. through at least two cortical regions. each one with different strengths. The amount of neurotransmitter released and its postsynaptic effects are. and linked together in some functional manner. These spatially different inputs can have different consequences. but vary with use. as described above. In certain parts of the brain groups of hundreds of neurons have been identified and are denoted as modules. where areas responsible for the primary and secondary processing of sensory information have been identified. and especially. Finally. This leads to a profusion of proposals for the model of the neuron.When the action potential reaches the nerve terminal it does not cross the gap. one neuron receiving inputs from many neurons and sending its output to many neurons. which is assumed to be achieved by modifying the strengths of the existing connections. Any particular neuron has many inputs (some receive nerve terminals from hundreds or thousands of other neurons). These have been associated with the processing of specific functions. The neuron integrates the strengths and fires action potentials accordingly. The brain is capable of learning. which alter the effective strengths of inputs to a neuron. The mechanisms behind this modification are now beginning to be understood. Of particular interest is the cerebral cortex. rather a chemical neurotransmitter is released from the nerve terminal. the neurons. Inputs to one neuron can occur in the dendrites or in the cell body. These changes in the input strengths are thought to be particularly relevant to learning and memory. These neurotransmitters can have an excitatory or inhibitory effect on the postsynaptic neuron. graded events on the frequency of the action potential on the presynaptic neuron. in parallel. In these areas. and converges into areas where some association occurs. From the brief discussion on the structure of the brain. any particular neuron always releases the same type from its nerve terminals. Artificial Neural Networks 5 . However a detailed description of the structure and the mechanisms of the human brain is currently not known. Although there are different types of neurotransmitters. These broad aspects are also present in all ANN models. These neurons are densely interconnected. but not both. some representation of the world around us is coded in electrical form. it is possible to identify different regions of the human brain. Therefore our sensory information is processed. some broad aspects can be highlighted: the brain is composed of a large number of small processing elements. to a first approximation. all working in parallel. at a higher level. acting in parallel. The combination of the neurotransmitter with the receptor changes the electrical activity on the receiving neuron. It may well be that other parts of the brain are composed of millions of these modules. the pattern of interconnectivity. This chemical crosses the synapse and interacts on the postsynaptic side with specific sites called receptors.

oi can then be given as: o i = g –1 X i. however. usually called the output function or activation function. Some ANN models are interested in the transient behaviour of (1.4) j∈i The terms inside brackets are usually denoted as the net input: net i = o j W j. which can be obtained by equating (1.j(.1) j∈i where Xi. i + b i = o i W i. the following differential equation can be considered for oi: do i = dt X i. If the simple approximation (1. and also independent of the activity of the neuron. j ( o j ) = o j W j. as well as the influence of the excitatory and inhibitory inputs.1). To derive function Xi. j ( o j ) + b i (1. If the effect of every synapse is assumed independent of the other synapses. we then have: 6 Artificial Neural Networks .2) is considered: X i. the principle of spatial summation of the effects of the synapses can be used.1) to 0.) is a loss term. the spatial location of the synapses should be taken into account. The majority. i + b i (1. i (1.j(. As it is often thought that the neuron is some kind of leaky integrator of the presynaptic signals.2) j∈i where bi is the bias or threshold for the ith neuron.3) where W denotes the weight matrix.for the learning mechanisms. Taking into account the brief description of the biological neuron given above. Assuming that the inverse of g(. consider only the stationary input-output relationship.) by f.5) j∈i Denoting g-1(. j ( o j ) – g ( o i ) (1. a model of a neuron can be obtained [39] as follows. In the neuron model the frequency of oscillation of the ith neuron will be denoted by oi (we shall consider that o is a row vector). then the most usual model of a single neuron model is obtained: oi = g –1 o j W j. Artificial neural networks can be loosely characterized by these three fundamental aspects [38].) exists.).) is a function denoting the strength of the jth synapse out of the set i and g(. which must be nonlinear in oi to take the saturation effects into account. i + b i (1.

f(. This can easily be done by defining: o' i = o i 1 (1. 1.6) Sometimes it is useful to incorporate the bias in the weight matrix. the way that these neurons are interconnected between themselves. 1. i1 Wi.6).2 . This Section will try to address these important topics. other types of activation functions can be employed. . the most common model of a neuron was introduced in Section 1.3.2 neti oi f(.2.5) and (1. . as we shall see in future sections.7) and (1.8).Model b) External view of a typical neuron Artificial Neural Networks 7 .7) W' = W bT so that the net input is simply given as: net i = o'i W' i.8) Typically. which has low and high saturation limits and a proportional range between. alternatively. consisting of eq. neural network models can be somehow characterized by the models employed for the individual neurons. (1.) is a sigmoid function.1 θi ?i θi ?i i1 i2 Wi.p a) Internal view FIGURE 1. in the following three subsections.3 Characteristics of Neural Networks As referred in the last Section. However.1 Models of a Neuron Inspired by biological fundamentals. of (1. i (1.o i = f ( net i ) (1. or. ip i S oi ip Wi. It is illustrated in Fig.2. and by the learning mechanism(s) employed. 1.) i2 .

As described above. but can be defined at an earlier stage of network design. the influence of the threshold appears in the equation of the net input as negative. as the neuron net input. which multiplies the associated input signal. specially the first and the last blocks differ according to each ANN considered. 2. but instead a product of the outputs of the activation function is employed. We denote this weight by Wk. as already introduced. CMACs and B-splines. In some networks. are not just determined by the learning process.1. In three networks.3. usually nonlinear. CMACs and B-splines.1 Input Synapses In Multilayer Perceptrons (MLPs) [31]. Bidirectional Associative Memories (BAM).i. without limitations.1]. historically. The weights can then. where they were used as functional approximators and for data modelling. or [-1. Kohohen Self-Organized Feature Maps (SOFM). and the first element in subscript refers to which one of the inputs the signal comes from (in this case the kth input to neuron i). Usually an adder is employed for computing a weighted summation of the input signals. Adaptive Resonance Theory (ART) networks. The hidden layer of a RBF network consists of a certain number of neurons (more about this number later) where the ‘weight’ associated with each link can be viewed as the squared distance between the centre associated with the neuron. An activation or output function. during the learning process. The input signals are integrated in the neuron. the weights related with the neuron(s) in the output layer are simply linear combiner(s) and therefore can be viewed as normal synapses encountered in MLPs. RBFs. Radial Basis Functions Networks (RBFs) were first introduced in the framework of ANNs by Broomhead and Lowe [40]. The following Sections will introduce the major differences found in the ANNs considered in this text. for the considered input. which has as input the neuron net input. The weights of some neural networks. and the actual input value. The output of this adder is denoted. Typically. considered as another (fixed. or. This can be associated with the adder. Since then. each link connecting each input to the summation block.Focusing our attention in fig. 8 Artificial Neural Networks . 3. 1. Brain-State-in-a-Box (BSB) networks and Boltzmann Machines. each synapse is associated with a real weight. there is a strength or weight. with the value of 1) input. these three major blocks are all present in ANNs. [0. Note also the existence of a threshold or bias term associated with the neuron. due to the faster learning associated with these neural networks (relatively to MLPs) they were more and more used in different applications. A set of synapses or connecting links. In these networks. Such are the cases of RBFs. assume any value. Associated with each synapse. where the second element in sub- script refers to the neuron in question.2 we can identify three major blocks: 1. The same happens with Hopfield networks. the integration of the different input signals is not additive.1]. the ith neuron. the range of the output lies within a finite interval. However. The positive or negative value of the weight is not important for these ANNs. however. 1.

is the order (o) of the spline. as function of the input vector (centre 0) As we can see. the weight value assumes a null value at the centre. 25 20 15 W(i. in this case. m. for this dimension is given by: m = n+o The active region for each neuron is given by: (1.3 .x max] (1. which is a parameter associated with the activation function form.Suppose. and increases when we depart from it. In these networks. i = ( C i.10) and that we introduce n interior knots (more about this later). and represented as λj . the weight Wk. k – x k ) 2 (1. Assuming that the input range of the kth dimension of the input is given by: ϑ = [x min.9) Another network that has embedded the concept of distance from centres is the B-Spline neural network.Weight value of the hidden neuron of a RBF network. but also with the region where the neuron is active or inactive. equally spaced. which. for instance. The following figure represents the weight value. such as robotics and adaptive modelling and control. B-splines are best known in graphical applications [41]. The number of hidden neurons (more commonly known as basis functions).k) 10 5 0 -5 -4 -3 -2 -1 0 x(k) 1 2 3 4 5 FIGURE 1. the neurons in the hidden layer have also associated with them a concept similar to the centres in RBFs.11) Artificial Neural Networks 9 . New to these neural networks. as a function of the input values. Mathematically. in this range.i can be described as: W k. that the centre of the ith neuron associated with the kth input has been set at 0. although they are now employed for other purposes. are denoted as the knots.

3 0.11) 10 Artificial Neural Networks .1 0 -1 -0. Associated with each hidden neuron (basis function) there is also a centre. as function of the distance of the input from the neuron centre.7 0.8 0.5 1 FIGURE 1. a set of n knots is associated with each dimension of the input space. Equations (1.5 0 x(k) 0. x ∈ [ c i – ( δ ⁄ 2 ).4 0. c i + δ ⁄ 2 [ . its active region is within the limits [ c i – δ ⁄ 2 . It is assumed.6 W(i. A generalization parameter. i = (1.12) Assuming that the ith neuron has associated with it a centre ci.k) 0. the weight Wi. These networks were introduced in the mid-seventies by Albus [42]. signal processing tasks and classification problems. x – ci 1 – -----------.. that the centre is 0. for simplicity.k can be given by. This network was originally developed for adaptively controlling robotic manipulators. 1 0.9 0. ρ .ϑδ = o ⋅ ----------n+1 (1. but has in the last twenty years been employed for adaptive modelling and control.2 0.Weight value of the hidden neuron of a B-spline network.4 . Therefore.13) The following figure illustrates the weight value. as function of the input value (centre 0) The concept of locality is taken to an extreme in Cerebellar Model Articulation Controller (CMAC) networks. must be established by the designer. other values W k. and that the spline order and the number of knots is such that its active range is 2. In the same way as B-spline neural networks. c i + ( δ ⁄ 2 ) [ δ -2 0 .5 0.

2.1. by the generalization parameter. o.2 Activation Functions The different neural networks use a considerable number of different activation functions. The weight equation is. or Threshold Function For this type of activation function. 1. ρ .2.3.1. i = . and can be represented as: 1. 1. and indicate which neural networks employ them. x ∈ [ c i – ( δ ⁄ 2 ).5 1 f(x) 0.5 .15) This activation function is represented in fig. different.1. we have: f(x) = 1 if x ≥ 0 0 if x < 0 (1.and (1.2Piecewise-linear Function This activation function is described by: Artificial Neural Networks 11 .5 . Here we shall introduce some of the most common ones.1Sign. (1. 1. all the values within the active region of the neuron contribute equally to the output value of the neuron.Threshold activation function 1.3. 1.5 0 -0.12) are also valid for this neural network. however.3. c i + ( δ ⁄ 2 ) [ 0. other values W k.14) this is.5 -3 -2 -1 0 x 1 2 3 FIGURE 1. replacing the spline order.

5 0 x 0.5 1 FIGURE 1.5 1 f(x) 0.1.2.6 .3Linear Function This is the simplest activation function.if x ≥ -2 1 1 1 f ( x ) = x + -. The form of this function is represented in fig. if – -.16) Please note that different limits for the input variable and for the output variable can be used.3.5 0 -0.< x < -2 2 2 0 .6 .17) 12 Artificial Neural Networks .Piecewise-linear function 1.if x ≤ – 1 -2 (1.. It is given simply by: f( x) = x (1.5 -1 -0.1 1 . 1. 1.

1.1.5 0 -0.5 (1.5 f(x) 0 -0.19) 1 f(x) 0. 1.2. is just f '(x) = 1 1.5 1 FIGURE 1.7 . with respect to the input.A linear function is illustrated in fig.5 -1 -0. 1.7 .Linear Activation function Its derivative. It is given by: 1 f ( x ) = --------------1 + e –x It is represented in the following figure.5 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 Artificial Neural Networks 13 .3.5 0 x 0.5 1 0.5 -1 -1.4Sigmoid Function This is by far the most common activation function.18) (1.

It is defined as: x 1 – e –x f ( x ) = tanh -.8 . 1.4 -0.= f ( x ) ( 1 – f ( x ) ) ( 1 + e –x ) 2 –x (1.2 -0.5 0.21) 1.4 0. is represented in fig.5 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 FIGURE 1. 0.2 0.20) that makes this function more attractive than the others. as function of the net input.9 .1.FIGURE 1.3.3 0.2.1 f´(x) 0 -0.Derivative of a sigmoid function 1.Sigmoid function Its derivative is given by: –1 e f ' ( x ) = ( ( 1 + e – x ) )' = ----------------------.= --------------.1 2 1 + e –x It is represented in the following figure. (1.9 .1 -0.5Hyperbolic tangent Sometimes this function is used instead of the original sigmoid function.20) It is the last equality in (1. 14 Artificial Neural Networks . The derivative. Sometimes the function 1 – e –2x f ( x ) = tanh ( x ) = -----------------1 + e –2x is also employed.3 -0.

4 0.5 1 0.05 0 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 FIGURE 1.25 0.1.5 0.35 0.( 1 – f 2 ( x )) 2 (1.11 .1 0.10 .3 f´(x) 0.Hyperbolic tangent function The derivative of this function is: 1 x f ' ( x ) = -.45 0.2 0.5 f(x) 0 -0.Derivative of an hyperbolic tangent function Artificial Neural Networks 15 .5 -1 -1.1 – tanh -2 2 2 1 = -.5 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 FIGURE 1.22) 0.15 0.

12 .0]. each node is characterised by a centre vector and a variance.1.i=[0.3.1. both in the range]-2. k – x k ) 2 = e – ----------------------------------2σ i2 k=1 (1. and a unitary variance 1.. as it can be seen. The jth univariate basis j function of order k is denoted by N ( x ) .9).2[. C i.3. We assume that there are 2 inputs.Activation values of the ith neuron. and σ i = 1 . with centre C. FIGURE 1. The following figure illustrates the activation function values of one neuron. i ( x k. as the name indicates. the activation function can be denoted as: n n W k.σ i ) = e – ----------------------------------------2σ i2 k=1 = k=1 ∏e n W k. k ) – -------------------------------2σ i2 ( C i. Using this notation.23) Please note that we are assuming just one input pattern. and it is defined by the following relationships: k 16 Artificial Neural Networks . This function is usually called localised Gaussian function. C i. Remember that the weight associated with a neuron in the RBF hidden layer can be described by (1. k ) f i ( C i.2.0]. k . are found in B-spline networks.6Gaussian functions This type of activation function is commonly employed in RBFs. considering its centre at [0. i ( x k. and.7Spline functions These functions. and that we have n different inputs.1.2.

1 0 0. The curse of dimension- Artificial Neural Networks 17 .x – λj – k λj – x j j–1 N k ( x ) = ---------------------------. λ j ] is the jth interval.7 0.1 0 0 1 2 x 3 4 5 a) Order 1 b) Order 2 0. so that it is clear how many functions are active for that particular point.24).2 0. depending on the order of the spline.4 0.6 0.5 0.6 0.8 0.1 0 0 1 2 x 3 4 5 output 1 0. λ j is the jth knot.3 output 0. where for each input dimension 2 interior knots have been assigned.7 0.8 0.4 0.5 is marked.3 0.9 0.N kj – 1 ( x ) λj – 1 – λ j – k λ j – λj – k + 1 N1 ( x ) = j 1 if x ∈ I j 0 otherwise (1.8 0. In the following figures we give examples of 2 dimensional basis functions. For all graphs.2 0.2 0.3 0. and I j = [ λ j – 1.7 0.13 . 1 0.5 0. The following figure illustrates univariate basis functions of orders o=1…4.4 0.Univariate d) Order 4 B-splines of orders 1-4 Multivariate B-splines basis functions are formed taking the tensor product of n univariate basis functions.4 0.9 0. the same point x=2.1 0 0 1 2 x 3 4 5 0 1 2 x 3 4 5 c) Order 3 FIGURE 1.2 0.24) In (1.5 output 0.5 0.6 0.7 0.6 output 0.3 0.N k – 1 ( x ) + ---------------------------.

order 1.5.Two-dimensional multivariable basis functions formed with 2. The point [1. 32=9 basis functions are active each time. FIGURE 1.14 . univariate basis functions.ality is clear from an analysis of the figures. as in the case of splines of order 3 for each dimension. The point [1.6] is marked 18 Artificial Neural Networks . 1.Two-dimensional multivariable basis functions formed with order 2 univariate basis functions. 1.15 .5] is marked FIGURE 1.

1.2. we denote the network as a multilayer NN. In the left. The point [1. there is the input layer.Two-dimensional multivariable basis functions formed with order 3 univariate basis functions. Finally. if loops are allowed. which is nothing but a buffer. neurons which are not input nor output neurons. and therefore does not implement any processing. Another possible classification is dependent on the existence of hidden neurons. 1. biological neural networks are densely interconnected. if every neuron in one layer is connected with the layer immediately above.3. This also happens with artificial neural networks. i.3. a latticebased structure. if the signals flow just from input to output. If not.17 .4] is marked 1. we can divide the architectures into feedforward networks. 1.16 .FIGURE 1. Some networks are easier described if we envisage them with a special structure.4. we speak about a partially connected network. If there are hidden neurons.2 Interconnecting neurons As already mentioned. otherwise the network can be called a singlelayer NN.. According to the flow of the signals within an ANN. or recurrent networks.1 Singlelayer feedforward network The simplest form of an ANN is represented in fig.e. The signals Artificial Neural Networks 19 . the network is called fully connected.

arriving to the output layer. two hidden layers with 4 neurons in each one. 2. as represented in fig. 4. 1. x1 y1 x2 y2 z-1 FIGURE 1. a ANN [5. For instance.2. and feeding back this delay signal to one input. 1] has 5 neurons in the input layer.3. x1 x2 x3 FIGURE 1. 1] 1. with topology [3.3.18 - y A multi-layer feedforward network. where n denotes the number of the neurons in a layer. Usually the topology of a multi layer network is represented as [ni. 4. where computation is performed.19 .19 .2 Multilayer feedforward network In this case there is one or more hidden layers. x1 y1 x2 x3 y2 FIGURE 1. nh1.3 Recurrent networks Recurrent networks are those where there are feedback loops.flow to the right through the synapses or weights.A recurrent network 20 Artificial Neural Networks . ….2. and one neuron in the output layer. no].17 - A single layer feedforward network 1. Notice that any feedforward network can be transformed into a recurrent network just by introducing a delay. The output of each layer constitutes the input to the layer immediately above.

1. and B-splines networks are best described as lattice-based associative memories.3. Artificial Neural Networks 21 .A lattice in two dimensions 1. and according to the manner how the adjustment takes place.4 Lattice-based associative memories Some types of neural networks.20 - Basic structure of a lattice associative memory The input layer can take different forms but is usually a lattice on which the basis functions are defined.3. fig.21 . RBFs. a1 x(t) a2 a3 ω1 ω2 ω3 ωp − 2 ∑ y(t) network output a p−2 a p −1 ap ωp −1 ωp normalized input space basis functions weight vector FIGURE 1. a basis functions layer and a linear weight layer. like CMACs. We can somehow classify learning with respect to the learning mechanism.21 shows a lattice in two dimensions x2 x1 FIGURE 1. with respect to when this modification takes place. All of these AMNs have a structure composed of three layers: a normalized input space layer. learning is envisaged in ANNs as the method of modifying the weights of the NN such that a desired goal is reached.2. 1.3 Learning As we have seen.

1.) denotes the trace operator and E is the error matrix defined as: E = T–O There are two major classes within supervised learning: • gradient descent learning These methods reduce a cost criterion such as (1.25) through the use of gradient methods. . the weight Wi. . In contrast with supervised learning.27) (1. the corresponding output values. The corresponding target output pattern is Tp. where m is the number of patterns in the training set. The most commonly used minimization criterion is: Ω = tr ( E E ) Γ (1. The aim of the learning phase is to find values of the weights and biases in the network such that. In this variant..this kind of learning also involves minimisation of some cost function. or teacher signals. or LMS (least-mean-square) rule [21] and the error back-propagation algorithm [31]. there is a matrix of desired outputs. it can be divided broadly into three major classes: 1. Ip. are as close as possible to T. The dimensions of these two matrices are m*ni and m*no. and consider the pth input pattern. ni is the number of inputs in the network and no is the number of outputs. Typically. suppose that you have m input patterns..1 Learning mechanism With respect to the learning mechanism. 2. in each iteration.28) reinforcement learning . Tp.3.j experiences a modification which is proportional to the negative of the gradient: δΩ ∆W i. j = – η -----------δW i. where each is computed as a correlation matrix: W = I p. I. Of course in this case the activation functions must be differentiable. O. T. respectively. using I as input data. the network does not receive a teaching signal at every training pattern but only a score that tells it how it performed over a 22 Artificial Neural Networks .25) where tr(. Then the weight matrix W is computed as the sum of m superimposed pattern matrices Wp. p T (1. Associated with some input matrix.26) Examples of supervised learning rules based on gradient descent are the Widrow-Hoff rule. In other words. • forced Hebbian or correlative learning The basic theory behind this class of learning paradigms was introduced by Donald Hebb [18]. j (1..this learning scheme assumes that the network is used as an inputoutput system.3. supervised learning .. however. this cost function is only given to the network from time to time.

1. the learning process is stochastic. batch learning or training. due to Donald Hebb [18]. 3.this type of learning mechanisms. 1. 1.3. like the Boltzmann Machine. The processing element (or elements) that emerge as winners of the competition are allowed to modify their weights (or modify their weights in a different way from those of the non-winning neurons). over a wide range of uncertainty. if both coincide.2 Off-line and on-line learning If the weight adjustment is only performed after a complete set of patterns has been presented to the network.4. The two most important classes of unsupervised learning are Hebbian learning and competitive learning: • Hebbian learning .training sequence. covering almost every field of science. Typically. epoch learning. the weight adjustment is deterministic. These cost functions are. if upon the presentation of each pattern there is always a weight update.3. otherwise it is decreased.3 Deterministic or stochastic learning In all the techniques described so far. according to states of activation of its input and output. updates the weights locally. 1. A more detailed look of control applications of ANNs is covered in Section 1. forecasting. for this reason. pattern learning. highly dependent on the application. A well known example of reinforcement learning can be found in [56]. data compression. Unsupervised learning . the states of the units are determined by a probabilistic distribution.1 Control systems Today. content addressable memories.3. Examples of this type of learning are Kohonen learning [38] and the Adaptive Resonance Theory (ART) [44]. On the other hand. data fusion and pattern recognition are briefly described in the following sections. A good survey on recent on-line learning algorithms can be found in [121].4.in this case there is no teaching signal from the environment. Artificial neural networks offer a large number of attractive features for the area of control systems [45][38]: • The ability to perform arbitrary nonlinear mappings makes them a cost efficient tool to synthesise accurate forward and inverse models of nonlinear dynamical systems. then the strength of the weight is increased.1. Other fields of applications such as combinatorial problems. there is a constant need to provide better control of more complex (and probably nonlinear) systems.3.4 Applications of neural networks During the past 15 years neural networks were applied to a large class of problems. instantaneous learning. • competitive learning . diagnosis.in this type of learning an input pattern is presented to the network and the neurons compete among themselves. In some networks. we can say that we are in the presence of on-line learning. we denote this process as off-line learning. allow- Artificial Neural Networks 23 . As learning depends on the states of the units. or adaptation.

Kuperstein and Rubinstein [11] have implemented a neural controller. as demonstrated in [46]. models are extensively used to infer primary values. which learns hand-eye coordination from its own experience. the ability of neural networks for creating arbitrary decision regions is exploited. This allows calculations to be performed at a high speed. when compared to traditional methods. Neural networks can also provide. 24 Artificial Neural Networks . It is therefore not surprising that robotics was one of the first fields where ANNs were applied. [50] have employed MLPs for the sensor failure detection in process control systems. This can be done without the need for detailed knowledge of the plant [46]. In process control. pattern recognisers and controllers. The ability to create arbitrary decision regions means that they have the potential to be applied to fault detection problems. Only a brief mention of the first two areas will be made in this Section. Fukuda et al. Development of fast architectures further reduces computation time. significant fault tolerance. [12] employed MLPs with time delay elements in the hidden layer for impact control of robotic manipulators. As examples. During the last years. In these applications. Neural networks are massive parallel computation structures. As their training can be effected on-line or off-line. from more accessible secondary variables. Robots are nonlinear. a possible use of ANNs is as control managers. if so desired. the main effort being devoted to a description of applications for the last two areas. to detect. Exploiting this property. Kawato et al. self-tuning and model reference adaptive control employ some form of plant model. making real-time implementations feasible. classify and recover from faults in control systems. complicated structures. concluding that under certain conditions the latter has advantages over the former. Elsley [47] applied MLPs to the kinematic control of a robot arm.• • • • ing traditional control schemes to be extended to the control of nonlinear plants. Artificial neural networks have also been applied to sensor failure detection and diagnosis. which may be difficult to obtain. the applications considered in the last two points can be designed off-line and afterwards used in an adaptive scheme. due to the lack of reliable process data. Predictive control. Naidu et al.1 Nonlinear identification The provision of an accurate model of the plant can be of great benefit for control. Leonard and Kramer [51] compared the performance of MLPs against RBF networks. nonlinear systems identification and control of nonlinear dynamical systems have been proposed. 1. Narenda [52] proposes the use of neural networks. Guo and Cherkassky [49] used Hopfield networks to solve the inverse kinematics problem.1. A very brief list of important work on this area can be given. failure detection systems. a large number of applications of ANN to robotics.4. in the three different capacities of identifiers. since damage to a few weights need not significantly impair the overall performance. obtaining promising results. [48] performed feedforward control in such a way that the inverse system would be built up by neural networks in trajectory control. It is still one of the most active areas for applications of artificial neural networks. deciding which control algorithm to employ based on current operational conditions.

yn[k+1] ANN yp/n[k-ny+1] .29) can be considered as a nonlinear mapping between a nu+ny dimensional space to a one-dimensional space.. in the real world. .1Forward models Consider the SISO.29) where ny and nu are the corresponding lags in the output and input and f(. y p [ k – n y + 1 ]. making use of their ability to approximate large classes of nonlinear functions accurately. …. Good surveys in neural networks for systems identification can be found in [122] and [123]. discrete-time. nonlinear plant. . most of the processes are inherently nonlinear. These applications are discussed in the following sections.Most of the models currently employed are linear in the parameters. yp/n[k] FIGURE 1. This mapping can then be approximated by a multilayer perceptron or a CMAC network (a radial basis function network can also be applied) with nu+ny inputs and one output. Equation (1. 1. (1.. have been used to approximate forward as well as inverse models of nonlinear dynamical plants. However... . up[k] plant yp[k+1] up[k] up[k-1] up[k-nu+1] . u [ k ]. are now a recognised tool for nonlinear systems identification.1. described by the following equation: y p [ k + 1 ] = f ( y p [ k ].4.22 . RBF networks and specially MLPs.) is a nonlinear continuous function.Forward plant modelling with ANNs Artificial Neural Networks 25 .S . . u [ k – n u + 1 ] ) . . Artificial neural networks. CMAC networks.1. ….

For MLPs. …. the recursiveprediction-error (RPE) algorithm.fig. due to Chen et al.22 illustrates the modelling the plant described by (1. denoted as series-parallel model. with good results. which acts as a model for the plant. [53] can be employed. Lapedes and Farber [57] have obtained good results in the approximation of chaotic time series.31) 26 Artificial Neural Networks . These exploit the cases where eq. as usual. but not for function extrapolation purposes. Training is achieved. Lant. The neural model can be trained off-line by gathering vectors of plant data and using them for training by means of one of the methods described later. y p [ k – n y + 1 ] ) + g ( u [ k ]. by passing the outputs of the neuron through a filter with transfer function: N(s) G ( s ) = e –Td s ----------D(s) (1. for forward plant modelling purposes.30) h(. or alternative rules. In the case of CMAC networks. can be used. This will be explained in Chapter 2. 1. within the training range. A large number of researchers have been applying MLPs to forward nonlinear identification. Morris et al. as ANNs can be successfully employed for nonlinear function interpolation. with promising results. Alternatively. the shaded rectangle denotes a delay. it should be ensured that the input data used should adequately span the entire input space that will be used in the recall operation. to ensure convergence of the neural model. essentially a recursive version of the batch Gauss-Newton method. This may be explained by the fact that. achieving superior convergence rate. This last configuration is called the parallel model. Bhat et al. the output of the model rapidly follows the output of the plant. (1. In this figure. at the beginning of the training process. The neural network can also be trained on-line. the WidrowHoff rule. When the neural network is being trained to identify the plant. the error back-propagation algorithm. but as soon as the learning mechanism is turned off. In [60] they have proposed. to incorporate dynamics in an MLP. In certain cases. …. in pattern mode. switch S can be connected to the output of the neural network. Narenda and Parthasarathy [13] have proposed different types of neural network models of SISO systems. As learning proceeds (using the error back-propagation algorithm. switch S is connected to the output of the plant. (1. This algorithm was also introduced by Ruano [54] based on a recursive algorithm proposed by Ljung [55]. the neural network model does not incorporate feedback. the model and the plant diverge.) possibly being linear functions.29). when performing on-line training of MLPs. the network is continuously converging to a small range of training data represented by a small number of the latest training samples. u [ k – nu + 1 ] ) . After the neural network has completed the training. is employed.) or u(. by minimizing the sum of the squares of the errors between the output of the plant and the output of the neural network. [59][60] have been applying MLPs to model ill-defined plants commonly found in industrial applications. Willis. [58] have applied MLPs to model chemical processes. In this approach.29) can be recast as: y p [ k + 1 ] = h ( yp [ k ]. As examples. In this case. One fact was observed by several researchers [13][56]. usually for a very large number of samples) the approximation becomes better within the complete input training range.

Training of the ANN involves minimizing the cost function: 1 J = -2 over a set of patterns j. depending on the architecture used for learning.23 .4. ( r [ j ] – yp [ j ] ) 2 j (1. Examples of their application can be found. for instance. u [ k – n u + 1 ] ) (1.33) (1. Psaltis et al. ….2Inverse models The use of neural networks for producing inverse models of nonlinear dynamical systems seems to be finding favour in control systems applications. The arrow denotes that the error ‘e’ is used to adjust the neural network parameters. … .34) Artificial Neural Networks 27 . have been working since 1982 in the application of CMAC networks for the control of nonlinear plants. u [ k – n u + 1 ] ) as shown in fig. since it depends on yp[k+1]. in [62][63]. The ANN will then approximate the mapping: u [ k ] = f – 1 ( r [ k + 1 ]. denoted as generalised learning architecture.1. with the use of the ANN.24 .as an alternative to the scheme shown in Fig. as shown in fig. 1. [64] proposed two different architectures. Ersü and others.1.23 . interpret and improve network performance. [61] have reminded the neural networks community that concepts from estimation theory can be employed to measure. y p [ k ]. In this application the neural network is placed in series with the plant. Assuming that the plant is described by (1.4. 1. y p [ k ]. The first one. this value is replaced by the control value r[k+1].32) is not realisable. is shown in fig.32) As (1. to obtain yp[k]=r[k]. at the Technical University of Darmstadt. …. CMAC networks have also been applied to forward modelling of nonlinear plants. 1. … .29) the ANN would then implement the mapping: u [ k ] = f –1 ( y p [ k + 1 ]. y p [ k – n y + 1 ]. u [ k – 1 ]. u [ k – 1 ]. There are several possibilities for training the network. 2. y p [ k – n y + 1 ]. 1. Recently. The aim here is. Billings et al.

after having been trained. the network may have to be trained over an extended operational range. Another drawback is that it is impossible to adapt the network on-line.Generalised learning architecture Using this scheme. yp[k] FIGURE 1. This procedure works well but has the drawback of not training the network over the range where it will operate.24 .Inverse modelling with ANNs r Plant + e - y ANN FIGURE 1. 28 Artificial Neural Networks ..23 .. up[k-nu+1] r[k+1] up[k] ANN yp[k-ny+1] yp[k+1] plant .up[k-1] .. training of the neural network involves supplying different inputs to the plant and teaching the network to map the corresponding outputs back to the plant inputs.. Consequently.

The problem with this architecture lies in the fact that. unmodifiable. and can also be trained on-line. although the error r-y is available. there is no explicit target for the control input u. Artificial Neural Networks 29 . Applications of ANN in nonlinear identification seem very promising. was introduced by Nguyen and Widrow in their famous example of the ‘truck backer-upper’ [66]. layer of the neural network to backpropagate the error. [60] achieve good results when employing MLPs as nonlinear estimators of bioprocesses. Psaltis proposed that the plant be considered as an additional. r ANN u Plant y e + FIGURE 1. Examples of MLPs used for storing the inverse modelling of nonlinear dynamical plant can be found.To overcome these disadvantages. in [56][58]. Willis et al. Saerens and Soquet [65] proposed either replacing the Jacobian elements by their sign. 1. or performing a linear identification of the plant by an adaptive least-squares algorithm. which seems to be widely used nowadays. This method of training addresses the two criticisms made of the previous architecture because the ANN is now trained in the region where it will operate. To overcome these difficulties. r. as another available tool for nonlinear systems identification. y. This depends upon having a prior knowledge of the Jacobean of the plant. during the past ten years. Harris and Brown [68] showed examples of CMAC networks for the same application. for instance. Artificial neural networks have been emerging. u.Specialised learning architecture In this case the network learns how to find the inputs. Here they employ another MLP. that drive the system outputs. the same authors proposed another architecture. An alternative. the specialised learning architecture. previously trained as a forward plant model. and open questions like determining the best topology of the network to use will be answered.25 . towards the reference signal.25 . Lapedes and Farber [57] point out that the predictive performance of MLPs exceeds the known conventional methods of prediction. to be applied in the training of the ANN. or determining it by finite difference gradients. shown in fig. It is expected that in the next few years this role will be consolidated. where the unknown Jacobian of the plant is approximated by the Jacobian of this additional MLP.

37) 30 Artificial Neural Networks . Assume. These schemes have been studied for over 30 years and have been successful for the control of linear.) are unknown nonlinear functions.36) where the error e[k] is defined as: e [ k ] = yp [ k ] – yr [ k ] and yr[k] denotes the output of the reference model at time k.4. Exploiting the nonlinear mapping capabilities of ANNs. MRAC systems based on Lyapunov or Popov stability theories were later proposed [69].) and g(. The aim of MRAC is to choose the control such that k→∞ lim e [ k ] = 0 (1.1Model reference adaptive control The largest area of application seems to be centred on model reference adaptive control (MRAC).4. adaptive elements. immediate scope for application in nonlinear. time-invariant plants with unknown parameters. therefore. traditional MRAC schemes initially used gradient descent techniques.1. using both the direct and the indirect approaches. 1. To adapt the controller gains. that the plant is given by (1. such as the MIT rule [69]. the control is calculated. references being given to additional important work. where these parameters are first estimated. several researchers have proposed extensions of MRAC to the control of nonlinear systems. With the need to develop stable adaptation schemes.30).2 Control Neural networks are nonlinear.2. During the past ten years ANNs have been integrated into a large number of control schemes currently in use. and that the reference model can be described as: yr[ k + 1 ] = n–1 i=0 ai yr [ k – i ] + r [ k ] (1. without determining the plant parameters. There are two different approaches to MRAC [13]: the direct approach. (1. MRAC systems are designed so that the output of the system being controlled follows the output of a pre-specified system with desirable characteristics. and then. in which the controller parameters are updated to reduce some norm of the output error. and the indirect approach. assuming that these estimates represent the true plant values. where h(. selected examples of applications of neural networks for control will be pointed out. Narenda and Parthasarathy [13] exploited the capability of MLPs to accurately derive forward and inverse plant models in order to develop different indirect MRAC structures.1. for instance. Following the overall philosophy behind this Chapter.1. They have found.35) where r[k] is some bounded reference input. adaptive control.

1. we then have n–1 i=0 (1..36) holds.41) can be approximated as: ˆ u [ k ] = g –1 n–1 i=0 ˆ ai yr [ k – i ] + r [ k ] – h [ k ] (1.26 suggests.40) and u[k] is then given as: u [ k ] = g –1 n–1 i=0 ai yr [ k – i ] + r [ k ] – h [ k ] (1.38) aiyr [ k – i ] + r [ k ] = g [ k ] + h [ k ] (1. y p [ k – n y + 1 ] ) and assuming that (1..40) is obtained: g[k] = n–1 i=0 ai yr [ k – i ] + r [ k ] – h [ k ] (1.. (1. y[k-ny+1] 1 y[k+1} u[k] u[k-nu+1] . … . ANNg ˆ 1 Artificial Neural Networks 31 . (1.39) Rearranging this last equation.41) If good estimates of h[k] and g-1[k] are available. interconnected as fig.Defining g [ k ] = g ( u [ k ].42) The functions h[k] and g[k] are approximated by two ANNs. …. u [ k – n u + 1 ] ) h [ k ] = h ( y p [ k ].. y[k] ANNh ˆ .

26 . ANN + un[k] + u[k] yp[k] xp[k] fixed gain .Direct MRAC controller similar scheme was proposed by Kraft and Campagna [70] using CMAC networks. which is used to adapt the weights of the ANN. Consequently. This ˆ ˆ does not involve any problem for training and any of the methods discussed in “Forward models” on page 25 can be used. is run in parallel with the controlled plant. For direct neural MRAC.Forward model Notice that this structure may be viewed as a forward model. The plant is situated between this error and the output of the MLP. The controller is a hybrid controller. Lightbody and Irwin [56] propose the structure shown in fig. composed of a fixed gain. they propose either to use Saerens and Soquet’s technique [65] of approximating the elements of the Jacobian of the plant by their sign.FIGURE 1. reference model xm[k] r[k] yr[k] + plant . linear controller in parallel with a MLP controller. with the difference that the weights at the outputs of the neural networks ANN h and ANN g are fixed and unitary. which approximates the forward model of the plant. or to use a previously trained network. 32 Artificial Neural Networks . 1. with state vector xm[k]. in order to feed back this error to the network. Its output is compared with the output of the plant in order to derive an error.27 . Examples of this approach can be seen in [13][67][56]. A suitable reference model. controller FIGURE 1.27 .

Two CMAC networks (referred by Ersü as AMS) are used in their approach. At each time interval. w [ k ]. u [ k – l + 1 ] ) (1.43) where xm[k] and v[k] denote the state of the process and the disturbance at time k. ˜ iv) finally. before applying a control to the plant. w [ k + 1 ]. w [ k ]. the following sequence of operations is performed: i) the predictive model is updated by the prediction error e [ k ] = y p [ k ] – y m [ k ] . and u[k] is the control input to the plant.1. respectively. To speed up the optimization process. The first one stores a predictive model of the process: ( x m [ k ].46) ˆ iii) the control decision u [ k ] is stored in the controller ANN to be used. v [ k ] ) → y m [ k + 1 ] (1. v [ k ] ) → u [ k ] . such that: ˆ ˜ u[k] – u[k] < ε ˜ u [ k ] ∉ GT (1. if necessary.45) constrained in the trained region GT of the input space of the predictive model of the ˆ plant. whether as a future initial guess for an optimization. a sub-optimal control input u [ k ] is sought. can be obtained from past decisions: ( x c [ k ].2Predictive control Artificial neural networks have also been proposed for predictive control. an approximation of u [ k ] . u [ k ]. (1. This process enlarges the trained region GT and speeds up the 33 Artificial Neural Networks . v [ k ] ) → u [ k ] (1. or directly as the control input upon user decision. ii) an optimization scheme is activated.4.44) where xc[k] and w[k] denote the state of the controller and the value of the set-point at time k.1. to calculate the optimal control ˆ value u [ k ] that minimizes a l-step ahead subgoal of the type: l i=1 J[ k] = L s ( y m [ k + i ]. since 1982. coined LERNAS.47) for some specified ε .2. The second ANN stores the control strategy: ( x c [ k ]. denoted as u [ k ] . a learning control structure. Ersü and others [71][72][62] have developed and applied. which is essentially a predictive control approach using CMAC networks.

The following cost function is used in their approach: J[ k] = N2 i = N1 ( w [ k + i ] – ym [ k + i])2 Nu + j=0 λu [ k + j ] 2 (1. Typically one pattern is presented to the NN. ( i = k. since each hole must be visited just once. u [ k ] . [79] have employed MLPs for autotuning PID controllers. the output of the plant is sampled.48) with respect to ˆ the sequence { u [ i ]. and the difference between the output of the plant and the predicted value is used to compute a correction factor. but specially Hopfield type networks have been used to solve this kind of problems. At each time step. is then applied to the plant and the sequence is repeated. which emulates the forward model of the plant. k + N u ) } . More recently. Different neural networks. i = N 1 …N 2 are obtained from a previously trained MLP. …. since this determines the total distance travelled in positioning the board. 1. and employed to recall patterns. Hernandez and Arkun [76] have employed MLPs for nonlinear dynamic matrix control. The same cost function (1. [78] have employed MLPs to add nonlinear effects to a linear optimal regulator. [73] developed a nonlinear extension to Generalized Predictive Control (GPC) [74].2 Combinatorial problems Within this large class of problems. Iiguni et al. This is a combinatorial problem with constraint satisfaction. Montague et al. [75] who employ radial basis functions to implement an approach similar to the LERNAS concept. The total time of drilling is a function of the order in which the holes are drilled. 1. Artificial neural networks have been proposed for other adaptive control schemes. It is usually necessary to drill a large number of accurately positioned holes in the board.48) where y m [ k + i ]. For instance. These values are later corrected and employed for the minimization of (1.Ruano et al.learning.4. This is a similar problem to the well known Travelling Salesman (TSP) problem. consider for instance the manufacturing of printed circuits. and the most similar pattern within the memories already stored 34 Artificial Neural Networks .48) is used by Sbarbaro et al. Chen [77] has introduced a neural self-tuner.3 Content addressable memories Some types of NNs can act as memories. The MLP is again employed to predict the outputs over the horizon n = N 1 …N 2 . The first of these values.4.

which has advantages for instance in data transmission applications and memory storage requirements.4. should be classified differently from patterns associated with normal behaviour. These networks are known as content addressable memories. ANNs offer great promises in this field for the same reasons biological systems are so successful at these tasks. which are trained with patterns of past/future pairs. faults. 1. 1. and power electric companies need to predict the future demands of electrical power during the day or in a near future. Patterns are transformed from a mdimensional space to a n-dimensional space. and therefore perform data compression. by assignment of “codes” to groups or clusters of similar patterns.6 Forecasting This type of application occurs also in several different areas. visual images of objects. smell and taste) to gain a meaningful perception of the environment. Different types of ANNs have been applied for this purpose. etc. malfunction. Therefore. Typical applications of ANNs compare favourably with standard symbolic processing methods or statistical pattern recognition methods such as Baysean classifiers. Artificial Neural Networks 35 . shorter patters can be dealt with. touch.7 Pattern recognition This involves areas such as recognition of printed or handwritten characters. sound.5 Diagnostics Diagnosis is a problem found in several fields of science: medicine. A bank wants to predict the creditworthiness of companies as a basis for granting loans.4. The most typical networks used for this role are Hopfield nets. etc.4. The most powerful examples of large scale multisensor fusion is found in human and other biological systems. This is one of the most used applications of ANNs. where m>n. BSBs and BAMs.8 Multisensor data fusion Sensor data fusion is the process of combining data from different sources in order to extract more information than the individual sources considered separately. where patterns associated with diseases.4 Data compression Some networks are used to perform a mapping that is a reduction of the input pattern space dimension. speech recognition. manufacturing.4. 1. engineering. Humans apply different sensory data (sight.are retrieved from the NNs. 1..4. 1. The problem is essentially a classification problem. etc. The airport management unit needs to predict the growth of passenger arrivals in the airports.

3.1 Neuron models This classifies the different ANNs according to the dependence or independence of the weight values from the distance of the pattern to the centre of the neuron. Linear Gaussian Linear Spline (order 0) Linear Spline (variable order) Linear Threshold. the present Section presents different taxonomies of the most important ANN models. The following table summarizes the classification : Table 1.Taxonomy according to the neural model Activation Function Hidden Layers: Sigmoid. Hyperbolic Tangent Output Layer: Sigmoid. the most typical learning mechanism.4.1. Sigmoid Probabilistic Piecewise-linear Piecewise-linear Mexican hat Maximum signal Neural Networks MLP Weight value unconstrained RBF CMAC B-Spline Hopfield Boltzmann BSB BAM SOFM ART Hidden Layer: dependent Output Layer: unconstrained Hidden Layer: dependent Output Layer: unconstrained Hidden Layer: dependent Output Layer: unconstrained unconstrained unconstrained unconstrained unconstrained unconstrained unconstrained 36 Artificial Neural Networks . and the most relevant applications introduced in Section 1. Each classification is based on a different perspective. 1.5. Hyperbolic Tangent. and finally.1 . starting with the neuron model employed.5 Taxonomy of Neural Networks Based on the characteristics of neural networks discussed in Section 1. and the activation function most employed. the most typical application type. the network topology.

2 summarizes this taxonomy: Table 1.3 Learning Mechanism According to the most typical learning mechanism associated with the different ANNs. Table 1.2 . or Unsupervised (Hebbian learning) Stochastic Supervised. we can categorize the different ANNs the following way: Table 1.5. gradient type Supervised (gradient type) or hybrid Supervised (gradient type) or hybrid Supervised (gradient type) or hybrid Supervised.2 Topology This perspective classifies the neural networks according to how the neurons are interconnected.5. Unsupervised (Hebbian learning) Supervised (Forced Hebbian learning) Neural Networks MLP RBF CMAC B-Spline Hopfield Boltzmann BSB BAM Artificial Neural Networks 37 .3 .Taxonomy according to the learning mechanism Learning Mechanism Supervised.1.Taxonomy according to the topology Interconnection Multilayer feedforward Multilayer feedforward Multilayer feedforward Multilayer feedforward Recurrent Recurrent Recurrent Recurrent Single layer feedforward Recurrent Neural Networks MLP RBF CMAC B-Spline Hopfield Boltzmann BSB BAM SOFM ART 1.

forecast. Please note that this should be taken as a guideline. since almost every ANN has been employed in almost every application type. pattern recognition.3 . classification.Taxonomy according to the learning mechanism Learning Mechanism Competitive Competitive Neural Networks SOFM ART 1.4 . optimization Optimization Associative memory Associative memory Optimization Pattern recognition 38 Artificial Neural Networks . forecast Associative memory. this section will classify the ANNs by the most typically found application type. forecast Control.4 Application Type Finally.Table 1.5. optimization Control. forecast Control.Taxonomy according to the application type Neural Networks MLP RBF CMAC B-Spline Hopfield Boltzmann BSB BAM SOFM ART Application Control. Table 1.

and one output layer (we shall assume here that MLPs are used for function approximation purposes). and described more extensively that the other network types. the Multilayer Perceptrons.3. will address the application of these networks for classification purposes. the inverse of a pH nonlinearity and the inverse of a coordinates transformation problem. The last section. the considerations about MLP Artificial Neural Networks 39 .CHAPTER 2 Supervised Feedforward Neural Networks Strictly speaking. and. Nevertheless. It has been shown that they constitute universal approximators [130]. respectively. The final results are summarized in Section 2. For their popularity.2 describes Radial Basis Function networks. The modelling performance of these four types of ANNs is investigated using two common examples. This common topology can be exploited. The basic theory behind Lattice-based Associative Memory networks is introduced in Section 2.3. and that is the main reason that they appear together in one Chapter. is that they share a common topology: one (or more in the case of MLPs) hidden layers. as some neural network types described in this Chapter can also be trained with competitive learning methods.1. the most common type of learning mechanism associated with all the artificial neural networks described here is. they are introduced first. the most common type of artificial neural network.5. as it will become clear later on. at least with hybrid training methods. Section 2. and the supervised learning mechanism can exploit this fact. Section 2. are introduced. there are no artificial neural networks that can be classified as supervised feedforward neural networks. In Section 2.3. both with 1 hidden layer [128] and with 2 hidden layers [129]. The role of these two blocks is different. by far. based on that CMAC networks and B-spline networks are described in Section 2.1 Multilayer Perceptrons Multilayer perceptrons (MLPs. Additionally. or.4. 2. These sections describe the use of supervised feedforward networks for approximation purposes. for short) are the most widely known type of ANNs. the supervised paradigm. Another reason to study these networks together.3.2 and in Section 2.

1 θi ?i θi ?i i1 i2 Wi.2.training methods are also of interest to the others.1.1 The Perceptron Perceptrons were first introduced by Frank Rosenblatt [19]. usually the output layer has a linear activation function. grouped in three layers: a sensory layer. and only when an error exists between the network output and the desired output. Bernard Widrow’s Adalines and Madalines are described in Section 2.7.3.4. The inputs. working at Cornell Aeronautical Labs. Problems occurring in practice in off-line training are discussed in Section 2.) i2 .1. Although there are different versions of perceptrons and corresponding learning rules. and the Least-Mean-Square (LMS) algorithm in Section 2.2.1. intended to be computational models of the retina. methods amenable to be employed for on-line learning are discussed in Section 2. we shall adopt the term Rosenblatt’s perceptron to identify this neuron model. the activation function is a threshold function.1.6 introduces a new learning criterion.1. ip i S oi ip Wi.1.1. an association layer and a classification layer.1.p a) Internal view FIGURE 2.1.1.3. . and alternative algorithms are described in Section 2. Rosenblatt’s perceptron was a network of neurons. The model for a perceptron1 was already introduced in Section 1. To be precise. The major limitations of BP are pointed out in Section 2. and its associated theorem.1.1. 40 Artificial Neural Networks . the basic rule is to alter the value of the weights only of active lines (the ones corresponding to a 1 input).Model b) External view of a perceptron The typical application of Rosenblatt was to activate an appropriate response unit for a given input patterns or a class of patterns.3. Section 2. and the "famous" Error BackPropagation algorithm (the BP algorithm) in Section 2.3.1.3.2. The neurons were of the McCulloch and Pitts’ type.3. The original Perceptron is introduced in Section 2.3. introduced in Section 1. As for modelling purposes. and this is the reason that they are described in some detail.8. .1. outputs and training patterns were binary valued (0 or 1). as the most typical model of a neuron i1 Wi.1.1. which is the model presented in the figure. the Perceptron Convergence Theorem. Nevertheless. 2. which decreases the computational training time.3. in Section 2. More formally: 1.1.3. Finally.1.1.1.3. Multilayer Perceptrons are formally introduced in Section 2.2 neti oi f(. For this reason.1 .

3) Some variations have been made to this simple perceptron model: First. If the output is 1 and should be 0. Algorithm 2. as: ∆w i = α ( t [ k ] – y [ k ] )x i [ k ] (2. increment all the weights in all active lines. and incorporated in the weight vector. in a finite number of iterations. and that the threshold is negative. at time k. if the output is correct (2. 3.1.4) Artificial Neural Networks 41 . at the k+1th iteration. as well as binary. w. do nothing. the inputs to the net may be real valued.1. The input vector is.1) In (2. If the output is 0 and should be 1. bipolar (+1.1 Perceptron Convergence Theorem One of the most important contribution of Rosenblatt was his demonstration that.2). decrement the weights in all active lines. some models do not employ a bias. …. -1).2) (2. Thus. respectively. if the output is 0 and should be 1 wi [ k + 1 ] = w i [ k ] – αx i . where ∆w is the change made to the weight vector. α is the learning rate. 2.1 - Perceptron learning rule Considering a perceptron with just 1 output. and xi[k] is the ith element of the input vector at time k. if the output is 1 and should be 0 wi [ k ]i .1. then the learning algorithm can determine. If the output is 1 and should be 1. x 1 [ k ]. is updated as: w [ k + 1 ] = w [ k ] + ∆w . Let us assume that the initial weight vector is null (w[0]=0). given a linearly separable problem. t[k] and y[k] are the desired and actual output. x n [ k ] ] T (2. the outputs may be bipolar. is: w i [ k ] + αx i . 2. one set of weights to accomplish the task. the weight vector. of if the output is 0 and should be 0. This is the famous Perceptron convergence theorem. in this case: x [ k ] = [ –1. the ith component of the weight vector.

α = 1 .7) T Since we have assumed the classes to be linearly separable.for all x[k] belonging to C1. defines an hyperplane as the decision surface between two different classes of inputs. in two dimensions Suppose that the input data of the perceptron originate from 2 linearly separable classes. wo. and unitary.5) (2.8) (2. Please note that set of weights is not unique. in such a way that the two classes C1 and C2 are separable. We can iteratively solve (2. a fixed k).6) For a specific iteration of the learning algorithm (i. and the subset of vectors that belong to class C2 as X2.. according to (2.and the weight vector is: w [ k ] = [ θ [ k ]. the perceptron incorrectly classifies all the patterns belonging to C1.7). plotted in a n-dimensional space. there is a set of weights. such that: 42 Artificial Neural Networks . to obtain the result (do not forget that w[0]=0): w[n + 1] = x[1] + x[2] + … + x[n] (2. that fall on the opposite sides of some hyperplane. …. Then training means adjust the weight vector w.6). the update equation can be represented as: w [ k + 1 ] = w [ k ] + x [ k ] . w 1 [ k ]. the equation wTx=0. Class C1 Class C2 w1 x1 + w2 x2 – θ = 0 FIGURE 2.e. Let us assume that the learning rate is constant. this is. The linear combiner output is therefore: y [ k ] = w [ k ]x [ k ] T T (2. Let us denote the subset of training vectors that belong to class C1 as X1. w n [ k ] ] assuming that we have n inputs. These two classes are said to be linearly separable if such a set of weights exist. Suppose that w [ k ]x [ k ] < 0 for all the patterns belonging to class C1.2 - Two classes separated by a hyperplane. Then.

all the terms in the right-hand-side of the last equation have a lower bound. we stay with: w[k + 1] 2 = w[k ] 2 + 2w [ k ]x [ k ] + x [ k ] T 2 . (2.16) can therefore be transformed in: w[ k + 1 ] 2 ≤ w[ k] 2 2 + x[k] 2 2 . for all x[k] belonging to C1. or equivalently: 2 2 (2.14) w[ n + 1] ( nη ) ≥ ------------2 w0 2 (2. given two vectors a and b. we have: w0 2 w[n + 1 ] 2 ≥ ( nη ) . according to (2. wT[k]x[k]<0. (2.15) If another route is followed. such that: η = min w 0 x [ i ] . from k=0 to n. for all x[k] belonging to C1.17) (2. we finally stay with: w[n + 1] 2 n ≤ k=0 x[ k ] 2 = nβ . and knowing that w[0]=0. x[i] ∈ X 1 T (2.11) But.8) by woT. which states that.10) Multiplying now both sides of (2.9) T (2.7).12) We now must make use of the Cauchy-Schwartz inequality.16) Under the assumption that the perceptron incorrectly classifies all the patterns belonging to the first class.12). and therefore: w 0 w [ n + 1 ] ≥ nη T (2.19) Artificial Neural Networks 43 . We may then define a fixed number. or equivalently. and we square (2. and (2. η .18) w[k + 1] – w[ k] ≤ x[k] . Adding these inequalities. T T T (2.w 0 x > 0 . we then have: w0 w[ n + 1 ] = T w 0 x [ 1 ] + w 0 x [ 2 ] + … + w0 x [ n ] .10). we have: a 2 b 2 ≥ (a b) T 2 (2.13) Applying this inequality to (2. 2 (2.

nmax.05 -0. Table 2. and given that an optimal value of the weights exists.21) We have thus proved that. Use the input and target data found in the MATLAB file AND. as long as the learning rate is positive.05 -0. the evolution of the weights is summarized in Table 2.05 44 Artificial Neural Networks . This value is given by: nmax β w0 = ---------------2 η 2 (2.20) Eq. Clearly this last equation is in conflict with (2.05 0. Resolution Starting with a null initial vector.1.15). a different value of η .05 0 0 -0.05 Iteration 1 2 3 4 5 6 7 W1 0 0 0 0.05 0. A different value of the initial weight vector will just decrease or increase the number of iterations. (2. There will be a maximum value of n.1 .1 . such that these equations will be satisfied as equalities.05 -0. the initial value of the weights is null.05 0.mat. It can be easily proved that.Evolution of the weights W2 0 -0.AND Function Design a perceptron to implement the logical AND function.1. will also terminate the learning algorithm in a finite number of iterations. the perceptron learning algorithm will terminate after at most nmax iterations. not necessarily constant.2 Examples Example 2.05 0 0 0 0 W3 0 -0. 2. provided the learning rate is constant and unitary.19) states that the Euclidean norm of the weight vector grows at most linearly with the number of iterations. and employing a learning rate of 0.where β is a positive number such that: β = max x [ k ] x [ k] ∈ X 1 2 (2.05.1. n.

05 0.1 -0.05 0.05 0.05 0.1 0.1 -0.05 0.05 0.1 Iteration 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 W1 0.05 0.15 -0.05 0.05 0 0 0 0 0 0 0 0 0.05 0.1 -0.05 0.15 -0.1 0.1 0.1 0.05 0.05 0.Table 2.1 0.1 0.1 -0.05 -0.05 -0.05 0.05 0.1 -0.1 Artificial Neural Networks 45 .1 -0.1 0.15 -0.1 0.1 0.05 0 0.05 0.1 -0.1 -0.1 0.Evolution of the weights W2 0 0 0 0.1 -0.05 -0.05 0.1 -0.1 0.1 -0.05 0.05 -0.05 W3 -0.1 0.1 0.1 -0.1 0.05 -0.1 0.1 -0.05 0.1 0.1 -0.05 0.15 -0.1 -0.1 0.1 -0.1 0.05 0.1 0.15 -0.05 0.05 0 0.05 -0.05 0.05 0.1 -0.1 0.05 0.05 0.1 -0.05 0.05 0.05 0.1 .1 -0.1 -0.1 -0.

Example 2. The next figure shows the changes in the decision boundaries as learning proceeds.05 0.5 15 0 -0.05 0.2 .5 2 0 0.05 46 Artificial Neural Networks .05. the evolution of the weights is summarized in Table 2.05 0.05 -0.05 0.05 Iteration 1 2 3 4 5 6 7 8 W1 0 0 0 0.3 .Evolution of the decision boundaries For a Matlab resolution.05 0 0 -0.5 1 11 0.m.mat.5 24 2 1.OR Function Design a perceptron to implement the logical OR function.8 1 -1 FIGURE 2. after iteration 24. please see the data file AND. the weights remain fixed.05 0.m and the script plot1. 2.05 0.Evolution of the weights W2 0 0 0 0.05 0. Table 2.6 0.2 .05 W3 0 0 -0.mat.4 0. Resolution Starting with a null initial vector.05 -0.We can therefore see that.05 0.2 0. Use the input and target data found in the MATLAB file OR.2. the function perlearn. and employing a learning rate of 0.

Evolution of the weights W2 0.05 0.05 -0.05 -0.05 -0.05 0.05 0.05 -0.05 -0.05 0.05 -0.05 -0.05 Iteration 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 W1 0.05 0.05 0.05 -0.05 -0.05 0.05 0.05 0.05 0.05 0.05 0.05 -0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 W3 -0.05 0.05 0.05 0.05 0.05 0.05 -0.05 -0.05 -0.05 -0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 -0.05 -0.05 0.05 -0.05 -0.05 0.05 0.05 -0.05 0.05 0.05 0.05 -0.05 0.05 0.05 0.05 0.05 0.05 Artificial Neural Networks 47 .05 -0.05 0.05 -0.05 0.2 .05 0.05 0.05 0.05 0.05 -0.05 -0.05 -0.05 0.05 -0.05 0.05 0.05 0.Table 2.05 0.05 -0.05 -0.05 -0.05 0.05 0.05 0.05 -0.05 0.05 0.05 0.05 0.05 -0.05 0.05 0.

1 0.6 -0.mat.4 .3 . The next figure shows the changes in the decision boundaries as learning proceeds.05 -0.05 -0.mat. Example 2.4 -0.8 4 1 6 FIGURE 2.6 0.05 -0.05 Iteration 1 2 3 4 5 6 7 8 9 W1 0 0 0 0 0 0 0 0 0 48 Artificial Neural Networks .05.05 -0. the function perlearn. the weights remain fixed.2 0.6 0.4 0.8 0.m and the script plot1.3. Table 2.8 -1 0 0.4 0. the evolution of the weights is summarized in Table 2.Evolution of the weights W2 W3 0 0 0 0 0 0 0 0 0 0 0 -0. please see the data file OR. Why do the weights not converge? Resolution Starting with a null initial vector.Evolution of the decision boundaries For a Matlab resolution.05 -0. and employing a learning rate of 0.We can therefore see that. Use the input and target data found in the MATLAB file XOR.05 -0.2 0 -0.3 .m.2 -0. after iteration 6.XOR Function Try to design one perceptron to implement the logical Exclusive-OR function.

05 0 0 0 0 0 0 0.05 -0.1 -0.1 -0.05 -0.05 -0.05 -0.Evolution of the weights W2 0 -0.05 -0.05 0 0 0 0 0 -0.05 Artificial Neural Networks 49 .05 -0.05 0 0 0 0 0 -0.05 0 0 0 0.05 -0.05 -0.05 Iteration 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 W1 0.05 0.1 -0.05 0 0 0 0 0 0 0 0 0 0 0.05 0 -0.05 -0.05 0.Table 2.05 -0.05 0 -0.05 -0.3 .05 -0.05 -0.05 0 -0.05 0 0 0 0 -0.05 0 0 -0.05 0 -0.05 -0.05 0 0 0 W3 0 -0.05 0 0.1 -0.1 -0.05 -0.1 -0.05 -0.1 -0.05 0 -0.05 -0.05 -0.05 -0.05 0 -0.05 -0.

We can therefore see that the weights do not converge. The next figure shows the changes in the decision boundaries as learning proceeds.
1 36 23

0.5

0

34 22

-0.5 11

-1

-1.5 15 -2 0 0.2 0.4 0.6 0.8 1

FIGURE 2.5 - Evolution

of the decision boundaries

As we know, there is no linear decision boundary that can separate the XOR classes. It is simple unsolvable by a perceptron. For a Matlab resolution, please see the data file XOR.mat, the function perlearn.m and the script plot1.m. 2.1.2 Adalines and Madalines

The Adaline (ADAptive LINear Element) was originally introduced by Widrow and Hoff [21], as an adaptive pattern-classification element. Its associated learning rule is the LMS (LeastMean-Square) algorithm, which will be detailed later. The block diagram of an Adaline is shown in fig. 2.6 . As it can be seen, it has the same form of the perceptron introduced before.
i1 Wi,1 net i
θi ?
i

i2

Wi,2

oi

S

ip

Wi,p

FIGURE 2.6 - An Adaline element

50

Artificial Neural Networks

The only difference is that the input and output variables are bipolar {-1,+1}. The major difference lies in the learning algorithm, which in this case is the LMS algorithm, and the point where the error is computed. For an Adaline the error is computed at the neti point, and not at the output, as it is the case for the perceptron learning rule. By combining several Adaline elements into a network, a Madaline (Many ADALINEs) neural network is obtained.

x1 x2 x3

FIGURE 2.7 - A Madaline

(each circle denotes an Adaline element)

Employing Madalines solves the problem associated with computing decision boundaries that require nonlinear separability. Example 2.4 - XOR Function with Madalines Employing the perceptrons designed in Ex. 2.1 and Ex. 2.2, design a network structure capable of implementing the Exclusive-OR function. Resolution As it is well known, the XOR function can be implemented as: X ⊕ Y = XY + XY
(2.22)

Therefore, we need one perceptron that implements the OR function, designed in Ex. 2.2, and we can adapt the perceptron that implements the AND function to compute each one of the two terms in eq. (2.22). The net input of a perceptron is: net = w 1 x + w 2 y + θ
(2.23)

If we substitute the term x in the last equation by (1-x), it is obvious that the perceptron will implement the boolean function xy . This involves the following change of weights: net = w 1 ( 1 – x ) + w 2 y + θ = – w 1 x + w 2 y + ( θ + w 1 )
(2.24)

Artificial Neural Networks

51

Therefore, if the sign of weight 1 is changed, and the 3rd weight is changed to θ + w 1 , then the original AND function implements the function xy . Using the same reasoning, if the sign of weight 2 is changed, and the 3rd weight is changed to θ + w 2 , then the original AND function implements the function xy . Finally, if the perceptron implementing the OR function is employed, with the outputs of the previous perceptrons as inputs, the XOR problem is solved. The weights of the first perceptron, implementing the xy function, are: [-0.1 0.05 0]. The weights of the second perceptron, implementing the xy function, are: [0.1 -0.05 –0.05]. The weights of the third perceptron, implementing the OR function, are: [0.05 0.05 –0.05]. The following table illustrates the results of the network structure related to the training patterns. In the table, y1 implements the xy function and y2 is the output of the xy function. As it can be seen, the madaline implements the XOR function. For a Matlab resolution, please see the data files AND.mat and OR.mat, the functions perlearn.m, perrecal.m and XOR.m.
Table 2.4 - XOR

problem solved with a Madaline inp2 y1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 y2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 y 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 0

Iteration 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

inp1 0 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 1

52

Artificial Neural Networks

Table 2.4 - XOR

problem solved with a Madaline inp2 y1 0 0 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 y2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 y 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0

Iteration 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

inp1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0

Artificial Neural Networks

53

2.1.2.1 The LMS Algorithm One of the important contributions of Bernard Widrow and his co-workers was the introduction of the Least-Mean-Square (LMS) algorithm. This learning algorithm also goes by the name of Widrow-Hoff learning algorithm, adaline learning rule or delta rule.
θi ?

x1

Wi ,1

i

x2

Wi ,2

neti

oi f(.)

S
xp Wi ,p +

t

LMS

FIGURE 2.8 - Application

of the LMS algorithm

fig. 2.8 illustrates the application of the LMS algorithm. Opposed to the perceptron learning rule, the error is not defined at the output of the nonlinearity, but just before it. Therefore, the error is not limited to the discrete values {-1, 0, 1} as in the normal perceptron but can take any real value. The LMS rule is just: w [ k + 1 ] = w [ k ] + α ( t [ k ] – net [ k ] )x [ k ]
(2.25)

Comparing this equation with the perceptron learning algorithm (eq. (2.1) and (2.2)), we can see that, apparently, the only difference lies in the replacement of the output by the net input. However, the major difference lies in the justification of the introduction of the algorithm, which follows a completely different route. 2.1.2.2 Justification of the LMS Algorithm Assume that, associated with the error vector in the training set (composed of N patterns), we define a cost function:
2 1 T 1 E = -- e e = -e [i], 2 2i=1
N

(2.26)

where e[k] is defined as:

54

Artificial Neural Networks

5 0.6 1.15 0. and perform a step in the opposite direction.8 2 FIGURE 2.4 0. If the values of (2.Error surface We wish to find the value of the weight that corresponds to the minimum value of the error function.45 0.4 1. compute the gradient of the function. Eq. we assume that the error is defined at the output of a linear combiner.4 0.9 might appear. and so our linear model has just one variable weight. Widrow formulated the Adaline learning process as finding the minimum point of the error function.26).9 . 0.2 0. (2.05 0 minimum value of E E -dE/dw 0 0.28) and the LMS rule can be given as: 1. This last expression is the mean-square error. 2.35 0. and because of this its learning algorithm is called LMS. The gradient of the error is denoted as: ∂E --------∂w 1 g = … ∂E --------∂w p (2.3 0.2 1. One way to accomplish this task is.26).]. using an instantaneous version of the steepest descent technique2.6 0. can be rewritten as 1/2E[e2].25 0. it should be net[k]. where E[.1 0. a error-performance surface such as fig. Widrow introduced this in a stochastic framework.2 0. For the case of an Adaline.] is the expectation operator.27) Suppose that y is a linear function of just one input variable. For simplicity of notation. as functions of the weight are plotted. Artificial Neural Networks 55 .e [ k ] = t [ k ] – y [ k ] 1. Actually. (2.8 1 w 1. at each point. and denote this value by y[. 2. This is called the method of steepest descent.

31) e [ i ]x 1 [ i ] i=1 T … ∂y [ i ] e [ i ] -----------. we stay with: ∂y [ i ] e [ i ] -----------.∂y [ j ].29) 1 i=1 1 ∂e [ i ] -.26).= ----------------2 ∂w 1 2 i = 1 ∂w 1 g = N ∂ e [i] 2 N 2 … i=1 1 -------------------------.34) is the Jacobean matrix.… -------------∂w 1 ∂w p (2. and e[ N] ∂y [ 1 ] ------------.32) e = e[ 1] … is the error vector.33) (2.= ∂w p i=1 N = – ( e J ) . we have: N (2.30) ∂ e [i] 2 N 2 Let us consider one of the terms in (2..w [ k + 1 ] = w [ k ] – ηg [ k ] Substituting (2.… ∂y [ 1 ] ------------∂w 1 ∂w p J = … … … ∂y [ N ] ∂y [ N ] -------------.-------------------------. where: e [ i ]x p [ i ] i=1 T (2. 56 Artificial Neural Networks . the matrix of the derivatives of the output vector with respect to the weights.= ∂w i i=1 g = – N N N 2 2 (2.30).1 ∂e [ i ] -= ----------------2 ∂w p 2 i = 1 ∂w p (2. we have: ] ∂e [ j ] = ∂e [ j .------------i ∂w i ∂y [ j ] ∂w i Replacing this equation back in (2. i. By employing the chain rule. in (2.28).= – 2e [ j ]x [ j ] ------------------------------.e.30).

If. Repeat the same problem with the LMS algorithm. This is called batch update or epoch mode. 1]. in each pass through all the training set. as the output is just a linear combination of the inputs. an instantaneous version of the gradient is computed for each pattern. Employ a learning rate of 0. Determine the first update of the weight vector using the steepest descent method.5 . over the interval [-1.38 Artificial Neural Networks 57 .9]. 1.In this case.5 .Error vector Pattern 1 2 3 4 5 6 7 Error 1 0.35) (2. and we have the Widrow-Hoff algorithm.1.29). this is called pattern mode. x[N] Replacing now (2. Suppose that we use this model to approximate the nonlinear function y=x2.32) in (2. 2.1. on the other hand.x2 function Consider a simple Adaline with just one input and a bias. The initial weight vector is [0.2 -0. Example 2. the Jacobean is just the input matrix X: x[1] J = … . and the weights are updated pattern by pattern. in each iteration of the algorithm.46 0. with a sampling interval of 0. Compute all the relevant vectors and matrices. The error vector is: Table 2.36) Please note that in the last equation all the errors must be computed for all the training set before the weights are updated.72 0. we have the steepest descent update: w [ k + 1 ] = w [ k ] + ηe [ k ]X T (2. Resolution 1.9 0.22 0 -0.

08 -1.9 -0.8 -0.8 -0.Table 2.6 .98 -0.4 -0.1 0.68 -0.1 -1.98 -1.6 -0.2 -0.6 58 Artificial Neural Networks .8 Table 2.54 -0.9 -0.5 0.7 -0.1 -1.04 -1.5 .Jacobean matrix Input 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Input 1 -1 -0.3 0.2 0.1 0 0.3 -0.5 -0.9 -0.04 -0.4 0.08 -1.Error vector Pattern 8 9 10 11 12 13 14 15 16 17 18 19 20 21 The Jacobean matrix is: Error -0.

7211 -0.Jacobean matrix Input 2 1 1 1 1 Pattern 18 19 20 21 The gradient vector is therefore: Input 1 0.7436 -0.6135 -0. 2.Error vector Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Error 1 0.4106 -0.6778 -0.527 -0.7 .207 – 0.m.53 0.6804 -0.0658 -0.6158 -0.2602 -0.7228 -0.7 0. With the LMS algorithm.9 1 T T g = – ( e J ) = 6.8 0.5295 Artificial Neural Networks 59 .6 .7431 -0.1888 -0.Table 2. please see the function x_sq_std. we have the following error vector: Table 2.93 11.2 and the weight vector for next iteration: w [ 2 ] = 0.22 For a Matlab resolution.

7211 0.6135 0.2053 0.8 .0476 0.1361 0.2108 0.3116 -0.4278 0.1442 0.6778 0.0529 -0.53 0.Gradient vector Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Input 1 Input 2 1 -1 0.2567 0.527 -0.6804 -0.7228 0 0.Error vector Pattern 17 18 19 20 21 Error -0.4106 -0.1847 0.1561 0.0789 -0.1888 -0.477 -0.5295 0.7436 0.0789 60 Artificial Neural Networks .7 .2181 0.0461 0.0723 0.7431 0.0658 -0.1479 0.2602 -0.1849 -0.0743 0.2454 0.2648 0.6158 -0.4278 -0.0789 The evolution of the (instantaneous) gradient vector is: Table 2.1511 -0.0529 0.2033 0.Table 2.1849 0.3116 0.

it is not difficult to design a network structure with simple perceptrons.8383 0.7451 0.3408 0.6977 0.2668 0. Artificial Neural Networks 61 .5949 0. as long as each one is individually trained to implement a specific function.3937 0.8309 0.8839 0.0653 1.9 .7961 0.4.686 Notice that both the gradient and the weight vectors.9455 0.Evolution of the weight vector W2 0.8175 0.8311 0. However. after the presentation of all the training set. it is not possible to employ the derivative chain rule of the steepest descent algorithm.3 Multilayer perceptrons As we have seen in Ex. as because of the hard nonlinearity of the perceptron.0719 1. are different from the values computed in 1.0393 0.m.And the evolution of the weight vector is: Table 2. 2.9 0.8165 0.7372 0. to design the whole structure globally to implement the XOR is impossible. For a Matlab resolution.7418 0.4551 0.799 0.6781 0.298 0.1.2509 Iteration 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 W1 0.7779 0.2483 0.6693 0.7436 0.0393 1.9 1 1. 2.7574 0.5228 0.053 1.7716 0.8383 0.6829 0.8159 0.243 0.7195 0. please see the function x_sq_lms.8 0.7523 0.

..10 illustrates a typical multilayer perceptron. O (2) (1) (1) FIGURE 2... 62 Artificial Neural Networks .) Σ . .. • Net(z) is a m*kz matrix of the net inputs of the neurons in layer z. 2. this is not defined for q=1..This problem is solved if the hard nonlinearity is replaced by a differentiable nonlinearity..) Net Σ W .) Σ .) Σ .. and the output layer is the qth).. As the name indicates. f(. .. the network is structured in layers. the network input matrix is therefore O(1). o ( q) (q) f(. .. the zth element denoting the number of neurons in layer z. • O(z)is the m*kz matrix of the outputs of the neurons in layer z.) Σ . .... f(.10 . . net W (q – 1) O Net ( q – 1) (q – 1) f(. . • T is the m*kq matrix of desired outputs or targets. f(... • E is m*kq matrix of the errors (E=T-O(q)). and therefore the following notation is introduced: • q is the number of layers in the MLP (the input layer is the 1st layer.. All the important variables should reflect the layered structured... • k is a vector with q elements.....A multilayer perceptron fig. such as a sigmoid or an hyperbolic tangent.. .

MLPs and the BP algorithm are so intimately related that it is usual to find in the literature that this type of artificial neural network is referred to as a `back-propagation neural network'. rediscovered by Parker [81] and widely broadcast throughout the scientific community by Rumelhart and the PDP group [31]. In each iteration. In fact.1 The Error Back-Propagation Algorithm As in the case of the previous models. 2i = 1 2 N (2. The threshold associated with each neuron of the (z+1)th layer can be envisaged as a normal weight that connects an additional neuron (the (kz+1)th neuron) in the zth layer.e e = -e [i] . the weights of the multilayer perceptron are updated by a fixed percentage in the negative gradient direction: Artificial Neural Networks 63 . defined as w ( z) … (z) W . Several people from a wide-range of disciplines have independently developed this algorithm. Therefore. it is applied in an elementby-element basis.3. The original error back-propagation algorithm implements a steepest descent method. Apparently it was introduced by Paul Werbos [80] in his doctoral thesis. In this case: W .1. the aim of training the MLPs is to find the values of the weights and biases of the network that minimize the sum of the square of the errors between the target and the actual output: 2 1 1 T E = -. • W(z) is the (kz+1)*kz+1weight matrix between the zth layer and the (z+1)th layer.j is the weight that connects neuron i in the zth layer to neuron j in the (z+1)th layer .1 = (z) • w(z) is the vector representation of W(z). which has a fixed output of 1..kz + 1 w • w is the entire weight vector of the network w = ( q – 1) … w (1) 2.Net (z + 1) = O W (z) (z) In some cases it is convenient to represent these weight matrices as vectors. W (z)i.) is the matrix output function for the neurons in layer z..37) The error back-propagation (BP) algorithm is the best known learning algorithm for performing this operation.• F(z)(.

40) where J is the Jacobean matrix. allow the use of a faster learning rate. 2. in the weights update equation: w [ k + 1 ] = w [ k ] – ηg [ k ] + α ( w [ k ] – w [ k – 1 ] ) (2.3. the limitations of this algorithm are pointed out. is developed below.r. called the momentum term.41) In eq. Sandom and Uhr [85] have developed a set of heuristics for escaping local error minima. One of them is to perform the update of the weights each time a pattern is presented. The chain rule can be employed to compute J(z): 64 Artificial Neural Networks . Watrous [84] has employed a line search algorithm. In this section we shall compute the gradient vector following this approach. (2. To name but a few. without leading to oscillations. a fact that was also observed in several numerical simulations performed by us. The reasoning behind pattern mode update is that. Following this approach.2 The Jacobean matrix As we have already seen (Section 2. (2. Several other modifications to the BP algorithm were proposed.w [ k + 1 ] = w [ k ] – ηg [ k ] (2. Subsequently. in its original version.2). reflecting the topology of the MLP: J = J(q – 1) … J( 1) (2. if η is small. the departure from true gradient descent will be small and the algorithm will carry out a very close approximation to gradient descent in sum-squared error [31]. Tesauro et al. in principle. [82] suggest that the rate of convergence is essentially unmodified by the existence of the momentum term. It is however questionable whether in practice this modification increases the rate of convergence.1. which would be filtered out by the momentum term.41). Another modification. J(z) denotes the derivatives of the output vector of the MLP w.t the weight vector w(z). it is advisable to compute J in a partitioned fashion.38) Several modifications of this algorithm have been proposed.39) The use of this term would.2. as the Jacobean will be used later on with alternative training algorithms. also introduced by Rumelhart and the PDP group [31].1. the gradient vector can be computed as: gT = –eT J . The error back-propagation algorithm. Jacobs [83] proposed an adaptive method for obtaining the learning rate parameter. is the inclusion of a portion of the last weight change.

as the net input of one neuron only affects its output. ∂Net i. k ) . k j (j) ∂ Net i. .can be obtained as: (j – 1) ∂Net ∂Net ( j ) ∂O ( j – 1 ) ∂Net ----------------------. k j (j) ∂O i.------------------. .∂o ∂o J ( z ) = -----------. for presentation i. it is just a scalar f' ( net i ) . 1 … … … ∂O i. with q<j<1 is a diagonal matrix. 1 (j) ∂ Net i.-----------------------(q – 1) (z + 1) ∂w ( z ) ∂O ∂Net (q) (z + 1) (2. additional conventions should have to be defined. . k j (j) = … 0 . Their use does not clarify the text. which means they do not exist for z=q-1. ∂Net i.44) Detailing now each constituent element of the right hand-side of (2. it is easy to see that: ∂O i. ∂o i ∂o i = -----------.-------------------------(z) ( q) (q – 1) (z + 1) (z) ∂w ∂net i ∂O i. To avoid this additional complexity. and not the others.44).42) is followed.= ----------------.… -----------------------. the computation of J(z) will be developed in a pattern basis. 1 ( j) ∂ Net i. (j) (j) (2. 1 (j) … ∂O i.= ---------------(q) (z) ∂w ∂net (q) (q) ∂net ∂O ∂Net ( z + 1 ) ------------------. ∂w (q) (q) (q) (z + 1) (z + 1) (2. kj ∂ Net i. the kth diagonal element is f' ( Net i. (j) ( j) (j) ∂ Net i. kj ( j) ( j) ∂ Net i. the Jacobean of layer z. .= -----------------. and if introduced. If (2. . since the derivatives of ∂Net ----------------------.43) ∂net i ∂O i.42) In the last eq. since we are considering just one output. and since there is no dependence from presentation to presentation in the operation of the MLPs. = … ∂O i. kj (j) ∂ Net i. rather than in epoch basis. is just: (z) J i. . the terms within curly brackets occur in pairs. three-dimensional quantities must be employed. 1 (j) … … … 0 … ∂O i. (q) (j) Artificial Neural Networks 65 .for the case of the output layer (z=q). 1 a) (j) ∂O i.… -----------------------. So.----------------------(j – 1) ∂O ( j – 1 ) ∂Net ( j – 1 ) ∂Net The number of these pairs is q-z-1. ..

. ..44) can be further manipulated to yield a more compact form for the computation of Ji. eq. Using relationships (a) to (c). 1 (z) ( z + 1) ∂ W . respectively.. ) ( j – 1) T (j – 1) is the matrix of weights that connect the ith neuron in layer (j+1)th to the jth neuron in the layer immediately below. kz + 1 (z) ( z + 1) ∂w can be best viewed as a partitioned matrix. 1 (z) (z + 1) (z) … … … ∂Net i.46) (z) (z) . the sub-partition of J(z) corresponding to the weights that connect the neurons in the zth layer with the pth neuron in layer z+1.p ( k z + 1 ) (2. . 1 = … ∂Net i. where p = ( ( p – 1 ) ( k z + 1 ) + 1 ). 1 (j – 1) ( j) = ( W 1. c) -----------------------(z) (z + 1) ∂ W . k j. 1 … … … ∂Neti. with kz+1 column partitions. kj ∂ O i.. -----------------(j – 1) ∂O i. kz + 1 ∂ W .. k j – 1 ( j) (j – 1) = … ∂Net i. (2..45) 66 Artificial Neural Networks . should be denoted as J i. .∂Net i. another simplification can be achieved if we further partition Ji.. 1 … ( j) ∂ O i. with a format that reflects the organization of one layer into different neurons. ( j) (j – 1) ∂ O i. k j – 1 ∂Neti. and therefore is the transpose of the weight matrix without considering the last line. This way. kz + 1 ∂ W .. 1 (z + 1) ∂Net i. 1 0 where the top and bottom matrices have dimensions of (j-1)*(kz+1) and (kz+1-j)*(kz+1).. kj ∂ O i. its jth partition (Aj) has the form: 0 Aj = O i. without considering the bias term. p . among the different neurons within the layer z+1. 1 b) (j) ∂Net i. for the special case of the output layer it is just a row vector. . The matrices described in a) and c) are sparse matrices. Using the same type of conventions used so far. ∂Net i. . (2. kz + 1 … ∂Net i.

. Following (2.= -----------------------(z + 1) (z) (z + 1) ∂Net i. ∂o i = -----------------------.. p can be given (z) ∂w as: th (z) J i. . it is clear that no dupli(j) ∂Net i.. to compute -----------------------.47) Note that the first term in the last equality is just the pth element of a row vector.. must be calculated beforehand. -----------------------. (q) (q) (q) Artificial Neural Networks 67 . This way. ∂O i.44). ) .45)). J i. . . the matrices (z + 1) ∂Net i.has just the pth row non-null (see (2. Further.must be obtained. Using this property. ( j) (j) (2. . with q ≥ j > z + 1 . The diagonallity of matrix ---------------. . it is easy to see that the product of each pair of matrices inside the curly brackets of (2.-------------------------. because of the layered structure of the MLPs.48) where the symbol × has the meaning that each element of each column of the pre-multiplying matrix is multiplied by the corresponding element of the post-multiplying vector. ] × f ' ( Net i. . ( z + 1) ∂Neti. . ∂o i ---------------. ∂o i these derivatives are obtained in a recursive fashion. we note that.. kj – 1. p (z + 1) ∂o i ∂Net i. i. cate calculations will be made if Ji..can also be exploited. .Net i. . for every ∂o i layer z. is computed starting from the output layer towards the input layer. (j) ∂Net i..e. . The final consideration before presenting an efficient algorithm to compute Ji. . concerns the order of the computations of its different partitions. (z) Since the p partition of --------------------. (q) (z + 1) (q) p O i.44) can be given as: (j – 1) T ( j – 1) ∂Net ----------------------.= [ W 1. ∂w p ∂Net i. to simplify further the calculations. (j – 1) ∂Net . 1 (z) (2.

j ). roughly. which simplifies the calculations. and O (ij. ) . also O (. end if z>2 d := d ( W ( z – 1 ) ) T d := d × f ' ( Net i. Also for output functions like the sigmoid or the hyperbolic tangent functions. .).2 illustrates an efficient way to compute the Jacobean matrix of an MLP. must be know beforehand. for q ≥ j > 1 . If this operation is performed before the computation of the Jacobean matrix takes place. an analysis of the operations involved in Alg. a) Point b) implies that a recall operation must be executed. in conjunction with (2. end z := z-1 end Algorithm 2. b) The error vector. then Net (ij. Using this algorithm. . j (z – 1) := d j O ( z – 1 ) 1 i. and consequently the output vector. can be obtained. must be available.40). z := q d := f '( net ( q ) ) i while z>1 for j=1 to kz J i. requires that two other points be addressed: Net (i. and f' ( Net (ij. mn additions and km derivatives are needed to obtain the Jacobean matrix. 2. i . the computation of the derivative can be obtained from the output. Denoting by k the number of neurons in the MLP (without considering the input layer) and by n the total number of weights in the network.Keeping this preceding points in mind. 68 Artificial Neural Networks . must be available. Alg. for q > j ≥ 1 .. 2mn multiplications.2 ( z – 1) ) Computation of Ji.).2 shows that. 2. to compute the gradient vector.j ) .).

we obtain the gradient vector as computed by the BP algorithm. 2.. is also derived on a presentation mode. with the same computational complexity as the error back-propagation algorithm.4 - Computation of g using (2. 2.4: g T := 0 i := 1 while i ≤ m .3) -.3. compute o(q)i e i := t i – o ( q ) i . Employing Alg.2.The recall operation can be implemented using Alg. z := 1 while z<q Net i. for consistency with Alg. .. However. end -. 2. 2. := f ( z + 1 ) ( Net z + 1 ) z := z+1 end (z + 1) Algorithm 2. mk output function evaluations and mn additions and multiplications are needed to compute o(q).. 2. Artificial Neural Networks 69 . . [31]. By doing that.(Alg.. 2.40) This last algorithm computes the gradient vector.3 - Recall Operation Using this algorithm. compute Ji. which.2) Algorithm 2.2 and Alg. . 2. 2. O i..(Alg.3. it does not share one of its advantages because not all the computations are performed locally in the neurons. the gradient vector can be obtained using Alg.4. . This property can be provided by slightly modifying Alg. g T := g T – e i J i. as presented by Rumelhart et al. (z + 1) := O ( z ) 1 W ( z ) i.

(Alg. It is common to find in the specialized literature that hundreds of thousands of iterations are needed to train MLPs.2.3 Analysis of the error back-propagation algorithm It was shown above that each iteration of the error back-propagation algorithm is computationally efficient. This fact is related to the underlying minimization technique that this algorithm implements. end z := z-1 end end (z – 1) ) Algorithm 2. compute o(q)i e i := t i – o i z := q ( q) -. However. its performance with regard to the total process of training is very poor. To show the problems associated with this method.5 - Computation of g using BP The main difference between this algorithm and Alg. an example will be employed.1The Gradient Vector as computed by the BP Algorithm The following algorithm computes the gradient vector in the manner introduced by Rumelhart and co-workers [31]. end if z>2 d := d ( W ( z – 1 ) ) T d := d × f' ( Net i.3) d := [ f' ( net i )e i ] while z>1 for j=1 to kz gj ( z – 1 )T (q) := g j ( z – 1)T – dj O (z – 1 )T 1 i. . Notice that the symbol × has the meaning that each element of each column of the pre-multiplying matrix is multiplied by the corresponding element of the postmultiplying vector. in this case. all the calculations are made locally (in each neuron).1.3.. the steepest descent method. 2. g T := 0 i := 1 while i ≤ m . 2.3.4 is that.1. .2. 2. 70 Artificial Neural Networks ..

11 - Simplest possible MLP The performance of the algorithm will be evaluated using four different graphs: illustrates the evolution of w. and separated by ∆x = 0. The simplest possible MLP will be used. will be employed. 2. ˆ 1. for the continuous mapping.11 . fig.It can be shown that. Contour lines are superimposed on the graph.x2 mapping The aim of this example is to train one MLP to approximate the mapping y = x 2 in a specified range of x. b) illustrates the evolution of the criterion. 2. The same starting point (w = 0.6 .9 ) will be used for all the examples. as shown in fig. with different learning rates. for the first ten iterations.9 0.1 is used. namely a perceptron with a linear output T function. the training procedure diverges when a learning rate of 0. 2.1 . y w1 1 w2 x FIGURE 2. c) shows the convergence of the threshold (w1). As shown in fig. w = 1 0 -3 T Artificial Neural Networks 71 .12 to fig. The error back-propagation algorithm. the evolution of the bias is clearly an unstable process. which is the usual sum of the square of the errors ( e 2 ). 2. to illustrate the behaviour of the training criterion.12 .Example 2.14 illustrate the performance achieved by the algorithm for x ∈ – 1 1 1. Although w2 converges to its optimum value (0). starting from x=-1. The training range was discretized into 21 evenly spaced points. a) d) shows the convergence of the weight (w2).

x ∈ – 1 1 . 2.13 illustrates the performance of the training procedure for the same problem. with slower convergence rate than the evolution of the weight. w1 w2 Iteration number Iteration number c) Convergence of w1 FIGURE 2. 72 Artificial Neural Networks .e. The convergence of this last parameter (w2) is now slightly slower than was obtained with η = 0. η = 0.1 .w2 w1 a) Convergence of w e 2 Iteration number b) evolution of s. The evolution of the bias is an oscillatory process. ∆x = 0. With this learning rate the training procedure is now stable.09.12 - d) Convergence of w2 x2 problem.1 .1 fig.s. this time using a learning rate of 0.

.

w2

w1 a) Convergence of w

e

2

Iteration number

b) evolution of s.s.e.

w1

w2

Iteration number

Iteration number

c) Convergence of w1

d) Convergence of w2

FIGURE 2.13 -

x2 problem; x ∈ – 1 1 , ∆x = 0.1 , η = 0.09

fig. 2.14 shows what happens when the learning rate is further reduced to 0.05. The training process is perfectly stable. Although the learning rate used is smaller than the one employed in the last example, the evolution of the training criteria is faster. The convergence rate for w1 is now better than for w2. This latter parameter converges slower than with η = 0.09 and clearly dominates the convergence rate of the overall training process.

Artificial Neural Networks

73

w2

w1 a) Convergence of w

e

2

Iteration number

b) evolution of s.s.e.

w1

w2

Iteration number

Iteration number

c) Convergence of w1
FIGURE 2.14 - x2

d) Convergence of w2

problem; x ∈ – 1 1 , ∆x = 0.1 , η = 0.05

As a final example, the same range of x will be used, this time with a separation between samples of ∆x = 0.2 . fig. 2.15 illustrates the results obtained when η = 0.1 is used. Although the learning rate employed is the same as in the example depicted in fig. 2.12 , the results presented in fig. 2.15 resemble more closely the ones shown in fig. 2.14 .

74

Artificial Neural Networks

.

w2

w1 a) Convergence of w

e

2

Iteration number

b) evolution of s.s.e.

w1

Iteration number

w2

Iteration number

c) Convergence of w1
FIGURE 2.15 -

d) Convergence of w2

x2 problem; x ∈ – 1 1 , ∆x = 0.2 , η = 0.1

From the four graphs shown, two conclusions can be drawn: the error back-propagation algorithm is not a reliable algorithm; the training procedure can diverge; b) the convergence rate obtained depends on the learning rate used and on the characteristics of the training data.
a)

2.1.3.3.1Normal equations and least-squares solution This example was deliberately chosen as it is amenable for an analytical solution. Since the output layer has a linear function, the output of the network is:

Artificial Neural Networks

75

o(q) = x 1 w , or in the more general case: o ( q ) = Aw

(2.49)

(2.50)

where A denotes the input matrix, augmented by a column vector of ones to account for the threshold. In this special (linear) case, the training criteria, given by (2.37), becomes: t – Aw 2 Ω l = ---------------------2 where the delimiters denote the 2-norm.
(2.51)

Derivating (2.51) w.r.t. w, the gradient can then be expressed as: g l = – A T t + ( A T A )w
(2.52)

The minimal solution, or, as it is usually called, the least-squares solution of (2.51) is then obtained by first equating (2.52) to 0, giving then rise to the normal equations [92]: ( A T A )w = A T t which, when A is not rank-deficient, it has a unique solution, given by (2.54): ˆ w = ( A T A ) -1 A T t
(2.54) (2.53)

where it is assumed that A has dimensions m*n, m ≥ n 1.The matrix ATA is called the normal equation matrix. In the case where A is rank-deficient, there is an infinite number of solutions. The one with minimum 2-norm can be given by: ˆ w = A+t where A+ denotes the pseudo-inverse of A [93]. 2.1.3.3.2The error backpropagation algorithm as a discrete dynamic system If (2.52) is substituted into the back-propagation update equation (2.38), we obtain:
(2.55)

1.Throughout this work, it will always be assumed that the number of presentations is always greater than or equal to the number of parameters.

76

Artificial Neural Networks

w [ k + 1 ] = w [ k ] + η ( A T t – ( A T A )w [ k ] ) = ( I – η ( A T A ) )w [ k ] + ηA T t

(2.56)

This last equation explicitly describes the training procedure as a discrete MIMO system, which can be represented as [94]:

ATt

+

- g[k]

η H ( z ) = ---------z–1

w[k+1]

-

w[k]

(ATA)

z-1

FIGURE 2.16 - Closed-loop representation

of (2.56)

The rate of convergence and the stability of the closed loop system are governed by the choice of the learning rate parameter. In the present case, as x is symmetric, the matrix ATA is a diagonal matrix. In the general case, this is not verified, and to examine (2.56), it is then convenient to decouple the k1+1 simultaneous linear difference equations by performing a linear transformation of w. Since A T A is symmetric, it can be decomposed [95] as: A T A = USU T
(2.57)

where U is an orthonormal square matrix (the modal matrix or matrix of the eigenvectors of A T A ) and S is a diagonal matrix where the diagonal elements are the eigenvalues ( λ ) of AT A . Replacing (2.57) in (2.56), its jth equation becomes: v j [ k + 1 ] = ( 1 – ηλ j )v j [ k ] + ηz j where v = UT w and
(2.59)

j = 1, …, k 1 + 1

(2.58)

Artificial Neural Networks

77

z = UT ATt

(2.60)

It is clear that (2.58) converges exponentially to its steady-state value if the following condition is met: 1 – ηλ j < 1 which means that the maximum admissible learning rate is governed by: 2 η < ---------λ max where λ max denotes the maximum eigenvalue of A T A . From (2.58) it can also be concluded that the convergence rate for the jth weight is related to 1 – ηλ j , faster convergence rates being obtained as the modulus becomes smaller. The desirable condition of having fast convergence rates for all the variables cannot be obtained when there is a large difference between the largest and the smallest of the eigenvalues of A T A . To see this, suppose that η is chosen close to half of the maximum possible learning rate. In this condition, the variable associated with the maximum eigenvalue achieves its optimum value in the second iteration. If m denotes the equation related with the smallest eigenvalue of A T A , this equation may be represented as: λ min zm v m [ k + 1 ] = 1 – ---------- v m [ k ] + ---------λ max λ max which has a very slow convergence if λ min « λ max It is then clear that the convergence rate ultimately depends on the condition number of A T A , defined as: λ max κ ( A T A ) = ---------λ min
(2.64) (2.63) (2.62) (2.61)

We can then state that the maximum learning rate is related with the largest eigenvalue, and the convergence rate is related with the smallest eigenvalue. Referring back to Ex. 2.6, A T A is a diagonal matrix with eigenvalues {21, 7.7}, for the cases where ∆x = 0.1 and {11, 4.4}, when ∆x = 0.2 . Table 2.10 illustrates the values of ( 1 – ηλ i ) , for the cases shown in fig. 2.12 to fig. 2.15 .

78

Artificial Neural Networks

Fig 2.12 w1 w2 -1.1 0.23

Fig 2.13 -0.89 0.31

Fig 2.14 -0.05 0.62

Fig 2.15 -0.1 0.56

Table 2.10 -

Values of ( 1 – ηλ i ) for fig. 2.12 to fig. 2.15

Notice that the difference equation (2.58) can be expressed in terms of Z-transforms as a sum of two terms: the forced response and the natural response with non-null initial conditions: zv j [ 0 ] zηz j V j ( z ) = ------------------------------ + --------------------------------------------------z – ( 1 – ηλ j ) ( z – 1 ) ( z – ( 1 – ηλ j ) )
(2.65)

The values of ( 1 – ηλ i ) represent the poles of the Z-transform. As we know, poles in the negative real axis correspond to oscillatory sequences, divergent if outside the unitary circle, and convergent if inside, and poles in the positive real axis correspond to decreasing exponentials. Poles near the origin correspond to fast decay, while near the unit circle correspond to slow decay. The evolution of the transformed weight vector can be expressed, in closed form: v j [ k + 1 ] = v [ 0 ] ( 1 – ηλ j )
k+1

z k+1 + ---j ( 1 – ( 1 – ηλ j ) ) λj

(2.66)

z ˆ Notice that ---j = v j . Replacing this value in (2.66), we finally have: λj
k+1 ˆ ˆ v j [ k + 1 ] = v j + ( v [ 0 ] – v j ) ( 1 – ηλ j )

(2.67)

Eq. (2.67) clearly shows that, during the training process, the weights evolve from their starting point to their optimal values in an exponential fashion, provided (2.62) is satisfied. As we know, the concept of time constant is related with the exponent of decaying exponentials. For a transient with the form Ae , the time constant is τ = 1 ⁄ a . If we expand the exponential in Taylor series, and equate this approximation with the value of the pole in (2.65):
–1 -τ – at

e

1 1 ≈ 1 – -- + --------- – … = 1 – ηλ j , τ 2!τ 2

(2.68)

Artificial Neural Networks

79

The smallest axis is the one related with the eigenvector corresponding to the largest eigenvalue. (2. the training algorithm can be envisaged as a joint adaptation of several unidimensional algorithms.54) in (2. gives the stability condition. in this coordinate system of the eigenvectors of ATA.51) can be expressed in another form: l t t – 2t Aw + w A Aw Ω = --------------------------------------------------------2 T T T T (2.we can therefore approximate the training time constant as: 1τ ≈ -----.= -----------------------------.73) Notice that the distance of any point in the performance surface to its minimum. ∆Ω = Ω – Ω min .70). The points with the same value (the contour plots introduced earlier) have 80 Artificial Neural Networks .70) Employing (2. In this coordinate system. The performance surface is a paraboloid.1. we have its minimum value: T + T T ˆ + t ( I – AA )t l t t – t Aw t – AA Ω min = -----------------------. ηλ (2. The largest eigenvalue. v. on the other hand.3. and whose lengths are given by 1 ⁄ λ i .71) With some simple calculations.+ … + ---------------------------------------2 2 ( 1 ⁄ λ1 ) ( 1 ⁄ ( λk1 + 1 ) ) l l l l 2 (2. has now a form typical of an ellipsoid. centred in the optimal weight values.3Performance surface and the normal equations matrix Eq.= -------------------2 2 2 (2.72) Expressing the last equation in terms of the transformed variables. the performance criterion can be given as: l l ˆ T T ˆ Ω = Ω min + ( w – w ) A A ( w – w ) (2. 2.3.69) This last equation clearly shows that the time constant of the error backpropagation algorithm is related with the smallest eigenvalue of the normal equation matrix.74) This is an equation of an ellipsoid.. with axes coincident with the eigenvectors of ATA. we have: l l ˆ ˆ T Ω = Ω min + ( v – v ) S ( v – v ) (2. in k1+1 dimensions: ˆ ˆ 2 ( v k1 + 1 – v k1 + 1 ) ( v1 – v1 ) ∆Ω = ----------------------.

can be eliminated by incorporating a line search algorithm. namely search procedures (Fibonnaci. compute a step length: α [ k ] = η . it is unreliable and can have a very slow rate of convergence. in the framework of unconstrained deterministic optimization. namely unreliability and slow convergence..6 - This algorithm has now the usual structure of a step-length unconstrained minimization procedure. since it implements a steepest descent method. golden section) or polynomial methods (quadratic. Other approaches could be taken. Several strategies could be used.. Ω ( w [ k ] + α [ k ]p [ k ] ) < Ω ( w [ k ] ) (2.. step 2 would implement what is usually called a line search algorithm. This will be done in the following section.. compute a search direction: p [ k ] = – g [ k ] . can be extrapolated for the case of interest. a step length α [ k ] such that (2. whose axes coincide with the eigenvectors of the normal equation matrix. such that it minimizes (2. the unreliability of the method. For the purpose of the interpretation of the error back-propagation algorithm in the framework of unconstrained optimization.1. but. 2. it is advisable to divide the update equation (2. beyond the scope of this course. Also it has been shown that it is difficult to select appropriate values of the learning parameter. in each iteration. for the case of a linear perceptron..76) Artificial Neural Networks 81 . however. will be introduced. . partial or inexact if it does not.38) Algorithm 2.75) Hence. update the weights of the MLP: w [ k + 1 ] = w [ k ] + α [ k ]p [ k ] Sub-division of (2.75) would be verified in every iteration. This approach is.a shape of an ellipsoid. In this section alternatives to the error back-propagation are given.76).38) into three different steps: . cubic) involving interpolation or extrapolation [99]. to compute.. the nonlinear multilayer perceptron.3.4 Alternatives to the error back-propagation algorithm In last section it has been shown that the error back-propagation is a computationally efficient algorithm. The second step could then be modified. The above considerations. which do not suffer from its main limitations. The line search is denoted exact if it finds α [ k ] . Ω ( w [ k ] + α [ k ]p [ k ] ) (2. The first limitation of the error back-propagation algorithm. and whose lengths are the inverse of the square root of the corresponding eigenvalues. genetic algorithms [96] [97] offers potential for the training of neural networks [98]. where alternatives to back-propagation. For instance.

p T [ k ]G [ k ]p [ k ] 2 (2. and are responsible for the very large numbers of iterations typically needed to converge. the Jacobean matrix at the solution point. must be computed in step 1 of Alg. simple calculations show that the asymptotic error constant is 0. the convergence ˆ rate can be approximated by (2. is very small.The inclusion of a line search procedure in the back-propagation algorithm guarantees global convergence to a stationary point [99]. where several iterations (depending on the η used and κ ( A T A ) ) are needed to find the minimum. with the matrix A replaced by J ( w ) .38) is: p [ k ] = –g [ k ] . One way to express how the two different updates appear is to consider that two different approximations are used for Ω . employing an exact line search algorithm to determine the step length.1. Some guidelines for this task can be taken from Section 2. the rate of convergence is linear. For the case of (2.80) (2. which means that the gain of accuracy.3.≈ ( κ ( A A ) – 1 ) ˆ Ω(w[k]) – Ω(w) ( κ( AT A ) + 1 )2 (2. the optimum value of the weights of a linear perceptron. discussed in the last section. in just one iteration. as reported in the literature. clearly the steepest-descent direction. for the linear perceptron case.54) minimize these approximations. For the case of nonlinear multilayer perceptrons. It can be shown [100] that. Equation (2. a first-order approximation of Ω is assumed: Ω ( w [ k ] + p [ k ] ) ≈ Ω ( w [ k ] ) + g T [ k ]p [ k ] (2. and that (2. To speed up the training phase of MLPs. as will be shown afterwards.38). This solves the first limitation of the BP algorithm.38).38) and (2. in each iteration.77).3.996. even with this modification.54) appears when a second order approximation is assumed for Ω : 1 Ω ( w [ k ] + p [ k ] ) ≈ Ω ( w [ k ] ) + g T [ k ]p [ k ] + -.77) Even with a mild (in terms of MLPs) condition number of 1000. other than the steepest descent.78) It can be shown [101] that the normalized (Euclidean norm) p[k] that minimizes (2. However. This contrasts with the use of the normal error back-propagation update (2.79) 82 Artificial Neural Networks .54) gives us a way to determine. 2.6. the error back-propagation algorithm is still far from attractive. and given by: 2 T ˆ Ω ( w [ k + 1 ] ) – Ω ( w ) -----------------------------------------------------------------------------------. Equation (2. it is clear that some other search directions p[k]. large condition numbers appear in the process of training of MLPs (which seems to be a perfect example of an ill-conditioned problem). In practice.

in nonlinear least-square problems.84) Gi[k] denoting the second matrix of the second-order derivatives of o(q)i at iteration k. m is 0. (2.82). the quadratic model may be a poor approximation of Ω and G[k] may not be positive definite. and G must be inverted) stimulated the development of alternatives to Newton’s method.82) directly. For the linear case.82) The Hessian matrix has a special structure when the function to minimize is. Formulating the quadratic function in terms of p[k]: 1 Φ ( p [ k ] ) = g T [ k ]p [ k ] + -. at points far from the solution. i = 1. or an approximation to it. so that (2.81) is given by (2. together with the high computational costs of the method (the second order derivatives are needed. These methods are general unconstrained optimization methods.82) becomes (2. and therefore do not make use of the special structure of nonlinear least square problems. Depending on whether the actual Hessian matrix is used. ˆ p [ k ] = – G –1 [ k ]g [ k ] (2. provided ˆ ˆ w[k] is sufficiently close to w and G is positive definite. Second order methods usually rely in eq. Gi[k].82) may not possess a minimum.82) (or similar) to determine the search direction. These will be introduced in section Section 2. the Quasi-Newton methods. then the Newton method converges at a second-order rate. nor even a stationary point. Thus. Two other Artificial Neural Networks 83 . second-order methods are categorized into different classes. the Hessian can be expressed [102] as: G [ k ] = J T [ k ]J [ k ] – Q [ k ] where Q[k] is: m (2.3.4.54).p T [ k ]G [ k ]p [ k ] 2 (2. For the nonlinear case. and on the way that this approximation is performed. A Newton method implements (2.1. as in the case in consideration. This serious limitation. which means that (2. a sum of the square of the errors. As in the linear case.83) Q[k ] = i=1 e i [ k ]Gi [ k ] (2.1. This matrix is usually called the Hessian matrix of Ω . a better rate of convergence is usually obtained using second order information. However. the Hessian matrix is composed of first and second order derivatives.81) if G[k] is positive definite (all its eigenvalues are greater than 0) then the unique minimum of (2.The matrix G[k] denotes the matrix of the second order derivatives of Ω at the kth iteration. …. It can be proved [102] that.

2. i.1.87) Traditionally these methods update the inverse of the Hessian to avoid the costly inverse operation. In practice.4. where the performance of these two algorithms..– ------------------------------------s T [ k ]q [ k ] s T [ k ]q [ k ] s [ k ]q T [ k ]H [ k ] + H [ k ]q [ k ]s T [ k ] ------------------------------------------------------------------------------T [ k ]q [ k ] s (2. which will be presented in sections Section 2.86) (2. Nowadays this approach is questionable and.1.3.methods that exploit this structure are the Gauss-Newton and the Levenberg-Marquardt methods. The first important method was introduced by Davidon [103] in 1959 and later presented by Fletcher and Powell [104].85) It is now usually recognised [99] that the formula of Broyden [105]. so is G[k+1].3. 2.19 and in fig.4. as well as the Gauss-Newton and the Levenberg-Marquardt methods is compared on common examples.3 respectively.e. for instance. 2. if G[k] is positive definite. Goldfarb [107] and Shanno [108] is the most effective for a general unconstrained method. The update formulae used possess the property of hereditary positive definiteness.1.4.1.3. A large number of Hessian updating techniques have been proposed. Gill and Murray [101] propose a Cholesky factorization of G and the iterative update of the factors.3.4. the application of the BFGS version of a Quasi-Newton (QN) method always delivers better results than the BP algorithm in the training of MLPs. being then known as the DFP method.23 . Two examples of this are given in fig. Its update equation [103] is: s [ k ]s T [ k ] H [ k ]q [ k ]q T [ k ]H [ k ] H DFP [ k + 1 ] = H [ k ] + ----------------------.2Gauss-Newton method While the quasi-Newton methods are usually considered ‘state of the art’ in general unconstrained minimization. in order to make an approximation of G (or of H=G-1) using an appropriate updating technique. Fletcher [106].– ------------------------------------------------s T [ k ]q [ k ] q T [ k ]H [ k ]q [ k ] where: s[ k] = w[ k + 1 ] – w[ k] q[k] = g[k + 1 ] – g[k] (2. the nonlinear least-squares structure of the problem. 84 Artificial Neural Networks .1Quasi-Newton method This class of methods employs the observed behaviour of Ω and g to build up curvature information. The BFGS update [102] is given by: T s [ k ]s T [ k ] H BFGS [ k + 1 ] = H [ k ] + 1 + q [ k ]H [ k ]q [ k ] ----------------------. Further comparisons with BP can be found in [84]. 2. they do not exploit the special nature of the problem at hand.2 and Section 2.

A point is then reached where the search direction is almost orthogonal to the gradient direction. The Gauss-Newton method is therefore not recommended for the training of MLPs. is very small compared with J T J . is the ratio between the largest and the smallest singular values of J. The validity of this approximation is more questionable in the training of MLPs. at the optimum. thus preventing any further progress by the line search routine.84).89).89) is solved by first computing the product J T J . in terms of the Jacobean matrix.90) where κ ( J ) . This problem is exacerbated [102] if (2. given by (2.40). which is unique if J is full column rank.88) The reasoning behind this approximation lies in the belief that. as J is a rectangular matrix. or even whether the chosen topology is a suitable model for the underlying function in the range employed. Even using a QR factorization and taking some precautions to start the training with a small condition number of J. Artificial Neural Networks 85 . and then inverting the resultant matrix. since κ( JTJ ) = ( κ( J ) )2 (2. This assumption may not be (and usually is not) true at points far from the local minimum. one part of the Hessian matrix is exactly known. almost in every case. by a product of the Jacobean matrix and the error vector. in most of the situations where this method is applied. the Levenberg-Marquardt method. QR or SVD factorizations of J should be used to compute (2. the error will ˆ ˆ ˆ be small. as in (2. this system of equations always has a solution. So. so that Q . in this type of problem the gradient vector can be expressed. The search direction obtained with the Gauss-Newton method is then the solution of: J T [ k ]J [ k ]p GN [ k ] = – J T [ k ]e [ k ] (2.83) is approximated as: G [ k ] ≈ J T [ k ]J [ k ] (2. an infinite number of solutions exist if J is rank-deficient. The basis of the Gauss-Newton (GN) method lies in dropping the use of second order derivatives in the Hessian. since. if first order information is available. The problem of applying the Gauss-Newton method to the training of MLPs lies in the typical high degree of ill-conditioning of J. a better alternative exists.As noted previously. it is not known whether the particular MLP topology being used is the best for the function underlying the training data. so that (2.89) As already mentioned. and the Hessian as a combination of a product of first derivatives (Jacobean) and second order derivatives. but fortunately. the conditioning becomes worse as learning proceeds.

p[k] tends to a vector of zeros. In contrast with the Gauss-Newton and quasi-Newton methods.4. usually denoted the regularization factor. p[k] is identical to the Gauss-Newton direction. is the Levenberg-Marquardt (LM) method.95) (2. one way of envisaging the approximation of the Hessian employed in GN and LM methods is that these methods.92). even when J is rank-deficient. which belong to a step-length class of methods (Alg. consider a linear model for generating the data: o ( nl ) [ k ] = J [ k ]w [ k ] Using (2. and a steepest descent direction. then the ratio r[k] (2.94) (2. 2. Basically this type of method attempts to define a neighbourhood where the quadratic function model agrees with the actual function in some sense. As υ [ k ] tends to infinity.91): ( J T [ k ]J [ k ] + υ [ k ]I )p LM [ k ] = – J T [ k ]e [ k ] (2. the LM method is of the ‘trust-region’ or ‘restricted step’ type. then the test point is accepted and becomes a new point in the optimization.91) where the scalar υ [ k ] controls both the magnitude and the direction of p[k]. If there is good agreement. after taking a step p[k] is: e p [ k ] = e [ k ] – J [ k ]p [ k ] . When υ [ k ] is zero.2.6). otherwise it may be rejected and the neighbourhood is constricted.93) (2.1. Many different algorithms of this type have subsequently been suggested. This method uses a search direction which is the solution of the system (2. and overcomes the problems arising when Q[k] is significant. so that the predicted reduction of Ω is: (ep[k ])T(ep[ k]) ∆Ω p [ k ] = Ω ( w [ k ] ) – -------------------------------------2 As the actual reduction is given by: ∆Ω [ k ] = Ω ( w [ k ] ) – Ω ( w [ k ] + p [ k ] ) . the predicted error vector. Levenberg [109] and Marquardt [110] were the first to suggest this type of method in the context of nonlinear least-squares optimization. at every kth iteration.3. To introduce the algorithm actually employed [102]. The radius of this neighbourhood is controlled by the parameter υ .3Levenberg-Marquardt method A method which has global convergence property [102].92) 86 Artificial Neural Networks .

7 . It is usually agreed that the Levenberg-Marquardt method is the best method for nonlinear least-squares problems [102].25 υ [ k + 1 ] := 4υ [ k ] elseif r[k]>0. evaluate Ω ( w [ k ] + p [ k ] ) and hence r[k] if r[k]<0.96) measures the accuracy to which the quadratic function approximates the actual function.2) and e[k] (using Alg. 2.. This type of nonlinearity relates the pH (a measure of the activity of the hydrogen ions in a solution) with Artificial Neural Networks 87 . 2.91) (if ( J T [ k ]J [ k ] + υ [ k ]I ) is not positive definite then ( υ [ k ] := 4υ [ k ] ) and repeat) . the better the agreement. two examples will now be presented. for iteration k: .5 Examples Example 2.Inverse of a titration curve The aim of this example is to approximate the inverse of a tritation-like curve. To illustrate the performance of the different methods. 0.. [102].. 2..∆Ω [ k ]r [ k ] = -----------------∆Ω p [ k ] (2.75. obtain J[k] (Alg. etc.75 υ[k ] υ [ k + 1 ] := ---------2 else υ [ k + 1 ] := υ [ k ] end if r [ k ] ≤ 0 w [ k + 1 ] := w [ k ] else w [ k + 1 ] := w [ k ] + p [ k ] end Algorithm 2.1.3) .25. in the sense that the closer that r[k] is to unity.. in a large number of training examples performed. This was confirmed in this study. compute p[k] using (2.. Then the LM algorithm can be stated.7 - Levenberg-Marquardt algorithm The algorithm is usually initiated with υ [ 1 ] = 1 and is not very sensitive to the change of the parameters 0.3.

2 0. 88 Artificial Neural Networks .9 0.6 0.4 0.Titration-like b) inverse nonlinearity curves One strategy used in some pH control methods.7 0.8 2 (2.17 b) illustrates the inverse of the nonlinearity. to overcome the problems caused by the titration nonlinearity.5 x 0.8 0. Four different algorithms were used to train the MLP.+ 6 y ---4 2 –3 –3 pH' = – -----------------------------------------------------------.3 0.2 0. were used as x and the corresponding pH´ values computed using.97) 0 0.7 0.1 0.mat (x).6 0.4 0.1 0 0.+ 10 –14 – -. rather than the pH. 2 hidden layers with 4 nonlinear neurons in each.1 0.4 0. and with an even spacing of 0. y=2*10 x – 10 26 log 0.9 1 a) forward nonlinearity FIGURE 2.17 a). These data is stored in the files input. the error norm for the first 100 iterations of three of those algorithms is shown in the next figures.97).6 0.3 0. The initial value of the weights is stored in the file wini. To compare their performance. In this example a MLP with 1 input neuron. 2.8 0.. and in target.17 .7 0.1 0 0.5 ph´ 0.mat.8 0. and one linear neuron as output layer was trained to approximate the inverse of (2.9 1 0 0. An example of a normalized titration curve is shown in fig. fig. by linearizing the control [86].the concentration (x) of chemical substances.5 ph´ 0. The equation used to generate the data for this graph was: y .01.2 0. 2.6 0.mat (the pH’).3 x 1 0.3 0. because the presence of this static nonlinearity causes large variations in the process dynamics [86]. is to use the concentration as output.7 0.4 0. Then the x values were used as target output data and the pH´ values as input training data.2 0.5 0. pH control is a difficult control problem. 101 values ranging from 0 to 1.

m of the Optimization Toolbox).BP algorithms 14 12 10 error norm 8 neta=0.001 . Next.01.m (it employs the Matlab functions LM. The Matlab file which generates the results is LM_mat. 1.7.m (it uses the function funls.005. the Levenberg-Marquardt methods achieve a better rate of convergence than the quasi-Newton method.19 c). Iterations 1 to 20 are shown in fig. 2. As it is evident from these figures.01. a). while iterations 21 to 50 are represented in fig.005 neta=0. A Levenberg-Marquardt algorithm. The Matlab file employed is BP. employing a linesearch algorithm (using the lsqnonlin.18 - Performance of the BP algorithm η = ( 0. as it incorporates a line search algorithm.m (it uses Threelay. 2. The Matlab file which generates the results is Leve_Mar. WrapSpar. A quasi-Newton algorithm. with η = 0. A similar rate of convergence can be observed. 2.0. A different implementation of the Levenberg-Marquardt algorithm.0. fig. which shows the last 50 iterations of both methods.m is more computational intensive. with the remark that the Matlab function lsqnonlin.01 6 neta=0.m (it uses the function fun_lm.05 or greater causes the algorithm to diverge. described in Alg.0. implemented with a BFGS method and a mixed quadratic/ cubic line search (using the function fminunc.0. A learning rate of 0. b). three second-order algorithms are compared. The Matlab file which generates the results is quasi_n.m). The two LM methods are compared in fig.m.19 a) to c) illustrate the performance of these algorithms. 3.m). Artificial Neural Networks 89 .m).001 4 2 0 0 10 20 30 40 50 60 Iterations 70 80 90 100 FIGURE 2. 2.m of the Optimization Toolbox).005.001 ) The first figure denotes the performance of the error back-propagation.m and UnWrapSp.m).

8 0.20 .2 2 0. is related to the Cartesian coordinates of the end-effector (x.2 1 0.Inverse coordinate transformation This example illustrates an inverse kinematic transformation between Cartesian coordinates and one of the angles of a two-links manipulator.7 Quasi-Newton and Levenberg-Marquardt algorithms 12 0. the function changed to the LM algorithm in a few number of iterations.8 .y) by expression (2.4 Levenberg-Marquardt 0. the angle of the second joint ( θ 2 ) . 2. but due to conditioning problems.19 .6 Levenberg-Marquardt 1.Quasi-Newton and Levenberg-Marquardt algorithms 14 0.3 Levenberg-Marquardt (Matlab) 0.6 0.Quasi-Newton and Levenberg-Marquardt methods The Gauss-Newton version of the lsnonlin function was also tried.1 0 0 2 4 6 8 10 12 Iterations 14 16 18 20 0 20 30 40 50 60 Iterations 70 80 90 100 b) first 20 iterations 1.4 Levenberg-Marquardt (Matlab) error norm 1.8 x 10 -3 c) next 30 iterations Quasi-Newton and Levenberg-Marquardt algorithms 1.98): 90 Artificial Neural Networks . Example 2.4 50 55 60 65 70 75 80 Iterations 85 90 95 100 a) last 50 iterations FIGURE 2.6 10 0.5 Quasi-Newton error norm Levenberg-Marquardt 6 Levenberg-Marquardt (Matlab) 4 error norm 8 Quasi-Newton 0. Referring to fig.

1m until 1. y ) → +θ 2 . l2 l1 θ2 ? θ1 ? FIGURE 2. The length of each segment of the arm is assumed to be 0. To obtain the training data.8 0.= cos ( θ 2 ) 2l 1 l 2 s = 1 – c = sen ( θ 2 ) 2 where l1 and l2 represent the lengths of the arm segments.20 .Input training data Artificial Neural Networks 91 .9 0.6 0.Two-links robotic manipulator The aim of this example is to approximate the mapping: ( x. with an even separation of 9º).6 0.5 0.98) 2 2 x + y + l 1 + l2 .21 . with an even separation of 0.1 0 0 0.2 0.0m. 1 0.1m) with 11 radial lines (angles ranging from 0 to 90º. 110 pairs of x and y variables were generated as the intersection of 10 arcs of circle centred in the origin (with radius ranging from 0.8 1 y FIGURE 2.2 0.3 0.θ 2 = ± tg 2 2 –1 s c (2.7 0.4 x 0.5m. c = ------------------------------------.4 0. in the first quadrant.

5 0 1 0.For each of the 110 examples. quasi_n. quasi-Newton and Levenberg-Marquardt 92 Artificial Neural Networks . The corresponding functions are: BP.5 0. These data can be found in the file InpData.m) and Leve_Mar.m (uses fun.Performance of BP. the corresponding positive solution of (2.5 1 0.8 0.6 1 FIGURE 2.98) was determined and used as the target output.m (uses LM. the quasi-Newton and the Levenberg-Marquardt method. 2.m). one hidden layer with 5 nonlinear neurons and one linear neuron as output layer was employed to approximate the mapping illustrated in fig.23 illustrates the performance of the BP algorithm with η = 0.ls. 60 50 40 error norm BP Levenberg-Marquardt quasi-Newton 30 20 10 0 0 10 20 30 40 50 60 Iterations 70 80 90 100 FIGURE 2. 3 2. The initial values are stored in the file W.ini.5 2 teta2 1.23 .4 x 0.01 (higher values of the learning rate cause the training process to diverge). y ) → + θ 2 fig. An MLP with 2 input neurons.22 . respectively.mat.2 y 0 0 0. 2.22 .m).Mapping ( x.m (uses TwoLayer.

3. Consequently. 1 .. which is linear.24 . which exploits the least-squares nature of the criterion emerges as the best method. The LM method.. . ..6 A new learning criterion In the preceding section alternatives to the error back-propagation algorithm have been presented. . and indeed. in most of the applications of MLPs to control systems. 2. but the LM algorithm is undoubtedly the best of all.Most important topology of MLPs for control systems applications The only difference between this topology and the standard one lies in the activation function of the output neuron. .. seems to be the one depicted in fig.24 .... . 1 .... It was shown that quasi-Newton algorithms have a more robust performance and achieve a faster rate of convergence than the error back-propagation algorithm.. Output 1 ˆ w A { 1 . the most important topology of this type of artificial neural network. The convergence of the learning algorithms can be further improved if the topology of the MLPs is further exploited.. let us partition the weight vector as: Artificial Neural Networks 93 . . This simple fact.1. as will be shown in the following treatment.... ...As in the previous examples.. 0 or more hidden layers { Inputs FIGURE 2.. the quasi-Newton method achieves a much better rate of convergence than the BP algorithm. .. . For the purposes of explanation. In all the examples shown so far. ... can be exploited to decrease the number of iterations needed for convergence [87]. 2. the networks act as function approximators.. as far as control systems applications are concerned.

r. and although different from the standard criterion (2. it should be noticed that O ( q – 1 ) 1 is assumed to be full-column rank. i.102).. This new criterion (2. Besides reducing the dimensionality of the problem. and then. their minima are the same.e. creating therefore a new criterion: 2 2 P A⊥ t 2 t – AA + t 2 ψ = --------------------------.w( q – 1) w = w(q – 2) … w(1) where u denote the weights that connect to the output neuron. however.100) (2. for simplicity.101) w.103) where the dependence of A on v has been omitted. the last equation can then be replaced into (2.103) P A⊥ is the orthogonal projection matrix to the complementary space spanned by the columns of A. In (2.55). Denoting this matrix by A. Using this convention. the minimum of (2. and v denotes all the other weights. 2 2 (2.t.103).103) depends only on the nonlinear weights. + ˆ u( v) = ( O(q – 1)( v ) 1 ) t (2. the resulting criterion is: t – O(q – 1) 1 u 2 Ω = --------------------------------------------2 2 = u v (2. using v in eq. (2. obtain the complete optimal weight vector w . For any value of v.101). the output vector can be given as: o( q) = O(q – 1) 1 u When (2. instead of determining the optimum of (2. u can be found using (2. Ω is linear in the variables u .= ----------------. The optimum value of the linear parameters is therefore conditional upon the value taken by the nonlinear variables.102). a property which seems to be inde- 94 Artificial Neural Networks . the weights in criterion (2.101) The dependence of this criterion on v is nonlinear and appears only through O ( q – 1 ) . The proof of this statement can be found in [89].99) (2. a pseudo-inverse is used for the sake of simplicity. the main advantage of using this new criterion is a faster convergence of the training algorithm.101) can be separated into two classes: nonlinear ( v ) and linear ( u ) weights.101).102) In (2. We can therefore. on the other hand.100) is substituted in (2.37).101). first minimize ˆ ˆ (2.

• the initial value of the criterion (2.: Jψ = ( A )v A + t eψ = P A⊥ t The simplest way to compute g ψ is expressed in Alg.103) must be obtained. For this purpose the derivative of A w. One problem that may occur using this formulation is a poor conditioning of the A matrix during the training process. the gradient of ψ must be obtained.107) Artificial Neural Networks 95 . This is a three-dimensional quantity and will be denoted as: ( A ) v = ∂A ∂v (2. It can be proved (see [89]) that the gradient vector of ψ can be given as: ( o( q – 1) )T = – eΨ 1 T JΨ 0 = gΩ gψ where g Ω ˆ u=u (2.104) Considering first the use of the error back-propagation algorithm or the quasi-Newton with this new criterion.e. which results in large values for the linear parameters. in comparison with the situation where random numbers are used as starting values for these weights. the derivatives of (2.8: (2.103) is usually much smaller than the initial value of (2. conditioned by the value taken by the nonlinear parameters.105) ˆ u=u . including the first. This is because at each iteration. As the criterion is very sensitive to the value of the linear parameters. This reduction in the number of iterations needed to find a local minimum may be explained by two factors: • a better convergence rate is normally observed with this criterion.t. a large reduction is obtained. 2. In order to perform the training.pendent of the actual training method employed.37). for the same initial value of the nonlinear parameters. i. This problem will be treated afterwards. compared with the standard one. v must be introduced.106) (2. the value of the linear parameters is the optimal. the partition of the Jacobean matrix associated with the weights v and the error vector of Ω obtained when the values of linear weights are their conditional optimal values. J ψ and e ψ denote respectively the gradient.r.

To reduce the computational complexity of Golub and Pereyra’s approach. 2... 2. If the Gauss-Newton or Levenberg-Marquardt methods are employed with this new formulation.3 . Several algorithms can be used to compute these optimal values... there is also some gain at each iteration that must be taken into account...103) must be obtained...9 - Computation of J ψ ˆ As in the case of Alg.103) is used as the training criterion. compute O ( q – 1 ) -. then the Jacobean of (2. 2. apart from the computation of u . Ruano et al. a large reduction in the number of iterations needed for convergence is obtained if (2. Kaufman [90] proposed a simpler Jacobean matrix. replace u in the linear parameters.3 ˆ .106).4. Although there is an additional cost in complexity per iteration to pay using this approach. since the dimensionality of the problem has been reduced. and complete Alg. because of possible ill-conditioning of A. the cheapest way to obtain J ψ is using Alg.using Alg. [91] proposed to employ the Jacobean matrix J ψ (2. It is advisable to introduce a procedure to obtain matrix J ψ (2. Algorithm 2.. However.106)..8. replace u in the linear parameters.8 - Computation of g ψ ˆ The computation of g ψ has the additional burden of determining u . obtain u ˆ .8. 2. 2. the computation of J ψ does not involve more costs than the computation of J. compute Jψ -. Three different Jacobean matrices have been proposed: the first was introduced by Golub and Pereyra [88].. Following the same line of reasoning that led to Alg. but QR or SVD factorization are advisable. 2.using Alg.. who were also the first to introduce the reformulated criterion.using Alg. 96 Artificial Neural Networks . compute gψ -..3 . 2.3 ˆ .9: .. which further reduces the computational complexity of each training iteration.2 Algorithm 2. obtain u ˆ .using Alg. 2... 2. compute O ( q – 1 ) -.. and complete Alg.

The initial weights are stored in w_ini_optm.53). 0. Ex. 2. To perform a fair comparison. Remember that in the case of Ex.m (it also employs TwoLay2. but with the value of the linear parameters initialized with their optimal values.5. 2.35 0. fig.1 and 0.Ph example revisited To illustrate the application of the new criterion.Evolution of the error norm using the BP algorithm minimizing the new criterion for different learning rates If now the standard training criterion is employed.m). the initial values of the nonlinear parameters were the ones used in Ex. with respect to the initial values of the nonlinear parameters. It should be noted that the initial norm of the error vector (0.7 the maximum rate allowed was 0.25 .55 0.05. The learning rates employed are 0. The initial weights are stored in file wini.2) obtained after 100 iterations of the standard criterion.103) is used as the learning criterion.7.5 Erro r vecto r n o rm 0. 2. The matlab file that implements this method is BPopt.25 illustrates the performance of the error back-propagation when criterion (2.m and TL2.m.45 0. Artificial Neural Networks 97 . The standard BP technique is stored in file BP. is less than the value (1.mat.9 .3 0 20 40 Iterations 60 80 100 FIGURE 2.Example 2.01.7 will be used.4 0. 0.mat. in the reformulated case. we obtain the following performance of the training algorithm. 2.

with the value of the linear parameters initialized with their optimal values. illustrates the performance of the quasi-Newton method minimizing the new criterion (dotted line) and minimizing the standard criterion. Notice also that the new starting point has a error norm equivalently with the one obtained after more than 20 iterations of the standard quasi-Newton method (see fig.0001 are employed. the next fig.0005 and 0. In fig. 2. with respect to the initial values of the nonlinear parameters. The Matlab functions that implements the quasi- 98 Artificial Neural Networks . 2.180 160 140 120 100 80 60 40 20 0 0 Erro r vecto r n o rm 20 40 Iterations 60 80 100 FIGURE 2. Larger values makes training diverge.001.26 .19 ).Evolution of the error norm using the BP algorithm minimizing the standard criterion for different learning rates It should be noted that only with very small values of the learning rate the training converges. Concentrating now on quasi-Newton methods.26 . 0. learning rates of 0.

In what concerns Levenberg-Marquardt methods.7 of quasi-Newton methods 0.28 illustrates their performance. the initial norm of the error vector is reduced significantly. As it is easy to see.mat 0.2 0.5 Erro r Vecto r N o rm 0. it achieves a better rate of convergence.103) has two main advantages: firstly. minimizing criterion (2. Artificial Neural Networks 99 . The initial values of the weights are found in wini_red.Newton method with the new criterion is quasi-n_opt.m.5 Erro r Ve cto r No rm 0. secondly. The legend is the same as the last figure. the new criterion is actually less complex to implement. The Matlab function that implements the Levenberg-Marquardt method with the new criterion is Leve_Mar_opt.3 0.6 0. 2.Performance 0.3 0.1 0 0 20 40 Iterations 60 80 100 FIGURE 2.Performance of Levenberg-Marquardt methods With respect to Levenberg-Marquardt methods.28 .6 0.4 0.1 0 0 20 40 Iterations 60 80 100 FIGURE 2.27 .4 0.7 0.m (it also employs fun_optm. fig.m).2 0.

29 .005. The initial weights are stored in w.37.45 1.6 times smaller than the error norm with 100 iteration of the standard algorithm.3 1. Considering now the use of the standard criterion.Example 2.25 1. dotted line) 100 Artificial Neural Networks . We shall consider first the BP algorithm.8. 2. 2. 1.4 1. 0.Performance of the BP algorithm (new criterion) The learning rates experimented were η = 0. dash-dotted lines).2 1. we have the following performance.Inverse coordinates transformation revisited In this case we shall employ data from Ex.m and OL2.m (uses also onelay2.23 . more than 20 times smaller than the initial norm in fig.0001. but with the initial linear optimal weights.00005 (solid line.m) was employed. It should be noted that the initial norm of the error vector is 1.001. 0.ini.15 1.0005 (solid.10 . dotted. and 2.35 Erro r Ve cto r No rm 1. The next figures illustrates the performance of the BP algorithm with the new criterion. The Matlab function Bpoptm. 0.1 0 20 40 Iterations 60 80 100 FIGURE 2. for η = 0.

m).6 0.3695 0 20 40 Iterations 60 80 100 FIGURE 2. and the initial weight vectors are w.ini..4 0 20 40 Iterations 60 80 100 FIGURE 2.ini.3735 1. The Matlab functions employed are quasi_n_optm.3725 Erro r Vecto r N o rm 1.31 .Performance of the BP algorithm (standard criterion) The Matlab files that implement this training algorithm are BP.m) and quasi_n.m). 1.37 1.Performance of quasi-Newton methods Artificial Neural Networks 101 . the following figure illustrates the performance of the standard criterion (dotted line) against the standard criterion (solid line).2 1. for the new criterion and the standard criterion.7 0.m (employs fun.30 .3 1.9 0. Addressing now quasi-Newton methods. respectively.m (uses TwoLayer. and the initial values of the weights are stored in wtotal.ini and wtotal.8 0.371 1.372 1.373 1.m (employs fun_optm. 1.3705 1.1 Erro r Vecto r N o rm 1 0.5 0.3715 1.4 1.

7 Practicalities We have introduced in the previous sections different training algorithms for multilayer perceptrons. or zero values.3.Finally. the same conclusions can be taken.32 .3.Performance of Levenberg-Marquardt methods Again.6 1.1. this means we should supply as initial values model parameters close to the optimal ones. in the context of nonlinear optimization.1Initial weight values For any optimization problem.4 0. Levenberg-Marquardt methods are evaluated. and therefore there are no guidelines for determining good initial parameter values.2 0 20 40 Iterations 60 80 100 FIGURE 2. For this reason. This assumes that the model you are fitting has a structure which is close to the process originating the data. In terms of nonlinear least-squares problems. The problems we shall discuss now are: where should we start the optimization.7.1. a good (close to the optimum) initial value of the design variables is important for the decrease of the number of iterations towards convergence. as the initial parameter values. 2. 1.2 Error Ve cto r No rm 1 0. We can take another route and employ initial parameter values that do not exacerbate the condition number of the Jacobean matrix. we have seen that the rate of convergence of a linear model was related with the condition number of the Jacobean matrix (see Table 2. 102 Artificial Neural Networks . 2. and when should we stop it.8 0. Unfortunately in every case MLPs do not have a structure close to the data generating function (unless the data generating function in itself is a MLP). faster convergence being obtained for smaller condition numbers.1. The legends are as the ones of last figure. it is a common practice to employ random values.4 1.3.10). In Section 2.3.6 0.

m) implements one possible solution to this. the following equation is used: Artificial Neural Networks 103 .t. To summarize.2. . fne_ar_initial_par. This affects all the weights related with the neurons below the one we are considering the derivative. The derivative of the neuron net input w.t. So.t. but for the network input data. if the neuron net input is outside its active region (around [-4. . The Matlab function MLP_init_par. . for each weight. its inputs: this turns out to be the transpose of the partial weight matrix. This is translated into a higher condition number. Make_Pol_E. Then.m. The derivative of the neuron net input w. the initial weights should be chosen such that all the hidden layers lie within their active region.r. 1]. the weights. Related with the first term. (z – 1) should be chosen such that the columns of Net should be linearly independA sufficient condition for them to be linear dependent is that ( z – 1) ( z) W i. this means that the target data should lie between. -1 and 1. ParamPolExp. 4] see fig. (z – 1) Make sure that there is no linear dependencies in the weight matrices W ( i) . 1.m. let’s say.3. it should be scaled to lie within that range. for all the training patterns.Remembering now Section 2. = kW j. Focusing on the last layer. its net input. this is the set of weights connecting the lower layer (including the bias) with one neuron of the upper layer is multiple of any other set. the computation of the Jacobean matrix involves three terms: The derivative of the neuron output w.m.m and Make_Powe. YPolExp.m. assuming that it is linear. Looking now at the second term. the layer inputs should also lie between a range of -1 and 1. 2. 3. the corresponding derivative is very small.m (it also employs scale.9 ). we should: Scale the input and target data to lie within a range [-1 1]. This is no problem for the layers above the first one. 1. Determine a initial set of weights such that all the nonlinear neurons are in an active region. 2. as the layer inputs are limited between 0 and 1. it is convenient to scale the target data to this range. two observations can be made: a) W ent. b) The value of the weights should not be too big or too small. i ≠ j .r. Therefore. First of all.r. the input and target patterns are scaled to lie in a range [-1. Therefore. Looking now to the last term. which is related with the neuron inputs. 1. 3.1. which makes them more linear dependent on the column associated with the output bias (see Exercise 22). all the columns associated with those neurons will have a small norm.

although in the neuron active region. It should not be forgotten that. to derive the initial parameters. j (z – 1) xj = k j ⋅ 1 + ----.111) (2. and x j ∼ N ( 0. 2.is the constant needed to be multiplied by the jth input so that Max ( I j ) – min ( I j ) its range lies in a range of 2π .11 .110) (z – 1) (2.8.1)).15 1011 It is evident that a better initial point was obtained using the above considerations. It should be mentioned that the above equations were derived empirically. the following mean results were obtained (see Exercise 23): Table 2. This way. All the initial values for the examples in Section 2.1. 1 ) is a random variable. Employing Ex. kz (2.6 were obtained using MLP_init_par. 1 ) and: Li = π – -. comparing 100 different executions of MLP_init_par.m.5 105 1.109) π θ i = – -. After training has been accomplished.112) The reasoning behind the above equations for the bias is that each net input will be centred around a different value.. m Random 9.m with 100 different random initializations (N(0. the input and target patterns were scaled. the weight matrix related with the first hid- 104 Artificial Neural Networks .1.3.5 and Section 2.9 Jacobean condition number MLP_init_par. The weights related with the same input are nevertheless different due to the random variable x. k z – 1 + 1 = L i + θ i + x i ∆θ i . 10 (2.. each input produces a contribution for the net inputs of the above layer with a range approximately of 2π .3.8 11.– k j min ( I j ) 2 (2. where x j ∼ N ( 0.108) 2π where k j = ------------------------------------------.W i. Each bias term (corresponding to each one of the k z neurons in the upper layer is obtained as: W i.Comparison of initialization methods Initial error norm 1.+ ( i – 1 )∆θi 2 π ∆θ i = ---.

7 and Ex.10). The following tables show. for each case.0024 0. which is a measure of the desired number of the corrected figures in the objective function.011 0. two different initializations for the linear parameters were compared: one with the optimal leastsquares values (optimal) and the other one with random initial values (random). τ f . In the case of the standard criterion. The two case studies are revisited using the Levenberg-Marquardt method.1.den layer and the weight vector related with the output layer must be linearly transformed in order to operate with the original data. A possible set of termination criteria for smooth problems can be formulated as [101]: Assume that we specify a parameter. employing the standard criterion (Ex.9 and Ex. The Matlab function Train_MLPs. and therefore the termination criteria used for unconstrained optimization can be applied to decide if the training has converged or not.113) Then a sensible set of termination criteria indicates a successful termination if the following three conditions are met: Ω[ k – 1 ] – Ω[k ] < θ[k ] w [ k – 1 ] – w[ k ] < τf ⋅ ( 1 + w [ k ] ) g [ k ] ≤ 3 τf ⋅ ( 1 + Ω [ k ] ) (2.0024 Condition number 150 695 380 Artificial Neural Networks 105 . The same topologies are employed for the two cases.7.115) are employed to test if the training is converging.m. * τ = 10 –6 Criterion Standard (random) Standard (optimal) Reformulated Number of iterations 100 150 107 Error Norm 0.Ph Example.2When do we stop training? Off-line training of MLPs can be envisaged as a nonlinear least-squares optimization problem. 2.116) The conditions (2. Please see also Exercise 22.3. We convert this parameter into a measure of absolute accuracy.m was employed. 2. This can be accomplished by the Matlab function Restore_Scale. θ [ k ] : θ [ k ] = τf ⋅ ( 1 + Ω [ k ] ) (2.115) (2. 2.116) is based on the optimality condition g ( w ) = 0 .12 .8) and the reformulated criterion (Ex. 2.114) (2. 2. while condition (2. the mean values obtained with 20 different initializations The column condition number denotes the condition number of the input matrix of the output layer.114) and (2. Table 2.

while the fit in the training data usually decreases continuously. fig. and increases afterwards.13 . two other schemes should be employed: the early-stopping method (also called implicit regularization or stopped search) and the regularization method (also called explicit regularization. a minimum is found at iteration 28. never seen by the network. This. as the network was ’generalizing’ what it is learning to fresh data. is a bad method. if the performance of the model is evaluated in a different data set.33 illustrates the evolution of the error norm for the validation set. as we can end whether with a bad conditioned model. unfortunately. when training a network.m.52 Condition number 62 92 114 As it was already referred.32 115 0. An obvious scheme is therefore to train the network until it reaches the minimum in terms of fit of the validation set. employing the file gen_set.Coordinate transformation Example. The technique that is. The first technique is due to the observation that. called the validation set (also called test set). unfortunately. If we observe the evolu- 106 Artificial Neural Networks .32 75 0. standard termination criteria are not usually employed. As it can be observed. For this reason. it decreases at first. Instead. τ = 10 –4 Criterion Standard (random) Standard (optimal) Reformulated Number of iterations Error Norm 70 0. or with a model that could deliver better results.Table 2. MLPs do not have a similar structure to the data generating function corresponding to the training data. most used. An example of this technique is shown below. is to stop training once a specified accuracy value is obtained. 2. The pH problem was employed. This phenomenon is called overtraining. and the training data previously employed was divided between training (70%) and validation (30%). At this point it is also said that the network deteriorates its property of generalization.

Evolution of the error norm for the validation set Training Data 1. Va lid a tio n D a ta 0 . In terms of conditioning.4 0 .6 0 .12 .Evolution of the error norm for the training set The final error norm obtained was 0. we can see that this method produces more conservative results than standard termination criteria. If we compare this result (in terms of MSE)-8.2 Erro r N o rm 1 0.2 0 0 5 10 15 20 Iterations 25 30 35 FIGURE 2. we can see that the error norm is still decreasing.1 0 0 5 10 15 20 Ite ra tio n s 25 30 35 FIGURE 2.5. the con- Artificial Neural Networks 107 .34 . however.3 0 .7 0 .6 0. and will go on decreasing afterwards.33 .2 10-6 with the result shown in Table 2.6 1.8 1.8 0.8 0 .7 10-8.4 1.tion of the error norm for the validation set.024.5 Erro r N o rm 0 .2 0 .4 0.

117) should be replaced by: t–y 2+λ u φ = ------------------------------------. with different roles in the network.118) (2. (2. We have seen. with as consequence the decrease of the weight norm. employs usually the following training criterion: t–y 2+λ w φ = ------------------------------------2 2 (2. this results in a trade-off between accuracy within the training set (measured by the first term in (2.121) where the partitions below in the two last equations have dimensions (m*kq-1). (2. which is bounded between 0 and 1 (if they connect the first to the second hidden layer).1) are followed. Usually.3. regularization. while in the case of standard termination is 380.119) becomes: 108 Artificial Neural Networks . The latter perform the role of suitably transform the input domain in a such way that the nonlinearly transformed data can better linearly reproduce the training data. if the advices in terms of initialization (see Section 2. 2 where u stands for the linear weights (see (2..119) y = . resulting in overtraining and bad generalization. This problem usually only occurs with the linear weights.dition number of the input matrix of the last hidden layer in this case has the value of 24.118) can be transformed into (see Exercise 15): t–y 2 φ = ---------------.6. the criterion incorporates a term which measures the value of the weights. as the range of the nonlinear weights is related with the range of their input data. the norm of the linear weights can increase substantially.120) 2 (2. Eq. 2 where: t = t 0 y λu (2.117) ) and the absolute value of the weights.7.99)).1. If we now express y as O ( q – 1 ) 1 u = Au .1.. if care is not taken. The second technique. (2.117) As we can see. and between -1 and 1. However. Therefore.3. that the weights in a MLP can be divided into linear weights and nonlinear weights. (2. and consequently a better conditioned model is obtained. in Section 2.

that is. However. we could still employ the learning methods described afterwards.pH Example.8 On-line learning methods In this section.9 Number of iterations 147 146 133 34 Analysing these two tables. t and y replaced by A .41 0. it is clear that an increase in τ is translated in a decrease in the norm of the linear parameters and in the condition number of the last hidden layer.14 . as the methods described Artificial Neural Networks 109 .2 14.4 104 485 161 20 Number of iterations 61 63 199 199 Table 2.= -----------------------------------..123) Therefore. with A.39 0. τ = 10 –6 λ 0 10-6 10-4 10-2 e 2.1.2 t – t – A u 0 0 λu λ t – Au 2 φ = -----------------------------------. when we do not have a batch of training data to employ for network training. τ = 10 u Condition number 46 46 44 21 –4 λ 0 10-6 10-4 10-2 e 0.9 5.0029 0.38 0.m is used in these examples. 2.123) to compute the value of the linear weights. The function Train_MLPs.6 can be used here.3 10-4 0.1. and using (2.15 - Coordinate Transformation Example.= --------------------. t and y . This is usually translated in a worse fitting on the training data. we discuss learning methods suitable to be used when the training data appears in a pattern-by-pattern basis. 2 2 2 The global optimum value of u is (see Exercise 15): + T T ˆ u = ( A A + λI ) A t = A t –1 Au 2 (2.3.0045 u 36 28 14 2 Condition number 1.0055 0.94 15. the training algorithms introduced in Section 2.3.2 15.122) (2. Table 2. there is no point in using them in this situation. If we did.

127) This is the LMS rule. 2.. We shall address the first strategy first. which updates the weights in the direction of the transformed input. The mean-square error criterion is defined as: E(e ) ζ = ------------. usually of higher dimensionality than the original input.124). at time k. in such a way that this transformed data.. The instantaneous estimate of the MSE. This is an unbiased estimate of the gradient vector of (2.3. and e=t-y is the error vector. in the last sections.1Adapting just the linear weights In Section 2.before for off-line learning have far better performance than those that will be described in the following lines. produces the best linear model of the target data. When addressing on-line learning methods.126) where a represents the output of the last hidden layer. (2.125) where e[k] is the error measured at time k. multiplied by a scalar.124) where E(. we can assume that this nonlinear transformation has been discovered before by an off-line learning method such as the ones introduced before.) is the expectation operator. is: e [k] ζ [ k ] = -----------. at time k.8. plus a value of one.2 we have justified its learning rule. at time k. Instantaneous gradient descent algorithms of the first order (we shall not address higher-order on-line methods here) update the weight vector proportionally to the negative of the instantaneous gradient: u [ k ] = u [ k – 1 ] + δe [ k ]a [ k ] (2. that the layers in a MLP have two different roles: the hidden layers perform a nonlinear transformation of the input data. and we continue this line of thought here.1. The instantaneous estimate of the gradient vector. for the output bias.1. the LMS algorithm.2.1. although in the context of its batch variant. and just adapt the linear weights. At that time we mentioned that the LMS algorithm was an instantaneous version of the steepest descent technique. and determine the values of all the weights in the network during the on-line learning process. 2 2 (2. or we can start from scratch. is: g [ k ] = – e [ k ]a [ k ] . 110 Artificial Neural Networks . We have seen. 2 2 (2. in the linear output layer. and then discuss the second case.2 we have introduced Widrow’s Adalines and in Section 2. the steepest descent method.

Let us consider the x2 problem of Ex. We have 2 weights.5.5 Artificial Neural Networks 111 . with information supplied by x=-0. at time k.Notice that.125)) as function of the weights.35 . with the target 0. Therefore a [ 1 ] = – 0. At time 1.128) FIGURE 2. the first one related with x.25 . it must satisfy: a [ k ]u = t [ k ] An infinite number of weights satisfy this single equation. and the line which corresponds to the solution of (2.5 1 and t [ 1 ] = 0. and the second one related with the bias (1).6.Instantaneous performance surface of the x2 problem. for the network to reproduce the kth pattern exactly.25. the network is supplied with x=-0.1 .128). 2. with ∆x = 0. (2. The following figure shows the instantaneous MSE estimate ((2.

it should be divided by the training size.5. the product ATA is denominated the autocorrelation matrix (to be precise. This is the product of a column vector by one row vector. the global minimum of successive performance surfaces will intersect at the global minimum of the true MSE performance function (2. and x=+0. with only one eigenvalue different from zero. and the corresponding eigenvector is a[k] (see Exercise 27).124). Consider now the analytical derivation of the least squares solution conducted in Section 2. conducted in Section 2.3. the condition that was obtained was that the learning rate should satisfy (2.1.5.1. Remembering now the stability analysis of the steepest descent algorithm. we shall. In this context. and therefore its size is (p*p). and.5 The two lines intersect at the point (w1=0. if we only consider these two patterns. we would have the following instantaneous performance surfaces (the one corresponding to the 2nd pattern has been shifted by 10): FIGURE 2.36 . w2=0.124).Instantaneous performance surfaces of the x2 problem. this time x=+0.If another pattern is input to the network. which is the global minimum of the true MSE function (2.3. 2 112 Artificial Neural Networks . however. call it the autocorrelation matrix from now on).3. if the length of the vector is p.62).53)1. consequently the product a[k]Ta[k] is denominated the instantaneous autocorrelation matrix. If this condition is adapted to this situation. specially the left-hand-side of the normal equations (2. Therefore. This matrix has only one non-zero eigenvalue ( a [ k ] ). and if this pattern is also stored exactly. we have: 1. the normal equations are called the Wiener-Hopf equations.25). represented as R[k].3. with information supplied by x=-0. when the information is rich enough (sufficiently exciting). Here.

the error is not increasing in the long run.130). With η = 1 . Then we employed the Matlab function MLPs_on_line_lin. are represented in fig. The values of the nonlinear and linear weights are stored in initial_on_pH. 2 ) ⁄ a [ k ] ] ) 2 if η = 1 ⁄ a [ k ] 2 (2. the values obtained with η = 2 are also represented. if that pattern would be presented again to the network. The next figure illustrates the evolution of the MSE obtained using the learning rate of 2. Artificial Neural Networks 113 . for different learning rates.37 (the left graph for the first pass trough the training set . for a single example. we trained a MLP with the used topology ([4 4 1]). in 1937. Primarily..001.1. With this last value. as indicated by (2.m (it also employs TL2. and the right graph the other iterations). and the a posteriori error vanishes if the learning rate is the reciprocal of the 2 a[k] squared 2-norm of the transformed input vector.130).129) Let us consider now that at time k. as with values of η between 0 and 2. The nonlinear weights were fixed. It is easy to confirm the validity of (2. stable learning was achieved.the first 101 iterations.130) Analysing (2. using the NLMS rule. The LMS rule is employed.131) Let us see the application of these two rules (LMS and NLMS) to a common problem.7. The opposite happens when we slightly increase the learning rate. we can see that stable learning. 2 ) ⁄ a [ k ] ] ) 2 if η = 0 or η = 2 ⁄ a [ k ] 2 if ( η ∈ [ ( 0. resulting in a transformed pattern a[k].3. is achieved if 2 0 ≤ η ≤ ----------------.m). the minimum value of the MSE is 3 10-4. the pH example.mat.130). For comparison. just outside the stable interval predicted by (2. which was first proposed by Kaczmarz. Depending on the learning rate. 4. e[k ] e[k] e[k] e[k] > e[k ] = e[k ] < e[k ] = 0 if ( η ∉ [ ( 0. the network is presented with pattern I[k]. for 10 passes over the complete training data. and the linear weights were initialized to zero. minimizing the new training criterion. The NLMS rule is therefore: δe [ k ]a [ k ] u [ k ] = u [ k – 1 ] + -----------------------2 a[k] (2. in order to get a good conditioned model. we would have the following output errors (see Exercise 28). a bit far from the minimum obtained off-line. and using a regularization factor of 10-2 (see Section 2.e [ k ] ). although the learning process is completely oscillatory.4 10-6.2 η < ----------------2 a[k] (2.m and TL3. and let us evaluate what happens to the output error (a posteriori output error . It is therefore interesting to introduce a new LMS rule (coined for this reason NLMS rule in [124]). with the LevenbergMarquardt. and that is widely known in adaptive control. The values of the evolution of the MSE. 2.2). after 10 passes through all the training set (1010 iterations).130).

3 0.4 0.37 .2 0.1 0.7 0.15 0.5 MSE (neta=2) 0.2 0.Evolution η = 2.The instantaneous values of the MSE.06 0 50 Iterations 100 150 0 0 200 400 600 800 1000 FIGURE 2.14 0.7 of the MSE .NLMS (stable on-line learning) 0.5 0.5 0.39 . are presented in fig. pass through pass.6 0.3 0.26 neta= 1.9 neta= 1.38 .1 0.001) 0 500 Iterations 1000 1500 0.5 0.8 0.NLMS (unstable on-line learning) 114 Artificial Neural Networks .05 0.18 MSE 0.08 0.2 0.6 0. As it can be seen. 2.1 0.35 0.25 0.12 0.16 neta= 0. the squared errors for each presentation are continuously reduced.001 of the MSE .3 0.4 MSE (neta=2.1 0.Evolution 0. for the case with η = 1 .24 neta= 1 neta= 0.1 0 0 0 500 Iterations 1000 1500 a) η = 2 FIGURE 2.2 0.22 0. 0.4 0.

1 0.LMS (stable on-line learning) Artificial Neural Networks 115 .Evolution of the MSE .0.3 0.1 0.15 MSE 0.02 0. for η = 1 Let us now see what happens if the standard LMS rule is applied.07 0.2 0. when stable learning is obtained.4 Instantaneous MSE 0 50 100 150 200 Number of iterations 250 0.5 0.40 .1 0.05 0 0 200 400 600 Iterations 800 1000 1200 FIGURE 2.05 Instantaneous MSE 0. The following figures show the evolution of the MSE.08 0.01 0 0 0 200 400 600 800 Number of iterations 1000 a) First 2 passes FIGURE 2.6 0.2 neta=0.06 0.Evolution b) Last 8 passes of the instantaneous MSE .03 0.39 .5 0.7 0.04 0.25 neta=0.NLMS. 0.

as we do not allow any increase in the weight error norm from iteration to iteration. in a cyclic fashion. this corresponds to η < 0. if the LMS rule is employed to a set of consistent data (there is no modelling error nor measurement noise). we can see that with learning rates greater than 0. T max ( a a ) (2.57.5 0 0 200 400 600 Iterations 800 1000 1200 FIGURE 2.3 of [124] (and see Exercise 29) that.41 . and we can see that this is a conservative condition.5 (0. 2. the onlearning process is unstable. as the overall learning process is stable. if the following condition is satisfied: η ( ηa a – 2 ) < 0 T (2. Parameter convergence It is proved in section 5.133) Notice that this a more restrictive condition than the overall stability of the learning process.Employing values which are slightly larger than 0. as it can be seen in fig.LMS (unstable on-line learning) Analysing the last two figures. and converges to the optimal value. if we apply the LMS rule with 0.57) 1.132) This indicates that. For the case treated here. the norm of the weight error always decreases. η > 0 . unstable learning is obtained.57).56 as the learning parameter.5 . the following condition should be satisfied: 2 η < ------------------------.41 2.5 1 0.5 2 MSE(neta=0. to have the weight error norm always decreasing. 116 Artificial Neural Networks .Evolution of the MSE .

(2. and the second to the optimal modelling error.3 of [124]).The optimum value of the linear weights is. 2. and this introduces noise in the weights updates.135) where the first term is due to the difference between the current and the optimal network. if the weights are initialized with zero (for a proof. (2. see section 5.6 0. η = 1 ) Comparing fig. and that can cause the MSE to increase. while we observe a constant decrease in the weight error norm.2 0. 0.5 Square of the weight error norm 0. if we compute the weight error norm evolution.3 0. from iteration to iteration. and when there is no modelling error or measurement noise. with η = 1 .42 . When A is rank-deficient. as we know.1 0 0 200 400 600 Iterations 800 1000 1200 FIGURE 2.42 and fig. Modelling error and output dead-zones When there is modelling error.7 0. 2. the same does not happen to the MSE evolution. given by: + ˆ u = a t. We can see that the error is always decreasing. the application of the NLMS rule (with η > 0 and η < 2 ). 2. If the original output error is used in the LMS train- Artificial Neural Networks 117 . forces the weights to converge to a value which exactly reproduces the training set and has minimum Euclidean norm. The NLMS (and the LMS) employ instantaneous values of the gradient to update the weights.Squared weight error norm evolution (NLMS.134) and.4 0.42 . for the NLMS rule. we obtain fig.37 . the instantaneous output error can be decomposed into: ˆ ˆ e [ k ] = a [ k ] ( w – w [ k ] ) + ( t [ k ] – a [ k ]w ) o n = e [k] + e [k] .

and the weights have converged to a value w. as in the batch case (see Section 2.≤ C ( R ) ⋅ -------.137) where is any natural norm and C(R) is the condition number of R.3. the error in the LMS training rules is changed to: 0 e[ k] + ς e[ k] – ς if ( e [ k ] ≤ ς ) if ( e [ k ] < – ς ) if ( e [ k ] > ς ) e [k] = d (2. Usually it is used: ς = E[ e [ k] ] In order to evaluate the applicability of this technique.3). w t (2. The magnitude of the weight error is related with the magnitude of the output error as: ew ey --------.. Parameter convergence Suppose that the output error lies within the dead-zone. we can have a large relative weight error.138) n (2. output dead-zones are employed. we can have: 118 Artificial Neural Networks . It can be proved (see Exercise 29).ing rules. by defining: V [ k ] = ew [ k ] 2 –1 ˆ ˆ = (w – w[ k] ) ( w – w[k ]) . To solve this problem. the error is considered to be zero. computed as R R . ς . the network is unable to determine when the weight vector is optimal. In this technique. This will have as consequence a bad generalization. Rate of convergence The condition of the autocorrelation matrix also dictates the rate of convergence. please see Exercise 30.1. if the model is ill-conditioned. T (2.139) (2.140) η' = η ( 2 – η ) . if the absolute value of the error is within the dead-zone.136) The main problem is to define the dead-zone parameter. as the second term causes the gradient estimate to be non-zero. that. This means that even with a small relative output error. More formally.

2 a[1] where θ ( 1 ) is the angle between a[1] and ew[0].1. 2 a[1] and.1) we have described the way that the gra- Artificial Neural Networks 119 .142) (2.2) a way to compute the true Jacobean matrix and gradient vector.142) can be given as: ( a [ 1 ]e w [ 0 ] ) 2 V [ 1 ] = V [ 0 ] – η' ---------------------------------.145) Therefore. The effects of instantaneous estimates As we have already referred.143) (2. in Alg. the conditioning of the auto-correlation matrix appears as an important parameter for the learning procedure.. After L iterations. whose condition number is proportional to the condition number of the autocorrelation matrix.2.3. the instantaneous gradient estimates.2Adapting all the weights We have described. LMS type of rules converge not to a unique value. Again. but to a bounded area known as the minimal capture zone. introduce errors or noise.8.5 (Section 2.3.3. we shall have: V [ L ] = V [ 0 ] ∏ ( cos ( θ ( k ) ) ) k=1 L T 2 (2. is: e[ 1] V [ 1 ] = V [ 0 ] – η' ----------------. whose size and shape depend on the condition of the autocorrelation matrix. as e[k] can be given as: T T T ˆ e [ k ] = a [ k ]w – a [ k ]w [ k – 1 ] = a [ k ]e w [ k – 1 ] . 2.144) 2 (2. and in Alg. This occurs when the transformed input vectors are highly correlated (nearly parallel). and there is model mismatch.= V [ 0 ] ( 1 – η' ( cos ( θ ( 1 ) ) ) ) . 2. The weights never settle once they enter in this zone (see Exercise 32). convergence is very slow.1. after one iteration. (2. convergence is very fast (See Exercise 31). if the weight update (which has the same direction as the transformed input vector) is nearly perpendicular to the weight error vector.141) The size of the parameter error. 2. If output dead-zones are not employed.V[L] = V[0] – e [k] η' ----------------2 a[k] k=1 L 2 (2. although they are a unbiased estimate of the true gradient vector. while if the transformed input vectors are nearly orthogonal.1.4 (Section 2.

varying the learning rate.m (it also uses ThreeLay. Considering that the optimum was obtained as the results of applying the Levenberg-Marquardt. 0. with the reformulated criterion. only with values of the learning rate smaller than 0.43 .m and TwoLayer. 2. using the instantaneous (noisy) estimates of the gradient vector.1 0.62) is derived for the batch update.18 neta=0. According to (2. and computing the maximum eigenvalue of the autocorrelation matrix at that point.0093 ≈ 0. as in on-line learning the weights are updated in every iteration. and (2. they are just applied to 1 presentation.12 MSE 0. the learning process converges.146) It should be noticed that this is just an indication.08 0.16 neta=0.001 0. which will be denoted as the BP algorithm or delta rule. and the Matlab function employed is MLPs_on_line_all.01 0. the instantaneous gradient vector estimate is easily obtained in every iteration.Application of the delta rule Analysing fig. The delta update of the weight vector is replaced by: 120 Artificial Neural Networks . An insight into this value can be obtained (see Section 2. as coined by their rediscovers [31]. This way. learning converges if: 2 η < ---------.43 .06 0.3) by doing an eigenvalue analysis to the autocorrelation matrix at the optimum. A technique which is often used (and abused) to accelerate convergence is the inclusion of a momentum term.1 neta=0. Both of them can be slightly changed. so that.3. to the same problem.= 0. instead of being applied to a set of m presentations. it is clear that.m).01 λ max (2.01.dient vector is computed by the BP algorithm. The next figure presents the results obtained by the LMS algorithm.62). The weights were initialized to zero.14 0.1. we find that its value is 214.04 0 200 400 600 Iterations 800 1000 1200 FIGURE 2.

therefore Artificial Neural Networks 121 . On the other hand. (2.14 alfa=0 alfa=0. A simple algorithm which enables an adaptive step size.∆w [ k ] = – ηg [ k ] + α∆w [ k – 1 ] (2. and that again is related with the eigenvalue for that particular direction.18 0. for the case where η = 0. However.16 0.9.9 0. According to this rule.5 0.148) where si measures the average of the previous gradients.1 0. should be set between 0. The learning rate could be changed from iteration to iteration if we performed an eigenanalysis of the autocorrelation matrix. the update of the weight vector is given by: k ∆η i [ k ] = – βη i [ k – 1 ] 0 if ( s i [ k – 1 ]gi [ k ] > 0 ) if ( s i [ k – 1 ]g i [ k ] < 0 ) otherwise . the step size is increased by a constant (k).08 0. and to stabilize the search in narrow valleys.001 .5 and 0. it is not necessary that all the weights have the same learning rate. 0. the momentum parameter. that would be too costly to perform. When si has the same signal of the current gradient estimate. individually assigned to each weight is the delta-bar-delta rule [83].Application of the delta rule. for weight i.147) The reasoning behind the inclusion of this term is to increase the update in flat areas of the performance surface (where g[k] is almost 0).06 0. The next figure shows the effects of the inclusion of a momentum term.44 . with a momentum term It is clear that the inclusion of a momentum term ( α > 0 ) increases the convergence rate.12 MSE alfa=0. making it possible (theoretically) to escape from local minima. α . Typically.04 0 200 400 600 Iterations 800 1000 1200 FIGURE 2.

we have obtained in 10 epochs (1010 iterations) of on-line learning a final value of the MSE of 0. 2. with η = 0. and a value of 4. as each weight is multiplied by its own learning rate. β = γ = 0. a final value of 3 10-4.001 and α = 0. which is the need to fine tune the algorithm. It should be noted that there are several other on-line learning algorithms for MLPs. obtained with off-line learning. the step size is decreased geometrically ( –βη i [ k – 1 ] ). The parameters of the delta-bar-delta rule applied were k = 0. The measure of the average of the gradients (si) is an exponentially weighted sum of the current and past gradients. the direction of the update is no longer in the negative of the instantaneous gradient.00001 .Application of the delta-bar-delta rule Slightly better results were obtained. although with more experimentation with the parameters. When there is a change in the gradient.18 0.06 0.1 0.12 MSE 0. As a summary. Only the most standard have been explained. This way. or to the delta rule with momentum. this is. with the delta-bardelta rule.45 . 2. This is also one of the weak points in this rule.14 delta-bar-delta 0. better results could possibly be achieved. fig. for the best case found in fig.16 momentum 0. it should be clear that our initial 122 Artificial Neural Networks .04 0 200 400 600 Iterations 800 1000 1200 FIGURE 2.accelerating convergence.7 . obtained with the NLMS rule by adapting only the linear weights. adapting all the weights and employing the delta-bar-delta rule. computed as: s i [ k ] = ( 1 – γ )g i [ k ] + γs i [ k – 1 ] (2.149) The delta-bar-delta rule can be employed to the standard delta rule. It should be also referred that.44 .04.08 0.9 .4 10-6. 0.45 illustrates the comparison of this rule and the momentum rule.

the number and placement of the centres. they posses the best approximation property [132]. Let r = c–x 2 (2. 2 is the Euclidean norm. which states that. the width of the function. They were introduced in the framework of ANNs by Broomhead and Lowe [40].151) describe the basis function. for a given topology. and. 2 (2. k = c i – x k 2. 2. Since then. RBFs are a three-layer network. and the actual input value: W i. Several different choices of functions are available: Artificial Neural Networks 123 . if a set of training data is available. RBFs are also universal approximators [131]. in contrast with MLPs. there is 1 set of unknown coefficients that approximates the data better than any other set of coefficients. the hidden layer consists of a certain number of neurons where the ‘weight’ associated with each link can be viewed as the squared distance between the centre associated with the neuron.2 Radial Basis Functions Radial Basis Functions Networks (RBFs) were first used for high-dimensional interpolation by the functional approximation community and their excellent numerical properties were extensively investigated by Powell [111].statement that. where they were used as functional approximators and for data modelling. in same cases. for the considered input. where the first layer is composed by buffers. The output The modelling capabilities of this network are determined by the shape of the radial function.150) where ci is the centre of the ith hidden layer node. and layer is just a linear combiner. and. there is no point in employing on-line learning techniques. due to the faster learning associated with these neural networks (relatively to MLPs) they were more and more used in different applications.

0]. considering its centre at [0. FIGURE 2.46 - Activation values of the ith neuron. This is a unique function.152) f ( r ) = log ( r + σ ) shifted logarithm function Among those. the most widely used is the Gaussian function. both in the range]-2..153) The following figure illustrates the activation function values of one neuron. and a unitary variance The output of an RBF can be expressed as: 124 Artificial Neural Networks . and σ i = 1 .0].f( r )= r f( r )= r 3 r – -------2 2σ 2 radial linear function radial cubic function Gaussian function thin plate spline function multi-quadratic function inverse multi-quadratic function 2 f( r )= e f( r )= f ( r ) = r log ( r ) r +σ 1 f ( r ) = -------------------2 2 r +σ 2 2 2 2 (2.2[.=[0. as the multivariate Gaussian function can be written as a product of univariate Gaussian functions: ci – x 2 – ------------------2 2 σi 2 fi ( x ) = e = ∏e j=1 n ( c ij – x i ) – --------------------2 2 σi 2 (2. with centre Ci. We assume that there are 2 inputs.

155) (2. the Gaussian standard deviation is usually taken as: d max σ = ------------. 2. the centres do not coincide any more with the data points. consisting of: Artificial Neural Networks 125 . 2m 1 where dmax is the maximum distance between centres and m1 is the number of centres.. the output weights.6. as it is evident from Section 2. the output weights can be obtained by a pseudo-inverse of the output matrix (G) of the basis functions: + ˆ w = G t (2. due to the problems associated with this scheme (increased complexity with increasing data size and increasing conditioning problems). Now. With respect to the value of the linear parameters. This is an interpolation scheme.1. There are four classes of major methods for RBF training: 2. and the centres of the RBFs were chosen to be the input vectors at these data points.3. where the goal is that the function passes exactly through all of those data points. An alternative is to use an hybrid learning process [112].154) where wi is the weight associated with the ith hidden layer node. The learning problem is therefore associated with the determination of the centres and spreads of the hidden neurons.1 Training schemes As we have seen in Section 2. In this case.3.1.1. with number of basis functions strictly less than the number of data points. and one way of determining them is to by choosing them randomly.156) 2.6. as they act as a linear combiner. However. RBFs were soon applied as approximators.p y = i=1 wi fi ⋅ ( ci – x 2 ) (2.2 Self-organized selection of centres The main problem with the last scheme is that it may require a large training set for a satisfactory level of performance.1.2.1 Fixed centres selected at random The first practical applications of RBFs assumed that there were as many basis functions as data points. can be determined in just one iteration. within the range of the training set.2.2.

apart from the one expressed in (2. that can be employed: 126 Artificial Neural Networks .to determine the locations of the centres of the radial basis functions.2.to determine the linear output weights. With respect to the computation of the spreads. In (2. To solve this problem several other algorithms have been proposed. which places the centres in regions where a significant number of examples is presented. k ( x ) = arg min i x ( j ) – c i 2. j = j+1 end Algorithm 2. ….4.157) .3. The k-means clustering algorithm is presented below: Initialization .10. 0 < η < 1 .Updating (2. i = k ( x ) c i [ j ]. and n the number of radial basis functions.1.Adjust the centres of the radial basis functions according to: c i [ j ] + η ( x ( k ) – c i ). For the first step. i = 1. Let m denote the number of patterns in the training set. is that the optimum will depend on the initial values of the centres.Similarity matching . n 2. we need an algorithm that groups the input data.Find the centre closest to x(j). 2. they must be all different 2. otherwise ci[ j + 1 ] = (2.Find a sample vector x(j) from the input matrix 2.158).155). and a • Supervised learning step .Choose random values for the centres. One problem related with Alg.158) 2.10 - k-means clustering Step 2 in the last algorithm continues until there is no change in the weights.• Self-organized learning step . in an homogeneous way.Sampling . One algorithm for this task is the k-means clustering algorithm[112]. One of them is the kmeans adaptive clustering [117]. While go_on 2. there are other heuristics. Let its index be k(x): 1. and consequently it is not a reliable algorithm.

160) 3. Then the spread is: Ck – C i σ = Q --------------------. the weight vector.1. is: max i. i ∂ σ i ∂ wi parameters. which is the same through all the centres. but also the centres and the spread of the basis functions are computed using error-correction algorithms.1. j = 1 … m x i – x j σ = -------------------------------------------------2 2 (2.3 Supervised selection of centres and spreads In this class of methods. we can use: n σ = j=1 C j. i. It is easy to determine (please see Exercise 4) that the derivatives are: able. . spreads and linear ∂ C j. are all parameters to be determined by one of the training methods introduced previously for MLPs. and the spread vector. σ .e. C.2. Maximum distance between patterns: Assume that m is the total number of patterns. i – x j ) ∂y = – w i f i ----------------------2 ∂ C j.162) 2.163) Artificial Neural Networks 127 . the spread associated with this centre is: k Ci – Cj j=1 σ = -----------------------------k 2 (2. (2. Nearest neighbour: Assume that k is the nearest neighbour of centre i. The empirical standard deviation: if n denotes the number of patterns assigned to cluster i. the centre matrix. i – xj ------------------------n 2 (2. 2 where Q is an application-defined parameter.. i σ i (2. and . to compute the Jacobean. w. If the standard training criterion is employed. not only the linear weights.159) 2. The k-nearest neighbours heuristic: considering the k (a user-defined percentage of the total number of centres) centres nearest the centre Ci. the derivatives with respect to the centres. we need to have avail∂y ∂y ∂y .161) 4. This way.. Then the spread. ( C j.

In this case. 2.164). In terms of termination criteria.7. with the difference that the identity matrix is replaced by G0. … … … G ( C p.3. quasi-Newton.165) Then.2 On-line learning algorithms There are a variety of on-line learning algorithms for RBFs proposed.1. please see Section 5. The resulting network is coined the Generalized Radial Basis Function network [125].6 can also be employed. the criterion introduced in Section 2. the ones introduced in Section 2.163). C p ) (2. C 1 ) … G ( C 1. As the output layer is a linear layer.2.166) where G0 is defined as: G ( C 1. any method (steepest descent.123).5 of [125]): T T ˆ w = ( G G + λG o ) G t .1.117).n k=1 ∂y = w i f i ⋅ ------------------------------------3 ∂ σi σ i ( C k. and specifically the criterion (2. i – x k ) 2 (2. (2. 128 Artificial Neural Networks . the optimal value of the linear weights is defined as (for the theory related with this subject. we can employ the Jacobean composed of (2. If we replace the second-term in the right-hand-side with a linear differential operator. with wi ˆ replaced by w i (see [133]).3.2.1. 2. C p ) G0 = .7.167) Notice that (2.166) is obviously similar to (2. This method was proposed by Tikhonov in 1963 in the context of approximation theory and it has been introduced to the neural network community by Poggio and Girosi [118]. –1 (2.4 Regularization A technique also introduced for training BBFs comes from regularization theory.2.1.2 also apply here. Levenberg-Marquardt) can be employed.164) ∂y = fi ∂ wi (2. C 1 ) … G ( C p. Recall the regularization method introduced in Section 2.3. related with the basis functions used. Some will be briefly described here.

8. the linear output parameters are in every case determined by a pseudo-inverse. with the standard or the reformulated criterion.6 1017 Neurons 10 20 30 40 Error norm 0. centres and spreads are updated using an Extended-Kalman filter algorithm. to the results obtained with the techniques introduced afterwards Table 2.One approach is to employ the optimal k-means adaptive clustering [117] to update the centres and spreads on-line. 2. are summarized in Table 2. which adds or subtracts units from the network. to a smaller extent. 2. The same note is valid. that the results obtained should be interprets as such.16 . If no unit is added or subtracted. by solving equation (2.1. An example of a constructive method which can be applied on-line is the M_RAN [126] algorithm.11 .002 2 10-4 As a problem associated with RBFs is the condition number of the basis function output matrix.167) recursively.pH example Again Ex. The results obtained. It employs also rbf. It should be noted that as the centres are obtained randomly. and to use a version of a recursive least-squares algorithm [69] to compute the linear weights. as the number of basis functions is increased.Results of RBF learning (fixed centres) Condition number 7.m implements a training method based on fixed centres. these results are also shown. all the free parameters in the network (weights.006 0. by determining the centres and spreads off-line and adapting on-line the linear weights. Yee and Haykin [127] followed this approach.m. Artificial Neural Networks 129 .2.1.7 will be employed to assess the modelling capabilities of RBFs. and the performance of different training algorithms.5 107 5 1016 1 1017 3.05 0.3 Examples Example 2. based on its performance on the current sample and also in a sliding window of previous samples. Other approaches involve a strategy such as the one indicated in Section 2.16. to a sliding moving window (see [134]). In common with the different methods below described.3. The function rbf_fc. Another uses the application of the Levenberg-Marquardt method.

m.03 0.m. I 0.Input patterns and final selection of centres (n=10) Finally.Implementing the k-means clustering algorithm. optkmeansr.1 0 20 40 60 80 1 00 1 20 FIGURE 2.2) Error norm 0.47 .1 0 -0.8 1016 1 1017 1.2 0. both in terms of error norm (but not in terms of condition number) were obtained by determining the position of the centres using the clustering algorithm.4 0.2 10-4 2.2 106 6. the following results have been obtained: Table 2.7 0.05 0.m.m diffrows. in the Matlab function rbf_clust.Results of RBF learning (self-organized centres) adaptive clustering Neurons 10 20 30 40 Error norm 0.Results of RBF learning (self-organized centres) k-means clustering (alfa=0. with Matlab function rbf_ad_clust.7 1017 Neurons 10 20 30 40 Comparing both tables. better results. and its results are expressed in the next table.18 . the adaptive clustering algorithm was experimented.8 1017 130 Artificial Neural Networks . 2. The final distribution of the centres obtained is shown in fig.3 0.002 5.002 2.4 1017 2.3 10-5 3. initoakmr.m.4 10-4 Condition number 1 1010 2 1016 1.2 10-4 Condition number 5.47 .8 0.m Table 2.6 0.17 .m and lexsort.5 0. The following functions are employed: clust.

49 . The first 100 iterations are shown. These data can be found in winiph. 25 20 alfa=0.mat. The linear weights were initialised with random values.0005 alfa=0.m and Jac_mat. The initial values of the centres were found as the ones obtained by rbf_fc.001 15 Error norm alfa=0. The following figure is obtained: 20 18 16 14 Error Norm 12 10 8 6 4 2 0 0 20 40 60 Iterations 80 100 120 FIGURE 2. The termination criteria (2.m.Supervised methods are now evaluated. and the initial values for the spreads were all equal to the one obtained using this function.0001 10 5 0 0 10 20 30 40 50 Iteration 60 70 80 90 100 FIGURE 2.Steepest descent results (standard criterion) The Levenberg-Marquardt method is employed afterwards. –6 The following figure illustrates the results of a steepest descent method.m).Levenberg-Marquardt results (standard criterion) Artificial Neural Networks 131 . within a reasonable number of iterations. First the standard criterion will be employed.114) to (2.m (it also employs RBF_der. The method does not converge. for a RBF with 10 neurons. The following results were obtained with Train_RBFs.48 . It converges after 119 iterations.116) with τ = 10 was employed.

Convergence was obtained. for η = 0. The initial nonlinear parameters are the same as in the previous examples.025 0.01 0 0 10 20 30 40 50 Iterations 60 70 80 90 100 FIGURE 2.57 0.035 0.014.02 0.Applying now the gradient techniques employed before. at iteration 76.055 alfa=0. 0.50 .01 0 10 20 30 40 50 Iterations 60 70 80 90 100 FIGURE 2.001) Levenberg-Marquardt (standard criterion) 132 Artificial Neural Networks .03 0.06 0.mat. but this time with the reformulated criterion. alfa=0.50 . In fig.19.51 illustrates the performance of the Levenberg-Marquardt algorithm minimizing the new criterion.04 Error norm 0.Levenberg-Marquardt results (reformulated criterion) The best results of the supervised methods are summarized in Table 2.05 0. 2.Results of supervised learning (10 neurons) Error norm 0. as usually.02 0.1 .045 alfa=0. Table 2. we can find the results of the steepest descent method.03 0.1 0. It achieves convergence at iteration 115.05 alfa=0.01 0. 2.06 0.Steepest descent results (reformulated criterion) fig.015 0.02 Method BP (standard criterion. 0. and can be found in winiph_opt. for the first 100 iterations.51 .05 0. much better results are obtained.04 Error norm 0. with an error norm of 0.19 .

The results obtained with the fixed random centres algorithm are shown in Table 2.20 . gives the best results. Table 2. we obtain: Table 2.12 1 0.2) Error norm 5 2.5 103 2.8 1. Example 2.8 2.22 illustrates the results obtained with the adaptive clustering algorithm.Coordinate transformation Ex. 2.12 .Results of supervised learning (10 neurons) Error norm 0.3 104 5. minimizing the reformulated criterion.9 103 6.6 3.8 was also experimented with RBFs.22.12 Condition number 94 1.21 .88 1.2 Employing the k-means clustering algorithm.2 104 Artificial Neural Networks 133 .1) Levenberg-Marquardt (new criterion) Comparing these results with the results obtained with the previous methods.1 104 Neurons 10 20 30 40 50 Table 2.Table 2.Results of RBF learning (fixed centres) Condition number 278 128 348 1. Table 2.2 1.22 .9 3.014 6 10-4 Method BP (new criterion.Results of RBF learning (self-organized centres) (alfa=0.19 .Results of RBF learning (self-organized centres) (adaptive algorithm) Neurons 10 20 30 Error norm 3 2. alfa=0.54 103 7. at the expense of a more intensive computational effort. it is clear that the use of the Levenberg-Marquardt method.89 Condition number 30 532 2.4 103 Neurons 10 20 30 40 50 Error norm 13.

respectively). the model trained with this algorithm/criterion achieves a better performance than models with 50 neurons trained with other algorithms.18 Method BP (standard criterion.3 Lattice-based Associative Memory Networks Within the class of lattice-based networks.Table 2.m.23 .23: Table 2.57 106 With respect to supervised learning.18 105 1.mat (standard and reformulated criterion. specially minimizing the new criterion. converged after 28 iterations. and common features.46 0. it can be seen that.22 .73 0. found in winict. the steepest descent method with the standard criterion did not converge after 200 iterations. The same method. alfa=0. we shall describe the CMAC network and the Bspline network. 134 Artificial Neural Networks .1) Levenberg-Marquardt (new criterion) –3 With τ = 10 . with the new criterion. The LM methods converged in 103 (18) for the standard (new) criterion. Both share a common structure. employing again as initial values the results of rbf_fc. with only 10 neurons. which are described in the first section.Results of supervised learning (10 neurons) Error norm 2 0.16 Condition number 2.mat and winict_opt. alfa=0. These examples show clearly the excellent performance of the Levenberg-Marquardt. 2.001) Levenberg-Marquardt (standard criterion) BP (new criterion.Results of RBF learning (self-organized centres) (adaptive algorithm) Neurons 40 50 Error norm 1 1. the results with 10 neurons can be found in Table 2. Comparing the last 4 tables.

fig. a basis functions layer and a linear weight layer. a1 x(t) a2 a3 ω1 ω2 ω3 ωp − 2 ∑ y(t) network output a p−2 a p −1 ap ωp −1 ωp normalized input space basis functions weight vector FIGURE 2.53 shows a lattice in two dimensions. as the address of the non-zero basis functions can be explicitly computed. and the procedure to find in which cell an input lies can be formu- Artificial Neural Networks 135 .3.53 .A lattice in two dimensions One particular input lies within one of the cells in the lattice. This cell will activate some. 2. and also as the transformed input vector is generally sparse.1 Structure of a lattice-based AMN AMNs have a structure composed of three layers: a normalized input space layer. Additionally. knowledge is stored and adapted locally.1. This scheme offers some advantages.1 Normalized input space layer This can take different forms but is usually a lattice on which the basis functions are defined.Associative memory network 2. knowledge about the variation of the desired function with respect to each dimension can be incorporated into the lattice.2. but not all of the of the basis functions of the upper layer.3.52 . x2 x1 FIGURE 2.

. and they are generally placed at different positions. x 1 ] × … × [ x n . are λ i. and so the exterior knots are only used for defining these basis functions at the extreme of the lattice.170) These exterior knots are needed to generate the basis functions which are close to the boundaries. r i λ i. max λ i.lated as n univariate search algorithms. In order to define a lattice in the input space. a set of ki exterior knots must be given which satisfy: λ i. ri + 1 ≤ … ≤ λ i. When two or more knots occur at the same position. 1 λ i. ri < x i max . These knots are usually coincident with the extreme of the input axes. 2 ≤ … ≤ λ i. 0 λ i..A knot vector on one dimension. for the ith axis (out of n axis). j = ( j = 1. i = 1. – 1 λ i. They are arranged in such a way that: xi min max min < λ i.54 . r i + 1 λ i. and this feature can be used when attempting to model discontinuities. xi min xi . this is termed a coincident knot. 1 ≤ λ i. r i + 2 Exterior knots Interior knots FIGURE 2. (2. ….j and is defined as: min max min max 136 Artificial Neural Networks . The network input space is [ x 1 . At each extremity of each axis. 0 = x i xi max min (2. The jth interval of the ith input is denoted as Ii. – ( k i – 1 ) ≤ … ≤ λ i. n ) . …. ri + k i (2. vectors of knots must be defined. one for each input axis. r i. and therefore these networks are only suitable for small or medium-sized input space problems. The main disadvantage is that the memory requirements increase exponentially with the dimension of the input space. or are equidistant. There are usually a different number of knots for each dimension. x n ] . 3 Exterior knots λ i. respectively. 2 λ i.168) where x i and x i are the minimum and maximum values of the ith input.169) = λ i. with ri interior knots and ki=2 The interior knots.

λ i. A basis function has compact support. the union of the supports forms a complete and non-overlapping n-dimensional overlay. r i [ λ i. or bounded support. j ]if j=r i + 1 (2. where. for each set. and ri=5. ρ . within the range of the ith input. In each set.A two dimensional AMN with ρ =3.2 The basis functions The output of the hidden layer is determined by a set of p basis functions defied on the ndimensional lattice. and so they can be organized into ρ sets. j = [ λ i. and the support of each basis function is bounded. size and distribution of the basis functions are characteristics of the particular AMN employed. The support of a basis function is the domain in the input space for which the output is nonzero. j – 1. for any input. 137 Artificial Neural Networks . i=1 n 2. there are ri+1 intervals (possibly empty if the knots are coincident. the number of non-zero basis functions is a constant. if its support is smaller than the input domain.…. 3rd overlay 2nd overlay 1st overlay Input lattice FIGURE 2. the number of non-zero basis functions is ρ .I i. For any input.171) This way. For a particular AMN.55 .3. denoted as generalization parameter.1. which means that there are p' = ∏ ( ri + 1 ) n-dimensional cells in the lattice. j )for j=1. one and only one basis functions is non-zero. λ i. j – 1. The shape.

The total number of basis functions is the sum of basis functions in each overlay. This number. and therefore there are at least two defined on each axis. is the product of the number of univariate basis functions on each axis. 2. If a network has this property. The linear coefficients are the adjustable weights. 2.2.174) The dependence of the basis functions is nonlinear on the inputs. and so (2. for the AMN.In fig.1. The output is therefore: p T y = i=1 aiwi = a w (2. where the sum of all active basis functions are a constant: ρ a ad ( i ) ( x k ) = constant i=1 (2.1Partitions of the unity It is convenient to employ normalized basis functions. and does not vary according to the position of the input.173) can be reduced to: ρ y = i=1 a ad ( i ) w ad ( i ) (2.3.55 . any input within the pink region can be envisaged as lying in the intersection of three basis functions. a lower bound for the number of basis functions for each overlay. finding the weights is just a linear optimization problem. i. and subsequently. 2. Therefore. This is a desirable property. and this dependence can be made explicit as: 138 Artificial Neural Networks .172) where ad ( i ) is the address of the non-zero basis function in overlay i. since any variation in the network surface is only due to the weights in the network. is 2n. p denotes the total number of basis functions in the network.e.3. These networks suffer therefore from the curse of dimensionality [114]. green and blue rectangles in the three overlays. it is said to form a partition of the unity. in turn. The decomposition of the basis functions into ρ overlays demonstrates that the number of basis functions increases exponentially with the input dimension.173) In (2.173). only ρ basis function are non-zero. whose support is represented as the red. and because the mapping is linear. These have a bounded support.3 Output layer The output of an AMN is a linear combination of the outputs of the basis functions.1. As we know..

and introduced to.3. In the scheme devised by Albus to ensure this property. the supports are distributed on the lattice. In CMACs. A network satisfies this principle if the following condition holds: As the input moves along the lattice one cell parallel to an input axis. in other words. the CMAC is a look-up table.3) a18 a15 2nd overlay d2=(2.2) a9 1st overlay d1=(1. Its main use has been for modelling and controlling high-dimensional non-linear plants such as robotic manipulators.175) 2. For on-line modelling and control. which. this constant is 1 for all the axis. it is important that the nonlinear mapping achieved by the basis functions posses the property of conserving the topology. A CMAC is said to be well defined if the generalization parameter satisfies: 1 ≤ ρ ≤ max i ( r i + 1 ) (2. the output calculation is a constant and does not depend on the input.176) a21 a19 a16 a13 a10 a7 a4 a1 a2 a5 a3 a11 a8 a6 a20 a17 a14 a12 a22 3rd overlay d3=(3.ρ y = i=1 a i ( x i )w i (2.1) Input lattice Artificial Neural Networks 139 . where the basis functions generalize locally. In its simplest form.2 CMAC networks The CMAC network was first proposed by Albus [42]. This is called the uniform projection principle. the number of basis functions dropped from. such that the projection of the supports onto each of the axes is uniform. means that similar inputs will produce similar outputs.

and repeating this process through every dimension. The overlays are successively displaced relative to each other.1 Overlay displacement strategies The problem of distributing the overlays in the lattice has been considered by a number of researchers. then by adding ρ units in the 2nd dimension. As the input moves one cell apart in any dimension. the black cell shows the input. In fig.3. T x2 x1 FIGURE 2.57 . corresponding to a displacement matrix D = 123 . with the first dimension reset. 123 To understand how a CMAC generalizes. one basis function is dropped and a new one is incorporated. the grey cells show those that have 2 active basis functions in common with the input. The generalization parameter is 3. there are 5 interior knots in each axis.FIGURE 2. the overlay placement is uniquely defined.56 .2. it is interesting to consider the relationship between neighbouring sets of active basis functions. It is clear that. in a 2dimensional overlay) of the overlay is determined.2-dimensional CMAC overlays with ρ =3 Consider the two-dimensional CMAC shown in fig.57 . and the displacements of the overlays were given as: 140 Artificial Neural Networks . one interval along each axis. and the coloured cells the ones that have 1 active basis function in common with the input. and repeating this process until the maximum number of knots has been reached. 2. 2. The original Albus scheme placed the overlay displacement vector on the main diagonal of the lattice. If a LMS type of learning rule is used. This is why the CMAC stores and learns information locally. and the rest is left unchanged.56 . 2. which corresponds to 36 lattice cells. as the subsequent supports are obtained by adding ρ units in the first dimension. The network uses 22 basis functions to cover the input space. Cells which are more than ρ -1 cells apart have no basis function in common. once the corner (the "top-right" corner of the first support.CMAC generalization. only those weights which correspond to a non-white area are updated.

otherwise (2. …. … d ρ – 1 = ( ( ρ – 1 )d 1. ρ ) where all the calculations are performed using modulo ρ arithmetic. To 1 make a partition of the unity. CMACs can only exactly reproduce piecewise constant functions. and. 2d n ) . ρ. 1 ) d 2 = ( 2. …. In [115].d 1 = ( 1. if the input lies within its supρ port.3 Training and adaptation strategies The original training algorithm proposed by Albus is an instantaneous learning algorithm. Therefore. Tables are given by these authors for 1 ≤ n ≤ 15 and 1 ≤ ρ ≤ 100 . and the network might not be so smooth as using other strategies. Its iteration k can be expressed as: ∆w [ k ] = δe [ k ]a [ k ].177) (2. ρ – 1. 1. …. 2d 2.179) Artificial Neural Networks 141 . …. ( ρ – 1 )d2. The vectors are defined to be: d 1 = ( d 1. using an exhaustive search which maximizes the minimum distance between the supports. as proposed by Albus. and in fact it is equal (the output is simply the weight stored in the active basis function) when ρ = 1 . d2. ρ.2 Basis functions The original basis functions. the output of a basis function is -. 2 ) … d ρ – 1 = ( ρ – 1. ( ρ – 1 )d n ) d ρ = ( ρ. d 2. higher order basis functions have been subsequently proposed. where 1 ≤ d i < ρ .178) (2. …..2. di and ρ coprime.2. 2. …. 2. and has discontinuities along each (n-1) dimensional hyperplane which passes through a knot lying on one of the axes. ρ ) This distribution obeys the uniform projection principle.3. 2. and 0 otherwise. …. a vector of elements ( d 1. ρ – 1 ) d ρ = ( ρ. as a result of that. d n ) was computed. but we will not mention them in this text. are binary: whether they are on or off.3. Because of this. if e [ k ] > ζ 0. the output is piecewise constant. but lies only on the main diagonal. …. …. d n ) d 2 = ( 2d 1. The CMAC is therefore similar to a look-up-table.

1): δe [ k ]a [ k ] ------------------------. One most frequently used is the NLMS learning algorithm (see Section 2. a[k]= ρ -1.1.13 .2. respectively.75.m and ad.3.75 as the limit values for our network. Of course if we remember that the output of the CMAC is a linear layer.3. Please note that in the case of the binary CMAC. with an increasing number of interior knots. 142 Artificial Neural Networks .m.3. which employs also the files out_basis. the following table illustrates the results obtained.035 and 0. The CMACs were created with the Matlab file Create_CMAC.1.m. and a generalization parameter of 3. the pseudo-inverse can be employed to obtain: + ˆ w = A t (2. This is clearly a LMS update.1). if e [ k ] > ζ 2 a[k] 2 0.180) (2.m.181) can be applied. Other instantaneous learning algorithms can be used with the CMAC. In terms of training (off-line) algorithms. 2.8. with a dead-zone (see Section 2. otherwise where 0 < δ < 2 .Only those weights that contribute to the output are updated. They were trained with the function train_CMAC_opt. the batch version of the NLMS algorithm δ ∆w [ i ] = --m e [ k ]a [ k ] -------------------2 a[k] 2 k=1 m ∆w [ k ] = (2.182) It should be mentioned that matrix a is usually not a full-rank matrix. Considering 0 and 0.pH nonlinearity with CMACs The minimum and maximum values of the input vector are 0.4 Examples Example 2.8.

9 1018 3. 5 Interior knots and ρ =3 If now 10 interior knots are used for each dimension.Error norm.5 2 Ta rg e t.Table 2. and 22 basis functions.182) is used. respectively. we stay with 36 cells.88 0.Coordinate Transformation with CMACs Consider again the coordinate transformation mapping. using (2.4 0.34 0. The following figure illustrates the output obtained.64 0. 5 interior knots.8 1034 ∞ ∞ ∞ ri 10 20 30 40 50 Error norm 0.y 1. the number of cells raise to 121.24 . when (2. for different numbers of interior knots Condition number 1.27 Example 2.5 0 0 20 40 60 Pa tte rn nu mb er 80 1 00 12 0 FIGURE 2.0) and (1. Artificial Neural Networks 143 . and the number of basis functions to 57. Considering these as minimum and maximum values of our network.14 .Output and Target pattern. The minimum and maximum values of the input are (0.58 . The next figure illustrates the results obtained. 3 2.182).1). and a generalization parameter of 3.5 1 0.

3 2. for different numbers of interior knots ri 5 e 2. Even with this large number.60 .25 .5 0 20 40 60 Pattern number 80 100 120 FIGURE 2.59 . which shows that the model will generalize well. Table 2. 10 Interior knots and ρ =3 If we increase now the number of knots to 20.5 2 Ta rg e t. in each dimension.y 1.y 1.Output and Target pattern.Output and Target pattern. The results obtained are shown in fig. the norm of the weight vector is 20. we stay with 441 cells.5 1 0.60 .5 1 0.5 2 Ta rg e t. 20 Interior knots and ρ =3 The above results can be summarized in the following table. and 177 basis functions.5 0 0 20 40 60 Pattern number 80 100 120 FIGURE 2.Error norm. 3 2. 2.18 144 Artificial Neural Networks .5 0 -0.

it is necessary to specify the shape (order) of the univariate basis functions and the number of knots for each dimension.3. achieved an error norm of 0. with MLPs.8 0.3 0.4 0. 2. The univariate Bspline basis function of order k has a support which is k intervals wide.25 .7 0.13 It should be remembered that the best result obtained for this example. One additional reason to study these kind of networks is that they provide a link between neural networks and fuzzy systems.2 0. 2.05 0.3. 1 0.2 0.5 0. Hence.3 B-spline networks B-splines have been employed as surface-fitting algorithms for the past twenty years.1 Basic structure When the B-spline is designed. The order of the spline implicitly sets the size of the basis functions support and the generalization parameter. An important achievement was obtained in 1972 when Cox [116] derived an efficient recurrent relationship to evaluate the basis functions.5 0. for different numbers of interior knots ri 10 20 e 1.7 0.8 0. B-splines can be considered AMN networks. these functions being defined on a lattice.3 0.6 0.61 . Thus. A B-spline network is just a piecewise polynomial mapping. each input is assigned to k basis functions and its generalization parameter is k.Table 2.1 0 0 1 2 x 3 4 5 Artificial Neural Networks 145 .9 0. and their structure is very similar to the CMAC networks studied before.6 output 0.3. In fact. a multivariate B-spline of order 2 can be interpreted also as a set of fuzzy rules. A set of univariate basis functions is shown in fig.9 0.1 0 0 1 2 x 3 4 5 output 1 0. 2.Error norm.17.4 0. which is formed from a linear combination of basis functions.

and every possible combination of univariate basis function is taken.3 0. as in the case of splines of order 3 for each dimension.3 0 0 1 2 x 3 4 5 0 1 2 x 3 4 5 c) Order 3 FIGURE 2.2 0. The curse of dimensionality is clear from an analysis of the figures.5 0. 146 Artificial Neural Networks . 32=9 basis functions are active each time.6 0.4 0.4 0. each multivariable basis function is formed from the product of n univariate basis functions.1 0 0. In the following figures we give examples of 2 dimensional basis functions.61 - d) Order 4 Univariate B-splines of orders 1-4 Multivariate basis functions are formed by taking the tensor product of the univariate basis functions.1 output 0.7 b) Order 2 0.8 0.6 0. where for each input dimension 2 interior knots have been assigned. one from each input axis.7 0.5 output 0.2 0. Therefore.a) Order 1 0.

defined on a ndimensional lattice.63 - Two-dimensional multivariable basis functions formed with order 2 univariate basis functions. in B-splines networks. 1. The point [1. univariate basis functions.FIGURE 2. the generalization parameter is therefore: Artificial Neural Networks 147 . 1.5.62 - Two-dimensional multivariable basis functions formed with 2. The point [1.4.6] is marked FIGURE 2. the support of each multivariate basis function is an hyperrectangle of size k1 × k2 × … × k n .5] is marked FIGURE 2. The point [1.64 - Two-dimensional multivariable basis functions formed with order 3 univariate basis functions. 1. order 1.4] is marked Denoting the order of the univariate basis functions on the ith input as ki. Therefore.

it is clear that the basis function output and consequently the network output will become smoother as the spline order increases.3.185) In (2. and that splines of order 3 (k=3) are employed.183) This is exponential with n. λ j is the jth knot.ρ = ∏ ki i=1 n (2. The application of (2. or when the desired mapping can be additively decomposed into a number of simpler relationships.61 .N k – 1 ( x ) + ---------------------------. p is the total number of lattice cells.1. On each interval the basis function is a polynomial of order k. 2. the topology conservation principle is obeyed. If we consider the 3rd interval (j=3).N k – 1 ( x ) λ j – 1 – λj – k λj – λ j – k + 1 N1 ( x ) = j j 1 if x ∈ Ij 0 otherwise (2.185). Thus. (2. B-spline networks should only be used when the number of relevant inputs is small and the desired function is nonlinear. and therefore the cost of implementing the algorithm is also exponential with n. that 4 interior knots are employed. Let us assume that we are considering modelling a function of one input. and ad(i) the address of the active basis function in the ith overlay.1Univariate basis functions The jth univariate basis function of order k is denoted by N k ( x ) . Analysing fig.185) is therefore translated into: 148 Artificial Neural Networks . As in CMACs.3.184) where a is the vector of the output of the basis functions. and the output of the network can be given as: p ρ y = i=1 ai wi = i=1 aad ( i ) w ad ( i ) . we know that 3 basis functions have non-zero values for x values within this interval. and I j = [ λ j – 1. and it is defined by the following relationships: x – λj – k λj – x j j–1 . λ j ] is the jth interval. 2.j N k ( x ) = ---------------------------. varying from 0 to 5. and the knots are joined smoothly.

1 ( x ) λ3 – λ2 λ4 – λ3 (2.N 2 ( x ) λ2 – λ0 λ3 – λ1 x – λ1 2 λ3 – x 3 3 -N N 2 ( x ) = ---------------.2 ( x ) -N λ3 – λ1 λ4 – λ2 4 N2 ( x ) x – λ2 3 λ4 – x 4 -N = ---------------.⋅ ---------------.192) (2.186) .193) Artificial Neural Networks 149 .= -----------------------------------------------.x – λ0 2 λ3 – x 3 3 -N N 3 ( x ) = ---------------. (2.189) As: N1 ( x ) = 0 we then have: x – λ1 λ3 – x λ4 – x x – λ 2 4 ( x – 1)( 3 – x) (4 – x )(x – 2 ) N 3 ( x ) = ---------------.⋅ ---------------.= -------------------------------.= -----------------λ 3 – λ 1 λ3 – λ2 ( λ 3 – λ1 ) ⋅ ( λ 3 – λ 2 ) 2 4 2 (2.+ -------------------------------λ 3 – λ 1 λ 3 – λ 2 λ4 – λ 2 λ 3 – λ 2 2 2 Finally.+ ---------------. N 3 ( x ) .191) (2.⋅ ---------------.1 ( x ) λ2 – λ1 λ3 – λ2 As: N2( x ) = N1( x ) = 0 3 N1( x ) 2 2 (2.N 2 ( x ) + ---------------.N 2 ( x ) + ---------------. is: x – λ1 3 λ4 – x 4 4 N 3 ( x ) = ---------------.2 x + ---------------.190) (2. for the 3rd active basis function. N 3 ( x ) : x – λ2 4 λ5 – x 5 5 -N N 3 ( x ) = ---------------.N 1 ( x ) + ---------------.188) The second active basis function.2 ( x ) λ4 – λ 2 λ5 – λ3 As N2 ( x ) = 0 5 5 4 (2.N 2 ( x ) + ---------------.187) =1 we then have: 3 N3( x ) 2 λ3 – x λ3 – x ( λ3 – x ) (3 – x) = ---------------.

1 -6 0 0 2 2.8 3 a) All x domain FIGURE 2.we then have: 5 N3( x ) 2 x – λ2 x – λ2 ( x – λ2 ) (x – 2) = ---------------. p2. p3 1 2 x 3 4 5 0 0. which means that the basis function make a partition of unit. 2.5 p1.⋅ ---------------. p3 p1.4 0. dash-p2. 6 0.194) The polynomial functions active for interval I3 are shown in fig.= -----------------------------------------------.6 2 0.6 2. It can be seen that at the knots (2 and 3) the polynomials are joined smoothly.65 .65 .3 -2 0. and that the sum of the three polynomials is constant and equal to 1.2 -4 0.7 4 0.4 x 2.= -----------------λ 4 – λ 2 λ3 – λ2 ( λ 4 – λ2 ) ⋅ ( λ 3 – λ 2 ) 2 2 (2.2 2.Polynomial b) Only I3 functions of the active basis functions for I3 (solid-p1.8 0. dot-p3) 150 Artificial Neural Networks . p2.

2 multiplication and 2 divisions are performed for each basis function. i ( xi ) j i n (2. The number of basis functions to evaluate can be obtained from fig.3.Triangular array used for evaluating the k-non-zero B-spline functions of order k The basis functions of order k are evaluated using basis function of order k-1.3.The computational complexity involved in the calculation of N k ( x ) can be estimated analysing the following figure: Nk Nk – 1 j+2 N3 j+1 N2 j+k–2 j+k–1 j Nk Nk – 1 j+k–3 j+k–2 N1 N2 j j N3 j+1 Nk j+k–3 N3 Nk Nk – 1 Nk j j j+1 j FIGURE 2. summing the nodes till layer k-1.1. 2 multiplications and 2 divisions are needed for each k level. the total number of basis functions for a multivariate B-spline is: Artificial Neural Networks 151 . Therefore. and so the total cost is k ( k – 1 ) multiplications and divisions.195) i=1 The number of basis functions of order ki defined on an axis with ri interior knots is ri+ki. for k ≥ 2 . So.66 . multivariate basis functions are formed by multiplying n univariate basis functions: j Nk ( x ) = ∏ Nk .66 . 2. In this evaluation. 2.2Multivariate basis functions As already referred. It is easy to see that we have k ( ( k – 1 ) ⁄ 2 ) basis functions.

17 0.3.7.8. and determining the linear weights 152 Artificial Neural Networks .m. For this reason.04 0.2 Learning rules We shall defer the problem of determining the knot vector for the next Section.m.3. respectively.0) and (1. considering splines of order 3 and 4. the least-squares solution (whether standard or with regularization (see Section 2. by obvious reasons.pH with B-splines The minimum and maximum values of the input are 0 and 0.6 1 1018 117 2 105 k 3 3 4 4 r 4 9 4 9 Example 2.m. Employing the Matlab functions create_nn.m.1).3. respectively. the NMLS rule (see Section 2. The output layer of B-splines is a linear combiner.2)) can be used to compute them. basis_num.26 .m.m (this functions employs also basis_out. train_B_splines.p = ∏ ( ri + ki ) i=1 n (2.1) is the one most often employed.m. and 4 and 9 interior knots: Table 2.196) This number depends exponentially on the size of the input and because of this B-splines can only be applicable for problems where the input dimension is small (typically ≤ 5 ).B-spline modelling the inverse of the pH nonlinearity norm(e) 0. index.3. Employing the Matlab functions create_nn. 2.3.1.m. and just concentrate in the determination of the output weights.007 cond(a) 13.75. In terms of on-line adaptation.Coordinate transformation The minimum and maximum values of the input are (0.3 Examples Example 2.3.16 . njk_multi. and train_B_splines.m) and determining the linear weights using a pseudo-inverse.1. the following results were obtained. 2.15 .m and njk_uni. gen_ele.05 0.

for each dimension. a 5-inputs B-spline network.198) (2.3] [3. to explicitly express that the network inputs belong to the set u of input vector x. order 2. Consider.8 819 2.4] [9. ) . In this section we introduce one algorithm that. a saving of more than 150 times would have been achieved in storage and we would require one third of active basis functions.using a pseudo-inverse.1.2] [2.The ASMOD algorithm We have seen. 7+72+72 = 105 storage locations. inf.1 0.4] [9. and the model produced would have the form: y = f ( x 1. for this function. considering splines of order 2 and 3. we would require.3] r [4. inf. x 2. its output is given by: Artificial Neural Networks 153 . the following results were obtained. therefore restricting the use of B-spline networks to problems with up to 4 or 5 input variables. This network would require 75=16. the size of the multivariate basis functions increases exponentially. each one with the same knot vector and order as the one described above.B-spline modelling the coordinate transformation cond(a) inf.) could be well described by a function: ˆ y = f 1 ( x 1 ) + f 2 ( x 2.3. and it would have 2+22+22=10 active functions at any one time. we have assumed until now.0021 2. and on the other hand it is a construction procedure for the network. as the input dimension increases. or s u ( x u ) .27 . that their structure remains fixed and the training process only changes the network parameters according to the objective function to be minimized.4 Structure Selection .9 106 k [2.2] [3. and 4 and 9 interior knots: Table 2. This way. and we have nu sub-models in our network.3.3. if a sub-model network is denoted as s u ( . enables the use of B-splines to higher dimensional problems. for all the neural models considered. for instance.9] [4.9] norm(e) 0. in Section 2.807 storage locations. Each input would activate 25=32 basis functions. More formally. norm(w) 9. x 5 ) (2.8 16. assuming that each additive function would be modelled by a smaller sub-network. On the other hand. that.197) Then. inf. x 3. x 3 ) + f 3 ( x 4. x 5 ) Let us assume that function f(. x 4.52 0. on one hand.3.017 0. with seven univariate sets.

iteratively. As referred before.201) It is not obvious. usually constructed of univariate sub-models for each input dimension.3. 154 Artificial Neural Networks .199) The sets of input variables satisfy: u=1 ∪ xu ⊆ x nu (2.3.1The algorithm The ASMOD (Adaptive Spline Modelling of Observed Data) [119] [120] is an iterative algorithm that can be employed to construct the network structure from the observed data. the associated knot vector and spline order. If the designer has a clear idea of the function decomposition into univariate and multivariate sub-models. the network structure can be found analytically (see Exercise 38 and Exercise 39). to infer what is the best structure for a network. 2.nu y( x) = u=1 su ( xu ) (2. which must be a user-defined parameter. All those parameters are discovered by the algorithm. with the exception of the spline order. and. then it can use this structure as an initial model. the structure of a B-spline network is characterized by the number of additive sub-modules.199) can be transformed in: nu pi nu y = u = 1j = 1 a uj w u j (2. algorithms have been proposed to. determine the best network structure for a given data. each one with 0 or 1 interior knots is employed. Otherwise. the input set of variables for each one. w = u=1 ∪ wu nu and a = u=1 ∪ au . then (2. for each input in this set. For this reason. if we have an idea about the data generation function. However.200) If accordingly the weight vector and the basis function vector are partitioned. an initial model. and use the ASMOD to refine it.4. from a data set.

or just applied to the optimal model resulting from the ASMOD algorithm. in the majority of the cases.11 - ASMOD algorithm Each main part of the algorithm will be detailed below. The spline order and the number of interior knots is specified by the user. For every multivariate (n inputs) sub-models in the current network. 2. END Algorithm 2. Artificial Neural Networks 155 . split them into n sub-models with n-1 inputs. ˆ Determine the best candidate. 1. as it is just a constant. • Candidate models are generated by the applications of a refinement step. m i . remove this sub-model from the network. with just refinement steps. in an attempt to determine a simpler method that performs as well as the current model. combine them in a multivariate network with the same knot vector and spline order. also with no interior knots. 1. this set is often applied after a certain number of refinement steps. ˆ m i – 1 = Initial Model termination criterion = FALSE. where the current model is simplified. Estimate the parameters for each candidate network. and a pruning set. 3. For every sub-model in the current network. If k-1=1. also three possibilities are considered: For all univariate sub-models with no knots in the interior. Care must be taken in this step to ensure that the complexity (in terms of weights) of the final model does not overpass the size of the training set. Three methods are considered for model growing: For every input variable not presented in the current network. Because of this. replace them by a spline of order k-1. WHILE NOT(termination criterion) Generate a set M i of candidate networks. Note that. the latter step does not generate candidates that are eager to proceed to the next iteration.The ASMOD algorithm can be described as: i = 1. For every combination of sub-models presented in the current model. i=i+1. and usually 0 or 1 interior knots are applied. ˆ ˆ IF J ( m i ) ≥ J ( m i – 1 ) termination criterion = TRUE END. 2. a new univariate submodel is introduced in the network. For network pruning. according to some criterion J. creating therefore candidate models with a complexity higher of 1. for every dimension in each sub-model split every interval in two. where the complexity of the current model is increased.

with zero initial interior knots. Example 2. creating therefore candidate models with a complexity smaller of 1.204). and L is the number of training patterns. p is the size of the network (in terms of weights). although a pseudo-inverse implemented with a USV factorization is advisable. For every candidate. the networks must be trained to determine the optimal value of the linear weights. As referred before any learning rule such as the ones introduced before can be applied. • All candidate models must have their performance evaluated.3. and therefore the performance criterion to employ should balance accuracy and complexity. or similar would not be enough.203) (2. (2. and a spline order of 2.204) Akaike´s information criterion K ( φ ) = L ln ( J ) + pφ L+p Final prediction error K = L ln ( J ) + L ⋅ ----------L–p In eqs. The next figure shows the evolution of the performance criterion with the ASMOD iterations.zip) was applied to the pH example. we are looking for the smallest model with a satisfactory performance. 156 Artificial Neural Networks . A criterion such as the MSE. K is the performance criterion.202) to (2.17 . for every dimension in each sub-model. Three criteria are usually employed: Baysean information criterion K = L ln ( J ) + p ln ( L ) (2. as we are comparing models with different complexities. As it was referred before. • The application of the previous steps generates a certain number of candidates.pH example The ASMOD algorithm (a large number of functions compound this algorithm. For every sub-model in the current network. remove each interior knot. therefore the files are zipped in asmod.202) (2.

which measures the complexity of the network. This translates into an increasing number of basis functions. the ASMOD introduces.Evolution of the performance criterion The evolution of the MSE is show in fig.005 0 0 10 20 30 Iteration number 40 50 60 FIGURE 2. new interior knots.68 .01 MSE 0.68 . 2.69 Artificial Neural Networks 157 .67 . is shown in fig.015 0. in the refinement step. The evolution of this parameter.Evolution of the MSE As it is known. 2.-400 -600 -800 Pe rforman ce criterio n -1000 -1200 -1400 -1600 -1800 0 10 20 30 Iteration number 40 50 60 FIGURE 2. 0.

8 0.1 0 0 20 40 60 Pattern number 80 100 120 158 Artificial Neural Networks .3 0.2.69 .60 50 40 C omp le xity 30 20 10 0 0 10 20 30 Iteration number 40 50 60 FIGURE 2. which is an excellent value.70 . 0.5 10-4. The final placement of the interior knots is shown in fig.6 In pPat.2 0.4 0. and less where it remains constant. and it can be concluded that the ASMOD algorithm places more knots in the range where the function changes more rapidly. 1274 candidate models were generated and evaluated. This way. The condition of the matrix of the basis functions outputs is 7. and achieves a MSE of 0.Evolution of complexity The final structure is the best one in the iteration before the last one. The norm of the weight vector is 4.02. which is translated into an error norm of 9. ce n te rs 0. the best network has 51 basis functions. 2.5 0.7 0.9 10-8.

70 . the following figure shows the evolution of the performance criterion: -600 -800 -1000 Performance criterion -1200 -1400 -1600 -1800 0 5 10 Iterations 15 20 25 FIGURE 2.6 1. 2.72 .2 MSE 1 0.8 1.4 1.Placement of interior knots Considering now the use of B-splines of order 3. 2 x 10 -3 1.4 0.FIGURE 2.2 0 0 5 10 Iterations 15 20 25 Artificial Neural Networks 159 .71 .6 0.8 0.Evolution of the performance criterion The evolution of the MSE is presented in fig.

Coordinate transformation Employing splines of order 2.6 0.FIGURE 2.7 0.18 .4 0.3 0. and starting the ASMOD algorithm with 2 splines with 0 interior knots. It achieves an MSE of 1.1 0 0 20 40 60 Input 1 80 100 120 FIGURE 2.2 0. The final knot placement can be seen in the next figure.Final knot placement Example 2.Evolution of the MSE The final model has a complexity of 24.8 0.73 . 2. the evolution of the performance criterion can be seen in fig.74 . 160 Artificial Neural Networks .5 Input 2 0.72 .5 10-8. 0.

-200 -300 -400 Performance criterion -500 -600 -700 -800 -900 -1000 0 5 10 15 Iterations 20 25 30 35 FIGURE 2.04 0.06 0. Artificial Neural Networks 161 .09 0. the ’*’ symbols the knots of the bi-dimensional splines.Evolution of the MSE The final knot placement can be seen in the next figure. The small circles denote the training patterns.08 0.74 .02 0.07 0.05 MSE 0. and the vertical and horizontal lines are located at the knots of the univariate splines.03 0. 0.01 0 0 5 10 15 Iterations 20 25 30 35 FIGURE 2.Evolution of the performance criterion The evolution of the MSE is expressed in the following figure.75 .

and the different off-line training methods employed. envolving genetic programming. in terms of the accuracy obtained (2-norm of the error vector). the condition of the output matrix of the last nonlinear layer.6 Input 2 0.3 0.3 0.4 0.4 Model Comparison In the last sections we have introduced four major neural network models (Multilayer Perceptrons.5 Completely supervised methods If the structure of the network is determined by a constructive method.2 0. with a B-spline network with 62 weights.7 0.3.5 Input 1 0. two univariate sub-models and one bivariate sub-model.3.1 0. Radial Basis functions. CMACs and B-splines) and we have applied them to two different examples.Final knot placement The final value of the MSE is 1.5 0.2 0.zip.flops). 2. can be employed by downloading Gen_b_spline. 2.9 0.4 0.9 1 FIGURE 2. and in terms of training effort (time spent in training. In this section we shall compare them. the model complexity (how many neurons and model parameters). the position of the knots and the value of the linear parameters can be determined by any of the gradient methods introduced for MLPs and RBFs. A different approach.6 10-5 (corresponding to a error norm of 0. minimizing whether the standard or the reformulated criterion can be found in [142].1 0 0 0.1 0. and floating point operations per second .8 0.6 0.8 0. performing single or multi-objective optimization [143]. The final norm of the weight vector is 8.04).3. The application of the BP and the LM methods.7 0.76 . 162 Artificial Neural Networks .

in terms of accuracy.005 ) BP ( η = 0.41 0. and so the number of neurons is 9. with a computation time around three times larger than using the back-propagation algorithm.28 .0. achieves the best results. Table 2.1.45 0. 2.05 ) BP ( η = 0.2 1.Results of MLPs. and the number of parameters is 33.3 61 3.02 8.01 ) BP ( η = 0.Results of MLPs.4 1. with the new criterion Error norm 0.1 Multilayer perceptrons In this case.2 Flops (106) 3.4.3 1. running Windows NT 4.5 93 23 Algorithm BP ( η = 0.3 41. The accuracy obtained with this algorithm is equivalent to the Matlab implementation.5 ) quasi_Newton Levenberg-Marquardt Analysing the last two tables.3 Flops (106) 2.5 2.4.1 3.5 2.29 .3 1.2.9 10-4 Condition number 59 63 63 69 106 112 Time (sec) 1. but it saves more than twice the computational time. We considered the case of a MLP with [4 4 1].5 90 29 40 Algorithm BP ( η = 0.7 7.5 3. as the topology is fixed. with the standard criterion Error norm 1.1 Inverse of the pH nonlinearity The following results were obtained in a 200 MHz Pentium Pro Processor. with 64 MB of RAM.3 1.001 ) quasi_Newton Levenberg-Marquardt Levenberg-Marquardt (Matlab) Table 2.33 0.1 ) BP ( η = 0.2 10-4 Condition number 56 59 189 201 101 Time (sec) 1.3 0.2 1.5 3. minimizing the new criterion. we can see that the Levenberg-Marquardt algorithm. It should be noted that all the figures relate to a constant number of 100 iterations.2 10-4 5.005 5. the model complexity is also fixed. Artificial Neural Networks 163 .

2 106 6.5 0.8 1017 5.2 106 Number of Neurons 10 20 30 40 10 20 30 40 10 20 30 40 10 10 10 10 Time (sec) 0.2 0.7 1.8 1018 ∞ 2.5 0.014 6 10-4 Condition number 7.05 0.2 10-4 2.05 0.57 0.02 0.2.4 0.1 LM(NC) Error norm 0.7 0.8 1016 5 1017 5.1. at the expense of a larger computational time.2 .2 106 2. we can see that both clustering algorithms give slightly better results than the training algorithm with random fixed centres.2 Radial Basis functions Table 2.7 1.5 1 2.9 46 48 28 45 Flops (106) .6 1017 1 1010 2 1016 8. the adaptive one is typically 40 times faster than the standard k-means clustering algorithm.Results of RBFs Number of Parameter s 30 50 70 90 30 50 70 90 30 50 70 90 30 30 30 30 Algorithm Fixed centres Fixed centres Fixed centres Fixed centres Clustering Clustering Clustering Clustering Adaptive clustering Adaptive clustering Adaptive clustering Adaptive clustering BP (SC) alfa=0.2 0. 164 Artificial Neural Networks . specially the LM algorithm with the new criterion.3 0.4 2.4.001 LM (SC) BP (NC) alfa=0.006 0.30 .2 3. produces results typical of larger networks.5 107 5 1016 1 1017 3.7 12 21 29 38 0.03 0. Supervised methods.6 5.5 0.3 10-4 2. All algorithms obtain a very poor conditioning.5 8042 33737 8194 23696 Analysing the last table.4 2.002 5. Between the clustering algorithms.002 2 10-4 0.002 2.4 10-4 0.2 10-4 0.8 1016 1 1017 1.

9 Flops (106) .1.7 The first results are obtained without model selection.86 1.17 0.73 0.1.05 0.1 . A generalization parameter of 3 was employed.32 .4.007 Condition number 13.4 0.Results of CMACs Error norm 0.9 1018 3.2.08 0. Table 2. with a varying number of interior knots.4.8 2.8 1034 ∞ ∞ ∞ Number of interior knots 10 20 30 40 50 Number of Parameter s 13 23 33 43 53 Time (sec) 0.34 0.27 2.65 1.2 .34 The next table illustrates the results with the ASMOD algorithm.6 1 1.6 1 1018 117 2 105 Number of interior knots 4 9 4 9 Spline order 3 3 4 4 Time (sec) 0. Artificial Neural Networks 165 . using interior knots equally spaced.15 Flops (106) .2 .64 0.12 0.31 .88 0. Table 2.12 0.3 CMACs The following table illustrates the results obtained with CMACs.2 .12 0.4 B-Splines Condition number 1.04 0.Results of B-splines (without using ASMOD) Number of Parameter s (Weight vector) 7 12 8 13 Error norm 0.

1 Multilayer perceptrons In this case.0005 ) quasi_Newton Levenberg-Marquardt 166 Artificial Neural Networks .Table 2.1 1.49 0.4. the model complexity is also fixed. It should be noted that all the figures relate to a constant number of 100 iterations.9 0. as the topology is fixed.2 239 Number of interior knots 49 19 Spline order 2 3 Time (sec) 781 367 Flops (106) 2950 282 Inverse coordinates transformation The following results were obtained in a 266 MHz Pentium II Notebook.9 23 2 Flops (106) 2.33 .2 Condition number 7. with 64 MB of RAM.6 0.3 10-3 2.9 0. We considered the case of a MLP with [5 1].5 10-4 1.5 35 13 Algorithm BP ( η = 0.3 0.47 0.2.1 Flops (106) 1. 2.Results of MLPs. with the new criterion Error norm 1.9 Algorithm BP ( η = 0.35 . and therefore the number of neurons is 6.34 .3 1. running Windows 98. 005 ) BP ( η = 0.01 ) quasi_Newton Levenberg-Marquardt Table 2.9 2.17 Condition number 121 102 250 Time (sec) 1 28 2. and the number of parameters is 21.35 Condition number 381 189 190 3. Table 2. 001 ) BP ( η = 0.4.Results of B-splines (using ASMOD) Number of Parameter s (Weight vector) 51 22 Error norm 9.9 2.Results of MLPs. with the standard criterion Error norm 3.9 47 8.7 103 167 Time (sec) 0.

54 72.25 0.6 Flops (106) .5 .9 3.9 3.8 1.3 106 ∞ 3 1014 ∞ 2.7 103 3.46 0.3 104 5.9 3.12 1 0.2 5 2.001 LM (SC) BP (NC) alfa=0.2 Radial Basis functions Table 2.15 3.4 0.15 1 2 0.4 5.6 0.3 22207 55382 3899 6690 Artificial Neural Networks 167 .9 1.9 1.36 .3 30 44 58 84 0.73 0.8 2.5 1016 Number of Neurons 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 10 10 10 Time (sec) 0.2 10.1 104 94 4.Results of RBFs Number of Parameter s 40 70 100 130 160 40 70 100 130 160 30 70 100 130 160 30 30 30 30 Algorithm Fixed centres Fixed centres Fixed centres Fixed centres Fixed centres Clustering Clustering Clustering Clustering Clustering Adaptive clustering Adaptive clustering Adaptive clustering Adaptive clustering Adaptive clustering BP (SC) alfa=0.4 5.9 103 6.26 1.9 1.6 4.6 3.8 1.7 104 6.8 71 18 12.4 7.4.35 1.4 14 0.2 1.2 .2.1 LM (NC) Error norm 13.2 104 1.5 103 2.77 0.4 0.2.89 3 2 1.8 0.4 103 30 532 2.18 Condition number 278 128 348 1.

with a varying number of interior knots.5 10. using interior knots equally spaced.8 1016 ∞ ∞ Number of interior knots 5.13 2.4.27 0.2.4 B-Splines Condition number 5.6 The next table illustrates the results with the ASMOD algorithm.1 5.37 .Analysing the last table. Table 2. 2. A generalization parameter of 3 was employed.2 are also valid here.83 Flops (106) 0.8 26. the comments introduced in Section 2.4.20 Number of Parameter s 22 57 177 Time (sec) 0.3 Time (sec) 1. 168 Artificial Neural Networks .43 1. with all methods (with the smallest network) producing better results than the other methods with larger networks.05 0.4 9.2. Table 2.81 4.3 CMACs The following table illustrates the results obtained with CMACs.52 0.4 9.2 3.6 10-4 Condition number ∞ ∞ ∞ ∞ Number of interior knots 4.3 3.18 1.Results of B-splines (without using ASMOD) Number of Parameter s (Weight vector) 36 121 49 144 Error norm 0.5 5.9 4.9 Spline order 2.38 .1.4 Flops (106) 2. Supervised methods produce excellent results.10 20.4.27 0.89 6 39 The first results are obtained without model selection.017 0.2 2.4 5 35.Results of CMACs Error norm 2.

8 1017 2.(4. In the algorithm column. In terms of computational time. Table 2.2 2.3 Time (sec) 2354 1479 Flops (106) 4923 977 The final model has two univariate sub-models and one multivariate model.Best results (Example 1) Number of Parameters (Weight vector) 33 33 90 70 90 30 51 ANN/ Algorithm MLP (LM/Sc) MLP (LM/Nc) RBF (Fc) RBF (C) RBF (Ac) RBF(LM/Nc) B-splines (A) Error norm 8.01 Condition number 2. Nc New criterion.40.2 Time (sec) 3. The MLPs and the RBF with the LM method obtain a good accuracy with a small number of parameters. In terms of accuracy obtained.40 .2 3. In the last table. Table 2.2 10-4 2 10-4 2.3 10-4 2. C Clustering.5 1016 Number of interior knots 9.4.6 1017 8.6 2.2 10-4 5. 2. and the supervised RBF) are equivalent. Fc Fixed centers. Sc denotes Standard criterion. the former with a reasonable conditioning.5 10-4 Condition number 106 101 3.2 10-4 6 10-4 9.(5.Results of B-splines (using ASMOD) Number of Parameter s (Weight vector) 61 61 Error norm 0.5 45 781 The results obtained with CMACs were very poor and are not shown. all values (with the exception of B-splines using the Asmod algorithm.7 3.3 Conclusions The best results obtained for the first example are summarized in Table 2.8 1016 1. all models obtain similar values.045 0.6) 0. the number of interior knots is specified in this order.1.2 106 7.5 3.3.5) Spline order 2.Table 2. Focusing our attention now in the 2nd example.39 . Artificial Neural Networks 169 .5 1016 2. Ac Adaptive clustering and A Asmod.41illustrates the best results obtained.

045 0.1.1 Pattern recognition ability of the perceptron We already know that the output of a perceptron is given by: 170 Artificial Neural Networks .2 Widrow’s Adalines. In terms of accuracy. 1 } for a perceptron).1. which does not have a lot to do with the classification error. the MLPs with an acceptable condition number.Table 2. however. It is not easy to generalize the scheme to more than 2 classes (we know that we need a discriminant function for each class).1 2 12. there is only the need to have one discriminant function.205) 2. (2. in this special case where there are only two classes.1 we have introduced Rosenblatt’s perceptron and in Section 2.35 0.5 1016 Time (sec) 2. 1 } for an Adaline. MLPs and RBFs obtain a reasonable accuracy with a smaller number of parameters.18 0. Classn . B-splines.Best results (Example 2) Number of Parameters (Weight vector) 21 21 30 49 61 61 ANN/ Algorithm MLP (LM/Sc) MLP (LM/Nc) RBF (LM/Nc) B-splines B-splines (A2) B-splines (A3) Error norm 0. Notice that. 1. namely: There is no rigorous way to determine the values of the desired response (they need not be { – 1. or { 0.5 1016 2.41 . whose discriminant function is given by: gClassn ( x ) = g Class1 ( x ) – g Class2 ( x ) The scheme used has. and used them (somehow naively) for classification purposes. can be introduced. because of overfitting problems.01 Condition number 250 167 2. what we minimize is the (squared) difference between the target and the net input.5 1016 ∞ 2. In terms of Adalines. with the expense of a very large computational time. with spline orders of 2 and 3 achieve very good values. this will have impacts in the regression line determined by the algorithm. 3. several problems.5.1 0. 2. 2.17 0. The condition number is also very high. as a new class.6 5 2354 1479 We did not consider models with a number of parameters larger than the number of training data.5 Classification with Direct Supervised Neural Networks In Section 2. employing the Asmod algorithm.

It is obvious that in this case the discriminant function is: g( x) = w x + b = 0 For the particular case where there are only two inputs. is shown in fig. 1 ] . (2.77 . 2. (2.207) can be recast as: w1 b x 2 = – -----.1.2.1 (see Section 2.y = f ( net ) = f ( w x + b ) .209) Artificial Neural Networks 171 . in the domain [ – 1. while "o" denotes the 1 output.208) is given as: x 2 = – 2x 1 + 2 + ε (2.207) FIGURE 2.AND function revisited Using the weights generated in Ex.206) where f ( . 1 ].2).Perceptron output Using the weights generated by the application of the Perceptron learning rule. T (2.77 (the symbols "*" represent the output 0. 2.x 1 – ----w2 w2 Example 2.19 .1.1). ) is the sign function (see Section 1. (2. the plot of the function generated by the perceptron.1.3. [ – 1.208) T (2.

more generally.5 1 0. but this time with two outputs.210) which shows that g and v are orthogonal. with b = 0 .4 -0. Notice that we can have: g g(x) g ( x ) = -.Discriminant function Notice that a different interpretation can be given to (2. we shall use the same data.8 1 FIGURE 2.5 0 -1 Class 0 Class 1 -0. The perceptron with only one output can separate the data in two classes.6 -0.78 .6 0.2 0 x1 0.205).5 x2 2 1. or can be represented in a 2-dimensional graph as: 4 3. A perceptron with M outputs can separate the data in M classes. The weights control the slope of the decision surface (line in this case).209). or. If we denote by g the points that satisfy (2.4 0.207).207). then the equation of the discriminant can be seen as: g(x) = w x = 0 T (2.211) 172 Artificial Neural Networks .5 3 2.This can be seen in the above figure as the intersection of the surface with the x 1 – x 2 plane. to (2. and by v the vector of the weights.8 -0. and the bias the intersection with x 2 .( x ) – – ---------2 2 (2.209) or (2.2 0. To see how this works. using (2.

79 .Please notice that there are an infinite number of ways to achieve this. 2. The next figure illustrates the concept. 2. as we could create a decision boundary which is located in the middle of the class boundaries.78 .2 Large-margin perceptrons If we focus our attention in fig.78 . 2. FIGURE 2.Hyperplane with largest margin Artificial Neural Networks 173 . it is clear that we could do much better than this. we see that the application of the Perceptron learning rule will determine a decision boundary which is close to the sample in the boundary.80 .5. Class 1 b' h b' l Class 0 FIGURE 2. from a visual inspection of fig. The next figure illustrates this option. But.Two discriminant functions The intersection of these two discriminant functions defines the decision boundary for this problem. This is easy to understand as the rule does not update the weights as long as all the patterns are classified correctly.

which minimizes the squared error between the targets and the outputs. b' H – b' l . additionally. This should be done by ensuring that the (optimal) hyperplane correctly classifies the samples. 1 ) (the boundary point of class 1). only the samples that define the support vectors are considered. b' l = – ----w2 (2. 2. w + b > 0 ) (2. or.215) 174 Artificial Neural Networks . while maintaining a correct classification.m. 0 ) (the boundary points of class 0).212) and (2. and situated half-way between this line and a parallel line passing trough the point ( 1. which will enable. b' H = – ----w2 bl T w x l + b l = 0.214) This can be done by maximizing the sum of the two margins (each one for its class). accordingly to fig. this can be formulated as: Max ( b l – b H ) st w xH + bH ≥ 0 w xl + bl < 0 b l – bH > 0 This is implemented in the function large_mar_per.80 . According to (2. T T (2.2.1 Quadratic optimization The same solution can be found using quadratic optimization. the optimal hyperplane is obtained as 0. The margin of an hyperplane to the sample set S is defined as: ϒ = minx ∈ S ( x. It can easily be seen that this is a line parallel to the line passing through the points ( 0. 1 ). all samples contribute to this error.5x 2 – 0. so that the two lines can be written as: bH T w x H + b H = 0. to treat the case where patterns are not linearly separable. It should be clear to the students that this approach is substantially different from the Perceptron learning rule (which stops updating when all the parameters are correctly classified) and from the delta rule. In this latter case.213). while in the large-margin perceptrons. 2.5x 1 + 0.The blue lines are the hyperplanes which are close to the class boundaries (they are called the support vectors).212) (2. ( 1. Notice that their slope is the same.75 = 0 .213) We want to determine the vector w and the bias bH and bl so that we determine the hyperplane that generates the largest margins. Using the example of the AND function.5.

2.216) (2. assume first that the patterns are linearly separable. x is on the positive side of the optimal hyperplane.81 .81 . b o ): g ( x ) = wo x + bo .5.is an unitary vector. g ( xp ) = 0 .218) wo Looking at fig. we have: w x i + b ≥ 0. replacing (2.2..1Linearly separable patterns To introduce it. in the same direction as w 0 .218) T T T (2. we have: Artificial Neural Networks 175 .Interpretation of (2. we can see that ---------. Perhaps the easiest was to see this is to express x as: wo x = x p + r ---------. wo As. wo (2.218). Admitting that the desired output belongs to the set { – 1. and negative if it is on the negative side). 1 } .217) in (2. d i = – 1 The discriminant function gives a measure of the distance of x to the optimal hyperplane (denoted by w o. by definition. 2.217) where x p is the normal projection of x onto the optimal hyperplane and r is the desired algebraical distance (if r is positive. d i = 1 w x i + b < 0. g(x) = 0 x wo r ---------wo xp FIGURE 2.1.

then we have: 2ρ = 2r = ---------. and is: bo r = ---------wo (2. d = ±1 .220) In particular.225) The last equation means that maximizing the margin of separation is equivalent to minimizing the weight vector norm. the support vectors are the ones that satisfy (2. d =-1 wo 1 .219) This means that: r = g(x ) ---------wo (2.221) The issue here is to find w o. (2.( s) – ---------. di = 1 T wo xi T (2. b o by a constant) as: w o x i + bo ≥ 1. the distance of the optimal hyperplane to the origin is given by (2.222) as equalities. wo (2.T wo g ( x ) = w o r ---------.= r w o . (s) ) = wo x + b o = ± 1 ). for those vectors.. Consider a support vector x (g(x Then.. b o for the optimal hyperplane. wo (2. we can determine the optimal hyperplane by solving the following quadratic problem: 176 Artificial Neural Networks .216) can be recast (we can always multiply w o.. according to what has been said above. we have: 1 .223) r = . (2.(s) ---------.222) + b o ≤ – 1. Therefore. Note that (2. d i = – 1 (s) Notice that with this formulation. We have.220) with x = 0 .224) If we denote by ρ the optimal margin of separation. d =1 wo (s) (s) for which d T (s) = ± 1 .

228) (2.227) term by term: Artificial Neural Networks 177 . only for those constraints that are met as equalities (the support vectors). but where the Lagrange multipliers provide the optimal solution. α ) = 0 ∂b Application of the 1st condition to (2. α i ≥ 0. the multipliers αi are non-zero.2.t. b. b . we first expand (2. According to optimization theory. we can construct a dual problem. constructing the Lagrangian function: Min w w – ---------2 T N αi ( di ( w xi + b ) – 1 ) i=1 T (2.w w Min ---------2 s. i=1.…N The solution for this problem is obtained by differentiating the last function with respect to w.226) This is called the primal problem. i=1.…N T T (2. with the same optimal values of the primal problem.231) Notice that. at the saddle point. b. d i ( w x i + b ) ≥ 1. i=1.232) Therefore.2.227). α ) = 0 ∂w ∂ J ( w.227) s. We can solve this by using the method of Lagrange multipliers. gives: N (2. The primal problem deals with a convex function and linear constraints.2.t.229) w = i=1 αi di xi (2. We therefore get 2 optimality conditions: ∂ J ( w. at the optimum.…N T (2. To postulate the dual problem.230) and the 2nd condition: N αi di = 0 i=1 (2. we have: α i ( d i ( w xi + b ) – 1 ) = 0.

we get: 2x 1 + 2x 2 – 3 = 0 (2. Once the optimal α oi have been found.…N The cost function (2.s.223) for a support vector: ( bo = 1 – wo x T (s) ). b . Notice that the problem has been reformulated only in terms of the training data.235) must be maximized w.238) An implementation of this algorithm is given in lagrange.w w ---------. the 3rd term in the r.233) is 0.h.230): N wo = i=1 α oi d i x i (2.234) 1 and the subtraction of the first two terms in the r. of (2. in the optimum.230).236) α i ≥ 0. according to (2. of (2. Also.h.233) can be recast as: N N N α i αj d i d j x i x j and i = 1j = 1 T Q(α ) = i=1 1 αi – -2 N N αi αj di dj xi xj i = 1j = 1 T (2.2.– 2 T N i=1 w w α i ( d i ( w x i + b ) – 1 ) = ---------.231).m. due to (2.237) and using (2. Applying it to the AND problem.239) Notice that in the implementation of this function. we have: N N N N N w w = i=1 T αi d i w xi = i=1 T α i di j=1 αj dj xj xi = i = 1j = 1 T αi α j di d j xi x j T (2.235) is formulated as a minimization of a quadratic function: 178 Artificial Neural Networks .r.– 2 T T N N N αi di w x i – b i=1 i=1 T αi di + i=1 αi (2. the maximization of (2.233) is – -2 therefore (2. and does not depend on the classifier parameters w.s.t.233) Notice that. d (s) = 1 (2.235) subject to the constraints: N αi di = 0 i=1 (2. i=1. the classifier parameters can be obtained using (2. the Lagrange multipliers.

we introduce a new set of non negative scalar variables.243) Artificial Neural Networks 179 . if a data point violates the following condition: d i ( w x i + b ) ≥ 1.241) The problem here is that it is not possible to construct a separating hyperplane that correctly classifies all data. ….222) and (2.240) (2. in this case we have a missclassification.2Non linearly separable patterns T T T (2.Q' ( α ) = α Hα – f α .2.242) The violating data point ( x i.1. Nevertheless. i=1. we would like to construct an hyperplane which would minimize the probability of the classification error.2. averaged over the training set. N T (2. 2. i = 1. a) Right-side of the decision surface FIGURE 2. { ε i } i = 1 .Different b) wrong side of the decision surface types of violations N For a treatment of this case. The violating data point ( x i. into the definition of the separating hyperplanes ( see (2. d i ) falls on the wrong side of the decision surface. but on the right side of the decision surface. d i ) falls inside the region of separation. 2. T (2. in this case we still have a correct classification. 2.82 .…N This violation can occur in one of two ways: 1. The margin of separation between classes is said to be soft. where H = [ didj xi xj ] and f is a row vector of ones.223) ): d i ( w x i + b ) ≥ 1 – ε i.5.

Our goal is to find a separating hyperplane for which the miss classification error. One way to do this is to minimize: N Φ( ε) = i=1 I ( εi – 1 ) .243).245) The minimization of (2. we simplify the computation by formulating the problem as: 1 T Φ ( w. Unfortunately this is a non-convex non-polynomial (NP) problem. The parameter c has a role similar to the regularization parameter (see Section 2. (2. while for ε i > 1 .82 b) ). and by also minimizing w . but in the correct side of the decision surface (see fig.3.244) where I ( .245) must be done ensuring that (2. 2. the primal problem for the case of nonseparable patterns can be formulated as: N 180 Artificial Neural Networks . defined as: I( ε) = 0. even with ε i > 0 . while the 2nd term minimizes the occurrences of miss classifications. The objective function (2.The ε i are called slack variables.247) Notice that.246) Moreover. ε ) = -.244) by: N 2 Φ(ε) = i=1 εi (2. Notice that this optimization problem includes the case of all linearly separable problem (see (2.247) must be minimized subject to the constraint (2. 2. is minimized.226) ) as a special case. The support vectors are those data points for which (2. setting { ε i } i = 1 = 0 . the point falls into the region of separation. the first term minimizes the Euclidean norm of the weight vector (which is related with minimizing the complexity of the model. in (2.243) is satisfied exactly. to make the problem tractable. ) is an indicator function. ε ≤ 0 1.247). and. Therefore. or yet making it better conditioned).82 a) ). there is miss classification (see fig.243) is satisfied. and it must be selected by the user.1. averaged on the training set. and they measure the deviation of the ith data point from the ideal condition of pattern separability. For 0 ≤ ε i ≤ 1 .7). we can approximate (2. ε > 0 (2.w w + c 2 N εi i=1 (2.

3 Classification with multilayer feedforward networks In the same way as in nonlinear regression problems.5. we can formulate the dual problem for the case of nonseparable patterns.…N Notice that. the topology of feedforward networks can be seen as a nonlinear mapping.…N εi ≥ 0 Using the Lagrange multipliers and proceeding in a manner similar to that described in Section 2.250) 0 ≤ α i ≤ c. Artificial Neural Networks 181 . d i ( w x i + b ) ≥ 1 – ε i. followed by a linear mapping.5.2. 2. where an upper limit (c) is put to the Lagrange multipliers. For classification purposes. b o are obtained as in the separable case using (2.t.Min w w + c ---------εi 2 T i=1 s.2.248) Q(α ) = i=1 1 αi – -2 N N αi αj di dj xi xj i = 1j = 1 T (2.249) subject to the constraints: N αi di = 0 i=1 (2.237) and (2. followed by the construction of an optimal hyperplane for separating the features discovered in the nonlinear layers.1. The optimal solution w o. We must minimize: N T N (2. the first hidden layer(s) can be seen as performing a nonlinear mapping between the input space and a feature space.2.238).1. in this formulation the case of nonseparable patterns only differs from the separable patterns case in the last constraint. i=1. i=1.

249) is updated as: N Q( α) = i=1 1 α i – -2 N N α i α j d i d j K ( x i.253) 182 Artificial Neural Networks . x j ) .Non-linear In this case.) xi ϕ ( xi ) Input space Feature space mapping FIGURE 2. (2. Only a fraction of the data points (the support vectors) are used in this construction. it must satisfy Mercer’s theorem (see [125]. is used: k ( x.5. Here.ϕ( . x j ) is denoted the inner-product kernel: K ( x i. and that ϕ 0 ( x ) = 1 for all x.251) where K ( x i. i = 1j = 1 (2. The one-hidden layer perceptron only satisfies it for some values of the weight parameters.252) It is assumed that the dimension of the input space is m l . provided that in the equations the input data is replaced with their image in the feature space. the complexity of the model is controlled by maintaining the number of neurons in the hidden layer small. the dimensionality of the feature space is on purpose made large. T 2 (2. Example 2. we can use the solution presented in Section 2.2.20 . In the latter case. For the function ϕ ( x ) to be an inner-product kernel. It happens that inner-product kernels for polynomial and radial-basis functions always satisfy Mercer’s theorem.XOR Problem ([125]) Let’s assume that a polynomial inner-product kernel. to obtain a good generalization. x j ) = ϕ ( x i )ϕ ( x j ) = l=0 T ml ϕ l ( x i )ϕ l ( x j ) (2.83 . Chapter 6). x i ) = ( 1 + x x ) .1. Notice that the procedure to build a support vector machine is fundamentally different from the one usually employed in the design of neural networks for nonlinear regression. Therefore.2. so that a decision surface in the form of an hyperplane can be constructed in this space.

257) w. the Lagrange multipliers. we have: Artificial Neural Networks 183 . As ( k ( x.255) then. x i ) = ϕ ( x )ϕ ( x i ) ) T ϕ (x) = 1 T 2x 1 2x 2 x 1 x 2 2 2 2x 1 x 2 (2.254) = 1 + 2x 1 x i 1 + 2x 2 x i 2 + ( x 1 x i 1 ) + ( x 2 x i 2 ) + 2x 1 x i 1 x 2 x i 2 The dimensionality of the feature space is.257) 2 2 2 2 1 – -. the input dimension m i = 2 and the number of patterns M = 4 .-1) (-1. Expanding (2.The input data and the target vector are: Table 2.XOR data x (-1. we have: k ( x. in dual form (2.t.r.+1) t -1 +1 +1 -1 Therefore. is given as: 4 Q(α ) = i=1 1 αi – -2 4 4 α i α j d i d j K ( x i.-1) (+1. the matrix of inner product kernels is: 9 = 1 1 1 1 9 1 1 1 1 9 1 1 1 1 9 K = { k ( x i.42 .251). in this case. x i ) = 1 + x 1 x 2 xi1 xi2 2 = ( 1 + x1 xi1 + x2 xi2 ) = 2 2 2 (2. j = 1 … N (2.253).256) The objective function. x j ) = α1 + α 2 + α 3 + α4 i = 1j = 1 (2. x j ) } i = 1 … N.( 9α1 + 9α 2 + 9α3 + 9α 4 – 2α 1 α 2 – 2α1 α 3 + 2α 1 α4 + 2α 2 α 3 – 2α 2 α 4 – 2α 3 α 4 ) 2 Optimizing (2.+1) (+1. 6.

259) This means that all 4 input vectors are support vectors.[ – ϕ ( x 1 ) + ϕ ( x 2 i ) + ϕ ( x 3 ) – ϕ ( x4 ) ] 8 .260) 1 – 2 1 = -. (2. The optimal value of the criterion is 1 ˆ Q ( α ) = -.258) (2.261) 184 Artificial Neural Networks .9α 1 – α 2 – α 3 + α4 = 1 – α 1 + 9α 2 + α 3 – α 4 = 1 – α 1 + α 2 + 9α3 – α 4 = 1 α 1 – α 2 – α 3 + 9α 4 = 1 The solution of this system of equations gives: 1 ˆ α i.– – 2 + 8 1 1 2 we finally have: 1 – 2 1 2 1 2 2 1 1 2 2 + – 2 – 1 1 1 1 – 2 – 2 0 0 0 1 = -= 0 8 0 –1 -----–4 2 2 0 0 0 0 0 y = 1 2x 1 2x 2 x 1 x 2 2 2 2x 1 x 2 0 0 0 0 = –x1 x2 0 –1 -----2 (2. i = 1 … 4 = -8 (2.. 4 As 4 ˆ w = i=1 1 ˆ α i d i ϕ ( x i ) = -.

1.m will be used. 2. with the theoretical decision boundary (in green) and the estimated one (in black). 2. as expected. the probability of correctly classifying the data.21 . with the same parameters employed before.84 : x2 2 x1 x2 x1 – 2 0 2 Decision boundary FIGURE 2. Artificial Neural Networks 185 . and that the optimal decision line (point) is equidistant to the transformed points.86 shows the test data. we would obtain 83%. Therefore we can see this problem as in fig. Example 2. this dimensionality is reduced to 1. The Matlab file height_weight_gen.85 shows the examples employed for training). using the theoretical discriminant functions is 83. however.The XOR problem in the input and in the feature spaces Notice that in feature space all points are support vectors (points. If we employ a Baysean classifier. 2. For this particular problem. If the discriminant functions estimated with the training data would be used. in this case).2%.mat) is divided equally into training and test data (fig.1 in Appendix 2. This data (which can be found in the file dataex221.84 . to generate 1000 samples. fig. the feature space has a dimensionality of 6.Classification of humans Let us revisit the example of classification of males and females described in Section 2.Notice that using this inner-product kernel.

8 Height (m) 1. 186 Artificial Neural Networks .8 Height (m) 1.Test data and decision boundaries If now support vector machines are employed.6 1.m.3 40 50 60 70 80 Weight (Kg) 90 100 110 FIGURE 2.5 1.9 1.2 1.9 1.86 . using sup_vec_mac.Training 2 data 1. and therefore smaller training data is employed.7 1.85 .4 1. The algorithm does not converge with 500 training data.7 1.3 40 50 60 70 80 Weight (Kg) 90 100 110 FIGURE 2. the following tables illustrate the results obtained.6 1.5 1.4 1.

Results obtained with the support vector machines . Class. even with a small network with 3 hidden neurons.71 0. we obtain results equivalent to the ones obtained with the Baysean classifier Table 2.46 illustrates the results obtained. As we can see. τ = 10 Neurons 2 4 Final Iteration 39 15 Prob.Results obtained with the support vector machines . was employed.71 0.2 0. 0.1 0.Table 2. was employed.200 Training Data.1 0.45 illustrates the results obtained.71 N. Afterwards. As we can see.100 Training Data C 1 10 10 1 σ 0. minimizing the new criterion.44 . The Levenberg-Marquardt method.836 0. 0.71 0. The training stopped as a result of early-stopping.C. Table 2.70 Table 2. The initial nonlinear parameters were obtained with the function rbf_fc.m was used.83 –6 Finally. the Matlab function Train_MLPs. Cor.45 . Artificial Neural Networks 187 . with 3 and 4 hidden neurons.1 0. The 200 training data employed in the last example was used.2 Number of Centers 100 100 93 95 Prob. Table 2. Cor. Class.43 . 0. The same 200 training data was used. Prob.71 0. the Matlab function Train_RBFs.Results obtained with MLPs. Class. even with a small network with 2 hidden neurons. together with the validation data. The training stopped as a result of early-stopping. minimizing the new criterion.2 0.C.1 0. together with the validation data. with 2 and 4 hidden neurons. we obtain results equivalent to the ones obtained with the Baysean classifier.m was used. Cor. The Levenberg-Marquardt method.200 Training Data C 1 10 10 1 σ 0.2 Number of Centers 198 196 178 N.71 0.m.

the performance of the support vector machines is disappointing. It was impossible to obtain probabilities of correct classification greater than 71%. while in all the other cases. the best nonlinear mapping is searched for. With this classifier. 12 parameters (4 means and 8 covariance values) must be estimated.832 0. with large networks (the smallest has 93 linear parameters). With MLPs.Table 2. Cor.46 . 188 Artificial Neural Networks . as the data was generated with a bivariate Gaussian distribution. Class. 0. we know that the Baysean classifier is really the best method for this particular problem. This is an example of the flexibility of the mapping capabilities of the MLP. This is somehow expected as a fixed nonlinear mapping is used. therefore 9 in total.838 –6 Neurons 3 4 Summarizing. therefore in total 12 parameters as in the case of the Baysean classifier. τ = 10 Final Iteration 70 17 Prob. Finally.200 Training Data. The same can be said about RBFs. we obtain essentially the same results with a 2-hidden neurons network. In this case we must estimate 6 nonlinear parameters + 3 linear parameters. where we obtain equivalent results with a network with 9 nonlinear parameters + 3 linear parameters.Results obtained with RBFs.

the threshold of excitation of the latter neuron decreased. The next section will introduce the basic Hebbian learning rule.CHAPTER 3 Unsupervised Learning Chapter 2 was devoted to supervised learning.1. This principle has become very important in neurocomputing.1. Hebb postulated that whenever one neuron excited another. In several situations.2. data reduction. However. and is actually the basis for all mechanisms of unsupervised learning. These techniques will be covered in Section 3.1. This section ends by revisiting the LMS rule. this process facilitated communication between neurons upon Artificial Neural Networks 189 .2). Principal component analysis. A range of these techniques will be described in Section 3. we can devise mechanisms that are able to form internal representations for encoding features of the input data. and linear associative memories in Section 3. we have already mentioned the Hebbian rule (actually two types of Hebbian rules).3.1).1. The anti-Hebbian rule is discussed in Section 3. We can also use competitive learning paradigms. an relate it with autocorrelation of the input data and gradient search. 3. 3. however. a technique used in several different areas of science.3. due to Donald Hebb [18].4. unsupervised rule. relating the increase in the neuronal excitation effect with the frequency that the neurons are excited. that do not imply competition among neurons (those will be studied in Section 3. is discussed in Section 3.3. there is no teacher that can help in the learning process.1 Non-competitive learning In 1949 Donald Hebb proposed a plausible mechanism for learning.1. and express it in terms of Hebbian learning. data reconstruction. by allowing neurons to compete with each other for the opportunity to be tuned to features found in the input data. In this section we shall start with the original. that can be afterwards used for classification purposes.2.1 Hebbian rule When we have described the most important learning mechanisms (see Section 1. etc.

The Hebbian neuron therefore produces a similarity measure of the input w. Assuming that the neuron is linear. the update of the weight vector can be described as: ∆w [ k ] = ηx [ k ]y [ k ] = ηx [ k ]x [ k ]w [ k ] T (3. the learning process is always unstable. there is no target.1.. 3. In terms of artificial neural networks. Now consider that there is more than one input to our neuron.repeated excitation.6) 190 Artificial Neural Networks . we have: ∆w ij = ηx i y j (3. we can modulate this process by increasing the weight between a pair of neurons whenever there is activity between the two neurons. the weights stored in the neuron. the weight norm will always increase. the weights are only updated after a complete set of presentations is presented.2).1.2) It can be easily seen that.1) Notice that. i.e. a small y means that the weight vector is nearly orthogonal to the input vector.5) We can see that the expression within brackets in (3. on the other hand. and therefore it implements a type of memory which is called associative memory.3) If.3). and its input as x. a large y means that θ is small. in (3. provided that the learning rate is positive.1 Hebbian rule and autocorrelation According to (3. Assume also that we employ off-line learning.t.5) is the sample approximation of the autocorrelation of the input data (apart from the factor 1/N) R x = E [ xx ] T (3. or. If we denote the output of each neuron by y. starting with a initial weight vector w [ 0 ] . in other words. we assume that the weights and inputs are normalized. the output is given as: y = w x = x w = w x cos θ T T (3. in contrast to the other learning rules introduced before. the Hebbian rule is: w [ k + 1 ] = w [ k ] + ηx [ k ]y [ k ] = w [ k ] ( 1 + ηx [ k ] ) 2 (3. We therefore have: N ∆w = η k=1 x [ k ]x [ k ] w [ 0 ] T (3. y = wx .4) Assume that we have n patterns and we apply this rule.r. In this case. which means that learning is unsupervised.

3. If (3. as a performance surface.r.t w.7) k=1 1 y [ k ] = --N 2 N k=1 1 w x [ k ]x [ k ]w = w --N T T T N x [ k ]x [ k ] w = w Rw k=1 T T (3. we must have a stable version. Hebbian learning is an unstable process. we have: δV -----.. Any scaling factor could be used.= 2Rw δw (3.1. As V is unbounded upwards. proportional to the square of the output. V is a paraboloid facing upwards and passing through the origin of the weight space.1.8).3 Oja’s rule As it was seen above. To make it useful. 2 (3. and this we can achieve by normalizing the weights. 2 ( w i [ k ] + ηx i [ k ]y [ k ] ) i (3. the Hebbian rule will diverge.10) This can be approximated by: w i [ k + 1 ] = w i [ k ] + ηy [ k ] ( x i [ k ] – y [ k ]w i [ k ] ) = w i [ k ] ( 1 – ηy [ k ] ) + ηy [ k ]x i [ k ] This update is known as Oja’s rule [138].4) to: w i [ k ] + ηx i [ k ]y [ k ] w i [ k + 1 ] = ---------------------------------------------------------------.2 Hebbian rule and gradient search The power at the output of an Hebbian neuron is defined as: 1 V = --N N (3.9) Notice that this equation and (3.: ˆ ∆w = ηR x w [ 0 ] 3. the difference is that the latter employs a forgetting factor. As R is positive definite.11) Artificial Neural Networks 191 . comparing this last equation with (3.The Hebbian algorithm is effectively updating the weights with a sample estimate of the autocorrelation function of the input data. This can be implemented by changing (3.4). but we shall use a unitary norm of the weight vector.1.8) This is a quadratic equation in w.8) is differentiated w. defined in (3. and enables us to interpret V.1. Notice that.7) indicate that the Hebbian rule performs gradient ascent.

3.13) The matrix A A is square and symmetric. Using (3.12) is the defining equation of the eigenvalues/eigenvectors of R. the autocorrelation matrix. with dimensions D*M. Let us now consider that we have a random vector X ( K × D ) = x1 … x D . Let us now have a matrix W. The goal here is to find other directions where the data has also important variance. with mean zero and covariance R = E [ X X ] . and finds a weight vector w. The projection of the data X onto the subspace spanned by the columns of W is given as: Y = XW The reconstructed data from the projections is given as: T T ˆ X = YW = XWW T T (3.16) If we compute the sum of the square of the errors between the projected data and the actual data: 192 Artificial Neural Networks . as A AV = VS SV V = VS S .2 Principal Component Analysis Oja’s rule. whose columns are a subset of the columns of V. (3. as it only normalizes the weight vector to 1. We shall denote it as λ . Oja’s rule (3.Hebbian learning (3. discovers a weight vector which is colinear with the principal component of the input data. we have: A A = VS U USV = VS SV T T T T T T T T (3.1.13). It can be proved that λ 1 is the largest eigenvalue of R. As we already know. and therefore a neuron trained with Oja’s rule is usually called the maximum eigenfilter.14) T T T It is easy to see that V are the eigenvectors of A A . M<D.15) (3. We know. that satisfies the relation: Rw = λ 1 w (3.4) finds the weight vector such that the variance of the neuron output (the projection of the input onto the weight vector) is maximum.11). does not change its direction.12) As it is known. S S is a diagonal matrix of the eigenvalues. We know that R is a symmetric matrix of size D. from Section 5 in Appendix 1 that a matrix A (K observations *D dimensions) can be decomposed as: A = USV T T (3. introduced in the last section.

(3. Notice that as W V has a form I 0 . Thus. This basis is the space spanned by the eigenvectors of the covariance matrix.19) Using similar arguments.18) The trace operation is equal to: T tr ( V D – M λ D – M V D – M ) D = i = M+1 λi . Therefore. in terms of neural networks.21) Therefore.17) can be expressed as: J = tr ( X X ) – tr ( Y Y ) T T (3.17) = tr ( ( I – WW ) X X ( I – WW ) ) = tr ( ( I – WW ) VS SV ( I – WW ) ) where tr() denotes the trace operation.22) Minimizing J is equivalent to maximize (3. and. tr ( Y Y ) = i=1 T λi . (3. (3.T T T ˆ T ˆ J = tr ( ( X – X ) ( X – X ) ) = tr ( ( X ( I – WW ) ) ( X ( I – WW ) ) ) .20) and that M . Artificial Neural Networks 193 . ( I – WW ) V is a matrix whose first M columns are null and the last D-M columns are identical to the corresponding columns of V. T (3. the argument of (3. PCA is used for data compression. criterion (3. the projections of the data onto this space are called the principal components. and the method is called principal component analysis. PCA is doing nothing more than actually operating with the eigenstructure of the covariance matrix of the data.22). it can be shown that: D tr ( X X ) = i=1 T λi . which means that we are looking for a basis which maximizes the variance of the projected data.17) can be seen as: λM 0 = VD – M λD – M V D – M T T T T T T T T T T T T 0 VD – M λD – M VT – M D (3.

1 0.1.5 x2 1 0.Example 3.5 3 3. 3 2.5 -1 -0.Original data 194 Artificial Neural Networks .2 0.Data reconstruction Let us use the function used to generate the data of the example of Section 2.5 C2 0.1 .1 .5 2 1.5 0 0.5 0 -0.01 0.1 .Parameters Mean Covariance Matrices C1 0.1.5 1 1.5 x1 2 2. this time with the values: Table 3.01 0.5 0.2 0.01 0. 3.1 2 2 The data used is plotted in fig.01 0. in Appendix 2.1 .5 4 FIGURE 3.

5 2 1.5 0 -0.5 0 0.5 3 3.5 -1 -0.5 -1 -1. new basis Projecting the data onto the new axis.2 .5 0 0. The Baysean decision boundary is displayed in black.5 1 x1 1. in green.5 Artificial Neural Networks 195 . the eigenvectors of the covariance matrix are computed.5 y2 0 -0.5 2 y1 2. and the following figure incorporates them.5 FIGURE 3.5 3 3.5 x2 1 0.5 2 2.5 4 4.5 1 0.5 1 1.If now PCA is used. we have the transformed data: 1.5 -0.Original data. 3 2.

y1. If we want to reconstruct. Notice however that this is not always the case.5 . 3.3 . 3. than along axis y2. corresponding to the largest eigenvalue (variance).5 1 0. then the transformation matrix would only incorporate one eigenvector. If it was meant to reduce the dimensionality of the problem.5 y1 2 1. as PCA chooses the projection that best reconstructs the data.Transformed data Notice that the classes are more easily separable if the decision surface is taken along the axis y1. the original data. applying (3. Doing that.5 0 -0. 4. which may or may not be the optimal projection for discrimination between classes. Obviously.5 3 2. the reconstructed data only contains information along the first eigenvector.5 0 50 100 150 200 250 300 Patterns 350 400 450 500 196 Artificial Neural Networks . The transformed data.4 . is shown in fig. in this case. in this case. our transformed data would only incorporate one dimension.16) would produce fig.5 4 3.FIGURE 3. as the other component is lost.

5 2 1.Transformed 3 data (with only 1 dimension) 2.FIGURE 3.5 .5 xhat1 2 2.5 1 1.5 -0.4 .5 0 -0.5 xhat2 1 0.Reconstructed data Artificial Neural Networks 197 .5 3 3.5 0 0.5 FIGURE 3.

FIGURE 3. we stay with fig.Original picture of João This. each one with 480 dimensions. can be interpreted as a matrix. To display it.8 . Applying PCA. 3. The following picture displays a vga (640*480) in gray scale. it can be seen as 640 samples. If we transpose it.2 . where the decrease in detail is rather obvious 198 Artificial Neural Networks .Data Compression Image processing is an area where data compression is used significantly.Example 3.6 . of course. we end up with the following figure: If we reduce the number of dimensions to only 100. and reconstructing the image. the image processing toolbox function imshow is employed. using the first 261 components.

8 .Picture of João.FIGURE 3.7 .Picture of João. with 261 components FIGURE 3. with only 100 components Artificial Neural Networks 199 .

the first two related with the original input data.3 . and our fictious input would be considered. which does not take into account the relations between inputs and output.5 0. Let us consider that we did not know this. the model output(s) depend on a large number of variables. In some cases.9 . The next figure illustrates the network output after the first iteration of the Levenberg-Marquardt method. not a nonlinear model.8.8 0. Let us in this example the data used in Ex. and we shall PCA with the original 2 inputs. which is a mistake. a subset of the most "important" inputs for the model.5 18. before actually training the model. input x2 would be discarded. known as input selection is performed. which is the case of neural models. If these results were used to select 2 out of the 3 inputs.6 FIGURE 3.Input Selection It is often the case. This is usually done with PCA. care must be taken when applying this technique. we stay with the eigenvalues 9. from the whole set of available inputs. We must remember that PCA is an unsupervised method. specially in bioengineering applications.3 7. that neural networks are used to model complex processes. and the third one with the fictious data.9 .5 -2 1 1 0. a pre-processing phase. and assumes a linear. 200 Artificial Neural Networks . As this is an unsupervised process.4 x2 0 0.Example 3.2 0 x1 0. plus an additional one which is taken from a uniform distribution. In these cases. in order to select. but the degree of relationship is not known a priori. 2. 0 -0. If we perform an eigenanalysis of the covariance matrix of this enlarged input data.Inverse Coordinate problem This means that the output is exactly reproduced by a neural model with this topology and with the specific set of parameters.5 y -1 -1.

the application of (3. which means that will always produce zero output.Sum of the square of the errors Artificial Neural Networks 201 . so that we have: ∆w ij = –ηx i y j (3. and that it can be seen as performing gradient ascent in the output power field. the only way to achieve zero output is with null weights. otherwise it will find a weight vector which is orthogonal to the input. Using 10 random patterns.10 . and assuming that x 3 = x 1 + x 2 .3 Anti-Hebbian rule We have seen.23). in opposite with the original will perform gradient descent in the output power field.1.262).4 . gives the following results: 18 16 14 12 10 sse 8 6 4 2 0 0 10 20 30 40 50 60 iterations 70 80 90 100 FIGURE 3. in Section 3. that Hebbian rule discovers the directions in space where the input data has largest variance. In this case.3.23) This is called the anti-Hebbian rule. applied to a random initial weight vector. we can always find a direction which is orthogonal to the space spanned by the input patterns.1. Let us know change slightly (3. Example 3.1. but where the third input is a linear combination of the other two. This rule will find weights corresponding to the null-space (or orthogonal space) of the data. and.Application of anti-Hebbian rule Suppose that we consider a model with three inputs. If the data spans the whole input space.

where data patterns are stored in several neurons. in a distributed fashion. it is known that the brain stores information in a more global way. through knowledge of cognitive psychology.1.24) We are used to computer memory.11 .8. which can be retrieved or stored by means of a specific address. in Section 2. and the LMS rule. 202 Artificial Neural Networks ..3.1. which finds the correlation coefficients from the inputs to the output. In the same way as we have analysed the steepest descent algorithm.4 Linear Associative Memories (3. λ max where λ max is the largest eigenvalue of the autocorrelation matrix of the input data.Final results We can understand the behaviour of the anti-Hebbian rule as decorrelating the output from the input. in an opposite way as the LMS rule.3. in Section 2. and confirms that the final weight vector (after 10 passes over the data) is orthogonal to the space of the input data: FIGURE 3. where information is stored as a sequence of bits.3. 3.The next figure illustrates shows the space spanned by the input data. we can show that the rule is stable if: 2 η < ---------.1. as. This process is not the way that our brain stores information.

This was already introduced in Section 1. We can introduce. j = η k=1 X ki D kj. we can see that the output is composed of two terms: the first one. 0 < i < d.26) W i. Let us suppose that we have 2 data sets. and has closer ties to how it is thought that the brain processes information. j . and a second term. If. yj is replaced with dj.25) A high value of this index indicates that the corresponding vectors are similar. and this information. as: n W = η k=1 xk dk . and (3.1) is called forced Hebbian learning. which measures how similar the current input is to all the other inputs. Networks trained with this rule are called heteroassociators. is R i. for this purpose. apart from a scaling factor. and present it with a specific input (out of the ones already stored). If in this equation. let us formulate (3. j 1 = -n n X ki D kj.29) If ηx l x l = 1 . T (3. the cross-correlation matrix: R i. It is interesting that this can be implemented with a variation of the Hebbian learning rule (3. is stored in the network weights. called crosstalk. which is the desired output. k ≠ l x k xl d k T (3.3. 0 < j < d. (3. then the second term is null. then the crosstalk term can have magnitudes similar to the desired response. each one with n patterns of dimensionality d.3. X and D. and perfect recall is achieved. 0 < j < d k=1 (3. after n iterations we shall have: n (3.The Linear Assocative Memory (LAM) is an alternative to conventional digital memory.27) which. and the LAM does not perform how it is expected.262). The goal is to estimate how similar these sets are. as we can see. are we able to reproduce it exactly? To answer it. as they associate input patterns with different output patterns. on the other hand. The question is when we store different patterns in the LAM.28) When a pattern x l is presented to the network. 0 < i < d. the stored pattern are similar. Artificial Neural Networks 203 . we stay with: ∆w ij = ηx i d j If this process is started with W(0)=0. while a low value indicates dissimilarity. If all the stored patterns are orthogonal. the output is: n y = Wxl = η k=1 T T x k dk x l = T ηxl x l d l n +η k = 1.27) in matrix terms.

1.5 LMS as a combination of Hebbian rules If we recall the LMS learning rule.We know that in an d-dimensional space the largest set of orthogonal directions has size d. and suffer from the problem of crosstalk. LAMs are content addressable (in opposition with digital memory). and actually larger than the number of the model parameters. for constructing LAMs. as the information is stored globally in the network weights. and the influence of the crosstalk term is less important. The difference lies in the amount of data and the intended use of the model. 3. while the second term is clearly anti-Hebbian learning.1. For approximation purposes.8. Notice that.1. instead of forced Hebbian learning. for instance. the main goal is to use the network as a model to store patterns.30) The first term in (3.3. the weight update was given as: ∆w ij = ηe i x j As the error is given as e i = d i – y i . In LAMs. while the second one decorrelates the input and the system output.1.3. The first term performs gradient ascent and is unstable. whose number is less than the number of inputs. allowing therefore a range of learning rates to produce convergence to a local minimum. with PCA. while the second one tries to drive the output to zero. as the methods used for training them is the same. orthogonalizing the input data. the application of LMS reduces crosstalk for correlated input patterns. nonlinear associative memories are usually more robust to noise. the storage capacity of a LAM is equal to the dimensions of the input. For function approximation purposes. 3. these patterns must be orthogonal if a perfect recall is sought.1. so we can store at most d patterns.1 LMS for LAMs The interpretation that we gave for the LMS enables us to use this rule. even if this condition is fulfilled. have limited capacity. In practice. a pre-processing phase is usually performed.4. 3.8) we may ask ourselves what is the difference between LAMs and the models introduced in Section 2. Of course. In fact. an error in a weight does not produce a catastrophic failure. as the output is nonlinear. the vector space interpretation of the output is lost. the role of the first is to make the output similar to the desired response. A LAM trained with LMS is called an Optimal Linear Associative Memory (OLAM). studied in Section 2. a major goal for the network is 204 Artificial Neural Networks . As we have used LMS already for function approximation purposes (see Section 2. the LMS rule can be recast as: ∆w ij = ηd i x j – ηy i x j (3.5.31) can be seen to be forced Hebbian learning. Therefore. Therefore. However.31) (3. As we have seen. also in contrast with digital memory. the training data is larger than the number of inputs.1 Nonlinear Associative Memories There is nothing that prevents the use of the forced Hebbian learning when the neurons have a nonlinear activation function.

2. for this particular example. in a distributed manner. in biology. In terms of artificial neural networks. This is not one goal required for LAMs. The typical layout of a competitive network consists of a layer of neurons.32) provides the solution with the smallest 2-norm. Artificial Neural Networks 205 . but needs a global coordination for performing this action. The brain is organized in such a way that different sensory inputs area mapped into different areas of the cerebral cortex in a topologically ordered manner. what we want is memorization. and show how competitive networks can be used for classification purposes. For the case of LAMs. which may be interpreted as inputs not yet supplied to the network producing outputs with the smallest possible error. The use of a pseudo-inverse W = A D * + (3. by using competitive learning. usually inconsistent. We shall describe in more detail this last work in Section 3. 3. in Section 3. This can be more emphasized if we recall Appendix 1. this is easy to perform. the system is unconstrained (more equations than data).its capacity of generalization.1 Self-organizing maps These networks are motivated by the organization of the human brain. where. and also in the brain. with its self-organizing map. for the case of multiple outputs) that provides the least sum of the square of the errors. In biological systems. all receiving the same input. without global control. which introduced this idea and a dense mathematical treatment. 3.2 Competitive Learning Competition is present in our day-to-day life. among the ones stored. most of the work done on this subject is due to Stephen Grossberg [139]. and to Tuevo Kohonen [39]. making use of excitatory and inhibitory feedback among neurons. that is the recall of the stored pattern more similar to the particular input. The key fact of competition is that it leads to optimization without the need for a global control. who took a more pragmatical view. Different neurons (or groups of neurons) are differently tuned to sensory signals.2. and the use of the pseudo-inverse gives the weight vector (or matrix. this task is performed in a distributed manner. For the case of function approximation. and they become specialized to particular features without any form of global control. Self-organizing maps try to mimify this behaviour.2.2. we have more equations than data. so that we have an infinite number of optimal solutions. In digital computers.1. The neuron with the highest output will be declared the winner..

Cooperation... This provides cooperation between neurons... For each neuron. all neurons compute the value of a discrimant function.2. The excited neurons have their weights changed so that their discrimant function is increased for the particular input. (3.2-dimensional SOM All neurons in fig.. Inputs .34) 206 Artificial Neural Networks . We shall describe each of these phases in more detail 3. 3.. T T . An input vector can therefore be represented as: x = x1 … xn The weight vector for neuron j is represented as: w j = w j 1 … w jn .. j = 1…l . there are 3 steps to be performed: Competition. Afterwards.The typical goal of the Self-Organizing Map (SOM) is to transform an incoming signal of arbitrary dimension into a one or two-dimensional discrete map. where neurons are arranged into a single row or column.. A neighbourhood around the winning neuron is chosen. This transformation is performed adaptively in a topologically ordered manner. .33) (3..12 . A onedimensional map is a special case of this configuration.. . 3. 2.1. .12 receives all inputs and are arranged into rows and columns. so that the neurons inside this neighbourhood are excited by this particular input. The neuron with the highest value is declared the winner of the competition. Small random numbers are used for this phase.1 Competition Let us assume that the dimension of the input space is n. The algorithm starts by initializing the weights connecting all sources to all neurons. This makes them more fine tuned for it. FIGURE 3. 1. Adaptation.

This is equivalent to maximize the Euclidean distance between x and w j . a typical choice is the Gaussian function: d j. this represents i – j . Using again an exponential decay.To compare the best match of the input vector x with the weight vector w j we compute the inner products w j x for all neurons and select the one with the highest value.1.3 Adaptation For the network to be self-organizing. (3. There is evidence that a neuron that is firing excites more the neurons that are close to it than the ones that are more far apart.38) which is applied to all neurons inside the neighborhood. A popular choice is: σ ( k ) = σ0 e k – ---τ1 . If we denote the topological neighborhood around neuron i by h ji ( x ) .2. so that the winning neuron (and to a less degree the ones in its neighborhood) has its weights updated towards the input. 3. the weights must be adapted according to the input.1. the learning rate parameter. (3.35) hji ( x ) = e . so that if we denote the winning neuron by i ( x ) . (3.36) where d j. i – -------2 2σ 2 (3. This neighborhood decreases with time. while for the case of a two-dimensional map this can be given as r j – r i .37) where σ 0 is the initial value of the width and τ 1 is the time constant. In the same manner as the width parameter. it can be determined by: i ( x ) = arg min x – w j . where r denotes the position in the lattice. The parameter σ determines the width of the neighbourhood. Therefore the neighborhood should decay with lateral distance. η . it can be given as: Artificial Neural Networks 207 . i denotes the distance between the winning neuron i and the excited neuron j. j = 1…l 3. we must define a neighbourhood that is neurobiologically plausible. i ( x ) ( x – w j ) . For the case of a one-dimensional map. should also decrease with time.2. An update that does this is: ∆w j = ηh j.2 Cooperative process Having found the winner neuron.

10. 2. Actually their difference is in the definition of neighborhood.1. or even by reduced to just the winning neuron. 1. Therefore. the K-means clustering algorithm. the feature map computed by the algorithm has some nice properties which will be pointed out.1 τ 2 = 1000 (3. Density matching. 3.4 Ordering and Convergence k – ---τ2 . while the neighborhood should contain only the nearest neighbours of the winning neuron. the various parameters should have two different behaviours. which for the k-means just includes the winning neuron. where the topological ordering of the vectors take place. Topological ordering.2.01. and a convergence phase. 2. The learning parameter should remain at around 0. The feature map reflects variations in the statistics of the input distribution: regions in the input space from which samples are drawn with a higher probability are mapped onto larger domains of the output space. each one centred around one neuron. As the basic goal of the SOM algorithm is to perform clusters of data. resulting in two phases: ordering. (3.01. and the learning rate parameter should be decreasing from around 0.39) For the SOM algorithm to converge.η ( k ) = η0 e 3. and decrease to a couple of neurons around the winning neuron. The first phase can take 1000 iterations or more. The goal of the SOM algorithm is to store a large set of vectors by finding a smaller number of prototypes that provide a good approximation to the original input space.1 to 0.40) The neighborhood function should at first include all the neurons in the network (let us call this σ 0 . A more detailed discussion can be found in [125].5 Properties of the feature map Once the SOM has converged. 208 Artificial Neural Networks . 3. Approximation of the input space. to fine tune the map. displaying the weights in the input space and connecting neighbouring neurons by a line indicates the ordering obtained.41) The convergence phase usually takes 500 times the number of neurons in the network. it is not surprising that the SOM algorithm is not very different from Alg.1. This can be cone if: 1000 τ 1 = ----------σ0 (3. and the two phases involved in the SOM algorithm.2. The spacial location of each neuron in the lattice corresponds to a particular feature or domain of the input data. This can be achieved if: η 0 = 0.

2 Competitive learning is a form of unsupervised learning. The SOM is able to select a set of best features for approximating the underlying distribution of data. But it assumes a linear model. and we can profit of this to adapt this for classification purposes. data compression is achieved.2. As we saw in Section 3. picking input vectors x at random.Voronoi diagram It is clear from the discussion of SOM that this algorithm is capable of approximating the Voronoi vectors in an unsupervised manner. Selforganized feature maps gives us a discrete version of the principal curves and therefore can be seen as a nonlinear generalization of PCA.1. counterpropagation networks and Grossberg’s instar-outsar networks.1 Learning Vector Quantization Vector quantization is a technique that exploits the structure of the data for the purpose of data compression. This is actually the first stage of Learning Vector Quantization (LVQ).13 . The basic idea is to divide the input space into a number of distinct regions (called Voronoi cells) and to define for each cell a reconstruction vector.4.2. 3. This second phase involves assigning class labels to the input data and the Voronoi vectors. and adapting the Voronoi vector according to the following algorithm: Artificial Neural Networks 209 .2. Classifiers from competitive networks 3.2. which for classification is supplied in terms of class labels. But we have seen that SOM performs data clustering. FIGURE 3. As the number of cells is usually much smaller than the input data. The second stage involves fine tuning of the Voronoi vectors in a supervised manner. Feature selection. We shall describe briefly three different techniques: Learning vector quantization. PCA gives us the best set of features (the eigenvectors) for reconstructing data. Therefore it does not use any form of teacher.

∆w j = ηx ( d – w j ) 3. or the least-squares solution can be implemented. The first layer is trained first. w c ( k + 1 ) = wc ( k ) –αn ( x – wc ( k ) ) 4. All other Voronoi vectors are not modified Algorithm 3. (3.43) 210 Artificial Neural Networks .2.3 Grossberg’s instar-outsar networks Grossberg proposed a network very similar to the counterpropagation network described above (all developed in continuous time).1. 2. IF ξ wc = ξ x 2. Let ξ wc denote the class associated with vector wc and ξ x denote the class associated with x.2 Counterpropagation networks Hecht-Nielsen [36] proposed a classifier composed of a hidden competitive layer. (3.4 Adaptive Ressonance Theory The main problem associated with the instar-outsar network is that the application of the learning rule has the effect that new associations override the old associations. which is a problem common to networks of fixed topology. as training data. in just one step.2. The main difference lies in the input layer. The competitive layer (the first layer) uses the instar rule: ∆w j = ηx ( y – w j ) . w c ( k + 1 ) = w c ( k ) + αn ( x – wc ( k ) ) 3. called the stability-plasticity dilemma. For that purpose.42) and the classification layer the same rule. The way that Grossberg solved the problem was to increase the number of neurons in the competitive layer. in an iterative fashion. 3. connected to a linear output layer. which does normalization. ELSE 3. with y replaced by d. as it is a supervised learning.2.1.2. Suppose that the Voronoi vector wc is the closest to the input vector x. and afterwards the linear classifier is trained using the class labels. in situations where the input data differed from the nearest cluster centre more than . 3.1) and decrease monotonically with k.1. the LMS rule can be used.2.2.1 - LVQ algorithm The learning rate should start from a small value (typically 0.

a predefined value. it is said that there is no "ressonance" between the existing network and the data. Artificial Neural Networks 211 . and this is the reason that this network has been called Adaptive Ressonance Theory. In these situations.

212 Artificial Neural Networks .

2) j∈i Artificial Neural Networks 213 . and recover the dynamic model (1. when the biological inspiration of artificial neural networks was discusses. We now focus our attention to transient behaviour. i – g ( o i ) (4.1. a brief description of Brain-State-in-a-Box is given in Section 4.3. in Section 4. the most simplified model (1. but models where feedback is allowed.3) (4. j ( o j ) – g ( o i ) (4. This chapter is devoted to structures where the signal does not propagate only from input to output. Finally.1): do i = dt X i. and used intensively throughout the book. the most common neurodynamic models are introduced.2. supervised and unsupervised.5)(1. the simple approximation will be used: X i.2. j ( oj ) = o j W j. Hopfield models are described in Section 4. where the Cohen-Grossberg model is also introduced. Having stated that most of the artificial models was only interested in stationary behaviour.1 Neurodynamic Models In Section 1.1) j∈i Again.CHAPTER 4 Recurrent Networks Chapter 2 and 3 covered the field of direct networks.6) was developed. a simplified neural model was introduced. 4. First. i = net i and therefore do i = dt o j W j.

and denoted as the induced local field.+ I i Ri j∈i (4.3): • • • • we shall assume that the dynamics apply to the net input. o i . Using this. i – ---.3) is changed to: Ci dnet i = dt net oj W j. is that the network posses fixed point attractors (see Appendix 3). interconnected in different ways.2 Hopfield models This network consists of a series of neurons interconnected between themselves (no self-feedback is allowed in the discrete model) through delay elements. not to the output. The attractors are the places where this energy function has minima. to differentiate it from a multiplicative model. the concept of energy minimization is usually applied.5). such as memories. however we shall make the following modifications to (4. specifically introduce a bias term for the purpose of a possible circuit implementation.5).4) This is called the additive model. (4. but where feedback among neurons is allowed. introduce a capacitance and a resistance terms Implementing these modifications. so that the attractors can be used as computational objects. We shall consider that we have a set of neurons such as (4. 4. for simplicity. that the neuron output. net is represented as v. the most common form of this model is: Ci dv i = dt vi x j W j. the loss term is proportional to the net input. 214 Artificial Neural Networks .We shall assume. to be useful. and denoted as the neuron state. In these type of networks. where the weights are dependent on the output. as previously. is a nonlinear function of the net input net i . Each neuron has a dynamic represented in (4. the output o is represented as x. so that when a network is left to evolve from a certain initial condition. introduced later on. one of these attractors will be attained. and possibly also of their basins of attraction. The aim of these algorithms is actually controlling the location of these attractors. Usually. which for convenience is reproduced here. introduced by Grossberg.5) This is the form that will be used throughout this chapter. i – --------i + I i Ri j∈i (4. A necessary condition for the learning algorithms.

ϕ ( x ) ai –1 (4. and for this reason a i is called the gain of neuron i. 3. Using this equation in (4. Each neuron has a nonlinear activation function of its own.= – dt N N j=1 i=1 dx j vj W ji xi – ---.log ----------ai 1+x (4.t.12) The terms inside parenthesis are equal to C j dE = – -----dt N (4.10) The energy (Lyapunov) function of the Hopfield network is defined as: 1 E = – -2 N N N W ji x i x j + i = 1j = 1 j=1 –1 1 ---.7) Assuming an hyperbolic tangent function.9) Denoting by ϕ ( x ) the inverse output-input relation with unit gain.5)) and therefore we have: dt Cj dv j dx j ------d t dt (4. The inverse of this function exists: v = oi –1 = ϕi ( x ) –1 (4. The weight matrix is a symmetric matrix ( W ij = W ji ). 2.Ci dv i = dt vi x j W j. we have: dE -----.11) If we differentiate (4. i – ---.r.13) j=1 Artificial Neural Networks 215 .+ I j ------dt Rj dv j (see (4.= ------------------– ai v 2 1+e (4. we have: –a i v ai v 1–e x = ϕ i ( v ) = tanh -----. time.7).+ I i Ri j∈i (4.ϕ j ( x ) dx – Rj 0 xj N Ijxj j=1 (4.11) w. we have: –1 1 1–x v = ϕ i ( x ) = – --.6) The following assumptions will be made: 1.8) This function has a slope of ai ⁄ 2 ) at the origin. we have: –1 1 –1 ϕ i ( x ) = --.

= – dt N –1 –1 j=1 dx j C j ------dt –1 2 dϕ j ( x j ) -------------------dx –1 (4.2 -0..------. dϕ j ( xj ) dx j dv j dϕ j ( x j ) -----.13) can be expressed as: dt dx dt dt dE -----.plot of ϕ ( x ) 216 Artificial Neural Networks .4 -0.As v j = –1 ϕj ( xj ) .= -------------------.6 -0.14) Let us focus our attention in the function ϕ ( x ) .6 0. and therefore (4.4 0.8 -1 -3 -2 -1 0 x 1 2 3 FIGURE 4.1 .8 0.2 tanh(x/2) 0 -0. 1 0.= -------------------. The following figures show the behaviour of the direct and the inverse nonlinearity.

dE -----.6 4 2 0 -2 -4 -6 -1 -0. which will be described now. The stable points dx j dE are those that satisfy -----. and that it is a stable model.6 -0.8 -0. which was introduced in [30].4 -0. and compared with the continuous version.= 0 .1 Discrete Hopfield model The model described above is the continuous version of the Hopfield model.14) is also always non negative. The discrete version is based on McCulloch-Pitts model. The asymptotic values of the output are: Artificial Neural Networks 217 . As the square in (4.4 0.2.6 0.= 0 .plot of –1 ϕ (x) –1 It is clear that the derivative of ϕ ( x ) will always be non negative.8 1 FIGURE 4. John Hopfield introduced a discrete version [29]. Previously.2 0 x 0.≤ 0 dt (4.15) This result and the fact that (4. allows us to say that the energy function is a Lyapunov function for the Hopfield model.2 . We can build a bridge between the two models if we redefine the neuron activation function such that: 1.11) is positive definite. which corresponds to the points where ------. dt dt 4.2 0.

and so are their stable values. when this is not true ( a j is small). where self-loops are not allowed.2. 218 Artificial Neural Networks .16) 2.18) makes a significant contribution to the energy function.2 Operation of the Hopfield network The main application of the Hopfield network is as a content-addressable memory. the network evolves in such a way that the fundamental memory is retrieved.ϕ ( x ) dx a j Rj 0 xj (4. The midpoint of the activation function lies at the origin.10) in (4.ϕ j ( x ) dx Rj 0 xj (4. we may set the bias. 4. we have: 1 E = – -2 N N N W ji x i xj + i = 1j = 1 i≠j j=1 –1 1 --------. The energy function of a network with neurons with these characteristics.18) The integral of the second term in this equation is always positive and assumes the value 0 when x = 0 . The stable points.17). In view of this. By this. i. specially in regions where the neuron outputs is near their limiting values.17) Making use of (4. corrupted with noise) is presented to the network. If a j has a large value (the sigmoid function approaches a hard non-linearity). in this case. when a pattern (possibly one of these memories. depart from the corners of the hypercube where the neuron states lie. we mean that the network stores in its weights fundamental memories. and. Of course these fundamental memories represent stable points in the energy landscape of the net. the second term is negligible and (4. ϕ j ( 0 ) = 0 . v j = – ∞ (4. v j = ∞ – 1.18) can be expressed as: 1 E ≈ – -2 N N W ji x i x j i = 1j = 1 i≠j (4. the energy function of the continuous model is identical to the discrete mode. However.19) This last equation tells us that in the limit case ( a j = ∞ ). is given as: 1 E = – -2 N N N W ji x i x j + i = 1j = 1 i≠j j=1 –1 1 ---. the second term in (4..e.xj = 1. I j to 0 for all neurons.

The discrete version of the Hopfield network requires that there is no self-feedback. of the Hebbian type (3. the weight matrix is obtained as: 1 W = --N M ξξ – MI . in vector terms.e.21) where the last term is introduced so that W ii = 0 . The stability of a state can be inspected by verifying if the following system of equations. an input a is applied to the network.22) Artificial Neural Networks 219 . and is achieved by setting the initial network state to a . Therefore. there are two main phases in the operation of the Hopfield network: a storage phase and a retrieval phase. 4. the application of the hebbian rule to the ξ i.2. Notice that the weight matrix is symmetric. if it is not a stable point. and updates its state according to the following rule: 1.1 Storage phase Hopfield networks use unsupervised learning.Let us denote by ξ i.262). i = 1…M the set of M fundamental memories that we want to store in the net. The term 1 ⁄ N is employed for mathematical simplicity.2 Retrieval phase In the retrieval phase. Therefore. k=1 (4. v j = 0 This procedure is repeated until there is no change in the network state.2. This input is usually called a probe. Then. i. is satisfied: (4. the network will evolve to the stable point corresponding to the basin of attraction where a lies.2. Each one of them will be encoded onto a stable point xi in the net. Assuming that the weights are initialized to 0. k=1 T (4. v j > 0 xj = – 1. 4.20) where ξ kj is the jth value of pattern k. denoted as the alignment condition. until a stable state is reached. and N is the number of neurons. each neuron checks its induced vector field.2. v j < 0 x j. i = 1…M fundamental memories leads to W ji 1 = --N M ξ kj ξ ki . By presenting an input pattern a . randomly.

A 3-neurons model 0 2 2 1 Suppose that the weight matrix of a network is W = -. These other stable points are called spurious states.2. for simplic- ity. ξ v .21) with this data. and these are undesirable memories.2 0 2 3 2 2 0 3 (4. – 1 – 1 – 1 and 1 1 1 . is applied to the network.23). the network recalls the fundamental memory that is closer in terms of Hamming distance. Let one input pattern be equal to one of the stored memories.y = sgn ( Wy + b ) Example 4. that the bias is 0. These points. crosstalk between patterns implies that a perfect recall is not possible. Here. that we did not want the network to posses. and therefore. we can see that there are only 2 stable states. as the network is discrete.4). we have: – 1 – 1 – 1 → – 1 – 1 –1 –1 –1 1 → –1 –1 –1 –1 1 –1 → –1 –1 –1 –1 1 1 → 1 1 1 1 –1 –1 → –1 –1 –1 1 –1 1 → 1 1 1 1 1 –1 → 1 1 1 1 1 1 → 1 1 1 Analysing the above results. the local induced field for neuron i is given as: 220 Artificial Neural Networks . that is not a stable pattern. and allowing self-feedback for simplicity.1. Some of the other corners might or might not be stable points. Employing (4.1 . Another problem related with these networks is their limited storage capacity. Assume also. they can not be recalled. There are 2 = 8 states in the network. Let us see the alignment condition for all possible states.23) . the effect is that fundamental memories might not be stable memories. 4. The weight vector employed was actually the result of applying (4. Assuming zero bias. lie in corners of an hypercube. In linear associative memories (see Section 3.3 Problems with Hopfield networks The fundamental memories are stored in stable points of the energy surface. Notice also that when an input pattern.

24). The data can be found in numeros. we have: N vi = j=1 1 --N M ξ kj ξ ki ξ vj k=1 1 = --N M N ξ ki k=1 j=1 ξ kj ξ vj 1 = ξ vj + --N M N ξ ki k = 1 k≠v j=1 ξ kj ξ vj (4.3 . 4.mat.3 . the recall of m patterns is guaranteed (see [125]) only if: m 1 --. Typically.Recognition of digits The aim of this example is to design an Hopfield network to recognize digits such as the ones show in fig.24) Using (4.10 digits Each digit is drawn in a 4*5 matrix of pixels.N vi = j=1 W ji ξ vj (4. (4.20) in (4. Artificial Neural Networks 221 . Therefore each digit can be represented as a vector of 20 elements. The second term is the crosstalk between the fundamental memory v. for small n.15n. while for large n. and the other k fundamental memories.2 .< ----------n 2 ln n Example 4. each one with two different values. This means that we can employ an Hopfield network with 20 neurons.25) The first term in (4. -1 (white) and +1 (black).26) FIGURE 4.25) is just the jth element of the fundamental memory. the number of random patterns that can be stored is smaller than 0.

after 101 iterations we recover a spurious state. one and two. 91 and 101.m. at iterations 1.Evolution of the ’1’ distorted pattern If we try to store more than 3 patterns. according to what has been said before. If we change each bit of one the 1 stored pattern with a probability of 25%. Using retrieve_hop. 61.2. (4. Notice that. the weight matrix is computed. 4. we shall notice that all fundamental memories can not be stored with this network size.4 The Cohen-Grossberg theorem One of the most well known model in neurodynamics has been introduced by Cohen and Grossberg [140]. Stored pattern Iteration 1 Iteration 31 Iteration 61 Iteration 91 Iteration 101 FIGURE 4. 20 neurons are able to store 3 patterns.27) They have proved that this network admits a Lyapunov function 1 E = -2 N N N uj C ji ϕ i ( u i )ϕ j ( u j ) + i = 1j = 1 i≠j j=10 b j ( λ )ϕ' j ( λ ) dλ .28) where: 222 Artificial Neural Networks . 31.We shall start by memorizing just the 3 first digits: zero.m. Using store_hop.4 . we can conclude that all fundamental memories have been stored as stable memories. The following figure illustrates a spanshot of this evolution. and let the network to evolve. They proposed a network described by the following system of nonlinear differential equations: du i = ai ( ui ) b i ( u i ) – dt N C ji ϕ i ( u i ) i=1 (4.

31) In these equations. By applying positive feedback. with positive eigenvalues. Comparing (4. 3. Its main use is as a clustering device. then through application of (4.3 Brain-State-in-a-Box models The Brain-State-in-a-Box (BSB) model was first introduced by Anderson [16]. if the starting point is not a stable point.1.31).Correspondence between Cohen-Grossberg and Hopfield models Hopfield Cj vj 1 vj – ---. The use of the limiting function.5) and (4. This means that the trajectories will be driven to a "wall" of the hypercube and will travel through this wall until a stable point is reached. 2. and ϕ ( y ) is a piecewise linear function (see Section 1. W is an n*n symmetric matrix. however.ϕ' j ( λ ) = d ϕ j ( λ ) . the activation function ϕ i ( u i ) is monotonically increasing. dλ provided that: 1. Artificial Neural Networks 223 . x [ k ] is the state vector at time k. together with a limiting activation function. we can conclude that the Hopfield network is a special case of Cohen-Grossberg model: Table 4. (4. β is a small positive constant called the feedback factor.28). the function a i ( u i ) is non-negative. The network is called BSB as the trajectories are contained in a "box".1 . centred at the origin. the Euclidean norm of the state vector will increase with time.+ I j Rj – W ji ϕj ( vj ) Cohen-Grossberg uj aj ( uj ) bj ( uj ) C ji ϕj ( uj ) 4.3.2).30) (4. to drive the network states to a corner of an hypercube.30) and (4.2. The BSB algorithm is as follows: y [ k ] = x [ k ] + βWx [ k ] x[ k + 1 ] = ϕ( y[ k ] ) (4. The BSB uses positive feedback. constrains the state trajectories to lie inside an hypercube.29) the weight matrix is symmetric.

2 .4. The weight update is therefore: ∆W = η ( x [ k ] – Wx [ k ] )x [ k ] (4.2 Training and applications of BSB models T (4. Both generate basis of attraction but in the BSB these are more regular than in Hopfield networks.33) The BSB is usually employed as a clustering algorithm. BSBs may be reformulated as a special case of the Cohen-Grossberg model. It can be proved (see [141] and [125]) that using the correspondence expressed in for the two models: Table 4. where the target data is replaced by the input training data.1 Lyapunov function of the BSB model As Hopfield networks. creating a cluster function. as Hopfield networks.32) BSB models are usually trained with the LMS algorithm.Correspondence between Cohen-Grossberg and BSB models BSB vj 1 –vj – C ji ϕj ( vj ) Cohen-Grossberg uj aj ( uj ) bj ( uj ) C ji ϕj ( uj ) the energy function of a BSB model Is: E = – x Wx 4. 224 Artificial Neural Networks .3.3. rather than an associative memory. The BSB divides the input space into regions that are attracted to corners of an hypercube.

6 0. with two neurons. and a weight matrix of 0. move towards a side of the hypercube.5 illustrates the trajectories for 4 different probes: T 0.3 T – 0.Example 4. Artificial Neural Networks 225 . 4.2 0.005 is considered here. the states in a preliminary phase.3 .1 .8 – 0.5 . -0. fig. with β = 0.005 0.4 T 0.2 T – 0.m.035 -0.1 0.A two-dimensional example (taken from [125]) A BSB model.1 FIGURE 4. and afterwards move along this side towards the stable state.Different trajectories for a BSB model As it can be seen.035 Using retrieve_bsb.

226 Artificial Neural Networks .

A1. If certain conditions are satisfied.. Consider this system: Artificial Neural Networks 227 . very close. no solution at all. or when it is square but not of full rank. In this case.1) has the familiar solution: w = A t –1 –1 (A1. in some sense.APPENDIX 1 Linear Equations. A does not exist. Therefore. such that y is. This output layer can be described as a set of linear equations.e. when A is not square. and the problem of training means to determine the coefficients of these equations in such a way that a suitable criterion is minimized. involves a linear output layer. or many solutions. a solution (or infinite solutions) exist. Generalized Inverses and Least Squares Solutions The topology of supervised direct neural networks. A1. to t.2) However. in order to have a solution. the number of patterns is equal to the number of linear parameters. can be described as: y m × 1 = Am × p w p × 1 (A1. When A is square. In this appendix some concepts related with linear equations are reviewed and some other are introduced.1) must be consistent. and full rank.1) might have a single solution.1) We know that the problem of training is to determine w. with a linear activation function. whose main application is approximation. i. the equations in (A1. system (A1.2 Solutions of a linear system with A square Let us start with the case where p=m. (A1.1 Introduction The output layer of a direct supervised neural network. or equal.

5) A1.3). and that the system is consistent. (A1.6) i CA Cy Assume now that we interchange the columns of A and the corresponding elements of w. the first r rows of (A1. in such a way that the first r columns of A are linear independent.1). Therefore.3).4) Comparing (A1. we can see that they cannot be true simmultaneously. Therefore. (A1. and the other p-r rows are dependent. The linear relationship existing between the rows of A must also be present in the solution vector. but assume that the rank of A is r. there are infinite solutions. we had: 2 3 1 w1 14 w2 = 6 .3 Existence of solutions Consider again (A1.7) 228 Artificial Neural Networks . and therefore (A1. 1 11 3 5 1 w3 22 then.3) has no solution.4) with the last row of (A1. If.2 3 1 w1 14 = 6 1 1 1 w2 3 5 1 w3 19 Twice the 1st equation minus the 2nd gives: w1 3 5 1 w 2 = 22 w3 (A1.3) (A1. as we shall see below. Assume that we arrange the equations such that the first r rows of A are linearly independent.6) can be expressed as: w1 w2 i i i A1 A2 = y (A1. The system has inconsistent equations. instead of (A1.1) could be expressed as: A i w = i y i (A1.

10) + + + (A1. 1 1 1 –1 i –1 Using Matlab. depending on the value given to w2 . if A is full-rank and square. which satisfies the following four conditions: AA A = A A AA = A + T + (A1. A1. the solutions of 2 –1 (A1.11) (A1.5).5) have the form: w = w1 w2 = 4 – 2 w 2 2 –1 w2 (A1.5 Generalized Inverses + Penrose showed that.5).Now. then all linear dependencies in the rows of A appear also in the elements of y. w 2 is a scalar. we see that A 1 y = 4 and A 1 A 2 = 2 . This can be tested by using an augmented matrix A y and computing its rank. as A 1 is non-singular. we can pre-multiply (A1. If the rank of this augmented matrix is the same of the rank of the original matrix. for any matrix A. A1.7) with A 1 .8) Let us apply this to system (A1. Therefore. we have an infinite number of solutions for (A1. and have: w1 w2 = A1 y – A1 A2 w 2 w2 –1 i –1 –1 w = (A1.12) (A A) = A A + Artificial Neural Networks 229 . In this case.9) Therefore. then there are no linear dependencies to test for and the system is also consistent.4 Test for consistency We have assumed in the previous section that the system of equations is consistent. there is a unique matrix A . A 1 = 2 3 and A 2 = 1 . On the other hand.

we T + T (A1. It is now common to use these names for matrices satisfying only the first (A1. We shall use this here also. A pseudo-inverse can be obtained using a SVD decomposition: In this case the matrix A is decomposed as: A = USV .17) (A1. and P and Q are obtained by elementary operations (so that they can be inverted) can be applied. and S is a diagonal matrix containing the singular values of A.5) again.19) where ∆ is a diagonal matrix.18) It should be noted that.14) where U and V are orthonormal matrices.( AA ) = AA + + T + (A1. (A1. (A1.10) of Penrose’ conditions. latter on other authors also suggested the name Pseudo-Inverse. T (A1. to obtain a generalized inverse.16) –1 = U . Actually. Let’s look at system (A1. One possible decomposition is: 230 Artificial Neural Networks .V T –1 = V . we must have: –1 + T T + D r 0 D –1 0 D r 0 I 0 Dr 0 r A = V Dr 0 U ⇔ SV A US = = r = S 0 0 0 0 0 0 0 0 0 0 0 0 + T T T (A1. and denote the matrix satisfying this condition as G. As U can therefore have: AA A = A ⇔ USV A USV = A T + T T SV A US = U USV V = S For this equality. to hold.13) He called A the Generalized Inverse. Orthonormal matrices means that they satisfy: U U = UU = I The S matrix has the form: Dr 0 0 0 where D r means that this is a diagonal matrix with rank r ≤ p . any decomposition that generates: A = P∆Q .15) S = . there are other decompositions that can be applied besides SVD.

–1 3 0 1 –2 0 0 0 0 (A1. we have: 0.24) (A1. 25 0. 41 7. 22 – 0. 41 0.1 0 2 3 -.22) A = Q ∆P + – 1 . 39 – 0.12) and (A1.23) Notice that this pseudo-inverse satisfies conditions (A1. the pseudo-inverse is: Artificial Neural Networks 231 .13). 83 0.25) S = (A1. 88 – 0. 25 – 0. 22 – 0. 84 0 0 0 0 (A1. If we compute the pseudo-inverse using a SVD decomposition.20) (A1. 39 0. but does not satisfy conditions (A1. 41 0.26) Therefore. 16 0 0 0 0. 82 U = 0. 82 0.–1 = (A1. 41 0. 88 0.1 0 0 1 P = -.10) and (A1.21) 2 0 0 ∆ = 0 –1 0 -2 0 0 0 Therefore. 52 – 0.– 1 1 2 3 1 -2 Q = 0 1 0 0 1 -2 –1 1 (A1. 82 V = 0. 52 – 0.11).

33) 232 Artificial Neural Networks . using the pseudo-inverse (A1. 11 0. 94 – 0. In this case. and 0 0 0 0 0 2 H – I = 0 0 –1 0 0 –1 Therefore.30) Notice that. 05 – 0. (A1.5).9).29) (A1.31) (A1. 4 + 2w 3 4 0 0 2 ˆ w = 2 + 0 0 –1 w3 = 2 – w3 0 0 0 –1 –w3 Notice that this is the same solution as (A1. when the rank of A is less than the number of unknowns and the system is consistent. 38 0.32) (A1. 38 Notice that this pseudo-inverse satisfies all of Penrose’s conditions. H=I. 27 0 0 0.0. 27 0. when A is full-rank. we have: 1 0 2 H = 0 1 – 1 . (A1. we have: + ˆ w = A y + ( H – I )z + (A1. Let us consider again system (A1.6 Solving linear equations using generalized inverses A solution of system (A1.27) A1. 27 – 0. 05 –1 + T A = V D r 0 U = – 0. there are many solutions. Denoting as H = A A.28) As we know.23).1). is just: + ˆ w = A y (A1.

11 0. 27 0.37) Notice that matrix H has rank 1. 38 1 -3 H = 1 -3 1 -3 1 -3 5 -6 1 – -6 1 -3 1 – -. among the rows and columns of A. 05 + A = – 0.Using the pseudo-inverse (A1. we see that. In the latter method.33). 27 – 0. if the matrix is large.30). corresponding to ( H – I )z . and rank r. we must choose a non-singular matrix.36) –2 -3 2. 7 ˆ w = – 0. assuming that the system is consistent..34) (A1. Artificial Neural Networks 233 . which is a more complex task. and 6 5 -6 1 -3 1 – -6 1 – -6 1 -3 1 – -6 1 – -6 (A1. 6 + 1 -3 6. Comparing the use of the generalized inverse with the method expressed in Section 3. 38 0. 94 – 0. 05 – 0. in the former. 27 0. we shall have 1 linear independent solution (corresponding to z = 0 in (A1. and p-r independent solutions. 1 1 -3 1 -3 –1 -6 1 – -6 1 -3 –1 w -6 1 – -6 (A1.27). If matrix A has p columns. and perform some calculations. we only need to find one generalized inverse. and therefore we obtain the same solutions as (A1.35) 2 – -3 H–I = 1 -3 1 -3 and therefore: (A1. we have 0.

Graphical representation of (A1. FIGURE A1. in the normal case where there are as many patterns as parameters. lies in the space spanned by the columns of matrix A (in red). in black.7 Inconsistent equations and least squares solutions Looking at a graphical representation of system (A1. + (A1.5) When a system is inconsistent. it will no longer remain in the column space of A.39) 234 Artificial Neural Networks . in this case. it has no solutions.1 . as the right-hand-side (our t vector). Of course we shall have solutions with zero error for all target vectors that lie in that space. and therefore there will be no solution for it.A1. we could have solutions with zero error for every vector y.38) If we change the 3rd component of vector t to 30. Notice also that. . we can see that there is a solution with zero error. is to derive a solution that minimizes the norm of the errors obtained: m Ω = i=1 ( t i – A i. and the vectors are linear independent. w ) = ( t – Aw ) ( t – Aw ) 2 T (A1.5). What we can do is project vector t onto the column space of A using: P A = AA . What we can do.

the error is great with this pseudo-inverse. FIGURE A1.41) The next figure shows the effect of computing the pseudo-inverse with (A1. As it can be seen.38) can be obtained as: T T T ˆ ˆTˆ Ω = E E = t P A⊥ PA⊥ t = t P A⊥ t (A1. 2 .19).40) where PA⊥ is the projection operator onto the space orthogonal to the column space of A. with t 3 = 30 2 Notice that PA and P A⊥ are idempotent matrices. + (A1.2 . as it is no longer orthogonal with the column space of A. Therefore. which means that P A⊥ = PA⊥ . and the projection on the orthogonal space in magenta. as the distance from the extreme of this projection to the actual point: + + ˆ E = t – AA t = ( I – AA )t = P A⊥ t . In fig.5).Graphical representation of (A1. Artificial Neural Networks 235 . the projection of t onto the column space of A is shown in blue. the minimum of (A1.The minimum error is obtained when A is computed using SVD.

Graphical representation of (A1.44) (A1.FIGURE A1.1 Least squares solutions using a QR decomposition: In this case. a pseudo-inverse can be obtained as: (A1. with t 3 = 30 .7. Q 1 with size (m*r) and Q 2 with size (m*p-r).3 .5). where Q is an orthonormal matrix. (A1. Using this decomposition. and R is: R1 S 0 0 where R 1 is a upper-triangular matrix with size r*r. the matrix A is decomposed as: A = QR . with different generalized inverses A1.43) 236 Artificial Neural Networks . and Q = Q1 Q2 .42) R = .

82 Q = – 0. 6 0 0. 41 – 3. we have: – 0. 44 0.47) T P A⊥ = QQ – Q 1 Q 2 I 0 Q 1 Q 2 0 0 = Q1 Q2 0 0 Q1 Q2 0 I T = Q2 Q2 T (A1. the graphical representation is exactly as shown in fig. 81 0. 88 0.+ A = R1 0 Q1 Q2 0 0 –1 T = R1 Q1 0 –1 T (A1.46) Q Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Determining now PA and P A⊥ with this decomposition.48) This means that the training criterion (A1. we have: + AA = Q R 1 S R–1 0 T 1 Q = Q1 Q2 I 0 Q1 Q2 0 0 0 0 0 0 T T = Q 1 Q1 T (A1. as: + AA A = Q R 1 S R–1 0 T R 1 S R S R –1 0 R 1 S R S 1 1 = Q 1 = Q 1 = A (A1. 27 – 0. 65 – 0. 88 – 1. 53 – 0. 2 .50) R = (A1. 22 – 0.m function of Matlab.45) We can easily prove that this is a pseudo-inverse. 41 – 0.49) Example A1. 74 – 5.41) can be expressed as: T T T ˆ ˆTˆ Ω = E E = t Q2 Q2 t = Q2 t 2 (A1.51) The projection matrix P A is given as: Artificial Neural Networks 237 .1 .Least-squares solution with QR and SVD In this case. 65 0 0 0 (A1. Using the qr.

28 – 0. 39 and the coefficients are: (A1. in this case: –8 -3 + ˆ w = A t = 22 ----3 0 13 ----6 –4 -3 0 –5 -6 2 -3 0 (A1. 28 . 28 0. 0.56) 238 Artificial Neural Networks .55) If we use the Matlab function pinv. obtained with SVD.52) and (A1.53).54) (A1.m. but the pseudo-inverse is: 0.53) The pseudo-inverse is: 2 -3 + A = –1 -3 0 The coefficients are. 05 + A = – 0.52) The projection matrix P A⊥ is given as: 2 -3 = –1 -3 –1 -3 1 – -3 1 -6 1 -6 1 – -3 1 -6 1 -6 P A⊥ (A1. 94 – 0. 11 0. 039 0. the projection operators are the same (A1. 05 – 0.PA 1 -3 = 1 -3 1 -3 1 -3 5 -6 –1 -6 1 -3 1 – -6 5 -6 (A1.

Artificial Neural Networks 239 .+ ˆ w = A t = 1. 22 – 2.57) Notice that the norm of the weight vector is smaller using a pseudo-inverse computed with SVD than with QR. 11 (A1. 55 5.

240 Artificial Neural Networks .

Obviously.4 0. by definition.5 37 37.2 1 0. Let us suppose that we want to classify people’s health (sick or health). hour of day. It is more difficult to assign a value for the mean temperature of “sick” people. 2 1.6 1. we shall assume a value of 39º for the mean temperature of sick people.5 40 40.8 1. needed for understanding the role of neural networks as classifiers. intuitively. state. its degree and phase where it is at the moment.5 38 38. A2.1 The Pattern-Recognition Problem. since. As we have to assign a value. We will assume that 36.APPENDIX 2 Elements of Pattern Recognition This appendix will introduce some fundamental concepts from pattern recognition theory. i. it will vary with the disease. it is the mean temperature of “healthy” people.5 41 FIGURE A2.8 0.4 1.2 0 36 36. the temperature of sick people will have a larger variance than the temperature of healthy people. as typically it will vary more with the disease. and so on. according to the body’s temperature. the temperature of healthy people varies with the particular person.Data used in the “sick-healthy” problem Artificial Neural Networks 241 .e.5º is the usual “normal temperature”. But..1 .5 39 Temperature (ºC) 39.6 0. This appendix follows closely the material presented in [135].

This can be estimated by the relative frequency of the classes.2) In (A2. classify its state into 1 of 2 classes. which can be estimated from the data assuming a pdf (in our case. to define the shape and the placement of the boundary is not an easy task. we can construct the optimal classifier. while the blue crosses denote temperatures assigned for healthy people. Fisher showed that the optimum classifier chooses the class ci such that the a posterior probability P ( c i ⁄ x ) that the sample x belongs to the class ci is maximized. using Bayes’ rule: P ( x ⁄ ci )P ( c i ) P ( c i ⁄ x ) = -------------------------------P( x) (A2. The task that we have at hand is. the two classes overlap.1 .. A2. Therefore. As we can imagine (and see if we inspect carefully fig. i. generated by two different phenomena.89 µ 36. we can estimate the distribution parameters. health and sickness. The red balls show temperatures for sick people. in fact it is the most important problem in pattern recognition problems. this is. Table A2.The above figure shows the data (100 samples) used in this problem.1) There is only one slight problem: this probability can not be measured. and in the general cases. in this case only the mean and the variance. for all j ≠ i (A2. P ( x ⁄ c i ) is the likelihood that x was produced by the class ci.2).2 Statistical Formulation of Classifiers We shall assume that the patient temperature is a random variable.3) 242 Artificial Neural Networks . A natural way to perform this task is to define the boundary temperature between the 2 classes.Sample Temperature mean and std σ 0. we have to define the decision surface. P ( c i ) is the prior probability of class ci. P ( x ) is just a normalizing factor which is usually left out. each one with a probability density function (pdf). in this case a Gaussian distribution. In our case. Using statistical decision theory.14 0. 1 ). given the people’s temperature. From the measurements of the temperature. we used: 1 µ = --N N xi i=1 (A2. the sample mean and standard deviation are depicted in Table 1.89 Healthy Sick To estimate these values. a Gaussian function). x belongs to the class ci if: P ( c i ⁄ x ) > P ( c j ⁄ x ).e.51 38. It must be obtained indirectly.

5 2 1. 3 . the smaller the error. We can use the Gaussian pdf for estimating P ( x ⁄ c i ) . there are errors in this classification.3) and (A2. In this figure. otherwise it is more likely to be sick. If x < T then the patient is more likely to be healthy. with the parameters estimated by (A2. the error is affected by a combination of the cluster mean difference and the cluster variances.3). Likewise. T.4): (x – µ) – -----------------2 2σ 2 1 p ( x ) = --------------e 2πσ (A2. Intuitively.2 1 σ = -----------N–1 N ( xi – µ ) i=1 2 (A2. we arrive to fig. 2 shows the histograms and the pdfs for the data belonging to both classes. the smaller the variances the smaller the error. each multiplied by P ( c i ) (here we assume that the probability of being sick is 0. Of course.5 0 35 36 37 38 39 T em perature (ºC ) 40 41 42 FIGURE A2. and vice-versa.5 1 0. the larger the difference between the cluster centres.2 .5) 3. Therefore. which enables us to classify the patient’s health. The classification error depends on the classes overlap. for a fixed cluster centre distance. fixed variances. for. as there are healthy people whose temperature is higher than T. Artificial Neural Networks 243 .5 3 2. we have plot the separation boundary.4) fig.Sampled data distributions If we plot the pdfs.

2 P(x/c1)P(c1) 1.8

1.6

s1

1.4

1.2

1

0.8

0.6

0.4

0.2 T 0 35

P(x/c2)P(c2) s2

36

m1 x<T

37 x>T

38

39 m2

40

41

42

FIGURE A2.3 - Bayes

threshold

Our next step is to compute the Baysean threshold, T. It is defined as the value of x for which the a posteriori probabilities are equal. For Gaussian pdfs, as x appears in the exponent, we must take the natural logarithms of each side of (A2.6)
( x – µ1 ) – -------------------2 2 σ1
2

P ( c 1 )σ 2 e Taking the logaritms

= P ( c 2 )σ 1 e

( x – µ2 ) – -------------------2 2 σ2
2

(A2.6)

( x – µ1 ) ( x – µ2 ) ln ( P ( c 1 ) ) + ln ( σ 2 ) – --------------------- = ln ( P ( c 2 ) ) + ln ( σ 1 ) – --------------------2 2 2σ 1 2σ 2 This is clearly a quadratic equation in x, which can be given as:
2 µ1 µ2 P ( c1 ) σ2 µ µ2 x- 1 1 ---- ----- – ----- + x ----- – ----- + ln ------------- + ln ----- – 1 ----- – ----- -- 1 - 2 2 2 σ2 σ2 P ( c2 ) σ1 2 σ2 σ2 σ1 σ2 2 1 1 2 2 2

2

2

(A2.7)

= 0

(A2.8)

When σ 1 = σ 2 , we get a linear equation, whose solution is:

244

Artificial Neural Networks

µ1 + µ2 T = ----------------- + k , 2

(A2.9)

where k depends on the ratio of the a priori probabilities. For our case, considering c1 the healthy case, and c2 the sick case, we have the quadratic polynomial – 21.9 x + 1597x – 29100 . The roots of this equation are x = 36.98, 35.9 , which implies that T=36.98. Notice that, if both variances would be equal, T would be larger. The threshold moves towards the class with smallest variance. A2.2.1 Minimum Error Rate and Mahalanobis Distance The probability of error is computed by adding the area under the likelihood of class c1 in the decision region of class c2, with the area under the likelihood of class c2 in the decision region of class c1. As we know, the decision region is a function of the threshold chosen, and so is the probability of error. It can be shown that the minimum of this probability of error is achieved determining the threshold using Bayes’ rule, and the minimum error is a function not only of the cluster means, but also of the cluster variances. We can express this fact by saying that the metric for classification is not Euclidean, but the probability depends on distance of x from the mean, normalized by the variance. This distance is called the Mahalanobis distance. It can be expressed, for the case of multivariate data, as the exponent of a multivariate Gaussian distribution: 1 p ( x ) = -------------------e 2π Σ
D --2

2

(x – µ) Σ (x – µ) – -------------------------------------------2
T –1

1 -2

(A2.10)

Notice that the data has dimension D, and Σ denotes the determinant of Σ , the covariance matrix: σ 11 … σ 1 D … … … . σ D 1 … σ DD
(A2.11)

Σ =

In (A2.11), each element is the product of the dispersions among sample pairs in two coordinates: 1 σ ij = -----------N–1
N N

( x i, k – µ i ) ( x j, m – µ j ) .
k = 1m = 1

(A2.12)

Artificial Neural Networks

245

Notice that the covariance matrix measures the density of samples of the data cluster in the radial direction from the cluster centre in each dimension of the input space, and therefore it quantifies the shape of the data cluster. The covariance matrix is also positive semi-definite and symmetric. When the data in each dimension is independent, the covariance matrix is a diagonal matrix. A2.2.1 Discriminant Functions Let us assume that we have N measurements, each one with D components. These can be seen as N different points in a D-dimensional input space. Any one of these points will be assigned to class ci if (A2.1) is satisfied. Each scaled likelihood can be seen as a function that assigns each point one score for each class. These functions are called discriminant functions. These functions intersect in the input space defining a decision surface, which partitions the input space into regions where one of the discriminant functions (the one assigned to the class) is always greater than the others. Using this view, the optimal classifier just computes the discriminant functions, and chooses the class that provides the largest value for the specific measurement. This can be seen as the following figure: xk1 xk2 g1(x) g2(x) Maximum i

xkD

gn(x)

FIGURE A2.4 - An

optimal classifier for n classes

The classifiers we have described previously are called parametric classifiers, since their discriminant functions depend on specific parameters (the mean and the variance). There are classifiers whose discriminant function does not have a specific functional form and therefore are only driven by data. These are called non-parametric classifiers. A2.2.1.1 A Two-dimensional Example Let us now assume that we want to classify humans as male or female according to the weight and height of the individuals. We shall assume that we have 1000 samples of weights and heights of males, and another 1000 samples of weights and heights of females. Those were random numbers from a bivariate Gaussian distribution, and are shown in fig. 5 . The sample mean and covariance matrices are shown in Table 1. We also assume equal a priori probabilities, i.e., 0.5.

246

Artificial Neural Networks

Table A2.1 - Data Statistics
Mean Covariance Matrices

Male

79.9 1.75
Female

100 0 0 0.01 51.3 0 0 0.01

65 1.65

2.1 2 1.9 1.8 Height (m) 1.7 1.6 1.5 1.4 1.3 40

50

60

70

80 90 Weight (Kg)

100

110

120

FIGURE A2.5 - A

scatter plot of weight and height of males ‘o’ and females ‘x’

Our goal here is to compute the optimal decision surface. Using as discriminant functions the logaritm of the scaled likelihood, we have:
T –1 1 1 g i ( x ) = – -- ( x – µ i ) Σ i ( x – µ i ) – log ( 2π ) – -- ln Σ i + ln ( P ( c i ) ) 2 2

(A2.13)

In order to compute the decision surface, we can remove the ln ( 2π ) and the ln ( P ( c i ) ) terms, as they appear in the discriminant functions of both classes. For women we have: Σ = 0.56, Σ
–1

= 0.019 0 0 92

(A2.14)

Artificial Neural Networks

247

1 g 1 ( x ) = – -- x 1 – 65 x 2 – 1.65 2
2

T

0.019 0 x 1 – 65 x 2 – 1.65 + 0.29 0 92

(A2.15)

= – 0.0095x 1 + 1.1x 1 – 94.5x 2 + 312x 2 – 29 For men we have: Σ = 1.07, Σ
–1

2

= 0.01 0 0 100

(A2.16)

1 g2 ( x ) = – -- x 1 – 80.3 x 2 – 1.75 2
2 2

T

0.01 0 x 1 – 80.3 x 2 – 1.75 – 0.03 0 100

(A2.17)

= – 0.005x 1 + 0.8x 1 – 50x 2 + 175x 2 – 185 Equating (A2.15) and (A2.17), we have: 0.005x 1 – 5.4 x 2 – 2x 1 – 332x 2 + 22 = 0
2 2

(A2.18)

This is the equation of the decision surface. If we superimpose the solutions of this equation with the scatter plot, we end up with:
120 110 100 90 Weight (kg) 80 70 60 50 40 30 20 1.2

1.3

1.4

1.5

1.6 1.7 Height (m)

1.8

1.9

2

2.1

FIGURE A2.6 - Scatter

plot with decision boundary

248

Artificial Neural Networks

7 1.3D plot of the modelled pdfs Producing a contour plot of the pdfs.8 .As we can see. we have: 100 95 90 85 Weight (kg) 80 75 70 65 60 1.75 Height (m) 1.Contour plot of the modelled pdfs Artificial Neural Networks 249 . It only means that it is the best we can do with the data.85 1. many errors will be made using this classification. Optimal does not mean good performance. If we produce a 3-dimensional plot of the modelled pdf. we get FIGURE A2.7 . and superimposing the decision curve.65 1.9 FIGURE A2.8 1.6 1.

23) Σi –1 1 T –1 1 where W i = – ------. the Mahalanobis distance defaults to the Euclidean distance: (x – µ) (x – µ) = x – µ T 2 (A2.17) are special cases of (A2. µi 2 µi where w i = ---. we have assumed that the weight was uncorrelated with the height (which as we know is not a realistic assumption). but the covariance matrices are identical.ln ( Σ i ) . in generating the data.and b i = – ----------.. then the covariance matrix is diagonal: Σ = σ I In this case. There are some special cases where the decision surfaces of the optimal classifiers are linear functions.22) (A2. If the input variables are uncorrelated and with the same variance. as. but with w i = Σ µ i and b i = – -.3 Decision Surfaces In the last example. 2 σi 2σ i In this case the samples define circular clusters and the decision surface is a line perpendicular to the line joining the two cluster centres: w1 x + b1 = w 2 x + b2 ( w1 – w2 ) x = b2 – b 1 When the data is correlated.23). Notice that (A2. the discriminant is still –1 1 T –1 linear (A2. the decision surface is given as a solution of quadratic function. but in this case not perpendicular to it.15) and 2 2 2 (A2. The samples define in this case 2 ellipses (with equal form) and the decision surface is a line.A2.19) 2 2 (A2. –1 T T T T T T (A2.µ i Σ µ i .. The discriminant functions in this case are linear functions: gi ( x ) = wi x + bi . and 250 Artificial Neural Networks .21).µ i Σ µ i – -.21) 2 (A2. w i = Σ µ i and b i = – -. intersecting the line joining the two cluster centres.20) and the classifier is called a minimum distance classifier. In the general case (different covariance matrices) the discriminant functions are quadratic functions: gi ( x ) = x Wi x + wi x + bi .

Therefore. and that there is one summation for each class.therefore there are no cross terms in the covariance matrices. In addition to the problem of computing more parameters. A more sophisticated approach involves using a linear discriminator.9 . The linear classifier is expressed as: D g ( x ) = w 1 x1 + w2 x2 + … + wD x D + b = i=1 wi xi + b (A2. a circle. which is a nonlinear mapping of the original input space. but to an intermediate space (usually of greater dimensions). but this time applied not the measurements. an ellipse or a parabola. and is depicted in the next figure. in contrast with fig. xk1 xk2 Σ Σ Maximum i xkD Σ FIGURE A2. The system that implements this discriminant is called a linear machine. with limitations which have been pointed out. the major problem is that more data is needed to have a good estimation of the parameters of the discriminant functions. A2. The decision region can be a line. Notice that this a simple scheme. by applying logaritms). and not in the clusters shapes. Sometimes.A linear classifier Notice that. there are always more parameters to estimate. defines an hyperplane in D dimensions. Also that a linear classifier (one where the discriminant functions are linear) exploits only differences in the cluster means. It should also be stressed that. because lack of data or when the input dimension becomes large. 4 the functions become a sum of products. it is better to use a sub-optimal classifier than the optimal (quadratic) one. Essentially Artificial Neural Networks 251 . independently of the fact that quadratic classifiers are optimal classifiers when data follows a Gaussian distribution. we have encountered two types of classifiers: linear and quadratic (the Gaussian classifier can be expressed as a quadratic one. as we know. Notice that for Gaussian distributed classes the optimal classifier is a quadratic classifier given by (A2.24) This equation.23). This is due to Cover’s theorem.4 Linear and Nonlinear Classifiers Until now. they perform badly for situations where the classes have the same mean.

the first D components of (A2. which is mapped into a feature space Φ .25) could be { x i } . For instance. The dimension of 2 D( D + 3) the feature space in this case is ---------------------.5 Methods of training parametric classifiers In a parametric classifier. The deci- 252 Artificial Neural Networks . This leads to the new field of Support Vector Machines [137]. Recently. Vapnik [136] demonstrated that if k ( x ) are symmetric functions that represent inner products in the feature space.A Kernel-based Classifier This feature space is a higher-dimensional space. and 2 trigonometric polynomials.10 . A2.what must be done is to map the input space to another space. Other popular choices are Gaussian functions. a linear discriminant function is constructed as: g ( x ) = w1 k1 ( x ) + w 2 k2 ( x ) + … + wD k D ( x ) + b (A2.25) There is a great flexibility in choosing the family of kernel functions to employ. the next D( D – 1 ) ---------------------.components to { x i x j. In the Φ space. where the problem is linearly separable. by a kernel x1 … xD function: k ( x ) = k1 ( x ) … kM ( x ) (A2. The accuracy of the classifier is dictated by the location and the shape of the decision boundary in the pattern space.26) 2 The great advantage of using this scheme is that the number of free parameters is decoupled from the size of the input space. This is called training the classifier.. i ≠ j } . and the last D components to { x i } . the solution for the discriminant function is greatly simplified. called the feature space. Let us assume that the original input space has D dimensions x = . we need to estimate the parameters of the discriminant functions from the data collected. xk1 xk2 k1 k2 Σ Σ Maximum i xkD kM 1 Σ FIGURE A2. if a quadratic mapping is chosen.

There are. the Gaussian is used). and therefore the functional form of the discriminant functions and their placement are fundamental issues in designing a classifier. some problems related with these classifiers: Parametric training assumes a specific distribution (usually Gaussian) which sometimes is not true. In principle. the covariance and the probabilities to compute (A2. Usually there are a significant number of parameters to estimate. where. we need to estimate the mean. the parameters of the discriminant functions are estimated directly from the data. which limits its use to a small numbers of distributions (typically. as already referred. parametric training seems to be the best choice. This needs a corresponding amount of data (typically the number of samples should be more than 10 times the number of parameters) which sometimes simply is not available 1. if Gaussian-distributed pattern categories are used. There are no assumptions about the data distribution. 2. is the way that class likelihoods are related to the discriminants. Artificial Neural Networks 253 . There are two different ways for training a parametric classifier: parametric and non-parametric training.13). however. the functional form of the discriminant function is known. In parametric training. and the problem lies in estimating its parameters from the data. The problem.sion boundary is the intersection of the discriminant functions. In nonparametric training. An example are statistical classifiers.

254 Artificial Neural Networks .

It is assumed that the student has already knowledge of a basic course in control systems. A3. needed to understand Chapter 4. which covers basic state space.APPENDIX 3 Elements of Nonlinear Dynamical Systems This appendix will introduce some fundamental concepts from nonlinear dynamical systems.1) can be viewed as describing the motion of a point.1) · where x is a n-vector of state variables. (A3. x2 t2 t1 t0 x1 Artificial Neural Networks 255 . The curve drawn in a graph representing x ( t ) as t varies is a trajectory. x ( t ) = dx ( t ) and F is a vector-valued function. The instantaneous velocity is the tangent to the curve at the specified time instant.1 Introduction The dynamics of a large class of nonlinear dynamical systems can be cast in the form of a system of first-order differential equations: · x(t ) = F( x(t )) . Accordingly. or simply a vector · field. · as a function of time. and for this reason F ( x ( t ) ) is called a velocity vector field. we can think of x ( t ) as describing the velocity of that point at time t. -----------dt Equation (A3. in a n-dimensional space.

3 0.4 0. for all x and u in a vector space.1 0. (A3.8 0.1) if: F(x) = 0 (A3. is referred to as a space portrait of the system.2) A3.9 1 FIGURE A3. which states that.1 0 0 0.5) 256 Artificial Neural Networks .A sample trajectory A family of trajectories.2 Equilibrium states A vector x is an equilibrium state of (A3.5 0.6 0.7 0.1 .9 0.3 0. such that: F( x) – F( u) ≤ k x – u (A3.FIGURE A3.6 x2 0. for different time instants and for different initial conditions. and in order for it to be unique.8 0.Phase portrait and vector field of a linear dynamical system Until now.3) Let us suppose that F sufficiently smooth for (A3.1) to be linearized in a neighborhood around x .4 0.2 .7 0. Another view of this system is representing the instantaneous velocities. it must satisfy the Lipschtiz condition. Let: x = x + ∆x F ( x ) ≈ x + A∆x . For a solution of (A3. we have not said anything about the vector function F.5 x1 0.1) to exist. there is a constant k.2 0. 1 0. for different initial conditions.4) (A3.2 0.

5 -1 -0.2 0.8 1 1.2 0 x1 0. we have: ∂ ∆x ( t ) A∆x ≈ .6 -0.5 -1 -1.4 -0.4 -0.8 1 8 6 4 2 x2 0 -2 Unstable node -4 -6 -8 -25 -20 -15 -10 -5 0 x1 5 10 15 20 25 FIGURE A3.1).6 0. 1.Equilibrium points Artificial Neural Networks 257 .3 .5 1 0.where A is the Jacobean matrix of F evaluated at x : A = ∂F ∂x (A3.7) Provided that A is non-singular. ∂t (A3.5 -1 -0.4 0.2 0.5 Stable focus -1 -1.6 0.5 x2 0 -0.4) and (A3.4 0.6) x=x Replacing (A3.8 -0.5 x2 0 Stable node -0.8 -0.5) in (A3.6 -0.2 0 x1 0. the local behaviour of the trajectories in the neighbourhood of the equilibrium state is determined by the eigenvalues of A.5 1 0.

A3.5 1 0.3 Stability The technique described above.4 3 2 1 x2 0 -1 -2 Unstable focus -3 -4 -10 -8 -6 -4 -2 0 x1 2 4 6 8 10 3 2 1 x2 0 -1 Saddle point -2 -3 -3 -2 -1 0 x1 1 2 3 1.Equilibrium points For the case of a second-order system. A3.5 -1 -0.5 Center -1 -1.4 . the equilibrium states can be classified as shown in fig.5 x2 0 -0.3 and fig.5 1 1.5 0 x1 0. Some definitions are in order:~ 258 Artificial Neural Networks .5 -1. linearization.5 FIGURE A3.4 A3. allows inspection of the local stability properties at an equilibrium state.

The problem is that there is not a particular method for doing that. if it satisfies the following requirements: 1.4 ). 2. there is a positive δ . A function that satisfies these requirements is called a Lyapunov function for the equilibrium state x . and it is matter of ingenuity. A function is positive definite in a space ℘ . The equilibrium state x is stable if in a small neighbourhood of x there exists a positive definite function V ( x ) such that its derivative with respect to time is negative semi-definite in that region. The equilibrium state x is convergent if.9) 3. such that: x(0) – x < δ x ( t ) → x. Another definition is needed here. The equilibrium state x is asymptotically stable if it is both stable and convergent. allt > 0 (A3. The equilibrium state x is globally asymptotically stable if it is stable and all trajectories converge to x as t approaches infinity. for any positive ε . can have the form of a periodic orbit (see last case of fig. Having defined stability we shall now introduce the Lyapunov method for determining stability. 3.3 ). and in this case we speak of a limit cycle. Its partial derivatives with respect to the state vector are continuous. and in this case we speak of point attractors (see fig. 2. or can have other forms. A3. V(x ) = 0 V ( x ) > 0. which are bounded subsets of a larger space to which regions of non zero initial conditions converge as time increases. such that: x(0) – x < δ x ( t ) – x < ε. Artificial Neural Networks 259 .1.4 Attractors Many systems are characterized by the presence of attractors. A3.8) 2. Attractors can be a single point in state space. The equilibrium state x is asymptotically stable if in a small neighbourhood of x there exists a positive definite function V ( x ) such that its derivative with respect to time is negative definite in that region. A3. Lyapunov introduced two important theorems: 1. x ≠ x If we can find a Lyapunov function for the particular system at hand we prove global stability of the system. ast → ∞ (A3. The equilibrium state x is uniformly stable if. 4. there is a positive δ .

Related with each attractor. there is a region in state space where the trajectories whose initial conditions lie within that region will converge to the specified attractor. 260 Artificial Neural Networks . This region is called the domain of attraction or basis of attraction for that specified attractor.

1. Draw and identify each major block of the most used artificial neural network. a) Describe and discuss the characteristics found in a biological neural network and that are present in an artificial neural network. Justify your answer. 3. inspired by biological neural networks. It is tedious to answer a) for all patterns. and employ it to compute the net input vector and the output vector for the 10 input patterns. The neuron has two inputs. w2=-1 and θ =0. and employes a sigmoid transfer function. Assume that we are considering an artificial neuron. b) Consider now one neuron.5. Artificial Neural Networks 261 . Artificial neural network were. a) Assume that w1=1. explain the meaning of each term in the above equations. What are the values of the net input and the output. The equations that describe the most common artificial neural network are: net i = o j W j. i + b i (1) j∈i o i = f ( net i ) (2) Under a biological point of view. such as the one shown in fig. without doubts. 2.Exercises Chapter 1 1.2. if the input is [1 1]? b) Consider now that you have 10 input random patterns. Write a Matlab function. i1 and i2.

262 Artificial Neural Networks . Determine the standard equations for a B-spline of: a) Order 1 b) Order 2 c) Order 3 6. ∂f i . – 0. { – 1. 2 7. tion. such as the one in the figure below. ∂ C i. within the range – 1 1 . 0. the B-spline function of order k. Assume that you want to train a simple linear neuron. such as the one illustrated in (1. of this activation function. in a recursive way. 1 } . x k . with respect to the standard devia∂ σi b) The derivative. to approximate the mapping y = x .5. k – x k ) 2 f i ( C i. C i.σ i ) = e – ----------------------------------2 σ i2 k=1 . Equation (1.5. Determine: a) The derivative.23) n ( C i. Identify the principal learning mechanisms and compare them. with respect to the kth centre. of this activation function.4. k c) The derivative. For this. k .24) defines. ∂f i . with respect to the kth input. Consider a Gaussian activation function. k . 0. σ i . you sample the input at 5 different points. How is learning envisaged in artificial neural networks. ∂ xk 5. ∂f i . of this activation function.

introduced by Bernard Widrow? Suppose that you want to to design a perceptron to implement the logical AND function. 11. a single layer network.y w1 1 w2 x a) Show explicitly the training criterion. how you could solve it with Madalines (Multiple Adalines). Artificial Neural Networks 263 . Explain. the concept of linearly separability. Assume that you have already trained 2 Adalines to implement the logic functions OR and AND. Consider a Multilayer Perceptron with 1 or more hidden layers.0]. 13. introduced by Rosenblatt. x 2. (3) 9. 14. using a diagram. 10. Demonstrate. b) Determine the weight vector employing the Hebbian learning rule. t is the desired output. x 3 ) = x 1 ∧ ( x 2 ⊕ x 3 ) . Consider the boolean function f ( x 1. All the neuron activation functions are linear. Which are the essential differences between the perceptron. justifying. where w is the weight vector. Explain. Determine the value of its weights by hand. The LMS (Least-Means-Square) rule is given by: w [ k + 1 ] = w [ k ] + α ( t [ k ] – net [ k ] )x [ k ] . justifying. how you could implement the last functions using Madalines (Multiple Adalines). and compute the gradient vector for point [0. Explain. net is the net input and x is the input vector. Chapter 2 8. using the diagram. Justify this learning rule. Prove analytically that this neural network is equivalent to a neural network with no hidden layers. that the exclusive OR problem is not a linearly separable problem. The exclusive OR problem is very well known in artificial neural networks. α is the learning rate. 12. and the Adaline.

. Comment the results obtained. there are several activation functions that can be employed in MLPs. Repeat a) and b) for 2 this new criterion. justifying your comments.. For each one of these methods. repeat the training now with an hyperbolic tangent function. different training algorithms were considered.15. and explain how those can be minored. instead of a sigmoid transfer function. 2 training criteria were introduced: 1 Ω = -2 1 Ω = -2 N 19. As it is known. 2 16. i=1 + 2 (5) 264 Artificial Neural Networks . For the pH and the coordinate transformation examples.is the most well known. The most well known method of training Multilayer Perceptrons (MLPs) is the error back-propagation (BP) algorithm. Compare them in terms of advantadges and disadvantadges. For this class of MLPs. 17. justifying. Consider the BP algorithm and the Levenberg-Marquardt algorithm. Determine: 2 a) The gradient vector. b) Identify its major limitations. (t – y) i=1 2 (4) N ( t – AA t ) . the advantadges and disadvantadges of one compared with the other. b) The minimum of the criterion. but the hyperbolic –x 1+e 1–e tangent function f ( x ) = -----------------. this method. Considering the training –2x 1+e of a MLP with the BP algorithm.is also often employed. Consider MLPs where the last layer has a linear activation function. a) Describe. Different training methods of MLPs were discussed in this course. t – Aw 2 Consider the training criteron Ω l = ---------------------. 20. t – Aw 2 + λ w c) Consider now a training criterion φ l = ------------------------------------------. identify. –2 x 18. 1 The sigmoid function f ( x ) = --------------.

W(1) denotes the weight matrix related with the first hidden layer and w(z) is the weight vector related with the output layer. 26. 21. where IP and tp are the original input and target patterns. 23.m (found in Section 2_1_3_7.zip). We are going to try one. After training has finished. IPs and tps are the scaled patterns.where t is the desired output vector. and try to justify your thoughts. and A is the matrix of the outputs of the last hidden layer. and li and lo are the ordinate diagonal matrices of the input and output.). the input and target patterns are scaled. construct first a training set and a validation set using the Matlab function gen_set.9]). Prove. For the pH and the coordinate transform examples. Consider that. that when the norm of all the column vectors of a rectangular matrix is reduced of a factor α . with the results obtained using the early-stopping method. except for one column vector whic remains unaltered. Using Ex. 24. Artificial Neural Networks 265 . Compare their results. Compare these 2 criteria in terms of advantadges and disadvantadges for off-line training. the condition number of this matrix is increased by a factor of approximately α . before training a MLP. 25. ki and ko are the diagonal gain matrices. N is the number of training patterns. Transform these matrices so that they produce the same results using the original data. In that case. for a one hidden layer network.8. 0. Explain why the Gauss-Newton method does not obtain good results when applied to neural networks training. in terms of accuracy and conditioning. respectively. I is the unitary matrix of suitable dimensions. pass this scaled target data ( y )through the inverse of the sigmoid nonlinearity 1–y ( net = – ln ---------. and determine the nonlinear weights by the least squares y 22. Consider that you start with just 1 neuron.m. Use for both cases 100 different initializations. you can pass the scale the target pattern to a range where the sigmoid is active (say [-0. There are several constructive methods for MLPs. employing random initial values for the weights and employing the Matlab function MLP_initial_par. compare the mean values of the intial error norm and of the initial condition number of the Jacobean matrix. y is the network output vector. 2. experimentally. Then train the networks with different values of the regularization parameter. in such a way that: IPs = IP ⋅ ki + I ⋅ l i tp s = tp ⋅ k o + I ⋅ l o . and a column of ones to account for the output bias.9.

with a regularization factor of 10-2. Assume that at time k. with a number of neurons up to 10. Consider that you have an input matrix X. the final weight values of previous off-line training sessions. The results shown with the LMS and the NLMS rules. Is there any effect of overmodelling visible? 27. What do you conclude. Consider the x2 problem used as an example in the start of Section 2. You want to design a Radial Basis Function network to approximate this input-output mapping. as a function of the learning rate employed. There is of course a model mismatch (the model y = w 1 x + w 2 is not a good model for x2).1. i=1).3. Relax this condition (employ a smaller regularization factor) so that you get a worse model. and the 2 28. For this new neuron. a network is presented with pattern I[k]. Apply the LMS and the NLMS rules to the pH and the coordinates transformation problems. corresponding to 100 patterns with 3 different inputs. Prove that the instantaneous autocorrelation matrix. The desired outputs corresponding to this input matrix are stores in vector t.1 were obtained using. Use this strategy for the pH and the Coordinate Transformation examples.. Apply the NLMS rule to this problem and identify its minimal capture zone.net i – I 1 v i solution of --------------------------------------. in the previous exercises and in Section 2.e [ k ] . 33.mat.4 of [124] to have a more clear insight of this problem. employing the reformulated criterion. Compare its performance against the original function. in terms of convergence? Modify the Matlab function MLPs_on_line_lin. and add another neuron. 32. You then can train the network until convergence. has only one non-zero eigenvalue. for both cases. 266 Artificial Neural Networks . the target pattern is just the error vector obtained after convergence with the last model. Employ the Matlab function MLPs_on_line_lin. Now see 5. and vi is the nonlinear 2 weight vector associated with the ith neuron (in this case. 31. for the pH and the coordinate transformation problem.1. for the better and worse conditioned models. resulting in a transformed pattern a[k]. with dimensions 100*3. as initial nonlinear weight values. with different learning rates. The LMS rule is afterwards employed. Use the LMS and the NLMS rules. in order to have a well conditioned model. indicate the topology of the network. Compare the results obtained with training from the scratch MLPs with 10 neurons.m in order to incorporate a deadzone. 29. and its corresponding eigenvector is a[k].m to the coordinate transformation problem. R[k]. starting with the initial nonlinear weight values stored in initial_on_CT. where I is the input matrix.8. in terms of conditioning.3. and compare the convergence rates obtained (use the same learning rates). Assuming that you are going to design the network using an interpolation scheme. 30. and all the steps indicated before can be employed. Determine what happens to the a posteriori output error . using the LMS and the NLMS rule.8.

1.5. using a supervised algorithm for the linear layer and heuristics for the determination of the centre and spread values of the basis functions. The most important class of training algorithms for Radial Basis Function networks employees an hybrid scheme. x4. 5. How can this problem be alleviated in B-spline networks? Consider a B-spline network. Repeat their application to the two examples mentioned. 4 and 3 interior knots are used in the univariate sub-networks. A displacement matrix 1 2 was used.5). 1 2 a) Consider the point (1. x2. illustrating the active cells for that point. The generalization parameter is 2. with splines of order 2. x2. Name the problems related with this scheme. the k-means clustering algorithm is the one more often used. 36. Draw an overlay diagram. and 1 output. a) Assume that the network is composed of 3 univariate sub-networks. and 1 output.values of all the relevant network parameters. 4 interior knots were employed for each input dimension. x4). c) explicit regularization. 34. Among the heuristics employed. using: a) the standard termination criterion. explain the uniform projection principle. x3. with range [0. with the G0 matrix. with 2 inputs. What is the reasoning behind this algorithm? Expose it. b) the early-stopping technique. with an identity matrix. as no stopping criterion was employed. d) explicit regularization. Consider a CMAC network. associated with the variables x3. in an analytical form. Artificial Neural Networks 267 . y. related with the variables x1. 38.What is the total number of basis functions for the overall network? How many basis functions are active in any moment? 35. The bivariate sub-network employees splines of order 3. with 4 interior knots in each dimension. Justify this problem. The problem known as curse of dimensionality is associated with CMAC and Bspline networks. c) How is a well-defined CMAC characterized? 37. 5]. The models obtained with RBFs using complete supervised methods are illconditioned. equally spaced of 1 unit. x3. with 4 inputs (x1. and 1 bivariate sub-network. How many cells exist in the lattice? How many basis functions are employed in the network? How many basis functions are active in any moment? b) Based on the overlay diagram drawn in a).

b) What are discriminant functions? And decision surfaces? The neural classifiers are parametric classifiers or non-parametric classifiers? c) Explain. using the MSE and the MSRE criterion. Compare the results obtained using the standard termination criterion. in the intervals x 1 ∈ 0 1 and x 2 ∈ – 1 1 . for a particular application. sometimes the ASMOD algorithm produces badconditioned models. and compare the results: a) Change the performance criterion employed to evaluate the candidates to an earlystopping philosophy (evaluate the performance on a validation set). briefly. b) Employ a regularization factor in the computation of the linear weights.b) The ASMOD algorithm is one of the most used algorithms for determining the structure of a B-spline algorithm. 41. b) the OR problem. 39.2. a) Explain the main differences betweeb the use of neural networks for classification and approximation. d) Compare the results obtained using the standard termination criterion with the early-stopping method. 40. an algorithm minimizing the SSE.5x 1 x 2 . employing different regularization factors. This can be solved in different ways.O que são funções discriminantes? E superfícies de decisão? d) What is the difference between standard perceptrons and large-margin perceptrons? 268 Artificial Neural Networks . Describe the most important steps of this algorithm. Consider now the application of neural networks for classification. The goal is to emulate this function using a B-spline neural network. Compare the results with the MSE criterion. c) Change the training criterion in order to minimize the MSRE (mean-square of relative errors. Determine this network analytically. in terms of the margins obtained. and in this exercise you try them. for: a) the AND problem. Depending on the training data. Assume that. and the process indicated in Section 2. Compare the results obtained with the perceptron learning rule.5. x 2 ) = 3 + 4x 1 – 2x 2 + 0. 42. and the SSE for all patterns. the data generation function is known: f ( x 1. the training process of support vector machines.

2 1⁄2 4. y 2 = 3. 15.mat. np T pi σ 44. 5. Describe how it can be done. noise removal and data reduction.9 y 3 = 5. This is usefull for input reconstruction. and this is called the Generalized Hebbian Algorithm (GHA). x 3 = – 1 ⁄ 14 .0 . y 1 = 1. W [ 0 ] = 0 . 45.6 . 2. 0. and produces at its outputs the largest eigenvalue. F ( σ k.8 – 11 ⁄ 14 c) Given x 1 = 1. Another paradigm is the auto-associator. x 2 = – 2. calculate W [ 3 ] using the forced Hebbian rule 6. where a set of data is linked to itself. We have described in our study of LAMs the hetero-associator. 2 ) are to be used. 10.0 . Identify. + ˆ W = X d. Artificial Neural Networks 269 . providing a link between two distinctive sets of data. i. Use the Matlab function F_ana.e. for a perfect recall. T p ) = ∏ nz np nz TT = L + i=1 Tpi – i=1 T zi ∏ i=1 1 + ---------TT Assume that two identification measures ( k = 1.5 2. L. T z. Consider that a set of plants are identified using the identification measures: T zi σ 1 + --------Lσ – -----TT TT i = 1 e ----------------------------------. 1.1.5 2. in the network weights. 20 } .5.5 4. b) Why must be the training patterns orthogonal? Show that.Chapter 3 43.m to compute the identification measures. out of the set { 0. 1.1 0. Prove that the Oja’s rule determines the eigenvector.7 0. using PCA.3 47. Consider now linear associative memories. 2. 46. Oja’s rule extracts the first principal component of the input data. PCA can also help in system identification applications.. A repeated application of Oja’s rule is able to extract all principal components. and collect the planst under consideration from plants. a) Describe the structure and the operation of a LAM.5 0.5 .4 . the best set of two σ k values.

54 1. 48.2. T 49.9 0.1.92 1. Using this function generate 100 bi-dimensional patterns. A more powerfull structure. an autoencoder. find the winning node. The following figure shows a simple 2*2 self-organizing map. in Matlab. 1 2 3 4 FIGURE 1 . Given an input vector x = 1 2 . State and explain the self-organizing map weight update equation. using the Euclidean distance between the input vector and the weight vectors.a) Prove that the application of forced Hebbian learning performs an eigendecomposition of the autocorrelation function. the SOM algorithm as described in Section 3.87 50. 51.71 1. if the weight matrices between layers are equal and transposed. b) Typically an autoassociator has no hidden layer.85 0. 1. to implement a) A two-dimensional SOM with 5*5 neurons 270 Artificial Neural Networks .SOM weights Inputs Node 1 2 1 2 3 4 0. is achieved if an hidden layer (where the number of neurons is equal or less than the number of neurons in the input/output layer) is added to the network. covering the space – 1 1 × – 1 1 .A 2*2 SOM Table 1 .74 1. Implement. The SOM has twodimensional inputs and the current weights are given in fig. Prove that.64 0. Explain why the LMS rule can be seen as a combination of forced Hebbian and antiHebbian rules. the application of forced Hebbian learning implements principal component analysis.

T T T T T T Demonstrate that the Lyapunov function of the BSB model is given by (4.32). a) Draw a circuit that implements this model. b) Draw a block diagram describing this equation. Consider a Hopfield network made up of five neurons. ξ6 = 1 – 1 1 –1 – 1 54. ξ3 = – 1 1 – 1 1 1 a) Determine the weight matrix b) Demonstrate that all three fundamental memories satisfy the alignement condition c) Investigate the retrieval performance of the network when the second element in ξ 1 is reversed in polarity d) Show that the following patterns are also fundamental memories of the network. designed to store three fundamental memories ξ1 = 1 1 1 1 1 . b 1 18 –2 0 55. 53. Appendix 1 2 1 Compute the solution of 3 4 1 3 14 1 6 a = 5 22 . Chapter 4 52.b) A one-dimensional SOM with 25 neurons. Artificial Neural Networks 271 .5) is ammenable to be implemented analogically. How are these related with the trained patterns? ξ4 = – 1 –1 – 1 – 1 – 1 . ξ 5 = – 1 1 1 – 1 1 . The model described by (4. ξ2 = 1 – 1 –1 1 – 1 .

2 9 10 5 5 272 Artificial Neural Networks .56. 2 6 4 2 1 See if the system 4 15 14 7 w = 6 is consistent.

pp. force and impact control for robotic manipulators'. A.. Mitsuoka. Vol. 'Neural net application to optical character recognition'. 9. 'A neural network approach to character recognition'. Langenderfer. 1987 Shea.. R. Vol. Shibata. K. Vol. Brighton. Musavi.. Vol. 355-365 Bounds. Kosuge.. ICS Report 8702. A. Nº 5. K.. P. Nº 3. pp. 9. IEEE 1st International Conference on Neural Networks. Rosenberg.. J. Richfield. Mathew. 1. Vol. P.. 'Implementation of an adaptive neural controller for sensory-motor coordination'. Vol. Nº 3. T. Nasser. IEEE INNS International Conference on Neural Networks. Neural Computation. M. pp. Pilla.. Chiabrera. IEEE Transactions on Neural Networks.. pp. 'Review of neural networks for speech recognition'. R. Munro. 1. Nº 5. Wake. Vol. pp. Nº 3. P. 'Parallel networks that learn to pronounce English text'. B. Arai. 1989. Lattuga. Hakim. M. 'A neural network approach for bone fracture healing assessment'. Liu. 'Operational experience with a neural network in the detection of explosives in checked airline luggage'.... 'Use of neural networks in detecting cardiac diseases from echocardiographic images'. Chen.. 1990. N. IEEE Engineering in Medicine and Biology. Tokita. 'Neuromorphic sensing and control .. D. C. S. Rubinstein. 1989. 1-38 Sejnowsky.. F. 1990..Bibliography [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] Mehr. 1990. R. Hatem.. 1991. 583-591 Kaufman. Lloyd. M.. 175-178 Kuperstein. pp..application to position. IEEE Engineering in Medicine and Biology. IEEE Control Systems Magazine. T. M. V. J. Zipser. Shirvaikar. Vol. K. Siffert. M. 2.. 30th IEEE Conference on Decision and Control. 145-168 Cottrel. 162-167 Artificial Neural Networks 273 . 1989. pp. 25-30 Fukuda..... R.. pp... 1990. 4. 3.. Nº 3. 'Handwritten alphanumeric character recognition by the neocognitron'. Vol. A.. Nº 1. 771-777 Rajavelu.. 23-30 Cios. Univ.2. 9. 387-393 Fukushima. T.. D. D. pp. 1987.. Neural Networks. pp. 1987. N.. Complex Systems. G. F. Figueiredo. pp. England. of California at San Diego.. 'A comparison of neural networks and other pattern recognition approaches to the diagnosis of low back disorders'. 58-60 Lippmann. 1991.. Neural Networks.. K. 2. S. 'Image compression by back propagation: an example of an extensional programming'. T.

.. pp. Nº 1. 1949 Rosenblatt. 1-11 Hecht-Nielsen.. E. S. R. Stanford Electronics Labs. B. 'A logical calculus of the ideas immanent in nervous activity'.. Nº 3. IEE Colloquium on 'Current issues in neural network research'. Vol. V. 1990. 1. J.. 1975. F. Acad. D.. Proc. pp. Proc. pp. W. Vol. 'An introduction to computing with neural nets'. 1969. 'Principles of neurodynamics'. pp. Hoff. Math. 'Embedding fields: a theory of learning with physiological implications'. 'Silicon implementations of neural networks'. 10. 'Correlation matrix memories'. pp.. M. 25. Biological Cybernetics. 'Neural networks for self-leaming control systems'. 'Neural networks in control systems'.. 'Neurocomputing'. MIT Press. 299-307 Anderson.. 1967. neuron using chemical "memistors" '. Vol. MIT Press. B.. 1990.. 353-359 Hopfield. 1943. Pitts. W. F. Rosenfeld. 'Neurocomputing: picking the human brain'. IEEE Spectrum. pp. 1-5 Antsaklis. Vols.. IEEE Control Systems Magazine. Papert. U. 'A memory storage model utilizing spatial correlation functions'. 121-136 Grossberg. 'The organization of behaviour'. 10. 1988. Vol. Stanford. D. 1986 Vemuri. pp. pp. Nº 3. 4. IEEE Control Systems. 4. Kybernetik. Smith. 1990 Strange. Bulletin of Mathematical Biophysics. Nº 3. IEEE ASSP Magazine. B. pp. IEEE Technology Series Artificial Neural Networks: Theoretical Concepts. 16. 'Adaptive switching circuits'. pp. K... 1990. Vol.. Vol. 3-5 274 Artificial Neural Networks . Nº 4. Vol. Symposium Proceedings. P.. IEEE Transactions on Neural Networks. & the PDP Research Group. USA. USA. T. 113-119 Fukushima. S. Parthasarathy.. Digest Nº 1989/83. Report 1553-2. 1968. Nat. pp. First IEE International Conference on Artificial Neural Networks'. Nº 20. 1982. S. Nº 6. 'An adaptive "Adaline". B. 1960. 'A theory of adaptive patterns classifiers'.. 1988 McCulloch. Addison-Wesley.. 18-23 Lippmann.. A. Spartan Books.. 1989 Murray. R. J. 4-27 Nguyen.. Widrow. pp. R. Nº 5.27-32 Hecht-Nielsen.. 96-104 Widrow. Vol.. 'Cognitron: a self-organizing multilayered neural network'. D. 4-22 Anderson. Nº 2. McClelland. J. Sci. 21. 'Basic brain mechanisms: a biologist's view of neural networks'. 2554-2558 Hopfield. 'Analog VLSI and neural systems'.[13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] Narendra. MIT Press. pp. pp. 1989. pp.K. 1962 Widrow.. C. 1972.. Nat. J. K. J. 'Pattern recognizing control systems'. pp. 3088-3092 Rumelhart... 1984. 81. Wiley.. 'Artificial neural networks: an introduction'. 36-41 Mead. Psych... Nº 3... Sci.. Vol. Acad. Addison Wesley. 1987. Tech. 115-133 Hebb.. Spartan Books.. M. 1963 Minsky. 'Parallel distributed processing'. 1989. 1960 WESCON Convention Record. 'Neurocomputing: foundations of research'. 1969 Amari. pp. 'Neurons with graded responses have collective computational properties like those of two-state neurons'. 'Neural networks and physical systems with emergent collective computational abilities'. K. USA. 'Identification and control of dynamical systems using neural networks'. IEEE Transactions on Computers. 'Perceptrons'. 209-239 Kohonen. R. J.. 1988. 5. IEEE Transactions Electronic Computers. 1960 Widrow. 1 and 2. 79. Computer and Information Sciences.

Zafiriou. 'Analysis of a general recursive prediction error identification algorithm'. Vol.N. A. Automatica. pp. ‘Neuron-like adaptive systems that can solve difficult learning control problems’.. Proceedings of the 1st IFAC Workshop on 'Algorithms and Architectures for RealTime Control'. Y. 77-88 Bavarian. Automatica.G. NPL Report DITC 166/90. Complex Systems. M. Cherkassky. Springer-Verlag.. pp. 'Nonlinear systems identification using neural networks'... Ljung.E. R. 2. IEEE International Conference on Neural Networks. pp. A. Man & Cybernetics. ‘The ART of adaptive pattern recognition by a self-organizing neural network’. 21. T. IEEE Control Systems Magazine. IEEE Automatic Control Conference. Nº 2. pp.. 8. 49-55 Leonard. 'Neural networks for nonlinear adaptive control'. Delyon. 3-7 Sjoberg. Sutton.. 1989. 'A learning architecture for control based on back-propagation neural networks'.. ‘Algorithms for spline curves and surfaces’.. M. B. T. Juditsky. R.. N2 3. G. Brighton. S.. 321-355 Cox. N. pp. Wang... pp. England.. Lowe. 1989 Ljung. Tech. G. 1995. H... 1983.. T. Suzuki. R. McAvoy. A. 1990. Isobe. Vol 31. series G.. Irwin. Benveniste.. ’Nonlinear Black-box Modelling in Systems Identification: a Unified Overview’. Minden-nan. Vol. 2. U. Nº 2. 587594 Kawato. IEEE Control Systems Magazine'. 89-99 Lightbody.. Vol. No 3. Uno. pp. 24- Artificial Neural Networks 275 . S. 3. 1988. 1988 Broomhead. S. Vol. S.. P. J. 'Use of neural networks for sensor failure detection in a control system'.. 834-846 Carpenter.. 'Nonlinear signal processing using neural networks: prediction and system modelling'. Hjalmarsson.. 1988. 'Classifying process behaviour with neural networks: strategies for improved training and generalization'... Vol.. Nº 3. No 5. 1-13 Lapedes. 1691-1724 Elsley.S..[39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] Kohonen. M. Vol. Kramer... 8-16 Guo. 1990 Albus. Vol.W. pp. IEEE Control Systems Magazine. 1981. Nº 12. C. Grant. Transactions ASME. J. 170-171 Chen. N. September 1991. J. 13.. 1988. K. Vol. 1975 Barto. Mukhopadhyay.. 'Multilevel control of dynamical systems using neural networks'.. McAvoy. E. 'Hierarchical neural network model for voluntary movement with application to robotics'. 1988.'Self-organization and associative memory'. pp.. A. V. 30th IEEE Conference on Decision and Control... pp. pp. S. 97. 17.. IEEE Transactions on Systems.. 1991. 299-304 Naidu. International Journal of Control. 1990. Nº 3. M. 10..... pp. 'A solution to the inverse kinematic problem in robotics using neural network processing'. D. A. pp. Grossberg. 1990. L.. Sept. IEEE Computer. pp. Los Alamos National Lab. 1988. 1191-1214 Ruano. Anderson..E. B. S. Nº 1. IEEE Control Systems Magazine. 2nd ed.. 1987 Bhat. 'First year report'. Zhang. Billings.C. Vol. 10. pp. L. J.. Vol. 2478-2483 Narenda. 'Modelling chemical process systems via neural computation'.. 'Introduction to neural networks for intelligent control'. Glorennec.. 'Multivadable functional interpolation and adaptive networks'. G. 51. San Diego.. R. P. Q. 'The Cerebellar model articulation controller'. 1990. Vol. D. IEEE International Conference on Neural Networks. Nº 2. Vol. 8. P. Farber.. Rept.

Yamamura.. 1991. IEE Proceedings. Proceedings of the 9th IFAC World Congress. Willis. 'A comparison of adaptive estimation with neural based techniques for bioprocess application'.. IEE Colloquium on Neural Networks for Systems: Principles and Applications'. Hungary. 1984. C. K.. G. 30th IEEE Conference on Decision and Control. Tham. 'Artificial neural network based control'. pp. Morris. Vol. 'An associative memory based learning control scheme with PIcontroller for SISO-nonlinear processes'. Proceedings of the 1989 International Joint Conference on Neural Networks. pp. 1991. D. E. A. E.. 'A nonlinear receding horizon controller based on connectionist models'. Vol. 55. 'Adaptive Control'. A. Proceedings of the IFAC Microcomputer Application in Process Control. control and piloting through a new class of associative memory neural networks'. 'Real-time implementation of an associative memory-based learning control scheme for nonlinear multivariable processes'. Nº 1. M. U. Digest Nº 1991/019 Astrom. 551-558 Saerens. Nº 1. A. 1987. Chen. Militzer. 256266 Billings. pp.. Vol.. U. M. Edinburgh. 138. 2... pp. Jamaluddin.. 1990. E. Nº 2. 36-43 Ersu. IEEE Control Systems Magazine. Tuffs. Brighton. Willis.. Wienand. Plymouth. pp. 1989 Kraft.... A. 172-173 Hernandez. Vol. Vol. 2173-2178 Willis. Tolle.. Sbarbaro. pp. Vol. 'Neural networks for nonlinear internal control'. 23. pp. San Diego. Y. Proceedings of the IEE International Conference Control 91. Montague. 'The truck backer-upper: an example of self-leaming in neural networks'. Vol.. Mothadi. 431-438 Harris. 1990. Soquet. D. 'Neural network modelling and an extended DMC algorithm 276 Artificial Neural Networks . 'Software implementation of a neuron-like associative memory for control applications'. P. pp. S.. 55-62 Nguyen. 'Generalized predictive control . International Journal of Control.. 357-363 Hunt. D. 2. Tham. S. Istanbul.... Brown.. 'Artificial neural networks in process engineering'. Arkun.. Automatica. 6165. Nº 3.part 1: the basic algorithm'. 99-105 Psaltis. J.. Massimo.. 109-119 Ersil.. Budapest... 1992. 'Autonomous vehicle identification. 1986. 266-271 Clarke. pp.. 1987... Part F. 1982 ErsU. 137-148 Sbarbaro. Sideris. pp. IEEE Automatic Control Conference. 1039-1044 Montague. E. K. 'Neural Controllers'. M. D.. 138. Militzer. C. 'Neural-controller based on back-propagation algorithm'. 1991. 'A new concept for learning control inspired by brain theory'. IEEE First Conference on Neural Networks. Montague.. Switzerland. pp.K. G. 193-224 Ersil. 1. pp. D. C. 'A comparison between CMAC neural networks and two traditional adaptive control systems'.. E. Widrow. 'Properties of neural networks to modelling nonlinear dynamical systems'. 1991. January 1991.. England. Morris. A.. D. M.. M. Vol. K. 1991. A. Part D. Morris. P. G.. Nº 5.. pp. Vol. Campagna. M. pp.[59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] 29 Lant. M. Tham.. B. IEE Proceedings Part D. S. H.. Nº 3. Addison-Wesley. B. 1989.. M. J. IEE Proceedings.. Hunt. London. Turkey.. pp. Vol. 4. L. Proceedings of the ISMM 8th Symposium MIMI.. Proceedings of the Symposium 'Application of multivariable system technique'... Wittenmark.K.. 1984. 138.. 10.

. IEEE Automatic Control Conference. of Michigan Press. Y. 10. 2. 1. Vol. G.. D. 1990. F. Fleming. I.. Vol. 'Beyond regression: New tools for prediction and analysis in the behavioral sciences'. 279-285 Golub. pp. England.. 1989. He. 1988. 1973... ‘Digital spectral analysis with applications’. Ann Arbor: The Univ. Nº 4.. V.. 1990. 1st IEEE International Conference on Neural Networks. 1975 Goldberg. D.49-57 Ruano. UCNW.. pp.. G.. Thesis. IEEE Control Systems.. 'A local interaction heuristic for adaptive networks'.[77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] to control nonlinear systems'. 2. S. Office of Technology Licensing. 15. Edinburgh. 'A nonlinear regulator design in the presence of system uncertainties using multilayered neural networks'. P. R.382-391 Jacobs. 1992 Kaufman. Part D... pp. Vol. ‘The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate’. IEE Internacional Conference Control 91. 410-417 Ruano. Nº 3. pp. P. ‘Genetic programming: modular neural evolution for Darwin machines’. Pereyra.S. Neural Computation.. 2454-2459 Chen. A. D. Nº 3. Tokumaru.. Fleming. H. Macmillan Publishing Company. D. 'Back-propagation neural networks for nonlinear self-tuning adaptive control'. J. pp. pp.A. ‘Linear algebra and its applications’. A. 1980 Golub. C. 1991. R. Vol. 3. ‘Introduction to digital signal processing’. Vol.. San Diego. pp. optimization and machine learning’.. 'Increased rates of convergence through learning rate adaptation'. J. 1989 Ruano. and Fleming P. P. L. Jones. 1974 Parker. Vol. 413432 Ruano. Artificial Neural Networks 277 . 44-48 Iiguni. K. G. 'Learning logic'. UK. pp. Invention report.’Genetic algorithms in search. 865-866 Strang.. A. Nº 3. 2. 199 1. File 1. 2nd IEEE International Conference on Neural Networks. Vol. pp. 1987.. Math. 'A connectionist approach to PID autotuning'. PhD. North Oxford Academic. Yu and Ahmad. D. BIT. International Joint Conference on Neural Networks. L. PhD. Harvard University. 1988....K. Doctoral Dissertation.. Nº 3. pp. ‘A variable projection method for solving separable nonlinear least squares problems’. Uhr. 1988 Marple. Vol. IEE Proceedings.. Stanford University Tesauro.. Vol.. Neural Networks. Prentice-Hall International. 10. IEEE Transactions on Neural Networks'. pp. 'Learning algorithms for connectionist networks: applied gradient methods of nonlinear optimization'. A.. 1991. Nº 2. H.. Addison-Wesley. ‘Adaptation in natural and artificial systems’. 762-767 Werbos. D. 1987 Holland. A. 295-307 Watrous. pp. SIAM JNA. G.. S. 1. Manolakis.. Vol. ‘Matrix computations’. 1. P. 'Asymptotic convergence of backpropagation'. 317-324 Astrom. Sakai. S81-64. 1992. H. B. ‘A new formulation of the learning problem for a neural network controller’ 30th IEEE Conference on Decision and Control. 'Adaptive Control'. 511 Grace.. Jones. 'A connectionist approach to PID autotuning'.. Addison-Wesley.. 3. 1983 Proakis. Vol. Wittenmark.. ‘Applications of Neural Networks to Control Systems’. 619-627 Sandom. Van Loan. 1975. ‘Computer-aided control system design using optimization methods’. Vol. Academic Press.. 1989 de Garis. Appl. pp. U. 139. Brighton. U.H. Jones.

pp. Q. ‘An algorithm for least-squares estimation of nonlinear parameters’. J. pp. K.. R.C. pp. Fukuoka. 1963.. ‘Fast learning in networks of locally-tuned processing units’. 1995. on Design Methods for Control Systems. Nº 1.. Proc. 24. Vol. ‘A new approach to variable metric algorithms’. Wright.. T. ‘Adaptive control processes’. 6. John Wiley & Sons.. 8-11 July 1997 [123] Sjoberg. Math.. E. Academic Press.. Princeton University Press.-Y-.. Militzer.. A. M. Vol.Thesis. Murray. pp. Quart. P. 24. 1993 [120] Weyer. ‘Variable metric method for minimization’. 163-168 [105] Broyden. 134139.H. SIAM J. M. Appl. A. 947-67. D. 11. Technical Report. Vol. Girosi.O.E. pp.5990.. 1. 1st IFAC Symp.K.. ’Optimal adaptive k-means algorithm with dynamic adjustment of learning rate’. 1981 [102] Fletcher.76-90 [106] Fletcher. Wiley.C.. 777-82 [116] Cox.. F. 1970. Mathematics of Computation. Macedonia. Zurich. Ljung. M. Math. Proc. Appl. U.431-441 [111] Powell. Research and Development Report. pp. M. Vol. J. C. ’Identification with dynamic neural networks architectures. Journal of Control. Séquin.. 23-26 [108] Shanno. Darken. ‘A family of variable metrics updates derived by variational means’.. R... ’Learning in Intelligent Control’. R. Eds. University College of North Wales. IEEE Transactions on Neural Networks... Mason J.. 157-169 [118] Poggio. 1972 [117] Chinrungrueng. pp. R. ‘The numerical evaluation of B-splines’. Delyon.. Powell. Int.. ‘A method for the solution of certain problems in least squares’. D. ‘Conditioning of quasi-Newton methods for function minimization’. ‘Pattern classification and scene analysis’. Hart.. ‘Practical optimization’. M. pp. 78. Cox.C. 6... Neural Computation. 647-656 [109] Levenberg.. 1481-1497.. ‘The convergence of a class of double-rank minimization algorithms’.. 1989 [100] Luenberger.. 1963.E. 10. 9-10 October 1998 [122] Isermann. 1990 [119] Kavli. pp.J. ‘Improved allocation of weights for associative memory storage in learning control systems’. C. ANL . Vol. R.. 4. Japan. O. 1944. ‘Introduction to linear and nonlinear programming’. pp. ‘A rapidly convergent descent method for minimization’.. Benveniste. of the IEEE. 1973 [101] Gill. T. C. Mathematics of Computation.. P. 1970.. 143-167 [112] Moody.. D. in ‘Algorithms for approximation of functions and data’...D. 1970. P. 317-322 [107] Goldfarb. 1961 [115] Parks. Computer Journal. ‘Radial basis functions for multivariable interpolation: a review’. 164-168 [110] Marquardt. Journal of the Institute of Mathematics and its Applications. Oxford University Press. ‘Practical methods of optimization’. W.. Vol. Zhang. Inst. Math.. J. pp. W.G. ’The ASMOD algorithm: some new theoretical and experimental results’. Cosy Annual Meeting. pp. March 1995 [121] Dourado. Ernst. Nelles. pp. IFAC Symposium on System Identification. R. S. applications’. ’Networks for approximation and learning’. Ohrid. Hjal- 278 Artificial Neural Networks . 1970. 1959 [104] Fletcher. J.G.. P.. D.E. comparisons. Vol. 2.. Vol. 58. Computer Journal. Glorennec. Addison-Wesley Publishing. pp. ’ASMOD-an algorithm for adaptive spline modelling of observation data’. Apl. L. I. Vol. 281-294 [113] Duda... 6. T. C. Kavli. B. 1991.J. A. 13. 1973 [114] Bellman. 1980 [103] Davidon.

Stinchcombe.. ’Exploiting the Separability of Linear and Nonlinear Parameters in Radial Basis Function Networks’.C. ’A Simplified Neuron Model as a Principal Component Analyzer’.. 2000 Ferreira. V. Valencia. Automatica.M. ’On the Approximate Realization of Continuous Mappings by Neural Newtorks’.R. K. V. 39. Springer Verlag. E.. 183-192.. International Journal of Systems Science. C. ’Content-Addressable Memory Storage by Neural Networks: a General Model and Global Lyapunov Method’.[124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] marsson. 1994 Haykin. 169-176. F. A. 1989 Park. P. Prentice-Hall.. ’Universal Approximation using Radial-Basis Function Network’.V. 2000 Vapnik. A. 239-245. S. 1691-1724. 1990 Ruano. Poggio. Spain. in ’Computational Neuroscience’.T.C.M. 63.. 31. John Wiley & Sons..). 33. Schwartz (Ed. MIT Press. 1995 Vapnik.. A.. 359-366. 8. A. ’Neurofuzzy Adaptive Modelling and Control’. ’Singel and Multiobjective Programming Design for B-spline Neural Networks and Fuzzy Systems’. W. L.E. 93-98.. IEEE Transactions on Signal Processing’... AFNC’01. 1998 Yee. L. ’Absolute Stability of Global Pattern Formation and Parallel Memory Storage by Competitive Neural Networks’. Symposium 2000: Adaptive Systems for Signal Processing.L. and Lefebvre. P. 9 (2). Neural Networks. ’Nonlinear Black-box Modelling in System Identification: a Unified Overview’. 2001 Artificial Neural Networks 279 .. ’A Dynamic Regularized Radial Basis Function Network for Nonlinear... Neural Networks. 12. 1989 Hornik. Grossberg. 2. Haykin. K. H. 1991 Girosi. S. Lake Louise..E. C.E. Oliveira... M. White. 3. 47 (9). E. IEEE Trans. Ruano. Euliano.. 1993 Funahashi. ’Neural Networks: a comprehensive foundation (2nd ed. 2002 Cabrita. ’Multilayer Feedforward Networks are Universal Approximators’. S. Prentice Hall..Wiley. M. ’Performance Evaluation of a Sequantial Minimal Radial Basis Function Neural Network Learning Algorithm’. Cabrita. MIT Press. Kóczy. 1988 Cohen. 689-711. Systems. P.. Saratchandran. NeuroComputing (to appear) Principe. pp. 1999 Yingwei..W. 1990 Ferreira.. Nonstationary Time Series Prediction’. Juditsky..E... S.. N. 56-65. A. 2.. 13. ’Universal Approximation Bounds for Superpositions of a Sigmoidal Transactions on Information Theory’. J. Ruano. Neural Networks.. M. Fonseca. Ruano. ’Statistical Learning Theory’. J. 1995 Brown. S. 815-826. Canada.. Sandberg. 246-257. Man and Cybernetics. 1998 Oja.. Journal of Mathematical Biology. ’Neural and Adaptive Systems: Fundamentals through Simulations’. R. Communications and Control (AS-SPPC).. ’Networks and the Best Approximation Property’. ’The Nature of Statistical Learning Theory’. 1982 Grossberg. 15. 308-318. 930-945. J. C. 1983 Grossberg. T. P. IFAC Workshop on Advanced Fuzzy/Neural Control. ’Neural Networks and Natural Intelligence’. N. A.)’.A.. C.. Harris. 321-326. Biological Cybernetics. ’Neural Network Models in Greenhouse Environmental Control’. Sundararajan. 2503-2521. ’Supervised Training Algorithms for B-Spline Neural Networks and Neuro-Fuzzy Systems’. IEEE Transactions on Neural Networks’. 1999 Barron.. I.M.. H..

280 Artificial Neural Networks .

145 C Cauchy-Schwartz inequality 43 cell body 4 Cerebellar Model Articulation Controller 10 coincident knot 136 compact support 137 competitive learning 23 condition number 78 consistent equations 227 content addressable memories 35 cross-correlation matrix 203 curse of dimensionality 138 Artificial Neural Networks 281 . 50 adaline learning rule 54 adaptation 23 Adaptive Resonance Theory 8 anti-Hebbian rule 201 approximators 125 ASMOD 154 Associative memory 190 Auto-associator 269 Autocorrelation 190 autocorrelation matrix 112 Autoencoder 270 axon 4 B basis functions 135 basis functions layer 21 batch learning 23 batch update 57 Bayesian threshold 244 Bemard Widrow 3 best approximation property 123 BFGS update 84 Bidirectional Associative Memories 8 Boltzmann Machines 8 Brain-State-in-a-Box 8. 223 B-Spline 9. 8 active lines 40 Adaline 3.Index A a posterior probability 242 a posteriori output error 113 action potentials 4 activation function 6.

120 delta-bar-delta rule 121 deterministic learning 23 DFP method 84 direct approach 30 discriminant functions. 189 Hessian matrix 83 Hetero-associator 269 hidden layer 20 hidden neurons 19 Hopfield 8 hybrid 125 I Idempotent matrices 235 implicit regularization 106 Indicator function 180 indirect approach 30 Inner-product kernel 182 input layer 19 input selection 200 instantaneous autocorrelation matrix 112 instantaneous learning 23 interior knots 136 interpolation 125 J Jacobean matrix 56 282 Artificial Neural Networks .D decision surface 242. 246 delta rule 54. 23 Hebbian or correlative learning 22 Hebbian rule 2. 145 Generalized Hebbian Algorithm 269 Generalized Inverse 230 Generalized Radial Basis Function 128 Gradient ascent. 191 gradient descent learning 22 Gradient vector 55 H Hebbian learning 22. 246 displacement matrix 140 Donald Hebb 2 Dual problem 177 E early-stopping 106 energy minimization 214 epoch learning 23 epoch mode 57 Error back-propagation 63 explicit regularization 106 exterior knots 136 F Feature space 252 feedback factor 224 feedforward networks 19 forced Hebbian learning 22 fully connected network 19 functional form of discriminant functions 253 G Gauss-Newton 84 generalised learning architecture 27 generalization 106 generalization parameter 137.

John Hopfield 3 K k-means clustering 126 Kohohen Self-Organized Feature Maps 8 L Lagrangian function 177 lattice-based networks 134 learning 21 Least-Mean-Square 54 Least-Means-Square 50 least-squares solution 76 Levenberg-Marquardt methods 84 likelihood 242 linear 21 Linear Assocative Memory 203 linear weight layer 21 lineraly separable classes 42 LMS rule 110 localised Gaussian function 16 look-up-table. 141 M Madaline 51 Maximum eigenfilter 192 McCulloch 2 mean 242 mean-square error 110 Mercer’s theorem 182 minimal capture zone 119 Minimum distance classifier 250 modal matrix 77 modules 5 momentum term 64. 135 null-space 201 O off-line learning 23 Oja’s rule 191 on-line learning 23 optimal classifier 242 Orthonomal matrices 230 output dead-zones 118 output function 6. 8 output layer 20 overlay 137 overtraining 106 P parallel model. 120 multilayer network 19 N nerve terminal 4 net input 6. 26 parametric classifiers 246 Parametric training of classifiers 253 partially connected network 19 partition of the unity 138 Artificial Neural Networks 283 . 8 neuron 4 NLMS rule 113 non-parametric classifiers 246 nonparametric training for classifiers 253 normal equation matrix 76 normal equations 76 normalized input space layer 21.

123 Unsupervised learning 23.pattern learning 23 pattern mode 57 PDP 3 perceptron 2 Perceptron convergence theorem 41 Perceptrons 3 Pitts 2 placement of discriminant functions 253 Power 191 principal component analysis 193 principal components 193 probability density function 242 pruning set 155 Pseudo-Inverse 230 Q QR decomposition 236 Quasi-Newton methods 83 R Radial Basis Functions Networks 8. 190 V validation set 106 284 Artificial Neural Networks . 8 T tensor product 146 test set 106 time constant 79 topology conserving mapping 139 training 23 training the classifier 252 U uniform projection principle 139 universal approximators 39. 123 random variable 242 receptors 5 recurrent networks 19 refinement step 155 regions 5 regularization 106. 108 regularization factor 86 regularization theory 128 reinforcement learning 22 Rosenblatt 2 S series-parallel model 26 Similarity measure 190 singlelayer network 19 Slack variables 180 Soft margin 179 specialised learning architecture 29 spread heuristics 126 spurious states 220 steepest descent 55 stochastic learning 23 stopped search 106 supervised learning 22 support 137 support vector machine 182 Support Vector Machines 252 Support vectors 174 SVD decomposition 230 synapses 4.

variance 242 W weight 8 Widrow-Hoff learning algorithm 54 Widrow-Hoff learning rule 3 Wiener-Hopf equations 112 Artificial Neural Networks 285 .

Sign up to vote on this title
UsefulNot useful