P. 1
Neural Networks

Neural Networks

|Views: 456|Likes:
Published by mundosss

More info:

Published by: mundosss on Nov 02, 2010
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Artificial Neural Networks

António Eduardo de Barros Ruano

Centre for Intelligent Systems University of Algarve Faro, Portugal aruano@ualg.pt


1 - Introduction to Artificial Neural Networks.............................................. 1
1.1 - An Historical Perspective........................................................................................ 2 1.2 - Biological Inspiration ........................................................................................... 4 1.3 - Characteristics of Neural Networks...................................................................... 7 1.3.1 - Models of a Neuron......................................................................................... 7 - Input Synapses..................................................................................................... 8 - Activation Functions............................................................................... 11 - Sign, or Threshold Function ............................................................. 11 - Piecewise-linear Function................................................................. 11 - Linear Function................................................................................. 12 - Sigmoid Function ............................................................................. 13 - Hyperbolic tangent ........................................................................... 14 - Gaussian functions............................................................................ 16 - Spline functions ................................................................................ 16

1.3.2 - Interconnecting neurons ................................................................................ 19 - Singlelayer feedforward network ........................................................... 19 - Multilayer feedforward network............................................................. 20 - Recurrent networks................................................................................. 20 - Lattice-based associative memories ....................................................... 21 1.3.3 - Learning ........................................................................................................ 21 - Learning mechanism............................................................................... 22 - Off-line and on-line learning .................................................................. 23 - Deterministic or stochastic learning ....................................................... 23 1.4 - Applications of neural networks ......................................................................... 23 1.4.1 - Control systems............................................................................................. 23 - Nonlinear identification .......................................................................... 24 - Forward models ................................................................................ 25 - Inverse models.................................................................................. 27 - Control .................................................................................................... 30 - Model reference adaptive control ..................................................... 30 - Predictive control.............................................................................. 33 1.4.2 - Combinatorial problems................................................................................ 34 1.4.3 - Content addressable memories...................................................................... 34 1.4.4 - Data compression .......................................................................................... 35

Artificial Neural Networks


1.4.5 - Diagnostics.................................................................................................... 35 1.4.6 - Forecasting .................................................................................................... 35 1.4.7 - Pattern recognition ........................................................................................ 35 1.4.8 - Multisensor data fusion ................................................................................. 35 1.5 - Taxonomy of Neural Networks ............................................................................ 36 1.5.1 - Neuron models .............................................................................................. 36 1.5.2 - Topology........................................................................................................ 37 1.5.3 - Learning Mechanism..................................................................................... 37 1.5.4 - Application Type ........................................................................................... 38

2 - Supervised Feedforward Neural Networks .............................................39
2.1 - Multilayer Perceptrons.......................................................................................... 39 2.1.1 - The Perceptron .............................................................................................. 40 - Perceptron Convergence Theorem ......................................................... 41 - Examples .......................................................................................................... 44 2.1.2 - Adalines and Madalines ................................................................................ 50 - The LMS Algorithm ............................................................................... 54 - Justification of the LMS Algorithm........................................................ 54 2.1.3 - Multilayer perceptrons .................................................................................. 61 - The Error Back-Propagation Algorithm ................................................. 63 - The Jacobean matrix............................................................................... 64 - The Gradient Vector as computed by the BP Algorithm ........................................70 - Analysis of the error back-propagation algorithm ..................................... 70 - Normal equations and least-squares solution ..........................................................75 - The error backpropagation algorithm as a discrete dynamic system ......................76 - Performance surface and the normal equations matrix ...........................................80 - Alternatives to the error back-propagation algorithm ....................................... 81 - Quasi-Newton method.............................................................................................84 - Gauss-Newton method ............................................................................................84 - Levenberg-Marquardt method.................................................................................86 - Examples ................................................................................................ 87 - A new learning criterion ................................................................................ 93 - Practicalities .................................................................................................... 102 - Initial weight values ..............................................................................................102 - When do we stop training?....................................................................................105 - On-line learning methods ............................................................................ 109 - Adapting just the linear weights............................................................................110 - Adapting all the weights........................................................................................119

2.2 - Radial Basis Functions ....................................................................................... 123 2.2.1 - Training schemes......................................................................................... 125 - Fixed centres selected at random ................................................................ 125 - Self-organized selection of centres............................................................. 125 - Supervised selection of centres and spreads ............................................. 127 - Regularization ............................................................................................... 128 2.2.2 - On-line learning algorithms ........................................................................ 128 2.2.3 - Examples ..................................................................................................... 129 2.3 - Lattice-based Associative Memory Networks ..................................................... 134 2.3.1 - Structure of a lattice-based AMN ............................................................... 135


Artificial Neural Networks - Normalized input space layer ...................................................................... 135 - The basis functions ....................................................................................... 137 - Partitions

of the unity ..................................................................... 138 - Output layer ................................................................................................... 138

2.3.2 - CMAC networks.......................................................................................... 139 - Overlay displacement strategies ................................................................. 140 - Basis functions................................................................................................. 141 - Training and adaptation strategies .............................................................. 141 - Examples ........................................................................................................ 142

2.3.3 - B-spline networks........................................................................................ 145 - Basic structure ............................................................................................... 145 - Univariate basis functions .............................................................. 148 - Multivariate basis functions............................................................ 151 - Learning rules ................................................................................................ 152 - Examples ........................................................................................................ 152 - Structure Selection - The ASMOD algorithm .......................................... 153 - The algorithm ................................................................................. 154 - Completely supervised methods...................................................................... 162

2.4 - Model Comparison ............................................................................................. 162 2.4.1 - Inverse of the pH nonlinearity..................................................................... 163 - Multilayer perceptrons ................................................................................. 163 - Radial Basis functions .................................................................................. 164 - CMACs........................................................................................................... 165 - B-Splines ........................................................................................................ 165 2.4.2 - Inverse coordinates transformation ............................................................. 166 - Multilayer perceptrons ................................................................................. 166 - Radial Basis functions .................................................................................. 167 - CMACs........................................................................................................... 168 - B-Splines ........................................................................................................ 168 2.4.3 - Conclusions ................................................................................................. 169 2.5 - Classification with Direct Supervised Neural Networks..................................... 170 2.5.1 - Pattern recognition ability of the perceptron............................................... 170 2.5.2 - Large-margin perceptrons ........................................................................... 173 - Quadratic optimization .................................................................................... 174 - Linearly separable patterns....................................................................................175 - Non linearly separable patterns .............................................................................179

2.5.3 - Classification with multilayer feedforward networks ................................. 181

3 - Unsupervised Learning .........................................................................189
3.1 - Non-competitive learning ................................................................................... 189 3.1.1 - Hebbian rule ................................................................................................ 189 - Hebbian rule and autocorrelation .................................................................... 190 - Hebbian rule and gradient search .................................................................... 191 - Oja’s rule ......................................................................................................... 191

3.1.2 - Principal Component Analysis.................................................................... 192 3.1.3 - Anti-Hebbian rule........................................................................................ 201 3.1.4 - Linear Associative Memories...................................................................... 202 - Nonlinear Associative Memories .................................................................... 204

Artificial Neural Networks


.........................Training and applications of BSB models.2...............................1 ...............................213 4.................................Elements of Pattern Recognition ..1......2 ......................................1...................... 208 3.................2....................................................2..............................The Pattern-Recognition Problem.Retrieval phase ...................................2 ................Grossberg’s instar-outsar networks .....................Properties of the feature map......3.2........................................Cooperative process.2 ...2............1 .Solutions of a linear system with A square ................ 214 4......................1 ............ 245 iv Artificial Neural Networks ................... Generalized Inverses and Least Squares Solutions..................2....................................2......LMS as a combination of Hebbian rules.......6 .......................... 205 3...............................2........................1 .............Hopfield models .................5.......................Learning Vector Quantization ..........................................Recurrent Networks.........................................................2.....2.....................2.1 ................... 229 1.............................................. 207 3......2.1.Counterpropagation networks ..3.................Generalized Inverses......7 ....1...................................................................3 .......................................................2...............2 .......2 .........2 ..........................4 ................Problems with Hopfield networks...................................... 228 1............................................................................ 218 4...1 ...................................Least squares solutions using a QR decomposition: ................. 209 3...........................................1 .......................... 208 3.................................... 224 Appendix 1 ................................................4 ...............2...................... 205 3.............. 232 1.............................2 ..........Discrete Hopfield model ....1...........................Inconsistent equations and least squares solutions ..... 227 1.........................Self-organizing maps.............2................................................................................Minimum Error Rate and Mahalanobis Distance..1 .Test for consistency ...................................The Cohen-Grossberg theorem ..............................2 ...............................2 ............................................. 229 1..........2..1 ....................Storage phase. 209 3.............LMS for LAMs................................................................ 242 2.... 210 4 .............3 ..Brain-State-in-a-Box models ..................................2..........3 ....... 224 4............................................1 ....Linear Equations.............Neurodynamic Models ............................. 210 3.......................5 ..........Competitive Learning ............... 219 4...Adaptive Ressonance Theory ......227 1.....................2........................................1 ..........................2...........................Statistical Formulation of Classifiers................. 204 3.3 ...........Competition .................... 219 4..........1.Solving linear equations using generalized inverses ..........Classifiers from competitive networks.....5 ... 217 4.... 227 1................ 207 3.........7.....1 ..........................................4 ......................................................................................... 220 4.................2 .................2......................... 236 Appendix 2 ....3 .................................................Adaptation . 222 4.. 206 3..............4 .Ordering and Convergence......3................Lyapunov function of the BSB model.............2....2........ 241 2......Introduction....................2.......................Existence of solutions.1....2.... 234 1..........................................Operation of the Hopfield network ...241 2..................... 210 3.......................................................... 213 4..................................................... 223 4..........................5 .................................................. 204 3...........

............................2....6 ...................3 ...................................................................................................4 .. 252 Appendix 3 ............................Stability ............................................273 Index .........................................................................................................................Discriminant Functions ....................5 ..Decision Surfaces ..................................................................................................................... 263 Chapter 3.......................3 .........................................................................................................................................................................255 3......................................................................................... 251 2............................................................ 261 Chapter 1............................................................................ 271 Bibliography..3.................................................4 .......................................................Methods of training parametric classifiers ... 258 3............Equilibrium states ........................................................................................................ 255 3........... 269 Chapter 4.. 271 Appendix 1................................................................1 ......281 Artificial Neural Networks v ..... 250 2........................Elements of Nonlinear Dynamical Systems ...1 ...............................................................Linear and Nonlinear Classifiers.......................................................................Attractors ................................................................................................................ 246 2.................................................................................. 246 2............................... 259 Exercises ............................................................... 261 Chapter 2...........................Introduction......... 256 3........A Two-dimensional Example.......2 .............

vi Artificial Neural Networks .

artificial neural networks are yet a new and unfamiliar field of science. a large number of neural networks (including their variants) have been proposed. robotics [11] [12] and control systems [13] [14]. This task becomes more complicated if we consider that contributions to this field come from a broad range of areas of science. is not an easy task. the outline of this Chapter is as follows: Artificial Neural Networks 1 . Writing a syllabus of artificial neural networks. which is the class of neural networks most employed in control systems. Some were introduced with a specific application in mind. image processing [9] [10]. Therefore. for the large majority of researchers. and there are different learning rules around. and applications are targeted for an even wider range of disciplines. Another objective of this Chapter is to give ideas of where and how neural networks can be applied. or precisely because of that. Although active research in this area is still recent. they will classified according to some important characteristics. In an attempt to introduce some systematisation in this field. speech recognition and synthesis [7] [8]. The most important ANN models are described in this introduction. electronics. Neural networks research has spread through almost every field of science. this syllabus is confined to a broad view of the field of artificial neural networks. Although work on artificial neural networks began some fifty years ago. like neurophysiology. For these reasons. widespread interest in this area has only taken place in the last years of the past decade. psychology. physics. medicine [4] [5] [6]. Emphasis is given to supervised feedforward neural networks. control systems and mathematics. as well as to summarise the main advantages that this new technology might offer over conventional techniques. which is the research field of the author. covering areas as different as character recognition [1] [2] [3]. This means that. specifically to the type of the ANN discussed. These will be briefly indicated in this Chapter and more clearly identified in the following Chapters.CHAPTER 1 Introduction to Artificial Neural Networks In the past fifteen years a phenomenal growth in the development and application of artificial neural networks (ANN) technology has been observed.

a brief introduction to the basic brain mechanisms is given in Section 1. Principles of neurodynamics [19]. Primarily. the network topology.2.2. which developed in the late eighties.4 presents an overview of the applications of artificial neural networks to different fields of Science. It will be shown that neural networks have been extensively used to approximate forward and inverse models of nonlinear dynamical systems. four different taxonomies of ANNs are presented.2. a perceptron would. given linearly separable classes. how they are interconnected.1. and in the application fields. An extension of his proposals is widely known nowadays as the Hebbian learning rule.An introduction to artificial neural networks is given in Section 1.. To characterise a neural network. 1. data fusion.3. in Section 1.1 will discuss the role of ANN for nonlinear systems identification. back in 1943. He proposed a learning rule for this first artificial neural network and proved that. McCulloch and Pitts [17]. Additional applications to other areas.3 Section 1. Section 1. develop a weight vector that would separate the classes (the famous Perceptron convergence theorem). the perceptron.1. They essentially originated neurophysiological automata theory and developed a theory of nets.1. Although for many people artificial neural networks can be considered a new area of research.2. The most common types of interconnection are addressed in Section 1. a chronological perspective of ANN research is presented. In 1949 Donald Hebb. neural control schemes are discussed in Section 1. etc.1 An Historical Perspective The field of artificial neural networks (they also go by the names of connectionist models. and these are summarised in Section 1.1 and Section 1. Finally.4. Based on the learning paradigm. Different types of learning methods are available. it is necessary to specify the neurons employed. such as optimization. where it will be shown that the basic concepts behind the brain mechanism are present in artificial neural networks.1. a special emphasis will be given to the field of Control Systems. in a finite number of training trials. As already mentioned. Neuron models are discussed in Section 1.4. parallel distributed processing systems and neuromorphic systems) is very active nowadays.5 will introduce general classifications or taxonomies. As artificial neural networks are. work in this field can be traced back to more than fifty years ago [15] [16]. to a great extent. Section 1. Section 1. inspired by the current understanding of how the brain works. Building up on the modelling concepts introduced.3 introduces important characteristics of ANNs. In 1957 Rosenblatt developed the first neurocomputer.3. His results were summarised in a very interesting book. the neural models. in his book The organization of behaviour [18] postulated a plausible qualitative mechanism for learning at the cellular level in brains. are briefly summarized in the sections below. 2 Artificial Neural Networks . pattern recognition.4. and the learning mechanism associated with the network.3. pointed out the possibility of applying Boolean algebra to nerve net behaviour.

This discredited the research on artificial neural networks. Another is the benefit that neuroscience can obtain from ANN research. that can solve problems that are proving to be extremely difficult for current digital computers (algorithmically or rulebased inspired) and yet are easily done by humans in everyday life [32]. . Although this persuasive and popular work made a very important contribution to the success of ANNS. where it was mathematically proved that these neural networks were not able to compute certain essential computer predicates like the EXCLUSIVE OR boolean function. with commercial optical or electrooptical implementations being expected in the future. The death sentence to ANN was given in a book written by these researchers. digital and hybrid electronic implementations [33][34][35][36] are available today. This work has led to modern adaptive filters. Analog.The dramatic improvement in processing power observed in the last few years makes it possible to perform computationally intensive training tasks which would. Until the 80's. making ANN one of the most active current areas of research. retrieving contextually appropriate information from memory. New artificial neural network architectures are constantly being developed and new concepts and theories being proposed to explain the operation of these architectures. The work and charisma of John Hopfield [29][30] has made a large contribution to the credibility of ANN. Fukushima [26]. - Artificial Neural Networks 3 . With the publication of the PDP books [31] the field exploded. About the same time Minsky and Papert began promoting the field of artificial intelligence at the expense of neural networks research. the ADALINE [20]. required an unaffordable time. image processing. Grossberg [27] and Kohonen [28]. probably the best known being the control of an inverted pendulum [22]. Adalines and the Widrow-Hoff learning rule were applied to a large number of problems. other reasons can be identified for this recent renewal of interest: . Anderson [25]. Perceptrons [23].About the same time Bemard Widrow modelled learning from the point of view of minimizing the mean-square error between the output of a different type of ANN processing element.One is the desire is to build a new breed of powerful computers. Many of these developments can be used by neuroscientists as new paradigms for building functional concepts and models of elements of the brain. In the sixties many sensational promises were made which were not fulfilled. Then in the middle 80's interest in artificial neural networks started to rise substantially. .Also the advances of VLSI technology in recent years. most of the research on this field was done from an experimental point of view. with older technology. Notable exceptions from this period are the works of Amari [24]. are all examples of such tasks. Although people like Bemard Widrow approached this field from an analytical point of view. and the desired output vector over the set of patterns [21]. turned the possibility of implementing ANN in hardware into a reality. Cognitive tasks like understanding spoken and written language. research on neural networks was almost null.

regions of the brain containing millions of neurons or groups of neurons. each region having assigned some discrete overall function. very briefly consider the structure of the human brain. more than one nerve terminal can form synapses with a single neuron. Let us. All of them try. 4 Artificial Neural Networks . then. As shown. Information is coded on the frequency of the signal.Research into ANNs has lead to several different architectures being proposed over the years. • at a more macroscopic level. a neuron consists of a cell body with finger-like projections called dendrites and a long cablelike extension called the axon. Before describing the structure of artificial neural networks we should give a brief description of the structure of the archetypal neural network. The nerve terminal is close to the dendrites or body cells of other neurons. forming special junctions called synapses.the individual components in the brain circuitry. It is usually recognized that the human brain can be viewed at three different levels [37]: • the neurons . On the other end. The basic component of brain circuitry is a specialized cell called the neuron. called action potentials. Branching of the axon allows a neuron to form synapses on to several other neurons. which are propagated down the axis towards the nerve terminal. to a greater or lesser extent. to exploit the available knowledge of the mechanisms of the human brain.2 Biological Inspiration One point that is common to the various ANN models is their biological inspiration.A biological neuron Neurons are electrically excitable and the cell body can generate electrical signals. Circuits can therefore be formed by a number of neurons.1 . • groups of a few hundreds neurons which form modules that may have highly specific functional processing capabilities. The axon may branch. The electrical signal propagates only in this direction and it is an all-or-none event. 1. each branch terminating in a structure called the nerve terminal. FIGURE 1.

rather a chemical neurotransmitter is released from the nerve terminal. These changes in the input strengths are thought to be particularly relevant to learning and memory.When the action potential reaches the nerve terminal it does not cross the gap. Artificial Neural Networks 5 . all working in parallel. Finally. The amount of neurotransmitter released and its postsynaptic effects are. Of particular interest is the cerebral cortex. to a first approximation. These have been associated with the processing of specific functions. This chemical crosses the synapse and interacts on the postsynaptic side with specific sites called receptors. The combination of the neurotransmitter with the receptor changes the electrical activity on the receiving neuron. some broad aspects can be highlighted: the brain is composed of a large number of small processing elements. graded events on the frequency of the action potential on the presynaptic neuron. but vary with use. acting in parallel. and converges into areas where some association occurs. This leads to a profusion of proposals for the model of the neuron. The input strengths are not fixed. some representation of the world around us is coded in electrical form. It may well be that other parts of the brain are composed of millions of these modules. The neuron integrates the strengths and fires action potentials accordingly. Inputs to one neuron can occur in the dendrites or in the cell body. as described above. but not both. which alter the effective strengths of inputs to a neuron. In these areas. Although there are different types of neurotransmitters. Therefore our sensory information is processed. it is possible to identify different regions of the human brain. through at least two cortical regions. where areas responsible for the primary and secondary processing of sensory information have been identified. which is assumed to be achieved by modifying the strengths of the existing connections. The brain is capable of learning. and linked together in some functional manner. These broad aspects are also present in all ANN models. at a higher level. and especially. These neurons are densely interconnected. However a detailed description of the structure and the mechanisms of the human brain is currently not known. one neuron receiving inputs from many neurons and sending its output to many neurons. Any particular neuron has many inputs (some receive nerve terminals from hundreds or thousands of other neurons). In certain parts of the brain groups of hundreds of neurons have been identified and are denoted as modules. the pattern of interconnectivity. These neurotransmitters can have an excitatory or inhibitory effect on the postsynaptic neuron. These spatially different inputs can have different consequences. any particular neuron always releases the same type from its nerve terminals. From the brief discussion on the structure of the brain. the neurons. The mechanisms behind this modification are now beginning to be understood. each one with different strengths. in parallel.

usually called the output function or activation function. i (1.j(.2) j∈i where bi is the bias or threshold for the ith neuron.5) j∈i Denoting g-1(. Assuming that the inverse of g(.) exists.) is a loss term. i + b i (1. the following differential equation can be considered for oi: do i = dt X i. consider only the stationary input-output relationship. j ( o j ) + b i (1.1) to 0.) by f. which must be nonlinear in oi to take the saturation effects into account. If the simple approximation (1. i + b i = o i W i. To derive function Xi. and also independent of the activity of the neuron.3) where W denotes the weight matrix. a model of a neuron can be obtained [39] as follows. j ( o j ) – g ( o i ) (1. the spatial location of the synapses should be taken into account. If the effect of every synapse is assumed independent of the other synapses.). we then have: 6 Artificial Neural Networks . i + b i (1.1) j∈i where Xi. As it is often thought that the neuron is some kind of leaky integrator of the presynaptic signals. In the neuron model the frequency of oscillation of the ith neuron will be denoted by oi (we shall consider that o is a row vector).4) j∈i The terms inside brackets are usually denoted as the net input: net i = o j W j. The majority.for the learning mechanisms. Artificial neural networks can be loosely characterized by these three fundamental aspects [38]. Some ANN models are interested in the transient behaviour of (1. as well as the influence of the excitatory and inhibitory inputs.1). which can be obtained by equating (1. j ( o j ) = o j W j. oi can then be given as: o i = g –1 X i. Taking into account the brief description of the biological neuron given above. the principle of spatial summation of the effects of the synapses can be used.) is a function denoting the strength of the jth synapse out of the set i and g(. however. then the most usual model of a single neuron model is obtained: oi = g –1 o j W j.j(.2) is considered: X i.

6).p a) Internal view FIGURE 1.7) W' = W bT so that the net input is simply given as: net i = o'i W' i.) is a sigmoid function.3. other types of activation functions can be employed. in the following three subsections. However.1 θi ?i θi ?i i1 i2 Wi.7) and (1. alternatively. i (1. It is illustrated in Fig. This Section will try to address these important topics.2 . i1 Wi. . the way that these neurons are interconnected between themselves. 1. This can easily be done by defining: o' i = o i 1 (1.2 neti oi f(. 1. neural network models can be somehow characterized by the models employed for the individual neurons.8).8) Typically.2. or. f(. the most common model of a neuron was introduced in Section 1. ip i S oi ip Wi.Model b) External view of a typical neuron Artificial Neural Networks 7 . and by the learning mechanism(s) employed.6) Sometimes it is useful to incorporate the bias in the weight matrix.1 Models of a Neuron Inspired by biological fundamentals.5) and (1. (1.2.) i2 . . which has low and high saturation limits and a proportional range between. as we shall see in future sections. consisting of eq. of (1.3 Characteristics of Neural Networks As referred in the last Section. 1.o i = f ( net i ) (1.

each synapse is associated with a real weight. there is a strength or weight. Kohohen Self-Organized Feature Maps (SOFM). and the first element in subscript refers to which one of the inputs the signal comes from (in this case the kth input to neuron i).1. but can be defined at an earlier stage of network design. due to the faster learning associated with these neural networks (relatively to MLPs) they were more and more used in different applications. which multiplies the associated input signal. RBFs. Brain-State-in-a-Box (BSB) networks and Boltzmann Machines. As described above. historically. Note also the existence of a threshold or bias term associated with the neuron. where they were used as functional approximators and for data modelling. CMACs and B-splines. are not just determined by the learning process. Since then. The output of this adder is denoted. considered as another (fixed. A set of synapses or connecting links. but instead a product of the outputs of the activation function is employed. with the value of 1) input. for the considered input. these three major blocks are all present in ANNs. 2.3. 1. Bidirectional Associative Memories (BAM). assume any value. as already introduced. or. Usually an adder is employed for computing a weighted summation of the input signals. where the second element in sub- script refers to the neuron in question. Adaptive Resonance Theory (ART) networks. each link connecting each input to the summation block. In these networks. specially the first and the last blocks differ according to each ANN considered. as the neuron net input.2 we can identify three major blocks: 1. during the learning process. The positive or negative value of the weight is not important for these ANNs. and the actual input value. The weights of some neural networks. the influence of the threshold appears in the equation of the net input as negative. Associated with each synapse.i.Focusing our attention in fig. without limitations. The input signals are integrated in the neuron. In some networks. The same happens with Hopfield networks. the integration of the different input signals is not additive. the weights related with the neuron(s) in the output layer are simply linear combiner(s) and therefore can be viewed as normal synapses encountered in MLPs.1]. This can be associated with the adder. usually nonlinear. 8 Artificial Neural Networks . In three networks. 1. CMACs and B-splines. the ith neuron.1]. Typically. which has as input the neuron net input. We denote this weight by Wk.1 Input Synapses In Multilayer Perceptrons (MLPs) [31]. the range of the output lies within a finite interval. Such are the cases of RBFs. However. however. or [-1. [0. The following Sections will introduce the major differences found in the ANNs considered in this text. The hidden layer of a RBF network consists of a certain number of neurons (more about this number later) where the ‘weight’ associated with each link can be viewed as the squared distance between the centre associated with the neuron. An activation or output function. The weights can then. 3. Radial Basis Functions Networks (RBFs) were first introduced in the framework of ANNs by Broomhead and Lowe [40].

the neurons in the hidden layer have also associated with them a concept similar to the centres in RBFs. i = ( C i. New to these neural networks. as function of the input vector (centre 0) As we can see.9) Another network that has embedded the concept of distance from centres is the B-Spline neural network. and increases when we depart from it.k) 10 5 0 -5 -4 -3 -2 -1 0 x(k) 1 2 3 4 5 FIGURE 1. which. as a function of the input values. m. The following figure represents the weight value. is the order (o) of the spline. The number of hidden neurons (more commonly known as basis functions).3 . 25 20 15 W(i. k – x k ) 2 (1. the weight value assumes a null value at the centre.11) Artificial Neural Networks 9 . for instance. that the centre of the ith neuron associated with the kth input has been set at 0. equally spaced. B-splines are best known in graphical applications [41].10) and that we introduce n interior knots (more about this later). Assuming that the input range of the kth dimension of the input is given by: ϑ = [x min. in this case. In these networks. in this range.Weight value of the hidden neuron of a RBF network. for this dimension is given by: m = n+o The active region for each neuron is given by: (1. but also with the region where the neuron is active or inactive. which is a parameter associated with the activation function form.x max] (1. the weight Wk.i can be described as: W k.Suppose. are denoted as the knots. such as robotics and adaptive modelling and control. and represented as λj . Mathematically. although they are now employed for other purposes.

9 0.k can be given by. x – ci 1 – -----------.12) Assuming that the ith neuron has associated with it a centre ci. signal processing tasks and classification problems.6 W(i.ϑδ = o ⋅ ----------n+1 (1. c i + ( δ ⁄ 2 ) [ δ -2 0 . i = (1.5 0 x(k) 0.. that the centre is 0. must be established by the designer.1 0 -1 -0. a set of n knots is associated with each dimension of the input space. Therefore.2 0. ρ . and that the spline order and the number of knots is such that its active range is 2. its active region is within the limits [ c i – δ ⁄ 2 . for simplicity. This network was originally developed for adaptively controlling robotic manipulators. c i + δ ⁄ 2 [ . It is assumed. x ∈ [ c i – ( δ ⁄ 2 ).5 1 FIGURE 1. the weight Wi.8 0. A generalization parameter. as function of the input value (centre 0) The concept of locality is taken to an extreme in Cerebellar Model Articulation Controller (CMAC) networks.Weight value of the hidden neuron of a B-spline network.4 0.3 0. In the same way as B-spline neural networks.11) 10 Artificial Neural Networks .7 0. Associated with each hidden neuron (basis function) there is also a centre. but has in the last twenty years been employed for adaptive modelling and control.13) The following figure illustrates the weight value. as function of the distance of the input from the neuron centre.4 . These networks were introduced in the mid-seventies by Albus [42]. 1 0.5 0. Equations (1.k) 0. other values W k.

3.3.2 Activation Functions The different neural networks use a considerable number of different activation functions. and indicate which neural networks employ them. we have: f(x) = 1 if x ≥ 0 0 if x < 0 (1. ρ .5 0 -0. different. or Threshold Function For this type of activation function.2Piecewise-linear Function This activation function is described by: Artificial Neural Networks 11 .3.1Sign.2.5 1 f(x) 0. 1.5 -3 -2 -1 0 x 1 2 3 FIGURE 1.15) This activation function is represented in fig.5 . 1. other values W k. o.1.12) are also valid for this neural network. however. x ∈ [ c i – ( δ ⁄ 2 ).and (1. The weight equation is.1. (1. by the generalization parameter.2. i = . and can be represented as: 1.14) this is.5 . all the values within the active region of the neuron contribute equally to the output value of the neuron. 1.Threshold activation function 1. 1. c i + ( δ ⁄ 2 ) [ 0. replacing the spline order. Here we shall introduce some of the most common ones.1.

5 -1 -0. if – -.6 .3Linear Function This is the simplest activation function.17) 12 Artificial Neural Networks . The form of this function is represented in fig.6 .16) Please note that different limits for the input variable and for the output variable can be used.1 1 .if x ≥ -2 1 1 1 f ( x ) = x + -.1..if x ≤ – 1 -2 (1. 1.< x < -2 2 2 0 .Piecewise-linear function 1.5 1 FIGURE 1.5 1 f(x) 0.3.2. It is given simply by: f( x) = x (1.5 0 -0.5 0 x 0. 1.

5 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 Artificial Neural Networks 13 .5 -1 -0.5 1 FIGURE 1.A linear function is illustrated in fig. 1.18) (1.5 -1 -1.Linear Activation function Its derivative.2.5 1 0.7 . It is given by: 1 f ( x ) = --------------1 + e –x It is represented in the following figure.3.19) 1 f(x) 0.7 . is just f '(x) = 1 1.5 0 x 0. 1.5 (1.5 0 -0. 1. with respect to the input.5 f(x) 0 -0.1.4Sigmoid Function This is by far the most common activation function.

5Hyperbolic tangent Sometimes this function is used instead of the original sigmoid function.1 -0.FIGURE 1.8 .2 -0.20) It is the last equality in (1. Sometimes the function 1 – e –2x f ( x ) = tanh ( x ) = -----------------1 + e –2x is also employed. (1.Sigmoid function Its derivative is given by: –1 e f ' ( x ) = ( ( 1 + e – x ) )' = ----------------------.21) 1. 1.20) that makes this function more attractive than the others. is represented in fig.2.Derivative of a sigmoid function 1. It is defined as: x 1 – e –x f ( x ) = tanh -.5 0. The derivative.1 f´(x) 0 -0. 0.1. 14 Artificial Neural Networks .3.4 -0.9 .1 2 1 + e –x It is represented in the following figure.9 .3 -0.4 0.2 0.5 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 FIGURE 1.= --------------. as function of the net input.= f ( x ) ( 1 – f ( x ) ) ( 1 + e –x ) 2 –x (1.3 0.

5 1 0.05 0 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 FIGURE 1.10 .3 f´(x) 0.1 – tanh -2 2 2 1 = -.4 0.( 1 – f 2 ( x )) 2 (1.45 0.Hyperbolic tangent function The derivative of this function is: 1 x f ' ( x ) = -.35 0.5 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 FIGURE 1.5 f(x) 0 -0.22) 0.15 0.5 -1 -1.1.1 0.25 0.2 0.5 0.Derivative of an hyperbolic tangent function Artificial Neural Networks 15 .11 .

and σ i = 1 .12 . and it is defined by the following relationships: k 16 Artificial Neural Networks . and.6Gaussian functions This type of activation function is commonly employed in RBFs.0]. each node is characterised by a centre vector and a variance.1.σ i ) = e – ----------------------------------------2σ i2 k=1 = k=1 ∏e n W k. C i. and a unitary variance 1.1. k ) f i ( C i. The following figure illustrates the activation function values of one neuron. as it can be seen.0]. k ) – -------------------------------2σ i2 ( C i. considering its centre at [0. are found in B-spline networks. This function is usually called localised Gaussian function. i ( x k. Using this notation.3. We assume that there are 2 inputs. i ( x k.i=[0. k – x k ) 2 = e – ----------------------------------2σ i2 k=1 (1.. both in the range]-2. the activation function can be denoted as: n n W k.9).2.2. as the name indicates.2[.Activation values of the ith neuron.23) Please note that we are assuming just one input pattern. Remember that the weight associated with a neuron in the RBF hidden layer can be described by (1. with centre C.3. FIGURE 1.1. C i. k .7Spline functions These functions. The jth univariate basis j function of order k is denoted by N ( x ) . and that we have n different inputs.

1 0. where for each input dimension 2 interior knots have been assigned. For all graphs.5 output 0.2 0.4 0.6 0.6 0.8 0.7 0.13 .5 0.3 output 0. the same point x=2.x – λj – k λj – x j j–1 N k ( x ) = ---------------------------.6 0.9 0.2 0.8 0.1 0 0 1 2 x 3 4 5 a) Order 1 b) Order 2 0.8 0.Univariate d) Order 4 B-splines of orders 1-4 Multivariate B-splines basis functions are formed taking the tensor product of n univariate basis functions.3 0. λ j ] is the jth interval.5 0.7 0.1 0 0 1 2 x 3 4 5 output 1 0.3 0. λ j is the jth knot. The curse of dimension- Artificial Neural Networks 17 .1 0 0 1 2 x 3 4 5 0 1 2 x 3 4 5 c) Order 3 FIGURE 1.9 0. In the following figures we give examples of 2 dimensional basis functions.5 0.7 0.24).1 0 0.4 0. depending on the order of the spline.4 0.5 is marked.2 0.3 0.4 0.N kj – 1 ( x ) λj – 1 – λ j – k λ j – λj – k + 1 N1 ( x ) = j 1 if x ∈ I j 0 otherwise (1. The following figure illustrates univariate basis functions of orders o=1…4.7 0. so that it is clear how many functions are active for that particular point.2 0.6 output 0. and I j = [ λ j – 1.24) In (1.N k – 1 ( x ) + ---------------------------.

14 .6] is marked 18 Artificial Neural Networks . order 1. FIGURE 1.Two-dimensional multivariable basis functions formed with order 2 univariate basis functions.Two-dimensional multivariable basis functions formed with 2. 32=9 basis functions are active each time.5. univariate basis functions.ality is clear from an analysis of the figures. 1. as in the case of splines of order 3 for each dimension. The point [1. The point [1.5] is marked FIGURE 1.15 . 1.

i.FIGURE 1.4] is marked 1. The point [1. In the left.4. The signals Artificial Neural Networks 19 . we speak about a partially connected network. which is nothing but a buffer. a latticebased structure. Finally. or recurrent networks. there is the input layer. if loops are allowed. the network is called fully connected. If not..1 Singlelayer feedforward network The simplest form of an ANN is represented in fig. Some networks are easier described if we envisage them with a special structure. we can divide the architectures into feedforward networks.3.e. 1. biological neural networks are densely interconnected. This also happens with artificial neural networks.17 .3. 1. If there are hidden neurons. and therefore does not implement any processing.2. neurons which are not input nor output neurons. if the signals flow just from input to output. if every neuron in one layer is connected with the layer immediately above.16 . Another possible classification is dependent on the existence of hidden neurons.2 Interconnecting neurons As already mentioned. According to the flow of the signals within an ANN. otherwise the network can be called a singlelayer NN. we denote the network as a multilayer NN. 1.Two-dimensional multivariable basis functions formed with order 3 univariate basis functions.

arriving to the output layer. For instance. x1 y1 x2 x3 y2 FIGURE 1. 4. x1 y1 x2 y2 z-1 FIGURE 1. …. 1] has 5 neurons in the input layer. as represented in fig. with topology [3. and one neuron in the output layer.A recurrent network 20 Artificial Neural Networks . nh1. no].19 . x1 x2 x3 FIGURE 1. two hidden layers with 4 neurons in each one. where n denotes the number of the neurons in a layer.3.flow to the right through the synapses or weights.18 - y A multi-layer feedforward network. Usually the topology of a multi layer network is represented as [ni. 1] 1. where computation is performed.2 Multilayer feedforward network In this case there is one or more hidden layers. The output of each layer constitutes the input to the layer immediately above.17 - A single layer feedforward network 1.3. Notice that any feedforward network can be transformed into a recurrent network just by introducing a delay. 1.3 Recurrent networks Recurrent networks are those where there are feedback loops.19 . 2.2. and feeding back this delay signal to one input. 4. a ANN [5.2.

a1 x(t) a2 a3 ω1 ω2 ω3 ωp − 2 ∑ y(t) network output a p−2 a p −1 ap ωp −1 ωp normalized input space basis functions weight vector FIGURE 1. RBFs. and B-splines networks are best described as lattice-based associative memories. All of these AMNs have a structure composed of three layers: a normalized input space layer.A lattice in two dimensions 1.21 . 1. with respect to when this modification takes place.20 - Basic structure of a lattice associative memory The input layer can take different forms but is usually a lattice on which the basis functions are defined. We can somehow classify learning with respect to the learning mechanism. and according to the manner how the adjustment takes place.2.21 shows a lattice in two dimensions x2 x1 FIGURE 1. a basis functions layer and a linear weight layer. Artificial Neural Networks 21 .3 Learning As we have seen. like CMACs. learning is envisaged in ANNs as the method of modifying the weights of the NN such that a desired goal is reached. Lattice-based associative memories Some types of neural networks. fig.

Ip. • forced Hebbian or correlative learning The basic theory behind this class of learning paradigms was introduced by Donald Hebb [18]. Then the weight matrix W is computed as the sum of m superimposed pattern matrices Wp. the corresponding output values. there is a matrix of desired outputs. 2. suppose that you have m input patterns. and consider the pth input pattern. The dimensions of these two matrices are m*ni and m*no. Typically. supervised learning .3. The aim of the learning phase is to find values of the weights and biases in the network such that. the network does not receive a teaching signal at every training pattern but only a score that tells it how it performed over a 22 Artificial Neural Networks . Associated with some input matrix. it can be divided broadly into three major classes: 1. this cost function is only given to the network from time to time. or LMS (least-mean-square) rule [21] and the error back-propagation algorithm [31]. In contrast with supervised learning. however. The corresponding target output pattern is Tp.25) through the use of gradient methods. or teacher signals. Of course in this case the activation functions must be differentiable. I. In this variant.this kind of learning also involves minimisation of some cost function.26) Examples of supervised learning rules based on gradient descent are the Widrow-Hoff rule. j (1..25) where tr(. .this learning scheme assumes that the network is used as an inputoutput system. are as close as possible to T.. respectively. T. where each is computed as a correlation matrix: W = I p. Tp. ni is the number of inputs in the network and no is the number of outputs. the weight Wi.1 Learning mechanism With respect to the learning mechanism. . p T (1. The most commonly used minimization criterion is: Ω = tr ( E E ) Γ (1.27) (1. in each iteration. where m is the number of patterns in the training set. using I as input data..1. O.. In other words.28) reinforcement learning .3. j = – η -----------δW i.) denotes the trace operator and E is the error matrix defined as: E = T–O There are two major classes within supervised learning: • gradient descent learning These methods reduce a cost criterion such as (1.j experiences a modification which is proportional to the negative of the gradient: δΩ ∆W i.

3. like the Boltzmann Machine. 1.in this type of learning an input pattern is presented to the network and the neurons compete among themselves. batch learning or training.3 Deterministic or stochastic learning In all the techniques described so far. The processing element (or elements) that emerge as winners of the competition are allowed to modify their weights (or modify their weights in a different way from those of the non-winning neurons).3.1.this type of learning mechanisms. diagnosis. allow- Artificial Neural Networks 23 . the states of the units are determined by a probabilistic distribution. A more detailed look of control applications of ANNs is covered in Section 1. 1. In some networks. data compression. 1.in this case there is no teaching signal from the environment. data fusion and pattern recognition are briefly described in the following sections. As learning depends on the states of the units.3. otherwise it is decreased. for this reason. Artificial neural networks offer a large number of attractive features for the area of control systems [45][38]: • The ability to perform arbitrary nonlinear mappings makes them a cost efficient tool to synthesise accurate forward and inverse models of nonlinear dynamical systems. then the strength of the weight is increased. due to Donald Hebb [18]. Examples of this type of learning are Kohonen learning [38] and the Adaptive Resonance Theory (ART) [44]. content addressable memories. we can say that we are in the presence of on-line learning.4 Applications of neural networks During the past 15 years neural networks were applied to a large class of problems.training sequence. These cost functions are. • competitive learning .3. according to states of activation of its input and output. if upon the presentation of each pattern there is always a weight update. we denote this process as off-line learning. over a wide range of uncertainty. A well known example of reinforcement learning can be found in [56]. 1. 3. instantaneous learning. the weight adjustment is deterministic. Typically.2 Off-line and on-line learning If the weight adjustment is only performed after a complete set of patterns has been presented to the network. covering almost every field of science. epoch learning. Other fields of applications such as combinatorial problems. updates the weights locally. pattern learning. On the other hand.4. the learning process is stochastic. or adaptation. Unsupervised learning . there is a constant need to provide better control of more complex (and probably nonlinear) systems. forecasting.1 Control systems Today.4. highly dependent on the application. if both coincide. A good survey on recent on-line learning algorithms can be found in [121]. The two most important classes of unsupervised learning are Hebbian learning and competitive learning: • Hebbian learning .

Kuperstein and Rubinstein [11] have implemented a neural controller. as demonstrated in [46]. It is therefore not surprising that robotics was one of the first fields where ANNs were applied. [48] performed feedforward control in such a way that the inverse system would be built up by neural networks in trajectory control. This allows calculations to be performed at a high speed. [50] have employed MLPs for the sensor failure detection in process control systems. Development of fast architectures further reduces computation time. from more accessible secondary variables. In these applications. 24 Artificial Neural Networks . self-tuning and model reference adaptive control employ some form of plant model. The ability to create arbitrary decision regions means that they have the potential to be applied to fault detection problems. which may be difficult to obtain. when compared to traditional methods. As their training can be effected on-line or off-line. which learns hand-eye coordination from its own experience. As examples. Leonard and Kramer [51] compared the performance of MLPs against RBF networks. to detect. a possible use of ANNs is as control managers. pattern recognisers and controllers. Robots are nonlinear. a large number of applications of ANN to robotics. Artificial neural networks have also been applied to sensor failure detection and diagnosis.4. deciding which control algorithm to employ based on current operational conditions. Kawato et al. significant fault tolerance. [12] employed MLPs with time delay elements in the hidden layer for impact control of robotic manipulators. obtaining promising results. nonlinear systems identification and control of nonlinear dynamical systems have been proposed.1. Fukuda et al. Neural networks are massive parallel computation structures. in the three different capacities of identifiers. In process control. concluding that under certain conditions the latter has advantages over the former. complicated structures. if so desired. models are extensively used to infer primary values. It is still one of the most active areas for applications of artificial neural networks. During the last years. Only a brief mention of the first two areas will be made in this Section. failure detection systems. Narenda [52] proposes the use of neural networks. due to the lack of reliable process data. making real-time implementations feasible. Guo and Cherkassky [49] used Hopfield networks to solve the inverse kinematics problem. Elsley [47] applied MLPs to the kinematic control of a robot arm. A very brief list of important work on this area can be given. the applications considered in the last two points can be designed off-line and afterwards used in an adaptive scheme. the main effort being devoted to a description of applications for the last two areas.1 Nonlinear identification The provision of an accurate model of the plant can be of great benefit for control. since damage to a few weights need not significantly impair the overall performance. 1. the ability of neural networks for creating arbitrary decision regions is exploited. Neural networks can also provide.• • • • ing traditional control schemes to be extended to the control of nonlinear plants. Naidu et al. classify and recover from faults in control systems. This can be done without the need for detailed knowledge of the plant [46]. Predictive control. Exploiting this property.

. u [ k ]. have been used to approximate forward as well as inverse models of nonlinear dynamical plants.29) where ny and nu are the corresponding lags in the output and input and f(.1.) is a nonlinear continuous function.Forward plant modelling with ANNs Artificial Neural Networks 25 . up[k] plant yp[k+1] up[k] up[k-1] up[k-nu+1] . y p [ k – n y + 1 ]. Artificial neural networks. most of the processes are inherently nonlinear. u [ k – n u + 1 ] ) . ….1Forward models Consider the SISO. CMAC networks.22 . However. discrete-time. These applications are discussed in the following sections.. . nonlinear plant. described by the following equation: y p [ k + 1 ] = f ( y p [ k ]. . yn[k+1] ANN yp/n[k-ny+1] . 1. making use of their ability to approximate large classes of nonlinear functions accurately.1. RBF networks and specially MLPs.29) can be considered as a nonlinear mapping between a nu+ny dimensional space to a one-dimensional space.. are now a recognised tool for nonlinear systems identification. Good surveys in neural networks for systems identification can be found in [122] and [123].Most of the models currently employed are linear in the parameters. This mapping can then be approximated by a multilayer perceptron or a CMAC network (a radial basis function network can also be applied) with nu+ny inputs and one output.. (1.. yp/n[k] FIGURE 1. ….S . . .4. Equation (1. in the real world.

switch S can be connected to the output of the neural network. The neural network can also be trained on-line. [59][60] have been applying MLPs to model ill-defined plants commonly found in industrial applications. This may be explained by the fact that. Morris et al. In certain cases. For MLPs. is employed. ….fig. but as soon as the learning mechanism is turned off. with promising results. the error back-propagation algorithm. u [ k – nu + 1 ] ) . the neural network model does not incorporate feedback. As learning proceeds (using the error back-propagation algorithm. (1. Alternatively. as ANNs can be successfully employed for nonlinear function interpolation. After the neural network has completed the training.) or u(. As examples. These exploit the cases where eq. by passing the outputs of the neuron through a filter with transfer function: N(s) G ( s ) = e –Td s ----------D(s) (1. can be used. In [60] they have proposed. …. This algorithm was also introduced by Ruano [54] based on a recursive algorithm proposed by Ljung [55]. the WidrowHoff rule. the model and the plant diverge. The neural model can be trained off-line by gathering vectors of plant data and using them for training by means of one of the methods described later. when performing on-line training of MLPs.29) can be recast as: y p [ k + 1 ] = h ( yp [ k ]. In the case of CMAC networks. Lant. usually for a very large number of samples) the approximation becomes better within the complete input training range. but not for function extrapolation purposes. One fact was observed by several researchers [13][56]. This will be explained in Chapter 2. Willis. essentially a recursive version of the batch Gauss-Newton method. Bhat et al. which acts as a model for the plant. in pattern mode.29). In this case. to ensure convergence of the neural model. it should be ensured that the input data used should adequately span the entire input space that will be used in the recall operation.22 illustrates the modelling the plant described by (1. 1. A large number of researchers have been applying MLPs to forward nonlinear identification. as usual. (1. Training is achieved. within the training range. y p [ k – n y + 1 ] ) + g ( u [ k ]. denoted as series-parallel model. switch S is connected to the output of the plant. In this figure. In this approach.) possibly being linear functions. to incorporate dynamics in an MLP. with good results. When the neural network is being trained to identify the plant. the shaded rectangle denotes a delay. Narenda and Parthasarathy [13] have proposed different types of neural network models of SISO systems.31) 26 Artificial Neural Networks . the recursiveprediction-error (RPE) algorithm. Lapedes and Farber [57] have obtained good results in the approximation of chaotic time series. by minimizing the sum of the squares of the errors between the output of the plant and the output of the neural network. the network is continuously converging to a small range of training data represented by a small number of the latest training samples. or alternative rules. at the beginning of the training process. [53] can be employed. This last configuration is called the parallel model. the output of the model rapidly follows the output of the plant. achieving superior convergence rate.30) h(. [58] have applied MLPs to model chemical processes. for forward plant modelling purposes. due to Chen et al.

In this application the neural network is placed in series with the plant. Psaltis et al.34) Artificial Neural Networks 27 .32) is not realisable.29) the ANN would then implement the mapping: u [ k ] = f –1 ( y p [ k + 1 ].23 . 1. The aim here is. as shown in fig.1. y p [ k ].2Inverse models The use of neural networks for producing inverse models of nonlinear dynamical systems seems to be finding favour in control systems applications. Training of the ANN involves minimizing the cost function: 1 J = -2 over a set of patterns j. have been working since 1982 in the application of CMAC networks for the control of nonlinear plants. at the Technical University of Darmstadt. Assuming that the plant is described by (1. The first one. [61] have reminded the neural networks community that concepts from estimation theory can be employed to measure.1.33) (1. y p [ k – n y + 1 ]. for instance.24 . Billings et al.23 . this value is replaced by the control value r[k+1]. 1. … .4. CMAC networks have also been applied to forward modelling of nonlinear plants. … . [64] proposed two different architectures. depending on the architecture used for learning. denoted as generalised learning architecture. 2.4.32) As (1. y p [ k – n y + 1 ]. u [ k – 1 ]. to obtain yp[k]=r[k]. y p [ k ]. ( r [ j ] – yp [ j ] ) 2 j (1. …. with the use of the ANN. since it depends on yp[k+1]. u [ k – n u + 1 ] ) as shown in fig.as an alternative to the scheme shown in Fig. u [ k – n u + 1 ] ) (1. interpret and improve network performance. in [62][63]. Examples of their application can be found. 1. The arrow denotes that the error ‘e’ is used to adjust the neural network parameters. 1. is shown in fig. There are several possibilities for training the network. u [ k – 1 ]. The ANN will then approximate the mapping: u [ k ] = f – 1 ( r [ k + 1 ]. Ersü and others. Recently. ….

Generalised learning architecture Using this scheme.24 ..up[k-1] .. 28 Artificial Neural Networks . Another drawback is that it is impossible to adapt the network on-line. the network may have to be trained over an extended operational range.. This procedure works well but has the drawback of not training the network over the range where it will operate. up[k-nu+1] r[k+1] up[k] ANN yp[k-ny+1] yp[k+1] plant . Consequently.23 .. training of the neural network involves supplying different inputs to the plant and teaching the network to map the corresponding outputs back to the plant inputs. after having been trained. yp[k] FIGURE 1.Inverse modelling with ANNs r Plant + e - y ANN FIGURE 1.

Willis et al. and open questions like determining the best topology of the network to use will be answered.25 . r ANN u Plant y e + FIGURE 1. previously trained as a forward plant model. or performing a linear identification of the plant by an adaptive least-squares algorithm. 1. [60] achieve good results when employing MLPs as nonlinear estimators of bioprocesses. To overcome these difficulties. layer of the neural network to backpropagate the error. in [56][58]. shown in fig. or determining it by finite difference gradients. Here they employ another MLP. This method of training addresses the two criticisms made of the previous architecture because the ANN is now trained in the region where it will operate. r. Artificial Neural Networks 29 . It is expected that in the next few years this role will be consolidated. there is no explicit target for the control input u. that drive the system outputs. the specialised learning architecture. The problem with this architecture lies in the fact that. Psaltis proposed that the plant be considered as an additional. as another available tool for nonlinear systems identification. and can also be trained on-line. to be applied in the training of the ANN. towards the reference signal. An alternative. Examples of MLPs used for storing the inverse modelling of nonlinear dynamical plant can be found. during the past ten years. y. Lapedes and Farber [57] point out that the predictive performance of MLPs exceeds the known conventional methods of prediction. where the unknown Jacobian of the plant is approximated by the Jacobian of this additional MLP. Artificial neural networks have been emerging. although the error r-y is available.To overcome these disadvantages. This depends upon having a prior knowledge of the Jacobean of the plant. Harris and Brown [68] showed examples of CMAC networks for the same application. the same authors proposed another architecture.25 . Saerens and Soquet [65] proposed either replacing the Jacobian elements by their sign. which seems to be widely used nowadays. u. for instance.Specialised learning architecture In this case the network learns how to find the inputs. unmodifiable. was introduced by Nguyen and Widrow in their famous example of the ‘truck backer-upper’ [66]. Applications of ANN in nonlinear identification seem very promising.

time-invariant plants with unknown parameters. traditional MRAC schemes initially used gradient descent techniques. the control is calculated. immediate scope for application in nonlinear.1. There are two different approaches to MRAC [13]: the direct approach. selected examples of applications of neural networks for control will be pointed out. With the need to develop stable adaptation schemes.1. and that the reference model can be described as: yr[ k + 1 ] = n–1 i=0 ai yr [ k – i ] + r [ k ] (1. They have found. Narenda and Parthasarathy [13] exploited the capability of MLPs to accurately derive forward and inverse plant models in order to develop different indirect MRAC structures.) are unknown nonlinear functions. Exploiting the nonlinear mapping capabilities of ANNs.36) where the error e[k] is defined as: e [ k ] = yp [ k ] – yr [ k ] and yr[k] denotes the output of the reference model at time k. where h(.4. MRAC systems based on Lyapunov or Popov stability theories were later proposed [69]. Assume. To adapt the controller gains.37) 30 Artificial Neural Networks .1Model reference adaptive control The largest area of application seems to be centred on model reference adaptive control (MRAC). without determining the plant parameters. During the past ten years ANNs have been integrated into a large number of control schemes currently in use.2. MRAC systems are designed so that the output of the system being controlled follows the output of a pre-specified system with desirable characteristics. that the plant is given by (1.) and g(. (1. therefore.35) where r[k] is some bounded reference input. adaptive elements. These schemes have been studied for over 30 years and have been successful for the control of linear. where these parameters are first estimated. for instance. using both the direct and the indirect approaches. such as the MIT rule [69]. Following the overall philosophy behind this Chapter.4. in which the controller parameters are updated to reduce some norm of the output error.2 Control Neural networks are nonlinear. and then. The aim of MRAC is to choose the control such that k→∞ lim e [ k ] = 0 (1. several researchers have proposed extensions of MRAC to the control of nonlinear systems. 1. references being given to additional important work. adaptive control.30). assuming that these estimates represent the true plant values. and the indirect approach.1.

(1. (1. 1. y[k] ANNh ˆ .38) aiyr [ k – i ] + r [ k ] = g [ k ] + h [ k ] (1.Defining g [ k ] = g ( u [ k ].39) Rearranging this last equation.36) holds.26 suggests.40) is obtained: g[k] = n–1 i=0 ai yr [ k – i ] + r [ k ] – h [ k ] (1..40) and u[k] is then given as: u [ k ] = g –1 n–1 i=0 ai yr [ k – i ] + r [ k ] – h [ k ] (1. …. … . we then have n–1 i=0 (1. u [ k – n u + 1 ] ) h [ k ] = h ( y p [ k ].41) can be approximated as: ˆ u [ k ] = g –1 n–1 i=0 ˆ ai yr [ k – i ] + r [ k ] – h [ k ] (1... y p [ k – n y + 1 ] ) and assuming that (1.. ANNg ˆ 1 Artificial Neural Networks 31 . y[k-ny+1] 1 y[k+1} u[k] u[k-nu+1] .42) The functions h[k] and g[k] are approximated by two ANNs. interconnected as fig.41) If good estimates of h[k] and g-1[k] are available.

or to use a previously trained network. reference model xm[k] r[k] yr[k] + plant . Its output is compared with the output of the plant in order to derive an error. is run in parallel with the controlled plant. The plant is situated between this error and the output of the MLP. Consequently. ANN + un[k] + u[k] yp[k] xp[k] fixed gain . For direct neural MRAC. The controller is a hybrid controller.Direct MRAC controller similar scheme was proposed by Kraft and Campagna [70] using CMAC networks. in order to feed back this error to the network.FIGURE 1. composed of a fixed gain. which is used to adapt the weights of the ANN. controller FIGURE 1.27 . linear controller in parallel with a MLP controller. Lightbody and Irwin [56] propose the structure shown in fig. This ˆ ˆ does not involve any problem for training and any of the methods discussed in “Forward models” on page 25 can be used. 1. they propose either to use Saerens and Soquet’s technique [65] of approximating the elements of the Jacobian of the plant by their sign.26 .Forward model Notice that this structure may be viewed as a forward model. with the difference that the weights at the outputs of the neural networks ANN h and ANN g are fixed and unitary. 32 Artificial Neural Networks . A suitable reference model. with state vector xm[k]. Examples of this approach can be seen in [13][67][56]. which approximates the forward model of the plant.27 .

can be obtained from past decisions: ( x c [ k ]. v [ k ] ) → u [ k ] (1. or directly as the control input upon user decision.2Predictive control Artificial neural networks have also been proposed for predictive control. ˜ iv) finally. the following sequence of operations is performed: i) the predictive model is updated by the prediction error e [ k ] = y p [ k ] – y m [ k ] . an approximation of u [ k ] .46) ˆ iii) the control decision u [ k ] is stored in the controller ANN to be used. ii) an optimization scheme is activated. To speed up the optimization process. w [ k + 1 ]. coined LERNAS. if necessary.43) where xm[k] and v[k] denote the state of the process and the disturbance at time k. to calculate the optimal control ˆ value u [ k ] that minimizes a l-step ahead subgoal of the type: l i=1 J[ k] = L s ( y m [ k + i ]. v [ k ] ) → y m [ k + 1 ] (1. At each time interval. since 1982. a sub-optimal control input u [ k ] is sought. a learning control structure. and u[k] is the control input to the plant. denoted as u [ k ] . Two CMAC networks (referred by Ersü as AMS) are used in their approach. w [ k ]. before applying a control to the plant.47) for some specified ε . The second ANN stores the control strategy: ( x c [ k ].44) where xc[k] and w[k] denote the state of the controller and the value of the set-point at time k. This process enlarges the trained region GT and speeds up the 33 Artificial Neural Networks . respectively.1. Ersü and others [71][72][62] have developed and applied. (1.1. u [ k – l + 1 ] ) (1.45) constrained in the trained region GT of the input space of the predictive model of the ˆ plant.2. whether as a future initial guess for an optimization. v [ k ] ) → u [ k ] . The first one stores a predictive model of the process: ( x m [ k ]. which is essentially a predictive control approach using CMAC networks.4. such that: ˆ ˜ u[k] – u[k] < ε ˜ u [ k ] ∉ GT (1. w [ k ]. u [ k ].

( i = k.2 Combinatorial problems Within this large class of problems. The MLP is again employed to predict the outputs over the horizon n = N 1 …N 2 . i = N 1 …N 2 are obtained from a previously trained MLP. For instance. This is a combinatorial problem with constraint satisfaction. Chen [77] has introduced a neural self-tuner.48) where y m [ k + i ]. The same cost function (1. which emulates the forward model of the plant.Ruano et al. The total time of drilling is a function of the order in which the holes are drilled. Hernandez and Arkun [76] have employed MLPs for nonlinear dynamic matrix control. Artificial neural networks have been proposed for other adaptive control schemes. [79] have employed MLPs for autotuning PID controllers.4. …. u [ k ] .48) is used by Sbarbaro et al. 1. but specially Hopfield type networks have been used to solve this kind of problems.4. Iiguni et al. Typically one pattern is presented to the NN. is then applied to the plant and the sequence is repeated. The following cost function is used in their approach: J[ k] = N2 i = N1 ( w [ k + i ] – ym [ k + i])2 Nu + j=0 λu [ k + j ] 2 (1. [78] have employed MLPs to add nonlinear effects to a linear optimal regulator.3 Content addressable memories Some types of NNs can act as memories. It is usually necessary to drill a large number of accurately positioned holes in the board. Different neural networks. and the difference between the output of the plant and the predicted value is used to compute a correction factor. consider for instance the manufacturing of printed circuits. Montague et al. since this determines the total distance travelled in positioning the board.learning. and employed to recall patterns. [75] who employ radial basis functions to implement an approach similar to the LERNAS concept. [73] developed a nonlinear extension to Generalized Predictive Control (GPC) [74]. At each time step. The first of these values. since each hole must be visited just once. 1.48) with respect to ˆ the sequence { u [ i ]. the output of the plant is sampled. This is a similar problem to the well known Travelling Salesman (TSP) problem. More recently. and the most similar pattern within the memories already stored 34 Artificial Neural Networks . k + N u ) } . These values are later corrected and employed for the minimization of (1.

This is one of the most used applications of ANNs. and therefore perform data compression. 1..4. which are trained with patterns of past/future pairs.are retrieved from the NNs. Patterns are transformed from a mdimensional space to a n-dimensional space. etc. speech recognition. Different types of ANNs have been applied for this purpose. where patterns associated with diseases.4. etc. ANNs offer great promises in this field for the same reasons biological systems are so successful at these tasks.4 Data compression Some networks are used to perform a mapping that is a reduction of the input pattern space dimension. touch.4. sound. manufacturing. Artificial Neural Networks 35 . The problem is essentially a classification problem. 1. A bank wants to predict the creditworthiness of companies as a basis for granting loans. where m>n. shorter patters can be dealt with. should be classified differently from patterns associated with normal behaviour. Typical applications of ANNs compare favourably with standard symbolic processing methods or statistical pattern recognition methods such as Baysean classifiers. BSBs and BAMs. 1. The airport management unit needs to predict the growth of passenger arrivals in the airports. Humans apply different sensory data (sight. The most typical networks used for this role are Hopfield nets. faults. 1. etc.5 Diagnostics Diagnosis is a problem found in several fields of science: medicine. and power electric companies need to predict the future demands of electrical power during the day or in a near future. The most powerful examples of large scale multisensor fusion is found in human and other biological systems.8 Multisensor data fusion Sensor data fusion is the process of combining data from different sources in order to extract more information than the individual sources considered separately. engineering.4. which has advantages for instance in data transmission applications and memory storage requirements. malfunction. by assignment of “codes” to groups or clusters of similar patterns.6 Forecasting This type of application occurs also in several different areas. 1.7 Pattern recognition This involves areas such as recognition of printed or handwritten characters.4. smell and taste) to gain a meaningful perception of the environment. visual images of objects. These networks are known as content addressable memories. Therefore.

Taxonomy according to the neural model Activation Function Hidden Layers: Sigmoid. The following table summarizes the classification : Table 1. Linear Gaussian Linear Spline (order 0) Linear Spline (variable order) Linear Threshold. and finally. Hyperbolic Tangent.5 Taxonomy of Neural Networks Based on the characteristics of neural networks discussed in Section 1.1. 1. and the activation function most employed.3. and the most relevant applications introduced in Section 1. the most typical learning mechanism. Each classification is based on a different perspective. the network topology. Hyperbolic Tangent Output Layer: Sigmoid.1 Neuron models This classifies the different ANNs according to the dependence or independence of the weight values from the distance of the pattern to the centre of the neuron. Sigmoid Probabilistic Piecewise-linear Piecewise-linear Mexican hat Maximum signal Neural Networks MLP Weight value unconstrained RBF CMAC B-Spline Hopfield Boltzmann BSB BAM SOFM ART Hidden Layer: dependent Output Layer: unconstrained Hidden Layer: dependent Output Layer: unconstrained Hidden Layer: dependent Output Layer: unconstrained unconstrained unconstrained unconstrained unconstrained unconstrained unconstrained 36 Artificial Neural Networks . the most typical application type. starting with the neuron model employed.5.4.1 . the present Section presents different taxonomies of the most important ANN models.

Taxonomy according to the learning mechanism Learning Mechanism Supervised. Unsupervised (Hebbian learning) Supervised (Forced Hebbian learning) Neural Networks MLP RBF CMAC B-Spline Hopfield Boltzmann BSB BAM Artificial Neural Networks 37 .2 .1.Taxonomy according to the topology Interconnection Multilayer feedforward Multilayer feedforward Multilayer feedforward Multilayer feedforward Recurrent Recurrent Recurrent Recurrent Single layer feedforward Recurrent Neural Networks MLP RBF CMAC B-Spline Hopfield Boltzmann BSB BAM SOFM ART 1.2 summarizes this taxonomy: Table 1.3 . Table Learning Mechanism According to the most typical learning mechanism associated with the different ANNs. or Unsupervised (Hebbian learning) Stochastic Supervised. gradient type Supervised (gradient type) or hybrid Supervised (gradient type) or hybrid Supervised (gradient type) or hybrid Supervised. we can categorize the different ANNs the following way: Table 1.2 Topology This perspective classifies the neural networks according to how the neurons are interconnected.

optimization Optimization Associative memory Associative memory Optimization Pattern recognition 38 Artificial Neural Networks . Please note that this should be taken as a guideline.5. this section will classify the ANNs by the most typically found application type.4 . pattern recognition. classification. optimization Control. Table 1.3 .4 Application Type Finally.Taxonomy according to the application type Neural Networks MLP RBF CMAC B-Spline Hopfield Boltzmann BSB BAM SOFM ART Application Control. forecast Control. forecast Control. since almost every ANN has been employed in almost every application type.Table 1. forecast Associative memory. forecast.Taxonomy according to the learning mechanism Learning Mechanism Competitive Competitive Neural Networks SOFM ART 1.

5. The role of these two blocks is different. as it will become clear later on. Section 2. the supervised paradigm. respectively. are introduced. The basic theory behind Lattice-based Associative Memory networks is introduced in Section 2. the Multilayer Perceptrons. or. and the supervised learning mechanism can exploit this fact. both with 1 hidden layer [128] and with 2 hidden layers [129].3. based on that CMAC networks and B-spline networks are described in Section 2. the most common type of artificial neural network.3. as some neural network types described in this Chapter can also be trained with competitive learning methods. The last section. and one output layer (we shall assume here that MLPs are used for function approximation purposes).1 Multilayer Perceptrons Multilayer perceptrons (MLPs. Additionally. The final results are summarized in Section 2. they are introduced first. It has been shown that they constitute universal approximators [130].3.1. there are no artificial neural networks that can be classified as supervised feedforward neural networks. at least with hybrid training methods. and described more extensively that the other network types. the inverse of a pH nonlinearity and the inverse of a coordinates transformation problem. Nevertheless. the considerations about MLP Artificial Neural Networks 39 . For their popularity. The modelling performance of these four types of ANNs is investigated using two common examples.CHAPTER 2 Supervised Feedforward Neural Networks Strictly speaking. is that they share a common topology: one (or more in the case of MLPs) hidden layers.2 and in Section 2. Another reason to study these networks together. This common topology can be exploited. for short) are the most widely known type of ANNs. the most common type of learning mechanism associated with all the artificial neural networks described here is. 2. and. will address the application of these networks for classification purposes.2 describes Radial Basis Function networks. These sections describe the use of supervised feedforward networks for approximation purposes. by far. Section 2.4. In Section 2. and that is the main reason that they appear together in one Chapter.3.

1. Finally.3.1.1. 2. outputs and training patterns were binary valued (0 or 1).1. and this is the reason that they are described in some detail. More formally: 1. Rosenblatt’s perceptron was a network of neurons.1.1.1 The Perceptron Perceptrons were first introduced by Frank Rosenblatt [19].3.3. The model for a perceptron1 was already introduced in Section 1.8. the Perceptron Convergence Theorem.) i2 . Bernard Widrow’s Adalines and Madalines are described in Section 2. we shall adopt the term Rosenblatt’s perceptron to identify this neuron model.Model b) External view of a perceptron The typical application of Rosenblatt was to activate an appropriate response unit for a given input patterns or a class of patterns. Section 2. As for modelling purposes.1.1. . The neurons were of the McCulloch and Pitts’ type.1. methods amenable to be employed for on-line learning are discussed in Section 2. working at Cornell Aeronautical Labs.3.1. Multilayer Perceptrons are formally introduced in Section 2.1.3. as the most typical model of a neuron i1 Wi.p a) Internal view FIGURE 2.1.1. The original Perceptron is introduced in Section 2.1.3. To be precise. and its associated theorem.3.7. ip i S oi ip Wi. introduces a new learning criterion.3.1 .training methods are also of interest to the others.4. 40 Artificial Neural Networks .1 θi ?i θi ?i i1 i2 Wi. The inputs. introduced in Section 1.1. the basic rule is to alter the value of the weights only of active lines (the ones corresponding to a 1 input). and the "famous" Error BackPropagation algorithm (the BP algorithm) in Section 2.2.1. Nevertheless. which decreases the computational training time. and only when an error exists between the network output and the desired output. which is the model presented in the figure. usually the output layer has a linear activation function. in Section 2. Although there are different versions of perceptrons and corresponding learning rules. The major limitations of BP are pointed out in Section 2. . Problems occurring in practice in off-line training are discussed in Section 2.2. intended to be computational models of the retina.1. grouped in three layers: a sensory layer. and alternative algorithms are described in Section 2. an association layer and a classification layer. and the Least-Mean-Square (LMS) algorithm in Section 2.1. the activation function is a threshold function. For this reason.2 neti oi f(.3.1.

2) (2. Let us assume that the initial weight vector is null (w[0]=0). of if the output is 0 and should be 0. If the output is 1 and should be 1. and xi[k] is the ith element of the input vector at time k.1. t[k] and y[k] are the desired and actual output. Algorithm 2. some models do not employ a bias. one set of weights to accomplish the task. given a linearly separable problem. if the output is 1 and should be 0 wi [ k ]i . x n [ k ] ] T (2. increment all the weights in all active lines. as: ∆w i = α ( t [ k ] – y [ k ] )x i [ k ] (2. 3. where ∆w is the change made to the weight vector. 2.1 Perceptron Convergence Theorem One of the most important contribution of Rosenblatt was his demonstration that. -1). in a finite number of iterations. If the output is 1 and should be 0. decrement the weights in all active lines. α is the learning rate. The input vector is. If the output is 0 and should be 1. the outputs may be bipolar. if the output is correct (2. respectively. at time k. w. if the output is 0 and should be 1 wi [ k + 1 ] = w i [ k ] – αx i . 2. the inputs to the net may be real valued.4) Artificial Neural Networks 41 . as well as binary.1. Thus. the ith component of the weight vector. and that the threshold is negative.2). bipolar (+1. the weight vector. then the learning algorithm can determine. at the k+1th iteration.1) In (2. and incorporated in the weight vector. is updated as: w [ k + 1 ] = w [ k ] + ∆w . in this case: x [ k ] = [ –1.1 - Perceptron learning rule Considering a perceptron with just 1 output. …. do nothing.3) Some variations have been made to this simple perceptron model: First. This is the famous Perceptron convergence theorem. is: w i [ k ] + αx i .1. x 1 [ k ].

5) (2. The linear combiner output is therefore: y [ k ] = w [ k ]x [ k ] T T (2. Let us assume that the learning rate is constant. and unitary. Then.6).for all x[k] belonging to C1.2 - Two classes separated by a hyperplane. plotted in a n-dimensional space. that fall on the opposite sides of some hyperplane. in two dimensions Suppose that the input data of the perceptron originate from 2 linearly separable classes. a fixed k). Please note that set of weights is not unique.and the weight vector is: w [ k ] = [ θ [ k ]. this is. …. according to (2.8) (2. the perceptron incorrectly classifies all the patterns belonging to C1. w 1 [ k ]. Suppose that w [ k ]x [ k ] < 0 for all the patterns belonging to class C1. These two classes are said to be linearly separable if such a set of weights exist.e.7). and the subset of vectors that belong to class C2 as X2.. in such a way that the two classes C1 and C2 are separable. such that: 42 Artificial Neural Networks . wo. Class C1 Class C2 w1 x1 + w2 x2 – θ = 0 FIGURE 2.7) T Since we have assumed the classes to be linearly separable. to obtain the result (do not forget that w[0]=0): w[n + 1] = x[1] + x[2] + … + x[n] (2. the update equation can be represented as: w [ k + 1 ] = w [ k ] + x [ k ] . defines an hyperplane as the decision surface between two different classes of inputs. Then training means adjust the weight vector w. the equation wTx=0. Let us denote the subset of training vectors that belong to class C1 as X1. α = 1 . We can iteratively solve (2.6) For a specific iteration of the learning algorithm (i. there is a set of weights. w n [ k ] ] assuming that we have n inputs.

we have: w0 2 w[n + 1 ] 2 ≥ ( nη ) . for all x[k] belonging to C1. and knowing that w[0]=0. we stay with: w[k + 1] 2 = w[k ] 2 + 2w [ k ]x [ k ] + x [ k ] T 2 . for all x[k] belonging to C1. T T T (2. η . which states that. we have: a 2 b 2 ≥ (a b) T 2 (2. x[i] ∈ X 1 T (2. or equivalently.12) We now must make use of the Cauchy-Schwartz inequality.13) Applying this inequality to (2. (2.17) (2. from k=0 to n. or equivalently: 2 2 (2.12). and therefore: w 0 w [ n + 1 ] ≥ nη T (2.18) w[k + 1] – w[ k] ≤ x[k] . We may then define a fixed number.16) can therefore be transformed in: w[ k + 1 ] 2 ≤ w[ k] 2 2 + x[k] 2 2 .7).15) If another route is followed. wT[k]x[k]<0.14) w[ n + 1] ( nη ) ≥ ------------2 w0 2 (2.8) by woT.9) T (2.11) But. given two vectors a and b. 2 (2. we then have: w0 w[ n + 1 ] = T w 0 x [ 1 ] + w 0 x [ 2 ] + … + w0 x [ n ] .10).10) Multiplying now both sides of (2.19) Artificial Neural Networks 43 . and (2. and we square (2. such that: η = min w 0 x [ i ] . according to (2. (2.w 0 x > 0 . Adding these inequalities.16) Under the assumption that the perceptron incorrectly classifies all the patterns belonging to the first class. we finally stay with: w[n + 1] 2 n ≤ k=0 x[ k ] 2 = nβ . all the terms in the right-hand-side of the last equation have a lower bound.

not necessarily constant.21) We have thus proved that.Evolution of the weights W2 0 -0.2 Examples Example 2. n.05 Iteration 1 2 3 4 5 6 7 W1 0 0 0 0.05 -0.1 .05 -0.05 44 Artificial Neural Networks . such that these equations will be satisfied as equalities.19) states that the Euclidean norm of the weight vector grows at most linearly with the number of iterations.05 0 0 0 0 W3 0 -0.1. This value is given by: nmax β w0 = ---------------2 η 2 (2. will also terminate the learning algorithm in a finite number of iterations.05 0.05 -0.05 0. the initial value of the weights is null. nmax.1. as long as the learning rate is positive. (2. the perceptron learning algorithm will terminate after at most nmax iterations. and employing a learning rate of 0. There will be a maximum value of n.1 . Clearly this last equation is in conflict with (2. 2. Resolution Starting with a null initial vector. Table 2. the evolution of the weights is summarized in Table 2.1. and given that an optimal value of the weights exists. Use the input and target data found in the MATLAB file AND.15). A different value of the initial weight vector will just decrease or increase the number of iterations. provided the learning rate is constant and unitary.where β is a positive number such that: β = max x [ k ] x [ k] ∈ X 1 2 (2.05. a different value of η .05 0. It can be easily proved that.05 0 0 -0.20) Eq.AND Function Design a perceptron to implement the logical AND function.mat.

05 -0.05 -0.05 0.05 0.05 -0.1 -0.05 -0.15 -0.1 -0.1 0.1 0.1 0.05 0.05 0.1 -0.05 0.1 -0.05 0.1 -0.1 0.05 0.05 0.1 -0.1 -0.05 0.1 0.1 0.1 0.1 -0.1 -0.05 0.Evolution of the weights W2 0 0 0 0.1 0.05 0.1 0.05 0.1 -0.05 0.1 0.1 -0.1 0.15 -0.1 -0.05 0 0.05 0.05 0.15 -0.05 0 0.1 0.1 -0.05 -0.1 -0.1 Artificial Neural Networks 45 .1 Iteration 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 W1 0.05 0.05 0.1 -0.05 0.05 -0.05 0 0 0 0 0 0 0 0 0.1 0.05 0.1 -0.05 0.1 0.05 0.1 -0.05 0.05 0.1 0.05 0.1 -0.1 0.05 W3 -0.1 .Table 2.1 0.1 0.1 -0.1 0.1 -0.05 0.1 -0.15 -0.05 0.15 -0.1 0.05 0.05 0.

2 . 2.m. Resolution Starting with a null initial vector.8 1 -1 FIGURE 2.mat.6 0. the evolution of the weights is summarized in Table 2.2 0.05 0.05 0. after iteration 24.05 0 0 -0.05 46 Artificial Neural Networks .Evolution of the decision boundaries For a Matlab resolution.Evolution of the weights W2 0 0 0 0.2.05 0.OR Function Design a perceptron to implement the logical OR function.05 -0.3 . Example 2.05 0.05. and employing a learning rate of 0.mat. please see the data file AND.5 2 0 0.05 -0. the function perlearn.We can therefore see that. the weights remain fixed.2 .05 W3 0 0 -0. The next figure shows the changes in the decision boundaries as learning proceeds.m and the script plot1.05 0. Table 2.5 24 2 1.05 Iteration 1 2 3 4 5 6 7 8 W1 0 0 0 0.05 0.05 0. Use the input and target data found in the MATLAB file OR.4 0.05 0.5 1 11 0.5 15 0 -0.

05 0.05 -0.05 -0.05 -0.05 0.05 -0.05 0.2 .05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 -0.05 0.05 0.05 -0.05 0.Evolution of the weights W2 0.05 0.05 0.05 -0.05 -0.05 -0.05 0.05 -0.05 Artificial Neural Networks 47 .05 -0.05 0.05 -0.05 0.05 0.05 0.05 0.05 0.05 -0.05 0.05 0.05 0.05 0.05 -0.05 0.05 0.05 -0.05 0.05 0.05 0.05 0.05 -0.05 -0.05 -0.05 0.05 -0.05 -0.05 0.05 -0.05 0.05 -0.05 -0.05 W3 -0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 -0.05 -0.05 0.05 Iteration 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 W1 0.05 -0.05 0.05 0.05 0.05 -0.05 0.05 0.05 0.Table 2.05 0.05 0.05 0.05 0.05 0.05 0.05 -0.05 0.05 0.05 -0.05 -0.05 -0.05 0.05 0.

8 0.4 .6 -0. Why do the weights not converge? Resolution Starting with a null initial vector.05 -0. Table 2.3 .05 -0.6 0.05 -0. 1 0.mat. the weights remain fixed.mat.m and the script plot1.8 -1 0 0. the function perlearn.4 0. after iteration 6. The next figure shows the changes in the decision boundaries as learning proceeds.2 0.Evolution of the weights W2 W3 0 0 0 0 0 0 0 0 0 0 0 -0.2 0 -0. please see the data file OR.05 -0. and employing a learning rate of 0.4 0.m. the evolution of the weights is summarized in Table 2.6 0. Example 2.3. Use the input and target data found in the MATLAB file XOR.2 -0.XOR Function Try to design one perceptron to implement the logical Exclusive-OR function.4 -0.8 4 1 6 FIGURE 2.05.05 -0.Evolution of the decision boundaries For a Matlab resolution.We can therefore see that.3 .05 Iteration 1 2 3 4 5 6 7 8 9 W1 0 0 0 0 0 0 0 0 0 48 Artificial Neural Networks .05 -0.

3 .05 -0.05 0 -0.05 -0.05 -0.05 0 0.05 -0.05 -0.05 0 0 0 0 0 -0.1 -0.05 -0.05 0 0 0 0 0 0 0.05 Artificial Neural Networks 49 .05 0 -0.05 -0.1 -0.1 -0.05 0 0 0 0 0 -0.05 0 -0.1 -0.05 -0.05 -0.05 -0.1 -0.Table 2.05 0 0 0 0.05 0.05 0 0 -0.05 0 -0.05 -0.05 -0.Evolution of the weights W2 0 -0.05 0 0 0 0 0 0 0 0 0 0 0.05 -0.05 0 -0.05 0.1 -0.05 -0.05 -0.05 -0.05 -0.05 0 -0.05 -0.05 0 0 0 W3 0 -0.05 -0.05 -0.1 -0.05 0 0 0 0 -0.05 Iteration 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 W1 0.

We can therefore see that the weights do not converge. The next figure shows the changes in the decision boundaries as learning proceeds.
1 36 23



34 22

-0.5 11


-1.5 15 -2 0 0.2 0.4 0.6 0.8 1

FIGURE 2.5 - Evolution

of the decision boundaries

As we know, there is no linear decision boundary that can separate the XOR classes. It is simple unsolvable by a perceptron. For a Matlab resolution, please see the data file XOR.mat, the function perlearn.m and the script plot1.m. 2.1.2 Adalines and Madalines

The Adaline (ADAptive LINear Element) was originally introduced by Widrow and Hoff [21], as an adaptive pattern-classification element. Its associated learning rule is the LMS (LeastMean-Square) algorithm, which will be detailed later. The block diagram of an Adaline is shown in fig. 2.6 . As it can be seen, it has the same form of the perceptron introduced before.
i1 Wi,1 net i
θi ?







FIGURE 2.6 - An Adaline element


Artificial Neural Networks

The only difference is that the input and output variables are bipolar {-1,+1}. The major difference lies in the learning algorithm, which in this case is the LMS algorithm, and the point where the error is computed. For an Adaline the error is computed at the neti point, and not at the output, as it is the case for the perceptron learning rule. By combining several Adaline elements into a network, a Madaline (Many ADALINEs) neural network is obtained.

x1 x2 x3

FIGURE 2.7 - A Madaline

(each circle denotes an Adaline element)

Employing Madalines solves the problem associated with computing decision boundaries that require nonlinear separability. Example 2.4 - XOR Function with Madalines Employing the perceptrons designed in Ex. 2.1 and Ex. 2.2, design a network structure capable of implementing the Exclusive-OR function. Resolution As it is well known, the XOR function can be implemented as: X ⊕ Y = XY + XY

Therefore, we need one perceptron that implements the OR function, designed in Ex. 2.2, and we can adapt the perceptron that implements the AND function to compute each one of the two terms in eq. (2.22). The net input of a perceptron is: net = w 1 x + w 2 y + θ

If we substitute the term x in the last equation by (1-x), it is obvious that the perceptron will implement the boolean function xy . This involves the following change of weights: net = w 1 ( 1 – x ) + w 2 y + θ = – w 1 x + w 2 y + ( θ + w 1 )

Artificial Neural Networks


Therefore, if the sign of weight 1 is changed, and the 3rd weight is changed to θ + w 1 , then the original AND function implements the function xy . Using the same reasoning, if the sign of weight 2 is changed, and the 3rd weight is changed to θ + w 2 , then the original AND function implements the function xy . Finally, if the perceptron implementing the OR function is employed, with the outputs of the previous perceptrons as inputs, the XOR problem is solved. The weights of the first perceptron, implementing the xy function, are: [-0.1 0.05 0]. The weights of the second perceptron, implementing the xy function, are: [0.1 -0.05 –0.05]. The weights of the third perceptron, implementing the OR function, are: [0.05 0.05 –0.05]. The following table illustrates the results of the network structure related to the training patterns. In the table, y1 implements the xy function and y2 is the output of the xy function. As it can be seen, the madaline implements the XOR function. For a Matlab resolution, please see the data files AND.mat and OR.mat, the functions perlearn.m, perrecal.m and XOR.m.
Table 2.4 - XOR

problem solved with a Madaline inp2 y1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 y2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 y 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 0

Iteration 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

inp1 0 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 1


Artificial Neural Networks

Table 2.4 - XOR

problem solved with a Madaline inp2 y1 0 0 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 y2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 y 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0

Iteration 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

inp1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0

Artificial Neural Networks

53 The LMS Algorithm One of the important contributions of Bernard Widrow and his co-workers was the introduction of the Least-Mean-Square (LMS) algorithm. This learning algorithm also goes by the name of Widrow-Hoff learning algorithm, adaline learning rule or delta rule.
θi ?


Wi ,1



Wi ,2


oi f(.)

xp Wi ,p +



FIGURE 2.8 - Application

of the LMS algorithm

fig. 2.8 illustrates the application of the LMS algorithm. Opposed to the perceptron learning rule, the error is not defined at the output of the nonlinearity, but just before it. Therefore, the error is not limited to the discrete values {-1, 0, 1} as in the normal perceptron but can take any real value. The LMS rule is just: w [ k + 1 ] = w [ k ] + α ( t [ k ] – net [ k ] )x [ k ]

Comparing this equation with the perceptron learning algorithm (eq. (2.1) and (2.2)), we can see that, apparently, the only difference lies in the replacement of the output by the net input. However, the major difference lies in the justification of the introduction of the algorithm, which follows a completely different route. Justification of the LMS Algorithm Assume that, associated with the error vector in the training set (composed of N patterns), we define a cost function:
2 1 T 1 E = -- e e = -e [i], 2 2i=1


where e[k] is defined as:


Artificial Neural Networks

One way to accomplish this task is.4 0. as functions of the weight are plotted.6 0.2 1.8 2 FIGURE 2. a error-performance surface such as fig. Widrow formulated the Adaline learning process as finding the minimum point of the error function. and perform a step in the opposite direction. This last expression is the mean-square error. can be rewritten as 1/2E[e2]. and so our linear model has just one variable weight. Widrow introduced this in a stochastic framework. Artificial Neural Networks 55 . The gradient of the error is denoted as: ∂E --------∂w 1 g = … ∂E --------∂w p (2.45 0. compute the gradient of the function.26).28) and the LMS rule can be given as: 1. and because of this its learning algorithm is called LMS.2 0. 2.8 1 w 1.26). If the values of (2. (2.4 1. This is called the method of steepest descent. (2.27) Suppose that y is a linear function of just one input variable. 2.9 might appear. we assume that the error is defined at the output of a linear combiner.35 0. Actually.] is the expectation operator.3 0.15 0.1 0. using an instantaneous version of the steepest descent technique2. For the case of an Adaline.6 1. at each point.e [ k ] = t [ k ] – y [ k ] 1. 0.2 0.5 0. where E[. it should be net[k].9 . For simplicity of notation. Eq.]. and denote this value by y[.25 0.4 0.05 0 minimum value of E E -dE/dw 0 0.Error surface We wish to find the value of the weight that corresponds to the minimum value of the error function.

w [ k + 1 ] = w [ k ] – ηg [ k ] Substituting (2. where: e [ i ]x p [ i ] i=1 T (2.26). we have: ] ∂e [ j ] = ∂e [ j .e. 56 Artificial Neural Networks .= ∂w i i=1 g = – N N N 2 2 (2.1 ∂e [ i ] -= ----------------2 ∂w p 2 i = 1 ∂w p (2.30) ∂ e [i] 2 N 2 Let us consider one of the terms in (2.33) (2. in (2.------------i ∂w i ∂y [ j ] ∂w i Replacing this equation back in (2.30).= – 2e [ j ]x [ j ] ------------------------------.34) is the Jacobean matrix. By employing the chain rule..29) 1 i=1 1 ∂e [ i ] -.= ∂w p i=1 N = – ( e J ) .∂y [ j ].… ∂y [ 1 ] ------------∂w 1 ∂w p J = … … … ∂y [ N ] ∂y [ N ] -------------. and e[ N] ∂y [ 1 ] ------------. we have: N (2.-------------------------.32) e = e[ 1] … is the error vector.= ----------------2 ∂w 1 2 i = 1 ∂w 1 g = N ∂ e [i] 2 N 2 … i=1 1 -------------------------.… -------------∂w 1 ∂w p (2.30). the matrix of the derivatives of the output vector with respect to the weights. i. we stay with: ∂y [ i ] e [ i ] -----------.28).31) e [ i ]x 1 [ i ] i=1 T … ∂y [ i ] e [ i ] -----------.

5 . this is called pattern mode. in each iteration of the algorithm. Compute all the relevant vectors and matrices.2 -0. If. Resolution 1.Error vector Pattern 1 2 3 4 5 6 7 Error 1 0.In this case. over the interval [-1.22 0 -0.9]. Example 2. we have the steepest descent update: w [ k + 1 ] = w [ k ] + ηe [ k ]X T (2.38 Artificial Neural Networks 57 . This is called batch update or epoch mode.29).35) (2. in each pass through all the training set.36) Please note that in the last equation all the errors must be computed for all the training set before the weights are updated. The error vector is: Table 2. Repeat the same problem with the LMS algorithm.46 0. the Jacobean is just the input matrix X: x[1] J = … . on the other hand.32) in (2.9 0.x2 function Consider a simple Adaline with just one input and a bias.1.1. 1]. Suppose that we use this model to approximate the nonlinear function y=x2.5 . 1. an instantaneous version of the gradient is computed for each pattern. x[N] Replacing now (2. as the output is just a linear combination of the inputs. and the weights are updated pattern by pattern. with a sampling interval of 0. 2. Determine the first update of the weight vector using the steepest descent method. The initial weight vector is [0.72 0. and we have the Widrow-Hoff algorithm. Employ a learning rate of 0.

08 -1.3 -0.9 -0.6 58 Artificial Neural Networks .5 0.54 -0.2 0.7 -0.8 -0.6 -0.Jacobean matrix Input 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Input 1 -1 -0.4 0.1 0 0.1 -1.8 Table 2.5 .04 -1.1 -1.6 .98 -1.Error vector Pattern 8 9 10 11 12 13 14 15 16 17 18 19 20 21 The Jacobean matrix is: Error -0.68 -0.04 -0.98 -0.9 -0.1 0.5 -0.3 0.9 -0.2 -0.8 -0.Table 2.08 -1.4 -0.

6135 -0. please see the function x_sq_std.2602 -0.7 0.7 .7228 -0.6 . 2.9 1 T T g = – ( e J ) = 6.4106 -0.6804 -0.5295 Artificial Neural Networks 59 .Table 2. With the LMS algorithm. we have the following error vector: Table 2.m.93 11.7436 -0.527 -0.2 and the weight vector for next iteration: w [ 2 ] = 0.207 – 0.Jacobean matrix Input 2 1 1 1 1 Pattern 18 19 20 21 The gradient vector is therefore: Input 1 0.7211 -0.6158 -0.8 0.22 For a Matlab resolution.0658 -0.7431 -0.6778 -0.53 0.1888 -0.Error vector Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Error 1 0.

1361 0.1561 0.1849 -0.1479 0.Gradient vector Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Input 1 Input 2 1 -1 0.0743 0.477 -0.3116 -0.6158 -0.2053 0.0529 0.2454 0.0529 -0.4278 -0.2033 0.0461 0.6804 -0.53 0.4106 -0.0723 0.0658 -0.1888 -0.2567 0.1847 0.7 .2181 0.2108 0.6778 0.7436 0.4278 0.8 .0789 60 Artificial Neural Networks .Table 2.7431 0.1849 0.0476 0.3116 0.527 -0.1511 -0.6135 0.2602 -0.1442 0.0789 -0.5295 0.Error vector Pattern 17 18 19 20 21 Error -0.0789 The evolution of the (instantaneous) gradient vector is: Table 2.7228 0 0.2648 0.7211 0.

5228 0.7779 0.9 0.0719 1.6977 0.8383 0.8311 0.0393 1.7574 0.m.And the evolution of the weight vector is: Table 2.799 0.8839 0. 2.7195 0.686 Notice that both the gradient and the weight vectors.8 0.Evolution of the weight vector W2 0.7961 0.0653 1.9 1 1.0393 0.4.5949 0. as long as each one is individually trained to implement a specific function.8175 0. For a Matlab resolution.7451 0.8165 0. to design the whole structure globally to implement the XOR is impossible.7523 0. 2.3408 0. after the presentation of all the training set.8309 0. However.243 0.2509 Iteration 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 W1 0.6829 0. as because of the hard nonlinearity of the perceptron.1.298 0. it is not possible to employ the derivative chain rule of the steepest descent algorithm.7716 0.8159 0. are different from the values computed in 1.7436 0.3 Multilayer perceptrons As we have seen in Ex.8383 0.3937 0. please see the function x_sq_lms.6781 0.9 .2483 0.6693 0.053 1.9455 0.2668 0.7418 0.7372 0. it is not difficult to design a network structure with simple perceptrons.4551 0. Artificial Neural Networks 61 .

the network input matrix is therefore O(1). f(...A multilayer perceptron fig. . net W (q – 1) O Net ( q – 1) (q – 1) f(.... ... As the name indicates.. . • k is a vector with q elements.... this is not defined for q=1.. • T is the m*kq matrix of desired outputs or targets. . and the output layer is the qth).. 2. the zth element denoting the number of neurons in layer z.10 illustrates a typical multilayer perceptron. 62 Artificial Neural Networks . ... f(. • O(z)is the m*kz matrix of the outputs of the neurons in layer z.. such as a sigmoid or an hyperbolic tangent. the network is structured in layers.) Σ ....) Σ . • E is m*kq matrix of the errors (E=T-O(q)). All the important variables should reflect the layered structured. and therefore the following notation is introduced: • q is the number of layers in the MLP (the input layer is the 1st layer.) Σ .This problem is solved if the hard nonlinearity is replaced by a differentiable nonlinearity. O (2) (1) (1) FIGURE 2. • Net(z) is a m*kz matrix of the net inputs of the neurons in layer z. .. f(.) Σ .. .10 . o ( q) (q) f(....) Net Σ W .

Several people from a wide-range of disciplines have independently developed this algorithm. In fact.1 = (z) • w(z) is the vector representation of W(z). 2i = 1 2 N (2. In this case: W . rediscovered by Parker [81] and widely broadcast throughout the scientific community by Rumelhart and the PDP group [31].kz + 1 w • w is the entire weight vector of the network w = ( q – 1) … w (1) 2..) is the matrix output function for the neurons in layer z. The threshold associated with each neuron of the (z+1)th layer can be envisaged as a normal weight that connects an additional neuron (the (kz+1)th neuron) in the zth layer.Net (z + 1) = O W (z) (z) In some cases it is convenient to represent these weight matrices as vectors..1 The Error Back-Propagation Algorithm As in the case of the previous models. defined as w ( z) … (z) W . it is applied in an elementby-element basis.• F(z)(.1. the aim of training the MLPs is to find the values of the weights and biases of the network that minimize the sum of the square of the errors between the target and the actual output: 2 1 1 T E = -. In each iteration. The original error back-propagation algorithm implements a steepest descent method. Apparently it was introduced by Paul Werbos [80] in his doctoral thesis. • W(z) is the (kz+1)*kz+1weight matrix between the zth layer and the (z+1)th layer.e e = -e [i] .3. the weights of the multilayer perceptron are updated by a fixed percentage in the negative gradient direction: Artificial Neural Networks 63 . Therefore. which has a fixed output of 1.j is the weight that connects neuron i in the zth layer to neuron j in the (z+1)th layer . W (z)i.37) The error back-propagation (BP) algorithm is the best known learning algorithm for performing this operation. MLPs and the BP algorithm are so intimately related that it is usual to find in the literature that this type of artificial neural network is referred to as a `back-propagation neural network'.

allow the use of a faster learning rate. which would be filtered out by the momentum term. in the weights update equation: w [ k + 1 ] = w [ k ] – ηg [ k ] + α ( w [ k ] – w [ k – 1 ] ) (2. One of them is to perform the update of the weights each time a pattern is presented. 2. a fact that was also observed in several numerical simulations performed by us. [82] suggest that the rate of convergence is essentially unmodified by the existence of the momentum term. Jacobs [83] proposed an adaptive method for obtaining the learning rate parameter.t the weight vector w(z). also introduced by Rumelhart and the PDP group [31]. called the momentum term. without leading to oscillations. In this section we shall compute the gradient vector following this approach. Tesauro et al. Watrous [84] has employed a line search algorithm.3.40) where J is the Jacobean matrix. Subsequently. (2. The reasoning behind pattern mode update is that. Following this approach.38) Several modifications of this algorithm have been proposed.w [ k + 1 ] = w [ k ] – ηg [ k ] (2.r. the gradient vector can be computed as: gT = –eT J .1. in its original version.41) In eq. if η is small.2 The Jacobean matrix As we have already seen (Section 2. reflecting the topology of the MLP: J = J(q – 1) … J( 1) (2. (2. is the inclusion of a portion of the last weight change. the departure from true gradient descent will be small and the algorithm will carry out a very close approximation to gradient descent in sum-squared error [31]. the limitations of this algorithm are pointed out.41). Several other modifications to the BP algorithm were proposed.2.2).39) The use of this term would. Another modification. in principle. is developed below. It is however questionable whether in practice this modification increases the rate of convergence. J(z) denotes the derivatives of the output vector of the MLP w. Sandom and Uhr [85] have developed a set of heuristics for escaping local error minima. The error back-propagation algorithm. as the Jacobean will be used later on with alternative training algorithms. The chain rule can be employed to compute J(z): 64 Artificial Neural Networks . To name but a few. it is advisable to compute J in a partitioned fashion.1.

the computation of J(z) will be developed in a pattern basis.43) ∂net i ∂O i. 1 a) (j) ∂O i.44) Detailing now each constituent element of the right hand-side of (2. kj ( j) ( j) ∂ Net i. .… -----------------------.42) In the last eq. since we are considering just one output.can be obtained as: (j – 1) ∂Net ∂Net ( j ) ∂O ( j – 1 ) ∂Net ----------------------.----------------------(j – 1) ∂O ( j – 1 ) ∂Net ( j – 1 ) ∂Net The number of these pairs is q-z-1.-------------------------(z) ( q) (q – 1) (z + 1) (z) ∂w ∂net i ∂O i. ∂w (q) (q) (q) (z + 1) (z + 1) (2. the Jacobean of layer z. ∂o i ∂o i = -----------. with q<j<1 is a diagonal matrix. as the net input of one neuron only affects its output.for the case of the output layer (z=q). the terms within curly brackets occur in pairs.44).= ----------------. = … ∂O i. is just: (z) J i. 1 (j) ∂ Net i. it is just a scalar f' ( net i ) . .------------------. . and not the others. ∂Net i. 1 (j) … … … 0 … ∂O i.∂o ∂o J ( z ) = -----------.42) is followed.-----------------------(q – 1) (z + 1) ∂w ( z ) ∂O ∂Net (q) (z + 1) (2. for presentation i. (j) ( j) (j) ∂ Net i.= ---------------(q) (z) ∂w ∂net (q) (q) ∂net ∂O ∂Net ( z + 1 ) ------------------.= -----------------. k j (j) = … 0 . . . . since the derivatives of ∂Net ----------------------. 1 (j) … ∂O i. the kth diagonal element is f' ( Net i. So. rather than in epoch basis. ∂Net i. kj (j) ∂ Net i. Their use does not clarify the text. and if introduced. which means they do not exist for z=q-1. To avoid this additional complexity. three-dimensional quantities must be employed. it is easy to see that: ∂O i. k ) . (q) (j) Artificial Neural Networks 65 . kj ∂ Net i. 1 … … … ∂O i. . If (2. k j (j) ∂O i. and since there is no dependence from presentation to presentation in the operation of the MLPs. k j (j) ∂ Net i. 1 ( j) ∂ Net i..… -----------------------. additional conventions should have to be defined. (j) (j) (2.

. and therefore is the transpose of the weight matrix without considering the last line. ∂Net i. The matrices described in a) and c) are sparse matrices. kj ∂ O i.. 1 (z) ( z + 1) ∂ W . p .. among the different neurons within the layer z+1. k j – 1 ∂Neti.. without considering the bias term. another simplification can be achieved if we further partition Ji. c) -----------------------(z) (z + 1) ∂ W .∂Net i. 1 (j – 1) ( j) = ( W 1. 1 (z + 1) ∂Net i. for the special case of the output layer it is just a row vector. 1 = … ∂Net i..44) can be further manipulated to yield a more compact form for the computation of Ji.. -----------------(j – 1) ∂O i. ( j) (j – 1) ∂ O i. .. kj ∂ O i.46) (z) (z) . ) ( j – 1) T (j – 1) is the matrix of weights that connect the ith neuron in layer (j+1)th to the jth neuron in the layer immediately below. respectively. . kz + 1 ∂ W . its jth partition (Aj) has the form: 0 Aj = O i. Using the same type of conventions used so far. 1 … … … ∂Neti. with kz+1 column partitions. (2.. kz + 1 … ∂Net i. kz + 1 (z) ( z + 1) ∂w can be best viewed as a partitioned matrix. 1 0 where the top and bottom matrices have dimensions of (j-1)*(kz+1) and (kz+1-j)*(kz+1). . Using relationships (a) to (c).p ( k z + 1 ) (2.. . k j. (2. 1 (z) (z + 1) (z) … … … ∂Net i. 1 … ( j) ∂ O i. .45) 66 Artificial Neural Networks . the sub-partition of J(z) corresponding to the weights that connect the neurons in the zth layer with the pth neuron in layer z+1. kz + 1 ∂ W . with a format that reflects the organization of one layer into different neurons. 1 b) (j) ∂Net i. where p = ( ( p – 1 ) ( k z + 1 ) + 1 ). k j – 1 ( j) (j – 1) = … ∂Net i. should be denoted as J i. eq. This way.

. it is clear that no dupli(j) ∂Net i. Following (2. (j – 1) ∂Net . The diagonallity of matrix ---------------. . ∂O i... ..44).44) can be given as: (j – 1) T ( j – 1) ∂Net ----------------------. ∂o i these derivatives are obtained in a recursive fashion.Net i. i.e. because of the layered structure of the MLPs. to simplify further the calculations. it is easy to see that the product of each pair of matrices inside the curly brackets of (2. cate calculations will be made if Ji. .has just the pth row non-null (see (2. kj – 1. .47) Note that the first term in the last equality is just the pth element of a row vector... (q) (q) (q) Artificial Neural Networks 67 . (j) ∂Net i. . p (z + 1) ∂o i ∂Net i. . ( j) (j) (2.= [ W 1. ∂w p ∂Net i. Using this property. must be calculated beforehand. is computed starting from the output layer towards the input layer. ∂o i = -----------------------. . . (q) (z + 1) (q) p O i. Further. ∂o i ---------------.-------------------------. ) . to compute -----------------------. . ] × f ' ( Net i. J i.= -----------------------(z + 1) (z) (z + 1) ∂Net i. we note that. ( z + 1) ∂Neti.must be obtained. (z) Since the p partition of --------------------.45)).can also be exploited. 1 (z) (2. This way. -----------------------. for every ∂o i layer z. p can be given (z) ∂w as: th (z) J i.48) where the symbol × has the meaning that each element of each column of the pre-multiplying matrix is multiplied by the corresponding element of the post-multiplying vector. . with q ≥ j > z + 1 . The final consideration before presenting an efficient algorithm to compute Ji.. the matrices (z + 1) ∂Net i. . concerns the order of the computations of its different partitions.

end z := z-1 end Algorithm 2. can be obtained.Keeping this preceding points in mind. j (z – 1) := d j O ( z – 1 ) 1 i.j ). also O (. Alg. in conjunction with (2. Using this algorithm.). then Net (ij. If this operation is performed before the computation of the Jacobean matrix takes place. end if z>2 d := d ( W ( z – 1 ) ) T d := d × f ' ( Net i. roughly. for q > j ≥ 1 . the computation of the derivative can be obtained from the output. a) Point b) implies that a recall operation must be executed. . ) . 2mn multiplications. b) The error vector. and consequently the output vector.j ) .2 illustrates an efficient way to compute the Jacobean matrix of an MLP. mn additions and km derivatives are needed to obtain the Jacobean matrix. 2. z := q d := f '( net ( q ) ) i while z>1 for j=1 to kz J i. . must be available. requires that two other points be addressed: Net (i. must be available.). which simplifies the calculations. 68 Artificial Neural Networks . 2.2 shows that. Also for output functions like the sigmoid or the hyperbolic tangent functions.40). i . and f' ( Net (ij. to compute the gradient vector.2 ( z – 1) ) Computation of Ji. an analysis of the operations involved in Alg. for q ≥ j > 1 . Denoting by k the number of neurons in the MLP (without considering the input layer) and by n the total number of weights in the network. must be know beforehand. and O (ij.)..

Artificial Neural Networks 69 . the gradient vector can be obtained using Alg. mk output function evaluations and mn additions and multiplications are needed to compute o(q).3.2. . 2. := f ( z + 1 ) ( Net z + 1 ) z := z+1 end (z + 1) Algorithm 2. we obtain the gradient vector as computed by the BP algorithm. .3) -.(Alg.4: g T := 0 i := 1 while i ≤ m .3. O i.4. 2. g T := g T – e i J i.. z := 1 while z<q Net i. . which. 2. compute Ji. This property can be provided by slightly modifying Alg. (z + 1) := O ( z ) 1 W ( z ) i. for consistency with Alg. 2. 2.. end -.2) Algorithm 2.The recall operation can be implemented using Alg.40) This last algorithm computes the gradient vector. 2. it does not share one of its advantages because not all the computations are performed locally in the neurons. is also derived on a presentation mode. as presented by Rumelhart et al... . 2. 2. However. compute o(q)i e i := t i – o ( q ) i .3 - Recall Operation Using this algorithm. with the same computational complexity as the error back-propagation algorithm. Employing Alg. [31].2 and Alg.4 - Computation of g using (2..(Alg. By doing that.

in this case. 2. its performance with regard to the total process of training is very poor. compute o(q)i e i := t i – o i z := q ( q) -.5 - Computation of g using BP The main difference between this algorithm and Alg.1. an example will be employed. This fact is related to the underlying minimization technique that this algorithm implements. g T := 0 i := 1 while i ≤ m . It is common to find in the specialized literature that hundreds of thousands of iterations are needed to train MLPs. all the calculations are made locally (in each neuron). 70 Artificial Neural Networks .1. However. To show the problems associated with this method.4 is that.(Alg. end z := z-1 end end (z – 1) ) Algorithm 2. end if z>2 d := d ( W ( z – 1 ) ) T d := d × f' ( Net i.2.3... the steepest descent method. Notice that the symbol × has the meaning that each element of each column of the pre-multiplying matrix is multiplied by the corresponding element of the postmultiplying vector.1The Gradient Vector as computed by the BP Algorithm The following algorithm computes the gradient vector in the manner introduced by Rumelhart and co-workers [31]. 2.3) d := [ f' ( net i )e i ] while z>1 for j=1 to kz gj ( z – 1 )T (q) := g j ( z – 1)T – dj O (z – 1 )T 1 i. .3. . 2.3 Analysis of the error back-propagation algorithm It was shown above that each iteration of the error back-propagation algorithm is computationally efficient.2.

to illustrate the behaviour of the training criterion. the training procedure diverges when a learning rate of 0. The same starting point (w = 0. As shown in fig. The simplest possible MLP will be used.It can be shown that. The training range was discretized into 21 evenly spaced points. 2. as shown in fig. 2. 2.9 0. w = 1 0 -3 T Artificial Neural Networks 71 . a) d) shows the convergence of the weight (w2).14 illustrate the performance achieved by the algorithm for x ∈ – 1 1 1. The error back-propagation algorithm.9 ) will be used for all the examples. the evolution of the bias is clearly an unstable process. with different learning rates. fig. Although w2 converges to its optimum value (0).6 . c) shows the convergence of the threshold (w1). namely a perceptron with a linear output T function. will be employed.Example 2.12 . and separated by ∆x = 0.12 to fig.1 is used. starting from x=-1. y w1 1 w2 x FIGURE 2. which is the usual sum of the square of the errors ( e 2 ). 2. b) illustrates the evolution of the criterion.1 .x2 mapping The aim of this example is to train one MLP to approximate the mapping y = x 2 in a specified range of x. ˆ 1.11 . for the first ten iterations. Contour lines are superimposed on the graph. for the continuous mapping.11 - Simplest possible MLP The performance of the algorithm will be evaluated using four different graphs: illustrates the evolution of w.

13 illustrates the performance of the training procedure for the same problem. 2. 72 Artificial Neural Networks . this time using a learning rate of 0.09.e. w1 w2 Iteration number Iteration number c) Convergence of w1 FIGURE 2.1 fig.w2 w1 a) Convergence of w e 2 Iteration number b) evolution of s.1 . x ∈ – 1 1 . ∆x = 0. With this learning rate the training procedure is now stable. The convergence of this last parameter (w2) is now slightly slower than was obtained with η = 0.1 .s. η = 0. The evolution of the bias is an oscillatory process.12 - d) Convergence of w2 x2 problem. with slower convergence rate than the evolution of the weight.



w1 a) Convergence of w



Iteration number

b) evolution of s.s.e.



Iteration number

Iteration number

c) Convergence of w1

d) Convergence of w2

FIGURE 2.13 -

x2 problem; x ∈ – 1 1 , ∆x = 0.1 , η = 0.09

fig. 2.14 shows what happens when the learning rate is further reduced to 0.05. The training process is perfectly stable. Although the learning rate used is smaller than the one employed in the last example, the evolution of the training criteria is faster. The convergence rate for w1 is now better than for w2. This latter parameter converges slower than with η = 0.09 and clearly dominates the convergence rate of the overall training process.

Artificial Neural Networks



w1 a) Convergence of w



Iteration number

b) evolution of s.s.e.



Iteration number

Iteration number

c) Convergence of w1
FIGURE 2.14 - x2

d) Convergence of w2

problem; x ∈ – 1 1 , ∆x = 0.1 , η = 0.05

As a final example, the same range of x will be used, this time with a separation between samples of ∆x = 0.2 . fig. 2.15 illustrates the results obtained when η = 0.1 is used. Although the learning rate employed is the same as in the example depicted in fig. 2.12 , the results presented in fig. 2.15 resemble more closely the ones shown in fig. 2.14 .


Artificial Neural Networks



w1 a) Convergence of w



Iteration number

b) evolution of s.s.e.


Iteration number


Iteration number

c) Convergence of w1
FIGURE 2.15 -

d) Convergence of w2

x2 problem; x ∈ – 1 1 , ∆x = 0.2 , η = 0.1

From the four graphs shown, two conclusions can be drawn: the error back-propagation algorithm is not a reliable algorithm; the training procedure can diverge; b) the convergence rate obtained depends on the learning rate used and on the characteristics of the training data.
a) equations and least-squares solution This example was deliberately chosen as it is amenable for an analytical solution. Since the output layer has a linear function, the output of the network is:

Artificial Neural Networks


o(q) = x 1 w , or in the more general case: o ( q ) = Aw



where A denotes the input matrix, augmented by a column vector of ones to account for the threshold. In this special (linear) case, the training criteria, given by (2.37), becomes: t – Aw 2 Ω l = ---------------------2 where the delimiters denote the 2-norm.

Derivating (2.51) w.r.t. w, the gradient can then be expressed as: g l = – A T t + ( A T A )w

The minimal solution, or, as it is usually called, the least-squares solution of (2.51) is then obtained by first equating (2.52) to 0, giving then rise to the normal equations [92]: ( A T A )w = A T t which, when A is not rank-deficient, it has a unique solution, given by (2.54): ˆ w = ( A T A ) -1 A T t
(2.54) (2.53)

where it is assumed that A has dimensions m*n, m ≥ n 1.The matrix ATA is called the normal equation matrix. In the case where A is rank-deficient, there is an infinite number of solutions. The one with minimum 2-norm can be given by: ˆ w = A+t where A+ denotes the pseudo-inverse of A [93]. error backpropagation algorithm as a discrete dynamic system If (2.52) is substituted into the back-propagation update equation (2.38), we obtain:

1.Throughout this work, it will always be assumed that the number of presentations is always greater than or equal to the number of parameters.


Artificial Neural Networks

w [ k + 1 ] = w [ k ] + η ( A T t – ( A T A )w [ k ] ) = ( I – η ( A T A ) )w [ k ] + ηA T t


This last equation explicitly describes the training procedure as a discrete MIMO system, which can be represented as [94]:



- g[k]

η H ( z ) = ---------z–1






FIGURE 2.16 - Closed-loop representation

of (2.56)

The rate of convergence and the stability of the closed loop system are governed by the choice of the learning rate parameter. In the present case, as x is symmetric, the matrix ATA is a diagonal matrix. In the general case, this is not verified, and to examine (2.56), it is then convenient to decouple the k1+1 simultaneous linear difference equations by performing a linear transformation of w. Since A T A is symmetric, it can be decomposed [95] as: A T A = USU T

where U is an orthonormal square matrix (the modal matrix or matrix of the eigenvectors of A T A ) and S is a diagonal matrix where the diagonal elements are the eigenvalues ( λ ) of AT A . Replacing (2.57) in (2.56), its jth equation becomes: v j [ k + 1 ] = ( 1 – ηλ j )v j [ k ] + ηz j where v = UT w and

j = 1, …, k 1 + 1


Artificial Neural Networks


z = UT ATt


It is clear that (2.58) converges exponentially to its steady-state value if the following condition is met: 1 – ηλ j < 1 which means that the maximum admissible learning rate is governed by: 2 η < ---------λ max where λ max denotes the maximum eigenvalue of A T A . From (2.58) it can also be concluded that the convergence rate for the jth weight is related to 1 – ηλ j , faster convergence rates being obtained as the modulus becomes smaller. The desirable condition of having fast convergence rates for all the variables cannot be obtained when there is a large difference between the largest and the smallest of the eigenvalues of A T A . To see this, suppose that η is chosen close to half of the maximum possible learning rate. In this condition, the variable associated with the maximum eigenvalue achieves its optimum value in the second iteration. If m denotes the equation related with the smallest eigenvalue of A T A , this equation may be represented as: λ min zm v m [ k + 1 ] = 1 – ---------- v m [ k ] + ---------λ max λ max which has a very slow convergence if λ min « λ max It is then clear that the convergence rate ultimately depends on the condition number of A T A , defined as: λ max κ ( A T A ) = ---------λ min
(2.64) (2.63) (2.62) (2.61)

We can then state that the maximum learning rate is related with the largest eigenvalue, and the convergence rate is related with the smallest eigenvalue. Referring back to Ex. 2.6, A T A is a diagonal matrix with eigenvalues {21, 7.7}, for the cases where ∆x = 0.1 and {11, 4.4}, when ∆x = 0.2 . Table 2.10 illustrates the values of ( 1 – ηλ i ) , for the cases shown in fig. 2.12 to fig. 2.15 .


Artificial Neural Networks

Fig 2.12 w1 w2 -1.1 0.23

Fig 2.13 -0.89 0.31

Fig 2.14 -0.05 0.62

Fig 2.15 -0.1 0.56

Table 2.10 -

Values of ( 1 – ηλ i ) for fig. 2.12 to fig. 2.15

Notice that the difference equation (2.58) can be expressed in terms of Z-transforms as a sum of two terms: the forced response and the natural response with non-null initial conditions: zv j [ 0 ] zηz j V j ( z ) = ------------------------------ + --------------------------------------------------z – ( 1 – ηλ j ) ( z – 1 ) ( z – ( 1 – ηλ j ) )

The values of ( 1 – ηλ i ) represent the poles of the Z-transform. As we know, poles in the negative real axis correspond to oscillatory sequences, divergent if outside the unitary circle, and convergent if inside, and poles in the positive real axis correspond to decreasing exponentials. Poles near the origin correspond to fast decay, while near the unit circle correspond to slow decay. The evolution of the transformed weight vector can be expressed, in closed form: v j [ k + 1 ] = v [ 0 ] ( 1 – ηλ j )

z k+1 + ---j ( 1 – ( 1 – ηλ j ) ) λj


z ˆ Notice that ---j = v j . Replacing this value in (2.66), we finally have: λj
k+1 ˆ ˆ v j [ k + 1 ] = v j + ( v [ 0 ] – v j ) ( 1 – ηλ j )


Eq. (2.67) clearly shows that, during the training process, the weights evolve from their starting point to their optimal values in an exponential fashion, provided (2.62) is satisfied. As we know, the concept of time constant is related with the exponent of decaying exponentials. For a transient with the form Ae , the time constant is τ = 1 ⁄ a . If we expand the exponential in Taylor series, and equate this approximation with the value of the pole in (2.65):
–1 -τ – at


1 1 ≈ 1 – -- + --------- – … = 1 – ηλ j , τ 2!τ 2


Artificial Neural Networks


69) This last equation clearly shows that the time constant of the error backpropagation algorithm is related with the smallest eigenvalue of the normal equation matrix. centred in the optimal weight values.51) can be expressed in another form: l t t – 2t Aw + w A Aw Ω = --------------------------------------------------------2 T T T T (2.70). The points with the same value (the contour plots introduced earlier) have 80 Artificial Neural Networks .+ … + ---------------------------------------2 2 ( 1 ⁄ λ1 ) ( 1 ⁄ ( λk1 + 1 ) ) l l l l 2 (2.= -----------------------------. ∆Ω = Ω – Ω min . with axes coincident with the eigenvectors of ATA. in k1+1 dimensions: ˆ ˆ 2 ( v k1 + 1 – v k1 + 1 ) ( v1 – v1 ) ∆Ω = ----------------------. we have its minimum value: T + T T ˆ + t ( I – AA )t l t t – t Aw t – AA Ω min = -----------------------.70) Employing (2. and whose lengths are given by 1 ⁄ λ i . the performance criterion can be given as: l l ˆ T T ˆ Ω = Ω min + ( w – w ) A A ( w – w ) (2.73) Notice that the distance of any point in the performance surface to its minimum. The largest eigenvalue.74) This is an equation of an ellipsoid. the training algorithm can be envisaged as a joint adaptation of several unidimensional algorithms. on the other hand. The performance surface is a paraboloid. (2.1.54) in (2.3.3. we have: l l ˆ ˆ T Ω = Ω min + ( v – v ) S ( v – v ) (2. ηλ (2. 2. in this coordinate system of the eigenvectors of ATA.71) With some simple calculations. In this coordinate system. The smallest axis is the one related with the eigenvector corresponding to the largest eigenvalue..we can therefore approximate the training time constant as: 1τ ≈ -----. has now a form typical of an ellipsoid.72) Expressing the last equation in terms of the transformed variables.3Performance surface and the normal equations matrix Eq. v.= -------------------2 2 2 (2. gives the stability condition.

3.. will be introduced. golden section) or polynomial methods (quadratic. 2. the unreliability of the method. such that it minimizes (2. however.1. partial or inexact if it does not. the nonlinear multilayer perceptron. genetic algorithms [96] [97] offers potential for the training of neural networks [98].38) into three different steps: .4 Alternatives to the error back-propagation algorithm In last section it has been shown that the error back-propagation is a computationally efficient algorithm. and whose lengths are the inverse of the square root of the corresponding eigenvalues.6 - This algorithm has now the usual structure of a step-length unconstrained minimization procedure.. compute a search direction: p [ k ] = – g [ k ] .75) would be verified in every iteration. namely search procedures (Fibonnaci. to compute. but.a shape of an ellipsoid. For the purpose of the interpretation of the error back-propagation algorithm in the framework of unconstrained optimization. namely unreliability and slow convergence. For instance. The above considerations.. a step length α [ k ] such that (2.76) Artificial Neural Networks 81 .. Also it has been shown that it is difficult to select appropriate values of the learning parameter. can be eliminated by incorporating a line search algorithm. update the weights of the MLP: w [ k + 1 ] = w [ k ] + α [ k ]p [ k ] Sub-division of (2. Several strategies could be used. cubic) involving interpolation or extrapolation [99]. it is advisable to divide the update equation (2. Ω ( w [ k ] + α [ k ]p [ k ] ) < Ω ( w [ k ] ) (2. This approach is.75) Hence. whose axes coincide with the eigenvectors of the normal equation matrix. The first limitation of the error back-propagation algorithm. The second step could then be modified. can be extrapolated for the case of interest. . it is unreliable and can have a very slow rate of convergence. where alternatives to back-propagation. Other approaches could be taken.. compute a step length: α [ k ] = η . since it implements a steepest descent method. The line search is denoted exact if it finds α [ k ] . Ω ( w [ k ] + α [ k ]p [ k ] ) (2.76). which do not suffer from its main limitations. step 2 would implement what is usually called a line search algorithm. This will be done in the following section.. In this section alternatives to the error back-propagation are given. in each iteration. beyond the scope of this course. in the framework of unconstrained deterministic optimization.38) Algorithm 2. for the case of a linear perceptron.

other than the steepest descent. large condition numbers appear in the process of training of MLPs (which seems to be a perfect example of an ill-conditioned problem). and that (2. To speed up the training phase of MLPs.38).996.38) and (2. clearly the steepest-descent direction.1. Equation (2. 2.38). discussed in the last section. In practice. as reported in the literature. the convergence ˆ rate can be approximated by (2. the rate of convergence is linear. in each iteration. it is clear that some other search directions p[k]. One way to express how the two different updates appear is to consider that two different approximations are used for Ω . However. and are responsible for the very large numbers of iterations typically needed to converge. in just one iteration.77) Even with a mild (in terms of MLPs) condition number of 1000.3. the optimum value of the weights of a linear perceptron. as will be shown afterwards. This solves the first limitation of the BP algorithm. for the linear perceptron case.6. the error back-propagation algorithm is still far from attractive. and given by: 2 T ˆ Ω ( w [ k + 1 ] ) – Ω ( w ) -----------------------------------------------------------------------------------.≈ ( κ ( A A ) – 1 ) ˆ Ω(w[k]) – Ω(w) ( κ( AT A ) + 1 )2 (2. a first-order approximation of Ω is assumed: Ω ( w [ k ] + p [ k ] ) ≈ Ω ( w [ k ] ) + g T [ k ]p [ k ] (2. with the matrix A replaced by J ( w ) . where several iterations (depending on the η used and κ ( A T A ) ) are needed to find the minimum. For the case of nonlinear multilayer perceptrons. Some guidelines for this task can be taken from Section 2. is very small.54) gives us a way to determine.p T [ k ]G [ k ]p [ k ] 2 (2.78) It can be shown [101] that the normalized (Euclidean norm) p[k] that minimizes (2. For the case of (2. employing an exact line search algorithm to determine the step length.80) (2.77). must be computed in step 1 of Alg.79) 82 Artificial Neural Networks . It can be shown [100] that. simple calculations show that the asymptotic error constant is 0.The inclusion of a line search procedure in the back-propagation algorithm guarantees global convergence to a stationary point [99].38) is: p [ k ] = –g [ k ] . the Jacobean matrix at the solution point.3.54) minimize these approximations.54) appears when a second order approximation is assumed for Ω : 1 Ω ( w [ k ] + p [ k ] ) ≈ Ω ( w [ k ] ) + g T [ k ]p [ k ] + -. Equation (2. even with this modification. This contrasts with the use of the normal error back-propagation update (2. which means that the gain of accuracy.

the quadratic model may be a poor approximation of Ω and G[k] may not be positive definite. together with the high computational costs of the method (the second order derivatives are needed. the Hessian matrix is composed of first and second order derivatives. and G must be inverted) stimulated the development of alternatives to Newton’s method. Two other Artificial Neural Networks 83 . the Hessian can be expressed [102] as: G [ k ] = J T [ k ]J [ k ] – Q [ k ] where Q[k] is: m (2.82). For the nonlinear case. These methods are general unconstrained optimization methods. second-order methods are categorized into different classes. Formulating the quadratic function in terms of p[k]: 1 Φ ( p [ k ] ) = g T [ k ]p [ k ] + -.4. For the linear case. Gi[k]. ….82) becomes (2. a sum of the square of the errors.The matrix G[k] denotes the matrix of the second order derivatives of Ω at the kth iteration. the Quasi-Newton methods. a better rate of convergence is usually obtained using second order information.81) is given by (2. and therefore do not make use of the special structure of nonlinear least square problems. Second order methods usually rely in eq.1.82) may not possess a minimum.82) (or similar) to determine the search direction.84) Gi[k] denoting the second matrix of the second-order derivatives of o(q)i at iteration k.81) if G[k] is positive definite (all its eigenvalues are greater than 0) then the unique minimum of (2.1. in nonlinear least-square problems. This serious limitation. m is 0. It can be proved [102] that. ˆ p [ k ] = – G –1 [ k ]g [ k ] (2. As in the linear case. This matrix is usually called the Hessian matrix of Ω .54). at points far from the solution. which means that (2. However. nor even a stationary point. These will be introduced in section Section 2. A Newton method implements (2. or an approximation to it. Depending on whether the actual Hessian matrix is used.3. Thus.82) The Hessian matrix has a special structure when the function to minimize is.p T [ k ]G [ k ]p [ k ] 2 (2. as in the case in consideration. so that (2.83) Q[k ] = i=1 e i [ k ]Gi [ k ] (2. and on the way that this approximation is performed. then the Newton method converges at a second-order rate. provided ˆ ˆ w[k] is sufficiently close to w and G is positive definite.82) directly. i = 1. (2.

Two examples of this are given in fig.2 and Section 2.1Quasi-Newton method This class of methods employs the observed behaviour of Ω and g to build up curvature information.4. Nowadays this approach is questionable and.methods that exploit this structure are the Gauss-Newton and the Levenberg-Marquardt methods.1.1.87) Traditionally these methods update the inverse of the Hessian to avoid the costly inverse operation. A large number of Hessian updating techniques have been proposed.4. Fletcher [106].4. if G[k] is positive definite.3.23 .1. the nonlinear least-squares structure of the problem. being then known as the DFP method. the application of the BFGS version of a Quasi-Newton (QN) method always delivers better results than the BP algorithm in the training of MLPs. 84 Artificial Neural Networks . they do not exploit the special nature of the problem at hand. i. 2. Further comparisons with BP can be found in [84].– ------------------------------------------------s T [ k ]q [ k ] q T [ k ]H [ k ]q [ k ] where: s[ k] = w[ k + 1 ] – w[ k] q[k] = g[k + 1 ] – g[k] (2. in order to make an approximation of G (or of H=G-1) using an appropriate updating technique.2Gauss-Newton method While the quasi-Newton methods are usually considered ‘state of the art’ in general unconstrained minimization. Goldfarb [107] and Shanno [108] is the most effective for a general unconstrained method. Gill and Murray [101] propose a Cholesky factorization of G and the iterative update of the factors.3 respectively. Its update equation [103] is: s [ k ]s T [ k ] H [ k ]q [ k ]q T [ k ]H [ k ] H DFP [ k + 1 ] = H [ k ] + ----------------------. 2. for instance.19 and in fig.– ------------------------------------s T [ k ]q [ k ] s T [ k ]q [ k ] s [ k ]q T [ k ]H [ k ] + H [ k ]q [ k ]s T [ k ] ------------------------------------------------------------------------------T [ k ]q [ k ] s ( where the performance of these two algorithms.86) (2. so is G[k+1].3. 2. The update formulae used possess the property of hereditary positive definiteness.85) It is now usually recognised [99] that the formula of Broyden [105]. The first important method was introduced by Davidon [103] in 1959 and later presented by Fletcher and Powell [104].e. The BFGS update [102] is given by: T s [ k ]s T [ k ] H BFGS [ k + 1 ] = H [ k ] + 1 + q [ k ]H [ k ]q [ k ] ----------------------.3. 2. which will be presented in sections Section 2. In practice.. as well as the Gauss-Newton and the Levenberg-Marquardt methods is compared on common examples.

if first order information is available.88) The reasoning behind this approximation lies in the belief that. the Levenberg-Marquardt method. is the ratio between the largest and the smallest singular values of J. since. in most of the situations where this method is applied. which is unique if J is full column rank. This problem is exacerbated [102] if (2. Even using a QR factorization and taking some precautions to start the training with a small condition number of J. at the optimum. the conditioning becomes worse as learning proceeds. is very small compared with J T J .89) is solved by first computing the product J T J . almost in every case. and the Hessian as a combination of a product of first derivatives (Jacobean) and second order derivatives. this system of equations always has a solution.40). since κ( JTJ ) = ( κ( J ) )2 (2. The search direction obtained with the Gauss-Newton method is then the solution of: J T [ k ]J [ k ]p GN [ k ] = – J T [ k ]e [ k ] (2. as in (2. so that Q . but fortunately.89). so that (2. in terms of the Jacobean matrix.89) As already mentioned. by a product of the Jacobean matrix and the error vector. thus preventing any further progress by the line search routine. an infinite number of solutions exist if J is rank-deficient. as J is a rectangular matrix. in this type of problem the gradient vector can be expressed. Artificial Neural Networks 85 .84). one part of the Hessian matrix is exactly known. a better alternative exists.As noted previously.90) where κ ( J ) . the error will ˆ ˆ ˆ be small. or even whether the chosen topology is a suitable model for the underlying function in the range employed. This assumption may not be (and usually is not) true at points far from the local minimum. The basis of the Gauss-Newton (GN) method lies in dropping the use of second order derivatives in the Hessian. The Gauss-Newton method is therefore not recommended for the training of MLPs. and then inverting the resultant matrix. QR or SVD factorizations of J should be used to compute (2. The problem of applying the Gauss-Newton method to the training of MLPs lies in the typical high degree of ill-conditioning of J. it is not known whether the particular MLP topology being used is the best for the function underlying the training data. So. The validity of this approximation is more questionable in the training of MLPs.83) is approximated as: G [ k ] ≈ J T [ k ]J [ k ] (2. given by (2. A point is then reached where the search direction is almost orthogonal to the gradient direction.

Basically this type of method attempts to define a neighbourhood where the quadratic function model agrees with the actual function in some sense. at every kth iteration.92). and a steepest descent direction. the predicted error vector. is the Levenberg-Marquardt (LM) method. The radius of this neighbourhood is controlled by the parameter υ .4. Many different algorithms of this type have subsequently been suggested. p[k] is identical to the Gauss-Newton direction. In contrast with the Gauss-Newton and quasi-Newton methods. then the test point is accepted and becomes a new point in the optimization. usually denoted the regularization factor. As υ [ k ] tends to infinity. To introduce the algorithm actually employed [102].3. the LM method is of the ‘trust-region’ or ‘restricted step’ type. so that the predicted reduction of Ω is: (ep[k ])T(ep[ k]) ∆Ω p [ k ] = Ω ( w [ k ] ) – -------------------------------------2 As the actual reduction is given by: ∆Ω [ k ] = Ω ( w [ k ] ) – Ω ( w [ k ] + p [ k ] ) . one way of envisaging the approximation of the Hessian employed in GN and LM methods is that these methods.6).3Levenberg-Marquardt method A method which has global convergence property [102].2. consider a linear model for generating the data: o ( nl ) [ k ] = J [ k ]w [ k ] Using (2.95) (2. then the ratio r[k] (2. Levenberg [109] and Marquardt [110] were the first to suggest this type of method in the context of nonlinear least-squares optimization. even when J is rank-deficient.91): ( J T [ k ]J [ k ] + υ [ k ]I )p LM [ k ] = – J T [ k ]e [ k ] (2.93) (2. This method uses a search direction which is the solution of the system (2.92) 86 Artificial Neural Networks . after taking a step p[k] is: e p [ k ] = e [ k ] – J [ k ]p [ k ] . and overcomes the problems arising when Q[k] is significant. otherwise it may be rejected and the neighbourhood is constricted. If there is good agreement. which belong to a step-length class of methods (Alg.1. When υ [ k ] is zero.94) (2. 2.91) where the scalar υ [ k ] controls both the magnitude and the direction of p[k]. p[k] tends to a vector of zeros.

Inverse of a titration curve The aim of this example is to approximate the inverse of a tritation-like curve. This type of nonlinearity relates the pH (a measure of the activity of the hydrogen ions in a solution) with Artificial Neural Networks 87 .∆Ω [ k ]r [ k ] = -----------------∆Ω p [ k ] (2.96) measures the accuracy to which the quadratic function approximates the actual function. the better the agreement. evaluate Ω ( w [ k ] + p [ k ] ) and hence r[k] if r[k]<0. 2.. To illustrate the performance of the different methods. 2.75. 0..7 . etc. two examples will now be presented.91) (if ( J T [ k ]J [ k ] + υ [ k ]I ) is not positive definite then ( υ [ k ] := 4υ [ k ] ) and repeat) .25.3.25 υ [ k + 1 ] := 4υ [ k ] elseif r[k]>0. obtain J[k] (Alg. for iteration k: . in a large number of training examples performed.2) and e[k] (using Alg.3) .1..5 Examples Example 2. This was confirmed in this study..7 - Levenberg-Marquardt algorithm The algorithm is usually initiated with υ [ 1 ] = 1 and is not very sensitive to the change of the parameters 0. 2. [102]. in the sense that the closer that r[k] is to unity.. It is usually agreed that the Levenberg-Marquardt method is the best method for nonlinear least-squares problems [102].. Then the LM algorithm can be stated.75 υ[k ] υ [ k + 1 ] := ---------2 else υ [ k + 1 ] := υ [ k ] end if r [ k ] ≤ 0 w [ k + 1 ] := w [ k ] else w [ k + 1 ] := w [ k ] + p [ k ] end Algorithm 2. compute p[k] using (2.

7 0. The initial value of the weights is stored in the file wini. 88 Artificial Neural Networks .5 ph´ 0.5 0. and with an even spacing of 0.2 0.5 x 0.+ 10 –14 – -. These data is stored in the files input. An example of a normalized titration curve is shown in fig.8 2 (2.1 0.8 0. 2.mat (x). because the presence of this static nonlinearity causes large variations in the process dynamics [86].6 0. and in target.4 0. were used as x and the corresponding pH´ values computed using.2 0.the concentration (x) of chemical substances.+ 6 y ---4 2 –3 –3 pH' = – -----------------------------------------------------------. to overcome the problems caused by the titration nonlinearity.01. 2..9 1 0 0. In this example a MLP with 1 input neuron.8 0.1 0.9 0. rather than the pH. is to use the concentration as output.mat (the pH’).1 0 0.3 0.4 0.mat.6 0.6 0.9 1 a) forward nonlinearity FIGURE 2.7 0.1 0 0.3 0. fig. Four different algorithms were used to train the MLP.4 0.3 x 1 0. 2 hidden layers with 4 nonlinear neurons in each.Titration-like b) inverse nonlinearity curves One strategy used in some pH control methods.17 a).7 0. the error norm for the first 100 iterations of three of those algorithms is shown in the next figures.4 0. Then the x values were used as target output data and the pH´ values as input training data. To compare their performance.17 b) illustrates the inverse of the nonlinearity. The equation used to generate the data for this graph was: y .7 0.17 .3 0. by linearizing the control [86].5 ph´ 0.6 0. y=2*10 x – 10 26 log 0.97) 0 0. pH control is a difficult control problem. 101 values ranging from 0 to 1.2 0.8 0.2 0. and one linear neuron as output layer was trained to approximate the inverse of (2.97).

m of the Optimization Toolbox).19 c).m (it employs the Matlab functions LM.0.01.m (it uses the function funls.01. fig.18 - Performance of the BP algorithm η = ( 0. WrapSpar.m (it uses Threelay. while iterations 21 to 50 are represented in fig.m of the Optimization Toolbox). 2. The Matlab file which generates the results is Leve_Mar.m is more computational intensive.m). employing a linesearch algorithm (using the lsqnonlin.m and UnWrapSp. b). The Matlab file which generates the results is LM_mat. The two LM methods are compared in fig. as it incorporates a line search algorithm.m. implemented with a BFGS method and a mixed quadratic/ cubic line search (using the function fminunc.001 ) The first figure denotes the performance of the error back-propagation.m).001 4 2 0 0 10 20 30 40 50 60 Iterations 70 80 90 100 FIGURE 2.005.005 neta=0. 2. which shows the last 50 iterations of both methods. a). the Levenberg-Marquardt methods achieve a better rate of convergence than the quasi-Newton method. A learning rate of 0. The Matlab file which generates the results is quasi_n. A quasi-Newton algorithm. 1. Iterations 1 to 20 are shown in fig.01 6 neta=0. As it is evident from these figures.7. described in Alg.0.m). The Matlab file employed is BP. Next.m (it uses the function fun_lm.19 a) to c) illustrate the performance of these algorithms. with the remark that the Matlab function lsqnonlin. A similar rate of convergence can be observed.0. 3. A Levenberg-Marquardt algorithm. A different implementation of the Levenberg-Marquardt algorithm. 2. 2. Artificial Neural Networks 89 . three second-order algorithms are compared.05 or greater causes the algorithm to diverge.0.BP algorithms 14 12 10 error norm 8 neta=0.001 .005.m). with η = 0.

Inverse coordinate transformation This example illustrates an inverse kinematic transformation between Cartesian coordinates and one of the angles of a two-links manipulator.8 0.6 0.20 .Quasi-Newton and Levenberg-Marquardt methods The Gauss-Newton version of the lsnonlin function was also tried.8 x 10 -3 c) next 30 iterations Quasi-Newton and Levenberg-Marquardt algorithms 1.Quasi-Newton and Levenberg-Marquardt algorithms 14 0.3 Levenberg-Marquardt (Matlab) 0.6 10 0.1 0 0 2 4 6 8 10 12 Iterations 14 16 18 20 0 20 30 40 50 60 Iterations 70 80 90 100 b) first 20 iterations 1.4 Levenberg-Marquardt (Matlab) error norm 1.4 50 55 60 65 70 75 80 Iterations 85 90 95 100 a) last 50 iterations FIGURE 2. but due to conditioning problems.8 .2 2 0.6 Levenberg-Marquardt 1.98): 90 Artificial Neural Networks . Referring to fig.5 Quasi-Newton error norm Levenberg-Marquardt 6 Levenberg-Marquardt (Matlab) 4 error norm 8 Quasi-Newton 0.y) by expression (2.7 Quasi-Newton and Levenberg-Marquardt algorithms 12 0.19 . 2. the angle of the second joint ( θ 2 ) . Example 2. the function changed to the LM algorithm in a few number of iterations.4 Levenberg-Marquardt 0.2 1 0. is related to the Cartesian coordinates of the end-effector (x.

1m until 1.= cos ( θ 2 ) 2l 1 l 2 s = 1 – c = sen ( θ 2 ) 2 where l1 and l2 represent the lengths of the arm segments. with an even separation of 0.20 . l2 l1 θ2 ? θ1 ? FIGURE 2.1 0 0 0.2 0.θ 2 = ± tg 2 2 –1 s c (2.Input training data Artificial Neural Networks 91 .4 0.2 0.21 .6 0. with an even separation of 9º). The length of each segment of the arm is assumed to be 0.3 0. 110 pairs of x and y variables were generated as the intersection of 10 arcs of circle centred in the origin (with radius ranging from 0. c = ------------------------------------. To obtain the training data.8 0. 1 0.6 0.9 0.5 0.Two-links robotic manipulator The aim of this example is to approximate the mapping: ( x. y ) → +θ 2 .4 x 0. in the first quadrant.98) 2 2 x + y + l 1 + l2 .1m) with 11 radial lines (angles ranging from 0 to 90º.5m.7 0.8 1 y FIGURE 2.0m.

8 0.m (uses TwoLayer. quasi-Newton and Levenberg-Marquardt 92 Artificial Neural Networks . 2.2 y 0 0 0. the corresponding positive solution of (2.4 x 0.98) was determined and used as the target output.6 1 FIGURE 2. The corresponding functions are: BP.5 1 0. 2.For each of the 110 examples. 3 2.m).ls.m (uses fun. y ) → + θ 2 fig. the quasi-Newton and the Levenberg-Marquardt method.01 (higher values of the learning rate cause the training process to diverge).23 illustrates the performance of the BP algorithm with η = 0.Mapping ( x.mat.23 .5 0.Performance of BP.22 . An MLP with 2 input neurons. The initial values are stored in the file W.22 .5 0 1 0.m).5 2 teta2 1. These data can be found in the file InpData.ini.m) and Leve_Mar. quasi_n.m (uses LM. 60 50 40 error norm BP Levenberg-Marquardt quasi-Newton 30 20 10 0 0 10 20 30 40 50 60 Iterations 70 80 90 100 FIGURE 2. one hidden layer with 5 nonlinear neurons and one linear neuron as output layer was employed to approximate the mapping illustrated in fig. respectively.

1. can be exploited to decrease the number of iterations needed for convergence [87]... Consequently. the most important topology of this type of artificial neural network. . . . . as far as control systems applications are concerned. This simple fact.. 1 . seems to be the one depicted in fig.. The convergence of the learning algorithms can be further improved if the topology of the MLPs is further exploited.. but the LM algorithm is undoubtedly the best of all.. 1 . . In all the examples shown so far... .6 A new learning criterion In the preceding section alternatives to the error back-propagation algorithm have been presented. For the purposes of explanation. which exploits the least-squares nature of the criterion emerges as the best method... .24 .... . .. the quasi-Newton method achieves a much better rate of convergence than the BP algorithm. and indeed. Output 1 ˆ w A { 1 .. The LM method. the networks act as function approximators.Most important topology of MLPs for control systems applications The only difference between this topology and the standard one lies in the activation function of the output neuron. which is linear. 2.As in the previous examples. 2.24 .3... .. It was shown that quasi-Newton algorithms have a more robust performance and achieve a faster rate of convergence than the error back-propagation algorithm. in most of the applications of MLPs to control systems.... let us partition the weight vector as: Artificial Neural Networks 93 . 0 or more hidden layers { Inputs FIGURE 2. as will be shown in the following treatment...

. Denoting this matrix by A.t.103) where the dependence of A on v has been omitted. Ω is linear in the variables u .101) can be separated into two classes: nonlinear ( v ) and linear ( u ) weights. and although different from the standard criterion (2.55). and then.99) (2. (2.102) In (2. the main advantage of using this new criterion is a faster convergence of the training algorithm. The optimum value of the linear parameters is therefore conditional upon the value taken by the nonlinear variables. For any value of v.102). + ˆ u( v) = ( O(q – 1)( v ) 1 ) t (2.100) is substituted in (2.101) w.101).101). Using this convention.101). 2 2 (2. using v in eq. the resulting criterion is: t – O(q – 1) 1 u 2 Ω = --------------------------------------------2 2 = u v (2. first minimize ˆ ˆ (2. and v denotes all the other weights. u can be found using (2. obtain the complete optimal weight vector w .103) P A⊥ is the orthogonal projection matrix to the complementary space spanned by the columns of A. The proof of this statement can be found in [89]. the output vector can be given as: o( q) = O(q – 1) 1 u When (2. i. for simplicity. the last equation can then be replaced into (2. the minimum of (2. their minima are the same.102).101) The dependence of this criterion on v is nonlinear and appears only through O ( q – 1 ) .103) depends only on the nonlinear weights.r. it should be noticed that O ( q – 1 ) 1 is assumed to be full-column rank. creating therefore a new criterion: 2 2 P A⊥ t 2 t – AA + t 2 ψ = --------------------------. a property which seems to be inde- 94 Artificial Neural Networks .103).= ----------------. on the other hand.e. instead of determining the optimum of (2. Besides reducing the dimensionality of the problem.37).100) (2. In (2. We can therefore.w( q – 1) w = w(q – 2) … w(1) where u denote the weights that connect to the output neuron. the weights in criterion (2. a pseudo-inverse is used for the sake of simplicity. This new criterion (2. however.

This reduction in the number of iterations needed to find a local minimum may be explained by two factors: • a better convergence rate is normally observed with this criterion.103) is usually much smaller than the initial value of (2. the gradient of ψ must be obtained.e.r. conditioned by the value taken by the nonlinear parameters.37). compared with the standard one. This problem will be treated afterwards.pendent of the actual training method employed. It can be proved (see [89]) that the gradient vector of ψ can be given as: ( o( q – 1) )T = – eΨ 1 T JΨ 0 = gΩ gψ where g Ω ˆ u=u (2. As the criterion is very sensitive to the value of the linear parameters.8: (2. in comparison with the situation where random numbers are used as starting values for these weights. v must be introduced. i. the value of the linear parameters is the optimal. For this purpose the derivative of A w.: Jψ = ( A )v A + t eψ = P A⊥ t The simplest way to compute g ψ is expressed in Alg.106) (2.104) Considering first the use of the error back-propagation algorithm or the quasi-Newton with this new criterion.105) ˆ u=u . • the initial value of the criterion (2. the partition of the Jacobean matrix associated with the weights v and the error vector of Ω obtained when the values of linear weights are their conditional optimal values.107) Artificial Neural Networks 95 . a large reduction is obtained. including the first.103) must be obtained. This is a three-dimensional quantity and will be denoted as: ( A ) v = ∂A ∂v (2. 2. for the same initial value of the nonlinear parameters. which results in large values for the linear parameters. In order to perform the training. J ψ and e ψ denote respectively the gradient. This is because at each iteration. One problem that may occur using this formulation is a poor conditioning of the A matrix during the training process. the derivatives of (2.t.

obtain u ˆ . the computation of J ψ does not involve more costs than the computation of J.. replace u in the linear parameters.3 ˆ .2 Algorithm 2.using Alg. [91] proposed to employ the Jacobean matrix J ψ (2. compute O ( q – 1 ) -. replace u in the linear parameters..using Alg. To reduce the computational complexity of Golub and Pereyra’s approach. then the Jacobean of (2.103) is used as the training criterion. It is advisable to introduce a procedure to obtain matrix J ψ (2. Kaufman [90] proposed a simpler Jacobean matrix.. who were also the first to introduce the reformulated criterion. 2. and complete Alg.3 ˆ .. there is also some gain at each iteration that must be taken into account. compute O ( q – 1 ) -. since the dimensionality of the problem has been reduced. obtain u ˆ .. but QR or SVD factorization are advisable.. Algorithm 2.106).9: . the cheapest way to obtain J ψ is using Alg.using Alg. 2. 2. 2.8. compute gψ -.3 . and complete Alg. 2..4.. 2... If the Gauss-Newton or Levenberg-Marquardt methods are employed with this new formulation.8 - Computation of g ψ ˆ The computation of g ψ has the additional burden of determining u ... compute Jψ -. Several algorithms can be used to compute these optimal values.106).8. Three different Jacobean matrices have been proposed: the first was introduced by Golub and Pereyra [88]. which further reduces the computational complexity of each training iteration. 2. 2.. 2.9 - Computation of J ψ ˆ As in the case of Alg. 96 Artificial Neural Networks .103) must be obtained.. However.using Alg. apart from the computation of u .. because of possible ill-conditioning of A. Although there is an additional cost in complexity per iteration to pay using this approach. Following the same line of reasoning that led to Alg. Ruano et al..3 .. a large reduction in the number of iterations needed for convergence is obtained if (2.

9 .m and TL2. The learning rates employed are 0. but with the value of the linear parameters initialized with their optimal values.01.05.1 and 0.Example 2. Remember that in the case of Ex.mat.2) obtained after 100 iterations of the standard criterion. is less than the value (1.55 0.45 0.103) is used as the learning criterion. To perform a fair comparison.25 illustrates the performance of the error back-propagation when criterion (2.Evolution of the error norm using the BP algorithm minimizing the new criterion for different learning rates If now the standard training criterion is employed.5. The matlab file that implements this method is BPopt. 2. The initial weights are stored in file wini.3 0 20 40 Iterations 60 80 100 FIGURE 2. The initial weights are stored in w_ini_optm.7 will be used.Ph example revisited To illustrate the application of the new criterion. 0. 2.m (it also employs TwoLay2. The standard BP technique is stored in file BP. the initial values of the nonlinear parameters were the ones used in Ex. we obtain the following performance of the training algorithm. 0.53).35 0.7. fig.mat. It should be noted that the initial norm of the error vector (0.25 . in the reformulated case. Ex.7 the maximum rate allowed was 0. 2.m. Artificial Neural Networks 97 .m).5 Erro r vecto r n o rm 0. 2. with respect to the initial values of the nonlinear parameters.4 0.

Concentrating now on quasi-Newton methods. the next fig.26 . with the value of the linear parameters initialized with their optimal values. 2.Evolution of the error norm using the BP algorithm minimizing the standard criterion for different learning rates It should be noted that only with very small values of the learning rate the training converges.19 ).0005 and 0. The Matlab functions that implements the quasi- 98 Artificial Neural Networks . 2.0001 are employed.180 160 140 120 100 80 60 40 20 0 0 Erro r vecto r n o rm 20 40 Iterations 60 80 100 FIGURE 2. 0. Notice also that the new starting point has a error norm equivalently with the one obtained after more than 20 iterations of the standard quasi-Newton method (see fig.26 . illustrates the performance of the quasi-Newton method minimizing the new criterion (dotted line) and minimizing the standard criterion. In fig. with respect to the initial values of the nonlinear parameters. learning rates of 0.001. Larger values makes training diverge.

m.28 illustrates their performance.Newton method with the new criterion is quasi-n_opt. 2.6 0.5 Erro r Vecto r N o rm 0.2 0.4 0. it achieves a better rate of convergence.3 0. The legend is the same as the last figure.1 0 0 20 40 Iterations 60 80 100 FIGURE 2. In what concerns Levenberg-Marquardt methods.mat 0. minimizing criterion (2. the initial norm of the error vector is reduced significantly.Performance of Levenberg-Marquardt methods With respect to Levenberg-Marquardt methods. The Matlab function that implements the Levenberg-Marquardt method with the new criterion is Leve_Mar_opt.Performance 0. The initial values of the weights are found in wini_red. As it is easy to see.3 0.27 . the new criterion is actually less complex to implement.28 .5 Erro r Ve cto r No rm 0.7 0.m (it also employs fun_optm.1 0 0 20 40 Iterations 60 80 100 FIGURE 2.m).6 0. fig. Artificial Neural Networks 99 . secondly.103) has two main advantages: firstly.2 0.4 0.7 of quasi-Newton methods 0.

m and OL2.0001.001. 0.15 1. more than 20 times smaller than the initial norm in fig. 0. dotted. 2.Inverse coordinates transformation revisited In this case we shall employ data from Ex. but with the initial linear optimal weights. Considering now the use of the standard criterion.1 0 20 40 Iterations 60 80 100 FIGURE 2. and 2.23 . dotted line) 100 Artificial Neural Networks .0005 (solid. 1. It should be noted that the initial norm of the error vector is 1.2 1. The Matlab function Bpoptm. 0. The initial weights are stored in w.005.10 . 2.00005 (solid line.ini. dash-dotted lines).35 Erro r Ve cto r No rm 1.25 1. for η = 0.6 times smaller than the error norm with 100 iteration of the standard algorithm.37.Example 2. we have the following performance. We shall consider first the BP algorithm.Performance of the BP algorithm (new criterion) The learning rates experimented were η = 0.29 .45 1.4 1.m) was employed. The next figures illustrates the performance of the BP algorithm with the new criterion.8.m (uses also onelay2.3 1.

ini and wtotal.m).3 1. and the initial weight vectors are w.4 0 20 40 Iterations 60 80 100 FIGURE 2.37 1.ini. the following figure illustrates the performance of the standard criterion (dotted line) against the standard criterion (solid line).30 .371 1.m (employs fun.8 0.373 1. Addressing now quasi-Newton methods. The Matlab functions employed are quasi_n_optm.3725 Erro r Vecto r N o rm 1.372 1.6 0. 1..2 1.m).9 0.ini. 1.31 .7 0.5 0.1 Erro r Vecto r N o rm 1 0.3695 0 20 40 Iterations 60 80 100 FIGURE 2. respectively.m (employs fun_optm.4 1.m) and quasi_n.3735 1. for the new criterion and the standard criterion.3715 1.m (uses TwoLayer.Performance of the BP algorithm (standard criterion) The Matlab files that implement this training algorithm are BP. and the initial values of the weights are stored in wtotal.Performance of quasi-Newton methods Artificial Neural Networks 101 .3705 1.

The legends are as the ones of last figure.3. The problems we shall discuss now are: where should we start the optimization.Finally.3.Performance of Levenberg-Marquardt methods Again. In terms of nonlinear least-squares problems. Levenberg-Marquardt methods are evaluated. faster convergence being obtained for smaller condition numbers.6 1.8 0.1Initial weight values For any optimization problem. Unfortunately in every case MLPs do not have a structure close to the data generating function (unless the data generating function in itself is a MLP).3. we have seen that the rate of convergence of a linear model was related with the condition number of the Jacobean matrix (see Table 2.7. We can take another route and employ initial parameter values that do not exacerbate the condition number of the Jacobean matrix. in the context of nonlinear optimization.1.2 Error Ve cto r No rm 1 0. 102 Artificial Neural Networks . and when should we stop it.4 1. as the initial parameter values.10). a good (close to the optimum) initial value of the design variables is important for the decrease of the number of iterations towards convergence. For this reason.3.7 Practicalities We have introduced in the previous sections different training algorithms for multilayer perceptrons.6 0. 2.32 . In Section 2. 1.4 0. and therefore there are no guidelines for determining good initial parameter values. it is a common practice to employ random values. the same conclusions can be taken. 2. this means we should supply as initial values model parameters close to the optimal ones.1. This assumes that the model you are fitting has a structure which is close to the process originating the data.2 0 20 40 Iterations 60 80 100 FIGURE 2.1. or zero values.

.t. it is convenient to scale the target data to this range. all the columns associated with those neurons will have a small norm.t. for all the training patterns. This is translated into a higher condition number. it should be scaled to lie within that range. but for the network input data. . we should: Scale the input and target data to lie within a range [-1 1]. 1. ParamPolExp. this is the set of weights connecting the lower layer (including the bias) with one neuron of the upper layer is multiple of any other set. 3.m) implements one possible solution to this. its inputs: this turns out to be the transpose of the partial weight matrix. 1. Make_Pol_E. . the computation of the Jacobean matrix involves three terms: The derivative of the neuron output w. Then. To summarize. i ≠ j . this means that the target data should lie between. its net input.3. 2.m and Make_Powe. 1. Related with the first term. 1]. The Matlab function MLP_init_par. -1 and 1.m (it also employs scale.r. Looking now to the last term. assuming that it is linear. two observations can be made: a) W ent. fne_ar_initial_par. the weights. the initial weights should be chosen such that all the hidden layers lie within their active region.m. Determine a initial set of weights such that all the nonlinear neurons are in an active region. the corresponding derivative is very small. 4] see fig. as the layer inputs are limited between 0 and 1.9 ). The derivative of the neuron net input w. b) The value of the weights should not be too big or too small. This affects all the weights related with the neurons below the one we are considering the derivative. (z – 1) should be chosen such that the columns of Net should be linearly independA sufficient condition for them to be linear dependent is that ( z – 1) ( z) W i. The derivative of the neuron net input w. for each weight.r. 2. First of all. Therefore. if the neuron net input is outside its active region (around [-4. So.2. This is no problem for the layers above the first one. Focusing on the last layer. Looking now at the second term.t.Remembering now Section 2.m. which is related with the neuron inputs. the layer inputs should also lie between a range of -1 and 1.m. 3. which makes them more linear dependent on the column associated with the output bias (see Exercise 22).m. YPolExp. = kW j.1. let’s say. the following equation is used: Artificial Neural Networks 103 . the input and target patterns are scaled to lie in a range [-1.r. Therefore. (z – 1) Make sure that there is no linear dependencies in the weight matrices W ( i) .

j (z – 1) xj = k j ⋅ 1 + ----. 1 ) and: Li = π – -.+ ( i – 1 )∆θi 2 π ∆θ i = ---.m.8.108) 2π where k j = ------------------------------------------. All the initial values for the examples in Section 2. the weight matrix related with the first hid- 104 Artificial Neural Networks .9 Jacobean condition number MLP_init_par.is the constant needed to be multiplied by the jth input so that Max ( I j ) – min ( I j ) its range lies in a range of 2π . comparing 100 different executions of MLP_init_par.109) π θ i = – -. 1 ) is a random variable.111) (2. to derive the initial parameters. each input produces a contribution for the net inputs of the above layer with a range approximately of 2π . It should be mentioned that the above equations were derived empirically.W i..110) (z – 1) (2. the following mean results were obtained (see Exercise 23): Table 2. The weights related with the same input are nevertheless different due to the random variable x.1. kz (2.1)). After training has been accomplished.6 were obtained using MLP_init_par.11 . 2.m with 100 different random initializations (N(0. m Random 9. Employing Ex.– k j min ( I j ) 2 (2. k z – 1 + 1 = L i + θ i + x i ∆θ i .1.8 11. It should not be forgotten that. and x j ∼ N ( 0.112) The reasoning behind the above equations for the bias is that each net input will be centred around a different value. Each bias term (corresponding to each one of the k z neurons in the upper layer is obtained as: W i.5 105 1.3. 10 (2. although in the neuron active region.15 1011 It is evident that a better initial point was obtained using the above considerations.Comparison of initialization methods Initial error norm 1. This way. the input and target patterns were scaled..3. where x j ∼ N ( 0.5 and Section 2.

115) (2.Ph Example.2When do we stop training? Off-line training of MLPs can be envisaged as a nonlinear least-squares optimization problem.116) The conditions (2.114) and (2.m was employed. * τ = 10 –6 Criterion Standard (random) Standard (optimal) Reformulated Number of iterations 100 150 107 Error Norm 0.011 0. 2.den layer and the weight vector related with the output layer must be linearly transformed in order to operate with the original data. 2.115) are employed to test if the training is converging.113) Then a sensible set of termination criteria indicates a successful termination if the following three conditions are met: Ω[ k – 1 ] – Ω[k ] < θ[k ] w [ k – 1 ] – w[ k ] < τf ⋅ ( 1 + w [ k ] ) g [ k ] ≤ 3 τf ⋅ ( 1 + Ω [ k ] ) (2.114) (2. We convert this parameter into a measure of absolute accuracy. The same topologies are employed for the two cases.8) and the reformulated criterion (Ex. In the case of the standard criterion. 2. A possible set of termination criteria for smooth problems can be formulated as [101]: Assume that we specify a parameter. θ [ k ] : θ [ k ] = τf ⋅ ( 1 + Ω [ k ] ) (2. while condition (2. The two case studies are revisited using the Levenberg-Marquardt method.12 . τ f .1. 2. Please see also Exercise 22.7.3. This can be accomplished by the Matlab function Restore_Scale. which is a measure of the desired number of the corrected figures in the objective function.116) is based on the optimality condition g ( w ) = 0 .0024 Condition number 150 695 380 Artificial Neural Networks 105 . The Matlab function Train_MLPs.m. two different initializations for the linear parameters were compared: one with the optimal leastsquares values (optimal) and the other one with random initial values (random).7 and Ex. 2. Table 2. the mean values obtained with 20 different initializations The column condition number denotes the condition number of the input matrix of the output layer. The following tables show. and therefore the termination criteria used for unconstrained optimization can be applied to decide if the training has converged or not.0024 0.9 and Ex. employing the standard criterion (Ex.10). for each case.

As it can be observed. unfortunately. most used. This. or with a model that could deliver better results. This phenomenon is called overtraining. An example of this technique is shown below.m.13 . it decreases at first. as we can end whether with a bad conditioned model. An obvious scheme is therefore to train the network until it reaches the minimum in terms of fit of the validation set. For this reason. Instead. At this point it is also said that the network deteriorates its property of generalization. The technique that is. MLPs do not have a similar structure to the data generating function corresponding to the training data.32 115 0.52 Condition number 62 92 114 As it was already referred.Table 2. as the network was ’generalizing’ what it is learning to fresh data. and the training data previously employed was divided between training (70%) and validation (30%). called the validation set (also called test set). unfortunately. The pH problem was employed. when training a network. τ = 10 –4 Criterion Standard (random) Standard (optimal) Reformulated Number of iterations Error Norm 70 0. and increases afterwards. fig. if the performance of the model is evaluated in a different data set. The first technique is due to the observation that. employing the file gen_set. standard termination criteria are not usually employed. two other schemes should be employed: the early-stopping method (also called implicit regularization or stopped search) and the regularization method (also called explicit regularization. 2. while the fit in the training data usually decreases continuously.33 illustrates the evolution of the error norm for the validation set. is to stop training once a specified accuracy value is obtained.32 75 0. never seen by the network. is a bad method. a minimum is found at iteration 28.Coordinate transformation Example. If we observe the evolu- 106 Artificial Neural Networks .

4 0 .6 1.Evolution of the error norm for the training set The final error norm obtained was 0.024.7 0 .34 .8 1.5.Evolution of the error norm for the validation set Training Data 1.2 Erro r N o rm 1 0.6 0.2 0 0 5 10 15 20 Iterations 25 30 35 FIGURE 2.tion of the error norm for the validation set. and will go on decreasing afterwards. we can see that this method produces more conservative results than standard termination criteria. Va lid a tio n D a ta 0 .8 0. In terms of conditioning. we can see that the error norm is still decreasing.6 0 .2 0 .7 10-8.3 0 .12 .1 0 0 5 10 15 20 Ite ra tio n s 25 30 35 FIGURE 2. however.8 0 .4 0. the con- Artificial Neural Networks 107 . If we compare this result (in terms of MSE)-8.2 10-6 with the result shown in Table 2.4 1.5 Erro r N o rm 0 .33 .

119) y = . and between -1 and 1. as the range of the nonlinear weights is related with the range of their input data. resulting in overtraining and bad generalization. which is bounded between 0 and 1 (if they connect the first to the second hidden layer). Eq. This problem usually only occurs with the linear weights. Usually. The latter perform the role of suitably transform the input domain in a such way that the nonlinearly transformed data can better linearly reproduce the training data. regularization. if care is not taken.117) ) and the absolute value of the weights. that the weights in a MLP can be divided into linear weights and nonlinear weights.3. 2 where u stands for the linear weights (see (2.117) should be replaced by: t–y 2+λ u φ = ------------------------------------. the norm of the linear weights can increase substantially. (2. in Section 2.1. (2.dition number of the input matrix of the last hidden layer in this case has the value of 24.99)). employs usually the following training criterion: t–y 2+λ w φ = ------------------------------------2 2 (2..3. The second technique.118) (2.118) can be transformed into (see Exercise 15): t–y 2 φ = ---------------.1) are followed. if the advices in terms of initialization (see Section 2. the criterion incorporates a term which measures the value of the weights.6. If we now express y as O ( q – 1 ) 1 u = Au . Therefore. 2 where: t = t 0 y λu (2.7. (2.120) 2 (2.. with as consequence the decrease of the weight norm. However.121) where the partitions below in the two last equations have dimensions (m*kq-1). We have seen.119) becomes: 108 Artificial Neural Networks . and consequently a better conditioned model is obtained. this results in a trade-off between accuracy within the training set (measured by the first term in (2.1. (2.117) As we can see. with different roles in the network. while in the case of standard termination is 380.

2 t – t – A u 0 0 λu λ t – Au 2 φ = -----------------------------------.9 Number of iterations 147 146 133 34 Analysing these two tables. it is clear that an increase in τ is translated in a decrease in the norm of the linear parameters and in the condition number of the last hidden layer.14 .39 0. τ = 10 u Condition number 46 46 44 21 –4 λ 0 10-6 10-4 10-2 e 0.3. and using (2. when we do not have a batch of training data to employ for network training.2 15. we could still employ the learning methods described afterwards.9 5.4 104 485 161 20 Number of iterations 61 63 199 199 Table 2.122) (2. If we did. we discuss learning methods suitable to be used when the training data appears in a pattern-by-pattern basis. τ = 10 –6 λ 0 10-6 10-4 10-2 e 2. t and y .15 - Coordinate Transformation Example.94 15.pH Example.0055 0.6 can be used here.3 10-4 0.0045 u 36 28 14 2 Condition number 1. The function Train_MLPs. t and y replaced by A .= -----------------------------------. the training algorithms introduced in Section 2.m is used in these examples..41 0. This is usually translated in a worse fitting on the training data.2 14. with A.38 0.0029 0.1. as the methods described Artificial Neural Networks 109 .8 On-line learning methods In this section.123) to compute the value of the linear weights. 2. Table 2.123) Therefore.1. that is. 2 2 2 The global optimum value of u is (see Exercise 15): + T T ˆ u = ( A A + λI ) A t = A t –1 Au 2 (2.3. there is no point in using them in this situation.= --------------------. However.

and just adapt the linear weights. at time k. The instantaneous estimate of the MSE. at time k.8. and determine the values of all the weights in the network during the on-line learning process.) is the expectation operator. in the linear output layer. The instantaneous estimate of the gradient vector.1. 110 Artificial Neural Networks .124) where E(. the LMS algorithm. We have seen.before for off-line learning have far better performance than those that will be described in the following lines. The mean-square error criterion is defined as: E(e ) ζ = ------------.3. 2.. and e=t-y is the error vector. is: e [k] ζ [ k ] = -----------.. is: g [ k ] = – e [ k ]a [ k ] . at time k.1.126) where a represents the output of the last hidden layer. in the last sections. At that time we mentioned that the LMS algorithm was an instantaneous version of the steepest descent technique. This is an unbiased estimate of the gradient vector of (2. Instantaneous gradient descent algorithms of the first order (we shall not address higher-order on-line methods here) update the weight vector proportionally to the negative of the instantaneous gradient: u [ k ] = u [ k – 1 ] + δe [ k ]a [ k ] (2.1. and then discuss the second case. 2 2 (2. plus a value of one. which updates the weights in the direction of the transformed input. although in the context of its batch variant.2 we have introduced Widrow’s Adalines and in Section 2.1Adapting just the linear weights In Section 2. produces the best linear model of the target data. for the output bias. (2.127) This is the LMS rule. or we can start from scratch. in such a way that this transformed data. the steepest descent method. usually of higher dimensionality than the original input. 2 2 (2. that the layers in a MLP have two different roles: the hidden layers perform a nonlinear transformation of the input data.2. We shall address the first strategy first. When addressing on-line learning methods. and we continue this line of thought here. we can assume that this nonlinear transformation has been discovered before by an off-line learning method such as the ones introduced before.125) where e[k] is the error measured at time k.2 we have justified its learning rule.124). multiplied by a scalar.

128).128) FIGURE 2. Let us consider the x2 problem of Ex.35 .125)) as function of the weights.Instantaneous performance surface of the x2 problem. We have 2 weights. at time k. and the second one related with the bias (1). At time 1.5. it must satisfy: a [ k ]u = t [ k ] An infinite number of weights satisfy this single equation.5 1 and t [ 1 ] = 0. The following figure shows the instantaneous MSE estimate ((2. the first one related with x. 2.25 . Therefore a [ 1 ] = – 0. (2.Notice that. with information supplied by x=-0. and the line which corresponds to the solution of (2.6. with the target 0.25.1 . the network is supplied with x=-0.5 Artificial Neural Networks 111 . for the network to reproduce the kth pattern exactly. with ∆x = 0.

1. if the length of the vector is p.36 . and therefore its size is (p*p). we have: 1. we shall. with only one eigenvalue different from zero. If this condition is adapted to this situation. In this context. however. and the corresponding eigenvector is a[k] (see Exercise 27). the normal equations are called the Wiener-Hopf equations. when the information is rich enough (sufficiently exciting). conducted in Section 2.If another pattern is input to the network. Consider now the analytical derivation of the least squares solution conducted in Section 2. consequently the product a[k]Ta[k] is denominated the instantaneous autocorrelation matrix.5 The two lines intersect at the point (w1=0. this time x=+0. the condition that was obtained was that the learning rate should satisfy (2.5.62). and.3. the product ATA is denominated the autocorrelation matrix (to be precise. This matrix has only one non-zero eigenvalue ( a [ k ] ).3.3.124). This is the product of a column vector by one row vector. specially the left-hand-side of the normal equations (2. Therefore. it should be divided by the training size. represented as R[k].5. and x=+0.Instantaneous performance surfaces of the x2 problem. Remembering now the stability analysis of the steepest descent algorithm. and if this pattern is also stored exactly. 2 112 Artificial Neural Networks . if we only consider these two patterns. Here.53)1.1. the global minimum of successive performance surfaces will intersect at the global minimum of the true MSE performance function (2.25).124).3. w2=0. we would have the following instantaneous performance surfaces (the one corresponding to the 2nd pattern has been shifted by 10): FIGURE 2. which is the global minimum of the true MSE function (2. call it the autocorrelation matrix from now on). with information supplied by x=-0.

2 ) ⁄ a [ k ] ] ) 2 if η = 1 ⁄ a [ k ] 2 (2. Artificial Neural Networks 113 . although the learning process is completely oscillatory. a bit far from the minimum obtained off-line.130).m and TL3. 4. The opposite happens when we slightly increase the learning rate.. for a single example. in order to get a good conditioned model. the pH example.mat. as with values of η between 0 and 2. the network is presented with pattern I[k].37 (the left graph for the first pass trough the training set . we can see that stable learning. and that is widely known in adaptive control.m (it also employs TL2. the error is not increasing in the long run. stable learning was achieved. the minimum value of the MSE is 3 10-4. The NLMS rule is therefore: δe [ k ]a [ k ] u [ k ] = u [ k – 1 ] + -----------------------2 a[k] (2. using the NLMS rule. The next figure illustrates the evolution of the MSE obtained using the learning rate of 2.7. as indicated by (2. Depending on the learning rate.2). Then we employed the Matlab function MLPs_on_line_lin.3.001. we trained a MLP with the used topology ([4 4 1]). 2. The LMS rule is employed.130).129) Let us consider now that at time k. and the linear weights were initialized to zero. we would have the following output errors (see Exercise 28). and using a regularization factor of 10-2 (see Section 2. With this last value. minimizing the new training criterion. e[k ] e[k] e[k] e[k] > e[k ] = e[k ] < e[k ] = 0 if ( η ∉ [ ( 0.e [ k ] ). and the right graph the other iterations). With η = 1 . in 1937.130). The values of the evolution of the MSE.the first 101 iterations. and the a posteriori error vanishes if the learning rate is the reciprocal of the 2 a[k] squared 2-norm of the transformed input vector. For comparison.131) Let us see the application of these two rules (LMS and NLMS) to a common problem. and let us evaluate what happens to the output error (a posteriori output error .m). the values obtained with η = 2 are also represented. is achieved if 2 0 ≤ η ≤ ----------------. if that pattern would be presented again to the network. 2 ) ⁄ a [ k ] ] ) 2 if η = 0 or η = 2 ⁄ a [ k ] 2 if ( η ∈ [ ( 0. with the LevenbergMarquardt. resulting in a transformed pattern a[k]. It is therefore interesting to introduce a new LMS rule (coined for this reason NLMS rule in [124]).130).130) Analysing (2.2 η < ----------------2 a[k] (2.1.4 10-6. after 10 passes through all the training set (1010 iterations). Primarily. which was first proposed by Kaczmarz. for different learning rates. are represented in fig. The nonlinear weights were fixed. The values of the nonlinear and linear weights are stored in initial_on_pH. It is easy to confirm the validity of (2. for 10 passes over the complete training data. just outside the stable interval predicted by (2.

2 0.2 0.NLMS (stable on-line learning) 0.37 .8 0. 2.6 0. the squared errors for each presentation are continuously reduced.22 0.1 0 0 0 500 Iterations 1000 1500 a) η = 2 FIGURE 2.1 0.26 neta= 1.Evolution 0.5 0.5 MSE (neta=2) 0. As it can be seen.7 of the MSE .1 0.25 0.14 0.1 0.The instantaneous values of the MSE.12 0.05 0.3 0.38 .16 neta= 0.08 0.9 neta= 1.2 0.Evolution η = 2.7 0. 0. pass through pass.15 0.3 0.6 0.5 0.4 0.1 0. for the case with η = 1 .5 0.001 of the MSE .24 neta= 1 neta= 0.2 0.35 0.001) 0 500 Iterations 1000 1500 0.4 MSE (neta=2.3 0.06 0 50 Iterations 100 150 0 0 200 400 600 800 1000 FIGURE 2.39 . are presented in fig.4 0.NLMS (unstable on-line learning) 114 Artificial Neural Networks .18 MSE 0.

LMS (stable on-line learning) Artificial Neural Networks 115 .NLMS. when stable learning is obtained.1 0.05 Instantaneous MSE 0.08 0.0.07 0.Evolution b) Last 8 passes of the instantaneous MSE .Evolution of the MSE .1 0. for η = 1 Let us now see what happens if the standard LMS rule is applied.25 neta=0.40 .5 0.04 0.05 0 0 200 400 600 Iterations 800 1000 1200 FIGURE 2.2 0.03 0. 0.2 neta=0.3 0. The following figures show the evolution of the MSE.15 MSE 0.5 0.4 Instantaneous MSE 0 50 100 150 200 Number of iterations 250 0.6 0.39 .02 0.06 0.7 0.1 0.01 0 0 0 200 400 600 800 Number of iterations 1000 a) First 2 passes FIGURE 2.

5 (0.132) This indicates that.57). 116 Artificial Neural Networks . as it can be seen in fig. as the overall learning process is stable.Evolution of the MSE . if the LMS rule is employed to a set of consistent data (there is no modelling error nor measurement noise). the norm of the weight error always decreases.5 1 0. the onlearning process is unstable.3 of [124] (and see Exercise 29) that.133) Notice that this a more restrictive condition than the overall stability of the learning process. and converges to the optimal value. if the following condition is satisfied: η ( ηa a – 2 ) < 0 T (2.5 . we can see that with learning rates greater than 0.56 as the learning parameter. as we do not allow any increase in the weight error norm from iteration to iteration. unstable learning is obtained. the following condition should be satisfied: 2 η < ------------------------. and we can see that this is a conservative condition. Parameter convergence It is proved in section 5. T max ( a a ) (2.57. in a cyclic fashion.41 .5 2 MSE(neta=0.41 2.57) 1. η > 0 . For the case treated here.LMS (unstable on-line learning) Analysing the last two figures. if we apply the LMS rule with 0.5 0 0 200 400 600 Iterations 800 1000 1200 FIGURE 2. 2. this corresponds to η < 0. to have the weight error norm always decreasing.Employing values which are slightly larger than 0.

1 0 0 200 400 600 Iterations 800 1000 1200 FIGURE 2. given by: + ˆ u = a t.3 of [124]).42 .37 . (2. If the original output error is used in the LMS train- Artificial Neural Networks 117 . if the weights are initialized with zero (for a proof.5 Square of the weight error norm 0. for the NLMS rule. 0. Modelling error and output dead-zones When there is modelling error.The optimum value of the linear weights is. η = 1 ) Comparing fig. as we know. with η = 1 . When A is rank-deficient. we obtain fig. forces the weights to converge to a value which exactly reproduces the training set and has minimum Euclidean norm. (2.4 0. and that can cause the MSE to increase.2 0. The NLMS (and the LMS) employ instantaneous values of the gradient to update the weights. and this introduces noise in the weights updates. while we observe a constant decrease in the weight error norm. if we compute the weight error norm evolution. and when there is no modelling error or measurement noise.6 0. We can see that the error is always decreasing.Squared weight error norm evolution (NLMS. 2.134) and. 2. from iteration to iteration.135) where the first term is due to the difference between the current and the optimal network.3 0. and the second to the optimal modelling error.42 . see section 5. the application of the NLMS rule (with η > 0 and η < 2 ).42 and fig. 2.7 0. the instantaneous output error can be decomposed into: ˆ ˆ e [ k ] = a [ k ] ( w – w [ k ] ) + ( t [ k ] – a [ k ]w ) o n = e [k] + e [k] . the same does not happen to the MSE evolution.

ing rules. T (2.3). Rate of convergence The condition of the autocorrelation matrix also dictates the rate of convergence. It can be proved (see Exercise 29).. if the absolute value of the error is within the dead-zone.137) where is any natural norm and C(R) is the condition number of R.139) (2. The magnitude of the weight error is related with the magnitude of the output error as: ew ey --------. This will have as consequence a bad generalization.1. In this technique. output dead-zones are employed. More formally. ς .140) η' = η ( 2 – η ) .136) The main problem is to define the dead-zone parameter.≤ C ( R ) ⋅ -------. Parameter convergence Suppose that the output error lies within the dead-zone. w t (2. that. To solve this problem.3. the error in the LMS training rules is changed to: 0 e[ k] + ς e[ k] – ς if ( e [ k ] ≤ ς ) if ( e [ k ] < – ς ) if ( e [ k ] > ς ) e [k] = d (2. Usually it is used: ς = E[ e [ k] ] In order to evaluate the applicability of this technique. by defining: V [ k ] = ew [ k ] 2 –1 ˆ ˆ = (w – w[ k] ) ( w – w[k ]) . and the weights have converged to a value w. as the second term causes the gradient estimate to be non-zero. This means that even with a small relative output error. please see Exercise 30. the network is unable to determine when the weight vector is optimal. computed as R R . the error is considered to be zero.138) n (2. if the model is ill-conditioned. we can have a large relative weight error. as in the batch case (see Section 2. we can have: 118 Artificial Neural Networks .

the conditioning of the auto-correlation matrix appears as an important parameter for the learning procedure.2. convergence is very slow. 2.1) we have described the way that the gra- Artificial Neural Networks 119 .2) a way to compute the true Jacobean matrix and gradient vector.142) can be given as: ( a [ 1 ]e w [ 0 ] ) 2 V [ 1 ] = V [ 0 ] – η' ---------------------------------. If output dead-zones are not employed. after one iteration. This occurs when the transformed input vectors are highly correlated (nearly parallel). we shall have: V [ L ] = V [ 0 ] ∏ ( cos ( θ ( k ) ) ) k=1 L T 2 (2. while if the transformed input vectors are nearly orthogonal.V[L] = V[0] – e [k] η' ----------------2 a[k] k=1 L 2 (2. and in Alg. The effects of instantaneous estimates As we have already referred. 2 a[1] and. convergence is very fast (See Exercise 31). the instantaneous gradient estimates. as e[k] can be given as: T T T ˆ e [ k ] = a [ k ]w – a [ k ]w [ k – 1 ] = a [ k ]e w [ k – 1 ] .2Adapting all the weights We have described. but to a bounded area known as the minimal capture zone.1.3. and there is model mismatch.5 (Section 2. The weights never settle once they enter in this zone (see Exercise 32).3.145) Therefore.142) (2. in Alg. After L iterations. 2.4 (Section 2. whose condition number is proportional to the condition number of the autocorrelation matrix.143) (2.1. is: e[ 1] V [ 1 ] = V [ 0 ] – η' ----------------.8.144) 2 (2.3. although they are a unbiased estimate of the true gradient vector.141) The size of the parameter error.1.. 2. whose size and shape depend on the condition of the autocorrelation matrix. Again. (2. introduce errors or noise.= V [ 0 ] ( 1 – η' ( cos ( θ ( 1 ) ) ) ) . if the weight update (which has the same direction as the transformed input vector) is nearly perpendicular to the weight error vector. 2 a[1] where θ ( 1 ) is the angle between a[1] and ew[0]. LMS type of rules converge not to a unique value.

The next figure presents the results obtained by the LMS algorithm. the learning process converges. A technique which is often used (and abused) to accelerate convergence is the inclusion of a momentum term. and (2. 0. as coined by their rediscovers [31]. it is clear that.= 0. This way.62). as in on-line learning the weights are updated in every iteration. so that. and the Matlab function employed is MLPs_on_line_all.1 neta=0.3.m).001 0.01.dient vector is computed by the BP algorithm.06 0.12 MSE 0. and computing the maximum eigenvalue of the autocorrelation matrix at that point. According to (2.14 0. instead of being applied to a set of m presentations.0093 ≈ 0.08 0. which will be denoted as the BP algorithm or delta rule.01 λ max (2. the instantaneous gradient vector estimate is easily obtained in every iteration.01 0. only with values of the learning rate smaller than 0. with the reformulated criterion.43 . Both of them can be slightly changed.1 0. Considering that the optimum was obtained as the results of applying the Levenberg-Marquardt.04 0 200 400 600 Iterations 800 1000 1200 FIGURE 2. 2. varying the learning rate. they are just applied to 1 presentation. learning converges if: 2 η < ---------.m and TwoLayer.18 neta=0.1.16 neta=0. to the same problem. The delta update of the weight vector is replaced by: 120 Artificial Neural Networks .Application of the delta rule Analysing fig. The weights were initialized to zero. we find that its value is 214.43 .m (it also uses ThreeLay.3) by doing an eigenvalue analysis to the autocorrelation matrix at the optimum. using the instantaneous (noisy) estimates of the gradient vector.62) is derived for the batch update. An insight into this value can be obtained (see Section 2.146) It should be noticed that this is just an indication.

The learning rate could be changed from iteration to iteration if we performed an eigenanalysis of the autocorrelation matrix. According to this rule. Typically. that would be too costly to perform.5 0. α . When si has the same signal of the current gradient estimate. 0.147) The reasoning behind the inclusion of this term is to increase the update in flat areas of the performance surface (where g[k] is almost 0). A simple algorithm which enables an adaptive step size.04 0 200 400 600 Iterations 800 1000 1200 FIGURE 2. The next figure shows the effects of the inclusion of a momentum term.9 0. and to stabilize the search in narrow valleys. the momentum parameter. On the other hand. therefore Artificial Neural Networks 121 . with a momentum term It is clear that the inclusion of a momentum term ( α > 0 ) increases the convergence rate. (2.08 0. it is not necessary that all the weights have the same learning rate.148) where si measures the average of the previous gradients. should be set between 0. individually assigned to each weight is the delta-bar-delta rule [83].14 alfa=0 alfa=0.∆w [ k ] = – ηg [ k ] + α∆w [ k – 1 ] (2.18 0.9.5 and 0. for the case where η = 0.1 0. for weight i.44 .12 MSE alfa=0. and that again is related with the eigenvalue for that particular direction.Application of the delta rule.001 . However. the update of the weight vector is given by: k ∆η i [ k ] = – βη i [ k – 1 ] 0 if ( s i [ k – 1 ]gi [ k ] > 0 ) if ( s i [ k – 1 ]g i [ k ] < 0 ) otherwise . making it possible (theoretically) to escape from local minima.06 0. the step size is increased by a constant (k).16 0.

7 . obtained with the NLMS rule by adapting only the linear weights. which is the need to fine tune the algorithm. the direction of the update is no longer in the negative of the instantaneous gradient. As a summary. fig.04.04 0 200 400 600 Iterations 800 1000 1200 FIGURE 2. and a value of 4. Only the most standard have been explained.accelerating convergence.18 0.Application of the delta-bar-delta rule Slightly better results were obtained.14 delta-bar-delta 0.00001 . for the best case found in fig. we have obtained in 10 epochs (1010 iterations) of on-line learning a final value of the MSE of 0.4 10-6. When there is a change in the gradient.45 illustrates the comparison of this rule and the momentum rule. a final value of 3 10-4.1 0. computed as: s i [ k ] = ( 1 – γ )g i [ k ] + γs i [ k – 1 ] (2. This way. This is also one of the weak points in this rule. although with more experimentation with the parameters. 2.45 . The parameters of the delta-bar-delta rule applied were k = 0.06 0.149) The delta-bar-delta rule can be employed to the standard delta rule.001 and α = 0.12 MSE 0. The measure of the average of the gradients (si) is an exponentially weighted sum of the current and past gradients. as each weight is multiplied by its own learning rate.9 . It should be noted that there are several other on-line learning algorithms for MLPs. 0. better results could possibly be achieved.16 momentum 0. or to the delta rule with momentum. It should be also referred that. obtained with off-line learning. 2. it should be clear that our initial 122 Artificial Neural Networks . β = γ = 0. with the delta-bardelta rule. this is.08 0. the step size is decreased geometrically ( –βη i [ k – 1 ] ). adapting all the weights and employing the delta-bar-delta rule. with η = 0.44 .

They were introduced in the framework of ANNs by Broomhead and Lowe [40]. and layer is just a linear combiner. RBFs are a three-layer network. the number and placement of the centres.statement that. where they were used as functional approximators and for data modelling. there is 1 set of unknown coefficients that approximates the data better than any other set of coefficients. which states that.2 Radial Basis Functions Radial Basis Functions Networks (RBFs) were first used for high-dimensional interpolation by the functional approximation community and their excellent numerical properties were extensively investigated by Powell [111]. RBFs are also universal approximators [131].151) describe the basis function. Since then. there is no point in employing on-line learning techniques. 2. if a set of training data is available. Let r = c–x 2 (2. they posses the best approximation property [132]. and the actual input value: W i.150) where ci is the centre of the ith hidden layer node. for the considered input. where the first layer is composed by buffers. The output The modelling capabilities of this network are determined by the shape of the radial function. Several different choices of functions are available: Artificial Neural Networks 123 . in contrast with MLPs. and. 2 is the Euclidean norm. in same cases. k = c i – x k 2. for a given topology. due to the faster learning associated with these neural networks (relatively to MLPs) they were more and more used in different applications. 2 (2. the width of the function. and. the hidden layer consists of a certain number of neurons where the ‘weight’ associated with each link can be viewed as the squared distance between the centre associated with the neuron.

considering its centre at [0.153) The following figure illustrates the activation function values of one neuron.f( r )= r f( r )= r 3 r – -------2 2σ 2 radial linear function radial cubic function Gaussian function thin plate spline function multi-quadratic function inverse multi-quadratic function 2 f( r )= e f( r )= f ( r ) = r log ( r ) r +σ 1 f ( r ) = -------------------2 2 r +σ 2 2 2 2 (2. and σ i = 1 . with centre Ci. the most widely used is the Gaussian function. We assume that there are 2 inputs. as the multivariate Gaussian function can be written as a product of univariate Gaussian functions: ci – x 2 – ------------------2 2 σi 2 fi ( x ) = e = ∏e j=1 n ( c ij – x i ) – --------------------2 2 σi 2 (2. both in the range]-2.0].=[0.2[. and a unitary variance The output of an RBF can be expressed as: 124 Artificial Neural Networks .0].152) f ( r ) = log ( r + σ ) shifted logarithm function Among those.46 - Activation values of the ith neuron. This is a unique function. FIGURE 2..

can be determined in just one iteration. consisting of: Artificial Neural Networks 125 . the Gaussian standard deviation is usually taken as: d max σ = ------------.2. There are four classes of major methods for RBF training: 2. The learning problem is therefore associated with the determination of the centres and spreads of the hidden neurons. In this case.2.1.1. With respect to the value of the linear parameters.155) (2. with number of basis functions strictly less than the number of data points.2. the output weights can be obtained by a pseudo-inverse of the output matrix (G) of the basis functions: + ˆ w = G t (2.1 Training schemes As we have seen in Section 2.154) where wi is the weight associated with the ith hidden layer node.1. An alternative is to use an hybrid learning process [112]. within the range of the training set. However.3. the output weights.2 Self-organized selection of centres The main problem with the last scheme is that it may require a large training set for a satisfactory level of performance.3.1 Fixed centres selected at random The first practical applications of RBFs assumed that there were as many basis functions as data points.6..p y = i=1 wi fi ⋅ ( ci – x 2 ) (2. Now.6. 2m 1 where dmax is the maximum distance between centres and m1 is the number of centres. and one way of determining them is to by choosing them randomly. 2.1. and the centres of the RBFs were chosen to be the input vectors at these data points. RBFs were soon applied as approximators. This is an interpolation scheme. as it is evident from Section 2. as they act as a linear combiner. due to the problems associated with this scheme (increased complexity with increasing data size and increasing conditioning problems). the centres do not coincide any more with the data points. where the goal is that the function passes exactly through all of those data points.156) 2.

Find a sample vector x(j) from the input matrix 2.4.158) 2.Choose random values for the centres. otherwise ci[ j + 1 ] = (2.10 - k-means clustering Step 2 in the last algorithm continues until there is no change in the weights.158).Similarity matching .Find the centre closest to x(j). apart from the one expressed in (2. In (2. and n the number of radial basis functions.Sampling . and consequently it is not a reliable algorithm.to determine the linear output weights. which places the centres in regions where a significant number of examples is presented. Let its index be k(x): 1.1. While go_on 2. is that the optimum will depend on the initial values of the centres. they must be all different 2.• Self-organized learning step . and a • Supervised learning step . With respect to the computation of the spreads. that can be employed: 126 Artificial Neural Networks . n 2. To solve this problem several other algorithms have been proposed.Updating (2. there are other heuristics. For the first step.3. Let m denote the number of patterns in the training set. One of them is the kmeans adaptive clustering [117]. ….10. i = 1. The k-means clustering algorithm is presented below: Initialization .155).Adjust the centres of the radial basis functions according to: c i [ j ] + η ( x ( k ) – c i ). 2.2. i = k ( x ) c i [ j ]. One algorithm for this task is the k-means clustering algorithm[112]. in an homogeneous way.157) . j = j+1 end Algorithm 2.to determine the locations of the centres of the radial basis functions. we need an algorithm that groups the input data. One problem related with Alg. 0 < η < 1 . k ( x ) = arg min i x ( j ) – c i 2.

is: max i. spreads and linear ∂ C j.161) 4. not only the linear weights.163) Artificial Neural Networks 127 .3 Supervised selection of centres and spreads In this class of methods. The empirical standard deviation: if n denotes the number of patterns assigned to cluster i. and .1. Nearest neighbour: Assume that k is the nearest neighbour of centre i. The k-nearest neighbours heuristic: considering the k (a user-defined percentage of the total number of centres) centres nearest the centre Ci. i – x j ) ∂y = – w i f i ----------------------2 ∂ C j.1. C. the derivatives with respect to the centres. This way. Then the spread. (2. but also the centres and the spread of the basis functions are computed using error-correction algorithms.162) 2. i. we need to have avail∂y ∂y ∂y . Maximum distance between patterns: Assume that m is the total number of patterns. are all parameters to be determined by one of the training methods introduced previously for MLPs. j = 1 … m x i – x j σ = -------------------------------------------------2 2 (2. i σ i (2. It is easy to determine (please see Exercise 4) that the derivatives are: able.e. Then the spread is: Ck – C i σ = Q --------------------. w. the weight vector. If the standard training criterion is employed. ( C j.160) 3.. . to compute the Jacobean. the spread associated with this centre is: k Ci – Cj j=1 σ = -----------------------------k 2 (2. the centre matrix. σ . and the spread vector.. we can use: n σ = j=1 C j.2. 2 where Q is an application-defined parameter. which is the same through all the centres.159) 2. i ∂ σ i ∂ wi parameters. i – xj ------------------------n 2 (2.

4 Regularization A technique also introduced for training BBFs comes from regularization theory.1.166) is obviously similar to (2. In this case.6 can also be employed. –1 (2. 2.166) where G0 is defined as: G ( C 1. As the output layer is a linear layer.7.163). related with the basis functions used. 2. … … … G ( C p. please see Section 5.164) ∂y = fi ∂ wi (2. quasi-Newton.1.n k=1 ∂y = w i f i ⋅ ------------------------------------3 ∂ σi σ i ( C k. we can employ the Jacobean composed of (2.165) Then.3.1. C p ) G0 = . the ones introduced in Section 2. In terms of termination criteria. This method was proposed by Tikhonov in 1963 in the context of approximation theory and it has been introduced to the neural network community by Poggio and Girosi [118]. C p ) (2. i – x k ) 2 (2.2.167) Notice that (2.164). The resulting network is coined the Generalized Radial Basis Function network [125]. with the difference that the identity matrix is replaced by G0.5 of [125]): T T ˆ w = ( G G + λG o ) G t .3. with wi ˆ replaced by w i (see [133]).1.2.2 On-line learning algorithms There are a variety of on-line learning algorithms for RBFs proposed. (2. If we replace the second-term in the right-hand-side with a linear differential operator.3. Levenberg-Marquardt) can be employed. any method (steepest descent. C 1 ) … G ( C p. 128 Artificial Neural Networks .2. the criterion introduced in Section 2.7. the optimal value of the linear weights is defined as (for the theory related with this subject. C 1 ) … G ( C 1. and specifically the criterion (2.123).117).2 also apply here. Some will be briefly described here. Recall the regularization method introduced in Section 2.

11 . centres and spreads are updated using an Extended-Kalman filter algorithm. are summarized in Table 2.2.05 0. the linear output parameters are in every case determined by a pseudo-inverse. Other approaches involve a strategy such as the one indicated in Section 2. It employs also rbf.002 2 10-4 As a problem associated with RBFs is the condition number of the basis function output matrix.1. An example of a constructive method which can be applied on-line is the M_RAN [126] algorithm.16 . 2. Another uses the application of the Levenberg-Marquardt method. with the standard or the reformulated criterion. which adds or subtracts units from the network.167) recursively. The same note is valid.16. In common with the different methods below described.6 1017 Neurons 10 20 30 40 Error norm 0. Artificial Neural Networks 129 . all the free parameters in the network (weights. to a smaller extent. to a sliding moving window (see [134]). these results are also shown. by solving equation (2. The function rbf_fc.3 Examples Example 2. based on its performance on the current sample and also in a sliding window of previous samples.One approach is to employ the optimal k-means adaptive clustering [117] to update the centres and spreads on-line.m.m implements a training method based on fixed centres.3. It should be noted that as the centres are obtained randomly.5 107 5 1016 1 1017 3. and to use a version of a recursive least-squares algorithm [69] to compute the linear weights.7 will be employed to assess the modelling capabilities of RBFs.pH example Again Ex. and the performance of different training algorithms.Results of RBF learning (fixed centres) Condition number 7. by determining the centres and spreads off-line and adapting on-line the linear weights. If no unit is added or subtracted. 2. Yee and Haykin [127] followed this approach. as the number of basis functions is increased.006 0. to the results obtained with the techniques introduced afterwards Table 2.8.1. that the results obtained should be interprets as such. The results obtained.

the adaptive clustering algorithm was experimented. I 0.2 106 6. 2.Input patterns and final selection of centres (n=10) Finally. and its results are expressed in the next table. both in terms of error norm (but not in terms of condition number) were obtained by determining the position of the centres using the clustering algorithm.2 10-4 Condition number 5. The final distribution of the centres obtained is shown in fig.8 1016 1 1017 1.7 0.47 .3 10-5 3.m.Results of RBF learning (self-organized centres) adaptive clustering Neurons 10 20 30 40 Error norm 0.4 0.05 0. optkmeansr. with Matlab function rbf_ad_clust.3 0.03 0.4 1017 2.17 .m and lexsort.m. initoakmr.6 0.1 0 20 40 60 80 1 00 1 20 FIGURE 2.2 0.m Table 2.8 0.m. in the Matlab function rbf_clust.2 10-4 2.8 1017 130 Artificial Neural Networks .Implementing the k-means clustering algorithm.Results of RBF learning (self-organized centres) k-means clustering (alfa=0. better results.4 10-4 Condition number 1 1010 2 1016 1.m diffrows.7 1017 Neurons 10 20 30 40 Comparing both tables.2) Error norm 0.002 5.1 0 -0.5 0. the following results have been obtained: Table 2.002 2. The following functions are employed: clust.18 .m.47 .

48 .0001 10 5 0 0 10 20 30 40 50 Iteration 60 70 80 90 100 FIGURE 2. –6 The following figure illustrates the results of a steepest descent method. It converges after 119 iterations. The termination criteria (2.Supervised methods are now evaluated. The initial values of the centres were found as the ones obtained by rbf_fc. for a RBF with 10 neurons. The first 100 iterations are shown.m).114) to (2. The following figure is obtained: 20 18 16 14 Error Norm 12 10 8 6 4 2 0 0 20 40 60 Iterations 80 100 120 FIGURE 2. 25 20 alfa=0.0005 alfa=0. The following results were obtained with Train_RBFs. within a reasonable number of iterations. First the standard criterion will be employed. The method does not converge.Steepest descent results (standard criterion) The Levenberg-Marquardt method is employed afterwards. and the initial values for the spreads were all equal to the one obtained using this function. The linear weights were initialised with random values.001 15 Error norm alfa=0.m (it also employs RBF_der.116) with τ = 10 was employed.mat.49 .Levenberg-Marquardt results (standard criterion) Artificial Neural Networks 131 .m.m and Jac_mat. These data can be found in winiph.

for the first 100 iterations.001) Levenberg-Marquardt (standard criterion) 132 Artificial Neural Networks .51 illustrates the performance of the Levenberg-Marquardt algorithm minimizing the new criterion.03 0. but this time with the reformulated criterion. Table 2. and can be found in winiph_opt.1 0. we can find the results of the steepest descent method.02 Method BP (standard criterion.Levenberg-Marquardt results (reformulated criterion) The best results of the supervised methods are summarized in Table 2.1 .01 0 10 20 30 40 50 Iterations 60 70 80 90 100 FIGURE 2.035 0. 2.055 alfa=0. 2. at iteration 76.19.05 alfa=0.05 0.06 0.01 0 0 10 20 30 40 50 Iterations 60 70 80 90 100 FIGURE 2. for η = 0.05 0. alfa=0.02 0. with an error norm of 0.02 0.57 0.04 Error norm 0.014.Steepest descent results (reformulated criterion) fig. Convergence was obtained. much better results are obtained.01 0.50 . The initial nonlinear parameters are the same as in the previous examples.045 alfa=0.06 0. It achieves convergence at iteration 115.50 .51 .19 .mat.04 Error norm 0.015 0.Applying now the gradient techniques employed before. as usually.025 0. 0. 0.Results of supervised learning (10 neurons) Error norm 0.03 0. In fig.

2 1.22.21 .2 104 Artificial Neural Networks 133 .4 103 Neurons 10 20 30 40 50 Error norm 13.1 104 Neurons 10 20 30 40 50 Table 2.89 Condition number 30 532 2.Results of RBF learning (self-organized centres) (alfa=0.8 1.12 1 0.22 . it is clear that the use of the Levenberg-Marquardt method. gives the best results.9 103 6. The results obtained with the fixed random centres algorithm are shown in Table 2.8 was also experimented with RBFs. alfa=0.014 6 10-4 Method BP (new criterion. at the expense of a more intensive computational effort.12 .6 3.88 1.1) Levenberg-Marquardt (new criterion) Comparing these results with the results obtained with the previous methods.12 Condition number 94 1. Example 2.5 103 2. minimizing the reformulated criterion.8 2.Coordinate transformation Ex.2) Error norm 5 2.Results of supervised learning (10 neurons) Error norm 0.3 104 5.22 illustrates the results obtained with the adaptive clustering algorithm.2 Employing the k-means clustering algorithm. Table 2.20 .Table 2. Table 2.Results of RBF learning (self-organized centres) (adaptive algorithm) Neurons 10 20 30 Error norm 3 2. we obtain: Table 2.Results of RBF learning (fixed centres) Condition number 278 128 348 1. 2.9 3.19 .54 103 7.

Table 2.23: Table 2.3 Lattice-based Associative Memory Networks Within the class of lattice-based networks.23 . specially minimizing the new criterion.18 Method BP (standard criterion.m.18 105 1. alfa=0.Results of supervised learning (10 neurons) Error norm 2 0. which are described in the first section. alfa=0. with only 10 neurons.1) Levenberg-Marquardt (new criterion) –3 With τ = 10 . we shall describe the CMAC network and the Bspline network.Results of RBF learning (self-organized centres) (adaptive algorithm) Neurons 40 50 Error norm 1 1.mat and winict_opt. found in winict. Both share a common structure. the results with 10 neurons can be found in Table 2.mat (standard and reformulated criterion. it can be seen that. the model trained with this algorithm/criterion achieves a better performance than models with 50 neurons trained with other algorithms. 134 Artificial Neural Networks . converged after 28 iterations. employing again as initial values the results of rbf_fc. Comparing the last 4 tables.22 .73 0. The LM methods converged in 103 (18) for the standard (new) criterion. These examples show clearly the excellent performance of the Levenberg-Marquardt.46 0. respectively). the steepest descent method with the standard criterion did not converge after 200 iterations.16 Condition number 2. The same method.57 106 With respect to supervised learning.001) Levenberg-Marquardt (standard criterion) BP (new criterion. and common features. 2. with the new criterion.

3. This cell will activate some. as the address of the non-zero basis functions can be explicitly computed. x2 x1 FIGURE 2. knowledge is stored and adapted locally.1 Normalized input space layer This can take different forms but is usually a lattice on which the basis functions are defined. fig. 2.52 .Associative memory network 2. knowledge about the variation of the desired function with respect to each dimension can be incorporated into the lattice. This scheme offers some advantages.3.1.1 Structure of a lattice-based AMN AMNs have a structure composed of three layers: a normalized input space layer. a basis functions layer and a linear weight layer.A lattice in two dimensions One particular input lies within one of the cells in the lattice.53 . and the procedure to find in which cell an input lies can be formu- Artificial Neural Networks 135 . but not all of the of the basis functions of the upper layer. Additionally. a1 x(t) a2 a3 ω1 ω2 ω3 ωp − 2 ∑ y(t) network output a p−2 a p −1 ap ωp −1 ωp normalized input space basis functions weight vector FIGURE 2. and also as the transformed input vector is generally sparse.53 shows a lattice in two dimensions.2.

r i + 2 Exterior knots Interior knots FIGURE 2.. (2. x n ] .168) where x i and x i are the minimum and maximum values of the ith input. 2 ≤ … ≤ λ i.lated as n univariate search algorithms. The network input space is [ x 1 . These knots are usually coincident with the extreme of the input axes. – ( k i – 1 ) ≤ … ≤ λ i. 2 λ i. In order to define a lattice in the input space. At each extremity of each axis. 3 Exterior knots λ i. respectively. and therefore these networks are only suitable for small or medium-sized input space problems. this is termed a coincident knot. a set of ki exterior knots must be given which satisfy: λ i.. are λ i. The main disadvantage is that the memory requirements increase exponentially with the dimension of the input space. 1 λ i. ri < x i max . max λ i.170) These exterior knots are needed to generate the basis functions which are close to the boundaries. There are usually a different number of knots for each dimension. The jth interval of the ith input is denoted as Ii. i = 1. with ri interior knots and ki=2 The interior knots. xi min xi . r i λ i.j and is defined as: min max min max 136 Artificial Neural Networks .54 . 0 λ i. vectors of knots must be defined. j = ( j = 1. x 1 ] × … × [ x n . 0 = x i xi max min (2. 1 ≤ λ i. When two or more knots occur at the same position. – 1 λ i. …. ri + 1 ≤ … ≤ λ i. …. r i. for the ith axis (out of n axis). They are arranged in such a way that: xi min max min < λ i. or are equidistant. and they are generally placed at different positions.A knot vector on one dimension. and this feature can be used when attempting to model discontinuities.169) = λ i. one for each input axis. ri + k i (2. and so the exterior knots are only used for defining these basis functions at the extreme of the lattice. r i + 1 λ i. n ) .

j = [ λ i.2 The basis functions The output of the hidden layer is determined by a set of p basis functions defied on the ndimensional lattice. denoted as generalization parameter.1. j – 1. for each set.171) This way. the union of the supports forms a complete and non-overlapping n-dimensional overlay. or bounded support. within the range of the ith input. the number of non-zero basis functions is a constant. the number of non-zero basis functions is ρ .55 .…. λ i. size and distribution of the basis functions are characteristics of the particular AMN employed. For any input. 137 Artificial Neural Networks . one and only one basis functions is non-zero.3. for any input. i=1 n 2. if its support is smaller than the input domain.A two dimensional AMN with ρ =3. and ri=5. there are ri+1 intervals (possibly empty if the knots are coincident. which means that there are p' = ∏ ( ri + 1 ) n-dimensional cells in the lattice. and the support of each basis function is bounded. In each set. r i [ λ i. j – 1.I i. For a particular AMN. j ]if j=r i + 1 (2. j )for j=1. and so they can be organized into ρ sets. λ i. 3rd overlay 2nd overlay 1st overlay Input lattice FIGURE 2. A basis function has compact support. The support of a basis function is the domain in the input space for which the output is nonzero. The shape. ρ . where.

The decomposition of the basis functions into ρ overlays demonstrates that the number of basis functions increases exponentially with the input dimension.2. and so (2. and does not vary according to the position of the input.173) can be reduced to: ρ y = i=1 a ad ( i ) w ad ( i ) (2.174) The dependence of the basis functions is nonlinear on the inputs. 2.1. These networks suffer therefore from the curse of dimensionality [114].55 . 2.1. This is a desirable property.3 Output layer The output of an AMN is a linear combination of the outputs of the basis functions. where the sum of all active basis functions are a constant: ρ a ad ( i ) ( x k ) = constant i=1 (2. any input within the pink region can be envisaged as lying in the intersection of three basis functions. This number. since any variation in the network surface is only due to the weights in the network.3. and therefore there are at least two defined on each axis. The total number of basis functions is the sum of basis functions in each overlay.1Partitions of the unity It is convenient to employ normalized basis functions. is 2n. in turn. it is said to form a partition of the unity.. The output is therefore: p T y = i=1 aiwi = a w (2.173). 2. Therefore. green and blue rectangles in the three overlays. whose support is represented as the red. only ρ basis function are non-zero. finding the weights is just a linear optimization problem. As we know.172) where ad ( i ) is the address of the non-zero basis function in overlay i. p denotes the total number of basis functions in the network.In fig.173) In (2. i. If a network has this property. a lower bound for the number of basis functions for each overlay. and this dependence can be made explicit as: 138 Artificial Neural Networks . is the product of the number of univariate basis functions on each axis. These have a bounded support. The linear coefficients are the adjustable weights. and subsequently.3. for the AMN. and because the mapping is linear.e.

3) a18 a15 2nd overlay d2=(2. which. Its main use has been for modelling and controlling high-dimensional non-linear plants such as robotic manipulators.3. A network satisfies this principle if the following condition holds: As the input moves along the lattice one cell parallel to an input axis. in other words. and introduced to. the output calculation is a constant and does not depend on the input.176) a21 a19 a16 a13 a10 a7 a4 a1 a2 a5 a3 a11 a8 a6 a20 a17 a14 a12 a22 3rd overlay d3=(3. means that similar inputs will produce similar outputs. For on-line modelling and control. the supports are distributed on the lattice. This is called the uniform projection principle. In its simplest form.2) a9 1st overlay d1=(1. it is important that the nonlinear mapping achieved by the basis functions posses the property of conserving the topology. In CMACs. A CMAC is said to be well defined if the generalization parameter satisfies: 1 ≤ ρ ≤ max i ( r i + 1 ) (2.ρ y = i=1 a i ( x i )w i (2. where the basis functions generalize locally. the CMAC is a look-up table. such that the projection of the supports onto each of the axes is uniform. this constant is 1 for all the axis.2 CMAC networks The CMAC network was first proposed by Albus [42].175) 2. the number of basis functions dropped from.1) Input lattice Artificial Neural Networks 139 . In the scheme devised by Albus to ensure this property.

2. 2. as the subsequent supports are obtained by adding ρ units in the first dimension. If a LMS type of learning rule is used. only those weights which correspond to a non-white area are updated. The network uses 22 basis functions to cover the input space. It is clear that. which corresponds to 36 lattice cells. once the corner (the "top-right" corner of the first support. one interval along each axis.CMAC generalization.57 . and the coloured cells the ones that have 1 active basis function in common with the input. and the displacements of the overlays were given as: 140 Artificial Neural Networks . 2. In fig.56 . in a 2dimensional overlay) of the overlay is determined. This is why the CMAC stores and learns information locally. The generalization parameter is 3. 2. and repeating this process through every dimension. and the rest is left unchanged.56 .2-dimensional CMAC overlays with ρ =3 Consider the two-dimensional CMAC shown in fig. the grey cells show those that have 2 active basis functions in common with the input. corresponding to a displacement matrix D = 123 . one basis function is dropped and a new one is incorporated. 123 To understand how a CMAC generalizes. As the input moves one cell apart in any dimension.FIGURE 2.57 .1 Overlay displacement strategies The problem of distributing the overlays in the lattice has been considered by a number of researchers. there are 5 interior knots in each axis. and repeating this process until the maximum number of knots has been reached. T x2 x1 FIGURE 2. The overlays are successively displaced relative to each other. The original Albus scheme placed the overlay displacement vector on the main diagonal of the lattice.3. with the first dimension reset. then by adding ρ units in the 2nd dimension. the overlay placement is uniquely defined. Cells which are more than ρ -1 cells apart have no basis function in common. the black cell shows the input. it is interesting to consider the relationship between neighbouring sets of active basis functions.

using an exhaustive search which maximizes the minimum distance between the supports. ρ ) where all the calculations are performed using modulo ρ arithmetic. and 0 otherwise. higher order basis functions have been subsequently proposed. as a result of that.179) Artificial Neural Networks 141 . 1 ) d 2 = ( 2.2. d 2. To 1 make a partition of the unity. …. and. In [115].3 Training and adaptation strategies The original training algorithm proposed by Albus is an instantaneous learning algorithm. 2 ) … d ρ – 1 = ( ρ – 1. Tables are given by these authors for 1 ≤ n ≤ 15 and 1 ≤ ρ ≤ 100 . … d ρ – 1 = ( ( ρ – 1 )d 1.d 1 = ( 1. where 1 ≤ d i < ρ .3. di and ρ coprime. The vectors are defined to be: d 1 = ( d 1. d n ) d 2 = ( 2d 1. are binary: whether they are on or off. …. 2. …. CMACs can only exactly reproduce piecewise constant functions. …. ρ ) This distribution obeys the uniform projection principle. the output is piecewise constant. ρ. d n ) was computed. Its iteration k can be expressed as: ∆w [ k ] = δe [ k ]a [ k ]. ρ. ( ρ – 1 )d2.2 Basis functions The original basis functions.. Therefore. a vector of elements ( d 1. if the input lies within its supρ port. 2d n ) . d2. …. and in fact it is equal (the output is simply the weight stored in the active basis function) when ρ = 1 . 2. but lies only on the main diagonal. …. 2. and has discontinuities along each (n-1) dimensional hyperplane which passes through a knot lying on one of the axes. the output of a basis function is -. …. otherwise (2. and the network might not be so smooth as using other strategies. but we will not mention them in this text. ρ – 1 ) d ρ = ( ρ. if e [ k ] > ζ 0. The CMAC is therefore similar to a look-up-table. as proposed by Albus. ρ – 1. 1.2.177) (2. 2d 2. ….178) (2.3. ( ρ – 1 )d n ) d ρ = ( ρ. …. Because of this.

180) (2. Other instantaneous learning algorithms can be used with the CMAC.181) can be applied. and a generalization parameter of 3.3. the batch version of the NLMS algorithm δ ∆w [ i ] = --m e [ k ]a [ k ] -------------------2 a[k] 2 k=1 m ∆w [ k ] = (2. In terms of training (off-line) algorithms.1).75 as the limit values for our network.3. which employs also the files out_basis.1): δe [ k ]a [ k ] ------------------------. 2.8. One most frequently used is the NLMS learning algorithm (see Section 2. 142 Artificial Neural Networks .m and ad.182) It should be mentioned that matrix a is usually not a full-rank matrix. This is clearly a LMS update. respectively. the pseudo-inverse can be employed to obtain: + ˆ w = A t (2.m.pH nonlinearity with CMACs The minimum and maximum values of the input vector are 0. They were trained with the function train_CMAC_opt.2.1.m. with a dead-zone (see Section 2.Only those weights that contribute to the output are updated. if e [ k ] > ζ 2 a[k] 2 0.035 and 0.75. the following table illustrates the results obtained. a[k]= ρ -1.4 Examples Example 2. Of course if we remember that the output of the CMAC is a linear layer.13 . The CMACs were created with the Matlab file Create_CMAC.8.3.1. otherwise where 0 < δ < 2 . Please note that in the case of the binary CMAC.m. with an increasing number of interior knots. Considering 0 and 0.

Error norm. respectively. The following figure illustrates the output obtained. we stay with 36 cells. and the number of basis functions to 57. The next figure illustrates the results obtained.58 . and 22 basis functions.27 Example 2.182) is used. 5 interior knots.5 2 Ta rg e t.5 0 0 20 40 60 Pa tte rn nu mb er 80 1 00 12 0 FIGURE 2.Output and Target pattern.Coordinate Transformation with CMACs Consider again the coordinate transformation mapping.4 0.y 1. Considering these as minimum and maximum values of our network.64 0.9 1018 3. for different numbers of interior knots Condition number 1.Table 2. the number of cells raise to 121.34 0. and a generalization parameter of 3.5 1 0. 3 2.88 0.24 .0) and (1. Artificial Neural Networks 143 . using (2. The minimum and maximum values of the input are (0.14 .182). when (2.1). 5 Interior knots and ρ =3 If now 10 interior knots are used for each dimension.8 1034 ∞ ∞ ∞ ri 10 20 30 40 50 Error norm 0.

The results obtained are shown in fig.3 2.y 1.60 .60 . Even with this large number. 10 Interior knots and ρ =3 If we increase now the number of knots to 20.Output and Target pattern. Table 2. 2. 3 2.5 2 Ta rg e t. the norm of the weight vector is 20.Output and Target pattern.5 1 0. we stay with 441 cells.18 144 Artificial Neural Networks . for different numbers of interior knots ri 5 e 2.5 0 0 20 40 60 Pattern number 80 100 120 FIGURE 2.5 1 0.59 .5 2 Ta rg e t.Error norm. which shows that the model will generalize well.5 0 20 40 60 Pattern number 80 100 120 FIGURE 2. 20 Interior knots and ρ =3 The above results can be summarized in the following table.5 0 -0. and 177 basis functions.y 1. in each dimension.25 .

1 0 0 1 2 x 3 4 5 output 1 0.2 0.3. a multivariate B-spline of order 2 can be interpreted also as a set of fuzzy rules.8 0. An important achievement was obtained in 1972 when Cox [116] derived an efficient recurrent relationship to evaluate the basis functions.4 0.8 0.6 0. each input is assigned to k basis functions and its generalization parameter is k.1 0 0 1 2 x 3 4 5 Artificial Neural Networks 145 .3 B-spline networks B-splines have been employed as surface-fitting algorithms for the past twenty years.5 0.Error norm.1 Basic structure When the B-spline is designed. which is formed from a linear combination of basis functions. these functions being defined on a lattice. The order of the spline implicitly sets the size of the basis functions support and the generalization parameter. Thus.05 0. The univariate Bspline basis function of order k has a support which is k intervals wide. 2. B-splines can be considered AMN networks.17. for different numbers of interior knots ri 10 20 e 1. Hence.5 0. 2.3 0. A B-spline network is just a piecewise polynomial mapping.3.2 0.7 0. One additional reason to study these kind of networks is that they provide a link between neural networks and fuzzy systems. and their structure is very similar to the CMAC networks studied before.9 0. In fact. achieved an error norm of 0.25 . A set of univariate basis functions is shown in fig.13 It should be remembered that the best result obtained for this example.6 output 0.Table 2.4 0.7 0.3.61 . 1 0.9 0. it is necessary to specify the shape (order) of the univariate basis functions and the number of knots for each dimension. 2. with MLPs.3 0.

each multivariable basis function is formed from the product of n univariate basis functions.7 b) Order 2 0. The curse of dimensionality is clear from an analysis of the figures.6 0.8 0.61 - d) Order 4 Univariate B-splines of orders 1-4 Multivariate basis functions are formed by taking the tensor product of the univariate basis functions.a) Order 1 0. one from each input axis.7 0.5 0. 32=9 basis functions are active each time. and every possible combination of univariate basis function is taken. In the following figures we give examples of 2 dimensional basis functions.1 output 0.6 0. Therefore.5 output 0.3 0.3 0 0 1 2 x 3 4 5 0 1 2 x 3 4 5 c) Order 3 FIGURE 2. where for each input dimension 2 interior knots have been assigned. 146 Artificial Neural Networks .4 0.2 0. as in the case of splines of order 3 for each dimension.2 0.4 0.1 0 0.

the support of each multivariate basis function is an hyperrectangle of size k1 × k2 × … × k n . order 1. in B-splines networks. 1.4] is marked Denoting the order of the univariate basis functions on the ith input as ki.63 - Two-dimensional multivariable basis functions formed with order 2 univariate basis functions. The point [1. The point [1.4.FIGURE 2. defined on a ndimensional lattice. the generalization parameter is therefore: Artificial Neural Networks 147 .5. 1. The point [1.6] is marked FIGURE 2.64 - Two-dimensional multivariable basis functions formed with order 3 univariate basis functions. 1. Therefore.5] is marked FIGURE 2. univariate basis functions.62 - Two-dimensional multivariable basis functions formed with 2.

and that splines of order 3 (k=3) are employed.ρ = ∏ ki i=1 n (2.3.183) This is exponential with n. Thus. or when the desired mapping can be additively decomposed into a number of simpler relationships.1. varying from 0 to 5. and the output of the network can be given as: p ρ y = i=1 ai wi = i=1 aad ( i ) w ad ( i ) .61 . and therefore the cost of implementing the algorithm is also exponential with n. λ j ] is the jth interval.3.184) where a is the vector of the output of the basis functions. Analysing fig. λ j is the jth knot.N k – 1 ( x ) + ---------------------------. and I j = [ λ j – 1. 2. As in CMACs. The application of (2. If we consider the 3rd interval (j=3).1Univariate basis functions The jth univariate basis function of order k is denoted by N k ( x ) . B-spline networks should only be used when the number of relevant inputs is small and the desired function is nonlinear. and it is defined by the following relationships: x – λj – k λj – x j j–1 .j N k ( x ) = ---------------------------. it is clear that the basis function output and consequently the network output will become smoother as the spline order increases.N k – 1 ( x ) λ j – 1 – λj – k λj – λ j – k + 1 N1 ( x ) = j j 1 if x ∈ Ij 0 otherwise (2. the topology conservation principle is obeyed.185).185) is therefore translated into: 148 Artificial Neural Networks .185) In (2. Let us assume that we are considering modelling a function of one input. p is the total number of lattice cells. 2. and the knots are joined smoothly. On each interval the basis function is a polynomial of order k. we know that 3 basis functions have non-zero values for x values within this interval. and ad(i) the address of the active basis function in the ith overlay. that 4 interior knots are employed. (2.

2 x + ---------------.N 1 ( x ) + ---------------. (2.193) Artificial Neural Networks 149 .= -----------------------------------------------.192) (2. for the 3rd active basis function.N 2 ( x ) + ---------------.2 ( x ) λ4 – λ 2 λ5 – λ3 As N2 ( x ) = 0 5 5 4 (2. is: x – λ1 3 λ4 – x 4 4 N 3 ( x ) = ---------------.186) .191) (2.= -----------------λ 3 – λ 1 λ3 – λ2 ( λ 3 – λ1 ) ⋅ ( λ 3 – λ 2 ) 2 4 2 (2. N 3 ( x ) .x – λ0 2 λ3 – x 3 3 -N N 3 ( x ) = ---------------.189) As: N1 ( x ) = 0 we then have: x – λ1 λ3 – x λ4 – x x – λ 2 4 ( x – 1)( 3 – x) (4 – x )(x – 2 ) N 3 ( x ) = ---------------.187) =1 we then have: 3 N3( x ) 2 λ3 – x λ3 – x ( λ3 – x ) (3 – x) = ---------------.⋅ ---------------.1 ( x ) λ2 – λ1 λ3 – λ2 As: N2( x ) = N1( x ) = 0 3 N1( x ) 2 2 (2.= -------------------------------.+ ---------------.1 ( x ) λ3 – λ2 λ4 – λ3 (2.+ -------------------------------λ 3 – λ 1 λ 3 – λ 2 λ4 – λ 2 λ 3 – λ 2 2 2 Finally.188) The second active basis function.2 ( x ) -N λ3 – λ1 λ4 – λ2 4 N2 ( x ) x – λ2 3 λ4 – x 4 -N = ---------------.N 2 ( x ) + ---------------.⋅ ---------------.N 2 ( x ) + ---------------.⋅ ---------------.190) (2.N 2 ( x ) λ2 – λ0 λ3 – λ1 x – λ1 2 λ3 – x 3 3 -N N 2 ( x ) = ---------------. N 3 ( x ) : x – λ2 4 λ5 – x 5 5 -N N 3 ( x ) = ---------------.

8 0.= -----------------λ 4 – λ 2 λ3 – λ2 ( λ 4 – λ2 ) ⋅ ( λ 3 – λ 2 ) 2 2 (2. which means that the basis function make a partition of unit. 6 0.⋅ ---------------. dot-p3) 150 Artificial Neural Networks .5 p1. It can be seen that at the knots (2 and 3) the polynomials are joined smoothly.= -----------------------------------------------.2 2.we then have: 5 N3( x ) 2 x – λ2 x – λ2 ( x – λ2 ) (x – 2) = ---------------. p3 p1.6 2 0. p2.2 -4 0. and that the sum of the three polynomials is constant and equal to 1.65 .6 2. 2.8 3 a) All x domain FIGURE 2.Polynomial b) Only I3 functions of the active basis functions for I3 (solid-p1.1 -6 0 0 2 2.194) The polynomial functions active for interval I3 are shown in fig. p3 1 2 x 3 4 5 0 0.7 4 0. p2.3 -2 0.4 0.65 .4 x 2. dash-p2.

Triangular array used for evaluating the k-non-zero B-spline functions of order k The basis functions of order k are evaluated using basis function of order k-1. 2. The number of basis functions to evaluate can be obtained from fig. and so the total cost is k ( k – 1 ) multiplications and divisions. 2.66 . 2 multiplication and 2 divisions are performed for each basis function. In this evaluation.1. Therefore. the total number of basis functions for a multivariate B-spline is: Artificial Neural Networks 151 .66 .195) i=1 The number of basis functions of order ki defined on an axis with ri interior knots is ri+ki.2Multivariate basis functions As already referred. for k ≥ 2 . 2 multiplications and 2 divisions are needed for each k level. i ( xi ) j i n (2.The computational complexity involved in the calculation of N k ( x ) can be estimated analysing the following figure: Nk Nk – 1 j+2 N3 j+1 N2 j+k–2 j+k–1 j Nk Nk – 1 j+k–3 j+k–2 N1 N2 j j N3 j+1 Nk j+k–3 N3 Nk Nk – 1 Nk j j j+1 j FIGURE 2. multivariate basis functions are formed by multiplying n univariate basis functions: j Nk ( x ) = ∏ Nk . It is easy to see that we have k ( ( k – 1 ) ⁄ 2 ) basis functions.3. summing the nodes till layer k-1. So.3.

and just concentrate in the determination of the output weights.8.3. the least-squares solution (whether standard or with regularization (see Section 2. 2. basis_num.Coordinate transformation The minimum and maximum values of the input are (0.1) is the one most often employed. by obvious reasons.m) and determining the linear weights using a pseudo-inverse.16 . In terms of on-line adaptation.3. gen_ele. Employing the Matlab functions create_nn. train_B_splines.1.3.26 .m.1.m.75.007 cond(a) 13.m.m (this functions employs also basis_out.15 . njk_multi. For this reason.3.p = ∏ ( ri + ki ) i=1 n (2. and train_B_splines.2)) can be used to compute them.7.m. and determining the linear weights 152 Artificial Neural Networks .1).m.2 Learning rules We shall defer the problem of determining the knot vector for the next Section. Employing the Matlab functions create_nn.B-spline modelling the inverse of the pH nonlinearity norm(e) 0. The output layer of B-splines is a linear combiner.6 1 1018 117 2 105 k 3 3 4 4 r 4 9 4 9 Example 2.05 0. considering splines of order 3 and 4.m. 2.196) This number depends exponentially on the size of the input and because of this B-splines can only be applicable for problems where the input dimension is small (typically ≤ 5 ). respectively.m and njk_uni. respectively.3 Examples Example 2.3. index.m. the NMLS rule (see Section 2.17 0. and 4 and 9 interior knots: Table 2.3.0) and (1.pH with B-splines The minimum and maximum values of the input are 0 and 0. the following results were obtained.04 0.

as the input dimension increases. if a sub-model network is denoted as s u ( . More formally.3] [3.807 storage locations. and we have nu sub-models in our network.4] [9.198) (2. and 4 and 9 interior knots: Table 2.0021 2.9] norm(e) 0.9] [4. each one with the same knot vector and order as the one described above. Each input would activate 25=32 basis functions. enables the use of B-splines to higher dimensional problems. On the other hand.8 16. assuming that each additive function would be modelled by a smaller sub-network.2] [3.27 . and the model produced would have the form: y = f ( x 1. This way. x 5 ) Let us assume that function f(. inf.3. that. for this function. the size of the multivariate basis functions increases exponentially.4] [9. x 2.197) Then. In this section we introduce one algorithm that. This network would require 75=16. order 2. 7+72+72 = 105 storage locations.3. norm(w) 9. we have assumed until now. for all the neural models considered. or s u ( x u ) . that their structure remains fixed and the training process only changes the network parameters according to the objective function to be minimized.2] [2. for instance. x 3. ) . its output is given by: Artificial Neural Networks 153 . on one hand. and it would have 2+22+22=10 active functions at any one time. with seven univariate sets.4 Structure Selection .3] r [4. inf. Consider.3. x 3 ) + f 3 ( x 4. x 5 ) (2.1. the following results were obtained. in Section 2.B-spline modelling the coordinate transformation cond(a) inf. to explicitly express that the network inputs belong to the set u of input vector x. a saving of more than 150 times would have been achieved in storage and we would require one third of active basis functions.9 106 k [2. considering splines of order 2 and 3.1 0.) could be well described by a function: ˆ y = f 1 ( x 1 ) + f 2 ( x 2. and on the other hand it is a construction procedure for the network. for each dimension.017 0.3. a 5-inputs B-spline network.52 0.The ASMOD algorithm We have seen. therefore restricting the use of B-spline networks to problems with up to 4 or 5 input variables.8 819 2. x 4. inf.using a pseudo-inverse. we would require.

with the exception of the spline order. then it can use this structure as an initial model.nu y( x) = u=1 su ( xu ) (2. the input set of variables for each one. However.3. 2. algorithms have been proposed to. then (2. for each input in this set.3. each one with 0 or 1 interior knots is employed.199) The sets of input variables satisfy: u=1 ∪ xu ⊆ x nu (2. and. If the designer has a clear idea of the function decomposition into univariate and multivariate sub-models. All those parameters are discovered by the algorithm.199) can be transformed in: nu pi nu y = u = 1j = 1 a uj w u j (2. usually constructed of univariate sub-models for each input dimension.200) If accordingly the weight vector and the basis function vector are partitioned. if we have an idea about the data generation function. 154 Artificial Neural Networks .4. the structure of a B-spline network is characterized by the number of additive sub-modules. For this reason. an initial model. and use the ASMOD to refine it. the associated knot vector and spline order. Otherwise. w = u=1 ∪ wu nu and a = u=1 ∪ au . iteratively. determine the best network structure for a given data.201) It is not obvious. to infer what is the best structure for a network. As referred before. from a data set.1The algorithm The ASMOD (Adaptive Spline Modelling of Observed Data) [119] [120] is an iterative algorithm that can be employed to construct the network structure from the observed data. which must be a user-defined parameter. the network structure can be found analytically (see Exercise 38 and Exercise 39).

this set is often applied after a certain number of refinement steps. Care must be taken in this step to ensure that the complexity (in terms of weights) of the final model does not overpass the size of the training set. Because of this. • Candidate models are generated by the applications of a refinement step. 2. Note that. For network pruning. For every combination of sub-models presented in the current model. and a pruning set. where the current model is simplified. also three possibilities are considered: For all univariate sub-models with no knots in the interior. the latter step does not generate candidates that are eager to proceed to the next iteration. combine them in a multivariate network with the same knot vector and spline order. END Algorithm 2.11 - ASMOD algorithm Each main part of the algorithm will be detailed below. WHILE NOT(termination criterion) Generate a set M i of candidate networks. in an attempt to determine a simpler method that performs as well as the current model. m i . i=i+1. and usually 0 or 1 interior knots are applied. split them into n sub-models with n-1 inputs. ˆ Determine the best candidate. 3. Estimate the parameters for each candidate network. Three methods are considered for model growing: For every input variable not presented in the current network. a new univariate submodel is introduced in the network. 1. according to some criterion J. 1. replace them by a spline of order k-1.The ASMOD algorithm can be described as: i = 1. For every multivariate (n inputs) sub-models in the current network. The spline order and the number of interior knots is specified by the user. as it is just a constant. with just refinement steps. 2. ˆ m i – 1 = Initial Model termination criterion = FALSE. remove this sub-model from the network. or just applied to the optimal model resulting from the ASMOD algorithm. in the majority of the cases. where the complexity of the current model is increased. creating therefore candidate models with a complexity higher of 1. For every sub-model in the current network. also with no interior knots. Artificial Neural Networks 155 . If k-1=1. for every dimension in each sub-model split every interval in two. ˆ ˆ IF J ( m i ) ≥ J ( m i – 1 ) termination criterion = TRUE END.

For every sub-model in the current network.3.203) (2.zip) was applied to the pH example. • All candidate models must have their performance evaluated.202) (2. or similar would not be enough. K is the performance criterion. and L is the number of training patterns. Three criteria are usually employed: Baysean information criterion K = L ln ( J ) + p ln ( L ) (2. the networks must be trained to determine the optimal value of the linear weights. As referred before any learning rule such as the ones introduced before can be applied.204) Akaike´s information criterion K ( φ ) = L ln ( J ) + pφ L+p Final prediction error K = L ln ( J ) + L ⋅ ----------L–p In eqs. The next figure shows the evolution of the performance criterion with the ASMOD iterations.17 .pH example The ASMOD algorithm (a large number of functions compound this algorithm.202) to (2. we are looking for the smallest model with a satisfactory performance. for every dimension in each sub-model. 156 Artificial Neural Networks . As it was referred before. (2. as we are comparing models with different complexities. For every candidate. Example 2. • The application of the previous steps generates a certain number of candidates. p is the size of the network (in terms of weights).204). and therefore the performance criterion to employ should balance accuracy and complexity. and a spline order of 2. remove each interior knot. therefore the files are zipped in asmod. although a pseudo-inverse implemented with a USV factorization is advisable. A criterion such as the MSE. with zero initial interior knots. creating therefore candidate models with a complexity smaller of 1.

2.67 .015 0.-400 -600 -800 Pe rforman ce criterio n -1000 -1200 -1400 -1600 -1800 0 10 20 30 Iteration number 40 50 60 FIGURE 2.69 Artificial Neural Networks 157 . 2.68 . 0.Evolution of the MSE As it is known. The evolution of this parameter. in the refinement step.68 .Evolution of the performance criterion The evolution of the MSE is show in fig.005 0 0 10 20 30 Iteration number 40 50 60 FIGURE 2. which measures the complexity of the network.01 MSE 0. new interior knots. the ASMOD introduces. is shown in fig. This translates into an increasing number of basis functions.

60 50 40 C omp le xity 30 20 10 0 0 10 20 30 Iteration number 40 50 60 FIGURE 2.69 . The norm of the weight vector is 4. 2. ce n te rs 0. The condition of the matrix of the basis functions outputs is 7. and less where it remains constant. This way. 0. The final placement of the interior knots is shown in fig. and it can be concluded that the ASMOD algorithm places more knots in the range where the function changes more rapidly. which is an excellent value.3 0. which is translated into an error norm of 9.5 0.4 0. 1274 candidate models were generated and evaluated.8 0.02.6 In pPat.5 10-4.70 .9 10-8.7 0.2 0. the best network has 51 basis functions. and achieves a MSE of 0.2.1 0 0 20 40 60 Pattern number 80 100 120 158 Artificial Neural Networks .Evolution of complexity The final structure is the best one in the iteration before the last one.

2 x 10 -3 1.71 .8 1.6 1. 2.70 .Evolution of the performance criterion The evolution of the MSE is presented in fig.4 0.4 1.FIGURE 2.2 0 0 5 10 Iterations 15 20 25 Artificial Neural Networks 159 .Placement of interior knots Considering now the use of B-splines of order 3.8 0.72 . the following figure shows the evolution of the performance criterion: -600 -800 -1000 Performance criterion -1200 -1400 -1600 -1800 0 5 10 Iterations 15 20 25 FIGURE 2.2 MSE 1 0.6 0.

1 0 0 20 40 60 Input 1 80 100 120 FIGURE 2. The final knot placement can be seen in the next figure. 0.5 10-8. 2.72 . It achieves an MSE of 1.FIGURE 2.74 .18 .3 0.73 . 160 Artificial Neural Networks .Coordinate transformation Employing splines of order 2.Evolution of the MSE The final model has a complexity of 24.5 Input 2 0.2 0.Final knot placement Example 2.6 0.7 0.8 0. the evolution of the performance criterion can be seen in fig. and starting the ASMOD algorithm with 2 splines with 0 interior knots.4 0.

Evolution of the performance criterion The evolution of the MSE is expressed in the following figure.02 0.-200 -300 -400 Performance criterion -500 -600 -700 -800 -900 -1000 0 5 10 15 Iterations 20 25 30 35 FIGURE 2. the ’*’ symbols the knots of the bi-dimensional splines.09 0.08 0. 0. Artificial Neural Networks 161 .06 0.07 0.03 0.01 0 0 5 10 15 Iterations 20 25 30 35 FIGURE 2.04 0.05 MSE 0.75 . and the vertical and horizontal lines are located at the knots of the univariate splines. The small circles denote the training patterns.74 .Evolution of the MSE The final knot placement can be seen in the next figure.

In this section we shall compare them. and in terms of training effort (time spent in training.6 Input 2 0. 162 Artificial Neural Networks .1 0 0 0. Radial Basis functions.8 0.3.4 0. The application of the BP and the LM methods. the condition of the output matrix of the last nonlinear layer.2 0. CMACs and B-splines) and we have applied them to two different examples. performing single or multi-objective optimization [143]. can be employed by downloading Gen_b_spline.1 0.6 0. the model complexity (how many neurons and model parameters).2 0. 2. in terms of the accuracy obtained (2-norm of the error vector).76 .3 0.Final knot placement The final value of the MSE is 1. The final norm of the weight vector is 8.04). 2.6 10-5 (corresponding to a error norm of 0. A different approach.7 0.3.5 0.4 0.5 Completely supervised methods If the structure of the network is determined by a constructive method.flops). and floating point operations per second .zip. two univariate sub-models and one bivariate sub-model.9 1 FIGURE 2.1 0. the position of the knots and the value of the linear parameters can be determined by any of the gradient methods introduced for MLPs and RBFs. envolving genetic programming. and the different off-line training methods employed. with a B-spline network with 62 weights.9 0. minimizing whether the standard or the reformulated criterion can be found in [142].3.8 0.3 0.7 0.4 Model Comparison In the last sections we have introduced four major neural network models (Multilayer Perceptrons.5 Input 1 0.

3 0.1 3. minimizing the new criterion. Table 2.7 7. running Windows NT 4.41 0. The accuracy obtained with this algorithm is equivalent to the Matlab implementation.29 . We considered the case of a MLP with [4 4 1].1 Multilayer perceptrons In this case. as the topology is fixed. with the standard criterion Error norm 1.4. we can see that the Levenberg-Marquardt algorithm.33 0.3 1. in terms of accuracy.02 8.5 2.2 1.5 93 23 Algorithm BP ( η = 0.3 61 3. and the number of parameters is 33.2 Flops (106) 3.5 ) quasi_Newton Levenberg-Marquardt Analysing the last two tables. It should be noted that all the figures relate to a constant number of 100 iterations.05 ) BP ( η = 0.2 10-4 5.0. with the new criterion Error norm 0. with a computation time around three times larger than using the back-propagation algorithm.2 10-4 Condition number 56 59 189 201 101 Time (sec) 1.3 1. 2.3 41.9 10-4 Condition number 59 63 63 69 106 112 Time (sec) 1.1 ) BP ( η = 0.2 1.5 2. achieves the best results. the model complexity is also fixed. and so the number of neurons is 9.01 ) BP ( η = 0.Results of MLPs.28 .005 5.005 ) BP ( η = 0.4.5 3. Artificial Neural Networks 163 .2.1 Inverse of the pH nonlinearity The following results were obtained in a 200 MHz Pentium Pro Processor.3 1.45 0.5 3. but it saves more than twice the computational time.001 ) quasi_Newton Levenberg-Marquardt Levenberg-Marquardt (Matlab) Table 2.3 Flops (106) 2.5 90 29 40 Algorithm BP ( η = 0.4 1. with 64 MB of RAM.Results of MLPs.1.

5 0.002 2 10-4 0.57 0.4 2.2 . specially the LM algorithm with the new criterion.05 0.2 Radial Basis functions Table 2.8 1017 5. produces results typical of larger networks.8 1016 5 1017 5.3 10-4 2.7 1.1 LM(NC) Error norm 0.Results of RBFs Number of Parameter s 30 50 70 90 30 50 70 90 30 50 70 90 30 30 30 30 Algorithm Fixed centres Fixed centres Fixed centres Fixed centres Clustering Clustering Clustering Clustering Adaptive clustering Adaptive clustering Adaptive clustering Adaptive clustering BP (SC) alfa=0.02 0.6 5.2 0.4 0.8 1016 1 1017 1.006 0. at the expense of a larger computational time.7 1.001 LM (SC) BP (NC) alfa=0.3 0.4 10-4 0.2 106 6. we can see that both clustering algorithms give slightly better results than the training algorithm with random fixed centres.5 8042 33737 8194 23696 Analysing the last table.30 .5 0.7 0. Supervised methods.002 2.6 1017 1 1010 2 1016 8. 164 Artificial Neural Networks .014 6 10-4 Condition number 7.2 106 Number of Neurons 10 20 30 40 10 20 30 40 10 20 30 40 10 10 10 10 Time (sec) 0.5 1 2.2 106 2.8 1018 ∞ 2.7 12 21 29 38 0.4 2. All algorithms obtain a very poor conditioning. Between the clustering algorithms.9 46 48 28 45 Flops (106) .2.002 5. the adaptive one is typically 40 times faster than the standard k-means clustering algorithm.2 3.03 0.5 107 5 1016 1 1017 0.2 10-4 2.05 0.2 0.2 10-4 0.

12 0.05 0.1.Results of CMACs Error norm 0.6 1 1018 117 2 105 Number of interior knots 4 9 4 9 Spline order 3 3 4 4 Time (sec) 0.007 Condition number 13.4. Table 2.9 Flops (106) .73 0.2.9 1018 3.65 1. using interior knots equally spaced.12 0.04 0.2 .7 The first results are obtained without model selection.12 0.15 Flops (106) .3 CMACs The following table illustrates the results obtained with CMACs. Table 2.Results of B-splines (without using ASMOD) Number of Parameter s (Weight vector) 7 12 8 13 Error norm 0. A generalization parameter of 3 was employed.86 1.31 .4.08 0.8 2. with a varying number of interior knots.27 2.6 1 1.34 The next table illustrates the results with the ASMOD algorithm.1 .8 1034 ∞ ∞ ∞ Number of interior knots 10 20 30 40 50 Number of Parameter s 13 23 33 43 53 Time (sec) 0.2 .88 0.17 0.34 0.32 .1.4 0.2 .4 B-Splines Condition number 1. Artificial Neural Networks 165 .64 0.

2 Condition number 7.Results of MLPs.35 Condition number 381 189 190 3.7 103 167 Time (sec) 0.Table 2.01 ) quasi_Newton Levenberg-Marquardt Table 2.1 1.3 10-3 2. with the standard criterion Error norm 3.0005 ) quasi_Newton Levenberg-Marquardt 166 Artificial Neural Networks .9 0.4.5 10-4 1.9 Algorithm BP ( η = 0.Results of B-splines (using ASMOD) Number of Parameter s (Weight vector) 51 22 Error norm 9.1 Flops (106) 1.3 0.5 35 13 Algorithm BP ( η = 0. Table 2.47 0. running Windows 98. 005 ) BP ( η = 0.4. the model complexity is also fixed.17 Condition number 121 102 250 Time (sec) 1 28 2.33 .1 Multilayer perceptrons In this case.49 0. with the new criterion Error norm 1.2 239 Number of interior knots 49 19 Spline order 2 3 Time (sec) 781 367 Flops (106) 2950 282 Inverse coordinates transformation The following results were obtained in a 266 MHz Pentium II Notebook. It should be noted that all the figures relate to a constant number of 100 iterations. We considered the case of a MLP with [5 1].9 0. 001 ) BP ( η = 0.9 47 8. as the topology is fixed.9 2.Results of MLPs. 2.3 1.34 . with 64 MB of RAM.35 . and therefore the number of neurons is 6.2.9 23 2 Flops (106) 2.6 0. and the number of parameters is 21.9 2.

3 106 ∞ 3 1014 ∞ 2.89 3 2 1.8 0.4 5.1 104 94 4.001 LM (SC) BP (NC) alfa=0.2 5 2.73 0.18 Condition number 278 128 348 1.15 3.6 Flops (106) .77 0.9 103 6.2.6 4.9 3.54 72.2 104 1.5 103 2.46 0.4 5.8 1.15 1 2 0.2.3 30 44 58 84 0.2 .8 71 18 12.26 1.7 103 3.9 1.4 14 0.5 .9 3.8 1.36 .25 0.2 Radial Basis functions Table 2.7 104 6.9 3.3 22207 55382 3899 6690 Artificial Neural Networks 167 .9 1.9 1.4 0.4 103 30 532 2.5 1016 Number of Neurons 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 10 10 10 Time (sec) 0.2 10.6 3.4.8 2.35 1.1 LM (NC) Error norm 13.3 104 5.4 0.4 7.12 1 0.2 1.6 0.Results of RBFs Number of Parameter s 40 70 100 130 160 40 70 100 130 160 30 70 100 130 160 30 30 30 30 Algorithm Fixed centres Fixed centres Fixed centres Fixed centres Fixed centres Clustering Clustering Clustering Clustering Clustering Adaptive clustering Adaptive clustering Adaptive clustering Adaptive clustering Adaptive clustering BP (SC) alfa=0.

4.27 0.6 10-4 Condition number ∞ ∞ ∞ ∞ Number of interior knots 4.1 5.3 CMACs The following table illustrates the results obtained with CMACs.9 4.10 20.2.43 1.6 The next table illustrates the results with the ASMOD algorithm.05 0.18 1.4.3 Time (sec) 1.017 0. with a varying number of interior knots.4 Flops (106) 2.5 10.81 4. 2.2 are also valid here.37 .4 5 35.52 0.2.8 26.2 3. Table 2.4 B-Splines Condition number 5.9 Spline order 2. using interior knots equally spaced.Results of B-splines (without using ASMOD) Number of Parameter s (Weight vector) 36 121 49 144 Error norm 0.Analysing the last table.89 6 39 The first results are obtained without model selection.38 .83 Flops (106) 0.1. Supervised methods produce excellent results.Results of CMACs Error norm 2.4.20 Number of Parameter s 22 57 177 Time (sec) 0.5 5.27 0. the comments introduced in Section 2.4 9.8 1016 ∞ ∞ Number of interior knots 5.4 9. with all methods (with the smallest network) producing better results than the other methods with larger networks.2 2. Table 2.3 3. A generalization parameter of 3 was employed. 168 Artificial Neural Networks .13 2.

2.4.2 10-4 6 10-4 9.01 Condition number 2. Table 2.045 0. Artificial Neural Networks 169 .Table 2.5 1016 Number of interior knots 9. all models obtain similar values.2 3. The MLPs and the RBF with the LM method obtain a good accuracy with a small number of parameters.(4.2 106 7. and the supervised RBF) are equivalent. Nc New criterion.3 Conclusions The best results obtained for the first example are summarized in Table 2. Sc denotes Standard criterion.5 3.5) Spline order 2. In the algorithm column.3. Focusing our attention now in the 2nd example.40 . In terms of computational time. C Clustering. all values (with the exception of B-splines using the Asmod algorithm. Ac Adaptive clustering and A Asmod.3 10-4 2.Best results (Example 1) Number of Parameters (Weight vector) 33 33 90 70 90 30 51 ANN/ Algorithm MLP (LM/Sc) MLP (LM/Nc) RBF (Fc) RBF (C) RBF (Ac) RBF(LM/Nc) B-splines (A) Error norm 8.8 1017 2.6 2.6 1017 8.5 45 781 The results obtained with CMACs were very poor and are not shown.2 10-4 2 10-4 2.5 1016 2.41illustrates the best results obtained. the number of interior knots is specified in this order. Fc Fixed centers.39 .Results of B-splines (using ASMOD) Number of Parameter s (Weight vector) 61 61 Error norm 0.1.7 3.2 2.(5.2 Time (sec) 3.5 10-4 Condition number 106 101 3. the former with a reasonable conditioning.3 Time (sec) 2354 1479 Flops (106) 4923 977 The final model has two univariate sub-models and one multivariate model.2 10-4 5. Table 2.8 1016 1.40.6) 0. In terms of accuracy obtained. In the last table.

045 0. B-splines.17 0. in this special case where there are only two classes. 1 } for a perceptron). can be introduced.1 0.01 Condition number 250 167 2.1. 2. with the expense of a very large computational time.35 0.2 Widrow’s Adalines. however. the MLPs with an acceptable condition number.5 1016 Time (sec) 2. In terms of accuracy. 3.5 Classification with Direct Supervised Neural Networks In Section 2.18 0. and used them (somehow naively) for classification purposes. there is only the need to have one discriminant function.1 2 12. In terms of Adalines. 1 } for an Adaline.1 Pattern recognition ability of the perceptron We already know that the output of a perceptron is given by: 170 Artificial Neural Networks .6 5 2354 1479 We did not consider models with a number of parameters larger than the number of training data. whose discriminant function is given by: gClassn ( x ) = g Class1 ( x ) – g Class2 ( x ) The scheme used has.5 1016 2. Notice that.5. 2. because of overfitting problems. or { 0.Table 2.1. as a new class. employing the Asmod algorithm. several problems. MLPs and RBFs obtain a reasonable accuracy with a smaller number of parameters. 1.205) 2.Best results (Example 2) Number of Parameters (Weight vector) 21 21 30 49 61 61 ANN/ Algorithm MLP (LM/Sc) MLP (LM/Nc) RBF (LM/Nc) B-splines B-splines (A2) B-splines (A3) Error norm 0. with spline orders of 2 and 3 achieve very good values.1 we have introduced Rosenblatt’s perceptron and in Section 2. (2. what we minimize is the (squared) difference between the target and the net input. namely: There is no rigorous way to determine the values of the desired response (they need not be { – 1. It is not easy to generalize the scheme to more than 2 classes (we know that we need a discriminant function for each class). which does not have a lot to do with the classification error. Classn . this will have impacts in the regression line determined by the algorithm. The condition number is also very high.5 1016 ∞ 2.41 .

208) T (2. (2.Perceptron output Using the weights generated by the application of the Perceptron learning rule.1. ) is the sign function (see Section 1. 2.y = f ( net ) = f ( w x + b ) .209) Artificial Neural Networks 171 . is shown in fig. It is obvious that in this case the discriminant function is: g( x) = w x + b = 0 For the particular case where there are only two inputs. (2.1 (see Section 2.1). (2.19 . the plot of the function generated by the perceptron. in the domain [ – 1. while "o" denotes the 1 output.x 1 – ----w2 w2 Example 2. T ( can be recast as: w1 b x 2 = – -----.2).2.3. 1 ] .206) where f ( .207) FIGURE 2. [ – 1.208) is given as: x 2 = – 2x 1 + 2 + ε (2.77 (the symbols "*" represent the output 0. 1 ].AND function revisited Using the weights generated in Ex.77 . 2.

Discriminant function Notice that a different interpretation can be given to (2. using (2.4 -0. more generally. or can be represented in a 2-dimensional graph as: 4 3. A perceptron with M outputs can separate the data in M classes. The weights control the slope of the decision surface (line in this case).207).209).78 .211) 172 Artificial Neural Networks .8 1 FIGURE 2.8 -0. The perceptron with only one output can separate the data in two classes.5 0 -1 Class 0 Class 1 -0.2 0.210) which shows that g and v are orthogonal. to (2.209) or (2. and the bias the intersection with x 2 . but this time with two outputs. and by v the vector of the weights.205).2 0 x1 0. or. with b = 0 .5 3 2. then the equation of the discriminant can be seen as: g(x) = w x = 0 T (2.207).This can be seen in the above figure as the intersection of the surface with the x 1 – x 2 plane. Notice that we can have: g g(x) g ( x ) = -. we shall use the same data.( x ) – – ---------2 2 (2. If we denote by g the points that satisfy (2.6 0.5 x2 2 1.4 0.5 1 0. To see how this works.6 -0.

78 .5. 2. as we could create a decision boundary which is located in the middle of the class boundaries.Hyperplane with largest margin Artificial Neural Networks 173 . Class 1 b' h b' l Class 0 FIGURE 2. But.80 . This is easy to understand as the rule does not update the weights as long as all the patterns are classified correctly. FIGURE 2.Two discriminant functions The intersection of these two discriminant functions defines the decision boundary for this problem. The next figure illustrates the concept.Please notice that there are an infinite number of ways to achieve this. 2.2 Large-margin perceptrons If we focus our attention in fig. we see that the application of the Perceptron learning rule will determine a decision boundary which is close to the sample in the boundary. from a visual inspection of fig.78 . 2. The next figure illustrates this option. it is clear that we could do much better than this.79 .

It should be clear to the students that this approach is substantially different from the Perceptron learning rule (which stops updating when all the parameters are correctly classified) and from the delta rule.212) (2. Notice that their slope is the same. additionally. 0 ) (the boundary points of class 0). the optimal hyperplane is obtained as 0. 1 ). The margin of an hyperplane to the sample set S is defined as: ϒ = minx ∈ S ( x.5x 1 + 0. 2.5x 2 – 0. which minimizes the squared error between the targets and the outputs.214) This can be done by maximizing the sum of the two margins (each one for its class). only the samples that define the support vectors are considered. In this latter case.1 Quadratic optimization The same solution can be found using quadratic optimization. all samples contribute to this error. w + b > 0 ) (2. accordingly to fig. b' l = – ----w2 (2. 1 ) (the boundary point of class 1). to treat the case where patterns are not linearly separable. According to (2. which will enable. It can easily be seen that this is a line parallel to the line passing through the points ( 0.212) and (2. while in the large-margin perceptrons. Using the example of the AND function. while maintaining a correct classification. 2. this can be formulated as: Max ( b l – b H ) st w xH + bH ≥ 0 w xl + bl < 0 b l – bH > 0 This is implemented in the function large_mar_per. This should be done by ensuring that the (optimal) hyperplane correctly classifies the samples.213). and situated half-way between this line and a parallel line passing trough the point ( 1.75 = 0 .213) We want to determine the vector w and the bias bH and bl so that we determine the hyperplane that generates the largest margins.2.215) 174 Artificial Neural Networks . b' H – b' l .80 .m. or.The blue lines are the hyperplanes which are close to the class boundaries (they are called the support vectors). ( 1. T T (2.5. so that the two lines can be written as: bH T w x H + b H = 0. b' H = – ----w2 bl T w x l + b l = 0.

1Linearly separable patterns To introduce it. wo As. 1 } . we can see that ---------.is an unitary vector.. in the same direction as w 0 . d i = – 1 The discriminant function gives a measure of the distance of x to the optimal hyperplane (denoted by w o. d i = 1 w x i + b < 0.218). assume first that the patterns are linearly separable. Perhaps the easiest was to see this is to express x as: wo x = x p + r ---------.218) T T T (2. g ( xp ) = 0 .5. we have: Artificial Neural Networks 175 .81 . g(x) = 0 x wo r ---------wo xp FIGURE 2.1.Interpretation of (2.217) where x p is the normal projection of x onto the optimal hyperplane and r is the desired algebraical distance (if r is positive.218) wo Looking at fig.216) (2. 2. Admitting that the desired output belongs to the set { – 1.2. wo (2. by definition. replacing (2. and negative if it is on the negative side).217) in (2. we have: w x i + b ≥ 0. x is on the positive side of the optimal hyperplane.81 .2. b o ): g ( x ) = wo x + bo .

. (s) ) = wo x + b o = ± 1 ). the distance of the optimal hyperplane to the origin is given by (2. wo (2. Note that (2. (2.(s) ---------. then we have: 2ρ = 2r = ---------. (2.T wo g ( x ) = w o r ---------.222) as equalities. b o by a constant) as: w o x i + bo ≥ 1.. Therefore. we can determine the optimal hyperplane by solving the following quadratic problem: 176 Artificial Neural Networks .219) This means that: r = g(x ) ---------wo (2.221) The issue here is to find w o.225) The last equation means that maximizing the margin of separation is equivalent to minimizing the weight vector norm.. d =1 wo (s) (s) for which d T (s) = ± 1 . d = ±1 . according to what has been said above. for those vectors.224) If we denote by ρ the optimal margin of separation. d i = – 1 (s) Notice that with this formulation.216) can be recast (we can always multiply w o.220) In particular. We have.223) r = . the support vectors are the ones that satisfy (2. d =-1 wo 1 . and is: bo r = ---------wo (2.( s) – ---------. b o for the optimal hyperplane. we have: 1 .220) with x = 0 . wo (2. Consider a support vector x (g(x Then. di = 1 T wo xi T (2.= r w o .222) + b o ≤ – 1.

We can solve this by using the method of Lagrange multipliers. b. the multipliers αi are non-zero.232) Therefore.228) (2.t. we have: α i ( d i ( w xi + b ) – 1 ) = 0. The primal problem deals with a convex function and linear constraints. α ) = 0 ∂w ∂ J ( w. α ) = 0 ∂b Application of the 1st condition to (2. According to optimization theory.231) Notice that.2. b. b .2. i=1.w w Min ---------2 s. we first expand (2. i=1.226) This is called the primal problem.2.…N T (2. at the optimum.230) and the 2nd condition: N αi di = 0 i=1 (2.…N T T (2. with the same optimal values of the primal problem. we can construct a dual problem. We therefore get 2 optimality conditions: ∂ J ( w. gives: N (2.227).227) s.229) w = i=1 αi di xi (2.227) term by term: Artificial Neural Networks 177 . α i ≥ 0. i=1. constructing the Lagrangian function: Min w w – ---------2 T N αi ( di ( w xi + b ) – 1 ) i=1 T (2. To postulate the dual problem. only for those constraints that are met as equalities (the support vectors). at the saddle point. d i ( w x i + b ) ≥ 1.t.…N The solution for this problem is obtained by differentiating the last function with respect to w. but where the Lagrange multipliers provide the optimal solution.

Once the optimal α oi have been found.r. due to (2.s. of (2.238) An implementation of this algorithm is given in lagrange.– 2 T N i=1 w w α i ( d i ( w x i + b ) – 1 ) = ---------.235) must be maximized w.w w ---------.m. the Lagrange multipliers.s. d (s) = 1 (2. and does not depend on the classifier parameters w.h.2. according to (2. the 3rd term in the r.…N The cost function (2.234) 1 and the subtraction of the first two terms in the r.235) is formulated as a minimization of a quadratic function: 178 Artificial Neural Networks .223) for a support vector: ( bo = 1 – wo x T (s) ).239) Notice that in the implementation of this function.– 2 T T N N N αi di w x i – b i=1 i=1 T αi di + i=1 αi (2.235) subject to the constraints: N αi di = 0 i=1 (2. of (2. we get: 2x 1 + 2x 2 – 3 = 0 (2. Notice that the problem has been reformulated only in terms of the training data.233) Notice that.230).t. Also.233) is – -2 therefore (2. Applying it to the AND problem.237) and using (2. b .231).233) is 0. the classifier parameters can be obtained using (2. in the optimum. i=1.h.230): N wo = i=1 α oi d i x i (2.236) α i ≥ 0.233) can be recast as: N N N α i αj d i d j x i x j and i = 1j = 1 T Q(α ) = i=1 1 αi – -2 N N αi αj di dj xi xj i = 1j = 1 T (2. we have: N N N N N w w = i=1 T αi d i w xi = i=1 T α i di j=1 αj dj xj xi = i = 1j = 1 T αi α j di d j xi x j T (2. the maximization of (2.

but on the right side of the decision surface. { ε i } i = 1 .1. we introduce a new set of non negative scalar variables. if a data point violates the following condition: d i ( w x i + b ) ≥ 1. in this case we have a missclassification.241) The problem here is that it is not possible to construct a separating hyperplane that correctly classifies all data.…N This violation can occur in one of two ways: 1.Different b) wrong side of the decision surface types of violations N For a treatment of this case.223) ): d i ( w x i + b ) ≥ 1 – ε i.5. i=1. into the definition of the separating hyperplanes ( see (2. i = 1. The violating data point ( x i. averaged over the training set. T (2. …. a) Right-side of the decision surface FIGURE 2. d i ) falls on the wrong side of the decision surface. Nevertheless. in this case we still have a correct classification.242) The violating data point ( x i. 2.2Non linearly separable patterns T T T (2. 2. we would like to construct an hyperplane which would minimize the probability of the classification error.2.82 . d i ) falls inside the region of separation.222) and (2. N T (2. The margin of separation between classes is said to be soft. 2.240) (2.2. where H = [ didj xi xj ] and f is a row vector of ones.243) Artificial Neural Networks 179 .Q' ( α ) = α Hα – f α .

3.247) must be minimized subject to the constraint (2.247). One way to do this is to minimize: N Φ( ε) = i=1 I ( εi – 1 ) . while the 2nd term minimizes the occurrences of miss classifications. Therefore. averaged on the training set. is minimized. or yet making it better conditioned). while for ε i > 1 .7). even with ε i > 0 .The ε i are called slack variables. and they measure the deviation of the ith data point from the ideal condition of pattern separability.1.w w + c 2 N εi i=1 (2. we can approximate (2.243). The support vectors are those data points for which (2. 2. Unfortunately this is a non-convex non-polynomial (NP) problem.82 b) ). but in the correct side of the decision surface (see fig.247) Notice that.245) must be done ensuring that (2.243) is satisfied exactly. ε ) = -. the first term minimizes the Euclidean norm of the weight vector (which is related with minimizing the complexity of the model. and. there is miss classification (see fig. the primal problem for the case of nonseparable patterns can be formulated as: N 180 Artificial Neural Networks . ε ≤ 0 1. ) is an indicator function. in (2.244) by: N 2 Φ(ε) = i=1 εi (2.246) Moreover. and it must be selected by the user.245) The minimization of (2. to make the problem tractable. defined as: I( ε) = 0.82 a) ). 2. Notice that this optimization problem includes the case of all linearly separable problem (see (2.226) ) as a special case. setting { ε i } i = 1 = 0 . For 0 ≤ ε i ≤ 1 .243) is satisfied. ε > 0 (2. Our goal is to find a separating hyperplane for which the miss classification error. the point falls into the region of separation. The parameter c has a role similar to the regularization parameter (see Section 2. (2. we simplify the computation by formulating the problem as: 1 T Φ ( w. The objective function (2.244) where I ( . and by also minimizing w .

248) Q(α ) = i=1 1 αi – -2 N N αi αj di dj xi xj i = 1j = 1 T (2.1.238).2.2. The optimal solution w o.5.3 Classification with multilayer feedforward networks In the same way as in nonlinear regression problems. the first hidden layer(s) can be seen as performing a nonlinear mapping between the input space and a feature space. i=1.2.…N Notice that.250) 0 ≤ α i ≤ c.t. in this formulation the case of nonseparable patterns only differs from the separable patterns case in the last constraint.237) and (2.Min w w + c ---------εi 2 T i=1 s. b o are obtained as in the separable case using (2. For classification purposes.…N εi ≥ 0 Using the Lagrange multipliers and proceeding in a manner similar to that described in Section 2. We must minimize: N T N (2. d i ( w x i + b ) ≥ 1 – ε i. 2.249) subject to the constraints: N αi di = 0 i=1 (2.1. i=1. followed by a linear mapping. followed by the construction of an optimal hyperplane for separating the features discovered in the nonlinear layers. we can formulate the dual problem for the case of nonseparable patterns. where an upper limit (c) is put to the Lagrange multipliers. Artificial Neural Networks 181 . the topology of feedforward networks can be seen as a nonlinear mapping.5.

to obtain a good generalization. In the latter case. x j ) = ϕ ( x i )ϕ ( x j ) = l=0 T ml ϕ l ( x i )ϕ l ( x j ) (2. Notice that the procedure to build a support vector machine is fundamentally different from the one usually employed in the design of neural networks for nonlinear regression. For the function ϕ ( x ) to be an inner-product kernel. x j ) is denoted the inner-product kernel: K ( x i. Therefore. T 2 (2.252) It is assumed that the dimension of the input space is m l .5.) xi ϕ ( xi ) Input space Feature space mapping FIGURE 2. the dimensionality of the feature space is on purpose made large. provided that in the equations the input data is replaced with their image in the feature space.Non-linear In this case.249) is updated as: N Q( α) = i=1 1 α i – -2 N N α i α j d i d j K ( x i.253) 182 Artificial Neural Networks . Chapter 6). the complexity of the model is controlled by maintaining the number of neurons in the hidden layer small.XOR Problem ([125]) Let’s assume that a polynomial inner-product kernel.ϕ( . i = 1j = 1 (2.20 . Here. so that a decision surface in the form of an hyperplane can be constructed in this space. and that ϕ 0 ( x ) = 1 for all x. The one-hidden layer perceptron only satisfies it for some values of the weight parameters.83 .1.2. we can use the solution presented in Section 2. (2. It happens that inner-product kernels for polynomial and radial-basis functions always satisfy Mercer’s theorem.251) where K ( x i. it must satisfy Mercer’s theorem (see [125]. Example 2. x i ) = ( 1 + x x ) . x j ) . Only a fraction of the data points (the support vectors) are used in this construction.2. is used: k ( x.

the matrix of inner product kernels is: 9 = 1 1 1 1 9 1 1 1 1 9 1 1 1 1 9 K = { k ( x i. x j ) } i = 1 … N.253).t.257) 2 2 2 2 1 – -.255) then. the Lagrange multipliers. in dual form (2. x i ) = ϕ ( x )ϕ ( x i ) ) T ϕ (x) = 1 T 2x 1 2x 2 x 1 x 2 2 2 2x 1 x 2 (2. x j ) = α1 + α 2 + α 3 + α4 i = 1j = 1 (2. the input dimension m i = 2 and the number of patterns M = 4 . As ( k ( x. in this case. 6.XOR data x (-1.-1) (+1.r.+1) (+1.42 . we have: Artificial Neural Networks 183 . x i ) = 1 + x 1 x 2 xi1 xi2 2 = ( 1 + x1 xi1 + x2 xi2 ) = 2 2 2 (2. we have: k ( x.254) = 1 + 2x 1 x i 1 + 2x 2 x i 2 + ( x 1 x i 1 ) + ( x 2 x i 2 ) + 2x 1 x i 1 x 2 x i 2 The dimensionality of the feature space is.The input data and the target vector are: Table 2. is given as: 4 Q(α ) = i=1 1 αi – -2 4 4 α i α j d i d j K ( x i.-1) (-1. Expanding (2.+1) t -1 +1 +1 -1 Therefore.251).257) w. j = 1 … N (2.( 9α1 + 9α 2 + 9α3 + 9α 4 – 2α 1 α 2 – 2α1 α 3 + 2α 1 α4 + 2α 2 α 3 – 2α 2 α 4 – 2α 3 α 4 ) 2 Optimizing (2.256) The objective function.

i = 1 … 4 = -8 (2..261) 184 Artificial Neural Networks . (2.[ – ϕ ( x 1 ) + ϕ ( x 2 i ) + ϕ ( x 3 ) – ϕ ( x4 ) ] 8 .9α 1 – α 2 – α 3 + α4 = 1 – α 1 + 9α 2 + α 3 – α 4 = 1 – α 1 + α 2 + 9α3 – α 4 = 1 α 1 – α 2 – α 3 + 9α 4 = 1 The solution of this system of equations gives: 1 ˆ α i.259) This means that all 4 input vectors are support vectors. The optimal value of the criterion is 1 ˆ Q ( α ) = -.– – 2 + 8 1 1 2 we finally have: 1 – 2 1 2 1 2 2 1 1 2 2 + – 2 – 1 1 1 1 – 2 – 2 0 0 0 1 = -= 0 8 0 –1 -----–4 2 2 0 0 0 0 0 y = 1 2x 1 2x 2 x 1 x 2 2 2 2x 1 x 2 0 0 0 0 = –x1 x2 0 –1 -----2 (2.258) (2. 4 As 4 ˆ w = i=1 1 ˆ α i d i ϕ ( x i ) = -.260) 1 – 2 1 = -.

The Matlab file height_weight_gen. Artificial Neural Networks 185 .Notice that using this inner-product kernel.84 : x2 2 x1 x2 x1 – 2 0 2 Decision boundary FIGURE 2. to generate 1000 samples. with the theoretical decision boundary (in green) and the estimated one (in black).86 shows the test data. If the discriminant functions estimated with the training data would be used. using the theoretical discriminant functions is 83.mat) is divided equally into training and test data (fig.m will be used. this dimensionality is reduced to 1. as expected. in this case).The XOR problem in the input and in the feature spaces Notice that in feature space all points are support vectors (points. the probability of correctly classifying the data. 2.21 .1. 2.Classification of humans Let us revisit the example of classification of males and females described in Section 2. the feature space has a dimensionality of 6.1 in Appendix 2. This data (which can be found in the file dataex221. we would obtain 83%. however. 2. Example 2.85 shows the examples employed for training).2%. Therefore we can see this problem as in fig. and that the optimal decision line (point) is equidistant to the transformed points.84 . For this particular problem. fig. with the same parameters employed before. If we employ a Baysean classifier.

6 1. 186 Artificial Neural Networks .8 Height (m) 1.2 1.3 40 50 60 70 80 Weight (Kg) 90 100 110 FIGURE 2. and therefore smaller training data is employed.9 1. The algorithm does not converge with 500 training data.9 1.Test data and decision boundaries If now support vector machines are employed.6 1.3 40 50 60 70 80 Weight (Kg) 90 100 110 FIGURE 2.86 . using sup_vec_mac. the following tables illustrate the results obtained.85 .7 1.5 1.5 1.m.8 Height (m) 1.4 1.7 1.4 1.Training 2 data 1.

2 0. even with a small network with 2 hidden neurons. minimizing the new criterion.2 Number of Centers 100 100 93 95 Prob. Cor. The Levenberg-Marquardt method. Afterwards.100 Training Data C 1 10 10 1 σ 0. As we can see.71 0.C. 0.200 Training Data C 1 10 10 1 σ 0. The initial nonlinear parameters were obtained with the function rbf_fc. Cor. The same 200 training data was used. we obtain results equivalent to the ones obtained with the Baysean classifier.m.200 Training Data.70 Table 2.1 0. minimizing the new criterion.1 0. the Matlab function Train_MLPs. together with the validation data. The training stopped as a result of early-stopping.2 0. Prob. As we can see. with 3 and 4 hidden neurons. Artificial Neural Networks 187 .1 0. Table 2.46 illustrates the results obtained. Cor.44 .71 0.45 illustrates the results obtained. Class.45 . The 200 training data employed in the last example was used.2 Number of Centers 198 196 178 N.Results obtained with MLPs.836 0.71 0.C. even with a small network with 3 hidden neurons. The training stopped as a result of early-stopping.83 –6 Finally. 0.71 0. together with the validation data. we obtain results equivalent to the ones obtained with the Baysean classifier Table 2.71 0. with 2 and 4 hidden neurons. 0.Results obtained with the support vector machines . τ = 10 Neurons 2 4 Final Iteration 39 15 Prob.1 0.m was used.43 . Class. was employed.m was used.Results obtained with the support vector machines .71 N. Class. was employed. the Matlab function Train_RBFs. The Levenberg-Marquardt method. Table 2.Table 2.

188 Artificial Neural Networks . we know that the Baysean classifier is really the best method for this particular problem. Class. 12 parameters (4 means and 8 covariance values) must be estimated. In this case we must estimate 6 nonlinear parameters + 3 linear parameters. Finally.838 –6 Neurons 3 4 Summarizing. where we obtain equivalent results with a network with 9 nonlinear parameters + 3 linear parameters. Cor.Table 2. 0. This is an example of the flexibility of the mapping capabilities of the MLP. therefore in total 12 parameters as in the case of the Baysean classifier.46 . the performance of the support vector machines is disappointing. therefore 9 in total. we obtain essentially the same results with a 2-hidden neurons network. the best nonlinear mapping is searched for. as the data was generated with a bivariate Gaussian distribution. τ = 10 Final Iteration 70 17 Prob. with large networks (the smallest has 93 linear parameters). With MLPs. With this classifier. This is somehow expected as a fixed nonlinear mapping is used. while in all the other cases.200 Training Data. The same can be said about RBFs.832 0.Results obtained with RBFs. It was impossible to obtain probabilities of correct classification greater than 71%.

2. In this section we shall start with the original. This section ends by revisiting the LMS rule. and linear associative memories in Section 3.2. and express it in terms of Hebbian learning.1 Non-competitive learning In 1949 Donald Hebb proposed a plausible mechanism for learning. The anti-Hebbian rule is discussed in Section 3. However.1). 3. etc.3. we can devise mechanisms that are able to form internal representations for encoding features of the input data. due to Donald Hebb [18]. a technique used in several different areas of science. and is actually the basis for all mechanisms of unsupervised learning. Hebb postulated that whenever one neuron excited another.3. by allowing neurons to compete with each other for the opportunity to be tuned to features found in the input data. that do not imply competition among neurons (those will be studied in Section 3.1. data reduction. These techniques will be covered in Section 3. however.3. unsupervised rule. relating the increase in the neuronal excitation effect with the frequency that the neurons are excited. we have already mentioned the Hebbian rule (actually two types of Hebbian rules). that can be afterwards used for classification purposes. this process facilitated communication between neurons upon Artificial Neural Networks 189 .1.2).1 Hebbian rule When we have described the most important learning mechanisms (see Section 1. there is no teacher that can help in the learning process. Principal component analysis. the threshold of excitation of the latter neuron decreased. A range of these techniques will be described in Section 3.1. data reconstruction.1.CHAPTER 3 Unsupervised Learning Chapter 2 was devoted to supervised learning. 3.1. The next section will introduce the basic Hebbian learning rule. In several situations. This principle has become very important in neurocomputing. We can also use competitive learning paradigms. is discussed in Section 3.4. an relate it with autocorrelation of the input data and gradient search.

e.5) is the sample approximation of the autocorrelation of the input data (apart from the factor 1/N) R x = E [ xx ] T (3.1.1..repeated excitation. y = wx . In this case. starting with a initial weight vector w [ 0 ] . the update of the weight vector can be described as: ∆w [ k ] = ηx [ k ]y [ k ] = ηx [ k ]x [ k ]w [ k ] T (3. The Hebbian neuron therefore produces a similarity measure of the input w.1 Hebbian rule and autocorrelation According to (3. we have: ∆w ij = ηx i y j (3. the weights are only updated after a complete set of presentations is presented. 3.t.r. In terms of artificial neural networks. on the other hand.5) We can see that the expression within brackets in (3. Assume also that we employ off-line learning. the weight norm will always increase. in (3.2) It can be easily seen that. a large y means that θ is small. Now consider that there is more than one input to our neuron. we can modulate this process by increasing the weight between a pair of neurons whenever there is activity between the two neurons. i. Assuming that the neuron is linear. We therefore have: N ∆w = η k=1 x [ k ]x [ k ] w [ 0 ] T (3.2). the weights stored in the neuron. the output is given as: y = w x = x w = w x cos θ T T (3. in contrast to the other learning rules introduced before.3) If. the Hebbian rule is: w [ k + 1 ] = w [ k ] + ηx [ k ]y [ k ] = w [ k ] ( 1 + ηx [ k ] ) 2 (3. or. If we denote the output of each neuron by y.3). the learning process is always unstable.6) 190 Artificial Neural Networks . a small y means that the weight vector is nearly orthogonal to the input vector. which means that learning is unsupervised.1) Notice that. and its input as x. we assume that the weights and inputs are normalized. there is no target.4) Assume that we have n patterns and we apply this rule. in other words. and therefore it implements a type of memory which is called associative memory. provided that the learning rate is positive.

1. 2 ( w i [ k ] + ηx i [ k ]y [ k ] ) i (3. This can be implemented by changing (3. Any scaling factor could be used. V is a paraboloid facing upwards and passing through the origin of the weight space.1.7) indicate that the Hebbian rule performs gradient ascent. 3. proportional to the square of the output.4).r.1.8) is differentiated w..8) This is a quadratic equation in w. the difference is that the latter employs a forgetting factor.The Hebbian algorithm is effectively updating the weights with a sample estimate of the autocorrelation function of the input data. If (3. As R is positive definite.3 Oja’s rule As it was seen above. we have: δV -----.10) This can be approximated by: w i [ k + 1 ] = w i [ k ] + ηy [ k ] ( x i [ k ] – y [ k ]w i [ k ] ) = w i [ k ] ( 1 – ηy [ k ] ) + ηy [ k ]x i [ k ] This update is known as Oja’s rule [138]. we must have a stable version. Hebbian learning is an unstable process.7) k=1 1 y [ k ] = --N 2 N k=1 1 w x [ k ]x [ k ]w = w --N T T T N x [ k ]x [ k ] w = w Rw k=1 T T (3. comparing this last equation with (3. but we shall use a unitary norm of the weight vector. as a performance surface. Notice that.t w.4) to: w i [ k ] + ηx i [ k ]y [ k ] w i [ k + 1 ] = ---------------------------------------------------------------.11) Artificial Neural Networks 191 . To make it useful. and enables us to interpret V.2 Hebbian rule and gradient search The power at the output of an Hebbian neuron is defined as: 1 V = --N N (3.9) Notice that this equation and (3. defined in (3. 2 (3.= 2Rw δw (3.: ˆ ∆w = ηR x w [ 0 ] 3.8).1. and this we can achieve by normalizing the weights. As V is unbounded upwards. the Hebbian rule will diverge.

13) The matrix A A is square and symmetric. discovers a weight vector which is colinear with the principal component of the input data. with dimensions D*M.1. introduced in the last section. Oja’s rule (3. as it only normalizes the weight vector to 1. The projection of the data X onto the subspace spanned by the columns of W is given as: Y = XW The reconstructed data from the projections is given as: T T ˆ X = YW = XWW T T (3. We know.4) finds the weight vector such that the variance of the neuron output (the projection of the input onto the weight vector) is maximum. As we already know. It can be proved that λ 1 is the largest eigenvalue of R.16) If we compute the sum of the square of the errors between the projected data and the actual data: 192 Artificial Neural Networks .15) (3. Using (3.Hebbian learning (3.11).2 Principal Component Analysis Oja’s rule. Let us now have a matrix W. M<D.12) is the defining equation of the eigenvalues/eigenvectors of R. as A AV = VS SV V = VS S .14) T T T It is easy to see that V are the eigenvectors of A A . the autocorrelation matrix. does not change its direction. whose columns are a subset of the columns of V. that satisfies the relation: Rw = λ 1 w (3. we have: A A = VS U USV = VS SV T T T T T T T T (3. with mean zero and covariance R = E [ X X ] . We know that R is a symmetric matrix of size D. (3. and therefore a neuron trained with Oja’s rule is usually called the maximum eigenfilter. We shall denote it as λ .13). S S is a diagonal matrix of the eigenvalues. The goal here is to find other directions where the data has also important variance. and finds a weight vector w. from Section 5 in Appendix 1 that a matrix A (K observations *D dimensions) can be decomposed as: A = USV T T (3. 3. Let us now consider that we have a random vector X ( K × D ) = x1 … x D .12) As it is known.

22) Minimizing J is equivalent to maximize (3.18) The trace operation is equal to: T tr ( V D – M λ D – M V D – M ) D = i = M+1 λi .22).17) can be expressed as: J = tr ( X X ) – tr ( Y Y ) T T (3. (3. the projections of the data onto this space are called the principal components. and. This basis is the space spanned by the eigenvectors of the covariance matrix. which means that we are looking for a basis which maximizes the variance of the projected data. Therefore.20) and that M .T T T ˆ T ˆ J = tr ( ( X – X ) ( X – X ) ) = tr ( ( X ( I – WW ) ) ( X ( I – WW ) ) ) .21) Therefore. in terms of neural networks. Artificial Neural Networks 193 . (3.17) = tr ( ( I – WW ) X X ( I – WW ) ) = tr ( ( I – WW ) VS SV ( I – WW ) ) where tr() denotes the trace operation. PCA is used for data compression. it can be shown that: D tr ( X X ) = i=1 T λi . T (3. ( I – WW ) V is a matrix whose first M columns are null and the last D-M columns are identical to the corresponding columns of V.17) can be seen as: λM 0 = VD – M λD – M V D – M T T T T T T T T T T T T 0 VD – M λD – M VT – M D (3.19) Using similar arguments. the argument of (3. Thus. criterion (3. Notice that as W V has a form I 0 . (3. and the method is called principal component analysis. PCA is doing nothing more than actually operating with the eigenstructure of the covariance matrix of the data. tr ( Y Y ) = i=1 T λi .

Data reconstruction Let us use the function used to generate the data of the example of Section 2.2 0.5 3 3.5 1 1.1. this time with the values: Table 3.5 0. in Appendix 2.1 .01 0.1. 3 2.5 C2 0.5 0 0.5 2 1.1 2 2 The data used is plotted in fig.2 0.5 x2 1 0.01 0.Original data 194 Artificial Neural Networks .1 .5 -1 -0.1 .01 0.01 0.5 x1 2 2. 3.Example 3.1 .5 4 FIGURE 3.1 0.5 0 -0.Parameters Mean Covariance Matrices C1 0.

5 0 -0.5 y2 0 -0.5 1 0.5 2 1. and the following figure incorporates them. 3 2. we have the transformed data: 1. in green.If now PCA is used.5 1 x1 1.5 2 y1 2.2 .5 -1 -0. The Baysean decision boundary is displayed in black.5 3 3.5 3 3.5 1 1.5 Artificial Neural Networks 195 . the eigenvectors of the covariance matrix are computed. new basis Projecting the data onto the new axis.5 0 0.5 -1 -1.5 0 0.5 4 4.5 -0.5 x2 1 0.5 2 2.Original data.5 FIGURE 3.

3. our transformed data would only incorporate one dimension.16) would produce fig.FIGURE 3. as the other component is lost.5 0 -0.5 0 50 100 150 200 250 300 Patterns 350 400 450 500 196 Artificial Neural Networks .5 y1 2 1. as PCA chooses the projection that best reconstructs the data.4 . If it was meant to reduce the dimensionality of the problem.Transformed data Notice that the classes are more easily separable if the decision surface is taken along the axis y1.5 3 2. than along axis y2. The transformed data. in this case. in this case. 4.5 1 0. Doing that. corresponding to the largest eigenvalue (variance).5 4 3. Notice however that this is not always the case.3 . 3.5 . y1. If we want to reconstruct. Obviously. the original data. which may or may not be the optimal projection for discrimination between classes. applying (3. is shown in fig. the reconstructed data only contains information along the first eigenvector. then the transformation matrix would only incorporate one eigenvector.

Reconstructed data Artificial Neural Networks 197 .Transformed 3 data (with only 1 dimension) 2.5 0 -0.5 3 3.5 .5 FIGURE 3.5 xhat1 2 2.5 2 1.5 1 1.5 xhat2 1 0.5 -0.4 .FIGURE 3.5 0 0.

the image processing toolbox function imshow is employed. of course. To display it. we end up with the following figure: If we reduce the number of dimensions to only 100.Example 3.6 . it can be seen as 640 samples. we stay with fig. using the first 261 components.Data Compression Image processing is an area where data compression is used significantly. where the decrease in detail is rather obvious 198 Artificial Neural Networks . can be interpreted as a matrix.8 . FIGURE 3. 3.2 . The following picture displays a vga (640*480) in gray scale. and reconstructing the image.Original picture of João This. If we transpose it. each one with 480 dimensions. Applying PCA.

8 . with 261 components FIGURE 3.7 .Picture of João.Picture of João. with only 100 components Artificial Neural Networks 199 .FIGURE 3.

200 Artificial Neural Networks . which does not take into account the relations between inputs and output. In some cases. and the third one with the fictious data. a subset of the most "important" inputs for the model.8 0. The next figure illustrates the network output after the first iteration of the Levenberg-Marquardt method.6 FIGURE 3. specially in bioengineering applications. As this is an unsupervised process.Input Selection It is often the case.9 . before actually training the model.Example 3. Let us consider that we did not know this. care must be taken when applying this technique. This is usually done with PCA.5 y -1 -1. 0 -0. and assumes a linear.2 0 x1 0.3 .4 x2 0 0.8. the model output(s) depend on a large number of variables. and we shall PCA with the original 2 inputs. that neural networks are used to model complex processes. known as input selection is performed. the first two related with the original input data. in order to select. plus an additional one which is taken from a uniform distribution. and our fictious input would be considered. but the degree of relationship is not known a priori.5 18. which is the case of neural models. If these results were used to select 2 out of the 3 inputs. If we perform an eigenanalysis of the covariance matrix of this enlarged input data.3 7. We must remember that PCA is an unsupervised method. we stay with the eigenvalues 9.5 0. Let us in this example the data used in Ex.5 -2 1 1 0.Inverse Coordinate problem This means that the output is exactly reproduced by a neural model with this topology and with the specific set of parameters. not a nonlinear model. In these cases. 2.9 . which is a mistake. from the whole set of available inputs. input x2 would be discarded. a pre-processing phase.

23).4 .3. and that it can be seen as performing gradient ascent in the output power field. Using 10 random patterns.Sum of the square of the errors Artificial Neural Networks 201 . otherwise it will find a weight vector which is orthogonal to the input. This rule will find weights corresponding to the null-space (or orthogonal space) of the data. that Hebbian rule discovers the directions in space where the input data has largest variance. the only way to achieve zero output is with null weights. If the data spans the whole input space.1. applied to a random initial weight vector. In this case. Let us know change slightly (3.3 Anti-Hebbian rule We have seen. which means that will always produce zero output. the application of (3.23) This is called the anti-Hebbian rule. and.262). in opposite with the original will perform gradient descent in the output power field.10 .Application of anti-Hebbian rule Suppose that we consider a model with three inputs.1. in Section 3. we can always find a direction which is orthogonal to the space spanned by the input patterns. Example 3. and assuming that x 3 = x 1 + x 2 . gives the following results: 18 16 14 12 10 sse 8 6 4 2 0 0 10 20 30 40 50 60 iterations 70 80 90 100 FIGURE 3. but where the third input is a linear combination of the other two. so that we have: ∆w ij = –ηx i y j (3.1.

3.8. it is known that the brain stores information in a more global way.3. in an opposite way as the LMS rule.Final results We can understand the behaviour of the anti-Hebbian rule as decorrelating the output from the input.1.11 . 202 Artificial Neural Networks . 3. we can show that the rule is stable if: 2 η < ---------. through knowledge of cognitive psychology. In the same way as we have analysed the steepest descent algorithm.1. as. which can be retrieved or stored by means of a specific address. This process is not the way that our brain stores information.The next figure illustrates shows the space spanned by the input data. in Section 2. where information is stored as a sequence of bits. λ max where λ max is the largest eigenvalue of the autocorrelation matrix of the input data. in Section 2.3. and the LMS rule..24) We are used to computer memory. and confirms that the final weight vector (after 10 passes over the data) is orthogonal to the space of the input data: FIGURE 3.1. in a distributed fashion.4 Linear Associative Memories (3. where data patterns are stored in several neurons. which finds the correlation coefficients from the inputs to the output.

(3. as we can see.25) A high value of this index indicates that the corresponding vectors are similar. which is the desired output. 0 < j < d k=1 (3.28) When a pattern x l is presented to the network. We can introduce.29) If ηx l x l = 1 . as: n W = η k=1 xk dk .27) which. then the second term is null. T (3. Networks trained with this rule are called heteroassociators. and the LAM does not perform how it is expected.3. This was already introduced in Section 1. after n iterations we shall have: n (3. on the other hand. each one with n patterns of dimensionality d. If. The question is when we store different patterns in the LAM. and has closer ties to how it is thought that the brain processes information. It is interesting that this can be implemented with a variation of the Hebbian learning rule (3. which measures how similar the current input is to all the other inputs. called crosstalk. is R i. and (3. If in this equation. yj is replaced with dj.27) in matrix terms. and present it with a specific input (out of the ones already stored).1) is called forced Hebbian learning. we stay with: ∆w ij = ηx i d j If this process is started with W(0)=0. apart from a scaling factor. k ≠ l x k xl d k T (3. we can see that the output is composed of two terms: the first one.The Linear Assocative Memory (LAM) is an alternative to conventional digital memory. j 1 = -n n X ki D kj. X and D.26) W i. and a second term. for this purpose. and perfect recall is achieved. and this information. are we able to reproduce it exactly? To answer it. the cross-correlation matrix: R i. If all the stored patterns are orthogonal. the output is: n y = Wxl = η k=1 T T x k dk x l = T ηxl x l d l n +η k = 1. 0 < i < d. is stored in the network weights. then the crosstalk term can have magnitudes similar to the desired response. the stored pattern are similar. j = η k=1 X ki D kj. Artificial Neural Networks 203 . 0 < i < d. let us formulate (3.262). as they associate input patterns with different output patterns.3. j . The goal is to estimate how similar these sets are. Let us suppose that we have 2 data sets. while a low value indicates dissimilarity. 0 < j < d.

1. Of course. the LMS rule can be recast as: ∆w ij = ηd i x j – ηy i x j (3.1 LMS for LAMs The interpretation that we gave for the LMS enables us to use this rule. The difference lies in the amount of data and the intended use of the model.31) can be seen to be forced Hebbian learning. with PCA.8. and suffer from the problem of crosstalk. Notice that. while the second term is clearly anti-Hebbian learning. A LAM trained with LMS is called an Optimal Linear Associative Memory (OLAM). have limited capacity. As we have used LMS already for function approximation purposes (see Section 2. the training data is larger than the number of inputs.1. orthogonalizing the input data. an error in a weight does not produce a catastrophic failure.8) we may ask ourselves what is the difference between LAMs and the models introduced in Section 2. even if this condition is fulfilled. instead of forced Hebbian learning. For function approximation purposes. a pre-processing phase is usually performed. The first term performs gradient ascent and is unstable. In practice. the weight update was given as: ∆w ij = ηe i x j As the error is given as e i = d i – y i . also in contrast with digital memory. so we can store at most d patterns. LAMs are content addressable (in opposition with digital memory).1. the role of the first is to make the output similar to the desired response.31) (3. As we have seen.4. studied in Section 2. and actually larger than the number of the model parameters.1 Nonlinear Associative Memories There is nothing that prevents the use of the forced Hebbian learning when the neurons have a nonlinear activation function. the storage capacity of a LAM is equal to the dimensions of the input. the application of LMS reduces crosstalk for correlated input patterns. Therefore. 3. the vector space interpretation of the output is lost.30) The first term in (3. In fact. these patterns must be orthogonal if a perfect recall is sought. and the influence of the crosstalk term is less important. 3. allowing therefore a range of learning rates to produce convergence to a local minimum. whose number is less than the number of inputs. for constructing LAMs.5.1. a major goal for the network is 204 Artificial Neural Networks .1.5 LMS as a combination of Hebbian rules If we recall the LMS learning rule. the main goal is to use the network as a model to store patterns. as the methods used for training them is the same. while the second one tries to drive the output to zero. as the output is nonlinear. 3. Therefore. as the information is stored globally in the network weights.We know that in an d-dimensional space the largest set of orthogonal directions has size d.3.3. while the second one decorrelates the input and the system output. for instance. In LAMs. For approximation purposes. However. nonlinear associative memories are usually more robust to noise.

Self-organizing maps try to mimify this behaviour. with its self-organizing map. where.2 Competitive Learning Competition is present in our day-to-day life. and the use of the pseudo-inverse gives the weight vector (or matrix. which introduced this idea and a dense mathematical treatment. this is easy to perform.its capacity of generalization. in biology. this task is performed in a distributed manner. Artificial Neural Networks 205 . This can be more emphasized if we recall Appendix 1. and they become specialized to particular features without any form of global control. for the case of multiple outputs) that provides the least sum of the square of the errors. but needs a global coordination for performing this action. In biological systems.2. making use of excitatory and inhibitory feedback among neurons.1. and to Tuevo Kohonen [39].1 Self-organizing maps These networks are motivated by the organization of the human brain. which may be interpreted as inputs not yet supplied to the network producing outputs with the smallest possible error. we have more equations than data.2. who took a more pragmatical view. In digital computers. what we want is memorization. The neuron with the highest output will be declared the winner. and show how competitive networks can be used for classification purposes. We shall describe in more detail this last work in Section 3. in Section 3. In terms of artificial neural networks.2. For the case of function approximation. usually inconsistent. For the case of LAMs. by using competitive learning. and also in the brain. 3. that is the recall of the stored pattern more similar to the particular input. all receiving the same input. 3. most of the work done on this subject is due to Stephen Grossberg [139]. The key fact of competition is that it leads to optimization without the need for a global control. The brain is organized in such a way that different sensory inputs area mapped into different areas of the cerebral cortex in a topologically ordered manner. among the ones stored.32) provides the solution with the smallest 2-norm.2. The typical layout of a competitive network consists of a layer of neurons. without global control. for this particular example.. the system is unconstrained (more equations than data). in a distributed manner. Different neurons (or groups of neurons) are differently tuned to sensory signals. This is not one goal required for LAMs. The use of a pseudo-inverse W = A D * + (3. so that we have an infinite number of optimal solutions.

Cooperation.12 .2.1 Competition Let us assume that the dimension of the input space is n. An input vector can therefore be represented as: x = x1 … xn The weight vector for neuron j is represented as: w j = w j 1 … w jn . Small random numbers are used for this phase. 3. so that the neurons inside this neighbourhood are excited by this particular input. 1.. For each neuron.. FIGURE 3. .1.. 2. Afterwards. A neighbourhood around the winning neuron is chosen. Inputs .. there are 3 steps to be performed: Competition.. T T . The excited neurons have their weights changed so that their discrimant function is increased for the particular input... all neurons compute the value of a discrimant function. The algorithm starts by initializing the weights connecting all sources to all neurons. . This transformation is performed adaptively in a topologically ordered manner.34) 206 Artificial Neural Networks .12 receives all inputs and are arranged into rows and columns. (3. A onedimensional map is a special case of this configuration. j = 1…l . The neuron with the highest value is declared the winner of the competition. Adaptation. This provides cooperation between neurons. where neurons are arranged into a single row or column. .2-dimensional SOM All neurons in fig..The typical goal of the Self-Organizing Map (SOM) is to transform an incoming signal of arbitrary dimension into a one or two-dimensional discrete map.. 3.33) (3.. This makes them more fine tuned for it. We shall describe each of these phases in more detail 3.

the weights must be adapted according to the input. should also decrease with time. a typical choice is the Gaussian function: d j. i – -------2 2σ 2 (3. while for the case of a two-dimensional map this can be given as r j – r i .3 Adaptation For the network to be self-organizing. There is evidence that a neuron that is firing excites more the neurons that are close to it than the ones that are more far apart. so that the winning neuron (and to a less degree the ones in its neighborhood) has its weights updated towards the input. where r denotes the position in the lattice. (3. This is equivalent to maximize the Euclidean distance between x and w j .35) hji ( x ) = e . it can be determined by: i ( x ) = arg min x – w j .36) where d j. i ( x ) ( x – w j ) . For the case of a one-dimensional map. Therefore the neighborhood should decay with lateral distance. 3. If we denote the topological neighborhood around neuron i by h ji ( x ) . (3. Using again an exponential decay. this represents i – j .To compare the best match of the input vector x with the weight vector w j we compute the inner products w j x for all neurons and select the one with the highest value. An update that does this is: ∆w j = ηh j. η . i denotes the distance between the winning neuron i and the excited neuron j. In the same manner as the width parameter. This neighborhood decreases with time. The parameter σ determines the width of the neighbourhood. j = 1…l 3. so that if we denote the winning neuron by i ( x ) . (3. A popular choice is: σ ( k ) = σ0 e k – ---τ1 .37) where σ 0 is the initial value of the width and τ 1 is the time constant. the learning rate parameter. we must define a neighbourhood that is neurobiologically plausible.2 Cooperative process Having found the winner neuron.2.38) which is applied to all neurons inside the neighborhood. it can be given as: Artificial Neural Networks 207 .2.1.1.

40) The neighborhood function should at first include all the neurons in the network (let us call this σ 0 . Density matching. The goal of the SOM algorithm is to store a large set of vectors by finding a smaller number of prototypes that provide a good approximation to the original input space.1 to 0. (3.39) For the SOM algorithm to converge. each one centred around one neuron. and a convergence phase.η ( k ) = η0 e 3. 3. the various parameters should have two different behaviours. where the topological ordering of the vectors take place.01. Topological ordering.2. resulting in two phases: ordering. As the basic goal of the SOM algorithm is to perform clusters of data. and the two phases involved in the SOM algorithm. while the neighborhood should contain only the nearest neighbours of the winning neuron. 1.4 Ordering and Convergence k – ---τ2 . and the learning rate parameter should be decreasing from around 0. Approximation of the input space.10. 3. it is not surprising that the SOM algorithm is not very different from Alg. or even by reduced to just the winning neuron. The feature map reflects variations in the statistics of the input distribution: regions in the input space from which samples are drawn with a higher probability are mapped onto larger domains of the output space.1. This can be achieved if: η 0 = 0. to fine tune the map. This can be cone if: 1000 τ 1 = ----------σ0 (3. 208 Artificial Neural Networks .5 Properties of the feature map Once the SOM has converged. A more detailed discussion can be found in [125].1 τ 2 = 1000 (3. the feature map computed by the algorithm has some nice properties which will be pointed out.1. Therefore.41) The convergence phase usually takes 500 times the number of neurons in the network. Actually their difference is in the definition of neighborhood.01. and decrease to a couple of neurons around the winning neuron. displaying the weights in the input space and connecting neighbouring neurons by a line indicates the ordering obtained. which for the k-means just includes the winning neuron. The spacial location of each neuron in the lattice corresponds to a particular feature or domain of the input data. the K-means clustering algorithm. 2. The learning parameter should remain at around 0. The first phase can take 1000 iterations or more. 2.2.

This is actually the first stage of Learning Vector Quantization (LVQ). which for classification is supplied in terms of class labels.Voronoi diagram It is clear from the discussion of SOM that this algorithm is capable of approximating the Voronoi vectors in an unsupervised manner. Selforganized feature maps gives us a discrete version of the principal curves and therefore can be seen as a nonlinear generalization of PCA. data compression is achieved. Feature selection. We shall describe briefly three different techniques: Learning vector quantization. and we can profit of this to adapt this for classification purposes. and adapting the Voronoi vector according to the following algorithm: Artificial Neural Networks 209 .2. PCA gives us the best set of features (the eigenvectors) for reconstructing data. But it assumes a linear model. The second stage involves fine tuning of the Voronoi vectors in a supervised manner. Therefore it does not use any form of teacher.1 Learning Vector Quantization Vector quantization is a technique that exploits the structure of the data for the purpose of data compression.2 Competitive learning is a form of unsupervised learning. picking input vectors x at random. Classifiers from competitive networks 3. This second phase involves assigning class labels to the input data and the Voronoi vectors. But we have seen that SOM performs data clustering.2.1. 3. The SOM is able to select a set of best features for approximating the underlying distribution of data. As we saw in Section 3. The basic idea is to divide the input space into a number of distinct regions (called Voronoi cells) and to define for each cell a reconstruction vector.2. FIGURE 3.4. counterpropagation networks and Grossberg’s instar-outsar networks. As the number of cells is usually much smaller than the input data.2.13 .

43) 210 Artificial Neural Networks . connected to a linear output layer. ∆w j = ηx ( d – w j ) 3.1 - LVQ algorithm The learning rate should start from a small value (typically 0. The way that Grossberg solved the problem was to increase the number of neurons in the competitive layer. in an iterative fashion. w c ( k + 1 ) = wc ( k ) –αn ( x – wc ( k ) ) 4. and afterwards the linear classifier is trained using the class labels.2.2. called the stability-plasticity dilemma.2. w c ( k + 1 ) = w c ( k ) + αn ( x – wc ( k ) ) 3. with y replaced by d.4 Adaptive Ressonance Theory The main problem associated with the instar-outsar network is that the application of the learning rule has the effect that new associations override the old associations.42) and the classification layer the same rule. (3.2 Counterpropagation networks Hecht-Nielsen [36] proposed a classifier composed of a hidden competitive layer. which does normalization.3 Grossberg’s instar-outsar networks Grossberg proposed a network very similar to the counterpropagation network described above (all developed in continuous time).1. 2. in situations where the input data differed from the nearest cluster centre more than . Let ξ wc denote the class associated with vector wc and ξ x denote the class associated with x. Suppose that the Voronoi vector wc is the closest to the input vector x. IF ξ wc = ξ x 2.2. The main difference lies in the input layer. For that purpose. ELSE 3. (3.1.2. All other Voronoi vectors are not modified Algorithm 3. or the least-squares solution can be implemented. 3. in just one step.2. as it is a supervised learning. as training data. the LMS rule can be used. which is a problem common to networks of fixed topology.1. The first layer is trained first.1) and decrease monotonically with k. The competitive layer (the first layer) uses the instar rule: ∆w j = ηx ( y – w j ) . 3.

Artificial Neural Networks 211 . it is said that there is no "ressonance" between the existing network and the data.a predefined value. In these situations. and this is the reason that this network has been called Adaptive Ressonance Theory.

212 Artificial Neural Networks .

1 Neurodynamic Models In Section 1.CHAPTER 4 Recurrent Networks Chapter 2 and 3 covered the field of direct networks. and recover the dynamic model (1. in Section 4. but models where feedback is allowed.2.6) was developed. We now focus our attention to transient behaviour. supervised and unsupervised.2. 4. where the Cohen-Grossberg model is also introduced. the simple approximation will be used: X i. j ( oj ) = o j W j. This chapter is devoted to structures where the signal does not propagate only from input to output. the most simplified model (1.1. and used intensively throughout the book.1) j∈i Again. Hopfield models are described in Section 4. a simplified neural model was introduced. i = net i and therefore do i = dt o j W j.1): do i = dt X i. j ( o j ) – g ( o i ) (4. First. when the biological inspiration of artificial neural networks was discusses. Finally. the most common neurodynamic models are introduced.3. a brief description of Brain-State-in-a-Box is given in Section 4. Having stated that most of the artificial models was only interested in stationary behaviour. i – g ( o i ) (4.2) j∈i Artificial Neural Networks 213 .3) (4.5)(1.

the most common form of this model is: Ci dv i = dt vi x j W j. one of these attractors will be attained. i – ---. In these type of networks. but where feedback among neurons is allowed. for simplicity. the output o is represented as x. where the weights are dependent on the output. 4. interconnected in different ways. o i . and possibly also of their basins of attraction. the loss term is proportional to the net input. Each neuron has a dynamic represented in (4. and denoted as the induced local field. specifically introduce a bias term for the purpose of a possible circuit implementation. is a nonlinear function of the net input net i .3): • • • • we shall assume that the dynamics apply to the net input. which for convenience is reproduced here. the concept of energy minimization is usually applied. 214 Artificial Neural Networks .5) This is the form that will be used throughout this chapter. as previously. to differentiate it from a multiplicative model. The attractors are the places where this energy function has minima. A necessary condition for the learning algorithms.3) is changed to: Ci dnet i = dt net oj W j.4) This is called the additive model. Using this. not to the output. (4. so that when a network is left to evolve from a certain initial condition. and denoted as the neuron state. The aim of these algorithms is actually controlling the location of these attractors.We shall assume. introduce a capacitance and a resistance terms Implementing these modifications.5). however we shall make the following modifications to (4.+ I i Ri j∈i (4. net is represented as v. i – --------i + I i Ri j∈i (4. Usually. We shall consider that we have a set of neurons such as (4. introduced later on. such as memories. to be useful. that the neuron output.5). introduced by Grossberg.2 Hopfield models This network consists of a series of neurons interconnected between themselves (no self-feedback is allowed in the discrete model) through delay elements. so that the attractors can be used as computational objects. is that the network posses fixed point attractors (see Appendix 3).

we have: dE -----.6) The following assumptions will be made: 1.log ----------ai 1+x (4.13) j=1 Artificial Neural Networks 215 .= – dt N N j=1 i=1 dx j vj W ji xi – ---. i – ---. we have: –1 1 –1 ϕ i ( x ) = --. and for this reason a i is called the gain of neuron i. The weight matrix is a symmetric matrix ( W ij = W ji ).7).11) w.t.= ------------------– ai v 2 1+e (4. Using this equation in (4. Each neuron has a nonlinear activation function of its own.11) If we differentiate (4.ϕ ( x ) ai –1 (4. we have: –a i v ai v 1–e x = ϕ i ( v ) = tanh -----.8) This function has a slope of ai ⁄ 2 ) at the origin. time. 3. we have: –1 1 1–x v = ϕ i ( x ) = – --.r.5)) and therefore we have: dt Cj dv j dx j ------d t dt (4.+ I i Ri j∈i (4. The inverse of this function exists: v = oi –1 = ϕi ( x ) –1 (4.Ci dv i = dt vi x j W j.7) Assuming an hyperbolic tangent function. 2.+ I j ------dt Rj dv j (see (4.ϕ j ( x ) dx – Rj 0 xj N Ijxj j=1 (4.12) The terms inside parenthesis are equal to C j dE = – -----dt N (4.10) The energy (Lyapunov) function of the Hopfield network is defined as: 1 E = – -2 N N N W ji x i x j + i = 1j = 1 j=1 –1 1 ---.9) Denoting by ϕ ( x ) the inverse output-input relation with unit gain.

13) can be expressed as: dt dx dt dt dE -----.6 0. 1 0.= -------------------.------.As v j = –1 ϕj ( xj ) .14) Let us focus our attention in the function ϕ ( x ) . The following figures show the behaviour of the direct and the inverse nonlinearity.= – dt N –1 –1 j=1 dx j C j ------dt –1 2 dϕ j ( x j ) -------------------dx –1 (4.8 0.8 -1 -3 -2 -1 0 x 1 2 3 FIGURE 4.plot of ϕ ( x ) 216 Artificial Neural Networks .1 .4 -0. and therefore (4..= -------------------.2 tanh(x/2) 0 -0.4 0.6 -0.2 -0. dϕ j ( xj ) dx j dv j dϕ j ( x j ) -----.

= 0 .2 0 x 0. John Hopfield introduced a discrete version [29]. and that it is a stable model.4 -0.1 Discrete Hopfield model The model described above is the continuous version of the Hopfield model.2 . dt dt 4.2 0.4 0.plot of –1 ϕ (x) –1 It is clear that the derivative of ϕ ( x ) will always be non negative.6 -0. The stable points dx j dE are those that satisfy -----.≤ 0 dt (4. which was introduced in [30]. The discrete version is based on McCulloch-Pitts model.6 4 2 0 -2 -4 -6 -1 -0.= 0 . and compared with the continuous version.15) This result and the fact that (4.2.14) is also always non negative. We can build a bridge between the two models if we redefine the neuron activation function such that: 1.8 1 FIGURE 4. which corresponds to the points where ------. The asymptotic values of the output are: Artificial Neural Networks 217 .6 0. As the square in (4. dE -----. allows us to say that the energy function is a Lyapunov function for the Hopfield model. which will be described now.11) is positive definite. Previously.8 -0.

v j = – ∞ (4. and so are their stable values. 4.17). The midpoint of the activation function lies at the origin. specially in regions where the neuron outputs is near their limiting values. However. In view of this.18) makes a significant contribution to the energy function.18) can be expressed as: 1 E ≈ – -2 N N W ji x i x j i = 1j = 1 i≠j (4.17) Making use of (4. If a j has a large value (the sigmoid function approaches a hard non-linearity).18) The integral of the second term in this equation is always positive and assumes the value 0 when x = 0 . By this. is given as: 1 E = – -2 N N N W ji x i x j + i = 1j = 1 i≠j j=1 –1 1 ---. the second term is negligible and (4. the network evolves in such a way that the fundamental memory is retrieved. where self-loops are not allowed. depart from the corners of the hypercube where the neuron states lie. when a pattern (possibly one of these memories. when this is not true ( a j is small). the energy function of the continuous model is identical to the discrete mode. The stable points. I j to 0 for all neurons. we may set the bias.19) This last equation tells us that in the limit case ( a j = ∞ ).xj = 1. 218 Artificial Neural Networks . i. The energy function of a network with neurons with these characteristics. we mean that the network stores in its weights fundamental memories. v j = ∞ – 1. we have: 1 E = – -2 N N N W ji x i xj + i = 1j = 1 i≠j j=1 –1 1 --------.ϕ ( x ) dx a j Rj 0 xj (4.2. ϕ j ( 0 ) = 0 .ϕ j ( x ) dx Rj 0 xj (4. the second term in (4. corrupted with noise) is presented to the network.16) 2.. Of course these fundamental memories represent stable points in the energy landscape of the net.10) in (4. in this case. and.e.2 Operation of the Hopfield network The main application of the Hopfield network is as a content-addressable memory.

i = 1…M the set of M fundamental memories that we want to store in the net. v j = 0 This procedure is repeated until there is no change in the network state. v j > 0 xj = – 1. the weight matrix is obtained as: 1 W = --N M ξξ – MI . The term 1 ⁄ N is employed for mathematical simplicity. and N is the number of neurons.1 Storage phase Hopfield networks use unsupervised learning. denoted as the alignment condition.21) where the last term is introduced so that W ii = 0 . and is achieved by setting the initial network state to a . and updates its state according to the following rule: 1. each neuron checks its induced vector field. Notice that the weight matrix is symmetric. until a stable state is reached. Therefore. i = 1…M fundamental memories leads to W ji 1 = --N M ξ kj ξ ki . Then.Let us denote by ξ i. randomly.22) Artificial Neural Networks 219 . k=1 T (4. is satisfied: (4. there are two main phases in the operation of the Hopfield network: a storage phase and a retrieval phase. Assuming that the weights are initialized to 0. v j < 0 x j. of the Hebbian type (3. if it is not a stable point.2.2. the network will evolve to the stable point corresponding to the basin of attraction where a lies. an input a is applied to the network. The discrete version of the Hopfield network requires that there is no self-feedback.2. By presenting an input pattern a . This input is usually called a probe. in vector terms.20) where ξ kj is the jth value of pattern k. The stability of a state can be inspected by verifying if the following system of equations.262).e. 4. Each one of them will be encoded onto a stable point xi in the net. i. Therefore. the application of the hebbian rule to the ξ i. 4.2 Retrieval phase In the retrieval phase. k=1 (4.2.

Notice also that when an input pattern. These other stable points are called spurious states. There are 2 = 8 states in the network. Here. Let us see the alignment condition for all possible states. In linear associative memories (see Section 3. Another problem related with these networks is their limited storage capacity. that we did not want the network to posses. – 1 – 1 – 1 and 1 1 1 . they can not be recalled.4). The weight vector employed was actually the result of applying (4.3 Problems with Hopfield networks The fundamental memories are stored in stable points of the energy surface.2. These points. and these are undesirable memories. that is not a stable pattern. and therefore. for simplic- ity. Employing (4.21) with this data. as the network is discrete. Assume also. lie in corners of an hypercube.23). the network recalls the fundamental memory that is closer in terms of Hamming distance. we have: – 1 – 1 – 1 → – 1 – 1 –1 –1 –1 1 → –1 –1 –1 –1 1 –1 → –1 –1 –1 –1 1 1 → 1 1 1 1 –1 –1 → –1 –1 –1 1 –1 1 → 1 1 1 1 1 –1 → 1 1 1 1 1 1 → 1 1 1 Analysing the above results.y = sgn ( Wy + b ) Example 4.1.1 . and allowing self-feedback for simplicity. Let one input pattern be equal to one of the stored memories. the effect is that fundamental memories might not be stable memories. ξ v . 4. is applied to the network. that the bias is 0. crosstalk between patterns implies that a perfect recall is not possible.23) . Some of the other corners might or might not be stable points. the local induced field for neuron i is given as: 220 Artificial Neural Networks . Assuming zero bias. we can see that there are only 2 stable states.2 0 2 3 2 2 0 3 (4.A 3-neurons model 0 2 2 1 Suppose that the weight matrix of a network is W = -.

the number of random patterns that can be stored is smaller than 0. -1 (white) and +1 (black).20) in (4.2 .24). while for large n.25) The first term in (4. 4. This means that we can employ an Hopfield network with 20 neurons. (4.15n.24) Using (4. each one with two different values.mat. we have: N vi = j=1 1 --N M ξ kj ξ ki ξ vj k=1 1 = --N M N ξ ki k=1 j=1 ξ kj ξ vj 1 = ξ vj + --N M N ξ ki k = 1 k≠v j=1 ξ kj ξ vj (4.10 digits Each digit is drawn in a 4*5 matrix of pixels. The data can be found in numeros. Therefore each digit can be represented as a vector of 20 elements. The second term is the crosstalk between the fundamental memory v.Recognition of digits The aim of this example is to design an Hopfield network to recognize digits such as the ones show in fig. Artificial Neural Networks 221 . and the other k fundamental memories.3 . for small n.< ----------n 2 ln n Example 4. the recall of m patterns is guaranteed (see [125]) only if: m 1 --. Typically.3 .26) FIGURE 4.N vi = j=1 W ji ξ vj (4.25) is just the jth element of the fundamental memory.

(4. Using retrieve_hop.Evolution of the ’1’ distorted pattern If we try to store more than 3 patterns. If we change each bit of one the 1 stored pattern with a probability of 25%.2. Using store_hop. 61. at iterations 1. They proposed a network described by the following system of nonlinear differential equations: du i = ai ( ui ) b i ( u i ) – dt N C ji ϕ i ( u i ) i=1 (4.4 . we can conclude that all fundamental memories have been stored as stable memories.m.28) where: 222 Artificial Neural Networks . one and two. 20 neurons are able to store 3 patterns. Notice that.4 The Cohen-Grossberg theorem One of the most well known model in neurodynamics has been introduced by Cohen and Grossberg [140]. the weight matrix is computed. we shall notice that all fundamental memories can not be stored with this network size.We shall start by memorizing just the 3 first digits: zero. according to what has been said before. The following figure illustrates a spanshot of this evolution. 31. and let the network to evolve.27) They have proved that this network admits a Lyapunov function 1 E = -2 N N N uj C ji ϕ i ( u i )ϕ j ( u j ) + i = 1j = 1 i≠j j=10 b j ( λ )ϕ' j ( λ ) dλ . 91 and 101. 4. Stored pattern Iteration 1 Iteration 31 Iteration 61 Iteration 91 Iteration 101 FIGURE 4. after 101 iterations we recover a spurious state.m.

31).1. the activation function ϕ i ( u i ) is monotonically increasing. By applying positive feedback. Its main use is as a clustering device. however. The BSB uses positive feedback. we can conclude that the Hopfield network is a special case of Cohen-Grossberg model: Table 4.1 .ϕ' j ( λ ) = d ϕ j ( λ ) . Comparing (4.30) (4.2).3 Brain-State-in-a-Box models The Brain-State-in-a-Box (BSB) model was first introduced by Anderson [16]. (4. 2. the function a i ( u i ) is non-negative.2.29) the weight matrix is symmetric.Correspondence between Cohen-Grossberg and Hopfield models Hopfield Cj vj 1 vj – ---. This means that the trajectories will be driven to a "wall" of the hypercube and will travel through this wall until a stable point is reached. Artificial Neural Networks 223 . 3. The BSB algorithm is as follows: y [ k ] = x [ k ] + βWx [ k ] x[ k + 1 ] = ϕ( y[ k ] ) (4. constrains the state trajectories to lie inside an hypercube. The network is called BSB as the trajectories are contained in a "box".30) and (4.28). then through application of (4. the Euclidean norm of the state vector will increase with time. to drive the network states to a corner of an hypercube. with positive eigenvalues. β is a small positive constant called the feedback factor. dλ provided that: 1.3.+ I j Rj – W ji ϕj ( vj ) Cohen-Grossberg uj aj ( uj ) bj ( uj ) C ji ϕj ( uj ) 4. The use of the limiting function.5) and (4. x [ k ] is the state vector at time k. if the starting point is not a stable point. centred at the origin. and ϕ ( y ) is a piecewise linear function (see Section 1. W is an n*n symmetric matrix. together with a limiting activation function.31) In these equations.

Correspondence between Cohen-Grossberg and BSB models BSB vj 1 –vj – C ji ϕj ( vj ) Cohen-Grossberg uj aj ( uj ) bj ( uj ) C ji ϕj ( uj ) the energy function of a BSB model Is: E = – x Wx 4. creating a cluster function. The weight update is therefore: ∆W = η ( x [ k ] – Wx [ k ] )x [ k ] (4. where the target data is replaced by the input training data. 224 Artificial Neural Networks . rather than an associative memory.32) BSB models are usually trained with the LMS algorithm. Both generate basis of attraction but in the BSB these are more regular than in Hopfield networks.3.2 .2 Training and applications of BSB models T (4.4. It can be proved (see [141] and [125]) that using the correspondence expressed in for the two models: Table 4. The BSB divides the input space into regions that are attracted to corners of an hypercube.3. as Hopfield networks.1 Lyapunov function of the BSB model As Hopfield networks. BSBs may be reformulated as a special case of the Cohen-Grossberg model.33) The BSB is usually employed as a clustering algorithm.

4.5 illustrates the trajectories for 4 different probes: T 0.3 . fig.1 .2 0.m.6 0. and afterwards move along this side towards the stable state.035 -0.005 0.A two-dimensional example (taken from [125]) A BSB model.1 0. move towards a side of the hypercube.3 T – 0. with two neurons.5 .005 is considered here.Different trajectories for a BSB model As it can be seen.8 – 0.Example 4.4 T 0. and a weight matrix of 0.2 T – 0. the states in a preliminary phase. Artificial Neural Networks 225 .1 FIGURE 4. with β = 0. -0.035 Using retrieve_bsb.

226 Artificial Neural Networks .

such that y is. a solution (or infinite solutions) exist.1 Introduction The output layer of a direct supervised neural network.1) We know that the problem of training is to determine w.1) has the familiar solution: w = A t –1 –1 (A1. and full rank.2) However.2 Solutions of a linear system with A square Let us start with the case where p=m. When A is square. or equal.. in some sense.1) might have a single solution. to t. with a linear activation function. system (A1. the number of patterns is equal to the number of linear parameters. If certain conditions are satisfied. very close. A1. involves a linear output layer. and the problem of training means to determine the coefficients of these equations in such a way that a suitable criterion is minimized.APPENDIX 1 Linear Equations. (A1. whose main application is approximation. in order to have a solution. or when it is square but not of full rank. In this appendix some concepts related with linear equations are reviewed and some other are introduced. when A is not square. Therefore. can be described as: y m × 1 = Am × p w p × 1 (A1.e. or many solutions. Consider this system: Artificial Neural Networks 227 . This output layer can be described as a set of linear equations. the equations in (A1.1) must be consistent. no solution at all. A1. Generalized Inverses and Least Squares Solutions The topology of supervised direct neural networks. i. A does not exist. In this case.

1) could be expressed as: A i w = i y i (A1.6) i CA Cy Assume now that we interchange the columns of A and the corresponding elements of w. (A1. we had: 2 3 1 w1 14 w2 = 6 .6) can be expressed as: w1 w2 i i i A1 A2 = y (A1. in such a way that the first r columns of A are linear independent.1).2 3 1 w1 14 = 6 1 1 1 w2 3 5 1 w3 19 Twice the 1st equation minus the 2nd gives: w1 3 5 1 w 2 = 22 w3 (A1. and the other p-r rows are dependent. and therefore (A1.3) (A1. Assume that we arrange the equations such that the first r rows of A are linearly independent. Therefore. instead of (A1.7) 228 Artificial Neural Networks . The linear relationship existing between the rows of A must also be present in the solution vector. Therefore.3).4) Comparing (A1. we can see that they cannot be true simmultaneously. there are infinite solutions. the first r rows of (A1.3 Existence of solutions Consider again (A1. as we shall see below. (A1.5) A1.3) has no solution. but assume that the rank of A is r. If.4) with the last row of (A1. 1 11 3 5 1 w3 22 then.3). The system has inconsistent equations. and that the system is consistent.

5) have the form: w = w1 w2 = 4 – 2 w 2 2 –1 w2 (A1.5). This can be tested by using an augmented matrix A y and computing its rank. we have an infinite number of solutions for (A1. there is a unique matrix A . we see that A 1 y = 4 and A 1 A 2 = 2 . then all linear dependencies in the rows of A appear also in the elements of y. depending on the value given to w2 . we can pre-multiply (A1.12) (A A) = A A + Artificial Neural Networks 229 .8) Let us apply this to system (A1.10) + + + (A1. In this case. which satisfies the following four conditions: AA A = A A AA = A + T + (A1. If the rank of this augmented matrix is the same of the rank of the original matrix. A1. On the other hand. if A is full-rank and square. then there are no linear dependencies to test for and the system is also consistent.5 Generalized Inverses + Penrose showed that. and have: w1 w2 = A1 y – A1 A2 w 2 w2 –1 i –1 –1 w = (A1. Therefore.5). for any matrix A. 1 1 1 –1 i –1 Using Matlab. as A 1 is non-singular.11) (A1.9) Therefore.7) with A 1 .Now.4 Test for consistency We have assumed in the previous section that the system of equations is consistent. A 1 = 2 3 and A 2 = 1 . the solutions of 2 –1 (A1. A1. w 2 is a scalar.

Orthonormal matrices means that they satisfy: U U = UU = I The S matrix has the form: Dr 0 0 0 where D r means that this is a diagonal matrix with rank r ≤ p .10) of Penrose’ conditions.( AA ) = AA + + T + (A1. Actually. It is now common to use these names for matrices satisfying only the first (A1. any decomposition that generates: A = P∆Q .5) again. to hold. we T + T (A1.13) He called A the Generalized Inverse.14) where U and V are orthonormal matrices. (A1. One possible decomposition is: 230 Artificial Neural Networks .17) (A1. A pseudo-inverse can be obtained using a SVD decomposition: In this case the matrix A is decomposed as: A = USV . and S is a diagonal matrix containing the singular values of A.15) S = . (A1.16) –1 = U .18) It should be noted that. We shall use this here also.V T –1 = V . and denote the matrix satisfying this condition as G. to obtain a generalized inverse. and P and Q are obtained by elementary operations (so that they can be inverted) can be applied. Let’s look at system (A1. we must have: –1 + T T + D r 0 D –1 0 D r 0 I 0 Dr 0 r A = V Dr 0 U ⇔ SV A US = = r = S 0 0 0 0 0 0 0 0 0 0 0 0 + T T T (A1. there are other decompositions that can be applied besides SVD.19) where ∆ is a diagonal matrix. latter on other authors also suggested the name Pseudo-Inverse. As U can therefore have: AA A = A ⇔ USV A USV = A T + T T SV A US = U USV V = S For this equality. T (A1.

1 0 2 3 -. 22 – 0. 83 0.23) Notice that this pseudo-inverse satisfies conditions (A1. 82 V = 0.11). 25 0.22) A = Q ∆P + – 1 .25) S = (A1. but does not satisfy conditions (A1.10) and (A1. 41 7. 84 0 0 0 0 (A1. the pseudo-inverse is: Artificial Neural Networks 231 . 41 0. 16 0 0 0 0. 41 0.13). we have: 0. 52 – 0. 39 0. 82 0.–1 = (A1.26) Therefore.20) (A1. 22 – 0. 39 – 0.1 0 0 1 P = -.– 1 1 2 3 1 -2 Q = 0 1 0 0 1 -2 –1 1 (A1.21) 2 0 0 ∆ = 0 –1 0 -2 0 0 0 Therefore. If we compute the pseudo-inverse using a SVD decomposition. 25 – 0. 88 – 0. 82 U = 0. 41 0.24) (A1.12) and (A1. 88 0. 52 – 0. –1 3 0 1 –2 0 0 0 0 (A1.

31) (A1. 38 0. 27 – 0. 38 Notice that this pseudo-inverse satisfies all of Penrose’s conditions. we have: 1 0 2 H = 0 1 – 1 .29) (A1.5). (A1.30) Notice that.23).27) A1. when A is full-rank. when the rank of A is less than the number of unknowns and the system is consistent.6 Solving linear equations using generalized inverses A solution of system (A1. using the pseudo-inverse (A1. 27 0 0 0. In this case. 27 0.1).9). is just: + ˆ w = A y (A1. 4 + 2w 3 4 0 0 2 ˆ w = 2 + 0 0 –1 w3 = 2 – w3 0 0 0 –1 –w3 Notice that this is the same solution as (A1. Let us consider again system (A1.32) (A1. there are many solutions. we have: + ˆ w = A y + ( H – I )z + (A1.28) As we know.0.33) 232 Artificial Neural Networks . H=I. 05 – 0. (A1. 11 0. 94 – 0. 05 –1 + T A = V D r 0 U = – 0. and 0 0 0 0 0 2 H – I = 0 0 –1 0 0 –1 Therefore. Denoting as H = A A.

corresponding to ( H – I )z . 05 – 0.33).35) 2 – -3 H–I = 1 -3 1 -3 and therefore: (A1. if the matrix is large.34) (A1. 1 1 -3 1 -3 –1 -6 1 – -6 1 -3 –1 w -6 1 – -6 (A1. among the rows and columns of A. 38 1 -3 H = 1 -3 1 -3 1 -3 5 -6 1 – -6 1 -3 1 – -. we shall have 1 linear independent solution (corresponding to z = 0 in (A1. 11 0. assuming that the system is consistent.27).30). In the latter method. and therefore we obtain the same solutions as (A1. 94 – 0. we must choose a non-singular matrix. we see that. 38 0. Comparing the use of the generalized inverse with the method expressed in Section 3. 27 0. which is a more complex task. we have 0. and perform some calculations. 27 0. If matrix A has p columns. in the former. 7 ˆ w = – 0. 6 + 1 -3 6.36) –2 -3 2. and rank r. and 6 5 -6 1 -3 1 – -6 1 – -6 1 -3 1 – -6 1 – -6 (A1. we only need to find one generalized inverse.37) Notice that matrix H has rank 1. 27 – 0..Using the pseudo-inverse (A1. 05 + A = – 0. Artificial Neural Networks 233 . and p-r independent solutions.

7 Inconsistent equations and least squares solutions Looking at a graphical representation of system (A1. in the normal case where there are as many patterns as parameters. w ) = ( t – Aw ) ( t – Aw ) 2 T (A1. it has no solutions. in this case.1 .Graphical representation of (A1. and therefore there will be no solution for it.5). + (A1. Notice also that. FIGURE A1. .39) 234 Artificial Neural Networks . Of course we shall have solutions with zero error for all target vectors that lie in that space.A1. we could have solutions with zero error for every vector y. in black. as the right-hand-side (our t vector). What we can do is project vector t onto the column space of A using: P A = AA .38) If we change the 3rd component of vector t to 30. lies in the space spanned by the columns of matrix A (in red).5) When a system is inconsistent. it will no longer remain in the column space of A. is to derive a solution that minimizes the norm of the errors obtained: m Ω = i=1 ( t i – A i. What we can do. and the vectors are linear independent. we can see that there is a solution with zero error.

the minimum of (A1.Graphical representation of (A1. 2 .5).41) The next figure shows the effect of computing the pseudo-inverse with (A1. FIGURE A1. As it can be seen. as it is no longer orthogonal with the column space of A. which means that P A⊥ = PA⊥ .2 .38) can be obtained as: T T T ˆ ˆTˆ Ω = E E = t P A⊥ PA⊥ t = t P A⊥ t (A1. Artificial Neural Networks 235 .19). as the distance from the extreme of this projection to the actual point: + + ˆ E = t – AA t = ( I – AA )t = P A⊥ t .The minimum error is obtained when A is computed using SVD. the error is great with this pseudo-inverse. Therefore. + (A1. and the projection on the orthogonal space in magenta. In fig. with t 3 = 30 2 Notice that PA and P A⊥ are idempotent matrices. the projection of t onto the column space of A is shown in blue.40) where PA⊥ is the projection operator onto the space orthogonal to the column space of A.

and Q = Q1 Q2 . and R is: R1 S 0 0 where R 1 is a upper-triangular matrix with size r*r. with t 3 = 30 .3 . the matrix A is decomposed as: A = QR . where Q is an orthonormal matrix. Using this decomposition.43) 236 Artificial Neural Networks .1 Least squares solutions using a QR decomposition: In this case.Graphical representation of (A1.42) R = . Q 1 with size (m*r) and Q 2 with size (m*p-r). (A1.5). with different generalized inverses A1.7. a pseudo-inverse can be obtained as: (A1.FIGURE A1.44) (A1.

22 – 0.49) Example A1. 41 – 0. as: + AA A = Q R 1 S R–1 0 T R 1 S R S R –1 0 R 1 S R S 1 1 = Q 1 = Q 1 = A (A1. 65 – 0.+ A = R1 0 Q1 Q2 0 0 –1 T = R1 Q1 0 –1 T (A1. 74 – 5. we have: + AA = Q R 1 S R–1 0 T 1 Q = Q1 Q2 I 0 Q1 Q2 0 0 0 0 0 0 T T = Q 1 Q1 T (A1. 53 – 0.45) We can easily prove that this is a pseudo-inverse. 65 0 0 0 (A1. 6 0 0.1 . 88 0. 88 – 1.48) This means that the training criterion (A1. Using the qr.m function of Matlab.47) T P A⊥ = QQ – Q 1 Q 2 I 0 Q 1 Q 2 0 0 = Q1 Q2 0 0 Q1 Q2 0 I T = Q2 Q2 T (A1. we have: – 0.50) R = (A1. the graphical representation is exactly as shown in fig. 82 Q = – 0. 81 0.51) The projection matrix P A is given as: Artificial Neural Networks 237 . 2 . 27 – 0.Least-squares solution with QR and SVD In this case.41) can be expressed as: T T T ˆ ˆTˆ Ω = E E = t Q2 Q2 t = Q2 t 2 (A1. 44 0.46) Q Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Determining now PA and P A⊥ with this decomposition. 41 – 3.

53). 39 and the coefficients are: (A1. 28 . the projection operators are the same (A1. in this case: –8 -3 + ˆ w = A t = 22 ----3 0 13 ----6 –4 -3 0 –5 -6 2 -3 0 (A1. 11 0.PA 1 -3 = 1 -3 1 -3 1 -3 5 -6 –1 -6 1 -3 1 – -6 5 -6 (A1.52) and (A1.53) The pseudo-inverse is: 2 -3 + A = –1 -3 0 The coefficients are. 039 0.55) If we use the Matlab function pinv.m.52) The projection matrix P A⊥ is given as: 2 -3 = –1 -3 –1 -3 1 – -3 1 -6 1 -6 1 – -3 1 -6 1 -6 P A⊥ (A1. 05 – 0.54) (A1. but the pseudo-inverse is: 0. 28 0.56) 238 Artificial Neural Networks . obtained with SVD. 0. 05 + A = – 0. 94 – 0. 28 – 0.

22 – 2. Artificial Neural Networks 239 . 55 5.57) Notice that the norm of the weight vector is smaller using a pseudo-inverse computed with SVD than with QR. 11 (A1.+ ˆ w = A t = 1.

240 Artificial Neural Networks .

APPENDIX 2 Elements of Pattern Recognition This appendix will introduce some fundamental concepts from pattern recognition theory. as typically it will vary more with the disease. Obviously.1 . by definition.1 The Pattern-Recognition Problem.4 0. It is more difficult to assign a value for the mean temperature of “sick” people. the temperature of sick people will have a larger variance than the temperature of healthy people. This appendix follows closely the material presented in [135]. hour of day. it will vary with the disease.5 39 Temperature (ºC) 39.e.5 37 37.5 38 38. state. according to the body’s temperature. We will assume that 36. the temperature of healthy people varies with the particular person. we shall assume a value of 39º for the mean temperature of sick people.6 1.5 41 FIGURE A2.5 40 40.. A2. it is the mean temperature of “healthy” people. intuitively. i.5º is the usual “normal temperature”. and so on.2 0 36 36.4 1. Let us suppose that we want to classify people’s health (sick or health).2 1 0. 2 1. since.8 1. its degree and phase where it is at the moment.Data used in the “sick-healthy” problem Artificial Neural Networks 241 . needed for understanding the role of neural networks as classifiers. But.6 0.8 0. As we have to assign a value.

The above figure shows the data (100 samples) used in this problem. It must be obtained indirectly.e. we have to define the decision surface.2) In (A2. As we can imagine (and see if we inspect carefully fig. a Gaussian function). we can estimate the distribution parameters. The red balls show temperatures for sick people. in this case a Gaussian distribution.2 Statistical Formulation of Classifiers We shall assume that the patient temperature is a random variable. 1 ). Therefore. From the measurements of the temperature. P ( c i ) is the prior probability of class ci. P ( x ⁄ c i ) is the likelihood that x was produced by the class ci. to define the shape and the placement of the boundary is not an easy task.1) There is only one slight problem: this probability can not be measured. the sample mean and standard deviation are depicted in Table 1. while the blue crosses denote temperatures assigned for healthy people.14 0. which can be estimated from the data assuming a pdf (in our case. In our case.89 µ 36. and in the general cases..51 38. given the people’s temperature.3) 242 Artificial Neural Networks . we can construct the optimal classifier. Using statistical decision theory.Sample Temperature mean and std σ 0. health and sickness.2). i. this is. in this case only the mean and the variance. Table A2. The task that we have at hand is.89 Healthy Sick To estimate these values. Fisher showed that the optimum classifier chooses the class ci such that the a posterior probability P ( c i ⁄ x ) that the sample x belongs to the class ci is maximized. classify its state into 1 of 2 classes. A2. P ( x ) is just a normalizing factor which is usually left out. A natural way to perform this task is to define the boundary temperature between the 2 classes. This can be estimated by the relative frequency of the classes. using Bayes’ rule: P ( x ⁄ ci )P ( c i ) P ( c i ⁄ x ) = -------------------------------P( x) (A2. the two classes overlap. in fact it is the most important problem in pattern recognition problems. for all j ≠ i (A2.1 . generated by two different phenomena. we used: 1 µ = --N N xi i=1 (A2. x belongs to the class ci if: P ( c i ⁄ x ) > P ( c j ⁄ x ). each one with a probability density function (pdf).

we have plot the separation boundary. Intuitively. We can use the Gaussian pdf for estimating P ( x ⁄ c i ) .2 . the error is affected by a combination of the cluster mean difference and the cluster variances. as there are healthy people whose temperature is higher than T. for. which enables us to classify the patient’s health.3).5 0 35 36 37 38 39 T em perature (ºC ) 40 41 42 FIGURE A2. In this figure. 2 shows the histograms and the pdfs for the data belonging to both classes. Of course. The classification error depends on the classes overlap. Likewise. fixed variances. for a fixed cluster centre distance.3) and (A2. the smaller the error.5) 3. If x < T then the patient is more likely to be healthy. T. the smaller the variances the smaller the error. Therefore. 3 . there are errors in this classification.4): (x – µ) – -----------------2 2σ 2 1 p ( x ) = --------------e 2πσ (A2. with the parameters estimated by (A2. and vice-versa.Sampled data distributions If we plot the pdfs.5 1 0.5 3 2. we arrive to fig.4) fig.5 2 1. each multiplied by P ( c i ) (here we assume that the probability of being sick is 0.2 1 σ = -----------N–1 N ( xi – µ ) i=1 2 (A2. otherwise it is more likely to be sick. the larger the difference between the cluster centres. Artificial Neural Networks 243 .

2 P(x/c1)P(c1) 1.8









0.2 T 0 35

P(x/c2)P(c2) s2


m1 x<T

37 x>T


39 m2




FIGURE A2.3 - Bayes


Our next step is to compute the Baysean threshold, T. It is defined as the value of x for which the a posteriori probabilities are equal. For Gaussian pdfs, as x appears in the exponent, we must take the natural logarithms of each side of (A2.6)
( x – µ1 ) – -------------------2 2 σ1

P ( c 1 )σ 2 e Taking the logaritms

= P ( c 2 )σ 1 e

( x – µ2 ) – -------------------2 2 σ2


( x – µ1 ) ( x – µ2 ) ln ( P ( c 1 ) ) + ln ( σ 2 ) – --------------------- = ln ( P ( c 2 ) ) + ln ( σ 1 ) – --------------------2 2 2σ 1 2σ 2 This is clearly a quadratic equation in x, which can be given as:
2 µ1 µ2 P ( c1 ) σ2 µ µ2 x- 1 1 ---- ----- – ----- + x ----- – ----- + ln ------------- + ln ----- – 1 ----- – ----- -- 1 - 2 2 2 σ2 σ2 P ( c2 ) σ1 2 σ2 σ2 σ1 σ2 2 1 1 2 2 2




= 0


When σ 1 = σ 2 , we get a linear equation, whose solution is:


Artificial Neural Networks

µ1 + µ2 T = ----------------- + k , 2


where k depends on the ratio of the a priori probabilities. For our case, considering c1 the healthy case, and c2 the sick case, we have the quadratic polynomial – 21.9 x + 1597x – 29100 . The roots of this equation are x = 36.98, 35.9 , which implies that T=36.98. Notice that, if both variances would be equal, T would be larger. The threshold moves towards the class with smallest variance. A2.2.1 Minimum Error Rate and Mahalanobis Distance The probability of error is computed by adding the area under the likelihood of class c1 in the decision region of class c2, with the area under the likelihood of class c2 in the decision region of class c1. As we know, the decision region is a function of the threshold chosen, and so is the probability of error. It can be shown that the minimum of this probability of error is achieved determining the threshold using Bayes’ rule, and the minimum error is a function not only of the cluster means, but also of the cluster variances. We can express this fact by saying that the metric for classification is not Euclidean, but the probability depends on distance of x from the mean, normalized by the variance. This distance is called the Mahalanobis distance. It can be expressed, for the case of multivariate data, as the exponent of a multivariate Gaussian distribution: 1 p ( x ) = -------------------e 2π Σ
D --2


(x – µ) Σ (x – µ) – -------------------------------------------2
T –1

1 -2


Notice that the data has dimension D, and Σ denotes the determinant of Σ , the covariance matrix: σ 11 … σ 1 D … … … . σ D 1 … σ DD

Σ =

In (A2.11), each element is the product of the dispersions among sample pairs in two coordinates: 1 σ ij = -----------N–1

( x i, k – µ i ) ( x j, m – µ j ) .
k = 1m = 1


Artificial Neural Networks


Notice that the covariance matrix measures the density of samples of the data cluster in the radial direction from the cluster centre in each dimension of the input space, and therefore it quantifies the shape of the data cluster. The covariance matrix is also positive semi-definite and symmetric. When the data in each dimension is independent, the covariance matrix is a diagonal matrix. A2.2.1 Discriminant Functions Let us assume that we have N measurements, each one with D components. These can be seen as N different points in a D-dimensional input space. Any one of these points will be assigned to class ci if (A2.1) is satisfied. Each scaled likelihood can be seen as a function that assigns each point one score for each class. These functions are called discriminant functions. These functions intersect in the input space defining a decision surface, which partitions the input space into regions where one of the discriminant functions (the one assigned to the class) is always greater than the others. Using this view, the optimal classifier just computes the discriminant functions, and chooses the class that provides the largest value for the specific measurement. This can be seen as the following figure: xk1 xk2 g1(x) g2(x) Maximum i



FIGURE A2.4 - An

optimal classifier for n classes

The classifiers we have described previously are called parametric classifiers, since their discriminant functions depend on specific parameters (the mean and the variance). There are classifiers whose discriminant function does not have a specific functional form and therefore are only driven by data. These are called non-parametric classifiers. A2.2.1.1 A Two-dimensional Example Let us now assume that we want to classify humans as male or female according to the weight and height of the individuals. We shall assume that we have 1000 samples of weights and heights of males, and another 1000 samples of weights and heights of females. Those were random numbers from a bivariate Gaussian distribution, and are shown in fig. 5 . The sample mean and covariance matrices are shown in Table 1. We also assume equal a priori probabilities, i.e., 0.5.


Artificial Neural Networks

Table A2.1 - Data Statistics
Mean Covariance Matrices


79.9 1.75

100 0 0 0.01 51.3 0 0 0.01

65 1.65

2.1 2 1.9 1.8 Height (m) 1.7 1.6 1.5 1.4 1.3 40




80 90 Weight (Kg)





scatter plot of weight and height of males ‘o’ and females ‘x’

Our goal here is to compute the optimal decision surface. Using as discriminant functions the logaritm of the scaled likelihood, we have:
T –1 1 1 g i ( x ) = – -- ( x – µ i ) Σ i ( x – µ i ) – log ( 2π ) – -- ln Σ i + ln ( P ( c i ) ) 2 2


In order to compute the decision surface, we can remove the ln ( 2π ) and the ln ( P ( c i ) ) terms, as they appear in the discriminant functions of both classes. For women we have: Σ = 0.56, Σ

= 0.019 0 0 92


Artificial Neural Networks


1 g 1 ( x ) = – -- x 1 – 65 x 2 – 1.65 2


0.019 0 x 1 – 65 x 2 – 1.65 + 0.29 0 92


= – 0.0095x 1 + 1.1x 1 – 94.5x 2 + 312x 2 – 29 For men we have: Σ = 1.07, Σ


= 0.01 0 0 100


1 g2 ( x ) = – -- x 1 – 80.3 x 2 – 1.75 2
2 2


0.01 0 x 1 – 80.3 x 2 – 1.75 – 0.03 0 100


= – 0.005x 1 + 0.8x 1 – 50x 2 + 175x 2 – 185 Equating (A2.15) and (A2.17), we have: 0.005x 1 – 5.4 x 2 – 2x 1 – 332x 2 + 22 = 0
2 2


This is the equation of the decision surface. If we superimpose the solutions of this equation with the scatter plot, we end up with:
120 110 100 90 Weight (kg) 80 70 60 50 40 30 20 1.2




1.6 1.7 Height (m)





FIGURE A2.6 - Scatter

plot with decision boundary


Artificial Neural Networks

Optimal does not mean good performance. If we produce a 3-dimensional plot of the modelled pdf.65 1. we get FIGURE A2.6 1.9 FIGURE A2. many errors will be made using this classification.8 .85 1.8 1. and superimposing the decision curve.7 1. It only means that it is the best we can do with the data.3D plot of the modelled pdfs Producing a contour plot of the pdfs.75 Height (m) 1.7 .As we can see.Contour plot of the modelled pdfs Artificial Neural Networks 249 . we have: 100 95 90 85 Weight (kg) 80 75 70 65 60 1.

21) 2 (A2.and b i = – ----------. Notice that (A2. The samples define in this case 2 ellipses (with equal form) and the decision surface is a line.17) are special cases of (A2. but the covariance matrices are identical.15) and 2 2 2 (A2. in generating the data.20) and the classifier is called a minimum distance classifier. the discriminant is still –1 1 T –1 linear (A2.19) 2 2 (A2. and 250 Artificial Neural Networks . There are some special cases where the decision surfaces of the optimal classifiers are linear functions. the Mahalanobis distance defaults to the Euclidean distance: (x – µ) (x – µ) = x – µ T 2 (A2.ln ( Σ i ) .µ i Σ µ i – -. we have assumed that the weight was uncorrelated with the height (which as we know is not a realistic assumption). –1 T T T T T T (A2. The discriminant functions in this case are linear functions: gi ( x ) = wi x + bi . then the covariance matrix is diagonal: Σ = σ I In this case. In the general case (different covariance matrices) the discriminant functions are quadratic functions: gi ( x ) = x Wi x + wi x + bi . the decision surface is given as a solution of quadratic function.. but in this case not perpendicular to it. If the input variables are uncorrelated and with the same variance. w i = Σ µ i and b i = – -.. 2 σi 2σ i In this case the samples define circular clusters and the decision surface is a line perpendicular to the line joining the two cluster centres: w1 x + b1 = w 2 x + b2 ( w1 – w2 ) x = b2 – b 1 When the data is correlated.A2. intersecting the line joining the two cluster centres.µ i Σ µ i .21).23) Σi –1 1 T –1 1 where W i = – ------. µi 2 µi where w i = ---. as.23).22) (A2. but with w i = Σ µ i and b i = – -.3 Decision Surfaces In the last example.

it is better to use a sub-optimal classifier than the optimal (quadratic) one. A more sophisticated approach involves using a linear discriminator. the major problem is that more data is needed to have a good estimation of the parameters of the discriminant functions. The system that implements this discriminant is called a linear machine.4 Linear and Nonlinear Classifiers Until now.9 . defines an hyperplane in D dimensions. they perform badly for situations where the classes have the same mean. In addition to the problem of computing more parameters. and not in the clusters shapes. A2. xk1 xk2 Σ Σ Maximum i xkD Σ FIGURE A2.therefore there are no cross terms in the covariance matrices. as we know. with limitations which have been pointed out. 4 the functions become a sum of products.A linear classifier Notice that. Sometimes. It should also be stressed that. and is depicted in the next figure. we have encountered two types of classifiers: linear and quadratic (the Gaussian classifier can be expressed as a quadratic one. Also that a linear classifier (one where the discriminant functions are linear) exploits only differences in the cluster means. Essentially Artificial Neural Networks 251 . but to an intermediate space (usually of greater dimensions). which is a nonlinear mapping of the original input space.23). by applying logaritms). Notice that for Gaussian distributed classes the optimal classifier is a quadratic classifier given by (A2. and that there is one summation for each class. there are always more parameters to estimate.24) This equation. independently of the fact that quadratic classifiers are optimal classifiers when data follows a Gaussian distribution. The decision region can be a line. The linear classifier is expressed as: D g ( x ) = w 1 x1 + w2 x2 + … + wD x D + b = i=1 wi xi + b (A2. Therefore. an ellipse or a parabola. This is due to Cover’s theorem. because lack of data or when the input dimension becomes large. a circle. Notice that this a simple scheme. but this time applied not the measurements. in contrast with fig.

components to { x i x j. The accuracy of the classifier is dictated by the location and the shape of the decision boundary in the pattern space. by a kernel x1 … xD function: k ( x ) = k1 ( x ) … kM ( x ) (A2. The dimension of 2 D( D + 3) the feature space in this case is ---------------------.A Kernel-based Classifier This feature space is a higher-dimensional space. the solution for the discriminant function is greatly simplified. Let us assume that the original input space has D dimensions x = .25) There is a great flexibility in choosing the family of kernel functions to employ. where the problem is linearly separable. i ≠ j } .26) 2 The great advantage of using this scheme is that the number of free parameters is decoupled from the size of the input space.. and 2 trigonometric polynomials. Vapnik [136] demonstrated that if k ( x ) are symmetric functions that represent inner products in the feature space. the first D components of (A2.what must be done is to map the input space to another space.10 . which is mapped into a feature space Φ . and the last D components to { x i } . In the Φ space. we need to estimate the parameters of the discriminant functions from the data collected. xk1 xk2 k1 k2 Σ Σ Maximum i xkD kM 1 Σ FIGURE A2. This is called training the classifier. For instance. This leads to the new field of Support Vector Machines [137]. if a quadratic mapping is chosen. Recently. The deci- 252 Artificial Neural Networks .25) could be { x i } . called the feature space. the next D( D – 1 ) ---------------------. Other popular choices are Gaussian functions. A2. a linear discriminant function is constructed as: g ( x ) = w1 k1 ( x ) + w 2 k2 ( x ) + … + wD k D ( x ) + b (A2.5 Methods of training parametric classifiers In a parametric classifier.

where. This needs a corresponding amount of data (typically the number of samples should be more than 10 times the number of parameters) which sometimes simply is not available 1. we need to estimate the mean. Usually there are a significant number of parameters to estimate. is the way that class likelihoods are related to the discriminants. however. 2. the parameters of the discriminant functions are estimated directly from the data. In principle. In parametric training. There are no assumptions about the data distribution. An example are statistical classifiers.13). parametric training seems to be the best choice. and therefore the functional form of the discriminant functions and their placement are fundamental issues in designing a classifier. Artificial Neural Networks 253 . some problems related with these classifiers: Parametric training assumes a specific distribution (usually Gaussian) which sometimes is not true. as already referred.sion boundary is the intersection of the discriminant functions. The problem. There are two different ways for training a parametric classifier: parametric and non-parametric training. which limits its use to a small numbers of distributions (typically. There are. if Gaussian-distributed pattern categories are used. the covariance and the probabilities to compute (A2. the Gaussian is used). In nonparametric training. the functional form of the discriminant function is known. and the problem lies in estimating its parameters from the data.

254 Artificial Neural Networks .

1) can be viewed as describing the motion of a point. The instantaneous velocity is the tangent to the curve at the specified time instant. -----------dt Equation (A3. we can think of x ( t ) as describing the velocity of that point at time t. x2 t2 t1 t0 x1 Artificial Neural Networks 255 . · as a function of time. x ( t ) = dx ( t ) and F is a vector-valued function. Accordingly. It is assumed that the student has already knowledge of a basic course in control systems. The curve drawn in a graph representing x ( t ) as t varies is a trajectory. (A3.1) · where x is a n-vector of state variables. in a n-dimensional space.1 Introduction The dynamics of a large class of nonlinear dynamical systems can be cast in the form of a system of first-order differential equations: · x(t ) = F( x(t )) . which covers basic state space. A3.APPENDIX 3 Elements of Nonlinear Dynamical Systems This appendix will introduce some fundamental concepts from nonlinear dynamical systems. or simply a vector · field. and for this reason F ( x ( t ) ) is called a velocity vector field. needed to understand Chapter 4.

5 0.2 Equilibrium states A vector x is an equilibrium state of (A3. Let: x = x + ∆x F ( x ) ≈ x + A∆x .Phase portrait and vector field of a linear dynamical system Until now.7 0. 1 0.3) Let us suppose that F sufficiently smooth for (A3.2 0.4 0. it must satisfy the Lipschtiz condition.4 0. such that: F( x) – F( u) ≤ k x – u (A3. for all x and u in a vector space.2) A3.1 .1) to be linearized in a neighborhood around x . (A3.2 .6 0.7 0.6 x2 0.A sample trajectory A family of trajectories.8 0. For a solution of (A3. is referred to as a space portrait of the system.4) (A3. Another view of this system is representing the instantaneous velocities.1) to exist.3 0. for different initial conditions.3 0.5) 256 Artificial Neural Networks . we have not said anything about the vector function F. for different time instants and for different initial conditions.5 x1 0. which states that.2 0.9 1 FIGURE A3.1 0 0 0.1) if: F(x) = 0 (A3. and in order for it to be unique.1 0.9 0. there is a constant k.FIGURE A3.8 0.

5) in (A3.6 -0.6 0.6 -0.5 1 0.6) x=x Replacing (A3.2 0.8 1 1.4) and (A3.5 -1 -0.8 -0.5 Stable focus -1 -1.2 0 x1 0.1).5 x2 0 Stable node -0.Equilibrium points Artificial Neural Networks 257 . we have: ∂ ∆x ( t ) A∆x ≈ .5 -1 -1.4 0.5 x2 0 -0.5 1 0. the local behaviour of the trajectories in the neighbourhood of the equilibrium state is determined by the eigenvalues of A.6 0.3 .4 0.8 -0.5 -1 -0.2 0 x1 0. 1.4 -0.7) Provided that A is non-singular.where A is the Jacobean matrix of F evaluated at x : A = ∂F ∂x (A3. ∂t (A3.2 0.8 1 8 6 4 2 x2 0 -2 Unstable node -4 -6 -8 -25 -20 -15 -10 -5 0 x1 5 10 15 20 25 FIGURE A3.4 -0.

4 A3.Equilibrium points For the case of a second-order system.4 .5 -1 -0. linearization. A3. allows inspection of the local stability properties at an equilibrium state.5 0 x1 0.5 FIGURE A3.3 Stability The technique described above.5 1 1.5 x2 0 -0. Some definitions are in order:~ 258 Artificial Neural Networks . A3.4 3 2 1 x2 0 -1 -2 Unstable focus -3 -4 -10 -8 -6 -4 -2 0 x1 2 4 6 8 10 3 2 1 x2 0 -1 Saddle point -2 -3 -3 -2 -1 0 x1 1 2 3 1. the equilibrium states can be classified as shown in fig.5 Center -1 -1.5 -1.5 1 0.3 and fig.

The equilibrium state x is uniformly stable if. Another definition is needed here. Attractors can be a single point in state space.1. The equilibrium state x is convergent if. The problem is that there is not a particular method for doing that. allt > 0 (A3. A3. A function is positive definite in a space ℘ .4 Attractors Many systems are characterized by the presence of attractors. there is a positive δ . Lyapunov introduced two important theorems: 1. and in this case we speak of a limit cycle. x ≠ x If we can find a Lyapunov function for the particular system at hand we prove global stability of the system. Having defined stability we shall now introduce the Lyapunov method for determining stability. A function that satisfies these requirements is called a Lyapunov function for the equilibrium state x . and it is matter of ingenuity. A3. which are bounded subsets of a larger space to which regions of non zero initial conditions converge as time increases. The equilibrium state x is asymptotically stable if in a small neighbourhood of x there exists a positive definite function V ( x ) such that its derivative with respect to time is negative definite in that region. The equilibrium state x is stable if in a small neighbourhood of x there exists a positive definite function V ( x ) such that its derivative with respect to time is negative semi-definite in that region. such that: x(0) – x < δ x ( t ) – x < ε. 2. 4.4 ).8) 2. if it satisfies the following requirements: 1. The equilibrium state x is asymptotically stable if it is both stable and convergent.9) 3.3 ). The equilibrium state x is globally asymptotically stable if it is stable and all trajectories converge to x as t approaches infinity. V(x ) = 0 V ( x ) > 0. for any positive ε . there is a positive δ . and in this case we speak of point attractors (see fig. 2. ast → ∞ (A3. A3. Its partial derivatives with respect to the state vector are continuous. Artificial Neural Networks 259 . 3. or can have other forms. can have the form of a periodic orbit (see last case of fig. such that: x(0) – x < δ x ( t ) → x.

Related with each attractor. This region is called the domain of attraction or basis of attraction for that specified attractor. 260 Artificial Neural Networks . there is a region in state space where the trajectories whose initial conditions lie within that region will converge to the specified attractor.

i1 and i2. if the input is [1 1]? b) Consider now that you have 10 input random patterns. Draw and identify each major block of the most used artificial neural network. What are the values of the net input and the output. inspired by biological neural networks. Artificial Neural Networks 261 . w2=-1 and θ =0. a) Assume that w1=1. 2. b) Consider now one neuron. The equations that describe the most common artificial neural network are: net i = o j W j.Exercises Chapter 1 1. The neuron has two inputs. Justify your answer. Artificial neural network were. a) Describe and discuss the characteristics found in a biological neural network and that are present in an artificial neural network. explain the meaning of each term in the above equations. 3. without doubts.5. 1. such as the one shown in fig. It is tedious to answer a) for all patterns.2. and employes a sigmoid transfer function. and employ it to compute the net input vector and the output vector for the 10 input patterns. i + b i (1) j∈i o i = f ( net i ) (2) Under a biological point of view. Write a Matlab function. Assume that we are considering an artificial neuron.

the B-spline function of order k. How is learning envisaged in artificial neural networks. with respect to the kth centre. 0. ∂f i . 0. Assume that you want to train a simple linear neuron. σ i .5. such as the one in the figure below. – 0. to approximate the mapping y = x . For this. Consider a Gaussian activation function. in a recursive way. within the range – 1 1 . 1 } .23) n ( C i. x k .5. Equation (1. of this activation function. Identify the principal learning mechanisms and compare them. with respect to the kth input.σ i ) = e – ----------------------------------2 σ i2 k=1 . such as the one illustrated in (1. k c) The derivative. k – x k ) 2 f i ( C i. tion. with respect to the standard devia∂ σi b) The derivative. { – 1. k . of this activation function. you sample the input at 5 different points. ∂f i . 262 Artificial Neural Networks .4.24) defines. ∂ C i. Determine: a) The derivative. 2 7. C i. k . ∂f i . of this activation function. ∂ xk 5. Determine the standard equations for a B-spline of: a) Order 1 b) Order 2 c) Order 3 6.

x 2. justifying. that the exclusive OR problem is not a linearly separable problem. Explain. using the diagram. Artificial Neural Networks 263 . justifying. and the Adaline. Explain. introduced by Rosenblatt. net is the net input and x is the input vector. Consider a Multilayer Perceptron with 1 or more hidden layers. x 3 ) = x 1 ∧ ( x 2 ⊕ x 3 ) . the concept of linearly separability. Determine the value of its weights by hand. introduced by Bernard Widrow? Suppose that you want to to design a perceptron to implement the logical AND function. The LMS (Least-Means-Square) rule is given by: w [ k + 1 ] = w [ k ] + α ( t [ k ] – net [ k ] )x [ k ] . 11. 13. Demonstrate. (3) 9. Explain. Justify this learning rule. 14.0]. t is the desired output. Which are the essential differences between the perceptron. Consider the boolean function f ( x 1. where w is the weight vector. a single layer network. The exclusive OR problem is very well known in artificial neural networks. b) Determine the weight vector employing the Hebbian learning rule. using a diagram. Chapter 2 8. All the neuron activation functions are linear. Prove analytically that this neural network is equivalent to a neural network with no hidden layers. how you could implement the last functions using Madalines (Multiple Adalines). Assume that you have already trained 2 Adalines to implement the logic functions OR and AND. how you could solve it with Madalines (Multiple Adalines). 12. and compute the gradient vector for point [0. 10.y w1 1 w2 x a) Show explicitly the training criterion. α is the learning rate.

there are several activation functions that can be employed in MLPs. justifying your comments. For this class of MLPs. As it is known. repeat the training now with an hyperbolic tangent function. t – Aw 2 + λ w c) Consider now a training criterion φ l = ------------------------------------------. but the hyperbolic –x 1+e 1–e tangent function f ( x ) = -----------------. and explain how those can be minored. instead of a sigmoid transfer function. 2 16. 17. justifying.is also often employed. Considering the training –2x 1+e of a MLP with the BP algorithm. Consider MLPs where the last layer has a linear activation function. Different training methods of MLPs were discussed in this course. Repeat a) and b) for 2 this new criterion. identify. 1 The sigmoid function f ( x ) = --------------. Determine: 2 a) The gradient vector. Consider the BP algorithm and the Levenberg-Marquardt algorithm. (t – y) i=1 2 (4) N ( t – AA t ) . For the pH and the coordinate transformation examples. 2 training criteria were introduced: 1 Ω = -2 1 Ω = -2 N 19... For each one of these methods. a) Describe. different training algorithms were considered.is the most well known. –2 x 18. Comment the results obtained. The most well known method of training Multilayer Perceptrons (MLPs) is the error back-propagation (BP) algorithm. the advantadges and disadvantadges of one compared with the other. this method. t – Aw 2 Consider the training criteron Ω l = ---------------------. i=1 + 2 (5) 264 Artificial Neural Networks . b) Identify its major limitations. b) The minimum of the criterion. 20. Compare them in terms of advantadges and disadvantadges.15.

We are going to try one.m (found in Section 2_1_3_7. I is the unitary matrix of suitable dimensions. 25. N is the number of training patterns. pass this scaled target data ( y )through the inverse of the sigmoid nonlinearity 1–y ( net = – ln ---------. Use for both cases 100 different initializations. W(1) denotes the weight matrix related with the first hidden layer and w(z) is the weight vector related with the output layer. Using Ex. For the pH and the coordinate transform examples. and a column of ones to account for the output bias. There are several constructive methods for MLPs. 21. Consider that you start with just 1 neuron. with the results obtained using the early-stopping method. Artificial Neural Networks 265 .8. 2. in terms of accuracy and conditioning. compare the mean values of the intial error norm and of the initial condition number of the Jacobean matrix. except for one column vector whic remains unaltered. Compare their results. experimentally. and determine the nonlinear weights by the least squares y 22. IPs and tps are the scaled patterns. where IP and tp are the original input and target patterns. the condition number of this matrix is increased by a factor of approximately α . Prove. you can pass the scale the target pattern to a range where the sigmoid is active (say [-0. 0. Compare these 2 criteria in terms of advantadges and disadvantadges for off-line training. construct first a training set and a validation set using the Matlab function gen_set. in such a way that: IPs = IP ⋅ ki + I ⋅ l i tp s = tp ⋅ k o + I ⋅ l o . Explain why the Gauss-Newton method does not obtain good results when applied to neural networks training.). After training has finished. employing random initial values for the weights and employing the Matlab function MLP_initial_par. 23. Then train the networks with different values of the regularization parameter. and A is the matrix of the outputs of the last hidden layer.m. and try to justify your thoughts. the input and target patterns are scaled. before training a MLP. Consider that.where t is the desired output vector. and li and lo are the ordinate diagonal matrices of the input and output. In that case. 24.zip). y is the network output vector. 26. that when the norm of all the column vectors of a rectangular matrix is reduced of a factor α . ki and ko are the diagonal gain matrices. respectively.9. Transform these matrices so that they produce the same results using the original data. for a one hidden layer network.9]).

with different learning rates. for both cases. for the pH and the coordinate transformation problem. has only one non-zero eigenvalue. the final weight values of previous off-line training sessions.1. with a regularization factor of 10-2. and compare the convergence rates obtained (use the same learning rates). Assume that at time k. Compare the results obtained with training from the scratch MLPs with 10 neurons. Relax this condition (employ a smaller regularization factor) so that you get a worse model. as initial nonlinear weight values.net i – I 1 v i solution of --------------------------------------.3. with a number of neurons up to 10. Prove that the instantaneous autocorrelation matrix. 33. The LMS rule is afterwards employed. Determine what happens to the a posteriori output error . in the previous exercises and in Section 2.8. starting with the initial nonlinear weight values stored in initial_on_CT.4 of [124] to have a more clear insight of this problem.3. R[k]. where I is the input matrix.mat. in order to have a well conditioned model. Compare its performance against the original function. Now see 5. using the LMS and the NLMS rule. in terms of convergence? Modify the Matlab function MLPs_on_line_lin. i=1).m to the coordinate transformation problem. 29. and add another neuron. resulting in a transformed pattern a[k]. For this new neuron. 30. You want to design a Radial Basis Function network to approximate this input-output mapping. Assuming that you are going to design the network using an interpolation scheme.. a network is presented with pattern I[k]. corresponding to 100 patterns with 3 different inputs. Use this strategy for the pH and the Coordinate Transformation examples. and all the steps indicated before can be employed. 32. Employ the Matlab function MLPs_on_line_lin. 266 Artificial Neural Networks . Consider that you have an input matrix X. The desired outputs corresponding to this input matrix are stores in vector t. with dimensions 100*3. What do you conclude. Apply the NLMS rule to this problem and identify its minimal capture zone. The results shown with the LMS and the NLMS rules.e [ k ] . Apply the LMS and the NLMS rules to the pH and the coordinates transformation problems. Use the LMS and the NLMS rules. for the better and worse conditioned models. employing the reformulated criterion.1. Consider the x2 problem used as an example in the start of Section 2. 31. You then can train the network until convergence. indicate the topology of the network.1 were obtained using. There is of course a model mismatch (the model y = w 1 x + w 2 is not a good model for x2). and vi is the nonlinear 2 weight vector associated with the ith neuron (in this case.m in order to incorporate a deadzone. the target pattern is just the error vector obtained after convergence with the last model.8. in terms of conditioning. as a function of the learning rate employed. and the 2 28. and its corresponding eigenvector is a[k]. Is there any effect of overmodelling visible? 27.

associated with the variables x3. Draw an overlay diagram. the k-means clustering algorithm is the one more often used. Name the problems related with this scheme. 34. with splines of order 2. The models obtained with RBFs using complete supervised methods are illconditioned. How many cells exist in the lattice? How many basis functions are employed in the network? How many basis functions are active in any moment? b) Based on the overlay diagram drawn in a). x4). equally spaced of 1 unit. with 4 interior knots in each dimension. with 2 inputs. c) How is a well-defined CMAC characterized? 37. d) explicit regularization. 4 interior knots were employed for each input dimension. a) Assume that the network is composed of 3 univariate sub-networks. in an analytical form. and 1 bivariate sub-network. using: a) the standard termination criterion. How can this problem be alleviated in B-spline networks? Consider a B-spline network. x2.1. b) the early-stopping technique. c) explicit regularization. Consider a CMAC network. x3. x3. 36. y.5. A displacement matrix 1 2 was used. The most important class of training algorithms for Radial Basis Function networks employees an hybrid scheme. Among the heuristics employed. 38. 5.What is the total number of basis functions for the overall network? How many basis functions are active in any moment? 35. with the G0 matrix. The generalization parameter is 2. Repeat their application to the two examples mentioned.values of all the relevant network parameters. The problem known as curse of dimensionality is associated with CMAC and Bspline networks. using a supervised algorithm for the linear layer and heuristics for the determination of the centre and spread values of the basis functions. The bivariate sub-network employees splines of order 3. x2. x4. What is the reasoning behind this algorithm? Expose it. and 1 output. and 1 output. related with the variables x1. with an identity matrix. with range [0. Justify this problem. 1 2 a) Consider the point (1. illustrating the active cells for that point.5). 4 and 3 interior knots are used in the univariate sub-networks. with 4 inputs (x1. explain the uniform projection principle. 5]. as no stopping criterion was employed. Artificial Neural Networks 267 .

Assume that. using the MSE and the MSRE criterion. sometimes the ASMOD algorithm produces badconditioned models. d) Compare the results obtained using the standard termination criterion with the early-stopping method. b) What are discriminant functions? And decision surfaces? The neural classifiers are parametric classifiers or non-parametric classifiers? c) Explain.O que são funções discriminantes? E superfícies de decisão? d) What is the difference between standard perceptrons and large-margin perceptrons? 268 Artificial Neural Networks . an algorithm minimizing the SSE. 41. and the process indicated in Section 2. the data generation function is known: f ( x 1. Compare the results obtained with the perceptron learning rule. in the intervals x 1 ∈ 0 1 and x 2 ∈ – 1 1 .5x 1 x 2 . for a particular application. x 2 ) = 3 + 4x 1 – 2x 2 + 0. Compare the results with the MSE criterion. and the SSE for all patterns. Consider now the application of neural networks for classification. employing different regularization factors. for: a) the AND problem. 40. Compare the results obtained using the standard termination criterion. and compare the results: a) Change the performance criterion employed to evaluate the candidates to an earlystopping philosophy (evaluate the performance on a validation set). in terms of the margins obtained. Depending on the training data. b) Employ a regularization factor in the computation of the linear weights.b) The ASMOD algorithm is one of the most used algorithms for determining the structure of a B-spline algorithm. a) Explain the main differences betweeb the use of neural networks for classification and approximation. and in this exercise you try them. the training process of support vector machines. briefly. The goal is to emulate this function using a B-spline neural network.2. c) Change the training criterion in order to minimize the MSRE (mean-square of relative errors. 39. b) the OR problem. Determine this network analytically. Describe the most important steps of this algorithm. This can be solved in different ways. 42.5.

20 } .5 2.1. L. using PCA. 1. i. This is usefull for input reconstruction..Chapter 3 43. F ( σ k. a) Describe the structure and the operation of a LAM.8 – 11 ⁄ 14 c) Given x 1 = 1. b) Why must be the training patterns orthogonal? Show that. providing a link between two distinctive sets of data. Describe how it can be done. where a set of data is linked to itself. calculate W [ 3 ] using the forced Hebbian rule 6. Use the Matlab function F_ana. Artificial Neural Networks 269 . We have described in our study of LAMs the hetero-associator. Another paradigm is the auto-associator. x 2 = – 2. T p ) = ∏ nz np nz TT = L + i=1 Tpi – i=1 T zi ∏ i=1 1 + ---------TT Assume that two identification measures ( k = 1. 0.9 y 3 = 5. in the network weights. 15. 5. Consider now linear associative memories.2 1⁄2 4.mat.m to compute the identification measures. 2 ) are to be used. Oja’s rule extracts the first principal component of the input data. out of the set { 0. and produces at its outputs the largest eigenvalue. x 3 = – 1 ⁄ 14 . Consider that a set of plants are identified using the identification measures: T zi σ 1 + --------Lσ – -----TT TT i = 1 e ----------------------------------. np T pi σ 44.5 .5 2. T z. A repeated application of Oja’s rule is able to extract all principal components. y 1 = 1. 46.1 0. and collect the planst under consideration from plants.5. 1. Identify.e. noise removal and data reduction.0 . the best set of two σ k values.5 0. Prove that the Oja’s rule determines the eigenvector. 2.4 . PCA can also help in system identification applications. + ˆ W = X d.0 . y 2 = 3. 10.3 47. 2.6 . 45. for a perfect recall. W [ 0 ] = 0 . and this is called the Generalized Hebbian Algorithm (GHA).7 0.5 4.

the SOM algorithm as described in Section 3. the application of forced Hebbian learning implements principal component analysis. T 49. The SOM has twodimensional inputs and the current weights are given in fig. to implement a) A two-dimensional SOM with 5*5 neurons 270 Artificial Neural Networks . 51. Explain why the LMS rule can be seen as a combination of forced Hebbian and antiHebbian rules. find the winning node. 1. Using this function generate 100 bi-dimensional patterns. 1 2 3 4 FIGURE 1 . is achieved if an hidden layer (where the number of neurons is equal or less than the number of neurons in the input/output layer) is added to the network. in Matlab.87 50. an autoencoder.a) Prove that the application of forced Hebbian learning performs an eigendecomposition of the autocorrelation function. b) Typically an autoassociator has no hidden layer.A 2*2 SOM Table 1 . using the Euclidean distance between the input vector and the weight vectors. 48. Given an input vector x = 1 2 .9 0. A more powerfull structure. if the weight matrices between layers are equal and transposed.2.1. Implement.SOM weights Inputs Node 1 2 1 2 3 4 0.85 0. State and explain the self-organizing map weight update equation.74 1. covering the space – 1 1 × – 1 1 . Prove that. The following figure shows a simple 2*2 self-organizing map.92 1.54 1.64 0.71 1.

ξ3 = – 1 1 – 1 1 1 a) Determine the weight matrix b) Demonstrate that all three fundamental memories satisfy the alignement condition c) Investigate the retrieval performance of the network when the second element in ξ 1 is reversed in polarity d) Show that the following patterns are also fundamental memories of the network. Consider a Hopfield network made up of five neurons. How are these related with the trained patterns? ξ4 = – 1 –1 – 1 – 1 – 1 . a) Draw a circuit that implements this model. 53.b) A one-dimensional SOM with 25 neurons.32). T T T T T T Demonstrate that the Lyapunov function of the BSB model is given by (4. b 1 18 –2 0 55. Artificial Neural Networks 271 . b) Draw a block diagram describing this equation. designed to store three fundamental memories ξ1 = 1 1 1 1 1 .5) is ammenable to be implemented analogically. ξ2 = 1 – 1 –1 1 – 1 . Chapter 4 52. The model described by (4. ξ 5 = – 1 1 1 – 1 1 . ξ6 = 1 – 1 1 –1 – 1 54. Appendix 1 2 1 Compute the solution of 3 4 1 3 14 1 6 a = 5 22 .

56. 2 6 4 2 1 See if the system 4 15 14 7 w = 6 is consistent. 2 9 10 5 5 272 Artificial Neural Networks .

Siffert. 58-60 Lippmann. 30th IEEE Conference on Decision and Control. 9... Shirvaikar. pp. F. Kosuge. 2. pp. 771-777 Rajavelu. 'A neural network approach to character recognition'. 1990. T. T. force and impact control for robotic manipulators'. Neural Networks.. 1. K. 9. 'A neural network approach for bone fracture healing assessment'. pp. England. B. pp. Hakim. IEEE Transactions on Neural Networks.. R. Shibata. Rubinstein. 'Image compression by back propagation: an example of an extensional programming'. Zipser.. pp... 162-167 Artificial Neural Networks 273 .. 1-38 Sejnowsky. G. pp. T. 'Review of neural networks for speech recognition'. K. Wake. Hatem. Nasser. 'A comparison of neural networks and other pattern recognition approaches to the diagnosis of low back disorders'. Vol. R. 355-365 Bounds. M. P. 1987. K.. Nº 3.. J. F. S.. 3. 23-30 Cios. 'Parallel networks that learn to pronounce English text'. M. 145-168 Cottrel.. Vol. IEEE Engineering in Medicine and Biology.. S. 'Handwritten alphanumeric character recognition by the neocognitron'. 4. Munro. Tokita. IEEE Control Systems Magazine. Vol. 1990. IEEE Engineering in Medicine and Biology... P. Musavi.. of California at San Diego. Nº 3. P. 25-30 Fukuda. T. Vol. A.. Richfield. N. Univ. Lattuga.... Rosenberg. 175-178 Kuperstein.Bibliography [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] Mehr.. pp. 'Neuromorphic sensing and control . 'Operational experience with a neural network in the detection of explosives in checked airline luggage'. C. D. Chen. 1990. pp. 1991. 1991.. Nº 5.. 'Use of neural networks in detecting cardiac diseases from echocardiographic images'. 583-591 Kaufman. 1987 Shea. D. M. Neural Networks. M. J. R... Pilla.. A. Nº 3. K. Lloyd. Vol. Chiabrera.application to position. Liu. 387-393 Fukushima. V. Mathew. 2. Figueiredo. Complex Systems.2. Langenderfer. R. Mitsuoka. pp. Brighton.. IEEE 1st International Conference on Neural Networks. A. 1990. 1989. Vol. Nº 5. 9. 1989. Arai. D. pp... ICS Report 8702. 'Implementation of an adaptive neural controller for sensory-motor coordination'. 1987. 1989. 1.. pp... Vol. 'Neural net application to optical character recognition'. Vol. Nº 1. IEEE INNS International Conference on Neural Networks. Nº 3. Vol... N. M... Neural Computation.

353-359 Hopfield. Nº 3. Computer and Information Sciences. 'An introduction to computing with neural nets'. Vol. Vol.. K. V. Wiley. IEEE Transactions Electronic Computers. 115-133 Hebb.. neuron using chemical "memistors" '. Vol. 'Correlation matrix memories'.27-32 Hecht-Nielsen. 'Neurons with graded responses have collective computational properties like those of two-state neurons'. 'Principles of neurodynamics'. J. pp. IEEE Control Systems Magazine. 1960. IEEE ASSP Magazine.. 1963 Minsky. MIT Press. 'A logical calculus of the ideas immanent in nervous activity'... 'Neural networks in control systems'. 1987. 1968. M. Symposium Proceedings. J. 'Neurocomputing'.. Psych. W. USA.. R. 'A memory storage model utilizing spatial correlation functions'.. Vol.. J. USA.. W. 1960 Widrow. S. F. 1960 WESCON Convention Record.. 1986 Vemuri. 81. 1943. Widrow. 'Pattern recognizing control systems'. Vols. T. D. 'Adaptive switching circuits'. Biological Cybernetics. Pitts. 113-119 Fukushima. 1990. 1962 Widrow.K. Smith. 1 and 2.. 'Basic brain mechanisms: a biologist's view of neural networks'.. Proc. Rosenfeld. 1989. USA.. Nat.. Math. 18-23 Lippmann.. 1969 Amari. pp. Nº 3. 21. Spartan Books. Nº 6. MIT Press. Sci. Addison Wesley. J.. 'Analog VLSI and neural systems'. pp.. 10. IEEE Technology Series Artificial Neural Networks: Theoretical Concepts.. pp. 1984.. B. 1988.. First IEE International Conference on Artificial Neural Networks'. Sci. D. U. pp. 36-41 Mead. B.. E. 1-11 Hecht-Nielsen. 4. Vol. 'Identification and control of dynamical systems using neural networks'. pp. C.. pp. Report 1553-2.. 'An adaptive "Adaline". Digest Nº 1989/83. IEEE Transactions on Neural Networks. pp. 1989 Murray. Bulletin of Mathematical Biophysics. pp. 1-5 Antsaklis. pp.. 5. S.. Tech. 1975. 'Artificial neural networks: an introduction'.. J. J. pp. IEEE Transactions on Computers. & the PDP Research Group. M. Stanford Electronics Labs. IEE Colloquium on 'Current issues in neural network research'. Nº 5. R. 'Cognitron: a self-organizing multilayered neural network'. Nº 3. 1988. 1990.[13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] Narendra. 1988 McCulloch. 1982. 209-239 Kohonen.. 4-22 Anderson. Nº 2. pp. 'The organization of behaviour'.. Nº 20. 1969. 1. 1972. K. Nº 3. 121-136 Grossberg. Papert. B. 'Neural networks and physical systems with emergent collective computational abilities'. Stanford. Acad. K. Acad. 'Parallel distributed processing'.. McClelland. 299-307 Anderson. IEEE Spectrum. Nº 4. Vol. Vol. 'Embedding fields: a theory of learning with physiological implications'. pp. 1967. B. pp. 'Neurocomputing: picking the human brain'. pp. 'Perceptrons'. 25.. Nº 1. D. Kybernetik. 2554-2558 Hopfield. pp. 'Neural networks for self-leaming control systems'. 10. P. A... 4. Spartan Books. 16.. MIT Press. Hoff. Parthasarathy. Vol. R. 1949 Rosenblatt. S. 'A theory of adaptive patterns classifiers'. 4-27 Nguyen. 'Silicon implementations of neural networks'. F. R. 'Neurocomputing: foundations of research'.. Vol.. 1990. Addison-Wesley. IEEE Control Systems. 3088-3092 Rumelhart. 1990 Strange. Nat. 3-5 274 Artificial Neural Networks . pp.. 1989. 79. 96-104 Widrow. Proc.

A. D. 587594 Kawato. Isobe.. Minden-nan. Vol... 13. Nº 12. 'Use of neural networks for sensor failure detection in a control system'. L. IEEE Automatic Control Conference. H. U. Vol.'Self-organization and associative memory'. 97. September 1991. 1988. P. 170-171 Chen. Anderson. pp... San Diego.. Nº 2. Proceedings of the 1st IFAC Workshop on 'Algorithms and Architectures for RealTime Control'. 1988. C. 8. Vol. J... B. Vol. 'First year report'. 10. M. R. P. A. K. 3-7 Sjoberg. R. Vol. S... IEEE Control Systems Magazine.. Ljung. 1989. Q. series G. IEEE International Conference on Neural Networks. V.. Wang. ‘Neuron-like adaptive systems that can solve difficult learning control problems’. 1990.. 321-355 Cox. McAvoy. Vol. 2. 'Nonlinear systems identification using neural networks'.. Nº 2. G... Vol. Zhang. 'Nonlinear signal processing using neural networks: prediction and system modelling'... Glorennec. 1981. S. J. 1990. Mukhopadhyay. 'Hierarchical neural network model for voluntary movement with application to robotics'. 1983. 1990. M. 'The Cerebellar model articulation controller'. Grant. pp. D.G.. Transactions ASME. R. 'Introduction to neural networks for intelligent control'. Springer-Verlag. 2nd ed. 1-13 Lapedes. 77-88 Bavarian.. J.. A. S. 51. Nº 2. Grossberg.. 49-55 Leonard... E. Juditsky.. 'Multilevel control of dynamical systems using neural networks'. 17.. Cherkassky. IEEE Control Systems Magazine.. 1989 Ljung. 1990. Y. pp. pp. 3. Kramer. Vol. T. S. Suzuki. pp. Billings.. Sutton. 299-304 Naidu. No 3. No 5. T. Automatica.. 2478-2483 Narenda. Nº 3. pp. pp... Man & Cybernetics. Brighton. pp. R... Vol. Vol. 30th IEEE Conference on Decision and Control. Uno. A. ‘Algorithms for spline curves and surfaces’. T. pp. Tech... International Journal of Control. 1691-1724 Elsley.. McAvoy. IEEE Control Systems Magazine. Vol 31.S. pp. P. IEEE Transactions on Systems. A. pp. Lowe..N. N2 3.. Complex Systems. Zafiriou.. 1975 Barto. G. L. Delyon. ‘The ART of adaptive pattern recognition by a self-organizing neural network’.C. pp.. Sept. IEEE Computer.W. 1988. 834-846 Carpenter. IEEE International Conference on Neural Networks. Benveniste.. G. M. pp. 'Classifying process behaviour with neural networks: strategies for improved training and generalization'. M. 'A learning architecture for control based on back-propagation neural networks'. 21. 1988 Broomhead. Automatica. Nº 1. 'Neural networks for nonlinear adaptive control'. 'Modelling chemical process systems via neural computation'. England. 'Analysis of a general recursive prediction error identification algorithm'.E.E. Nº 3. 89-99 Lightbody. 1191-1214 Ruano. N.[39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] Kohonen... 2. 8. J. pp. 1988. 24- Artificial Neural Networks 275 . N. 1991. B. 10. IEEE Control Systems Magazine'. Los Alamos National Lab. S. Hjalmarsson.. Rept.. Farber. Irwin. 'Multivadable functional interpolation and adaptive networks'. 1990 Albus. 'A solution to the inverse kinematic problem in robotics using neural network processing'. Vol. NPL Report DITC 166/90... S. pp. 1995. ’Nonlinear Black-box Modelling in Systems Identification: a Unified Overview’. Vol. 1988.. 8-16 Guo.. 1987 Bhat.

International Journal of Control. D. 1989 Kraft. Morris. A. Vol.. 138.. pp.. D. January 1991. Sbarbaro.. U. Brighton..... E. Wienand.. 'Software implementation of a neuron-like associative memory for control applications'.. Proceedings of the ISMM 8th Symposium MIMI. D... England. 138. Campagna. Tham. Proceedings of the 9th IFAC World Congress. Automatica. Vol. C. pp. Vol. Proceedings of the IFAC Microcomputer Application in Process Control. K. Willis. 'Adaptive Control'. D. pp. M. 23. Brown. Y. 'A nonlinear receding horizon controller based on connectionist models'. 357-363 Hunt. B. pp.. 'Autonomous vehicle identification. 'A comparison of adaptive estimation with neural based techniques for bioprocess application'.. Soquet. Militzer. G. 551-558 Saerens. Vol. Sideris. 1984.. 1982 ErsU.[59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] 29 Lant.. 1.. 'Neural Controllers'. 1984. 30th IEEE Conference on Decision and Control.. 1991. Yamamura.. P. San Diego. pp. 138. Istanbul. Digest Nº 1991/019 Astrom. London.. pp. pp. Proceedings of the 1989 International Joint Conference on Neural Networks. Montague. control and piloting through a new class of associative memory neural networks'.. Massimo... 'Real-time implementation of an associative memory-based learning control scheme for nonlinear multivariable processes'.. IEE Proceedings. 1990. 4. 'Artificial neural network based control'. 'Generalized predictive control . pp. A. Vol. 1992. 1991. M. Nº 2. A. Turkey.. Morris. IEE Proceedings. 193-224 Ersil. Addison-Wesley.. Tham. S. P. M. G.. J. Proceedings of the Symposium 'Application of multivariable system technique'. Arkun.K. S.. Edinburgh. Mothadi. D. pp. 6165. IEEE Automatic Control Conference... 109-119 Ersil. 137-148 Sbarbaro. pp. 99-105 Psaltis. G.. Hunt. S. M... Part D. A. Nº 1.. C. D. Jamaluddin. Hungary. Vol. E. M. Plymouth. 55-62 Nguyen. IEE Proceedings Part D.part 1: the basic algorithm'. IEEE Control Systems Magazine. Morris. Vol. Tham. A. Budapest. 'Properties of neural networks to modelling nonlinear dynamical systems'. 1987. pp. 'A new concept for learning control inspired by brain theory'. 172-173 Hernandez. E. Nº 5. M.. 'Artificial neural networks in process engineering'. 'A comparison between CMAC neural networks and two traditional adaptive control systems'.. Vol. Montague. H.. 'Neural networks for nonlinear internal control'. Part F.. Vol. pp.. 55. Willis. Chen. 2. 'Neural network modelling and an extended DMC algorithm 276 Artificial Neural Networks . 36-43 Ersu.. 'The truck backer-upper: an example of self-leaming in neural networks'. pp. Tuffs. C. 1990. Wittenmark. 'An associative memory based learning control scheme with PIcontroller for SISO-nonlinear processes'. 1986. Widrow. E. 1039-1044 Montague. Tolle. U. L.. IEE Colloquium on Neural Networks for Systems: Principles and Applications'. 10.. 1991. pp. Vol. IEEE First Conference on Neural Networks. B. Militzer. M. Nº 3. 2173-2178 Willis. K. 1991.. 266-271 Clarke.. 'Neural-controller based on back-propagation algorithm'. Nº 3. M. Proceedings of the IEE International Conference Control 91. K. pp...K. 1989. J. Nº 1. 1991.. E. 2. 1987.. A. 256266 Billings. Switzerland. 431-438 Harris.

. File 1. Manolakis. 865-866 Strang.. 619-627 Sandom.. A. C. PhD. G. UCNW.. IEEE Control Systems. Macmillan Publishing Company. 'Back-propagation neural networks for nonlinear self-tuning adaptive control'. 2. Van Loan. D. Yu and Ahmad. ‘The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate’. 1989 de Garis. S81-64. 'Learning algorithms for connectionist networks: applied gradient methods of nonlinear optimization'. pp. Prentice-Hall International.H. IEE Internacional Conference Control 91. San Diego. 1974 Parker. IEEE Transactions on Neural Networks'. 10. 1990. 2454-2459 Chen. A. Harvard University. IEE Proceedings. V. Vol. 'A local interaction heuristic for adaptive networks'.. Appl.. G. L.. ‘Computer-aided control system design using optimization methods’. Y. Pereyra. D. Fleming.K. Jones. 10. R. Nº 3. pp. ‘Introduction to digital signal processing’. Addison-Wesley. 'A connectionist approach to PID autotuning'.. Neural Computation. U. ‘Genetic programming: modular neural evolution for Darwin machines’. L. Math. 'Adaptive Control'.’Genetic algorithms in search. Vol.. A. 410-417 Ruano. UK. 2. Part D. 2nd IEEE International Conference on Neural Networks. Brighton. SIAM JNA. pp. Vol. 'A nonlinear regulator design in the presence of system uncertainties using multilayered neural networks'. Vol. 1989 Ruano. pp. A. Vol. 1980 Golub. 'Asymptotic convergence of backpropagation'.49-57 Ruano. ‘Adaptation in natural and artificial systems’. ‘Digital spectral analysis with applications’. 1988. 1987 Holland. Fleming. 1. pp. 1988. 1973. Tokumaru. P.. Sakai. H. 'Learning logic'. International Joint Conference on Neural Networks. 'Beyond regression: New tools for prediction and analysis in the behavioral sciences'. 1992. Academic Press. pp. 3. J. England. Jones. Vol. 1. 1975. 511 Grace. optimization and machine learning’. P.382-391 Jacobs. Neural Networks. H. Vol. 2. Addison-Wesley. ‘Linear algebra and its applications’. D.. 317-324 Astrom. 1988 Marple... Vol. of Michigan Press. Artificial Neural Networks 277 . 279-285 Golub. 1st IEEE International Conference on Neural Networks. D. 1.. I. Stanford University Tesauro.A. 1990. P.. D.[77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] to control nonlinear systems'... Ann Arbor: The Univ. Edinburgh.. IEEE Automatic Control Conference.... 15. 295-307 Watrous. D.. pp... B. Vol. pp. He. 1991.. Nº 4. Nº 3. S.. BIT. 139. pp.. Vol.. and Fleming P. 762-767 Werbos.. Vol.. Uhr. 413432 Ruano. pp. 1989. G. PhD. 3. 'A connectionist approach to PID autotuning'. 44-48 Iiguni. H. ‘Matrix computations’. ‘Applications of Neural Networks to Control Systems’. F. pp.. Vol. Jones.. G. Thesis. K. S. Wittenmark. 1991.. A. pp. J. North Oxford Academic. ‘A variable projection method for solving separable nonlinear least squares problems’. R. 1987. 1983 Proakis. U. Nº 2. 1975 Goldberg. 1992 Kaufman.. 'Increased rates of convergence through learning rate adaptation'. P.. pp. Invention report. Office of Technology Licensing. Nº 3..S. Doctoral Dissertation. Nº 3. 199 1. ‘A new formulation of the learning problem for a neural network controller’ 30th IEEE Conference on Decision and Control.

76-90 [106] Fletcher. P. M. IFAC Symposium on System Identification.. Wright.E. Journal of Control. pp.. 1973 [114] Bellman. E. pp. 1995... Mathematics of Computation. R. 317-322 [107] Goldfarb. P.G.. of the IEEE.H... 1990 [119] Kavli. pp.C. 157-169 [118] Poggio. ‘Improved allocation of weights for associative memory storage in learning control systems’. ‘A method for the solution of certain problems in least squares’. in ‘Algorithms for approximation of functions and data’.. 1970. R. comparisons. D. J. ’Identification with dynamic neural networks architectures. 10.. Computer Journal. 58. Ernst.. C. A. Academic Press.Thesis. Vol.. ’Optimal adaptive k-means algorithm with dynamic adjustment of learning rate’.C. J. pp. Journal of the Institute of Mathematics and its Applications. Inst. U. 134139. Japan. Vol. University College of North Wales.. Zurich. L... ‘Pattern classification and scene analysis’. ’Networks for approximation and learning’. ‘Variable metric method for minimization’. ’Learning in Intelligent Control’. Darken. ‘Adaptive control processes’. W.. applications’.431-441 [111] Powell. M. Delyon. Research and Development Report. SIAM J. 9-10 October 1998 [122] Isermann. I.. ’The ASMOD algorithm: some new theoretical and experimental results’. Mason J. Girosi. ‘A new approach to variable metric algorithms’. A. Addison-Wesley Publishing. 1st IFAC Symp. Glorennec. ‘An algorithm for least-squares estimation of nonlinear parameters’. T.-Y-. 11. Cosy Annual Meeting. 6.. Kavli. Mathematics of Computation. 24.C.O. M. C. Benveniste. Vol. 1991. ANL . 1972 [117] Chinrungrueng.K. 1963. Vol.. on Design Methods for Control Systems. A. Appl. B. D. pp. D. 1980 [103] Davidon. ‘Practical optimization’. F. pp. 24. ‘A family of variable metrics updates derived by variational means’. ‘Fast learning in networks of locally-tuned processing units’. pp. Technical Report. 2.. ‘Introduction to linear and nonlinear programming’. 1963. March 1995 [121] Dourado.. K. M. 1989 [100] Luenberger.. 1481-1497. 8-11 July 1997 [123] Sjoberg.. T.E. Princeton University Press. 1993 [120] Weyer. Nº 1. 1970. Macedonia. pp. John Wiley & Sons. ‘Conditioning of quasi-Newton methods for function minimization’. D. ‘The numerical evaluation of B-splines’. P. R. 1961 [115] Parks. O. Oxford University Press. Math... P. IEEE Transactions on Neural Networks. Zhang. Math...J. T. 23-26 [108] Shanno. 143-167 [112] Moody. Quart. pp.. Ljung. Apl. Powell.. pp.. Militzer. 281-294 [113] Duda. Cox. 13. J. W. 78. pp. 163-168 [105] Broyden. 6.J.5990. Hjal- 278 Artificial Neural Networks . Vol.. 1981 [102] Fletcher. 1944. 777-82 [116] Cox. 164-168 [110] Marquardt. Neural Computation. Proc. ‘The convergence of a class of double-rank minimization algorithms’. Appl. ’ASMOD-an algorithm for adaptive spline modelling of observation data’. 4. Computer Journal.. 1970. S.. Vol. Vol.D.. Q.. Eds. R. Séquin. ‘A rapidly convergent descent method for minimization’... pp. Proc. Vol. C. Murray. 647-656 [109] Levenberg. 1970. Int. ‘Radial basis functions for multivariable interpolation: a review’. Hart. 947-67. pp. Ohrid. 1959 [104] Fletcher.. J..E. 1973 [101] Gill. Fukuoka. R. Wiley. Vol. pp.G... M. Math. 1.. 6. ‘Practical methods of optimization’. C. R. Nelles...

’On the Approximate Realization of Continuous Mappings by Neural Newtorks’. S. ’Statistical Learning Theory’. Neural Networks. Schwartz (Ed. Symposium 2000: Adaptive Systems for Signal Processing. Harris. A.. IEEE Trans. P. Ruano.. C. 1994 Haykin.C. 1982 Grossberg.. 169-176. 9 (2). K. Communications and Control (AS-SPPC). H.R. Man and Cybernetics.M. L. 1990 Ferreira. Biological Cybernetics. Saratchandran. V. P. ’Neural Network Models in Greenhouse Environmental Control’. 1998 Oja.. 2. ’Multilayer Feedforward Networks are Universal Approximators’. ’Absolute Stability of Global Pattern Formation and Parallel Memory Storage by Competitive Neural Networks’. ’A Dynamic Regularized Radial Basis Function Network for Nonlinear. 1995 Brown.... ’Neurofuzzy Adaptive Modelling and Control’. Sandberg. 1991 Girosi. 13. Euliano. F.V. IEEE Transactions on Signal Processing’. 2503-2521. ’Networks and the Best Approximation Property’. T. Nonstationary Time Series Prediction’. ’Neural Networks: a comprehensive foundation (2nd ed... 39. S. 1995 Vapnik. 239-245.. Juditsky. White. ’Nonlinear Black-box Modelling in System Identification: a Unified Overview’. 308-318...W. Neural Networks. 2001 Artificial Neural Networks 279 .E. MIT Press. 2000 Vapnik.. S.. 1989 Hornik. 183-192.. Canada. 47 (9). P. Automatica.E. ’Universal Approximation Bounds for Superpositions of a Sigmoidal Transactions on Information Theory’. 3.. ’The Nature of Statistical Learning Theory’. A.. R..T. J. H. N. ’Singel and Multiobjective Programming Design for B-spline Neural Networks and Fuzzy Systems’. IFAC Workshop on Advanced Fuzzy/Neural Control.... 321-326. in ’Computational Neuroscience’.. Springer Verlag. 8. ’Neural and Adaptive Systems: Fundamentals through Simulations’. 33. Stinchcombe. Neural Networks.E.L. Haykin. Systems. IEEE Transactions on Neural Networks’. 63.. MIT Press. NeuroComputing (to appear) Principe.. M. V. 1998 Yee.Wiley. M. 93-98. 1983 Grossberg. Spain.[124] [125] [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] [142] [143] marsson. K. Grossberg. International Journal of Systems Science. J.. 12. pp. Ruano. Sundararajan.. S.). 1990 Ruano. 1999 Yingwei. Journal of Mathematical Biology. L.. Cabrita. Lake Louise. ’Neural Networks and Natural Intelligence’. 1691-1724.. N. 689-711... 1988 Cohen. ’Exploiting the Separability of Linear and Nonlinear Parameters in Radial Basis Function Networks’.A.E. A. P.. ’Content-Addressable Memory Storage by Neural Networks: a General Model and Global Lyapunov Method’. AFNC’01. 1999 Barron. 56-65. 2000 Ferreira. E. Fonseca. C.. 930-945. 246-257. Kóczy. M. 1989 Park.. and Lefebvre. John Wiley & Sons. ’Supervised Training Algorithms for B-Spline Neural Networks and Neuro-Fuzzy Systems’. 2002 Cabrita. ’Performance Evaluation of a Sequantial Minimal Radial Basis Function Neural Network Learning Algorithm’. C. A. J.M.)’. A. W.. ’Universal Approximation using Radial-Basis Function Network’. C. E.C. 815-826. ’A Simplified Neuron Model as a Principal Component Analyzer’.M. Prentice Hall. S. Poggio. Prentice-Hall. 359-366.. Oliveira.. 1993 Funahashi. 31. 2. A. 15.. Valencia. I. Ruano.

280 Artificial Neural Networks .

145 C Cauchy-Schwartz inequality 43 cell body 4 Cerebellar Model Articulation Controller 10 coincident knot 136 compact support 137 competitive learning 23 condition number 78 consistent equations 227 content addressable memories 35 cross-correlation matrix 203 curse of dimensionality 138 Artificial Neural Networks 281 . 223 B-Spline 9. 8 active lines 40 Adaline 3. 50 adaline learning rule 54 adaptation 23 Adaptive Resonance Theory 8 anti-Hebbian rule 201 approximators 125 ASMOD 154 Associative memory 190 Auto-associator 269 Autocorrelation 190 autocorrelation matrix 112 Autoencoder 270 axon 4 B basis functions 135 basis functions layer 21 batch learning 23 batch update 57 Bayesian threshold 244 Bemard Widrow 3 best approximation property 123 BFGS update 84 Bidirectional Associative Memories 8 Boltzmann Machines 8 Brain-State-in-a-Box 8.Index A a posterior probability 242 a posteriori output error 113 action potentials 4 activation function 6.

246 delta rule 54. 23 Hebbian or correlative learning 22 Hebbian rule 2. 189 Hessian matrix 83 Hetero-associator 269 hidden layer 20 hidden neurons 19 Hopfield 8 hybrid 125 I Idempotent matrices 235 implicit regularization 106 Indicator function 180 indirect approach 30 Inner-product kernel 182 input layer 19 input selection 200 instantaneous autocorrelation matrix 112 instantaneous learning 23 interior knots 136 interpolation 125 J Jacobean matrix 56 282 Artificial Neural Networks . 191 gradient descent learning 22 Gradient vector 55 H Hebbian learning 22. 120 delta-bar-delta rule 121 deterministic learning 23 DFP method 84 direct approach 30 discriminant functions. 145 Generalized Hebbian Algorithm 269 Generalized Inverse 230 Generalized Radial Basis Function 128 Gradient ascent.D decision surface 242. 246 displacement matrix 140 Donald Hebb 2 Dual problem 177 E early-stopping 106 energy minimization 214 epoch learning 23 epoch mode 57 Error back-propagation 63 explicit regularization 106 exterior knots 136 F Feature space 252 feedback factor 224 feedforward networks 19 forced Hebbian learning 22 fully connected network 19 functional form of discriminant functions 253 G Gauss-Newton 84 generalised learning architecture 27 generalization 106 generalization parameter 137.

8 neuron 4 NLMS rule 113 non-parametric classifiers 246 nonparametric training for classifiers 253 normal equation matrix 76 normal equations 76 normalized input space layer 21. 135 null-space 201 O off-line learning 23 Oja’s rule 191 on-line learning 23 optimal classifier 242 Orthonomal matrices 230 output dead-zones 118 output function 6. 26 parametric classifiers 246 Parametric training of classifiers 253 partially connected network 19 partition of the unity 138 Artificial Neural Networks 283 . 8 output layer 20 overlay 137 overtraining 106 P parallel model. 120 multilayer network 19 N nerve terminal 4 net input 6.John Hopfield 3 K k-means clustering 126 Kohohen Self-Organized Feature Maps 8 L Lagrangian function 177 lattice-based networks 134 learning 21 Least-Mean-Square 54 Least-Means-Square 50 least-squares solution 76 Levenberg-Marquardt methods 84 likelihood 242 linear 21 Linear Assocative Memory 203 linear weight layer 21 lineraly separable classes 42 LMS rule 110 localised Gaussian function 16 look-up-table. 141 M Madaline 51 Maximum eigenfilter 192 McCulloch 2 mean 242 mean-square error 110 Mercer’s theorem 182 minimal capture zone 119 Minimum distance classifier 250 modal matrix 77 modules 5 momentum term 64.

8 T tensor product 146 test set 106 time constant 79 topology conserving mapping 139 training 23 training the classifier 252 U uniform projection principle 139 universal approximators 39.pattern learning 23 pattern mode 57 PDP 3 perceptron 2 Perceptron convergence theorem 41 Perceptrons 3 Pitts 2 placement of discriminant functions 253 Power 191 principal component analysis 193 principal components 193 probability density function 242 pruning set 155 Pseudo-Inverse 230 Q QR decomposition 236 Quasi-Newton methods 83 R Radial Basis Functions Networks 8. 108 regularization factor 86 regularization theory 128 reinforcement learning 22 Rosenblatt 2 S series-parallel model 26 Similarity measure 190 singlelayer network 19 Slack variables 180 Soft margin 179 specialised learning architecture 29 spread heuristics 126 spurious states 220 steepest descent 55 stochastic learning 23 stopped search 106 supervised learning 22 support 137 support vector machine 182 Support Vector Machines 252 Support vectors 174 SVD decomposition 230 synapses 4. 123 Unsupervised learning 23. 123 random variable 242 receptors 5 recurrent networks 19 refinement step 155 regions 5 regularization 106. 190 V validation set 106 284 Artificial Neural Networks .

variance 242 W weight 8 Widrow-Hoff learning algorithm 54 Widrow-Hoff learning rule 3 Wiener-Hopf equations 112 Artificial Neural Networks 285 .

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->